SITIS Archives - Topic Details
Program:  SBIR
Topic Num:  N102-176 (Navy)
Title:  Disambiguation of Entity Association Statements
Research & Technical Areas:  Information Systems, Human Systems

Acquisition Program:  PM Intel
  Objective:  Advances have been made with regard to our ability to express large disparate unstructured data sources (e.g. text, images, audio) as connected entity graphs in resource description framework (RDF). There remains practical problems, however, working with the large the RDF data store that can easily be generated from even modest sized data stores. Due to entity and association uncertainty, current implementations of RDF data stores become filled with redundant statements, preventing the expression of a large data corpus as one connected graph. The objective of this topic is to develop algorithms and techniques for level 1 fusion of association statements in a large RDF data store.
  Description:  There are a number of technologies and systems today that perform named entity disambiguation in support of entity extraction on structured and unstructured data. Reference sets are used to resolve ambiguity and return results. Other implementations may be as basic as providing a list of possible results and have the ambiguity resolved by the information seeker such as that implemented in Wikipedia. It has proved beneficial to anchor terms in a recognized vocabulary and ontology. For instance, WordNet offers a lexical database of words for the English language. This topic seeks to tailor or develop disambiguation algorithms that can be effectively applied to large RDF graphs. The objective of a large RDF data store is to enable the warfighter or analyst to find everything known about a specific entity (person, group, place, object, event/behavior) rapidly and accurately. The search strategies are dependent on a level of clustering of related RDF that has not to date been demonstrated. Redundant entities and associations cause broken connections. The goal of this topic is to develop and demonstrate a new class of level 1 fusion (disambiguation) algorithms that can be applied to large RDF data stores. Offerors may examine tagging RDF if that is found to support the overall objective. The offeror can also use information contained within the triple itself. Research is needed to expand entity disambiguation concepts into the domain of large association (RDF) data stores. The offerror needs to assume that a RDF data store of interest is populated with disparate statements derived from a wide variety of data stores. The goal of the topic is to generate a single connected graph from large RDF that contains no redundant entities and no missed connections. In order to achieve the necessary information refinement on a topic and support evolution of RDF knowledge bases, examples of areas which need to be addressed include: 1) dealing with entity uncertainty 2) dealing with entity information from different knowledge bases that results in a contradiction, 3) creation and updating of statements regarding an entity in a knowledge base or common feature space that do not contradict existing statements on that entity, and 3) deletion of an entity or entity statements without breaking other associations that may refer to that entity. The Navy is interested in innovative R&D that involves technical risk. Proposed work should have technical and scientific merit. Creative solutions are desired.

  PHASE I: Develop algorithms that can identify redundant statements and missed connections in a large RDF data store. Measure and show clear progress in RDF statement disambiguation and in fixing missed connections. Perform a proof of concept against a data store containing tens of thousands of statements. Results from the model development and tests are to be documented in a technical report and presented at a selected conference.
  PHASE II: Produce a prototype system that is capable scalable to very large data stores. The prototype system will be able to automatically process and display/catalog on numerous topics defined by the user in near real-time. The model(s) and techniques are to include other forms of data besides textual and should include audio and image type sources. Context based tie points that can be developed on text, audio, and images will be demonstrated in the prototype. The prototype should be a software application that is compatible with a service oriented architecture and demonstrated against real tactical data sources (secret level).

  PHASE III: Produce a system capable of deployment and operational evaluation. The system should address topics or themes that are specific to developing a terrorist threat assessment or identification of techniques, tactics, and procedures based on system developed tie points. Tie points will be presented in human understandable form. The system should be modified to operate in accordance with guidelines provided by a program of record. PRIVATE SECTOR COMMERCIAL POTENTIAL/

  DUAL-USE APPLICATIONS: There are many commercial applications including credit card fraud detection, business activity monitoring, and security monitoring that would benefit from advanced data enterprise library services. Presently, there is a strong need to protect military and civilian personnel from terrorist attack by analyzing large data stores. To facilitate interoperability, the systems should operate in a net-centric environment and provide reliable performance. Commercial value and cost savings is achieved by operation in a distributed service oriented architecture with other applications.

  References:  1. Paolo Bouquet, Luciano Serafini, and Heiko Stoermer. “Introducing Context into RDF Knowledge Bases”, in Proceedings of SWAP 2005, the 2nd Italian Semantic Web Workshop, Trento, Italy, December 14-16, 2005. CEUR Workshop Proceedings, ISSN 1613-0073. http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-166/70.pdf 2. Barbara Bazzanella, Paolo Bouquet, and Heiko Stoermer. “A Cognitive Contribution to Entity Representation and Matching”, Technical Report DISI-09-004, Ingegneria e Scienza dell'Informazione, University of Trento. 2009. http://eprints.biblio.unitn.it/archive/00001540/ 3. Deepak P, Jyothi John, Sandeep Parameswaran. “Context Disambiguation in Web Search Results”, in Proceedings of the IEEE International Conference on Web Services 2004 (ICWS’04) 0-7695-2167-3/04. 4. Smith, Barry, Lowell Viznor and James Schoening, “Universal Core Semantic Layer”, OIC2009, http://c4i.gmu.edu/OIC09/papers/OIC2009_5_SmithEtAll.pdf

Keywords:  correlation; data fusion; terrorist threats; human language; entity disambiguation; entity extraction

Questions and Answers:
Q: 1. Does the disambiguation process work with any type of RDF statements, or OWL/RDF only?
2. May the application services using the resulting disambiguated RDF store query it via SPARQL?
A: 1. An exclusively OWL/RDF solution would be preferable.
2. Yes, SPARQL is acceptable.
Q: 1. How out far behind can the disambiguation process fall as new data is inserted into the RDF store?
2. Does the algorithm has to work in NRT?
A: 1. It would be most valuable to have disambiguation occur within an "actionable" amount of time.
2. Real-time may not be computationally or architecturally feasible. Again, "actionable" is perhaps a more applicable term vs. NRT. The proposed solution should make every effort to avoid unnecessary lag, e.g. human in the loop.
Q: 1. Are you looking for a specific domain solution or a cross-domain solution? For example, cargo, cyber threats, individuals, ship tracks, etc.
2. Also, is there general knowledge sample data available, or can you provide sample data?
A: 1. The solution should be cross-domain within reason. For instance, disambiguation could occur between an entity identified by biometric profile, and an entity identified in unstructured text.
2. We would prefer you to identify your own sample data for Phase I.
Q: In the Objective statement, you refer to level 1 fusion of association statements. Does fusion refer to multiple, disparate data sources, or multiple entities within a single data source?
A: Disambiguation should occur between both disparate sources and single-sources.

Q: To what degree are the sources of RDF statements characterized in the RDF graph, and is this provenance data for statements included in the graph?
A: The provenance of RDF statements may be explicitly described with reification as you select or create your data set.
Q: What, if any, upper level or domain specific ontologies are being used to ground the terms in the RDF?
A: Something similar to UCORE-SL would be a reasonable upper-level ontology to begin with.
Q: Is there a need to maintain traceability for additiona, deletions and modifications to the original RDF data? For example, why an entity removed, what if anything replaced it and when.
A: You may want to consider versioning of the RDF datastore as external to this problem statement. The focus should be on entity disambiguation, but you are welcome describe an approach for traceability.

Record: of