SITIS Archives - Topic Details
Program:  SBIR
Topic Num:  AF071-091 (AirForce)
Title:  Customizable Text Extraction for Warfighters
Research & Technical Areas:  Information Systems

  STATEMENT OF INTENT: This capability will give analysts the flexibility to find actionable intelligence from unstructured textual data sources (e.g. HUMINT) in dynamically changing situations, as information needs change.
  Objective:  Develop automated techniques enabling AF users to rapidly customize open-domain Text Extraction capabilities to warfighters’ new/changing domains, with high accuracy.
  Description:  Domain-customization is an important research area in Text Extraction. A domain is simply an area of interest, such as “the biomedical domain” or “land-based target identification”. Extraction systems are typically trained ahead of time to extract specific types of information objects (i.e., specific types of entities, relationships and/or events) known to be relevant to a particular domain. So for the biomedical domain, an extractor might be trained ahead of time to extract entity types like “protein names” and relationship types like “protein-protein interactions”. For the land-based target ID domain, the extractor may be trained to extract entity types like “equipment names” and relationships like “located at”. For traditional extractors using supervised learning algorithms, training data is required. This consists of a few hundred documents with the predefined entities and relationships manually marked-up (“annotated”). Once the system has been trained on this data, it will recognize and extract previously unseen mentions of these entity and relationship types. While this method of training an extractor is effective, it is time-consuming and expensive, and usually requires assistance from people with expertise in computational linguistics. The problem is, timely and costly domain customization is not acceptable for our AF users. AF users’ information requirements are not static, nor can they always be anticipated ahead of time. A rapid, easy way to customize Text Extraction systems to new and changing information requirements is needed, without the assistance of linguistics experts. E.g., intelligence analysts assessing information pertinent to Improvised Explosive Devices (IEDs) may suddenly face a new biological threat, and need the ability to rapidly customize their extractor to find information (entities, relationships, events) pertinent to that new domain. The newly customized system must be able to perform at a high level of accuracy as well, in order to support operational capabilities. Another customization problem is that the text the extractor is applied to is not always the same type it was trained on. AF Users also need a fast and easy way to customize the Text Extraction system to new types of input text. In addition, many current Text Extraction systems do not take advantage of users’ domain knowledge. This knowledge could be exploited to improve performance of extraction systems. While some systems do utilize domain knowledge, they can be overly cumbersome; they are not “light-weight and agile enough for a real world system. Given this background, the goals of this SBIR Topic are as follows: Develop user-centric techniques and tools to customize Text Extraction systems to new/changing domains. Customization MUST be able to be performed simply, easily, and rapidly by a user vs a linguist. Accuracy of the resultant system must be near the state-of-the-practice. Leverage light-weight techniques to exploit users' domain knowledge to achieve even higher accuracy extraction results. At a minimum, the capability will support customizable entity and relationship extraction. Customizable entity attribute (e.g., “tall”) and event extraction are also desirable, as is customizability to new text input types, such as various types of Intelligence message traffic and open source intelligence (OSINT).

  PHASE I: Feasibility concept. Research and develop innovative techniques for rapid, easy customization of Text Extraction systems to users’ new and changing information needs, per the requirements of the SBIR Topic Description. Assess the feasibility of the various techniques explored, and determine which techniques are the most promising. Based on these results, develop the initial design for a prototype Domain-Customizable Text Extraction system, and demonstrate its application.
  
  PHASE II: Perform in-depth research and development of innovative techniques and tools for rapid, easy customization of Text Extraction systems to users’ new and changing information needs, per the requirements of the SBIR Topic Description. Focus on those domain-customization techniques found to be most promising during Phase 1. Develop a prototype Domain-Customizable Text Extraction System utilizing these tools and techniques, per the initial Phase 1 design. Demonstrate the prototype’s capabilities using candidate actual data from operational systems. One potential operational system that this capability could support is AFRL’s Advanced Text Exploitation Assistant being developed for the Air and Space Operations Center (ATEA-AOC). ATEA-AOC may also support other operational sites such as CENTCOM, AOC Reachback, and the AF Distributed Common Ground System (DCGS). ATEA-AOC is currently being developed for domains such as Operational Assessment and Target ID. However, there is considerable interest in customizing its extraction capabilities to additional new domains, such as exploiting information pertinent to IEDs. DUAL USE: Military application: Rapid customization of domain-independent Text Extraction capabilities to a warfighter’s specific domain (Area of Responsibility), enabling more dynamic Battlespace Awareness in the AOC and DCGS. Commercial application: Jobs that monitor events over time (e.g, stock market analysis of buy-sell trends) or analyze relationships between events over time (e.g., political strategists, criminal investigators, researchers).

  References:  1. Mihai Surdeanu and Sanda Harabagiu. Infrastructure for Open-Domain Information Extraction. Proceedings of the Human Language Technology Conference (HLT 2002), March 2002, San Diego, CA. 2. Cheng Niu, Wei Li, and Rohini K. Srihari. A Bootstrapping Approach to Information Extraction Domain Porting. Adaptive Text Extraction and Mining (ATEM-04): Papers from the 2004 AAAI Workshop. The AAAI Press, Technical Report WS-04-01, 2004.

Keywords:  Information Extraction Domain Porting, Information Extraction Domain Customization

Additional Information, Corrections, References, etc:
Ref #1: available at: http://www.languagecomputer.com/papers/hlt2002.pdf
Ref #1: available at: http://www.languagecomputer.com/papers/hlt2002.pdf
Ref #2: available at: http://www.ai.sri.com/~muslea/atem-04/niu.pdf
Ref #2: available at: http://www.ai.sri.com/~muslea/atem-04/niu.pdf

Questions and Answers:
Q: The system described in Ref #2 appears to already address the major requirements of this project. What capabilities should a new system focus on for this effort, beyond what seems to have been already demonstrated in the previous project?
A: Let me generalize my response to address not only the contractor/capability that you reference in your question (as I am not sure if that is appropriate), but to include all of the capabilities in domain customization that we have familiarity with through Text Extraction R&D we are either sponsoring or familiar with. I would agree that yes, there is some very good R&D being performed in this area, by various researchers. And we see different researchers taking different approaches, so there are multiple R&D paths working towards achieving this goal. But the problem is by no means solved or a done deal. If it was, we would not be pursuing this topic.

However, the fact that there are various productive paths of research being conducted towards addressing this problem/need is a good thing. The R&D community gets to see the strengths and weaknesses of the different approaches, and can further advance those approaches that seem to be panning out well. Or, if they perceive that a given approach does not go far enough to solve the problem, they can try a totally different approach that has the potential to meet domain customization needs even better. So the point is, domain customization is still a very active area of Text Extraction research. In summary, I would say that the SBIR Topic write-up really does specify the problems that we believe need to be addressed and the goals that we want to achieve. Yes, some R&D is further along in addressing those problems and achieving those goals than others. But since the overall problem is not yet solved, we want all researchers to understand the problems that need to be addressed, the problems/the limitations we see in various existing capabilities, and the goals that we want to achieve.
Q: The system described in Ref #2 appears to already address the major requirements of this project. What capabilities should a new system focus on for this effort, beyond what seems to have been already demonstrated in the previous project?
A: Let me generalize my response to address not only the contractor/capability that you reference in your question (as I am not sure if that is appropriate), but to include all of the capabilities in domain customization that we have familiarity with through Text Extraction R&D we are either sponsoring or familiar with. I would agree that yes, there is some very good R&D being performed in this area, by various researchers. And we see different researchers taking different approaches, so there are multiple R&D paths working towards achieving this goal. But the problem is by no means solved or a done deal. If it was, we would not be pursuing this topic.

However, the fact that there are various productive paths of research being conducted towards addressing this problem/need is a good thing. The R&D community gets to see the strengths and weaknesses of the different approaches, and can further advance those approaches that seem to be panning out well. Or, if they perceive that a given approach does not go far enough to solve the problem, they can try a totally different approach that has the potential to meet domain customization needs even better. So the point is, domain customization is still a very active area of Text Extraction research. In summary, I would say that the SBIR Topic write-up really does specify the problems that we believe need to be addressed and the goals that we want to achieve. Yes, some R&D is further along in addressing those problems and achieving those goals than others. But since the overall problem is not yet solved, we want all researchers to understand the problems that need to be addressed, the problems/the limitations we see in various existing capabilities, and the goals that we want to achieve.
Q: 1. What granularity is meant by “users' domain knowledge"?

2. Are there specific operating constraints the system must adhere to, such as computational time, space, system training time, etc.?
A: 1. I could interpret your question a couple of different ways, so hopefully I am answering your question as you meant it. If you mean, do I expect the system to incorporate the analyst's knowledge of the world to the point that the system can do in-depth complex reasoning (like an expert system or the analyst himself might), no, that's not what is meant. I mean "users' domain knowledge" more in terms of specific/explicit knowledge that can help the extraction system correctly interpret how to process text for a specialized domain, that a system trained for more generalized processing may not otherwise be able to interpret correctly. E.g., knowing that entities X, Y, and Z are all types of aircraft, or that A and B are key parts of a missile, or that u, v and w are all strains of a deadly virus of interest. Another example could be capturing known spelling variants of non-Western names, and capturing known aliases for person names (e.g., the fact that Person A has also been known to go by the totally different names B and C). Now these examples may be more simplistic than what researchers pushing the state-of-the-art of text extraction are capable of doing (or that they think may be a reasonable goal that they could accomplish as the result of the SBIR R&D). If that is the case, offerors should by all means feel free to propose more advanced capabilities for leveraging users' domain knowledge. Just keep in mind that you will need to convince us that what you propose is feasible (that the R&D has a legitimate chance of succeeding), and that if there is added risk, that there is additional pay-off that makes the added risk worth considering.

2. We haven't explicitly defined such performance requirements or constraints at this time. But our goal is to push the state-of-the-art in terms reducing the overall time and effort that it takes to customize a system (without a significant negative impact to extraction performance). So that would include reducing both training time, and the work that has to be done up-front to develop training data to customize the system (e.g., having to manually annotate and adjudicate hundreds of documents prior to training is a real slow-down to customization). So the faster and easier, the better. Also: getting feedback from the analyst is acceptable. The key is to do it in such a way that they don't perceive it as overly burdensome or painful.
Q: 1. What granularity is meant by “users' domain knowledge"?

2. Are there specific operating constraints the system must adhere to, such as computational time, space, system training time, etc.?
A: 1. I could interpret your question a couple of different ways, so hopefully I am answering your question as you meant it. If you mean, do I expect the system to incorporate the analyst's knowledge of the world to the point that the system can do in-depth complex reasoning (like an expert system or the analyst himself might), no, that's not what is meant. I mean "users' domain knowledge" more in terms of specific/explicit knowledge that can help the extraction system correctly interpret how to process text for a specialized domain, that a system trained for more generalized processing may not otherwise be able to interpret correctly. E.g., knowing that entities X, Y, and Z are all types of aircraft, or that A and B are key parts of a missile, or that u, v and w are all strains of a deadly virus of interest. Another example could be capturing known spelling variants of non-Western names, and capturing known aliases for person names (e.g., the fact that Person A has also been known to go by the totally different names B and C). Now these examples may be more simplistic than what researchers pushing the state-of-the-art of text extraction are capable of doing (or that they think may be a reasonable goal that they could accomplish as the result of the SBIR R&D). If that is the case, offerors should by all means feel free to propose more advanced capabilities for leveraging users' domain knowledge. Just keep in mind that you will need to convince us that what you propose is feasible (that the R&D has a legitimate chance of succeeding), and that if there is added risk, that there is additional pay-off that makes the added risk worth considering.

2. We haven't explicitly defined such performance requirements or constraints at this time. But our goal is to push the state-of-the-art in terms reducing the overall time and effort that it takes to customize a system (without a significant negative impact to extraction performance). So that would include reducing both training time, and the work that has to be done up-front to develop training data to customize the system (e.g., having to manually annotate and adjudicate hundreds of documents prior to training is a real slow-down to customization). So the faster and easier, the better. Also: getting feedback from the analyst is acceptable. The key is to do it in such a way that they don't perceive it as overly burdensome or painful.

Record: of