SITIS Archives - Topic Details
Program:  SBIR
Topic Num:  AF071-079 (AirForce)
Title:  Non-Language Speech Sound Detection
Research & Technical Areas:  Information Systems

  Objective:  Develop and test algorithms to identify and eliminate non-language speech sounds as a pre-processing stage to improve audio processing.
  Description:  Non-language speech sounds (NLSS) (laughter, coughing, grunting, sighing, breathing, clicking; as well as back channel sounds like mhm, hmm, unhuh, etc.) make up a very large part of natural human language use. The current state-of-the-art in speech preprocessing has focused almost entirely on segmenting human speech from noise. However, these human non-speech sounds also cause significant problems for automatic processing of speech, such as speech recognition, language identification and speaker recognition. In fact contemporary speech recognition training data preparation requires hand-labelling of NLSS, a very time consuming and costly process. The focus of this effort will be to develop and test algorithms to automatically identify and eliminate non-language speech sounds as a pre-processing stage to improve audio processing. The algorithms must be able to accommodate multiple channel conditions and speakers.

  PHASE I: Develop focused approaches to identifying and removing NLSS that go beyond the current speech segementation algorithms that do not accomodate interfering human noise. Develop and provide a conceptual design and breadboard a demonstration of the technology.
  
  PHASE II: Based on the research in Phase I begin to develop and expand on various approaches to increasing the accuracy of the NLSS algorithm. Capitalize on all research data from earlier research to finalize a design and characterization of the software to build. Define the test requirements to evaluate the accuracy and suitability of the prototype for the identification NLSS in audio.

  DUAL USE COMMERCIALIZATION: Military application: This technology provides a value added benefit to speech recognition algorithms, speech-to-speech machine translation, audio data mining and information extraction, speaker and language recognition. Commercial application: This technology provides a value added benefit to speech recognition algorithms, speech-to-speech machine translation, audio data mining and information extraction, speaker and language recognition.

  References:  1. Jakobson, R. (1995). On Language. Ed. L. R. Waugh & M. Monville-Burston. Cambridge, MA: Harvard Free Press. 2. Barlow, A. R. (1993). Language-Specific and Universal Aspects of Vowel Production and Perception: A Cross-Linguistic Study of Vowel Inventories. Ithaca, NY: CLC Publications. 3. Kornai, A. (1999). Extended Finite State Models of Language. Cambridge: Cambridge UP.

Keywords:  Speech processing, human language technologies, audio processing, speech segmentation

Additional Information, Corrections, References, etc:
Ref #1: Available through interlibrbary loan or online document delivery services.
Ref #1: Available through interlibrbary loan or online document delivery services.
Ref #2: Available through online document delivery services
Ref #2: Available through online document delivery services
Ref #3: Available through interlibrbary loan or online document delivery services.
Ref #3: Available through interlibrbary loan or online document delivery services.

Questions and Answers:
Q: Is there a audio database in mind, or should the firm generate their own for prototype testing?
A: AFRL will provide the representative data for all training and testing. It will be provided as audio files with phrase-level transcripts.
Q: 1. Do you want a system that segments language speech sounds (LSS) from NLSS and background noise, or a system that just detects NLSS?

2. Should the NLSS detected be categorized (cough, breath, backchannel, etc.) or do you just want a binary decision on whether a segment corresponds to an NLSS or not?
A: 1. AFRL is looking for a system that accomplishes the segmentation as well as the detection of NLSS. Determining the times (either in sample numbers or in seconds) that the beginning and end of the segment occurred would be appropriate.

2. No, it does not need to be categorized. Just the binary decision is adequate.
Q: Is there a audio database in mind, or should the firm generate their own for prototype testing?
A: AFRL will provide the representative data for all training and testing. It will be provided as audio files with phrase-level transcripts.
Q: 1. Do you want a system that segments language speech sounds (LSS) from NLSS and background noise, or a system that just detects NLSS?

2. Should the NLSS detected be categorized (cough, breath, backchannel, etc.) or do you just want a binary decision on whether a segment corresponds to an NLSS or not?
A: 1. AFRL is looking for a system that accomplishes the segmentation as well as the detection of NLSS. Determining the times (either in sample numbers or in seconds) that the beginning and end of the segment occurred would be appropriate.

2. No, it does not need to be categorized. Just the binary decision is adequate.

Record: of