SITIS Topic Details

Proposals Accepted:  
Program:  SBIR
Topic Number:  A10-159 (Army)
Title:  Software Tool for Complex Biomarker Discovery
Research & Technical Areas:  Biomedical

Acquisition Program:  Office of the Principal Assistant for Acquisition
 The technology within this topic is restricted under the International Traffic in Arms Regulation (ITAR), which controls the export and import of defense-related material and services. Offerors must disclose any proposed use of foreign nationals, their country of origin, and what tasks each would accomplish in the statement of work in accordance with section 3.5.b.(7) of the solicitation.
  Objective:  Develop an innovative software tool to identify complex biomarker signatures (e.g., using multiple genes or multiple proteins) of toxicity from microarray, proteomic, and other high dimensional data.
  Description:  Many laboratories have begun screening for novel measurable molecular or biochemical alterations (biomarkers) in biological matrices, such as fluids, cells, or tissues occurring in response to hazardous chemical exposures, other insults, and diseases. The US Army Center for Environmental Health Research (USACEHR) is using toxicogenomic (whole genome microarray assays) and toxicoproteomic [whole proteome protein mass spectrometry (MS)] methods to discover novel biomarkers of environmental and industrial toxicant exposure in model systems, including animals and cultured mammalian cells. Since biomarkers can indicate the degree of exposure, biological effects, and susceptibility to disease from toxic hazards, they have many potential applications in Force Health Protection and health surveillance. Both functional genomic and proteomic experiments generate very large amounts (gigabyte to terabyte range) of highly multivariate data, and the complexity of these data can be an impediment to their use. Moreover in practice, the “biomarker” of toxic insult or of a complex disease state may be a protein or gene expression “signature” or constellation of responses rather than an alteration in the abundance of a single protein, RNA, or small molecule. There is a lack of efficient and robust methods, algorithms and workflows for reliably identifying multivariate changes in RNA, small molecule, and protein abundance which may prove to be good biomarkers genomic and proteomic data. The methods for complex biomarker identification that exist are principally targeted at biomarkers for binary classification, such as cancerous/non-cancerous cells, rather than at biomarkers that reflect continuous host responses to continuous stimuli or insults, such as toxicant exposures. An additional difficulty with the analysis of data from functional genomic and proteomic experiments arises from the large number of variables but low level of replication characteristic of such studies; this condition tends to lead to overfitting of models and lack of reproducibility in complex biomarker discovery efforts.

  PHASE I: USACEHR will provide genomic and/or proteomic data sets from toxicological exposures in model systems from either open access data bases or from USACEHR’s own work, and the performer will provide a preliminary prototype of a generically applicable set of computational and bioinformatic tools or an analytical pipeline for identifying biomarker constellations that show a continuous response to toxicant exposure and that distinguish between different insults (e.g., toxicants with distinct modes of action). The analytical workflow provided by the tool need not consist of all or even any entirely new algorithms or methods but must provide significantly heightened functionality. Thus, the tools may include conventional feature selection and machine-learning approaches (such as pattern recognition, artificial neural networks, and support vector machines), but must significantly extend the capabilities of such methods through the workflow, and be able to provide a read-out of continuous response. The performer will develop improved analytical methods/pipelines for a. Extracting biological signals from continuous transcriptomic and proteomic datasets. b. Selecting and ranking biological features (potential biomarkers). c. Validating and verifying complex biomarkers. Software developed for the Phase I period may run in any environment convenient for development.

  PHASE II: The prototype will be further developed and validated using data provided by USACEHR from either USACEHR’s work or from public databases. It is expected that the product will distinguish between host responses to distinct experimental insults with high specificity and sensitivity, and additionally provide a numerical indication of the severity of the response. The finished Phase II prototype will meet all Army requirements for obtaining an Army Certificate of Networthiness (CON) as well as specifications for software development and documentation detailed in the Army Directorate of Information Assurance Security Technical Implementation Guides or comparable standards prevailing at the time of Phase II completion. The contractor will: provide a reproducible copy of the software with all documentation required above participate if necessary in the CON process provide USACEHR personnel with sufficient training or instructions in printed or digital format for utilizing the software on USACEHR equipment provide a description of the performance and function of the software sufficiently detailed to meet the requirements for publication of the biological analysis of the datasets in peer-reviewed scientific journals provide the results obtained by the performer from analyzing the test datasets in a format appropriate for manipulation for manuscript preparation and archiving.

  PHASE III: The tool for biomarker identification and validation will be applicable to wide array of biomarker discovery problems. The tool will be useful for developing panels of biomarkers for environmental risk assessment for civilian and military chemicals, for screening pharmaceuticals, potentially for identifying biomarker of pathogen exposure and infection, and for occupational health surveillance. Moreover, the biomarkers developed using the tool would be expected to have civilian applications in homeland security and disaster response, as chemical spills and accidents often require biomonitoring of first responders. The application of the tool will depend chiefly on the particular collection of input data. The number of possible applications for the technology is limited only by the reliability of the input data and the design of the validation studies.

  References:  
1. Anderson, N.L . The roles of multiple proteomic platforms in a pipeline for new diagnostics. Molecular and Cellular Proteomics. 4:1441-1444 (2005).

2. Conrads, T.P. et al. High-resolution serum proteomic features for ovarian cancer detection. Endocrine-related cancer. 11, 163-78 (2004).

3. Edelman, L.B. et al. Two-transcript gene expression classifiers in the diagnosis and prognosis of human diseases. BMC genomics. 10:583 (2009).

4. Li L. et al., Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics. 17:1131-42 (2001).

5. Ryan, P.B. et al. Using biomarkers to inform cumulative risk assessment. Environmental health perspectives. 115, 833-40 (2007).

6. Saeys Y. et al. A review of feature selection techniques in bioinformatics. Bioinformatics. 23:2507-17 (2007).

7. Vissers, J.P.C. et al.. Analysis and Quantification of Diagnostic Serum Markers and Protein Signatures for Gaucher Disease. Molecular and Cellular Proteomics. 6, 755-766 (2007).

8. Xu, M. et al., A stable iterative method for refining discriminative gene clusters. BMC Genomics, 9(Suppl 2):S18doi:10.1186/1471-2164-9-S2-S18 (2008).

Keywords:  bioinformatics, software, biomarkers, proteomics, genomics

Questions and Answers:
Q: NOTE: Clarification from TPOC in response to FAQs about topic A10-159:

Q: What does the data set for training/algorithm development look like?

A. We propose to use Gene Expression Omnibus data set accession # GSE8858. A relevant open-access article may be found at:
http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0008126
A: o
Q: Additional information from TPOC in response to FAQs received during Pre-Release:

Q1. Can you speak to the number of dimensions that some sample data sets may have?

A1. The sample data sets referenced have approximately 10,000 genes/microarray. The software tool is intended for identifying complex biomarkers in "high content" data; data sets could contain a few hundred variables to several tens of thousands of variables.

Q2. You mention that prior methods are binary, e.g. either a cell is cancerous is not. It seems that looking for the presence or lack of a particular biomarker is also binary. Are you stating that you would be looking at tracking a biomarker that may change over time, but you want to maintain a reference to the same biomarker? Can you speak more to how the biomarker might change, or what changes in response that you want to track?

A2. Organisms typically present with graded responses to toxic insults over time and dose; hence, patterns of gene or protein or small metabolic molecule expression (complex biomarkers) will be expected to vary with time and dose. The ability to detect subclinical responses is of particular interest.

Q3. You mention the propensity for over-fitting models due to low level of replication characteristic of the studies you're doing. Does the nature of the data preclude the use of dimension reduction techniques such as principal components analysis or multidimensional scaling?

A3. Both principal components analysis and multidimensional scaling are commonly used with this type of data.

Q4. Interactive methods can be employed so that the user is really directing machine learning. Would such methods and the use of visualization be of interest, or is the interest to utilize a completely automated and non-visual process?

A4. We imagined a highly automated process.
Classification: UNCLASSIFIED
Caveats: NONE

A: o
Q: Are there any issues using open source software (such as the software packages R or WEKA) for the development of algorithms and/or the implementation of the deliverables for the Phase I/II?
A: There are no constraints on the platform used for development of the
algorithm/software. However, the Phase II prototype software must meet Army standards for deployment on networked Army computers. In general, that requirement will preclude using open source software for the prototype.
Please see http://iase.disa.mil/stigs/checklist/index.html for more details.

As of midnight September 1, questions for solicitations SBIR 10.3 and STTR 10.B will no longer be accepted.

To read the solicitation for full proposal preparation and submission details click here.

Record: 14 of 367