1
|
Boulesteix AL, Wilson R, Hapfelmeier A. Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies. BMC Med Res Methodol 2017; 17:138. [PMID: 28888225 PMCID: PMC5591542 DOI: 10.1186/s12874-017-0417-2] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2017] [Accepted: 08/31/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The goal of medical research is to develop interventions that are in some sense superior, with respect to patient outcome, to interventions currently in use. Similarly, the goal of research in methodological computational statistics is to develop data analysis tools that are themselves superior to the existing tools. The methodology of the evaluation of medical interventions continues to be discussed extensively in the literature and it is now well accepted that medicine should be at least partly "evidence-based". Although we statisticians are convinced of the importance of unbiased, well-thought-out study designs and evidence-based approaches in the context of clinical research, we tend to ignore these principles when designing our own studies for evaluating statistical methods in the context of our methodological research. MAIN MESSAGE In this paper, we draw an analogy between clinical trials and real-data-based benchmarking experiments in methodological statistical science, with datasets playing the role of patients and methods playing the role of medical interventions. Through this analogy, we suggest directions for improvement in the design and interpretation of studies which use real data to evaluate statistical methods, in particular with respect to dataset inclusion criteria and the reduction of various forms of bias. More generally, we discuss the concept of "evidence-based" statistical research, its limitations and its impact on the design and interpretation of real-data-based benchmark experiments. CONCLUSION We suggest that benchmark studies-a method of assessment of statistical methods using real-world datasets-might benefit from adopting (some) concepts from evidence-based medicine towards the goal of more evidence-based statistical research.
Collapse
Affiliation(s)
- Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry and Epidemiology, LMU Munich, Marchioninistr. 15, Munich, 81377, Germany.
| | - Rory Wilson
- Institute for Medical Information Processing, Biometry and Epidemiology, LMU Munich, Marchioninistr. 15, Munich, 81377, Germany
| | - Alexander Hapfelmeier
- Institute of Medical Statistics and Epidemiology, Technical University Munich, Ismaninger Str. 22, Munich, 81675, Germany
| |
Collapse
|
2
|
Liang Y, Kelemen A, Tayo B. Model-based or algorithm-based? Statistical evidence for diabetes and treatments using gene expression. Stat Methods Med Res 2016; 16:139-53. [PMID: 17484297 DOI: 10.1177/0962280206071927] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Gene expression profiles obtained from samples of diabetic and normal rats with and without treatments can be used to identify genes that distinguish normal and diabetic individuals and also to evaluate the effectiveness of drug treatments. This study examines changes in global gene expression in rat muscle caused by streptozotocin-induced diabetes and vanadyl sulfate treatment. We explored model-based and algorithm-based methods with gene screening measures for microarray gene expression data to classify and predict individuals with high risk of diabetes. Results show that the mixed ANOVA model-based approach provides an efficient way to conduct an investigation of the inherent variability in gene expression data and to estimate the effects of experimental factors such as treatments and diseases and their interactions. The algorithm-based weighted voting and neural network classifiers show good classification performance for the diabetes and treatment groups. Although neural network performs better than weighted voting with higher classification rate, the interpretation of weighted voting is more straightforward. The study indicates that the choice of the gene selection procedure is at least as important as the choice of the classification procedure. We conclude that both mixed model-based and algorithm-based approaches provide the statistical evidence of the biological hypotheses that vanadyl sulfate treatment of diabetic animals restores gene expression patterns to normal. Although model-based and algorithm-based methods provide different strengths and perspective for the analysis of the same set of data, in general both can be considered and developed for analyzing factorial design experiments with multiple groups and factors. This study represents a major step towards the discovery of responsible genes related to diabetes and its treatment.
Collapse
Affiliation(s)
- Yulan Liang
- Department of Biostatistics, The State University of New York, Buffalo 14214, USA.
| | | | | |
Collapse
|
3
|
A Review: Proteomics in Nasopharyngeal Carcinoma. Int J Mol Sci 2015; 16:15497-530. [PMID: 26184160 PMCID: PMC4519910 DOI: 10.3390/ijms160715497] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2015] [Revised: 06/08/2015] [Accepted: 07/01/2015] [Indexed: 12/24/2022] Open
Abstract
Although radiotherapy is generally effective in the treatment of major nasopharyngeal carcinoma (NPC), this treatment still makes approximately 20% of patients radioresistant. Therefore, the identification of blood or biopsy biomarkers that can predict the treatment response to radioresistance and that can diagnosis early stages of NPC would be highly useful to improve this situation. Proteomics is widely used in NPC for searching biomarkers and comparing differentially expressed proteins. In this review, an overview of proteomics with different samples related to NPC and common proteomics methods was made. In conclusion, identical proteins are sorted as follows: Keratin is ranked the highest followed by such proteins as annexin, heat shock protein, 14-3-3σ, nm-23 protein, cathepsin, heterogeneous nuclear ribonucleoproteins, enolase, triosephosphate isomerase, stathmin, prohibitin, and vimentin. This ranking indicates that these proteins may be NPC-related proteins and have potential value for further studies.
Collapse
|
4
|
Matamala N, Vargas MT, González-Cámpora R, Miñambres R, Arias JI, Menéndez P, Andrés-León E, Gómez-López G, Yanowsky K, Calvete-Candenas J, Inglada-Pérez L, Martínez-Delgado B, Benítez J. Tumor microRNA expression profiling identifies circulating microRNAs for early breast cancer detection. Clin Chem 2015; 61:1098-106. [PMID: 26056355 DOI: 10.1373/clinchem.2015.238691] [Citation(s) in RCA: 152] [Impact Index Per Article: 15.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2015] [Accepted: 05/07/2015] [Indexed: 01/20/2023]
Abstract
BACKGROUND The identification of novel biomarkers for early breast cancer detection would be a great advance. Because of their role in tumorigenesis and stability in body fluids, microRNAs (miRNAs) are emerging as a promising diagnostic tool. Our aim was to identify miRNAs deregulated in breast tumors and evaluate the potential of circulating miRNAs in breast cancer detection. METHODS We conducted miRNA expression profiling of 1919 human miRNAs in paraffin-embedded tissue from 122 breast tumors and 11 healthy breast tissue samples. Differential expression analysis was performed, and a microarray classifier was generated. The most relevant miRNAs were analyzed in plasma from 26 healthy individuals and 83 patients with breast cancer (36 before and 47 after treatment) and validated in 116 healthy individuals and 114 patients before treatment. RESULTS We identified a large number of miRNAs deregulated in breast cancer and generated a 25-miRNA microarray classifier that discriminated breast tumors with high diagnostic sensitivity and specificity. Ten miRNAs were selected for further investigation, of which 4 (miR-505-5p, miR-125b-5p, miR-21-5p, and miR-96-5p) were significantly overexpressed in pretreated patients with breast cancer compared with healthy individuals in 2 different series of plasma. MiR-505-5p and miR-96-5p were the most valuable biomarkers (area under the curve 0.72). Moreover, the expression levels of miR-3656, miR-505-5p, and miR-21-5p were decreased in a group of treated patients. CONCLUSIONS Circulating miRNAs reflect the presence of breast tumors. The identification of deregulated miRNAs in plasma of patients with breast cancer supports the use of circulating miRNAs as a method for early breast cancer detection.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Eduardo Andrés-León
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Gonzalo Gómez-López
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | | | | | - Lucía Inglada-Pérez
- Human Cancer Genetics Programme and Spanish Network in Rare Diseases (CIBERER), Madrid, Spain
| | - Beatriz Martínez-Delgado
- Molecular Genetics Unit, Research Institute of Rare Diseases (IIER), Instituto de Salud Carlos III (ISCIII), Madrid, Spain
| | - Javier Benítez
- Human Cancer Genetics Programme and Spanish Network in Rare Diseases (CIBERER), Madrid, Spain;
| |
Collapse
|
5
|
Xiong XD, Jung HJ, Gombar S, Park JY, Zhang CL, Zheng H, Ruan J, Li JB, Kaeberlein M, Kennedy BK, Zhou Z, Liu X, Suh Y. MicroRNA transcriptome analysis identifies miR-365 as a novel negative regulator of cell proliferation in Zmpste24-deficient mouse embryonic fibroblasts. Mutat Res 2015; 777:69-78. [PMID: 25983189 DOI: 10.1016/j.mrfmmm.2015.04.010] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2015] [Revised: 04/08/2015] [Accepted: 04/16/2015] [Indexed: 02/01/2023]
Abstract
Zmpste24 is a metalloproteinase responsible for the posttranslational processing and cleavage of prelamin A into mature laminA. Zmpste24(-/-) mice display a range of progeroid phenotypes overlapping with mice expressing progerin, an altered version of lamin A associated with Hutchinson-Gilford progeria syndrome (HGPS). Increasing evidence has demonstrated that miRNAs contribute to the regulation of normal aging process, but their roles in progeroid disorders remain poorly understood. Here we report the miRNA transcriptomes of mouse embryonic fibroblasts (MEFs) established from wild type (WT) and Zmpste24(-/-) progeroid mice using a massively parallel sequencing technology. With data from 19.5 × 10(6) reads from WT MEFs and 16.5 × 10(6) reads from Zmpste24(-/-) MEFs, we discovered a total of 306 known miRNAs expressed in MEFs with a wide dynamic range of read counts ranging from 10 to over 1 million. A total of 8 miRNAs were found to be significantly down-regulated, with only 2 miRNAs upregulated, in Zmpste24(-/-) MEFs as compared to WT MEFs. Functional studies revealed that miR-365, a significantly down-regulated miRNA in Zmpste24(-/-) MEFs, modulates cellular growth phenotypes in MEFs. Overexpression of miR-365 in Zmpste24(-/-) MEFs increased cellular proliferation and decreased the percentage of SA-β-gal-positive cells, while inhibition of miR-365 function led to an increase of SA-β-gal-positive cells in WT MEFs. Furthermore, we identified Rasd1, a member of the Ras superfamily of small GTPases, as a functional target of miR-365. While expression of miR-365 suppressed Rasd1 3' UTR luciferase-reporter activity, this effect was lost with mutations in the putative 3' UTR target-site. Consistently, expression levels of miR-365 were found to inversely correlate with endogenous Rasd1 levels. These findings suggest that miR-365 is down-regulated in Zmpste24(-/-) MEFs and acts as a novel negative regulator of Rasd1. Our comprehensive miRNA data provide a resource to study gene regulatory networks in MEFs.
Collapse
Affiliation(s)
- Xing-dong Xiong
- Institute of Aging Research, Guangdong Medical College, Xin Cheng Avenue 1#, Songshan Lake, Dongguan, Guangdong 523808, PR China; Institute of Biochemistry & Molecular Biology, Guangdong Medical College, Zhanjiang 524023, PR China; Key Laboratory for Medical Molecular Diagnostics of Guangdong Province, Dongguan 523808, PR China; Institute of Laboratory Medicine, Guangdong Medical College, Dongguan, Guangdong 523808, PR China
| | - Hwa Jin Jung
- Departments of Genetics, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| | - Saurabh Gombar
- Departments of Systems Biology, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| | - Jung Yoon Park
- Departments of Genetics, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| | - Chun-long Zhang
- Institute of Aging Research, Guangdong Medical College, Xin Cheng Avenue 1#, Songshan Lake, Dongguan, Guangdong 523808, PR China; Institute of Biochemistry & Molecular Biology, Guangdong Medical College, Zhanjiang 524023, PR China; Key Laboratory for Medical Molecular Diagnostics of Guangdong Province, Dongguan 523808, PR China
| | - Huiling Zheng
- Institute of Aging Research, Guangdong Medical College, Xin Cheng Avenue 1#, Songshan Lake, Dongguan, Guangdong 523808, PR China; Institute of Biochemistry & Molecular Biology, Guangdong Medical College, Zhanjiang 524023, PR China; Key Laboratory for Medical Molecular Diagnostics of Guangdong Province, Dongguan 523808, PR China
| | - Jie Ruan
- Institute of Aging Research, Guangdong Medical College, Xin Cheng Avenue 1#, Songshan Lake, Dongguan, Guangdong 523808, PR China; Institute of Biochemistry & Molecular Biology, Guangdong Medical College, Zhanjiang 524023, PR China; Key Laboratory for Medical Molecular Diagnostics of Guangdong Province, Dongguan 523808, PR China; Institute of Laboratory Medicine, Guangdong Medical College, Dongguan, Guangdong 523808, PR China
| | - Jiang-bin Li
- Institute of Aging Research, Guangdong Medical College, Xin Cheng Avenue 1#, Songshan Lake, Dongguan, Guangdong 523808, PR China; Institute of Biochemistry & Molecular Biology, Guangdong Medical College, Zhanjiang 524023, PR China; Key Laboratory for Medical Molecular Diagnostics of Guangdong Province, Dongguan 523808, PR China; Institute of Laboratory Medicine, Guangdong Medical College, Dongguan, Guangdong 523808, PR China
| | - Matt Kaeberlein
- Institute of Aging Research, Guangdong Medical College, Xin Cheng Avenue 1#, Songshan Lake, Dongguan, Guangdong 523808, PR China; Department of Pathology, University of Washington, Seattle, WA 98195, USA
| | - Brian K Kennedy
- Institute of Aging Research, Guangdong Medical College, Xin Cheng Avenue 1#, Songshan Lake, Dongguan, Guangdong 523808, PR China; The Buck Institute for Research on Aging, Novato, CA 94945, USA
| | - Zhongjun Zhou
- Institute of Aging Research, Guangdong Medical College, Xin Cheng Avenue 1#, Songshan Lake, Dongguan, Guangdong 523808, PR China; Department of Biochemistry, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong, PR China
| | - Xinguang Liu
- Institute of Aging Research, Guangdong Medical College, Xin Cheng Avenue 1#, Songshan Lake, Dongguan, Guangdong 523808, PR China; Institute of Biochemistry & Molecular Biology, Guangdong Medical College, Zhanjiang 524023, PR China; Key Laboratory for Medical Molecular Diagnostics of Guangdong Province, Dongguan 523808, PR China; Institute of Laboratory Medicine, Guangdong Medical College, Dongguan, Guangdong 523808, PR China.
| | - Yousin Suh
- Institute of Aging Research, Guangdong Medical College, Xin Cheng Avenue 1#, Songshan Lake, Dongguan, Guangdong 523808, PR China; Departments of Genetics, Albert Einstein College of Medicine, Bronx, NY 10461, USA.
| |
Collapse
|
6
|
Önskog J, Freyhult E, Landfors M, Rydén P, Hvidsten TR. Classification of microarrays; synergistic effects between normalization, gene selection and machine learning. BMC Bioinformatics 2011; 12:390. [PMID: 21982277 PMCID: PMC3229535 DOI: 10.1186/1471-2105-12-390] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2010] [Accepted: 10/07/2011] [Indexed: 11/29/2022] Open
Abstract
Background Machine learning is a powerful approach for describing and predicting classes in microarray data. Although several comparative studies have investigated the relative performance of various machine learning methods, these often do not account for the fact that performance (e.g. error rate) is a result of a series of analysis steps of which the most important are data normalization, gene selection and machine learning. Results In this study, we used seven previously published cancer-related microarray data sets to compare the effects on classification performance of five normalization methods, three gene selection methods with 21 different numbers of selected genes and eight machine learning methods. Performance in term of error rate was rigorously estimated by repeatedly employing a double cross validation approach. Since performance varies greatly between data sets, we devised an analysis method that first compares methods within individual data sets and then visualizes the comparisons across data sets. We discovered both well performing individual methods and synergies between different methods. Conclusion Support Vector Machines with a radial basis kernel, linear kernel or polynomial kernel of degree 2 all performed consistently well across data sets. We show that there is a synergistic relationship between these methods and gene selection based on the T-test and the selection of a relatively high number of genes. Also, we find that these methods benefit significantly from using normalized data, although it is hard to draw general conclusions about the relative performance of different normalization procedures.
Collapse
Affiliation(s)
- Jenny Önskog
- Umeå Plant Science Center, Department of Plant Physiology, Umeå University, 901 87 Umeå, Sweden
| | | | | | | | | |
Collapse
|
7
|
Forest classification trees and forest support vector machines algorithms: Demonstration using microarray data. Comput Biol Med 2010; 40:519-24. [DOI: 10.1016/j.compbiomed.2010.03.006] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2009] [Revised: 01/09/2010] [Accepted: 03/22/2010] [Indexed: 11/22/2022]
|
8
|
López-Pintado S, Romo J, Torrente A. Robust depth-based tools for the analysis of gene expression data. Biostatistics 2010; 11:254-64. [PMID: 20064844 DOI: 10.1093/biostatistics/kxp056] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Microarray experiments provide data on the expression levels of thousands of genes and, therefore, statistical methods applicable to the analysis of such high-dimensional data are needed. In this paper, we propose robust nonparametric tools for the description and analysis of microarray data based on the concept of functional depth, which measures the centrality of an observation within a sample. We show that this concept can be easily adapted to high-dimensional observations and, in particular, to gene expression data. This allows the development of the following depth-based inference tools: (1) a scale curve for measuring and visualizing the dispersion of a set of points, (2) a rank test for deciding if 2 groups of multidimensional observations come from the same population, and (3) supervised classification techniques for assigning a new sample to one of G given groups. We apply these methods to microarray data, and to simulated data including contaminated models, and show that they are robust, efficient, and competitive with other procedures proposed in the literature, outperforming them in some situations.
Collapse
Affiliation(s)
- Sara López-Pintado
- Departamento de Economía, Métodos Cuantitativos e Historia Económica, Universidad Pablo de Olavide, Seville, Spain.
| | | | | |
Collapse
|
9
|
Romualdi C, Giuliani A, Millino C, Celegato B, Benigni R, Lanfranchi G. Correlation between gene expression and clinical data through linear and nonlinear principal components analyses: muscular dystrophies as case studies. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2009; 13:173-84. [PMID: 19405797 DOI: 10.1089/omi.2009.0003] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The large dimension of microarray data and the complex dependence structure among genes make data analysis extremely challenging. In the last decade several statistical techniques have been proposed to tackle genome-wide expression data; however, clinical and molecular data associated to pathologies have often been considered as separate dimensions of the same phenomenon, especially when clinical variables lie on a multidimensional space. A better comprehension of the relationships between clinical and molecular data can be obtained if both data types are combined and integrated. In this work we adopt a multidimensional correlation strategy together with linear and nonlinear principal component, to integrate genetic and clinical information obtained from two sets of dystrophic patients. With this approach we decompose different aspects of clinical manifestations and correlate these features with the correspondent patterns of differential gene expression.
Collapse
Affiliation(s)
- Chiara Romualdi
- CRIBI Biotechnology Centre and Dipartimento di Biologia, Università degli Studi di Padova, Padova, Italy.
| | | | | | | | | | | |
Collapse
|
10
|
Boulesteix AL, Strobl C, Augustin T, Daumer M. Evaluating microarray-based classifiers: an overview. Cancer Inform 2008; 6:77-97. [PMID: 19259405 PMCID: PMC2623308 DOI: 10.4137/cin.s408] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
For the last eight years, microarray-based class prediction has been the subject of numerous publications in medicine, bioinformatics and statistics journals. However, in many articles, the assessment of classification accuracy is carried out using suboptimal procedures and is not paid much attention. In this paper, we carefully review various statistical aspects of classifier evaluation and validation from a practical point of view. The main topics addressed are accuracy measures, error rate estimation procedures, variable selection, choice of classifiers and validation strategy.
Collapse
Affiliation(s)
- A-L Boulesteix
- Sylvia Lawry Centre for MS Research (SLC), Hohenlindenerstr. 1, Munich, Germany
| | | | | | | |
Collapse
|
11
|
|
12
|
Díaz-Uriarte R, Alvarez de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006. [DOI: 10.1186/1471-2105-7-3 pmid: 16398926] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection.
Results
We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.
Conclusion
Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.
Collapse
|
13
|
Medina I, Montaner D, Tárraga J, Dopazo J. Prophet, a web-based tool for class prediction using microarray data. Bioinformatics 2006; 23:390-1. [PMID: 17138587 DOI: 10.1093/bioinformatics/btl602] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
UNLABELLED Sample classification and class prediction is the aim of many gene expression studies. We present a web-based application, Prophet, which builds prediction rules and allows using them for further sample classification. Prophet automatically chooses the best classifier, along with the optimal selection of genes, using a strategy that renders unbiased cross-validated errors. Prophet is linked to different microarray data analysis modules, and includes a unique feature: the possibility of performing the functional interpretation of the molecular signature found. AVAILABILITY Prophet can be found at the URL http://prophet.bioinfo.cipf.es/ or within the GEPAS package at http://www.gepas.org/ SUPPLEMENTARY INFORMATION http://gepas.bioinfo.cipf.es/tutorial/prophet.html.
Collapse
Affiliation(s)
- Ignacio Medina
- Department of Bioinformatics, Centro de Investigación Príncipe Felipe, Valencia, E46013, Spain
| | | | | | | |
Collapse
|
14
|
Jaumot J, Tauler R, Gargallo R. Exploratory data analysis of DNA microarrays by multivariate curve resolution. Anal Biochem 2006; 358:76-89. [PMID: 16962983 DOI: 10.1016/j.ab.2006.07.028] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2006] [Revised: 07/27/2006] [Accepted: 07/27/2006] [Indexed: 11/18/2022]
Abstract
In this work, the application of a multivariate curve resolution procedure based on alternating least squares optimization (MCR-ALS) for the analysis of data from DNA microarrays is proposed. For this purpose, simulated and publicly available experimental data sets have been analyzed. Application of MCR-ALS, a method that operates without the use of any training set, has enabled the resolution of the relevant information about different cancer lines classification using a set of few components; each of these defined by a sample and a pure gene expression profile. From resolved sample profiles, a classification of samples according to their origin is proposed. From the resolved pure gene expression profiles, a set of over- or underexpressed genes that could be related to the development of cancer diseases has been selected. Advantages of the MCR-ALS procedure in relation to other previously proposed procedures such as principal component analysis are discussed.
Collapse
Affiliation(s)
- Joaquim Jaumot
- Department of Analytical Chemistry, Universitat de Barcelona, Diagonal 647, E-08028 Barcelona, Spain
| | | | | |
Collapse
|
15
|
Díaz-Uriarte R, Alvarez de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006; 7:3. [PMID: 16398926 PMCID: PMC1363357 DOI: 10.1186/1471-2105-7-3] [Citation(s) in RCA: 1186] [Impact Index Per Article: 62.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2005] [Accepted: 01/06/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection. RESULTS We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy. CONCLUSION Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.
Collapse
Affiliation(s)
- Ramón Díaz-Uriarte
- Bioinformatics Unit, Biotechnology Programme, Spanish National Cancer Centre (CNIO), Melchor Fernandez Almagro 3, Madrid, 28029, Spain
| | - Sara Alvarez de Andrés
- Cytogenetics Unit, Biotechnology Programme, Spanish National Cancer Centre (CNIO), Melchor Fernández Almagro 3, Madrid, 28029, Spain
| |
Collapse
|
16
|
Díaz-Uriarte R, Alvarez de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006. [PMID: 16398926 DOI: 10.1186/1471‐2105‐7‐3] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection. RESULTS We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy. CONCLUSION Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.
Collapse
Affiliation(s)
- Ramón Díaz-Uriarte
- Bioinformatics Unit, Biotechnology Programme, Spanish National Cancer Centre (CNIO), Melchor Fernandez Almagro 3, Madrid, 28029, Spain.
| | | |
Collapse
|
17
|
Ancona N, Maglietta R, D'Addabbo A, Liuni S, Pesole G. Regularized Least Squares Cancer classifiers from DNA microarray data. BMC Bioinformatics 2005; 6 Suppl 4:S2. [PMID: 16351746 PMCID: PMC1866388 DOI: 10.1186/1471-2105-6-s4-s2] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The advent of the technology of DNA microarrays constitutes an epochal change in the classification and discovery of different types of cancer because the information provided by DNA microarrays allows an approach to the problem of cancer analysis from a quantitative rather than qualitative point of view. Cancer classification requires well founded mathematical methods which are able to predict the status of new specimens with high significance levels starting from a limited number of data. In this paper we assess the performances of Regularized Least Squares (RLS) classifiers, originally proposed in regularization theory, by comparing them with Support Vector Machines (SVM), the state-of-the-art supervised learning technique for cancer classification by DNA microarray data. The performances of both approaches have been also investigated with respect to the number of selected genes and different gene selection strategies. RESULTS We show that RLS classifiers have performances comparable to those of SVM classifiers as the Leave-One-Out (LOO) error evaluated on three different data sets shows. The main advantage of RLS machines is that for solving a classification problem they use a linear system of order equal to either the number of features or the number of training examples. Moreover, RLS machines allow to get an exact measure of the LOO error with just one training. CONCLUSION RLS classifiers are a valuable alternative to SVM classifiers for the problem of cancer classification by gene expression data, due to their simplicity and low computational complexity. Moreover, RLS classifiers show generalization ability comparable to the ones of SVM classifiers also in the case the classification of new specimens involves very few gene expression levels.
Collapse
Affiliation(s)
- Nicola Ancona
- Istituto di Studi sui Sistemi Intelligenti per I'Automazione, CNR, Via Amendola 122/D-I, 70126 Bari, Italy
| | - Rosalia Maglietta
- Istituto di Studi sui Sistemi Intelligenti per I'Automazione, CNR, Via Amendola 122/D-I, 70126 Bari, Italy
| | - Annarita D'Addabbo
- Istituto di Studi sui Sistemi Intelligenti per I'Automazione, CNR, Via Amendola 122/D-I, 70126 Bari, Italy
| | - Sabino Liuni
- Istituto di Tecnologie Biomediche-Sezione di Bari, CNR, Via Amendola 122/D, 70126 Bari Italy
| | - Graziano Pesole
- Istituto di Tecnologie Biomediche-Sezione di Bari, CNR, Via Amendola 122/D, 70126 Bari Italy
- Dipartimento Scienze Biomolecolari e Biotecnologie, Universitá di Milano, Via Caloria 26, 20133 Milano, Italy
| |
Collapse
|
18
|
Liang Y, Kelemen A. Associating phenotypes with molecular events: recent statistical advances and challenges underpinning microarray experiments. Funct Integr Genomics 2005; 6:1-13. [PMID: 16292543 DOI: 10.1007/s10142-005-0006-z] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2005] [Revised: 06/22/2005] [Accepted: 08/16/2005] [Indexed: 10/25/2022]
Abstract
Progress in mapping the genome and developments in array technologies have provided large amounts of information for delineating the roles of genes involved in complex diseases and quantitative traits. Since complex phenotypes are determined by a network of interrelated biological traits typically involving multiple inter-correlated genetic and environmental factors that interact in a hierarchical fashion, microarrays hold tremendous latent information. The analysis of microarray data is, however, still a bottleneck. In this paper, we review the recent advances in statistical analyses for associating phenotypes with molecular events underpinning microarray experiments. Classical statistical procedures to analyze phenotypes in genetics are reviewed first, followed by descriptions of the statistical procedures for linking molecular events to measured gene expression phenotypes (microarray-based gene expression) and observed phenotypes such as diseases status. These statistical procedures include (1) prior analysis, such as data quality controls, and normalization analyses for minimizing the effects of experimental artifacts and random noise; (2) gene selections and differentiation procedures based on inferential statistics for the class comparisons; (3) dynamic temporal patterns analysis through exploratory statistics such as unsupervised clustering and supervised classification and predictions; (4) assessing the reliability of microarray studies using real-time PCR and the reproducibility issues from many studies and multiple platforms. In addition, the post analysis to associate the discovered patterns of gene expression to pathway and functional analysis for selected genes are also considered in order to increase our understanding of interconnected gene processes.
Collapse
Affiliation(s)
- Yulan Liang
- Department of Biostatistics, The State University of New York at Buffalo, Buffalo, NY 14214, USA.
| | | |
Collapse
|
19
|
Statnikov A, Tsamardinos I, Dosbayev Y, Aliferis CF. GEMS: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data. Int J Med Inform 2005; 74:491-503. [PMID: 15967710 DOI: 10.1016/j.ijmedinf.2005.05.002] [Citation(s) in RCA: 95] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2004] [Accepted: 05/02/2005] [Indexed: 10/25/2022]
Abstract
The success of treatment of patients with cancer depends on establishing an accurate diagnosis. To this end, we have built a system called GEMS (gene expression model selector) for the automated development and evaluation of high-quality cancer diagnostic models and biomarker discovery from microarray gene expression data. In order to determine and equip the system with the best performing diagnostic methodologies in this domain, we first conducted a comprehensive evaluation of classification algorithms using 11 cancer microarray datasets. In this paper we present a preliminary evaluation of the system with five new datasets. The performance of the models produced automatically by GEMS is comparable or better than the results obtained by human analysts. Additionally, we performed a cross-dataset evaluation of the system. This involved using a dataset to build a diagnostic model and to estimate its future performance, then applying this model and evaluating its performance on a different dataset. We found that models produced by GEMS indeed perform well in independent samples and, furthermore, the cross-validation performance estimates output by the system approximate well the error obtained by the independent validation. GEMS is freely available for download for non-commercial use from http://www.gems-system.org.
Collapse
Affiliation(s)
- Alexander Statnikov
- Discovery Systems Laboratory, Department of Biomedical Informatics, Vanderbilt University, 2209 Garland Avenue, Nashville, TN 37232, USA
| | | | | | | |
Collapse
|
20
|
Chen YD, Zheng S, Yu JK, Hu X. Artificial neural networks analysis of surface-enhanced laser desorption/ionization mass spectra of serum protein pattern distinguishes colorectal cancer from healthy population. Clin Cancer Res 2005; 10:8380-5. [PMID: 15623616 DOI: 10.1158/1078-0432.ccr-1162-03] [Citation(s) in RCA: 117] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
PURPOSE The low specificity and sensitivity of the carcinoembryonic antigen test makes it not an ideal biomarker for the detection of colorectal cancer. We developed and evaluated a proteomic approach for the simultaneous detection and analysis of multiple proteins for distinguishing individuals with colorectal cancer from healthy individuals. EXPERIMENTAL DESIGN We subjected serum samples (including 55 colorectal cancer patients and 92 age- and sex-matched healthy individuals) from 147 individuals, for analysis by surface-enhanced laser desorption/ionization (SELDI) mass spectrometry. Peaks were detected with Ciphergen SELDI software version 3.0. Using a multilayer artificial neural network with a back propagation algorithm, we developed a classifier for separating the colorectal cancer groups from the healthy groups. RESULTS The artificial neural network classifier separated the colorectal cancer from the healthy samples, with a sensitivity of 91% and specificity of 93%. Four top-scored peaks, at m/z of 5,911, 8,930, 8,817, and 4,476, were finally selected as the potential "fingerprints" for detection of colorectal cancer. CONCLUSIONS The combination of SELDI-TOF mass spectrometry with the artificial neural networks in the analysis of serum protein yields significantly higher sensitivity and specificity values for the detection and diagnosis of colorectal cancer.
Collapse
Affiliation(s)
- Yi-ding Chen
- Department of Oncology, Second Affiliated Hospital, Zhejiang University, HangZhou, Zhejiang, People's Republic of China
| | | | | | | |
Collapse
|
21
|
Liang Y, Tayo B, Cai X, Kelemen A. Differential and trajectory methods for time course gene expression data. Bioinformatics 2005; 21:3009-16. [PMID: 15886280 PMCID: PMC2574001 DOI: 10.1093/bioinformatics/bti465] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The issue of high dimensionality in microarray data has been, and remains, a hot topic in statistical and computational analysis. Efficient gene filtering and differentiation approaches can reduce the dimensions of data, help to remove redundant genes and noises, and highlight the most relevant genes that are major players in the development of certain diseases or the effect of drug treatment. The purpose of this study is to investigate the efficiency of parametric (including Bayesian and non-Bayesian, linear and non-linear), non-parametric and semi-parametric gene filtering methods through the application of time course microarray data from multiple sclerosis patients being treated with interferon-beta-1a. The analysis of variance with bootstrapping (parametric), class dispersion (semi-parametric) and Pareto (non-parametric) with permutation methods are presented and compared for filtering and finding differentially expressed genes. The Bayesian linear correlated model, the Bayesian non-linear model the and non-Bayesian mixed effects model with bootstrap were also developed to characterize the differential expression patterns. Furthermore, trajectory-clustering approaches were developed in order to investigate the dynamic patterns and inter-dependency of drug treatment effects on gene expression. RESULTS Results show that the presented methods performed significant differently but all were adequate in capturing a small number of the potentially relevant genes to the disease. The parametric method, such as the mixed model and two Bayesian approaches proved to be more conservative. This may because these methods are based on overall variation in expression across all time points. The semi-parametric (class dispersion) and non-parametric (Pareto) methods were appropriate in capturing variation in expression from time point to time point, thereby making them more suitable for investigating significant monotonic changes and trajectories of changes in gene expressions in time course microarray data. Also, the non-linear Bayesian model proved to be less conservative than linear Bayesian correlated growth models to filter out the redundant genes, although the linear model showed better fit than non-linear model (smaller DIC). We also report the trajectories of significant genes-since we have been able to isolate trajectories of genes whose regulations appear to be inter-dependent.
Collapse
Affiliation(s)
- Yulan Liang
- Department of Biostatistics, University at Buffalo Buffalo, NY 14226, USA.
| | | | | | | |
Collapse
|
22
|
Wessels LFA, Reinders MJT, Hart AAM, Veenman CJ, Dai H, He YD, van't Veer LJ. A protocol for building and evaluating predictors of disease state based on microarray data. Bioinformatics 2005; 21:3755-62. [PMID: 15817694 DOI: 10.1093/bioinformatics/bti429] [Citation(s) in RCA: 91] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Microarray gene expression data are increasingly employed to identify sets of marker genes that accurately predict disease development and outcome in cancer. Many computational approaches have been proposed to construct such predictors. However, there is, as yet, no objective way to evaluate whether a new approach truly improves on the current state of the art. In addition no 'standard' computational approach has emerged which enables robust outcome prediction. RESULTS An important contribution of this work is the description of a principled training and validation protocol, which allows objective evaluation of the complete methodology for constructing a predictor. We review the possible choices of computational approaches, with specific emphasis on predictor choice and reporter selection strategies. Employing this training-validation protocol, we evaluated different reporter selection strategies and predictors on six gene expression datasets of varying degrees of difficulty. We demonstrate that simple reporter selection strategies (forward filtering and shrunken centroids) work surprisingly well and outperform partial least squares in four of the six datasets. Similarly, simple predictors, such as the nearest mean classifier, outperform more complex classifiers. Our training-validation protocol provides a robust methodology to evaluate the performance of new computational approaches and to objectively compare outcome predictions on different datasets.
Collapse
Affiliation(s)
- Lodewyk F A Wessels
- Department of Mediamatics, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology Mekelweg 4, 2628 CD Delft, The Netherlands.
| | | | | | | | | | | | | |
Collapse
|
23
|
Azuaje F, Wang H, Chesneau A. Non-linear mapping for exploratory data analysis in functional genomics. BMC Bioinformatics 2005; 6:13. [PMID: 15661072 PMCID: PMC548129 DOI: 10.1186/1471-2105-6-13] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2004] [Accepted: 01/20/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Several supervised and unsupervised learning tools are available to classify functional genomics data. However, relatively less attention has been given to exploratory, visualisation-driven approaches. Such approaches should satisfy the following factors: Support for intuitive cluster visualisation, user-friendly and robust application, computational efficiency and generation of biologically meaningful outcomes. This research assesses a relaxation method for non-linear mapping that addresses these concerns. Its applications to gene expression and protein-protein interaction data analyses are investigated. RESULTS Publicly available expression data originating from leukaemia, round blue-cell tumours and Parkinson disease studies were analysed. The method distinguished relevant clusters and critical analysis areas. The system does not require assumptions about the inherent class structure of the data, its mapping process is controlled by only one parameter and the resulting transformations offer intuitive, meaningful visual displays. Comparisons with traditional mapping models are presented. As a way of promoting potential, alternative applications of the methodology presented, an example of exploratory data analysis of interactome networks is illustrated. Data from the C. elegans interactome were analysed. Results suggest that this method might represent an effective solution for detecting key network hubs and for clustering biologically meaningful groups of proteins. CONCLUSION A relaxation method for non-linear mapping provided the basis for visualisation-driven analyses using different types of data. This study indicates that such a system may represent a user-friendly and robust approach to exploratory data analysis. It may allow users to gain better insights into the underlying data structure, detect potential outliers and assess assumptions about the cluster composition of the data.
Collapse
Affiliation(s)
- Francisco Azuaje
- School of Computing and Mathematics, University of Ulster, BT37 0QB, UK
| | - Haiying Wang
- School of Computing and Mathematics, University of Ulster, BT37 0QB, UK
| | - Alban Chesneau
- Molecular Genetics Institute, CNRS UMR5535, 1919, route de Mende, 34293 Montpellier, France
| |
Collapse
|
24
|
Yu JK, Chen YD, Zheng S. An integrated approach to the detection of colorectal cancer utilizing proteomics and bioinformatics. World J Gastroenterol 2004; 10:3127-31. [PMID: 15457557 PMCID: PMC4611255 DOI: 10.3748/wjg.v10.i21.3127] [Citation(s) in RCA: 59] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
AIM: To find new potential biomarkers and to establish patterns for early detection of colorectal cancer.
METHODS: One hundred and eighty-two serum samples including 55 from colorectal cancer (CRC) patients, 35 from colorectal adenoma (CRA) patients and 92 from healthy persons (HP) were detected by surface-enhanced laser desorption/ionization mass spectrometry (SELDI-MS). The data of spectra were analyzed by bioinformatics tools like artificial neural network (ANN) and support vector machine (SVM).
RESULTS: The diagnostic pattern combined with 7 potential biomarkers could differentiate CRC patients from CRA patients with a specificity of 83%, sensitivity of 89% and positive predictive value of 89%. The diagnostic pattern combined with 4 potential biomarkers could differentiate CRC patients from HP with a specificity of 92%, sensitivity of 89% and positive predictive value of 86%.
CONCLUSION: The combination of SELDI with bioinformatics tools could help find new biomarkers and establish patterns with high sensitivity and specificity for the detection of CRC.
Collapse
Affiliation(s)
- Jie-Kai Yu
- Cancer Institute, Zhejiang University, Hangzhou 310009, Zhejiang Province, China
| | | | | |
Collapse
|
25
|
Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 2004; 21:631-43. [PMID: 15374862 DOI: 10.1093/bioinformatics/bti033] [Citation(s) in RCA: 597] [Impact Index Per Article: 28.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Cancer diagnosis is one of the most important emerging clinical applications of gene expression microarray technology. We are seeking to develop a computer system for powerful and reliable cancer diagnostic model creation based on microarray data. To keep a realistic perspective on clinical applications we focus on multicategory diagnosis. To equip the system with the optimum combination of classifier, gene selection and cross-validation methods, we performed a systematic and comprehensive evaluation of several major algorithms for multicategory classification, several gene selection methods, multiple ensemble classifier methods and two cross-validation designs using 11 datasets spanning 74 diagnostic categories and 41 cancer types and 12 normal tissue types. RESULTS Multicategory support vector machines (MC-SVMs) are the most effective classifiers in performing accurate cancer diagnosis from gene expression data. The MC-SVM techniques by Crammer and Singer, Weston and Watkins and one-versus-rest were found to be the best methods in this domain. MC-SVMs outperform other popular machine learning algorithms, such as k-nearest neighbors, backpropagation and probabilistic neural networks, often to a remarkable degree. Gene selection techniques can significantly improve the classification performance of both MC-SVMs and other non-SVM learning algorithms. Ensemble classifiers do not generally improve performance of the best non-ensemble models. These results guided the construction of a software system GEMS (Gene Expression Model Selector) that automates high-quality model construction and enforces sound optimization and performance estimation procedures. This is the first such system to be informed by a rigorous comparative analysis of the available algorithms and datasets. AVAILABILITY The software system GEMS is available for download from http://www.gems-system.org for non-commercial use. CONTACT alexander.statnikov@vanderbilt.edu.
Collapse
Affiliation(s)
- Alexander Statnikov
- Department of Biomedical Informatics, Vanderbilt University Nashville, TN, USA.
| | | | | | | | | |
Collapse
|
26
|
Herrero J, Vaquerizas JM, Al-Shahrour F, Conde L, Mateos A, Díaz-Uriarte JSR, Dopazo J. New challenges in gene expression data analysis and the extended GEPAS. Nucleic Acids Res 2004; 32:W485-91. [PMID: 15215434 PMCID: PMC441559 DOI: 10.1093/nar/gkh421] [Citation(s) in RCA: 39] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2004] [Revised: 04/07/2004] [Accepted: 04/07/2004] [Indexed: 01/30/2023] Open
Abstract
Since the first papers published in the late nineties, including, for the first time, a comprehensive analysis of microarray data, the number of questions that have been addressed through this technique have both increased and diversified. Initially, interest focussed on genes coexpressing across sets of experimental conditions, implying, essentially, the use of clustering techniques. Recently, however, interest has focussed more on finding genes differentially expressed among distinct classes of experiments, or correlated to diverse clinical outcomes, as well as in building predictors. In addition to this, the availability of accurate genomic data and the recent implementation of CGH arrays has made mapping expression and genomic data on the chromosomes possible. There is also a clear demand for methods that allow the automatic transfer of biological information to the results of microarray experiments. Different initiatives, such as the Gene Ontology (GO) consortium, pathways databases, protein functional motifs, etc., provide curated annotations for genes. Whereas many resources on the web focus mainly on clustering methods, GEPAS has evolved to cope with the aforementioned new challenges that have recently arisen in the field of microarray data analysis. The web-based pipeline for microarray gene expression data, GEPAS, is available at http://gepas.bioinfo.cnio.es.
Collapse
Affiliation(s)
- Javier Herrero
- Bioinformatics Unit, Biotechnology Programme, Centro Nacional de Investigaciones Oncológicas, Melchor Fernández Almagro, 3, E-28029 Madrid, Spain
| | | | | | | | | | | | | |
Collapse
|
27
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2003; 4. [PMCID: PMC2447311 DOI: 10.1002/cfg.231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022] Open
|