Triphasic DeepBRCA A Deep Learning Based Framework for Identification of Biomarkers for Breast Cance

Triphasic DeepBRCA A Deep Learning Based Framework for Identification of Biomarkers for Breast Cance

Abstract:

Breast cancer being major death-leading cancer demands utmost attention. Recently, the next-generation sequencing techniques capable of capturing gene expression data have been used successfully for the detection of breast cancer. The proposed work identifies a small set of biomarker genes for molecular stratification of breast cancer subtypes. In this work, we have proposed Triphasic DeepBRCA - a novel deep learning framework, for breast cancer subtype detection and biomarker discovery. In the first phase, an autoencoder is used for extracting a compact representation of the gene expression data which is provided as an input to a supervised feed-forward neural network for classification of breast cancer subtypes in the second phase. In the third phase, the proposed Biomarker Gene Discovery Algorithm (BGDA) leverages the neural network classifier of the second phase to estimate the relevance of various genes. Next, Wilcoxon rank-sum test with False Discovery Rate (FDR) Correction is applied to identify the most differentiating genes. Using the TCGA BRCA RNASeq data, the proposed framework enabled us to discover a set of 54 most-variant genes. Using 10-fold cross-validation, we obtained a mean accuracy of 0.899 plusmn; 0.04 at 95% confidence interval. We also validated our results on METABRIC dataset. Gene Set Analysis revealed statistically enriched pathways. Heatmap of the expression levels and t-SNE visualization reveals that these genes have an aggregated capability to distinguish amongst the different breast cancer subtypes. Further, the prognostic evaluation using 54 biomarkers revealed that over 30 genes out of 54 are significantly linked to the prognostic outcome.