Sparse Canonical Correlation Analyses of Multimodal Omics Data


  • Kejun He
  • Xiaoning Qian
  • Jianhua Huang
  • Sharon M. Donovan
  • Robert S. Chapkin
  • Ivan Ivanov* Texas A&M University



There have been an increasing number of applications of sparse Canonical Correlation Analysis (sCCA) to genomic data during the past several years. Most of the research in this area has focused on the relationships between gene expression levels and phenotype variations. However, as multimodal omics data becomes available there is a need to integrate these data modalities into a framework that allows for simultaneous data analyses, thereby providing novel insight for various fields in the life sciences. The pioneering work of Schwartz et al. (2012) used the classical Canonical Correlation Analysis (CCA) to provide an integrative approach to the analysis of host gene expression and microbiota composition data from neonates with different feeding types. Although promising, the proposed approach has serious deficiencies. First, the statistical interpretation is problematic because the involved two-stage analysis makes the results sensitive to the variations of data and the original interpretation of CCA is lost. Second, the associated computational cost is tremendous, O(n3) where n is the number of variables involved in the analysis. Thus, we developed a methodology based on the sCCA to overcome these problems. The performance of our approach is compared to that of Schwartz et al. (2012) and to the sparse Principal Component Analysis (sPCA) on a large synthetic data set with the subsequent application to a multimodal omics data (gene expression, microbiota composition, and metabolites) from neonates with two different feeding types.






Conference Contributions