题目: High Dimensional Compositional Data Analysis in Microbiome Studies
时间: 2016年12月21日下午3:30-4:30
地点: 明德主楼1030
报告人:Hongzhe Li (University of Pennsylvania)
报告人简介:
Hongzhe Li, PhD, is a Professor of Biostatistics at the Perelman School of Medicine at the University of Pennsylvania. He is the Chair of the Graduate Group in Biostatistics at Penn and Director of Center of Statistical Methods in Big Data. He is also a faculty member in the graduate groups of Genomics and Computational Biology and and Applied Mathematics and Computational Science at Penn. He has a secondary appointment in the Department of Statistics at the Wharton School. Dr. Li’s research has been focused on developing powerful statistical and computational methods for analysis of large-scale genetic, genomics and metagenomics data, high dimensional statistics with applications in genomics and biomedical big data, methods for estimating the heterogeneous treatment effects and methods for analyzing the mobile health data. He has been elected as a fellow of the American Statistical Association and a fellow of the Institute of Mathematical Statistics. Dr. Li served on the Board of Scientific Counselors- Clinical Sciences and Epidemiology of the National Cancer Institute.
摘要:
In microbiome studies, the taxa composition is often estimated based on the sequencing read counts in order to account for the large variability in the total number of observed reads across different samples. Due to sequencing depth, some rare microbial taxa might not be captured in the metagenomic sequencing, which results in many zero read counts. Naive composition estimation using count normalization therefore results in many zeros, which underestimates the underlying compositions, especially for the rare taxa. Such an estimate of the composition can further lead to biased estimate of taxa diversity, and can also cause difficulty in downstream log-ratio based analysis for compositional data. In the first part of this talk, the observed counts are assumed to be sampled from a multinomial distribution, with the unknown composition being the probability parameter in a high dimensional positive simplex space. Under the assumption that the composition matrix is approximately low rank, a nuclear norm regularization-based likelihood estimation is developed to estimate underlying compositions of the samples. The second part of the talk focuses on the problem of covariance estimation for high-dimensional compositional data where a composition-adjusted thresholding (COAT) method is introduced under the assumption that the basis covariance matrix is sparse. Our method is based on a decomposition relating the compositional covariance to the basis covariance, which is approximately identifiable as the dimensionality tends to infinity. The resulting procedure can be viewed as thresholding the sample centered log-ratio covariance matrix and hence is scalable for large covariance matrices. The methods are applied to the analysis of a human gut microbiome dataset.