Missing Data and Technical Variability in Single Cell RNA-Sequencing Experiments

Single-cell RNA-Seq (scRNA-seq) is the most widely used high-throughput technology to measure genome-wide gene expression at the single-cell level. Unlike bulk RNA-Seq, the majority of reported expression levels in scRNA-seq are zeros and the proportion of genes reporting the expression level to be zero varies substantially across cells. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical versus biological variation. Here, we use an assessment experiment to examine data from published studies. We present evidence that some of these zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. This missing data problem is exacerbated by the fact that technical variation varies cell-to-cell, which can be confused with novel biological results. Finally, we propose a cell-specific censoring with a varying-censoring aware matrix factorization model (VAMF) for dimensionality reduction that permits the identification of factors in the presence of the above described systematic bias.

Workshop: Statistical Analysis and Comprehension of Single Cell RNA-Sequencing Data in R / Bioconductor

The volume and rich complexity of single cell RNA-Sequencing (scRNA-seq) data requires sophisticated computational tools for integrative statistical analysis and comprehension. Bioconductor is the pre-eminent resource for this purpose with over 1400 R packages, and has a broad developer community (>900 maintainers), extensive documentation, active support, and mature software development practices. In this workshop, we will illustrate some questions that can be answered using scRNA-seq data and demonstrate through case studies and tutorials how to answer those questions using R / Bioconductor packages. This will include examples on accessing and loading in data, pre-processing and applying quality control, and normalization. We will demonstrate some common analyses performed with scRNA-seq data such as unsupervised clustering, pseudotime analysis, testing for differentially expressed genes, and combining multiple scRNA-seq datasets.

Stephanie Hicks

Assistant Professor Stephanie Hicks

Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health
Co-founder, R-Ladies Baltimore

Stephanie Hicks is an Assistant Professor in the Department of Biostatistics at Johns Hopkins Bloomberg School of Public Health. She is also a faculty member of the Johns Hopkins Data Science Lab and co-founder of R-Ladies Baltimore. Her research interests focus around developing statistical methods, tools and software for the analysis of genomics data. Specifically, her research addresses statistical challenges in epigenomics, functional genomics and single-cell genomics such as the pre-processing, normalization, analysis of noisy high-throughput data (microarray and next-generation sequencing) leading to an improved quantification and understanding of biological variability. This work led to a K99/R00 award from the National Human Genome Research Institute (NHGRI) at the National Institutes of Health (NIH). She actively contributes software packages to the Bioconductor project and is involved in teaching courses for data science and the analysis of genomics data. Most recently, she became involved in one of the 85 one-year projects to develop Collaborative Computational Tools partnering between the Chan Zuckerberg Initiative (CZI) and the Human Cell Atlas (HCA). With other Bioconductor developers, she will create the infrastructure and tools needed to analyze potentially billions of single cells in the HCA within Bioconductor. For more information, please see http://www.stephaniehicks.com/.

Contact Us

We're not around right now. But you can send us an email and we'll get back to you, asap.

Not readable? Change text.