Chapter 23 Mouse brain dataset

23.1 Introduction

Here, we examine a heterogeneous dataset from a study of cell types in the mouse brain (Zeisel et al. 2015). This contains approximately 3000 cells of varying types such as oligodendrocytes, microglia and neurons. Individual cells were isolated using the Fluidigm C1 microfluidics system (Pollen et al. 2014) and library preparation was performed on each cell using a UMI-based protocol. After sequencing, expression was quantified by counting the number of unique molecular identifiers (UMIs) mapped to each gene.

23.2 Analysis

23.2.1 Data loading

We remove repeats and merge together redundant rows corresponding to alternative genomic locations for the same gene. Specifically, we sum the counts for all rows corresponding to a single gene for ease of interpretation, and create a new SingleCellExperiment with these aggregated counts.

23.2.3 Quality control

The original authors of the study have already removed low-quality cells prior to data publication. Nonetheless, we compute some quality control metrics to check whether the remaining cells are satisfactory.

23.2.5 Variance modelling

In theory, we should block on the plate of origin for each cell. However, only 20-40 cells are available on each plate, and the population is also highly heterogeneous. This means that we cannot assume that the distribution of sampled cell types on each plate is the same. Thus, to avoid regressing out potential biology, we will not block on any factors in this analysis.

23.3 Results

23.3.3 Variance modelling

The technical and total variances are much smaller than those in the read-based datasets. This is due to the use of UMIs, which reduces the noise caused by variable PCR amplification. Furthermore, the spike-in trend is consistently lower than the variances of the endogenous genes. This reflects the heterogeneity in gene expression across cells of different types.

23.3.4 Dimensionality reduction

## [1] 50

23.3.5 Clustering

## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14 
## 283 451 114 143 599 167 191 128 350  70 199  58  39  24

23.3.6 Interpretation

We focus on upregulated marker genes as these can quickly provide positive identification of cell type in a heterogeneous population. We examine the table for cluster 1, in which log-fold changes are reported between cluster 1 and every other cluster. The same output is provided for each cluster in order to identify genes that discriminate between clusters.

## DataFrame with 10 rows and 8 columns
##                Top               p.value                   FDR
##          <integer>             <numeric>             <numeric>
## Atp1a3           1 1.45982186911338e-282 7.24035151533507e-279
## Celf4            1 2.27029547482426e-246  4.5040391925039e-243
## Gad1             1 7.44924918209853e-232 1.34350595021507e-228
## Gad2             1 2.88085688916001e-207 3.57208248900282e-204
## Mllt11           1 1.72981654896447e-249 3.81309227943405e-246
## Ndrg4            1                     0                     0
## Slc32a1          1 2.38275637971764e-110 4.04029947155706e-108
## Syngr3           1 3.68256675481535e-143 1.30461503301398e-140
## Atp6v1g2         2 3.04451168636047e-204 3.55294513798266e-201
## Napb             2 1.10402212607901e-231 1.82522457994002e-228
##                     logFC.2            logFC.3           logFC.4
##                   <numeric>          <numeric>         <numeric>
## Atp1a3   0.0398567870629432 0.0893942907040342  1.24138772287709
## Celf4     0.388671603033847  0.614502330656712 0.869334226406005
## Gad1       4.53927506566819   4.30032796959812  4.05030508822797
## Gad2       4.23224868544476   3.88846541684484  3.76955593590045
## Mllt11    0.578271855775141   1.49331277467142 0.951649432078093
## Ndrg4     0.888723903511497   1.01834080644559  1.14004100784735
## Slc32a1    1.91961732735247    1.8252062286114  1.80431073251461
## Syngr3     1.09812580310356   1.19947934287253  1.18885646587966
## Atp6v1g2 0.0981705572226326  0.520376039817737 0.616391117156833
## Napb        0.1774507863199  0.304690110897742 0.673772065505896
##                   logFC.5          logFC.6
##                 <numeric>        <numeric>
## Atp1a3   3.45668791973593 3.30955878767438
## Celf4    3.10464945550287 2.84820071288112
## Gad1     4.47235956786919 4.45592750650999
## Gad2     4.16902040416753 4.19188953367348
## Mllt11   2.88362685529244 3.00152878819561
## Ndrg4    3.84336948000034 3.70201252599173
## Slc32a1  1.92426053663399 1.95712425188146
## Syngr3   2.47695590144558 2.42819171398985
## Atp6v1g2 2.50875223076368 2.47036803963657
## Napb      2.8153253451666 2.58430358875952

The heatmap below indicates that most of the top markers are strongly DE in cells of cluster 1 compared to some or all of the other clusters. We can use these markers to identify cells from cluster 1 in validation studies with an independent population of cells. A quick look at the markers suggest that cluster 1 represents interneurons based on expression of Gad1 and Slc6a1 (Zeng et al. 2012),

An alternative visualization approach is to plot the log-fold changes to all other clusters directly. This is more concise and is useful in situations involving many clusters that contain different numbers of cells.

Session Info

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS:   /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRblas.so
LAPACK: /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] pheatmap_1.0.12             BiocSingular_1.2.0         
 [3] scran_1.14.0                org.Mm.eg.db_3.10.0        
 [5] AnnotationDbi_1.48.0        scater_1.14.0              
 [7] ggplot2_3.2.1               scRNAseq_1.99.8            
 [9] SingleCellExperiment_1.8.0  SummarizedExperiment_1.16.0
[11] DelayedArray_0.12.0         BiocParallel_1.20.0        
[13] matrixStats_0.55.0          Biobase_2.46.0             
[15] GenomicRanges_1.38.0        GenomeInfoDb_1.22.0        
[17] IRanges_2.20.0              S4Vectors_0.24.0           
[19] BiocGenerics_0.32.0         Cairo_1.5-10               
[21] BiocStyle_2.14.0            OSCAUtils_0.0.1            

loaded via a namespace (and not attached):
 [1] bitops_1.0-6                  bit64_0.9-7                  
 [3] RColorBrewer_1.1-2            httr_1.4.1                   
 [5] tools_3.6.1                   backports_1.1.5              
 [7] R6_2.4.0                      irlba_2.3.3                  
 [9] vipor_0.4.5                   DBI_1.0.0                    
[11] lazyeval_0.2.2                colorspace_1.4-1             
[13] withr_2.1.2                   tidyselect_0.2.5             
[15] gridExtra_2.3                 bit_1.1-14                   
[17] curl_4.2                      compiler_3.6.1               
[19] BiocNeighbors_1.4.0           labeling_0.3                 
[21] bookdown_0.14                 scales_1.0.0                 
[23] rappdirs_0.3.1                stringr_1.4.0                
[25] digest_0.6.22                 rmarkdown_1.16               
[27] XVector_0.26.0                pkgconfig_2.0.3              
[29] htmltools_0.4.0               limma_3.42.0                 
[31] dbplyr_1.4.2                  fastmap_1.0.1                
[33] rlang_0.4.1                   RSQLite_2.1.2                
[35] shiny_1.4.0                   DelayedMatrixStats_1.8.0     
[37] dplyr_0.8.3                   RCurl_1.95-4.12              
[39] magrittr_1.5                  GenomeInfoDbData_1.2.2       
[41] Matrix_1.2-17                 Rcpp_1.0.2                   
[43] ggbeeswarm_0.6.0              munsell_0.5.0                
[45] viridis_0.5.1                 edgeR_3.28.0                 
[47] stringi_1.4.3                 yaml_2.2.0                   
[49] zlibbioc_1.32.0               Rtsne_0.15                   
[51] BiocFileCache_1.10.0          AnnotationHub_2.18.0         
[53] grid_3.6.1                    blob_1.2.0                   
[55] dqrng_0.2.1                   promises_1.1.0               
[57] ExperimentHub_1.12.0          crayon_1.3.4                 
[59] lattice_0.20-38               cowplot_1.0.0                
[61] locfit_1.5-9.1                zeallot_0.1.0                
[63] knitr_1.25                    pillar_1.4.2                 
[65] igraph_1.2.4.1                codetools_0.2-16             
[67] glue_1.3.1                    BiocVersion_3.10.1           
[69] evaluate_0.14                 BiocManager_1.30.9           
[71] vctrs_0.2.0                   httpuv_1.5.2                 
[73] gtable_0.3.0                  purrr_0.3.3                  
[75] assertthat_0.2.1              xfun_0.10                    
[77] rsvd_1.0.2                    mime_0.7                     
[79] xtable_1.8-4                  later_1.0.0                  
[81] viridisLite_0.3.0             tibble_2.1.3                 
[83] beeswarm_0.2.3                memoise_1.1.0                
[85] statmod_1.4.32                interactiveDisplayBase_1.24.0

Bibliography

Pollen, A. A., T. J. Nowakowski, J. Shuga, X. Wang, A. A. Leyrat, J. H. Lui, N. Li, et al. 2014. “Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex.” Nat. Biotechnol. 32 (10):1053–8.

Zeisel, A., A. B. Munoz-Manchado, S. Codeluppi, P. Lonnerberg, G. La Manno, A. Jureus, S. Marques, et al. 2015. “Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.” Science 347 (6226):1138–42.

Zeng, H., E. H. Shen, J. G. Hohmann, S. W. Oh, A. Bernard, J. J. Royall, K. J. Glattfelder, et al. 2012. “Large-scale cellular-resolution gene profiling in human neocortex reveals species-specific molecular signatures.” Cell 149 (2):483–96.