Chapter 22 416B Smart-seq2 dataset

22.1 Introduction

The A. T. L. Lun et al. (2017) dataset contains two 96-well plates of 416B cells (an immortalized mouse myeloid progenitor cell line), processed using the Smart-seq2 protocol (Picelli et al. 2014). A constant amount of spike-in RNA from the External RNA Controls Consortium (ERCC) was also added to each cell’s lysate prior to library preparation. High-throughput sequencing was performed and the expression of each gene was quantified by counting the total number of reads mapped to its exonic regions. Similarly, the quantity of each spike-in transcript was measured by counting the number of reads mapped to the spike-in reference sequences.

22.2 Analysis code

22.2.2 Gene annotation

Ensembl identifiers are stable but difficult to interpret compared to the gene symbols. Thus, we obtain the symbols corresponding to each row using the relevant annotation package. We also rename the rows of our SingleCellExperiment with the symbols, reverting to Ensembl identifiers for missing or duplicate symbols.

22.2.3 Quality control

We save an unfiltered copy of the SingleCellExperiment for later use.

Technically, we do not need to use the mitochondrial proportions as we already have the spike-in proportions (which serve a similar purpose) for this dataset. However, it probably doesn’t do any harm to include it anyway.

22.2.4 Normalization

No pre-clustering is performed here, as the dataset is small and all cells are derived from the same cell line anyway.

22.2.5 Variance modelling

We block on the plate of origin to minimize plate effects, and then we take the top 10% of genes with the largest biological components.

22.2.6 Batch correction

The composition of cells is expected to be the same across the two plates, hence the use of removeBatchEffect() rather than more complex methods. For larger datasets, consider using regressBatches() from the batchelor package.

22.2.7 Dimensionality reduction

We do not expect a great deal of heterogeneity in this dataset, so we only request 10 PCs. We use an exact SVD to avoid warnings from irlba about handling small datasets.

22.3 Results

22.3.2 Normalization

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.347   0.711   0.921   1.000   1.152   3.604

We see that the induced cells have size factors that are systematically shifted from the uninduced cells, consistent with the presence of a composition bias.

22.3.4 Dimensionality reduction

## [1] 10

22.3.5 Clustering

We compare the clusters to the plate of origin. Each cluster is comprised of cells from both batches, indicating that the clustering is not driven by a batch effect.

##        Plate
## Cluster 20160113 20160325
##       1       40       38
##       2       37       32
##       3       10       14
##       4        6        8

We compare the clusters to the oncogene induction status. We observe differences in in the composition of each cluster, consistent with a biological effect of oncogene induction.

##        Oncogene
## Cluster induced CBFB-MYH11 oncogene expression wild type phenotype
##       1                                     78                   0
##       2                                      0                  69
##       3                                      1                  23
##       4                                     14                   0

Most cells have relatively small positive widths in the silhouette plot below, indicating that the separation between clusters is weak. This may be symptomatic of over-clustering where clusters that are clearly defined on oncogene induction status are further split into subsets that are less well separated. Nonetheless, we will proceed with the current clustering scheme as it provides reasonable partitions for further characterization of heterogeneity.

22.3.6 Interpretation

## DataFrame with 10 rows and 6 columns
##             Top              p.value                  FDR
##       <integer>            <numeric>            <numeric>
## Ccna2         1 9.85421668598772e-67  4.5924591443377e-62
## Cdca8         1 1.01448996794399e-41 1.52513840213101e-38
## Pirb          1 4.16554663077011e-33 1.95515769781954e-30
## Cks1b         2 2.98232994474815e-40 3.23229080802427e-37
## Aurkb         2 2.41435546776197e-64 5.62593111097891e-60
## Myh11         2 1.28865409950081e-46 3.75352722832101e-43
## Mcm6          3 1.15877343533364e-28 3.69886830001981e-26
## Cdca3         3 5.02047431193528e-45  1.2314430780707e-41
## Top2a         3 7.25965326958986e-61 1.12776293658655e-56
## Mcm2          4 1.50853796372546e-33 7.98907991607512e-31
##                 logFC.2           logFC.3            logFC.4
##               <numeric>         <numeric>          <numeric>
## Ccna2 -7.13310245267633 -2.20631703724532  -7.34510519589326
## Cdca8 -6.00378219751815 -2.03840963970867   -7.2617478169573
## Pirb   5.28148687893909  5.87819540547819 0.0352848702834082
## Cks1b -6.43381466749891 -4.15385484057129  -6.43853226584289
## Aurkb -6.94062780450867  -1.6553391915966  -6.41621256181855
## Myh11  4.38181921578171  4.29289578732512  0.941049889491085
## Mcm6  -5.44557545692518 -5.82130006100653  -3.58049727055086
## Cdca3 -6.22179351867583 -2.10502332118743  -7.05395104385302
## Top2a  -7.0781109799137 -2.39123404088975  -6.82973432115931
## Mcm2  -5.54196734535358 -6.09177602490117  -3.82381029847084

We visualize the expression profiles of the top candidates in the heatmap below to verify that the DE signature is robust. Most of the top markers have strong and consistent up- or downregulation in cells of cluster 1 compared to some or all of the other clusters. A cursory examination of the heatmap indicates that cluster 1 contains oncogene-induced cells with strong downregulation of DNA replication and cell cycle genes. This is consistent with the potential induction of senescence as an anti-tumorigenic response (Wajapeyee et al. 2010).

Session Info

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS:   /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/
LAPACK: /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] cluster_2.1.0               dynamicTreeCut_1.63-1      
 [3] limma_3.42.0                scran_1.14.0               
 [5] scater_1.14.0               ggplot2_3.2.1              
 [7] ensembldb_2.10.0            AnnotationFilter_1.10.0    
 [9] GenomicFeatures_1.38.0      AnnotationDbi_1.48.0       
[11] AnnotationHub_2.18.0        BiocFileCache_1.10.0       
[13] dbplyr_1.4.2                scRNAseq_1.99.8            
[15] SingleCellExperiment_1.8.0  SummarizedExperiment_1.16.0
[17] DelayedArray_0.12.0         BiocParallel_1.20.0        
[19] matrixStats_0.55.0          Biobase_2.46.0             
[21] GenomicRanges_1.38.0        GenomeInfoDb_1.22.0        
[23] IRanges_2.20.0              S4Vectors_0.24.0           
[25] BiocGenerics_0.32.0         Cairo_1.5-10               
[27] BiocStyle_2.14.0            OSCAUtils_0.0.1            

loaded via a namespace (and not attached):
 [1] Rtsne_0.15                    ggbeeswarm_0.6.0             
 [3] colorspace_1.4-1              XVector_0.26.0               
 [5] BiocNeighbors_1.4.0           bit64_0.9-7                  
 [7] interactiveDisplayBase_1.24.0 codetools_0.2-16             
 [9] knitr_1.25                    zeallot_0.1.0                
[11] Rsamtools_2.2.0               pheatmap_1.0.12              
[13] shiny_1.4.0                   BiocManager_1.30.9           
[15] compiler_3.6.1                httr_1.4.1                   
[17] dqrng_0.2.1                   backports_1.1.5              
[19] assertthat_0.2.1              Matrix_1.2-17                
[21] fastmap_1.0.1                 lazyeval_0.2.2               
[23] later_1.0.0                   BiocSingular_1.2.0           
[25] htmltools_0.4.0               prettyunits_1.0.2            
[27] tools_3.6.1                   rsvd_1.0.2                   
[29] igraph_1.2.4.1                gtable_0.3.0                 
[31] glue_1.3.1                    GenomeInfoDbData_1.2.2       
[33] dplyr_0.8.3                   rappdirs_0.3.1               
[35] Rcpp_1.0.2                    vctrs_0.2.0                  
[37] Biostrings_2.54.0             ExperimentHub_1.12.0         
[39] rtracklayer_1.46.0            DelayedMatrixStats_1.8.0     
[41] xfun_0.10                     stringr_1.4.0                
[43] mime_0.7                      irlba_2.3.3                  
[45] statmod_1.4.32                XML_3.98-1.20                
[47] edgeR_3.28.0                  zlibbioc_1.32.0              
[49] scales_1.0.0                  hms_0.5.2                    
[51] promises_1.1.0                ProtGenerics_1.18.0          
[53] RColorBrewer_1.1-2            yaml_2.2.0                   
[55] curl_4.2                      memoise_1.1.0                
[57] gridExtra_2.3                 biomaRt_2.42.0               
[59] stringi_1.4.3                 RSQLite_2.1.2                
[61] BiocVersion_3.10.1            rlang_0.4.1                  
[63] pkgconfig_2.0.3               bitops_1.0-6                 
[65] evaluate_0.14                 lattice_0.20-38              
[67] purrr_0.3.3                   labeling_0.3                 
[69] GenomicAlignments_1.22.0      cowplot_1.0.0                
[71] bit_1.1-14                    tidyselect_0.2.5             
[73] magrittr_1.5                  bookdown_0.14                
[75] R6_2.4.0                      DBI_1.0.0                    
[77] pillar_1.4.2                  withr_2.1.2                  
[79] RCurl_1.95-4.12               tibble_2.1.3                 
[81] crayon_1.3.4                  rmarkdown_1.16               
[83] viridis_0.5.1                 progress_1.2.2               
[85] locfit_1.5-9.1                grid_3.6.1                   
[87] blob_1.2.0                    digest_0.6.22                
[89] xtable_1.8-4                  httpuv_1.5.2                 
[91] openssl_1.4.1                 munsell_0.5.0                
[93] beeswarm_0.2.3                viridisLite_0.3.0            
[95] vipor_0.4.5                   askpass_1.1                  


Lun, A. T. L., F. J. Calero-Nieto, L. Haim-Vilmovsky, B. Gottgens, and J. C. Marioni. 2017. “Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data.” Genome Res. 27 (11):1795–1806.

Picelli, S., O. R. Faridani, A. K. Bjorklund, G. Winberg, S. Sagasser, and R. Sandberg. 2014. “Full-length RNA-seq from single cells using Smart-seq2.” Nat Protoc 9 (1):171–81.

Wajapeyee, N., S. Z. Wang, R. W. Serra, P. D. Solomon, A. Nagarajan, X. Zhu, and M. R. Green. 2010. “Senescence induction in human fibroblasts and hematopoietic progenitors by leukemogenic fusion proteins.” Blood 115 (24):5057–60.