Chapter 16 Cell cycle assignment

16.1 Motivation

On occasion, it can be desirable to determine cell cycle activity from scRNA-seq data. In and of itself, the distribution of cells across phases of the cell cycle is not usually informative, but we can use this to determine if there are differences in cell cycle progression between subpopulations or across treatment conditions. Many of the key events in the cell cycle (e.g., passage through checkpoints) are post-translational and thus not directly visible in transcriptomic data; nonetheless, there are enough changes in expression that can be exploited to determine cell cycle phase. We demonstrate using the 416B dataset, which is known to contain actively cycling cells after oncogene induction.

### setup ###
library(OSCAUtils)
chapterPreamble(use_cache = TRUE)

### loading ###
library(scRNAseq)
sce.416b <- LunSpikeInData(which="416b") 
sce.416b$block <- factor(sce.416b$block)

### gene-annotation ###
library(AnnotationHub)
ens.mm.v97 <- AnnotationHub()[["AH73905"]]
rowData(sce.416b)$ENSEMBL <- rownames(sce.416b)
rowData(sce.416b)$SYMBOL <- mapIds(ens.mm.v97, keys=rownames(sce.416b),
    keytype="GENEID", column="SYMBOL")
rowData(sce.416b)$SEQNAME <- mapIds(ens.mm.v97, keys=rownames(sce.416b),
    keytype="GENEID", column="SEQNAME")

library(scater)
rownames(sce.416b) <- uniquifyFeatureNames(rowData(sce.416b)$ENSEMBL, 
    rowData(sce.416b)$SYMBOL)

### quality-control ###
mito <- which(rowData(sce.416b)$SEQNAME=="MT")
stats <- perCellQCMetrics(sce.416b, subsets=list(Mt=mito))
qc <- quickPerCellQC(stats, percent_subsets=c("subsets_Mt_percent",
    "altexps_ERCC_percent"), nmads=3, batch=sce.416b$block)
sce.416b <- sce.416b[,!qc$discard]

### normalization ###
library(scran)
sce.416b <- computeSumFactors(sce.416b)
sce.416b <- logNormCounts(sce.416b)

### variance-modelling ###
dec.416b <- modelGeneVarWithSpikes(sce.416b, "ERCC", block=sce.416b$block)

### batch-correction ###
library(limma)
assay(sce.416b, "corrected") <- removeBatchEffect(logcounts(sce.416b), 
    design=model.matrix(~sce.416b$phenotype), batch=sce.416b$block)

### dimensionality-reduction ###
sce.416b <- denoisePCA(sce.416b, technical=dec.416b, 
    assay.type="corrected", BSPARAM=BiocSingular::ExactParam())

set.seed(1010)
sce.416b <- runTSNE(sce.416b, dimred="PCA", perplexity=10)

### clustering ###
my.dist <- dist(reducedDim(sce.416b, "PCA"))
my.tree <- hclust(my.dist, method="ward.D2")

library(dynamicTreeCut)
my.clusters <- unname(cutreeDynamic(my.tree, distM=as.matrix(my.dist),
    minClusterSize=10, verbose=0))
sce.416b$cluster <- factor(my.clusters)
## class: SingleCellExperiment 
## dim: 46604 185 
## metadata(0):
## assays(3): counts logcounts corrected
## rownames(46604): 4933401J01Rik Gm26206 ... CAAA01147332.1
##   CBFB-MYH11-mcherry
## rowData names(4): Length ENSEMBL SYMBOL SEQNAME
## colnames(185): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1
##   SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ...
##   SLX-11312.N712_S507.H5H5YBBXX.s_8.r_1
##   SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1
## colData names(10): Source Name cell line ... block cluster
## reducedDimNames(2): PCA TSNE
## spikeNames(0):
## altExpNames(2): ERCC SIRV

16.2 Using the cyclins

The cyclins control progression through the cell cycle and have well-characterized patterns of expression across cell cycle phases. Cyclin D is expressed throughout but peaks at G1; cyclin E is expressed highest in the G1/S transition; cyclin A is expressed across S and G2; and cyclin B is expressed highest in late G2 and mitosis. Inspection of the relative expression of cyclins across the population can often be sufficient to determine the relative cell cycle activity in each cluster (Figure 16.1). For example, cluster 1 is likely to be in G1 while the other clusters are scattered across the later phases.

##  [1] "Ccnb3" "Ccna2" "Ccna1" "Ccne2" "Ccnd2" "Ccne1" "Ccnd1" "Ccnb2"
##  [9] "Ccnb1" "Ccnd3"
Heatmap of the log-normalized expression values of the cyclin genes in the 416B dataset.

Figure 16.1: Heatmap of the log-normalized expression values of the cyclin genes in the 416B dataset.

The 416B dataset is somewhat unusual as each cluster maps clealy onto a distinct phase of the cell cycle. Such separation is not typical in more heterogeneous datasets where the cell cycle is only a secondary factor of variation. However, it is not strictly necessary to know the exact phase of each cell or cluster to answer most cycle-related questions. For example, we can use standard DE methods (Chapter 32.2.8) to look for upregulation of each cyclin, allowing us to determine if there are more cells in the corresponding phase of the cell cycle between subpopulations. The same logic applies to comparisons between treatment conditions, as described in Chapter 14.

## DataFrame with 10 rows and 7 columns
##             Top              p.value                  FDR
##       <integer>            <numeric>            <numeric>
## Ccnb1         1 2.37957533903471e-17 2.37957533903471e-16
## Ccnb2         1 1.34341210453291e-12 4.47804034844303e-12
## Ccnd2         1   0.0178162563867722   0.0356325127735445
## Ccna2         2 7.90317047962163e-17 3.95158523981083e-16
## Ccnd3         2   0.0381754305787045   0.0636257176311741
## Ccnd1         3   0.0723589725759802    0.103369960822829
## Ccna1         4  0.00143225677223033  0.00358064193057581
## Ccne1         5    0.152911439222281    0.191139299027851
## Ccne2         7    0.821459489933298    0.912732766592554
## Ccnb3        10                    1                    1
##                   AUC.1             AUC.3             AUC.4
##               <numeric>         <numeric>         <numeric>
## Ccnb1 0.996180555555556 0.924479166666667 0.998842592592593
## Ccnb2 0.918055555555556 0.936631944444444 0.796296296296296
## Ccnd2 0.229861111111111         0.5859375  0.52662037037037
## Ccna2 0.987847222222222 0.598958333333333 0.939814814814815
## Ccnd3 0.561458333333333 0.495659722222222 0.472222222222222
## Ccnd1 0.197569444444444         0.6484375 0.527777777777778
## Ccna1 0.569444444444444 0.551649305555556 0.569444444444444
## Ccne1           0.59375 0.161458333333333 0.482638888888889
## Ccne2 0.540104166666667 0.207465277777778 0.357638888888889
## Ccnb3               0.5               0.5               0.5
##                   AUC.5
##               <numeric>
## Ccnb1 0.670940170940171
## Ccnb2 0.354700854700855
## Ccnd2 0.747863247863248
## Ccna2 0.427350427350427
## Ccnd3 0.722222222222222
## Ccnd1 0.183760683760684
## Ccna1 0.525641025641026
## Ccne1 0.492521367521368
## Ccne2 0.366452991452991
## Ccnb3               0.5

Direct examination of cyclin expression is easily understood, interpreted and validated with other technologies. However, it is best suited for statements about relative cell cycle activity; for example, we would find it difficult to assign cell cycle phase in Figure 16.1 without the presence of clusters spanning all phases to provide benchmarks for “high” and “low” expression of each cyclin. We also assume that cyclin expression is not affected by biological processes other than the cell cycle, which may be a strong assumption in some cases, e.g., malignant cells. On a practical note, we rely on having good coverage of the cyclins (if they are present), which is not a problem for the whole-of-transcriptome methods described below.

16.3 Using reference profiles

Cell cycle assignment can be considered a specialized case of cell annotation, which suggests that the strategies described in Chapter 12 can be applied here. For example, given a reference dataset containing mouse ESCs with known cell cycle phases (Buettner et al. 2015), we could use SingleR to determine the phase of each cell in a test dataset.

We identify phase-specific markers from the reference and use them to assign labels to the 416B data. Cluster 1 mostly consists of G1 cells while the other clusters are assigned to G2M, which is broadly consistent with our conclusions from the cyclin-based analysis. Unlike the cyclin-based analysis, this approach yields “absolute” assignments of cell cycle phase that do not need to be interpreted relative to other cells in the same dataset.

##      
##        1  2  3  4  5
##   G1  67  0  3 11  0
##   G2M 13 36 29 13 13

The key assumption here is that the cell cycle is orthogonal to cell type and other aspects of cell behavior. This justifies the use of a reference involving cell types that are quite different from the cells in the test dataset, provided that the cell cycle transcriptional program is conserved across datasets (Bertoli, Skotheim, and Bruin 2013; Conboy et al. 2007). However, it is not difficult to find routine violations of this assumption - for example, Dppa3 is detected as one of the top markers to distinguish between G1 from G2/M in the reference but has no detectable expression in the 416B dataset (Figure 16.2).

Distribution of log-normalized expression values for _Dppa3_ in the reference dataset (left) and in the 416B dataset (right).

Figure 16.2: Distribution of log-normalized expression values for Dppa3 in the reference dataset (left) and in the 416B dataset (right).

Thus, a healthy dose of skepticism is required when interpreting these assignments. Our hope is that any systematic assignment error is consistent across clusters and conditions such that they cancel out in comparisons of phase frequencies, which is the more interesting analysis anyway. Indeed, while the availability of absolute phase calls may be more appealing, it may not make much practical difference to the conclusions if the frequencies are ultimately interpreted in a relative sense (e.g., using a chi-squared test).

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab[, 1:2]
## X-squared = 68, df = 1, p-value <2e-16

16.4 Using the cyclone() classifier

The prediction method described by Scialdone et al. (2015) is another approach for classifying cells into cell cycle phases. Using a reference dataset, we first compute the sign of the difference in expression between each pair of genes. Pairs with changes in the sign across cell cycle phases are chosen as markers. Cells in a test dataset can then be classified into the appropriate phase, based on whether the observed sign for each marker pair is consistent with one phase or another. This approach is implemented in the cyclone() function from the scran package, which also contains pre-trained set of marker pairs for mouse and human data.

The phase assignment result for each cell in the 416B dataset is shown in Figure 16.3. For each cell, a higher score for a phase corresponds to a higher probability that the cell is in that phase. We focus on the G1 and G2/M scores as these are the most informative for classification.

Cell cycle phase scores from applying the pair-based classifier on the 416B dataset. Each point represents a cell, plotted according to its scores for G1 and G2/M phases.

Figure 16.3: Cell cycle phase scores from applying the pair-based classifier on the 416B dataset. Each point represents a cell, plotted according to its scores for G1 and G2/M phases.

Cells are classified as being in G1 phase if the G1 score is above 0.5 and greater than the G2/M score; in G2/M phase if the G2/M score is above 0.5 and greater than the G1 score; and in S phase if neither score is above 0.5. Here, the majority of cells in cluster 1 are again classified as being in G1 phase. The main difference from the SingleR assignments is the distinction between G2/M and S phases, which is slightly more consistent with our cyclin-based results. (This is attributable to some additional manual curation, performed during the definition of marker pairs to focus on genes with known roles in cell cycle progression.)

##      
##        1  2  3  4  5
##   G1  75  0  8 19  0
##   G2M  2 36 12  0 12
##   S    3  0 12  5  1

The same considerations and caveats described for the previous SingleR-based cell approach are also applicable here. From a practical perspective, cyclone() takes longer but does not require an explicit reference as the marker pairs are already computed.

16.5 Regressing out cell cycle phase

For some time, it was popular to regress out the cell cycle phase prior to downstream analyses. The aim was to remove uninteresting variation due to cell cycle, thus improving resolution of other biological processes of interest. We could implement this by performing cell cycle phase assignment as described above, treating each phase as a separate batch and applying any of the batch correction strategies described in Chapter 26.2.6. The most common approach is to use a linear model to simply regress out the phase effect, e.g., via removeBatchEffect().

That said, we do not consider adjusting for cell cycle to be a necessary step in routine scRNA-seq analyses. In most applications, the cell cycle is a minor factor of variation, secondary to differences between cell types. Any attempt at removal would also need to assume that the cell cycle effect is orthogonal to other biological processes. For example, regression would potentially remove interesting signal if cell cycle activity varied across clusters or conditions, with a prime example being the increased proliferation of activated T cells (Richard et al. 2018). We suggest only performing cell cycle adjustment on an as-needed basis in populations with clear cell cycle effects.

Alternatively, users may consider just excluding cell cycle-related genes from downstream analysis. This should remove most of the cell cycle effect without making strong assumptions about orthogonality. Of course, this will not remove the effect of the cell cycle in genes with no annotated role in the cell cycle, but in such cases, there is ambiguity over whether that effect is truly due to the cell cycle or from some other (interesting) biological process that happens to be correlated with the cell cycle.

Session Info

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS:   /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRblas.so
LAPACK: /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] org.Mm.eg.db_3.8.2          AnnotationDbi_1.47.1       
 [3] SingleR_0.99.13             BiocFileCache_1.9.1        
 [5] dbplyr_1.4.2                scran_1.13.26              
 [7] scater_1.13.25              ggplot2_3.2.1              
 [9] SingleCellExperiment_1.7.11 SummarizedExperiment_1.15.9
[11] DelayedArray_0.11.8         BiocParallel_1.19.3        
[13] matrixStats_0.55.0          Biobase_2.45.1             
[15] GenomicRanges_1.37.16       GenomeInfoDb_1.21.2        
[17] IRanges_2.19.16             S4Vectors_0.23.25          
[19] BiocGenerics_0.31.6         Cairo_1.5-10               
[21] BiocStyle_2.13.2            OSCAUtils_0.0.1            

loaded via a namespace (and not attached):
 [1] bitops_1.0-6                  bit64_0.9-7                  
 [3] RColorBrewer_1.1-2            httr_1.4.1                   
 [5] tools_3.6.1                   backports_1.1.5              
 [7] R6_2.4.0                      irlba_2.3.3                  
 [9] vipor_0.4.5                   DBI_1.0.0                    
[11] lazyeval_0.2.2                colorspace_1.4-1             
[13] withr_2.1.2                   tidyselect_0.2.5             
[15] gridExtra_2.3                 curl_4.2                     
[17] bit_1.1-14                    compiler_3.6.1               
[19] BiocNeighbors_1.3.5           labeling_0.3                 
[21] bookdown_0.14                 scales_1.0.0                 
[23] rappdirs_0.3.1                stringr_1.4.0                
[25] digest_0.6.21                 rmarkdown_1.16               
[27] XVector_0.25.0                pkgconfig_2.0.3              
[29] htmltools_0.4.0               fastmap_1.0.1                
[31] limma_3.41.17                 highr_0.8                    
[33] rlang_0.4.0                   RSQLite_2.1.2                
[35] shiny_1.4.0                   DelayedMatrixStats_1.7.2     
[37] dplyr_0.8.3                   RCurl_1.95-4.12              
[39] magrittr_1.5                  BiocSingular_1.1.7           
[41] GenomeInfoDbData_1.2.1        Matrix_1.2-17                
[43] Rcpp_1.0.2                    ggbeeswarm_0.6.0             
[45] munsell_0.5.0                 viridis_0.5.1                
[47] stringi_1.4.3                 yaml_2.2.0                   
[49] edgeR_3.27.13                 zlibbioc_1.31.0              
[51] AnnotationHub_2.17.10         grid_3.6.1                   
[53] blob_1.2.0                    promises_1.1.0               
[55] dqrng_0.2.1                   ExperimentHub_1.11.6         
[57] crayon_1.3.4                  lattice_0.20-38              
[59] cowplot_1.0.0                 locfit_1.5-9.1               
[61] zeallot_0.1.0                 knitr_1.25                   
[63] pillar_1.4.2                  igraph_1.2.4.1               
[65] glue_1.3.1                    evaluate_0.14                
[67] BiocManager_1.30.7            httpuv_1.5.2                 
[69] vctrs_0.2.0                   gtable_0.3.0                 
[71] purrr_0.3.2                   assertthat_0.2.1             
[73] xfun_0.10                     mime_0.7                     
[75] rsvd_1.0.2                    xtable_1.8-4                 
[77] later_1.0.0                   viridisLite_0.3.0            
[79] tibble_2.1.3                  pheatmap_1.0.12              
[81] beeswarm_0.2.3                memoise_1.1.0                
[83] statmod_1.4.32                interactiveDisplayBase_1.23.0

Bibliography

Bertoli, C., J. M. Skotheim, and R. A. de Bruin. 2013. “Control of cell cycle transcription during G1 and S phases.” Nat. Rev. Mol. Cell Biol. 14 (8):518–28.

Buettner, F., K. N. Natarajan, F. P. Casale, V. Proserpio, A. Scialdone, F. J. Theis, S. A. Teichmann, J. C. Marioni, and O. Stegle. 2015. “Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells.” Nat. Biotechnol. 33 (2):155–60.

Conboy, C. M., C. Spyrou, N. P. Thorne, E. J. Wade, N. L. Barbosa-Morais, M. D. Wilson, A. Bhattacharjee, et al. 2007. “Cell cycle genes are the evolutionarily conserved targets of the E2F4 transcription factor.” PLoS ONE 2 (10):e1061.

Richard, A. C., A. T. L. Lun, W. W. Y. Lau, B. Gottgens, J. C. Marioni, and G. M. Griffiths. 2018. “T cell cytolytic capacity is independent of initial stimulation strength.” Nat. Immunol. 19 (8):849–58.

Scialdone, A., K. N. Natarajan, L. R. Saraiva, V. Proserpio, S. A. Teichmann, O. Stegle, J. C. Marioni, and F. Buettner. 2015. “Computational assignment of cell-cycle stage from single-cell transcriptome data.” Methods 85 (September):54–61.