Chapter 30 Merged human pancreas datasets

30.1 Introduction

For a period in 2016, there was a great deal of interest in using scRNA-seq to profile the human pancreas at cellular resolution (Muraro et al. 2016; Grun et al. 2016; Lawlor et al. 2017; ???). As a consequence, we have a surplus of human pancreas datasets generated by different authors with different technologies, which provides an ideal use case for demonstrating more complex data integration strategies. This represents a more challenging application than the PBMC dataset in Chapter 13 as it involves different sequencing protocols and different patients, most likely with differences in cell type composition.

30.2 The good

We start by considering only two datasets from Muraro et al. (2016) and Grun et al. (2016). This is a relatively simple scenario involving very similar protocols (CEL-seq and CEL-seq2) and a similar set of authors.

## class: SingleCellExperiment 
## dim: 17692 1290 
## metadata(0):
## assays(2): counts logcounts
## rownames(17692): ENSG00000268895 ENSG00000121410 ...
##   ENSG00000074755 ENSG00000036549
## rowData names(2): symbol chr
## colnames(1290): D2ex_1 D2ex_2 ... D17TGFB_94 D17TGFB_95
## colData names(2): donor sample
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(1): ERCC
## class: SingleCellExperiment 
## dim: 16940 2346 
## metadata(0):
## assays(2): counts logcounts
## rownames(16940): ENSG00000268895 ENSG00000121410 ...
##   ENSG00000159840 ENSG00000074755
## rowData names(2): symbol chr
## colnames(2346): D28-1_1 D28-1_2 ... D30-8_93 D30-8_94
## colData names(3): label donor plate
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(1): ERCC

We subset both batches to their common universe of genes; adjust their scaling to equalize sequencing coverage (not really necessary in this case, as the coverage is already similar, but we will do so anyway for consistency); and select those genes with positive average biological components for further use.

We observe that rescaleBatches() is unable to align cells from different batches in Figure 30.1. This is attributable to differences in population composition between batches, with additional complications from non-linearities in the batch effect, e.g., when the magnitude or direction of the batch effect differs between cell types.

$t$-SNE plot of the two pancreas datasets after correction with `rescaleBatches()`. Each point represents a cell and is colored according to the batch of origin.

Figure 30.1: \(t\)-SNE plot of the two pancreas datasets after correction with rescaleBatches(). Each point represents a cell and is colored according to the batch of origin.

Here, we use fastMNN() to merge together the two human pancreas datasets described earlier. Clustering on the merged datasets yields fewer batch-specific clusters, which is recapitulated as greater intermingling between batches in Figure 30.2. This improvement over Figure 30.1 represents the ability of fastMNN() to adapt to more complex situations involving differences in population composition between batches.

##        Batch
## Cluster   1   2
##      1  320 279
##      2  343 265
##      3  209 847
##      4   61 197
##      5  161 399
##      6   42   1
##      7   25 108
##      8   22 127
##      9   59  80
##      10  33   0
##      11   0  18
##      12   8   4
##      13   7  21
$t$-SNE plot of the two pancreas datasets after correction with `fastMNN()`. Each point represents a cell and is colored according to the batch of origin.

Figure 30.2: \(t\)-SNE plot of the two pancreas datasets after correction with fastMNN(). Each point represents a cell and is colored according to the batch of origin.

30.3 The bad

Flushed with our previous success, we now attempt to merge the other datasets from Lawlor et al. (2017) and (???). This is a more challenging task as it involves different technologies, mixtures of UMI and read count data and a more diverse set of authors (presumably with greater differences in the patient population).

## class: SingleCellExperiment 
## dim: 26616 615 
## metadata(0):
## assays(2): counts logcounts
## rownames(26616): ENSG00000229483 ENSG00000232849 ...
##   ENSG00000251576 ENSG00000082898
## rowData names(2): SYMBOL SEQNAME
## colnames(615): 10th_C11_S96 10th_C13_S61 ... 9th-C96_S81
##   9th-C9_S13
## colData names(8): title age ... race Sex
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):
### setup ###
library(OSCAUtils)
chapterPreamble(use_cache = TRUE)

### loading ###
library(scRNAseq)
sce.seger <- SegerstolpePancreasData()

### gene-annotation ###
library(AnnotationHub)
edb <- AnnotationHub()[["AH73881"]]
symbols <- rowData(sce.seger)$symbol
ens.id <- mapIds(edb, keys=symbols, keytype="SYMBOL", column="GENEID")
ens.id <- ifelse(is.na(ens.id), symbols, ens.id)

# Removing duplicated rows.
keep <- !duplicated(ens.id)
sce.seger <- sce.seger[keep,]
rownames(sce.seger) <- ens.id[keep]

### sample-annotation ###
emtab.meta <- colData(sce.seger)[,c("cell type", 
    "individual", "single cell well quality")]
colnames(emtab.meta) <- c("CellType", "Donor", "Quality")
colData(sce.seger) <- emtab.meta

sce.seger$CellType <- gsub(" cell", "", sce.seger$CellType)
sce.seger$CellType <- paste0(
    toupper(substr(sce.seger$CellType, 1, 1)),
    substring(sce.seger$CellType, 2))

### quality-control ###
low.qual <- sce.seger$Quality == "low quality cell"

library(scater)
stats <- perCellQCMetrics(sce.seger)
qc <- quickPerCellQC(stats, nmads=3, percent_subsets="altexps_ERCC_percent")
sce.seger <- sce.seger[,!(qc$discard | low.qual)]

### normalization ###
library(scran)
clusters <- quickCluster(sce.seger)
sce.seger <- computeSumFactors(sce.seger, clusters=clusters)
sce.seger <- logNormCounts(sce.seger, use_altexps=FALSE)

### variance-modelling ###
for.hvg <- sce.seger[,librarySizeFactors(altExp(sce.seger)) > 0
    & sce.seger$Donor!="AZ"]
dec.seger <- modelGeneVarWithSpikes(for.hvg, "ERCC", block=for.hvg$Donor)
chosen.hvgs <- head(order(dec.seger$bio, decreasing=TRUE), 2000)
## class: SingleCellExperiment 
## dim: 25454 2169 
## metadata(0):
## assays(2): counts logcounts
## rownames(25454): ENSG00000118473 ENSG00000142920 ...
##   ENSG00000278306 eGFP
## rowData names(2): symbol refseq
## colnames(2169): HP1502401_H13 HP1502401_J14 ... HP1526901T2D_N8
##   HP1526901T2D_A8
## colData names(3): CellType Donor Quality
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(1): ERCC

We perform the usual routine to obtain re-normalized values and a set of HVGs.

We observe that the merge is generally successful, with many clusters containing contributions from each batch (Figure 30.3). There are few clusters that are specific to the Segerstolpe dataset, and if we were naive, we might consider them to represent interesting subpopulations that are not present in the other datasets.

##        Batch
## Cluster Grun Lawlor Muraro Seger
##      1   338     26    256   391
##      2   243    242    858   344
##      3    39     12     53    59
##      4   197     10    229   129
##      5    62     20    197   112
##      6     0      0      0    53
##      7    12      2     25     7
##      8   106      0      1     0
##      9   123    246    397   206
##      10    0      0      0   219
##      11   23     19    127   195
##      12   54      0      0     0
##      13   25     19    108    56
##      14   59     11     74    13
##      15    0      0      0    27
##      16    2      0      0   203
##      17    7      8     21    17
##      18    0      0      0   107
##      19    0      0      0    31
$t$-SNE plots of the four pancreas datasets after correction with `fastMNN()`. Each point represents a cell and is colored according to the batch of origin (left) or the assigned cluster (right). The cluster label is shown at the median location across all cells in the cluster.

Figure 30.3: \(t\)-SNE plots of the four pancreas datasets after correction with fastMNN(). Each point represents a cell and is colored according to the batch of origin (left) or the assigned cluster (right). The cluster label is shown at the median location across all cells in the cluster.

Fortunately, we are battle-hardened and cynical, so we are sure to check for other sources of variation. The most obvious candidate is the donor of origin for each cell (Figure 30.4), which correlates strongly to these Segerstolpe-only clusters. This is not surprising given the large differences between humans in the wild, but donor-level variation is not interesting for the purposes of cell type characterization. (That said, preservation of the within-dataset donor effects is the technically correct course of action here, as a batch correction method should try to avoid removing heterogeneity within each of its defined batches.)

$t$-SNE plots of the four pancreas datasets after correction with `fastMNN()`. Each point represents a cell and is colored according to the donor of origin for the Segerstolpe dataset.

Figure 30.4: \(t\)-SNE plots of the four pancreas datasets after correction with fastMNN(). Each point represents a cell and is colored according to the donor of origin for the Segerstolpe dataset.

30.4 The ugly

Given these results, the most prudent course of action is to remove the donor effects within each dataset in addition to the batch effects across datasets. This involves a bit more work to properly specify the two levels of unwanted heterogeneity. To make it a bit easier, we use the noCorrect() utility to combine all batches into a single SingleCellExperiment object.

For each donor, we determine the number of donors from the same dataset. This is used to weight the PCA step to ensure that each dataset contributes equally to the definition of the PC space, regardless of the number of donors in that dataset. (Similarly, the default behavior of fastMNN() ensures that each batch contributes equally to the PCA, regardless of the number of cells.)

## $Grun
## [1] "D2"  "D3"  "D7"  "D10" "D17"
## 
## $Lawlor
## [1] "ACIW009"  "ACJV399"  "ACCG268"  "ACCR015A" "ACEK420A" "ACEL337" 
## [7] "ACHY057"  "ACIB065" 
## 
## $Muraro
## [1] "D28" "D29" "D31" "D30"
## 
## $Seger
##  [1] "HP1502401"    "HP1504101T2D" "AZ"           "HP1508501T2D"
##  [5] "HP1506401"    "HP1507101"    "HP1509101"    "HP1504901"   
##  [9] "HP1525301T2D" "HP1526901T2D"
##           D2           D3           D7          D10          D17 
##            5            5            5            5            5 
##      ACIW009      ACJV399      ACCG268     ACCR015A     ACEK420A 
##            8            8            8            8            8 
##      ACEL337      ACHY057      ACIB065          D28          D29 
##            8            8            8            4            4 
##          D31          D30    HP1502401 HP1504101T2D           AZ 
##            4            4           10           10           10 
## HP1508501T2D    HP1506401    HP1507101    HP1509101    HP1504901 
##           10           10           10           10           10 
## HP1525301T2D HP1526901T2D 
##           10           10

We then call fastMNN() on the combined object, using the batch= argument to specify which cells belong to which donors. This will progressively merge cells from each donor until all cells are mapped onto a common coordinate space.

With this approach, we see that the Segerstolpe-only clusters have disappeared (Figure 30.5). Visually, there also seems to be much greater mixing between cells from different Segerstolpe donors. This suggests that we have removed most of the donor effect, which simplifies the interpretation of our clusters.

##         
## clusters Grun Lawlor Muraro Seger
##       1   343     21    285   188
##       2   203    254    465   292
##       3   350     27    256   392
##       4   221    247    863   911
##       5    65     20    197   115
##       6    10      1     24     7
##       7    26     18    108    56
##       8     7      9     21    17
##       9    24     18    127   191
##       10   41      0      0     0
$t$-SNE plots of the four pancreas datasets after donor-level correction with `fastMNN()`. Each point represents a cell and is colored according to the batch of origin (left) or the donor of origin for the Segerstolpe-derived cells (right). The cluster label is shown at the median location across all cells in the cluster.

Figure 30.5: \(t\)-SNE plots of the four pancreas datasets after donor-level correction with fastMNN(). Each point represents a cell and is colored according to the batch of origin (left) or the donor of origin for the Segerstolpe-derived cells (right). The cluster label is shown at the median location across all cells in the cluster.

Our clusters compare well to the published annotations, indicating that we did not inadvertently discard important factors of variation during correction. (Though in this case, the cell types are so well defined, it would be quite a feat to fail to separate them!)

##                         clusters
## proposed                    1    2    3    4    5    6    7    8    9   10
##   acinar                  423    0    2    0    0    0    0    1    0    0
##   alpha                     2    8    6 1890    1    0    0    1    0    0
##   beta                      3  943    5    5    0    1    2    1    7    0
##   co-expression             0   18    0   21    0    0    0    0    0    0
##   delta                     0    3    2    1  317    0    0    1    0    0
##   ductal                    9    4  615    1    0   13    5    0    1    0
##   endothelial               0    0    0    0    0    0    1   35    0    0
##   epsilon                   0    0    0    7    0    0    0    0    3    0
##   gamma                     2    0    0    1    0    0    0    0  302    0
##   mast                      0    0    0    0    0    2    0    0    0    0
##   mesenchymal               0    0    1    0    0    0   79    0    0    0
##   mhc class ii              0    0    0    0    0    5    0    0    0    0
##   nana                      2    1    9   17    1    0    0    1    1    0
##   none/other                0    1    3    5    0    1    0    4    0    0
##   stellate                  0    0    0    0    0    0   72    1    0    0
##   unclassified              0    0    0    0    0    0    2    0    0    0
##   unclassified endocrine    0    6    1    4    3    0    0    0    1    0
##   unclear                   0    0    4    0    0    0    0    0    0    0

Session Info

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS:   /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRblas.so
LAPACK: /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] fossil_0.3.7                shapefiles_0.7             
 [3] foreign_0.8-72              maps_3.3.0                 
 [5] sp_1.3-1                    scater_1.13.25             
 [7] ggplot2_3.2.1               scran_1.13.26              
 [9] batchelor_1.1.17            SingleCellExperiment_1.7.11
[11] SummarizedExperiment_1.15.9 DelayedArray_0.11.8        
[13] BiocParallel_1.19.3         matrixStats_0.55.0         
[15] Biobase_2.45.1              GenomicRanges_1.37.16      
[17] GenomeInfoDb_1.21.2         IRanges_2.19.16            
[19] S4Vectors_0.23.25           BiocGenerics_0.31.6        
[21] Cairo_1.5-10                BiocStyle_2.13.2           
[23] OSCAUtils_0.0.1            

loaded via a namespace (and not attached):
 [1] viridis_0.5.1            edgeR_3.27.13           
 [3] BiocSingular_1.1.7       viridisLite_0.3.0       
 [5] DelayedMatrixStats_1.7.2 assertthat_0.2.1        
 [7] statmod_1.4.32           highr_0.8               
 [9] BiocManager_1.30.7       dqrng_0.2.1             
[11] GenomeInfoDbData_1.2.1   vipor_0.4.5             
[13] yaml_2.2.0               pillar_1.4.2            
[15] lattice_0.20-38          glue_1.3.1              
[17] limma_3.41.17            digest_0.6.21           
[19] XVector_0.25.0           colorspace_1.4-1        
[21] cowplot_1.0.0            htmltools_0.4.0         
[23] Matrix_1.2-17            pkgconfig_2.0.3         
[25] bookdown_0.14            zlibbioc_1.31.0         
[27] purrr_0.3.2              scales_1.0.0            
[29] Rtsne_0.15               tibble_2.1.3            
[31] withr_2.1.2              lazyeval_0.2.2          
[33] magrittr_1.5             crayon_1.3.4            
[35] evaluate_0.14            beeswarm_0.2.3          
[37] tools_3.6.1              stringr_1.4.0           
[39] munsell_0.5.0            locfit_1.5-9.1          
[41] irlba_2.3.3              compiler_3.6.1          
[43] rsvd_1.0.2               rlang_0.4.0             
[45] grid_3.6.1               RCurl_1.95-4.12         
[47] BiocNeighbors_1.3.5      igraph_1.2.4.1          
[49] labeling_0.3             bitops_1.0-6            
[51] rmarkdown_1.16           gtable_0.3.0            
[53] codetools_0.2-16         R6_2.4.0                
[55] gridExtra_2.3            knitr_1.25              
[57] dplyr_0.8.3              stringi_1.4.3           
[59] ggbeeswarm_0.6.0         Rcpp_1.0.2              
[61] tidyselect_0.2.5         xfun_0.10               

Bibliography

Grun, D., M. J. Muraro, J. C. Boisset, K. Wiebrands, A. Lyubimova, G. Dharmadhikari, M. van den Born, et al. 2016. “De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data.” Cell Stem Cell 19 (2):266–77.

Lawlor, N., J. George, M. Bolisetty, R. Kursawe, L. Sun, V. Sivakamasundari, I. Kycia, P. Robson, and M. L. Stitzel. 2017. “Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes.” Genome Res. 27 (2):208–22.

Muraro, M. J., G. Dharmadhikari, D. Grun, N. Groen, T. Dielen, E. Jansen, L. van Gurp, et al. 2016. “A Single-Cell Transcriptome Atlas of the Human Pancreas.” Cell Syst 3 (4):385–94.