# Chapter 13 Integrating Datasets

## 13.1 Motivation

Large single-cell RNA sequencing (scRNA-seq) projects usually need to generate data across multiple batches due to logistical constraints. However, the processing of different batches is often subject to uncontrollable differences, e.g., changes in operator, differences in reagent quality. This results in systematic differences in the observed expression in cells from different batches, which we refer to as “batch effects”. Batch effects are problematic as they can be major drivers of heterogeneity in the data, masking the relevant biological differences and complicating interpretation of the results.

Computational correction of these effects is critical for eliminating batch-to-batch variation, allowing data across multiple batches to be combined for common downstream analysis. However, existing methods based on linear models (Ritchie et al. 2015; Leek et al. 2012) assume that the composition of cell populations are either known or the same across batches. To overcome these limitations, bespoke methods have been developed for batch correction of single-cell data (Haghverdi et al. 2018; Butler et al. 2018; Lin et al. 2019) that do not require a priori knowledge about the composition of the population. This allows them to be used in workflows for exploratory analyses of scRNA-seq data where such knowledge is usually unavailable.

## 13.2 Setting up the data

To demonstrate, we will use two separate 10X Genomics PBMC datasets generated in two different batches. Each dataset was obtained from the TENxPBMCData package and separately subjected to basic processing steps. Separate processing prior to the batch correction step is more convenient, scalable and (on occasion) more reliable. For example, outlier-based QC on the cells is more effective when performed within a batch (Section 6.3.2.3). The same can also be said for trend fitting when modelling the mean-variance relationship (Section 8.2.4.1).

### loading ###
library(TENxPBMCData)
pbmc3k <- TENxPBMCData('pbmc3k')

### quality-control ###
is.mito <- grep("MT", rowData(pbmc3k)$Symbol_TENx) library(scater) stats <- perCellQCMetrics(pbmc3k, subsets=list(Mito=is.mito)) high.mito <- isOutlier(stats$subsets_Mito_percent, nmads=3, type="higher")
pbmc3k <- pbmc3k[,!high.mito]

### normalization ###
pbmc3k <- logNormCounts(pbmc3k)

### variance-modelling ###
library(scran)
dec3k <- modelGeneVar(pbmc3k)

### feature-selection ###
chosen.hvgs <- which(dec3k$bio > 0) ### dimensionality-reduction ### # Using randomized SVD, which is more efficient for file-backed matrices. set.seed(10000) pbmc3k <- runPCA(pbmc3k, subset_row=chosen.hvgs, ncomponents=25, BSPARAM=BiocSingular::RandomParam()) set.seed(100000) pbmc3k <- runTSNE(pbmc3k, dimred="PCA") set.seed(1000000) pbmc3k <- runUMAP(pbmc3k, dimred="PCA") ### clustering ### g <- buildSNNGraph(pbmc3k, k=10, use.dimred = 'PCA') clust <- igraph::cluster_walktrap(g)$membership
pbmc3k$cluster <- factor(clust) pbmc3k ## class: SingleCellExperiment ## dim: 32738 2609 ## metadata(0): ## assays(2): counts logcounts ## rownames(32738): ENSG00000243485 ENSG00000237613 ... ## ENSG00000215616 ENSG00000215611 ## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol ## colnames: NULL ## colData names(12): Sample Barcode ... Date_published cluster ## reducedDimNames(3): PCA TSNE UMAP ## spikeNames(0): ## altExpNames(0): ### loading ### library(TENxPBMCData) pbmc4k <- TENxPBMCData('pbmc4k') ### quality-control ### is.mito <- grep("MT", rowData(pbmc4k)$Symbol_TENx)

library(scater)
stats <- perCellQCMetrics(pbmc4k, subsets=list(Mito=is.mito))
high.mito <- isOutlier(stats$subsets_Mito_percent, nmads=3, type="higher") pbmc4k <- pbmc4k[,!high.mito] ### normalization ### pbmc4k <- logNormCounts(pbmc4k) ### variance-modelling ### library(scran) dec4k <- modelGeneVar(pbmc4k) ### feature-selection ### chosen.hvgs <- which(dec4k$bio > 0)

### dimensionality-reduction ###
# Using randomized SVD, which is more efficient for file-backed matrices.
set.seed(10000)
pbmc4k <- runPCA(pbmc4k, subset_row=chosen.hvgs, ncomponents=25,
BSPARAM=BiocSingular::RandomParam())

set.seed(100000)
pbmc4k <- runTSNE(pbmc4k, dimred="PCA")

set.seed(1000000)
pbmc4k <- runUMAP(pbmc4k, dimred="PCA")

### clustering ###
g <- buildSNNGraph(pbmc4k, k=10, use.dimred = 'PCA')
clust <- igraph::cluster_walktrap(g)$membership pbmc4k$cluster <- factor(clust)
pbmc4k
## class: SingleCellExperiment
## dim: 33694 4182
## metadata(0):
## assays(2): counts logcounts
## rownames(33694): ENSG00000243485 ENSG00000237613 ...
##   ENSG00000277475 ENSG00000268674
## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
## colnames: NULL
## colData names(12): Sample Barcode ... Date_published cluster
## reducedDimNames(3): PCA TSNE UMAP
## spikeNames(0):
## altExpNames(0):

To prepare for the batch correction:

1. We subset all batches to the common “universe” of features. In this case, it is straightforward as both batches use Ensembl gene annotation16.

universe <- intersect(rownames(pbmc3k), rownames(pbmc4k))
length(universe)
## [1] 31232
2. We rescale each batch to adjust for differences in sequencing depth between batches. The multiBatchNorm() function recomputes log-normalized expression values after adjusting the size factors for systematic differences in coverage between SingleCellExperiment objects. (Size factors only remove biases between cells within a single batch.) This improves the quality of the correction by removing one aspect of the technical differences between batches.

library(batchelor)
rescaled <- multiBatchNorm(pbmc3k[universe,], pbmc4k[universe,])
pbmc3k <- rescaled[[1]]
pbmc4k <- rescaled[[2]]
3. We perform feature selection by averaging the variance components across all batches with the combineVar() function. We use the average as it is responsive to batch-specific HVGs while still preserving the within-batch ranking of genes. This allows us to use the same strategies described in Section 8.3 to select genes of interest. For example, if we were to retain all genes with positive biological components, convergence of the average towards zero for non-HVGs ensures that the expected number of retained genes is stable with more batches. In contrast, approaches based on taking the intersection or union of HVGs across batches become increasingly conservative or liberal, respectively, with an increasing number of batches.

library(scran)
combined.dec <- combineVar(dec3k[universe,], dec4k[universe,])
chosen.hvgs <- combined.dec$bio > 0 combined.dec[,1:6] ## DataFrame with 31232 rows and 6 columns ## mean total ## <numeric> <numeric> ## ENSG00000243485 0 0 ## ENSG00000237613 0 0 ## ENSG00000186092 0 0 ## ENSG00000238009 0.00104386992482179 0.00106138029098745 ## ENSG00000239945 0.000258798374840956 0.000281948894897105 ## ... ... ... ## ENSG00000212907 0.535233858583492 0.437804768744041 ## ENSG00000198886 3.01692549039158 0.617031726421764 ## ENSG00000198786 1.54198723959989 0.617834240701208 ## ENSG00000198695 0.127831201202011 0.129776187258376 ## ENSG00000198727 2.56240219438806 0.717304616025944 ## tech bio ## <numeric> <numeric> ## ENSG00000243485 0 0 ## ENSG00000237613 0 0 ## ENSG00000186092 0 0 ## ENSG00000238009 0.00106590149926527 -4.52120827782246e-06 ## ENSG00000239945 0.000264260488048504 1.76884068486008e-05 ## ... ... ... ## ENSG00000212907 0.408040681424972 0.0297640873190686 ## ENSG00000198886 0.596769333970983 0.0202623924507815 ## ENSG00000198786 0.628483569144462 -0.010649328443254 ## ENSG00000198695 0.129278188931492 0.000497998326884352 ## ENSG00000198727 0.691490878206059 0.0258137378198841 ## p.value FDR ## <numeric> <numeric> ## ENSG00000243485 NA NA ## ENSG00000237613 NA NA ## ENSG00000186092 NA NA ## ENSG00000238009 0.512246221415639 0.834032705754633 ## ENSG00000239945 0.314021259910558 0.834032705754633 ## ... ... ... ## ENSG00000212907 0.404607233389663 0.834032705754633 ## ENSG00000198886 0.215888743417311 0.834032705754633 ## ENSG00000198786 0.642327056813382 0.834072156711655 ## ENSG00000198695 0.585074623751064 0.834032705754633 ## ENSG00000198727 0.419767773645846 0.834032705754633 ## <numeric> ## 0 ## 0 ## 0 ## 0.00104386992482179 ## 0.000258798374840956 ## ... ## 0.535233858583492 ## 3.01692549039158 ## 1.54198723959989 ## 0.127831201202011 ## 2.56240219438806 ## <numeric> ## 0 ## 0 ## 0 ## 0.00106138029098745 ## 0.000281948894897105 ## ... ## 0.437804768744041 ## 0.617031726421764 ## 0.617834240701208 ## 0.129776187258376 ## 0.717304616025944 ## <numeric> ## 0 ## 0 ## 0 ## 0.00106590149926527 ## 0.000264260488048504 ## ... ## 0.408040681424972 ## 0.596769333970983 ## 0.628483569144462 ## 0.129278188931492 ## 0.691490878206059 ## <numeric> ## 0 ## 0 ## 0 ## -4.52120827782246e-06 ## 1.76884068486008e-05 ## ... ## 0.0297640873190686 ## 0.0202623924507815 ## -0.010649328443254 ## 0.000497998326884352 ## 0.0258137378198841 ## <numeric> ## NA ## NA ## NA ## 0.512246221415639 ## 0.314021259910558 ## ... ## 0.404607233389663 ## 0.215888743417311 ## 0.642327056813382 ## 0.585074623751064 ## 0.419767773645846 ## <numeric> ## NA ## NA ## NA ## 0.834032705754633 ## 0.834032705754633 ## ... ## 0.834032705754633 ## 0.834032705754633 ## 0.834072156711655 ## 0.834032705754633 ## 0.834032705754633 ## 13.3 Diagnosing batch effects Before we actually perform any correction, it is worth examining whether there is any batch effect in this dataset. We combine the two SingleCellExperiments and perform a PCA on the log-expression values for all genes with positive (average) biological components. # Synchronizing the metadata for cbind()ing. rowData(pbmc3k) <- rowData(pbmc4k) pbmc3k$batch <- "3k"
pbmc4k$batch <- "4k" combined <- cbind(pbmc3k, pbmc4k) # Using RandomParam() as it is more efficient for file-backed matrices. library(scater) set.seed(0010101010) combined <- runPCA(combined, subset_row=chosen.hvgs, BSPARAM=BiocSingular::RandomParam()) We use graph-based clustering on the components to obtain a summary of the population structure. As our two PBMC populations should be replicates, each cluster should ideally consist of cells from both batches. However, we instead see clusters that are comprised of cells from a single batch. This indicates that cells of the same type are artificially separated due to technical differences between batches. library(scran) snn.gr <- buildSNNGraph(combined, use.dimred="PCA") clusters <- igraph::cluster_walktrap(snn.gr)$membership
tab <- table(Cluster=clusters, Batch=combined$batch) tab ## Batch ## Cluster 3k 4k ## 1 0 126 ## 2 12 459 ## 3 1 776 ## 4 0 1310 ## 5 500 0 ## 6 0 536 ## 7 0 606 ## 8 1296 0 ## 9 0 176 ## 10 0 54 ## 11 149 0 ## 12 30 1 ## 13 0 89 ## 14 131 0 ## 15 342 0 ## 16 1 10 ## 17 134 0 ## 18 11 3 ## 19 2 36 We can also visualize the corrected coordinates using a $$t$$-SNE plot (Figure 13.1). The strong separation between cells from different batches is consistent with the clustering results. combined <- runTSNE(combined, dimred="PCA") plotTSNE(combined, colour_by="batch") Of course, the other explanation for batch-specific clusters is that there are cell types that are unique to each batch. The degree of intermingling of cells from different batches is not an effective diagnostic when the batches involved might actually contain unique cell subpopulations (which is not a consideration in the PBMC dataset, but the same cannot be said in general). If a cluster only contains cells from a single batch, one can always debate whether that is caused by a failure of the correction method or if there is truly a batch-specific subpopulation. For example, do batch-specific metabolic or differentiation states represent distinct subpopulations? Or should they be merged together? We will not attempt to answer this here, only noting that each batch correction algorithm will make different (and possibly inappropriate) decisions on what constitutes “shared” and “unique” populations. ## 13.4 Linear regression Batch effects in bulk RNA sequencing studies are commonly removed with linear regression. This involves fitting a linear model to each gene’s expression profile, setting the undesirable batch term to zero and recomputing the observations sans the batch effect, yielding a set of corrected expression values for downstream analyses. Linear modelling is the basis of the removeBatchEffect() function from the limma package (Ritchie et al. 2015) as well the comBat() function from the sva package (Leek et al. 2012). To use this approach in a scRNA-seq context, we assume that the composition of cell subpopulations is the same across batches. We also assume that the batch effect is additive, i.e., any batch-induced fold-change in expression is the same across different cell subpopulations for any given gene. These are strong assumptions as batches derived from different individuals will naturally exhibit variation in cell type abundances and expression. Nonetheless, they may be acceptable when dealing with batches that are technical replicates generated from the same population of cells. Linear modelling can also accommodate situations where the composition is known a priori by including the cell type as a factor in the linear model, but this situation is even less common17. We use the rescaleBatches() function from the batchelor package to remove the batch effect. This is roughly equivalent to applying a linear regression to the log-expression values per gene, with some adjustments to improve performance and efficiency. For each gene, the mean expression in each batch is scaled down until it is equal to the lowest mean across all batches. We deliberately choose to scale all expression values down as this mitigates differences in variance when batches lie at different positions on the mean-variance trend. (Specifically, the shrinkage effect of the pseudo-count is greater for smaller counts, suppressing any differences in variance across batches.) An additional feature of rescaleBatches() is that it will preserve sparsity in the input matrix for greater efficiency, whereas other methods like removeBatchEffect() will always return a dense matrix. library(batchelor) rescaled <- rescaleBatches(pbmc3k, pbmc4k) rescaled ## class: SingleCellExperiment ## dim: 31232 6791 ## metadata(0): ## assays(1): corrected ## rownames(31232): ENSG00000243485 ENSG00000237613 ... ## ENSG00000198695 ENSG00000198727 ## rowData names(0): ## colnames: NULL ## colData names(1): batch ## reducedDimNames(0): ## spikeNames(0): ## altExpNames(0): After clustering, we observe that most clusters consist of mixtures of cells from the two replicate batches, consistent with the removal of the batch effect. This conclusion is supported by the apparent mixing of cells from different batches in Figure 13.2. However, at least one batch-specific cluster is still present, indicating that the correction is not entirely complete. This is attributable to violation of one of the aforementioned assumptions, even in this simple case involving replicated batches. set.seed(1010101010) rescaled <- runPCA(rescaled, subset_row=chosen.hvgs, BSPARAM=BiocSingular::IrlbaParam(), exprs_values="corrected") snn.gr <- buildSNNGraph(rescaled, use.dimred="PCA") clusters.resc <- igraph::cluster_walktrap(snn.gr)$membership
tab.resc <- table(Cluster=clusters.resc, Batch=rescaled$batch) tab.resc ## Batch ## Cluster 1 2 ## 1 272 523 ## 2 336 606 ## 3 126 266 ## 4 643 560 ## 5 19 47 ## 6 12 3 ## 7 313 0 ## 8 8 50 ## 9 19 58 ## 10 15 70 ## 11 131 154 ## 12 37 511 ## 13 10 83 ## 14 100 207 ## 15 137 8 ## 16 16 25 ## 17 397 964 ## 18 3 36 ## 19 4 8 ## 20 11 3 rescaled <- runTSNE(rescaled, dimred="PCA") rescaled$batch <- factor(rescaled$batch) plotTSNE(rescaled, colour_by="batch") ## 13.5 Performing MNN correction ### 13.5.1 Application to the PBMC data Consider a cell $$a$$ in batch $$A$$, and identify the cells in batch $$B$$ that are nearest neighbours to $$a$$ in the expression space defined by the selected features. Repeat this for a cell $$b$$ in batch $$B$$, identifying its nearest neighbours in $$A$$. Mutual nearest neighbours are pairs of cells from different batches that belong in each other’s set of nearest neighbours. The reasoning is that MNN pairs represent cells from the same biological state prior to the application of a batch effect - see Haghverdi et al. (2018) for full theoretical details. Thus, the difference between cells in MNN pairs can be used as an estimate of the batch effect, the subtraction of which yields batch-corrected values. The batchelor package provides an implementation of the MNN approach via the fastMNN() function. (Unlike the MNN method described by Haghverdi et al. (2018), the fastMNN() function performs PCA to reduce the dimensions beforehand and speed up the downstream neighbor detection steps.) We apply it to our two PBMC batches to remove the batch effect across the highly variable genes in chosen. To reduce computational work and technical noise, all cells in all batches are projected into the low-dimensional space defined by the top d principal components. Identification of MNNs and calculation of correction vectors are then performed in this low-dimensional space. # Using randomized SVD here, as this is faster than # irlba for file-backed matrices. set.seed(1000101001) mnn.out <- fastMNN(pbmc3k, pbmc4k, d=50, k=20, BSPARAM=BiocSingular::RandomParam(deferred=TRUE)) mnn.out ## class: SingleCellExperiment ## dim: 31232 6791 ## metadata(1): merge.info ## assays(1): reconstructed ## rownames(31232): ENSG00000243485 ENSG00000237613 ... ## ENSG00000198695 ENSG00000198727 ## rowData names(1): rotation ## colnames: NULL ## colData names(1): batch ## reducedDimNames(1): corrected ## spikeNames(0): ## altExpNames(0): The function returns a SingleCellExperiment object containing corrected values for downstream analyses like clustering or visualization. Each column of mnn.out corresponds to a cell in one of the batches, while each row corresponds to an input gene in chosen. The batch field in the column metadata contains a vector specifying the batch of origin of each cell. head(mnn.out$batch) 
## [1] 1 1 1 1 1 1

The corrected matrix in the reducedDims() contains the low-dimensional corrected coordinates for all cells, which we will use in place of the PCs in our downstream analyses.

dim(reducedDim(mnn.out, "corrected"))
## [1] 6791   50

The most relevant parameter for tuning fastMNN() is k, which specifies the number of nearest neighbours to consider when defining MNN pairs. This can be interpreted as the minimum anticipated frequency of any shared cell type or state in each batch. Increasing k will generally result in more aggressive merging as the algorithm is more generous in matching subpopulations across batches. It can occasionally be desirable to increase k if one clearly sees that the same cell types are not being adequately merged across batches.

### 13.5.2 Correction diagnostics

We cluster on the low-dimensional corrected coordinates to obtain a partitioning of the cells that serves as a proxy for the population structure. If the batch effect is successfully corrected, clusters corresponding to shared cell types or states should contain cells from multiple batches. We see that all clusters contain contributions from each batch after correction, consistent with our expectation that the two batches are replicates of each other.

library(scran)
snn.gr <- buildSNNGraph(mnn.out, use.dimred="corrected")
clusters.mnn <- igraph::cluster_walktrap(snn.gr)$membership tab.mnn <- table(Cluster=clusters.mnn, Batch=mnn.out$batch)
tab.mnn
##        Batch
## Cluster   1   2
##      1  282 507
##      2  331 588
##      3  301 633
##      4  210 163
##      5  675 622
##      6   12  17
##      7   19  74
##      8    9  52
##      9    7  18
##      10  28  50
##      11  14  47
##      12  59 174
##      13 410 982
##      14 122 132
##      15   4  36
##      16 111  76
##      17   4   8
##      18  11   3

We can also visualize the corrected coordinates using a $$t$$-SNE plot (Figure 13.3). The presence of visual clusters containing cells from both batches provides a comforting illusion that the correction was successful.

library(scater)
set.seed(0010101010)
mnn.out <- runTSNE(mnn.out, use_dimred="corrected")

mnn.out$batch <- factor(mnn.out$batch)
plotTSNE(mnn.out, colour_by="batch")

For fastMNN(), one useful diagnostic is the proportion of variance within each batch that is lost during MNN correction. Specifically, this refers to the within-batch variance that is removed during orthogonalization with respect to the average correction vector at each merge step. This is returned via the lost.var field in the metadata of mnn.out, which contains a matrix of the variance lost in each batch (column) at each merge step (row).

metadata(mnn.out)$merge.info$lost.var
##          [,1]    [,2]
## [1,] 0.004513 0.00327

Large proportions of lost variance suggest that correction is removing genuine biological heterogeneity. This would occur due to violations of the assumption of orthogonality between the batch effect and the biological subspace (Haghverdi et al. 2018). In this case, the proportion of lost variance is small, indicating that non-orthogonality is not a major concern.

## 13.6 Preserving biological heterogeneity

Another useful diagnostic check is to compare the clustering within each batch to the clustering of the merged data. Accurate data integration should preserve variance within each batch as there should be nothing to remove between cells in the same batch. This check complements the previously mentioned diagnostics that only focus on the removal of differences between batches. Specifically, it protects us against cases where the correction method simply aggregates all cells together, which would achieve perfect mixing but also discard the biological heterogeneity of interest.

Ideally, we should see a many-to-1 mapping where the across-batch clustering is nested inside the within-batch clusterings. This indicates that any within-batch structure was preserved after correction while acknowledging that greater resolution is possible with more cells. In practice, more discrepancies can be expected even when the correction is perfect, due to the existence of closely related clusters that were arbitrarily separated in the within-batch clustering. As a general rule, we can be satisfied with the correction if the vast majority of entries of the table()s below are zero, though this may depend on whether specific clusters of interest are gained or lost.

# For the first batch.
table(New=clusters.mnn[rescaled$batch==1], Old=pbmc3k$cluster)
##     Old
## New    1   2   3   4   5   6   7   8   9
##   1    0 275   0   3   0   4   0   0   0
##   2    0   0   3   0   0   0   0 328   0
##   3  300   0   0   0   0   0   1   0   0
##   4  162   0   0   0   0   0  48   0   0
##   5    0  47 487 141   0   0   0   0   0
##   6    0   0  12   0   0   0   0   0   0
##   7    0   0   0   0  19   0   0   0   0
##   8    9   0   0   0   0   0   0   0   0
##   9    0   1   1   1   0   0   0   4   0
##   10   0   2   0   0   0  26   0   0   0
##   11   0   0   0   0  14   0   0   0   0
##   12   0   0   0  59   0   0   0   0   0
##   13   0   4   4 402   0   0   0   0   0
##   14   0   0   0   0   0 122   0   0   0
##   15   0   0   3   0   1   0   0   0   0
##   16   0   0   0   0   0   0 111   0   0
##   17   0   0   0   4   0   0   0   0   0
##   18   0   0   0   0   0   0   0   0  11
# For the second batch.
table(New=clusters.mnn[rescaled$batch==2], Old=pbmc4k$cluster)
##     Old
## New    1   2   3   4   5   6   7   8   9  10  11  12
##   1    1   1   0 501   0   4   0   0   0   0   0   0
##   2    0   0   0   0 221   0 367   0   0   0   0   0
##   3    0 632   0   0   0   0   0   1   0   0   0   0
##   4    0 146   8   0   0   0   0   2   0   0   7   0
##   5  464   0   0  60   0   0   0   0  77  21   0   0
##   6   14   0   0   1   1   0   1   0   0   0   0   0
##   7    0   0  74   0   0   0   0   0   0   0   0   0
##   8    0   6   0   0   0   0   0  46   0   0   0   0
##   9    0   0   0   1   4   1  12   0   0   0   0   0
##   10   0   0   0   0   0  50   0   0   0   0   0   0
##   11   0   2  45   0   0   0   0   0   0   0   0   0
##   12   3   0   0   0   0   0   0   0  13 158   0   0
##   13  10   0   0   4   0   0   0   0 946  22   0   0
##   14   0   0   0   0   0 132   0   0   0   0   0   0
##   15   0   0   0   0   0   0   0   0   0   0   0  36
##   16   0   6   0   0   0   0   0   0   0   0  70   0
##  [ reached getOption("max.print") -- omitted 2 rows ]

We can summarize the agreement between clusterings by computing the Rand index. This provides a simple metric that we can use to assess the preservation of variation by different correction methods. Larger Rand indices are more desirable, though this must be balanced against the ability of each method to actually remove the batch effect.

library(fossil)
rand.index(as.integer(clusters.mnn[rescaled$batch==1]), as.integer(pbmc3k$cluster))
## [1] 0.9129
rand.index(as.integer(clusters.resc[rescaled$batch==1]), as.integer(pbmc3k$cluster))
## [1] 0.9102

## 13.7 Application to a pancreas dataset

We perform another demonstration with two human pancreas CEL-seq(2) datasets (Muraro et al. 2016; Grun et al. 2016). This is a more challenging application than the PBMC dataset as it involves different patients and protocols.

### loading ###
library(scRNAseq)
sce.grun <- GrunPancreasData()

### gene-annotation ###
library(org.Hs.eg.db)
gene.ids <- mapIds(org.Hs.eg.db, keys=rowData(sce.grun)$symbol, keytype="SYMBOL", column="ENSEMBL") keep <- !is.na(gene.ids) & !duplicated(gene.ids) sce.grun <- sce.grun[keep,] rownames(sce.grun) <- gene.ids[keep] ### quality-control ### library(scater) stats <- perCellQCMetrics(sce.grun) qc <- quickCellQC(stats, percent_subsets="altexps_ERCC_percent", nmads=3) sce.grun <- sce.grun[,!qc$discard]

### normalization ###
library(scran)
set.seed(1000) # for irlba.
clusters <- quickCluster(sce.grun)
sce.grun <- computeSumFactors(sce.grun, min.mean=0.1, clusters=clusters)
sce.grun <- logNormCounts(sce.grun)

### variance-modelling ###
block <- paste0(sce.grun$sample, "_", sce.grun$donor)
dec.grun <- modelGeneVarWithSpikes(sce.grun, spikes="ERCC", block=block)
sce.grun
## class: SingleCellExperiment
## dim: 17692 1290
## metadata(0):
## assays(2): counts logcounts
## rownames(17692): ENSG00000268895 ENSG00000121410 ...
##   ENSG00000074755 ENSG00000036549
## rowData names(2): symbol chr
## colnames(1290): D2ex_1 D2ex_2 ... D17TGFB_94 D17TGFB_95
## colData names(2): donor sample
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(1): ERCC
### loading ###
library(scRNAseq)
sce.muraro <- MuraroPancreasData()

### gene-annotation ###
library(AnnotationHub)
edb <- AnnotationHub()[["AH73881"]]
gene.symb <- sub("__chr.*$", "", rownames(sce.muraro)) gene.ids <- mapIds(edb, keys=gene.symb, keytype="SYMBOL", column="GENEID") # Removing duplicated genes or genes without Ensembl IDs. keep <- !is.na(gene.ids) & !duplicated(gene.ids) sce.muraro <- sce.muraro[keep,] rownames(sce.muraro) <- gene.ids[keep] ### quality-control ### library(scater) stats <- perCellQCMetrics(sce.muraro) qc <- quickCellQC(stats, nmads=3, percent_subsets="altexps_ERCC_percent") sce.muraro <- sce.muraro[,!qc$discard]

### normalization ###
library(scran)
set.seed(1000)
clusters <- quickCluster(sce.muraro)
sce.muraro <- computeSumFactors(sce.muraro, min.mean=0.1, clusters=clusters)
sce.muraro <- logNormCounts(sce.muraro)

### variance-modelling ###
block <- paste0(sce.muraro$plate, "_", sce.muraro$donor)
dec.muraro <- modelGeneVarWithSpikes(sce.muraro, "ERCC", block=block)
sce.muraro
## class: SingleCellExperiment
## dim: 16940 2346
## metadata(0):
## assays(2): counts logcounts
## rownames(16940): ENSG00000268895 ENSG00000121410 ...
##   ENSG00000159840 ENSG00000074755
## rowData names(2): symbol chr
## colnames(2346): D28-1_1 D28-1_2 ... D30-8_93 D30-8_94
## colData names(3): label donor plate
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(1): ERCC

We subset both batches to their common universe of genes; adjust their scaling to equalize sequencing coverage (not really necessary in this case, as the coverage is already similar, but we will do so anyway for consistency); and select those genes with positive average biological components for further use.

universe <- intersect(rownames(sce.grun), rownames(sce.muraro))
universe <- universe[!grepl("^ERCC", universe)]

normed.pancreas <- multiBatchNorm(sce.grun[universe,], sce.muraro[universe,])
sce.grun <- normed.pancreas[[1]]
sce.muraro <- normed.pancreas[[2]]

combined.pan <- combineVar(dec.grun[universe,], dec.muraro[universe,])
chosen.genes <- universe[combined.panbio > 0] We observe that rescaleBatches() is unable to align cells from different batches in Figure 13.4. This is attributable to differences in population composition between batches, with additional complications from non-linearities in the batch effect, e.g., when the magnitude or direction of the batch effect differs between cell types. rescaled.pancreas <- rescaleBatches(sce.grun, sce.muraro) rescaled.pancreas <- runPCA(rescaled.pancreas, subset_row=chosen.genes, BSPARAM=BiocSingular::IrlbaParam(), exprs_values="corrected") rescaled.pancreas <- runTSNE(rescaled.pancreas, dimred="PCA") plotTSNE(rescaled.pancreas, colour_by="batch") Here, we use fastMNN() to merge together the two human pancreas datasets described earlier. Clustering on the merged datasets yields fewer batch-specific clusters, which is recapitulated as greater intermingling between batches in Figure 13.5. This improvement over Figure 13.4 represents the ability of fastMNN() to adapt to more complex situations involving differences in population composition between batches. mnn.pancreas <- fastMNN(sce.grun, sce.muraro, subset.row=chosen.genes) snn.gr <- buildSNNGraph(mnn.pancreas, use.dimred="corrected") clusters <- igraph::cluster_walktrap(snn.gr)membership
tab <- table(Cluster=clusters, Batch=mnn.pancreas\$batch)
tab
##        Batch
## Cluster   1   2
##      1  320 279
##      2  343 265
##      3  209 847
##      4   61 197
##      5  161 399
##      6   42   1
##      7   25 108
##      8   22 127
##      9   59  80
##      10  33   0
##      11   0  18
##      12   8   4
##      13   7  21
mnn.pancreas <- runTSNE(mnn.pancreas, dimred="corrected")
plotTSNE(mnn.pancreas, colour_by="batch")

## 13.8 Using the corrected values

The greatest value of batch correction lies in facilitating cell-based analysis of population heterogeneity in a consistent manner across batches. Cluster 1 in batch A is the same as cluster 1 in batch B when the clustering is performed on the merged data. There is no need to identify mappings between separate clusterings, which might not even be possible when the clusters are not well-separated. The burden of interpretation is consolidated by generating a single set of clusters for all batches, rather than requiring separate examination of each batch’s clusters. Another benefit is that the available number of cells is increased when all batches are combined, which allows for greater resolution of population structure in downstream analyses18. We previously demonstrated the application of clustering methods to the batch-corrected data, but the same principles apply for other analyses like trajectory reconstruction.

At this point, it is also tempting to use the batch-corrected values for gene-based analyses like DE-based marker gene detection. This is not generally recommended as an arbitrary correction algorithm is not obliged to preserve the magnitude (or even direction) of differences in per-gene expression when attempting to align multiple batches. For example, cosine normalization in fastMNN() shrinks the magnitude of the expression values so that the computed log-fold changes have no obvious interpretation. Of greater concern is the possibility that the correction introduces artificial agreement across batches. To illustrate:

1. Consider a dataset (first batch) with two cell types, $$A$$ and $$B$$. Consider a second batch with the same cell types, denoted as $$A'$$ and $$B'$$. Assume that, for some reason, gene $$X$$ is expressed in $$A$$ but not in $$A'$$, $$B$$ or $$B'$$ - possibly due to some difference in how the cells were treated, or maybe due to a donor effect.
2. We then merge the batches together based on the shared cell types. This yields a result where $$A$$ and $$A'$$ cells are intermingled and the difference due to $$X$$ is eliminated. One can debate whether this should be the case, but in general, it is necessary for batch correction methods to smooth over small biological differences (as discussed in Section 13.3).
3. Now, if we corrected the second batch to the first, we must have coerced the expression values of $$X$$ in $$A'$$ to non-zero values to align with those of $$A$$, while leaving the expression of $$X$$ in $$B'$$ and $$B$$ at zero. Thus, we have artificially introduced DE between $$A'$$ and $$B'$$ for $$X$$ in the second batch to align with the DE between $$A$$ and $$B$$ in the first batch. (The converse is also possible where DE in the first batch is artificially removed to align with the second batch, depending on the order of merges.)
4. The artificial DE has implications for the identification of the cell types and interpretation of the results. We would be misled into believing that both $$A$$ and $$A'$$ are $$X$$-positive, when in fact this is only true for $$A$$. At best, this is only a minor error - after all, we do actually have $$X$$-positive cells of that overall type, we simply do not see that $$A'$$ is $$X$$-negative. At worst, this can compromise the conclusions, e.g., if the first batch was drug treated and the second batch was a control, we might mistakenly think that a $$X$$-positive population exists in the latter and conclude that our drug has no effect.

Rather, it is preferable to perform DE analyses using the uncorrected expression values with blocking on the batch, as discussed in Section 11.4. This strategy is based on the expectation that any genuine DE between clusters should still be present in a within-batch comparison where batch effects are absent. It penalizes genes that exhibit inconsistent DE across batches, thus protecting against misleading conclusions when a population in one batch is aligned to a similar-but-not-identical population in another batch.

## Session Info

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS:   /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRblas.so
LAPACK: /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] fossil_0.3.7                shapefiles_0.7
[3] foreign_0.8-72              maps_3.3.0
[5] sp_1.3-1                    scater_1.13.18
[7] ggplot2_3.2.1               scran_1.13.18
[9] batchelor_1.1.11            SingleCellExperiment_1.7.8
[11] SummarizedExperiment_1.15.9 Biobase_2.45.1
[13] GenomicRanges_1.37.15       GenomeInfoDb_1.21.1
[15] HDF5Array_1.13.8            rhdf5_2.29.3
[17] DelayedArray_0.11.4         BiocParallel_1.19.2
[19] IRanges_2.19.14             S4Vectors_0.23.21
[21] BiocGenerics_0.31.5         matrixStats_0.55.0
[23] Cairo_1.5-10                BiocStyle_2.13.2
[25] OSCAUtils_0.0.1

loaded via a namespace (and not attached):
[1] bitops_1.0-6             tools_3.6.1
[3] R6_2.4.0                 irlba_2.3.3
[5] vipor_0.4.5              lazyeval_0.2.2
[7] colorspace_1.4-1         withr_2.1.2
[9] tidyselect_0.2.5         gridExtra_2.3
[11] compiler_3.6.1           BiocNeighbors_1.3.3
[13] labeling_0.3             bookdown_0.13
[15] scales_1.0.0             stringr_1.4.0
[17] digest_0.6.20            rmarkdown_1.15
[19] XVector_0.25.0           pkgconfig_2.0.2
[21] htmltools_0.3.6          limma_3.41.16
[23] highr_0.8                rlang_0.4.0
[25] DelayedMatrixStats_1.7.2 dplyr_0.8.3
[27] RCurl_1.95-4.12          magrittr_1.5
[29] BiocSingular_1.1.5       GenomeInfoDbData_1.2.1
[31] Matrix_1.2-17            Rcpp_1.0.2
[33] ggbeeswarm_0.6.0         munsell_0.5.0
[35] Rhdf5lib_1.7.5           viridis_0.5.1
[37] stringi_1.4.3            yaml_2.2.0
[39] edgeR_3.27.13            zlibbioc_1.31.0
[41] Rtsne_0.15               grid_3.6.1
[43] dqrng_0.2.1              crayon_1.3.4
[45] lattice_0.20-38          cowplot_1.0.0
[47] beachmat_2.1.2           locfit_1.5-9.1
[49] knitr_1.24               pillar_1.4.2
[51] igraph_1.2.4.1           glue_1.3.1
[53] evaluate_0.14            BiocManager_1.30.4
[55] gtable_0.3.0             purrr_0.3.2
[57] assertthat_0.2.1         xfun_0.9
[59] rsvd_1.0.2               viridisLite_0.3.0
[61] tibble_2.1.3             beeswarm_0.2.3
[63] statmod_1.4.32          

### Bibliography

Butler, A., P. Hoffman, P. Smibert, E. Papalexi, and R. Satija. 2018. “Integrating single-cell transcriptomic data across different conditions, technologies, and species.” Nat. Biotechnol. 36 (5):411–20.

Grun, D., M. J. Muraro, J. C. Boisset, K. Wiebrands, A. Lyubimova, G. Dharmadhikari, M. van den Born, et al. 2016. “De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data.” Cell Stem Cell 19 (2):266–77.

Haghverdi, L., A. T. L. Lun, M. D. Morgan, and J. C. Marioni. 2018. “Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.” Nat. Biotechnol. 36 (5):421–27.

Leek, J. T., W. E. Johnson, H. S. Parker, A. E. Jaffe, and J. D. Storey. 2012. “The sva package for removing batch effects and other unwanted variation in high-throughput experiments.” Bioinformatics 28 (6):882–83.

Lin, Y., S. Ghazanfar, K. Y. X. Wang, J. A. Gagnon-Bartsch, K. K. Lo, X. Su, Z. G. Han, et al. 2019. “scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets.” Proc. Natl. Acad. Sci. U.S.A. 116 (20):9775–84.

Muraro, M. J., G. Dharmadhikari, D. Grun, N. Groen, T. Dielen, E. Jansen, L. van Gurp, et al. 2016. “A Single-Cell Transcriptome Atlas of the Human Pancreas.” Cell Syst 3 (4):385–94.

Ritchie, M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth. 2015. “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Res. 43 (7):e47.

1. This step can be much, much, much more painful. As is often said, biologists would rather share a toothbrush than nomenclature.

2. If I already knew the type and state of each cell, why would I waste money sequencing them?

3. And a nice $$t$$-SNE plot for Figure 1. Hey, those atlas papers almost write themselves!