Chapter 19 Dealing with big data

19.1 Motivation

Advances in scRNA-seq technologies have increased the number of cells that can be assayed in routine experiments. Public databases such as GEO are continually expanding with more scRNA-seq studies, while large-scale projects such as the Human Cell Atlas are expected to generate data for billions of cells. For effective data analysis, the computational methods need to scale with the increasing size of scRNA-seq data sets. This section discusses how we can use various aspects of the Bioconductor ecosystem to tune our analysis pipelines for greater speed and efficiency.

19.2 Fast approximations

19.2.1 Nearest neighbor searching

Identification of neighbouring cells in PC or expression space is a common procedure that is used in many functions, e.g., buildSNNGraph(), doubletCells(). The default is to favour accuracy over speed by using an exact nearest neighbour (NN) search, implemented with the \(k\)-means for \(k\)-nearest neighbours algorithm (Wang 2012). However, for large data sets, it may be preferable to use a faster approximate approach. The BiocNeighbors framework makes it easy to switch between search options by simply changing the BNPARAM= argument in compatible functions. To demonstrate, we will use the 10X PBMC data:

### setup ###
library(OSCAUtils)
chapterPreamble(use_cache = TRUE)

### loading ###
library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask = FALSE)
raw.path <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples",
    "cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices.tar.gz"))
untar(raw.path, exdir=file.path(tempdir(), "pbmc4k"))

library(DropletUtils)
fname <- file.path(tempdir(), "pbmc4k/raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names=TRUE)

### gene-annotation ###
library(scater)
rownames(sce.pbmc) <- uniquifyFeatureNames(
    rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)

library(EnsDb.Hsapiens.v86)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, 
    column="SEQNAME", keytype="GENEID")

### cell-detection ###
set.seed(100)
e.out <- emptyDrops(counts(sce.pbmc))
sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]

### quality-control ###
stats <- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location=="MT")))
high.mito <- isOutlier(stats$subsets_Mito_percent, nmads=3, type="higher")
sce.pbmc <- sce.pbmc[,!high.mito]

### normalization ###
library(scran)
set.seed(1000)
clusters <- quickCluster(sce.pbmc)
sce.pbmc <- computeSumFactors(sce.pbmc, cluster=clusters)
sce.pbmc <- logNormCounts(sce.pbmc)

### variance-modelling ###
set.seed(1001)
dec.pbmc <- modelGeneVarByPoisson(sce.pbmc)

### dimensionality-reduction ###
set.seed(10000)
sce.pbmc <- denoisePCA(sce.pbmc, technical=dec.pbmc)

set.seed(100000)
sce.pbmc <- runTSNE(sce.pbmc, use_dimred="PCA")

set.seed(1000000)
sce.pbmc <- runUMAP(sce.pbmc, use_dimred="PCA")

### clustering ###
g <- buildSNNGraph(sce.pbmc, k=10, use.dimred = 'PCA')
clust <- igraph::cluster_walktrap(g)$membership
sce.pbmc$cluster <- factor(clust)
## class: SingleCellExperiment 
## dim: 33694 3922 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(2): ID Symbol
## colnames(3922): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
##   TTTGTCACAGGTCCAC-1 TTTGTCATCCCAAGAT-1
## colData names(3): Sample Barcode cluster
## reducedDimNames(3): PCA TSNE UMAP
## spikeNames(0):
## altExpNames(0):

We had previously clustered on a shared nearest neighbor graph generated with an exact neighbour search (Section 10.3). We repeat this below using an approximate search, implemented using the Annoy algorithm. This involves constructing a AnnoyParam object to specify the search algorithm and then passing it to the buildSNNGraph() function. The results from the exact and approximate searches are consistent with most clusters from the former re-appearing in the latter. This suggests that the inaccuracy from the approximation can be largely ignored.

##      Approx
## Exact   1   2   3   4   5   6   7   8   9  10  11  12  13
##    1  783   0   0   1   0   0   0   0   1   0   0   0   0
##    2    0   0 198   0   0   0   0   0   0   0   0   0   0
##    3    5  49   0   0   0   0   0   0   0   0   2   0   0
##    4    0   0   0 512  29   0   0   0   0   0   0   0   0
##    5    0   0   0   5 511   0   0   1   0  12   0   0   0
##    6    0   0   0   0   0 516   0   0   0   0   0   0   0
##    7    0   0   0   0   0   0 128   0   0   0   0   0   0
##    8    0   0   0   0   2   0   0 822   0   0   0   0   0
##    9    0   0   0   0   0   0   0   0  45   0   0   0   0
##    10   0   0   0   0   2   0   0   0   0 149   0   0   0
##    11   0   0   0   0   0   0   0   0   0   0  92   0   0
##    12   0   0   0   0   0   0   0   0   0   0   0  21   0
##    13   0   0   0   0   0   0   0   0   0   0   0   0  36

Note that Annoy writes the NN index to disk prior to performing the search. Thus, it may not actually be faster than the default exact algorithm for small datasets, depending on whether the overhead of disk write is offset by the computational complexity of the search. It is also not difficult to find situations where the approximation deteriorates, especially at high dimensions, though this may not have an appreciable impact on the biological conclusions.

## [1] 0.5619

19.2.2 Singular value decomposition

The singular value decomposition (SVD) underlies the PCA used throughout our analyses, e.g., in denoisePCA(), fastMNN(), doubletCells(). (Briefly, the right singular vectors are the eigenvectors of the gene-gene covariance matrix, where each eigenvector represents the axis of maximum remaining variation in the PCA.) The default base::svd() function performs an exact SVD that is not performant for large datasets. Instead, we use fast approximate methods from the irlba and rsvd packages, conveniently wrapped into the BiocSingular package for ease of use and package development. Specifically, we can change the SVD algorithm used in any of these functions by simply specifying an alternative value for the BSPARAM= argument.

##  num [1:3922, 1:20] 15.29 13.36 -8.84 -7.83 6.36 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3922] "AAACCTGAGAAGGCCT-1" "AAACCTGAGACAGACC-1" "AAACCTGAGGCATGGT-1" "AAACCTGCAAGGTTCT-1" ...
##   ..$ : chr [1:20] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "percentVar")= num [1:20] 20.27 10.04 5.34 2.19 1.41 ...
##  num [1:3922, 1:20] 15.29 13.36 -8.84 -7.83 6.36 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3922] "AAACCTGAGAAGGCCT-1" "AAACCTGAGACAGACC-1" "AAACCTGAGGCATGGT-1" "AAACCTGCAAGGTTCT-1" ...
##   ..$ : chr [1:20] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "percentVar")= num [1:20] 20.27 10.04 5.34 2.19 1.41 ...

Both IRLBA and randomized SVD (RSVD) are much faster than the exact SVD with negligible loss of accuracy. This motivates their default use in many scran and scater functions, at the cost of requiring users to set the seed to guarantee reproducibility. IRLBA can occasionally fail to converge and require more iterations (passed via maxit= in IrlbaParam()), while RSVD involves an explicit trade-off between accuracy and speed based on its oversampling parameter (p=) and number of power iterations (q=). We tend to prefer IRLBA as its default behavior is more accurate, though RSVD is much faster for file-backed matrices (Section 26.2.6).

19.3 Parallelization

Parallelization of calculations across genes or cells is an obvious strategy for speeding up scRNA-seq analysis workflows. The BiocParallel package provides a common interface for parallel computing throughout the Bioconductor ecosystem, manifesting as a BPPARAM= argument in compatible functions. We can pick from a diverse range of parallelization backends depending on the available hardware and operating system. For example, we might use forking across 2 cores to parallelize the variance calculations on a Unix system:

## DataFrame with 33694 rows and 6 columns
##                              mean                total
##                         <numeric>            <numeric>
## RP11-34P13.3                    0                    0
## FAM138A                         0                    0
## OR4F5                           0                    0
## RP11-34P13.7  0.00223806192235176  0.00234711395627082
## RP11-34P13.8 0.000562049544395219 0.000628030706791975
## ...                           ...                  ...
## AC233755.2                      0                    0
## AC233755.1                      0                    0
## AC240274.1     0.0101803017459632   0.0120232469216998
## AC213203.1                      0                    0
## FAM231B                         0                    0
##                              tech                  bio           p.value
##                         <numeric>            <numeric>         <numeric>
## RP11-34P13.3                    0                    0               NaN
## FAM138A                         0                    0               NaN
## OR4F5                           0                    0               NaN
## RP11-34P13.7  0.00230122682234757  4.5887133923255e-05 0.446738657416478
## RP11-34P13.8 0.000577912671042726 5.01180357492494e-05 0.280158884966049
## ...                           ...                  ...               ...
## AC233755.2                      0                    0               NaN
## AC233755.1                      0                    0               NaN
## AC240274.1     0.0104674572614225  0.00155578966027733 0.159114051883997
## AC213203.1                      0                    0               NaN
## FAM231B                         0                    0               NaN
##                            FDR
##                      <numeric>
## RP11-34P13.3               NaN
## FAM138A                    NaN
## OR4F5                      NaN
## RP11-34P13.7 0.747547252788604
## RP11-34P13.8 0.747547252788604
## ...                        ...
## AC233755.2                 NaN
## AC233755.1                 NaN
## AC240274.1   0.747547252788604
## AC213203.1                 NaN
## FAM231B                    NaN

Another approach would be to distribute jobs across a network of computers, which yields the same result:

## DataFrame with 33694 rows and 6 columns
##                              mean                total
##                         <numeric>            <numeric>
## RP11-34P13.3                    0                    0
## FAM138A                         0                    0
## OR4F5                           0                    0
## RP11-34P13.7  0.00223806192235176  0.00234711395627082
## RP11-34P13.8 0.000562049544395219 0.000628030706791975
## ...                           ...                  ...
## AC233755.2                      0                    0
## AC233755.1                      0                    0
## AC240274.1     0.0101803017459632   0.0120232469216998
## AC213203.1                      0                    0
## FAM231B                         0                    0
##                              tech                  bio           p.value
##                         <numeric>            <numeric>         <numeric>
## RP11-34P13.3                    0                    0               NaN
## FAM138A                         0                    0               NaN
## OR4F5                           0                    0               NaN
## RP11-34P13.7  0.00230122682234757  4.5887133923255e-05 0.446738657416478
## RP11-34P13.8 0.000577912671042726 5.01180357492494e-05 0.280158884966049
## ...                           ...                  ...               ...
## AC233755.2                      0                    0               NaN
## AC233755.1                      0                    0               NaN
## AC240274.1     0.0104674572614225  0.00155578966027733 0.159114051883997
## AC213203.1                      0                    0               NaN
## FAM231B                         0                    0               NaN
##                            FDR
##                      <numeric>
## RP11-34P13.3               NaN
## FAM138A                    NaN
## OR4F5                      NaN
## RP11-34P13.7 0.747547252788604
## RP11-34P13.8 0.747547252788604
## ...                        ...
## AC233755.2                 NaN
## AC233755.1                 NaN
## AC240274.1   0.747547252788604
## AC213203.1                 NaN
## FAM231B                    NaN

For high-performance computing (HPC) systems with a cluster of compute nodes, we can distribute jobs via the job scheduler using the BatchtoolsParam class. The example below assumes a SLURM cluster, though the settings can be easily configured for a particular system (see here for details).

Parallelization is best suited for CPU-intensive calculations where the division of labor results in a concomitant reduction in compute time. It is not suited for tasks that are bounded by other compute resources, e.g., memory or file I/O (though the latter is less of an issue on HPC systems with parallel read/write). In particular, R itself is inherently single-core, so many of the parallelization backends involve (i) setting up one or more separate R sessions, (ii) loading the relevant packages and (iii) transmitting the data to that session. Depending on the nature and size of the task, this overhead may outweigh any benefit from parallel computing.

19.4 Out of memory representations

The count matrix is the central structure around which our analyses are based. In most of the previous chapters, this has been held fully in memory as a dense matrix or as a sparse dgCMatrix. Howevever, in-memory representations may not be feasible for very large data sets, especially on machines with limited memory. For example, the 1.3 million brain cell data set from 10X Genomics (Zheng et al. 2017) would require over 100 GB of RAM to hold as a matrix and around 30 GB as a dgCMatrix. This makes it challenging to explore the data on anything less than a HPC system.

The obvious solution is to use a file-backed matrix representation where the data are held on disk and subsets are retrieved into memory as requested. While a number of implementations of file-backed matrices are available (e.g., bigmemory, matter), we will be using the implementation from the HDF5Array package. This uses the popular HDF5 format as the underlying data store, which provides a measure of standardization and portability across systems. We demonstrate with a subset of 20,000 cells from the 1.3 million brain cell data set, as provided by the TENxBrainData package.

## class: SingleCellExperiment 
## dim: 27998 20000 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(2): Ensembl Symbol
## colnames: NULL
## colData names(4): Barcode Sequence Library Mouse
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):

Examination of the SingleCellExperiment object indicates that the count matrix is a HDF5Matrix. From a comparison of the memory usage, it is clear that this matrix object is simply a stub that points to the much larger HDF5 file that actually contains the data. This avoids the need for large RAM availability during analyses.

## <27998 x 20000> matrix of class HDF5Matrix and type "integer":
##              [,1]     [,2]     [,3]     [,4] ... [,19997] [,19998]
##     [1,]        0        0        0        0   .        0        0
##     [2,]        0        0        0        0   .        0        0
##     [3,]        0        0        0        0   .        0        0
##     [4,]        0        0        0        0   .        0        0
##     [5,]        0        0        0        0   .        0        0
##      ...        .        .        .        .   .        .        .
## [27994,]        0        0        0        0   .        0        0
## [27995,]        0        0        0        1   .        0        2
## [27996,]        0        0        0        0   .        0        1
## [27997,]        0        0        0        0   .        0        0
## [27998,]        0        0        0        0   .        0        0
##          [,19999] [,20000]
##     [1,]        0        0
##     [2,]        0        0
##     [3,]        0        0
##     [4,]        0        0
##     [5,]        0        0
##      ...        .        .
## [27994,]        0        0
## [27995,]        0        0
## [27996,]        0        0
## [27997,]        0        0
## [27998,]        0        0
## 2160 bytes
## [1] 76264332

Manipulation of the count matrix will generally result in the creation of a DelayedArray object from the DelayedArray package. This remembers the operations to be applied to the counts and stores them in the object, to be executed when the modified matrix values are realized for use in calculations. The use of delayed operations avoids the need to write the modified values to a new file at every operation, which would unnecessarily require time-consuming disk I/O.

## <27998 x 20000> matrix of class DelayedMatrix and type "double":
##              [,1]     [,2]     [,3] ... [,19999] [,20000]
##     [1,]        0        0        0   .        0        0
##     [2,]        0        0        0   .        0        0
##     [3,]        0        0        0   .        0        0
##     [4,]        0        0        0   .        0        0
##     [5,]        0        0        0   .        0        0
##      ...        .        .        .   .        .        .
## [27994,]        0        0        0   .        0        0
## [27995,]        0        0        0   .        0        0
## [27996,]        0        0        0   .        0        0
## [27997,]        0        0        0   .        0        0
## [27998,]        0        0        0   .        0        0

Many functions described in the previous workflows are capable of accepting HDF5Matrix objects. This is powered by the availability of common methods for all matrix representations (e.g., subsetting, combining, methods from DelayedMatrixStats) as well as representation-agnostic C++ code using beachmat (A. T. L. Lun, Pages, and Smith 2018). For example, we compute QC metrics below with the same calculateQCMetrics() function that we used in the other workflows.

## DataFrame with 20000 rows and 10 columns
##             sum  detected   percent_top_50  percent_top_100
##       <integer> <integer>        <numeric>        <numeric>
## 1          3060      1546 24.1830065359477 34.5751633986928
## 2          3500      1694 22.7428571428571 33.2285714285714
## 3          3092      1613 22.3479948253558 33.8292367399741
## 4          4420      2050 24.7511312217195 33.7782805429864
## 5          3771      1813 23.0442853354548 33.1742243436754
## ...         ...       ...              ...              ...
## 19996      4431      2050  23.019634394042 32.7916948770029
## 19997      6988      2704 18.6605609616485 28.5632512879222
## 19998      8749      2988 23.9113041490456 33.6267001943079
## 19999      3842      1711 24.7267048412285 36.8037480478917
## 20000      1775       945 29.8591549295775 40.9014084507042
##        percent_top_200  percent_top_500 subsets_Mt_sum subsets_Mt_detected
##              <numeric>        <numeric>      <integer>           <integer>
## 1     46.5032679738562 65.8169934640523            123                  10
## 2     45.6571428571429 64.8285714285714            118                  11
## 3     45.7309184993532 64.0038809831824             58                   9
## 4     44.8190045248869 61.4705882352941            131                  10
## 5     45.0543622381331 63.1397507292495            100                   8
## ...                ...              ...            ...                 ...
## 19996 44.5046264951478 60.9794628751975            127                   9
## 19997 40.5838580423583 58.2856325128792             60                   9
## 19998 44.4965138873014 60.9783975311464            305                  11
## 19999  48.750650702759 66.6840187402395            129                   8
## 20000 54.0845070422535 74.9295774647887             26                   6
##       subsets_Mt_percent     total
##                <numeric> <integer>
## 1       4.01960784313725      3060
## 2       3.37142857142857      3500
## 3         1.875808538163      3092
## 4       2.96380090497738      4420
## 5       2.65181649429859      3771
## ...                  ...       ...
## 19996   2.86617016474836      4431
## 19997  0.858614768174013      6988
## 19998   3.48611269859413      8749
## 19999   3.35762623633524      3842
## 20000   1.46478873239437      1775

Needless to say, data access from file-backed representations is slower than that from in-memory representations. The time spent retrieving data from disk is an unavoidable cost of reducing memory usage. Whether this is tolerable depends on the application. One example usage pattern involves performing the heavy computing quickly with in-memory representations on HPC systems with plentiful memory, and then distributing file-backed counterparts to individual users for exploration and visualization on their personal machines.

Session Info

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS:   /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRblas.so
LAPACK: /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] TENxBrainData_1.5.0         HDF5Array_1.13.9           
 [3] rhdf5_2.29.3                BiocSingular_1.1.7         
 [5] scater_1.13.25              ggplot2_3.2.1              
 [7] BiocNeighbors_1.3.5         scran_1.13.26              
 [9] SingleCellExperiment_1.7.11 SummarizedExperiment_1.15.9
[11] DelayedArray_0.11.8         BiocParallel_1.19.3        
[13] matrixStats_0.55.0          Biobase_2.45.1             
[15] GenomicRanges_1.37.16       GenomeInfoDb_1.21.2        
[17] IRanges_2.19.16             S4Vectors_0.23.25          
[19] BiocGenerics_0.31.6         Cairo_1.5-10               
[21] BiocStyle_2.13.2            OSCAUtils_0.0.1            

loaded via a namespace (and not attached):
 [1] bitops_1.0-6                  bit64_0.9-7                  
 [3] httr_1.4.1                    tools_3.6.1                  
 [5] backports_1.1.5               R6_2.4.0                     
 [7] irlba_2.3.3                   vipor_0.4.5                  
 [9] DBI_1.0.0                     lazyeval_0.2.2               
[11] colorspace_1.4-1              withr_2.1.2                  
[13] tidyselect_0.2.5              gridExtra_2.3                
[15] curl_4.2                      bit_1.1-14                   
[17] compiler_3.6.1                bookdown_0.14                
[19] scales_1.0.0                  rappdirs_0.3.1               
[21] stringr_1.4.0                 digest_0.6.21                
[23] rmarkdown_1.16                XVector_0.25.0               
[25] pkgconfig_2.0.3               htmltools_0.4.0              
[27] fastmap_1.0.1                 dbplyr_1.4.2                 
[29] limma_3.41.17                 rlang_0.4.0                  
[31] RSQLite_2.1.2                 shiny_1.4.0                  
[33] DelayedMatrixStats_1.7.2      dplyr_0.8.3                  
[35] RCurl_1.95-4.12               magrittr_1.5                 
[37] GenomeInfoDbData_1.2.1        Matrix_1.2-17                
[39] Rcpp_1.0.2                    ggbeeswarm_0.6.0             
[41] munsell_0.5.0                 Rhdf5lib_1.7.5               
[43] viridis_0.5.1                 stringi_1.4.3                
[45] yaml_2.2.0                    edgeR_3.27.13                
[47] zlibbioc_1.31.0               BiocFileCache_1.9.1          
[49] AnnotationHub_2.17.10         grid_3.6.1                   
[51] blob_1.2.0                    promises_1.1.0               
[53] dqrng_0.2.1                   ExperimentHub_1.11.6         
[55] crayon_1.3.4                  lattice_0.20-38              
[57] beachmat_2.1.2                locfit_1.5-9.1               
[59] zeallot_0.1.0                 knitr_1.25                   
[61] pillar_1.4.2                  igraph_1.2.4.1               
[63] glue_1.3.1                    evaluate_0.14                
[65] BiocManager_1.30.7            httpuv_1.5.2                 
[67] vctrs_0.2.0                   gtable_0.3.0                 
[69] purrr_0.3.2                   assertthat_0.2.1             
[71] xfun_0.10                     mime_0.7                     
[73] rsvd_1.0.2                    xtable_1.8-4                 
[75] later_1.0.0                   viridisLite_0.3.0            
[77] tibble_2.1.3                  snow_0.4-3                   
[79] AnnotationDbi_1.47.1          beeswarm_0.2.3               
[81] memoise_1.1.0                 statmod_1.4.32               
[83] interactiveDisplayBase_1.23.0

Bibliography

Lun, A. T. L., H. Pages, and M. L. Smith. 2018. “beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.” PLoS Comput. Biol. 14 (5):e1006135.

Wang, X. 2012. “A Fast Exact K-Nearest Neighbors Algorithm for High Dimensional Search Using K-Means Clustering and Triangle Inequality.” Proc Int Jt Conf Neural Netw 43 (6):2351–8.

Zheng, G. X., J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, et al. 2017. “Massively parallel digital transcriptional profiling of single cells.” Nat Commun 8 (January):14049.