Chapter 4 Data Infrastructure

One of the advantages of using Bioconductor packages is that they utilize common data infrastructures which makes analyses interoperable across various packages. Furthermore, much engineering effort is put into making this infrastructure robust and scalable. Here, we describe the SingleCellExperiment object (or sce in shorthand) in detail to describe how it is constructed, utilized in downstream analysis, and how it stores various types of primary data and metadata.

4.1 Prerequisites

The Bioconductor package SingleCellExperiment provides the SingleCellExperiment class for usage. While the package is implicitly installed and loaded when using any package that depends on the SingleCellExperiment class, it can be explicitly installed (and loaded) as follows:

Additionally, we use some functions from the scater and scran packages, as well as the CRAN package uwot (which conveniently can also be installed through BiocManager). These functions will be accessed through the <package>::<function> convention as needed.

For this session, all we will need loaded is the SingleCellExperiment package:

4.2 The SingleCellExperiment Class

Overview of the SingleCellExperiment class object

Overview of the SingleCellExperiment class object

4.2.1 Primary Data: The assays Slot

The SingleCellExperiment (sce) object is the basis of single-cell analytical applications based in Bioconductor. The sce object is an S4 object, which in essence provides a more formalized approach towards construction and accession of data compared to other methods available in R. The utility of S4 comes from validity checks that ensure that safe data manipulation, and most important for our discussion, from its extensibility through slots.

If we imagine the sce object to be a ship, the slots of sce can be thought of as individual cargo boxes - each exists as a separate entity within the sce object. Furthermore, each slot contains data that arrives in its own format. To extend the metaphor, we can imagine that different variations of cargo boxes are required for fruits versus bricks. In the case of sce, certain slots expect numeric matrices, whereas others may expect data frames.

To construct a rudimentary sce object, all we need is a single slot:

  • assays slot: contains primary data such as counts in list, where each entry of the list is in a matrix format, where rows correspond to features (genes) and columns correspond to samples (cells) (Figure 1A, blue box)

Let’s start simple by generating three cells worth of count data across ten genes.

From this, we can now construct our first SingleCellExperiment object, using the defined constructor, SingleCellExperiment(). Note that we provide our data as a named list, and each entry of the list is a matrix. Here, we name the counts_matrix entry as simply counts within the list.

To inspect the object, we can simply type sce into the console to see some pertinent information, which will display an overview of the various slots available to us (which may or may not have any data).

## class: SingleCellExperiment 
## dim: 10 3 
## metadata(0):
## assays(1): counts
## rownames(10): gene_1 gene_2 ... gene_9 gene_10
## rowData names(0):
## colnames(3): cell_1 cell_2 cell_3
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):

To access the count data we just supplied, we can do any one of the following:

  • assay(sce, "counts") - this is the most general method, where we can supply the name of the assay as the second argument.
  • counts(sce) - this is the same as the above, but only works for assays with the special name "counts".
##         cell_1 cell_2 cell_3
## gene_1       5      7     26
## gene_2       7      9     34
## gene_3       5      8     25
## gene_4      10     11     30
## gene_5       8     12     23
## gene_6       5      8     32
## gene_7      10      9     29
## gene_8      11     10     37
## gene_9      12     12     27
## gene_10      8     12     28

4.2.2 Extending the assays Slot

What makes the assay slot especially powerful is that it can hold multiple representations of the primary data. This is especially useful for storing a normalized version of the data. We can do just that as shown below, using the scran and scater packages to compute a log-count normalized representation of the initial primary data.

Note that here, we overwrite our previous sce upon reassigning the results to sce - this is because these functions return a SingleCellExperiment object. Some functions - especially those outside of single-cell oriented Bioconductor packages - do not, in which case you will need to append your results to the sce object (see below).

Viewing the object again, we see that these functions added some new entries:

## class: SingleCellExperiment 
## dim: 10 3 
## metadata(1): log.exprs.offset
## assays(2): counts logcounts
## rownames(10): gene_1 gene_2 ... gene_9 gene_10
## rowData names(0):
## colnames(3): cell_1 cell_2 cell_3
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):

Specifically, we see that the assays slot has grown to be comprised of two entries: counts (our initial data) and logcounts (the normalized data). Similar to counts, the logcounts name is a special name which lets us access it simply by typing logcounts(sce), although the longhand version works just as well.

##         cell_1 cell_2 cell_3
## gene_1    3.42   3.61   3.91
## gene_2    3.86   3.94   4.27
## gene_3    3.42   3.79   3.85
## gene_4    4.35   4.22   4.10
## gene_5    4.04   4.34   3.74
## gene_6    3.42   3.79   4.19
## gene_7    4.35   3.94   4.05
## gene_8    4.48   4.09   4.39
## gene_9    4.60   4.34   3.96
## gene_10   4.04   4.34   4.01

Notice that the data before had a severe discrepancy in counts between cells 1/2 versus 3, and that normalization has ameliorated this difference.

To look at all the available assays within sce, we can type:

## List of length 2
## names(2): counts logcounts

While the functions above demonstrate automatic addition of assays to our sce object, there may be cases where we want to perform our own calculations and save the result into the assays slot. In particular, this is important for using functions that do not return your SingleCellExperiment object.

Let’s append a new version of the data that has been offset by +100.

Then we can use the accessor assays() (notice this is plural!) to see all our entries into the assay slot that we have made so far. Note that to see the names of all the assays, we use the plural assays() accessor, and to retrieve a single assay entry (as a matrix) we use the singular assay() accessor, providing the name of the assay we wish to retrieve as above.

## List of length 3
## names(3): counts logcounts counts_100

These entries are also seen on the default view of sce:

## class: SingleCellExperiment 
## dim: 10 3 
## metadata(1): log.exprs.offset
## assays(3): counts logcounts counts_100
## rownames(10): gene_1 gene_2 ... gene_9 gene_10
## rowData names(0):
## colnames(3): cell_1 cell_2 cell_3
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):

This sort of extension of the assays slot is represented graphically in Figure 1B (dark blue box), showing the addition of the logcounts matrix into the assays slot.

In a similar manner, many of the slots of sce are extendable through assignment as shown above, thus allowing for myriad custom functionality as needed for interoperability with functions outside of single-cell oriented Bioconductor packages.

4.2.3 Column (Meta)Data: colData Slot

To further annotate our sce object, one of the first and most useful pieces of information is adding on metadata that describes the columns of our primary data, e.g. describing the samples or cells of our experiment. This data is entered into the colData slot:

  • colData slot: metadata that describes that samples (cells) provided as a data.frame or (DataFrame if appending), where rows correspond to cells, and columns correspond to the sample (cells) metadata features (e.g. id, batch, author, etc.) (Figure 1A, orange box).

So, let’s come up with some metadata for the cells, starting with a batch variable, where cells 1 and 2 are in batch 1, and cell 3 is from batch 2.

Now, we can take two approaches - either append the cell_metadata to our existing sce, or start from scratch via the SingleCellExperiment() constructor and provide it from the get go. We’ll start from scratch for now, but will also show how to append the data as well:

Similar to assays, we can see our colData is now populated from the default view of sce:

## class: SingleCellExperiment 
## dim: 10 3 
## metadata(0):
## assays(1): counts
## rownames(10): gene_1 gene_2 ... gene_9 gene_10
## rowData names(0):
## colnames(3): cell_1 cell_2 cell_3
## colData names(1): batch
## reducedDimNames(0):
## spikeNames(0):

And furthermore access our column (meta)data with the accessor, colData():

## DataFrame with 3 rows and 1 column
##            batch
##        <numeric>
## cell_1         1
## cell_2         1
## cell_3         2

Finally, some packages automatically add to the colData slot, for example, the scater package features a function, calculateQCMetrics(), which appends a lot of quality control data. Here we show the first five columns of colData(sce) with the quality control metrics appended to it.

## DataFrame with 3 rows and 5 columns
##            batch is_cell_control total_features_by_counts
##        <numeric>       <logical>                <integer>
## cell_1         1           FALSE                       10
## cell_2         1           FALSE                       10
## cell_3         2           FALSE                       10
##        log10_total_features_by_counts total_counts
##                             <numeric>    <integer>
## cell_1               1.04139268515822           81
## cell_2               1.04139268515822           98
## cell_3               1.04139268515822          291

4.2.3.1 Using colData for Subsetting

A common operation with colData is its use in subsetting. One simple way to access colData is through the use of the $ operator, which is a shortcut for accessing a variable within the colData slot:

## [1] 1 1 2

If we only wanted cells within batch 1, we could subset our sce object as follows (remember, we subset on the columns in this case because we are filtering by cells/samples here).

## class: SingleCellExperiment 
## dim: 10 2 
## metadata(0):
## assays(1): counts
## rownames(10): gene_1 gene_2 ... gene_9 gene_10
## rowData names(7): is_feature_control mean_counts ... total_counts
##   log10_total_counts
## colnames(2): cell_1 cell_2
## colData names(10): batch is_cell_control ...
##   pct_counts_in_top_200_features pct_counts_in_top_500_features
## reducedDimNames(0):
## spikeNames(0):

4.2.4 Feature Metadata: rowData/rowRanges

Lastly, the rows also have their own metadata slot to store information that pertains to the features of the sce object:

  • rowData slot: contains data in a data.frame (DataFrame) format that describes aspects of the data corresponding to the rows of the primary data (Figure 1A, green box).

Furthermore, there is a special slot which pertains to features with genomic coordinates:

  • rowRanges slot: contains data in a GRangesList (where each entry is a GenomicRanges format) that describes the chromosome, start, and end coordinates of the features (genes, genomic regions).

Both of these can be accessed via their respective accessors, rowRanges() and rowData(). In our case, rowRanges(sce) produces an empty list:

## GRangesList object of length 10:
## $gene_1
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: no sequences
## 
## $gene_2
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: no sequences
## 
## $gene_3
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: no sequences
## 
## ...
## <7 more elements>

However, our call to calculateQCMetrics(sce) in the prior section filled in the rowData slot of our sce object, as we can see below (only the first three columns are shown for brevity):

## DataFrame with 10 rows and 3 columns
##         is_feature_control      mean_counts log10_mean_counts
##                  <logical>        <numeric>         <numeric>
## gene_1               FALSE 12.6666666666667  1.13566260200007
## gene_2               FALSE 16.6666666666667  1.24715461488113
## gene_3               FALSE 12.6666666666667  1.13566260200007
## gene_4               FALSE               17  1.25527250510331
## gene_5               FALSE 14.3333333333333  1.18563657696191
## gene_6               FALSE               15  1.20411998265592
## gene_7               FALSE               16  1.23044892137827
## gene_8               FALSE 19.3333333333333   1.3082085802911
## gene_9               FALSE               17  1.25527250510331
## gene_10              FALSE               16  1.23044892137827

In a similar fashion to the colData slot, such feature metadata could be provided at the onset when creating the SingleCellExperiment object, which we leave up to the reader as an exercise.

4.2.4.1 Subsetting with on Rows

To subset an sce object down at the feature/gene level, we can do a row subsetting operation similar to other R objects, by supplying either numeric indices or a vector of names:

## class: SingleCellExperiment 
## dim: 2 3 
## metadata(0):
## assays(1): counts
## rownames(2): gene_1 gene_4
## rowData names(7): is_feature_control mean_counts ... total_counts
##   log10_total_counts
## colnames(3): cell_1 cell_2 cell_3
## colData names(10): batch is_cell_control ...
##   pct_counts_in_top_200_features pct_counts_in_top_500_features
## reducedDimNames(0):
## spikeNames(0):

4.2.5 Size Factors Slot: sizeFactors

Briefly, we already encountered this via the scran::computeSumFactors(sce) call, which adds a sizeFactors slot:

  • sizeFactors slot: contains information in a numeric vector regarding the sample/cell normalization factors used to produce a normalize data representation (Figure 1B, brown box)
## [1] 0.517 0.626 1.857

4.3 A Brief Recap: From se to sce

So far, we have covered the assays (primary data), colData (sample metadata), rowData/rowRanges (feature metadata), and sizeFactors slots of SingleCellExperiment.

What is important to note is that the SingleCellExperiment class derives from the SummarizedExperiment (se) class, its predecessor, and in particular inherits the aforementioned slots. As such, much of the SummarizedExperiment functionality is retained in SingleCellExperiment. This allows existing methods that work with SummarizedExperiment to work similarly on SingleCellExperiment objects.

So what’s new about the SingleCellExperiment class then? For our discussion, the most important change is the addition of a new slot called reducedDims.

4.4 The reducedDims Slot

The reducedDims slot is a new addition which is specially designed to store the reduced dimensionality representations of primary data, such as PCA, tSNE, UMAP, and others.

  • reducedDims slot: contains a list of numeric matrix entries which describe dimensionality reduced representations of the primary data, such that rows represent the columns of the primary data (aka the samples/cells), and columns represent the dimensions

Most importantly, just like the assays slot, the reducedDims slot can hold a list of many entries. So, it can hold a PCA, TSNE, and UMAP representation of a given dataset all within the reducedDims slot.

In our example, we can calculate a PCA representation of our data as follows using the scater package function runPCA(). We see that the sce now shows a new reducedDim and that the accessor reducedDim() produces the results of running PCA on the normalized data from logcounts(sce).

##            PC1   PC2      PC3
## cell_1 -2.5559  1.08 3.17e-15
## cell_2 -0.0014 -2.15 3.17e-15
## cell_3  2.5573  1.07 3.17e-15
## attr(,"percentVar")
## [1] 6.54e-01 3.46e-01 1.51e-30

From this, we can also calculate a tSNE representation using the scater package function runTSNE(), and see that it can be seen both in the default view of sce and via accession:

## Perplexity should be lower than K!
##         [,1]  [,2]
## cell_1 -1427  5513
## cell_2  5488 -1521
## cell_3 -4061 -3992

We can view the names of all our entries in the reducedDims slot via the accessor, reducedDims() (notice that this is plural, and thus not the same as reducedDim():

## List of length 2
## names(2): PCA TSNE

Now, say we have a different dimensionality reduction approach which has not yet been implemented with SingleCellExperiment objects in mind. For example, let’s say we want to try the umap() function as implemented in the uwot package (which is a much faster version of the default umap implementation currently in scater).

Similar to how we extended the assays slot with our own custom entry of counts_100, we can do similarly for the reducedDims slot:

##           [,1]    [,2]
## cell_1 -0.5711  0.3196
## cell_2 -0.0906 -0.3988
## cell_3  0.6617  0.0792
## attr(,"scaled:center")
## [1] -1.09 -8.79

And we can also see its entry when we look at the reducedDims() accessor output:

## List of length 3
## names(3): PCA TSNE UMAP_uwot

4.5 One More Thing: metadata Slot

Some analyses produce results that do not fit into the aforementioned slots. Thankfully, there is a slot just for this type of messy data, and in fact, can accommodate any type of data, so long as it is in a named list:

  • metadata slot: a named list of entries, where each entry in the list can be anything you want it to be

For example, say we have some favorite genes, such as highly variable genes, we want to save inside of sce for use in our analysis at a later point. We can do this simply by appending to the metadata slot as follows:

## $favorite_genes
## [1] "gene_1" "gene_5"

Similarly, we can append more information via the $ operator:

## $favorite_genes
## [1] "gene_1" "gene_5"
## 
## $your_genes
## [1] "gene_4" "gene_8"

4.6 About Spike-Ins

You might have noticed that the sce default view produces an entry with spikeNames. The SingleCellExperiment object contains some special considerations for experiments with spike-in (ERCC) controls. We leave this to the interested reader to learn more about in the SingleCellExperiment introductory vignette.

4.7 Working with SingleCellExperiment

In subsequent sections, we will show an example workflow that uses the SingleCellExperiment object as its base, and similar to our walkthrough of the SingleCellExperiment class above, continually appends new entries to save the results of the analysis. The SingleCellExperiment thus can serve as a record of analysis in this manner. This makes it especially useful for collaboration, as the object can be transferred and then visualized via graphical user interfaces such as iSEE.

4.8 The Centrality of SingleCellExperiment

Graph network of package dependencies linking to the SingleCellExperiment package (class). Packages are filtered by biocView “singleCell”.

Graph network of package dependencies linking to the SingleCellExperiment package (class). Packages are filtered by biocView “singleCell”.

To emphasize its importance, here we show the centrality of the SingleCellExperiment class to the Bioc-verse single-cell related packages. It is this connection to SingleCellExperiment that makes many of these packages easily interoperable and modular over the span of an scRNA-seq analysis.

4.9 Multimodal Data: MultiAssayExperiment

Recent advances in technology and protocols allow the simultaneous collection of DNA and RNA from the same cells, enabling single-cell multi-modal analysis. These data present new challenges in the complexity of statistical analyses, which are addressed in Bioconductor through the MultiAssayExperiment container. The MultiAssayExperiment class integrates all major Bioconductor experimental data containers, and any containers derived from those, including SingleCellExperiment. It provides harmonized data management for heterogeneous assays, including subsetting by genomic identifiers, genomic coordinates, or sample/cell attributes such as cell type. The user interface mimics that of SingleCellExperiment, with comparable actions working across all assays.

Multi-modal profiling is an emergent area of single-cell biology with many exciting technologies coming online, such as gene expression profiling in tandem with protein via CITE-seq/REAP-seq and adaptive repertoire sequencing. While we won’t cover multimodal data analysis further in this online book as of this writing, we anticipate infrastructure and statistical methodology advances in this area in the near future.

document.addEventListener("click", function (event) { if (event.target.classList.contains("aaron-collapse")) { event.target.classList.toggle("active"); var content = event.target.nextElementSibling; if (content.style.display === "block") { content.style.display = "none"; } else { content.style.display = "block"; } } })