# Chapter 4 Data Infrastructure

One of the advantages of using Bioconductor packages is that they utilize common data infrastructures which makes analyses interoperable across various packages. Furthermore, much engineering effort is put into making this infrastructure robust and scalable. Here, we describe the SingleCellExperiment object (or sce in shorthand) in detail to describe how it is constructed, utilized in downstream analysis, and how it stores various types of primary data and metadata.

## 4.1 Prerequisites

The Bioconductor package SingleCellExperiment provides the SingleCellExperiment class for usage. While the package is implicitly installed and loaded when using any package that depends on the SingleCellExperiment class, it can be explicitly installed (and loaded) as follows:

BiocManager::install('SingleCellExperiment')

Additionally, we use some functions from the scater and scran packages, as well as the CRAN package uwot (which conveniently can also be installed through BiocManager). These functions will be accessed through the <package>::<function> convention as needed.

BiocManager::install(c('scater', 'scran', 'uwot'))

For this session, all we will need loaded is the SingleCellExperiment package:

library(SingleCellExperiment)

## 4.2 The SingleCellExperiment Class

### 4.2.1 Primary Data: The assays Slot

The SingleCellExperiment (sce) object is the basis of single-cell analytical applications based in Bioconductor. The sce object is an S4 object, which in essence provides a more formalized approach towards construction and accession of data compared to other methods available in R. The utility of S4 comes from validity checks that ensure safe data manipulation, and most important for our discussion, from its extensibility through slots.

If we imagine the sce object to be a ship, the slots of sce can be thought of as individual cargo boxes - each exists as a separate entity within the sce object. Furthermore, each slot contains data that arrives in its own format. To extend the metaphor, we can imagine that different variations of cargo boxes are required for fruits versus bricks. In the case of sce, certain slots expect numeric matrices, whereas others may expect data frames.

To construct a rudimentary sce object, all we need is a single slot:

• assays slot: contains primary data such as counts in a list, where each entry of the list is in a matrix format, where rows correspond to features (genes) and columns correspond to samples (cells) (Figure 1A, blue box)

Let’s start simple by generating three cells worth of count data across ten genes.

counts_matrix <- data.frame(cell_1 = rpois(10, 10),
cell_2 = rpois(10, 10),
cell_3 = rpois(10, 30))
rownames(counts_matrix) <- paste0("gene_", 1:10)
counts_matrix <- as.matrix(counts_matrix) # must be a matrix object!

From this, we can now construct our first SingleCellExperiment object, using the defined constructor, SingleCellExperiment(). Note that we provide our data as a named list, and each entry of the list is a matrix. Here, we name the counts_matrix entry as simply counts within the list.

sce <- SingleCellExperiment(assays = list(counts = counts_matrix))

To inspect the object, we can simply type sce into the console to see some pertinent information, which will display an overview of the various slots available to us (which may or may not have any data).

sce
## class: SingleCellExperiment
## dim: 10 3
## assays(1): counts
## rownames(10): gene_1 gene_2 ... gene_9 gene_10
## rowData names(0):
## colnames(3): cell_1 cell_2 cell_3
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):

To access the count data we just supplied, we can do any one of the following:

• assay(sce, "counts") - this is the most general method, where we can supply the name of the assay as the second argument.
• counts(sce) - this is the same as the above, but only works for assays with the special name "counts".
counts(sce)
##         cell_1 cell_2 cell_3
## gene_1      12     14     30
## gene_2      16     10     40
## gene_3       9     11     31
## gene_4      11      7     37
## gene_5       7      7     32
## gene_6      12     12     19
## gene_7       5      9     25
## gene_8      13     13     34
## gene_9       7     14     27
## gene_10      6      7     28
## assay(sce, "counts") ## same as above in this special case

### 4.2.2 Extending the assays Slot

What makes the assay slot especially powerful is that it can hold multiple representations of the primary data. This is especially useful for storing the raw as well as a normalized version of the data. We can do just that as shown below, using the scran and scater packages to compute a normalized and log-transformed representation of the initial primary data.

Note that here, we overwrite our previous sce upon reassigning the results to sce - this is because these functions return a SingleCellExperiment object. Some functions - especially those outside of single-cell oriented Bioconductor packages - do not, in which case you will need to append your results to the sce object (see below).

sce <- scran::computeSumFactors(sce)
sce <- scater::normalize(sce)

Viewing the object again, we see that these functions added some new entries:

sce
## class: SingleCellExperiment
## dim: 10 3
## assays(2): counts logcounts
## rownames(10): gene_1 gene_2 ... gene_9 gene_10
## rowData names(0):
## colnames(3): cell_1 cell_2 cell_3
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):

Specifically, we see that the assays slot has grown to be comprised of two entries: counts (our initial data) and logcounts (the normalized data). Similar to counts, the logcounts name is a special name which lets us access it simply by typing logcounts(sce), although the longhand version works just as well.

logcounts(sce)
##         cell_1 cell_2 cell_3
## gene_1    4.43   4.56   4.14
## gene_2    4.83   4.10   4.54
## gene_3    4.04   4.23   4.19
## gene_4    4.31   3.62   4.43
## gene_5    3.70   3.62   4.23
## gene_6    4.43   4.35   3.53
## gene_7    3.26   3.96   3.90
## gene_8    4.54   4.46   4.31
## gene_9    3.70   4.56   4.00
## gene_10   3.50   3.62   4.05
## assay(sce, "logcounts") ## same as above

Notice that the data before had a severe discrepancy in counts between cells 1/2 versus 3, and that normalization has ameliorated this difference.

To look at all the available assays within sce, we can type:

assays(sce)
## List of length 2
## names(2): counts logcounts

While the functions above demonstrate automatic addition of assays to our sce object, there may be cases where we want to perform our own calculations and save the result into the assays slot. In particular, this is important for using functions that do not return a SingleCellExperiment object.

Let’s append a new version of the data that has been offset by +100.

counts_100 <- assay(sce, "counts") + 100
assay(sce, "counts_100") <- counts_100 # assign a new entry to assays slot

Then we can use the accessor assays() (notice this is plural!) to see all our entries into the assay slot that we have made so far. Note that to see all the assays, we use the plural assays() accessor, and to retrieve a single assay entry (as a matrix) we use the singular assay() accessor, providing the name of the assay we wish to retrieve as above.

assays(sce)
## List of length 3
## names(3): counts logcounts counts_100

These entries are also seen on the default view of sce:

sce
## class: SingleCellExperiment
## dim: 10 3
## assays(3): counts logcounts counts_100
## rownames(10): gene_1 gene_2 ... gene_9 gene_10
## rowData names(0):
## colnames(3): cell_1 cell_2 cell_3
## colData names(0):
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):

This sort of extension of the assays slot is represented graphically in Figure 1B (dark blue box), showing the addition of the logcounts matrix into the assays slot.

In a similar manner, many of the slots of sce are extendable through assignment as shown above, thus allowing for myriad custom functionality as needed for interoperability with functions outside of single-cell oriented Bioconductor packages.

### 4.2.3 Column (Meta)Data: colData Slot

To further annotate our sce object, one of the first and most useful pieces of information is adding metadata that describes the columns of our primary data, e.g., the samples or cells of our experiment. This data is entered into the colData slot:

• colData slot: metadata that describes that samples (cells) provided as a data.frame (or DataFrame if appending), where rows correspond to cells, and columns correspond to the sample (cells) metadata features (e.g. id, batch, author, etc.) (Figure 1A, orange box).

So, let’s come up with some metadata for the cells, starting with a batch variable, where cells 1 and 2 are in batch 1, and cell 3 is from batch 2.

cell_metadata <- data.frame(batch = c(1, 1, 2))
rownames(cell_metadata) <- paste0("cell_", 1:3)

Now, we can take two approaches - either append the cell_metadata to our existing sce, or start from scratch via the SingleCellExperiment() constructor and provide it from the get go. We’ll start from scratch for now, but will also show how to append the data:

## From scratch:
sce <- SingleCellExperiment(assays = list(counts = counts_matrix),

## Appending to existing object (requires DataFrame() coercion)
## colData(sce) <- DataFrame(cell_metadata)

Similar to assays, we can see our colData is now populated from the default view of sce:

sce
## class: SingleCellExperiment
## dim: 10 3
## assays(1): counts
## rownames(10): gene_1 gene_2 ... gene_9 gene_10
## rowData names(0):
## colnames(3): cell_1 cell_2 cell_3
## colData names(1): batch
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):

And furthermore access our column (meta)data with the accessor, colData():

colData(sce)
## DataFrame with 3 rows and 1 column
##            batch
##        <numeric>
## cell_1         1
## cell_2         1
## cell_3         2

Finally, some packages automatically add to the colData slot, for example, the scater package features a function, calculateQCMetrics(), which appends a lot of quality control data. Here we show the first five columns of colData(sce) with the quality control metrics appended to it.

sce <- scater::calculateQCMetrics(sce)
colData(sce)[, 1:5]
## DataFrame with 3 rows and 5 columns
##            batch is_cell_control total_features_by_counts
##        <numeric>       <logical>                <integer>
## cell_1         1           FALSE                       10
## cell_2         1           FALSE                       10
## cell_3         2           FALSE                       10
##        log10_total_features_by_counts total_counts
##                             <numeric>    <integer>
## cell_1               1.04139268515822           98
## cell_2               1.04139268515822          104
## cell_3               1.04139268515822          303

#### 4.2.3.1 Using colData for Subsetting

A common operation with colData is its use in subsetting. One simple way to access colData is through the use of the $ operator, which is a shortcut for accessing a variable within the colData slot: sce$batch
## [1] 1 1 2
## colData(sce)$batch # same as above If we only wanted cells within batch 1, we could subset our sce object as follows (remember, we subset on the columns in this case because we are filtering by cells/samples here). sce[, sce$batch == 1]
## class: SingleCellExperiment
## dim: 10 2
## assays(1): counts
## rownames(10): gene_1 gene_2 ... gene_9 gene_10
## rowData names(7): is_feature_control mean_counts ... total_counts
##   log10_total_counts
## colnames(2): cell_1 cell_2
## colData names(10): batch is_cell_control ...
##   pct_counts_in_top_200_features pct_counts_in_top_500_features
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):

### 4.2.4 Feature Metadata: rowData/rowRanges

Lastly, the rows also have their own metadata slot to store information that pertains to the features of the sce object:

• rowData slot: contains data in a data.frame (DataFrame) format that describes aspects of the data corresponding to the rows of the primary data (Figure 1A, green box).

Furthermore, there is a special slot which pertains to features with genomic coordinates:

• rowRanges slot: contains data in a GRangesList (where each entry is a GenomicRanges format) that describes the chromosome, start, and end coordinates of the features (genes, genomic regions).

Both of these can be accessed via their respective accessors, rowRanges() and rowData(). In our case, rowRanges(sce) produces an empty list:

rowRanges(sce) # empty
## GRangesList object of length 10:
## $gene_1 ## GRanges object with 0 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## ------- ## seqinfo: no sequences ## ##$gene_2
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: no sequences
##
## $gene_3 ## GRanges object with 0 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## ------- ## seqinfo: no sequences ## ## ... ## <7 more elements> However, our call to calculateQCMetrics(sce) in the prior section filled in the rowData slot of our sce object, as we can see below (only the first three columns are shown for brevity): rowData(sce)[, 1:3] ## DataFrame with 10 rows and 3 columns ## is_feature_control mean_counts log10_mean_counts ## <logical> <numeric> <numeric> ## gene_1 FALSE 18.6666666666667 1.29373075692248 ## gene_2 FALSE 22 1.36172783601759 ## gene_3 FALSE 17 1.25527250510331 ## gene_4 FALSE 18.3333333333333 1.28630673884327 ## gene_5 FALSE 15.3333333333333 1.21307482530885 ## gene_6 FALSE 14.3333333333333 1.18563657696191 ## gene_7 FALSE 13 1.14612803567824 ## gene_8 FALSE 20 1.32221929473392 ## gene_9 FALSE 16 1.23044892137827 ## gene_10 FALSE 13.6666666666667 1.16633142176653 In a similar fashion to the colData slot, such feature metadata could be provided at the onset when creating the SingleCellExperiment object, which we leave up to the reader as an exercise. #### 4.2.4.1 Subsetting by Rows To subset an sce object down at the feature/gene level, we can do a row subsetting operation similar to other R objects, by supplying either numeric indices or a vector of names: sce[c("gene_1", "gene_4"), ] ## class: SingleCellExperiment ## dim: 2 3 ## metadata(0): ## assays(1): counts ## rownames(2): gene_1 gene_4 ## rowData names(7): is_feature_control mean_counts ... total_counts ## log10_total_counts ## colnames(3): cell_1 cell_2 cell_3 ## colData names(10): batch is_cell_control ... ## pct_counts_in_top_200_features pct_counts_in_top_500_features ## reducedDimNames(0): ## spikeNames(0): ## altExpNames(0): ## sce[c(1, 4), ] # same as above in this case ### 4.2.5 Size Factors Slot: sizeFactors Briefly, we already encountered this via the scran::computeSumFactors(sce) call, which adds a sizeFactors slot: • sizeFactors slot: contains information in a numeric vector regarding the sample/cell normalization factors used to produce a normalized data representation (Figure 1B, brown box) sce <- scran::computeSumFactors(sce) sce <- scater::normalize(sce) sizeFactors(sce) ## [1] 0.582 0.618 1.800 ## 4.3 A Brief Recap: From se to sce So far, we have covered the assays (primary data), colData (sample metadata), rowData/rowRanges (feature metadata), and sizeFactors slots of SingleCellExperiment. What is important to note is that the SingleCellExperiment class derives from the SummarizedExperiment (se) class, its predecessor, and in particular inherits the assays, colData and rowData/rowRanges slots. As such, much of the SummarizedExperiment functionality is retained in SingleCellExperiment. This allows existing methods that work with SummarizedExperiment to work similarly on SingleCellExperiment objects. So what’s new about the SingleCellExperiment class then? For our discussion, the most important change is the addition of a new slot called reducedDims. ## 4.4 The reducedDims Slot The reducedDims slot is a new addition which is specially designed to store reduced dimensionality representations of primary data, obtained by methods such as PCA, tSNE, UMAP, and others. • reducedDims slot: contains a list of numeric matrix entries which describe dimensionality reduced representations of the primary data, such that rows represent the columns of the primary data (aka the samples/cells), and columns represent the dimensions Most importantly, just like the assays slot, the reducedDims slot can hold a list of many entries. So, it can hold a PCA, TSNE, and UMAP representation of a given dataset all within the reducedDims slot. In our example, we can calculate a PCA representation of our data as follows using the scater package function runPCA(). We see that the sce now shows a new reducedDim and that the accessor reducedDim() produces the results of running PCA on the normalized data from logcounts(sce). sce <- scater::runPCA(sce) reducedDim(sce, "PCA") ## PC1 PC2 ## cell_1 0.194 0.818 ## cell_2 -0.887 -0.258 ## cell_3 0.693 -0.560 ## attr(,"percentVar") ## [1] 55.4 44.6 From this, we can also calculate a tSNE representation using the scater package function runTSNE(), and see that it can be seen both in the default view of sce and via accession: sce <- scater::runTSNE(sce, perplexity = 0.1) ## Perplexity should be lower than K! reducedDim(sce, "TSNE") ## [,1] [,2] ## cell_1 1341 5533 ## cell_2 -5458 -1597 ## cell_3 4117 -3936 We can view the names of all our entries in the reducedDims slot via the accessor, reducedDims() (notice that this is plural, and thus not the same as reducedDim(): reducedDims(sce) ## List of length 2 ## names(2): PCA TSNE Now, say we have a different dimensionality reduction algorithm we’d like to try that is not yet implemented with direct compatibility for SingleCellExperiment, e.g. with an associated wrapper in packages such as scater. To accommodate this case (or alternately, when we want to directly run the dimensionality reduction method on the data ourselves rather than through a wrapper), we can directly add to the reducedDims slot. This is similar to how we previously extended the assays slot with our own custom entry of counts_100. Below, we show how to run umap() as implemented in the uwot package directly, rather than through the associated wrapper function implemented in scater via runUMAP(), saving the intermediate results, and then adding them into the sce object we had previously. u <- uwot::umap(t(logcounts(sce)), n_neighbors = 2) reducedDim(sce, "UMAP_uwot") <- u reducedDim(sce, "UMAP_uwot") ## [,1] [,2] ## cell_1 -0.413 0.0368 ## cell_2 0.243 0.5494 ## cell_3 0.170 -0.5862 ## attr(,"scaled:center") ## [1] -2.32 -1.64 And now we can also see its entry when we look at the reducedDims() accessor output: reducedDims(sce) ## List of length 3 ## names(3): PCA TSNE UMAP_uwot ## 4.5 The metadata Slot Some analyses produce results that do not fit into the aforementioned slots. Thankfully, there is a slot just for this type of messy data, and in fact, can accommodate any type of data, so long as it is in a named list: • metadata slot: a named list of entries, where each entry in the list can be anything you want it to be For example, say we have some favorite genes, such as highly variable genes, we want to save inside of sce for use in our analysis at a later point. We can do this simply by appending to the metadata slot as follows: my_genes <- c("gene_1", "gene_5") metadata(sce) <- list(favorite_genes = my_genes) metadata(sce) ##$favorite_genes
## [1] "gene_1" "gene_5"

Similarly, we can append more information via the $ operator: your_genes <- c("gene_4", "gene_8") metadata(sce)$your_genes <- your_genes
metadata(sce)
## $favorite_genes ## [1] "gene_1" "gene_5" ## ##$your_genes
## [1] "gene_4" "gene_8"

You might have noticed that the sce default view produces an entry with spikeNames. The SingleCellExperiment object contains some special considerations for experiments with spike-in (ERCC) controls. We leave this to the interested reader to learn more about in the SingleCellExperiment introductory vignette.

## 4.7 Recording Analyses in SingleCellExperiment

In subsequent sections, we will show an example workflow that uses the SingleCellExperiment object as its base, and similar to our walkthrough of the SingleCellExperiment class above, continually appends new entries to save the results of the analysis. The SingleCellExperiment thus can serve as a record of analysis in this manner. This makes it especially useful for collaboration, as the object can be transferred, analyzed, and even visualized via graphical user interfaces such as iSEE.

## 4.8 Conclusion

The SingleCellExperiment class to the Bioc-verse single-cell related packages. It is this connection to SingleCellExperiment that makes many of these packages easily interoperable and modular over the span of an scRNA-seq analysis. Furthermore, it allows anybody to implement their own analytical tools for scRNA-seq that build on the SingleCellExperiment, enabling decentralized scientific developments.