Chapter 8 Feature selection

8.1 Motivation

We often use scRNA-seq data in exploratory analyses to characterize heterogeneity across cells. Procedures like clustering and dimensionality reduction compare cells based on their gene expression profiles, which involves aggregating per-gene differences into a single (dis)similarity metric between a pair of cells. The choice of genes to use in this calculation has a major impact on the behavior of the metric and the performance of downstream methods. We want to select genes that contain useful information about the biology of the system while removing genes that contain random noise. This aims to preserve interesting biological structure without the variance that obscures that structure. It also reduces the size of the dataset to improve computational efficiency of later steps.

The simplest approach to feature selection is to select the most variable genes based on their expression across the population. This assumes that genuine biological differences will manifest as increased variation in the affected genes, compared to other genes that are only affected by technical noise or a baseline level of “uninteresting” biological variation (e.g., from transcriptional bursting). Several methods are available to quantify the variation per gene and to select the highly variable genes (HVGs), which we will discuss below. For demonstration, we will use the 10X PBMC dataset:

## class: SingleCellExperiment 
## dim: 33694 3922 
## metadata(0):
## assays(2): counts logcounts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(2): ID Symbol
## colnames(3922): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
##   TTTGTCACAGGTCCAC-1 TTTGTCATCCCAAGAT-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(0):

As well as the 416B dataset:

## class: SingleCellExperiment 
## dim: 46604 185 
## metadata(0):
## assays(2): counts logcounts
## rownames(46604): 4933401J01Rik Gm26206 ... CAAA01147332.1
##   CBFB-MYH11-mcherry
## rowData names(4): Length ENSEMBL SYMBOL SEQNAME
## colnames(185): SLX-9555.N701_S502.C89V9ANXX.s_1.r_1
##   SLX-9555.N701_S503.C89V9ANXX.s_1.r_1 ...
##   SLX-11312.N712_S507.H5H5YBBXX.s_8.r_1
##   SLX-11312.N712_S517.H5H5YBBXX.s_8.r_1
## colData names(9): Source Name cell line ... spike-in addition
##   block
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(2): ERCC SIRV

8.2 Quantifying per-gene variation

8.2.1 Variance of the log-counts

The simplest approach to quantifying per-gene variation is to simply compute the variance of the log-normalized expression values (referred to as “log-counts” for simplicity) for each gene across all cells in the population (???). This has an advantage in that the feature selection is based on the same log-values that are used for later downstream steps. Genes with the largest variances in log-values will contribute the most to the Euclidean distances between cells. By using log-values here, we ensure that our quantitative definition of heterogeneity is consistent throughout the entire analysis.

Calculation of the per-gene variance is simple but feature selection requires modelling of the mean-variance relationship. As discussed briefly in the last chapter, the log-transformation does not achieve perfect variance stabilization. This means that the variance of a gene is more affected by its abundance than the underlying biological heterogeneity. To account for this effect, we use the modelGeneVar() function to fit a trend to the variance with respect to abundance across all genes (Figure 8.1).

Variance in the PBMC data set as a function of the mean. Each point represents a gene while the blue line represents the trend fitted to all genes.

Figure 8.1: Variance in the PBMC data set as a function of the mean. Each point represents a gene while the blue line represents the trend fitted to all genes.

At any given abundance, we assume that the expression profiles of most genes are dominated by random technical noise (see 8.2.3 for details). Under this assumption, our trend represents an estimate of the technical noise as a function of abundance. We then break down the total variance of each gene into the technical component, i.e., the fitted value of the trend at that gene’s abundance; and the biological component, defined as the difference between the total variance and the technical component. This biological component represents the “interesting” variation for each gene and can be used as the metric for HVG selection.

## DataFrame with 33694 rows and 6 columns
##                     mean             total              tech
##                <numeric>         <numeric>         <numeric>
## LYZ     1.97720774557909   5.1093189779025 0.829103127339641
## S100A9  1.94951728004377  4.58067979752237 0.829355453833642
## S100A8  1.71813080049067  4.45055534363201 0.821139887638155
## HLA-DRA 2.09809035649378  3.73632552542341 0.825379685977392
## CD74    2.89964299605066  3.32217806233448 0.794847540253829
## ...                  ...               ...               ...
## PTMA    3.83104778025389 0.473973817102769 0.742787426970617
## HLA-B   4.50257580785005 0.477758219250799 0.758429131374342
## EIF1    3.24339798459787  0.47749233228004 0.773251495255292
## TMSB4X  6.08574497419255 0.405491648922873 0.748371851921361
## B2M     5.95574224289044   0.3039647664231 0.720380624981018
##                        bio               p.value                   FDR
##                  <numeric>             <numeric>             <numeric>
## LYZ       4.28021585056286 1.21236560942303e-263 1.19114921125812e-259
## S100A9    3.75132434368872 5.94533311061492e-203 3.89419318745278e-199
## S100A8    3.62941545599385 6.64673132537092e-194 3.26520676358846e-190
## HLA-DRA   2.91094583944602 2.66256736867182e-124 4.02457298418472e-121
## CD74      2.52733052208066 1.85532049884549e-101  1.73604989534828e-98
## ...                    ...                   ...                   ...
## PTMA    -0.268813609867848     0.992456096325914     0.999999999334435
## HLA-B   -0.280670912123543     0.993524806169525     0.999999999334435
## EIF1    -0.295759162975252     0.994893424392943     0.999999999334435
## TMSB4X  -0.342880202998488     0.998953636189745     0.999999999334435
## B2M     -0.416415858557918     0.999948154865494     0.999999999334435
## <numeric>

## 1.97720774557909

## 1.94951728004377

## 1.71813080049067

## 2.09809035649378

## 2.89964299605066

## ...

## 3.83104778025389

## 4.50257580785005

## 3.24339798459787

## 6.08574497419255

## 5.95574224289044

## <numeric>

## 5.1093189779025

## 4.58067979752237

## 4.45055534363201

## 3.73632552542341

## 3.32217806233448

## ...

## 0.473973817102769

## 0.477758219250799

## 0.47749233228004

## 0.405491648922873

## 0.3039647664231

## <numeric>

## 0.829103127339641

## 0.829355453833642

## 0.821139887638155

## 0.825379685977392

## 0.794847540253829

## ...

## 0.742787426970617

## 0.758429131374342

## 0.773251495255292

## 0.748371851921361

## 0.720380624981018

## <numeric>

## 4.28021585056286

## 3.75132434368872

## 3.62941545599385

## 2.91094583944602

## 2.52733052208066

## ...

## -0.268813609867848

## -0.280670912123543

## -0.295759162975252

## -0.342880202998488

## -0.416415858557918

## <numeric>

## 1.21236560942303e-263

## 5.94533311061492e-203

## 6.64673132537092e-194

## 2.66256736867182e-124

## 1.85532049884549e-101

## ...

## 0.992456096325914

## 0.993524806169525

## 0.994893424392943

## 0.998953636189745

## 0.999948154865494

## <numeric>

## 1.19114921125812e-259

## 3.89419318745278e-199

## 3.26520676358846e-190

## 4.02457298418472e-121

## 1.73604989534828e-98

## ...

## 0.999999999334435

## 0.999999999334435

## 0.999999999334435

## 0.999999999334435

## 0.999999999334435

(Careful readers will notice that some genes have negative biological components. Negative components of variation have no obvious interpretation and can be ignored for most applications. They are inevitable when fitting a trend to the per-gene variances, as approximately half of the genes will lie below the trend.)

8.2.2 Coefficient of variation

An alternative approach to quantification uses the squared coefficient of variation (CV2) of the normalized expression values prior to log-transformation. The CV2 is a widely used metric for describing variation in non-negative data and is closely related to the dispersion parameter of the negative binomial distribution in packages like edgeR and DESeq2. We compute the CV2 for each gene in the PBMC dataset using the modelGeneCV2() function, which provides a robust implementation of the approach described by Brennecke et al. (2013).

This allows us to model the mean-variance relationship when considering the relevance of each gene (Figure 8.2). Again, our assumption is that most genes contain random noise and that the trend captures mostly technical variation. Large CV\(^2\) values that deviate strongly from the trend are likely to represent genes affected by biological structure.

CV^2^ in the PBMC data set as a function of the mean. Each point represents a gene while the blue line represents the fitted trend.

Figure 8.2: CV2 in the PBMC data set as a function of the mean. Each point represents a gene while the blue line represents the fitted trend.

For each gene, we quantify the deviation from the trend in terms of the ratio of its CV2 to the fitted value of trend at its abundance. This is more appropriate than the directly subtracting the trend from the CV2, as the magnitude of the ratio is not affected by the mean.

## DataFrame with 33694 rows and 5 columns
##                          mean            total            trend
##                     <numeric>        <numeric>        <numeric>
## HIST1H2AC   0.903832087554492 267.529464650432  1.5624934717354
## GNG11       0.689915788093486 219.246474356749 1.98426985647823
## PRTFDC1    0.0411712190890846 3039.89950832652 30.0648545663972
## TNNC2       0.102005431273827  1211.4778002681  12.255298794817
## PF4           1.1070844889201 128.819231749223 1.31275273667162
## ...                       ...              ...              ...
## AC023491.2                  0              NaN              Inf
## AC233755.2                  0              NaN              Inf
## AC233755.1                  0              NaN              Inf
## AC213203.1                  0              NaN              Inf
## FAM231B                     0              NaN              Inf
##                       ratio   p.value
##                   <numeric> <numeric>
## HIST1H2AC  171.219572746948         0
## GNG11      110.492266785667         0
## PRTFDC1     101.11139907938         0
## TNNC2      98.8533874653847         0
## PF4        98.1290902320522         0
## ...                     ...       ...
## AC023491.2              NaN       NaN
## AC233755.2              NaN       NaN
## AC233755.1              NaN       NaN
## AC213203.1              NaN       NaN
## FAM231B                 NaN       NaN
## <numeric>

## 0.903832087554492

## 0.689915788093486

## 0.0411712190890846

## 0.102005431273827

## 1.1070844889201

## ...

## 0

## 0

## 0

## 0

## 0

## <numeric>

## 267.529464650432

## 219.246474356749

## 3039.89950832652

## 1211.4778002681

## 128.819231749223

## ...

## NaN

## NaN

## NaN

## NaN

## NaN

## <numeric>

## 1.5624934717354

## 1.98426985647823

## 30.0648545663972

## 12.255298794817

## 1.31275273667162

## ...

## Inf

## Inf

## Inf

## Inf

## Inf

## <numeric>

## 171.219572746948

## 110.492266785667

## 101.11139907938

## 98.8533874653847

## 98.1290902320522

## ...

## NaN

## NaN

## NaN

## NaN

## NaN

## <numeric>

## 0

## 0

## 0

## 0

## 0

## ...

## NaN

## NaN

## NaN

## NaN

## NaN

Both the CV2 and the variance of log-counts are effective metrics for quantifying variation in gene expression. The CV2 tends to give higher rank to low-abundance HVGs driven by upregulation in rare subpopulations, for which the increase in variance on the raw scale is stronger than that on the log-scale. However, the variation described by the CV2 is less directly relevant to downstream procedures operating on the log-counts, and the reliance on the ratio can assign high rank to uninteresting genes with low absolute variance. We generally prefer the use of the variance of log-counts and will use it in the following sections, though the many of the same principles apply to procedures based on the CV2.

8.2.3 Quantifying technical noise

The use of a trend fitted to endogenous genes assumes that the expression profiles of most genes are dominated by random technical noise. In practice, all expressed genes will exhibit some non-zero level of biological variability due to events like transcriptional bursting. This suggests that our estimates of the technical component are likely to be inflated. It would be more appropriate to consider these estimates as technical noise plus “uninteresting” biological variation, under the assumption that most genes are unaffected by the relevant heterogeneity in the population.

This assumption is generally reasonable but may be problematic in some scenarios where many genes at a particular abundance are affected by a biological process. For example, strong upregulation of cell type-specific genes may result in an enrichment of HVGs at high abundances. This would inflate the fitted trend at high abundances and compromise the detection of the affected genes. We can avoid this problem by fitting a mean-dependent trend to the variance of the spike-in transcripts, if they are available (Figure 8.3). The premise here is that spike-ins should not be affected by biological variation, so the fitted value of the spike-in trend should represent a better estimate of the technical component for each gene.

## DataFrame with 46604 rows and 6 columns
##                      mean             total             tech
##                 <numeric>         <numeric>        <numeric>
## Lyz2     6.61096803891878  13.8497201762179 1.57130530159628
## Ccl9     6.67845998360554  13.1869005745724 1.50034784951575
## Top2a    5.81023942468506  14.1787245984925 2.54775542207223
## Cd200r3  4.83179821635761  15.5612599938701 4.22984247867865
## Ccnb2    5.97776034107469  13.1392817339087 2.30176764474006
## ...                   ...               ...              ...
## Rpl5-ps2 3.60625401733281 0.612623344818866 6.32852660729456
## Gm11942  3.38767658515201 0.798570319939665 6.51473011137326
## Gm12816  2.91276213414496  0.83866968433397 6.57363560699119
## Gm13623  2.72843841213055 0.708071335002565 6.45448378131385
## Rps12l1  3.15419890541469 0.746615468755298   6.593316320265
##                        bio               p.value                   FDR
##                  <numeric>             <numeric>             <numeric>
## Lyz2      12.2784148746216 1.48992982137285e-186 1.54155912866129e-183
## Ccl9      11.6865527250566 2.21855297819158e-185 2.19978771758437e-182
## Top2a     11.6309691764202  3.80015396614872e-65  1.13040329915551e-62
## Cd200r3   11.3314175151915   9.4622094003004e-24  6.08573505672834e-22
## Ccnb2     10.8375140891687  3.68706024719105e-69  1.20193113290966e-66
## ...                    ...                   ...                   ...
## Rpl5-ps2 -5.71590326247569     0.999616247009411     0.999726145748132
## Gm11942   -5.7161597914336     0.999458916367196     0.999726145748132
## Gm12816  -5.73496592265722     0.999422192516362     0.999726145748132
## Gm13623  -5.74641244631128      0.99954376258778     0.999726145748132
## Rps12l1   -5.8467008515097     0.999521783342308     0.999726145748132
## <numeric>

## 6.61096803891878

## 6.67845998360554

## 5.81023942468506

## 4.83179821635761

## 5.97776034107469

## ...

## 3.60625401733281

## 3.38767658515201

## 2.91276213414496

## 2.72843841213055

## 3.15419890541469

## <numeric>

## 13.8497201762179

## 13.1869005745724

## 14.1787245984925

## 15.5612599938701

## 13.1392817339087

## ...

## 0.612623344818866

## 0.798570319939665

## 0.83866968433397

## 0.708071335002565

## 0.746615468755298

## <numeric>

## 1.57130530159628

## 1.50034784951575

## 2.54775542207223

## 4.22984247867865

## 2.30176764474006

## ...

## 6.32852660729456

## 6.51473011137326

## 6.57363560699119

## 6.45448378131385

## 6.593316320265

## <numeric>

## 12.2784148746216

## 11.6865527250566

## 11.6309691764202

## 11.3314175151915

## 10.8375140891687

## ...

## -5.71590326247569

## -5.7161597914336

## -5.73496592265722

## -5.74641244631128

## -5.8467008515097

## <numeric>

## 1.48992982137285e-186

## 2.21855297819158e-185

## 3.80015396614872e-65

## 9.4622094003004e-24

## 3.68706024719105e-69

## ...

## 0.999616247009411

## 0.999458916367196

## 0.999422192516362

## 0.99954376258778

## 0.999521783342308

## <numeric>

## 1.54155912866129e-183

## 2.19978771758437e-182

## 1.13040329915551e-62

## 6.08573505672834e-22

## 1.20193113290966e-66

## ...

## 0.999726145748132

## 0.999726145748132

## 0.999726145748132

## 0.999726145748132

## 0.999726145748132
Variance in the 416B data set as a function of the mean. Each point represents a gene (black) or spike-in transcript (red) and the blue line represents the trend fitted to all spike-ins.

Figure 8.3: Variance in the 416B data set as a function of the mean. Each point represents a gene (black) or spike-in transcript (red) and the blue line represents the trend fitted to all spike-ins.

In the absence of spike-in data, one can attempt to create a trend by making some distributional assumptions about the noise. For example, UMI counts typically exhibit near-Poisson noise when only technical effects are considered. This can be used to construct a mean-variance trend in the log-counts (Figure 8.4) with the modelGeneVarByPoisson() function. Note the increased residuals of the high-abundance genes, which can be interpreted as the amount of biological variation that was assumed to be “uninteresting” when fitting the gene-based trend in Figure 8.1.

## DataFrame with 6 rows and 6 columns
##                     mean            total              tech
##                <numeric>        <numeric>         <numeric>
## LYZ     1.97720774557909  5.1093189779025 0.625920315506784
## S100A9  1.94951728004377 4.58067979752237 0.631472547598291
## S100A8  1.71813080049067 4.45055534363201 0.672535358554566
## HLA-DRA 2.09809035649378 3.73632552542341 0.600680256888961
## CD74    2.89964299605066 3.32217806233448 0.425592152366567
## CST3     1.4921009978511  2.9688331724004 0.697923385536007
##                      bio   p.value       FDR
##                <numeric> <numeric> <numeric>
## LYZ     4.48339866239571         0         0
## S100A9  3.94920724992408         0         0
## S100A8  3.77801998507744         0         0
## HLA-DRA 3.13564526853445         0         0
## CD74    2.89658590996792         0         0
## CST3    2.27090978686439         0         0
## <numeric>

## 1.97720774557909

## 1.94951728004377

## 1.71813080049067

## 2.09809035649378

## 2.89964299605066

## 1.4921009978511

## <numeric>

## 5.1093189779025

## 4.58067979752237

## 4.45055534363201

## 3.73632552542341

## 3.32217806233448

## 2.9688331724004

## <numeric>

## 0.625920315506784

## 0.631472547598291

## 0.672535358554566

## 0.600680256888961

## 0.425592152366567

## 0.697923385536007

## <numeric>

## 4.48339866239571

## 3.94920724992408

## 3.77801998507744

## 3.13564526853445

## 2.89658590996792

## 2.27090978686439

## <numeric>

## 0

## 0

## 0

## 0

## 0

## 0

## <numeric>

## 0

## 0

## 0

## 0

## 0

## 0
Variance of normalized log-expression values for each gene in the PBMC dataset, plotted against the mean log-expression. The blue line represents represents the mean-variance relationship corresponding to Poisson noise.

Figure 8.4: Variance of normalized log-expression values for each gene in the PBMC dataset, plotted against the mean log-expression. The blue line represents represents the mean-variance relationship corresponding to Poisson noise.

8.2.4 Accounting for blocking factors

8.2.4.1 Fitting block-specific trends

Data containing multiple batches will often exhibit batch effects (see Chapter 18.2.6 for more details). We are (usually) not interested in HVGs that are driven by batch effects. Rather, we want to focus on genes that are highly variable within each batch. This is naturally achieved by performing trend fitting and variance decomposition separately for each batch. We demonstrate this approach by treating each plate (block) in the 416B dataset as a different batch, using the modelGeneVarWithSpikes() function. (The same argument is available in all other variance-modelling functions.)

## DataFrame with 6 rows and 6 columns
##                     mean            total             tech
##                <numeric>        <numeric>        <numeric>
## Lyz2    6.61235092956153 13.8618988024144 1.58416440878876
## Ccl9    6.67841214065115 13.2598761518269 1.44553397965825
## Top2a   5.81274666129111 14.0191605357462 2.74571164328693
## Cd200r3 4.83305175110888 15.5908569933105 4.31892122926251
## Ccnb2   5.97999269432625 13.0256084334992 2.46646680409343
## Hbb-bt  4.91682531222784 14.6538670496416 4.12156477107562
##                      bio               p.value                   FDR
##                <numeric>             <numeric>             <numeric>
## Lyz2    12.2777343936257                     0                     0
## Ccl9    11.8143421721686                     0                     0
## Top2a   11.2734488924592 3.89854825869685e-137 8.43397753747354e-135
## Cd200r3  11.271935764048  1.17783174428153e-54  7.00721550466689e-53
## Ccnb2   10.5591416294057 1.20380000061177e-151 2.98404464734982e-149
## Hbb-bt   10.532302278566   2.5263862540857e-49  1.34197351983209e-47
## <numeric>

## 6.61235092956153

## 6.67841214065115

## 5.81274666129111

## 4.83305175110888

## 5.97999269432625

## 4.91682531222784

## <numeric>

## 13.8618988024144

## 13.2598761518269

## 14.0191605357462

## 15.5908569933105

## 13.0256084334992

## 14.6538670496416

## <numeric>

## 1.58416440878876

## 1.44553397965825

## 2.74571164328693

## 4.31892122926251

## 2.46646680409343

## 4.12156477107562

## <numeric>

## 12.2777343936257

## 11.8143421721686

## 11.2734488924592

## 11.271935764048

## 10.5591416294057

## 10.532302278566

## <numeric>

## 0

## 0

## 3.89854825869685e-137

## 1.17783174428153e-54

## 1.20380000061177e-151

## 2.5263862540857e-49

## <numeric>

## 0

## 0

## 8.43397753747354e-135

## 7.00721550466689e-53

## 2.98404464734982e-149

## 1.34197351983209e-47

The use of a batch-specific trend fit is useful as it accommodates differences in the mean-variance trends between batches. This is especially important if batches exhibit systematic technical differences, e.g., differences in coverage or in the amount of spike-in RNA added. In this case, there are only minor differences between the trends in Figure 8.5, which indicates that the experiment was tightly replicated across plates. The analysis of each plate yields estimates of the biological and technical components for each gene, which are averaged across plates to take advantage of information from multiple batches.

Variance in the 416B data set as a function of the mean after blocking on the plate of origin. Each plot represents the results for a single plate, each point represents a gene (black) or spike-in transcript (red) and the blue line represents the trend fitted to all spike-ins.

Figure 8.5: Variance in the 416B data set as a function of the mean after blocking on the plate of origin. Each plot represents the results for a single plate, each point represents a gene (black) or spike-in transcript (red) and the blue line represents the trend fitted to all spike-ins.

8.2.4.2 Using a design matrix

The use of block-specific trends is the recommended approach for experiments with a single blocking factor. However, this is not practical for studies involving a large number of blocking factors and/or covariates. In such cases, we can use the design= argument to specify a design matrix with uninteresting factors of variation. We illustrate again with the 416B data set, blocking on the plate of origin and oncogene induction. (The same argument is available in modelGeneVar() when spike-ins are not available.)

## DataFrame with 46604 rows and 6 columns
##                      mean             total             tech
##                 <numeric>         <numeric>        <numeric>
## Lyz2     6.61096803891878  8.90512635193215 1.50405193582095
## Ccnb2    5.97776034107469  9.54372612545439 2.24180423717101
## Gem      5.90224546535998  9.54358133115793 2.35175324142717
## Cenpa    5.81348980307671  8.65621880001792    2.48791803297
## Idh1     5.99343230026856  8.32112833403297 2.21964586481642
## ...                   ...               ...              ...
## Gm5054   2.90433859612123 0.463698403028673 6.76999876266688
## Gm12191  3.55920216241565 0.170709357430238 6.53284597878746
## Gm7429   3.45394451046348 0.248350912621566 6.63457982170964
## Gm16378  2.83987048016898 0.208215208521281 6.74662681328272
## Rps2-ps2 3.11324156219337 0.202307294410046 6.78483536857142
##                        bio               p.value                   FDR
##                  <numeric>             <numeric>             <numeric>
## Lyz2      7.40107441611121 1.78185058463941e-172 1.28493025341406e-169
## Ccnb2     7.30192188828338  7.77223320238053e-77  1.44496744935195e-74
## Gem       7.19182808973075  5.49586828733217e-68  8.12330295861141e-66
## Cenpa     6.16830076704792  2.08034584949453e-45  1.52796265988954e-43
## Idh1      6.10148246921656  2.42818909044647e-55  2.41772450984748e-53
## ...                    ...                   ...                   ...
## Gm5054   -6.30630035963821     0.999999940536335     0.999999984861416
## Gm12191  -6.36213662135722     0.999999984522185     0.999999984861416
## Gm7429   -6.38622890908807     0.999999977712902     0.999999984861416
## Gm16378  -6.53841160476144     0.999999981961201     0.999999984861416
## Rps2-ps2 -6.58252807416138      0.99999998255976     0.999999984861416
## <numeric>

## 6.61096803891878

## 5.97776034107469

## 5.90224546535998

## 5.81348980307671

## 5.99343230026856

## ...

## 2.90433859612123

## 3.55920216241565

## 3.45394451046348

## 2.83987048016898

## 3.11324156219337

## <numeric>

## 8.90512635193215

## 9.54372612545439

## 9.54358133115793

## 8.65621880001792

## 8.32112833403297

## ...

## 0.463698403028673

## 0.170709357430238

## 0.248350912621566

## 0.208215208521281

## 0.202307294410046

## <numeric>

## 1.50405193582095

## 2.24180423717101

## 2.35175324142717

## 2.48791803297

## 2.21964586481642

## ...

## 6.76999876266688

## 6.53284597878746

## 6.63457982170964

## 6.74662681328272

## 6.78483536857142

## <numeric>

## 7.40107441611121

## 7.30192188828338

## 7.19182808973075

## 6.16830076704792

## 6.10148246921656

## ...

## -6.30630035963821

## -6.36213662135722

## -6.38622890908807

## -6.53841160476144

## -6.58252807416138

## <numeric>

## 1.78185058463941e-172

## 7.77223320238053e-77

## 5.49586828733217e-68

## 2.08034584949453e-45

## 2.42818909044647e-55

## ...

## 0.999999940536335

## 0.999999984522185

## 0.999999977712902

## 0.999999981961201

## 0.99999998255976

## <numeric>

## 1.28493025341406e-169

## 1.44496744935195e-74

## 8.12330295861141e-66

## 1.52796265988954e-43

## 2.41772450984748e-53

## ...

## 0.999999984861416

## 0.999999984861416

## 0.999999984861416

## 0.999999984861416

## 0.999999984861416

This strategy is simple but somewhat inaccurate as it does not consider the mean expression in each blocking level. Briefly, the technical component is estimated as the fitted value of the trend at the average abundance for each gene. However, the true technical component is the average of the fitted values at the per-block means, which may be quite different for strong batch effects and non-linear mean-variance relationships. The block= approach is safer and should be preferred in all situations where it is applicable.

8.3 Selecting highly variable genes

8.3.1 Based on the largest metrics

Once we have quantified the per-gene variation, the next step is to select the subset of HVGs to use in downstream analyses. The simplest approach is to simply take the top \(X\) genes with the largest values for the relevant measure of variation. For modelGeneVar() and modelGeneVarWithSpikes(), this would be the genes with the largest biological components:

For modelGeneCV2() (and its relative, modelGeneCV2WithSpikes()), this would instead be the genes with the largest ratios:

The main advantage of this approach is that the user can easily control the number of genes retained. This ensures that the computational complexity of downstream calculations is easily predicted. It is also fairly easy to translate the choice of \(X\) into a biological statement. Recall our trend-fitting assumption that most genes are not differentially expressed between cell types or states in our population. If we quantify this assumption into a statement that, e.g., no more than 5% of genes are differentially expressed, we can simply take the top 5% of genes with the largest variance.

The main disadvantage of this approach that it turns HVG selection into a competition between genes, whereby a subset of very highly variable genes can push other informative genes out of the top set. This can be problematic if a single subpopulation is very different from the others. In such cases, the top set will be dominated by differentially expressed genes involving the outlier subpopulation, compromising resolution of heterogeneity between the other populations. Similar problems are encountered when the magnitude of the chosen variance measure varies with abundance.

The choice of \(X\) is also fairly arbitrary, with any value from 500 to 5000 being considered “reasonable”. We have chosen \(X=1000\) in the code above though there is no particular a priori reason for doing so. A larger \(X\) will reduce the risk of discarding signal at the cost of increasing noise that obscures signal. Our recommendation is to simply pick an arbitrary \(X\) and proceed with the rest of the analysis, with the intention of testing other choices later, rather than spending much time worrying about obtaining the “optimal” value7.

8.3.2 Based on a fixed threshold

Another approach to feature selection is to set a fixed threshold of one of the metrics. This is most commonly done with the (adjusted) \(p\)-value reported by each of the above methods. The \(p\)-value for each gene is generated by testing against the null hypothesis that the variance is equal to the trend. For example, we might define our HVGs as all genes that have adjusted \(p\)-values below 0.05.

## [1] 642

This approach is simple to implement and - if the test holds its size - it controls the false discovery rate (FDR). That is, it returns a subset of genes where the proportion of false positives is expected to be below the specified threshold. This can occasionally be useful in applications where the HVGs themselves are of interest. For example, if a collaborator were to take a list of HVGs back to the bench to verify the existence of heterogeneous expression for some of the genes, we would want to control the FDR in that list.

The downside of this approach is that it is less predictable than the top \(X\) strategy. The number of genes returned depends on the type II error rate of the test and the severity of the multiple testing correction. One might obtain no genes or every gene at a given FDR threshold, depending on the circumstances. Moreover, control of the FDR is usually not helpful at this stage of the analysis. We are not interpreting the individual HVGs themselves but are only using them for feature selection prior to downstream steps. There is no reason to think that a 5% threshold on the FDR yields a more suitable compromise between bias and noise.

8.3.3 Keeping all genes above the trend

As the title suggests, this involves keeping all genes above the trend. The rationale is to only remove the obviously uninteresting genes with variances below the trend. By doing so, we avoid the need to make any judgement calls regarding what level of variation is interesting enough to retain. This approach represents one extreme of the bias-variance trade-off where bias is minimized at the cost of maximizing noise. For decomposeVar(), it equates to keeping all positive biological components:

## [1] 12775

For improvedCV2(), this involves keeping all ratios above 1:

## [1] 9390

This strategy is the most conservative as it does not discard any potential biological signal. Weak or secondary population structure is given the chance to manifest as the affected genes are retained. This makes it useful for reliable automated processing of diverse data sets where the primary factor of variation in one data set is a secondary factor in another data set (and thus overlooked by the top \(X\) approach). The obvious cost is that more noise is also captured, which can reduce the resolution of otherwise well-separated populations. From a practical perspective, the use of more genes involves more computational work in each downstream step.

8.4 Putting it all together

The few lines of code below will select the top 10% of genes with the highest biological components.

## [1] 3369

We can then subset the SingleCellExperiment to only retain our selection of HVGs. This ensures that downstream methods will only use these genes for their calculations.

## [1] 3369 3922

Alternatively, some methods may allow users to pass in the full SingleCellExperiment object and specify the genes to use via an extra argument like subset.row=. This may be more convenient in the context of the overall analysis, where genes outside of this subset may still be of interest during DE analyses or for visualization.

Session Info

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS:   /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRblas.so
LAPACK: /home/ramezqui/Rbuild/danbuild/R-3.6.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] scran_1.13.18               SingleCellExperiment_1.7.8 
 [3] SummarizedExperiment_1.15.9 DelayedArray_0.11.4        
 [5] BiocParallel_1.19.2         matrixStats_0.55.0         
 [7] Biobase_2.45.1              GenomicRanges_1.37.15      
 [9] GenomeInfoDb_1.21.1         IRanges_2.19.14            
[11] S4Vectors_0.23.21           BiocGenerics_0.31.5        
[13] Cairo_1.5-10                BiocStyle_2.13.2           
[15] OSCAUtils_0.0.1            

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2               rsvd_1.0.2              
 [3] locfit_1.5-9.1           lattice_0.20-38         
 [5] assertthat_0.2.1         digest_0.6.20           
 [7] R6_2.4.0                 evaluate_0.14           
 [9] highr_0.8                ggplot2_3.2.1           
[11] pillar_1.4.2             zlibbioc_1.31.0         
[13] rlang_0.4.0              lazyeval_0.2.2          
[15] irlba_2.3.3              Matrix_1.2-17           
[17] rmarkdown_1.15           BiocNeighbors_1.3.3     
[19] statmod_1.4.32           stringr_1.4.0           
[21] igraph_1.2.4.1           RCurl_1.95-4.12         
[23] munsell_0.5.0            vipor_0.4.5             
[25] compiler_3.6.1           BiocSingular_1.1.5      
[27] xfun_0.9                 pkgconfig_2.0.2         
[29] ggbeeswarm_0.6.0         htmltools_0.3.6         
[31] tidyselect_0.2.5         gridExtra_2.3           
[33] tibble_2.1.3             GenomeInfoDbData_1.2.1  
[35] bookdown_0.13            edgeR_3.27.13           
[37] viridisLite_0.3.0        crayon_1.3.4            
[39] dplyr_0.8.3              bitops_1.0-6            
[41] grid_3.6.1               gtable_0.3.0            
[43] magrittr_1.5             scales_1.0.0            
[45] dqrng_0.2.1              stringi_1.4.3           
[47] XVector_0.25.0           viridis_0.5.1           
[49] limma_3.41.16            scater_1.13.18          
[51] DelayedMatrixStats_1.7.2 tools_3.6.1             
[53] beeswarm_0.2.3           glue_1.3.1              
[55] purrr_0.3.2              yaml_2.2.0              
[57] colorspace_1.4-1         BiocManager_1.30.4      
[59] knitr_1.24              

Bibliography

Brennecke, P., S. Anders, J. K. Kim, A. A. Kołodziejczyk, X. Zhang, V. Proserpio, B. Baying, et al. 2013. “Accounting for technical noise in single-cell RNA-seq experiments.” Nat. Methods 10 (11):1093–5.


  1. When everything’s said and done, the “optimal” value is one that gets you a figure for your manuscript.