Hashing-based deconvolution workflow#

Cell hashing is a sample processing technique that requires processing individual samples to “tag” the membrane of the cell or the nuclei with unique oligonucleotide barcodes. The cells are then washed or the reaction is quenched, and the samples can be safely mixed and processed following the standard library preparation procedure. Two libraries are generated after this process, one for the scRNA and one for the hashing oligos (HTO), which are independently sequenced to produce each a single cell count matrix, one for the RNA library and one for the HTO library. The hashtag counts are then bioinformatically processed to deconvolve the cell’s source sample.

hash_demulti in hadge#

Quick start#

nextflow run ${hadge_project_dir}/main.nf -profile test,conda --mode hashing

Example case#

Case 1: Run the entire hashing-based mode:

nextflow run ${hadge_project_dir}/main.nf -profile conda --outputdir ${output_dir} --mode hashing --hto_matrix_raw ${hto_raw_dir} --hto_matrix_filtered ${hto_filtered_dir} --rna_matrix_raw ${rna_raw_dir} --rna_matrix_filtered ${rna_filtered_dir}

Case 2: Run Multiseq with raw counts :

nextflow run ${hadge_project_dir}/main.nf -profile conda --outputdir ${output_dir} --mode hashing --rna_matrix_multiseq raw --hto_matrix_multiseq raw // additional parameters as in case 1

Case 3: Run the pipeline with different combinations of parameter. This is only available in the single sample mode. The values should be separated by semicolumn and double quoted.

nextflow run ${hadge_project_dir}/main.nf -profile conda --mode hashing --quantile_multi "0.5;0.7" //additional paramters as in case 1

Input data preparation#

The input data depends heavily on the deconvolution tools. In the following table, you will find the minimal input data required by different tools.

Deconvolution method

Input data

HTODemux

UMI and hashing count matrix

Multiseq

UMI and hashing count matrix

HashSolo

- Required: Hashing count matrix - Optional: UMI count matrix

HashedDrops

Hashing count matrix

Demuxem

Both UMI and hashing count matrix

Similary as genotype-based deconvlution methods, hashing methods also have some input in common. So we also try to utilize common input parameters params.[rna/hto]_matrix_[raw/filtered] to store count matrices for better control and params.[rna/hto]_matrix_[method] is used to specify whether to use raw or filtered counts for each method, e.g. hto_matrix_hashedDrops = "raw" means that raw HTO count matrix is used as input for HTODemux.

Input data

Parameter

Raw scRNAseq count matrix

params.rna_matrix_raw

Filtered scRNAseq count matrix

params.rna_matrix_filtered

Raw HTO count matrix

params.hto_matrix_raw

Filtered HTO count matrix

params.hto_matrix_filtered

Pre-processing#

Similar as in the genetic demultiplexing workflow, we provide a pre-processing step required before running HTODemux and Multiseq where the count matrices are loaded from the parameters set above into a Seurat object.

Output#

By default, the pipeline is run on a single sample. In this case, all pipeline output will be saved in the folder $projectDir/$params.outdir/hashing/hash_demulti. When running the pipeline on multiple samples, the pipeline output will be found in the folder "$projectDir/$params.outdir/$sampleId/hashing/hash_demulti. To simplify this, we’ll refer to this folder as $pipeline_output_folder from now on.

Pre-processing#

output directory: $pipeline_output_folder/preprocess/preprocess_[task_ID/sampleId]

  • ${params.preprocessOut}.rds: pre-processed data in an RDS object

  • params.csv: specified parameters in the hashing pre-processing task

HTODemux#

output directory: $pipeline_output_folder/htodemux/htodemux_[task_ID/sampleId]

  • ${params.assignmentOutHTO}_assignment_htodemux.csv: the assignment of HTODemux

  • ${params.assignmentOutHTO}_classification_htodemux.csv: the classification of HTODemux as singlet, doublet and negative droplets

  • ${params.objectOutHTO}.rds: the result of HTODemux in an RDS object

  • params.csv: specified parameters in the HTODemux task

Optionally:

  • ridge.jpeg: a ridge plot showing the enrichment of selected HTOs

  • featureScatter.jpeg: a scatter plot showing the signal of two selected HTOs

  • violinPlot.jpeg: a violin plot showing selected features

  • tSNE.jpeg: a 2D plot based on tSNE embedding of HTOs

  • heatMap.jpeg: a heatmap of hashtag oligo signals across singlets, doublets and negative cells

  • visual_params.csv: specified parameters for visualisation of the HTODemux result

Multiseq#

output directory: $pipeline_output_folder/multiseq/multiseq_[task_ID/sampleId]

  • ${params.assignmentOutMulti}_res.csv: the assignment of Multiseq

  • ${params.objectOutMulti}.rds: the result of Multiseq in an RDS object

  • params.csv: specified parameters in the Multiseq task

Demuxem#

output directory: $pipeline_output_folder/demuxem/demuxem_[task_ID/sampleId]

  • ${params.objectOutDemuxem}_demux.zarr.zip: RNA expression matrix with demultiplexed sample identities in Zarr format

  • ${params.objectOutDemuxem}.out.demuxEM.zarr.zip: DemuxEM-calculated results in Zarr format, containing two datasets, one for HTO and one for RNA

  • ${params.objectOutDemuxem}.ambient_hashtag.hist.pdf: A histogram plot depicting hashtag distributions of empty droplets and non-empty droplets

  • ${params.objectOutDemuxem}.background_probabilities.bar.pdf}: A bar plot visualizing the estimated hashtag background probability distribution

  • ${params.objectOutDemuxem}.real_content.hist.pdf: A histogram plot depicting hashtag distributions of not-real-cells and real-cells as defined by total number of expressed genes in the RNA assay

  • ${params.objectOutDemuxem}.rna_demux.hist.pdf: This figure consists of two plots. The first one is a horizontal bar plot depicting the percentage of RNA barcodes with at least one HTO count. The second plot is a histogram plot depicting RNA UMI distribution for singlets, doublets and unknown cells.

  • ${params..objectOutDemuxem}.gene_name.violin.pdf: Violin plots depicting gender-specific gene expression across samples.

  • ${params.objectOutDemuxem}_summary.csv: the classification of Demuxem

  • ${params.objectOutDemuxem}_obs.csv: the assignment of Demuxem

  • params.csv: specified parameters in the Demuxem task

Optionally:

  • {params.objectOutDemuxem}.{gene_name}.violin.pdf: violin plots using specified gender-specific gene

HashSolo#

output directory: $pipeline_output_folder/hashsolo/hashsolo_[task_ID/sampleId]

  • ${params.assignmentOutHashSolo}_res.csv: the assignment of HashSolo

  • ${params.plotOutHashSolo}.jpg: plot of HashSolo demultiplexing results for QC checks

  • params.csv: specified parameters in the HashSolo task

HashedDrops#

output directory: $pipeline_output_folder/hashedDrops/hashedDrops_[task_ID/sampleId]

  • ${params.objectOutEmptyDrops}.rds: the result of emptyDrops in an RDS object

  • ${params.assignmentOutEmptyDrops}.csv: the result of emptyDrops in a csv file

  • plot_emptyDrops.png: a diagnostic plot comparing the total count against the negative log-probability

  • ${params.objectOutHashedDrops}.rds: the result of hashedyDrops in an RDS object

  • ${params.assignmentOutHashedDrops}_res.csv: the assignment of HashSolo

  • ${params.objectOutHashedDrops}_LogFC.png: a diagnostic plot comparing the log-fold change between the second HTO’s abundance and the ambient contamination

  • params.csv: specified parameters in the HashedDrops task

GMM-Demux#

output directory: $pipeline_output_folder/gmm_demux/gmm_demux_[task_ID/sampleId]

  • features.tsv.gz: default content in the output folder are the non-MSM droplets (SSDs), stored in MTX format.

  • barcodes.tsv.gz: default content in the output folder are the non-MSM droplets (SSDs), stored in MTX format.

  • matrix.mtx.gz: default content in the output folder are the non-MSM droplets (SSDs), stored in MTX format.

  • GMM_full.csv: The classification file containing the label of each droplet as well as the probability of the classification.

  • GMM_full.config: Used to assign each classification to a donor using the numbers listed in the config file

  • gmm_demux_${task.index}_report.txt: Specify the file to store summary report, produced only if GMM can find a viable solution that satisfies the droplet formation model

  • params.csv: specified parameters in the GMM-Demux task

BFF#

output directory: $pipeline_output_folder/bff/bff_[task_ID/sampleId]

  • ${params.assignmentOutBff}_assignment_bff.csv: the assignment and classification results produced by BFF

  • params.csv: specified parameters in the BFF task

Parameter#

Preprocessing#

ndelim

For the initial identity calss for each cell, delimiter for the cell’s column name. Default: _

sel_method

The selection method used to choose top variable features. Default: mean.var.plot

n_features

Number of features to be used when finding variable features. Default: 2000

assay

Assay name for HTO modality. Default: HTO

norm_method

Method for normalization of HTO data. Default: CLR

margin

If performing CLR normalization, normalize across features (1) or cells (2). Default: 2

preprocessOut

Name of the output Seurat object. Default: preprocessed

HTODemux#

htodemux

Whether to perform Multiseq. Default: True

rna_matrix_htodemux

Whether to use raw or filtered scRNA-seq count matrix. Default: filtered

hto_matrix_htodemux

Whether to use raw or filtered HTO count matrix. Default: filtered

assay

Name of the hashtag assay. Default: HTO

quantile_htodemux

The quantile of inferred ‘negative’ distribution for each hashtag, over which the cell is considered ‘positive’. Default: 0.99

kfunc

Clustering function for initial hashtag grouping. Default: clara.

nstarts

nstarts value for k-means clustering when kfunc=kmeans. Default: 100

nsamples_clustering

Number of samples to be drawn from the dataset used for clustering when kfunc= clara. Default: 100

seed

Sets the random seed. Default: 42

init

Initial number of clusters for hashtags. Default: None, which means the # of hashtag oligo names + 1 to account for negatives.

objectOutHTO

Name of the output Seurat object. Default: htodemux

assignmentOutHTO

Prefix of the output CSV files. Default: htodemux

ridgePlot

Whether to generate a ridge plot to visualize enrichment for all HTOs. Default: True

ridgeNCol

Number of columns in the ridge plot. Default: 3

featureScatter

Whether to generate a scatter plot to visualize pairs of HTO signals. Default: False

scatterFeat1

First feature to plot. Default: None

scatterFeat2

Second feature to plot. Default: None

vlnplot

Whether to generate a violin plot, e.g. to compare number of UMIs for singlets, doublets and negative cells. Default: True

vlnFeatures

Features to plot. Default: nCount_RNA

vlnLog

Whether to plot the feature axis on log scale. Default: True

tsne

Whether to generate a 2D tSNE embedding for HTOs. Default: True

tsneIdents

Subset Seurat object based on identity class. Default: Negative

tsneInvert

Whether to keep or remove the identity class. Default: True

tsneVerbose

Whether to print the top genes associated with high/low loadings for the PCs when running PCA. Default: False

tsneApprox

Whether to use truncated singular value decomposition to approximate PCA. Default: False

tsneDimMax

Number of dimensions to use as input features when running t-SNE dimensionality reduction. Default: 2

tsnePerplexity

Perplexity when running t-SNE dimensionality reduction. Default: 100

heatmap

Whether to generate an HTO heatmap. Default: True

heatmapNcells

Number of cells to plot. Default: 5000

Multiseq#

multiseq

Whether to perform Multiseq. Default: True

rna_matrix_multiseq

Whether to use raw or filtered scRNA-seq count matrix. Default: filtered

hto_matrix_multiseq

Whether to use raw or filtered HTO count matrix. Default: filtered

assay

Name of the hashtag assay, same as used for HTODemux. Default: HTO

quantile_multi

The quantile to use for classification. Default: 0.7

autoThresh

Whether to perform automated threshold finding to define the best quantile. Default: True

maxiter

nstarts value for k-means clustering when kfunc=kmeans. Default: 100

qrangeFrom

The minimal possible quantile value to try if autoThresh=True. Default: 0.1

qrangeTo

The minimal possible quantile value to try if autoThresh=True. Default: 0.9

qrangeBy

The constant difference of a range of possible quantile values to try if autoThresh=True. Default: 0.05

verbose_multiseq

Wether to print the output. Default: True

assignmentOutMulti

Prefix of the output CSV files. Default: multiseq

objectOutMulti

Name of the output Seurat object. Default: multiseq

HashSolo#

hashsolo

Whether to perform HashSolo. Default: True

use_rna_data

Whether to use RNA counts for deconvolution. Default: False

rna_matrix_hashsolo

Whether to use raw or filtered scRNA-seq count matrix. Default: raw

hto_matrix_hashsolo

Whether to use raw or filtered HTO count matrix if use_rna_data is set to True. Default: raw

priors_negative

Prior for the negative hypothesis. Default: 1/3

priors_singlet

Prior for the singlet hypothesis. Default: 1/3

priors_doublet

Prior for the doublet hypothesis. Default: 1/3

pre_existing_clusters

Column in the input data for how to break up demultiplexing. Default: None

number_of_noise_barcodes

Number of barcodes to use to create noise distribution. Default: None

assignmentOutHashSolo

Prefix of the output CSV files. Default: hashsolo

plotOutHashSolo

Prefix of the output figures. Default: hashsolo

DemuxEm#

demuxem

Whether to perform Demuxem. Default: True

rna_matrix_demuxem

Whether to use raw or filtered scRNA-seq count matrix. Default: raw

hto_matrix_demuxem

Whether to use raw or filtered HTO count matrix. Default: raw

threads_demuxem

Number of threads to use. Must be a positive integer. Default: 1

alpha_demuxem

The Dirichlet prior concentration parameter (alpha) on samples. An alpha value < 1.0 will make the prior sparse. Default: 0.0

alpha_noise

The Dirichlet prior concenration parameter on the background noise. Default: 1.0

min_num_genes

Filter cells/nuclei with at least specified number of expressed genes. Default: 100

min_num_umis

Filter cells/nuclei with at least specified number of UMIs. Default: 100

min_signal

Any cell/nucleus with less than min_signal hashtags from the signal will be marked as unknown. Default: 10

tol

Threshold used for the EM convergence. Default: 1e-6

generate_gender_plot

Generate violin plots using gender-specific genes (e.g. Xist). Value is a comma-separated list of gene names. Default: None

random_state

Random seed set for reproducing results. Default: 0

objectOutDemuxem

Prefix of the output files. Default: demuxem_res

HashedDrops#

hashedDrops

Whether to perform hashedDrops. Default: True

hto_matrix_hashedDrops

Whether to use raw or filtered HTO count matrix. Default: raw

lower

The lower bound on the total UMI count, at or below which all barcodes are assumed to correspond to empty droplets. Default: 100

niters

The number of iterations to use for the Monte Carlo p-value calculations. Default: 10000

testAmbient

Whether results should be returned for barcodes with totals less than or equal to lower. Default: True

ignore_hashedDrops

The lower bound on the total UMI count, at or below which barcodes will be ignored. Default: None

alpha_hashedDrops

The scaling parameter for the Dirichlet-multinomial sampling scheme. Default: None

round

Whether to check for non-integer values in m and, if present, round them for ambient profile estimation. Default: True

byRank

If set, this is used to redefine lower and any specified value for lower is ignored. Default: None

isCellFDR

FDR Threshold to filter the cells for empty droplet detection. Default: 0.01

objectOutEmptyDrops

Prefix of the emptyDroplets output RDS object. Default: emptyDroplets

assignmentOutEmptyDrops

Prefix of the emptyDroplets output CSV file. Default: emptyDroplets

ambient

Whether to use the relative abundance of each HTO in the ambient solution from emptyDrops, set True only when testAmbient=True. Default: False

minProp

The ambient profile when ambient=None. Default: 0.05

pseudoCount

The minimum pseudo-count when computing logfold changes. Default: 5

constantAmbient

Whether a constant level of ambient contamination should be used to estimate LogFC2 for all cells. Default: False

doubletNmads

The number of median absolute deviations (MADs) to use to identify doublets. Default: 3

doubletMin

The minimum threshold on the log-fold change to use to identify doublets. Default: 2

doubletMixture

Wwhether to use a 2-component mixture model to identify doublets. Default: False

confidentNmads

The number of MADs to use to identify confidently assigned singlets. Default: 3

confidenMin

The minimum threshold on the log-fold change to use to identify singlets. Default: 2

combinations

An integer matrix specifying valid combinations of HTOs. Each row corresponds to a single sample and specifies the indices of rows in x corresponding to the HTOs used to label that sample. Default: None

objectOutHashedDrops

Prefix of the hashedDrops output RDS object. Default: hashedDrops

assignmentOutHashedDrops

Prefix of the hashedDrops output CSV file. Default: hashedDrops

GMM-Demux#

gmmDemux

Whether to perform GMMDemux. Default: True

hto_matrix_gmm_demux

Whether to use raw or filtered HTO count matrix. Default: filtered

assignmentOutGmmDemux

Name for the folder output. Default: gmm_demux

hto_name_gmm

list of sample tags (HTOs) separated by ‘,’ without whitespace. Default: None

summary

the estimated total count of cells in the single cell assay. Default: 2000

report_gmm

Name for the file generated by the summary. Default:report.txt

mode_GMM

Format of the input, either tsv or csv. Default: tsv

extract

extract names of the sample barcoding tag(s) to extract, separated by ‘,’. Joint tags are linked with ‘+’. Default: None

threshold_gmm

Provide the confidence threshold value. Requires a float in (0,1). Default: 0.8

ambiguous

The estimated chance of having a phony GEM getting included in a pure type GEM cluster by the clustering algorithm. Default: 0.5.

plotOutHashSolo

Prefix of the output figures. Default: hashsolo

BFF#

BFF

Whether to perform BFF. Default: False

hto_matrix_bff

Whether to use raw or filtered HTO count matrix. Default: raw

rna_matrix_bff

Whether to use raw or filtered scRNA-seq count matrix. Default: raw

assignmentOutBff

Name for the folder output. Default: bff

methods

method or list of methods to be used. Default: combined_bff

methodsForConsensus

a consensus call will be generated using all methods especified. Default: NULL

cellbarcodeWhitelist

A vector of expected cell barcodes. Default:NULL

metricsFile

summary metrics will be written to this file. Default: metrics_bff.cvs

doTSNE

tSNE will be run on the resulting hashing calls after each caller. Default: True

doHeatmap

if true, Seurat::HTOHeatmap will be run on the results of each calle Default: True

perCellSaturation

An optional dataframe with the columns cellbarcode and saturation. Default: NULL

majorityConsensusThreshold

This applies to calculating a consensus call when multiple algorithms are used. Default: NULL

chemistry

This string is passed to EstimateMultipletRate. Should be either 10xV2 or 10xV3. Default: 10xV3

callerDisagreementThreshold

If provided, the agreement rate will be calculated between each caller and the simple majority call, ignoring discordant and no-call cells. Default: NULL

preprocess_bff

When True, the data is preprocess using the method ProcessCountMatrix from CellHashR. Default: False

barcodeWhitelist

A vector of barcode names to retain. This parameter is used only when the pre-processing step is executed. Default: NULL

General Use#

Single sample use#

The use of the pipeline for a single samples require the definition of certain parameters in order to run the tools under default configuration. The parameter --mode hashing must be included with the purpose of running the hashing tools only.

GMM-Demux#

The names of the hashtags must be given as a list of string, separated by ‘,’. This list is given under the parameter --hto_name_gmm

BFF#

The demultiplexing method for the experiment must be given under the parameter --methods. Multiple methods can be given as a list, separated by ‘,’. Besides, the method or methods for consensus must be given under the parameter --methodsForConsensus.

nextflow run main.nf --mode hashing --match_donor False  --hto_matrix_raw /data_folder/raw_hto_data
--hto_matrix_filtered /data_folder/filtered_hto_data --barcodes /data_folder/filtered_hto_data/barcodes.tsv.gz --rna_matrix_raw /data_folder/raw_rna_data --rna_matrix_filtered /data_folder/filtered_rna_data --hto_name_gmm "hto_name_1,hto_name_2,hto_name_3" --methods bff_cluster  --methodsForConsensus bff_cluster