Hashing-based deconvolution workflow#
Cell hashing is a sample processing technique that requires processing individual samples to “tag” the membrane of the cell or the nuclei with unique oligonucleotide barcodes. The cells are then washed or the reaction is quenched, and the samples can be safely mixed and processed following the standard library preparation procedure. Two libraries are generated after this process, one for the scRNA and one for the hashing oligos (HTO), which are independently sequenced to produce each a single cell count matrix, one for the RNA library and one for the HTO library. The hashtag counts are then bioinformatically processed to deconvolve the cell’s source sample.
hash_demulti in hadge#
Quick start#
nextflow run ${hadge_project_dir}/main.nf -profile test,conda --mode hashing
Example case#
Case 1: Run the entire hashing-based mode:
nextflow run ${hadge_project_dir}/main.nf -profile conda --outputdir ${output_dir} --mode hashing --hto_matrix_raw ${hto_raw_dir} --hto_matrix_filtered ${hto_filtered_dir} --rna_matrix_raw ${rna_raw_dir} --rna_matrix_filtered ${rna_filtered_dir}
Case 2: Run Multiseq with raw counts :
nextflow run ${hadge_project_dir}/main.nf -profile conda --outputdir ${output_dir} --mode hashing --rna_matrix_multiseq raw --hto_matrix_multiseq raw // additional parameters as in case 1
Case 3: Run the pipeline with different combinations of parameter. This is only available in the single sample mode. The values should be separated by semicolumn and double quoted.
nextflow run ${hadge_project_dir}/main.nf -profile conda --mode hashing --quantile_multi "0.5;0.7" //additional paramters as in case 1
Input data preparation#
The input data depends heavily on the deconvolution tools. In the following table, you will find the minimal input data required by different tools.
Deconvolution method |
Input data |
---|---|
HTODemux |
UMI and hashing count matrix |
Multiseq |
UMI and hashing count matrix |
HashSolo |
- Required: Hashing count matrix - Optional: UMI count matrix |
HashedDrops |
Hashing count matrix |
Demuxem |
Both UMI and hashing count matrix |
Similary as genotype-based deconvlution methods, hashing methods also have some input in common. So we also try to utilize common input parameters params.[rna/hto]_matrix_[raw/filtered]
to store count matrices for better control and params.[rna/hto]_matrix_[method]
is used to specify whether to use raw or filtered counts for each method, e.g. hto_matrix_hashedDrops = "raw"
means that raw HTO count matrix is used as input for HTODemux.
Input data |
Parameter |
---|---|
Raw scRNAseq count matrix |
|
Filtered scRNAseq count matrix |
|
Raw HTO count matrix |
|
Filtered HTO count matrix |
|
Pre-processing#
Similar as in the genetic demultiplexing workflow, we provide a pre-processing step required before running HTODemux and Multiseq where the count matrices are loaded from the parameters set above into a Seurat object.
Output#
By default, the pipeline is run on a single sample. In this case, all pipeline output will be saved in the folder $projectDir/$params.outdir/hashing/hash_demulti
. When running the pipeline on multiple samples, the pipeline output will be found in the folder "$projectDir/$params.outdir/$sampleId/hashing/hash_demulti
. To simplify this, we’ll refer to this folder as $pipeline_output_folder
from now on.
Pre-processing#
output directory: $pipeline_output_folder/preprocess/preprocess_[task_ID/sampleId]
${params.preprocessOut}.rds
: pre-processed data in an RDS objectparams.csv
: specified parameters in the hashing pre-processing task
HTODemux#
output directory: $pipeline_output_folder/htodemux/htodemux_[task_ID/sampleId]
${params.assignmentOutHTO}_assignment_htodemux.csv
: the assignment of HTODemux${params.assignmentOutHTO}_classification_htodemux.csv
: the classification of HTODemux as singlet, doublet and negative droplets${params.objectOutHTO}.rds
: the result of HTODemux in an RDS objectparams.csv
: specified parameters in the HTODemux task
Optionally:
ridge.jpeg
: a ridge plot showing the enrichment of selected HTOsfeatureScatter.jpeg
: a scatter plot showing the signal of two selected HTOsviolinPlot.jpeg
: a violin plot showing selected featurestSNE.jpeg
: a 2D plot based on tSNE embedding of HTOsheatMap.jpeg
: a heatmap of hashtag oligo signals across singlets, doublets and negative cellsvisual_params.csv
: specified parameters for visualisation of the HTODemux result
Multiseq#
output directory: $pipeline_output_folder/multiseq/multiseq_[task_ID/sampleId]
${params.assignmentOutMulti}_res.csv
: the assignment of Multiseq${params.objectOutMulti}.rds
: the result of Multiseq in an RDS objectparams.csv
: specified parameters in the Multiseq task
Demuxem#
output directory: $pipeline_output_folder/demuxem/demuxem_[task_ID/sampleId]
${params.objectOutDemuxem}_demux.zarr.zip
: RNA expression matrix with demultiplexed sample identities in Zarr format${params.objectOutDemuxem}.out.demuxEM.zarr.zip
: DemuxEM-calculated results in Zarr format, containing two datasets, one for HTO and one for RNA${params.objectOutDemuxem}.ambient_hashtag.hist.pdf
: A histogram plot depicting hashtag distributions of empty droplets and non-empty droplets${params.objectOutDemuxem}.background_probabilities.bar.pdf}
: A bar plot visualizing the estimated hashtag background probability distribution${params.objectOutDemuxem}.real_content.hist.pdf
: A histogram plot depicting hashtag distributions of not-real-cells and real-cells as defined by total number of expressed genes in the RNA assay${params.objectOutDemuxem}.rna_demux.hist.pdf
: This figure consists of two plots. The first one is a horizontal bar plot depicting the percentage of RNA barcodes with at least one HTO count. The second plot is a histogram plot depicting RNA UMI distribution for singlets, doublets and unknown cells.${params..objectOutDemuxem}.gene_name.violin.pdf
: Violin plots depicting gender-specific gene expression across samples.${params.objectOutDemuxem}_summary.csv
: the classification of Demuxem${params.objectOutDemuxem}_obs.csv
: the assignment of Demuxemparams.csv
: specified parameters in the Demuxem task
Optionally:
{params.objectOutDemuxem}.{gene_name}.violin.pdf
: violin plots using specified gender-specific gene
HashSolo#
output directory: $pipeline_output_folder/hashsolo/hashsolo_[task_ID/sampleId]
${params.assignmentOutHashSolo}_res.csv
: the assignment of HashSolo${params.plotOutHashSolo}.jpg
: plot of HashSolo demultiplexing results for QC checksparams.csv
: specified parameters in the HashSolo task
HashedDrops#
output directory: $pipeline_output_folder/hashedDrops/hashedDrops_[task_ID/sampleId]
${params.objectOutEmptyDrops}.rds
: the result of emptyDrops in an RDS object${params.assignmentOutEmptyDrops}.csv
: the result of emptyDrops in a csv fileplot_emptyDrops.png
: a diagnostic plot comparing the total count against the negative log-probability${params.objectOutHashedDrops}.rds
: the result of hashedyDrops in an RDS object${params.assignmentOutHashedDrops}_res.csv
: the assignment of HashSolo${params.objectOutHashedDrops}_LogFC.png
: a diagnostic plot comparing the log-fold change between the second HTO’s abundance and the ambient contaminationparams.csv
: specified parameters in the HashedDrops task
GMM-Demux#
output directory: $pipeline_output_folder/gmm_demux/gmm_demux_[task_ID/sampleId]
features.tsv.gz
: default content in the output folder are the non-MSM droplets (SSDs), stored in MTX format.barcodes.tsv.gz
: default content in the output folder are the non-MSM droplets (SSDs), stored in MTX format.matrix.mtx.gz
: default content in the output folder are the non-MSM droplets (SSDs), stored in MTX format.GMM_full.csv
: The classification file containing the label of each droplet as well as the probability of the classification.GMM_full.config
: Used to assign each classification to a donor using the numbers listed in the config filegmm_demux_${task.index}_report.txt
: Specify the file to store summary report, produced only if GMM can find a viable solution that satisfies the droplet formation modelparams.csv
: specified parameters in the GMM-Demux task
BFF#
output directory: $pipeline_output_folder/bff/bff_[task_ID/sampleId]
${params.assignmentOutBff}_assignment_bff.csv
: the assignment and classification results produced by BFFparams.csv
: specified parameters in the BFF task
Parameter#
Preprocessing#
ndelim |
For the initial identity calss for each cell, delimiter for the cell’s column name. Default: _ |
sel_method |
The selection method used to choose top variable features. Default: mean.var.plot |
n_features |
Number of features to be used when finding variable features. Default: 2000 |
assay |
Assay name for HTO modality. Default: HTO |
norm_method |
Method for normalization of HTO data. Default: CLR |
margin |
If performing CLR normalization, normalize across features (1) or cells (2). Default: 2 |
preprocessOut |
Name of the output Seurat object. Default: preprocessed |
HTODemux#
htodemux |
Whether to perform Multiseq. Default: True |
rna_matrix_htodemux |
Whether to use raw or filtered scRNA-seq count matrix. Default: filtered |
hto_matrix_htodemux |
Whether to use raw or filtered HTO count matrix. Default: filtered |
assay |
Name of the hashtag assay. Default: HTO |
quantile_htodemux |
The quantile of inferred ‘negative’ distribution for each hashtag, over which the cell is considered ‘positive’. Default: 0.99 |
kfunc |
Clustering function for initial hashtag grouping. Default: clara. |
nstarts |
nstarts value for k-means clustering when kfunc=kmeans. Default: 100 |
nsamples_clustering |
Number of samples to be drawn from the dataset used for clustering when kfunc= clara. Default: 100 |
seed |
Sets the random seed. Default: 42 |
init |
Initial number of clusters for hashtags. Default: None, which means the # of hashtag oligo names + 1 to account for negatives. |
objectOutHTO |
Name of the output Seurat object. Default: htodemux |
assignmentOutHTO |
Prefix of the output CSV files. Default: htodemux |
ridgePlot |
Whether to generate a ridge plot to visualize enrichment for all HTOs. Default: True |
ridgeNCol |
Number of columns in the ridge plot. Default: 3 |
featureScatter |
Whether to generate a scatter plot to visualize pairs of HTO signals. Default: False |
scatterFeat1 |
First feature to plot. Default: None |
scatterFeat2 |
Second feature to plot. Default: None |
vlnplot |
Whether to generate a violin plot, e.g. to compare number of UMIs for singlets, doublets and negative cells. Default: True |
vlnFeatures |
Features to plot. Default: nCount_RNA |
vlnLog |
Whether to plot the feature axis on log scale. Default: True |
tsne |
Whether to generate a 2D tSNE embedding for HTOs. Default: True |
tsneIdents |
Subset Seurat object based on identity class. Default: Negative |
tsneInvert |
Whether to keep or remove the identity class. Default: True |
tsneVerbose |
Whether to print the top genes associated with high/low loadings for the PCs when running PCA. Default: False |
tsneApprox |
Whether to use truncated singular value decomposition to approximate PCA. Default: False |
tsneDimMax |
Number of dimensions to use as input features when running t-SNE dimensionality reduction. Default: 2 |
tsnePerplexity |
Perplexity when running t-SNE dimensionality reduction. Default: 100 |
heatmap |
Whether to generate an HTO heatmap. Default: True |
heatmapNcells |
Number of cells to plot. Default: 5000 |
Multiseq#
multiseq |
Whether to perform Multiseq. Default: True |
rna_matrix_multiseq |
Whether to use raw or filtered scRNA-seq count matrix. Default: filtered |
hto_matrix_multiseq |
Whether to use raw or filtered HTO count matrix. Default: filtered |
assay |
Name of the hashtag assay, same as used for HTODemux. Default: HTO |
quantile_multi |
The quantile to use for classification. Default: 0.7 |
autoThresh |
Whether to perform automated threshold finding to define the best quantile. Default: True |
maxiter |
nstarts value for k-means clustering when kfunc=kmeans. Default: 100 |
qrangeFrom |
The minimal possible quantile value to try if autoThresh=True. Default: 0.1 |
qrangeTo |
The minimal possible quantile value to try if autoThresh=True. Default: 0.9 |
qrangeBy |
The constant difference of a range of possible quantile values to try if autoThresh=True. Default: 0.05 |
verbose_multiseq |
Wether to print the output. Default: True |
assignmentOutMulti |
Prefix of the output CSV files. Default: multiseq |
objectOutMulti |
Name of the output Seurat object. Default: multiseq |
HashSolo#
hashsolo |
Whether to perform HashSolo. Default: True |
use_rna_data |
Whether to use RNA counts for deconvolution. Default: False |
rna_matrix_hashsolo |
Whether to use raw or filtered scRNA-seq count matrix. Default: raw |
hto_matrix_hashsolo |
Whether to use raw or filtered HTO count matrix if use_rna_data is set to True. Default: raw |
priors_negative |
Prior for the negative hypothesis. Default: 1/3 |
priors_singlet |
Prior for the singlet hypothesis. Default: 1/3 |
priors_doublet |
Prior for the doublet hypothesis. Default: 1/3 |
pre_existing_clusters |
Column in the input data for how to break up demultiplexing. Default: None |
number_of_noise_barcodes |
Number of barcodes to use to create noise distribution. Default: None |
assignmentOutHashSolo |
Prefix of the output CSV files. Default: hashsolo |
plotOutHashSolo |
Prefix of the output figures. Default: hashsolo |
DemuxEm#
demuxem |
Whether to perform Demuxem. Default: True |
rna_matrix_demuxem |
Whether to use raw or filtered scRNA-seq count matrix. Default: raw |
hto_matrix_demuxem |
Whether to use raw or filtered HTO count matrix. Default: raw |
threads_demuxem |
Number of threads to use. Must be a positive integer. Default: 1 |
alpha_demuxem |
The Dirichlet prior concentration parameter (alpha) on samples. An alpha value < 1.0 will make the prior sparse. Default: 0.0 |
alpha_noise |
The Dirichlet prior concenration parameter on the background noise. Default: 1.0 |
min_num_genes |
Filter cells/nuclei with at least specified number of expressed genes. Default: 100 |
min_num_umis |
Filter cells/nuclei with at least specified number of UMIs. Default: 100 |
min_signal |
Any cell/nucleus with less than min_signal hashtags from the signal will be marked as unknown. Default: 10 |
tol |
Threshold used for the EM convergence. Default: 1e-6 |
generate_gender_plot |
Generate violin plots using gender-specific genes (e.g. Xist). Value is a comma-separated list of gene names. Default: None |
random_state |
Random seed set for reproducing results. Default: 0 |
objectOutDemuxem |
Prefix of the output files. Default: demuxem_res |
HashedDrops#
hashedDrops |
Whether to perform hashedDrops. Default: True |
hto_matrix_hashedDrops |
Whether to use raw or filtered HTO count matrix. Default: raw |
lower |
The lower bound on the total UMI count, at or below which all barcodes are assumed to correspond to empty droplets. Default: 100 |
niters |
The number of iterations to use for the Monte Carlo p-value calculations. Default: 10000 |
testAmbient |
Whether results should be returned for barcodes with totals less than or equal to lower. Default: True |
ignore_hashedDrops |
The lower bound on the total UMI count, at or below which barcodes will be ignored. Default: None |
alpha_hashedDrops |
The scaling parameter for the Dirichlet-multinomial sampling scheme. Default: None |
round |
Whether to check for non-integer values in m and, if present, round them for ambient profile estimation. Default: True |
byRank |
If set, this is used to redefine lower and any specified value for lower is ignored. Default: None |
isCellFDR |
FDR Threshold to filter the cells for empty droplet detection. Default: 0.01 |
objectOutEmptyDrops |
Prefix of the emptyDroplets output RDS object. Default: emptyDroplets |
assignmentOutEmptyDrops |
Prefix of the emptyDroplets output CSV file. Default: emptyDroplets |
ambient |
Whether to use the relative abundance of each HTO in the ambient solution from emptyDrops, set True only when testAmbient=True. Default: False |
minProp |
The ambient profile when ambient=None. Default: 0.05 |
pseudoCount |
The minimum pseudo-count when computing logfold changes. Default: 5 |
constantAmbient |
Whether a constant level of ambient contamination should be used to estimate LogFC2 for all cells. Default: False |
doubletNmads |
The number of median absolute deviations (MADs) to use to identify doublets. Default: 3 |
doubletMin |
The minimum threshold on the log-fold change to use to identify doublets. Default: 2 |
doubletMixture |
Wwhether to use a 2-component mixture model to identify doublets. Default: False |
confidentNmads |
The number of MADs to use to identify confidently assigned singlets. Default: 3 |
confidenMin |
The minimum threshold on the log-fold change to use to identify singlets. Default: 2 |
combinations |
An integer matrix specifying valid combinations of HTOs. Each row corresponds to a single sample and specifies the indices of rows in x corresponding to the HTOs used to label that sample. Default: None |
objectOutHashedDrops |
Prefix of the hashedDrops output RDS object. Default: hashedDrops |
assignmentOutHashedDrops |
Prefix of the hashedDrops output CSV file. Default: hashedDrops |
GMM-Demux#
gmmDemux |
Whether to perform GMMDemux. Default: True |
hto_matrix_gmm_demux |
Whether to use raw or filtered HTO count matrix. Default: filtered |
assignmentOutGmmDemux |
Name for the folder output. Default: gmm_demux |
hto_name_gmm |
list of sample tags (HTOs) separated by ‘,’ without whitespace. Default: None |
summary |
the estimated total count of cells in the single cell assay. Default: 2000 |
report_gmm |
Name for the file generated by the summary. Default:report.txt |
mode_GMM |
Format of the input, either tsv or csv. Default: tsv |
extract |
extract names of the sample barcoding tag(s) to extract, separated by ‘,’. Joint tags are linked with ‘+’. Default: None |
threshold_gmm |
Provide the confidence threshold value. Requires a float in (0,1). Default: 0.8 |
ambiguous |
The estimated chance of having a phony GEM getting included in a pure type GEM cluster by the clustering algorithm. Default: 0.5. |
plotOutHashSolo |
Prefix of the output figures. Default: hashsolo |
BFF#
BFF |
Whether to perform BFF. Default: False |
hto_matrix_bff |
Whether to use raw or filtered HTO count matrix. Default: raw |
rna_matrix_bff |
Whether to use raw or filtered scRNA-seq count matrix. Default: raw |
assignmentOutBff |
Name for the folder output. Default: bff |
methods |
method or list of methods to be used. Default: combined_bff |
methodsForConsensus |
a consensus call will be generated using all methods especified. Default: NULL |
cellbarcodeWhitelist |
A vector of expected cell barcodes. Default:NULL |
metricsFile |
summary metrics will be written to this file. Default: metrics_bff.cvs |
doTSNE |
tSNE will be run on the resulting hashing calls after each caller. Default: True |
doHeatmap |
if true, Seurat::HTOHeatmap will be run on the results of each calle Default: True |
perCellSaturation |
An optional dataframe with the columns cellbarcode and saturation. Default: NULL |
majorityConsensusThreshold |
This applies to calculating a consensus call when multiple algorithms are used. Default: NULL |
chemistry |
This string is passed to EstimateMultipletRate. Should be either 10xV2 or 10xV3. Default: 10xV3 |
callerDisagreementThreshold |
If provided, the agreement rate will be calculated between each caller and the simple majority call, ignoring discordant and no-call cells. Default: NULL |
preprocess_bff |
When True, the data is preprocess using the method ProcessCountMatrix from CellHashR. Default: False |
barcodeWhitelist |
A vector of barcode names to retain. This parameter is used only when the pre-processing step is executed. Default: NULL |
General Use#
Single sample use#
The use of the pipeline for a single samples require the definition of certain parameters in order to run the tools under default configuration.
The parameter --mode hashing
must be included with the purpose of running the hashing tools only.
GMM-Demux#
The names of the hashtags must be given as a list of string, separated by ‘,’. This list is given under the parameter --hto_name_gmm
BFF#
The demultiplexing method for the experiment must be given under the parameter --methods
. Multiple methods can be given as a list, separated by ‘,’.
Besides, the method or methods for consensus must be given under the parameter --methodsForConsensus
.
nextflow run main.nf --mode hashing --match_donor False --hto_matrix_raw /data_folder/raw_hto_data
--hto_matrix_filtered /data_folder/filtered_hto_data --barcodes /data_folder/filtered_hto_data/barcodes.tsv.gz --rna_matrix_raw /data_folder/raw_rna_data --rna_matrix_filtered /data_folder/filtered_rna_data --hto_name_gmm "hto_name_1,hto_name_2,hto_name_3" --methods bff_cluster --methodsForConsensus bff_cluster