General#

hadge: a comprehensive pipeline for donor deconvolution in single cell#

A preprint is available here.

Caption

Pipeline overview:#

The mode of the pipeline is set by params.mode. hadge provides 4 modes in total: genetic, hashing, rescue or donor_match.

Pipeline configuration#

The pipeline provides some pre-defined profiles for different use cases.

Test dataset#

To get familiar with hadge, we provide the test profile for a quick start. To access the test sample data, you can use the provided bash script to download the test data to the project directory of hadge and run the pipeline locally.

sh ${hadge_project_dir}/hadge/test_data/download_data.sh
nextflow run ${hadge_project_dir}/hadge/main.nf -profile test,conda_singularity

Conda environments#

By using the -profile conda option, the pipeline executes each process within a Conda environment specified in the conda directive. Alternatively, you have the flexibility to add a new profile in the nextflow.config file, allowing you to use local Conda environments for running processes.

profiles{
    my_conda_profile {
        // dont forget to enable Conda
        conda.enable = true
        process {
            // Use Conda environment files
            withName:scSplit {
                conda = './conda/scsplit.yml'
            }
            // Use Conda package names
            withName:cellSNP {
                conda = 'bioconda::cellsnp-lite'
            }
            // Use existing Conda environments
            withName:summary {
                conda = '/path/to/an/existing/env/directory'
            }
        }
    }
}

Containers#

Nextflow also supports a variety of container runtimes, e.g. Docker. To specify a different Docker image for each process:

profiles{
    my_docker_profile {
        // dont forget to enable Docker
        docker.enabled = true
        process {
            withName:foo {
                container = 'image_name_1'
            }
        }
    }
}

In general, it is recommended to run with -profile conda when you don’t need to run Souporcell. If Souporcell is necessary, please run the pipeline with -profile conda,singularity, where both Conda and Singularity are enabled.

Executor and resource specifications#

  • The standard profile is used by default when no profile is specified, where the pipeline is run locally and all processes annotated with the big_mem label are assigned 4 cpus and 16 Gb of memory.

  • The pipeline can also be run on an HPC. You can set the executor by running the pipeline with -profile cluster.

profiles{
    standard {
        process {
            executor = 'local'
            withLabel: big_mem {
                cpus = 4
                memory = 16.GB
            }
            withLabel: small_mem {
                cpus = 2
                memory = 8.GB
            }
        }
    }
    cluster {
        process {
            executor = 'slurm'
             // queue = ...
            withLabel: big_mem {
                cpus = 32
                memory = 64.GB
            }
            withLabel: small_mem {
                cpus = 16
                memory = 32.GB
            }
        }
    }
}

Customized profiles#

Configuration files can contain the definition of one or more profiles. Multiple configuration profiles can be specified by separating the profile names with a comma (no whitespace), for example:

nextflow run main.nf -profile standard,conda

scverse compatibility#

To ensure scverse compatibility, the pipeline provides the option to generate AnnData or MuData objects after demultiplexing specified by params.generate_anndata and params.generate_mudata. The objects contain the scRNA-seq counts from params.rna_matrix_filered and stores the assignment of each demultiplexing method in the assignment column of obs. Additionally, if match_donor is True, the pipeline also produces an AnnData object which contains the assignment of the best-matched method pair after donor matching.

Pipeline output#

The output directory of the pipeline is set by $params.outdir. By default, the pipeline is run on a single sample. In this case, all pipeline output will be saved in the folder $projectDir/$params.outdir/$params.mode. When running the pipeline on multiple samples, the pipeline output will be found in the folder "$projectDir/$params.outdir/$sampleId/$params.mode. To simplify this, we’ll refer to this folder as $pipeline_output_folder from now on.

The demultiplexing workflow saves its output in $pipeline_output_folder/[gene/hash]_demulti. The pipeline will also generate some TSV files to summarize the results in the folder [gene/hash]_summary under this directory.

  • [method]_classification.csv: classification of all trials for a given method

    Barcode

    multiseq_1

    multiseq_2

    barcode-1

    singlet

    singlet

    barcode-2

    doublet

    negative

  • [method]_assignment.csv: assignment of all trials for a given method

    Barcode

    multiseq_1

    multiseq_2

    barcode-1

    donor-1

    donor-2

    barcode-2

    doublet

    negative

  • [method]_params.csv: specified paramters of all trials for a given method

    Argument

    Value

    seuratObejctPath

    Path

    quantile

    0.7

  • [workflow]_classification_all.csv: classification of all trials across different methods

    Barcode

    multiseq_1

    htodemux_1

  • [workflow]_assignment_all.csv: save the assignment of all trials across different methods

    Barcode

    multiseq_1

    htodemux_1

  • adata folder: stores Anndata object with filtered scRNA-seq read counts and assignment of each deconvolution method if params.generate_anndata is True. Details see section “scverse compatibility” above.

  • mudata folder: stores Mudata object with filtered scRNA-seq and HTO read counts and assignment of each deconvolution method if params.generate_mudata is True. Details see section “scverse compatibility” above.

  • In the rescue mode, the pipeline generates some additional output files, details please check Combining results: rescue mode.