Genetics-based deconvolution workflow#

Genotyped-based deconvolution leverages the unique genetic composition of individual samples to guarantee that the final cell mixture can be deconvolved. This can be conducted with genotype of origin or in a genotype-free mode using a genomic reference from unmatched donors, for example the 1000 genome project genotypes in a genotype-free. The result of this approach is a table of SNP assignment to cells that can be used to computationally infer the donors. One limitation of this approach is the need to produce additional data to genotype the individual donors in order to correctly assign the cell mixtures.

gene_demulti in hadge#

Quick start#

# use test dataset
nextflow run ${hadge_project_dir}/main.nf -profile test,conda_singularity --mode genetic

Example case#

Case 1: Run the entire genotype-based mode without known donor genotype:

nextflow run ${hadge_project_dir}/main.nf -profile conda_singularity --outputdir ${output_dir} --mode genetic --bam ${bam_dir} --bai ${bai_dir} --barcodes ${barcodes_dir}  --nsamples_genetic ${nsamples} --fasta ${fasta_dir} --fasta_index ${fasta_index_dir} --common_variants_scSplit ${common_variant_scsplit} --common_variants_souporcell ${common_variant_souporcell} --common_variants_freemuxlet ${common_variant_freemuxlet}  --common_variants_cellsnp ${common_variant_cellsnp} --demuxlet False

Case 2: Skip cellSNP and run Vireo with available cell genotype file in VCF format:

nextflow run ${hadge_project_dir}/main.nf -profile conda --mode genetic --vireo_variant False --celldata ${cell_data_dir}

Case 3: Run Demuxlet with donor genotype:

nextflow run ${hadge_project_dir}/main.nf -profile conda --mode genetic --outputdir ${output_dir} --bam ${bam_dir} --bai ${bai_dir} --barcodes ${barcodes_dir} --vcf_donor ${donor_genotype_dir}

Case 4: Run scSplit without data pre-processing:

nextflow run ${hadge_project_dir}/main.nf -profile conda --mode genetic --scSplit_preprocess False //additional paramters as in case 1

Case 5: Run the pipeline with different combinations of parameter. This is only available in the single sample mode. The values should be separated by semicolumn and double quoted.

nextflow run ${hadge_project_dir}/main.nf -profile conda_singularity --mode genetic --alpha "0.1;0.3;0.5" //additional paramters as in case 1

Input data preparation#

The input data depends heavily on the deconvolution tools. In the following table, you will find the minimal input data required by different tools.

Deconvolution methods

Input data

Demuxlet

- Alignment (BAM)
- Barcode (TSV)
- Genotype reference per sample (VCF)

Freemuxlet

- Alignment (BAM)
- Barcode (TSV)

Vireo

- Genotype per cell (VCF or cellSNP folder)

Souporcell

- Alignment (BAM)
- Barcode (TSV)
- Reference genome (FASTA)

scSplit

- Alignment (BAM)
- Barcode (TSV)
- Genotype per pool (VCF)

You may see that some tools share some input data in common, so we set only one parameter for the same input for benchmarking.

Input data

Parameter

Alignment (BAM)

params.bam
params.bai

Barcode (TSV)

params.barcodes

Genotype reference per sample (VCF)

params.vcf_donor

Genotype per pool (VCF)

params.vcf_mixed

Reference genome (FASTA)

params.fasta
params.fasta_index

Genotype per cell (VCF or cellSNP folder)

params.celldata

Note, this is only the minial input input data set. You may also need the common variants from the population to run genotype-based deconvolution methods without genotype reference. Here we collect different sources of common variants for GRCh38 recommended by different methods.

Method

Paramter

Source

scSplit

common_variants_scSplit

https://melbourne.figshare.com/articles/dataset/Common_SNVS_hg38/17032163

Souporcell

common_variants_souporcell

https://github.com/wheaton5/souporcell

Freemuxlet

common_variants_freemuxlet

https://sourceforge.net/projects/cellsnp/files/SNPlist/

cellSNP-lite

common_variants_cellsnp

https://sourceforge.net/projects/cellsnp/files/SNPlist/

Pre-processing#

In case you want to perform genotype-based deconvolution on pre-processed data, we provide a process in concordance with the instruction of scSplit. It only requires the Alignment (BAM) file as input. To specify which method is performed on the pre-processed data : set [method]_preprocess = True.

Variant calling#

In case you don’t have any cell genotypes or variants called from mixed samples yet, we provide two processes for variant calling.

Variant calling methods

Input data

Parameter

Output

freebayes

- Alignment (BAM)
- Reference genome (FASTA)

params.bam
params.bai
params.fasta
params.fasta_index

Variants from mixed samples

cellsnp-lite

- Alignment (BAM)
- Barcode (TSV)
- Common SNPs (VCF)

params.bam
params.bai
params.barcodes
params.regionsVCF

Cell genotypes

You can have following options for scsplit_variant:

  • True: activate freebayes

  • Otherwise: inactivate variant calling, get the input data from params.vcf_mixed

You can have following options for vireo_variant:

  • True: activate cellsnp

  • Otherwise: inactivate variant calling, get the input data from params.celldata

Output#

By default, the pipeline is run on a single sample. In this case, all pipeline output will be saved in the folder $projectDir/$params.outdir/genetic/gene_demulti. When running the pipeline on multiple samples, the pipeline output will be found in the folder "$projectDir/$params.outdir/$sampleId/genetic/gene_demulti. To simplify this, we’ll refer to this folder as $pipeline_output_folder from now on.

Samtools#

output directory: $pipeline_output_folder/samtools/samtools_[task_ID/sampleId]

  • filtered.bam: processed BAM in a way that reads with any of following patterns be removed: read quality lower than 10, being unmapped segment, being secondary alignment, not passing filters, being PCR or optical duplicate, or being supplementary alignment

  • filtered.bam.bai: index of filtered bam

  • no_dup.bam: processed BAM after removing duplicated reads based on UMI

  • sorted.bam: sorted BAM

  • sorted.bam.bai: index of sorted BAM

cellSNP-lite#

output directory: $pipeline_output_folder/cellsnp/cellsnp_[task_ID/sampleId]

  • cellSNP.base.vcf.gz: a VCF file listing genotyped SNPs and aggregated AD & DP infomation (without GT)

  • cellSNP.samples.tsv: a TSV file listing cell barcodes or sample IDs

  • cellSNP.tag.AD.mtx: a file in mtx format, containing the allele depths of the alternative (ALT) alleles

  • cellSNP.tag.DP.mtx: a file in mtx format, containing the sum of allele depths of the reference and alternative alleles (REF + ALT)

  • cellSNP.tag.OTH.mtx: a file in mtx format, containing the sum of allele depths of all the alleles other than REF and ALT.

  • cellSNP.cells.vcf.gz: a VCF file listing genotyped SNPs and AD & DP & genotype (GT) information for each cell or sample

  • params.csv: specified parameters in the cellsnp-lite task

Freebayes#

output directory: $pipeline_output_folder/freebayes/freebayes_[task_ID/sampleId]

  • ${region}_${vcf_freebayes}: a VCF file containing variants called from mixed samples in the given chromosome region

Bcftools#

output directory: $pipeline_output_folder/bcftools/bcftools_[task_ID/sampleId]

  • total_chroms.vcf: a VCF containing variants from all chromosomes

  • sorted_total_chroms.vcf: sorted VCF file

  • filtered_sorted_total_chroms.vcf: sorted VCF file containing variants with a quality score > 30

Demuxlet#

output directory: $pipeline_output_folder/demuxlet/demuxlet_[task_ID/sampleId]

  • {demuxlet_out}.best: result of demuxlet containing the best guess of the sample identity, with detailed statistics to reach to the best guess

  • params.csv: specified parameters in the Demuxlet task

Optionally:

  • {demuxlet_out}.cel: contains the relation between numerated barcode ID and barcode. Also, it contains the number of SNP and number of UMI for each barcoded droplet.

  • {demuxlet_out}.plp: contains the overlapping SNP and the corresponding read and base quality for each barcode ID.

  • {demuxlet_out}.umi: contains the position covered by each umi

  • {demuxlet_out}.var: contains the position, reference allele and allele frequency for each SNP.

Freemuxlet#

output directory: $pipeline_output_folder/freemuxlet/freemuxlet_[task_ID/sampleId]

  • {freemuxlet_out}.clust1.samples.gz: contains the best guess of the sample identity, with detailed statistics to reach to the best guess.

  • {freemuxlet_out}.clust1.vcf.gz: VCF file for each sample inferred and clustered from freemuxlet

  • {freemuxlet_out}.lmix: contains basic statistics for each barcode

  • params.csv: specified parameters in the Freemuxlet task

Optionally:

  • {freemuxlet_out}.cel: contains the relation between numerated barcode ID and barcode. Also, it contains the number of SNP and number of UMI for each barcoded droplet.

  • {freemuxlet_out}.plp: contains the overlapping SNP and the corresponding read and base quality for each barcode ID.

  • {freemuxlet_out}.umi: contains the position covered by each umi

  • {freemuxlet_out}.var: contains the position, reference allele and allele frequency for each SNP.

  • {freemuxlet_out}.clust0.samples.gz: contains the best sample identity assuming all droplets are singlets

  • {freemuxlet_out}.clust0.vcf.gz}: VCF file for each sample inferred and clustered from freemuxlet assuming all droplets are singlets

  • {freemuxlet_out}.ldist.gz: contains the pairwise Bayes factor for each possible pair of droplets

Vireo#

output directory: $pipeline_output_folder/vireo/vireo_[task_ID/sampleId]

  • donor_ids.tsv: assignment of Vireo with detailed statistics

  • summary.tsv: summary of assignment

  • prob_singlet.tsv.gz: contains probability of classifing singlets

  • prob_doublet.tsv.gz: contains probability of classifing doublets

  • GT_donors.vireo.vcf.gz: contains estimated donor genotypes

  • filtered_variants.tsv: a minimal set of discriminatory variants

  • GT_barcodes.png: a figure for the identified genotype barcodes

  • fig_GT_distance_estimated.pdf: a plog showing estimated genotype distance

  • _log.txt: vireo log file

  • params.csv: specified parameters in the Vireo task

scSplit#

output directory: $pipeline_output_folder/scSplit/scsplit_[task_ID/sampleId]

  • alt_filtered.csv: count matrix of alternative alleles

  • ref_filtered.csv: count matrix of reference alleles

  • scSplit_result.csv: barcodes assigned to each of the N+1 cluster (N singlets and 1 doublet cluster), doublet marked as DBL- (n stands for the cluster number), e.g SNG-0 means the cluster 0 is a singlet cluster.

  • scSplit_dist_matrix.csv: the ALT allele Presence/Absence (P/A) matrix on distinguishing variants for all samples as a reference in assigning sample to clusters, NOT including the doublet cluster, whose sequence number would be different every run (please pay enough attention to this)

  • scSplit_dist_variants.txt: the distinguishing variants that can be used to genotype and assign sample to clusters

  • scSplit_PA_matrix.csv: the full ALT allele Presence/Absence (P/A) matrix for all samples, NOT including the doublet cluster, whose sequence number would be different every run (please pay enough attention to this)

  • scSplit_P_s_c.csv: the probability of each cell belonging to each sample

  • scSplit.log: log file containing information for current run, iterations, and final Maximum Likelihood and doublet sample

  • params.csv: specified parameters in the scSplit task

Souporcell#

output directory: $pipeline_output_folder/souporcell/souporcell_[task_ID/sampleId]

  • alt.mtx: count matrix of alternative alleles

  • ref.mtx: count matrix of reference alleles

  • clusters.tsv: assignment of Souporcell with the cell barcode, singlet/doublet status, cluster, log_loss_singleton, log_loss_doublet, followed by log loss for each cluster.

  • cluster_genotypes.vcf: VCF with genotypes for each cluster for each variant in the input vcf from freebayes

  • ambient_rna.txt: contains the ambient RNA percentage detected

  • params.csv: specified parameters in the Souporcell task

Parameter#

Demuxlet and dsc-pileup#

demuxlet

Whether to run Demuxlet. Default: False

demuxlet_preprocess

Whether to perform pre-processing on the input params.bam for demuxlet. True: Perform pre-processing. Otherwise pre-processing is not called. Default: False

bam

Input SAM/BAM/CRAM file. Must be sorted by coordinates and indexed.

bai

Index of Input SAM/BAM/CRAM file.

barcodes

List of cell barcodes to consider.

tag_group

Tag representing readgroup or cell barcodes, in the case to partition the BAM file into multiple groups. For 10x genomics, use CB Default: CB

tag_UMI

Tag representing UMIs. For 10x genomiucs, use UB. Default: UB

sm

List of sample IDs to compare to. Default: None (use all)

vcf_donor

Input VCF/BCF file, containing GT, GP or PL for donors. It also requires the AC and AN field if plp_freemuxlet=True.

sm_list

File containing the list of sample IDs to compare. Default: None

sam_verbose

Verbose message frequency for SAM/BAM/CRAM. Default: 1000000

vcf_verbose

Verbose message frequency for VCF/BCF. Default: 10000

skip_umi

Do not generate [prefix].umi.gz file, which stores the regions covered by each barcode/UMI pair. Default: False

cap_BQ

Maximum base quality (higher BQ will be capped). Default: 40

min_BQ

Minimum base quality to consider (lower BQ will be skipped). Default: 13

min_MQ

Minimum mapping quality to consider (lower MQ will be ignored). Default: 20

min_TD

Minimum distance to the tail (lower will be ignored). Default: 0

excl_flag

SAM/BAM FLAGs to be excluded. Default: 3844

min_total

Minimum number of total reads for a droplet/cell to be considered. Default: 0

min_uniq

Minimum number of unique reads (determined by UMI/SNP pair) for a droplet/cell to be considered. Default: 0

min_snp

Minimum number of SNPs with coverage for a droplet/cell to be considered. Default: 0

min_umi

Minimum number of UMIs for a droplet/cell to be considered. Default: 0

plp

Whether to call dsc-pileup. If set True, dsc-pileup will be called. It set False, will use SAM file to call Demuxlet. Default: False

field

FORMAT field to extract the genotype, likelihood, or posterior from. Default: GT

geno_error_offset

Offset of genotype error rate. [error] = [offset] + [1-offset][coeff][1-r2]. Default: 0.1

geno_error_coeff

Slope of genotype error rate. [error] = [offset] + [1-offset][coeff][1-r2]. Default: 0.0

r2_info

INFO field name representing R2 value. Used for representing imputation quality. Default: R2

min_mac

Minimum minor allele frequency. Default: 1

min_callrate

Minimum call rate. Default: 0.5

alpha

Grid of alpha to search for. Default: 0.5

doublet-prior

Prior of doublet. Default: 0.5

demuxlet_out

Prefix out the demuxlet and dsc-pileup output files. Default: demuxlet_res

Freemuxlet and dsc-pileup#

freemuxlet

Whether to run Freemuxlet. Default: True

freemuxlet_preprocess

Whether to perform pre-processing on the input params.bam for Freemuxlet. True: Perform pre-processing. Otherwise pre-processing is not called. Default: False

bam

Input SAM/BAM/CRAM file. Must be sorted by coordinates and indexed.

bai

Index of Input SAM/BAM/CRAM file.

barcodes

List of cell barcodes to consider.

nsamples_genetic

Number of samples multiplexed together

tag_group

Tag representing readgroup or cell barcodes, in the case to partition the BAM file into multiple groups. For 10x genomics, use CB Default: CB

tag_UMI

Tag representing UMIs. For 10x genomiucs, use UB. Default: UB

common_variants_freemuxlet

Input VCF/BCF file for dsc-pileup, containing the AC and AN field.

sm

List of sample IDs to compare to. Default: None (use all)

sm_list

File containing the list of sample IDs to compare. Default: None

sam_verbose

Verbose message frequency for SAM/BAM/CRAM. Default: 1000000

vcf_verbose

Verbose message frequency for VCF/BCF. Default: 10000

skip_umi

Do not generate [prefix].umi.gz file, which stores the regions covered by each barcode/UMI pair. Default: False

cap_BQ

Maximum base quality (higher BQ will be capped). Default: 40

min_BQ

Minimum base quality to consider (lower BQ will be skipped). Default: 13

min_MQ

Minimum mapping quality to consider (lower MQ will be ignored). Default: 20

min_TD

Minimum distance to the tail (lower will be ignored). Default: 0

excl_flag

SAM/BAM FLAGs to be excluded. Default: 3844

min_total

Minimum number of total reads for a droplet/cell to be considered. Default: 0

min_uniq

Minimum number of unique reads (determined by UMI/SNP pair) for a droplet/cell to be considered. Default: 0

min_umi

Minimum number of UMIs for a droplet/cell to be considered. Default: 0

min_snp

Minimum number of SNPs with coverage for a droplet/cell to be considered. Default: 0

init_cluster

Input file containing the initial cluster information. Default: None

aux_files

Turn on writing auxiliary output files. Default: False

verbose

Turn on verbose mode with specific verbosity threshold. 0: fully verbose, 100 : no verbose messages. Default: 100

doublet_prior

Prior of doublet. Default: 0.5

bf_thres

Bayes Factor Threshold used in the initial clustering. Default: 5.41

frac_init_clust

Fraction of droplets to be clustered in the very first round of initial clustering procedure. Default: 0.5

iter_init

Iteration for initial cluster assignment (set to zero to skip the iterations). Default: 10

keep_init_missing

Keep missing cluster assignment as missing in the initial iteration. Default: False

freemuxlet_out

Prefix out the freemuxlet and dsc-pileup output files. Default: freemuxlet_out

Vireo#

vireo

Whether to run Vireo. Default: True

vireo_preprocess

Whether to perform pre-processing on the input params.bam for cellSNP-lite. True: Perform pre-processing. Otherwise pre-processing is not called. Default: False

vireo_variant

Whether to perform cellSNP-lite before running Vireo. True: Run cellSNP-lite. Otherwise cellSNP-lite is not called and params.celldata is used as input. Default: True

celldata

The cell genotype file in VCF format or cellSNP folder with sparse matrices.

nsamples_genetic

Number of donors to demultiplex; can be larger than provided in vcf_donor

vartrixData

The cell genotype files in vartrix outputs (three/four files, comma separated): alt.mtx,ref.mtx,barcodes.tsv,SNPs.vcf.gz. This will suppress cellData argument. Default: None

vcf_donor

The donor genotype file in VCF format. Default: None

genoTag

The tag for donor genotype: GT, GP, PL. Default: GT

noDoublet

If use, not checking doublets. Default: False

nInit

Number of random initializations, when GT needs to learn. Default: 50

extraDonor

Number of extra donor in pre-cluster, when GT needs to learn. Default: 0

extraDonorMode

Method for searching from extra donors. size: n_cell per donor; distance: GT distance between donor. Default: distance

forceLearnGT

If use, treat donor GT as prior only. Default: False

ASEmode

If use, turn on SNP specific allelic ratio. Default: False

noPlot

If use, turn off plotting GT distance. Default: False

randSeed

Seed for random initialization. Default: None

cellRange

Range of cells to process, eg. 0-10000. Default: all

callAmbientRNAs

If use, detect ambient RNAs in each cell. Default: False

nproc

Number of subprocesses for computing, sacrifices memory for speedups. Default: 4

vireo_out

Dirtectory for output files. Default: vireo_out

scSplit#

scSplit

Whether to run scSplit. Default: True

scSplit_preprocess

Whether to perform pre-processing on the input params.bam for Freebayes and scSplit. True: Perform pre-processing. Otherwise pre-processing is not called. Default: True

scSplit_variant

Whether to perform Freebayes before running scSplit. True: run Freebayes. Otherwise freebayes is not called and params.vcf_mixed is used as input. Default: True

vcf_mixed

VCF from mixed BAM. Default: None

bam

Input Mixed sample BAM.

bai

Index of mixed sample BAM.

barcodes

Barcodes whitelist.

tag_group

Tag for barcode. Default: CB

common_variants_scSplit

Common SNVs for scSplit.

nsamples_genetic

Expected number of mixed samples.

refscSplit

Output Ref count matrix. Default: ref_filtered.csv

altscSplit

Output Alt count matrix. Default: alt_filtered.csv

subscSplit

The maximum number of subpopulations in autodetect mode. Default: 10

emsscSplit

Number of EM repeats to avoid local maximum. Default: 30

dblscSplit

Correction for doublets, Setting to 0 means you would expect no doublets. There will be no refinement on the results if this optional parameter is not specified or specified percentage is less than doublet rates detected during the run. Default: None

vcf_donor

Known individual genotypes to limit distinguishing variants to available variants, so that users do not need to redo genotyping on selected variants.

sample_geno

Whether to generate sample genotypes based on the split result. Default: True

scsplit_out

Dirtectory for scSplit output files. Default: scsplit_out

Souporcell#

souporcell

Whether to run Souporcell. Default: True

souporcell_preprocess

Whether to perform pre-processing on the input params.bam for Souporcell. True: Perform pre-processing. Otherwise pre-processing is not called. Default: False

bam

Cellranger bam.

bai

Index of cellranger bam.

barcodes

Barcodes.tsv from cellranger

fasta

Reference fasta file.

fasta_index

Index of reference fasta file.

nsamples_genetic

Number of clusters in the BAM file.

threads

Max threads to use. Default: 5

ploidy

Ploidy, must be 1 or 2. Default: 2

min_alt

Min alt to use locus. Default: 10

min_ref

Min ref to use locus. Default: 10

max_loci

Max loci per cell, affects speed. Default: 2048

restarts

Number of restarts in clustering, when there are > 12 clusters we recommend increasing this to avoid local minima. Default: None

common_variants_souporcell

Common variant loci or known variant loci vcf, must be vs same reference fasta.

use_known_genotype

Whether to use known donor genotype. Default: True

vcf_donor

Known variants per clone in population vcf mode, must be VCF file.

known_genotypes_sample_names

Which samples in population vcf from known genotypes option represent the donors in your sample. Default: None

skip_remap

Don’t remap with minimap2, not recommended unless in conjunction with comman variants. Default: True

ignore

Set to True to ignore data error assertions. Default: False

souporcell_out

Dirtectory for Souporcell output files. Default: souporcell_out

cellSNP-lite#

bam

An indexed sam/bam file, comma separated multiple samples.

barcodes

A plain file listing all effective cell barcode.

common_variants_cellsnp

A VCF file listing all candidate snps, for fetch each variants.

targetsVCF

Similar as regionsVCF, but the next position is accessed by streaming rather than indexing/jumping. Default: None

sampleList

A list file containing sample IDs, each per line. Default: None

sampleIDs

Comma separated sample ids. Default: None

genotype_cellSNP

If use, do genotyping in addition to counting. Default: True

gzip_cellSNP

If use, the output files will be zipped into BGZF format. Default: True

printSkipSNPs

If use, the SNPs skipped when loading VCF will be printed. Default: False

nproc_cellSNP

min alt to use locus. Default: 10

refseq_cellSNP

Faidx indexed reference sequence file. If set, the real (genomic) ref extracted from this file would be used for Mode 2 or for the missing REFs in the input VCF for Mode 1. Default: None.

chrom

The chromosomes to use, comma separated. Default: None (1-22)

cellTAG

Tag for cell barcodes, turn off with None. Default: CB

UMItag

Tag for UMI: UB, Auto, None. For Auto mode, use UB if barcodes are inputted, otherwise use None. None mode means no UMI but read counts. Default: Auto

minCOUNT

Minimum aggragated count. Default: 20

minMAF

Minimum minor allele frequency. Default: 0.0

doubletGL

If use, keep doublet GT likelihood. Default: False

inclFLAG

Required flags: skip reads with all mask bits unset []. Default: None

exclFLAG

Filter flags: skip reads with any mask bits set [UNMAP,SECONDARY,QCFAIL (when use UMI) or UNMAP,SECONDARY,QCFAIL,DUP (otherwise)]. Default: None

minLEN

Minimum mapped length for read filtering. Default: 30

minMAPQ

Minimum MAPQ for read filtering. Default: 20

maxDEPTH

Maximum depth for one site of one file (excluding those filtered reads), avoids excessive memory usage; 0 means highest possible value. Default: 0

countORPHAN

If use, do not skip anomalous read pairs. Default: False

cellsnp_out

Dirtectory for cellSNP-lite output files. Default: cellSNP_out

Freebayes#

bam

Input BAM file to be analyzed.

bai

Index of input BAM file to be analyzed.

fasta

A reference sequence for analysis.

fasta_index

The index of the reference sequence for analysis.

stdin

Read BAM input on stdin. Default: False

targets

Limit analysis to targets listed in the BED-format file. Default: None

region

Limit analysis to the specified chromosome region, 0-base coordinates. If set to None, all chromosomes are considered. Default: None.

samples

Limit analysis to samples listed (one per line) in the file. By default FreeBayes will analyze all samples in its input BAM files. Default: None

populations

Each line of FILE should list a sample and a population which it is part of. The population-based bayesian inference model will then be partitioned on the basis of the populations. Default: None

cnv_map

Read a copy number map from the BED file. Default: None

vcf_freebayes

Name of output VCF file, must be end with .vcf. Default: vcf_freebayes_output.vcf

gvcf

Write gVCF output, which indicates coverage in uncalled regions. Default: False

gvcf_chunk

When writing gVCF output emit a record for every specified number of bases. Default: None

gvcf_dont_use_chunk

When writing gVCF output don’t emit a record for every specified number of bases. Default: None

variant_input

Use variants reported in VCF file as input to the algorithm. Variants in this file will included in the output even if there is not enough support in the data to pass input filters. Default: None

only_use_input_alleles

Only provide variant calls and genotype likelihoods for sites and alleles which are provided in the VCF input, and provide output in the VCF for all input alleles, not just those which have support in the data. Default: False

haplotype_basis_alleles

When specified, only variant alleles provided in this input VCF will be used for the construction of complex or haplotype alleles. Default: None

report_all_haplotype_alleles

At sites where genotypes are made over haplotype alleles, provide information about all alleles in output, not only those which are called. Default: False

report_monomorphic

Report even loci which appear to be monomorphic, and report all considered alleles, even those which are not in called genotypes. Default: False

pvar

Report sites if the probability that there is a polymorphism at the site is greater than N. Default: 0.0

strict_vcf

Generate strict VCF format (FORMAT/GQ will be an int). Default: False

theta

The expected mutation rate or pairwise nucleotide diversity among the population under analysis. This serves as the single parameter to the Ewens Sampling Formula prior model. Default: 0.001

ploidy

Sets the default ploidy for the analysis. Default: 2

pooled_discrete

Assume that samples result from pooled sequencing. Model pooled samples using discrete genotypes across pools. When using this flag, set –ploidy to the number of alleles in each sample or use the –cnv-map to define per-sample ploidy. Default: False

pooled_continuous

Output all alleles which pass input filters, regardless of genotyping outcome or model. Default: False

use_reference_allele

This flag includes the reference allele in the analysis as if it is another sample from the same population. Default: False

reference_quality

Assign mapping quality to the reference allele at each site and base quality. Default: 100,60

no_snps

Ignore SNP alleles. Default: False

no_indels

Ignore insertion and deletion alleles. Default: True

no_mnps

Ignore multi-nuceotide polymorphisms, MNPs. Default: True

no_complex

Ignore complex events (composites of other classes). Default: True

use_best_n_alleles

Evaluate only the best N SNP alleles, ranked by sum of supporting quality scores. Set to 0 to use all. Default: 0

haplotype_length

Allow haplotype calls with contiguous embedded matches of up to this length. Set N=-1 to disable clumping. Default: 3

min_repeat_size

When assembling observations across repeats, require the total repeat length at least this many bp. Default: 5

min_repeat_entropy

To detect interrupted repeats, build across sequence until it has entropy > N bits per bp. Set to 0 to turn off. Default: 1

no_partial_observations

Exclude observations which do not fully span the dynamically-determined detection window. Default: None, to use all observations, dividing partial support across matching haplotypes when generating haplotypes.

dont_left_align_indels

Turn off left-alignment of indels, which is enabled by default. Default: False

use_duplicate_reads

Include duplicate-marked alignments in the analysis. Default: False, to exclude duplicates marked as such in alignments

min_mapping_quality

Exclude alignments from analysis if they have a mapping quality less than Q. Default: 1

min_base_quality

Exclude alleles from analysis if their supporting base quality is less than Q. Default: 1

min_supporting_allele_qsum

Consider any allele in which the sum of qualities of supporting observations is at least Q. Default: 0

min_supporting_mapping_qsum

Consider any allele in which and the sum of mapping qualities of supporting reads is at least Q. Default: 0

mismatch_base_quality_threshold

Count mismatches toward –read-mismatch-limit if the base quality of the mismatch is >= Q. Default: 10

read_mismatch_limit

Exclude reads with more than N mismatches where each mismatch has base quality >= mismatch-base-quality-threshold. Default: None, ~unbounded

read_max_mismatch_fraction

Exclude reads with more than N [0,1] fraction of mismatches where each mismatch has base quality >= mismatch-base-quality-threshold. Default: 1.0

read_snp_limit

Exclude reads with more than N base mismatches, ignoring gaps with quality >= mismatch-base-quality-threshold. Default: None, ~unbounded

read_indel_limit

Exclude reads with more than N separate gaps. Default: None, ~unbounded

standard_filters

Use stringent input base and mapping quality filters equivalent to -m 30 -q 20 -R 0 -S 0. Default: False

min_alternate_fraction

Require at least this fraction of observations supporting an alternate allele within a single individual in in order to evaluate the position. Default: 0.05

min_alternate_count

Require at least this count of observations supporting an alternate allele within a single individual in order to evaluate the position. Default: 2

min_alternate_qsum

Require at least this sum of quality of observations supporting an alternate allele within a single individual in order to evaluate the position. Default: 0

min_alternate_total

Require at least this count of observations supporting an alternate allele within the total population in order to use the allele in analysis. Default: 1

min_coverage

Require at least this coverage to process a site. Default: 0

max_coverage

Do not process sites with greater than this coverage. Default: None, no limit

no_population_priors

Equivalent to –pooled-discrete –hwe-priors-off and removal of Ewens Sampling Formula component of priors. Default: False

hwe_priors_off

Disable estimation of the probability of the combination arising under HWE given the allele frequency as estimated by observation frequency. Default: False

binomial_obs_priors_off

Disable incorporation of prior expectations about observations. Uses read placement probability, strand balance probability, and read position (5’-3’) probability. Default: False

allele_balance_priors_off

Disable use of aggregate probability of observation balance between alleles as a component of the priors. Default: False

observation_bias

Read length-dependent allele observation biases from the file. Default: None

base_quality_cap

Limit estimated observation quality by capping base quality at Q. Default: None

prob_contamination

An estimate of contamination to use for all samples. Default: 10e-9

legacy_gls

Use legacy (polybayes equivalent) genotype likelihood calculations. Default: False

contamination_estimates

A file containing per-sample estimates of contamination, such as those generated by VerifyBamID. Default: None

report_genotype_likelihood_max

Report genotypes using the maximum-likelihood estimate provided from genotype likelihoods. Default: False

genotyping_max_iterations

Iterate no more than N times during genotyping step. Default: 1000

genotyping_max_banddepth

Integrate no deeper than the Nth best genotype by likelihood when genotyping. Default: 6

posterior_integration_limits

Integrate all genotype combinations in our posterior space which include no more than N samples with their Mth best data likelihood. Default: 1,3

exclude_unobserved_genotypes

Skip sample genotypings for which the sample has no supporting reads. Default: False

genotype_variant_threshold

Limit posterior integration to samples where the second-best genotype likelihood is no more than log(N) from the highest genotype likelihood for the sample. Default: None, ~unbounded

use_mapping_quality

Use mapping quality of alleles when calculating data likelihoods. Default: False

harmonic_indel_quality

Use a weighted sum of base qualities around an indel, scaled by the distance from the indel. Default: False, use a minimum BQ in flanking sequence.

read_dependence_factor

Incorporate non-independence of reads by scaling successive observations by this factor during data likelihood calculations. Default: 0.9

genotype_qualities

Calculate the marginal probability of genotypes and report as GQ in each sample field in the VCF output Default: False

debug

Print debugging output. Default: False

dd

Print more verbose debugging output (requires “make DEBUG”). Default: False