## Installation

We now provide a ready-to-run Docker container that includes the package and all prerequisites (see section Docker). Alternatively, you can follow the installation procedure below.

Numbat uses cellsnp-lite for generating SNP pileup data and eagle2 for phasing. Please follow their installation instructions and make sure their binary executables can be found in your \$PATH.

Additionally, Numbat needs a common SNP VCF and phasing reference panel. You can use the 1000 Genome reference below:

1. 1000G SNP VCF

# hg38
wget https://sourceforge.net/projects/cellsnp/files/SNPlist/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf.gz
# hg19
wget https://sourceforge.net/projects/cellsnp/files/SNPlist/genome1K.phase3.SNP_AF5e2.chr1toX.hg19.vcf.gz
2. 1000G Reference Panel

# hg38
wget http://pklab.med.harvard.edu/teng/data/1000G_hg38.zip
# hg19
wget http://pklab.med.harvard.edu/teng/data/1000G_hg19.zip

You can install the numbat R package via CRAN:

install.packages('numbat', dependencies = TRUE)

Alternatively, you can install the GitHub version:

devtools::install_github("https://github.com/kharchenkolab/numbat")

### Docker

We provide a ready-to-run Docker container that includes the Numbat R package and all of its dependencies. You can launch it as follows:

docker run -v /work:/mnt/mydata -it pkharchenkolab/numbat-rbase:latest /bin/bash

where /work is a local data folder you would like to access and write to. Within the container, cellsnp-lite, eagle2 and samtools are available in path and the 1000 Genome SNP VCF and phasing panel files (hg38) are stored under /data. You can also launch R and run interactive analysis using Numbat (or run an R script).

## Preparing data

1. Prepare the allele data. Run the preprocessing script (numbat/inst/bin/pileup_and_phase.R) to count alleles and phase SNPs.
usage: pileup_and_phase.R [-h] --label LABEL --samples SAMPLES --bams BAMS
[--barcodes BARCODES] --gmap GMAP [--eagle EAGLE]
--snpvcf SNPVCF --paneldir PANELDIR --outdir OUTDIR
--ncores NCORES [--UMItag UMITAG]
[--cellTAG CELLTAG] [--smartseq] [--bulk]

Run SNP pileup and phasing with 1000G

Arguments:
-h, --help           show this help message and exit
--label LABEL        Individual label. One per run.
--samples SAMPLES    Sample name(s); comma delimited if multiple.
All samples must belong to the same individual.
--bams BAMS          BAM file(s); one per sample, comma delimited if multiple.
--barcodes BARCODES  Cell barcode file(s); one per sample,
comma delimited if multiple.
--gmap GMAP          Path to genetic map provided by Eagle2 (e.g.
Eagle_v2.4.1/tables/genetic_map_hg38_withX.txt.gz)
--eagle EAGLE        Path to Eagle2 binary file
--snpvcf SNPVCF      SNP VCF for pileup
--paneldir PANELDIR  Directory to phasing reference panel (BCF files)
--outdir OUTDIR      Output directory
--ncores NCORES      Number of cores
--smartseq           Running with SMART-seq mode; Supply a txt file containing
directories of BAM files to --bams and a txt file
containing cell names to --barcodes (each entry on its
own line for both; ordering must match).

For example, within the Numbat Docker container you can run the preprocessing script like this:

Rscript /numbat/inst/bin/pileup_and_phase.R \
--label {sample} \
--samples {sample} \
--bams /mnt/mydata/{sample}.bam \
--barcodes /mnt/mydata/{sample}_barcodes.tsv \
--outdir /mnt/mydata/{sample} \
--gmap /Eagle_v2.4.1/tables/genetic_map_hg38_withX.txt.gz \
--snpvcf /data/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf \
--paneldir /data/1000G_hg38 \
--ncores ncores

Important: If your 10x bam is multiplexed (containing cells from multiple individuals), please only provide cell barcodes for a single individual in each genotyping run.

This will produce a file named {sample}_allele_counts.tsv.gz under the specified output directory, which contains cell-level phased allele counts. If multiple samples from the same individual were provided, there will be one allele count file for each sample. Other outputs include phased vcfs under phasing/ folder and raw pileup counts under pileup/.

1. Prepare the expression data. Numbat takes a gene by cell integer UMI count matrix as input. You can directly use results from upstream transcriptome quantification pipelines such as 10x CellRanger.

2. Prepare the expression reference, which is a gene by cell type matrix of normalized expression values (raw gene counts divided by total counts). For a quick start, you may use a our HCA collection (ref_hca) that ships with the package. If you have matched normal cells (ideally, of various cell type) from the same patient or dataset and would like to make your own references, you may use this utility function:

# count_mat is a gene x cell raw count matrices
# cell_annot is a dataframe with columns "cell" and "group"
ref_internal = aggregate_counts(count_mat, cell_annot)

## Running Numbat

In this example (ATC2 from Gao et al), the gene expression count matrix and allele dataframe are already prepared for you.

library(numbat)

# run
out = run_numbat(
count_mat_ATC2, # gene x cell integer UMI count matrix
ref_hca, # reference expression profile, a gene x cell type normalized expression level matrix
df_allele_ATC2, # allele dataframe generated by pileup_and_phase script
genome = "hg38",
t = 1e-5,
ncores = 4,
plot = TRUE,
out_dir = './test'
)

Note: If you wish to use your own custom reference, please use the aggregate_counts function as per the example in preparing data. You do not need to include the reference cells in count_mat or df_allele; only provide them as lambdas_ref.

### Run parameters

There are a few parameters you can consider tuning to a specific dataset.

CNV detection

• t: the transition probability used in the HMM. A lower t is more appropriate for tumors with more complex copy number landscapes (from which you can expect more breakpoints) and is sometimes better for detecting subclonal events. A higher t is more effective for controlling false-positive rates of CNV calls.
• gamma: Overdispersion in allele counts (default: 20). For 10x data, 20 is recommended. Non-UMI protocols (e.g. SMART-Seq) usually produce noisier allele data, and a smaller value of gamma is recommended (e.g. 5).
• min_cells: minimum number of cells for which an pseudobulk HMM will be run. If the allele coverage per cell is very sparse for a dataset, then I would consider setting this threshold to be higher.
• multi_allelic: Whether to enable calling of multiallelic CNVs

CNV filtering

• min_LLR: minimum log-likelihood ratio threshold to filter CNVs (default: 5). To ensure quality of phylogeny inference, we only use confident CNVs to reconstruct the phylogeny.
• max_entropy: another criteria that we use to filter CNVs before phylogeny construction (default: 0.5). The entropy of the binary posterior quantifies the uncertainty of an event across single cells. The value can be from 0 to 1 where 1 is the least stringent.

Phylogeny

• tau: Stringency to simplify the mutational history (0-1). A higher tau produces less clones and a more simplified evolutionary history.

Iterative optimization

• init_k: initial number of subclusters to use for the hclust initialization. Numbat by default uses hierarchical clustering (hclust) of smoothed expression values to approximate an initial phylogeny. This will cut the initial tree into k clusters. More clusters means more resolution at the initial stage for subclonal CNV detection. By default, we set init_k to be 3.
• max_iter: maximum number of iterations. Numbat iteratively optimizes the phylogeny and copy number estimations. In practice, we find that results after 2 iterations are usually stable.
• check_convergence: stop iterating if the results have converged (based on consensus CNV segments).

Parallelization

• ncores: number of cores to use for single-cell CNV testing
• ncores_nni: number of cores to use for phylogeny inference

### Detecting clonal LOH

For cell line data and high-purity tumors, we recommend running the below procedure to identify and exclude regions of clonal deletions/LOH before running Numbat. To do so, aggregate all cells into a pseudobulk, and run the SNP-density HMM:

bulk = get_bulk(
count_mat = count_mat,
lambdas_ref = ref_hca,
df_allele = df_allele,
gtf = gtf_hg38
)

segs_loh = bulk %>% detect_clonal_loh(t = 1e-4)

Then pass segs_loh to the Numbat run for those regions to be excluded from baseline during analysis.

out = run_numbat(
....,
segs_loh = segs_loh
)

## Understanding results

A detailed vignette on how to interpret and visualize Numbat results is available:
- Interpreting Numbat results

Numbat generates a number of files in the output folder. The file names are post-fixed with the ith iteration of phylogeny optimization. Here is a detailed list:

• gexp_roll_wide.tsv.gz: window-smoothed normalized expression profiles of single cells
• hc.rds: initial hierarchical clustering result based on smoothed expression
• exp_roll_clust.png: visualization of single-cell smoothed gene expression profiles
• bulk_subtrees_{i}.tsv.gz: subtree pseudobulk data based on current cell lineage tree
• bulk_subtrees_{i}.png: visualization of subtree pseudobulk CNV profiles
• segs_consensus_{i}.tsv.gz: consensus segments from subtree pseudobulk HMMs
• bulk_clones_{i}.tsv.gz: clone-leve pseudobulk data based on current cell lineage tree
• bulk_clones_{i}.png: visualization of clone pseudobulk CNV profiles
• bulk_clones_final.tsv.gz: clone-leve pseudobulk data based on final cell lineage tree
• bulk_clones_final.png: visualization of final clone pseudobulk CNV profiles
• exp_post_{i}.tsv: single-cell expression posteriors
• allele_post_{i}.tsv: single-cell allele posteriors
• joint_post_{i}.tsv: single-cell joint posteriors
• clone_post_{i}.tsv: single-cell clone assignment and tumor versus normal classification posteriors
• tree_list_{i}.rds: list of candidate phylogeneies in the maximum likelihood tree search
• panel_{i}.png: integrated visualization of single-cell phylogeny and CNV landscape

## Phasing options

The CNV detection power of Numbat can be further enhanced by conducting population-based phasing with a larger and more diverse reference panel (i.e. reference haplotypes). The default pipeline above uses the 1000 Genome panel, which contains genotypes from 2,504 individuals. Larger reference panels include:

• gnomAD HGDP + 1KG panel (n=4,099). You can download the reference files using gsutil: gs://gcp-public-data--gnomad/resources/hgdp_1kg/phased_haplotypes.

• TOPMed panel (n=97,256). You can upload your VCF to the TOPMed imputation server.

## FAQ

Q: We expect a certain CNV to be in a particular sample, but it is not detected by Numbat.

A: In general, there are three scenarios that a CNV is not called.

1. The CNV is not called in pseudobulk HMM. You can check bulk_subtrees_*.png to see if the event was called. If it was not, I would suggest trying to find a better expression reference to reduce the noise in logFC, or changing the parameters used to configure the HMM (gamma, t, etc). Sometimes the CNV is very subclonal and was not isolated as a pseudobulk by the initial clustering. You can see if this was the case by looking at the exp_roll_clust.png and if so, try increasing k_init.

2. CNV is called in pseudobulks but did not appear in segs_consensus_*.tsv and joint_post_*tsv. This is because the event is filtered out due to low LLR. Lowering min_LLR threshold will help.

3. CNV is called in pseudobulks, is present in segs_consensus_*.tsv and joint_post_*tsv, but did not appear in phylogeny. This is because the event is filtered out due to high entropy (weak evidence) in single cells. Raising max_entropy threshold will help. You can check the entropy of specific events in joint_post_*tsv in the column avg_entropy.

Q: Numbat predicts most of the genome to be amplified, and there are gaps in the pseudobulk allele profile with no heterozygous SNPs.

A: This is a sign that the sample consists mostly of tumor (with very few normal cells), and clonal deletions are present. The heterozygous SNPs cannot be identified in these regions, therefore leaving gaps. You can identify these clonally deleted regions via the SNP-density HMM prior to running the main run_numbat workflow (see section Detecting Clonal LOH).

Q: Does Numbat run on mouse data?

A: In principle, Numbat should work for a hybrid F1 mouse bred from two known lab mice strains. Since the parental haplotypes are known, you can just supply a VCF containing heterozygous sites and run cell-snp-lite to create the allele count dataframe.