
Preparing Inputs for Numbat-multiome
Ruitong Li, Tina Keshavarzian and Teng Gao
2025-06-25
Source:vignettes/numbat-multiome.Rmd
numbat-multiome.Rmd
Introduction
This vignette provides a guide to preparing input data for running Numbat-multiome, which extends the capabilities of Numbat to perform CNV inference from accessible chromatin profiles (through scATAC-seq data) alone, as well as mixed modality (separate scRNA-seq and scATAC-seq datasets of the same sample) in an integrated manner. Note that joint multiome assays such as the 10x Multiome (direct linkage of gene expression and open chromatin readouts from the same nucleus) is not yet fully supported (currently under development).
The core inputs required are:
- Reference feature matrix: feature (gene or genomic bin) by cell‑type normalized count matrix.
- GTF file: annotation mapping features to genomic locations, used to connect SNPs and features.
- Feature count matrix: raw count matrix (RNA: cell-by-gene/bin matrix filled with integer counts of UMI barcodes; ATAC: cell-by-bin matrix filled with aggregated integer counts of individual Tn5-accessible chromatin fragments, defined by unique combination of chromosome:start-end genomic coordinates).
- Phased SNP genotype data: single‑cell phased genotype of SNP (TSV.GZ).
This document covers two preparation modes:
-
Combined-bin
: Joint RNA and ATAC CNV analysis using binned inputs. -
ATAC-bin
: ATAC‑based CNV analysis using binned accessibility data.
Combined-bin: Combined RNA and ATAC CNV analysis
1. Prepare SNP allele data for both modalities
nc=8 # number of cores to use
sample="MM1"
### default to path in numbat image
phase_panel="/data/1000G_hg38"
vcf_genome1k="/data/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf"
gma_gz="/Eagle_v2.4.1/tables/genetic_map_hg38_withX.txt.gz"
Rscript pileup_and_phase.R \
--label ${sample} \
--samples ${sample} \
--bams ${sample}_rna.bam,${sample}_atac.bam \
--barcodes ${sample}_rna_barcodes.tsv,${sample}_atac_barcodes.tsv \
--gmap ${gma_gz} \
--snpvcf ${vcf_genome1k} \
--paneldir ${phase_panel} \
--ncores ${nc} \
--cellTAG CB \
--UMItag Auto,None \
--outdir ${sample}/${sample}_comb_allele_counts.tsv.gz
2. Prepare binned RNA & ATAC inputs
Generate gene-to-bin mapping
You can download a premade genomic bin object here: var220kb.rds
binGR="var220kb.rds" # path to grange object containing genomic bins
gtfF="gtf_hg38.gtf" # any version of gtf files you use or hg38, hg19 or mm10
Rscript get_gene_binned_intersections.R \
--numbatGTFname ${gtfF} \
--binGR ${binGR} \
--outfile gene2bin_map.csv
Here, you can either supply a gtf-format file if you have a
customized version, if you prefer using the default numbat gtf files,
you can simply use hg38
, hg19
or
mm10
as the numbatGTFname
argument.
Generate binned expression counts
Rscript get_binned_rna.R \
--rnaCountsFile ${sample}_seu.rds \
--outFile ${sample}/${sample}_rna_bin.rds \
--barcodesKeep ${sample}_barcodes.tsv \
--geneBinMapCSVFile gene2bin_map.csv
Argument | Accepted types / format |
---|---|
--rnaCountsFile |
.rds file containing either: • a Seurat object (extracts @assays[[1]]@counts )• a raw counts matrix ( dgCMatrix etc.)• .h5 file from 10x (uses Read10X_h5 ) |
--geneBinMapCSVFile |
CSV with at least columns
gene_name and bingrp (e.g. output of your
get_gene_binned_intersections.R ). |
--barcodesKeep |
- A TSV/CSV text file with one barcode
per line (first column). - Or an R data.frame containing one column of barcodes. |
--outFile |
Path for saving the binned-by-cell matrix (always as an
RDS of a sparse dgCMatrix ). |
--generateAggRef |
(flag) if included, also creates a Numbat-style aggregated reference matrix. |
Generate binned ATAC counts
binGR="var220kb.rds"
Rscript get_binned_atac.R \
--CB ${sample}_atac_barcodes.tsv \
--frag ${sample}_fragments.tsv.gz \
--binGR ${binGR} \
--outFile ${sample}/${sample}_atac_bin.rds
Argument | Accepted types / format |
---|---|
--CB |
TSV/CSV text file listing cell barcodes (one per line, first column). |
--frag |
Fragments file in BED-style TSV or
.gz , with columns: chrom, start, end, barcode,
duplicate-count. |
--binGR |
RDS file containing a GRanges object
of genomic bins (e.g. var220kb.rds ). |
--outFile |
Output path—if .tsv , writes a
tab-delimited table; if .rds , saves as an RDS matrix. |
--generateAggRef |
(flag) if included, also runs Numbat’s
aggregate_counts on the bin×cell matrix and writes that
result. |
Generate combined count matrix
source("input_prep.R")
saveRDS(binCnt(c(glue("${sample}/{sample}_rna_bin.rds"),
glue("${sample}/{sample}_atac_bin.rds")),seed=123,maxCB=10000),
glue("{sample}/{sample}_comb_bincnt.rds"))
You can load in an interactive R session in your terminal or open
other IDEs to apply function binCnt()
which combines the
binned RNA and ATAC count matrices into a single matrix which also did a
subsampling of cells to a maximum of 10,000 cells per modality, you can
specify the seed and maximum allowed cell number in the parameters.
3. Generate combined reference
We generated binned counts from scRNA-seq and scATAC-seq data of normal samples.
binGR="var220kb.rds"
refsample="normal1"
Rscript get_binned_rna.R \
--rnaCountsFile ref_seu.rds \
--outFile Reference/lambdas_RNA_bincnt.rds \
--barcodesKeep ref_barcodes.tsv \
--geneBinMapCSVFile gene2bin_map.csv \
--generateAggRef
Rscript get_binned_atac.R \
--CB ${refsample}_atac_barcodes.tsv \
--frag ${refsample}_fragments.tsv.gz \
--binGR $binGR \
--outFile Reference/lambdas_RNA_bincnt.rds \
--generateAggRef
If you have relatively high number of cells for multiple normal
samples, you may consider running agg_refs
, a wrapper
function of the original aggrerate_count()
function in
numbat but with subsampling max number of cells from each normal
sample.
First we can set up an array of normal samples to iterate through bash array
# Define array of normal sample names
normal_samples=("normal1" "normal2" "normal3") # <-- Replace with your sample names
# Path to shared binGR file
binGR="var220kb.rds"
# Loop through each sample
for refsample in "${normal_samples[@]}"; do
echo "Processing $refsample..."
Rscript get_binned_atac.R \
--CB "${refsample}_atac_barcodes.tsv" \
--frag "${refsample}_fragments.tsv.gz" \
--binGR "$binGR" \
--outFile "${refsample}/${refsample}_atac_bin.rds"
done
Specify refsamples
as a vector of normal sample names,
e.g., refsamples <- c("normal1", "normal2", "normal3")
.
Then run the following R code to aggregate the references:
source("input_prep.R")
ref_atac <- agg_refs(
paste0(refsamples,"/",refsamples,"_atac_bin.rds") %>%
set_names(refsamples)) %>%
saveRDS("Reference/lambdas_ATAC_bincnt.rds")
After separately generating binned RNA and ATAC reference, we can combine them into a single reference.
Run numbat inference with prepared inputs
binGR="var220kb.rds"
parL="par_numbatm.rds" # a list of any run_numbat parameters you would like to optimize
Rscript run_numbat_multiome.R \
--countmat ${sample}/${sample}_comb_bincnt.rds \
--alleledf ${sample}/${sample}_comb_allele_counts.tsv.gz \
--out_dir ${sample}/paired/ \
--ref Reference/lambdas_comb_bincnt.rds \
--gtf ${binGR}\
--parL ${parL}
ATAC-bin and RNA-bin modes
If you want to run Numbat-multiome in RNA bin and ATAC bin modes, you
can simply use lambdas_RNA_bincnt.rds
and
lambdas_ATAC_bincnt.rds
as the reference files.
${sample}_rna_bin.rds
and
${sample}_atac_bin.tsv
as the count matrix.
ATAC-bin mode
In ATAC-bin mode you perform CNV inference using only the binned ATAC counts and the ATAC reference.
-
Inputs
-
Count matrix:
${sample}_atac_bin.rds
(or.tsv
) fromget_binned_atac.R
-
Reference:
Reference/lambdas_ATAC_bincnt.rds
(generated viaget_binned_atac.R --generateAggRef
oragg_refs()
)
-
Allele VCF:
${sample}/${sample}_comb_allele_counts.tsv.gz
-
Genomic bins:
var220kb.rds
GRanges object
-
Count matrix:
-
Command
Rscript run_numbat_multiome.R \ --countmat ${sample}/${sample}_atac_bin.rds \ --alleledf ${sample}/${sample}_comb_allele_counts.tsv.gz \ --out_dir ${sample}/atac_only/ \ --ref Reference/lambdas_ATAC_bincnt.rds \ --gtf var220kb.rds \ --parL par_numbatm.rds
-
--mode ATAC-bin
skips RNA likelihoods and relies solely on ATAC signals.
-
-
Interpretation
- CNV calls are driven by chromatin-accessibility patterns.
- Ideal when RNA data is unavailable or as a comparison to Combined-bin.
- CNV calls are driven by chromatin-accessibility patterns.
RNA-bin mode
In RNA-bin mode you leverage only the binned RNA counts plus the RNA reference:
Rscript run_numbat_multiome.R \
--countmat ${sample}/${sample}_rna_bin.rds \
--alleledf ${sample}/${sample}_comb_allele_counts.tsv.gz \
--out_dir ${sample}/rna_only/ \
--ref Reference/lambdas_RNA_bincnt.rds \
--gtf var220kb.rds \
--parL par_numbatm.rds
- Use when ATAC coverage is low or to benchmark RNA-only CNV calls.