MitoDrift reconstructs single-cell lineage trees from mitochondrial DNA (mtDNA) mutations by modeling heteroplasmy drift and measurement noise with a Wright–Fisher hidden Markov Tree (WF-HMT). It applies population genetics principles (genetic drift) to model mtDNA heteroplasmy in single cells in order to reconstruct high-precision lineage trees from single-cell genomics/multiome data. MitoDrift uses expectation-maximization (EM) to obtain maximum-likelihood estimates of drift, mutation, and error rates, then performs phylogenetic MCMC to quantify the uncertainty in tree topology. The primary output is a phylogeny with posterior clade supports and refined tree topologies at different levels of confidence. Inputs can be mtDNA allele counts from any single-cell genomics assays that capture mtDNA variation (e.g., ReDeeM, mtscATAC-seq, MAESTER).

Preprint: Teng Gao, Chen Weng, Isaac Johnson, Michael Poeschla, Jonas Gudera, Emily King, Christopher Rouya, Adriana Donovan, Lauren Bourke, Ying Shao, Eladio Marquez, Rahul Tyagi, Leonard I Zon, Jonathan S Weissman, Vijay G Sankaran. Modeling mitochondrial inheritance enables high-precision single-cell lineage tracing in humans. bioRxiv (2026).

Installation

# from a local checkout
if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")
remotes::install_github("sankaranlab/mitodrift")

For faster MCMC trace IO, install qs2 with TBB:

remotes::install_cran("qs2", type = "source", configure.args = "--with-TBB --with-simd=AVX2")

Quick start

This uses a pL1000 subset dataset packaged under inst/extdata/ (200 cells, 186 variants).

You can run this from a terminal in the package root:

Rscript inst/bin/run_mitodrift_em.R \
  --mut_dat inst/extdata/pL1000_mut_dat.csv \
  --outdir mitodrift_demo \
  --tree_mcmc_iter 5000 \
  --tree_mcmc_chains 4 \
  --tree_mcmc_burnin 1000

Outputs in mitodrift_demo/ include mitodrift_object.rds plus tree files and diagnostics.

Once the run finishes:

library(mitodrift)
mut_dat <- read.csv(
  system.file("extdata", "pL1000_mut_dat.csv", package = "mitodrift")
)
md <- readRDS("mitodrift_demo/mitodrift_object.rds")
phy_trim <- trim_tree(md$tree_annot, conf = 0.5)

plot_phylo_heatmap2(
  phy_trim,
  mut_dat,
  dot_size = 1,
  branch_width = 0.3,
  branch_length = FALSE,
  node_conf = TRUE,
  het_max = 1
)

Input data requirements

Long format (`mut_dat`)

One row per cell–variant pair, with counts for alternate allele (a) and total depth (d).

cell	variant	a	d
PD45534aj	MT_10448_T_C	0	2348
PD45534aj	MT_11787_T_C	1462	2000
PD45534aj	MT_1244_T_C	2	1500
PD45534ak	MT_10448_T_C	5	2100

IMPORTANT: Include rows where a = 0 so that every cell × variant combination is represented (the observation model uses total depth).

Matrix pair format (`amat`, `dmat`)

Wide matrices with variant in the first column and cells in subsequent columns. Use the same layout for amat (alt counts) and dmat (total depth).

Inference settings & diagnostics

Full lineage inference pipeline is run with inst/bin/run_mitodrift_em.R.

Model parameters (EM fitting)

--fit_params: Enable automatic parameter fitting via EM (default: TRUE)
--fit_param_max_iter: Maximum EM iterations (default: 100)
--fit_param_epsilon: EM convergence threshold on parameter changes (default: 1e-3)
--k: Number of VAF bins for discretizing heteroplasmy levels (default: 20)
--npop: Population size for Wright-Fisher model (default: 600)
--eps: Mutation rate per branch (default: 0.001; auto-fitted if fit_params=TRUE)
--err: Variant detection error rate (default: 0; auto-fitted if fit_params=TRUE)
--ngen: Number of WF generations (default: 100; auto-fitted if fit_params=TRUE)

Phylogenetic MCMC settings

--tree_mcmc_chains: Number of independent MCMC runs (default: 1; recommended 10-50 for robust inference. For large trees e.g., >8000 cells, use less chains to avoid memory limits)
--tree_mcmc_iter: Maximum iterations per chain (default: 10000; can be overridden by automatic termination via ASDSF convergence check)
--conv_thres: ASDSF threshold for MCMC termination (default: NULL; e.g., 0.05-0.1 for auto-convergence). ASDSF (Average Standard Deviation of Split Frequencies) summarizes topology agreement across chains. A lower value indicates good mixing. Values < 0.05 indicate good convergence; < 0.1 is acceptable for exploratory analyses or large trees.
--tree_mcmc_batch_size: Iteration interval for saving traces and checking convergence (default: 1000)
--tree_mcmc_burnin: Initial samples to discard from each chain (default: 0; recommended 10-20% of iterations)

Parallelization options

--ncores: Default number of cores for likelihood computation during tree optimization (default: 1)
--ncores_em: Cores for parallel EM parameter fitting across variants (default: 1)
--ncores_nj: Cores for initial NJ tree distance matrix computation via parallelDist (default: 1)
- NOTE: Different ncores_nj values may yield slightly different NJ topologies due to floating-point rounding
--ncores_qs: Cores for parallel file I/O (qs2 compression) during MCMC and annotation (default: 1)
--ncores_annot: Cores for computing clade frequencies from MCMC traces (default: same as --ncores)

Resuming interrupted job

Use --resume TRUE to continue interrupted runs from saved traces

Analysis workflow

For a complete walkthrough, see the Analysis Workflow tutorial.

MitoDrift