Skip to content

Performance & Tuning

RustQC runs all RNA-seq QC analyses in a single pass over the BAM file. Because every tool shares the same BAM decode and per-read dispatch, the cost of each tool is not simply additive — there is a shared baseline cost for reading and decoding the BAM that all tools amortise together.

This page documents how much each individual tool contributes to the total runtime, how disabling tools affects performance, and how to tune thread counts for your hardware.

  • Leave all tools enabled unless you have a specific reason not to. The full run completes in ~10 minutes for a large BAM — far faster than running the equivalent traditional tools separately.
  • Disabling TIN provides the single largest speedup. TIN’s large per-transcript data structures (~1 GB+) cause cache thrashing that slows every other tool in the pipeline. Removing it alone saves ~4 minutes (~38% of the total runtime).
  • Disabling TIN + read_duplication together brings the runtime under 2.5 minutes.
  • Thread count: -t 4 to -t 8 is the productive range for a single human BAM. Beyond ~8 threads, runtime plateaus at roughly 2x speedup with increasing memory usage (see CPU scaling benchmarks below).
  • Multiple BAMs: the useful thread count scales with the number of input files (e.g. 4 human BAMs can productively use ~32 threads).
  • Ensure a BAM index exists (.bai/.csi). Without one, only a single counting thread is used regardless of -t.

The per-tool benchmarks were measured with the following setup:

ParameterValue
BAM fileGM12878 REP1, paired-end RNA-seq, GRCh38 (~186M reads, 10 GB)
AnnotationGENCODE v46 GTF (63,677 genes)
HardwareApple M1 Pro, 10 cores, 16 GB RAM
Threads8 (-t 8)
Buildcargo build --release (LTO, codegen-units=1, opt-level=3)
StrandednessReverse (-s reverse)
ModePaired-end (-p), all tools enabled by default

The baseline is a run with every tool disabled (BAM decode + dispatch overhead only): 34 seconds.

Each tool was benchmarked in isolation — a run with only that single tool enabled, all others disabled. The “marginal cost” is the difference between the single-tool run and the no-tools baseline.

ToolRun time (mm:ss)Marginal cost (mm:ss)
read_duplication1:351:01
infer_experiment0:530:19
bam_stat0:520:18
samtools_stats0:510:17
idxstats0:500:16
flagstat0:500:16
qualimap0:480:14
junction_annotation0:470:13
inner_distance0:450:11
preseq0:400:06
tin0:390:05
read_distribution0:350:01
junction_saturation0:350:01
dupradar0:350:01
featurecounts0:33~0:00
All tools enabled10:37

Why don’t the marginal costs add up to 10:37? Two reasons. First, the shared BAM decode baseline (34s) is paid once regardless of how many tools are active. Second, and more significant, TIN has per-transcript data structures (coverage arrays and unique_starts hash sets for ~230K transcripts) that consume over 1 GB of RAM. When combined with the data structures of all other tools, the aggregate working set exceeds CPU cache capacity, causing cache thrashing that slows the per-read processing of every tool. TIN’s apparent marginal cost when run alone (5s) is low because TIN’s data structures alone still fit reasonably in cache. But in a full run with all tools active, adding TIN pushes the combined working set over a performance cliff.

With every tool active, the total runtime is 10:37 using 4.5 GB peak RSS.

Cumulative removal: stripping tools by impact

Section titled “Cumulative removal: stripping tools by impact”

Starting from a full run with all 15 tools, tools were removed one at a time. TIN and read_duplication are removed first (the two highest-impact tools), then remaining tools in order of marginal cost. Both wall-clock time and peak memory (RSS) are shown.

StepTool removedTools remainingRun timePeak RSSDelta
0(none: all tools)1510:374.5 GB
1tin146:374.5 GB-4:00
2read_duplication132:222.7 GB-4:16
3infer_experiment121:572.7 GB-0:25
4bam_stat111:592.2 GB+0:02
5samtools_stats102:011.8 GB+0:02
6flagstat92:032.4 GB+0:03
7idxstats81:392.2 GB-0:25
8qualimap71:132.0 GB-0:25
9junction_annotation61:021.8 GB-0:11
10inner_distance50:491.4 GB-0:13
11preseq40:401.0 GB-0:09
12read_distribution30:391.0 GB0:00
13junction_saturation20:360.7 GB-0:03
14dupradar10:370.7 GB0:00
15featurecounts00:350.6 GB-0:01

TIN’s impact is driven by memory pressure, not computation. When TIN is the only tool enabled, it adds just 5 seconds of marginal compute time. But in a full run, removing TIN saves 4:00. As explained above, TIN’s ~1 GB of per-transcript data structures push the combined working set past CPU L3 cache capacity, causing cache misses that slow every other tool’s per-read processing. Removing TIN shrinks the working set enough for the remaining tools to run efficiently.

read_duplication is the most expensive counting-phase tool. Removing it saves 4:16 (step 2), cutting the runtime from 6:37 to 2:22. This tool maintains per-read sequence hashing (SipHash over every base) for duplication detection, which is expensive in both CPU and memory.

The “samtools” group (bam_stat, flagstat, idxstats, samtools_stats) is cheap. These tools add minimal overhead individually because they compute simple per-read counters with no complex data structures.

Memory scales with enabled tools. Peak RSS drops from 4.5 GB (all tools) to 0.6 GB (no tools). The largest single memory consumers are TIN (~1 GB for per-transcript hash sets and sampled-position indices) and read_duplication (~1 GB for sequence hash tables).

Individual tools can be disabled in the YAML config file. See the Configuration page for the full list of toggles.

When a tool is disabled, its accumulators are never constructed, its per-read processing is skipped entirely, and no output files are written. There is zero overhead for disabled tools - not just output suppression.

RustQC speed is typically bound by CPU and disk I/O. It parallelises heavily across multiple CPUs to reduce wall-clock time.

RustQC parallelises by assigning chromosomes to worker threads — each worker opens its own BAM reader and processes its assigned chromosomes independently.

The number of chromosomes sets the theoretical ceiling for counting threads, but in practice returns diminish well before that limit because chromosomes aren’t split and have very different lengths - so the longest chromosome becomes the parallelisation bottleneck.

Performance plateaus at roughly 4-8 threads (~2x speedup). Additional threads beyond this point are used for BAM decompression and add little benefit while increasing memory usage (each worker maintains its own per-chromosome accumulators).

When multiple BAM files are provided, the thread budget is split across files, so the practical ceiling scales with the number of inputs.

A BAM index (.bai/.csi) is required for parallel mode. Without one, RustQC falls back to a single counting thread.

Measured on AWS using the same GM12878 BAM (186M reads, 10 GB) with all tools enabled. Three replicates per thread count; the table shows the median run time (robust to cloud scheduling jitter).

CPU scaling benchmark chart – light theme CPU scaling benchmark chart – dark theme
ThreadsRun timePeak RSSSpeedup vs 1 thread
139:568.9 GB1.0x
226:0811.2 GB1.5x
320:4211.2 GB1.9x
421:5411.8 GB1.8x
520:1611.9 GB2.0x
621:3012.1 GB1.9x
719:2912.5 GB2.1x
818:4212.8 GB2.1x
1018:4213.2 GB2.1x
1219:0013.7 GB2.1x
1418:1814.2 GB2.2x
1618:1414.5 GB2.2x
2018:2615.9 GB2.2x
2417:5416.7 GB2.2x
2818:3216.9 GB2.2x
3218:3216.9 GB2.2x

Performance scales well up to ~8 threads, where runtime drops from ~40 minutes (single-threaded) to ~19 minutes (2.1x speedup).

Beyond 8 threads, gains plateau because each chromosome is processed by exactly one worker — despite greedy bin-packing to balance load, the largest chromosome (human chr1, ~8% of the genome) becomes the critical path and cannot be subdivided. Additional threads beyond this point serve only as BAM decompression helpers, yielding diminishing returns. Peak RSS grows linearly with thread count as each worker maintains its own per-chromosome accumulators.