Performance
reducers is built for fast reductions over NumPy arrays.
Short Version
| workload | local result |
|---|---|
| Small (≲ 100) 1-D arrays | Faster than NumPy for most cases, but Bottleneck can still win for nan* cases. |
Plain axis=0 scans (mean / sum / min / max) on shallow, small-output stacks |
Mixed: often near NumPy, sometimes slower. See Parallel Threads for the crossover. |
NaN-aware axis=0 nanmean / nanaverage / nansum / nanvar on stacks |
Faster than NumPy; also faster than Bottleneck except the shallowest stacks. |
Axis-0 stack medians and [nan]var / std on larger outputs |
Usually much faster locally, because these paths avoid building one temporary slice per output pixel. |
all other cases (order statistics, larger arrays, axis=-1) |
reducers is usually much faster than NumPy and Bottleneck locally. |
- NumPy functions such as
meanandsumare extremely optimized and fast when the workload is tiny, soreducersdoes not try to force parallelism for shallow, small-output stacks. - However,
reducersis much faster than NumPy and Bottleneck for larger arrays, NaN-aware reductions (nan*functions), and especially for larger stacks. stdandvarare common bottlenecks in data processing pipelines. Axis-0 variance and standard deviation stream contiguous chunks from each stack level directly, while still using the stable two-pass formula described below.- For some use cases,
return_meancan even skip additional call ofmeanand save computation time.
- For some use cases,
- Weighted
sumandnansumfollow the same idea:return_sum_weights=Trueandreturn_unweighted_sum=Trueexpose values already accumulated by the fused weighted scan, avoiding separate passes when those totals are needed together.
How To Read The Numbers
Benchmark ratios are reported as np/rd and bn/rd; values above 1 mean reducers is faster.
The benchmark scripts check correctness before timing each reducers row. They compare against NumPy or masked NumPy with np.testing.assert_allclose(..., equal_nan=True), using tight tolerances for float64 and tolerances suitable for float32. If that check fails, the script stops instead of printing a speed number.
The tables are recorded local runs from the environment below. They are useful for the shape of the performance story, but exact timings should be regenerated on the target machine when a small difference matters.
Recorded Environment
The benchmark tables in these docs were generated with:
python: 3.13.11 (CPython)
reducers: 0.2.1
numpy: 2.4.6
bottleneck: 1.6.0
os: macOS-26.5-arm64-arm-64bit-Mach-O
kernel: Darwin 25.5.0 Darwin Kernel Version 25.5.0: Mon Apr 27 20:41:15 PDT 2026; root:xnu-12377.121.6~2/RELEASE_ARM64_T6041
machine: arm64
processor: arm
reducers threads: 14
REDUCERS_AXIS_SCAN_PLAIN_GRAIN=262144
REDUCERS_AXIS_SCAN_NAN_GRAIN=262144
REDUCERS_AXIS_SCAN_VAR_GRAIN=4096
REDUCERS_AXIS_WEIGHTED_GRAIN=262144
REDUCERS_AXIS_ORDER_MEDIAN_GRAIN=1024
REDUCERS_AXIS_ORDER_PERCENTILE_GRAIN=1024
REDUCERS_MINMAX_1D_GRAIN=16384