Representative stack reductions.
Run locally with:
uv run --extra bench python benchmarks/benchmark_axis.py
uv run --extra bench python benchmarks/benchmark_axis.py --ops average
np, bn, and rd are trimmed median elapsed ms. np/rd and bn/rd above 1 mean reducers is faster; ratios are bolded when reducers ratio > 0.95 .
std and nanstd use ddof=1.
average and nanaverage use 1-D weights whose length matches the reduced axis (100 for the cube; 11 or 31 for the short stacks). percentile/quantile use [16, 50, 84].
nanaverage compares against masked np.average; Bottleneck has no weighted-average reducer.
nanpercentile/nanquantile are omitted because NumPy’s axis path is extremely slow here.
Plain Results
General cube (100, 100, 100)
At this size, reducers wins the most cases.
mean
0.18
-
0.10
1.81x
-
average
0.32
-
0.44
0.72x
-
median
9.35
8.22
0.59
15.96x
14.02x
var
0.65
-
0.17
3.77x
-
std
0.70
-
0.37
1.89x
-
min
0.18
-
0.12
1.48x
-
max
0.18
-
0.12
1.45x
-
minmax
0.35
-
0.42
0.84x
-
sum
0.17
-
0.11
1.61x
-
percentile
15.90
-
1.68
9.48x
-
quantile
15.84
-
1.53
10.38x
-
mean
0.18
-
0.07
2.60x
-
average
0.47
-
0.14
3.47x
-
median
9.03
7.92
0.46
19.44x
17.06x
var
0.72
-
0.12
5.84x
-
std
0.71
-
0.13
5.54x
-
min
0.13
-
0.10
1.26x
-
max
0.12
-
0.10
1.20x
-
minmax
0.28
-
0.21
1.34x
-
sum
0.18
-
0.07
2.63x
-
percentile
14.80
-
1.42
10.40x
-
quantile
15.11
-
1.44
10.50x
-
mean
0.07
-
0.09
0.81x
-
average
0.13
-
0.37
0.36x
-
median
9.54
8.44
0.57
16.65x
14.75x
var
0.27
-
0.14
1.90x
-
std
0.28
-
0.12
2.32x
-
min
0.07
-
0.08
0.91x
-
max
0.07
-
0.10
0.67x
-
minmax
0.13
-
0.18
0.72x
-
sum
0.07
-
0.09
0.83x
-
percentile
15.79
-
1.77
8.94x
-
quantile
16.59
-
1.74
9.54x
-
mean
0.17
-
0.12
1.50x
-
average
0.31
-
0.20
1.59x
-
median
9.19
8.45
0.65
14.22x
13.07x
var
0.52
-
0.18
2.84x
-
std
0.53
-
0.21
2.52x
-
min
0.13
-
0.11
1.16x
-
max
0.10
-
0.11
0.91x
-
minmax
0.20
-
0.23
0.87x
-
sum
0.17
-
0.11
1.56x
-
percentile
15.21
-
2.23
6.81x
-
quantile
15.29
-
1.80
8.47x
-
Short axis=0 stacks
Cases where shape[0] is small with axis=0, reducers may lose to NumPy/Bottleneck for cheap reductions (like min, max, mean, sum).
Larger shape[0] or output size (prod(*shape[1:])) makes reducers win over NumPy/Bottleneck.
(11, 100, 100), float64 (31, 100, 100), float64 (11, 512, 512), float64 (11, 100, 100), float32 (31, 100, 100), float32 (11, 512, 512), float32
mean
0.02
-
0.02
1.13x
-
average
0.04
-
0.06
0.76x
-
median
0.89
0.68
0.25
3.52x
2.69x
var
0.08
-
0.05
1.64x
-
std
0.07
-
0.05
1.57x
-
min
0.02
-
0.03
0.73x
-
max
0.02
-
0.03
0.76x
-
minmax
0.04
-
0.05
0.78x
-
sum
0.02
-
0.03
0.84x
-
percentile
1.27
-
0.76
1.68x
-
quantile
1.29
-
0.72
1.80x
-
mean
0.06
-
0.05
1.13x
-
average
0.11
-
0.14
0.77x
-
median
3.19
2.65
0.30
10.82x
8.98x
var
0.20
-
0.05
3.83x
-
std
0.21
-
0.05
3.94x
-
min
0.06
-
0.07
0.79x
-
max
0.06
-
0.06
0.89x
-
minmax
0.11
-
0.13
0.88x
-
sum
0.06
-
0.06
0.93x
-
percentile
5.39
-
0.81
6.67x
-
quantile
5.15
-
0.80
6.48x
-
mean
0.53
-
0.24
2.20x
-
average
0.92
-
1.03
0.90x
-
median
30.48
24.01
4.80
6.35x
5.01x
var
2.01
-
0.49
4.07x
-
std
2.04
-
0.47
4.30x
-
min
0.46
-
0.30
1.56x
-
max
0.49
-
0.31
1.57x
-
minmax
0.93
-
0.52
1.78x
-
sum
0.51
-
0.26
1.98x
-
percentile
37.15
-
16.88
2.20x
-
quantile
38.25
-
16.09
2.38x
-
mean
0.01
-
0.02
0.83x
-
average
0.03
-
0.06
0.45x
-
median
1.04
0.83
0.22
4.68x
3.74x
var
0.04
-
0.04
1.07x
-
std
0.05
-
0.04
1.12x
-
min
0.01
-
0.01
0.73x
-
max
0.01
-
0.01
0.69x
-
minmax
0.02
-
0.03
0.69x
-
sum
0.01
-
0.02
0.60x
-
percentile
1.51
-
0.59
2.57x
-
quantile
1.37
-
0.60
2.27x
-
mean
0.03
-
0.04
0.73x
-
average
0.05
-
0.13
0.40x
-
median
3.41
2.90
0.33
10.17x
8.67x
var
0.10
-
0.06
1.58x
-
std
0.10
-
0.05
1.88x
-
min
0.02
-
0.03
0.73x
-
max
0.03
-
0.03
0.82x
-
minmax
0.05
-
0.06
0.85x
-
sum
0.02
-
0.04
0.61x
-
percentile
5.63
-
0.71
7.89x
-
quantile
5.49
-
0.66
8.26x
-
mean
0.38
-
0.19
1.96x
-
average
0.46
-
1.04
0.44x
-
median
31.85
25.16
3.24
9.84x
7.77x
var
1.20
-
0.37
3.27x
-
std
1.23
-
0.40
3.09x
-
min
0.24
-
0.19
1.29x
-
max
0.25
-
0.19
1.28x
-
minmax
0.49
-
0.36
1.36x
-
sum
0.25
-
0.20
1.27x
-
percentile
39.39
-
12.00
3.28x
-
quantile
38.40
-
11.47
3.35x
-
NaN-aware Results
Same shapes with about 1% NaNs.
np.nanpercentile / np.nanquantile are omitted: locally they were extremely slow (often >100x times slower than rd) and dominated benchmark runtime.
General cube (100, 100, 100)
At this size, reducers wins the most cases.
nanmean
0.99
0.74
0.18
5.39x
4.02x
nanaverage
3.89
-
0.47
8.24x
-
nanmedian
30.50
8.36
0.47
65.20x
17.87x
nanvar
2.23
1.93
0.20
10.98x
9.49x
nanstd
2.25
1.92
0.24
9.40x
8.02x
nanmin
0.18
0.71
0.10
1.78x
7.14x
nanmax
0.17
0.72
0.11
1.63x
6.76x
nanminmax
0.34
1.45
0.22
1.58x
6.68x
nansum
0.53
0.72
0.17
3.02x
4.14x
nanmean
0.75
0.72
0.15
5.05x
4.86x
nanaverage
4.83
-
0.20
24.72x
-
nanmedian
27.40
8.37
0.54
50.57x
15.45x
nanvar
2.06
1.80
0.17
12.17x
10.59x
nanstd
2.05
1.90
0.18
11.58x
10.71x
nanmin
0.13
0.72
0.15
0.86x
4.94x
nanmax
0.13
0.70
0.10
1.22x
6.84x
nanminmax
0.25
1.41
0.19
1.33x
7.41x
nansum
0.52
0.71
0.17
3.12x
4.28x
nanmean
0.75
0.74
0.20
3.80x
3.73x
nanaverage
3.26
-
0.49
6.68x
-
nanmedian
28.53
8.30
0.51
55.97x
16.28x
nanvar
1.69
1.62
0.23
7.31x
6.99x
nanstd
1.68
1.88
0.21
7.96x
8.90x
nanmin
0.07
0.68
0.09
0.76x
7.72x
nanmax
0.07
0.67
0.08
0.82x
8.22x
nanminmax
0.14
1.41
0.17
0.81x
8.27x
nansum
0.31
0.73
0.21
1.52x
3.54x
nanmean
0.65
0.72
0.16
4.08x
4.48x
nanaverage
3.57
-
0.24
15.16x
-
nanmedian
26.70
8.36
0.64
41.86x
13.10x
nanvar
1.75
1.81
0.21
8.40x
8.68x
nanstd
1.70
1.84
0.21
8.31x
8.98x
nanmin
0.11
0.70
0.10
1.10x
7.25x
nanmax
0.10
0.68
0.09
1.10x
7.22x
nanminmax
0.21
1.44
0.21
1.03x
6.96x
nansum
0.78
0.71
0.16
4.84x
4.42x
Short axis=0 stacks
Cases where shape[0] is small with axis=0, reducers may lose to NumPy/Bottleneck for cheap reductions (like min, max, mean, sum).
Larger shape[0] or output size (prod(*shape[1:])) makes reducers win over NumPy/Bottleneck.
(11, 100, 100), float64 (31, 100, 100), float64 (11, 512, 512), float64 (11, 100, 100), float32 (31, 100, 100), float32 (11, 512, 512), float32
nanmean
0.12
0.03
0.05
2.19x
0.64x
nanaverage
0.50
-
0.08
6.32x
-
nanmedian
1.90
0.75
0.26
7.25x
2.86x
nanvar
0.26
0.10
0.08
3.31x
1.29x
nanstd
0.27
0.10
0.05
5.36x
2.06x
nanmin
0.02
0.04
0.02
1.05x
2.07x
nanmax
0.02
0.04
0.02
1.04x
1.95x
nanminmax
0.04
0.09
0.04
1.06x
2.12x
nansum
0.06
0.04
0.05
1.08x
0.71x
nanmean
0.31
0.14
0.14
2.25x
1.00x
nanaverage
1.20
-
0.20
6.02x
-
nanmedian
7.54
2.85
0.64
11.73x
4.44x
nanvar
0.69
0.40
0.09
7.54x
4.42x
nanstd
0.67
0.41
0.08
8.54x
5.25x
nanmin
0.06
0.14
0.05
1.03x
2.48x
nanmax
0.06
0.14
0.06
1.02x
2.44x
nanminmax
0.11
0.27
0.11
1.03x
2.48x
nansum
0.17
0.14
0.14
1.20x
0.98x
nanmean
3.09
1.48
0.40
7.64x
3.66x
nanaverage
12.13
-
1.18
10.29x
-
nanmedian
58.18
26.32
5.77
10.09x
4.56x
nanvar
6.67
4.92
0.57
11.68x
8.63x
nanstd
6.68
4.71
0.54
12.32x
8.69x
nanmin
0.48
1.37
0.21
2.32x
6.61x
nanmax
0.49
1.41
0.23
2.12x
6.13x
nanminmax
0.99
2.81
0.49
2.03x
5.73x
nansum
1.63
1.30
0.39
4.16x
3.34x
nanmean
0.10
0.04
0.06
1.63x
0.69x
nanaverage
0.46
-
0.08
5.70x
-
nanmedian
1.85
0.75
0.25
7.42x
3.02x
nanvar
0.21
0.10
0.06
3.36x
1.51x
nanstd
0.21
0.10
0.05
4.31x
2.08x
nanmin
0.01
0.04
0.01
1.15x
4.77x
nanmax
0.01
0.04
0.01
1.14x
4.79x
nanminmax
0.02
0.09
0.02
1.16x
5.02x
nansum
0.04
0.04
0.06
0.65x
0.66x
nanmean
0.25
0.13
0.16
1.58x
0.85x
nanaverage
1.10
-
0.20
5.40x
-
nanmedian
6.87
2.92
0.29
24.04x
10.22x
nanvar
0.54
0.39
0.08
6.37x
4.62x
nanstd
0.55
0.40
0.09
5.88x
4.21x
nanmin
0.02
0.13
0.02
1.07x
5.95x
nanmax
0.02
0.13
0.02
1.03x
6.02x
nanminmax
0.05
0.27
0.04
1.07x
6.13x
nansum
0.10
0.13
0.16
0.65x
0.84x
nanmean
2.53
1.49
0.41
6.21x
3.66x
nanaverage
10.74
-
1.16
9.27x
-
nanmedian
53.31
26.11
3.37
15.81x
7.74x
nanvar
5.51
4.99
0.55
9.96x
9.02x
nanstd
5.40
4.98
0.59
9.21x
8.49x
nanmin
0.25
1.39
0.14
1.73x
9.76x
nanmax
0.25
1.38
0.16
1.57x
8.76x
nanminmax
0.51
2.75
0.29
1.74x
9.43x
nansum
1.01
1.33
0.41
2.45x
3.23x
Reading The Axis-0 Rows
Two factors explain most axis=0 rows:
.shape[0] drives the order-statistic win. NumPy’s median uses partition, but still pays broader axis/output overhead. The difference grows with stack depth and is largest for NaN-aware quantiles.
Output-element size (prod(*arr.shape[1:])) matters. reducers parallelizes across output elements. NumPy’s cheap strided loops can win on small outputs, but (11, 512, 512) has ~26x more outputs than (11, 100, 100) and reducers wins for most cases.
Some reducers stream along the non-stacking axes. For var, std, and count_finite, reducers walks each stack level as contiguous memory over the non-stacking axes and updates many output positions at once. For finite-only medians, it skips non-finite values while reading along the stacking axis, so the scratch buffer contains only useful values. In image-stack terms: it scans each stack level over its output positions, rather than repeatedly gathering one output position from every stack level into a tiny temporary list.
Practical rule: for tiny shallow stacks and cheap scans, NumPy/Bottleneck may be better. For larger outputs or expensive reducers (median, percentile, var/std), reducers is usually the better path.