import numpy as np
def prepare_workspace(stack):
dtype = np.float64 if stack.dtype == np.dtype(np.int32) else np.float32
return np.ascontiguousarray(stack, dtype=dtype)How To Get Maximum Performance
This page collects the performance model, implementation notes, thread tuning, and optimized-workspace guidance for imcombiners. Exact timings vary with shape, dtype, rejection density, stack depth, memory layout, CPU, and thread settings, so treat printed numbers as local measurements rather than universal constants. See 2-D Image Benchmarks and 1-D Array Benchmarks for recorded benchmark tables.
1. Implementation Tricks
imcombiners is built around image-stack reductions where both algorithmic work and memory traffic matter. The following implementation choices are intended to change runtime and memory traffic, not mathematical results. Here, “vector” is a Rust term for a contiguous 1-D array.
- Fuse output-only rejection and final reduction when it is exact. IRAF avoids duplicate work when the rejection path already has the retained samples needed for the requested final value.
imcapplies that work-elimination principle through supportedmean/medianoutput-only Rust paths. - Axis-first public layout.
(N, *spatial)matches the cheapnp.stack(images, axis=0)construction path. Building a C-order(H, W, N)cube from separate images can cost a lot; see Why The Stack Axis Is First. - Prepared float workspaces. Public calls validate and promote dtypes at the boundary; hot loops can prepare a C-contiguous
float32/float64workspace once and then usevalidate=Falseon already-checked data. - Trusted 1-D direct calls. The
_1dreduction wrappers sendvalidate=Falsecalls directly into Rust instead of re-running Python array preparation. This path assumes the caller already supplied a contiguousfloat32orfloat64vector. - Fused output-only kernels. When
diagnostics=None, mean/median rejection calls combine rejection and final reduction in one Rust pass. If it were not for this, the code should make a full rejection mask, copying a NaN-filled stack, and calling a second combine kernel. - Selective median instead of full sort. Median-style combine kernels compact finite values for one output pixel and select only the needed order statistic instead of sorting the whole per-pixel stack.
- Per-worker scratch reuse. Each Rayon worker keeps reusable temporary buffers for one per-output stack: copied pixel values, boolean masks, sorted rank pairs, and statistic work arrays. As the worker moves from one output pixel to the next, those buffers are cleared and refilled instead of newly allocated.
- Scratch reuse in diagnostic rejection. Full diagnostic
sigclip,minmax, andpclippaths also reuse per-worker stack-axis buffer, mask, rank-pair, residual, and statistic buffers; only the final per-output mask/diagnostic payload is retained. - Raw contiguous slice indexing. After validation, the Rust kernels require C-contiguous NumPy arrays. They view the stack as one flat memory slice and compute offsets directly, avoiding repeated multidimensional ndarray indexing and shape/stride bookkeeping in inner loops.
- Unrolled reductions with separate accumulators. Sum, mean, and variance scans process several stack samples per loop and keep separate running sums/counts before merging them at the end. This reduces loop overhead and short dependency chains in the CPU pipeline. For axis-0 image stacks, the Rust loop advances a flat slice index by
H * Wto readarr[:, y, x]without repeated multidimensional indexing. - Branch-light min/max scans. NaN-aware min/max reductions use the same numeric min/max semantics as the desired
nanmin/nanmaxbehavior, so the inner loop does not need a separate branch just to ask whether the current output is stillNaN. This is not “Rust is faster by default”; it is a specific way to avoid an unnecessary branch in a very small hot loop. Large contiguous 1-D min/max vectors also use Rayon chunk reductions. - CCD clip without a full gain copy.
ccdclip,ccdclip_mask, and output-onlyccdclip_combinepassgaininto Rust and evaluate rejection on gain-corrected per-pixel scratch values instead of first materializing a fullarr / gainstack. Output-onlyccdclipstill combines the original source values. - Mask-aware finite handling. Non-finite samples are folded into masks once, so later inner loops can avoid redundant finite checks where correctness permits.
- Squared-spread sigma clipping.
sigmais the user-supplied dimensionless multiplier, not the measured data spread. The ordinary test isabs(x - center) > sigma * std, wherestdis the standard deviation forstdfunc="std", the scaled MAD estimate forstdfunc="mad", or the CCD noise sigma forccdclip. The Rust loop squares both sides and tests(x - center)^2 > sigma^2 * std^2; forstdfunc="std",std^2is variance. That lets output-only and mask-only paths avoid computing square roots (sqrt) on every iteration; the square root is only needed when returning diagnosticstd. - MAD sigma without a full sort.
sigclip(stdfunc="mad")is primarily a robustness option, not a reason to choose MAD for speed. Its implementation still avoids a full sort: it computes1.4826 * median(abs(x - clip_cen))using the same selective-median scratch buffer strategy as other median-style kernels. - Rank-pair final combine. Output-only
minmaxandpclipmean/median paths compute the final value from the already-built retained rank pairs instead of copying a stack-axis source values and scanning a second mask. - Reusable pclip residual scores. The
pclipimplementation follows IRAF-style rank-offset clipping. During that calculation it builds a temporary vector of normalized distance scores used to decide which sorted values to keep whennkeeprestoration is needed. That vector is reused from per-worker scratch memory instead of allocated fresh for every output pixel.
Why The Stack Axis Is First
imcombiners uses (N, *spatial) stacks, so a normal image cube is shaped (N, H, W) and the combine/reject axis is axis 0. This is deliberate.
At first glance, (H, W, N) looks attractive because the per-pixel stack would be contiguous in memory for C-order NumPy arrays. That can make the inner Rust kernel read pattern slightly better when the axis-last workspace has already been prepared. However, building that workspace is not free:
stack0 = np.stack(images, axis=0) # shape (N, H, W)
stackn = np.stack(images, axis=-1) # shape (H, W, N)Both calls allocate and copy the full cube. The difference is where each loaded 2-D image lands. With axis=0, each image is copied into one contiguous slab of the output array. With axis=-1, each image is copied into a strided [..., i] plane of a C-order array. For common FITS-style workflows where images are loaded one at a time and combined once, that strided construction cost can be comparable to a whole rejection/combine kernel call.
From local test, I found the prepared axis-last Rust kernels were only slightly faster for representative sigclip, ccdclip, minmax, and pclip mean/median cases, while np.stack(..., axis=-1) could be much slower than np.stack(..., axis=0) for large image stacks. Therefore the public convention keeps N on axis 0: it matches the cheapest natural NumPy stacking pattern and avoids making one-off reductions pay a large layout-construction cost for a small gain. It also keeps per-frame masks, weights, zero/scale values, growth behavior, and full rejection diagnostics aligned with one documented stack axis.
- NOTE:
np.moveaxis(stack, 0, -1)is cheap as a view, but it does not make the data contiguous in the axis-last order. A Rust kernel requiring contiguous(H, W, N)storage would still need a full copy such asnp.ascontiguousarray(np.moveaxis(stack, 0, -1)).
Dependency probes
Several Rust statistics crates (ndarray-stats, argminmax, scirs2-stats) were checked making kernels in this repo. A small benchmark on (N, 384, 384) float32stacks measured the median combine core and a simple min/max scan:
stack N |
imc’s old median |
imc’s new selective median |
ndarray-stats median |
imc’s scalar min/max scan |
using argminmax |
|---|---|---|---|---|---|
| 31 | 39.1 ms | 19.0 ms | 98.2 ms | 5.21 ms | 8.82 ms |
| 5 | 7.80 ms | 3.26 ms | 33.0 ms | 0.11 ms | 3.42 ms |
These numbers are not a claim that the external crates are slow in general.
ndarray-statsuses a generic mutable axis-quantile path (plus version differences fromimc);argminmaxis useful for contiguous vectors, butimcfirst has to gather each axis-0 pixel stack from strided(N, H, W)storage;scirs2-statshas a broader SciPy-like dependency surface thanimcneeds.
2. API Choice
The shortest rule is:
- Use Standard Combiner or
ndcombine()described in tutorials when you only need the final image. - Use Chained Combiner or
diagnostics="simple"/"full"when you needlast_reject,mask_rej,low,upp,nit, oroutput_flags, plussample_flags. - Use direct kernel calls (
import imcombiners.kernels as imck) only for trusted hot loops where the input layout is already guaranteed.
Direct imck calls are not a second algorithm. For output-only rejection, they use the same Rust kernels as the fast public routes. Their advantage is smaller Python-boundary work: no object construction, no repeated validation, no default workspace copy, and no retained diagnostics unless the called kernel returns them. Also users can use *_1d kernels for their 1-D array calculator - they are designed to be very fast and useful. Once the input is already a contiguous floating workspace, imck.sigclip_combine(..., validate=False) and Combiner(..., copy=False, validate=False).combine(..., diagnostics=None) are near-tie comparisons around the same core kernel.
Workspace Requirements
The trusted Rust paths require:
- C-contiguous layout —
arr.flags['C_CONTIGUOUS']must beTrue. - Float workspace dtype —
float32foruint8,uint16,int16, andfloat32inputs;float64forint32andfloat64inputs.
This function shows one example:
int32→float64- All other (in many astronomical contexts, the only other is
uint16) →float32
3. Public API Baseline
The baseline for output-only timing is the default Standard Combiner: Combiner(...).combine(..., diagnostics=None). It keeps the public object API, returns only the final image, and can use the fused reject-and-combine Rust path for supported mean/median cases.
This benchmark compares that baseline with three nearby public routes:
SC-None: Standard Combiner,Combiner(...).combine(..., diagnostics=None), output-only reference row.nd-None: compactndcombine(..., diagnostics=None), output only.CC: Chained Combiner,Combiner(...).reject(...).combine(...), retained rejection state.nd-simple: compactndcombine(..., diagnostics="simple")[0], final image plus rejection diagnostics.
Diagnostic rows ask for different work, so they are grouped separately and compared with Chained Combiner. Output-only rows are compared with Standard Combiner.
import numpy as np
import imcombiners as imc
import timedef time_call(func, warmups=5, repeat=100):
result = None
for _ in range(warmups):
result = func()
times = []
for _ in range(repeat):
t0 = time.perf_counter()
result = func()
times.append(time.perf_counter() - t0)
return result, float(np.median(times))
def comparison_text(seconds, reference_seconds):
ratio = reference_seconds / seconds
if np.isclose(ratio, 1.0, rtol=0.05):
return "Near-tie"
if ratio > 1.0:
return f"{ratio:.2f}x faster"
return f"{1.0 / ratio:.2f}x slower"
def print_timing_table(rows, reference):
print(f"{'name':<10} {'work':<12} {'ms':>9} comparison")
ref_seconds = rows[reference]["seconds"]
for name, row in rows.items():
seconds = row["seconds"]
if name == reference:
comparison = ""
else:
comparison = f"{comparison_text(seconds, ref_seconds)} vs {reference}"
print(f"{name:<10} {row['work']:<12} {seconds * 1e3:9.1f} {comparison}")rng = np.random.default_rng(20250311)
stack = np.clip(
rng.normal(1000.0, 30.0, size=(31, 512, 512)),
0,
np.iinfo(np.uint16).max,
).astype("uint16")
kw_sc = dict(combine="median", reject="sigclip", sigma=(3.0, 3.0), maxiters=5, ddof=0)
sigc = imc.SigClip(sigma=(3.0, 3.0), maxiters=5, ddof=0)
timings = {}
timings["SC-None"] = {}
timings["SC-None"]["result"], timings["SC-None"]["seconds"] = time_call(
lambda: imc.Combiner(stack).combine("median", rejectors=sigc, diagnostics=None)
)
timings["SC-None"]["work"] = "output-only"
timings["nd-None"] = {}
timings["nd-None"]["result"], timings["nd-None"]["seconds"] = time_call(
lambda: imc.ndcombine(stack, **kw_sc, diagnostics=None)
)
timings["nd-None"]["work"] = "output-only"
timings["CC"] = {}
timings["CC"]["result"], timings["CC"]["seconds"] = time_call(
lambda: imc.Combiner(stack).reject(sigc).combine("median")
)
timings["CC"]["work"] = "diagnostic"
timings["nd-simple"] = {}
timings["nd-simple"]["result"], timings["nd-simple"]["seconds"] = time_call(
lambda: imc.ndcombine(stack, **kw_sc, diagnostics="simple")[0]
)
timings["nd-simple"]["work"] = "diagnostic"
np.testing.assert_allclose(timings["SC-None"]["result"], timings["nd-None"]["result"], rtol=1e-5)
np.testing.assert_allclose(timings["SC-None"]["result"], timings["CC"]["result"], rtol=1e-5)
np.testing.assert_allclose(timings["SC-None"]["result"], timings["nd-simple"]["result"], rtol=1e-5)
out_diagnostic_one_call = timings["nd-simple"]["result"]
t_diagnostic_one_call = timings["nd-simple"]["seconds"]
out_output_only_one_call = timings["nd-None"]["result"]
t_output_only_one_call = timings["nd-None"]["seconds"]
out_output_only_combiner = timings["SC-None"]["result"]
t_output_only_combiner = timings["SC-None"]["seconds"]
out_diagnostic_chained = timings["CC"]["result"]
t_diagnostic_chained = timings["CC"]["seconds"]
print_timing_table(
{k: timings[k] for k in ("SC-None", "nd-None")},
reference="SC-None",
)
print()
print_timing_table(
{k: timings[k] for k in ("CC", "nd-simple")},
reference="CC",
)name work ms comparison
SC-None output-only 99.5
nd-None output-only 84.1 1.18x faster vs SC-None
name work ms comparison
CC diagnostic 144.8
nd-simple diagnostic 147.4 Near-tie vs CC
nd-simple and CC both compute complicated diagnostic products. Their timings should be same-order and slow; small differences are very likely noise. The diagnostics=None will return only the final image and can use a fused reject-and-combine optimization as mentioned above, so they are faster. nd-None may be faster than SC-None because while Combiner() keeps its default copy=True and makes an owned workspace copy after validation. Setting copy=False maybe dangerous but can be beneficial when the user does not care about the input stack after the call (e.g., loading image files from disk and combine - no need to preserve loaded stack). Use SC-None Combiner().combine() approach when you want that fast path with an object-oriented API; use CC only when you want retained rejection state.
4. Prepared Data & validate=False
Prepare the workspace once, then skip repeated validation inside the hot loop. This section compares each validate=False call against the same public API shape with validation left on. That makes the ratio answer one clear question: how much did preparing the workspace once save?
- CC-prep: CC-None with
validate=False - SC-prep: SC-None with
validate=False&©=False
stack_ws = prepare_workspace(stack)
def output_only_combiner_no_validation(arr):
return imc.Combiner(arr, copy=False, validate=False).combine(
"median",
rejectors=sigc,
diagnostics=None,
)
prepared = {}
prepared["CC"] = timings["CC"]
prepared["CC-prep"] = {}
prepared["CC-prep"]["result"], prepared["CC-prep"]["seconds"] = time_call(
lambda: (
imc.Combiner(stack_ws, validate=False)
.reject(sigc)
.combine("median")
)
)
prepared["CC-prep"]["work"] = "diagnostic"
prepared["SC-None"] = timings["SC-None"]
prepared["SC-prep"] = {}
prepared["SC-prep"]["result"], prepared["SC-prep"]["seconds"] = time_call(
lambda: output_only_combiner_no_validation(stack_ws)
)
prepared["SC-prep"]["work"] = "output-only"
np.testing.assert_allclose(prepared["CC"]["result"], prepared["CC-prep"]["result"], rtol=1e-5)
np.testing.assert_allclose(prepared["SC-None"]["result"], prepared["SC-prep"]["result"], rtol=1e-5)
out_chained_prepared = prepared["CC-prep"]["result"]
t_chained_prepared = prepared["CC-prep"]["seconds"]
out_output_only_prepared = prepared["SC-prep"]["result"]
t_output_only_prepared = prepared["SC-prep"]["seconds"]
print_timing_table(
{k: prepared[k] for k in ("CC", "CC-prep")},
reference="CC",
)
print()
print_timing_table(
{k: prepared[k] for k in ("SC-None", "SC-prep")},
reference="SC-None",
)name work ms comparison
CC diagnostic 144.8
CC-prep diagnostic 167.7 1.16x slower vs CC
name work ms comparison
SC-None output-only 99.5
SC-prep output-only 82.5 1.21x faster vs SC-None
SC-prep keeps the output-only Standard Combiner work, skips validation, and uses copy=False because the workspace is already owned by the caller. Due to copy=False, SC-prep may be a bit faster than SC-None.
5. Stack Number (0-th Axis)
Say stack has shape (N, H, W) where N is the number of images. Computation time is roughly
Total time ≈ fixed overhead + per-pixel kernel time(∝ N) * total num of pixels (H * W).
Skipping validation removes the first term: shape/dtype checks, contiguity handling, and possible workspace conversion. The fixed overhead can be a larger fraction of the total runtime when N is small, and skipping validation can show a larger percentage improvement.
rng_small = np.random.default_rng(20250311)
stack_small = np.clip(
rng_small.normal(1000.0, 30.0, size=(5, 512, 512)),
0,
np.iinfo(np.uint16).max,
).astype("uint16")
stack_small_ws = prepare_workspace(stack_small)
out_output_only_small, t_output_only_small = time_call(
lambda: imc.Combiner(stack_small).combine(
"median",
rejectors=sigc,
diagnostics=None,
)
)
out_output_only_small_prepared, t_output_only_small_prepared = time_call(
lambda: output_only_combiner_no_validation(stack_small_ws)
)
np.testing.assert_allclose(
out_output_only_small,
out_output_only_small_prepared,
atol=1e-5,
)
depth_rows = {
"N5-SC": {
"work": "output-only",
"seconds": t_output_only_small,
},
"N5-prep": {
"work": "output-only",
"seconds": t_output_only_small_prepared,
},
"N31-SC": {
"work": "output-only",
"seconds": t_output_only_combiner,
},
"N31-prep": {
"work": "output-only",
"seconds": t_output_only_prepared,
},
}
print_timing_table(
{k: depth_rows[k] for k in ("N5-SC", "N5-prep")},
reference="N5-SC",
)
print()
print_timing_table(
{k: depth_rows[k] for k in ("N31-SC", "N31-prep")},
reference="N31-SC",
)name work ms comparison
N5-SC output-only 14.5
N5-prep output-only 13.7 1.06x faster vs N5-SC
name work ms comparison
N31-SC output-only 99.5
N31-prep output-only 82.5 1.21x faster vs N31-SC
The n=5 and n=31 ratios should not be ranked as a universal rule.
Reserve validate=False for workloads where profiling shows a real bottleneck; on shallow stacks, milliseconds-scale differences may be within timing noise.
6. Thread Tuning
The default settings are good enough for most use cases.
There are three parameters that control imc parallelism:
RAYON_NUM_THREADS— how many threads Rayon uses inside the Rust kernels. Set it before starting Python, or right after importingimcand before any Rayon work; Rayon reads it once at thread-pool initialization and ignores later changes.IMCOMBINERS_PARALLEL_THRESHOLD— the minimum output-element count required to use Rayon at all. Below this thresholdimcruns a serial loop instead. The default is10000, so images smaller than roughly 100x100 run serially.IMCOMBINERS_1D_MINMAX_PARALLEL_THRESHOLD— the vector length wheremin_1dandmax_1dswitch from serial scan to Rayon chunk reduction. The default is1000000. This is separate fromIMCOMBINERS_PARALLEL_THRESHOLDbecause 1-D min/max has one output value, while image-stack combine/rejection parallelizes over many output pixels.
Both knobs can be set from the shell before starting Python:
export RAYON_NUM_THREADS=8
export IMCOMBINERS_PARALLEL_THRESHOLD=4096
export IMCOMBINERS_1D_MINMAX_PARALLEL_THRESHOLD=1000000
python your_pipeline.pyThey can also be set from Python before heavy imc work:
import imcombiners as imc
imc.set_rayon_num_threads(8)
imc.set_parallel_threshold(4096)
imc.set_minmax_1d_parallel_threshold(1_000_000)
print(
imc.get_rayon_num_threads(),
imc.get_parallel_threshold(),
imc.get_minmax_1d_parallel_threshold(),
)For a stack shaped (N, H, W), the threshold is compared with H * W; for arbitrary N-D stacks, it is compared with np.prod(stack.shape[1:]). For min_1d and max_1d, the 1-D min/max threshold is compared with values.size.
Thread Benchmark Script
benchmark_threads.py measures both the serial path and a range of Rayon thread counts in fresh subprocesses, so every candidate gets a clean thread pool. By default it runs sigclip_median on both (5, 512, 512) and (31, 512, 512) stacks and reports separate recommendations:
uv run python benchmarks/benchmark_threads.pyThe reported values are a starting point for the measured workload, not a package-wide optimum. To tune for your own shape or operation:
uv run python benchmarks/benchmark_threads.py \
--op median --dtype float32 --n 31 --height 1024 --width 1024Pass multiple --n values to cover several stack sizes in one run:
uv run python benchmarks/benchmark_threads.py --n 5 31 63Default thread-count candidates. On machines with <= 32 logical CPUs the script tests every parallel thread count: 2, 3, …, os.cpu_count(). A dedicated serial baseline run covers the non-Rayon case, so single-thread Rayon is only tested if you pass it explicitly with --threads.
On machines with > 32 logical CPUs, os.cpu_count() may not match your scheduler allocation. Pass --threads with explicit candidates, for example based on SLURM_CPUS_PER_TASK, or --force-test to get the sparse {2, 4, 8, ..., ncpu} list anyway.
The benchmark forces IMCOMBINERS_PARALLEL_THRESHOLD=1 for every parallel candidate so Rayon is exercised regardless of image size. This avoids a small image silently running the serial path at every thread count. The forced value is a measurement tool, not a production recommendation.
More threads does not always mean faster. Repeated runs can recommend different near-best values because timing varies with operation, dtype, number of images, output shape, CPU allocation, memory bandwidth, and machine load. For example, on an MBP 14” [2024, macOS 26.4.1, M4Pro(8P+4E/G20c/N16c/48G)], RAYON_NUM_THREADS=6 was optimal for n=5 while RAYON_NUM_THREADS=12 was optimal for n=31, and both changed by about two threads after the machine load shifted. The tuner treats median times within --tie-tolerance of the fastest as near-ties and recommends the lower thread count. For a more stable estimate, increase --repeats and --warmups:
uv run python benchmarks/benchmark_threads.py --repeats 15 --warmups 3If your pipeline also calls threaded NumPy/BLAS code, consider pinning OPENBLAS_NUM_THREADS, OMP_NUM_THREADS, MKL_NUM_THREADS, and VECLIB_MAXIMUM_THREADS to 1 while benchmarking to avoid measuring nested oversubscription.