How To Get Maximum Performance

This page collects the performance model, implementation notes, thread tuning, and optimized-workspace guidance for imcombiners. Exact timings vary with shape, dtype, rejection density, stack depth, memory layout, CPU, and thread settings, so treat printed numbers as local measurements rather than universal constants. See 2-D Image Benchmarks for recorded imcombiners benchmark tables.

1. Implementation Tricks

imcombiners is built around image-stack reductions where both algorithmic work and memory traffic matter. The following implementation choices are intended to change runtime and memory traffic, not mathematical results. Here, “vector” is a Rust term for a contiguous 1-D array.

  • Fuse output-only rejection and final reduction when it is exact. IRAF avoids duplicate work when the rejection path already has the retained samples needed for the requested final value. imc applies that work-elimination principle through supported mean/median output-only Rust paths.
  • Hybrid reducer ownership. General-purpose reducers live in reducers; imc calls them for generic stack reductions where their axis kernels are competitive. imc keeps image-stack-specific median traversal and fused reject+combine loops when per-output scratch reuse, mask handling, or rejection state makes the generic reducer shape slower.
  • Axis-first public layout. (N, *spatial) matches the cheap np.stack(images, axis=0) construction path. Building a C-order (H, W, N) cube from separate images can cost a lot; see Why The Stack Axis Is First.
  • Prepared float workspaces. Public calls validate and promote dtypes at the boundary; hot loops can prepare a C-contiguous float32/float64 workspace once and then use validate=False on already-checked data.
  • Fused output-only kernels. When diagnostics=None, mean/median rejection calls combine rejection and final reduction in one Rust pass. If it were not for this, the code should make a full rejection mask, copying a NaN-filled stack, and calling a second combine kernel.
  • Selective order statistics instead of full sort. Median/percentile-style combine kernels compact finite values for one output pixel and select only the needed order statistic instead of sorting the whole per-pixel stack. imc owns the image-stack traversal and scratch reuse; reducers supplies the valid-buffer in-place selection primitives used after compaction.
  • Fused variance/mean only when both are requested. Plain variance uses reducers’ axis scan directly. When return_mean=True, imc compacts each finite output stack into reused scratch and calls reducers’ borrowed-slice (variance, mean) primitive so the mean is not computed by a second full axis reduction.
  • Per-worker scratch reuse. Each Rayon worker keeps reusable temporary buffers for one per-output stack: copied pixel values, boolean masks, sorted rank pairs, and statistic work arrays. As the worker moves from one output pixel to the next, those buffers are cleared and refilled instead of newly allocated.
  • Scratch reuse in diagnostic rejection. Full diagnostic sigclip, minmax, and pclip paths also reuse per-worker stack-axis buffer, mask, rank-pair, residual, and statistic buffers; only the final per-output mask/diagnostic payload is retained.
  • Raw contiguous slice indexing. After validation, the Rust kernels require C-contiguous NumPy arrays. They view the stack as one flat memory slice and compute offsets directly, avoiding repeated multidimensional ndarray indexing and shape/stride bookkeeping in inner loops.
  • Unrolled reductions with separate accumulators. Sum, mean, and variance scans process several stack samples per loop and keep separate running sums/counts before merging them at the end. This reduces loop overhead and short dependency chains in the CPU pipeline. For axis-0 image stacks, the Rust loop advances a flat slice index by H * W to read arr[:, y, x] without repeated multidimensional indexing.
  • Branch-light min/max scans. NaN-aware stack min/max reductions use the same numeric min/max semantics as the desired nanmin/nanmax behavior, so the inner loop does not need a separate branch just to ask whether the current output is still NaN. This is not “Rust is faster by default”; it is a specific way to avoid an unnecessary branch in a very small hot loop.
  • CCD clip without a full gain copy. ccdclip, ccdclip_mask, and output-only ccdclip_combine pass gain into Rust and evaluate rejection on gain-corrected per-pixel scratch values instead of first materializing a full arr / gain stack. Output-only ccdclip still combines the original source values.
  • Mask-aware finite handling. Non-finite samples are folded into masks once, so later inner loops can avoid redundant finite checks where correctness permits.
  • Squared-spread sigma clipping. sigma is the user-supplied dimensionless multiplier, not the measured data spread. The ordinary test is abs(x - center) > sigma * std, where std is the standard deviation for stdfunc="std", the scaled MAD estimate for stdfunc="mad", or the CCD noise sigma for ccdclip. The Rust loop squares both sides and tests (x - center)^2 > sigma^2 * std^2; for stdfunc="std", std^2 is variance. That lets output-only and mask-only paths avoid computing square roots (sqrt) on every iteration; the square root is only needed when returning diagnostic std.
  • MAD sigma without a full sort. sigclip(stdfunc="mad") is primarily a robustness option, not a reason to choose MAD for speed. Its implementation still avoids a full sort: it computes 1.4826 * median(abs(x - clip_cen)) using the same selective-median scratch buffer strategy as other median-style kernels.
  • Rank-pair final combine. Output-only minmax and pclip mean/median paths compute the final value from the already-built retained rank pairs instead of copying a stack-axis source values and scanning a second mask.
  • Reusable pclip residual scores. The pclip implementation follows IRAF-style rank-offset clipping. During that calculation it builds a temporary vector of normalized distance scores used to decide which sorted values to keep when nkeep restoration is needed. That vector is reused from per-worker scratch memory instead of allocated fresh for every output pixel.

Why The Stack Axis Is First

imcombiners uses (N, *spatial) stacks, so a normal image cube is shaped (N, H, W) and the combine/reject axis is axis 0. This is deliberate.

At first glance, (H, W, N) looks attractive because the per-pixel stack would be contiguous in memory for C-order NumPy arrays. That can make the inner Rust kernel read pattern slightly better when the axis-last workspace has already been prepared. However, building that workspace is not free:

stack0 = np.stack(images, axis=0)   # shape (N, H, W)
stackn = np.stack(images, axis=-1)  # shape (H, W, N)

Both calls allocate and copy the full cube. The difference is where each loaded 2-D image lands. With axis=0, each image is copied into one contiguous slab of the output array. With axis=-1, each image is copied into a strided [..., i] plane of a C-order array. For common FITS-style workflows where images are loaded one at a time and combined once, that strided construction cost can be comparable to a whole rejection/combine kernel call.

From local test, I found the prepared axis-last Rust kernels were only slightly faster for representative sigclip, ccdclip, minmax, and pclip mean/median cases, while np.stack(..., axis=-1) could be much slower than np.stack(..., axis=0) for large image stacks. Therefore the public convention keeps N on axis 0: it matches the cheapest natural NumPy stacking pattern and avoids making one-off reductions pay a large layout-construction cost for a small gain. It also keeps per-frame masks, weights, zero/scale values, growth behavior, and full rejection diagnostics aligned with one documented stack axis.

  • NOTE: np.moveaxis(stack, 0, -1) is cheap as a view, but it does not make the data contiguous in the axis-last order. A Rust kernel requiring contiguous (H, W, N) storage would still need a full copy such as np.ascontiguousarray(np.moveaxis(stack, 0, -1)).

2. API Choice

The shortest rule is:

  • Use Standard Combiner or ndcombine() described in tutorials when you only need the final image.
  • Use Chained Combiner or diagnostics="simple"/"full" when you need last_reject, mask_rej, low, upp, nit, or output_flags, plus sample_flags.
  • Use direct kernel calls (import imcombiners.kernels as imck) only for trusted hot loops where the input layout is already guaranteed.

Direct imck calls are not a second algorithm. For output-only rejection, they use the same Rust kernels as the fast public routes. Their advantage is smaller Python-boundary work: no object construction, no repeated validation, no default workspace copy, and no retained diagnostics unless the called kernel returns them. Once the input is already a contiguous floating workspace, imck.sigclip_combine(..., validate=False) and Combiner(..., copy=False, validate=False).combine(..., diagnostics=None) are near-tie comparisons around the same core kernel. Use reducers directly for standalone 1-D reductions.

Workspace Requirements

The trusted Rust paths require:

  1. C-contiguous layoutarr.flags['C_CONTIGUOUS'] must be True.
  2. Float workspace dtypefloat32 for uint8, uint16, int16, and float32 inputs; float64 for int32 and float64 inputs.

This function shows one example:

import numpy as np

def prepare_workspace(stack):
    dtype = np.float64 if stack.dtype == np.dtype(np.int32) else np.float32
    return np.ascontiguousarray(stack, dtype=dtype)
  • int32float64
  • All other (in many astronomical contexts, the only other is uint16) → float32

3. Public API Baseline

The baseline for output-only timing is the default Standard Combiner: Combiner(...).combine(..., diagnostics=None). It keeps the public object API, returns only the final image, and can use the fused reject-and-combine Rust path for supported mean/median cases.

This benchmark compares that baseline with three nearby public routes:

  • SC-None: Standard Combiner, Combiner(...).combine(..., diagnostics=None), output-only reference row.
  • nd-None: compact ndcombine(..., diagnostics=None), output only.
  • CC: Chained Combiner, Combiner(...).reject(...).combine(...), retained rejection state.
  • nd-simple: compact ndcombine(..., diagnostics="simple")[0], final image plus rejection diagnostics.

Diagnostic rows ask for different work, so they are grouped separately and compared with Chained Combiner. Output-only rows are compared with Standard Combiner.

import numpy as np
import imcombiners as imc
import time
def time_call(func, warmups=5, repeat=100):
    result = None
    for _ in range(warmups):
        result = func()

    times = []
    for _ in range(repeat):
        t0 = time.perf_counter()
        result = func()
        times.append(time.perf_counter() - t0)
    return result, float(np.median(times))


def comparison_text(seconds, reference_seconds):
    ratio = reference_seconds / seconds
    if np.isclose(ratio, 1.0, rtol=0.05):
        return "Near-tie"
    if ratio > 1.0:
        return f"{ratio:.2f}x faster"
    return f"{1.0 / ratio:.2f}x slower"


def print_timing_table(rows, reference):
    print(f"{'name':<10} {'work':<12} {'ms':>9}  comparison")
    ref_seconds = rows[reference]["seconds"]
    for name, row in rows.items():
        seconds = row["seconds"]
        if name == reference:
            comparison = ""
        else:
            comparison = f"{comparison_text(seconds, ref_seconds)} vs {reference}"
        print(f"{name:<10} {row['work']:<12} {seconds * 1e3:9.1f}  {comparison}")
rng = np.random.default_rng(20250311)
stack = np.clip(
    rng.normal(1000.0, 30.0, size=(31, 512, 512)),
    0,
    np.iinfo(np.uint16).max,
).astype("uint16")

kw_sc = dict(combine="median", reject="sigclip", sigma=(3.0, 3.0), maxiters=5, ddof=0)
sigc = imc.SigClip(sigma=(3.0, 3.0), maxiters=5, ddof=0)

timings = {}

timings["SC-None"] = {}
timings["SC-None"]["result"], timings["SC-None"]["seconds"] = time_call(
    lambda: imc.Combiner(stack).combine("median", rejectors=sigc, diagnostics=None)
)
timings["SC-None"]["work"] = "output-only"

timings["nd-None"] = {}
timings["nd-None"]["result"], timings["nd-None"]["seconds"] = time_call(
    lambda: imc.ndcombine(stack, **kw_sc, diagnostics=None)
)
timings["nd-None"]["work"] = "output-only"

timings["CC"] = {}
timings["CC"]["result"], timings["CC"]["seconds"] = time_call(
    lambda: imc.Combiner(stack).reject(sigc).combine("median")
)
timings["CC"]["work"] = "diagnostic"

timings["nd-simple"] = {}
timings["nd-simple"]["result"], timings["nd-simple"]["seconds"] = time_call(
    lambda: imc.ndcombine(stack, **kw_sc, diagnostics="simple")[0]
)
timings["nd-simple"]["work"] = "diagnostic"

np.testing.assert_allclose(timings["SC-None"]["result"], timings["nd-None"]["result"], rtol=1e-5)
np.testing.assert_allclose(timings["SC-None"]["result"], timings["CC"]["result"], rtol=1e-5)
np.testing.assert_allclose(timings["SC-None"]["result"], timings["nd-simple"]["result"], rtol=1e-5)

out_diagnostic_one_call = timings["nd-simple"]["result"]
t_diagnostic_one_call = timings["nd-simple"]["seconds"]
out_output_only_one_call = timings["nd-None"]["result"]
t_output_only_one_call = timings["nd-None"]["seconds"]
out_output_only_combiner = timings["SC-None"]["result"]
t_output_only_combiner = timings["SC-None"]["seconds"]
out_diagnostic_chained = timings["CC"]["result"]
t_diagnostic_chained = timings["CC"]["seconds"]

print_timing_table(
    {k: timings[k] for k in ("SC-None", "nd-None")},
    reference="SC-None",
)
print()
print_timing_table(
    {k: timings[k] for k in ("CC", "nd-simple")},
    reference="CC",
)
name       work                ms  comparison
SC-None    output-only       85.2  
nd-None    output-only       80.6  1.06x faster vs SC-None

name       work                ms  comparison
CC         diagnostic       142.6  
nd-simple  diagnostic       152.0  1.07x slower vs CC

nd-simple and CC both compute complicated diagnostic products. Their timings should be same-order and slow; small differences are very likely noise. The diagnostics=None will return only the final image and can use a fused reject-and-combine optimization as mentioned above, so they are faster. nd-None may be faster than SC-None because while Combiner() keeps its default copy=True and makes an owned workspace copy after validation. Setting copy=False maybe dangerous but can be beneficial when the user does not care about the input stack after the call (e.g., loading image files from disk and combine - no need to preserve loaded stack). Use SC-None Combiner().combine() approach when you want that fast path with an object-oriented API; use CC only when you want retained rejection state.

4. Prepared Data & validate=False

Prepare the workspace once, then skip repeated validation inside the hot loop. This section compares each validate=False call against the same public API shape with validation left on. That makes the ratio answer one clear question: how much did preparing the workspace once save?

  • CC-prep: CC-None with validate=False
  • SC-prep: SC-None with validate=False && copy=False
stack_ws = prepare_workspace(stack)

def output_only_combiner_no_validation(arr):
    return imc.Combiner(arr, copy=False, validate=False).combine(
        "median",
        rejectors=sigc,
        diagnostics=None,
    )


prepared = {}

prepared["CC"] = timings["CC"]
prepared["CC-prep"] = {}
prepared["CC-prep"]["result"], prepared["CC-prep"]["seconds"] = time_call(
    lambda: (
        imc.Combiner(stack_ws, validate=False)
        .reject(sigc)
        .combine("median")
    )
)
prepared["CC-prep"]["work"] = "diagnostic"

prepared["SC-None"] = timings["SC-None"]
prepared["SC-prep"] = {}
prepared["SC-prep"]["result"], prepared["SC-prep"]["seconds"] = time_call(
    lambda: output_only_combiner_no_validation(stack_ws)
)
prepared["SC-prep"]["work"] = "output-only"

np.testing.assert_allclose(prepared["CC"]["result"], prepared["CC-prep"]["result"], rtol=1e-5)
np.testing.assert_allclose(prepared["SC-None"]["result"], prepared["SC-prep"]["result"], rtol=1e-5)

out_chained_prepared = prepared["CC-prep"]["result"]
t_chained_prepared = prepared["CC-prep"]["seconds"]
out_output_only_prepared = prepared["SC-prep"]["result"]
t_output_only_prepared = prepared["SC-prep"]["seconds"]

print_timing_table(
    {k: prepared[k] for k in ("CC", "CC-prep")},
    reference="CC",
)
print()
print_timing_table(
    {k: prepared[k] for k in ("SC-None", "SC-prep")},
    reference="SC-None",
)
name       work                ms  comparison
CC         diagnostic       142.6  
CC-prep    diagnostic       157.6  1.10x slower vs CC

name       work                ms  comparison
SC-None    output-only       85.2  
SC-prep    output-only       78.7  1.08x faster vs SC-None

SC-prep keeps the output-only Standard Combiner work, skips validation, and uses copy=False because the workspace is already owned by the caller. Due to copy=False, SC-prep may be a bit faster than SC-None.

5. Stack Number (0-th Axis)

Say stack has shape (N, H, W) where N is the number of images. Computation time is roughly

Total time ≈ fixed overhead + per-pixel kernel time(∝ N) * total num of pixels (H * W).

Skipping validation removes the first term: shape/dtype checks, contiguity handling, and possible workspace conversion. The fixed overhead can be a larger fraction of the total runtime when N is small, and skipping validation can show a larger percentage improvement.

rng_small = np.random.default_rng(20250311)
stack_small = np.clip(
    rng_small.normal(1000.0, 30.0, size=(5, 512, 512)),
    0,
    np.iinfo(np.uint16).max,
).astype("uint16")
stack_small_ws = prepare_workspace(stack_small)

out_output_only_small, t_output_only_small = time_call(
    lambda: imc.Combiner(stack_small).combine(
        "median",
        rejectors=sigc,
        diagnostics=None,
    )
)

out_output_only_small_prepared, t_output_only_small_prepared = time_call(
    lambda: output_only_combiner_no_validation(stack_small_ws)
)

np.testing.assert_allclose(
    out_output_only_small,
    out_output_only_small_prepared,
    atol=1e-5,
)

depth_rows = {
    "N5-SC": {
        "work": "output-only",
        "seconds": t_output_only_small,
    },
    "N5-prep": {
        "work": "output-only",
        "seconds": t_output_only_small_prepared,
    },
    "N31-SC": {
        "work": "output-only",
        "seconds": t_output_only_combiner,
    },
    "N31-prep": {
        "work": "output-only",
        "seconds": t_output_only_prepared,
    },
}

print_timing_table(
    {k: depth_rows[k] for k in ("N5-SC", "N5-prep")},
    reference="N5-SC",
)
print()
print_timing_table(
    {k: depth_rows[k] for k in ("N31-SC", "N31-prep")},
    reference="N31-SC",
)
name       work                ms  comparison
N5-SC      output-only       13.5  
N5-prep    output-only       13.0  Near-tie vs N5-SC

name       work                ms  comparison
N31-SC     output-only       85.2  
N31-prep   output-only       78.7  1.08x faster vs N31-SC

The n=5 and n=31 ratios should not be ranked as a universal rule.

Reserve validate=False for workloads where profiling shows a real bottleneck; on shallow stacks, milliseconds-scale differences may be within timing noise.

6. Thread Tuning

TipTL;DR

The default settings are good enough for most use cases.

imcombiners uses the same Rayon worker pool as reducers, but rejection kernels keep an imc-owned internal parallel policy.

  • RAYON_NUM_THREADS — how many threads Rayon uses inside Rust kernels. Set it before starting Python; Rayon reads it once at thread-pool initialization.
  • reducers parallel grains — per-kernel work grains for generic reducers axis reductions. Use reducers’ built-in defaults for ordinary use, or run its autotuner on the target machine.
  • imc rejection policy — private heuristics for iterative rejection, rank-pair masks, diagnostics, and fused reject+combine paths. These are benchmarked in imc because their cost model is not the same as a generic reducers median or percentile.

Set the Rayon thread count from the shell before starting Python:

export RAYON_NUM_THREADS=8
python your_pipeline.py

Inspect reducers settings for generic reducer workloads:

import reducers as rd

print(rd.get_num_threads())
print(rd.get_parallel_grains())

# Optional machine/workload-specific tuning:
# python -m reducers.autotuner

There is no separate imcombiners parallel-control API. This is intentional: general reductions are implemented by reducers, while imc rejection kernels own their private crossover policy. If a rejection benchmark regresses, tune the imc rejection policy and only propose a reducers API when the pattern is generic outside imcombiners.

Thread Benchmark Script

benchmark_threads.py measures a single-thread baseline and a range of Rayon thread counts in fresh subprocesses, so every candidate gets a clean thread pool. By default it runs sigclip_median on both (5, 512, 512) and (31, 512, 512) stacks and reports separate recommendations:

uv run python benchmarks/benchmark_threads.py

The script prints an environment table before the timing results, including Python, package versions, OS/kernel details, machine type, and logical CPU count. Keep that block with any copied benchmark table so results can be compared later.

The reported values are a starting point for the measured workload, not a package-wide optimum. To tune for your own shape or operation:

uv run python benchmarks/benchmark_threads.py \
  --op median --dtype float32 --n 31 --height 1024 --width 1024

Pass multiple --n values to cover several stack sizes in one run:

uv run python benchmarks/benchmark_threads.py --n 5 31 63
Note

Default thread-count candidates. On machines with <= 32 logical CPUs the script tests every parallel thread count: 2, 3, …, os.cpu_count(). A dedicated serial baseline run covers the non-Rayon case, so single-thread Rayon is only tested if you pass it explicitly with --threads.

On machines with > 32 logical CPUs, os.cpu_count() may not match your scheduler allocation. Pass --threads with explicit candidates, for example based on SLURM_CPUS_PER_TASK, or --force-test to get the sparse {2, 4, 8, ..., ncpu} list anyway.

Note

The benchmark uses the active reducers grain settings for generic reducer operations and imc’s internal policy for rejection kernels. It does not force reducers grains to make rejection kernels parallel; rejection crossover tuning belongs in imc.

More threads does not always mean faster. Repeated runs can recommend different near-best values because timing varies with operation, dtype, number of images, output shape, CPU allocation, memory bandwidth, and machine load. For example, on an MBP 14” [2024, macOS 26.4.1, M4Pro(8P+4E/G20c/N16c/48G)], RAYON_NUM_THREADS=6 was optimal for n=5 while RAYON_NUM_THREADS=12 was optimal for n=31, and both changed by about two threads after the machine load shifted. The tuner treats median times within --tie-tolerance of the fastest as near-ties and recommends the lower thread count. For a more stable estimate, increase --repeats and --warmups:

uv run python benchmarks/benchmark_threads.py --repeats 15 --warmups 3

If your pipeline also calls threaded NumPy/BLAS code, consider pinning OPENBLAS_NUM_THREADS, OMP_NUM_THREADS, MKL_NUM_THREADS, and VECLIB_MAXIMUM_THREADS to 1 while benchmarking to avoid measuring nested oversubscription.