How To Get Maximum Performance

This page collects the performance model, implementation notes, thread tuning, and optimized-workspace guidance for imcombiners. Exact timings vary with shape, dtype, rejection density, stack depth, memory layout, CPU, and thread settings, so treat printed numbers as local measurements rather than universal constants. See 2-D Image Benchmarks and 1-D Array Benchmarks for recorded benchmark tables.

1. Implementation Tricks

imcombiners is built around image-stack reductions where both algorithmic work and memory traffic matter. The following implementation choices are intended to change runtime and memory traffic, not mathematical results. Here, “vector” is a Rust term for a contiguous 1-D array.

  • Fuse output-only rejection and final reduction when it is exact. IRAF avoids duplicate work when the rejection path already has the retained samples needed for the requested final value. imc applies that work-elimination principle through supported mean/median output-only Rust paths.
  • Axis-first public layout. (N, *spatial) matches the cheap np.stack(images, axis=0) construction path. Building a C-order (H, W, N) cube from separate images can cost a lot; see Why The Stack Axis Is First.
  • Prepared float workspaces. Public calls validate and promote dtypes at the boundary; hot loops can prepare a C-contiguous float32/float64 workspace once and then use validate=False on already-checked data.
  • Trusted 1-D direct calls. The _1d reduction wrappers send validate=False calls directly into Rust instead of re-running Python array preparation. This path assumes the caller already supplied a contiguous float32 or float64 vector.
  • Fused output-only kernels. When diagnostics=None, mean/median rejection calls combine rejection and final reduction in one Rust pass. If it were not for this, the code should make a full rejection mask, copying a NaN-filled stack, and calling a second combine kernel.
  • Selective median instead of full sort. Median-style combine kernels compact finite values for one output pixel and select only the needed order statistic instead of sorting the whole per-pixel stack.
  • Per-worker scratch reuse. Each Rayon worker keeps reusable temporary buffers for one per-output stack: copied pixel values, boolean masks, sorted rank pairs, and statistic work arrays. As the worker moves from one output pixel to the next, those buffers are cleared and refilled instead of newly allocated.
  • Scratch reuse in diagnostic rejection. Full diagnostic sigclip, minmax, and pclip paths also reuse per-worker stack-axis buffer, mask, rank-pair, residual, and statistic buffers; only the final per-output mask/diagnostic payload is retained.
  • Raw contiguous slice indexing. After validation, the Rust kernels require C-contiguous NumPy arrays. They view the stack as one flat memory slice and compute offsets directly, avoiding repeated multidimensional ndarray indexing and shape/stride bookkeeping in inner loops.
  • Unrolled reductions with separate accumulators. Sum, mean, and variance scans process several stack samples per loop and keep separate running sums/counts before merging them at the end. This reduces loop overhead and short dependency chains in the CPU pipeline. For axis-0 image stacks, the Rust loop advances a flat slice index by H * W to read arr[:, y, x] without repeated multidimensional indexing.
  • Branch-light min/max scans. NaN-aware min/max reductions use the same numeric min/max semantics as the desired nanmin/nanmax behavior, so the inner loop does not need a separate branch just to ask whether the current output is still NaN. This is not “Rust is faster by default”; it is a specific way to avoid an unnecessary branch in a very small hot loop. Large contiguous 1-D min/max vectors also use Rayon chunk reductions.
  • CCD clip without a full gain copy. ccdclip, ccdclip_mask, and output-only ccdclip_combine pass gain into Rust and evaluate rejection on gain-corrected per-pixel scratch values instead of first materializing a full arr / gain stack. Output-only ccdclip still combines the original source values.
  • Mask-aware finite handling. Non-finite samples are folded into masks once, so later inner loops can avoid redundant finite checks where correctness permits.
  • Squared-spread sigma clipping. sigma is the user-supplied dimensionless multiplier, not the measured data spread. The ordinary test is abs(x - center) > sigma * std, where std is the standard deviation for stdfunc="std", the scaled MAD estimate for stdfunc="mad", or the CCD noise sigma for ccdclip. The Rust loop squares both sides and tests (x - center)^2 > sigma^2 * std^2; for stdfunc="std", std^2 is variance. That lets output-only and mask-only paths avoid computing square roots (sqrt) on every iteration; the square root is only needed when returning diagnostic std.
  • MAD sigma without a full sort. sigclip(stdfunc="mad") is primarily a robustness option, not a reason to choose MAD for speed. Its implementation still avoids a full sort: it computes 1.4826 * median(abs(x - clip_cen)) using the same selective-median scratch buffer strategy as other median-style kernels.
  • Rank-pair final combine. Output-only minmax and pclip mean/median paths compute the final value from the already-built retained rank pairs instead of copying a stack-axis source values and scanning a second mask.
  • Reusable pclip residual scores. The pclip implementation follows IRAF-style rank-offset clipping. During that calculation it builds a temporary vector of normalized distance scores used to decide which sorted values to keep when nkeep restoration is needed. That vector is reused from per-worker scratch memory instead of allocated fresh for every output pixel.

Why The Stack Axis Is First

imcombiners uses (N, *spatial) stacks, so a normal image cube is shaped (N, H, W) and the combine/reject axis is axis 0. This is deliberate.

At first glance, (H, W, N) looks attractive because the per-pixel stack would be contiguous in memory for C-order NumPy arrays. That can make the inner Rust kernel read pattern slightly better when the axis-last workspace has already been prepared. However, building that workspace is not free:

stack0 = np.stack(images, axis=0)   # shape (N, H, W)
stackn = np.stack(images, axis=-1)  # shape (H, W, N)

Both calls allocate and copy the full cube. The difference is where each loaded 2-D image lands. With axis=0, each image is copied into one contiguous slab of the output array. With axis=-1, each image is copied into a strided [..., i] plane of a C-order array. For common FITS-style workflows where images are loaded one at a time and combined once, that strided construction cost can be comparable to a whole rejection/combine kernel call.

From local test, I found the prepared axis-last Rust kernels were only slightly faster for representative sigclip, ccdclip, minmax, and pclip mean/median cases, while np.stack(..., axis=-1) could be much slower than np.stack(..., axis=0) for large image stacks. Therefore the public convention keeps N on axis 0: it matches the cheapest natural NumPy stacking pattern and avoids making one-off reductions pay a large layout-construction cost for a small gain. It also keeps per-frame masks, weights, zero/scale values, growth behavior, and full rejection diagnostics aligned with one documented stack axis.

  • NOTE: np.moveaxis(stack, 0, -1) is cheap as a view, but it does not make the data contiguous in the axis-last order. A Rust kernel requiring contiguous (H, W, N) storage would still need a full copy such as np.ascontiguousarray(np.moveaxis(stack, 0, -1)).

Dependency probes

Several Rust statistics crates (ndarray-stats, argminmax, scirs2-stats) were checked making kernels in this repo. A small benchmark on (N, 384, 384) float32stacks measured the median combine core and a simple min/max scan:

stack N imc’s old median imc’s new selective median ndarray-stats median imc’s scalar min/max scan using argminmax
31 39.1 ms 19.0 ms 98.2 ms 5.21 ms 8.82 ms
5 7.80 ms 3.26 ms 33.0 ms 0.11 ms 3.42 ms

These numbers are not a claim that the external crates are slow in general.

  • ndarray-stats uses a generic mutable axis-quantile path (plus version differences from imc);
  • argminmax is useful for contiguous vectors, but imc first has to gather each axis-0 pixel stack from strided (N, H, W) storage;
  • scirs2-stats has a broader SciPy-like dependency surface than imc needs.

2. API Choice

The shortest rule is:

  • Use Standard Combiner or ndcombine() described in tutorials when you only need the final image.
  • Use Chained Combiner or diagnostics="simple"/"full" when you need last_reject, mask_rej, low, upp, nit, or output_flags, plus sample_flags.
  • Use direct kernel calls (import imcombiners.kernels as imck) only for trusted hot loops where the input layout is already guaranteed.

Direct imck calls are not a second algorithm. For output-only rejection, they use the same Rust kernels as the fast public routes. Their advantage is smaller Python-boundary work: no object construction, no repeated validation, no default workspace copy, and no retained diagnostics unless the called kernel returns them. Also users can use *_1d kernels for their 1-D array calculator - they are designed to be very fast and useful. Once the input is already a contiguous floating workspace, imck.sigclip_combine(..., validate=False) and Combiner(..., copy=False, validate=False).combine(..., diagnostics=None) are near-tie comparisons around the same core kernel.

Workspace Requirements

The trusted Rust paths require:

  1. C-contiguous layoutarr.flags['C_CONTIGUOUS'] must be True.
  2. Float workspace dtypefloat32 for uint8, uint16, int16, and float32 inputs; float64 for int32 and float64 inputs.

This function shows one example:

import numpy as np

def prepare_workspace(stack):
    dtype = np.float64 if stack.dtype == np.dtype(np.int32) else np.float32
    return np.ascontiguousarray(stack, dtype=dtype)
  • int32float64
  • All other (in many astronomical contexts, the only other is uint16) → float32

3. Public API Baseline

The baseline for output-only timing is the default Standard Combiner: Combiner(...).combine(..., diagnostics=None). It keeps the public object API, returns only the final image, and can use the fused reject-and-combine Rust path for supported mean/median cases.

This benchmark compares that baseline with three nearby public routes:

  • SC-None: Standard Combiner, Combiner(...).combine(..., diagnostics=None), output-only reference row.
  • nd-None: compact ndcombine(..., diagnostics=None), output only.
  • CC: Chained Combiner, Combiner(...).reject(...).combine(...), retained rejection state.
  • nd-simple: compact ndcombine(..., diagnostics="simple")[0], final image plus rejection diagnostics.

Diagnostic rows ask for different work, so they are grouped separately and compared with Chained Combiner. Output-only rows are compared with Standard Combiner.

import numpy as np
import imcombiners as imc
import time
def time_call(func, warmups=5, repeat=100):
    result = None
    for _ in range(warmups):
        result = func()

    times = []
    for _ in range(repeat):
        t0 = time.perf_counter()
        result = func()
        times.append(time.perf_counter() - t0)
    return result, float(np.median(times))


def comparison_text(seconds, reference_seconds):
    ratio = reference_seconds / seconds
    if np.isclose(ratio, 1.0, rtol=0.05):
        return "Near-tie"
    if ratio > 1.0:
        return f"{ratio:.2f}x faster"
    return f"{1.0 / ratio:.2f}x slower"


def print_timing_table(rows, reference):
    print(f"{'name':<10} {'work':<12} {'ms':>9}  comparison")
    ref_seconds = rows[reference]["seconds"]
    for name, row in rows.items():
        seconds = row["seconds"]
        if name == reference:
            comparison = ""
        else:
            comparison = f"{comparison_text(seconds, ref_seconds)} vs {reference}"
        print(f"{name:<10} {row['work']:<12} {seconds * 1e3:9.1f}  {comparison}")
rng = np.random.default_rng(20250311)
stack = np.clip(
    rng.normal(1000.0, 30.0, size=(31, 512, 512)),
    0,
    np.iinfo(np.uint16).max,
).astype("uint16")

kw_sc = dict(combine="median", reject="sigclip", sigma=(3.0, 3.0), maxiters=5, ddof=0)
sigc = imc.SigClip(sigma=(3.0, 3.0), maxiters=5, ddof=0)

timings = {}

timings["SC-None"] = {}
timings["SC-None"]["result"], timings["SC-None"]["seconds"] = time_call(
    lambda: imc.Combiner(stack).combine("median", rejectors=sigc, diagnostics=None)
)
timings["SC-None"]["work"] = "output-only"

timings["nd-None"] = {}
timings["nd-None"]["result"], timings["nd-None"]["seconds"] = time_call(
    lambda: imc.ndcombine(stack, **kw_sc, diagnostics=None)
)
timings["nd-None"]["work"] = "output-only"

timings["CC"] = {}
timings["CC"]["result"], timings["CC"]["seconds"] = time_call(
    lambda: imc.Combiner(stack).reject(sigc).combine("median")
)
timings["CC"]["work"] = "diagnostic"

timings["nd-simple"] = {}
timings["nd-simple"]["result"], timings["nd-simple"]["seconds"] = time_call(
    lambda: imc.ndcombine(stack, **kw_sc, diagnostics="simple")[0]
)
timings["nd-simple"]["work"] = "diagnostic"

np.testing.assert_allclose(timings["SC-None"]["result"], timings["nd-None"]["result"], rtol=1e-5)
np.testing.assert_allclose(timings["SC-None"]["result"], timings["CC"]["result"], rtol=1e-5)
np.testing.assert_allclose(timings["SC-None"]["result"], timings["nd-simple"]["result"], rtol=1e-5)

out_diagnostic_one_call = timings["nd-simple"]["result"]
t_diagnostic_one_call = timings["nd-simple"]["seconds"]
out_output_only_one_call = timings["nd-None"]["result"]
t_output_only_one_call = timings["nd-None"]["seconds"]
out_output_only_combiner = timings["SC-None"]["result"]
t_output_only_combiner = timings["SC-None"]["seconds"]
out_diagnostic_chained = timings["CC"]["result"]
t_diagnostic_chained = timings["CC"]["seconds"]

print_timing_table(
    {k: timings[k] for k in ("SC-None", "nd-None")},
    reference="SC-None",
)
print()
print_timing_table(
    {k: timings[k] for k in ("CC", "nd-simple")},
    reference="CC",
)
name       work                ms  comparison
SC-None    output-only       99.5  
nd-None    output-only       84.1  1.18x faster vs SC-None

name       work                ms  comparison
CC         diagnostic       144.8  
nd-simple  diagnostic       147.4  Near-tie vs CC

nd-simple and CC both compute complicated diagnostic products. Their timings should be same-order and slow; small differences are very likely noise. The diagnostics=None will return only the final image and can use a fused reject-and-combine optimization as mentioned above, so they are faster. nd-None may be faster than SC-None because while Combiner() keeps its default copy=True and makes an owned workspace copy after validation. Setting copy=False maybe dangerous but can be beneficial when the user does not care about the input stack after the call (e.g., loading image files from disk and combine - no need to preserve loaded stack). Use SC-None Combiner().combine() approach when you want that fast path with an object-oriented API; use CC only when you want retained rejection state.

4. Prepared Data & validate=False

Prepare the workspace once, then skip repeated validation inside the hot loop. This section compares each validate=False call against the same public API shape with validation left on. That makes the ratio answer one clear question: how much did preparing the workspace once save?

  • CC-prep: CC-None with validate=False
  • SC-prep: SC-None with validate=False && copy=False
stack_ws = prepare_workspace(stack)

def output_only_combiner_no_validation(arr):
    return imc.Combiner(arr, copy=False, validate=False).combine(
        "median",
        rejectors=sigc,
        diagnostics=None,
    )


prepared = {}

prepared["CC"] = timings["CC"]
prepared["CC-prep"] = {}
prepared["CC-prep"]["result"], prepared["CC-prep"]["seconds"] = time_call(
    lambda: (
        imc.Combiner(stack_ws, validate=False)
        .reject(sigc)
        .combine("median")
    )
)
prepared["CC-prep"]["work"] = "diagnostic"

prepared["SC-None"] = timings["SC-None"]
prepared["SC-prep"] = {}
prepared["SC-prep"]["result"], prepared["SC-prep"]["seconds"] = time_call(
    lambda: output_only_combiner_no_validation(stack_ws)
)
prepared["SC-prep"]["work"] = "output-only"

np.testing.assert_allclose(prepared["CC"]["result"], prepared["CC-prep"]["result"], rtol=1e-5)
np.testing.assert_allclose(prepared["SC-None"]["result"], prepared["SC-prep"]["result"], rtol=1e-5)

out_chained_prepared = prepared["CC-prep"]["result"]
t_chained_prepared = prepared["CC-prep"]["seconds"]
out_output_only_prepared = prepared["SC-prep"]["result"]
t_output_only_prepared = prepared["SC-prep"]["seconds"]

print_timing_table(
    {k: prepared[k] for k in ("CC", "CC-prep")},
    reference="CC",
)
print()
print_timing_table(
    {k: prepared[k] for k in ("SC-None", "SC-prep")},
    reference="SC-None",
)
name       work                ms  comparison
CC         diagnostic       144.8  
CC-prep    diagnostic       167.7  1.16x slower vs CC

name       work                ms  comparison
SC-None    output-only       99.5  
SC-prep    output-only       82.5  1.21x faster vs SC-None

SC-prep keeps the output-only Standard Combiner work, skips validation, and uses copy=False because the workspace is already owned by the caller. Due to copy=False, SC-prep may be a bit faster than SC-None.

5. Stack Number (0-th Axis)

Say stack has shape (N, H, W) where N is the number of images. Computation time is roughly

Total time ≈ fixed overhead + per-pixel kernel time(∝ N) * total num of pixels (H * W).

Skipping validation removes the first term: shape/dtype checks, contiguity handling, and possible workspace conversion. The fixed overhead can be a larger fraction of the total runtime when N is small, and skipping validation can show a larger percentage improvement.

rng_small = np.random.default_rng(20250311)
stack_small = np.clip(
    rng_small.normal(1000.0, 30.0, size=(5, 512, 512)),
    0,
    np.iinfo(np.uint16).max,
).astype("uint16")
stack_small_ws = prepare_workspace(stack_small)

out_output_only_small, t_output_only_small = time_call(
    lambda: imc.Combiner(stack_small).combine(
        "median",
        rejectors=sigc,
        diagnostics=None,
    )
)

out_output_only_small_prepared, t_output_only_small_prepared = time_call(
    lambda: output_only_combiner_no_validation(stack_small_ws)
)

np.testing.assert_allclose(
    out_output_only_small,
    out_output_only_small_prepared,
    atol=1e-5,
)

depth_rows = {
    "N5-SC": {
        "work": "output-only",
        "seconds": t_output_only_small,
    },
    "N5-prep": {
        "work": "output-only",
        "seconds": t_output_only_small_prepared,
    },
    "N31-SC": {
        "work": "output-only",
        "seconds": t_output_only_combiner,
    },
    "N31-prep": {
        "work": "output-only",
        "seconds": t_output_only_prepared,
    },
}

print_timing_table(
    {k: depth_rows[k] for k in ("N5-SC", "N5-prep")},
    reference="N5-SC",
)
print()
print_timing_table(
    {k: depth_rows[k] for k in ("N31-SC", "N31-prep")},
    reference="N31-SC",
)
name       work                ms  comparison
N5-SC      output-only       14.5  
N5-prep    output-only       13.7  1.06x faster vs N5-SC

name       work                ms  comparison
N31-SC     output-only       99.5  
N31-prep   output-only       82.5  1.21x faster vs N31-SC

The n=5 and n=31 ratios should not be ranked as a universal rule.

Reserve validate=False for workloads where profiling shows a real bottleneck; on shallow stacks, milliseconds-scale differences may be within timing noise.

6. Thread Tuning

TipTL;DR

The default settings are good enough for most use cases.

There are three parameters that control imc parallelism:

  • RAYON_NUM_THREADS — how many threads Rayon uses inside the Rust kernels. Set it before starting Python, or right after importing imc and before any Rayon work; Rayon reads it once at thread-pool initialization and ignores later changes.
  • IMCOMBINERS_PARALLEL_THRESHOLD — the minimum output-element count required to use Rayon at all. Below this threshold imc runs a serial loop instead. The default is 10000, so images smaller than roughly 100x100 run serially.
  • IMCOMBINERS_1D_MINMAX_PARALLEL_THRESHOLD — the vector length where min_1d and max_1d switch from serial scan to Rayon chunk reduction. The default is 1000000. This is separate from IMCOMBINERS_PARALLEL_THRESHOLD because 1-D min/max has one output value, while image-stack combine/rejection parallelizes over many output pixels.

Both knobs can be set from the shell before starting Python:

export RAYON_NUM_THREADS=8
export IMCOMBINERS_PARALLEL_THRESHOLD=4096
export IMCOMBINERS_1D_MINMAX_PARALLEL_THRESHOLD=1000000
python your_pipeline.py

They can also be set from Python before heavy imc work:

import imcombiners as imc

imc.set_rayon_num_threads(8)
imc.set_parallel_threshold(4096)
imc.set_minmax_1d_parallel_threshold(1_000_000)
print(
    imc.get_rayon_num_threads(),
    imc.get_parallel_threshold(),
    imc.get_minmax_1d_parallel_threshold(),
)

For a stack shaped (N, H, W), the threshold is compared with H * W; for arbitrary N-D stacks, it is compared with np.prod(stack.shape[1:]). For min_1d and max_1d, the 1-D min/max threshold is compared with values.size.

Thread Benchmark Script

benchmark_threads.py measures both the serial path and a range of Rayon thread counts in fresh subprocesses, so every candidate gets a clean thread pool. By default it runs sigclip_median on both (5, 512, 512) and (31, 512, 512) stacks and reports separate recommendations:

uv run python benchmarks/benchmark_threads.py

The reported values are a starting point for the measured workload, not a package-wide optimum. To tune for your own shape or operation:

uv run python benchmarks/benchmark_threads.py \
  --op median --dtype float32 --n 31 --height 1024 --width 1024

Pass multiple --n values to cover several stack sizes in one run:

uv run python benchmarks/benchmark_threads.py --n 5 31 63
Note

Default thread-count candidates. On machines with <= 32 logical CPUs the script tests every parallel thread count: 2, 3, …, os.cpu_count(). A dedicated serial baseline run covers the non-Rayon case, so single-thread Rayon is only tested if you pass it explicitly with --threads.

On machines with > 32 logical CPUs, os.cpu_count() may not match your scheduler allocation. Pass --threads with explicit candidates, for example based on SLURM_CPUS_PER_TASK, or --force-test to get the sparse {2, 4, 8, ..., ncpu} list anyway.

Note

The benchmark forces IMCOMBINERS_PARALLEL_THRESHOLD=1 for every parallel candidate so Rayon is exercised regardless of image size. This avoids a small image silently running the serial path at every thread count. The forced value is a measurement tool, not a production recommendation.

More threads does not always mean faster. Repeated runs can recommend different near-best values because timing varies with operation, dtype, number of images, output shape, CPU allocation, memory bandwidth, and machine load. For example, on an MBP 14” [2024, macOS 26.4.1, M4Pro(8P+4E/G20c/N16c/48G)], RAYON_NUM_THREADS=6 was optimal for n=5 while RAYON_NUM_THREADS=12 was optimal for n=31, and both changed by about two threads after the machine load shifted. The tuner treats median times within --tie-tolerance of the fastest as near-ties and recommends the lower thread count. For a more stable estimate, increase --repeats and --warmups:

uv run python benchmarks/benchmark_threads.py --repeats 15 --warmups 3

If your pipeline also calls threaded NumPy/BLAS code, consider pinning OPENBLAS_NUM_THREADS, OMP_NUM_THREADS, MKL_NUM_THREADS, and VECLIB_MAXIMUM_THREADS to 1 while benchmarking to avoid measuring nested oversubscription.