Skip to content

Benchmark Tests

This guide covers the performance and scale testing infrastructure for validating calculator performance from 10K to 10M counterparties.

Overview

Benchmark tests validate that the RWA calculator meets performance requirements at various scales. The tests cover:

  • Hierarchy Resolution - Building counterparty and facility hierarchies
  • Pipeline Execution - End-to-end RWA calculation
  • Memory Usage - Peak memory consumption at scale
  • Component Performance - Individual calculator components

Test Structure

tests/benchmarks/
├── test_hierarchy_benchmark.py   # HierarchyResolver performance
└── test_pipeline_benchmark.py    # End-to-end pipeline performance

Running Benchmarks

Benchmark tests are marked with @pytest.mark.benchmark and are skipped by default (--benchmark-skip in pyproject.toml). Use --benchmark-only or override addopts to run them.

All Benchmarks (10K + 100K)

# Run all benchmarks except 1M/10M (recommended)
uv run pytest tests/benchmarks/ -m "benchmark and not slow" -k "not 1m and not 1M" -o "addopts=" --benchmark-only -v

# With detailed timing on failure
uv run pytest tests/benchmarks/ -m "benchmark and not slow" -k "not 1m" -o "addopts=" --benchmark-only -v --tb=short

By Scale

# Quick tests (10K counterparties)
uv run pytest tests/benchmarks/ -m scale_10k -o "addopts=" --benchmark-only -v

# Standard benchmarks (100K counterparties)
uv run pytest tests/benchmarks/ -m scale_100k -o "addopts=" --benchmark-only -v

# Large scale (1M counterparties) - requires significant memory
uv run pytest tests/benchmarks/ -m scale_1m -o "addopts=" --benchmark-only -v

# Enterprise scale (10M counterparties) - very slow
uv run pytest tests/benchmarks/ -m scale_10m -o "addopts=" --benchmark-only -v

Skip Slow Tests

# Skip 1M+ scale tests (default recommendation)
uv run pytest tests/benchmarks/ -m "benchmark and not slow" -o "addopts=" --benchmark-only -v

Profiling Scripts

In addition to pytest benchmarks, standalone profiling scripts provide stage-by-stage breakdowns:

# Full pipeline stage breakdown (hierarchy → classifier → CRM → calculators)
uv run python -m tests.benchmarks.profile_stage_breakdown

# Hierarchy and classifier sub-stage profiling
uv run python -m tests.benchmarks.profile_hierarchy_classifier

Test Markers

Marker Description Typical Duration
@pytest.mark.scale_10k 10K counterparty tests < 5 seconds
@pytest.mark.scale_100k 100K counterparty tests < 30 seconds
@pytest.mark.scale_1m 1M counterparty tests < 5 minutes
@pytest.mark.scale_10m 10M counterparty tests < 30 minutes
@pytest.mark.slow Long-running tests (1M+) Minutes
@pytest.mark.benchmark Memory/performance benchmarks Varies

Hierarchy Benchmarks

Tests for HierarchyResolver performance at scale.

Test Classes

TestHierarchyBenchmark10K

Quick validation tests at 10K scale:

Test Target Description
test_full_resolve_10k < 1 sec Full hierarchy resolution
test_counterparty_lookup_10k - Counterparty lookup building
test_exposure_unification_10k - Exposure unification

TestHierarchyBenchmark100K

Standard benchmark at 100K scale:

Test Target Description
test_full_resolve_100k < 5 sec Full hierarchy resolution
test_counterparty_lookup_100k < 2 sec Counterparty lookup building
test_org_hierarchy_depth_100k - Verify hierarchy depth >= 2
test_facility_hierarchy_depth_100k - Verify facility depth >= 2

TestHierarchyBenchmark1M

Large scale tests (marked @pytest.mark.slow):

Test Target Description
test_full_resolve_1m < 60 sec Full hierarchy resolution

TestHierarchyBenchmark10M

Enterprise scale tests (marked @pytest.mark.slow):

Test Target Description
test_full_resolve_10m < 10 min Full hierarchy resolution

TestHierarchyMemoryBenchmark

Memory consumption tests:

Test Target Description
test_memory_usage_10k < 100 MB Peak memory at 10K
test_memory_usage_100k < 500 MB Peak memory at 100K

Pipeline Benchmarks

End-to-end RWA calculation pipeline performance.

Test Classes

TestPipelineBenchmark10K

Quick pipeline validation:

Test Target Description
test_full_pipeline_sa_10k < 2 sec SA-only calculation
test_full_pipeline_crr_10k < 3 sec SA + IRB calculation

TestPipelineBenchmark100K

Standard pipeline benchmarks:

Test Target Description
test_full_pipeline_sa_100k < 10 sec SA-only calculation
test_full_pipeline_crr_100k < 15 sec SA + IRB calculation
test_pipeline_throughput_100k - Measures exposures/second

TestPipelineBenchmark1M

Large scale pipeline tests:

Test Target Description
test_full_pipeline_sa_1m < 120 sec SA-only at 1M scale

TestPipelineBenchmark10M

Enterprise scale pipeline tests:

Test Target Description
test_full_pipeline_sa_10m < 20 min SA-only at 10M scale

Approach-Specific Benchmarks

Tests at 100K scale for different calculation approaches:

TestApproachBenchmarks100K

Test Description
test_sa_only_100k All exposures use SA (no IRB)
test_full_irb_100k All eligible exposures use IRB
test_irb_with_slotting_100k IRB + Slotting approach
test_partial_irb_corporate_only_100k Corporate-only IRB
test_basel_3_1_with_output_floor_100k Basel 3.1 with output floor

TestApproachBenchmarks1M

Test Description
test_sa_only_1m SA-only at 1M scale
test_full_irb_1m Full IRB at 1M scale
test_irb_with_slotting_1m IRB + Slotting at 1M scale

Component Benchmarks

Individual component performance at 100K scale:

TestComponentBenchmarks100K

Test Description
test_classifier_100k Exposure classifier performance
test_sa_calculator_100k SA calculator performance

IRB Formula Benchmarks

The IRB formula implementation uses pure Polars expressions with polars-normal-stats for statistical functions, enabling full lazy evaluation.

Key benefits of the pure Polars implementation:

  • Full lazy evaluation: Query optimization preserved throughout
  • No data conversion: No NumPy/SciPy overhead
  • Sub-second for 1M rows: 1 million IRB exposures processed in ~300ms

Memory Benchmarks

TestPipelineMemoryBenchmark

Test Target Description
test_pipeline_memory_100k < 2 GB Peak memory during pipeline

Performance Targets Summary

Hierarchy Resolution

Scale Target Time Memory
10K < 1 sec < 100 MB
100K < 5 sec < 500 MB
1M < 60 sec -
10M < 10 min -

Pipeline Execution

Scale SA Only SA + IRB
10K < 2 sec < 3 sec
100K < 10 sec < 15 sec
1M < 120 sec -
10M < 20 min -

Measured Results (100K, v0.1.28)

Results from pytest-benchmark on a typical development machine (100K counterparties, ~365K exposures):

Test Min (ms) Mean (ms)
Hierarchy resolve 67 72
Counterparty lookup 45 51
Exposure unification 19 22
Classifier 730 757
SA calculator 271 310
Full pipeline (SA only) 1,611 1,710
Full pipeline (CRR) 1,848 1,931
Full pipeline (IRB + slotting) 2,092 2,210
Basel 3.1 with output floor 2,058 2,110

Pipeline Stage Breakdown (from profiler)

Stage Best (ms) Mean (ms)
Hierarchy 383 400
Classifier 212 230
CRM 669 710
Calculators (SA+IRB+Slotting) 309 340
Total (stages) 1,634 1,774

Writing Benchmark Tests

Basic Structure

Tests use the pytest-benchmark fixture for accurate timing with multiple rounds:

import pytest

@pytest.mark.benchmark
@pytest.mark.scale_100k
class TestMyBenchmark:
    """Benchmark tests for MyComponent."""

    def test_my_component_100k(self, benchmark, dataset_100k):
        """Benchmark MyComponent at 100K scale."""
        raw_data = create_raw_data_bundle(dataset_100k)
        config = CalculationConfig.crr(date(2026, 1, 1))

        def run():
            result = my_component.process(raw_data, config)
            _ = result.collect()  # Force lazy evaluation
            return result

        result = benchmark(run)
        assert result is not None

Using Dataset Generators

The benchmark tests use dataset generators with cached parquet files for fast loading. Datasets are cached in tests/benchmarks/data/ and regenerated only when --benchmark-regenerate is passed:

# Session-scoped fixture — generates once, cached to parquet
@pytest.fixture(scope="session")
def dataset_100k():
    """Load or generate 100K counterparty dataset."""
    return get_or_create_dataset(
        scale="100k",
        n_counterparties=100_000,
        hierarchy_depth=3,
        seed=42,
    )

To regenerate cached datasets:

# Regenerate all cached datasets
uv run pytest tests/benchmarks/ -o "addopts=" --benchmark-only --benchmark-regenerate -v

# Regenerate specific scale only
uv run pytest tests/benchmarks/ -o "addopts=" --benchmark-only --benchmark-regenerate-scale=100k -v

Memory Testing

import tracemalloc

@pytest.mark.benchmark
def test_memory_usage(self, dataset):
    """Test memory consumption."""
    tracemalloc.start()

    # Run operation
    result = component.process(dataset)

    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()

    peak_mb = peak / 1024 / 1024
    assert peak_mb < 500, f"Expected < 500 MB, got {peak_mb:.1f} MB"

CI/CD Integration

# Run quick benchmarks on every PR (10K + 100K, excludes 1M/10M)
benchmark-quick:
  script:
    - uv run pytest tests/benchmarks/ -m "benchmark and not slow" -k "not 1m" -o "addopts=" --benchmark-only -v

# Run full benchmarks nightly (includes 1M)
benchmark-full:
  schedule: "0 2 * * *"  # 2 AM daily
  script:
    - uv run pytest tests/benchmarks/ -m "benchmark and not slow" -o "addopts=" --benchmark-only -v --tb=short

Performance Regression Detection

Monitor benchmark results over time to detect regressions:

# Generate benchmark report
uv run pytest tests/benchmarks/ -m "benchmark and not slow" -k "not 1m" -o "addopts=" --benchmark-only --benchmark-json=benchmark.json

# Compare with baseline
uv run pytest-benchmark compare benchmark.json baseline.json

Next Steps