Skip to content

Test Catalog

Every test in the suite, what it verifies, and why its area exists. Generated by tools/gen_test_catalog.py — regenerate after adding tests.

Total test functions: 321 (run them with python -m pytest tests/ -q).

tests/active/

Active Mode is the opt-in interventional path (expressibility probing). These tests check the probe records correctly, labels the trace as active, and that analyzers needing interventional data refuse passive traces.

test_active_mode.py

Tests for Active Mode probing and the kl_expressibility analyzer they feed. Active Mode runs real circuits, so these use small ansätze and modest sample counts. The headline physics check is directional: a rigid ansatz must score a larger Haar KL divergence than an expressive one.

TestProbeCore

Test What it verifies
test_records_active_mode_trace Records active mode trace
test_trace_verifies Trace verifies
test_seed_reproducible Seed reproducible — Same seed → identical sampled parameters → identical statevectors
test_statevector_round_trips Statevector round trips

TestQiskitProbe

Test What it verifies
test_qiskit_probe_stores_circuit_qasm Qiskit probe stores circuit qasm
test_qiskit_circuit_structure_visible Qiskit circuit structure visible — decomposed → real gates visible, depth > 1

TestPennyLaneProbe

Test What it verifies
test_pennylane_probe Pennylane probe

TestExpressibility

Test What it verifies
test_rigid_more_than_expressive Directional physics: rigid ansatz has larger Haar KL than expressive.
test_expressive_ansatz_low_kl StronglyEntanglingLayers is known to be highly expressive.
test_num_qubits_inferred Num qubits inferred
test_passive_trace_guard Expressibility on a passive trace returns a guard, not a number.
test_insufficient_states Insufficient states

tests/analysis/

Each analyzer is tested against constructed ground-truth traces: plant a known condition, assert the verdict and the quantitative evidence (variance, SNR, KL, fidelity) with their confidence intervals. Includes regression tests for hardware-format (ISA) circuits and active-qubit calibration scoping.

test_builtin_analyzers.py

Tests for the function-based analysis layer (hilbertbench.analysis). Traces are built deterministically with the tape; no quantum execution.

TestBarrenPlateau

Test What it verifies
test_trainable Trainable
test_barren_plateau Barren plateau
test_insufficient_data Insufficient data — span with no numeric outcome (counts dict)
test_custom_threshold Custom threshold
test_accepts_path_and_trace_object Accepts path and trace object
test_variance_matches_numpy Variance matches numpy

TestShotNoise

Test What it verifies
test_with_recorded_shots With recorded shots — high trajectory variance, low shots → signal clear
test_shot_noise_dominated Shot noise dominated — tiny trajectory variance, many shots → buried in noise
test_no_shots_recorded No shots recorded
test_default_shots_fallback Default shots fallback
test_precision_fallback Precision fallback — estimator runs record target precision, not shots; the floor
test_recorded_shots_win_over_precision Recorded shots win over precision
test_insufficient_data Insufficient data

TestSummary

Test What it verifies
test_combined_report Combined report
test_summary_accepts_path Summary accepts path

TestCustomAnalysis

Test What it verifies
test_user_can_write_own_analysis A user composes their own diagnostic on the same trace API.

test_confidence.py

Tests for the statistical-uncertainty measures added to the analyzers (proposal Section 2.6: "reported with statistical uncertainty and confidence measures, emphasizing transparency over definitive attribution").

TestBootstrapCI

Test What it verifies
test_ci_brackets_statistic Ci brackets statistic
test_degenerate_inputs_return_none Degenerate inputs return none
test_n_boot_zero_disables N boot zero disables
test_reproducible_with_seed Reproducible with seed
test_wider_ci_for_higher_level Wider ci for higher level

TestBarrenPlateauConfidence

Test What it verifies
test_ci_brackets_variance Ci brackets variance
test_clear_trainable_high_confidence Clear trainable high confidence
test_clear_barren_high_confidence Clear barren high confidence
test_near_threshold_low_confidence Near threshold low confidence — variance engineered to sit right at the 0.005 threshold
test_n_boot_zero_skips_ci N boot zero skips ci

TestShotNoiseConfidence

Test What it verifies
test_empirical_variance_ci_present Empirical variance ci present
test_ci_present_even_without_shots Ci present even without shots

test_noise.py

Tests for the noise-profile analyzer (Diagnostic Axis: Noise). Calibration data only exists on real/fake hardware backends, so these use Qiskit's FakeManilaV2 (which ships realistic T1/T2/readout/gate-error data) and assert ideal simulators degrade gracefully to a no-calibration result.

TestDeviceSummary

Test What it verifies
test_reports_calibration_stats Reports calibration stats
test_gate_errors_present Gate errors present
test_estimated_fidelity_in_unit_interval Estimated fidelity in unit interval

TestIdealSimulator

Test What it verifies
test_no_calibration_status No calibration status

TestDepthInteraction

Test What it verifies
test_fidelity_decreases_with_depth Fidelity decreases with depth
test_dominant_error_shifts_to_two_qubit_gates Dominant error shifts to two qubit gates — shallow circuit: readout dominates the small infidelity

TestCalibrationScoping

Test What it verifies
test_stats_scoped_to_active_qubits Stats scoped to active qubits — the awful qubit 2 must not contaminate any statistic
test_fidelity_uses_scoped_errors Fidelity uses scoped errors — 1 sx + 1 cx on good qubits: fidelity must stay high; the

test_optimization_and_circuit.py

Tests for the optimization-loop (Axis 4) and circuit-structure analyzers.

TestOptimizationConvergence

Test What it verifies
test_converging_trajectory Converging trajectory
test_constant_step_still_improving Constant step still improving
test_converging_path_length_positive Converging path length positive
test_insufficient_data Insufficient data
test_outcome_envelope_reported Outcome envelope reported — cost should drop from start to finish
test_accepts_path_and_trace Accepts path and trace

TestCircuitStructure

Test What it verifies
test_bell_circuit Bell circuit
test_bell_depth Bell depth — h on q0 (layer 1) then cx q0,q1 (layer 2) → depth 2
test_parametric_circuit Parametric circuit
test_no_qasm_circuit No qasm circuit — Trace with only inline outcomes, no circuit_qasm artifact
test_entangling_fraction_bounds Entangling fraction bounds
test_isa_circuit_physical_qubits Isa circuit physical qubits — Regression: hardware ISA circuits previously parsed as empty
test_isa_circuit_depth Isa circuit depth — sx, rz, sx, rz stack on $0 (layers 1-4), cx joins \(0/\)1 → 5

tests/compliance/

Architecture-level checks that the documented invariants (INV-001 and friends) hold end to end — the contract the paper and the docs promise.

test_backward_compatibility.py

These fixtures are FROZEN. They represent real v1.0 traces that must remain parseable for the lifetime of the project. Never update these dicts — if a model change breaks them, that is a BREAKING CHANGE and requires a schema version bump.

TestGoldenRecords — These must NEVER fail. Failure = breaking change.

Test What it verifies
test_golden_trace_always_parses Golden trace always parses
test_golden_span_always_parses Golden span always parses
test_golden_artifact_always_parses Golden artifact always parses

test_execution_parity.py

Validates the proposal's central "1:1 Execution Parity" claim (§2.1): wrapping a primitive with HilbertBench must NOT change what reaches the backend — same number of executions, same circuits, same shots — and must NOT change the results. This is what makes the recorder a non-confounding observer.

TestEstimatorParity

Test What it verifies
test_backend_called_once_per_user_call Backend called once per user call — No silent extra executions: the backend saw exactly the same calls.
test_pubs_unchanged Pubs unchanged — The parameter bindings submitted to the backend are bit-identical.
test_results_identical Results identical

TestSamplerParity

Test What it verifies
test_backend_called_once_per_user_call Backend called once per user call
test_shots_unchanged Shots unchanged — No silent shot inflation: every backed call kept the requested 256.

TestPennyLaneParity

Test What it verifies
test_device_executed_same_number_of_times Device executed same number of times

TestOverhead

Test What it verifies
test_estimator_overhead_under_budget Per-call recording overhead must stay small (proposal target: <5ms). We assert a generous CI-safe ceiling; the demo reports the real number (typically well under 1ms on a workstation).

test_invariants.py

TestINV003 — INV-003: models are auto-generated, never manually edited.

Test What it verifies
test_all_generated_files_have_header All generated files have header

TestINV004 — INV-004: models must only import stdlib + pydantic.

Test What it verifies
test_no_forbidden_imports No forbidden imports

test_schema_roundtrip.py

Validates that all Pydantic v2 models correctly enforce schema constraints. Tests are organized by model. Negative tests confirm bad data is rejected.

TestTraceManifest

Test What it verifies
test_valid_construction Valid construction
test_roundtrip Roundtrip
test_mode_enum Mode enum
test_status_enum Status enum
test_all_status_values_valid All status values valid
test_all_mode_values_valid All mode values valid
test_null_timestamp_end_allowed Null timestamp end allowed
test_null_integrity_seal_allowed Null integrity seal allowed
test_optional_fields_absent Optional fields absent — timestamp_end, integrity_seal, tags are all optional
test_rejects_wrong_version Rejects wrong version
test_rejects_extra_fields Rejects extra fields
test_rejects_invalid_mode Rejects invalid mode
test_rejects_invalid_status Rejects invalid status
test_requires_client_environment Requires client environment
test_client_environment_requires_version Client environment requires version
test_tags_arbitrary_strings Tags arbitrary strings

TestSpan

Test What it verifies
test_valid_construction Valid construction
test_roundtrip Roundtrip
test_events_preserved Events preserved
test_trace_id_matches_parent Trace id matches parent
test_sequence_number_zero_allowed Sequence number zero allowed
test_sequence_number_large_value Sequence number large value
test_all_status_values_valid All status values valid
test_null_outcome_ref_allowed Null outcome ref allowed
test_null_parent_span_id_allowed Null parent span id allowed — Null parent_span_id = root span
test_event_type_open_pattern Event type open pattern — event_type is open pattern ^[A-Z_]+$ — custom types must be allowed
test_event_attributes_allow_arbitrary_scalars Event attributes allow arbitrary scalars
test_event_null_attributes_allowed Event null attributes allowed
test_rejects_negative_sequence_number Rejects negative sequence number
test_rejects_empty_events Rejects empty events — minItems: 1 — a span with no events is invalid
test_rejects_lowercase_event_type Rejects lowercase event type — event_type pattern is ^[A-Z_]+$ — lowercase must be rejected
test_rejects_extra_fields Rejects extra fields

TestArtifact

Test What it verifies
test_valid_construction Valid construction
test_roundtrip Roundtrip
test_all_kind_values_valid All kind values valid
test_all_encoding_values_valid All encoding values valid
test_all_compression_values_valid All compression values valid
test_compression_null_allowed Compression null allowed
test_size_bytes_zero_allowed Size bytes zero allowed
test_producer_null_allowed Producer null allowed
test_hash_pattern_enforced Hash pattern enforced
test_hash_wrong_length_rejected Hash wrong length rejected
test_rejects_negative_size Rejects negative size
test_rejects_ref_count_zero Rejects ref count zero — minimum: 1 — an artifact with zero references is orphaned
test_rejects_invalid_kind Rejects invalid kind

TestCatalog

Test What it verifies
test_valid_construction Valid construction
test_roundtrip Roundtrip
test_multiple_artifacts Multiple artifacts
test_empty_artifacts_allowed Empty artifacts allowed
test_artifact_values_are_validated Artifact values are validated — Even though the key is not pattern-validated by Pydantic (see below),
test_artifact_key_format_not_enforced_by_pydantic catalog.json uses additionalProperties (not patternProperties) so Pydantic accepts any string key at runtime. Key format and key==artifact_hash integrity is the responsibility of reader/verify.py, not the Pydantic model. This is a deliberate design tradeoff — see design_decisions/0003. This test documents and pins the behaviour. If it starts raising, the schema was changed back to patternProperties.
test_rejects_wrong_version Rejects wrong version
test_rejects_extra_fields Rejects extra fields

tests/e2e/

Full journeys: record a realistic workload, seal, reopen, analyze — the integration surface a real user touches.

test_full_algorithms.py

Tier 3 end-to-end regression tests. Each test runs a complete algorithm for a small number of steps and verifies: - The trace is sealed and complete - The expected number of spans were recorded - Inline artifacts contain the correct data kinds

TestVQERegression

Test What it verifies
test_vqe_trace_complete Runs 5 steps of gradient-based VQE on the simple H = Z⊗Z Hamiltonian. Verifies: spans created, outcome + parameters + observables captured.

TestQAOARegression

Test What it verifies
test_qaoa_bitstrings_recorded Runs a 2-qubit QAOA-like circuit for 3 angles and records bitstring outcomes. Verifies that counts are captured inline with proper structure.
test_qaoa_multiple_angle_sets Multiple parameter sets in one PUB → one span total.

TestQNNRegression

Test What it verifies
test_qnn_training_trace_complete Trains a 2-qubit QNN for 5 steps on a 4-point dataset. Verifies: spans recorded, outcomes + params captured, trace sealed.

TestCrossFramework

Test What it verifies
test_both_frameworks_produce_valid_traces Evaluate Z⊗Z expectation value using both frameworks. Both traces should be valid, sealed, and contain outcome data.

test_phenomenology.py

Phenomenological validation (proposal Section 2.6): plant a known QML phenomenon in synthetic ground-truth circuits and confirm the detector attributes it correctly from trace evidence alone.

TestBarrenPlateauValidation

Test What it verifies
test_wide_deep_circuit_flagged_barren Wide deep circuit flagged barren
test_shallow_control_is_trainable Shallow control is trainable
test_variance_collapses_with_width The planted property: variance must shrink as width grows.
test_trace_is_active_mode Landscape probing is a controlled, opt-in active diagnostic.

test_qiskit_aer.py

End-to-End verification using a REAL Qiskit Aer simulator. Proves the transparent proxy works with actual quantum execution without breaking standard QML workflows.

Test What it verifies
test_real_qml_parameterized_circuit Runs a real parameterized circuit (typical of QML/VQE) through the AerSimulator, verifying the proxy handles real Qiskit objects.

tests/integrations/

The proxies (Qiskit Estimator/Sampler, backend.run, PennyLane) must be perfectly transparent to the wrapped framework (1:1 execution parity, INV-001) while recording faithfully. This area also covers calibration-snapshot capture across all three backend-access conventions found in the wild, drift refresh, and shot/precision evidence.

test_pennylane.py

Verifies the dynamic proxy integration for PennyLane. Tests that strict ML type-checks are preserved and synchronous executions are correctly logged as single, unified spans.

TestPennyLaneProxyTransparency

Test What it verifies
test_dynamic_inheritance Crucial for PennyLane QNodes: The proxy must pass isinstance().

TestPennyLaneExecutionLifecycle

Test What it verifies
test_synchronous_execution_span PennyLane evaluates synchronously. We should get exactly ONE span.

TestPennyLaneExceptionVisibility

Test What it verifies
test_synchronous_exception_handling Verifies INV-007 for synchronous failures.

test_pennylane_measurements.py

Tier 2 integration tests for HilbertPennyLaneDeviceProxy covering all common PennyLane measurement types: expval, probs, counts, sample, state. Also verifies exception handling and backend_id propagation.

TestMeasurementTypes

Test What it verifies
test_expval_inline Expval inline
test_probs_inline Probs inline
test_counts_inline Counts inline — counts is a bitstring → int dict; "00" should dominate
test_sample_inline Sample inline
test_state_inline_as_complex_pairs State inline as complex pairs — Stored as [[real, imag], ...] — first amplitude should be [1.0, 0.0]

TestSpanStructure

Test What it verifies
test_four_events_per_span EXECUTION_REQUEST + DEVICE_EXECUTE_STARTED + EXECUTION_COMPLETED + EXECUTION_RESULT
test_device_started_event_has_num_tapes Device started event has num tapes
test_parameters_captured_per_span Parameters captured per span
test_observables_captured Observables captured
test_payload_ref_resolves_to_circuit_qasm The circuit is now a templated QASM in the file store (deduplicated), so payload_ref must resolve from the catalog, not inline.
test_circuit_qasm_deduplicates_across_steps Many evaluations of the same circuit structure produce one QASM file.
test_backend_id_set Backend id set
test_span_status_completed Span status completed

TestPennyLaneExceptions

Test What it verifies
test_device_exception_creates_failed_span When the device raises, a FAILED span with ERROR event is recorded.
test_exception_propagates_to_caller Exception propagates to caller

TestNoFilePollution

Test What it verifies
test_all_measurements_stay_inline expval, probs, counts, sample — none should write .npy files.

test_pennylane_qasm_reproducibility.py

Proves that the templated OpenQASM stored for PennyLane traces is useful: template + recorded parameters reconstructs a valid circuit whose outcome matches what was recorded — verified by re-executing through Qiskit (a different framework), proving the QASM is portable and complete.

TestTemplateHelper

Test What it verifies
test_placeholders_replace_numeric_literals Placeholders replace numeric literals
test_wire_indices_untouched Wire indices untouched — wire indices live in [...] not (...) — must not become placeholders
test_multi_param_gate Multi param gate
test_template_stable_across_values Template stable across values

TestQASMRoundTrip

Test What it verifies
test_template_plus_params_reproduces_outcome The core guarantee: bind(template, params) reproduces the recorded expval.
test_single_qubit_rotation_round_trip Minimal case: one RY rotation, check exact reproduction.

test_qiskit.py

Verifies the transparent proxy integration for Qiskit. Tests that circuits are serialized, spans are split (async mirroring), and all underlying framework exceptions are properly propagated (INV-007).

TestProxyTransparency

Test What it verifies
test_backend_proxy_passthrough Backend proxy passthrough — The proxy must perfectly imitate the underlying backend properties
test_job_proxy_passthrough Job proxy passthrough

TestAsyncLifecycle

Test What it verifies
test_successful_run_and_result Successful run and result — 1. Trigger the run (SUBMIT SPAN)

TestExceptionVisibility

Test What it verifies
test_run_exception_visibility Run exception visibility — Simulate a crash during circuit translation/submission
test_result_exception_visibility Result exception visibility — Simulate a timeout while waiting for an IBM cloud job

test_qiskit_calibration.py

Tests for calibration-snapshot capture. Calibration data (T1, T2, readout error, gate errors) only exists on real/fake hardware backends, never on ideal simulators — so these tests use Qiskit's FakeManilaV2 which ships realistic calibration data, and assert that ideal simulators produce no snapshot.

TestSerializeCalibration

Test What it verifies
test_extracts_t1_t2_readout Extracts t1 t2 readout
test_none_backend_returns_none None backend returns none
test_backend_without_properties_returns_none Backend without properties returns none
test_backend_raising_properties_returns_none Backend raising properties returns none

TestEstimatorCalibrationCapture

Test What it verifies
test_snapshot_captured Snapshot captured
test_snapshot_captured_once_across_runs Snapshot captured once across runs — Content-addressed: identical calibration → one artifact regardless of run count
test_ideal_simulator_produces_no_snapshot Ideal simulator produces no snapshot

TestSamplerCalibrationCapture

Test What it verifies
test_snapshot_captured Snapshot captured
test_ideal_sampler_produces_no_snapshot Ideal sampler produces no snapshot

TestResolveBackend — The three conventions in the wild: qiskit BackendEstimatorV2 exposes .backend as a property, qiskit-ibm-runtime primitives expose backend() as a bound method, and qiskit-aer primitives only hold ._backend.

Test What it verifies
test_property_style Property style
test_method_style_runtime_convention Method style runtime convention
test_private_attr_aer_convention Private attr aer convention
test_backend_passed_directly Backend passed directly
test_statevector_primitive_resolves_to_none Statevector primitive resolves to none
test_none_resolves_to_none None resolves to none

TestCalibrationRefresh

Test What it verifies
test_drift_yields_snapshot_history Drift yields snapshot history
test_stable_calibration_attaches_once Stable calibration attaches once
test_rate_limit_skips_query_inside_window Rate limit skips query inside window

TestCalibrationHistory

Test What it verifies
test_single_snapshot_history Single snapshot history
test_drift_history_is_chronological Drift history is chronological — calibration() returns the newest snapshot
test_ideal_trace_has_empty_history Ideal trace has empty history

test_qiskit_sampler.py

Tier 2 integration tests for HilbertSamplerProxy. Uses real Qiskit circuits but keeps them minimal (1–2 qubits, few shots).

TestSamplerBasic

Test What it verifies
test_one_span_per_pub Each PUB produces exactly one span.
test_outcome_inline_with_counts Bitstring counts are stored inline as JSON, not as files.
test_no_outcome_files_on_disk All data is inline — artifacts/ holds only QASM, not outcomes.
test_circuit_deduplication Same circuit template across many shots produces only one QASM file.
test_span_status_completed Span status completed

TestSamplerParametric

Test What it verifies
test_parameter_bindings_captured Parameter bindings captured — Should contain the parameter array flattened
test_different_params_different_outcomes Different params different outcomes — theta=0: should be almost all '0'

TestSamplerTransparency

Test What it verifies
test_job_result_unchanged The job returned by the proxy produces the same result as unproxied.
test_shots_in_execution_completed_event EXECUTION_COMPLETED event carries the actual shot count.
test_tape_closed_skips_recording After tape closes, proxy forwards calls but records nothing.
test_deepcopy_preserves_tape Deepcopy preserves tape

tests/reader/

Verification: trace.verify() must pass on honest traces and fail loudly on any tampering — the property the blinded validation protocol depends on.

test_verify.py

Proves the cryptographic and causal verification engine. Guarantees that tampered data, missing files, or out-of-order execution spans are strictly rejected.

Test What it verifies
test_verify_valid_trace_passes A perfectly clean trace should pass with True.
Test What it verifies
test_verify_detects_tampered_artifact Simulates a malicious user altering their result file after the run to make their quantum benchmark look better.
Test What it verifies
test_verify_detects_missing_events_file If events.jsonl is deleted, the trace is invalid.
Test What it verifies
test_integrity_seal_present_and_valid A sealed trace carries an integrity_seal that matches events.jsonl.
Test What it verifies
test_verify_detects_event_stream_tampering Modifying events.jsonl in a way that still passes causal/reference checks (e.g. flipping a backend_id that no check inspects) must still be caught by the integrity seal's byte-level checksum.
Test What it verifies
test_verify_detects_causal_sequence_violation Simulates a logging error where sequence numbers are duplicated, or a user copy-pasting spans to fake execution data.
Test What it verifies
test_verify_detects_dangling_artifact_references Simulates a span pointing to an artifact hash that doesn't exist in the catalog.
Test What it verifies
test_verify_detects_child_before_parent_violation A child span cannot legally finish and flush to the logs BEFORE its parent span has been created. Causal arrows flow one way.

tests/recorder/

The recorder is the write path: HilbertTape, spans, events, and the content-addressed artifact store. These tests protect the append-only discipline (INV-002), atomic sealing, and the guarantee that every initiated span terminates explicitly (INV-007).

test_inline_artifacts.py

Tier 1 unit tests for the two-tier storage system: - SpanHandle.attach_inline() correctness - Hash integrity (key == sha256 of data) - Routing enforcement (structural kinds rejected inline) - Inline data appears in JSONL, not in the file store - Parquet writer preserves inline_artifacts column

TestAttachInlineBasics

Test What it verifies
test_returns_sha256_hash Returns sha256 hash
test_hash_matches_data Hash matches data
test_size_bytes_matches_data Size bytes matches data
test_all_fields_present All fields present
test_same_data_same_hash_idempotent Same data same hash idempotent — dict is keyed by hash — still just one entry
test_raises_after_tape_closed Raises after tape closed

TestInlineKindEnforcement

Test What it verifies
test_circuit_qasm_rejected Circuit qasm rejected
test_calibration_snapshot_rejected Calibration snapshot rejected
test_execution_outcome_allowed Execution outcome allowed
test_parameters_allowed Parameters allowed
test_observables_allowed Observables allowed
test_generic_blob_allowed Generic blob allowed

TestStorageRouting

Test What it verifies
test_inline_artifact_not_written_to_disk Inline artifact not written to disk — artifacts directory must be empty (no files, not even shard dirs with files)
test_inline_artifact_not_in_catalog Inline artifact not in catalog
test_inline_appears_in_jsonl Inline appears in jsonl
test_outcome_ref_resolves_from_inline Outcome ref resolves from inline
test_structural_artifact_still_uses_file_store Structural artifact still uses file store

TestParquetWriterInline

Test What it verifies
test_inline_artifacts_column_written Inline artifacts column written
test_inline_artifacts_data_round_trips Inline artifacts data round trips
test_spans_without_inline_have_null_column Spans without inline have null column — a null cell reads back as None or NaN depending on pandas version

test_invariants.py

Tier 4 property / invariant tests. These test the eight architectural invariants stated in docs/architecture/001_invariants.md, plus hash integrity and storage-triage consistency properties.

TestINV001ObserverEffect

Test What it verifies
test_qiskit_estimator_does_not_alter_result Proxy result must be bitwise identical to unproxied result.
test_pennylane_proxy_does_not_alter_result PennyLane proxy result must match direct device result.

TestINV002TraceImmutability

Test What it verifies
test_write_after_close_raises Write after close raises
test_attach_artifact_after_close_raises Attach artifact after close raises
test_close_idempotent Close idempotent
test_events_jsonl_append_only Span data is flushed immediately — JSONL length only grows.

TestPROP007IntegritySeal

Test What it verifies
test_seal_present_after_seal Seal present after seal
test_seal_checksum_matches_events_file Seal checksum matches events file
test_artifact_count_includes_inline Artifact count includes inline
test_artifact_count_includes_filestore Artifact count includes filestore
test_seal_absent_while_in_flight Before sealing, trace.json is CRASHED_IN_FLIGHT with no seal.

TestINV007FailureVisibility

Test What it verifies
test_exception_span_has_error_event Exception span has error event
test_error_event_captures_exception_type Error event captures exception type
test_exception_propagates_to_caller Exception propagates to caller
test_tape_sealed_with_errors_on_outer_exception Tape sealed with errors on outer exception

TestPROP001HashIntegrity

Test What it verifies
test_inline_artifact_keys_match_sha256 Inline artifact keys match sha256
test_all_spans_in_jsonl_pass_hash_check All spans in jsonl pass hash check

TestPROP002FileStoreIntegrity

Test What it verifies
test_attached_file_hash_matches_disk Attached file hash matches disk — Verify file on disk

TestPROP003CatalogConsistency

Test What it verifies
test_catalog_entries_match_file_count Catalog entries match file count — Files on disk may be < catalog entries if circuits are identical (dedup)
test_inline_artifacts_not_counted_in_catalog Inline artifacts not counted in catalog

TestPROP004SequenceNumbers

Test What it verifies
test_sequence_numbers_monotonic_and_unique Sequence numbers monotonic and unique
test_nested_spans_both_get_unique_sequences Nested spans both get unique sequences

TestPROP005NoCircuitInline

Test What it verifies
test_circuit_qasm_always_in_file_store Circuit qasm always in file store

TestPROP006OutcomeRefResolves

Test What it verifies
test_inline_outcome_ref_resolves Inline outcome ref resolves
test_file_store_outcome_ref_resolves File store outcome ref resolves

test_storage.py

Verifies the PyArrow Parquet conversion engine. Ensures columnar arrays maintain strict integrity against the JSON schema.

Test What it verifies
test_parquet_conversion_creates_file Parquet conversion creates file
Test What it verifies
test_parquet_schema_and_data_integrity Parquet schema and data integrity — Read it back into memory to verify column types
Test What it verifies
test_missing_jsonl_raises_error Missing jsonl raises error

test_tape.py

All I/O is isolated to tmp_path. All model types imported from hilbertbench.models public interface only — never from v1_0 directly. Adheres strictly to INV-001, INV-003, INV-004, and INV-007.

TestOpen

Test What it verifies
test_creates_run_directory Creates run directory
test_dir_name_format Dir name format
test_artifacts_subdir_exists Artifacts subdir exists
test_trace_json_written_on_open Trace json written on open
test_events_jsonl_created_on_open Events jsonl created on open

TestTraceLifecycle

Test What it verifies
test_sealed_success_on_clean_exit Sealed success on clean exit
test_sealed_with_errors_on_exception Sealed with errors on exception
test_timestamp_end_absent_while_open Timestamp end absent while open
test_timestamp_end_present_after_close Timestamp end present after close
test_tags_persisted Tags persisted

TestSpans

Test What it verifies
test_span_flushed_immediately_on_close Span flushed immediately on close
test_span_fields_present Span fields present
test_span_nesting_parent_id Span nesting parent id — Inner span is closed and flushed first, so it is at index 0
test_root_span_has_no_parent Root span has no parent
test_sequence_numbers_monotonic_and_unique Sequence numbers monotonic and unique
test_span_event_recorded Span event recorded — Should have REQUEST, CALIBRATION_CHECK, RESULT

TestThreadSafety

Test What it verifies
test_parallel_spans_do_not_cross_nest Per-thread span stack must be independent (threading.local check).
test_events_jsonl_valid_under_concurrency Every line must be valid JSON after 10 concurrent span writers.

TestAttach

Test What it verifies
test_artifact_copied_to_artifacts_dir Artifact copied to artifacts dir — Artifacts use 2-char sharding: artifacts//.
test_catalog_json_written_on_close Catalog json written on close
test_sha256_correct Sha256 correct
test_size_bytes_correct Size bytes correct
test_missing_file_raises Missing file raises
test_compression_stored Compression stored

TestFreezeOnClose

Test What it verifies
test_span_after_close_raises Span after close raises
test_attach_after_close_raises Attach after close raises
test_close_idempotent Close idempotent

TestExceptionPath

Test What it verifies
test_exception_span_written Exception span written — The exception occurred inside the span, so it should be FAILED
test_original_exception_propagates Original exception propagates
test_exception_attributes_captured Exception attributes captured

tests/tools/

The blinded-corpus protocol tool: leakage auditing, verbatim blinding with random IDs, SHA-256 answer-key commitments, and confusion-matrix scoring with Wilson intervals.

test_blind_corpus.py

Tests for the blinded-corpus protocol tool (tools/blind_corpus.py): leakage audit, blinding round-trip, commitment verification, and confusion-matrix scoring.

TestAudit

Test What it verifies
test_clean_run_passes Clean run passes
test_label_in_tags_is_flagged Label in tags is flagged
test_label_in_dirname_is_flagged Label in dirname is flagged

TestBlind

Test What it verifies
test_blinding_roundtrip Blinding roundtrip — blinded copies + key + commitment + sheet all exist
test_leaky_corpus_is_refused Leaky corpus is refused
test_invalid_label_is_refused Invalid label is refused

TestScore

Test What it verifies
test_perfect_diagnosis_scores_one Perfect diagnosis scores one
test_wrong_diagnosis_scores_zero Wrong diagnosis scores zero
test_secondary_label_counts_in_top2 Secondary label counts in top2 — primary wrong, but the true label is given as the secondary
test_missing_primary_is_refused Missing primary is refused
test_tampered_key_fails_commitment Tampered key fails commitment — flip the label so the tamper is guaranteed to change content
test_wilson_interval_sane Wilson interval sane

tests/trace/

HilbertTrace is the public read API. These tests guarantee that whatever the recorder wrote, the reader resolves back exactly — spans, outcomes, parameters, circuits, calibration history — without the caller knowing about storage details.

test_hilberttrace.py

Tests for the HilbertTrace unified data API. Traces are built with the tape directly (no quantum execution) so the resolution logic — inline vs file-store, scalar vs array vs dict outcomes — is exercised deterministically.

TestConstruction

Test What it verifies
test_missing_directory_raises Missing directory raises
test_directory_without_events_raises Directory without events raises
test_repr Repr

TestMetadata

Test What it verifies
test_status_mode_tags Status mode tags
test_integrity_seal_present Integrity seal present
test_environment Environment

TestSpanAccess

Test What it verifies
test_len_and_iteration Len and iteration
test_completed_filter Completed filter
test_filter_by_backend Filter by backend
test_dataframe_view Dataframe view

TestInlineResolution

Test What it verifies
test_outcome_resolves Outcome resolves
test_parameters_resolve Parameters resolve
test_observables_resolve Observables resolve
test_missing_parameters_returns_none Missing parameters returns none

TestFileStoreResolution

Test What it verifies
test_circuit_resolves_from_file Circuit resolves from file
test_outcome_inline_circuit_filestore_same_span A span may mix storage tiers: inline outcome + file-store circuit.

TestNumericOutcomes

Test What it verifies
test_scalars Scalars
test_arrays_flattened Arrays flattened
test_counts_dict_skipped Sampler-style counts dicts are not numeric outcomes.
test_variance_matches_manual Variance matches manual

TestCalibrationAndVerify

Test What it verifies
test_calibration_none_when_absent Calibration none when absent
test_calibration_resolves_when_present Calibration resolves when present
test_verify_passes_on_clean_trace Verify passes on clean trace

TestLazyImport

Test What it verifies
test_top_level_import Top level import
test_unknown_attribute_raises Unknown attribute raises