Test Catalog¶

Every test in the suite, what it verifies, and why its area exists. Generated by tools/gen_test_catalog.py — regenerate after adding tests.

Total test functions: 321 (run them with python -m pytest tests/ -q).

`tests/active/`¶

Active Mode is the opt-in interventional path (expressibility probing). These tests check the probe records correctly, labels the trace as active, and that analyzers needing interventional data refuse passive traces.

test_active_mode.py¶

Tests for Active Mode probing and the kl_expressibility analyzer they feed. Active Mode runs real circuits, so these use small ansätze and modest sample counts. The headline physics check is directional: a rigid ansatz must score a larger Haar KL divergence than an expressive one.

TestProbeCore

Test	What it verifies
`test_records_active_mode_trace`	Records active mode trace
`test_trace_verifies`	Trace verifies
`test_seed_reproducible`	Seed reproducible — Same seed → identical sampled parameters → identical statevectors
`test_statevector_round_trips`	Statevector round trips

TestQiskitProbe

Test	What it verifies
`test_qiskit_probe_stores_circuit_qasm`	Qiskit probe stores circuit qasm
`test_qiskit_circuit_structure_visible`	Qiskit circuit structure visible — decomposed → real gates visible, depth > 1

TestPennyLaneProbe

Test	What it verifies
`test_pennylane_probe`	Pennylane probe

TestExpressibility

Test	What it verifies
`test_rigid_more_than_expressive`	Directional physics: rigid ansatz has larger Haar KL than expressive.
`test_expressive_ansatz_low_kl`	StronglyEntanglingLayers is known to be highly expressive.
`test_num_qubits_inferred`	Num qubits inferred
`test_passive_trace_guard`	Expressibility on a passive trace returns a guard, not a number.
`test_insufficient_states`	Insufficient states

`tests/analysis/`¶

Each analyzer is tested against constructed ground-truth traces: plant a known condition, assert the verdict and the quantitative evidence (variance, SNR, KL, fidelity) with their confidence intervals. Includes regression tests for hardware-format (ISA) circuits and active-qubit calibration scoping.

test_builtin_analyzers.py¶

Tests for the function-based analysis layer (hilbertbench.analysis). Traces are built deterministically with the tape; no quantum execution.

TestBarrenPlateau

Test	What it verifies
`test_trainable`	Trainable
`test_barren_plateau`	Barren plateau
`test_insufficient_data`	Insufficient data — span with no numeric outcome (counts dict)
`test_custom_threshold`	Custom threshold
`test_accepts_path_and_trace_object`	Accepts path and trace object
`test_variance_matches_numpy`	Variance matches numpy

TestShotNoise

Test	What it verifies
`test_with_recorded_shots`	With recorded shots — high trajectory variance, low shots → signal clear
`test_shot_noise_dominated`	Shot noise dominated — tiny trajectory variance, many shots → buried in noise
`test_no_shots_recorded`	No shots recorded
`test_default_shots_fallback`	Default shots fallback
`test_precision_fallback`	Precision fallback — estimator runs record target precision, not shots; the floor
`test_recorded_shots_win_over_precision`	Recorded shots win over precision
`test_insufficient_data`	Insufficient data

TestSummary

Test	What it verifies
`test_combined_report`	Combined report
`test_summary_accepts_path`	Summary accepts path

TestCustomAnalysis

Test	What it verifies
`test_user_can_write_own_analysis`	A user composes their own diagnostic on the same trace API.

test_confidence.py¶

Tests for the statistical-uncertainty measures added to the analyzers (proposal Section 2.6: "reported with statistical uncertainty and confidence measures, emphasizing transparency over definitive attribution").

TestBootstrapCI

Test	What it verifies
`test_ci_brackets_statistic`	Ci brackets statistic
`test_degenerate_inputs_return_none`	Degenerate inputs return none
`test_n_boot_zero_disables`	N boot zero disables
`test_reproducible_with_seed`	Reproducible with seed
`test_wider_ci_for_higher_level`	Wider ci for higher level

TestBarrenPlateauConfidence

Test	What it verifies
`test_ci_brackets_variance`	Ci brackets variance
`test_clear_trainable_high_confidence`	Clear trainable high confidence
`test_clear_barren_high_confidence`	Clear barren high confidence
`test_near_threshold_low_confidence`	Near threshold low confidence — variance engineered to sit right at the 0.005 threshold
`test_n_boot_zero_skips_ci`	N boot zero skips ci

TestShotNoiseConfidence

Test	What it verifies
`test_empirical_variance_ci_present`	Empirical variance ci present
`test_ci_present_even_without_shots`	Ci present even without shots

test_noise.py¶

Tests for the noise-profile analyzer (Diagnostic Axis: Noise). Calibration data only exists on real/fake hardware backends, so these use Qiskit's FakeManilaV2 (which ships realistic T1/T2/readout/gate-error data) and assert ideal simulators degrade gracefully to a no-calibration result.

TestDeviceSummary

Test	What it verifies
`test_reports_calibration_stats`	Reports calibration stats
`test_gate_errors_present`	Gate errors present
`test_estimated_fidelity_in_unit_interval`	Estimated fidelity in unit interval

TestIdealSimulator

Test	What it verifies
`test_no_calibration_status`	No calibration status

TestDepthInteraction

Test	What it verifies
`test_fidelity_decreases_with_depth`	Fidelity decreases with depth
`test_dominant_error_shifts_to_two_qubit_gates`	Dominant error shifts to two qubit gates — shallow circuit: readout dominates the small infidelity

TestCalibrationScoping

Test	What it verifies
`test_stats_scoped_to_active_qubits`	Stats scoped to active qubits — the awful qubit 2 must not contaminate any statistic
`test_fidelity_uses_scoped_errors`	Fidelity uses scoped errors — 1 sx + 1 cx on good qubits: fidelity must stay high; the

test_optimization_and_circuit.py¶

Tests for the optimization-loop (Axis 4) and circuit-structure analyzers.

TestOptimizationConvergence

Test	What it verifies
`test_converging_trajectory`	Converging trajectory
`test_constant_step_still_improving`	Constant step still improving
`test_converging_path_length_positive`	Converging path length positive
`test_insufficient_data`	Insufficient data
`test_outcome_envelope_reported`	Outcome envelope reported — cost should drop from start to finish
`test_accepts_path_and_trace`	Accepts path and trace

TestCircuitStructure

Test	What it verifies
`test_bell_circuit`	Bell circuit
`test_bell_depth`	Bell depth — h on q0 (layer 1) then cx q0,q1 (layer 2) → depth 2
`test_parametric_circuit`	Parametric circuit
`test_no_qasm_circuit`	No qasm circuit — Trace with only inline outcomes, no circuit_qasm artifact
`test_entangling_fraction_bounds`	Entangling fraction bounds
`test_isa_circuit_physical_qubits`	Isa circuit physical qubits — Regression: hardware ISA circuits previously parsed as empty
`test_isa_circuit_depth`	Isa circuit depth — sx, rz, sx, rz stack on $0 (layers 1-4), cx joins $0/$1 → 5

`tests/compliance/`¶

Architecture-level checks that the documented invariants (INV-001 and friends) hold end to end — the contract the paper and the docs promise.

test_backward_compatibility.py¶

These fixtures are FROZEN. They represent real v1.0 traces that must remain parseable for the lifetime of the project. Never update these dicts — if a model change breaks them, that is a BREAKING CHANGE and requires a schema version bump.

TestGoldenRecords — These must NEVER fail. Failure = breaking change.

Test	What it verifies
`test_golden_trace_always_parses`	Golden trace always parses
`test_golden_span_always_parses`	Golden span always parses
`test_golden_artifact_always_parses`	Golden artifact always parses

test_execution_parity.py¶

Validates the proposal's central "1:1 Execution Parity" claim (§2.1): wrapping a primitive with HilbertBench must NOT change what reaches the backend — same number of executions, same circuits, same shots — and must NOT change the results. This is what makes the recorder a non-confounding observer.

TestEstimatorParity

Test	What it verifies
`test_backend_called_once_per_user_call`	Backend called once per user call — No silent extra executions: the backend saw exactly the same calls.
`test_pubs_unchanged`	Pubs unchanged — The parameter bindings submitted to the backend are bit-identical.
`test_results_identical`	Results identical

TestSamplerParity

Test	What it verifies
`test_backend_called_once_per_user_call`	Backend called once per user call
`test_shots_unchanged`	Shots unchanged — No silent shot inflation: every backed call kept the requested 256.

TestPennyLaneParity

Test	What it verifies
`test_device_executed_same_number_of_times`	Device executed same number of times

TestOverhead

Test	What it verifies
`test_estimator_overhead_under_budget`	Per-call recording overhead must stay small (proposal target: <5ms). We assert a generous CI-safe ceiling; the demo reports the real number (typically well under 1ms on a workstation).

test_invariants.py¶

TestINV003 — INV-003: models are auto-generated, never manually edited.

Test	What it verifies
`test_all_generated_files_have_header`	All generated files have header

TestINV004 — INV-004: models must only import stdlib + pydantic.

Test	What it verifies
`test_no_forbidden_imports`	No forbidden imports

test_schema_roundtrip.py¶

Validates that all Pydantic v2 models correctly enforce schema constraints. Tests are organized by model. Negative tests confirm bad data is rejected.

TestTraceManifest

Test	What it verifies
`test_valid_construction`	Valid construction
`test_roundtrip`	Roundtrip
`test_mode_enum`	Mode enum
`test_status_enum`	Status enum
`test_all_status_values_valid`	All status values valid
`test_all_mode_values_valid`	All mode values valid
`test_null_timestamp_end_allowed`	Null timestamp end allowed
`test_null_integrity_seal_allowed`	Null integrity seal allowed
`test_optional_fields_absent`	Optional fields absent — timestamp_end, integrity_seal, tags are all optional
`test_rejects_wrong_version`	Rejects wrong version
`test_rejects_extra_fields`	Rejects extra fields
`test_rejects_invalid_mode`	Rejects invalid mode
`test_rejects_invalid_status`	Rejects invalid status
`test_requires_client_environment`	Requires client environment
`test_client_environment_requires_version`	Client environment requires version
`test_tags_arbitrary_strings`	Tags arbitrary strings

TestSpan

Test	What it verifies
`test_valid_construction`	Valid construction
`test_roundtrip`	Roundtrip
`test_events_preserved`	Events preserved
`test_trace_id_matches_parent`	Trace id matches parent
`test_sequence_number_zero_allowed`	Sequence number zero allowed
`test_sequence_number_large_value`	Sequence number large value
`test_all_status_values_valid`	All status values valid
`test_null_outcome_ref_allowed`	Null outcome ref allowed
`test_null_parent_span_id_allowed`	Null parent span id allowed — Null parent_span_id = root span
`test_event_type_open_pattern`	Event type open pattern — event_type is open pattern ^[A-Z_]+$ — custom types must be allowed
`test_event_attributes_allow_arbitrary_scalars`	Event attributes allow arbitrary scalars
`test_event_null_attributes_allowed`	Event null attributes allowed
`test_rejects_negative_sequence_number`	Rejects negative sequence number
`test_rejects_empty_events`	Rejects empty events — minItems: 1 — a span with no events is invalid
`test_rejects_lowercase_event_type`	Rejects lowercase event type — event_type pattern is ^[A-Z_]+$ — lowercase must be rejected
`test_rejects_extra_fields`	Rejects extra fields

TestArtifact

Test	What it verifies
`test_valid_construction`	Valid construction
`test_roundtrip`	Roundtrip
`test_all_kind_values_valid`	All kind values valid
`test_all_encoding_values_valid`	All encoding values valid
`test_all_compression_values_valid`	All compression values valid
`test_compression_null_allowed`	Compression null allowed
`test_size_bytes_zero_allowed`	Size bytes zero allowed
`test_producer_null_allowed`	Producer null allowed
`test_hash_pattern_enforced`	Hash pattern enforced
`test_hash_wrong_length_rejected`	Hash wrong length rejected
`test_rejects_negative_size`	Rejects negative size
`test_rejects_ref_count_zero`	Rejects ref count zero — minimum: 1 — an artifact with zero references is orphaned
`test_rejects_invalid_kind`	Rejects invalid kind

TestCatalog

Test	What it verifies
`test_valid_construction`	Valid construction
`test_roundtrip`	Roundtrip
`test_multiple_artifacts`	Multiple artifacts
`test_empty_artifacts_allowed`	Empty artifacts allowed
`test_artifact_values_are_validated`	Artifact values are validated — Even though the key is not pattern-validated by Pydantic (see below),
`test_artifact_key_format_not_enforced_by_pydantic`	catalog.json uses additionalProperties (not patternProperties) so Pydantic accepts any string key at runtime. Key format and key==artifact_hash integrity is the responsibility of reader/verify.py, not the Pydantic model. This is a deliberate design tradeoff — see design_decisions/0003. This test documents and pins the behaviour. If it starts raising, the schema was changed back to patternProperties.
`test_rejects_wrong_version`	Rejects wrong version
`test_rejects_extra_fields`	Rejects extra fields

`tests/e2e/`¶

Full journeys: record a realistic workload, seal, reopen, analyze — the integration surface a real user touches.

test_full_algorithms.py¶

Tier 3 end-to-end regression tests. Each test runs a complete algorithm for a small number of steps and verifies: - The trace is sealed and complete - The expected number of spans were recorded - Inline artifacts contain the correct data kinds

TestVQERegression

Test	What it verifies
`test_vqe_trace_complete`	Runs 5 steps of gradient-based VQE on the simple H = Z⊗Z Hamiltonian. Verifies: spans created, outcome + parameters + observables captured.

TestQAOARegression

Test	What it verifies
`test_qaoa_bitstrings_recorded`	Runs a 2-qubit QAOA-like circuit for 3 angles and records bitstring outcomes. Verifies that counts are captured inline with proper structure.
`test_qaoa_multiple_angle_sets`	Multiple parameter sets in one PUB → one span total.

TestQNNRegression

Test	What it verifies
`test_qnn_training_trace_complete`	Trains a 2-qubit QNN for 5 steps on a 4-point dataset. Verifies: spans recorded, outcomes + params captured, trace sealed.

TestCrossFramework

Test	What it verifies
`test_both_frameworks_produce_valid_traces`	Evaluate Z⊗Z expectation value using both frameworks. Both traces should be valid, sealed, and contain outcome data.

test_phenomenology.py¶

Phenomenological validation (proposal Section 2.6): plant a known QML phenomenon in synthetic ground-truth circuits and confirm the detector attributes it correctly from trace evidence alone.

TestBarrenPlateauValidation

Test	What it verifies
`test_wide_deep_circuit_flagged_barren`	Wide deep circuit flagged barren
`test_shallow_control_is_trainable`	Shallow control is trainable
`test_variance_collapses_with_width`	The planted property: variance must shrink as width grows.
`test_trace_is_active_mode`	Landscape probing is a controlled, opt-in active diagnostic.

test_qiskit_aer.py¶

End-to-End verification using a REAL Qiskit Aer simulator. Proves the transparent proxy works with actual quantum execution without breaking standard QML workflows.

Test	What it verifies
`test_real_qml_parameterized_circuit`	Runs a real parameterized circuit (typical of QML/VQE) through the AerSimulator, verifying the proxy handles real Qiskit objects.

`tests/integrations/`¶

The proxies (Qiskit Estimator/Sampler, backend.run, PennyLane) must be perfectly transparent to the wrapped framework (1:1 execution parity, INV-001) while recording faithfully. This area also covers calibration-snapshot capture across all three backend-access conventions found in the wild, drift refresh, and shot/precision evidence.

test_pennylane.py¶

Verifies the dynamic proxy integration for PennyLane. Tests that strict ML type-checks are preserved and synchronous executions are correctly logged as single, unified spans.

TestPennyLaneProxyTransparency

Test	What it verifies
`test_dynamic_inheritance`	Crucial for PennyLane QNodes: The proxy must pass isinstance().

TestPennyLaneExecutionLifecycle

Test	What it verifies
`test_synchronous_execution_span`	PennyLane evaluates synchronously. We should get exactly ONE span.

TestPennyLaneExceptionVisibility

Test	What it verifies
`test_synchronous_exception_handling`	Verifies INV-007 for synchronous failures.

test_pennylane_measurements.py¶

Tier 2 integration tests for HilbertPennyLaneDeviceProxy covering all common PennyLane measurement types: expval, probs, counts, sample, state. Also verifies exception handling and backend_id propagation.

TestMeasurementTypes

Test	What it verifies
`test_expval_inline`	Expval inline
`test_probs_inline`	Probs inline
`test_counts_inline`	Counts inline — counts is a bitstring → int dict; "00" should dominate
`test_sample_inline`	Sample inline
`test_state_inline_as_complex_pairs`	State inline as complex pairs — Stored as [[real, imag], ...] — first amplitude should be [1.0, 0.0]

TestSpanStructure

Test	What it verifies
`test_four_events_per_span`	EXECUTION_REQUEST + DEVICE_EXECUTE_STARTED + EXECUTION_COMPLETED + EXECUTION_RESULT
`test_device_started_event_has_num_tapes`	Device started event has num tapes
`test_parameters_captured_per_span`	Parameters captured per span
`test_observables_captured`	Observables captured
`test_payload_ref_resolves_to_circuit_qasm`	The circuit is now a templated QASM in the file store (deduplicated), so payload_ref must resolve from the catalog, not inline.
`test_circuit_qasm_deduplicates_across_steps`	Many evaluations of the same circuit structure produce one QASM file.
`test_backend_id_set`	Backend id set
`test_span_status_completed`	Span status completed

TestPennyLaneExceptions

Test	What it verifies
`test_device_exception_creates_failed_span`	When the device raises, a FAILED span with ERROR event is recorded.
`test_exception_propagates_to_caller`	Exception propagates to caller

TestNoFilePollution

Test	What it verifies
`test_all_measurements_stay_inline`	expval, probs, counts, sample — none should write .npy files.

test_pennylane_qasm_reproducibility.py¶

Proves that the templated OpenQASM stored for PennyLane traces is useful: template + recorded parameters reconstructs a valid circuit whose outcome matches what was recorded — verified by re-executing through Qiskit (a different framework), proving the QASM is portable and complete.

TestTemplateHelper

Test	What it verifies
`test_placeholders_replace_numeric_literals`	Placeholders replace numeric literals
`test_wire_indices_untouched`	Wire indices untouched — wire indices live in [...] not (...) — must not become placeholders
`test_multi_param_gate`	Multi param gate
`test_template_stable_across_values`	Template stable across values

TestQASMRoundTrip

Test	What it verifies
`test_template_plus_params_reproduces_outcome`	The core guarantee: bind(template, params) reproduces the recorded expval.
`test_single_qubit_rotation_round_trip`	Minimal case: one RY rotation, check exact reproduction.

test_qiskit.py¶

Verifies the transparent proxy integration for Qiskit. Tests that circuits are serialized, spans are split (async mirroring), and all underlying framework exceptions are properly propagated (INV-007).

TestProxyTransparency

Test	What it verifies
`test_backend_proxy_passthrough`	Backend proxy passthrough — The proxy must perfectly imitate the underlying backend properties
`test_job_proxy_passthrough`	Job proxy passthrough

TestAsyncLifecycle

Test	What it verifies
`test_successful_run_and_result`	Successful run and result — 1. Trigger the run (SUBMIT SPAN)

TestExceptionVisibility

Test	What it verifies
`test_run_exception_visibility`	Run exception visibility — Simulate a crash during circuit translation/submission
`test_result_exception_visibility`	Result exception visibility — Simulate a timeout while waiting for an IBM cloud job

test_qiskit_calibration.py¶

Tests for calibration-snapshot capture. Calibration data (T1, T2, readout error, gate errors) only exists on real/fake hardware backends, never on ideal simulators — so these tests use Qiskit's FakeManilaV2 which ships realistic calibration data, and assert that ideal simulators produce no snapshot.

TestSerializeCalibration

Test	What it verifies
`test_extracts_t1_t2_readout`	Extracts t1 t2 readout
`test_none_backend_returns_none`	None backend returns none
`test_backend_without_properties_returns_none`	Backend without properties returns none
`test_backend_raising_properties_returns_none`	Backend raising properties returns none

TestEstimatorCalibrationCapture

Test	What it verifies
`test_snapshot_captured`	Snapshot captured
`test_snapshot_captured_once_across_runs`	Snapshot captured once across runs — Content-addressed: identical calibration → one artifact regardless of run count
`test_ideal_simulator_produces_no_snapshot`	Ideal simulator produces no snapshot

TestSamplerCalibrationCapture

Test	What it verifies
`test_snapshot_captured`	Snapshot captured
`test_ideal_sampler_produces_no_snapshot`	Ideal sampler produces no snapshot

TestResolveBackend — The three conventions in the wild: qiskit BackendEstimatorV2 exposes .backend as a property, qiskit-ibm-runtime primitives expose backend() as a bound method, and qiskit-aer primitives only hold ._backend.

Test	What it verifies
`test_property_style`	Property style
`test_method_style_runtime_convention`	Method style runtime convention
`test_private_attr_aer_convention`	Private attr aer convention
`test_backend_passed_directly`	Backend passed directly
`test_statevector_primitive_resolves_to_none`	Statevector primitive resolves to none
`test_none_resolves_to_none`	None resolves to none

TestCalibrationRefresh

Test	What it verifies
`test_drift_yields_snapshot_history`	Drift yields snapshot history
`test_stable_calibration_attaches_once`	Stable calibration attaches once
`test_rate_limit_skips_query_inside_window`	Rate limit skips query inside window

TestCalibrationHistory

Test	What it verifies
`test_single_snapshot_history`	Single snapshot history
`test_drift_history_is_chronological`	Drift history is chronological — calibration() returns the newest snapshot
`test_ideal_trace_has_empty_history`	Ideal trace has empty history

test_qiskit_sampler.py¶

Tier 2 integration tests for HilbertSamplerProxy. Uses real Qiskit circuits but keeps them minimal (1–2 qubits, few shots).

TestSamplerBasic

Test	What it verifies
`test_one_span_per_pub`	Each PUB produces exactly one span.
`test_outcome_inline_with_counts`	Bitstring counts are stored inline as JSON, not as files.
`test_no_outcome_files_on_disk`	All data is inline — artifacts/ holds only QASM, not outcomes.
`test_circuit_deduplication`	Same circuit template across many shots produces only one QASM file.
`test_span_status_completed`	Span status completed

TestSamplerParametric

Test	What it verifies
`test_parameter_bindings_captured`	Parameter bindings captured — Should contain the parameter array flattened
`test_different_params_different_outcomes`	Different params different outcomes — theta=0: should be almost all '0'

TestSamplerTransparency

Test	What it verifies
`test_job_result_unchanged`	The job returned by the proxy produces the same result as unproxied.
`test_shots_in_execution_completed_event`	EXECUTION_COMPLETED event carries the actual shot count.
`test_tape_closed_skips_recording`	After tape closes, proxy forwards calls but records nothing.
`test_deepcopy_preserves_tape`	Deepcopy preserves tape

`tests/reader/`¶

Verification: trace.verify() must pass on honest traces and fail loudly on any tampering — the property the blinded validation protocol depends on.

test_verify.py¶

Proves the cryptographic and causal verification engine. Guarantees that tampered data, missing files, or out-of-order execution spans are strictly rejected.

Test	What it verifies
`test_verify_valid_trace_passes`	A perfectly clean trace should pass with True.

Test	What it verifies
`test_verify_detects_tampered_artifact`	Simulates a malicious user altering their result file after the run to make their quantum benchmark look better.

Test	What it verifies
`test_verify_detects_missing_events_file`	If events.jsonl is deleted, the trace is invalid.

Test	What it verifies
`test_integrity_seal_present_and_valid`	A sealed trace carries an integrity_seal that matches events.jsonl.

Test	What it verifies
`test_verify_detects_event_stream_tampering`	Modifying events.jsonl in a way that still passes causal/reference checks (e.g. flipping a backend_id that no check inspects) must still be caught by the integrity seal's byte-level checksum.

Test	What it verifies
`test_verify_detects_causal_sequence_violation`	Simulates a logging error where sequence numbers are duplicated, or a user copy-pasting spans to fake execution data.

Test	What it verifies
`test_verify_detects_dangling_artifact_references`	Simulates a span pointing to an artifact hash that doesn't exist in the catalog.

Test	What it verifies
`test_verify_detects_child_before_parent_violation`	A child span cannot legally finish and flush to the logs BEFORE its parent span has been created. Causal arrows flow one way.

`tests/recorder/`¶

The recorder is the write path: HilbertTape, spans, events, and the content-addressed artifact store. These tests protect the append-only discipline (INV-002), atomic sealing, and the guarantee that every initiated span terminates explicitly (INV-007).

test_inline_artifacts.py¶

Tier 1 unit tests for the two-tier storage system: - SpanHandle.attach_inline() correctness - Hash integrity (key == sha256 of data) - Routing enforcement (structural kinds rejected inline) - Inline data appears in JSONL, not in the file store - Parquet writer preserves inline_artifacts column

TestAttachInlineBasics

Test	What it verifies
`test_returns_sha256_hash`	Returns sha256 hash
`test_hash_matches_data`	Hash matches data
`test_size_bytes_matches_data`	Size bytes matches data
`test_all_fields_present`	All fields present
`test_same_data_same_hash_idempotent`	Same data same hash idempotent — dict is keyed by hash — still just one entry
`test_raises_after_tape_closed`	Raises after tape closed

TestInlineKindEnforcement

Test	What it verifies
`test_circuit_qasm_rejected`	Circuit qasm rejected
`test_calibration_snapshot_rejected`	Calibration snapshot rejected
`test_execution_outcome_allowed`	Execution outcome allowed
`test_parameters_allowed`	Parameters allowed
`test_observables_allowed`	Observables allowed
`test_generic_blob_allowed`	Generic blob allowed

TestStorageRouting

Test	What it verifies
`test_inline_artifact_not_written_to_disk`	Inline artifact not written to disk — artifacts directory must be empty (no files, not even shard dirs with files)
`test_inline_artifact_not_in_catalog`	Inline artifact not in catalog
`test_inline_appears_in_jsonl`	Inline appears in jsonl
`test_outcome_ref_resolves_from_inline`	Outcome ref resolves from inline
`test_structural_artifact_still_uses_file_store`	Structural artifact still uses file store

TestParquetWriterInline

Test	What it verifies
`test_inline_artifacts_column_written`	Inline artifacts column written
`test_inline_artifacts_data_round_trips`	Inline artifacts data round trips
`test_spans_without_inline_have_null_column`	Spans without inline have null column — a null cell reads back as None or NaN depending on pandas version

test_invariants.py¶

Tier 4 property / invariant tests. These test the eight architectural invariants stated in docs/architecture/001_invariants.md, plus hash integrity and storage-triage consistency properties.

TestINV001ObserverEffect

Test	What it verifies
`test_qiskit_estimator_does_not_alter_result`	Proxy result must be bitwise identical to unproxied result.
`test_pennylane_proxy_does_not_alter_result`	PennyLane proxy result must match direct device result.

TestINV002TraceImmutability

Test	What it verifies
`test_write_after_close_raises`	Write after close raises
`test_attach_artifact_after_close_raises`	Attach artifact after close raises
`test_close_idempotent`	Close idempotent
`test_events_jsonl_append_only`	Span data is flushed immediately — JSONL length only grows.

TestPROP007IntegritySeal

Test	What it verifies
`test_seal_present_after_seal`	Seal present after seal
`test_seal_checksum_matches_events_file`	Seal checksum matches events file
`test_artifact_count_includes_inline`	Artifact count includes inline
`test_artifact_count_includes_filestore`	Artifact count includes filestore
`test_seal_absent_while_in_flight`	Before sealing, trace.json is CRASHED_IN_FLIGHT with no seal.

TestINV007FailureVisibility

Test	What it verifies
`test_exception_span_has_error_event`	Exception span has error event
`test_error_event_captures_exception_type`	Error event captures exception type
`test_exception_propagates_to_caller`	Exception propagates to caller
`test_tape_sealed_with_errors_on_outer_exception`	Tape sealed with errors on outer exception

TestPROP001HashIntegrity

Test	What it verifies
`test_inline_artifact_keys_match_sha256`	Inline artifact keys match sha256
`test_all_spans_in_jsonl_pass_hash_check`	All spans in jsonl pass hash check

TestPROP002FileStoreIntegrity

Test	What it verifies
`test_attached_file_hash_matches_disk`	Attached file hash matches disk — Verify file on disk

TestPROP003CatalogConsistency

Test	What it verifies
`test_catalog_entries_match_file_count`	Catalog entries match file count — Files on disk may be < catalog entries if circuits are identical (dedup)
`test_inline_artifacts_not_counted_in_catalog`	Inline artifacts not counted in catalog

TestPROP004SequenceNumbers

Test	What it verifies
`test_sequence_numbers_monotonic_and_unique`	Sequence numbers monotonic and unique
`test_nested_spans_both_get_unique_sequences`	Nested spans both get unique sequences

TestPROP005NoCircuitInline

Test	What it verifies
`test_circuit_qasm_always_in_file_store`	Circuit qasm always in file store

TestPROP006OutcomeRefResolves

Test	What it verifies
`test_inline_outcome_ref_resolves`	Inline outcome ref resolves
`test_file_store_outcome_ref_resolves`	File store outcome ref resolves

test_storage.py¶

Verifies the PyArrow Parquet conversion engine. Ensures columnar arrays maintain strict integrity against the JSON schema.

Test	What it verifies
`test_parquet_conversion_creates_file`	Parquet conversion creates file

Test	What it verifies
`test_parquet_schema_and_data_integrity`	Parquet schema and data integrity — Read it back into memory to verify column types

Test	What it verifies
`test_missing_jsonl_raises_error`	Missing jsonl raises error

test_tape.py¶

All I/O is isolated to tmp_path. All model types imported from hilbertbench.models public interface only — never from v1_0 directly. Adheres strictly to INV-001, INV-003, INV-004, and INV-007.

TestOpen

Test	What it verifies
`test_creates_run_directory`	Creates run directory
`test_dir_name_format`	Dir name format
`test_artifacts_subdir_exists`	Artifacts subdir exists
`test_trace_json_written_on_open`	Trace json written on open
`test_events_jsonl_created_on_open`	Events jsonl created on open

TestTraceLifecycle

Test	What it verifies
`test_sealed_success_on_clean_exit`	Sealed success on clean exit
`test_sealed_with_errors_on_exception`	Sealed with errors on exception
`test_timestamp_end_absent_while_open`	Timestamp end absent while open
`test_timestamp_end_present_after_close`	Timestamp end present after close
`test_tags_persisted`	Tags persisted

TestSpans

Test	What it verifies
`test_span_flushed_immediately_on_close`	Span flushed immediately on close
`test_span_fields_present`	Span fields present
`test_span_nesting_parent_id`	Span nesting parent id — Inner span is closed and flushed first, so it is at index 0
`test_root_span_has_no_parent`	Root span has no parent
`test_sequence_numbers_monotonic_and_unique`	Sequence numbers monotonic and unique
`test_span_event_recorded`	Span event recorded — Should have REQUEST, CALIBRATION_CHECK, RESULT

TestThreadSafety

Test	What it verifies
`test_parallel_spans_do_not_cross_nest`	Per-thread span stack must be independent (threading.local check).
`test_events_jsonl_valid_under_concurrency`	Every line must be valid JSON after 10 concurrent span writers.

TestAttach

Test	What it verifies
`test_artifact_copied_to_artifacts_dir`	Artifact copied to artifacts dir — Artifacts use 2-char sharding: artifacts//.
`test_catalog_json_written_on_close`	Catalog json written on close
`test_sha256_correct`	Sha256 correct
`test_size_bytes_correct`	Size bytes correct
`test_missing_file_raises`	Missing file raises
`test_compression_stored`	Compression stored

TestFreezeOnClose

Test	What it verifies
`test_span_after_close_raises`	Span after close raises
`test_attach_after_close_raises`	Attach after close raises
`test_close_idempotent`	Close idempotent

TestExceptionPath

Test	What it verifies
`test_exception_span_written`	Exception span written — The exception occurred inside the span, so it should be FAILED
`test_original_exception_propagates`	Original exception propagates
`test_exception_attributes_captured`	Exception attributes captured

`tests/tools/`¶

The blinded-corpus protocol tool: leakage auditing, verbatim blinding with random IDs, SHA-256 answer-key commitments, and confusion-matrix scoring with Wilson intervals.

test_blind_corpus.py¶

Tests for the blinded-corpus protocol tool (tools/blind_corpus.py): leakage audit, blinding round-trip, commitment verification, and confusion-matrix scoring.

TestAudit

Test	What it verifies
`test_clean_run_passes`	Clean run passes
`test_label_in_tags_is_flagged`	Label in tags is flagged
`test_label_in_dirname_is_flagged`	Label in dirname is flagged

TestBlind

Test	What it verifies
`test_blinding_roundtrip`	Blinding roundtrip — blinded copies + key + commitment + sheet all exist
`test_leaky_corpus_is_refused`	Leaky corpus is refused
`test_invalid_label_is_refused`	Invalid label is refused

TestScore

Test	What it verifies
`test_perfect_diagnosis_scores_one`	Perfect diagnosis scores one
`test_wrong_diagnosis_scores_zero`	Wrong diagnosis scores zero
`test_secondary_label_counts_in_top2`	Secondary label counts in top2 — primary wrong, but the true label is given as the secondary
`test_missing_primary_is_refused`	Missing primary is refused
`test_tampered_key_fails_commitment`	Tampered key fails commitment — flip the label so the tamper is guaranteed to change content
`test_wilson_interval_sane`	Wilson interval sane

`tests/trace/`¶

HilbertTrace is the public read API. These tests guarantee that whatever the recorder wrote, the reader resolves back exactly — spans, outcomes, parameters, circuits, calibration history — without the caller knowing about storage details.

test_hilberttrace.py¶

Tests for the HilbertTrace unified data API. Traces are built with the tape directly (no quantum execution) so the resolution logic — inline vs file-store, scalar vs array vs dict outcomes — is exercised deterministically.

TestConstruction

Test	What it verifies
`test_missing_directory_raises`	Missing directory raises
`test_directory_without_events_raises`	Directory without events raises
`test_repr`	Repr

TestMetadata

Test	What it verifies
`test_status_mode_tags`	Status mode tags
`test_integrity_seal_present`	Integrity seal present
`test_environment`	Environment

TestSpanAccess

Test	What it verifies
`test_len_and_iteration`	Len and iteration
`test_completed_filter`	Completed filter
`test_filter_by_backend`	Filter by backend
`test_dataframe_view`	Dataframe view

TestInlineResolution

Test	What it verifies
`test_outcome_resolves`	Outcome resolves
`test_parameters_resolve`	Parameters resolve
`test_observables_resolve`	Observables resolve
`test_missing_parameters_returns_none`	Missing parameters returns none

TestFileStoreResolution

Test	What it verifies
`test_circuit_resolves_from_file`	Circuit resolves from file
`test_outcome_inline_circuit_filestore_same_span`	A span may mix storage tiers: inline outcome + file-store circuit.

TestNumericOutcomes

Test	What it verifies
`test_scalars`	Scalars
`test_arrays_flattened`	Arrays flattened
`test_counts_dict_skipped`	Sampler-style counts dicts are not numeric outcomes.
`test_variance_matches_manual`	Variance matches manual

TestCalibrationAndVerify

Test	What it verifies
`test_calibration_none_when_absent`	Calibration none when absent
`test_calibration_resolves_when_present`	Calibration resolves when present
`test_verify_passes_on_clean_trace`	Verify passes on clean trace

TestLazyImport

Test	What it verifies
`test_top_level_import`	Top level import
`test_unknown_attribute_raises`	Unknown attribute raises

Test Catalog¶

tests/active/¶

test_active_mode.py¶

tests/analysis/¶

test_builtin_analyzers.py¶

test_confidence.py¶

test_noise.py¶

test_optimization_and_circuit.py¶

tests/compliance/¶

test_backward_compatibility.py¶

test_execution_parity.py¶

test_invariants.py¶

test_schema_roundtrip.py¶

tests/e2e/¶

test_full_algorithms.py¶

test_phenomenology.py¶

test_qiskit_aer.py¶

tests/integrations/¶

test_pennylane.py¶

test_pennylane_measurements.py¶

test_pennylane_qasm_reproducibility.py¶

test_qiskit.py¶

test_qiskit_calibration.py¶

test_qiskit_sampler.py¶

tests/reader/¶

test_verify.py¶

tests/recorder/¶

test_inline_artifacts.py¶

test_invariants.py¶

test_storage.py¶

test_tape.py¶

tests/tools/¶

test_blind_corpus.py¶

tests/trace/¶

test_hilberttrace.py¶

`tests/active/`¶

`tests/analysis/`¶

`tests/compliance/`¶

`tests/e2e/`¶

`tests/integrations/`¶

`tests/reader/`¶

`tests/recorder/`¶

`tests/tools/`¶

`tests/trace/`¶