LIS AI Validation Framework

Auditable, workflow-level validation artifacts for AI agents in Laboratory Information Systems

The Challenge

Traditional LIS validation assumes deterministic, rule-based systems. AI agents introduce emergent capabilities that escape change-based validation.

CAP GEN.43875 requires validation "based on changes made."
But you can't validate changes you don't know exist.

When you update your LIS AI from GPT-4 to Claude Sonnet 4.5:

  • Documented: "Improved reasoning"
  • What emerged: Proactive aliquot swap detection
  • Validation scope: ???

AI agents in regulated industries need workflow-level validation, not just threshold accuracy.

Our Approach

Build a library of Terminal Bench validation tasks that provide auditable, reproducible validation artifacts grounded in real laboratory practices.

🎯

Workflow-Level Testing

Test reasoning across analytes and workflows, not just individual thresholds

📋

Auditable Artifacts

Versioned, reproducible tasks for regulatory compliance

🔬

Real Failure Modes

Grounded in established laboratory practices and actual safety risks

⚖️

Terminal Bench Standard

Standardized evaluation with Harbor execution framework

First Validated Artifact

LIS Swap & Contamination Triage is the first auditable, reproducible, validated task in a growing library.

What It Tests

This Terminal Bench task evaluates whether AI agents can correctly triage laboratory specimens for:

  • EDTA contamination — Elevated K, depressed Ca from tube contamination
  • Identity swaps — Specimens assigned to wrong patients
  • Normal results — Safe to release

Why This Matters

  • Threshold-only validation passes (individual values may be in range)
  • Workflow reasoning fails (agents must detect cross-analyte patterns)
  • Safety-critical decisions (zero unsafe releases required)

Evaluation Criteria

F1 ≥ 0.80

Precision & recall

🛡️
Zero Unsafe Releases

Safety constraint

📊
False Hold ≤ 0.34

Minimize false positives

View Task on GitHub

Join the Community

We welcome contributions from the laboratory community to expand this framework
with additional workflow-level validation tasks.