Skip to content

OpenVINO Runtime

OpenMed can export ONNX token-classification graphs to OpenVINO IR for Intel CPU, GPU, and NPU edge deployments:

.venv/bin/python -m openmed.onnx.convert \
  --model OpenMed/example-token-classifier \
  --output dist/example-openvino \
  --profile openvino

The profile writes the ONNX source graph plus an OpenVINO IR directory:

dist/example-openvino/
  model.onnx
  config.json
  id2label.json
  openmed-onnx.json
  openvino/
    model.xml
    model.bin

openmed-onnx.json records the openvino-ir artifact and includes synthetic verification metadata. The verifier runs a synthetic note through ONNX Runtime and the exported OpenVINO graph, compares logits within a fixed tolerance, and checks that decoded token spans match.

Runtime Session

Use OpenVinoTokenClassificationSession when loading an exported IR graph:

from openmed.onnx import OpenVinoTokenClassificationSession

session = OpenVinoTokenClassificationSession(
    "dist/example-openvino/openvino/model.xml",
    device="NPU",
)

logits = session.run(input_ids=input_ids, attention_mask=attention_mask)

Device selection is deterministic. The requested device is used when present; otherwise the runtime falls back through CPU, GPU, then NPU, and records whether fallback was used. If OpenVINO reports no devices, session creation fails instead of guessing.

INT8 Quantization

INT8 export uses NNCF post-training quantization and is fail-closed by the G4 recall-delta gate. Callers must provide calibration samples plus per-family recall evidence from the synthetic eval or a precomputed recall-delta payload:

from openmed.onnx import quantize_openvino_int8

result = quantize_openvino_int8(
    "dist/example-openvino/openvino/model.xml",
    "dist/example-openvino/openvino_int8",
    calibration_data=[{"input_ids": input_ids, "attention_mask": attention_mask}],
    family="bert",
    candidate_recall={"PERSON": 0.990},
    parent_recall={"PERSON": 0.992},
)

If recall evidence is missing, or any evaluated G1/G2 label loses at least the INT8 threshold, OpenVinoQuantizationRejected is raised and no INT8 artifact is accepted.

Benchmark Records

OpenVINO device results are written with the standard BenchmarkReport schema:

from openmed.onnx import OpenVinoBenchmarkRecord, write_openvino_benchmark_report

write_openvino_benchmark_report(
    "dist/example-openvino/openvino-benchmark.report.json",
    model_name="OpenMed/example-token-classifier",
    records=[
        OpenVinoBenchmarkRecord(
            device="CPU",
            precision="float32",
            latency_ms=4.0,
            throughput_items_per_second=250.0,
            sample_count=3,
            sequence_length=128,
        )
    ],
)

Each device record contains latency, throughput, precision, batch size, and optional sequence length under metrics.devices.<device>.