Advanced NER & Output Formatting¶

Post-processing matters as much as the base model. OpenMed codifies the heuristics from the public demos so you can go from noisy token output to high-quality, copy-pasteable spans in a few lines.

Advanced NER processor¶

openmed.processing.advanced_ner.AdvancedNERProcessor applies the same filtering stack used in the OpenMed Gradio app:

Confidence filtering with a configurable threshold (min_confidence).
Punctuation-only and short-span removal.
Regex-based exclusions for common false positives.
Optional edge stripping and gap-aware merging of adjacent entities.
Smart BIO grouping fixes overlapping spans when aggregation_strategy=None.

from openmed.processing.advanced_ner import create_advanced_processor

processor = create_advanced_processor(
    min_confidence=0.65,
    merge_adjacent=True,
    max_merge_gap=8,
)

raw = pipeline(text)  # HF token-classification output
entities = processor.process_pipeline_output(text, raw)

for span in entities:
    print(span.label, span.text, span.score)

Use it when you need deterministic filtering outside of analyze_text or when you operate on raw tokens.

OutputFormatter & PredictionResult¶

openmed.processing.OutputFormatter normalizes predictions into dictionaries, JSON strings, HTML snippets, or CSV rows. The dataclasses in openmed/processing/outputs.py ensure the payload stays type-safe and ready for logging.

from openmed.processing import format_predictions

formatted = format_predictions(
    raw_predictions,
    original_text,
    model_name="Disease Detection",
    include_confidence=True,
    confidence_threshold=0.6,
    group_entities=True,
)

print(formatted.entities[0].to_dict())
print(formatted.to_dict())

HTML output¶

from openmed.processing.outputs import OutputFormatter

formatter = OutputFormatter(group_entities=True)
result = formatter.format_predictions(raw_predictions, text, model_name="Oncology")
html = formatter.to_html(result, tag_colors={"Cancer": "#f97316"})

The HTML helper wraps highlighted spans with semantic tags (data-entity="Cancer") so your dashboards can apply custom styles or tooltips.

CSV output¶

csv_lines = formatter.to_csv(result)
print("\n".join(csv_lines[:5]))

CSV export is handy when you need to feed BI tools or spreadsheets without additional ETL code.

Sentence spans & metadata¶

analyze_text attaches sentence spans (when pySBD is enabled) and forwards metadata objects so each entity can carry extra context (e.g., the originating service, clinical section, or ontological hints).
The formatter ensures confidence, start, and end offsets are normalized to built-in float/int so serializing to JSON never fails due to NumPy/PyTorch dtypes.

Guardrails¶

Pair the formatter with validation helpers from openmed.utils.validation:

from openmed.utils.validation import (
    validate_confidence_threshold,
    validate_output_format,
    validate_batch_size,
)

These guardrails keep API endpoints resilient against out-of-range parameters and malformed payloads.