Analyze Text Helper¶
openmed.analyze_text is the top-level orchestrator that most users start with. It validates input, spins up a token-classification pipeline, segments sentences with pySBD, and normalizes the output so you can copy dict/JSON/HTML/CSV payloads straight into downstream systems.
Quick reference¶
from openmed import analyze_text
result = analyze_text(
text="Patient started on imatinib for chronic myeloid leukemia.",
model_name="disease_detection_superclinical",
aggregation_strategy="simple",
output_format="json",
include_confidence=True,
confidence_threshold=0.55,
group_entities=True,
metadata={"source": "clinic-note-42"},
)
print(result[:200]) # JSON string in this configuration
Key arguments¶
model_name: registry alias, full Hugging Face id, or local model directory. Useopenmed.list_models()if you need auto-discovery.model_id: alias formodel_name, supported for API-style callers that use model-id terminology.aggregation_strategy: forwarded to the HF pipeline.simple(default) yields grouped tokens;Nonekeeps raw tokens.output_format:"dict"(default, returnsPredictionResult),"json","html", or"csv".include_confidence&confidence_threshold: control the final payload; defaults keep all scores.group_entities: merge adjacent spans of the same label after formatting.formatter_kwargs: forwarded toopenmed.processing.format_predictions.- Sentence options (
sentence_detection,sentence_language,sentence_clean,sentence_segmenter) wrap pySBD so each prediction carries the sentence span; disable them if latency matters more than helper metadata.
Chunking & truncation¶
result = analyze_text(
text=long_report,
model_name="pharma_detection_superclinical",
aggregation_strategy=None, # work with raw tokens
max_length=512, # forwarded to HF pipeline
truncation=True, # enforce length (default)
sentence_detection=False, # skip pySBD to save ~2ms per note
)
When you need full-control over tokenizer behaviour:
- Pass
max_length/truncationviapipeline_kwargs. If you skip truncation, the helper sets the tokenizer max length to unlimited (0) so HF pipelines accept longer inputs. - Provide
batch_sizeornum_workersinpipeline_kwargsand they will be forwarded to the pipeline call but not to the constructor. - Enable medical token remapping with
OpenMedConfig(use_medical_tokenizer=True)to group outputs onto clinical-friendly tokens without changing the model tokenizer.
Loading from a local path¶
Pass an existing model directory to model_name or model_id when the model files are already present on disk:
import os
from openmed import OpenMedConfig, analyze_text
local_path = os.path.abspath("./models/OpenMed-NER-DiseaseDetect-SuperClinical-434M")
config = OpenMedConfig(device="cpu")
result = analyze_text(
"Patient presents with chronic myeloid leukemia and Type 2 diabetes.",
model_id=local_path,
config=config,
)
for entity in result.entities:
print(entity.text, entity.label)
When the identifier points to an existing local path, OpenMed asks Transformers to load with local_files_only=True by default. That keeps air-gapped deployments from validating or downloading the model from the Hugging Face Hub. If any required tokenizer, config, or weight file is missing, loading fails locally with the underlying Transformers error.
Streaming multiple texts¶
analyze_text is optimized for single inputs. For batch jobs, keep a ModelLoader instance around and reuse its pipelines:
from openmed import ModelLoader, format_predictions
loader = ModelLoader()
pipeline = loader.create_pipeline("disease_detection_superclinical")
for note in notes:
raw = pipeline(note, batch_size=16)
formatted = format_predictions(raw, note, model_name="Disease Detection")
print(formatted.entities[:3])
See ModelLoader & Pipelines for details on caching, GPU selection, and tokenizer reuse.
HTML/CSV rendering¶
html = analyze_text(
text,
model_name="oncology_detection_superclinical",
output_format="html",
formatter_kwargs={
"html_class": "openmed-highlights",
"tag_colors": {"CANCER": "#d97706"},
},
)
csv_rows = analyze_text(
text,
model_name="pharma_detection_superclinical",
output_format="csv",
)
The HTML formatter emits a ready-to-embed snippet for dashboards; CSV mode writes row strings (header + body). Both respect confidence_threshold and group_entities.
Validation behaviours¶
Behind the scenes analyze_text calls:
validate_input— trims whitespace and enforces max lengths.validate_model_name— normalizes registry aliases.- Sentence detection (
openmed.processing.sentences) — optional pySBD segmentation with language hints. OutputFormatter— see Advanced NER & Output Formatting for available kwargs.
If you need custom validation or logging, inject your own OpenMedConfig or reuse a configured ModelLoader.