API Reference¶
extract_pii¶
Extract PII entities from text with intelligent entity merging.
Uses token classification models to detect personally identifiable information including names, emails, phone numbers, addresses, and other HIPAA-protected identifiers.
The smart merging feature uses regex patterns to identify semantic units (dates, SSN, phone numbers, etc.) and merges fragmented model predictions into complete entities with dominant label selection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text | str | Input text to analyze | required |
model_name | str | PII detection model (registry key or HuggingFace ID). When the default is used and | _DEFAULT_EN_MODEL |
confidence_threshold | float | Minimum confidence score (0-1) | 0.5 |
config | Optional[OpenMedConfig] | Optional configuration override | None |
use_smart_merging | bool | Enable regex-based semantic unit merging (recommended) | True |
lang | str | ISO 639-1 language code (en, fr, de, it, es, nl, hi, te, pt, ar, ja, tr). Controls which default model and regex patterns are used. | 'en' |
normalize_accents | Optional[bool] | Strip diacritical marks before model inference so that models trained on accent-free text still detect accented names. Entity spans in the result reference the original (accented) text. | None |
loader | Optional['ModelLoader'] | Optional shared model loader to reuse warmed pipelines. | None |
custom_recognizer | Any | Optional deny-list/allow-list recognizer config, | None |
cache_results | bool | Whether to cache this result in the in-process LRU cache. Cached results may contain PHI, but are never saved to disk. | False |
max_cache_entries | int | Maximum number of cached results. | 128 |
Returns:
| Type | Description |
|---|---|
PredictionResult | PredictionResult with detected PII entities |
Example
from unittest.mock import patch from openmed.core.pii import extract_pii from openmed.processing.outputs import EntityPrediction, PredictionResult fake_result = PredictionResult( ... text="Patient Casey Example called.", ... entities=[ ... EntityPrediction( ... text="Casey Example", ... label="NAME", ... confidence=0.98, ... start=8, ... end=21, ... ) ... ], ... model_name="fixture-pii-model", ... timestamp="2026-01-01T00:00:00", ... ) with patch("openmed.analyze_text", return_value=fake_result): ... result = extract_pii( ... "Patient Casey Example called.", ... model_name="fixture-pii-model", ... use_smart_merging=False, ... ) next((entity.text, entity.label) for entity in result.entities) ('Casey Example', 'NAME')
deidentify¶
De-identify text by detecting and redacting PII with intelligent merging.
Implements multiple de-identification strategies for HIPAA compliance:
- mask: Replace with placeholders like [NAME], [EMAIL], etc.
- remove: Remove PII text entirely (empty string)
- replace: Replace with fake but realistic data
- hash: Replace with consistent hashed values for entity linking
- format_preserve: Replace structured identifiers with synthetic values that keep shape and separators, masking unsupported labels
- shift_dates: Shift dates by random offset while preserving intervals
Smart merging uses regex patterns to merge fragmented entities (e.g., dates split into '01' and '/15/1970' are merged into complete '01/15/1970').
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text | str | Input text to de-identify | required |
method | DeidentificationMethod | De-identification method (mask, remove, replace, hash, shift_dates, format_preserve) | 'mask' |
model_name | str | PII detection model | _DEFAULT_EN_MODEL |
confidence_threshold | float | Minimum confidence for redaction (default 0.7 for safety) | 0.7 |
keep_year | bool | For dates, keep the year unchanged | False |
shift_dates | Optional[bool] | Deprecated alias for | None |
date_shift_days | Optional[int] | Specific number of days to shift when | None |
patient_key | Optional[str | bytes] | Optional stable patient identifier used only to derive a deterministic HMAC date-shift offset. Raw keys are not logged, persisted, or returned. | None |
date_shift_max_days | Optional[int] | Maximum absolute offset for random or patient-keyed date shifting. Defaults to 365 when | None |
date_shift_secret | Optional[str | bytes] | Required HMAC key material for patient-keyed offsets. Reuse the same value across sessions to keep offsets stable. | None |
keep_mapping | bool | Keep mapping for re-identification | False |
config | Optional[OpenMedConfig] | Optional configuration override | None |
use_smart_merging | bool | Enable regex-based semantic unit merging (recommended) | True |
use_safety_sweep | bool | Run a deterministic structured-identifier sweep after model detection and before redaction. | True |
lang | str | ISO 639-1 language code (en, fr, de, it, es, nl, hi, te, pt, ar, ja, tr). Controls model selection, regex patterns, and fake data for replacement. | 'en' |
normalize_accents | Optional[bool] | Strip diacritical marks before model inference. | None |
loader | Optional['ModelLoader'] | Optional shared model loader to reuse warmed pipelines. | None |
consistent | bool | When | False |
seed | Optional[int] | Optional integer seed for cross-run reproducibility of | None |
locale | Optional[str] | Faker locale override ( | None |
surrogate_vault | Optional['SurrogateVault'] | Optional cross-document surrogate vault. When provided with | None |
policy | Optional[str] | Optional policy profile name controlling arbitration, action selection, mandatory safety sweep behavior, and reversible mapping. | None |
calibration_thresholds_path | Optional[str | Path] | Optional thresholds.json artifact path or artifact directory. When provided, per-label calibrated thresholds filter model detections and appear in audit output. | None |
custom_recognizer | Any | Optional deny-list/allow-list recognizer config, | None |
audit | bool | Return a deterministic AuditReport instead of the DeidentificationResult. | False |
cache_results | bool | Whether to cache this result in the in-process LRU cache. Cached results may contain PHI, but are never saved to disk. | False |
max_cache_entries | int | Maximum number of cached results. | 128 |
Returns:
| Type | Description |
|---|---|
DeidentificationResult | 'AuditReport' | DeidentificationResult with original and de-identified text, or |
DeidentificationResult | 'AuditReport' | AuditReport when |
Example
from datetime import datetime from types import SimpleNamespace from unittest.mock import patch from openmed.core.pii import ( ... DeidentificationResult, ... PIIEntity, ... deidentify, ... ) fixture = DeidentificationResult( ... original_text="Patient Casey Example", ... deidentified_text="Patient [NAME]", ... pii_entities=[ ... PIIEntity( ... text="Casey Example", ... label="NAME", ... start=8, ... end=21, ... confidence=0.98, ... redacted_text="[NAME]", ... ) ... ], ... method="mask", ... timestamp=datetime(2026, 1, 1, 0, 0, 0), ... mapping={"[NAME]": "Casey Example"}, ... ) with patch("openmed.core.pipeline.Pipeline") as pipeline_cls: ... pipeline_cls.return_value.run.return_value = SimpleNamespace( ... deidentification_result=fixture ... ) ... result = deidentify( ... "Patient Casey Example", ... method="mask", ... keep_mapping=True, ... ) result.deidentified_text 'Patient [NAME]' result.mapping
reidentify¶
Re-identify text using stored mapping.
Restores original PII from de-identified text using the mapping created during de-identification. Only works if keep_mapping=True was used.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
deidentified_text | str | De-identified text | required |
mapping | Mapping[str, str] | Mapping from redacted to original text | required |
Returns:
| Type | Description |
|---|---|
str | Re-identified text with original PII restored |
Example
from openmed.core.pii import reidentify reidentify( ... "Patient [NAME] has record [ID]", ... {"[NAME]": "Casey Example", "[ID]": "MRN-0001"}, ... ) 'Patient Casey Example has record MRN-0001'
Note
Only works if keep_mapping=True was used during de-identification. Requires proper authorization and audit logging in production.
analyze_text¶
Run a token-classification model on text and format the predictions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text | str | Clinical or biomedical text to analyse. | required |
model_name | str | Registry key, fully-qualified Hugging Face model id, or local model path. | 'disease_detection_superclinical' |
model_id | Optional[str] | Alias for | None |
config | Optional[OpenMedConfig] | Optional :class: | None |
loader | Optional[ModelLoader] | Reuse an existing :class: | None |
aggregation_strategy | Optional[str] | Hugging Face aggregation strategy ( | 'simple' |
output_format | str |
| 'dict' |
include_confidence | bool | Whether to include confidence scores in formatted output. | True |
confidence_threshold | Optional[float] | Minimum confidence for entities. | 0.0 |
group_entities | bool | Merge adjacent entities of the same label in the formatted output. | False |
formatter_kwargs | Optional[Dict[str, Any]] | Extra keyword arguments forwarded to :func: | None |
metadata | Optional[Dict[str, Any]] | Optional metadata to attach to the result. | None |
use_fast_tokenizer | bool | Prefer fast tokenizers when available. | True |
sentence_detection | bool | Enable pySBD-powered sentence detection (default: True). | True |
sentence_language | str | Language hint for the sentence detector. | 'en' |
sentence_clean | bool | Whether to enable pySBD's cleaning heuristics. | False |
sentence_segmenter | Optional[Any] | Optional preconstructed pySBD segmenter to reuse. | None |
cache_results | bool | Whether to cache this result in the in-process LRU cache. Cached results may contain PHI, but are never saved to disk. | False |
max_cache_entries | int | Maximum number of cached results. | 128 |
**pipeline_kwargs | Any | Additional arguments passed to :meth: | {} |
Returns:
| Type | Description |
|---|---|
Union[AnalyzeResult, str, List[Dict[str, Any]]] | Analyze result for |
Union[AnalyzeResult, str, List[Dict[str, Any]]] | format. |
Example
class FixtureLoader: ... config = None ... ... def create_pipeline(self, model_name, kwargs): ... def pipeline(text, call_kwargs): ... return [ ... { ... "entity_group": "CONDITION", ... "score": 0.99, ... "start": 11, ... "end": 17, ... "word": "asthma", ... } ... ] ... ... return pipeline ... ... def get_max_sequence_length(self, model_name, tokenizer=None): ... return 128 result = analyze_text( ... "History of asthma.", ... model_name="fixture-ner-model", ... loader=FixtureLoader(), ... sentence_detection=False, ... ) next((entity.text, entity.label) for entity in result.entities) ('asthma', 'CONDITION')
list_models¶
Return available OpenMed model identifiers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
include_registry | bool | Include entries from the bundled registry in addition to entries in the committed manifest. | True |
include_remote | bool | Retained for compatibility; no live discovery is performed. | True |
config | Optional[OpenMedConfig] | Optional custom configuration for model discovery. | None |
BatchProcessor¶
Process multiple texts efficiently with progress tracking.
Example usage
from openmed import BatchProcessor, OpenMedConfig processor = BatchProcessor(model_name="disease_detection_superclinical") texts = ["Patient has diabetes.", "No significant findings."] result = processor.process_texts(texts) print(result.summary())
__init__(model_name='disease_detection_superclinical', operation='analyze_text', batch_size=8, config=None, loader=None, aggregation_strategy='simple', confidence_threshold=None, group_entities=False, continue_on_error=True, **analyze_kwargs) ¶
Initialize batch processor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name | str | Model registry key or HuggingFace identifier. | 'disease_detection_superclinical' |
operation | BatchOperation | Which function to call per item: | 'analyze_text' |
batch_size | int | Number of documents to process together per batch. | 8 |
config | Optional[Any] | Optional OpenMedConfig instance. | None |
loader | Optional[Any] | Optional ModelLoader instance to reuse. | None |
aggregation_strategy | Optional[str] | HuggingFace aggregation strategy ( | 'simple' |
confidence_threshold | Optional[float] | Minimum confidence for entities. When not provided, defaults match the selected operation: | None |
group_entities | bool | Whether to group adjacent entities ( | False |
continue_on_error | bool | Continue processing on individual item errors. | True |
**analyze_kwargs | Any | Additional arguments passed to the selected function. | {} |
iter_process(texts, ids=None, *, on_progress=None) ¶
Process texts as an iterator, yielding results one at a time.
This is useful for streaming results or processing very large batches where you don't want to hold all results in memory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts | Sequence[str] | Sequence of texts to analyze. | required |
ids | Optional[Sequence[str]] | Optional identifiers for each text. | None |
on_progress | Optional[BatchProgressCallback] | Optional PHI-safe callback that receives a BatchProgress record after each completed item. | None |
Yields:
| Type | Description |
|---|---|
BatchItemResult | BatchItemResult for each processed text. |
process_directory(directory, pattern='*.txt', recursive=False, encoding='utf-8', progress_callback=None, *, on_progress=None) ¶
Process all matching files in a directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
directory | Union[str, Path] | Directory path. | required |
pattern | str | Glob pattern for file matching. | '*.txt' |
recursive | bool | Whether to search recursively. | False |
encoding | str | File encoding. | 'utf-8' |
progress_callback | Optional[ProgressCallback] | Optional callback for progress updates. | None |
on_progress | Optional[BatchProgressCallback] | Optional PHI-safe callback that receives a BatchProgress record after each completed item. | None |
Returns:
| Type | Description |
|---|---|
BatchResult | BatchResult with all processing results. |
process_files(file_paths, encoding='utf-8', progress_callback=None, *, on_progress=None) ¶
Process multiple files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_paths | Sequence[Union[str, Path]] | Paths to text files. | required |
encoding | str | File encoding. | 'utf-8' |
progress_callback | Optional[ProgressCallback] | Optional callback for progress updates. | None |
on_progress | Optional[BatchProgressCallback] | Optional PHI-safe callback that receives a BatchProgress record after each completed item. | None |
Returns:
| Type | Description |
|---|---|
BatchResult | BatchResult with all processing results. |
process_items(items, progress_callback=None, *, on_progress=None) ¶
Process a sequence of BatchItem objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
items | Sequence[BatchItem] | Sequence of BatchItem objects. | required |
progress_callback | Optional[ProgressCallback] | Optional callback for progress updates. | None |
on_progress | Optional[BatchProgressCallback] | Optional PHI-safe callback that receives a BatchProgress record after each completed item. | None |
Returns:
| Type | Description |
|---|---|
BatchResult | BatchResult with all processing results. |
process_texts(texts, ids=None, progress_callback=None, *, on_progress=None) ¶
Process multiple texts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts | Sequence[str] | Sequence of texts to analyze. | required |
ids | Optional[Sequence[str]] | Optional identifiers for each text. | None |
progress_callback | Optional[ProgressCallback] | Optional callback for progress updates. Signature: callback(completed_count, total_count, result) | None |
on_progress | Optional[BatchProgressCallback] | Optional PHI-safe callback that receives a BatchProgress record after each completed item. | None |
Returns:
| Type | Description |
|---|---|
BatchResult | BatchResult with all processing results. |
PIIEntity¶
Bases: EntityPrediction
Extended Entity with PII-specific metadata.
Attributes:
| Name | Type | Description |
|---|---|---|
text | str | The entity text span |
label | str | PII category (NAME, EMAIL, PHONE, etc.) |
start | Optional[int] | Character start position |
end | Optional[int] | Character end position |
confidence | float | Model confidence score (0-1) |
entity_type | str | PII category (same as label) |
redacted_text | Optional[str] | Replacement text after de-identification |
original_text | Optional[str] | Original text before redaction |
hash_value | Optional[str] | Consistent hash for entity linking |
reversible_id | Optional[str] | Optional reversible pseudonymization handle |
__post_init__() ¶
Initialize entity_type from label if not set.
DeidentificationResult¶
Result of de-identification operation.
Attributes:
| Name | Type | Description |
|---|---|---|
original_text | str | Input text before de-identification |
deidentified_text | str | Output text with PII redacted |
pii_entities | list[PIIEntity] | List of detected and redacted PII entities |
method | str | De-identification method used |
timestamp | datetime | When de-identification was performed |
mapping | Optional[dict[str, str]] | Optional mapping for re-identification (redacted -> original) |
to_dataframe() ¶
Convert detected PII entities to a pandas DataFrame.
Returns:
| Type | Description |
|---|---|
Any | A pandas DataFrame with one row per detected entity and columns |
Any |
|
Any |
|
Raises:
| Type | Description |
|---|---|
ImportError | If pandas is not installed. |
to_dict() ¶
Convert result to dictionary format.
Returns:
| Type | Description |
|---|---|
dict | Dictionary with all result fields and metadata |