Skip to content

API Reference

extract_pii

Extract PII entities from text with intelligent entity merging.

Uses token classification models to detect personally identifiable information including names, emails, phone numbers, addresses, and other HIPAA-protected identifiers.

The smart merging feature uses regex patterns to identify semantic units (dates, SSN, phone numbers, etc.) and merges fragmented model predictions into complete entities with dominant label selection.

Parameters:

Name Type Description Default
text str

Input text to analyze

required
model_name str

PII detection model (registry key or HuggingFace ID). When the default is used and lang is not "en", the language-appropriate default model is selected automatically.

_DEFAULT_EN_MODEL
confidence_threshold float

Minimum confidence score (0-1)

0.5
config Optional[OpenMedConfig]

Optional configuration override

None
use_smart_merging bool

Enable regex-based semantic unit merging (recommended)

True
lang str

ISO 639-1 language code (en, fr, de, it, es, nl, hi, te, pt, ar, ja, tr). Controls which default model and regex patterns are used.

'en'
normalize_accents Optional[bool]

Strip diacritical marks before model inference so that models trained on accent-free text still detect accented names. Entity spans in the result reference the original (accented) text. None (default) auto-enables for languages in _ACCENT_NORMALIZE_LANGS (currently Spanish).

None
loader Optional['ModelLoader']

Optional shared model loader to reuse warmed pipelines.

None
custom_recognizer Any

Optional deny-list/allow-list recognizer config, CustomRecognizer instance, or JSON/YAML config path. Deny-list matches are added with custom:deny provenance; allow-list matches suppress overlapping spans from any detector.

None
cache_results bool

Whether to cache this result in the in-process LRU cache. Cached results may contain PHI, but are never saved to disk.

False
max_cache_entries int

Maximum number of cached results.

128

Returns:

Type Description
PredictionResult

PredictionResult with detected PII entities

Example

from unittest.mock import patch from openmed.core.pii import extract_pii from openmed.processing.outputs import EntityPrediction, PredictionResult fake_result = PredictionResult( ... text="Patient Casey Example called.", ... entities=[ ... EntityPrediction( ... text="Casey Example", ... label="NAME", ... confidence=0.98, ... start=8, ... end=21, ... ) ... ], ... model_name="fixture-pii-model", ... timestamp="2026-01-01T00:00:00", ... ) with patch("openmed.analyze_text", return_value=fake_result): ... result = extract_pii( ... "Patient Casey Example called.", ... model_name="fixture-pii-model", ... use_smart_merging=False, ... ) next((entity.text, entity.label) for entity in result.entities) ('Casey Example', 'NAME')

deidentify

De-identify text by detecting and redacting PII with intelligent merging.

Implements multiple de-identification strategies for HIPAA compliance:

  • mask: Replace with placeholders like [NAME], [EMAIL], etc.
  • remove: Remove PII text entirely (empty string)
  • replace: Replace with fake but realistic data
  • hash: Replace with consistent hashed values for entity linking
  • format_preserve: Replace structured identifiers with synthetic values that keep shape and separators, masking unsupported labels
  • shift_dates: Shift dates by random offset while preserving intervals

Smart merging uses regex patterns to merge fragmented entities (e.g., dates split into '01' and '/15/1970' are merged into complete '01/15/1970').

Parameters:

Name Type Description Default
text str

Input text to de-identify

required
method DeidentificationMethod

De-identification method (mask, remove, replace, hash, shift_dates, format_preserve)

'mask'
model_name str

PII detection model

_DEFAULT_EN_MODEL
confidence_threshold float

Minimum confidence for redaction (default 0.7 for safety)

0.7
keep_year bool

For dates, keep the year unchanged

False
shift_dates Optional[bool]

Deprecated alias for method="shift_dates".

None
date_shift_days Optional[int]

Specific number of days to shift when patient_key is omitted. When patient_key is supplied, this is treated as a legacy maximum absolute offset bound unless date_shift_max_days is also supplied.

None
patient_key Optional[str | bytes]

Optional stable patient identifier used only to derive a deterministic HMAC date-shift offset. Raw keys are not logged, persisted, or returned.

None
date_shift_max_days Optional[int]

Maximum absolute offset for random or patient-keyed date shifting. Defaults to 365 when patient_key is supplied and neither this nor date_shift_days is set.

None
date_shift_secret Optional[str | bytes]

Required HMAC key material for patient-keyed offsets. Reuse the same value across sessions to keep offsets stable.

None
keep_mapping bool

Keep mapping for re-identification

False
config Optional[OpenMedConfig]

Optional configuration override

None
use_smart_merging bool

Enable regex-based semantic unit merging (recommended)

True
use_safety_sweep bool

Run a deterministic structured-identifier sweep after model detection and before redaction.

True
lang str

ISO 639-1 language code (en, fr, de, it, es, nl, hi, te, pt, ar, ja, tr). Controls model selection, regex patterns, and fake data for replacement.

'en'
normalize_accents Optional[bool]

Strip diacritical marks before model inference. None (default) auto-enables for Spanish.

None
loader Optional['ModelLoader']

Optional shared model loader to reuse warmed pipelines.

None
consistent bool

When method="replace" or method="format_preserve", generate stable surrogates (same input -> same surrogate within the call). Lets repeated mentions of the same name resolve to one fake identity instead of a different one each time.

False
seed Optional[int]

Optional integer seed for cross-run reproducibility of consistent=True replacements. Implies consistent=True.

None
locale Optional[str]

Faker locale override (pt_BR, en_GB, ...) for method="replace" and method="format_preserve". When None, derived from lang.

None
surrogate_vault Optional['SurrogateVault']

Optional cross-document surrogate vault. When provided with method="replace", OpenMed stores only HMAC source hashes and reuses the same surrogate for the same label/language/source identifier across calls.

None
policy Optional[str]

Optional policy profile name controlling arbitration, action selection, mandatory safety sweep behavior, and reversible mapping.

None
calibration_thresholds_path Optional[str | Path]

Optional thresholds.json artifact path or artifact directory. When provided, per-label calibrated thresholds filter model detections and appear in audit output.

None
custom_recognizer Any

Optional deny-list/allow-list recognizer config, CustomRecognizer instance, or JSON/YAML config path. Deny-list matches are redacted with custom:deny provenance; allow-list matches suppress overlapping spans from any detector.

None
audit bool

Return a deterministic AuditReport instead of the DeidentificationResult.

False
cache_results bool

Whether to cache this result in the in-process LRU cache. Cached results may contain PHI, but are never saved to disk.

False
max_cache_entries int

Maximum number of cached results.

128

Returns:

Type Description
DeidentificationResult | 'AuditReport'

DeidentificationResult with original and de-identified text, or

DeidentificationResult | 'AuditReport'

AuditReport when audit=True.

Example

from datetime import datetime from types import SimpleNamespace from unittest.mock import patch from openmed.core.pii import ( ... DeidentificationResult, ... PIIEntity, ... deidentify, ... ) fixture = DeidentificationResult( ... original_text="Patient Casey Example", ... deidentified_text="Patient [NAME]", ... pii_entities=[ ... PIIEntity( ... text="Casey Example", ... label="NAME", ... start=8, ... end=21, ... confidence=0.98, ... redacted_text="[NAME]", ... ) ... ], ... method="mask", ... timestamp=datetime(2026, 1, 1, 0, 0, 0), ... mapping={"[NAME]": "Casey Example"}, ... ) with patch("openmed.core.pipeline.Pipeline") as pipeline_cls: ... pipeline_cls.return_value.run.return_value = SimpleNamespace( ... deidentification_result=fixture ... ) ... result = deidentify( ... "Patient Casey Example", ... method="mask", ... keep_mapping=True, ... ) result.deidentified_text 'Patient [NAME]' result.mapping

reidentify

Re-identify text using stored mapping.

Restores original PII from de-identified text using the mapping created during de-identification. Only works if keep_mapping=True was used.

Parameters:

Name Type Description Default
deidentified_text str

De-identified text

required
mapping Mapping[str, str]

Mapping from redacted to original text

required

Returns:

Type Description
str

Re-identified text with original PII restored

Example

from openmed.core.pii import reidentify reidentify( ... "Patient [NAME] has record [ID]", ... {"[NAME]": "Casey Example", "[ID]": "MRN-0001"}, ... ) 'Patient Casey Example has record MRN-0001'

Note

Only works if keep_mapping=True was used during de-identification. Requires proper authorization and audit logging in production.

analyze_text

Run a token-classification model on text and format the predictions.

Parameters:

Name Type Description Default
text str

Clinical or biomedical text to analyse.

required
model_name str

Registry key, fully-qualified Hugging Face model id, or local model path.

'disease_detection_superclinical'
model_id Optional[str]

Alias for model_name. Useful for APIs and examples that name model identifiers as model_id.

None
config Optional[OpenMedConfig]

Optional :class:~openmed.core.config.OpenMedConfig instance.

None
loader Optional[ModelLoader]

Reuse an existing :class:~openmed.core.models.ModelLoader.

None
aggregation_strategy Optional[str]

Hugging Face aggregation strategy ("simple" by default). Set to None to work with raw token outputs.

'simple'
output_format str

"dict" (default), "json", "html" or "csv".

'dict'
include_confidence bool

Whether to include confidence scores in formatted output.

True
confidence_threshold Optional[float]

Minimum confidence for entities. None keeps all.

0.0
group_entities bool

Merge adjacent entities of the same label in the formatted output.

False
formatter_kwargs Optional[Dict[str, Any]]

Extra keyword arguments forwarded to :func:openmed.processing.format_predictions.

None
metadata Optional[Dict[str, Any]]

Optional metadata to attach to the result.

None
use_fast_tokenizer bool

Prefer fast tokenizers when available.

True
sentence_detection bool

Enable pySBD-powered sentence detection (default: True).

True
sentence_language str

Language hint for the sentence detector.

'en'
sentence_clean bool

Whether to enable pySBD's cleaning heuristics.

False
sentence_segmenter Optional[Any]

Optional preconstructed pySBD segmenter to reuse.

None
cache_results bool

Whether to cache this result in the in-process LRU cache. Cached results may contain PHI, but are never saved to disk.

False
max_cache_entries int

Maximum number of cached results.

128
**pipeline_kwargs Any

Additional arguments passed to :meth:openmed.core.models.ModelLoader.create_pipeline.

{}

Returns:

Type Description
Union[AnalyzeResult, str, List[Dict[str, Any]]]

Analyze result for "dict" output, otherwise the requested rendered

Union[AnalyzeResult, str, List[Dict[str, Any]]]

format.

Example

class FixtureLoader: ... config = None ... ... def create_pipeline(self, model_name, kwargs): ... def pipeline(text, call_kwargs): ... return [ ... { ... "entity_group": "CONDITION", ... "score": 0.99, ... "start": 11, ... "end": 17, ... "word": "asthma", ... } ... ] ... ... return pipeline ... ... def get_max_sequence_length(self, model_name, tokenizer=None): ... return 128 result = analyze_text( ... "History of asthma.", ... model_name="fixture-ner-model", ... loader=FixtureLoader(), ... sentence_detection=False, ... ) next((entity.text, entity.label) for entity in result.entities) ('asthma', 'CONDITION')

list_models

Return available OpenMed model identifiers.

Parameters:

Name Type Description Default
include_registry bool

Include entries from the bundled registry in addition to entries in the committed manifest.

True
include_remote bool

Retained for compatibility; no live discovery is performed.

True
config Optional[OpenMedConfig]

Optional custom configuration for model discovery.

None

BatchProcessor

Process multiple texts efficiently with progress tracking.

Example usage

from openmed import BatchProcessor, OpenMedConfig processor = BatchProcessor(model_name="disease_detection_superclinical") texts = ["Patient has diabetes.", "No significant findings."] result = processor.process_texts(texts) print(result.summary())

__init__(model_name='disease_detection_superclinical', operation='analyze_text', batch_size=8, config=None, loader=None, aggregation_strategy='simple', confidence_threshold=None, group_entities=False, continue_on_error=True, **analyze_kwargs)

Initialize batch processor.

Parameters:

Name Type Description Default
model_name str

Model registry key or HuggingFace identifier.

'disease_detection_superclinical'
operation BatchOperation

Which function to call per item: "analyze_text" (default), "extract_pii" or "deidentify". Extra kwargs passed via **analyze_kwargs are passed to the selected function.

'analyze_text'
batch_size int

Number of documents to process together per batch.

8
config Optional[Any]

Optional OpenMedConfig instance.

None
loader Optional[Any]

Optional ModelLoader instance to reuse.

None
aggregation_strategy Optional[str]

HuggingFace aggregation strategy (analyze_text operation only).

'simple'
confidence_threshold Optional[float]

Minimum confidence for entities. When not provided, defaults match the selected operation: 0.0 for analyze_text, 0.5 for extract_pii, and 0.7 for deidentify.

None
group_entities bool

Whether to group adjacent entities (analyze_text operation only).

False
continue_on_error bool

Continue processing on individual item errors.

True
**analyze_kwargs Any

Additional arguments passed to the selected function.

{}

iter_process(texts, ids=None, *, on_progress=None)

Process texts as an iterator, yielding results one at a time.

This is useful for streaming results or processing very large batches where you don't want to hold all results in memory.

Parameters:

Name Type Description Default
texts Sequence[str]

Sequence of texts to analyze.

required
ids Optional[Sequence[str]]

Optional identifiers for each text.

None
on_progress Optional[BatchProgressCallback]

Optional PHI-safe callback that receives a BatchProgress record after each completed item.

None

Yields:

Type Description
BatchItemResult

BatchItemResult for each processed text.

process_directory(directory, pattern='*.txt', recursive=False, encoding='utf-8', progress_callback=None, *, on_progress=None)

Process all matching files in a directory.

Parameters:

Name Type Description Default
directory Union[str, Path]

Directory path.

required
pattern str

Glob pattern for file matching.

'*.txt'
recursive bool

Whether to search recursively.

False
encoding str

File encoding.

'utf-8'
progress_callback Optional[ProgressCallback]

Optional callback for progress updates.

None
on_progress Optional[BatchProgressCallback]

Optional PHI-safe callback that receives a BatchProgress record after each completed item.

None

Returns:

Type Description
BatchResult

BatchResult with all processing results.

process_files(file_paths, encoding='utf-8', progress_callback=None, *, on_progress=None)

Process multiple files.

Parameters:

Name Type Description Default
file_paths Sequence[Union[str, Path]]

Paths to text files.

required
encoding str

File encoding.

'utf-8'
progress_callback Optional[ProgressCallback]

Optional callback for progress updates.

None
on_progress Optional[BatchProgressCallback]

Optional PHI-safe callback that receives a BatchProgress record after each completed item.

None

Returns:

Type Description
BatchResult

BatchResult with all processing results.

process_items(items, progress_callback=None, *, on_progress=None)

Process a sequence of BatchItem objects.

Parameters:

Name Type Description Default
items Sequence[BatchItem]

Sequence of BatchItem objects.

required
progress_callback Optional[ProgressCallback]

Optional callback for progress updates.

None
on_progress Optional[BatchProgressCallback]

Optional PHI-safe callback that receives a BatchProgress record after each completed item.

None

Returns:

Type Description
BatchResult

BatchResult with all processing results.

process_texts(texts, ids=None, progress_callback=None, *, on_progress=None)

Process multiple texts.

Parameters:

Name Type Description Default
texts Sequence[str]

Sequence of texts to analyze.

required
ids Optional[Sequence[str]]

Optional identifiers for each text.

None
progress_callback Optional[ProgressCallback]

Optional callback for progress updates. Signature: callback(completed_count, total_count, result)

None
on_progress Optional[BatchProgressCallback]

Optional PHI-safe callback that receives a BatchProgress record after each completed item.

None

Returns:

Type Description
BatchResult

BatchResult with all processing results.

PIIEntity

Bases: EntityPrediction

Extended Entity with PII-specific metadata.

Attributes:

Name Type Description
text str

The entity text span

label str

PII category (NAME, EMAIL, PHONE, etc.)

start Optional[int]

Character start position

end Optional[int]

Character end position

confidence float

Model confidence score (0-1)

entity_type str

PII category (same as label)

redacted_text Optional[str]

Replacement text after de-identification

original_text Optional[str]

Original text before redaction

hash_value Optional[str]

Consistent hash for entity linking

reversible_id Optional[str]

Optional reversible pseudonymization handle

__post_init__()

Initialize entity_type from label if not set.

DeidentificationResult

Result of de-identification operation.

Attributes:

Name Type Description
original_text str

Input text before de-identification

deidentified_text str

Output text with PII redacted

pii_entities list[PIIEntity]

List of detected and redacted PII entities

method str

De-identification method used

timestamp datetime

When de-identification was performed

mapping Optional[dict[str, str]]

Optional mapping for re-identification (redacted -> original)

to_dataframe()

Convert detected PII entities to a pandas DataFrame.

Returns:

Type Description
Any

A pandas DataFrame with one row per detected entity and columns

Any

text, label, entity_type, start, end,

Any

confidence, action, and result_id.

Raises:

Type Description
ImportError

If pandas is not installed.

to_dict()

Convert result to dictionary format.

Returns:

Type Description
dict

Dictionary with all result fields and metadata