API Reference¶

extract_pii¶

Extract PII entities from text with intelligent entity merging.

Uses token classification models to detect personally identifiable information including names, emails, phone numbers, addresses, and other HIPAA-protected identifiers.

The smart merging feature uses regex patterns to identify semantic units (dates, SSN, phone numbers, etc.) and merges fragmented model predictions into complete entities with dominant label selection.

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text to analyze	required
`model_name`	`str`	PII detection model (registry key or HuggingFace ID). When the default is used and `lang` is not `"en"`, the language-appropriate default model is selected automatically.	`_DEFAULT_EN_MODEL`
`confidence_threshold`	`float`	Minimum confidence score (0-1)	`0.5`
`config`	`Optional[OpenMedConfig]`	Optional configuration override	`None`
`use_smart_merging`	`bool`	Enable regex-based semantic unit merging (recommended)	`True`
`lang`	`str`	ISO 639-1 language code (en, fr, de, it, es, nl, hi, te, pt, ar, ja, tr). Controls which default model and regex patterns are used.	`'en'`
`normalize_accents`	`Optional[bool]`	Strip diacritical marks before model inference so that models trained on accent-free text still detect accented names. Entity spans in the result reference the original (accented) text. `None` (default) auto-enables for languages in `_ACCENT_NORMALIZE_LANGS` (currently Spanish).	`None`
`loader`	`Optional['ModelLoader']`	Optional shared model loader to reuse warmed pipelines.	`None`
`custom_recognizer`	`Any`	Optional deny-list/allow-list recognizer config, `CustomRecognizer` instance, or JSON/YAML config path. Deny-list matches are added with `custom:deny` provenance; allow-list matches suppress overlapping spans from any detector.	`None`
`cache_results`	`bool`	Whether to cache this result in the in-process LRU cache. Cached results may contain PHI, but are never saved to disk.	`False`
`max_cache_entries`	`int`	Maximum number of cached results.	`128`

Returns:

Type	Description
`PredictionResult`	PredictionResult with detected PII entities

Example

from unittest.mock import patch from openmed.core.pii import extract_pii from openmed.processing.outputs import EntityPrediction, PredictionResult fake_result = PredictionResult( ... text="Patient Casey Example called.", ... entities=[ ... EntityPrediction( ... text="Casey Example", ... label="NAME", ... confidence=0.98, ... start=8, ... end=21, ... ) ... ], ... model_name="fixture-pii-model", ... timestamp="2026-01-01T00:00:00", ... ) with patch("openmed.analyze_text", return_value=fake_result): ... result = extract_pii( ... "Patient Casey Example called.", ... model_name="fixture-pii-model", ... use_smart_merging=False, ... ) next((entity.text, entity.label) for entity in result.entities) ('Casey Example', 'NAME')

deidentify¶

De-identify text by detecting and redacting PII with intelligent merging.

Implements multiple de-identification strategies for HIPAA compliance:

mask: Replace with placeholders like [NAME], [EMAIL], etc.
remove: Remove PII text entirely (empty string)
replace: Replace with fake but realistic data
hash: Replace with consistent hashed values for entity linking
format_preserve: Replace structured identifiers with synthetic values that keep shape and separators, masking unsupported labels
shift_dates: Shift dates by random offset while preserving intervals

Smart merging uses regex patterns to merge fragmented entities (e.g., dates split into '01' and '/15/1970' are merged into complete '01/15/1970').

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text to de-identify	required
`method`	`DeidentificationMethod`	De-identification method (mask, remove, replace, hash, shift_dates, format_preserve)	`'mask'`
`model_name`	`str`	PII detection model	`_DEFAULT_EN_MODEL`
`confidence_threshold`	`float`	Minimum confidence for redaction (default 0.7 for safety)	`0.7`
`keep_year`	`bool`	For dates, keep the year unchanged	`False`
`shift_dates`	`Optional[bool]`	Deprecated alias for `method="shift_dates"`.	`None`
`date_shift_days`	`Optional[int]`	Specific number of days to shift when `patient_key` is omitted. When `patient_key` is supplied, this is treated as a legacy maximum absolute offset bound unless `date_shift_max_days` is also supplied.	`None`
`patient_key`	`Optional[str \| bytes]`	Optional stable patient identifier used only to derive a deterministic HMAC date-shift offset. Raw keys are not logged, persisted, or returned.	`None`
`date_shift_max_days`	`Optional[int]`	Maximum absolute offset for random or patient-keyed date shifting. Defaults to 365 when `patient_key` is supplied and neither this nor `date_shift_days` is set.	`None`
`date_shift_secret`	`Optional[str \| bytes]`	Required HMAC key material for patient-keyed offsets. Reuse the same value across sessions to keep offsets stable.	`None`
`keep_mapping`	`bool`	Keep mapping for re-identification	`False`
`config`	`Optional[OpenMedConfig]`	Optional configuration override	`None`
`use_smart_merging`	`bool`	Enable regex-based semantic unit merging (recommended)	`True`
`use_safety_sweep`	`bool`	Run a deterministic structured-identifier sweep after model detection and before redaction.	`True`
`lang`	`str`	ISO 639-1 language code (en, fr, de, it, es, nl, hi, te, pt, ar, ja, tr). Controls model selection, regex patterns, and fake data for replacement.	`'en'`
`normalize_accents`	`Optional[bool]`	Strip diacritical marks before model inference. `None` (default) auto-enables for Spanish.	`None`
`loader`	`Optional['ModelLoader']`	Optional shared model loader to reuse warmed pipelines.	`None`
`consistent`	`bool`	When `method="replace"` or `method="format_preserve"`, generate stable surrogates (same input -> same surrogate within the call). Lets repeated mentions of the same name resolve to one fake identity instead of a different one each time.	`False`
`seed`	`Optional[int]`	Optional integer seed for cross-run reproducibility of `consistent=True` replacements. Implies `consistent=True`.	`None`
`locale`	`Optional[str]`	Faker locale override (`pt_BR`, `en_GB`, ...) for `method="replace"` and `method="format_preserve"`. When `None`, derived from `lang`.	`None`
`surrogate_vault`	`Optional['SurrogateVault']`	Optional cross-document surrogate vault. When provided with `method="replace"`, OpenMed stores only HMAC source hashes and reuses the same surrogate for the same label/language/source identifier across calls.	`None`
`policy`	`Optional[str]`	Optional policy profile name controlling arbitration, action selection, mandatory safety sweep behavior, and reversible mapping.	`None`
`calibration_thresholds_path`	`Optional[str \| Path]`	Optional thresholds.json artifact path or artifact directory. When provided, per-label calibrated thresholds filter model detections and appear in audit output.	`None`
`custom_recognizer`	`Any`	Optional deny-list/allow-list recognizer config, `CustomRecognizer` instance, or JSON/YAML config path. Deny-list matches are redacted with `custom:deny` provenance; allow-list matches suppress overlapping spans from any detector.	`None`
`audit`	`bool`	Return a deterministic AuditReport instead of the DeidentificationResult.	`False`
`cache_results`	`bool`	Whether to cache this result in the in-process LRU cache. Cached results may contain PHI, but are never saved to disk.	`False`
`max_cache_entries`	`int`	Maximum number of cached results.	`128`

Returns:

Type	Description
`DeidentificationResult \| 'AuditReport'`	DeidentificationResult with original and de-identified text, or
`DeidentificationResult \| 'AuditReport'`	AuditReport when `audit=True`.

Example

from datetime import datetime from types import SimpleNamespace from unittest.mock import patch from openmed.core.pii import ( ... DeidentificationResult, ... PIIEntity, ... deidentify, ... ) fixture = DeidentificationResult( ... original_text="Patient Casey Example", ... deidentified_text="Patient [NAME]", ... pii_entities=[ ... PIIEntity( ... text="Casey Example", ... label="NAME", ... start=8, ... end=21, ... confidence=0.98, ... redacted_text="[NAME]", ... ) ... ], ... method="mask", ... timestamp=datetime(2026, 1, 1, 0, 0, 0), ... mapping={"[NAME]": "Casey Example"}, ... ) with patch("openmed.core.pipeline.Pipeline") as pipeline_cls: ... pipeline_cls.return_value.run.return_value = SimpleNamespace( ... deidentification_result=fixture ... ) ... result = deidentify( ... "Patient Casey Example", ... method="mask", ... keep_mapping=True, ... ) result.deidentified_text 'Patient [NAME]' result.mapping

reidentify¶

Re-identify text using stored mapping.

Restores original PII from de-identified text using the mapping created during de-identification. Only works if keep_mapping=True was used.

Parameters:

Name	Type	Description	Default
`deidentified_text`	`str`	De-identified text	required
`mapping`	`Mapping[str, str]`	Mapping from redacted to original text	required

Returns:

Type	Description
`str`	Re-identified text with original PII restored

Example

from openmed.core.pii import reidentify reidentify( ... "Patient [NAME] has record [ID]", ... {"[NAME]": "Casey Example", "[ID]": "MRN-0001"}, ... ) 'Patient Casey Example has record MRN-0001'

Note

Only works if keep_mapping=True was used during de-identification. Requires proper authorization and audit logging in production.

analyze_text¶

Run a token-classification model on text and format the predictions.

Parameters:

Name	Type	Description	Default
`text`	`str`	Clinical or biomedical text to analyse.	required
`model_name`	`str`	Registry key, fully-qualified Hugging Face model id, or local model path.	`'disease_detection_superclinical'`
`model_id`	`Optional[str]`	Alias for `model_name`. Useful for APIs and examples that name model identifiers as `model_id`.	`None`
`config`	`Optional[OpenMedConfig]`	Optional :class:`~openmed.core.config.OpenMedConfig` instance.	`None`
`loader`	`Optional[ModelLoader]`	Reuse an existing :class:`~openmed.core.models.ModelLoader`.	`None`
`aggregation_strategy`	`Optional[str]`	Hugging Face aggregation strategy (`"simple"` by default). Set to `None` to work with raw token outputs.	`'simple'`
`output_format`	`str`	`"dict"` (default), `"json"`, `"html"` or `"csv"`.	`'dict'`
`include_confidence`	`bool`	Whether to include confidence scores in formatted output.	`True`
`confidence_threshold`	`Optional[float]`	Minimum confidence for entities. `None` keeps all.	`0.0`
`group_entities`	`bool`	Merge adjacent entities of the same label in the formatted output.	`False`
`formatter_kwargs`	`Optional[Dict[str, Any]]`	Extra keyword arguments forwarded to :func:`openmed.processing.format_predictions`.	`None`
`metadata`	`Optional[Dict[str, Any]]`	Optional metadata to attach to the result.	`None`
`use_fast_tokenizer`	`bool`	Prefer fast tokenizers when available.	`True`
`sentence_detection`	`bool`	Enable pySBD-powered sentence detection (default: True).	`True`
`sentence_language`	`str`	Language hint for the sentence detector.	`'en'`
`sentence_clean`	`bool`	Whether to enable pySBD's cleaning heuristics.	`False`
`sentence_segmenter`	`Optional[Any]`	Optional preconstructed pySBD segmenter to reuse.	`None`
`cache_results`	`bool`	Whether to cache this result in the in-process LRU cache. Cached results may contain PHI, but are never saved to disk.	`False`
`max_cache_entries`	`int`	Maximum number of cached results.	`128`
`**pipeline_kwargs`	`Any`	Additional arguments passed to :meth:`openmed.core.models.ModelLoader.create_pipeline`.	`{}`

Returns:

Type	Description
`Union[AnalyzeResult, str, List[Dict[str, Any]]]`	Analyze result for `"dict"` output, otherwise the requested rendered
`Union[AnalyzeResult, str, List[Dict[str, Any]]]`	format.

Example

class FixtureLoader: ... config = None ... ... def create_pipeline(self, model_name, kwargs): ... def pipeline(text, call_kwargs): ... return [ ... { ... "entity_group": "CONDITION", ... "score": 0.99, ... "start": 11, ... "end": 17, ... "word": "asthma", ... } ... ] ... ... return pipeline ... ... def get_max_sequence_length(self, model_name, tokenizer=None): ... return 128 result = analyze_text( ... "History of asthma.", ... model_name="fixture-ner-model", ... loader=FixtureLoader(), ... sentence_detection=False, ... ) next((entity.text, entity.label) for entity in result.entities) ('asthma', 'CONDITION')

list_models¶

Return available OpenMed model identifiers.

Parameters:

Name	Type	Description	Default
`include_registry`	`bool`	Include entries from the bundled registry in addition to entries in the committed manifest.	`True`
`include_remote`	`bool`	Retained for compatibility; no live discovery is performed.	`True`
`config`	`Optional[OpenMedConfig]`	Optional custom configuration for model discovery.	`None`

BatchProcessor¶

Process multiple texts efficiently with progress tracking.

Example usage

from openmed import BatchProcessor, OpenMedConfig processor = BatchProcessor(model_name="disease_detection_superclinical") texts = ["Patient has diabetes.", "No significant findings."] result = processor.process_texts(texts) print(result.summary())

`init(model_name='disease_detection_superclinical', operation='analyze_text', batch_size=8, config=None, loader=None, aggregation_strategy='simple', confidence_threshold=None, group_entities=False, continue_on_error=True, **analyze_kwargs)` ¶

Initialize batch processor.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	Model registry key or HuggingFace identifier.	`'disease_detection_superclinical'`
`operation`	`BatchOperation`	Which function to call per item: `"analyze_text"` (default), `"extract_pii"` or `"deidentify"`. Extra kwargs passed via `**analyze_kwargs` are passed to the selected function.	`'analyze_text'`
`batch_size`	`int`	Number of documents to process together per batch.	`8`
`config`	`Optional[Any]`	Optional OpenMedConfig instance.	`None`
`loader`	`Optional[Any]`	Optional ModelLoader instance to reuse.	`None`
`aggregation_strategy`	`Optional[str]`	HuggingFace aggregation strategy (`analyze_text` operation only).	`'simple'`
`confidence_threshold`	`Optional[float]`	Minimum confidence for entities. When not provided, defaults match the selected operation: `0.0` for `analyze_text`, `0.5` for `extract_pii`, and `0.7` for `deidentify`.	`None`
`group_entities`	`bool`	Whether to group adjacent entities (`analyze_text` operation only).	`False`
`continue_on_error`	`bool`	Continue processing on individual item errors.	`True`
`**analyze_kwargs`	`Any`	Additional arguments passed to the selected function.	`{}`

`iter_process(texts, ids=None, *, on_progress=None)` ¶

Process texts as an iterator, yielding results one at a time.

This is useful for streaming results or processing very large batches where you don't want to hold all results in memory.

Parameters:

Name	Type	Description	Default
`texts`	`Sequence[str]`	Sequence of texts to analyze.	required
`ids`	`Optional[Sequence[str]]`	Optional identifiers for each text.	`None`
`on_progress`	`Optional[BatchProgressCallback]`	Optional PHI-safe callback that receives a BatchProgress record after each completed item.	`None`

Yields:

Type	Description
`BatchItemResult`	BatchItemResult for each processed text.

`process_directory(directory, pattern='.txt', recursive=False, encoding='utf-8', progress_callback=None, , on_progress=None)` ¶

Process all matching files in a directory.

Parameters:

Name	Type	Description	Default
`directory`	`Union[str, Path]`	Directory path.	required
`pattern`	`str`	Glob pattern for file matching.	`'*.txt'`
`recursive`	`bool`	Whether to search recursively.	`False`
`encoding`	`str`	File encoding.	`'utf-8'`
`progress_callback`	`Optional[ProgressCallback]`	Optional callback for progress updates.	`None`
`on_progress`	`Optional[BatchProgressCallback]`	Optional PHI-safe callback that receives a BatchProgress record after each completed item.	`None`

Returns:

Type	Description
`BatchResult`	BatchResult with all processing results.

`process_files(file_paths, encoding='utf-8', progress_callback=None, *, on_progress=None)` ¶

Process multiple files.

Parameters:

Name	Type	Description	Default
`file_paths`	`Sequence[Union[str, Path]]`	Paths to text files.	required
`encoding`	`str`	File encoding.	`'utf-8'`
`progress_callback`	`Optional[ProgressCallback]`	Optional callback for progress updates.	`None`
`on_progress`	`Optional[BatchProgressCallback]`	Optional PHI-safe callback that receives a BatchProgress record after each completed item.	`None`

Returns:

Type	Description
`BatchResult`	BatchResult with all processing results.

`process_items(items, progress_callback=None, *, on_progress=None)` ¶

Process a sequence of BatchItem objects.

Parameters:

Name	Type	Description	Default
`items`	`Sequence[BatchItem]`	Sequence of BatchItem objects.	required
`progress_callback`	`Optional[ProgressCallback]`	Optional callback for progress updates.	`None`
`on_progress`	`Optional[BatchProgressCallback]`	Optional PHI-safe callback that receives a BatchProgress record after each completed item.	`None`

Returns:

Type	Description
`BatchResult`	BatchResult with all processing results.

`process_texts(texts, ids=None, progress_callback=None, *, on_progress=None)` ¶

Process multiple texts.

Parameters:

Name	Type	Description	Default
`texts`	`Sequence[str]`	Sequence of texts to analyze.	required
`ids`	`Optional[Sequence[str]]`	Optional identifiers for each text.	`None`
`progress_callback`	`Optional[ProgressCallback]`	Optional callback for progress updates. Signature: callback(completed_count, total_count, result)	`None`
`on_progress`	`Optional[BatchProgressCallback]`	Optional PHI-safe callback that receives a BatchProgress record after each completed item.	`None`

Returns:

Type	Description
`BatchResult`	BatchResult with all processing results.

PIIEntity¶

Bases: EntityPrediction

Extended Entity with PII-specific metadata.

Attributes:

Name	Type	Description
`text`	`str`	The entity text span
`label`	`str`	PII category (NAME, EMAIL, PHONE, etc.)
`start`	`Optional[int]`	Character start position
`end`	`Optional[int]`	Character end position
`confidence`	`float`	Model confidence score (0-1)
`entity_type`	`str`	PII category (same as label)
`redacted_text`	`Optional[str]`	Replacement text after de-identification
`original_text`	`Optional[str]`	Original text before redaction
`hash_value`	`Optional[str]`	Consistent hash for entity linking
`reversible_id`	`Optional[str]`	Optional reversible pseudonymization handle

`__post_init__()` ¶

Initialize entity_type from label if not set.

DeidentificationResult¶

Result of de-identification operation.

Attributes:

Name	Type	Description
`original_text`	`str`	Input text before de-identification
`deidentified_text`	`str`	Output text with PII redacted
`pii_entities`	`list[PIIEntity]`	List of detected and redacted PII entities
`method`	`str`	De-identification method used
`timestamp`	`datetime`	When de-identification was performed
`mapping`	`Optional[dict[str, str]]`	Optional mapping for re-identification (redacted -> original)

`to_dataframe()` ¶

Convert detected PII entities to a pandas DataFrame.

Returns:

Type	Description
`Any`	A pandas DataFrame with one row per detected entity and columns
`Any`	`text`, `label`, `entity_type`, `start`, `end`,
`Any`	`confidence`, `action`, and `result_id`.

Raises:

Type	Description
`ImportError`	If pandas is not installed.

`to_dict()` ¶

Convert result to dictionary format.

Returns:

Type	Description
`dict`	Dictionary with all result fields and metadata

API Reference¶

extract_pii¶

deidentify¶

reidentify¶

analyze_text¶

list_models¶

BatchProcessor¶

__init__(model_name='disease_detection_superclinical', operation='analyze_text', batch_size=8, config=None, loader=None, aggregation_strategy='simple', confidence_threshold=None, group_entities=False, continue_on_error=True, **analyze_kwargs) ¶

iter_process(texts, ids=None, *, on_progress=None) ¶

process_directory(directory, pattern='*.txt', recursive=False, encoding='utf-8', progress_callback=None, *, on_progress=None) ¶

process_files(file_paths, encoding='utf-8', progress_callback=None, *, on_progress=None) ¶

process_items(items, progress_callback=None, *, on_progress=None) ¶

process_texts(texts, ids=None, progress_callback=None, *, on_progress=None) ¶

PIIEntity¶

__post_init__() ¶

DeidentificationResult¶

to_dataframe() ¶

to_dict() ¶

`init(model_name='disease_detection_superclinical', operation='analyze_text', batch_size=8, config=None, loader=None, aggregation_strategy='simple', confidence_threshold=None, group_entities=False, continue_on_error=True, **analyze_kwargs)` ¶

`iter_process(texts, ids=None, *, on_progress=None)` ¶

`process_directory(directory, pattern='.txt', recursive=False, encoding='utf-8', progress_callback=None, , on_progress=None)` ¶

`process_files(file_paths, encoding='utf-8', progress_callback=None, *, on_progress=None)` ¶

`process_items(items, progress_callback=None, *, on_progress=None)` ¶

`process_texts(texts, ids=None, progress_callback=None, *, on_progress=None)` ¶

`__post_init__()` ¶

`to_dataframe()` ¶

`to_dict()` ¶