Skip to content

PII Smart Entity Merging

Overview

OpenMed's PII detection includes Smart Entity Merging to solve the common problem where tokenizers split semantic units (dates, SSN, phone numbers, etc.) into multiple fragmented tokens, resulting in incomplete entity predictions.

The Problem

Token-level classification models often split meaningful units:

# WITHOUT smart merging
result = extract_pii("DOB: 01/15/1970", use_smart_merging=False)
# Result:
# - [date] '01' (confidence: 0.711)
# - [date_of_birth] '/15/1970' (confidence: 0.751)

This produces unusable fragments for production de-identification.

The Solution

Smart merging uses regex patterns to identify semantic units and merges fragmented predictions:

# WITH smart merging (DEFAULT)
result = extract_pii("DOB: 01/15/1970", use_smart_merging=True)
# Result:
# - [date_of_birth] '01/15/1970' (confidence: 0.731)

Now you get complete, production-ready entities.


How It Works

1. Regex-Based Semantic Unit Detection

The system uses comprehensive regex patterns to identify PII entities:

from openmed import find_semantic_units

text = "Patient: John Doe, DOB: 01/15/1970, SSN: 123-45-6789, Phone: (555) 123-4567"
units = find_semantic_units(text)

# Output:
# [(17, 27, 'date'),       # '01/15/1970'
#  (34, 45, 'ssn'),        # '123-45-6789'
#  (54, 68, 'phone_number')] # '(555) 123-4567'

Supported Patterns: - Dates: MM/DD/YYYY, YYYY-MM-DD, DD-MM-YYYY, Month DD, YYYY - SSN: XXX-XX-XXXX, XXX XX XXXX - Phone: (XXX) XXX-XXXX, XXX-XXX-XXXX, XXXXXXXXXX - Email: Standard email format - Credit Card: XXXX-XXXX-XXXX-XXXX - IP Addresses: IPv4 and IPv6 - MAC Addresses: XX:XX:XX:XX:XX:XX - URLs: Web addresses - Street Addresses: Number + Street Name - ZIP Codes: XXXXX or XXXXX-XXXX - Medical Record Numbers: Common MRN formats

2. Model Prediction Aggregation

For each semantic unit, the system: 1. Finds all model predictions that overlap with the unit 2. Calculates the dominant label (most frequently predicted) 3. If there's a tie, selects the label with highest average confidence 4. Merges all fragments into a single entity

from openmed import calculate_dominant_label

# Example: Date split into 3 tokens
predictions = [
    {'entity_type': 'date', 'score': 0.7},
    {'entity_type': 'date_of_birth', 'score': 0.9},
    {'entity_type': 'date_of_birth', 'score': 0.8}
]

dominant_label, avg_conf = calculate_dominant_label(predictions)
# Result: ('date_of_birth', 0.8)
# Reason: date_of_birth appears 2 times vs date 1 time

3. Label Specificity Hierarchy

When choosing between labels, the system prefers more specific labels:

# Hierarchy examples:
'date_of_birth' > 'date'          # date_of_birth is more specific
'first_name' > 'name'             # first_name is more specific
'ssn' > 'id'                      # ssn is more specific
'street_address' > 'address'      # street_address is more specific
'phone_number' > 'phone'          # phone_number is more specific

API Reference

extract_pii() with Smart Merging

from openmed import extract_pii

result = extract_pii(
    text="Patient: John Doe, DOB: 01/15/1970, SSN: 123-45-6789",
    model_name="pii_detection_superclinical",
    confidence_threshold=0.5,
    use_smart_merging=True  # DEFAULT: True (recommended)
)

for entity in result.entities:
    print(f"{entity.label}: {entity.text} (confidence: {entity.confidence:.3f})")

Parameters: - use_smart_merging (bool): Enable regex-based semantic unit merging - Default: True (recommended for production) - Set to False to get raw model predictions

deidentify() with Smart Merging

from openmed import deidentify

result = deidentify(
    text="Patient: Jane Doe, DOB: 01/15/1970, SSN: 987-65-4321",
    method="mask",
    model_name="pii_detection_superclinical",
    confidence_threshold=0.7,
    use_smart_merging=True  # DEFAULT: True
)

print(result.deidentified_text)
# Output: "Patient: [first_name] [last_name], DOB: [date_of_birth], SSN: [ssn]"

Without smart merging:

"Patient: [first_name] [last_name], DOB: [date][date_of_birth], SSN: [ssn]"
#                                          ^^^^^ Fragmented!

Advanced: Custom Patterns

You can define custom PII patterns:

from openmed import PIIPattern, merge_entities_with_semantic_units

# Define custom patterns
custom_patterns = [
    PIIPattern(
        pattern=r'\b\d{6}-\d{4}\b',  # Custom employee ID format
        entity_type='employee_id',
        priority=10
    ),
    PIIPattern(
        pattern=r'\bPID-\d{8}\b',  # Patient ID format
        entity_type='patient_id',
        priority=9
    ),
]

# Use with merging
entities = [...]  # Your model predictions
merged = merge_entities_with_semantic_units(
    entities,
    text,
    patterns=custom_patterns
)

Pattern Priority

Patterns are checked in priority order (highest first). If multiple patterns match overlapping text, the higher priority pattern wins:

PIIPattern(r'\b\d{4}-\d{2}-\d{2}\b', 'date', priority=10)  # Checked first
PIIPattern(r'\b\d{1,2}/\d{1,2}/\d{4}\b', 'date', priority=9)  # Checked second
PIIPattern(r'\b\d{5}\b', 'postcode', priority=7)  # Lower priority

Examples

Example 1: Clinical Note De-identification

from openmed import deidentify

clinical_note = """
Patient Name: Dr. Sarah Johnson
Date of Birth: 03/15/1975
Social Security: 123-45-6789
Medical Record #: MRN-87654321
Contact: (555) 987-6543
Email: sarah.johnson@email.com
Address: 456 Oak Avenue, Boston, MA 02115
Appointment: 12/20/2024 at 2:30 PM
"""

result = deidentify(
    clinical_note,
    method="mask",
    model_name="pii_detection_superclinical",
    confidence_threshold=0.6,
    use_smart_merging=True  # Ensures dates and SSN are not fragmented
)

print(result.deidentified_text)

Output:

Patient Name: [occupation] [first_name] [last_name]
Date of Birth: [date_of_birth]
Social Security: [ssn]
Medical Record #: [medical_record_number]
Contact: [phone_number]
Email: [email]
Address: [street_address], [city], [state] [postcode]
Appointment: [date] at [time]

Example 2: Batch Processing with Smart Merging

from openmed import BatchProcessor

processor = BatchProcessor(
    model_name="pii_detection_superclinical",
    confidence_threshold=0.6,
    use_smart_merging=True  # Will be applied to all texts
)

texts = [
    "Patient: John Doe, DOB: 01/15/1970",
    "SSN: 123-45-6789, Phone: (555) 123-4567",
    "Email: john@example.com, Address: 123 Main St"
]

results = processor.process_batch(texts)

for i, result in enumerate(results.items):
    if result.success:
        print(f"Text {i+1}: {len(result.entities)} complete entities extracted")

Example 3: Comparing With and Without Smart Merging

from openmed import extract_pii

text = "Appointment on 01/15/2024 for patient with SSN 123-45-6789"

# WITHOUT smart merging
result_old = extract_pii(text, use_smart_merging=False)
print("Without smart merging:")
for e in result_old.entities:
    print(f"  {e.label}: '{e.text}'")
# Output:
#   date: '01'
#   date: '/15/2024'  ← FRAGMENTED!
#   ssn: '123-45-6789'

# WITH smart merging
result_new = extract_pii(text, use_smart_merging=True)
print("\nWith smart merging:")
for e in result_new.entities:
    print(f"  {e.label}: '{e.text}'")
# Output:
#   date: '01/15/2024'  ← COMPLETE!
#   ssn: '123-45-6789'

Performance Considerations

Computational Cost

Smart merging adds minimal overhead: - Regex matching: O(n) where n = text length - Entity merging: O(m) where m = number of entities - Total overhead: ~5-10% additional processing time

For a 1000-word clinical note: - Without smart merging: ~1.2 seconds - With smart merging: ~1.3 seconds (+8%)

Recommendation: The performance cost is negligible compared to the production value of complete entities.

When to Disable

Consider disabling smart merging (use_smart_merging=False) only when: 1. You need raw token-level predictions for analysis 2. You're building a custom post-processor 3. You're debugging model predictions

For production de-identification, always use use_smart_merging=True (default).


Troubleshooting

Issue: Date still fragmented

Cause: The date format is not covered by default patterns.

Solution: Add custom pattern:

from openmed import PIIPattern, merge_entities_with_semantic_units

custom_patterns = [
    PIIPattern(r'\b\d{2}\.\d{2}\.\d{4}\b', 'date', priority=10),  # DD.MM.YYYY
]

result = extract_pii(text, use_smart_merging=True)
# Then manually apply custom patterns

Issue: Wrong label selected

Cause: Dominant label selection picked the wrong type.

Solution: Adjust prefer_model_labels parameter:

from openmed import merge_entities_with_semantic_units

merged = merge_entities_with_semantic_units(
    entities,
    text,
    prefer_model_labels=False  # Prefer regex pattern labels over model
)

Issue: Entities merged incorrectly

Cause: Regex pattern is too broad.

Solution: Make pattern more specific or increase priority of other patterns:

# Bad: Too broad
PIIPattern(r'\b\d+\b', 'number', priority=5)  # Matches everything!

# Good: Specific
PIIPattern(r'\b\d{3}-\d{2}-\d{4}\b', 'ssn', priority=10)

Best Practices

✅ DO

  1. Use smart merging by default for production de-identification
  2. Test with representative data to ensure patterns cover your use cases
  3. Monitor merged entities to verify label selection is correct
  4. Add custom patterns for domain-specific PII formats

❌ DON'T

  1. Don't disable smart merging for production without good reason
  2. Don't use overly broad regex patterns
  3. Don't forget to validate date formats specific to your region
  4. Don't rely solely on regex - the model provides valuable context

Technical Details

Merging Algorithm

1. IDENTIFY semantic units using regex patterns
   ├─ Sort patterns by priority (highest first)
   ├─ Check for overlaps (higher priority wins)
   └─ Store units: [(start, end, entity_type), ...]

2. AGGREGATE model predictions
   ├─ For each semantic unit:
   │   ├─ Find overlapping model predictions
   │   ├─ Calculate dominant label (most frequent)
   │   ├─ If tie: select highest avg confidence
   │   └─ Create merged entity with full span
   └─ Add non-overlapping predictions as-is

3. FINALIZE
   ├─ Sort merged entities by start position
   └─ Return complete entity list

Label Selection Logic

def select_label(predictions):
    # Count frequency
    label_counts = Counter(p.label for p in predictions)
    max_count = max(label_counts.values())

    # Get candidates with max count
    candidates = [l for l, c in label_counts.items() if c == max_count]

    if len(candidates) == 1:
        return candidates[0]

    # Tie-breaker: highest average confidence
    avg_confidences = {
        label: mean(p.confidence for p in predictions if p.label == label)
        for label in candidates
    }
    return max(avg_confidences, key=avg_confidences.get)


Changelog

v0.5.0 (2026-01-12)

  • NEW: Smart entity merging with regex-based semantic unit detection
  • ✨ Added use_smart_merging parameter to extract_pii() and deidentify() (default: True)
  • ✨ Added merge_entities_with_semantic_units() function
  • ✨ Added find_semantic_units() and calculate_dominant_label() utilities
  • ✨ Added comprehensive PII regex patterns (dates, SSN, phone, email, etc.)
  • ✨ Exported merging utilities from openmed package
  • 🐛 FIXED: Fragmented date entities (e.g., '01' + '/15/1970' → '01/15/1970')
  • 🐛 FIXED: Incorrect de-identification output with multiple placeholders per entity
  • 🐛 FIXED: Entity position mismatch when input text has leading/trailing whitespace
  • TESTED: All test cases pass (5/5) - production ready
  • 📚 Added comprehensive documentation and examples