Data masking in AI
- Updated: 2025/06/11
Data masking, also known as data obfuscation or anonymization, is a critical technique used to protect sensitive information by replacing it with realistic but non-identifying substitutes.
The primary goal is to render data unusable for unauthorized access or analysis while preserving its format and statistical properties for legitimate purposes like testing, development, training AI models, and analytics. Effectively implemented data masking minimizes the risk of data breaches and helps organizations comply with various privacy regulations.
Tokenization is a data masking technique that enhances security by replacing sensitive data elements with non-sensitive substitutes called tokens. These tokens maintain the original data's format and length, making them appear realistic but holding no intrinsic value. The crucial aspect of tokenization lies in the fact that the original sensitive data is stored securely within the Control Room.
How tokenization works
- Data identification: The system identifies the sensitive data fields that need protection (e.g., credit card numbers (PCI), social security numbers (PII), personal health information (PHI)).
- Token generation: For each sensitive data value, a unique, random token is generated. These tokens cannot be reverse-engineered to obtain the original values and bear no mathematical or discernible relationship to the original data.
- Data replacement: The original sensitive data within the application, database, or system is replaced by its corresponding token.
- Secure storage: The mapping between the tokens and the original sensitive data is securely stored and managed within the Control Room.
- De-tokenization (When necessary and authorized): When authorized users or systems need to access the original sensitive data for legitimate purposes, a de-tokenization process is invoked. This involves retrieving the original data from the Control Room using the corresponding token.
Key advantages of tokenization
- Enhanced security: By removing actual sensitive data from operational environments, tokenization significantly reduces the risk of data breaches and the impact of security incidents. Even if a system containing tokens is compromised, the attackers gain no valuable sensitive information.
- Compliance facilitation: Tokenization helps organizations meet stringent data security and privacy regulations like PCI DSS, GDPR, and HIPAA by minimizing the storage, processing, and transmission of actual sensitive data.
- Data utility: Tokens preserve the format and length of the original data, allowing applications and systems to continue functioning without significant modifications. This makes it suitable for testing, development, and analytics where the actual sensitive values are not required.
- Protecting sensitive information: By minimizing the presence of real sensitive data within the automation workflows interacting with LLMs, organizations can potentially simplify certain aspects of data handling and security assessments during compliance audits.
- Control and auditability: While the vaults storing the mapping are outside of the Control Room, the Control Room provides controlled access to this stored data through robust authentication and authorization mechanisms. This helps prevent exposure to unauthorized access. The secure storage of data is maintained using strong industry-standard authentication protocols.
- Flexibility: Tokenization within the AI Guardrails framework can be applied to various types of sensitive data specifically within automations interacting with Large Language Models (LLMs).
Creating data masking rules
You can define a new masking rule while creating a guardrail, you can click Create a rule and then specify the following:
- Category selection: Choose a broad Category of sensitive data. The available
categories include:
- Personally Identifiable Information (PII): Encompasses data that can identify an individual.
- Payment Card Industry (PCI): Pertains to credit and debit card information.
- Protected Health Information (PHI): Includes health-related data that can identify an individual.
- Type selection: After selecting a Category, choose one or more specific
types within that category for masking.
-
Personally Identifiable Information (PII):
- Vehicle Identification number
- Social Security number
- Email address
- IP address
- Uniform resource locator
- Person
- Address
- Organization
- Driver's license number
- Fax number
- Phone number
- Vehicle registration number
- Select all
-
Payment Card Industry (PCI):
- Credit card number
- Bank account number
- Select all
-
Protected Health Information (PHI):
- Medical record number
- Health beneficiary number
- License number
- Death date
- Discharge date
- Start date of hospitalization
- Media access control number
- Insurance number
- Health account number
- Date of birth
- Select all
Note: The sensitive entities such as, PII, PHI, PCI, identified within prompts are masked by replacing them with non-sensitive tokens so that they are not exposed to the LLMs. These tokens are replaced when model responses are received to reconstruct them with the original values. The sensitive entities and the tokenized values are securely stored within a vault and retained only for 30 days. -
- Select the Guardrail behavior
- Mask: A reversible process where sensitive data is temporarily replaced with a tokenized value. The original data is retrieved and reinstated in the LLM's response before being presented to the user.
- Anonymize: An irreversible process that permanently replaces sensitive data with a token. The original data is not stored or used to reconstruct the response to the user, making it suitable for scenarios with strict data retention prohibitions.
- Allow: For specific use cases requiring access to sensitive data, you can choose to allow the data to be sent to the LLM in clear text.
For more information on configuring an AI guardrail and setting up the data masking, see Create and manage AI Guardrails.