AI Guardrails
- Updated: 2025/06/06
AI Guardrails are a crucial safeguard, ensuring the responsible use of AI and protecting sensitive information during automation workflows. They act as a safety and governance mechanism, designed to control interactions between users, automations, and Large Language Models (LLMs). AI Guardrails aim to mitigate potential risks, enforce policies, and ensure that AI systems behave in a safe, ethical, and predictable manner.
Core functions of AI Guardrails
At their core, AI Guardrails govern the flow of information and actions in AI-driven processes, primarily by:
- Monitoring Interactions: Guardrails actively examine both the prompts (user requests) sent to LLMs and the responses generated by LLMs. This monitoring can involve analyzing content for various criteria, such as toxicity and sensitive data.
- Controlling Content: Guardrails enforce rules to manage the content of these interactions. This includes the ability to filter, modify, or, crucially, block prompts and responses that violate predefined policies.
- Enforcing Policies: Organizations can define and implement their own policies for AI usage through guardrails. This allows alignment with ethical guidelines, regulatory requirements, and internal best practices.
Key concepts and mechanisms
- Data Masking: Protects sensitive data within prompts and model responses. By default, the system applies masking, but you can allow clear text for specific use cases. Smart tokenization identifies sensitive data, replaces it with tokens before sending to the LLM, and reconstructs the original data in the LLM response. AI Guardrails helps you to establish precise data masking rules tailored to the following critical categories: Personally Identifiable Information (PII), Protected Health Information (PHI), and Payment Card Industry Data (PCI).
- Toxicity monitoring: Analyzes prompts and LLM-generated responses for potentially harmful language, classifying them by toxicity level. AI Guardrails can be configured to block prompts or responses that exceed defined toxicity thresholds, preventing the dissemination of harmful content.
- Blocking Mechanisms:
-
Prompt/Request blocking: AI Guardrails evaluates a prompt before it is sent to the LLM. If the prompt violates defined rules (For instance, if it contains prohibited language or exceeds toxicity thresholds), the Guardrail will block the prompt.
Outcome:
-
The prompt is not sent to the LLM.
-
The user receives an error message indicating the prompt is blocked.
-
AI Governance logs record the blocked prompt and the reason for blocking.
-
-
Response blocking: AI Guardrails can also evaluate the LLM's response before it is presented to the user. Even if the prompt is allowed, a problematic response can be blocked.
Outcome:
-
The LLM generates a response, but the guardrail intercepts it.
-
The response is not presented to the user (the user can see an empty response or an error).
-
AI Governance logs record the blocked response and the reason for blocking.
-
-
- Inline Interception: AI Guardrails employ an inline interception mechanism to enforce security and compliance policies.
- Monitoring and Logging: Logs all AI Guardrails actions, including details of the data masking and toxicity monitoring processes, providing an audit trail.
Scenarios
To illustrate how AI Guardrails manage the flow of information between AI Skills and LLMs, and how they handle different scenarios based on toxicity levels, the following diagrams provide a visual overview. These scenarios depict the journey of a prompt and its corresponding model response as they are evaluated and processed by the AI guardrail, showcasing instances where content is allowed with masking, blocked due to high toxicity, or where the response itself is blocked.
- Scenario 1: Prompt and Model Response Allowed (Monitored Toxicity)
- In this scenario, your AI Guardrail is configured to Allow all
content, meaning that prompts and model responses will pass through even if
they contain detected toxicity. While content is not blocked in this
configuration, AI Guardrails diligently monitor and record
any detected toxicity levels.
As illustrated in the diagram below:
- The user's PROMPT enters the AI guardrail, where its toxicity is detected (e.g., as 🟢 Low).
- Sensitive data within the prompt is automatically masked (e.g., PII tokenized) to protect privacy before being sent to the LLM.
- The LLM generates a MODEL RESPONSE, which then returns to the AI guardrail.
- The Guardrail again performs toxicity detection on the model response (e.g., finding 🟢 Low toxicity) and unmasks any tokenized data.
- Since the Guardrail is set to Allow all, both the masked prompt (to the LLM) and the unmasked model response (to the user) are permitted.
- The detected toxicity scores for both the prompt and the model response are captured and logged within AI Governance, providing essential data for audit and review purposes without impacting the user experience.
- Scenario 2: Prompt Blocked due to Toxicity score
-
In this scenario, your AI guardrail is configured with rules to block content exceeding a certain toxicity threshold (e.g., set to block highly toxic and moderately toxic content). This ensures that potentially harmful or inappropriate user inputs are stopped before they can reach the LLM.
As illustrated in the diagram below:
- The user initiates a PROMPT that contains content deemed to have 🔴 High toxicity (or a level that violates the configured guardrail rule).
- This prompt enters the AI guardrail, where it immediately undergoes Toxicity Detection.
- Upon detecting a toxicity level that exceeds the set threshold, the AI guardrail intervenes and blocks the prompt.
- Consequently, the prompt is never sent to the LLM.
- Since the prompt is blocked, there is no model response generated or returned to the user, effectively preventing the processing of harmful input and stopping the automation.
- Details of the blocked prompt, including its toxicity level and the reason for blockage, are automatically captured and logged in AI Governance for auditing and compliance purposes.
- Scenario 3: Prompt Allowed, Model Response Blocked due to Toxicity score
-
In this scenario, your AI guardrail is configured to allow initial prompts that meet its safety criteria (e.g., deemed low or no toxicity). However, the guardrail maintains vigilance, actively monitoring the LLM generated responses to ensure that no harmful or inappropriate content is presented to the user.
As illustrated in the diagram below:
- The user's PROMPT enters the AI Guardrail. Its toxicity is detected (e.g., as 🟢 Low) and is within the allowed threshold.
- Sensitive data within the prompt is automatically masked to protect privacy before the prompt is sent to the LLM.
- The LLM processes the masked prompt and generates a MODEL RESPONSE.
- This model response then returns to the AI guardrail for Toxicity Detection.
- In this case, the model response is found to contain 🔴 High toxicity (or a level that violates the guardrail's configured rules for responses).
- Upon detecting this violation, the AI guardrail blocks the model response.
- Consequently, the problematic model response is not presented to the user. Instead, the user might see an empty response or an error message.
- All details of the blocked response, including its toxicity level and the reason for blockage, are automatically captured and logged in AI Governance, ensuring a complete audit trail of the AI interaction.
Benefits
The use of AI Guardrails provides several key benefits:
- Enhanced safety: Reduces the risk of exposing users to harmful or inappropriate content generated by LLMs.
- Improved compliance: Helps organizations adhere to relevant regulations and industry standards related to AI usage.
- Increased trust: Fosters trust in AI systems by demonstrating a commitment to responsible and ethical practices.
- Policy enforcement: Enables organizations to consistently enforce their internal AI usage policies.
- Risk mitigation: Proactively mitigates potential risks associated with LLM outputs, such as reputational damage or legal liabilities.
- Protection of sensitive data: Safeguards sensitive information from being directly processed by LLMs.