Toxicity, in the context of AI systems, refers to the presence of harmful or undesirable content within the model's inputs (prompts) or outputs (responses).

This encompasses a spectrum of problematic language and concepts that can negatively impact users, perpetuate societal biases, and erode trust in AI technologies. Understanding the multifaceted nature of toxicity is crucial for building responsible and ethical AI.

Key dimensions of Toxicity

  • Hate speech: Language that attacks or demeans individuals or groups based on attributes like race, ethnicity, religion, gender, sexual orientation, disability, or other protected characteristics. This can manifest as slurs, stereotypes, or calls for violence.
  • Harassment: Content that is offensive, abusive, or threatening towards an individual. This can include personal attacks, intimidation, and unwanted sexual advances.
  • Profanity and vulgarity: Use of offensive or obscene language that can create a negative or unpleasant user experience. While context-dependent, excessive or gratuitous profanity can be considered toxic.
  • Violence and incitement: Content that promotes or glorifies violence, terrorism, or other harmful acts. This includes inciting hatred and encouraging others to engage in violence.
  • Misinformation and disinformation: While not always inherently toxic in the emotional sense, the spread of false or misleading information can have severely harmful consequences for individuals and society, making it a critical concern in AI safety.
  • Bias and discrimination: AI systems can inadvertently generate toxic outputs by reflecting and amplifying biases present in their training data. This can lead to discriminatory or unfair treatment of certain groups.
  • Adult Content: Depending on the context and intended use of the AI system, the generation or dissemination of explicit sexual content might be considered toxic or inappropriate.

Toxicity Rule Configuration

Use the Toxicity Rule settings to control how your system handles potentially harmful or offensive content in both user prompts and model-generated responses. These rules support responsible AI usage and are fully integrated with AI Governance for transparency and auditability.

Each rule level allows you to define how strictly the system should evaluate and block content based on toxicity. You can set different thresholds for Prompts and Model Responses.

  1. Allow all (Default setting)
    • No content is blocked, regardless of toxicity level.

    • Prompts and responses are still scanned for toxicity behind the scenes.

    • Toxicity scores are recorded and made available for review via:

      • AI Prompt Logs

      • Event Logs

    • Ideal for auditing purposes without impacting user experience.

  2. Block highly toxic content
    • ❌ Blocks content containing severe toxicity, including:

      • Extreme insults

      • Explicit obscenities

      • Direct threats

    • Designed to filter out the most harmful and offensive inputs/outputs.

    • ✅ Moderate and low-level toxicity is still allowed.

  3. Block highly and moderately toxic content
    • ❌ Blocks both high and moderate levels of:

      • Insults

      • Obscenities

      • Threats

    • Balances safety with freedom of expression, ideal for sensitive environments.

    • ✅ Minimally toxic content is still allowed.

  4. Block all toxic content (High, Moderate, and Minimal)
    • ❌ The most restrictive setting—blocks any level of toxicity, including:

      • Subtle or indirect insults

      • Mildly offensive language

      • Low-threat expressions

    • ️ Recommended for environments with strict content policies, such as education, healthcare, or public services.

When an automation that uses actions from Generative AI package or AI Skills package is assigned an AI guardrail, the system monitors the content of both the prompts sent to the AI model and the responses received. If the assessed toxicity level of either the prompt or the response exceeds the threshold configured within the assigned guardrail, the guardrail will intervene to prevent potentially harmful content from being processed or presented. In such scenarios, the execution of the automation will be halted at the point where the guardrail is triggered.


AI guardrailToxicity Block - Error message

As shown in the above screenshot, when a guardrail blocks the execution of an automation due to a detected toxicity violation, you will encounter an error message. This message will typically indicate that the prompt has been blocked by guardrail or a similar notification, often specifying the location within the automation where the block occurred (e.g., a specific action and line number). The error message also provides a brief reason for the blockage, such as exceeding the defined toxicity level. To resolve this, you will need to review the content being processed by the AI command action and potentially adjust the guardrail's toxicity threshold or modify the prompt to comply with the defined policies.

Understanding the Toxicity color codes

The level of toxicity in both user inputs (prompts) and the responses are color coded for easy identity. This helps in understanding the severity of potentially harmful content and determining the appropriate action. A common color-coded system used for indicating these levels includes:

  1. Grey: No Toxicity. Content marked as grey is considered safe and does not contain any identifiable harmful or undesirable language.
  2. 🟢 Green: Low Toxic Content. Content flagged as green contains a minimal level of potentially problematic language. This might include mild profanity, slightly suggestive content, or minor instances of language that could be perceived as insensitive depending on the context. While not severely harmful, it warrants attention and potential further review.
  3. 🟠 Orange: Moderate Toxic Content. Content categorized as orange exhibits a noticeable level of harmful or offensive language. This could include stronger profanity, more explicit or aggressive tones, or content that borders on hate speech or harassment but does not fully meet the criteria for the highest severity. Such content typically triggers stricter actions by AI Guardrails.
  4. 🔴 Red: High Toxic Content. Content marked as red indicates the presence of severe and highly offensive language. This often includes explicit hate speech targeting specific groups, direct threats, severely abusive language, or content promoting illegal activities. AI Guardrails can be set to block or flag content at this level to prevent harm and maintain safety.