Automation 360

Toxicity in AI

Download as PDF

Toxicity in AI

Download as PDF

Updated: 2025/12/04

Toxicity, in the context of AI systems, refers to the presence of harmful or undesirable content within the model's inputs (prompts) or outputs (responses).

This encompasses a spectrum of problematic language and concepts that can negatively impact users, perpetuate societal biases, and erode trust in AI technologies. Understanding the multifaceted nature of toxicity is crucial for building responsible and ethical AI.

Key dimensions of Toxicity

Hate speech: Language that attacks or demeans individuals or groups based on attributes like race, ethnicity, religion, gender, sexual orientation, disability, or other protected characteristics. This can manifest as slurs, stereotypes, or calls for violence.
Harassment: Content that is offensive, abusive, or threatening towards an individual. This can include personal attacks, intimidation, and unwanted sexual advances.
Profanity and vulgarity: Use of offensive or obscene language that can create a negative or unpleasant user experience. While context-dependent, excessive or gratuitous profanity can be considered toxic.
Violence and incitement: Content that promotes or glorifies violence, terrorism, or other harmful acts. This includes inciting hatred and encouraging others to engage in violence.
Misinformation and disinformation: While not always inherently toxic in the emotional sense, the spread of false or misleading information can have severely harmful consequences for individuals and society, making it a critical concern in AI safety.
Bias and discrimination: AI systems can inadvertently generate toxic outputs by reflecting and amplifying biases present in their training data. This can lead to discriminatory or unfair treatment of certain groups.
Adult Content: Depending on the context and intended use of the AI system, the generation or dissemination of explicit sexual content might be considered toxic or inappropriate.

Toxicity Rule Configuration

Use the Toxicity Rule settings to control how your system handles potentially harmful or offensive content in both user prompts and model-generated responses. These rules support responsible AI usage and are fully integrated with AI Governance for transparency and auditability.

Each rule level allows you to define how strictly the system should evaluate and block content based on toxicity. You can set different thresholds for Prompts and Model Responses.

Allow all (Default setting)
- ❌ No content is blocked, regardless of toxicity level.
- Prompts and responses are still scanned for toxicity behind the scenes.
- Toxicity scores are recorded and made available for review via:
  - AI Prompt Logs
  - Event Logs
- Ideal for auditing purposes without impacting user experience.
Block highly toxic content
- ❌ Blocks content containing severe toxicity, including:
  - Extreme insults
  - Explicit obscenities
  - Direct threats
- Designed to filter out the most harmful and offensive inputs/outputs.
- ✅ Moderate and low-level toxicity is still allowed.
Block highly and moderately toxic content
- ❌ Blocks both high and moderate levels of:
  - Insults
  - Obscenities
  - Threats
- Balances safety with freedom of expression, ideal for sensitive environments.
- ✅ Minimally toxic content is still allowed.
Block all toxic content (High, Moderate, and Minimal)
- ❌ The most restrictive setting—blocks any level of toxicity, including:
  - Subtle or indirect insults
  - Mildly offensive language
  - Low-threat expressions
- ️ Recommended for environments with strict content policies, such as education, healthcare, or public services.

Note:

Blocked content for both user inputs (prompts) and the responses display the following red icon to help identify potentially harmful content and for determining the appropriate action.

🔴 Red: Blocked Content. Content marked as red indicates content beyond the settings of AI Guardrails and includes block or flag content to prevent harm and maintain safety.

When an automation that uses actions from Generative AI package or AI Skills package is assigned an AI guardrail, the system monitors the content of both the prompts sent to the AI model and the responses received. If the assessed toxicity level of either the prompt or the response exceeds the threshold configured within the assigned guardrail, the guardrail will intervene to prevent potentially harmful content from being processed or presented. In such scenarios, the execution of the automation will be halted at the point where the guardrail is triggered.

AI guardrailToxicity Block - Error message

As shown in the above screenshot, when a guardrail blocks the execution of an automation due to a detected toxicity violation, you will encounter an error message. This message will typically indicate that the prompt has been blocked by guardrail or a similar notification, often specifying the location within the automation where the block occurred (e.g., a specific action and line number). The error message also provides a brief reason for the blockage, such as exceeding the defined toxicity level. To resolve this, you will need to review the content being processed by the AI command action and potentially adjust the guardrail's toxicity threshold or modify the prompt to comply with the defined policies.