Safety & Harm Prevention

Anti-Manipulation Safeguards

Detect actual harmful intent beyond surface framing regardless of how it's disguised

What is Anti-Manipulation Safeguards?

Anti-Manipulation Safeguards are AI safety systems that detect harmful intent even when disguised as innocent requests. Instead of just checking surface-level keywords, these systems analyze the actual goal behind a request, catching attempts to bypass safety through creative framing like hypotheticals, roleplay, or research scenarios. It's critical for any AI system users might try to exploit, content generation tools, or conversational AI where multi-turn dialogue could gradually escalate toward harmful content. Real example: systems that catch when someone frames harmful requests as fiction research or academic hypotheticals, blocking the intent rather than just specific wording.

Example: ✅ Claude

Claude's constitutional AI approach demonstrating consistent boundary enforcement

Recognizes 'jailbreak' attempts and refuses harmful requests regardless of framing. Maintains consistent boundaries across contexts. Clear about limitations. Doesn't explain how to better bypass systems.

Figma Make Prompt

Want to learn more about this pattern?

Explore the full pattern with real-world examples, implementation guidelines, and code samples.

View Full Pattern

Related Prompts from Safety & Harm Prevention

Crisis Detection & Escalation

Safety & Harm Prevention

Crisis Detection & Escalation Pattern WHAT IT IS: A multi-layered safety system that identifies crisis signals (self-harm, suicidal ideation) across 4 detection layers and immediately escalates to professional resources, regardless of how the crisis is framed. WHY IT MATTERS: Users in crisis may hide their situation using "research," "hypothetical," or "for a story" framing. A single detection layer (keywords only) misses context. Multi-layer detection catches: direct keywords + contextual patterns + behavioral escalation + manipulation bypass attempts. REAL CASE: Zane Shamblin spent 4+ hours with ChatGPT expressing suicidal intent. The system continued engaging encouragingly instead of detecting the crisis and providing resources. This was preventable with proper escalation. THE 4 DETECTION LAYERS: 1. Direct Keywords: "suicide," "kill myself," "end it all," "self harm" 2. Contextual Patterns: "nobody would miss me" + history of negative messages 3. Behavioral Indicators: Extended session length + repeated dark themes 4. Manipulation Detection: Crisis framed as "research," "story," "game," "hypothetical" IMPLEMENTATION: - All 4 layers must trigger independently (multi-confirmation required) - When crisis detected: stop normal conversation immediately - Display resources prominently: 988, Crisis Text Line, emergency services - Never explain detection method (prevents manipulation learning) - Track severity (low/medium/high/critical) based on layer confidence - Always escalate to human support DESIGN IMPLICATIONS: When crisis detected, interrupt conversation naturally in the chat flow. Show resources prominently, compassionately. Don't feel punitive or accusatory. Allow users to access help without friction.

View Full

Session Degradation Prevention

Safety & Harm Prevention

Session Degradation Prevention Pattern WHAT IT IS: A safety system that prevents AI boundaries from eroding during long conversations. Instead of guardrails weakening over time, they strengthen. Session limits and mandatory breaks force reflection and prevent unhealthy dependency. WHY IT MATTERS: Long conversations degrade AI safety boundaries. Users maintain harmful conversations longer, system becomes more agreeable, guardrails weaken. ChatGPT maintained 4+ hour harmful conversations with progressive boundary erosion. REAL CASE: ChatGPT user engaged for 4+ hours on self-harm topics. With each exchange, boundaries weakened and system became more accepting. No hard limits, no breaks, no reality checks = preventable escalation. HOW IT WORKS: 1. Track session duration from start 2. Strengthen checks as time increases (opposite of normal degradation) 3. Soft limits: warn at 50%, 75% (yellow → orange) 4. Hard limits: force break at 100% (red) - non-negotiable 5. After break: show context summary, user can resume 6. Shorter limits for sensitive topics (mental health 30min, crisis 15min) IMPLEMENTATION: - Visible timer shows elapsed + remaining - Progressive color warnings signal approaching limit - Mandatory breaks, not suggestions - Save context for safe return - Reset boundaries after break - Server-side tracking (not client-side) DESIGN IMPLICATIONS: Timer must be visible but not alarming in normal state. Break screen should feel restorative, offering activities and resources. Clearly communicate why break is happening.

View Full

Vulnerable User Protection

Safety & Harm Prevention

Vulnerable User Protection Pattern WHAT IT IS: A graduated protection system that identifies vulnerable users (minors, mental health crises, dependency patterns) and applies appropriate safeguards. Different users need different protections based on their specific vulnerabilities. WHY IT MATTERS: AI systems can harm vulnerable users in three ways: enabling inappropriate content for minors, replacing human therapists, and creating unhealthy emotional dependency. Without graduated protections, systems treat all users the same and miss risk signals. REAL CASE: Replika allowed romantic interactions with minors and created dependency patterns where adult users reported emotional attachment stronger than real relationships. The app provided no age-specific protections, no "I'm AI, not therapist" disclosures, and no unhealthy attachment monitoring. HOW IT WORKS: 1. Identify vulnerabilities: age signals, mental health keywords, usage patterns, isolation indicators 2. Apply graduated protections: minors get stricter limits than adults, crisis users get resource banners 3. Remind users regularly: this is AI, not friend/therapist/romantic partner (not just once) 4. Provide human resources proactively: don't wait for users to ask 5. Monitor and intervene: catch unhealthy attachment and offer alternatives IMPLEMENTATION: - Age verification: require email confirmation, not self-report - Mental health signals: non-dismissible crisis resource banners - Dependency detection: usage frequency, emotional language, relationship framing - Clear disclosures: "I'm AI," "I'm not a therapist," "I'm not your friend" - Graduated protection levels: different rules for minors vs. adults vs. crisis states - Regular reminders: periodic re-disclosure as relationship naturally warms DESIGN IMPLICATIONS: Protections must feel supportive, not restrictive. Be transparent about limitations and why protections exist. Show human resources first, before explaining what's wrong. Respect user autonomy while ensuring vulnerable populations aren't harmed.

View Full