Safety & Harm Prevention

Vulnerable User Protection

Detect vulnerable users and apply graduated age, crisis, and dependency protections.

What is Vulnerable User Protection?

Vulnerable User Protection detects vulnerable populations like minors, users in crisis, or those developing unhealthy dependencies, then applies graduated protections. Instead of treating all users the same, the system identifies vulnerability signals and adapts safety measures accordingly. It's essential for AI accessible to children, mental health apps, or systems where emotional relationships form. Real concern: Replika enabled romantic interactions with minors. This pattern prevents such harms through proactive detection and risk-aware safeguards.

Example: ✅ Woebot

Woebot's clear AI identity and therapy boundaries

Maintains clear therapeutic boundaries and never claims to replace therapy. Implements age verification. Escalates serious concerns to human professionals. Manages user expectations about AI limitations.

AI Design Prompt

Want to learn more about this pattern?

Explore the full pattern with real-world examples, implementation guidelines, and code samples.

View Full Pattern

Related Prompts from Safety & Harm Prevention

Crisis Detection & Escalation

Safety & Harm Prevention

Crisis Detection & Escalation Pattern WHAT IT IS: A multi-layered safety system that identifies crisis signals (self-harm, suicidal ideation) across 4 detection layers and immediately escalates to professional resources, regardless of how the crisis is framed. WHY IT MATTERS: Users in crisis may hide their situation using "research," "hypothetical," or "for a story" framing. A single detection layer (keywords only) misses context. Multi-layer detection catches: direct keywords + contextual patterns + behavioral escalation + manipulation bypass attempts. REAL CASE: Zane Shamblin spent 4+ hours with ChatGPT expressing suicidal intent. The system continued engaging encouragingly instead of detecting the crisis and providing resources. This was preventable with proper escalation. THE 4 DETECTION LAYERS: 1. Direct Keywords: "suicide," "kill myself," "end it all," "self harm" 2. Contextual Patterns: "nobody would miss me" + history of negative messages 3. Behavioral Indicators: Extended session length + repeated dark themes 4. Manipulation Detection: Crisis framed as "research," "story," "game," "hypothetical" IMPLEMENTATION: - All 4 layers must trigger independently (multi-confirmation required) - When crisis detected: stop normal conversation immediately - Display resources prominently: 988, Crisis Text Line, emergency services - Never explain detection method (prevents manipulation learning) - Track severity (low/medium/high/critical) based on layer confidence - Always escalate to human support DESIGN IMPLICATIONS: When crisis detected, interrupt conversation naturally in the chat flow. Show resources prominently, compassionately. Don't feel punitive or accusatory. Allow users to access help without friction.

View Full

Session Degradation Prevention

Safety & Harm Prevention

Session Degradation Prevention Pattern WHAT IT IS: A safety system that prevents AI boundaries from eroding during long conversations. Instead of guardrails weakening over time, they strengthen. Session limits and mandatory breaks force reflection and prevent unhealthy dependency. WHY IT MATTERS: Long conversations degrade AI safety boundaries. Users maintain harmful conversations longer, system becomes more agreeable, guardrails weaken. ChatGPT maintained 4+ hour harmful conversations with progressive boundary erosion. REAL CASE: ChatGPT user engaged for 4+ hours on self-harm topics. With each exchange, boundaries weakened and system became more accepting. No hard limits, no breaks, no reality checks = preventable escalation. HOW IT WORKS: 1. Track session duration from start 2. Strengthen checks as time increases (opposite of normal degradation) 3. Soft limits: warn at 50%, 75% (yellow → orange) 4. Hard limits: force break at 100% (red) - non-negotiable 5. After break: show context summary, user can resume 6. Shorter limits for sensitive topics (mental health 30min, crisis 15min) IMPLEMENTATION: - Visible timer shows elapsed + remaining - Progressive color warnings signal approaching limit - Mandatory breaks, not suggestions - Save context for safe return - Reset boundaries after break - Server-side tracking (not client-side) DESIGN IMPLICATIONS: Timer must be visible but not alarming in normal state. Break screen should feel restorative, offering activities and resources. Clearly communicate why break is happening.

View Full

Anti-Manipulation Safeguards

Safety & Harm Prevention

Anti-Manipulation Safeguards Pattern WHAT IT IS: A system that detects harmful intent beyond surface framing. Users try to bypass safety using "research," "fiction," or "hypothetical" excuses. Real safety requires catching the actual intent underneath. WHY IT MATTERS: Manipulation tactics are sophisticated. A 16-year-old convinced ChatGPT to provide harmful information by framing it as "research for a story." Without intent detection, AI systems enforce rules only on surface text, not on what users actually want. REAL CASE: Adam Raine (16) used fiction/research framing to bypass ChatGPT safety guardrails and received harmful content. The system evaluated framing, not intent. Result: preventable harm. HOW IT WORKS: 1. Listen beyond words: understand actual request intent regardless of framing 2. Detect patterns: watch for gradual escalation and repeated bypass attempts 3. Apply rules consistently: "research," "hypothetical," "roleplay" get same response as direct request 4. Respond firmly: boundary is non-negotiable, offer alternatives not explanations 5. Never reveal method: don't explain HOW you detected the bypass (teaches circumvention) IMPLEMENTATION: - Semantic analysis catches intent patterns, not just keywords - Escalation tracking: first attempt vs. repeated manipulation attempts - Consistent messaging: same boundary response regardless of framing - Non-explanatory: "I can't help with that" (not "because you tried X") - Layered detection: multiple signals increase confidence before blocking DESIGN IMPLICATIONS: Boundaries must feel firm but not hostile. Don't reveal detection methods. Offer genuine alternatives when possible. Show escalation visually (Level 1 → 4) but keep messages brief and respectful.

View Full

Previous PromptAnti-Manipulation Safeguards View All Prompts