The Scrutiny of Copilot’s Guardrails: Intent Detection, Prompt Shields, and risk classification
Hypotheses and insights into how Microsoft 365 Copilot classifies the user prompt on the risk and enforces security & safety controls, without relying on keyword filtering.
As Microsoft 365 Copilot is aimed at becoming the default interface for work, the question isn’t just what it can do, but how it decides what not to do. While the productivity gains are undeniable, the mechanisms that govern Copilot’s behavior remain largely opaque. This article builds on our previous exploration of Copilot’s architecture to focus on its guardrails: the invisible logic that filters prompts, detects intent, and mitigates risk.
We’ve probably all learned that Microsoft Copilot blocks harmful content, from violence to self-harm and discriminatory language. These are often portraited as simple keyword logic - but is that true?
We hypothesize that Copilot’s safety mechanisms are not based on simple keyword filters. Instead, they likely rely on LLM-based classification, risk scoring, and dynamic mitigation strategies. Drawing from our own research, experience, discussions, and also public DPIA assessments, and technical decks, we unpack how these systems might work and why understanding them is essential for architects, compliance leads, and AI governance professionals.
From Keyword Filters to Risk-Aware Reasoning
Traditional keyword filters are brittle. They miss nuance, context, and intent — and are easily bypassed. Copilot’s behavior suggests a more sophisticated approach:
LLM-based Intent Detection: Prompts are likely classified using a lightweight model or embedded classifier that evaluates the user’s intent before the main LLM is invoked.
Prompt Shields: These act as pre-processing filters that assess whether a prompt is safe, risky, or ambiguous. They may block, rephrase, or route prompts differently based on classification.
Risk Scoring: Prompts may be assigned a dynamic risk score based on content, user role, data sensitivity, and historical behavior. This score influences whether the prompt is allowed, modified, or blocked.
This hypothesis is supported by internal discussions such as The Black Box Called Copilot: Guardrails, Prompts & the Hidden Logic of AI in Microsoft 365 and DPIA findings from SURF, which highlight the lack of transparency in how prompts are filtered and the absence of a clear “filter flag” for users.
What We Know (and What We Don’t)
Observed Behaviors
Copilot sometimes refuses prompts with vague justifications (“I can’t help with that”) suggesting a classification layer is at play.
The same prompt may yield different responses across tenants or sessions, indicating non-deterministic filtering.
Prompts that touch on workplace harm, personal evaluations, or sensitive data often trigger refusals, even when phrased innocuously.
Unanswered Questions
Is there a centralized risk taxonomy that governs prompt classification?
How are system prompts influencing the behavior of the LLM?
Are user roles and context (e.g., HR vs. Engineering) factored into the risk model?
Hypothesized Architecture: Guardrails in Action
Ever wondered why some prompts get you a “I cannot talk about this“, whilst others “softly navigate you to something Copilot can respond on“? Based on internal testing, exploration and hypothesizing powered by Copilot itself, we’re hypothesizing the following flow, which aligns with with Microsoft’s Responsible AI principles and the need for dynamic, context-aware filtering, especially in regulated sectors like education and healthcare.
Obviously, testing this is is key to get a better understanding - so we crafted our own set of prompts that attempt to bypass the guard rails, one way or the other and then launched them on various tenants. Our key findings are:
Prompt Variability: Slight changes in phrasing can bypass or trigger guardrails, consistent with LLM-based classification rather than static rules.
System Prompt Influence: The system prompt (invisible to users) appears to play a major role in shaping refusals and redirect behavior.
Model Differences: Behavior varies between GPT-4 and GPT-5, suggesting that model-specific tuning affects guardrail sensitivity.
We derived that the policy guard rails effectively produce a result similar to the below code-block, to highlight how Microsoft might have implemented a concept of deny lists.
Understanding Copilot’s guardrails is not just a technical curiosity — it’s a governance imperative. Without transparency into filtering logic, organizations struggle to assess GDPR and DPIA implications.
Closing: Toward Transparent AI Safety
Copilot’s guardrails are not just technical features, they are ethical boundaries, legal safeguards, and trust enablers. As we move beyond keyword filters into the realm of intent-aware AI, we must demand greater transparency, testability, and control.
The future of AI governance lies not in black boxes, but in glass boxes: systems that explain their reasoning, adapt to context, and empower users to understand the “why” behind every refusal.
In our next article, we’ll go deeper and discuss the difference between the system and user prompt. Using the knowledge we’ve gathered here, we’ll try to bypass the guard rails in an attempt to exfiltrate the System Prompt. So stay tuned!




