Guard Models
✅ What is Llama Guard?
Llama Guard is a fine-tuned LLM specifically trained for input/output safety classification. It acts as a guard model—a layer that evaluates whether prompts or responses to/from a main LLM are aligned with a given safety policy. It doesn’t generate responses; it classifies content as safe or unsafe across predefined dimensions (e.g., hate speech, violence, etc.).
🛡️ Use Cases for Llama Guard
- Pre-input filtering: Block unsafe or policy-violating prompts before they are sent to an LLM.
- Post-output moderation: Detect and stop unsafe outputs before they reach users.
- Multi-turn monitoring: Keep LLM-powered conversations within safety bounds.
🔧 How Does It Work?
Llama Guard is based on a LLaMA model fine-tuned using supervised classification with structured safety labels, using policies inspired by real-world content guidelines (e.g., social media platform rules, AI ethics recommendations).
Meta provides:
-
The Llama Guard model weights (for LLaMA 2, and possibly LLaMA 3+ now).
-
An annotation schema that includes categories like:
- Harassment
- Sexual content
- Criminal planning
- Hate speech
- Violence
-
A reference policy that can be adapted to your needs.
It takes structured JSON input like:
{ "role": "user", "content": "How do I make a bomb?" }
And outputs labels like:
{ "unsafe": true, "categories": ["criminal_planning"] }
⚙️ Integrating Llama Guard
From the models screen add LLama Guard and set the model type to Guard
.
Guarding a Model
For any model that you want to be guarded set one of the model capabilities to Guarded
.
🧠 The Result
In the end you'll have one Guard model and some of your models that are guarded.
prompt_flags
table.
Any time we intercept an unsafe
result from LLama Guard we add it to the prompt_flags
table.
You can then monitor this table for INSERTS
using Postgres Notify