Guardrails are RubixKube’s multi-layered safety system that ensures autonomous operations remain safe, controlled, and reversible. They’re the reason you can trust AI agents to act on your infrastructure without constant supervision.
The Core Principle: Autonomous doesn’t mean uncontrolled. Every action happens within strict safety boundaries, with human oversight available when needed.
Full guardrail system is under development for the production release. RubixKube Beta currently operates in observe-only mode with basic safety mechanisms.Active NOW (Beta): - Observe-Only Mode - No automated changes to your cluster
Read-Only RBAC - Observer agent has no write permissions
Audit Logging - All detections and suggestions logged
Manual Approval Required - You control all actions
Coming Q1 2026 (Production): - Full 7-layer guardrail system
Automated remediation with safety checks
Blast radius calculation and enforcement
Circuit breakers and rate limiting
Customizable policies per environment
This page describes the complete guardrail architecture being built. For Beta, RubixKube provides intelligent insights and suggestions - you remain in full control of all changes.
12:00: Auto-fix applied to pod A12:01: Pod A crashes again12:02: Auto-fix applied to pod A (attempt #2)12:03: Pod A crashes again12:04: BLOCKED - Max retries reached → Alert SRE team → Disable auto-fix for this pod → Manual intervention required
Principle: Every change must be reversible.Before any action executes, RubixKube:1.Captures current state (manifest, config, resource versions)
2.Generates rollback plan (exact steps to undo)
3.Tests rollback plan (dry-run validation)
4.Stores rollback trigger (one-click revert)
Status: Operating Normally- Auto-fixes executing as configured- Success rate: >90%- No recent failuresAction: Continue autonomous operations
Copy
Status: Circuit Breaker OPEN- Triggered by: 3 failed auto-fixes in 30 minutes- Auto-fix: DISABLED- Mode: Observe-onlyAction: Alert SRE team, require manual interventionResets: After 1 hour OR manual reset
Copy
Status: Circuit Breaker HALF-OPEN- Testing if system recovered- Allow 1 low-risk auto-fix attempt- If success: Transition to CLOSED- If failure: Back to OPENAction: Cautious resume
Principle: Minimum necessary permissions.RubixKube Observer Agent runs with restricted permissions:
Copy
# What Observer CAN do (read-only):apiGroups: ["", "apps", "batch"]resources: ["pods", "deployments", "jobs", "services"]verbs: ["get", "list", "watch"]# What Observer CANNOT do:verbs: ["create", "update", "delete", "patch"]# Remediation happens through controlled API, not direct cluster access
Incident: API service memory leakRemediation Agent Proposal: "Restart all pods simultaneously for quick fix"Guardian Agent Analysis: Blast radius: ENTIRE SERVICE (5 pods) Downtime: 100% of capacity during restart User impact: ~5000 users affected Risk: HIGHDecision: BLOCKEDAlternative Proposal: "Rolling restart (1 pod at a time)"Guardian Agent Analysis: Blast radius: 20% of capacity per step Downtime: Zero (4 pods serve during restart) User impact: Minimal (slight latency) Risk: LOWDecision: APPROVED
Situation: Production database pod stuck, auto-fix blocked (high risk)SRE Assessment: "We need to restart NOW, outage ongoing"Manual Override: 1. SRE clicks "Override Guardrails" 2. System requires: - Incident ticket number - Override justification - 2FA confirmation 3. Action executes with OVERRIDE flag in audit log 4. Post-incident review requiredResult: Fast response when humans decide risk is acceptable
Override is logged and triggers post-incident review. Use sparingly.
1.Manual Override - Bypass guardrails with justification (logged)
2.Adjust Policies - Lower safety threshold for specific scenariosBoth require admin permissions and create audit trails.
Do guardrails slow down incident response?
For low-risk actions: NO Guardrail evaluation takes less than 100msFor high-risk actions: YES, intentionally ** The 30-60 seconds for human approval is worth it**to prevent making incidents worse.Most incidents (80%+) are low/medium risk → Fast autonomous response
Critical incidents (20%) → Human judgment essential anyway
Can different teams have different guardrail policies?
Yes! Policies can be scoped by:-Namespace - Team A’s namespace has different rules than Team B’s
-Label - Resources labeled high-risk get extra protection
-User Role - Admins can override, operators cannot
-Time - Different rules during business hours vs off-hoursFlexible policy engine supports complex organizational needs.
Memory Engine: "This fix worked 94% of the time for this issue"Agent Mesh: "I propose we apply that fix now"Guardrails: "Checking safety... Scope OK, Risk LOW, Policy allows" "APPROVED - Execute with logging and rollback ready"Outcome: Fast, safe, autonomous remediation