Agent Mesh: Distributed Intelligence for Reliability
The Agent Mesh is RubixKube’s core architectural pattern - a network of specialized AI agents that work together like a distributed SRE team, each bringing unique expertise to infrastructure reliability.Think of it like this: Instead of one overworked engineer handling everything, you have a team of specialists - one investigates issues, one proposes fixes, one ensures safety, and one remembers everything. They never sleep, never forget, and always collaborate perfectly.
Why a Mesh of Agents?
The Problem with Monolithic AI
A single “do-everything” AI faces fundamental limitations:- Jack of all trades, master of none - Can’t specialize deeply
- Single point of failure - If it fails, everything stops
- Slow decision-making - Must consider everything at once
- Hard to trust - Black box with unclear reasoning
The Agent Mesh Solution
Multiple specialized agents working together:- Deep expertise - Each agent masters one domain
- Distributed resilience - System continues if one agent fails
- Parallel processing - Agents work simultaneously
- Transparent reasoning - See which agent did what and why
Core Agents in the Mesh
RubixKube’s Agent Mesh includes several specialized agents:RCA Pipeline Agent
Role: Root Cause Analysis
- Investigates incidents systematically
- Builds dependency graphs
- Correlates logs, metrics, and events
- Generates evidence-linked RCA reports
Observer Agent
Role: Infrastructure Monitoring
- Watches cluster resources continuously
- Collects metrics and events
- Detects anomalies and deviations
- Reports to other agents
SRI Agent
Role: Conversational Interface
- Provides natural language interaction
- Translates questions into actions
- Explains system state in plain English
- Guides troubleshooting workflows
Remediation Agent
Role: Fix Proposals & Execution
- Proposes safe remediation actions
- Calculates risk and blast radius
- Applies fixes (with approval)
- Verifies remediation success
Memory Agent
Role: Knowledge Management
- Stores incident history
- Recalls similar past events
- Suggests fixes from memory
- Updates knowledge graph
Guardian Agent
Role: Safety & Policy Enforcement
- Enforces guardrails
- Assesses action risk
- Requires approvals when needed
- Prevents dangerous operations
How Agents Collaborate
Example: Pod Crashes with OOMKilled
Here’s how the Agent Mesh handles a memory overflow incident:1
Detection (Observer Agent)
2
Investigation (RCA Pipeline Agent)
3
Memory Recall (Memory Agent)
4
Fix Proposal (Remediation Agent)
5
Safety Check (Guardian Agent)
6
Execution & Learning
Agent Communication Protocols
How Agents Talk to Each Other
Agents communicate through: 1.Event Bus - Asynchronous message passing 2.Shared Context - Common understanding of cluster state 3.Priority Queuing - Critical incidents get immediate attention 4.Feedback Loops - Agents learn from each other’s successesAgent Mesh vs Traditional Approaches
| Feature | Traditional Monitoring | Single AI Bot | Agent Mesh (SRI) |
|---|---|---|---|
| Specialization | Tools per domain | One system does all | Agents per expertise |
| Reasoning | Rule-based | Generic AI | Domain-specific intelligence |
| Collaboration | Manual integration | N/A | Native mesh communication |
| Fault Tolerance | Tool outages break flow | Single point of failure | Distributed resilience |
| Scalability | Add more tools | Scale up one system | Add more agents |
| Transparency | Logs everywhere | Black box | Agent-level audit trails |
| Learning | Humans update rules | Model retraining | Continuous agent learning |
Agent States and Health
Monitoring Your Agents

- Status - Active, idle, or error
- Last Activity - When it last performed work
- Capabilities - What it can do
- Type - System agent vs custom agent
Agent Health Indicators
Healthy Agent
Healthy Agent
- Status:**Active ** - Last seen:Active (within 60 seconds)
- Response time: less than 2 seconds
- Error rate: less than 1%
Degraded Agent
Degraded Agent
- Status:Slow response - Last seen: 1-5 minutes ago
- Response time: 2-10 seconds
- Error rate: 1-5%
Failed Agent
Failed Agent
- Status:Error ** orError** - Last seen: greater than 5 minutes ago
- No responses
- Error rate: greater than 5%
Extensibility: Custom Agents
Adding Your Own Agents
The Agent Mesh is extensible - you can add custom agents for your specific needs: Example Use Cases: - ** Example Use Cases:** - Ensures changes meet regulatory requirements- Cost Agent - Optimizes resource usage for budget targets
- Security Agent - Scans for vulnerabilities and misconfigurations
- Performance Agent - Optimizes application response times
Agent Mesh Benefits
Speed
Parallel Processing
Multiple agents work simultaneously, not sequentially. Investigation, planning, and safety checks happen in parallel.Accuracy
Specialized Knowledge
Each agent is expert in its domain, leading to better decisions than generalist systems.Resilience
No Single Point of Failure
If one agent fails, others continue working. System degradation is gradual, not catastrophic.Evolution
Independent Improvement
Each agent can be upgraded independently without affecting the mesh.Frequently Asked Questions
How many agents run in my cluster?
How many agents run in my cluster?
In your cluster: Just the ** In your cluster:** runs (lightweight, ~100MB RAM).In RubixKube Cloud: All other agents (RCA, Remediation, Memory, Guardian) run centrally - no additional load on your infrastructure.This hybrid architecture gives you powerful AI without cluster overhead.
Can agents act without permission?
Can agents act without permission?
By default: NO.
Agents operate in observe-only mode initially. They will:- Detect issues
- Investigate root causes
- Propose fixes
- NOT apply changes without approval
What if an agent makes a mistake?
What if an agent makes a mistake?
Multiple safety layers:
1.Guardian Agent - Reviews all actions before execution 2.Blast Radius Limits - Actions are scoped, never cluster-wide 3.Dry-Run Mode - Test changes before applying 4.Instant Rollback - Every change is reversible 5.Audit Logs - Complete trail of who did whatPlus: All high-risk actions require human approval.How do I see what agents are doing?
How do I see what agents are doing?
Three ways:1.Agents Page - Real-time status of all agents
2.Activity Feed - Stream of agent actions and decisions
3.RCA Reports - Detailed explanation of agent reasoningFull transparency into the mesh.