Skip to main content

Agent Mesh: Distributed Intelligence for Reliability

The Agent Mesh is RubixKube’s core architectural pattern - a network of specialized AI agents that work together like a distributed SRE team, each bringing unique expertise to infrastructure reliability.
Think of it like this: Instead of one overworked engineer handling everything, you have a team of specialists - one investigates issues, one proposes fixes, one ensures safety, and one remembers everything. They never sleep, never forget, and always collaborate perfectly.

Beta Release Status - Active Agents

Currently Active (Beta): - Observer Agent - Monitors your cluster continuously
  • RCA Pipeline Agent - Analyzes incidents and generates root cause reports
  • Memory Agent - Stores incidents and learns patterns
  • SRI Agent - Provides conversational interface via Chat
Coming in Future Releases: - Remediation Agent - Autonomous fix execution (suggestions available now, requires manual approval)
  • Guardian Agent - Full guardrail enforcement (basic safety checks active)
This page describes the complete Agent Mesh architecture, including agents being finalized for production release.

Why a Mesh of Agents?

The Problem with Monolithic AI

A single “do-everything” AI faces fundamental limitations:
  • Jack of all trades, master of none - Can’t specialize deeply
  • Single point of failure - If it fails, everything stops
  • Slow decision-making - Must consider everything at once
  • Hard to trust - Black box with unclear reasoning

The Agent Mesh Solution

Multiple specialized agents working together:
  • Deep expertise - Each agent masters one domain
  • Distributed resilience - System continues if one agent fails
  • Parallel processing - Agents work simultaneously
  • Transparent reasoning - See which agent did what and why

Core Agents in the Mesh

RubixKube’s Agent Mesh includes several specialized agents:

RCA Pipeline Agent

Role: Root Cause Analysis
  • Investigates incidents systematically
  • Builds dependency graphs
  • Correlates logs, metrics, and events
  • Generates evidence-linked RCA reports

Observer Agent

Role: Infrastructure Monitoring
  • Watches cluster resources continuously
  • Collects metrics and events
  • Detects anomalies and deviations
  • Reports to other agents

SRI Agent

Role: Conversational Interface
  • Provides natural language interaction
  • Translates questions into actions
  • Explains system state in plain English
  • Guides troubleshooting workflows

Remediation Agent

Role: Fix Proposals & Execution
  • Proposes safe remediation actions
  • Calculates risk and blast radius
  • Applies fixes (with approval)
  • Verifies remediation success

Memory Agent

Role: Knowledge Management
  • Stores incident history
  • Recalls similar past events
  • Suggests fixes from memory
  • Updates knowledge graph

Guardian Agent

Role: Safety & Policy Enforcement
  • Enforces guardrails
  • Assesses action risk
  • Requires approvals when needed
  • Prevents dangerous operations

How Agents Collaborate

Example: Pod Crashes with OOMKilled

Here’s how the Agent Mesh handles a memory overflow incident:
1

Detection (Observer Agent)

Observer Agent: "Pod 'checkout-service-7f9d' crashed at 14:23:15"
Status: OOMKilled
Namespace: production
→ Alert sent to RCA Pipeline Agent
2

Investigation (RCA Pipeline Agent)

RCA Pipeline Agent: "Analyzing incident..."

Evidence gathered:
- Container memory limit: 512Mi
- Actual memory used at crash: 487Mi (95%)
- Memory usage growing 50Mi/hour
- Recent code deploy 2 hours ago increased payload size

Root Cause: Memory leak in v2.3.1 + undersized container
Confidence: 94%

→ Forward to Remediation Agent
3

Memory Recall (Memory Agent)

Memory Agent: "Checking similar incidents..."

Found: 3 similar OOMKilled events in last 30 days
Previous fix: Increased memory 512Mi → 1Gi
Success rate: 100% (3/3 incidents resolved)

→ Suggest proven fix to Remediation Agent
4

Fix Proposal (Remediation Agent)

Remediation Agent: "Proposing remediation..."

Action: Update deployment 'checkout-service'
Change: memory.limits: 512Mi → 1Gi
Blast Radius: Single deployment (low risk)
Expected Impact: Pod restart (~10s downtime)
Rollback Plan: Revert to previous deployment manifest

→ Submit to Guardian Agent for approval
5

Safety Check (Guardian Agent)

Guardian Agent: "Evaluating safety..."

Risk Assessment:
- Scope: Single deployment 
- Change type: Resource adjustment (low risk) 
- Rollback available: Yes 
- Production impact: Minimal 
- Policy compliance: Within limits 

Decision: APPROVED (auto-fix authorized)

→ Execute
6

Execution & Learning

Remediation Agent: Applying fix...
 Deployment updated
 Pod restarted
 Memory usage stabilized at 680Mi
 Service healthy

Memory Agent: Storing learning...
 Pattern recorded
 Fix validated
 Knowledge graph updated

Total resolution time: 42 seconds
This entire process happened automatically while your SRE team slept. They wake up to a complete RCA report and a fixed system.

Agent Communication Protocols

How Agents Talk to Each Other

Agents communicate through: 1.Event Bus - Asynchronous message passing 2.Shared Context - Common understanding of cluster state 3.Priority Queuing - Critical incidents get immediate attention 4.Feedback Loops - Agents learn from each other’s successes
# Example Agent Communication Flow

Observer → Detection Event → {
  "type": "pod_crash",
  "severity": "high",
  "pod": "checkout-7f9d",
  "reason": "OOMKilled",
  "timestamp": "2025-10-03T14:23:15Z"
}

RCA Pipeline → Investigation Result → {
  "root_cause": "memory_leak + undersized_container",
  "evidence": ["logs", "metrics", "deployment_diff"],
  "confidence": 0.94
}

Memory Agent → Historical Context → {
  "similar_incidents": 3,
  "proven_fix": "increase_memory_512_to_1024",
  "success_rate": 1.0
}

Remediation → Fix Proposal → {
  "action": "update_deployment",
  "changes": {"memory.limits": "1Gi"},
  "risk": "low",
  "blast_radius": "single_deployment"
}

Guardian → Safety Decision → {
  "approved": true,
  "reason": "within_policy_limits"
}

Agent Mesh vs Traditional Approaches

FeatureTraditional MonitoringSingle AI BotAgent Mesh (SRI)
SpecializationTools per domainOne system does allAgents per expertise
ReasoningRule-basedGeneric AIDomain-specific intelligence
CollaborationManual integrationN/ANative mesh communication
Fault ToleranceTool outages break flowSingle point of failureDistributed resilience
ScalabilityAdd more toolsScale up one systemAdd more agents
TransparencyLogs everywhereBlack boxAgent-level audit trails
LearningHumans update rulesModel retrainingContinuous agent learning

Agent States and Health

Monitoring Your Agents

RubixKube Agents dashboard showing active agents
Each agent reports:
  • Status - Active, idle, or error
  • Last Activity - When it last performed work
  • Capabilities - What it can do
  • Type - System agent vs custom agent

Agent Health Indicators

  • Status:**Active ** - Last seen:Active (within 60 seconds)
  • Response time: less than 2 seconds
  • Error rate: less than 1%
  • Status:Slow response - Last seen: 1-5 minutes ago
  • Response time: 2-10 seconds
  • Error rate: 1-5%
Action: Monitor, may self-recover
  • Status:Error ** orError** - Last seen: greater than 5 minutes ago
  • No responses
  • Error rate: greater than 5%
Action: Alert sent, manual intervention may be needed

Extensibility: Custom Agents

Adding Your Own Agents

The Agent Mesh is extensible - you can add custom agents for your specific needs: Example Use Cases: - ** Example Use Cases:** - Ensures changes meet regulatory requirements
  • Cost Agent - Optimizes resource usage for budget targets
  • Security Agent - Scans for vulnerabilities and misconfigurations
  • Performance Agent - Optimizes application response times
Coming in Future Releases: Full SDK for custom agent development

Agent Mesh Benefits

Speed

Parallel Processing

Multiple agents work simultaneously, not sequentially. Investigation, planning, and safety checks happen in parallel.

Accuracy

Specialized Knowledge

Each agent is expert in its domain, leading to better decisions than generalist systems.

Resilience

No Single Point of Failure

If one agent fails, others continue working. System degradation is gradual, not catastrophic.

Evolution

Independent Improvement

Each agent can be upgraded independently without affecting the mesh.

Frequently Asked Questions

In your cluster: Just the ** In your cluster:** runs (lightweight, ~100MB RAM).In RubixKube Cloud: All other agents (RCA, Remediation, Memory, Guardian) run centrally - no additional load on your infrastructure.This hybrid architecture gives you powerful AI without cluster overhead.

By default: NO.

Agents operate in observe-only mode initially. They will:
  • Detect issues
  • Investigate root causes
  • Propose fixes
  • NOT apply changes without approval
You enable auto-remediation gradually as trust builds.

Multiple safety layers:

1.Guardian Agent - Reviews all actions before execution 2.Blast Radius Limits - Actions are scoped, never cluster-wide 3.Dry-Run Mode - Test changes before applying 4.Instant Rollback - Every change is reversible 5.Audit Logs - Complete trail of who did whatPlus: All high-risk actions require human approval.
Three ways:1.Agents Page - Real-time status of all agents 2.Activity Feed - Stream of agent actions and decisions 3.RCA Reports - Detailed explanation of agent reasoningFull transparency into the mesh.