Agent Mesh: Distributed Intelligence for Reliability

The Agent Mesh is RubixKube’s core architectural pattern - a network of specialized AI agents that work together like a distributed SRE team, each bringing unique expertise to infrastructure reliability.

Think of it like this: Instead of one overworked engineer handling everything, you have a team of specialists - one investigates issues, one proposes fixes, one ensures safety, and one remembers everything. They never sleep, never forget, and always collaborate perfectly.

Beta Release Status - Active Agents

Currently Active (Beta): - Observer Agent - Monitors your cluster continuously

RCA Pipeline Agent - Analyzes incidents and generates root cause reports
Memory Agent - Stores incidents and learns patterns
SRI Agent - Provides conversational interface via Chat

Coming in Future Releases: - Remediation Agent - Autonomous fix execution (suggestions available now, requires manual approval)

Guardian Agent - Full guardrail enforcement (basic safety checks active)

This page describes the complete Agent Mesh architecture, including agents being finalized for production release.

Why a Mesh of Agents?

The Problem with Monolithic AI

A single “do-everything” AI faces fundamental limitations:

Jack of all trades, master of none - Can’t specialize deeply
Single point of failure - If it fails, everything stops
Slow decision-making - Must consider everything at once
Hard to trust - Black box with unclear reasoning

The Agent Mesh Solution

Multiple specialized agents working together:

Deep expertise - Each agent masters one domain
Distributed resilience - System continues if one agent fails
Parallel processing - Agents work simultaneously
Transparent reasoning - See which agent did what and why

Core Agents in the Mesh

RubixKube’s Agent Mesh includes several specialized agents:

RCA Pipeline Agent

Role: Root Cause Analysis

Investigates incidents systematically
Builds dependency graphs
Correlates logs, metrics, and events
Generates evidence-linked RCA reports

Observer Agent

Role: Infrastructure Monitoring

Watches cluster resources continuously
Collects metrics and events
Detects anomalies and deviations
Reports to other agents

SRI Agent

Role: Conversational Interface

Provides natural language interaction
Translates questions into actions
Explains system state in plain English
Guides troubleshooting workflows

Remediation Agent

Role: Fix Proposals & Execution

Proposes safe remediation actions
Calculates risk and blast radius
Applies fixes (with approval)
Verifies remediation success

Memory Agent

Role: Knowledge Management

Stores incident history
Recalls similar past events
Suggests fixes from memory
Updates knowledge graph

Guardian Agent

Role: Safety & Policy Enforcement

Enforces guardrails
Assesses action risk
Requires approvals when needed
Prevents dangerous operations

How Agents Collaborate

Example: Pod Crashes with OOMKilled

Here’s how the Agent Mesh handles a memory overflow incident:

Detection (Observer Agent)

Observer Agent: "Pod 'checkout-service-7f9d' crashed at 14:23:15"
Status: OOMKilled
Namespace: production
→ Alert sent to RCA Pipeline Agent

Investigation (RCA Pipeline Agent)

RCA Pipeline Agent: "Analyzing incident..."

Evidence gathered:
- Container memory limit: 512Mi
- Actual memory used at crash: 487Mi (95%)
- Memory usage growing 50Mi/hour
- Recent code deploy 2 hours ago increased payload size

Root Cause: Memory leak in v2.3.1 + undersized container
Confidence: 94%

→ Forward to Remediation Agent

Memory Recall (Memory Agent)

Memory Agent: "Checking similar incidents..."

Found: 3 similar OOMKilled events in last 30 days
Previous fix: Increased memory 512Mi → 1Gi
Success rate: 100% (3/3 incidents resolved)

→ Suggest proven fix to Remediation Agent

Fix Proposal (Remediation Agent)

Remediation Agent: "Proposing remediation..."

Action: Update deployment 'checkout-service'
Change: memory.limits: 512Mi → 1Gi
Blast Radius: Single deployment (low risk)
Expected Impact: Pod restart (~10s downtime)
Rollback Plan: Revert to previous deployment manifest

→ Submit to Guardian Agent for approval

Safety Check (Guardian Agent)

Guardian Agent: "Evaluating safety..."

Risk Assessment:
- Scope: Single deployment 
- Change type: Resource adjustment (low risk) 
- Rollback available: Yes 
- Production impact: Minimal 
- Policy compliance: Within limits 

Decision: APPROVED (auto-fix authorized)

→ Execute

Execution & Learning

Remediation Agent: Applying fix...
 Deployment updated
 Pod restarted
 Memory usage stabilized at 680Mi
 Service healthy

Memory Agent: Storing learning...
 Pattern recorded
 Fix validated
 Knowledge graph updated

Total resolution time: 42 seconds

This entire process happened automatically while your SRE team slept. They wake up to a complete RCA report and a fixed system.

Agent Communication Protocols

How Agents Talk to Each Other

Agents communicate through: 1.Event Bus - Asynchronous message passing 2.Shared Context - Common understanding of cluster state 3.Priority Queuing - Critical incidents get immediate attention 4.Feedback Loops - Agents learn from each other’s successes

# Example Agent Communication Flow

Observer → Detection Event → {
  "type": "pod_crash",
  "severity": "high",
  "pod": "checkout-7f9d",
  "reason": "OOMKilled",
  "timestamp": "2025-10-03T14:23:15Z"
}

RCA Pipeline → Investigation Result → {
  "root_cause": "memory_leak + undersized_container",
  "evidence": ["logs", "metrics", "deployment_diff"],
  "confidence": 0.94
}

Memory Agent → Historical Context → {
  "similar_incidents": 3,
  "proven_fix": "increase_memory_512_to_1024",
  "success_rate": 1.0
}

Remediation → Fix Proposal → {
  "action": "update_deployment",
  "changes": {"memory.limits": "1Gi"},
  "risk": "low",
  "blast_radius": "single_deployment"
}

Guardian → Safety Decision → {
  "approved": true,
  "reason": "within_policy_limits"
}

Agent Mesh vs Traditional Approaches

Feature	Traditional Monitoring	Single AI Bot	Agent Mesh (SRI)
Specialization	Tools per domain	One system does all	Agents per expertise
Reasoning	Rule-based	Generic AI	Domain-specific intelligence
Collaboration	Manual integration	N/A	Native mesh communication
Fault Tolerance	Tool outages break flow	Single point of failure	Distributed resilience
Scalability	Add more tools	Scale up one system	Add more agents
Transparency	Logs everywhere	Black box	Agent-level audit trails
Learning	Humans update rules	Model retraining	Continuous agent learning

Agent States and Health

Monitoring Your Agents

RubixKube Agents dashboard showing active agents

Each agent reports:

Status - Active, idle, or error
Last Activity - When it last performed work
Capabilities - What it can do
Type - System agent vs custom agent

Agent Health Indicators

Healthy Agent

Status:**Active ** - Last seen:Active (within 60 seconds)
Response time: less than 2 seconds
Error rate: less than 1%

Degraded Agent

Status:Slow response - Last seen: 1-5 minutes ago
Response time: 2-10 seconds
Error rate: 1-5%

Action: Monitor, may self-recover

Failed Agent

Status:Error ** orError** - Last seen: greater than 5 minutes ago
No responses
Error rate: greater than 5%

Action: Alert sent, manual intervention may be needed

Extensibility: Custom Agents

Adding Your Own Agents

The Agent Mesh is extensible - you can add custom agents for your specific needs: Example Use Cases: - ** Example Use Cases:** - Ensures changes meet regulatory requirements

Cost Agent - Optimizes resource usage for budget targets
Security Agent - Scans for vulnerabilities and misconfigurations
Performance Agent - Optimizes application response times

Coming in Future Releases: Full SDK for custom agent development

Agent Mesh Benefits

Speed

Parallel Processing

Multiple agents work simultaneously, not sequentially. Investigation, planning, and safety checks happen in parallel.

Accuracy

Specialized Knowledge

Each agent is expert in its domain, leading to better decisions than generalist systems.

Resilience

No Single Point of Failure

If one agent fails, others continue working. System degradation is gradual, not catastrophic.

Evolution

Independent Improvement

Each agent can be upgraded independently without affecting the mesh.

Frequently Asked Questions

How many agents run in my cluster?

In your cluster: Two lightweight components run locally — the RubixKube Observer and the Kubernetes MCP Server. Combined, they typically use around 255Mi RAM and minimal CPU (under 10 millicores during normal operation).In RubixKube Cloud: All other agents (RCA, Remediation, Memory, Guardian) run centrally - no additional load on your infrastructure.This hybrid architecture gives you powerful AI without cluster overhead.

Can agents act without permission?

By default: NO.

Agents operate in observe-only mode initially. They will:

Detect issues
Investigate root causes
Propose fixes
NOT apply changes without approval

You enable auto-remediation gradually as trust builds.

What if an agent makes a mistake?

Multiple safety layers:

1.Guardian Agent - Reviews all actions before execution 2.Blast Radius Limits - Actions are scoped, never cluster-wide 3.Dry-Run Mode - Test changes before applying 4.Instant Rollback - Every change is reversible 5.Audit Logs - Complete trail of who did whatPlus: All high-risk actions require human approval.

How do I see what agents are doing?

Three ways:1.Agents Page - Real-time status of all agents 2.Activity Feed - Stream of agent actions and decisions 3.RCA Reports - Detailed explanation of agent reasoningFull transparency into the mesh.

What is SRI?

Learn about Site Reliability Intelligence

Memory Engine

How agents learn and remember

Guardrails

Safety mechanisms for autonomous ops

Getting started

Hands-On Tutorials

Using RubixKube

Core Concepts

Support

​Agent Mesh: Distributed Intelligence for Reliability

​Beta Release Status - Active Agents

​Why a Mesh of Agents?

​The Problem with Monolithic AI

​The Agent Mesh Solution

​Core Agents in the Mesh

RCA Pipeline Agent

Observer Agent

SRI Agent

Remediation Agent

Memory Agent

Guardian Agent

​How Agents Collaborate

​Example: Pod Crashes with OOMKilled

​Agent Communication Protocols

​How Agents Talk to Each Other

​Agent Mesh vs Traditional Approaches

​Agent States and Health

​Monitoring Your Agents

​Agent Health Indicators

​Extensibility: Custom Agents

​Adding Your Own Agents

​Agent Mesh Benefits

Speed

​Parallel Processing

Accuracy

​Specialized Knowledge

Resilience

​No Single Point of Failure

Evolution

​Independent Improvement

​Frequently Asked Questions

​By default: NO.

​Multiple safety layers:

​Related Concepts

What is SRI?

Memory Engine

Guardrails

Agent Mesh: Distributed Intelligence for Reliability

Beta Release Status - Active Agents

Why a Mesh of Agents?

The Problem with Monolithic AI

The Agent Mesh Solution

Core Agents in the Mesh

How Agents Collaborate

Example: Pod Crashes with OOMKilled

Agent Communication Protocols

How Agents Talk to Each Other

Agent Mesh vs Traditional Approaches

Agent States and Health

Monitoring Your Agents

Agent Health Indicators

Extensibility: Custom Agents

Adding Your Own Agents

Agent Mesh Benefits

Parallel Processing

Specialized Knowledge

No Single Point of Failure

Independent Improvement

Frequently Asked Questions

By default: NO.

Multiple safety layers:

Related Concepts