Skip to main content
The Agent Mesh is how RubixKube gets reliability work done. Rather than one general-purpose AI, the mesh is a set of specialists, each expert in one slice of the loop. They share a Knowledge Graph, communicate through typed events, and coordinate under Guardian policies. Specialisation matters because reliability work is not one job. Watching an environment is different from investigating an incident, which is different from applying a fix safely. Giving each job to its own agent makes the whole system more accurate and more auditable than a single “do-everything” bot.

The agents at a glance

Observer Agent

Eyes on every connected environment. Discovers topology, collects signals, keeps the graph current.

Memory Agent

Stores incidents, resolutions, and operator context. Surfaces prior art when relevant.

RCA Pipeline Agent

Builds evidence-linked causal chains from raw signals. Produces RCA reports.

SRI Agent

The conversational surface. Answers questions, runs skills, drafts actions.

Remediation Agent

Applies approved fixes inside their scoped blast radius. Watches for the stabilisation window.

Guardian Agent

Enforces policy on everything that could change state. Requires approvals where it matters.

Who does what in the OPEL loop

The mesh maps cleanly onto the four phases of the OPEL loop.
PhasePrimary agentSupporting agents
ObserveObserver AgentMemory Agent (surfaces relevant prior incidents)
PlanRCA Pipeline AgentMemory Agent, SRI Agent
ExecuteSRI Agent (draft) and Remediation Agent (apply)Guardian Agent (validate)
LearnMemory AgentObserver Agent (continuing verification)

How the agents communicate

1

Typed events on a shared bus

Agents publish and subscribe to events: “anomaly detected”, “RCA ready”, “action proposed”, “verification complete”. Every event has a schema and a source.
2

The Knowledge Graph as shared state

Durable state lives in the graph, not in individual agent memory. Any agent that needs a resource, a signal, or a relationship reads it from the graph. Any update is applied there and broadcast.
3

Guardian as the gate

Anything that mutates your environment passes through Guardian policy. Policies are per-environment and versioned. Approvals are auditable.
4

Per-run context

Each incident gets an investigation context: the signals involved, the relevant graph region, the prior art surfaced. Agents work on that context, not on the entire graph.

Why specialisation beats a single generalist

A narrow surface means better prompts, smaller tool sets, cleaner evaluations. The Observer has one job. So does Guardian. The combined system is far more predictable than a single general-purpose agent with dozens of tools.
Every action has exactly one agent accountable for it. When something goes wrong, you can trace which agent produced which output and why.
The Observer upgrades on its own cadence. The SRI Agent upgrades on another. You never rebuild the entire stack to improve one phase of the loop.
Guardian can enforce strict limits on the Remediation Agent without constraining how the RCA Pipeline thinks. Different agents, different policy surfaces.

Where each agent runs

Observer Agent

Runs in your environment (cluster, cloud VM, or Linux host). Lightweight, read-only.

Every other agent

Runs in RubixKube Cloud. You never install, upgrade, or patch them.
This hybrid placement keeps local cluster overhead at around 255Mi RAM, under 10 millicores of CPU for Kubernetes (lower for VM installs). Heavy compute (LLM inference, graph queries, memory indexing) stays on the Cloud side.

Common questions

Custom agents are on the roadmap. In the meantime, custom skills let you extend the SRI Agent with your own runbooks, and custom integrations let you add tools any agent can use.
Agents are stateless relative to the graph, so adding environments scales linearly. The Observer fans out per environment. The Cloud-side agents work against the shared graph and spread load across investigation contexts.
The bus is fault-tolerant. If an agent is unreachable, events queue and retry. The Guardian never permits an action that relies on a missing agent’s output.
They see the same graph, but each agent has a scoped view of what matters to its job. The Observer does not need to read RCA reports. The Memory Agent does not need to stream raw metrics. This scoping keeps the system fast and the permissions sensible.

The OPEL Loop

The rhythm every agent in the mesh follows.

Safety and Guardrails

The policy layer that bounds what the mesh can do.