Skip to main content
The OPEL loop is the rhythm RubixKube runs on. Every agent in the mesh participates. Every incident passes through the same four phases. The loop is policy-driven, memory-backed, and always active. The shape comes from SRE, translated for the AI era. SRE gave us SLIs, error budgets, and runbooks. Site Reliability Intelligence (SRI) folds that discipline into a continuous loop an AI can run against infrastructure at a speed humans cannot match.
OPEL is defined in detail in the Age of Site Reliability Intelligence essay on the RubixKube blog. This page is the operational summary.

The four phases

1

Observe

The Observer Agent walks every connected environment through read-only APIs. It maps topology, collects metrics, events, logs, and state, and keeps the Knowledge Graph current.Observation is continuous. Nothing waits on a cron, and nothing is polled less often than its natural rate of change.
2

Plan

When a signal drifts beyond its learned baseline, the RCA Pipeline Agent takes over. It pulls related signals, recent changes, and prior incidents from the Memory Engine, and constructs a causal chain.The output of Plan is a ranked set of candidate actions, each with expected blast radius and confidence.
3

Execute

The SRI Agent surfaces the top candidate to a human through Chat or the Action Center. The Guardian Agent validates that the action is inside policy for the target environment. If approved, the Remediation Agent applies it.Nothing lands in your environment without an approval step. Policies decide who can approve what.
4

Learn

After execution, the Observer keeps watching. If signals return to baseline, the incident is marked verified. If they do not, the loop rewinds to Plan with new evidence.Either way, the Memory Engine stores the outcome, the resolution notes, and the operator context. The next matching incident starts with that prior art in scope.

What makes the loop different

Continuous, not reactive

The loop is always running. It does not wait for an alert to open.

Policy-driven

Every mutating step is bounded by Guardian policy. No surprises.

Memory-backed

Every pass updates long-term memory. The next pass is sharper.

Evidence-cited

Every claim links back to the signal it came from. Trust is verifiable.

A concrete example

A payments deployment in production. A rollout causes a subtle memory leak.
The Observer notices payments-api memory usage ticking up on every pod in the new ReplicaSet. The Knowledge Graph flags the drift.
The RCA Pipeline pulls the deployment event, the memory curves, and the last three incidents involving payments-api. It recognises a shape from a prior RCA resolved three months ago. Recommended action: scale memory limit from 512Mi to 1Gi, rollback if not improved in ten minutes.
The SRI Agent posts the recommendation to #payments-ops Slack. On-call approves. Guardian confirms the scope is inside the production policy. Remediation applies the patch. The deployment stabilises.
The Memory Engine records the full thread: signal, causal chain, applied fix, verification. The operator adds a note: “Peak traffic pattern from campaign X repeats every Tuesday”. Next Tuesday’s campaign inherits this context automatically.

Where each concept fits in the loop

PhasePrimary agentKey concepts
ObserveObserver AgentKnowledge Graph, Environments
PlanRCA Pipeline AgentInsights, Memory Engine
ExecuteSRI Agent and Remediation AgentActions, Skills
LearnMemory AgentMemory Engine, Safety and Guardrails

Common questions

OPEL borrows the spirit of continuous improvement loops but is built for machine speed. OODA (observe, orient, decide, act) targets human decision-making; PDCA (plan, do, check, act) targets process improvement. OPEL runs every few minutes against live infrastructure, with the Knowledge Graph and Memory Engine as durable state between passes.
Not by default. Execute phase always passes through Guardian policies, which are explicit about what is approvable and by whom. You can loosen policies on lab environments and tighten them in production.
Observation is continuous. A Plan pass completes in seconds to a few minutes depending on how wide the causal search is. Execute latency is bounded by human approval time. Learn is instant once the outcome is confirmed.
Each incident has its own loop instance. The Knowledge Graph is shared, so an RCA in one loop can reference signals another loop is investigating. Guardian serialises mutating actions that would conflict.