OPEL is defined in detail in the Age of Site Reliability Intelligence essay on the RubixKube blog. This page is the operational summary.
The four phases
Observe
The Observer Agent walks every connected environment through read-only APIs. It maps topology, collects metrics, events, logs, and state, and keeps the Knowledge Graph current.Observation is continuous. Nothing waits on a cron, and nothing is polled less often than its natural rate of change.
Plan
When a signal drifts beyond its learned baseline, the RCA Pipeline Agent takes over. It pulls related signals, recent changes, and prior incidents from the Memory Engine, and constructs a causal chain.The output of Plan is a ranked set of candidate actions, each with expected blast radius and confidence.
Execute
The SRI Agent surfaces the top candidate to a human through Chat or the Action Center. The Guardian Agent validates that the action is inside policy for the target environment. If approved, the Remediation Agent applies it.Nothing lands in your environment without an approval step. Policies decide who can approve what.
Learn
After execution, the Observer keeps watching. If signals return to baseline, the incident is marked verified. If they do not, the loop rewinds to Plan with new evidence.Either way, the Memory Engine stores the outcome, the resolution notes, and the operator context. The next matching incident starts with that prior art in scope.
What makes the loop different
Continuous, not reactive
The loop is always running. It does not wait for an alert to open.
Policy-driven
Every mutating step is bounded by Guardian policy. No surprises.
Memory-backed
Every pass updates long-term memory. The next pass is sharper.
Evidence-cited
Every claim links back to the signal it came from. Trust is verifiable.
A concrete example
A payments deployment in production. A rollout causes a subtle memory leak.Observe
Observe
The Observer notices
payments-api memory usage ticking up on every pod in the new ReplicaSet. The Knowledge Graph flags the drift.Plan
Plan
The RCA Pipeline pulls the deployment event, the memory curves, and the last three incidents involving
payments-api. It recognises a shape from a prior RCA resolved three months ago. Recommended action: scale memory limit from 512Mi to 1Gi, rollback if not improved in ten minutes.Execute
Execute
The SRI Agent posts the recommendation to
#payments-ops Slack. On-call approves. Guardian confirms the scope is inside the production policy. Remediation applies the patch. The deployment stabilises.Learn
Learn
The Memory Engine records the full thread: signal, causal chain, applied fix, verification. The operator adds a note: “Peak traffic pattern from campaign X repeats every Tuesday”. Next Tuesday’s campaign inherits this context automatically.
Where each concept fits in the loop
| Phase | Primary agent | Key concepts |
|---|---|---|
| Observe | Observer Agent | Knowledge Graph, Environments |
| Plan | RCA Pipeline Agent | Insights, Memory Engine |
| Execute | SRI Agent and Remediation Agent | Actions, Skills |
| Learn | Memory Agent | Memory Engine, Safety and Guardrails |
Common questions
How is OPEL different from OODA or PDCA?
How is OPEL different from OODA or PDCA?
OPEL borrows the spirit of continuous improvement loops but is built for machine speed. OODA (observe, orient, decide, act) targets human decision-making; PDCA (plan, do, check, act) targets process improvement. OPEL runs every few minutes against live infrastructure, with the Knowledge Graph and Memory Engine as durable state between passes.
Can the loop run without human approval?
Can the loop run without human approval?
Not by default. Execute phase always passes through Guardian policies, which are explicit about what is approvable and by whom. You can loosen policies on lab environments and tighten them in production.
How fast is one pass?
How fast is one pass?
Observation is continuous. A Plan pass completes in seconds to a few minutes depending on how wide the causal search is. Execute latency is bounded by human approval time. Learn is instant once the outcome is confirmed.
What happens if two incidents overlap?
What happens if two incidents overlap?
Each incident has its own loop instance. The Knowledge Graph is shared, so an RCA in one loop can reference signals another loop is investigating. Guardian serialises mutating actions that would conflict.