Prerequisites
An environment connected
Kubernetes, AWS, GCP, or a VM. Any of them work for this tutorial.
Something worth remediating
A real incident, or a deliberate break (e.g. a misconfigured Deployment on a test cluster).
The OPEL loop, applied to remediation
RubixKube follows the same rhythm for every incident.Observe
The Observer spots the anomaly. Common examples: CrashLoopBackOff, OOMKilled, RDS connection saturation, GCE memory pressure, deployment rollback.
Plan
The RCA Pipeline Agent gathers logs, metrics, events, and recent changes across the affected resources. Builds a causal chain and produces an RCA report.
Execute
The report includes recommended actions, each with expected blast radius. You approve, Rubix applies.
Step 1: Catch an incident in Insights
Open Magic Insights. Any currently-open anomaly shows as a card. Pick one and click through. A healthy RubixKube setup surfaces an Insight within one or two minutes of the underlying problem starting. If nothing is live, a deliberate break helps you practice:Step 2: Read the RCA report
When an insight has enough evidence for a full causal chain, the RCA Pipeline Agent produces a report. You will find it in RCA Reports (or linked from the insight card). Every report contains three sections:Observed conditions
Observed conditions
The exact state of the affected resources when the incident started. Snapshots of metrics, events, and recent changes.
Causal chain
Causal chain
Which signal led to which, ending at the root cause. Each link cites the evidence that justifies it, so you can verify the reasoning rather than trust it blind.
Recommended actions
Recommended actions
Ranked fixes, each annotated with expected blast radius, estimated recovery time, and a confidence score. High-confidence actions sit at the top.
Step 3: Approve or adapt the action
The recommended action comes with two buttons.- Approve. RubixKube applies the change within the Guardian policy for that environment. You get a verification notification when the change lands.
- Adapt. Open the action in Chat, adjust the parameters, then approve. Useful when the default scope is too wide or too narrow.
Step 4: Verify the fix
After the action runs, RubixKube watches the affected resources for a stabilisation window. Three outcomes are possible.Verified
Signals return to baseline. Insight closes. RCA report is marked resolved.
Partial
The primary signal improved but related resources are still off. A follow-up insight opens automatically.
Rollback suggested
Signals worsened. The RCA is re-run with the new state and a revised recommendation appears.
Step 5: Feed the outcome back
The Memory Engine learns from every resolution, but a short note from the operator sharpens future runs. On the closed RCA report, add a comment like:What good remediation looks like
RCAs have evidence
Every causal link cites the signal it is based on. You can verify a claim in under a minute.
Actions stay scoped
Guardian rejects anything beyond the intended blast radius. No surprise cascades.
Resolutions compound
Repeat incidents surface the original fix as context. MTTU drops with tenure.
Rollbacks are one click
Every applied action is reversible. Confidence to approve stays high.
Common questions
Can RubixKube apply a fix without my approval?
Can RubixKube apply a fix without my approval?
No. Every change in your environment needs an explicit approval. Guardian policies can narrow what is approvable per environment or per team, but nothing is fully autonomous by default.
What if the recommended action is wrong?
What if the recommended action is wrong?
Reject it. The RCA is re-scored using your rejection as signal, and a revised recommendation appears. The rejection also feeds the Memory Engine, so the system learns what your team does not want.
How do I see who approved which action?
How do I see who approved which action?
Every applied action has an audit entry: actor, timestamp, scope, outcome. Find it on the RCA report or in the Action Center.
Can I set different guardrails per environment?
Can I set different guardrails per environment?
Yes. Guardian policies attach per environment. Production can require two approvals, staging can require one, lab can require none. See Safety and Guardrails.
Related guides
How to Monitor Infrastructure Health
The daily rhythm that feeds this remediation flow.
Safety and Guardrails
Exactly what RubixKube will and will not do on its own.