Skip to main content
This tutorial covers the remediation flow end to end. Detection, RCA, recommended actions, approvals, and verification. Takes about twenty minutes using a live environment.

Prerequisites

An environment connected

Kubernetes, AWS, GCP, or a VM. Any of them work for this tutorial.

Something worth remediating

A real incident, or a deliberate break (e.g. a misconfigured Deployment on a test cluster).

The OPEL loop, applied to remediation

RubixKube follows the same rhythm for every incident.
1

Observe

The Observer spots the anomaly. Common examples: CrashLoopBackOff, OOMKilled, RDS connection saturation, GCE memory pressure, deployment rollback.
2

Plan

The RCA Pipeline Agent gathers logs, metrics, events, and recent changes across the affected resources. Builds a causal chain and produces an RCA report.
3

Execute

The report includes recommended actions, each with expected blast radius. You approve, Rubix applies.
4

Learn

The Memory Engine stores the resolution. Next time a similar pattern surfaces, the system recognises it faster.

Step 1: Catch an incident in Insights

Open Magic Insights. Any currently-open anomaly shows as a card. Pick one and click through. A healthy RubixKube setup surfaces an Insight within one or two minutes of the underlying problem starting. If nothing is live, a deliberate break helps you practice:
kubectl run brokenpod --image=nginx:doesnotexist -n default
Wait a minute or two, then refresh Magic Insights.

Step 2: Read the RCA report

When an insight has enough evidence for a full causal chain, the RCA Pipeline Agent produces a report. You will find it in RCA Reports (or linked from the insight card). Every report contains three sections:
The exact state of the affected resources when the incident started. Snapshots of metrics, events, and recent changes.
Which signal led to which, ending at the root cause. Each link cites the evidence that justifies it, so you can verify the reasoning rather than trust it blind.

Step 3: Approve or adapt the action

The recommended action comes with two buttons.
  • Approve. RubixKube applies the change within the Guardian policy for that environment. You get a verification notification when the change lands.
  • Adapt. Open the action in Chat, adjust the parameters, then approve. Useful when the default scope is too wide or too narrow.
Nothing changes in your environment without an explicit human approval. Guardian policies make destructive operations (scale-to-zero, delete, force-restart) require a second pair of eyes by default.

Step 4: Verify the fix

After the action runs, RubixKube watches the affected resources for a stabilisation window. Three outcomes are possible.

Verified

Signals return to baseline. Insight closes. RCA report is marked resolved.

Partial

The primary signal improved but related resources are still off. A follow-up insight opens automatically.

Rollback suggested

Signals worsened. The RCA is re-run with the new state and a revised recommendation appears.

Step 5: Feed the outcome back

The Memory Engine learns from every resolution, but a short note from the operator sharpens future runs. On the closed RCA report, add a comment like:
Scaled the memory limit from 512Mi to 1Gi. Holds through peak traffic.
Future similar incidents will surface this note with their recommendations.

What good remediation looks like

RCAs have evidence

Every causal link cites the signal it is based on. You can verify a claim in under a minute.

Actions stay scoped

Guardian rejects anything beyond the intended blast radius. No surprise cascades.

Resolutions compound

Repeat incidents surface the original fix as context. MTTU drops with tenure.

Rollbacks are one click

Every applied action is reversible. Confidence to approve stays high.

Common questions

No. Every change in your environment needs an explicit approval. Guardian policies can narrow what is approvable per environment or per team, but nothing is fully autonomous by default.
Every applied action has an audit entry: actor, timestamp, scope, outcome. Find it on the RCA report or in the Action Center.
Yes. Guardian policies attach per environment. Production can require two approvals, staging can require one, lab can require none. See Safety and Guardrails.

How to Monitor Infrastructure Health

The daily rhythm that feeds this remediation flow.

Safety and Guardrails

Exactly what RubixKube will and will not do on its own.