How to automate incident remediation with RubixKube

This tutorial covers the remediation flow end to end. Detection, RCA, recommended actions, approvals, and verification. Takes about twenty minutes using a live environment.

Prerequisites

An environment connected

Kubernetes, AWS, GCP, or a VM. Any of them work for this tutorial.

Something worth remediating

A real incident, or a deliberate break (e.g. a misconfigured Deployment on a test cluster).

The OPEL loop, applied to remediation

RubixKube follows the same rhythm for every incident.

Observe

The Observer spots the anomaly. Common examples: CrashLoopBackOff, OOMKilled, RDS connection saturation, GCE memory pressure, deployment rollback.

Plan

The RCA Pipeline Agent gathers logs, metrics, events, and recent changes across the affected resources. Builds a causal chain and produces an RCA report.

Execute

The report includes recommended actions, each with expected blast radius. You approve, Rubix applies.

Learn

The Memory Engine stores the resolution. Next time a similar pattern surfaces, the system recognises it faster.

Step 1: Catch an incident in Insights

Open Magic Insights. Any currently-open anomaly shows as a card. Pick one and click through. A healthy RubixKube setup surfaces an Insight within one or two minutes of the underlying problem starting. If nothing is live, a deliberate break helps you practice:

kubectl run brokenpod --image=nginx:doesnotexist -n default

Wait a minute or two, then refresh Magic Insights.

Step 2: Read the RCA report

When an insight has enough evidence for a full causal chain, the RCA Pipeline Agent produces a report. You will find it in RCA Reports (or linked from the insight card). Every report contains three sections:

Observed conditions

The exact state of the affected resources when the incident started. Snapshots of metrics, events, and recent changes.

Causal chain

Which signal led to which, ending at the root cause. Each link cites the evidence that justifies it, so you can verify the reasoning rather than trust it blind.

Recommended actions

Ranked fixes, each annotated with expected blast radius, estimated recovery time, and a confidence score. High-confidence actions sit at the top.

Step 3: Approve or adapt the action

The recommended action comes with two buttons.

Approve. RubixKube applies the change within the Guardian policy for that environment. You get a verification notification when the change lands.
Adapt. Open the action in Chat, adjust the parameters, then approve. Useful when the default scope is too wide or too narrow.

Nothing changes in your environment without an explicit human approval. Guardian policies make destructive operations (scale-to-zero, delete, force-restart) require a second pair of eyes by default.

Step 4: Verify the fix

After the action runs, RubixKube watches the affected resources for a stabilisation window. Three outcomes are possible.

Verified

Signals return to baseline. Insight closes. RCA report is marked resolved.

Partial

The primary signal improved but related resources are still off. A follow-up insight opens automatically.

Rollback suggested

Signals worsened. The RCA is re-run with the new state and a revised recommendation appears.

Step 5: Feed the outcome back

The Memory Engine learns from every resolution, but a short note from the operator sharpens future runs. On the closed RCA report, add a comment like:

Scaled the memory limit from 512Mi to 1Gi. Holds through peak traffic.

Future similar incidents will surface this note with their recommendations.

What good remediation looks like

RCAs have evidence

Every causal link cites the signal it is based on. You can verify a claim in under a minute.

Actions stay scoped

Guardian rejects anything beyond the intended blast radius. No surprise cascades.

Resolutions compound

Repeat incidents surface the original fix as context. MTTU drops with tenure.

Rollbacks are one click

Every applied action is reversible. Confidence to approve stays high.

Common questions

Can RubixKube apply a fix without my approval?

No. Every change in your environment needs an explicit approval. Guardian policies can narrow what is approvable per environment or per team, but nothing is fully autonomous by default.

What if the recommended action is wrong?

Reject it. The RCA is re-scored using your rejection as signal, and a revised recommendation appears. The rejection also feeds the Memory Engine, so the system learns what your team does not want.

How do I see who approved which action?

Every applied action has an audit entry: actor, timestamp, scope, outcome. Find it on the RCA report or in the Action Center.

Can I set different guardrails per environment?

Yes. Guardian policies attach per environment. Production can require two approvals, staging can require one, lab can require none. See Safety and Guardrails.

How to Monitor Infrastructure Health

The daily rhythm that feeds this remediation flow.

Safety and Guardrails

Exactly what RubixKube will and will not do on its own.

​Prerequisites

An environment connected

Something worth remediating

​The OPEL loop, applied to remediation

Observe

Plan

Execute

Learn

​Step 1: Catch an incident in Insights

​Step 2: Read the RCA report

​Step 3: Approve or adapt the action

​Step 4: Verify the fix

Verified

Partial

Rollback suggested

​Step 5: Feed the outcome back

​What good remediation looks like

RCAs have evidence

Actions stay scoped

Resolutions compound

Rollbacks are one click

​Common questions

​Related guides

How to Monitor Infrastructure Health

Safety and Guardrails

Prerequisites

The OPEL loop, applied to remediation

Step 1: Catch an incident in Insights

Step 2: Read the RCA report

Step 3: Approve or adapt the action

Step 4: Verify the fix

Step 5: Feed the outcome back

What good remediation looks like

Common questions

Related guides