RubixKube in Action: Multi-Failure Detection

See RubixKube handle multiple simultaneous failures - just like real production incidents. This tutorial demonstrates the power of the Agent Mesh and Memory Engine working together.

Real Scenario: We created 3 different pod failures at once. RubixKube detected all of them, prioritized by severity, and provided specific guidance for each.

The Production-Like Scenario

In our demo, we deployed 3 pods that would fail in different ways:

ImagePullBackOff

broken-image-demo

Invalid container registry - image doesn’t exist

OOMKilled

memory-hog-demo

Memory limit 50Mi, but needs 100Mi - gets killed repeatedly

CrashLoop

crash-loop-demo

Application exits with code 1 - crashes on startup

This simulates a real incident where multiple things go wrong at once (bad deployment, resource misconfiguration, application bug).

RubixKube’s Response (Real Data)

Within ** 2 minutes** , RubixKube detected ALL THREE failures:

What Happened

Dashboard Metrics Changed: - ** Active Insights:** - ** Dashboard Metrics Changed:** (three separate issues detected)

Notifications: 0 → ** 20+** (detailed event stream)
System Health: 100% (overall cluster still healthy - isolated failures)
Agents: 3/3 Active (all AI agents working)

Activity Feed Populated: ``` Container experiencing repeated crashes in crash-loop-demo Severity: medium | 2 minutes ago Container experiencing repeated crashes in memory-hog-demo
Severity: medium | 2 minutes ago Out of memory (OOMKilled) detected on Pod/memory-hog-demo Severity: HIGH | 2 minutes ago

<Note>
**Intelligent Prioritization:** RubixKube marked the OOMKilled as ** Intelligent Prioritization:** RubixKube marked the OOMKilled as ** Intelligent Prioritization:** because they're isolated pod failures.
</Note>

---

## Detailed Incident Analysis

Navigate to **Insights**  to see comprehensive analysis:

<Frame>
  <img
    style={{ borderRadius: '0.5rem' }}
    src="/images/tutorials/tutorial-03-insights-with-detections.png"
    alt="Insights showing all 3 incidents with analysis"
  />
</Frame>

### Incident #1: CrashLoop in crash-loop-demo

**Details:** -** Details:** CrashLoopBackOff
-**Restart Count:**  3 (and increasing)
-**Severity:**  Medium
-**RCA Status:**  IN_PROGRESS (60% analyzed)

**AI Suggestions:** -  Check container logs for error messages
-  Verify application configuration
-  Consider increasing resource limits
-  Check for external dependencies

**Source Events:** ```
CrashLoopBackOff: container app in pod crash-loop-demo restarted 3 times
Namespace: rubixkube-tutorials

Incident #2: OOMKilled in memory-hog-demo

Details: - ** Details:** Out of Memory

Restart Count: 3
Severity: HIGH
Memory Limit: 50Mi (too low)

Root Cause: - Container attempted to allocate ~100Mi

Memory limit set to 50Mi
Kubernetes killed container to protect node

Suggested Fix: - Increase memory limit to 150Mi (50% buffer over observed usage)

Monitor for memory leaks
Consider HPA for auto-scaling

Incident #3: CrashLoop in memory-hog-demo

Related Incident: - This crash loop is ** Related Incident:** the OOMKilled

RubixKube shows the correlation
Fixing the memory issue resolves both

This is intelligent analysis! RubixKube doesn’t just list errors - it understands relationships between incidents.

Agent Mesh Collaboration

Behind the scenes, multiple agents worked together:

Observer Agent

Detected all 3 pod failures within 30 seconds of each occurringSent events to RCA Pipeline Agent for analysis

RCA Pipeline Agent

Analyzed each incident:

Gathered logs, events, pod specs
Identified root causes
Calculated severity
Generated suggestions

Memory Agent

Recalled similar past incidents (if any existed)Stored these new incidents for future pattern matching

SRI Agent

Made insights available via:

Dashboard Activity Feed
Insights page with details
Chat interface for queries

Total analysis time: 60-90 seconds for all 3 incidents

Key Observations

1. Parallel Detection

All 3 failures detected simultaneously - no delay between them:

Incident	Detection Time	Severity	RCA Progress
crash-loop-demo	01:58:00 AM	Medium	60%
memory-hog-demo (crash)	01:58:05 AM	Medium	60%
memory-hog-demo (OOM)	01:58:10 AM	HIGH	80%

Takeaway: RubixKube scales - handles multiple incidents without degradation.

2. Severity Prioritization

RubixKube ranked issues appropriately: - ** RubixKube ranked issues appropriately:** OOMKilled (resource exhaustion, node impact risk)

Medium: CrashLoops (isolated pod failures)

In production: Focus on HIGH severity first.

3. Event Correlation

RubixKube connected related incidents: - Recognized memory-hog-demo’s crash loop is ** RubixKube connected related incidents:** OOMKilled

Suggested fixing the root cause (memory) would resolve both

This prevents wasted effort fixing symptoms instead of causes.

Real-World Parallels

This scenario mirrors actual production incidents:

Bad Deployment Scenario

Real example: - New version deployed with typo in image tag (ImagePullBackOff)

Same version has memory leak (OOMKilled after 10 minutes)
Database connection pool exhausted (CrashLoop)

RubixKube would: - Detect all 3 issues

Prioritize by severity
Show which are related
Suggest rollback to previous version

Resource Misconfiguration

Real example: - HPA scaled deployment to 10 replicas

Namespace quota only allows 8
2 pods stuck pending
Remaining 8 pods OOMKilling due to increased load

RubixKube would: - Detect quota limit

Identify OOM root cause
Suggest quota increase OR replica reduction
Show which pods are affected

Cascading Failure

Real example: - Database pod OOMKilled

API pods start failing (can’t connect to DB)
Frontend shows 500 errors
Load balancer marks all backends unhealthy

RubixKube would: - Trace the cascade from DB → API → Frontend

Identify DB memory as root cause
Show dependency graph
Suggest fixing DB first (not the symptoms)

Cleanup

Remove the demo pods:

kubectl delete namespace rubixkube-tutorials

What happens in RubixKube: - Incidents marked as “Resolved”

Activity Feed shows cleanup
Active Insights returns to 0
Memory retains the learnings for future incidents

What You Learned

Multi-Incident Handling

RubixKube detects and analyzes multiple failures simultaneously

Intelligent Prioritization

Severity ranking helps you focus on critical issues first

Event Correlation

Related incidents are connected - fix root causes, not symptoms

Comprehensive Analysis

Each incident gets detailed RCA with suggestions

Next Steps

Query Your Infrastructure

Learn to use the Chat interface for natural language queries

Understand the Memory Engine

See how RubixKube learns from these incidents

Explore Agent Mesh

Learn how agents collaborated to analyze these failures

Installation Guide

Install RubixKube on your own cluster to try this

Getting started

Hands-On Tutorials

Using RubixKube

Core Concepts

Support

RubixKube in Action

RubixKube in Action: Multi-Failure Detection

The Production-Like Scenario

ImagePullBackOff

broken-image-demo

OOMKilled

memory-hog-demo

CrashLoop

crash-loop-demo

RubixKube’s Response (Real Data)

What Happened

Incident #2: OOMKilled in memory-hog-demo

Incident #3: CrashLoop in memory-hog-demo

Agent Mesh Collaboration

Total analysis time: 60-90 seconds for all 3 incidents

Key Observations

1. Parallel Detection

2. Severity Prioritization

3. Event Correlation

Real-World Parallels

Cleanup

What You Learned

Multi-Incident Handling

Intelligent Prioritization

Event Correlation

Comprehensive Analysis

Next Steps

Query Your Infrastructure

Understand the Memory Engine

Explore Agent Mesh

Installation Guide

Getting started

Hands-On Tutorials

Using RubixKube

Core Concepts

Support

​RubixKube in Action: Multi-Failure Detection

​The Production-Like Scenario

ImagePullBackOff

​broken-image-demo

OOMKilled

​memory-hog-demo

CrashLoop

​crash-loop-demo

​RubixKube’s Response (Real Data)

​What Happened

​Incident #2: OOMKilled in memory-hog-demo

​Incident #3: CrashLoop in memory-hog-demo

​Agent Mesh Collaboration

​Total analysis time: 60-90 seconds for all 3 incidents

​Key Observations

​1. Parallel Detection

​2. Severity Prioritization

​3. Event Correlation

​Real-World Parallels

​Cleanup

​What You Learned

Multi-Incident Handling

Intelligent Prioritization

Event Correlation

Comprehensive Analysis

​Next Steps

Query Your Infrastructure

Understand the Memory Engine

Explore Agent Mesh

Installation Guide

RubixKube in Action: Multi-Failure Detection

The Production-Like Scenario

broken-image-demo

memory-hog-demo

crash-loop-demo

RubixKube’s Response (Real Data)

What Happened

Incident #2: OOMKilled in memory-hog-demo

Incident #3: CrashLoop in memory-hog-demo

Agent Mesh Collaboration

Total analysis time: 60-90 seconds for all 3 incidents

Key Observations

1. Parallel Detection

2. Severity Prioritization

3. Event Correlation

Real-World Parallels

Cleanup

What You Learned

Next Steps