Skip to main content

RubixKube in Action: Multi-Failure Detection

See RubixKube handle multiple simultaneous failures - just like real production incidents. This tutorial demonstrates the power of the Agent Mesh and Memory Engine working together.
Real Scenario: We created 3 different pod failures at once. RubixKube detected all of them, prioritized by severity, and provided specific guidance for each.

The Production-Like Scenario

In our demo, we deployed 3 pods that would fail in different ways:

ImagePullBackOff

broken-image-demo

Invalid container registry - image doesn’t exist

OOMKilled

memory-hog-demo

Memory limit 50Mi, but needs 100Mi - gets killed repeatedly

CrashLoop

crash-loop-demo

Application exits with code 1 - crashes on startup
This simulates a real incident where multiple things go wrong at once (bad deployment, resource misconfiguration, application bug).

RubixKube’s Response (Real Data)

Within ** 2 minutes** , RubixKube detected ALL THREE failures:
Dashboard showing 3 active insights

What Happened

Dashboard Metrics Changed: - ** Active Insights:** - ** Dashboard Metrics Changed:** (three separate issues detected)
  • Notifications: 0 → ** 20+** (detailed event stream)
  • System Health: 100% (overall cluster still healthy - isolated failures)
  • Agents: 3/3 Active (all AI agents working)
Activity Feed Populated: ``` Container experiencing repeated crashes in crash-loop-demo Severity: medium | 2 minutes ago Container experiencing repeated crashes in memory-hog-demo
Severity: medium | 2 minutes ago
Out of memory (OOMKilled) detected on Pod/memory-hog-demo Severity: HIGH | 2 minutes ago

<Note>
**Intelligent Prioritization:** RubixKube marked the OOMKilled as ** Intelligent Prioritization:** RubixKube marked the OOMKilled as ** Intelligent Prioritization:** because they're isolated pod failures.
</Note>

---

## Detailed Incident Analysis

Navigate to **Insights**  to see comprehensive analysis:

<Frame>
  <img
    style={{ borderRadius: '0.5rem' }}
    src="/images/tutorials/tutorial-03-insights-with-detections.png"
    alt="Insights showing all 3 incidents with analysis"
  />
</Frame>

### Incident #1: CrashLoop in crash-loop-demo

**Details:** -** Details:** CrashLoopBackOff
-**Restart Count:**  3 (and increasing)
-**Severity:**  Medium
-**RCA Status:**  IN_PROGRESS (60% analyzed)

**AI Suggestions:** -  Check container logs for error messages
-  Verify application configuration
-  Consider increasing resource limits
-  Check for external dependencies

**Source Events:** ```
CrashLoopBackOff: container app in pod crash-loop-demo restarted 3 times
Namespace: rubixkube-tutorials

Incident #2: OOMKilled in memory-hog-demo

Details: - ** Details:** Out of Memory
  • Restart Count: 3
  • Severity: HIGH
  • Memory Limit: 50Mi (too low)
Root Cause: - Container attempted to allocate ~100Mi
  • Memory limit set to 50Mi
  • Kubernetes killed container to protect node
Suggested Fix: - Increase memory limit to 150Mi (50% buffer over observed usage)
  • Monitor for memory leaks
  • Consider HPA for auto-scaling

Incident #3: CrashLoop in memory-hog-demo

Related Incident: - This crash loop is ** Related Incident:** the OOMKilled
  • RubixKube shows the correlation
  • Fixing the memory issue resolves both
This is intelligent analysis! RubixKube doesn’t just list errors - it understands relationships between incidents.

Agent Mesh Collaboration

Behind the scenes, multiple agents worked together:
1

Observer Agent

Detected all 3 pod failures within 30 seconds of each occurringSent events to RCA Pipeline Agent for analysis
2

RCA Pipeline Agent

Analyzed each incident:
  • Gathered logs, events, pod specs
  • Identified root causes
  • Calculated severity
  • Generated suggestions
3

Memory Agent

Recalled similar past incidents (if any existed)Stored these new incidents for future pattern matching
4

SRI Agent

Made insights available via:
  • Dashboard Activity Feed
  • Insights page with details
  • Chat interface for queries

Total analysis time: 60-90 seconds for all 3 incidents


Key Observations

1. Parallel Detection

All 3 failures detected simultaneously - no delay between them:
IncidentDetection TimeSeverityRCA Progress
crash-loop-demo01:58:00 AMMedium60%
memory-hog-demo (crash)01:58:05 AMMedium60%
memory-hog-demo (OOM)01:58:10 AMHIGH80%
Takeaway: RubixKube scales - handles multiple incidents without degradation.

2. Severity Prioritization

RubixKube ranked issues appropriately: - ** RubixKube ranked issues appropriately:** OOMKilled (resource exhaustion, node impact risk)
  • Medium: CrashLoops (isolated pod failures)
In production: Focus on HIGH severity first.

3. Event Correlation

RubixKube connected related incidents: - Recognized memory-hog-demo’s crash loop is ** RubixKube connected related incidents:** OOMKilled
  • Suggested fixing the root cause (memory) would resolve both
This prevents wasted effort fixing symptoms instead of causes.

Real-World Parallels

This scenario mirrors actual production incidents:
Real example: - New version deployed with typo in image tag (ImagePullBackOff)
  • Same version has memory leak (OOMKilled after 10 minutes)
  • Database connection pool exhausted (CrashLoop)
RubixKube would: - Detect all 3 issues
  • Prioritize by severity
  • Show which are related
  • Suggest rollback to previous version
Real example: - HPA scaled deployment to 10 replicas
  • Namespace quota only allows 8
  • 2 pods stuck pending
  • Remaining 8 pods OOMKilling due to increased load
RubixKube would: - Detect quota limit
  • Identify OOM root cause
  • Suggest quota increase OR replica reduction
  • Show which pods are affected
Real example: - Database pod OOMKilled
  • API pods start failing (can’t connect to DB)
  • Frontend shows 500 errors
  • Load balancer marks all backends unhealthy
RubixKube would: - Trace the cascade from DB → API → Frontend
  • Identify DB memory as root cause
  • Show dependency graph
  • Suggest fixing DB first (not the symptoms)

Cleanup

Remove the demo pods:
kubectl delete namespace rubixkube-tutorials
What happens in RubixKube: - Incidents marked as “Resolved”
  • Activity Feed shows cleanup
  • Active Insights returns to 0
  • Memory retains the learnings for future incidents

What You Learned

Multi-Incident Handling

RubixKube detects and analyzes multiple failures simultaneously

Intelligent Prioritization

Severity ranking helps you focus on critical issues first

Event Correlation

Related incidents are connected - fix root causes, not symptoms

Comprehensive Analysis

Each incident gets detailed RCA with suggestions

Next Steps

Query Your Infrastructure

Learn to use the Chat interface for natural language queries

Understand the Memory Engine

See how RubixKube learns from these incidents

Explore Agent Mesh

Learn how agents collaborated to analyze these failures

Installation Guide

Install RubixKube on your own cluster to try this