Skip to main content

RubixKube in Action: Multi-Failure Detection

See RubixKube handle multiple simultaneous failures - just like real production incidents. This tutorial demonstrates the power of the Agent Mesh and Memory Engine working together.
Real Scenario: We created 3 different pod failures at once. RubixKube detected all of them, prioritized by severity, and provided specific guidance for each.

The Production-Like Scenario

In our demo, we deployed 3 pods that would fail in different ways:

ImagePullBackOff

broken-image-demo

Invalid container registry - image doesn’t exist

OOMKilled

memory-hog-demo

Memory limit 50Mi, but needs 100Mi - gets killed repeatedly

CrashLoop

crash-loop-demo

Application exits with code 1 - crashes on startup
This simulates a real incident where multiple things go wrong at once (bad deployment, resource misconfiguration, application bug).

RubixKube’s Response (Real Data)

Within ** 2 minutes** , RubixKube detected ALL THREE failures:
Dashboard showing 3 active insights

What Happened

Dashboard Metrics Changed: - ** Active Insights:** - ** Dashboard Metrics Changed:** (three separate issues detected)
  • Notifications: 0 → ** 20+** (detailed event stream)
  • System Health: 100% (overall cluster still healthy - isolated failures)
  • Agents: 3/3 Active (all AI agents working)
Activity Feed Populated: ``` Container experiencing repeated crashes in crash-loop-demo Severity: medium | 2 minutes ago Container experiencing repeated crashes in memory-hog-demo
Severity: medium | 2 minutes ago
Out of memory (OOMKilled) detected on Pod/memory-hog-demo Severity: HIGH | 2 minutes ago

<Note>
**Intelligent Prioritization:** RubixKube marked the OOMKilled as ** Intelligent Prioritization:** RubixKube marked the OOMKilled as ** Intelligent Prioritization:** because they're isolated pod failures.
</Note>

---

## Detailed Incident Analysis

Navigate to **Insights**  to see comprehensive analysis:

<Frame>
  <img
    style={{ borderRadius: '0.5rem' }}
    src="/images/tutorials/tutorial-03-insights-with-detections.png"
    alt="Insights showing all 3 incidents with analysis"
  />
</Frame>

### Incident #1: CrashLoop in crash-loop-demo

**Details:** -** Details:** CrashLoopBackOff
-**Restart Count:**  3 (and increasing)
-**Severity:**  Medium
-**RCA Status:**  IN_PROGRESS (60% analyzed)

**AI Suggestions:** -  Check container logs for error messages
-  Verify application configuration
-  Consider increasing resource limits
-  Check for external dependencies

**Source Events:** ```
CrashLoopBackOff: container app in pod crash-loop-demo restarted 3 times
Namespace: rubixkube-tutorials

Incident #2: OOMKilled in memory-hog-demo

Details: - ** Details:** Out of Memory
  • Restart Count: 3
  • Severity: HIGH
  • Memory Limit: 50Mi (too low)
Root Cause: - Container attempted to allocate ~100Mi
  • Memory limit set to 50Mi
  • Kubernetes killed container to protect node
Suggested Fix: - Increase memory limit to 150Mi (50% buffer over observed usage)
  • Monitor for memory leaks
  • Consider HPA for auto-scaling

Incident #3: CrashLoop in memory-hog-demo

Related Incident: - This crash loop is ** Related Incident:** the OOMKilled
  • RubixKube shows the correlation
  • Fixing the memory issue resolves both
This is intelligent analysis! RubixKube doesn’t just list errors - it understands relationships between incidents.

Agent Mesh Collaboration

Behind the scenes, multiple agents worked together:
1

Observer Agent

Detected all 3 pod failures within 30 seconds of each occurringSent events to RCA Pipeline Agent for analysis
2

RCA Pipeline Agent

Analyzed each incident:
  • Gathered logs, events, pod specs
  • Identified root causes
  • Calculated severity
  • Generated suggestions
3

Memory Agent

Recalled similar past incidents (if any existed)Stored these new incidents for future pattern matching
4

SRI Agent

Made insights available via:
  • Dashboard Activity Feed
  • Insights page with details
  • Chat interface for queries

Total analysis time: 60-90 seconds for all 3 incidents


Key Observations

1. Parallel Detection

All 3 failures detected simultaneously - no delay between them:
IncidentDetection TimeSeverityRCA Progress
crash-loop-demo01:58:00 AMMedium60%
memory-hog-demo (crash)01:58:05 AMMedium60%
memory-hog-demo (OOM)01:58:10 AMHIGH80%
Takeaway: RubixKube scales - handles multiple incidents without degradation.

2. Severity Prioritization

RubixKube ranked issues appropriately: - ** RubixKube ranked issues appropriately:** OOMKilled (resource exhaustion, node impact risk)
  • Medium: CrashLoops (isolated pod failures)
In production: Focus on HIGH severity first.

3. Event Correlation

RubixKube connected related incidents: - Recognized memory-hog-demo’s crash loop is ** RubixKube connected related incidents:** OOMKilled
  • Suggested fixing the root cause (memory) would resolve both
This prevents wasted effort fixing symptoms instead of causes.

Real-World Parallels

This scenario mirrors actual production incidents:
Real example: - New version deployed with typo in image tag (ImagePullBackOff)
  • Same version has memory leak (OOMKilled after 10 minutes)
  • Database connection pool exhausted (CrashLoop)
RubixKube would: - Detect all 3 issues
  • Prioritize by severity
  • Show which are related
  • Suggest rollback to previous version
Real example: - HPA scaled deployment to 10 replicas
  • Namespace quota only allows 8
  • 2 pods stuck pending
  • Remaining 8 pods OOMKilling due to increased load
RubixKube would: - Detect quota limit
  • Identify OOM root cause
  • Suggest quota increase OR replica reduction
  • Show which pods are affected
Real example: - Database pod OOMKilled
  • API pods start failing (can’t connect to DB)
  • Frontend shows 500 errors
  • Load balancer marks all backends unhealthy
RubixKube would: - Trace the cascade from DB → API → Frontend
  • Identify DB memory as root cause
  • Show dependency graph
  • Suggest fixing DB first (not the symptoms)

Cleanup

Remove the demo pods:
kubectl delete namespace rubixkube-tutorials
What happens in RubixKube: - Incidents marked as “Resolved”
  • Activity Feed shows cleanup
  • Active Insights returns to 0
  • Memory retains the learnings for future incidents

What You Learned

Multi-Incident Handling

RubixKube detects and analyzes multiple failures simultaneously

Intelligent Prioritization

Severity ranking helps you focus on critical issues first

Event Correlation

Related incidents are connected - fix root causes, not symptoms

Comprehensive Analysis

Each incident gets detailed RCA with suggestions

Next Steps