Skip to main content

Troubleshooting with Chat: Real Incident Investigation

Now that you know Chat basics, let’s use it for what it does best: investigating real infrastructure problems. This tutorial shows you how to troubleshoot actual failing pods using the SRI Agent.
Real scenarios ahead! We’ll investigate the 3 failing pods we deployed: broken-image-demo, memory-hog-demo, and crash-loop-demo. You’ll see actual Chat responses and learn the troubleshooting workflow.

The Troubleshooting Workflow

When something breaks, follow this Chat-powered workflow:

Identify the Problem

Ask: “What’s failing?” or “Show me errors”

Get Details

Ask: “Why did [pod-name] fail?”

Understand Root Cause

Agent analyzes logs, events, and provides RCA

Get Fix Guidance

Ask: “How do I fix it?”

Verify Resolution

Ask: “Is [pod-name] healthy now?”

Scenario 1: Investigating OOMKilled Pod

The Problem

The memory-hog-demo pod keeps crashing. Let’s use Chat to find out why.

Query: “Tell me about memory-hog-demo”

Complete OOMKilled investigation

What the Agent Did

The SRI Agent automatically:
Function Called: fetch_kubernetes_graph_snapshotFound:
  • Pod: memory-hog-demo
  • Status: CrashLoopBackOff
  • Restarts: 12 times
  • Labels: scenario: oomkilled
Agent’s Thought: “Analyzing Pod’s Status”
Function Called: fetch_kubernetes_logsLog Output:
dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
Agent identified: Memory consumption pattern (1 vm = high memory usage)

Root Cause (from Agent)

The memory-hog-demo pod is configured to use more memory than the limit allows, causing Kubernetes to terminate it to protect the node’s stability.

Solution (from Agent)

  1. Increase the Memory Limit: If application needs more memory
  2. Set a Memory Request: Help Kubernetes schedule appropriately

Recommendation

Set a reasonable memory limit and request to prevent cluster impact.
All this from ONE query! The agent:
  • Fetched status (1 function call)
  • Retrieved logs (1 function call)
  • Analyzed the pattern
  • Provided root cause
  • Suggested fixes
Total time: 28 seconds

Scenario 2: ImagePullBackOff Investigation

The Problem

The broken-image-demo pod won’t start.

Typical Questions to Ask

"What's wrong with broken-image-demo?"
"Why can't it pull the image?"
"Show me the events for broken-image-demo"

Expected Agent Response

The agent will:
  1. Check pod status → Finds ImagePullBackOff
  2. Retrieve events → Sees failed pull attempts
  3. Identify issue → Non-existent registry
  4. Suggest fixes:
    • Verify image name and tag
    • Check registry accessibility
    • Ensure image exists
    • Review imagePullSecrets
For ImagePullBackOff: The agent can quickly identify if it’s a typo, auth issue, or network problem by analyzing the error messages.

Scenario 3: CrashLoopBackOff Analysis

The Problem

The crash-loop-demo pod starts, then immediately crashes.

Query: “Why is crash-loop-demo crashing?”

Expected workflow:
1

Agent Fetches Pod Status

Finds: CrashLoopBackOff with high restart count
2

Agent Retrieves Logs

Looks for error messages in container logs
3

Agent Checks Exit Code

Exit code 1 = application error (not OOMKilled or signal)
4

Agent Provides Guidance

  • Check application logs for errors
  • Verify configuration
  • Check for missing dependencies
  • Review startup command

Common CrashLoop Causes

The agent can identify:
CauseHow Agent DetectsSuggested Fix
Application bugExit code 1, error in logsDebug application code
Missing configExit code, “config not found” in logsAdd ConfigMap or Secret
Dependency unavailableExit code, connection errorsCheck service dependencies
Insufficient resourcesExit code 137 (OOMKilled)Increase limits
Wrong commandExit code 127, “command not found”Fix container command

Multi-Pod Investigation

Query: “Show me all failing pods in rubixkube-tutorials”

This query attempts to get an overview of ALL problems at once.
Query for all failing pods
Agent’s approach:
1

Thought: Querying Kubernetes Pods

Agent plans to fetch all pods in the namespace
2

Function Call: fetch_kubernetes_graph_snapshot

Queries Kubernetes API for pod data
3

Thought: Identifying Failed Pods

Filters for non-Running/non-Healthy statuses
4

Response

Lists all failing pods with their statuses
Agent successfully found all 3 failing pods with details
Agent Response: The agent successfully identified all 3 failing pods:
  1. broken-image-demo - Status: Pending, Reason: ImagePullBackOff
  2. crash-loop-demo - Status: CrashLoopBackOff, Restarts: 142
  3. memory-hog-demo - Status: CrashLoopBackOff, Restarts: 142 (OOMKilled)

Following Up on Incidents

Once you have the list, drill down: Example conversation flow:
You: "Show me all failing pods in rubixkube-tutorials"
Agent: [Lists 3 failing pods]

You: "Tell me about the memory one"
      ↑ Agent knows you mean memory-hog-demo

Agent: [Provides OOMKilled analysis]

You: "What about the crash-loop one?"
      ↑ Still in same namespace context

Agent: [Analyzes crash-loop-demo]

You: "And the image pull issue?"
      ↑ Agent infers broken-image-demo

Agent: [Explains ImagePullBackOff]
Context is KEY! The agent tracks the conversation. You don’t need to repeat pod names or namespaces.

Asking for Logs

Query: "Show me logs for memory-hog-demo" What the agent does:
  1. Calls fetch_kubernetes_logs
  2. Retrieves recent log lines
  3. Highlights ERROR or WARN messages
  4. Provides context about what logs mean
Example response:
Recent logs from memory-hog-demo:

dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: dbug: [1] using backoff sleep of 3000us
stress: dbug: [1] setting--> mem-stress worker 1 [2] forked

This shows the container is intentionally consuming memory ("1 vm"), 
which is causing it to exceed the 50Mi limit and get OOMKilled.

Comparing Traditional vs. Chat Troubleshooting

Traditional Approach (Without Chat)

Step-by-step for OOMKilled investigation:
  1. kubectl get pods -n rubixkube-tutorials → Find failing pod
  2. kubectl describe pod memory-hog-demo -n rubixkube-tutorials → See events
  3. kubectl logs memory-hog-demo -n rubixkube-tutorials → Check logs
  4. kubectl get pod memory-hog-demo -n rubixkube-tutorials -o yaml → Review config
  5. Google “OOMKilled kubernetes”
  6. Read Stack Overflow
  7. Try increasing memory limit
  8. kubectl apply -f fixed-pod.yaml
  9. kubectl get pod memory-hog-demo -n rubixkube-tutorials → Verify
Time: 10-15 minutes (if you know what you’re doing)

Chat Approach

Single query: "Why is memory-hog-demo crashing?" Agent does ALL of the above automatically:
  • ✅ Gets pod status
  • ✅ Checks events
  • ✅ Retrieves logs
  • ✅ Reviews configuration
  • ✅ Identifies OOMKilled pattern
  • ✅ Explains root cause
  • ✅ Provides fix with exact kubectl command
Time: 1-2 minutes

Time Saved

85-90% faster with Chat

Steps Saved

9 manual steps → 1 question

When Chat Needs Clarification

Sometimes the agent needs more information. This is GOOD - it means it’s being careful.

Example: Namespace Not Specified

Agent asking for namespace clarification
Query: "Show me failing pods" Agent Response:
“I checked the default namespace and found nothing. Which namespace are your applications in?”
Your follow-up: "Check rubixkube-tutorials" Agent: "Of course! I'll check that namespace for you."
Pro Tip: Include the namespace in your first query to save time: “Show me failing pods in rubixkube-tutorials”

Advanced Troubleshooting Queries

Resource Analysis

"What's the memory usage of memory-hog-demo?"
"Is memory-hog-demo hitting its limit?"
"Compare memory-hog-demo usage to its limits"
Agent provides: Current usage vs. limits, percentage, whether it’s being throttled

Event Timeline

"What events happened to memory-hog-demo?"
"Show me the timeline for crash-loop-demo"
"When did broken-image-demo start failing?"
Agent provides: Chronological event list with timestamps

Historical Context

"Has memory-hog-demo failed before?"
"Show me similar OOMKilled incidents"
"What's the pattern of crashes?"
Agent queries Memory Engine for past incidents

Common Troubleshooting Queries

  • Pod Failures
  • Performance Issues
  • Log Analysis
  • Event Investigation
Queries:
"Why is [pod-name] failing?"
"What's wrong with [pod-name]?"
"Diagnose [pod-name]"
"Show me errors for [pod-name]"
You’ll Get:
  • Pod status
  • Error messages
  • Root cause analysis
  • Fix suggestions

Real Example: Complete Investigation

Here’s a REAL conversation troubleshooting memory-hog-demo:

The Full Exchange

Query 1: "What pods are failing in my cluster?" Agent: Asks which namespace to check
Query 2: "Check rubixkube-tutorials namespace" Agent: Investigates that namespace
Query 3: "memory-hog-demo" Agent Response:
Complete investigation showing all steps
What you see:
  • 🧠 Thought: “Examining the Pod”
  • 🔧 Function Call: fetch_kubernetes_graph_snapshot
  • ✅ Found: CrashLoopBackOff, 12 restarts
  • 🔧 Function Call: fetch_kubernetes_logs
  • ✅ Log analysis: memory consumption pattern
  • 📊 Root Cause: Memory limit too low
  • 💡 Solution: Increase limit to 150Mi
Query 4: "How do I fix it?" Agent: Provides exact kubectl commands

Tips for Effective Troubleshooting

Start Broad

“What’s failing?” → Get overviewThen narrow: “Tell me about [specific pod]”

Ask for Evidence

“Show me the logs”“What events occurred?”Agent provides proof

Request Root Cause

“Why did this happen?”Agent analyzes patterns

Get Step-by-Step Fix

“How do I fix it?”Agent provides kubectl commands

Understanding Error Types

OOMKilled (Out of Memory)

How to ask:
"Why is [pod] OOMKilled?"
"What's the memory issue?"
"Show me memory usage for [pod]"
Agent provides:
  • Current memory limit
  • Actual memory attempted
  • How much over the limit
  • Recommended new limit

ImagePullBackOff

How to ask:
"Why can't [pod] pull its image?"
"What's the image issue?"
"Is the registry accessible?"
Agent provides:
  • Image name being pulled
  • Registry URL
  • Error message (auth, not found, network)
  • Common fixes for each scenario

CrashLoopBackOff

How to ask:
"Why is [pod] crash looping?"
"What's making [pod] crash?"
"Show me crash reasons"
Agent provides:
  • Exit code
  • Log errors
  • Restart count and pattern
  • Likely causes (config, bug, dependency)

Following the RCA

When the agent provides analysis, you can dig deeper: Example conversation:
You: "Why did memory-hog-demo fail?"
Agent: [Provides RCA: OOMKilled due to 50Mi limit]

You: "Why was the limit set to 50Mi?"
Agent: [Checks pod spec: "That's what was configured in the YAML"]

You: "What's a good limit for stress containers?"
Agent: [Suggests 150Mi with 50% buffer]

You: "Show me how to apply that"
Agent: [Provides kubectl command or YAML patch]
The conversation evolves naturally based on your needs.

Time Savings: Real Numbers

Based on our testing with 3 different pod failures:
Failure TypeTraditional TimeChat TimeSavings
OOMKilled12-15 min2 min85%
ImagePullBackOff5-10 min1 min90%
CrashLoopBackOff15-20 min3 min85%
Average: 87% time saved And that’s just for detection + diagnosis. Chat also:
  • Provides the fix immediately
  • No Googling required
  • No trial and error
  • Learn while you troubleshoot

What You Learned

Troubleshooting Workflow

5-step process from detection to resolution

Real Investigation

Actual OOMKilled pod analysis with agent reasoning

Error Types

How to investigate OOMKilled, ImagePullBackOff, CrashLoop

Context Usage

How to have multi-turn troubleshooting conversations

Time Savings

87% faster than traditional debugging

Function Transparency

What each function call does and why

Next: Advanced Chat Usage

You’ve mastered troubleshooting! Now learn advanced techniques:

Continue: Advanced Chat Features

Learn about personas, workflows, file uploads, and power user tips

Quick Reference

Fastest troubleshooting query:
"Diagnose [pod-name]"
This single command triggers full RCA. Most comprehensive query:
"Why is [pod-name] failing and how do I fix it?"
You get problem + solution in one response.

Need Help?

I