Troubleshooting with Chat: Real Incident Investigation

Now that you know Chat basics, let’s use it for what it does best: investigating real infrastructure problems. This tutorial shows you how to troubleshoot actual failing pods using the SRI Agent.

Real scenarios ahead! We’ll investigate the 3 failing pods we deployed: broken-image-demo, memory-hog-demo, and crash-loop-demo. You’ll see actual Chat responses and learn the troubleshooting workflow.

The Troubleshooting Workflow

When something breaks, follow this Chat-powered workflow:

Identify the Problem

Ask: “What’s failing?” or “Show me errors”

Get Details

Ask: “Why did [pod-name] fail?”

Understand Root Cause

Agent analyzes logs, events, and provides RCA

Get Fix Guidance

Ask: “How do I fix it?”

Verify Resolution

Ask: “Is [pod-name] healthy now?”

Scenario 1: Investigating OOMKilled Pod

The Problem

The memory-hog-demo pod keeps crashing. Let’s use Chat to find out why.

Query: “Tell me about memory-hog-demo”

What the Agent Did

The SRI Agent automatically:

Step 1: Fetched Pod Status

Function Called: fetch_kubernetes_graph_snapshotFound:

Pod: memory-hog-demo
Status: CrashLoopBackOff
Restarts: 12 times
Labels: scenario: oomkilled

Agent’s Thought: “Analyzing Pod’s Status”

Step 2: Retrieved Logs

Function Called: fetch_kubernetes_logsLog Output:

dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

Agent identified: Memory consumption pattern (1 vm = high memory usage)

Step 3: Analyzed Root Cause

Root Cause (from Agent)

The memory-hog-demo pod is configured to use more memory than the limit allows, causing Kubernetes to terminate it to protect the node’s stability.

Step 4: Provided Solutions

Solution (from Agent)

Increase the Memory Limit: If application needs more memory
Set a Memory Request: Help Kubernetes schedule appropriately

Recommendation

Set a reasonable memory limit and request to prevent cluster impact.

All this from ONE query! The agent:

Fetched status (1 function call)
Retrieved logs (1 function call)
Analyzed the pattern
Provided root cause
Suggested fixes

Total time: 28 seconds

Scenario 2: ImagePullBackOff Investigation

The Problem

The broken-image-demo pod won’t start.

Typical Questions to Ask

"What's wrong with broken-image-demo?"
"Why can't it pull the image?"
"Show me the events for broken-image-demo"

Expected Agent Response

The agent will:

Check pod status → Finds ImagePullBackOff
Retrieve events → Sees failed pull attempts
Identify issue → Non-existent registry
Suggest fixes:
- Verify image name and tag
- Check registry accessibility
- Ensure image exists
- Review imagePullSecrets

For ImagePullBackOff: The agent can quickly identify if it’s a typo, auth issue, or network problem by analyzing the error messages.

Scenario 3: CrashLoopBackOff Analysis

The Problem

The crash-loop-demo pod starts, then immediately crashes.

Query: “Why is crash-loop-demo crashing?”

Expected workflow:

Agent Fetches Pod Status

Finds: CrashLoopBackOff with high restart count

Agent Retrieves Logs

Looks for error messages in container logs

Agent Checks Exit Code

Exit code 1 = application error (not OOMKilled or signal)

Agent Provides Guidance

Check application logs for errors
Verify configuration
Check for missing dependencies
Review startup command

Common CrashLoop Causes

The agent can identify:

Cause	How Agent Detects	Suggested Fix
Application bug	Exit code 1, error in logs	Debug application code
Missing config	Exit code, “config not found” in logs	Add ConfigMap or Secret
Dependency unavailable	Exit code, connection errors	Check service dependencies
Insufficient resources	Exit code 137 (OOMKilled)	Increase limits
Wrong command	Exit code 127, “command not found”	Fix container command

Multi-Pod Investigation

Query: “Show me all failing pods in rubixkube-tutorials”

This query attempts to get an overview of ALL problems at once.

Agent’s approach:

Thought: Querying Kubernetes Pods

Agent plans to fetch all pods in the namespace

Function Call: fetch_kubernetes_graph_snapshot

Queries Kubernetes API for pod data

Thought: Identifying Failed Pods

Filters for non-Running/non-Healthy statuses

Response

Lists all failing pods with their statuses

Agent successfully found all 3 failing pods with details

Agent Response: The agent successfully identified all 3 failing pods:

broken-image-demo - Status: Pending, Reason: ImagePullBackOff
crash-loop-demo - Status: CrashLoopBackOff, Restarts: 142
memory-hog-demo - Status: CrashLoopBackOff, Restarts: 142 (OOMKilled)

Following Up on Incidents

Once you have the list, drill down: Example conversation flow:

You: "Show me all failing pods in rubixkube-tutorials"
Agent: [Lists 3 failing pods]

You: "Tell me about the memory one"
      ↑ Agent knows you mean memory-hog-demo

Agent: [Provides OOMKilled analysis]

You: "What about the crash-loop one?"
      ↑ Still in same namespace context

Agent: [Analyzes crash-loop-demo]

You: "And the image pull issue?"
      ↑ Agent infers broken-image-demo

Agent: [Explains ImagePullBackOff]

Context is KEY! The agent tracks the conversation. You don’t need to repeat pod names or namespaces.

Asking for Logs

Query: "Show me logs for memory-hog-demo" What the agent does:

Calls fetch_kubernetes_logs
Retrieves recent log lines
Highlights ERROR or WARN messages
Provides context about what logs mean

Example response:

Recent logs from memory-hog-demo:

dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: dbug: [1] using backoff sleep of 3000us
stress: dbug: [1] setting--> mem-stress worker 1 [2] forked

This shows the container is intentionally consuming memory ("1 vm"), 
which is causing it to exceed the 50Mi limit and get OOMKilled.

Comparing Traditional vs. Chat Troubleshooting

Traditional Approach (Without Chat)

Step-by-step for OOMKilled investigation:

kubectl get pods -n rubixkube-tutorials → Find failing pod
kubectl describe pod memory-hog-demo -n rubixkube-tutorials → See events
kubectl logs memory-hog-demo -n rubixkube-tutorials → Check logs
kubectl get pod memory-hog-demo -n rubixkube-tutorials -o yaml → Review config
Google “OOMKilled kubernetes”
Read Stack Overflow
Try increasing memory limit
kubectl apply -f fixed-pod.yaml
kubectl get pod memory-hog-demo -n rubixkube-tutorials → Verify

Time: 10-15 minutes (if you know what you’re doing)

Chat Approach

Single query: "Why is memory-hog-demo crashing?" Agent does ALL of the above automatically:

Gets pod status
Checks events
Retrieves logs
Reviews configuration
Identifies OOMKilled pattern
Explains root cause
Provides fix with exact kubectl command

Time: 1-2 minutes

Time Saved

85-90% faster with Chat

Steps Saved

9 manual steps → 1 question

When Chat Needs Clarification

Sometimes the agent needs more information. This is GOOD - it means it’s being careful.

Example: Namespace Not Specified

Agent asking for namespace clarification

Query: "Show me failing pods" Agent Response:

“I checked the default namespace and found nothing. Which namespace are your applications in?”

Your follow-up: "Check rubixkube-tutorials" Agent: "Of course! I'll check that namespace for you."

Pro Tip: Include the namespace in your first query to save time: “Show me failing pods in rubixkube-tutorials”

Advanced Troubleshooting Queries

Resource Analysis

"What's the memory usage of memory-hog-demo?"
"Is memory-hog-demo hitting its limit?"
"Compare memory-hog-demo usage to its limits"

Agent provides: Current usage vs. limits, percentage, whether it’s being throttled

Event Timeline

"What events happened to memory-hog-demo?"
"Show me the timeline for crash-loop-demo"
"When did broken-image-demo start failing?"

Agent provides: Chronological event list with timestamps

Historical Context

"Has memory-hog-demo failed before?"
"Show me similar OOMKilled incidents"
"What's the pattern of crashes?"

Agent queries Memory Engine for past incidents

Common Troubleshooting Queries

Pod Failures
Performance Issues
Log Analysis
Event Investigation

Queries:

"Why is [pod-name] failing?"
"What's wrong with [pod-name]?"
"Diagnose [pod-name]"
"Show me errors for [pod-name]"

You’ll Get:

Pod status
Error messages
Root cause analysis
Fix suggestions

Real Example: Complete Investigation

Here’s a REAL conversation troubleshooting memory-hog-demo:

The Full Exchange

Query 1: "What pods are failing in my cluster?" Agent: Asks which namespace to check

Query 2: "Check rubixkube-tutorials namespace" Agent: Investigates that namespace

Query 3: "memory-hog-demo" Agent Response:

Complete investigation showing all steps

What you see:

Thought: “Examining the Pod”
Function Call: fetch_kubernetes_graph_snapshot
Found: CrashLoopBackOff, 12 restarts
Function Call: fetch_kubernetes_logs
Log analysis: memory consumption pattern
Root Cause: Memory limit too low
Solution: Increase limit to 150Mi

Query 4: "How do I fix it?" Agent: Provides exact kubectl commands

Tips for Effective Troubleshooting

Start Broad

“What’s failing?” → Get overviewThen narrow: “Tell me about [specific pod]”

Ask for Evidence

“Show me the logs”“What events occurred?”Agent provides proof

Request Root Cause

“Why did this happen?”Agent analyzes patterns

Get Step-by-Step Fix

“How do I fix it?”Agent provides kubectl commands

Understanding Error Types

OOMKilled (Out of Memory)

How to ask:

"Why is [pod] OOMKilled?"
"What's the memory issue?"
"Show me memory usage for [pod]"

Agent provides:

Current memory limit
Actual memory attempted
How much over the limit
Recommended new limit

ImagePullBackOff

How to ask:

"Why can't [pod] pull its image?"
"What's the image issue?"
"Is the registry accessible?"

Agent provides:

Image name being pulled
Registry URL
Error message (auth, not found, network)
Common fixes for each scenario

CrashLoopBackOff

How to ask:

"Why is [pod] crash looping?"
"What's making [pod] crash?"
"Show me crash reasons"

Agent provides:

Exit code
Log errors
Restart count and pattern
Likely causes (config, bug, dependency)

Following the RCA

When the agent provides analysis, you can dig deeper: Example conversation:

You: "Why did memory-hog-demo fail?"
Agent: [Provides RCA: OOMKilled due to 50Mi limit]

You: "Why was the limit set to 50Mi?"
Agent: [Checks pod spec: "That's what was configured in the YAML"]

You: "What's a good limit for stress containers?"
Agent: [Suggests 150Mi with 50% buffer]

You: "Show me how to apply that"
Agent: [Provides kubectl command or YAML patch]

The conversation evolves naturally based on your needs.

Time Savings: Real Numbers

Based on our testing with 3 different pod failures:

Failure Type	Traditional Time	Chat Time	Savings
OOMKilled	12-15 min	2 min	85%
ImagePullBackOff	5-10 min	1 min	90%
CrashLoopBackOff	15-20 min	3 min	85%

Average: 87% time saved And that’s just for detection + diagnosis. Chat also:

Provides the fix immediately
No Googling required
No trial and error
Learn while you troubleshoot

What You Learned

Troubleshooting Workflow

5-step process from detection to resolution

Real Investigation

Actual OOMKilled pod analysis with agent reasoning

Error Types

How to investigate OOMKilled, ImagePullBackOff, CrashLoop

Context Usage

How to have multi-turn troubleshooting conversations

Time Savings

87% faster than traditional debugging

Function Transparency

What each function call does and why

Next: Advanced Chat Usage

You’ve mastered troubleshooting! Now learn advanced techniques:

Continue: Advanced Chat Features

Learn about personas, workflows, file uploads, and power user tips

Quick Reference

Fastest troubleshooting query:

"Diagnose [pod-name]"

This single command triggers full RCA. Most comprehensive query:

"Why is [pod-name] failing and how do I fix it?"

You get problem + solution in one response.

Need Help?

Support

Email: [email protected]

Understand RCA

How Root Cause Analysis Works

Getting started

Hands-On Tutorials

Using RubixKube

Core Concepts

Support

​Troubleshooting with Chat: Real Incident Investigation

​The Troubleshooting Workflow

​Scenario 1: Investigating OOMKilled Pod

​The Problem

​Query: “Tell me about memory-hog-demo”

​What the Agent Did

​Root Cause (from Agent)

​Solution (from Agent)

​Recommendation

​Scenario 2: ImagePullBackOff Investigation

​The Problem

​Typical Questions to Ask

​Expected Agent Response

​Scenario 3: CrashLoopBackOff Analysis

​The Problem

​Query: “Why is crash-loop-demo crashing?”

​Common CrashLoop Causes

​Multi-Pod Investigation

​Query: “Show me all failing pods in rubixkube-tutorials”

​Following Up on Incidents

​Asking for Logs

​Comparing Traditional vs. Chat Troubleshooting

​Traditional Approach (Without Chat)

​Chat Approach

Time Saved

Steps Saved

​When Chat Needs Clarification

​Example: Namespace Not Specified

​Advanced Troubleshooting Queries

​Resource Analysis

​Event Timeline

​Historical Context

​Common Troubleshooting Queries

​Real Example: Complete Investigation

​The Full Exchange

​Tips for Effective Troubleshooting

Start Broad

Ask for Evidence

Request Root Cause

Get Step-by-Step Fix

​Understanding Error Types

​OOMKilled (Out of Memory)

​ImagePullBackOff

​CrashLoopBackOff

​Following the RCA

​Time Savings: Real Numbers

​What You Learned

Troubleshooting Workflow

Real Investigation

Error Types

Context Usage

Time Savings

Function Transparency

​Next: Advanced Chat Usage

Continue: Advanced Chat Features

​Quick Reference

​Need Help?

Support

Understand RCA

Troubleshooting with Chat: Real Incident Investigation

The Troubleshooting Workflow

Scenario 1: Investigating OOMKilled Pod

The Problem

Query: “Tell me about memory-hog-demo”

What the Agent Did

Root Cause (from Agent)

Solution (from Agent)

Recommendation

Scenario 2: ImagePullBackOff Investigation

The Problem

Typical Questions to Ask

Expected Agent Response

Scenario 3: CrashLoopBackOff Analysis

The Problem

Query: “Why is crash-loop-demo crashing?”

Common CrashLoop Causes

Multi-Pod Investigation

Query: “Show me all failing pods in rubixkube-tutorials”

Following Up on Incidents

Asking for Logs

Comparing Traditional vs. Chat Troubleshooting

Traditional Approach (Without Chat)

Chat Approach

When Chat Needs Clarification

Example: Namespace Not Specified

Advanced Troubleshooting Queries

Resource Analysis

Event Timeline

Historical Context

Common Troubleshooting Queries

Real Example: Complete Investigation

The Full Exchange

Tips for Effective Troubleshooting

Understanding Error Types

OOMKilled (Out of Memory)

ImagePullBackOff

CrashLoopBackOff

Following the RCA

Time Savings: Real Numbers

What You Learned

Next: Advanced Chat Usage

Quick Reference

Need Help?