Detecting Memory Issues (OOMKilled)

Out of Memory (OOM) errors are critical failures that cause pods to be killed by Kubernetes. Let’s see how RubixKube detects memory issues, analyzes them, and suggests solutions.

What you’ll learn:

Creating a memory-constrained pod
How RubixKube detects OOMKilled events
Reading memory usage analysis
Understanding resource limit recommendations
Fixing memory issues

Real Impact: OOMKilled errors cause:

Immediate pod termination
Service degradation (if all replicas affected)
Data loss (if not using persistent storage)
Customer-facing errors

RubixKube helps you catch and fix these quickly.

The Scenario: Memory Overflow

We’ll create a pod that intentionally exceeds its memory limit to trigger an OOMKilled event.

Create the Memory Hog Pod

apiVersion: v1
kind: Pod
metadata:
  name: memory-hog-demo
  namespace: rubixkube-tutorials
spec:
  containers:
  - name: stress
    image: polinux/stress
    resources:
      limits:
        memory: "50Mi"       # Only 50MB allowed
      requests:
        memory: "50Mi"
    command: ["stress"]
    args: ["--vm", "1", "--vm-bytes", "100M"]  # Try to use 100MB!

What this does:

Requests 50Mi of memory
Tries to allocate 100Mi (double the limit)
Kubernetes kills it when memory exceeds limit

Deploy it:

kubectl apply -f memory-hog-pod.yaml

Watch the OOM Cycle

The pod will go through a cycle:

kubectl get pods -n rubixkube-tutorials -w

You’ll see:

NAME              READY   STATUS              RESTARTS   AGE
memory-hog-demo   0/1     ContainerCreating   0          5s
memory-hog-demo   0/1     OOMKilled           0          10s
memory-hog-demo   0/1     CrashLoopBackOff    1          20s
memory-hog-demo   0/1     OOMKilled           1          35s
memory-hog-demo   0/1     CrashLoopBackOff    2          50s

The pattern:

ContainerCreating - Kubernetes starts the container
OOMKilled - Container exceeds memory, Kubernetes kills it
CrashLoopBackOff - Kubernetes waits before restarting
Repeat - Cycle continues forever

This is a real production problem! Pods stuck in this cycle waste resources and cause service degradation.

RubixKube Detection (Within 2 Minutes)

Open the RubixKube Dashboard. What you’ll see:

Detection Indicators

Activity Feed shows TWO related incidents:

“Out of memory (OOMKilled) detected on Pod/memory-hog-demo”
- Severity: HIGH
- Status: Active
- Detected: Immediately after first OOM
“Container experiencing repeated crashes in memory-hog-demo”
- Severity: Medium
- Status: Active
- Restart count: Tracked automatically

RubixKube correlates events! It understands that the crashes are CAUSED BY the OOM condition. This is intelligent detection, not just alert noise.

Viewing Detailed Analysis

Navigate to Insights to see the full analysis:

Incident Details

What RubixKube shows:

Root Cause

OOMKilled - Container exceeded memory limits

Affected Resource

Pod/memory-hog-demo in namespace rubixkube-tutorials

Restart Count

3 restarts (and counting) - Problem persists

Confidence

90%+ - RubixKube is highly confident about the diagnosis

Memory Analysis

RubixKube’s analysis includes:

Current Configuration:
  memory.limits: 50Mi
  memory.requests: 50Mi

Actual Usage:
  Attempted allocation: ~100Mi
  Limit exceeded by: ~100% (2x over)

Pattern:
  - Pod starts
  - Memory usage spikes immediately
  - Kubernetes OOMKills at ~50Mi
  - Pod restarts
  - Cycle repeats

AI-Generated Suggestions

Based on the analysis, RubixKube suggests:

Increase Memory Limits

Recommended action: Double the memory limit

resources:
  limits:
    memory: "100Mi"  # Was 50Mi
  requests:
    memory: "100Mi"

Why: Application needs ~100Mi based on observed behavior

Check for Memory Leaks

If restarts continue after increasing limits:

Monitor memory usage over time
Look for gradual increase (leak pattern)
Check application code for issues

RubixKube helps: Memory Engine will track usage trends

Consider Resource Quotas

Verify namespace limits:

kubectl describe resourcequota -n rubixkube-tutorials

Ensure namespace has enough memory available

Fixing the Memory Issue

Method 1: Delete and Recreate (Easiest)

# Delete the broken pod
kubectl delete pod memory-hog-demo -n rubixkube-tutorials

# Create new pod with correct limits
cat > fixed-memory-pod.yaml << EOF
apiVersion: v1
kind: Pod
metadata:
  name: memory-fixed-demo
  namespace: rubixkube-tutorials
spec:
  containers:
  - name: stress
    image: polinux/stress
    resources:
      limits:
        memory: "150Mi"      # Increased from 50Mi
      requests:
        memory: "150Mi"
    command: ["stress"]
    args: ["--vm", "1", "--vm-bytes", "100M"]
EOF

kubectl apply -f fixed-memory-pod.yaml

Method 2: Update Deployment (Production Approach)

If this were a Deployment (not a standalone Pod):

kubectl set resources deployment/my-app \
  --limits=memory=150Mi \
  --requests=memory=150Mi \
  -n rubixkube-tutorials

This triggers a rolling update with new resource limits.

Verify the Fix

Check the new pod:

kubectl get pods -n rubixkube-tutorials

Success looks like:

NAME                 READY   STATUS    RESTARTS   AGE
memory-fixed-demo    1/1     Running   0          45s

Monitor memory usage:

kubectl top pod memory-fixed-demo -n rubixkube-tutorials

Expected output:

NAME                 CPU(cores)   MEMORY(bytes)
memory-fixed-demo    50m          98Mi

Memory usage (98Mi) is under the limit (150Mi) - Problem solved!

RubixKube Confirms Resolution

Back in the Dashboard: Within 1-2 minutes:

Active Insights decreases (OOMKilled incident marked resolved)
Activity Feed shows “Incident resolved” event
System Health improves
Memory Engine records: “OOMKilled fixed by increasing memory 50Mi → 150Mi”

What the Memory Engine Learned

RubixKube now knows: Pattern Recognized:

Pod: memory-hog-demo
Issue: OOMKilled (memory limit too low)
Attempted allocation: ~100Mi
Original limit: 50Mi
Fix applied: Increased to 150Mi
Result: Successful (pod stable)
Time to resolution: 3 minutes

Lesson: If pod needs ~100Mi, set limit to 150Mi (50% buffer)

Next time a similar OOM occurs:

RubixKube will recall this pattern
Suggest the proven fix immediately
Calculate appropriate memory based on actual usage
Recommend buffer (typically 30-50% over observed peak)

Memory Best Practices

Set Requests = Limits for Critical Pods

Guaranteed QoS:

resources:
  requests:
    memory: "1Gi"
  limits:
    memory: "1Gi"  # Same value

Benefit: Kubernetes won’t evict these pods under memory pressure

Use Requests < Limits for Burstable Workloads

Burstable QoS:

resources:
  requests:
    memory: "512Mi"  # Baseline
  limits:
    memory: "1Gi"    # Peak allowed

Benefit: Efficient resource usage, can handle spikes

Monitor Actual Usage Over Time

Right-sizing:

# Check current usage
kubectl top pods -n production

# Set limits to 95th percentile + 30% buffer

RubixKube helps: Memory Engine tracks usage trends over weeks/months

Enable Horizontal Pod Autoscaling (HPA)

Auto-scaling:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70

Benefit: Kubernetes scales replicas before OOM occurs

Cleanup

Remove the demo pods:

kubectl delete namespace rubixkube-tutorials

This removes:

All demo pods
Associated events
Namespace resources

RubixKube retains:

Incident history in Memory Engine
Learned patterns
RCA reports for future reference

Key Takeaways

Fast Detection

RubixKube detected OOMKilled within 2 minutes - no configuration needed

Intelligent Analysis

Correlated OOM event with crash loop - showed ROOT CAUSE

Actionable Suggestions

Specific recommendations: increase memory, check for leaks, verify quotas

Learning System

Memory Engine stored the pattern for faster resolution next time

Next Steps

Try CrashLoopBackOff Detection

See how RubixKube handles application crashes

Chat with Your Cluster

Use natural language to query infrastructure

Understand the Memory Engine

Learn how RubixKube learns from every incident

Read About the Agent Mesh

Understand how AI agents collaborate

Getting started

Hands-On Tutorials

Using RubixKube

Core Concepts

Support

Detecting OOMKilled Pods

Detecting Memory Issues (OOMKilled)

The Scenario: Memory Overflow

Create the Memory Hog Pod

Watch the OOM Cycle

RubixKube Detection (Within 2 Minutes)

Detection Indicators

Viewing Detailed Analysis

Incident Details

Root Cause

Affected Resource

Restart Count

Confidence

Memory Analysis

AI-Generated Suggestions

Fixing the Memory Issue

Method 1: Delete and Recreate (Easiest)

Method 2: Update Deployment (Production Approach)

Verify the Fix

RubixKube Confirms Resolution

What the Memory Engine Learned

Memory Best Practices

Cleanup

Key Takeaways

Fast Detection

Intelligent Analysis

Actionable Suggestions

Learning System

Next Steps

Try CrashLoopBackOff Detection

Chat with Your Cluster

Understand the Memory Engine

Read About the Agent Mesh

Getting started

Hands-On Tutorials

Using RubixKube

Core Concepts

Support

​Detecting Memory Issues (OOMKilled)

​The Scenario: Memory Overflow

​Create the Memory Hog Pod

​Watch the OOM Cycle

​RubixKube Detection (Within 2 Minutes)

​Detection Indicators

​Viewing Detailed Analysis

​Incident Details

Root Cause

Affected Resource

Restart Count

Confidence

​Memory Analysis

​AI-Generated Suggestions

​Fixing the Memory Issue

​Method 1: Delete and Recreate (Easiest)

​Method 2: Update Deployment (Production Approach)

​Verify the Fix

​RubixKube Confirms Resolution

​What the Memory Engine Learned

​Memory Best Practices

​Cleanup

​Key Takeaways

Fast Detection

Intelligent Analysis

Actionable Suggestions

Learning System

​Next Steps

Try CrashLoopBackOff Detection

Chat with Your Cluster

Understand the Memory Engine

Read About the Agent Mesh

Detecting Memory Issues (OOMKilled)

The Scenario: Memory Overflow

Create the Memory Hog Pod

Watch the OOM Cycle

RubixKube Detection (Within 2 Minutes)

Detection Indicators

Viewing Detailed Analysis

Incident Details

Memory Analysis

AI-Generated Suggestions

Fixing the Memory Issue

Method 1: Delete and Recreate (Easiest)

Method 2: Update Deployment (Production Approach)

Verify the Fix

RubixKube Confirms Resolution

What the Memory Engine Learned

Memory Best Practices

Cleanup

Key Takeaways

Next Steps