Skip to main content

Detecting Memory Issues (OOMKilled)

Out of Memory (OOM) errors are critical failures that cause pods to be killed by Kubernetes. Let’s see how RubixKube detects memory issues, analyzes them, and suggests solutions.
What you’ll learn:
  • Creating a memory-constrained pod
  • How RubixKube detects OOMKilled events
  • Reading memory usage analysis
  • Understanding resource limit recommendations
  • Fixing memory issues
Real Impact: OOMKilled errors cause:
  • Immediate pod termination
  • Service degradation (if all replicas affected)
  • Data loss (if not using persistent storage)
  • Customer-facing errors
RubixKube helps you catch and fix these quickly.

The Scenario: Memory Overflow

We’ll create a pod that intentionally exceeds its memory limit to trigger an OOMKilled event.

Create the Memory Hog Pod

apiVersion: v1
kind: Pod
metadata:
  name: memory-hog-demo
  namespace: rubixkube-tutorials
spec:
  containers:
  - name: stress
    image: polinux/stress
    resources:
      limits:
        memory: "50Mi"       # Only 50MB allowed
      requests:
        memory: "50Mi"
    command: ["stress"]
    args: ["--vm", "1", "--vm-bytes", "100M"]  # Try to use 100MB!
What this does:
  • Requests 50Mi of memory
  • Tries to allocate 100Mi (double the limit)
  • Kubernetes kills it when memory exceeds limit
Deploy it:
kubectl apply -f memory-hog-pod.yaml

Watch the OOM Cycle

The pod will go through a cycle:
kubectl get pods -n rubixkube-tutorials -w
You’ll see:
NAME              READY   STATUS              RESTARTS   AGE
memory-hog-demo   0/1     ContainerCreating   0          5s
memory-hog-demo   0/1     OOMKilled           0          10s
memory-hog-demo   0/1     CrashLoopBackOff    1          20s
memory-hog-demo   0/1     OOMKilled           1          35s
memory-hog-demo   0/1     CrashLoopBackOff    2          50s
The pattern:
  1. ContainerCreating - Kubernetes starts the container
  2. OOMKilled - Container exceeds memory, Kubernetes kills it
  3. CrashLoopBackOff - Kubernetes waits before restarting
  4. Repeat - Cycle continues forever
This is a real production problem! Pods stuck in this cycle waste resources and cause service degradation.

RubixKube Detection (Within 2 Minutes)

Open the RubixKube Dashboard. What you’ll see:
Dashboard detecting OOMKilled pod

Detection Indicators

Activity Feed shows TWO related incidents:
  1. “Out of memory (OOMKilled) detected on Pod/memory-hog-demo”
    • Severity: HIGH
    • Status: Active
    • Detected: Immediately after first OOM
  2. “Container experiencing repeated crashes in memory-hog-demo”
    • Severity: Medium
    • Status: Active
    • Restart count: Tracked automatically
RubixKube correlates events! It understands that the crashes are CAUSED BY the OOM condition. This is intelligent detection, not just alert noise.

Viewing Detailed Analysis

Navigate to Insights to see the full analysis:

Incident Details

Insights showing OOMKilled analysis
What RubixKube shows:

Root Cause

OOMKilled - Container exceeded memory limits

Affected Resource

Pod/memory-hog-demo in namespace rubixkube-tutorials

Restart Count

3 restarts (and counting) - Problem persists

Confidence

90%+ - RubixKube is highly confident about the diagnosis

Memory Analysis

RubixKube’s analysis includes:
Current Configuration:
  memory.limits: 50Mi
  memory.requests: 50Mi

Actual Usage:
  Attempted allocation: ~100Mi
  Limit exceeded by: ~100% (2x over)

Pattern:
  - Pod starts
  - Memory usage spikes immediately
  - Kubernetes OOMKills at ~50Mi
  - Pod restarts
  - Cycle repeats

AI-Generated Suggestions

Based on the analysis, RubixKube suggests:
1

Increase Memory Limits

Recommended action: Double the memory limit
resources:
  limits:
    memory: "100Mi"  # Was 50Mi
  requests:
    memory: "100Mi"
Why: Application needs ~100Mi based on observed behavior
2

Check for Memory Leaks

If restarts continue after increasing limits:
  • Monitor memory usage over time
  • Look for gradual increase (leak pattern)
  • Check application code for issues
RubixKube helps: Memory Engine will track usage trends
3

Consider Resource Quotas

Verify namespace limits:
kubectl describe resourcequota -n rubixkube-tutorials
Ensure namespace has enough memory available

Fixing the Memory Issue

Method 1: Delete and Recreate (Easiest)

# Delete the broken pod
kubectl delete pod memory-hog-demo -n rubixkube-tutorials

# Create new pod with correct limits
cat > fixed-memory-pod.yaml << EOF
apiVersion: v1
kind: Pod
metadata:
  name: memory-fixed-demo
  namespace: rubixkube-tutorials
spec:
  containers:
  - name: stress
    image: polinux/stress
    resources:
      limits:
        memory: "150Mi"      # Increased from 50Mi
      requests:
        memory: "150Mi"
    command: ["stress"]
    args: ["--vm", "1", "--vm-bytes", "100M"]
EOF

kubectl apply -f fixed-memory-pod.yaml

Method 2: Update Deployment (Production Approach)

If this were a Deployment (not a standalone Pod):
kubectl set resources deployment/my-app \
  --limits=memory=150Mi \
  --requests=memory=150Mi \
  -n rubixkube-tutorials
This triggers a rolling update with new resource limits.

Verify the Fix

Check the new pod:
kubectl get pods -n rubixkube-tutorials
Success looks like:
NAME                 READY   STATUS    RESTARTS   AGE
memory-fixed-demo    1/1     Running   0          45s
Monitor memory usage:
kubectl top pod memory-fixed-demo -n rubixkube-tutorials
Expected output:
NAME                 CPU(cores)   MEMORY(bytes)
memory-fixed-demo    50m          98Mi
Memory usage (98Mi) is under the limit (150Mi) - Problem solved!

RubixKube Confirms Resolution

Back in the Dashboard: Within 1-2 minutes:
  • Active Insights decreases (OOMKilled incident marked resolved)
  • Activity Feed shows “Incident resolved” event
  • System Health improves
  • Memory Engine records: “OOMKilled fixed by increasing memory 50Mi → 150Mi”
Dashboard after resolution

What the Memory Engine Learned

RubixKube now knows: Pattern Recognized:
Pod: memory-hog-demo
Issue: OOMKilled (memory limit too low)
Attempted allocation: ~100Mi
Original limit: 50Mi
Fix applied: Increased to 150Mi
Result: Successful (pod stable)
Time to resolution: 3 minutes

Lesson: If pod needs ~100Mi, set limit to 150Mi (50% buffer)
Next time a similar OOM occurs:
  • RubixKube will recall this pattern
  • Suggest the proven fix immediately
  • Calculate appropriate memory based on actual usage
  • Recommend buffer (typically 30-50% over observed peak)

Memory Best Practices

Guaranteed QoS:
resources:
  requests:
    memory: "1Gi"
  limits:
    memory: "1Gi"  # Same value
Benefit: Kubernetes won’t evict these pods under memory pressure
Burstable QoS:
resources:
  requests:
    memory: "512Mi"  # Baseline
  limits:
    memory: "1Gi"    # Peak allowed
Benefit: Efficient resource usage, can handle spikes
Right-sizing:
# Check current usage
kubectl top pods -n production

# Set limits to 95th percentile + 30% buffer
RubixKube helps: Memory Engine tracks usage trends over weeks/months
Auto-scaling:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70
Benefit: Kubernetes scales replicas before OOM occurs

Cleanup

Remove the demo pods:
kubectl delete namespace rubixkube-tutorials
This removes:
  • All demo pods
  • Associated events
  • Namespace resources
RubixKube retains:
  • Incident history in Memory Engine
  • Learned patterns
  • RCA reports for future reference

Key Takeaways

Fast Detection

RubixKube detected OOMKilled within 2 minutes - no configuration needed

Intelligent Analysis

Correlated OOM event with crash loop - showed ROOT CAUSE

Actionable Suggestions

Specific recommendations: increase memory, check for leaks, verify quotas

Learning System

Memory Engine stored the pattern for faster resolution next time

Next Steps