Learn how Coinbase made investigations 72% faster using AI

Since its first commit in June 2014, Kubernetes has evolved into the de facto standard for container orchestration, with over 88,000 contributors from more than 8,000 companies spanning 44 countries. Its self-healing and declarative nature promises effortless scaling and high availability. Yet, managing Kubernetes in production is far from straightforward. Just ask any on-call engineer or SRE; kubernetes troubleshooting in production often spirals into a frustrating cycle of trial and error.
Many find that a 2 a.m. alert leads them to the kubectl CLI, only to find the issue has mysteriously "fixed itself." But not for long. Issues like noisy neighbors, misbehaving add-ons, resource starvation and subtle memory leaks lurk just beneath the surface. Kubernetes errors like CrashLoopBackOff, OOMKilled, and ImagePullBackOff are common, but diagnosing their root cause across a sprawling kubernetes cluster requires stitching together signals from dozens of sources. Troubleshooting Kubernetes often feels less like solving a puzzle and more like chasing shadows.
What if you could eliminate the strain, guesswork and manual toil? Imagine an AI-powered, autonomous AI agent that not only assists but proactively investigates and performs root cause analysis across your Kubernetes infrastructure and the applications running on it. That's exactly why we built the AI Production Engineer; to optimize kubernetes operations, reduce MTTR (mean time to resolve), and make on-call stress free.
While Kubernetes automates a lot, its dynamic and ephemeral nature brings new challenges for DevOps and SRE teams. Here are the most common use cases we see:
1. Noisy Alerts That Cry Wolf
Kubernetes' control plane tirelessly adjusts workloads to match the desired state. Minor hiccups like a pod restarting, often trigger alerts that resolve themselves before you even react. The result? Alert fatigue. But buried within that noise, real issues like misconfigured autoscalers or hidden bottlenecks go unnoticed until they snowball into outages.
2. Ephemeral Pods, Lost Context
When pods crash, they take valuable troubleshooting context with them. Running kubectl describe on the pod after the fact often reveals little. It's impossible to attach a debugger in time, and the kubernetes resources and states have already reset. By the time you investigate, critical clues are already gone. It's like arriving at a crime scene after the evidence has been swept away.
3. The Observability Data Maze
Logs are scattered across nodes, pods, and containers, turning debugging into a frustrating exercise. Kubernetes generates a flood of metrics and telemetry, but only a small fraction matter for any given alert. Sifting through endless dashboards, running kubectl commands from the CLI, and correlating CPU and memory usage across namespaces to find relevant data wastes time and delays resolution, leaving teams overwhelmed by noise instead of focused on solutions.
Now, imagine a kubernetes troubleshooting partner that not only pinpoints problems but actively resolves them. Agentic AI from Resolve AI operates as an AI-powered, 24/7 Kubernetes expert that connects the dots, surfaces actionable diagnostics, and automates tedious investigations across your entire kubernetes cluster.
It removes the need to gather data from multiple sources, coordinate calls with incident managers, or escalate to those who've "seen this before." It understands unique and recurring issues and it streamlines remediation workflows and minimizes operational overhead. It accelerates your incident response, offers a clear starting point, and instills greater confidence in taking the right actions.
Here’s how it works:
1. Always-On Expertise
Agentic AI doesn't sleep or tire. When an alert fires, it dives into your kubernetes cluster, navigating the complexity and presenting clear, actionable insights - often before you even reach for your laptop. By monitoring every alert, it handles the flood of noisy issues that usually lead to alert fatigue, ensuring on-call teams only focus on what truly matters.
In the near future, the AI Production Engineer will go a step further, automatically resolving issues within human-approved boundaries through automated remediation pipelines.
2. Knowledge Graphs for Context and Clarity
At the core of Resolve AI is a dynamic knowledge graph that maps your kubernetes environment. It links pods, nodes, services, ingress controllers, API endpoints, and other kubernetes resources, revealing patterns you might miss. For instance:
3. Noise-Free Analysis Across All Telemetry
Resolve AI transforms your observability data into actionable clarity by analyzing data from diverse sources like Prometheus metrics, Datadog logs, Kubernetes events, configuration changes, AWS infrastructure signals, and more. Your data holds immense value but only when it's relevant. Resolve AI excels at parsing and prioritizing change events, resource states, metrics, dashboards, and logs, pinpointing the entries directly tied to an alert. By filtering out irrelevant noise, it delivers a clear and concise narrative of what's happening, empowering you to focus on solving issues rather than sifting through data.
Picture this:
You're alerted about a pod crash. Instead of wrestling with kubectl or parsing endless logs from the command-line, the AI Production Engineer steps in:
1. Reconstructs the Event Timeline
It pieces together what led to the crash; be it resource contention, a CrashLoopBackOff loop, a container image misconfiguration, or external throttling.
2. Correlates Issues Across the Cluster
Using the knowledge graph, it checks for similar anomalies across pods, nodes, or namespaces, identifying whether the issue is isolated or part of a broader kubernetes cluster problem. It also checks for permissions issues, Docker registry errors, and endpoint misconfigurations that could be contributing factors.
3. Runs Automated Investigations
Agentic AI tests hypotheses like "Was it an OOMKilled error due to CPU or memory limits?" or "Is the pod failing due to a misconfigured startup command?" by executing automated runbooks and analyzing real-time kubernetes events. Resolve AI's AI agents don't just surface information. They actually execute workflows across your stack, pulling from observability data, GitHub deployment history, and infrastructure state to build a complete picture.
4. Provides Resolutions
If a root cause is found, the agent suggests remediation steps and is poised to act on it (a capability that's just around the corner). If not, it outlines clear next steps and optimized workflows, saving time and effort.
All this happens while you’re grabbing coffee … or better yet, still asleep.
Kubernetes is complex, but troubleshooting doesn't have to be. Traditional approaches relying on kubectl commands, open source tools like K8sGPT, and manual log correlation can't keep up with the scale and speed of modern kubernetes environments. From day one, Resolve AI transforms the way you manage Kubernetes by leveraging its built-in expertise to eliminate repetitive firefighting, streamline kubernetes operations, and give you back your nights and weekends.
Instead of scrambling for answers during an outage, you'll have an AI-powered ally that understands Kubernetes inside and out. It spots patterns, automates investigations using natural language instead of complex queries, and keeps your cluster humming.
The next time Kubernetes throws you a curveball, let an AI Production Engineer handle the heavy lifting. Your future self will thank you.


Hear AI strategies and approaches from engineering leaders at FinServ companies including Affirm, MSCI, and SoFi.

Resolve AI has launched with a $35M Seed round to automate software operations for engineers using agentic AI, reducing mean time to resolve incidents by 5x, and allowing engineers to focus on innovation by handling operational tasks autonomously.

Vibe debugging is the process of using AI agents to investigate any software issue, from understanding code to troubleshooting the daily incidents that disrupt your flow. In a natural language conversation, the agent translates your intent (whether a vague question or a specific hypothesis) into the necessary tool calls, analyzes the resulting data, and delivers a synthesized answer.