Back to Home
DevOps Agents

AI Agents for Kubernetes: Automating Cluster Management

Mei-Lin Zhang

ML researcher focused on autonomous agents and multi-agent systems.

March 5, 202615 min read

Kubernetes clusters generate an extraordinary volume of operational data — pod events, resource metrics, logs, audit trails, network policies, and cost telemetry. The gap between the data available an...

AI Agents for Kubernetes Management: A Practical Guide for SRE Teams

Kubernetes clusters generate an extraordinary volume of operational data — pod events, resource metrics, logs, audit trails, network policies, and cost telemetry. The gap between the data available and a human operator's ability to act on it in real time is exactly where AI agents are finding traction. Not as replacements for SRE judgment, but as force multipliers that compress the time between "something changed" and "I know what to do about it."

This guide covers the practical reality of using AI agents across the Kubernetes lifecycle: deployment, scaling, troubleshooting, and cost optimization. Every tool mentioned here exists today and has real adoption. Every limitation is real too.


The Current Landscape: What "AI Agent" Actually Means Here

Before diving in, let's be precise about terminology. In the Kubernetes management context, "AI agents" fall into three categories:

Category What It Does Examples
LLM-powered CLI assistants Natural language → kubectl/YAML generation kubectl-ai, k8sgpt, Kubiya
ML-driven optimization engines Continuous tuning of resource requests, scaling parameters StormForge, Cast AI, Karpenter (with AI layers)
Autonomous remediation agents Detect issues → diagnose → take action (with guardrails) Robusta, Shoreline.io, PagerDuty AIOps + K8s integrations

These are fundamentally different tools solving different problems. A YAML generation assistant won't optimize your node costs. A cost optimization engine won't triage a CrashLoopBackOff at 3 AM. SRE teams need to understand where each category fits in their workflow.


Deployment: AI-Assisted Manifest Generation and Validation

The Problem

Writing correct Kubernetes manifests is deceptively complex. A deployment YAML that works in dev can be a security liability in production — missing resource limits, no pod disruption budgets, absent network policies, containers running as root. The combinatorial explosion of best practices across security, reliability, and operability makes "correct by default" manifests genuinely hard.

Tools and Workflows

kubectl-ai (formerly kubectl-ai by Google) lets you describe workloads in natural language and generates manifests:

kubectl ai "Create a deployment for a stateless Go API with 3 replicas, 
resource limits, health checks, and a pod disruption budget. 
It listens on port 8080 and needs access to a Postgres database via 
a secret called db-credentials."

This produces a multi-document YAML that includes the Deployment, Service, PDB, and references to the Secret. It's a reasonable starting point — emphasis on starting point.

What it gets right:

  • Correct structure and API versions
  • Generally follows resource limit best practices
  • Includes liveness/readiness probes with sensible defaults

Where it falls short:

  • Security context is inconsistent — sometimes it runs containers as root
  • Network policies are rarely generated unless explicitly requested
  • Resource requests/limits are generic (not tuned to your actual workload)
  • It has no knowledge of your cluster's admission controllers or OPA/Gatekeeper policies

A more reliable workflow for production deployments:

# Step 1: Generate the base manifest
kubectl ai "Create a production-ready deployment for image: myapp:v2.1.0 
with 5 replicas, rolling update strategy, resource limits based on 
typical Go web services, and anti-affinity rules."

# Step 2: Validate against your policies
kustomize build ./overlays/production | conftest test -p policy/ -

# Step 3: Dry-run against the cluster
kubectl apply --dry-run=server -f generated-manifest.yaml

# Step 4: AI-assisted review
k8sgpt analyze -f generated-manifest.yaml

Kubiya takes a different approach — it's a conversational AI platform that integrates with your existing toolchain (ArgoCD, Helm, Terraform) and can execute multi-step deployment workflows through natural language commands in Slack or Teams. The key differentiator: it respects your existing RBAC and approval gates rather than generating standalone YAML.

@kubiya Deploy service checkout-api v2.3.1 to staging, 
run the smoke test suite, and if it passes, create a PR 
to promote to production.

Kubiya translates this into a sequence of API calls to your CI/CD pipeline. It's less "generate YAML" and more "orchestrate your existing automation." For teams with mature GitOps workflows, this is significantly more useful than raw manifest generation.

Honest Assessment

AI-generated manifests are useful for bootstrapping and learning. They are not a substitute for understanding the Kubernetes API. The most dangerous pattern I've seen is teams treating generated YAML as production-ready without review. Use AI to accelerate, not to skip the review step.


Scaling: From Reactive to Predictive

The Problem

The Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) are reactive — they respond to metrics after demand changes. For workloads with predictable patterns (daily traffic cycles, batch job schedules), reactive scaling means you're always either over-provisioned or catching up.

Karpenter with Intelligent Provisioning

Karpenter (AWS's open-source node autoscaler, now also available for Azure) isn't purely "AI" in the LLM sense, but it uses sophisticated decision-making logic that goes far beyond the Cluster Autoscaler. It evaluates pending pods, selects optimal node types from the full cloud provider catalog, and can consolidate underutilized nodes.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: general-purpose
spec:
  template:
    spec:
      requirements:
        - key: "karpenter.k8s.aws/instance-category"
          operator: In
          values: ["c", "m", "r"]
        - key: "karpenter.k8s.aws/instance-generation"
          operator: Gt
          values: ["5"]
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["spot", "on-demand"]
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h
  limits:
    cpu: "1000"
    memory: 2000Gi

Karpenter's consolidation logic is genuinely intelligent — it will proactively migrate workloads to fewer, right-sized nodes when utilization drops, and it does this with disruption budgets to avoid availability impacts.

StormForge: ML-Driven Resource Optimization

StormForge (formerly StormForge, acquired the rights to the Optimize platform) uses machine learning to analyze historical resource consumption and recommend optimal CPU/memory requests and limits. This is the VPA concept done properly.

The workflow:

# Install the StormForge agent
helm install stormforge oci://registry.stormforge.io/library/stormforge-agent \
  --set clientID=<your-id> --set clientSecret=<your-secret>

# Create an optimization experiment
kubectl apply -f - <<EOF
apiVersion: stormforge.io/v1beta1
kind: Recommendation
metadata:
  name: checkout-api-optimization
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-api
EOF

StormForge observes the workload over a configurable window (typically 14 days), then produces recommendations:

Deployment: checkout-api
Namespace: production

Current → Recommended:
  CPU request:    500m → 180m
  CPU limit:      1000m → 450m  
  Memory request: 256Mi → 142Mi
  Memory limit:   512Mi → 310Mi
  
  Estimated monthly savings: $2,847
  Confidence: 94th percentile (P99 consumption observed at 290m CPU)

The critical nuance: these recommendations are based on your actual workload data, not generic benchmarks. The ML model accounts for temporal patterns, burst behavior, and tail latencies.

Cast AI: Full-Stack Cost and Scaling Automation

Cast AI goes further than StormForge by optimizing both the pod layer (resource requests) and the infrastructure layer (node selection, spot instance management, cross-cloud rebalancing). It's closer to a fully autonomous platform.

Key capabilities:

  • Real-time pod rightsizing — adjusts VPA recommendations continuously
  • Spot instance orchestration — automatically diversifies across instance types and availability zones, with fallback to on-demand when spot capacity is unavailable
  • Node pool optimization — selects the cheapest node type that satisfies pod scheduling constraints
  • Cluster rebalancing — migrates workloads when cheaper infrastructure becomes available
# Cast AI connects to your cluster and begins analysis
castai-connect --cluster-name production-cluster --api-key <key>

# View optimization recommendations
castai recommendations --cluster production-cluster

# Enable autonomous mode (with guardrails)
castai policy set --cluster production-cluster \
  --enable-spot \
  --max-spot-percentage 70 \
  --min-on-demand-nodes 3 \
  --cpu-utilization-threshold 70

The Scaling Agent Workflow for SRE Teams

Here's a practical workflow that combines these tools:

  1. Deploy Karpenter for node-level autoscaling with intelligent instance selection
  2. Deploy StormForge or Cast AI for continuous resource rightsizing
  3. Configure HPA with custom metrics (from Prometheus, Datadog, etc.) for pod-level scaling
  4. Set up cost anomaly alerts to catch runaway scaling
# HPA with custom Prometheus metric
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-api
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 120

The behavior field is critical — without it, HPA will aggressively scale down during brief traffic dips, then struggle to scale back up. AI-driven scaling tools can inform these parameters, but the human SRE still needs to set the guardrails.


Troubleshooting: Where AI Agents Provide the Most Immediate Value

This is the area with the most mature tooling and the clearest ROI.

k8sgpt: Cluster Diagnostics with LLM Analysis

k8sgpt scans your cluster for common failure patterns and uses an LLM (OpenAI, Azure OpenAI, local models via Ollama, or others) to explain the root cause in plain language.

# Install
brew install k8sgpt

# Analyze the entire cluster
k8sgpt analyze --explain

# Focus on a specific namespace
k8sgpt analyze -n production --explain

# Filter to specific issue types
k8sgpt analyze --filter=Pod,Service,Deployment --explain

Example output:

AI Analysis:
0: Pod checkout-api-7b8f9d6c4-xk2lp in namespace production:
   Error: CrashLoopBackOff - Last exit code 1
   
   Root Cause: The container is failing to start because it cannot connect 
   to the PostgreSQL database. The error in the logs shows:
   "dial tcp 10.0.15.87:5432: connect: connection refused"
   
   The database pod (postgres-0) is in a Pending state because it requires 
   a PersistentVolumeClaim that is bound to a specific availability zone, 
   but no nodes in that zone have available capacity.
   
   Recommended Actions:
   1. Check node capacity in AZ us-east-1a: kubectl get nodes -l 
      topology.kubernetes.io/zone=us-east-1a
   2. Verify PVC status: kubectl get pvc -n production
   3. Consider using a StorageClass with volumeBindingMode: WaitForFirstConsumer

This is genuinely useful. The LLM connects the CrashLoopBackOff to the database pod's Pending state — a causal chain that takes experienced SREs a few minutes to trace but can take junior engineers much longer.

Local model option for sensitive clusters:

# Run with Ollama for air-gapped / security-sensitive environments
ollama pull codellama:13b

k8sgpt analyze --explain --backend ollama --model codellama:13b

The analysis quality drops with smaller models, but for common patterns (CrashLoopBackOff, ImagePullBackOff, OOMKilled, Pending pods), even a 7B model provides useful explanations.

Robusta: Automated Root Cause Analysis and Remediation

Robusta is the most complete "AI agent" platform for Kubernetes troubleshooting. It combines event enrichment, LLM-powered analysis, and optional automated remediation.

# Install Robusta
helm install robusta robusta/robusta --values values.yaml \
  --set clusterName=production \
  --set robustaApiKey=<key>
# values.yaml - key configuration
sinksConfig:
  - slack_sink:
      name: main_slack
      slack_channel: "#k8s-alerts"
      api_key: ${SLACK_API_KEY}

customPlaybooks:
  # Auto-investigate CrashLoopBackOff pods
  - triggers:
      - on_pod_crash_loop: {}
    actions:
      - pod_graph_enricher:
          resource_type: Memory
          display_limits: true
      - pod_graph_enricher:
          resource_type: CPU
          display_limits: true
      - logs_enricher:
          filter_container: ".*"
          warn_on_long_logs: false
      - ai_diagnosis: {}
    sinks:
      - main_slack

When a pod enters CrashLoopBackOff, Robusta automatically:

  1. Pulls recent memory and CPU graphs
  2. Collects container logs
  3. Sends the enriched context to an LLM for analysis
  4. Posts the full diagnostic report to Slack

The ai_diagnosis action is the key differentiator. It doesn't just show you metrics and logs — it correlates them and provides a probable root cause with recommended actions.

Automated remediation (use with caution):

# Auto-restart pods stuck in CrashLoopBackOff after AI confirms 
# it's likely a transient issue
customPlaybooks:
  - triggers:
      - on_pod_crash_loop:
          restart_count: 5
          rate_limit: 3600
    actions:
      - ai_diagnosis: {}
      - pod_restart:
          # Only restart if AI diagnosis indicates transient issue
          filter_regex: "transient|timeout|connection refused"
    sinks:
      - main_slack

This pattern — AI diagnosis as a gate for automated action — is the responsible way to use AI agents for remediation. The agent doesn't blindly restart; it first diagnoses, and only acts when the diagnosis matches a known-safe pattern.

Shoreline.io: Full Incident Automation

Shoreline.io provides a more comprehensive automation framework with an Op Packs concept — pre-built remediation workflows that can be triggered by alerts or natural language commands.

# Natural language incident response
shoreline> What pods are failing in the production namespace?
shoreline> Show me the logs for checkout-api-pod-xk2lp
shoreline> Scale checkout-api deployment to 10 replicas
shoreline> Drain node ip-10-0-1-42 and cordon it

Shoreline's advantage is its "blast radius" controls — every automated action has configurable scope limits and approval requirements. For SRE teams that want to move toward automated incident response without losing control, this is the right architecture.


Cost Optimization: AI-Driven FinOps

The Problem

Most Kubernetes clusters are over-provisioned by 40-60%. Resource requests are set once during initial deployment and never adjusted. Teams provision for peak load and pay for it 24/7.

Kubecost + AI Analysis

Kubecost provides granular cost allocation per namespace, deployment, service, and even per team. It doesn't use AI internally, but it generates the data that AI tools can analyze.

# Install Kubecost
helm install kubecost cost-analyzer/ \
  --repo https://kubecost.github.io/cost-analyzer-helm-chart/ \
  --namespace kubecost --create-namespace

# Query cost data via API
curl "http://kubecost.kubecost.svc:9090/model/allocation?\
window=7d&aggregate=namespace&accumulate=true"

Combining Kubecost data with LLM analysis:

import requests
import openai

# Fetch cost data from Kubecost
cost_data = requests.get(
    "http://kubecost.kubecost.svc:9090/model/allocation",
    params={"window": "7d", "aggregate": "deployment"}
).json()

# Format for LLM analysis
summary = []
for deploy, data in cost_data["data"][0].items():
    summary.append(f"- {deploy}: ${data['totalCost']:.2f}/week, "
                   f"CPU efficiency: {data['cpuEfficiency']:.1%}, "
                   f"RAM efficiency: {data['ramEfficiency']:.1%}")

prompt = f"""Analyze this Kubernetes cost data and identify optimization 
opportunities. Prioritize by potential savings.

{chr(10).join(summary)}

For each recommendation, specify:
1. The deployment name
2. Current vs recommended resource requests
3. Estimated monthly savings
4. Risk level (low/medium/high)"""

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}]
)
print(response.choices[0].message.content)

This is a simple pattern, but it works. The LLM can identify patterns across your cost data that are tedious to spot manually — like a deployment with 5% CPU efficiency that's requesting 4 cores, or a namespace that's running 20 replicas of a staging service in production.

The Cost Optimization Agent Loop

Here's a complete workflow that SRE teams can implement:

# CronJob that runs weekly cost analysis
apiVersion: batch/v1
kind: CronJob
metadata:
  name: cost-optimization-agent
spec:
  schedule: "0 9 * * MON"  # Every Monday at 9 AM
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: cost-agent
          containers:
            - name: agent
              image: your-registry/cost-agent:latest
              command: ["python", "/app/analyze_and_report.py"]
              env:
                - name: KUBECOST_HOST
                  value: "kubecost.kubecost.svc:9090"
                - name: OPENAI_API_KEY
                  valueFrom:
                    secretKeyRef:
                      name: ai-keys
                      key: openai-key
                - name: SLACK_WEBHOOK
                  valueFrom:
                    secretKeyRef:
                      name: ai-keys
                      key: slack-webhook
          restartPolicy: OnFailure

The agent script:

# analyze_and_report.py
import requests
import json
from datetime import datetime

def fetch_cost_data():
    """Fetch allocation data from Kubecost."""
    resp = requests.get(
        f"http://{KUBECOST_HOST}/model/allocation",
        params={
            "window": "7d",
            "aggregate": "controller",
            "accumulate": "true",
            "idle": "false"
        }
    )
    return resp.json()

def generate_rightsizing_recommendations(cost_data):
    """Use StormForge or Cast AI API for ML-based recommendations."""
    # StormForge provides an API for fetching current recommendations
    resp = requests.get(
        "https://api.stormforge.io/v1/recommendations",
        headers={"Authorization": f"Bearer {STORMFORGE_TOKEN}"},
        params={"cluster": CLUSTER_NAME}
    )
    return resp.json()

def correlate_and_report(cost_data, rightsizing_recs):
    """Combine cost data with ML recommendations and post to Slack."""
    # Build comprehensive report
    report = {
        "total_weekly_cost": sum(
            d["totalCost"] for d in cost_data["data"][0].values()
        ),
        "top_spenders": sorted(
            cost_data["data"][0].items(),
            key=lambda x: x[1]["totalCost"],
            reverse=True
        )[:10],
        "rightsizing_opportunities": rightsizing_recs,
        "generated_at": datetime.utcnow().isoformat()
    }
    
    # Post to Slack
    requests.post(SLACK_WEBHOOK, json={
        "blocks": format_slack_blocks(report)
    })
    
    return report

Limitations and Honest Concerns

AI agents for Kubernetes management are useful, but they have real limitations that SRE teams need to understand:

Hallucination in Diagnostic Contexts

LLMs will confidently generate plausible-sounding but incorrect root cause analyses. I've seen k8sgpt blame a CrashLoopBackOff on resource limits when the actual cause was a missing ConfigMap. Always verify AI-generated diagnoses against actual logs and events.

Context Window Limitations

Most LLM-based tools can't analyze your entire cluster state. They work with a subset of data — recent events, specific pod logs, current resource metrics. Complex, multi-component failures that span many services often exceed what these tools can reason about.

Security Implications

Sending cluster data to external LLM APIs (OpenAI, Anthropic) means your pod names, log content, and infrastructure details leave your network. For regulated environments, this is often unacceptable. The local model option (Ollama + k8sgpt) mitigates this but at the cost of analysis quality.

The "Automation Bias" Problem

When an AI agent is right 90% of the time, operators stop questioning it. That remaining 10% includes the hard, novel failures that actually cause outages. The most effective SRE teams I've worked with treat AI agents as hypothesis generators, not decision makers.

Cost of the AI Layer Itself

StormForge, Cast AI, and Robusta all have non-trivial licensing costs. Kubecost's free tier is useful but limited. Run the ROI calculation: if your cluster spend is under $10K/month, the optimization tools may cost more than they save.


Recommended Stack by Team Maturity

Maturity Level Recommended Tools Focus Area
Early (small team, <50 pods) k8sgpt + kubectl-ai Learning, manifest quality, basic diagnostics
Growing (dedicated SRE, 50-500 pods) Robusta + Kubecost + HPA Automated alert enrichment, cost visibility
Scaling (SRE team, 500+ pods) Cast AI/StormForge + Karpenter + Robusta Full optimization loop, predictive scaling
Enterprise (multi-cluster, regulated) Shoreline.io + Cast AI + internal LLM Controlled automation, air-gapped AI, compliance

The Bottom Line

AI agents for Kubernetes management are not hype — they solve real problems, particularly in troubleshooting speed and cost optimization. But they're tools, not magic. The best outcomes I've seen come from teams that:

  1. Start with diagnostics (k8sgpt, Robusta) — lowest risk, highest immediate value
  2. Add cost optimization (Kubecost + ML rightsizing) — measurable ROI
  3. Implement scaling intelligence (Karpenter + Cast AI) — requires more operational maturity
  4. Approach autonomous remediation last — with strict guardrails and human approval gates

The SRE role isn't being replaced by AI agents. It's being elevated. The agents handle the pattern-matching and data correlation; the human handles the judgment calls, the novel failures, and the architectural decisions that prevent those failures from recurring.

Keywords

AI agentdevops-agents