Stop spending 30 minutes investigating incidents. Let AI do it in seconds. Here is a hands-on demo you can practice in 15 minutes.

The Problem

3 AM. Production is down. You are doing this:

  1. Open CloudWatch → Check metrics
  2. Open Datadog → Review traces
  3. Open Splunk → Search logs
  4. Check GitHub → Find recent deployments
  5. Correlate everything manually → Find root cause

Time: 20-40 minutes of context switching and log correlation.

What if AI could do all of this in seconds?

The Solution: AWS DevOps Agent

Announced at AWS re:Invent 2025, AWS DevOps Agent is an AI service that automatically investigates incidents by:

  • Analyzing logs, metrics, and traces across multiple tools
  • Mapping infrastructure dependencies automatically
  • Recommending fixes to prevent future incidents
  • Integrating with your existing DevOps stack
Status: Public preview (us-east-1) Free during preview

Who Should Use This?

Perfect For

  • On-call engineers who spend hours investigating incidents
  • SREs managing complex distributed systems
  • Platform teams running multi-account AWS environments
  • DevOps engineers correlating deployments with failures

Skip If

  • Simple applications with clear failure modes
  • Rarely experience incidents
  • Not heavily using AWS services

My Test: Real Results

I deployed a Lambda function with an intentional error and let the AI investigate.

Setup

  • Lambda function with division-by-zero error
  • CloudWatch alarm monitoring failures
  • 3 error-generating invocations

Results

What the AI found in seconds:

The Lambda function contains intentional test code that throws ZeroDivisionError at line 9 in lambda_test.py with the literal expression ‘result = 1 / 0’. This is not a production bug but an expected test behavior.

What impressed me:

  1. Context-aware: Understood it was test code, not a bug
  2. Complete timeline: Linked deployment time to first error
  3. Exact location: Found the error on line 9
  4. Impact analysis: Calculated 100% failure rate
  5. Fast: AI analysis in seconds + 4 minutes total

Before vs After

Task Manual AI Agent Savings
Check metrics 2-3 min Auto 100%
Review logs 3-5 min Auto 100%
Check deployments 5-10 min Auto 100%
Correlate timeline 5-10 min Auto 100%
Root cause 5-10 min sec 90%
Total 20-40 min ~4 min 80-90%

Three Core Features

1. AI Investigation

Auto-triggers from:

  • ServiceNow tickets
  • PagerDuty alerts
  • Datadog/Dynatrace/Splunk webhooks
  • Slack commands

What it analyzes:

  • CloudWatch metrics, logs, alarms
  • Third-party observability data
  • Deployment history from GitHub/GitLab
  • Infrastructure topology
  • Historical incident patterns

Delivers:

  • Root cause with reasoning
  • Event timeline
  • Blast radius analysis
  • Mitigation steps

2. Topology Discovery

Automatically maps your AWS infrastructure:

  • Resources across all accounts
  • Service dependencies
  • Links to source code
  • Deployment history

Use it to:

  • Understand blast radius during incidents
  • See cascading failure patterns
  • Assess change impact

3. Incident Prevention

After analyzing multiple incidents, the AI recommends:

  • Observability: “Add alarm for Lambda cold starts”
  • Testing: “Add load testing to pipeline”
  • Code: “Implement retry logic for API calls”
  • Infrastructure: “Enable Multi-AZ for RDS”

Integrations

Works with your existing tools:

Observability: CloudWatch • Datadog • Dynatrace • New Relic • Splunk

CI/CD: GitHub • GitLab

Ticketing: ServiceNow • PagerDuty

Chat: Slack

Kubernetes: Amazon EKS

Custom: MCP servers for proprietary tools

Try It: 15-Minute Demo

A hands-on demo using Terraform for infrastructure and manual Agent Space setup through the AWS Console.

Prerequisites

  • AWS account with admin access
  • AWS CLI v2 + Terraform installed
  • Region: us-east-1

Quick Start

1. Clone & Deploy Infrastructure

git clone https://github.com/sprider/aws-devops-agent-demo.git
cd aws-devops-agent-demo
chmod +x lambda-test.sh
./lambda-test.sh deploy

This automatically creates:

  • Lambda function with intentional error
  • CloudWatch alarm

Terraform Deploy Terraform Output

2. Create Agent Space (Manual - AWS Console)

The Agent Space must be created through the AWS Console to ensure proper Primary source configuration.

  1. Open the AWS DevOps Agent Console
  2. Click “Begin setup” or “Create Agent Space”
  3. Configure:
    • Name: TestAgentSpace (or your preferred name)
    • Description: Test Agent Space for Lambda error investigation demo
  4. Click “Create”

DevOps Agent Console Create Agent Space

3. Configure Cloud Capabilities (Primary Source)

After Agent Space creation, configure AWS account access:

  1. In your Agent Space, go to “Settings”“Cloud capabilities”
  2. Click “Add cloud capability”
  3. Select “AWS”
  4. Choose “Primary source” (not Secondary)
  5. Configuration:
    • Account ID: Your AWS account (from terraform output aws_account_id)
    • IAM Role: Use “Auto-create role” option
  6. Click “Add”

Cloud Capabilities

Note: The IAM roles required for the DevOps Agent are automatically created by AWS when you select “Auto-create role” - you do not need to create them manually. The Primary source configuration ensures the agent can properly access CloudWatch alarms, Lambda logs, and other AWS resources needed for investigations.

4. Generate Lambda Errors

./lambda-test.sh test

Lambda Errors Generated

5. Wait for Alarm to Trigger

After generating errors, wait 1-2 minutes for the CloudWatch alarm to evaluate and enter ALARM state:

./lambda-test.sh status

Wait until you see AlarmState: ALARM before proceeding to the next step.

CloudWatch Alarm Triggered

6. Start Investigation

  1. In the AWS DevOps Agent Console, click on your Agent Space name (e.g., “TestAgentSpace”)
  2. Click the “Incident Response” tab
  3. In the “Start an investigation” text box, type: Lambda function throwing errors
  4. Click “Start investigation” button
  5. A modal will appear - fill in the investigation details:
    • Investigation details: Keep “Lambda function throwing errors”
    • Investigation starting point: CloudWatch alarm AWS-AIDevOps-Lambda-Error-Test
    • Date and time of incident: Get current time with date -u +"%Y-%m-%dT%H:%M:%SZ"
  6. Click “Start investigating…“

Start Investigation Investigation Details Modal

7. Watch AI Work

Watch the investigation in real-time. The AI will:

  • Detect the alarm
  • Pull Lambda logs
  • Identify ZeroDivisionError
  • Correlate deployment time
  • Provide root cause

Investigation In Progress Investigation Completed Investigation Summary Mitigation Plan

Investigation time: In seconds

8. Cleanup Everything

./lambda-test.sh destroy

Important: Manually delete the Agent Space and auto-created resources from the AWS Console before destroying infrastructure.

  1. Delete Agent Space:
    • Go to AWS DevOps Agent Console
    • Select your Agent Space
    • Click “Actions”“Delete Agent Space”
    • Confirm deletion
    • Note: This automatically removes the IAM roles created by the Agent Space
  2. Delete Lambda Log Group:
    • Go to CloudWatch Console → Log groups
    • Find /aws/lambda/AWS-AIDevOps-test-lambda
    • Select it and click “Actions”“Delete log group(s)”
    • Confirm deletion
  3. Verify IAM Roles Cleanup (Optional):
    • Go to IAM Console → Roles
    • Search for roles created by the Agent Space (they usually have “DevOpsAgent” or “AIDevOps” in the name)
    • These should be automatically deleted when the Agent Space is deleted
    • If any remain, manually delete them
  4. Then run: ./lambda-test.sh destroy

Terraform Destroy

All Available Commands

./lambda-test.sh deploy    # Deploy Lambda and CloudWatch alarm
./lambda-test.sh test      # Generate Lambda errors (invoke 3 times)
./lambda-test.sh status    # Check CloudWatch alarm status
./lambda-test.sh logs      # View Lambda function logs
./lambda-test.sh destroy   # Destroy all infrastructure

Cost

$0.00 - Everything covered by AWS Free Tier

Troubleshooting

Issue: “AWS account is not accessible” or “Monitor Association not found”

Error message in investigation:

Unable to investigate the Lambda function errors because AWS account XXX
is not accessible. The error 'Monitor Association with AgentSpace agentSpaceId
XXX not found' indicates this account is not associated with the monitoring system.

Root cause: Your AWS account is not configured as a Primary source in Cloud Capabilities.

Solution:

  1. Open your Agent Space in AWS Console
  2. Go to SettingsCloud capabilities
  3. Check if your AWS account is listed under “Primary sources”
  4. If not listed or listed under “Secondary sources”:
    • Click “Add cloud capability”
    • Select “AWS”
    • CRITICAL: Choose “Primary source” (NOT Secondary)
    • Enter your AWS account ID (from terraform output aws_account_id)
    • Use “Auto-create role” option
    • Click “Add”
  5. Verify your account now appears under “Primary sources”
  6. Try the investigation again

Why this matters: Only Primary sources give the AI agent full access to CloudWatch alarms, Lambda logs, and other AWS resources needed for investigations.

Key Facts

What It Is

  • AI layer that connects your existing tools
  • Not a monitoring tool replacement
  • Reduces investigation time by 80-90%

Limitations (Preview)

  • Region: us-east-1 only
  • Quotas: 20 investigation hours/month, 10 prevention hours/month
  • Pricing: Free now, pricing TBD at GA

Security

  • Read-only permissions by default
  • IAM-based access control
  • Agent Space isolation
  • AWS IAM Identity Center support

Common Questions

Q: Does it replace my observability tools? A: No. It sits on top of them, connecting data across tools.

Q: What if the AI is wrong? A: You are in control. Ask follow-up questions, steer investigations, or escalate to AWS Support.

Q: How secure is it? A: Very. Read-only by default, IAM-controlled, data stays in your account.

Q: Works with non-AWS tools? A: Yes. Integrates with Datadog, Dynatrace, New Relic, Splunk, GitHub, GitLab, ServiceNow, Slack.

Next Steps

After testing:

  1. Connect production - Create Agent Space for real environment
  2. Enable auto-triggers - Set up ServiceNow/PagerDuty webhooks
  3. Review recommendations - Implement prevention suggestions
  4. Expand scope - Connect multiple AWS accounts

Files in This Repo

aws-devops-agent-demo/
├── README.md                 # This guide
├── lambda-test.tf            # Terraform: Lambda and CloudWatch alarm
├── lambda_test.py            # Test Lambda function (division by zero)
├── lambda-test.sh            # Automation script for deployment
├── .gitignore                # Git ignore file
└── screenshots/              # Step-by-step screenshots of the demo
    ├── 01-terraform-deploy.png
    ├── 02-terraform-output.png
    ├── 03-devops-agent-console.png
    ├── 04-create-agent-space.png
    ├── 05-cloud-capabilities.png
    ├── 06-lambda-errors-generated.png
    ├── 07-cloudwatch-alarm-triggered.png
    ├── 08-incident-response-dashboard.png
    ├── 10-investigation-details-modal.png
    ├── 11-investigation-in-progress.png
    ├── 12-investigation-completed.png
    ├── 13-investigation-summary.png
    ├── 14-mitigation-plan.png
    └── 15-terraform-destroy.png

What is Automated vs Manual?

Automated via Terraform:

  • Lambda function with intentional error
  • CloudWatch alarm monitoring

Manual via AWS Console:

  • Agent Space creation
  • Cloud Capabilities configuration (Primary source setup + IAM role auto-creation)
  • Agent Space deletion (which automatically removes auto-created IAM roles)

Why Manual? The Agent Space requires Primary source configuration through the console to ensure the AI agent can properly access AWS resources during investigations. The AWS CLI cannot currently configure this correctly. When you delete the Agent Space, AWS automatically cleans up the auto-created IAM roles.

About This Article This article and accompanying automation scripts were developed with assistance from Claude Code(Anthropic). All code has been tested in my personal AWS environment and verified against the official AWS DevOps Agent User Guide.

Resources