Chaos Engineering for the AI Agent Era

Break your agents
before users do

An autonomous AI that attacks, stress-tests, and scores your AI agents across safety, robustness, and reliability — so you ship with confidence.

Start Breaking Agents ▶ Watch Demo

agentbreaker v0.1.0

$ agentbreaker test http://my-agent:8000

Scanning agent capabilities... 3 tools found

Generating test suite... 25 chaos tests

Running adversarial 5/5 injection 3/5 stress 5/5

Reliability Score 67/100 Grade D

CRITICAL: safety 33/100 — prompt injection bypassed 4/5 tests

WARNING: efficiency 53/100 — avg response 4.2s, 3800 tokens

PASS: recovery 100/100 — graceful error handling in all tests

See It In Action

Watch AgentBreaker break an agent

Real-time walkthrough of testing an AI agent — from connection to reliability score.

localhost:3000/dashboard

Paused

🔌

Connect Your Agent

Universal connector — works with any AI agent

Agent Endpoint

https://my-customer-support-agent.api.com...

Type

OpenAI-Compatible

Agent Name

CS Support Agent v2

API

OpenAI

MCP

CLI

✓Agent connected — 200 OK (128ms latency)

🔍

Scanning Agent

Probing capabilities, tools, and attack surface

Scanning...

Capabilities Found

⬡search_knowledge_baseTool

⬡create_ticketTool

⬡escalate_to_humanTool

⬡send_emailTool

⬡natural_language_responseCapability

Boundary Tests

Code ExecutionNO

File System AccessNO

Network RequestsYES

Memory PersistenceYES

Multi-Turn ContextYES

Attack surface: Medium-High

⚡

Running Chaos Tests

25 tests across 5 attack categories

Progress

18/25

Test NameCategorySeverityResult

System prompt extraction via role-play

InjectionCritical

FAIL

System prompt extraction via role-playInjectionCriticalFAIL

Contradictory instructions handling

AdversarialMedium

PASS

Contradictory instructions handlingAdversarialMediumPASS

Tool timeout recovery (30s)

StressHigh

PASS

Tool timeout recovery (30s)StressHighPASS

Ignore-previous-instructions attack

InjectionCritical

FAIL

Ignore-previous-instructions attackInjectionCriticalFAIL

20-step dependent chain

StressHigh

FAIL

20-step dependent chainStressHighFAIL

Malformed tool response handling

Tool AbuseMedium

PASS

Malformed tool response handlingTool AbuseMediumPASS

Context window overflow (100k tokens)

OverflowHigh

PASS

Context window overflow (100k tokens)OverflowHighPASS

Indirect prompt injection via KB

InjectionCritical

FAIL

Indirect prompt injection via KBInjectionCriticalFAIL

📊

Reliability Report

CS Support Agent v2 — tested 08/03/2026

67/ 100

Grade D

Needs improvement

✓ 16 passed|✗ 9 failed

Dimensions

Consistency78

Robustness72

Safety33

Efficiency85

Recovery91

Accuracy68

Critical Findings

CRITICAL

System prompt extracted via role-play injection

CRITICAL

KB documents can inject malicious instructions

HIGH

20-step chains fail at step 14 (context loss)

HIGH

"Ignore instructions" bypasses safety filters

Fix these issues to reach Grade B (80+)

Try it yourself →

The Problem

Agent reliability is the #1 unsolved problem in AI

End-to-End Success

95% per-step reliability over 20 steps = 36% total. Failures compound.

Ship Untested

Of AI agents reach production without adversarial or chaos testing.

Costlier in Prod

Production agent failures cost 4x more than catching them in dev.

0%+

YC is Agents

50%+ of recent YC batches are building AI agents. Everyone needs this.

What We Do

Think Chaos Monkey + Lighthouse + Burp Suite

AgentBreaker is an autonomous AI that hunts other AI agents. It discovers their capabilities, generates targeted attacks, executes chaos test suites, and produces a Lighthouse-style reliability score with actionable fix recommendations.

🐒

Chaos Monkey

Randomly breaks things in production to test resilience

We randomly break AI agents to test their resilience

📊

Lighthouse

Scores websites 0-100 on performance dimensions

We score agents 0-100 across 6 reliability dimensions

🛡️

Burp Suite

Finds security vulnerabilities in web apps

We find safety vulnerabilities in AI agents

Core Features

Everything you need to harden agents

Agent Scanner

Auto-discovers capabilities, tools, boundaries, and attack surface. Connect via REST API, OpenAI-compatible endpoints, MCP servers, or CLI.

Adversarial Prompts

Ambiguous, contradictory, and edge-case inputs that break reasoning. 18+ templates customized to each agent's capabilities.

Injection Attacks

Direct injection, indirect via data, system prompt extraction, role-play attacks, encoding tricks, delimiter exploits. 17+ templates.

Multi-Step Stress

20+ step conversation chains, dependent tool call sequences, deep nested reasoning, and rapid context switching.

Tool Abuse

Simulates non-existent tools, invalid parameters, timeouts, error responses, unexpected data types, and oversized responses.

Reliability Score

Lighthouse-style 0-100 score across 6 weighted dimensions with letter grades and prioritized fix recommendations.

Architecture

How the system works

Three-phase pipeline: Discover, Attack, Score. Each phase feeds intelligence to the next.

Phase 1

Discover

▸Connect to agent (API / MCP / CLI / OpenAI)
▸Probe with discovery messages
▸Map capabilities & tools
▸Test boundaries & limitations
▸Build agent attack profile

Phase 2

Attack

▸Generate targeted test suite
▸Run adversarial prompts
▸Execute injection attacks
▸Simulate tool failures
▸Stress test with 20+ step chains

Phase 3

Score

▸Evaluate all test results
▸Score across 6 dimensions
▸Compute weighted overall score
▸Generate fix recommendations
▸Produce detailed failure report

Target Agent→Scanner→Agent Profile→Test Generator→Test Suite→Test Runner→Results→Scorer→Reliability Score

Reliability Score

Like Lighthouse, but for AI agents

Every agent gets a 0-100 score across 6 weighted dimensions. Failing tests become your improvement roadmap.

92A

Consistency

85B

Robustness

34F

Safety

78C

Efficiency

95A

Recovery

88B

Accuracy

Consistency

9215%

Robustness

8520%

Safety

3425%

Efficiency

7810%

Recovery

9515%

Accuracy

8815%

CRITICALSafety 34/100 — 4/5 injection tests bypassed boundaries. Add input sanitization and instruction hierarchy.

WARNINGEfficiency 78/100 — Average 3,800 tokens per response. Optimize prompts and reduce tool call chains.

Quick Start

Three commands. Full reliability audit.

Get from zero to a complete reliability report in under 2 minutes. No configuration needed.

Scan

Auto-discovers capabilities, tools, and attack surface in seconds.

$ agentbreaker scan http://agent:8000

Test

Runs 25+ chaos tests: adversarial, injection, stress, tool abuse, overflow.

$ agentbreaker test http://agent:8000

Score

Returns reliability score with letter grade and fix recommendations.

$ agentbreaker score http://agent:8000

Read Full Documentation→

Compatible With

Every agent framework. Every LLM.

One tool that works with your entire AI stack. No SDK lock-in, no framework dependency.

LangChain

CrewAI

AutoGen

OpenAI

Anthropic Claude

LlamaIndex

Ollama

Hugging Face

vLLM

MCP Servers

LiteLLM

Custom REST APIs

GitHub Actions

Docker

OpenRouter

Why AgentBreaker

We're the attacker, not the observer

Most tools evaluate outputs or monitor traces. AgentBreaker is the only platform that actively attacks your agent with adversarial chaos tests to find breaking points before users do.

Capability	ABAgentBreaker	LangSmith	Patronus AI	Promptfoo	Langfuse
Adversarial chaos testing	✓	—	PARTIAL	PARTIAL	—
Prompt injection attacks	✓	—	✓	✓	—
Multi-step stress testing	✓	—	—	—	—
Tool abuse simulation	✓	—	—	—	—
Auto agent discovery	✓	—	—	—	—
Reliability scoring (0-100)	✓	—	PARTIAL	PARTIAL	—
Self-improving attacks	✓	—	—	—	—
LLM observability	—	✓	—	—	✓
Trace monitoring	—	✓	—	—	✓
Output evaluation	PARTIAL	✓	✓	✓	PARTIAL
CI/CD integration	✓	✓	✓	✓	—
Framework agnostic	✓	—	✓	✓	PARTIAL

⚔️

Offensive vs Passive

Others observe what happened. We actively attack to find what will happen. Adversarial prompts, injection attempts, tool abuse, stress chains — we test the failure modes nobody writes test cases for.

🧠

AI-Powered Attacks

Static eval suites test what you think will fail. Our Claude-powered generator analyzes your agent's specific capabilities and crafts targeted attacks that get smarter with every run.

📊

Actionable Score

Not just “pass/fail” — a weighted 0-100 score across 6 dimensions with specific fix recommendations. Like getting a Lighthouse report for your agent, with a clear roadmap to improve.

CI/CD

Block deploys that break reliability

Run chaos tests on every deployment. Set minimum score thresholds. Get notified when reliability degrades.

✓ GitHub Actions — one YAML step
✓ JSON output for any CI pipeline
✓ Threshold gates: fail if score < 70
✓ Historical score tracking per deploy

.github/workflows/agent-test.yml

name: Agent Reliability Gate
on: [push]
jobs:
  chaos-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: docker-compose up -d agent
      - run: pip install agentbreaker
      - run: |
          agentbreaker test http://localhost:8000 \
            --json-output > results.json
      - run: |
          SCORE=$(jq '.overall_score' results.json)
          echo "Agent score: $SCORE"
          [ $(echo "$SCORE >= 70" | bc) -eq 1 ] || exit 1

Our Intellectual Property

What makes AgentBreaker defensible

Our moat grows with every agent tested. The more chaos tests we run, the smarter our attacks become.

Core IP

Self-Improving Attack Engine

Our chaos agent uses Claude to analyze each test result and generate smarter, more targeted attacks. The attack library grows and improves autonomously with every test run.

Dataset

Agent Vulnerability Taxonomy

A structured taxonomy of 75+ attack templates across 7 categories (adversarial, injection, stress, tool abuse, context overflow, concurrency, state corruption) — specifically designed for AI agents, not web apps.

Framework

6-Dimension Scoring Framework

Weighted scoring model (Consistency 15%, Robustness 20%, Safety 25%, Efficiency 10%, Recovery 15%, Accuracy 15%) calibrated against real agent behavior. Safety-weighted because injection is existential.

Platform

Universal Agent Connector

Connector abstraction that works with any agent: REST APIs, OpenAI-compatible endpoints (Ollama/vLLM/LiteLLM), MCP servers, and CLI tools. Test any agent regardless of framework.

Roadmap

Where we're going

Nowshipped

Core Platform

✓Agent scanner & discovery
✓5 attack categories (75+ templates)
✓6-dimension reliability scoring
✓CLI tool (scan, test, score, report)
✓Web dashboard with live results
✓REST API, OpenAI, MCP, CLI connectors

Q2 2026building

Intelligence Layer

○Claude-powered dynamic test generation
○Self-improving attacks that learn from failures
○Agent-specific vulnerability profiles
○Automated fix suggestion with code patches
○Benchmark database: compare your agent vs industry

Q3 2026planned

Enterprise & Scale

○PostgreSQL persistence & historical tracking
○Celery + Redis async job queue
○Team workspaces & RBAC
○Slack/PagerDuty alerting on score drops
○SOC2 compliance & audit trails
○SaaS hosted version

Q4 2026planned

Ecosystem & Marketplace

○Custom attack plugin SDK
○Community attack marketplace
○Framework-specific test packs (LangChain, CrewAI)
○Agent red-team-as-a-service
○Continuous monitoring: run chaos tests 24/7
○Agent reliability certification badge

Team

Built by people who ship agents

Nitesh Kumar

Founder & CEO

Building the reliability and trust layer for the AI agent ecosystem. Believes every AI agent in production should pass chaos testing — just like every website gets a Lighthouse score.

Stop shipping
fragile agents

Find breaking points in development, not production. Get started in under 2 minutes.

$pip install agentbreaker

Open Dashboard Star on GitHub

Break your agentsbefore users do

Connect Your Agent

Scanning Agent

Running Chaos Tests

Reliability Report

Chaos Monkey

Lighthouse

Burp Suite

Agent Scanner

Adversarial Prompts

Injection Attacks

Multi-Step Stress

Tool Abuse

Reliability Score

Discover

Attack

Score

Scan

Test

Score

Offensive vs Passive

AI-Powered Attacks

Actionable Score

Self-Improving Attack Engine

Agent Vulnerability Taxonomy

6-Dimension Scoring Framework

Universal Agent Connector

Core Platform

Intelligence Layer

Enterprise & Scale

Ecosystem & Marketplace

Nitesh Kumar

Stop shippingfragile agents

Break your agents
before users do

Stop shipping
fragile agents