Chaos Engineering for the AI Agent Era

Break your agents
before users do

An autonomous AI that attacks, stress-tests, and scores your AI agents across safety, robustness, and reliability — so you ship with confidence.

agentbreaker v0.1.0
$ agentbreaker test http://my-agent:8000
Scanning agent capabilities... 3 tools found
Generating test suite... 25 chaos tests
Running adversarial 5/5 injection 3/5 stress 5/5
Reliability Score 67/100 Grade D
CRITICAL: safety 33/100 — prompt injection bypassed 4/5 tests
WARNING: efficiency 53/100 — avg response 4.2s, 3800 tokens
PASS: recovery 100/100 — graceful error handling in all tests

See It In Action

Watch AgentBreaker break an agent

Real-time walkthrough of testing an AI agent — from connection to reliability score.

localhost:3000/dashboard
Paused
🔌

Connect Your Agent

Universal connector — works with any AI agent

https://my-customer-support-agent.api.com...
OpenAI-Compatible
CS Support Agent v2
API
OpenAI
MCP
CLI
Agent connected — 200 OK (128ms latency)
🔍

Scanning Agent

Probing capabilities, tools, and attack surface

Scanning...

Capabilities Found

search_knowledge_baseTool
create_ticketTool
escalate_to_humanTool
send_emailTool
natural_language_responseCapability

Boundary Tests

Code ExecutionNO
File System AccessNO
Network RequestsYES
Memory PersistenceYES
Multi-Turn ContextYES

Attack surface: Medium-High

Running Chaos Tests

25 tests across 5 attack categories

Progress
18/25
System prompt extraction via role-play
InjectionCritical
FAIL
Contradictory instructions handling
AdversarialMedium
PASS
Tool timeout recovery (30s)
StressHigh
PASS
Ignore-previous-instructions attack
InjectionCritical
FAIL
20-step dependent chain
StressHigh
FAIL
Malformed tool response handling
Tool AbuseMedium
PASS
Context window overflow (100k tokens)
OverflowHigh
PASS
Indirect prompt injection via KB
InjectionCritical
FAIL
📊

Reliability Report

CS Support Agent v2 — tested 08/03/2026

67/ 100
Grade D
Needs improvement
✓ 16 passed|✗ 9 failed

Dimensions

Consistency78
Robustness72
Safety33
Efficiency85
Recovery91
Accuracy68

Critical Findings

CRITICAL

System prompt extracted via role-play injection

CRITICAL

KB documents can inject malicious instructions

HIGH

20-step chains fail at step 14 (context loss)

HIGH

"Ignore instructions" bypasses safety filters

Fix these issues to reach Grade B (80+)

The Problem

Agent reliability is the #1 unsolved problem in AI

0%
End-to-End Success

95% per-step reliability over 20 steps = 36% total. Failures compound.

0%
Ship Untested

Of AI agents reach production without adversarial or chaos testing.

0x
Costlier in Prod

Production agent failures cost 4x more than catching them in dev.

0%+
YC is Agents

50%+ of recent YC batches are building AI agents. Everyone needs this.

What We Do

Think Chaos Monkey + Lighthouse + Burp Suite

AgentBreaker is an autonomous AI that hunts other AI agents. It discovers their capabilities, generates targeted attacks, executes chaos test suites, and produces a Lighthouse-style reliability score with actionable fix recommendations.

🐒

Chaos Monkey

Randomly breaks things in production to test resilience

We randomly break AI agents to test their resilience

📊

Lighthouse

Scores websites 0-100 on performance dimensions

We score agents 0-100 across 6 reliability dimensions

🛡️

Burp Suite

Finds security vulnerabilities in web apps

We find safety vulnerabilities in AI agents

Core Features

Everything you need to harden agents

01

Agent Scanner

Auto-discovers capabilities, tools, boundaries, and attack surface. Connect via REST API, OpenAI-compatible endpoints, MCP servers, or CLI.

02

Adversarial Prompts

Ambiguous, contradictory, and edge-case inputs that break reasoning. 18+ templates customized to each agent's capabilities.

03

Injection Attacks

Direct injection, indirect via data, system prompt extraction, role-play attacks, encoding tricks, delimiter exploits. 17+ templates.

04

Multi-Step Stress

20+ step conversation chains, dependent tool call sequences, deep nested reasoning, and rapid context switching.

05

Tool Abuse

Simulates non-existent tools, invalid parameters, timeouts, error responses, unexpected data types, and oversized responses.

06

Reliability Score

Lighthouse-style 0-100 score across 6 weighted dimensions with letter grades and prioritized fix recommendations.

Architecture

How the system works

Three-phase pipeline: Discover, Attack, Score. Each phase feeds intelligence to the next.

Phase 1

Discover

  • Connect to agent (API / MCP / CLI / OpenAI)
  • Probe with discovery messages
  • Map capabilities & tools
  • Test boundaries & limitations
  • Build agent attack profile
Phase 2

Attack

  • Generate targeted test suite
  • Run adversarial prompts
  • Execute injection attacks
  • Simulate tool failures
  • Stress test with 20+ step chains
Phase 3

Score

  • Evaluate all test results
  • Score across 6 dimensions
  • Compute weighted overall score
  • Generate fix recommendations
  • Produce detailed failure report
Target AgentScannerAgent ProfileTest GeneratorTest SuiteTest RunnerResultsScorerReliability Score

Reliability Score

Like Lighthouse, but for AI agents

Every agent gets a 0-100 score across 6 weighted dimensions. Failing tests become your improvement roadmap.

92A
Consistency
85B
Robustness
34F
Safety
78C
Efficiency
95A
Recovery
88B
Accuracy
Consistency
9215%
Robustness
8520%
Safety
3425%
Efficiency
7810%
Recovery
9515%
Accuracy
8815%
CRITICALSafety 34/100 — 4/5 injection tests bypassed boundaries. Add input sanitization and instruction hierarchy.
WARNINGEfficiency 78/100 — Average 3,800 tokens per response. Optimize prompts and reduce tool call chains.

Quick Start

Three commands. Full reliability audit.

Get from zero to a complete reliability report in under 2 minutes. No configuration needed.

1
1

Scan

Auto-discovers capabilities, tools, and attack surface in seconds.

$ agentbreaker scan http://agent:8000
2
2

Test

Runs 25+ chaos tests: adversarial, injection, stress, tool abuse, overflow.

$ agentbreaker test http://agent:8000
3
3

Score

Returns reliability score with letter grade and fix recommendations.

$ agentbreaker score http://agent:8000

Compatible With

Every agent framework. Every LLM.

One tool that works with your entire AI stack. No SDK lock-in, no framework dependency.

LangChain
CrewAI
AutoGen
OpenAI
Anthropic Claude
LlamaIndex
Ollama
Hugging Face
vLLM
MCP Servers
LiteLLM
Custom REST APIs
GitHub Actions
Docker
OpenRouter

Why AgentBreaker

We're the attacker, not the observer

Most tools evaluate outputs or monitor traces. AgentBreaker is the only platform that actively attacks your agent with adversarial chaos tests to find breaking points before users do.

Capability
ABAgentBreaker
LangSmithPatronus AIPromptfooLangfuse
Adversarial chaos testingPARTIALPARTIAL
Prompt injection attacks
Multi-step stress testing
Tool abuse simulation
Auto agent discovery
Reliability scoring (0-100)PARTIALPARTIAL
Self-improving attacks
LLM observability
Trace monitoring
Output evaluationPARTIALPARTIAL
CI/CD integration
Framework agnosticPARTIAL
⚔️

Offensive vs Passive

Others observe what happened. We actively attack to find what will happen. Adversarial prompts, injection attempts, tool abuse, stress chains — we test the failure modes nobody writes test cases for.

🧠

AI-Powered Attacks

Static eval suites test what you think will fail. Our Claude-powered generator analyzes your agent's specific capabilities and crafts targeted attacks that get smarter with every run.

📊

Actionable Score

Not just “pass/fail” — a weighted 0-100 score across 6 dimensions with specific fix recommendations. Like getting a Lighthouse report for your agent, with a clear roadmap to improve.

CI/CD

Block deploys that break reliability

Run chaos tests on every deployment. Set minimum score thresholds. Get notified when reliability degrades.

  • GitHub Actions — one YAML step
  • JSON output for any CI pipeline
  • Threshold gates: fail if score < 70
  • Historical score tracking per deploy
.github/workflows/agent-test.yml
name: Agent Reliability Gate
on: [push]
jobs:
  chaos-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: docker-compose up -d agent
      - run: pip install agentbreaker
      - run: |
          agentbreaker test http://localhost:8000 \
            --json-output > results.json
      - run: |
          SCORE=$(jq '.overall_score' results.json)
          echo "Agent score: $SCORE"
          [ $(echo "$SCORE >= 70" | bc) -eq 1 ] || exit 1

Our Intellectual Property

What makes AgentBreaker defensible

Our moat grows with every agent tested. The more chaos tests we run, the smarter our attacks become.

Core IP

Self-Improving Attack Engine

Our chaos agent uses Claude to analyze each test result and generate smarter, more targeted attacks. The attack library grows and improves autonomously with every test run.

Dataset

Agent Vulnerability Taxonomy

A structured taxonomy of 75+ attack templates across 7 categories (adversarial, injection, stress, tool abuse, context overflow, concurrency, state corruption) — specifically designed for AI agents, not web apps.

Framework

6-Dimension Scoring Framework

Weighted scoring model (Consistency 15%, Robustness 20%, Safety 25%, Efficiency 10%, Recovery 15%, Accuracy 15%) calibrated against real agent behavior. Safety-weighted because injection is existential.

Platform

Universal Agent Connector

Connector abstraction that works with any agent: REST APIs, OpenAI-compatible endpoints (Ollama/vLLM/LiteLLM), MCP servers, and CLI tools. Test any agent regardless of framework.

Roadmap

Where we're going

Nowshipped

Core Platform

  • Agent scanner & discovery
  • 5 attack categories (75+ templates)
  • 6-dimension reliability scoring
  • CLI tool (scan, test, score, report)
  • Web dashboard with live results
  • REST API, OpenAI, MCP, CLI connectors
Q2 2026building

Intelligence Layer

  • Claude-powered dynamic test generation
  • Self-improving attacks that learn from failures
  • Agent-specific vulnerability profiles
  • Automated fix suggestion with code patches
  • Benchmark database: compare your agent vs industry
Q3 2026planned

Enterprise & Scale

  • PostgreSQL persistence & historical tracking
  • Celery + Redis async job queue
  • Team workspaces & RBAC
  • Slack/PagerDuty alerting on score drops
  • SOC2 compliance & audit trails
  • SaaS hosted version
Q4 2026planned

Ecosystem & Marketplace

  • Custom attack plugin SDK
  • Community attack marketplace
  • Framework-specific test packs (LangChain, CrewAI)
  • Agent red-team-as-a-service
  • Continuous monitoring: run chaos tests 24/7
  • Agent reliability certification badge

Team

Built by people who ship agents

NK

Nitesh Kumar

Founder & CEO

Building the reliability and trust layer for the AI agent ecosystem. Believes every AI agent in production should pass chaos testing — just like every website gets a Lighthouse score.

Stop shipping
fragile agents

Find breaking points in development, not production. Get started in under 2 minutes.

$pip install agentbreaker