Phase A is free — no credit card required

Learn SRE by doing.

An AI-mentored lab where you take a bare trading platform from "runs on my laptop" to production-ready on AWS. No videos. No hand-holding. Real infrastructure, real tools.

Start the Lab See the Curriculum

br-mentor — zsh

❯ br-mentor start

BLAST RADIUS LAB - Phase E: Chaos Engineering

⚠ INCIDENT INJECTED: Latency spike on /api/orders

Your Grafana dashboard is live at localhost:3000

Traces are flowing to Tempo. Logs to Loki.

MENTOR: P99 latency just crossed your SLO threshold.

MENTOR: What's the first thing you check?

❯ _

The Problem

Tutorial hell doesn't prepare you
for a 3am production incident.

Most SRE training teaches you tools. We teach you instincts.

Traditional Learning

Watch someone else's terminal
Copy-paste from a guide
Toy infrastructure that vanishes
"Congratulations!" after every step

Blast Radius

Build your own infrastructure from zero
An AI mentor that pushes back and quizzes you
Real AWS, real Terraform, real incidents
Controlled chaos that tests your instincts

Curriculum

Six phases. One production-ready system.

Each phase builds on the last. No skipping ahead. No shortcuts.

Containerization

Multi-stage Docker builds, non-root users, health checks, Compose networking. You'll containerize a Python trading API and React frontend from scratch.

DockerComposeFastAPI

CI Pipeline

Build a real CI pipeline: workflows, caching, secrets management, build matrices, and image scanning with Trivy. Every push runs the gauntlet.

GitHub ActionsTrivyGHCR

Observability

Instrument the stack with OpenTelemetry. Metrics flow to Prometheus, traces to Tempo, logs to Loki. Grafana ties it all together. Then something breaks.

OpenTelemetryPrometheusGrafanaTempoLoki

SLI / SLO

Define what "reliable" means for your service. Build service level indicators, set objectives, calculate error budgets, and wire up burn-rate alerts that actually work.

PrometheusAlertmanagerError Budgets

Chaos Engineering

Live incidents are injected into your running stack. Latency spikes, cascading failures, resource exhaustion. Triage under pressure.

Fault InjectionIncident ResponseRunbooks

CD to AWS

Take everything to production. Terraform for infrastructure, ECS Fargate for workloads, OIDC federation for auth, rolling deploys with health gates.

TerraformAWS ECSOIDCRoute 53

How it Works

Not a course. A proving ground.

Real Infrastructure, Not Simulations

You deploy to actual AWS accounts with real Terraform state, real ECS services, and real network topology. Everything you build persists across sessions.

AI Mentor

An AI that acts like a senior SRE — it pushes back on bad answers, quizzes you after each task, and refuses to give you the answer.

Full Observability Stack

Prometheus, Grafana, Tempo, and Loki running locally. You instrument the code, build dashboards, and debug from real telemetry data.

Controlled Chaos

The mentor injects real failures into your running system — latency spikes, cascading errors, resource exhaustion. You triage under pressure with your dashboards and logs.

Log in

Create Account