Log in

Phase A is free — no credit card required

Learn SRE by doing.

An AI-mentored lab where you take a bare trading platform from "runs on my laptop" to production-ready on AWS. No videos. No hand-holding. Real infrastructure, real tools.

Start the Lab See the Curriculum
br-mentor — zsh
br-mentor start
 
BLAST RADIUS LAB - Phase E: Chaos Engineering
 
⚠ INCIDENT INJECTED: Latency spike on /api/orders
Your Grafana dashboard is live at localhost:3000
Traces are flowing to Tempo. Logs to Loki.
 
MENTOR: P99 latency just crossed your SLO threshold.
MENTOR: What's the first thing you check?
 
_
You'll work with

Tutorial hell doesn't prepare you
for a 3am production incident.

Most SRE training teaches you tools. We teach you instincts.

Traditional Learning

  • Watch someone else's terminal
  • Copy-paste from a guide
  • Toy infrastructure that vanishes
  • "Congratulations!" after every step

Blast Radius

  • Build your own infrastructure from zero
  • An AI mentor that pushes back and quizzes you
  • Real AWS, real Terraform, real incidents
  • Controlled chaos that tests your instincts

Six phases. One production-ready system.

Each phase builds on the last. No skipping ahead. No shortcuts.

A

Containerization

Multi-stage Docker builds, non-root users, health checks, Compose networking. You'll containerize a Python trading API and React frontend from scratch.

DockerComposeFastAPI
B

CI Pipeline

Build a real CI pipeline: workflows, caching, secrets management, build matrices, and image scanning with Trivy. Every push runs the gauntlet.

GitHub ActionsTrivyGHCR
C

Observability

Instrument the stack with OpenTelemetry. Metrics flow to Prometheus, traces to Tempo, logs to Loki. Grafana ties it all together. Then something breaks.

OpenTelemetryPrometheusGrafanaTempoLoki
D

SLI / SLO

Define what "reliable" means for your service. Build service level indicators, set objectives, calculate error budgets, and wire up burn-rate alerts that actually work.

PrometheusAlertmanagerError Budgets
E

Chaos Engineering

Live incidents are injected into your running stack. Latency spikes, cascading failures, resource exhaustion. Triage under pressure.

Fault InjectionIncident ResponseRunbooks
F

CD to AWS

Take everything to production. Terraform for infrastructure, ECS Fargate for workloads, OIDC federation for auth, rolling deploys with health gates.

TerraformAWS ECSOIDCRoute 53

Not a course. A proving ground.

Real Infrastructure, Not Simulations

You deploy to actual AWS accounts with real Terraform state, real ECS services, and real network topology. Everything you build persists across sessions.

AI Mentor

An AI that acts like a senior SRE — it pushes back on bad answers, quizzes you after each task, and refuses to give you the answer.

Full Observability Stack

Prometheus, Grafana, Tempo, and Loki running locally. You instrument the code, build dashboards, and debug from real telemetry data.

Controlled Chaos

The mentor injects real failures into your running system — latency spikes, cascading errors, resource exhaustion. You triage under pressure with your dashboards and logs.

6
Phases
12+
Tools & Technologies
1
Production System
0
Videos to Watch

Ready to build something real?

Phase A is free — no credit card required. Just you, a terminal, and an AI mentor that won't let you take shortcuts.

Start the Lab