An AI-mentored lab where you take a bare trading platform from "runs on my laptop" to production-ready on AWS. No videos. No hand-holding. Real infrastructure, real tools.
Most SRE training teaches you tools. We teach you instincts.
Each phase builds on the last. No skipping ahead. No shortcuts.
Multi-stage Docker builds, non-root users, health checks, Compose networking. You'll containerize a Python trading API and React frontend from scratch.
Build a real CI pipeline: workflows, caching, secrets management, build matrices, and image scanning with Trivy. Every push runs the gauntlet.
Instrument the stack with OpenTelemetry. Metrics flow to Prometheus, traces to Tempo, logs to Loki. Grafana ties it all together. Then something breaks.
Define what "reliable" means for your service. Build service level indicators, set objectives, calculate error budgets, and wire up burn-rate alerts that actually work.
Live incidents are injected into your running stack. Latency spikes, cascading failures, resource exhaustion. Triage under pressure.
Take everything to production. Terraform for infrastructure, ECS Fargate for workloads, OIDC federation for auth, rolling deploys with health gates.
You deploy to actual AWS accounts with real Terraform state, real ECS services, and real network topology. Everything you build persists across sessions.
An AI that acts like a senior SRE — it pushes back on bad answers, quizzes you after each task, and refuses to give you the answer.
Prometheus, Grafana, Tempo, and Loki running locally. You instrument the code, build dashboards, and debug from real telemetry data.
The mentor injects real failures into your running system — latency spikes, cascading errors, resource exhaustion. You triage under pressure with your dashboards and logs.
Phase A is free — no credit card required. Just you, a terminal, and an AI mentor that won't let you take shortcuts.
Start the Lab