Site Reliability Engineering

Keeping the show running.
Always.

We build and maintain the infrastructure that powers millions of entertainment experiences — from movie nights to sold-out concerts.

99.99% Availability
50M+ Monthly Users
<200ms P95 Latency

Engineering reliability
at entertainment scale

The SRE team at BookMyShow owns the reliability, scalability, and performance of one of India's largest entertainment platforms. We bridge software engineering and operations to ensure that every ticket booking, every movie search, and every live event streaming experience is fast, available, and delightful.

From Coldplay concerts selling out in minutes to IPL matches drawing millions simultaneously — we design systems that hold up when it matters most.

☁️
Cloud-Native on AWS

Multi-region architecture with auto-scaling, fault isolation, and disaster recovery built in.

📊
Observability First

Full-stack monitoring with Grafana, Prometheus, and distributed tracing across every service.

🤖
Automate Everything

Toil reduction through Ansible, Terraform, and intelligent runbooks — humans for decisions, not repetition.

Four pillars of reliability

Reliability Engineering

We define and track SLIs, SLOs, and error budgets across all critical user journeys — from search to seat selection to payment.

  • SLO definition & tracking
  • Error budget management
  • Chaos engineering
  • Capacity planning

Observability

End-to-end visibility into our systems through metrics, logs, and traces — so we know about problems before users do.

  • Prometheus + Grafana dashboards
  • Distributed tracing (Jaeger)
  • Centralized log aggregation
  • Anomaly detection & alerting

Automation & CI/CD

Reducing toil through intelligent automation of deployments, scaling, and incident response so engineers can focus on what matters.

  • Terraform & Ansible IaC
  • Jenkins / GitHub Actions CI/CD
  • Kubernetes orchestration
  • GitOps workflows

Incident Management

Structured on-call rotations, fast incident response, and blameless post-mortems to continuously improve our resilience.

  • PagerDuty on-call management
  • Runbooks & playbooks
  • Blameless post-mortems
  • MTTR optimization

The stack behind the scenes

Cloud & Infrastructure

AWS EC2 AWS EKS AWS RDS AWS S3 Route 53 VPC CloudFront

Orchestration & Automation

Kubernetes Docker Helm Terraform Ansible ArgoCD

CI/CD

Jenkins GitHub Actions Bamboo SonarQube

Observability

Prometheus Grafana Jaeger ELK Stack PagerDuty Datadog

Languages & Frameworks

Python Go Java Bash Spring Boot

Databases & Messaging

MySQL Redis Kafka Elasticsearch MongoDB

People keeping the lights on

VB
Viraj Bharvada Head of SRE
AS
Avani Singhal Principal Engineer, DevOps/SRE

SRE best practices

A foundational overview of Site Reliability Engineering — what it is, how it works, and why it matters at scale.

IBM Technology · YouTube

What is Site Reliability Engineering (SRE)?

Covers SRE fundamentals including SLIs, SLOs, error budgets, monitoring, and how SRE bridges the gap between development and operations — directly applicable to how we work at BookMyShow.

SLOs & SLIs Error Budgets Monitoring Incident Response