Site Reliability Engineering

Keeping the show running.
Always.

We build and maintain the infrastructure that powers millions of entertainment experiences — from movie nights to sold-out concerts.

99.99% Availability

50M+ Monthly Users

<200ms P95 Latency

Who We Are

Engineering reliability
at entertainment scale

The SRE team at BookMyShow owns the reliability, scalability, and performance of one of India's largest entertainment platforms. We bridge software engineering and operations to ensure that every ticket booking, every movie search, and every live event streaming experience is fast, available, and delightful.

From Coldplay concerts selling out in minutes to IPL matches drawing millions simultaneously — we design systems that hold up when it matters most.

☁️

Cloud-Native on AWS

Multi-region architecture with auto-scaling, fault isolation, and disaster recovery built in.

📊

Observability First

Full-stack monitoring with Grafana, Prometheus, and distributed tracing across every service.

🤖

Automate Everything

Toil reduction through Ansible, Terraform, and intelligent runbooks — humans for decisions, not repetition.

What We Do

Four pillars of reliability

Reliability Engineering

We define and track SLIs, SLOs, and error budgets across all critical user journeys — from search to seat selection to payment.

SLO definition & tracking
Error budget management
Chaos engineering
Capacity planning

Observability

End-to-end visibility into our systems through metrics, logs, and traces — so we know about problems before users do.

Prometheus + Grafana dashboards
Distributed tracing (Jaeger)
Centralized log aggregation
Anomaly detection & alerting

Automation & CI/CD

Reducing toil through intelligent automation of deployments, scaling, and incident response so engineers can focus on what matters.

Terraform & Ansible IaC
Jenkins / GitHub Actions CI/CD
Kubernetes orchestration
GitOps workflows

Incident Management

Structured on-call rotations, fast incident response, and blameless post-mortems to continuously improve our resilience.

PagerDuty on-call management
Runbooks & playbooks
Blameless post-mortems
MTTR optimization

Our Tools

The stack behind the scenes

Cloud & Infrastructure

AWS EC2 AWS EKS AWS RDS AWS S3 Route 53 VPC CloudFront

Orchestration & Automation

Kubernetes Docker Helm Terraform Ansible ArgoCD

CI/CD

Jenkins GitHub Actions Bamboo SonarQube

Observability

Prometheus Grafana Jaeger ELK Stack PagerDuty Datadog

Languages & Frameworks

Python Go Java Bash Spring Boot

Databases & Messaging

MySQL Redis Kafka Elasticsearch MongoDB

The Team

People keeping the lights on

Viraj Bharvada Head of SRE

Avani Singhal Principal Engineer, DevOps/SRE

Riyaz Manihar SRE - 2

Learn SRE

SRE best practices

A foundational overview of Site Reliability Engineering — what it is, how it works, and why it matters at scale.

IBM Technology · YouTube

What is Site Reliability Engineering (SRE)?

Covers SRE fundamentals including SLIs, SLOs, error budgets, monitoring, and how SRE bridges the gap between development and operations — directly applicable to how we work at BookMyShow.

SLOs & SLIs Error Budgets Monitoring Incident Response

Keeping the show running.Always.

Engineering reliabilityat entertainment scale

Four pillars of reliability

Reliability Engineering

Observability

Automation & CI/CD

Incident Management

The stack behind the scenes

Cloud & Infrastructure

Orchestration & Automation

CI/CD

Observability

Languages & Frameworks

Databases & Messaging

People keeping the lights on

SRE best practices

What is Site Reliability Engineering (SRE)?

Register for SRE Training

Keeping the show running.
Always.

Engineering reliability
at entertainment scale