oncall

What is oncall?

On‑call is a work arrangement where a person (or a team) is scheduled to be available outside normal business hours to respond to urgent technical problems, such as system outages or critical bugs. When an incident occurs, the on‑call engineer is the first point of contact to investigate, troubleshoot, and restore service.

Let's break it down

Schedule: A calendar that assigns who is on‑call and when, often rotating weekly or bi‑weekly.
Pager/Alert system: A tool (like PagerDuty, Opsgenie, or VictorOps) that sends notifications (SMS, phone call, app push) when an incident is detected.
Incident: Any event that disrupts normal operation, such as a server crash, high latency, or security breach.
Escalation: If the first responder can’t fix the issue quickly, the alert is passed to a higher‑level engineer or manager.
Post‑mortem: After the incident is resolved, the team writes a short report to understand what happened and how to prevent it next time.

Why does it matter?

Reliability: Quick response reduces downtime, keeping services available for users.
Customer trust: Faster fixes mean happier customers and less damage to a company’s reputation.
Business impact: Every minute of outage can cost money; on‑call helps limit those losses.
Learning: Engineers gain real‑world experience with production systems, improving their skills.

Where is it used?

Site Reliability Engineering (SRE) teams at large internet companies.
DevOps and infrastructure groups that manage servers, databases, and cloud resources.
IT support departments handling internal corporate applications.
Managed service providers who monitor client systems 24/7.
Any organization that runs critical online services, from startups to enterprises.

Good things about it

Fast issue resolution keeps services running smoothly.
Shared responsibility spreads the workload across many engineers, preventing burnout of a single person.
Skill development exposes engineers to production problems they wouldn’t see in development.
Clear process (alerts, escalation, post‑mortems) creates accountability and continuous improvement.
Customer confidence knowing there’s a team ready to act at any hour.

Not-so-good things

Interruptions: On‑call can break sleep, family time, and personal routines.
Burnout risk if rotations are too long or incidents are frequent.
On‑call fatigue may lead to slower response or mistakes.
Knowledge gaps: If the on‑call engineer isn’t familiar with a system, resolution can take longer.
Hand‑off challenges: Poor documentation can make it hard to transfer knowledge after a shift ends.