Mid-air airplane repair: troubleshooting at WhatsApp

Simple, reliable messaging. It takes a lot to support this statement. For 10 years WhatsApp demonstrated unprecedented reliability and availability, serving over 1.5B users. There is absolutely no way to reproduce interactions between all of them, within the cluster spanning over 10,000 nodes and multiple datacenters. Investigations must be done on a live system without disturbing connected users. If there are repairs needed, it has to be done on the fly.

This talk will guide through debugging and troubleshooting techniques used at WhatsApp. Maxim will share a few case studies, explain monitoring, introspection, performance analysis, and tools.

Some knowledge of Erlang and C is necessary.

OBJECTIVES

Share processes, best practices, tools and war stories about 10 years of reliable messaging service.

TARGET AUDIENCE

Software developers, DevOps, Site Reliability Engineers, System Administrators and everyone else interested in troubleshooting live production system.

ARTICLES: 1

How to serve 1.5 billion active users at the same time - scaling Erlang cluster to 10,000 nodes

Article by Maxim Fedorov

A growing user population beyond 1.7B, whilst simultaneously adding new capabilities, does not leave much chance to keep the server footprint as small as it used to be.

READ MORE