How to serve 1.5 billion active users at the same time - scaling Erlang cluster to 10,000 nodes

Maxim Fedorov 23.10.2018

A growing user population beyond 1.5B, whilst simultaneously adding new capabilities, does not leave much chance to keep the server footprint as small as it used to be. When ten servers became too small, WhatsApp scaled their cluster to a hundred. When a hundred got too tight, they expanded it to 1,000. What’s next? 10,000? How is it possible, considering current scalability limits of a single Erlang cluster?

Maxim Fedorov will be giving a talk about the challenges the team at WhatsApp faced, when they had to migrate from HW/SW stack, that they had complete control over (SoftLayer - relatively small fleet of powerful servers, OS of our choice, Erlang R16) to an infrastructure provided by Facebook, a large array of tightly packed machines running Linux.

Maxim and the team had already had some experience in creating large Erlang clusters, but they never expected to expand the fleet to over 10,000 nodes. It is known that an Erlang cluster can successfully scale to 50 machines, but how about more?

They managed to squeeze more than 1,500 within a single distribution cluster and more than 10,000 with their own transport.

In his talk at Code Mesh LDN 2018, Maxim will also be explaining the paradigm shift they experienced throughout this process. It changed the way they deploy code and monitor their systems, track bugs, prevent potential outages, and plan capacity requirements.

“Erlang scaling limits are not as tight as it is generally thought.“

 

Maxim will ask, what are the true limits of Erlang clusters?

The truth is they don’t yet know – and they’re already planning to push it even further, to basically as far as it can get. At WhatsApp, they are already serving 1.5B monthly active users, and hoping to grow this even further, whilst still providing a reliable service and new features.

Breaking News

The team at WhatsApp are making their patches available to the Erlang community! Some have been already merged to upstream, with more to follow in the future.

Maxim Fedorov - Software Engineer at WhatsApp

Maxim Fedorov is a software engineer at WhatsApp, his work is focused on server side performance and scalability.

Author

Maxim Fedorov

Impossible takes a little longer

Upcoming conferences

Start booking your calendar with more Code Sync conferences happening across the globe. We will be slowly releasing more dates, in the meantime here is what we’ve planned already:

All conferences