You know what is cool? A Million Users. Here is how to scale an Application to support millions of users.

Legend speaks of a time when engineers would have to set up data centers and handle all the networking themselves to run a business on the internet. I am not a product of such times and the cloud services that I can access from my home computer would sound nothing more than wishful thinking to computer engineers a couple decades ago.

This is how we build products that could serve millions of users. We shall start by learning the concepts of scaling and follow it be seeing how one can leverage services provided by a cloud service provider such as AWS.

Some sections of this article reference each other and as such I recommend you going through it twice if you find yourself not knowing some of the terms used within the text. I have done my best to at-least briefly explain terms when they first appear. If not, it is likely that they will be covered somewhere further in the article. Going through the text a second time should make you grasp what you might have missed the first time.

On the other side of this, we shall know how we can have highly available, resilient architectures that can serve tens of millions of users and even beyond.

We start with a single server system.

User

Our user is a web or mobile client that tries to access our web server. Think this is the front-end of the application on client device.

DNS

DNS is 3rd party to resolve domain names for us and our users.

Web Server

This is where everything about our application that does not happen on client happens.

The next step to scaling our application is separating our Web tier and Data tier. We do this so we can scale them both independent of each other.

Data Tier

Data Tier of our application is where we store our database.

We will have to choose between two types of database solutions, SQL and NoSQL. Which database to use can be answered by the following general rules.

Use an RDBMS SQL solution if you have structured data that is related to each other.
MySQL, Oracle, PostgreSQL are all good options if your data can be populated into well defined rows and columns.
Use NoSQL options like Cassandra, HBase, DynamoDB when
- you do not want to lean on join operations.
- you want low latency.
- data is unstructured.
- data volume is huge.
- only serialization and deserialization of data is needed.
Associate NoSQL with Key-Value pairs, Graph databases, Column databases, Blob stores, Document stores.

You will most likely be using both SQL and NoSQL for different use cases. You can always start your journey with SQL databases though. SQL databases have a very long legacy of serving use cases across the industry.

The next level of evolution for our application would look like:

Load Balancer

A load balancer is used to distribute incoming traffic among a set of web servers. What we are representing here is an Application Load Balancer. An ALB is a Layer 7(Refer: OSI Model) load balancer and as such we can use it to route traffic based on host name, path, string parameters, source IP, HTTP headers, HTTP method and port.

Our ALB can have session affinity so that same users are always directed to same server instances. This will make sure users don’t lose their sessions because for every request they are served by a different server instance. We are still under the realm of stateful architecture. We will be covering stateless architecture a bit later down the line.

Database Replication

Most applications are read heavy. So, we perform data modifying operations like INSERT, DELETE, UPDATE on master database; read operations are performed on slave replicas. This effectively takes burden of a lot of read requests away from the master database.

Parallel processing of queries achieved in such a way will significantly improve performance of our data tier.

We can also have cross data center replication for our replica instances. This will make our data tier fault tolerant and highly available(works even if a data center is temporarily down). An entire data center can be lost to a disaster and all we will need to do is promote a replica as the new master(a new replica will be created for reads) and we will have recovered from a disaster fairly quickly. Note that there may be some data loss because our database is replicated across regions in an asynchronous manner. [This is only to illustrate a point, we are not yet at the stage where we could lose an entire data center.]

This however may not be enough for the scale that we are aiming towards. We can still optimize our data tier by reducing the number of read requests to our database.

The upgraded design of our system will look like:

Cache

A cache tier is added to our system to store the result of expensive database queries. Caches are super fast in-memory stores for data. Before checking for an answer in database, client can check for data in cache, if there is a cache hit. the response can be sent to the client directly from the cache without having to query the database. If response is not available in cache, only then Database is queried and then in addition to sending the response to client, data is stored in cache for any subsequent requests.

In-memory databases like Redis are extremely fast with sub millisecond latency and very high IOPS(Input Output operations per second). Watch: RedisConf 2020: Achieving a Half-Billion IOPs in a 1U REDIS server with FPGA Acceleration

Read more about: Persistence, TTL and consistency for caches.

We also use multiple cache servers across different data centers to avoid any SPOF(Single Point of Failure). Caching can also be done on web server layer and network layer, but we shall leave that for a later time.

CDN

A Content Delivery Network is a network of servers distributed across the planet on a great number of edge locations. CDN can serve as a cache for static content like images, videos, CSS, JavaScript etc. for our webpages. CDN servers connected to each other through dedicated high speed connections can also accelerate transfers significantly.

We have now gone global with our application. We will have servers running 24x7, 365 days in data centers around the globe. Users will be served by the “best” data center as per geography. If a data center/server is down, failover will automatically happen to the next “best” available data center.

Imagine a global ecommerce site. Even though available stock for a product might be subject to frequent change, it is unlikely that images for said product would change frequently. For a user in Philippines We do not need to send them the static images from a data center in Mumbai. Serving them these assets from Singapore would be much faster. That is how CDNs increase speed of our applications.

//Solutions like AWS S3 with 99.99% availability (four nine availability) and 99.999999999% durability (eleven nine durability) and infinitely scalable along with CloudFront(AWS’s CDN) are like a match made in heaven for such use cases.

Side Tracking a bit here. I shall allocate a little time to discuss some specifics of caching. TTL, flushing and exempting data should be carefully considered when implementing a CDN/Cache. In any cache we do not want our data to be outdated, we want our cache to have the latest version of data from our data store. Updating data in in cache too frequently just defeats the purpose of a cache by the cache itself sending too many requests to database. Not updating enough might mean users may be served with data from cache, but at the backend database, data has already changed. Flushing policies should also be carefully considered. If the cache is full, should the oldest data in cache be deleted or the least frequently used data or least recently accessed data or something else. TTL(Time to Live) is the time for which data is stored in a caching system before it is deleted. Too low TTL might not take away enough load from database, Too high TTL might mean data in cache might have gone stale. Consider this, you lower TTL for an object to 1800 sec(30 mins), object is updated and now you know that within 30 minutes all caches will have the latest version of the object. After 30 mins you increase the TTL to 86400 sec(a day). It can be a good idea to remove data from the cache whenever any data modifying operation is done(MODIFY, DELETE etc.) on entries that are cached. Caching can be tricky and often there are no right answers. I hope I have provided you enough context that you can determine what is the best caching strategy for your use case.

State Store

If you had noticed, we have also made our Web tier Stateless. To outgrow the limitations of scalability, availability and cost we must make our application stateless. Stateless architecture basically means that our web tier servers do not need to remember session data for users. Requests by users can be directed to any of the web servers because state data is now kept outside of web servers in a shared storage that all web servers can refer.

We have chosen a NoSQL data storage for our session data because it is easily scalable and can hold a lot of data while delivering with low latency. Effectively this means easy auto scaling based on traffic volumes.

Previously if our load balancer did not have session affinity(sticky sessions), the load balancer would direct users to a different web server/instance every time and they would lose their session data. Now it does not matter(for simplicity's sake) which web server serves our client because our web tier is stateless. We can now scale our web tier both vertically and horizontally(up-down and in-out) without affecting session data of users. We can add/remove servers from out web tier without affecting session data.

We have been scaling outwards(horizontally) all this time; but worry not, I shall explicitly explain these terms in the next section.

In-memory database solutions are very fast but data persistence is compromised as session data will be lost when database server is restarted. Our NoSQL shared storage is thehappy medium we will move ahead with.

We may have to explore other solutions like Redis with persistence to disk if we need to handle very high traffic volumes. I am not a developer but I have on good authority that developers love Redis. Redis clusters will help us in scaling our applications to the largest of scales.

Multiple Data Centers

Our Web tier and Data tier now lives in multiple data centers with a shared data store for session data. At this stage we can survive the loss of a complete data center. Our load balancer will start directing all users to a healthy data center in case one is lost. With a multi data center system design, if a data center goes down and say automatic failover is made to another data center; the other data center may not have the most recent data available for users who were previously being served by the first data center. There may be some data loss because of asynchronous data replication across different data centers. Data synchronization should be studied and here is an excellent article on Netflix Christmas eve 2012 Outage, also active-active for multi region resiliency.

Netflix regularly tests its Infrastructure by shutting off services, even entire data centers at random and seeing how their system responds to this chaos. AND THEY’RE CRAZY ENOUGH TO DO THIS IN PRODUCTION.

Let’s make our system even more scalable

Like I’d mentioned before, we can scale vertically and horizontally.

Vertical Scaling is having better hardware to run our application. You increase your servers’ RAM, CPU, Storage, Networking etc to make it able to handle greater workloads. Moving from HDD to SSD to effectively increase your IOPS is also Vertical Scaling. Cloud Service providers like AWS also allow you to rent servers that are, CPU Optimized, Disk Optimized, IO Optimized, Memory Optimized or GPU optimized to handle different types of workloads.

Sidetrack: Correct placement of servers is also important as having your servers be in the same cluster will provide very high network performance and very low latency. So, say you’re doing some extreme computing job for NASA that requires your server instances to communicate a lot a information with each other very frequently, you will want your instances to be in a cluster. If you’re doing a job that is not as compute intensive but you want built in fault tolerance, you will want to spread your server instances across different racks or even different data centers. Also, as a rule of thumb, we do not want different components of our software to talk to each other via public Internet. It is expensive, slow, and not secure. We always want to communicate using private IPs over our cloud service provider’s dedicated private connections.

AWS has a 24TB RAM server called u-24tb1.metal that we could rent to store a 24TB database in-memory. But these servers are very expensive.

We navigate this by scaling our system Horizontally(or outwards). Horizontal scaling is the act of increasing the number of servers we use to do our jobs. We have until now scaled our web tier horizontally by having the ability to add(scale out) or remove(scale in) web servers.

We want both scale out and scale in to happen automatically as per demand. This will make sure that our users are happy because all their requests are served and we are only using and paying for cloud capacity that we need.

We can scale our data tier horizontally by database sharding and/or database federation. Both involve dividing our database into smaller “parts”. The idea is to vertically scale only specific “parts” of our database so that we do not have to pay for servers that are very expensive. Even if it weren’t expensive, it is entirely possible that a database grows so huge that it goes beyond the capacities of u-24tb1.metal. As principle we want to avoid SPOFs from our design as well. So we must scale our database horizontally if we want our application to outgrow these limitations.

Database Federation

Database federation is the act of splitting a database by function. Think database for an ecommerce. Now think separate databases for products and users. You have effectively federated your database.

Database Sharding

Sharding is the act of horizontally scaling our database by splitting it into different shards. We divide de-normalized database into smaller parts(shards) that hold unique data but the schema for those parts is the same.

Database scaling can be tricky. We can partition databases both horizontally and vertically. We denormalize database because this way queries can be performed on a single table. Resharding is required if sharding is done in such a way that different shards grow at different rates. Nevertheless some shards my experience exhaustion(exceeding resources) sooner than others, and we may be required to change our hashing function and server pool size. This requires a lot of movement of data. For this we need to practice Consistent Hashing which is basically making sure minimum data movement is required when pool size changes. Sharding is not as straight forward as storing all data for some users in one shard and all data for other users in a different shard. Say you have divided into 10 shards. Say you have an application like youtube and all data for Led Zepplin is stored on the same shard, now they decided to release a new song(a man can dream). Your system will collapse because of the number of requests made on a single shard. This is commonly referred to as celebrity problem or hotspot key problem. I encourage you to read more about consistent hashing.

Message Queuing

Message queues are an Inter Process Communication (IPC) mechanism for services within your application to talk to each other asynchronously. It is a very common technique used in serverless and microservice architectures. They are a great way to decouple services and smoothen spikes in workloads by introducing buffers. A larger service is decoupled by attaching two smaller independent building blocks on either side of a message queue as a producer and consumer of “messages” within the queue.

With message queuing, we can scale both producer and consumer servers independently. Also, if our consumer servers go down, our producer/publisher servers will still be able to publish messages to the message queue. Similarly, if producers go down, consumers can keep on reading messages from the queue.

Consider a photo processing application. Users want to crop, blur, apply filter etc. to photos. These are all time-consuming jobs. We can decouple our system into producers, that will publish what jobs need to be done to the message queue, and consumers(workers) that will asynchronously pull those messages and actually perform said jobs. We can scale our system to have multiple queues and fan out messages to different kinds of servers. Compute heavy jobs can be published for consumption by CPU optimized servers and IO heavy can be published for consumption by a different type of servers. Size of message queues is regularly monitored so that when the size of queue increases, more workers are added to counterbalance. Similarly, if our queue is empty most of the time, we can reduce our number of workers.

I also recommend you read about the time for which another worker should not pick up a message if it is already picked up by a worker. A maximum number of times workers could try processing a message before it is sent to the dead letter queue so that developers could examine why a message is not being processed.

By now our application has been scaled to serve tens of millions of users. We can scale out and optimize our system even more by decoupling our system into smaller services. Conceptually we have covered a lot of ground. In addition to decoupling of services, we have made our web tier stateless, we have built redundancy at every level, we exported static content serving to a CDN, we have scaled our data tier and introduced caching to improve performance. To work at this scale we will likely have an organization with an army of engineers. To ensure smooth operations we must add a few systems to support and maintain all this infrastructure.

Logging

Our architecture is not monolithic, we have scaled in such a way that we do not have any SPOF. But now we can have errors at many possible points in our system. To effectively diagnose errors and issues with our system, we must have a logging system that logs activities happening within our system. We must also aggregate all the logs to a centralized system if we hope to make any sense of them.

Metrics

We need a central place to view metrics associated with our system for two broadly categorized purposes.

Metrics like daily active users, inactive users, user retention, users leaning towards a particular service/feature etc. to make better business decisions.
Metrics like CPU usage, memory, networking, when scale-outs and scale-ins happen, etc. to measure the overall health of our system. Multiple metrics can be aggregated to give an idea of the performance of web tier, data tier, cache tier, etc.

We will see AWS services like cloudwatch, cloudtrail, etc. in Part 2 of our exercise to see how serverless services can make all of this easy for us to implement within our system.

Automation

Because our system has grown fairly big, we need automation to improve productivity. If say our web tier needs to be scaled out, it cannot be expected that every time new server instances are to be launched; we define the configurations of every instance manually and then manually load them with all the necessary software. With appropriate security checks, problems can be detected early. Automating our build, test and deploy process is also necessary so that developers do not spend all their time working on ancillary support systems.

We have scaled our system to tens of millions of users. By optimizing these techniques we can scale our application to a 100 million users.

But, you know what is cool? A Billion Users.

Search This Blog

Harsh Rathi's Blog

You know what is cool? A Million Users. Here is how to scale an Application to support millions of users.

In-memory databases like Redis are extremely fast with sub millisecond latency and very high IOPS(Input Output operations per second). Watch: RedisConf 2020: Achieving a Half-Billion IOPs in a 1U REDIS server with FPGA Acceleration

Popular posts from this blog

One Framework to rule them all. Root Cause Analysis.

on managers reasoning with Analogies, and making ideas bulletproof.