Introduction
One of the big promises of applying a microservice based architecture is the ability to independently provide additional resources to components that need to scale up when spikes or peaks of load arrive.
There are some important differences between scaling compute resource and scaling stateful systems that should be kept in mind when designing how a system will maintain state.
Examples of autoscaling policies
Some examples of scaling strategies for microservices that I have worked on include:
- increase the number of EC2 instances of the service when the average CPU load exceeds 70% for 5 minutes
- increase the number of EC2 instances of the service when the minimum CPU credits falls below a threshold
- increase the number of instances of an ECS task when the percentage of available connections is below a threshold (via a custom Cloudwatch Metric)
These examples should be familiar to anyone who has developed scalable microservices on AWS. They attempt to detect when a resource is approaching its capacity and anticipate that by adding some additional compute resource the system can continue to cope without requiring any manual intervention by system operators / administrators.
Alongside the logic for scaling up we should also include some way of detecting when the allocated resources can be scaled down towards the predefined steady state level.
Scaling down
For the first two scaling strategies we had a fairly typical approach to scaling down - select a candidate for termination, stop new incoming requests from being directed to that instance, wait a fixed amount of time for existing connections to have their request processed (connection draining) then terminate the instance / task.
For the third example on my list, the connections in use by the service were being kept open to enable notifications to be transmitted to the client as part of a "realtime" synchronisation mechanism. So, unline the earlier examples, connection draining would not be expected to occurin a timely manner. We had to build logic into the clients for them to re-establish the connection when the service instance that they had been utilising dropped out.
Scaling with state - a relational database example
Behind the client facing and middle tier microservices we inevitably have some databases which will have their own limitations for coping with incoming requests. Maximum connections is the most direct one that I would expect to be a bottleneck for coping with scaled up numbers of services.
Typically each microservice instance will hold its own connection pool for sending requests to read, write, update or delete data in the backing database. If our database has a maximum connections limit of 200 connections, and each instance of the microservice has a connection pool for 50 connections, then we will need to consider deeper scaling implications if the microservice's scaling logic approaches more than four instances (4 x 50 = 200).
In an environment where most database interactions involve reads there are some options to scale the database horizontally by setting up read replicas. The gotcha to this is that you need to know in advance that the replicas will be worthwhile. Due to the nature of the database being the source of the data required for any replication, it is best to create the replica(s) when the database is not under much load.
For systems that have many writes to the database the scaling options involve either partitioning the data across multiple databases, or
vertically scaling by moving to a larger instance type.
Summary
- Scaling of compute resource can be done quickly in response to recently measured load.
- Some down-scaling approaches need to have a protocol in place for clients to stay active.
- Where the compute resource involves interacting with databases there should be some restrictions in place to prevent the compute resources from flooding the databases or being unable to connect.
- Scaling of relational databases needs to take place before it is needed.