Monday, 28 October 2019

Ensuring DNS does not limit your microservices' ability to scale

Introduction
Before Kubernetes and service meshes became the mainstream way for deploying microservices some organizations would set up their microservice infrastructure in much the same way as monolithic applications.  For the part that I am interested in the top layer for receiving requests over HTTP is of interest - a load balancer sitting in front of a cluster of servers.

Simple deployments
The load balancer infrastructure can be manipulated to seamlessly upgrade a service by adding servers running the new version of the service to the cluster and removing the old versions.  The load balancer transparently forwards traffic to the live instances and passes responses back to the clients.  Services only reach eachother via the receiving service's load balancer, so there is no need for additional service discovery systems.

Managed infrastructure implications
Unlike data centre setups with physical load balancers, in cloud environments such as AWS the environment will dynamically resize the loadbalancer - including the possibility of distributing it across multiple machines and multiple IP addresses.
To take advantage of the capacity that the load balancer offers, clients need to regularly lookup the IP addresses of the load balancer.
 
Client DNS lookup performance
Historically some application runtimes would cache the IP addresses resolved from a request and continue to use that for prolonged periods of time - ignoring any hints on offer from TTL of the DNS record.  Here are some of the potential problems of caching DNS for too long:
  • Load balancer capacity is not evenly distributed across the load balancer's provisioned hosts, making the load balancer bottleneck even though it may have scaled up.
  • When the load balancer capacity is scaled down or moved to different addresses the client continues to attempt to reach it on a stale address. 
Overriding the timeout for DNS lookups can reduce the risks of striking these types of issues, but brings with it some performance cost of the lookup time when making additional calls to resolve the load balancer IP addresses.
One way to prevent lookup calls from increasing latency for service calls is to have the address resolution activity run asynchronously such as on a dedicated scheduled thread that updates a local name resolution cache - provided that the refresh rate is aligned with the DNS TTL of course.

No comments:

Post a Comment

How should we set up a relational database for microservices?

Introduction Over the years I've provisioned and maintained relational databases to support a range of web based applications, starting ...