Monday 28 October 2019

Ensuring DNS does not limit your microservices' ability to scale

Before Kubernetes and service meshes became the mainstream way for deploying microservices some organizations would set up their microservice infrastructure in much the same way as monolithic applications.  For the part that I am interested in the top layer for receiving requests over HTTP is of interest - a load balancer sitting in front of a cluster of servers.

Simple deployments
The load balancer infrastructure can be manipulated to seamlessly upgrade a service by adding servers running the new version of the service to the cluster and removing the old versions.  The load balancer transparently forwards traffic to the live instances and passes responses back to the clients.  Services only reach eachother via the receiving service's load balancer, so there is no need for additional service discovery systems.

Managed infrastructure implications
Unlike data centre setups with physical load balancers, in cloud environments such as AWS the environment will dynamically resize the loadbalancer - including the possibility of distributing it across multiple machines and multiple IP addresses.
To take advantage of the capacity that the load balancer offers, clients need to regularly lookup the IP addresses of the load balancer.
Client DNS lookup performance
Historically some application runtimes would cache the IP addresses resolved from a request and continue to use that for prolonged periods of time - ignoring any hints on offer from TTL of the DNS record.  Here are some of the potential problems of caching DNS for too long:
  • Load balancer capacity is not evenly distributed across the load balancer's provisioned hosts, making the load balancer bottleneck even though it may have scaled up.
  • When the load balancer capacity is scaled down or moved to different addresses the client continues to attempt to reach it on a stale address. 
Overriding the timeout for DNS lookups can reduce the risks of striking these types of issues, but brings with it some performance cost of the lookup time when making additional calls to resolve the load balancer IP addresses.
One way to prevent lookup calls from increasing latency for service calls is to have the address resolution activity run asynchronously such as on a dedicated scheduled thread that updates a local name resolution cache - provided that the refresh rate is aligned with the DNS TTL of course.

Friday 25 October 2019

Removing all the things - a good time to do things asychronously

Microservices can be great for isolating segments of functionality and data, but there are some situations when all of the various segments need to be prodded.  Deletion of an account or some other core item in the business domain is the classic example and the topic of this post.

Adding or modifying an item is a relatively straightforward process involving a single call to an endpoint.  It either completes successfully or returns a response that provided enough context for the caller to know whether to retry or give up.  This can usually be done synchronously, or in a short enough timeframe for the calling system to establish the state has been saved.

Similarly, a single item or group of items that fall under a common area of functionality can be deleted while the caller is waiting.

I'm going to be lazy and not offer the hypothetical business domain here, but imagine having a dozen microservices, each with their own data store and type of content.  When a request comes in to delete the central item that each of those content types is ultimately associated with we can no longer afford to wait around for each service to check whether it holds anything related and successfully complete the deletion activity.

There isn't a graceful degradation fallback option to respond with when a deletion request times out, so rather than exposing the caller to the possibility of any one of a dozen or more systems hitting a temporary issue we should persist the relevant identifying characteristics of the deletion request and provide a quick to let the caller know that the deletion process has begun.

I'll save the description of possible mechanisms for each sub-system to detect and process any relevant deletion from its data store for another post.

In summary, the distributed nature of data in a microservice architecture should lead us to handling broad actions asynchronously.  Were no longer in a situation of having everything in once central relational database with deletions automatically propagating across tables.  It's part of the trade off of the flexibility to scale and develop new functionality in isolation rather than as a monolith.

A new beginning

I was so impressed and/or amused by my choice of title for a recent blog post that I decided to spin up a new blog dedicated to my experiences and contemplations around microservices.

The "New Adventures in Microservices" title was partially inspired by the R.E.M. album "New Adventures in Hi-Fi" from the mid-nineties.  "New adventures in wi-fi" would have a nicer ring to it, but would be a bit off topic.

I feel doubly unoriginal in my choice of title, as IBM used to have a series of continuous integration related posts on their DeveloperWorks site under the title "Automation for the People" which I am 99.9% certain took inspiration from the R.E.M. album "Automatic for the People".  Just to add another "nine" to my confidence - at the time of writing this post I see Paul Duvall - who produced the content for that DeveloperWorks series - has a tweet pinned from a Stelligent presentation, "The waiting is the hardest part" - a reference to the lyrics of a Tom Petty song.

Anyway...  I've managed to resist the temptation to give each post the title of an R.E.M. song, so from here forward you may find this blog more informative than amusing.

How should we set up a relational database for microservices?

Introduction Over the years I've provisioned and maintained relational databases to support a range of web based applications, starting ...