Tuesday 12 November 2019

Cyclic Dependencies in Microservices

What are they?Simplest case: Service A calls on a resource of Service B, and Service B calls on a resource of Service A - BANG!  We have a cycle.

How can they happen?
Imagine a website that includes a feature for sharing images with the public or within among groups of individuals.  The characteristics involved in uploading images are very different from the characteristics for managing permissions related to those images, so the responsibilities natuarally get split out into separate services - one for the slower longer running requests that upload the multi-megabyte data of images, another for managing whether the images are public, or private, or only visible to some specified groups.
Users can only share their images to groups that they are a member of, so there is a need for a service for managing groups.

Now we can consider two scenarios that approach the system from different directions and result in a cycle existing between two services.
On the one side we have the flow of someone uploading an image and applying some permissions for the groups that they want to share the image with.
On the other side is the situation of another user choosing to delete a group that they have admin rights to.
- Before creating permissions the system needs to validate that the user is a member of the groups.
- As part of cleanly deleting a group the system needs to ensure that there are no left over references to the group.

When might this be problematic?
As part of deploying a new service - or more commonly a new version of a service - it is sensible to verify that the service is ready to operate before it can start accepting incoming traffic.  This is typically acheived by having the service expose a readiness endpoint that only produces a positive response once the service has successfully completed initialisation and confirmed that it can reach its dependencies.

With a cycle between services we would face a deadlock situation as each would be waiting for the other to become available before declaring itself ready.

This is probably the most basic reason for avoiding having cycles in the call graph of a collection of microservices.  I can imagine a few scenarios where a team could find out that they have a cycle problem:
 - Starting up a full copy of the system for a new region
 - Re-starting after scheduled maintenance
 - Deploying a new version of two or more of the microservices at the same time.

The ideal approach is to avoid getting to the cycle sitation in the first place, but "I told you so" isn't helpful advice, so lets also consider ways to reduce the difficulty and / or buy some time for adjustments.

A Strategy for avoiding cycles
De-couple the least time sensitive link in the call chain.

In the scenario outlined above we might consider deleting the permissions associated with a group as being a lower priority task.  The end users shouldn't see any impact from leftover permissions, so there is no need for them to wait for that peocessing to successfully complete.

Instead of groups calling permissions and requiring permissions to be up, we could introduce a dedicated notification topic which the groups service uses to announce deletion events.  The permissions service could subscribe to that topic via a queue to allow it to accumulate events for processing without having to be available at the point in time that the notification occurs.  Now from the groups service's perspective any secondary aspects of deleting a group becomes a fire and forget concern.  Any other service that is introduced that involves referencing groups and which in turn needs to be aware when a group is deleted can apply the same subscription approach.

Friday 8 November 2019

Moving to lambdas - not just a "lift and shift"

A while ago I came across a lambda that had been set up to run in AWS that had previously existed as an endpoint within a microservice running in ECS (Docker).  I needed to dig into the code to try to understand why an external cache was behaving strangely - but I can save that for another post.

The focus of this post is on what may have made sense in one environment doesn't necessary hold true for AWS lambdas.

Lambdas are intended to be short-lived and are charged at a fine level of time granularity so the most cost efficient way of running is to start quickly, perform the desired action and then stop.

Some code that I came across in this particular lambda involved setting up a large in memory cache capable of containing millions of items.  This would take a little time to set up, and based on my knowledge of the data involved it would not achieve a good hit rate for a typical batch of data being processed.

Another aspect of the code that had been carried over was the use of the external cache.  At initialisation time a cache client was set up to enable checking for the existence of records in Redis.  The client was only being closed by a shutdown hook or the Java application - which AWS Lambdas do not reach.  This resulted in client connections being kept around even after the lambda had completed, resulting in the underlying Docker container running out of file handles and having to be destroyed mid-processing.

Upgrading the in-house microservice framework

This is a bit of a "note to self" but if you find it interesting let me know.

This isn't intended to be a deep dive into any particular technologies, but it might touch on some situations familiar to developers with a few years of experience in managing microservices.

Setting the scene
The main Java-based microservices had some common needs:
 - authorization
 - logging
 - service discovery
 - api documentation generation
 - accessing objects from the cloud provider

Naturally it made sense for all of this common functionality to be bundled together in one common framework and used everywhere - which is fine.

Unfortunately some awkward shortcuts were taken to achieve some of the functionality, which made upgrading the underlying open source framework impossible to achieve without introducing breaking changes.

Before I had joined the organisation a couple of attempts had already been made to get things back into shape, but they ended up being aborted as the developers came to the realisation that they could not make the necessary updates without breaking the build for most of the existing services.

I helped to persuade the management team that this inability to upgrade had to be addressed to enable us to avoid security issues and take advantage of performance improvements, so a team was formed.

My main coding contributions were:
 - migrating between JAX-RS versions
 - updating logging to continue to preserve correlation Ids since the underlying framework had changed significantly
 - migrating away from using the very out of date S3 client
 - repetitive small changes that didn't make sense for every team to learn about and apply themselves

Dependencies for tests should be scoped as "test"
Something that looked like a minor oversight turned into a couple of weeks of work for me.  Some dependencies that were needed for running unit tests had been specified with an incorrect scope, so instead of just being available in the framework during build time they were actually being bundled up and included in most of our microservices.

Changing the scope of a handful of dependencies only to realise that a dozen or more projects had been relying on that to bring in test support libraries for their builds made me a little unpopular with developers who had just completed work on new features and found that their builds were broken so they could not deploy to production.

This led to one of my first time sinks in the project.  The majority of the broken builds were for services that were not under active development, so the teams that owned them could not be expected to down tools on what they were actively working on to fix their services' dependencies.  Fortunately a migration of one of the test dependencies was a recommended part of preparing to upgrade the underlying framework, so I was able to get that change out of the way more quickly than if the individual teams had done this themselves.

Service discovery client changes
The existing service discovery mechanism involved service instances registering themselves with Zookeeper during deployment.  This allowed Zookeeper to know which instances were available, and allowed services to keep up to date about the available instances of services that they needed to call.
Some aspect of this setup was not working properly, so as a result each time a new version of a service was deployed we would see a spike of errors due to clients sending requests to instances that were not yet ready.

We had an alternative mechanism for services to reach eachother by introducing fully qualified domain names pointing to the relevant load balancers, so removing the Zookeeper dependency and updating the various http clients to have some new configuration was in scope for this upgrade project.

A colleague from another team contributed by using his team's services as a proof of concept for migrating to the fully qualified domain names.

The initial approach was fine for those services as the clients could all be updated at the same time.  When it came time to apply the same type of change to other services we struck an issue whereby not all client libraries could be updated at the same time - so I had to introduce a bridging interface for the internal API client to allow old style clients and fully qualified domain name clients to co-exist.  This became my second time sink of the project, as once more we could not rely on every team having time made available by their product owners to address this migration work.

I saw the client service discovery migration work as being an area where specialization can speed things up.  Having one or two individuals applying the same mechanisms to make the necessary changes in dozens of places is much more time efficient than having dozens of individuals learn what is required and apply the change in two or three places each.  A couple of teams that did attempt to apply the changes themselves missed applying changes to the bundled default configuration for their client libraries - meaning additional configuration changes would need to be set up in each of the services that used their client libraries.

Not all services were created equal
Some services' build phases were more stable than others.  One particularly complex system had not been deployed for a couple of months before we came to apply the service discovery client changes.  The flaky nature of some of its integration tests left us in the dark about some broken builds for a few weeks.  It didn't help that the test suite would take over half an hour to run.  Eventually I realised that a client that I had configured in a hurry was missing one crucial property and we were able to unblock that service and it's associated client for release.

Libraries and versioning
Several services had a common type of sub-system to interact with - the obvious example being a relational database - so over the years some common additional functionality had been introduced into a supporting library.  Due to the lack of modularity in the underlying framework we found ourselves tying these support libraries to the framework - even for just being able to specify the database's connection properties as a configurable object.

I took the decision to treat the new version of the framework as a completely new artifact, which meant that each library also had to diverge from the existing versioning so that we would not have a situation of a service automatically picking up a new library version and bringing along a completely incompatible framework as a transitive dependency.

This got a bit of push-back from some developers as they started to look into what had been involved in successfully migrating services so far.  "Are we going to end up with a new artifact for the next  upgrade as well?" was a very fair question.  Like most things in technology the short answer is, "It depends."  My hope is that the current stable version of the underlying open source framework will only have minor upgrades for the next year or two.  Alongside this my expectation is that many of the existing microservices will be migrated to a different type of environment, where some of the functionality provided by the in-house framework will be dealt with in ways that are external to the service - e.g. sidecars in systems such as Kubernetes.

How should we set up a relational database for microservices?

Introduction Over the years I've provisioned and maintained relational databases to support a range of web based applications, starting ...