Friday, 8 October 2021

How should we set up a relational database for microservices?

Introduction

Over the years I've provisioned and maintained relational databases to support a range of web based applications, starting off with monoliths that would run on a single server, and more recently for microservices that run across multiple servers and scale up and down dynamically based on the capacity and load of interactions shared across the instances.

Here I'd like to evaluate whether we can bring together all of the capabilities when deploying into a cloud environment, using AWS as that is the environment that I am most familiar with.

Database setup and structure updates

I would like the provisioning of the database and all subsequent structural updates - such as the creation of tables - to be handled independently of the microservice runtime.

The initial provisioning of the database should be a one-off task actionable using something like CloudFormation, or Terraform, or Ansible. These technologies should all also be capable of updating the configuration such as resizing of the database instance, or cluster, or storage space.  This isn't anything particularly novel so I won't delve into this.

As a starting assumption, we can reasonably expect that the microservice that owns the data to be stored in the database will be deployed from a continuous deployment service or something like GitHub Actions.

The deployment lifecycle for a microservice may be split into something like the following stages and phases:

Stage One - making something that can be deployed

  • Build: bring in dependencies, compile and run unit tests, assemble the service into its deployable shape - e.g. runnable jar or Docker image.
  • Integration tests: Assemble mocks and / or real integration points with appropriate data and run the microservice with well established integration tests to verify behaviour meets expectations.
  • Static checking of the unit of deployment, such as whether the underlying OS of the Docker image has known security issues.
  • Phase zero of deployment: Push the deployable microservice artifact into the artifact repository / registry that will hold it as a deployable unit with a version or tag that will be uniquely identifiable and traceable back to this stage of building for this microservice.

 Stage Two - deploying the artifact into the runtime environment

  • Phase 1 of deployment: Applying structural database updates to the target deployment environment (staging / production)
  • Phase 2 of deployment: Taking a copy of the deployable artifact and launching it into the target environment - e.g. staging or production
  • Phase 3 of deployment: Smoke testing to verify that the newly released version of the microservice is fit for purpose and ready to be rolled out to replace the previous version
  • Phase 4: Roll forward or roll back, the "go" / "no go" decision point
    • Roll forward
      • Notify any existing previous version of the microservice to finish processing what it has in flight, and shut down. For services driven by requests such as over http, load balancers should drain connections and stop sending requests to the old version of the running microservice.
      • Scale up the instances of the microservice to cope with current established suitable load handling.
      • Finish when all instances of the microservice are now running the specified version
    • Roll backward
      • If the new version of the microservice fails smoke tests or becomes unhealthy then we should back out of this release and revert back to the setup that was in place for the current live version. Terminate all of the new instances (most often we will reached this decision point whenonly one has been in place)
      • Roll back the database changes (this requires discipline to ensure that such changes are reversible).
      • Verify that the health of the microservice has recovered now that the previous known good version is back in full control of processing.

Connection Security

We want to be able to have confidence that the communication between our service and the database meets the following security requirements:

  • the system that we are connected to is genuine and not some man in the middle
  • the data that we receive back has not been corrupted, intentionally or unintentionally
  • the database must not permit other parties to read from or write to it
  • the data being transferred to and from the database must not be readable as plaintext

Basically, we need some encryption and some way of having the sender of data signify that it is authentic in a way that the receiver can verify it - preferably with minimal overhead.

Data Security - Access Rights to Tables and Stored Procedures

The tables in the database will contain information that the service only needs to be permitted to perform specific types of actions on, e.g. there may be a table containing a list of locations that the service needs to be able to read but has no requirement to be able to write or delete.

My preference here ties back to the database setup and structure updates section mentioned above, as that would also contain the setup of permissions that are required to be permitted for the service's data access needs. Historically that may have been associated with the user associated with the service, but I'm hopeful that this would now be associated with a role which can in turn be associated with the service's database access user.

Why would I prefer role-based permissions over user-based permissions? Surely the service is only connecting as one user?

Yes, and no - see the next section.

Credential Rotation

Another level of security around connectivity between the service and the database can be introduced by periodically changing the user that the service is connecting as. This ensures that if a third party somehow gets in and successfully guesses or otherwise obtains a set of credentials for accessing the database then those credentials will only be active and useful for a limited amount of time. So if there is an attack involving guessing credentials, phoning home, and then manual probing, then the credentials should no longer be useful by the time the attacker is ready to poke around.

Credential rotation can be a tricky feature to get right, as it involved coordination between:

  • the system that creates and applies the credentials into the database
  • the database connections in your application (possibly a connection pool)
  • the database system itself

The main timing problem that I have seen in an implementation of credential rotation involved the service  failing to obtain fresh credentials before the previous ones had expired, resulting in data access errors around "permission denied". This situation was made worse by the fact that a query being used to check the health of the database connectivity was querying something that didn't involve permission checks (if I recall correctly).

If there is a choice between a third party solution and an implementation that your cloud provider provides, then my advice would be to go with the cloud provider's implementation unless it is orders of magnitude more expensive or known to be deficient.

Database Connection Scalability

One thing that microservices are intended to be good for is scaling to cope with demand. If there is a big announcement or marketing campaign for your website then there may be a surge of user sessions that would exceed the normally provisioned resource capacity. With a bit of foresight and autoscaling configuration in place, the service(s) can have their underlying execution resources scaled up - be they Lambdas, or EC2 instances, or ECS tasks.

In the real world we only have finite resources, and that applies to relational database servers too, so if each service runtime environment operates with X database connections, then some multiple of X will ultimately exceed the maximum available connections that a database server of cluster of database servers will allow. If this happens then we will start to see new instances of our service fail to operate because they cannot obtain the desired resource.

A compromise can be reached in the scaling up situation, by having a database connection proxy sit in between the database client - our microservice instances - and the database server. The proxy hides the fact that there is a limitation to the maximum available connections on the server side by sharing the connections. The usual rules and common sense should apply, when a transaction is active on a connection it will not be available for sharing.

So far I haven't had any real world experience with using something like Amazon RDS proxy, so I can't vouch for whether it can achieve much beyond buying us time before we need to scale up the underlying RDS sizing.

Saturday, 7 December 2019

Handling database connections while scaling up - RDS Proxy

Earlier this week I posted about some aspects to consider when designing the mechanisms to scale microservices up and down.  One of the key takeaways was that an underlying database could impose limitations on how far to scale up compute resources due to connection limitations.

This afternoon I noticed a tweet refering to a new feature that AWS have announced that could simplify the management of database connections.  Amazon RDS Proxy is currently in preview and only available for RDS MySQL and Aurora MySQL in a limited number of regions.

I believe that it will go some way towards solving two problems that I have encountered:
  1. Prevent reaching maximum connection limit of database.
  2. Reduce warm up time for lambdas that connect to a database.
According to the documentation it is also intended to detect and handle failover situations for clustered databases.

I'll wait until some examples show up for Java before I decide whether this will work for the types of applications that I am used to developing.

Friday, 6 December 2019

Considerations when scaling microservices - up and down

Introduction

One of the big promises of applying a microservice based architecture is the ability to independently provide additional resources to components that need to scale up when spikes or peaks of load arrive.

There are some important differences between scaling compute resource and scaling stateful systems that should be kept in mind when designing how a system will maintain state.

Examples of autoscaling policies

Some examples of scaling strategies for microservices that I have worked on include:
  • increase the number of EC2 instances of the service when the average CPU load exceeds 70% for 5 minutes
  • increase the number of EC2 instances of the service when the minimum CPU credits falls below a threshold
  • increase the number of instances of an ECS task when the percentage of available connections is below a threshold (via a custom Cloudwatch Metric)
These examples should be familiar to anyone who has developed scalable microservices on AWS.  They attempt to detect when a resource is approaching its capacity and anticipate that by adding some additional compute resource the system can continue to cope without requiring any manual intervention by system operators / administrators.

Alongside the logic for scaling up we should also include some way of detecting when the allocated resources can be scaled down towards the predefined steady state level.

Scaling down

For the first two scaling strategies we had a fairly typical approach to scaling down - select a candidate for termination, stop new incoming requests from being directed to that instance, wait a fixed amount of time for existing connections to have their request processed (connection draining) then terminate the instance / task.

For the third example on my list, the connections in use by the service were being kept open to enable notifications to be transmitted to the client as part of a "realtime" synchronisation mechanism.  So, unline the earlier examples, connection draining would not be expected to occurin a timely manner.  We had to build logic into the clients for them to re-establish the connection when the service instance that they had been utilising dropped out.

Scaling with state - a relational database example

Behind the client facing and middle tier microservices we inevitably have some databases which will have their own limitations for coping with incoming requests.  Maximum connections is the most direct one that I would expect to be a bottleneck for coping with scaled up numbers of services.

Typically each microservice instance will hold its own connection pool for sending requests to read, write, update or delete data in the backing database.  If our database has a maximum connections limit of 200 connections, and each instance of the microservice has a connection pool for 50 connections, then we will need to consider deeper scaling implications if the microservice's scaling logic approaches more than four instances (4 x 50 = 200).

In an environment where most database interactions involve reads there are some options to scale the database horizontally by setting up read replicas.  The gotcha to this is that you need to know in advance that the replicas will be worthwhile.  Due to the nature of the database being the source of the data required for any replication, it is best to create the replica(s) when the database is not under much load.

For systems that have many writes to the database the scaling options involve either partitioning the data across multiple databases, or vertically scaling by moving to a larger instance type.

Summary

  • Scaling of compute resource can be done quickly in response to recently measured load.
  • Some down-scaling approaches need to have a protocol in place for clients to stay active.
  • Where the compute resource involves interacting with databases there should be some restrictions in place to prevent the compute resources from flooding the databases or being unable to connect.
  • Scaling of relational databases needs to take place before it is needed.

Tuesday, 12 November 2019

Cyclic Dependencies in Microservices


What are they?Simplest case: Service A calls on a resource of Service B, and Service B calls on a resource of Service A - BANG!  We have a cycle.

How can they happen?
Imagine a website that includes a feature for sharing images with the public or within among groups of individuals.  The characteristics involved in uploading images are very different from the characteristics for managing permissions related to those images, so the responsibilities natuarally get split out into separate services - one for the slower longer running requests that upload the multi-megabyte data of images, another for managing whether the images are public, or private, or only visible to some specified groups.
Users can only share their images to groups that they are a member of, so there is a need for a service for managing groups.

Now we can consider two scenarios that approach the system from different directions and result in a cycle existing between two services.
On the one side we have the flow of someone uploading an image and applying some permissions for the groups that they want to share the image with.
On the other side is the situation of another user choosing to delete a group that they have admin rights to.
- Before creating permissions the system needs to validate that the user is a member of the groups.
- As part of cleanly deleting a group the system needs to ensure that there are no left over references to the group.



When might this be problematic?
As part of deploying a new service - or more commonly a new version of a service - it is sensible to verify that the service is ready to operate before it can start accepting incoming traffic.  This is typically acheived by having the service expose a readiness endpoint that only produces a positive response once the service has successfully completed initialisation and confirmed that it can reach its dependencies.

With a cycle between services we would face a deadlock situation as each would be waiting for the other to become available before declaring itself ready.

This is probably the most basic reason for avoiding having cycles in the call graph of a collection of microservices.  I can imagine a few scenarios where a team could find out that they have a cycle problem:
 - Starting up a full copy of the system for a new region
 - Re-starting after scheduled maintenance
 - Deploying a new version of two or more of the microservices at the same time.

The ideal approach is to avoid getting to the cycle sitation in the first place, but "I told you so" isn't helpful advice, so lets also consider ways to reduce the difficulty and / or buy some time for adjustments.

A Strategy for avoiding cycles
De-couple the least time sensitive link in the call chain.

In the scenario outlined above we might consider deleting the permissions associated with a group as being a lower priority task.  The end users shouldn't see any impact from leftover permissions, so there is no need for them to wait for that peocessing to successfully complete.

Instead of groups calling permissions and requiring permissions to be up, we could introduce a dedicated notification topic which the groups service uses to announce deletion events.  The permissions service could subscribe to that topic via a queue to allow it to accumulate events for processing without having to be available at the point in time that the notification occurs.  Now from the groups service's perspective any secondary aspects of deleting a group becomes a fire and forget concern.  Any other service that is introduced that involves referencing groups and which in turn needs to be aware when a group is deleted can apply the same subscription approach.

Friday, 8 November 2019

Moving to lambdas - not just a "lift and shift"

A while ago I came across a lambda that had been set up to run in AWS that had previously existed as an endpoint within a microservice running in ECS (Docker).  I needed to dig into the code to try to understand why an external cache was behaving strangely - but I can save that for another post.

The focus of this post is on what may have made sense in one environment doesn't necessary hold true for AWS lambdas.

Lambdas are intended to be short-lived and are charged at a fine level of time granularity so the most cost efficient way of running is to start quickly, perform the desired action and then stop.

Some code that I came across in this particular lambda involved setting up a large in memory cache capable of containing millions of items.  This would take a little time to set up, and based on my knowledge of the data involved it would not achieve a good hit rate for a typical batch of data being processed.

Another aspect of the code that had been carried over was the use of the external cache.  At initialisation time a cache client was set up to enable checking for the existence of records in Redis.  The client was only being closed by a shutdown hook or the Java application - which AWS Lambdas do not reach.  This resulted in client connections being kept around even after the lambda had completed, resulting in the underlying Docker container running out of file handles and having to be destroyed mid-processing.

Upgrading the in-house microservice framework


This is a bit of a "note to self" but if you find it interesting let me know.

This isn't intended to be a deep dive into any particular technologies, but it might touch on some situations familiar to developers with a few years of experience in managing microservices.

Setting the scene
The main Java-based microservices had some common needs:
 - authorization
 - logging
 - service discovery
 - api documentation generation
 - accessing objects from the cloud provider

Naturally it made sense for all of this common functionality to be bundled together in one common framework and used everywhere - which is fine.

Unfortunately some awkward shortcuts were taken to achieve some of the functionality, which made upgrading the underlying open source framework impossible to achieve without introducing breaking changes.

Before I had joined the organisation a couple of attempts had already been made to get things back into shape, but they ended up being aborted as the developers came to the realisation that they could not make the necessary updates without breaking the build for most of the existing services.

I helped to persuade the management team that this inability to upgrade had to be addressed to enable us to avoid security issues and take advantage of performance improvements, so a team was formed.

My main coding contributions were:
 - migrating between JAX-RS versions
 - updating logging to continue to preserve correlation Ids since the underlying framework had changed significantly
 - migrating away from using the very out of date S3 client
 - repetitive small changes that didn't make sense for every team to learn about and apply themselves

Dependencies for tests should be scoped as "test"
Something that looked like a minor oversight turned into a couple of weeks of work for me.  Some dependencies that were needed for running unit tests had been specified with an incorrect scope, so instead of just being available in the framework during build time they were actually being bundled up and included in most of our microservices.

Changing the scope of a handful of dependencies only to realise that a dozen or more projects had been relying on that to bring in test support libraries for their builds made me a little unpopular with developers who had just completed work on new features and found that their builds were broken so they could not deploy to production.

This led to one of my first time sinks in the project.  The majority of the broken builds were for services that were not under active development, so the teams that owned them could not be expected to down tools on what they were actively working on to fix their services' dependencies.  Fortunately a migration of one of the test dependencies was a recommended part of preparing to upgrade the underlying framework, so I was able to get that change out of the way more quickly than if the individual teams had done this themselves.

Service discovery client changes
The existing service discovery mechanism involved service instances registering themselves with Zookeeper during deployment.  This allowed Zookeeper to know which instances were available, and allowed services to keep up to date about the available instances of services that they needed to call.
Some aspect of this setup was not working properly, so as a result each time a new version of a service was deployed we would see a spike of errors due to clients sending requests to instances that were not yet ready.

We had an alternative mechanism for services to reach eachother by introducing fully qualified domain names pointing to the relevant load balancers, so removing the Zookeeper dependency and updating the various http clients to have some new configuration was in scope for this upgrade project.

A colleague from another team contributed by using his team's services as a proof of concept for migrating to the fully qualified domain names.

The initial approach was fine for those services as the clients could all be updated at the same time.  When it came time to apply the same type of change to other services we struck an issue whereby not all client libraries could be updated at the same time - so I had to introduce a bridging interface for the internal API client to allow old style clients and fully qualified domain name clients to co-exist.  This became my second time sink of the project, as once more we could not rely on every team having time made available by their product owners to address this migration work.

I saw the client service discovery migration work as being an area where specialization can speed things up.  Having one or two individuals applying the same mechanisms to make the necessary changes in dozens of places is much more time efficient than having dozens of individuals learn what is required and apply the change in two or three places each.  A couple of teams that did attempt to apply the changes themselves missed applying changes to the bundled default configuration for their client libraries - meaning additional configuration changes would need to be set up in each of the services that used their client libraries.

Not all services were created equal
Some services' build phases were more stable than others.  One particularly complex system had not been deployed for a couple of months before we came to apply the service discovery client changes.  The flaky nature of some of its integration tests left us in the dark about some broken builds for a few weeks.  It didn't help that the test suite would take over half an hour to run.  Eventually I realised that a client that I had configured in a hurry was missing one crucial property and we were able to unblock that service and it's associated client for release.

Libraries and versioning
Several services had a common type of sub-system to interact with - the obvious example being a relational database - so over the years some common additional functionality had been introduced into a supporting library.  Due to the lack of modularity in the underlying framework we found ourselves tying these support libraries to the framework - even for just being able to specify the database's connection properties as a configurable object.

I took the decision to treat the new version of the framework as a completely new artifact, which meant that each library also had to diverge from the existing versioning so that we would not have a situation of a service automatically picking up a new library version and bringing along a completely incompatible framework as a transitive dependency.

This got a bit of push-back from some developers as they started to look into what had been involved in successfully migrating services so far.  "Are we going to end up with a new artifact for the next  upgrade as well?" was a very fair question.  Like most things in technology the short answer is, "It depends."  My hope is that the current stable version of the underlying open source framework will only have minor upgrades for the next year or two.  Alongside this my expectation is that many of the existing microservices will be migrated to a different type of environment, where some of the functionality provided by the in-house framework will be dealt with in ways that are external to the service - e.g. sidecars in systems such as Kubernetes.

Monday, 28 October 2019

Ensuring DNS does not limit your microservices' ability to scale

Introduction
Before Kubernetes and service meshes became the mainstream way for deploying microservices some organizations would set up their microservice infrastructure in much the same way as monolithic applications.  For the part that I am interested in the top layer for receiving requests over HTTP is of interest - a load balancer sitting in front of a cluster of servers.

Simple deployments
The load balancer infrastructure can be manipulated to seamlessly upgrade a service by adding servers running the new version of the service to the cluster and removing the old versions.  The load balancer transparently forwards traffic to the live instances and passes responses back to the clients.  Services only reach eachother via the receiving service's load balancer, so there is no need for additional service discovery systems.

Managed infrastructure implications
Unlike data centre setups with physical load balancers, in cloud environments such as AWS the environment will dynamically resize the loadbalancer - including the possibility of distributing it across multiple machines and multiple IP addresses.
To take advantage of the capacity that the load balancer offers, clients need to regularly lookup the IP addresses of the load balancer.
 
Client DNS lookup performance
Historically some application runtimes would cache the IP addresses resolved from a request and continue to use that for prolonged periods of time - ignoring any hints on offer from TTL of the DNS record.  Here are some of the potential problems of caching DNS for too long:
  • Load balancer capacity is not evenly distributed across the load balancer's provisioned hosts, making the load balancer bottleneck even though it may have scaled up.
  • When the load balancer capacity is scaled down or moved to different addresses the client continues to attempt to reach it on a stale address. 
Overriding the timeout for DNS lookups can reduce the risks of striking these types of issues, but brings with it some performance cost of the lookup time when making additional calls to resolve the load balancer IP addresses.
One way to prevent lookup calls from increasing latency for service calls is to have the address resolution activity run asynchronously such as on a dedicated scheduled thread that updates a local name resolution cache - provided that the refresh rate is aligned with the DNS TTL of course.

How should we set up a relational database for microservices?

Introduction Over the years I've provisioned and maintained relational databases to support a range of web based applications, starting ...