Introduction
Over the years I've provisioned and maintained relational databases to support a range of web based applications, starting off with monoliths that would run on a single server, and more recently for microservices that run across multiple servers and scale up and down dynamically based on the capacity and load of interactions shared across the instances.
Here I'd like to evaluate whether we can bring together all of the capabilities when deploying into a cloud environment, using AWS as that is the environment that I am most familiar with.
Database setup and structure updates
I would like the provisioning of the database and all subsequent structural updates - such as the creation of tables - to be handled independently of the microservice runtime.
The initial provisioning of the database should be a one-off task actionable using something like CloudFormation, or Terraform, or Ansible. These technologies should all also be capable of updating the configuration such as resizing of the database instance, or cluster, or storage space. This isn't anything particularly novel so I won't delve into this.
As a starting assumption, we can reasonably expect that the microservice that owns the data to be stored in the database will be deployed from a continuous deployment service or something like GitHub Actions.
The deployment lifecycle for a microservice may be split into something like the following stages and phases:
Stage One - making something that can be deployed
- Build: bring in dependencies, compile and run unit tests, assemble the service into its deployable shape - e.g. runnable jar or Docker image.
- Integration tests: Assemble mocks and / or real integration points with appropriate data and run the microservice with well established integration tests to verify behaviour meets expectations.
- Static checking of the unit of deployment, such as whether the underlying OS of the Docker image has known security issues.
- Phase zero of deployment: Push the deployable microservice artifact into the artifact repository / registry that will hold it as a deployable unit with a version or tag that will be uniquely identifiable and traceable back to this stage of building for this microservice.
Stage Two - deploying the artifact into the runtime environment
- Phase 1 of deployment: Applying structural database updates to the target deployment environment (staging / production)
- Phase 2 of deployment: Taking a copy of the deployable artifact and launching it into the target environment - e.g. staging or production
- Phase 3 of deployment: Smoke testing to verify that the newly released version of the microservice is fit for purpose and ready to be rolled out to replace the previous version
- Phase 4: Roll forward or roll back, the "go" / "no go" decision point
- Roll forward
- Notify any existing previous version of the microservice to finish processing what it has in flight, and shut down. For services driven by requests such as over http, load balancers should drain connections and stop sending requests to the old version of the running microservice.
- Scale up the instances of the microservice to cope with current established suitable load handling.
- Finish when all instances of the microservice are now running the specified version
- Roll backward
- If the new version of the microservice fails smoke tests or becomes unhealthy then we should back out of this release and revert back to the setup that was in place for the current live version. Terminate all of the new instances (most often we will reached this decision point whenonly one has been in place)
- Roll back the database changes (this requires discipline to ensure that such changes are reversible).
- Verify that the health of the microservice has recovered now that the previous known good version is back in full control of processing.
Connection Security
We want to be able to have confidence that the communication between our service and the database meets the following security requirements:
- the system that we are connected to is genuine and not some man in the middle
- the data that we receive back has not been corrupted, intentionally or unintentionally
- the database must not permit other parties to read from or write to it
- the data being transferred to and from the database must not be readable as plaintext
Basically, we need some encryption and some way of having the sender of data signify that it is authentic in a way that the receiver can verify it - preferably with minimal overhead.
Data Security - Access Rights to Tables and Stored Procedures
The tables in the database will contain information that the service only needs to be permitted to perform specific types of actions on, e.g. there may be a table containing a list of locations that the service needs to be able to read but has no requirement to be able to write or delete.
My preference here ties back to the database setup and structure updates section mentioned above, as that would also contain the setup of permissions that are required to be permitted for the service's data access needs. Historically that may have been associated with the user associated with the service, but I'm hopeful that this would now be associated with a role which can in turn be associated with the service's database access user.
Why would I prefer role-based permissions over user-based permissions? Surely the service is only connecting as one user?
Yes, and no - see the next section.
Credential Rotation
Another level of security around connectivity between the service and the database can be introduced by periodically changing the user that the service is connecting as. This ensures that if a third party somehow gets in and successfully guesses or otherwise obtains a set of credentials for accessing the database then those credentials will only be active and useful for a limited amount of time. So if there is an attack involving guessing credentials, phoning home, and then manual probing, then the credentials should no longer be useful by the time the attacker is ready to poke around.
Credential rotation can be a tricky feature to get right, as it involved coordination between:
- the system that creates and applies the credentials into the database
- the database connections in your application (possibly a connection pool)
- the database system itself
The main timing problem that I have seen in an implementation of credential rotation involved the service failing to obtain fresh credentials before the previous ones had expired, resulting in data access errors around "permission denied". This situation was made worse by the fact that a query being used to check the health of the database connectivity was querying something that didn't involve permission checks (if I recall correctly).
If there is a choice between a third party solution and an implementation that your cloud provider provides, then my advice would be to go with the cloud provider's implementation unless it is orders of magnitude more expensive or known to be deficient.
Database Connection Scalability
One thing that microservices are intended to be good for is scaling to cope with demand. If there is a big announcement or marketing campaign for your website then there may be a surge of user sessions that would exceed the normally provisioned resource capacity. With a bit of foresight and autoscaling configuration in place, the service(s) can have their underlying execution resources scaled up - be they Lambdas, or EC2 instances, or ECS tasks.
In the real world we only have finite resources, and that applies to relational database servers too, so if each service runtime environment operates with X database connections, then some multiple of X will ultimately exceed the maximum available connections that a database server of cluster of database servers will allow. If this happens then we will start to see new instances of our service fail to operate because they cannot obtain the desired resource.
A compromise can be reached in the scaling up situation, by having a database connection proxy sit in between the database client - our microservice instances - and the database server. The proxy hides the fact that there is a limitation to the maximum available connections on the server side by sharing the connections. The usual rules and common sense should apply, when a transaction is active on a connection it will not be available for sharing.
So far I haven't had any real world experience with using something like Amazon RDS proxy, so I can't vouch for whether it can achieve much beyond buying us time before we need to scale up the underlying RDS sizing.