How can they happen?
Imagine a website that includes a feature for sharing images with the public or within among groups of individuals. The characteristics involved in uploading images are very different from the characteristics for managing permissions related to those images, so the responsibilities natuarally get split out into separate services - one for the slower longer running requests that upload the multi-megabyte data of images, another for managing whether the images are public, or private, or only visible to some specified groups.
Users can only share their images to groups that they are a member of, so there is a need for a service for managing groups.
Now we can consider two scenarios that approach the system from different directions and result in a cycle existing between two services.
On the one side we have the flow of someone uploading an image and applying some permissions for the groups that they want to share the image with.
On the other side is the situation of another user choosing to delete a group that they have admin rights to.
- Before creating permissions the system needs to validate that the user is a member of the groups.
- As part of cleanly deleting a group the system needs to ensure that there are no left over references to the group.
When might this be problematic?
As part of deploying a new service - or more commonly a new version of a service - it is sensible to verify that the service is ready to operate before it can start accepting incoming traffic. This is typically acheived by having the service expose a readiness endpoint that only produces a positive response once the service has successfully completed initialisation and confirmed that it can reach its dependencies.
With a cycle between services we would face a deadlock situation as each would be waiting for the other to become available before declaring itself ready.
This is probably the most basic reason for avoiding having cycles in the call graph of a collection of microservices. I can imagine a few scenarios where a team could find out that they have a cycle problem:
- Starting up a full copy of the system for a new region
- Re-starting after scheduled maintenance
- Deploying a new version of two or more of the microservices at the same time.
The ideal approach is to avoid getting to the cycle sitation in the first place, but "I told you so" isn't helpful advice, so lets also consider ways to reduce the difficulty and / or buy some time for adjustments.
A Strategy for avoiding cycles
De-couple the least time sensitive link in the call chain.
In the scenario outlined above we might consider deleting the permissions associated with a group as being a lower priority task. The end users shouldn't see any impact from leftover permissions, so there is no need for them to wait for that peocessing to successfully complete.
Instead of groups calling permissions and requiring permissions to be up, we could introduce a dedicated notification topic which the groups service uses to announce deletion events. The permissions service could subscribe to that topic via a queue to allow it to accumulate events for processing without having to be available at the point in time that the notification occurs. Now from the groups service's perspective any secondary aspects of deleting a group becomes a fire and forget concern. Any other service that is introduced that involves referencing groups and which in turn needs to be aware when a group is deleted can apply the same subscription approach.