This blog post briefly explains canary deployments. The content doesn't assume the use of specific technologies, although the examples reference Kubernetes concepts.
What are canary deployments?
In essence, canary deployments refer to a gradual rollout of a new software version. The term canary deployment is derived from the use of canaries as a warning signal in mines. Dangerous gases would kill the canaries before the miners, thereby alerting the miners. With canary deployments, the canary is the subset of users that gets exposed to the new version before it is rolled out to all users.
The diagram below illustrates a canary deployment that gradually replaces the stable version in three steps.
Some percentage of the service traffic is routed to the canary deployment. If problems are detected via automated/manual testing or monitoring, the deployment is rolled back. Otherwise, the traffic percentage sent to the canary deployment is gradually increased until the stable version is completely replaced. At this point, the canary version can be promoted to the stable verison. The underlying mechanisms used for canary deployments can also be applied to other concerns than the rollout of a new version, for example A/B testing.
Requirements for canary deployments
The use of canary deployments requires the overall system to be able to deal with the stable and canary versions at the same time. This is also the case if only rolling updates are used. Database migrations, breaking changes in interfaces etc. have to be handled with care. For example, a common approach to the rollout of breaking changes in interfaces is splitting the change into multiple steps according to the "expand and contract" pattern. First, the service is expanded in such a way that the service consumers are able to function without changes. Next, the consumers are migrated to the new interface. Lastly, the old interface is removed.
The second major requirement is the ability to route some percentage of traffic to the canary version. The necessity for this capability should be obvious. Take into account that requests from the same user could hit canary instances sometimes and stable instances other times. This behaviour can be mitigated with certain load balancing strategies / sticky sessions.
Implementing canary deployments
Kubernetes does a great job with rolling updates. Canary deployments, on the other hand, are a bit harder as they aren't treated as first class citizen. One possible approach looks like this:
- Create a new Kubernetes deployment for the canary version
- Gradually increase the number of canary replicas while gradually decreasing the number of stable replicas
If the the canary deployment and the stable deployment have common labels, the corresponding Kubernetes service will round-robin traffic across all canary and stable replicas.
As a consequence, the amount of traffic sent to the canary replicas is controlled by the ratio of canary pods to stable pods which causes several issues. Routing a small amount of traffic, e. g. 1 %, to the canary pods requires a big amount of pods. Furthermore, autoscaling is problematic because the two Kubernetes deployments have seperate autoscalers. The canary to stable pod ratio and therefore the ratio of requests that are sent to the canary deployment are impacted by dynamic scaling.
The solution is decoupling the traffic distribution across service versions from the replica ratio. Service meshes such as Istio provide this capability. With a service mesh, advanced traffic routing based on some critera (e. g. user ID) is easily possible, too.
Canary deployments increase confidence when updating services and reduce risks. Performance issues and bugs can be detected before all users are impacted. To implement canary deployments, the overall system needs to be evolved in a way that allows running multiple versions at once and allows performing rollbacks.
Decoupling the traffic distribution across service versions from the ratio of canary to stable replicas is essential to reliable canary deployments, especially if autoscaling is used. Service meshes like Istio can be used to achieve this decoupling and additionally enable advanced traffic routing based on some criteria (e. g. user ID).