Published on

Why spend innovation tokens on a service mesh?

Service meshes are a confusing and somewhat hyped up technology, but they have real benefits. Let's enumerate them, concisely.

A service mesh, like Istio, Kuma, or Linkerd will improve the observability, security, and traffic management of your cluster. I believe traffic management is where this particular technology really shines.

But real quick—

What's a service mesh?

In short, it is a fleet of proxies in front of all of your services, with some other management processes.

They're almost always implemented via sidecar proxies. Sidecar containers are injected each pod in Kubernetes, and they act as the gatekeeper for all network traffic. They intercept every request before it enters and after it leaves a service.

The sidecar proxies are called the "data plane", and then there is a separate piece called the control plane, which communicates with the data plane and manages it all.


As for observability, service meshes generate a lot of useful metrics.

Think about it, it's wrapping every request before it enters and after it leaves a service. Request counts, duration, response codes, and open connections — that's at least 3.5 of the 4 golden signals (traffic, latency, errors, and a little bit of saturation).

You can also get distributed traces, giving a detailed view of the span of everything some call into your system entails.

Just to give you an idea of the cool stuff you can derive, check out these network maps.


With a service mesh, you can shut down all east-west traffic by default, and then explicitly approve certain pathways. (e.g. The reservation service is allowed to call the restaurant service, but nothing else.)

You can also enable mTLS, so east-west traffic is not only encrypted, but authenticated.

Traffic Management

Timeouts, retries, and circuit breaking

You can set request timeouts, have retries happen outside of your application code (this alone might be worth the innovation tokens), and put in place circuit breaking (i.e. if one instance of a service is returning a lot of 500's, stop routing traffic to it).

Versioning enables advanced deployment patterns

Personally, this is why I'm so excited about service meshes.

With a service mesh, you can intelligently route traffic to multiple versions of a service.

So you could route based on a header. Or a query string parameter. Or you could do it weight based, sending 99% of traffic to the normal version, and 1% to the canary version.

The service mesh gives us the right abstractions (I'm thinking of Istio's VirtualService) to experiment with different forms of canary deployments within the same cluster.

If you want to see these patterns taken to their logical conclusion, take a look at Flagger. This project provides features on top of a service mesh, so you can do things like progressive traffic shifting.


When you spend innovation tokens on a service mesh, you get more visibility into what's happening in your distributed system. You get a more secure system. You enable less risky deployments, so going to production can be a non-event, and you can release faster.