How to Elegantly Handle Volatility in Microservices ?

4 min readSep 1, 2018

Transient Faults are one of the biggest problem of Microservices. Distributed nature of the Microservices makes them susceptible to temporary blips (Transient Faults). These failures can happen anytime, bothering developers night sleep and morning’s attention. In most cases, these failures are false alarms, and can cause fatigue for developers. Chances are real issues can be misinterpreted as transient faults, and ignored completely. Leading to performance, and speed issues.

What is a Transient Fault?

According to Microsoft Docs, “Transient faults include the momentary loss of network connectivity to components and services, the temporary unavailability of a service, or timeouts that arise when a service is busy.”

Transient fault is a bit like Diabetes. There is no cure for Diabetes. It can only be managed. A diabetic person requires frequent check-ups, and dosage adjustments. Depending on the severity of the case, some patients are prescribed insulin injections, and others tablets.

Similarly, Transient Faults have no cure. They can only be managed. A distributed Microservices requires frequent fault checks, and handled accordingly. Depending on the severity of the case, some faults are handled reactively and others proactively. Reactive handling means re-trying the Transient Faults couple of times until they disappear. Proactive handling means breaking the circuit before it lead to cascading failures.

Why Transient Fault happens?

In Microservices world, we divide and conquer our domains into functional pieces. In some cases these Microservices are deployed within the same virtual machines, and data centers. In other cases, Microservices are deployed in different virtual machines, or data centers, or even countries. Cloud is the best example of distributed environments.

In such cases, Communication among Microservices happens through internet. The internet router’s send data by dividing into smaller packets at the source computer, and routes them through various paths to the destination. All the packets meet up at the destination, and recreate meaningful data.

But, Transient Faults occur because the path these packets follow are subjected to the forces of nature. Like any other systems, internet routers are not perfect. They undergo routine checks, upgrades, or worst under a storm or earthquake. Leading to loss of packets or causing corrupt packets. Thus leading to corrupted HTTP request that could NOT be understood bythe destination server.

Transient Faults could also happen because of destination servers that are subjected to the same forces of nature. Like any other systems, servers are not perfect. They undergo routine checks, upgrades, or worst under a storm or earthquake. Thus leading to busyness, and not able to process requests.

How to manage Transient Faults?

Polly is a Transient Fault handling library for .NET applications. It improves the resilience of Microservices by healing the faults with pre-defined policies. First it identifies the type of fault and then heals it by retry or circuit breaker patterns.

Polly is the .NET equivalent of Hystrix for Java. At it’s core, it is a wrapper around try-catch with math calculations around how often, and how many times to retry. Its better than try-catch because you can apply advanced techniques of retries without having to re-invent the wheel.

Reactive policies:

You can reactively retry by backing up in exponential times. In doing so we spread out our retries over 1, 2, 4, 8, and 16 seconds to allow destination server to cool down before we hammer it again. A try-catch retries within couple of milliseconds of a failure, but Polly manages it more evenly.

Proactive policies:

Circuit Breaker is the Proactive policy that measures the number of calls to a dependency, and checks the exceptions. So the calling code will not even try for a period of time (pre-configured) in case of a failure is detected. The calling Microservices can abort the call to update its front-stream of applications. Basically letting the system follow the path of the fail-over strategy for doing something else. In doing so, Microservices are failing fast, and gives the underlining system time to recover from failure instead of banging them with more retries. By gracefully degrading, we are letting our users know ahead that something has gone wrong instead of flooding infrastructure with retries. And also stopping cascading failures from happening further.

What are the Benefits of Polly?

Polly is an open source free library.

Polly can do Http calls, and also Database, WSDL, Messages. Polly manages pretty much every external dependency.

Policies are defined in one place, and used everywhere in the project. You can stick the policies into the DI for reuse everywhere.

And also Reactive and Proactive policies can be combined. For example, Retry policy can be combined with Circuit Breaker policy. We retry couple of times with back up policy, and back out immediately in case any failures are detected.

How to get started with Polly?

The best way to get started with Polly is by running the sample demo project locally. These samples have elaborate code comments, and intuitive examples. The sample code runs against a faulting server that simulates failures for the sake of demo.

If you are interested in tweaking the Polly source code, then you can download it from following Github project.

You can also consume the binaries from the following Nuget download. For detailed discussions of transient fault-handling and further Polly patterns, see the Polly wiki