I wrote code without tests that ran in production without defects, and I wrote buggy code with TDD (Test Driven Development). Time to look back at 35 years of coding and when tests help, and when there is something better. And especially, what these better things are.
In this post, we look at what we can do to recover well even if a defect finds its way into production.
Make it easier to recover from a defect
Slicing
In case we still have a defect, we want to limit its blast radius, the amount of code that could be affected by a needed fix. Slicing our code into mostly independent vertical slices introduces boundaries that prevent defect fixes from affecting large portions of code. A fix should typically stay within a single slice. Not always, but most of the time.

In a bigger system, slices are, however, not as simple as shown above. It typically makes more sense to build the system out of fractal slices – slices within slices:

Of course, there are other approaches to modularity with the same goal. I find that fractal slices best match our needs. Choose the best option for your solution, though.
Please note that my use of the word “slice” differs from that in Vertical Slices Architecture. I use fractal slices; VSA does not.
Often, when slices get really thin, a single test is enough to ensure they work correctly. Maybe we can extract some small-scope tests for an algorithmic part, but the whole slice is covered well enough with a single test. We often even have tests that exercise several slices (the smallest kind) to further reduce the number of tests. A single test may create an entity, update it, and finally delete it. These so-called lifetime-tests reduce duplication of test code, making tests easier to maintain.
Recover from defects with Event Sourcing
One of the worst things a defect can damage is data. Once data is corrupted, it becomes very difficult to recover.
One approach to lower this possibility is to not (only) store the latest and greatest state, but the whole series of (immutable) events that lead up to the current state – by using Event Sourcing. Events are appended to the event stream, (almost*) never updated. So, no chance of corrupting them by a poorly done update. When the projected state becomes corrupt, we can replay the events to restore it.
*almost because we change existing events, for example, with migrations, so we don’t have to support multiple versions of the same event.

Telemetry – what, how, why
When something goes wrong, telemetry can help to find the cause quickly. Telemetry can help identify what went wrong and how. With this information, you hopefully can swiftly recover your system.
A common problem with telemetry, however, is that we can quickly become overwhelmed by too much data and struggle to find the relevant information.
Next time
The next post will cover how to make manual testing on a local machine simple and why this is important.
[…] To test, or not to Test? Part 3 – Make it easier to recover from a defect (Urs Enzler) […]