Event Sourcing: compensation – the simple way out when things go wrong

In the fifth part of this event sourcing series, I’ll show you how we use compensation of events to handle failed commands and events that should never have happened.

Sometimes, things go wrong – a command fails because the database is overloaded, there is a bug in the code for some edge case, the system is out of memory, the infrastructure misbehaves, etc. Or a user did something that should never have happened, like importing the wrong data set. When this happens, we want the system to return to a consistent, well-defined state.

Transactions and compensation are the two classic options to solve this problem. Let me explain why we went with compensation and how we implemented it.

Compensation vs Transactions

When executing a command, we want either for everything that needs to be updated, notified, persisted, etc., to be completed, or for nothing to be done. The typical way to achieve this is to use transactions. Start a transaction, persist all data, and if everything goes well, commit the transaction; otherwise, roll back and report an error.

The problem with transactions is that everything involved in the command needs to support the transaction. This gets tricky when several different databases or remote services are involved.
In addition, once a transaction is committed, it can’t be rolled back later (for example, if the user notices a mistake in the meantime).

An alternative approach is to use compensation. When something goes wrong, we compensate everything that we had already persisted or sent to another service. There are different ways to compensate: delete the data, mark it as compensated, persist an adjustment entry and probably more.

Compensation is generally harder to implement than transactions, but they provide a more flexible solution to more problems.

Local transactions, global compensation

We use transactions for a single interaction with a single database (for databases supporting transactions). We may execute several SQL queries in a single transaction to ensure that they all succeed or that no data is changed. And that nobody else reads inconsistent, uncommitted data concurrently.

There are, however, commands that interact with different kinds of storage (SQL, Redis, table store in our case). Or commands from one subsystem that trigger other subsystems to do additional work. In these cases, we prefer to use compensation so we don’t need to rely on a distributed transaction coordinator or introduce a technical dependency (the transaction) between our subsystems, since not all storage systems even support transactions.

Compensate an event

The nice thing about event sourcing is that we can see the whole history of what happened. We want to retain this property even when we compensate an event. So, instead of just deleting the event, we mark it as compensated.

You may wonder why we don’t simply use an additional event that compensates the prior one and “restores” the previous value. There are two reasons: First, compensation occurs very seldom in our system, so we don’t want to introduce dedicated compensation events for every “real” event. This would double the number of event types, and we couldn’t use a generic compensation implementation. Secondly, this approach doesn’t work for bi-temporal event streams. I’ll show you in a later post in this series how we implemented an “undo” for bi-temporal event streams. So, it’s possible, but it can quickly lead to problems. Marking events as compensated is a much simpler solution for us. And it makes compensated events obvious, and we can filter them out directly when querying data from the database for better performance (in case there are a lot of compensated events – if, for example, a data import was compensated).

So when a command fails, we compensate all the things the command already changed. We mark already-persisted events as compensated so they are ignored in the future, and we notify other systems to compensate the thing we just told them to do (or we perform some action that compensates what we did in case the other system does not support compensation).

What if compensation fails?

We compensate when something goes wrong. And, of course, when things go wrong, they often go terribly wrong. So we want to make sure that a compensation operation can’t get lost. That’s why we use our message bus to send a compensation message. When things can get compensated, great; otherwise, we get a dead-letter message (and a notification in our chat tool) and can fix it manually.

Summary

Compensation is a flexible solution to deal with errors in our system – to return the system to a consistent prior state. Compensation is more flexible than transactions, but harder to implement. Make your own trade-off for your system.

Deep Dive

Time for a deep dive. The following example shows the command that resigns an employee from a company. I simplified it a bit by removing things that aren’t relevant to compensation, like validation and some features. The red lines on the left indicate omitted parts.

After obtaining existing employee data, especially their last employment range, we create two events: The first states that the employment range will end on the resignation date. The second states that the employee cannot use the application anymore after the “usage end” date. These are two different dates because people typically have some vacation days left, and the company does not want them accessing and changing data after the accounting closes.

Then we tell the operation runner to perform the following steps:

  1. Mark this command as started so that when everything fails, we see that we started a command that needs compensation.
  2. Persist the two new events. We also log the events to update the audit trail.
  3. We execute the necessary side effects to all dates after the resignation date.
  4. We update the permissions (nobody is allowed to do something on this employee after the resignation date)
  5. We tell the operation runner that we are finished. This marks the operation as successful.

If anything goes wrong, the operation runner catches the exception and initiates a compensation.

The functions inside the operation runner use the execute function to handle compensation in case of an exception:

The context value holds everything we passed to the operation runner is passed from function call to function call in the pipe from start to finish.

The logAndPersistEvent'' function adds the event to the audit trail and calls the function that really persists the event.

The logAndPersistEvent' function uses the execute function for the compensation mechanism.

If the action fails, we mark the command/operation as failed and send the compensation message to the service bus. Yes, I know that it is called RollbackOperation, not CompensateOperation. That’s for some historic reasons and some old code that still does rollbacks 😅

The return after throwing the exception with raise is necessary to make the compiler happy.🤷‍♂️

If a compensation message is sent to the service bus, a handler will react to it and make some calls that will ultimately end up back in the employment subsystem. It really is boring code, so I don’t show it here. It’s a function living in the composition root that knows the whole system and just “broadcasts” that we need to compensate a command with a given ID.

Since compensation is very rare, we just try to delete all the events in every event stream that was created by the given operation ID. That’s a lot of UPDATE statements that do nothing, but it’s simple, and it works great so far.

In our case, there are two UPDATE statements that actually do something. The one for the resignation event, and the one for the employment range end event (shown here).

Note that we use the term triggerId, not operationId, here because these events might be created by something different from commands/operations.

To make SQL queries faster, we typically define an index that ignores the compensated events:

CREATE NONCLUSTERED INDEX [IX_employment_employementRangeEvents_employeeId] ON [employement].[employementRangeEvents](
[tenantId] ASC,
[employeeId] ASC
)
WHERE ([compensatedAt] IS NULL)
WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]GO

And we have to make sure that we exclude the compensated events in our queries (to get the right results and so that the index is used):

I think that is enough deep diving for today. As always, feel free to ask questions.

Updates

Improved usage of raise in the execute function (thanks, Ruben).

About the author

Urs Enzler

Add comment

Recent Posts