Our Experience with Bi-temporal Event Sourcing

Bi-temporal event sourcing combines storing data as a sequence of events, which tell what has happened with the data, and the data has two associated points of time, one when the data entered the system and one when the data takes effect.

This post is about our 8+ years of experience with bi-temporal event sourcing, along with code samples showing how to achieve this.

Feel free to skip the code blocks and just read the conceptual parts. But you’ll miss the beauty of F# 😂

This post is part of the F# Advent Calendar.

We use bi-temporal event sourcing in our time-tracking application called TimeRocket. The time tracking domain knows a lot of bi-temporal data: I book now that I started working 2 hours earlier, I book a vacation today for next month and so on. We chose event sourcing because it allows us to keep the history of our data along with the reason for its changes. This is helpful to explain why the numbers are as they are shown and to correct manipulation errors made by our users – the system never loses data.

Event Sourcing Basics

Data is stored as a sequence of events when using event sourcing. The example in the diagram shows the data for an address. First, the address is indicated – for example, when creating an account. Then, the address changes because the person moves. Then, a user corrects a typo in the address. The events describe what happened. Compared to simple CRUD (create, read, update, delete), where we would only ever have the latest state of the address, we know how the current address came to be. We also can distinguish between a person moving and fixing a typo, which can enable business cases not otherwise possible.

The events of a single address form a so-called event stream. To get the current address, we project the events of an event stream by applying the events in order. We call the result a projection.

Bi-temporal Data

Bi-temporal data has two points in time associated with it. In our system, we typically have a point in time when the data enters the system. We call this the application time. And there is a point in time at which the data has an effect. We call this the effective time. It’s like when paying a bill: I enter data in my bill-paying application now (application) that the bill should be paid at the end of the month (effective).

An interesting case can happen when I enter multiple data about the same thing (I don’t like the term entity) with different application times and effective times. In the diagram, you can see the numbers 1 to 3. First, I enter data at 1 (early on the application timeline and early on the effective timeline), then 2 (later on both application and effective), and finally, I enter data at 3 (later application than 2, but earlier effective than 2). That means that three was entered later in the system but takes effect earlier than what was specified at 2. Let’s look at an example of how to deal with this potential conflict.

In the above example, I enter when I want to go on vacation. First, I tell the system that I want to go on vacation for seven days on the 20th of October. Later, I make up my mind and I want to go a week earlier, but for two weeks. In this scenario, it is clear that I want the new data to overrule the old data. The old data should be ignored.

In the following example, it’s not so easy anymore:

Many things in our system can be configured using rules. For instance, rules that specify when one should work or isn’t allowed to work, or for how long maximally.

In the above example, I first tell the system to use rules A and B (green). Later (on the application axis), I tell the system that at a later point in time (on the effective axis), rules B and C should be applied. So, I want the system to switch from A and B to B and C. Up to now, the system uses no rules before the green point in time on the effective timeline, then A and B until the yellow point in time on the effective timeline. From then on, the system should use B and C. Easy so far 😊.

But then I enter that the system should use D from an effective point in time that lies between green and yellow. That can now have two different meanings. Either the change to D should overrule the change to B and C completely. Or D should be inserted, and the change to B and C still applies. You can see the resulting timelines for both variants in the diagram on the right.

In our use case, inserting is the correct behaviour, but that differs for every use case.

Timelines

Our codebase uses the type Timeline to represent how a value changes over time.

We distinguish two different kinds of timelines. A lifetime timeline represents a value that can be created, updated and deleted. For example, an address. A set-based timeline represents a value that only can be set or removed. For example a set of rules that should be applied. These two kinds of timelines have some subtle differences when events are projected. For this blog post, we stick with the lifetime timelines.

F# basics

We need to cover some basics before I can show you our code. Skip this section if you know F# already. Don’t be scared if you don’t – other non-F#-ers successfully survived this presentation* and took the concepts with them. Here are all the F# things you need to know to understand the rest of the code in this blog post (and probably most F# code there is).

* this blog post is the written form of my presentation about this topic.

A value can be defined with let. A tuple can be created with a simple ,.
A list can be defined with [ ]. A value can be added to the front of a list with ::.
Functions, like the sum function, are also defined with let, and arguments are separated with a blank.
F# supports pipes |> to pass a single value to the function after the pipe. In my silly example (f), value is passed to the function String.toUpper and the result is then passed to String.length. This is the same as in function f', but much easier to read because we don’t have to nest the function calls.
Records can be defined with type. In the example, I defined a record with the fields Name and Age. A record value can then be defined with let. Yeah, yeah, I’m not 42 anymore, but it’s a nice number.
F# also knows types that say that a value is one of a fixed set of values – so-called discriminated unions. In the example, I define a type Fruit that says that a fruit is either a banana, apple or pear. No other fruit is possible. An apple also has a colour; the others don’t. My daughter only eats red apples, and my son only green apples, so this is very important information!
Discriminated Unions are especially nice when used in pattern matching. The toString function matches on the kind of fruit and returns a string representing the fruit.

If you want to go deep into the upcoming code examples, you should also know how partial application works – when not all function arguments are specified, but only part of it.
In the above example, we provide only a single argument to the sum function. As a result, plusTwo is still a function. A function that still needs a single argument because we haven’t specified y yet. Once we pass y we get the resulting value.

That’s all you need to know about F# for this presentation.

A typical event

In our system, one can define so-called organisation forms. Think of your company’s organisational chart showing departments, teams, and bosses as an example.

A typical event in our system has an event ID, the ID of the thing it belongs to – here, the organisation form -, a data field containing the details of the event, and an application date-time.

The data is one of the following:

The organisation form can either be created, removed or renamed. Units in the organisation form can be added, shifted, renamed or deleted. Most cases have a record associated that contains the details. Below are two examples. The form removed event only needs the effective date-time.

The organisation form created event contains the label, sub-units and the effective date – a workday, a special abstraction representing a day without a calendar*. Forms are typically created or changed when there is a reorganisation in the company, and they happen on a certain date. We now have an application date-time and an effective date, so it’s bi-temporal.

* dealing with time is difficult because time zones are involved.

The organisation form renamed event only needs the new label for the form. This is a uni-temporal event because it misses an effective date. In our domain, it would confuse people if we showed old labels for historical data. So we have a mix of uni- and bi-temporal data in a single event stream – fun!

Reality is worse

While talking to users and implementing real use cases, we quickly found out that there are even more possibilities for events in our domain.

There are the events which are rather obvious. Things get created, updated, and deleted. But we found some more.

Our system supports workflows. When I, for example, want to go on vacation, I send a request to my boss. He then can either accept or reject my vacation. In case of rejection, we need to get rid of the vacation entry in my calendar. But it’s not a simple delete. It’s what we call a thing whose creation was rejected. Distinguishing these two cases allows us for better decisions in business logic.

Then there are things that were deleted but should come back. That is a recreate event – kind of an undo of the delete.

Earlier we have seen that labels of organisation forms do not have an effective date associated with them. To be able to mix uni- and bi-temporal events, we have modify perpetually events. That means that the label should be changed over the whole lifetime of the organisation form.

Finally, we sometimes project a single event stream in different ways to enable different use cases. Sometimes, not all events are relevant for a specific projection. So we can skip events.

In code, we can specify what each event means – it creates something, or deletes it, or is skipped and so on. The following discriminated union shows the possibilities explained above:

'projection: the resulting type of projecting events.
'granularity: the measure* of the timeline, typically workdays or UTC ticks.

As you can see, the events specified as RejectCreation, ModifyPerpetually, and Skip are uni-temporal, and the others are bi-temporal.

* not in the sense of an F# measure!

The projection

So, back to our events’ data for the organisation form. Two are shown below, together with the type for the projected organisation form that results when projecting the events for a single organisation form.

Timelines revisited

When a timeline for a thing is requested, then the timeline can be one of these three things:

  • If there is data available, then we get an Existent timeline.
  • If no data is found, we get an NonExistent timeline.
  • If the creation of the thing was rejected, we get a Rejected timeline. A rejected timeline can give us the value that was rejected. Which we use to show, for example, which vacation request was rejected.

In code, a timeline is defined as follows:

An existent timeline consists of a list of phases. A phase is either a phase with a value or one without. Both types of phases have a start, but only the phase with a value also has an associated value. So whenever a thing changes, there is a new phase added.

Specifying how to project each event

Before we can project an event stream into a timeline, we need to specify how each event must be treated, whether it creates, updates, deletes, or rejects the projection or should be skipped. We need to define a function that takes an event and returns the action to be taken:

I just saw that the indentation on the line | OrganisationUnitAdded is wrong 🙈. Copy % Paste error.

In the above example, you can see that when we have an OrganisationFormCreated event, then we specify that we want to treat that event as an event that creates a projection. We have to pass the effective date and the created projection.

If it is an OrganisationUnitAdded event, then we want to update the projection. Here, I call a function that takes care of changing the organisation form. This function takes the old projection and returns the updated one (remember partial application?).

If we have an OrganisationUnitRenamed event, we specify that we want to modify the projection perpetually (over the whole lifetime). Here, I didn’t use partial application*.

* This is how our code looks in reality – including the inconsistencies!

We have to specify the behaviour for every kind of event data – otherwise, the compiler is not happy and will tell us. Yeah, exhaustive pattern matching!

Meet the projection Algorithm

In this section, I’ll first explain the algorithm of the projector conceptually and afterwards I’ll show our code that achieves this. If you are happy with just the concepts, then skip the code. The code is a bit hard to understand. One reason is that it is (a bit) optimized for performance, which doesn’t help readability too much. 🙈

The projection algorithm first loops over all events we pass to it and augments the events with some metadata. The metadata contains the associated action describing what we should do with the event (creates, updates, deletes etc.).

The second step is to partition the events into two buckets. Bucket one contains the bi-temporal events – the events with an effective date(-time). Bucket two contains the uni-temporal events (Skip, ModifyPerpetually, RejectCreation).

The bi-temporal events are then sorted, first by their effective, then by application and if both are the same, then additionally by their kind. Creates come before Deletes, for example.

The uni-temporal events are sorted by their application.

Then, we loop over the bi-temporal events and modify the current projection according to the associated action. And we add a new phase to the list of phases. We also apply all uni-temporal events in this step. For details, see the code below – it’s a bit complicated.

Now, we switch to the code. Feel free to skip these details and continue with the next section. But also feel free to enjoy the nice syntax of F#. The gray areas represent code currently not shown.

I won’t describe every detail of the code; otherwise, this blog post will be even longer. If you have questions, ask them in the comments.

The TimelineProjector provides the project function that takes a configuration and the events to be projected.

We can either have a SetRemove or a CreateUpdateDelete timeline. I will only show the code for the latter because it is the more common scenario.

As shown in the conceptual part of this section, we augment the events with some metadata. Then, we partition the events into two buckets for bi- and uni-temporal events.

Sorting the uni- and bi-temporal events.

Then, we loop – or fold – over the bi-temporal events to get the phases and the resulting projection. The folder will be shown further below. We start the fold with an empty list of phases and a NeverExisted projection. I won’t explain the third argument; it’s a riddle for the interested reader.

The folder function takes the fold’s current state and element as its first argument. The state consists of the phases created so far, the previous projection and the previous metadata. The current element consists of the event to be handled with its metadata.

We get the application date-time of the event and then match on the metadata, which describes what we should do, and the previous projection (the projection that resulted from the previous element).

If, for example, the metadata tells us to create a projection and the previous projection is NeverExisted , then this is a valid combination, and we can create the projection. The same is true when we deleted the thing, and now it is recreated. This is a rare case, but it happens. Everything happens when the system runs long enough!

We have to first apply the uni-temporal events. This can end in two ways. Either we should continue with creating the timeline, or we should stop because we found a CreationRejected event. If we can continue, we create a new phase with the returned potentially modified projection. Otherwise, we set the current projection to CreationRejected and empty the list of phases.

We then handle all the other cases. Updates make sense when we already have an existing current projection:

Deletes also make sense when we have a current projection. In this case, we add a phase without a value.

The Recreates action only makes sense when we have a deleted current projection and if the effective dates of the delete and recreate event match. That is important because multiple deleted events could be in a single timeline. We find the correct deleted event to undo with the effective date.

And for all other cases ( _ -> ), we simply keep what we currently have.

Finally, we build the resulting timeline from the phases we collected and the resulting projection.

If you missed code handling the overrule vs. insert problem shown above, that is because this has only to be considered in set-based projections, which I didn’t show here.

Experience of 8+ Years of (bi-temporal) Event Sourcing

We found many advantages of using bi-temporal event sourcing and “normal” uni-temporal event sourcing. The data is much more expressive because we know why the data has changed. This makes operating on historical data much easier and more powerful. We also have the history of all changes and never lose data. This is great when you have customers calling support who accidentally “deleted” some data. Having all events available makes it easier to debug problems because we see how data was changed over time and when it was changed (application). And thus, it’s easier to fix errors made by the system or the users.

But we also found many challenges. Modelling a business domain with events is difficult. Especially events for undo, “Oops, I accidentally changed the wrong value”, etc. Events that are not true domain events but reflect things that can happen anyway. Existing business processes are often also not straightforward, and they know many special cases and exceptions. We can’t change the business process of our several hundred customers, so we need to be flexible.

Furthermore, migrations can be challenging as well. Sometimes, we migrate the events themselves – we change the historical data. This is sometimes simpler in the long run than having to deal with different versions of events. Or there is a technical reason for it. We switched from Newtonsoft JSON to System.Text.Json because it’s much faster. But we had to migrate all the event data – stored as JSON – because the JSON didn’t look exactly the same (representation of discriminated unions).

When we migrate data, we can either do that when we deploy our system. That’s the simpler approach, but it leads to downtimes. So this is only feasible for very fast migrations. For long-running migrations, we do a multi-phase release:

  1. Add the new capability to the system but still support the old events. Events are always written in the new format, but can be read in the old and new format.
  2. Migrate the old events to the new format with the help of a console app that we can run locally. This app converts a couple of events at a time to prevent overload of the database.
  3. Remove support for the old version of the event. This keeps our system simple and maintainable.

Conclusions

Dealing with bi-temporal data is difficult. Combining it with event sourcing is even trickier. However, the decision to go with bi-temporal event sourcing was good for us. Once we had the infrastructure code for dealing with bi-temporal events, we were capable of implementing difficult use cases very quickly, which would not have been possible with CRUD (or tedious with CRUD-with-history).

One last Thing

The code I showed you here is our real production code. But we didn’t write it in a single go. To get to this code, we iterated on it at least five times. Iterated means here that we completely rewrote it. The first version was in C# and dealt with rejects completely differently – which was even more difficult and much less performant. The first rewrite to F# was nice, small and incomplete to handle all use cases needed. The next version was complete but slow. Finally, we ended up with the solution described in this blog post.

Thanks for reading!

About the author

Urs Enzler

2 comments

By Urs Enzler

Recent Posts