A not-so-gentle follow-up on bitemporal data challenges

Not long ago, we met Bob, in an introduction on bitemporal data delivery and how a well-designed data solution can provide this feature out of the box.

When delivering data from the integration layer (e.g. a Data Vault model) to the presentation layer (anything, but usually a dimensional model or wide table), a key requirement is re-organising data to the selected ‘business’ timeline for delivery. During this process, we leave the safety of the assertion (technical) timeline behind and start using the real-world state timeline for delivery.

In Bob’s example, this real-world timeline is standardised as the state timeline. This column is available as a standard column in the integration layer model, where the appropriate data values have already been mapped to it.

During Bob’s ordeal, we only had to worry about a single time-variant data set containing the two standard timelines; the assertion- and state timelines. Complexity ensues when you combine more than one of these time-variant data sets into a single data delivery – a common scenario when delivering a dimensional model.

In this post, we explore the behaviour in data when multiple time-variant and bi-temporal data sets are combined. In a subsequent post we will revisit what effect this has on issuing (deterministic) dimension keys.

Update 2023-03-29 – don’t forget to check out an effective visualisation of this concept here.

From Data Vault to Dimensional

In order to demonstrate what happens when we merge two (or more) time-variant data sets, let’s consider the Data Vault example below:

Data Vault tables loading into a dimension.

The ‘Sat Customer’ and ‘Sat Customer Contact Details’ both contain regular context, descriptive columns that describes the ‘Hub Customer’ business concept. Together with the hub, they are merged into a ‘Dim Customer’.

Note that, conceptually, the satellite table should also contain a uniquefied state timestamp value to guarantee that joins don’t return too many rows caused by duplicate values. I add a ‘State Timestamp Key’ for this purpose, but will elaborate on this in another post. To illustrate the concept, we should for now assume the state timestamp values are unique. Also, the inscription record id is part of the key to guarantee uniqueness for time of arrival – but I won’t go into this in detail as this is covered in other posts and discussions.

Now, imagine that the two satellite tables contain the following data:

In this example, ‘Inscription’ refers to the Inscription Timestamp and ‘State’ to the State Timestamp’ For brevity, only the business key (‘ID’) value from the hub is displayed and the rest of the columns are omitted. Each satellite has one descriptive property; ‘Name’ for Sat Customer and ‘Number’ for Sat Customer Contact Details.

The question to ask is: what would the resulting dimension look like? Remember, we have to re-arrange the data along the state timeline.

The typical answer looks like this:

This is what the data looks like when re-arranged corresponding to the state timeline, with the wisdom of ‘now’.

But, this is not entirely correct.

What we really should consider, is evaluating each record with the data that was available (for joining) at that point in time – excluding data that was not yet available at the time. Looking at the history from today’s perspective we should exclude ‘future’ records when looking at history, even if this data exists in the table.

Loading incrementally

When loading each transaction incrementally, as it comes in, we can produce a data set that shows what should have happened. Simply because future data is not yet physically available for each incoming transaction, it cannot be taken into account.

To do so, we replay the transactions across the two satellites and add the results to the dimension one by one.

The inscription timestamp represents the order of arrival, but to (try to) make it a bit clearer I have also added an ‘arrival number’ that corresponds with the inscription timestamp. In addition, the order of arrival according to the inscription timestamp is shown to left of the table, and the order of arrival according to the state timestamp is on the right side of the table

Creating a dimension from the first arriving five records is relatively straightforward. So far, the inscription- and state timestamps follow the same order:

It is with the arrival of the sixth record in the Customer Contact Details satellite (below) that things get interesting. This record arrives on 2014-02-1, but applies to 2011-12-01. This means that the corresponding descriptive properties from the other satellite must be matched as per that earlier point in time, in the state timeline.

The process (join, lookup) therefore checks what the active value is in the other data set, for that point in time according to the state timeline.

A late-arriving record appears

So, for the Number value ‘345’ from record 6, the corresponding Name value would be ‘Jonathan’. This was the last known value as per 2011-12-01. At 2012-01-01, the value would change to ‘John’ but that has not happened yet at 2011-12-01.

But we’re not there yet. In the Customer Contact Details satellite, the order of records has changed. The record that arrives (inscription timestamp) on 2014-02-01 causes a reshuffle in the history of the state timeline. A record was ‘inserted’ between the state timestamp values of 2011-01-01 and 2012-01-01.

This causes a ripple effect in the history.

The dimension already contains the changes over time up to the point the backdated adjustment was received. The backdated adjustment the forces a re-evaluation of changes that would have occurred if the record was received in the correct order.

In the state timeline, the changes that would have happened (if the record was received in order) are recalculated up to the point another change (still in the state timeline) would have happened anyway.

These records can only be identified when the backdated adjustment is received (inscription timestamp), and thus inserted with the 2014-02-01 inscription timestamp value. After all, before that record arrived, we were blissfully ignorant. The result is two competing ‘histories’ depending on when data was received.

This is illustrated below:

This is where the additional highlighted records are coming from. As per 2014-02-01, we know that there is an alternate history which only is received now. These record are created as per the inscription timestamp, because it couldn’t be known in the past that this would happen – which is exactly what occurs during incremental loading.

Interestingly, these additional rows are the ones you would recognise in the initial assumption at the start of this chapter – seen from today’s perspective where the changes have already happened.

To complete the example, a final row arrives which adds another regular history record to the dimension.

It’s time to evaluate the results.

Ensuring an incremental view when loading today

The virtual data warehouse mindset would dictate that, no matter when or how you process data, you should always be able to reproduce the data as it was. This means that if you would reload your dimension today, you would make sure that the additional records are present.

The result must be deterministic.

And, if you go ‘full virtual’, the view should be able to do this (hint: this is possible) too. For consideration, below are the original dimension and incrementally loaded dimension for comparison. In the incrementally loaded dimension the red highlighted rows shows the ripple effect of the backdated adjustment, as additional records.

Interestingly, as a result of the incremental load there are multiple values ‘active’ at the same point of arrival (inscription timestamp). This is the result of the ‘ripple effect’ in history caused by the backdated adjustment, which becomes known -as a record set- at the arrival of the record (inscription timestamp) that contains the backdated adjustment.

Plotting these records on a timeline looks like below:

At point ‘3’ and ‘4’, two different and competing values are active at the same state timestamp. How can the Number be ‘123’ and ‘345’ at 2012-01-01?

This is where time-travel comes in again. Time-travel is showing ‘as of’ values following the assertion timeline (inscription timestamp). By filtering the records to a point before 2014-02-01, we can reproduce the data set as it was at the point in time. We travel back in time along the assertion timeline to an earlier point.

At all times, joins between data sets are done using the state timestamp values. The inscription timestamp is purely and only ever for filtering.

Are you looking at the data from today’s perspective? Then the Number value would be ‘345’, but if you time-travel back to before 2014-02-01, the value would be ‘123’.

In Data Vault, the witty remark is often made that late-arriving data can not occur. This is true, of course, for the inscription timestamp / assertion timeline as embodied by the ‘load date timestamp’ column in Data Vault. Because this is a technical timeline controlled by the data solution, it will always be sequential.

However, this doesn’t remove the necessity to think about what timeline is used for data delivery. Ideally, this should never be the assertion timeline. There are many situations where this will not yield the expected result for the consumers of the data.

A better way, is to only ever use the assertion timeline for filtering (time-travelling) and ensure that all joins are steps to bring data together are done using the state timeline. And this is where ‘late-arriving’ does happen, even in Data Vault.

In the next post, we’ll visualise this example using the cartesian plane used in Bob’s example. After that, it’s time to revisit what impact additional records such as covered in this example will have on generating deterministic dimension keys.

After all, they do all share the same inscription timestamp…

Roelant Vos

Ravos Business Intelligence admin

You may also like...

Leave a Reply