I’m working on a new demo to explain the conceptual differences between time-series databases and XTDB (not discussing internals/performance here at all). Thought I’d share my notes…
- Let’s initially imagine there are no built-in time features, and pretend that XTDB is just another update-in-place SQL database - this is what a typical time-series schema might look like:
INSERT INTO instruments (_id, symbol, ts, val) VALUES (1, ‘USD’, NOW, 100); - note that by default we see all the versions, so let’s limit to seeing the latest value - Now let’s use dates for the examples to make reading the
ts
column values a little easier, and also convey the idea of importing externally produced timestamps INSERT INTO instruments (_id, symbol, ts, val) VALUES (1, ‘USD’, DATE ‘2024-01-01’, 101); - For the rest of the demo, we’ll query for all versions in the time series at once
- Add a
reported_at
column to convey the idea that the results may change each time we query for the time series, and we may need to track the ‘version’ of the results produced by the database - Instead of INSERTing the 3rd value at the end, what if it is actually meant to sit between the first two entries based on an out-of-order
ts
- What if we then discover some sort of mistake or error in that retroactively INSERTed entry, e.g. the value was wrong - we have to delete the old entry and insert a new one (with a new unique ID)
- At this point, a typical SQL / time series system would be losing any sense of auditability with such changes going on (retroactivity + corrections), and every time you run such a query the results may be wildly different - luckily XTDB retains old versions, so even with this userspace time series model we can always see old versions and access a full audit history
- But now let’s see what the best way to model this data in XTDB actually is, given time is first-class. Instead of using separate rows and IDs for the same time series, we can use the built-in versioning against the _id column, along with the ability to control _valid_from and _valid_to
- Adding
FOR ALL SYSTEM_TIME
shows us the number of versions required to correctly decompose the full history of the data AS OF
syntax allows us to query the relevant versions - this can similarly be applied across many tables or even the entire query- We can use
SYSTEM_TIME AS OF
to avoid the need for worrying about versioning the results (e.g. viareported_at
) and the repeatability/consistency across queries more generally
As a concluding remark, it’s worth appreciating that time series databases have evolved out from a mindset of ingestion performance, whereas the bitemporal model is all about correctness. To achieve the same level of correctness, consistency and overall flexibility in a time series database would likely require it to run a lot more slowly.