(V2) Best way to handle frequent updates that might not contain any changes?

Hi all.

We’re looking at XTDB to power a system that ingests data from various (hospitality) industry partners. Aside from its Clojure roots, we’re excited because the built-in ability to “go back in time” to see e.g. the price of a hotel room last Wednesday is crucial.

My question is this: we receive hourly data dumps. However, in many cases, the data hasn’t changed. When we perform an UPDATE with values that are identical to existing values, it appears that XTDB still creates a “new” historical record:

Fiddle showing multiple identical UPDATEs

Is this something we need to deal with on the application level, or are we missing a DB switch/flag/feature that would take care of this automatically?

Thanks!

A “hybrid” approach can be taken by adding a “hash” data member to your document or row. Then using asserts (via xtql) or an extra WHERE clause in the SQL version, the update can be elided in the case the data is the same for the hash.

SQL Example
XTQL Example

Note that with the XQTL version, you’ll need to catch on the transaction, given it seems like the function will throw. In the fiddle, the query doesn’t run, given the second transaction fails on the assert, which signals that an “insert” will only take place when the hash is different. In this example, the hash was computed on a hash map like so:

(hash-unordered-coll
  {:xt/id 1, :name "Pendleton Bikes", :price 340})

Additionally, you could also use a function for the Clojure API to jump out of the transaction if the hash was the same, but the above more/less can solve the issue.

Hope this perspective help! Not sure if it is “idiomatic”, but how I’d approach the problem at a cursory glance :slight_smile:


Additionally, out of curiosity: how come storing all the data, even if it is redundant, is undesirable?

@bnert thanks for the helpful perspective!

From the bottom up:

Additionally, out of curiosity: how come storing all the data, even if it is redundant, is undesirable?

Ergonomically, storing all the data makes a lot of sense, and would make for a great DX. The reason I’m asking is because I suspect that approach (in our specific situation) would quickly “bloat” the database.

In a single market (city) in a single province, we’re receiving ~10,000 data files every day (one file per “item”) that range from 20KB to 400KB. Many of those items don’t see any changes for days/weeks at a time.

Unless I’m missing something, there are currently no optimizations for such cases, and the full document would be stored every time? (which seems less than ideal!)

Sidebar: Additionally, we’re looking at breaking out attributes that do update more frequently (think aggregated attributes, like ratings) into “side tables” to keep storage from going out of control.


Sadly, I wasn’t able to see the fiddles you linked to; looks like an issue with the server:

Additionally, you could also use a function for the Clojure API to jump out of the transaction if the hash was the same, but the above more/less can solve the issue.

Checking my assumptions: you’re talking about a Clojure transaction function?

Not sure if this is what you meant, but I might kick the tires on a function that accepts a full doc, then inserts (doesn’t exist), updates (hash-unordered-coll doesn’t match), or returns false. (hash-unordered-coll matches)

That seems like it would obviate the need for a dedicated “hash” data member?

Really appreciate your thoughts and feedback. Thanks again!

Sadly, I wasn’t able to see the fiddles you linked to

Weird, I ran into the same issue with those links as well…

Hopefully this SQL fiddle works. Was running into issues using the “Save as URL” with XTQL examples and the resulting query being munged.

Therefore, the transactions + queries for Clojure could look like:

(def document
  {:xt/id 1
   :name "Pendleton Electric Bicycle"
   :price 340})

(doseq [d (repeat 3 document)]
  ; If the document "d" already had the :hash member
  ; it'd need to be filtered out
  (let [hash (hash-unordered-coll d)]
    (xt/submit-tx node
      [[:assert-not-exists
        '(from :products [{:xt/id $id, :hash $hash}])
        {:id (:xt/id d), :hash hash}]
       [:put-docs :products (assoc d :hash hash)]

(xt/q node
  '(-> (from :products
         {:bind [xt/id xt/valid-from name price hash]
          :for-valid-time :all-time})
       (order-by xt/valid-from))

Checking my assumptions: you’re talking about a Clojure transaction function?

Yes, that is what I was intending to point toward.

Not sure if this is what you meant, but I might kick the tires on a function that accepts a full doc, then inserts (doesn’t exist), updates (hash-unordered-coll doesn’t match), or returns false. (hash-unordered-coll matches)

That seems like it would obviate the need for a dedicated “hash” data member?

I think so! Something along the lines of:

[:put-fn :put-when-distinct
  '(fn [x]
    (let [d  (first (q (from :products [{:xt/id $id} *])
                       {:args {:id (:xt/id x)}}))
          d' (or d {})]
      (if-not (= (hash-unordered-coll d') (hash-unordered-coll x))
        [:put-docs :products x]
        false)))]
 
[:call :put-when-distinct {:xt/id 1, :name "..."}]

Haven’t run the above, but it seems like it gets the example across :slight_smile:


Unless I’m missing something, there are currently no optimizations for such cases, and the full document would be stored every time? (which seems less than ideal!!)

I may be missing something as well, so maybe one of the team members from Juxt may need to hop in here. That said, I am under the assumption that columnar oriented formats can be efficient when it comes to space, especially in the immutable case. Hypothetically, if the data doesn’t change, then my intuition is only metdata around system/valid time will be created/updated, preventing the need for a new values to be stored. I may be wrong though, I have a hunch there is some nuance here given the HTAP messaging. In the case a new document is created/serialized, some sort of computed hash and updating only when need be seems to be a good direction!

FYI, we currently do store the full document insert/update.
As @bnert mentioned there are some encoding techniques we could take advantage of in the future to help mitigate this.

It doesn’t look like it will help in this case, but another option mentioned here is to use a “fast changing” table to limit duplication to columns that are slower to change.


As for the broken fiddle link @bnert would you mind sharing your OS + browser?
I’ve figured out why the link is broken but I’m struggling to replicate the creation of the link via the save button :thinking:

@osm thanks for confirming that full documents are stored on updates; glad to know we weren’t going crazy!

Also appreciate the link to the discussion around fast changing tables; that’s what we were planning to do, but it’s nice to get some external validation.

@bnert thanks for being an amazing internet stranger; your :put-when-distinct example function really helped get us moving in the right direction faster than expected!

1 Like

@osm, thanks for the clarification on how documents are stored. Good to know my mental model was off :slight_smile: From your knowledge, is that primarily due to the hybrid (OLTP + OLAP) design of xtdb v2?

As for the broken fiddle link @bnert would you mind sharing your OS + browser?

OS: MacOS 14.2.1 (23C71)
Browser: Orion, Version 0.99.127.1-beta (WebKit 619.1.1)

1 Like

Glad ya’ll have a promising direction on this! Happy to collaborate :slight_smile:

Additionally, I am curious about the “fast changing” tables, but I currently don’t have a slack account and am strapped for time today. Is there a means to provide a publicly accesible link? If not, I can go through the slack rigamarol, haha.

Short version: use identical :xt/id values plus the addition of tables to separate fast and slow updating attributes.

3 Likes

Thanks for posting the blurb! Helps grok previous suggestions :slightly_smiling_face:

@bnert
Not sure about the reasoning myself but I’ve asked someone who knows more, hopefully they can get back to us :slight_smile:.

On the issue with the fiddle, it turns out it was an interesting bug with the way Orion browser copies urls.
I’ve worked around it for you by making the share button a “copy url” button, which is more useful for everyone anyway :smiley: (should be live in a few mins).

2 Likes

Hey folks - sorry to chime in late on this one!

My simplistic answer to this part of the conversation: storage is cheaper than ever though still not ‘free’, but if you want fast, general-purpose reads & writes (by default) then you probably want to always embrace the tradeoff in storage cost on the basis that inelastic (provisioned) compute is often the most expensive part of the infrastructure. Therefore, v2 uses S3/etc. for primary storage, which is in principle as cheap as commodity storage gets, and the engine currently defaults to naively storing copies of ~redundant updates to data. By contrast, pure-OLTP systems are generally built on an assumption of pre-provisioned/sized local disks where storage usage is carefully managed (…or you have distributed transactions across scale-out nodes, but that’s another conversation…)

Ultimately though, v2 would ideally provide options and possibilities for moving around in the ‘RUM’ design space and employ a range of compression/delta encoding strategies. The RUM paper is a good overview and useful lens for thinking about all these tradeoffs: Designing Access Methods: The RUM Conjecture

2 Likes

Hi folks, resurrecting this thread to capture the following (relevant) guidance for SQL users…

How to avoid superfluous inserting when a record has not changed?

At the moment XTDB performs an INSERT even when the row doesn’t change, which means that there is one additional line in the visible & stored history for every INSERT even when nothing changed in the record. Ideally XTDB could be smart enough to understand that nothing changed.

Response

In general we don’t want to prevent users from re-asserting the same states at different points in time, so the default semantics here won’t be changed. However you have a couple of DML strategies to prevent unnecessary write and storage amplification (at the inevitable expense of indexing latency & throughput).

The first is to re-assert all the columns/values from the client:


UPDATE docs

SET foo = 'bar'

WHERE foo <> 'bar';

This has the downside of needing the client to remember the previous values.

The second strategy is to additionally have the client remember the previous successful _system_to that was written, as well as the previous values. This minimises the work done on the indexing thread, but it relies on the assumption that all other clients are also only ever updating when the values are actually changed (otherwise it will devolve into the same behaviour you are hoping to avoid).

Working example: Play