Using a document with a vector of values vs. splitting them to many documents

refset · 2 February 2023 21:03

Re-posting the original question:

Hukka 5 hours ago

Is there a performance difference between having a document with a vector of values, vs splitting them to many documents that refer to the original document? I suppose that’s assuming it’s a flat list, because if they aren’t then the split to many documents will fill indexes a lot more. I’m thinking if there’s a trade-off between the write amplification caused by changing the elements (or just appending) vs possibly slower queries (if querying for all elements from the parent)

A ~2 year old (but still relevant) Q&A exchange I just found in the Clojurian Slack history:

Q: Is it ok to store frequently changed attributes like :tweet/view-count together with more static and “large” attributes like :tweet/text? Or should I have two documents?
Similarly, is it ok to model arity-many references as attributes of a document? Even if the attribute is expected to have thousands of values? Or should I use arity-one reverse references instead?

A: As it stands today none of XT’s doc-store module implementations attempt to do any clever structural sharing across revisions, so there invariably will be a storage (and network) overhead when frequently submitting large and mostly unchanged documents. Whether this inefficiency is cost-prohibitive for a given use-case would require some analysis. We have no immediate plans to add structural sharing to the existing doc-store implementations but it’s something we could bring attention to fairly quickly if required.
Note that a doc-store only ever stores one physical copy of any given revision, which is achieved by using the content-hash as the key.

Splitting fast-changing attributes into a separate document (and therefore entity) is a useful pattern for performance optimisation and would have benefits regardless of whether the doc-store implements structural sharing. However the pattern shouldn’t be applied too liberally as it introduces new complications when transacting and querying. In the extreme case of attempting to use a distinct document (& entity) per attribute you will almost certainly find performance is a lot worse

The Rocks/LMDB indexes do benefit from structural sharing though, so at least the storage overheads only accumulate in the one central place (the doc-store).

How all this should impact your modelling is debatable, but certainly there are some advantages in being able to use vectors or sets as forward-reference containers. The disadvantages are mostly performance related when dealing with large documents, as you point out, but also positional information from vectors is effectively transparent to the indexes & Datalog, so the utility is limited (pull and entity will respect the original vector ordering, however). In general though I would usually lean towards modelling with reverse references by default (arity-one or otherwise)

What I would add in my response now to the specific question at hand…

I’m thinking if there’s a trade-off between the write amplification caused by changing the elements (or just appending) vs possibly slower queries (if querying for all elements from the parent)

There is definitely going to be such a write vs. read amplification tradeoff, but with some sufficiently small number of values in the list, let’s say 50 (…there is likely a tipping point that can be measured, based on the average size of your scalar values), I would expect any write amplification to appear completely negligible (due to the noise of the basic overheads involved) whilst the read performance may be 50x faster than the splitting approach, because the entity API can retrieve the document in one I/O operation vs. a query needing to perform 50 index lookups. Beyond that tipping point the write amplification could definitely become a bottleneck (both latency & throughput) but that can be measured, and it may not actually prove too expensive in practice.

Sleepful · 7 December 2023 22:27

Oh, I was pondering this same question, nice to find it posted already.

Anyway, I was pondering it from a entity-history perspective.

I was wondering if there are any considerations whenever the use-case involves displaying the history of multiple entities.

Think like entity-history but with the desire to get the history on every entity that conforms to some triplet, such as [eid :user user-id] where we want to find all the eids that have the given user-id.

Idea #1: query every document for user-id, then entity-history every one of those documents with-corrections?, and then sort all of the results together. This seems like it might be a lot of overhead for the DB. It is also in essence a k-way merge problem, eh.

Idea #2: have a “history” document for each user-id, it stores every change, it becomes easy to query for the history and if more information is needed for a given document then entity-history can provide more details. It might get too large.

Idea #3: like idea #2 but maybe have multiple history documents, to put a cap on the max size of the document. Then backlink to older history docs.

Idea #4: probably a bad idea, but one could store all information pertaining to user-id into a single document, such that using entity-history would output a unified and sorted history for this user. The issue is that the document gets super large.

Well, so far that is what I have thought of, I was wondering if I am missing some obvious and simpler solution.

refset · 8 December 2023 18:29

Ooops…I accidentally flagged a reply yesterday from @Sleepful yesterday (sorry again @Sleepful !) as spam and so here it is re-posted in case it stops being visible…