Using a document with a vector of values vs. splitting them to many documents

Re-posting the original question:

Hukka 5 hours ago

Is there a performance difference between having a document with a vector of values, vs splitting them to many documents that refer to the original document? I suppose that’s assuming it’s a flat list, because if they aren’t then the split to many documents will fill indexes a lot more. I’m thinking if there’s a trade-off between the write amplification caused by changing the elements (or just appending) vs possibly slower queries (if querying for all elements from the parent)

A ~2 year old (but still relevant) Q&A exchange I just found in the Clojurian Slack history:

Q: Is it ok to store frequently changed attributes like :tweet/view-count together with more static and “large” attributes like :tweet/text? Or should I have two documents?
Similarly, is it ok to model arity-many references as attributes of a document? Even if the attribute is expected to have thousands of values? Or should I use arity-one reverse references instead?

A: As it stands today none of XT’s doc-store module implementations attempt to do any clever structural sharing across revisions, so there invariably will be a storage (and network) overhead when frequently submitting large and mostly unchanged documents. Whether this inefficiency is cost-prohibitive for a given use-case would require some analysis. We have no immediate plans to add structural sharing to the existing doc-store implementations but it’s something we could bring attention to fairly quickly if required.
Note that a doc-store only ever stores one physical copy of any given revision, which is achieved by using the content-hash as the key.

Splitting fast-changing attributes into a separate document (and therefore entity) is a useful pattern for performance optimisation and would have benefits regardless of whether the doc-store implements structural sharing. However the pattern shouldn’t be applied too liberally as it introduces new complications when transacting and querying. In the extreme case of attempting to use a distinct document (& entity) per attribute you will almost certainly find performance is a lot worse :slightly_smiling_face:

The Rocks/LMDB indexes do benefit from structural sharing though, so at least the storage overheads only accumulate in the one central place (the doc-store).

How all this should impact your modelling is debatable, but certainly there are some advantages in being able to use vectors or sets as forward-reference containers. The disadvantages are mostly performance related when dealing with large documents, as you point out, but also positional information from vectors is effectively transparent to the indexes & Datalog, so the utility is limited (pull and entity will respect the original vector ordering, however). In general though I would usually lean towards modelling with reverse references by default (arity-one or otherwise)

What I would add in my response now to the specific question at hand…

I’m thinking if there’s a trade-off between the write amplification caused by changing the elements (or just appending) vs possibly slower queries (if querying for all elements from the parent)

There is definitely going to be such a write vs. read amplification tradeoff, but with some sufficiently small number of values in the list, let’s say 50 (…there is likely a tipping point that can be measured, based on the average size of your scalar values), I would expect any write amplification to appear completely negligible (due to the noise of the basic overheads involved) whilst the read performance may be 50x faster than the splitting approach, because the entity API can retrieve the document in one I/O operation vs. a query needing to perform 50 index lookups. Beyond that tipping point the write amplification could definitely become a bottleneck (both latency & throughput) but that can be measured, and it may not actually prove too expensive in practice.