Storing derived data in the DB for indexing purposes

Some of our entities have a list of PDF documents they “own” which have a couple of attributes (e.g. id, type, name, expiration date, …). In the UI (editing, change history, …) those documents are always part of the owning entity. To avoid having to convert back and forth between a representation where the documents are their own entity and one where they aren’t we decided to make documents part of the owning entity on the DB side as well.

This of course has the issue that we can’t use document attributes in datalog queries because XTDB only indexes entity attributes one layer deep. A workaround we’re considering to implement is to create secondary entities derived from the documents in the primary entities containing the document attributes and a reference to the owning entity. This has the advantage that we can use the indexes created from the secondary entities within our datalog queries (including time-travel, which would be much harder if we created our own external indexes) and disadvantage that we’re polluting our DB with derived data.

What are your thoughts?

in the UI those documents are always part of the owning entity

I might be misunderstanding what is meant by this, but I think this is quite a common pattern. Even if XTDB 1.x indexed nested attributes, there would still be situations wherein you might want to break apart entities for storage and recombine them for display.

disadvantage that we’re polluting our DB with derived data.

As long as you are conscious of the derived data, I don’t see this as a problem. However, I’m a bit confused as to how querying the derived data is any better than storing the secondary entities separately, as far as your query and display is concerned – doesn’t this still introduce a translation step?

If you are using Datalog, it is possible that the pull syntax may save you some grief here?

If you are going to denormalize the document, could you just instead flatten the structure and save the document fields in the same level as the “parent” entity, perhaps using namespaced keys?

That is, instead of having {:field-a a :field-b :document doc :document-prop-a doc-a} have {:field-a a :field-b :doc/field-a doc-a :doc/field-b doc-b} and so on

One could consider using the secondary entities to find the primary entities as a translation step, but since those already contain all the data there would be no recombination needed.

I hadn’t thought about pull syntax as IIRC in terms of “querying power” it’s a subset of datalog but it certainly would make the recombining easier. Something to consider.

This works if there is at most one document per entity, alas that’s not the case for us.