Lucene indexes and storage of fields

xlfe · 5 January 2022 21:55

Just wanted to copy this here from the juxt zulip for public reference

Just wondering, why does the Lucene module store the fields? I can see that the resolve-search-results-a-v{-wildcard} function requires the attribute and value (hence they have to be stored by lucene), but why is this preferred over storing the document temporal hash? Is it to avoid having to reindex multiple copies of attributes on the same document that haven’t changed?

Answer from @refset

Exactly this, yep undoubtedly in some scenarios it’s going to be less than ideal to have massive strings stored many (i.e. both in KV indexes + multiple ways in Lucene), but hopefully overall this is the better trade-off to make (…although we don’t have any hard empirical data to support that claim!)
For others’ context, this is really what we are discussing: https://github.com/xtdb/xtdb/blob/ff0895b0c3cc956940a4784afe1b477094995a88/modules/lucene/src/xtdb/lucene.clj#L240-L243

Topic		Replies	Views
Using a document with a vector of values vs. splitting them to many documents Users modeling-tradeoffs	3	403	8 December 2023
Could Someone Give me Advice on Optimization and Indexing in XTDB for Large-Scale Data Users v1	1	144	25 July 2024
Storing derived data in the DB for indexing purposes Users	4	396	27 April 2022
Handling potentially missing fields Users	1	597	5 November 2022
Indexing and database internals Users	0	36	10 February 2025

Lucene indexes and storage of fields

Related topics