[2.x] Efficiently querying collections

tenpin · 30 June 2024 20:15

Question: what is (or will be) the best/most performant way to query collections?

Use case: we have a “people” table. Each person can have one or more phone numbers with metadata (e.g. label) attached. We will always query by phone number, never metadata.

Here’s our first, preferred (and naive?) attempt:

(xt/q
  my-node
  '(->
     (rel
       [{:name          "John Smith"
         :phone-numbers [{:label "home" :phone "+15551111111"}
                         {:label "work" :phone "+15552222222"}
                         {:label "mobile" :phone "+15553333333"}]}
        {:name          "Jane Smith"
         :phone-numbers [{:label "work" :phone "+15554444444"}
                         {:label "mobile" :phone "+15555555555"}]}]
       [name phone-numbers])
     (unnest {:phone-number phone-numbers})
     (with {:phone (. phone-number phone)})
     (where (= phone "+15554444444"))))

Unless I’m missing something, my current understanding is that every record in the table would need to be examined.

Other options we can see:

storing phone numbers and metadata in separate fields (e.g. {:phone-numbers ["+15551111111"] :metadata {"+15551111111" {:label "home"}}) (this obviates the with step, but doesn’t seem terribly useful)
expanding phone numbers and metadata (e.g. {:phone-0/phone "+15551111111" :phone-0/label "home"}) (keeps queries… interesting!)
storing phone numbers in a separate table (e.g. {:xt/id #uuid … :phone "+15551111111"} (phone numbers aren’t unique, so we can’t use them for IDs)

I believe the last two options would provide for the best performance given the current state of XTDB; my understanding is that the planner can limit the number of “pages” scanned for top-level attributes.

However, our initial data modeling approach strikes me as ergonomically best, and would (could?) work well once secondary indexes make their debut.

Finally, this is a bit hypothetical; we’re not storing production data yet, but we want to be ready to hit the ground running!

Thanks!

refset · 1 July 2024 10:44

Hey @tenpin your understanding and reasoning here all looks sound to me

Unless I’m missing something, my current understanding is that every record in the table would need to be examined.

You’re right that we don’t have real secondary indexes yet (beyond “zone map” pruning), so full table scans will be required to execute the model/query as shown with the nested data.

Given you are most concerned about performance I suspect the “expanding phone numbers and metadata” (second bullet) option is the better choice.

It might be worth a chat soon though! Please reach out this week if you have time: jdt@juxt.pro

tenpin · 1 July 2024 23:01

@refset thanks for confirming my (still limited!) understanding of XTDB’s internals.

The second option is weirdly appealing to me! A downside is the need to track the highest cardinality across all documents in order to perform accurate queries (which might end up having a breathtaking or clause )

We might also look at maintaining our own secondary indexes, since ordered result sets and the ability to query multiple attributes is something we’re looking for.

Will reach out later in the week. I’ve really appreciated your help and feedback!

Topic		Replies	Views
Could Someone Give me Advice on Optimization and Indexing in XTDB for Large-Scale Data Users v1	1	144	25 July 2024
XTDB v1 slow query Users	5	63	10 July 2024
Data modeling question Users	1	240	4 January 2024
Datalog help: Only if every matched item Users	2	367	26 October 2022
Handling high-volume temporal queries in XTDB Users	1	28	6 February 2025

[2.x] Efficiently querying collections

Related topics