[2.x] Efficiently querying collections

Question: what is (or will be) the best/most performant way to query collections?

Use case: we have a “people” table. Each person can have one or more phone numbers with metadata (e.g. label) attached. We will always query by phone number, never metadata.

Here’s our first, preferred (and naive?) attempt:

(xt/q
  my-node
  '(->
     (rel
       [{:name          "John Smith"
         :phone-numbers [{:label "home" :phone "+15551111111"}
                         {:label "work" :phone "+15552222222"}
                         {:label "mobile" :phone "+15553333333"}]}
        {:name          "Jane Smith"
         :phone-numbers [{:label "work" :phone "+15554444444"}
                         {:label "mobile" :phone "+15555555555"}]}]
       [name phone-numbers])
     (unnest {:phone-number phone-numbers})
     (with {:phone (. phone-number phone)})
     (where (= phone "+15554444444"))))

Unless I’m missing something, my current understanding is that every record in the table would need to be examined.

Other options we can see:

  • storing phone numbers and metadata in separate fields (e.g. {:phone-numbers ["+15551111111"] :metadata {"+15551111111" {:label "home"}}) (this obviates the with step, but doesn’t seem terribly useful)
  • expanding phone numbers and metadata (e.g. {:phone-0/phone "+15551111111" :phone-0/label "home"}) (keeps queries… interesting!)
  • storing phone numbers in a separate table (e.g. {:xt/id #uuid … :phone "+15551111111"} (phone numbers aren’t unique, so we can’t use them for IDs)

I believe the last two options would provide for the best performance given the current state of XTDB; my understanding is that the planner can limit the number of “pages” scanned for top-level attributes.

However, our initial data modeling approach strikes me as ergonomically best, and would (could?) work well once secondary indexes make their debut.

Finally, this is a bit hypothetical; we’re not storing production data yet, but we want to be ready to hit the ground running!

Thanks!

Hey @tenpin your understanding and reasoning here all looks sound to me :slightly_smiling_face:

Unless I’m missing something, my current understanding is that every record in the table would need to be examined.

You’re right that we don’t have real secondary indexes yet (beyond “zone map” pruning), so full table scans will be required to execute the model/query as shown with the nested data.

Given you are most concerned about performance I suspect the “expanding phone numbers and metadata” (second bullet) option is the better choice.

It might be worth a chat soon though! Please reach out this week if you have time: jdt@juxt.pro

@refset thanks for confirming my (still limited!) understanding of XTDB’s internals.

The second option is weirdly appealing to me! A downside is the need to track the highest cardinality across all documents in order to perform accurate queries (which might end up having a breathtaking or clause :joy:)

We might also look at maintaining our own secondary indexes, since ordered result sets and the ability to query multiple attributes is something we’re looking for.

Will reach out later in the week. I’ve really appreciated your help and feedback!

1 Like