How can I test document store with new XTDB 2 installation in Azure AKS?

Hello, first of all, thanks for developing a really incredible project. I’m very excited to get XTDB 2 up and running in my test cluster!

I feel like I’m missing something extremely basic, but I can’t seem to find where the document/object store data is located. I see nothing in /var/lib/xtdb/buffers/disk-cache/, and nothing in my xtdb Blob Storage container. Yet I’m still able to perform queries normally, even after deleting/restarting the xtdb-statefulset pod(s). I do see a xtdb-log topic in Kafka, but I was under the impression that would not be used for answering database queries.

Here is a bit more info about my test setup:

  • I have 3 Kafka nodes that can talk to each other, and I can add topics, produce/consume messages, etc.
  • I have created a single XTDB node using helm install, providing the appropriate Blob Storage container credentials, and providing the Kafka headless service DNS name via --set arguments.
  • The XTDB node starts up with no errors (server started on port 5432, Healthz server started on port 8080, HTTP server started on port 3000).
  • I can connect to the database using psql -h PUBLIC_IP -p 5432 xtdb and issue SQL queries.
  • I have issued 500 insertions of the form shown in the “Handling Documents” section of the SQL Overview, and confirmed that I see 500 rows when performing a subsequent selection.
  • I can subsequently query a Kafka node(s) and see binary message data on the xtdb-log topic, but I can’t find the actual JSON-like documents anywhere on the persistent disk or Blob Storage.

So I guess my specific questions are:

  1. Am I misunderstanding the distinction between document store (stored in Blob Storage and cached on local disk) and transaction log (stored in a Kafka topic) in XTDB 2?
  2. Is there some time/bytes-based delay between when the transaction log is written and when the document store is written? If so, how can I reduce this delay to confirm that things are working?
  3. How is XTDB 2 able to answer queries using only the transaction log, assuming that is indeed what’s happening here?

Thank you!

1 Like

Hey @jrmcclurg - thanks for your interest and words of encouragement!

It sounds like you are experiencing the local disk indexes be being re-populated each time you restart the pods - this is natural and a consequence of how the replication/replay from the xtdb-log topic works.

Currently (in beta6) the threshold for writing a file to the object store requires around 100MB of data to be INSERTed - anything less than that won’t trigger the background job which creates the files (either in the disk-cache or the remote object store – it will reside in-memory only). Have you tried INSERTing more data than that already?

I don’t think so, although ‘document store’ is not a name we’re using with v2 (unlike v1), as it implies document-level granularity, whereas the files in v2 contain will contain many documents.

Prior to beta6 there was a default ‘stagnant log flusher’ process that would write files to the object store at a fixed time interval regardless of the amount of data in the ‘live’ index (or ever 100k rows) - but we just changed that behaviour in beta6 to be purely based on the number of bytes accumulated.

The log contains a row-oriented copy of the data, similar to a logical WAL in Postgres. It still uses Arrow though, so works well with bulk insert parameters.

EDIT: heads up that there will be a few changes to the Azure stack coming in beta7 - see Updates to Azure helm/terraform · Issue #4178 · xtdb/xtdb · GitHub

Hope that helps,

Jeremy