Questions regarding v2.0 storage architecture

Hello!

Congrats on the beta release!

I’ve been watching the development of v2 since the EA announcement and we’ve been using v1 in production for a couple of years now. So, below questions are in context of how storage is implemented in v1 and how it differs in v2.

In v1, there were 3 distinct components: index store, document store & tx log. Document store & tx log were required to be persistent (and stored in kafka/postgres) whereas index store was stored locally on the node and could be recreated by replaying all transactions from the beginning. In our case we use postgres for document store & tx log and rocksdb for indexes.

In v2, it seems that the indexes & documents are stored in s3 (compatible?) object storage and tx log remains on kafka. Is this correct? If so, I have below questions.

  1. Is the tx log required to be persistent or can it be recreated from the information available in the object store (documents+indexes)?
  2. v1 provided several different options for storage of golden stores. Is such support planned for v2? Will it be possible to store tx log in postgres in future?
  3. Can minio be used for object store + indexes?

For below questions, assume that document store + indexes are stored in object storage. In this case,

  1. If the tx log is stored locally, is it impossible to sync correctly between nodes. In other words, is the use of kafka primarily for consensus across nodes? Or does local tx log have other limitations as well?
  2. Can a locally stored tx log be later stored on kafka (or postgres if support is planned)? If yes how?

Regards,
Noel

Hey @noelkurian great to hear from you!

Yes, that’s the case for now. We’re hoping that Kafka and Kafka-compatible services (e.g. things like Redpanda and Warpstream) are now ubiquitous enough that we don’t need to support further pluggability in the tx log implementation. Anything S3-compatible should be sufficient for object storage.

The tx log is a Write-Ahead Log which means that there will ~always be some novelty stored there which doesn’t yet exist in the object store (and the delay may be minutes or even hours). This means it is essential for the tx log to be as durable as possible (backed up routinely etc.), otherwise you risk breaking the wider strong consistency and durability guarantees that XTDB offers.

We aren’t seeking to pursue such a course currently but we’re always happy to entertain such conversations, especially if the problems XTDB is solving are valuable enough :slight_smile:

Yes MinIO should work great. We’ve not done any testing explicitly yet, but would be happy to assist.

The file-based tx log is intended for development only and is not suitable for connecting multiple nodes (even if you can technically make it work with shared filesystems).

Theoretically yes, you should be able to migrate tx logs (local → kafka, or kafka A to kafka B), but please discuss this with us directly before factoring it into your project timelines. I suspect it may already ‘just work’ if you simply pause new submissions to the tx log and wait for the log to be flushed (there’s a configurable background process and schedule) - but that would certainly require the system to be unavailable for new writes for a non-trivial amount of time. A cleaner solution should be possible in future.

Thanks for the questions :pray:

Jeremy

Thanks for the answers. It’s good to know that Kafka-compatible services should work for tx log. I’ll do some testing with redpanda later this week.

A quick question about tx-log. Suppose a scenario where the db is idle (no read/write queries) and WAL is flushed. In this case, would it matter if the node is taken down, configuration changed for tx-log to use a different (empty) topic and node is restarted. Would anything break? Would xtdb notice any change at all?

Assuming XTDB has successfully flushed (according to the configured schedule) since the last application write, and that all the XTDB nodes have shut down before the log is killed, I can’t see why anything should break. The only caveat to that I’m not sure about is how a new epoch of offsets gets handled if you then switch things back on pointing at a fresh log, but that can be tested fairly easily :slightly_smiling_face:

Yes MinIO should work great. We’ve not done any testing explicitly yet, but would be happy to assist.

I did not see a way to set an alternative endpoint url in the configuration section to point to my local minio instance. I might have misunderstood how this works, but the code at modules/aws/src/main/kotlin/xtdb/aws/s3/S3Configurator.kt does S3AsyncClient.create() rather than create a .builder()... - so additional configuration params cannot be specified.

Do you have an example of how to get this working?

Thanks for creating and sharing your product!

1 Like

Hey @jr200 thanks for mentioning this - I wasn’t aware - we’ve now opened an issue to make sure we expose that as a configuration option Users can configure the S3 endpoint URL · Issue #3725 · xtdb/xtdb · GitHub

1 Like

@jr200 good news! @jarohen just landed the changes for MinIO in add support for configuring S3 endpoint and creds, resolves #3725 · xtdb/xtdb@c37ffef · GitHub (+ also adding it into our testing setup). This should be available to test via the nightly release once it’s published in the next few hours (e.g. using docker run -p 5432:5432 ghcr.io/xtdb/xtdb:nightly + edn config example here)

Is there a way to get more information when I have an invalid config. eg., my local_config.yaml is:

    metrics: !Prometheus
      port: 8080

    modules: 
    - !HttpServer
      port: 3000
    - !PgwireServer
      port: 5432

    txLog: !Local
      path: "/var/lib/xtdb/log"

    storage: !Remote
      objectStore: !S3
        bucket: xtdb
        prefix: my-xtdb-node
        endpoint: https://minio.example.com
        credentials:
          accessKey: MYACCESSKEY
          secretKey: MYSECRETKEY

      localDiskCache: /var/lib/xtdb/remote-cache

and I get a very long stack trace without any pointer to what the issue is:

Starting XTDB 2.x @ nightly @ c37ffef ...
{:clojure.main/message
 "Execution error (IllegalArgumentException) at com.charleskorn.kaml.YamlPolymorphicInput/throwIfUnknownPolymorphicTypeException (YamlPolymorphicInput.kt:117).\nCan't get known types for descriptor of kind CLASS\n",
 :clojure.main/triage
 {:clojure.error/class java.lang.IllegalArgumentException,
  :clojure.error/line 117,
  :clojure.error/cause
  "Can't get known types for descriptor of kind CLASS",
...

I know the issue is related to !Remote, since when I change it to !Local it works fine.

Hey @jr200 - thanks for mentioning that, I’ve opened an issue to make sure we improve the error handling for yaml config: Improve debugging of yaml config · Issue #3739 · xtdb/xtdb · GitHub

In regards the specific issue though, did you already try adding a few extra quotation marks, e.g.:

    metrics: !Prometheus
      port: 8080

    modules: 
    - !HttpServer
      port: 3000
    - !PgwireServer
      port: 5432

    txLog: !Local
      path: "/var/lib/xtdb/log"

    storage: !Remote
      objectStore: !S3
        bucket: "xtdb"
        prefix: my-xtdb-node
        endpoint: "https://minio.example.com"
        credentials:
          accessKey: "MYACCESSKEY"
          secretKey: "MYSECRETKEY"

      localDiskCache: /var/lib/xtdb/remote-cache

Yes, I tried with and without quotes, commenting in and out each of the fields (which sometimes gave different stack-traces) - but in most cases without any reference to what the actual error is. I had a browse of the commit and I’m guessing it might be to do with how the credentials are deserialised… ?