Is there an established way or pattern of going from "hot" to "cold" storage?

Ie, how can we transition data from hot → cold (ex: Kafka → Azure Blob Store)?

I’m thinking of how this can be done. And I do see XTDB’s Kafka Connect module. But to leverage that from Azure Blob Store however…

  • Can we generate the “Transaction Log” and “Index”, from the “Document Store”?
  • Or can we query against solely the “Document Store”?

The xtdb-kafka module design currently relies on a transparent model of an infinite-retention + single-partition topic, so you couldn’t truncate or migrate messages from that topic without some redesign of the module. I’m aware that mainline tiered storage for Kafka is incoming (and that Confluent has something similar in-house already) which would avoid the need for changes in XT whilst achieving the same outcome, see https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage

Can we generate the “Transaction Log” and “Index”, from the “Document Store”?

No, the tx-log is the primary source of truth as it contains all the hashed IDs which are the references to the documents in the doc-store (as well as lists of transaction operations, timestamps etc.).

can we query against solely the “Document Store”?

XT doesn’t officially support querying the doc-store directly without first consulting indexes, so you always have to query via a node with local indexes (via q/entity/pull etc.). There is an internal Clojure protocol / API available for advanced usage, and in the ultimate case you can enumerate and inspect the contents of the underlying doc-store storage (e.g. a Postgres table or Kafka topic) directly if needed.

“The xtdb-kafka module design currently relies on a transparent model of an infinite-retention + single-partition topic, so you couldn’t truncate or migrate messages from that topic without some redesign of the module.”

Ah ok. So that means, for a Kafka-backed XTDB, the Kafka topic must have an inifinite retention.
And a topic retention of, say 1 year, means the XTDB data model doesn’t work.

Assuming that’s the case, then yes, “Tiered Storage” souds like what I’m looking for. I’m working on a project. And have it in mind to write and query against a “hot” storage (Kafka-backed XTDB). And a Kafka-connnector module would sync data to “cold” storage (Postgres-backed XTDB). And Confluent’s solutino sounds like it eliminates “the need for separate data pipelines to copy the data from Kafka to external stores”.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Status

And ok on those other points. Thank-you very much @refset. This helps a great deal.

Tim

1 Like