Is there an established way or pattern of going from "hot" to "cold" storage?

twashing · 25 February 2023 23:16

Ie, how can we transition data from hot → cold (ex: Kafka → Azure Blob Store)?

I’m thinking of how this can be done. And I do see XTDB’s Kafka Connect module. But to leverage that from Azure Blob Store however…

Can we generate the “Transaction Log” and “Index”, from the “Document Store”?
Or can we query against solely the “Document Store”?

refset · 26 February 2023 18:28

The xtdb-kafka module design currently relies on a transparent model of an infinite-retention + single-partition topic, so you couldn’t truncate or migrate messages from that topic without some redesign of the module. I’m aware that mainline tiered storage for Kafka is incoming (and that Confluent has something similar in-house already) which would avoid the need for changes in XT whilst achieving the same outcome, see https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage

Can we generate the “Transaction Log” and “Index”, from the “Document Store”?

No, the tx-log is the primary source of truth as it contains all the hashed IDs which are the references to the documents in the doc-store (as well as lists of transaction operations, timestamps etc.).

can we query against solely the “Document Store”?

XT doesn’t officially support querying the doc-store directly without first consulting indexes, so you always have to query via a node with local indexes (via q/entity/pull etc.). There is an internal Clojure protocol / API available for advanced usage, and in the ultimate case you can enumerate and inspect the contents of the underlying doc-store storage (e.g. a Postgres table or Kafka topic) directly if needed.

twashing · 26 February 2023 22:39

“The xtdb-kafka module design currently relies on a transparent model of an infinite-retention + single-partition topic, so you couldn’t truncate or migrate messages from that topic without some redesign of the module.”

Ah ok. So that means, for a Kafka-backed XTDB, the Kafka topic must have an inifinite retention.
And a topic retention of, say 1 year, means the XTDB data model doesn’t work.

Assuming that’s the case, then yes, “Tiered Storage” souds like what I’m looking for. I’m working on a project. And have it in mind to write and query against a “hot” storage (Kafka-backed XTDB). And a Kafka-connnector module would sync data to “cold” storage (Postgres-backed XTDB). And Confluent’s solutino sounds like it eliminates “the need for separate data pipelines to copy the data from Kafka to external stores”.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Status

And ok on those other points. Thank-you very much @refset. This helps a great deal.

Tim

Topic		Replies	Views
How can I test document store with new XTDB 2 installation in Azure AKS? Users v2	1	27	18 February 2025
Role/support of Kafka in XTDB 2.0? Users v2	5	389	3 May 2023
Kafka Transaction log Retention Users	5	275	7 August 2023
Old kafka config and should I change it? Users	1	384	9 February 2022
Three questions Users v2	1	53	1 October 2024

Is there an established way or pattern of going from "hot" to "cold" storage?

Related topics