I hope you’re doing well. I’m quite new to using XTDB, so I apologize if my question appears basic. I’m currently working on a project where I have Kafka set up as the transaction log , S3 as the document and rocksDb for index store.
As part of testing, I attempted to delete the transaction logs in Kafka to observe the impact. I noticed that queries started to fail due to the missing logs. Now, I’m unsure about the best approach to ensure data persistence in case Kafka goes down and comes back up, potentially resulting in log loss.
One thought I had was to set the retention policy of Kafka to be infinite. However, I’m not sure if this is the right solution or if there are better practices to handle such situations. I would greatly appreciate any insights or guidance on how to ensure data persistence and protect against data loss in scenarios like this.
Thank you for your help and understanding. I look forward to learning from your experiences and expertise in this area.
As part of testing, I attempted to delete the transaction logs in Kafka to observe the impact. I noticed that queries started to fail due to the missing logs.
This is expected, but actually the impact is much more severe as you no longer have an ability to ‘replay’ the tx-log (so you wouldn’t be able to trivially upgrade XTDB) and any data that may not also duplicated in the indexes could be lost entirely / made unrecoverable.
One thought I had was to set the retention policy of Kafka to be infinite
Unfortunately infinite retention is the only valid configuration for XTDB currently, as the system relies entirely on the durability guarantees of the tx-log (i.e. the single Kafka topic & partition).
Does your Kafka platform support strong durability levels or infinite retention already? I know many legacy deployments of Kafka are not optimised for this kind of workload, but certainly newer versions of Kafka and the Confluent stack more generally have made this kind of setup much more practical.
Thank you for your response. I am using an AWS Kafka MSK cluster. Is there any different type of setup I can use maybe not involving Kafka that would make the data more persistent?
MSK should be sufficient, particularly now it supports tiered storage - if you have it working already then I would suggest continuing with that path unless it’s causing more problems than you can justifiably handle.
Postgres is the other popular choice for more modest data volumes, and is likely to be more cost-effective at lower scale if you don’t already have an existing Kafka cluster in play (because a single XT tx-log is very unlikely to saturate Kafka cluster by itself).