We are running Postgres as doc and tx store, and on separate instances XTDB/rocksdb node within the backend process. The backend disk (well, the whole node) is ephemeral. At this point we have some tens of thousands, approaching a 100k documents in the stores, and startups (to point where the XTDB node is in sync with latest transactions) are taking about three minutes, with what seems roughly linear increase with the amount of data we have (though we haven’t really plotted this, just looked at the times after inserting new datasets in bulk).
We are expecting production datasets to be thousands, if not tens of thousands times larger than this. At this point it’s clear that we have to do checkpointing sooner rather than later, since it doesn’t seem like the startup time can be improved with allocating more CPU cores or RAM to the node. The Postgres doesn’t seem to be loaded at all with CPU and RAM use below 10%, and the network traffic going out when a node is starting peaking at only 100kBps. So it would seem that the bottleneck is something single threaded on the XTDB node.
That said, I would be interested in hearing what kind of startup times others are seeing, with how many documents and what kind of setup. For example does using Kafka show markedly difference speed, and how large datasets could checkpoints handle? Based on Slack discussions, even checkpoints are not trivial with for example S3 having timeouts with fetches when not using a proper S3 client.