A question about RocksDB data files

Are they cross platform (like, e.g., SQLite data files are)? That is, can files created in an XTDB instance created on one platform (e.g., Windows) be used from an instance running on Linux? I can’t seem to find this topic on the RocksDB site (maybe I just missed it), figured someone here may know.

Sincerely,

Bob

Hi @bobcalco - thanks for the question! I’ve never verified this extensively before, but the RocksDB data directory and file format(s) should be completely robust & portable across all operating systems where RocksDB is able to run. The on-disk format is intended to be very stable even when using very different binaries.

If you observe any evidence to the contrary we would certainly be keen to hear about it :slight_smile:

Jeremy

I will be testing that explicitly today as I’m iterating toward a data load with data spanning a decade.

The reason I ask is I want to pre-load the data into a store that a dockerized app using XTDB can then refer to as a kind of baseline. I also want to be able to return to that baseline after testing/demonstrations.

So one other related question I have (happy to make it a separate topic, and sorry if it’s already answered somewhere) is how to set up local persistence with Docker - i.e., how to use Docker volumes with XTDB. I will go search on that now but maybe you have some pointers that can short circuit my search.

Thanks, Jeremy!

  • Bob
1 Like

Hi again, sorry for the delay - mounting a volume for XT to use via Docker could vary depending on how you build your container, but something like docker ... -v xtdata:/var/lib/xtdb ... could be enough. Hope that helps!

So /var/lib/xtdb is the default location for data using the docker image?

Next question - related: How do I configure it so that XTDB can parallelize batch updates? I have 455 files with roughly 30K records in each. 188 of them are from the farthest back “valid time” the rest are weekly updates sequentially dated. I assume I can parallelize this? Though if batch input of 30K records is fast enough I may not have to. It is kinda important to preserve sequential transaction time, but it’s more of a nice to have for my PoC, if parallelizing populates it faster.

The next question is: how do I handle a query result with possibly millions of records? Some kind of laziness and partitioning seems required here. What I’m likely to do is only ask for the IDs filtered by 2 criteria, and then spit out SQL update files.

Big picture: I’m loading XTDB with all history, then loading a few SQL instances with the current state of relevant, filtered records (each one having its own filtered view of current state).

Any additional advice deeply appreciated! This PoC can lead to a big contract in which case I will have some ability to hire help for the next phase.

So /var/lib/xtdb is the default location for data using the docker image?

That’s right yep, as defined here https://github.com/xtdb/xtdb/blob/f7102e542c735e7db00867a3992abf913e930e34/build/docker/Dockerfile#L7

I assume I can parallelize this? Though if batch input of 30K records is fast enough I may not have to. It is kinda important to preserve sequential transaction time, but it’s more of a nice to have for my PoC, if parallelizing populates it faster.

As per my reply on the other thread Parallelizing data loading, processing large query results - XT doesn’t have an explicit mechanism for parallel import. Sequential transaction time can’t be avoided using the public APIs. That said, if you really wanted to explore advanced custom options then there are some interesting possibilities in theory, e.g. see https://rockset.com/blog/optimizing-bulk-load-in-rocksdb/ - but hopefully the default serial performance is sufficient for now.

how do I handle a query result with possibly millions of records? Some kind of laziness and partitioning seems required here. What I’m likely to do is only ask for the IDs filtered by 2 criteria, and then spit out SQL update files

See again the link I shared on open-q on the other thread :slight_smile:

Big picture: I’m loading XTDB with all history, then loading a few SQL instances with the current state of relevant, filtered records (each one having its own filtered view of current state).

Any additional advice deeply appreciated!

Good to know, I think you’re on the right track, but will have a think whether there are any useful existing examples to consider. Let me know if I can help accelerate or unblock your evaluation somehow - I would be very happy to get on call sometime soon if that’s of interest.