Parallelizing data loading, processing large query results

Any advice for someone about to load millions of records, and needing to process querying them?

I have about 500 CSV files with up to 30,000 records in each, and once loaded, I need to iterate over a result set whose size will be in the millions, such that I can use that result to populate a SQL database.

What considerations, resource-wise (memory, CPUs), do I need to focus on?

I’m running on a gaming PC with 24 cores / 32 logical processors and 64 GB of RAM. If needed I can use a second machine that is a high end M1 Mac (from the original batch) also with 64GB of RAM.

Any advice appreciated. TIA!

  • Bob

Hey @bobcalco - thanks for your question and for sharing some of the details!

Is this a one-time thing? Or something that will need to run as a batch job many times in the future?

Transaction processing is ultimately single-threaded on the JVM side, however there are worker threads that will help parallelize some of the work. On the RocksDB side (native threads) the parallelism definitely helps with concurrent compaction - the default configurations should be good enough, but there’s an ~endless number of knobs to tune with Rocks if you need the extra performance.

Unfortunately you are unlikely to be able to saturate your disk I/O purely by increasing thread counts, but you should at least see a good number of threads being able to deliver incremental improvements.

In terms of things you can easily control - for good bulk loading performance I would advise creating transactions with batches on the order of 1000 docs per tx (definitely not 1 per tx), and also avoid any kind of match operation/ or transaction function usage where possible.

When querying data back out, memory is almost always (i.e. unless you’re doing something compute intensive within the query) the biggest bottleneck, followed by disk I/O throughput/latency. Although if the caches are cold then disk will dominate. You should also be able to stream results using open-q (at the cost of duplication and not being able to use :order-by), which would decrease the end-to-end running time and limit memory usage.

Queries are single-threaded and don’t run across multiple threads, but you can of course issue multiple queries on multiple threads using a consistent db basis.

Hope that helps as a starting point,

Jeremy

The PoC is a one-off but the learning from it will inform how our product, building on XTDB for at least one of its storage use cases, will deal with ingestion of enterprise data later. I hope to make this as repeatable a process as possible over time without regard to initial data input size.

1000 records per tx would be expected to execute how quickly, all other things being equal?

Also: Please explain further what you mean by "at the cost of duplication and not being able to use :order-by when using open-q?

Bob

It will depend greatly on the size of the documents and number of top-level keys, but for 1000 small single-digit kB documents on modern hardware (and low network overheads) it should safely be <1s end-to-end.

Please explain further what you mean by "at the cost of duplication and not being able to use :order-by when using open-q ?

Apologies - I should have just linked this in the first place:

https://docs.xtdb.com/language-reference/datalog-queries/#streaming-queries

Jeremy