Will large XML blobs get compressed?

refset · 1 February 2023 20:58

The complete question from @arichiardi on the Clojurians Slack:

Is there any recommendation for storing a large (XML) payload in XTDB? Do I need extra compression or nippy will take care of it for me?
I don’t need to query it, it is basically just a blob to hydrate and store as is.

First I’ll answer the hard bit:

Is there any recommendation for storing a large (XML) payload in XTDB?

If you find it convenient and the storage volumes aren’t too extreme (i.e. both the size of an individual document and the database in aggregate are sufficiently small), then XT can definitely be used to store arbitrary blobs usefully and I wouldn’t immediately rule it out. It can be particularly handy when you are prototyping or otherwise trying to minimise the number of systems in your architecture.

However, it is important to understand that at non-trivial scales this kind of usage may not be cost effective compared to using raw blob storage (or other KV storage) and merely holding a pointer (e.g. URL) to the blob in XT. This is partly because XT currently does not employ any structural sharing (i.e. cross-document compression) in the document store, such that small changes to the same large blob values over time will cause a lot of nearly duplicate data to be written into the document store. This duplication could be relatively $expensive if the document store is, for example, backed by Postgres and not S3.

Perhaps more significantly though XT’s index-store (e.g. backed by RocksDB as the KV store) is monolithic (all data is local) and storing blob data will bloat its size more quickly, and therefore when storing blobs you will want to feel confident that the rate in growth of the KV store is low enough to avoid being too concerned about ongoing re-provisioning concerns, i.e. upgrading storage capacity in a year’s time. The disk used for this KV store may also be $expensive compared to generic object storage, but then it likely also has lower latency and avoids tranfser costs…so there plenty of tradeoffs to consider!

a large (XML) payload […] Do I need extra compression or nippy will take care of it for me?

Whether you store XML as a string or as some opaque binary blob, all that Nippy really does within XT is manage the most essential level of encoding (i.e. handling headers for known types + serialization). XT is not configured to use any of Nippy’s compression options. Instead XT relies entirely on the native compression facilities (if any) available in the underlying KV store (e.g. RocksDB compression) and document store implementations (e.g. Postgres TOAST) - also note that XT will not control the configuration options for these external systems, so activating some of the compression facilities may require extra configuration.

I hope that helps!

arichiardi · 1 February 2023 21:31

Thank you Jeremy, what you are saying totally makes sense and I am definitely considering using Postgres’ facilities for this blob.

I have also listened to what you were mentioning around metadata vs actual data storing in the first video Meetup, that helped as well.

Topic		Replies	Views
Opinions: Separate Storage and Compute? Dev	2	786	3 January 2022
Using a document with a vector of values vs. splitting them to many documents Users modeling-tradeoffs	3	403	8 December 2023
A question about RocksDB data files Users	5	612	6 September 2023
Could Someone Give me Advice on Optimization and Indexing in XTDB for Large-Scale Data Users v1	1	140	25 July 2024
Is there an established way or pattern of going from "hot" to "cold" storage? Users	2	379	26 February 2023

Will large XML blobs get compressed?

Related topics