Client-side compression, cassandra or both?
Hi there, We're working on a project which is going to store a lot of JSON objects in Cassandra. A large piece of this (90%) consists of an array of integers, of which in a lot of cases there are a bunch of zeroes. The average JSON is 4KB in size, and once GZIP (default compression) just under 100 bytes. My question is, should we compress client-side (literally converting JSON string to compressed gzip bytes), let Cassandra do the work, or do both? From my point of view I think Cassandra would be better, as it could compress beyond a single value, using large blocks within a row / SSTable. Thank you in advance for your help. Best regards, Robin Verlangen *Chief Data Architect* W http://www.robinverlangen.nl E ro...@us2.nl http://goo.gl/Lt7BC *What is CloudPelican? http://goo.gl/HkB3D* Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.
Re: Client-side compression, cassandra or both?
Hello Robin You have many options for compression in C*: 1) Serialized in bytes instead of JSON, to save a lot of space due to String encoding. Of course the data will be opaque and not human readable 2) Activate client-node data compression. In this case, do not forget to ship LZ4 or SNAPPY dependency on the client side. On the server-side, data compression is active by default using LZ4 when you're creating a new table so there is pretty much nothing to do. It's up to you to consider whether the compression ratio difference between Gzip and LZ4 does worth relying on C* compression. Regards On Mon, Nov 3, 2014 at 3:51 PM, Robin Verlangen ro...@us2.nl wrote: Hi there, We're working on a project which is going to store a lot of JSON objects in Cassandra. A large piece of this (90%) consists of an array of integers, of which in a lot of cases there are a bunch of zeroes. The average JSON is 4KB in size, and once GZIP (default compression) just under 100 bytes. My question is, should we compress client-side (literally converting JSON string to compressed gzip bytes), let Cassandra do the work, or do both? From my point of view I think Cassandra would be better, as it could compress beyond a single value, using large blocks within a row / SSTable. Thank you in advance for your help. Best regards, Robin Verlangen *Chief Data Architect* W http://www.robinverlangen.nl E ro...@us2.nl http://goo.gl/Lt7BC *What is CloudPelican? http://goo.gl/HkB3D* Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.
Re: Client-side compression, cassandra or both?
I wouldn’t do both. Unless a little server CPU or (and you’d have to measure it - I imagine it is probably not significant - as you say C* has more context, and hopefully most things can compress “0, “ repeatedly) disk space are an issue, I wouldn’t bother to compress yourself. Compression across the wire is good of course (client side CPU a wash, and server CPU we already mentioned anyway) On a side note, perhaps your object model should address the redundancy, though of course this is perhaps equivalent to the complexity of doing client side compression, IDK. We do have one table where we keep compressed blobs, but that is because those are natural from an application perspective, and so we just turn off C* table compression for those (there isn’t much other data there). Note, I haven’t been tracking it recently, but certainly in the past the compression code path on the C* had to do more data copies, but this is not likely significant unless your case is special. I believe this has been/will be improved in 2.1 or 3. On Nov 3, 2014, at 9:40 AM, DuyHai Doan doanduy...@gmail.com wrote: Hello Robin You have many options for compression in C*: 1) Serialized in bytes instead of JSON, to save a lot of space due to String encoding. Of course the data will be opaque and not human readable 2) Activate client-node data compression. In this case, do not forget to ship LZ4 or SNAPPY dependency on the client side. On the server-side, data compression is active by default using LZ4 when you're creating a new table so there is pretty much nothing to do. It's up to you to consider whether the compression ratio difference between Gzip and LZ4 does worth relying on C* compression. Regards On Mon, Nov 3, 2014 at 3:51 PM, Robin Verlangen ro...@us2.nl mailto:ro...@us2.nl wrote: Hi there, We're working on a project which is going to store a lot of JSON objects in Cassandra. A large piece of this (90%) consists of an array of integers, of which in a lot of cases there are a bunch of zeroes. The average JSON is 4KB in size, and once GZIP (default compression) just under 100 bytes. My question is, should we compress client-side (literally converting JSON string to compressed gzip bytes), let Cassandra do the work, or do both? From my point of view I think Cassandra would be better, as it could compress beyond a single value, using large blocks within a row / SSTable. Thank you in advance for your help. Best regards, Robin Verlangen Chief Data Architect W http://www.robinverlangen.nl http://www.robinverlangen.nl/ E ro...@us2.nl mailto:ro...@us2.nl http://goo.gl/Lt7BC What is CloudPelican? http://goo.gl/HkB3D Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. smime.p7s Description: S/MIME cryptographic signature