Client-side compression, cassandra or both?

2014-11-03 Thread Robin Verlangen
Hi there,

We're working on a project which is going to store a lot of JSON objects in
Cassandra. A large piece of this (90%) consists of an array of integers, of
which in a lot of cases there are a bunch of zeroes.

The average JSON is 4KB in size, and once GZIP (default compression) just
under 100 bytes.

My question is, should we compress client-side (literally converting JSON
string to compressed gzip bytes), let Cassandra do the work, or do both?

From my point of view I think Cassandra would be better, as it could
compress beyond a single value, using large blocks within a row / SSTable.

Thank you in advance for your help.

Best regards,

Robin Verlangen
*Chief Data Architect*

W http://www.robinverlangen.nl
E ro...@us2.nl

http://goo.gl/Lt7BC
*What is CloudPelican? http://goo.gl/HkB3D*

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.


Re: Client-side compression, cassandra or both?

2014-11-03 Thread DuyHai Doan
Hello Robin

 You have many options for compression in C*:

1) Serialized in bytes instead of JSON, to save a lot of space due to
String encoding. Of course the data will be opaque and not human readable

2) Activate client-node data compression. In this case, do not forget to
ship LZ4 or SNAPPY dependency on the client side.

On the server-side, data compression is active by default using LZ4 when
you're creating a new table so there is pretty much nothing to do.

 It's up to you to consider whether the compression ratio difference
between Gzip and LZ4 does worth relying on C* compression.


Regards


On Mon, Nov 3, 2014 at 3:51 PM, Robin Verlangen ro...@us2.nl wrote:

 Hi there,

 We're working on a project which is going to store a lot of JSON objects
 in Cassandra. A large piece of this (90%) consists of an array of integers,
 of which in a lot of cases there are a bunch of zeroes.

 The average JSON is 4KB in size, and once GZIP (default compression) just
 under 100 bytes.

 My question is, should we compress client-side (literally converting JSON
 string to compressed gzip bytes), let Cassandra do the work, or do both?

 From my point of view I think Cassandra would be better, as it could
 compress beyond a single value, using large blocks within a row / SSTable.

 Thank you in advance for your help.

 Best regards,

 Robin Verlangen
 *Chief Data Architect*

 W http://www.robinverlangen.nl
 E ro...@us2.nl

 http://goo.gl/Lt7BC
 *What is CloudPelican? http://goo.gl/HkB3D*

 Disclaimer: The information contained in this message and attachments is
 intended solely for the attention and use of the named addressee and may be
 confidential. If you are not the intended recipient, you are reminded that
 the information remains the property of the sender. You must not use,
 disclose, distribute, copy, print or rely on this e-mail. If you have
 received this message in error, please contact the sender immediately and
 irrevocably delete this message and any copies.



Re: Client-side compression, cassandra or both?

2014-11-03 Thread graham sanderson
I wouldn’t do both.
Unless a little server CPU or (and you’d have to measure it - I imagine it is 
probably not significant - as you say C* has more context, and hopefully most 
things can compress “0, “ repeatedly) disk space are an issue, I wouldn’t 
bother to compress yourself. Compression across the wire is good of course 
(client side CPU a wash, and server CPU we already mentioned anyway)

On a side note, perhaps your object model should address the redundancy, though 
of course this is perhaps equivalent to the complexity of doing client side 
compression, IDK.

We do have one table where we keep compressed blobs, but that is because those 
are natural from an application perspective, and so we just turn off C* table 
compression for those (there isn’t much other data there).

Note, I haven’t been tracking it recently, but certainly in the past the 
compression code path on the C* had to do more data copies, but this is not 
likely significant unless your case is special. I believe this has been/will be 
improved in 2.1 or 3.

 On Nov 3, 2014, at 9:40 AM, DuyHai Doan doanduy...@gmail.com wrote:
 
 Hello Robin
 
  You have many options for compression in C*:
 
 1) Serialized in bytes instead of JSON, to save a lot of space due to String 
 encoding. Of course the data will be opaque and not human readable
 
 2) Activate client-node data compression. In this case, do not forget to ship 
 LZ4 or SNAPPY dependency on the client side. 
 
 On the server-side, data compression is active by default using LZ4 when 
 you're creating a new table so there is pretty much nothing to do.
 
  It's up to you to consider whether the compression ratio difference between 
 Gzip and LZ4 does worth relying on C* compression.
 
 
 Regards
 
 
 On Mon, Nov 3, 2014 at 3:51 PM, Robin Verlangen ro...@us2.nl 
 mailto:ro...@us2.nl wrote:
 Hi there,
 
 We're working on a project which is going to store a lot of JSON objects in 
 Cassandra. A large piece of this (90%) consists of an array of integers, of 
 which in a lot of cases there are a bunch of zeroes. 
 
 The average JSON is 4KB in size, and once GZIP (default compression) just 
 under 100 bytes. 
 
 My question is, should we compress client-side (literally converting JSON 
 string to compressed gzip bytes), let Cassandra do the work, or do both?
 
 From my point of view I think Cassandra would be better, as it could compress 
 beyond a single value, using large blocks within a row / SSTable.
 
 Thank you in advance for your help.
 
 Best regards, 
 
 Robin Verlangen
 Chief Data Architect
 
 W http://www.robinverlangen.nl http://www.robinverlangen.nl/
 E ro...@us2.nl mailto:ro...@us2.nl
 
  http://goo.gl/Lt7BC
 What is CloudPelican? http://goo.gl/HkB3D
 
 Disclaimer: The information contained in this message and attachments is 
 intended solely for the attention and use of the named addressee and may be 
 confidential. If you are not the intended recipient, you are reminded that 
 the information remains the property of the sender. You must not use, 
 disclose, distribute, copy, print or rely on this e-mail. If you have 
 received this message in error, please contact the sender immediately and 
 irrevocably delete this message and any copies.
 



smime.p7s
Description: S/MIME cryptographic signature