[protobuf] Re: Serialized Object as a value in Key-Value store.
according to my experience, not so much I'm afraid... since PB format is length prefixed, or actually tag+length prefixed, the compression rates are not as high as one might expect. I migrated from an in-house XML format to a PB based format, the performance gain was amazing (both serialize and un-serialize), however the resulting BLOB had lousy compression ratio compared to the XML format. I'm not absolutely sure about the numbers but I think PB compressed to 50% the original size, while XML compressed to 16% of its original size, all in all the compressed PB weighted roughly twice the compressed XML. I had to roll back the change, fortunately I came up with a different way to use PB which increased performance while maintaining smaller (compressed) BLOBS. My understanding of this behavior is that the tag+size that prefix each string or message field simply reduces the compressor ability to find repeating regions, nesting messages one into an other (like the XML approach) makes things even worse as it introduces more 'prefixes' into the BLOB, which reduces the possibility of finding any repetitions in the BLOB (which is basically one of two major methods compressors use). My solution was to arrange the message in a way such that all strings are grouped together in one repeated field (plus some other nasty tricks). BTW, I am a great fan of PB, and I am pushing its usage in my company, but when it's used as a serialization format (rather than a message passing protocol) sometimes you have to bend things a bit... Eyal. -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To view this discussion on the web visit https://groups.google.com/d/msg/protobuf/-/Yp_mSexivqIJ. To post to this group, send email to protobuf@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
[protobuf] Re: Serialized Object as a value in Key-Value store.
Protocol buffers does nothing to reduce the size of strings and string data trends to compress very well. In my experience you can expect similar gains as if you had a simple text file (very good) On Sep 22, 3:05 pm, Suraj wrote: > Hi Marc, > > Thanks for the reply. > > Ours is basically a text data. The record size varies, it is in the > range from around 300Bytes to 3kB. But most of them will be more than > 1k. I serialized one record from the production, and found that, the > 3kB object, when serialized, became 728Byte string. Our data is > structured and is kind of hierarchical. CPU is not a limitation in our > case too, so I was thinking that if compressing will be helpful to us > to further reduce the data size, then it will help us to save on SSD > cost. > > Our proto is something similar to : > Message test { > Optional string key = 1; [16 Byte]. > Message inner1{ > Optional int id = 1; > Message inner2{ > Optional int inner_id = 1; > Optional string entry = 2; [size varies > a lot..from 16 byte to 100 byte.] > } > repeated inner2 rec = 2; > } > repeated inner1 in = 2; > > } > > But this will be changing over time and we do plan to push lot of the > more information(hence change in the proto) to this store once the > basic version is rolled out. > > Thanks and Regards, > Suraj Narkhede > > On Sep 22, 11:40 pm, Marc Gravell wrote: > > > > > This will depend on many factors: > > > - how big is each fragment? Very small fragments of *anything* generally > > get bigger when compressed > > - what is the data? If it contains a lot of text data you might see > > benefits; however, many typical fragments will get bigger when compressed - > > it depends entirely on the content > > > In one of our uses, I cheat: I the size above some nominal lower-bound, I > > *try* GZipStream; the moment this exceeds the original size I kill it, and > > send the uncompressed original. I it turns out to be smaller, I store that. > > > This works well for us as the tier that is processing this data has plenty > > of spare CPU to speculatively try both options. > > > Marc > > > On 22 Sep 2011, at 11:39, Suraj wrote: > > > > Hi, > > > > We are planning to use protocol buffer to serialized the data before > > > inserting it into the db. Then we will be inserting this serialized > > > string into the db. > > > > We will be storing this on SSD so look up throughput is pretty high. > > > But since SSD's are costly, to save on the disk cost, I am thinking to > > > compress the serialized string before inserting it into DB. > > > > Do someone have done the benchmarking of using GZipStream on the > > > binary serialized string? > > > Also, can you please give me any example of how to do this? I want to > > > compress the serialized string. So data is in memory and it is not in > > > file. > > > > Thanks. > > > > -- > > > You received this message because you are subscribed to the Google Groups > > > "Protocol Buffers" group. > > > To post to this group, send email to protobuf@googlegroups.com. > > > To unsubscribe from this group, send email to > > > protobuf+unsubscr...@googlegroups.com. > > > For more options, visit this group > > > athttp://groups.google.com/group/protobuf?hl=en. -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
[protobuf] Re: Serialized Object as a value in Key-Value store.
Hi Marc, Thanks for the reply. Ours is basically a text data. The record size varies, it is in the range from around 300Bytes to 3kB. But most of them will be more than 1k. I serialized one record from the production, and found that, the 3kB object, when serialized, became 728Byte string. Our data is structured and is kind of hierarchical. CPU is not a limitation in our case too, so I was thinking that if compressing will be helpful to us to further reduce the data size, then it will help us to save on SSD cost. Our proto is something similar to : Message test { Optional string key = 1; [16 Byte]. Message inner1{ Optional int id = 1; Message inner2{ Optional int inner_id = 1; Optional string entry = 2; [size varies a lot..from 16 byte to 100 byte.] } repeated inner2 rec = 2; } repeated inner1 in = 2; } But this will be changing over time and we do plan to push lot of the more information(hence change in the proto) to this store once the basic version is rolled out. Thanks and Regards, Suraj Narkhede On Sep 22, 11:40 pm, Marc Gravell wrote: > This will depend on many factors: > > - how big is each fragment? Very small fragments of *anything* generally get > bigger when compressed > - what is the data? If it contains a lot of text data you might see benefits; > however, many typical fragments will get bigger when compressed - it depends > entirely on the content > > In one of our uses, I cheat: I the size above some nominal lower-bound, I > *try* GZipStream; the moment this exceeds the original size I kill it, and > send the uncompressed original. I it turns out to be smaller, I store that. > > This works well for us as the tier that is processing this data has plenty of > spare CPU to speculatively try both options. > > Marc > > On 22 Sep 2011, at 11:39, Suraj wrote: > > > > > > > > > Hi, > > > We are planning to use protocol buffer to serialized the data before > > inserting it into the db. Then we will be inserting this serialized > > string into the db. > > > We will be storing this on SSD so look up throughput is pretty high. > > But since SSD's are costly, to save on the disk cost, I am thinking to > > compress the serialized string before inserting it into DB. > > > Do someone have done the benchmarking of using GZipStream on the > > binary serialized string? > > Also, can you please give me any example of how to do this? I want to > > compress the serialized string. So data is in memory and it is not in > > file. > > > Thanks. > > > -- > > You received this message because you are subscribed to the Google Groups > > "Protocol Buffers" group. > > To post to this group, send email to protobuf@googlegroups.com. > > To unsubscribe from this group, send email to > > protobuf+unsubscr...@googlegroups.com. > > For more options, visit this group > > athttp://groups.google.com/group/protobuf?hl=en. -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.