[protobuf] Re: Serialized Object as a value in Key-Value store.

2011-10-03 Thread Eyal Farago
according to my experience, not so much I'm afraid...
since PB format is length prefixed, or actually tag+length prefixed, the 
compression rates are not as high as one might expect.
I migrated from an in-house XML format to a PB based format, the performance 
gain was amazing (both serialize and un-serialize), however the resulting 
BLOB had lousy compression ratio compared to the XML format. I'm not 
absolutely sure about the numbers but I think PB compressed to 50% the 
original size, while XML compressed to 16% of its original size, all in all 
the compressed PB weighted roughly twice the compressed XML.
I had to roll back the change, fortunately I came up with a different way to 
use PB which increased performance while maintaining smaller (compressed) 
BLOBS.
My understanding of this behavior is that the tag+size that prefix each 
string or message field simply reduces the compressor ability to find 
repeating regions, nesting messages one into an other (like the XML 
approach) makes things even worse as it introduces more 'prefixes' into the 
BLOB, which reduces the possibility of finding any repetitions in the BLOB 
(which is basically one of two major methods compressors use). My solution 
was to arrange the message in a way such that all strings are grouped 
together in one repeated field (plus some other nasty tricks).

BTW, I am a great fan of PB, and I am pushing its usage in my company, but 
when it's used as a serialization format (rather than a message passing 
protocol) sometimes you have to bend things a bit...

Eyal.

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/protobuf/-/Yp_mSexivqIJ.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] Re: Serialized Object as a value in Key-Value store.

2011-09-22 Thread Benjamin Wright

Protocol buffers does nothing to reduce the size of strings and string
data trends to compress very well.  In my experience you can expect
similar gains as if you had a simple text file (very good)


On Sep 22, 3:05 pm, Suraj  wrote:
> Hi Marc,
>
> Thanks for the reply.
>
> Ours is basically a text data. The record size varies, it is in the
> range from around 300Bytes to 3kB. But most of them will be more than
> 1k. I serialized one record from the production, and found that, the
> 3kB object, when serialized, became 728Byte string. Our data is
> structured and is kind of hierarchical. CPU is not a limitation in our
> case too, so I was thinking that if compressing will be helpful to us
> to further reduce the data size, then it will help us to save on SSD
> cost.
>
> Our proto is something similar to :
> Message test {
>          Optional string key = 1; [16 Byte].
>          Message inner1{
>                    Optional int id = 1;
>                    Message inner2{
>                                Optional int inner_id = 1;
>                                Optional string entry = 2; [size varies
> a lot..from 16 byte to 100 byte.]
>                   }
>                   repeated inner2 rec = 2;
>         }
>         repeated inner1 in = 2;
>
> }
>
> But this will be changing over time and we do plan to push lot of the
> more information(hence change in the proto) to this store once the
> basic version is rolled out.
>
> Thanks and Regards,
> Suraj Narkhede
>
> On Sep 22, 11:40 pm, Marc Gravell  wrote:
>
>
>
> > This will depend on many factors:
>
> > - how big is each fragment? Very small fragments of *anything* generally 
> > get bigger when compressed
> > - what is the data? If it contains a lot of text data you might see 
> > benefits; however, many typical fragments will get bigger when compressed - 
> > it depends entirely on the content
>
> > In one of our uses, I cheat: I the size above some nominal lower-bound, I 
> > *try* GZipStream; the moment this exceeds the original size I kill it, and 
> > send the uncompressed original. I it turns out to be smaller, I store that.
>
> > This works well for us as the tier that is processing this data has plenty 
> > of spare CPU to speculatively try both options.
>
> > Marc
>
> > On 22 Sep 2011, at 11:39, Suraj  wrote:
>
> > > Hi,
>
> > > We are planning to use protocol buffer to serialized the data before
> > > inserting it into the db. Then we will be inserting this serialized
> > > string into the db.
>
> > > We will be storing this on SSD so look up throughput is pretty high.
> > > But since SSD's are costly, to save on the disk cost, I am thinking to
> > > compress the serialized string before inserting it into DB.
>
> > > Do someone have done the benchmarking of using GZipStream on the
> > > binary serialized string?
> > > Also, can you please give me any example of how to do this?  I want to
> > > compress the serialized string. So data is in memory and it is not in
> > > file.
>
> > > Thanks.
>
> > > --
> > > You received this message because you are subscribed to the Google Groups 
> > > "Protocol Buffers" group.
> > > To post to this group, send email to protobuf@googlegroups.com.
> > > To unsubscribe from this group, send email to 
> > > protobuf+unsubscr...@googlegroups.com.
> > > For more options, visit this group 
> > > athttp://groups.google.com/group/protobuf?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] Re: Serialized Object as a value in Key-Value store.

2011-09-22 Thread Suraj
Hi Marc,

Thanks for the reply.

Ours is basically a text data. The record size varies, it is in the
range from around 300Bytes to 3kB. But most of them will be more than
1k. I serialized one record from the production, and found that, the
3kB object, when serialized, became 728Byte string. Our data is
structured and is kind of hierarchical. CPU is not a limitation in our
case too, so I was thinking that if compressing will be helpful to us
to further reduce the data size, then it will help us to save on SSD
cost.

Our proto is something similar to :
Message test {
 Optional string key = 1; [16 Byte].
 Message inner1{
   Optional int id = 1;
   Message inner2{
   Optional int inner_id = 1;
   Optional string entry = 2; [size varies
a lot..from 16 byte to 100 byte.]
  }
  repeated inner2 rec = 2;
}
repeated inner1 in = 2;
}

But this will be changing over time and we do plan to push lot of the
more information(hence change in the proto) to this store once the
basic version is rolled out.

Thanks and Regards,
Suraj Narkhede

On Sep 22, 11:40 pm, Marc Gravell  wrote:
> This will depend on many factors:
>
> - how big is each fragment? Very small fragments of *anything* generally get 
> bigger when compressed
> - what is the data? If it contains a lot of text data you might see benefits; 
> however, many typical fragments will get bigger when compressed - it depends 
> entirely on the content
>
> In one of our uses, I cheat: I the size above some nominal lower-bound, I 
> *try* GZipStream; the moment this exceeds the original size I kill it, and 
> send the uncompressed original. I it turns out to be smaller, I store that.
>
> This works well for us as the tier that is processing this data has plenty of 
> spare CPU to speculatively try both options.
>
> Marc
>
> On 22 Sep 2011, at 11:39, Suraj  wrote:
>
>
>
>
>
>
>
> > Hi,
>
> > We are planning to use protocol buffer to serialized the data before
> > inserting it into the db. Then we will be inserting this serialized
> > string into the db.
>
> > We will be storing this on SSD so look up throughput is pretty high.
> > But since SSD's are costly, to save on the disk cost, I am thinking to
> > compress the serialized string before inserting it into DB.
>
> > Do someone have done the benchmarking of using GZipStream on the
> > binary serialized string?
> > Also, can you please give me any example of how to do this?  I want to
> > compress the serialized string. So data is in memory and it is not in
> > file.
>
> > Thanks.
>
> > --
> > You received this message because you are subscribed to the Google Groups 
> > "Protocol Buffers" group.
> > To post to this group, send email to protobuf@googlegroups.com.
> > To unsubscribe from this group, send email to 
> > protobuf+unsubscr...@googlegroups.com.
> > For more options, visit this group 
> > athttp://groups.google.com/group/protobuf?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.