Re: [protobuf] Re: Protocol Buffers using Lzip
Hello Chris, 2009/12/10 Christopher Smith cbsm...@gmail.com: One compression algo that I thought would be particularly useful with PB's would be LZO. It lines up nicely with PB's goals of being fast and compact. Have you thought about allowing an integrated LZO stream? --Chris My goal is to compress huge amounts 5GB of small serialized chunks (~150...500 Bytes) into a single stream, and still being able to randomly access each part of it without having to decompress to whole stream. GzipOutputStream (with level 5) reduces the size to about 40% compared to the uncompressed binary stream, whereas my LzipOutputStream (with level 5) reduces the size to about 20%. The difficulty with gzip is to find synchronizing boundaries in the stream during uncompression If your aim is to exchange small messages, say by RPC, than a fast but less efficient algorithm is the right choice. If however you want to store huge amounts of data permanently, your requirements may be different. In my opinion, generic streaming classes such as ZeroCopyIn/OutputStream, shall offer different compression algorithms for different purposes. LZO has advantages if used for communication of small to medium sized chunks of data. LZMA on the other hand has advantages if you have to store lots of data for a long term. GZIP is somewhere in the middle. Unfortunately Kenton has another opinion about adding too many compression streaming classes. Today I studied the API of LZO. From what I have seen, I think one could implement two LzoIn/OutputStream classes. LZO compression however has a small drawback, let me explain why: The LZO API is not intended to be used for streams. Instead it always compresses and decompresses a whole block. This is different behaviour than gzip and lzip, which are intended to compress streams. A compression class has a fixed sized buffer of typically 8 or 64kB. If this buffer is filled with data, lzip and gzip digest the input and you can start to fill the buffer from the beginning. On the other hand, the LZO compressor has to compress the whole buffer in one step. The next block then has to be concatenated with the already compressed data, which means that during decompression you have to fiddle these chunks apart. If your intention is to compress a chunk of data with, say less than 64kB each, and then to put it on the wire, then LZO is the right solution for you. For my requirements, as you will understand now, LZO does not really fit well. If there is a strong interest in an alternative Protocol Buffer compression stream, don't hesitate to contact me. Jacob -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to proto...@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
Re: [protobuf] Re: Protocol Buffers using Lzip
As I'm sure you can imagine, we store a lot of data in protocol buffer format at Google, so we often want to store very large files with many serialized protocol buffers. The technique we use is to batch a bunch of records together, compress them and write the compressed block to the file (with checksumming and positioning markers). Then if you want to read a specific record, you need to decompress one block. That doesn't take much longer than the disk seek, so it's not a problem unless you have huge blocks. The code we use to do this and the file format is honestly a bit of mess (it's grown slowly over many years), so it's not suitable to be open-sourced. It certainly makes sense to have an open source library to do this, and it sounds similar to what your code is aiming at. But I agree with Kenton that it should not be part of the Protocol Buffer library -- it should be a separate project. It doesn't even need to be directly connected to Protocol Buffers -- you can use the same format for any kind of record. Daniel On Fri, Dec 11, 2009 at 2:20 PM, Jacob Rief jacob.r...@gmail.com wrote: Hello Chris, 2009/12/10 Christopher Smith cbsm...@gmail.com: One compression algo that I thought would be particularly useful with PB's would be LZO. It lines up nicely with PB's goals of being fast and compact. Have you thought about allowing an integrated LZO stream? --Chris My goal is to compress huge amounts 5GB of small serialized chunks (~150...500 Bytes) into a single stream, and still being able to randomly access each part of it without having to decompress to whole stream. GzipOutputStream (with level 5) reduces the size to about 40% compared to the uncompressed binary stream, whereas my LzipOutputStream (with level 5) reduces the size to about 20%. The difficulty with gzip is to find synchronizing boundaries in the stream during uncompression If your aim is to exchange small messages, say by RPC, than a fast but less efficient algorithm is the right choice. If however you want to store huge amounts of data permanently, your requirements may be different. In my opinion, generic streaming classes such as ZeroCopyIn/OutputStream, shall offer different compression algorithms for different purposes. LZO has advantages if used for communication of small to medium sized chunks of data. LZMA on the other hand has advantages if you have to store lots of data for a long term. GZIP is somewhere in the middle. Unfortunately Kenton has another opinion about adding too many compression streaming classes. Today I studied the API of LZO. From what I have seen, I think one could implement two LzoIn/OutputStream classes. LZO compression however has a small drawback, let me explain why: The LZO API is not intended to be used for streams. Instead it always compresses and decompresses a whole block. This is different behaviour than gzip and lzip, which are intended to compress streams. A compression class has a fixed sized buffer of typically 8 or 64kB. If this buffer is filled with data, lzip and gzip digest the input and you can start to fill the buffer from the beginning. On the other hand, the LZO compressor has to compress the whole buffer in one step. The next block then has to be concatenated with the already compressed data, which means that during decompression you have to fiddle these chunks apart. If your intention is to compress a chunk of data with, say less than 64kB each, and then to put it on the wire, then LZO is the right solution for you. For my requirements, as you will understand now, LZO does not really fit well. If there is a strong interest in an alternative Protocol Buffer compression stream, don't hesitate to contact me. Jacob -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to proto...@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.comprotobuf%2bunsubscr...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/protobuf?hl=en. -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to proto...@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
Re: [protobuf] Re: Protocol Buffers using Lzip
One compression algo that I thought would be particularly useful with PB's would be LZO. It lines up nicely with PB's goals of being fast and compact. Have you thought about allowing an integrated LZO stream? --Chris On Wed, Dec 9, 2009 at 12:21 PM, Kenton Varda ken...@google.com wrote: Thanks for writing this code! I'm sure someone will find it useful. That said, I'm wary of adding random stuff to the protobuf library. gzip made sense because practically everyone has zlib, but lzlib is less universal. Also, if we have lzip support built-in, should we also support bzip, rar, etc.? Also note that libprotobuf is already much larger (in terms of binary footprint) than it should be, so making it bigger is a tricky proposition. And finally, on a non-technical note, adding any code to the protobuf distribution puts maintenance work on me, and I'm overloaded as it is. Usually I recommend that people set up a googlecode project to host code like this, but lzip_stream may be a bit small to warrant that. Maybe a project whose goal is to provide protobuf adaptors for many different compression formats? Or a project for hosting random protobuf-related utility code in general? On Tue, Dec 8, 2009 at 4:17 AM, Jacob Rief jacob.r...@gmail.com wrote: Hello Brian, hello Kenton, hello list, as an alternative to GzipInputStream and GzipOutputStream I have written a compression and an uncompression stream class which are stackable into Protocol Buffers streams. They are named LzipInputStream and LzipOutputStream and use the Lempel-Ziv-Markov chain algorithm, as implemented by LZIP http://www.nongnu.org/lzip/lzip.html An advantage for using Lzip instead of Gzip is, that Lzip supports multi member compression. So one can jump into the stream at any position, forward up to the next synchronization boundary and start reading from there. Using the default compression level, Lzip has a better compression ratio at the cost of being slower than Gzip, but when Lzip is used with a low compression level, speed and output size of Lzip are comparable to that of Gzip. I would like to donate these classes to the ProtoBuf software repository. They will be released under an OSS license, compatible to LZIP and Google's. Could someone please check them and tell me in what kind of repository I can publish them. In Google's license agreements there is a passage telling: Neither the name of Google Inc. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. Since I have to use the name google in the C++ namespace of LzipIn/OutputStream, hereby I ask for permission to do so. Comments are appreciated, Jacob -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to proto...@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.comprotobuf%2bunsubscr...@googlegroups.com . For more options, visit this group at http://groups.google.com/group/protobuf?hl=en. -- Chris -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to proto...@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.