Re: [protobuf] Re: Protocol Buffers using Lzip

2009-12-11 Thread Jacob Rief
Hello Chris,

2009/12/10 Christopher Smith cbsm...@gmail.com:
 One compression algo that I thought would be particularly useful with PB's
 would be LZO. It lines up nicely with PB's goals of being fast and compact.
 Have you thought about allowing an integrated LZO stream?

 --Chris

My goal is to compress huge amounts 5GB of small serialized chunks
(~150...500 Bytes) into a single stream, and still being able to
randomly access each part of it without having to decompress to whole
stream. GzipOutputStream (with level 5) reduces the size to about 40%
compared to the uncompressed binary stream, whereas my
LzipOutputStream (with level 5) reduces the size to about 20%. The
difficulty with gzip is to find synchronizing boundaries in the stream
during uncompression
If your aim is to exchange small messages, say by RPC, than a fast but
less efficient algorithm is the right choice. If however you want to
store huge amounts of data permanently, your requirements may be
different.

In my opinion, generic streaming classes such as
ZeroCopyIn/OutputStream, shall offer different compression algorithms
for different purposes. LZO has advantages if used for communication
of small to medium sized chunks of data. LZMA on the other hand has
advantages if you have to store lots of data for a long term. GZIP is
somewhere in the middle. Unfortunately Kenton has another opinion
about adding too many compression streaming classes.

Today I studied the API of LZO. From what I have seen, I think one
could implement two LzoIn/OutputStream classes. LZO compression
however has a small drawback, let me explain why: The LZO API is not
intended to be used for streams. Instead it always compresses and
decompresses a whole block. This is different behaviour than gzip and
lzip, which are intended to compress streams. A compression class has
a fixed sized buffer of typically 8 or 64kB. If this buffer is filled
with data, lzip and gzip digest the input and you can start to fill
the buffer from the beginning. On the other hand, the LZO compressor
has to compress the whole buffer in one step. The next block then has
to be concatenated with the already compressed data, which means that
during decompression you have to fiddle these chunks apart.

If your intention is to compress a chunk of data with, say less than
64kB each, and then to put it on the wire, then LZO is the right
solution for you. For my requirements, as you will understand now, LZO
does not really fit well.
If there is a strong interest in an alternative Protocol Buffer
compression stream, don't hesitate to contact me.

Jacob

--

You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.




Re: [protobuf] Re: Protocol Buffers using Lzip

2009-12-11 Thread Daniel Wright
As I'm sure you can imagine, we store a lot of data in protocol buffer
format at Google, so we often want to store very large files with many
serialized protocol buffers.  The technique we use is to batch a bunch of
records together, compress them and write the compressed block to the file
(with checksumming and positioning markers).  Then if you want to read a
specific record, you need to decompress one block.  That doesn't take much
longer than the disk seek, so it's not a problem unless you have huge
blocks.

The code we use to do this and the file format is honestly a bit of mess
(it's grown slowly over many years), so it's not suitable to be
open-sourced.  It certainly makes sense to have an open source library to do
this, and it sounds similar to what your code is aiming at.  But I agree
with Kenton that it should not be part of the Protocol Buffer library -- it
should be a separate project.  It doesn't even need to be directly connected
to Protocol Buffers -- you can use the same format for any kind of record.

Daniel

On Fri, Dec 11, 2009 at 2:20 PM, Jacob Rief jacob.r...@gmail.com wrote:

 Hello Chris,

 2009/12/10 Christopher Smith cbsm...@gmail.com:
  One compression algo that I thought would be particularly useful with
 PB's
  would be LZO. It lines up nicely with PB's goals of being fast and
 compact.
  Have you thought about allowing an integrated LZO stream?
 
  --Chris

 My goal is to compress huge amounts 5GB of small serialized chunks
 (~150...500 Bytes) into a single stream, and still being able to
 randomly access each part of it without having to decompress to whole
 stream. GzipOutputStream (with level 5) reduces the size to about 40%
 compared to the uncompressed binary stream, whereas my
 LzipOutputStream (with level 5) reduces the size to about 20%. The
 difficulty with gzip is to find synchronizing boundaries in the stream
 during uncompression
 If your aim is to exchange small messages, say by RPC, than a fast but
 less efficient algorithm is the right choice. If however you want to
 store huge amounts of data permanently, your requirements may be
 different.

 In my opinion, generic streaming classes such as
 ZeroCopyIn/OutputStream, shall offer different compression algorithms
 for different purposes. LZO has advantages if used for communication
 of small to medium sized chunks of data. LZMA on the other hand has
 advantages if you have to store lots of data for a long term. GZIP is
 somewhere in the middle. Unfortunately Kenton has another opinion
 about adding too many compression streaming classes.

 Today I studied the API of LZO. From what I have seen, I think one
 could implement two LzoIn/OutputStream classes. LZO compression
 however has a small drawback, let me explain why: The LZO API is not
 intended to be used for streams. Instead it always compresses and
 decompresses a whole block. This is different behaviour than gzip and
 lzip, which are intended to compress streams. A compression class has
 a fixed sized buffer of typically 8 or 64kB. If this buffer is filled
 with data, lzip and gzip digest the input and you can start to fill
 the buffer from the beginning. On the other hand, the LZO compressor
 has to compress the whole buffer in one step. The next block then has
 to be concatenated with the already compressed data, which means that
 during decompression you have to fiddle these chunks apart.

 If your intention is to compress a chunk of data with, say less than
 64kB each, and then to put it on the wire, then LZO is the right
 solution for you. For my requirements, as you will understand now, LZO
 does not really fit well.
 If there is a strong interest in an alternative Protocol Buffer
 compression stream, don't hesitate to contact me.

 Jacob

 --

 You received this message because you are subscribed to the Google Groups
 Protocol Buffers group.
 To post to this group, send email to proto...@googlegroups.com.
 To unsubscribe from this group, send email to
 protobuf+unsubscr...@googlegroups.comprotobuf%2bunsubscr...@googlegroups.com
 .
 For more options, visit this group at
 http://groups.google.com/group/protobuf?hl=en.




--

You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.




Re: [protobuf] Re: Protocol Buffers using Lzip

2009-12-10 Thread Christopher Smith
One compression algo that I thought would be particularly useful with PB's
would be LZO. It lines up nicely with PB's goals of being fast and compact.
Have you thought about allowing an integrated LZO stream?

--Chris

On Wed, Dec 9, 2009 at 12:21 PM, Kenton Varda ken...@google.com wrote:

 Thanks for writing this code!  I'm sure someone will find it useful.

 That said, I'm wary of adding random stuff to the protobuf library.  gzip
 made sense because practically everyone has zlib, but lzlib is less
 universal.  Also, if we have lzip support built-in, should we also support
 bzip, rar, etc.?

 Also note that libprotobuf is already much larger (in terms of binary
 footprint) than it should be, so making it bigger is a tricky proposition.

 And finally, on a non-technical note, adding any code to the protobuf
 distribution puts maintenance work on me, and I'm overloaded as it is.

 Usually I recommend that people set up a googlecode project to host code
 like this, but lzip_stream may be a bit small to warrant that.  Maybe a
 project whose goal is to provide protobuf adaptors for many different
 compression formats?  Or a project for hosting random protobuf-related
 utility code in general?

 On Tue, Dec 8, 2009 at 4:17 AM, Jacob Rief jacob.r...@gmail.com wrote:

 Hello Brian, hello Kenton, hello list,
 as an alternative to GzipInputStream and GzipOutputStream I have
 written a compression and an uncompression stream class which are
 stackable into Protocol Buffers streams. They are named
 LzipInputStream and LzipOutputStream and use the Lempel-Ziv-Markov
 chain algorithm, as implemented by LZIP
 http://www.nongnu.org/lzip/lzip.html

 An advantage for using Lzip instead of Gzip is, that Lzip supports
 multi member compression. So one can jump into the stream at any
 position, forward up to the next synchronization boundary and start
 reading from there.
 Using the default compression level, Lzip has a better compression
 ratio at the cost of being slower than Gzip, but when Lzip is used
 with a low compression level, speed and output size of Lzip are
 comparable to that of Gzip.

 I would like to donate these classes to the ProtoBuf software
 repository. They will be released under an OSS license, compatible to
 LZIP and Google's. Could someone please check them and tell me in what
 kind of repository I can publish them. In Google's license agreements
 there is a passage telling: Neither the name of Google Inc. nor the
 names of its contributors may be used to endorse or promote products
 derived from this software without specific prior written permission.
 Since I have to use the name google in the C++ namespace of
 LzipIn/OutputStream, hereby I ask for permission to do so.

 Comments are appreciated,
 Jacob


  --
 You received this message because you are subscribed to the Google Groups
 Protocol Buffers group.
 To post to this group, send email to proto...@googlegroups.com.
 To unsubscribe from this group, send email to
 protobuf+unsubscr...@googlegroups.comprotobuf%2bunsubscr...@googlegroups.com
 .
 For more options, visit this group at
 http://groups.google.com/group/protobuf?hl=en.




-- 
Chris

--

You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.