Re: [protobuf] Re: Opensourcing LzLib Protocol Buffer Code
Hello Ernest, here is the latest version of that lzip-in/output-stream classes. I fixed some issues since the last published version. These two classes, in my humble opinion, are stable now. Any code reviews are welcomed! Kenton, if there is a repository for external utility classes like this one, please let me know. And if my other library gets into a state, where I can publish it, I will create a project together with its repository containing the above classes. However, since this is work in progress with lots of changes to my API, I prefer to keep it unpublished for now. Regards, Jacob -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to proto...@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en. // This file contains the declaration of classes // LzipInputStream and LzipOutputStream used to compress // and decompress Google's Protocol Buffer Streams using // the Lempel-Ziv-Markow-Algorithm. // // Derived from http://protobuf.googlecode.com/svn/tags/2.2.0/src/google/protobuf/io/gzip_stream.h // Copyright 2010 by Jacob Rief jacob.r...@gmail.com #ifndef GOOGLE_PROTOBUF_IO_LZIP_STREAM_H__ #define GOOGLE_PROTOBUF_IO_LZIP_STREAM_H__ #include stdint.h #include lzlib.h #include google/protobuf/io/zero_copy_stream.h namespace google { namespace protobuf { namespace io { // A ZeroCopyInputStream that reads compressed data through lzib class LIBPROTOBUF_EXPORT LzipInputStream : public ZeroCopyInputStream { public: explicit LzipInputStream(ZeroCopyInputStream* sub_stream); virtual ~LzipInputStream(); // Releases the decoder. bool Close(); // Reset the underlying input stream, resetting the decompressor and all counters. void Reset(); // Forward the underlying InputStream to the begin of the next compression member. // Use this function after repositioning the underlying stream or in case a stream error occured. bool Forward(); // In case of an error, check reason here inline LZ_Errno ErrorCode() const { return errno_; } // --- implements ZeroCopyInputStream --- bool Next(const void** data, int* size); void BackUp(int count); bool Skip(int count); int64 ByteCount() const; private: GOOGLE_DISALLOW_EVIL_CONSTRUCTORS(LzipInputStream); void Decompress(); // compressed input stream ZeroCopyInputStream* sub_stream_; bool finished_; // plain text output stream const int output_buffer_length_; void* const output_buffer_; uint8_t* output_position_; uint8_t* next_out_; int avail_out_; // Lzip decoder LZ_Decoder* decoder_; LZ_Errno errno_; }; class LIBPROTOBUF_EXPORT LzipOutputStream : public ZeroCopyOutputStream { public: // Create a LzipOutputStream with default options. explicit LzipOutputStream(ZeroCopyOutputStream* sub_stream, size_t compression_level = 5, int64_t member_size = kint64max); virtual ~LzipOutputStream(); // Flushes data written so far to zipped data in the underlying stream. // It is the caller's responsibility to flush the underlying stream if // necessary. // Compression may be less efficient stopping and starting around flushes. // Returns true if no error. bool Flush(); // Flushes data written so far to zipped data in the underlying stream // and restarts a new LZIP member. It is the caller's responsibility to // flush the underlying stream if necessary. // Compression is a lot more inefficient when restarting a new member, // rather than calling Flush(). // Returns true if no error. bool Restart(); // Writes out all data and closes the lzip stream. // It is the caller's responsibility to close the underlying stream if // necessary. // Returns true if no error. bool Close(); // --- implements ZeroCopyOutputStream --- bool Next(void** data, int* size); void BackUp(int count); int64 ByteCount() const; void Reset(); private: GOOGLE_DISALLOW_EVIL_CONSTRUCTORS(LzipOutputStream); void Compress(bool flush = false); // plain text input stream const int input_buffer_length_; void* const input_buffer_; uint8_t* input_position_; uint8_t* const input_buffer_end_; // compressed output stream ZeroCopyOutputStream* sub_stream_; bool finished_; // Lzip encoder struct Options { int dictionary_size; // 4KiB..512MiB int match_len_limit; // 5..273 }; static const Options options[9]; LZ_Encoder* encoder_; const uint64_t member_size_; LZ_Errno errno_; }; } // namespace io } // namespace protobuf } // namespace google #endif // GOOGLE_PROTOBUF_IO_LZIP_STREAM_H__ // This file contains the implementation of classes // LzipInputStream and LzipOutputStream used to compress // and decompress Google's Protocol Buffer Streams using // the Lempel-Ziv-Markow-Algorithm. // // Derived from http
[protobuf] Re: Opensourcing LzLib Protocol Buffer Code
Hello Ernest, this code is part of a private project in progress and it works well in that context. Unfortunately the Google guys had no use case for it, therefore they did not want to incorporate it into their code base; maybe they just suffer the not-invented-here-syndrome. When my project is ready to be published, I will add that code there. If I can get write access to a PB-related Google-repository, I will use that. The reason I did not publish anything yet, was, that I did not want to start a Google project just to publish two files. Regards, Jacob 2010/3/23 Ernest Lee hellfir...@gmail.com: Hello Jacob Rief, I have noticed you haven't publicly said anythng about your lzma protobuff storage. Is it dead? Can you post it on your google code project? What has happened to it? Thanks. -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to proto...@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
Re: [protobuf] Re: How can I reset a FileInputStream?
Hello Kenton, 2010/1/30 Kenton Varda ken...@google.com: We can't add a new virtual method to the ZeroCopyInputStream interface because it would break existing implementations. but only on a binary level. God's sake you are not Microsoft :) Besides that, it's unclear what the Reset() method means in an abstract sense. Yes, you can define it for the particular set of streams that you're thinking of, but what does it mean in general? Put the object into a state equivalent to a state immediately after construction. The Reset() button should be pressed only when you know what you are doing - otherwise you will loose valuable data. What should ArrayInputStream::Reset() do? In this case Reset() is nonsensical. the same as its constructor without reallocation, ie.: void ArrayInputStream::Reset() { position_ = 0; last_returned_size_ = 0; } What should IstreamInputStream::Reset() do? Should it only discard its own buffer, or should it also reset the underlying istream? If that istream is void IstreamInputStream::Reset() { impl_.Reset(); } since impl_ is a CopyingInputStreamAdaptor which itself IS-A ZeroCopyInputStream its Reset() is just another implementation, ie. void CopyingInputStreamAdaptor::Reset() { position_ = 0; buffer_used_ = 0; backup_bytes_ = 0; } itself wrapping a file descriptor, and you're trying to seek that file descriptor directly, then you need to reset the istream. But maybe the user is actually calling IstreamInputStream::Reset() because they have seeked the istream itself and what IstreamInputStream to acknowledge this. Who knows? But you can't say that Reset() is only propagated down the stack by *some* implementations and not others. Since the creator of IstreamInputStream is the owner of the file descriptor, its his responsibility to seek to whatever location desired. No, we won't be adding a Reset() method because the meaning is unclear. Meanwhile, you seem to have made an argument against FileInputStream::Seek(): Any streams layered on top of it will be broken if you Seek() the stream under them. So you have to have some way to reset those streams, and the problem starts again! Exactly! Therefore instead of destroying and recreating them, a much simpler Reset() function would do the job. The design of the streaming classes is to consider a stream which can move only forward. It was not designed for moving backwards or random access. Please just don't add anything new. If you are unhappy with what ZeroCopy{Input,Output}Stream provide, you can always just create your own stream framework to use. Well, I have to live with that decision. Maybe in the future some other people have similar use cases. Maybe in version 3? Just for curiosity, the protobuf code is really easy to read and to understand. The only thing is disliked is the mapping of class names to filenames. Is all the code inside Google written that clearly? Regards, Jacob -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to proto...@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
[protobuf] Re: How can I reset a FileInputStream?
Hello Kenton, 2010/1/20 Kenton Varda ken...@google.com: (1) Normally micro-benchmarks involve running the operation in a loop many times so that the total time is closer to 1s or more, not running the operation once and trying to time that. System clocks are not very accurate at that scale, and depending on what kind of clock it is, it may actually take significantly longer to read the lock than it does not allocate memory. (2) Your benchmark does not include the time spent actually reading the file, which is what I asserted would be much slower than re-allocating the buffer. Sure, the seek itself is fast but it is pointless without actually reading. now I modified the benchmark, now the code looks like this boost::posix_time::ptime time0(boost::posix_time::microsec_clock::local_time()); boost::posix_time::ptime time1(boost::posix_time::microsec_clock::local_time()); for (int i = 0; i100; ++i) { const void* data; int size; fileInStream-Seek(offset, whence); fileInStream-Next(data, size); } boost::posix_time::ptime time2(boost::posix_time::microsec_clock::local_time()); for (int i = 0; i100; ++i) { const void* data; int size; ::lseek64(fileDescriptor, offset, whence); fileInStream.reset(new google::protobuf::io::FileInputStream(fileDescriptor)); fileInStream-Next(data, size); } boost::posix_time::ptime time3(boost::posix_time::microsec_clock::local_time()); std::cerr t1: boost::posix_time::time_period(time1, time2).length() t2: boost::posix_time::time_period(time2, time3).length() std::endl; The difference now is less significant, but still measurable: t1: 00:00:02.068949 t2: 00:00:02.389942 t1: 00:00:02.092842 t2: 00:00:02.429206 t1: 00:00:02.080614 t2: 00:00:02.394708 t1: 00:00:02.094289 t2: 00:00:02.429952 t1: 00:00:02.323403 t2: 00:00:03.723459 t1: 00:00:02.151486 t2: 00:00:03.711809 t1: 00:00:02.084442 t2: 00:00:02.416326 t1: 00:00:02.052930 t2: 00:00:02.383500 (3) What memory allocator are you using? With tcmalloc, a malloc/free pair should take around 50ns, two orders of magnitude less than your 4us measurement. The 'new' operator is not overloaded. I use gcc-Version 4.4.1 20090725 (Red Hat 4.4.1-2) Regards, Jacob On Wed, Jan 20, 2010 at 2:17 PM, Jacob Rief jacob.r...@gmail.com wrote: Hello Kenton, now I did some benchmarks, while Seek'ing though a FileInputStream. The testing code looks like this: boost::posix_time::ptime t0(boost::posix_time::microsec_clock::local_time()); // initialize boost::posix_time boost::shared_ptrgoogle::protobuf::io::FileInputStream fileInStream = new google::protobuf::io::FileInputStream(fileDescriptor); boost::posix_time::ptime t1(boost::posix_time::microsec_clock::local_time()); // using Seek(), the function available through my patch fileInStream-Seek(offset, whence); boost::posix_time::ptime t2(boost::posix_time::microsec_clock::local_time()); // this is the default method of achieving the same ::lseek64(fileDescriptor, offset, whence); fileInStream.reset(new google::protobuf::io::FileInputStream(fileDescriptor)); boost::posix_time::ptime t3(boost::posix_time::microsec_clock::local_time()); std::cerr t1: boost::posix_time::time_period(t1, t2).length() t2: boost::posix_time::time_period(t2, t3).length() std::endl; and on my Intel Core2 Duo CPU E8400 (3.00GHz) with 4GB of RAM, gcc-Version 4.4.1 20090725, compiled with -O2 I get these numbers: t1: 00:00:00.01 t2: 00:00:00.03 t1: 00:00:00.01 t2: 00:00:00.03 t1: 00:00:00.01 t2: 00:00:00.04 t1: 00:00:00.01 t2: 00:00:00.07 t1: 00:00:00.01 t2: 00:00:00.02 t1: 00:00:00.01 t2: 00:00:00.03 t1: 00:00:00.02 t2: 00:00:00.03 t1: 00:00:00.01 t2: 00:00:00.04 t1: 00:00:00.01 t2: 00:00:00.04 t1: 00:00:00.01 t2: 00:00:00.03 t1: 00:00:00.01 t2: 00:00:00.04 In absolute numbers, ~1 microsecond compared to 3-4 microseconds is not a big difference, but from a relative point of view, direct Seek'ing is much faster than object recreation. And since I have to seek a lot in the FileInputStream, the measured times will accumulate. Regards, Jacob 2010/1/19 Kenton Varda ken...@google.com: Did you do any tests to determine if the performance difference is relevant? On Mon, Jan 18, 2010 at 3:14 PM, Jacob Rief jacob.r...@gmail.com wrote: Hello Kenton, 2010/1/18 Kenton Varda ken...@google.com: (...snip...) As for code cleanliness, I find the Reset() method awkward since the user has to remember to call it at the same time as they do some other operation, like seeking the file descriptor. Either calling Reset() or seeking the file descriptor alone will put the object in an inconsistent state. It might make more sense to offer an actual Seek() method which can safely perform both operations together with an interface that is not so
[protobuf] Re: How can I reset a FileInputStream?
Hello Kenton, What makes you think it is inefficient? It does mean the buffer has to be re-allocated but with a decent malloc implementation that shouldn't take long. Certainly the actual reading from the file would take longer. Have you seen performance problems with this approach? Well, in order to see any performance penalties, I would have to implement FileInputStream::Reset() and compare the results with the current implementation, (I can do that, if there is enough interest). I reviewed the implementation and I saw that by reinstantiating a FileInputStream object, 3 destructors and 3 constructors have to be called, where one (CopyingInputStreamAdaptor) invalidates a buffer which in the Next() step immediately afterwards has to be reallocated. A Reset() function would avoid these unnecessary steps. If there really is a performance problem with allocating new objects, then sure. From the performance point of view, its certainly not a big issue, but from the code cleanness point of view, it is. I have written a class named LzipInputStream, which offers a Reset() functionality to randomly access any part of the uncompressed input stream without having to decompress everything. Therefore this Reset() function is called quite often and it has to destroy and recreate its lower layer, ie. the FileInputStream. If each stackable ...InputStream offers a Reset() function, the upper layer then only has to call Reset () on the lower layer, instead of keeping track how to reconstruct the lower layered FileInputStream object. Regards, Jacob -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to proto...@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
[protobuf] How can I reset a FileInputStream?
Hello Kenton, currently I have the following problem: I have a very big file with many small messages serialized with Protobuf. Each message contains its owner separator and thus can be found even in an unsynchronized stream. I move through this file using lseek64, because FileInputStream::Skip only works into forwarding direction and FileInputStream::BackUp can move back only up to the current buffer boundary. Since I am the owner of the file descriptor, also used by FileInputStream, I can randomly seek to any position in the file. However after seek'ing, my FileInputStream is obviously in an unusable state and has to be reset. Currently the only feasible solution is to replace the current FileInputStream object by a new one - which, somehow is quite inefficient! Wouldn't it make sense to add a member function which resets a FileInputStream to the state of a natively opened and repositioned file descriptor? Or is there any other solution to randomly access the raw content of the file, say by wrapping seek? Regards, Jacob -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to proto...@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
Re: [protobuf] Re: Protocol Buffers using Lzip
Hello Chris, 2009/12/10 Christopher Smith cbsm...@gmail.com: One compression algo that I thought would be particularly useful with PB's would be LZO. It lines up nicely with PB's goals of being fast and compact. Have you thought about allowing an integrated LZO stream? --Chris My goal is to compress huge amounts 5GB of small serialized chunks (~150...500 Bytes) into a single stream, and still being able to randomly access each part of it without having to decompress to whole stream. GzipOutputStream (with level 5) reduces the size to about 40% compared to the uncompressed binary stream, whereas my LzipOutputStream (with level 5) reduces the size to about 20%. The difficulty with gzip is to find synchronizing boundaries in the stream during uncompression If your aim is to exchange small messages, say by RPC, than a fast but less efficient algorithm is the right choice. If however you want to store huge amounts of data permanently, your requirements may be different. In my opinion, generic streaming classes such as ZeroCopyIn/OutputStream, shall offer different compression algorithms for different purposes. LZO has advantages if used for communication of small to medium sized chunks of data. LZMA on the other hand has advantages if you have to store lots of data for a long term. GZIP is somewhere in the middle. Unfortunately Kenton has another opinion about adding too many compression streaming classes. Today I studied the API of LZO. From what I have seen, I think one could implement two LzoIn/OutputStream classes. LZO compression however has a small drawback, let me explain why: The LZO API is not intended to be used for streams. Instead it always compresses and decompresses a whole block. This is different behaviour than gzip and lzip, which are intended to compress streams. A compression class has a fixed sized buffer of typically 8 or 64kB. If this buffer is filled with data, lzip and gzip digest the input and you can start to fill the buffer from the beginning. On the other hand, the LZO compressor has to compress the whole buffer in one step. The next block then has to be concatenated with the already compressed data, which means that during decompression you have to fiddle these chunks apart. If your intention is to compress a chunk of data with, say less than 64kB each, and then to put it on the wire, then LZO is the right solution for you. For my requirements, as you will understand now, LZO does not really fit well. If there is a strong interest in an alternative Protocol Buffer compression stream, don't hesitate to contact me. Jacob -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to proto...@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
[protobuf] Protocol Buffers using Lzip
Hello Brian, hello Kenton, hello list, as an alternative to GzipInputStream and GzipOutputStream I have written a compression and an uncompression stream class which are stackable into Protocol Buffers streams. They are named LzipInputStream and LzipOutputStream and use the Lempel-Ziv-Markov chain algorithm, as implemented by LZIP http://www.nongnu.org/lzip/lzip.html An advantage for using Lzip instead of Gzip is, that Lzip supports multi member compression. So one can jump into the stream at any position, forward up to the next synchronization boundary and start reading from there. Using the default compression level, Lzip has a better compression ratio at the cost of being slower than Gzip, but when Lzip is used with a low compression level, speed and output size of Lzip are comparable to that of Gzip. I would like to donate these classes to the ProtoBuf software repository. They will be released under an OSS license, compatible to LZIP and Google's. Could someone please check them and tell me in what kind of repository I can publish them. In Google's license agreements there is a passage telling: Neither the name of Google Inc. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. Since I have to use the name google in the C++ namespace of LzipIn/OutputStream, hereby I ask for permission to do so. Comments are appreciated, Jacob -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to proto...@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en. // This file contains the implementation of classes // LzipInputStream and LzipOutputStream used to compress // and decompress Google's Protocol Buffer Streams using // the Lempel-Ziv-Markow-Algorithm. // // Derived from http://protobuf.googlecode.com/svn/tags/2.2.0/src/google/protobuf/io/gzip_stream.cc // Copyright 2009 by Jacob Rief jacob.r...@gmail.com // Evaluation copy - don't use in production code #include lzip_stream.h #include google/protobuf/stubs/common.h namespace google { namespace protobuf { namespace io { static const int kDefaultBufferSize = 8192; // === LzipInputStream === LzipInputStream::LzipInputStream(ZeroCopyInputStream* sub_stream) : sub_stream_(sub_stream), finished_(false), output_buffer_length_(kDefaultBufferSize), output_buffer_(operator new(output_buffer_length_)), output_position_(NULL), next_out_(NULL), avail_out_(0), errno_(LZ_ok) { GOOGLE_CHECK(output_buffer_ != NULL); decoder_ = LZ_decompress_open(); errno_ = LZ_decompress_errno(decoder_); GOOGLE_CHECK(errno_ == LZ_ok); } LzipInputStream::~LzipInputStream() { if (decoder_ != NULL) { Close(); } if (output_buffer_ != NULL) { operator delete(output_buffer_); } } bool LzipInputStream::Close() { errno_ = LZ_decompress_errno(decoder_); bool ok = LZ_decompress_close(decoder_) == LZ_ok; decoder_ = NULL; return ok; } // --- implements ZeroCopyInputStream --- bool LzipInputStream::Next(const void** data, int* size) { GOOGLE_CHECK_GE(next_out_, output_position_); if (next_out_ == output_position_) { if (finished_ LZ_decompress_finished(decoder_)) return false; output_position_ = next_out_ = static_castuint8_t*(output_buffer_); avail_out_ = output_buffer_length_; Decompress(); } *data = output_position_; *size = next_out_ - output_position_; output_position_ = next_out_; return true; } void LzipInputStream::BackUp(int count) { GOOGLE_CHECK_GE(output_position_-static_castuint8_t*(output_buffer_), count); output_position_ -= count; } bool LzipInputStream::Skip(int count) { const void* data; int size; bool ok = Next(data, size); while (ok (size count)) { count -= size; ok = Next(data, size); } if (size count) { BackUp(size - count); } return ok; } int64 LzipInputStream::ByteCount() const { return LZ_decompress_total_out_size(decoder_); } // --- private --- void LzipInputStream::Decompress() { GOOGLE_CHECK_GT(avail_out_, 0); if (!finished_) { int avail_in; const void* next_in; if (sub_stream_-Next(next_in, avail_in)) { int bytes_written = LZ_decompress_write(decoder_, static_castconst uint8_t*(next_in), avail_in); errno_ = LZ_decompress_errno(decoder_); GOOGLE_CHECK(errno_ == LZ_ok); GOOGLE_CHECK_GE(bytes_written, 0); sub_stream_-BackUp(avail_in - bytes_written); } else { GOOGLE_CHECK(LZ_decompress_finish(decoder_) == LZ_ok); finished_ = true; } } int bytes_read = LZ_decompress_read(decoder_, next_out_, avail_out_); errno_ = LZ_decompress_errno(decoder_); GOOGLE_CHECK(errno_ == LZ_ok); GOOGLE_CHECK_GE(bytes_read, 0); next_out_ += bytes_read; avail_out_ -= bytes_read
Unable to read from concatenated zip's using GzipInputStream in protobuf-2.2.0 with zlib-1.2.3
I use protobuf to write self delimited messages to a file. When I use FileOutputStream, I can close the stream, reopen it at a later time for writing, closing it again and then parse the whole file. When I try to do the same job after writing with GzipOutputStream, than parsing with GzipInputStream, I can read up to the end of the first chunk, but then CodedInputStream::ReadRaw returns false and my application looses its sync. If however I first uncompress the written file with gunzip and then use FileInputStream to decode it, everything works fine. Also, if I lseek the file descriptor onto the beginning of the second chunk (1f 8b 08 ...) and create a new GzipInputStream object using that file descriptor, I can read everything. I did some debugging and found out that when I use a zipped file with one chunk (normal case) and hit the eof in GzipInputStream::Next Inflate() returns Z_STREAM_END and zcontext_.avail_in is 0. When I do the same tests with a concatenated file, when reaching the end of the first chunk, in GzipInputStream::Next Inflate() returns Z_STREAM_END and zcontext_.avail_in is 1129, which means that the zlib has some unprocessed bytes in the input buffer. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---