For compression, I'm also interested in investigating the pure java compression codecs that were done by the Presto project:
https://github.com/airlift/aircompressor They've implemented LZ4, Snappy, and LZO in pure java. On Thu, Jun 23, 2016 at 8:04 PM, Gopal Vijayaraghavan <[email protected]> wrote: > > Though, I'm also wondering about about performance difference between > >the two. Since they both use native implementations, theoretically they > >can be close in performance. > > ZlibCompressor block compression was extremely slow due to the non-JNI > bits in Hadoop - <https://issues.apache.org/jira/browse/HADOOP-10681> > > When I last benchmarked after that issue was fixed 86% of CPU samples were > spent inside zlib.so in the perf traces - irrespective of which mode it > was used. > > The result of those profiles went into making ORC fit into Zlib better, > avoid doing compression work twice - ORC did its own versions of > dictionary+rle+bit-packing already. > > < > http://www.slideshare.net/Hadoop_Summit/orc-2015-faster-better-smaller-494 > 81231/22> > > For instance, bit-packing 127 bit data into 7 bits and then compressing it > offered less compression (& cost more CPU) than leaving it at 8 bits > without reduction. LZ77 worked much better and the huffman anyway > compressed the data by bit-packing anyway. The impact was more visible at > higher bit-counts (like 27 bits is way worse than 32 bits). > > And then turning off bits of Zlib not necessary for some encoding patterns > - Z_FILTERED for instance for numeric sequences, Z_TEXT for the string > dicts etc. > > Purely from a performance standpoint, I'm getting more interested in Zstd, > because it brings a whole new way of fast bit-packing. > > <https://issues.apache.org/jira/browse/ORC-45> > > > Cheers, > Gopal > > >
