For compression, I'm also interested in investigating the pure java
compression codecs that were done by the Presto project:

https://github.com/airlift/aircompressor

They've implemented LZ4, Snappy, and LZO in pure java.

On Thu, Jun 23, 2016 at 8:04 PM, Gopal Vijayaraghavan <[email protected]>
wrote:

> > Though, I'm also wondering about about performance difference between
> >the two. Since they both use native implementations, theoretically they
> >can be close in performance.
>
> ZlibCompressor block compression was extremely slow due to the non-JNI
> bits in Hadoop - <https://issues.apache.org/jira/browse/HADOOP-10681>
>
> When I last benchmarked after that issue was fixed 86% of CPU samples were
> spent inside zlib.so in the perf traces - irrespective of which mode it
> was used.
>
> The result of those profiles went into making ORC fit into Zlib better,
> avoid doing compression work twice - ORC did its own versions of
> dictionary+rle+bit-packing already.
>
> <
> http://www.slideshare.net/Hadoop_Summit/orc-2015-faster-better-smaller-494
> 81231/22>
>
> For instance, bit-packing 127 bit data into 7 bits and then compressing it
> offered less compression (& cost more CPU) than leaving it at 8 bits
> without reduction. LZ77 worked much better and the huffman anyway
> compressed the data by bit-packing anyway. The impact was more visible at
> higher bit-counts (like 27 bits is way worse than 32 bits).
>
> And then turning off bits of Zlib not necessary for some encoding patterns
> - Z_FILTERED for instance for numeric sequences, Z_TEXT for the string
> dicts etc.
>
> Purely from a performance standpoint, I'm getting more interested in Zstd,
> because it brings a whole new way of fast bit-packing.
>
> <https://issues.apache.org/jira/browse/ORC-45>
>
>
> Cheers,
> Gopal
>
>
>

Reply via email to