Re: [Discussion] Base 128 variable integer encoding is not always good

2018-09-20 Thread Gang Wu
Owen, Yes, you are correct. I misunderstood RLEv2 which does not use LEB128. To answer your question: 1. RLEv1 + fixed 8 byte in my experiment means that we don't do LEB128 encoding for RLE literals and directly write fixed 8 bytes in little endian. 2. The data is from our production data which i

Re: [Discussion] Base 128 variable integer encoding is not always good

2018-09-19 Thread Owen O'Malley
Thanks for the sample data. Just out of curiosity, is the natural data actually sorted like that? I think you have a misunderstanding of RLEv2. It doesn't use LEB128 except for the values in the header. What does RLEv1 + fixed 8 byte mean? Based on the 512 values that you posted, I see: 512 val

Re:Re: [Discussion] Base 128 variable integer encoding is not always good

2018-09-19 Thread Xiening Dai
I think here the bigger issue is the combination of zstd and LEB128 which results in much lower compression ratio compared to Zlib. This is by design for zstd level 1.And according to the answer from zstd community (see link from Gang), this only gets better after much higher level (says 12). I

Re: [Discussion] Base 128 variable integer encoding is not always good

2018-09-18 Thread Gang Wu
Owen I have put the example data to reproduce the issue in https://github.com/facebook/zstd/issues/1325. It contains 512 unsigned numbers which are already zigzag-encoded using (val « 1) ^ (val » 63). The low overhead representation of literals is exactly what we need for RLEv3. We should also pa

Re: [Discussion] Base 128 variable integer encoding is not always good

2018-09-18 Thread Owen O'Malley
Gang, As you correctly point out, some columns don't work well with RLE. Unfortunately, without being able to look at the data it is hard for me to guess what the right compression strategies are. Based on your description, I would guess that the data doesn't have a lot of patterns to it and cov

Re: [Discussion] Base 128 variable integer encoding is not always good

2018-09-18 Thread Gopal Vijayaraghavan
Hi, > From above observation, we find that it is better to disable LEB128 encoding > while zstd is used. You can enable file size optimizations (automatically recommend better layouts for compression) when "orc.encoding.strategy"="COMPRESSION" There are a bunch of bitpacking loops that's co

[Discussion] Base 128 variable integer encoding is not always good

2018-09-18 Thread Gang Wu
Hi, We are using zstd as the default compressor in production for ORC. Overall the performance is very good. Through our analysis, there is some room of improvement for integers. As we know, all integers use base 128 varint encoding (a.k.a LEB128) after RLE. This works well for zlib and other com