Re: RLE encoding limitation

Todd Lipcon Wed, 10 Aug 2016 00:22:53 -0700

On Tue, Aug 9, 2016 at 3:13 PM, James Pirz <[email protected]> wrote:

> Thanks Todd for your prompt reply !
> I will check Bit_Shuffle encoding.
>
> w.r.t the fix, I will be happy to try that. I realized that the initial
> shift in PutValue() :
>
> v &= (1ULL << num_bits) - 1;
>
> potentially makes the results to be incorrect when applied to the buffered
> values longer than 32 bits (simply turns them to become zero).
>
> (If it is on top of your head) can you please give me a bit of context for
> "some of the edge cases" that are pointed out in the TODO comment (in the
> BitWriter::PutValue() )?
>

I think the edge cases are things like the fact that, in the line you
quoted above, if 'num_bits' is 64, then we have the expression '(1ULL <<
64)' which invokes undefined behavior (shifting by >= the width of the
operand). I think this line could be safely replaced with:

v &= ~0ULL >> (64 - num_bits)

what we really want is the BZHI instruction (
http://www.felixcloutier.com/x86/BZHI.html) but that's not available until
very recent processors, so the above should do. From looking at generated
assembly, it's the same number of instructions as the original, just in a
different order, so shouldn't hurt perf.

On the read side, it looks like we're using a utility function
BitUtil::TrailingBits() which implements basically the same as the above,
but in a branchy way. It would be a nice optimization to remove some
branches there and consolidate code with the above.

As for other edge cases, nothing jumps out from reading the code, but the
best way to be sure would be to add some unit tests (maybe randomized tests
which mix up add/get with different bit widths). I just hacked up such a
test here:
https://gist.github.com/a777013736ce480b2062ed42f0657c72

The test fails as is since it generates values up to 64 bits, but passes
with 32-bit values, so clearly there's another bug lurking :)

-Todd

>
> Thanks.
>
> On Tue, Aug 9, 2016 at 2:34 PM, Todd Lipcon <[email protected]> wrote:
>
>> Hi James,
>>
>> Yes, that's correct. Unfortunately right now the RLE encoding code
>> doesn't support int64.
>>
>> I'd suggest trying the "BIT_SHUFFLE" encoding, which can get you a lot of
>> the same benefits as RLE and does work on int64.
>>
>> Since it seems you're good at checking out the code, I'd also be happy to
>> review a patch to fix this if you want to give it a try :)
>>
>> -Todd
>>
>> On Tue, Aug 9, 2016 at 2:19 PM, James Pirz <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I am trying to use RLE-encoding in Kudu with fixed bit-width values. I
>>> am specifically dealing with uint64 values which are supposed to be 64-bit
>>> long. I realized that the code under BitWriter (which is used in flushing
>>> RLE runs) only handles values up to 32-bit values:
>>>
>>> inline void BitWriter::PutValue(uint64_t v, int num_bits) {
>>>         // TODO: revisit this limit if necessary (can be raised to 64 by
>>> fixing some edge cases)
>>>         DCHECK_LE(num_bits, 32);
>>> ...
>>>
>>>
>>> Can you please verify if indeed such a limitation exists in Kudu for RLE
>>> encoding (i.e. it can not be applied to fixed size values longer than 32
>>> bits) and if yes is there a work around for that ?
>>> (I also checked RLE code in Impala and parquet-cpp which share the RLE
>>> code and they seem to have the same limitation).
>>>
>>> Thanks
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: RLE encoding limitation

Reply via email to