[jira] [Commented] (KUDU-2854) Short circuit predicates on dictionary-coded columns

2019-06-21 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869715#comment-16869715
 ] 

Todd Lipcon commented on KUDU-2854:
---

bq. But currently we don't have a quick way to judge if there is any delta for 
the whole column(cfile) or the whole data block(part of cfile). 

I think we could make some changes in DeltaTracker::WrapIterator and 
DeltaTracker::NewDeltaIterator so that, if there are no relevant DeltaFiles, 
and the only relevant DMS is empty, we could avoid wrapping the base iterator. 
This is the special case of "no deltas at all" which is a little different than 
"no deltas for a specific column". Still, that's a useful optimization (and 
common that we have no deltas). KUDU-2855 would also make this easier to 
implement.

bq.  Is there any way we can easily judge if a column contain deltas or if a 
data block contain deltas?

After DeltaIterator::PrepareBatch is called, we can use MayHaveDeltas() to see 
on a per-block basis whether there were any deltas. We can extend this method 
to be MayHaveDeltas(col_idx). Note that we already use this to determine 
whether we can push down predicates into the block decoder here in 
DeltaApplier::MaterializeColumn:

{code}
  // Data with updates cannot be evaluated at the decoder-level.
  if (delta_iter_->MayHaveDeltas()) {
ctx->SetDecoderEvalNotSupported();
RETURN_NOT_OK(base_iter_->MaterializeColumn(ctx));
RETURN_NOT_OK(delta_iter_->ApplyUpdates(ctx->col_idx(), ctx->block(), 
*ctx->sel()));
  } else {
RETURN_NOT_OK(base_iter_->MaterializeColumn(ctx));
  }
{code}

> Short circuit predicates on dictionary-coded columns
> 
>
> Key: KUDU-2854
> URL: https://issues.apache.org/jira/browse/KUDU-2854
> Project: Kudu
>  Issue Type: Improvement
>  Components: cfile, perf, tserver
>Reporter: Todd Lipcon
>Priority: Major
>
> In the common case that a column has no updates in a given DRS, if we see 
> that no entries in the dictionary match the predicate, we can short circuit 
> at a few layers:
> - we can store a flag in the cfile footer that indicates that all blocks are 
> dict-coded (ie there are no fallbacks). In that case, we can skip the whole 
> rowset
> - if a cfile is partially dict-encoded, we can skip any dict-coded blocks 
> without decoding the dictionary words



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2854) Short circuit predicates on dictionary-coded columns

2019-06-19 Thread ZhangYao (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867570#comment-16867570
 ] 

ZhangYao commented on KUDU-2854:


After reading related code as saying in Description, if no entries in the 
dictionary match the predicate, we can short circuit at a few layers.

But currently we don't have a quick way to judge if there is any delta for the 
whole column(cfile) or the whole data block(part of cfile). Although base 
data's dictionary may not hit the predicate but after applying delta things may 
change. Cfile reads data batch by batch so we can only judge if there is any 
deltas for the batch and can short circuit the following data copy if on 
entries hit the predicate and no delta for this batch. This has been 
implemented in BinaryDictBlockDecoder::CopyNextAndEval.


Is there any way we can easily judge if a column contain deltas or if a data 
block contain deltas?(?)

> Short circuit predicates on dictionary-coded columns
> 
>
> Key: KUDU-2854
> URL: https://issues.apache.org/jira/browse/KUDU-2854
> Project: Kudu
>  Issue Type: Improvement
>  Components: cfile, perf, tserver
>Reporter: Todd Lipcon
>Priority: Major
>
> In the common case that a column has no updates in a given DRS, if we see 
> that no entries in the dictionary match the predicate, we can short circuit 
> at a few layers:
> - we can store a flag in the cfile footer that indicates that all blocks are 
> dict-coded (ie there are no fallbacks). In that case, we can skip the whole 
> rowset
> - if a cfile is partially dict-encoded, we can skip any dict-coded blocks 
> without decoding the dictionary words



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-2854) Short circuit predicates on dictionary-coded columns

2019-06-14 Thread Todd Lipcon (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16864309#comment-16864309
 ] 

Todd Lipcon commented on KUDU-2854:
---

I think we should also consider fast-pathing equality checks for dictionary 
predicates. Currently, we evaluate the predicate against the dictionary and 
come up with a bitmap of matching values. Then, for each codeword, we test the 
corresponding bit in the bitmap. That bitmap testing likely requires a few 
cycles and a branch, and can't readily be done with SIMD outside of AVX512 
gather instructions.

In the case that we see that exactly one dictionary value matches the 
predicate, we can transform it into an equality predicate on the codewords, and 
then use the SIMD-optimized equality code path.

I don't have perf numbers on hand but I know I often am querying datasets using 
equality predicates on dictionary-coded columns.

> Short circuit predicates on dictionary-coded columns
> 
>
> Key: KUDU-2854
> URL: https://issues.apache.org/jira/browse/KUDU-2854
> Project: Kudu
>  Issue Type: Improvement
>  Components: cfile, perf, tserver
>Reporter: Todd Lipcon
>Priority: Major
>
> In the common case that a column has no updates in a given DRS, if we see 
> that no entries in the dictionary match the predicate, we can short circuit 
> at a few layers:
> - we can store a flag in the cfile footer that indicates that all blocks are 
> dict-coded (ie there are no fallbacks). In that case, we can skip the whole 
> rowset
> - if a cfile is partially dict-encoded, we can skip any dict-coded blocks 
> without decoding the dictionary words



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)