[jira] [Commented] (KUDU-2854) Short circuit predicates on dictionary-coded columns
[ https://issues.apache.org/jira/browse/KUDU-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869715#comment-16869715 ] Todd Lipcon commented on KUDU-2854: --- bq. But currently we don't have a quick way to judge if there is any delta for the whole column(cfile) or the whole data block(part of cfile). I think we could make some changes in DeltaTracker::WrapIterator and DeltaTracker::NewDeltaIterator so that, if there are no relevant DeltaFiles, and the only relevant DMS is empty, we could avoid wrapping the base iterator. This is the special case of "no deltas at all" which is a little different than "no deltas for a specific column". Still, that's a useful optimization (and common that we have no deltas). KUDU-2855 would also make this easier to implement. bq. Is there any way we can easily judge if a column contain deltas or if a data block contain deltas? After DeltaIterator::PrepareBatch is called, we can use MayHaveDeltas() to see on a per-block basis whether there were any deltas. We can extend this method to be MayHaveDeltas(col_idx). Note that we already use this to determine whether we can push down predicates into the block decoder here in DeltaApplier::MaterializeColumn: {code} // Data with updates cannot be evaluated at the decoder-level. if (delta_iter_->MayHaveDeltas()) { ctx->SetDecoderEvalNotSupported(); RETURN_NOT_OK(base_iter_->MaterializeColumn(ctx)); RETURN_NOT_OK(delta_iter_->ApplyUpdates(ctx->col_idx(), ctx->block(), *ctx->sel())); } else { RETURN_NOT_OK(base_iter_->MaterializeColumn(ctx)); } {code} > Short circuit predicates on dictionary-coded columns > > > Key: KUDU-2854 > URL: https://issues.apache.org/jira/browse/KUDU-2854 > Project: Kudu > Issue Type: Improvement > Components: cfile, perf, tserver >Reporter: Todd Lipcon >Priority: Major > > In the common case that a column has no updates in a given DRS, if we see > that no entries in the dictionary match the predicate, we can short circuit > at a few layers: > - we can store a flag in the cfile footer that indicates that all blocks are > dict-coded (ie there are no fallbacks). In that case, we can skip the whole > rowset > - if a cfile is partially dict-encoded, we can skip any dict-coded blocks > without decoding the dictionary words -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2854) Short circuit predicates on dictionary-coded columns
[ https://issues.apache.org/jira/browse/KUDU-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867570#comment-16867570 ] ZhangYao commented on KUDU-2854: After reading related code as saying in Description, if no entries in the dictionary match the predicate, we can short circuit at a few layers. But currently we don't have a quick way to judge if there is any delta for the whole column(cfile) or the whole data block(part of cfile). Although base data's dictionary may not hit the predicate but after applying delta things may change. Cfile reads data batch by batch so we can only judge if there is any deltas for the batch and can short circuit the following data copy if on entries hit the predicate and no delta for this batch. This has been implemented in BinaryDictBlockDecoder::CopyNextAndEval. Is there any way we can easily judge if a column contain deltas or if a data block contain deltas?(?) > Short circuit predicates on dictionary-coded columns > > > Key: KUDU-2854 > URL: https://issues.apache.org/jira/browse/KUDU-2854 > Project: Kudu > Issue Type: Improvement > Components: cfile, perf, tserver >Reporter: Todd Lipcon >Priority: Major > > In the common case that a column has no updates in a given DRS, if we see > that no entries in the dictionary match the predicate, we can short circuit > at a few layers: > - we can store a flag in the cfile footer that indicates that all blocks are > dict-coded (ie there are no fallbacks). In that case, we can skip the whole > rowset > - if a cfile is partially dict-encoded, we can skip any dict-coded blocks > without decoding the dictionary words -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KUDU-2854) Short circuit predicates on dictionary-coded columns
[ https://issues.apache.org/jira/browse/KUDU-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16864309#comment-16864309 ] Todd Lipcon commented on KUDU-2854: --- I think we should also consider fast-pathing equality checks for dictionary predicates. Currently, we evaluate the predicate against the dictionary and come up with a bitmap of matching values. Then, for each codeword, we test the corresponding bit in the bitmap. That bitmap testing likely requires a few cycles and a branch, and can't readily be done with SIMD outside of AVX512 gather instructions. In the case that we see that exactly one dictionary value matches the predicate, we can transform it into an equality predicate on the codewords, and then use the SIMD-optimized equality code path. I don't have perf numbers on hand but I know I often am querying datasets using equality predicates on dictionary-coded columns. > Short circuit predicates on dictionary-coded columns > > > Key: KUDU-2854 > URL: https://issues.apache.org/jira/browse/KUDU-2854 > Project: Kudu > Issue Type: Improvement > Components: cfile, perf, tserver >Reporter: Todd Lipcon >Priority: Major > > In the common case that a column has no updates in a given DRS, if we see > that no entries in the dictionary match the predicate, we can short circuit > at a few layers: > - we can store a flag in the cfile footer that indicates that all blocks are > dict-coded (ie there are no fallbacks). In that case, we can skip the whole > rowset > - if a cfile is partially dict-encoded, we can skip any dict-coded blocks > without decoding the dictionary words -- This message was sent by Atlassian JIRA (v7.6.3#76005)