[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has abandoned this change. Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. Abandoned Not enough win for the complexity. -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: abandon Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 19 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has uploaded a new patch set (#19). Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. IMPALA-4864 Speed up single slot predicates with dictionaries When dictionaries are present we can pre-evaluate conjuncts against the dictionary values and simply look up the result. Status: Runs with ASAN, runs without crashes on ee tests. Performance results inconclusive; this may not be worth the complexity unless we get really selective or really expensive predicates. Basic idea: since we codegen so early, before we know enough details about the columns to know if they are dict filterable, if we do have dictionary filtering predicates, we codegen a guard around each dictionary filterable predicate evaluation. This guard skips evaluation of the predicate if it has already been evaluated by the dictionary. In this way, we can skip evaluation dynamically for each row group as we learn which columns are dictionary filterable, and then push predicate evaluation into the column reader. Since the branches will remain 100% predictable over the row group, this should give us the fastest way to skip over predicate evaluation without compromising the general case where we may be unable to evaluate against the dictionary. We can even do this with codegen turned off, as a side effect of the way we generate the codegen'd function when dictionary evaluation is enabled. If dictionaries aren't available for some predicates, we automatically fall back to evaluating those predicates in the original order, preserving selectivity. The overhead in this case is a perfectly predictable extra conditional per dictionary predicate. Performance: Hard to get! Simple predicates did not show improvement, in fact regressed. I used a TPC-H scale 30 dataset, duplicated 3x into a 'biglineitem' table. select count(*) from biglineitem WHERE l_returnflag = 'A'; 1.43s -> 1.53s select count(*) from biglineitem WHERE l_shipinstruct = 'DELIVER IN PERSON'; 1.43s -> 1.53s select count(*) from biglineitem WHERE l_quantity > 49; 0.93s -> 0.93s select count(*) from biglineitem WHERE instr(l_shipdate, '1994-11') > 0; 2.33s -> 1.03s So this appears to only be a win for expensive predicates. Update: I added changes to make computed predicate costs visible from the frontend to the backend, and tried a TPC-DS scale 10 dataset, which has better queries (lots of IN groups). Still there is a regression in raw query performance. Update again: reversing one branch to UNLIKELY in the ir compiled code gave these results on TPC-DSx10: Q46 (limit modified to 10): 3.05 -> 2.89 sec Q27 (limit modified to 10): 3.13 -> 3.09 sec Update (6.5.2017): Switched back to bitset evaluation for scratch batch rather than using an extra byte per tuple row. Surprisingly, this is the best performing diff yet for selective predicates. Using TPC-DS-10 with a filter: Query: select count(*) from store_sales, customer, customer_address where store_sales.ss_customer_sk = customer.c_customer_sk and customer.c_current_addr_sk = customer_address.ca_address_sk and customer_address.ca_city IN (... list of ~200 cities) This almost makes parity. I get 3836140 rows in 1.41-1.45s with this diff and in 1.39-1.45s with the same caching optimization for batch_size on top of the last change (Tim's parquest column reader optimizations). So in a totally fair comparison, we are still losing :( Update (6.8.2017): Tried Tim's suggestions. Best I could get on this query was now only 1.54s. My host OS kernel did change during this time so I re-measured baseline and got a best time of 1.53s. So we seem to be making parity, but not showcasing a big win. Maybe runtime filters will pay off better; they already are toggled dynamically so nothing should be lost. Update (6.16.2017): Tried runtime filters, simplified the logic as much as possible. Everything is still a wash: Query: select count(*) from store_sales, customer, customer_address where store_sales.ss_customer_sk = customer.c_customer_sk and customer.c_current_addr_sk = customer_address.ca_address_sk and customer_address.ca_city IN ('Providence', 'Mount Zion', 'Greenville', 'Mount Olive', 'Wildwood', 'Hillcrest', 'Crossroads', 'Belmont', 'Wilson', 'Riverdale', 'Newport', 'Springdale', 'Mountain View', 'Forest Hills', 'Bridgeport', 'White Oak', 'Oakwood', 'Newtown', 'Macedonia', 'Five Forks', 'Edgewood', 'Arlington', 'Unionville', 'Red Hill', 'Clinton', 'Woodville', 'Summit', 'Jamestown', 'Hamilton', 'Clifton', 'Union Hill', 'Sulphur Springs', 'Lincoln', 'Lebanon', 'Farmington', 'Enterprise', 'Brownsville', 'Ashland', 'Woodland', 'Waterloo', 'Valley View', 'Oak Ridge', 'Maple Grove', 'Kingston', 'Jackson', 'Greenfield', 'Green Acres', 'Plainview', 'Marion', 'Florence', 'Qarth', 'R\'lyeh', 'Mos Eisley', 'King\'s Landing', "Q'onos", "Inn
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has posted comments on this change. Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. Patch Set 18: (4 comments) I could remove the conjunct cost propagation to the backend, but there doesn't seem to be much else to try here. I fought hard for this one but was unable to get a clear win. Special purpose benchmarks (filtered out rows which do timestamp / string conversion) might show a benefit, but is it worth the cost in terms of complexity? http://gerrit.cloudera.org:8080/#/c/6726/18/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 738: runtime_filter_position_map_.find(scalar_reader->slot_desc_->id()); use slot_desc->id() instead of scalar_reader->slot_desc Line 913: if (conjuncts == nullptr && runtime_filters.size() == 0) continue; bug (rare but possible if runtime filters get disabled): deferred_dict_init_list.push_back(scalar_reader) http://gerrit.cloudera.org:8080/#/c/6726/18/be/src/exec/parquet-scratch-tuple-batch.h File be/src/exec/parquet-scratch-tuple-batch.h: Line 90: &tuple_mem)); All diffs here could be reverted as using a spare byte in scratch tuple was discontinued. Should not make a difference overall to perf however. http://gerrit.cloudera.org:8080/#/c/6726/18/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: Line 516: 25: optional list conjunct_costs This was a desperate attempt to catch a win which didn't seem to pay off. -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 18 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has uploaded a new patch set (#18). Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. IMPALA-4864 Speed up single slot predicates with dictionaries When dictionaries are present we can pre-evaluate conjuncts against the dictionary values and simply look up the result. Status: Runs with ASAN, runs without crashes on ee tests. Performance results inconclusive; this may not be worth the complexity unless we get really selective or really expensive predicates. Basic idea: since we codegen so early, before we know enough details about the columns to know if they are dict filterable, if we do have dictionary filtering predicates, we codegen a guard around each dictionary filterable predicate evaluation. This guard skips evaluation of the predicate if it has already been evaluated by the dictionary. In this way, we can skip evaluation dynamically for each row group as we learn which columns are dictionary filterable, and then push predicate evaluation into the column reader. Since the branches will remain 100% predictable over the row group, this should give us the fastest way to skip over predicate evaluation without compromising the general case where we may be unable to evaluate against the dictionary. We can even do this with codegen turned off, as a side effect of the way we generate the codegen'd function when dictionary evaluation is enabled. If dictionaries aren't available for some predicates, we automatically fall back to evaluating those predicates in the original order, preserving selectivity. The overhead in this case is a perfectly predictable extra conditional per dictionary predicate. Performance: Hard to get! Simple predicates did not show improvement, in fact regressed. I used a TPC-H scale 30 dataset, duplicated 3x into a 'biglineitem' table. select count(*) from biglineitem WHERE l_returnflag = 'A'; 1.43s -> 1.53s select count(*) from biglineitem WHERE l_shipinstruct = 'DELIVER IN PERSON'; 1.43s -> 1.53s select count(*) from biglineitem WHERE l_quantity > 49; 0.93s -> 0.93s select count(*) from biglineitem WHERE instr(l_shipdate, '1994-11') > 0; 2.33s -> 1.03s So this appears to only be a win for expensive predicates. Update: I added changes to make computed predicate costs visible from the frontend to the backend, and tried a TPC-DS scale 10 dataset, which has better queries (lots of IN groups). Still there is a regression in raw query performance. Update again: reversing one branch to UNLIKELY in the ir compiled code gave these results on TPC-DSx10: Q46 (limit modified to 10): 3.05 -> 2.89 sec Q27 (limit modified to 10): 3.13 -> 3.09 sec Update (6.5.2017): Switched back to bitset evaluation for scratch batch rather than using an extra byte per tuple row. Surprisingly, this is the best performing diff yet for selective predicates. Using TPC-DS-10 with a filter: Query: select count(*) from store_sales, customer, customer_address where store_sales.ss_customer_sk = customer.c_customer_sk and customer.c_current_addr_sk = customer_address.ca_address_sk and customer_address.ca_city IN (... list of ~200 cities) This almost makes parity. I get 3836140 rows in 1.41-1.45s with this diff and in 1.39-1.45s with the same caching optimization for batch_size on top of the last change (Tim's parquest column reader optimizations). So in a totally fair comparison, we are still losing :( Update (6.8.2017): Tried Tim's suggestions. Best I could get on this query was now only 1.54s. My host OS kernel did change during this time so I re-measured baseline and got a best time of 1.53s. So we seem to be making parity, but not showcasing a big win. Maybe runtime filters will pay off better; they already are toggled dynamically so nothing should be lost. Update (6.16.2017): Tried runtime filters, simplified the logic as much as possible. Everything is still a wash: Query: select count(*) from store_sales, customer, customer_address where store_sales.ss_customer_sk = customer.c_customer_sk and customer.c_current_addr_sk = customer_address.ca_address_sk and customer_address.ca_city IN ('Providence', 'Mount Zion', 'Greenville', 'Mount Olive', 'Wildwood', 'Hillcrest', 'Crossroads', 'Belmont', 'Wilson', 'Riverdale', 'Newport', 'Springdale', 'Mountain View', 'Forest Hills', 'Bridgeport', 'White Oak', 'Oakwood', 'Newtown', 'Macedonia', 'Five Forks', 'Edgewood', 'Arlington', 'Unionville', 'Red Hill', 'Clinton', 'Woodville', 'Summit', 'Jamestown', 'Hamilton', 'Clifton', 'Union Hill', 'Sulphur Springs', 'Lincoln', 'Lebanon', 'Farmington', 'Enterprise', 'Brownsville', 'Ashland', 'Woodland', 'Waterloo', 'Valley View', 'Oak Ridge', 'Maple Grove', 'Kingston', 'Jackson', 'Greenfield', 'Green Acres', 'Plainview', 'Marion', 'Florence', 'Qarth', 'R\'lyeh', 'Mos Eisley', 'King\'s Landing', "Q'onos", "Inn
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has uploaded a new patch set (#17). Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. IMPALA-4864 Speed up single slot predicates with dictionaries When dictionaries are present we can pre-evaluate conjuncts against the dictionary values and simply look up the result. Status: Runs with ASAN, runs without crashes on ee tests. Performance results inconclusive; this may not be worth the complexity unless we get really selective or really expensive predicates. Basic idea: since we codegen so early, before we know enough details about the columns to know if they are dict filterable, if we do have dictionary filtering predicates, we codegen a guard around each dictionary filterable predicate evaluation. This guard skips evaluation of the predicate if it has already been evaluated by the dictionary. In this way, we can skip evaluation dynamically for each row group as we learn which columns are dictionary filterable, and then push predicate evaluation into the column reader. Since the branches will remain 100% predictable over the row group, this should give us the fastest way to skip over predicate evaluation without compromising the general case where we may be unable to evaluate against the dictionary. We can even do this with codegen turned off, as a side effect of the way we generate the codegen'd function when dictionary evaluation is enabled. If dictionaries aren't available for some predicates, we automatically fall back to evaluating those predicates in the original order, preserving selectivity. The overhead in this case is a perfectly predictable extra conditional per dictionary predicate. Performance: Hard to get! Simple predicates did not show improvement, in fact regressed. I used a TPC-H scale 30 dataset, duplicated 3x into a 'biglineitem' table. select count(*) from biglineitem WHERE l_returnflag = 'A'; 1.43s -> 1.53s select count(*) from biglineitem WHERE l_shipinstruct = 'DELIVER IN PERSON'; 1.43s -> 1.53s select count(*) from biglineitem WHERE l_quantity > 49; 0.93s -> 0.93s select count(*) from biglineitem WHERE instr(l_shipdate, '1994-11') > 0; 2.33s -> 1.03s So this appears to only be a win for expensive predicates. Update: I added changes to make computed predicate costs visible from the frontend to the backend, and tried a TPC-DS scale 10 dataset, which has better queries (lots of IN groups). Still there is a regression in raw query performance. Update again: reversing one branch to UNLIKELY in the ir compiled code gave these results on TPC-DSx10: Q46 (limit modified to 10): 3.05 -> 2.89 sec Q27 (limit modified to 10): 3.13 -> 3.09 sec Update (6.5.2017): Switched back to bitset evaluation for scratch batch rather than using an extra byte per tuple row. Surprisingly, this is the best performing diff yet for selective predicates. Using TPC-DS-10 with a filter: Query: select count(*) from store_sales, customer, customer_address where store_sales.ss_customer_sk = customer.c_customer_sk and customer.c_current_addr_sk = customer_address.ca_address_sk and customer_address.ca_city IN (... list of ~200 cities) This almost makes parity. I get 3836140 rows in 1.41-1.45s with this diff and in 1.39-1.45s with the same caching optimization for batch_size on top of the last change (Tim's parquest column reader optimizations). So in a totally fair comparison, we are still losing :( Update (6.8.2017): Tried Tim's suggestions. Best I could get on this query was now only 1.54s. My host OS kernel did change during this time so I re-measured baseline and got a best time of 1.53s. So we seem to be making parity, but not showcasing a big win. Maybe runtime filters will pay off better; they already are toggled dynamically so nothing should be lost. Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/exec-node.h M be/src/exec/hdfs-parquet-scanner-ir.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/hdfs-scanner.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/exec/parquet-scratch-tuple-batch.h M be/src/runtime/descriptors.h M be/src/runtime/row-batch.h M be/src/runtime/tuple.h M be/src/util/bitmap-test.cc M be/src/util/bitmap.h M be/src/util/dict-encoding.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/PlanNode.java 20 files changed, 559 insertions(+), 165 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/6726/17 -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe05
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has posted comments on this change. Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. Patch Set 16: (1 comment) http://gerrit.cloudera.org:8080/#/c/6726/16/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 420: LIKELY(dictionary_results_.num_bits() > 0)) { > I think the predicate evaluation on 40,000 values is probably cheap enough We can certainly try it. I was worried the pre-computation might be expensive if we have, say, string manipulation in predicates, as opposed to inexpensive, simple comparisons. Still, even if we have the same number of predicate evaluations, they end up going through the unoptimized EvalConjuncts() path, as opposed to the codegen'd path. As for IS_FILTERED, that is set when the column reader is created. IS_DICT_ENCODED is determined per page. We are left with no way to remove dictionary_results_.num_bits() on a per-row basis, since we can't unset IS_FILTERED, and IS_DICT_ENCODED may be true even if the encoding did not cover all values. I'll try precomputing all the values and see what happens. -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 16 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. Patch Set 16: (1 comment) http://gerrit.cloudera.org:8080/#/c/6726/16/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 430: if (!dictionary_results_.Get(dict_index)) { I'm thinking that this branch lengthened the critical path through this function for the non-selective case, since the two loads can't necessarily happen in parallel. Hard to know if that's the source of the regression without profiling. We could try to do something like this so that the two loads can happen in parallel: void* slot = tuple->GetSlot(tuple_offset_); T* val_ptr = reinterpret_cast(slot); dict_decoder_.GetValue(dict_index, val_ptr); if (!dictionary_results_.Get(dict_index)) { filtered_rows_->Set(*num_values + val_count, true); } Or maybe even change Set() so that it uses branch-free code and do something like: filtered_rows_->Set(!dictionary_results_.Get(dict_index)); -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 16 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. Patch Set 16: (1 comment) http://gerrit.cloudera.org:8080/#/c/6726/16/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 420: LIKELY(dictionary_results_.num_bits() > 0)) { > Not liking this branch, but it is unavoidable without pre-computation overh I think the predicate evaluation on 40,000 values is probably cheap enough to just do it. In most cases we would have to evaluate the predicates anyway when processing the dictionary-encoded pages. I don't think I fully understand the inexpensive predicate check that you're envisioning. Does it really need to be done per-row instead of per-batch (like the IS_DICT_ENCODED branch). -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 16 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has posted comments on this change. Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. Patch Set 16: (5 comments) http://gerrit.cloudera.org:8080/#/c/6726/16/be/src/exec/hdfs-parquet-scanner-ir.cc File be/src/exec/hdfs-parquet-scanner-ir.cc: Line 89: idx = scratch_batch_.NextValidTuple(idx); Surprising, despite apparently being costlier, the bitmap scan appears to be fast, especially when predicates are selective. http://gerrit.cloudera.org:8080/#/c/6726/16/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 420: LIKELY(dictionary_results_.num_bits() > 0)) { Not liking this branch, but it is unavoidable without pre-computation overhead. When we have a partially dictionary encoding on a filtered column that spills to plain encoding, we don't pre-evaluate the dictionary against predicates. We could do so anyway, but it may end up being an unnecessary up-front cost. However, the number of dictionary entries before spilling to plain is bounded, so the cost should be fixed. Maybe we can afford to do so if the predicates are "inexpensive", but then we are still left with this check for expensive predicate (and have no way to determine cost here without another check). http://gerrit.cloudera.org:8080/#/c/6726/16/be/src/exec/parquet-scratch-tuple-batch.h File be/src/exec/parquet-scratch-tuple-batch.h: Line 172: // pointer. We don't need to transfer the extra filter byte. Comment obsolete with latest diff. http://gerrit.cloudera.org:8080/#/c/6726/16/be/src/runtime/descriptors.h File be/src/runtime/descriptors.h: Line 94: NullIndicatorOffset(int byte_offset = 0, int bit_offset = -1) This fixes an apparent bug, where byte_offset = -1 (passed in constructor for ParquetColumnReader as NullIndicatorOffset(-1, -1) ends up converting into code which will execute something like: tuple_mem[-1] |= 0 Which despite not modifying memory, could be an out of bounds memory access. http://gerrit.cloudera.org:8080/#/c/6726/16/be/src/util/bitmap-test.cc File be/src/util/bitmap-test.cc: Line 81: for (index = bm.NextSetBit(); index < size; index = bm.NextSetBit(index)) { test needs updating for behavior change: index -> index + 1 -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 16 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has uploaded a new patch set (#16). Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. IMPALA-4864 Speed up single slot predicates with dictionaries When dictionaries are present we can pre-evaluate conjuncts against the dictionary values and simply look up the result. Status: Runs with ASAN, runs without crashes on ee tests. Performance results inconclusive; this may not be worth the complexity unless we get really selective or really expensive predicates. Basic idea: since we codegen so early, before we know enough details about the columns to know if they are dict filterable, if we do have dictionary filtering predicates, we codegen a guard around each dictionary filterable predicate evaluation. This guard skips evaluation of the predicate if it has already been evaluated by the dictionary. In this way, we can skip evaluation dynamically for each row group as we learn which columns are dictionary filterable, and then push predicate evaluation into the column reader. Since the branches will remain 100% predictable over the row group, this should give us the fastest way to skip over predicate evaluation without compromising the general case where we may be unable to evaluate against the dictionary. We can even do this with codegen turned off, as a side effect of the way we generate the codegen'd function when dictionary evaluation is enabled. If dictionaries aren't available for some predicates, we automatically fall back to evaluating those predicates in the original order, preserving selectivity. The overhead in this case is a perfectly predictable extra conditional per dictionary predicate. Performance: Hard to get! Simple predicates did not show improvement, in fact regressed. I used a TPC-H scale 30 dataset, duplicated 3x into a 'biglineitem' table. select count(*) from biglineitem WHERE l_returnflag = 'A'; 1.43s -> 1.53s select count(*) from biglineitem WHERE l_shipinstruct = 'DELIVER IN PERSON'; 1.43s -> 1.53s select count(*) from biglineitem WHERE l_quantity > 49; 0.93s -> 0.93s select count(*) from biglineitem WHERE instr(l_shipdate, '1994-11') > 0; 2.33s -> 1.03s So this appears to only be a win for expensive predicates. Update: I added changes to make computed predicate costs visible from the frontend to the backend, and tried a TPC-DS scale 10 dataset, which has better queries (lots of IN groups). Still there is a regression in raw query performance. Update again: reversing one branch to UNLIKELY in the ir compiled code gave these results on TPC-DSx10: Q46 (limit modified to 10): 3.05 -> 2.89 sec Q27 (limit modified to 10): 3.13 -> 3.09 sec Update (6.5.2017): Switched back to bitset evaluation for scratch batch rather than using an extra byte per tuple row. Surprisingly, this is the best performing diff yet for selective predicates. Using TPC-DS-10 with a filter: Query: select count(*) from store_sales, customer, customer_address where store_sales.ss_customer_sk = customer.c_customer_sk and customer.c_current_addr_sk = customer_address.ca_address_sk and customer_address.ca_city IN (... list of ~200 cities) This almost makes parity. I get 3836140 rows in 1.41-1.45s with this diff and in 1.39-1.45s with the same caching optimization for batch_size on top of the last change (Tim's parquest column reader optimizations). So in a totally fair comparison, we are still losing :( Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/exec-node.h M be/src/exec/hdfs-parquet-scanner-ir.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/hdfs-scanner.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/exec/parquet-scratch-tuple-batch.h M be/src/runtime/descriptors.h M be/src/runtime/row-batch.h M be/src/runtime/tuple.h M be/src/util/bitmap-test.cc M be/src/util/bitmap.h M be/src/util/dict-encoding.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/PlanNode.java 20 files changed, 552 insertions(+), 165 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/6726/16 -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 16 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has uploaded a new patch set (#15). Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. IMPALA-4864 Speed up single slot predicates with dictionaries When dictionaries are present we can pre-evaluate conjuncts against the dictionary values and simply look up the result. Status: Runs with ASAN, runs without crashes on ee tests. Performance results inconclusive; this may not be worth the complexity unless we get really selective or really expensive predicates. Basic idea: since we codegen so early, before we know enough details about the columns to know if they are dict filterable, if we do have dictionary filtering predicates, we codegen a guard around each dictionary filterable predicate evaluation. This guard skips evaluation of the predicate if it has already been evaluated by the dictionary. In this way, we can skip evaluation dynamically for each row group as we learn which columns are dictionary filterable, and then push predicate evaluation into the column reader. Since the branches will remain 100% predictable over the row group, this should give us the fastest way to skip over predicate evaluation without compromising the general case where we may be unable to evaluate against the dictionary. We can even do this with codegen turned off, as a side effect of the way we generate the codegen'd function when dictionary evaluation is enabled. If dictionaries aren't available for some predicates, we automatically fall back to evaluating those predicates in the original order, preserving selectivity. The overhead in this case is a perfectly predictable extra conditional per dictionary predicate. Performance: Hard to get! Simple predicates did not show improvement, in fact regressed. I used a TPC-H scale 30 dataset, duplicated 3x into a 'biglineitem' table. select count(*) from biglineitem WHERE l_returnflag = 'A'; 1.43s -> 1.53s select count(*) from biglineitem WHERE l_shipinstruct = 'DELIVER IN PERSON'; 1.43s -> 1.53s select count(*) from biglineitem WHERE l_quantity > 49; 0.93s -> 0.93s select count(*) from biglineitem WHERE instr(l_shipdate, '1994-11') > 0; 2.33s -> 1.03s So this appears to only be a win for expensive predicates. Update: I added changes to make computed predicate costs visible from the frontend to the backend, and tried a TPC-DS scale 10 dataset, which has better queries (lots of IN groups). Still there is a regression in raw query performance. Update again: reversing one branch to UNLIKELY in the ir compiled code gave these results on TPC-DSx10: Q46 (limit modified to 10): 3.05 -> 2.89 sec Q27 (limit modified to 10): 3.13 -> 3.09 sec Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/exec-node.h M be/src/exec/hdfs-parquet-scanner-ir.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/hdfs-scanner.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/exec/parquet-scratch-tuple-batch.h M be/src/runtime/descriptors.h M be/src/runtime/row-batch.h M be/src/runtime/tuple.h M be/src/util/bitmap-test.cc M be/src/util/bitmap.h M be/src/util/dict-encoding.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/PlanNode.java 20 files changed, 527 insertions(+), 142 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/6726/15 -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 15 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has posted comments on this change. Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. Patch Set 14: I'm getting kind of disheartened by this approach. The biggest obstacle is that we don't know ahead of time what conjuncts will be useful for filtering, as we don't know which columns are fully dictionary encoded. None of the approaches I have tried really make parity with the baseline, and only very expensive conjuncts on sets of small to mid-size distinct values appear to be a win. It is hard to come up with a data set and query that perform much better without artificially making poor queries to begin with. Nothing reasonable I've tried on TPC-Hx30 or TPC-DSx10 have produced great results. -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 14 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has uploaded a new patch set (#14). Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. IMPALA-4864 Speed up single slot predicates with dictionaries When dictionaries are present we can pre-evaluate conjuncts against the dictionary values and simply look up the result. Status: Runs with ASAN, runs without crashes on ee tests. Performance results inconclusive; this may not be worth the complexity unless we get really selective or really expensive predicates. Basic idea: since we codegen so early, before we know enough details about the columns to know if they are dict filterable, if we do have dictionary filtering predicates, we codegen a guard around each dictionary filterable predicate evaluation. This guard skips evaluation of the predicate if it has already been evaluated by the dictionary. In this way, we can skip evaluation dynamically for each row group as we learn which columns are dictionary filterable, and then push predicate evaluation into the column reader. Since the branches will remain 100% predictable over the row group, this should give us the fastest way to skip over predicate evaluation without compromising the general case where we may be unable to evaluate against the dictionary. We can even do this with codegen turned off, as a side effect of the way we generate the codegen'd function when dictionary evaluation is enabled. If dictionaries aren't available for some predicates, we automatically fall back to evaluating those predicates in the original order, preserving selectivity. The overhead in this case is a perfectly predictable extra conditional per dictionary predicate. Performance: Hard to get! Simple predicates did not show improvement, in fact regressed. I used a TPC-H scale 30 dataset, duplicated 3x into a 'biglineitem' table. select count(*) from biglineitem WHERE l_returnflag = 'A'; 1.43s -> 1.53s select count(*) from biglineitem WHERE l_shipinstruct = 'DELIVER IN PERSON'; 1.43s -> 1.53s select count(*) from biglineitem WHERE l_quantity > 49; 0.93s -> 0.93s select count(*) from biglineitem WHERE instr(l_shipdate, '1994-11') > 0; 2.33s -> 1.03s So this appears to only be a win for expensive predicates. Update: I added changes to make computed predicate costs visible from the frontend to the backend, and tried a TPC-DS scale 10 dataset, which has better queries (lots of IN groups). Still there is a regression in raw query performance. Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/exec-node.h M be/src/exec/hdfs-parquet-scanner-ir.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/hdfs-scanner.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/exec/parquet-scratch-tuple-batch.h M be/src/runtime/descriptors.h M be/src/runtime/row-batch.h M be/src/runtime/tuple.h M be/src/util/bitmap-test.cc M be/src/util/bitmap.h M be/src/util/dict-encoding.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/PlanNode.java 20 files changed, 527 insertions(+), 142 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/6726/14 -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 14 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has uploaded a new patch set (#13). Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. IMPALA-4864 Speed up single slot predicates with dictionaries When dictionaries are present we can pre-evaluate conjuncts against the dictionary values and simply look up the result. Status: Ready for review. Runs with ASAN, runs without crashes on ee tests. Basic idea: since we codegen so early, before we know enough details about the columns to know if they are dict filterable, if we do have dictionary filtering predicates, we codegen a guard around each dictionary filterable predicate evaluation. This guard skips evaluation of the predicate if it has already been evaluated by the dictionary. In this way, we can skip evaluation dynamically for each row group as we learn which columns are dictionary filterable, and then push predicate evaluation into the column reader. Since the branches will remain 100% predictable over the row group, this should give us the fastest way to skip over predicate evaluation without compromising the general case where we may be unable to evaluate against the dictionary. We can even do this with codegen turned off, as a side effect of the way we generate the codegen'd function when dictionary evaluation is enabled. If dictionaries aren't available for some predicates, we automatically fall back to evaluating those predicates in the original order, preserving selectivity. The overhead in this case is a perfectly predictable extra conditional per dictionary predicate. Performance: Hard to get! Simple predicates did not show improvement, in fact regressed. I used a TPC-H scale 30 dataset, duplicated 3x into a 'biglineitem' table. select count(*) from biglineitem WHERE l_returnflag = 'A'; 1.43s -> 1.53s select count(*) from biglineitem WHERE l_shipinstruct = 'DELIVER IN PERSON'; 1.43s -> 1.53s select count(*) from biglineitem WHERE l_quantity > 49; 0.93s -> 0.93s select count(*) from biglineitem WHERE instr(l_shipdate, '1994-11') > 0; 2.33s -> 1.03s So this appears to only be a win for expensive predicates. Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/exec-node.h M be/src/exec/hdfs-parquet-scanner-ir.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/hdfs-scanner.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/exec/parquet-scratch-tuple-batch.h M be/src/runtime/descriptors.h M be/src/runtime/row-batch.h M be/src/runtime/tuple.h M be/src/util/bitmap-test.cc M be/src/util/bitmap.h M be/src/util/dict-encoding.h 18 files changed, 527 insertions(+), 150 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/6726/13 -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 13 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has uploaded a new patch set (#12). Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. IMPALA-4864 Speed up single slot predicates with dictionaries When dictionaries are present we can pre-evaluate conjuncts against the dictionary values and simply look up the result. Status: Ready for review. Seems to be passing all tests, runs with ASAN, and gives expected results. Need to come up with some specific test cases that exercise this functionality and measure performance. Basic idea: since we codegen so early, before we know enough details about the columns to know if they are dict filterable, if we do have dictionary filtering predicates, we codegen a guard around each dictionary filterable predicate evaluation. This guard skips evaluation of the predicate if it has already been evaluated by the dictionary. In this way, we can skip evaluation dynamically for each row group as we learn which columns are dictionary filterable, and then push predicate evaluation into the column reader. Since the branches will remain 100% predictable over the row group, this should give us the fastest way to skip over predicate evaluation without compromising the general case where we may be unable to evaluate against the dictionary. We can even do this with codegen turned off, as a side effect of the way we generate the codegen'd function when dictionary evaluation is enabled. If dictionaries aren't available for some predicates, we automatically fall back to evaluating those predicates in the original order, preserving selectivity. The overhead in this case is a perfectly predictable extra conditional per dictionary predicate. Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/exec-node.h M be/src/exec/hdfs-parquet-scanner-ir.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/hdfs-scanner.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/exec/parquet-scratch-tuple-batch.h M be/src/runtime/descriptors.h M be/src/runtime/row-batch.h M be/src/runtime/tuple.h M be/src/util/bitmap-test.cc M be/src/util/bitmap.h M be/src/util/dict-encoding.h 18 files changed, 527 insertions(+), 150 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/6726/12 -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 12 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has posted comments on this change. Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. Patch Set 9: (13 comments) http://gerrit.cloudera.org:8080/#/c/6726/9/be/src/exec/exec-node.cc File be/src/exec/exec-node.cc: PS9, Line 567: skippable_map > skippable_map != nullptr Done PS9, Line 586: 0 > huh ? Was diverting for testing PS9, Line 586: (*skippable_map)[i] == true > (*skippable_map)[i] Done PS9, Line 590: bool > skip_val Done http://gerrit.cloudera.org:8080/#/c/6726/9/be/src/exec/hdfs-parquet-scanner-ir.cc File be/src/exec/hdfs-parquet-scanner-ir.cc: PS9, Line 88: LOG(INFO) << "tuple idx: " << tuple_index; > debugging output ? Yeah, removed this http://gerrit.cloudera.org:8080/#/c/6726/9/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: PS9, Line 1050: std::map>& > DictionaryFilterPositionMap I think it will need to be HdfsScanNodeBase::DictionaryFilterPositionMap - not sure even a using declaration will help have the syntax, but I'll try. http://gerrit.cloudera.org:8080/#/c/6726/9/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: PS9, Line 167: typedef std::map> DictFilterConjunctsPositionMap; > Comment what this map stands for ? Done PS9, Line 366: Mapping to original position in conjuncts > Mapping from SlotId to ... Done http://gerrit.cloudera.org:8080/#/c/6726/9/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: PS9, Line 206: IS_FILTERED > Comment what IS_FILTERED stands for. Done PS9, Line 240: (filters == nullptr) ^ IS_FILTERED > Is there a reason to not use DCHECK((filters == nullptr) != IS_FILTERED) ? 1 fewer character? I can change this though. http://gerrit.cloudera.org:8080/#/c/6726/9/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: PS9, Line 390: return false; > Is this hard coded for testing ? If not, please document why it's always fa No, it gets overridden in the derived class. http://gerrit.cloudera.org:8080/#/c/6726/9/be/src/util/bitmap.h File be/src/util/bitmap.h: PS9, Line 86: (mask & buffer_[word_index]) > (mask & buffer[word_index]) == 0 Done PS9, Line 99: buffer_.size() > num_words_ Done -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has uploaded a new patch set (#11). Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. IMPALA-4864 Speed up single slot predicates with dictionaries When dictionaries are present we can pre-evaluate conjuncts against the dictionary values and simply look up the result. Status: Ready for review. Seems to be passing all tests, runs with ASAN, and gives expected results. Need to come up with some specific test cases that exercise this functionality and measure performance. Basic idea: since we codegen so early, before we know enough details about the columns to know if they are dict filterable, if we do have dictionary filtering predicates, we codegen a guard around each dictionary filterable predicate evaluation. This guard skips evaluation of the predicate if it has already been evaluated by the dictionary. In this way, we can skip evaluation dynamically for each row group as we learn which columns are dictionary filterable, and then push predicate evaluation into the column reader. Since the branches will remain 100% predictable over the row group, this should give us the fastest way to skip over predicate evaluation without compromising the general case where we may be unable to evaluate against the dictionary. We can even do this with codegen turned off, as a side effect of the way we generate the codegen'd function when dictionary evaluation is enabled. If dictionaries aren't available for some predicates, we automatically fall back to evaluating those predicates in the original order, preserving selectivity. The overhead in this case is a perfectly predictable extra conditional per dictionary predicate. Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/exec-node.h M be/src/exec/hash-join-node.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-parquet-scanner-ir.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/hdfs-scanner.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/exec/parquet-scratch-tuple-batch.h M be/src/exec/partitioned-hash-join-node.cc M be/src/runtime/descriptors.h M be/src/runtime/row-batch.h M be/src/runtime/tuple.h M be/src/util/bitmap-test.cc M be/src/util/bitmap.h M be/src/util/dict-encoding.h 21 files changed, 511 insertions(+), 148 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/6726/11 -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has posted comments on this change. Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. Patch Set 10: (3 comments) http://gerrit.cloudera.org:8080/#/c/6726/10/be/src/runtime/query-state.h File be/src/runtime/query-state.h: Line 167: static const int MAX_BATCH_SIZE = 65536; Didn't end up needing this. http://gerrit.cloudera.org:8080/#/c/6726/10/be/src/util/bitmap.h File be/src/util/bitmap.h: Line 79: /// Returns next set bit, or num_bits_ if the end is reached. This function ended up unused as well. What shall we do with it? http://gerrit.cloudera.org:8080/#/c/6726/10/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test File testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test: Line 25:parquet dictionary predicates: int_col IS NULL, int_col > 1 Bogus test change needs to be backed out. -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 10 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has uploaded a new patch set (#10). Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. IMPALA-4864 Speed up single slot predicates with dictionaries When dictionaries are present we can pre-evaluate conjuncts against the dictionary values and simply look up the result. Status: Ready for review. Seems to be passing all tests, runs with ASAN, and gives expected results. Need to come up with some specific test cases that exercise this functionality and measure performance. Basic idea: since we codegen so early, before we know enough details about the columns to know if they are dict filterable, if we do have dictionary filtering predicates, we codegen a guard around each dictionary filterable predicate evaluation. This guard skips evaluation of the predicate if it has already been evaluated by the dictionary. In this way, we can skip evaluation dynamically for each row group as we learn which columns are dictionary filterable, and then push predicate evaluation into the column reader. Since the branches will remain 100% predictable over the row group, this should give us the fastest way to skip over predicate evaluation without compromising the general case where we may be unable to evaluate against the dictionary. We can even do this with codegen turned off, as a side effect of the way we generate the codegen'd function when dictionary evaluation is enabled. If dictionaries aren't available for some predicates, we automatically fall back to evaluating those predicates in the original order, preserving selectivity. The overhead in this case is a perfectly predictable extra conditional per dictionary predicate. Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/exec-node.h M be/src/exec/hash-join-node.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-parquet-scanner-ir.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/hdfs-scanner.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/exec/parquet-scratch-tuple-batch.h M be/src/exec/partitioned-hash-join-node.cc M be/src/runtime/descriptors.h M be/src/runtime/query-state.h M be/src/runtime/row-batch.h M be/src/runtime/tuple.h M be/src/util/bitmap-test.cc M be/src/util/bitmap.h M be/src/util/dict-encoding.h M testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test 23 files changed, 510 insertions(+), 150 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/6726/10 -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 10 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Michael Ho has posted comments on this change. Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. Patch Set 9: (16 comments) http://gerrit.cloudera.org:8080/#/c/6726/9/be/src/exec/exec-node.cc File be/src/exec/exec-node.cc: PS9, Line 567: skippable_map skippable_map != nullptr PS9, Line 586: (*skippable_map)[i] == true (*skippable_map)[i] PS9, Line 586: 0 huh ? PS9, Line 590: bool skip_val http://gerrit.cloudera.org:8080/#/c/6726/9/be/src/exec/exec-node.h File be/src/exec/exec-node.h: PS9, Line 153: Again No need for "Again". PS9, Line 159: /// This function is capable of doing codegen for either of the above functions; simply : /// pass a null value instead of a bool array pointer. If 'skippable_map' is not NULL, this will generate the equivalent of EvalConjunctsPreEval(). Otherwise, this generates EvalConjuncts(). Also describe the content of skippable_map. Alternately, we can keep this function private and expose two interfaces CodegenEvalConjuncts() and CodegenEvalConjunctsPreEval(). The former always passes nullptr for 'skippable_map' argument. PS9, Line 163: skippable_map Will 'skippable_conjuncts' be a more appropriate name ? http://gerrit.cloudera.org:8080/#/c/6726/9/be/src/exec/hdfs-parquet-scanner-ir.cc File be/src/exec/hdfs-parquet-scanner-ir.cc: PS9, Line 88: LOG(INFO) << "tuple idx: " << tuple_index; debugging output ? http://gerrit.cloudera.org:8080/#/c/6726/9/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: PS9, Line 1050: std::map>& DictionaryFilterPositionMap http://gerrit.cloudera.org:8080/#/c/6726/9/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: PS9, Line 167: typedef std::map> DictFilterConjunctsPositionMap; Comment what this map stands for ? Slot Id -> index of the dictionary filter conjuncts in conjunct_ctxs_ ? PS9, Line 366: Mapping to original position in conjuncts Mapping from SlotId to ... http://gerrit.cloudera.org:8080/#/c/6726/9/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: PS9, Line 206: IS_FILTERED Comment what IS_FILTERED stands for. PS9, Line 240: (filters == nullptr) ^ IS_FILTERED Is there a reason to not use DCHECK((filters == nullptr) != IS_FILTERED) ? http://gerrit.cloudera.org:8080/#/c/6726/9/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: PS9, Line 390: return false; Is this hard coded for testing ? If not, please document why it's always false. http://gerrit.cloudera.org:8080/#/c/6726/9/be/src/util/bitmap.h File be/src/util/bitmap.h: PS9, Line 86: (mask & buffer_[word_index]) (mask & buffer[word_index]) == 0 PS9, Line 99: buffer_.size() num_words_ -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has uploaded a new patch set (#9). Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. IMPALA-4864 Speed up single slot predicates with dictionaries When dictionaries are present we can pre-evaluate conjuncts against the dictionary values and simply look up the result. Status: ready for review. Seems to be passing all tests, runs with ASAN, and gives expected results. Need to come up with some specific test cases that exercise this functionality and measure performance. Basic idea: since we codegen so early, before we know enough details about the columns to know if they are dict filterable, if we do have dictionary filtering predicates, we codegen a guard around each dictionary filterable predicate evaluation. This guard skips evaluation of the predicate if it has already been evaluated by the dictionary. In this way, we can skip evaluation dynamically for each row group as we learn which columns are dictionary filterable, and then push predicate evaluation into the column reader. Since the branches will remain 100% predictable over the row group, this should give us the fastest way to skip over predicate evaluation without compromising the general case where we may be unable to evaluate against the dictionary. We can even do this with codegen turned off, as a side effect of the way we generate the codegen'd function when dictionary evaluation is enabled. If dictionaries aren't available for some predicates, we automatically fall back to evaluating those predicates in the original order, preserving selectivity. The overhead in this case is a perfectly predictable extra conditional per dictionary predicate. Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/exec-node.cc M be/src/exec/exec-node.h M be/src/exec/hash-join-node.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-parquet-scanner-ir.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/exec/parquet-scratch-tuple-batch.h M be/src/exec/partitioned-hash-join-node.cc M be/src/runtime/query-state.h M be/src/runtime/row-batch.h M be/src/util/bitmap-test.cc M be/src/util/bitmap.h M be/src/util/dict-encoding.h M testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test 20 files changed, 564 insertions(+), 180 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/6726/9 -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. Patch Set 7: (8 comments) http://gerrit.cloudera.org:8080/#/c/6726/7//COMMIT_MSG Commit Message: Line 7: IMPALA-4864 Speed up single slot predicates with dictionaries > There is a bit of a disconnect with the JIRA as specified and the intention Thanks for the clarification. Agree this is the best path forward for the moment. We should make sure to straighten out the JIRAs though so that the JIRA attached to this diff actually reflects the work you're doing, and that we track the other optimisation as separate work. http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner-ir.cc File be/src/exec/hdfs-parquet-scanner-ir.cc: Line 77:tuple_index = scratch_batch_->NextValidTuple()) { > Your intuition and mine are similar. I did my best to make the "simple" ve +1 no point optimising it until the code around it has settled down. http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 876: RETURN_IF_ERROR(scalar_reader->InitDictionary(dict_filter_tuple)); This is orthogonal, but we really should evaluate single-slot runtime filters against the dictionary as well. The payoff there is potentially really big - they can be quite selective and are expensive to evaluate. Do we have a JIRA for that? http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 317: bool ReadValueBatch(ScratchTupleBatch& scratch, int* num_values) { > My guess is that inlining the functions from the header should get about as There's also a design layering question about whether the ColumnReader abstraction should know about ScratchTupleBatch. IMO there's a cleaner separation if ColumnReaders only care about materialising a column into a memory buffer and setting bits based on predicates as fast as possible, not about how we're staging tuples in a scratch batch. Even after inlining it adds an additional pointer indirection via 'scratch'. We can try to coax the compiler into undoing what we want it to do but at some point it's more robust and adds less cognitive load if we just write code that's closer to what we want the compiler to generate. On a lot of queries we spend up to 50% of CPU time in this function so the perf matters a lot (I'm not just being anal about shaving CPU cycles). We've had a couple rounds of attempting to optimise this function and it is *very* sensitive. Even then we left a lot of performance on the floor for subtle reasons: see https://gerrit.cloudera.org/#/c/6950/. The big problem is that we have multiple layers of non-trivial inlined functions inside that loop and it's hard to reason about what the compiler and CPU will actually be doing. PS7, Line 408: IS_FILTERED > The behavior is different as a class, not just at this function level, so I It looks like the only perf-critical place that references IS_FILTERED are in these two templatised functions, so I'm still skeptical that it makes sense to templatise the whole class - otherwise it could just be a member variable. I don't see a reason to treat it differently from IS_DICT_ENCODED. Line 410: return false; I'm going to retract my previous comment about materializing the indices as a separate step. I played around with a prototype of batching the slot materialisation and the best design looks like making the dictionary decoder RLE-aware so that it can only does one dictionary lookup per run. See https://github.com/timarmstrong/incubator-impala/commit/d45d8fae03d2645683f82ece21fce9310f78ac7a#diff-66ee012551a23c734bbe834529dd3b3dR306 (warning - it's pretty rough). I think that would mean pushing the bitmap check down into the dictionary decoder, which is something very different from what I was talking about. That requires rearranging a lot of code in a more fundamental so the short-term change to ReadFilteredSlot() doesn't really matter either way. Line 417: if (IS_FILTERED) batch.InvalidateTuple(val_index); > True. (Was not so in original code). Do you have a repro? If it's a correctness bug we should make sure we're on top of it. I agree we're not strict about validating that the def levels are in the expected range and we could accept a technically-invalid parquet file as valid input. I think it's reasonable that we don't exhaustively validate the file so long as we don't overrun any buffers etc as a result. I don't see how that leads to any further consequences though. Impala assumes that all slots coming out of HDFS scans are nullable, so it can't be non-NULLable from Impala's point of view. If the Parquet file claims that a column is non-NULLable within that file (max_def_level == 0) then I don't think we w
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has uploaded a new patch set (#8). Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. IMPALA-4864 Speed up single slot predicates with dictionaries When dictionaries are present we can pre-evaluate conjuncts against the dictionary values and simply look up the result. Status of this diff: Compiles and starts. Bitmap tests for new functionality pass. Tests are broken due to some inadvertent change in the parquet column reader that causing file decode to break. Needs debugging, and there are definitely some bugs, but the exposition of the concept is now fully formed. Basic idea: since we codegen so early, before we know enough details about the columns to know if they are dict filterable, if we do have dictionary filtering predicates, we codegen a guard around each dictionary filterable predicate evaluation. This guard skips evaluation of the predicate if it has already been evaluated by the dictionary. In this way, we can skip evaluation dynamically for each row group as we learn which columns are dictionary filterable, and then push predicate evaluation into the column reader. Since the branches will remain 100% predictable over the row group, this should give us the fastest way to skip over predicate evaluation without compromising the general case where we may be unable to evaluate against the dictionary. We can even do this with codegen turned off, as a side effect of the way we generate the codegen'd function when dictionary evaluation is enabled. If dictionaries aren't available for some predicates, we automatically fall back to evaluating those predicates in the original order, preserving selectivity. The overhead in this case is a perfectly predictable extra conditional per dictionary predicate. We could codegen another version of the EvalConjuncts function without this overhead, but because of the complexity involved in doing so and the pain involved (ScanNodeBase assumes one codegen'd function per file format, so we would have to simulate a file format or some other awful hack). Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 --- M be/src/codegen/gen_ir_descriptions.py M be/src/common/global-flags.cc M be/src/exec/exec-node.cc M be/src/exec/exec-node.h M be/src/exec/hash-join-node.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-parquet-scanner-ir.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/exec/parquet-scratch-tuple-batch.h M be/src/exec/partitioned-hash-join-node.cc M be/src/runtime/query-state.h M be/src/runtime/row-batch.h M be/src/util/bitmap-test.cc M be/src/util/bitmap.h M be/src/util/dict-encoding.h M testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test 21 files changed, 569 insertions(+), 182 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/6726/8 -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has posted comments on this change. Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. Patch Set 7: (17 comments) Thanks for looking at this! http://gerrit.cloudera.org:8080/#/c/6726/7//COMMIT_MSG Commit Message: Line 7: IMPALA-4864 Speed up single slot predicates with dictionaries > The JIRA seems to be talking about doing something slightly different: taki There is a bit of a disconnect with the JIRA as specified and the intention behind the JIRA, which had me initially working down the wrong path. I realized that (x OP k) type predicates can only be evaluated for OP = or <=>, and started working on a way to identify such predicates in sub-expressions of the conjuncts. It all seemed quite speculative and limited in applicability. Once can imagine expanding this to ordering predicates with sorted dictionaries, but as we don't have that yet, it would be building without a foundation. After clarification from Marcel, it became apparent that what was intended was to be able to evaluate arbitrary single slot predicates against the dictionary (as we do this anyway for row-group filtering) and save those results to eliminate the cost of re-evaluating the same conjuncts again later. The central difficulty of this problem is that the set of conjuncts which can be dictionary evaluated is not known until row group time. This is the best proposal that I have come up with so far at attempting to solve that problem. I agree we should get the design right on this one. Doing this wrong could be costly in terms of either performance or complexity. http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner-ir.cc File be/src/exec/hdfs-parquet-scanner-ir.cc: Line 40 > I believe this code was deliberately written like this to avoid the indirec I have no problem reverting this. Originally I had the two functions merged into one, but I didn't like imposing the overhead of checking the bitmap on this version. While the code evolved, I had to put an interface on ScratchBatch just for sanity, as too many pieces of code were playing with its internals. I think the interface is actually broad enough now to go back to the original code (minus the interference with internals). I'll simply do this at the end: scratch_batch_->Advance((scratch_tuple - scratch_tuple_start) / tuple_size); Line 77:tuple_index = scratch_batch_->NextValidTuple()) { > It seems like this might be slow for the case when many or all of the tuple Your intuition and mine are similar. I did my best to make the "simple" version of NextSetBit() as fast as possible, but I believe intrinsics might actually be warranted here. Wanted to get the core ideas sound before going in that direction though. Oh, and for reference, I also tried various other bitmap implementations - boost::dynamic_bitset and std::vector. I didn't like the boost dependency (or the code too much) and the std::vector (specialized bit-vector) implementation compiled to assembler sent me running for the hills. Branches, branches everywhere... http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 935 > I'm not sure, but this hoisting may have been deliberate because the compil It certainly shouldn't hurt to re-hoist Line 208: dict_filters_active_.reset(new bool[scanner_conjunct_ctxs_->size()]); > This probably works fine in practice but afaik scoped_ptr will call free in Will check PS7, Line 1051: _ > nit: extraneous _ Originally a class member, until I realized this function had no access. Will fix. http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 317: bool ReadValueBatch(ScratchTupleBatch& scratch, int* num_values) { > I think it would be best to avoid using the ScratchTupleBatch abstraction i My guess is that inlining the functions from the header should get about as optimal as possible. Maybe hoisting 'max = scratch.capacity()' to tell the compiler it is unchanged throughout the loop, but other than that, I don't see much room for improvement that has been lost. Let's see what perf results look like when that is ready. I don't want to pass the bitmap directly, as I only want the bitmap to be touched (or even known about) by those pieces of code which actually are running in a filtered column reader. Line 408: if (IS_FILTERED && IS_DICT_ENCODED && > This patch pushes the predicate evaluation down into ReadFilteredSlot() so Perhaps there is a way to do this, but I don't see a clear path to that. We have some choices, but they are all problematic: Materialize codewords and values. Where do the codewords go? There is no space in the tuple. Materialize codewords only. What do we do when there is no dictionary? Materialize c
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. Patch Set 7: (19 comments) Looks promising. We're likely to be building more things on top of this, and the potential impact is pretty big if we made the right design decisions, so I want to make sure that we're making the right choices here - I think there are a few decision points we need to think through careful. http://gerrit.cloudera.org:8080/#/c/6726/7//COMMIT_MSG Commit Message: Line 7: IMPALA-4864 Speed up single slot predicates with dictionaries The JIRA seems to be talking about doing something slightly different: taking predicates of form (x OP constant), translating the constant into the codeword, then comparing dict_index directly to the codeword. That avoids any kind of dictionary or bitmap lookup and can be vectorized so can be very effective. There are various papers about this kind of trick - I can find links if you haven't seen them. I think maybe we need to split out this work into a separate JIRA. Your current approach applies to a much larger range of predicates so is also very useful. It would be good to think about whether the infrastructure will support that kind of optimisation in the future. I think doing a more column-oriented approach might make it more natural to do multiple passes over the column. Line 41: function per file format, so we would have to simulate a file format Yeah that codegen'd function map is a bit of a mess. http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner-ir.cc File be/src/exec/hdfs-parquet-scanner-ir.cc: Line 40 I believe this code was deliberately written like this to avoid the indirection via this->scratch_batch_ in the tight loop below - I think we should leave this alone. Line 77:tuple_index = scratch_batch_->NextValidTuple()) { It seems like this might be slow for the case when many or all of the tuples are valid, since NextSetBit() is fairly expensive compared to incrementing a counter. It's probably worth waiting to see performance numbers, but my intuition is that we'll need to make sure that advancing to the next bit is cheaper. http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 935 I'm not sure, but this hoisting may have been deliberate because the compiler couldn't hoist the load via scratch_batch_ out of the loop Line 208: dict_filters_active_.reset(new bool[scanner_conjunct_ctxs_->size()]); This probably works fine in practice but afaik scoped_ptr will call free instead of free[] on the array. I think unique_ptr does the right thing. Line 865: // Legacy impala files cannot be eliminated here, because the only way to I think we're missing an opportunity to apply the optimisation to columns with a mix of plain and dictionary-encoded pages. My understanding is that the pre-existing dictionary filtering optimisation only works when the whole column is dictionary-encoded, but your new optimisation can be applied on a page-by-page basis. I'm not sure how hard it is to restructure things to get it to work in that case. I think it's probably fairly low-impact because most columns will either be predominantly plain-encoded or predominantly dict-encoded, but we should add a comment so that the limitation is explicit. PS7, Line 1051: _ nit: extraneous _ http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: Line 436: /// XXX Why is this a scoped_ptr ? IIRC so we don't need to include the full header. http://gerrit.cloudera.org:8080/#/c/6726/7/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 317: bool ReadValueBatch(ScratchTupleBatch& scratch, int* num_values) { I think it would be best to avoid using the ScratchTupleBatch abstraction in this code - these functions are perf critical and it's easier to reason about performance when the abstraction is stripped away and the code is directly operating over contiguous arrays as much as possible. Maybe just pass in the bitmap directly? Line 408: if (IS_FILTERED && IS_DICT_ENCODED && This patch pushes the predicate evaluation down into ReadFilteredSlot() so that it's done value-by-value as they are materialised. I'm not convinced this is the best approach - I think we should consider peeling predicate evaluation out into a separate loop and doing it columnwise. This would mean that ReadSlot() or some variant would materialize the values and the codewords into an array, then we'd do another pass over the codeword array to evaluate the conjunctions. Or maybe materialize the codewords in one pass, then doing another pass to evaluate conjuncts, then another one to lookup the surviving values in the dictionary. I think t
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has uploaded a new patch set (#7). Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. IMPALA-4864 Speed up single slot predicates with dictionaries When dictionaries are present we can pre-evaluate conjuncts against the dictionary values and simply look up the result. Status of this diff: Compiles and starts. Bitmap tests for new functionality pass. Tests are broken due to some inadvertent change in the parquet column reader that causing file decode to break. Needs debugging, and there are definitely some bugs, but the exposition of the concept is now fully formed. Basic idea: since we codegen so early, before we know enough details about the columns to know if they are dict filterable, if we do have dictionary filtering predicates, we codegen a guard around each dictionary filterable predicate evaluation. This guard skips evaluation of the predicate if it has already been evaluated by the dictionary. In this way, we can skip evaluation dynamically for each row group as we learn which columns are dictionary filterable, and then push predicate evaluation into the column reader. Since the branches will remain 100% predictable over the row group, this should give us the fastest way to skip over predicate evaluation without compromising the general case where we may be unable to evaluate against the dictionary. We can even do this with codegen turned off, as a side effect of the way we generate the codegen'd function when dictionary evaluation is enabled. If dictionaries aren't available for some predicates, we automatically fall back to evaluating those predicates in the original order, preserving selectivity. The overhead in this case is a perfectly predictable extra conditional per dictionary predicate. We could codegen another version of the EvalConjuncts function without this overhead, but because of the complexity involved in doing so and the pain involved (ScanNodeBase assumes one codegen'd function per file format, so we would have to simulate a file format or some other awful hack). Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 --- M be/src/codegen/gen_ir_descriptions.py M be/src/common/global-flags.cc M be/src/exec/exec-node.cc M be/src/exec/exec-node.h M be/src/exec/hash-join-node.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-parquet-scanner-ir.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/exec/parquet-scratch-tuple-batch.h M be/src/exec/partitioned-hash-join-node.cc M be/src/util/bitmap-test.cc M be/src/util/bitmap.h M be/src/util/dict-encoding.h M testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test 19 files changed, 561 insertions(+), 188 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/6726/7 -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has uploaded a new patch set (#6). Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. IMPALA-4864 Speed up single slot predicates with dictionaries When dictionaries are present we can pre-evaluate conjuncts against the dictionary values and simply look up the result. Status of this diff: Compiles and starts. Bitmap tests for new functionality pass. Tests are broken due to some inadvertent change in the parquet column reader that causing file decode to break. Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 --- M be/src/codegen/gen_ir_descriptions.py M be/src/common/global-flags.cc M be/src/exec/exec-node.cc M be/src/exec/exec-node.h M be/src/exec/hash-join-node.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-parquet-scanner-ir.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/exec/parquet-scratch-tuple-batch.h M be/src/exec/partitioned-hash-join-node.cc M be/src/util/bitmap-test.cc M be/src/util/bitmap.h M be/src/util/dict-encoding.h M testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test 19 files changed, 554 insertions(+), 188 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/6726/6 -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has uploaded a new patch set (#5). Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. IMPALA-4864 Speed up single slot predicates with dictionaries When dictionaries are present we can pre-evaluate conjuncts against the dictionary values and simply look up the result. Status of this diff: Compiles and starts. Bitmap tests for new functionality pass. FE and BE tests passing. Query test now passing (run not finished). Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 --- M be/src/common/global-flags.cc M be/src/exec/hdfs-parquet-scanner-ir.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/exec/parquet-scratch-tuple-batch.h M be/src/util/bitmap-test.cc M be/src/util/bitmap.h M be/src/util/dict-encoding.h M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test 12 files changed, 446 insertions(+), 185 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/6726/5 -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Joe McDonnell has posted comments on this change. Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. Patch Set 4: (2 comments) A couple quick observations http://gerrit.cloudera.org:8080/#/c/6726/4/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: PS4, Line 1454: ); The front end orders conjuncts by selectivity and cost. When we pull them out and attach them to column materialization, the order is not preserved. If the conjunct is evaluated using the dictionary, this should be fine. If the conjunct is not evaluated from the dictionary, then it might result in a more expensive evaluation. To put numbers on it: Suppose there are two conjuncts A and B. A is expensive (cost = 10) and super selective (eliminates 0.99). B is cheap (cost = 1) and moderately selective (eliminates 0.50). The front end might put B first, so if B eliminates 50% of the row, then A is called 50% of the time to eliminate the rest. This has an amortized cost of 1 + 0.50 * 10 = 6, which is cheaper than calling A 100% of the time. We can reorder the materialization of the columns at runtime using knowledge of which columns are dictionary encoded and which aren't. PS4, Line 1467: endif It should be possible to do this up in HdfsScanNode. As an example, see extractKuduConjuncts in KuduScanNode. This pulls out conjuncts that will be evaluated by Kudu. -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden Gerrit-Reviewer: Joe McDonnell Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4864 Speed up single slot predicates with dictionaries
Zach Amsden has uploaded a new patch set (#4). Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries .. IMPALA-4864 Speed up single slot predicates with dictionaries When dictionaries are present we can pre-evaluate conjuncts against the dictionary values and simply look up the result. Status of this diff: Compiles and starts. Bitmap tests for new functionality pass. FE and BE tests passing. Going to rebase to try to get past a bug that is firing consistently on my branch as a few fixes have gone in. Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 --- M be/src/exec/hdfs-parquet-scanner-ir.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/exec/parquet-scratch-tuple-batch.h M be/src/util/bitmap-test.cc M be/src/util/bitmap.h M be/src/util/dict-encoding.h M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test 11 files changed, 422 insertions(+), 182 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/6726/4 -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden