[Impala-ASF-CR] IMPALA-9829: Add Write Metrics for Spilling
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16083 ) Change subject: IMPALA-9829: Add Write Metrics for Spilling .. IMPALA-9829: Add Write Metrics for Spilling Currently, we have read metrics for spilling, in this patch, we add support for write metrics. The new metrics could be useful to measure the write operations and target performance issues when involving in spilling to remote disks(S3) (IMPALA-9828). The metrics added record the information includes: 1. write latency of each write operation to the disk, metric kind: HistogramMetric, unit: nanosecond. 2. write size of each write operation to the disk, metric kind: HistogramMetric, unit: Bytes. 3. number of write IO errors when writing to the disk, metric kind: IntCounter. Testing: * added DiskIoMgrTest.MetricsOfWriteSizeAndLatency * added DiskIoMgrTest.MetricsOfWriteIoError Ran unit test disk-io-mgr-test and pre-commit test Change-Id: I152b9c5339cedabe33f8873a2bbf651aa5dbb914 Reviewed-on: http://gerrit.cloudera.org:8080/16083 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins --- M be/src/runtime/io/disk-io-mgr-internal.h M be/src/runtime/io/disk-io-mgr-test.cc M be/src/runtime/io/disk-io-mgr.cc M be/src/runtime/io/disk-io-mgr.h M be/src/util/histogram-metric.h M common/thrift/metrics.json 6 files changed, 258 insertions(+), 15 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/16083 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I152b9c5339cedabe33f8873a2bbf651aa5dbb914 Gerrit-Change-Number: 16083 Gerrit-PatchSet: 16 Gerrit-Owner: Yida Wu Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Andrew Sherman Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Yida Wu
[Impala-ASF-CR] IMPALA-9829: Add Write Metrics for Spilling
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16083 ) Change subject: IMPALA-9829: Add Write Metrics for Spilling .. Patch Set 15: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/16083 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I152b9c5339cedabe33f8873a2bbf651aa5dbb914 Gerrit-Change-Number: 16083 Gerrit-PatchSet: 15 Gerrit-Owner: Yida Wu Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Andrew Sherman Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Yida Wu Gerrit-Comment-Date: Sat, 27 Jun 2020 03:54:04 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 ) Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans .. Patch Set 10: (9 comments) http://gerrit.cloudera.org:8080/#/c/16098/10//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16098/10//COMMIT_MSG@27 PS10, Line 27: configrations nit: typo http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@a1164 PS10, Line 1164: why remove this? http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1179 PS10, Line 1179: // If all partitions have good stats, return the total row count, contributed : // by all of them, as the row count for the table. is this the intended behavior? so if a user has a table with hundreds of partitions, adding a single partition with corrupt stats basically causes the stats estimation to fall back to the table-size based estimation? wouldn't a more reasonable behavior be if you see a partition with corrupt stats you fallback to the file-size based estimation just for that partition? http://gerrit.cloudera.org:8080/#/c/16098/10/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1220 PS10, Line 1220: if ( numRows > 0 ) : numRows = Math.min(numRows, estNumRows); : else : numRows = estNumRows; for any if statement thats is longer than a single line, you should add curly braces. so it should be: if (numRows > 0) { numRows = Math.min(numRows, estNumRows); } else { numRows = estNumRows; } you don't need the extra space in the if condition either http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py File tests/metadata/test_compute_stats.py: http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@180 PS10, Line 180: impala_9744 I know other methods do this, but I'm generally not a fan of including the JIRA number in a test method name. I think we can come up with a more descriptive name for this method. http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@194 PS10, Line 194: # nit remove http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@199 PS10, Line 199: CREATE TABLE {0}.{1} ( I think you could just use 'create table like parquet' rather than listing out all the columns. You can create the table in Impala and still modify it using the load data local inpath statement in Hive. http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@215 PS10, Line 215: set hive.stats.autogather=true; do you need this? http://gerrit.cloudera.org:8080/#/c/16098/10/tests/metadata/test_compute_stats.py@236 PS10, Line 236: # I think its probably best to create separate test methods for the partitioned and unpartitioned cases -- To view, visit http://gerrit.cloudera.org:8080/16098 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576 Gerrit-Change-Number: 16098 Gerrit-PatchSet: 10 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 27 Jun 2020 03:43:27 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9569: Show progress bar and live summary of the retried query
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16096 ) Change subject: IMPALA-9569: Show progress bar and live_summary of the retried query .. Patch Set 5: (2 comments) http://gerrit.cloudera.org:8080/#/c/16096/5/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/16096/5/be/src/service/impala-server.cc@759 PS5, Line 759: if (query_handle->IsRetriedQuery()) result->__set_error_logs(error_logs); is there a specific reason to set the error logs at the end of the method rather than inside the above if statement that adds the entries to error_logs? i guess one benefit is that if the Coordinator is null, then the error logs can still be exposed? might be good to add a code comment mentioning this because its not immediately obvious give that this if statement condition is the same as the above if statement condition http://gerrit.cloudera.org:8080/#/c/16096/5/shell/impala_shell.py File shell/impala_shell.py: http://gerrit.cloudera.org:8080/#/c/16096/5/shell/impala_shell.py@1100 PS5, Line 1100: if query_id_search and len(query_id_search.groups()) >= 1: the length of the group should only be 1, right? -- To view, visit http://gerrit.cloudera.org:8080/16096 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8f96919f00e0b64d589efd15b6b5ec82fb725d56 Gerrit-Change-Number: 16096 Gerrit-PatchSet: 5 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Sat, 27 Jun 2020 02:21:33 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6692: Trigger sort node run before hitting memory limit.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15963 ) Change subject: IMPALA-6692: Trigger sort node run before hitting memory limit. .. Patch Set 14: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6440/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/15963 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2a0ba7c4bae4f1d300d4d9d7f594f63ced06a240 Gerrit-Change-Number: 15963 Gerrit-PatchSet: 14 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 27 Jun 2020 02:11:36 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6692: Trigger sort node run before hitting memory limit.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15963 ) Change subject: IMPALA-6692: Trigger sort node run before hitting memory limit. .. Patch Set 13: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6439/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/15963 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2a0ba7c4bae4f1d300d4d9d7f594f63ced06a240 Gerrit-Change-Number: 15963 Gerrit-PatchSet: 13 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 27 Jun 2020 02:08:21 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6692: Trigger sort node run before hitting memory limit.
Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/15963 ) Change subject: IMPALA-6692: Trigger sort node run before hitting memory limit. .. Patch Set 14: (13 comments) http://gerrit.cloudera.org:8080/#/c/15963/12/be/src/exec/sort-node.h File be/src/exec/sort-node.h: http://gerrit.cloudera.org:8080/#/c/15963/12/be/src/exec/sort-node.h@77 PS12, Line 77: will > nit: "will go"? Done http://gerrit.cloudera.org:8080/#/c/15963/12/be/src/exec/sort-node.cc File be/src/exec/sort-node.cc: http://gerrit.cloudera.org:8080/#/c/15963/12/be/src/exec/sort-node.cc@89 PS12, Line 89: > Ok, in that case, we should just abandon estimate in case of cardinality is This is now changed to obtain full input estimate from planner. http://gerrit.cloudera.org:8080/#/c/15963/12/be/src/exec/sort-node.cc@90 PS12, Line 90: > I will look at possibility to access that average size data in the backend. I figure out how to get more precise estimate from planner. It is close enough to the actual input size that I observe in my test. http://gerrit.cloudera.org:8080/#/c/15963/12/be/src/exec/sort-node.cc@92 PS12, Line 92: & > I think that VLOG(3) is enough. Done http://gerrit.cloudera.org:8080/#/c/15963/12/be/src/runtime/sorter.h File be/src/runtime/sorter.h: http://gerrit.cloudera.org:8080/#/c/15963/12/be/src/runtime/sorter.h@101 PS12, Line 101: t > "the" would be better Done http://gerrit.cloudera.org:8080/#/c/15963/12/be/src/runtime/sorter.h@101 PS12, Line 101: /// 'estimated_input_size' is the total rows in bytes that are estimated to get added > nit: missing "are" Done http://gerrit.cloudera.org:8080/#/c/15963/12/be/src/runtime/sorter.h@102 PS12, Line 102: /// into this sorter. This is used to decide if sorter needs to proactively spill for > nit: needs Done http://gerrit.cloudera.org:8080/#/c/15963/12/be/src/runtime/sorter.h@102 PS12, Line 102: ively sp > nit: spill Done http://gerrit.cloudera.org:8080/#/c/15963/12/be/src/runtime/sorter.h@223 PS12, Line 223: do > nit: "do an"? Done http://gerrit.cloudera.org:8080/#/c/15963/12/be/src/runtime/sorter.cc File be/src/runtime/sorter.cc: http://gerrit.cloudera.org:8080/#/c/15963/12/be/src/runtime/sorter.cc@816 PS12, Line 816: PrettyPrinter::PrintBytes(estimated_input_size), > I think that VLOG(3) is enough here - this should happen if the cardinality Done http://gerrit.cloudera.org:8080/#/c/15963/12/be/src/runtime/sorter.cc@907 PS12, Line 907: PrettyPrinter::PrintBytes(GetSortRunBytesLimit())); > Same as line 816. Done http://gerrit.cloudera.org:8080/#/c/15963/12/common/thrift/ImpalaInternalService.thrift File common/thrift/ImpalaInternalService.thrift: http://gerrit.cloudera.org:8080/#/c/15963/12/common/thrift/ImpalaInternalService.thrift@645 PS12, Line 645: a join > typo: backends Done http://gerrit.cloudera.org:8080/#/c/15963/12/tests/query_test/test_sort.py File tests/query_test/test_sort.py: http://gerrit.cloudera.org:8080/#/c/15963/12/tests/query_test/test_sort.py@74 PS12, Line 74: """The first sort run is given a privilege to ignore sort_run_bytes_limit, except :when estimate hints that spill is inevitable. The lower sort_run_bytes_limit of :a query is, the more sort runs are likely to be produced and spilled. :Case 1 : 0 SpilledRuns, because all rows fit within the maximum reservation. : sort_run_bytes_limit is not enforced. :Case 2 : 3 SpilledRuns, because the first run hit reservation limit, and the : next 2 runs are capped to 150m. :Case 3 : 4 SpilledRuns, because sort node estimate that spill is inevitable. : So all runs are capped to 130m, including the first one.""" > I will look at that 'query_result.runtime_profile'. Done -- To view, visit http://gerrit.cloudera.org:8080/15963 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2a0ba7c4bae4f1d300d4d9d7f594f63ced06a240 Gerrit-Change-Number: 15963 Gerrit-PatchSet: 14 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 27 Jun 2020 01:50:13 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6692: Trigger sort node run before hitting memory limit.
Hello David Rorke, Tim Armstrong, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15963 to look at the new patch set (#14). Change subject: IMPALA-6692: Trigger sort node run before hitting memory limit. .. IMPALA-6692: Trigger sort node run before hitting memory limit. Sorter node works by adding row batches to a sort run. After all batches added to current unsorted run or memory limit is hit, sorter will immediately start the run. If the latter case happen, sorter will spill the sorted run to disk after sort complete, create new unsorted run object, and continue add the next row batches, and so on. This algorithm try to fit as much rows into memory before start sorting. However, in the case of partitioned sort with large number of row batches, fitting too much rows into memory will cause the sort to be slow and block the sorter node for a long time before it can release some memory and continue accepting the next row batch from exchange node. One slow sorter node can block exchange node from sending row batches to other sorter node that is free. This patch speedup the decision to start the sort without waiting it to hit memory limit first by capping the intermediary quicksort run to lower memory limit, determined by query option 'sort_run_bytes_limit'. If the total used reservation of quicksort has exceed sort_run_bytes_limit, current unsorted_run_ will be wrapped up, sorted, and then spilled. Thus, overlapping the next sort run with spill from previous sort run. To reduce regression for cases where total input size of sort node might be fully fitted into available memory, sort_run_bytes_limit will not be enforced for the first sort run. However, it will stay limited by sort_run_bytes_limit if planner estimates hint that spill is inevitably will happen. We also add new summary counter 'AddBatchTime' to get summary of how much time spent in Sorter::AddBatch. Max of 'AddBatchTime' indicate the longest time spent in Sorter::AddBatch, presumably busy doing intermediary sort. Testing: - Add new e2e test TestQueryFullSort::test_multiple_sort_run_bytes_limits - Run core tests - Run data loading of 3 largest TPC-DS facts table of 300GB scale into real cluster using 5 backends, and 4GB mem_limit. sort_run_bytes_limit is varied between unspecified (not limited) vs 512 MB. The performance result is summarized in the following table. +---+-+--+---+-+ | Insert table | #Rows | Avg | no limit| 512 MB limit | | | | SortDataSize ++--+-+---+ | | | per Node | Query | Max | Query | Max | | | | | Time | AddBatchTime | Time | AddBatchTime | +---+-+--++--+-+---+ | store_sales | 864.00M | 15.29 GB | 30m18s | 53s311ms | 20m | 5s634ms | +---+-+--++--+-+---+ | catalog_sales | 431.97M | 11.34 GB | 23m24s | 31s212ms | 15m27s | 3s603ms | +---+-+--++--+-+---+ | web_sales | 216.01M | 5.67 GB | 8m16s | 29s250ms | 6m41s | 3s856ms | +---+-+--++--+-+---+ Change-Id: I2a0ba7c4bae4f1d300d4d9d7f594f63ced06a240 --- M be/src/exec/sort-node.cc M be/src/exec/sort-node.h M be/src/runtime/coordinator-backend-state.cc M be/src/runtime/query-state.cc M be/src/runtime/query-state.h M be/src/runtime/sorter.cc M be/src/runtime/sorter.h M be/src/service/query-options-test.cc M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/SortNode.java M tests/query_test/test_sort.py 15 files changed, 225 insertions(+), 11 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/15963/14 -- To view, visit http://gerrit.cloudera.org:8080/15963 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2a0ba7c4bae4f1d300d4d9d7f594f63ced06a240 Gerrit-Change-Number: 15963 Gerrit-PatchSet: 14 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-6692: Trigger sort node run before hitting memory limit.
Hello David Rorke, Tim Armstrong, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15963 to look at the new patch set (#13). Change subject: IMPALA-6692: Trigger sort node run before hitting memory limit. .. IMPALA-6692: Trigger sort node run before hitting memory limit. Sorter node works by adding row batches to a sort run. After all batches added to current unsorted run or memory limit is hit, sorter will immediately start the run. If the latter case happen, sorter will spill the sorted run to disk after sort complete, create new unsorted run object, and continue add the next row batches, and so on. This algorithm try to fit as much rows into memory before start sorting. However, in the case of partitioned sort with large number of row batches, fitting too much rows into memory will cause the sort to be slow and block the sorter node for a long time before it can release some memory and continue accepting the next row batch from exchange node. One slow sorter node can block exchange node from sending row batches to other sorter node that is free. This patch speedup the decision to start the sort without waiting it to hit memory limit first by capping the intermediary quicksort run to lower memory limit, determined by query option 'sort_run_bytes_limit'. If the total used reservation of quicksort has exceed sort_run_bytes_limit, current unsorted_run_ will be wrapped up, sorted, and then spilled. Thus, overlapping the next sort run with spill from previous sort run. To reduce regression for cases where total input size of sort node might be fully fitted into available memory, sort_run_bytes_limit will not be enforced for the first sort run. However, it will stay limited by sort_run_bytes_limit if planner estimates hint that spill is inevitably will happen. We also add new summary counter 'AddBatchTime' to get summary of how much time spent in Sorter::AddBatch. Max of 'AddBatchTime' indicate the longest time spent in Sorter::AddBatch, presumably busy doing intermediary sort. Testing: - Add new e2e test TestQueryFullSort::test_multiple_sort_run_bytes_limits - Run core tests - Run data loading of 3 largest TPC-DS facts table of 300GB scale into real cluster using 5 backends, and 4GB mem_limit. sort_run_bytes_limit is varied between unspecified (not limited) vs 512 MB. The performance result is summarized in the following table. +---+-+--+---+-+ | Insert table | #Rows | Avg | no limit| 512 MB limit | | | | SortDataSize ++--+-+---+ | | | per Node | Query | Max | Query | Max | | | | | Time | AddBatchTime | Time | AddBatchTime | +---+-+--++--+-+---+ | store_sales | 864.00M | 15.29 GB | 30m18s | 53s311ms | 20m | 5s634ms | +---+-+--++--+-+---+ | catalog_sales | 431.97M | 11.34 GB | 23m24s | 31s212ms | 15m27s | 3s603ms | +---+-+--++--+-+---+ | web_sales | 216.01M | 5.67 GB | 8m16s | 29s250ms | 6m41s | 3s856ms | +---+-+--++--+-+---+ Change-Id: I2a0ba7c4bae4f1d300d4d9d7f594f63ced06a240 --- M be/src/exec/sort-node.cc M be/src/exec/sort-node.h M be/src/runtime/coordinator-backend-state.cc M be/src/runtime/query-state.cc M be/src/runtime/query-state.h M be/src/runtime/sorter.cc M be/src/runtime/sorter.h M be/src/service/query-options-test.cc M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/SortNode.java M tests/query_test/test_sort.py 15 files changed, 225 insertions(+), 11 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/15963/13 -- To view, visit http://gerrit.cloudera.org:8080/15963 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2a0ba7c4bae4f1d300d4d9d7f594f63ced06a240 Gerrit-Change-Number: 15963 Gerrit-PatchSet: 13 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: David Rorke Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-9902: add rewrite of TPC-DS q38
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16118 ) Change subject: IMPALA-9902: add rewrite of TPC-DS q38 .. Patch Set 1: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6438/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16118 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I81807683aa265a946729e15156bd2e33724103e1 Gerrit-Change-Number: 16118 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Fang-Yu Rao Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Sat, 27 Jun 2020 00:47:33 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9294: Support DATE for min-max runtime filter
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16103 ) Change subject: IMPALA-9294: Support DATE for min-max runtime filter .. Patch Set 4: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6437/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16103 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic2f6e2dc6949735d5f0fcf317361cc2969a5e82c Gerrit-Change-Number: 16103 Gerrit-PatchSet: 4 Gerrit-Owner: Wenzhe Zhou Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Sat, 27 Jun 2020 00:30:25 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9902: add rewrite of TPC-DS q38
Tim Armstrong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16118 Change subject: IMPALA-9902: add rewrite of TPC-DS q38 .. IMPALA-9902: add rewrite of TPC-DS q38 I generated the query with dsqgen and then rewrote it to avoid intersect. Testing: Compared results to hive running the original version of the query. Change-Id: I81807683aa265a946729e15156bd2e33724103e1 --- A testdata/workloads/tpcds/queries/tpcds-q38.test M testdata/workloads/tpcds/queries/tpcds-q8.test M tests/query_test/test_tpcds_queries.py 3 files changed, 39 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/18/16118/1 -- To view, visit http://gerrit.cloudera.org:8080/16118 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I81807683aa265a946729e15156bd2e33724103e1 Gerrit-Change-Number: 16118 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong
[Impala-ASF-CR] IMPALA-9692 (part 3): Model QuerySchedule as a protobuf
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/15961 ) Change subject: IMPALA-9692 (part 3): Model QuerySchedule as a protobuf .. Patch Set 8: Code-Review+2 This looks good to me. -- To view, visit http://gerrit.cloudera.org:8080/15961 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1db64e72f84604b1d8ac24e0bdd4ad6bedd6bcd9 Gerrit-Change-Number: 15961 Gerrit-PatchSet: 8 Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Sat, 27 Jun 2020 00:10:24 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9294: Support DATE for min-max runtime filter
Wenzhe Zhou has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/16103 ) Change subject: IMPALA-9294: Support DATE for min-max runtime filter .. IMPALA-9294: Support DATE for min-max runtime filter Implemented Date min-max filter and applied it to Kudu as other min-max runtime filters. Added new test cases for Date min-max filters. Testing: Passed all core tests. Change-Id: Ic2f6e2dc6949735d5f0fcf317361cc2969a5e82c --- M be/src/codegen/gen_ir_descriptions.py M be/src/runtime/date-value.h M be/src/util/min-max-filter-ir.cc M be/src/util/min-max-filter-test.cc M be/src/util/min-max-filter.cc M be/src/util/min-max-filter.h M common/protobuf/common.proto M common/thrift/Data.thrift M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java M testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test M testdata/workloads/functional-query/queries/QueryTest/all_runtime_filters.test M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test 12 files changed, 264 insertions(+), 167 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/16103/4 -- To view, visit http://gerrit.cloudera.org:8080/16103 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic2f6e2dc6949735d5f0fcf317361cc2969a5e82c Gerrit-Change-Number: 16103 Gerrit-PatchSet: 4 Gerrit-Owner: Wenzhe Zhou Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Wenzhe Zhou
[Impala-ASF-CR] IMPALA-9294: Support DATE for min-max runtime filter
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16103 ) Change subject: IMPALA-9294: Support DATE for min-max runtime filter .. Patch Set 3: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6060/ -- To view, visit http://gerrit.cloudera.org:8080/16103 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic2f6e2dc6949735d5f0fcf317361cc2969a5e82c Gerrit-Change-Number: 16103 Gerrit-PatchSet: 3 Gerrit-Owner: Wenzhe Zhou Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Fri, 26 Jun 2020 23:13:11 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9692 (part 3): Model QuerySchedule as a protobuf
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15961 ) Change subject: IMPALA-9692 (part 3): Model QuerySchedule as a protobuf .. Patch Set 8: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/15961 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1db64e72f84604b1d8ac24e0bdd4ad6bedd6bcd9 Gerrit-Change-Number: 15961 Gerrit-PatchSet: 8 Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 26 Jun 2020 23:08:20 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9829: Add Write Metrics for Spilling
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16083 ) Change subject: IMPALA-9829: Add Write Metrics for Spilling .. Patch Set 15: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16083 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I152b9c5339cedabe33f8873a2bbf651aa5dbb914 Gerrit-Change-Number: 16083 Gerrit-PatchSet: 15 Gerrit-Owner: Yida Wu Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Andrew Sherman Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Yida Wu Gerrit-Comment-Date: Fri, 26 Jun 2020 23:06:18 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9829: Add Write Metrics for Spilling
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16083 ) Change subject: IMPALA-9829: Add Write Metrics for Spilling .. Patch Set 15: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6063/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/16083 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I152b9c5339cedabe33f8873a2bbf651aa5dbb914 Gerrit-Change-Number: 16083 Gerrit-PatchSet: 15 Gerrit-Owner: Yida Wu Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Andrew Sherman Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Yida Wu Gerrit-Comment-Date: Fri, 26 Jun 2020 23:06:19 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9213: Add query retry info to GetLog result
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16093 ) Change subject: IMPALA-9213: Add query retry info to GetLog result .. Patch Set 6: (5 comments) http://gerrit.cloudera.org:8080/#/c/16093/6/be/src/service/impala-beeswax-server.cc File be/src/service/impala-beeswax-server.cc: http://gerrit.cloudera.org:8080/#/c/16093/6/be/src/service/impala-beeswax-server.cc@327 PS6, Line 327: error_log_ss << Substitute(GET_LOG_QUERY_RETRY_INFO_FORMAT, might be worth adding a DCHECK right before this line asserting that '!original_query_handle->query_status()' http://gerrit.cloudera.org:8080/#/c/16093/6/be/src/service/impala-hs2-server.cc File be/src/service/impala-hs2-server.cc: http://gerrit.cloudera.org:8080/#/c/16093/6/be/src/service/impala-hs2-server.cc@939 PS6, Line 939: ss << Substitute(GET_LOG_QUERY_RETRY_INFO_FORMAT, same comment as the beeswax version http://gerrit.cloudera.org:8080/#/c/16093/6/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/16093/6/be/src/service/impala-server.cc@366 PS6, Line 366: \n don't think this is necessary? it looks like it is adding an extra new line in the error output: WARNINGS: Original query failed: Failed due to unreachable impalad(s): stakiar-desktop:22002 Query has been retried using query id: c84bd698dc98fac2:e79c78d1 Unless that was intentional? It looks like Status::GetDetail --> ErrorMsg::GetFullMessageDetails which always has a newline at the end of the stringstream http://gerrit.cloudera.org:8080/#/c/16093/6/shell/impala_client.py File shell/impala_client.py: http://gerrit.cloudera.org:8080/#/c/16093/6/shell/impala_client.py@337 PS6, Line 337: get_query_link_callback instead of piping this through all these methods as a callback, can we just move the 'get_query_link' method into a util class and just import it and use it directly here and in impala_shell.py? you might be able to just add it as a utility method in impala_client.py and import it in impala_shell.py. I guess there isn't a common util class. http://gerrit.cloudera.org:8080/#/c/16093/6/shell/impala_shell.py File shell/impala_shell.py: http://gerrit.cloudera.org:8080/#/c/16093/6/shell/impala_shell.py@1072 PS6, Line 1072: _get_query_link_callback nit: add some method code comments, rename to '_get_query_link' since it isn't always used as a callback (e.g. in _periodic_wait_callback) -- To view, visit http://gerrit.cloudera.org:8080/16093 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I58cf94f91a0b92eb9a3088bee3894ac157a954dc Gerrit-Change-Number: 16093 Gerrit-PatchSet: 6 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Fri, 26 Jun 2020 22:51:37 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9829: Add Write Metrics for Spilling
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16083 ) Change subject: IMPALA-9829: Add Write Metrics for Spilling .. Patch Set 14: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6058/ -- To view, visit http://gerrit.cloudera.org:8080/16083 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I152b9c5339cedabe33f8873a2bbf651aa5dbb914 Gerrit-Change-Number: 16083 Gerrit-PatchSet: 14 Gerrit-Owner: Yida Wu Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Andrew Sherman Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Yida Wu Gerrit-Comment-Date: Fri, 26 Jun 2020 21:47:17 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9692 (part 3): Model QuerySchedule as a protobuf
Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/15961 ) Change subject: IMPALA-9692 (part 3): Model QuerySchedule as a protobuf .. Patch Set 8: Code-Review+1 (2 comments) http://gerrit.cloudera.org:8080/#/c/15961/6//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15961/6//COMMIT_MSG@42 PS6, Line 42: get > Seems right to me? Done http://gerrit.cloudera.org:8080/#/c/15961/4/be/src/scheduling/query-schedule.h File be/src/scheduling/query-schedule.h: http://gerrit.cloudera.org:8080/#/c/15961/4/be/src/scheduling/query-schedule.h@58 PS4, Line 58: er > Great, I'll do it in a follow up JIRA? -- To view, visit http://gerrit.cloudera.org:8080/15961 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1db64e72f84604b1d8ac24e0bdd4ad6bedd6bcd9 Gerrit-Change-Number: 15961 Gerrit-PatchSet: 8 Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 26 Jun 2020 21:40:33 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9294: Support DATE for min-max runtime filter
Wenzhe Zhou has posted comments on this change. ( http://gerrit.cloudera.org:8080/16103 ) Change subject: IMPALA-9294: Support DATE for min-max runtime filter .. Patch Set 3: (6 comments) http://gerrit.cloudera.org:8080/#/c/16103/3/be/src/runtime/date-value.h File be/src/runtime/date-value.h: http://gerrit.cloudera.org:8080/#/c/16103/3/be/src/runtime/date-value.h@179 PS3, Line 179: pvalue > Do we need to worry pvalue being nullptr? No. The caller get pvalue by calling function UpdateFilterParamsPB::mutable_min_max_filter(), which will create a impala::MinMaxFilterPB object if the object has not been created yet. http://gerrit.cloudera.org:8080/#/c/16103/3/be/src/util/min-max-filter-test.cc File be/src/util/min-max-filter-test.cc: http://gerrit.cloudera.org:8080/#/c/16103/3/be/src/util/min-max-filter-test.cc@371 PS3, Line 371: DATE_TIME_CHECK_VALS > Maybe rename it as CHECK_NON_EMPTY_FILTER. It's a micro for CheckTimestampVals and CheckDateVals. I try to keep the naming consistent. http://gerrit.cloudera.org:8080/#/c/16103/3/be/src/util/min-max-filter.h File be/src/util/min-max-filter.h: http://gerrit.cloudera.org:8080/#/c/16103/3/be/src/util/min-max-filter.h@220 PS3, Line 220: > min_ and max_ are not inited in the cstr. Don't need to initialize min_ and max_ in constructor since always_false_ is set as true which means no rows have been inserted. min_ and max_ will be overwrote when the first row is inserted. http://gerrit.cloudera.org:8080/#/c/16103/3/be/src/util/min-max-filter.cc File be/src/util/min-max-filter.cc: http://gerrit.cloudera.org:8080/#/c/16103/3/be/src/util/min-max-filter.cc@368 PS3, Line 368: in > I wonder if In can be empty? No. There is just one caller for this function. It's called only if 'in' has the value in MinMaxFilter::Or(). http://gerrit.cloudera.org:8080/#/c/16103/3/testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test File testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test: http://gerrit.cloudera.org:8080/#/c/16103/3/testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test@167 PS3, Line 167: Date type of column for > Maybe "Query with date-typed join column"? Agree http://gerrit.cloudera.org:8080/#/c/16103/3/testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test@168 PS3, Line 168: functional_kudu.date_tbl a : join [BROADCAST] functional_kudu.date_tbl b > It may be useful to include some additional join queries: There is 3-way join test case for integer-typed join columns. Since the min-max filter generation function don't check column type, the integer-typed test case should be enough to cover the logic. Also I don't see 3-way join for other type, like timestamp, string, decimal, etc. -- To view, visit http://gerrit.cloudera.org:8080/16103 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic2f6e2dc6949735d5f0fcf317361cc2969a5e82c Gerrit-Change-Number: 16103 Gerrit-PatchSet: 3 Gerrit-Owner: Wenzhe Zhou Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Fri, 26 Jun 2020 20:34:16 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9515: Full ACID Milestone 3: Read support for "original files"
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16001 ) Change subject: IMPALA-9515: Full ACID Milestone 3: Read support for "original files" .. Patch Set 13: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6056/ -- To view, visit http://gerrit.cloudera.org:8080/16001 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I176497ef9873ed7589bd3dee07d048a42dfad953 Gerrit-Change-Number: 16001 Gerrit-PatchSet: 13 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 26 Jun 2020 19:52:51 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9790: option to use resolved hostname everywhere
Tim Armstrong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16108 ) Change subject: IMPALA-9790: option to use resolved hostname everywhere .. IMPALA-9790: option to use resolved hostname everywhere This adds a flag --use_resolved_hostname, which replaces --hostname with a resolved IP on startup. This is useful for containerized environments where the hostname -> IP mapping can be very dynamic. This flag is used by default in the dockerized minicluster. This also fixes a bug in the test code that incorrectly identified command line flags. Specifically it only checked the suffix, so it confused use_resolved_hostname and hostname. Change-Id: I0d5cb9c68c60ce8dc838cde9dcf1c590017f5c9a Reviewed-on: http://gerrit.cloudera.org:8080/16108 Tested-by: Impala Public Jenkins Reviewed-by: Andrew Sherman --- M be/src/common/global-flags.cc M be/src/common/init.cc M docker/catalogd/Dockerfile M docker/impalad_coord_exec/Dockerfile M docker/impalad_coordinator/Dockerfile M docker/impalad_executor/Dockerfile M docker/statestored/Dockerfile M tests/common/impala_cluster.py 8 files changed, 23 insertions(+), 6 deletions(-) Approvals: Impala Public Jenkins: Verified Andrew Sherman: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/16108 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I0d5cb9c68c60ce8dc838cde9dcf1c590017f5c9a Gerrit-Change-Number: 16108 Gerrit-PatchSet: 3 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Andrew Sherman Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-9784: Non correlated subqueries in HAVING.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16052 ) Change subject: IMPALA-9784: Non correlated subqueries in HAVING. .. Patch Set 2: (8 comments) I think this makes sense, doing the nested query thing and converting it to a where clause does simplify it a lot. http://gerrit.cloudera.org:8080/#/c/16052/2/fe/src/main/java/org/apache/impala/analysis/StmtRewriter.java File fe/src/main/java/org/apache/impala/analysis/StmtRewriter.java: http://gerrit.cloudera.org:8080/#/c/16052/2/fe/src/main/java/org/apache/impala/analysis/StmtRewriter.java@465 PS2, Line 465: // TODO: Remove this when independent subquery evaluation is implemented. Can we add a simple planner test for this rewrite? With the cardinality check node added. http://gerrit.cloudera.org:8080/#/c/16052/2/fe/src/main/java/org/apache/impala/analysis/StmtRewriter.java@466 PS2, Line 466: // TODO: IMPALA-5100 to cover all cases, we do let through runtime scalars with > I relaxed some of these rules to let through subqueries such as (select cou It might be helpful to file a separate JIRA with an example of this query, just so we have some record that we can run some new queries. Some simple things like: select * from functional.alltypes where id < (select count(bool_col) from functional.alltypes where int_col=1 group by int_col); http://gerrit.cloudera.org:8080/#/c/16052/2/fe/src/main/java/org/apache/impala/analysis/StmtRewriter.java@469 PS2, Line 469: presensce nit: presence http://gerrit.cloudera.org:8080/#/c/16052/2/testdata/workloads/functional-query/queries/QueryTest/subquery.test File testdata/workloads/functional-query/queries/QueryTest/subquery.test: http://gerrit.cloudera.org:8080/#/c/16052/2/testdata/workloads/functional-query/queries/QueryTest/subquery.test@1218 PS2, Line 1218: from functional.alltypestiny group by id nit: remove the functional database name here and below - the test framework should set the current db based on the file format in the test matrix.. A lot of other queries here don't follow the best practice. http://gerrit.cloudera.org:8080/#/c/16052/2/testdata/workloads/tpcds/queries/tpcds-q23-1.test File testdata/workloads/tpcds/queries/tpcds-q23-1.test: http://gerrit.cloudera.org:8080/#/c/16052/2/testdata/workloads/tpcds/queries/tpcds-q23-1.test@52 PS2, Line 52: RESULTS Compared to https://github.com/gregrahn/tpcds-kit/blob/master/answer_sets/23_NULLS_LAST.ans http://gerrit.cloudera.org:8080/#/c/16052/2/testdata/workloads/tpcds/queries/tpcds-q23-2.test File testdata/workloads/tpcds/queries/tpcds-q23-2.test: http://gerrit.cloudera.org:8080/#/c/16052/2/testdata/workloads/tpcds/queries/tpcds-q23-2.test@59 PS2, Line 59: 'Brown','Monika',6031.52 Compared to https://github.com/gregrahn/tpcds-kit/blob/master/answer_sets/23_NULLS_LAST.ans http://gerrit.cloudera.org:8080/#/c/16052/2/testdata/workloads/tpcds/queries/tpcds-q24-2.test File testdata/workloads/tpcds/queries/tpcds-q24-2.test: http://gerrit.cloudera.org:8080/#/c/16052/2/testdata/workloads/tpcds/queries/tpcds-q24-2.test@53 PS2, Line 53: 'Hamlin','Heather','able',149.65 Compared to https://github.com/gregrahn/tpcds-kit/blob/master/answer_sets/24.ans http://gerrit.cloudera.org:8080/#/c/16052/2/testdata/workloads/tpcds/queries/tpcds-q44.test File testdata/workloads/tpcds/queries/tpcds-q44.test: http://gerrit.cloudera.org:8080/#/c/16052/2/testdata/workloads/tpcds/queries/tpcds-q44.test@51 PS2, Line 51: 1,'oughtantiprin st','callyeingbarcallyought' Compared this to https://github.com/gregrahn/tpcds-kit/blob/master/answer_sets/44.ans -- To view, visit http://gerrit.cloudera.org:8080/16052 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I124a58a09a1a47e1222a22d84b54fe7d07844461 Gerrit-Change-Number: 16052 Gerrit-PatchSet: 2 Gerrit-Owner: Shant Hovsepian Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Shant Hovsepian Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 26 Jun 2020 19:35:58 + Gerrit-HasComments: Yes
[Impala-ASF-CR] [WIP] IMPALA-9898: Plan generation and execution for grouping sets
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16115 ) Change subject: [WIP] IMPALA-9898: Plan generation and execution for grouping sets .. Patch Set 1: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6436/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16115 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Id474c5373860b0d8014ee9c844a3fb90092be968 Gerrit-Change-Number: 16115 Gerrit-PatchSet: 1 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Fri, 26 Jun 2020 19:31:32 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9294: Support DATE for min-max runtime filter
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/16103 ) Change subject: IMPALA-9294: Support DATE for min-max runtime filter .. Patch Set 3: (6 comments) Looks good to me. Have some comments. Thanks! http://gerrit.cloudera.org:8080/#/c/16103/3/be/src/runtime/date-value.h File be/src/runtime/date-value.h: http://gerrit.cloudera.org:8080/#/c/16103/3/be/src/runtime/date-value.h@179 PS3, Line 179: pvalue Do we need to worry pvalue being nullptr? http://gerrit.cloudera.org:8080/#/c/16103/3/be/src/util/min-max-filter-test.cc File be/src/util/min-max-filter-test.cc: http://gerrit.cloudera.org:8080/#/c/16103/3/be/src/util/min-max-filter-test.cc@371 PS3, Line 371: DATE_TIME_CHECK_VALS Maybe rename it as CHECK_NON_EMPTY_FILTER. http://gerrit.cloudera.org:8080/#/c/16103/3/be/src/util/min-max-filter.h File be/src/util/min-max-filter.h: http://gerrit.cloudera.org:8080/#/c/16103/3/be/src/util/min-max-filter.h@220 PS3, Line 220: min_ and max_ are not inited in the cstr. http://gerrit.cloudera.org:8080/#/c/16103/3/be/src/util/min-max-filter.cc File be/src/util/min-max-filter.cc: http://gerrit.cloudera.org:8080/#/c/16103/3/be/src/util/min-max-filter.cc@368 PS3, Line 368: in I wonder if In can be empty? http://gerrit.cloudera.org:8080/#/c/16103/3/testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test File testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test: http://gerrit.cloudera.org:8080/#/c/16103/3/testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test@167 PS3, Line 167: Date type of column for Maybe "Query with date-typed join column"? http://gerrit.cloudera.org:8080/#/c/16103/3/testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test@168 PS3, Line 168: functional_kudu.date_tbl a : join [BROADCAST] functional_kudu.date_tbl b It may be useful to include some additional join queries: 1. 3-way join on date-typed join columns; 2. 3-way join, one join column on date, and another join column on a non-date, such as int etc. -- To view, visit http://gerrit.cloudera.org:8080/16103 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic2f6e2dc6949735d5f0fcf317361cc2969a5e82c Gerrit-Change-Number: 16103 Gerrit-PatchSet: 3 Gerrit-Owner: Wenzhe Zhou Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Fri, 26 Jun 2020 19:24:01 + Gerrit-HasComments: Yes
[Impala-ASF-CR] [WIP] IMPALA-9898: Plan generation and execution for grouping sets
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16115 ) Change subject: [WIP] IMPALA-9898: Plan generation and execution for grouping sets .. Patch Set 1: (8 comments) http://gerrit.cloudera.org:8080/#/c/16115/1/fe/src/main/java/org/apache/impala/analysis/MultiAggregateInfo.java File fe/src/main/java/org/apache/impala/analysis/MultiAggregateInfo.java: http://gerrit.cloudera.org:8080/#/c/16115/1/fe/src/main/java/org/apache/impala/analysis/MultiAggregateInfo.java@355 PS1, Line 355: int outputSlotIdx = groupingExprs_.size() + 1; // add 1 to account for the tuple id column line too long (94 > 90) http://gerrit.cloudera.org:8080/#/c/16115/1/fe/src/main/java/org/apache/impala/analysis/MultiAggregateInfo.java@357 PS1, Line 357: outputSmap_.put(aggExprs_.get(i).clone(), new SlotRef(outputSlots.get(outputSlotIdx))); line too long (93 > 90) http://gerrit.cloudera.org:8080/#/c/16115/1/fe/src/main/java/org/apache/impala/analysis/MultiAggregateInfo.java@452 PS1, Line 452:* Note that all classes have the same aggregate output exprs but different grouping exprs. line too long (93 > 90) http://gerrit.cloudera.org:8080/#/c/16115/1/fe/src/main/java/org/apache/impala/analysis/MultiAggregateInfo.java@523 PS1, Line 523: CaseExpr caseExpr = new CaseExpr(new ValidTupleIdExpr(aggTids), caseWhenClauses, null); line too long (93 > 90) http://gerrit.cloudera.org:8080/#/c/16115/1/fe/src/main/java/org/apache/impala/analysis/MultiAggregateInfo.java@551 PS1, Line 551: CaseExpr caseExpr = new CaseExpr(new ValidTupleIdExpr(aggTids), caseWhenClauses, null); line too long (93 > 90) http://gerrit.cloudera.org:8080/#/c/16115/1/fe/src/main/java/org/apache/impala/analysis/MultiAggregateInfo.java@564 PS1, Line 564: CaseExpr caseExprForTids = new CaseExpr(new ValidTupleIdExpr(aggTids), caseWhenClauses, null); line too long (98 > 90) http://gerrit.cloudera.org:8080/#/c/16115/1/fe/src/main/java/org/apache/impala/analysis/MultiAggregateInfo.java@595 PS1, Line 595: CaseExpr caseExpr = new CaseExpr(new ValidTupleIdExpr(aggTids), caseWhenClauses, null); line too long (93 > 90) http://gerrit.cloudera.org:8080/#/c/16115/1/fe/src/main/java/org/apache/impala/analysis/MultiAggregateInfo.java@597 PS1, Line 597: // wrap this in an aggif expr because an AggregateInfo only allows either aggregate exprs line too long (95 > 90) -- To view, visit http://gerrit.cloudera.org:8080/16115 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Id474c5373860b0d8014ee9c844a3fb90092be968 Gerrit-Change-Number: 16115 Gerrit-PatchSet: 1 Gerrit-Owner: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Fri, 26 Jun 2020 18:54:50 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9790: option to use resolved hostname everywhere
Andrew Sherman has posted comments on this change. ( http://gerrit.cloudera.org:8080/16108 ) Change subject: IMPALA-9790: option to use resolved hostname everywhere .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16108 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0d5cb9c68c60ce8dc838cde9dcf1c590017f5c9a Gerrit-Change-Number: 16108 Gerrit-PatchSet: 2 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Andrew Sherman Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Comment-Date: Fri, 26 Jun 2020 18:54:18 + Gerrit-HasComments: No
[Impala-ASF-CR] [WIP] IMPALA-9898: Plan generation and execution for grouping sets
Aman Sinha has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16115 Change subject: [WIP] IMPALA-9898: Plan generation and execution for grouping sets .. [WIP] IMPALA-9898: Plan generation and execution for grouping sets This patch enhances the MultiAggregateInfo to handle grouping sets and rollup (which is converted to grouping sets). This patch does not itself do parsing/validation of grouping sets syntax but rather provides the following supporting functionality: - A separate analyze method that accepts aggregation classes and aggregation info that have been created separately (TODO: this part will be added in another patch). - A modified Transpose phase that uses combination of aggif(), valid_tid() functions and CASE exprs to choose exactly which slots from the underlying aggregate classes need to be output based on the tuple id. - Modified materialization step where all aggregate slots and grouping slots are materialized in case of grouping sets. - Creates grouping_id value for grouping sets. The grouping_id function in SQL describes which expression is grouped-by in a particular row of a query with grouping sets. Testing: This patch is not individually testable but will be tested as part of the overall grouping set support. Change-Id: Id474c5373860b0d8014ee9c844a3fb90092be968 --- M fe/src/main/java/org/apache/impala/analysis/MultiAggregateInfo.java 1 file changed, 260 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/15/16115/1 -- To view, visit http://gerrit.cloudera.org:8080/16115 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Id474c5373860b0d8014ee9c844a3fb90092be968 Gerrit-Change-Number: 16115 Gerrit-PatchSet: 1 Gerrit-Owner: Aman Sinha
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 ) Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans .. Patch Set 10: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6435/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16098 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576 Gerrit-Change-Number: 16098 Gerrit-PatchSet: 10 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 26 Jun 2020 18:45:22 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Qifan Chen has uploaded a new patch set (#10). ( http://gerrit.cloudera.org:8080/16098 ) Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans .. IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans This work addresses the current limitation in computing the total row count for a Hive table. The row count can be incorrectly computed as 0, even though there exists data in partitions of the Hive table. In the fix, as long as no partition in a Hive table exhibits any stats corruptions including the kind described above, the total row count for the table is computed from the row counts in all partitions. Otherwise, Impala estimates the total row count from the total size of the partitions involved and the row width, if feasible. With the fix and in the explain output for some scan node, the cardinality for a table with corrupted stats will be reported as a positive value (the estimated row count), instead of 'unavailable'. The table is still listed in the WARNING section for potentially corrupt table statistics, at the beginning of the explain output. Testing: 1. Ran unit tests with queries documented in the case against Hive tables with the following configrations: a. No stats corruption in any partitions b. Stats corruption in some partitions c. Stats corruption in all partitions 2. Added a new test in test_compute_stats.py: test_corrupted_stats_impala_9744 3. Fixed failures in corrupt-stats.test 4. Ran "core" test Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576 --- M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/workloads/functional-query/queries/QueryTest/corrupt-stats.test M tests/metadata/test_compute_stats.py 3 files changed, 146 insertions(+), 28 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/16098/10 -- To view, visit http://gerrit.cloudera.org:8080/16098 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576 Gerrit-Change-Number: 16098 Gerrit-PatchSet: 10 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 ) Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans .. Patch Set 9: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6434/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16098 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576 Gerrit-Change-Number: 16098 Gerrit-PatchSet: 9 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 26 Jun 2020 18:12:28 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9294: Support DATE for min-max runtime filter
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16103 ) Change subject: IMPALA-9294: Support DATE for min-max runtime filter .. Patch Set 3: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6060/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/16103 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic2f6e2dc6949735d5f0fcf317361cc2969a5e82c Gerrit-Change-Number: 16103 Gerrit-PatchSet: 3 Gerrit-Owner: Wenzhe Zhou Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Fri, 26 Jun 2020 18:05:49 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9692 (part 3): Model QuerySchedule as a protobuf
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15961 ) Change subject: IMPALA-9692 (part 3): Model QuerySchedule as a protobuf .. Patch Set 8: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6059/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/15961 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1db64e72f84604b1d8ac24e0bdd4ad6bedd6bcd9 Gerrit-Change-Number: 15961 Gerrit-PatchSet: 8 Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 26 Jun 2020 18:02:47 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9829: Add Write Metrics for Spilling
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16083 ) Change subject: IMPALA-9829: Add Write Metrics for Spilling .. Patch Set 14: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6058/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/16083 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I152b9c5339cedabe33f8873a2bbf651aa5dbb914 Gerrit-Change-Number: 16083 Gerrit-PatchSet: 14 Gerrit-Owner: Yida Wu Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Andrew Sherman Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Yida Wu Gerrit-Comment-Date: Fri, 26 Jun 2020 17:57:24 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9829: Add Write Metrics for Spilling
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16083 ) Change subject: IMPALA-9829: Add Write Metrics for Spilling .. Patch Set 14: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16083 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I152b9c5339cedabe33f8873a2bbf651aa5dbb914 Gerrit-Change-Number: 16083 Gerrit-PatchSet: 14 Gerrit-Owner: Yida Wu Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Andrew Sherman Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Yida Wu Gerrit-Comment-Date: Fri, 26 Jun 2020 17:57:23 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9692 (part 3): Model QuerySchedule as a protobuf
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15961 ) Change subject: IMPALA-9692 (part 3): Model QuerySchedule as a protobuf .. Patch Set 8: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6433/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/15961 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1db64e72f84604b1d8ac24e0bdd4ad6bedd6bcd9 Gerrit-Change-Number: 15961 Gerrit-PatchSet: 8 Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 26 Jun 2020 17:51:39 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Qifan Chen has uploaded a new patch set (#9). ( http://gerrit.cloudera.org:8080/16098 ) Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans .. IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans This work addresses the current limitation in computing the total row count for a Hive table. The row count can be incorrectly computed as 0, even though there exists data in partitions of the Hive table. In the fix, as long as no partition in a Hive table exhibits any stats corruptions including the kind described above, the total row count for the table is computed from the row counts in all partitions. Otherwise, Impala estimates the total row count from the total size of the partitions involved and the row width, if feasible. With the fix and in the explain output for some scan node, the cardinality for a table with corrupted stats will be reported as a positive value (the estimated row count), instead of 'unavailable'. The table is still listed in the WARNING section for potentially corrupt table statistics, at the beginning of the explain output. Testing: 1. Ran unit tests with queries documented in the case against Hive tables with the following configrations: a. No stats corruption in any partitions b. Stats corruption in some partitions c. Stats corruption in all partitions 2. Added a new test in test_compute_stats.py: test_corrupted_stats_impala_9744 3. Fixed failures in corrupt-stats.test 4. Ran "core" test Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576 --- M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M testdata/workloads/functional-query/queries/QueryTest/corrupt-stats.test M tests/metadata/test_compute_stats.py 3 files changed, 153 insertions(+), 28 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/16098/9 -- To view, visit http://gerrit.cloudera.org:8080/16098 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576 Gerrit-Change-Number: 16098 Gerrit-PatchSet: 9 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16098 ) Change subject: IMPALA-9744: Treat corrupt table stats as missing to avoid bad plans .. Patch Set 9: (3 comments) http://gerrit.cloudera.org:8080/#/c/16098/9/tests/metadata/test_compute_stats.py File tests/metadata/test_compute_stats.py: http://gerrit.cloudera.org:8080/#/c/16098/9/tests/metadata/test_compute_stats.py@198 PS9, Line 198: flake8: W291 trailing whitespace http://gerrit.cloudera.org:8080/#/c/16098/9/tests/metadata/test_compute_stats.py@198 PS9, Line 198: # autogathering. The partition stats is corrupted: row count=0 and row line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/16098/9/tests/metadata/test_compute_stats.py@240 PS9, Line 240: # flake8: E303 too many blank lines (2) -- To view, visit http://gerrit.cloudera.org:8080/16098 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9f4c64616ff7c0b6d5a48f2b5331325feeff3576 Gerrit-Change-Number: 16098 Gerrit-PatchSet: 9 Gerrit-Owner: Qifan Chen Gerrit-Reviewer: Aman Sinha Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Qifan Chen Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 26 Jun 2020 17:50:26 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9829: Add Write Metrics for Spilling
Andrew Sherman has posted comments on this change. ( http://gerrit.cloudera.org:8080/16083 ) Change subject: IMPALA-9829: Add Write Metrics for Spilling .. Patch Set 13: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16083 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I152b9c5339cedabe33f8873a2bbf651aa5dbb914 Gerrit-Change-Number: 16083 Gerrit-PatchSet: 13 Gerrit-Owner: Yida Wu Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Andrew Sherman Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Yida Wu Gerrit-Comment-Date: Fri, 26 Jun 2020 17:46:16 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9294: Support DATE for min-max runtime filter
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16103 ) Change subject: IMPALA-9294: Support DATE for min-max runtime filter .. Patch Set 3: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6432/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16103 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic2f6e2dc6949735d5f0fcf317361cc2969a5e82c Gerrit-Change-Number: 16103 Gerrit-PatchSet: 3 Gerrit-Owner: Wenzhe Zhou Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Fri, 26 Jun 2020 17:31:12 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9692 (part 3): Model QuerySchedule as a protobuf
Hello Sahil Takiar, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15961 to look at the new patch set (#8). Change subject: IMPALA-9692 (part 3): Model QuerySchedule as a protobuf .. IMPALA-9692 (part 3): Model QuerySchedule as a protobuf In order to support the new admission control service, we need to be able to return the results of an admission attempt, i.e. the query schedule, to the coordinator. To enable this, this patch moves all parts of the QuerySchedule class and related classes that are required by the coordinator into a new message QuerySchedulePB. The main admission control interface, SubmitForAdmission(), now returns a QuerySchedulePB. Some notable changes: - Previously, QuerySchedule was used by Coordinator as a way to pass around references to parts of the TExecRequest to places like Coordinator::ExecSummary and Coordinator::BackendState. This has been replaced with the ExecParams class, which is a container for references to the TExecRequest and QuerySchedulePB along with convenience functions for accessing them. - Similarly, FragmentExecParams, which is part of QuerySchedule, contains references to the associated TPlanFragment, owned by the TExecRequest, which were used by the Coordinator when iterating over the schedule to initiate the query. Since FragmentExecParamsPB can't contain these references, they were replaced by a map between fragment idx and TPlanFragment in ExecParams. - In order to keep payloads reasonable for the eventual RPC interface, AdmissionController::ReleaseQuery() and ReleaseQueryBackend() now take a query id as a parameter instead of a QuerySchedule. To facilitate this, AdmissionController now maintains a map from query ids of running queries to the resources that were allocated for them so that it can look the resources up when releasing them. This map will be necessary when implementing the admission control service to facilitate proper accounting of resouces in cases like coordinator failures. - As scheduling is currently organized, we first construct the FragmentExecParams with the FInstanceExecParams as their children, then we construct the BackendExecParams which get references to their FInstanceExecParams. Since we can't send references like these through an rpc, we now instead Swap() the FInstanceExecParamsPB into the BackendExecParamsPB. Testing: - Updated related tests. - Passed a full run of existing tests. Change-Id: I1db64e72f84604b1d8ac24e0bdd4ad6bedd6bcd9 --- M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator-backend-resource-state.cc M be/src/runtime/coordinator-backend-state-test.cc M be/src/runtime/coordinator-backend-state.cc M be/src/runtime/coordinator-backend-state.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/runtime/fragment-instance-state.cc M be/src/runtime/query-driver.cc A be/src/runtime/query-exec-params.cc A be/src/runtime/query-exec-params.h M be/src/scheduling/CMakeLists.txt M be/src/scheduling/admission-controller-test.cc M be/src/scheduling/admission-controller.cc M be/src/scheduling/admission-controller.h M be/src/scheduling/cluster-membership-mgr-test.cc M be/src/scheduling/cluster-membership-mgr.cc M be/src/scheduling/cluster-membership-mgr.h M be/src/scheduling/cluster-membership-test-util.cc M be/src/scheduling/query-schedule.cc M be/src/scheduling/query-schedule.h M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler.cc M be/src/scheduling/scheduler.h M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h M be/src/util/debug-util.h M be/src/util/uid-util.h M common/protobuf/CMakeLists.txt A common/protobuf/admission_control_service.proto 34 files changed, 1,082 insertions(+), 776 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/61/15961/8 -- To view, visit http://gerrit.cloudera.org:8080/15961 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I1db64e72f84604b1d8ac24e0bdd4ad6bedd6bcd9 Gerrit-Change-Number: 15961 Gerrit-PatchSet: 8 Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall
[Impala-ASF-CR] IMPALA-9294: Support DATE for min-max runtime filter
Wenzhe Zhou has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/16103 ) Change subject: IMPALA-9294: Support DATE for min-max runtime filter .. IMPALA-9294: Support DATE for min-max runtime filter Implemented Date min-max filter and applied it to Kudu as other min-max runtime filters. Added new test cases for Date min-max filters. Testing: Passed all core tests. Change-Id: Ic2f6e2dc6949735d5f0fcf317361cc2969a5e82c --- M be/src/codegen/gen_ir_descriptions.py M be/src/runtime/date-value.h M be/src/util/min-max-filter-ir.cc M be/src/util/min-max-filter-test.cc M be/src/util/min-max-filter.cc M be/src/util/min-max-filter.h M common/protobuf/common.proto M common/thrift/Data.thrift M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java M testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test M testdata/workloads/functional-query/queries/QueryTest/all_runtime_filters.test M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test 12 files changed, 264 insertions(+), 167 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/16103/3 -- To view, visit http://gerrit.cloudera.org:8080/16103 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic2f6e2dc6949735d5f0fcf317361cc2969a5e82c Gerrit-Change-Number: 16103 Gerrit-PatchSet: 3 Gerrit-Owner: Wenzhe Zhou Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Wenzhe Zhou
[Impala-ASF-CR] IMPALA-9692 (part 3): Model QuerySchedule as a protobuf
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15961 ) Change subject: IMPALA-9692 (part 3): Model QuerySchedule as a protobuf .. Patch Set 7: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/6431/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/15961 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1db64e72f84604b1d8ac24e0bdd4ad6bedd6bcd9 Gerrit-Change-Number: 15961 Gerrit-PatchSet: 7 Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 26 Jun 2020 16:07:22 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9692 (part 3): Model QuerySchedule as a protobuf
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/15961 ) Change subject: IMPALA-9692 (part 3): Model QuerySchedule as a protobuf .. Patch Set 7: (10 comments) http://gerrit.cloudera.org:8080/#/c/15961/6//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15961/6//COMMIT_MSG@31 PS6, Line 31: eventua > nit: typo Done http://gerrit.cloudera.org:8080/#/c/15961/6//COMMIT_MSG@42 PS6, Line 42: get > nit typo Seems right to me? http://gerrit.cloudera.org:8080/#/c/15961/6/be/src/runtime/exec-params.h File be/src/runtime/exec-params.h: http://gerrit.cloudera.org:8080/#/c/15961/6/be/src/runtime/exec-params.h@37 PS6, Line 37: > we might actually want to rename this to something like QueryExecParams, ot Done http://gerrit.cloudera.org:8080/#/c/15961/4/be/src/runtime/exec-params.h File be/src/runtime/exec-params.h: http://gerrit.cloudera.org:8080/#/c/15961/4/be/src/runtime/exec-params.h@73 PS4, Line 73: > What I had in mind is if you know the total number of indices, then you can Done http://gerrit.cloudera.org:8080/#/c/15961/6/be/src/scheduling/admission-controller.h File be/src/scheduling/admission-controller.h: http://gerrit.cloudera.org:8080/#/c/15961/6/be/src/scheduling/admission-controller.h@724 PS6, Line 724: memory > nit: typo Done http://gerrit.cloudera.org:8080/#/c/15961/6/be/src/scheduling/admission-controller.h@745 PS6, Line 745: > mention this is guarded by admission_ctrl_lock_? Done http://gerrit.cloudera.org:8080/#/c/15961/4/be/src/scheduling/cluster-membership-mgr.cc File be/src/scheduling/cluster-membership-mgr.cc: http://gerrit.cloudera.org:8080/#/c/15961/4/be/src/scheduling/cluster-membership-mgr.cc@352 PS4, Line 352: INFO > makes sense, perhaps a better INFO message would be "Did not blacklist ... Done http://gerrit.cloudera.org:8080/#/c/15961/4/be/src/scheduling/query-schedule.h File be/src/scheduling/query-schedule.h: http://gerrit.cloudera.org:8080/#/c/15961/4/be/src/scheduling/query-schedule.h@58 PS4, Line 58: er > I think that makes sense, I like the 'ScheduleState' approach. yeah, its up Great, I'll do it in a follow up http://gerrit.cloudera.org:8080/#/c/15961/4/be/src/scheduling/scheduler.cc File be/src/scheduling/scheduler.cc: http://gerrit.cloudera.org:8080/#/c/15961/4/be/src/scheduling/scheduler.cc@300 PS4, Line 300: } : } else if (ContainsNode(fragment.plan, TPlanNodeType::UNION_NODE) > okay - might be worth re-factoring into a static helper method to make it a Done http://gerrit.cloudera.org:8080/#/c/15961/6/common/protobuf/admission_control_service.proto File common/protobuf/admission_control_service.proto: http://gerrit.cloudera.org:8080/#/c/15961/6/common/protobuf/admission_control_service.proto@39 PS6, Line 39: // 0-based ordinal number of this particular instance. This is within its fragment, not : // query-wide, so eg. there will be one instance '0' for each fragment. > still a bit confused about this represents, but I think this is basically t Right -- To view, visit http://gerrit.cloudera.org:8080/15961 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1db64e72f84604b1d8ac24e0bdd4ad6bedd6bcd9 Gerrit-Change-Number: 15961 Gerrit-PatchSet: 7 Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Fri, 26 Jun 2020 15:45:20 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9692 (part 3): Model QuerySchedule as a protobuf
Hello Sahil Takiar, Joe McDonnell, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15961 to look at the new patch set (#7). Change subject: IMPALA-9692 (part 3): Model QuerySchedule as a protobuf .. IMPALA-9692 (part 3): Model QuerySchedule as a protobuf In order to support the new admission control service, we need to be able to return the results of an admission attempt, i.e. the query schedule, to the coordinator. To enable this, this patch moves all parts of the QuerySchedule class and related classes that are required by the coordinator into a new message QuerySchedulePB. The main admission control interface, SubmitForAdmission(), now returns a QuerySchedulePB. Some notable changes: - Previously, QuerySchedule was used by Coordinator as a way to pass around references to parts of the TExecRequest to places like Coordinator::ExecSummary and Coordinator::BackendState. This has been replaced with the ExecParams class, which is a container for references to the TExecRequest and QuerySchedulePB along with convenience functions for accessing them. - Similarly, FragmentExecParams, which is part of QuerySchedule, contains references to the associated TPlanFragment, owned by the TExecRequest, which were used by the Coordinator when iterating over the schedule to initiate the query. Since FragmentExecParamsPB can't contain these references, they were replaced by a map between fragment idx and TPlanFragment in ExecParams. - In order to keep payloads reasonable for the eventual RPC interface, AdmissionController::ReleaseQuery() and ReleaseQueryBackend() now take a query id as a parameter instead of a QuerySchedule. To facilitate this, AdmissionController now maintains a map from query ids of running queries to the resources that were allocated for them so that it can look the resources up when releasing them. This map will be necessary when implementing the admission control service to facilitate proper accounting of resouces in cases like coordinator failures. - As scheduling is currently organized, we first construct the FragmentExecParams with the FInstanceExecParams as their children, then we construct the BackendExecParams which get references to their FInstanceExecParams. Since we can't send references like these through an rpc, we now instead Swap() the FInstanceExecParamsPB into the BackendExecParamsPB. Testing: - Updated related tests. - Passed a full run of existing tests. Change-Id: I1db64e72f84604b1d8ac24e0bdd4ad6bedd6bcd9 --- M be/src/runtime/CMakeLists.txt M be/src/runtime/coordinator-backend-resource-state.cc M be/src/runtime/coordinator-backend-state-test.cc M be/src/runtime/coordinator-backend-state.cc M be/src/runtime/coordinator-backend-state.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M be/src/runtime/fragment-instance-state.cc M be/src/runtime/query-driver.cc A be/src/runtime/query-exec-params.cc A be/src/runtime/query-exec-params.h M be/src/scheduling/CMakeLists.txt M be/src/scheduling/admission-controller-test.cc M be/src/scheduling/admission-controller.cc M be/src/scheduling/admission-controller.h M be/src/scheduling/cluster-membership-mgr-test.cc M be/src/scheduling/cluster-membership-mgr.cc M be/src/scheduling/cluster-membership-mgr.h M be/src/scheduling/cluster-membership-test-util.cc M be/src/scheduling/query-schedule.cc M be/src/scheduling/query-schedule.h M be/src/scheduling/scheduler-test-util.cc M be/src/scheduling/scheduler-test-util.h M be/src/scheduling/scheduler.cc M be/src/scheduling/scheduler.h M be/src/service/client-request-state.cc M be/src/service/client-request-state.h M be/src/service/impala-http-handler.cc M be/src/service/impala-server.cc M be/src/service/impala-server.h M be/src/util/debug-util.h M be/src/util/uid-util.h M common/protobuf/CMakeLists.txt A common/protobuf/admission_control_service.proto 34 files changed, 1,082 insertions(+), 776 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/61/15961/7 -- To view, visit http://gerrit.cloudera.org:8080/15961 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I1db64e72f84604b1d8ac24e0bdd4ad6bedd6bcd9 Gerrit-Change-Number: 15961 Gerrit-PatchSet: 7 Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Thomas Tauber-Marshall
[Impala-ASF-CR] IMPALA-9515: Full ACID Milestone 3: Read support for "original files"
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16001 ) Change subject: IMPALA-9515: Full ACID Milestone 3: Read support for "original files" .. Patch Set 13: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6056/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/16001 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I176497ef9873ed7589bd3dee07d048a42dfad953 Gerrit-Change-Number: 16001 Gerrit-PatchSet: 13 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 26 Jun 2020 14:52:55 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9515: Full ACID Milestone 3: Read support for "original files"
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16001 ) Change subject: IMPALA-9515: Full ACID Milestone 3: Read support for "original files" .. Patch Set 12: Code-Review+2 Carry +2 -- To view, visit http://gerrit.cloudera.org:8080/16001 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I176497ef9873ed7589bd3dee07d048a42dfad953 Gerrit-Change-Number: 16001 Gerrit-PatchSet: 12 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 26 Jun 2020 14:52:35 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9515: Full ACID Milestone 3: Read support for "original files"
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16001 ) Change subject: IMPALA-9515: Full ACID Milestone 3: Read support for "original files" .. Patch Set 13: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16001 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I176497ef9873ed7589bd3dee07d048a42dfad953 Gerrit-Change-Number: 16001 Gerrit-PatchSet: 13 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 26 Jun 2020 14:52:54 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9531: Dropped support for dateless timestamps
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15866 ) Change subject: IMPALA-9531: Dropped support for dateless timestamps .. Patch Set 6: (2 comments) http://gerrit.cloudera.org:8080/#/c/15866/6/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: http://gerrit.cloudera.org:8080/#/c/15866/6/be/src/exprs/timestamp-functions.cc@154 PS6, Line 154: void UnixAndFromUnixPrepare(FunctionContext* context, This function is not included in the header because you didn't want to include the whole datatime-common for CastDirection, right? Could you mention it here? http://gerrit.cloudera.org:8080/#/c/15866/6/be/src/runtime/datetime-simple-date-format-parser.h File be/src/runtime/datetime-simple-date-format-parser.h: http://gerrit.cloudera.org:8080/#/c/15866/6/be/src/runtime/datetime-simple-date-format-parser.h@87 PS6, Line 87: CastDirection cast_mode = FORMAT This still has a default value. (I saw that you actually added the parameters on the callsites.) -- To view, visit http://gerrit.cloudera.org:8080/15866 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I48c49bf027cc4b917849b3d58518facba372b322 Gerrit-Change-Number: 15866 Gerrit-PatchSet: 6 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Fri, 26 Jun 2020 13:58:38 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 ) Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. Patch Set 10: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6430/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 10 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Fri, 26 Jun 2020 12:55:45 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 ) Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. Patch Set 10: PS10 is rebase with master and conflict resolution. -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 10 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Fri, 26 Jun 2020 12:27:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16000 to look at the new patch set (#10). Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Even though the DataSketches HLL algorithm could be parameterized currently this implementation hard-codes these parameters and use HLL_4 and lg_k=12. For more details about Apache DataSketches' HLL implementation see: https://datasketches.apache.org/docs/HLL/HLL.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe --- M be/src/codegen/impala-ir.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-functions-ir.cc A be/src/exprs/datasketches-functions.h M be/src/exprs/datasketches-test.cc M be/src/exprs/scalar-expr-evaluator.cc M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test A tests/query_test/test_datasketches.py 14 files changed, 477 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/10 -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 10 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 ) Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. Patch Set 9: Code-Review+1 (1 comment) Thanks for the changes! http://gerrit.cloudera.org:8080/#/c/16000/9/fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java File fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java: http://gerrit.cloudera.org:8080/#/c/16000/9/fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java@79 PS9, Line 79: private boolean isUnupported_; For me it seems more logical to move this to base class Function, even if only aggregates use it right now. -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 9 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Fri, 26 Jun 2020 11:13:56 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9515: Full ACID Milestone 3: Read support for "original files"
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16001 ) Change subject: IMPALA-9515: Full ACID Milestone 3: Read support for "original files" .. Patch Set 11: (1 comment) http://gerrit.cloudera.org:8080/#/c/16001/11/testdata/workloads/functional-query/queries/QueryTest/full-acid-original-file.test File testdata/workloads/functional-query/queries/QueryTest/full-acid-original-file.test: http://gerrit.cloudera.org:8080/#/c/16001/11/testdata/workloads/functional-query/queries/QueryTest/full-acid-original-file.test@235 PS11, Line 235: invalidate metadata; Ouch, I should have noticed this :/ -- To view, visit http://gerrit.cloudera.org:8080/16001 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I176497ef9873ed7589bd3dee07d048a42dfad953 Gerrit-Change-Number: 16001 Gerrit-PatchSet: 11 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 26 Jun 2020 10:38:28 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9515: Full ACID Milestone 3: Read support for "original files"
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16001 ) Change subject: IMPALA-9515: Full ACID Milestone 3: Read support for "original files" .. Patch Set 12: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6429/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16001 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I176497ef9873ed7589bd3dee07d048a42dfad953 Gerrit-Change-Number: 16001 Gerrit-PatchSet: 12 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 26 Jun 2020 10:21:55 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9697: Support priority based scratch directory selection
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16091 ) Change subject: IMPALA-9697: Support priority based scratch directory selection .. Patch Set 6: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/16091 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I381c3a358e1382e6696325fec74667f1fa18dd17 Gerrit-Change-Number: 16091 Gerrit-PatchSet: 6 Gerrit-Owner: Abhishek Rawat Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 26 Jun 2020 10:20:39 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9697: Support priority based scratch directory selection
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16091 ) Change subject: IMPALA-9697: Support priority based scratch directory selection .. IMPALA-9697: Support priority based scratch directory selection The '--scratch_dirs' configuration option now supports specifying the priority of the scratch direcotry. The lower the numeric value, the higher is the priority. If priority is not specified then default priority with value numeric_limits::max() is used. Valid formats for specifying the priority are: - :: - :: Following formats use default priority: - - : - :: The new logic in TmpFileGroup::AllocateSpace() tries to find a target file using a prioritized round-robin scheme. Files are ordered in decreasing order of their priority. The priority of a file is same as the priority of the related directory. A target file is selected by always searching in the ordered list starting from the file with highest priority. If multiple files have same priority, then the target file is selected in a round robin manner. Testing: - Added unit and e2e tests for priority based spilling logic. Change-Id: I381c3a358e1382e6696325fec74667f1fa18dd17 Reviewed-on: http://gerrit.cloudera.org:8080/16091 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins --- M be/src/runtime/tmp-file-mgr-test.cc M be/src/runtime/tmp-file-mgr.cc M be/src/runtime/tmp-file-mgr.h M tests/custom_cluster/test_scratch_disk.py 4 files changed, 370 insertions(+), 49 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/16091 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I381c3a358e1382e6696325fec74667f1fa18dd17 Gerrit-Change-Number: 16091 Gerrit-PatchSet: 7 Gerrit-Owner: Abhishek Rawat Gerrit-Reviewer: Abhishek Rawat Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 ) Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. Patch Set 9: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6428/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 9 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Fri, 26 Jun 2020 10:10:17 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9515: Full ACID Milestone 3: Read support for "original files"
Hello Norbert Luksa, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16001 to look at the new patch set (#12). Change subject: IMPALA-9515: Full ACID Milestone 3: Read support for "original files" .. IMPALA-9515: Full ACID Milestone 3: Read support for "original files" "Original files" are files that don't have full ACID schema. We can see such files if we upgrade a non-ACID table to full ACID. Also, the LOAD DATA statement can load non-ACID files into full ACID tables. So such files don't store special ACID columns, that means we need to auto-generate their values. These are (operation, originalTransaction, bucket, rowid, and currentTransaction). With the exception of 'rowid', all of them can be calculated based on the file path, so I add their values to the scanner's template tuple. 'rowid' is the ordinal number of the row inside a bucket inside a directory. For now Impala only allows one file per bucket per directory. Therefore we can generate row ids for each file independently. Multiple files in a single bucket in a directory can only be present if the table was non-transactional earlier and we upgraded it to full ACID table. After the first compaction we should only see one original file per bucket per directory. In HdfsOrcScanner we calculate the first row id for our split then the OrcStructReader fills the rowid slot with the proper values. Testing: * added e2e tests to check if the generated values are correct * added e2e test to reject tables that have multiple files per bucket * added unit tests to the new auxiliary functions Change-Id: I176497ef9873ed7589bd3dee07d048a42dfad953 --- M be/src/exec/acid-metadata-utils-test.cc M be/src/exec/acid-metadata-utils.cc M be/src/exec/acid-metadata-utils.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h M be/src/exec/orc-column-readers.cc M be/src/exec/orc-column-readers.h M be/src/exec/orc-metadata-utils.cc M be/src/exec/orc-metadata-utils.h M testdata/data/README A testdata/data/alltypes_non_acid.orc M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-query/queries/QueryTest/acid-negative.test A testdata/workloads/functional-query/queries/QueryTest/full-acid-original-file.test M tests/query_test/test_acid.py 16 files changed, 715 insertions(+), 80 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/01/16001/12 -- To view, visit http://gerrit.cloudera.org:8080/16001 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I176497ef9873ed7589bd3dee07d048a42dfad953 Gerrit-Change-Number: 16001 Gerrit-PatchSet: 12 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16000 to look at the new patch set (#9). Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Even though the DataSketches HLL algorithm could be parameterized currently this implementation hard-codes these parameters and use HLL_4 and lg_k=12. For more details about Apache DataSketches' HLL implementation see: https://datasketches.apache.org/docs/HLL/HLL.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe --- M be/src/codegen/impala-ir.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-functions-ir.cc A be/src/exprs/datasketches-functions.h M be/src/exprs/datasketches-test.cc M be/src/exprs/scalar-expr-evaluator.cc M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test A tests/query_test/test_datasketches.py 14 files changed, 477 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/9 -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 9 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9633: Implement ds hll union()
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16095 ) Change subject: IMPALA-9633: Implement ds_hll_union() .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/16095/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test File testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test: http://gerrit.cloudera.org:8080/#/c/16095/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@163 PS2, Line 163: QUERY Add a test where the input data set contains valid sketches and nulls. Expectation is that nulls are ignored. -- To view, visit http://gerrit.cloudera.org:8080/16095 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09 Gerrit-Change-Number: 16095 Gerrit-PatchSet: 2 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Fri, 26 Jun 2020 07:04:58 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9691: Support Kudu Timestamp and Date bloom filter
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16094 ) Change subject: IMPALA-9691: Support Kudu Timestamp and Date bloom filter .. Patch Set 6: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/16094 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3c1e9bcc9fd6d79a39f25eaa3396188fc0a52a48 Gerrit-Change-Number: 16094 Gerrit-PatchSet: 6 Gerrit-Owner: Wenzhe Zhou Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Wenzhe Zhou Gerrit-Comment-Date: Fri, 26 Jun 2020 06:56:15 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9691: Support Kudu Timestamp and Date bloom filter
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16094 ) Change subject: IMPALA-9691: Support Kudu Timestamp and Date bloom filter .. IMPALA-9691: Support Kudu Timestamp and Date bloom filter Impala save timestamp as 12 bytes of structure TimestampValue with time in nano seconds. Kudu store timestamp as 8 bytes of Unix Time microseconds. To avoid the data truncation issue in the bloom filter, add FunctionCallExpr with 'utc_to_unix_micros' as the root of source expression of bloom filter to convert timestamp values to microseconds when building timestamp bloom filter for Kudu. Generated functional date_tbl table in Kudu format for unit-test. Added new test cases for Kudu Timestamp and Date bloom filters. Testing: Passed all core tests. Change-Id: I3c1e9bcc9fd6d79a39f25eaa3396188fc0a52a48 Reviewed-on: http://gerrit.cloudera.org:8080/16094 Reviewed-by: Thomas Tauber-Marshall Tested-by: Impala Public Jenkins --- M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-planner/queries/PlannerTest/bloom-filter-assignment.test M testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test M testdata/workloads/functional-planner/queries/PlannerTest/runtime-filter-propagation.test M testdata/workloads/functional-query/queries/QueryTest/all_runtime_filters.test 7 files changed, 311 insertions(+), 30 deletions(-) Approvals: Thomas Tauber-Marshall: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/16094 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I3c1e9bcc9fd6d79a39f25eaa3396188fc0a52a48 Gerrit-Change-Number: 16094 Gerrit-PatchSet: 7 Gerrit-Owner: Wenzhe Zhou Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Wenzhe Zhou