[Impala-ASF-CR] IMPALA-9941: ExprTest.CastExprs fails when running with ASAN fix
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16179 ) Change subject: IMPALA-9941: ExprTest.CastExprs fails when running with ASAN fix .. Patch Set 3: (5 comments) http://gerrit.cloudera.org:8080/#/c/16179/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16179/3//COMMIT_MSG@13 PS3, Line 13: Testxing typo http://gerrit.cloudera.org:8080/#/c/16179/3/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/16179/3/be/src/exprs/expr-test.cc@3311 PS3, Line 3311: TestTimestampValue("cast(' \t\r\n 01:05:01 \t\r\n ' as timestamp)", : TimestampValue::ParseSimpleDateFormat("01:05:01")); Does this produce a valid TS? http://gerrit.cloudera.org:8080/#/c/16179/3/be/src/exprs/expr-test.cc@3313 PS3, Line 3313: // TestIsNull("cast(' \t\r\n 01:05:01.123456789 \t\r\n ' as timestamp)", : // TYPE_TIMESTAMP); Please don't leave commented code. http://gerrit.cloudera.org:8080/#/c/16179/3/be/src/exprs/expr-test.cc@3327 PS3, Line 3327: TestIsNull("cast(' \t\r\n 1:5:1\t\r\n' as timestamp)", TYPE_TIMESTAMP); : TestIsNull( : "cast(' \t\r\n 1:5:1.12345678\t\r\n' as timestamp)", TYPE_TIMESTAMP); These have nothing to do with IMPALA-6995, please move them to a different section of the test file. http://gerrit.cloudera.org:8080/#/c/16179/3/be/src/runtime/datetime-simple-date-format-parser.cc File be/src/runtime/datetime-simple-date-format-parser.cc: http://gerrit.cloudera.org:8080/#/c/16179/3/be/src/runtime/datetime-simple-date-format-parser.cc@346 PS3, Line 346: str[4] For me the root cause is still a bit unclear. The ASAN build said that for certain inputs here we try to read from memory that was actually released. I wonder how come we get to this point when we had released 'str' and where it is actually released? -- To view, visit http://gerrit.cloudera.org:8080/16179 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I75bd47b6627a2e338c46dc354f7e67d34c1abbcf Gerrit-Change-Number: 16179 Gerrit-PatchSet: 3 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Mon, 13 Jul 2020 09:24:18 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9633: Implement ds hll union()
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16095 ) Change subject: IMPALA-9633: Implement ds_hll_union() .. Patch Set 7: (1 comment) http://gerrit.cloudera.org:8080/#/c/16095/5/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test File testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test: http://gerrit.cloudera.org:8080/#/c/16095/5/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@202 PS5, Line 202: > Can you also add a union test with hll_sketches_from_hive? Sure, Done -- To view, visit http://gerrit.cloudera.org:8080/16095 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09 Gerrit-Change-Number: 16095 Gerrit-PatchSet: 7 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Mon, 13 Jul 2020 07:48:39 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9633: Implement ds hll union()
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16095 to look at the new patch set (#7). Change subject: IMPALA-9633: Implement ds_hll_union() .. IMPALA-9633: Implement ds_hll_union() This function receives a set of sketches produced by ds_hll_sketch() and merges them into a single sketch. An example usage is to create a sketch for each partition of a table, write these sketches to a separate table and based on which partition the user is interested of the relevant sketches can be union-ed together to get an estimate. E.g.: SELECT ds_hll_estimate(ds_hll_union(sketch_col)) FROM sketch_tbl WHERE partition_col=1 OR partition_col=5; Note, currently there is a known limitation of unioning string types where some input sketches come from Impala and some from Hive. In this case if there is an overlap in the input data used by Impala and by Hive this overlapping data is still counted twice due to some string representation difference between Impala and Hive. For more details see: https://issues.apache.org/jira/browse/IMPALA-9939 Testing: - Apart from the automated tests I added to this patch I also tested ds_hll_union() on a bigger dataset to check that serialization, deserialization and merging steps work well. I took TPCH25.linelitem, created a number of sketches with grouping by l_shipdate and called ds_hll_union() on those sketches. Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09 --- M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-common.cc A be/src/exprs/datasketches-common.h M be/src/exprs/datasketches-functions-ir.cc M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M testdata/data/README A testdata/data/hll_sketches_from_impala.parquet M testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test M tests/query_test/test_datasketches.py 11 files changed, 271 insertions(+), 29 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/16095/7 -- To view, visit http://gerrit.cloudera.org:8080/16095 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09 Gerrit-Change-Number: 16095 Gerrit-PatchSet: 7 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9633: Implement ds hll union()
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16095 to look at the new patch set (#6). Change subject: IMPALA-9633: Implement ds_hll_union() .. IMPALA-9633: Implement ds_hll_union() This function receives a set of sketches produced by ds_hll_sketch() and merges them into a single sketch. An example usage is to create a sketch for each partition of a table, write these sketches to a separate table and based on which partition the user is interested of the relevant sketches can be union-ed together to get an estimate. E.g.: SELECT ds_hll_estimate(ds_hll_union(sketch_col)) FROM sketch_tbl WHERE partition_col=1 OR partition_col=5; Note, currently there is a known limitation of unioning string types where some input sketches come from Impala and some from Hive. In this case if there is an overlap in the input data used by Impala and by Hive this overlapping data is still counted twice due to some string representation difference between Impala and Hive. For more details see: https://issues.apache.org/jira/browse/IMPALA-9939 Testing: - Apart from the automated tests I added to this patch I also tested ds_hll_union() on a bigger dataset to check that serialization, deserialization and merging steps work well. I took TPCH25.linelitem, created a number of sketches with grouping by l_shipdate and called ds_hll_union() on those sketches. Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09 --- M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-common.cc A be/src/exprs/datasketches-common.h M be/src/exprs/datasketches-functions-ir.cc M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M testdata/data/README M testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test M tests/query_test/test_datasketches.py 10 files changed, 271 insertions(+), 29 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/16095/6 -- To view, visit http://gerrit.cloudera.org:8080/16095 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09 Gerrit-Change-Number: 16095 Gerrit-PatchSet: 6 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9633: Implement ds hll union()
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16095 ) Change subject: IMPALA-9633: Implement ds_hll_union() .. Patch Set 5: (3 comments) http://gerrit.cloudera.org:8080/#/c/16095/4/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/16095/4/be/src/exprs/aggregate-functions-ir.cc@1736 PS4, Line 1736: tten by DeserializeHllSketch( > Do these configs matter, or they are overwritten during deserialization? My These settings are overwritten in DeserializeHllSketch() to repesent the ones coming from the serialized sketch. Added a comment. http://gerrit.cloudera.org:8080/#/c/16095/4/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java: http://gerrit.cloudera.org:8080/#/c/16095/4/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1336 PS4, Line 1336: Lists.newArrayList(Type.STRING), Type.STRING, Type.STRING, : prefix + "14DsHllUnionInitEPN10impala_udf15FunctionContextEPNS1_9StringValE", : prefix + : "16DsHllUnionUpdateEPN10impala_udf15FunctionContextERKNS1_9StringValEPS4_", : prefix + : "15DsHllUnionMergeEPN10impala_udf15FunctionContextERKNS1_9StringValEPS4_", : prefix + > nit: indentation Done http://gerrit.cloudera.org:8080/#/c/16095/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test File testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test: http://gerrit.cloudera.org:8080/#/c/16095/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@163 PS2, Line 163: from hll_sketches_from_hive; > Add a test where the input data set contains valid sketches and nulls. Expe Done -- To view, visit http://gerrit.cloudera.org:8080/16095 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09 Gerrit-Change-Number: 16095 Gerrit-PatchSet: 5 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Thu, 09 Jul 2020 09:36:18 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9633: Implement ds hll union()
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16095 to look at the new patch set (#5). Change subject: IMPALA-9633: Implement ds_hll_union() .. IMPALA-9633: Implement ds_hll_union() This function receives a set of sketches produced by ds_hll_sketch() and merges them into a single sketch. An example usage is to create a sketch for each partition of a table, write these sketches to a separate table and based on which partition the user is interested of the relevant sketches can be union-ed together to get an estimate. E.g.: SELECT ds_hll_estimate(ds_hll_union(sketch_col)) FROM sketch_tbl WHERE partition_col=1 OR partition_col=5; Testing: - Apart from the automated tests I added to this patch I also tested ds_hll_union() on a bigger dataset to check that serialization, deserialization and merging steps work well. I took TPCH25.linelitem, created a number of sketches with grouping by l_shipdate and called ds_hll_union() on those sketches. Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09 --- M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-common.cc A be/src/exprs/datasketches-common.h M be/src/exprs/datasketches-functions-ir.cc M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test 8 files changed, 240 insertions(+), 29 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/16095/5 -- To view, visit http://gerrit.cloudera.org:8080/16095 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09 Gerrit-Change-Number: 16095 Gerrit-PatchSet: 5 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9531: Dropped support for dateless timestamps
Gabor Kaszab has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/15866 ) Change subject: IMPALA-9531: Dropped support for dateless timestamps .. IMPALA-9531: Dropped support for dateless timestamps Removed the support for dateless timestamps. During dateless timestamp casts if the format doesn't contain date part we get an error during tokenization of the format. If the input str doesn't contain a date part then we get null result. Examples: select cast('01:02:59' as timestamp); This will come back as NULL value. select to_timestamp('01:01:01', 'HH:mm:ss'); select cast('01:02:59' as timestamp format 'HH12:MI:SS'); select cast('12 AM' as timestamp FORMAT 'AM.HH12'); These will come back with a parsing errors. Casting from a table will generate similar results. Testing: Modified the previous tests related to dateless timestamps. Added test to read fromtables which are still containing dateless timestamps and covered timestamp to string path when no date tokens are requested in the output string. Change-Id: I48c49bf027cc4b917849b3d58518facba372b322 Reviewed-on: http://gerrit.cloudera.org:8080/15866 Tested-by: Impala Public Jenkins Reviewed-by: Gabor Kaszab --- M be/src/benchmarks/convert-timestamp-benchmark.cc M be/src/benchmarks/parse-timestamp-benchmark.cc M be/src/exec/text-converter.inline.h M be/src/exprs/cast-functions-ir.cc M be/src/exprs/expr-test.cc M be/src/exprs/scalar-expr-evaluator.cc M be/src/exprs/timestamp-functions-ir.cc M be/src/exprs/timestamp-functions.cc M be/src/exprs/timestamp-functions.h M be/src/runtime/date-parse-util.cc M be/src/runtime/date-test.cc M be/src/runtime/datetime-iso-sql-format-tokenizer.cc M be/src/runtime/datetime-parser-common.cc M be/src/runtime/datetime-parser-common.h M be/src/runtime/datetime-simple-date-format-parser.cc M be/src/runtime/datetime-simple-date-format-parser.h M be/src/runtime/timestamp-parse-util.cc M be/src/runtime/timestamp-test.cc M be/src/runtime/timestamp-value.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/test/java/org/apache/impala/analysis/AnalyzeKuduDDLTest.java M testdata/data/README A testdata/data/dateless_timestamps.parq A testdata/data/dateless_timestamps.txt M testdata/data/lazy_timestamp.csv M testdata/workloads/functional-query/queries/QueryTest/date.test A testdata/workloads/functional-query/queries/QueryTest/dateless_timestamp_parquet.test A testdata/workloads/functional-query/queries/QueryTest/dateless_timestamp_text.test M testdata/workloads/functional-query/queries/QueryTest/exprs.test M testdata/workloads/functional-query/queries/QueryTest/select-lazy-timestamp.test M tests/data_errors/test_data_errors.py M tests/query_test/test_cast_with_format.py M tests/query_test/test_scanners.py 34 files changed, 278 insertions(+), 231 deletions(-) Approvals: Impala Public Jenkins: Verified Gabor Kaszab: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/15866 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I48c49bf027cc4b917849b3d58518facba372b322 Gerrit-Change-Number: 15866 Gerrit-PatchSet: 14 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9531: Dropped support for dateless timestamps
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15866 ) Change subject: IMPALA-9531: Dropped support for dateless timestamps .. Patch Set 13: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/15866 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I48c49bf027cc4b917849b3d58518facba372b322 Gerrit-Change-Number: 15866 Gerrit-PatchSet: 13 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Wed, 08 Jul 2020 19:32:13 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9531: Dropped support for dateless timestamps
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15866 ) Change subject: IMPALA-9531: Dropped support for dateless timestamps .. Patch Set 11: (1 comment) http://gerrit.cloudera.org:8080/#/c/15866/11/testdata/workloads/functional-query/queries/QueryTest/dateless_timestamp_text.test File testdata/workloads/functional-query/queries/QueryTest/dateless_timestamp_text.test: http://gerrit.cloudera.org:8080/#/c/15866/11/testdata/workloads/functional-query/queries/QueryTest/dateless_timestamp_text.test@8 PS11, Line 8: localhost:20500 Note, the dockerised tests fail because the hardcoded localhost. -- To view, visit http://gerrit.cloudera.org:8080/15866 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I48c49bf027cc4b917849b3d58518facba372b322 Gerrit-Change-Number: 15866 Gerrit-PatchSet: 11 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Wed, 08 Jul 2020 06:23:30 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9633: Implement ds hll union()
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16095 to look at the new patch set (#3). Change subject: IMPALA-9633: Implement ds_hll_union() .. IMPALA-9633: Implement ds_hll_union() This function receives a set of sketches produced by ds_hll_sketch() and merges them into a single sketch. An example usage is to create a sketch for each partition of a table, write these sketches to a separate table and based on which partition the user is interested of the relevant sketches can be union-ed together to get an estimate. E.g.: SELECT ds_hll_estimate(ds_hll_union(sketch_col)) FROM sketch_tbl WHERE partition_col=1 OR partition_col=5; Testing: - Apart from the automated tests I added to this patch I also tested ds_hll_union() on a bigger dataset to check that serialization, deserialization and merging steps work well. I took TPCH25.linelitem, created a number of sketches with grouping by l_shipdate and called ds_hll_union() on those sketches. Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09 --- M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-common.cc A be/src/exprs/datasketches-common.h M be/src/exprs/datasketches-functions-ir.cc M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test 8 files changed, 232 insertions(+), 29 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/16095/3 -- To view, visit http://gerrit.cloudera.org:8080/16095 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09 Gerrit-Change-Number: 16095 Gerrit-PatchSet: 3 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Gabor Kaszab has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16000 ) Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Even though the DataSketches HLL algorithm could be parameterized currently this implementation hard-codes these parameters and use HLL_4 and lg_k=12. For more details about Apache DataSketches' HLL implementation see: https://datasketches.apache.org/docs/HLL/HLL.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Reviewed-on: http://gerrit.cloudera.org:8080/16000 Tested-by: Impala Public Jenkins Reviewed-by: Csaba Ringhofer --- M be/src/codegen/impala-ir.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-functions-ir.cc A be/src/exprs/datasketches-functions.h M be/src/exprs/datasketches-test.cc M be/src/exprs/scalar-expr-evaluator.cc M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/Function.java M testdata/data/README A testdata/data/hll_sketches_from_hive.parquet A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test A tests/query_test/test_datasketches.py 17 files changed, 505 insertions(+), 8 deletions(-) Approvals: Impala Public Jenkins: Verified Csaba Ringhofer: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 17 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9531: Dropped support for dateless timestamps
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15866 ) Change subject: IMPALA-9531: Dropped support for dateless timestamps .. Patch Set 11: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/15866 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I48c49bf027cc4b917849b3d58518facba372b322 Gerrit-Change-Number: 15866 Gerrit-PatchSet: 11 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Tue, 07 Jul 2020 13:08:34 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 ) Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. Patch Set 15: Apparently, test_ddl.py/test_functions_ddl failed with IllegalStateException because for unsupported functions there were some unset members. -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 15 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Tue, 07 Jul 2020 08:57:08 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16000 to look at the new patch set (#15). Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Even though the DataSketches HLL algorithm could be parameterized currently this implementation hard-codes these parameters and use HLL_4 and lg_k=12. For more details about Apache DataSketches' HLL implementation see: https://datasketches.apache.org/docs/HLL/HLL.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe --- M be/src/codegen/impala-ir.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-functions-ir.cc A be/src/exprs/datasketches-functions.h M be/src/exprs/datasketches-test.cc M be/src/exprs/scalar-expr-evaluator.cc M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/Function.java M testdata/data/README A testdata/data/hll_sketches_from_hive.parquet A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test A tests/query_test/test_datasketches.py 17 files changed, 505 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/15 -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 15 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16000 to look at the new patch set (#12). Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Even though the DataSketches HLL algorithm could be parameterized currently this implementation hard-codes these parameters and use HLL_4 and lg_k=12. For more details about Apache DataSketches' HLL implementation see: https://datasketches.apache.org/docs/HLL/HLL.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe --- M be/src/codegen/impala-ir.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-functions-ir.cc A be/src/exprs/datasketches-functions.h M be/src/exprs/datasketches-test.cc M be/src/exprs/scalar-expr-evaluator.cc M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/Function.java M testdata/data/README A testdata/data/hll_sketches_from_hive.parquet A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test A tests/query_test/test_datasketches.py 17 files changed, 503 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/12 -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 12 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16000 to look at the new patch set (#11). Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Even though the DataSketches HLL algorithm could be parameterized currently this implementation hard-codes these parameters and use HLL_4 and lg_k=12. For more details about Apache DataSketches' HLL implementation see: https://datasketches.apache.org/docs/HLL/HLL.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe --- M be/src/codegen/impala-ir.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-functions-ir.cc A be/src/exprs/datasketches-functions.h M be/src/exprs/datasketches-test.cc M be/src/exprs/scalar-expr-evaluator.cc M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/Function.java M testdata/data/README A testdata/data/hll_sketches_from_hive.parquet A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test A tests/query_test/test_datasketches.py 17 files changed, 504 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/11 -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 11 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9531: Dropped support for dateless timestamps
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15866 ) Change subject: IMPALA-9531: Dropped support for dateless timestamps .. Patch Set 8: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/15866 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I48c49bf027cc4b917849b3d58518facba372b322 Gerrit-Change-Number: 15866 Gerrit-PatchSet: 8 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Mon, 06 Jul 2020 05:28:41 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9531: Dropped support for dateless timestamps
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15866 ) Change subject: IMPALA-9531: Dropped support for dateless timestamps .. Patch Set 6: (2 comments) http://gerrit.cloudera.org:8080/#/c/15866/6/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: http://gerrit.cloudera.org:8080/#/c/15866/6/be/src/exprs/timestamp-functions.cc@154 PS6, Line 154: void UnixAndFromUnixPrepare(FunctionContext* context, This function is not included in the header because you didn't want to include the whole datatime-common for CastDirection, right? Could you mention it here? http://gerrit.cloudera.org:8080/#/c/15866/6/be/src/runtime/datetime-simple-date-format-parser.h File be/src/runtime/datetime-simple-date-format-parser.h: http://gerrit.cloudera.org:8080/#/c/15866/6/be/src/runtime/datetime-simple-date-format-parser.h@87 PS6, Line 87: CastDirection cast_mode = FORMAT This still has a default value. (I saw that you actually added the parameters on the callsites.) -- To view, visit http://gerrit.cloudera.org:8080/15866 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I48c49bf027cc4b917849b3d58518facba372b322 Gerrit-Change-Number: 15866 Gerrit-PatchSet: 6 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Fri, 26 Jun 2020 13:58:38 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 ) Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. Patch Set 10: PS10 is rebase with master and conflict resolution. -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 10 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Fri, 26 Jun 2020 12:27:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16000 to look at the new patch set (#10). Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Even though the DataSketches HLL algorithm could be parameterized currently this implementation hard-codes these parameters and use HLL_4 and lg_k=12. For more details about Apache DataSketches' HLL implementation see: https://datasketches.apache.org/docs/HLL/HLL.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe --- M be/src/codegen/impala-ir.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-functions-ir.cc A be/src/exprs/datasketches-functions.h M be/src/exprs/datasketches-test.cc M be/src/exprs/scalar-expr-evaluator.cc M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test A tests/query_test/test_datasketches.py 14 files changed, 477 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/10 -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 10 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16000 to look at the new patch set (#9). Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Even though the DataSketches HLL algorithm could be parameterized currently this implementation hard-codes these parameters and use HLL_4 and lg_k=12. For more details about Apache DataSketches' HLL implementation see: https://datasketches.apache.org/docs/HLL/HLL.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe --- M be/src/codegen/impala-ir.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-functions-ir.cc A be/src/exprs/datasketches-functions.h M be/src/exprs/datasketches-test.cc M be/src/exprs/scalar-expr-evaluator.cc M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test A tests/query_test/test_datasketches.py 14 files changed, 477 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/9 -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 9 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9633: Implement ds hll union()
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16095 ) Change subject: IMPALA-9633: Implement ds_hll_union() .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/16095/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test File testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test: http://gerrit.cloudera.org:8080/#/c/16095/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@163 PS2, Line 163: QUERY Add a test where the input data set contains valid sketches and nulls. Expectation is that nulls are ignored. -- To view, visit http://gerrit.cloudera.org:8080/16095 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09 Gerrit-Change-Number: 16095 Gerrit-PatchSet: 2 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Fri, 26 Jun 2020 07:04:58 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16000 to look at the new patch set (#8). Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Even though the DataSketches HLL algorithm could be parameterized currently this implementation hard-codes these parameters and use HLL_4 and lg_k=12. For more details about Apache DataSketches' HLL implementation see: https://datasketches.apache.org/docs/HLL/HLL.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe --- M be/src/codegen/impala-ir.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-functions-ir.cc A be/src/exprs/datasketches-functions.h M be/src/exprs/datasketches-test.cc M be/src/exprs/scalar-expr-evaluator.cc M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test A tests/query_test/test_datasketches.py 14 files changed, 471 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/8 -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 8 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 ) Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. Patch Set 7: (4 comments) http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc@1578 PS5, Line 1578: void AggregateFunctions::DsHllUpdate( > I have experimented a bit with upstream Hive and it turned out that (apart I have found a way to disable the unsupported parameter types. Done http://gerrit.cloudera.org:8080/#/c/16000/6/be/src/exprs/datasketches-functions-ir.cc File be/src/exprs/datasketches-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/16000/6/be/src/exprs/datasketches-functions-ir.cc@37 PS6, Line 37: ctx->SetError("Unable to deserialize sketch > I would vote on treating this always as error. We can relax this later if i Sure, changed this to give error unconditionally. http://gerrit.cloudera.org:8080/#/c/16000/5/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test File testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test: http://gerrit.cloudera.org:8080/#/c/16000/5/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@64 PS5, Line 64: --- > Hive seems to return NULL if there are no non-NULL values. I think that the Apparently it wasn't complicated to take care. Done. http://gerrit.cloudera.org:8080/#/c/16000/6/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test File testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test: http://gerrit.cloudera.org:8080/#/c/16000/6/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@121 PS6, Line 121: > typo Dropped this part of the test due to another change. -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 7 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Wed, 24 Jun 2020 15:47:25 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16000 to look at the new patch set (#7). Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Even though the DataSketches HLL algorithm could be parameterized currently this implementation hard-codes these parameters and use HLL_4 and lg_k=12. For more details about Apache DataSketches' HLL implementation see: https://datasketches.apache.org/docs/HLL/HLL.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe --- M be/src/codegen/impala-ir.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-functions-ir.cc A be/src/exprs/datasketches-functions.h M be/src/exprs/datasketches-test.cc M be/src/exprs/scalar-expr-evaluator.cc M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test A tests/query_test/test_datasketches.py 14 files changed, 473 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/7 -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 7 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9633: Implement ds hll union()
Hello Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16095 to look at the new patch set (#2). Change subject: IMPALA-9633: Implement ds_hll_union() .. IMPALA-9633: Implement ds_hll_union() This function receives a set of sketches produced by ds_hll_sketch() and merges them into a single sketch. An example usage is to create a sketch for each partition of a table, write these sketches to a separate table and based on which partition the user is interested of the relevant sketches can be union-ed together to get an estimate. E.g.: SELECT ds_hll_estimate(ds_hll_union(sketch_col)) FROM sketch_tbl WHERE partition_col=1 OR partition_col=5; Testing: - Apart from the automated tests I added to this patch I also tested ds_hll_union() on a bigger dataset to check that serialization, deserialization and merging steps work well. I took TPCH25.linelitem, created a number of sketches with grouping by l_shipdate and called ds_hll_union() on those sketches. Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09 --- M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-common.cc A be/src/exprs/datasketches-common.h M be/src/exprs/datasketches-functions-ir.cc M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test 8 files changed, 244 insertions(+), 34 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/16095/2 -- To view, visit http://gerrit.cloudera.org:8080/16095 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09 Gerrit-Change-Number: 16095 Gerrit-PatchSet: 2 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9633: Implement ds hll union()
Gabor Kaszab has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16095 Change subject: IMPALA-9633: Implement ds_hll_union() .. IMPALA-9633: Implement ds_hll_union() This function receives a set of sketches produced by ds_hll_sketch() and merges them into a single sketch. An example usage is to create a sketch for each partition of a table, write these sketches to a separate table and based on which partition the user is interested of the relevant sketches can be union-ed together to get an estimate. E.g.: SELECT ds_hll_estimate(ds_hll_union(sketch_col)) FROM sketch_tbl WHERE partition_col=1 OR partition_col=5; Testing: - Apart from the automated tests I added to this patch I also tested ds_hll_union() on a bigger dataset to check that serialization, deserialization and merging steps work well. I took TPCH25.linelitem, created a number of sketches with grouping by l_shipdate and called ds_hll_union() on those sketches. Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09 --- M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-common.cc A be/src/exprs/datasketches-common.h M be/src/exprs/datasketches-functions-ir.cc M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test 8 files changed, 243 insertions(+), 34 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/16095/1 -- To view, visit http://gerrit.cloudera.org:8080/16095 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09 Gerrit-Change-Number: 16095 Gerrit-PatchSet: 1 Gerrit-Owner: Gabor Kaszab
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 ) Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. Patch Set 6: (3 comments) http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc@1578 PS5, Line 1578: void AggregateFunctions::DsHllUpdate( > Don't support decimal in the first iteration. Open a Jira for further suppo Removed decimal support, however, Impala converts them automatically to bigintval and still executes update(). There isn't much I can do about it in my opinion. http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc@1611 PS5, Line 1611: union_sketch.update(src_sketch); > Don't support timestamp in the first iteration. Open a Jira for further sup Timestamp support removed. http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions.h File be/src/exprs/aggregate-functions.h: http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions.h@210 PS5, Line 210: S_SKETCH_CONFIG = 12 > this is 12 in Hive Changed this to 12. -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 6 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Wed, 17 Jun 2020 11:36:33 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16000 to look at the new patch set (#6). Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Even though the DataSketches HLL algorithm could be parameterized currently this implementation hard-codes these parameters and use HLL_4 and lg_k=12. For more details about Apache DataSketches' HLL implementation see: https://datasketches.apache.org/docs/HLL/HLL.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe --- M be/src/codegen/impala-ir.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-functions-ir.cc A be/src/exprs/datasketches-functions.h M be/src/exprs/datasketches-test.cc M be/src/exprs/scalar-expr-evaluator.cc M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test A tests/query_test/test_datasketches.py 12 files changed, 452 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/6 -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 6 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 ) Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. Patch Set 5: (4 comments) http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc@1578 PS5, Line 1578: void AggregateFunctions::DsHllUpdate( Don't support decimal in the first iteration. Open a Jira for further support. http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc@1611 PS5, Line 1611: void AggregateFunctions::DsHllUpdate( Don't support timestamp in the first iteration. Open a Jira for further support. http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions.h File be/src/exprs/aggregate-functions.h: http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions.h@210 PS5, Line 210: S_SKETCH_CONFIG = 11 this is 12 in Hive http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test File testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test: http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@142 PS3, Line 142: > We currently give an error for types not supported for ds_hll_functions. E. Update: Implemented that I return null and a warning when abort_on_error=false and return an error otherwise. -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 5 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Tue, 16 Jun 2020 15:23:55 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9531: Dropped support for dateless timestamps
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15866 ) Change subject: IMPALA-9531: Dropped support for dateless timestamps .. Patch Set 4: (8 comments) Nice work! I feel that this patch is getting into shape. I have some follow-up comments but none of the seems super difficult to address. http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/exprs/timestamp-functions-ir.cc File be/src/exprs/timestamp-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/exprs/timestamp-functions-ir.cc@167 PS3, Line 167: dt_ctx->Reset(reinterpret_casthttp://gerrit.cloudera.org:8080/#/c/15866/4/be/src/exprs/timestamp-functions-ir.cc File be/src/exprs/timestamp-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/15866/4/be/src/exprs/timestamp-functions-ir.cc@168 PS4, Line 168: if (!SimpleDateFormatTokenizer::Tokenize(dt_ctx, true, PARSE)) { Anyway, if you tokenize the format here then you do the tokenization for each row of the table even though it would be enough to do once before Impala starts reading rows. http://gerrit.cloudera.org:8080/#/c/15866/4/be/src/exprs/timestamp-functions.h File be/src/exprs/timestamp-functions.h: http://gerrit.cloudera.org:8080/#/c/15866/4/be/src/exprs/timestamp-functions.h@104 PS4, Line 104: bool format Please use datetime_parse_util::CastDirection http://gerrit.cloudera.org:8080/#/c/15866/4/be/src/exprs/timestamp-functions.cc File be/src/exprs/timestamp-functions.cc: http://gerrit.cloudera.org:8080/#/c/15866/4/be/src/exprs/timestamp-functions.cc@155 PS4, Line 155: bool format = true Why don't you use the parameter type that could be directly passed to Tokenize()? http://gerrit.cloudera.org:8080/#/c/15866/4/be/src/exprs/timestamp-functions.cc@167 PS4, Line 167: if (format) : parse_result = SimpleDateFormatTokenizer::Tokenize(dt_ctx, true, FORMAT); : else : parse_result = SimpleDateFormatTokenizer::Tokenize(dt_ctx, true, PARSE); Please check how to format if-else in Impala http://gerrit.cloudera.org:8080/#/c/15866/4/be/src/runtime/datetime-simple-date-format-parser.h File be/src/runtime/datetime-simple-date-format-parser.h: http://gerrit.cloudera.org:8080/#/c/15866/4/be/src/runtime/datetime-simple-date-format-parser.h@88 PS4, Line 88: CastDirection cast_mode = FORMAT I wouldn't give this a default value unless you have any specific reason making it this way. http://gerrit.cloudera.org:8080/#/c/15866/4/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/15866/4/testdata/data/README@503 PS4, Line 503: Timestamp_parquet_test The file name is not that verbose. After reading it it doesn't give any hints what timestamp tests it is used for. "dateless_timestamps.parq" would be better for this purpose. nit: No need to include "parquet" into the file name as the extension already expresses the format. nit: The file name itself starts with a lowercase letter. http://gerrit.cloudera.org:8080/#/c/15866/4/testdata/data/timestamp_text_test.txt File testdata/data/timestamp_text_test.txt: http://gerrit.cloudera.org:8080/#/c/15866/4/testdata/data/timestamp_text_test.txt@1 PS4, Line 1: 1996-04-22 10:00:00.43210 Same comment for the file name as for the parquet file: "dateless_timestamps.txt" would be more verbose. -- To view, visit http://gerrit.cloudera.org:8080/15866 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I48c49bf027cc4b917849b3d58518facba372b322 Gerrit-Change-Number: 15866 Gerrit-PatchSet: 4 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Tue, 16 Jun 2020 15:12:25 + Gerrit-HasComments: Yes
[Impala-ASF-CR] WIP IMPALA-9482 Support for BINARY columns
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: WIP IMPALA-9482 Support for BINARY columns .. Patch Set 2: (8 comments) I did a go through on this patch and left some minor comments. However, I feel that a second look from someone more competent to introducing new types should take a look. I see that there are a high number of places you had to touch but I can't judge if there is anything else missing. http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@47 PS2, Line 47: Testing Do you plan to do interop tests between Hive and Impala? http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/text-converter.h File be/src/exec/text-converter.h: http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/text-converter.h@63 PS2, Line 63: col_desc Could you mention the new param in the comment above? http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/text-converter.h@88 PS2, Line 88: col_desc Could you mention the new param in the comment above? http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc File be/src/runtime/types.cc: http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@180 PS2, Line 180: // ODBC driver requires types in lower case nit: I'd move this comment right before the switch to L186. http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/service/hs2-util.cc File be/src/service/hs2-util.cc: http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/service/hs2-util.cc@729 PS2, Line 729: // HiveServer2 does not have a type for invalid, date, datetime or : // fixed_uda_intermediate. Shouldn't you change this comment to reflect that now BINARY can also trigger this branch? http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java File fe/src/main/java/org/apache/impala/analysis/CastExpr.java: http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@341 PS2, Line 341: boolean isStringBinaryCast = This can go after the Preconditions check, L347. http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java File fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java: http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java@263 PS2, Line 263: For tables with unsupported primitive types (e.g., binary) This comment is no longer valid, right? http://gerrit.cloudera.org:8080/#/c/16066/2/tests/common/impala_connection.py File tests/common/impala_connection.py: http://gerrit.cloudera.org:8080/#/c/16066/2/tests/common/impala_connection.py@426 PS2, Line 426: ` nit: typo -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 2 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Tue, 16 Jun 2020 13:39:03 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16000 to look at the new patch set (#5). Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Even though the DataSketches HLL algorithm could be parameterized currently this implementation hard-codes these parameters and use HLL_4 and lg_k=11. For more details about Apache DataSketches' HLL implementation see: https://datasketches.apache.org/docs/HLL/HLL.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe --- M be/src/codegen/impala-ir.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-functions-ir.cc A be/src/exprs/datasketches-functions.h M be/src/exprs/datasketches-test.cc M be/src/exprs/scalar-expr-evaluator.cc M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test A tests/query_test/test_datasketches.py 12 files changed, 493 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/5 -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 5 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 ) Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. Patch Set 4: (11 comments) http://gerrit.cloudera.org:8080/#/c/16000/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16000/3//COMMIT_MSG@30 PS3, Line 30: In terms of correctness DataSketches HLL is most of the time in 2% : range from the correct result but there are occasional spikes where : the difference is bigger but never goes out of the range of 5%. > A link to https://datasketches.apache.org/docs/HLL/HLL.html could be useful Sure, done. http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1593 PS3, Line 1593: sketch_ptr->update((void*)(), sizeof(impala::int128_t)); > We don't need to use the convenient typed update function, we can simply ha Good point! I don't have to loose precision here for int128 as casting it to void* and giving the length to update() would also work. I can also get rid of this warning/error functionality then. http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1595 PS3, Line 1595: } > I think that this can be raised a few time in a query if more than one inst Indeed, I just tried this locally on a bigger dataset and apparently there is one warning per instance. In my opinion we can leave it as it is as the purpose here was not to flood the screen with one error per row. Modified the comment, though. http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1625 PS3, Line 1625: : StringVal AggregateFunctions::DsHllSerialize(FunctionContext* ctx, : const StringVal& src) { > Does Hive also convert to double (as seconds from Unix epoch)? Unless we ar Leave this question open until we find out how Hive tackles the same if it does. http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1650 PS3, Line 1650: union_sketch.update(*dst_sketch_ptr); : > optional: I was looking for ways to avoid the 2 copies that occur here and I found that there is another interface for deserialization: deserialize(const void* bytes, size_t len) I can use this to get rid of the 2 copies. http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1684 PS3, Line 1684: static const uint32 > It would be nice to add a comment somewhere that describes the way hll_sket Sure, added a comment about this to DsHllInit where I first allocate memory for hll_sketch. http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test File testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test: http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@18 PS3, Line 18: ds_hll_estimate(ds_hll_sketch(timestamp_col)) > whitespace Done http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@21 PS3, Line 21: 100 > Are you sure that this won't be flaky? The value is the same as with count I've run this a few times and appeared stable. In theory until a certain size of data HLL will pre-populate its cache and during this phase it will give correct results. It will start to approximate only after this certain size. Apparently, alltypessmall is small enough to always give the correct result; http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@90 PS3, Line 90: insert into sketch_store > Do you know if this is also the case in Hive? This error scenario could be dropped using the update(void* bytes, size_t length) interface. http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@115 PS3, Line 115: where date_string_col = "03/01/09"; > Can you mention that this will change in the future when BINARY will be imp Done http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@142 PS3, Line 142: > Do you know how Hive handles this? I would consider returning an error inst We currently give an error for types not supported for ds_hll_functions. E.g. ds_hll_estimate() only accepts strings and for other types it gives an AnalysisException. For strings we can decide if it's a serialized sketch only when we read that particular row so we can't return earlier. What I can do here is to write a warning or give an error depending on how ABORT_ON_ERROR is set. What do you think? -- To view, visit
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16000 to look at the new patch set (#4). Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Even though the DataSketches HLL algorithm could be parameterized currently this implementation hard-codes these parameters and use HLL_4 and lg_k=11. For more details about Apache DataSketches' HLL implementation see: https://datasketches.apache.org/docs/HLL/HLL.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe --- M be/src/codegen/impala-ir.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-functions-ir.cc A be/src/exprs/datasketches-functions.h M be/src/exprs/datasketches-test.cc M be/src/exprs/scalar-expr-evaluator.cc M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test A tests/query_test/test_datasketches.py 12 files changed, 465 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/4 -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 4 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9820: Pull Datasketches-5 HLL MurmurHash fix
Gabor Kaszab has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16026 Change subject: IMPALA-9820: Pull Datasketches-5 HLL MurmurHash fix .. IMPALA-9820: Pull Datasketches-5 HLL MurmurHash fix There is a bug in DataSketches HLL MurmurHash where long strings are over-read resulting a cardinality estimate that is more than 15% off from the correct cardinality number. A recent upstream fix in Apache DataSketches addresses this issue and this patch pulls it to Impala. https://issues.apache.org/jira/browse/DATASKETCHES-5 Testing: - I used ds_hll_sketch() and ds_hll_estimate() functions from IMPALA-9632 to trigger DataSketches HLL functionality. - Ran DataSketches HLL on lineitem.l_comment in TPCH25_parquet to reproduce the issue. The symptom was that the actual result was around 15% off from the correct cardinality result (~69M vs 79M). - After applying this fix re-running the query gives much closer results, usually under 3% error range. Change-Id: I84d73fce1e7a197c1f8fb49404b58ed9bb0b843d --- M be/src/thirdparty/datasketches/MurmurHash3.h 1 file changed, 3 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/16026/1 -- To view, visit http://gerrit.cloudera.org:8080/16026 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I84d73fce1e7a197c1f8fb49404b58ed9bb0b843d Gerrit-Change-Number: 16026 Gerrit-PatchSet: 1 Gerrit-Owner: Gabor Kaszab
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Hello Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16000 to look at the new patch set (#3). Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe --- M be/src/codegen/impala-ir.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-functions-ir.cc A be/src/exprs/datasketches-functions.h M be/src/exprs/datasketches-test.cc M be/src/exprs/scalar-expr-evaluator.cc M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test A tests/query_test/test_datasketches.py 12 files changed, 509 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/3 -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 3 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Hello Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16000 to look at the new patch set (#2). Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe --- M be/src/codegen/impala-ir.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-functions-ir.cc A be/src/exprs/datasketches-functions.h M be/src/exprs/datasketches-test.cc M be/src/exprs/scalar-expr-evaluator.cc M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test A tests/query_test/test_datasketches.py 12 files changed, 492 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/2 -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 2 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()
Gabor Kaszab has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16000 Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() .. IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe --- M be/src/codegen/impala-ir.cc M be/src/exprs/CMakeLists.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/exprs/datasketches-functions-ir.cc A be/src/exprs/datasketches-functions.h M be/src/exprs/datasketches-test.cc M be/src/exprs/scalar-expr-evaluator.cc M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test A tests/query_test/test_datasketches.py 12 files changed, 494 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/1 -- To view, visit http://gerrit.cloudera.org:8080/16000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Gerrit-Change-Number: 16000 Gerrit-PatchSet: 1 Gerrit-Owner: Gabor Kaszab
[Impala-ASF-CR] IMPALA-9531: Dropped support for dateless timestamps
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15866 ) Change subject: IMPALA-9531: Dropped support for dateless timestamps .. Patch Set 3: (18 comments) Thanks for taking care of my comments! In general the casting part of this change is getting into shape in my opinion. However, I miss the part where the timestamp is not a result of a cast() or to_timestamp() or such, and it comes from reading a table. e.g. I don't see now any changes that could prevent us reading dateless timestamp when we run e.g. "select ts_col from table_name". We discussed offline that some of the file formats are unable to hold these values because only Hive can write them and there Hive doesn't allow writing dateless TSs but for the ones where Impala can write them it should at leas be covered by tests that a previously written file with dateless timestamps can be read after this changes and the dateless TSs will be nulls. http://gerrit.cloudera.org:8080/#/c/15866/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15866/3//COMMIT_MSG@15 PS3, Line 15: select to_timestamp('01:01:01', 'HH:mm:ss'); This should return parse error because the format itself doesn't contain date just time. http://gerrit.cloudera.org:8080/#/c/15866/3//COMMIT_MSG@17 PS3, Line 17: values nit: dot at the end of a sentence. http://gerrit.cloudera.org:8080/#/c/15866/3//COMMIT_MSG@17 PS3, Line 17: This nit: This should be plural in my opinion as it stands for 2 separate examples. http://gerrit.cloudera.org:8080/#/c/15866/3//COMMIT_MSG@28 PS3, Line 28: Testing I still don't see tests where you use a file populated by dateless timestamps (created by an Impala pre this change) and try to read it with Impala containing your changes. http://gerrit.cloudera.org:8080/#/c/15866/3//COMMIT_MSG@29 PS3, Line 29: modified nit: start a sentence with capital letter. http://gerrit.cloudera.org:8080/#/c/15866/3//COMMIT_MSG@29 PS3, Line 29: test nit: tests http://gerrit.cloudera.org:8080/#/c/15866/3//COMMIT_MSG@30 PS3, Line 30: backwards It's just a matter of your point of view which direction you call backwards so I wouldn't use that word here. Simply "Added test to cover timestamp to string path when no date tokens are requested in the output string" or something like that. http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/exprs/expr-test.cc@4507 PS3, Line 4507: 1.00 Is this related to your change or does it come with a rebase you've made in the background? If the second, then a general comment is to try to push rebase related changes to gerrit separately from your actual changes to make the reviewers life easier. http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/exprs/expr-test.cc@7536 PS3, Line 7536: HH:mm:ss Actually, we should get an error when parsing the format and see that there is no date part. http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/runtime/datetime-simple-date-format-parser.h File be/src/runtime/datetime-simple-date-format-parser.h: http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/runtime/datetime-simple-date-format-parser.h@88 PS3, Line 88: parse Can you re-use the CastDirection type similarly as it is used in e.g. ISO SQL format tokenizer? http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/runtime/datetime-simple-date-format-parser.cc File be/src/runtime/datetime-simple-date-format-parser.cc: http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/runtime/datetime-simple-date-format-parser.cc@179 PS3, Line 179: //return (dt_ctx->has_date_toks || dt_ctx->has_time_toks); Please don't leave commented code http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/runtime/datetime-simple-date-format-parser.cc@180 PS3, Line 180: return parse?(dt_ctx->has_date_toks):(dt_ctx->has_date_toks || dt_ctx->has_time_toks); This seems logically correct but not that readable. Additionally, this is not how we format ternary operators in Impala: - Use space around the question marks - As the condition is simple no need to surround with brackets. - Use space around the colon Anyway, I feel that rewriting to something like this would help a bit: if (parse) return dt_ctx->has_date_toks; return dt_ctx->has_date_toks || dt_ctx->has_time_toks; Or something similar. http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/runtime/timestamp-test.cc File be/src/runtime/timestamp-test.cc: http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/runtime/timestamp-test.cc@610 PS3, Line 610: (TimestampTC("-MM-dd HH:m:ss", "2020-05-11 1:24:34", false, true)) Here the intention was to also cover the case when the minute part of the input is 1 char long, right? Instead this tests that the hour part is 1 long. Additionally, shouldn't you also give params to test the expected result?
[Impala-ASF-CR] Revert "IMPALA-8980: Remove functional*.alltypesinsert from EE tests"
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15902 ) Change subject: Revert "IMPALA-8980: Remove functional*.alltypesinsert from EE tests" .. Patch Set 1: Code-Review-1 Tamas just submitted a patch to fix whatever went wrong by IMPALA-8980. Can we give it a chance and let's see if this fixes the build? Btw, if you go for a revert, these are the follow up changes that went in for IMPALA-8980: https://gerrit.cloudera.org/#/c/15766/ https://gerrit.cloudera.org/#/c/15816/ -- To view, visit http://gerrit.cloudera.org:8080/15902 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3d9acba85b1082502b2bc62acc76116de5750d30 Gerrit-Change-Number: 15902 Gerrit-PatchSet: 1 Gerrit-Owner: Joe McDonnell Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sahil Takiar Gerrit-Comment-Date: Tue, 12 May 2020 05:13:45 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9680: Fixed compressed inserts failing
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15816 ) Change subject: IMPALA-9680: Fixed compressed inserts failing .. Patch Set 5: Code-Review+2 Carry-over +2 from Csaba -- To view, visit http://gerrit.cloudera.org:8080/15816 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib3c7ba02190f57a7ed40311c95a3dd9eca9b474d Gerrit-Change-Number: 15816 Gerrit-PatchSet: 5 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Wed, 06 May 2020 14:15:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9631: Import HLL functionality from DataSketches
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15746 to look at the new patch set (#5). Change subject: IMPALA-9631: Import HLL functionality from DataSketches .. IMPALA-9631: Import HLL functionality from DataSketches This patch imports the functionality needed for HLL approximate algorithm from Apache DataSketches. I decided to copy the necessary files into be/src/thirdparty/datasketches. Note, that the original structure of files was changed during this process as originally hll/ and common/ libraries were both affected but I copied these into the same directory so that Impala can compile them without rewriting the include paths in the files themselves. Also note, that not the whole common/ directory was copied just the files needed for HLL. The git hash of the snapshot I used as a source for the files: a6265b307a03085abe26c20413fdbf7d7a5eaf29 Browse the source files here: https://github.com/apache/incubator-datasketches-cpp Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf --- M be/src/exprs/CMakeLists.txt A be/src/exprs/datasketches-test.cc A be/src/thirdparty/datasketches/AuxHashMap-internal.hpp A be/src/thirdparty/datasketches/AuxHashMap.hpp A be/src/thirdparty/datasketches/CommonUtil.hpp A be/src/thirdparty/datasketches/CompositeInterpolationXTable-internal.hpp A be/src/thirdparty/datasketches/CompositeInterpolationXTable.hpp A be/src/thirdparty/datasketches/CouponHashSet-internal.hpp A be/src/thirdparty/datasketches/CouponHashSet.hpp A be/src/thirdparty/datasketches/CouponList-internal.hpp A be/src/thirdparty/datasketches/CouponList.hpp A be/src/thirdparty/datasketches/CubicInterpolation-internal.hpp A be/src/thirdparty/datasketches/CubicInterpolation.hpp A be/src/thirdparty/datasketches/HarmonicNumbers-internal.hpp A be/src/thirdparty/datasketches/HarmonicNumbers.hpp A be/src/thirdparty/datasketches/Hll4Array-internal.hpp A be/src/thirdparty/datasketches/Hll4Array.hpp A be/src/thirdparty/datasketches/Hll6Array-internal.hpp A be/src/thirdparty/datasketches/Hll6Array.hpp A be/src/thirdparty/datasketches/Hll8Array-internal.hpp A be/src/thirdparty/datasketches/Hll8Array.hpp A be/src/thirdparty/datasketches/HllArray-internal.hpp A be/src/thirdparty/datasketches/HllArray.hpp A be/src/thirdparty/datasketches/HllSketch-internal.hpp A be/src/thirdparty/datasketches/HllSketchImpl-internal.hpp A be/src/thirdparty/datasketches/HllSketchImpl.hpp A be/src/thirdparty/datasketches/HllSketchImplFactory.hpp A be/src/thirdparty/datasketches/HllUnion-internal.hpp A be/src/thirdparty/datasketches/HllUtil.hpp A be/src/thirdparty/datasketches/LICENSE A be/src/thirdparty/datasketches/MurmurHash3.h A be/src/thirdparty/datasketches/README.md A be/src/thirdparty/datasketches/RelativeErrorTables-internal.hpp A be/src/thirdparty/datasketches/RelativeErrorTables.hpp A be/src/thirdparty/datasketches/coupon_iterator-internal.hpp A be/src/thirdparty/datasketches/coupon_iterator.hpp A be/src/thirdparty/datasketches/hll.hpp A be/src/thirdparty/datasketches/hll.private.hpp A be/src/thirdparty/datasketches/inv_pow2_table.hpp M bin/rat_exclude_files.txt M bin/run_clang_tidy.sh 41 files changed, 7,381 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/15746/5 -- To view, visit http://gerrit.cloudera.org:8080/15746 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf Gerrit-Change-Number: 15746 Gerrit-PatchSet: 5 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9631: Import HLL functionality from DataSketches
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15746 to look at the new patch set (#4). Change subject: IMPALA-9631: Import HLL functionality from DataSketches .. IMPALA-9631: Import HLL functionality from DataSketches This patch imports the functionality needed for HLL approximate algorithm from Apache DataSketches. I decided to copy the necessary files into be/src/thirdparty/datasketches. Note, that the original structure of files was changed during this process as originally hll/ and common/ libraries were both affected but I copied these into the same directory so that Impala can compile them without rewriting the include paths in the files themselves. Also note, that not the whole common/ directory was copied just the files needed for HLL. The git hash of the snapshot I used as a source for the files: a6265b307a03085abe26c20413fdbf7d7a5eaf29 Browse the source files here: https://github.com/apache/incubator-datasketches-cpp Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf --- M .clang-tidy M be/src/exprs/CMakeLists.txt A be/src/exprs/datasketches-test.cc A be/src/thirdparty/datasketches/AuxHashMap-internal.hpp A be/src/thirdparty/datasketches/AuxHashMap.hpp A be/src/thirdparty/datasketches/CommonUtil.hpp A be/src/thirdparty/datasketches/CompositeInterpolationXTable-internal.hpp A be/src/thirdparty/datasketches/CompositeInterpolationXTable.hpp A be/src/thirdparty/datasketches/CouponHashSet-internal.hpp A be/src/thirdparty/datasketches/CouponHashSet.hpp A be/src/thirdparty/datasketches/CouponList-internal.hpp A be/src/thirdparty/datasketches/CouponList.hpp A be/src/thirdparty/datasketches/CubicInterpolation-internal.hpp A be/src/thirdparty/datasketches/CubicInterpolation.hpp A be/src/thirdparty/datasketches/HarmonicNumbers-internal.hpp A be/src/thirdparty/datasketches/HarmonicNumbers.hpp A be/src/thirdparty/datasketches/Hll4Array-internal.hpp A be/src/thirdparty/datasketches/Hll4Array.hpp A be/src/thirdparty/datasketches/Hll6Array-internal.hpp A be/src/thirdparty/datasketches/Hll6Array.hpp A be/src/thirdparty/datasketches/Hll8Array-internal.hpp A be/src/thirdparty/datasketches/Hll8Array.hpp A be/src/thirdparty/datasketches/HllArray-internal.hpp A be/src/thirdparty/datasketches/HllArray.hpp A be/src/thirdparty/datasketches/HllSketch-internal.hpp A be/src/thirdparty/datasketches/HllSketchImpl-internal.hpp A be/src/thirdparty/datasketches/HllSketchImpl.hpp A be/src/thirdparty/datasketches/HllSketchImplFactory.hpp A be/src/thirdparty/datasketches/HllUnion-internal.hpp A be/src/thirdparty/datasketches/HllUtil.hpp A be/src/thirdparty/datasketches/LICENSE A be/src/thirdparty/datasketches/MurmurHash3.h A be/src/thirdparty/datasketches/README.md A be/src/thirdparty/datasketches/RelativeErrorTables-internal.hpp A be/src/thirdparty/datasketches/RelativeErrorTables.hpp A be/src/thirdparty/datasketches/coupon_iterator-internal.hpp A be/src/thirdparty/datasketches/coupon_iterator.hpp A be/src/thirdparty/datasketches/hll.hpp A be/src/thirdparty/datasketches/hll.private.hpp A be/src/thirdparty/datasketches/inv_pow2_table.hpp M bin/rat_exclude_files.txt M bin/run_clang_tidy.sh 42 files changed, 7,382 insertions(+), 4 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/15746/4 -- To view, visit http://gerrit.cloudera.org:8080/15746 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf Gerrit-Change-Number: 15746 Gerrit-PatchSet: 4 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9631: Import HLL functionality from DataSketches
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15746 ) Change subject: IMPALA-9631: Import HLL functionality from DataSketches .. Patch Set 3: (5 comments) http://gerrit.cloudera.org:8080/#/c/15746/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15746/2//COMMIT_MSG@7 PS2, Line 7: IMPALA-9631: Import HLL functionality from DataSketches > Do you plan to merge this patch as it is? I would consider it more of a POC Well, I don't see a difference of merging it now of waiting for other development work and start the review then. Btw, this unlock other steps under the umbrella Jira: https://issues.apache.org/jira/browse/IMPALA-9593 Additionally, waiting for the rest of the functionality means that splitting up these tasks has no point and we could have kept a single Jira for everything HLL related. So I'd take this as a non-POC review now and we still can make changes later on if we want to modify the build structure or drop the entire library for some reason. http://gerrit.cloudera.org:8080/#/c/15746/2//COMMIT_MSG@9 PS2, Line 9: This patch imports the functionality needed for HLL approximate > Can you add this information in a README.md? To the datasketches folder? Done http://gerrit.cloudera.org:8080/#/c/15746/2//COMMIT_MSG@11 PS2, Line 11: files into be/src/thirdparty/datasketches. Note, that the original > nit: wrap at 72 Done http://gerrit.cloudera.org:8080/#/c/15746/2/be/src/exprs/datasketches-test.cc File be/src/exprs/datasketches-test.cc: http://gerrit.cloudera.org:8080/#/c/15746/2/be/src/exprs/datasketches-test.cc@27 PS2, Line 27: algorith > typo: algorithm Done http://gerrit.cloudera.org:8080/#/c/15746/2/be/src/exprs/datasketches-test.cc@38 PS2, Line 38: std::stringstream sketch_stream1; : std::stringstream sketch_stream2; > I would prefer to use stringstream instead of files to make this faster/avo Done -- To view, visit http://gerrit.cloudera.org:8080/15746 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf Gerrit-Change-Number: 15746 Gerrit-PatchSet: 3 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Mon, 20 Apr 2020 15:26:40 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9631: Import HLL functionality from DataSketches
Hello Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15746 to look at the new patch set (#3). Change subject: IMPALA-9631: Import HLL functionality from DataSketches .. IMPALA-9631: Import HLL functionality from DataSketches This patch imports the functionality needed for HLL approximate algorithm from Apache DataSketches. I decided to copy the necessary files into be/src/thirdparty/datasketches. Note, that the original structure of files was changed during this process as originally hll/ and common/ libraries were both affected but I copied these into the same directory so that Impala can compile them without rewriting the include paths in the files themselves. Also note, that not the whole common/ directory was copied just the files needed for HLL. The git hash of the snapshot I used as a source for the files: a6265b307a03085abe26c20413fdbf7d7a5eaf29 Browse the source files here: https://github.com/apache/incubator-datasketches-cpp Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf --- M .clang-tidy M be/src/exprs/CMakeLists.txt A be/src/exprs/datasketches-test.cc A be/src/thirdparty/datasketches/AuxHashMap-internal.hpp A be/src/thirdparty/datasketches/AuxHashMap.hpp A be/src/thirdparty/datasketches/CommonUtil.hpp A be/src/thirdparty/datasketches/CompositeInterpolationXTable-internal.hpp A be/src/thirdparty/datasketches/CompositeInterpolationXTable.hpp A be/src/thirdparty/datasketches/CouponHashSet-internal.hpp A be/src/thirdparty/datasketches/CouponHashSet.hpp A be/src/thirdparty/datasketches/CouponList-internal.hpp A be/src/thirdparty/datasketches/CouponList.hpp A be/src/thirdparty/datasketches/CubicInterpolation-internal.hpp A be/src/thirdparty/datasketches/CubicInterpolation.hpp A be/src/thirdparty/datasketches/HarmonicNumbers-internal.hpp A be/src/thirdparty/datasketches/HarmonicNumbers.hpp A be/src/thirdparty/datasketches/Hll4Array-internal.hpp A be/src/thirdparty/datasketches/Hll4Array.hpp A be/src/thirdparty/datasketches/Hll6Array-internal.hpp A be/src/thirdparty/datasketches/Hll6Array.hpp A be/src/thirdparty/datasketches/Hll8Array-internal.hpp A be/src/thirdparty/datasketches/Hll8Array.hpp A be/src/thirdparty/datasketches/HllArray-internal.hpp A be/src/thirdparty/datasketches/HllArray.hpp A be/src/thirdparty/datasketches/HllSketch-internal.hpp A be/src/thirdparty/datasketches/HllSketchImpl-internal.hpp A be/src/thirdparty/datasketches/HllSketchImpl.hpp A be/src/thirdparty/datasketches/HllSketchImplFactory.hpp A be/src/thirdparty/datasketches/HllUnion-internal.hpp A be/src/thirdparty/datasketches/HllUtil.hpp A be/src/thirdparty/datasketches/LICENSE A be/src/thirdparty/datasketches/MurmurHash3.h A be/src/thirdparty/datasketches/README.md A be/src/thirdparty/datasketches/RelativeErrorTables-internal.hpp A be/src/thirdparty/datasketches/RelativeErrorTables.hpp A be/src/thirdparty/datasketches/coupon_iterator-internal.hpp A be/src/thirdparty/datasketches/coupon_iterator.hpp A be/src/thirdparty/datasketches/hll.hpp A be/src/thirdparty/datasketches/hll.private.hpp A be/src/thirdparty/datasketches/inv_pow2_table.hpp M bin/rat_exclude_files.txt 41 files changed, 7,379 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/15746/3 -- To view, visit http://gerrit.cloudera.org:8080/15746 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf Gerrit-Change-Number: 15746 Gerrit-PatchSet: 3 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9631: Import HLL functionality from DataSketches
Hello Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15746 to look at the new patch set (#2). Change subject: IMPALA-9631: Import HLL functionality from DataSketches .. IMPALA-9631: Import HLL functionality from DataSketches This patch imports the functionality needed for HLL approximate algorithm from Apache DataSketches. I decided to copy the necessary files into be/src/thirdparty/datasketches. Note, that the original structure of files was changed during this process as originally hll/ and common/ libraries were both affected but I copied these into the same directory so that Impala can compile them without rewriting the include paths in the files themselves. Also note, that not the whole common/ directory was copied just the files needed for HLL. The git hash of the snapshot I used as a source for the files: a6265b307a03085abe26c20413fdbf7d7a5eaf29 Browse the source files here: https://github.com/apache/incubator-datasketches-cpp Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf --- M be/src/exprs/CMakeLists.txt A be/src/exprs/datasketches-test.cc A be/src/thirdparty/datasketches/AuxHashMap-internal.hpp A be/src/thirdparty/datasketches/AuxHashMap.hpp A be/src/thirdparty/datasketches/CommonUtil.hpp A be/src/thirdparty/datasketches/CompositeInterpolationXTable-internal.hpp A be/src/thirdparty/datasketches/CompositeInterpolationXTable.hpp A be/src/thirdparty/datasketches/CouponHashSet-internal.hpp A be/src/thirdparty/datasketches/CouponHashSet.hpp A be/src/thirdparty/datasketches/CouponList-internal.hpp A be/src/thirdparty/datasketches/CouponList.hpp A be/src/thirdparty/datasketches/CubicInterpolation-internal.hpp A be/src/thirdparty/datasketches/CubicInterpolation.hpp A be/src/thirdparty/datasketches/HarmonicNumbers-internal.hpp A be/src/thirdparty/datasketches/HarmonicNumbers.hpp A be/src/thirdparty/datasketches/Hll4Array-internal.hpp A be/src/thirdparty/datasketches/Hll4Array.hpp A be/src/thirdparty/datasketches/Hll6Array-internal.hpp A be/src/thirdparty/datasketches/Hll6Array.hpp A be/src/thirdparty/datasketches/Hll8Array-internal.hpp A be/src/thirdparty/datasketches/Hll8Array.hpp A be/src/thirdparty/datasketches/HllArray-internal.hpp A be/src/thirdparty/datasketches/HllArray.hpp A be/src/thirdparty/datasketches/HllSketch-internal.hpp A be/src/thirdparty/datasketches/HllSketchImpl-internal.hpp A be/src/thirdparty/datasketches/HllSketchImpl.hpp A be/src/thirdparty/datasketches/HllSketchImplFactory.hpp A be/src/thirdparty/datasketches/HllUnion-internal.hpp A be/src/thirdparty/datasketches/HllUtil.hpp A be/src/thirdparty/datasketches/LICENSE A be/src/thirdparty/datasketches/MurmurHash3.h A be/src/thirdparty/datasketches/RelativeErrorTables-internal.hpp A be/src/thirdparty/datasketches/RelativeErrorTables.hpp A be/src/thirdparty/datasketches/coupon_iterator-internal.hpp A be/src/thirdparty/datasketches/coupon_iterator.hpp A be/src/thirdparty/datasketches/hll.hpp A be/src/thirdparty/datasketches/hll.private.hpp A be/src/thirdparty/datasketches/inv_pow2_table.hpp M bin/rat_exclude_files.txt 39 files changed, 7,368 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/15746/2 -- To view, visit http://gerrit.cloudera.org:8080/15746 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf Gerrit-Change-Number: 15746 Gerrit-PatchSet: 2 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9469: ORC scanner vectorization for collection types
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15730 ) Change subject: IMPALA-9469: ORC scanner vectorization for collection types .. Patch Set 2: Code-Review+2 (5 comments) http://gerrit.cloudera.org:8080/#/c/15730/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15730/1//COMMIT_MSG@32 PS1, Line 32: > Unfortunately there is no 'scale' option for loading the nested workloads. No need to put more efforts into this if there is no convenient way to produce scaled nested workloads. Thx for the explanation! http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.h File be/src/exec/orc-column-readers.h: http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.h@95 PS1, Line 95: ReadValue > In that case we'd need more friend declarations, because both OrcListReader Yeah, agree it's more convenient to leave it here. I don't have strong feelings for either option. http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.cc@451 PS1, Line 451: int OrcListReader::NumElements() const { > Yeah, it's actually the same function. I see. But we still gain something with codegen nested types so it's still a win. Thx for rewriting this. http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.cc@473 PS1, Line 473: > If the list is an array of structs, then we can have many children, but in Indeed. Thx! http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.cc@477 PS1, Line 477: return Status::OK(); > Yeah, it's confusing. slot_desc_ being null is independent of MaterializeTu Hmm, According to my understanding MaterializeTuple() is true when the reader writes directly to a slot and false if the reader delegates the writing to its children. I might have misunderstood then. Still, introducing DirectReader() helps readability, thx. -- To view, visit http://gerrit.cloudera.org:8080/15730 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I477961b427406035a04529c5175dbee8f8a93ad5 Gerrit-Change-Number: 15730 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 17 Apr 2020 13:13:34 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9631: Import HLL functionality from DataSketches
Gabor Kaszab has uploaded this change for review. ( http://gerrit.cloudera.org:8080/15746 Change subject: IMPALA-9631: Import HLL functionality from DataSketches .. IMPALA-9631: Import HLL functionality from DataSketches This patch imports the functionality needed for HLL approximate algorithm from Apache DataSketches. I decided to copy the necessary files into be/src/thirdparty/datasketches. Note, that the original structure of files was changed during this process as originally hll/ and common/ libraries were both affected but I copied these into the same directory so that Impala can compile them without rewriting the include paths in the files themselves. Also note, that not the whole common/ directory was copied just the files needed for HLL. The git hash of the snapshot I used as a source for the files: a6265b307a03085abe26c20413fdbf7d7a5eaf29 Browse the source files here: https://github.com/apache/incubator-datasketches-cpp Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf --- M be/src/exprs/CMakeLists.txt A be/src/exprs/datasketches-test.cc A be/src/thirdparty/datasketches/AuxHashMap-internal.hpp A be/src/thirdparty/datasketches/AuxHashMap.hpp A be/src/thirdparty/datasketches/CommonUtil.hpp A be/src/thirdparty/datasketches/CompositeInterpolationXTable-internal.hpp A be/src/thirdparty/datasketches/CompositeInterpolationXTable.hpp A be/src/thirdparty/datasketches/CouponHashSet-internal.hpp A be/src/thirdparty/datasketches/CouponHashSet.hpp A be/src/thirdparty/datasketches/CouponList-internal.hpp A be/src/thirdparty/datasketches/CouponList.hpp A be/src/thirdparty/datasketches/CubicInterpolation-internal.hpp A be/src/thirdparty/datasketches/CubicInterpolation.hpp A be/src/thirdparty/datasketches/HarmonicNumbers-internal.hpp A be/src/thirdparty/datasketches/HarmonicNumbers.hpp A be/src/thirdparty/datasketches/Hll4Array-internal.hpp A be/src/thirdparty/datasketches/Hll4Array.hpp A be/src/thirdparty/datasketches/Hll6Array-internal.hpp A be/src/thirdparty/datasketches/Hll6Array.hpp A be/src/thirdparty/datasketches/Hll8Array-internal.hpp A be/src/thirdparty/datasketches/Hll8Array.hpp A be/src/thirdparty/datasketches/HllArray-internal.hpp A be/src/thirdparty/datasketches/HllArray.hpp A be/src/thirdparty/datasketches/HllSketch-internal.hpp A be/src/thirdparty/datasketches/HllSketchImpl-internal.hpp A be/src/thirdparty/datasketches/HllSketchImpl.hpp A be/src/thirdparty/datasketches/HllSketchImplFactory.hpp A be/src/thirdparty/datasketches/HllUnion-internal.hpp A be/src/thirdparty/datasketches/HllUtil.hpp A be/src/thirdparty/datasketches/MurmurHash3.h A be/src/thirdparty/datasketches/RelativeErrorTables-internal.hpp A be/src/thirdparty/datasketches/RelativeErrorTables.hpp A be/src/thirdparty/datasketches/coupon_iterator-internal.hpp A be/src/thirdparty/datasketches/coupon_iterator.hpp A be/src/thirdparty/datasketches/hll.hpp A be/src/thirdparty/datasketches/hll.private.hpp A be/src/thirdparty/datasketches/inv_pow2_table.hpp 37 files changed, 7,137 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/15746/1 -- To view, visit http://gerrit.cloudera.org:8080/15746 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf Gerrit-Change-Number: 15746 Gerrit-PatchSet: 1 Gerrit-Owner: Gabor Kaszab
[Impala-ASF-CR] IMPALA-9596: deflake test tpch mem limit single node
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15700 ) Change subject: IMPALA-9596: deflake test_tpch_mem_limit_single_node .. Patch Set 4: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/15700 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4678963c264b7c15fbac6f71721162b38676aa21 Gerrit-Change-Number: 15700 Gerrit-PatchSet: 4 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 16 Apr 2020 12:13:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9469: ORC scanner vectorization for collection types
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15730 ) Change subject: IMPALA-9469: ORC scanner vectorization for collection types .. Patch Set 1: (11 comments) Nice change and cleanup! I took a look but have to admit that the part that touches batch->offsets (mostly in SetPositionSlot()) is not 100% clear for me. http://gerrit.cloudera.org:8080/#/c/15730/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15730/1//COMMIT_MSG@10 PS1, Line 10: now nit: now on http://gerrit.cloudera.org:8080/#/c/15730/1//COMMIT_MSG@13 PS1, Line 13: now same http://gerrit.cloudera.org:8080/#/c/15730/1//COMMIT_MSG@18 PS1, Line 18: I eliminated the OrcComplexColumnReader::TransferTuple() interface : and the related codes. nit: I feel that this can be merged with the next paragraph. http://gerrit.cloudera.org:8080/#/c/15730/1//COMMIT_MSG@32 PS1, Line 32: scale=1 Is there a plan to check performance on a higher scale data, e.g. scale=25 so that we can compare these improvements to the ones we achieved for primitive types? Giving it a second thought we tested primitives on TPCH instead of TPCH Nested, but still I'm wondering if it makes sense to see these numbers on a higher scale. http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.h File be/src/exec/orc-column-readers.h: http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.h@50 PS1, Line 50: alwas typo http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.h@95 PS1, Line 95: ReadValue >From now on ReadValue() is only called from ReadValueBatch(), right? If this >is the case can you make it private? http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.h@108 PS1, Line 108: lists Not just lists but maps as well, right? http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.h@487 PS1, Line 487: int NumElements() const override { : if (MaterializeTuple()) return vbatch_->numElements; : DCHECK_EQ(children().size(), 1); : OrcColumnReader* child = children()[0]; : return child->NumElements(); : } Both OrcListReader and OrcMapReader overrides this so actually this will be only for OrcStructReader, right? Why not move it there? http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.cc@451 PS1, Line 451: Status OrcCollectionReader::ReadValueBatch This looks very much the same as OrcPrimitiveColumnReader::ReadValueBatch() except L462. Do you know a way to avoid this code duplication? http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.cc@473 PS1, Line 473: return children_[0]->NumElements(); Is it worth a DCHECK to verify that this OrcListReader has exactly one child? http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.cc@477 PS1, Line 477: if (slot_desc_ == nullptr) { This check is not that clear for me. Isn't this identical to !MaterializeTuple() check? I see other occurences below as well all for collections. -- To view, visit http://gerrit.cloudera.org:8080/15730 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I477961b427406035a04529c5175dbee8f8a93ad5 Gerrit-Change-Number: 15730 Gerrit-PatchSet: 1 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Quanlong Huang Gerrit-Comment-Date: Wed, 15 Apr 2020 13:34:12 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9596: deflake test tpch mem limit single node
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15700 ) Change subject: IMPALA-9596: deflake test_tpch_mem_limit_single_node .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/15700 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4678963c264b7c15fbac6f71721162b38676aa21 Gerrit-Change-Number: 15700 Gerrit-PatchSet: 2 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 15 Apr 2020 05:44:55 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9596: deflake test tpch mem limit single node
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15700 ) Change subject: IMPALA-9596: deflake test_tpch_mem_limit_single_node .. Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/15700/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15700/1//COMMIT_MSG@16 PS1, Line 16: was was nit: typo http://gerrit.cloudera.org:8080/#/c/15700/1/be/src/runtime/collection-value-builder.h File be/src/runtime/collection-value-builder.h: http://gerrit.cloudera.org:8080/#/c/15700/1/be/src/runtime/collection-value-builder.h@65 PS1, Line 65: have_debug_action_ What is the point of introducing a member for this? Can't you do "!state->query_options().debug_action.empty()" here? Is this a perf optimisation? http://gerrit.cloudera.org:8080/#/c/15700/1/tests/query_test/test_nested_types.py File tests/query_test/test_nested_types.py: http://gerrit.cloudera.org:8080/#/c/15700/1/tests/query_test/test_nested_types.py@148 PS1, Line 148: with memory limits tuned for : a 3-node HDFS minicluster This part of the comment is no longer true, right? -- To view, visit http://gerrit.cloudera.org:8080/15700 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4678963c264b7c15fbac6f71721162b38676aa21 Gerrit-Change-Number: 15700 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Comment-Date: Tue, 14 Apr 2020 08:20:05 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6360: Don't show full query statement on Impala WebUI by default
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15288 ) Change subject: IMPALA-6360: Don't show full query statement on Impala WebUI by default .. Patch Set 16: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/15288 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib7109a0be5d1022b4f8d6e72441cf5dc1dc42605 Gerrit-Change-Number: 15288 Gerrit-PatchSet: 16 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Tamas Mate Gerrit-Comment-Date: Mon, 23 Mar 2020 12:54:28 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6360: Don't show full query statement on Impala WebUI by default
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15288 ) Change subject: IMPALA-6360: Don't show full query statement on Impala WebUI by default .. Patch Set 15: Code-Review+1 (2 comments) I'm fine with the change, just left some nits. Will +2 once they are covered. http://gerrit.cloudera.org:8080/#/c/15288/15/tests/custom_cluster/test_web_pages.py File tests/custom_cluster/test_web_pages.py: http://gerrit.cloudera.org:8080/#/c/15288/15/tests/custom_cluster/test_web_pages.py@114 PS15, Line 114: imput nit: typo http://gerrit.cloudera.org:8080/#/c/15288/15/tests/custom_cluster/test_web_pages.py@130 PS15, Line 130: part nit: partial -- To view, visit http://gerrit.cloudera.org:8080/15288 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib7109a0be5d1022b4f8d6e72441cf5dc1dc42605 Gerrit-Change-Number: 15288 Gerrit-PatchSet: 15 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Tamas Mate Gerrit-Comment-Date: Wed, 18 Mar 2020 14:23:00 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6360: Don't show full query statement on Impala WebUI by default
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15288 ) Change subject: IMPALA-6360: Don't show full query statement on Impala WebUI by default .. Patch Set 14: (10 comments) http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py File tests/custom_cluster/test_web_pages.py: http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@113 PS14, Line 113: Create a long select query then check if it is contained in the response json. nit: It would be more beneficial to describe the expected functionality here like: Check if the full query string is displayed in the query list on the WebUI. http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@115 PS14, Line 115: "x " * 450 As this string is used multiple times I'd put this into a variable. http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@115 PS14, Line 115: "select \"{0}\"" Might not work but could you try to put the string within single quotes instead of double quotes so that you don't have to escape the double quotes inside? http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@116 PS14, Line 116: theere typo http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@118 PS14, Line 118: expected = "select \\\"{0}\\\"".format("x " * 450) Same comment for single quotes as above. http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@129 PS14, Line 129: """Create a long select query then check if it is contained in the response json.""" This statements isn't true for this test. Might be a copy-paste error. http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@131 PS14, Line 131: query = "select \"{0}\"".format("x " * 450) Same comment for single quotes as above. http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@133 PS14, Line 133: theere typo http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@135 PS14, Line 135: expected = "select \\\"x ..." Same comment for single quotes as above. http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@139 PS14, Line 139: if unexpected in response_json: : assert False, "The response is containing the full query." Is this needed? L141 will check the expected output anyway that won't be a prefix of the input because of the "..." appended to the end. -- To view, visit http://gerrit.cloudera.org:8080/15288 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib7109a0be5d1022b4f8d6e72441cf5dc1dc42605 Gerrit-Change-Number: 15288 Gerrit-PatchSet: 14 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Tamas Mate Gerrit-Comment-Date: Mon, 16 Mar 2020 12:55:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6360: Don't show full query statement on Impala WebUI by default
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15288 ) Change subject: IMPALA-6360: Don't show full query statement on Impala WebUI by default .. Patch Set 12: (1 comment) http://gerrit.cloudera.org:8080/#/c/15288/12/tests/webserver/test_web_pages.py File tests/webserver/test_web_pages.py: http://gerrit.cloudera.org:8080/#/c/15288/12/tests/webserver/test_web_pages.py@419 PS12, Line 419: I miss a test to check that the length of the query statement is configurable. Could you look around in the tests and check if there is a way to test these kind of configs as well? -- To view, visit http://gerrit.cloudera.org:8080/15288 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib7109a0be5d1022b4f8d6e72441cf5dc1dc42605 Gerrit-Change-Number: 15288 Gerrit-PatchSet: 12 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Tamas Mate Gerrit-Comment-Date: Fri, 06 Mar 2020 13:15:18 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7686: Allow RANGE() clause before HASH() clause for PARTITION BY
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15332 ) Change subject: IMPALA-7686: Allow RANGE() clause before HASH() clause for PARTITION BY .. Patch Set 7: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/15332 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I914e340e9acfb0f49c7d4f78705dbd9bde0aec8c Gerrit-Change-Number: 15332 Gerrit-PatchSet: 7 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tamas Mate Gerrit-Comment-Date: Wed, 04 Mar 2020 13:51:15 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6506: Codegen in ORC scanner for primitives and struct
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15350 ) Change subject: IMPALA-6506: Codegen in ORC scanner for primitives and struct .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/15350/1/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: http://gerrit.cloudera.org:8080/#/c/15350/1/be/src/exec/hdfs-scan-node-base.cc@437 PS1, Line 437: case THdfsFileFormat::PARQUET: : case THdfsFileFormat::ORC: : status = HdfsColumnarScanner::Codegen(this, state, ); > It would be nice to check what happens if there are Parquet and ORC partiti Did check it manually and this runs without any error scanning partitions well for both formats. Added a comment about this to the commit msg. -- To view, visit http://gerrit.cloudera.org:8080/15350 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2352d0c8fc75ff722e931bc8c866b3e43d3636f4 Gerrit-Change-Number: 15350 Gerrit-PatchSet: 2 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 04 Mar 2020 12:05:42 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6506: Codegen in ORC scanner for primitives and struct
Hello Zoltan Borok-Nagy, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15350 to look at the new patch set (#2). Change subject: IMPALA-6506: Codegen in ORC scanner for primitives and struct .. IMPALA-6506: Codegen in ORC scanner for primitives and struct IMPALA-9228 introduced scratch batch handling for struct and primitive types in the ORC scanner and the existing scratch batch logic already supports Codegen for ProcessScratchBatch() function. This change turns on this Codegen logic for primitives types and structs in the ORC scanner. Note, if the query involves collection types then ProcessScratchBatch() is still codegend but the codegend function isn't used as the regular row-by-row approach is followed in this case without using a scratch batch. Testing: - Re-run the whole test suite to check for regressions. - Checked the performance on a scale 25 TPCH workload in ORC format using single_node_perf_run.py. Comparing the query runtimes it seems that codegen brings a 1-21% improvement for most of the queries. There is a slight decrease in 3 queries that are not scan-heavy where codegen doesn't provide any help for scanning. However, these are short queries where the size of the degradation is in subseconds so I'd say the decrease is negligible. - Did a manual check for a table that contains both Parquet and ORC partitions. Verified that in this case ProcessScratchBatch() is codegend for both formats and the query results are as expected. Change-Id: I2352d0c8fc75ff722e931bc8c866b3e43d3636f4 --- M be/src/exec/hdfs-columnar-scanner.cc M be/src/exec/hdfs-columnar-scanner.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h 6 files changed, 58 insertions(+), 47 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/15350/2 -- To view, visit http://gerrit.cloudera.org:8080/15350 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2352d0c8fc75ff722e931bc8c866b3e43d3636f4 Gerrit-Change-Number: 15350 Gerrit-PatchSet: 2 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-7686: Allow RANGE() clause before HASH() clause for PARTITION BY
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15332 ) Change subject: IMPALA-7686: Allow RANGE() clause before HASH() clause for PARTITION BY .. Patch Set 5: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/15332 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I914e340e9acfb0f49c7d4f78705dbd9bde0aec8c Gerrit-Change-Number: 15332 Gerrit-PatchSet: 5 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tamas Mate Gerrit-Comment-Date: Wed, 04 Mar 2020 09:30:11 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6360: Don't show full query statement on Impala WebUI by default
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15288 ) Change subject: IMPALA-6360: Don't show full query statement on Impala WebUI by default .. Patch Set 10: (4 comments) http://gerrit.cloudera.org:8080/#/c/15288/10/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/15288/10/be/src/service/impala-server.cc@151 PS10, Line 151: display nit: displayed http://gerrit.cloudera.org:8080/#/c/15288/10/tests/webserver/test_web_pages.py File tests/webserver/test_web_pages.py: http://gerrit.cloudera.org:8080/#/c/15288/10/tests/webserver/test_web_pages.py@422 PS10, Line 422: enought nit: typo http://gerrit.cloudera.org:8080/#/c/15288/10/tests/webserver/test_web_pages.py@429 PS10, Line 429: contain nit: contains http://gerrit.cloudera.org:8080/#/c/15288/10/tests/webserver/test_web_pages.py@434 PS10, Line 434: stmt_response The name of this variable is misleading here as it's not storing the response anymore. -- To view, visit http://gerrit.cloudera.org:8080/15288 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib7109a0be5d1022b4f8d6e72441cf5dc1dc42605 Gerrit-Change-Number: 15288 Gerrit-PatchSet: 10 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tamas Mate Gerrit-Comment-Date: Wed, 04 Mar 2020 08:12:27 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-7686: Allow RANGE() clause before HASH() clause for PARTITION BY
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15332 ) Change subject: IMPALA-7686: Allow RANGE() clause before HASH() clause for PARTITION BY .. Patch Set 4: (2 comments) Minor nits again. I'll +2 if these are covered http://gerrit.cloudera.org:8080/#/c/15332/4/fe/src/test/java/org/apache/impala/analysis/AnalyzeKuduDDLTest.java File fe/src/test/java/org/apache/impala/analysis/AnalyzeKuduDDLTest.java: http://gerrit.cloudera.org:8080/#/c/15332/4/fe/src/test/java/org/apache/impala/analysis/AnalyzeKuduDDLTest.java@123 PS4, Line 123: swapped nit: Swapped http://gerrit.cloudera.org:8080/#/c/15332/4/fe/src/test/java/org/apache/impala/analysis/AnalyzeKuduDDLTest.java@128 PS4, Line 128: swapped nit: this word is not needed (and doesn't really makes sense here) -- To view, visit http://gerrit.cloudera.org:8080/15332 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I914e340e9acfb0f49c7d4f78705dbd9bde0aec8c Gerrit-Change-Number: 15332 Gerrit-PatchSet: 4 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tamas Mate Gerrit-Comment-Date: Wed, 04 Mar 2020 07:47:56 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6506: Codegen in ORC scanner for primitives and struct
Gabor Kaszab has uploaded this change for review. ( http://gerrit.cloudera.org:8080/15350 Change subject: IMPALA-6506: Codegen in ORC scanner for primitives and struct .. IMPALA-6506: Codegen in ORC scanner for primitives and struct IMPALA-9228 introduced scratch batch handling for struct and primitive types in the ORC scanner and the existing scratch batch logic already supports Codegen for ProcessScratchBatch() function. This change turns on this Codegen logic for primitives types and structs in the ORC scanner. Note, if the query involves collection types then ProcessScratchBatch() is still codegend but the codegend function isn't used as the regular row-by-row approach is followed in this case without using a scratch batch. Testing: - Re-run the whole test suite to check for regressions. - Checked the performance on a scale 25 TPCH workload in ORC format using single_node_perf_run.py. Comparing the query runtimes it seems that codegen brings a 1-21% improvement for most of the queries. There is a slight decrease in 3 queries that are not scan-heavy where codegen doesn't provide any help for scanning. However, these are short queries where the size of the degradation is in subseconds so I'd say the decrease is negligible. Change-Id: I2352d0c8fc75ff722e931bc8c866b3e43d3636f4 --- M be/src/exec/hdfs-columnar-scanner.cc M be/src/exec/hdfs-columnar-scanner.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h 6 files changed, 58 insertions(+), 47 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/15350/1 -- To view, visit http://gerrit.cloudera.org:8080/15350 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I2352d0c8fc75ff722e931bc8c866b3e43d3636f4 Gerrit-Change-Number: 15350 Gerrit-PatchSet: 1 Gerrit-Owner: Gabor Kaszab
[Impala-ASF-CR] IMPALA-6360: Don't show full query statement on Impala webUI by default
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15288 ) Change subject: IMPALA-6360: Don't show full query statement on Impala webUI by default .. Patch Set 8: (8 comments) http://gerrit.cloudera.org:8080/#/c/15288/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15288/8//COMMIT_MSG@7 PS8, Line 7: webUI nit: WebUI http://gerrit.cloudera.org:8080/#/c/15288/8//COMMIT_MSG@9 PS8, Line 9: Added the ‘query_stmt_size’ flag to impala-server.cc with default value : of 250 and modified the ‘ImpalaHttpHandler::QueryStateToJson()’ to : truncate the end of the statements if they are too long. Similarly to my comment in your other commit: I don't see the point of mentioning all these technical details in the commit message as they hide the original meaning. Here you should describe that in the WebUI's query list the query statements are trimmed, but you can see the full query statement in the details page. Additionally mention please that the default statement length is 250 chars but it can be adjusted by a flag that requires an Impala restart to take effect. I'd also describe if there is a special value of this new flag that for example defaults to showing the full query statement. http://gerrit.cloudera.org:8080/#/c/15288/8//COMMIT_MSG@12 PS8, Line 12: It would be useful to add an example how to set the new flag to a custom value. http://gerrit.cloudera.org:8080/#/c/15288/8/be/src/service/impala-http-handler.cc File be/src/service/impala-http-handler.cc: http://gerrit.cloudera.org:8080/#/c/15288/8/be/src/service/impala-http-handler.cc@495 PS8, Line 495: document->AddMember("completed_stmt_size", FLAGS_query_stmt_size, : document->GetAllocator()); Do you need this? If I understand correctly this adds a new variable to the web page document called "completed_stmt_size" but you don't use it anywhere in the page itself. http://gerrit.cloudera.org:8080/#/c/15288/8/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/15288/8/be/src/service/impala-server.cc@151 PS8, Line 151: statements have unbounded size nit: It would be a bit better in my opinion to say like "the full statement is display in the query log without trimming." http://gerrit.cloudera.org:8080/#/c/15288/8/tests/webserver/test_web_pages.py File tests/webserver/test_web_pages.py: http://gerrit.cloudera.org:8080/#/c/15288/8/tests/webserver/test_web_pages.py@421 PS8, Line 421: then search the response json for the : expected, truncated statement. nit: then check if it is truncated in the response json. http://gerrit.cloudera.org:8080/#/c/15288/8/tests/webserver/test_web_pages.py@423 PS8, Line 423: query = """select \"Proin bibendum erat eu libero iaculis pharetra. Duis nit: Pls add a comment how long the input query string is. Also add a similar comment above the expected_result http://gerrit.cloudera.org:8080/#/c/15288/8/tests/webserver/test_web_pages.py@440 PS8, Line 440: assert expected_result in stmt_response, "No matching statement found in the jsons." This part is a bit weird: above you set stmt_response if the expected_result is in a subsection of the json output and once you finish the for loop you do the same check. What if you do an "assert false" in this line and return in L439 instead of the break? Or option2: instead of stmt_response you introduce a bool that initializes to False and you set it to True if you have a match for the expected_result. You can assert for the bool variable being True in the end. Would this make sense? -- To view, visit http://gerrit.cloudera.org:8080/15288 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib7109a0be5d1022b4f8d6e72441cf5dc1dc42605 Gerrit-Change-Number: 15288 Gerrit-PatchSet: 8 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tamas Mate Gerrit-Comment-Date: Tue, 03 Mar 2020 12:45:48 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15104 ) Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch .. Patch Set 9: (4 comments) http://gerrit.cloudera.org:8080/#/c/15104/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15104/5//COMMIT_MSG@20 PS5, Line 20: Note, > typo: testing Done http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/hdfs-columnar-scanner-ir.cc File be/src/exec/hdfs-columnar-scanner-ir.cc: http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/hdfs-columnar-scanner-ir.cc@23 PS5, Line 23: scratch_batch_ > nit: the != nullptr in written out in most of Impala Done http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/hdfs-orc-scanner.cc@702 PS5, Line 702: > nit: the != nullptr in written out in most of Impala Done http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/orc-column-readers.cc@264 PS5, Line 264: >col_id_path_map_[ > Can you mention it in the commit message that no scratch batch is used if t I've already added a note about this to the commit msg. -- To view, visit http://gerrit.cloudera.org:8080/15104 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca Gerrit-Change-Number: 15104 Gerrit-PatchSet: 9 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 03 Mar 2020 10:01:28 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch
Hello Zoltan Borok-Nagy, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15104 to look at the new patch set (#9). Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch .. IMPALA-9228: ORC scanner reads rows into scratch batch Because of performance considerations this change enhances ORC scanner to populate a scratch batch on a column-by-column manner using data from the column readers. Once this is done the parquet code was reused to apply runtime filter and conjuncts and to populate the outgoing row batch. This approach reduces the number of virtual function calls and takes advantage of the columnar orientation of the data to enhance scan performance. Additionally, introducing the scratch batch concept also opens the door for codegen runtime filtering and applying conjuncts. Note, this change doesn't cover collection types just primitive types and struct. Collection types will follow the previous row-by-row approach. Testing: - Re-run the full test suite to verify that no regression is introduced. - Checked the performance impact by running TPCH workload on a scale 25 database using single_node_perf_run.py. The total query runtime is decreased by 0-20% depending on how scan heavy the particular query was. The more scan heavy the query is the more performance gain I observe. Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca --- M be/src/codegen/gen_ir_descriptions.py M be/src/codegen/impala-ir.cc M be/src/exec/CMakeLists.txt R be/src/exec/hdfs-columnar-scanner-ir.cc A be/src/exec/hdfs-columnar-scanner.cc A be/src/exec/hdfs-columnar-scanner.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h M be/src/exec/hdfs-scanner.h M be/src/exec/orc-column-readers.cc M be/src/exec/orc-column-readers.h M be/src/exec/parquet/CMakeLists.txt M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h R be/src/exec/scratch-tuple-batch.h 15 files changed, 425 insertions(+), 144 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/15104/9 -- To view, visit http://gerrit.cloudera.org:8080/15104 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca Gerrit-Change-Number: 15104 Gerrit-PatchSet: 9 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-7686: Allow RANGE() clause before HASH() clause for PARTITION BY
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15332 ) Change subject: IMPALA-7686: Allow RANGE() clause before HASH() clause for PARTITION BY .. Patch Set 3: Code-Review+1 (1 comment) I just have one nit. Let me give a +1 now and once you covered that I can +2 http://gerrit.cloudera.org:8080/#/c/15332/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15332/3//COMMIT_MSG@12 PS3, Line 12: Added an extra AnalyzesOk into ‘AnalyzeKuduDDLTest()’ -> : ‘testDDlsOnKuduTable()’ in AnalyzeKuduDDLTest.java where the RANGE() and : HASH() pair is swapped. nit: no need to mention the each line of test change in the commit message. A single sentence without this level of technical details is fine. Like "Added and extra analyzer test to cover the case when RANGE() is before HASH() for Kudu tables." -- To view, visit http://gerrit.cloudera.org:8080/15332 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I914e340e9acfb0f49c7d4f78705dbd9bde0aec8c Gerrit-Change-Number: 15332 Gerrit-PatchSet: 3 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tamas Mate Gerrit-Comment-Date: Tue, 03 Mar 2020 09:47:36 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15104 ) Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch .. Patch Set 7: (1 comment) http://gerrit.cloudera.org:8080/#/c/15104/7/be/src/exec/hdfs-scanner.h File be/src/exec/hdfs-scanner.h: http://gerrit.cloudera.org:8080/#/c/15104/7/be/src/exec/hdfs-scanner.h@176 PS7, Line 176: Not inlined in IR so it can be replaced with a constant. > nit: this comment is applicable to both 'tuple_byte_size' functions, so it Done -- To view, visit http://gerrit.cloudera.org:8080/15104 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca Gerrit-Change-Number: 15104 Gerrit-PatchSet: 7 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 02 Mar 2020 15:26:44 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch
Hello Zoltan Borok-Nagy, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15104 to look at the new patch set (#8). Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch .. IMPALA-9228: ORC scanner reads rows into scratch batch Because of performance considerations this change enhances ORC scanner to populate a scratch batch on a column-by-column manner using data from the column readers. Once this is done the parquet code was reused to apply runtime filter and conjuncts and to populate the outgoing row batch. This approach reduces the number of virtual function calls and takes advantage of the columnar orientation of the data to enhance scan performance. Additionally, introducing the scratch batch concept also opens the door for codegen runtime filtering and applying conjuncts. Note, this change doesn't cover collection types just primitive types and struct. Tesing: - Re-run the full test suite to verify that no regression is introduced. - Checked the performance impact by running TPCH workload on a scale 25 database using single_node_perf_run.py. The total query runtime is decreased by 0-20% depending on how scan heavy the particular query was. The more scan heavy the query is the more performance gain I observe. Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca --- M be/src/codegen/gen_ir_descriptions.py M be/src/codegen/impala-ir.cc M be/src/exec/CMakeLists.txt R be/src/exec/hdfs-columnar-scanner-ir.cc A be/src/exec/hdfs-columnar-scanner.cc A be/src/exec/hdfs-columnar-scanner.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h M be/src/exec/hdfs-scanner.h M be/src/exec/orc-column-readers.cc M be/src/exec/orc-column-readers.h M be/src/exec/parquet/CMakeLists.txt M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h R be/src/exec/scratch-tuple-batch.h 15 files changed, 425 insertions(+), 144 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/15104/8 -- To view, visit http://gerrit.cloudera.org:8080/15104 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca Gerrit-Change-Number: 15104 Gerrit-PatchSet: 8 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15104 ) Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch .. Patch Set 7: (3 comments) http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/hdfs-scanner.h File be/src/exec/hdfs-scanner.h: http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/hdfs-scanner.h@a566 PS5, Line 566: : : : : > nit: maybe these could be moved together to public. Done http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/orc-column-readers.h File be/src/exec/orc-column-readers.h: http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/orc-column-readers.h@188 PS5, Line 188: Status ReadValue(int row_idx, Tuple* tuple, MemPool* pool) final WARN_UNUSED_RESULT; > Since ReadValue() is public I think you don't need the friend declarations Friend declarations are still needed for accessing 'derived->batch_' from OrcPrimitiveColumnReader::ReadValueBatch() http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/orc-column-readers.cc@221 PS5, Line 221: ErrorMsg msg(errorCode, scanner_->filename(), orc_column_id_); : return scanner_- > nit: fits single line again Done -- To view, visit http://gerrit.cloudera.org:8080/15104 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca Gerrit-Change-Number: 15104 Gerrit-PatchSet: 7 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 02 Mar 2020 13:14:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch
Hello Zoltan Borok-Nagy, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15104 to look at the new patch set (#7). Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch .. IMPALA-9228: ORC scanner reads rows into scratch batch Because of performance considerations this change enhances ORC scanner to populate a scratch batch on a column-by-column manner using data from the column readers. Once this is done the parquet code was reused to apply runtime filter and conjuncts and to populate the outgoing row batch. This approach reduces the number of virtual function calls and takes advantage of the columnar orientation of the data to enhance scan performance. Additionally, introducing the scratch batch concept also opens the door for codegen runtime filtering and applying conjuncts. Tesing: - Re-run the full test suite to verify that no regression is introduced. - Checked the performance impact by running TPCH workload on a scale 25 database using single_node_perf_run.py. The total query runtime is decreased by 0-20% depending on how scan heavy the particular query was. The more scan heavy the query is the more performance gain I observe. Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca --- M be/src/codegen/gen_ir_descriptions.py M be/src/codegen/impala-ir.cc M be/src/exec/CMakeLists.txt R be/src/exec/hdfs-columnar-scanner-ir.cc A be/src/exec/hdfs-columnar-scanner.cc A be/src/exec/hdfs-columnar-scanner.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h M be/src/exec/hdfs-scanner.h M be/src/exec/orc-column-readers.cc M be/src/exec/orc-column-readers.h M be/src/exec/parquet/CMakeLists.txt M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h R be/src/exec/scratch-tuple-batch.h 15 files changed, 426 insertions(+), 144 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/15104/7 -- To view, visit http://gerrit.cloudera.org:8080/15104 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca Gerrit-Change-Number: 15104 Gerrit-PatchSet: 7 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15104 ) Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch .. Patch Set 6: PS6 is a rebase with master to resolve conflict with the ORC string allocations enhancement. -- To view, visit http://gerrit.cloudera.org:8080/15104 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca Gerrit-Change-Number: 15104 Gerrit-PatchSet: 6 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 02 Mar 2020 12:57:33 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch
Hello Zoltan Borok-Nagy, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15104 to look at the new patch set (#6). Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch .. IMPALA-9228: ORC scanner reads rows into scratch batch Because of performance considerations this change enhances ORC scanner to populate a scratch batch on a column-by-column manner using data from the column readers. Once this is done the parquet code was reused to apply runtime filter and conjuncts and to populate the outgoing row batch. This approach reduces the number of virtual function calls and takes advantage of the columnar orientation of the data to enhance scan performance. Additionally, introducing the scratch batch concept also opens the door for codegen runtime filtering and applying conjuncts. Tesing: - Re-run the full test suite to verify that no regression is introduced. - Checked the performance impact by running TPCH workload on a scale 25 database using single_node_perf_run.py. The total query runtime is decreased by 0-20% depending on how scan heavy the particular query was. The more scan heavy the query is the more performance gain I observe. Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca --- M be/src/codegen/gen_ir_descriptions.py M be/src/codegen/impala-ir.cc M be/src/exec/CMakeLists.txt R be/src/exec/hdfs-columnar-scanner-ir.cc A be/src/exec/hdfs-columnar-scanner.cc A be/src/exec/hdfs-columnar-scanner.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h M be/src/exec/hdfs-scanner.h M be/src/exec/orc-column-readers.cc M be/src/exec/orc-column-readers.h M be/src/exec/parquet/CMakeLists.txt M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h R be/src/exec/scratch-tuple-batch.h 15 files changed, 423 insertions(+), 140 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/15104/6 -- To view, visit http://gerrit.cloudera.org:8080/15104 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca Gerrit-Change-Number: 15104 Gerrit-PatchSet: 6 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15104 ) Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch .. Patch Set 5: (7 comments) http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/hdfs-columnar-scanner.h File be/src/exec/hdfs-columnar-scanner.h: http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/hdfs-columnar-scanner.h@40 PS4, Line 40: () > nit: "...in the derived classes." Done http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/hdfs-columnar-scanner.h@45 PS4, Line 45: > nit: you could add "= nullptr" to make the default value of it obvious. And Done http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/hdfs-orc-scanner.h File be/src/exec/hdfs-orc-scanner.h: http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/hdfs-orc-scanner.h@258 PS4, Line 258: has a collection : /// amongst its children. > nit: sounds weird to me, maybe just "has a collection amongst..." Done http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/hdfs-orc-scanner.cc@703 PS4, Line 703: DCHECK(scratch_batch_->AtEnd()); > We hit this DCHECK when we come from the loop of AssembleRows(). We can als Done http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/orc-column-readers.h File be/src/exec/orc-column-readers.h: http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/orc-column-readers.h@130 PS4, Line 130: to implement static polymorphism via the "curiously > nit: to implement static polymorphism via the "curiously recurring template Done http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/orc-column-readers.h@161 PS4, Line 161: > If you'd add the 'final' specifier to the ReadValue() implementations then Done http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/orc-column-readers.h@469 PS4, Line 469: /// Keep row index if we're top level readers > You need to rename this to make clang-tidy happy as it hides virtual functi Done -- To view, visit http://gerrit.cloudera.org:8080/15104 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca Gerrit-Change-Number: 15104 Gerrit-PatchSet: 5 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 02 Mar 2020 09:45:55 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch
Hello Zoltan Borok-Nagy, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15104 to look at the new patch set (#5). Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch .. IMPALA-9228: ORC scanner reads rows into scratch batch Because of performance considerations this change enhances ORC scanner to populate a scratch batch on a column-by-column manner using data from the column readers. Once this is done the parquet code was reused to apply runtime filter and conjuncts and to populate the outgoing row batch. This approach reduces the number of virtual function calls and takes advantage of the columnar orientation of the data to enhance scan performance. Additionally, introducing the scratch batch concept also opens the door for codegen runtime filtering and applying conjuncts. Tesing: - Re-run the full test suite to verify that no regression is introduced. - Checked the performance impact by running TPCH workload on a scale 25 database using single_node_perf_run.py. The total query runtime is decreased by 0-20% depending on how scan heavy the particular query was. The more scan heavy the query is the more performance gain I observe. Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca --- M be/src/codegen/gen_ir_descriptions.py M be/src/codegen/impala-ir.cc M be/src/exec/CMakeLists.txt R be/src/exec/hdfs-columnar-scanner-ir.cc A be/src/exec/hdfs-columnar-scanner.cc A be/src/exec/hdfs-columnar-scanner.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h M be/src/exec/hdfs-scanner.h M be/src/exec/orc-column-readers.cc M be/src/exec/orc-column-readers.h M be/src/exec/parquet/CMakeLists.txt M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h R be/src/exec/scratch-tuple-batch.h 15 files changed, 423 insertions(+), 140 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/15104/5 -- To view, visit http://gerrit.cloudera.org:8080/15104 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca Gerrit-Change-Number: 15104 Gerrit-PatchSet: 5 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch
Hello Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15104 to look at the new patch set (#4). Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch .. IMPALA-9228: ORC scanner reads rows into scratch batch Because of performance considerations this change enhances ORC scanner to populate a scratch batch on a column-by-column manner using data from the column readers. Once this is done the parquet code was reused to apply runtime filter and conjuncts and to populate the outgoing row batch. This approach reduces the number of virtual function calls and takes advantage of the columnar orientation of the data to enhance scan performance. Additionally, introducing the scratch batch concept also opens the door for codegen runtime filtering and applying conjuncts. Tesing: - Re-run the full test suite to verify that no regression is introduced. - Checked the performance impact by running TPCH workload on a scale 25 database using single_node_perf_run.py. The total query runtime is decreased by 0-20% depending on how scan heavy the particular query was. The more scan heavy the query is the more performance gain I observe. Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca --- M be/src/codegen/gen_ir_descriptions.py M be/src/codegen/impala-ir.cc M be/src/exec/CMakeLists.txt R be/src/exec/hdfs-columnar-scanner-ir.cc A be/src/exec/hdfs-columnar-scanner.cc A be/src/exec/hdfs-columnar-scanner.h M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h M be/src/exec/hdfs-scanner.h M be/src/exec/orc-column-readers.cc M be/src/exec/orc-column-readers.h M be/src/exec/parquet/CMakeLists.txt M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h R be/src/exec/scratch-tuple-batch.h 15 files changed, 464 insertions(+), 140 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/15104/4 -- To view, visit http://gerrit.cloudera.org:8080/15104 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca Gerrit-Change-Number: 15104 Gerrit-PatchSet: 4 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] PoC: IMPALA-9228: ORC scanner reads rows into scratch batch
Gabor Kaszab has restored this change. ( http://gerrit.cloudera.org:8080/15104 ) Change subject: PoC: IMPALA-9228: ORC scanner reads rows into scratch batch .. Restored -- To view, visit http://gerrit.cloudera.org:8080/15104 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: restore Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca Gerrit-Change-Number: 15104 Gerrit-PatchSet: 2 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9226: Bump ORC version to 1.6.2-p7
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15287 ) Change subject: IMPALA-9226: Bump ORC version to 1.6.2-p7 .. Patch Set 1: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/15287 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I444bfac435e5b05eee1ff7c8cf6a32ff5b65 Gerrit-Change-Number: 15287 Gerrit-PatchSet: 1 Gerrit-Owner: Norbert Luksa Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Tue, 25 Feb 2020 08:57:00 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9036: Fix CTRL+C a multiline query in impala-shell
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15233 ) Change subject: IMPALA-9036: Fix CTRL+C a multiline query in impala-shell .. Patch Set 5: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/15233 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Id8d8bdaee929e2655eb66e886ae92a02d3fbd83f Gerrit-Change-Number: 15233 Gerrit-PatchSet: 5 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Mon, 24 Feb 2020 10:41:48 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9036: Fix CTRL+C a multiline query in impala-shell
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15233 ) Change subject: IMPALA-9036: Fix CTRL+C a multiline query in impala-shell .. Patch Set 4: (3 comments) I just left some nits but in general I'm fine with the change. http://gerrit.cloudera.org:8080/#/c/15233/4/tests/shell/test_shell_interactive.py File tests/shell/test_shell_interactive.py: http://gerrit.cloudera.org:8080/#/c/15233/4/tests/shell/test_shell_interactive.py@253 PS4, Line 253: doesn't ends nit: doesn't end http://gerrit.cloudera.org:8080/#/c/15233/4/tests/shell/test_shell_interactive.py@258 PS4, Line 258: "select \"test withouth newline\";\n" nit: I think if you put this string between single quotes then you don't need to escape the double quotes inside. This would make it a bit more readable. http://gerrit.cloudera.org:8080/#/c/15233/4/tests/shell/test_shell_interactive.py@271 PS4, Line 271: "select \"test with newline\";\n" nit: same as above -- To view, visit http://gerrit.cloudera.org:8080/15233 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Id8d8bdaee929e2655eb66e886ae92a02d3fbd83f Gerrit-Change-Number: 15233 Gerrit-PatchSet: 4 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Wed, 19 Feb 2020 21:10:41 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9036: Fix CTRL+C a multiline query in impala-shell
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15233 ) Change subject: IMPALA-9036: Fix CTRL+C a multiline query in impala-shell .. Patch Set 2: (2 comments) http://gerrit.cloudera.org:8080/#/c/15233/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15233/2//COMMIT_MSG@9 PS2, Line 9: `_signal_handler()` nit: instead of backticks please put the function name between single quotes. http://gerrit.cloudera.org:8080/#/c/15233/2/shell/impala_shell.py File shell/impala_shell.py: http://gerrit.cloudera.org:8080/#/c/15233/2/shell/impala_shell.py@562 PS2, Line 562: self.prompt = self.cached_prompt Question: Have you tried to simple set both self.prompt and self.cached_prompt to empty string? I don't understand it 100% what cached_prompt holds in this case that we want to reuse. -- To view, visit http://gerrit.cloudera.org:8080/15233 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Id8d8bdaee929e2655eb66e886ae92a02d3fbd83f Gerrit-Change-Number: 15233 Gerrit-PatchSet: 2 Gerrit-Owner: Adam Tamas Gerrit-Reviewer: Adam Tamas Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Comment-Date: Wed, 19 Feb 2020 08:10:15 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9280: Fix parsing of timestamp with dash before TZH
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15130 ) Change subject: IMPALA-9280: Fix parsing of timestamp with dash before TZH .. Patch Set 5: Code-Review+2 Thanks for the adjustments! -- To view, visit http://gerrit.cloudera.org:8080/15130 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I24c636e75fd380f6ebd091bcb38bb60a274f0e00 Gerrit-Change-Number: 15130 Gerrit-PatchSet: 5 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 05 Feb 2020 10:22:15 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9280: Fix parsing of timestamp with dash before TZH
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15130 ) Change subject: IMPALA-9280: Fix parsing of timestamp with dash before TZH .. Patch Set 4: (3 comments) http://gerrit.cloudera.org:8080/#/c/15130/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15130/4//COMMIT_MSG@11 PS4, Line 11: if the negative sign is the only possible : separator character in the separator substring. I have the same comment here as in the parser: This doesn't cover the whole picture as we also don't include it if the TZH itself has a '+' char as a sign. http://gerrit.cloudera.org:8080/#/c/15130/4/be/src/runtime/datetime-iso-sql-format-parser.cc File be/src/runtime/datetime-iso-sql-format-parser.cc: http://gerrit.cloudera.org:8080/#/c/15130/4/be/src/runtime/datetime-iso-sql-format-parser.cc@296 PS4, Line 296: (*current_pos) nit: no need for the brackets. http://gerrit.cloudera.org:8080/#/c/15130/4/be/src/runtime/datetime-iso-sql-format-parser.cc@323 PS4, Line 323: // The last '-' of a separator sequence might be taken as a sign for timezone hour. : // UNLESS the last '-' is the only possible separator character in the separator : // substring (in which case it is not counted as the timezone hour's negative sign). I'd rephrase this comment a bit as it misses to mention that if the TZH starts with a '+' then we again don't take the last '-' of the separator sequence as the sign. Let's say something like this: "If the TZH token itself doesn't start with a sign and the separator sequence contains more than one separators then the last '-' of the separator sequence is taken as the sign of TZH". What do you think? -- To view, visit http://gerrit.cloudera.org:8080/15130 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I24c636e75fd380f6ebd091bcb38bb60a274f0e00 Gerrit-Change-Number: 15130 Gerrit-PatchSet: 4 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 31 Jan 2020 07:19:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9280: Fix parsing of timestamp with dash before TZH
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15130 ) Change subject: IMPALA-9280: Fix parsing of timestamp with dash before TZH .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/15130/2/be/src/runtime/datetime-iso-sql-format-parser.cc File be/src/runtime/datetime-iso-sql-format-parser.cc: http://gerrit.cloudera.org:8080/#/c/15130/2/be/src/runtime/datetime-iso-sql-format-parser.cc@332 PS2, Line 332: && cur_pos_move_count > cur_tok_move_count && *(*current_pos) != '+' > What I understand from your explanation, the number of separator chars in t Basically separator matching works in a way that if there is at least one separator in a position somewhere in the format then there should be at least one separator in the input string in that section. (By position I don't mean the exact index of that character, rather the relative location to the surrounding tokens). Those examples are correct. For more explanation please check the design doc for this feature (feel free to ask if you still have questions): https://docs.google.com/document/d/1V7k6-lrPGW7_uhqM-FhKl3QsxwCRy69v2KIxPsGjc1k/ -- To view, visit http://gerrit.cloudera.org:8080/15130 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I24c636e75fd380f6ebd091bcb38bb60a274f0e00 Gerrit-Change-Number: 15130 Gerrit-PatchSet: 2 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 30 Jan 2020 18:44:10 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9280: Fix parsing of timestamp with dash before TZH
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15130 ) Change subject: IMPALA-9280: Fix parsing of timestamp with dash before TZH .. Patch Set 2: (3 comments) Thanks for taking care of this! http://gerrit.cloudera.org:8080/#/c/15130/2/be/src/runtime/datetime-iso-sql-format-parser.cc File be/src/runtime/datetime-iso-sql-format-parser.cc: http://gerrit.cloudera.org:8080/#/c/15130/2/be/src/runtime/datetime-iso-sql-format-parser.cc@332 PS2, Line 332: && cur_pos_move_count > cur_tok_move_count && *(*current_pos) != '+' Counting the processed separator tokens doesn't seem the best approach for me here. Also I don't see why it is necessary to have more separators in the input than in the format to use the last minus sign as a part of TZH instead of taking it as a separator. Previously this if condition said "If a TZH token follows the separator sequence and the last separator in the sequence was a minus char then I'll use that minus char as the sign of the TZH". I'd extend this: "If a TZH token follows the separator sequence and the last separator in the sequence was a minus char and the TZH token doesn't start with a plus char then ..." Please make sure that in case the minus char is actually taken as sign of TZH instead of a separator then there is still at least one separator remaining in the input sequence to match the separator sequence in the token list. http://gerrit.cloudera.org:8080/#/c/15130/2/be/src/runtime/timestamp-test.cc File be/src/runtime/timestamp-test.cc: http://gerrit.cloudera.org:8080/#/c/15130/2/be/src/runtime/timestamp-test.cc@1071 PS2, Line 1071: TEST(TimestampTest, IsoSqlWithTZ) { Almost all of these formatting tests are in tests/query_test/test_cast_with_format.py I think we should keep them together. Please consider moving this coverage. http://gerrit.cloudera.org:8080/#/c/15130/2/be/src/runtime/timestamp-test.cc@1078 PS2, Line 1078: HH:MI-TZH As far as I know we don't really support dateless timestamps even though Impala accepts them. Additionally I think we should remove the support of them from Impala at one point (AFAIK JDBC/ODBC connections throw parse error for dateless timestamps currently). So might be safer not to use them even in tests as once we drop the support we have to rewrite less code. -- To view, visit http://gerrit.cloudera.org:8080/15130 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I24c636e75fd380f6ebd091bcb38bb60a274f0e00 Gerrit-Change-Number: 15130 Gerrit-PatchSet: 2 Gerrit-Owner: Riza Suminto Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Riza Suminto Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 30 Jan 2020 09:00:02 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 ) Change subject: IMPALA-9226: Improve string allocations of the ORC scanner .. Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/15051/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15051/3//COMMIT_MSG@24 PS3, Line 24: Some results: As discussed offline: These measurement were run with a debug build. Let' see the same with release builds as that is the one we would like to optimize. I also experienced a big difference in performance between debug and perf. http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc@625 PS3, Line 625: nit: this extra new line is not needed -- To view, visit http://gerrit.cloudera.org:8080/15051 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f Gerrit-Change-Number: 15051 Gerrit-PatchSet: 3 Gerrit-Owner: Norbert Luksa Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 24 Jan 2020 14:50:25 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch
Gabor Kaszab has uploaded this change for review. ( http://gerrit.cloudera.org:8080/15104 Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch .. IMPALA-9228: ORC scanner reads rows into scratch batch Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca --- M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h M be/src/exec/hdfs-scanner.h M be/src/exec/orc-column-readers.cc M be/src/exec/orc-column-readers.h 5 files changed, 153 insertions(+), 9 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/15104/1 -- To view, visit http://gerrit.cloudera.org:8080/15104 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca Gerrit-Change-Number: 15104 Gerrit-PatchSet: 1 Gerrit-Owner: Gabor Kaszab
[Impala-ASF-CR] PoC: IMPALA-9228: ORC scanner reads rows into scratch batch
Hello Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/15104 to look at the new patch set (#2). Change subject: PoC: IMPALA-9228: ORC scanner reads rows into scratch batch .. PoC: IMPALA-9228: ORC scanner reads rows into scratch batch Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca --- M be/src/exec/hdfs-orc-scanner.cc M be/src/exec/hdfs-orc-scanner.h M be/src/exec/hdfs-scanner.h M be/src/exec/orc-column-readers.cc M be/src/exec/orc-column-readers.h 5 files changed, 153 insertions(+), 9 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/15104/2 -- To view, visit http://gerrit.cloudera.org:8080/15104 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca Gerrit-Change-Number: 15104 Gerrit-PatchSet: 2 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 ) Change subject: IMPALA-9226: Improve string allocations of the ORC scanner .. Patch Set 2: (5 comments) http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.h File be/src/exec/orc-column-readers.h: http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.h@207 PS2, Line 207: /*&& !orc_batch->isEncoded*/ nit: pls remove commented code http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc@150 PS2, Line 150: if (blob_ == nullptr) { nit: It is probably just personal preference but I find it more readable to return early in edge cases. This saves some indentation later on. In this case if (blob_ != nullptr) return Status::OK() http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc@151 PS2, Line 151: TODO Is this TODO going to be taken care within this patch or do you expect this to be handled somewhere in the future unrelated to this change? If the second one then opening a Jira and adding the Jira ID to the comment would make sense. (I see that the TODO was added after a review comment requested it, though). http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc@167 PS2, Line 167: nit : too many spaces here and below http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc@176 PS2, Line 176: InitBlob I'm wondering if calling InitBlob() could move into UpdateInputBatch() from here and from L186. This way you don't have to do the "if (blob_ == nullptr)" check in every ReadValue() through the InitBlob() functions just simply you can DCHECK that it's not null. -- To view, visit http://gerrit.cloudera.org:8080/15051 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f Gerrit-Change-Number: 15051 Gerrit-PatchSet: 2 Gerrit-Owner: Norbert Luksa Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 20 Jan 2020 14:19:42 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner
Hello Quanlong Huang, Norbert Luksa, Zoltan Borok-Nagy, Attila Jeges, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14982 to look at the new patch set (#9). Change subject: IMPALA-8801: Date type support for ORC scanner .. IMPALA-8801: Date type support for ORC scanner Implements read path for the date type in ORC scanner. The internal representation of a date is an int32 meaning the number of days since Unix epoch using proleptic Gregorian calendar. Similarly to the Parquet implementation (IMPALA-7370) this representation introduces an interoperability issue between Impala and older versions of Hive (before 3.1). For more details see the commit message of the mentioned Parquet implementation. Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956 --- M be/src/exec/hdfs-orc-scanner.h M be/src/exec/orc-column-readers.cc M be/src/exec/orc-column-readers.h M be/src/exec/orc-metadata-utils.cc M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M testdata/data/README A testdata/data/hive2_pre_gregorian.orc A testdata/data/out_of_range_date.orc M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test M testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test A testdata/workloads/functional-query/queries/QueryTest/hive2-pre-gregorian-date-orc.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-date-orc.test M tests/query_test/test_scanners.py 16 files changed, 170 insertions(+), 29 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/14982/9 -- To view, visit http://gerrit.cloudera.org:8080/14982 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956 Gerrit-Change-Number: 14982 Gerrit-PatchSet: 9 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/14982 ) Change subject: IMPALA-8801: Date type support for ORC scanner .. Patch Set 9: (1 comment) http://gerrit.cloudera.org:8080/#/c/14982/8/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/14982/8/tests/query_test/test_scanners.py@1302 PS8, Line 1302: an > nit: the comments could be updated a bit, since not only one illtypes table Done -- To view, visit http://gerrit.cloudera.org:8080/14982 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956 Gerrit-Change-Number: 14982 Gerrit-PatchSet: 9 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 17 Jan 2020 13:06:54 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/14982 ) Change subject: IMPALA-8801: Date type support for ORC scanner .. Patch Set 8: PS8 is a rebase with master to resolve conflict. -- To view, visit http://gerrit.cloudera.org:8080/14982 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956 Gerrit-Change-Number: 14982 Gerrit-PatchSet: 8 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 15 Jan 2020 09:53:16 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner
Hello Quanlong Huang, Norbert Luksa, Zoltan Borok-Nagy, Attila Jeges, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14982 to look at the new patch set (#8). Change subject: IMPALA-8801: Date type support for ORC scanner .. IMPALA-8801: Date type support for ORC scanner Implements read path for the date type in ORC scanner. The internal representation of a date is an int32 meaning the number of days since Unix epoch using proleptic Gregorian calendar. Similarly to the Parquet implementation (IMPALA-7370) this representation introduces an interoperability issue between Impala and older versions of Hive (before 3.1). For more details see the commit message of the mentioned Parquet implementation. Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956 --- M be/src/exec/hdfs-orc-scanner.h M be/src/exec/orc-column-readers.cc M be/src/exec/orc-column-readers.h M be/src/exec/orc-metadata-utils.cc M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M testdata/data/README A testdata/data/hive2_pre_gregorian.orc A testdata/data/out_of_range_date.orc M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test M testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test A testdata/workloads/functional-query/queries/QueryTest/hive2-pre-gregorian-date-orc.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-date-orc.test M tests/query_test/test_scanners.py 16 files changed, 168 insertions(+), 27 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/14982/8 -- To view, visit http://gerrit.cloudera.org:8080/14982 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956 Gerrit-Change-Number: 14982 Gerrit-PatchSet: 8 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner
Hello Quanlong Huang, Norbert Luksa, Zoltan Borok-Nagy, Attila Jeges, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14982 to look at the new patch set (#7). Change subject: IMPALA-8801: Date type support for ORC scanner .. IMPALA-8801: Date type support for ORC scanner Implements read path for the date type in ORC scanner. The internal representation of a date is an int32 meaning the number of days since Unix epoch using proleptic Gregorian calendar. Similarly to the Parquet implementation (IMPALA-7370) this representation introduces an interoperability issue between Impala and older versions of Hive (before 3.1). For more details see the commit message of the mentioned Parquet implementation. Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956 --- M be/src/exec/hdfs-orc-scanner.h M be/src/exec/orc-column-readers.cc M be/src/exec/orc-column-readers.h M be/src/exec/orc-metadata-utils.cc M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M testdata/data/README A testdata/data/hive2_pre_gregorian.orc A testdata/data/out_of_range_date.orc M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test M testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test A testdata/workloads/functional-query/queries/QueryTest/hive2-pre-gregorian-date-orc.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-date-orc.test M tests/query_test/test_scanners.py 16 files changed, 168 insertions(+), 27 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/14982/7 -- To view, visit http://gerrit.cloudera.org:8080/14982 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956 Gerrit-Change-Number: 14982 Gerrit-PatchSet: 7 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/14982 ) Change subject: IMPALA-8801: Date type support for ORC scanner .. Patch Set 7: (3 comments) http://gerrit.cloudera.org:8080/#/c/14982/2/be/src/exec/orc-metadata-utils.cc File be/src/exec/orc-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/14982/2/be/src/exec/orc-metadata-utils.cc@181 PS2, Line 181: case orc::TypeKind::DATE: : if (type.type == TYPE_DATE) return Status::OK(); : break; : d > > If I extend the DATE case to also allow "type.type == TYPE_TIMESTAMP" the Sorry, missed to address the end of the comment. Done http://gerrit.cloudera.org:8080/#/c/14982/6/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/14982/6/tests/query_test/test_scanners.py@306 PS6, Line 306: class TestParquet(ImpalaTestSuite): > nit: merge this into TestOrc below? sure, Done http://gerrit.cloudera.org:8080/#/c/14982/6/tests/query_test/test_scanners.py@1335 PS6, Line 1335: % unique_database) > It'd be better to add test coverage for date and timestamp type compactibil Done -- To view, visit http://gerrit.cloudera.org:8080/14982 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956 Gerrit-Change-Number: 14982 Gerrit-PatchSet: 7 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 15 Jan 2020 09:34:13 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/14982 ) Change subject: IMPALA-8801: Date type support for ORC scanner .. Patch Set 6: (1 comment) http://gerrit.cloudera.org:8080/#/c/14982/5/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/14982/5/tests/query_test/test_scanners.py@305 PS5, Line 305: > flake8: E302 expected 2 blank lines, found 1 Done -- To view, visit http://gerrit.cloudera.org:8080/14982 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956 Gerrit-Change-Number: 14982 Gerrit-PatchSet: 6 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 13 Jan 2020 14:03:07 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner
Hello Quanlong Huang, Norbert Luksa, Zoltan Borok-Nagy, Attila Jeges, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14982 to look at the new patch set (#6). Change subject: IMPALA-8801: Date type support for ORC scanner .. IMPALA-8801: Date type support for ORC scanner Implements read path for the date type in ORC scanner. The internal representation of a date is an int32 meaning the number of days since Unix epoch using proleptic Gregorian calendar. Similarly to the Parquet implementation (IMPALA-7370) this representation introduces an interoperability issue between Impala and older versions of Hive (before 3.1). For more details see the commit message of the mentioned Parquet implementation. Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956 --- M be/src/exec/hdfs-orc-scanner.h M be/src/exec/orc-column-readers.cc M be/src/exec/orc-column-readers.h M be/src/exec/orc-metadata-utils.cc M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M testdata/data/README A testdata/data/hive2_pre_gregorian.orc A testdata/data/out_of_range_date.orc M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test A testdata/workloads/functional-query/queries/QueryTest/hive2-pre-gregorian-date-orc.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-date-orc.test M tests/query_test/test_scanners.py 15 files changed, 154 insertions(+), 26 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/14982/6 -- To view, visit http://gerrit.cloudera.org:8080/14982 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956 Gerrit-Change-Number: 14982 Gerrit-PatchSet: 6 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/14982 ) Change subject: IMPALA-8801: Date type support for ORC scanner .. Patch Set 5: (2 comments) http://gerrit.cloudera.org:8080/#/c/14982/2/be/src/exec/orc-metadata-utils.cc File be/src/exec/orc-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/14982/2/be/src/exec/orc-metadata-utils.cc@181 PS2, Line 181: case orc::TypeKind::DATE: : if (type.type == TYPE_DATE) return Status::OK(); : break; : d > I'm ok to do this in another JIRA. Thanks for the explanation Quanlong! I oepened a separate Jira for this: https://issues.apache.org/jira/browse/IMPALA-9290 One thing I don't get here. If I leave this file intact then I get the "Type mismatch" below. If I extend the DATE case to also allow "type.type == TYPE_TIMESTAMP" then the DCHECK fails in UpdateInputBatch(). http://gerrit.cloudera.org:8080/#/c/14982/4/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/14982/4/tests/query_test/test_scanners.py@352 PS4, Line 352: def test_parquet(self, vector): : self.run_test_case('QueryTest/parquet', vector) : : def test_corrupt_files(self, vector): : new_vector = deepcopy(vector) : del new_vector.get_value('exec_option')['num_nodes'] # .test file sets num_nodes : new_vector.get_value('exec_option')['abort_on_error'] = 0 : self.run_test_case('QueryTest/parquet-continue-on-error', new_vector) : new_vector.get_value('exec_option')['abort_on_error'] = 1 : self.run_test_case('QueryTest/parquet-abort-on-error', new_vector) : : def test_timestamp_out_of_range(self, vector, unique_database): : """IMPALA-4363: Test scanning parquet files with an out of range timestamp. :Also tests IMPALA-7595: Test Parquet timestamp columns where the time part :is out of the valid range [0..24H). : """ : # out of range date part : create_table_from_parquet(self.client, unique_database, "out_of_range_timestamp") : : # out of range time part : create_table_from_parquet(self.client, unique_database, "out_of_range_time_of_day") : > These tests are related to ORC, but we are in class TestParquet. Indeed :) Done -- To view, visit http://gerrit.cloudera.org:8080/14982 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956 Gerrit-Change-Number: 14982 Gerrit-PatchSet: 5 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 13 Jan 2020 13:54:54 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner
Hello Quanlong Huang, Norbert Luksa, Zoltan Borok-Nagy, Attila Jeges, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/14982 to look at the new patch set (#5). Change subject: IMPALA-8801: Date type support for ORC scanner .. IMPALA-8801: Date type support for ORC scanner Implements read path for the date type in ORC scanner. The internal representation of a date is an int32 meaning the number of days since Unix epoch using proleptic Gregorian calendar. Similarly to the Parquet implementation (IMPALA-7370) this representation introduces an interoperability issue between Impala and older versions of Hive (before 3.1). For more details see the commit message of the mentioned Parquet implementation. Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956 --- M be/src/exec/hdfs-orc-scanner.h M be/src/exec/orc-column-readers.cc M be/src/exec/orc-column-readers.h M be/src/exec/orc-metadata-utils.cc M common/thrift/generate_error_codes.py M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M testdata/data/README A testdata/data/hive2_pre_gregorian.orc A testdata/data/out_of_range_date.orc M testdata/datasets/functional/schema_constraints.csv M testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test A testdata/workloads/functional-query/queries/QueryTest/hive2-pre-gregorian-date-orc.test A testdata/workloads/functional-query/queries/QueryTest/out-of-range-date-orc.test M tests/query_test/test_scanners.py 15 files changed, 152 insertions(+), 26 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/14982/5 -- To view, visit http://gerrit.cloudera.org:8080/14982 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956 Gerrit-Change-Number: 14982 Gerrit-PatchSet: 5 Gerrit-Owner: Gabor Kaszab Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Norbert Luksa Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zoltan Borok-Nagy