[Impala-ASF-CR] IMPALA-10463: Implement ds theta sketch() and ds theat estimate() functions
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/17008 ) Change subject: IMPALA-10463: Implement ds_theta_sketch() and ds_theat_estimate() functions .. Patch Set 2: (7 comments) http://gerrit.cloudera.org:8080/#/c/17008/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17008/2//COMMIT_MSG@7 PS2, Line 7: ds_theat_estimate nit: typo http://gerrit.cloudera.org:8080/#/c/17008/2//COMMIT_MSG@13 PS2, Line 13: ds_theat_estimate nit: same typo http://gerrit.cloudera.org:8080/#/c/17008/2//COMMIT_MSG@28 PS2, Line 28:see IMPALA-10464. I'd also include some highlights from that perf measurement doc into the commit msg. Probably an additional section would be great for this. http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/aggregate-functions-ir.cc@1646 PS2, Line 1646: SerializeCompactDsThetaSketch In contrast with HLL as I see Theta doesn't compact the sketch just serializes it so this function name is not reflecting well what actually happens inside the function. Please rename it to SerializeDsThetaSketch() http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/aggregate-functions-ir.cc@1899 PS2, Line 1899: datasketches::compact_theta_sketch* sketch_ptr = I;m a bit lost here. Could you help me understand why is it needed to convert the union_sketch to a compact_theta_sketch? Can't you return the union_sketch? http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/datasketches-functions-ir.cc File be/src/exprs/datasketches-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/datasketches-functions-ir.cc@110 PS2, Line 110: return 0; HLL returns a null here. Have you checked the behaviour in Hive to be in sync with the 2 systems? http://gerrit.cloudera.org:8080/#/c/17008/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-theta.test File testdata/workloads/functional-query/queries/QueryTest/datasketches-theta.test: http://gerrit.cloudera.org:8080/#/c/17008/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-theta.test@138 PS2, Line 138: # Check that ds_theta_estimate returns error for strings that are not serialized sketches. Please add a test when ds_theta_estimate() is used on an HLL sketch. I guess we expect an error there. -- To view, visit http://gerrit.cloudera.org:8080/17008 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc Gerrit-Change-Number: 17008 Gerrit-PatchSet: 2 Gerrit-Owner: Fucun Chu Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Tue, 09 Feb 2021 15:13:30 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10463: Implement ds theta sketch() and ds theat estimate() functions
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17008 ) Change subject: IMPALA-10463: Implement ds_theta_sketch() and ds_theat_estimate() functions .. Patch Set 2: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/8102/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/17008 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc Gerrit-Change-Number: 17008 Gerrit-PatchSet: 2 Gerrit-Owner: Fucun Chu Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Tue, 09 Feb 2021 10:40:42 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10463: Implement ds theta sketch() and ds theat estimate() functions
Fucun Chu has uploaded a new patch set (#2). ( http://gerrit.cloudera.org:8080/17008 ) Change subject: IMPALA-10463: Implement ds_theta_sketch() and ds_theat_estimate() functions .. IMPALA-10463: Implement ds_theta_sketch() and ds_theat_estimate() functions These functions can be used to get cardinality estimates of data using Theta algorithm from Apache DataSketches. ds_theta_sketch() receives a dataset, e.g. a column from a table, and returns a serialized Theta sketch in string format. This can be written to a table or be fed directly to ds_theat_estimate() that returns the cardinality estimate for that sketch. Similar to the HLL sketch, the primary use-case for the Theta sketch is for counting distinct values as a stream, and then merging multiple sketches together for a total distinct count. For more details about Apache DataSketches' Theta see: https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on tpch25_parquet.lineitem to compare perfomance with ds_hll_*. HLL and Theta gives closer estimate except for string, see IMPALA-10464. Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc --- M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions-test.cc M be/src/exprs/aggregate-functions.h M be/src/exprs/datasketches-functions-ir.cc M be/src/exprs/datasketches-functions.h M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M testdata/data/README A testdata/data/theta_sketches_from_hive.parquet A testdata/workloads/functional-query/queries/QueryTest/datasketches-theta.test M tests/query_test/test_datasketches.py 11 files changed, 399 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/08/17008/2 -- To view, visit http://gerrit.cloudera.org:8080/17008 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc Gerrit-Change-Number: 17008 Gerrit-PatchSet: 2 Gerrit-Owner: Fucun Chu Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins
[Impala-ASF-CR] IMPALA-10463: Implement ds theta sketch() and ds theat estimate() functions
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17008 ) Change subject: IMPALA-10463: Implement ds_theta_sketch() and ds_theat_estimate() functions .. Patch Set 1: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/8052/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/17008 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc Gerrit-Change-Number: 17008 Gerrit-PatchSet: 1 Gerrit-Owner: Fucun Chu Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Comment-Date: Sat, 30 Jan 2021 03:02:54 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10463: Implement ds theta sketch() and ds theat estimate() functions
Fucun Chu has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17008 Change subject: IMPALA-10463: Implement ds_theta_sketch() and ds_theat_estimate() functions .. IMPALA-10463: Implement ds_theta_sketch() and ds_theat_estimate() functions These functions can be used to get cardinality estimates of data using Theta algorithm from Apache DataSketches. ds_theta_sketch() receives a dataset, e.g. a column from a table, and returns a serialized Theta sketch in string format. This can be written to a table or be fed directly to ds_theat_estimate() that returns the cardinality estimate for that sketch. Similar to the HLL sketch, the primary use-case for the Theta sketch is for counting distinct values as a stream, and then merging multiple sketches together for a total distinct count. For more details about Apache DataSketches' Theta see: https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on tpch25_parquet.lineitem to compare perfomance with ds_hll_*. HLL and Theta gives closer estimate except for string, see IMPALA-10464. Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc --- M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions-test.cc M be/src/exprs/aggregate-functions.h M be/src/exprs/datasketches-functions-ir.cc M be/src/exprs/datasketches-functions.h M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M testdata/data/README A testdata/data/theta_sketches_from_hive.parquet A testdata/workloads/functional-query/queries/QueryTest/datasketches-theta.test M tests/query_test/test_datasketches.py 11 files changed, 401 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/08/17008/1 -- To view, visit http://gerrit.cloudera.org:8080/17008 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc Gerrit-Change-Number: 17008 Gerrit-PatchSet: 1 Gerrit-Owner: Fucun Chu Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins