[Impala-ASF-CR] IMPALA-10463: Implement ds theta sketch() and ds theat estimate() functions

2021-02-09 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17008 )

Change subject: IMPALA-10463: Implement ds_theta_sketch() and 
ds_theat_estimate() functions
..


Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/17008/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17008/2//COMMIT_MSG@7
PS2, Line 7: ds_theat_estimate
nit: typo


http://gerrit.cloudera.org:8080/#/c/17008/2//COMMIT_MSG@13
PS2, Line 13: ds_theat_estimate
nit: same typo


http://gerrit.cloudera.org:8080/#/c/17008/2//COMMIT_MSG@28
PS2, Line 28:see IMPALA-10464.
I'd also include some highlights from that perf measurement doc into the commit 
msg. Probably an additional section would be great for this.


http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/aggregate-functions-ir.cc@1646
PS2, Line 1646: SerializeCompactDsThetaSketch
In contrast with HLL as I see Theta doesn't compact the sketch just serializes 
it so this function name is not reflecting well what actually happens inside 
the function. Please rename it to SerializeDsThetaSketch()


http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/aggregate-functions-ir.cc@1899
PS2, Line 1899:   datasketches::compact_theta_sketch* sketch_ptr =
I;m a bit lost here. Could you help me understand why is it needed to convert 
the union_sketch to a compact_theta_sketch? Can't you return the union_sketch?


http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/datasketches-functions-ir.cc
File be/src/exprs/datasketches-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17008/2/be/src/exprs/datasketches-functions-ir.cc@110
PS2, Line 110: return 0;
HLL returns a null here. Have you checked the behaviour in Hive to be in sync 
with the 2 systems?


http://gerrit.cloudera.org:8080/#/c/17008/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-theta.test
File 
testdata/workloads/functional-query/queries/QueryTest/datasketches-theta.test:

http://gerrit.cloudera.org:8080/#/c/17008/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-theta.test@138
PS2, Line 138: # Check that ds_theta_estimate returns error for strings that 
are not serialized sketches.
Please add a test when ds_theta_estimate() is used on an HLL sketch. I guess we 
expect an error there.



--
To view, visit http://gerrit.cloudera.org:8080/17008
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc
Gerrit-Change-Number: 17008
Gerrit-PatchSet: 2
Gerrit-Owner: Fucun Chu 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Tue, 09 Feb 2021 15:13:30 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10463: Implement ds theta sketch() and ds theat estimate() functions

2021-02-09 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17008 )

Change subject: IMPALA-10463: Implement ds_theta_sketch() and 
ds_theat_estimate() functions
..


Patch Set 2:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8102/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17008
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc
Gerrit-Change-Number: 17008
Gerrit-PatchSet: 2
Gerrit-Owner: Fucun Chu 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Tue, 09 Feb 2021 10:40:42 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10463: Implement ds theta sketch() and ds theat estimate() functions

2021-02-09 Thread Fucun Chu (Code Review)
Fucun Chu has uploaded a new patch set (#2). ( 
http://gerrit.cloudera.org:8080/17008 )

Change subject: IMPALA-10463: Implement ds_theta_sketch() and 
ds_theat_estimate() functions
..

IMPALA-10463: Implement ds_theta_sketch() and ds_theat_estimate() functions

These functions can be used to get cardinality estimates of data
using Theta algorithm from Apache DataSketches. ds_theta_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized Theta sketch in string format. This can be written to a
table or be fed directly to ds_theat_estimate() that returns the
cardinality estimate for that sketch.

Similar to the HLL sketch, the primary use-case for the Theta sketch
is for counting distinct values as a stream, and then merging
multiple sketches together for a total distinct count.

For more details about Apache DataSketches' Theta see:
https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on tpch25_parquet.lineitem to compare perfomance
   with ds_hll_*. HLL and Theta gives closer estimate except for string,
   see IMPALA-10464.

Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc
---
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions-test.cc
M be/src/exprs/aggregate-functions.h
M be/src/exprs/datasketches-functions-ir.cc
M be/src/exprs/datasketches-functions.h
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M testdata/data/README
A testdata/data/theta_sketches_from_hive.parquet
A testdata/workloads/functional-query/queries/QueryTest/datasketches-theta.test
M tests/query_test/test_datasketches.py
11 files changed, 399 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/08/17008/2
--
To view, visit http://gerrit.cloudera.org:8080/17008
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc
Gerrit-Change-Number: 17008
Gerrit-PatchSet: 2
Gerrit-Owner: Fucun Chu 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-10463: Implement ds theta sketch() and ds theat estimate() functions

2021-01-29 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17008 )

Change subject: IMPALA-10463: Implement ds_theta_sketch() and 
ds_theat_estimate() functions
..


Patch Set 1:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8052/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17008
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc
Gerrit-Change-Number: 17008
Gerrit-PatchSet: 1
Gerrit-Owner: Fucun Chu 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Sat, 30 Jan 2021 03:02:54 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10463: Implement ds theta sketch() and ds theat estimate() functions

2021-01-29 Thread Fucun Chu (Code Review)
Fucun Chu has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/17008


Change subject: IMPALA-10463: Implement ds_theta_sketch() and 
ds_theat_estimate() functions
..

IMPALA-10463: Implement ds_theta_sketch() and ds_theat_estimate() functions

These functions can be used to get cardinality estimates of data
using Theta algorithm from Apache DataSketches. ds_theta_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized Theta sketch in string format. This can be written to a
table or be fed directly to ds_theat_estimate() that returns the
cardinality estimate for that sketch.

Similar to the HLL sketch, the primary use-case for the Theta sketch
is for counting distinct values as a stream, and then merging
multiple sketches together for a total distinct count.

For more details about Apache DataSketches' Theta see:
https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on tpch25_parquet.lineitem to compare perfomance
   with ds_hll_*. HLL and Theta gives closer estimate except for string,
   see IMPALA-10464.

Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc
---
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions-test.cc
M be/src/exprs/aggregate-functions.h
M be/src/exprs/datasketches-functions-ir.cc
M be/src/exprs/datasketches-functions.h
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M testdata/data/README
A testdata/data/theta_sketches_from_hive.parquet
A testdata/workloads/functional-query/queries/QueryTest/datasketches-theta.test
M tests/query_test/test_datasketches.py
11 files changed, 401 insertions(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/08/17008/1
--
To view, visit http://gerrit.cloudera.org:8080/17008
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc
Gerrit-Change-Number: 17008
Gerrit-PatchSet: 1
Gerrit-Owner: Fucun Chu 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins