[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 12: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 12 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Jim Apple Gerrit-Comment-Date: Tue, 12 Dec 2017 22:20:17 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. IMPALA-5310: Part 2: Add SAMPLED_NDV() function. Adds a new SAMPLED_NDV() aggregate function that is intended to be used in COMPUTE STATS TABLESAMPLE. This patch only adds the function itself. Integration with COMPUTE STATS will come in a separate patch. SAMPLED_NDV() estimates the number of distinct values (NDV) based on a sample of data and the corresponding sampling rate. The main idea is to collect several x/y data points where x is the number of rows and y is the corresponding NDV estimate. These data points are used to fit an objective function to the data such that the true NDV can be extrapolated. The aggregate function maintains a fixed number of HyperLogLog intermediates to compute the x/y points. Several objective functions are fit and the best-fit one is used for extrapolation. Adds the MPFIT C library to perform curve fitting: https://www.physics.wisc.edu/~craigm/idl/cmpfit.html The library is a C port from Fortran. Scipy uses the Fortran version of the library for curve fitting. Testing: - added functional tests - core/hdfs run passed Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Reviewed-on: http://gerrit.cloudera.org:8080/8569 Reviewed-by: Alex BehmTested-by: Impala Public Jenkins --- M LICENSE.txt M NOTICE.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/thirdparty/mpfit/DISCLAIMER A be/src/thirdparty/mpfit/README A be/src/thirdparty/mpfit/mpfit.c A be/src/thirdparty/mpfit/mpfit.h M be/src/util/CMakeLists.txt A be/src/util/mpfit-util.cc A be/src/util/mpfit-util.h M bin/rat_exclude_files.txt M bin/run_clang_tidy.sh M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java M tests/query_test/test_aggregation.py 19 files changed, 3,976 insertions(+), 12 deletions(-) Approvals: Alex Behm: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 13 Gerrit-Owner: Alex Behm Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Jim Apple
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 12: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1611/ -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 12 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Jim Apple Gerrit-Comment-Date: Tue, 12 Dec 2017 18:18:16 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 12: Code-Review+2 rebase -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 12 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Jim Apple Gerrit-Comment-Date: Tue, 12 Dec 2017 18:17:31 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 11: Thanks for the quick and thorough review Jim! Much appreciated. -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 11 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Jim Apple Gerrit-Comment-Date: Tue, 12 Dec 2017 18:17:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 9: (1 comment) http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt File bin/rat_exclude_files.txt: http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt@24 PS9, Line 24: be/src/thirdparty/mpfit/* > Sorry, I was unclear. I think it has to be copied to the LICENSE.txt file i Done -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 9 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Jim Apple Gerrit-Comment-Date: Tue, 12 Dec 2017 18:07:26 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Jim Apple has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 9: (1 comment) http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt File bin/rat_exclude_files.txt: http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt@24 PS9, Line 24: be/src/thirdparty/mpfit/* > Done Sorry, I was unclear. I think it has to be copied to the LICENSE.txt file in the root directory. I expect you can leave the file in that directory as-is, but the text should be copied to /LICENSE.txt. If you want you can update the comment on line 22 to clarify. -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 9 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Jim Apple Gerrit-Comment-Date: Tue, 12 Dec 2017 18:01:14 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 9: (2 comments) http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt File bin/rat_exclude_files.txt: http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt@24 PS9, Line 24: be/src/thirdparty/mpfit/* > The license part of the DISCLAIMER file needs to be put in LICENSE.txt in o Done http://gerrit.cloudera.org:8080/#/c/8569/9/bin/run_clang_tidy.sh File bin/run_clang_tidy.sh: http://gerrit.cloudera.org:8080/#/c/8569/9/bin/run_clang_tidy.sh@58 PS9, Line 58: # > I think this is leftovers. Sorry. Yes. -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 9 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Jim Apple Gerrit-Comment-Date: Tue, 12 Dec 2017 17:54:37 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Hello Bharath Vissapragada, Jim Apple, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8569 to look at the new patch set (#9). Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. IMPALA-5310: Part 2: Add SAMPLED_NDV() function. Adds a new SAMPLED_NDV() aggregate function that is intended to be used in COMPUTE STATS TABLESAMPLE. This patch only adds the function itself. Integration with COMPUTE STATS will come in a separate patch. SAMPLED_NDV() estimates the number of distinct values (NDV) based on a sample of data and the corresponding sampling rate. The main idea is to collect several x/y data points where x is the number of rows and y is the corresponding NDV estimate. These data points are used to fit an objective function to the data such that the true NDV can be extrapolated. The aggregate function maintains a fixed number of HyperLogLog intermediates to compute the x/y points. Several objective functions are fit and the best-fit one is used for extrapolation. Adds the MPFIT C library to perform curve fitting: https://www.physics.wisc.edu/~craigm/idl/cmpfit.html The library is a C port from Fortran. Scipy uses the Fortran version of the library for curve fitting. Testing: - added functional tests - core/hdfs run passed Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 --- M NOTICE.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/thirdparty/mpfit/DISCLAIMER A be/src/thirdparty/mpfit/README A be/src/thirdparty/mpfit/mpfit.c A be/src/thirdparty/mpfit/mpfit.h M be/src/util/CMakeLists.txt A be/src/util/mpfit-util.cc A be/src/util/mpfit-util.h M bin/rat_exclude_files.txt M bin/run_clang_tidy.sh M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java M tests/query_test/test_aggregation.py 18 files changed, 3,896 insertions(+), 12 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/9 -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 9 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Jim Apple
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Hello Bharath Vissapragada, Jim Apple, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8569 to look at the new patch set (#7). Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. IMPALA-5310: Part 2: Add SAMPLED_NDV() function. Adds a new SAMPLED_NDV() aggregate function that is intended to be used in COMPUTE STATS TABLESAMPLE. This patch only adds the function itself. Integration with COMPUTE STATS will come in a separate patch. SAMPLED_NDV() estimates the number of distinct values (NDV) based on a sample of data and the corresponding sampling rate. The main idea is to collect several x/y data points where x is the number of rows and y is the corresponding NDV estimate. These data points are used to fit an objective function to the data such that the true NDV can be extrapolated. The aggregate function maintains a fixed number of HyperLogLog intermediates to compute the x/y points. Several objective functions are fit and the best-fit one is used for extrapolation. Adds the MPFIT C library to perform curve fitting: https://www.physics.wisc.edu/~craigm/idl/cmpfit.html The library is a C port from Fortran. Scipy uses the Fortran version of the library for curve fitting. Testing: - added functional tests - core/hdfs run passed Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 --- M NOTICE.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/thirdparty/mpfit/DISCLAIMER A be/src/thirdparty/mpfit/README A be/src/thirdparty/mpfit/mpfit.c A be/src/thirdparty/mpfit/mpfit.h M be/src/util/CMakeLists.txt A be/src/util/mpfit-util.cc A be/src/util/mpfit-util.h M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java M tests/query_test/test_aggregation.py 16 files changed, 3,888 insertions(+), 10 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/7 -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 7 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Jim Apple
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Hello Bharath Vissapragada, Jim Apple, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8569 to look at the new patch set (#8). Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. IMPALA-5310: Part 2: Add SAMPLED_NDV() function. Adds a new SAMPLED_NDV() aggregate function that is intended to be used in COMPUTE STATS TABLESAMPLE. This patch only adds the function itself. Integration with COMPUTE STATS will come in a separate patch. SAMPLED_NDV() estimates the number of distinct values (NDV) based on a sample of data and the corresponding sampling rate. The main idea is to collect several x/y data points where x is the number of rows and y is the corresponding NDV estimate. These data points are used to fit an objective function to the data such that the true NDV can be extrapolated. The aggregate function maintains a fixed number of HyperLogLog intermediates to compute the x/y points. Several objective functions are fit and the best-fit one is used for extrapolation. Adds the MPFIT C library to perform curve fitting: https://www.physics.wisc.edu/~craigm/idl/cmpfit.html The library is a C port from Fortran. Scipy uses the Fortran version of the library for curve fitting. Testing: - added functional tests - core/hdfs run passed Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 --- M NOTICE.txt M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/thirdparty/mpfit/DISCLAIMER A be/src/thirdparty/mpfit/README A be/src/thirdparty/mpfit/mpfit.c A be/src/thirdparty/mpfit/mpfit.h M be/src/util/CMakeLists.txt A be/src/util/mpfit-util.cc A be/src/util/mpfit-util.h M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java M tests/query_test/test_aggregation.py 16 files changed, 3,889 insertions(+), 10 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/8 -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 8 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Jim Apple
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 7: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 7 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Jim Apple Gerrit-Comment-Date: Mon, 11 Dec 2017 23:58:21 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 6: Code-Review+2 Rebase and trivial size fix. Tests passed. Still waiting for legal. -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 6 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Jim Apple Gerrit-Comment-Date: Thu, 07 Dec 2017 19:55:02 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 5: Before merging, I'll run the tests one more time and give the Apache legal team some time to respond regarding including the MpFit library. -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 5 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Jim Apple Gerrit-Comment-Date: Wed, 06 Dec 2017 20:41:35 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 5: Code-Review+2 That's clearer, thanks. -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 5 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Jim Apple Gerrit-Comment-Date: Wed, 06 Dec 2017 20:35:24 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Hello Bharath Vissapragada, Jim Apple, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8569 to look at the new patch set (#5). Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. IMPALA-5310: Part 2: Add SAMPLED_NDV() function. Adds a new SAMPLED_NDV() aggregate function that is intended to be used in COMPUTE STATS TABLESAMPLE. This patch only adds the function itself. Integration with COMPUTE STATS will come in a separate patch. SAMPLED_NDV() estimates the number of distinct values (NDV) based on a sample of data and the corresponding sampling rate. The main idea is to collect several x/y data points where x is the number of rows and y is the corresponding NDV estimate. These data points are used to fit an objective function to the data such that the true NDV can be extrapolated. The aggregate function maintains a fixed number of HyperLogLog intermediates to compute the x/y points. Several objective functions are fit and the best-fit one is used for extrapolation. Adds the MPFIT C library to perform curve fitting: https://www.physics.wisc.edu/~craigm/idl/cmpfit.html The library is a C port from Fortran. Scipy uses the Fortran version of the library for curve fitting. Testing: - added functional tests - core/hdfs run passed Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 --- M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/thirdparty/mpfit/DISCLAIMER A be/src/thirdparty/mpfit/README A be/src/thirdparty/mpfit/mpfit.c A be/src/thirdparty/mpfit/mpfit.h M be/src/util/CMakeLists.txt A be/src/util/mpfit-util.cc A be/src/util/mpfit-util.h M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java M tests/query_test/test_aggregation.py 15 files changed, 3,900 insertions(+), 19 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/5 -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 5 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Jim Apple
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc File be/src/util/mpfit-util.cc: http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc@33 PS3, Line 33: function > I just meant to say "fitting function" in the comment where I highlighted, Clarified in comment as discussed. -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 3 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Jim Apple Gerrit-Comment-Date: Wed, 06 Dec 2017 20:32:30 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 3: (14 comments) http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1443 PS3, Line 1443: bucket_ > I don't understand why we have "bucket" as a prefix here. Both the row_cou Done http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1456 PS3, Line 1456: existing issues with passing constant arguments to all : /// aggregation phases > what does that mean? is there a JIRA we can just reference? Yes. IMPALA-6179. Added. http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1469 PS3, Line 1469: int64_t* GetCountPtr(int bucket_idx) { : return reinterpret_cast(GetBucketPtr(bucket_idx)); : } : : uint8_t* GetHllPtr(int bucket_idx) { : return GetBucketPtr(bucket_idx) + sizeof(int64_t); : } : : uint8_t* GetBucketPtr(int bucket_idx) { : return reinterpret_cast (this) + : sizeof(SampledNdvState) + bucket_idx * BUCKET_SIZE; : } > given that NUM_HLL_BUCKETS is predetermined, why not just use an array: Much better. Done. http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1500 PS3, Line 1500: void AggregateFunctions::SampledNdvUpdate(FunctionContext* ctx, const T& src, > I think a quick comment explaining what we're doing here would be helpful, Done http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1503 PS3, Line 1503: state->total_row_count % SampledNdvState::NUM_HLL_BUCKETS; > I wonder if picking a random bucket is better? We talked a bit about this. Interesting idea that I'll investigate. Ideally I'd not want to hold up this patch since the current simpler approach has shown good results and already took a long time to tune. http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1533 PS3, Line 1533: int64_t counts[num_points]; : memset(counts, 0, num_points * sizeof(int64_t)); > how about: Done http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1542 PS3, Line 1542: diminishing returns from creating additional data points > not sure what that means. do you mean additional to 'num_points'? i.e. are Yes. Rephrased to clarify. http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions.h File be/src/exprs/aggregate-functions.h: http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions.h@209 PS3, Line 209: x/y > I think using a notation like "(x, y)" maybe clearer since this sounds like Done http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.h File be/src/util/mpfit-util.h: http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.h@74 PS3, Line 74: /// Human-readable name of this function. Used for debugging. > should any of these members be private? Oops. Yes, all of them. Done. http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc File be/src/util/mpfit-util.cc: http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc@33 PS3, Line 33: function > but what is this function suppose to do? Does it make sense to call it a " Added comment but did not rename. Not sure if "FittingFunction" would be clearer. This user supplied function is a little weird imo. Happy to rename if you prefer FittingFunction(), lmk. http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java: http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@519 PS3, Line 519: children_.get(1) > samplePerc? Done http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@524 PS3, Line 524: children_.get(1) > samplePerc? Done. Setting child 1 for clarify but it's not necessary since the cast modifies the literal in-place. http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py File tests/query_test/test_aggregation.py: http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py@337 PS3, Line 337: * sample_perc > even with the comment, I'm not sure what's going on here. Tried to clarify in comment. Lmk if still unclear. http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py@339 PS3, Line 339: def __appx_equals(self, a, b, diff_perc): > a quick comment
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 3: (14 comments) http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1443 PS3, Line 1443: bucket_ I don't understand why we have "bucket" as a prefix here. Both the row_count and hll_state are per bucket, right? http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1456 PS3, Line 1456: existing issues with passing constant arguments to all : /// aggregation phases what does that mean? is there a JIRA we can just reference? http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1469 PS3, Line 1469: int64_t* GetCountPtr(int bucket_idx) { : return reinterpret_cast(GetBucketPtr(bucket_idx)); : } : : uint8_t* GetHllPtr(int bucket_idx) { : return GetBucketPtr(bucket_idx) + sizeof(int64_t); : } : : uint8_t* GetBucketPtr(int bucket_idx) { : return reinterpret_cast (this) + : sizeof(SampledNdvState) + bucket_idx * BUCKET_SIZE; : } given that NUM_HLL_BUCKETS is predetermined, why not just use an array: struct { int64_t row_count, uint8_t hll[HLL_LEN] } buckets[NUM_HLL_BUCKETS]; http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1500 PS3, Line 1500: void AggregateFunctions::SampledNdvUpdate(FunctionContext* ctx, const T& src, I think a quick comment explaining what we're doing here would be helpful, something like Incorporate the row into one of the intermediate HLLs, which will be used by Finalize to generate a set of the (x,y) points. http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1503 PS3, Line 1503: state->total_row_count % SampledNdvState::NUM_HLL_BUCKETS; I wonder if picking a random bucket is better? We talked a bit about this. Up to you whether you think it's worth investigating. http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1533 PS3, Line 1533: int64_t counts[num_points]; : memset(counts, 0, num_points * sizeof(int64_t)); how about: int64_t counts[num_points] = { 0 }; http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1542 PS3, Line 1542: diminishing returns from creating additional data points not sure what that means. do you mean additional to 'num_points'? i.e. are you saying that empirically, 'num_points' is a sufficient number of points? http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions.h File be/src/exprs/aggregate-functions.h: http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions.h@209 PS3, Line 209: x/y I think using a notation like "(x, y)" maybe clearer since this sounds like divide. http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.h File be/src/util/mpfit-util.h: http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.h@74 PS3, Line 74: /// Human-readable name of this function. Used for debugging. should any of these members be private? http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc File be/src/util/mpfit-util.cc: http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc@33 PS3, Line 33: function but what is this function suppose to do? Does it make sense to call it a "fitting function"? http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java: http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@519 PS3, Line 519: children_.get(1) samplePerc? http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@524 PS3, Line 524: children_.get(1) samplePerc? why don't we have to do anything with the return value of uncheckedCastTo() (like set it as the children_ at index 1)? http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py File tests/query_test/test_aggregation.py: http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py@337 PS3, Line 337: * sample_perc even with the comment, I'm not sure what's going on here. http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py@339 PS3, Line 339: def __appx_equals(self, a, b, diff_perc): a quick comment for this would be helpful. -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe,
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/thirdparty/mpfit/DISCLAIMER File be/src/thirdparty/mpfit/DISCLAIMER: PS3: > I'm not sure if it counts as a licence under ASF rules. Maybe it'll get the Ok let's wait for the JIRA. The text in here sure makes it sound like the authors intended for anyone to do anything with this code. -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 3 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Jim Apple Gerrit-Comment-Date: Thu, 30 Nov 2017 16:51:01 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Jim Apple has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/thirdparty/mpfit/DISCLAIMER File be/src/thirdparty/mpfit/DISCLAIMER: PS3: > Thanks for bringing this up. Can you be more specific what you are concerne I'm not sure if it counts as a licence under ASF rules. Maybe it'll get the same treatment from the legal team as WTFPL, which is definitely allowed: http://www.wtfpl.net/txt/copying/ -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 3 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Jim Apple Gerrit-Comment-Date: Thu, 30 Nov 2017 14:53:25 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/thirdparty/mpfit/DISCLAIMER File be/src/thirdparty/mpfit/DISCLAIMER: PS3: > I'm not sure this meets ASF rules, in particular the teh translation to C l Thanks for bringing this up. Can you be more specific what you are concerned about? No restrictions on distribution seems like a good thing. I don't see any violation of: https://opensource.org/osd-annotated Regardless I've filed LEGAL-348 to clarify. -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 3 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Jim Apple Gerrit-Comment-Date: Thu, 30 Nov 2017 00:41:30 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Jim Apple has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/thirdparty/mpfit/DISCLAIMER File be/src/thirdparty/mpfit/DISCLAIMER: PS3: I'm not sure this meets ASF rules, in particular the teh translation to C license, which looks to me just "no restrictions placed on distribution". I'd suggest a close reading of https://www.apache.org/legal/resolved.html. If that doesn't settle it, and I suspect it wont, you might file a "LEGAL" JIRA: https://issues.apache.org/jira/browse/LEGAL -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 3 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Reviewer: Jim Apple Gerrit-Comment-Date: Wed, 29 Nov 2017 22:20:03 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Bharath Vissapragada has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 3: Code-Review+1 (3 comments) LGTM. http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc@1465 PS2, Line 1465: static int64_t GetStateSize() { > Can't do that because sizeof(SampledNdvState) fails since SampledNdvState i oops, yes. http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc@1552 PS2, Line 1552: tringVal merged_hll(merged_hll_data, HLL_LEN); : int64_t merged_count = 0; : for (int j = 0; j < SampledNdvState::NUM_HLL_BUCKETS; ++j) { : int bucket_idx = (i + j) % SampledNdvState::NUM_HLL_BUCKETS; : merged_count += *state->GetCountPtr(bucket_idx); : counts[pidx] = merged_count; > I expanded the comment here to explain and justify. Ah, I misread the part where we do HllMerge(ctx, hll, _hll). I read it as HllMerge(ctx, hll, _hll[i]). It makes sense now, thanks. I agree the chance of dupes is pretty less and that also explains the max_count invariant. My bad. http://gerrit.cloudera.org:8080/#/c/8569/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java File fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java: http://gerrit.cloudera.org:8080/#/c/8569/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java@2063 PS2, Line 2063: 0.0 > The value passed to this agg fn is independent of how much data is scanned. Makes sense, I too think the same. Thanks for explaining. -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 3 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Comment-Date: Tue, 28 Nov 2017 22:58:29 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Hello Bharath Vissapragada, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8569 to look at the new patch set (#2). Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. IMPALA-5310: Part 2: Add SAMPLED_NDV() function. Adds a new SAMPLED_NDV() aggregate function that is intended to be used in COMPUTE STATS TABLESAMPLE. This patch only adds the function itself. Integration with COMPUTE STATS will come in a separate patch. SAMPLED_NDV() estimates the number of distinct values (NDV) based on a sample of data and the corresponding sampling rate. The main idea is to collect several x/y data points where x is the number of rows and y is the corresponding NDV estimate. These data points are used to fit an objective function to the data such that the true NDV can be extrapolated. The aggregate function maintains a fixed number of HyperLogLog intermediates to compute the x/y points. Several objective functions are fit and the best-fit one is used for extrapolation. Adds the MPFIT C library to perform curve fitting: https://www.physics.wisc.edu/~craigm/idl/cmpfit.html The library is a C port from Fortran. Scipy uses the Fortran version of the library for curve fitting. Testing: - added functional tests - core/hdfs run passed Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 --- M be/src/exprs/aggregate-functions-ir.cc M be/src/exprs/aggregate-functions.h A be/src/thirdparty/mpfit/DISCLAIMER A be/src/thirdparty/mpfit/README A be/src/thirdparty/mpfit/mpfit.c A be/src/thirdparty/mpfit/mpfit.h M be/src/util/CMakeLists.txt A be/src/util/mpfit-util.cc A be/src/util/mpfit-util.h M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java M tests/query_test/test_aggregation.py 15 files changed, 3,892 insertions(+), 19 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/2 -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 2 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 1: (7 comments) http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc@1464 PS1, Line 1464: sizeof(int64_t) + AggregateFunctions::HLL_LEN) > maybe factor this out into a GetSingleBucketSize() or something for readabi Good idea done. I did some more similar cleanup in this class. http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc@1469 PS1, Line 1469: sizeof(SampledNdvState) + bucket_idx * (sizeof(int64_t) + AggregateFunctions::HLL_LEN)); > nit: longline Done http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h File be/src/util/mpfit-util.h: http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h@50 PS1, Line 50: bool LmsFit(const double* xs, const double* ys, int num_points); > WARN_UNUSED_RESULT, here and elsewhere Done here. I could not find another place, did I miss one? http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h@82 PS1, Line 82: mp_result result_; > Looks like this struct has a bunch of ptrs, whats the lifecycle of those? Added comment. http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h@90 PS1, Line 90: const double* ys_; > DISALLOW_COPY_ASSIGN That's needed for putting into std::vector. http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.cc File be/src/util/mpfit-util.cc: http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.cc@29 PS1, Line 29: num_params_(num_params), : fn_(fn), : num_points_(0), : xs_(nullptr), : ys_(nullptr) { > nit: I think we could condense them into fewer lines (other places in the c Done http://gerrit.cloudera.org:8080/#/c/8569/1/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java: http://gerrit.cloudera.org:8080/#/c/8569/1/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@507 PS1, Line 507: && children_.size() == 2 > Isn't this more like a preconditions check inside the if {}? No. We're parsing a generic FunctionCallExpr that can be invoked with any number of args and we want to return a "Could not resolve function" error when SAMPLED_NDV() is invoked with an invalid number. That's done in L527. Added comment. -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 1 Gerrit-Owner: Alex BehmGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Bharath Vissapragada Gerrit-Comment-Date: Tue, 21 Nov 2017 19:03:34 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.
Bharath Vissapragada has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 ) Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function. .. Patch Set 1: (7 comments) I have a bunch of nits. Although I follow the basic flow of the patch, I think I need to spend more time by running some examples to get a deeper understanding, as I'm not fully familiar with how the curve fitting math works. http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc@1464 PS1, Line 1464: sizeof(int64_t) + AggregateFunctions::HLL_LEN) maybe factor this out into a GetSingleBucketSize() or something for readability? (here and below) http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc@1469 PS1, Line 1469: sizeof(SampledNdvState) + bucket_idx * (sizeof(int64_t) + AggregateFunctions::HLL_LEN)); nit: longline http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h File be/src/util/mpfit-util.h: http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h@50 PS1, Line 50: bool LmsFit(const double* xs, const double* ys, int num_points); WARN_UNUSED_RESULT, here and elsewhere http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h@82 PS1, Line 82: mp_result result_; Looks like this struct has a bunch of ptrs, whats the lifecycle of those? http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h@90 PS1, Line 90: const double* ys_; DISALLOW_COPY_ASSIGN http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.cc File be/src/util/mpfit-util.cc: http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.cc@29 PS1, Line 29: num_params_(num_params), : fn_(fn), : num_points_(0), : xs_(nullptr), : ys_(nullptr) { nit: I think we could condense them into fewer lines (other places in the codebase use multiple args per line) http://gerrit.cloudera.org:8080/#/c/8569/1/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java: http://gerrit.cloudera.org:8080/#/c/8569/1/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@507 PS1, Line 507: && children_.size() == 2 Isn't this more like a preconditions check inside the if {}? -- To view, visit http://gerrit.cloudera.org:8080/8569 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Gerrit-Change-Number: 8569 Gerrit-PatchSet: 1 Gerrit-Owner: Alex BehmGerrit-Reviewer: Bharath Vissapragada Gerrit-Comment-Date: Thu, 16 Nov 2017 08:28:37 + Gerrit-HasComments: Yes