[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-12-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 12: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 12
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Jim Apple 
Gerrit-Comment-Date: Tue, 12 Dec 2017 22:20:17 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-12-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..

IMPALA-5310: Part 2: Add SAMPLED_NDV() function.

Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Reviewed-on: http://gerrit.cloudera.org:8080/8569
Reviewed-by: Alex Behm 
Tested-by: Impala Public Jenkins
---
M LICENSE.txt
M NOTICE.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/thirdparty/mpfit/DISCLAIMER
A be/src/thirdparty/mpfit/README
A be/src/thirdparty/mpfit/mpfit.c
A be/src/thirdparty/mpfit/mpfit.h
M be/src/util/CMakeLists.txt
A be/src/util/mpfit-util.cc
A be/src/util/mpfit-util.h
M bin/rat_exclude_files.txt
M bin/run_clang_tidy.sh
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M tests/query_test/test_aggregation.py
19 files changed, 3,976 insertions(+), 12 deletions(-)

Approvals:
  Alex Behm: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 13
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Jim Apple 


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-12-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1611/


--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 12
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Jim Apple 
Gerrit-Comment-Date: Tue, 12 Dec 2017 18:18:16 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-12-12 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 12: Code-Review+2

rebase


--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 12
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Jim Apple 
Gerrit-Comment-Date: Tue, 12 Dec 2017 18:17:31 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-12-12 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 11:

Thanks for the quick and thorough review Jim! Much appreciated.


--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 11
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Jim Apple 
Gerrit-Comment-Date: Tue, 12 Dec 2017 18:17:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-12-12 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt
File bin/rat_exclude_files.txt:

http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt@24
PS9, Line 24: be/src/thirdparty/mpfit/*
> Sorry, I was unclear. I think it has to be copied to the LICENSE.txt file i
Done



--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 9
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Jim Apple 
Gerrit-Comment-Date: Tue, 12 Dec 2017 18:07:26 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-12-12 Thread Jim Apple (Code Review)
Jim Apple has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt
File bin/rat_exclude_files.txt:

http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt@24
PS9, Line 24: be/src/thirdparty/mpfit/*
> Done
Sorry, I was unclear. I think it has to be copied to the LICENSE.txt file in 
the root directory. I expect you can leave the file in that directory as-is, 
but the text should be copied to /LICENSE.txt.

If you want you can update the comment on line 22 to clarify.



--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 9
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Jim Apple 
Gerrit-Comment-Date: Tue, 12 Dec 2017 18:01:14 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-12-12 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 9:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt
File bin/rat_exclude_files.txt:

http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt@24
PS9, Line 24: be/src/thirdparty/mpfit/*
> The license part of the DISCLAIMER file needs to be put in LICENSE.txt in o
Done


http://gerrit.cloudera.org:8080/#/c/8569/9/bin/run_clang_tidy.sh
File bin/run_clang_tidy.sh:

http://gerrit.cloudera.org:8080/#/c/8569/9/bin/run_clang_tidy.sh@58
PS9, Line 58: #
> I think this is leftovers.
Sorry. Yes.



--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 9
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Jim Apple 
Gerrit-Comment-Date: Tue, 12 Dec 2017 17:54:37 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-12-11 Thread Alex Behm (Code Review)
Hello Bharath Vissapragada, Jim Apple, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8569

to look at the new patch set (#9).

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..

IMPALA-5310: Part 2: Add SAMPLED_NDV() function.

Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
---
M NOTICE.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/thirdparty/mpfit/DISCLAIMER
A be/src/thirdparty/mpfit/README
A be/src/thirdparty/mpfit/mpfit.c
A be/src/thirdparty/mpfit/mpfit.h
M be/src/util/CMakeLists.txt
A be/src/util/mpfit-util.cc
A be/src/util/mpfit-util.h
M bin/rat_exclude_files.txt
M bin/run_clang_tidy.sh
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M tests/query_test/test_aggregation.py
18 files changed, 3,896 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/9
--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 9
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Jim Apple 


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-12-11 Thread Alex Behm (Code Review)
Hello Bharath Vissapragada, Jim Apple, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8569

to look at the new patch set (#7).

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..

IMPALA-5310: Part 2: Add SAMPLED_NDV() function.

Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
---
M NOTICE.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/thirdparty/mpfit/DISCLAIMER
A be/src/thirdparty/mpfit/README
A be/src/thirdparty/mpfit/mpfit.c
A be/src/thirdparty/mpfit/mpfit.h
M be/src/util/CMakeLists.txt
A be/src/util/mpfit-util.cc
A be/src/util/mpfit-util.h
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M tests/query_test/test_aggregation.py
16 files changed, 3,888 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/7
--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 7
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Jim Apple 


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-12-11 Thread Alex Behm (Code Review)
Hello Bharath Vissapragada, Jim Apple, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8569

to look at the new patch set (#8).

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..

IMPALA-5310: Part 2: Add SAMPLED_NDV() function.

Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
---
M NOTICE.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/thirdparty/mpfit/DISCLAIMER
A be/src/thirdparty/mpfit/README
A be/src/thirdparty/mpfit/mpfit.c
A be/src/thirdparty/mpfit/mpfit.h
M be/src/util/CMakeLists.txt
A be/src/util/mpfit-util.cc
A be/src/util/mpfit-util.h
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M tests/query_test/test_aggregation.py
16 files changed, 3,889 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/8
--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 8
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Jim Apple 


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-12-11 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 7: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 7
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Jim Apple 
Gerrit-Comment-Date: Mon, 11 Dec 2017 23:58:21 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-12-07 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 6: Code-Review+2

Rebase and trivial size fix. Tests passed. Still waiting for legal.


--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 6
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Jim Apple 
Gerrit-Comment-Date: Thu, 07 Dec 2017 19:55:02 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-12-06 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 5:

Before merging, I'll run the tests one more time and give the Apache legal team 
some time to respond regarding including the MpFit library.


--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 5
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Jim Apple 
Gerrit-Comment-Date: Wed, 06 Dec 2017 20:41:35 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-12-06 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 5: Code-Review+2

That's clearer, thanks.


--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 5
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Jim Apple 
Gerrit-Comment-Date: Wed, 06 Dec 2017 20:35:24 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-12-06 Thread Alex Behm (Code Review)
Hello Bharath Vissapragada, Jim Apple, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8569

to look at the new patch set (#5).

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..

IMPALA-5310: Part 2: Add SAMPLED_NDV() function.

Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
---
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/thirdparty/mpfit/DISCLAIMER
A be/src/thirdparty/mpfit/README
A be/src/thirdparty/mpfit/mpfit.c
A be/src/thirdparty/mpfit/mpfit.h
M be/src/util/CMakeLists.txt
A be/src/util/mpfit-util.cc
A be/src/util/mpfit-util.h
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M tests/query_test/test_aggregation.py
15 files changed, 3,900 insertions(+), 19 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/5
--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 5
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Jim Apple 


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-12-06 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc
File be/src/util/mpfit-util.cc:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc@33
PS3, Line 33: function
> I just meant to say "fitting function" in the comment where I highlighted,
Clarified in comment as discussed.



--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Jim Apple 
Gerrit-Comment-Date: Wed, 06 Dec 2017 20:32:30 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-12-05 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 3:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1443
PS3, Line 1443: bucket_
> I don't understand why we have "bucket" as a prefix here.  Both the row_cou
Done


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1456
PS3, Line 1456: existing issues with passing constant arguments to all
  :   /// aggregation phases
> what does that mean? is there a JIRA we can just reference?
Yes. IMPALA-6179. Added.


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1469
PS3, Line 1469:   int64_t* GetCountPtr(int bucket_idx) {
  : return reinterpret_cast(GetBucketPtr(bucket_idx));
  :   }
  :
  :   uint8_t* GetHllPtr(int bucket_idx) {
  : return GetBucketPtr(bucket_idx) + sizeof(int64_t);
  :   }
  :
  :   uint8_t* GetBucketPtr(int bucket_idx) {
  : return reinterpret_cast(this) +
  : sizeof(SampledNdvState) + bucket_idx * BUCKET_SIZE;
  :   }
> given that NUM_HLL_BUCKETS is predetermined, why not just use an array:
Much better. Done.


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1500
PS3, Line 1500: void AggregateFunctions::SampledNdvUpdate(FunctionContext* ctx, 
const T& src,
> I think a quick comment explaining what we're doing here would be helpful,
Done


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1503
PS3, Line 1503: state->total_row_count % SampledNdvState::NUM_HLL_BUCKETS;
> I wonder if picking a random bucket is better? We talked a bit about this.
Interesting idea that I'll investigate. Ideally I'd not want to hold up this 
patch since the current simpler approach has shown good results and already 
took a long time to tune.


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1533
PS3, Line 1533:   int64_t counts[num_points];
  :   memset(counts, 0, num_points * sizeof(int64_t));
> how about:
Done


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1542
PS3, Line 1542: diminishing returns from creating additional data points
> not sure what that means. do you mean additional to 'num_points'? i.e. are
Yes. Rephrased to clarify.


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions.h
File be/src/exprs/aggregate-functions.h:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions.h@209
PS3, Line 209: x/y
> I think using a notation like "(x, y)" maybe clearer since this sounds like
Done


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.h
File be/src/util/mpfit-util.h:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.h@74
PS3, Line 74:   /// Human-readable name of this function. Used for debugging.
> should any of these members be private?
Oops. Yes, all of them. Done.


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc
File be/src/util/mpfit-util.cc:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc@33
PS3, Line 33: function
> but what is this function suppose to do?  Does it make sense to call it a "
Added comment but did not rename. Not sure if "FittingFunction" would be 
clearer. This user supplied function is a little weird imo.

Happy to rename if you prefer FittingFunction(), lmk.


http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@519
PS3, Line 519: children_.get(1)
> samplePerc?
Done


http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@524
PS3, Line 524: children_.get(1)
> samplePerc?
Done.

Setting child 1 for clarify but it's not necessary since the cast modifies the 
literal in-place.


http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py@337
PS3, Line 337: * sample_perc
> even with the comment, I'm not sure what's going on here.
Tried to clarify in comment. Lmk if still unclear.


http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py@339
PS3, Line 339:   def __appx_equals(self, a, b, diff_perc):
> a quick comment 

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-12-01 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 3:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1443
PS3, Line 1443: bucket_
I don't understand why we have "bucket" as a prefix here.  Both the row_count 
and hll_state are per bucket, right?


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1456
PS3, Line 1456: existing issues with passing constant arguments to all
  :   /// aggregation phases
what does that mean? is there a JIRA we can just reference?


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1469
PS3, Line 1469:   int64_t* GetCountPtr(int bucket_idx) {
  : return reinterpret_cast(GetBucketPtr(bucket_idx));
  :   }
  :
  :   uint8_t* GetHllPtr(int bucket_idx) {
  : return GetBucketPtr(bucket_idx) + sizeof(int64_t);
  :   }
  :
  :   uint8_t* GetBucketPtr(int bucket_idx) {
  : return reinterpret_cast(this) +
  : sizeof(SampledNdvState) + bucket_idx * BUCKET_SIZE;
  :   }
given that NUM_HLL_BUCKETS is predetermined, why not just use an array:

struct { int64_t row_count, uint8_t hll[HLL_LEN] } buckets[NUM_HLL_BUCKETS];


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1500
PS3, Line 1500: void AggregateFunctions::SampledNdvUpdate(FunctionContext* ctx, 
const T& src,
I think a quick comment explaining what we're doing here would be helpful, 
something like

Incorporate the row into one of the intermediate HLLs, which will be used by 
Finalize to generate a set of the (x,y) points.


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1503
PS3, Line 1503: state->total_row_count % SampledNdvState::NUM_HLL_BUCKETS;
I wonder if picking a random bucket is better? We talked a bit about this. Up 
to you whether you think it's worth investigating.


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1533
PS3, Line 1533:   int64_t counts[num_points];
  :   memset(counts, 0, num_points * sizeof(int64_t));
how about:

int64_t counts[num_points] = { 0 };


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1542
PS3, Line 1542: diminishing returns from creating additional data points
not sure what that means. do you mean additional to 'num_points'? i.e. are you 
saying that empirically, 'num_points' is a sufficient number of points?


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions.h
File be/src/exprs/aggregate-functions.h:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions.h@209
PS3, Line 209: x/y
I think using a notation like "(x, y)" maybe clearer since this sounds like 
divide.


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.h
File be/src/util/mpfit-util.h:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.h@74
PS3, Line 74:   /// Human-readable name of this function. Used for debugging.
should any of these members be private?


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc
File be/src/util/mpfit-util.cc:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc@33
PS3, Line 33: function
but what is this function suppose to do?  Does it make sense to call it a 
"fitting function"?


http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@519
PS3, Line 519: children_.get(1)
samplePerc?


http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@524
PS3, Line 524: children_.get(1)
samplePerc?

why don't we have to do anything with the return value of uncheckedCastTo() 
(like set it as the children_ at index 1)?


http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py@337
PS3, Line 337: * sample_perc
even with the comment, I'm not sure what's going on here.


http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py@339
PS3, Line 339:   def __appx_equals(self, a, b, diff_perc):
a quick comment for this would be helpful.



--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, 

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-11-30 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/thirdparty/mpfit/DISCLAIMER
File be/src/thirdparty/mpfit/DISCLAIMER:

PS3:
> I'm not sure if it counts as a licence under ASF rules. Maybe it'll get the
Ok let's wait for the JIRA. The text in here sure makes it sound like the 
authors intended for anyone to do anything with this code.



--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Jim Apple 
Gerrit-Comment-Date: Thu, 30 Nov 2017 16:51:01 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-11-30 Thread Jim Apple (Code Review)
Jim Apple has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/thirdparty/mpfit/DISCLAIMER
File be/src/thirdparty/mpfit/DISCLAIMER:

PS3:
> Thanks for bringing this up. Can you be more specific what you are concerne
I'm not sure if it counts as a licence under ASF rules. Maybe it'll get the 
same treatment from the legal team as WTFPL, which is definitely allowed:

http://www.wtfpl.net/txt/copying/



--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Jim Apple 
Gerrit-Comment-Date: Thu, 30 Nov 2017 14:53:25 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-11-29 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/thirdparty/mpfit/DISCLAIMER
File be/src/thirdparty/mpfit/DISCLAIMER:

PS3:
> I'm not sure this meets ASF rules, in particular the teh translation to C l
Thanks for bringing this up. Can you be more specific what you are concerned 
about?

No restrictions on distribution seems like a good thing.

I don't see any violation of: https://opensource.org/osd-annotated

Regardless I've filed LEGAL-348 to clarify.



--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Jim Apple 
Gerrit-Comment-Date: Thu, 30 Nov 2017 00:41:30 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-11-29 Thread Jim Apple (Code Review)
Jim Apple has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/thirdparty/mpfit/DISCLAIMER
File be/src/thirdparty/mpfit/DISCLAIMER:

PS3:
I'm not sure this meets ASF rules, in particular the teh translation to C 
license, which looks to me just "no restrictions placed on distribution".

I'd suggest a close reading of https://www.apache.org/legal/resolved.html. If 
that doesn't settle it, and I suspect it wont, you might file a "LEGAL" JIRA: 
https://issues.apache.org/jira/browse/LEGAL



--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Reviewer: Jim Apple 
Gerrit-Comment-Date: Wed, 29 Nov 2017 22:20:03 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-11-28 Thread Bharath Vissapragada (Code Review)
Bharath Vissapragada has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 3: Code-Review+1

(3 comments)

LGTM.

http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc@1465
PS2, Line 1465:   static int64_t GetStateSize() {
> Can't do that because sizeof(SampledNdvState) fails since SampledNdvState i
oops, yes.


http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc@1552
PS2, Line 1552: tringVal merged_hll(merged_hll_data, HLL_LEN);
  : int64_t merged_count = 0;
  : for (int j = 0; j < SampledNdvState::NUM_HLL_BUCKETS; ++j) {
  :   int bucket_idx = (i + j) % 
SampledNdvState::NUM_HLL_BUCKETS;
  :   merged_count += *state->GetCountPtr(bucket_idx);
  :   counts[pidx] = merged_count;
> I expanded the comment here to explain and justify.
Ah, I misread the part where we do HllMerge(ctx, hll, _hll). I read it 
as HllMerge(ctx, hll, _hll[i]). It makes sense now, thanks. I agree the 
chance of dupes is pretty less and that also explains the max_count invariant. 
My bad.


http://gerrit.cloudera.org:8080/#/c/8569/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java:

http://gerrit.cloudera.org:8080/#/c/8569/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java@2063
PS2, Line 2063: 0.0
> The value passed to this agg fn is independent of how much data is scanned.
Makes sense, I too think the same. Thanks for explaining.



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Comment-Date: Tue, 28 Nov 2017 22:58:29 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-11-21 Thread Alex Behm (Code Review)
Hello Bharath Vissapragada,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8569

to look at the new patch set (#2).

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..

IMPALA-5310: Part 2: Add SAMPLED_NDV() function.

Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
---
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/thirdparty/mpfit/DISCLAIMER
A be/src/thirdparty/mpfit/README
A be/src/thirdparty/mpfit/mpfit.c
A be/src/thirdparty/mpfit/mpfit.h
M be/src/util/CMakeLists.txt
A be/src/util/mpfit-util.cc
A be/src/util/mpfit-util.h
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M tests/query_test/test_aggregation.py
15 files changed, 3,892 insertions(+), 19 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/2
--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 2
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-11-21 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 1:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc@1464
PS1, Line 1464: sizeof(int64_t) + AggregateFunctions::HLL_LEN)
> maybe factor this out into a GetSingleBucketSize() or something for readabi
Good idea done. I did some more similar cleanup in this class.


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc@1469
PS1, Line 1469: sizeof(SampledNdvState) + bucket_idx * (sizeof(int64_t) 
+ AggregateFunctions::HLL_LEN));
> nit: longline
Done


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h
File be/src/util/mpfit-util.h:

http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h@50
PS1, Line 50:   bool LmsFit(const double* xs, const double* ys, int num_points);
> WARN_UNUSED_RESULT, here and elsewhere
Done here. I could not find another place, did I miss one?


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h@82
PS1, Line 82:   mp_result result_;
> Looks like this struct has a bunch of ptrs, whats the lifecycle of those?
Added comment.


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h@90
PS1, Line 90:   const double* ys_;
> DISALLOW_COPY_ASSIGN
That's needed for putting into std::vector.


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.cc
File be/src/util/mpfit-util.cc:

http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.cc@29
PS1, Line 29: num_params_(num_params),
: fn_(fn),
: num_points_(0),
: xs_(nullptr),
: ys_(nullptr) {
> nit: I think we could condense them into fewer lines (other places in the c
Done


http://gerrit.cloudera.org:8080/#/c/8569/1/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

http://gerrit.cloudera.org:8080/#/c/8569/1/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@507
PS1, Line 507: && children_.size() == 2
> Isn't this more like a preconditions check inside the if {}?
No. We're parsing a generic FunctionCallExpr that can be invoked with any 
number of args and we want to return a "Could not resolve function" error when 
SAMPLED_NDV() is invoked with an invalid number. That's done in L527.

Added comment.



--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 1
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Comment-Date: Tue, 21 Nov 2017 19:03:34 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

2017-11-16 Thread Bharath Vissapragada (Code Review)
Bharath Vissapragada has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
..


Patch Set 1:

(7 comments)

I have a bunch of nits. Although I follow the basic flow of the patch, I think 
I need to spend more time by running some examples to get a deeper 
understanding, as I'm not fully familiar with how the curve fitting math works.

http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc@1464
PS1, Line 1464: sizeof(int64_t) + AggregateFunctions::HLL_LEN)
maybe factor this out into a GetSingleBucketSize() or something for 
readability? (here and below)


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc@1469
PS1, Line 1469: sizeof(SampledNdvState) + bucket_idx * (sizeof(int64_t) 
+ AggregateFunctions::HLL_LEN));
nit: longline


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h
File be/src/util/mpfit-util.h:

http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h@50
PS1, Line 50:   bool LmsFit(const double* xs, const double* ys, int num_points);
WARN_UNUSED_RESULT, here and elsewhere


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h@82
PS1, Line 82:   mp_result result_;
Looks like this struct has a bunch of ptrs, whats the lifecycle of those?


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h@90
PS1, Line 90:   const double* ys_;
DISALLOW_COPY_ASSIGN


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.cc
File be/src/util/mpfit-util.cc:

http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.cc@29
PS1, Line 29: num_params_(num_params),
: fn_(fn),
: num_points_(0),
: xs_(nullptr),
: ys_(nullptr) {
nit: I think we could condense them into fewer lines (other places in the 
codebase use multiple args per line)


http://gerrit.cloudera.org:8080/#/c/8569/1/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

http://gerrit.cloudera.org:8080/#/c/8569/1/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@507
PS1, Line 507: && children_.size() == 2
Isn't this more like a preconditions check inside the if {}?



--
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 1
Gerrit-Owner: Alex Behm 
Gerrit-Reviewer: Bharath Vissapragada 
Gerrit-Comment-Date: Thu, 16 Nov 2017 08:28:37 +
Gerrit-HasComments: Yes