[Impala-ASF-CR] [WIP] IMPALA-2658: Extend the NDV function to accept a precision

2020-06-08 Thread Qifan Chen (Code Review)
Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15997 )

Change subject: [WIP] IMPALA-2658: Extend the NDV function to accept a precision
..


Patch Set 13:

(21 comments)

Done and please review one more time.

http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1441
PS13, Line 1441: ComputeSizeOfIntermediateTypeForNDV
> this needs documentation.
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1452
PS13, Line 1452:   if (ctx->GetNumArgs() == 2) {
> it seems this whole code block is only used to validate the input parameter
Wrapped the entire block in #ifndef NDEBUG.


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1457
PS13, Line 1457: if (int_val)
> is this ever possible? can int_val be null if GetNumArgs() == 2?
Yes. In a parallel plan, this method can be called for the merge() part of the 
expression evaluation, where the actual number of arguments is 1 (the 
intermediate result).


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1466
PS13, Line 1466:   int hll_len = dst->len;
> this copies the int value from dst->len to hll_len, is this necessary? if n
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1468
PS13, Line 1468:   int precision = log2(hll_len);
> is there a reason this needs to be re-computed during each update?
I improved it a little bit, from log2(x) to 
BitUtil::CountTrailingZeros((unsigned int)size, sizeof(size) * CHAR_BIT).

We could cache the precision value during HllInit() call to a replace and 
reuse. But I have not found the place to do so. If we use ctx, does it mean we 
need to add a new data member?


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1481
PS13, Line 1481: // Two input argument version, which calls the single input 
argument version
> same comment as elsewhere
Same answer as above.


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1510
PS13, Line 1510: // Two input argument version which calls the single input 
argument version.
> should be more descriptive that just "Two input argument version"
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1523
PS13, Line 1523:   DCHECK_EQ(dst->len, src.len);
> is there a reason the order of these got switched?
Made the src.len the first argument in the DCHECK_EQ() which reads more natural.


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1557
PS13, Line 1557: actual_harmonic_mean
> it doesn't look like this variable exists anywhere, did you mean 'harmonic_
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1575
PS13, Line 1575:
> nit: extra new line
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@2467
PS13, Line 2467: // Two input argument version.
> this could be more descriptive. what does the two input argument version do
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h
File be/src/exprs/aggregate-functions.h:

http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h@192
PS13, Line 192:   /// 2) HyperLogLog in Practice (paper from google with some 
improvements)
> probably worth mentioning either here or elsewhere that these functions tak
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h@196
PS13, Line 196: HLL_PRECISION = 10; // default precision
> probably worth just renaming this to DEFAULT_HLL_PRECISION to keep it consi
Done.

I hesitated a little bit last on making the change, as this will affect several 
other modules (e.g., sampled_ndv, and test  modules).


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h@203
PS13, Line 203: HLL_LEN
> same here, rename to DEFAULT_HLL_LEN
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java:

http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@65
PS13, Line 65: HashMap
> should be Map instead of HashMap
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@65
PS13, Line 65: ArrayList
> should be List instead of ArrayList: https://stackoverflow.com/questions/22
Done


http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@999
PS13, Line 999: fakedHllIntermediateType
> it might be cleaner to just use the defaultHllIntermediateType here 

[Impala-ASF-CR] [WIP] IMPALA-2658: Extend the NDV function to accept a precision

2020-06-05 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15997 )

Change subject: [WIP] IMPALA-2658: Extend the NDV function to accept a precision
..


Patch Set 13:

(16 comments)

some more comments

http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1452
PS13, Line 1452:   if (ctx->GetNumArgs() == 2) {
it seems this whole code block is only used to validate the input parameters 
via DCHECKS. so it will only be run in debug builds, in which case it should be 
surrounded by a #ifdef NDEBUG ... #endif block.


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1457
PS13, Line 1457: if (int_val)
is this ever possible? can int_val be null if GetNumArgs() == 2?


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1466
PS13, Line 1466:   int hll_len = dst->len;
this copies the int value from dst->len to hll_len, is this necessary? if not, 
then can the int just be a const reference?


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1481
PS13, Line 1481: // Two input argument version, which calls the single input 
argument version
same comment as elsewhere


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1510
PS13, Line 1510: // Two input argument version which calls the single input 
argument version.
should be more descriptive that just "Two input argument version"


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1523
PS13, Line 1523:   DCHECK_EQ(dst->len, src.len);
is there a reason the order of these got switched?


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1557
PS13, Line 1557: actual_harmonic_mean
it doesn't look like this variable exists anywhere, did you mean 'harmonic_mean'


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1575
PS13, Line 1575:
nit: extra new line


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@2467
PS13, Line 2467: // Two input argument version.
this could be more descriptive. what does the two input argument version do? 
saying something like "HLL version that accepts a precision value" would be 
more informative.


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h
File be/src/exprs/aggregate-functions.h:

http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h@192
PS13, Line 192:   /// 2) HyperLogLog in Practice (paper from google with some 
improvements)
probably worth mentioning either here or elsewhere that these functions take in 
an optional precision. there should be some description of what the precision 
means and the range of acceptable values.


http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java:

http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@65
PS13, Line 65: HashMap
should be Map instead of HashMap


http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@999
PS13, Line 999: fakedHllIntermediateType
it might be cleaner to just use the defaultHllIntermediateType here instead of 
creating a fake one


http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1009
PS13, Line 1009: ArrayList
should be declared as a "List".


http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1013
PS13, Line 1013: ndvList.add(AggregateFunction.createBuiltin(db, "ndv",
   : Lists.newArrayList(t, Type.INT), Type.BIGINT, 
hllIntermediateType,
   : prefix + 
"7HllInitEPN10impala_udf15FunctionContextEPNS1_9StringValE",
   : prefix + HLL_UPDATE_SYMBOL_TWO_ARGS.get(t),
   : prefix + 
"8HllMergeEPN10impala_udf15FunctionContextERKNS1_9StringValEPS4_",
   : null,
   : prefix + 
"11HllFinalizeEPN10impala_udf15FunctionContextERKNS1_9StringValE",
   : true, false, true));
it looks like this is pretty much a duplicate of the code on lines 998-1005, 
its probably work abstracting this into a private method that takes the 
intermediate type as a parameter.
even better would be to replace the call on lines 983-990 with this common 
method, you would have to make the update symbol a parameter as well.


http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1022
PS13, Line 1022:   

[Impala-ASF-CR] [WIP] IMPALA-2658: Extend the NDV function to accept a precision

2020-06-05 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15997 )

Change subject: [WIP] IMPALA-2658: Extend the NDV function to accept a precision
..


Patch Set 13:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/15997/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15997/7//COMMIT_MSG@7
PS7, Line 7: [WIP] IMPALA-2658: Extend the NDV function to accept a precision
> overall, I think this commit message is a bit verbose in terms of describin
ping - I think this still needs to be addressed


http://gerrit.cloudera.org:8080/#/c/15997/13//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15997/13//COMMIT_MSG@41
PS13, Line 41: Testing:
since you ran all core tests you should add an entry here that says "Ran core 
tests"


http://gerrit.cloudera.org:8080/#/c/15997/13//COMMIT_MSG@44
PS13, Line 44: 2 Run unit tests against other tables such as tpcds.store_sales
which unit tests are you talking about? if these are already include in "core" 
tests, you don't need to include this.


http://gerrit.cloudera.org:8080/#/c/15997/13//COMMIT_MSG@47
PS13, Line 47: select ndv(c_name, 1) "one", ndv(c_name, 2) two, ndv(c_name, 3) 
three,
 : ndv(c_name, 4) as four, ndv(c_name, 5) as five, ndv(c_name, 6) 
as six,
 : ndv(c_name, 7) as seven, ndv(c_name, 8) as eight, ndv(c_name, 9) 
as nine,
 : ndv(c_name, 10)  as ten
 : from tpch.customer;
 :
 : select ndv(ss_sold_time_sk, 1) "one", ndv(ss_sold_time_sk, 2) 
two,
 : ndv(ss_sold_time_sk, 3) three, ndv(ss_sold_time_sk, 4) as four,
 : ndv(ss_sold_time_sk, 5) as five, ndv(ss_sold_time_sk, 6) as six,
 : ndv(ss_sold_time_sk, 7) as seven, ndv(ss_sold_time_sk, 8) as 
eight,
 : ndv(ss_sold_time_sk, 9) as nine, ndv(ss_sold_time_sk, 10) as ten
 : from tpcds.store_sales;
I think adding these in the commit message makes it too verbose. You just need 
to mention that you ran ndv(column, precision) for all possible values (1-10).


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1441
PS13, Line 1441: ComputeSizeOfIntermediateTypeForNDV
this needs documentation.


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions-ir.cc@1468
PS13, Line 1468:   int precision = log2(hll_len);
is there a reason this needs to be re-computed during each update?


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h
File be/src/exprs/aggregate-functions.h:

http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h@196
PS13, Line 196: HLL_PRECISION = 10; // default precision
probably worth just renaming this to DEFAULT_HLL_PRECISION to keep it 
consistent with the MIN/MAX_HLL_PRECISION variables. you can remove the 
"default precision" comment as well.


http://gerrit.cloudera.org:8080/#/c/15997/13/be/src/exprs/aggregate-functions.h@203
PS13, Line 203: HLL_LEN
same here, rename to DEFAULT_HLL_LEN


http://gerrit.cloudera.org:8080/#/c/15997/9/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

http://gerrit.cloudera.org:8080/#/c/15997/9/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@592
PS9, Line 592:   if (fn_ == null)
> Can you please explain?
it should be:

 if (fn_ == null) {
   throw new AnalysisException(
   "A suitable intermediate data type can not be found for the second 
parameter "
+ children_.get(1).toSql() + " in NDV()");
 }

notice how there are curly braces around the body of the if statement


http://gerrit.cloudera.org:8080/#/c/15997/9/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java:

http://gerrit.cloudera.org:8080/#/c/15997/9/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@352
PS9, Line 352: HLL_UPDATE_SYMBOL_T
> Change to HLL_UPDATE_SYMBOL_TWO_ARGS.
I think it can be more descriptive still. Something like 
HLL_UPDATE_SYMBOL_WITH_PRECISION would be better.


http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java:

http://gerrit.cloudera.org:8080/#/c/15997/13/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@65
PS13, Line 65: ArrayList
should be List instead of ArrayList: 
https://stackoverflow.com/questions/2279030/type-list-vs-type-arraylist-in-java


http://gerrit.cloudera.org:8080/#/c/15997/9/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/15997/9/tests/query_test/test_aggregation.py@318
PS9, Line 318: 0, 1, 2, 

[Impala-ASF-CR] [WIP] IMPALA-2658: Extend the NDV function to accept a precision

2020-06-05 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15997 )

Change subject: [WIP] IMPALA-2658: Extend the NDV function to accept a precision
..


Patch Set 13:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/6222/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/15997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
Gerrit-Change-Number: 15997
Gerrit-PatchSet: 13
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Fri, 05 Jun 2020 15:34:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] [WIP] IMPALA-2658: Extend the NDV function to accept a precision

2020-06-05 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#13). ( 
http://gerrit.cloudera.org:8080/15997 )

Change subject: [WIP] IMPALA-2658: Extend the NDV function to accept a precision
..

[WIP] IMPALA-2658: Extend the NDV function to accept a precision

This work addresses the current limitation in NDV function by
extending the function to take the 2nd integer-typed argument,
which must be an abstract value in the range of 1 to 10. This
abstract value specifies a real precision value used in the HLL
algorithm for the function.

Front end work:
1. Add a new template ndv function in builtin db that takes two
   arguments.
2. Verify that the 2nd argument of a NDV() is an integer literal in
   [1,10];
3. A new method to implement the mapping of the abstract value to the
   hll precision (the real work is TBD);
4. The length of the intermediate data type is computed based on the
   actual hll precision. When the 2nd argument is missing, the length
   is 1024 as in the current implementation;
5. The 2nd argument, if present, will be carried over all the way to
   the BE.

Back end work:
1. Remove the hardcoded precision (10) from these functions:
 AggregateFunctions::HllInit(),
 AggregateFunctions::HllUpdate(),
 AggregateFunctions::HllMerge(),
 AggregateFunctions::HllFinalEstimate(),
 AggregateFunctions::HllFinalize(),
 HllEstimateBias();
2. Instead, the actual precision is computed from the
   length of the intermediate data type as log2(hll_len);
3. Verify that the length of the intermediate data type is
   correct according to the 2nd argument (if present).

Testing:
1 Add a regression test (test_ndv)) in TestAggregationQueries
  section to computes ndv() for every supported Impala data type.
2 Run unit tests against other tables such as tpcds.store_sales
  and tpch.customer in both serial and parallel plan settings.

select ndv(c_name, 1) "one", ndv(c_name, 2) two, ndv(c_name, 3) three,
ndv(c_name, 4) as four, ndv(c_name, 5) as five, ndv(c_name, 6) as six,
ndv(c_name, 7) as seven, ndv(c_name, 8) as eight, ndv(c_name, 9) as nine,
ndv(c_name, 10)  as ten
from tpch.customer;

select ndv(ss_sold_time_sk, 1) "one", ndv(ss_sold_time_sk, 2) two,
ndv(ss_sold_time_sk, 3) three, ndv(ss_sold_time_sk, 4) as four,
ndv(ss_sold_time_sk, 5) as five, ndv(ss_sold_time_sk, 6) as six,
ndv(ss_sold_time_sk, 7) as seven, ndv(ss_sold_time_sk, 8) as eight,
ndv(ss_sold_time_sk, 9) as nine, ndv(ss_sold_time_sk, 10) as ten
from tpcds.store_sales;

Perf: TBD

Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
---
M be/src/common/logging.h
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M tests/query_test/test_aggregation.py
6 files changed, 302 insertions(+), 37 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/97/15997/13
--
To view, visit http://gerrit.cloudera.org:8080/15997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
Gerrit-Change-Number: 15997
Gerrit-PatchSet: 13
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Sahil Takiar 


[Impala-ASF-CR] [WIP] IMPALA-2658: Extend the NDV function to accept a precision

2020-06-05 Thread Qifan Chen (Code Review)
Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15997 )

Change subject: [WIP] IMPALA-2658: Extend the NDV function to accept a precision
..


Patch Set 11:

(13 comments)

Fix style issues with the newly added test.

http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@87
PS11, Line 87: [2, 9, 96, 988, 980, 1000, 944, 1030, 1020, 990, 1010, 957, 
1030, 1027, 9845, 9898],
> flake8: E122 continuation line missing indentation or outdented
Done


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@88
PS11, Line 88: [2, 9, 97, 957, 1008, 1016, 1005, 963, 994, 993, 1018, 1004, 
963, 1014, 10210, 10280],
> flake8: E122 continuation line missing indentation or outdented
Done


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@89
PS11, Line 89: [2, 9, 98, 977, 1024, 1020, 975, 977, 1002, 991, 994, 1006, 977, 
999, 10118, 9923],
> flake8: E122 continuation line missing indentation or outdented
Done


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@90
PS11, Line 90: [2, 9, 99, 986, 1009, 1011, 994, 980, 997, 994, 1002, 997, 980, 
988, 10148, 9987],
> flake8: E122 continuation line missing indentation or outdented
Done


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@91
PS11, Line 91: [2, 9, 99, 995, 996, 1000, 998, 988, 995, 999, 997, 999, 988, 
979, 9974, 9960],
> flake8: E122 continuation line missing indentation or outdented
Done


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@92
PS11, Line 92: [2, 9, 99, 998, 1005, 999, 1003, 994, 1000, 993, 999, 998, 994, 
992, 9899, 9941],
> flake8: E122 continuation line missing indentation or outdented
Done


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@93
PS11, Line 93: [2, 9, 99, 993, 1001, 1007, 1000, 998, 1002, 997, 999, 998, 998, 
999, 9923, 9931],
> flake8: E122 continuation line missing indentation or outdented
Done


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@94
PS11, Line 94: [2, 9, 99, 994, 998, 1002, 1002, 999, 998, 999, 997, 1000, 999, 
997, 9937, 9973],
> flake8: E122 continuation line missing indentation or outdented
Done


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@95
PS11, Line 95: [2, 9, 99, 995, 997, 998, 1001, 999, 1001, 996, 997, 1000, 999, 
998, 9989, 9981],
> flake8: E122 continuation line missing indentation or outdented
Done


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@96
PS11, Line 96: [2, 9, 99, 998, 998, 997, 999, 998, 1000,  998, 1000, 998, 998, 
1000, 1, 10003]
> flake8: E122 continuation line missing indentation or outdented
Done


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@315
PS11, Line 315:   # Verify that each ndv() value (one per column for a 
total of 11) is identical
> line has trailing whitespace
Done


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@316
PS11, Line 316:   # to the corresponding known value. Since NDV() invokes 
Hash64() hash function
> line has trailing whitespace
Done


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@319
PS11, Line 319: -
> flake8: E226 missing whitespace around arithmetic operator
Done



--
To view, visit http://gerrit.cloudera.org:8080/15997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
Gerrit-Change-Number: 15997
Gerrit-PatchSet: 11
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Fri, 05 Jun 2020 14:29:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] [WIP] IMPALA-2658: Extend the NDV function to accept a precision

2020-06-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15997 )

Change subject: [WIP] IMPALA-2658: Extend the NDV function to accept a precision
..


Patch Set 11:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/6218/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/15997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
Gerrit-Change-Number: 15997
Gerrit-PatchSet: 11
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Fri, 05 Jun 2020 01:16:18 +
Gerrit-HasComments: No


[Impala-ASF-CR] [WIP] IMPALA-2658: Extend the NDV function to accept a precision

2020-06-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15997 )

Change subject: [WIP] IMPALA-2658: Extend the NDV function to accept a precision
..


Patch Set 11:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@87
PS11, Line 87: [2, 9, 96, 988, 980, 1000, 944, 1030, 1020, 990, 1010, 957, 
1030, 1027, 9845, 9898],
flake8: E122 continuation line missing indentation or outdented


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@88
PS11, Line 88: [2, 9, 97, 957, 1008, 1016, 1005, 963, 994, 993, 1018, 1004, 
963, 1014, 10210, 10280],
flake8: E122 continuation line missing indentation or outdented


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@89
PS11, Line 89: [2, 9, 98, 977, 1024, 1020, 975, 977, 1002, 991, 994, 1006, 977, 
999, 10118, 9923],
flake8: E122 continuation line missing indentation or outdented


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@90
PS11, Line 90: [2, 9, 99, 986, 1009, 1011, 994, 980, 997, 994, 1002, 997, 980, 
988, 10148, 9987],
flake8: E122 continuation line missing indentation or outdented


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@91
PS11, Line 91: [2, 9, 99, 995, 996, 1000, 998, 988, 995, 999, 997, 999, 988, 
979, 9974, 9960],
flake8: E122 continuation line missing indentation or outdented


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@92
PS11, Line 92: [2, 9, 99, 998, 1005, 999, 1003, 994, 1000, 993, 999, 998, 994, 
992, 9899, 9941],
flake8: E122 continuation line missing indentation or outdented


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@93
PS11, Line 93: [2, 9, 99, 993, 1001, 1007, 1000, 998, 1002, 997, 999, 998, 998, 
999, 9923, 9931],
flake8: E122 continuation line missing indentation or outdented


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@94
PS11, Line 94: [2, 9, 99, 994, 998, 1002, 1002, 999, 998, 999, 997, 1000, 999, 
997, 9937, 9973],
flake8: E122 continuation line missing indentation or outdented


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@95
PS11, Line 95: [2, 9, 99, 995, 997, 998, 1001, 999, 1001, 996, 997, 1000, 999, 
998, 9989, 9981],
flake8: E122 continuation line missing indentation or outdented


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@96
PS11, Line 96: [2, 9, 99, 998, 998, 997, 999, 998, 1000,  998, 1000, 998, 998, 
1000, 1, 10003]
flake8: E122 continuation line missing indentation or outdented


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@96
PS11, Line 96:  
flake8: E241 multiple spaces after ','


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@315
PS11, Line 315: 
flake8: W291 trailing whitespace


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@315
PS11, Line 315:   # Verify that each ndv() value (one per column for a 
total of 11) is identical
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@316
PS11, Line 316:
flake8: W291 trailing whitespace


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@316
PS11, Line 316:   # to the corresponding known value. Since NDV() invokes 
Hash64() hash function
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/15997/11/tests/query_test/test_aggregation.py@319
PS11, Line 319: -
flake8: E226 missing whitespace around arithmetic operator



--
To view, visit http://gerrit.cloudera.org:8080/15997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
Gerrit-Change-Number: 15997
Gerrit-PatchSet: 11
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Fri, 05 Jun 2020 00:30:47 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] [WIP] IMPALA-2658: Extend the NDV function to accept a precision

2020-06-04 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#11). ( 
http://gerrit.cloudera.org:8080/15997 )

Change subject: [WIP] IMPALA-2658: Extend the NDV function to accept a precision
..

[WIP] IMPALA-2658: Extend the NDV function to accept a precision

This work addresses the current limitation in NDV function by
extending the function to take the 2nd integer-typed argument,
which must be an abstract value in the range of 1 to 10. This
abstract value specifies a real precision value used in the HLL
algorithm for the function.

Front end work:
1. Add a new template ndv function in builtin db that takes two
   arguments.
2. Verify that the 2nd argument of a NDV() is an integer literal in
   [1,10];
3. A new method to implement the mapping of the abstract value to the
   hll precision (the real work is TBD);
4. The length of the intermediate data type is computed based on the
   actual hll precision. When the 2nd argument is missing, the length
   is 1024 as in the current implementation;
5. The 2nd argument, if present, will be carried over all the way to
   the BE.

Back end work:
1. Remove the hardcoded precision (10) from these functions:
 AggregateFunctions::HllInit(),
 AggregateFunctions::HllUpdate(),
 AggregateFunctions::HllMerge(),
 AggregateFunctions::HllFinalEstimate(),
 AggregateFunctions::HllFinalize(),
 HllEstimateBias();
2. Instead, the actual precision is computed from the
   length of the intermediate data type as log2(hll_len);
3. Verify that the length of the intermediate data type is
   correct according to the 2nd argument (if present).

Work TDB:
1. Add unit tests;
2. Determine the final mapping of the 10 abstract values to
   10 possible Hll precisions.

Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
---
M be/src/common/logging.h
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M tests/query_test/test_aggregation.py
6 files changed, 302 insertions(+), 37 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/97/15997/11
--
To view, visit http://gerrit.cloudera.org:8080/15997
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I48a4517bd0959f7021143073d37505a46c551a58
Gerrit-Change-Number: 15997
Gerrit-PatchSet: 11
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Sahil Takiar