[Impala-ASF-CR] IMPALA-9941: ExprTest.CastExprs fails when running with ASAN fix

2020-07-13 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16179 )

Change subject: IMPALA-9941: ExprTest.CastExprs fails when running with ASAN fix
..


Patch Set 3:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/16179/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16179/3//COMMIT_MSG@13
PS3, Line 13: Testxing
typo


http://gerrit.cloudera.org:8080/#/c/16179/3/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/16179/3/be/src/exprs/expr-test.cc@3311
PS3, Line 3311: TestTimestampValue("cast('  \t\r\n  01:05:01   \t\r\n ' 
as timestamp)",
  :   TimestampValue::ParseSimpleDateFormat("01:05:01"));
Does this produce a valid TS?


http://gerrit.cloudera.org:8080/#/c/16179/3/be/src/exprs/expr-test.cc@3313
PS3, Line 3313: //  TestIsNull("cast(' \t\r\n 01:05:01.123456789   \t\r\n ' 
as timestamp)",
  : //  TYPE_TIMESTAMP);
Please don't leave commented code.


http://gerrit.cloudera.org:8080/#/c/16179/3/be/src/exprs/expr-test.cc@3327
PS3, Line 3327: TestIsNull("cast('  \t\r\n  1:5:1\t\r\n' as 
timestamp)", TYPE_TIMESTAMP);
  :   TestIsNull(
  :   "cast('  \t\r\n  1:5:1.12345678\t\r\n' as 
timestamp)", TYPE_TIMESTAMP);
These have nothing to do with IMPALA-6995, please move them to a different 
section of the test file.


http://gerrit.cloudera.org:8080/#/c/16179/3/be/src/runtime/datetime-simple-date-format-parser.cc
File be/src/runtime/datetime-simple-date-format-parser.cc:

http://gerrit.cloudera.org:8080/#/c/16179/3/be/src/runtime/datetime-simple-date-format-parser.cc@346
PS3, Line 346: str[4]
For me the root cause is still a bit unclear. The ASAN build said that for 
certain inputs here we try to read from memory that was actually released. I 
wonder how come we get to this point when we had released 'str' and where it is 
actually released?



--
To view, visit http://gerrit.cloudera.org:8080/16179
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I75bd47b6627a2e338c46dc354f7e67d34c1abbcf
Gerrit-Change-Number: 16179
Gerrit-PatchSet: 3
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 13 Jul 2020 09:24:18 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9633: Implement ds hll union()

2020-07-13 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16095 )

Change subject: IMPALA-9633: Implement ds_hll_union()
..


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16095/5/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
File 
testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test:

http://gerrit.cloudera.org:8080/#/c/16095/5/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@202
PS5, Line 202: 
> Can you also add a union test with hll_sketches_from_hive?
Sure, Done



--
To view, visit http://gerrit.cloudera.org:8080/16095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09
Gerrit-Change-Number: 16095
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 13 Jul 2020 07:48:39 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9633: Implement ds hll union()

2020-07-13 Thread Gabor Kaszab (Code Review)
Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16095

to look at the new patch set (#7).

Change subject: IMPALA-9633: Implement ds_hll_union()
..

IMPALA-9633: Implement ds_hll_union()

This function receives a set of sketches produced by ds_hll_sketch()
and merges them into a single sketch.

An example usage is to create a sketch for each partition of a table,
write these sketches to a separate table and based on which partition
the user is interested of the relevant sketches can be union-ed
together to get an estimate. E.g.:
  SELECT
  ds_hll_estimate(ds_hll_union(sketch_col))
  FROM sketch_tbl
  WHERE partition_col=1 OR partition_col=5;

Note, currently there is a known limitation of unioning string types
where some input sketches come from Impala and some from Hive. In
this case if there is an overlap in the input data used by Impala and
by Hive this overlapping data is still counted twice due to some
string representation difference between Impala and Hive.
For more details see:
https://issues.apache.org/jira/browse/IMPALA-9939

Testing:
  - Apart from the automated tests I added to this patch I also
tested ds_hll_union() on a bigger dataset to check that
serialization, deserialization and merging steps work well. I
took TPCH25.linelitem, created a number of sketches with grouping
by l_shipdate and called ds_hll_union() on those sketches.

Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-common.cc
A be/src/exprs/datasketches-common.h
M be/src/exprs/datasketches-functions-ir.cc
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M testdata/data/README
A testdata/data/hll_sketches_from_impala.parquet
M testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
M tests/query_test/test_datasketches.py
11 files changed, 271 insertions(+), 29 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/16095/7
--
To view, visit http://gerrit.cloudera.org:8080/16095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09
Gerrit-Change-Number: 16095
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9633: Implement ds hll union()

2020-07-13 Thread Gabor Kaszab (Code Review)
Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16095

to look at the new patch set (#6).

Change subject: IMPALA-9633: Implement ds_hll_union()
..

IMPALA-9633: Implement ds_hll_union()

This function receives a set of sketches produced by ds_hll_sketch()
and merges them into a single sketch.

An example usage is to create a sketch for each partition of a table,
write these sketches to a separate table and based on which partition
the user is interested of the relevant sketches can be union-ed
together to get an estimate. E.g.:
  SELECT
  ds_hll_estimate(ds_hll_union(sketch_col))
  FROM sketch_tbl
  WHERE partition_col=1 OR partition_col=5;

Note, currently there is a known limitation of unioning string types
where some input sketches come from Impala and some from Hive. In
this case if there is an overlap in the input data used by Impala and
by Hive this overlapping data is still counted twice due to some
string representation difference between Impala and Hive.
For more details see:
https://issues.apache.org/jira/browse/IMPALA-9939

Testing:
  - Apart from the automated tests I added to this patch I also
tested ds_hll_union() on a bigger dataset to check that
serialization, deserialization and merging steps work well. I
took TPCH25.linelitem, created a number of sketches with grouping
by l_shipdate and called ds_hll_union() on those sketches.

Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-common.cc
A be/src/exprs/datasketches-common.h
M be/src/exprs/datasketches-functions-ir.cc
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M testdata/data/README
M testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
M tests/query_test/test_datasketches.py
10 files changed, 271 insertions(+), 29 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/16095/6
--
To view, visit http://gerrit.cloudera.org:8080/16095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09
Gerrit-Change-Number: 16095
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9633: Implement ds hll union()

2020-07-09 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16095 )

Change subject: IMPALA-9633: Implement ds_hll_union()
..


Patch Set 5:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/16095/4/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16095/4/be/src/exprs/aggregate-functions-ir.cc@1736
PS4, Line 1736: tten by DeserializeHllSketch(
> Do these configs matter, or they are overwritten during deserialization? My
These settings are overwritten in DeserializeHllSketch() to repesent the ones 
coming from the serialized sketch. Added a comment.


http://gerrit.cloudera.org:8080/#/c/16095/4/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java:

http://gerrit.cloudera.org:8080/#/c/16095/4/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@1336
PS4, Line 1336: Lists.newArrayList(Type.STRING), Type.STRING, 
Type.STRING,
  : prefix + 
"14DsHllUnionInitEPN10impala_udf15FunctionContextEPNS1_9StringValE",
  : prefix +
  : 
"16DsHllUnionUpdateEPN10impala_udf15FunctionContextERKNS1_9StringValEPS4_",
  : prefix +
  : 
"15DsHllUnionMergeEPN10impala_udf15FunctionContextERKNS1_9StringValEPS4_",
  : prefix +
> nit: indentation
Done


http://gerrit.cloudera.org:8080/#/c/16095/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
File 
testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test:

http://gerrit.cloudera.org:8080/#/c/16095/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@163
PS2, Line 163: from hll_sketches_from_hive;
> Add a test where the input data set contains valid sketches and nulls. Expe
Done



--
To view, visit http://gerrit.cloudera.org:8080/16095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09
Gerrit-Change-Number: 16095
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Thu, 09 Jul 2020 09:36:18 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9633: Implement ds hll union()

2020-07-09 Thread Gabor Kaszab (Code Review)
Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16095

to look at the new patch set (#5).

Change subject: IMPALA-9633: Implement ds_hll_union()
..

IMPALA-9633: Implement ds_hll_union()

This function receives a set of sketches produced by ds_hll_sketch()
and merges them into a single sketch.

An example usage is to create a sketch for each partition of a table,
write these sketches to a separate table and based on which partition
the user is interested of the relevant sketches can be union-ed
together to get an estimate. E.g.:
  SELECT
  ds_hll_estimate(ds_hll_union(sketch_col))
  FROM sketch_tbl
  WHERE partition_col=1 OR partition_col=5;

Testing:
  - Apart from the automated tests I added to this patch I also
tested ds_hll_union() on a bigger dataset to check that
serialization, deserialization and merging steps work well. I
took TPCH25.linelitem, created a number of sketches with grouping
by l_shipdate and called ds_hll_union() on those sketches.

Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-common.cc
A be/src/exprs/datasketches-common.h
M be/src/exprs/datasketches-functions-ir.cc
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
8 files changed, 240 insertions(+), 29 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/16095/5
--
To view, visit http://gerrit.cloudera.org:8080/16095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09
Gerrit-Change-Number: 16095
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9531: Dropped support for dateless timestamps

2020-07-08 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/15866 )

Change subject: IMPALA-9531: Dropped support for dateless timestamps
..

IMPALA-9531: Dropped support for dateless timestamps

Removed the support for dateless timestamps.
During dateless timestamp casts if the format doesn't contain
date part we get an error during tokenization of the format.
If the input str doesn't contain a date part then we get null result.

Examples:
select cast('01:02:59' as timestamp);
This will come back as NULL value.

select to_timestamp('01:01:01', 'HH:mm:ss');
select cast('01:02:59' as timestamp format 'HH12:MI:SS');
select cast('12 AM' as timestamp FORMAT 'AM.HH12');
These will come back with a parsing errors.

Casting from a table will generate similar results.

Testing:
Modified the previous tests related to dateless timestamps.
Added test to read fromtables which are still containing dateless
timestamps and covered timestamp to string path when no date tokens
are requested in the output string.

Change-Id: I48c49bf027cc4b917849b3d58518facba372b322
Reviewed-on: http://gerrit.cloudera.org:8080/15866
Tested-by: Impala Public Jenkins 
Reviewed-by: Gabor Kaszab 
---
M be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/benchmarks/parse-timestamp-benchmark.cc
M be/src/exec/text-converter.inline.h
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/expr-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M be/src/exprs/timestamp-functions-ir.cc
M be/src/exprs/timestamp-functions.cc
M be/src/exprs/timestamp-functions.h
M be/src/runtime/date-parse-util.cc
M be/src/runtime/date-test.cc
M be/src/runtime/datetime-iso-sql-format-tokenizer.cc
M be/src/runtime/datetime-parser-common.cc
M be/src/runtime/datetime-parser-common.h
M be/src/runtime/datetime-simple-date-format-parser.cc
M be/src/runtime/datetime-simple-date-format-parser.h
M be/src/runtime/timestamp-parse-util.cc
M be/src/runtime/timestamp-test.cc
M be/src/runtime/timestamp-value.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/test/java/org/apache/impala/analysis/AnalyzeKuduDDLTest.java
M testdata/data/README
A testdata/data/dateless_timestamps.parq
A testdata/data/dateless_timestamps.txt
M testdata/data/lazy_timestamp.csv
M testdata/workloads/functional-query/queries/QueryTest/date.test
A 
testdata/workloads/functional-query/queries/QueryTest/dateless_timestamp_parquet.test
A 
testdata/workloads/functional-query/queries/QueryTest/dateless_timestamp_text.test
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
M 
testdata/workloads/functional-query/queries/QueryTest/select-lazy-timestamp.test
M tests/data_errors/test_data_errors.py
M tests/query_test/test_cast_with_format.py
M tests/query_test/test_scanners.py
34 files changed, 278 insertions(+), 231 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Gabor Kaszab: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/15866
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I48c49bf027cc4b917849b3d58518facba372b322
Gerrit-Change-Number: 15866
Gerrit-PatchSet: 14
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9531: Dropped support for dateless timestamps

2020-07-08 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15866 )

Change subject: IMPALA-9531: Dropped support for dateless timestamps
..


Patch Set 13: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/15866
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I48c49bf027cc4b917849b3d58518facba372b322
Gerrit-Change-Number: 15866
Gerrit-PatchSet: 13
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Wed, 08 Jul 2020 19:32:13 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9531: Dropped support for dateless timestamps

2020-07-08 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15866 )

Change subject: IMPALA-9531: Dropped support for dateless timestamps
..


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15866/11/testdata/workloads/functional-query/queries/QueryTest/dateless_timestamp_text.test
File 
testdata/workloads/functional-query/queries/QueryTest/dateless_timestamp_text.test:

http://gerrit.cloudera.org:8080/#/c/15866/11/testdata/workloads/functional-query/queries/QueryTest/dateless_timestamp_text.test@8
PS11, Line 8: localhost:20500
Note, the dockerised tests fail because the hardcoded localhost.



--
To view, visit http://gerrit.cloudera.org:8080/15866
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I48c49bf027cc4b917849b3d58518facba372b322
Gerrit-Change-Number: 15866
Gerrit-PatchSet: 11
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Wed, 08 Jul 2020 06:23:30 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9633: Implement ds hll union()

2020-07-07 Thread Gabor Kaszab (Code Review)
Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16095

to look at the new patch set (#3).

Change subject: IMPALA-9633: Implement ds_hll_union()
..

IMPALA-9633: Implement ds_hll_union()

This function receives a set of sketches produced by ds_hll_sketch()
and merges them into a single sketch.

An example usage is to create a sketch for each partition of a table,
write these sketches to a separate table and based on which partition
the user is interested of the relevant sketches can be union-ed
together to get an estimate. E.g.:
  SELECT
  ds_hll_estimate(ds_hll_union(sketch_col))
  FROM sketch_tbl
  WHERE partition_col=1 OR partition_col=5;

Testing:
  - Apart from the automated tests I added to this patch I also
tested ds_hll_union() on a bigger dataset to check that
serialization, deserialization and merging steps work well. I
took TPCH25.linelitem, created a number of sketches with grouping
by l_shipdate and called ds_hll_union() on those sketches.

Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-common.cc
A be/src/exprs/datasketches-common.h
M be/src/exprs/datasketches-functions-ir.cc
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
8 files changed, 232 insertions(+), 29 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/16095/3
--
To view, visit http://gerrit.cloudera.org:8080/16095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09
Gerrit-Change-Number: 16095
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-07-07 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Reviewed-on: http://gerrit.cloudera.org:8080/16000
Tested-by: Impala Public Jenkins 
Reviewed-by: Csaba Ringhofer 
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M testdata/data/README
A testdata/data/hll_sketches_from_hive.parquet
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
17 files changed, 505 insertions(+), 8 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Csaba Ringhofer: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 17
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9531: Dropped support for dateless timestamps

2020-07-07 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15866 )

Change subject: IMPALA-9531: Dropped support for dateless timestamps
..


Patch Set 11: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/15866
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I48c49bf027cc4b917849b3d58518facba372b322
Gerrit-Change-Number: 15866
Gerrit-PatchSet: 11
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Tue, 07 Jul 2020 13:08:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-07-07 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..


Patch Set 15:

Apparently, test_ddl.py/test_functions_ddl failed with IllegalStateException 
because for unsupported functions there were some unset members.


--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 15
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Tue, 07 Jul 2020 08:57:08 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-07-07 Thread Gabor Kaszab (Code Review)
Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#15).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M testdata/data/README
A testdata/data/hll_sketches_from_hive.parquet
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
17 files changed, 505 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/15
--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 15
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-07-06 Thread Gabor Kaszab (Code Review)
Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#12).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M testdata/data/README
A testdata/data/hll_sketches_from_hive.parquet
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
17 files changed, 503 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/12
--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 12
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-07-06 Thread Gabor Kaszab (Code Review)
Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#11).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M testdata/data/README
A testdata/data/hll_sketches_from_hive.parquet
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
17 files changed, 504 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/11
--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 11
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9531: Dropped support for dateless timestamps

2020-07-05 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15866 )

Change subject: IMPALA-9531: Dropped support for dateless timestamps
..


Patch Set 8: Code-Review+1


--
To view, visit http://gerrit.cloudera.org:8080/15866
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I48c49bf027cc4b917849b3d58518facba372b322
Gerrit-Change-Number: 15866
Gerrit-PatchSet: 8
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 06 Jul 2020 05:28:41 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9531: Dropped support for dateless timestamps

2020-06-26 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15866 )

Change subject: IMPALA-9531: Dropped support for dateless timestamps
..


Patch Set 6:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/15866/6/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

http://gerrit.cloudera.org:8080/#/c/15866/6/be/src/exprs/timestamp-functions.cc@154
PS6, Line 154: void UnixAndFromUnixPrepare(FunctionContext* context,
This function is not included in the header because you didn't want to include 
the whole datatime-common for CastDirection, right? Could you mention it here?


http://gerrit.cloudera.org:8080/#/c/15866/6/be/src/runtime/datetime-simple-date-format-parser.h
File be/src/runtime/datetime-simple-date-format-parser.h:

http://gerrit.cloudera.org:8080/#/c/15866/6/be/src/runtime/datetime-simple-date-format-parser.h@87
PS6, Line 87: CastDirection cast_mode = FORMAT
This still has a default value. (I saw that you actually added the parameters 
on the callsites.)



--
To view, visit http://gerrit.cloudera.org:8080/15866
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I48c49bf027cc4b917849b3d58518facba372b322
Gerrit-Change-Number: 15866
Gerrit-PatchSet: 6
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Fri, 26 Jun 2020 13:58:38 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-06-26 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..


Patch Set 10:

PS10 is rebase with master and conflict resolution.


--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 10
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Fri, 26 Jun 2020 12:27:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-06-26 Thread Gabor Kaszab (Code Review)
Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#10).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
14 files changed, 477 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/10
--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 10
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-06-26 Thread Gabor Kaszab (Code Review)
Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#9).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
14 files changed, 477 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/9
--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9633: Implement ds hll union()

2020-06-26 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16095 )

Change subject: IMPALA-9633: Implement ds_hll_union()
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16095/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
File 
testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test:

http://gerrit.cloudera.org:8080/#/c/16095/2/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@163
PS2, Line 163:  QUERY
Add a test where the input data set contains valid sketches and nulls. 
Expectation is that nulls are ignored.



--
To view, visit http://gerrit.cloudera.org:8080/16095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09
Gerrit-Change-Number: 16095
Gerrit-PatchSet: 2
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Fri, 26 Jun 2020 07:04:58 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-06-24 Thread Gabor Kaszab (Code Review)
Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#8).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
14 files changed, 471 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/8
--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 8
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-06-24 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..


Patch Set 7:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc@1578
PS5, Line 1578: void AggregateFunctions::DsHllUpdate(
> I have experimented a bit with upstream Hive and it turned out that (apart
I have found a way to disable the unsupported parameter types.
Done


http://gerrit.cloudera.org:8080/#/c/16000/6/be/src/exprs/datasketches-functions-ir.cc
File be/src/exprs/datasketches-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16000/6/be/src/exprs/datasketches-functions-ir.cc@37
PS6, Line 37: ctx->SetError("Unable to deserialize sketch
> I would vote on treating this always as error. We can relax this later if i
Sure, changed this to give error unconditionally.


http://gerrit.cloudera.org:8080/#/c/16000/5/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
File 
testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test:

http://gerrit.cloudera.org:8080/#/c/16000/5/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@64
PS5, Line 64: ---
> Hive seems to return NULL if there are no non-NULL values. I think that the
Apparently it wasn't complicated to take care.
Done.


http://gerrit.cloudera.org:8080/#/c/16000/6/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
File 
testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test:

http://gerrit.cloudera.org:8080/#/c/16000/6/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@121
PS6, Line 121:
> typo
Dropped this part of the test due to another change.



--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Wed, 24 Jun 2020 15:47:25 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-06-24 Thread Gabor Kaszab (Code Review)
Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#7).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
14 files changed, 473 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/7
--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9633: Implement ds hll union()

2020-06-18 Thread Gabor Kaszab (Code Review)
Hello Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16095

to look at the new patch set (#2).

Change subject: IMPALA-9633: Implement ds_hll_union()
..

IMPALA-9633: Implement ds_hll_union()

This function receives a set of sketches produced by ds_hll_sketch()
and merges them into a single sketch.

An example usage is to create a sketch for each partition of a table,
write these sketches to a separate table and based on which partition
the user is interested of the relevant sketches can be union-ed
together to get an estimate. E.g.:
  SELECT
  ds_hll_estimate(ds_hll_union(sketch_col))
  FROM sketch_tbl
  WHERE partition_col=1 OR partition_col=5;

Testing:
  - Apart from the automated tests I added to this patch I also
tested ds_hll_union() on a bigger dataset to check that
serialization, deserialization and merging steps work well. I
took TPCH25.linelitem, created a number of sketches with grouping
by l_shipdate and called ds_hll_union() on those sketches.

Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-common.cc
A be/src/exprs/datasketches-common.h
M be/src/exprs/datasketches-functions-ir.cc
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
8 files changed, 244 insertions(+), 34 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/16095/2
--
To view, visit http://gerrit.cloudera.org:8080/16095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09
Gerrit-Change-Number: 16095
Gerrit-PatchSet: 2
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9633: Implement ds hll union()

2020-06-18 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/16095


Change subject: IMPALA-9633: Implement ds_hll_union()
..

IMPALA-9633: Implement ds_hll_union()

This function receives a set of sketches produced by ds_hll_sketch()
and merges them into a single sketch.

An example usage is to create a sketch for each partition of a table,
write these sketches to a separate table and based on which partition
the user is interested of the relevant sketches can be union-ed
together to get an estimate. E.g.:
  SELECT
  ds_hll_estimate(ds_hll_union(sketch_col))
  FROM sketch_tbl
  WHERE partition_col=1 OR partition_col=5;

Testing:
  - Apart from the automated tests I added to this patch I also
tested ds_hll_union() on a bigger dataset to check that
serialization, deserialization and merging steps work well. I
took TPCH25.linelitem, created a number of sketches with grouping
by l_shipdate and called ds_hll_union() on those sketches.

Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-common.cc
A be/src/exprs/datasketches-common.h
M be/src/exprs/datasketches-functions-ir.cc
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
8 files changed, 243 insertions(+), 34 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/16095/1
--
To view, visit http://gerrit.cloudera.org:8080/16095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09
Gerrit-Change-Number: 16095
Gerrit-PatchSet: 1
Gerrit-Owner: Gabor Kaszab 


[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-06-17 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..


Patch Set 6:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc@1578
PS5, Line 1578: void AggregateFunctions::DsHllUpdate(
> Don't support decimal in the first iteration. Open a Jira for further suppo
Removed decimal support, however, Impala converts them automatically to 
bigintval and still executes update(). There isn't much I can do about it in my 
opinion.


http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc@1611
PS5, Line 1611:   union_sketch.update(src_sketch);
> Don't support timestamp in the first iteration. Open a Jira for further sup
Timestamp support removed.


http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions.h
File be/src/exprs/aggregate-functions.h:

http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions.h@210
PS5, Line 210: S_SKETCH_CONFIG = 12
> this is 12 in Hive
Changed this to 12.



--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Wed, 17 Jun 2020 11:36:33 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-06-17 Thread Gabor Kaszab (Code Review)
Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#6).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
12 files changed, 452 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/6
--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-06-16 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..


Patch Set 5:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc@1578
PS5, Line 1578: void AggregateFunctions::DsHllUpdate(
Don't support decimal in the first iteration. Open a Jira for further support.


http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc@1611
PS5, Line 1611: void AggregateFunctions::DsHllUpdate(
Don't support timestamp in the first iteration. Open a Jira for further support.


http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions.h
File be/src/exprs/aggregate-functions.h:

http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions.h@210
PS5, Line 210: S_SKETCH_CONFIG = 11
this is 12 in Hive


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
File 
testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test:

http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@142
PS3, Line 142:
> We currently give an error for types not supported for ds_hll_functions. E.
Update: Implemented that I return null and a warning when abort_on_error=false 
and return an error otherwise.



--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Tue, 16 Jun 2020 15:23:55 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9531: Dropped support for dateless timestamps

2020-06-16 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15866 )

Change subject: IMPALA-9531: Dropped support for dateless timestamps
..


Patch Set 4:

(8 comments)

Nice work! I feel that this patch is getting into shape. I have some follow-up 
comments but none of the seems super difficult to address.

http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/exprs/timestamp-functions-ir.cc
File be/src/exprs/timestamp-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/exprs/timestamp-functions-ir.cc@167
PS3, Line 167: dt_ctx->Reset(reinterpret_casthttp://gerrit.cloudera.org:8080/#/c/15866/4/be/src/exprs/timestamp-functions-ir.cc
File be/src/exprs/timestamp-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/15866/4/be/src/exprs/timestamp-functions-ir.cc@168
PS4, Line 168: if (!SimpleDateFormatTokenizer::Tokenize(dt_ctx, true, PARSE)) {
Anyway, if you tokenize the format here then you do the tokenization for each 
row of the table even though it would be enough to do once before Impala starts 
reading rows.


http://gerrit.cloudera.org:8080/#/c/15866/4/be/src/exprs/timestamp-functions.h
File be/src/exprs/timestamp-functions.h:

http://gerrit.cloudera.org:8080/#/c/15866/4/be/src/exprs/timestamp-functions.h@104
PS4, Line 104: bool format
Please use datetime_parse_util::CastDirection


http://gerrit.cloudera.org:8080/#/c/15866/4/be/src/exprs/timestamp-functions.cc
File be/src/exprs/timestamp-functions.cc:

http://gerrit.cloudera.org:8080/#/c/15866/4/be/src/exprs/timestamp-functions.cc@155
PS4, Line 155: bool format = true
Why don't you use the parameter type that could be directly passed to 
Tokenize()?


http://gerrit.cloudera.org:8080/#/c/15866/4/be/src/exprs/timestamp-functions.cc@167
PS4, Line 167: if (format)
 :   parse_result = SimpleDateFormatTokenizer::Tokenize(dt_ctx, 
true, FORMAT);
 : else
 :   parse_result = SimpleDateFormatTokenizer::Tokenize(dt_ctx, 
true, PARSE);
Please check how to format if-else in Impala


http://gerrit.cloudera.org:8080/#/c/15866/4/be/src/runtime/datetime-simple-date-format-parser.h
File be/src/runtime/datetime-simple-date-format-parser.h:

http://gerrit.cloudera.org:8080/#/c/15866/4/be/src/runtime/datetime-simple-date-format-parser.h@88
PS4, Line 88: CastDirection cast_mode = FORMAT
I wouldn't give this a default value unless you have any specific reason making 
it this way.


http://gerrit.cloudera.org:8080/#/c/15866/4/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/15866/4/testdata/data/README@503
PS4, Line 503: Timestamp_parquet_test
The file name is not that verbose. After reading it it doesn't give any hints 
what timestamp tests it is used for. "dateless_timestamps.parq" would be better 
for this purpose.

nit: No need to include "parquet" into the file name as the extension already 
expresses the format.
nit: The file name itself starts with a lowercase letter.


http://gerrit.cloudera.org:8080/#/c/15866/4/testdata/data/timestamp_text_test.txt
File testdata/data/timestamp_text_test.txt:

http://gerrit.cloudera.org:8080/#/c/15866/4/testdata/data/timestamp_text_test.txt@1
PS4, Line 1: 1996-04-22 10:00:00.43210
Same comment for the file name as for the parquet file: 
"dateless_timestamps.txt" would be more verbose.



--
To view, visit http://gerrit.cloudera.org:8080/15866
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I48c49bf027cc4b917849b3d58518facba372b322
Gerrit-Change-Number: 15866
Gerrit-PatchSet: 4
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Tue, 16 Jun 2020 15:12:25 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-9482 Support for BINARY columns

2020-06-16 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: WIP IMPALA-9482 Support for BINARY columns
..


Patch Set 2:

(8 comments)

I did a go through on this patch and left some minor comments. However, I feel 
that a second look from someone more competent to introducing new types should 
take a look. I see that there are a high number of places you had to touch but 
I can't judge if there is anything else missing.

http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@47
PS2, Line 47: Testing
Do you plan to do interop tests between Hive and Impala?


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/text-converter.h
File be/src/exec/text-converter.h:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/text-converter.h@63
PS2, Line 63: col_desc
Could you mention the new param in the comment above?


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/text-converter.h@88
PS2, Line 88: col_desc
Could you mention the new param in the comment above?


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc
File be/src/runtime/types.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@180
PS2, Line 180:   // ODBC driver requires types in lower case
nit: I'd move this comment right before the switch to L186.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/service/hs2-util.cc
File be/src/service/hs2-util.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/service/hs2-util.cc@729
PS2, Line 729: // HiveServer2 does not have a type for invalid, date, datetime 
or
 :   // fixed_uda_intermediate.
Shouldn't you change this comment to reflect that now BINARY can also trigger 
this branch?


http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java
File fe/src/main/java/org/apache/impala/analysis/CastExpr.java:

http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@341
PS2, Line 341: boolean isStringBinaryCast =
This can go after the Preconditions check, L347.


http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/2/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java@263
PS2, Line 263: For tables with unsupported primitive types (e.g., binary)
This comment is no longer valid, right?


http://gerrit.cloudera.org:8080/#/c/16066/2/tests/common/impala_connection.py
File tests/common/impala_connection.py:

http://gerrit.cloudera.org:8080/#/c/16066/2/tests/common/impala_connection.py@426
PS2, Line 426: `
nit: typo



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Tue, 16 Jun 2020 13:39:03 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-06-12 Thread Gabor Kaszab (Code Review)
Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#5).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=11.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
12 files changed, 493 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/5
--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-06-08 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..


Patch Set 4:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/16000/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16000/3//COMMIT_MSG@30
PS3, Line 30: In terms of correctness DataSketches HLL is most of the time in 2%
: range from the correct result but there are occasional spikes 
where
: the difference is bigger but never goes out of the range of 5%.
> A link to https://datasketches.apache.org/docs/HLL/HLL.html could be useful
Sure, done.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1593
PS3, Line 1593: sketch_ptr->update((void*)(), 
sizeof(impala::int128_t));
> We don't need to use the convenient typed update function, we can simply ha
Good point! I don't have to loose precision here for int128 as casting it to 
void* and giving the length to update() would also work.
I can also get rid of this warning/error functionality then.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1595
PS3, Line 1595: }
> I think that this can be raised a few time in a query if more than one inst
Indeed, I just tried this locally on a bigger dataset and apparently there is 
one warning per instance. In my opinion we can leave it as it is as the purpose 
here was not to flood the screen with one error per row.
Modified the comment, though.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1625
PS3, Line 1625:
  : StringVal AggregateFunctions::DsHllSerialize(FunctionContext* 
ctx,
  : const StringVal& src) {
> Does Hive also convert to double (as seconds from Unix epoch)? Unless we ar
Leave this question open until we find out how Hive tackles the same if it does.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1650
PS3, Line 1650:   union_sketch.update(*dst_sketch_ptr);
  :
> optional: I was looking for ways to avoid the 2 copies that occur here and
I found that there is another interface for deserialization:
deserialize(const void* bytes, size_t len)
I can use this to get rid of the 2 copies.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1684
PS3, Line 1684:   static const uint32
> It would be nice to add a comment somewhere that describes the way hll_sket
Sure, added a comment about this to DsHllInit where I first allocate memory for 
hll_sketch.


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
File 
testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test:

http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@18
PS3, Line 18: ds_hll_estimate(ds_hll_sketch(timestamp_col))
> whitespace
Done


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@21
PS3, Line 21: 100
> Are you sure that this won't be flaky? The value is the same as with count
I've run this a few times and appeared stable. In theory until a certain size 
of data HLL will pre-populate its cache and during this phase it will give 
correct results. It will start to approximate only after this certain size.

Apparently, alltypessmall is small enough to always give the correct result;


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@90
PS3, Line 90: insert into sketch_store
> Do you know if this is also the case in Hive?
This error scenario could be dropped using the update(void* bytes, size_t 
length) interface.


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@115
PS3, Line 115: where date_string_col = "03/01/09";
> Can you mention that this will change in the future when BINARY will be imp
Done


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@142
PS3, Line 142:
> Do you know how Hive handles this? I would consider returning an error inst
We currently give an error for types not supported for ds_hll_functions. E.g. 
ds_hll_estimate() only accepts strings and for other types it gives an 
AnalysisException.
For strings we can decide if it's a serialized sketch only when we read that 
particular row so we can't return earlier. What I can do here is to write a 
warning or give an error depending on how ABORT_ON_ERROR is set.
What do you think?



--
To view, visit 

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-06-05 Thread Gabor Kaszab (Code Review)
Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#4).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=11.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
12 files changed, 465 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/4
--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 4
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9820: Pull Datasketches-5 HLL MurmurHash fix

2020-06-02 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/16026


Change subject: IMPALA-9820: Pull Datasketches-5 HLL MurmurHash fix
..

IMPALA-9820: Pull Datasketches-5 HLL MurmurHash fix

There is a bug in DataSketches HLL MurmurHash where long strings are
over-read resulting a cardinality estimate that is more than 15% off
from the correct cardinality number. A recent upstream fix in Apache
DataSketches addresses this issue and this patch pulls it to Impala.

https://issues.apache.org/jira/browse/DATASKETCHES-5

Testing:
  - I used ds_hll_sketch() and ds_hll_estimate() functions from
IMPALA-9632 to trigger DataSketches HLL functionality.
  - Ran DataSketches HLL on lineitem.l_comment in TPCH25_parquet to
reproduce the issue. The symptom was that the actual result was
around 15% off from the correct cardinality result (~69M vs 79M).
  - After applying this fix re-running the query gives much closer
results, usually under 3% error range.

Change-Id: I84d73fce1e7a197c1f8fb49404b58ed9bb0b843d
---
M be/src/thirdparty/datasketches/MurmurHash3.h
1 file changed, 3 insertions(+), 8 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/16026/1
--
To view, visit http://gerrit.cloudera.org:8080/16026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I84d73fce1e7a197c1f8fb49404b58ed9bb0b843d
Gerrit-Change-Number: 16026
Gerrit-PatchSet: 1
Gerrit-Owner: Gabor Kaszab 


[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-06-02 Thread Gabor Kaszab (Code Review)
Hello Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#3).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
12 files changed, 509 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/3
--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-05-29 Thread Gabor Kaszab (Code Review)
Hello Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#2).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
12 files changed, 492 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/2
--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 2
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

2020-05-29 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/16000


Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
..

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
12 files changed, 494 insertions(+), 8 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/1
--
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 1
Gerrit-Owner: Gabor Kaszab 


[Impala-ASF-CR] IMPALA-9531: Dropped support for dateless timestamps

2020-05-27 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15866 )

Change subject: IMPALA-9531: Dropped support for dateless timestamps
..


Patch Set 3:

(18 comments)

Thanks for taking care of my comments! In general the casting part of this 
change is getting into shape in my opinion. However, I miss the part where the 
timestamp is not a result of a cast() or to_timestamp() or such, and it comes 
from reading a table. e.g. I don't see now any changes that could prevent us 
reading dateless timestamp when we run e.g. "select ts_col from table_name".

We discussed offline that some of the file formats are unable to hold these 
values because only Hive can write them and there Hive doesn't allow writing 
dateless TSs but for the ones where Impala can write them it should at leas be 
covered by tests that a previously written file with dateless timestamps can be 
read after this changes and the dateless TSs will be nulls.

http://gerrit.cloudera.org:8080/#/c/15866/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15866/3//COMMIT_MSG@15
PS3, Line 15: select to_timestamp('01:01:01', 'HH:mm:ss');
This should return parse error because the format itself doesn't contain date 
just time.


http://gerrit.cloudera.org:8080/#/c/15866/3//COMMIT_MSG@17
PS3, Line 17: values
nit: dot at the end of a sentence.


http://gerrit.cloudera.org:8080/#/c/15866/3//COMMIT_MSG@17
PS3, Line 17: This
nit: This should be plural in my opinion as it stands for 2 separate examples.


http://gerrit.cloudera.org:8080/#/c/15866/3//COMMIT_MSG@28
PS3, Line 28: Testing
I still don't see tests where you use a file populated by dateless timestamps 
(created by an Impala pre this change) and try to read it with Impala 
containing your changes.


http://gerrit.cloudera.org:8080/#/c/15866/3//COMMIT_MSG@29
PS3, Line 29: modified
nit: start a sentence with capital letter.


http://gerrit.cloudera.org:8080/#/c/15866/3//COMMIT_MSG@29
PS3, Line 29: test
nit: tests


http://gerrit.cloudera.org:8080/#/c/15866/3//COMMIT_MSG@30
PS3, Line 30: backwards
It's just a matter of your point of view which direction you call backwards so 
I wouldn't use that word here. Simply "Added test to cover timestamp to string 
path when no date tokens are requested in the output string" or something like 
that.


http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/exprs/expr-test.cc@4507
PS3, Line 4507: 1.00
Is this related to your change or does it come with a rebase you've made in the 
background?
If the second, then a general comment is to try to push rebase related changes 
to gerrit separately from your actual changes to make the reviewers life easier.


http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/exprs/expr-test.cc@7536
PS3, Line 7536: HH:mm:ss
Actually, we should get an error when parsing the format and see that there is 
no date part.


http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/runtime/datetime-simple-date-format-parser.h
File be/src/runtime/datetime-simple-date-format-parser.h:

http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/runtime/datetime-simple-date-format-parser.h@88
PS3, Line 88: parse
Can you re-use the CastDirection type similarly as it is used in e.g. ISO SQL 
format tokenizer?


http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/runtime/datetime-simple-date-format-parser.cc
File be/src/runtime/datetime-simple-date-format-parser.cc:

http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/runtime/datetime-simple-date-format-parser.cc@179
PS3, Line 179:   //return (dt_ctx->has_date_toks || dt_ctx->has_time_toks);
Please don't leave commented code


http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/runtime/datetime-simple-date-format-parser.cc@180
PS3, Line 180:   return parse?(dt_ctx->has_date_toks):(dt_ctx->has_date_toks || 
dt_ctx->has_time_toks);
This seems logically correct but not that readable. Additionally, this is not 
how we format ternary operators in Impala:
 - Use space around the question marks
 - As the condition is simple no need to surround with brackets.
 - Use space around the colon

Anyway, I feel that rewriting to something like this would help a bit:

if (parse) return dt_ctx->has_date_toks;
return dt_ctx->has_date_toks || dt_ctx->has_time_toks;

Or something similar.


http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/runtime/timestamp-test.cc
File be/src/runtime/timestamp-test.cc:

http://gerrit.cloudera.org:8080/#/c/15866/3/be/src/runtime/timestamp-test.cc@610
PS3, Line 610: (TimestampTC("-MM-dd HH:m:ss", "2020-05-11 1:24:34", 
false, true))
Here the intention was to also cover the case when the minute part of the input 
is 1 char long, right? Instead this tests that the hour part is 1 long.
Additionally, shouldn't you also give params to test the expected result?



[Impala-ASF-CR] Revert "IMPALA-8980: Remove functional*.alltypesinsert from EE tests"

2020-05-11 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15902 )

Change subject: Revert "IMPALA-8980: Remove functional*.alltypesinsert from EE 
tests"
..


Patch Set 1: Code-Review-1

Tamas just submitted a patch to fix whatever went wrong by IMPALA-8980. Can we 
give it a chance and let's see if this fixes the build?

Btw, if you go for a revert, these are the follow up changes that went in for 
IMPALA-8980:
https://gerrit.cloudera.org/#/c/15766/
https://gerrit.cloudera.org/#/c/15816/


--
To view, visit http://gerrit.cloudera.org:8080/15902
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3d9acba85b1082502b2bc62acc76116de5750d30
Gerrit-Change-Number: 15902
Gerrit-PatchSet: 1
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Tue, 12 May 2020 05:13:45 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9680: Fixed compressed inserts failing

2020-05-06 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15816 )

Change subject: IMPALA-9680: Fixed compressed inserts failing
..


Patch Set 5: Code-Review+2

Carry-over +2 from Csaba


--
To view, visit http://gerrit.cloudera.org:8080/15816
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib3c7ba02190f57a7ed40311c95a3dd9eca9b474d
Gerrit-Change-Number: 15816
Gerrit-PatchSet: 5
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Wed, 06 May 2020 14:15:40 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9631: Import HLL functionality from DataSketches

2020-04-21 Thread Gabor Kaszab (Code Review)
Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15746

to look at the new patch set (#5).

Change subject: IMPALA-9631: Import HLL functionality from DataSketches
..

IMPALA-9631: Import HLL functionality from DataSketches

This patch imports the functionality needed for HLL approximate
algorithm from Apache DataSketches. I decided to copy the necessary
files into be/src/thirdparty/datasketches. Note, that the original
structure of files was changed during this process as originally hll/
and common/ libraries were both affected but I copied these into the
same directory so that Impala can compile them without rewriting the
include paths in the files themselves. Also note, that not the whole
common/ directory was copied just the files needed for HLL.

The git hash of the snapshot I used as a source for the files:
a6265b307a03085abe26c20413fdbf7d7a5eaf29

Browse the source files here:
https://github.com/apache/incubator-datasketches-cpp

Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf
---
M be/src/exprs/CMakeLists.txt
A be/src/exprs/datasketches-test.cc
A be/src/thirdparty/datasketches/AuxHashMap-internal.hpp
A be/src/thirdparty/datasketches/AuxHashMap.hpp
A be/src/thirdparty/datasketches/CommonUtil.hpp
A be/src/thirdparty/datasketches/CompositeInterpolationXTable-internal.hpp
A be/src/thirdparty/datasketches/CompositeInterpolationXTable.hpp
A be/src/thirdparty/datasketches/CouponHashSet-internal.hpp
A be/src/thirdparty/datasketches/CouponHashSet.hpp
A be/src/thirdparty/datasketches/CouponList-internal.hpp
A be/src/thirdparty/datasketches/CouponList.hpp
A be/src/thirdparty/datasketches/CubicInterpolation-internal.hpp
A be/src/thirdparty/datasketches/CubicInterpolation.hpp
A be/src/thirdparty/datasketches/HarmonicNumbers-internal.hpp
A be/src/thirdparty/datasketches/HarmonicNumbers.hpp
A be/src/thirdparty/datasketches/Hll4Array-internal.hpp
A be/src/thirdparty/datasketches/Hll4Array.hpp
A be/src/thirdparty/datasketches/Hll6Array-internal.hpp
A be/src/thirdparty/datasketches/Hll6Array.hpp
A be/src/thirdparty/datasketches/Hll8Array-internal.hpp
A be/src/thirdparty/datasketches/Hll8Array.hpp
A be/src/thirdparty/datasketches/HllArray-internal.hpp
A be/src/thirdparty/datasketches/HllArray.hpp
A be/src/thirdparty/datasketches/HllSketch-internal.hpp
A be/src/thirdparty/datasketches/HllSketchImpl-internal.hpp
A be/src/thirdparty/datasketches/HllSketchImpl.hpp
A be/src/thirdparty/datasketches/HllSketchImplFactory.hpp
A be/src/thirdparty/datasketches/HllUnion-internal.hpp
A be/src/thirdparty/datasketches/HllUtil.hpp
A be/src/thirdparty/datasketches/LICENSE
A be/src/thirdparty/datasketches/MurmurHash3.h
A be/src/thirdparty/datasketches/README.md
A be/src/thirdparty/datasketches/RelativeErrorTables-internal.hpp
A be/src/thirdparty/datasketches/RelativeErrorTables.hpp
A be/src/thirdparty/datasketches/coupon_iterator-internal.hpp
A be/src/thirdparty/datasketches/coupon_iterator.hpp
A be/src/thirdparty/datasketches/hll.hpp
A be/src/thirdparty/datasketches/hll.private.hpp
A be/src/thirdparty/datasketches/inv_pow2_table.hpp
M bin/rat_exclude_files.txt
M bin/run_clang_tidy.sh
41 files changed, 7,381 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/15746/5
--
To view, visit http://gerrit.cloudera.org:8080/15746
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf
Gerrit-Change-Number: 15746
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9631: Import HLL functionality from DataSketches

2020-04-21 Thread Gabor Kaszab (Code Review)
Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15746

to look at the new patch set (#4).

Change subject: IMPALA-9631: Import HLL functionality from DataSketches
..

IMPALA-9631: Import HLL functionality from DataSketches

This patch imports the functionality needed for HLL approximate
algorithm from Apache DataSketches. I decided to copy the necessary
files into be/src/thirdparty/datasketches. Note, that the original
structure of files was changed during this process as originally hll/
and common/ libraries were both affected but I copied these into the
same directory so that Impala can compile them without rewriting the
include paths in the files themselves. Also note, that not the whole
common/ directory was copied just the files needed for HLL.

The git hash of the snapshot I used as a source for the files:
a6265b307a03085abe26c20413fdbf7d7a5eaf29

Browse the source files here:
https://github.com/apache/incubator-datasketches-cpp

Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf
---
M .clang-tidy
M be/src/exprs/CMakeLists.txt
A be/src/exprs/datasketches-test.cc
A be/src/thirdparty/datasketches/AuxHashMap-internal.hpp
A be/src/thirdparty/datasketches/AuxHashMap.hpp
A be/src/thirdparty/datasketches/CommonUtil.hpp
A be/src/thirdparty/datasketches/CompositeInterpolationXTable-internal.hpp
A be/src/thirdparty/datasketches/CompositeInterpolationXTable.hpp
A be/src/thirdparty/datasketches/CouponHashSet-internal.hpp
A be/src/thirdparty/datasketches/CouponHashSet.hpp
A be/src/thirdparty/datasketches/CouponList-internal.hpp
A be/src/thirdparty/datasketches/CouponList.hpp
A be/src/thirdparty/datasketches/CubicInterpolation-internal.hpp
A be/src/thirdparty/datasketches/CubicInterpolation.hpp
A be/src/thirdparty/datasketches/HarmonicNumbers-internal.hpp
A be/src/thirdparty/datasketches/HarmonicNumbers.hpp
A be/src/thirdparty/datasketches/Hll4Array-internal.hpp
A be/src/thirdparty/datasketches/Hll4Array.hpp
A be/src/thirdparty/datasketches/Hll6Array-internal.hpp
A be/src/thirdparty/datasketches/Hll6Array.hpp
A be/src/thirdparty/datasketches/Hll8Array-internal.hpp
A be/src/thirdparty/datasketches/Hll8Array.hpp
A be/src/thirdparty/datasketches/HllArray-internal.hpp
A be/src/thirdparty/datasketches/HllArray.hpp
A be/src/thirdparty/datasketches/HllSketch-internal.hpp
A be/src/thirdparty/datasketches/HllSketchImpl-internal.hpp
A be/src/thirdparty/datasketches/HllSketchImpl.hpp
A be/src/thirdparty/datasketches/HllSketchImplFactory.hpp
A be/src/thirdparty/datasketches/HllUnion-internal.hpp
A be/src/thirdparty/datasketches/HllUtil.hpp
A be/src/thirdparty/datasketches/LICENSE
A be/src/thirdparty/datasketches/MurmurHash3.h
A be/src/thirdparty/datasketches/README.md
A be/src/thirdparty/datasketches/RelativeErrorTables-internal.hpp
A be/src/thirdparty/datasketches/RelativeErrorTables.hpp
A be/src/thirdparty/datasketches/coupon_iterator-internal.hpp
A be/src/thirdparty/datasketches/coupon_iterator.hpp
A be/src/thirdparty/datasketches/hll.hpp
A be/src/thirdparty/datasketches/hll.private.hpp
A be/src/thirdparty/datasketches/inv_pow2_table.hpp
M bin/rat_exclude_files.txt
M bin/run_clang_tidy.sh
42 files changed, 7,382 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/15746/4
--
To view, visit http://gerrit.cloudera.org:8080/15746
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf
Gerrit-Change-Number: 15746
Gerrit-PatchSet: 4
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9631: Import HLL functionality from DataSketches

2020-04-20 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15746 )

Change subject: IMPALA-9631: Import HLL functionality from DataSketches
..


Patch Set 3:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/15746/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15746/2//COMMIT_MSG@7
PS2, Line 7: IMPALA-9631: Import HLL functionality from DataSketches
> Do you plan to merge this patch as it is? I would consider it more of a POC
Well, I don't see a difference of merging it now of waiting for other 
development work and start the review then. Btw, this unlock other steps under 
the umbrella Jira: https://issues.apache.org/jira/browse/IMPALA-9593
Additionally, waiting for the rest of the functionality means that splitting up 
these tasks has no point and we could have kept a single Jira for everything 
HLL related.

So I'd take this as a non-POC review now and we still can make changes later on 
if we want to modify the build structure or drop the entire library for some 
reason.


http://gerrit.cloudera.org:8080/#/c/15746/2//COMMIT_MSG@9
PS2, Line 9: This patch imports the functionality needed for HLL approximate
> Can you add this information in a README.md? To the datasketches folder?
Done


http://gerrit.cloudera.org:8080/#/c/15746/2//COMMIT_MSG@11
PS2, Line 11: files into be/src/thirdparty/datasketches. Note, that the original
> nit: wrap at 72
Done


http://gerrit.cloudera.org:8080/#/c/15746/2/be/src/exprs/datasketches-test.cc
File be/src/exprs/datasketches-test.cc:

http://gerrit.cloudera.org:8080/#/c/15746/2/be/src/exprs/datasketches-test.cc@27
PS2, Line 27: algorith
> typo: algorithm
Done


http://gerrit.cloudera.org:8080/#/c/15746/2/be/src/exprs/datasketches-test.cc@38
PS2, Line 38:   std::stringstream sketch_stream1;
:   std::stringstream sketch_stream2;
> I would prefer to use stringstream instead of files to make this faster/avo
Done



--
To view, visit http://gerrit.cloudera.org:8080/15746
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf
Gerrit-Change-Number: 15746
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 20 Apr 2020 15:26:40 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9631: Import HLL functionality from DataSketches

2020-04-20 Thread Gabor Kaszab (Code Review)
Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15746

to look at the new patch set (#3).

Change subject: IMPALA-9631: Import HLL functionality from DataSketches
..

IMPALA-9631: Import HLL functionality from DataSketches

This patch imports the functionality needed for HLL approximate
algorithm from Apache DataSketches. I decided to copy the necessary
files into be/src/thirdparty/datasketches. Note, that the original
structure of files was changed during this process as originally hll/
and common/ libraries were both affected but I copied these into the
same directory so that Impala can compile them without rewriting the
include paths in the files themselves. Also note, that not the whole
common/ directory was copied just the files needed for HLL.

The git hash of the snapshot I used as a source for the files:
a6265b307a03085abe26c20413fdbf7d7a5eaf29

Browse the source files here:
https://github.com/apache/incubator-datasketches-cpp

Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf
---
M .clang-tidy
M be/src/exprs/CMakeLists.txt
A be/src/exprs/datasketches-test.cc
A be/src/thirdparty/datasketches/AuxHashMap-internal.hpp
A be/src/thirdparty/datasketches/AuxHashMap.hpp
A be/src/thirdparty/datasketches/CommonUtil.hpp
A be/src/thirdparty/datasketches/CompositeInterpolationXTable-internal.hpp
A be/src/thirdparty/datasketches/CompositeInterpolationXTable.hpp
A be/src/thirdparty/datasketches/CouponHashSet-internal.hpp
A be/src/thirdparty/datasketches/CouponHashSet.hpp
A be/src/thirdparty/datasketches/CouponList-internal.hpp
A be/src/thirdparty/datasketches/CouponList.hpp
A be/src/thirdparty/datasketches/CubicInterpolation-internal.hpp
A be/src/thirdparty/datasketches/CubicInterpolation.hpp
A be/src/thirdparty/datasketches/HarmonicNumbers-internal.hpp
A be/src/thirdparty/datasketches/HarmonicNumbers.hpp
A be/src/thirdparty/datasketches/Hll4Array-internal.hpp
A be/src/thirdparty/datasketches/Hll4Array.hpp
A be/src/thirdparty/datasketches/Hll6Array-internal.hpp
A be/src/thirdparty/datasketches/Hll6Array.hpp
A be/src/thirdparty/datasketches/Hll8Array-internal.hpp
A be/src/thirdparty/datasketches/Hll8Array.hpp
A be/src/thirdparty/datasketches/HllArray-internal.hpp
A be/src/thirdparty/datasketches/HllArray.hpp
A be/src/thirdparty/datasketches/HllSketch-internal.hpp
A be/src/thirdparty/datasketches/HllSketchImpl-internal.hpp
A be/src/thirdparty/datasketches/HllSketchImpl.hpp
A be/src/thirdparty/datasketches/HllSketchImplFactory.hpp
A be/src/thirdparty/datasketches/HllUnion-internal.hpp
A be/src/thirdparty/datasketches/HllUtil.hpp
A be/src/thirdparty/datasketches/LICENSE
A be/src/thirdparty/datasketches/MurmurHash3.h
A be/src/thirdparty/datasketches/README.md
A be/src/thirdparty/datasketches/RelativeErrorTables-internal.hpp
A be/src/thirdparty/datasketches/RelativeErrorTables.hpp
A be/src/thirdparty/datasketches/coupon_iterator-internal.hpp
A be/src/thirdparty/datasketches/coupon_iterator.hpp
A be/src/thirdparty/datasketches/hll.hpp
A be/src/thirdparty/datasketches/hll.private.hpp
A be/src/thirdparty/datasketches/inv_pow2_table.hpp
M bin/rat_exclude_files.txt
41 files changed, 7,379 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/15746/3
--
To view, visit http://gerrit.cloudera.org:8080/15746
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf
Gerrit-Change-Number: 15746
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9631: Import HLL functionality from DataSketches

2020-04-17 Thread Gabor Kaszab (Code Review)
Hello Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15746

to look at the new patch set (#2).

Change subject: IMPALA-9631: Import HLL functionality from DataSketches
..

IMPALA-9631: Import HLL functionality from DataSketches

This patch imports the functionality needed for HLL approximate algorithm
from Apache DataSketches. I decided to copy the necessary files into
be/src/thirdparty/datasketches. Note, that the original structure of files
was changed during this process as originally hll/ and common/ libraries were
both affected but I copied these into the same directory so that Impala can
compile them without rewriting the include paths in the files themselves. Also
note, that not the whole common/ directory was copied just the files needed for
HLL.

The git hash of the snapshot I used as a source for the files:
a6265b307a03085abe26c20413fdbf7d7a5eaf29

Browse the source files here:
https://github.com/apache/incubator-datasketches-cpp

Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf
---
M be/src/exprs/CMakeLists.txt
A be/src/exprs/datasketches-test.cc
A be/src/thirdparty/datasketches/AuxHashMap-internal.hpp
A be/src/thirdparty/datasketches/AuxHashMap.hpp
A be/src/thirdparty/datasketches/CommonUtil.hpp
A be/src/thirdparty/datasketches/CompositeInterpolationXTable-internal.hpp
A be/src/thirdparty/datasketches/CompositeInterpolationXTable.hpp
A be/src/thirdparty/datasketches/CouponHashSet-internal.hpp
A be/src/thirdparty/datasketches/CouponHashSet.hpp
A be/src/thirdparty/datasketches/CouponList-internal.hpp
A be/src/thirdparty/datasketches/CouponList.hpp
A be/src/thirdparty/datasketches/CubicInterpolation-internal.hpp
A be/src/thirdparty/datasketches/CubicInterpolation.hpp
A be/src/thirdparty/datasketches/HarmonicNumbers-internal.hpp
A be/src/thirdparty/datasketches/HarmonicNumbers.hpp
A be/src/thirdparty/datasketches/Hll4Array-internal.hpp
A be/src/thirdparty/datasketches/Hll4Array.hpp
A be/src/thirdparty/datasketches/Hll6Array-internal.hpp
A be/src/thirdparty/datasketches/Hll6Array.hpp
A be/src/thirdparty/datasketches/Hll8Array-internal.hpp
A be/src/thirdparty/datasketches/Hll8Array.hpp
A be/src/thirdparty/datasketches/HllArray-internal.hpp
A be/src/thirdparty/datasketches/HllArray.hpp
A be/src/thirdparty/datasketches/HllSketch-internal.hpp
A be/src/thirdparty/datasketches/HllSketchImpl-internal.hpp
A be/src/thirdparty/datasketches/HllSketchImpl.hpp
A be/src/thirdparty/datasketches/HllSketchImplFactory.hpp
A be/src/thirdparty/datasketches/HllUnion-internal.hpp
A be/src/thirdparty/datasketches/HllUtil.hpp
A be/src/thirdparty/datasketches/LICENSE
A be/src/thirdparty/datasketches/MurmurHash3.h
A be/src/thirdparty/datasketches/RelativeErrorTables-internal.hpp
A be/src/thirdparty/datasketches/RelativeErrorTables.hpp
A be/src/thirdparty/datasketches/coupon_iterator-internal.hpp
A be/src/thirdparty/datasketches/coupon_iterator.hpp
A be/src/thirdparty/datasketches/hll.hpp
A be/src/thirdparty/datasketches/hll.private.hpp
A be/src/thirdparty/datasketches/inv_pow2_table.hpp
M bin/rat_exclude_files.txt
39 files changed, 7,368 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/15746/2
--
To view, visit http://gerrit.cloudera.org:8080/15746
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf
Gerrit-Change-Number: 15746
Gerrit-PatchSet: 2
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9469: ORC scanner vectorization for collection types

2020-04-17 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15730 )

Change subject: IMPALA-9469: ORC scanner vectorization for collection types
..


Patch Set 2: Code-Review+2

(5 comments)

http://gerrit.cloudera.org:8080/#/c/15730/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15730/1//COMMIT_MSG@32
PS1, Line 32:
> Unfortunately there is no 'scale' option for loading the nested workloads.
No need to put more efforts into this if there is no convenient way to produce 
scaled nested workloads. Thx for the explanation!


http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.h@95
PS1, Line 95: ReadValue
> In that case we'd need more friend declarations, because both OrcListReader
Yeah, agree it's more convenient to leave it here. I don't have strong feelings 
for either option.


http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.cc@451
PS1, Line 451: int OrcListReader::NumElements() const {
> Yeah, it's actually the same function.
I see. But we still gain something with codegen nested types so it's still a 
win. Thx for rewriting this.


http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.cc@473
PS1, Line 473:
> If the list is an array of structs, then we can have many children, but in
Indeed. Thx!


http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.cc@477
PS1, Line 477:   return Status::OK();
> Yeah, it's confusing. slot_desc_ being null is independent of MaterializeTu
Hmm, According to my understanding MaterializeTuple() is true when the reader 
writes directly to a slot and false if the reader delegates the writing to its 
children. I might have misunderstood then.
Still, introducing DirectReader() helps readability, thx.



--
To view, visit http://gerrit.cloudera.org:8080/15730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I477961b427406035a04529c5175dbee8f8a93ad5
Gerrit-Change-Number: 15730
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 17 Apr 2020 13:13:34 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9631: Import HLL functionality from DataSketches

2020-04-17 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/15746


Change subject: IMPALA-9631: Import HLL functionality from DataSketches
..

IMPALA-9631: Import HLL functionality from DataSketches

This patch imports the functionality needed for HLL approximate algorithm
from Apache DataSketches. I decided to copy the necessary files into
be/src/thirdparty/datasketches. Note, that the original structure of files
was changed during this process as originally hll/ and common/ libraries were
both affected but I copied these into the same directory so that Impala can
compile them without rewriting the include paths in the files themselves. Also
note, that not the whole common/ directory was copied just the files needed for
HLL.

The git hash of the snapshot I used as a source for the files:
a6265b307a03085abe26c20413fdbf7d7a5eaf29

Browse the source files here:
https://github.com/apache/incubator-datasketches-cpp

Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf
---
M be/src/exprs/CMakeLists.txt
A be/src/exprs/datasketches-test.cc
A be/src/thirdparty/datasketches/AuxHashMap-internal.hpp
A be/src/thirdparty/datasketches/AuxHashMap.hpp
A be/src/thirdparty/datasketches/CommonUtil.hpp
A be/src/thirdparty/datasketches/CompositeInterpolationXTable-internal.hpp
A be/src/thirdparty/datasketches/CompositeInterpolationXTable.hpp
A be/src/thirdparty/datasketches/CouponHashSet-internal.hpp
A be/src/thirdparty/datasketches/CouponHashSet.hpp
A be/src/thirdparty/datasketches/CouponList-internal.hpp
A be/src/thirdparty/datasketches/CouponList.hpp
A be/src/thirdparty/datasketches/CubicInterpolation-internal.hpp
A be/src/thirdparty/datasketches/CubicInterpolation.hpp
A be/src/thirdparty/datasketches/HarmonicNumbers-internal.hpp
A be/src/thirdparty/datasketches/HarmonicNumbers.hpp
A be/src/thirdparty/datasketches/Hll4Array-internal.hpp
A be/src/thirdparty/datasketches/Hll4Array.hpp
A be/src/thirdparty/datasketches/Hll6Array-internal.hpp
A be/src/thirdparty/datasketches/Hll6Array.hpp
A be/src/thirdparty/datasketches/Hll8Array-internal.hpp
A be/src/thirdparty/datasketches/Hll8Array.hpp
A be/src/thirdparty/datasketches/HllArray-internal.hpp
A be/src/thirdparty/datasketches/HllArray.hpp
A be/src/thirdparty/datasketches/HllSketch-internal.hpp
A be/src/thirdparty/datasketches/HllSketchImpl-internal.hpp
A be/src/thirdparty/datasketches/HllSketchImpl.hpp
A be/src/thirdparty/datasketches/HllSketchImplFactory.hpp
A be/src/thirdparty/datasketches/HllUnion-internal.hpp
A be/src/thirdparty/datasketches/HllUtil.hpp
A be/src/thirdparty/datasketches/MurmurHash3.h
A be/src/thirdparty/datasketches/RelativeErrorTables-internal.hpp
A be/src/thirdparty/datasketches/RelativeErrorTables.hpp
A be/src/thirdparty/datasketches/coupon_iterator-internal.hpp
A be/src/thirdparty/datasketches/coupon_iterator.hpp
A be/src/thirdparty/datasketches/hll.hpp
A be/src/thirdparty/datasketches/hll.private.hpp
A be/src/thirdparty/datasketches/inv_pow2_table.hpp
37 files changed, 7,137 insertions(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/15746/1
--
To view, visit http://gerrit.cloudera.org:8080/15746
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I8ca8e77dcbb6b6c3b1e3bca7ab57cb7d3c018bbf
Gerrit-Change-Number: 15746
Gerrit-PatchSet: 1
Gerrit-Owner: Gabor Kaszab 


[Impala-ASF-CR] IMPALA-9596: deflake test tpch mem limit single node

2020-04-16 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15700 )

Change subject: IMPALA-9596: deflake test_tpch_mem_limit_single_node
..


Patch Set 4: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/15700
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4678963c264b7c15fbac6f71721162b38676aa21
Gerrit-Change-Number: 15700
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 16 Apr 2020 12:13:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9469: ORC scanner vectorization for collection types

2020-04-15 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15730 )

Change subject: IMPALA-9469: ORC scanner vectorization for collection types
..


Patch Set 1:

(11 comments)

Nice change and cleanup!
I took a look but have to admit that the part that touches batch->offsets 
(mostly in SetPositionSlot()) is not 100% clear for me.

http://gerrit.cloudera.org:8080/#/c/15730/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15730/1//COMMIT_MSG@10
PS1, Line 10: now
nit: now on


http://gerrit.cloudera.org:8080/#/c/15730/1//COMMIT_MSG@13
PS1, Line 13: now
same


http://gerrit.cloudera.org:8080/#/c/15730/1//COMMIT_MSG@18
PS1, Line 18: I eliminated the OrcComplexColumnReader::TransferTuple() interface
: and the related codes.
nit: I feel that this can be merged with the next paragraph.


http://gerrit.cloudera.org:8080/#/c/15730/1//COMMIT_MSG@32
PS1, Line 32: scale=1
Is there a plan to check performance on a higher scale data, e.g. scale=25 so 
that we can compare these improvements to the ones we achieved for primitive 
types?

Giving it a second thought we tested primitives on TPCH instead of TPCH Nested, 
but still I'm wondering if it makes sense to see these numbers on a higher 
scale.


http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.h@50
PS1, Line 50: alwas
typo


http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.h@95
PS1, Line 95: ReadValue
>From now on ReadValue() is only called from ReadValueBatch(), right? If this 
>is the case can you make it private?


http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.h@108
PS1, Line 108: lists
Not just lists but maps as well, right?


http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.h@487
PS1, Line 487: int NumElements() const override {
 : if (MaterializeTuple()) return vbatch_->numElements;
 : DCHECK_EQ(children().size(), 1);
 : OrcColumnReader* child = children()[0];
 : return child->NumElements();
 :   }
Both OrcListReader and OrcMapReader overrides this so actually this will be 
only for OrcStructReader, right? Why not move it there?


http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.cc@451
PS1, Line 451: Status OrcCollectionReader::ReadValueBatch
This looks very much the same as OrcPrimitiveColumnReader::ReadValueBatch() 
except L462. Do you know a way to avoid this code duplication?


http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.cc@473
PS1, Line 473: return children_[0]->NumElements();
Is it worth a DCHECK to verify that this OrcListReader has exactly one child?


http://gerrit.cloudera.org:8080/#/c/15730/1/be/src/exec/orc-column-readers.cc@477
PS1, Line 477:   if (slot_desc_ == nullptr) {
This check is not that clear for me. Isn't this identical to 
!MaterializeTuple() check?
I see other occurences below as well all for collections.



--
To view, visit http://gerrit.cloudera.org:8080/15730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I477961b427406035a04529c5175dbee8f8a93ad5
Gerrit-Change-Number: 15730
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Wed, 15 Apr 2020 13:34:12 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9596: deflake test tpch mem limit single node

2020-04-14 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15700 )

Change subject: IMPALA-9596: deflake test_tpch_mem_limit_single_node
..


Patch Set 2: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/15700
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4678963c264b7c15fbac6f71721162b38676aa21
Gerrit-Change-Number: 15700
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 15 Apr 2020 05:44:55 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9596: deflake test tpch mem limit single node

2020-04-14 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15700 )

Change subject: IMPALA-9596: deflake test_tpch_mem_limit_single_node
..


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15700/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15700/1//COMMIT_MSG@16
PS1, Line 16: was was
nit: typo


http://gerrit.cloudera.org:8080/#/c/15700/1/be/src/runtime/collection-value-builder.h
File be/src/runtime/collection-value-builder.h:

http://gerrit.cloudera.org:8080/#/c/15700/1/be/src/runtime/collection-value-builder.h@65
PS1, Line 65: have_debug_action_
What is the point of introducing a member for this?
Can't you do "!state->query_options().debug_action.empty()" here?
Is this a perf optimisation?


http://gerrit.cloudera.org:8080/#/c/15700/1/tests/query_test/test_nested_types.py
File tests/query_test/test_nested_types.py:

http://gerrit.cloudera.org:8080/#/c/15700/1/tests/query_test/test_nested_types.py@148
PS1, Line 148: with memory limits tuned for
 : a 3-node HDFS minicluster
This part of the comment is no longer true, right?



--
To view, visit http://gerrit.cloudera.org:8080/15700
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4678963c264b7c15fbac6f71721162b38676aa21
Gerrit-Change-Number: 15700
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Bikramjeet Vig 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Comment-Date: Tue, 14 Apr 2020 08:20:05 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6360: Don't show full query statement on Impala WebUI by default

2020-03-23 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15288 )

Change subject: IMPALA-6360: Don't show full query statement on Impala WebUI by 
default
..


Patch Set 16: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/15288
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib7109a0be5d1022b4f8d6e72441cf5dc1dc42605
Gerrit-Change-Number: 15288
Gerrit-PatchSet: 16
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Comment-Date: Mon, 23 Mar 2020 12:54:28 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6360: Don't show full query statement on Impala WebUI by default

2020-03-18 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15288 )

Change subject: IMPALA-6360: Don't show full query statement on Impala WebUI by 
default
..


Patch Set 15: Code-Review+1

(2 comments)

I'm fine with the change, just left some nits. Will +2 once they are covered.

http://gerrit.cloudera.org:8080/#/c/15288/15/tests/custom_cluster/test_web_pages.py
File tests/custom_cluster/test_web_pages.py:

http://gerrit.cloudera.org:8080/#/c/15288/15/tests/custom_cluster/test_web_pages.py@114
PS15, Line 114: imput
nit: typo


http://gerrit.cloudera.org:8080/#/c/15288/15/tests/custom_cluster/test_web_pages.py@130
PS15, Line 130: part
nit: partial



--
To view, visit http://gerrit.cloudera.org:8080/15288
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib7109a0be5d1022b4f8d6e72441cf5dc1dc42605
Gerrit-Change-Number: 15288
Gerrit-PatchSet: 15
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Comment-Date: Wed, 18 Mar 2020 14:23:00 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6360: Don't show full query statement on Impala WebUI by default

2020-03-16 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15288 )

Change subject: IMPALA-6360: Don't show full query statement on Impala WebUI by 
default
..


Patch Set 14:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py
File tests/custom_cluster/test_web_pages.py:

http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@113
PS14, Line 113: Create a long select query then check if it is contained in the 
response json.
nit: It would be more beneficial to describe the expected functionality here 
like: Check if the full query string is displayed in the query list on the 
WebUI.


http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@115
PS14, Line 115: "x " * 450
As this string is used multiple times I'd put this into a variable.


http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@115
PS14, Line 115: "select \"{0}\""
Might not work but could you try to put the string within single quotes instead 
of double quotes so that you don't have to escape the double quotes inside?


http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@116
PS14, Line 116: theere
typo


http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@118
PS14, Line 118: expected = "select \\\"{0}\\\"".format("x " * 450)
Same comment for single quotes as above.


http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@129
PS14, Line 129: """Create a long select query then check if it is contained 
in the response json."""
This statements isn't true for this test. Might be a copy-paste error.


http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@131
PS14, Line 131: query = "select \"{0}\"".format("x " * 450)
Same comment for single quotes as above.


http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@133
PS14, Line 133: theere
typo


http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@135
PS14, Line 135: expected = "select \\\"x ..."
Same comment for single quotes as above.


http://gerrit.cloudera.org:8080/#/c/15288/14/tests/custom_cluster/test_web_pages.py@139
PS14, Line 139: if unexpected in response_json:
  :   assert False, "The response is containing the full query."
Is this needed? L141 will check the expected output anyway that won't be a 
prefix of the input because of the "..." appended to the end.



-- 
To view, visit http://gerrit.cloudera.org:8080/15288
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib7109a0be5d1022b4f8d6e72441cf5dc1dc42605
Gerrit-Change-Number: 15288
Gerrit-PatchSet: 14
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Comment-Date: Mon, 16 Mar 2020 12:55:45 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6360: Don't show full query statement on Impala WebUI by default

2020-03-06 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15288 )

Change subject: IMPALA-6360: Don't show full query statement on Impala WebUI by 
default
..


Patch Set 12:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15288/12/tests/webserver/test_web_pages.py
File tests/webserver/test_web_pages.py:

http://gerrit.cloudera.org:8080/#/c/15288/12/tests/webserver/test_web_pages.py@419
PS12, Line 419:
I miss a test to check that the length of the query statement is configurable.
Could you look around in the tests and check if there is a way to test these 
kind of configs as well?



--
To view, visit http://gerrit.cloudera.org:8080/15288
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib7109a0be5d1022b4f8d6e72441cf5dc1dc42605
Gerrit-Change-Number: 15288
Gerrit-PatchSet: 12
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Comment-Date: Fri, 06 Mar 2020 13:15:18 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-7686: Allow RANGE() clause before HASH() clause for PARTITION BY

2020-03-04 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15332 )

Change subject: IMPALA-7686: Allow RANGE() clause before HASH() clause for 
PARTITION BY
..


Patch Set 7: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/15332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I914e340e9acfb0f49c7d4f78705dbd9bde0aec8c
Gerrit-Change-Number: 15332
Gerrit-PatchSet: 7
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Comment-Date: Wed, 04 Mar 2020 13:51:15 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6506: Codegen in ORC scanner for primitives and struct

2020-03-04 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15350 )

Change subject: IMPALA-6506: Codegen in ORC scanner for primitives and struct
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15350/1/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/15350/1/be/src/exec/hdfs-scan-node-base.cc@437
PS1, Line 437:   case THdfsFileFormat::PARQUET:
 :   case THdfsFileFormat::ORC:
 : status = HdfsColumnarScanner::Codegen(this, state, );
> It would be nice to check what happens if there are Parquet and ORC partiti
Did check it manually and this runs without any error scanning partitions well 
for both formats. Added a comment about this to the commit msg.



--
To view, visit http://gerrit.cloudera.org:8080/15350
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2352d0c8fc75ff722e931bc8c866b3e43d3636f4
Gerrit-Change-Number: 15350
Gerrit-PatchSet: 2
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 04 Mar 2020 12:05:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6506: Codegen in ORC scanner for primitives and struct

2020-03-04 Thread Gabor Kaszab (Code Review)
Hello Zoltan Borok-Nagy, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15350

to look at the new patch set (#2).

Change subject: IMPALA-6506: Codegen in ORC scanner for primitives and struct
..

IMPALA-6506: Codegen in ORC scanner for primitives and struct

IMPALA-9228 introduced scratch batch handling for struct and
primitive types in the ORC scanner and the existing scratch batch
logic already supports Codegen for ProcessScratchBatch() function.
This change turns on this Codegen logic for primitives types and
structs in the ORC scanner.

Note, if the query involves collection types then
ProcessScratchBatch() is still codegend but the codegend function
isn't used as the regular row-by-row approach is followed in this
case without using a scratch batch.

Testing:
  - Re-run the whole test suite to check for regressions.
  - Checked the performance on a scale 25 TPCH workload in ORC format
using single_node_perf_run.py. Comparing the query runtimes it
seems that codegen brings a 1-21% improvement for most of the
queries. There is a slight decrease in 3 queries that are not
scan-heavy where codegen doesn't provide any help for scanning.
However, these are short queries where the size of the
degradation is in subseconds so I'd say the decrease is
negligible.
  - Did a manual check for a table that contains both Parquet and ORC
partitions. Verified that in this case ProcessScratchBatch() is
codegend for both formats and the query results are as expected.

Change-Id: I2352d0c8fc75ff722e931bc8c866b3e43d3636f4
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
6 files changed, 58 insertions(+), 47 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/15350/2
--
To view, visit http://gerrit.cloudera.org:8080/15350
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2352d0c8fc75ff722e931bc8c866b3e43d3636f4
Gerrit-Change-Number: 15350
Gerrit-PatchSet: 2
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-7686: Allow RANGE() clause before HASH() clause for PARTITION BY

2020-03-04 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15332 )

Change subject: IMPALA-7686: Allow RANGE() clause before HASH() clause for 
PARTITION BY
..


Patch Set 5: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/15332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I914e340e9acfb0f49c7d4f78705dbd9bde0aec8c
Gerrit-Change-Number: 15332
Gerrit-PatchSet: 5
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Comment-Date: Wed, 04 Mar 2020 09:30:11 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6360: Don't show full query statement on Impala WebUI by default

2020-03-04 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15288 )

Change subject: IMPALA-6360: Don't show full query statement on Impala WebUI by 
default
..


Patch Set 10:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/15288/10/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/15288/10/be/src/service/impala-server.cc@151
PS10, Line 151: display
nit: displayed


http://gerrit.cloudera.org:8080/#/c/15288/10/tests/webserver/test_web_pages.py
File tests/webserver/test_web_pages.py:

http://gerrit.cloudera.org:8080/#/c/15288/10/tests/webserver/test_web_pages.py@422
PS10, Line 422: enought
nit: typo


http://gerrit.cloudera.org:8080/#/c/15288/10/tests/webserver/test_web_pages.py@429
PS10, Line 429: contain
nit: contains


http://gerrit.cloudera.org:8080/#/c/15288/10/tests/webserver/test_web_pages.py@434
PS10, Line 434: stmt_response
The name of this variable is misleading here as it's not storing the response 
anymore.



--
To view, visit http://gerrit.cloudera.org:8080/15288
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib7109a0be5d1022b4f8d6e72441cf5dc1dc42605
Gerrit-Change-Number: 15288
Gerrit-PatchSet: 10
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Comment-Date: Wed, 04 Mar 2020 08:12:27 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-7686: Allow RANGE() clause before HASH() clause for PARTITION BY

2020-03-03 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15332 )

Change subject: IMPALA-7686: Allow RANGE() clause before HASH() clause for 
PARTITION BY
..


Patch Set 4:

(2 comments)

Minor nits again. I'll +2 if these are covered

http://gerrit.cloudera.org:8080/#/c/15332/4/fe/src/test/java/org/apache/impala/analysis/AnalyzeKuduDDLTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzeKuduDDLTest.java:

http://gerrit.cloudera.org:8080/#/c/15332/4/fe/src/test/java/org/apache/impala/analysis/AnalyzeKuduDDLTest.java@123
PS4, Line 123: swapped
nit: Swapped


http://gerrit.cloudera.org:8080/#/c/15332/4/fe/src/test/java/org/apache/impala/analysis/AnalyzeKuduDDLTest.java@128
PS4, Line 128: swapped
nit: this word is not needed (and doesn't really makes sense here)



-- 
To view, visit http://gerrit.cloudera.org:8080/15332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I914e340e9acfb0f49c7d4f78705dbd9bde0aec8c
Gerrit-Change-Number: 15332
Gerrit-PatchSet: 4
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Comment-Date: Wed, 04 Mar 2020 07:47:56 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6506: Codegen in ORC scanner for primitives and struct

2020-03-03 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/15350


Change subject: IMPALA-6506: Codegen in ORC scanner for primitives and struct
..

IMPALA-6506: Codegen in ORC scanner for primitives and struct

IMPALA-9228 introduced scratch batch handling for struct and
primitive types in the ORC scanner and the existing scratch batch
logic already supports Codegen for ProcessScratchBatch() function.
This change turns on this Codegen logic for primitives types and
structs in the ORC scanner.

Note, if the query involves collection types then
ProcessScratchBatch() is still codegend but the codegend function
isn't used as the regular row-by-row approach is followed in this
case without using a scratch batch.

Testing:
  - Re-run the whole test suite to check for regressions.
  - Checked the performance on a scale 25 TPCH workload in ORC format
using single_node_perf_run.py. Comparing the query runtimes it
seems that codegen brings a 1-21% improvement for most of the
queries. There is a slight decrease in 3 queries that are not
scan-heavy where codegen doesn't provide any help for scanning.
However, these are short queries where the size of the
degradation is in subseconds so I'd say the decrease is
negligible.

Change-Id: I2352d0c8fc75ff722e931bc8c866b3e43d3636f4
---
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
6 files changed, 58 insertions(+), 47 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/15350/1
--
To view, visit http://gerrit.cloudera.org:8080/15350
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I2352d0c8fc75ff722e931bc8c866b3e43d3636f4
Gerrit-Change-Number: 15350
Gerrit-PatchSet: 1
Gerrit-Owner: Gabor Kaszab 


[Impala-ASF-CR] IMPALA-6360: Don't show full query statement on Impala webUI by default

2020-03-03 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15288 )

Change subject: IMPALA-6360: Don't show full query statement on Impala webUI by 
default
..


Patch Set 8:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/15288/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15288/8//COMMIT_MSG@7
PS8, Line 7: webUI
nit: WebUI


http://gerrit.cloudera.org:8080/#/c/15288/8//COMMIT_MSG@9
PS8, Line 9: Added the ‘query_stmt_size’ flag to impala-server.cc with default 
value
   : of 250 and modified the ‘ImpalaHttpHandler::QueryStateToJson()’ to
   : truncate the end of the statements if they are too long.
Similarly to my comment in your other commit: I don't see the point of 
mentioning all these technical details in the commit message as they hide the 
original meaning.
Here you should describe that in the WebUI's query list the query statements 
are trimmed, but you can see the full query statement in the details page. 
Additionally mention please that the default statement length is 250 chars but 
it can be adjusted by a flag that requires an Impala restart to take effect.
I'd also describe if there is a special value of this new flag that for example 
defaults to showing the full query statement.


http://gerrit.cloudera.org:8080/#/c/15288/8//COMMIT_MSG@12
PS8, Line 12:
It would be useful to add an example how to set the new flag to a custom value.


http://gerrit.cloudera.org:8080/#/c/15288/8/be/src/service/impala-http-handler.cc
File be/src/service/impala-http-handler.cc:

http://gerrit.cloudera.org:8080/#/c/15288/8/be/src/service/impala-http-handler.cc@495
PS8, Line 495: document->AddMember("completed_stmt_size", FLAGS_query_stmt_size,
 :   document->GetAllocator());
Do you need this? If I understand correctly this adds a new variable to the web 
page document called "completed_stmt_size" but you don't use it anywhere in the 
page itself.


http://gerrit.cloudera.org:8080/#/c/15288/8/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

http://gerrit.cloudera.org:8080/#/c/15288/8/be/src/service/impala-server.cc@151
PS8, Line 151: statements have unbounded size
nit: It would be a bit better in my opinion to say like "the full statement is 
display in the query log without trimming."


http://gerrit.cloudera.org:8080/#/c/15288/8/tests/webserver/test_web_pages.py
File tests/webserver/test_web_pages.py:

http://gerrit.cloudera.org:8080/#/c/15288/8/tests/webserver/test_web_pages.py@421
PS8, Line 421: then search the response json for the
 : expected, truncated statement.
nit: then check if it is truncated in the response json.


http://gerrit.cloudera.org:8080/#/c/15288/8/tests/webserver/test_web_pages.py@423
PS8, Line 423: query = """select \"Proin bibendum erat eu libero iaculis 
pharetra. Duis
nit: Pls add a comment how long the input query string is.
Also add a similar comment above the expected_result


http://gerrit.cloudera.org:8080/#/c/15288/8/tests/webserver/test_web_pages.py@440
PS8, Line 440: assert expected_result in stmt_response, "No matching 
statement found in the jsons."
This part is a bit weird: above you set stmt_response if the expected_result is 
in a subsection of the json output and once you finish the for loop you do the 
same check.

What if you do an "assert false" in this line and return in L439 instead of the 
break?
Or option2: instead of stmt_response you introduce a bool that initializes to 
False and you set it to True if you have a match for the expected_result. You 
can assert for the bool variable being True in the end.

Would this make sense?



--
To view, visit http://gerrit.cloudera.org:8080/15288
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib7109a0be5d1022b4f8d6e72441cf5dc1dc42605
Gerrit-Change-Number: 15288
Gerrit-PatchSet: 8
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Comment-Date: Tue, 03 Mar 2020 12:45:48 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch

2020-03-03 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15104 )

Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch
..


Patch Set 9:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/15104/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15104/5//COMMIT_MSG@20
PS5, Line 20: Note,
> typo: testing
Done


http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/hdfs-columnar-scanner-ir.cc
File be/src/exec/hdfs-columnar-scanner-ir.cc:

http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/hdfs-columnar-scanner-ir.cc@23
PS5, Line 23: scratch_batch_
> nit: the != nullptr in written out in most of Impala
Done


http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/hdfs-orc-scanner.cc@702
PS5, Line 702:
> nit: the != nullptr in written out in most of Impala
Done


http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/orc-column-readers.cc@264
PS5, Line 264: >col_id_path_map_[
> Can you mention it in the commit message that no scratch batch is used if t
I've already added a note about this to the commit msg.



--
To view, visit http://gerrit.cloudera.org:8080/15104
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
Gerrit-Change-Number: 15104
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 03 Mar 2020 10:01:28 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch

2020-03-03 Thread Gabor Kaszab (Code Review)
Hello Zoltan Borok-Nagy, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15104

to look at the new patch set (#9).

Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch
..

IMPALA-9228: ORC scanner reads rows into scratch batch

Because of performance considerations this change enhances ORC
scanner to populate a scratch batch on a column-by-column manner
using data from the column readers. Once this is done the parquet
code was reused to apply runtime filter and conjuncts and to
populate the outgoing row batch.

This approach reduces the number of virtual function calls and takes
advantage of the columnar orientation of the data to enhance scan
performance. Additionally, introducing the scratch batch concept also
opens the door for codegen runtime filtering and applying conjuncts.

Note, this change doesn't cover collection types just primitive types
and struct. Collection types will follow the previous row-by-row
approach.

Testing:
  - Re-run the full test suite to verify that no regression is
introduced.
  - Checked the performance impact by running TPCH workload on a
scale 25 database using single_node_perf_run.py. The total query
runtime is decreased by 0-20% depending on how scan heavy the
particular query was. The more scan heavy the query is the more
performance gain I observe.

Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/CMakeLists.txt
R be/src/exec/hdfs-columnar-scanner-ir.cc
A be/src/exec/hdfs-columnar-scanner.cc
A be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
R be/src/exec/scratch-tuple-batch.h
15 files changed, 425 insertions(+), 144 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/15104/9
--
To view, visit http://gerrit.cloudera.org:8080/15104
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
Gerrit-Change-Number: 15104
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-7686: Allow RANGE() clause before HASH() clause for PARTITION BY

2020-03-03 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15332 )

Change subject: IMPALA-7686: Allow RANGE() clause before HASH() clause for 
PARTITION BY
..


Patch Set 3: Code-Review+1

(1 comment)

I just have one nit. Let me give a +1 now and once you covered that I can +2

http://gerrit.cloudera.org:8080/#/c/15332/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15332/3//COMMIT_MSG@12
PS3, Line 12: Added an extra AnalyzesOk into ‘AnalyzeKuduDDLTest()’ ->
: ‘testDDlsOnKuduTable()’ in AnalyzeKuduDDLTest.java where the 
RANGE() and
: HASH() pair is swapped.
nit: no need to mention the each line of test change in the commit message. A 
single sentence without this level of technical details is fine. Like "Added 
and extra analyzer test to cover the case when RANGE() is before HASH() for 
Kudu tables."



--
To view, visit http://gerrit.cloudera.org:8080/15332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I914e340e9acfb0f49c7d4f78705dbd9bde0aec8c
Gerrit-Change-Number: 15332
Gerrit-PatchSet: 3
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Comment-Date: Tue, 03 Mar 2020 09:47:36 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch

2020-03-02 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15104 )

Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch
..


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15104/7/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15104/7/be/src/exec/hdfs-scanner.h@176
PS7, Line 176: Not inlined in IR so it can be replaced with a constant.
> nit: this comment is applicable to both 'tuple_byte_size' functions, so it
Done



--
To view, visit http://gerrit.cloudera.org:8080/15104
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
Gerrit-Change-Number: 15104
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 02 Mar 2020 15:26:44 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch

2020-03-02 Thread Gabor Kaszab (Code Review)
Hello Zoltan Borok-Nagy, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15104

to look at the new patch set (#8).

Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch
..

IMPALA-9228: ORC scanner reads rows into scratch batch

Because of performance considerations this change enhances ORC
scanner to populate a scratch batch on a column-by-column manner
using data from the column readers. Once this is done the parquet
code was reused to apply runtime filter and conjuncts and to
populate the outgoing row batch.

This approach reduces the number of virtual function calls and takes
advantage of the columnar orientation of the data to enhance scan
performance. Additionally, introducing the scratch batch concept also
opens the door for codegen runtime filtering and applying conjuncts.

Note, this change doesn't cover collection types just primitive types
and struct.

Tesing:
  - Re-run the full test suite to verify that no regression is
introduced.
  - Checked the performance impact by running TPCH workload on a
scale 25 database using single_node_perf_run.py. The total query
runtime is decreased by 0-20% depending on how scan heavy the
particular query was. The more scan heavy the query is the more
performance gain I observe.

Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/CMakeLists.txt
R be/src/exec/hdfs-columnar-scanner-ir.cc
A be/src/exec/hdfs-columnar-scanner.cc
A be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
R be/src/exec/scratch-tuple-batch.h
15 files changed, 425 insertions(+), 144 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/15104/8
--
To view, visit http://gerrit.cloudera.org:8080/15104
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
Gerrit-Change-Number: 15104
Gerrit-PatchSet: 8
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch

2020-03-02 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15104 )

Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch
..


Patch Set 7:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/hdfs-scanner.h@a566
PS5, Line 566:
 :
 :
 :
 :
> nit: maybe these could be moved together to public.
Done


http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/orc-column-readers.h@188
PS5, Line 188:   Status ReadValue(int row_idx, Tuple* tuple, MemPool* pool) 
final WARN_UNUSED_RESULT;
> Since ReadValue() is public I think you don't need the friend declarations
Friend declarations are still needed for accessing 'derived->batch_' from 
OrcPrimitiveColumnReader::ReadValueBatch()


http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/15104/5/be/src/exec/orc-column-readers.cc@221
PS5, Line 221: ErrorMsg msg(errorCode, scanner_->filename(), 
orc_column_id_);
 : return scanner_-
> nit: fits single line again
Done



--
To view, visit http://gerrit.cloudera.org:8080/15104
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
Gerrit-Change-Number: 15104
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 02 Mar 2020 13:14:45 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch

2020-03-02 Thread Gabor Kaszab (Code Review)
Hello Zoltan Borok-Nagy, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15104

to look at the new patch set (#7).

Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch
..

IMPALA-9228: ORC scanner reads rows into scratch batch

Because of performance considerations this change enhances ORC
scanner to populate a scratch batch on a column-by-column manner
using data from the column readers. Once this is done the parquet
code was reused to apply runtime filter and conjuncts and to
populate the outgoing row batch.

This approach reduces the number of virtual function calls and takes
advantage of the columnar orientation of the data to enhance scan
performance. Additionally, introducing the scratch batch concept also
opens the door for codegen runtime filtering and applying conjuncts.

Tesing:
  - Re-run the full test suite to verify that no regression is
introduced.
  - Checked the performance impact by running TPCH workload on a
scale 25 database using single_node_perf_run.py. The total query
runtime is decreased by 0-20% depending on how scan heavy the
particular query was. The more scan heavy the query is the more
performance gain I observe.

Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/CMakeLists.txt
R be/src/exec/hdfs-columnar-scanner-ir.cc
A be/src/exec/hdfs-columnar-scanner.cc
A be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
R be/src/exec/scratch-tuple-batch.h
15 files changed, 426 insertions(+), 144 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/15104/7
--
To view, visit http://gerrit.cloudera.org:8080/15104
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
Gerrit-Change-Number: 15104
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch

2020-03-02 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15104 )

Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch
..


Patch Set 6:

PS6 is a rebase with master to resolve conflict with the ORC string allocations 
enhancement.


--
To view, visit http://gerrit.cloudera.org:8080/15104
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
Gerrit-Change-Number: 15104
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 02 Mar 2020 12:57:33 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch

2020-03-02 Thread Gabor Kaszab (Code Review)
Hello Zoltan Borok-Nagy, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15104

to look at the new patch set (#6).

Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch
..

IMPALA-9228: ORC scanner reads rows into scratch batch

Because of performance considerations this change enhances ORC
scanner to populate a scratch batch on a column-by-column manner
using data from the column readers. Once this is done the parquet
code was reused to apply runtime filter and conjuncts and to
populate the outgoing row batch.

This approach reduces the number of virtual function calls and takes
advantage of the columnar orientation of the data to enhance scan
performance. Additionally, introducing the scratch batch concept also
opens the door for codegen runtime filtering and applying conjuncts.

Tesing:
  - Re-run the full test suite to verify that no regression is
introduced.
  - Checked the performance impact by running TPCH workload on a
scale 25 database using single_node_perf_run.py. The total query
runtime is decreased by 0-20% depending on how scan heavy the
particular query was. The more scan heavy the query is the more
performance gain I observe.

Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/CMakeLists.txt
R be/src/exec/hdfs-columnar-scanner-ir.cc
A be/src/exec/hdfs-columnar-scanner.cc
A be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
R be/src/exec/scratch-tuple-batch.h
15 files changed, 423 insertions(+), 140 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/15104/6
--
To view, visit http://gerrit.cloudera.org:8080/15104
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
Gerrit-Change-Number: 15104
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch

2020-03-02 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15104 )

Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch
..


Patch Set 5:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/hdfs-columnar-scanner.h
File be/src/exec/hdfs-columnar-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/hdfs-columnar-scanner.h@40
PS4, Line 40: ()
> nit: "...in the derived classes."
Done


http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/hdfs-columnar-scanner.h@45
PS4, Line 45:
> nit: you could add "= nullptr" to make the default value of it obvious. And
Done


http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/hdfs-orc-scanner.h@258
PS4, Line 258: has a collection
 :   /// amongst its children.
> nit: sounds weird to me, maybe just "has a collection amongst..."
Done


http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/hdfs-orc-scanner.cc@703
PS4, Line 703:   DCHECK(scratch_batch_->AtEnd());
> We hit this DCHECK when we come from the loop of AssembleRows(). We can als
Done


http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/orc-column-readers.h@130
PS4, Line 130: to implement static polymorphism via the "curiously
> nit: to implement static polymorphism via the "curiously recurring template
Done


http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/orc-column-readers.h@161
PS4, Line 161:
> If you'd add the 'final' specifier to the ReadValue() implementations then
Done


http://gerrit.cloudera.org:8080/#/c/15104/4/be/src/exec/orc-column-readers.h@469
PS4, Line 469:   /// Keep row index if we're top level readers
> You need to rename this to make clang-tidy happy as it hides virtual functi
Done



--
To view, visit http://gerrit.cloudera.org:8080/15104
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
Gerrit-Change-Number: 15104
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 02 Mar 2020 09:45:55 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch

2020-03-02 Thread Gabor Kaszab (Code Review)
Hello Zoltan Borok-Nagy, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15104

to look at the new patch set (#5).

Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch
..

IMPALA-9228: ORC scanner reads rows into scratch batch

Because of performance considerations this change enhances ORC
scanner to populate a scratch batch on a column-by-column manner
using data from the column readers. Once this is done the parquet
code was reused to apply runtime filter and conjuncts and to
populate the outgoing row batch.

This approach reduces the number of virtual function calls and takes
advantage of the columnar orientation of the data to enhance scan
performance. Additionally, introducing the scratch batch concept also
opens the door for codegen runtime filtering and applying conjuncts.

Tesing:
  - Re-run the full test suite to verify that no regression is
introduced.
  - Checked the performance impact by running TPCH workload on a
scale 25 database using single_node_perf_run.py. The total query
runtime is decreased by 0-20% depending on how scan heavy the
particular query was. The more scan heavy the query is the more
performance gain I observe.

Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/CMakeLists.txt
R be/src/exec/hdfs-columnar-scanner-ir.cc
A be/src/exec/hdfs-columnar-scanner.cc
A be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
R be/src/exec/scratch-tuple-batch.h
15 files changed, 423 insertions(+), 140 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/15104/5
--
To view, visit http://gerrit.cloudera.org:8080/15104
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
Gerrit-Change-Number: 15104
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch

2020-02-27 Thread Gabor Kaszab (Code Review)
Hello Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15104

to look at the new patch set (#4).

Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch
..

IMPALA-9228: ORC scanner reads rows into scratch batch

Because of performance considerations this change enhances ORC
scanner to populate a scratch batch on a column-by-column manner
using data from the column readers. Once this is done the parquet
code was reused to apply runtime filter and conjuncts and to
populate the outgoing row batch.

This approach reduces the number of virtual function calls and takes
advantage of the columnar orientation of the data to enhance scan
performance. Additionally, introducing the scratch batch concept also
opens the door for codegen runtime filtering and applying conjuncts.

Tesing:
  - Re-run the full test suite to verify that no regression is
introduced.
  - Checked the performance impact by running TPCH workload on a
scale 25 database using single_node_perf_run.py. The total query
runtime is decreased by 0-20% depending on how scan heavy the
particular query was. The more scan heavy the query is the more
performance gain I observe.

Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/CMakeLists.txt
R be/src/exec/hdfs-columnar-scanner-ir.cc
A be/src/exec/hdfs-columnar-scanner.cc
A be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
R be/src/exec/scratch-tuple-batch.h
15 files changed, 464 insertions(+), 140 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/15104/4
--
To view, visit http://gerrit.cloudera.org:8080/15104
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
Gerrit-Change-Number: 15104
Gerrit-PatchSet: 4
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] PoC: IMPALA-9228: ORC scanner reads rows into scratch batch

2020-02-27 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has restored this change. ( http://gerrit.cloudera.org:8080/15104 )

Change subject: PoC: IMPALA-9228: ORC scanner reads rows into scratch batch
..


Restored
--
To view, visit http://gerrit.cloudera.org:8080/15104
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: restore
Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
Gerrit-Change-Number: 15104
Gerrit-PatchSet: 2
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9226: Bump ORC version to 1.6.2-p7

2020-02-25 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15287 )

Change subject: IMPALA-9226: Bump ORC version to 1.6.2-p7
..


Patch Set 1: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/15287
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I444bfac435e5b05eee1ff7c8cf6a32ff5b65
Gerrit-Change-Number: 15287
Gerrit-PatchSet: 1
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Tue, 25 Feb 2020 08:57:00 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9036: Fix CTRL+C a multiline query in impala-shell

2020-02-24 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15233 )

Change subject: IMPALA-9036: Fix CTRL+C a multiline query in impala-shell
..


Patch Set 5: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/15233
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id8d8bdaee929e2655eb66e886ae92a02d3fbd83f
Gerrit-Change-Number: 15233
Gerrit-PatchSet: 5
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 24 Feb 2020 10:41:48 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9036: Fix CTRL+C a multiline query in impala-shell

2020-02-19 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15233 )

Change subject: IMPALA-9036: Fix CTRL+C a multiline query in impala-shell
..


Patch Set 4:

(3 comments)

I just left some nits but in general I'm fine with the change.

http://gerrit.cloudera.org:8080/#/c/15233/4/tests/shell/test_shell_interactive.py
File tests/shell/test_shell_interactive.py:

http://gerrit.cloudera.org:8080/#/c/15233/4/tests/shell/test_shell_interactive.py@253
PS4, Line 253: doesn't ends
nit: doesn't end


http://gerrit.cloudera.org:8080/#/c/15233/4/tests/shell/test_shell_interactive.py@258
PS4, Line 258: "select \"test withouth newline\";\n"
nit: I think if you put this string between single quotes then you don't need 
to escape the double quotes inside. This would make it a bit more readable.


http://gerrit.cloudera.org:8080/#/c/15233/4/tests/shell/test_shell_interactive.py@271
PS4, Line 271: "select \"test with newline\";\n"
nit: same as above



--
To view, visit http://gerrit.cloudera.org:8080/15233
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id8d8bdaee929e2655eb66e886ae92a02d3fbd83f
Gerrit-Change-Number: 15233
Gerrit-PatchSet: 4
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Wed, 19 Feb 2020 21:10:41 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9036: Fix CTRL+C a multiline query in impala-shell

2020-02-19 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15233 )

Change subject: IMPALA-9036: Fix CTRL+C a multiline query in impala-shell
..


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/15233/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15233/2//COMMIT_MSG@9
PS2, Line 9: `_signal_handler()`
nit: instead of backticks please put the function name between single quotes.


http://gerrit.cloudera.org:8080/#/c/15233/2/shell/impala_shell.py
File shell/impala_shell.py:

http://gerrit.cloudera.org:8080/#/c/15233/2/shell/impala_shell.py@562
PS2, Line 562: self.prompt = self.cached_prompt
Question: Have you tried to simple set both self.prompt and self.cached_prompt 
to empty string? I don't understand it 100% what cached_prompt holds in this 
case that we want to reuse.



--
To view, visit http://gerrit.cloudera.org:8080/15233
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id8d8bdaee929e2655eb66e886ae92a02d3fbd83f
Gerrit-Change-Number: 15233
Gerrit-PatchSet: 2
Gerrit-Owner: Adam Tamas 
Gerrit-Reviewer: Adam Tamas 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Comment-Date: Wed, 19 Feb 2020 08:10:15 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9280: Fix parsing of timestamp with dash before TZH

2020-02-05 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15130 )

Change subject: IMPALA-9280: Fix parsing of timestamp with dash before TZH
..


Patch Set 5: Code-Review+2

Thanks for the adjustments!


--
To view, visit http://gerrit.cloudera.org:8080/15130
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I24c636e75fd380f6ebd091bcb38bb60a274f0e00
Gerrit-Change-Number: 15130
Gerrit-PatchSet: 5
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 05 Feb 2020 10:22:15 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9280: Fix parsing of timestamp with dash before TZH

2020-01-30 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15130 )

Change subject: IMPALA-9280: Fix parsing of timestamp with dash before TZH
..


Patch Set 4:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15130/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15130/4//COMMIT_MSG@11
PS4, Line 11: if the negative sign is the only possible
: separator character in the separator substring.
I have the same comment here as in the parser: This doesn't cover the whole 
picture as we also don't include it if the TZH itself has a '+' char as a sign.


http://gerrit.cloudera.org:8080/#/c/15130/4/be/src/runtime/datetime-iso-sql-format-parser.cc
File be/src/runtime/datetime-iso-sql-format-parser.cc:

http://gerrit.cloudera.org:8080/#/c/15130/4/be/src/runtime/datetime-iso-sql-format-parser.cc@296
PS4, Line 296: (*current_pos)
nit: no need for the brackets.


http://gerrit.cloudera.org:8080/#/c/15130/4/be/src/runtime/datetime-iso-sql-format-parser.cc@323
PS4, Line 323: // The last '-' of a separator sequence might be taken as a sign 
for timezone hour.
 :   // UNLESS the last '-' is the only possible separator 
character in the separator
 :   // substring (in which case it is not counted as the timezone 
hour's negative sign).
I'd rephrase this comment a bit as it misses to mention that if the TZH starts 
with a '+' then we again don't take the last '-' of the separator sequence as 
the sign.

Let's say something like this: "If the TZH token itself doesn't start with a 
sign and the separator sequence contains more than one separators  then the 
last '-' of the separator sequence is taken as the sign of TZH".

What do you think?



--
To view, visit http://gerrit.cloudera.org:8080/15130
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I24c636e75fd380f6ebd091bcb38bb60a274f0e00
Gerrit-Change-Number: 15130
Gerrit-PatchSet: 4
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 31 Jan 2020 07:19:45 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9280: Fix parsing of timestamp with dash before TZH

2020-01-30 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15130 )

Change subject: IMPALA-9280: Fix parsing of timestamp with dash before TZH
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15130/2/be/src/runtime/datetime-iso-sql-format-parser.cc
File be/src/runtime/datetime-iso-sql-format-parser.cc:

http://gerrit.cloudera.org:8080/#/c/15130/2/be/src/runtime/datetime-iso-sql-format-parser.cc@332
PS2, Line 332: && cur_pos_move_count > cur_tok_move_count && *(*current_pos) != 
'+'
> What I understand from your explanation, the number of separator chars in t
Basically separator matching works in a way that if there is at least one 
separator in a position somewhere in the format then there should be at least 
one separator in the input string in that section. (By position I don't mean 
the exact index of that character, rather the relative location to the 
surrounding tokens).

Those examples are correct.

For more explanation please check the design doc for this feature (feel free to 
ask if you still have questions):
https://docs.google.com/document/d/1V7k6-lrPGW7_uhqM-FhKl3QsxwCRy69v2KIxPsGjc1k/



--
To view, visit http://gerrit.cloudera.org:8080/15130
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I24c636e75fd380f6ebd091bcb38bb60a274f0e00
Gerrit-Change-Number: 15130
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 30 Jan 2020 18:44:10 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9280: Fix parsing of timestamp with dash before TZH

2020-01-30 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15130 )

Change subject: IMPALA-9280: Fix parsing of timestamp with dash before TZH
..


Patch Set 2:

(3 comments)

Thanks for taking care of this!

http://gerrit.cloudera.org:8080/#/c/15130/2/be/src/runtime/datetime-iso-sql-format-parser.cc
File be/src/runtime/datetime-iso-sql-format-parser.cc:

http://gerrit.cloudera.org:8080/#/c/15130/2/be/src/runtime/datetime-iso-sql-format-parser.cc@332
PS2, Line 332: && cur_pos_move_count > cur_tok_move_count && *(*current_pos) != 
'+'
Counting the processed separator tokens doesn't seem the best approach for me 
here. Also I don't see why it is necessary to have more separators in the input 
than in the format to use the last minus sign as a part of TZH instead of 
taking it as a separator.

Previously this if condition said "If a TZH token follows the separator 
sequence and the last separator in the sequence was a minus char then I'll use 
that minus char as the sign of the TZH".
I'd extend this: "If a TZH token follows the separator sequence and the last 
separator in the sequence was a minus char and the TZH token doesn't start with 
a plus char then ..."

Please make sure that in case the minus char is actually taken as sign of TZH 
instead of a separator then there is still at least one separator remaining in 
the input sequence to match the separator sequence in the token list.


http://gerrit.cloudera.org:8080/#/c/15130/2/be/src/runtime/timestamp-test.cc
File be/src/runtime/timestamp-test.cc:

http://gerrit.cloudera.org:8080/#/c/15130/2/be/src/runtime/timestamp-test.cc@1071
PS2, Line 1071: TEST(TimestampTest, IsoSqlWithTZ) {
Almost all of these formatting tests are in 
tests/query_test/test_cast_with_format.py
I think we should keep them together. Please consider moving this coverage.


http://gerrit.cloudera.org:8080/#/c/15130/2/be/src/runtime/timestamp-test.cc@1078
PS2, Line 1078: HH:MI-TZH
As far as I know we don't really support dateless timestamps even though Impala 
accepts them. Additionally I think we should remove the support of them from 
Impala at one point (AFAIK JDBC/ODBC connections throw parse error for dateless 
timestamps currently). So might be safer not to use them even in tests as once 
we drop the support we have to rewrite less code.



--
To view, visit http://gerrit.cloudera.org:8080/15130
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I24c636e75fd380f6ebd091bcb38bb60a274f0e00
Gerrit-Change-Number: 15130
Gerrit-PatchSet: 2
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 30 Jan 2020 09:00:02 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

2020-01-24 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
..


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/15051/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15051/3//COMMIT_MSG@24
PS3, Line 24: Some results:
As discussed offline: These measurement were run with a debug build. Let' see 
the same with release builds as that is the one we would like to optimize.
I also experienced a big difference in performance between debug and perf.


http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc@625
PS3, Line 625:
nit: this extra new line is not needed



--
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 3
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 24 Jan 2020 14:50:25 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9228: ORC scanner reads rows into scratch batch

2020-01-24 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/15104


Change subject: IMPALA-9228: ORC scanner reads rows into scratch batch
..

IMPALA-9228: ORC scanner reads rows into scratch batch

Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
5 files changed, 153 insertions(+), 9 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/15104/1
--
To view, visit http://gerrit.cloudera.org:8080/15104
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
Gerrit-Change-Number: 15104
Gerrit-PatchSet: 1
Gerrit-Owner: Gabor Kaszab 


[Impala-ASF-CR] PoC: IMPALA-9228: ORC scanner reads rows into scratch batch

2020-01-24 Thread Gabor Kaszab (Code Review)
Hello Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15104

to look at the new patch set (#2).

Change subject: PoC: IMPALA-9228: ORC scanner reads rows into scratch batch
..

PoC: IMPALA-9228: ORC scanner reads rows into scratch batch

Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
5 files changed, 153 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/15104/2
--
To view, visit http://gerrit.cloudera.org:8080/15104
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I56db0325dee283d73742ebbae412d19693fac0ca
Gerrit-Change-Number: 15104
Gerrit-PatchSet: 2
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

2020-01-20 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
..


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.h@207
PS2, Line 207: /*&& !orc_batch->isEncoded*/
nit: pls remove commented code


http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc@150
PS2, Line 150:   if (blob_ == nullptr) {
nit: It is probably just personal preference but I find it more readable to 
return early in edge cases. This saves some indentation later on. In this case 
if (blob_ != nullptr) return Status::OK()


http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc@151
PS2, Line 151: TODO
Is this TODO going to be taken care within this patch or do you expect this to 
be handled somewhere in the future unrelated to this change? If the second one 
then opening a Jira and adding the Jira ID to the comment would make sense. (I 
see that the TODO was added after a review comment requested it, though).


http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc@167
PS2, Line 167:
nit : too many spaces here and below


http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc@176
PS2, Line 176: InitBlob
I'm wondering if calling InitBlob() could move into UpdateInputBatch() from 
here and from L186. This way you don't have to do the "if (blob_ == nullptr)" 
check in every ReadValue() through the InitBlob() functions just simply you can 
DCHECK that it's not null.



--
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 2
Gerrit-Owner: Norbert Luksa 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 20 Jan 2020 14:19:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner

2020-01-17 Thread Gabor Kaszab (Code Review)
Hello Quanlong Huang, Norbert Luksa, Zoltan Borok-Nagy, Attila Jeges, Csaba 
Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14982

to look at the new patch set (#9).

Change subject: IMPALA-8801: Date type support for ORC scanner
..

IMPALA-8801: Date type support for ORC scanner

Implements read path for the date type in ORC scanner. The internal
representation of a date is an int32 meaning the number of days since
Unix epoch using proleptic Gregorian calendar.

Similarly to the Parquet implementation (IMPALA-7370) this
representation introduces an interoperability issue between Impala
and older versions of Hive (before 3.1). For more details see the
commit message of the mentioned Parquet implementation.

Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956
---
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/orc-metadata-utils.cc
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M testdata/data/README
A testdata/data/hive2_pre_gregorian.orc
A testdata/data/out_of_range_date.orc
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M 
testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test
A 
testdata/workloads/functional-query/queries/QueryTest/hive2-pre-gregorian-date-orc.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-date-orc.test
M tests/query_test/test_scanners.py
16 files changed, 170 insertions(+), 29 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/14982/9
--
To view, visit http://gerrit.cloudera.org:8080/14982
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956
Gerrit-Change-Number: 14982
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner

2020-01-17 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14982 )

Change subject: IMPALA-8801: Date type support for ORC scanner
..


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14982/8/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/14982/8/tests/query_test/test_scanners.py@1302
PS8, Line 1302: an
> nit: the comments could be updated a bit, since not only one illtypes table
Done



--
To view, visit http://gerrit.cloudera.org:8080/14982
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956
Gerrit-Change-Number: 14982
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 17 Jan 2020 13:06:54 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner

2020-01-15 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14982 )

Change subject: IMPALA-8801: Date type support for ORC scanner
..


Patch Set 8:

PS8 is a rebase with master to resolve conflict.


--
To view, visit http://gerrit.cloudera.org:8080/14982
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956
Gerrit-Change-Number: 14982
Gerrit-PatchSet: 8
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 15 Jan 2020 09:53:16 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner

2020-01-15 Thread Gabor Kaszab (Code Review)
Hello Quanlong Huang, Norbert Luksa, Zoltan Borok-Nagy, Attila Jeges, Csaba 
Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14982

to look at the new patch set (#8).

Change subject: IMPALA-8801: Date type support for ORC scanner
..

IMPALA-8801: Date type support for ORC scanner

Implements read path for the date type in ORC scanner. The internal
representation of a date is an int32 meaning the number of days since
Unix epoch using proleptic Gregorian calendar.

Similarly to the Parquet implementation (IMPALA-7370) this
representation introduces an interoperability issue between Impala
and older versions of Hive (before 3.1). For more details see the
commit message of the mentioned Parquet implementation.

Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956
---
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/orc-metadata-utils.cc
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M testdata/data/README
A testdata/data/hive2_pre_gregorian.orc
A testdata/data/out_of_range_date.orc
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M 
testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test
A 
testdata/workloads/functional-query/queries/QueryTest/hive2-pre-gregorian-date-orc.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-date-orc.test
M tests/query_test/test_scanners.py
16 files changed, 168 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/14982/8
--
To view, visit http://gerrit.cloudera.org:8080/14982
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956
Gerrit-Change-Number: 14982
Gerrit-PatchSet: 8
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner

2020-01-15 Thread Gabor Kaszab (Code Review)
Hello Quanlong Huang, Norbert Luksa, Zoltan Borok-Nagy, Attila Jeges, Csaba 
Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14982

to look at the new patch set (#7).

Change subject: IMPALA-8801: Date type support for ORC scanner
..

IMPALA-8801: Date type support for ORC scanner

Implements read path for the date type in ORC scanner. The internal
representation of a date is an int32 meaning the number of days since
Unix epoch using proleptic Gregorian calendar.

Similarly to the Parquet implementation (IMPALA-7370) this
representation introduces an interoperability issue between Impala
and older versions of Hive (before 3.1). For more details see the
commit message of the mentioned Parquet implementation.

Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956
---
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/orc-metadata-utils.cc
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M testdata/data/README
A testdata/data/hive2_pre_gregorian.orc
A testdata/data/out_of_range_date.orc
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M 
testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test
A 
testdata/workloads/functional-query/queries/QueryTest/hive2-pre-gregorian-date-orc.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-date-orc.test
M tests/query_test/test_scanners.py
16 files changed, 168 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/14982/7
--
To view, visit http://gerrit.cloudera.org:8080/14982
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956
Gerrit-Change-Number: 14982
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner

2020-01-15 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14982 )

Change subject: IMPALA-8801: Date type support for ORC scanner
..


Patch Set 7:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/14982/2/be/src/exec/orc-metadata-utils.cc
File be/src/exec/orc-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/14982/2/be/src/exec/orc-metadata-utils.cc@181
PS2, Line 181: case orc::TypeKind::DATE:
 :   if (type.type == TYPE_DATE) return Status::OK();
 :   break;
 : d
> > If I extend the DATE case to also allow "type.type == TYPE_TIMESTAMP" the
Sorry, missed to address the end of the comment. Done


http://gerrit.cloudera.org:8080/#/c/14982/6/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/14982/6/tests/query_test/test_scanners.py@306
PS6, Line 306: class TestParquet(ImpalaTestSuite):
> nit: merge this into TestOrc below?
sure, Done


http://gerrit.cloudera.org:8080/#/c/14982/6/tests/query_test/test_scanners.py@1335
PS6, Line 1335: % unique_database)
> It'd be better to add test coverage for date and timestamp type compactibil
Done



--
To view, visit http://gerrit.cloudera.org:8080/14982
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956
Gerrit-Change-Number: 14982
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 15 Jan 2020 09:34:13 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner

2020-01-13 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14982 )

Change subject: IMPALA-8801: Date type support for ORC scanner
..


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14982/5/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/14982/5/tests/query_test/test_scanners.py@305
PS5, Line 305:
> flake8: E302 expected 2 blank lines, found 1
Done



--
To view, visit http://gerrit.cloudera.org:8080/14982
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956
Gerrit-Change-Number: 14982
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 13 Jan 2020 14:03:07 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner

2020-01-13 Thread Gabor Kaszab (Code Review)
Hello Quanlong Huang, Norbert Luksa, Zoltan Borok-Nagy, Attila Jeges, Csaba 
Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14982

to look at the new patch set (#6).

Change subject: IMPALA-8801: Date type support for ORC scanner
..

IMPALA-8801: Date type support for ORC scanner

Implements read path for the date type in ORC scanner. The internal
representation of a date is an int32 meaning the number of days since
Unix epoch using proleptic Gregorian calendar.

Similarly to the Parquet implementation (IMPALA-7370) this
representation introduces an interoperability issue between Impala
and older versions of Hive (before 3.1). For more details see the
commit message of the mentioned Parquet implementation.

Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956
---
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/orc-metadata-utils.cc
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M testdata/data/README
A testdata/data/hive2_pre_gregorian.orc
A testdata/data/out_of_range_date.orc
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test
A 
testdata/workloads/functional-query/queries/QueryTest/hive2-pre-gregorian-date-orc.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-date-orc.test
M tests/query_test/test_scanners.py
15 files changed, 154 insertions(+), 26 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/14982/6
--
To view, visit http://gerrit.cloudera.org:8080/14982
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956
Gerrit-Change-Number: 14982
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner

2020-01-13 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14982 )

Change subject: IMPALA-8801: Date type support for ORC scanner
..


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14982/2/be/src/exec/orc-metadata-utils.cc
File be/src/exec/orc-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/14982/2/be/src/exec/orc-metadata-utils.cc@181
PS2, Line 181: case orc::TypeKind::DATE:
 :   if (type.type == TYPE_DATE) return Status::OK();
 :   break;
 : d
> I'm ok to do this in another JIRA.
Thanks for the explanation Quanlong! I oepened a separate Jira for this: 
https://issues.apache.org/jira/browse/IMPALA-9290

One thing I don't get here. If I leave this file intact then I get the "Type 
mismatch" below. If I extend the DATE case to also allow "type.type == 
TYPE_TIMESTAMP" then the DCHECK fails in UpdateInputBatch().


http://gerrit.cloudera.org:8080/#/c/14982/4/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/14982/4/tests/query_test/test_scanners.py@352
PS4, Line 352:   def test_parquet(self, vector):
 : self.run_test_case('QueryTest/parquet', vector)
 :
 :   def test_corrupt_files(self, vector):
 : new_vector = deepcopy(vector)
 : del new_vector.get_value('exec_option')['num_nodes']  # 
.test file sets num_nodes
 : new_vector.get_value('exec_option')['abort_on_error'] = 0
 : self.run_test_case('QueryTest/parquet-continue-on-error', 
new_vector)
 : new_vector.get_value('exec_option')['abort_on_error'] = 1
 : self.run_test_case('QueryTest/parquet-abort-on-error', 
new_vector)
 :
 :   def test_timestamp_out_of_range(self, vector, unique_database):
 : """IMPALA-4363: Test scanning parquet files with an out of 
range timestamp.
 :Also tests IMPALA-7595: Test Parquet timestamp columns 
where the time part
 :is out of the valid range [0..24H).
 : """
 : # out of range date part
 : create_table_from_parquet(self.client, unique_database, 
"out_of_range_timestamp")
 :
 : # out of range time part
 : create_table_from_parquet(self.client, unique_database, 
"out_of_range_time_of_day")
 :
> These tests are related to ORC, but we are in class TestParquet.
Indeed :) Done



--
To view, visit http://gerrit.cloudera.org:8080/14982
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956
Gerrit-Change-Number: 14982
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 13 Jan 2020 13:54:54 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8801: Date type support for ORC scanner

2020-01-13 Thread Gabor Kaszab (Code Review)
Hello Quanlong Huang, Norbert Luksa, Zoltan Borok-Nagy, Attila Jeges, Csaba 
Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14982

to look at the new patch set (#5).

Change subject: IMPALA-8801: Date type support for ORC scanner
..

IMPALA-8801: Date type support for ORC scanner

Implements read path for the date type in ORC scanner. The internal
representation of a date is an int32 meaning the number of days since
Unix epoch using proleptic Gregorian calendar.

Similarly to the Parquet implementation (IMPALA-7370) this
representation introduces an interoperability issue between Impala
and older versions of Hive (before 3.1). For more details see the
commit message of the mentioned Parquet implementation.

Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956
---
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/orc-metadata-utils.cc
M common/thrift/generate_error_codes.py
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M testdata/data/README
A testdata/data/hive2_pre_gregorian.orc
A testdata/data/out_of_range_date.orc
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-query/queries/QueryTest/date-fileformat-support.test
A 
testdata/workloads/functional-query/queries/QueryTest/hive2-pre-gregorian-date-orc.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-date-orc.test
M tests/query_test/test_scanners.py
15 files changed, 152 insertions(+), 26 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/82/14982/5
--
To view, visit http://gerrit.cloudera.org:8080/14982
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956
Gerrit-Change-Number: 14982
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Norbert Luksa 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Zoltan Borok-Nagy 


<    2   3   4   5   6   7   8   9   10   11   >