[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages

2021-01-05 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/16893 )

Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages
..

IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages

The encoding is identical to the already-supported PLAIN_DICTIONARY
encoding but the PLAIN enum value is used for the dictionary pages
and the RLE_DICTIONARY enum value is used for the data pages.

A hidden option -write_new_parquet_dictionary_encodings is
added to turn on writing too, for test purposes only.

Testing:
* Added an automated test using a pregenerated test file.
* Ran core tests.
* Manually tested by writing out TPC-H lineitem with the new encoding
  and reading back in Impala and Hive.

Parquet-tools output for the generated test file:
$ hadoop jar 
~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta 
/test-warehouse/att/824de2afebad009f-6f460ade0003_643159826_data.0.parq
20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
20/12/21 20:28:36 INFO hadoop.ParquetFileReader: reading another 1 footers
20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
file:
hdfs://localhost:20500/test-warehouse/att/824de2afebad009f-6f460ade0003_643159826_data.0.parq
creator: impala version 4.0.0-SNAPSHOT (build 
7b691c5d4249f0cb1ced8ddf01033fbbe10511d9)

file schema: schema

id:  OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1
bool_col:OPTIONAL BOOLEAN R:0 D:1
tinyint_col: OPTIONAL INT32 L:INTEGER(8,true) R:0 D:1
smallint_col:OPTIONAL INT32 L:INTEGER(16,true) R:0 D:1
int_col: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1
bigint_col:  OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1
float_col:   OPTIONAL FLOAT R:0 D:1
double_col:  OPTIONAL DOUBLE R:0 D:1
date_string_col: OPTIONAL BINARY R:0 D:1
string_col:  OPTIONAL BINARY R:0 D:1
timestamp_col:   OPTIONAL INT96 R:0 D:1
year:OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1
month:   OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1

row group 1: RC:8 TS:754 OFFSET:4

id:   INT32 SNAPPY DO:4 FPO:48 SZ:74/73/0.99 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 7, num_nulls: 0]
bool_col: BOOLEAN SNAPPY DO:0 FPO:141 SZ:26/24/0.92 VC:8 ENC:RLE,PLAIN 
ST:[min: false, max: true, num_nulls: 0]
tinyint_col:  INT32 SNAPPY DO:220 FPO:243 SZ:51/47/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0]
smallint_col: INT32 SNAPPY DO:343 FPO:366 SZ:51/47/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0]
int_col:  INT32 SNAPPY DO:467 FPO:490 SZ:51/47/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0]
bigint_col:   INT64 SNAPPY DO:586 FPO:617 SZ:59/55/0.93 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 10, num_nulls: 0]
float_col:FLOAT SNAPPY DO:724 FPO:747 SZ:51/47/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 1.1, num_nulls: 0]
double_col:   DOUBLE SNAPPY DO:845 FPO:876 SZ:59/55/0.93 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 10.1, num_nulls: 0]
date_string_col:  BINARY SNAPPY DO:983 FPO:1028 SZ:74/88/1.19 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0x30312F30312F3039, max: 0x30342F30312F3039, 
num_nulls: 0]
string_col:   BINARY SNAPPY DO:1143 FPO:1168 SZ:53/49/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0x30, max: 0x31, num_nulls: 0]
timestamp_col:INT96 SNAPPY DO:1261 FPO:1329 SZ:98/138/1.41 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[num_nulls: 0, min/max not defined]
year: INT32 SNAPPY DO:1451 FPO:1470 SZ:47/43/0.91 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 2009, max: 2009, num_nulls: 0]
month:INT32 SNAPPY DO:1563 FPO:1594 SZ:60/56/0.93 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 1, max: 4, num_nulls: 0]

Parquet-tools output for one of the lineitem files:
$ hadoop jar 
~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta 
/test-warehouse/li2/4b4d9143c575dd71-3f69d3cf0001_1879643220_data.0.parq
20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
20/12/22 09:39:56 INFO hadoop.ParquetFileReader: reading another 1 footers
20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
file:
hdfs://localhost:20500/test-warehouse/li2/4b4d9143c575dd71-3f69d3cf0001_1879643220_data.0.parq
creator: impala version 4.0.0-SNAPSHOT (build 
7b691c5d4249f0cb1ced8ddf01033fbbe10511d9)

file schema: schema

l_orderkey:  OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1
l_partkey:   OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1
l_suppkey:   

[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages

2021-01-05 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16893 )

Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages
..


Patch Set 6: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/16893
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a
Gerrit-Change-Number: 16893
Gerrit-PatchSet: 6
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 05 Jan 2021 23:30:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages

2021-01-05 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16893 )

Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages
..


Patch Set 5:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7958/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16893
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a
Gerrit-Change-Number: 16893
Gerrit-PatchSet: 5
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 05 Jan 2021 18:00:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages

2021-01-05 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16893 )

Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages
..


Patch Set 6:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6825/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/16893
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a
Gerrit-Change-Number: 16893
Gerrit-PatchSet: 6
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 05 Jan 2021 17:39:03 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages

2021-01-05 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16893 )

Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages
..


Patch Set 6: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/16893
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a
Gerrit-Change-Number: 16893
Gerrit-PatchSet: 6
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 05 Jan 2021 17:39:02 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages

2021-01-05 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16893 )

Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages
..


Patch Set 5: Code-Review+2

Carry +2


--
To view, visit http://gerrit.cloudera.org:8080/16893
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a
Gerrit-Change-Number: 16893
Gerrit-PatchSet: 5
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 05 Jan 2021 17:38:43 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages

2021-01-05 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16893 )

Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16893/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16893/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc@893
PS4, Line 893:   if (IsDictionaryEncoding(current_encoding_)
 :   && FLAGS_write_new_parquet_dictionary_encodings) {
 : header.data_page_header.encoding = 
parquet::Encoding::RLE_DICTIONARY;
 :   }
> Is this 'if' statement still needed? Now that we use 'DataPageDictionaryEnc
Done



--
To view, visit http://gerrit.cloudera.org:8080/16893
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a
Gerrit-Change-Number: 16893
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 05 Jan 2021 17:38:32 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages

2021-01-05 Thread Tim Armstrong (Code Review)
Hello Zoltan Borok-Nagy, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16893

to look at the new patch set (#5).

Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages
..

IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages

The encoding is identical to the already-supported PLAIN_DICTIONARY
encoding but the PLAIN enum value is used for the dictionary pages
and the RLE_DICTIONARY enum value is used for the data pages.

A hidden option -write_new_parquet_dictionary_encodings is
added to turn on writing too, for test purposes only.

Testing:
* Added an automated test using a pregenerated test file.
* Ran core tests.
* Manually tested by writing out TPC-H lineitem with the new encoding
  and reading back in Impala and Hive.

Parquet-tools output for the generated test file:
$ hadoop jar 
~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta 
/test-warehouse/att/824de2afebad009f-6f460ade0003_643159826_data.0.parq
20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
20/12/21 20:28:36 INFO hadoop.ParquetFileReader: reading another 1 footers
20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
file:
hdfs://localhost:20500/test-warehouse/att/824de2afebad009f-6f460ade0003_643159826_data.0.parq
creator: impala version 4.0.0-SNAPSHOT (build 
7b691c5d4249f0cb1ced8ddf01033fbbe10511d9)

file schema: schema

id:  OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1
bool_col:OPTIONAL BOOLEAN R:0 D:1
tinyint_col: OPTIONAL INT32 L:INTEGER(8,true) R:0 D:1
smallint_col:OPTIONAL INT32 L:INTEGER(16,true) R:0 D:1
int_col: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1
bigint_col:  OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1
float_col:   OPTIONAL FLOAT R:0 D:1
double_col:  OPTIONAL DOUBLE R:0 D:1
date_string_col: OPTIONAL BINARY R:0 D:1
string_col:  OPTIONAL BINARY R:0 D:1
timestamp_col:   OPTIONAL INT96 R:0 D:1
year:OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1
month:   OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1

row group 1: RC:8 TS:754 OFFSET:4

id:   INT32 SNAPPY DO:4 FPO:48 SZ:74/73/0.99 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 7, num_nulls: 0]
bool_col: BOOLEAN SNAPPY DO:0 FPO:141 SZ:26/24/0.92 VC:8 ENC:RLE,PLAIN 
ST:[min: false, max: true, num_nulls: 0]
tinyint_col:  INT32 SNAPPY DO:220 FPO:243 SZ:51/47/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0]
smallint_col: INT32 SNAPPY DO:343 FPO:366 SZ:51/47/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0]
int_col:  INT32 SNAPPY DO:467 FPO:490 SZ:51/47/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0]
bigint_col:   INT64 SNAPPY DO:586 FPO:617 SZ:59/55/0.93 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 10, num_nulls: 0]
float_col:FLOAT SNAPPY DO:724 FPO:747 SZ:51/47/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 1.1, num_nulls: 0]
double_col:   DOUBLE SNAPPY DO:845 FPO:876 SZ:59/55/0.93 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 10.1, num_nulls: 0]
date_string_col:  BINARY SNAPPY DO:983 FPO:1028 SZ:74/88/1.19 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0x30312F30312F3039, max: 0x30342F30312F3039, 
num_nulls: 0]
string_col:   BINARY SNAPPY DO:1143 FPO:1168 SZ:53/49/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0x30, max: 0x31, num_nulls: 0]
timestamp_col:INT96 SNAPPY DO:1261 FPO:1329 SZ:98/138/1.41 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[num_nulls: 0, min/max not defined]
year: INT32 SNAPPY DO:1451 FPO:1470 SZ:47/43/0.91 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 2009, max: 2009, num_nulls: 0]
month:INT32 SNAPPY DO:1563 FPO:1594 SZ:60/56/0.93 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 1, max: 4, num_nulls: 0]

Parquet-tools output for one of the lineitem files:
$ hadoop jar 
~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta 
/test-warehouse/li2/4b4d9143c575dd71-3f69d3cf0001_1879643220_data.0.parq
20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
20/12/22 09:39:56 INFO hadoop.ParquetFileReader: reading another 1 footers
20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
file:
hdfs://localhost:20500/test-warehouse/li2/4b4d9143c575dd71-3f69d3cf0001_1879643220_data.0.parq
creator: impala version 4.0.0-SNAPSHOT (build 
7b691c5d4249f0cb1ced8ddf01033fbbe10511d9)

file schema: schema

l_orderkey:  OPTIONAL INT64 L:INTEGER(64,tr

[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages

2021-01-05 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16893 )

Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages
..


Patch Set 4: Code-Review+2

+2 from my side too


--
To view, visit http://gerrit.cloudera.org:8080/16893
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a
Gerrit-Change-Number: 16893
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 05 Jan 2021 16:00:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages

2021-01-05 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16893 )

Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages
..


Patch Set 4: Code-Review+2

(1 comment)

An 'if' statement might be redundant, but other than that LGTM.

http://gerrit.cloudera.org:8080/#/c/16893/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16893/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc@893
PS4, Line 893:   if (IsDictionaryEncoding(current_encoding_)
 :   && FLAGS_write_new_parquet_dictionary_encodings) {
 : header.data_page_header.encoding = 
parquet::Encoding::RLE_DICTIONARY;
 :   }
Is this 'if' statement still needed? Now that we use 
'DataPageDictionaryEncoding()' to set current_encoding_.



--
To view, visit http://gerrit.cloudera.org:8080/16893
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a
Gerrit-Change-Number: 16893
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 05 Jan 2021 10:36:37 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages

2021-01-05 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16893 )

Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages
..


Patch Set 4:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7951/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16893
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a
Gerrit-Change-Number: 16893
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 05 Jan 2021 08:19:13 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages

2021-01-04 Thread Tim Armstrong (Code Review)
Hello Zoltan Borok-Nagy, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16893

to look at the new patch set (#4).

Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages
..

IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages

The encoding is identical to the already-supported PLAIN_DICTIONARY
encoding but the PLAIN enum value is used for the dictionary pages
and the RLE_DICTIONARY enum value is used for the data pages.

A hidden option -write_new_parquet_dictionary_encodings is
added to turn on writing too, for test purposes only.

Testing:
* Added an automated test using a pregenerated test file.
* Ran core tests.
* Manually tested by writing out TPC-H lineitem with the new encoding
  and reading back in Impala and Hive.

Parquet-tools output for the generated test file:
$ hadoop jar 
~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta 
/test-warehouse/att/824de2afebad009f-6f460ade0003_643159826_data.0.parq
20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
20/12/21 20:28:36 INFO hadoop.ParquetFileReader: reading another 1 footers
20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
file:
hdfs://localhost:20500/test-warehouse/att/824de2afebad009f-6f460ade0003_643159826_data.0.parq
creator: impala version 4.0.0-SNAPSHOT (build 
7b691c5d4249f0cb1ced8ddf01033fbbe10511d9)

file schema: schema

id:  OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1
bool_col:OPTIONAL BOOLEAN R:0 D:1
tinyint_col: OPTIONAL INT32 L:INTEGER(8,true) R:0 D:1
smallint_col:OPTIONAL INT32 L:INTEGER(16,true) R:0 D:1
int_col: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1
bigint_col:  OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1
float_col:   OPTIONAL FLOAT R:0 D:1
double_col:  OPTIONAL DOUBLE R:0 D:1
date_string_col: OPTIONAL BINARY R:0 D:1
string_col:  OPTIONAL BINARY R:0 D:1
timestamp_col:   OPTIONAL INT96 R:0 D:1
year:OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1
month:   OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1

row group 1: RC:8 TS:754 OFFSET:4

id:   INT32 SNAPPY DO:4 FPO:48 SZ:74/73/0.99 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 7, num_nulls: 0]
bool_col: BOOLEAN SNAPPY DO:0 FPO:141 SZ:26/24/0.92 VC:8 ENC:RLE,PLAIN 
ST:[min: false, max: true, num_nulls: 0]
tinyint_col:  INT32 SNAPPY DO:220 FPO:243 SZ:51/47/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0]
smallint_col: INT32 SNAPPY DO:343 FPO:366 SZ:51/47/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0]
int_col:  INT32 SNAPPY DO:467 FPO:490 SZ:51/47/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0]
bigint_col:   INT64 SNAPPY DO:586 FPO:617 SZ:59/55/0.93 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 10, num_nulls: 0]
float_col:FLOAT SNAPPY DO:724 FPO:747 SZ:51/47/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 1.1, num_nulls: 0]
double_col:   DOUBLE SNAPPY DO:845 FPO:876 SZ:59/55/0.93 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 10.1, num_nulls: 0]
date_string_col:  BINARY SNAPPY DO:983 FPO:1028 SZ:74/88/1.19 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0x30312F30312F3039, max: 0x30342F30312F3039, 
num_nulls: 0]
string_col:   BINARY SNAPPY DO:1143 FPO:1168 SZ:53/49/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0x30, max: 0x31, num_nulls: 0]
timestamp_col:INT96 SNAPPY DO:1261 FPO:1329 SZ:98/138/1.41 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[num_nulls: 0, min/max not defined]
year: INT32 SNAPPY DO:1451 FPO:1470 SZ:47/43/0.91 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 2009, max: 2009, num_nulls: 0]
month:INT32 SNAPPY DO:1563 FPO:1594 SZ:60/56/0.93 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 1, max: 4, num_nulls: 0]

Parquet-tools output for one of the lineitem files:
$ hadoop jar 
~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta 
/test-warehouse/li2/4b4d9143c575dd71-3f69d3cf0001_1879643220_data.0.parq
20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
20/12/22 09:39:56 INFO hadoop.ParquetFileReader: reading another 1 footers
20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
file:
hdfs://localhost:20500/test-warehouse/li2/4b4d9143c575dd71-3f69d3cf0001_1879643220_data.0.parq
creator: impala version 4.0.0-SNAPSHOT (build 
7b691c5d4249f0cb1ced8ddf01033fbbe10511d9)

file schema: schema

l_orderkey:  OPTIONAL INT64 L:INTEGER(64,tr

[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages

2021-01-04 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16893 )

Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages
..


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16893/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16893/3//COMMIT_MSG@10
PS3, Line 10: PLAIN/
> PLAIN is the new way AFAIK, so we use PLAIN for the dictionary page and RLE
Done. Good point


http://gerrit.cloudera.org:8080/#/c/16893/3//COMMIT_MSG@10
PS3, Line 10: old PLAIN/PLAIN_DICTIONARY values.
> Maybe you could emphasise that the data is still encoded the same way.
Done


http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@92
PS3, Line 92: use
> nit: maybe write_new_parquet_dictionary_encodings to be more explicit?
Done


http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@881
PS3, Line 881: current_encoding_
> I wonder if the code would be cleaner/less error-prone if 'current_encoding
I made the switch to doing this. I'd initially thought that it would be bad to 
add the extra branch in ProcessValue() but on further thought it doesn't really 
make sense that it would matter, it should be a predictable branch.


http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/parquet-column-readers.cc@326
PS3, Line 326: so
> nit: to
Done


http://gerrit.cloudera.org:8080/#/c/16893/3/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/16893/3/testdata/data/README@593
PS3, Line 593:
> is the newline intentional?
Done



--
To view, visit http://gerrit.cloudera.org:8080/16893
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a
Gerrit-Change-Number: 16893
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 05 Jan 2021 07:57:34 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages

2021-01-04 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16893 )

Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages
..


Patch Set 3:

(6 comments)

Few nits, but the code looks good to me overall.

http://gerrit.cloudera.org:8080/#/c/16893/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16893/3//COMMIT_MSG@10
PS3, Line 10: old PLAIN/PLAIN_DICTIONARY values.
Maybe you could emphasise that the data is still encoded the same way.


http://gerrit.cloudera.org:8080/#/c/16893/3//COMMIT_MSG@10
PS3, Line 10: PLAIN/
PLAIN is the new way AFAIK, so we use PLAIN for the dictionary page and 
RLE_DICTIONARY for the data pages.

While the old way was to use PLAIN_DICTIONARY everywhere, and it meant PLAIN 
encoding for the dictionary page and RLE encoded dict keys for the data pages.


http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@92
PS3, Line 92: use
nit: maybe write_new_parquet_dictionary_encodings to be more explicit?


http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@881
PS3, Line 881: current_encoding_
I wonder if the code would be cleaner/less error-prone if 'current_encoding_' 
stored the actual encoding. So probably we could move this 'if' to the place 
where we set 'current_encoding_'.


http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/parquet-column-readers.cc@326
PS3, Line 326: so
nit: to


http://gerrit.cloudera.org:8080/#/c/16893/3/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/16893/3/testdata/data/README@593
PS3, Line 593:
is the newline intentional?



--
To view, visit http://gerrit.cloudera.org:8080/16893
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a
Gerrit-Change-Number: 16893
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 04 Jan 2021 19:21:47 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages

2020-12-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16893 )

Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages
..


Patch Set 3:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/7897/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/16893
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a
Gerrit-Change-Number: 16893
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 22 Dec 2020 17:52:49 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages

2020-12-22 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new patch set (#3). ( 
http://gerrit.cloudera.org:8080/16893 )

Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages
..

IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages

This add the support to use this enum value instead of the
old PLAIN/PLAIN_DICTIONARY values.

A hidden option -use_new_parquet_dictionary_encodings is
added to turn on writing too, for test purposes only.

Testing:
* Added an automated test using a pregenerated test file.
* Ran core tests.
* Manually tested by writing out TPC-H lineitem with the new encoding
  and reading back in Impala and Hive.

Parquet-tools output for the generated test file:
$ hadoop jar 
~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta 
/test-warehouse/att/824de2afebad009f-6f460ade0003_643159826_data.0.parq
20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
20/12/21 20:28:36 INFO hadoop.ParquetFileReader: reading another 1 footers
20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
file:
hdfs://localhost:20500/test-warehouse/att/824de2afebad009f-6f460ade0003_643159826_data.0.parq
creator: impala version 4.0.0-SNAPSHOT (build 
7b691c5d4249f0cb1ced8ddf01033fbbe10511d9)

file schema: schema

id:  OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1
bool_col:OPTIONAL BOOLEAN R:0 D:1
tinyint_col: OPTIONAL INT32 L:INTEGER(8,true) R:0 D:1
smallint_col:OPTIONAL INT32 L:INTEGER(16,true) R:0 D:1
int_col: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1
bigint_col:  OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1
float_col:   OPTIONAL FLOAT R:0 D:1
double_col:  OPTIONAL DOUBLE R:0 D:1
date_string_col: OPTIONAL BINARY R:0 D:1
string_col:  OPTIONAL BINARY R:0 D:1
timestamp_col:   OPTIONAL INT96 R:0 D:1
year:OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1
month:   OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1

row group 1: RC:8 TS:754 OFFSET:4

id:   INT32 SNAPPY DO:4 FPO:48 SZ:74/73/0.99 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 7, num_nulls: 0]
bool_col: BOOLEAN SNAPPY DO:0 FPO:141 SZ:26/24/0.92 VC:8 ENC:RLE,PLAIN 
ST:[min: false, max: true, num_nulls: 0]
tinyint_col:  INT32 SNAPPY DO:220 FPO:243 SZ:51/47/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0]
smallint_col: INT32 SNAPPY DO:343 FPO:366 SZ:51/47/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0]
int_col:  INT32 SNAPPY DO:467 FPO:490 SZ:51/47/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0]
bigint_col:   INT64 SNAPPY DO:586 FPO:617 SZ:59/55/0.93 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 10, num_nulls: 0]
float_col:FLOAT SNAPPY DO:724 FPO:747 SZ:51/47/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 1.1, num_nulls: 0]
double_col:   DOUBLE SNAPPY DO:845 FPO:876 SZ:59/55/0.93 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 10.1, num_nulls: 0]
date_string_col:  BINARY SNAPPY DO:983 FPO:1028 SZ:74/88/1.19 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0x30312F30312F3039, max: 0x30342F30312F3039, 
num_nulls: 0]
string_col:   BINARY SNAPPY DO:1143 FPO:1168 SZ:53/49/0.92 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 0x30, max: 0x31, num_nulls: 0]
timestamp_col:INT96 SNAPPY DO:1261 FPO:1329 SZ:98/138/1.41 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[num_nulls: 0, min/max not defined]
year: INT32 SNAPPY DO:1451 FPO:1470 SZ:47/43/0.91 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 2009, max: 2009, num_nulls: 0]
month:INT32 SNAPPY DO:1563 FPO:1594 SZ:60/56/0.93 VC:8 
ENC:RLE,RLE_DICTIONARY ST:[min: 1, max: 4, num_nulls: 0]

Parquet-tools output for one of the lineitem files:
$ hadoop jar 
~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta 
/test-warehouse/li2/4b4d9143c575dd71-3f69d3cf0001_1879643220_data.0.parq
20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
20/12/22 09:39:56 INFO hadoop.ParquetFileReader: reading another 1 footers
20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with 
parallelism: 5
file:
hdfs://localhost:20500/test-warehouse/li2/4b4d9143c575dd71-3f69d3cf0001_1879643220_data.0.parq
creator: impala version 4.0.0-SNAPSHOT (build 
7b691c5d4249f0cb1ced8ddf01033fbbe10511d9)

file schema: schema

l_orderkey:  OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1
l_partkey:   OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1
l_suppkey:   OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1
l_linenumber:OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1
l_quantity:  O