[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16893 ) Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages .. IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages The encoding is identical to the already-supported PLAIN_DICTIONARY encoding but the PLAIN enum value is used for the dictionary pages and the RLE_DICTIONARY enum value is used for the data pages. A hidden option -write_new_parquet_dictionary_encodings is added to turn on writing too, for test purposes only. Testing: * Added an automated test using a pregenerated test file. * Ran core tests. * Manually tested by writing out TPC-H lineitem with the new encoding and reading back in Impala and Hive. Parquet-tools output for the generated test file: $ hadoop jar ~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta /test-warehouse/att/824de2afebad009f-6f460ade0003_643159826_data.0.parq 20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 20/12/21 20:28:36 INFO hadoop.ParquetFileReader: reading another 1 footers 20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 file: hdfs://localhost:20500/test-warehouse/att/824de2afebad009f-6f460ade0003_643159826_data.0.parq creator: impala version 4.0.0-SNAPSHOT (build 7b691c5d4249f0cb1ced8ddf01033fbbe10511d9) file schema: schema id: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 bool_col:OPTIONAL BOOLEAN R:0 D:1 tinyint_col: OPTIONAL INT32 L:INTEGER(8,true) R:0 D:1 smallint_col:OPTIONAL INT32 L:INTEGER(16,true) R:0 D:1 int_col: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 bigint_col: OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1 float_col: OPTIONAL FLOAT R:0 D:1 double_col: OPTIONAL DOUBLE R:0 D:1 date_string_col: OPTIONAL BINARY R:0 D:1 string_col: OPTIONAL BINARY R:0 D:1 timestamp_col: OPTIONAL INT96 R:0 D:1 year:OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 month: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 row group 1: RC:8 TS:754 OFFSET:4 id: INT32 SNAPPY DO:4 FPO:48 SZ:74/73/0.99 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 7, num_nulls: 0] bool_col: BOOLEAN SNAPPY DO:0 FPO:141 SZ:26/24/0.92 VC:8 ENC:RLE,PLAIN ST:[min: false, max: true, num_nulls: 0] tinyint_col: INT32 SNAPPY DO:220 FPO:243 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] smallint_col: INT32 SNAPPY DO:343 FPO:366 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] int_col: INT32 SNAPPY DO:467 FPO:490 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] bigint_col: INT64 SNAPPY DO:586 FPO:617 SZ:59/55/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 10, num_nulls: 0] float_col:FLOAT SNAPPY DO:724 FPO:747 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 1.1, num_nulls: 0] double_col: DOUBLE SNAPPY DO:845 FPO:876 SZ:59/55/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 10.1, num_nulls: 0] date_string_col: BINARY SNAPPY DO:983 FPO:1028 SZ:74/88/1.19 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0x30312F30312F3039, max: 0x30342F30312F3039, num_nulls: 0] string_col: BINARY SNAPPY DO:1143 FPO:1168 SZ:53/49/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0x30, max: 0x31, num_nulls: 0] timestamp_col:INT96 SNAPPY DO:1261 FPO:1329 SZ:98/138/1.41 VC:8 ENC:RLE,RLE_DICTIONARY ST:[num_nulls: 0, min/max not defined] year: INT32 SNAPPY DO:1451 FPO:1470 SZ:47/43/0.91 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 2009, max: 2009, num_nulls: 0] month:INT32 SNAPPY DO:1563 FPO:1594 SZ:60/56/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 1, max: 4, num_nulls: 0] Parquet-tools output for one of the lineitem files: $ hadoop jar ~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta /test-warehouse/li2/4b4d9143c575dd71-3f69d3cf0001_1879643220_data.0.parq 20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 20/12/22 09:39:56 INFO hadoop.ParquetFileReader: reading another 1 footers 20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 file: hdfs://localhost:20500/test-warehouse/li2/4b4d9143c575dd71-3f69d3cf0001_1879643220_data.0.parq creator: impala version 4.0.0-SNAPSHOT (build 7b691c5d4249f0cb1ced8ddf01033fbbe10511d9) file schema: schema l_orderkey: OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1 l_partkey: OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1 l_suppkey:
[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16893 ) Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages .. Patch Set 6: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/16893 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a Gerrit-Change-Number: 16893 Gerrit-PatchSet: 6 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 05 Jan 2021 23:30:34 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16893 ) Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages .. Patch Set 5: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7958/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16893 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a Gerrit-Change-Number: 16893 Gerrit-PatchSet: 5 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 05 Jan 2021 18:00:34 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16893 ) Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages .. Patch Set 6: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6825/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/16893 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a Gerrit-Change-Number: 16893 Gerrit-PatchSet: 6 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 05 Jan 2021 17:39:03 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16893 ) Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages .. Patch Set 6: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/16893 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a Gerrit-Change-Number: 16893 Gerrit-PatchSet: 6 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 05 Jan 2021 17:39:02 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16893 ) Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages .. Patch Set 5: Code-Review+2 Carry +2 -- To view, visit http://gerrit.cloudera.org:8080/16893 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a Gerrit-Change-Number: 16893 Gerrit-PatchSet: 5 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 05 Jan 2021 17:38:43 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16893 ) Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/16893/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/16893/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc@893 PS4, Line 893: if (IsDictionaryEncoding(current_encoding_) : && FLAGS_write_new_parquet_dictionary_encodings) { : header.data_page_header.encoding = parquet::Encoding::RLE_DICTIONARY; : } > Is this 'if' statement still needed? Now that we use 'DataPageDictionaryEnc Done -- To view, visit http://gerrit.cloudera.org:8080/16893 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a Gerrit-Change-Number: 16893 Gerrit-PatchSet: 4 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 05 Jan 2021 17:38:32 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages
Hello Zoltan Borok-Nagy, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16893 to look at the new patch set (#5). Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages .. IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages The encoding is identical to the already-supported PLAIN_DICTIONARY encoding but the PLAIN enum value is used for the dictionary pages and the RLE_DICTIONARY enum value is used for the data pages. A hidden option -write_new_parquet_dictionary_encodings is added to turn on writing too, for test purposes only. Testing: * Added an automated test using a pregenerated test file. * Ran core tests. * Manually tested by writing out TPC-H lineitem with the new encoding and reading back in Impala and Hive. Parquet-tools output for the generated test file: $ hadoop jar ~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta /test-warehouse/att/824de2afebad009f-6f460ade0003_643159826_data.0.parq 20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 20/12/21 20:28:36 INFO hadoop.ParquetFileReader: reading another 1 footers 20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 file: hdfs://localhost:20500/test-warehouse/att/824de2afebad009f-6f460ade0003_643159826_data.0.parq creator: impala version 4.0.0-SNAPSHOT (build 7b691c5d4249f0cb1ced8ddf01033fbbe10511d9) file schema: schema id: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 bool_col:OPTIONAL BOOLEAN R:0 D:1 tinyint_col: OPTIONAL INT32 L:INTEGER(8,true) R:0 D:1 smallint_col:OPTIONAL INT32 L:INTEGER(16,true) R:0 D:1 int_col: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 bigint_col: OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1 float_col: OPTIONAL FLOAT R:0 D:1 double_col: OPTIONAL DOUBLE R:0 D:1 date_string_col: OPTIONAL BINARY R:0 D:1 string_col: OPTIONAL BINARY R:0 D:1 timestamp_col: OPTIONAL INT96 R:0 D:1 year:OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 month: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 row group 1: RC:8 TS:754 OFFSET:4 id: INT32 SNAPPY DO:4 FPO:48 SZ:74/73/0.99 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 7, num_nulls: 0] bool_col: BOOLEAN SNAPPY DO:0 FPO:141 SZ:26/24/0.92 VC:8 ENC:RLE,PLAIN ST:[min: false, max: true, num_nulls: 0] tinyint_col: INT32 SNAPPY DO:220 FPO:243 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] smallint_col: INT32 SNAPPY DO:343 FPO:366 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] int_col: INT32 SNAPPY DO:467 FPO:490 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] bigint_col: INT64 SNAPPY DO:586 FPO:617 SZ:59/55/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 10, num_nulls: 0] float_col:FLOAT SNAPPY DO:724 FPO:747 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 1.1, num_nulls: 0] double_col: DOUBLE SNAPPY DO:845 FPO:876 SZ:59/55/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 10.1, num_nulls: 0] date_string_col: BINARY SNAPPY DO:983 FPO:1028 SZ:74/88/1.19 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0x30312F30312F3039, max: 0x30342F30312F3039, num_nulls: 0] string_col: BINARY SNAPPY DO:1143 FPO:1168 SZ:53/49/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0x30, max: 0x31, num_nulls: 0] timestamp_col:INT96 SNAPPY DO:1261 FPO:1329 SZ:98/138/1.41 VC:8 ENC:RLE,RLE_DICTIONARY ST:[num_nulls: 0, min/max not defined] year: INT32 SNAPPY DO:1451 FPO:1470 SZ:47/43/0.91 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 2009, max: 2009, num_nulls: 0] month:INT32 SNAPPY DO:1563 FPO:1594 SZ:60/56/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 1, max: 4, num_nulls: 0] Parquet-tools output for one of the lineitem files: $ hadoop jar ~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta /test-warehouse/li2/4b4d9143c575dd71-3f69d3cf0001_1879643220_data.0.parq 20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 20/12/22 09:39:56 INFO hadoop.ParquetFileReader: reading another 1 footers 20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 file: hdfs://localhost:20500/test-warehouse/li2/4b4d9143c575dd71-3f69d3cf0001_1879643220_data.0.parq creator: impala version 4.0.0-SNAPSHOT (build 7b691c5d4249f0cb1ced8ddf01033fbbe10511d9) file schema: schema l_orderkey: OPTIONAL INT64 L:INTEGER(64,tr
[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16893 ) Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages .. Patch Set 4: Code-Review+2 +2 from my side too -- To view, visit http://gerrit.cloudera.org:8080/16893 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a Gerrit-Change-Number: 16893 Gerrit-PatchSet: 4 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 05 Jan 2021 16:00:34 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16893 ) Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages .. Patch Set 4: Code-Review+2 (1 comment) An 'if' statement might be redundant, but other than that LGTM. http://gerrit.cloudera.org:8080/#/c/16893/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/16893/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc@893 PS4, Line 893: if (IsDictionaryEncoding(current_encoding_) : && FLAGS_write_new_parquet_dictionary_encodings) { : header.data_page_header.encoding = parquet::Encoding::RLE_DICTIONARY; : } Is this 'if' statement still needed? Now that we use 'DataPageDictionaryEncoding()' to set current_encoding_. -- To view, visit http://gerrit.cloudera.org:8080/16893 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a Gerrit-Change-Number: 16893 Gerrit-PatchSet: 4 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 05 Jan 2021 10:36:37 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16893 ) Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages .. Patch Set 4: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7951/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16893 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a Gerrit-Change-Number: 16893 Gerrit-PatchSet: 4 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 05 Jan 2021 08:19:13 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages
Hello Zoltan Borok-Nagy, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16893 to look at the new patch set (#4). Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages .. IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages The encoding is identical to the already-supported PLAIN_DICTIONARY encoding but the PLAIN enum value is used for the dictionary pages and the RLE_DICTIONARY enum value is used for the data pages. A hidden option -write_new_parquet_dictionary_encodings is added to turn on writing too, for test purposes only. Testing: * Added an automated test using a pregenerated test file. * Ran core tests. * Manually tested by writing out TPC-H lineitem with the new encoding and reading back in Impala and Hive. Parquet-tools output for the generated test file: $ hadoop jar ~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta /test-warehouse/att/824de2afebad009f-6f460ade0003_643159826_data.0.parq 20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 20/12/21 20:28:36 INFO hadoop.ParquetFileReader: reading another 1 footers 20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 file: hdfs://localhost:20500/test-warehouse/att/824de2afebad009f-6f460ade0003_643159826_data.0.parq creator: impala version 4.0.0-SNAPSHOT (build 7b691c5d4249f0cb1ced8ddf01033fbbe10511d9) file schema: schema id: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 bool_col:OPTIONAL BOOLEAN R:0 D:1 tinyint_col: OPTIONAL INT32 L:INTEGER(8,true) R:0 D:1 smallint_col:OPTIONAL INT32 L:INTEGER(16,true) R:0 D:1 int_col: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 bigint_col: OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1 float_col: OPTIONAL FLOAT R:0 D:1 double_col: OPTIONAL DOUBLE R:0 D:1 date_string_col: OPTIONAL BINARY R:0 D:1 string_col: OPTIONAL BINARY R:0 D:1 timestamp_col: OPTIONAL INT96 R:0 D:1 year:OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 month: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 row group 1: RC:8 TS:754 OFFSET:4 id: INT32 SNAPPY DO:4 FPO:48 SZ:74/73/0.99 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 7, num_nulls: 0] bool_col: BOOLEAN SNAPPY DO:0 FPO:141 SZ:26/24/0.92 VC:8 ENC:RLE,PLAIN ST:[min: false, max: true, num_nulls: 0] tinyint_col: INT32 SNAPPY DO:220 FPO:243 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] smallint_col: INT32 SNAPPY DO:343 FPO:366 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] int_col: INT32 SNAPPY DO:467 FPO:490 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] bigint_col: INT64 SNAPPY DO:586 FPO:617 SZ:59/55/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 10, num_nulls: 0] float_col:FLOAT SNAPPY DO:724 FPO:747 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 1.1, num_nulls: 0] double_col: DOUBLE SNAPPY DO:845 FPO:876 SZ:59/55/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 10.1, num_nulls: 0] date_string_col: BINARY SNAPPY DO:983 FPO:1028 SZ:74/88/1.19 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0x30312F30312F3039, max: 0x30342F30312F3039, num_nulls: 0] string_col: BINARY SNAPPY DO:1143 FPO:1168 SZ:53/49/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0x30, max: 0x31, num_nulls: 0] timestamp_col:INT96 SNAPPY DO:1261 FPO:1329 SZ:98/138/1.41 VC:8 ENC:RLE,RLE_DICTIONARY ST:[num_nulls: 0, min/max not defined] year: INT32 SNAPPY DO:1451 FPO:1470 SZ:47/43/0.91 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 2009, max: 2009, num_nulls: 0] month:INT32 SNAPPY DO:1563 FPO:1594 SZ:60/56/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 1, max: 4, num_nulls: 0] Parquet-tools output for one of the lineitem files: $ hadoop jar ~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta /test-warehouse/li2/4b4d9143c575dd71-3f69d3cf0001_1879643220_data.0.parq 20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 20/12/22 09:39:56 INFO hadoop.ParquetFileReader: reading another 1 footers 20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 file: hdfs://localhost:20500/test-warehouse/li2/4b4d9143c575dd71-3f69d3cf0001_1879643220_data.0.parq creator: impala version 4.0.0-SNAPSHOT (build 7b691c5d4249f0cb1ced8ddf01033fbbe10511d9) file schema: schema l_orderkey: OPTIONAL INT64 L:INTEGER(64,tr
[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16893 ) Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages .. Patch Set 3: (6 comments) http://gerrit.cloudera.org:8080/#/c/16893/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16893/3//COMMIT_MSG@10 PS3, Line 10: PLAIN/ > PLAIN is the new way AFAIK, so we use PLAIN for the dictionary page and RLE Done. Good point http://gerrit.cloudera.org:8080/#/c/16893/3//COMMIT_MSG@10 PS3, Line 10: old PLAIN/PLAIN_DICTIONARY values. > Maybe you could emphasise that the data is still encoded the same way. Done http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@92 PS3, Line 92: use > nit: maybe write_new_parquet_dictionary_encodings to be more explicit? Done http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@881 PS3, Line 881: current_encoding_ > I wonder if the code would be cleaner/less error-prone if 'current_encoding I made the switch to doing this. I'd initially thought that it would be bad to add the extra branch in ProcessValue() but on further thought it doesn't really make sense that it would matter, it should be a predictable branch. http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/parquet-column-readers.cc@326 PS3, Line 326: so > nit: to Done http://gerrit.cloudera.org:8080/#/c/16893/3/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/16893/3/testdata/data/README@593 PS3, Line 593: > is the newline intentional? Done -- To view, visit http://gerrit.cloudera.org:8080/16893 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a Gerrit-Change-Number: 16893 Gerrit-PatchSet: 3 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 05 Jan 2021 07:57:34 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16893 ) Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages .. Patch Set 3: (6 comments) Few nits, but the code looks good to me overall. http://gerrit.cloudera.org:8080/#/c/16893/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16893/3//COMMIT_MSG@10 PS3, Line 10: old PLAIN/PLAIN_DICTIONARY values. Maybe you could emphasise that the data is still encoded the same way. http://gerrit.cloudera.org:8080/#/c/16893/3//COMMIT_MSG@10 PS3, Line 10: PLAIN/ PLAIN is the new way AFAIK, so we use PLAIN for the dictionary page and RLE_DICTIONARY for the data pages. While the old way was to use PLAIN_DICTIONARY everywhere, and it meant PLAIN encoding for the dictionary page and RLE encoded dict keys for the data pages. http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@92 PS3, Line 92: use nit: maybe write_new_parquet_dictionary_encodings to be more explicit? http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@881 PS3, Line 881: current_encoding_ I wonder if the code would be cleaner/less error-prone if 'current_encoding_' stored the actual encoding. So probably we could move this 'if' to the place where we set 'current_encoding_'. http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/16893/3/be/src/exec/parquet/parquet-column-readers.cc@326 PS3, Line 326: so nit: to http://gerrit.cloudera.org:8080/#/c/16893/3/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/16893/3/testdata/data/README@593 PS3, Line 593: is the newline intentional? -- To view, visit http://gerrit.cloudera.org:8080/16893 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a Gerrit-Change-Number: 16893 Gerrit-PatchSet: 3 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 04 Jan 2021 19:21:47 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16893 ) Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages .. Patch Set 3: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/7897/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/16893 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I90942022edcd5d96c720a1bde53879e50394660a Gerrit-Change-Number: 16893 Gerrit-PatchSet: 3 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 22 Dec 2020 17:52:49 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6434: Add support to decode RLE DICTIONARY encoded pages
Tim Armstrong has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/16893 ) Change subject: IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages .. IMPALA-6434: Add support to decode RLE_DICTIONARY encoded pages This add the support to use this enum value instead of the old PLAIN/PLAIN_DICTIONARY values. A hidden option -use_new_parquet_dictionary_encodings is added to turn on writing too, for test purposes only. Testing: * Added an automated test using a pregenerated test file. * Ran core tests. * Manually tested by writing out TPC-H lineitem with the new encoding and reading back in Impala and Hive. Parquet-tools output for the generated test file: $ hadoop jar ~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta /test-warehouse/att/824de2afebad009f-6f460ade0003_643159826_data.0.parq 20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 20/12/21 20:28:36 INFO hadoop.ParquetFileReader: reading another 1 footers 20/12/21 20:28:36 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 file: hdfs://localhost:20500/test-warehouse/att/824de2afebad009f-6f460ade0003_643159826_data.0.parq creator: impala version 4.0.0-SNAPSHOT (build 7b691c5d4249f0cb1ced8ddf01033fbbe10511d9) file schema: schema id: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 bool_col:OPTIONAL BOOLEAN R:0 D:1 tinyint_col: OPTIONAL INT32 L:INTEGER(8,true) R:0 D:1 smallint_col:OPTIONAL INT32 L:INTEGER(16,true) R:0 D:1 int_col: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 bigint_col: OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1 float_col: OPTIONAL FLOAT R:0 D:1 double_col: OPTIONAL DOUBLE R:0 D:1 date_string_col: OPTIONAL BINARY R:0 D:1 string_col: OPTIONAL BINARY R:0 D:1 timestamp_col: OPTIONAL INT96 R:0 D:1 year:OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 month: OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 row group 1: RC:8 TS:754 OFFSET:4 id: INT32 SNAPPY DO:4 FPO:48 SZ:74/73/0.99 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 7, num_nulls: 0] bool_col: BOOLEAN SNAPPY DO:0 FPO:141 SZ:26/24/0.92 VC:8 ENC:RLE,PLAIN ST:[min: false, max: true, num_nulls: 0] tinyint_col: INT32 SNAPPY DO:220 FPO:243 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] smallint_col: INT32 SNAPPY DO:343 FPO:366 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] int_col: INT32 SNAPPY DO:467 FPO:490 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 1, num_nulls: 0] bigint_col: INT64 SNAPPY DO:586 FPO:617 SZ:59/55/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0, max: 10, num_nulls: 0] float_col:FLOAT SNAPPY DO:724 FPO:747 SZ:51/47/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 1.1, num_nulls: 0] double_col: DOUBLE SNAPPY DO:845 FPO:876 SZ:59/55/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: -0.0, max: 10.1, num_nulls: 0] date_string_col: BINARY SNAPPY DO:983 FPO:1028 SZ:74/88/1.19 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0x30312F30312F3039, max: 0x30342F30312F3039, num_nulls: 0] string_col: BINARY SNAPPY DO:1143 FPO:1168 SZ:53/49/0.92 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 0x30, max: 0x31, num_nulls: 0] timestamp_col:INT96 SNAPPY DO:1261 FPO:1329 SZ:98/138/1.41 VC:8 ENC:RLE,RLE_DICTIONARY ST:[num_nulls: 0, min/max not defined] year: INT32 SNAPPY DO:1451 FPO:1470 SZ:47/43/0.91 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 2009, max: 2009, num_nulls: 0] month:INT32 SNAPPY DO:1563 FPO:1594 SZ:60/56/0.93 VC:8 ENC:RLE,RLE_DICTIONARY ST:[min: 1, max: 4, num_nulls: 0] Parquet-tools output for one of the lineitem files: $ hadoop jar ~/repos/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar meta /test-warehouse/li2/4b4d9143c575dd71-3f69d3cf0001_1879643220_data.0.parq 20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 20/12/22 09:39:56 INFO hadoop.ParquetFileReader: reading another 1 footers 20/12/22 09:39:56 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5 file: hdfs://localhost:20500/test-warehouse/li2/4b4d9143c575dd71-3f69d3cf0001_1879643220_data.0.parq creator: impala version 4.0.0-SNAPSHOT (build 7b691c5d4249f0cb1ced8ddf01033fbbe10511d9) file schema: schema l_orderkey: OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1 l_partkey: OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1 l_suppkey: OPTIONAL INT64 L:INTEGER(64,true) R:0 D:1 l_linenumber:OPTIONAL INT32 L:INTEGER(32,true) R:0 D:1 l_quantity: O