[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 18: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 18 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 01 Nov 2017 04:26:43 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. IMPALA-5307: Part 2: copy out strings in uncompressed Avro The approach is to re-materialize strings in those tuples that survive conjunct evaluation and may reference disk I/O buffers directly. This means that perf should not regress for the following cases: * Compressed Avro files. * Non-string columns. * Selective scans where the majority of tuples are filtered out. This approach will also work for the Sequence and Text scanners. Includes some improvements to Avro codegen to replace more constants to help win back some performance (with limited success): replaced InitTuple() with an optimised version and substituted tuple_byte_size() with a constant. Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed length. Perf: Did microbenchmarks on uncompressed Avro files, one with all columns from lineitem and one with only l_comment. Tests were run with: set num_scanner_threads=1; I ran the query 5 times and extracted MaterializeTupleTime from the profile to measure CPU cost of materialization. Overall string materialization got significantly slower, mainly because of the extra memcpy() calls required. Selecting one string from a table with multiple columns: select min(l_comment) from biglineitem_avro 1.814 -> 2.096 Selecting one string from a table with one column: select min(l_comment) from biglineitem_comment; profile; 1.708 -> 3.7 Selecting one string from a table with one column with predicate: select min(l_comment) from biglineitem_comment where length(l_comment) > 1; 1.691 -> 1.449 Selecting all columns: select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber), min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax), min(l_returnflag), min(l_linestatus), min(l_shipdate), min(l_commitdate), min(l_receiptdate), min(l_shipinstruct), min(l_shipmode), min(l_comment) from biglineitem_avro; profile; 2.335 -> 3.711 Selecting an int column (no strings): select min(l_linenumber) from biglineitem_avro 1.806 -> 1.819 Testing: Ran exhaustive tests. Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Reviewed-on: http://gerrit.cloudera.org:8080/8146 Reviewed-by: Tim ArmstrongTested-by: Impala Public Jenkins --- M be/src/codegen/gen_ir_descriptions.py M be/src/codegen/impala-ir.cc M be/src/codegen/llvm-codegen.cc M be/src/codegen/llvm-codegen.h M be/src/common/status.cc M be/src/common/status.h M be/src/exec/hdfs-avro-scanner-ir.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-avro-scanner.h M be/src/exec/hdfs-scan-node-base.h M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-scanner.h M be/src/exec/scan-node.h M be/src/runtime/CMakeLists.txt M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h A be/src/runtime/tuple-ir.cc M be/src/runtime/tuple.cc M be/src/runtime/tuple.h M be/src/util/avro-util.h 22 files changed, 401 insertions(+), 114 deletions(-) Approvals: Tim Armstrong: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 19 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 18: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 18 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 01 Nov 2017 00:37:34 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 17: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/1421/ -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 17 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 01 Nov 2017 00:37:22 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 18: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1422/ -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 18 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 01 Nov 2017 00:37:42 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 17: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1421/ -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 17 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 01 Nov 2017 00:37:05 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 16: (1 comment) http://gerrit.cloudera.org:8080/#/c/8146/16/be/src/exec/hdfs-scanner.h File be/src/exec/hdfs-scanner.h: http://gerrit.cloudera.org:8080/#/c/8146/16/be/src/exec/hdfs-scanner.h@405 PS16, Line 405: runtim > runtime Done -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 16 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 31 Oct 2017 04:22:51 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 17: Code-Review+2 carry -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 17 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 31 Oct 2017 04:22:59 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Hello Michael Ho, Thomas Tauber-Marshall, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8146 to look at the new patch set (#17). Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. IMPALA-5307: Part 2: copy out strings in uncompressed Avro The approach is to re-materialize strings in those tuples that survive conjunct evaluation and may reference disk I/O buffers directly. This means that perf should not regress for the following cases: * Compressed Avro files. * Non-string columns. * Selective scans where the majority of tuples are filtered out. This approach will also work for the Sequence and Text scanners. Includes some improvements to Avro codegen to replace more constants to help win back some performance (with limited success): replaced InitTuple() with an optimised version and substituted tuple_byte_size() with a constant. Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed length. Perf: Did microbenchmarks on uncompressed Avro files, one with all columns from lineitem and one with only l_comment. Tests were run with: set num_scanner_threads=1; I ran the query 5 times and extracted MaterializeTupleTime from the profile to measure CPU cost of materialization. Overall string materialization got significantly slower, mainly because of the extra memcpy() calls required. Selecting one string from a table with multiple columns: select min(l_comment) from biglineitem_avro 1.814 -> 2.096 Selecting one string from a table with one column: select min(l_comment) from biglineitem_comment; profile; 1.708 -> 3.7 Selecting one string from a table with one column with predicate: select min(l_comment) from biglineitem_comment where length(l_comment) > 1; 1.691 -> 1.449 Selecting all columns: select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber), min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax), min(l_returnflag), min(l_linestatus), min(l_shipdate), min(l_commitdate), min(l_receiptdate), min(l_shipinstruct), min(l_shipmode), min(l_comment) from biglineitem_avro; profile; 2.335 -> 3.711 Selecting an int column (no strings): select min(l_linenumber) from biglineitem_avro 1.806 -> 1.819 Testing: Ran exhaustive tests. Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 --- M be/src/codegen/gen_ir_descriptions.py M be/src/codegen/impala-ir.cc M be/src/codegen/llvm-codegen.cc M be/src/codegen/llvm-codegen.h M be/src/common/status.cc M be/src/common/status.h M be/src/exec/hdfs-avro-scanner-ir.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-avro-scanner.h M be/src/exec/hdfs-scan-node-base.h M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-scanner.h M be/src/exec/scan-node.h M be/src/runtime/CMakeLists.txt M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h A be/src/runtime/tuple-ir.cc M be/src/runtime/tuple.cc M be/src/runtime/tuple.h M be/src/util/avro-util.h 22 files changed, 401 insertions(+), 114 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/17 -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 17 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 16: Code-Review+2 (1 comment) http://gerrit.cloudera.org:8080/#/c/8146/16/be/src/exec/hdfs-scanner.h File be/src/exec/hdfs-scanner.h: http://gerrit.cloudera.org:8080/#/c/8146/16/be/src/exec/hdfs-scanner.h@405 PS16, Line 405: runtim runtime -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 16 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 31 Oct 2017 02:00:24 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 16: Code-Review+1 carry +1 -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 16 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 30 Oct 2017 22:03:34 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Hello Michael Ho, Thomas Tauber-Marshall, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8146 to look at the new patch set (#16). Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. IMPALA-5307: Part 2: copy out strings in uncompressed Avro The approach is to re-materialize strings in those tuples that survive conjunct evaluation and may reference disk I/O buffers directly. This means that perf should not regress for the following cases: * Compressed Avro files. * Non-string columns. * Selective scans where the majority of tuples are filtered out. This approach will also work for the Sequence and Text scanners. Includes some improvements to Avro codegen to replace more constants to help win back some performance (with limited success): replaced InitTuple() with an optimised version and substituted tuple_byte_size() with a constant. Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed length. Perf: Did microbenchmarks on uncompressed Avro files, one with all columns from lineitem and one with only l_comment. Tests were run with: set num_scanner_threads=1; I ran the query 5 times and extracted MaterializeTupleTime from the profile to measure CPU cost of materialization. Overall string materialization got significantly slower, mainly because of the extra memcpy() calls required. Selecting one string from a table with multiple columns: select min(l_comment) from biglineitem_avro 1.814 -> 2.096 Selecting one string from a table with one column: select min(l_comment) from biglineitem_comment; profile; 1.708 -> 3.7 Selecting one string from a table with one column with predicate: select min(l_comment) from biglineitem_comment where length(l_comment) > 1; 1.691 -> 1.449 Selecting all columns: select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber), min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax), min(l_returnflag), min(l_linestatus), min(l_shipdate), min(l_commitdate), min(l_receiptdate), min(l_shipinstruct), min(l_shipmode), min(l_comment) from biglineitem_avro; profile; 2.335 -> 3.711 Selecting an int column (no strings): select min(l_linenumber) from biglineitem_avro 1.806 -> 1.819 Testing: Ran exhaustive tests. Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 --- M be/src/codegen/gen_ir_descriptions.py M be/src/codegen/impala-ir.cc M be/src/codegen/llvm-codegen.cc M be/src/codegen/llvm-codegen.h M be/src/common/status.cc M be/src/common/status.h M be/src/exec/hdfs-avro-scanner-ir.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-avro-scanner.h M be/src/exec/hdfs-scan-node-base.h M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-scanner.h M be/src/exec/scan-node.h M be/src/runtime/CMakeLists.txt M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h A be/src/runtime/tuple-ir.cc M be/src/runtime/tuple.cc M be/src/runtime/tuple.h M be/src/util/avro-util.h 22 files changed, 401 insertions(+), 114 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/16 -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 16 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 15: (15 comments) http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner-ir.cc File be/src/exec/hdfs-avro-scanner-ir.cc: http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner-ir.cc@a213 PS15, Line 213: > If I understand it correctly, this if branch was dead code before this chan Yeah I missed cleaning it up in an earlier commit. I mention this in my admittedly gigantic commit message. http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner-ir.cc@51 PS15, Line 51: !tuple->CopyStrings("HdfsAvroScanner::DecodeAvroData()", : state_, string_slot_offsets_.data(), string_slot_offsets_.size(), pool, : _status_)) > nit: tuple->CopyStrings(...) == nullptr It returns a bool though, since it doesn't reallocate the tuple itself. http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner.h File be/src/exec/hdfs-avro-scanner.h: http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner.h@133 PS15, Line 133: // > nit: /// to be consistent. Done http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner.cc File be/src/exec/hdfs-avro-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner.cc@1066 PS15, Line 1066: HdfsScanNodeBase* node > Should this be const HdfsScanNodeBase* ? Done. this required propagating the const qualifier a few more places. http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-scanner.h File be/src/exec/hdfs-scanner.h: http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-scanner.h@406 PS15, Line 406: /// Codegen function to replace InitTuple(). The codegen'd version of InitTuple() is : /// stored in 'init_tuple_fn' if codegen was successful. > May help to also state the codegen'd version of the function has some const Done http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-scanner.cc File be/src/exec/hdfs-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-scanner.cc@535 PS15, Line 535: > nit: indent 4 Done http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/descriptors.h File be/src/runtime/descriptors.h: http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/descriptors.h@93 PS15, Line 93: if > is Done http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/descriptors.h@109 PS15, Line 109: llvm::Constant* ToIR(LlvmCodeGen* codegen) const; > Comment: This needs to be updated should the layout of this struct change. Done http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple-ir.cc File be/src/runtime/tuple-ir.cc: http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple-ir.cc@28 PS15, Line 28: for (int i = 0; i < num_string_slots; ++i) { > Not sure if it will help but have you tried #pragma unroll hint here to see I tried a couple of queries but didn't see a noticeable difference in perf. http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.h File be/src/runtime/tuple.h: http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.h@47 PS15, Line 47: /// Generate an LLVM Constant containing the offset values of this SlotOffsets instance. > Please also comment that this function needs to be updated if the layout of Done http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.h@204 PS15, Line 204: // > nit: /// Done http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc File be/src/runtime/tuple.cc: http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc@404 PS15, Line 404: materialize_strings_fn > nit: Using the name copy_strings_fn will be more consistent. Thanks for catching this, I missed this one place. http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc@435 PS15, Line 435: Constant* > Not your change but I feel it's generally less confusing to include llvm:: I removed the "using namespace llvm" in this file and added llvm:: to the appropriate places. http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc@435 PS15, Line 435: slot_offset_constants > nit: 'slot_offset_ir_constants' may make it easier to follow. Done http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc@436 PS15, Line 436: for (SlotDescriptor* slot_desc : desc.string_slots()) { : SlotOffsets offsets = {slot_desc->null_indicator_offset(), slot_desc->tuple_offset()}; : slot_offset_constants.push_back(offsets.ToIR(codegen)); : } : : Constant* constant_slot_offsets = codegen->ConstantsToGVArrayPtr( : slot_offsets_type, slot_offset_constants, "slot_offsets"); :
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 15: (3 comments) http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner-ir.cc File be/src/exec/hdfs-avro-scanner-ir.cc: http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner-ir.cc@51 PS15, Line 51: !tuple->CopyStrings("HdfsAvroScanner::DecodeAvroData()", : state_, string_slot_offsets_.data(), string_slot_offsets_.size(), pool, : _status_)) nit: tuple->CopyStrings(...) == nullptr http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner.cc File be/src/exec/hdfs-avro-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner.cc@1066 PS15, Line 1066: HdfsScanNodeBase* node Should this be const HdfsScanNodeBase* ? http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple-ir.cc File be/src/runtime/tuple-ir.cc: http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple-ir.cc@28 PS15, Line 28: for (int i = 0; i < num_string_slots; ++i) { Not sure if it will help but have you tried #pragma unroll hint here to see if it makes a difference. https://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 15 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 30 Oct 2017 20:44:13 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 15: (12 comments) Sorry for the delay. Some more comments. Close to completion. http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner-ir.cc File be/src/exec/hdfs-avro-scanner-ir.cc: http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner-ir.cc@a213 PS15, Line 213: If I understand it correctly, this if branch was dead code before this change, right ? http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner.h File be/src/exec/hdfs-avro-scanner.h: http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-avro-scanner.h@133 PS15, Line 133: // nit: /// to be consistent. http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-scanner.h File be/src/exec/hdfs-scanner.h: http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-scanner.h@406 PS15, Line 406: /// Codegen function to replace InitTuple(). The codegen'd version of InitTuple() is : /// stored in 'init_tuple_fn' if codegen was successful. May help to also state the codegen'd version of the function has some constants replaced. http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-scanner.cc File be/src/exec/hdfs-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/exec/hdfs-scanner.cc@535 PS15, Line 535: nit: indent 4 http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/descriptors.h File be/src/runtime/descriptors.h: http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/descriptors.h@93 PS15, Line 93: if is http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/descriptors.h@109 PS15, Line 109: llvm::Constant* ToIR(LlvmCodeGen* codegen) const; Comment: This needs to be updated should the layout of this struct change. http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.h File be/src/runtime/tuple.h: http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.h@47 PS15, Line 47: /// Generate an LLVM Constant containing the offset values of this SlotOffsets instance. Please also comment that this function needs to be updated if the layout of this struct is changed. http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.h@204 PS15, Line 204: // nit: /// May help to also add that this is replacing some of the arguments passed to CopyString() in an attempt to trigger loop unrolling and constant substitution by LLVM. http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc File be/src/runtime/tuple.cc: http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc@404 PS15, Line 404: materialize_strings_fn nit: Using the name copy_strings_fn will be more consistent. http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc@435 PS15, Line 435: slot_offset_constants nit: 'slot_offset_ir_constants' may make it easier to follow. http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc@435 PS15, Line 435: Constant* Not your change but I feel it's generally less confusing to include llvm:: namespace for class not defined by codegen logic. Don't need to change it in this change. Someone can do it later in a clean-up patch. http://gerrit.cloudera.org:8080/#/c/8146/15/be/src/runtime/tuple.cc@436 PS15, Line 436: for (SlotDescriptor* slot_desc : desc.string_slots()) { : SlotOffsets offsets = {slot_desc->null_indicator_offset(), slot_desc->tuple_offset()}; : slot_offset_constants.push_back(offsets.ToIR(codegen)); : } : : Constant* constant_slot_offsets = codegen->ConstantsToGVArrayPtr( : slot_offsets_type, slot_offset_constants, "slot_offsets"); : Constant* num_string_slots = : ConstantInt::get(codegen->int_type(), desc.string_slots().size()); I think it may be helpful to add a comment on what line 435 - 444 is trying to achieve. In essence, it's converting all string slots' offsets into an array of IR constants which is a global variable named "slot_offsets". -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 15 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 30 Oct 2017 19:33:56 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 15: Code-Review+1 rebase, carry +1 -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 15 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 30 Oct 2017 17:00:53 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Hello Michael Ho, Thomas Tauber-Marshall, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8146 to look at the new patch set (#15). Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. IMPALA-5307: Part 2: copy out strings in uncompressed Avro The approach is to re-materialize strings in those tuples that survive conjunct evaluation and may reference disk I/O buffers directly. This means that perf should not regress for the following cases: * Compressed Avro files. * Non-string columns. * Selective scans where the majority of tuples are filtered out. This approach will also work for the Sequence and Text scanners. Includes some improvements to Avro codegen to replace more constants to help win back some performance (with limited success): replaced InitTuple() with an optimised version and substituted tuple_byte_size() with a constant. Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed length. Perf: Did microbenchmarks on uncompressed Avro files, one with all columns from lineitem and one with only l_comment. Tests were run with: set num_scanner_threads=1; I ran the query 5 times and extracted MaterializeTupleTime from the profile to measure CPU cost of materialization. Overall string materialization got significantly slower, mainly because of the extra memcpy() calls required. Selecting one string from a table with multiple columns: select min(l_comment) from biglineitem_avro 1.814 -> 2.096 Selecting one string from a table with one column: select min(l_comment) from biglineitem_comment; profile; 1.708 -> 3.7 Selecting one string from a table with one column with predicate: select min(l_comment) from biglineitem_comment where length(l_comment) > 1; 1.691 -> 1.449 Selecting all columns: select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber), min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax), min(l_returnflag), min(l_linestatus), min(l_shipdate), min(l_commitdate), min(l_receiptdate), min(l_shipinstruct), min(l_shipmode), min(l_comment) from biglineitem_avro; profile; 2.335 -> 3.711 Selecting an int column (no strings): select min(l_linenumber) from biglineitem_avro 1.806 -> 1.819 Testing: Ran exhaustive tests. Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 --- M be/src/codegen/gen_ir_descriptions.py M be/src/codegen/impala-ir.cc M be/src/codegen/llvm-codegen.cc M be/src/codegen/llvm-codegen.h M be/src/common/status.cc M be/src/common/status.h M be/src/exec/hdfs-avro-scanner-ir.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-avro-scanner.h M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-scanner.h M be/src/runtime/CMakeLists.txt M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h A be/src/runtime/tuple-ir.cc M be/src/runtime/tuple.cc M be/src/runtime/tuple.h 19 files changed, 337 insertions(+), 57 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/15 -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 15 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 14: Code-Review+1 Thanks. This approach looks okay to me, please see if Michael can do the +2 review. -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 14 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 27 Oct 2017 20:33:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 14: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 14 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 27 Oct 2017 17:33:56 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 13: (1 comment) http://gerrit.cloudera.org:8080/#/c/8146/13/be/src/runtime/tuple.h File be/src/runtime/tuple.h: http://gerrit.cloudera.org:8080/#/c/8146/13/be/src/runtime/tuple.h@175 PS13, Line 175: /// Materialize the var-len string data in this tuple into the provided memory : /// pool. > the name and the comment didn't make it clear that this also rewrites this Renamed to CopyStrings(). Agree it's more accurate. -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 13 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 27 Oct 2017 05:22:44 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Hello Michael Ho, Thomas Tauber-Marshall, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8146 to look at the new patch set (#14). Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. IMPALA-5307: Part 2: copy out strings in uncompressed Avro The approach is to re-materialize strings in those tuples that survive conjunct evaluation and may reference disk I/O buffers directly. This means that perf should not regress for the following cases: * Compressed Avro files. * Non-string columns. * Selective scans where the majority of tuples are filtered out. This approach will also work for the Sequence and Text scanners. Includes some improvements to Avro codegen to replace more constants to help win back some performance (with limited success): replaced InitTuple() with an optimised version and substituted tuple_byte_size() with a constant. Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed length. Perf: Did microbenchmarks on uncompressed Avro files, one with all columns from lineitem and one with only l_comment. Tests were run with: set num_scanner_threads=1; I ran the query 5 times and extracted MaterializeTupleTime from the profile to measure CPU cost of materialization. Overall string materialization got significantly slower, mainly because of the extra memcpy() calls required. Selecting one string from a table with multiple columns: select min(l_comment) from biglineitem_avro 1.814 -> 2.096 Selecting one string from a table with one column: select min(l_comment) from biglineitem_comment; profile; 1.708 -> 3.7 Selecting one string from a table with one column with predicate: select min(l_comment) from biglineitem_comment where length(l_comment) > 1; 1.691 -> 1.449 Selecting all columns: select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber), min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax), min(l_returnflag), min(l_linestatus), min(l_shipdate), min(l_commitdate), min(l_receiptdate), min(l_shipinstruct), min(l_shipmode), min(l_comment) from biglineitem_avro; profile; 2.335 -> 3.711 Selecting an int column (no strings): select min(l_linenumber) from biglineitem_avro 1.806 -> 1.819 Testing: Ran exhaustive tests. Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 --- M be/src/codegen/gen_ir_descriptions.py M be/src/codegen/impala-ir.cc M be/src/codegen/llvm-codegen.cc M be/src/codegen/llvm-codegen.h M be/src/common/status.cc M be/src/common/status.h M be/src/exec/hdfs-avro-scanner-ir.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-avro-scanner.h M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-scanner.h M be/src/runtime/CMakeLists.txt M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h A be/src/runtime/tuple-ir.cc M be/src/runtime/tuple.cc M be/src/runtime/tuple.h 19 files changed, 337 insertions(+), 57 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/14 -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 14 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 13: (1 comment) http://gerrit.cloudera.org:8080/#/c/8146/13/be/src/runtime/tuple.h File be/src/runtime/tuple.h: http://gerrit.cloudera.org:8080/#/c/8146/13/be/src/runtime/tuple.h@175 PS13, Line 175: /// Materialize the var-len string data in this tuple into the provided memory : /// pool. the name and the comment didn't make it clear that this also rewrites this tuple. i.e. in that sense, this is different than MaterializeExprs(). In this case, the slots (and strings) are already materialized -- it's re-materializing (or copying) the strings, right? -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 13 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 26 Oct 2017 23:46:22 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 11: (2 comments) http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.h File be/src/runtime/tuple.h: http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.h@289 PS11, Line 289: ); > Are those calls to std C++ library functions ? I realise now the comment didn't make sense without context. The invoke came from the call to this function and was fixed by adding a "noexcept" specifier - we've used that in various other places. http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc File be/src/runtime/tuple.cc: http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc@449 PS11, Line 449: Value* result_val = builder.CreateCall(cross_compiled_fn, > Yes. That seems to be helpful. Can we also do something bigger (e.g. 16 col I did a quick microbenchmark and there is a ~10% difference in scan time between the codegen'd and non-codegen'd when scanning a single string column. Another case where it's useful is optimising out the loop and runtime checks entirely when no string cols are materialised. -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 11 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 24 Oct 2017 22:28:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 11: (2 comments) http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.h File be/src/runtime/tuple.h: http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.h@289 PS11, Line 289: ); > Looking in the IR I noticed that there were "invoke" instruction emitted wh Are those calls to std C++ library functions ? http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc File be/src/runtime/tuple.cc: http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc@449 PS11, Line 449: Value* result_val = builder.CreateCall(cross_compiled_fn, > I think being conservative is probably a good thing actually. The code is n Yes. That seems to be helpful. Can we also do something bigger (e.g. 16 columns) ? If I understand it correctly, the benefit for codegen'ing here (and all the new code for creating the constants) is for unrolling the loop, right ? May be it's still helpful for the cases in which few number of string columns are materialized. -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 11 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 24 Oct 2017 00:37:51 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 11: (1 comment) http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc File be/src/runtime/tuple.cc: http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc@449 PS11, Line 449: Value* result_val = builder.CreateCall(cross_compiled_fn, > It's probably because of the inlining overhead. I wonder what the perf is l I think being conservative is probably a good thing actually. The code is never executed in the common case where the Avro files are compressed so taking too much of a codegen time hit would be a bad thing. I could do some further experiments to understand the impact - e.g. compare the overhead of the unrolled and non-unrolled versions on 8 string columns - would that be helpful? -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 11 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 24 Oct 2017 00:23:47 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 11: (1 comment) http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc File be/src/runtime/tuple.cc: http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc@449 PS11, Line 449: Value* result_val = builder.CreateCall(cross_compiled_fn, > I confirmed that it was unrolled for a handful of string columns. For 100s It's probably because of the inlining overhead. I wonder what the perf is like if we force inlining and let the unroll to occur for your 100 columns example -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 11 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 24 Oct 2017 00:10:55 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Hello Michael Ho, Thomas Tauber-Marshall, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8146 to look at the new patch set (#13). Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. IMPALA-5307: Part 2: copy out strings in uncompressed Avro The approach is to re-materialize strings in those tuples that survive conjunct evaluation and may reference disk I/O buffers directly. This means that perf should not regress for the following cases: * Compressed Avro files. * Non-string columns. * Selective scans where the majority of tuples are filtered out. This approach will also work for the Sequence and Text scanners. Includes some improvements to Avro codegen to replace more constants to help win back some performance (with limited success): replaced InitTuple() with an optimised version and substituted tuple_byte_size() with a constant. Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed length. Perf: Did microbenchmarks on uncompressed Avro files, one with all columns from lineitem and one with only l_comment. Tests were run with: set num_scanner_threads=1; I ran the query 5 times and extracted MaterializeTupleTime from the profile to measure CPU cost of materialization. Overall string materialization got significantly slower, mainly because of the extra memcpy() calls required. Selecting one string from a table with multiple columns: select min(l_comment) from biglineitem_avro 1.814 -> 2.096 Selecting one string from a table with one column: select min(l_comment) from biglineitem_comment; profile; 1.708 -> 3.7 Selecting one string from a table with one column with predicate: select min(l_comment) from biglineitem_comment where length(l_comment) > 1; 1.691 -> 1.449 Selecting all columns: select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber), min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax), min(l_returnflag), min(l_linestatus), min(l_shipdate), min(l_commitdate), min(l_receiptdate), min(l_shipinstruct), min(l_shipmode), min(l_comment) from biglineitem_avro; profile; 2.335 -> 3.711 Selecting an int column (no strings): select min(l_linenumber) from biglineitem_avro 1.806 -> 1.819 Testing: Ran exhaustive tests. Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 --- M be/src/codegen/gen_ir_descriptions.py M be/src/codegen/impala-ir.cc M be/src/codegen/llvm-codegen.cc M be/src/codegen/llvm-codegen.h M be/src/common/status.cc M be/src/common/status.h M be/src/exec/hdfs-avro-scanner-ir.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-avro-scanner.h M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-scanner.h M be/src/runtime/CMakeLists.txt M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h A be/src/runtime/tuple-ir.cc M be/src/runtime/tuple.cc M be/src/runtime/tuple.h 19 files changed, 337 insertions(+), 57 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/13 -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 13 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 13: Code-Review+1 PS13 is a rebase -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 13 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 23 Oct 2017 23:37:38 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Hello Michael Ho, Thomas Tauber-Marshall, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8146 to look at the new patch set (#12). Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. IMPALA-5307: Part 2: copy out strings in uncompressed Avro The approach is to re-materialize strings in those tuples that survive conjunct evaluation and may reference disk I/O buffers directly. This means that perf should not regress for the following cases: * Compressed Avro files. * Non-string columns. * Selective scans where the majority of tuples are filtered out. This approach will also work for the Sequence and Text scanners. Includes some improvements to Avro codegen to replace more constants to help win back some performance (with limited success): replaced InitTuple() with an optimised version and substituted tuple_byte_size() with a constant. Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed length. Perf: Did microbenchmarks on uncompressed Avro files, one with all columns from lineitem and one with only l_comment. Tests were run with: set num_scanner_threads=1; I ran the query 5 times and extracted MaterializeTupleTime from the profile to measure CPU cost of materialization. Overall string materialization got significantly slower, mainly because of the extra memcpy() calls required. Selecting one string from a table with multiple columns: select min(l_comment) from biglineitem_avro 1.814 -> 2.096 Selecting one string from a table with one column: select min(l_comment) from biglineitem_comment; profile; 1.708 -> 3.7 Selecting one string from a table with one column with predicate: select min(l_comment) from biglineitem_comment where length(l_comment) > 1; 1.691 -> 1.449 Selecting all columns: select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber), min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax), min(l_returnflag), min(l_linestatus), min(l_shipdate), min(l_commitdate), min(l_receiptdate), min(l_shipinstruct), min(l_shipmode), min(l_comment) from biglineitem_avro; profile; 2.335 -> 3.711 Selecting an int column (no strings): select min(l_linenumber) from biglineitem_avro 1.806 -> 1.819 Testing: Ran exhaustive tests. Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 --- M be/src/codegen/gen_ir_descriptions.py M be/src/codegen/impala-ir.cc M be/src/codegen/llvm-codegen.cc M be/src/codegen/llvm-codegen.h M be/src/common/status.cc M be/src/common/status.h M be/src/exec/hdfs-avro-scanner-ir.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-avro-scanner.h M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-scanner.h M be/src/runtime/CMakeLists.txt M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h A be/src/runtime/tuple-ir.cc M be/src/runtime/tuple.cc M be/src/runtime/tuple.h 19 files changed, 336 insertions(+), 56 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/12 -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 12 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 11: (4 comments) http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/descriptors.cc File be/src/runtime/descriptors.cc: http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/descriptors.cc@87 PS11, Line 87: return ConstantStruct::get(null_indicator_offset_type, : {ConstantInt::get(codegen->int_type(), byte_offset), : ConstantInt::get(codegen->tinyint_type(), bit_mask), : zeroes->getStructElement(2)}); > This needs to be updated when we change the definition of NullIndicatorOffs Added a comment to the class definition. ConstantStruct::get() asserts if the number of elements doesn't match the number of elements in the struct (that's why I had to add the padding explicitly). This behaviour seems better than silently failing. The code in LLVM is: Constant *ConstantStruct::get(StructType *ST, ArrayRefV) { assert((ST->isOpaque() || ST->getNumElements() == V.size()) && "Incorrect # elements specified to ConstantStruct::get"); http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.h File be/src/runtime/tuple.h: http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.h@289 PS11, Line 289: ); Looking in the IR I noticed that there were "invoke" instruction emitted when calling this function. http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc File be/src/runtime/tuple.cc: http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc@397 PS11, Line 397: Status( > Status::Expected() ? I think finalisation failing should be very unusual - either an Impala bug or a corrupt IR UDF. To be honest I don't know how robust our handling of this is, since we'll continue on and eventually try to optimise and compile the module. http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc@449 PS11, Line 449: Value* result_val = builder.CreateCall(cross_compiled_fn, > Just curious if you inspected the output IR to see if it's actually unrolli I confirmed that it was unrolled for a handful of string columns. For 100s of string columns, e.g, select count(*) from (select distinct * from widetable_1000_cols) v;, it is not unrolled. I checked some intermediate values too and it seems pretty conservative - it seemed like the switch-over point is around materialising >=8 string columns. E.g. select * from tpch_avro did not unroll the loop. -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 11 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 23 Oct 2017 23:36:23 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 11: (3 comments) http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/descriptors.cc File be/src/runtime/descriptors.cc: http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/descriptors.cc@87 PS11, Line 87: return ConstantStruct::get(null_indicator_offset_type, : {ConstantInt::get(codegen->int_type(), byte_offset), : ConstantInt::get(codegen->tinyint_type(), bit_mask), : zeroes->getStructElement(2)}); This needs to be updated when we change the definition of NullIndicatorOffset in the future. Please add a remark about it in the header declaration. Alternately, one can consider copying the ConstantAggregateZero struct and selectively populating the fields which we deem useful. In this way, newly added field will have the zero values if we forget to initialize them but this approach may still be error-prone if we update the struct layouts in the future. Not sure if there are easy ways to add DCHECK() to check for consistency of the constant generated. http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc File be/src/runtime/tuple.cc: http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc@397 PS11, Line 397: Status( Status::Expected() ? http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/runtime/tuple.cc@449 PS11, Line 449: Value* result_val = builder.CreateCall(cross_compiled_fn, Just curious if you inspected the output IR to see if it's actually unrolling the loop ? -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 11 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 23 Oct 2017 21:25:07 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 11: OK. Will take a look. -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 11 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 23 Oct 2017 20:25:14 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 11: Michael, this uses some of the constant global value and .ToIR() concepts that you had used in a few places. Would be good to confirm that you think this is a good application of the pattern. -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 11 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Michael Ho Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 23 Oct 2017 18:37:36 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 11: Code-Review+1 carry -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 11 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 20 Oct 2017 23:36:26 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 11: (1 comment) http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/exec/hdfs-avro-scanner.h File be/src/exec/hdfs-avro-scanner.h: http://gerrit.cloudera.org:8080/#/c/8146/11/be/src/exec/hdfs-avro-scanner.h@136 PS11, Line 136: // in a simple array of struct simplifies codegen and speeds up interpretation. Addressed Alex's comment on the other CR https://gerrit.cloudera.org/#/c/8172/4/be/src/exec/hdfs-scanner.h@276 -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 11 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 20 Oct 2017 23:36:02 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Hello Thomas Tauber-Marshall, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8146 to look at the new patch set (#11). Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. IMPALA-5307: Part 2: copy out strings in uncompressed Avro The approach is to re-materialize strings in those tuples that survive conjunct evaluation and may reference disk I/O buffers directly. This means that perf should not regress for the following cases: * Compressed Avro files. * Non-string columns. * Selective scans where the majority of tuples are filtered out. This approach will also work for the Sequence and Text scanners. Includes some improvements to Avro codegen to replace more constants to help win back some performance (with limited success): replaced InitTuple() with an optimised version and substituted tuple_byte_size() with a constant. Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed length. Perf: Did microbenchmarks on uncompressed Avro files, one with all columns from lineitem and one with only l_comment. Tests were run with: set num_scanner_threads=1; I ran the query 5 times and extracted MaterializeTupleTime from the profile to measure CPU cost of materialization. Overall string materialization got significantly slower, mainly because of the extra memcpy() calls required. Selecting one string from a table with multiple columns: select min(l_comment) from biglineitem_avro 1.814 -> 2.096 Selecting one string from a table with one column: select min(l_comment) from biglineitem_comment; profile; 1.708 -> 3.7 Selecting one string from a table with one column with predicate: select min(l_comment) from biglineitem_comment where length(l_comment) > 1; 1.691 -> 1.449 Selecting all columns: select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber), min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax), min(l_returnflag), min(l_linestatus), min(l_shipdate), min(l_commitdate), min(l_receiptdate), min(l_shipinstruct), min(l_shipmode), min(l_comment) from biglineitem_avro; profile; 2.335 -> 3.711 Selecting an int column (no strings): select min(l_linenumber) from biglineitem_avro 1.806 -> 1.819 Testing: Ran exhaustive tests. Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 --- M be/src/codegen/gen_ir_descriptions.py M be/src/codegen/impala-ir.cc M be/src/codegen/llvm-codegen.cc M be/src/codegen/llvm-codegen.h M be/src/common/status.cc M be/src/common/status.h M be/src/exec/hdfs-avro-scanner-ir.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-avro-scanner.h M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-scanner.h M be/src/runtime/CMakeLists.txt M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h A be/src/runtime/tuple-ir.cc M be/src/runtime/tuple.cc M be/src/runtime/tuple.h 19 files changed, 333 insertions(+), 56 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/11 -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 11 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 10: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 10 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 20 Oct 2017 17:13:31 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Hello Thomas Tauber-Marshall, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8146 to look at the new patch set (#10). Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. IMPALA-5307: Part 2: copy out strings in uncompressed Avro The approach is to re-materialize strings in those tuples that survive conjunct evaluation and may reference disk I/O buffers directly. This means that perf should not regress for the following cases: * Compressed Avro files. * Non-string columns. * Selective scans where the majority of tuples are filtered out. This approach will also work for the Sequence and Text scanners. Includes some improvements to Avro codegen to replace more constants to help win back some performance (with limited success): replaced InitTuple() with an optimised version and substituted tuple_byte_size() with a constant. Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed length. Perf: Did microbenchmarks on uncompressed Avro files, one with all columns from lineitem and one with only l_comment. Tests were run with: set num_scanner_threads=1; I ran the query 5 times and extracted MaterializeTupleTime from the profile to measure CPU cost of materialization. Overall string materialization got significantly slower, mainly because of the extra memcpy() calls required. Selecting one string from a table with multiple columns: select min(l_comment) from biglineitem_avro 1.814 -> 2.096 Selecting one string from a table with one column: select min(l_comment) from biglineitem_comment; profile; 1.708 -> 3.7 Selecting one string from a table with one column with predicate: select min(l_comment) from biglineitem_comment where length(l_comment) > 1; 1.691 -> 1.449 Selecting all columns: select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber), min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax), min(l_returnflag), min(l_linestatus), min(l_shipdate), min(l_commitdate), min(l_receiptdate), min(l_shipinstruct), min(l_shipmode), min(l_comment) from biglineitem_avro; profile; 2.335 -> 3.711 Selecting an int column (no strings): select min(l_linenumber) from biglineitem_avro 1.806 -> 1.819 Testing: Ran exhaustive tests. Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 --- M be/src/codegen/gen_ir_descriptions.py M be/src/codegen/impala-ir.cc M be/src/codegen/llvm-codegen.cc M be/src/codegen/llvm-codegen.h M be/src/common/status.cc M be/src/common/status.h M be/src/exec/hdfs-avro-scanner-ir.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-avro-scanner.h M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-scanner.h M be/src/runtime/CMakeLists.txt M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h A be/src/runtime/tuple-ir.cc M be/src/runtime/tuple.cc M be/src/runtime/tuple.h 19 files changed, 331 insertions(+), 56 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/10 -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 10 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 9: (4 comments) http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/exec/hdfs-scanner.cc File be/src/exec/hdfs-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/exec/hdfs-scanner.cc@543 PS9, Line 543: *init_tuple_fn = : codegen->GetFunction(IRFunction::HDFS_SCANNER_INIT_TUPLE, true); > single line? Done http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/descriptors.cc File be/src/runtime/descriptors.cc: http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/descriptors.cc@86 PS9, Line 86: ConstantAggregateZero* zeroes = : ConstantAggregateZero::get(null_indicator_offset_type); > single line? Done http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/tuple.h File be/src/runtime/tuple.h: http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/tuple.h@182 PS9, Line 182: Status* status > Its unusual to have s status as an out parameter instead of returning it. C That's a good point. I discovered for PhjBuilder::AppendRow() a while ago that the extra IR emitted noticeably hurt codegen time. Copied over the explanation from there. http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/tuple.h@184 PS9, Line 184: Helper for MaterializeStrings() > If this isn't intended to be called outside this class, make it private? Good point - done. -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 9 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 20 Oct 2017 00:48:18 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 9: (4 comments) Looks good http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/exec/hdfs-scanner.cc File be/src/exec/hdfs-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/exec/hdfs-scanner.cc@543 PS9, Line 543: *init_tuple_fn = : codegen->GetFunction(IRFunction::HDFS_SCANNER_INIT_TUPLE, true); single line? http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/descriptors.cc File be/src/runtime/descriptors.cc: http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/descriptors.cc@86 PS9, Line 86: ConstantAggregateZero* zeroes = : ConstantAggregateZero::get(null_indicator_offset_type); single line? http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/tuple.h File be/src/runtime/tuple.h: http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/tuple.h@182 PS9, Line 182: Status* status Its unusual to have s status as an out parameter instead of returning it. Could you document why you did this? http://gerrit.cloudera.org:8080/#/c/8146/9/be/src/runtime/tuple.h@184 PS9, Line 184: Helper for MaterializeStrings() If this isn't intended to be called outside this class, make it private? -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 9 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 19 Oct 2017 17:09:15 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. Patch Set 8: (1 comment) There's a fair bit of codegen logic in this patch - maybe you could take a look Thomas since you've been looking at codegen recently? http://gerrit.cloudera.org:8080/#/c/8146/6/be/src/runtime/tuple.cc File be/src/runtime/tuple.cc: http://gerrit.cloudera.org:8080/#/c/8146/6/be/src/runtime/tuple.cc@448 PS6, Line 448: Value* constant_slot_offsets_first_element_ptr = Weird ordering of statements. -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 8 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 12 Oct 2017 17:26:12 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5307: Part 2: copy out strings in uncompressed Avro
Tim Armstrong has uploaded a new patch set (#7). ( http://gerrit.cloudera.org:8080/8146 ) Change subject: IMPALA-5307: Part 2: copy out strings in uncompressed Avro .. IMPALA-5307: Part 2: copy out strings in uncompressed Avro The approach is to re-materialize strings in those tuples that survive conjunct evaluation and may reference disk I/O buffers directly. This means that perf should not regress for the following cases: * Compressed Avro files. * Non-string columns. * Selective scans where the majority of tuples are filtered out. This approach will also work for the Sequence and Text scanners. Includes some improvements to Avro codegen to replace more constants to help win back some performance (with limited success): replaced InitTuple() with an optimised version and substituted tuple_byte_size() with a constant. Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed length. Perf: Did microbenchmarks on uncompressed Avro files, one with all columns from lineitem and one with only l_comment. Tests were run with: set num_scanner_threads=1; I ran the query 5 times and extracted MaterializeTupleTime from the profile to measure CPU cost of materialization. Overall string materialization got significantly slower, mainly because of the extra memcpy() calls required. Selecting one string from a table with multiple columns: select min(l_comment) from biglineitem_avro 1.814 -> 2.096 Selecting one string from a table with one column: select min(l_comment) from biglineitem_comment; profile; 1.708 -> 3.7 Selecting one string from a table with one column with predicate: select min(l_comment) from biglineitem_comment where length(l_comment) > 1; 1.691 -> 1.449 Selecting all columns: select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber), min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax), min(l_returnflag), min(l_linestatus), min(l_shipdate), min(l_commitdate), min(l_receiptdate), min(l_shipinstruct), min(l_shipmode), min(l_comment) from biglineitem_avro; profile; 2.335 -> 3.711 Selecting an int column (no strings): select min(l_linenumber) from biglineitem_avro 1.806 -> 1.819 Testing: Ran exhaustive tests. Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 --- M be/src/codegen/gen_ir_descriptions.py M be/src/codegen/impala-ir.cc M be/src/codegen/llvm-codegen.cc M be/src/codegen/llvm-codegen.h M be/src/common/status.cc M be/src/common/status.h M be/src/exec/hdfs-avro-scanner-ir.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-avro-scanner.h M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-scanner.h M be/src/runtime/CMakeLists.txt M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h A be/src/runtime/tuple-ir.cc M be/src/runtime/tuple.cc M be/src/runtime/tuple.h 19 files changed, 329 insertions(+), 56 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/8146/7 -- To view, visit http://gerrit.cloudera.org:8080/8146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233 Gerrit-Change-Number: 8146 Gerrit-PatchSet: 7 Gerrit-Owner: Tim Armstrong