[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 ) Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. Patch Set 6: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 6 Gerrit-Owner: Zoltan Borok-NagyGerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 16 Feb 2018 20:23:20 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/9239 ) Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows Tuple pointers in the generated row batches may not be initialized if a tuple has byte size 0. There are some codes which compare these uninitialized pointers against nullptr so having them uninitialized may return wrong (and non-deterministic) results, e.g.: impala::TupleIsNullPredicate::GetBooleanVal() The following query produces non-deterministic results currently: SELECT count(v.x) FROM functional.alltypestiny t3 LEFT OUTER JOIN ( SELECT true AS x FROM functional.alltypestiny t1 LEFT OUTER JOIN functional.alltypestiny t2 ON (true)) v ON (v.x = t3.bool_col) WHERE t3.bool_col = true; The alltypestiny table has 8 records, 4 records of them has the true boolean value for bool_col. Therefore, the above query is a fancy way of saying "8 * 8 * 4", i.e. it should return 256. The solution is that scanners initialize tuple pointers to a non-null value when there are no materialized slots. This non-null value is provided by the new static member Tuple::POISON. I extended QueryTest/scanners.test with the above query. This test executes the query against all table formats. This change has the biggest performance impact on count(*) queries on large kudu tables. For my quick benchmark I copied tpch_kudu.lineitem and doubled its data. The resulting table has 12,002,430 rows. Without this patch 'select count(*) from biglineitem' runs for ~0.12s. With the patch applied, the overhead is around a dozens of ms. I measured the query on my desktop PC using a relase build of Impala. On debug builds, the execution time of the patched version is around 160% of the original version. Without this patch: +--++--+--+++---+---+-+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +--++--+--+++---+---+-+ | 03:AGGREGATE | 1 | 127.50us | 127.50us | 1 | 1 | 28.00 KB | 10.00 MB | FINALIZE| | 02:EXCHANGE | 1 | 22.32ms | 22.32ms | 3 | 1 | 0 B | 0 B | UNPARTITIONED | | 01:AGGREGATE | 3 | 1.78ms | 1.89ms | 3 | 1 | 16.00 KB | 10.00 MB | | | 00:SCAN KUDU | 3 | 8.00ms | 8.28ms | 12.00M | -1 | 512.00 KB | 0 B | default.biglineitem | +--++--+--+++---+---+-+ With this patch: +--++--+--+++---+---+-+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +--++--+--+++---+---+-+ | 03:AGGREGATE | 1 | 129.01us | 129.01us | 1 | 1 | 28.00 KB | 10.00 MB | FINALIZE| | 02:EXCHANGE | 1 | 33.00ms | 33.00ms | 3 | 1 | 0 B | 0 B | UNPARTITIONED | | 01:AGGREGATE | 3 | 1.99ms | 2.13ms | 3 | 1 | 16.00 KB | 10.00 MB | | | 00:SCAN KUDU | 3 | 13.13ms | 13.97ms | 12.00M | -1 | 512.00 KB | 0 B | default.biglineitem | +--++--+--+++---+---+-+ Change-Id: I2981227e62eb5971508e0698e189519755de Reviewed-on: http://gerrit.cloudera.org:8080/9239 Reviewed-by: Tim ArmstrongTested-by: Impala Public Jenkins --- M be/src/exec/hdfs-scanner.cc M be/src/exec/kudu-scanner.cc M be/src/runtime/tuple.cc M be/src/runtime/tuple.h M testdata/workloads/functional-query/queries/QueryTest/scanners.test 5 files changed, 27 insertions(+), 2 deletions(-) Approvals: Tim Armstrong: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 7 Gerrit-Owner: Zoltan Borok-Nagy Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 ) Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. Patch Set 6: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1948/ -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 6 Gerrit-Owner: Zoltan Borok-NagyGerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 16 Feb 2018 16:41:10 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 ) Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. Patch Set 6: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 6 Gerrit-Owner: Zoltan Borok-NagyGerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 16 Feb 2018 16:40:31 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 ) Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. Patch Set 5: Seems like it was hit by IMPALA-6532 -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 5 Gerrit-Owner: Zoltan Borok-NagyGerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 16 Feb 2018 14:25:55 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 ) Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. Patch Set 5: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/1947/ -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 5 Gerrit-Owner: Zoltan Borok-NagyGerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 16 Feb 2018 09:45:53 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 ) Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. Patch Set 5: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1947/ -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 5 Gerrit-Owner: Zoltan Borok-NagyGerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 16 Feb 2018 06:03:47 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 ) Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. Patch Set 5: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 5 Gerrit-Owner: Zoltan Borok-NagyGerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 16 Feb 2018 06:03:35 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Hello Thomas Tauber-Marshall, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9239 to look at the new patch set (#4). Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows Tuple pointers in the generated row batches may not be initialized if a tuple has byte size 0. There are some codes which compare these uninitialized pointers against nullptr so having them uninitialized may return wrong (and non-deterministic) results, e.g.: impala::TupleIsNullPredicate::GetBooleanVal() The following query produces non-deterministic results currently: SELECT count(v.x) FROM functional.alltypestiny t3 LEFT OUTER JOIN ( SELECT true AS x FROM functional.alltypestiny t1 LEFT OUTER JOIN functional.alltypestiny t2 ON (true)) v ON (v.x = t3.bool_col) WHERE t3.bool_col = true; The alltypestiny table has 8 records, 4 records of them has the true boolean value for bool_col. Therefore, the above query is a fancy way of saying "8 * 8 * 4", i.e. it should return 256. The solution is that scanners initialize tuple pointers to a non-null value when there are no materialized slots. This non-null value is provided by the new static member Tuple::POISON. I extended QueryTest/scanners.test with the above query. This test executes the query against all table formats. This change has the biggest performance impact on count(*) queries on large kudu tables. For my quick benchmark I copied tpch_kudu.lineitem and doubled its data. The resulting table has 12,002,430 rows. Without this patch 'select count(*) from biglineitem' runs for ~0.12s. With the patch applied, the overhead is around a dozens of ms. I measured the query on my desktop PC using a relase build of Impala. On debug builds, the execution time of the patched version is around 160% of the original version. Without this patch: +--++--+--+++---+---+-+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +--++--+--+++---+---+-+ | 03:AGGREGATE | 1 | 127.50us | 127.50us | 1 | 1 | 28.00 KB | 10.00 MB | FINALIZE| | 02:EXCHANGE | 1 | 22.32ms | 22.32ms | 3 | 1 | 0 B | 0 B | UNPARTITIONED | | 01:AGGREGATE | 3 | 1.78ms | 1.89ms | 3 | 1 | 16.00 KB | 10.00 MB | | | 00:SCAN KUDU | 3 | 8.00ms | 8.28ms | 12.00M | -1 | 512.00 KB | 0 B | default.biglineitem | +--++--+--+++---+---+-+ With this patch: +--++--+--+++---+---+-+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +--++--+--+++---+---+-+ | 03:AGGREGATE | 1 | 129.01us | 129.01us | 1 | 1 | 28.00 KB | 10.00 MB | FINALIZE| | 02:EXCHANGE | 1 | 33.00ms | 33.00ms | 3 | 1 | 0 B | 0 B | UNPARTITIONED | | 01:AGGREGATE | 3 | 1.99ms | 2.13ms | 3 | 1 | 16.00 KB | 10.00 MB | | | 00:SCAN KUDU | 3 | 13.13ms | 13.97ms | 12.00M | -1 | 512.00 KB | 0 B | default.biglineitem | +--++--+--+++---+---+-+ Change-Id: I2981227e62eb5971508e0698e189519755de --- M be/src/exec/hdfs-scanner.cc M be/src/exec/kudu-scanner.cc M be/src/runtime/tuple.cc M be/src/runtime/tuple.h M testdata/workloads/functional-query/queries/QueryTest/scanners.test 5 files changed, 27 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/9239/4 -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 4 Gerrit-Owner: Zoltan Borok-NagyGerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 ) Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. Patch Set 3: (5 comments) http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG@57 PS2, Line 57: | 03:AGGREGATE | 1 | 129.01us | 129.01us | 1 | 1 | 28.00 KB | 10.00 MB | FINALIZE| > With the new numbers, I think that this seems okay. Done, Tim filed IMPALA-6501 http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/hdfs-scanner.cc File be/src/exec/hdfs-scanner.cc: http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/hdfs-scanner.cc@228 PS3, Line 228: // Let's not leave tuple ptrs uninitialized. > Can you add the JIRA to this comment to better explain the consequences of Added the JIRA number http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/kudu-scanner.cc File be/src/exec/kudu-scanner.cc: http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/kudu-scanner.cc@259 PS3, Line 259: // Initialize tuple ptrs to a dummy non-null value > Can you add the JIRA there too? yes, added it http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/runtime/tuple.h File be/src/runtime/tuple.h: http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/runtime/tuple.h@89 PS3, Line 89: Poison > This should either be lower-case if we're treating it as a variable or uppe Renamed it to all upper-case, I also think it is more of a constant. And we'll never modify the pointed object since it would raise an error instantly. http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/runtime/tuple.h@91 PS3, Line 91: void Init(int size) { > Unnecessary formatting change here. Oops, it became multi-line when I added those DCHECKs everywhere. -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-NagyGerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 13 Feb 2018 17:39:20 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 ) Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. Patch Set 3: > I don't think there's a JIRA. Filed IMPALA-6501 Great, thanks. -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-NagyGerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 12 Feb 2018 19:45:25 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 ) Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. Patch Set 3: (4 comments) Logic of the patch looks good to me, did a final pass for stylistic nits. I'll +2 once those are fixed. http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/hdfs-scanner.cc File be/src/exec/hdfs-scanner.cc: http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/hdfs-scanner.cc@228 PS3, Line 228: // Let's not leave tuple ptrs uninitialized. Can you add the JIRA to this comment to better explain the consequences of not doing this? http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/kudu-scanner.cc File be/src/exec/kudu-scanner.cc: http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/kudu-scanner.cc@259 PS3, Line 259: // Initialize tuple ptrs to a dummy non-null value Can you add the JIRA there too? http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/runtime/tuple.h File be/src/runtime/tuple.h: http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/runtime/tuple.h@89 PS3, Line 89: Poison This should either be lower-case if we're treating it as a variable or upper-case if we're treating it as a constant. It's a weird edge case but it feels like it should be a constant to me. http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/runtime/tuple.h@91 PS3, Line 91: void Init(int size) { Unnecessary formatting change here. -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-NagyGerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 12 Feb 2018 19:48:17 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 ) Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. Patch Set 3: I don't think there's a JIRA. Filed IMPALA-6501 -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-NagyGerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 12 Feb 2018 19:41:51 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 ) Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG@57 PS2, Line 57: | 03:AGGREGATE | 1 | 129.01us | 129.01us | 1 | 1 | 28.00 KB | 10.00 MB | FINALIZE| > Yes, I measured it against a debug build... With the new numbers, I think that this seems okay. Is there a JIRA for the count(*) optimization for Kudu? (and if not, could you file one?) -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-NagyGerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 12 Feb 2018 17:25:27 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 ) Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG@57 PS2, Line 57: 00:SCAN KUDU3 90.856ms 107.409ms 6.00M 6.00M 512.00 KB 0 tpch_kudu.lineitem > I tried to do a similar experiment with a larger Kudu scale factor (I creat Yes, I measured it against a debug build... Re-run the query against release versions of Impala and now the difference is much smaller. Updated the commit with the new data. -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-NagyGerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 12 Feb 2018 16:26:03 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Hello Thomas Tauber-Marshall, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9239 to look at the new patch set (#3). Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows Tuple pointers in the generated row batches may not be initialized if a tuple has byte size 0. There are some codes which compare these uninitialized pointers against nullptr so having them uninitialized may return wrong (and non-deterministic) results, e.g.: impala::TupleIsNullPredicate::GetBooleanVal() The following query produces non-deterministic results currently: SELECT count(v.x) FROM functional.alltypestiny t3 LEFT OUTER JOIN ( SELECT true AS x FROM functional.alltypestiny t1 LEFT OUTER JOIN functional.alltypestiny t2 ON (true)) v ON (v.x = t3.bool_col) WHERE t3.bool_col = true; The alltypestiny table has 8 records, 4 records of them has the true boolean value for bool_col. Therefore, the above query is a fancy way of saying "8 * 8 * 4", i.e. it should return 256. The solution is that scanners initialize tuple pointers to a non-null value when there are no materialized slots. This non-null value is provided by the new static method called Tuple::Poison(). I extended QueryTest/scanners.test with the above query. This test executes the query against all table formats. This change has the biggest performance impact on count(*) queries on large kudu tables. For my quick benchmark I copied tpch_kudu.lineitem and doubled its data. The resulting table has 12,002,430 rows. Without this patch 'select count(*) from biglineitem' runs for ~0.12s. With the patch applied, the overhead is around a dozens of ms. I measured the query on my desktop PC using a relase build of Impala. On debug builds, the execution time of the patched version is around 160% of the original version. Without this patch: +--++--+--+++---+---+-+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +--++--+--+++---+---+-+ | 03:AGGREGATE | 1 | 127.50us | 127.50us | 1 | 1 | 28.00 KB | 10.00 MB | FINALIZE| | 02:EXCHANGE | 1 | 22.32ms | 22.32ms | 3 | 1 | 0 B | 0 B | UNPARTITIONED | | 01:AGGREGATE | 3 | 1.78ms | 1.89ms | 3 | 1 | 16.00 KB | 10.00 MB | | | 00:SCAN KUDU | 3 | 8.00ms | 8.28ms | 12.00M | -1 | 512.00 KB | 0 B | default.biglineitem | +--++--+--+++---+---+-+ With this patch: +--++--+--+++---+---+-+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +--++--+--+++---+---+-+ | 03:AGGREGATE | 1 | 129.01us | 129.01us | 1 | 1 | 28.00 KB | 10.00 MB | FINALIZE| | 02:EXCHANGE | 1 | 33.00ms | 33.00ms | 3 | 1 | 0 B | 0 B | UNPARTITIONED | | 01:AGGREGATE | 3 | 1.99ms | 2.13ms | 3 | 1 | 16.00 KB | 10.00 MB | | | 00:SCAN KUDU | 3 | 13.13ms | 13.97ms | 12.00M | -1 | 512.00 KB | 0 B | default.biglineitem | +--++--+--+++---+---+-+ Change-Id: I2981227e62eb5971508e0698e189519755de --- M be/src/exec/hdfs-scanner.cc M be/src/exec/kudu-scanner.cc M be/src/runtime/tuple.cc M be/src/runtime/tuple.h M testdata/workloads/functional-query/queries/QueryTest/scanners.test 5 files changed, 30 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/9239/3 -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-NagyGerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 ) Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG@57 PS2, Line 57: 00:SCAN KUDU3 90.856ms 107.409ms 6.00M 6.00M 512.00 KB 0 tpch_kudu.lineitem I tried to do a similar experiment with a larger Kudu scale factor (I created a new Kudu table like lineitem and expanded it by inserting duplicate data): insert into biglineitem select l_orderkey + max_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment from biglineitem, (select max(l_orderkey) as max_orderkey from biglineitem) v I can definitely see some time spent in the HandleEmptyProjection() function in "perf top" but the delta in performance seems smaller than your experiment showed. I saw it around 5% slower. The count(*) optimisation sounds good but not sure if the regression is severe enough to block this going in. Maybe Thomas can weigh in on how important he thinks this is. -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-NagyGerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 09 Feb 2018 23:22:38 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 ) Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. Patch Set 1: (4 comments) Thanks for the comments! http://gerrit.cloudera.org:8080/#/c/9239/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9239/1//COMMIT_MSG@32 PS1, Line 32: > It would be good to do a microbenchmark to make sure we haven't regressed p Extended the commit with a measurement. Do you think we should develop a count(*) optimisation for kudu like we do for Parquet? http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.h File be/src/runtime/tuple.h: http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.h@225 PS1, Line 225: void SetNull(const NullIndicatorOffset& offset) { > I'm not sure about adding DCHECKs to all these functions. They're definitel Removed the DCHECKs, pointer value is 42 now, such a low address should segfault (it does segfault on my system). http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.cc File be/src/runtime/tuple.cc: http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.cc@49 PS1, Line 49: Tuple* Tuple::Poison() { > Can we avoid this function call indirection? MemPool just uses a bogus cons Now it is just a pointer. http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.cc@87 PS1, Line 87: if (UNLIKELY(this == Poison())) return Poison(); > We can skip this case. MemPool::Allocate() actually already returns a poiso Yeah I saw that, just thought maybe it is better to keep our own poison (that sounds wrong:) ). -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 1 Gerrit-Owner: Zoltan Borok-NagyGerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 08 Feb 2018 17:28:12 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Hello Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9239 to look at the new patch set (#2). Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows Tuple pointers in the generated row batches may not be initialized if a tuple has byte size 0. There are some codes which compare these uninitialized pointers against nullptr so having them uninitialized may return wrong (and non-deterministic) results, e.g.: impala::TupleIsNullPredicate::GetBooleanVal() The following query produces non-deterministic results currently: SELECT count(v.x) FROM functional.alltypestiny t3 LEFT OUTER JOIN ( SELECT true AS x FROM functional.alltypestiny t1 LEFT OUTER JOIN functional.alltypestiny t2 ON (true)) v ON (v.x = t3.bool_col) WHERE t3.bool_col = true; The alltypestiny table has 8 records, 4 records of them has the true boolean value for bool_col. Therefore, the above query is a fancy way of saying "8 * 8 * 4", i.e. it should return 256. The solution is that scanners initialize tuple pointers to a non-null value when there are no materialized slots. This non-null value is provided by the new static method called Tuple::Poison(). I extended QueryTest/scanners.test with the above query. This test executes the query against all table formats. This change has the biggest performance impact on count(*) queries on large kudu tables. For my quick benchmark I chose tpch_kudu.lineitem which has 6001215 rows. Without this patch 'select count(*) from tpch_kudu.lineitem' runs around 0.15s. With the patch applied, it runs around 0.24s. I ran the query on my desktop PC. Without this patch: ExecSummary: Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail - 03:AGGREGATE1 432.840us 432.840us 1 1 28.00 KB 10.00 MB FINALIZE 02:EXCHANGE 1 31.375ms 31.375ms 3 1 0 0 UNPARTITIONED 01:AGGREGATE35.013ms5.502ms 3 1 16.00 KB 10.00 MB 00:SCAN KUDU3 16.581ms 20.199ms 6.00M 6.00M 512.00 KB 0 tpch_kudu.lineitem With this patch: ExecSummary: Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail - 03:AGGREGATE1 472.898us 472.898us 1 1 28.00 KB 10.00 MB FINALIZE 02:EXCHANGE 1 122.700ms 122.700ms 3 1 0 0 UNPARTITIONED 01:AGGREGATE36.104ms6.151ms 3 1 16.00 KB 10.00 MB 00:SCAN KUDU3 90.856ms 107.409ms 6.00M 6.00M 512.00 KB 0 tpch_kudu.lineitem Change-Id: I2981227e62eb5971508e0698e189519755de --- M be/src/exec/hdfs-scanner.cc M be/src/exec/kudu-scanner.cc M be/src/runtime/tuple.cc M be/src/runtime/tuple.h M testdata/workloads/functional-query/queries/QueryTest/scanners.test 5 files changed, 30 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/9239/2 -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-NagyGerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 ) Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. Patch Set 1: (4 comments) Looks good. Mainly just questions about perf. http://gerrit.cloudera.org:8080/#/c/9239/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9239/1//COMMIT_MSG@32 PS1, Line 32: It would be good to do a microbenchmark to make sure we haven't regressed performance (or if we have, so that we know how much). I was trying to think of the worst possible case. I think it might be count(*) from a large kudu table, since I think that can be served from metadata only and we currently create a bunch of empty row batches. Parquet has a special count(*) optimisation and other file formats have to decode rows to compute the count, so they seem less likely to regress. http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.h File be/src/runtime/tuple.h: http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.h@225 PS1, Line 225: void SetNull(const NullIndicatorOffset& offset) { I'm not sure about adding DCHECKs to all these functions. They're definitely valid but I'm concerned there might be a real impact on the performance of the debug build (which can be annoying for testing, etc). If we set the pointer to a bogus constant we could potentially catch the same set of bugs via the dereference of invalid memory. http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.cc File be/src/runtime/tuple.cc: http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.cc@49 PS1, Line 49: Tuple* Tuple::Poison() { Can we avoid this function call indirection? MemPool just uses a bogus constant pointer value for this purpose. http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.cc@87 PS1, Line 87: if (UNLIKELY(this == Poison())) return Poison(); We can skip this case. MemPool::Allocate() actually already returns a poison value for 0 allocations. -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 1 Gerrit-Owner: Zoltan Borok-NagyGerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 07 Feb 2018 23:53:16 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Zoltan Borok-Nagy has uploaded this change for review. ( http://gerrit.cloudera.org:8080/9239 Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows .. IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows Tuple pointers in the generated row batches may not be initialized if a tuple has byte size 0. There are some codes which compare these uninitialized pointers against nullptr so having them uninitialized may return wrong (and non-deterministic) results, e.g.: impala::TupleIsNullPredicate::GetBooleanVal() The following query produces non-deterministic results currently: SELECT count(v.x) FROM functional.alltypestiny t3 LEFT OUTER JOIN ( SELECT true AS x FROM functional.alltypestiny t1 LEFT OUTER JOIN functional.alltypestiny t2 ON (true)) v ON (v.x = t3.bool_col) WHERE t3.bool_col = true; The alltypestiny table has 8 records, 4 records of them has the true boolean value for bool_col. Therefore, the above query is a fancy way of saying "8 * 8 * 4", i.e. it should return 256. The solution is that scanners initialize tuple pointers to a non-null value when there are no materialized slots. This non-null value is provided by the new static method called Tuple::Poison(). I extended QueryTest/scanners.test with the above query. This test executes the query against all table formats. Change-Id: I2981227e62eb5971508e0698e189519755de --- M be/src/exec/hdfs-scanner.cc M be/src/exec/kudu-scanner.cc M be/src/runtime/tuple.cc M be/src/runtime/tuple.h M testdata/workloads/functional-query/queries/QueryTest/scanners.test 5 files changed, 47 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/9239/1 -- To view, visit http://gerrit.cloudera.org:8080/9239 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I2981227e62eb5971508e0698e189519755de Gerrit-Change-Number: 9239 Gerrit-PatchSet: 1 Gerrit-Owner: Zoltan Borok-Nagy