[Impala-ASF-CR] IMPALA-6227: deflake admission stress tests
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8631 ) Change subject: IMPALA-6227: deflake admission stress tests .. Patch Set 5: Code-Review+2 Carry -- To view, visit http://gerrit.cloudera.org:8080/8631 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4 Gerrit-Change-Number: 8631 Gerrit-PatchSet: 5 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 22 Nov 2017 23:17:01 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6227: deflake admission stress tests
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8631 ) Change subject: IMPALA-6227: deflake admission stress tests .. Patch Set 5: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1518/ -- To view, visit http://gerrit.cloudera.org:8080/8631 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4 Gerrit-Change-Number: 8631 Gerrit-PatchSet: 5 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 22 Nov 2017 23:17:14 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6227: deflake admission stress tests
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8631 ) Change subject: IMPALA-6227: deflake admission stress tests .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/8631/4/tests/custom_cluster/test_admission_controller.py File tests/custom_cluster/test_admission_controller.py: http://gerrit.cloudera.org:8080/#/c/8631/4/tests/custom_cluster/test_admission_controller.py@698 PS4, Line 698: every : # second > shouldn't that say "at least every second" or something? Done -- To view, visit http://gerrit.cloudera.org:8080/8631 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4 Gerrit-Change-Number: 8631 Gerrit-PatchSet: 4 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 22 Nov 2017 23:16:40 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6227: deflake admission stress tests
Hello Bikramjeet Vig, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8631 to look at the new patch set (#5). Change subject: IMPALA-6227: deflake admission stress tests .. IMPALA-6227: deflake admission stress tests The problem was that, during the initial admission decision phase, some queries were initially queued then dequeued once memory came available. All of the accounting in the test implicitly relies on queries not being dequeued until queries are later explicitly ended, so if this happened, the test broke in multiple subtle ways. This happened because the query only scanned a small number of rows, which could be all buffered on the receiver side of the exchange even before the client fetched any rows from the coordinator. This means that the reserved memory on some backends could increase then decrease during the initial admission phase, resulting in a query being queued then dequeued. The fix is to increase the number of rows returned by the query so that all fragments remain active during the initial admission phase. This increased test execution time somewhat, so I also had to bump the queue wait timeout for the admission stress tests (they assume that queries don't time out in the queue). Testing: Ran the test under debug, release and ASAN builds, i.e. impala-py.test tests/custom_cluster/test_admission_controller.py \ --workload_exploration_strategy="functional-query:exhaustive" I looped the mem_limit test for a while to confirm it didn't reproduce (it reproduced reliably every 2-3 iterations before this fix). It still reproduces every 5-10 runs with exhaustive+release, so I need to do further work to make it more robust. Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4 --- M fe/src/test/resources/llama-site-test2.xml M tests/custom_cluster/test_admission_controller.py 2 files changed, 116 insertions(+), 49 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/8631/5 -- To view, visit http://gerrit.cloudera.org:8080/8631 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4 Gerrit-Change-Number: 8631 Gerrit-PatchSet: 5 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-6227: deflake admission stress tests
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8631 ) Change subject: IMPALA-6227: deflake admission stress tests .. Patch Set 4: Code-Review+2 (1 comment) Seems like an improvement if you want to commit it here even if it doesn't address all flakiness. http://gerrit.cloudera.org:8080/#/c/8631/4/tests/custom_cluster/test_admission_controller.py File tests/custom_cluster/test_admission_controller.py: http://gerrit.cloudera.org:8080/#/c/8631/4/tests/custom_cluster/test_admission_controller.py@698 PS4, Line 698: every : # second shouldn't that say "at least every second" or something? -- To view, visit http://gerrit.cloudera.org:8080/8631 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4 Gerrit-Change-Number: 8631 Gerrit-PatchSet: 4 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 22 Nov 2017 23:14:03 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6227: deflake admission stress tests
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8631 ) Change subject: IMPALA-6227: deflake admission stress tests .. Patch Set 3: (1 comment) Just to summarise where this is at, it's a lot less flaky but I'm still seeing occasional flakes that I think are the result of the test seeing inconsistent metrics (i.e. admitted, queued, dequeued, etc are out of sync). http://gerrit.cloudera.org:8080/#/c/8631/3/tests/custom_cluster/test_admission_controller.py File tests/custom_cluster/test_admission_controller.py: http://gerrit.cloudera.org:8080/#/c/8631/3/tests/custom_cluster/test_admission_controller.py@706 PS3, Line 706: sleep(0.2) > what's that about? does the QUERY_TIMEOUT comment above need updating? I was experimenting with ways to reduce test runtime and reduce the chances of it hitting various timeouts. Factored out a constant to make the relationship clearer. -- To view, visit http://gerrit.cloudera.org:8080/8631 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4 Gerrit-Change-Number: 8631 Gerrit-PatchSet: 3 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 22 Nov 2017 23:03:27 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6227: deflake admission stress tests
Hello Bikramjeet Vig, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8631 to look at the new patch set (#4). Change subject: IMPALA-6227: deflake admission stress tests .. IMPALA-6227: deflake admission stress tests The problem was that, during the initial admission decision phase, some queries were initially queued then dequeued once memory came available. All of the accounting in the test implicitly relies on queries not being dequeued until queries are later explicitly ended, so if this happened, the test broke in multiple subtle ways. This happened because the query only scanned a small number of rows, which could be all buffered on the receiver side of the exchange even before the client fetched any rows from the coordinator. This means that the reserved memory on some backends could increase then decrease during the initial admission phase, resulting in a query being queued then dequeued. The fix is to increase the number of rows returned by the query so that all fragments remain active during the initial admission phase. This increased test execution time somewhat, so I also had to bump the queue wait timeout for the admission stress tests (they assume that queries don't time out in the queue). Testing: Ran the test under debug, release and ASAN builds, i.e. impala-py.test tests/custom_cluster/test_admission_controller.py \ --workload_exploration_strategy="functional-query:exhaustive" I looped the mem_limit test for a while to confirm it didn't reproduce (it reproduced reliably every 2-3 iterations before this fix). It still reproduces every 5-10 runs with exhaustive+release, so I need to do further work to make it more robust. Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4 --- M fe/src/test/resources/llama-site-test2.xml M tests/custom_cluster/test_admission_controller.py 2 files changed, 114 insertions(+), 47 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/8631/4 -- To view, visit http://gerrit.cloudera.org:8080/8631 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4 Gerrit-Change-Number: 8631 Gerrit-PatchSet: 4 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-6227: deflake admission stress tests
Hello Bikramjeet Vig, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8631 to look at the new patch set (#2). Change subject: IMPALA-6227: deflake admission stress tests .. IMPALA-6227: deflake admission stress tests The problem was that, during the initial admission decision phase, some queries were initially queued then dequeued once memory came available. All of the accounting in the test implicitly relies on queries not being dequeued until queries are later explicitly ended, so if this happened, the test broke in multiple subtle ways. This happened because the query only scanned a small number of rows, which could be all buffered on the receiver side of the exchange even before the client fetched any rows from the coordinator. This means that the reserved memory on some backends could increase then decrease during the initial admission phase, resulting in a query being queued then dequeued. The fix is to increase the number of rows returned by the query so that all fragments remain active during the initial admission phase. This increased test execution time somewhat, so I also had to bump the queue wait timeout for the admission stress tests (they assume that queries don't time out in the queue). Testing: Ran the test under debug, release and ASAN builds, i.e. impala-py.test tests/custom_cluster/test_admission_controller.py \ --workload_exploration_strategy="functional-query:exhaustive" I looped the mem_limit test for a while to confirm it didn't reproduce (it reproduced reliably every 2-3 iterations before this fix). I will try looping it a bit more under different build types concurrently with the review. Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4 --- M fe/src/test/resources/llama-site-test2.xml M tests/custom_cluster/test_admission_controller.py 2 files changed, 89 insertions(+), 47 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/8631/2 -- To view, visit http://gerrit.cloudera.org:8080/8631 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4 Gerrit-Change-Number: 8631 Gerrit-PatchSet: 2 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht
[Impala-ASF-CR] IMPALA-6227: deflake admission stress tests
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8631 ) Change subject: IMPALA-6227: deflake admission stress tests .. Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/8631/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8631/1//COMMIT_MSG@11 PS1, Line 11: All of the accounting in the test implicitly relies on queries not being : dequeued until queries are later explicitly ended, so if this happened, : the test broke in multiple subtle ways. > and this assumption is no longer true because of the final change for IMPAL I don't think it was ever true for the mem_limit tests. Even before IMPALA-1575 it depended implicitly on the memory staying reserved on all backends. >From what I can tell, the bug was always there, it was a tweak to the >statestore frequency that threw it out of balance. http://gerrit.cloudera.org:8080/#/c/8631/1/tests/custom_cluster/test_admission_controller.py File tests/custom_cluster/test_admission_controller.py: http://gerrit.cloudera.org:8080/#/c/8631/1/tests/custom_cluster/test_admission_controller.py@705 PS1, Line 705: b4 > where is that used? Leftover debugging code that I missed removing. http://gerrit.cloudera.org:8080/#/c/8631/1/tests/custom_cluster/test_admission_controller.py@792 PS1, Line 792: amount of time > is this actually timing sensitive, or is it that as long as we don't fetch It shouldn't be timing sensitive. Updated the comment. -- To view, visit http://gerrit.cloudera.org:8080/8631 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4 Gerrit-Change-Number: 8631 Gerrit-PatchSet: 1 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Bikramjeet Vig Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 22 Nov 2017 22:45:50 + Gerrit-HasComments: Yes