[Impala-ASF-CR] IMPALA-10258, IMPALA-10109: Fixed flaky test in test query retries.py
Wenzhe Zhou has abandoned this change. ( http://gerrit.cloudera.org:8080/16763 ) Change subject: IMPALA-10258, IMPALA-10109: Fixed flaky test in test_query_retries.py .. Abandoned -- To view, visit http://gerrit.cloudera.org:8080/16763 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: abandon Gerrit-Change-Id: Ib89f7b01a0f2a66a97f312e779a4ab04f4f347f3 Gerrit-Change-Number: 16763 Gerrit-PatchSet: 2 Gerrit-Owner: Wenzhe Zhou Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall
[Impala-ASF-CR] IMPALA-10258, IMPALA-10109: Fixed flaky test in test query retries.py
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/16763 ) Change subject: IMPALA-10258, IMPALA-10109: Fixed flaky test in test_query_retries.py .. Patch Set 2: (3 comments) http://gerrit.cloudera.org:8080/#/c/16763/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16763/2//COMMIT_MSG@7 PS2, Line 7: IMPALA-10258, IMPALA-10109 since these issues are basically unrelated, could you separate them out into two reviews? http://gerrit.cloudera.org:8080/#/c/16763/2//COMMIT_MSG@9 PS2, Line 9: When TestQueryRetries.test_original_query_cancel was ran on s3 I'm not sure I understand what you're saying the issue is: According to the JIRA, the test was waiting for the query to reach state "RUNNING", but it was already at state "EXCEPTION" (QueryState = 5, see beeswax.thrift). At that point in the test, the query shouldn't have failed, since the impalad hasn't been killed yet, so really not sure what could have happened, and unfortunately it doesn't look like we have the logs for it. http://gerrit.cloudera.org:8080/#/c/16763/2//COMMIT_MSG@16 PS2, Line 16: For IMPALA-10109, test_retries_from_cancellation_pool did not I'm not sure I understand what you're saying the issue is: According to the JIRA, the query timed out after ~784s, which is a lot longer than the default statestore time-to-detect-failure of heartbeat_frequency x max_missed = 1000ms x 10 = 10s. So it seems like the coordinator should have had plenty of time to get the statestore message, even under the old values. Looking through the logs, I'm a little confused by what I see - the coordinator says the query was only scheduled on 2 backends, but I think the test assumes that it gets scheduled on all 3 backends in the minicluster (see __kill_random_impalad()). I also see a reference to CancelFromThreadPool in QueryExecMgr on impalad_node1, but that should be hit unless the coordinator is killed, which it shouldn't have been. -- To view, visit http://gerrit.cloudera.org:8080/16763 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib89f7b01a0f2a66a97f312e779a4ab04f4f347f3 Gerrit-Change-Number: 16763 Gerrit-PatchSet: 2 Gerrit-Owner: Wenzhe Zhou Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Tue, 24 Nov 2020 20:36:46 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-10258, IMPALA-10109: Fixed flaky test in test query retries.py
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16763 ) Change subject: IMPALA-10258, IMPALA-10109: Fixed flaky test in test_query_retries.py .. Patch Set 2: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/7727/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16763 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib89f7b01a0f2a66a97f312e779a4ab04f4f347f3 Gerrit-Change-Number: 16763 Gerrit-PatchSet: 2 Gerrit-Owner: Wenzhe Zhou Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Comment-Date: Tue, 24 Nov 2020 18:46:28 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-10258, IMPALA-10109: Fixed flaky test in test query retries.py
Wenzhe Zhou has uploaded a new patch set (#2). ( http://gerrit.cloudera.org:8080/16763 ) Change subject: IMPALA-10258, IMPALA-10109: Fixed flaky test in test_query_retries.py .. IMPALA-10258, IMPALA-10109: Fixed flaky test in test_query_retries.py When TestQueryRetries.test_original_query_cancel was ran on s3 with query option spool_query_results enabled, the query was timeout before reaching the expected state. This patch double the timeout for the query when the test is running on S3 and double the timeout for query to reaching "FINISHED" state. For IMPALA-10109, test_retries_from_cancellation_pool did not trigger query-retry when one of impalad was killed. It seems that membership updating message was not received and processed by coordinator before reaching terminated state, hence the query-retry was not triggered. This patch reduce the heartbeat_frequency and max_missed_heartbeats so that statestore will take much less time to update membership when one impalad was killed so that coordinator could start query-retry. Testing: - Ran the two tests in a loop for more than 3 hours. The test failures did not happen. Change-Id: Ib89f7b01a0f2a66a97f312e779a4ab04f4f347f3 --- M tests/custom_cluster/test_query_retries.py 1 file changed, 12 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/16763/2 -- To view, visit http://gerrit.cloudera.org:8080/16763 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib89f7b01a0f2a66a97f312e779a4ab04f4f347f3 Gerrit-Change-Number: 16763 Gerrit-PatchSet: 2 Gerrit-Owner: Wenzhe Zhou Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Thomas Tauber-Marshall