[kudu-CR] KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/9897 ) Change subject: KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls .. KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls This test was flaky under stress+ASAN with three cases of failures: (1) most commonly, the assertion to check that an RPC had timed out in the queue would run too quickly, just before the actual metric was incremented. Changing it to ASSERT_EVENTUALLY got rid of most of the failures. (2) in a few cases, the 100ms sleeps were processed completely before the later RPC was processed, so the RPC that was supposed to time out in the queue actually had a chance to run. Switching to 1000ms sleeps got rid of the remaining failure case. (3) in a few other cases, the call that was supposed to time out (with a 1ms timeout) timed out before it was even sent to the server. This patch bumps the timeout to 5ms and retries that until it succeeds with a server-side timeout. Ran 5000x with the following command: ./build-support/dist_test.py loop -n 500 build/latest/bin/rpc_stub-test \ --gtest_filter=\*DontHandle\* --gtest_repeat=10 --stress-cpu-threads=8 and got no failures after this patch, whereas I'd usually get 5-10 failures prior. Change-Id: If8cfebb5e9160fb11586abdbba6fc1ae06574a2c Reviewed-on: http://gerrit.cloudera.org:8080/9897 Tested-by: Kudu Jenkins Reviewed-by: Adar Dembo--- M src/kudu/rpc/rpc_stub-test.cc 1 file changed, 23 insertions(+), 9 deletions(-) Approvals: Kudu Jenkins: Verified Adar Dembo: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/9897 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: If8cfebb5e9160fb11586abdbba6fc1ae06574a2c Gerrit-Change-Number: 9897 Gerrit-PatchSet: 4 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/9897 ) Change subject: KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls .. Patch Set 3: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/9897 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If8cfebb5e9160fb11586abdbba6fc1ae06574a2c Gerrit-Change-Number: 9897 Gerrit-PatchSet: 3 Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 17 Apr 2018 17:16:28 + Gerrit-HasComments: No
[kudu-CR] KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls
Hello Andrew Wong, Kudu Jenkins, Andrew Wong, Sailesh Mukil, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9897 to look at the new patch set (#3). Change subject: KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls .. KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls This test was flaky under stress+ASAN with three cases of failures: (1) most commonly, the assertion to check that an RPC had timed out in the queue would run too quickly, just before the actual metric was incremented. Changing it to ASSERT_EVENTUALLY got rid of most of the failures. (2) in a few cases, the 100ms sleeps were processed completely before the later RPC was processed, so the RPC that was supposed to time out in the queue actually had a chance to run. Switching to 1000ms sleeps got rid of the remaining failure case. (3) in a few other cases, the call that was supposed to time out (with a 1ms timeout) timed out before it was even sent to the server. This patch bumps the timeout to 5ms and retries that until it succeeds with a server-side timeout. Ran 5000x with the following command: ./build-support/dist_test.py loop -n 500 build/latest/bin/rpc_stub-test \ --gtest_filter=\*DontHandle\* --gtest_repeat=10 --stress-cpu-threads=8 and got no failures after this patch, whereas I'd usually get 5-10 failures prior. Change-Id: If8cfebb5e9160fb11586abdbba6fc1ae06574a2c --- M src/kudu/rpc/rpc_stub-test.cc 1 file changed, 23 insertions(+), 9 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/97/9897/3 -- To view, visit http://gerrit.cloudera.org:8080/9897 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: If8cfebb5e9160fb11586abdbba6fc1ae06574a2c Gerrit-Change-Number: 9897 Gerrit-PatchSet: 3 Gerrit-Owner: Todd LipconGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/9897 ) Change subject: KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls .. Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/9897/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9897/1//COMMIT_MSG@9 PS1, Line 9: two > three? Done http://gerrit.cloudera.org:8080/#/c/9897/1/src/kudu/rpc/rpc_stub-test.cc File src/kudu/rpc/rpc_stub-test.cc: http://gerrit.cloudera.org:8080/#/c/9897/1/src/kudu/rpc/rpc_stub-test.cc@460 PS1, Line 460: 1000 > This value doesn't matter, does it? nope. commented. http://gerrit.cloudera.org:8080/#/c/9897/1/src/kudu/rpc/rpc_stub-test.cc@464 PS1, Line 464: // Since our timeout was short, it's possible in rare circumstances : // that we time out the RPC on the outbound queue, in which case : // we won't trigger the desired behavior here. > Hmm, this seems to be explaining the reasoning for the ASSERT_EVENTUALLY, i the comment is referring to the check for 'SENT' below. I clarified it. -- To view, visit http://gerrit.cloudera.org:8080/9897 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If8cfebb5e9160fb11586abdbba6fc1ae06574a2c Gerrit-Change-Number: 9897 Gerrit-PatchSet: 1 Gerrit-Owner: Todd LipconGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Mon, 16 Apr 2018 23:23:55 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls
Hello Andrew Wong, Kudu Jenkins, Andrew Wong, Sailesh Mukil, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9897 to look at the new patch set (#2). Change subject: KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls .. KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls This test was flaky under stress+ASAN with three cases of failures: (1) most commonly, the assertion to check that an RPC had timed out in the queue would run too quickly, just before the actual metric was incremented. Changing it to ASSERT_EVENTUALLY got rid of most of the failures. (2) in a few cases, the 100ms sleeps were processed completely before the later RPC was processed, so the RPC that was supposed to time out in the queue actually had a chance to run. Switching to 1000ms sleeps got rid of the remaining failure case. (3) in a few other cases, the call that was supposed to time out (with a 1ms timeout) timed out before it was even sent to the server. This patch bumps the timeout to 5ms and retries that until it succeeds with a server-side timeout. Ran 5000x with the following command: ./build-support/dist_test.py loop -n 500 build/latest/bin/rpc_stub-test \ --gtest_filter=\*DontHandle\* --gtest_repeat=10 --stress-cpu-threads=8 and got no failures after this patch, whereas I'd usually get 5-10 failures prior. Change-Id: If8cfebb5e9160fb11586abdbba6fc1ae06574a2c --- M src/kudu/rpc/rpc_stub-test.cc 1 file changed, 22 insertions(+), 9 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/97/9897/2 -- To view, visit http://gerrit.cloudera.org:8080/9897 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: If8cfebb5e9160fb11586abdbba6fc1ae06574a2c Gerrit-Change-Number: 9897 Gerrit-PatchSet: 2 Gerrit-Owner: Todd LipconGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls
Sailesh Mukil has posted comments on this change. ( http://gerrit.cloudera.org:8080/9897 ) Change subject: KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls .. Patch Set 1: Code-Review+1 (1 comment) http://gerrit.cloudera.org:8080/#/c/9897/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9897/1//COMMIT_MSG@17 PS1, Line 17: Switching to 1000ms sleeps > The total test time is still just 1 second on average so I didn't think it nvm, I didn't see that you were sending only 'n_worker_threads_' RPCs. If you sent more RPCs than service threads, then the test would take longer, but that's not the case. -- To view, visit http://gerrit.cloudera.org:8080/9897 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If8cfebb5e9160fb11586abdbba6fc1ae06574a2c Gerrit-Change-Number: 9897 Gerrit-PatchSet: 1 Gerrit-Owner: Todd LipconGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 03 Apr 2018 03:24:22 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9897 ) Change subject: KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls .. Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/9897/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9897/1//COMMIT_MSG@9 PS1, Line 9: two three? http://gerrit.cloudera.org:8080/#/c/9897/1/src/kudu/rpc/rpc_stub-test.cc File src/kudu/rpc/rpc_stub-test.cc: http://gerrit.cloudera.org:8080/#/c/9897/1/src/kudu/rpc/rpc_stub-test.cc@460 PS1, Line 460: 1000 This value doesn't matter, does it? http://gerrit.cloudera.org:8080/#/c/9897/1/src/kudu/rpc/rpc_stub-test.cc@464 PS1, Line 464: // Since our timeout was short, it's possible in rare circumstances : // that we time out the RPC on the outbound queue, in which case : // we won't trigger the desired behavior here. Hmm, this seems to be explaining the reasoning for the ASSERT_EVENTUALLY, in which case it probably belongs up on L456? -- To view, visit http://gerrit.cloudera.org:8080/9897 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If8cfebb5e9160fb11586abdbba6fc1ae06574a2c Gerrit-Change-Number: 9897 Gerrit-PatchSet: 1 Gerrit-Owner: Todd LipconGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 03 Apr 2018 02:40:38 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/9897 ) Change subject: KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/9897/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9897/1//COMMIT_MSG@17 PS1, Line 17: Switching to 1000ms sleeps > Could this make it a slow test on an average case? And if yes, should we ad The total test time is still just 1 second on average so I didn't think it was necessary to make it "slow" -- To view, visit http://gerrit.cloudera.org:8080/9897 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If8cfebb5e9160fb11586abdbba6fc1ae06574a2c Gerrit-Change-Number: 9897 Gerrit-PatchSet: 1 Gerrit-Owner: Todd LipconGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Sailesh Mukil Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 03 Apr 2018 00:06:48 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls
Sailesh Mukil has posted comments on this change. ( http://gerrit.cloudera.org:8080/9897 ) Change subject: KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/9897/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9897/1//COMMIT_MSG@17 PS1, Line 17: Switching to 1000ms sleeps Could this make it a slow test on an average case? And if yes, should we add it to the list of slow tests? -- To view, visit http://gerrit.cloudera.org:8080/9897 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If8cfebb5e9160fb11586abdbba6fc1ae06574a2c Gerrit-Change-Number: 9897 Gerrit-PatchSet: 1 Gerrit-Owner: Todd LipconGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Sailesh Mukil Gerrit-Comment-Date: Mon, 02 Apr 2018 23:54:58 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls
Hello Andrew Wong, Sailesh Mukil, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/9897 to review the following change. Change subject: KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls .. KUDU-2394. Fix flaky RpcStubTest.TestDontHandleTimedOutCalls This test was flaky under stress+ASAN with two cases of failures: (1) most commonly, the assertion to check that an RPC had timed out in the queue would run too quickly, just before the actual metric was incremented. Changing it to ASSERT_EVENTUALLY got rid of most of the failures. (2) in a few cases, the 100ms sleeps were processed completely before the later RPC was processed, so the RPC that was supposed to time out in the queue actually had a chance to run. Switching to 1000ms sleeps got rid of the remaining failure case. (3) in a few other cases, the call that was supposed to time out (with a 1ms timeout) timed out before it was even sent to the server. This patch bumps the timeout to 5ms and retries that until it succeeds with a server-side timeout. Ran 5000x with the following command: ./build-support/dist_test.py loop -n 500 build/latest/bin/rpc_stub-test \ --gtest_filter=\*DontHandle\* --gtest_repeat=10 --stress-cpu-threads=8 and got no failures after this patch, whereas I'd usually get 5-10 failures prior. Change-Id: If8cfebb5e9160fb11586abdbba6fc1ae06574a2c --- M src/kudu/rpc/rpc_stub-test.cc 1 file changed, 19 insertions(+), 9 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/97/9897/1 -- To view, visit http://gerrit.cloudera.org:8080/9897 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: If8cfebb5e9160fb11586abdbba6fc1ae06574a2c Gerrit-Change-Number: 9897 Gerrit-PatchSet: 1 Gerrit-Owner: Todd LipconGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Sailesh Mukil