[Impala-ASF-CR] IMPALA-6565: Fix dropped status in DequeueDeferredRpc()
Hello Sailesh Mukil, Dan Hecht, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9439 to look at the new patch set (#3). Change subject: IMPALA-6565: Fix dropped status in DequeueDeferredRpc() .. IMPALA-6565: Fix dropped status in DequeueDeferredRpc() The local variable 'status' was mistakenly defined twice in DequeueDeferredRpc(): one in the outer scope and one in the inner scope. This causes the error status of AddBatchWork() to be dropped when the inner scope ends. As a result, the error status from AddBatchWork() (e.g. MemLimitExceeded) will not be propagated back to the sender and the receiver will continue to operate with some missing data, leading to wrong query results. This change fixes the problem by removing the redefinition of the status local variable. It also adds some counters in the profile to make diagnostics of failed RPCs or missing EOS easier. This change also fixes a missing check for 'is_cancelled_' flag in TakeOverEarlySenders() before enqueuing an early sender into the deferred_rpc_ queue. The missing check should be benign because KrpcDataStreamRecvr::SenderQueue::TakeOverEarlySenders() is called from KrpcDataStreamMgr::CreateRecvr() in the fragment execution thread context so KrpcDataStreamRecvr::CancelStream() can only be called due to query cancellation while CreateRecvr() is still running. In which case, the sender should also be cancelled. That said, this should be fixed by checking for the 'is_cancelled_' flag before enqueuing an early sender into the 'deferred_rpc_' queue. Testing done: Stress test consistently reproduced the problem before. With this fix, no wrong results have been seen after 2 iterations of stress tests, which translates to about 2 queries being run. Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5 --- M be/src/runtime/krpc-data-stream-recvr.cc M be/src/runtime/krpc-data-stream-recvr.h M be/src/runtime/krpc-data-stream-sender.cc M be/src/runtime/krpc-data-stream-sender.h 4 files changed, 52 insertions(+), 11 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/9439/3 -- To view, visit http://gerrit.cloudera.org:8080/9439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5 Gerrit-Change-Number: 9439 Gerrit-PatchSet: 3 Gerrit-Owner: Michael HoGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil
[Impala-ASF-CR] IMPALA-6565: Fix dropped status in DequeueDeferredRpc()
Sailesh Mukil has posted comments on this change. ( http://gerrit.cloudera.org:8080/9439 ) Change subject: IMPALA-6565: Fix dropped status in DequeueDeferredRpc() .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/9439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5 Gerrit-Change-Number: 9439 Gerrit-PatchSet: 2 Gerrit-Owner: Michael HoGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Comment-Date: Sat, 24 Feb 2018 19:13:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6565: Fix dropped status in DequeueDeferredRpc()
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/9439 ) Change subject: IMPALA-6565: Fix dropped status in DequeueDeferredRpc() .. Patch Set 2: (2 comments) http://gerrit.cloudera.org:8080/#/c/9439/2/be/src/runtime/krpc-data-stream-recvr.cc File be/src/runtime/krpc-data-stream-recvr.cc: http://gerrit.cloudera.org:8080/#/c/9439/2/be/src/runtime/krpc-data-stream-recvr.cc@346 PS2, Line 346: // TODO: Add timers for time spent in this function and queue time in 'batch_queue_'. > is there a counter to determine how many batches made it to this point? If Added a new counter called NumBatchesArrived. So a batch goes from "Arrived" (->"Deferred") -> "Received" -> "Enqueued" http://gerrit.cloudera.org:8080/#/c/9439/2/be/src/runtime/krpc-data-stream-recvr.cc@446 PS2, Line 446: lock_guard l(lock_); There seems to be a missing check for the flag "is_cancelled_" here. Fixed in PS3. This may be benign though as this function is called from CreateRecvr() from the main execution thread so it really cannot call Cancel() until after returning from CreateRecvr(). If the receiver is cancelled due to query cancellation, the sender should also get a cancellation too so it won't be waiting for a reply. That said, it's still nice to fix this to maintain the invariant that the deferred_rpcs_ queue is empty in Close(). -- To view, visit http://gerrit.cloudera.org:8080/9439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5 Gerrit-Change-Number: 9439 Gerrit-PatchSet: 2 Gerrit-Owner: Michael HoGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Comment-Date: Sat, 24 Feb 2018 18:41:04 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6565: Fix dropped status in DequeueDeferredRpc()
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/9439 ) Change subject: IMPALA-6565: Fix dropped status in DequeueDeferredRpc() .. Patch Set 2: Code-Review+2 (1 comment) Please let Sailesh finish his review too, of course. http://gerrit.cloudera.org:8080/#/c/9439/2/be/src/runtime/krpc-data-stream-recvr.cc File be/src/runtime/krpc-data-stream-recvr.cc: http://gerrit.cloudera.org:8080/#/c/9439/2/be/src/runtime/krpc-data-stream-recvr.cc@346 PS2, Line 346: // TODO: Add timers for time spent in this function and queue time in 'batch_queue_'. is there a counter to determine how many batches made it to this point? If not, do you think that would be helpful, given the deferred batch path is non-trivial, and that there are error paths that can cause us to not get to AddBatchWork()? -- To view, visit http://gerrit.cloudera.org:8080/9439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5 Gerrit-Change-Number: 9439 Gerrit-PatchSet: 2 Gerrit-Owner: Michael HoGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Comment-Date: Sat, 24 Feb 2018 05:04:09 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6565: Fix dropped status in DequeueDeferredRpc()
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/9439 ) Change subject: IMPALA-6565: Fix dropped status in DequeueDeferredRpc() .. Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/9439/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9439/1//COMMIT_MSG@14 PS1, Line 14: aand > nit: typo Done http://gerrit.cloudera.org:8080/#/c/9439/1/be/src/runtime/krpc-data-stream-sender.cc File be/src/runtime/krpc-data-stream-sender.cc: http://gerrit.cloudera.org:8080/#/c/9439/1/be/src/runtime/krpc-data-stream-sender.cc@309 PS1, Line 309: if (UNLIKELY(!status.ok())) COUNTER_ADD(parent_->rpc_failure_counter_, 1); > Does this mean a RPC failed? Couldn't a bad RPC response also end up increm Yes, this is the sum of RPC failure and number of times remote replies with a non-retryable error. Comments updated. This helps identify if the error from a receiver propagates correctly to the sender. http://gerrit.cloudera.org:8080/#/c/9439/1/be/src/runtime/krpc-data-stream-sender.cc@627 PS1, Line 627: eos_sent_counter_ > What's the use case for counting this? In the good case this should be equa To pinpoint senders which didn't send EOS, leading to hung receivers. -- To view, visit http://gerrit.cloudera.org:8080/9439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5 Gerrit-Change-Number: 9439 Gerrit-PatchSet: 1 Gerrit-Owner: Michael HoGerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Comment-Date: Sat, 24 Feb 2018 02:54:13 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6565: Fix dropped status in DequeueDeferredRpc()
Hello Sailesh Mukil, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9439 to look at the new patch set (#2). Change subject: IMPALA-6565: Fix dropped status in DequeueDeferredRpc() .. IMPALA-6565: Fix dropped status in DequeueDeferredRpc() The local variable 'status' was mistakenly defined twice in DequeueDeferredRpc(): one in the outer scope and one in the inner scope. This causes the error status of AddBatchWork() to be dropped when the inner scope ends. As a result, the error status from AddBatchWork() (e.g. MemLimitExceeded) will not be propagated back to the sender and the receiver will continue to operate with some missing data, leading to wrong query results. This change fixes the problem by removing the redefinition of the status local variable. It also adds some counters in the profile to make diagnostics of failed RPCs or missing EOS easier. Testing done: Stress test consistently reproduced the problem before. With this fix, no wrong results have been seen after 2 iterations of stress tests, which translates to about 2 queries being run. Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5 --- M be/src/runtime/krpc-data-stream-recvr.cc M be/src/runtime/krpc-data-stream-recvr.h M be/src/runtime/krpc-data-stream-sender.cc M be/src/runtime/krpc-data-stream-sender.h 4 files changed, 32 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/9439/2 -- To view, visit http://gerrit.cloudera.org:8080/9439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5 Gerrit-Change-Number: 9439 Gerrit-PatchSet: 2 Gerrit-Owner: Michael HoGerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil
[Impala-ASF-CR] IMPALA-6565: Fix dropped status in DequeueDeferredRpc()
Sailesh Mukil has posted comments on this change. ( http://gerrit.cloudera.org:8080/9439 ) Change subject: IMPALA-6565: Fix dropped status in DequeueDeferredRpc() .. Patch Set 1: (3 comments) http://gerrit.cloudera.org:8080/#/c/9439/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/9439/1//COMMIT_MSG@14 PS1, Line 14: aand nit: typo http://gerrit.cloudera.org:8080/#/c/9439/1/be/src/runtime/krpc-data-stream-sender.cc File be/src/runtime/krpc-data-stream-sender.cc: http://gerrit.cloudera.org:8080/#/c/9439/1/be/src/runtime/krpc-data-stream-sender.cc@309 PS1, Line 309: if (UNLIKELY(!status.ok())) COUNTER_ADD(parent_->rpc_failure_counter_, 1); Does this mean a RPC failed? Couldn't a bad RPC response also end up incrementing this counter? Eg: L501 doesn't mean that the RPC itself failed, but that the RPC hit some error on the remote side. Is this to include both? If yes, let's make that clear, else it might be confusing as to what exactly this counter is counting. http://gerrit.cloudera.org:8080/#/c/9439/1/be/src/runtime/krpc-data-stream-sender.cc@627 PS1, Line 627: eos_sent_counter_ What's the use case for counting this? In the good case this should be equal to number of channels right? Is it to compare with the number of EOS RPCs received on the receiver side? -- To view, visit http://gerrit.cloudera.org:8080/9439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5 Gerrit-Change-Number: 9439 Gerrit-PatchSet: 1 Gerrit-Owner: Michael HoGerrit-Reviewer: Michael Ho Gerrit-Reviewer: Sailesh Mukil Gerrit-Comment-Date: Sat, 24 Feb 2018 02:17:54 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6565: Fix dropped status in DequeueDeferredRpc()
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/9439 ) Change subject: IMPALA-6565: Fix dropped status in DequeueDeferredRpc() .. Patch Set 1: Pushing this fix out to unbreak the build for now. Will follow up with a separate change to implement a targeted test. -- To view, visit http://gerrit.cloudera.org:8080/9439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5 Gerrit-Change-Number: 9439 Gerrit-PatchSet: 1 Gerrit-Owner: Michael HoGerrit-Reviewer: Michael Ho Gerrit-Comment-Date: Sat, 24 Feb 2018 01:21:28 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6565: Fix dropped status in DequeueDeferredRpc()
Michael Ho has uploaded this change for review. ( http://gerrit.cloudera.org:8080/9439 Change subject: IMPALA-6565: Fix dropped status in DequeueDeferredRpc() .. IMPALA-6565: Fix dropped status in DequeueDeferredRpc() The local variable 'status' was mistakenly defined twice in DequeueDeferredRpc(): one in the outer scope and one in the inner scope. This causes the error status of AddBatchWork() to be dropped when the inner scope ends. As a result, the error status from AddBatchWork() (e.g. MemLimitExceeded) will not be propagated back to the sender aand the receiver will continue to operate with some missing data, leading to wrong query results. This change fixes the problem by removing the redefinition of the status local variable. It also adds some counters in the profile to make diagnostics of failed RPCs or missing EOS easier. Testing done: Stress test consistently reproduced the problem before. With this fix, no wrong results have been seen after 2 iterations of stress tests, which translates to about 2 queries being run. Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5 --- M be/src/runtime/krpc-data-stream-recvr.cc M be/src/runtime/krpc-data-stream-recvr.h M be/src/runtime/krpc-data-stream-sender.cc M be/src/runtime/krpc-data-stream-sender.h 4 files changed, 32 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/9439/1 -- To view, visit http://gerrit.cloudera.org:8080/9439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5 Gerrit-Change-Number: 9439 Gerrit-PatchSet: 1 Gerrit-Owner: Michael Ho