[Impala-ASF-CR] IMPALA-6565: Fix dropped status in DequeueDeferredRpc()

2018-02-24 Thread Michael Ho (Code Review)
Hello Sailesh Mukil, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9439

to look at the new patch set (#3).

Change subject: IMPALA-6565: Fix dropped status in DequeueDeferredRpc()
..

IMPALA-6565: Fix dropped status in DequeueDeferredRpc()

The local variable 'status' was mistakenly defined twice in
DequeueDeferredRpc(): one in the outer scope and one in the
inner scope. This causes the error status of AddBatchWork()
to be dropped when the inner scope ends. As a result, the error
status from AddBatchWork() (e.g. MemLimitExceeded) will not be
propagated back to the sender and the receiver will continue
to operate with some missing data, leading to wrong query
results.

This change fixes the problem by removing the redefinition of
the status local variable. It also adds some counters in the
profile to make diagnostics of failed RPCs or missing EOS easier.

This change also fixes a missing check for 'is_cancelled_' flag
in TakeOverEarlySenders() before enqueuing an early sender into
the deferred_rpc_ queue. The missing check should be benign
because KrpcDataStreamRecvr::SenderQueue::TakeOverEarlySenders()
is called from KrpcDataStreamMgr::CreateRecvr() in the fragment
execution thread context so KrpcDataStreamRecvr::CancelStream()
can only be called due to query cancellation while CreateRecvr()
is still running. In which case, the sender should also be cancelled.
That said, this should be fixed by checking for the 'is_cancelled_'
flag before enqueuing an early sender into the 'deferred_rpc_' queue.

Testing done: Stress test consistently reproduced the problem before.
With this fix, no wrong results have been seen after 2 iterations
of stress tests, which translates to about 2 queries being run.

Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5
---
M be/src/runtime/krpc-data-stream-recvr.cc
M be/src/runtime/krpc-data-stream-recvr.h
M be/src/runtime/krpc-data-stream-sender.cc
M be/src/runtime/krpc-data-stream-sender.h
4 files changed, 52 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/9439/3
--
To view, visit http://gerrit.cloudera.org:8080/9439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5
Gerrit-Change-Number: 9439
Gerrit-PatchSet: 3
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 


[Impala-ASF-CR] IMPALA-6565: Fix dropped status in DequeueDeferredRpc()

2018-02-24 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9439 )

Change subject: IMPALA-6565: Fix dropped status in DequeueDeferredRpc()
..


Patch Set 2: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/9439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5
Gerrit-Change-Number: 9439
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Comment-Date: Sat, 24 Feb 2018 19:13:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6565: Fix dropped status in DequeueDeferredRpc()

2018-02-24 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9439 )

Change subject: IMPALA-6565: Fix dropped status in DequeueDeferredRpc()
..


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/9439/2/be/src/runtime/krpc-data-stream-recvr.cc
File be/src/runtime/krpc-data-stream-recvr.cc:

http://gerrit.cloudera.org:8080/#/c/9439/2/be/src/runtime/krpc-data-stream-recvr.cc@346
PS2, Line 346:   // TODO: Add timers for time spent in this function and queue 
time in 'batch_queue_'.
> is there a counter to determine how many batches made it to this point? If 
Added a new counter called NumBatchesArrived. So a batch goes from "Arrived" 
(->"Deferred") -> "Received" -> "Enqueued"


http://gerrit.cloudera.org:8080/#/c/9439/2/be/src/runtime/krpc-data-stream-recvr.cc@446
PS2, Line 446: lock_guard l(lock_);
There seems to be a missing check for the flag "is_cancelled_" here. Fixed in 
PS3. This may be benign though as this function is called from CreateRecvr() 
from the main execution thread so it really cannot call Cancel() until after 
returning from CreateRecvr().

If the receiver is cancelled due to query cancellation, the sender should also 
get a cancellation too so it won't be waiting for a reply. That said, it's 
still nice to fix this to maintain the invariant that the deferred_rpcs_ queue 
is empty in Close().



--
To view, visit http://gerrit.cloudera.org:8080/9439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5
Gerrit-Change-Number: 9439
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Comment-Date: Sat, 24 Feb 2018 18:41:04 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6565: Fix dropped status in DequeueDeferredRpc()

2018-02-23 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9439 )

Change subject: IMPALA-6565: Fix dropped status in DequeueDeferredRpc()
..


Patch Set 2: Code-Review+2

(1 comment)

Please let Sailesh finish his review too, of course.

http://gerrit.cloudera.org:8080/#/c/9439/2/be/src/runtime/krpc-data-stream-recvr.cc
File be/src/runtime/krpc-data-stream-recvr.cc:

http://gerrit.cloudera.org:8080/#/c/9439/2/be/src/runtime/krpc-data-stream-recvr.cc@346
PS2, Line 346:   // TODO: Add timers for time spent in this function and queue 
time in 'batch_queue_'.
is there a counter to determine how many batches made it to this point? If not, 
do you think that would be helpful, given the deferred batch path is 
non-trivial, and that there are error paths that can cause us to not get to 
AddBatchWork()?



--
To view, visit http://gerrit.cloudera.org:8080/9439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5
Gerrit-Change-Number: 9439
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Comment-Date: Sat, 24 Feb 2018 05:04:09 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6565: Fix dropped status in DequeueDeferredRpc()

2018-02-23 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9439 )

Change subject: IMPALA-6565: Fix dropped status in DequeueDeferredRpc()
..


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/9439/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9439/1//COMMIT_MSG@14
PS1, Line 14: aand
> nit: typo
Done


http://gerrit.cloudera.org:8080/#/c/9439/1/be/src/runtime/krpc-data-stream-sender.cc
File be/src/runtime/krpc-data-stream-sender.cc:

http://gerrit.cloudera.org:8080/#/c/9439/1/be/src/runtime/krpc-data-stream-sender.cc@309
PS1, Line 309: if (UNLIKELY(!status.ok())) 
COUNTER_ADD(parent_->rpc_failure_counter_, 1);
> Does this mean a RPC failed? Couldn't a bad RPC response also end up increm
Yes, this is the sum of RPC failure and number of times remote replies with a 
non-retryable error. Comments updated. This helps identify if the error from a 
receiver propagates correctly to the sender.


http://gerrit.cloudera.org:8080/#/c/9439/1/be/src/runtime/krpc-data-stream-sender.cc@627
PS1, Line 627: eos_sent_counter_
> What's the use case for counting this? In the good case this should be equa
To pinpoint senders which didn't send EOS, leading to hung receivers.



--
To view, visit http://gerrit.cloudera.org:8080/9439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5
Gerrit-Change-Number: 9439
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Comment-Date: Sat, 24 Feb 2018 02:54:13 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6565: Fix dropped status in DequeueDeferredRpc()

2018-02-23 Thread Michael Ho (Code Review)
Hello Sailesh Mukil,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9439

to look at the new patch set (#2).

Change subject: IMPALA-6565: Fix dropped status in DequeueDeferredRpc()
..

IMPALA-6565: Fix dropped status in DequeueDeferredRpc()

The local variable 'status' was mistakenly defined twice in
DequeueDeferredRpc(): one in the outer scope and one in the
inner scope. This causes the error status of AddBatchWork()
to be dropped when the inner scope ends. As a result, the error
status from AddBatchWork() (e.g. MemLimitExceeded) will not be
propagated back to the sender and the receiver will continue
to operate with some missing data, leading to wrong query
results.

This change fixes the problem by removing the redefinition of
the status local variable. It also adds some counters in the
profile to make diagnostics of failed RPCs or missing EOS easier.

Testing done: Stress test consistently reproduced the problem before.
With this fix, no wrong results have been seen after 2 iterations
of stress tests, which translates to about 2 queries being run.

Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5
---
M be/src/runtime/krpc-data-stream-recvr.cc
M be/src/runtime/krpc-data-stream-recvr.h
M be/src/runtime/krpc-data-stream-sender.cc
M be/src/runtime/krpc-data-stream-sender.h
4 files changed, 32 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/9439/2
--
To view, visit http://gerrit.cloudera.org:8080/9439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5
Gerrit-Change-Number: 9439
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 


[Impala-ASF-CR] IMPALA-6565: Fix dropped status in DequeueDeferredRpc()

2018-02-23 Thread Sailesh Mukil (Code Review)
Sailesh Mukil has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9439 )

Change subject: IMPALA-6565: Fix dropped status in DequeueDeferredRpc()
..


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/9439/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9439/1//COMMIT_MSG@14
PS1, Line 14: aand
nit: typo


http://gerrit.cloudera.org:8080/#/c/9439/1/be/src/runtime/krpc-data-stream-sender.cc
File be/src/runtime/krpc-data-stream-sender.cc:

http://gerrit.cloudera.org:8080/#/c/9439/1/be/src/runtime/krpc-data-stream-sender.cc@309
PS1, Line 309: if (UNLIKELY(!status.ok())) 
COUNTER_ADD(parent_->rpc_failure_counter_, 1);
Does this mean a RPC failed? Couldn't a bad RPC response also end up 
incrementing this counter?

Eg: L501 doesn't mean that the RPC itself failed, but that the RPC hit some 
error on the remote side.

Is this to include both? If yes, let's make that clear, else it might be 
confusing as to what exactly this counter is counting.


http://gerrit.cloudera.org:8080/#/c/9439/1/be/src/runtime/krpc-data-stream-sender.cc@627
PS1, Line 627: eos_sent_counter_
What's the use case for counting this? In the good case this should be equal to 
number of channels right?

Is it to compare with the number of EOS RPCs received on the receiver side?



--
To view, visit http://gerrit.cloudera.org:8080/9439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5
Gerrit-Change-Number: 9439
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sailesh Mukil 
Gerrit-Comment-Date: Sat, 24 Feb 2018 02:17:54 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6565: Fix dropped status in DequeueDeferredRpc()

2018-02-23 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9439 )

Change subject: IMPALA-6565: Fix dropped status in DequeueDeferredRpc()
..


Patch Set 1:

Pushing this fix out to unbreak the build for now. Will follow up with a 
separate change to implement a targeted test.


--
To view, visit http://gerrit.cloudera.org:8080/9439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5
Gerrit-Change-Number: 9439
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Michael Ho 
Gerrit-Comment-Date: Sat, 24 Feb 2018 01:21:28 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6565: Fix dropped status in DequeueDeferredRpc()

2018-02-23 Thread Michael Ho (Code Review)
Michael Ho has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/9439


Change subject: IMPALA-6565: Fix dropped status in DequeueDeferredRpc()
..

IMPALA-6565: Fix dropped status in DequeueDeferredRpc()

The local variable 'status' was mistakenly defined twice in
DequeueDeferredRpc(): one in the outer scope and one in the
inner scope. This causes the error status of AddBatchWork()
to be dropped when the inner scope ends. As a result, the error
status from AddBatchWork() (e.g. MemLimitExceeded) will not be
propagated back to the sender aand the receiver will continue
to operate with some missing data, leading to wrong query
results.

This change fixes the problem by removing the redefinition of
the status local variable. It also adds some counters in the
profile to make diagnostics of failed RPCs or missing EOS easier.

Testing done: Stress test consistently reproduced the problem before.
With this fix, no wrong results have been seen after 2 iterations
of stress tests, which translates to about 2 queries being run.

Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5
---
M be/src/runtime/krpc-data-stream-recvr.cc
M be/src/runtime/krpc-data-stream-recvr.h
M be/src/runtime/krpc-data-stream-sender.cc
M be/src/runtime/krpc-data-stream-sender.h
4 files changed, 32 insertions(+), 8 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/9439/1
--
To view, visit http://gerrit.cloudera.org:8080/9439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I6b2985a47021ebd4a970861040e7474aca7941b5
Gerrit-Change-Number: 9439
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho