[jira] [Comment Edited] (HDFS-14146) Handle exception from internalQueueCall

2018-12-14 Thread Chao Sun (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721675#comment-16721675
 ] 

Chao Sun edited comment on HDFS-14146 at 12/14/18 6:38 PM:
---

Attached patch v3 to address the checkstyle issues as well as removing minor 
duplication from HDFS-14138.

The last jenkins run looks good - the two failed tests do not seem to be 
related.


was (Author: csun):
Attached patch v3 to address the checkstyle issues as well as removing minor 
duplication from HDFS-14138.

> Handle exception from internalQueueCall
> ---
>
> Key: HDFS-14146
> URL: https://issues.apache.org/jira/browse/HDFS-14146
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ipc
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Critical
> Attachments: HDFS-14146-HDFS-12943.000.patch, 
> HDFS-14146-HDFS-12943.001.patch, HDFS-14146-HDFS-12943.002.patch, 
> HDFS-14146-HDFS-12943.003.patch
>
>
> When we re-queue RPC call, the {{internalQueueCall}} will potentially throw 
> exceptions (e.g., RPC backoff), which is then swallowed. This will cause the 
> RPC to be silently discarded without response to the client, which is not 
> good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14146) Handle exception from internalQueueCall

2018-12-12 Thread Erik Krogen (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719587#comment-16719587
 ] 

Erik Krogen edited comment on HDFS-14146 at 12/13/18 12:28 AM:
---

{quote}
I couldn't come up with an example (It would be great if you have one) that 
causing all the handlers to stuck yet, but certainly can see that some handlers 
get blocked when putting the call back, which is undesirable.
{quote}
Let's imagine that the call queue currently is completely full (with many 
requests which have a state ID in the future) and the listen queue is backing 
up. All of your handler threads start to grab items from the queue; for each 
item taken from the queue, the readers add new items, since the listen queue 
has backed up. By the time any of the handler threads try to call 
{{callQueue.put()}}, the call queue is already completely full because the 
readers have been inserting new items. I think the scenario that _all_ of the 
handlers become blocked is pretty unlikely since there are many more handler 
threads than readers, and the readers would have to consistently win races 
against the handlers to add items, but still, a NameNode deadlock is no joke :)


was (Author: xkrogen):
{quote}
I couldn't come up with an example (It would be great if you have one) that 
causing all the handlers to stuck yet, but certainly can see that some handlers 
get blocked when putting the call back, which is undesirable.
{quote}
Let's imagine that the call queue currently is completely full (with many 
requests which have a state ID in the future) and the listen queue is backing 
up. All of your handler threads start to grab items from the queue; for each 
item taken from the queue, the readers add new items, since the listen queue 
has backed up. By the time any of the handler threads try to call 
{{callQueue.put()}}, the call queue is already completely full because the 
readers have been inserting new items. I think the scenario is pretty unlikely 
since there are many more handler threads than readers, and the readers would 
have to consistently win races against the handlers to add items, but still, a 
NameNode deadlock is no joke :)

> Handle exception from internalQueueCall
> ---
>
> Key: HDFS-14146
> URL: https://issues.apache.org/jira/browse/HDFS-14146
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ipc
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Critical
> Attachments: HDFS-14146-HDFS-12943.000.patch
>
>
> When we re-queue RPC call, the {{internalQueueCall}} will potentially throw 
> exceptions (e.g., RPC backoff), which is then swallowed. This will cause the 
> RPC to be silently discarded without response to the client, which is not 
> good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14146) Handle exception from internalQueueCall

2018-12-12 Thread Erik Krogen (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719587#comment-16719587
 ] 

Erik Krogen edited comment on HDFS-14146 at 12/13/18 12:28 AM:
---

{quote}
I couldn't come up with an example (It would be great if you have one) that 
causing all the handlers to stuck yet, but certainly can see that some handlers 
get blocked when putting the call back, which is undesirable.
{quote}
Let's imagine that the call queue currently is completely full (with many 
requests which have a state ID in the future) and the listen queue is backing 
up. All of your handler threads start to grab items from the queue; for each 
item taken from the queue, the readers add new items, since the listen queue 
has backed up. By the time any of the handler threads try to call 
{{callQueue.put()}}, the call queue is already completely full because the 
readers have been inserting new items. I think the scenario is pretty unlikely 
since there are many more handler threads than readers, and the readers would 
have to consistently win races against the handlers to add items, but still, a 
NameNode deadlock is no joke :)


was (Author: xkrogen):
{quote}
I couldn't come up with an example (It would be great if you have one) that 
causing all the handlers to stuck yet, but certainly can see that some handlers 
get blocked when putting the call back, which is undesirable.
{quote}
Let's imagine that the call queue currently is completely full (with many 
requests which have a state ID in the future) and the listen queue is backing 
up. All of your handler threads start to grab items from the queue; for each 
item taken from the queue, the readers add new items, since the listen queue 
has backed up. By the time any of the handler threads try to call 
{{callQueue.put()}}, the call queue is already completely full because the 
readers have been inserting new items. I think the scenario is pretty unlikely 
since there are many more handler threads than readers, and the readers would 
have to consistently win races against the handlers to add items, but still, a 
NameNode deadlock is no joke :)

{quote}
Yes I compared this code with Call#doResponse - they are doing the same thing.
{quote}
Did you see the edit to my last comment? I believe that the {{RpcStatusProto}} 
returned may be different which can have implications on client behavior.

> Handle exception from internalQueueCall
> ---
>
> Key: HDFS-14146
> URL: https://issues.apache.org/jira/browse/HDFS-14146
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ipc
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Critical
> Attachments: HDFS-14146-HDFS-12943.000.patch
>
>
> When we re-queue RPC call, the {{internalQueueCall}} will potentially throw 
> exceptions (e.g., RPC backoff), which is then swallowed. This will cause the 
> RPC to be silently discarded without response to the client, which is not 
> good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14146) Handle exception from internalQueueCall

2018-12-12 Thread Erik Krogen (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719557#comment-16719557
 ] 

Erik Krogen edited comment on HDFS-14146 at 12/13/18 12:05 AM:
---

Hey [~csun], this is a good find. I actually think this problem is more serious 
than just the issue you raised.

Normally, _reader_ threads attempt to offer a request to the {{callQueue}}. If 
the {{callQueue}} is full, they will block, creating a natural backoff by 
pushing back on the listen queue, which will eventually cause clients to be 
unable to connect. However, now we have the _handler_ threads attempting to 
offer a request to the {{callQueue}} -- but the handler threads are the same 
ones that drain the queue. If the queue became full, and all handler threads 
were waiting to attempt to push a request into the {{callQueue}}, this could 
result in deadlock (all handlers are waiting on the queue because it is full, 
and no handler will drain the queue).

I think that instead of {{callQueue.put()}} we need to use {{callQueue.add()}}, 
which will never block, and instead throw an overflow exception if the queue is 
full. We also should make sure the exception handling is the same as when it 
happens in the reader thread, which current looks like (taken from 
{{processOneRpc}}):
{code}
...
} else {
  processRpcRequest(header, buffer);
}
  } catch (RpcServerException rse) {
// inform client of error, but do not rethrow else non-fatal
// exceptions will close connection!
if (LOG.isDebugEnabled()) {
  LOG.debug(Thread.currentThread().getName() +
  ": processOneRpc from client " + this +
  " threw exception [" + rse + "]");
}
// use the wrapped exception if there is one.
Throwable t = (rse.getCause() != null) ? rse.getCause() : rse;
final RpcCall call = new RpcCall(this, callId, retry);
setupResponse(call,
rse.getRpcStatusProto(), rse.getRpcErrorCodeProto(), null,
t.getClass().getName(), t.getMessage());
sendResponse(call);
  }
{code}
It's not clear to me if the logic you added will match this; can you confirm?

Edit:
I took a look and the error handling from the handler looks like:
{code}
void doResponse(Throwable t) throws IOException {
  RpcCall call = this;
  if (t != null) {
// clone the call to prevent a race with another thread stomping
// on the response while being sent.  the original call is
// effectively discarded since the wait count won't hit zero
call = new RpcCall(this);
setupResponse(call,
RpcStatusProto.FATAL, RpcErrorCodeProto.ERROR_RPC_SERVER,
null, t.getClass().getName(), StringUtils.stringifyException(t));
  } else {
setupResponse(call, call.responseParams.returnStatus,
call.responseParams.detailedErr, call.rv,
call.responseParams.errorClass,
call.responseParams.error);
  }
  connection.sendResponse(call);
}
{code}
So here it is set as a {{FATAL}} error always, but the 
{{CallQueueOverflowException}} can either be {{FATAL}} or {{ERROR}} (depending 
on whether it is a disconnect or a keepalive backoff) -- IIUC this is important 
to indicate to the client how it should behave. We may need a way to indicate 
to {{doResponse()}} what the {{RpcStatusProto}} should be using the 
{{RpcServerException}} thrown by the queue offer.


was (Author: xkrogen):
Hey [~csun], this is a good find. I actually think this problem is more serious 
than just the issue you raised.

Normally, _reader_ threads attempt to offer a request to the {{callQueue}}. If 
the {{callQueue}} is full, they will block, creating a natural backoff by 
pushing back on the listen queue, which will eventually cause clients to be 
unable to connect. However, now we have the _handler_ threads attempting to 
offer a request to the {{callQueue}} -- but the handler threads are the same 
ones that drain the queue. If the queue became full, and all handler threads 
were waiting to attempt to push a request into the {{callQueue}}, this could 
result in deadlock (all handlers are waiting on the queue because it is full, 
and no handler will drain the queue).

I think that instead of {{callQueue.put()}} we need to use {{callQueue.add()}}, 
which will never block, and instead throw an overflow exception if the queue is 
full. We also should make sure the exception handling is the same as when it 
happens in the reader thread, which current looks like (taken from 
{{processOneRpc}}):
{code}
...
} else {
  processRpcRequest(header, buffer);
}
  } catch (RpcServerException rse) {
// inform client of error, but do not rethrow else non-fatal
// exceptions will close connection!
if