[jira] [Comment Edited] (HIVE-15671) RPCServer.registerClient() erroneously uses server/client handshake timeout for connection timeout

2017-02-13 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864231#comment-15864231
 ] 

Xuefu Zhang edited comment on HIVE-15671 at 2/13/17 7:22 PM:
-

Committed to master. Thanks to Marchelo, Jimmy, and Rui for the review.
Created HIVE-15893 as a followup.


was (Author: xuefuz):
Committed to master. Thanks to Marchelo, Jimmy, and Rui for the review.

> RPCServer.registerClient() erroneously uses server/client handshake timeout 
> for connection timeout
> --
>
> Key: HIVE-15671
> URL: https://issues.apache.org/jira/browse/HIVE-15671
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 2.2.0
>
> Attachments: HIVE-15671.1.patch, HIVE-15671.patch
>
>
> {code}
>   /**
>* Tells the RPC server to expect a connection from a new client.
>* ...
>*/
>   public Future registerClient(final String clientId, String secret,
>   RpcDispatcher serverDispatcher) {
> return registerClient(clientId, secret, serverDispatcher, 
> config.getServerConnectTimeoutMs());
>   }
> {code}
> {{config.getServerConnectTimeoutMs()}} returns value for 
> *hive.spark.client.server.connect.timeout*, which is meant for timeout for 
> handshake between Hive client and remote Spark driver. Instead, the timeout 
> should be *hive.spark.client.connect.timeout*, which is for timeout for 
> remote Spark driver in connecting back to Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (HIVE-15671) RPCServer.registerClient() erroneously uses server/client handshake timeout for connection timeout

2017-02-10 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861114#comment-15861114
 ] 

Xuefu Zhang edited comment on HIVE-15671 at 2/10/17 11:10 AM:
--

Hi [~vanzin], to backtrack a little bit, I have a followup question about your 
comment.
{quote}
That's kinda hard to solve, because the server doesn't know which client 
connected until two things happen: first the driver has started, second the 
driver completed the SASL handshake to identify itself. A lot of things can go 
wrong in that time. There's already some code, IIRC, that fails the session if 
the spark-submit job dies with an error, but aside from that, it's kinda hard 
to do more.
{quote}
I was talking about server detecting a driver problem after it has connected 
back to the server. I'm wondering which timeout applies in case of a problem on 
the driver side, such as long GC, stall connection between the server and the 
driver, etc. It's kind of long if this timeout is also server.connect.timeout, 
which is increased to 10m in our case to accommodate for the busy cluster. To 
me it doesn't seem that such a timeout exist, in absence of a heartbeat 
mechanism.


was (Author: xuefuz):
Hi [~vanzin], to backtrack a little bit, I have a followup question about your 
comment.
{quote}
That's kinda hard to solve, because the server doesn't know which client 
connected until two things happen: first the driver has started, second the 
driver completed the SASL handshake to identify itself. A lot of things can go 
wrong in that time. There's already some code, IIRC, that fails the session if 
the spark-submit job dies with an error, but aside from that, it's kinda hard 
to do more.
{code}
I was talking about server detecting a driver problem after it has connected 
back to the server. I'm wondering which timeout applies in case of a problem on 
the driver side, such as long GC, stall connection between the server and the 
driver, etc. It's kind of long if this timeout is also server.connect.timeout, 
which is increased to 10m in our case to accommodate for the busy cluster. To 
me it doesn't seem that such a timeout exist, in absence of a heartbeat 
mechanism.

> RPCServer.registerClient() erroneously uses server/client handshake timeout 
> for connection timeout
> --
>
> Key: HIVE-15671
> URL: https://issues.apache.org/jira/browse/HIVE-15671
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15671.1.patch, HIVE-15671.patch
>
>
> {code}
>   /**
>* Tells the RPC server to expect a connection from a new client.
>* ...
>*/
>   public Future registerClient(final String clientId, String secret,
>   RpcDispatcher serverDispatcher) {
> return registerClient(clientId, secret, serverDispatcher, 
> config.getServerConnectTimeoutMs());
>   }
> {code}
> {{config.getServerConnectTimeoutMs()}} returns value for 
> *hive.spark.client.server.connect.timeout*, which is meant for timeout for 
> handshake between Hive client and remote Spark driver. Instead, the timeout 
> should be *hive.spark.client.connect.timeout*, which is for timeout for 
> remote Spark driver in connecting back to Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (HIVE-15671) RPCServer.registerClient() erroneously uses server/client handshake timeout for connection timeout

2017-02-08 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857950#comment-15857950
 ] 

Xuefu Zhang edited comment on HIVE-15671 at 2/8/17 12:57 PM:
-

[~KaiXu], your case might be different. The connection is explicitly closed due 
to 
{code}
2017-02-08T09:51:04,346 INFO [stderr-redir-1] client.SparkClientImpl: 17/02/08 
09:51:04 WARN rpc.RpcDispatcher: [DriverProtocol] Expected RPC header, got 
org.apache.hive.spark.client.rpc.Rpc$NullMessage instead.
2017-02-08T09:51:04,347 INFO [stderr-redir-1] client.SparkClientImpl: 17/02/08 
09:51:04 INFO rpc.RpcDispatcher: [DriverProtocol] Closing channel due to 
exception in pipeline (null).
2017-02-08T09:51:04,347 INFO [RPC-Handler-3] rpc.RpcDispatcher: 
[ClientProtocol] Closing channel due to exception in pipeline (Connection reset 
by peer).
{code}
I haven't seen this kind of exception, so am not sure how it happens. If this 
can be reproduced, complete logs (driver, hive, yarn) would be helpful.


was (Author: xuefuz):
[~KaiXu], your case might be different. The connection is explicitly closed due 
to 
{code}
2017-02-08T09:51:04,346 INFO [stderr-redir-1] client.SparkClientImpl: 17/02/08 
09:51:04 WARN rpc.RpcDispatcher: [DriverProtocol] Expected RPC header, got 
org.apache.hive.spark.client.rpc.Rpc$NullMessage instead.
2017-02-08T09:51:04,347 INFO [stderr-redir-1] client.SparkClientImpl: 17/02/08 
09:51:04 INFO rpc.RpcDispatcher: [DriverProtocol] Closing channel due to 
exception in pipeline (null).
2017-02-08T09:51:04,347 INFO [RPC-Handler-3] rpc.RpcDispatcher: 
[ClientProtocol] Closing channel due to exception in pipeline (Connection reset 
by peer).
{code}
I haven't seen this kind of exception, so am not sure how it happens.

> RPCServer.registerClient() erroneously uses server/client handshake timeout 
> for connection timeout
> --
>
> Key: HIVE-15671
> URL: https://issues.apache.org/jira/browse/HIVE-15671
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15671.1.patch, HIVE-15671.patch
>
>
> {code}
>   /**
>* Tells the RPC server to expect a connection from a new client.
>* ...
>*/
>   public Future registerClient(final String clientId, String secret,
>   RpcDispatcher serverDispatcher) {
> return registerClient(clientId, secret, serverDispatcher, 
> config.getServerConnectTimeoutMs());
>   }
> {code}
> {{config.getServerConnectTimeoutMs()}} returns value for 
> *hive.spark.client.server.connect.timeout*, which is meant for timeout for 
> handshake between Hive client and remote Spark driver. Instead, the timeout 
> should be *hive.spark.client.connect.timeout*, which is for timeout for 
> remote Spark driver in connecting back to Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (HIVE-15671) RPCServer.registerClient() erroneously uses server/client handshake timeout for connection timeout

2017-01-20 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832582#comment-15832582
 ] 

Xuefu Zhang edited comment on HIVE-15671 at 1/20/17 11:12 PM:
--

Patch #1 followed what [~vanzin] suggested. With it, I observed the following 
behavior:

1. Increasing *server.connect.timeout* will make hive wait longer for the 
driver to connect back, which solves the busy cluster problem.
2. Killing driver while the job is running immediately fails the query on Hive 
side with the following error:
{code}
2017-01-20 22:01:08,235 Stage-2_0: 7(+3)/685Stage-3_0: 0/1  
2017-01-20 22:01:09,237 Stage-2_0: 16(+6)/685   Stage-3_0: 0/1  
Failed to monitor Job[ 1] with exception 'java.lang.IllegalStateException(RPC 
channel is closed.)'
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.spark.SparkTask
{code}

This meets my expectation.

However, I didn't test the case of driver death before connecting back to Hive. 
(It's also hard to construct such a test case.) In that case, I assume that 
Hive will wait for *server.connect.timeout* before declaring a failure. I guess 
there isn't much we can do for this case. I don't think the change here has any 
implication on this.


was (Author: xuefuz):
Patch #1 followed what [~vanzin] suggested. With it, I observed the following 
behavior:

1. Increasing *server.connect.timeout* will make hive wait longer for the 
driver to connect back, which solves the busy cluster problem.
2. Killing driver while the job is running immediately fails the query on Hive 
side with the following error:
{code}
2017-01-20 22:01:08,235 Stage-2_0: 7(+3)/685Stage-3_0: 0/1  
2017-01-20 22:01:09,237 Stage-2_0: 16(+6)/685   Stage-3_0: 0/1  
Failed to monitor Job[ 1] with exception 'java.lang.IllegalStateException(RPC 
channel is closed.)'
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.spark.SparkTask
{code}

This meets my expectation.

However, I didn't test the case of driver death before connecting back to Hive. 
(It's also hard to construct such a test case.) In that case, I assume that 
Hive will wait for *server.connect.timeout* before declares a failure. I guess 
there isn't much we can do for this case. I don't think the change here has any 
implication on this.

> RPCServer.registerClient() erroneously uses server/client handshake timeout 
> for connection timeout
> --
>
> Key: HIVE-15671
> URL: https://issues.apache.org/jira/browse/HIVE-15671
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15671.1.patch, HIVE-15671.patch
>
>
> {code}
>   /**
>* Tells the RPC server to expect a connection from a new client.
>* ...
>*/
>   public Future registerClient(final String clientId, String secret,
>   RpcDispatcher serverDispatcher) {
> return registerClient(clientId, secret, serverDispatcher, 
> config.getServerConnectTimeoutMs());
>   }
> {code}
> {{config.getServerConnectTimeoutMs()}} returns value for 
> *hive.spark.client.server.connect.timeout*, which is meant for timeout for 
> handshake between Hive client and remote Spark driver. Instead, the timeout 
> should be *hive.spark.client.connect.timeout*, which is for timeout for 
> remote Spark driver in connecting back to Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HIVE-15671) RPCServer.registerClient() erroneously uses server/client handshake timeout for connection timeout

2017-01-19 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831178#comment-15831178
 ] 

Xuefu Zhang edited comment on HIVE-15671 at 1/20/17 4:33 AM:
-

Actually my understanding is a little different. Checking the code, I see:
1. On server side (RpcServer constructor), saslHandler is set a timeout using 
{{getServerConnectTimeoutMs()}}.
2. On client side, in {{Rpc.createClient()}}, saslHandler is also set a timeout 
using  {{getServerConnectTimeoutMs()}}.
These two are consistent, which I don't see any issue.

On the other hand, 
3. On server side, in {{Repc.registerClient()}}, ClientInfo stores 
{{getServerConnectTimeoutMs()}}. And, the timeout happens, the exception is 
TimeoutException("Timed out waiting for client connection.").
4. On client side, in {{Rpc.createClient()}}, the channel is initialized with 
{{getConnectTimeoutMs()}}.

To me, it seems there is mismatch between 3 and 4. In 3, the timeout message 
implies "connection timeout", while the value is what is supposed to be that 
for saslHandler handshake. This is why I think 3 should use 
{{getConnectTimeoutMs()}} instead.

Could you take another look?

I actually ran into issues with this. Our cluster is constantly busy, and it 
takes minutes for the Hive to get a YARN container to launch the remote driver. 
In that case, the query fails with a failure of creating a spark session. For 
such a scenario, I supposed we should increase *client.connect.timeout*. 
However, that's not effective. On the other hand, if I increase 
*server.connect.timeout*, Hive waits longer  for the driver to come up, which 
is good. However, doing that has a bad consequence that Hive will wait as long 
to declare a failure if for any reason the remote driver becomes dead.

With the patch in place, the problem is solved in both cases. I only need to 
increase *client.connect.timeout* and keep *server.connect.timeout* unchanged.


was (Author: xuefuz):
Actually my understanding is a little different. Checking the code, I see:
1. On server side (RpcServer constructor), saslHandler is set a timeout using 
{{getServerConnectTimeoutMs()}}.
2. On client side, in {{Rpc.createClient()}}, saslHandler is also set a timeout 
using  {{getServerConnectTimeoutMs()}}.
These two are consistent, which I don't see any issue.

On the other hand, 
3. On server side, in {{Repc.registerClient()}}, ClientInfo stores 
{{getServerConnectTimeoutMs()}}. And, the timeout happens, the exception is 
TimeoutException("Timed out waiting for client connection.").
4. On client side, in {{Rpc.createClient()}}, the channel is initialized with 
{{getConnectTimeoutMs()}}.

To me, it seems there is mismatch between 3 and 4. In 3, the timeout message 
implies "connection timeout", while the value is what is supposed to be that 
for saslHandler handshake. This is why I think 3 should use 
{{getConnectTimeoutMs()}} instead.

Could you take another look?

I actually ran into issues with this. Our cluster is constantly busy, and it 
takes minutes for the Hive's spark session to get a container to launch the 
remote driver. In that case, the query fails with a failure of creating a spark 
session. For such a scenario, I supposed we should increase 
*client.connect.timeout*. However, that's not effective. On the other hand, if 
I increase *server.connect.timeout*, Hive waits longer  for the driver to come 
up, which is good. However, doing that has a bad consequence that Hive will 
wait as long to declare a failure if for any reason the remote driver becomes 
dead.

With the patch in place, the problem is solved in both cases. I only need to 
increase *client.connect.timeout* and keep *server.connect.timeout* unchanged.

> RPCServer.registerClient() erroneously uses server/client handshake timeout 
> for connection timeout
> --
>
> Key: HIVE-15671
> URL: https://issues.apache.org/jira/browse/HIVE-15671
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15671.patch
>
>
> {code}
>   /**
>* Tells the RPC server to expect a connection from a new client.
>* ...
>*/
>   public Future registerClient(final String clientId, String secret,
>   RpcDispatcher serverDispatcher) {
> return registerClient(clientId, secret, serverDispatcher, 
> config.getServerConnectTimeoutMs());
>   }
> {code}
> {{config.getServerConnectTimeoutMs()}} returns value for 
> *hive.spark.client.server.connect.timeout*, which is meant for timeout for 
> handshake between Hive client and remote Spark driver. Instead, the timeout 
> should be *hive.spark.client.connect.timeout*, which is for timeout for 
> remote Spark 

[jira] [Comment Edited] (HIVE-15671) RPCServer.registerClient() erroneously uses server/client handshake timeout for connection timeout

2017-01-19 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831115#comment-15831115
 ] 

Marcelo Vanzin edited comment on HIVE-15671 at 1/20/17 3:30 AM:


Hmm, the options are poorly named (my fault, and they always confuse me when I 
look at them now), but the current use looks correct.

"client.connect.timeout" is for the connection that the Spark driver opens to 
HS2.
"server.connect.timeout" is actually used in two places, but is basically the 
time allowed between HS2 starting the Spark driver, and the SASL handshake to 
finish.


was (Author: vanzin):
Hmm, the options are poorly named (my fault, and they always confuse me when I 
look at them now), but the current use looks correct.

"client.connect.timeout" is for the connection that the Spark driver opens to 
HS2.
"server.connect.timeout" is actually used in both places, but is basically the 
time allowed between HS2 starting the Spark driver, and the SASL handshake to 
finish.

> RPCServer.registerClient() erroneously uses server/client handshake timeout 
> for connection timeout
> --
>
> Key: HIVE-15671
> URL: https://issues.apache.org/jira/browse/HIVE-15671
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Attachments: HIVE-15671.patch
>
>
> {code}
>   /**
>* Tells the RPC server to expect a connection from a new client.
>* ...
>*/
>   public Future registerClient(final String clientId, String secret,
>   RpcDispatcher serverDispatcher) {
> return registerClient(clientId, secret, serverDispatcher, 
> config.getServerConnectTimeoutMs());
>   }
> {code}
> {{config.getServerConnectTimeoutMs()}} returns value for 
> *hive.spark.client.server.connect.timeout*, which is meant for timeout for 
> handshake between Hive client and remote Spark driver. Instead, the timeout 
> should be *hive.spark.client.connect.timeout*, which is for timeout for 
> remote Spark driver in connecting back to Hive client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)