[jira] [Comment Edited] (HIVE-15671) RPCServer.registerClient() erroneously uses server/client handshake timeout for connection timeout
[ https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864231#comment-15864231 ] Xuefu Zhang edited comment on HIVE-15671 at 2/13/17 7:22 PM: - Committed to master. Thanks to Marchelo, Jimmy, and Rui for the review. Created HIVE-15893 as a followup. was (Author: xuefuz): Committed to master. Thanks to Marchelo, Jimmy, and Rui for the review. > RPCServer.registerClient() erroneously uses server/client handshake timeout > for connection timeout > -- > > Key: HIVE-15671 > URL: https://issues.apache.org/jira/browse/HIVE-15671 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Fix For: 2.2.0 > > Attachments: HIVE-15671.1.patch, HIVE-15671.patch > > > {code} > /** >* Tells the RPC server to expect a connection from a new client. >* ... >*/ > public Future registerClient(final String clientId, String secret, > RpcDispatcher serverDispatcher) { > return registerClient(clientId, secret, serverDispatcher, > config.getServerConnectTimeoutMs()); > } > {code} > {{config.getServerConnectTimeoutMs()}} returns value for > *hive.spark.client.server.connect.timeout*, which is meant for timeout for > handshake between Hive client and remote Spark driver. Instead, the timeout > should be *hive.spark.client.connect.timeout*, which is for timeout for > remote Spark driver in connecting back to Hive client. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (HIVE-15671) RPCServer.registerClient() erroneously uses server/client handshake timeout for connection timeout
[ https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861114#comment-15861114 ] Xuefu Zhang edited comment on HIVE-15671 at 2/10/17 11:10 AM: -- Hi [~vanzin], to backtrack a little bit, I have a followup question about your comment. {quote} That's kinda hard to solve, because the server doesn't know which client connected until two things happen: first the driver has started, second the driver completed the SASL handshake to identify itself. A lot of things can go wrong in that time. There's already some code, IIRC, that fails the session if the spark-submit job dies with an error, but aside from that, it's kinda hard to do more. {quote} I was talking about server detecting a driver problem after it has connected back to the server. I'm wondering which timeout applies in case of a problem on the driver side, such as long GC, stall connection between the server and the driver, etc. It's kind of long if this timeout is also server.connect.timeout, which is increased to 10m in our case to accommodate for the busy cluster. To me it doesn't seem that such a timeout exist, in absence of a heartbeat mechanism. was (Author: xuefuz): Hi [~vanzin], to backtrack a little bit, I have a followup question about your comment. {quote} That's kinda hard to solve, because the server doesn't know which client connected until two things happen: first the driver has started, second the driver completed the SASL handshake to identify itself. A lot of things can go wrong in that time. There's already some code, IIRC, that fails the session if the spark-submit job dies with an error, but aside from that, it's kinda hard to do more. {code} I was talking about server detecting a driver problem after it has connected back to the server. I'm wondering which timeout applies in case of a problem on the driver side, such as long GC, stall connection between the server and the driver, etc. It's kind of long if this timeout is also server.connect.timeout, which is increased to 10m in our case to accommodate for the busy cluster. To me it doesn't seem that such a timeout exist, in absence of a heartbeat mechanism. > RPCServer.registerClient() erroneously uses server/client handshake timeout > for connection timeout > -- > > Key: HIVE-15671 > URL: https://issues.apache.org/jira/browse/HIVE-15671 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: HIVE-15671.1.patch, HIVE-15671.patch > > > {code} > /** >* Tells the RPC server to expect a connection from a new client. >* ... >*/ > public Future registerClient(final String clientId, String secret, > RpcDispatcher serverDispatcher) { > return registerClient(clientId, secret, serverDispatcher, > config.getServerConnectTimeoutMs()); > } > {code} > {{config.getServerConnectTimeoutMs()}} returns value for > *hive.spark.client.server.connect.timeout*, which is meant for timeout for > handshake between Hive client and remote Spark driver. Instead, the timeout > should be *hive.spark.client.connect.timeout*, which is for timeout for > remote Spark driver in connecting back to Hive client. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (HIVE-15671) RPCServer.registerClient() erroneously uses server/client handshake timeout for connection timeout
[ https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857950#comment-15857950 ] Xuefu Zhang edited comment on HIVE-15671 at 2/8/17 12:57 PM: - [~KaiXu], your case might be different. The connection is explicitly closed due to {code} 2017-02-08T09:51:04,346 INFO [stderr-redir-1] client.SparkClientImpl: 17/02/08 09:51:04 WARN rpc.RpcDispatcher: [DriverProtocol] Expected RPC header, got org.apache.hive.spark.client.rpc.Rpc$NullMessage instead. 2017-02-08T09:51:04,347 INFO [stderr-redir-1] client.SparkClientImpl: 17/02/08 09:51:04 INFO rpc.RpcDispatcher: [DriverProtocol] Closing channel due to exception in pipeline (null). 2017-02-08T09:51:04,347 INFO [RPC-Handler-3] rpc.RpcDispatcher: [ClientProtocol] Closing channel due to exception in pipeline (Connection reset by peer). {code} I haven't seen this kind of exception, so am not sure how it happens. If this can be reproduced, complete logs (driver, hive, yarn) would be helpful. was (Author: xuefuz): [~KaiXu], your case might be different. The connection is explicitly closed due to {code} 2017-02-08T09:51:04,346 INFO [stderr-redir-1] client.SparkClientImpl: 17/02/08 09:51:04 WARN rpc.RpcDispatcher: [DriverProtocol] Expected RPC header, got org.apache.hive.spark.client.rpc.Rpc$NullMessage instead. 2017-02-08T09:51:04,347 INFO [stderr-redir-1] client.SparkClientImpl: 17/02/08 09:51:04 INFO rpc.RpcDispatcher: [DriverProtocol] Closing channel due to exception in pipeline (null). 2017-02-08T09:51:04,347 INFO [RPC-Handler-3] rpc.RpcDispatcher: [ClientProtocol] Closing channel due to exception in pipeline (Connection reset by peer). {code} I haven't seen this kind of exception, so am not sure how it happens. > RPCServer.registerClient() erroneously uses server/client handshake timeout > for connection timeout > -- > > Key: HIVE-15671 > URL: https://issues.apache.org/jira/browse/HIVE-15671 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: HIVE-15671.1.patch, HIVE-15671.patch > > > {code} > /** >* Tells the RPC server to expect a connection from a new client. >* ... >*/ > public Future registerClient(final String clientId, String secret, > RpcDispatcher serverDispatcher) { > return registerClient(clientId, secret, serverDispatcher, > config.getServerConnectTimeoutMs()); > } > {code} > {{config.getServerConnectTimeoutMs()}} returns value for > *hive.spark.client.server.connect.timeout*, which is meant for timeout for > handshake between Hive client and remote Spark driver. Instead, the timeout > should be *hive.spark.client.connect.timeout*, which is for timeout for > remote Spark driver in connecting back to Hive client. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (HIVE-15671) RPCServer.registerClient() erroneously uses server/client handshake timeout for connection timeout
[ https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832582#comment-15832582 ] Xuefu Zhang edited comment on HIVE-15671 at 1/20/17 11:12 PM: -- Patch #1 followed what [~vanzin] suggested. With it, I observed the following behavior: 1. Increasing *server.connect.timeout* will make hive wait longer for the driver to connect back, which solves the busy cluster problem. 2. Killing driver while the job is running immediately fails the query on Hive side with the following error: {code} 2017-01-20 22:01:08,235 Stage-2_0: 7(+3)/685Stage-3_0: 0/1 2017-01-20 22:01:09,237 Stage-2_0: 16(+6)/685 Stage-3_0: 0/1 Failed to monitor Job[ 1] with exception 'java.lang.IllegalStateException(RPC channel is closed.)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask {code} This meets my expectation. However, I didn't test the case of driver death before connecting back to Hive. (It's also hard to construct such a test case.) In that case, I assume that Hive will wait for *server.connect.timeout* before declaring a failure. I guess there isn't much we can do for this case. I don't think the change here has any implication on this. was (Author: xuefuz): Patch #1 followed what [~vanzin] suggested. With it, I observed the following behavior: 1. Increasing *server.connect.timeout* will make hive wait longer for the driver to connect back, which solves the busy cluster problem. 2. Killing driver while the job is running immediately fails the query on Hive side with the following error: {code} 2017-01-20 22:01:08,235 Stage-2_0: 7(+3)/685Stage-3_0: 0/1 2017-01-20 22:01:09,237 Stage-2_0: 16(+6)/685 Stage-3_0: 0/1 Failed to monitor Job[ 1] with exception 'java.lang.IllegalStateException(RPC channel is closed.)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.spark.SparkTask {code} This meets my expectation. However, I didn't test the case of driver death before connecting back to Hive. (It's also hard to construct such a test case.) In that case, I assume that Hive will wait for *server.connect.timeout* before declares a failure. I guess there isn't much we can do for this case. I don't think the change here has any implication on this. > RPCServer.registerClient() erroneously uses server/client handshake timeout > for connection timeout > -- > > Key: HIVE-15671 > URL: https://issues.apache.org/jira/browse/HIVE-15671 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: HIVE-15671.1.patch, HIVE-15671.patch > > > {code} > /** >* Tells the RPC server to expect a connection from a new client. >* ... >*/ > public Future registerClient(final String clientId, String secret, > RpcDispatcher serverDispatcher) { > return registerClient(clientId, secret, serverDispatcher, > config.getServerConnectTimeoutMs()); > } > {code} > {{config.getServerConnectTimeoutMs()}} returns value for > *hive.spark.client.server.connect.timeout*, which is meant for timeout for > handshake between Hive client and remote Spark driver. Instead, the timeout > should be *hive.spark.client.connect.timeout*, which is for timeout for > remote Spark driver in connecting back to Hive client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HIVE-15671) RPCServer.registerClient() erroneously uses server/client handshake timeout for connection timeout
[ https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831178#comment-15831178 ] Xuefu Zhang edited comment on HIVE-15671 at 1/20/17 4:33 AM: - Actually my understanding is a little different. Checking the code, I see: 1. On server side (RpcServer constructor), saslHandler is set a timeout using {{getServerConnectTimeoutMs()}}. 2. On client side, in {{Rpc.createClient()}}, saslHandler is also set a timeout using {{getServerConnectTimeoutMs()}}. These two are consistent, which I don't see any issue. On the other hand, 3. On server side, in {{Repc.registerClient()}}, ClientInfo stores {{getServerConnectTimeoutMs()}}. And, the timeout happens, the exception is TimeoutException("Timed out waiting for client connection."). 4. On client side, in {{Rpc.createClient()}}, the channel is initialized with {{getConnectTimeoutMs()}}. To me, it seems there is mismatch between 3 and 4. In 3, the timeout message implies "connection timeout", while the value is what is supposed to be that for saslHandler handshake. This is why I think 3 should use {{getConnectTimeoutMs()}} instead. Could you take another look? I actually ran into issues with this. Our cluster is constantly busy, and it takes minutes for the Hive to get a YARN container to launch the remote driver. In that case, the query fails with a failure of creating a spark session. For such a scenario, I supposed we should increase *client.connect.timeout*. However, that's not effective. On the other hand, if I increase *server.connect.timeout*, Hive waits longer for the driver to come up, which is good. However, doing that has a bad consequence that Hive will wait as long to declare a failure if for any reason the remote driver becomes dead. With the patch in place, the problem is solved in both cases. I only need to increase *client.connect.timeout* and keep *server.connect.timeout* unchanged. was (Author: xuefuz): Actually my understanding is a little different. Checking the code, I see: 1. On server side (RpcServer constructor), saslHandler is set a timeout using {{getServerConnectTimeoutMs()}}. 2. On client side, in {{Rpc.createClient()}}, saslHandler is also set a timeout using {{getServerConnectTimeoutMs()}}. These two are consistent, which I don't see any issue. On the other hand, 3. On server side, in {{Repc.registerClient()}}, ClientInfo stores {{getServerConnectTimeoutMs()}}. And, the timeout happens, the exception is TimeoutException("Timed out waiting for client connection."). 4. On client side, in {{Rpc.createClient()}}, the channel is initialized with {{getConnectTimeoutMs()}}. To me, it seems there is mismatch between 3 and 4. In 3, the timeout message implies "connection timeout", while the value is what is supposed to be that for saslHandler handshake. This is why I think 3 should use {{getConnectTimeoutMs()}} instead. Could you take another look? I actually ran into issues with this. Our cluster is constantly busy, and it takes minutes for the Hive's spark session to get a container to launch the remote driver. In that case, the query fails with a failure of creating a spark session. For such a scenario, I supposed we should increase *client.connect.timeout*. However, that's not effective. On the other hand, if I increase *server.connect.timeout*, Hive waits longer for the driver to come up, which is good. However, doing that has a bad consequence that Hive will wait as long to declare a failure if for any reason the remote driver becomes dead. With the patch in place, the problem is solved in both cases. I only need to increase *client.connect.timeout* and keep *server.connect.timeout* unchanged. > RPCServer.registerClient() erroneously uses server/client handshake timeout > for connection timeout > -- > > Key: HIVE-15671 > URL: https://issues.apache.org/jira/browse/HIVE-15671 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: HIVE-15671.patch > > > {code} > /** >* Tells the RPC server to expect a connection from a new client. >* ... >*/ > public Future registerClient(final String clientId, String secret, > RpcDispatcher serverDispatcher) { > return registerClient(clientId, secret, serverDispatcher, > config.getServerConnectTimeoutMs()); > } > {code} > {{config.getServerConnectTimeoutMs()}} returns value for > *hive.spark.client.server.connect.timeout*, which is meant for timeout for > handshake between Hive client and remote Spark driver. Instead, the timeout > should be *hive.spark.client.connect.timeout*, which is for timeout for > remote Spark
[jira] [Comment Edited] (HIVE-15671) RPCServer.registerClient() erroneously uses server/client handshake timeout for connection timeout
[ https://issues.apache.org/jira/browse/HIVE-15671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831115#comment-15831115 ] Marcelo Vanzin edited comment on HIVE-15671 at 1/20/17 3:30 AM: Hmm, the options are poorly named (my fault, and they always confuse me when I look at them now), but the current use looks correct. "client.connect.timeout" is for the connection that the Spark driver opens to HS2. "server.connect.timeout" is actually used in two places, but is basically the time allowed between HS2 starting the Spark driver, and the SASL handshake to finish. was (Author: vanzin): Hmm, the options are poorly named (my fault, and they always confuse me when I look at them now), but the current use looks correct. "client.connect.timeout" is for the connection that the Spark driver opens to HS2. "server.connect.timeout" is actually used in both places, but is basically the time allowed between HS2 starting the Spark driver, and the SASL handshake to finish. > RPCServer.registerClient() erroneously uses server/client handshake timeout > for connection timeout > -- > > Key: HIVE-15671 > URL: https://issues.apache.org/jira/browse/HIVE-15671 > Project: Hive > Issue Type: Bug > Components: Spark >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Attachments: HIVE-15671.patch > > > {code} > /** >* Tells the RPC server to expect a connection from a new client. >* ... >*/ > public Future registerClient(final String clientId, String secret, > RpcDispatcher serverDispatcher) { > return registerClient(clientId, secret, serverDispatcher, > config.getServerConnectTimeoutMs()); > } > {code} > {{config.getServerConnectTimeoutMs()}} returns value for > *hive.spark.client.server.connect.timeout*, which is meant for timeout for > handshake between Hive client and remote Spark driver. Instead, the timeout > should be *hive.spark.client.connect.timeout*, which is for timeout for > remote Spark driver in connecting back to Hive client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)