[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148177#comment-15148177 ] JoneZhang commented on HIVE-12650: -- Hi all, I'm sorrry reply you so late. Yes hive.spark.client.server.connect.timeout and spark.yarn.am.waitTime does not have any relations. hive.spark.client.server.connect.timeout is the timeout between RPC server and client handshake.When no container is available, hive cient will exit after hive.spark.client.server.connect.timeout. spark.yarn.am.waitTime is the time the Spark AM waits for the SparkContext to be created after the AM has been launched. There are two types of error log 1.Client closed before SASL negotiation finished was happened on resubmitted. See https://issues.apache.org/jira/browse/HIVE-12649. 2.Connection refused: /hiveclientip:port was happend when am tries to connect back to Hive. Container: container_1448873753366_113453_01_01 on 10.247.169.134_8041 LogType: stderr LogLength: 3302 Log Contents: Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled in the future Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled in the future 15/12/09 02:11:48 INFO yarn.ApplicationMaster: Registered signal handlers for [TERM, HUP, INT] 15/12/09 02:11:48 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1448873753366_113453_01 15/12/09 02:11:49 INFO spark.SecurityManager: Changing view acls to: mqq 15/12/09 02:11:49 INFO spark.SecurityManager: Changing modify acls to: mqq 15/12/09 02:11:49 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mqq); users with modify permissions: Set(mqq) 15/12/09 02:11:49 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread 15/12/09 02:11:49 INFO yarn.ApplicationMaster: Waiting for spark context initialization 15/12/09 02:11:49 INFO yarn.ApplicationMaster: Waiting for spark context initialization ... 15/12/09 02:11:49 INFO client.RemoteDriver: Connecting to: 10.179.12.140:58013 15/12/09 02:11:49 ERROR yarn.ApplicationMaster: User class threw exception: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused: /10.179.12.140:58013 java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused: /10.179.12.140:58013 at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37) at org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:156) at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:483) Caused by: java.net.ConnectException: Connection refused: /10.179.12.140:58013 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) 15/12/09 02:11:49 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused: /10.179.12.140:58013) 15/12/09 02:11:59 ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 15 ms. Please check earlier log output for errors. Failing the application. 15/12/09 02:11:59 INFO util.Utils: Shutdown hook called > Increase default value of hive.spark.client.server.connect.timeout to exceeds > spark.yarn.am.waitTime > > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions:
[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128913#comment-15128913 ] Marcelo Vanzin commented on HIVE-12650: --- bq. could you please explain a little bit the use of the two timeout? There's nothing complicated about them. - RSC timeout: time between the RSC launching the Spark app and the Spark driver connecting back. - Spark AM timeout: time between Spark AM launching the user's "main" method and a SparkContext being created. Both overlap but one is not necessarily contained in the other. > Increase default value of hive.spark.client.server.connect.timeout to exceeds > spark.yarn.am.waitTime > > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128949#comment-15128949 ] Xuefu Zhang commented on HIVE-12650: Thanks, [~vanzin]. I guess the question is the difference between the follow two (both defined in Hive): 1. hive.spark.client.connect.timeout 2. hive.spark.client.server.connect.timeout The second question is: what's the timeout value that spark-submit uses in case of no available containers? > Increase default value of hive.spark.client.server.connect.timeout to exceeds > spark.yarn.am.waitTime > > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128979#comment-15128979 ] Marcelo Vanzin commented on HIVE-12650: --- * hive.spark.client.connect.timeout That's the socket connect timeout when the driver connects to the RSC server. Equivalent to this: http://docs.oracle.com/javase/7/docs/api/java/net/Socket.html#connect(java.net.SocketAddress,%20int) * hive.spark.client.server.connect.timeout That's the timeout explained in my previous comment. * what's the timeout value that spark-submit uses in case of no available containers? I don't believe there is one. > Increase default value of hive.spark.client.server.connect.timeout to exceeds > spark.yarn.am.waitTime > > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129392#comment-15129392 ] Xuefu Zhang commented on HIVE-12650: Thanks, [~vanzin]. If there is no timeout in spark-submit (wait indefinitely), I'm wondering what happens if the cluster is busy. Here is my speculation. Hive will time out first (also corresponding to Rui's observation), but spark-submit will continue to run. If a container becomes available, Spark AM will start and connect to Hive. Hive of course refuses. Then, AM will error out. I'm not sure if this what the user experienced. It would be good if we can cancel the submit. However, it doesn't look too bad even if we decide to live with it. Unless [~joyoungzh...@gmail.com] can provide more info, it doesn't seem we can do much here. > Increase default value of hive.spark.client.server.connect.timeout to exceeds > spark.yarn.am.waitTime > > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129595#comment-15129595 ] Rui Li commented on HIVE-12650: --- bq. Regarding your last question, I tried submitting application when no container is available. Spark-submit will wait until timeout (90s). Sorry this comment is misleading. Actually I mean hive will timeout after 90s. But after this, we'll interrupt the driver thread: {code} try { // The RPC server will take care of timeouts here. this.driverRpc = rpcServer.registerClient(clientId, secret, protocol).get(); } catch (Throwable e) { LOG.warn("Error while waiting for client to connect.", e); driverThread.interrupt(); try { driverThread.join(); } catch (InterruptedException ie) { // Give up. LOG.debug("Interrupted before driver thread was finished."); } throw Throwables.propagate(e); } {code} which in turn will destroy the SparkSubmit process: {code} public void run() { try { int exitCode = child.waitFor(); if (exitCode != 0) { rpcServer.cancelClient(clientId, "Child process exited before connecting back"); LOG.warn("Child process exited with code {}.", exitCode); } } catch (InterruptedException ie) { LOG.warn("Waiting thread interrupted, killing child process."); Thread.interrupted(); child.destroy(); } catch (Exception e) { LOG.warn("Exception while waiting for child process.", e); } } {code} So on my machine, after the timeout, SparkSubmit is terminated. I think the {{Client closed before SASL negotiation finished.}} exception is worth investigating and should be root cause here. > Increase default value of hive.spark.client.server.connect.timeout to exceeds > spark.yarn.am.waitTime > > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129761#comment-15129761 ] Xuefu Zhang commented on HIVE-12650: I see. I think that's what the [~joyoungzh...@gmail.com] experienced as well. Killing spark-submit doesn't cancel AM request. When AM is finally launched, it tries to connect back to Hive and gets refused. As a result, it quickly errors out. (However, on spark side, the message, saying "spark context initialization times out in xxx seconds", is very confusing.) I'm not sure if we can do anything here. Nevertheless, it seems spark.yarn.am.waitTime isn't relevant after all. > Increase default value of hive.spark.client.server.connect.timeout to exceeds > spark.yarn.am.waitTime > > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128361#comment-15128361 ] Xuefu Zhang commented on HIVE-12650: [~lirui], thanks for your analysis. Yeah, I saw the actually elapsed time is very short, while the message says timeout 150s, which is very confusing. [~vanzin], could you please explain a little bit the use of the two timeout? Also, what timeout value does spark-submit use if the application cannot be submitted? [~joyoungzh...@gmail.com], could you please reproduce the problem and provide more info such as hive.log? Thanks, folks! > Increase default value of hive.spark.client.server.connect.timeout to exceeds > spark.yarn.am.waitTime > > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129655#comment-15129655 ] Xuefu Zhang commented on HIVE-12650: Hi [~lirui], thanks for the info. It's good that spark-submit is killed when Hive times out. Now the user's problem seems more interesting, though we cannot do much unless we have more information. "Client closed before SASL negotiation finished" could be caused by the fact that AM tries to connect back to Hive, but Hive has already timed out. While Spark-submit is killed, is possible that YARN RM still has the request which will be eventually served? > Increase default value of hive.spark.client.server.connect.timeout to exceeds > spark.yarn.am.waitTime > > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129682#comment-15129682 ] Rui Li commented on HIVE-12650: --- Thanks Xuefu. Yeah I tried again and found the application is served (AM launched) and failed eventually, even after SparkSubmit is killed. Although I didn't get the AM log due to some env issue. > Increase default value of hive.spark.client.server.connect.timeout to exceeds > spark.yarn.am.waitTime > > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126866#comment-15126866 ] Xuefu Zhang commented on HIVE-12650: Thanks for the clarification, [~vanzin]. I agree with you. Do you know what factors (such as a lack of available executors) might make Spark AM wait for SparkContext to be initialized for longer period of time (say, a minute)? The problem seems to be that Hive times out first while the AM still appears running, waiting for the context to be initialized. It will eventually fail either the context gets initialized for timeout occurs. This might look a bit confusing. I'm think if we make Hive waits longer than that, then we can avoid the scenario. Any further thoughts? > Increase default value of hive.spark.client.server.connect.timeout to exceeds > spark.yarn.am.waitTime > > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126881#comment-15126881 ] Marcelo Vanzin commented on HIVE-12650: --- bq. Do you know what factors (such as a lack of available executors) might make Spark AM wait for SparkContext to be initialized for longer period of time (say, a minute)? The only factor is possible problems in the user's {{main}} method, since that's the code that creates the SparkContext. The AM container is *already running* at that time, so it can't really fail for not being able to allocate the container... > Increase default value of hive.spark.client.server.connect.timeout to exceeds > spark.yarn.am.waitTime > > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127564#comment-15127564 ] Xuefu Zhang commented on HIVE-12650: I'm specially interested in case where Hive calls spark-submit to submit the application while there is no container available. I'm not sure if spark-submit will wait. If it does, then Hive can time out first before the AM starts to run. > Increase default value of hive.spark.client.server.connect.timeout to exceeds > spark.yarn.am.waitTime > > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127492#comment-15127492 ] Rui Li commented on HIVE-12650: --- Thanks guys for your inputs. My understanding is that {{hive.spark.client.server.connect.timeout}} is the timeout between RPC server and client handshake. In {{RemoteDriver}}, RPC client is created before SparkContext. And if {{spark.yarn.am.waitTime}} is the timeout waiting for SparkContext to be created, maybe it won't help here. I mean we can try increasing {{hive.spark.client.server.connect.timeout}}, but according to something else. BTW, is it possible the timeout is caused by the schedule delay within yarn? Is the issue only encountered with yarn-cluster? > Increase default value of hive.spark.client.server.connect.timeout to exceeds > spark.yarn.am.waitTime > > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127507#comment-15127507 ] Xuefu Zhang commented on HIVE-12650: Here is the log that provided the the JIRA creator: {code} Logs of Application_1448873753366_121022 as follows(same as application_1448873753366_121055): Container: container_1448873753366_121022_03_01 on 10.226.136.122_8041 LogType: stderr LogLength: 4664 Log Contents: Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled in the future Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled in the future 15/12/09 16:29:45 INFO yarn.ApplicationMaster: Registered signal handlers for [TERM, HUP, INT] 15/12/09 16:29:46 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1448873753366_121022_03 15/12/09 16:29:47 INFO spark.SecurityManager: Changing view acls to: mqq 15/12/09 16:29:47 INFO spark.SecurityManager: Changing modify acls to: mqq 15/12/09 16:29:47 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mqq); users with modify permissions: Set(mqq) 15/12/09 16:29:47 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread 15/12/09 16:29:47 INFO yarn.ApplicationMaster: Waiting for spark context initialization 15/12/09 16:29:47 INFO yarn.ApplicationMaster: Waiting for spark context initialization ... 15/12/09 16:29:47 INFO client.RemoteDriver: Connecting to: 10.179.12.140:38842 15/12/09 16:29:48 WARN rpc.Rpc: Invalid log level null, reverting to default. 15/12/09 16:29:48 ERROR yarn.ApplicationMaster: User class threw exception: java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: Client closed before SASL negotiation finished. java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: Client closed before SASL negotiation finished. at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37) at org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:156) at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:483) Caused by: javax.security.sasl.SaslException: Client closed before SASL negotiation finished. at org.apache.hive.spark.client.rpc.Rpc$SaslClientHandler.dispose(Rpc.java:449) at org.apache.hive.spark.client.rpc.SaslHandler.channelInactive(SaslHandler.java:90) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) at org.apache.hive.spark.client.rpc.KryoMessageCodec.channelInactive(KryoMessageCodec.java:127) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219) at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:769) at io.netty.channel.AbstractChannel$AbstractUnsafe$5.run(AbstractChannel.java:567) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:745) 15/12/09 16:29:48 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: Client closed before SASL negotiation finished.) 15/12/09 16:29:57 ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 15 ms. Please check earlier log output for errors.
[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126219#comment-15126219 ] Rui Li commented on HIVE-12650: --- Hi [~vanzin], any idea on this? > Increase default value of hive.spark.client.server.connect.timeout to exceeds > spark.yarn.am.waitTime > > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127747#comment-15127747 ] Rui Li commented on HIVE-12650: --- Hi [~xuefuz], the exception you posted doesn't seem to be a timeout, at least it's not related to {{hive.spark.client.server.connect.timeout}}, because the elapsed time is much less than 90s. I found the code that prints the log you mentioned: {code} while (sparkContextRef.get() == null && System.currentTimeMillis < deadline && !finished) { logInfo("Waiting for spark context initialization ... ") sparkContextRef.wait(1L) } val sparkContext = sparkContextRef.get() if (sparkContext == null) { logError(("SparkContext did not initialize after waiting for %d ms. Please check earlier" + " log output for errors. Failing the application.").format(totalWaitTime)) } {code} You can see the while loop can exit either on timeout or finished being set to true. Since time elapsed is short, it must because user thread (RemoteDriver) has finished abnormally: {code} val userThread = new Thread { override def run() { try { mainMethod.invoke(null, userArgs.toArray) finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) logDebug("Done running users class") } catch { case e: InvocationTargetException => e.getCause match { case _: InterruptedException => // Reporter thread can interrupt to stop user class case SparkUserAppException(exitCode) => val msg = s"User application exited with status $exitCode" logError(msg) finish(FinalApplicationStatus.FAILED, exitCode, msg) case cause: Throwable => logError("User class threw exception: " + cause, cause) finish(FinalApplicationStatus.FAILED, ApplicationMaster.EXIT_EXCEPTION_USER_CLASS, "User class threw exception: " + cause) } } } } {code} In conclusion, the problem here is not we timed out creating SparkContext. My guess is that something goes wrong before we create SparkContext (you can refer to the constructor of RemoteDriver). Also found another property {{hive.spark.client.connect.timeout}} which defaults to 1000ms. It's used when RemoteDriver creates RPC client so it could be related, although I'm a little confused about the difference between the 2 configurations. Regarding your last question, I tried submitting application when no container is available. Spark-submit will wait until timeout (90s). > Increase default value of hive.spark.client.server.connect.timeout to exceeds > spark.yarn.am.waitTime > > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126779#comment-15126779 ] Marcelo Vanzin commented on HIVE-12650: --- {{spark.yarn.am.waitTime}} is not the time Spark waits for the master to launch. It's the time the Spark AM waits for the SparkContext to be created after the AM has been launched. That being said, it's ok for the Hive timeout to be larger. 90s already seems like a really long time to wait, so I doubt the extra 30s will help, but it won't hurt. > Increase default value of hive.spark.client.server.connect.timeout to exceeds > spark.yarn.am.waitTime > > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime
[ https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126293#comment-15126293 ] Xuefu Zhang commented on HIVE-12650: Hi [~lirui], since application master in the context of Hive on Spark takes a container from yarn. In a busy cluster, spark-submit may wait up to spark.yarn.am.waitTime to launch the master. On the other hand, Hive waits for hive.spark.client.server.connect.timeout before declaring that the remote driver is not connecting back. If the latter is less than the former, it's possible that Hive prematurely disconnects, causing an unstable condition. [~joyoungzh...@gmail.com] had a description of the problem in the user list. I think we need at least to make hive.spark.client.server.connect.timeout greater than spark.yarn.am.waitTime by default. To further guard against the problem, Hive can increase hive.spark.client.server.connect.timeout automatically based on the value of spark.yarn.am.waitTime; [~vanzin], please share your thoughts as well. > Increase default value of hive.spark.client.server.connect.timeout to exceeds > spark.yarn.am.waitTime > > > Key: HIVE-12650 > URL: https://issues.apache.org/jira/browse/HIVE-12650 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Xuefu Zhang > > I think hive.spark.client.server.connect.timeout should be set greater than > spark.yarn.am.waitTime. The default value for > spark.yarn.am.waitTime is 100s, and the default value for > hive.spark.client.server.connect.timeout is 90s, which is not good. We can > increase it to a larger value such as 120s. -- This message was sent by Atlassian JIRA (v6.3.4#6332)