[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime

2016-02-15 Thread JoneZhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148177#comment-15148177
 ] 

JoneZhang commented on HIVE-12650:
--

Hi all,
I'm sorrry reply you so late.

Yes
hive.spark.client.server.connect.timeout and spark.yarn.am.waitTime does not 
have any relations.
hive.spark.client.server.connect.timeout is the timeout between RPC server and 
client handshake.When no container is available, hive cient  will exit after 
hive.spark.client.server.connect.timeout.
spark.yarn.am.waitTime is the time the Spark AM waits for the SparkContext to 
be created after the AM has been launched.

There are two types of error log
1.Client closed before SASL negotiation finished was happened on resubmitted. 
See https://issues.apache.org/jira/browse/HIVE-12649.
2.Connection refused: /hiveclientip:port was happend when am tries to connect 
back to Hive.

Container: container_1448873753366_113453_01_01 on 10.247.169.134_8041

LogType: stderr
LogLength: 3302
Log Contents:
Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled in 
the future
Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled in 
the future
15/12/09 02:11:48 INFO yarn.ApplicationMaster: Registered signal handlers for 
[TERM, HUP, INT]
15/12/09 02:11:48 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
appattempt_1448873753366_113453_01
15/12/09 02:11:49 INFO spark.SecurityManager: Changing view acls to: mqq
15/12/09 02:11:49 INFO spark.SecurityManager: Changing modify acls to: mqq
15/12/09 02:11:49 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mqq); users with 
modify permissions: Set(mqq)
15/12/09 02:11:49 INFO yarn.ApplicationMaster: Starting the user application in 
a separate Thread
15/12/09 02:11:49 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization
15/12/09 02:11:49 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization ... 
15/12/09 02:11:49 INFO client.RemoteDriver: Connecting to: 10.179.12.140:58013
15/12/09 02:11:49 ERROR yarn.ApplicationMaster: User class threw exception: 
java.util.concurrent.ExecutionException: java.net.ConnectException: Connection 
refused: /10.179.12.140:58013
java.util.concurrent.ExecutionException: java.net.ConnectException: Connection 
refused: /10.179.12.140:58013
at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
at 
org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:156)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:483)
Caused by: java.net.ConnectException: Connection refused: /10.179.12.140:58013
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
15/12/09 02:11:49 INFO yarn.ApplicationMaster: Final app status: FAILED, 
exitCode: 15, (reason: User class threw exception: 
java.util.concurrent.ExecutionException: java.net.ConnectException: Connection 
refused: /10.179.12.140:58013)
15/12/09 02:11:59 ERROR yarn.ApplicationMaster: SparkContext did not initialize 
after waiting for 15 ms. Please check earlier log output for errors. 
Failing the application.
15/12/09 02:11:59 INFO util.Utils: Shutdown hook called

> Increase default value of hive.spark.client.server.connect.timeout to exceeds 
> spark.yarn.am.waitTime
> 
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 

[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime

2016-02-02 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128913#comment-15128913
 ] 

Marcelo Vanzin commented on HIVE-12650:
---

bq. could you please explain a little bit the use of the two timeout?

There's nothing complicated about them.

- RSC timeout: time between the RSC launching the Spark app and the Spark 
driver connecting back.
- Spark AM timeout: time between Spark AM launching the user's "main" method 
and a SparkContext being created.

Both overlap but one is not necessarily contained in the other.

> Increase default value of hive.spark.client.server.connect.timeout to exceeds 
> spark.yarn.am.waitTime
> 
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime

2016-02-02 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128949#comment-15128949
 ] 

Xuefu Zhang commented on HIVE-12650:


Thanks, [~vanzin]. I guess the question is the difference between the follow 
two (both defined in Hive):
1. hive.spark.client.connect.timeout
2. hive.spark.client.server.connect.timeout

The second question is: what's the timeout value that spark-submit uses in case 
of no available containers?

> Increase default value of hive.spark.client.server.connect.timeout to exceeds 
> spark.yarn.am.waitTime
> 
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime

2016-02-02 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128979#comment-15128979
 ] 

Marcelo Vanzin commented on HIVE-12650:
---

* hive.spark.client.connect.timeout

That's the socket connect timeout when the driver connects to the RSC server. 
Equivalent to this:
http://docs.oracle.com/javase/7/docs/api/java/net/Socket.html#connect(java.net.SocketAddress,%20int)

* hive.spark.client.server.connect.timeout

That's the timeout explained in my previous comment.

* what's the timeout value that spark-submit uses in case of no available 
containers?

I don't believe there is one.

> Increase default value of hive.spark.client.server.connect.timeout to exceeds 
> spark.yarn.am.waitTime
> 
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime

2016-02-02 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129392#comment-15129392
 ] 

Xuefu Zhang commented on HIVE-12650:


Thanks, [~vanzin].

If there is no timeout in spark-submit (wait indefinitely), I'm wondering what 
happens if the cluster is busy. Here is my speculation. Hive will time out 
first (also corresponding to Rui's observation), but spark-submit will continue 
to run. If a container becomes available, Spark AM will start and connect to 
Hive. Hive of course refuses. Then, AM will error out.

I'm not sure if this what the user experienced. It would be good if we can 
cancel the submit. However, it doesn't look too bad even if we decide to live 
with it.

Unless [~joyoungzh...@gmail.com] can provide more info, it doesn't seem we can 
do much here.

> Increase default value of hive.spark.client.server.connect.timeout to exceeds 
> spark.yarn.am.waitTime
> 
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime

2016-02-02 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129595#comment-15129595
 ] 

Rui Li commented on HIVE-12650:
---

bq. Regarding your last question, I tried submitting application when no 
container is available. Spark-submit will wait until timeout (90s).
Sorry this comment is misleading. Actually I mean hive will timeout after 90s. 
But after this, we'll interrupt the driver thread:
{code}
try {
  // The RPC server will take care of timeouts here.
  this.driverRpc = rpcServer.registerClient(clientId, secret, 
protocol).get();
} catch (Throwable e) {
  LOG.warn("Error while waiting for client to connect.", e);
  driverThread.interrupt();
  try {
driverThread.join();
  } catch (InterruptedException ie) {
// Give up.
LOG.debug("Interrupted before driver thread was finished.");
  }
  throw Throwables.propagate(e);
}
{code}
which in turn will destroy the SparkSubmit process:
{code}
public void run() {
  try {
int exitCode = child.waitFor();
if (exitCode != 0) {
  rpcServer.cancelClient(clientId, "Child process exited before 
connecting back");
  LOG.warn("Child process exited with code {}.", exitCode);
}
  } catch (InterruptedException ie) {
LOG.warn("Waiting thread interrupted, killing child process.");
Thread.interrupted();
child.destroy();
  } catch (Exception e) {
LOG.warn("Exception while waiting for child process.", e);
  }
}
{code}
So on my machine, after the timeout, SparkSubmit is terminated.
I think the {{Client closed before SASL negotiation finished.}} exception is 
worth investigating and should be root cause here.

> Increase default value of hive.spark.client.server.connect.timeout to exceeds 
> spark.yarn.am.waitTime
> 
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime

2016-02-02 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129761#comment-15129761
 ] 

Xuefu Zhang commented on HIVE-12650:


I see. I think that's what the [~joyoungzh...@gmail.com] experienced as well. 
Killing spark-submit doesn't cancel AM request. When AM is finally launched, it 
tries to connect back to Hive and gets refused. As a result, it quickly errors 
out. (However, on spark side, the message, saying "spark context initialization 
times out in xxx seconds", is very confusing.) I'm not sure if we can do 
anything here.

Nevertheless, it seems spark.yarn.am.waitTime isn't relevant after all.

> Increase default value of hive.spark.client.server.connect.timeout to exceeds 
> spark.yarn.am.waitTime
> 
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime

2016-02-02 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128361#comment-15128361
 ] 

Xuefu Zhang commented on HIVE-12650:


[~lirui], thanks for your analysis. Yeah, I saw the actually elapsed time is 
very short, while the message says timeout 150s, which is very confusing.

[~vanzin], could you please explain a little bit the use of the two timeout? 
Also, what timeout value does spark-submit use if the application cannot be 
submitted?

[~joyoungzh...@gmail.com], could you please reproduce the problem and provide 
more info such as hive.log?

Thanks, folks!

> Increase default value of hive.spark.client.server.connect.timeout to exceeds 
> spark.yarn.am.waitTime
> 
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime

2016-02-02 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129655#comment-15129655
 ] 

Xuefu Zhang commented on HIVE-12650:


Hi [~lirui], thanks for the info. It's good that spark-submit is killed when 
Hive times out. Now the user's problem seems more interesting, though we cannot 
do much unless we have more information.

"Client closed before SASL negotiation finished" could be caused by the fact 
that AM tries to connect back to Hive, but Hive has already timed out. While 
Spark-submit is killed, is possible that YARN RM still has the request which 
will be eventually served?


> Increase default value of hive.spark.client.server.connect.timeout to exceeds 
> spark.yarn.am.waitTime
> 
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime

2016-02-02 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129682#comment-15129682
 ] 

Rui Li commented on HIVE-12650:
---

Thanks Xuefu. Yeah I tried again and found the application is served (AM 
launched) and failed eventually, even after SparkSubmit is killed. Although I 
didn't get the AM log due to some env issue.

> Increase default value of hive.spark.client.server.connect.timeout to exceeds 
> spark.yarn.am.waitTime
> 
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime

2016-02-01 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126866#comment-15126866
 ] 

Xuefu Zhang commented on HIVE-12650:


Thanks for the clarification, [~vanzin]. I agree with you. Do you know what 
factors (such as a lack of available executors) might make Spark AM wait for 
SparkContext to be initialized for longer period of time (say, a minute)? The 
problem seems to be that Hive times out first while the AM still appears 
running, waiting for the context to be initialized. It will eventually fail 
either the context gets initialized for timeout occurs. This might look a bit 
confusing. I'm think if we make Hive waits longer than that, then we can avoid 
the scenario. Any further thoughts?


> Increase default value of hive.spark.client.server.connect.timeout to exceeds 
> spark.yarn.am.waitTime
> 
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime

2016-02-01 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126881#comment-15126881
 ] 

Marcelo Vanzin commented on HIVE-12650:
---

bq.  Do you know what factors (such as a lack of available executors) might 
make Spark AM wait for SparkContext to be initialized for longer period of time 
(say, a minute)?

The only factor is possible problems in the user's {{main}} method, since 
that's the code that creates the SparkContext. The AM container is *already 
running* at that time, so it can't really fail for not being able to allocate 
the container...

> Increase default value of hive.spark.client.server.connect.timeout to exceeds 
> spark.yarn.am.waitTime
> 
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime

2016-02-01 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127564#comment-15127564
 ] 

Xuefu Zhang commented on HIVE-12650:


I'm specially interested in case where Hive calls spark-submit to submit the 
application while there is no container available. I'm not sure if spark-submit 
will wait. If it does, then Hive can time out first before the AM starts to run.

> Increase default value of hive.spark.client.server.connect.timeout to exceeds 
> spark.yarn.am.waitTime
> 
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime

2016-02-01 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127492#comment-15127492
 ] 

Rui Li commented on HIVE-12650:
---

Thanks guys for your inputs. My understanding is that 
{{hive.spark.client.server.connect.timeout}} is the timeout between RPC server 
and client handshake. In {{RemoteDriver}}, RPC client is created before 
SparkContext. And if {{spark.yarn.am.waitTime}} is the timeout waiting for 
SparkContext to be created, maybe it won't help here. I mean we can try 
increasing {{hive.spark.client.server.connect.timeout}}, but according to 
something else.
BTW, is it possible the timeout is caused by the schedule delay within yarn? Is 
the issue only encountered with yarn-cluster?

> Increase default value of hive.spark.client.server.connect.timeout to exceeds 
> spark.yarn.am.waitTime
> 
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime

2016-02-01 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127507#comment-15127507
 ] 

Xuefu Zhang commented on HIVE-12650:


Here is the log that provided the the JIRA creator:
{code}
Logs of Application_1448873753366_121022 as follows(same as 
application_1448873753366_121055):
Container: container_1448873753366_121022_03_01 on 10.226.136.122_8041

LogType: stderr
LogLength: 4664
Log Contents:
Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled in 
the future
Please use CMSClassUnloadingEnabled in place of CMSPermGenSweepingEnabled in 
the future
15/12/09 16:29:45 INFO yarn.ApplicationMaster: Registered signal handlers for 
[TERM, HUP, INT]
15/12/09 16:29:46 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
appattempt_1448873753366_121022_03
15/12/09 16:29:47 INFO spark.SecurityManager: Changing view acls to: mqq
15/12/09 16:29:47 INFO spark.SecurityManager: Changing modify acls to: mqq
15/12/09 16:29:47 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(mqq); users with 
modify permissions: Set(mqq)
15/12/09 16:29:47 INFO yarn.ApplicationMaster: Starting the user application in 
a separate Thread
15/12/09 16:29:47 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization
15/12/09 16:29:47 INFO yarn.ApplicationMaster: Waiting for spark context 
initialization ... 
15/12/09 16:29:47 INFO client.RemoteDriver: Connecting to: 10.179.12.140:38842
15/12/09 16:29:48 WARN rpc.Rpc: Invalid log level null, reverting to default.
15/12/09 16:29:48 ERROR yarn.ApplicationMaster: User class threw exception: 
java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: 
Client closed before SASL negotiation finished.
java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: 
Client closed before SASL negotiation finished.
at io.netty.util.concurrent.AbstractFuture.get(AbstractFuture.java:37)
at 
org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:156)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:556)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:483)
Caused by: javax.security.sasl.SaslException: Client closed before SASL 
negotiation finished.
at 
org.apache.hive.spark.client.rpc.Rpc$SaslClientHandler.dispose(Rpc.java:449)
at 
org.apache.hive.spark.client.rpc.SaslHandler.channelInactive(SaslHandler.java:90)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at 
org.apache.hive.spark.client.rpc.KryoMessageCodec.channelInactive(KryoMessageCodec.java:127)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
at 
io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:233)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:219)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:769)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe$5.run(AbstractChannel.java:567)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
15/12/09 16:29:48 INFO yarn.ApplicationMaster: Final app status: FAILED, 
exitCode: 15, (reason: User class threw exception: 
java.util.concurrent.ExecutionException: javax.security.sasl.SaslException: 
Client closed before SASL negotiation finished.)
15/12/09 16:29:57 ERROR yarn.ApplicationMaster: SparkContext did not initialize 
after waiting for 15 ms. Please check earlier log output for errors. 

[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime

2016-02-01 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126219#comment-15126219
 ] 

Rui Li commented on HIVE-12650:
---

Hi [~vanzin], any idea on this?

> Increase default value of hive.spark.client.server.connect.timeout to exceeds 
> spark.yarn.am.waitTime
> 
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime

2016-02-01 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127747#comment-15127747
 ] 

Rui Li commented on HIVE-12650:
---

Hi [~xuefuz], the exception you posted doesn't seem to be a timeout, at least 
it's not related to {{hive.spark.client.server.connect.timeout}}, because the 
elapsed time is much less than 90s. I found the code that prints the log you 
mentioned:
{code}
  while (sparkContextRef.get() == null && System.currentTimeMillis < 
deadline && !finished) {
logInfo("Waiting for spark context initialization ... ")
sparkContextRef.wait(1L)
  }

  val sparkContext = sparkContextRef.get()
  if (sparkContext == null) {
logError(("SparkContext did not initialize after waiting for %d ms. 
Please check earlier"
  + " log output for errors. Failing the 
application.").format(totalWaitTime))
  }
{code}
You can see the while loop can exit either on timeout or finished being set to 
true. Since time elapsed is short, it must because user thread (RemoteDriver) 
has finished abnormally:
{code}
val userThread = new Thread {
  override def run() {
try {
  mainMethod.invoke(null, userArgs.toArray)
  finish(FinalApplicationStatus.SUCCEEDED, 
ApplicationMaster.EXIT_SUCCESS)
  logDebug("Done running users class")
} catch {
  case e: InvocationTargetException =>
e.getCause match {
  case _: InterruptedException =>
// Reporter thread can interrupt to stop user class
  case SparkUserAppException(exitCode) =>
val msg = s"User application exited with status $exitCode"
logError(msg)
finish(FinalApplicationStatus.FAILED, exitCode, msg)
  case cause: Throwable =>
logError("User class threw exception: " + cause, cause)
finish(FinalApplicationStatus.FAILED,
  ApplicationMaster.EXIT_EXCEPTION_USER_CLASS,
  "User class threw exception: " + cause)
}
}
  }
}
{code}
In conclusion, the problem here is not we timed out creating SparkContext. My 
guess is that something goes wrong before we create SparkContext (you can refer 
to the constructor of RemoteDriver). Also found another property 
{{hive.spark.client.connect.timeout}} which defaults to 1000ms. It's used when 
RemoteDriver creates RPC client so it could be related, although I'm a little 
confused about the difference between the 2 configurations.

Regarding your last question, I tried submitting application when no container 
is available. Spark-submit will wait until timeout (90s).

> Increase default value of hive.spark.client.server.connect.timeout to exceeds 
> spark.yarn.am.waitTime
> 
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime

2016-02-01 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126779#comment-15126779
 ] 

Marcelo Vanzin commented on HIVE-12650:
---

{{spark.yarn.am.waitTime}} is not the time Spark waits for the master to 
launch. It's the time the Spark AM waits for the SparkContext to be created 
after the AM has been launched.

That being said, it's ok for the Hive timeout to be larger. 90s already seems 
like a really long time to wait, so I doubt the extra 30s will help, but it 
won't hurt.

> Increase default value of hive.spark.client.server.connect.timeout to exceeds 
> spark.yarn.am.waitTime
> 
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-12650) Increase default value of hive.spark.client.server.connect.timeout to exceeds spark.yarn.am.waitTime

2016-02-01 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126293#comment-15126293
 ] 

Xuefu Zhang commented on HIVE-12650:


Hi [~lirui], since application master in the context of Hive on Spark takes a 
container from yarn. In a busy cluster, spark-submit may wait up to 
spark.yarn.am.waitTime to launch the master. On the other hand, Hive waits for  
hive.spark.client.server.connect.timeout  before declaring that the remote 
driver is not connecting back. If the latter is less than the former, it's 
possible that Hive prematurely disconnects, causing an unstable condition. 
[~joyoungzh...@gmail.com] had a description of the problem in the user list.

I think we need at least to make hive.spark.client.server.connect.timeout 
greater than spark.yarn.am.waitTime by default. To further guard against the 
problem, Hive can increase hive.spark.client.server.connect.timeout 
automatically based on the value of spark.yarn.am.waitTime;

[~vanzin], please share your thoughts as well.

> Increase default value of hive.spark.client.server.connect.timeout to exceeds 
> spark.yarn.am.waitTime
> 
>
> Key: HIVE-12650
> URL: https://issues.apache.org/jira/browse/HIVE-12650
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.1
>Reporter: JoneZhang
>Assignee: Xuefu Zhang
>
> I think hive.spark.client.server.connect.timeout should be set greater than 
> spark.yarn.am.waitTime. The default value for 
> spark.yarn.am.waitTime is 100s, and the default value for 
> hive.spark.client.server.connect.timeout is 90s, which is not good. We can 
> increase it to a larger value such as 120s.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)