RE: FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Oleg Shirokikh
(CoarseGrainedExecutorBackend.scala:163)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException: 
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
... 4 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
... 7 more
/***/


When I go into worker UI from Master page, I can see the RUNNING Executor - 
it's in LOADING state. Here is its stderr:

/***/
15/02/23 18:15:05 INFO executor.CoarseGrainedExecutorBackend: Registered signal 
handlers for [TERM, HUP, INT]
15/02/23 18:15:06 INFO spark.SecurityManager: Changing view acls to: root,oleg
15/02/23 18:15:06 INFO spark.SecurityManager: Changing modify acls to: root,oleg
15/02/23 18:15:06 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(root, oleg); users 
with modify permissions: Set(root, oleg)
15/02/23 18:15:06 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/02/23 18:15:06 INFO Remoting: Starting remoting
15/02/23 18:15:06 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://driverpropsfetc...@ip-172-31-33-195.us-west-2.compute.internal:34609]
15/02/23 18:15:06 INFO util.Utils: Successfully started service 
'driverPropsFetcher' on port 34609.
/***/


So it seems that there is a problem with starting executors...


Hopefully this clarifies the environment and workflow. I'd be happy to provide 
any additional information.

Again, thanks a lot for help and time looking into this. Although I know the 
perfectly legit way how to work with Spark EC2 cluster (run the driver within 
the cluster), it's extremely interesting to understand how remoting works with 
Spark. And in general it would be very useful to have the ability to submit 
jobs remotely.

Thanks,
Oleg


-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com] 
Sent: Monday, February 23, 2015 1:22 AM
To: Oleg Shirokikh
Cc: user@spark.apache.org
Subject: Re: FW: Submitting jobs to Spark EC2 cluster remotely

What happens if you submit from the master node itself on ec2 (in client mode), 
does that work? What about in cluster mode?

It would be helpful if you could print the full command that the executor is 
failing. That might show that spark.driver.host is being set strangely. IIRC we 
print the launch command before starting the executor.

Overall the standalone cluster mode is not as well tested across environments 
with asymmetric connectivity. I didn't actually realize that akka (which the 
submission uses) can handle this scenario. But it does seem like the job is 
submitted, it's just not starting correctly.

- Patrick

On Mon, Feb 23, 2015 at 1:13 AM, Oleg Shirokikh o...@solver.com wrote:
 Patrick,

 I haven't changed the configs much. I just executed ec2-script to create 1 
 master, 2 slaves cluster. Then I try to submit the jobs from remote machine 
 leaving all defaults configured by Spark scripts as default. I've tried to 
 change configs as suggested in other mailing-list and stack overflow threads 
 (such as setting spark.driver.host, etc...), removed (hopefully) all 
 security/firewall restrictions from AWS, etc. but it didn't help.

 I think that what you are saying is exactly the issue: on my master node UI 
 at the bottom I can see the list of Completed Drivers all with ERROR 
 state...

 Thanks,
 Oleg

 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Monday, February 23, 2015 12:59 AM
 To: Oleg Shirokikh
 Cc: user@spark.apache.org
 Subject: Re: Submitting jobs to Spark EC2 cluster remotely

 Can you list other configs that you are setting? It looks like the executor 
 can't communicate back to the driver. I'm actually not sure it's a good idea 
 to set spark.driver.host here, you want to let spark set that automatically.

 - Patrick

 On Mon, Feb 23, 2015 at 12:48 AM, Oleg Shirokikh o...@solver.com wrote:
 Dear

FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Oleg Shirokikh
Patrick,

I haven't changed the configs much. I just executed ec2-script to create 1 
master, 2 slaves cluster. Then I try to submit the jobs from remote machine 
leaving all defaults configured by Spark scripts as default. I've tried to 
change configs as suggested in other mailing-list and stack overflow threads 
(such as setting spark.driver.host, etc...), removed (hopefully) all 
security/firewall restrictions from AWS, etc. but it didn't help.

I think that what you are saying is exactly the issue: on my master node UI at 
the bottom I can see the list of Completed Drivers all with ERROR state...

Thanks,
Oleg

-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com] 
Sent: Monday, February 23, 2015 12:59 AM
To: Oleg Shirokikh
Cc: user@spark.apache.org
Subject: Re: Submitting jobs to Spark EC2 cluster remotely

Can you list other configs that you are setting? It looks like the executor 
can't communicate back to the driver. I'm actually not sure it's a good idea to 
set spark.driver.host here, you want to let spark set that automatically.

- Patrick

On Mon, Feb 23, 2015 at 12:48 AM, Oleg Shirokikh o...@solver.com wrote:
 Dear Patrick,

 Thanks a lot for your quick response. Indeed, following your advice I've 
 uploaded the jar onto S3 and FileNotFoundException is gone now and job is 
 submitted in cluster deploy mode.

 However, now both (client and cluster) fail with the following errors in 
 executors (they keep exiting/killing executors as I see in UI):

 15/02/23 08:42:46 ERROR security.UserGroupInformation: 
 PriviledgedActionException as:oleg 
 cause:java.util.concurrent.TimeoutException: Futures timed out after 
 [30 seconds]


 Full log is:

 15/02/23 01:59:11 INFO executor.CoarseGrainedExecutorBackend: 
 Registered signal handlers for [TERM, HUP, INT]
 15/02/23 01:59:12 INFO spark.SecurityManager: Changing view acls to: 
 root,oleg
 15/02/23 01:59:12 INFO spark.SecurityManager: Changing modify acls to: 
 root,oleg
 15/02/23 01:59:12 INFO spark.SecurityManager: SecurityManager: 
 authentication disabled; ui acls disabled; users with view 
 permissions: Set(root, oleg); users with modify permissions: Set(root, 
 oleg)
 15/02/23 01:59:12 INFO slf4j.Slf4jLogger: Slf4jLogger started
 15/02/23 01:59:12 INFO Remoting: Starting remoting
 15/02/23 01:59:13 INFO Remoting: Remoting started; listening on 
 addresses 
 :[akka.tcp://driverpropsfetc...@ip-172-31-33-194.us-west-2.compute.int
 ernal:39379]
 15/02/23 01:59:13 INFO util.Utils: Successfully started service 
 'driverPropsFetcher' on port 39379.
 15/02/23 01:59:43 ERROR security.UserGroupInformation: 
 PriviledgedActionException as:oleg 
 cause:java.util.concurrent.TimeoutException: Futures timed out after [30 
 seconds] Exception in thread main 
 java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrai
 nedExecutorBackend.scala) Caused by: 
 java.security.PrivilegedActionException: 
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
 ... 4 more
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 
 seconds]
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
 at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
 ... 7 more




 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Monday, February 23, 2015 12:17 AM
 To: Oleg Shirokikh
 Subject: Re: Submitting jobs to Spark EC2 cluster remotely

 The reason is that the file needs to be in a globally visible 
 filesystem where the master node can download. So it needs to be on 
 s3, for instance, rather than on your local 

Re: FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Patrick Wendell
What happens if you submit from the master node itself on ec2 (in
client mode), does that work? What about in cluster mode?

It would be helpful if you could print the full command that the
executor is failing. That might show that spark.driver.host is being
set strangely. IIRC we print the launch command before starting the
executor.

Overall the standalone cluster mode is not as well tested across
environments with asymmetric connectivity. I didn't actually realize
that akka (which the submission uses) can handle this scenario. But it
does seem like the job is submitted, it's just not starting correctly.

- Patrick

On Mon, Feb 23, 2015 at 1:13 AM, Oleg Shirokikh o...@solver.com wrote:
 Patrick,

 I haven't changed the configs much. I just executed ec2-script to create 1 
 master, 2 slaves cluster. Then I try to submit the jobs from remote machine 
 leaving all defaults configured by Spark scripts as default. I've tried to 
 change configs as suggested in other mailing-list and stack overflow threads 
 (such as setting spark.driver.host, etc...), removed (hopefully) all 
 security/firewall restrictions from AWS, etc. but it didn't help.

 I think that what you are saying is exactly the issue: on my master node UI 
 at the bottom I can see the list of Completed Drivers all with ERROR 
 state...

 Thanks,
 Oleg

 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Monday, February 23, 2015 12:59 AM
 To: Oleg Shirokikh
 Cc: user@spark.apache.org
 Subject: Re: Submitting jobs to Spark EC2 cluster remotely

 Can you list other configs that you are setting? It looks like the executor 
 can't communicate back to the driver. I'm actually not sure it's a good idea 
 to set spark.driver.host here, you want to let spark set that automatically.

 - Patrick

 On Mon, Feb 23, 2015 at 12:48 AM, Oleg Shirokikh o...@solver.com wrote:
 Dear Patrick,

 Thanks a lot for your quick response. Indeed, following your advice I've 
 uploaded the jar onto S3 and FileNotFoundException is gone now and job is 
 submitted in cluster deploy mode.

 However, now both (client and cluster) fail with the following errors in 
 executors (they keep exiting/killing executors as I see in UI):

 15/02/23 08:42:46 ERROR security.UserGroupInformation:
 PriviledgedActionException as:oleg
 cause:java.util.concurrent.TimeoutException: Futures timed out after
 [30 seconds]


 Full log is:

 15/02/23 01:59:11 INFO executor.CoarseGrainedExecutorBackend:
 Registered signal handlers for [TERM, HUP, INT]
 15/02/23 01:59:12 INFO spark.SecurityManager: Changing view acls to:
 root,oleg
 15/02/23 01:59:12 INFO spark.SecurityManager: Changing modify acls to:
 root,oleg
 15/02/23 01:59:12 INFO spark.SecurityManager: SecurityManager:
 authentication disabled; ui acls disabled; users with view
 permissions: Set(root, oleg); users with modify permissions: Set(root,
 oleg)
 15/02/23 01:59:12 INFO slf4j.Slf4jLogger: Slf4jLogger started
 15/02/23 01:59:12 INFO Remoting: Starting remoting
 15/02/23 01:59:13 INFO Remoting: Remoting started; listening on
 addresses
 :[akka.tcp://driverpropsfetc...@ip-172-31-33-194.us-west-2.compute.int
 ernal:39379]
 15/02/23 01:59:13 INFO util.Utils: Successfully started service 
 'driverPropsFetcher' on port 39379.
 15/02/23 01:59:43 ERROR security.UserGroupInformation:
 PriviledgedActionException as:oleg 
 cause:java.util.concurrent.TimeoutException: Futures timed out after [30 
 seconds] Exception in thread main 
 java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
 at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
 at
 org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrai
 nedExecutorBackend.scala) Caused by:
 java.security.PrivilegedActionException: 
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
 ... 4 more
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
 [30 seconds]
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at 
 

Re: FW: Submitting jobs to Spark EC2 cluster remotely

2015-02-23 Thread Franc Carter
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
 at
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
 at
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
 at
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
 at
 org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
 Caused by: java.security.PrivilegedActionException:
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
 ... 4 more
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after
 [30 seconds]
 at
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at
 scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:127)
 at
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
 at
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
 ... 7 more
 /***/


 When I go into worker UI from Master page, I can see the RUNNING Executor
 - it's in LOADING state. Here is its stderr:

 /***/
 15/02/23 18:15:05 INFO executor.CoarseGrainedExecutorBackend: Registered
 signal handlers for [TERM, HUP, INT]
 15/02/23 18:15:06 INFO spark.SecurityManager: Changing view acls to:
 root,oleg
 15/02/23 18:15:06 INFO spark.SecurityManager: Changing modify acls to:
 root,oleg
 15/02/23 18:15:06 INFO spark.SecurityManager: SecurityManager:
 authentication disabled; ui acls disabled; users with view permissions:
 Set(root, oleg); users with modify permissions: Set(root, oleg)
 15/02/23 18:15:06 INFO slf4j.Slf4jLogger: Slf4jLogger started
 15/02/23 18:15:06 INFO Remoting: Starting remoting
 15/02/23 18:15:06 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://driverpropsfetc...@ip-172-31-33-195.us-west-2.compute.internal
 :34609]
 15/02/23 18:15:06 INFO util.Utils: Successfully started service
 'driverPropsFetcher' on port 34609.
 /***/


 So it seems that there is a problem with starting executors...


 Hopefully this clarifies the environment and workflow. I'd be happy to
 provide any additional information.

 Again, thanks a lot for help and time looking into this. Although I know
 the perfectly legit way how to work with Spark EC2 cluster (run the driver
 within the cluster), it's extremely interesting to understand how remoting
 works with Spark. And in general it would be very useful to have the
 ability to submit jobs remotely.

 Thanks,
 Oleg


 -Original Message-
 From: Patrick Wendell [mailto:pwend...@gmail.com]
 Sent: Monday, February 23, 2015 1:22 AM
 To: Oleg Shirokikh
 Cc: user@spark.apache.org
 Subject: Re: FW: Submitting jobs to Spark EC2 cluster remotely

 What happens if you submit from the master node itself on ec2 (in client
 mode), does that work? What about in cluster mode?

 It would be helpful if you could print the full command that the executor
 is failing. That might show that spark.driver.host is being set strangely.
 IIRC we print the launch command before starting the executor.

 Overall the standalone cluster mode is not as well tested across
 environments with asymmetric connectivity. I didn't actually realize that
 akka (which the submission uses) can handle this scenario. But it does seem
 like the job is submitted, it's just not starting correctly.

 - Patrick

 On Mon, Feb 23, 2015 at 1:13 AM, Oleg Shirokikh o...@solver.com wrote:
  Patrick,
 
  I haven't changed the configs much. I just executed ec2-script to create
 1 master, 2 slaves cluster. Then I try to submit the jobs from remote
 machine leaving all defaults configured by Spark scripts as default. I've
 tried to change configs as suggested in other mailing-list and stack
 overflow threads (such as setting spark.driver.host, etc...), removed
 (hopefully) all security/firewall restrictions from AWS, etc. but it didn't
 help.
 
  I think that what you are saying is exactly the issue: on my master node
 UI at the bottom I can see the list of Completed Drivers all with ERROR
 state...
 
  Thanks,
  Oleg
 
  -Original Message-
  From: Patrick Wendell [mailto:pwend...@gmail.com]
  Sent: Monday, February 23, 2015 12:59 AM
  To: Oleg Shirokikh
  Cc