Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Andrew Or Thu, 17 Jul 2014 18:11:20 -0700

Hi Matt,

The security group shouldn't be an issue; the ports listed in
`spark_ec2.py` are only for communication with the outside world.


How did you launch your application? I notice you did not launch your
driver from your Master node. What happens if you did? Another thing is
that there seems to be some inconsistency or missing pieces in the logs you
posted. After an executor says "driver disassociated," what happens in the
driver logs? Is an exception thrown or something?

It would be useful if you could also post your conf/spark-env.sh.

Andrew


2014-07-17 14:11 GMT-07:00 Marcelo Vanzin <van...@cloudera.com>:

> Hi Matt,
>
> I'm not very familiar with setup on ec2; the closest I can point you
> at is to look at the "launch_cluster" in ec2/spark_ec2.py, where the
> ports seem to be configured.
>
>
> On Thu, Jul 17, 2014 at 1:29 PM, Matt Work Coarr
> <mattcoarr.w...@gmail.com> wrote:
> > Thanks Marcelo!  This is a huge help!!
> >
> > Looking at the executor logs (in a vanilla spark install, I'm finding
> them
> > in $SPARK_HOME/work/*)...
> >
> > It launches the executor, but it looks like the
> CoarseGrainedExecutorBackend
> > is having trouble talking to the driver (exactly what you said!!!).
> >
> > Do you know what the range of random ports that is used for the the
> > executor-to-driver?  Is that range adjustable?  Any config setting or
> > environment variable?
> >
> > I manually setup my ec2 security group to include all the ports that the
> > spark ec2 script ($SPARK_HOME/ec2/spark_ec2.py) sets up in it's security
> > groups.  They included (for those listed above 10000):
> > 19999
> > 50060
> > 50070
> > 50075
> > 60060
> > 60070
> > 60075
> >
> > Obviously I'll need to make some adjustments to my EC2 security group!
>  Just
> > need to figure out exactly what should be in there.  To keep things
> simple,
> > I just have one security group for the master, slaves, and the driver
> > machine.
> >
> > In listing the port ranges in my current security group I looked at the
> > ports that spark_ec2.py sets up as well as the ports listed in the "spark
> > standalone mode" documentation page under "configuring ports for network
> > security":
> >
> > http://spark.apache.org/docs/latest/spark-standalone.html
> >
> >
> > Here are the relevant fragments from the executor log:
> >
> > Spark Executor Command: "/cask/jdk/bin/java" "-cp"
> >
> "::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.
> >
> >
> 2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar"
> > "-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100" "-Dspark.akka.
> >
> > frameSize=100" "-Xms512M" "-Xmx512M"
> > "org.apache.spark.executor.CoarseGrainedExecutorBackend"
> > "akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGra
> >
> > inedScheduler" "0" "ip-10-202-8-45.ec2.internal" "8"
> > "akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker"
> > "app-20140717195146-0000"
> >
> > ========================================
> >
> > ...
> >
> > 14/07/17 19:51:47 DEBUG NativeCodeLoader: Trying to load the custom-built
> > native-hadoop library...
> >
> > 14/07/17 19:51:47 DEBUG NativeCodeLoader: Failed to load native-hadoop
> with
> > error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
> >
> > 14/07/17 19:51:47 DEBUG NativeCodeLoader:
> >
> java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
> >
> > 14/07/17 19:51:47 WARN NativeCodeLoader: Unable to load native-hadoop
> > library for your platform... using builtin-java classes where applicable
> >
> > 14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Falling
> back
> > to shell based
> >
> > 14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Group
> mapping
> > impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping
> >
> > 14/07/17 19:51:48 DEBUG Groups: Group mapping
> > impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback;
> > cacheTimeout=300000
> >
> > 14/07/17 19:51:48 DEBUG SparkHadoopUtil: running as user: ec2-user
> >
> > ...
> >
> >
> > 14/07/17 19:51:48 INFO CoarseGrainedExecutorBackend: Connecting to
> driver:
> > akka.tcp://spark@ip-10-202-11-191.ec2.internal
> :46787/user/CoarseGrainedScheduler
> >
> > 14/07/17 19:51:48 INFO WorkerWatcher: Connecting to worker
> > akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker
> >
> > 14/07/17 19:51:49 INFO WorkerWatcher: Successfully connected to
> > akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker
> >
> > 14/07/17 19:53:29 ERROR CoarseGrainedExecutorBackend: Driver
> Disassociated
> > [akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:55670] ->
> > [akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787] disassociated!
> > Shutting down.
> >
> >
> > Thanks a bunch!
> > Matt
> >
> >
> > On Thu, Jul 17, 2014 at 1:21 PM, Marcelo Vanzin <van...@cloudera.com>
> wrote:
> >>
> >> When I meant the executor log, I meant the log of the process launched
> >> by the worker, not the worker. In my CDH-based Spark install, those
> >> end up in /var/run/spark/work.
> >>
> >> If you look at your worker log, you'll see it's launching the executor
> >> process. So there should be something there.
> >>
> >> Since you say it works when both are run in the same node, that
> >> probably points to some communication issue, since the executor needs
> >> to connect back to the driver. Check to see if you don't have any
> >> firewalls blocking the ports Spark tries to use. (That's one of the
> >> non-resource-related cases that will cause that message.)
>
>
>
> --
> Marcelo
>

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

Reply via email to