Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-21 Thread Matt Work Coarr
I got this working by having our sysadmin update our security group to
allow incoming traffic from the local subnet on ports 1-65535.  I'm not
sure if there's a more specific range I could have used, but so far,
everything is running!

Thanks for all the responses Marcelo and Andrew!!

Matt


On Thu, Jul 17, 2014 at 9:10 PM, Andrew Or  wrote:

> Hi Matt,
>
> The security group shouldn't be an issue; the ports listed in
> `spark_ec2.py` are only for communication with the outside world.
>
> How did you launch your application? I notice you did not launch your
> driver from your Master node. What happens if you did? Another thing is
> that there seems to be some inconsistency or missing pieces in the logs you
> posted. After an executor says "driver disassociated," what happens in the
> driver logs? Is an exception thrown or something?
>
> It would be useful if you could also post your conf/spark-env.sh.
>
> Andrew
>
>
> 2014-07-17 14:11 GMT-07:00 Marcelo Vanzin :
>
> Hi Matt,
>>
>> I'm not very familiar with setup on ec2; the closest I can point you
>> at is to look at the "launch_cluster" in ec2/spark_ec2.py, where the
>> ports seem to be configured.
>>
>>
>> On Thu, Jul 17, 2014 at 1:29 PM, Matt Work Coarr
>>  wrote:
>> > Thanks Marcelo!  This is a huge help!!
>> >
>> > Looking at the executor logs (in a vanilla spark install, I'm finding
>> them
>> > in $SPARK_HOME/work/*)...
>> >
>> > It launches the executor, but it looks like the
>> CoarseGrainedExecutorBackend
>> > is having trouble talking to the driver (exactly what you said!!!).
>> >
>> > Do you know what the range of random ports that is used for the the
>> > executor-to-driver?  Is that range adjustable?  Any config setting or
>> > environment variable?
>> >
>> > I manually setup my ec2 security group to include all the ports that the
>> > spark ec2 script ($SPARK_HOME/ec2/spark_ec2.py) sets up in it's security
>> > groups.  They included (for those listed above 1):
>> > 1
>> > 50060
>> > 50070
>> > 50075
>> > 60060
>> > 60070
>> > 60075
>> >
>> > Obviously I'll need to make some adjustments to my EC2 security group!
>>  Just
>> > need to figure out exactly what should be in there.  To keep things
>> simple,
>> > I just have one security group for the master, slaves, and the driver
>> > machine.
>> >
>> > In listing the port ranges in my current security group I looked at the
>> > ports that spark_ec2.py sets up as well as the ports listed in the
>> "spark
>> > standalone mode" documentation page under "configuring ports for network
>> > security":
>> >
>> > http://spark.apache.org/docs/latest/spark-standalone.html
>> >
>> >
>> > Here are the relevant fragments from the executor log:
>> >
>> > Spark Executor Command: "/cask/jdk/bin/java" "-cp"
>> >
>> "::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.
>> >
>> >
>> 2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar"
>> > "-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100" "-Dspark.akka.
>> >
>> > frameSize=100" "-Xms512M" "-Xmx512M"
>> > "org.apache.spark.executor.CoarseGrainedExecutorBackend"
>> > "akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGra
>> >
>> > inedScheduler" "0" "ip-10-202-8-45.ec2.internal" "8"
>> > "akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker"
>> > "app-20140717195146-"
>> >
>> > 
>> >
>> > ...
>> >
>> > 14/07/17 19:51:47 DEBUG NativeCodeLoader: Trying to load the
>> custom-built
>> > native-hadoop library...
>> >
>> > 14/07/17 19:51:47 DEBUG NativeCodeLoader: Failed to load native-hadoop
>> with
>> > error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
>> >
>> > 14/07/17 19:51:47 DEBUG NativeCodeLoader:
>> >
>> java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
>> >
>> > 14/07/17 19:51:47 WARN NativeCodeLoader: Unable to load native-hadoop
>> > libr

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-17 Thread Matt Work Coarr
Thanks Marcelo!  This is a huge help!!

Looking at the executor logs (in a vanilla spark install, I'm finding them
in $SPARK_HOME/work/*)...

It launches the executor, but it looks like the
CoarseGrainedExecutorBackend is having trouble talking to the driver
(exactly what you said!!!).

Do you know what the range of random ports that is used for the the
executor-to-driver?  Is that range adjustable?  Any config setting or
environment variable?

I manually setup my ec2 security group to include all the ports that the
spark ec2 script ($SPARK_HOME/ec2/spark_ec2.py) sets up in it's security
groups.  They included (for those listed above 1):
1
50060
50070
50075
60060
60070
60075

Obviously I'll need to make some adjustments to my EC2 security group!
 Just need to figure out exactly what should be in there.  To keep things
simple, I just have one security group for the master, slaves, and the
driver machine.

In listing the port ranges in my current security group I looked at the
ports that spark_ec2.py sets up as well as the ports listed in the "spark
standalone mode" documentation page under "configuring ports for network
security":

http://spark.apache.org/docs/latest/spark-standalone.html


Here are the relevant fragments from the executor log:

Spark Executor Command: "/cask/jdk/bin/java" "-cp"
"::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.

2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar"
"-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100" "-Dspark.akka.

frameSize=100" "-Xms512M" "-Xmx512M"
"org.apache.spark.executor.CoarseGrainedExecutorBackend"
"akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787/user/CoarseGra

inedScheduler" "0" "ip-10-202-8-45.ec2.internal" "8"
"akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker"
"app-20140717195146-"


...

14/07/17 19:51:47 DEBUG NativeCodeLoader: Trying to load the custom-built
native-hadoop library...

14/07/17 19:51:47 DEBUG NativeCodeLoader: Failed to load native-hadoop with
error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path

14/07/17 19:51:47 DEBUG NativeCodeLoader:
java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib

14/07/17 19:51:47 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Falling back
to shell based

14/07/17 19:51:47 DEBUG JniBasedUnixGroupsMappingWithFallback: Group
mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping

14/07/17 19:51:48 DEBUG Groups: Group mapping
impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback;
cacheTimeout=30

14/07/17 19:51:48 DEBUG SparkHadoopUtil: running as user: ec2-user

...


14/07/17 19:51:48 INFO CoarseGrainedExecutorBackend: Connecting to driver:
akka.tcp://spark@ip-10-202-11-191.ec2.internal
:46787/user/CoarseGrainedScheduler

14/07/17 19:51:48 INFO WorkerWatcher: Connecting to worker
akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker

14/07/17 19:51:49 INFO WorkerWatcher: Successfully connected to
akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker

14/07/17 19:53:29 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:55670] ->
[akka.tcp://spark@ip-10-202-11-191.ec2.internal:46787] disassociated!
Shutting down.


Thanks a bunch!
Matt


On Thu, Jul 17, 2014 at 1:21 PM, Marcelo Vanzin  wrote:

> When I meant the executor log, I meant the log of the process launched
> by the worker, not the worker. In my CDH-based Spark install, those
> end up in /var/run/spark/work.
>
> If you look at your worker log, you'll see it's launching the executor
> process. So there should be something there.
>
> Since you say it works when both are run in the same node, that
> probably points to some communication issue, since the executor needs
> to connect back to the driver. Check to see if you don't have any
> firewalls blocking the ports Spark tries to use. (That's one of the
> non-resource-related cases that will cause that message.)
>


Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-16 Thread Matt Work Coarr
0-202-8-45.ec2.internal:46848]: Error
[Association failed with
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]] [

akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]

Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: ip-10-202-8-45.ec2.internal/10.202.8.45:46848

]

14/07/16 19:34:09 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101] ->
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]: Error
[Association failed with
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]] [

akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]

Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: ip-10-202-8-45.ec2.internal/10.202.8.45:46848

]

14/07/16 19:34:09 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101] ->
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]: Error
[Association failed with
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]] [

akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@ip-10-202-8-45.ec2.internal:46848]

Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: ip-10-202-8-45.ec2.internal/10.202.8.45:46848

]

Spark assembly has been built with Hive, including Datanucleus jars on
classpath

14/07/16 19:34:10 INFO ExecutorRunner: Launch command: "/cask/jdk/bin/java"
"-cp"
"::/cask/spark/conf:/cask/spark/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/cask/spark/lib/datanucleus-api-jdo-3.2.1.jar:/cask/spark/lib/datanucleus-rdbms-3.2.1.jar:/cask/spark/lib/datanucleus-core-3.2.2.jar"
"-XX:MaxPermSize=128m" "-Dspark.akka.frameSize=100"
"-Dspark.akka.frameSize=100" "-Xms512M" "-Xmx512M"
"org.apache.spark.executor.CoarseGrainedExecutorBackend"
"akka.tcp://spark@ip-10-202-11-191.ec2.internal:47740/user/CoarseGrainedScheduler"
"1" "ip-10-202-8-45.ec2.internal" "8"
"akka.tcp://sparkWorker@ip-10-202-8-45.ec2.internal:7101/user/Worker"
"app-20140716193227-"


Matt


On Tue, Jul 15, 2014 at 5:47 PM, Marcelo Vanzin  wrote:

> Have you looked at the slave machine to see if the process has
> actually launched? If it has, have you tried peeking into its log
> file?
>
> (That error is printed whenever the executors fail to report back to
> the driver. Insufficient resources to launch the executor is the most
> common cause of that, but not the only one.)
>
> On Tue, Jul 15, 2014 at 2:43 PM, Matt Work Coarr
>  wrote:
> > Hello spark folks,
> >
> > I have a simple spark cluster setup but I can't get jobs to run on it.
>  I am
> > using the standlone mode.
> >
> > One master, one slave.  Both machines have 32GB ram and 8 cores.
> >
> > The slave is setup with one worker that has 8 cores and 24GB memory
> > allocated.
> >
> > My application requires 2 cores and 5GB of memory.
> >
> > However, I'm getting the following error:
> >
> > WARN TaskSchedulerImpl: Initial job has not accepted any resources; check
> > your cluster UI to ensure that workers are registered and have sufficient
> > memory
> >
> >
> > What else should I check for?
> >
> > This is a simplified setup (the real cluster has 20 nodes).  In this
> > simplified setup I am running the master and the slave manually.  The
> > master's web page shows the worker and it shows the application and the
> > memory/core requirements match what I mentioned above.
> >
> > I also tried running the SparkPi example via bin/run-example and get the
> > same result.  It requires 8 cores and 512MB of memory, which is also
> clearly
> > within the limits of the available worker.
> >
> > Any ideas would be greatly appreciated!!
> >
> > Matt
>
>
>
> --
> Marcelo
>


can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-15 Thread Matt Work Coarr
Hello spark folks,

I have a simple spark cluster setup but I can't get jobs to run on it.  I
am using the standlone mode.

One master, one slave.  Both machines have 32GB ram and 8 cores.

The slave is setup with one worker that has 8 cores and 24GB memory
allocated.

My application requires 2 cores and 5GB of memory.

However, I'm getting the following error:

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have sufficient
memory

What else should I check for?

This is a simplified setup (the real cluster has 20 nodes).  In this
simplified setup I am running the master and the slave manually.  The
master's web page shows the worker and it shows the application and the
memory/core requirements match what I mentioned above.

I also tried running the SparkPi example via bin/run-example and get the
same result.  It requires 8 cores and 512MB of memory, which is also
clearly within the limits of the available worker.

Any ideas would be greatly appreciated!!

Matt


Re: creating new ami image for spark ec2 commands

2014-06-06 Thread Matt Work Coarr
Thanks Akhil! I'll give that a try!


Re: creating new ami image for spark ec2 commands

2014-06-06 Thread Matt Work Coarr
Thanks for the response Akhil.  My email may not have been clear, but my
question is about what should be inside the AMI image, not how to pass an
AMI id in to the spark_ec2 script.

Should certain packages be installed? Do certain directories need to exist?
etc...


On Fri, Jun 6, 2014 at 4:40 AM, Akhil Das 
wrote:

> you can comment out this function and Create a new one which will return
> your ami-id and the rest of the script will run fine.
>
> def get_spark_ami(opts):
>   instance_types = {
> "m1.small":"pvm",
> "m1.medium":   "pvm",
> "m1.large":"pvm",
> "m1.xlarge":   "pvm",
> "t1.micro":"pvm",
> "c1.medium":   "pvm",
> "c1.xlarge":   "pvm",
> "m2.xlarge":   "pvm",
> "m2.2xlarge":  "pvm",
> "m2.4xlarge":  "pvm",
> "cc1.4xlarge": "hvm",
> "cc2.8xlarge": "hvm",
> "cg1.4xlarge": "hvm",
> "hs1.8xlarge": "hvm",
> "hi1.4xlarge": "hvm",
> "m3.xlarge":   "hvm",
> "m3.2xlarge":  "hvm",
> "cr1.8xlarge": "hvm",
> "i2.xlarge":   "hvm",
> "i2.2xlarge":  "hvm",
> "i2.4xlarge":  "hvm",
> "i2.8xlarge":  "hvm",
> "c3.large":"pvm",
> "c3.xlarge":   "pvm",
> "c3.2xlarge":  "pvm",
> "c3.4xlarge":  "pvm",
>     "c3.8xlarge":  "pvm"
>   }
>   if opts.instance_type in instance_types:
> instance_type = instance_types[opts.instance_type]
>   else:
> instance_type = "pvm"
> print >> stderr,\
> "Don't recognize %s, assuming type is pvm" % opts.instance_type
>
>   ami_path = "%s/%s/%s" % (AMI_PREFIX, opts.region, instance_type)
>   try:
> ami = urllib2.urlopen(ami_path).read().strip()
> print "Spark AMI: " + ami
>   except:
> print >> stderr, "Could not resolve AMI at: " + ami_path
> sys.exit(1)
>
>   return ami
>
> Thanks
> Best Regards
>
>
> On Fri, Jun 6, 2014 at 2:14 AM, Matt Work Coarr 
> wrote:
>
>> How would I go about creating a new AMI image that I can use with the
>> spark ec2 commands? I can't seem to find any documentation.  I'm looking
>> for a list of steps that I'd need to perform to make an Amazon Linux image
>> ready to be used by the spark ec2 tools.
>>
>> I've been reading through the spark 1.0.0 documentation, looking at the
>> script itself (spark_ec2.py), and looking at the github project
>> mesos/spark-ec2.
>>
>> From what I can tell, the spark_ec2.py script looks up the id of the AMI
>> based on the region and machine type (hvm or pvm) using static content
>> derived from the github repo mesos/spark-ec2.
>>
>> The spark ec2 script loads the AMI id from this base url:
>> https://raw.github.com/mesos/spark-ec2/v2/ami-list
>> (Which presumably comes from https://github.com/mesos/spark-ec2 )
>>
>> For instance, I'm working with us-east-1 and pvm, I'd end up with AMI id:
>> ami-5bb18832
>>
>> Is there a list of instructions for how this AMI was created?  Assuming
>> I'm starting with my own Amazon Linux image, what would I need to do to
>> make it usable where I could pass that AMI id to spark_ec2.py rather than
>> using the default spark-provided AMI?
>>
>> Thanks,
>> Matt
>>
>
>


creating new ami image for spark ec2 commands

2014-06-05 Thread Matt Work Coarr
How would I go about creating a new AMI image that I can use with the spark
ec2 commands? I can't seem to find any documentation.  I'm looking for a
list of steps that I'd need to perform to make an Amazon Linux image ready
to be used by the spark ec2 tools.

I've been reading through the spark 1.0.0 documentation, looking at the
script itself (spark_ec2.py), and looking at the github project
mesos/spark-ec2.

>From what I can tell, the spark_ec2.py script looks up the id of the AMI
based on the region and machine type (hvm or pvm) using static content
derived from the github repo mesos/spark-ec2.

The spark ec2 script loads the AMI id from this base url:
https://raw.github.com/mesos/spark-ec2/v2/ami-list
(Which presumably comes from https://github.com/mesos/spark-ec2 )

For instance, I'm working with us-east-1 and pvm, I'd end up with AMI id:
ami-5bb18832

Is there a list of instructions for how this AMI was created?  Assuming I'm
starting with my own Amazon Linux image, what would I need to do to make it
usable where I could pass that AMI id to spark_ec2.py rather than using the
default spark-provided AMI?

Thanks,
Matt


spark ec2 commandline tool error "VPC security groups may not be used for a non-VPC launch"

2014-05-19 Thread Matt Work Coarr
Hi, I'm attempting to run "spark-ec2 launch" on AWS.  My AWS instances
would be in our EC2 VPC (which seems to be causing a problem).

The two security groups MyClusterName-master and MyClusterName-slaves have
already been setup with the same ports open as the security group that
spark-ec2 tries to create.  (My company has security rules where I don't
have permissions to create security groups, so they have to be created by
someone else ahead of time.)

I'm getting the error VPC security groups may not be used for a non-VPC
launch" when I try to run "spark-ec2 launch".

Is there something I need to do to make spark-ec2 launch the master and
slave instances within the VPC?

Here's the command-line and the error that I get...

command-line (I've changed the clustername to something generic):

$SPARK_HOME/ec2/spark-ec2 --key-pair=MyKeyPair
'--identity-file=~/.ssh/id_mysshkey' --slaves=2 --instance-type=m3.large
--region=us-east-1 --zone=us-east-1a --ami=myami --spark-version

=0.9.1 launch MyClusterName


error:

ERROR:boto:400 Bad Request

ERROR:boto:

InvalidParameterCombinationVPC
security groups may not be used for a non-VPC
launch8374cac5-5869-4f38-a141-2fdaf3b18326

Setting up security groups...

Searching for existing cluster MyClusterName...

Launching instances...

Traceback (most recent call last):

  File "./spark_ec2.py", line 806, in 

main()

  File "./spark_ec2.py", line 799, in main

real_main()

  File "./spark_ec2.py", line 682, in real_main

conn, opts, cluster_name)

  File "./spark_ec2.py", line 344, in launch_cluster

block_device_map = block_map)

  File
"/opt/spark-0.9.1-bin-hadoop1/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/image.py",
line 255, in run

  File
"/opt/spark-0.9.1-bin-hadoop1/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py",
line 678, in run_instances

  File
"/opt/spark-0.9.1-bin-hadoop1/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py",
line 925, in get_object

boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request



InvalidParameterCombinationVPC
security groups may not be used for a non-VPC
launch8374cac5-5869-4f38-a141-2fdaf3b18326

Thanks for your help!!

Matt