Re: Can't submit job to stand alone cluster

Daniel Valdivia Tue, 29 Dec 2015 15:37:04 -0800

That makes things more clear! Thanks

Issue resolved


Sent from my iPhone

> On Dec 29, 2015, at 2:43 PM, Annabel Melongo <melongo_anna...@yahoo.com> 
> wrote:
> 
> Thanks Andrew for this awesome explanation 
> 
> 
> On Tuesday, December 29, 2015 5:30 PM, Andrew Or <and...@databricks.com> 
> wrote:
> 
> 
> Let me clarify a few things for everyone:
> 
> There are three cluster managers: standalone, YARN, and Mesos. Each cluster 
> manager can run in two deploy modes, client or cluster. In client mode, the 
> driver runs on the machine that submitted the application (the client). In 
> cluster mode, the driver runs on one of the worker machines in the cluster.
> 
> When I say "standalone cluster mode" I am referring to the standalone cluster 
> manager running in cluster deploy mode.
> 
> Here's how the resources are distributed in each mode (omitting Mesos):
> 
> Standalone / YARN client mode. The driver runs on the client machine (i.e. 
> machine that ran Spark submit) so it should already have access to the jars. 
> The executors then pull the jars from an HTTP server started in the driver.
> 
> Standalone cluster mode. Spark submit does not upload your jars to the 
> cluster, so all the resources you need must already be on all of the worker 
> machines. The executors, however, actually just pull the jars from the driver 
> as in client mode instead of finding it in their own local file systems.
> 
> YARN cluster mode. Spark submit does upload your jars to the cluster. In 
> particular, it puts the jars in HDFS so your driver can just read from there. 
> As in other deployments, the executors pull the jars from the driver.
> 
> When the docs say "If your application is launched through Spark submit, then 
> the application jar is automatically distributed to all worker nodes," it is 
> actually saying that your executors get their jars from the driver. This is 
> true whether you're running in client mode or cluster mode.
> 
> If the docs are unclear (and they seem to be), then we should update them. I 
> have filed SPARK-12565 to track this.
> 
> Please let me know if there's anything else I can help clarify.
> 
> Cheers,
> -Andrew
> 
> 
> 
> 
> 2015-12-29 13:07 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com>:
> Andrew,
> 
> Now I see where the confusion lays. Standalone cluster mode, your link, is 
> nothing but a combination of client-mode and standalone mode, my link, 
> without YARN.
> 
> But I'm confused by this paragraph in your link:
> 
>         If your application is launched through Spark submit, then the 
> application jar is automatically distributed to all worker nodes. For any 
> additional jars that your
>           application depends on, you should specify them through the --jars 
> flag using comma as a delimiter (e.g. --jars jar1,jar2).
> 
> That can't be true; this is only the case when Spark runs on top of YARN. 
> Please correct me, if I'm wrong.
> 
> Thanks
>   
> 
> 
> On Tuesday, December 29, 2015 2:54 PM, Andrew Or <and...@databricks.com> 
> wrote:
> 
> 
> http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications
> 
> 2015-12-29 11:48 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com>:
> Greg,
> 
> Can you please send me a doc describing the standalone cluster mode? 
> Honestly, I never heard about it.
> 
> The three different modes, I've listed appear in the last paragraph of this 
> doc: Running Spark Applications
>  
>  
>  
>  
>  
>  
> Running Spark Applications
> --class The FQCN of the class containing the main method of the application. 
> For example, org.apache.spark.examples.SparkPi. --conf
> View on www.cloudera.com
> Preview by Yahoo
>  
> 
> 
> 
> On Tuesday, December 29, 2015 2:42 PM, Andrew Or <and...@databricks.com> 
> wrote:
> 
> 
> The confusion here is the expression "standalone cluster mode". Either it's 
> stand-alone or it's cluster mode but it can't be both.
> 
> @Annabel That's not true. There is a standalone cluster mode where driver 
> runs on one of the workers instead of on the client machine. What you're 
> describing is standalone client mode.
> 
> 2015-12-29 11:32 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com>:
> Greg,
> 
> The confusion here is the expression "standalone cluster mode". Either it's 
> stand-alone or it's cluster mode but it can't be both.
> 
>  With this in mind, here's how jars are uploaded:
>     1. Spark Stand-alone mode: client and driver run on the same machine; use 
> --packages option to submit a jar
>     2. Yarn Cluster-mode: client and driver run on separate machines; 
> additionally driver runs as a thread in ApplicationMaster; use --jars option 
> with a globally visible path to said jar
>     3. Yarn Client-mode: client and driver run on the same machine. driver is 
> NOT a thread in ApplicationMaster; use --packages to submit a jar
> 
> 
> On Tuesday, December 29, 2015 1:54 PM, Andrew Or <and...@databricks.com> 
> wrote:
> 
> 
> Hi Greg,
> 
> It's actually intentional for standalone cluster mode to not upload jars. One 
> of the reasons why YARN takes at least 10 seconds before running any simple 
> application is because there's a lot of random overhead (e.g. putting jars in 
> HDFS). If this missing functionality is not documented somewhere then we 
> should add that.
> 
> Also, the packages problem seems legitimate. Thanks for reporting it. I have 
> filed https://issues.apache.org/jira/browse/SPARK-12559.
> 
> -Andrew
> 
> 2015-12-29 4:18 GMT-08:00 Greg Hill <greg.h...@rackspace.com>:
> 
> 
> On 12/28/15, 5:16 PM, "Daniel Valdivia" <h...@danielvaldivia.com> wrote:
> 
> >Hi,
> >
> >I'm trying to submit a job to a small spark cluster running in stand
> >alone mode, however it seems like the jar file I'm submitting to the
> >cluster is "not found" by the workers nodes.
> >
> >I might have understood wrong, but I though the Driver node would send
> >this jar file to the worker nodes, or should I manually send this file to
> >each worker node before I submit the job?
> 
> Yes, you have misunderstood, but so did I.  So the problem is that
> --deploy-mode cluster runs the Driver on the cluster as well, and you
> don't know which node it's going to run on, so every node needs access to
> the JAR.  spark-submit does not pass the JAR along to the Driver, but the
> Driver will pass it to the executors.  I ended up putting the JAR in HDFS
> and passing an hdfs:// path to spark-submit.  This is a subtle difference
> from Spark on YARN which does pass the JAR along to the Driver
> automatically, and IMO should probably be fixed in spark-submit.  It's
> really confusing for newcomers.
> 
> Another problem I ran into that you also might is that --packages doesn't
> work with --deploy-mode cluster.  It downloads the packages to a temporary
> location on the node running spark-submit, then passes those paths to the
> node that is running the Driver, but since that isn't the same machine, it
> can't find anything and fails.  The driver process *should* be the one
> doing the downloading, but it isn't. I ended up having to create a fat JAR
> with all of the dependencies to get around that one.
> 
> Greg
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: Can't submit job to stand alone cluster

Reply via email to