That makes things more clear! Thanks Issue resolved
Sent from my iPhone > On Dec 29, 2015, at 2:43 PM, Annabel Melongo <melongo_anna...@yahoo.com> > wrote: > > Thanks Andrew for this awesome explanation > > > On Tuesday, December 29, 2015 5:30 PM, Andrew Or <and...@databricks.com> > wrote: > > > Let me clarify a few things for everyone: > > There are three cluster managers: standalone, YARN, and Mesos. Each cluster > manager can run in two deploy modes, client or cluster. In client mode, the > driver runs on the machine that submitted the application (the client). In > cluster mode, the driver runs on one of the worker machines in the cluster. > > When I say "standalone cluster mode" I am referring to the standalone cluster > manager running in cluster deploy mode. > > Here's how the resources are distributed in each mode (omitting Mesos): > > Standalone / YARN client mode. The driver runs on the client machine (i.e. > machine that ran Spark submit) so it should already have access to the jars. > The executors then pull the jars from an HTTP server started in the driver. > > Standalone cluster mode. Spark submit does not upload your jars to the > cluster, so all the resources you need must already be on all of the worker > machines. The executors, however, actually just pull the jars from the driver > as in client mode instead of finding it in their own local file systems. > > YARN cluster mode. Spark submit does upload your jars to the cluster. In > particular, it puts the jars in HDFS so your driver can just read from there. > As in other deployments, the executors pull the jars from the driver. > > When the docs say "If your application is launched through Spark submit, then > the application jar is automatically distributed to all worker nodes," it is > actually saying that your executors get their jars from the driver. This is > true whether you're running in client mode or cluster mode. > > If the docs are unclear (and they seem to be), then we should update them. I > have filed SPARK-12565 to track this. > > Please let me know if there's anything else I can help clarify. > > Cheers, > -Andrew > > > > > 2015-12-29 13:07 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com>: > Andrew, > > Now I see where the confusion lays. Standalone cluster mode, your link, is > nothing but a combination of client-mode and standalone mode, my link, > without YARN. > > But I'm confused by this paragraph in your link: > > If your application is launched through Spark submit, then the > application jar is automatically distributed to all worker nodes. For any > additional jars that your > application depends on, you should specify them through the --jars > flag using comma as a delimiter (e.g. --jars jar1,jar2). > > That can't be true; this is only the case when Spark runs on top of YARN. > Please correct me, if I'm wrong. > > Thanks > > > > On Tuesday, December 29, 2015 2:54 PM, Andrew Or <and...@databricks.com> > wrote: > > > http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications > > 2015-12-29 11:48 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com>: > Greg, > > Can you please send me a doc describing the standalone cluster mode? > Honestly, I never heard about it. > > The three different modes, I've listed appear in the last paragraph of this > doc: Running Spark Applications > > > > > > > Running Spark Applications > --class The FQCN of the class containing the main method of the application. > For example, org.apache.spark.examples.SparkPi. --conf > View on www.cloudera.com > Preview by Yahoo > > > > > On Tuesday, December 29, 2015 2:42 PM, Andrew Or <and...@databricks.com> > wrote: > > > The confusion here is the expression "standalone cluster mode". Either it's > stand-alone or it's cluster mode but it can't be both. > > @Annabel That's not true. There is a standalone cluster mode where driver > runs on one of the workers instead of on the client machine. What you're > describing is standalone client mode. > > 2015-12-29 11:32 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com>: > Greg, > > The confusion here is the expression "standalone cluster mode". Either it's > stand-alone or it's cluster mode but it can't be both. > > With this in mind, here's how jars are uploaded: > 1. Spark Stand-alone mode: client and driver run on the same machine; use > --packages option to submit a jar > 2. Yarn Cluster-mode: client and driver run on separate machines; > additionally driver runs as a thread in ApplicationMaster; use --jars option > with a globally visible path to said jar > 3. Yarn Client-mode: client and driver run on the same machine. driver is > NOT a thread in ApplicationMaster; use --packages to submit a jar > > > On Tuesday, December 29, 2015 1:54 PM, Andrew Or <and...@databricks.com> > wrote: > > > Hi Greg, > > It's actually intentional for standalone cluster mode to not upload jars. One > of the reasons why YARN takes at least 10 seconds before running any simple > application is because there's a lot of random overhead (e.g. putting jars in > HDFS). If this missing functionality is not documented somewhere then we > should add that. > > Also, the packages problem seems legitimate. Thanks for reporting it. I have > filed https://issues.apache.org/jira/browse/SPARK-12559. > > -Andrew > > 2015-12-29 4:18 GMT-08:00 Greg Hill <greg.h...@rackspace.com>: > > > On 12/28/15, 5:16 PM, "Daniel Valdivia" <h...@danielvaldivia.com> wrote: > > >Hi, > > > >I'm trying to submit a job to a small spark cluster running in stand > >alone mode, however it seems like the jar file I'm submitting to the > >cluster is "not found" by the workers nodes. > > > >I might have understood wrong, but I though the Driver node would send > >this jar file to the worker nodes, or should I manually send this file to > >each worker node before I submit the job? > > Yes, you have misunderstood, but so did I. So the problem is that > --deploy-mode cluster runs the Driver on the cluster as well, and you > don't know which node it's going to run on, so every node needs access to > the JAR. spark-submit does not pass the JAR along to the Driver, but the > Driver will pass it to the executors. I ended up putting the JAR in HDFS > and passing an hdfs:// path to spark-submit. This is a subtle difference > from Spark on YARN which does pass the JAR along to the Driver > automatically, and IMO should probably be fixed in spark-submit. It's > really confusing for newcomers. > > Another problem I ran into that you also might is that --packages doesn't > work with --deploy-mode cluster. It downloads the packages to a temporary > location on the node running spark-submit, then passes those paths to the > node that is running the Driver, but since that isn't the same machine, it > can't find anything and fails. The driver process *should* be the one > doing the downloading, but it isn't. I ended up having to create a fat JAR > with all of the dependencies to get around that one. > > Greg > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > > > > > > > > > >