I am looking to decide what is best for my production grade spark application(s).
YARN ===== 1. YARN supports security. When Spark is run over YARN the communication between processes can use secure authentication through Kerberos. 2. Spark standalone cluster can only run Spark jobs and nothing else. With YARN you can have different kinds of jobs like M/R, or Spark. 3. YARN scheduler has multiple features like Queues, hierarchical queues with pluggable policies, auto placement of apps into queues, easy installation, ACLs for queues. These are missing with standalone spark scheduler. Resources are more intelligently and dynamically used. 4. Spark standalone scheduler requires each application to run an executor on every node in cluster, whereas with YARN you can run executor(s) on subset of nodes. I haven't tested it. 5. On YARN spark supports driver to run on the client machine itself (yarn-client) which requires the client application to run for the lifetime of application. With yarn-cluster mode, the spark driver will run on Application master and hence the client program can either exit or do something else. This feature might be missing with standalone scheduler. 6. YARN provides finer control of resources like CPU cores. Number of executors per node is configurable with YARN depending on the number of CPUs present on the node, this is missing on Mesos. It might be available with future releases. 7. Standalone mode requires management of daemon services. It also requires a Zookeeper setup as Spark master node needs to be highly available to avoid single point of failure. 8. Most of the existing users of Hadoop cluster have large amounts data (TBs/PBs) on residing on the cluster and hence Spark applications can make use of data locality when running on YARN cluster. Making it available on standalone cluster might be a challenge. 9. I do not think there are performance impacts of running a spark application on YARN/Mesos/Standalone cluster. Might require a test. 10. Mesos & Spark both were developed at Amplab, so both might be better compatible with each other. However I do not have any working knowledge of Mesos. I was thinking what are advantages and disadvantages of running Spark over Mesos and Spark over Standalone cluster. This will help me ( and others on the verge of using Spark systems) to decide which direction to go. Regards, Deepak On Wed, 12 Aug 2015 at 10:28 PM Tim Chen <t...@mesosphere.io> wrote: > I'm not sure what you're looking for, since you can't really compare > Standalone with YARN or Mesos, as Standalone is assuming the Spark > workers/master owns the cluster, and YARN/Mesos is trying to share the > cluster among different applications/frameworks. > > And when you refer to resource utilization, what exactly does it mean to > you? Is it the ability to maximize the usage of your resources with > multiple applications in mind, or just how much configuration Spark allows > you to in each mode? > > Tim > > On Wed, Aug 12, 2015 at 2:16 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> > wrote: > >> Do we have any comparisons in terms of resource utilization, scheduling >> of running Spark in the below three modes >> 1) Standalone >> 2) over YARN >> 3) over Mesos >> >> Can some one share resources (thoughts/URLs) on this area. >> >> >> -- >> Deepak >> >> >