On Thu, Jan 2, 2014 at 1:19 PM, Eugen Cepoi <[email protected]> wrote:
> When developing I am using local[2] that launches a local cluster with 2 > workers. In most cases it is fine, I just encountered some strange > behaviours for broadcasted variables, in local mode no broadcast is done > (at least in 0.8). > That's not good. This could hide bugs in production. > You also have access to the ui in that case at localhost:4040. > That server has a short life, it dies when the program exits. > > In dev mode I am directly launching my main class from intellij so no I > don't need to build the fat jar. > Why is that it is not possible to work with spark://localhost:7077 while developing? This allows to monitor and review the jobs, while keeping a record of the past jobs. I've never been able to connect to spark://localhost:7077 in development, I get: WARN cluster.ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory The ui says the workers are alive and they do have plenty of memory. Also, I tried the exact spark master name given by the ui with no luck (apparently akka is too fragile and sensitive to this). Also, turning off firewall on os x had no effect. > > > 2014/1/2 Aureliano Buendia <[email protected]> > >> How about when developing the spark application, do you use "localhost", >> or "spark://localhost:7077" for spark context master during development? >> >> Using "spark://localhost:7077" is a good way to simulate the production >> driver and it provides the web ui. When using "spark://localhost:7077", is >> it required to create the fat jar? Wouldn't that significantly slow down >> the development cycle? >> >> >> On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <[email protected]>wrote: >> >>> It depends how you deploy, I don't find it so complicated... >>> >>> 1) To build the fat jar I am using maven (as I am not familiar with sbt). >>> >>> Inside I have something like that, saying which libs should be used in >>> the fat jar (the others won't be present in the final artifact). >>> >>> <plugin> >>> <groupId>org.apache.maven.plugins</groupId> >>> <artifactId>maven-shade-plugin</artifactId> >>> <version>2.1</version> >>> <executions> >>> <execution> >>> <phase>package</phase> >>> <goals> >>> <goal>shade</goal> >>> </goals> >>> <configuration> >>> <minimizeJar>true</minimizeJar> >>> >>> <createDependencyReducedPom>false</createDependencyReducedPom> >>> <artifactSet> >>> <includes> >>> <include>org.apache.hbase:*</include> >>> >>> <include>org.apache.hadoop:*</include> >>> >>> <include>com.typesafe:config</include> >>> <include>org.apache.avro:*</include> >>> <include>joda-time:*</include> >>> <include>org.joda:*</include> >>> </includes> >>> </artifactSet> >>> <filters> >>> <filter> >>> <artifact>*:*</artifact> >>> <excludes> >>> <exclude>META-INF/*.SF</exclude> >>> <exclude>META-INF/*.DSA</exclude> >>> <exclude>META-INF/*.RSA</exclude> >>> </excludes> >>> </filter> >>> </filters> >>> </configuration> >>> </execution> >>> </executions> >>> </plugin> >>> >>> >>> 2) The App is the jar you have built, so you ship it to the driver node >>> (it depends a lot on how you are planing to use it, debian packaging, a >>> plain old scp, etc) to run it you can do something like: >>> >>> $SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar >>> com.myproject.MyJob >>> >>> where MyJob is the entry point to your job it defines a main method. >>> >>> 3) I don't know whats the "common way" but I am doing things this way: >>> build the fat jar, provide some launch scripts, make debian packaging, ship >>> it to a node that plays the role of the driver, run it over mesos using the >>> launch scripts + some conf. >>> >>> >>> 2014/1/2 Aureliano Buendia <[email protected]> >>> >>>> I wasn't aware of jarOfClass. I wish there was only one good way of >>>> deploying in spark, instead of many ambiguous methods. (seems like spark >>>> has followed scala in that there are more than one way of accomplishing a >>>> job, making scala an overcomplicated language) >>>> >>>> 1. Should sbt assembly be used to make the fat jar? If so, which sbt >>>> should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark >>>> is shipped with a separate sbt? >>>> >>>> 2. Let's say we have the dependencies fat jar which is supposed to be >>>> shipped to the workers. Now how do we deploy the main app which is supposed >>>> to be executed on the driver? Make jar another jar out of it? Does sbt >>>> assembly also create that jar? >>>> >>>> 3. Is calling sc.jarOfClass() the most common way of doing this? I >>>> cannot find any example by googling. What's the most common way that people >>>> use? >>>> >>>> >>>> >>>> On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <[email protected]>wrote: >>>> >>>>> Hi, >>>>> >>>>> This is the list of the jars you use in your job, the driver will send >>>>> all those jars to each worker (otherwise the workers won't have the >>>>> classes >>>>> you need in your job). The easy way to go is to build a fat jar with your >>>>> code and all the libs you depend on and then use this utility to get the >>>>> path: SparkContext.jarOfClass(YourJob.getClass) >>>>> >>>>> >>>>> 2014/1/2 Aureliano Buendia <[email protected]> >>>>> >>>>>> Hi, >>>>>> >>>>>> I do not understand why spark context has an option for loading jars >>>>>> at runtime. >>>>>> >>>>>> As an example, consider >>>>>> this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36> >>>>>> : >>>>>> >>>>>> object BroadcastTest { >>>>>> def main(args: Array[String]) { >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> val sc = new SparkContext(args(0), "Broadcast Test", >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> System.getenv("SPARK_HOME"), >>>>>> Seq(System.getenv("SPARK_EXAMPLES_JAR"))) >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> } >>>>>> } >>>>>> >>>>>> >>>>>> This is *the* example, or *the* application that we want to run, what >>>>>> does SPARK_EXAMPLES_JAR supposed to be? >>>>>> In this particular case, the BroadcastTest example is self-contained, >>>>>> why would it want to load other unrelated example jars? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Finally, how does this help a real world spark application? >>>>>> >>>>>> >>>>> >>>> >>> >> >
