Re: Spark context jar confusions

Eugen Cepoi Thu, 02 Jan 2014 05:21:04 -0800

When developing I am using local[2] that launches a local cluster with 2
workers. In most cases it is fine, I just encountered some strange
behaviours for broadcasted variables, in local mode no broadcast is done
(at least in 0.8). You also have access to the ui in that case at
localhost:4040.


In dev mode I am directly launching my main class from intellij so no I
don't need to build the fat jar.


2014/1/2 Aureliano Buendia <[email protected]>

> How about when developing the spark application, do you use "localhost",
> or "spark://localhost:7077" for spark context master during development?
>
> Using "spark://localhost:7077" is a good way to simulate the production
> driver and it provides the web ui. When using "spark://localhost:7077", is
> it required to create the fat jar? Wouldn't that significantly slow down
> the development cycle?
>
>
> On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <[email protected]>wrote:
>
>> It depends how you deploy, I don't find it so complicated...
>>
>> 1) To build the fat jar I am using maven (as I am not familiar with sbt).
>>
>> Inside I have something like that, saying which libs should be used in
>> the fat jar (the others won't be present in the final artifact).
>>
>> <plugin>
>>                 <groupId>org.apache.maven.plugins</groupId>
>>                 <artifactId>maven-shade-plugin</artifactId>
>>                 <version>2.1</version>
>>                 <executions>
>>                     <execution>
>>                         <phase>package</phase>
>>                         <goals>
>>                             <goal>shade</goal>
>>                         </goals>
>>                         <configuration>
>>                             <minimizeJar>true</minimizeJar>
>>
>> <createDependencyReducedPom>false</createDependencyReducedPom>
>>                             <artifactSet>
>>                                 <includes>
>>                                     <include>org.apache.hbase:*</include>
>>                                     <include>org.apache.hadoop:*</include>
>>                                     <include>com.typesafe:config</include>
>>                                     <include>org.apache.avro:*</include>
>>                                     <include>joda-time:*</include>
>>                                     <include>org.joda:*</include>
>>                                 </includes>
>>                             </artifactSet>
>>                             <filters>
>>                                 <filter>
>>                                     <artifact>*:*</artifact>
>>                                     <excludes>
>>                                         <exclude>META-INF/*.SF</exclude>
>>                                         <exclude>META-INF/*.DSA</exclude>
>>                                         <exclude>META-INF/*.RSA</exclude>
>>                                     </excludes>
>>                                 </filter>
>>                             </filters>
>>                         </configuration>
>>                     </execution>
>>                 </executions>
>>             </plugin>
>>
>>
>> 2) The App is the jar you have built, so you ship it to the driver node
>> (it depends a lot on how you are planing to use it, debian packaging, a
>> plain old scp, etc) to run it you can do something like:
>>
>> $SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar com.myproject.MyJob
>>
>> where MyJob is the entry point to your job it defines a main method.
>>
>> 3) I don't know whats the "common way" but I am doing things this way:
>> build the fat jar, provide some launch scripts, make debian packaging, ship
>> it to a node that plays the role of the driver, run it over mesos using the
>> launch scripts + some conf.
>>
>>
>> 2014/1/2 Aureliano Buendia <[email protected]>
>>
>>> I wasn't aware of jarOfClass. I wish there was only one good way of
>>> deploying in spark, instead of many ambiguous methods. (seems like spark
>>> has followed scala in that there are more than one way of accomplishing a
>>> job, making scala an overcomplicated language)
>>>
>>> 1. Should sbt assembly be used to make the fat jar? If so, which sbt
>>> should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark
>>> is shipped with a separate sbt?
>>>
>>> 2. Let's say we have the dependencies fat jar which is supposed to be
>>> shipped to the workers. Now how do we deploy the main app which is supposed
>>> to be executed on the driver? Make jar another jar out of it? Does sbt
>>> assembly also create that jar?
>>>
>>> 3. Is calling sc.jarOfClass() the most common way of doing this? I
>>> cannot find any example by googling. What's the most common way that people
>>> use?
>>>
>>>
>>>
>>> On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <[email protected]>wrote:
>>>
>>>> Hi,
>>>>
>>>> This is the list of the jars you use in your job, the driver will send
>>>> all those jars to each worker (otherwise the workers won't have the classes
>>>> you need in your job). The easy way to go is to build a fat jar with your
>>>> code and all the libs you depend on and then use this utility to get the
>>>> path: SparkContext.jarOfClass(YourJob.getClass)
>>>>
>>>>
>>>> 2014/1/2 Aureliano Buendia <[email protected]>
>>>>
>>>>> Hi,
>>>>>
>>>>> I do not understand why spark context has an option for loading jars
>>>>> at runtime.
>>>>>
>>>>> As an example, consider 
>>>>> this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36>
>>>>> :
>>>>>
>>>>> object BroadcastTest {
>>>>>   def main(args: Array[String]) {
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   val sc = new SparkContext(args(0), "Broadcast Test",
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>       System.getenv("SPARK_HOME"), 
>>>>> Seq(System.getenv("SPARK_EXAMPLES_JAR")))
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  }
>>>>> }
>>>>>
>>>>>
>>>>> This is *the* example, or *the* application that we want to run, what 
>>>>> does SPARK_EXAMPLES_JAR supposed to be?
>>>>> In this particular case, the BroadcastTest example is self-contained, why 
>>>>> would it want to load other unrelated example jars?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Finally, how does this help a real world spark application?
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark context jar confusions

Reply via email to