Re: Spark context jar confusions

Archit Thakur Thu, 02 Jan 2014 04:34:03 -0800

Aureliano, It doesn't matter actually. specifying "local" as your spark
master only does is It uses the single JVM to run whole application. Making
a cluster and then specifying "spark://localhost:7077" runs it on a set of
machines. Running spark in lcoal mode will be helpful for debugging
purposes but will perform much slower than if you have a cluster of 3-4-n
machines. If you do not have a set of machines, you can make your same
machine as a slave and start both master and slave on the same machine. Go
through Apache Spark home to know more about starting various node. Thx.




On Thu, Jan 2, 2014 at 5:21 PM, Aureliano Buendia <[email protected]>wrote:

> How about when developing the spark application, do you use "localhost",
> or "spark://localhost:7077" for spark context master during development?
>
> Using "spark://localhost:7077" is a good way to simulate the production
> driver and it provides the web ui. When using "spark://localhost:7077", is
> it required to create the fat jar? Wouldn't that significantly slow down
> the development cycle?
>
>
> On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <[email protected]>wrote:
>
>> It depends how you deploy, I don't find it so complicated...
>>
>> 1) To build the fat jar I am using maven (as I am not familiar with sbt).
>>
>> Inside I have something like that, saying which libs should be used in
>> the fat jar (the others won't be present in the final artifact).
>>
>> <plugin>
>>                 <groupId>org.apache.maven.plugins</groupId>
>>                 <artifactId>maven-shade-plugin</artifactId>
>>                 <version>2.1</version>
>>                 <executions>
>>                     <execution>
>>                         <phase>package</phase>
>>                         <goals>
>>                             <goal>shade</goal>
>>                         </goals>
>>                         <configuration>
>>                             <minimizeJar>true</minimizeJar>
>>
>> <createDependencyReducedPom>false</createDependencyReducedPom>
>>                             <artifactSet>
>>                                 <includes>
>>                                     <include>org.apache.hbase:*</include>
>>                                     <include>org.apache.hadoop:*</include>
>>                                     <include>com.typesafe:config</include>
>>                                     <include>org.apache.avro:*</include>
>>                                     <include>joda-time:*</include>
>>                                     <include>org.joda:*</include>
>>                                 </includes>
>>                             </artifactSet>
>>                             <filters>
>>                                 <filter>
>>                                     <artifact>*:*</artifact>
>>                                     <excludes>
>>                                         <exclude>META-INF/*.SF</exclude>
>>                                         <exclude>META-INF/*.DSA</exclude>
>>                                         <exclude>META-INF/*.RSA</exclude>
>>                                     </excludes>
>>                                 </filter>
>>                             </filters>
>>                         </configuration>
>>                     </execution>
>>                 </executions>
>>             </plugin>
>>
>>
>> 2) The App is the jar you have built, so you ship it to the driver node
>> (it depends a lot on how you are planing to use it, debian packaging, a
>> plain old scp, etc) to run it you can do something like:
>>
>> $SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar com.myproject.MyJob
>>
>> where MyJob is the entry point to your job it defines a main method.
>>
>> 3) I don't know whats the "common way" but I am doing things this way:
>> build the fat jar, provide some launch scripts, make debian packaging, ship
>> it to a node that plays the role of the driver, run it over mesos using the
>> launch scripts + some conf.
>>
>>
>> 2014/1/2 Aureliano Buendia <[email protected]>
>>
>>> I wasn't aware of jarOfClass. I wish there was only one good way of
>>> deploying in spark, instead of many ambiguous methods. (seems like spark
>>> has followed scala in that there are more than one way of accomplishing a
>>> job, making scala an overcomplicated language)
>>>
>>> 1. Should sbt assembly be used to make the fat jar? If so, which sbt
>>> should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark
>>> is shipped with a separate sbt?
>>>
>>> 2. Let's say we have the dependencies fat jar which is supposed to be
>>> shipped to the workers. Now how do we deploy the main app which is supposed
>>> to be executed on the driver? Make jar another jar out of it? Does sbt
>>> assembly also create that jar?
>>>
>>> 3. Is calling sc.jarOfClass() the most common way of doing this? I
>>> cannot find any example by googling. What's the most common way that people
>>> use?
>>>
>>>
>>>
>>> On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <[email protected]>wrote:
>>>
>>>> Hi,
>>>>
>>>> This is the list of the jars you use in your job, the driver will send
>>>> all those jars to each worker (otherwise the workers won't have the classes
>>>> you need in your job). The easy way to go is to build a fat jar with your
>>>> code and all the libs you depend on and then use this utility to get the
>>>> path: SparkContext.jarOfClass(YourJob.getClass)
>>>>
>>>>
>>>> 2014/1/2 Aureliano Buendia <[email protected]>
>>>>
>>>>> Hi,
>>>>>
>>>>> I do not understand why spark context has an option for loading jars
>>>>> at runtime.
>>>>>
>>>>> As an example, consider 
>>>>> this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36>
>>>>> :
>>>>>
>>>>> object BroadcastTest {
>>>>>   def main(args: Array[String]) {
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   val sc = new SparkContext(args(0), "Broadcast Test",
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>       System.getenv("SPARK_HOME"), 
>>>>> Seq(System.getenv("SPARK_EXAMPLES_JAR")))
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  }
>>>>> }
>>>>>
>>>>>
>>>>> This is *the* example, or *the* application that we want to run, what 
>>>>> does SPARK_EXAMPLES_JAR supposed to be?
>>>>> In this particular case, the BroadcastTest example is self-contained, why 
>>>>> would it want to load other unrelated example jars?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Finally, how does this help a real world spark application?
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark context jar confusions

Reply via email to