Re: Spark context jar confusions

Aureliano Buendia Thu, 02 Jan 2014 05:46:50 -0800

On Thu, Jan 2, 2014 at 1:19 PM, Eugen Cepoi <[email protected]> wrote:


> When developing I am using local[2] that launches a local cluster with 2
> workers. In most cases it is fine, I just encountered some strange
> behaviours for broadcasted variables, in local mode no broadcast is done
> (at least in 0.8).
>

That's not good. This could hide bugs in production.


> You also have access to the ui in that case at localhost:4040.
>

That server has a short life, it dies when the program exits.


>
> In dev mode I am directly launching my main class from intellij so no I
> don't need to build the fat jar.
>

Why is that it is not possible to work with spark://localhost:7077 while
developing? This allows to monitor and review the jobs, while keeping a
record of the past jobs.

I've never been able to connect to spark://localhost:7077 in development, I
get:

WARN cluster.ClusterScheduler: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient memory

The ui says the workers are alive and they do have plenty of memory. Also,
I tried the exact spark master name given by the ui with no luck
(apparently akka is too fragile and sensitive to this). Also, turning off
firewall on os x had no effect.


>
>
> 2014/1/2 Aureliano Buendia <[email protected]>
>
>> How about when developing the spark application, do you use "localhost",
>> or "spark://localhost:7077" for spark context master during development?
>>
>> Using "spark://localhost:7077" is a good way to simulate the production
>> driver and it provides the web ui. When using "spark://localhost:7077", is
>> it required to create the fat jar? Wouldn't that significantly slow down
>> the development cycle?
>>
>>
>> On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <[email protected]>wrote:
>>
>>> It depends how you deploy, I don't find it so complicated...
>>>
>>> 1) To build the fat jar I am using maven (as I am not familiar with sbt).
>>>
>>> Inside I have something like that, saying which libs should be used in
>>> the fat jar (the others won't be present in the final artifact).
>>>
>>> <plugin>
>>>                 <groupId>org.apache.maven.plugins</groupId>
>>>                 <artifactId>maven-shade-plugin</artifactId>
>>>                 <version>2.1</version>
>>>                 <executions>
>>>                     <execution>
>>>                         <phase>package</phase>
>>>                         <goals>
>>>                             <goal>shade</goal>
>>>                         </goals>
>>>                         <configuration>
>>>                             <minimizeJar>true</minimizeJar>
>>>
>>> <createDependencyReducedPom>false</createDependencyReducedPom>
>>>                             <artifactSet>
>>>                                 <includes>
>>>                                     <include>org.apache.hbase:*</include>
>>>
>>> <include>org.apache.hadoop:*</include>
>>>
>>> <include>com.typesafe:config</include>
>>>                                     <include>org.apache.avro:*</include>
>>>                                     <include>joda-time:*</include>
>>>                                     <include>org.joda:*</include>
>>>                                 </includes>
>>>                             </artifactSet>
>>>                             <filters>
>>>                                 <filter>
>>>                                     <artifact>*:*</artifact>
>>>                                     <excludes>
>>>                                         <exclude>META-INF/*.SF</exclude>
>>>                                         <exclude>META-INF/*.DSA</exclude>
>>>                                         <exclude>META-INF/*.RSA</exclude>
>>>                                     </excludes>
>>>                                 </filter>
>>>                             </filters>
>>>                         </configuration>
>>>                     </execution>
>>>                 </executions>
>>>             </plugin>
>>>
>>>
>>> 2) The App is the jar you have built, so you ship it to the driver node
>>> (it depends a lot on how you are planing to use it, debian packaging, a
>>> plain old scp, etc) to run it you can do something like:
>>>
>>> $SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar
>>> com.myproject.MyJob
>>>
>>> where MyJob is the entry point to your job it defines a main method.
>>>
>>> 3) I don't know whats the "common way" but I am doing things this way:
>>> build the fat jar, provide some launch scripts, make debian packaging, ship
>>> it to a node that plays the role of the driver, run it over mesos using the
>>> launch scripts + some conf.
>>>
>>>
>>> 2014/1/2 Aureliano Buendia <[email protected]>
>>>
>>>> I wasn't aware of jarOfClass. I wish there was only one good way of
>>>> deploying in spark, instead of many ambiguous methods. (seems like spark
>>>> has followed scala in that there are more than one way of accomplishing a
>>>> job, making scala an overcomplicated language)
>>>>
>>>> 1. Should sbt assembly be used to make the fat jar? If so, which sbt
>>>> should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark
>>>> is shipped with a separate sbt?
>>>>
>>>> 2. Let's say we have the dependencies fat jar which is supposed to be
>>>> shipped to the workers. Now how do we deploy the main app which is supposed
>>>> to be executed on the driver? Make jar another jar out of it? Does sbt
>>>> assembly also create that jar?
>>>>
>>>> 3. Is calling sc.jarOfClass() the most common way of doing this? I
>>>> cannot find any example by googling. What's the most common way that people
>>>> use?
>>>>
>>>>
>>>>
>>>> On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <[email protected]>wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> This is the list of the jars you use in your job, the driver will send
>>>>> all those jars to each worker (otherwise the workers won't have the 
>>>>> classes
>>>>> you need in your job). The easy way to go is to build a fat jar with your
>>>>> code and all the libs you depend on and then use this utility to get the
>>>>> path: SparkContext.jarOfClass(YourJob.getClass)
>>>>>
>>>>>
>>>>> 2014/1/2 Aureliano Buendia <[email protected]>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I do not understand why spark context has an option for loading jars
>>>>>> at runtime.
>>>>>>
>>>>>> As an example, consider 
>>>>>> this<https://github.com/apache/incubator-spark/blob/50fd8d98c00f7db6aa34183705c9269098c62486/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala#L36>
>>>>>> :
>>>>>>
>>>>>> object BroadcastTest {
>>>>>>   def main(args: Array[String]) {
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>   val sc = new SparkContext(args(0), "Broadcast Test",
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>       System.getenv("SPARK_HOME"), 
>>>>>> Seq(System.getenv("SPARK_EXAMPLES_JAR")))
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  }
>>>>>> }
>>>>>>
>>>>>>
>>>>>> This is *the* example, or *the* application that we want to run, what 
>>>>>> does SPARK_EXAMPLES_JAR supposed to be?
>>>>>> In this particular case, the BroadcastTest example is self-contained, 
>>>>>> why would it want to load other unrelated example jars?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Finally, how does this help a real world spark application?
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark context jar confusions

Reply via email to