Here is the oozie workflow that I got as a reference from Yahoo's Tom
Graves. This is a bit old so you probably have to update a few things. It
also has been a while since I've tried it but it should still work. Also,
refer to http://spark.apache.org/docs/latest/running-on-yarn.html for
details regarding the configs.
<workflow-app xmlns="uri:oozie:workflow:0.4" name="spark_oozie_wf">
<start to="spark-node"/>
<action name="spark-node">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<prepare>
<delete
path="${nameNode}/user/${wf:user()}/${wfRoot}/output-data/pig"/>
</prepare>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<property>
<name>oozie.launcher.mapred.child.env</name>
<value>SPARK_JAR=spark-repl-bin-0.8.0-SNAPSHOT-shaded-hadoop2-yarn.jar
</value>
</property>
</configuration>
<main-class>spark.deploy.yarn.Client</main-class>
<arg>--jar</arg>
<arg>spark-examples-0.8.0-SNAPSHOT-hadoop2-yarn.jar</arg>
<arg>--class</arg>
<arg>spark.examples.SparkHdfsLR</arg>
<arg>--args</arg>
<arg>yarn-standalone</arg>
<arg>--args</arg>
<arg>hdfs://${nameNode}/user/testuser/lr_data.txt</arg>
<arg>--args</arg>
<arg>3</arg>
<arg>--num-workers</arg>
<arg>3</arg>
<arg>--worker-memory</arg>
<arg>2g</arg>
<arg>--worker-cores</arg>
<arg>2</arg>
job.properties file - make sure to update for your cluster:
nameNode=hdfs://your_namenode:8020
jobTracker=your_resourcemanager:8032
queueName=default
wfRoot=spark_oozieoozie.libpath=/user/${user.name}/${wfRoot}/apps/liboozie.wf.application.path=${nameNode}/user/${user.name}/${wfRoot}/apps/spark
On Fri, Apr 11, 2014 at 9:53 AM, Robert Kanter <[email protected]> wrote:
> Hi,
>
> The Java action is pretty useful for running "driver" programs for MR.
> i.e. Java code that configures and submits a MapReduce job. I'm not super
> familiar with how to submit Spark jobs, but I imagine that you can write a
> similar driver for a Spark Job and give it to the Java action. You'd have
> to make the necessary Spark jars available on the action's classpath.
> Oozie has a number of ways to do that, but the easiest is to put them in a
> directory named "lib" next to your workflow.xml. Other than including the
> jars in some way, nothing "special" should be needed :)
>
> In the long run, a Spark action would be a nice convenience for users,
> especially since Spark is becoming more popular. Then Oozie could have a
> Spark sharelib with the necessary jar files and handle that automatically.
> (FYI: almost all of the action types are actually subclasses of the Java
> action where Oozie provides the driver and some integration logic to make
> things easier for the user)
>
>
>
>
> On Wed, Apr 9, 2014 at 4:59 PM, Segerlind, Nathan L <
> [email protected]> wrote:
>
> > Hi All.
> >
> > Is it possible to incorporate spark jobs into Oozie workflows? I've heard
> > that it is possible to do this as a Java action, but I've not seen an
> > example. If it is possible, does it the size of the workflow application
> > zip file - in particular would all the spark jars have to be included
> with
> > the workflow or could they be distributed about the cluster already?
> More
> > generally, does anything "special" have to be done to integrate Spark
> jobs
> > into Oozie?
> >
> > Thanks,
> > Nate
> >
>