Here is the oozie workflow that I got as a reference from Yahoo's Tom
Graves. This is a bit old so you probably have to update a few things.  It
also has been a while since I've tried it but it should still work. Also,
refer to http://spark.apache.org/docs/latest/running-on-yarn.html for
details regarding the configs.

<workflow-app xmlns="uri:oozie:workflow:0.4" name="spark_oozie_wf">
    <start to="spark-node"/>
    <action name="spark-node">
        <java>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete
path="${nameNode}/user/${wf:user()}/${wfRoot}/output-data/pig"/>
            </prepare>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
                <property>
                    <name>mapred.compress.map.output</name>
                    <value>true</value>
                </property>
                <property>

                <property>
                    <name>oozie.launcher.mapred.child.env</name>

<value>SPARK_JAR=spark-repl-bin-0.8.0-SNAPSHOT-shaded-hadoop2-yarn.jar
</value>
                </property>
            </configuration>
            <main-class>spark.deploy.yarn.Client</main-class>
            <arg>--jar</arg>
            <arg>spark-examples-0.8.0-SNAPSHOT-hadoop2-yarn.jar</arg>
            <arg>--class</arg>
            <arg>spark.examples.SparkHdfsLR</arg>
            <arg>--args</arg>
            <arg>yarn-standalone</arg>
            <arg>--args</arg>
            <arg>hdfs://${nameNode}/user/testuser/lr_data.txt</arg>
            <arg>--args</arg>
            <arg>3</arg>
            <arg>--num-workers</arg>
            <arg>3</arg>
            <arg>--worker-memory</arg>
            <arg>2g</arg>
            <arg>--worker-cores</arg>
            <arg>2</arg>


job.properties file - make sure to update for your cluster:

nameNode=hdfs://your_namenode:8020
jobTracker=your_resourcemanager:8032
queueName=default
wfRoot=spark_oozieoozie.libpath=/user/${user.name}/${wfRoot}/apps/liboozie.wf.application.path=${nameNode}/user/${user.name}/${wfRoot}/apps/spark





On Fri, Apr 11, 2014 at 9:53 AM, Robert Kanter <[email protected]> wrote:

> Hi,
>
> The Java action is pretty useful for running "driver" programs for MR.
>  i.e. Java code that configures and submits a MapReduce job.  I'm not super
> familiar with how to submit Spark jobs, but I imagine that you can write a
> similar driver for a Spark Job and give it to the Java action.  You'd have
> to make the necessary Spark jars available on the action's classpath.
>  Oozie has a number of ways to do that, but the easiest is to put them in a
> directory named "lib" next to your workflow.xml.  Other than including the
> jars in some way, nothing "special" should be needed :)
>
> In the long run, a Spark action would be a nice convenience for users,
> especially since Spark is becoming more popular.  Then Oozie could have a
> Spark sharelib with the necessary jar files and handle that automatically.
>  (FYI: almost all of the action types are actually subclasses of the Java
> action where Oozie provides the driver and some integration logic to make
> things easier for the user)
>
>
>
>
> On Wed, Apr 9, 2014 at 4:59 PM, Segerlind, Nathan L <
> [email protected]> wrote:
>
> > Hi All.
> >
> > Is it possible to incorporate spark jobs into Oozie workflows? I've heard
> > that it is possible to do this as a Java action, but I've not seen an
> > example. If it is possible, does it the size of the workflow application
> > zip file - in particular would all the spark jars have to be included
> with
> > the workflow or could they be distributed about the cluster already?
>  More
> > generally, does anything "special" have to be done to integrate Spark
> jobs
> > into Oozie?
> >
> > Thanks,
> > Nate
> >
>

Reply via email to