subject:"Recommended pipeline automation tool\? Oozie\?"

Re: Recommended pipeline automation tool? Oozie?

2014-07-15 Thread Dean Wampler

If you're already using Scala for Spark programming and you hate Oozie XML
as much as I do ;), you might check out Scoozie, a Scala DSL for Oozie:
https://github.com/klout/scoozie

On Thu, Jul 10, 2014 at 5:52 PM, Andrei faithlessfri...@gmail.com wrote:

I used both - Oozie and Luigi - but found them inflexible and still
overcomplicated, especially in presence of Spark.

Oozie has a fixed list of building blocks, which is pretty limiting. For
example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are
out of scope (of course, you can always write wrapper as Java or Shell
action, but does it really need to be so complicated?). Another issue with
Oozie is passing variables between actions. There's Oozie context that is
suitable for passing key-value pairs (both strings) between actions, but
for more complex objects (say, FileInputStream that should be closed at
last step only) you have to do some advanced kung fu.

Luigi, on other hand, has its niche - complicated dataflows with many
tasks that depend on each other. Basically, there are tasks (this is where
you define computations) and targets (something that can exist - file on
disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it
creates a plan for achieving this. Luigi is really shiny when your workflow
fits this model, but one step away and you are in trouble. For example,
consider simple pipeline: run MR job and output temporary data, run another
MR job and output final data, clean temporary data. You can make target
Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1,
right? Not so easy. How do you check that Clean task is achieved? If you
just test whether temporary directory is empty or not, you catch both cases
- when all tasks are done and when they are not even started yet. Luigi
allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single
run() method, but ruins the entire idea.

And of course, both of these frameworks are optimized for standard
MapReduce jobs, which is probably not what you want on Spark mailing list
:)

Experience with these frameworks, however, gave me some insights about
typical data pipelines.

1. Pipelines are mostly linear. Oozie, Luigi and number of other
frameworks allow branching, but most pipelines actually consist of moving
data from source to destination with possibly some transformations in
between (I'll be glad if somebody share use cases when you really need
branching).
2. Transactional logic is important. Either everything, or nothing.
Otherwise it's really easy to get into inconsistent state.
3. Extensibility is important. You never know what will need in a week or
two.

So eventually I decided that it is much easier to create your own pipeline
instead of trying to adopt your code to existing frameworks. My latest
pipeline incarnation simply consists of a list of steps that are started
sequentially. Each step is a class with at least these methods:

* run() - launch this step
* fail() - what to do if step fails
* finalize() - (optional) what to do when all steps are done

For example, if you want to add possibility to run Spark jobs, you just
create SparkStep and configure it with required code. If you want Hive
query - just create HiveStep and configure it with Hive connection
settings. I use YAML file to configure steps and Context (basically,
Map[String, Any]) to pass variables between them. I also use configurable
Reporter available for all steps to report the progress.

Hopefully, this will give you some insights about best pipeline for your
specific case.

On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown p...@mult.ifario.us wrote:

We use Luigi for this purpose. (Our pipelines are typically on AWS (no
EMR) backed by S3 and using combinations of Python jobs, non-Spark
Java/Scala, and Spark. We run Spark jobs by connecting drivers/clients to
the master, and those are what is invoked from Luigi.)

—
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/

On Thu, Jul 10, 2014 at 10:20 AM, k.tham kevins...@gmail.com wrote:

I'm just wondering what's the general recommendation for data pipeline
automation.

Say, I want to run Spark Job A, then B, then invoke script C, then do D,
and
if D fails, do E, and if Job A fails, send email F, etc...

It looks like Oozie might be the best choice. But I'd like some
advice/suggestions.

Thanks!

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

--
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Nick Pentreath

You may look into the new Azkaban - which while being quite heavyweight is
actually quite pleasant to use when set up.

You can run spark jobs (spark-submit) using azkaban shell commands and pass
paremeters between jobs. It supports dependencies, simple dags and scheduling
with retries.

I'm digging deeper and it may be worthwhile extending it with a Spark job
type...

It's probably best for mixed Hadoop / Spark clusters...
—
Sent from Mailbox

On Fri, Jul 11, 2014 at 12:52 AM, Andrei faithlessfri...@gmail.com
wrote:

I used both - Oozie and Luigi - but found them inflexible and still
overcomplicated, especially in presence of Spark.
Oozie has a fixed list of building blocks, which is pretty limiting. For
example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are
out of scope (of course, you can always write wrapper as Java or Shell
action, but does it really need to be so complicated?). Another issue with
Oozie is passing variables between actions. There's Oozie context that is
suitable for passing key-value pairs (both strings) between actions, but
for more complex objects (say, FileInputStream that should be closed at
last step only) you have to do some advanced kung fu.
Luigi, on other hand, has its niche - complicated dataflows with many tasks
that depend on each other. Basically, there are tasks (this is where you
define computations) and targets (something that can exist - file on
disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it
creates a plan for achieving this. Luigi is really shiny when your workflow
fits this model, but one step away and you are in trouble. For example,
consider simple pipeline: run MR job and output temporary data, run another
MR job and output final data, clean temporary data. You can make target
Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1,
right? Not so easy. How do you check that Clean task is achieved? If you
just test whether temporary directory is empty or not, you catch both cases
- when all tasks are done and when they are not even started yet. Luigi
allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single
run() method, but ruins the entire idea.
And of course, both of these frameworks are optimized for standard
MapReduce jobs, which is probably not what you want on Spark mailing list
:)
Experience with these frameworks, however, gave me some insights about
typical data pipelines.
1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks
allow branching, but most pipelines actually consist of moving data from
source to destination with possibly some transformations in between (I'll
be glad if somebody share use cases when you really need branching).
2. Transactional logic is important. Either everything, or nothing.
Otherwise it's really easy to get into inconsistent state.
3. Extensibility is important. You never know what will need in a week or
two.
So eventually I decided that it is much easier to create your own pipeline
instead of trying to adopt your code to existing frameworks. My latest
pipeline incarnation simply consists of a list of steps that are started
sequentially. Each step is a class with at least these methods:
* run() - launch this step
* fail() - what to do if step fails
* finalize() - (optional) what to do when all steps are done
For example, if you want to add possibility to run Spark jobs, you just
create SparkStep and configure it with required code. If you want Hive
query - just create HiveStep and configure it with Hive connection
settings. I use YAML file to configure steps and Context (basically,
Map[String, Any]) to pass variables between them. I also use configurable
Reporter available for all steps to report the progress.
Hopefully, this will give you some insights about best pipeline for your
specific case.
On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown p...@mult.ifario.us wrote:

—
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/

On Thu, Jul 10, 2014 at 10:20 AM, k.tham kevins...@gmail.com wrote:

I'm just wondering what's the general recommendation for data pipeline
automation.

Say, I want to run Spark Job A, then B, then invoke script C, then do D,
and
if D fails, do E, and if Job A fails, send email F, etc...

It looks like Oozie might be the best choice. But I'd like some
advice/suggestions.

Thanks!

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread 明风

We use Azkaban for a short time and suffer a lot. Finally we almost rewrite
it totally. Don’t recommend it really.

发件人:  Nick Pentreath nick.pentre...@gmail.com
答复:  user@spark.apache.org
日期:  2014年7月11日 星期五 下午3:18
至:  user@spark.apache.org
主题:  Re: Recommended pipeline automation tool? Oozie?

You may look into the new Azkaban - which while being quite heavyweight is
actually quite pleasant to use when set up.

You can run spark jobs (spark-submit) using azkaban shell commands and pass
paremeters between jobs. It supports dependencies, simple dags and
scheduling with retries.

I'm digging deeper and it may be worthwhile extending it with a Spark job
type...

It's probably best for mixed Hadoop / Spark clusters...
―
Sent from Mailbox https://www.dropbox.com/mailbox


On Fri, Jul 11, 2014 at 12:52 AM, Andrei faithlessfri...@gmail.com wrote:
 I used both - Oozie and Luigi - but found them inflexible and still
 overcomplicated, especially in presence of Spark.
 
 Oozie has a fixed list of building blocks, which is pretty limiting. For
 example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out
 of scope (of course, you can always write wrapper as Java or Shell action, but
 does it really need to be so complicated?). Another issue with Oozie is
 passing variables between actions. There's Oozie context that is suitable for
 passing key-value pairs (both strings) between actions, but for more complex
 objects (say, FileInputStream that should be closed at last step only) you
 have to do some advanced kung fu.
 
 Luigi, on other hand, has its niche - complicated dataflows with many tasks
 that depend on each other. Basically, there are tasks (this is where you
 define computations) and targets (something that can exist - file on disk,
 entry in ZooKeeper, etc.). You ask Luigi to get some target, and it creates a
 plan for achieving this. Luigi is really shiny when your workflow fits this
 model, but one step away and you are in trouble. For example, consider simple
 pipeline: run MR job and output temporary data, run another MR job and output
 final data, clean temporary data. You can make target Clean, that depends on
 target MRJob2 that, in its turn, depends on MRJob1, right? Not so easy. How do
 you check that Clean task is achieved? If you just test whether temporary
 directory is empty or not, you catch both cases - when all tasks are done and
 when they are not even started yet. Luigi allows you to specify all 3 actions
 - MRJob1, MRJob2, Clean - in a single run() method, but ruins the entire
 idea. 
 
 And of course, both of these frameworks are optimized for standard MapReduce
 jobs, which is probably not what you want on Spark mailing list :)
 
 Experience with these frameworks, however, gave me some insights about typical
 data pipelines. 
 
 1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks
 allow branching, but most pipelines actually consist of moving data from
 source to destination with possibly some transformations in between (I'll be
 glad if somebody share use cases when you really need branching).
 2. Transactional logic is important. Either everything, or nothing. Otherwise
 it's really easy to get into inconsistent state.
 3. Extensibility is important. You never know what will need in a week or two.
 
 So eventually I decided that it is much easier to create your own pipeline
 instead of trying to adopt your code to existing frameworks. My latest
 pipeline incarnation simply consists of a list of steps that are started
 sequentially. Each step is a class with at least these methods:
 
  * run() - launch this step
  * fail() - what to do if step fails
  * finalize() - (optional) what to do when all steps are done
 
 For example, if you want to add possibility to run Spark jobs, you just create
 SparkStep and configure it with required code. If you want Hive query - just
 create HiveStep and configure it with Hive connection settings. I use YAML
 file to configure steps and Context (basically, Map[String, Any]) to pass
 variables between them. I also use configurable Reporter available for all
 steps to report the progress.
 
 Hopefully, this will give you some insights about best pipeline for your
 specific case. 
 
 
 
 On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown p...@mult.ifario.us wrote:
 
 We use Luigi for this purpose.  (Our pipelines are typically on AWS (no EMR)
 backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and
 Spark.  We run Spark jobs by connecting drivers/clients to the master, and
 those are what is invoked from Luigi.)
 
 ―
 p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
 
 
 On Thu, Jul 10, 2014 at 10:20 AM, k.tham kevins...@gmail.com wrote:
 I'm just wondering what's the general recommendation for data pipeline
 automation.
 
 Say, I want to run Spark Job A, then B, then invoke script C, then do D, and
 if D fails, do E, and if Job A fails, send email F, etc...
 
 It looks like Oozie

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Nick Pentreath

Did you use old azkaban or azkaban 2.5? It has been completely rewritten.

Not saying it is the best but I found it way better than oozie for example.

Sent from my iPhone

 On 11 Jul 2014, at 09:24, 明风 mingf...@taobao.com wrote:
 
 We use Azkaban for a short time and suffer a lot. Finally we almost rewrite 
 it totally. Don’t recommend it really.
 
 发件人: Nick Pentreath nick.pentre...@gmail.com
 答复: user@spark.apache.org
 日期: 2014年7月11日 星期五 下午3:18
 至: user@spark.apache.org
 主题: Re: Recommended pipeline automation tool? Oozie?
 
 You may look into the new Azkaban - which while being quite heavyweight is 
 actually quite pleasant to use when set up.
 
 You can run spark jobs (spark-submit) using azkaban shell commands and pass 
 paremeters between jobs. It supports dependencies, simple dags and scheduling 
 with retries. 
 
 I'm digging deeper and it may be worthwhile extending it with a Spark job 
 type...
 
 It's probably best for mixed Hadoop / Spark clusters...
 —
 Sent from Mailbox
 
 
 On Fri, Jul 11, 2014 at 12:52 AM, Andrei faithlessfri...@gmail.com wrote:
 I used both - Oozie and Luigi - but found them inflexible and still 
 overcomplicated, especially in presence of Spark. 
 
 Oozie has a fixed list of building blocks, which is pretty limiting. For 
 example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out 
 of scope (of course, you can always write wrapper as Java or Shell action, 
 but does it really need to be so complicated?). Another issue with Oozie is 
 passing variables between actions. There's Oozie context that is suitable 
 for passing key-value pairs (both strings) between actions, but for more 
 complex objects (say, FileInputStream that should be closed at last step 
 only) you have to do some advanced kung fu. 
 
 Luigi, on other hand, has its niche - complicated dataflows with many tasks 
 that depend on each other. Basically, there are tasks (this is where you 
 define computations) and targets (something that can exist - file on disk, 
 entry in ZooKeeper, etc.). You ask Luigi to get some target, and it creates 
 a plan for achieving this. Luigi is really shiny when your workflow fits 
 this model, but one step away and you are in trouble. For example, consider 
 simple pipeline: run MR job and output temporary data, run another MR job 
 and output final data, clean temporary data. You can make target Clean, that 
 depends on target MRJob2 that, in its turn, depends on MRJob1, right? Not so 
 easy. How do you check that Clean task is achieved? If you just test whether 
 temporary directory is empty or not, you catch both cases - when all tasks 
 are done and when they are not even started yet. Luigi allows you to specify 
 all 3 actions - MRJob1, MRJob2, Clean - in a single run() method, but 
 ruins the entire idea. 
 
 And of course, both of these frameworks are optimized for standard MapReduce 
 jobs, which is probably not what you want on Spark mailing list :) 
 
 Experience with these frameworks, however, gave me some insights about 
 typical data pipelines. 
 
 1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks 
 allow branching, but most pipelines actually consist of moving data from 
 source to destination with possibly some transformations in between (I'll be 
 glad if somebody share use cases when you really need branching). 
 2. Transactional logic is important. Either everything, or nothing. 
 Otherwise it's really easy to get into inconsistent state. 
 3. Extensibility is important. You never know what will need in a week or 
 two. 
 
 So eventually I decided that it is much easier to create your own pipeline 
 instead of trying to adopt your code to existing frameworks. My latest 
 pipeline incarnation simply consists of a list of steps that are started 
 sequentially. Each step is a class with at least these methods: 
 
  * run() - launch this step
  * fail() - what to do if step fails
  * finalize() - (optional) what to do when all steps are done
 
 For example, if you want to add possibility to run Spark jobs, you just 
 create SparkStep and configure it with required code. If you want Hive query 
 - just create HiveStep and configure it with Hive connection settings. I use 
 YAML file to configure steps and Context (basically, Map[String, Any]) to 
 pass variables between them. I also use configurable Reporter available for 
 all steps to report the progress. 
 
 Hopefully, this will give you some insights about best pipeline for your 
 specific case. 
 
 
 
 On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown p...@mult.ifario.us wrote:
 
 We use Luigi for this purpose.  (Our pipelines are typically on AWS (no 
 EMR) backed by S3 and using combinations of Python jobs, non-Spark 
 Java/Scala, and Spark.  We run Spark jobs by connecting drivers/clients to 
 the master, and those are what is invoked from Luigi.)
 
 —
 p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
 
 
 On Thu, Jul 10, 2014 at 10:20 AM, k.tham

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Wei Tan

Just curious: how about using scala to drive the workflow? I guess if you 
use other tools (oozie, etc) you lose the advantage of reading from RDD -- 
you have to read from HDFS.

Best regards,
Wei

-
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan



From:   k.tham kevins...@gmail.com
To: u...@spark.incubator.apache.org, 
Date:   07/10/2014 01:20 PM
Subject:Recommended pipeline automation tool? Oozie?



I'm just wondering what's the general recommendation for data pipeline
automation.

Say, I want to run Spark Job A, then B, then invoke script C, then do D, 
and
if D fails, do E, and if Job A fails, send email F, etc...

It looks like Oozie might be the best choice. But I'd like some
advice/suggestions.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Li Pu

I like the idea of using scala to drive the workflow. Spark already comes
with a scheduler, why not program a plugin to schedule other types of tasks
(copy file, send email, etc.)? Scala could handle any logic required by the
pipeline. Passing objects (including RDDs) between tasks is also easier. I
don't know if this is an overuse of Spark scheduler, but sounds like a good
tool. The only issue would be releasing resources that is not used at
intermediate steps.

On Fri, Jul 11, 2014 at 12:05 PM, Wei Tan w...@us.ibm.com wrote:

Just curious: how about using scala to drive the workflow? I guess if you
use other tools (oozie, etc) you lose the advantage of reading from RDD --
you have to read from HDFS.

Best regards,
Wei

-
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
*http://researcher.ibm.com/person/us-wtan*
http://researcher.ibm.com/person/us-wtan

From:k.tham kevins...@gmail.com
To:u...@spark.incubator.apache.org,
Date:07/10/2014 01:20 PM
Subject:Recommended pipeline automation tool? Oozie?
--

I'm just wondering what's the general recommendation for data pipeline
automation.

Say, I want to run Spark Job A, then B, then invoke script C, then do D,
and
if D fails, do E, and if Job A fails, send email F, etc...

It looks like Oozie might be the best choice. But I'd like some
advice/suggestions.

Thanks!

--
Li
@vrilleup

Recommended pipeline automation tool? Oozie?

2014-07-10 Thread k.tham

I'm just wondering what's the general recommendation for data pipeline
automation.

Say, I want to run Spark Job A, then B, then invoke script C, then do D, and
if D fails, do E, and if Job A fails, send email F, etc...

It looks like Oozie might be the best choice. But I'd like some
advice/suggestions.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Recommended pipeline automation tool? Oozie?

2014-07-10 Thread Paul Brown

We use Luigi for this purpose.  (Our pipelines are typically on AWS (no
EMR) backed by S3 and using combinations of Python jobs, non-Spark
Java/Scala, and Spark.  We run Spark jobs by connecting drivers/clients to
the master, and those are what is invoked from Luigi.)

—
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/


On Thu, Jul 10, 2014 at 10:20 AM, k.tham kevins...@gmail.com wrote:

 I'm just wondering what's the general recommendation for data pipeline
 automation.

 Say, I want to run Spark Job A, then B, then invoke script C, then do D,
 and
 if D fails, do E, and if Job A fails, send email F, etc...

 It looks like Oozie might be the best choice. But I'd like some
 advice/suggestions.

 Thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Recommended pipeline automation tool? Oozie?

2014-07-10 Thread Andrei

I used both - Oozie and Luigi - but found them inflexible and still
overcomplicated, especially in presence of Spark.

Luigi, on other hand, has its niche - complicated dataflows with many tasks
that depend on each other. Basically, there are tasks (this is where you
define computations) and targets (something that can exist - file on
disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it
creates a plan for achieving this. Luigi is really shiny when your workflow
fits this model, but one step away and you are in trouble. For example,
consider simple pipeline: run MR job and output temporary data, run another
MR job and output final data, clean temporary data. You can make target
Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1,
right? Not so easy. How do you check that Clean task is achieved? If you
just test whether temporary directory is empty or not, you catch both cases
- when all tasks are done and when they are not even started yet. Luigi
allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single
run() method, but ruins the entire idea.

And of course, both of these frameworks are optimized for standard
MapReduce jobs, which is probably not what you want on Spark mailing list
:)

Experience with these frameworks, however, gave me some insights about
typical data pipelines.

1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks
allow branching, but most pipelines actually consist of moving data from
source to destination with possibly some transformations in between (I'll
be glad if somebody share use cases when you really need branching).
2. Transactional logic is important. Either everything, or nothing.
Otherwise it's really easy to get into inconsistent state.
3. Extensibility is important. You never know what will need in a week or
two.