Re: Recommended pipeline automation tool? Oozie?
If you're already using Scala for Spark programming and you hate Oozie XML as much as I do ;), you might check out Scoozie, a Scala DSL for Oozie: https://github.com/klout/scoozie On Thu, Jul 10, 2014 at 5:52 PM, Andrei faithlessfri...@gmail.com wrote: I used both - Oozie and Luigi - but found them inflexible and still overcomplicated, especially in presence of Spark. Oozie has a fixed list of building blocks, which is pretty limiting. For example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out of scope (of course, you can always write wrapper as Java or Shell action, but does it really need to be so complicated?). Another issue with Oozie is passing variables between actions. There's Oozie context that is suitable for passing key-value pairs (both strings) between actions, but for more complex objects (say, FileInputStream that should be closed at last step only) you have to do some advanced kung fu. Luigi, on other hand, has its niche - complicated dataflows with many tasks that depend on each other. Basically, there are tasks (this is where you define computations) and targets (something that can exist - file on disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it creates a plan for achieving this. Luigi is really shiny when your workflow fits this model, but one step away and you are in trouble. For example, consider simple pipeline: run MR job and output temporary data, run another MR job and output final data, clean temporary data. You can make target Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1, right? Not so easy. How do you check that Clean task is achieved? If you just test whether temporary directory is empty or not, you catch both cases - when all tasks are done and when they are not even started yet. Luigi allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single run() method, but ruins the entire idea. And of course, both of these frameworks are optimized for standard MapReduce jobs, which is probably not what you want on Spark mailing list :) Experience with these frameworks, however, gave me some insights about typical data pipelines. 1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks allow branching, but most pipelines actually consist of moving data from source to destination with possibly some transformations in between (I'll be glad if somebody share use cases when you really need branching). 2. Transactional logic is important. Either everything, or nothing. Otherwise it's really easy to get into inconsistent state. 3. Extensibility is important. You never know what will need in a week or two. So eventually I decided that it is much easier to create your own pipeline instead of trying to adopt your code to existing frameworks. My latest pipeline incarnation simply consists of a list of steps that are started sequentially. Each step is a class with at least these methods: * run() - launch this step * fail() - what to do if step fails * finalize() - (optional) what to do when all steps are done For example, if you want to add possibility to run Spark jobs, you just create SparkStep and configure it with required code. If you want Hive query - just create HiveStep and configure it with Hive connection settings. I use YAML file to configure steps and Context (basically, Map[String, Any]) to pass variables between them. I also use configurable Reporter available for all steps to report the progress. Hopefully, this will give you some insights about best pipeline for your specific case. On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown p...@mult.ifario.us wrote: We use Luigi for this purpose. (Our pipelines are typically on AWS (no EMR) backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and Spark. We run Spark jobs by connecting drivers/clients to the master, and those are what is invoked from Luigi.) — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Thu, Jul 10, 2014 at 10:20 AM, k.tham kevins...@gmail.com wrote: I'm just wondering what's the general recommendation for data pipeline automation. Say, I want to run Spark Job A, then B, then invoke script C, then do D, and if D fails, do E, and if Job A fails, send email F, etc... It looks like Oozie might be the best choice. But I'd like some advice/suggestions. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: Recommended pipeline automation tool? Oozie?
You may look into the new Azkaban - which while being quite heavyweight is actually quite pleasant to use when set up. You can run spark jobs (spark-submit) using azkaban shell commands and pass paremeters between jobs. It supports dependencies, simple dags and scheduling with retries. I'm digging deeper and it may be worthwhile extending it with a Spark job type... It's probably best for mixed Hadoop / Spark clusters... — Sent from Mailbox On Fri, Jul 11, 2014 at 12:52 AM, Andrei faithlessfri...@gmail.com wrote: I used both - Oozie and Luigi - but found them inflexible and still overcomplicated, especially in presence of Spark. Oozie has a fixed list of building blocks, which is pretty limiting. For example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out of scope (of course, you can always write wrapper as Java or Shell action, but does it really need to be so complicated?). Another issue with Oozie is passing variables between actions. There's Oozie context that is suitable for passing key-value pairs (both strings) between actions, but for more complex objects (say, FileInputStream that should be closed at last step only) you have to do some advanced kung fu. Luigi, on other hand, has its niche - complicated dataflows with many tasks that depend on each other. Basically, there are tasks (this is where you define computations) and targets (something that can exist - file on disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it creates a plan for achieving this. Luigi is really shiny when your workflow fits this model, but one step away and you are in trouble. For example, consider simple pipeline: run MR job and output temporary data, run another MR job and output final data, clean temporary data. You can make target Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1, right? Not so easy. How do you check that Clean task is achieved? If you just test whether temporary directory is empty or not, you catch both cases - when all tasks are done and when they are not even started yet. Luigi allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single run() method, but ruins the entire idea. And of course, both of these frameworks are optimized for standard MapReduce jobs, which is probably not what you want on Spark mailing list :) Experience with these frameworks, however, gave me some insights about typical data pipelines. 1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks allow branching, but most pipelines actually consist of moving data from source to destination with possibly some transformations in between (I'll be glad if somebody share use cases when you really need branching). 2. Transactional logic is important. Either everything, or nothing. Otherwise it's really easy to get into inconsistent state. 3. Extensibility is important. You never know what will need in a week or two. So eventually I decided that it is much easier to create your own pipeline instead of trying to adopt your code to existing frameworks. My latest pipeline incarnation simply consists of a list of steps that are started sequentially. Each step is a class with at least these methods: * run() - launch this step * fail() - what to do if step fails * finalize() - (optional) what to do when all steps are done For example, if you want to add possibility to run Spark jobs, you just create SparkStep and configure it with required code. If you want Hive query - just create HiveStep and configure it with Hive connection settings. I use YAML file to configure steps and Context (basically, Map[String, Any]) to pass variables between them. I also use configurable Reporter available for all steps to report the progress. Hopefully, this will give you some insights about best pipeline for your specific case. On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown p...@mult.ifario.us wrote: We use Luigi for this purpose. (Our pipelines are typically on AWS (no EMR) backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and Spark. We run Spark jobs by connecting drivers/clients to the master, and those are what is invoked from Luigi.) — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Thu, Jul 10, 2014 at 10:20 AM, k.tham kevins...@gmail.com wrote: I'm just wondering what's the general recommendation for data pipeline automation. Say, I want to run Spark Job A, then B, then invoke script C, then do D, and if D fails, do E, and if Job A fails, send email F, etc... It looks like Oozie might be the best choice. But I'd like some advice/suggestions. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Recommended pipeline automation tool? Oozie?
We use Azkaban for a short time and suffer a lot. Finally we almost rewrite it totally. Don’t recommend it really. 发件人: Nick Pentreath nick.pentre...@gmail.com 答复: user@spark.apache.org 日期: 2014年7月11日 星期五 下午3:18 至: user@spark.apache.org 主题: Re: Recommended pipeline automation tool? Oozie? You may look into the new Azkaban - which while being quite heavyweight is actually quite pleasant to use when set up. You can run spark jobs (spark-submit) using azkaban shell commands and pass paremeters between jobs. It supports dependencies, simple dags and scheduling with retries. I'm digging deeper and it may be worthwhile extending it with a Spark job type... It's probably best for mixed Hadoop / Spark clusters... ― Sent from Mailbox https://www.dropbox.com/mailbox On Fri, Jul 11, 2014 at 12:52 AM, Andrei faithlessfri...@gmail.com wrote: I used both - Oozie and Luigi - but found them inflexible and still overcomplicated, especially in presence of Spark. Oozie has a fixed list of building blocks, which is pretty limiting. For example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out of scope (of course, you can always write wrapper as Java or Shell action, but does it really need to be so complicated?). Another issue with Oozie is passing variables between actions. There's Oozie context that is suitable for passing key-value pairs (both strings) between actions, but for more complex objects (say, FileInputStream that should be closed at last step only) you have to do some advanced kung fu. Luigi, on other hand, has its niche - complicated dataflows with many tasks that depend on each other. Basically, there are tasks (this is where you define computations) and targets (something that can exist - file on disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it creates a plan for achieving this. Luigi is really shiny when your workflow fits this model, but one step away and you are in trouble. For example, consider simple pipeline: run MR job and output temporary data, run another MR job and output final data, clean temporary data. You can make target Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1, right? Not so easy. How do you check that Clean task is achieved? If you just test whether temporary directory is empty or not, you catch both cases - when all tasks are done and when they are not even started yet. Luigi allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single run() method, but ruins the entire idea. And of course, both of these frameworks are optimized for standard MapReduce jobs, which is probably not what you want on Spark mailing list :) Experience with these frameworks, however, gave me some insights about typical data pipelines. 1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks allow branching, but most pipelines actually consist of moving data from source to destination with possibly some transformations in between (I'll be glad if somebody share use cases when you really need branching). 2. Transactional logic is important. Either everything, or nothing. Otherwise it's really easy to get into inconsistent state. 3. Extensibility is important. You never know what will need in a week or two. So eventually I decided that it is much easier to create your own pipeline instead of trying to adopt your code to existing frameworks. My latest pipeline incarnation simply consists of a list of steps that are started sequentially. Each step is a class with at least these methods: * run() - launch this step * fail() - what to do if step fails * finalize() - (optional) what to do when all steps are done For example, if you want to add possibility to run Spark jobs, you just create SparkStep and configure it with required code. If you want Hive query - just create HiveStep and configure it with Hive connection settings. I use YAML file to configure steps and Context (basically, Map[String, Any]) to pass variables between them. I also use configurable Reporter available for all steps to report the progress. Hopefully, this will give you some insights about best pipeline for your specific case. On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown p...@mult.ifario.us wrote: We use Luigi for this purpose. (Our pipelines are typically on AWS (no EMR) backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and Spark. We run Spark jobs by connecting drivers/clients to the master, and those are what is invoked from Luigi.) ― p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Thu, Jul 10, 2014 at 10:20 AM, k.tham kevins...@gmail.com wrote: I'm just wondering what's the general recommendation for data pipeline automation. Say, I want to run Spark Job A, then B, then invoke script C, then do D, and if D fails, do E, and if Job A fails, send email F, etc... It looks like Oozie
Re: Recommended pipeline automation tool? Oozie?
Did you use old azkaban or azkaban 2.5? It has been completely rewritten. Not saying it is the best but I found it way better than oozie for example. Sent from my iPhone On 11 Jul 2014, at 09:24, 明风 mingf...@taobao.com wrote: We use Azkaban for a short time and suffer a lot. Finally we almost rewrite it totally. Don’t recommend it really. 发件人: Nick Pentreath nick.pentre...@gmail.com 答复: user@spark.apache.org 日期: 2014年7月11日 星期五 下午3:18 至: user@spark.apache.org 主题: Re: Recommended pipeline automation tool? Oozie? You may look into the new Azkaban - which while being quite heavyweight is actually quite pleasant to use when set up. You can run spark jobs (spark-submit) using azkaban shell commands and pass paremeters between jobs. It supports dependencies, simple dags and scheduling with retries. I'm digging deeper and it may be worthwhile extending it with a Spark job type... It's probably best for mixed Hadoop / Spark clusters... — Sent from Mailbox On Fri, Jul 11, 2014 at 12:52 AM, Andrei faithlessfri...@gmail.com wrote: I used both - Oozie and Luigi - but found them inflexible and still overcomplicated, especially in presence of Spark. Oozie has a fixed list of building blocks, which is pretty limiting. For example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out of scope (of course, you can always write wrapper as Java or Shell action, but does it really need to be so complicated?). Another issue with Oozie is passing variables between actions. There's Oozie context that is suitable for passing key-value pairs (both strings) between actions, but for more complex objects (say, FileInputStream that should be closed at last step only) you have to do some advanced kung fu. Luigi, on other hand, has its niche - complicated dataflows with many tasks that depend on each other. Basically, there are tasks (this is where you define computations) and targets (something that can exist - file on disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it creates a plan for achieving this. Luigi is really shiny when your workflow fits this model, but one step away and you are in trouble. For example, consider simple pipeline: run MR job and output temporary data, run another MR job and output final data, clean temporary data. You can make target Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1, right? Not so easy. How do you check that Clean task is achieved? If you just test whether temporary directory is empty or not, you catch both cases - when all tasks are done and when they are not even started yet. Luigi allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single run() method, but ruins the entire idea. And of course, both of these frameworks are optimized for standard MapReduce jobs, which is probably not what you want on Spark mailing list :) Experience with these frameworks, however, gave me some insights about typical data pipelines. 1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks allow branching, but most pipelines actually consist of moving data from source to destination with possibly some transformations in between (I'll be glad if somebody share use cases when you really need branching). 2. Transactional logic is important. Either everything, or nothing. Otherwise it's really easy to get into inconsistent state. 3. Extensibility is important. You never know what will need in a week or two. So eventually I decided that it is much easier to create your own pipeline instead of trying to adopt your code to existing frameworks. My latest pipeline incarnation simply consists of a list of steps that are started sequentially. Each step is a class with at least these methods: * run() - launch this step * fail() - what to do if step fails * finalize() - (optional) what to do when all steps are done For example, if you want to add possibility to run Spark jobs, you just create SparkStep and configure it with required code. If you want Hive query - just create HiveStep and configure it with Hive connection settings. I use YAML file to configure steps and Context (basically, Map[String, Any]) to pass variables between them. I also use configurable Reporter available for all steps to report the progress. Hopefully, this will give you some insights about best pipeline for your specific case. On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown p...@mult.ifario.us wrote: We use Luigi for this purpose. (Our pipelines are typically on AWS (no EMR) backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and Spark. We run Spark jobs by connecting drivers/clients to the master, and those are what is invoked from Luigi.) — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Thu, Jul 10, 2014 at 10:20 AM, k.tham
Re: Recommended pipeline automation tool? Oozie?
Just curious: how about using scala to drive the workflow? I guess if you use other tools (oozie, etc) you lose the advantage of reading from RDD -- you have to read from HDFS. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan From: k.tham kevins...@gmail.com To: u...@spark.incubator.apache.org, Date: 07/10/2014 01:20 PM Subject:Recommended pipeline automation tool? Oozie? I'm just wondering what's the general recommendation for data pipeline automation. Say, I want to run Spark Job A, then B, then invoke script C, then do D, and if D fails, do E, and if Job A fails, send email F, etc... It looks like Oozie might be the best choice. But I'd like some advice/suggestions. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Recommended pipeline automation tool? Oozie?
I like the idea of using scala to drive the workflow. Spark already comes with a scheduler, why not program a plugin to schedule other types of tasks (copy file, send email, etc.)? Scala could handle any logic required by the pipeline. Passing objects (including RDDs) between tasks is also easier. I don't know if this is an overuse of Spark scheduler, but sounds like a good tool. The only issue would be releasing resources that is not used at intermediate steps. On Fri, Jul 11, 2014 at 12:05 PM, Wei Tan w...@us.ibm.com wrote: Just curious: how about using scala to drive the workflow? I guess if you use other tools (oozie, etc) you lose the advantage of reading from RDD -- you have to read from HDFS. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center *http://researcher.ibm.com/person/us-wtan* http://researcher.ibm.com/person/us-wtan From:k.tham kevins...@gmail.com To:u...@spark.incubator.apache.org, Date:07/10/2014 01:20 PM Subject:Recommended pipeline automation tool? Oozie? -- I'm just wondering what's the general recommendation for data pipeline automation. Say, I want to run Spark Job A, then B, then invoke script C, then do D, and if D fails, do E, and if Job A fails, send email F, etc... It looks like Oozie might be the best choice. But I'd like some advice/suggestions. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Li @vrilleup
Recommended pipeline automation tool? Oozie?
I'm just wondering what's the general recommendation for data pipeline automation. Say, I want to run Spark Job A, then B, then invoke script C, then do D, and if D fails, do E, and if Job A fails, send email F, etc... It looks like Oozie might be the best choice. But I'd like some advice/suggestions. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Recommended pipeline automation tool? Oozie?
We use Luigi for this purpose. (Our pipelines are typically on AWS (no EMR) backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and Spark. We run Spark jobs by connecting drivers/clients to the master, and those are what is invoked from Luigi.) — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Thu, Jul 10, 2014 at 10:20 AM, k.tham kevins...@gmail.com wrote: I'm just wondering what's the general recommendation for data pipeline automation. Say, I want to run Spark Job A, then B, then invoke script C, then do D, and if D fails, do E, and if Job A fails, send email F, etc... It looks like Oozie might be the best choice. But I'd like some advice/suggestions. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Recommended pipeline automation tool? Oozie?
I used both - Oozie and Luigi - but found them inflexible and still overcomplicated, especially in presence of Spark. Oozie has a fixed list of building blocks, which is pretty limiting. For example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out of scope (of course, you can always write wrapper as Java or Shell action, but does it really need to be so complicated?). Another issue with Oozie is passing variables between actions. There's Oozie context that is suitable for passing key-value pairs (both strings) between actions, but for more complex objects (say, FileInputStream that should be closed at last step only) you have to do some advanced kung fu. Luigi, on other hand, has its niche - complicated dataflows with many tasks that depend on each other. Basically, there are tasks (this is where you define computations) and targets (something that can exist - file on disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it creates a plan for achieving this. Luigi is really shiny when your workflow fits this model, but one step away and you are in trouble. For example, consider simple pipeline: run MR job and output temporary data, run another MR job and output final data, clean temporary data. You can make target Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1, right? Not so easy. How do you check that Clean task is achieved? If you just test whether temporary directory is empty or not, you catch both cases - when all tasks are done and when they are not even started yet. Luigi allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single run() method, but ruins the entire idea. And of course, both of these frameworks are optimized for standard MapReduce jobs, which is probably not what you want on Spark mailing list :) Experience with these frameworks, however, gave me some insights about typical data pipelines. 1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks allow branching, but most pipelines actually consist of moving data from source to destination with possibly some transformations in between (I'll be glad if somebody share use cases when you really need branching). 2. Transactional logic is important. Either everything, or nothing. Otherwise it's really easy to get into inconsistent state. 3. Extensibility is important. You never know what will need in a week or two. So eventually I decided that it is much easier to create your own pipeline instead of trying to adopt your code to existing frameworks. My latest pipeline incarnation simply consists of a list of steps that are started sequentially. Each step is a class with at least these methods: * run() - launch this step * fail() - what to do if step fails * finalize() - (optional) what to do when all steps are done For example, if you want to add possibility to run Spark jobs, you just create SparkStep and configure it with required code. If you want Hive query - just create HiveStep and configure it with Hive connection settings. I use YAML file to configure steps and Context (basically, Map[String, Any]) to pass variables between them. I also use configurable Reporter available for all steps to report the progress. Hopefully, this will give you some insights about best pipeline for your specific case. On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown p...@mult.ifario.us wrote: We use Luigi for this purpose. (Our pipelines are typically on AWS (no EMR) backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and Spark. We run Spark jobs by connecting drivers/clients to the master, and those are what is invoked from Luigi.) — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Thu, Jul 10, 2014 at 10:20 AM, k.tham kevins...@gmail.com wrote: I'm just wondering what's the general recommendation for data pipeline automation. Say, I want to run Spark Job A, then B, then invoke script C, then do D, and if D fails, do E, and if Job A fails, send email F, etc... It looks like Oozie might be the best choice. But I'd like some advice/suggestions. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html Sent from the Apache Spark User List mailing list archive at Nabble.com.