I also tend to agree that Azkaban is somehqat easier to get set up. Though I haven't used the new UI for Oozie that is part of CDH, so perhaps that is another good option.
It's a pity Azkaban is a little rough in terms of documenting its API, and the scalability is an issue. However it would be possible to have a few different instances running for different use cases / groups within the org perhaps — Sent from Mailbox On Wed, Aug 12, 2015 at 12:14 AM, Vikram Kone <vikramk...@gmail.com> wrote: > Hi LarsThanks for the brain dump. All the points you made about target > audience, degree of high availability and time based scheduling instead of > event based scheduling are all valid and make sense.In our case, most of your > Devs are .net based and so xml or web based scheduling is preferred over > something written in Java/Scalia/Python. Based on my research so far on the > available workflow managers today, azkaban is the most easier to adopt since > it doesn't have any hard dependence on Hadoop and is easy to onboard and > schedule jobs. I was able to install and execute some spark workflows in a > day. Though the fact that it's being phased out in linkedin is troubling , I > think it's the best suited for our use case today. > Sent from Outlook > On Sun, Aug 9, 2015 at 4:51 PM -0700, "Lars Albertsson" > <lars.alberts...@gmail.com> wrote: > I used to maintain Luigi at Spotify, and got some insight in workflow > manager characteristics and production behaviour in the process. > I am evaluating options for my current employer, and the short list is > basically: Luigi, Azkaban, Pinball, Airflow, and rolling our own. The > latter is not necessarily more work than adapting an existing tool, > since existing managers are typically more or less tied to the > technology used by the company that created them. > Are your users primarily developers building pipelines that drive > data-intensive products, or are they analysts, producing business > intelligence? These groups tend to have preferences for different > types of tools and interfaces. > I have a love/hate relationship with Luigi, but given your > requirements, it is probably the best fit: > * It has support for Spark, and it seems to be used and maintained. > * It has no builtin support for Cassandra, but Cassandra is heavily > used at Spotify. IIRC, the code required to support Cassandra targets > is more or less trivial. There is no obvious single definition of a > dataset in C*, so you'll have to come up with a convention and encode > it as a Target subclass. I guess that is why it never made it outside > Spotify. > * The open source community is active and it is well tested in > production at multiple sites. > * It is easy to write dependencies, but in a Python DSL. If your users > are developers, this is preferable over XML or a web interface. There > are always quirks and odd constraints somewhere that require the > expressive power of a programming language. It also allows you to > create extensions without changing Luigi itself. > * It does not have recurring scheduling bulitin. Luigi needs a motor > to get going, typically cron, installed on a few machines for > redundancy. In a typical pipeline scenario, you give output datasets a > time parameter, which arranges for a dataset to be produced each > hour/day/week/month. > * It supports failure notifications. > Pinball and Airflow have similar architecture to Luigi, with a single > central scheduler and workers that submit and execute jobs. They seem > to be more solidly engineered at a glance, but less battle tested > outside Pinterest/Airbnb, and they have fewer integrations to the data > ecosystem. > Azkaban has a different architecture and user interface, and seems > more geared towards data scientists than developers; it has a good UI > for controlling jobs, but writing extensions and controlling it > programmatically seems more difficult than for Luigi. > All of the tools above are centralised, and the central component can > become a bottleneck and a single point of problem. I am not aware of > any decentralised open source workflow managers, but you can run > multiple instances and shard manually. > Regarding recurring jobs, it is typically undesirable to blindly run > jobs at a certain time. If you run jobs, e.g. with cron, and process > whatever data is available in your input sources, your jobs become > indeterministic and unreliable. If incoming data is late or missing, > your jobs will fail or create artificial skews in output data, leading > to confusing results. Moreover, if jobs fail or have bugs, it will be > difficult to rerun them and get predictable results. This is why I > don't think Chronos is a meaningful alternative for scheduling data > processing. > There are different strategies on this topic, but IMHO, it is easiest > create predictable and reliable pipelines by bucketing incoming data > into datasets that you seal off, and mark ready for processing, and > then use the workflow manager's DAG logic to process data when input > datasets are available, rather than at a certain time. If you use > Kafka for data collection, Secor can handle this logic for you. > In addition to your requirements, there are IMHO a few more topics one > needs to consider: > * How are pipelines tested? I.e. if I change job B below, how can I be > sure that the new output does not break A? You need to involve the > workflow DAG in testing such scenarios. > * How do you debug jobs and DAG problems? In case of trouble, can you > figure out where the job logs are, or why a particular job does not > start? > * Do you need high availability for job scheduling? That will require > additional components. > This became a bit of a brain dump on the topic. I hope that it is > useful. Don't hesitate to get back if I can help. > Regards, > Lars Albertsson > On Fri, Aug 7, 2015 at 5:43 PM, Vikram Kone wrote: >> Hi, >> I'm looking for open source workflow tools/engines that allow us to schedule >> spark jobs on a datastax cassandra cluster. Since there are tonnes of >> alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to >> check with people here to see what they are using today. >> >> Some of the requirements of the workflow engine that I'm looking for are >> >> 1. First class support for submitting Spark jobs on Cassandra. Not some >> wrapper Java code to submit tasks. >> 2. Active open source community support and well tested at production scale. >> 3. Should be dead easy to write job dependencices using XML or web interface >> . Ex; job A depends on Job B and Job C, so run Job A after B and C are >> finished. Don't need to write full blown java applications to specify job >> parameters and dependencies. Should be very simple to use. >> 4. Time based recurrent scheduling. Run the spark jobs at a given time >> every hour or day or week or month. >> 5. Job monitoring, alerting on failures and email notifications on daily >> basis. >> >> I have looked at Ooyala's spark job server which seems to be hated towards >> making spark jobs run faster by sharing contexts between the jobs but isn't >> a full blown workflow engine per se. A combination of spark job server and >> workflow engine would be ideal >> >> Thanks for the inputs