Re: Spark job workflow engine recommendations

Nick Pentreath Tue, 11 Aug 2015 22:08:53 -0700

I also tend to agree that Azkaban is somehqat easier to get set up. Though I 
haven't used the new UI for Oozie that is part of CDH, so perhaps that is 
another good option.





It's a pity Azkaban is a little rough in terms of documenting its API, and the 
scalability is an issue. However it would be possible to have a few different 
instances running for different use cases  / groups within the org perhaps



—
Sent from Mailbox

On Wed, Aug 12, 2015 at 12:14 AM, Vikram Kone <vikramk...@gmail.com>
wrote:

> Hi LarsThanks for the brain dump. All the points you made about target 
> audience, degree of high availability and time based scheduling instead of 
> event based scheduling are all valid and make sense.In our case, most of your 
> Devs are .net based and so xml or web based scheduling is preferred over 
> something written in Java/Scalia/Python. Based on my research so far on the 
> available workflow managers today, azkaban is the most easier to adopt since 
> it doesn't have any hard dependence on Hadoop and is easy to onboard and 
> schedule jobs. I was able to install and execute some spark workflows in a 
> day. Though the fact that it's being phased out in linkedin is troubling , I 
> think it's the best suited for our use case today. 
> Sent from Outlook
> On Sun, Aug 9, 2015 at 4:51 PM -0700, "Lars Albertsson" 
> <lars.alberts...@gmail.com> wrote:
> I used to maintain Luigi at Spotify, and got some insight in workflow
> manager characteristics and production behaviour in the process.
> I am evaluating options for my current employer, and the short list is
> basically: Luigi, Azkaban, Pinball, Airflow, and rolling our own. The
> latter is not necessarily more work than adapting an existing tool,
> since existing managers are typically more or less tied to the
> technology used by the company that created them.
> Are your users primarily developers building pipelines that drive
> data-intensive products, or are they analysts, producing business
> intelligence? These groups tend to have preferences for different
> types of tools and interfaces.
> I have a love/hate relationship with Luigi, but given your
> requirements, it is probably the best fit:
> * It has support for Spark, and it seems to be used and maintained.
> * It has no builtin support for Cassandra, but Cassandra is heavily
> used at Spotify. IIRC, the code required to support Cassandra targets
> is more or less trivial. There is no obvious single definition of a
> dataset in C*, so you'll have to come up with a convention and encode
> it as a Target subclass. I guess that is why it never made it outside
> Spotify.
> * The open source community is active and it is well tested in
> production at multiple sites.
> * It is easy to write dependencies, but in a Python DSL. If your users
> are developers, this is preferable over XML or a web interface. There
> are always quirks and odd constraints somewhere that require the
> expressive power of a programming language. It also allows you to
> create extensions without changing Luigi itself.
> * It does not have recurring scheduling bulitin. Luigi needs a motor
> to get going, typically cron, installed on a few machines for
> redundancy. In a typical pipeline scenario, you give output datasets a
> time parameter, which arranges for a dataset to be produced each
> hour/day/week/month.
> * It supports failure notifications.
> Pinball and Airflow have similar architecture to Luigi, with a single
> central scheduler and workers that submit and execute jobs. They seem
> to be more solidly engineered at a glance, but less battle tested
> outside Pinterest/Airbnb, and they have fewer integrations to the data
> ecosystem.
> Azkaban has a different architecture and user interface, and seems
> more geared towards data scientists than developers; it has a good UI
> for controlling jobs, but writing extensions and controlling it
> programmatically seems more difficult than for Luigi.
> All of the tools above are centralised, and the central component can
> become a bottleneck and a single point of problem. I am not aware of
> any decentralised open source workflow managers, but you can run
> multiple instances and shard manually.
> Regarding recurring jobs, it is typically undesirable to blindly run
> jobs at a certain time. If you run jobs, e.g. with cron, and process
> whatever data is available in your input sources, your jobs become
> indeterministic and unreliable. If incoming data is late or missing,
> your jobs will fail or create artificial skews in output data, leading
> to confusing results. Moreover, if jobs fail or have bugs, it will be
> difficult to rerun them and get predictable results. This is why I
> don't think Chronos is a meaningful alternative for scheduling data
> processing.
> There are different strategies on this topic, but IMHO, it is easiest
> create predictable and reliable pipelines by bucketing incoming data
> into datasets that you seal off, and mark ready for processing, and
> then use the workflow manager's DAG logic to process data when input
> datasets are available, rather than at a certain time. If you use
> Kafka for data collection, Secor can handle this logic for you.
> In addition to your requirements, there are IMHO a few more topics one
> needs to consider:
> * How are pipelines tested? I.e. if I change job B below, how can I be
> sure that the new output does not break A? You need to involve the
> workflow DAG in testing such scenarios.
> * How do you debug jobs and DAG problems? In case of trouble, can you
> figure out where the job logs are, or why a particular job does not
> start?
> * Do you need high availability for job scheduling? That will require
> additional components.
> This became a bit of a brain dump on the topic. I hope that it is
> useful. Don't hesitate to get back if I can help.
> Regards,
> Lars Albertsson
> On Fri, Aug 7, 2015 at 5:43 PM, Vikram Kone  wrote:
>> Hi,
>> I'm looking for open source workflow tools/engines that allow us to schedule
>> spark jobs on a datastax cassandra cluster. Since there are tonnes of
>> alternatives out there like Ozzie, Azkaban, Luigi , Chronos etc, I wanted to
>> check with people here to see what they are using today.
>>
>> Some of the requirements of the workflow engine that I'm looking for are
>>
>> 1. First class support for submitting Spark jobs on Cassandra. Not some
>> wrapper Java code to submit tasks.
>> 2. Active open source community support and well tested at production scale.
>> 3. Should be dead easy to write job dependencices using XML or web interface
>> . Ex; job A depends on Job B and Job C, so run Job A after B and C are
>> finished. Don't need to write full blown java applications to specify job
>> parameters and dependencies. Should be very simple to use.
>> 4. Time based  recurrent scheduling. Run the spark jobs at a given time
>> every hour or day or week or month.
>> 5. Job monitoring, alerting on failures and email notifications on daily
>> basis.
>>
>> I have looked at Ooyala's spark job server which seems to be hated towards
>> making spark jobs run faster by sharing contexts between the jobs but isn't
>> a full blown workflow engine per se. A combination of spark job server and
>> workflow engine would be ideal
>>
>> Thanks for the inputs

Re: Spark job workflow engine recommendations

Reply via email to