Re: [PROPOSAL] Oozie for the Apache Incubator
+1 Spectacular! On Fri, Jun 24, 2011 at 12:46 PM, Mohammad Islam misla...@yahoo.com wrote: Hi, I would like to propose Oozie to be an Apache Incubator project. Oozie is a server-based workflow scheduling and coordination system to manage data processing jobs for Apache Hadoop. Here's a link to the proposal in the Incubator wiki http://wiki.apache.org/incubator/OozieProposal I've also pasted the initial contents below. Regards, Mohammad Islam Start of Oozie Proposal Abstract Oozie is a server-based workflow scheduling and coordination system to manage data processing jobs for Apache HadoopTM. Proposal Oozie is an extensible, scalable and reliable system to define, manage, schedule, and execute complex Hadoop workloads via web services. More specifically, this includes: * XML-based declarative framework to specify a job or a complex workflow of dependent jobs. * Support different types of job such as Hadoop Map-Reduce, Pipe, Streaming, Pig, Hive and custom java applications. * Workflow scheduling based on frequency and/or data availability. * Monitoring capability, automatic retry and failure handing of jobs. * Extensible and pluggable architecture to allow arbitrary grid programming paradigms. * Authentication, authorization, and capacity-aware load throttling to allow multi-tenant software as a service. Background Most data processing applications require multiple jobs to achieve their goals, with inherent dependencies among the jobs. A dependency could be sequential, where one job can only start after another job has finished. Or it could be conditional, where the execution of a job depends on the return value or status of another job. In other cases, parallel execution of multiple jobs may be permitted – or desired – to exploit the massive pool of compute nodes provided by Hadoop. These job dependencies are often expressed as a Directed Acyclic Graph, also called a workflow. A node in the workflow is typically a job (a computation on the grid) or another type of action such as an eMail notification. Computations can be expressed in map/reduce, Pig, Hive or any other programming paradigm available on the grid. Edges of the graph represent transitions from one node to the next, as the execution of a workflow proceeds. Describing a workflow in a declarative way has the advantage of decoupling job dependencies and execution control from application logic. Furthermore, the workflow is modularized into jobs that can be reused within the same workflow or across different workflows. Execution of the workflow is then driven by a runtime system without understanding the application logic of the jobs. This runtime system specializes in reliable and predictable execution: It can retry actions that have failed or invoke a cleanup action after termination of the workflow; it can monitor progress, success, or failure of a workflow, and send appropriate alerts to an administrator. The application developer is relieved from implementing these generic procedures. Furthermore, some applications or workflows need to run in periodic intervals or when dependent data is available. For example, a workflow could be executed every day as soon as output data from the previous 24 instances of another, hourly workflow is available. The workflow coordinator provides such scheduling features, along with prioritization, load balancing and throttling to optimize utilization of resources in the cluster. This makes it easier to maintain, control, and coordinate complex data applications. Nearly three years ago, a team of Yahoo! developers addressed these critical requirements for Hadoop-based data processing systems by developing a new workflow management and scheduling system called Oozie. While it was initially developed as a Yahoo!-internal project, it was designed and implemented with the intention of open-sourcing. Oozie was released as a GitHub project in early 2010. Oozie is used in production within Yahoo and since it has been open-sourced it has been gaining adoption with external developers Rationale Commonly, applications that run on Hadoop require multiple Hadoop jobs in order to obtain the desired results. Furthermore, these Hadoop jobs are commonly a combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs and shell scripts. Because of this, developers find themselves writing ad-hoc glue programs to combine these Hadoop jobs. These ad-hoc programs are difficult to schedule, manage, monitor and recover. Workflow management and scheduling is an essential feature for large-scale data processing applications. Such applications could write the customized solution that would require separate development, operational, and
Re: [VOTE] Accept Bigtop for incubation
towards Hadoop or related Apache projects (Alejandro Abdelnur, Konstantin Boudnik, Eli Collins, Alan Gates, Patrick Hunt, Steve Loughran, Owen O'Malley, John Sichi, Michael Stack, Tom White) and are familiar with Apache principals and philosophy for community driven software development. == Alignment == We expect projects in Bigtop to be drawn from Hadoop and related projects at Apache. Bigtop will complement these projects (Hadoop, Pig, Hive, HBase, etc...) by providing an environment for contributors interested in building more complex data processing pipelines to work together integrating more than a single project into a well-tested whole. = Known Risks = == Orphaned Products == The contributors are leading vendors of Hadoop-based technologies and have a long standing in the Hadoop community. There is minimal risk of this work becoming non-strategic and the contributors are confident that a larger community will form within the project in a relatively short space of time. == Inexperience with Open Source == All code developed for Bigtop has been open sourced under the Apache 2.0 license. Most committers of Bigtop project are intimately familiar with the Apache model for open-source development and are experienced with working with new contributors. == Homogeneous Developers == The initial set of committers is from a small set of organizations and numerous existing Apache projects. We expect that once approved for incubation, the project will attract new contributors from more organizations and will thus grow organically. == Reliance on Salaried Developers == It is expected that Bigtop will be developed on salaried and volunteer time, although all of the initial developers will work on it mainly on salaried time. == Relationships with Other Apache Products == Bigtop depends upon other Apache Projects including Apache Hadoop, Apache HBase, Apache Hive, Apache Pig, Apache Zookeeper, Apache Thrift, Apache Avro, Apache Whirr. The build system uses Apache Ant and Apache Maven. == An Excessive Fascination with the Apache Brand == We would like Bigtop to become an Apache project to further foster a healthy community of contributors and consumers around interoperability, testing and packaging of Hadoop projects. Since Bigtop directly interacts with many Apache Hadoop-related projects and solves important problems of many Hadoop users, residing in the the Apache Software Foundation will increase interaction with the larger community. = Documentation = * Bigtop will develop its own documentation detailing how to build, test, install, configure and debug. = Initial Source = * https://github.com/cloudera/bigtop == Source and Intellectual Property Submission Plan == * The initial source is already licensed under the Apache License, Version 2.0. https://github.com/cloudera/bigtop == External Dependencies == The required external dependencies are all Apache License or compatible licenses. == Cryptography == Bigtop doesn't use cryptography itself, however Hadoop projects use standard APIs and tools for SSH and SSL communication where necessary. = Required Resources = == Mailing lists == * bigtop-private (with moderated subscriptions) * bigtop-dev * bigtop-commits * bigtop-user == Subversion Directory == https://svn.apache.org/repos/asf/incubator/bigtop == Issue Tracking == JIRA BIGTOP (Bigtop) == Other Resources == The existing code already has unit and integration tests so we would like a Jenkins instance to run them whenever a new patch is submitted. This can be added after project creation. To test RPM deb install/uninstall and upgrade, it is useful to have a set of Virtual Machine images in known states, and servers that can bring them up. It should be possible to use Apache Whirr to choreograph the VM setup/teardown, so these tests could be performed against VMs on developer desktops or large scale VM-hosting platforms. For the latter, VM hosting time would be appreciated. = Initial Committers = * Alejandro Abdelnur (tucu at cloudera dot com) * Andre Arcilla (arcilla at yahoo-inc dot com) * Andrew Bayer (abayer at cloudera dot com) * Konstantin Boudnik (cos at apache dot org) * Eli Collins (eli at apache dot org) * Travis Crawford (travis at twitter dot com) * Bruno Mahé (bruno at cloudera dot com) * Alan Gates (gates at apache dot org) * Patrick Hunt (phunt at apache dot org) * Peter Linnell (plinnell at cloudera dot com) * Steve Loughran (stevel at apache dot org) * Owen O'Malley (omalley at apache dot org) * James Page (James.page at canonical dot com) * Roman Shaposhnik (rvs at cloudera dot com) * John Sichi (jvs at apache dot org) * Michael Stack (stack at apache dot org) * Tom White (tomwhite at apache dot org) * Andrei Savu (asavu at apache dot org) * Edward J. Yoon (edwardyoon at apache dot org) = Affiliations = * Alejandro Abdelnur
Re: [PROPOSAL] Bigtop for the Apache Incubator
On Tue, Jun 14, 2011 at 11:43 AM, Konstantin Boudnik c...@apache.org wrote: On 14/06/11 05:26, Tom White wrote: Hi, I would like to propose Bigtop to be an Apache Incubator project. Bigtop is a project for the development of packaging and tests of the Hadoop ecosystem. The goal is to do testing at various levels (packaging, platform, runtime, upgrade, etc...) developed by a community with a focus on the system as a whole, rather than individual projects. Here's a link to the proposal on the wiki http://wiki.apache.org/incubator/BigtopProposal I've also included the initial contents below. Cheers, Tom I've added my name to the committer list, I won't be working on this in much/any of work time, and am fairly overcommitted, so don't expect that much. I can contribute some of my experience in VM setup/teardown for testing RPM installations, and how to do functional testing of dynamically created Hadoop clusters. I am going to add my name to the list of the committers too. Considering my other commitments I might not be able to work much on this project, but I guess the fact that I have wrote like 50% of the underlying system framework might count for something. Welcome aboard Cos! Glad to have you on. Cos has made a ton of contributions to the test frameworks in Bigtop. Looking forward to your contributions! Thanks, Eli - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org