Re: [PROPOSAL] Oozie for the Apache Incubator

2011-06-24 Thread Eli Collins
+1

Spectacular!

On Fri, Jun 24, 2011 at 12:46 PM, Mohammad Islam misla...@yahoo.com wrote:
 Hi,

 I would like to propose Oozie to be an Apache Incubator project.
 Oozie is a server-based workflow scheduling and coordination system to manage
 data processing jobs for Apache Hadoop.


 Here's a link to the proposal in the Incubator wiki
 http://wiki.apache.org/incubator/OozieProposal


 I've also pasted the initial contents below.

 Regards,

 Mohammad Islam


 Start of Oozie Proposal

 Abstract
 Oozie is a server-based workflow scheduling and coordination system to manage
 data processing jobs for Apache HadoopTM.

 Proposal
 Oozie is an  extensible, scalable and reliable system to define, manage,
 schedule,  and execute complex Hadoop workloads via web services. More
 specifically, this includes:

        * XML-based declarative framework to specify a job or a complex 
 workflow of
 dependent jobs.

        * Support different types of job such as Hadoop Map-Reduce, Pipe, 
 Streaming,
 Pig, Hive and custom java applications.

        * Workflow scheduling based on frequency and/or data availability.
        * Monitoring capability, automatic retry and failure handing of jobs.
        * Extensible and pluggable architecture to allow arbitrary grid 
 programming
 paradigms.

        * Authentication, authorization, and capacity-aware load throttling to 
 allow
 multi-tenant software as a service.

 Background
 Most data  processing applications require multiple jobs to achieve their 
 goals,
 with inherent dependencies among the jobs. A dependency could be  sequential,
 where one job can only start after another job has finished.  Or it could be
 conditional, where the execution of a job depends on the  return value or 
 status
 of another job. In other cases, parallel  execution of multiple jobs may be
 permitted – or desired – to exploit  the massive pool of compute nodes 
 provided
 by Hadoop.

 These  job dependencies are often expressed as a Directed Acyclic Graph, also
 called a workflow. A node in the workflow is typically a job (a  computation 
 on
 the grid) or another type of action such as an eMail  notification. 
 Computations
 can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
 available on the grid. Edges of the graph  represent transitions from one node
 to the next, as the execution of a  workflow proceeds.

 Describing  a workflow in a declarative way has the advantage of decoupling 
 job
 dependencies and execution control from application logic. Furthermore,  the
 workflow is modularized into jobs that can be reused within the same  workflow
 or across different workflows. Execution of the workflow is  then driven by a
 runtime system without understanding the application  logic of the jobs. This
 runtime system specializes in reliable and  predictable execution: It can 
 retry
 actions that have failed or invoke a  cleanup action after termination of the
 workflow; it can monitor  progress, success, or failure of a workflow, and 
 send
 appropriate alerts  to an administrator. The application developer is relieved
 from  implementing these generic procedures.

 Furthermore,  some applications or workflows need to run in periodic intervals
 or  when dependent data is available. For example, a workflow could be  
 executed
 every day as soon as output data from the previous 24 instances  of another,
 hourly workflow is available. The workflow coordinator  provides such 
 scheduling
 features, along with prioritization, load  balancing and throttling to 
 optimize
 utilization of resources in the  cluster. This makes it easier to maintain,
 control, and coordinate  complex data applications.

 Nearly  three years ago, a team of Yahoo! developers addressed these critical
 requirements for Hadoop-based data processing systems by developing a  new
 workflow management and scheduling system called Oozie. While it was  
 initially
 developed as a Yahoo!-internal project, it was designed and  implemented with
 the intention of open-sourcing. Oozie was released as a GitHub project in 
 early
 2010. Oozie is used in production within Yahoo and  since it has been
 open-sourced it has been gaining adoption with  external developers

 Rationale
 Commonly,  applications that run on Hadoop require multiple Hadoop jobs in 
 order
 to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a
 combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
 map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and 
 shell
 scripts.

 Because  of this, developers find themselves writing ad-hoc glue programs to
 combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule,
 manage, monitor and recover.

 Workflow  management and scheduling is an essential feature for large-scale 
 data
 processing applications. Such applications could write the customized  
 solution
 that would require separate development, operational, and  

Re: [VOTE] Accept Bigtop for incubation

2011-06-17 Thread Eli Collins
 towards
 Hadoop or related Apache projects (Alejandro Abdelnur, Konstantin
 Boudnik, Eli Collins, Alan Gates, Patrick Hunt, Steve Loughran, Owen
 O'Malley, John Sichi, Michael Stack, Tom White) and are familiar with
 Apache principals and philosophy for community driven software
 development.

 == Alignment ==

 We expect projects in Bigtop to be drawn from Hadoop and related
 projects at Apache. Bigtop will complement these projects (Hadoop,
 Pig, Hive, HBase, etc...) by providing an environment for contributors
 interested in building more complex data processing pipelines to work
 together integrating more than a single project into a well-tested
 whole.

 = Known Risks =

 == Orphaned Products ==

 The contributors are leading vendors of Hadoop-based technologies and
 have a long standing in the Hadoop community. There is minimal risk of
 this work becoming non-strategic and the contributors are confident
 that a larger community will form within the project in a relatively
 short space of time.

 == Inexperience with Open Source ==

 All code developed for Bigtop has been open sourced under the Apache
 2.0 license. Most committers of Bigtop project are intimately familiar
 with the Apache model for open-source development and are experienced
 with working with new contributors.

 == Homogeneous Developers ==

 The initial set of committers is from a small set of organizations and
 numerous existing Apache projects. We expect that once approved for
 incubation, the project will attract new contributors from more
 organizations and will thus grow organically.

 == Reliance on Salaried Developers ==

 It is expected that Bigtop will be developed on salaried and volunteer
 time, although all of the initial developers will work on it mainly on
 salaried time.

 == Relationships with Other Apache Products ==

 Bigtop depends upon other Apache Projects including Apache Hadoop,
 Apache HBase, Apache Hive, Apache Pig, Apache Zookeeper, Apache
 Thrift, Apache Avro, Apache Whirr. The build system uses Apache Ant
 and Apache Maven.

 == An Excessive Fascination with the Apache Brand ==

 We would like Bigtop to become an Apache project to further foster a
 healthy community of contributors and consumers around
 interoperability, testing and packaging of Hadoop projects. Since
 Bigtop directly interacts with many Apache Hadoop-related projects and
 solves important problems of many Hadoop users, residing in the the
 Apache Software Foundation will increase interaction with the larger
 community.

 = Documentation =

  * Bigtop will develop its own documentation detailing how to build,
 test, install, configure and debug.

 = Initial Source =

  * https://github.com/cloudera/bigtop

 == Source and Intellectual Property Submission Plan ==

  * The initial source is already licensed under the Apache License, Version 
 2.0.

 https://github.com/cloudera/bigtop

 == External Dependencies ==

 The required external dependencies are all Apache License or
 compatible licenses.

 == Cryptography ==

 Bigtop doesn't use cryptography itself, however Hadoop projects use
 standard APIs and tools for SSH and SSL communication where necessary.

 = Required  Resources =

 == Mailing lists ==

  * bigtop-private (with moderated subscriptions)
  * bigtop-dev
  * bigtop-commits
  * bigtop-user

 == Subversion Directory ==

 https://svn.apache.org/repos/asf/incubator/bigtop

 == Issue Tracking ==

 JIRA BIGTOP (Bigtop)

 == Other Resources ==

 The existing code already has unit and integration tests so we would
 like a Jenkins instance to run them whenever a new patch is submitted.
 This can be added after project creation.

 To test RPM  deb install/uninstall and upgrade, it is useful to have
 a set of Virtual Machine images in known states, and servers that can
 bring them up. It should be possible to use Apache Whirr to
 choreograph the VM setup/teardown, so these tests could be performed
 against VMs on developer desktops or large scale VM-hosting platforms.
 For the latter, VM hosting time would be appreciated.

 = Initial Committers =

  * Alejandro Abdelnur (tucu at cloudera dot com)
  * Andre Arcilla (arcilla at yahoo-inc dot com)
  * Andrew Bayer (abayer at cloudera dot com)
  * Konstantin Boudnik (cos at apache dot org)
  * Eli Collins (eli at apache dot org)
  * Travis Crawford (travis at twitter dot com)
  * Bruno Mahé (bruno at cloudera dot com)
  * Alan Gates (gates at apache dot org)
  * Patrick Hunt (phunt at apache dot org)
  * Peter Linnell (plinnell at cloudera dot com)
  * Steve Loughran (stevel at apache dot org)
  * Owen O'Malley (omalley at apache dot org)
  * James Page (James.page at canonical dot com)
  * Roman Shaposhnik (rvs at cloudera dot com)
  * John Sichi (jvs at apache dot org)
  * Michael Stack (stack at apache dot org)
  * Tom White (tomwhite at apache dot org)
  * Andrei Savu (asavu at apache dot org)
  * Edward J. Yoon (edwardyoon at apache dot org)

 = Affiliations =

  * Alejandro Abdelnur

Re: [PROPOSAL] Bigtop for the Apache Incubator

2011-06-14 Thread Eli Collins
On Tue, Jun 14, 2011 at 11:43 AM, Konstantin Boudnik c...@apache.org wrote:
 On 14/06/11 05:26, Tom White wrote:
  Hi,
 
  I would like to propose Bigtop to be an Apache Incubator project.
  Bigtop is a project for the development of packaging and tests of the
  Hadoop ecosystem. The goal is to do testing at various levels
  (packaging, platform, runtime, upgrade, etc...) developed by a
  community with a focus on the system as a whole, rather than
  individual projects.
 
  Here's a link to the proposal on the wiki
  http://wiki.apache.org/incubator/BigtopProposal
 
  I've also included the initial contents below.
 
  Cheers,
  Tom
 

 I've added my name to the committer list, I won't be working on this in
 much/any of work time, and am fairly overcommitted, so don't expect that
 much. I can contribute some of my experience in VM setup/teardown for
 testing RPM installations, and how to do functional testing of
 dynamically created Hadoop clusters.

 I am going to add my name to the list of the committers too. Considering my
 other commitments I might not be able to work much on this project, but I 
 guess
 the fact that I have wrote like 50% of the underlying system framework
 might count for something.

Welcome aboard Cos!  Glad to have you on.  Cos has made a ton of
contributions to the test frameworks in Bigtop. Looking forward to
your contributions!

Thanks,
Eli

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org