Re: [RESULT] [VOTE] Accept Hudi into the Apache Incubator

2019-01-18 Thread Vinoth Govindarajan
+1


On 2019/01/17 19:05:08, Thomas Weise  wrote: 
> The vote for accepting Hudi into the Apache Incubator passes with 11> 
> binding +1 votes, 5 non-binding +1 votes and no other votes.> 
> 
> Thanks for voting!> 
> 
> +1 votes:> 
> 
> Luciano Resende*> 
> Pierre Smits> 
> Suneel Marthi*> 
> Felix Cheung*> 
> Kenneth Knowles*> 
> Mohammad Islam> 
> Mayank Bansal> 
> Jakob Homan*> 
> Akira Ajisaka*> 
> Gosling Von*> 
> Matt Sicker*> 
> Brahma Reddy Battula> 
> Hongtao Gao> 
> Vinayakumar B*> 
> Furkan Kamaci*> 
> Thomas Weise*> 
> 
> * = binding> 
> 
> 
> On Sun, Jan 13, 2019 at 2:34 PM Thomas Weise  wrote:> 
> 
> > Hi all,> 
> >> 
> > Following the discussion of the Hudi proposal in [1], this is a vote> 
> > on accepting Hudi into the Apache Incubator,> 
> > per the ASF policy [2] and voting rules [3].> 
> >> 
> > A vote for accepting a new Apache Incubator podling is a> 
> > majority vote. Everyone is welcome to vote, only> 
> > Incubator PMC member votes are binding.> 
> >> 
> > This vote will run for at least 72 hours. Please VOTE as> 
> > follows:> 
> >> 
> > [ ] +1 Accept Hudi into the Apache Incubator> 
> > [ ] +0 Abstain> 
> > [ ] -1 Do not accept Hudi into the Apache Incubator because ...> 
> >> 
> > The proposal is included below, but you can also access it on> 
> > the wiki [4].> 
> >> 
> > Thanks for reviewing and voting,> 
> > Thomas> 
> >> 
> > [1]> 
> > https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E>
> >  
> >> 
> > [2]> 
> > https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor>
> >  
> >> 
> > [3] http://www.apache.org/foundation/voting.html> 
> >> 
> > [4] https://wiki.apache.org/incubator/HudiProposal> 
> >> 
> >> 
> >> 
> > = Hudi Proposal => 
> >> 
> > == Abstract ==> 
> >> 
> > Hudi is a big-data storage library, that provides atomic upserts and> 
> > incremental data streams.> 
> >> 
> > Hudi manages data stored in Apache Hadoop and other API compatible> 
> > distributed file systems/cloud stores.> 
> >> 
> > == Proposal ==> 
> >> 
> > Hudi provides the ability to atomically upsert datasets with new values in> 
> > near-real time, making data available quickly to existing query engines> 
> > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a> 
> > sequence of changes to a dataset from a given point-in-time to enable> 
> > incremental data pipelines that yield greater efficiency & latency than> 
> > their typical batch counterparts. By carefully managing number of files &> 
> > sizes, Hudi greatly aids both query engines (e.g: always providing> 
> > well-sized files) and underlying storage (e.g: HDFS NameNode memory> 
> > consumption).> 
> >> 
> > Hudi is largely implemented as an Apache Spark library that reads/writes> 
> > data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets 
> > are> 
> > supported via specialized Apache Hadoop input formats, that understand> 
> > Hudi’s storage layout. Currently, Hudi manages datasets using a 
> > combination> 
> > of Apache Parquet & Apache Avro file/serialization formats.> 
> >> 
> > == Background ==> 
> >> 
> > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud> 
> > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as> 
> > longer term analytical storage for thousands of organizations. Typical> 
> > analytical datasets are built by reading data from a source (e.g: upstream> 
> > databases, messaging buses, or other datasets), transforming the data,> 
> > writing results back to storage, & making it available for analytical> 
> > queries--all of this typically accomplished in batch jobs which operate in> 
> > a bulk fashion on partitions of datasets. Such a style of processing> 
> > typically incurs large delays in making data available to queries as well> 
> > as lot of complexity in carefully partitioning datasets to guarantee> 
> > latency SLAs.> 
> >> 
> > The need for fresher/faster analytics has increased enormously in the past> 
> > few years, as evidenced by the popularity of Stream processing systems 
> > like> 
> > Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By> 
> > using updateable state store to incrementally compute & instantly reflect> 
> > new results to queries and using a “tailable” messaging bus to publish> 
> > these results to other downstream jobs, such systems employ a different> 
> > approach to building analytical dataset. Even though this approach yields> 
> > low latency, the amount of data managed in such real-time data-marts is> 
> > typically limited in comparison to the aforementioned longer term storage> 
> > options. As a result, the overall data architecture has become more 
> > complex> 
> > with more moving parts and specialized systems, leading to duplication of> 
> > data and a strain on usability.> 
> >> 
> > Hudi takes a hybrid approach. Instead of moving vast amounts of batch data> 
> > to 

[RESULT] [VOTE] Accept Hudi into the Apache Incubator

2019-01-17 Thread Thomas Weise
The vote for accepting Hudi into the Apache Incubator passes with 11
binding +1 votes, 5 non-binding +1 votes and no other votes.

Thanks for voting!

+1 votes:

Luciano Resende*
Pierre Smits
Suneel Marthi*
Felix Cheung*
Kenneth Knowles*
Mohammad Islam
Mayank Bansal
Jakob Homan*
Akira Ajisaka*
Gosling Von*
Matt Sicker*
Brahma Reddy Battula
Hongtao Gao
Vinayakumar B*
Furkan Kamaci*
Thomas Weise*

* = binding


On Sun, Jan 13, 2019 at 2:34 PM Thomas Weise  wrote:

> Hi all,
>
> Following the discussion of the Hudi proposal in [1], this is a vote
> on accepting Hudi into the Apache Incubator,
> per the ASF policy [2] and voting rules [3].
>
> A vote for accepting a new Apache Incubator podling is a
> majority vote. Everyone is welcome to vote, only
> Incubator PMC member votes are binding.
>
> This vote will run for at least 72 hours. Please VOTE as
> follows:
>
> [ ] +1 Accept Hudi into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept Hudi into the Apache Incubator because ...
>
> The proposal is included below, but you can also access it on
> the wiki [4].
>
> Thanks for reviewing and voting,
> Thomas
>
> [1]
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
>
> [2]
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
>
> [3] http://www.apache.org/foundation/voting.html
>
> [4] https://wiki.apache.org/incubator/HudiProposal
>
>
>
> = Hudi Proposal =
>
> == Abstract ==
>
> Hudi is a big-data storage library, that provides atomic upserts and
> incremental data streams.
>
> Hudi manages data stored in Apache Hadoop and other API compatible
> distributed file systems/cloud stores.
>
> == Proposal ==
>
> Hudi provides the ability to atomically upsert datasets with new values in
> near-real time, making data available quickly to existing query engines
> like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> sequence of changes to a dataset from a given point-in-time to enable
> incremental data pipelines that yield greater efficiency & latency than
> their typical batch counterparts. By carefully managing number of files &
> sizes, Hudi greatly aids both query engines (e.g: always providing
> well-sized files) and underlying storage (e.g: HDFS NameNode memory
> consumption).
>
> Hudi is largely implemented as an Apache Spark library that reads/writes
> data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are
> supported via specialized Apache Hadoop input formats, that understand
> Hudi’s storage layout. Currently, Hudi manages datasets using a combination
> of Apache Parquet & Apache Avro file/serialization formats.
>
> == Background ==
>
> Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> longer term analytical storage for thousands of organizations. Typical
> analytical datasets are built by reading data from a source (e.g: upstream
> databases, messaging buses, or other datasets), transforming the data,
> writing results back to storage, & making it available for analytical
> queries--all of this typically accomplished in batch jobs which operate in
> a bulk fashion on partitions of datasets. Such a style of processing
> typically incurs large delays in making data available to queries as well
> as lot of complexity in carefully partitioning datasets to guarantee
> latency SLAs.
>
> The need for fresher/faster analytics has increased enormously in the past
> few years, as evidenced by the popularity of Stream processing systems like
> Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> using updateable state store to incrementally compute & instantly reflect
> new results to queries and using a “tailable” messaging bus to publish
> these results to other downstream jobs, such systems employ a different
> approach to building analytical dataset. Even though this approach yields
> low latency, the amount of data managed in such real-time data-marts is
> typically limited in comparison to the aforementioned longer term storage
> options. As a result, the overall data architecture has become more complex
> with more moving parts and specialized systems, leading to duplication of
> data and a strain on usability.
>
> Hudi takes a hybrid approach. Instead of moving vast amounts of batch data
> to streaming systems, we simply add the streaming primitives (upserts &
> incremental consumption) onto existing batch processing technologies. We
> believe that by adding some missing blocks to an existing Hadoop stack, we
> are able to a provide similar capabilities right on top of Hadoop at a
> reduced cost and with an increased efficiency, greatly simplifying the
> overall architecture in the process.
>
> Hudi was originally developed at Uber (original name “Hoodie”) to address
> such broad inefficiencies in ingest & ETL & ML pipelines across 

Re: [VOTE] Accept Hudi into the Apache Incubator

2019-01-17 Thread Thomas Weise
+1


On Wed, Jan 16, 2019 at 11:35 AM Matt Sicker  wrote:

> +1
>
> On Wed, 16 Jan 2019 at 01:25, Gosling Von  wrote:
> >
> > +1(binding)
> >
> > Best Regards,
> > Von Gosling
> >
> > > 在 2019年1月14日,上午6:34,Thomas Weise  写道:
> > >
> > > Hi all,
> > >
> > > Following the discussion of the Hudi proposal in [1], this is a vote
> > > on accepting Hudi into the Apache Incubator,
> > > per the ASF policy [2] and voting rules [3].
> > >
> > > A vote for accepting a new Apache Incubator podling is a
> > > majority vote. Everyone is welcome to vote, only
> > > Incubator PMC member votes are binding.
> > >
> > > This vote will run for at least 72 hours. Please VOTE as
> > > follows:
> > >
> > > [ ] +1 Accept Hudi into the Apache Incubator
> > > [ ] +0 Abstain
> > > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> > >
> > > The proposal is included below, but you can also access it on
> > > the wiki [4].
> > >
> > > Thanks for reviewing and voting,
> > > Thomas
> > >
> > > [1]
> > >
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> > >
> > > [2]
> > >
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> > >
> > > [3] http://www.apache.org/foundation/voting.html
> > >
> > > [4] https://wiki.apache.org/incubator/HudiProposal
> > >
> > >
> > >
> > > = Hudi Proposal =
> > >
> > > == Abstract ==
> > >
> > > Hudi is a big-data storage library, that provides atomic upserts and
> > > incremental data streams.
> > >
> > > Hudi manages data stored in Apache Hadoop and other API compatible
> > > distributed file systems/cloud stores.
> > >
> > > == Proposal ==
> > >
> > > Hudi provides the ability to atomically upsert datasets with new
> values in
> > > near-real time, making data available quickly to existing query engines
> > > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> > > sequence of changes to a dataset from a given point-in-time to enable
> > > incremental data pipelines that yield greater efficiency & latency than
> > > their typical batch counterparts. By carefully managing number of
> files &
> > > sizes, Hudi greatly aids both query engines (e.g: always providing
> > > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > > consumption).
> > >
> > > Hudi is largely implemented as an Apache Spark library that
> reads/writes
> > > data from/to Hadoop compatible filesystem. SQL queries on Hudi
> datasets are
> > > supported via specialized Apache Hadoop input formats, that understand
> > > Hudi’s storage layout. Currently, Hudi manages datasets using a
> combination
> > > of Apache Parquet & Apache Avro file/serialization formats.
> > >
> > > == Background ==
> > >
> > > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve
> as
> > > longer term analytical storage for thousands of organizations. Typical
> > > analytical datasets are built by reading data from a source (e.g:
> upstream
> > > databases, messaging buses, or other datasets), transforming the data,
> > > writing results back to storage, & making it available for analytical
> > > queries--all of this typically accomplished in batch jobs which
> operate in
> > > a bulk fashion on partitions of datasets. Such a style of processing
> > > typically incurs large delays in making data available to queries as
> well
> > > as lot of complexity in carefully partitioning datasets to guarantee
> > > latency SLAs.
> > >
> > > The need for fresher/faster analytics has increased enormously in the
> past
> > > few years, as evidenced by the popularity of Stream processing systems
> like
> > > Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> > > using updateable state store to incrementally compute & instantly
> reflect
> > > new results to queries and using a “tailable” messaging bus to publish
> > > these results to other downstream jobs, such systems employ a different
> > > approach to building analytical dataset. Even though this approach
> yields
> > > low latency, the amount of data managed in such real-time data-marts is
> > > typically limited in comparison to the aforementioned longer term
> storage
> > > options. As a result, the overall data architecture has become more
> complex
> > > with more moving parts and specialized systems, leading to duplication
> of
> > > data and a strain on usability.
> > >
> > > Hudi takes a hybrid approach. Instead of moving vast amounts of batch
> data
> > > to streaming systems, we simply add the streaming primitives (upserts &
> > > incremental consumption) onto existing batch processing technologies.
> We
> > > believe that by adding some missing blocks to an existing Hadoop
> stack, we
> > > are able to a provide similar capabilities right on top of Hadoop at a
> > > reduced cost and with an increased efficiency, greatly simplifying 

Re: [VOTE] Accept Hudi into the Apache Incubator

2019-01-16 Thread Matt Sicker
+1

On Wed, 16 Jan 2019 at 01:25, Gosling Von  wrote:
>
> +1(binding)
>
> Best Regards,
> Von Gosling
>
> > 在 2019年1月14日,上午6:34,Thomas Weise  写道:
> >
> > Hi all,
> >
> > Following the discussion of the Hudi proposal in [1], this is a vote
> > on accepting Hudi into the Apache Incubator,
> > per the ASF policy [2] and voting rules [3].
> >
> > A vote for accepting a new Apache Incubator podling is a
> > majority vote. Everyone is welcome to vote, only
> > Incubator PMC member votes are binding.
> >
> > This vote will run for at least 72 hours. Please VOTE as
> > follows:
> >
> > [ ] +1 Accept Hudi into the Apache Incubator
> > [ ] +0 Abstain
> > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> >
> > The proposal is included below, but you can also access it on
> > the wiki [4].
> >
> > Thanks for reviewing and voting,
> > Thomas
> >
> > [1]
> > https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> >
> > [2]
> > https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> >
> > [3] http://www.apache.org/foundation/voting.html
> >
> > [4] https://wiki.apache.org/incubator/HudiProposal
> >
> >
> >
> > = Hudi Proposal =
> >
> > == Abstract ==
> >
> > Hudi is a big-data storage library, that provides atomic upserts and
> > incremental data streams.
> >
> > Hudi manages data stored in Apache Hadoop and other API compatible
> > distributed file systems/cloud stores.
> >
> > == Proposal ==
> >
> > Hudi provides the ability to atomically upsert datasets with new values in
> > near-real time, making data available quickly to existing query engines
> > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> > sequence of changes to a dataset from a given point-in-time to enable
> > incremental data pipelines that yield greater efficiency & latency than
> > their typical batch counterparts. By carefully managing number of files &
> > sizes, Hudi greatly aids both query engines (e.g: always providing
> > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > consumption).
> >
> > Hudi is largely implemented as an Apache Spark library that reads/writes
> > data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are
> > supported via specialized Apache Hadoop input formats, that understand
> > Hudi’s storage layout. Currently, Hudi manages datasets using a combination
> > of Apache Parquet & Apache Avro file/serialization formats.
> >
> > == Background ==
> >
> > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> > longer term analytical storage for thousands of organizations. Typical
> > analytical datasets are built by reading data from a source (e.g: upstream
> > databases, messaging buses, or other datasets), transforming the data,
> > writing results back to storage, & making it available for analytical
> > queries--all of this typically accomplished in batch jobs which operate in
> > a bulk fashion on partitions of datasets. Such a style of processing
> > typically incurs large delays in making data available to queries as well
> > as lot of complexity in carefully partitioning datasets to guarantee
> > latency SLAs.
> >
> > The need for fresher/faster analytics has increased enormously in the past
> > few years, as evidenced by the popularity of Stream processing systems like
> > Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> > using updateable state store to incrementally compute & instantly reflect
> > new results to queries and using a “tailable” messaging bus to publish
> > these results to other downstream jobs, such systems employ a different
> > approach to building analytical dataset. Even though this approach yields
> > low latency, the amount of data managed in such real-time data-marts is
> > typically limited in comparison to the aforementioned longer term storage
> > options. As a result, the overall data architecture has become more complex
> > with more moving parts and specialized systems, leading to duplication of
> > data and a strain on usability.
> >
> > Hudi takes a hybrid approach. Instead of moving vast amounts of batch data
> > to streaming systems, we simply add the streaming primitives (upserts &
> > incremental consumption) onto existing batch processing technologies. We
> > believe that by adding some missing blocks to an existing Hadoop stack, we
> > are able to a provide similar capabilities right on top of Hadoop at a
> > reduced cost and with an increased efficiency, greatly simplifying the
> > overall architecture in the process.
> >
> > Hudi was originally developed at Uber (original name “Hoodie”) to address
> > such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s data
> > ecosystem that required the upsert & incremental consumption primitives
> > supported by Hudi.
> >
> > 

Re: [VOTE] Accept Hudi into the Apache Incubator

2019-01-15 Thread Gosling Von
+1(binding)

Best Regards,
Von Gosling

> 在 2019年1月14日,上午6:34,Thomas Weise  写道:
> 
> Hi all,
> 
> Following the discussion of the Hudi proposal in [1], this is a vote
> on accepting Hudi into the Apache Incubator,
> per the ASF policy [2] and voting rules [3].
> 
> A vote for accepting a new Apache Incubator podling is a
> majority vote. Everyone is welcome to vote, only
> Incubator PMC member votes are binding.
> 
> This vote will run for at least 72 hours. Please VOTE as
> follows:
> 
> [ ] +1 Accept Hudi into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> 
> The proposal is included below, but you can also access it on
> the wiki [4].
> 
> Thanks for reviewing and voting,
> Thomas
> 
> [1]
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> 
> [2]
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> 
> [3] http://www.apache.org/foundation/voting.html
> 
> [4] https://wiki.apache.org/incubator/HudiProposal
> 
> 
> 
> = Hudi Proposal =
> 
> == Abstract ==
> 
> Hudi is a big-data storage library, that provides atomic upserts and
> incremental data streams.
> 
> Hudi manages data stored in Apache Hadoop and other API compatible
> distributed file systems/cloud stores.
> 
> == Proposal ==
> 
> Hudi provides the ability to atomically upsert datasets with new values in
> near-real time, making data available quickly to existing query engines
> like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> sequence of changes to a dataset from a given point-in-time to enable
> incremental data pipelines that yield greater efficiency & latency than
> their typical batch counterparts. By carefully managing number of files &
> sizes, Hudi greatly aids both query engines (e.g: always providing
> well-sized files) and underlying storage (e.g: HDFS NameNode memory
> consumption).
> 
> Hudi is largely implemented as an Apache Spark library that reads/writes
> data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are
> supported via specialized Apache Hadoop input formats, that understand
> Hudi’s storage layout. Currently, Hudi manages datasets using a combination
> of Apache Parquet & Apache Avro file/serialization formats.
> 
> == Background ==
> 
> Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> longer term analytical storage for thousands of organizations. Typical
> analytical datasets are built by reading data from a source (e.g: upstream
> databases, messaging buses, or other datasets), transforming the data,
> writing results back to storage, & making it available for analytical
> queries--all of this typically accomplished in batch jobs which operate in
> a bulk fashion on partitions of datasets. Such a style of processing
> typically incurs large delays in making data available to queries as well
> as lot of complexity in carefully partitioning datasets to guarantee
> latency SLAs.
> 
> The need for fresher/faster analytics has increased enormously in the past
> few years, as evidenced by the popularity of Stream processing systems like
> Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> using updateable state store to incrementally compute & instantly reflect
> new results to queries and using a “tailable” messaging bus to publish
> these results to other downstream jobs, such systems employ a different
> approach to building analytical dataset. Even though this approach yields
> low latency, the amount of data managed in such real-time data-marts is
> typically limited in comparison to the aforementioned longer term storage
> options. As a result, the overall data architecture has become more complex
> with more moving parts and specialized systems, leading to duplication of
> data and a strain on usability.
> 
> Hudi takes a hybrid approach. Instead of moving vast amounts of batch data
> to streaming systems, we simply add the streaming primitives (upserts &
> incremental consumption) onto existing batch processing technologies. We
> believe that by adding some missing blocks to an existing Hadoop stack, we
> are able to a provide similar capabilities right on top of Hadoop at a
> reduced cost and with an increased efficiency, greatly simplifying the
> overall architecture in the process.
> 
> Hudi was originally developed at Uber (original name “Hoodie”) to address
> such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s data
> ecosystem that required the upsert & incremental consumption primitives
> supported by Hudi.
> 
> == Rationale ==
> 
> We truly believe the capabilities supported by Hudi would be increasingly
> useful for big-data ecosystems, as data volumes & need for faster data
> continue to increase. A detailed description of target use-cases can be
> found 

Re: [VOTE] Accept Hudi into the Apache Incubator

2019-01-15 Thread Brahma Reddy Battula
+1 ( non-binding).

Best choice for incremental processing

.

On Mon, Jan 14, 2019 at 4:04 AM Thomas Weise  wrote:

> Hi all,
>
> Following the discussion of the Hudi proposal in [1], this is a vote
> on accepting Hudi into the Apache Incubator,
> per the ASF policy [2] and voting rules [3].
>
> A vote for accepting a new Apache Incubator podling is a
> majority vote. Everyone is welcome to vote, only
> Incubator PMC member votes are binding.
>
> This vote will run for at least 72 hours. Please VOTE as
> follows:
>
> [ ] +1 Accept Hudi into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept Hudi into the Apache Incubator because ...
>
> The proposal is included below, but you can also access it on
> the wiki [4].
>
> Thanks for reviewing and voting,
> Thomas
>
> [1]
>
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
>
> [2]
>
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
>
> [3] http://www.apache.org/foundation/voting.html
>
> [4] https://wiki.apache.org/incubator/HudiProposal
>
>
>
> = Hudi Proposal =
>
> == Abstract ==
>
> Hudi is a big-data storage library, that provides atomic upserts and
> incremental data streams.
>
> Hudi manages data stored in Apache Hadoop and other API compatible
> distributed file systems/cloud stores.
>
> == Proposal ==
>
> Hudi provides the ability to atomically upsert datasets with new values in
> near-real time, making data available quickly to existing query engines
> like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> sequence of changes to a dataset from a given point-in-time to enable
> incremental data pipelines that yield greater efficiency & latency than
> their typical batch counterparts. By carefully managing number of files &
> sizes, Hudi greatly aids both query engines (e.g: always providing
> well-sized files) and underlying storage (e.g: HDFS NameNode memory
> consumption).
>
> Hudi is largely implemented as an Apache Spark library that reads/writes
> data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are
> supported via specialized Apache Hadoop input formats, that understand
> Hudi’s storage layout. Currently, Hudi manages datasets using a combination
> of Apache Parquet & Apache Avro file/serialization formats.
>
> == Background ==
>
> Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> longer term analytical storage for thousands of organizations. Typical
> analytical datasets are built by reading data from a source (e.g: upstream
> databases, messaging buses, or other datasets), transforming the data,
> writing results back to storage, & making it available for analytical
> queries--all of this typically accomplished in batch jobs which operate in
> a bulk fashion on partitions of datasets. Such a style of processing
> typically incurs large delays in making data available to queries as well
> as lot of complexity in carefully partitioning datasets to guarantee
> latency SLAs.
>
> The need for fresher/faster analytics has increased enormously in the past
> few years, as evidenced by the popularity of Stream processing systems like
> Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> using updateable state store to incrementally compute & instantly reflect
> new results to queries and using a “tailable” messaging bus to publish
> these results to other downstream jobs, such systems employ a different
> approach to building analytical dataset. Even though this approach yields
> low latency, the amount of data managed in such real-time data-marts is
> typically limited in comparison to the aforementioned longer term storage
> options. As a result, the overall data architecture has become more complex
> with more moving parts and specialized systems, leading to duplication of
> data and a strain on usability.
>
> Hudi takes a hybrid approach. Instead of moving vast amounts of batch data
> to streaming systems, we simply add the streaming primitives (upserts &
> incremental consumption) onto existing batch processing technologies. We
> believe that by adding some missing blocks to an existing Hadoop stack, we
> are able to a provide similar capabilities right on top of Hadoop at a
> reduced cost and with an increased efficiency, greatly simplifying the
> overall architecture in the process.
>
> Hudi was originally developed at Uber (original name “Hoodie”) to address
> such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s data
> ecosystem that required the upsert & incremental consumption primitives
> supported by Hudi.
>
> == Rationale ==
>
> We truly believe the capabilities supported by Hudi would be increasingly
> useful for big-data ecosystems, as data volumes & need for 

Re: [VOTE] Accept Hudi into the Apache Incubator

2019-01-15 Thread Furkan KAMACI
+1

16 Oca 2019 Çar, saat 01:40 tarihinde Vinayakumar B 
şunu yazdı:

> +1
>
> - Vinay
>
> On Tue, 15 Jan 2019, 10:56 am Hongtao Gao 
> > +1
> >
> > Hongtao Gao
> >
> >
> > Thomas Weise  于 2019年1月14日周一 上午6:34写道:
> >
> > > Hi all,
> > >
> > > Following the discussion of the Hudi proposal in [1], this is a vote
> > > on accepting Hudi into the Apache Incubator,
> > > per the ASF policy [2] and voting rules [3].
> > >
> > > A vote for accepting a new Apache Incubator podling is a
> > > majority vote. Everyone is welcome to vote, only
> > > Incubator PMC member votes are binding.
> > >
> > > This vote will run for at least 72 hours. Please VOTE as
> > > follows:
> > >
> > > [ ] +1 Accept Hudi into the Apache Incubator
> > > [ ] +0 Abstain
> > > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> > >
> > > The proposal is included below, but you can also access it on
> > > the wiki [4].
> > >
> > > Thanks for reviewing and voting,
> > > Thomas
> > >
> > > [1]
> > >
> > >
> >
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> > >
> > > [2]
> > >
> > >
> >
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> > >
> > > [3] http://www.apache.org/foundation/voting.html
> > >
> > > [4] https://wiki.apache.org/incubator/HudiProposal
> > >
> > >
> > >
> > > = Hudi Proposal =
> > >
> > > == Abstract ==
> > >
> > > Hudi is a big-data storage library, that provides atomic upserts and
> > > incremental data streams.
> > >
> > > Hudi manages data stored in Apache Hadoop and other API compatible
> > > distributed file systems/cloud stores.
> > >
> > > == Proposal ==
> > >
> > > Hudi provides the ability to atomically upsert datasets with new values
> > in
> > > near-real time, making data available quickly to existing query engines
> > > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> > > sequence of changes to a dataset from a given point-in-time to enable
> > > incremental data pipelines that yield greater efficiency & latency than
> > > their typical batch counterparts. By carefully managing number of
> files &
> > > sizes, Hudi greatly aids both query engines (e.g: always providing
> > > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > > consumption).
> > >
> > > Hudi is largely implemented as an Apache Spark library that
> reads/writes
> > > data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets
> > are
> > > supported via specialized Apache Hadoop input formats, that understand
> > > Hudi’s storage layout. Currently, Hudi manages datasets using a
> > combination
> > > of Apache Parquet & Apache Avro file/serialization formats.
> > >
> > > == Background ==
> > >
> > > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve
> as
> > > longer term analytical storage for thousands of organizations. Typical
> > > analytical datasets are built by reading data from a source (e.g:
> > upstream
> > > databases, messaging buses, or other datasets), transforming the data,
> > > writing results back to storage, & making it available for analytical
> > > queries--all of this typically accomplished in batch jobs which operate
> > in
> > > a bulk fashion on partitions of datasets. Such a style of processing
> > > typically incurs large delays in making data available to queries as
> well
> > > as lot of complexity in carefully partitioning datasets to guarantee
> > > latency SLAs.
> > >
> > > The need for fresher/faster analytics has increased enormously in the
> > past
> > > few years, as evidenced by the popularity of Stream processing systems
> > like
> > > Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> > > using updateable state store to incrementally compute & instantly
> reflect
> > > new results to queries and using a “tailable” messaging bus to publish
> > > these results to other downstream jobs, such systems employ a different
> > > approach to building analytical dataset. Even though this approach
> yields
> > > low latency, the amount of data managed in such real-time data-marts is
> > > typically limited in comparison to the aforementioned longer term
> storage
> > > options. As a result, the overall data architecture has become more
> > complex
> > > with more moving parts and specialized systems, leading to duplication
> of
> > > data and a strain on usability.
> > >
> > > Hudi takes a hybrid approach. Instead of moving vast amounts of batch
> > data
> > > to streaming systems, we simply add the streaming primitives (upserts &
> > > incremental consumption) onto existing batch processing technologies.
> We
> > > believe that by adding some missing blocks to an existing Hadoop stack,
> > we
> > > are able to a provide similar capabilities right on top of Hadoop at a
> > > reduced cost and with an increased 

Re: [VOTE] Accept Hudi into the Apache Incubator

2019-01-15 Thread Vinayakumar B
+1

- Vinay

On Tue, 15 Jan 2019, 10:56 am Hongtao Gao  +1
>
> Hongtao Gao
>
>
> Thomas Weise  于 2019年1月14日周一 上午6:34写道:
>
> > Hi all,
> >
> > Following the discussion of the Hudi proposal in [1], this is a vote
> > on accepting Hudi into the Apache Incubator,
> > per the ASF policy [2] and voting rules [3].
> >
> > A vote for accepting a new Apache Incubator podling is a
> > majority vote. Everyone is welcome to vote, only
> > Incubator PMC member votes are binding.
> >
> > This vote will run for at least 72 hours. Please VOTE as
> > follows:
> >
> > [ ] +1 Accept Hudi into the Apache Incubator
> > [ ] +0 Abstain
> > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> >
> > The proposal is included below, but you can also access it on
> > the wiki [4].
> >
> > Thanks for reviewing and voting,
> > Thomas
> >
> > [1]
> >
> >
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> >
> > [2]
> >
> >
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> >
> > [3] http://www.apache.org/foundation/voting.html
> >
> > [4] https://wiki.apache.org/incubator/HudiProposal
> >
> >
> >
> > = Hudi Proposal =
> >
> > == Abstract ==
> >
> > Hudi is a big-data storage library, that provides atomic upserts and
> > incremental data streams.
> >
> > Hudi manages data stored in Apache Hadoop and other API compatible
> > distributed file systems/cloud stores.
> >
> > == Proposal ==
> >
> > Hudi provides the ability to atomically upsert datasets with new values
> in
> > near-real time, making data available quickly to existing query engines
> > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> > sequence of changes to a dataset from a given point-in-time to enable
> > incremental data pipelines that yield greater efficiency & latency than
> > their typical batch counterparts. By carefully managing number of files &
> > sizes, Hudi greatly aids both query engines (e.g: always providing
> > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > consumption).
> >
> > Hudi is largely implemented as an Apache Spark library that reads/writes
> > data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets
> are
> > supported via specialized Apache Hadoop input formats, that understand
> > Hudi’s storage layout. Currently, Hudi manages datasets using a
> combination
> > of Apache Parquet & Apache Avro file/serialization formats.
> >
> > == Background ==
> >
> > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> > longer term analytical storage for thousands of organizations. Typical
> > analytical datasets are built by reading data from a source (e.g:
> upstream
> > databases, messaging buses, or other datasets), transforming the data,
> > writing results back to storage, & making it available for analytical
> > queries--all of this typically accomplished in batch jobs which operate
> in
> > a bulk fashion on partitions of datasets. Such a style of processing
> > typically incurs large delays in making data available to queries as well
> > as lot of complexity in carefully partitioning datasets to guarantee
> > latency SLAs.
> >
> > The need for fresher/faster analytics has increased enormously in the
> past
> > few years, as evidenced by the popularity of Stream processing systems
> like
> > Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> > using updateable state store to incrementally compute & instantly reflect
> > new results to queries and using a “tailable” messaging bus to publish
> > these results to other downstream jobs, such systems employ a different
> > approach to building analytical dataset. Even though this approach yields
> > low latency, the amount of data managed in such real-time data-marts is
> > typically limited in comparison to the aforementioned longer term storage
> > options. As a result, the overall data architecture has become more
> complex
> > with more moving parts and specialized systems, leading to duplication of
> > data and a strain on usability.
> >
> > Hudi takes a hybrid approach. Instead of moving vast amounts of batch
> data
> > to streaming systems, we simply add the streaming primitives (upserts &
> > incremental consumption) onto existing batch processing technologies. We
> > believe that by adding some missing blocks to an existing Hadoop stack,
> we
> > are able to a provide similar capabilities right on top of Hadoop at a
> > reduced cost and with an increased efficiency, greatly simplifying the
> > overall architecture in the process.
> >
> > Hudi was originally developed at Uber (original name “Hoodie”) to address
> > such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s
> data
> > ecosystem that required the upsert & incremental consumption primitives
> > supported by Hudi.
> 

Re: [VOTE] Accept Hudi into the Apache Incubator

2019-01-14 Thread Hongtao Gao
+1

Hongtao Gao


Thomas Weise  于 2019年1月14日周一 上午6:34写道:

> Hi all,
>
> Following the discussion of the Hudi proposal in [1], this is a vote
> on accepting Hudi into the Apache Incubator,
> per the ASF policy [2] and voting rules [3].
>
> A vote for accepting a new Apache Incubator podling is a
> majority vote. Everyone is welcome to vote, only
> Incubator PMC member votes are binding.
>
> This vote will run for at least 72 hours. Please VOTE as
> follows:
>
> [ ] +1 Accept Hudi into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept Hudi into the Apache Incubator because ...
>
> The proposal is included below, but you can also access it on
> the wiki [4].
>
> Thanks for reviewing and voting,
> Thomas
>
> [1]
>
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
>
> [2]
>
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
>
> [3] http://www.apache.org/foundation/voting.html
>
> [4] https://wiki.apache.org/incubator/HudiProposal
>
>
>
> = Hudi Proposal =
>
> == Abstract ==
>
> Hudi is a big-data storage library, that provides atomic upserts and
> incremental data streams.
>
> Hudi manages data stored in Apache Hadoop and other API compatible
> distributed file systems/cloud stores.
>
> == Proposal ==
>
> Hudi provides the ability to atomically upsert datasets with new values in
> near-real time, making data available quickly to existing query engines
> like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> sequence of changes to a dataset from a given point-in-time to enable
> incremental data pipelines that yield greater efficiency & latency than
> their typical batch counterparts. By carefully managing number of files &
> sizes, Hudi greatly aids both query engines (e.g: always providing
> well-sized files) and underlying storage (e.g: HDFS NameNode memory
> consumption).
>
> Hudi is largely implemented as an Apache Spark library that reads/writes
> data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are
> supported via specialized Apache Hadoop input formats, that understand
> Hudi’s storage layout. Currently, Hudi manages datasets using a combination
> of Apache Parquet & Apache Avro file/serialization formats.
>
> == Background ==
>
> Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> longer term analytical storage for thousands of organizations. Typical
> analytical datasets are built by reading data from a source (e.g: upstream
> databases, messaging buses, or other datasets), transforming the data,
> writing results back to storage, & making it available for analytical
> queries--all of this typically accomplished in batch jobs which operate in
> a bulk fashion on partitions of datasets. Such a style of processing
> typically incurs large delays in making data available to queries as well
> as lot of complexity in carefully partitioning datasets to guarantee
> latency SLAs.
>
> The need for fresher/faster analytics has increased enormously in the past
> few years, as evidenced by the popularity of Stream processing systems like
> Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> using updateable state store to incrementally compute & instantly reflect
> new results to queries and using a “tailable” messaging bus to publish
> these results to other downstream jobs, such systems employ a different
> approach to building analytical dataset. Even though this approach yields
> low latency, the amount of data managed in such real-time data-marts is
> typically limited in comparison to the aforementioned longer term storage
> options. As a result, the overall data architecture has become more complex
> with more moving parts and specialized systems, leading to duplication of
> data and a strain on usability.
>
> Hudi takes a hybrid approach. Instead of moving vast amounts of batch data
> to streaming systems, we simply add the streaming primitives (upserts &
> incremental consumption) onto existing batch processing technologies. We
> believe that by adding some missing blocks to an existing Hadoop stack, we
> are able to a provide similar capabilities right on top of Hadoop at a
> reduced cost and with an increased efficiency, greatly simplifying the
> overall architecture in the process.
>
> Hudi was originally developed at Uber (original name “Hoodie”) to address
> such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s data
> ecosystem that required the upsert & incremental consumption primitives
> supported by Hudi.
>
> == Rationale ==
>
> We truly believe the capabilities supported by Hudi would be increasingly
> useful for big-data ecosystems, as data volumes & need for faster data
> continue to increase. A detailed description of target use-cases can be
> found at https://uber.github.io/hudi/use_cases.html.

Re: [VOTE] Accept Hudi into the Apache Incubator

2019-01-14 Thread Akira Ajisaka
+1 (binding)

-Akira

2019年1月15日(火) 10:25 Jakob Homan :
>
> +1 (binding)
>
> -Jakob
>
> On Mon, Jan 14, 2019 at 5:22 PM Mayank Bansal  wrote:
> >
> > +1
> >
> > On Mon, Jan 14, 2019 at 5:11 PM Mohammad Islam 
> > wrote:
> >
> > >  +1
> > > On Monday, January 14, 2019, 12:46:48 PM PST, Kenneth Knowles <
> > > k...@apache.org> wrote:
> > >
> > >  +1
> > >
> > > On Mon, Jan 14, 2019 at 9:38 AM Felix Cheung 
> > > wrote:
> > >
> > > > +1
> > > >
> > > >
> > > > On Mon, Jan 14, 2019 at 3:20 AM Suneel Marthi
> > > >  wrote:
> > > >
> > > > > +1
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > > On Jan 13, 2019, at 5:34 PM, Thomas Weise  wrote:
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Following the discussion of the Hudi proposal in [1], this is a vote
> > > > > > on accepting Hudi into the Apache Incubator,
> > > > > > per the ASF policy [2] and voting rules [3].
> > > > > >
> > > > > > A vote for accepting a new Apache Incubator podling is a
> > > > > > majority vote. Everyone is welcome to vote, only
> > > > > > Incubator PMC member votes are binding.
> > > > > >
> > > > > > This vote will run for at least 72 hours. Please VOTE as
> > > > > > follows:
> > > > > >
> > > > > > [ ] +1 Accept Hudi into the Apache Incubator
> > > > > > [ ] +0 Abstain
> > > > > > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> > > > > >
> > > > > > The proposal is included below, but you can also access it on
> > > > > > the wiki [4].
> > > > > >
> > > > > > Thanks for reviewing and voting,
> > > > > > Thomas
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > > >
> > > https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> > > > > >
> > > > > > [2]
> > > > > >
> > > > >
> > > >
> > > https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> > > > > >
> > > > > > [3] http://www.apache.org/foundation/voting.html
> > > > > >
> > > > > > [4] https://wiki.apache.org/incubator/HudiProposal
> > > > > >
> > > > > >
> > > > > >
> > > > > > = Hudi Proposal =
> > > > > >
> > > > > > == Abstract ==
> > > > > >
> > > > > > Hudi is a big-data storage library, that provides atomic upserts and
> > > > > > incremental data streams.
> > > > > >
> > > > > > Hudi manages data stored in Apache Hadoop and other API compatible
> > > > > > distributed file systems/cloud stores.
> > > > > >
> > > > > > == Proposal ==
> > > > > >
> > > > > > Hudi provides the ability to atomically upsert datasets with new
> > > values
> > > > > in
> > > > > > near-real time, making data available quickly to existing query
> > > engines
> > > > > > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi
> > > provides a
> > > > > > sequence of changes to a dataset from a given point-in-time to 
> > > > > > enable
> > > > > > incremental data pipelines that yield greater efficiency & latency
> > > than
> > > > > > their typical batch counterparts. By carefully managing number of
> > > > files &
> > > > > > sizes, Hudi greatly aids both query engines (e.g: always providing
> > > > > > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > > > > > consumption).
> > > > > >
> > > > > > Hudi is largely implemented as an Apache Spark library that
> > > > reads/writes
> > > > > > data from/to Hadoop compatible filesystem. SQL queries on Hudi
> > > datasets
> > > > > are
> > > > > > supported via specialized Apache Hadoop input formats, that
> > > understand
> > > > > > Hudi’s storage layout. Currently, Hudi manages datasets using a
> > > > > combination
> > > > > > of Apache Parquet & Apache Avro file/serialization formats.
> > > > > >
> > > > > > == Background ==
> > > > > >
> > > > > > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > > > > > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) 
> > > > > > serve
> > > > as
> > > > > > longer term analytical storage for thousands of organizations.
> > > Typical
> > > > > > analytical datasets are built by reading data from a source (e.g:
> > > > > upstream
> > > > > > databases, messaging buses, or other datasets), transforming the
> > > data,
> > > > > > writing results back to storage, & making it available for 
> > > > > > analytical
> > > > > > queries--all of this typically accomplished in batch jobs which
> > > operate
> > > > > in
> > > > > > a bulk fashion on partitions of datasets. Such a style of processing
> > > > > > typically incurs large delays in making data available to queries as
> > > > well
> > > > > > as lot of complexity in carefully partitioning datasets to guarantee
> > > > > > latency SLAs.
> > > > > >
> > > > > > The need for fresher/faster analytics has increased enormously in 
> > > > > > the
> > > > > past
> > > > > > few years, as evidenced by the popularity of Stream processing
> > > systems
> > > > > like
> > > > > > Apache Spark, Apache Flink, and messaging systems like Apache Kafka.
> > > 

Re: [VOTE] Accept Hudi into the Apache Incubator

2019-01-14 Thread Jakob Homan
+1 (binding)

-Jakob

On Mon, Jan 14, 2019 at 5:22 PM Mayank Bansal  wrote:
>
> +1
>
> On Mon, Jan 14, 2019 at 5:11 PM Mohammad Islam 
> wrote:
>
> >  +1
> > On Monday, January 14, 2019, 12:46:48 PM PST, Kenneth Knowles <
> > k...@apache.org> wrote:
> >
> >  +1
> >
> > On Mon, Jan 14, 2019 at 9:38 AM Felix Cheung 
> > wrote:
> >
> > > +1
> > >
> > >
> > > On Mon, Jan 14, 2019 at 3:20 AM Suneel Marthi
> > >  wrote:
> > >
> > > > +1
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On Jan 13, 2019, at 5:34 PM, Thomas Weise  wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > Following the discussion of the Hudi proposal in [1], this is a vote
> > > > > on accepting Hudi into the Apache Incubator,
> > > > > per the ASF policy [2] and voting rules [3].
> > > > >
> > > > > A vote for accepting a new Apache Incubator podling is a
> > > > > majority vote. Everyone is welcome to vote, only
> > > > > Incubator PMC member votes are binding.
> > > > >
> > > > > This vote will run for at least 72 hours. Please VOTE as
> > > > > follows:
> > > > >
> > > > > [ ] +1 Accept Hudi into the Apache Incubator
> > > > > [ ] +0 Abstain
> > > > > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> > > > >
> > > > > The proposal is included below, but you can also access it on
> > > > > the wiki [4].
> > > > >
> > > > > Thanks for reviewing and voting,
> > > > > Thomas
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> > https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> > > > >
> > > > > [2]
> > > > >
> > > >
> > >
> > https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> > > > >
> > > > > [3] http://www.apache.org/foundation/voting.html
> > > > >
> > > > > [4] https://wiki.apache.org/incubator/HudiProposal
> > > > >
> > > > >
> > > > >
> > > > > = Hudi Proposal =
> > > > >
> > > > > == Abstract ==
> > > > >
> > > > > Hudi is a big-data storage library, that provides atomic upserts and
> > > > > incremental data streams.
> > > > >
> > > > > Hudi manages data stored in Apache Hadoop and other API compatible
> > > > > distributed file systems/cloud stores.
> > > > >
> > > > > == Proposal ==
> > > > >
> > > > > Hudi provides the ability to atomically upsert datasets with new
> > values
> > > > in
> > > > > near-real time, making data available quickly to existing query
> > engines
> > > > > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi
> > provides a
> > > > > sequence of changes to a dataset from a given point-in-time to enable
> > > > > incremental data pipelines that yield greater efficiency & latency
> > than
> > > > > their typical batch counterparts. By carefully managing number of
> > > files &
> > > > > sizes, Hudi greatly aids both query engines (e.g: always providing
> > > > > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > > > > consumption).
> > > > >
> > > > > Hudi is largely implemented as an Apache Spark library that
> > > reads/writes
> > > > > data from/to Hadoop compatible filesystem. SQL queries on Hudi
> > datasets
> > > > are
> > > > > supported via specialized Apache Hadoop input formats, that
> > understand
> > > > > Hudi’s storage layout. Currently, Hudi manages datasets using a
> > > > combination
> > > > > of Apache Parquet & Apache Avro file/serialization formats.
> > > > >
> > > > > == Background ==
> > > > >
> > > > > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > > > > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve
> > > as
> > > > > longer term analytical storage for thousands of organizations.
> > Typical
> > > > > analytical datasets are built by reading data from a source (e.g:
> > > > upstream
> > > > > databases, messaging buses, or other datasets), transforming the
> > data,
> > > > > writing results back to storage, & making it available for analytical
> > > > > queries--all of this typically accomplished in batch jobs which
> > operate
> > > > in
> > > > > a bulk fashion on partitions of datasets. Such a style of processing
> > > > > typically incurs large delays in making data available to queries as
> > > well
> > > > > as lot of complexity in carefully partitioning datasets to guarantee
> > > > > latency SLAs.
> > > > >
> > > > > The need for fresher/faster analytics has increased enormously in the
> > > > past
> > > > > few years, as evidenced by the popularity of Stream processing
> > systems
> > > > like
> > > > > Apache Spark, Apache Flink, and messaging systems like Apache Kafka.
> > By
> > > > > using updateable state store to incrementally compute & instantly
> > > reflect
> > > > > new results to queries and using a “tailable” messaging bus to
> > publish
> > > > > these results to other downstream jobs, such systems employ a
> > different
> > > > > approach to building analytical dataset. Even though this approach
> > > yields
> > > > > low latency, the amount 

Re: [VOTE] Accept Hudi into the Apache Incubator

2019-01-14 Thread Mayank Bansal
+1

On Mon, Jan 14, 2019 at 5:11 PM Mohammad Islam 
wrote:

>  +1
> On Monday, January 14, 2019, 12:46:48 PM PST, Kenneth Knowles <
> k...@apache.org> wrote:
>
>  +1
>
> On Mon, Jan 14, 2019 at 9:38 AM Felix Cheung 
> wrote:
>
> > +1
> >
> >
> > On Mon, Jan 14, 2019 at 3:20 AM Suneel Marthi
> >  wrote:
> >
> > > +1
> > >
> > > Sent from my iPhone
> > >
> > > > On Jan 13, 2019, at 5:34 PM, Thomas Weise  wrote:
> > > >
> > > > Hi all,
> > > >
> > > > Following the discussion of the Hudi proposal in [1], this is a vote
> > > > on accepting Hudi into the Apache Incubator,
> > > > per the ASF policy [2] and voting rules [3].
> > > >
> > > > A vote for accepting a new Apache Incubator podling is a
> > > > majority vote. Everyone is welcome to vote, only
> > > > Incubator PMC member votes are binding.
> > > >
> > > > This vote will run for at least 72 hours. Please VOTE as
> > > > follows:
> > > >
> > > > [ ] +1 Accept Hudi into the Apache Incubator
> > > > [ ] +0 Abstain
> > > > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> > > >
> > > > The proposal is included below, but you can also access it on
> > > > the wiki [4].
> > > >
> > > > Thanks for reviewing and voting,
> > > > Thomas
> > > >
> > > > [1]
> > > >
> > >
> >
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> > > >
> > > > [2]
> > > >
> > >
> >
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> > > >
> > > > [3] http://www.apache.org/foundation/voting.html
> > > >
> > > > [4] https://wiki.apache.org/incubator/HudiProposal
> > > >
> > > >
> > > >
> > > > = Hudi Proposal =
> > > >
> > > > == Abstract ==
> > > >
> > > > Hudi is a big-data storage library, that provides atomic upserts and
> > > > incremental data streams.
> > > >
> > > > Hudi manages data stored in Apache Hadoop and other API compatible
> > > > distributed file systems/cloud stores.
> > > >
> > > > == Proposal ==
> > > >
> > > > Hudi provides the ability to atomically upsert datasets with new
> values
> > > in
> > > > near-real time, making data available quickly to existing query
> engines
> > > > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi
> provides a
> > > > sequence of changes to a dataset from a given point-in-time to enable
> > > > incremental data pipelines that yield greater efficiency & latency
> than
> > > > their typical batch counterparts. By carefully managing number of
> > files &
> > > > sizes, Hudi greatly aids both query engines (e.g: always providing
> > > > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > > > consumption).
> > > >
> > > > Hudi is largely implemented as an Apache Spark library that
> > reads/writes
> > > > data from/to Hadoop compatible filesystem. SQL queries on Hudi
> datasets
> > > are
> > > > supported via specialized Apache Hadoop input formats, that
> understand
> > > > Hudi’s storage layout. Currently, Hudi manages datasets using a
> > > combination
> > > > of Apache Parquet & Apache Avro file/serialization formats.
> > > >
> > > > == Background ==
> > > >
> > > > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > > > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve
> > as
> > > > longer term analytical storage for thousands of organizations.
> Typical
> > > > analytical datasets are built by reading data from a source (e.g:
> > > upstream
> > > > databases, messaging buses, or other datasets), transforming the
> data,
> > > > writing results back to storage, & making it available for analytical
> > > > queries--all of this typically accomplished in batch jobs which
> operate
> > > in
> > > > a bulk fashion on partitions of datasets. Such a style of processing
> > > > typically incurs large delays in making data available to queries as
> > well
> > > > as lot of complexity in carefully partitioning datasets to guarantee
> > > > latency SLAs.
> > > >
> > > > The need for fresher/faster analytics has increased enormously in the
> > > past
> > > > few years, as evidenced by the popularity of Stream processing
> systems
> > > like
> > > > Apache Spark, Apache Flink, and messaging systems like Apache Kafka.
> By
> > > > using updateable state store to incrementally compute & instantly
> > reflect
> > > > new results to queries and using a “tailable” messaging bus to
> publish
> > > > these results to other downstream jobs, such systems employ a
> different
> > > > approach to building analytical dataset. Even though this approach
> > yields
> > > > low latency, the amount of data managed in such real-time data-marts
> is
> > > > typically limited in comparison to the aforementioned longer term
> > storage
> > > > options. As a result, the overall data architecture has become more
> > > complex
> > > > with more moving parts and specialized systems, leading to
> duplication
> > of
> > > > data and a strain on usability.
> > > >
> 

Re: [VOTE] Accept Hudi into the Apache Incubator

2019-01-14 Thread Mohammad Islam
 +1
On Monday, January 14, 2019, 12:46:48 PM PST, Kenneth Knowles 
 wrote:  
 
 +1

On Mon, Jan 14, 2019 at 9:38 AM Felix Cheung  wrote:

> +1
>
>
> On Mon, Jan 14, 2019 at 3:20 AM Suneel Marthi
>  wrote:
>
> > +1
> >
> > Sent from my iPhone
> >
> > > On Jan 13, 2019, at 5:34 PM, Thomas Weise  wrote:
> > >
> > > Hi all,
> > >
> > > Following the discussion of the Hudi proposal in [1], this is a vote
> > > on accepting Hudi into the Apache Incubator,
> > > per the ASF policy [2] and voting rules [3].
> > >
> > > A vote for accepting a new Apache Incubator podling is a
> > > majority vote. Everyone is welcome to vote, only
> > > Incubator PMC member votes are binding.
> > >
> > > This vote will run for at least 72 hours. Please VOTE as
> > > follows:
> > >
> > > [ ] +1 Accept Hudi into the Apache Incubator
> > > [ ] +0 Abstain
> > > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> > >
> > > The proposal is included below, but you can also access it on
> > > the wiki [4].
> > >
> > > Thanks for reviewing and voting,
> > > Thomas
> > >
> > > [1]
> > >
> >
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> > >
> > > [2]
> > >
> >
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> > >
> > > [3] http://www.apache.org/foundation/voting.html
> > >
> > > [4] https://wiki.apache.org/incubator/HudiProposal
> > >
> > >
> > >
> > > = Hudi Proposal =
> > >
> > > == Abstract ==
> > >
> > > Hudi is a big-data storage library, that provides atomic upserts and
> > > incremental data streams.
> > >
> > > Hudi manages data stored in Apache Hadoop and other API compatible
> > > distributed file systems/cloud stores.
> > >
> > > == Proposal ==
> > >
> > > Hudi provides the ability to atomically upsert datasets with new values
> > in
> > > near-real time, making data available quickly to existing query engines
> > > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> > > sequence of changes to a dataset from a given point-in-time to enable
> > > incremental data pipelines that yield greater efficiency & latency than
> > > their typical batch counterparts. By carefully managing number of
> files &
> > > sizes, Hudi greatly aids both query engines (e.g: always providing
> > > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > > consumption).
> > >
> > > Hudi is largely implemented as an Apache Spark library that
> reads/writes
> > > data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets
> > are
> > > supported via specialized Apache Hadoop input formats, that understand
> > > Hudi’s storage layout. Currently, Hudi manages datasets using a
> > combination
> > > of Apache Parquet & Apache Avro file/serialization formats.
> > >
> > > == Background ==
> > >
> > > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve
> as
> > > longer term analytical storage for thousands of organizations. Typical
> > > analytical datasets are built by reading data from a source (e.g:
> > upstream
> > > databases, messaging buses, or other datasets), transforming the data,
> > > writing results back to storage, & making it available for analytical
> > > queries--all of this typically accomplished in batch jobs which operate
> > in
> > > a bulk fashion on partitions of datasets. Such a style of processing
> > > typically incurs large delays in making data available to queries as
> well
> > > as lot of complexity in carefully partitioning datasets to guarantee
> > > latency SLAs.
> > >
> > > The need for fresher/faster analytics has increased enormously in the
> > past
> > > few years, as evidenced by the popularity of Stream processing systems
> > like
> > > Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> > > using updateable state store to incrementally compute & instantly
> reflect
> > > new results to queries and using a “tailable” messaging bus to publish
> > > these results to other downstream jobs, such systems employ a different
> > > approach to building analytical dataset. Even though this approach
> yields
> > > low latency, the amount of data managed in such real-time data-marts is
> > > typically limited in comparison to the aforementioned longer term
> storage
> > > options. As a result, the overall data architecture has become more
> > complex
> > > with more moving parts and specialized systems, leading to duplication
> of
> > > data and a strain on usability.
> > >
> > > Hudi takes a hybrid approach. Instead of moving vast amounts of batch
> > data
> > > to streaming systems, we simply add the streaming primitives (upserts &
> > > incremental consumption) onto existing batch processing technologies.
> We
> > > believe that by adding some missing blocks to an existing Hadoop stack,
> > we
> > > are able to a provide similar 

Re: [VOTE] Accept Hudi into the Apache Incubator

2019-01-14 Thread Kenneth Knowles
+1

On Mon, Jan 14, 2019 at 9:38 AM Felix Cheung  wrote:

> +1
>
>
> On Mon, Jan 14, 2019 at 3:20 AM Suneel Marthi
>  wrote:
>
> > +1
> >
> > Sent from my iPhone
> >
> > > On Jan 13, 2019, at 5:34 PM, Thomas Weise  wrote:
> > >
> > > Hi all,
> > >
> > > Following the discussion of the Hudi proposal in [1], this is a vote
> > > on accepting Hudi into the Apache Incubator,
> > > per the ASF policy [2] and voting rules [3].
> > >
> > > A vote for accepting a new Apache Incubator podling is a
> > > majority vote. Everyone is welcome to vote, only
> > > Incubator PMC member votes are binding.
> > >
> > > This vote will run for at least 72 hours. Please VOTE as
> > > follows:
> > >
> > > [ ] +1 Accept Hudi into the Apache Incubator
> > > [ ] +0 Abstain
> > > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> > >
> > > The proposal is included below, but you can also access it on
> > > the wiki [4].
> > >
> > > Thanks for reviewing and voting,
> > > Thomas
> > >
> > > [1]
> > >
> >
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> > >
> > > [2]
> > >
> >
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> > >
> > > [3] http://www.apache.org/foundation/voting.html
> > >
> > > [4] https://wiki.apache.org/incubator/HudiProposal
> > >
> > >
> > >
> > > = Hudi Proposal =
> > >
> > > == Abstract ==
> > >
> > > Hudi is a big-data storage library, that provides atomic upserts and
> > > incremental data streams.
> > >
> > > Hudi manages data stored in Apache Hadoop and other API compatible
> > > distributed file systems/cloud stores.
> > >
> > > == Proposal ==
> > >
> > > Hudi provides the ability to atomically upsert datasets with new values
> > in
> > > near-real time, making data available quickly to existing query engines
> > > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> > > sequence of changes to a dataset from a given point-in-time to enable
> > > incremental data pipelines that yield greater efficiency & latency than
> > > their typical batch counterparts. By carefully managing number of
> files &
> > > sizes, Hudi greatly aids both query engines (e.g: always providing
> > > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > > consumption).
> > >
> > > Hudi is largely implemented as an Apache Spark library that
> reads/writes
> > > data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets
> > are
> > > supported via specialized Apache Hadoop input formats, that understand
> > > Hudi’s storage layout. Currently, Hudi manages datasets using a
> > combination
> > > of Apache Parquet & Apache Avro file/serialization formats.
> > >
> > > == Background ==
> > >
> > > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve
> as
> > > longer term analytical storage for thousands of organizations. Typical
> > > analytical datasets are built by reading data from a source (e.g:
> > upstream
> > > databases, messaging buses, or other datasets), transforming the data,
> > > writing results back to storage, & making it available for analytical
> > > queries--all of this typically accomplished in batch jobs which operate
> > in
> > > a bulk fashion on partitions of datasets. Such a style of processing
> > > typically incurs large delays in making data available to queries as
> well
> > > as lot of complexity in carefully partitioning datasets to guarantee
> > > latency SLAs.
> > >
> > > The need for fresher/faster analytics has increased enormously in the
> > past
> > > few years, as evidenced by the popularity of Stream processing systems
> > like
> > > Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> > > using updateable state store to incrementally compute & instantly
> reflect
> > > new results to queries and using a “tailable” messaging bus to publish
> > > these results to other downstream jobs, such systems employ a different
> > > approach to building analytical dataset. Even though this approach
> yields
> > > low latency, the amount of data managed in such real-time data-marts is
> > > typically limited in comparison to the aforementioned longer term
> storage
> > > options. As a result, the overall data architecture has become more
> > complex
> > > with more moving parts and specialized systems, leading to duplication
> of
> > > data and a strain on usability.
> > >
> > > Hudi takes a hybrid approach. Instead of moving vast amounts of batch
> > data
> > > to streaming systems, we simply add the streaming primitives (upserts &
> > > incremental consumption) onto existing batch processing technologies.
> We
> > > believe that by adding some missing blocks to an existing Hadoop stack,
> > we
> > > are able to a provide similar capabilities right on top of Hadoop at a
> > > reduced cost and with an increased 

Re: [VOTE] Accept Hudi into the Apache Incubator

2019-01-14 Thread Felix Cheung
+1


On Mon, Jan 14, 2019 at 3:20 AM Suneel Marthi
 wrote:

> +1
>
> Sent from my iPhone
>
> > On Jan 13, 2019, at 5:34 PM, Thomas Weise  wrote:
> >
> > Hi all,
> >
> > Following the discussion of the Hudi proposal in [1], this is a vote
> > on accepting Hudi into the Apache Incubator,
> > per the ASF policy [2] and voting rules [3].
> >
> > A vote for accepting a new Apache Incubator podling is a
> > majority vote. Everyone is welcome to vote, only
> > Incubator PMC member votes are binding.
> >
> > This vote will run for at least 72 hours. Please VOTE as
> > follows:
> >
> > [ ] +1 Accept Hudi into the Apache Incubator
> > [ ] +0 Abstain
> > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> >
> > The proposal is included below, but you can also access it on
> > the wiki [4].
> >
> > Thanks for reviewing and voting,
> > Thomas
> >
> > [1]
> >
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> >
> > [2]
> >
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> >
> > [3] http://www.apache.org/foundation/voting.html
> >
> > [4] https://wiki.apache.org/incubator/HudiProposal
> >
> >
> >
> > = Hudi Proposal =
> >
> > == Abstract ==
> >
> > Hudi is a big-data storage library, that provides atomic upserts and
> > incremental data streams.
> >
> > Hudi manages data stored in Apache Hadoop and other API compatible
> > distributed file systems/cloud stores.
> >
> > == Proposal ==
> >
> > Hudi provides the ability to atomically upsert datasets with new values
> in
> > near-real time, making data available quickly to existing query engines
> > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> > sequence of changes to a dataset from a given point-in-time to enable
> > incremental data pipelines that yield greater efficiency & latency than
> > their typical batch counterparts. By carefully managing number of files &
> > sizes, Hudi greatly aids both query engines (e.g: always providing
> > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > consumption).
> >
> > Hudi is largely implemented as an Apache Spark library that reads/writes
> > data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets
> are
> > supported via specialized Apache Hadoop input formats, that understand
> > Hudi’s storage layout. Currently, Hudi manages datasets using a
> combination
> > of Apache Parquet & Apache Avro file/serialization formats.
> >
> > == Background ==
> >
> > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> > longer term analytical storage for thousands of organizations. Typical
> > analytical datasets are built by reading data from a source (e.g:
> upstream
> > databases, messaging buses, or other datasets), transforming the data,
> > writing results back to storage, & making it available for analytical
> > queries--all of this typically accomplished in batch jobs which operate
> in
> > a bulk fashion on partitions of datasets. Such a style of processing
> > typically incurs large delays in making data available to queries as well
> > as lot of complexity in carefully partitioning datasets to guarantee
> > latency SLAs.
> >
> > The need for fresher/faster analytics has increased enormously in the
> past
> > few years, as evidenced by the popularity of Stream processing systems
> like
> > Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> > using updateable state store to incrementally compute & instantly reflect
> > new results to queries and using a “tailable” messaging bus to publish
> > these results to other downstream jobs, such systems employ a different
> > approach to building analytical dataset. Even though this approach yields
> > low latency, the amount of data managed in such real-time data-marts is
> > typically limited in comparison to the aforementioned longer term storage
> > options. As a result, the overall data architecture has become more
> complex
> > with more moving parts and specialized systems, leading to duplication of
> > data and a strain on usability.
> >
> > Hudi takes a hybrid approach. Instead of moving vast amounts of batch
> data
> > to streaming systems, we simply add the streaming primitives (upserts &
> > incremental consumption) onto existing batch processing technologies. We
> > believe that by adding some missing blocks to an existing Hadoop stack,
> we
> > are able to a provide similar capabilities right on top of Hadoop at a
> > reduced cost and with an increased efficiency, greatly simplifying the
> > overall architecture in the process.
> >
> > Hudi was originally developed at Uber (original name “Hoodie”) to address
> > such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s
> data
> > ecosystem that required the upsert & incremental consumption primitives
> > 

Re: [VOTE] Accept Hudi into the Apache Incubator

2019-01-14 Thread Suneel Marthi
+1 

Sent from my iPhone

> On Jan 13, 2019, at 5:34 PM, Thomas Weise  wrote:
> 
> Hi all,
> 
> Following the discussion of the Hudi proposal in [1], this is a vote
> on accepting Hudi into the Apache Incubator,
> per the ASF policy [2] and voting rules [3].
> 
> A vote for accepting a new Apache Incubator podling is a
> majority vote. Everyone is welcome to vote, only
> Incubator PMC member votes are binding.
> 
> This vote will run for at least 72 hours. Please VOTE as
> follows:
> 
> [ ] +1 Accept Hudi into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> 
> The proposal is included below, but you can also access it on
> the wiki [4].
> 
> Thanks for reviewing and voting,
> Thomas
> 
> [1]
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> 
> [2]
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> 
> [3] http://www.apache.org/foundation/voting.html
> 
> [4] https://wiki.apache.org/incubator/HudiProposal
> 
> 
> 
> = Hudi Proposal =
> 
> == Abstract ==
> 
> Hudi is a big-data storage library, that provides atomic upserts and
> incremental data streams.
> 
> Hudi manages data stored in Apache Hadoop and other API compatible
> distributed file systems/cloud stores.
> 
> == Proposal ==
> 
> Hudi provides the ability to atomically upsert datasets with new values in
> near-real time, making data available quickly to existing query engines
> like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> sequence of changes to a dataset from a given point-in-time to enable
> incremental data pipelines that yield greater efficiency & latency than
> their typical batch counterparts. By carefully managing number of files &
> sizes, Hudi greatly aids both query engines (e.g: always providing
> well-sized files) and underlying storage (e.g: HDFS NameNode memory
> consumption).
> 
> Hudi is largely implemented as an Apache Spark library that reads/writes
> data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are
> supported via specialized Apache Hadoop input formats, that understand
> Hudi’s storage layout. Currently, Hudi manages datasets using a combination
> of Apache Parquet & Apache Avro file/serialization formats.
> 
> == Background ==
> 
> Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> longer term analytical storage for thousands of organizations. Typical
> analytical datasets are built by reading data from a source (e.g: upstream
> databases, messaging buses, or other datasets), transforming the data,
> writing results back to storage, & making it available for analytical
> queries--all of this typically accomplished in batch jobs which operate in
> a bulk fashion on partitions of datasets. Such a style of processing
> typically incurs large delays in making data available to queries as well
> as lot of complexity in carefully partitioning datasets to guarantee
> latency SLAs.
> 
> The need for fresher/faster analytics has increased enormously in the past
> few years, as evidenced by the popularity of Stream processing systems like
> Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> using updateable state store to incrementally compute & instantly reflect
> new results to queries and using a “tailable” messaging bus to publish
> these results to other downstream jobs, such systems employ a different
> approach to building analytical dataset. Even though this approach yields
> low latency, the amount of data managed in such real-time data-marts is
> typically limited in comparison to the aforementioned longer term storage
> options. As a result, the overall data architecture has become more complex
> with more moving parts and specialized systems, leading to duplication of
> data and a strain on usability.
> 
> Hudi takes a hybrid approach. Instead of moving vast amounts of batch data
> to streaming systems, we simply add the streaming primitives (upserts &
> incremental consumption) onto existing batch processing technologies. We
> believe that by adding some missing blocks to an existing Hadoop stack, we
> are able to a provide similar capabilities right on top of Hadoop at a
> reduced cost and with an increased efficiency, greatly simplifying the
> overall architecture in the process.
> 
> Hudi was originally developed at Uber (original name “Hoodie”) to address
> such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s data
> ecosystem that required the upsert & incremental consumption primitives
> supported by Hudi.
> 
> == Rationale ==
> 
> We truly believe the capabilities supported by Hudi would be increasingly
> useful for big-data ecosystems, as data volumes & need for faster data
> continue to increase. A detailed description of target use-cases can be
> found 

Re: [VOTE] Accept Hudi into the Apache Incubator

2019-01-13 Thread Pierre Smits
+1

On Mon, 14 Jan 2019 at 00:02 Luciano Resende  wrote:

> +1 (binding)
>
> On Sun, Jan 13, 2019 at 2:34 PM Thomas Weise  wrote:
> >
> > Hi all,
> >
> > Following the discussion of the Hudi proposal in [1], this is a vote
> > on accepting Hudi into the Apache Incubator,
> > per the ASF policy [2] and voting rules [3].
> >
> > A vote for accepting a new Apache Incubator podling is a
> > majority vote. Everyone is welcome to vote, only
> > Incubator PMC member votes are binding.
> >
> > This vote will run for at least 72 hours. Please VOTE as
> > follows:
> >
> > [ ] +1 Accept Hudi into the Apache Incubator
> > [ ] +0 Abstain
> > [ ] -1 Do not accept Hudi into the Apache Incubator because ...
> >
> > The proposal is included below, but you can also access it on
> > the wiki [4].
> >
> > Thanks for reviewing and voting,
> > Thomas
> >
> > [1]
> >
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
> >
> > [2]
> >
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
> >
> > [3] http://www.apache.org/foundation/voting.html
> >
> > [4] https://wiki.apache.org/incubator/HudiProposal
> >
> >
> >
> > = Hudi Proposal =
> >
> > == Abstract ==
> >
> > Hudi is a big-data storage library, that provides atomic upserts and
> > incremental data streams.
> >
> > Hudi manages data stored in Apache Hadoop and other API compatible
> > distributed file systems/cloud stores.
> >
> > == Proposal ==
> >
> > Hudi provides the ability to atomically upsert datasets with new values
> in
> > near-real time, making data available quickly to existing query engines
> > like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> > sequence of changes to a dataset from a given point-in-time to enable
> > incremental data pipelines that yield greater efficiency & latency than
> > their typical batch counterparts. By carefully managing number of files &
> > sizes, Hudi greatly aids both query engines (e.g: always providing
> > well-sized files) and underlying storage (e.g: HDFS NameNode memory
> > consumption).
> >
> > Hudi is largely implemented as an Apache Spark library that reads/writes
> > data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets
> are
> > supported via specialized Apache Hadoop input formats, that understand
> > Hudi’s storage layout. Currently, Hudi manages datasets using a
> combination
> > of Apache Parquet & Apache Avro file/serialization formats.
> >
> > == Background ==
> >
> > Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> > storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> > longer term analytical storage for thousands of organizations. Typical
> > analytical datasets are built by reading data from a source (e.g:
> upstream
> > databases, messaging buses, or other datasets), transforming the data,
> > writing results back to storage, & making it available for analytical
> > queries--all of this typically accomplished in batch jobs which operate
> in
> > a bulk fashion on partitions of datasets. Such a style of processing
> > typically incurs large delays in making data available to queries as well
> > as lot of complexity in carefully partitioning datasets to guarantee
> > latency SLAs.
> >
> > The need for fresher/faster analytics has increased enormously in the
> past
> > few years, as evidenced by the popularity of Stream processing systems
> like
> > Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> > using updateable state store to incrementally compute & instantly reflect
> > new results to queries and using a “tailable” messaging bus to publish
> > these results to other downstream jobs, such systems employ a different
> > approach to building analytical dataset. Even though this approach yields
> > low latency, the amount of data managed in such real-time data-marts is
> > typically limited in comparison to the aforementioned longer term storage
> > options. As a result, the overall data architecture has become more
> complex
> > with more moving parts and specialized systems, leading to duplication of
> > data and a strain on usability.
> >
> > Hudi takes a hybrid approach. Instead of moving vast amounts of batch
> data
> > to streaming systems, we simply add the streaming primitives (upserts &
> > incremental consumption) onto existing batch processing technologies. We
> > believe that by adding some missing blocks to an existing Hadoop stack,
> we
> > are able to a provide similar capabilities right on top of Hadoop at a
> > reduced cost and with an increased efficiency, greatly simplifying the
> > overall architecture in the process.
> >
> > Hudi was originally developed at Uber (original name “Hoodie”) to address
> > such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s
> data
> > ecosystem that required the upsert & incremental consumption primitives
> > supported by 

Re: [VOTE] Accept Hudi into the Apache Incubator

2019-01-13 Thread Luciano Resende
+1 (binding)

On Sun, Jan 13, 2019 at 2:34 PM Thomas Weise  wrote:
>
> Hi all,
>
> Following the discussion of the Hudi proposal in [1], this is a vote
> on accepting Hudi into the Apache Incubator,
> per the ASF policy [2] and voting rules [3].
>
> A vote for accepting a new Apache Incubator podling is a
> majority vote. Everyone is welcome to vote, only
> Incubator PMC member votes are binding.
>
> This vote will run for at least 72 hours. Please VOTE as
> follows:
>
> [ ] +1 Accept Hudi into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept Hudi into the Apache Incubator because ...
>
> The proposal is included below, but you can also access it on
> the wiki [4].
>
> Thanks for reviewing and voting,
> Thomas
>
> [1]
> https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E
>
> [2]
> https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor
>
> [3] http://www.apache.org/foundation/voting.html
>
> [4] https://wiki.apache.org/incubator/HudiProposal
>
>
>
> = Hudi Proposal =
>
> == Abstract ==
>
> Hudi is a big-data storage library, that provides atomic upserts and
> incremental data streams.
>
> Hudi manages data stored in Apache Hadoop and other API compatible
> distributed file systems/cloud stores.
>
> == Proposal ==
>
> Hudi provides the ability to atomically upsert datasets with new values in
> near-real time, making data available quickly to existing query engines
> like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
> sequence of changes to a dataset from a given point-in-time to enable
> incremental data pipelines that yield greater efficiency & latency than
> their typical batch counterparts. By carefully managing number of files &
> sizes, Hudi greatly aids both query engines (e.g: always providing
> well-sized files) and underlying storage (e.g: HDFS NameNode memory
> consumption).
>
> Hudi is largely implemented as an Apache Spark library that reads/writes
> data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are
> supported via specialized Apache Hadoop input formats, that understand
> Hudi’s storage layout. Currently, Hudi manages datasets using a combination
> of Apache Parquet & Apache Avro file/serialization formats.
>
> == Background ==
>
> Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
> storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
> longer term analytical storage for thousands of organizations. Typical
> analytical datasets are built by reading data from a source (e.g: upstream
> databases, messaging buses, or other datasets), transforming the data,
> writing results back to storage, & making it available for analytical
> queries--all of this typically accomplished in batch jobs which operate in
> a bulk fashion on partitions of datasets. Such a style of processing
> typically incurs large delays in making data available to queries as well
> as lot of complexity in carefully partitioning datasets to guarantee
> latency SLAs.
>
> The need for fresher/faster analytics has increased enormously in the past
> few years, as evidenced by the popularity of Stream processing systems like
> Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
> using updateable state store to incrementally compute & instantly reflect
> new results to queries and using a “tailable” messaging bus to publish
> these results to other downstream jobs, such systems employ a different
> approach to building analytical dataset. Even though this approach yields
> low latency, the amount of data managed in such real-time data-marts is
> typically limited in comparison to the aforementioned longer term storage
> options. As a result, the overall data architecture has become more complex
> with more moving parts and specialized systems, leading to duplication of
> data and a strain on usability.
>
> Hudi takes a hybrid approach. Instead of moving vast amounts of batch data
> to streaming systems, we simply add the streaming primitives (upserts &
> incremental consumption) onto existing batch processing technologies. We
> believe that by adding some missing blocks to an existing Hadoop stack, we
> are able to a provide similar capabilities right on top of Hadoop at a
> reduced cost and with an increased efficiency, greatly simplifying the
> overall architecture in the process.
>
> Hudi was originally developed at Uber (original name “Hoodie”) to address
> such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s data
> ecosystem that required the upsert & incremental consumption primitives
> supported by Hudi.
>
> == Rationale ==
>
> We truly believe the capabilities supported by Hudi would be increasingly
> useful for big-data ecosystems, as data volumes & need for faster data
> continue to increase. A detailed description of target use-cases can be
> found at 

[VOTE] Accept Hudi into the Apache Incubator

2019-01-13 Thread Thomas Weise
Hi all,

Following the discussion of the Hudi proposal in [1], this is a vote
on accepting Hudi into the Apache Incubator,
per the ASF policy [2] and voting rules [3].

A vote for accepting a new Apache Incubator podling is a
majority vote. Everyone is welcome to vote, only
Incubator PMC member votes are binding.

This vote will run for at least 72 hours. Please VOTE as
follows:

[ ] +1 Accept Hudi into the Apache Incubator
[ ] +0 Abstain
[ ] -1 Do not accept Hudi into the Apache Incubator because ...

The proposal is included below, but you can also access it on
the wiki [4].

Thanks for reviewing and voting,
Thomas

[1]
https://lists.apache.org/thread.html/12e2bdaa095d68dae6f8731e473d3d43885783177d1b7e3ff2f65b6d@%3Cgeneral.incubator.apache.org%3E

[2]
https://incubator.apache.org/policy/incubation.html#approval_of_proposal_by_sponsor

[3] http://www.apache.org/foundation/voting.html

[4] https://wiki.apache.org/incubator/HudiProposal



= Hudi Proposal =

== Abstract ==

Hudi is a big-data storage library, that provides atomic upserts and
incremental data streams.

Hudi manages data stored in Apache Hadoop and other API compatible
distributed file systems/cloud stores.

== Proposal ==

Hudi provides the ability to atomically upsert datasets with new values in
near-real time, making data available quickly to existing query engines
like Apache Hive, Apache Spark, & Presto. Additionally, Hudi provides a
sequence of changes to a dataset from a given point-in-time to enable
incremental data pipelines that yield greater efficiency & latency than
their typical batch counterparts. By carefully managing number of files &
sizes, Hudi greatly aids both query engines (e.g: always providing
well-sized files) and underlying storage (e.g: HDFS NameNode memory
consumption).

Hudi is largely implemented as an Apache Spark library that reads/writes
data from/to Hadoop compatible filesystem. SQL queries on Hudi datasets are
supported via specialized Apache Hadoop input formats, that understand
Hudi’s storage layout. Currently, Hudi manages datasets using a combination
of Apache Parquet & Apache Avro file/serialization formats.

== Background ==

Apache Hadoop distributed filesystem (HDFS) & other compatible cloud
storage systems (e.g: Amazon S3, Google Cloud, Microsoft Azure) serve as
longer term analytical storage for thousands of organizations. Typical
analytical datasets are built by reading data from a source (e.g: upstream
databases, messaging buses, or other datasets), transforming the data,
writing results back to storage, & making it available for analytical
queries--all of this typically accomplished in batch jobs which operate in
a bulk fashion on partitions of datasets. Such a style of processing
typically incurs large delays in making data available to queries as well
as lot of complexity in carefully partitioning datasets to guarantee
latency SLAs.

The need for fresher/faster analytics has increased enormously in the past
few years, as evidenced by the popularity of Stream processing systems like
Apache Spark, Apache Flink, and messaging systems like Apache Kafka. By
using updateable state store to incrementally compute & instantly reflect
new results to queries and using a “tailable” messaging bus to publish
these results to other downstream jobs, such systems employ a different
approach to building analytical dataset. Even though this approach yields
low latency, the amount of data managed in such real-time data-marts is
typically limited in comparison to the aforementioned longer term storage
options. As a result, the overall data architecture has become more complex
with more moving parts and specialized systems, leading to duplication of
data and a strain on usability.

Hudi takes a hybrid approach. Instead of moving vast amounts of batch data
to streaming systems, we simply add the streaming primitives (upserts &
incremental consumption) onto existing batch processing technologies. We
believe that by adding some missing blocks to an existing Hadoop stack, we
are able to a provide similar capabilities right on top of Hadoop at a
reduced cost and with an increased efficiency, greatly simplifying the
overall architecture in the process.

Hudi was originally developed at Uber (original name “Hoodie”) to address
such broad inefficiencies in ingest & ETL & ML pipelines across Uber’s data
ecosystem that required the upsert & incremental consumption primitives
supported by Hudi.

== Rationale ==

We truly believe the capabilities supported by Hudi would be increasingly
useful for big-data ecosystems, as data volumes & need for faster data
continue to increase. A detailed description of target use-cases can be
found at https://uber.github.io/hudi/use_cases.html.

Given our reliance on so many great Apache projects, we believe that the
Apache way of open source community driven development will enable us to
evolve Hudi in collaboration with a diverse set of contributors who can
bring new ideas into the project.