[RESULT][VOTE] Accept CarbonData into the Apache Incubator

2016-06-02 Thread Jean-Baptiste Onofré

Hi,

I close this vote with only +1: welcome to Apache CarbonData in the 
Incubator !


I will request the resources creation.

Thanks all for your vote.

Regards
JB

On 05/25/2016 10:24 PM, Jean-Baptiste Onofré wrote:

Hi all,

following the discussion thread, I'm now calling a vote to accept
CarbonData into the Incubator.

​[ ] +1 Accept CarbonData into the Apache Incubator
[ ] +0 Abstain
[ ] -1 Do not accept CarbonData into the Apache Incubator, because ...

This vote is open for 72 hours.

The proposal follows, you can also access the wiki page:
https://wiki.apache.org/incubator/CarbonDataProposal

Thanks !
Regards
JB

= Apache CarbonData =

== Abstract ==

Apache CarbonData is a new Apache Hadoop native file format for faster
interactive
query using advanced columnar storage, index, compression and encoding
techniques
to improve computing efficiency, in turn it will help speedup queries an
order of
magnitude faster over PetaBytes of data.

CarbonData github address: https://github.com/HuaweiBigData/carbondata

== Background ==

Huawei is an ICT solution provider, we are committed to enhancing
customer experiences for telecom carriers, enterprises, and consumers on
big data, In order to satisfy the following customer requirements, we
created a new Hadoop native file format:

  * Support interactive OLAP-style query over big data in seconds.
  * Support fast query on individual record which require touching all
fields.
  * Fast data loading speed and support incremental load in period of
minutes.
  * Support HDFS so that customer can leverage existing Hadoop cluster.
  * Support time based data retention.

Based on these requirements, we investigated existing file formats in
the Hadoop eco-system, but we could not find a suitable solution that
satisfying requirements all at the same time, so we start designing
CarbonData.

== Rationale ==

CarbonData contains multiple modules, which are classified into two
categories:

  1. CarbonData File Format: which contains core implementation for file
format such as columnar,index,dictionary,encoding+compression,API for
reading/writing etc.
  2. CarbonData integration with big data processing framework such as
Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract
the execution runtime.

=== CarbonData File Format ===

CarbonData file format is a columnar store in HDFS, it has many features
that a modern columnar format has, such as splittable, compression
schema ,complex data type etc. And CarbonData has following unique
features:

 Indexing 

In order to support fast interactive query, CarbonData leverage indexing
technology to reduce I/O scans. CarbonData files stores data along with
index, the index is not stored separately but the CarbonData file itself
contains the index. In current implementation, CarbonData supports 3
types of indexing:

1. Multi-dimensional Key (B+ Tree index)
  The Data block are written in sequence to the disk and within each
data blocks each column block is written in sequence. Finally, the
metadata block for the file is written with information about byte
positions of each block in the file, Min-Max statistics index and the
start and end MDK of each data block. Since, the entire data in the file
is in sorted order, the start and end MDK of each data block can be used
to construct a B+Tree and the file can be logically  represented as a
B+Tree with the data blocks as leaf nodes (on disk) and the remaining
non-leaf nodes in memory.
2. Inverted index
  Inverted index is widely used in search engine. By using this index,
it helps processing/query engine to do filtering inside one HDFS block.
Furthermore, query acceleration for count distinct like operation is
made possible when combining bitmap and inverted index in query time.
3. MinMax index
  For all columns, minmax index is created so that processing/query
engine can skip scan that is not required.

 Global Dictionary 

Besides I/O reduction, CarbonData accelerates computation by using
global dictionary, which enables processing/query engines to perform all
processing on encoded data without having to convert the data (Late
Materialization). We have observed dramatic performance improvement for
OLAP analytic scenario where table contains many columns in string data
type. The data is converted back to the user readable form just before
processing/query engine returning results to user.

 Column Group 

Sometimes users want to perform processing/query on multi-columns in one
table, for example, performing scan for individual record in
troubleshooting scenario. In this case, row format is more efficient
than columnar format since all columns will be touched by the workload.
To accelerate this, CarbonData supports storing a group of column in row
format, so data in column group is stored together and enable fast
retrieval.

 Optimized for multiple use cases 

CarbonData indices and dictionary is highly configurable. To make
storage 

Re: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-31 Thread Sandeep Deshmukh
+1 (non-binding)

Regards,
Sandeep

On Mon, May 30, 2016 at 7:04 PM, lidong <lid...@apache.org> wrote:

> +1 (non-binding)
>
>
> Thanks,
> Dong
> ---
> Apache Kylin - http://kylin.apache.org
> Kyligence Inc. - http://kyligence.io
>
>
> Original Message
> Sender:Jean-Baptiste Onofréj...@nanthrax.net
> Recipient:generalgene...@incubator.apache.org
> Date:Monday, May 30, 2016 14:07
> Subject:Re: [VOTE] Accept CarbonData into the Apache Incubator
>
>
> My own +1 (binding) ;) Regards JB On 05/25/2016 10:24 PM, Jean-Baptiste
> Onofré wrote:  Hi all,   following the discussion thread, I'm now calling a
> vote to accept  CarbonData into the Incubator.   ​[ ] +1 Accept CarbonData
> into the Apache Incubator  [ ] +0 Abstain  [ ] -1 Do not accept CarbonData
> into the Apache Incubator, because ...   This vote is open for 72 hours.
>  The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal   Thanks !  Regards
> JB   = Apache CarbonData =   == Abstract ==   Apache CarbonData is a new
> Apache Hadoop native file format for faster  interactive  query using
> advanced columnar storage, index, compression and encoding  techniques  to
> improve computing efficiency, in turn it will help speedup queries an
> order of  magnitude faster over PetaBytes of data.   CarbonData github
> address: https://github.com/HuaweiBigData/carbondata   == Background ==
>  Huawei is an ICT solution provider, we are committed to enhancing
> customer experiences for telecom carriers, enterprises, and consumers on
> big data, In order to satisfy the following customer requirements, we
> created a new Hadoop native file format:   * Support interactive OLAP-style
> query over big data in seconds.  * Support fast query on individual record
> which require touching all  fields.  * Fast data loading speed and support
> incremental load in period of  minutes.  * Support HDFS so that customer
> can leverage existing Hadoop cluster.  * Support time based data
> retention.   Based on these requirements, we investigated existing file
> formats in  the Hadoop eco-system, but we could not find a suitable
> solution that  satisfying requirements all at the same time, so we start
> designing  CarbonData.   == Rationale ==   CarbonData contains multiple
> modules, which are classified into two  categories:   1. CarbonData File
> Format: which contains core implementation for file  format such as
> columnar,index,dictionary,encoding+compression,API for  reading/writing
> etc.  2. CarbonData integration with big data processing framework such as
> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract  the
> execution runtime.   === CarbonData File Format ===   CarbonData file
> format is a columnar store in HDFS, it has many features  that a modern
> columnar format has, such as splittable, compression  schema ,complex data
> type etc. And CarbonData has following unique  features:    Indexing
>    In order to support fast interactive query, CarbonData leverage
> indexing  technology to reduce I/O scans. CarbonData files stores data
> along with  index, the index is not stored separately but the CarbonData
> file itself  contains the index. In current implementation, CarbonData
> supports 3  types of indexing:   1. Multi-dimensional Key (B+ Tree index)
> The Data block are written in sequence to the disk and within each  data
> blocks each column block is written in sequence. Finally, the  metadata
> block for the file is written with information about byte  positions of
> each block in the file, Min-Max statistics index and the  start and end MDK
> of each data block. Since, the entire data in the file  is in sorted order,
> the start and end MDK of each data block can be used  to construct a B+Tree
> and the file can be logically represented as a  B+Tree with the data blocks
> as leaf nodes (on disk) and the remaining  non-leaf nodes in memory.  2.
> Inverted index  Inverted index is widely used in search engine. By using
> this index,  it helps processing/query engine to do filtering inside one
> HDFS block.  Furthermore, query acceleration for count distinct like
> operation is  made possible when combining bitmap and inverted index in
> query time.  3. MinMax index  For all columns, minmax index is created so
> that processing/query  engine can skip scan that is not required.   
> Global Dictionary    Besides I/O reduction, CarbonData accelerates
> computation by using  global dictionary, which enables processing/query
> engines to perform all  processing on encoded data without having to
> convert the data (Late  Materialization). We have observed dramatic
> performance improvement for  OLAP analytic scenario where table contains
> many col

Re: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-30 Thread lidong
+1 (non-binding)


Thanks,
Dong
---
Apache Kylin - http://kylin.apache.org
Kyligence Inc. - http://kyligence.io


Original Message
Sender:Jean-Baptiste Onofréj...@nanthrax.net
Recipient:generalgene...@incubator.apache.org
Date:Monday, May 30, 2016 14:07
Subject:Re: [VOTE] Accept CarbonData into the Apache Incubator


My own +1 (binding) ;) Regards JB On 05/25/2016 10:24 PM, Jean-Baptiste Onofré 
wrote:  Hi all,   following the discussion thread, I'm now calling a vote to 
accept  CarbonData into the Incubator.   ​[ ] +1 Accept CarbonData into the 
Apache Incubator  [ ] +0 Abstain  [ ] -1 Do not accept CarbonData into the 
Apache Incubator, because ...   This vote is open for 72 hours.   The proposal 
follows, you can also access the wiki page:  
https://wiki.apache.org/incubator/CarbonDataProposal   Thanks !  Regards  JB   
= Apache CarbonData =   == Abstract ==   Apache CarbonData is a new Apache 
Hadoop native file format for faster  interactive  query using advanced 
columnar storage, index, compression and encoding  techniques  to improve 
computing efficiency, in turn it will help speedup queries an  order of  
magnitude faster over PetaBytes of data.   CarbonData github address: 
https://github.com/HuaweiBigData/carbondata   == Background ==   Huawei is an 
ICT solution provider, we are committed to enhancing  customer experiences for 
telecom carriers, enterprises, and consumers on  big data, In order to satisfy 
the following customer requirements, we  created a new Hadoop native file 
format:   * Support interactive OLAP-style query over big data in seconds.  * 
Support fast query on individual record which require touching all  fields.  * 
Fast data loading speed and support incremental load in period of  minutes.  * 
Support HDFS so that customer can leverage existing Hadoop cluster.  * Support 
time based data retention.   Based on these requirements, we investigated 
existing file formats in  the Hadoop eco-system, but we could not find a 
suitable solution that  satisfying requirements all at the same time, so we 
start designing  CarbonData.   == Rationale ==   CarbonData contains multiple 
modules, which are classified into two  categories:   1. CarbonData File 
Format: which contains core implementation for file  format such as 
columnar,index,dictionary,encoding+compression,API for  reading/writing etc.  
2. CarbonData integration with big data processing framework such as  Apache 
Spark, Apache Hive etc. Apache Beam is also planned to abstract  the execution 
runtime.   === CarbonData File Format ===   CarbonData file format is a 
columnar store in HDFS, it has many features  that a modern columnar format 
has, such as splittable, compression  schema ,complex data type etc. And 
CarbonData has following unique  features:    Indexing    In order to 
support fast interactive query, CarbonData leverage indexing  technology to 
reduce I/O scans. CarbonData files stores data along with  index, the index is 
not stored separately but the CarbonData file itself  contains the index. In 
current implementation, CarbonData supports 3  types of indexing:   1. 
Multi-dimensional Key (B+ Tree index)  The Data block are written in sequence 
to the disk and within each  data blocks each column block is written in 
sequence. Finally, the  metadata block for the file is written with information 
about byte  positions of each block in the file, Min-Max statistics index and 
the  start and end MDK of each data block. Since, the entire data in the file  
is in sorted order, the start and end MDK of each data block can be used  to 
construct a B+Tree and the file can be logically represented as a  B+Tree with 
the data blocks as leaf nodes (on disk) and the remaining  non-leaf nodes in 
memory.  2. Inverted index  Inverted index is widely used in search engine. By 
using this index,  it helps processing/query engine to do filtering inside one 
HDFS block.  Furthermore, query acceleration for count distinct like operation 
is  made possible when combining bitmap and inverted index in query time.  3. 
MinMax index  For all columns, minmax index is created so that processing/query 
 engine can skip scan that is not required.    Global Dictionary    
Besides I/O reduction, CarbonData accelerates computation by using  global 
dictionary, which enables processing/query engines to perform all  processing 
on encoded data without having to convert the data (Late  Materialization). We 
have observed dramatic performance improvement for  OLAP analytic scenario 
where table contains many columns in string data  type. The data is converted 
back to the user readable form just before  processing/query engine returning 
results to user.    Column Group    Sometimes users want to perform 
processing/query on multi-columns in one  table, for example, performing scan 
for individual record in  troubleshooting scenario. In this case, row format is 
more efficient  than columnar format since all

Re: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-30 Thread Jean-Baptiste Onofré

My own +1 (binding) ;)

Regards
JB

On 05/25/2016 10:24 PM, Jean-Baptiste Onofré wrote:

Hi all,

following the discussion thread, I'm now calling a vote to accept
CarbonData into the Incubator.

​[ ] +1 Accept CarbonData into the Apache Incubator
[ ] +0 Abstain
[ ] -1 Do not accept CarbonData into the Apache Incubator, because ...

This vote is open for 72 hours.

The proposal follows, you can also access the wiki page:
https://wiki.apache.org/incubator/CarbonDataProposal

Thanks !
Regards
JB

= Apache CarbonData =

== Abstract ==

Apache CarbonData is a new Apache Hadoop native file format for faster
interactive
query using advanced columnar storage, index, compression and encoding
techniques
to improve computing efficiency, in turn it will help speedup queries an
order of
magnitude faster over PetaBytes of data.

CarbonData github address: https://github.com/HuaweiBigData/carbondata

== Background ==

Huawei is an ICT solution provider, we are committed to enhancing
customer experiences for telecom carriers, enterprises, and consumers on
big data, In order to satisfy the following customer requirements, we
created a new Hadoop native file format:

  * Support interactive OLAP-style query over big data in seconds.
  * Support fast query on individual record which require touching all
fields.
  * Fast data loading speed and support incremental load in period of
minutes.
  * Support HDFS so that customer can leverage existing Hadoop cluster.
  * Support time based data retention.

Based on these requirements, we investigated existing file formats in
the Hadoop eco-system, but we could not find a suitable solution that
satisfying requirements all at the same time, so we start designing
CarbonData.

== Rationale ==

CarbonData contains multiple modules, which are classified into two
categories:

  1. CarbonData File Format: which contains core implementation for file
format such as columnar,index,dictionary,encoding+compression,API for
reading/writing etc.
  2. CarbonData integration with big data processing framework such as
Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract
the execution runtime.

=== CarbonData File Format ===

CarbonData file format is a columnar store in HDFS, it has many features
that a modern columnar format has, such as splittable, compression
schema ,complex data type etc. And CarbonData has following unique
features:

 Indexing 

In order to support fast interactive query, CarbonData leverage indexing
technology to reduce I/O scans. CarbonData files stores data along with
index, the index is not stored separately but the CarbonData file itself
contains the index. In current implementation, CarbonData supports 3
types of indexing:

1. Multi-dimensional Key (B+ Tree index)
  The Data block are written in sequence to the disk and within each
data blocks each column block is written in sequence. Finally, the
metadata block for the file is written with information about byte
positions of each block in the file, Min-Max statistics index and the
start and end MDK of each data block. Since, the entire data in the file
is in sorted order, the start and end MDK of each data block can be used
to construct a B+Tree and the file can be logically  represented as a
B+Tree with the data blocks as leaf nodes (on disk) and the remaining
non-leaf nodes in memory.
2. Inverted index
  Inverted index is widely used in search engine. By using this index,
it helps processing/query engine to do filtering inside one HDFS block.
Furthermore, query acceleration for count distinct like operation is
made possible when combining bitmap and inverted index in query time.
3. MinMax index
  For all columns, minmax index is created so that processing/query
engine can skip scan that is not required.

 Global Dictionary 

Besides I/O reduction, CarbonData accelerates computation by using
global dictionary, which enables processing/query engines to perform all
processing on encoded data without having to convert the data (Late
Materialization). We have observed dramatic performance improvement for
OLAP analytic scenario where table contains many columns in string data
type. The data is converted back to the user readable form just before
processing/query engine returning results to user.

 Column Group 

Sometimes users want to perform processing/query on multi-columns in one
table, for example, performing scan for individual record in
troubleshooting scenario. In this case, row format is more efficient
than columnar format since all columns will be touched by the workload.
To accelerate this, CarbonData supports storing a group of column in row
format, so data in column group is stored together and enable fast
retrieval.

 Optimized for multiple use cases 

CarbonData indices and dictionary is highly configurable. To make
storage optimized for different use cases, user can configure what to
index, so user can decide and tune the format before loading data into

Re: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-27 Thread Amol Kekre
+1 (non-binding)

Thks
Amol

On Fri, May 27, 2016 at 5:53 AM, Jim Jagielski  wrote:

> Thx for the feedback...
>
> I change my vote to +1 (binding)
> > On May 27, 2016, at 1:46 AM, Jean-Baptiste Onofré 
> wrote:
> >
> > Hi Jim,
> >
> > good point. Let me try to explain this "gap" regarding my discussion
> with the team:
> >
> > 1. Some people have been involved mostly in architecture and design more
> directly in code. That's why they are part of the initial committer list,
> whereas they didn't really provide "visible" code on github.
> >
> > 2. Some people are no more involved in the project. That's why they
> don't appear on the initial committer list.
> >
> > Regards
> > JB
> >
> > On 05/26/2016 05:45 PM, Jim Jagielski wrote:
> >> I am trying to align the list of initial committers with
> >> the list of current/active contributors, according to
> >> Github, and I am seeing people proposed who have not
> >> contributed anything and people NOT proposed who seem
> >> to be kinda active...
> >>
> >> Sooo. -0
> >>
> >>> On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofré 
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> following the discussion thread, I'm now calling a vote to accept
> CarbonData into the Incubator.
> >>>
> >>> ​[ ] +1 Accept CarbonData into the Apache Incubator
> >>> [ ] +0 Abstain
> >>> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> >>>
> >>> This vote is open for 72 hours.
> >>>
> >>> The proposal follows, you can also access the wiki page:
> >>> https://wiki.apache.org/incubator/CarbonDataProposal
> >>>
> >>> Thanks !
> >>> Regards
> >>> JB
> >>>
> >>> = Apache CarbonData =
> >>>
> >>> == Abstract ==
> >>>
> >>> Apache CarbonData is a new Apache Hadoop native file format for faster
> interactive
> >>> query using advanced columnar storage, index, compression and encoding
> techniques
> >>> to improve computing efficiency, in turn it will help speedup queries
> an order of
> >>> magnitude faster over PetaBytes of data.
> >>>
> >>> CarbonData github address: https://github.com/HuaweiBigData/carbondata
> >>>
> >>> == Background ==
> >>>
> >>> Huawei is an ICT solution provider, we are committed to enhancing
> customer experiences for telecom carriers, enterprises, and consumers on
> big data, In order to satisfy the following customer requirements, we
> created a new Hadoop native file format:
> >>>
> >>> * Support interactive OLAP-style query over big data in seconds.
> >>> * Support fast query on individual record which require touching all
> fields.
> >>> * Fast data loading speed and support incremental load in period of
> minutes.
> >>> * Support HDFS so that customer can leverage existing Hadoop cluster.
> >>> * Support time based data retention.
> >>>
> >>> Based on these requirements, we investigated existing file formats in
> the Hadoop eco-system, but we could not find a suitable solution that
> satisfying requirements all at the same time, so we start designing
> CarbonData.
> >>>
> >>> == Rationale ==
> >>>
> >>> CarbonData contains multiple modules, which are classified into two
> categories:
> >>>
> >>> 1. CarbonData File Format: which contains core implementation for file
> format such as columnar,index,dictionary,encoding+compression,API for
> reading/writing etc.
> >>> 2. CarbonData integration with big data processing framework such as
> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the
> execution runtime.
> >>>
> >>> === CarbonData File Format ===
> >>>
> >>> CarbonData file format is a columnar store in HDFS, it has many
> features that a modern columnar format has, such as splittable, compression
> schema ,complex data type etc. And CarbonData has following unique features:
> >>>
> >>>  Indexing 
> >>>
> >>> In order to support fast interactive query, CarbonData leverage
> indexing technology to reduce I/O scans. CarbonData files stores data along
> with index, the index is not stored separately but the CarbonData file
> itself contains the index. In current implementation, CarbonData supports 3
> types of indexing:
> >>>
> >>> 1. Multi-dimensional Key (B+ Tree index)
> >>> The Data block are written in sequence to the disk and within each
> data blocks each column block is written in sequence. Finally, the metadata
> block for the file is written with information about byte positions of each
> block in the file, Min-Max statistics index and the start and end MDK of
> each data block. Since, the entire data in the file is in sorted order, the
> start and end MDK of each data block can be used to construct a B+Tree and
> the file can be logically  represented as a B+Tree with the data blocks as
> leaf nodes (on disk) and the remaining non-leaf nodes in memory.
> >>> 2. Inverted index
> >>> Inverted index is widely used in search engine. By using this index,
> it helps processing/query engine to do filtering inside one HDFS block.
> Furthermore, query 

Re: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-27 Thread Jim Jagielski
Thx for the feedback...

I change my vote to +1 (binding)
> On May 27, 2016, at 1:46 AM, Jean-Baptiste Onofré  wrote:
> 
> Hi Jim,
> 
> good point. Let me try to explain this "gap" regarding my discussion with the 
> team:
> 
> 1. Some people have been involved mostly in architecture and design more 
> directly in code. That's why they are part of the initial committer list, 
> whereas they didn't really provide "visible" code on github.
> 
> 2. Some people are no more involved in the project. That's why they don't 
> appear on the initial committer list.
> 
> Regards
> JB
> 
> On 05/26/2016 05:45 PM, Jim Jagielski wrote:
>> I am trying to align the list of initial committers with
>> the list of current/active contributors, according to
>> Github, and I am seeing people proposed who have not
>> contributed anything and people NOT proposed who seem
>> to be kinda active...
>> 
>> Sooo. -0
>> 
>>> On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofré  wrote:
>>> 
>>> Hi all,
>>> 
>>> following the discussion thread, I'm now calling a vote to accept 
>>> CarbonData into the Incubator.
>>> 
>>> ​[ ] +1 Accept CarbonData into the Apache Incubator
>>> [ ] +0 Abstain
>>> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
>>> 
>>> This vote is open for 72 hours.
>>> 
>>> The proposal follows, you can also access the wiki page:
>>> https://wiki.apache.org/incubator/CarbonDataProposal
>>> 
>>> Thanks !
>>> Regards
>>> JB
>>> 
>>> = Apache CarbonData =
>>> 
>>> == Abstract ==
>>> 
>>> Apache CarbonData is a new Apache Hadoop native file format for faster 
>>> interactive
>>> query using advanced columnar storage, index, compression and encoding 
>>> techniques
>>> to improve computing efficiency, in turn it will help speedup queries an 
>>> order of
>>> magnitude faster over PetaBytes of data.
>>> 
>>> CarbonData github address: https://github.com/HuaweiBigData/carbondata
>>> 
>>> == Background ==
>>> 
>>> Huawei is an ICT solution provider, we are committed to enhancing customer 
>>> experiences for telecom carriers, enterprises, and consumers on big data, 
>>> In order to satisfy the following customer requirements, we created a new 
>>> Hadoop native file format:
>>> 
>>> * Support interactive OLAP-style query over big data in seconds.
>>> * Support fast query on individual record which require touching all fields.
>>> * Fast data loading speed and support incremental load in period of minutes.
>>> * Support HDFS so that customer can leverage existing Hadoop cluster.
>>> * Support time based data retention.
>>> 
>>> Based on these requirements, we investigated existing file formats in the 
>>> Hadoop eco-system, but we could not find a suitable solution that 
>>> satisfying requirements all at the same time, so we start designing 
>>> CarbonData.
>>> 
>>> == Rationale ==
>>> 
>>> CarbonData contains multiple modules, which are classified into two 
>>> categories:
>>> 
>>> 1. CarbonData File Format: which contains core implementation for file 
>>> format such as columnar,index,dictionary,encoding+compression,API for 
>>> reading/writing etc.
>>> 2. CarbonData integration with big data processing framework such as Apache 
>>> Spark, Apache Hive etc. Apache Beam is also planned to abstract the 
>>> execution runtime.
>>> 
>>> === CarbonData File Format ===
>>> 
>>> CarbonData file format is a columnar store in HDFS, it has many features 
>>> that a modern columnar format has, such as splittable, compression schema 
>>> ,complex data type etc. And CarbonData has following unique features:
>>> 
>>>  Indexing 
>>> 
>>> In order to support fast interactive query, CarbonData leverage indexing 
>>> technology to reduce I/O scans. CarbonData files stores data along with 
>>> index, the index is not stored separately but the CarbonData file itself 
>>> contains the index. In current implementation, CarbonData supports 3 types 
>>> of indexing:
>>> 
>>> 1. Multi-dimensional Key (B+ Tree index)
>>> The Data block are written in sequence to the disk and within each data 
>>> blocks each column block is written in sequence. Finally, the metadata 
>>> block for the file is written with information about byte positions of each 
>>> block in the file, Min-Max statistics index and the start and end MDK of 
>>> each data block. Since, the entire data in the file is in sorted order, the 
>>> start and end MDK of each data block can be used to construct a B+Tree and 
>>> the file can be logically  represented as a B+Tree with the data blocks as 
>>> leaf nodes (on disk) and the remaining non-leaf nodes in memory.
>>> 2. Inverted index
>>> Inverted index is widely used in search engine. By using this index, it 
>>> helps processing/query engine to do filtering inside one HDFS block. 
>>> Furthermore, query acceleration for count distinct like operation is made 
>>> possible when combining bitmap and inverted index in query time.
>>> 3. MinMax index
>>> For all columns, 

Re: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-27 Thread Sergio Fernández
+1 (binding)

On Wed, May 25, 2016 at 10:24 PM, Jean-Baptiste Onofré 
wrote:

> Hi all,
>
> following the discussion thread, I'm now calling a vote to accept
> CarbonData into the Incubator.
>
> ​[ ] +1 Accept CarbonData into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
>
> This vote is open for 72 hours.
>
> The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal
>
> Thanks !
> Regards
> JB
>
> = Apache CarbonData =
>
> == Abstract ==
>
> Apache CarbonData is a new Apache Hadoop native file format for faster
> interactive
> query using advanced columnar storage, index, compression and encoding
> techniques
> to improve computing efficiency, in turn it will help speedup queries an
> order of
> magnitude faster over PetaBytes of data.
>
> CarbonData github address: https://github.com/HuaweiBigData/carbondata
>
> == Background ==
>
> Huawei is an ICT solution provider, we are committed to enhancing customer
> experiences for telecom carriers, enterprises, and consumers on big data,
> In order to satisfy the following customer requirements, we created a new
> Hadoop native file format:
>
>  * Support interactive OLAP-style query over big data in seconds.
>  * Support fast query on individual record which require touching all
> fields.
>  * Fast data loading speed and support incremental load in period of
> minutes.
>  * Support HDFS so that customer can leverage existing Hadoop cluster.
>  * Support time based data retention.
>
> Based on these requirements, we investigated existing file formats in the
> Hadoop eco-system, but we could not find a suitable solution that
> satisfying requirements all at the same time, so we start designing
> CarbonData.
>
> == Rationale ==
>
> CarbonData contains multiple modules, which are classified into two
> categories:
>
>  1. CarbonData File Format: which contains core implementation for file
> format such as columnar,index,dictionary,encoding+compression,API for
> reading/writing etc.
>  2. CarbonData integration with big data processing framework such as
> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the
> execution runtime.
>
> === CarbonData File Format ===
>
> CarbonData file format is a columnar store in HDFS, it has many features
> that a modern columnar format has, such as splittable, compression schema
> ,complex data type etc. And CarbonData has following unique features:
>
>  Indexing 
>
> In order to support fast interactive query, CarbonData leverage indexing
> technology to reduce I/O scans. CarbonData files stores data along with
> index, the index is not stored separately but the CarbonData file itself
> contains the index. In current implementation, CarbonData supports 3 types
> of indexing:
>
> 1. Multi-dimensional Key (B+ Tree index)
>  The Data block are written in sequence to the disk and within each data
> blocks each column block is written in sequence. Finally, the metadata
> block for the file is written with information about byte positions of each
> block in the file, Min-Max statistics index and the start and end MDK of
> each data block. Since, the entire data in the file is in sorted order, the
> start and end MDK of each data block can be used to construct a B+Tree and
> the file can be logically  represented as a B+Tree with the data blocks as
> leaf nodes (on disk) and the remaining non-leaf nodes in memory.
> 2. Inverted index
>  Inverted index is widely used in search engine. By using this index, it
> helps processing/query engine to do filtering inside one HDFS block.
> Furthermore, query acceleration for count distinct like operation is made
> possible when combining bitmap and inverted index in query time.
> 3. MinMax index
>  For all columns, minmax index is created so that processing/query engine
> can skip scan that is not required.
>
>  Global Dictionary 
>
> Besides I/O reduction, CarbonData accelerates computation by using global
> dictionary, which enables processing/query engines to perform all
> processing on encoded data without having to convert the data (Late
> Materialization). We have observed dramatic performance improvement for
> OLAP analytic scenario where table contains many columns in string data
> type. The data is converted back to the user readable form just before
> processing/query engine returning results to user.
>
>  Column Group 
>
> Sometimes users want to perform processing/query on multi-columns in one
> table, for example, performing scan for individual record in
> troubleshooting scenario. In this case, row format is more efficient than
> columnar format since all columns will be touched by the workload. To
> accelerate this, CarbonData supports storing a group of column in row
> format, so data in column group is stored together and enable fast
> retrieval.
>
>  Optimized for multiple use cases 
>
> 

Re: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-27 Thread Madhawa Kasun Gunasekara
+1

Thanks,
Madhawa

Madhawa

On Fri, May 27, 2016 at 11:16 AM, Jean-Baptiste Onofré 
wrote:

> Hi Jim,
>
> good point. Let me try to explain this "gap" regarding my discussion with
> the team:
>
> 1. Some people have been involved mostly in architecture and design more
> directly in code. That's why they are part of the initial committer list,
> whereas they didn't really provide "visible" code on github.
>
> 2. Some people are no more involved in the project. That's why they don't
> appear on the initial committer list.
>
> Regards
> JB
>
>
> On 05/26/2016 05:45 PM, Jim Jagielski wrote:
>
>> I am trying to align the list of initial committers with
>> the list of current/active contributors, according to
>> Github, and I am seeing people proposed who have not
>> contributed anything and people NOT proposed who seem
>> to be kinda active...
>>
>> Sooo. -0
>>
>> On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofré 
>>> wrote:
>>>
>>> Hi all,
>>>
>>> following the discussion thread, I'm now calling a vote to accept
>>> CarbonData into the Incubator.
>>>
>>> ​[ ] +1 Accept CarbonData into the Apache Incubator
>>> [ ] +0 Abstain
>>> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
>>>
>>> This vote is open for 72 hours.
>>>
>>> The proposal follows, you can also access the wiki page:
>>> https://wiki.apache.org/incubator/CarbonDataProposal
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>> = Apache CarbonData =
>>>
>>> == Abstract ==
>>>
>>> Apache CarbonData is a new Apache Hadoop native file format for faster
>>> interactive
>>> query using advanced columnar storage, index, compression and encoding
>>> techniques
>>> to improve computing efficiency, in turn it will help speedup queries an
>>> order of
>>> magnitude faster over PetaBytes of data.
>>>
>>> CarbonData github address: https://github.com/HuaweiBigData/carbondata
>>>
>>> == Background ==
>>>
>>> Huawei is an ICT solution provider, we are committed to enhancing
>>> customer experiences for telecom carriers, enterprises, and consumers on
>>> big data, In order to satisfy the following customer requirements, we
>>> created a new Hadoop native file format:
>>>
>>> * Support interactive OLAP-style query over big data in seconds.
>>> * Support fast query on individual record which require touching all
>>> fields.
>>> * Fast data loading speed and support incremental load in period of
>>> minutes.
>>> * Support HDFS so that customer can leverage existing Hadoop cluster.
>>> * Support time based data retention.
>>>
>>> Based on these requirements, we investigated existing file formats in
>>> the Hadoop eco-system, but we could not find a suitable solution that
>>> satisfying requirements all at the same time, so we start designing
>>> CarbonData.
>>>
>>> == Rationale ==
>>>
>>> CarbonData contains multiple modules, which are classified into two
>>> categories:
>>>
>>> 1. CarbonData File Format: which contains core implementation for file
>>> format such as columnar,index,dictionary,encoding+compression,API for
>>> reading/writing etc.
>>> 2. CarbonData integration with big data processing framework such as
>>> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the
>>> execution runtime.
>>>
>>> === CarbonData File Format ===
>>>
>>> CarbonData file format is a columnar store in HDFS, it has many features
>>> that a modern columnar format has, such as splittable, compression schema
>>> ,complex data type etc. And CarbonData has following unique features:
>>>
>>>  Indexing 
>>>
>>> In order to support fast interactive query, CarbonData leverage indexing
>>> technology to reduce I/O scans. CarbonData files stores data along with
>>> index, the index is not stored separately but the CarbonData file itself
>>> contains the index. In current implementation, CarbonData supports 3 types
>>> of indexing:
>>>
>>> 1. Multi-dimensional Key (B+ Tree index)
>>> The Data block are written in sequence to the disk and within each data
>>> blocks each column block is written in sequence. Finally, the metadata
>>> block for the file is written with information about byte positions of each
>>> block in the file, Min-Max statistics index and the start and end MDK of
>>> each data block. Since, the entire data in the file is in sorted order, the
>>> start and end MDK of each data block can be used to construct a B+Tree and
>>> the file can be logically  represented as a B+Tree with the data blocks as
>>> leaf nodes (on disk) and the remaining non-leaf nodes in memory.
>>> 2. Inverted index
>>> Inverted index is widely used in search engine. By using this index, it
>>> helps processing/query engine to do filtering inside one HDFS block.
>>> Furthermore, query acceleration for count distinct like operation is made
>>> possible when combining bitmap and inverted index in query time.
>>> 3. MinMax index
>>> For all columns, minmax index is created so that processing/query engine
>>> can skip scan 

Re: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-26 Thread Jean-Baptiste Onofré

Hi Jim,

good point. Let me try to explain this "gap" regarding my discussion 
with the team:


1. Some people have been involved mostly in architecture and design more 
directly in code. That's why they are part of the initial committer 
list, whereas they didn't really provide "visible" code on github.


2. Some people are no more involved in the project. That's why they 
don't appear on the initial committer list.


Regards
JB

On 05/26/2016 05:45 PM, Jim Jagielski wrote:

I am trying to align the list of initial committers with
the list of current/active contributors, according to
Github, and I am seeing people proposed who have not
contributed anything and people NOT proposed who seem
to be kinda active...

Sooo. -0


On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofré  wrote:

Hi all,

following the discussion thread, I'm now calling a vote to accept CarbonData 
into the Incubator.

​[ ] +1 Accept CarbonData into the Apache Incubator
[ ] +0 Abstain
[ ] -1 Do not accept CarbonData into the Apache Incubator, because ...

This vote is open for 72 hours.

The proposal follows, you can also access the wiki page:
https://wiki.apache.org/incubator/CarbonDataProposal

Thanks !
Regards
JB

= Apache CarbonData =

== Abstract ==

Apache CarbonData is a new Apache Hadoop native file format for faster 
interactive
query using advanced columnar storage, index, compression and encoding 
techniques
to improve computing efficiency, in turn it will help speedup queries an order 
of
magnitude faster over PetaBytes of data.

CarbonData github address: https://github.com/HuaweiBigData/carbondata

== Background ==

Huawei is an ICT solution provider, we are committed to enhancing customer 
experiences for telecom carriers, enterprises, and consumers on big data, In 
order to satisfy the following customer requirements, we created a new Hadoop 
native file format:

* Support interactive OLAP-style query over big data in seconds.
* Support fast query on individual record which require touching all fields.
* Fast data loading speed and support incremental load in period of minutes.
* Support HDFS so that customer can leverage existing Hadoop cluster.
* Support time based data retention.

Based on these requirements, we investigated existing file formats in the 
Hadoop eco-system, but we could not find a suitable solution that satisfying 
requirements all at the same time, so we start designing CarbonData.

== Rationale ==

CarbonData contains multiple modules, which are classified into two categories:

1. CarbonData File Format: which contains core implementation for file format 
such as columnar,index,dictionary,encoding+compression,API for reading/writing 
etc.
2. CarbonData integration with big data processing framework such as Apache 
Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution 
runtime.

=== CarbonData File Format ===

CarbonData file format is a columnar store in HDFS, it has many features that a 
modern columnar format has, such as splittable, compression schema ,complex 
data type etc. And CarbonData has following unique features:

 Indexing 

In order to support fast interactive query, CarbonData leverage indexing 
technology to reduce I/O scans. CarbonData files stores data along with index, 
the index is not stored separately but the CarbonData file itself contains the 
index. In current implementation, CarbonData supports 3 types of indexing:

1. Multi-dimensional Key (B+ Tree index)
The Data block are written in sequence to the disk and within each data blocks 
each column block is written in sequence. Finally, the metadata block for the 
file is written with information about byte positions of each block in the 
file, Min-Max statistics index and the start and end MDK of each data block. 
Since, the entire data in the file is in sorted order, the start and end MDK of 
each data block can be used to construct a B+Tree and the file can be logically 
 represented as a B+Tree with the data blocks as leaf nodes (on disk) and the 
remaining non-leaf nodes in memory.
2. Inverted index
Inverted index is widely used in search engine. By using this index, it helps 
processing/query engine to do filtering inside one HDFS block. Furthermore, 
query acceleration for count distinct like operation is made possible when 
combining bitmap and inverted index in query time.
3. MinMax index
For all columns, minmax index is created so that processing/query engine can 
skip scan that is not required.

 Global Dictionary 

Besides I/O reduction, CarbonData accelerates computation by using global 
dictionary, which enables processing/query engines to perform all processing on 
encoded data without having to convert the data (Late Materialization). We have 
observed dramatic performance improvement for OLAP analytic scenario where 
table contains many columns in string data type. The data is converted back to 
the user readable form just before processing/query 

RE: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-26 Thread Zheng, Kai
+1 (non-binding)

Regards,
Kai

-Original Message-
From: Gangumalla, Uma [mailto:uma.ganguma...@intel.com] 
Sent: Friday, May 27, 2016 1:10 AM
To: general@incubator.apache.org
Subject: Re: [VOTE] Accept CarbonData into the Apache Incubator

+1 (binding)

Regards,
Uma

On 5/25/16, 1:24 PM, "Jean-Baptiste Onofré" <j...@nanthrax.net> wrote:

>Hi all,
>
>following the discussion thread, I'm now calling a vote to accept 
>CarbonData into the Incubator.
>
>​[ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [ ] 
>-1 Do not accept CarbonData into the Apache Incubator, because ...
>
>This vote is open for 72 hours.
>
>The proposal follows, you can also access the wiki page:
>https://wiki.apache.org/incubator/CarbonDataProposal
>
>Thanks !
>Regards
>JB
>
>= Apache CarbonData =
>
>== Abstract ==
>
>Apache CarbonData is a new Apache Hadoop native file format for faster 
>interactive query using advanced columnar storage, index, compression 
>and encoding techniques to improve computing efficiency, in turn it 
>will help speedup queries an order of magnitude faster over PetaBytes 
>of data.
>
>CarbonData github address: https://github.com/HuaweiBigData/carbondata
>
>== Background ==
>
>Huawei is an ICT solution provider, we are committed to enhancing 
>customer experiences for telecom carriers, enterprises, and consumers 
>on big data, In order to satisfy the following customer requirements, 
>we created a new Hadoop native file format:
>
>  * Support interactive OLAP-style query over big data in seconds.
>  * Support fast query on individual record which require touching all 
>fields.
>  * Fast data loading speed and support incremental load in period of 
>minutes.
>  * Support HDFS so that customer can leverage existing Hadoop cluster.
>  * Support time based data retention.
>
>Based on these requirements, we investigated existing file formats in 
>the Hadoop eco-system, but we could not find a suitable solution that 
>satisfying requirements all at the same time, so we start designing 
>CarbonData.
>
>== Rationale ==
>
>CarbonData contains multiple modules, which are classified into two
>categories:
>
>  1. CarbonData File Format: which contains core implementation for 
>file format such as columnar,index,dictionary,encoding+compression,API 
>for reading/writing etc.
>  2. CarbonData integration with big data processing framework such as 
>Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract 
>the execution runtime.
>
>=== CarbonData File Format ===
>
>CarbonData file format is a columnar store in HDFS, it has many 
>features that a modern columnar format has, such as splittable, 
>compression schema ,complex data type etc. And CarbonData has following 
>unique
>features:
>
> Indexing 
>
>In order to support fast interactive query, CarbonData leverage 
>indexing technology to reduce I/O scans. CarbonData files stores data 
>along with index, the index is not stored separately but the CarbonData 
>file itself contains the index. In current implementation, CarbonData 
>supports 3 types of indexing:
>
>1. Multi-dimensional Key (B+ Tree index)
>  The Data block are written in sequence to the disk and within each 
>data blocks each column block is written in sequence. Finally, the 
>metadata block for the file is written with information about byte 
>positions of each block in the file, Min-Max statistics index and the 
>start and end MDK of each data block. Since, the entire data in the 
>file is in sorted order, the start and end MDK of each data block can 
>be used to construct a B+Tree and the file can be logically  
>represented as a
>B+Tree with the data blocks as leaf nodes (on disk) and the remaining
>non-leaf nodes in memory.
>2. Inverted index
>  Inverted index is widely used in search engine. By using this index, 
>it helps processing/query engine to do filtering inside one HDFS block.
>Furthermore, query acceleration for count distinct like operation is 
>made possible when combining bitmap and inverted index in query time.
>3. MinMax index
>  For all columns, minmax index is created so that processing/query 
>engine can skip scan that is not required.
>
> Global Dictionary 
>
>Besides I/O reduction, CarbonData accelerates computation by using 
>global dictionary, which enables processing/query engines to perform 
>all processing on encoded data without having to convert the data (Late 
>Materialization). We have observed dramatic performance improvement for 
>OLAP analytic scenario where table contains many columns in string data 
>type. The data is converted back to the user readable form just before 
>processin

Re: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-26 Thread Gangumalla, Uma
+1 (binding)

Regards,
Uma

On 5/25/16, 1:24 PM, "Jean-Baptiste Onofré"  wrote:

>Hi all,
>
>following the discussion thread, I'm now calling a vote to accept
>CarbonData into the Incubator.
>
>​[ ] +1 Accept CarbonData into the Apache Incubator
>[ ] +0 Abstain
>[ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
>
>This vote is open for 72 hours.
>
>The proposal follows, you can also access the wiki page:
>https://wiki.apache.org/incubator/CarbonDataProposal
>
>Thanks !
>Regards
>JB
>
>= Apache CarbonData =
>
>== Abstract ==
>
>Apache CarbonData is a new Apache Hadoop native file format for faster
>interactive
>query using advanced columnar storage, index, compression and encoding
>techniques
>to improve computing efficiency, in turn it will help speedup queries an
>order of
>magnitude faster over PetaBytes of data.
>
>CarbonData github address: https://github.com/HuaweiBigData/carbondata
>
>== Background ==
>
>Huawei is an ICT solution provider, we are committed to enhancing
>customer experiences for telecom carriers, enterprises, and consumers on
>big data, In order to satisfy the following customer requirements, we
>created a new Hadoop native file format:
>
>  * Support interactive OLAP-style query over big data in seconds.
>  * Support fast query on individual record which require touching all
>fields.
>  * Fast data loading speed and support incremental load in period of
>minutes.
>  * Support HDFS so that customer can leverage existing Hadoop cluster.
>  * Support time based data retention.
>
>Based on these requirements, we investigated existing file formats in
>the Hadoop eco-system, but we could not find a suitable solution that
>satisfying requirements all at the same time, so we start designing
>CarbonData.
>
>== Rationale ==
>
>CarbonData contains multiple modules, which are classified into two
>categories:
>
>  1. CarbonData File Format: which contains core implementation for file
>format such as columnar,index,dictionary,encoding+compression,API for
>reading/writing etc.
>  2. CarbonData integration with big data processing framework such as
>Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract
>the execution runtime.
>
>=== CarbonData File Format ===
>
>CarbonData file format is a columnar store in HDFS, it has many features
>that a modern columnar format has, such as splittable, compression
>schema ,complex data type etc. And CarbonData has following unique
>features:
>
> Indexing 
>
>In order to support fast interactive query, CarbonData leverage indexing
>technology to reduce I/O scans. CarbonData files stores data along with
>index, the index is not stored separately but the CarbonData file itself
>contains the index. In current implementation, CarbonData supports 3
>types of indexing:
>
>1. Multi-dimensional Key (B+ Tree index)
>  The Data block are written in sequence to the disk and within each
>data blocks each column block is written in sequence. Finally, the
>metadata block for the file is written with information about byte
>positions of each block in the file, Min-Max statistics index and the
>start and end MDK of each data block. Since, the entire data in the file
>is in sorted order, the start and end MDK of each data block can be used
>to construct a B+Tree and the file can be logically  represented as a
>B+Tree with the data blocks as leaf nodes (on disk) and the remaining
>non-leaf nodes in memory.
>2. Inverted index
>  Inverted index is widely used in search engine. By using this index,
>it helps processing/query engine to do filtering inside one HDFS block.
>Furthermore, query acceleration for count distinct like operation is
>made possible when combining bitmap and inverted index in query time.
>3. MinMax index
>  For all columns, minmax index is created so that processing/query
>engine can skip scan that is not required.
>
> Global Dictionary 
>
>Besides I/O reduction, CarbonData accelerates computation by using
>global dictionary, which enables processing/query engines to perform all
>processing on encoded data without having to convert the data (Late
>Materialization). We have observed dramatic performance improvement for
>OLAP analytic scenario where table contains many columns in string data
>type. The data is converted back to the user readable form just before
>processing/query engine returning results to user.
>
> Column Group 
>
>Sometimes users want to perform processing/query on multi-columns in one
>table, for example, performing scan for individual record in
>troubleshooting scenario. In this case, row format is more efficient
>than columnar format since all columns will be touched by the workload.
>To accelerate this, CarbonData supports storing a group of column in row
>format, so data in column group is stored together and enable fast
>retrieval.
>
> Optimized for multiple use cases 
>
>CarbonData indices and dictionary is highly configurable. To make
>storage 

Re: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-26 Thread Jim Jagielski
I am trying to align the list of initial committers with
the list of current/active contributors, according to
Github, and I am seeing people proposed who have not
contributed anything and people NOT proposed who seem
to be kinda active...

Sooo. -0

> On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofré  wrote:
> 
> Hi all,
> 
> following the discussion thread, I'm now calling a vote to accept CarbonData 
> into the Incubator.
> 
> ​[ ] +1 Accept CarbonData into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> 
> This vote is open for 72 hours.
> 
> The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal
> 
> Thanks !
> Regards
> JB
> 
> = Apache CarbonData =
> 
> == Abstract ==
> 
> Apache CarbonData is a new Apache Hadoop native file format for faster 
> interactive
> query using advanced columnar storage, index, compression and encoding 
> techniques
> to improve computing efficiency, in turn it will help speedup queries an 
> order of
> magnitude faster over PetaBytes of data.
> 
> CarbonData github address: https://github.com/HuaweiBigData/carbondata
> 
> == Background ==
> 
> Huawei is an ICT solution provider, we are committed to enhancing customer 
> experiences for telecom carriers, enterprises, and consumers on big data, In 
> order to satisfy the following customer requirements, we created a new Hadoop 
> native file format:
> 
> * Support interactive OLAP-style query over big data in seconds.
> * Support fast query on individual record which require touching all fields.
> * Fast data loading speed and support incremental load in period of minutes.
> * Support HDFS so that customer can leverage existing Hadoop cluster.
> * Support time based data retention.
> 
> Based on these requirements, we investigated existing file formats in the 
> Hadoop eco-system, but we could not find a suitable solution that satisfying 
> requirements all at the same time, so we start designing CarbonData.
> 
> == Rationale ==
> 
> CarbonData contains multiple modules, which are classified into two 
> categories:
> 
> 1. CarbonData File Format: which contains core implementation for file format 
> such as columnar,index,dictionary,encoding+compression,API for 
> reading/writing etc.
> 2. CarbonData integration with big data processing framework such as Apache 
> Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution 
> runtime.
> 
> === CarbonData File Format ===
> 
> CarbonData file format is a columnar store in HDFS, it has many features that 
> a modern columnar format has, such as splittable, compression schema ,complex 
> data type etc. And CarbonData has following unique features:
> 
>  Indexing 
> 
> In order to support fast interactive query, CarbonData leverage indexing 
> technology to reduce I/O scans. CarbonData files stores data along with 
> index, the index is not stored separately but the CarbonData file itself 
> contains the index. In current implementation, CarbonData supports 3 types of 
> indexing:
> 
> 1. Multi-dimensional Key (B+ Tree index)
> The Data block are written in sequence to the disk and within each data 
> blocks each column block is written in sequence. Finally, the metadata block 
> for the file is written with information about byte positions of each block 
> in the file, Min-Max statistics index and the start and end MDK of each data 
> block. Since, the entire data in the file is in sorted order, the start and 
> end MDK of each data block can be used to construct a B+Tree and the file can 
> be logically  represented as a B+Tree with the data blocks as leaf nodes (on 
> disk) and the remaining non-leaf nodes in memory.
> 2. Inverted index
> Inverted index is widely used in search engine. By using this index, it helps 
> processing/query engine to do filtering inside one HDFS block. Furthermore, 
> query acceleration for count distinct like operation is made possible when 
> combining bitmap and inverted index in query time.
> 3. MinMax index
> For all columns, minmax index is created so that processing/query engine can 
> skip scan that is not required.
> 
>  Global Dictionary 
> 
> Besides I/O reduction, CarbonData accelerates computation by using global 
> dictionary, which enables processing/query engines to perform all processing 
> on encoded data without having to convert the data (Late Materialization). We 
> have observed dramatic performance improvement for OLAP analytic scenario 
> where table contains many columns in string data type. The data is converted 
> back to the user readable form just before processing/query engine returning 
> results to user.
> 
>  Column Group 
> 
> Sometimes users want to perform processing/query on multi-columns in one 
> table, for example, performing scan for individual record in troubleshooting 
> scenario. In this case, row format is more 

Re: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-26 Thread David E Jones

+1

-David (jonesde@a.o)


> On 25 May 2016, at 13:24, Jean-Baptiste Onofré  wrote:
> 
> Hi all,
> 
> following the discussion thread, I'm now calling a vote to accept CarbonData 
> into the Incubator.
> 
> ​[ ] +1 Accept CarbonData into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> 
> This vote is open for 72 hours.
> 
> The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal
> 
> Thanks !
> Regards
> JB
> 
> = Apache CarbonData =
> 
> == Abstract ==
> 
> Apache CarbonData is a new Apache Hadoop native file format for faster 
> interactive
> query using advanced columnar storage, index, compression and encoding 
> techniques
> to improve computing efficiency, in turn it will help speedup queries an 
> order of
> magnitude faster over PetaBytes of data.
> 
> CarbonData github address: https://github.com/HuaweiBigData/carbondata
> 
> == Background ==
> 
> Huawei is an ICT solution provider, we are committed to enhancing customer 
> experiences for telecom carriers, enterprises, and consumers on big data, In 
> order to satisfy the following customer requirements, we created a new Hadoop 
> native file format:
> 
> * Support interactive OLAP-style query over big data in seconds.
> * Support fast query on individual record which require touching all fields.
> * Fast data loading speed and support incremental load in period of minutes.
> * Support HDFS so that customer can leverage existing Hadoop cluster.
> * Support time based data retention.
> 
> Based on these requirements, we investigated existing file formats in the 
> Hadoop eco-system, but we could not find a suitable solution that satisfying 
> requirements all at the same time, so we start designing CarbonData.
> 
> == Rationale ==
> 
> CarbonData contains multiple modules, which are classified into two 
> categories:
> 
> 1. CarbonData File Format: which contains core implementation for file format 
> such as columnar,index,dictionary,encoding+compression,API for 
> reading/writing etc.
> 2. CarbonData integration with big data processing framework such as Apache 
> Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution 
> runtime.
> 
> === CarbonData File Format ===
> 
> CarbonData file format is a columnar store in HDFS, it has many features that 
> a modern columnar format has, such as splittable, compression schema ,complex 
> data type etc. And CarbonData has following unique features:
> 
>  Indexing 
> 
> In order to support fast interactive query, CarbonData leverage indexing 
> technology to reduce I/O scans. CarbonData files stores data along with 
> index, the index is not stored separately but the CarbonData file itself 
> contains the index. In current implementation, CarbonData supports 3 types of 
> indexing:
> 
> 1. Multi-dimensional Key (B+ Tree index)
> The Data block are written in sequence to the disk and within each data 
> blocks each column block is written in sequence. Finally, the metadata block 
> for the file is written with information about byte positions of each block 
> in the file, Min-Max statistics index and the start and end MDK of each data 
> block. Since, the entire data in the file is in sorted order, the start and 
> end MDK of each data block can be used to construct a B+Tree and the file can 
> be logically  represented as a B+Tree with the data blocks as leaf nodes (on 
> disk) and the remaining non-leaf nodes in memory.
> 2. Inverted index
> Inverted index is widely used in search engine. By using this index, it helps 
> processing/query engine to do filtering inside one HDFS block. Furthermore, 
> query acceleration for count distinct like operation is made possible when 
> combining bitmap and inverted index in query time.
> 3. MinMax index
> For all columns, minmax index is created so that processing/query engine can 
> skip scan that is not required.
> 
>  Global Dictionary 
> 
> Besides I/O reduction, CarbonData accelerates computation by using global 
> dictionary, which enables processing/query engines to perform all processing 
> on encoded data without having to convert the data (Late Materialization). We 
> have observed dramatic performance improvement for OLAP analytic scenario 
> where table contains many columns in string data type. The data is converted 
> back to the user readable form just before processing/query engine returning 
> results to user.
> 
>  Column Group 
> 
> Sometimes users want to perform processing/query on multi-columns in one 
> table, for example, performing scan for individual record in troubleshooting 
> scenario. In this case, row format is more efficient than columnar format 
> since all columns will be touched by the workload. To accelerate this, 
> CarbonData supports storing a group of column in row format, so data in 
> column group is stored together and enable fast 

Re: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-26 Thread Jake Farrell
+1 (binding)

-Jake

On Wed, May 25, 2016 at 4:24 PM, Jean-Baptiste Onofré 
wrote:

> Hi all,
>
> following the discussion thread, I'm now calling a vote to accept
> CarbonData into the Incubator.
>
> ​[ ] +1 Accept CarbonData into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
>
> This vote is open for 72 hours.
>
> The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal
>
> Thanks !
> Regards
> JB
>
> = Apache CarbonData =
>
> == Abstract ==
>
> Apache CarbonData is a new Apache Hadoop native file format for faster
> interactive
> query using advanced columnar storage, index, compression and encoding
> techniques
> to improve computing efficiency, in turn it will help speedup queries an
> order of
> magnitude faster over PetaBytes of data.
>
> CarbonData github address: https://github.com/HuaweiBigData/carbondata
>
> == Background ==
>
> Huawei is an ICT solution provider, we are committed to enhancing customer
> experiences for telecom carriers, enterprises, and consumers on big data,
> In order to satisfy the following customer requirements, we created a new
> Hadoop native file format:
>
>  * Support interactive OLAP-style query over big data in seconds.
>  * Support fast query on individual record which require touching all
> fields.
>  * Fast data loading speed and support incremental load in period of
> minutes.
>  * Support HDFS so that customer can leverage existing Hadoop cluster.
>  * Support time based data retention.
>
> Based on these requirements, we investigated existing file formats in the
> Hadoop eco-system, but we could not find a suitable solution that
> satisfying requirements all at the same time, so we start designing
> CarbonData.
>
> == Rationale ==
>
> CarbonData contains multiple modules, which are classified into two
> categories:
>
>  1. CarbonData File Format: which contains core implementation for file
> format such as columnar,index,dictionary,encoding+compression,API for
> reading/writing etc.
>  2. CarbonData integration with big data processing framework such as
> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the
> execution runtime.
>
> === CarbonData File Format ===
>
> CarbonData file format is a columnar store in HDFS, it has many features
> that a modern columnar format has, such as splittable, compression schema
> ,complex data type etc. And CarbonData has following unique features:
>
>  Indexing 
>
> In order to support fast interactive query, CarbonData leverage indexing
> technology to reduce I/O scans. CarbonData files stores data along with
> index, the index is not stored separately but the CarbonData file itself
> contains the index. In current implementation, CarbonData supports 3 types
> of indexing:
>
> 1. Multi-dimensional Key (B+ Tree index)
>  The Data block are written in sequence to the disk and within each data
> blocks each column block is written in sequence. Finally, the metadata
> block for the file is written with information about byte positions of each
> block in the file, Min-Max statistics index and the start and end MDK of
> each data block. Since, the entire data in the file is in sorted order, the
> start and end MDK of each data block can be used to construct a B+Tree and
> the file can be logically  represented as a B+Tree with the data blocks as
> leaf nodes (on disk) and the remaining non-leaf nodes in memory.
> 2. Inverted index
>  Inverted index is widely used in search engine. By using this index, it
> helps processing/query engine to do filtering inside one HDFS block.
> Furthermore, query acceleration for count distinct like operation is made
> possible when combining bitmap and inverted index in query time.
> 3. MinMax index
>  For all columns, minmax index is created so that processing/query engine
> can skip scan that is not required.
>
>  Global Dictionary 
>
> Besides I/O reduction, CarbonData accelerates computation by using global
> dictionary, which enables processing/query engines to perform all
> processing on encoded data without having to convert the data (Late
> Materialization). We have observed dramatic performance improvement for
> OLAP analytic scenario where table contains many columns in string data
> type. The data is converted back to the user readable form just before
> processing/query engine returning results to user.
>
>  Column Group 
>
> Sometimes users want to perform processing/query on multi-columns in one
> table, for example, performing scan for individual record in
> troubleshooting scenario. In this case, row format is more efficient than
> columnar format since all columns will be touched by the workload. To
> accelerate this, CarbonData supports storing a group of column in row
> format, so data in column group is stored together and enable fast
> retrieval.
>
>  Optimized for multiple use cases 
>

Re: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-25 Thread Luke Han
+1 (binding)


Best Regards!
-

Luke Han

On Wed, May 25, 2016 at 9:44 PM, Wang, Gang1 <gang1.w...@intel.com> wrote:

> +1 (no-binding)
>
> Best Regards
> +Gary.
>
> -Original Message-
> From: Cheng, Hao [mailto:hao.ch...@intel.com]
> Sent: Wednesday, May 25, 2016 7:09 PM
> To: general@incubator.apache.org
> Subject: RE: [VOTE] Accept CarbonData into the Apache Incubator
>
> +1
>
> -Original Message-
> From: Jacques Nadeau [mailto:jacq...@apache.org]
> Sent: Thursday, May 26, 2016 8:26 AM
> To: general@incubator.apache.org
> Subject: Re: [VOTE] Accept CarbonData into the Apache Incubator
>
> +1 (binding)
>
> On Wed, May 25, 2016 at 4:04 PM, John D. Ament <johndam...@apache.org>
> wrote:
>
> > +1
> >
> > On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> > wrote:
> >
> > > Hi all,
> > >
> > > following the discussion thread, I'm now calling a vote to accept
> > > CarbonData into the Incubator.
> > >
> > > ​[ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [
> > > ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> > >
> > > This vote is open for 72 hours.
> > >
> > > The proposal follows, you can also access the wiki page:
> > > https://wiki.apache.org/incubator/CarbonDataProposal
> > >
> > > Thanks !
> > > Regards
> > > JB
> > >
> > > = Apache CarbonData =
> > >
> > > == Abstract ==
> > >
> > > Apache CarbonData is a new Apache Hadoop native file format for
> > > faster interactive query using advanced columnar storage, index,
> > > compression and encoding techniques to improve computing efficiency,
> > > in turn it will help speedup queries an order of magnitude faster
> > > over PetaBytes of data.
> > >
> > > CarbonData github address:
> > > https://github.com/HuaweiBigData/carbondata
> > >
> > > == Background ==
> > >
> > > Huawei is an ICT solution provider, we are committed to enhancing
> > > customer experiences for telecom carriers, enterprises, and
> > > consumers on big data, In order to satisfy the following customer
> > > requirements, we created a new Hadoop native file format:
> > >
> > >   * Support interactive OLAP-style query over big data in seconds.
> > >   * Support fast query on individual record which require touching
> > > all fields.
> > >   * Fast data loading speed and support incremental load in period
> > > of minutes.
> > >   * Support HDFS so that customer can leverage existing Hadoop cluster.
> > >   * Support time based data retention.
> > >
> > > Based on these requirements, we investigated existing file formats
> > > in the Hadoop eco-system, but we could not find a suitable solution
> > > that satisfying requirements all at the same time, so we start
> > > designing CarbonData.
> > >
> > > == Rationale ==
> > >
> > > CarbonData contains multiple modules, which are classified into two
> > > categories:
> > >
> > >   1. CarbonData File Format: which contains core implementation for
> > > file format such as
> > > columnar,index,dictionary,encoding+compression,API for reading/writing
> etc.
> > >   2. CarbonData integration with big data processing framework such
> > > as Apache Spark, Apache Hive etc. Apache Beam is also planned to
> > > abstract the execution runtime.
> > >
> > > === CarbonData File Format ===
> > >
> > > CarbonData file format is a columnar store in HDFS, it has many
> > > features that a modern columnar format has, such as splittable,
> > > compression schema ,complex data type etc. And CarbonData has
> > > following unique
> > > features:
> > >
> > >  Indexing 
> > >
> > > In order to support fast interactive query, CarbonData leverage
> > > indexing technology to reduce I/O scans. CarbonData files stores
> > > data along with index, the index is not stored separately but the
> > > CarbonData file itself contains the index. In current
> > > implementation, CarbonData supports 3 types of indexing:
> > >
> > > 1. Multi-dimensional Key (B+ Tree index)
> > >   The Data block are written in sequence to the disk and within each
> > > data blocks each column block is written in sequence. Finally, the
> > > metadat

RE: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-25 Thread Wang, Gang1
+1 (no-binding)

Best Regards
+Gary.

-Original Message-
From: Cheng, Hao [mailto:hao.ch...@intel.com] 
Sent: Wednesday, May 25, 2016 7:09 PM
To: general@incubator.apache.org
Subject: RE: [VOTE] Accept CarbonData into the Apache Incubator

+1

-Original Message-
From: Jacques Nadeau [mailto:jacq...@apache.org]
Sent: Thursday, May 26, 2016 8:26 AM
To: general@incubator.apache.org
Subject: Re: [VOTE] Accept CarbonData into the Apache Incubator

+1 (binding)

On Wed, May 25, 2016 at 4:04 PM, John D. Ament <johndam...@apache.org>
wrote:

> +1
>
> On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
> > Hi all,
> >
> > following the discussion thread, I'm now calling a vote to accept 
> > CarbonData into the Incubator.
> >
> > ​[ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [ 
> > ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> >
> > This vote is open for 72 hours.
> >
> > The proposal follows, you can also access the wiki page:
> > https://wiki.apache.org/incubator/CarbonDataProposal
> >
> > Thanks !
> > Regards
> > JB
> >
> > = Apache CarbonData =
> >
> > == Abstract ==
> >
> > Apache CarbonData is a new Apache Hadoop native file format for 
> > faster interactive query using advanced columnar storage, index, 
> > compression and encoding techniques to improve computing efficiency, 
> > in turn it will help speedup queries an order of magnitude faster 
> > over PetaBytes of data.
> >
> > CarbonData github address: 
> > https://github.com/HuaweiBigData/carbondata
> >
> > == Background ==
> >
> > Huawei is an ICT solution provider, we are committed to enhancing 
> > customer experiences for telecom carriers, enterprises, and 
> > consumers on big data, In order to satisfy the following customer 
> > requirements, we created a new Hadoop native file format:
> >
> >   * Support interactive OLAP-style query over big data in seconds.
> >   * Support fast query on individual record which require touching 
> > all fields.
> >   * Fast data loading speed and support incremental load in period 
> > of minutes.
> >   * Support HDFS so that customer can leverage existing Hadoop cluster.
> >   * Support time based data retention.
> >
> > Based on these requirements, we investigated existing file formats 
> > in the Hadoop eco-system, but we could not find a suitable solution 
> > that satisfying requirements all at the same time, so we start 
> > designing CarbonData.
> >
> > == Rationale ==
> >
> > CarbonData contains multiple modules, which are classified into two
> > categories:
> >
> >   1. CarbonData File Format: which contains core implementation for 
> > file format such as 
> > columnar,index,dictionary,encoding+compression,API for reading/writing etc.
> >   2. CarbonData integration with big data processing framework such 
> > as Apache Spark, Apache Hive etc. Apache Beam is also planned to 
> > abstract the execution runtime.
> >
> > === CarbonData File Format ===
> >
> > CarbonData file format is a columnar store in HDFS, it has many 
> > features that a modern columnar format has, such as splittable, 
> > compression schema ,complex data type etc. And CarbonData has 
> > following unique
> > features:
> >
> >  Indexing 
> >
> > In order to support fast interactive query, CarbonData leverage 
> > indexing technology to reduce I/O scans. CarbonData files stores 
> > data along with index, the index is not stored separately but the 
> > CarbonData file itself contains the index. In current 
> > implementation, CarbonData supports 3 types of indexing:
> >
> > 1. Multi-dimensional Key (B+ Tree index)
> >   The Data block are written in sequence to the disk and within each 
> > data blocks each column block is written in sequence. Finally, the 
> > metadata block for the file is written with information about byte 
> > positions of each block in the file, Min-Max statistics index and 
> > the start and end MDK of each data block. Since, the entire data in 
> > the file is in sorted order, the start and end MDK of each data 
> > block can be used to construct a B+Tree and the file can be 
> > logically  represented as a
> > B+Tree with the data blocks as leaf nodes (on disk) and the 
> > B+remaining
> > non-leaf nodes in memory.
> > 2. Inverted index
> >   Inverted index is widely used in search engine. By using this 
> > index, it helps proces

RE: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-25 Thread Cheng, Hao
+1

-Original Message-
From: Jacques Nadeau [mailto:jacq...@apache.org] 
Sent: Thursday, May 26, 2016 8:26 AM
To: general@incubator.apache.org
Subject: Re: [VOTE] Accept CarbonData into the Apache Incubator

+1 (binding)

On Wed, May 25, 2016 at 4:04 PM, John D. Ament <johndam...@apache.org>
wrote:

> +1
>
> On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
> > Hi all,
> >
> > following the discussion thread, I'm now calling a vote to accept 
> > CarbonData into the Incubator.
> >
> > ​[ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [ 
> > ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> >
> > This vote is open for 72 hours.
> >
> > The proposal follows, you can also access the wiki page:
> > https://wiki.apache.org/incubator/CarbonDataProposal
> >
> > Thanks !
> > Regards
> > JB
> >
> > = Apache CarbonData =
> >
> > == Abstract ==
> >
> > Apache CarbonData is a new Apache Hadoop native file format for 
> > faster interactive query using advanced columnar storage, index, 
> > compression and encoding techniques to improve computing efficiency, 
> > in turn it will help speedup queries an order of magnitude faster 
> > over PetaBytes of data.
> >
> > CarbonData github address: 
> > https://github.com/HuaweiBigData/carbondata
> >
> > == Background ==
> >
> > Huawei is an ICT solution provider, we are committed to enhancing 
> > customer experiences for telecom carriers, enterprises, and 
> > consumers on big data, In order to satisfy the following customer 
> > requirements, we created a new Hadoop native file format:
> >
> >   * Support interactive OLAP-style query over big data in seconds.
> >   * Support fast query on individual record which require touching 
> > all fields.
> >   * Fast data loading speed and support incremental load in period 
> > of minutes.
> >   * Support HDFS so that customer can leverage existing Hadoop cluster.
> >   * Support time based data retention.
> >
> > Based on these requirements, we investigated existing file formats 
> > in the Hadoop eco-system, but we could not find a suitable solution 
> > that satisfying requirements all at the same time, so we start 
> > designing CarbonData.
> >
> > == Rationale ==
> >
> > CarbonData contains multiple modules, which are classified into two
> > categories:
> >
> >   1. CarbonData File Format: which contains core implementation for 
> > file format such as 
> > columnar,index,dictionary,encoding+compression,API for reading/writing etc.
> >   2. CarbonData integration with big data processing framework such 
> > as Apache Spark, Apache Hive etc. Apache Beam is also planned to 
> > abstract the execution runtime.
> >
> > === CarbonData File Format ===
> >
> > CarbonData file format is a columnar store in HDFS, it has many 
> > features that a modern columnar format has, such as splittable, 
> > compression schema ,complex data type etc. And CarbonData has 
> > following unique
> > features:
> >
> >  Indexing 
> >
> > In order to support fast interactive query, CarbonData leverage 
> > indexing technology to reduce I/O scans. CarbonData files stores 
> > data along with index, the index is not stored separately but the 
> > CarbonData file itself contains the index. In current 
> > implementation, CarbonData supports 3 types of indexing:
> >
> > 1. Multi-dimensional Key (B+ Tree index)
> >   The Data block are written in sequence to the disk and within each 
> > data blocks each column block is written in sequence. Finally, the 
> > metadata block for the file is written with information about byte 
> > positions of each block in the file, Min-Max statistics index and 
> > the start and end MDK of each data block. Since, the entire data in 
> > the file is in sorted order, the start and end MDK of each data 
> > block can be used to construct a B+Tree and the file can be 
> > logically  represented as a
> > B+Tree with the data blocks as leaf nodes (on disk) and the 
> > B+remaining
> > non-leaf nodes in memory.
> > 2. Inverted index
> >   Inverted index is widely used in search engine. By using this 
> > index, it helps processing/query engine to do filtering inside one HDFS 
> > block.
> > Furthermore, query acceleration for count distinct like operation is 
> > made possible when combining bitmap and inverted index in query time.
> > 3. MinMax i

Re: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-25 Thread Jacques Nadeau
+1 (binding)

On Wed, May 25, 2016 at 4:04 PM, John D. Ament 
wrote:

> +1
>
> On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré 
> wrote:
>
> > Hi all,
> >
> > following the discussion thread, I'm now calling a vote to accept
> > CarbonData into the Incubator.
> >
> > ​[ ] +1 Accept CarbonData into the Apache Incubator
> > [ ] +0 Abstain
> > [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> >
> > This vote is open for 72 hours.
> >
> > The proposal follows, you can also access the wiki page:
> > https://wiki.apache.org/incubator/CarbonDataProposal
> >
> > Thanks !
> > Regards
> > JB
> >
> > = Apache CarbonData =
> >
> > == Abstract ==
> >
> > Apache CarbonData is a new Apache Hadoop native file format for faster
> > interactive
> > query using advanced columnar storage, index, compression and encoding
> > techniques
> > to improve computing efficiency, in turn it will help speedup queries an
> > order of
> > magnitude faster over PetaBytes of data.
> >
> > CarbonData github address: https://github.com/HuaweiBigData/carbondata
> >
> > == Background ==
> >
> > Huawei is an ICT solution provider, we are committed to enhancing
> > customer experiences for telecom carriers, enterprises, and consumers on
> > big data, In order to satisfy the following customer requirements, we
> > created a new Hadoop native file format:
> >
> >   * Support interactive OLAP-style query over big data in seconds.
> >   * Support fast query on individual record which require touching all
> > fields.
> >   * Fast data loading speed and support incremental load in period of
> > minutes.
> >   * Support HDFS so that customer can leverage existing Hadoop cluster.
> >   * Support time based data retention.
> >
> > Based on these requirements, we investigated existing file formats in
> > the Hadoop eco-system, but we could not find a suitable solution that
> > satisfying requirements all at the same time, so we start designing
> > CarbonData.
> >
> > == Rationale ==
> >
> > CarbonData contains multiple modules, which are classified into two
> > categories:
> >
> >   1. CarbonData File Format: which contains core implementation for file
> > format such as columnar,index,dictionary,encoding+compression,API for
> > reading/writing etc.
> >   2. CarbonData integration with big data processing framework such as
> > Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract
> > the execution runtime.
> >
> > === CarbonData File Format ===
> >
> > CarbonData file format is a columnar store in HDFS, it has many features
> > that a modern columnar format has, such as splittable, compression
> > schema ,complex data type etc. And CarbonData has following unique
> > features:
> >
> >  Indexing 
> >
> > In order to support fast interactive query, CarbonData leverage indexing
> > technology to reduce I/O scans. CarbonData files stores data along with
> > index, the index is not stored separately but the CarbonData file itself
> > contains the index. In current implementation, CarbonData supports 3
> > types of indexing:
> >
> > 1. Multi-dimensional Key (B+ Tree index)
> >   The Data block are written in sequence to the disk and within each
> > data blocks each column block is written in sequence. Finally, the
> > metadata block for the file is written with information about byte
> > positions of each block in the file, Min-Max statistics index and the
> > start and end MDK of each data block. Since, the entire data in the file
> > is in sorted order, the start and end MDK of each data block can be used
> > to construct a B+Tree and the file can be logically  represented as a
> > B+Tree with the data blocks as leaf nodes (on disk) and the remaining
> > non-leaf nodes in memory.
> > 2. Inverted index
> >   Inverted index is widely used in search engine. By using this index,
> > it helps processing/query engine to do filtering inside one HDFS block.
> > Furthermore, query acceleration for count distinct like operation is
> > made possible when combining bitmap and inverted index in query time.
> > 3. MinMax index
> >   For all columns, minmax index is created so that processing/query
> > engine can skip scan that is not required.
> >
> >  Global Dictionary 
> >
> > Besides I/O reduction, CarbonData accelerates computation by using
> > global dictionary, which enables processing/query engines to perform all
> > processing on encoded data without having to convert the data (Late
> > Materialization). We have observed dramatic performance improvement for
> > OLAP analytic scenario where table contains many columns in string data
> > type. The data is converted back to the user readable form just before
> > processing/query engine returning results to user.
> >
> >  Column Group 
> >
> > Sometimes users want to perform processing/query on multi-columns in one
> > table, for example, performing scan for individual record in
> > troubleshooting scenario. 

Re: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-25 Thread John D. Ament
+1

On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré 
wrote:

> Hi all,
>
> following the discussion thread, I'm now calling a vote to accept
> CarbonData into the Incubator.
>
> ​[ ] +1 Accept CarbonData into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
>
> This vote is open for 72 hours.
>
> The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal
>
> Thanks !
> Regards
> JB
>
> = Apache CarbonData =
>
> == Abstract ==
>
> Apache CarbonData is a new Apache Hadoop native file format for faster
> interactive
> query using advanced columnar storage, index, compression and encoding
> techniques
> to improve computing efficiency, in turn it will help speedup queries an
> order of
> magnitude faster over PetaBytes of data.
>
> CarbonData github address: https://github.com/HuaweiBigData/carbondata
>
> == Background ==
>
> Huawei is an ICT solution provider, we are committed to enhancing
> customer experiences for telecom carriers, enterprises, and consumers on
> big data, In order to satisfy the following customer requirements, we
> created a new Hadoop native file format:
>
>   * Support interactive OLAP-style query over big data in seconds.
>   * Support fast query on individual record which require touching all
> fields.
>   * Fast data loading speed and support incremental load in period of
> minutes.
>   * Support HDFS so that customer can leverage existing Hadoop cluster.
>   * Support time based data retention.
>
> Based on these requirements, we investigated existing file formats in
> the Hadoop eco-system, but we could not find a suitable solution that
> satisfying requirements all at the same time, so we start designing
> CarbonData.
>
> == Rationale ==
>
> CarbonData contains multiple modules, which are classified into two
> categories:
>
>   1. CarbonData File Format: which contains core implementation for file
> format such as columnar,index,dictionary,encoding+compression,API for
> reading/writing etc.
>   2. CarbonData integration with big data processing framework such as
> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract
> the execution runtime.
>
> === CarbonData File Format ===
>
> CarbonData file format is a columnar store in HDFS, it has many features
> that a modern columnar format has, such as splittable, compression
> schema ,complex data type etc. And CarbonData has following unique
> features:
>
>  Indexing 
>
> In order to support fast interactive query, CarbonData leverage indexing
> technology to reduce I/O scans. CarbonData files stores data along with
> index, the index is not stored separately but the CarbonData file itself
> contains the index. In current implementation, CarbonData supports 3
> types of indexing:
>
> 1. Multi-dimensional Key (B+ Tree index)
>   The Data block are written in sequence to the disk and within each
> data blocks each column block is written in sequence. Finally, the
> metadata block for the file is written with information about byte
> positions of each block in the file, Min-Max statistics index and the
> start and end MDK of each data block. Since, the entire data in the file
> is in sorted order, the start and end MDK of each data block can be used
> to construct a B+Tree and the file can be logically  represented as a
> B+Tree with the data blocks as leaf nodes (on disk) and the remaining
> non-leaf nodes in memory.
> 2. Inverted index
>   Inverted index is widely used in search engine. By using this index,
> it helps processing/query engine to do filtering inside one HDFS block.
> Furthermore, query acceleration for count distinct like operation is
> made possible when combining bitmap and inverted index in query time.
> 3. MinMax index
>   For all columns, minmax index is created so that processing/query
> engine can skip scan that is not required.
>
>  Global Dictionary 
>
> Besides I/O reduction, CarbonData accelerates computation by using
> global dictionary, which enables processing/query engines to perform all
> processing on encoded data without having to convert the data (Late
> Materialization). We have observed dramatic performance improvement for
> OLAP analytic scenario where table contains many columns in string data
> type. The data is converted back to the user readable form just before
> processing/query engine returning results to user.
>
>  Column Group 
>
> Sometimes users want to perform processing/query on multi-columns in one
> table, for example, performing scan for individual record in
> troubleshooting scenario. In this case, row format is more efficient
> than columnar format since all columns will be touched by the workload.
> To accelerate this, CarbonData supports storing a group of column in row
> format, so data in column group is stored together and enable fast
> retrieval.
>
>  Optimized for multiple use cases 
>
> 

Re: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-25 Thread Henry Saputra
+1 (binding)

On Wednesday, May 25, 2016, Jean-Baptiste Onofré  wrote:

> Hi all,
>
> following the discussion thread, I'm now calling a vote to accept
> CarbonData into the Incubator.
>
> ​[ ] +1 Accept CarbonData into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
>
> This vote is open for 72 hours.
>
> The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal
>
> Thanks !
> Regards
> JB
>
> = Apache CarbonData =
>
> == Abstract ==
>
> Apache CarbonData is a new Apache Hadoop native file format for faster
> interactive
> query using advanced columnar storage, index, compression and encoding
> techniques
> to improve computing efficiency, in turn it will help speedup queries an
> order of
> magnitude faster over PetaBytes of data.
>
> CarbonData github address: https://github.com/HuaweiBigData/carbondata
>
> == Background ==
>
> Huawei is an ICT solution provider, we are committed to enhancing customer
> experiences for telecom carriers, enterprises, and consumers on big data,
> In order to satisfy the following customer requirements, we created a new
> Hadoop native file format:
>
>  * Support interactive OLAP-style query over big data in seconds.
>  * Support fast query on individual record which require touching all
> fields.
>  * Fast data loading speed and support incremental load in period of
> minutes.
>  * Support HDFS so that customer can leverage existing Hadoop cluster.
>  * Support time based data retention.
>
> Based on these requirements, we investigated existing file formats in the
> Hadoop eco-system, but we could not find a suitable solution that
> satisfying requirements all at the same time, so we start designing
> CarbonData.
>
> == Rationale ==
>
> CarbonData contains multiple modules, which are classified into two
> categories:
>
>  1. CarbonData File Format: which contains core implementation for file
> format such as columnar,index,dictionary,encoding+compression,API for
> reading/writing etc.
>  2. CarbonData integration with big data processing framework such as
> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the
> execution runtime.
>
> === CarbonData File Format ===
>
> CarbonData file format is a columnar store in HDFS, it has many features
> that a modern columnar format has, such as splittable, compression schema
> ,complex data type etc. And CarbonData has following unique features:
>
>  Indexing 
>
> In order to support fast interactive query, CarbonData leverage indexing
> technology to reduce I/O scans. CarbonData files stores data along with
> index, the index is not stored separately but the CarbonData file itself
> contains the index. In current implementation, CarbonData supports 3 types
> of indexing:
>
> 1. Multi-dimensional Key (B+ Tree index)
>  The Data block are written in sequence to the disk and within each data
> blocks each column block is written in sequence. Finally, the metadata
> block for the file is written with information about byte positions of each
> block in the file, Min-Max statistics index and the start and end MDK of
> each data block. Since, the entire data in the file is in sorted order, the
> start and end MDK of each data block can be used to construct a B+Tree and
> the file can be logically  represented as a B+Tree with the data blocks as
> leaf nodes (on disk) and the remaining non-leaf nodes in memory.
> 2. Inverted index
>  Inverted index is widely used in search engine. By using this index, it
> helps processing/query engine to do filtering inside one HDFS block.
> Furthermore, query acceleration for count distinct like operation is made
> possible when combining bitmap and inverted index in query time.
> 3. MinMax index
>  For all columns, minmax index is created so that processing/query engine
> can skip scan that is not required.
>
>  Global Dictionary 
>
> Besides I/O reduction, CarbonData accelerates computation by using global
> dictionary, which enables processing/query engines to perform all
> processing on encoded data without having to convert the data (Late
> Materialization). We have observed dramatic performance improvement for
> OLAP analytic scenario where table contains many columns in string data
> type. The data is converted back to the user readable form just before
> processing/query engine returning results to user.
>
>  Column Group 
>
> Sometimes users want to perform processing/query on multi-columns in one
> table, for example, performing scan for individual record in
> troubleshooting scenario. In this case, row format is more efficient than
> columnar format since all columns will be touched by the workload. To
> accelerate this, CarbonData supports storing a group of column in row
> format, so data in column group is stored together and enable fast
> retrieval.
>
>  Optimized for multiple use cases 
>
> CarbonData 

Re: [VOTE] Accept CarbonData into the Apache Incubator

2016-05-25 Thread Julian Hyde
+1 

Julian

> On May 25, 2016, at 1:24 PM, Jean-Baptiste Onofré  wrote:
> 
> Hi all,
> 
> following the discussion thread, I'm now calling a vote to accept CarbonData 
> into the Incubator.
> 
> ​[ ] +1 Accept CarbonData into the Apache Incubator
> [ ] +0 Abstain
> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ...
> 
> This vote is open for 72 hours.
> 
> The proposal follows, you can also access the wiki page:
> https://wiki.apache.org/incubator/CarbonDataProposal
> 
> Thanks !
> Regards
> JB
> 
> = Apache CarbonData =
> 
> == Abstract ==
> 
> Apache CarbonData is a new Apache Hadoop native file format for faster 
> interactive
> query using advanced columnar storage, index, compression and encoding 
> techniques
> to improve computing efficiency, in turn it will help speedup queries an 
> order of
> magnitude faster over PetaBytes of data.
> 
> CarbonData github address: https://github.com/HuaweiBigData/carbondata
> 
> == Background ==
> 
> Huawei is an ICT solution provider, we are committed to enhancing customer 
> experiences for telecom carriers, enterprises, and consumers on big data, In 
> order to satisfy the following customer requirements, we created a new Hadoop 
> native file format:
> 
> * Support interactive OLAP-style query over big data in seconds.
> * Support fast query on individual record which require touching all fields.
> * Fast data loading speed and support incremental load in period of minutes.
> * Support HDFS so that customer can leverage existing Hadoop cluster.
> * Support time based data retention.
> 
> Based on these requirements, we investigated existing file formats in the 
> Hadoop eco-system, but we could not find a suitable solution that satisfying 
> requirements all at the same time, so we start designing CarbonData.
> 
> == Rationale ==
> 
> CarbonData contains multiple modules, which are classified into two 
> categories:
> 
> 1. CarbonData File Format: which contains core implementation for file format 
> such as columnar,index,dictionary,encoding+compression,API for 
> reading/writing etc.
> 2. CarbonData integration with big data processing framework such as Apache 
> Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution 
> runtime.
> 
> === CarbonData File Format ===
> 
> CarbonData file format is a columnar store in HDFS, it has many features that 
> a modern columnar format has, such as splittable, compression schema ,complex 
> data type etc. And CarbonData has following unique features:
> 
>  Indexing 
> 
> In order to support fast interactive query, CarbonData leverage indexing 
> technology to reduce I/O scans. CarbonData files stores data along with 
> index, the index is not stored separately but the CarbonData file itself 
> contains the index. In current implementation, CarbonData supports 3 types of 
> indexing:
> 
> 1. Multi-dimensional Key (B+ Tree index)
> The Data block are written in sequence to the disk and within each data 
> blocks each column block is written in sequence. Finally, the metadata block 
> for the file is written with information about byte positions of each block 
> in the file, Min-Max statistics index and the start and end MDK of each data 
> block. Since, the entire data in the file is in sorted order, the start and 
> end MDK of each data block can be used to construct a B+Tree and the file can 
> be logically  represented as a B+Tree with the data blocks as leaf nodes (on 
> disk) and the remaining non-leaf nodes in memory.
> 2. Inverted index
> Inverted index is widely used in search engine. By using this index, it helps 
> processing/query engine to do filtering inside one HDFS block. Furthermore, 
> query acceleration for count distinct like operation is made possible when 
> combining bitmap and inverted index in query time.
> 3. MinMax index
> For all columns, minmax index is created so that processing/query engine can 
> skip scan that is not required.
> 
>  Global Dictionary 
> 
> Besides I/O reduction, CarbonData accelerates computation by using global 
> dictionary, which enables processing/query engines to perform all processing 
> on encoded data without having to convert the data (Late Materialization). We 
> have observed dramatic performance improvement for OLAP analytic scenario 
> where table contains many columns in string data type. The data is converted 
> back to the user readable form just before processing/query engine returning 
> results to user.
> 
>  Column Group 
> 
> Sometimes users want to perform processing/query on multi-columns in one 
> table, for example, performing scan for individual record in troubleshooting 
> scenario. In this case, row format is more efficient than columnar format 
> since all columns will be touched by the workload. To accelerate this, 
> CarbonData supports storing a group of column in row format, so data in 
> column group is stored together and enable fast retrieval.
> 
> 

[VOTE] Accept CarbonData into the Apache Incubator

2016-05-25 Thread Jean-Baptiste Onofré

Hi all,

following the discussion thread, I'm now calling a vote to accept 
CarbonData into the Incubator.


​[ ] +1 Accept CarbonData into the Apache Incubator
[ ] +0 Abstain
[ ] -1 Do not accept CarbonData into the Apache Incubator, because ...

This vote is open for 72 hours.

The proposal follows, you can also access the wiki page:
https://wiki.apache.org/incubator/CarbonDataProposal

Thanks !
Regards
JB

= Apache CarbonData =

== Abstract ==

Apache CarbonData is a new Apache Hadoop native file format for faster 
interactive
query using advanced columnar storage, index, compression and encoding 
techniques
to improve computing efficiency, in turn it will help speedup queries an 
order of

magnitude faster over PetaBytes of data.

CarbonData github address: https://github.com/HuaweiBigData/carbondata

== Background ==

Huawei is an ICT solution provider, we are committed to enhancing 
customer experiences for telecom carriers, enterprises, and consumers on 
big data, In order to satisfy the following customer requirements, we 
created a new Hadoop native file format:


 * Support interactive OLAP-style query over big data in seconds.
 * Support fast query on individual record which require touching all 
fields.
 * Fast data loading speed and support incremental load in period of 
minutes.

 * Support HDFS so that customer can leverage existing Hadoop cluster.
 * Support time based data retention.

Based on these requirements, we investigated existing file formats in 
the Hadoop eco-system, but we could not find a suitable solution that 
satisfying requirements all at the same time, so we start designing 
CarbonData.


== Rationale ==

CarbonData contains multiple modules, which are classified into two 
categories:


 1. CarbonData File Format: which contains core implementation for file 
format such as columnar,index,dictionary,encoding+compression,API for 
reading/writing etc.
 2. CarbonData integration with big data processing framework such as 
Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract 
the execution runtime.


=== CarbonData File Format ===

CarbonData file format is a columnar store in HDFS, it has many features 
that a modern columnar format has, such as splittable, compression 
schema ,complex data type etc. And CarbonData has following unique features:


 Indexing 

In order to support fast interactive query, CarbonData leverage indexing 
technology to reduce I/O scans. CarbonData files stores data along with 
index, the index is not stored separately but the CarbonData file itself 
contains the index. In current implementation, CarbonData supports 3 
types of indexing:


1. Multi-dimensional Key (B+ Tree index)
 The Data block are written in sequence to the disk and within each 
data blocks each column block is written in sequence. Finally, the 
metadata block for the file is written with information about byte 
positions of each block in the file, Min-Max statistics index and the 
start and end MDK of each data block. Since, the entire data in the file 
is in sorted order, the start and end MDK of each data block can be used 
to construct a B+Tree and the file can be logically  represented as a 
B+Tree with the data blocks as leaf nodes (on disk) and the remaining 
non-leaf nodes in memory.

2. Inverted index
 Inverted index is widely used in search engine. By using this index, 
it helps processing/query engine to do filtering inside one HDFS block. 
Furthermore, query acceleration for count distinct like operation is 
made possible when combining bitmap and inverted index in query time.

3. MinMax index
 For all columns, minmax index is created so that processing/query 
engine can skip scan that is not required.


 Global Dictionary 

Besides I/O reduction, CarbonData accelerates computation by using 
global dictionary, which enables processing/query engines to perform all 
processing on encoded data without having to convert the data (Late 
Materialization). We have observed dramatic performance improvement for 
OLAP analytic scenario where table contains many columns in string data 
type. The data is converted back to the user readable form just before 
processing/query engine returning results to user.


 Column Group 

Sometimes users want to perform processing/query on multi-columns in one 
table, for example, performing scan for individual record in 
troubleshooting scenario. In this case, row format is more efficient 
than columnar format since all columns will be touched by the workload. 
To accelerate this, CarbonData supports storing a group of column in row 
format, so data in column group is stored together and enable fast 
retrieval.


 Optimized for multiple use cases 

CarbonData indices and dictionary is highly configurable. To make 
storage optimized for different use cases, user can configure what to 
index, so user can decide and tune the format before loading data into 
CarbonData.


For example

|| Use Case