[RESULT][VOTE] Accept CarbonData into the Apache Incubator
Hi, I close this vote with only +1: welcome to Apache CarbonData in the Incubator ! I will request the resources creation. Thanks all for your vote. Regards JB On 05/25/2016 10:24 PM, Jean-Baptiste Onofré wrote: Hi all, following the discussion thread, I'm now calling a vote to accept CarbonData into the Incubator. [ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... This vote is open for 72 hours. The proposal follows, you can also access the wiki page: https://wiki.apache.org/incubator/CarbonDataProposal Thanks ! Regards JB = Apache CarbonData = == Abstract == Apache CarbonData is a new Apache Hadoop native file format for faster interactive query using advanced columnar storage, index, compression and encoding techniques to improve computing efficiency, in turn it will help speedup queries an order of magnitude faster over PetaBytes of data. CarbonData github address: https://github.com/HuaweiBigData/carbondata == Background == Huawei is an ICT solution provider, we are committed to enhancing customer experiences for telecom carriers, enterprises, and consumers on big data, In order to satisfy the following customer requirements, we created a new Hadoop native file format: * Support interactive OLAP-style query over big data in seconds. * Support fast query on individual record which require touching all fields. * Fast data loading speed and support incremental load in period of minutes. * Support HDFS so that customer can leverage existing Hadoop cluster. * Support time based data retention. Based on these requirements, we investigated existing file formats in the Hadoop eco-system, but we could not find a suitable solution that satisfying requirements all at the same time, so we start designing CarbonData. == Rationale == CarbonData contains multiple modules, which are classified into two categories: 1. CarbonData File Format: which contains core implementation for file format such as columnar,index,dictionary,encoding+compression,API for reading/writing etc. 2. CarbonData integration with big data processing framework such as Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution runtime. === CarbonData File Format === CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc. And CarbonData has following unique features: Indexing In order to support fast interactive query, CarbonData leverage indexing technology to reduce I/O scans. CarbonData files stores data along with index, the index is not stored separately but the CarbonData file itself contains the index. In current implementation, CarbonData supports 3 types of indexing: 1. Multi-dimensional Key (B+ Tree index) The Data block are written in sequence to the disk and within each data blocks each column block is written in sequence. Finally, the metadata block for the file is written with information about byte positions of each block in the file, Min-Max statistics index and the start and end MDK of each data block. Since, the entire data in the file is in sorted order, the start and end MDK of each data block can be used to construct a B+Tree and the file can be logically represented as a B+Tree with the data blocks as leaf nodes (on disk) and the remaining non-leaf nodes in memory. 2. Inverted index Inverted index is widely used in search engine. By using this index, it helps processing/query engine to do filtering inside one HDFS block. Furthermore, query acceleration for count distinct like operation is made possible when combining bitmap and inverted index in query time. 3. MinMax index For all columns, minmax index is created so that processing/query engine can skip scan that is not required. Global Dictionary Besides I/O reduction, CarbonData accelerates computation by using global dictionary, which enables processing/query engines to perform all processing on encoded data without having to convert the data (Late Materialization). We have observed dramatic performance improvement for OLAP analytic scenario where table contains many columns in string data type. The data is converted back to the user readable form just before processing/query engine returning results to user. Column Group Sometimes users want to perform processing/query on multi-columns in one table, for example, performing scan for individual record in troubleshooting scenario. In this case, row format is more efficient than columnar format since all columns will be touched by the workload. To accelerate this, CarbonData supports storing a group of column in row format, so data in column group is stored together and enable fast retrieval. Optimized for multiple use cases CarbonData indices and dictionary is highly configurable. To make storage
Re: [VOTE] Accept CarbonData into the Apache Incubator
+1 (non-binding) Regards, Sandeep On Mon, May 30, 2016 at 7:04 PM, lidong <lid...@apache.org> wrote: > +1 (non-binding) > > > Thanks, > Dong > --- > Apache Kylin - http://kylin.apache.org > Kyligence Inc. - http://kyligence.io > > > Original Message > Sender:Jean-Baptiste Onofréj...@nanthrax.net > Recipient:generalgene...@incubator.apache.org > Date:Monday, May 30, 2016 14:07 > Subject:Re: [VOTE] Accept CarbonData into the Apache Incubator > > > My own +1 (binding) ;) Regards JB On 05/25/2016 10:24 PM, Jean-Baptiste > Onofré wrote: Hi all, following the discussion thread, I'm now calling a > vote to accept CarbonData into the Incubator. [ ] +1 Accept CarbonData > into the Apache Incubator [ ] +0 Abstain [ ] -1 Do not accept CarbonData > into the Apache Incubator, because ... This vote is open for 72 hours. > The proposal follows, you can also access the wiki page: > https://wiki.apache.org/incubator/CarbonDataProposal Thanks ! Regards > JB = Apache CarbonData = == Abstract == Apache CarbonData is a new > Apache Hadoop native file format for faster interactive query using > advanced columnar storage, index, compression and encoding techniques to > improve computing efficiency, in turn it will help speedup queries an > order of magnitude faster over PetaBytes of data. CarbonData github > address: https://github.com/HuaweiBigData/carbondata == Background == > Huawei is an ICT solution provider, we are committed to enhancing > customer experiences for telecom carriers, enterprises, and consumers on > big data, In order to satisfy the following customer requirements, we > created a new Hadoop native file format: * Support interactive OLAP-style > query over big data in seconds. * Support fast query on individual record > which require touching all fields. * Fast data loading speed and support > incremental load in period of minutes. * Support HDFS so that customer > can leverage existing Hadoop cluster. * Support time based data > retention. Based on these requirements, we investigated existing file > formats in the Hadoop eco-system, but we could not find a suitable > solution that satisfying requirements all at the same time, so we start > designing CarbonData. == Rationale == CarbonData contains multiple > modules, which are classified into two categories: 1. CarbonData File > Format: which contains core implementation for file format such as > columnar,index,dictionary,encoding+compression,API for reading/writing > etc. 2. CarbonData integration with big data processing framework such as > Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the > execution runtime. === CarbonData File Format === CarbonData file > format is a columnar store in HDFS, it has many features that a modern > columnar format has, such as splittable, compression schema ,complex data > type etc. And CarbonData has following unique features: Indexing > In order to support fast interactive query, CarbonData leverage > indexing technology to reduce I/O scans. CarbonData files stores data > along with index, the index is not stored separately but the CarbonData > file itself contains the index. In current implementation, CarbonData > supports 3 types of indexing: 1. Multi-dimensional Key (B+ Tree index) > The Data block are written in sequence to the disk and within each data > blocks each column block is written in sequence. Finally, the metadata > block for the file is written with information about byte positions of > each block in the file, Min-Max statistics index and the start and end MDK > of each data block. Since, the entire data in the file is in sorted order, > the start and end MDK of each data block can be used to construct a B+Tree > and the file can be logically represented as a B+Tree with the data blocks > as leaf nodes (on disk) and the remaining non-leaf nodes in memory. 2. > Inverted index Inverted index is widely used in search engine. By using > this index, it helps processing/query engine to do filtering inside one > HDFS block. Furthermore, query acceleration for count distinct like > operation is made possible when combining bitmap and inverted index in > query time. 3. MinMax index For all columns, minmax index is created so > that processing/query engine can skip scan that is not required. > Global Dictionary Besides I/O reduction, CarbonData accelerates > computation by using global dictionary, which enables processing/query > engines to perform all processing on encoded data without having to > convert the data (Late Materialization). We have observed dramatic > performance improvement for OLAP analytic scenario where table contains > many col
Re: [VOTE] Accept CarbonData into the Apache Incubator
+1 (non-binding) Thanks, Dong --- Apache Kylin - http://kylin.apache.org Kyligence Inc. - http://kyligence.io Original Message Sender:Jean-Baptiste Onofréj...@nanthrax.net Recipient:generalgene...@incubator.apache.org Date:Monday, May 30, 2016 14:07 Subject:Re: [VOTE] Accept CarbonData into the Apache Incubator My own +1 (binding) ;) Regards JB On 05/25/2016 10:24 PM, Jean-Baptiste Onofré wrote: Hi all, following the discussion thread, I'm now calling a vote to accept CarbonData into the Incubator. [ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... This vote is open for 72 hours. The proposal follows, you can also access the wiki page: https://wiki.apache.org/incubator/CarbonDataProposal Thanks ! Regards JB = Apache CarbonData = == Abstract == Apache CarbonData is a new Apache Hadoop native file format for faster interactive query using advanced columnar storage, index, compression and encoding techniques to improve computing efficiency, in turn it will help speedup queries an order of magnitude faster over PetaBytes of data. CarbonData github address: https://github.com/HuaweiBigData/carbondata == Background == Huawei is an ICT solution provider, we are committed to enhancing customer experiences for telecom carriers, enterprises, and consumers on big data, In order to satisfy the following customer requirements, we created a new Hadoop native file format: * Support interactive OLAP-style query over big data in seconds. * Support fast query on individual record which require touching all fields. * Fast data loading speed and support incremental load in period of minutes. * Support HDFS so that customer can leverage existing Hadoop cluster. * Support time based data retention. Based on these requirements, we investigated existing file formats in the Hadoop eco-system, but we could not find a suitable solution that satisfying requirements all at the same time, so we start designing CarbonData. == Rationale == CarbonData contains multiple modules, which are classified into two categories: 1. CarbonData File Format: which contains core implementation for file format such as columnar,index,dictionary,encoding+compression,API for reading/writing etc. 2. CarbonData integration with big data processing framework such as Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution runtime. === CarbonData File Format === CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc. And CarbonData has following unique features: Indexing In order to support fast interactive query, CarbonData leverage indexing technology to reduce I/O scans. CarbonData files stores data along with index, the index is not stored separately but the CarbonData file itself contains the index. In current implementation, CarbonData supports 3 types of indexing: 1. Multi-dimensional Key (B+ Tree index) The Data block are written in sequence to the disk and within each data blocks each column block is written in sequence. Finally, the metadata block for the file is written with information about byte positions of each block in the file, Min-Max statistics index and the start and end MDK of each data block. Since, the entire data in the file is in sorted order, the start and end MDK of each data block can be used to construct a B+Tree and the file can be logically represented as a B+Tree with the data blocks as leaf nodes (on disk) and the remaining non-leaf nodes in memory. 2. Inverted index Inverted index is widely used in search engine. By using this index, it helps processing/query engine to do filtering inside one HDFS block. Furthermore, query acceleration for count distinct like operation is made possible when combining bitmap and inverted index in query time. 3. MinMax index For all columns, minmax index is created so that processing/query engine can skip scan that is not required. Global Dictionary Besides I/O reduction, CarbonData accelerates computation by using global dictionary, which enables processing/query engines to perform all processing on encoded data without having to convert the data (Late Materialization). We have observed dramatic performance improvement for OLAP analytic scenario where table contains many columns in string data type. The data is converted back to the user readable form just before processing/query engine returning results to user. Column Group Sometimes users want to perform processing/query on multi-columns in one table, for example, performing scan for individual record in troubleshooting scenario. In this case, row format is more efficient than columnar format since all
Re: [VOTE] Accept CarbonData into the Apache Incubator
My own +1 (binding) ;) Regards JB On 05/25/2016 10:24 PM, Jean-Baptiste Onofré wrote: Hi all, following the discussion thread, I'm now calling a vote to accept CarbonData into the Incubator. [ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... This vote is open for 72 hours. The proposal follows, you can also access the wiki page: https://wiki.apache.org/incubator/CarbonDataProposal Thanks ! Regards JB = Apache CarbonData = == Abstract == Apache CarbonData is a new Apache Hadoop native file format for faster interactive query using advanced columnar storage, index, compression and encoding techniques to improve computing efficiency, in turn it will help speedup queries an order of magnitude faster over PetaBytes of data. CarbonData github address: https://github.com/HuaweiBigData/carbondata == Background == Huawei is an ICT solution provider, we are committed to enhancing customer experiences for telecom carriers, enterprises, and consumers on big data, In order to satisfy the following customer requirements, we created a new Hadoop native file format: * Support interactive OLAP-style query over big data in seconds. * Support fast query on individual record which require touching all fields. * Fast data loading speed and support incremental load in period of minutes. * Support HDFS so that customer can leverage existing Hadoop cluster. * Support time based data retention. Based on these requirements, we investigated existing file formats in the Hadoop eco-system, but we could not find a suitable solution that satisfying requirements all at the same time, so we start designing CarbonData. == Rationale == CarbonData contains multiple modules, which are classified into two categories: 1. CarbonData File Format: which contains core implementation for file format such as columnar,index,dictionary,encoding+compression,API for reading/writing etc. 2. CarbonData integration with big data processing framework such as Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution runtime. === CarbonData File Format === CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc. And CarbonData has following unique features: Indexing In order to support fast interactive query, CarbonData leverage indexing technology to reduce I/O scans. CarbonData files stores data along with index, the index is not stored separately but the CarbonData file itself contains the index. In current implementation, CarbonData supports 3 types of indexing: 1. Multi-dimensional Key (B+ Tree index) The Data block are written in sequence to the disk and within each data blocks each column block is written in sequence. Finally, the metadata block for the file is written with information about byte positions of each block in the file, Min-Max statistics index and the start and end MDK of each data block. Since, the entire data in the file is in sorted order, the start and end MDK of each data block can be used to construct a B+Tree and the file can be logically represented as a B+Tree with the data blocks as leaf nodes (on disk) and the remaining non-leaf nodes in memory. 2. Inverted index Inverted index is widely used in search engine. By using this index, it helps processing/query engine to do filtering inside one HDFS block. Furthermore, query acceleration for count distinct like operation is made possible when combining bitmap and inverted index in query time. 3. MinMax index For all columns, minmax index is created so that processing/query engine can skip scan that is not required. Global Dictionary Besides I/O reduction, CarbonData accelerates computation by using global dictionary, which enables processing/query engines to perform all processing on encoded data without having to convert the data (Late Materialization). We have observed dramatic performance improvement for OLAP analytic scenario where table contains many columns in string data type. The data is converted back to the user readable form just before processing/query engine returning results to user. Column Group Sometimes users want to perform processing/query on multi-columns in one table, for example, performing scan for individual record in troubleshooting scenario. In this case, row format is more efficient than columnar format since all columns will be touched by the workload. To accelerate this, CarbonData supports storing a group of column in row format, so data in column group is stored together and enable fast retrieval. Optimized for multiple use cases CarbonData indices and dictionary is highly configurable. To make storage optimized for different use cases, user can configure what to index, so user can decide and tune the format before loading data into
Re: [VOTE] Accept CarbonData into the Apache Incubator
+1 (non-binding) Thks Amol On Fri, May 27, 2016 at 5:53 AM, Jim Jagielskiwrote: > Thx for the feedback... > > I change my vote to +1 (binding) > > On May 27, 2016, at 1:46 AM, Jean-Baptiste Onofré > wrote: > > > > Hi Jim, > > > > good point. Let me try to explain this "gap" regarding my discussion > with the team: > > > > 1. Some people have been involved mostly in architecture and design more > directly in code. That's why they are part of the initial committer list, > whereas they didn't really provide "visible" code on github. > > > > 2. Some people are no more involved in the project. That's why they > don't appear on the initial committer list. > > > > Regards > > JB > > > > On 05/26/2016 05:45 PM, Jim Jagielski wrote: > >> I am trying to align the list of initial committers with > >> the list of current/active contributors, according to > >> Github, and I am seeing people proposed who have not > >> contributed anything and people NOT proposed who seem > >> to be kinda active... > >> > >> Sooo. -0 > >> > >>> On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofré > wrote: > >>> > >>> Hi all, > >>> > >>> following the discussion thread, I'm now calling a vote to accept > CarbonData into the Incubator. > >>> > >>> [ ] +1 Accept CarbonData into the Apache Incubator > >>> [ ] +0 Abstain > >>> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... > >>> > >>> This vote is open for 72 hours. > >>> > >>> The proposal follows, you can also access the wiki page: > >>> https://wiki.apache.org/incubator/CarbonDataProposal > >>> > >>> Thanks ! > >>> Regards > >>> JB > >>> > >>> = Apache CarbonData = > >>> > >>> == Abstract == > >>> > >>> Apache CarbonData is a new Apache Hadoop native file format for faster > interactive > >>> query using advanced columnar storage, index, compression and encoding > techniques > >>> to improve computing efficiency, in turn it will help speedup queries > an order of > >>> magnitude faster over PetaBytes of data. > >>> > >>> CarbonData github address: https://github.com/HuaweiBigData/carbondata > >>> > >>> == Background == > >>> > >>> Huawei is an ICT solution provider, we are committed to enhancing > customer experiences for telecom carriers, enterprises, and consumers on > big data, In order to satisfy the following customer requirements, we > created a new Hadoop native file format: > >>> > >>> * Support interactive OLAP-style query over big data in seconds. > >>> * Support fast query on individual record which require touching all > fields. > >>> * Fast data loading speed and support incremental load in period of > minutes. > >>> * Support HDFS so that customer can leverage existing Hadoop cluster. > >>> * Support time based data retention. > >>> > >>> Based on these requirements, we investigated existing file formats in > the Hadoop eco-system, but we could not find a suitable solution that > satisfying requirements all at the same time, so we start designing > CarbonData. > >>> > >>> == Rationale == > >>> > >>> CarbonData contains multiple modules, which are classified into two > categories: > >>> > >>> 1. CarbonData File Format: which contains core implementation for file > format such as columnar,index,dictionary,encoding+compression,API for > reading/writing etc. > >>> 2. CarbonData integration with big data processing framework such as > Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the > execution runtime. > >>> > >>> === CarbonData File Format === > >>> > >>> CarbonData file format is a columnar store in HDFS, it has many > features that a modern columnar format has, such as splittable, compression > schema ,complex data type etc. And CarbonData has following unique features: > >>> > >>> Indexing > >>> > >>> In order to support fast interactive query, CarbonData leverage > indexing technology to reduce I/O scans. CarbonData files stores data along > with index, the index is not stored separately but the CarbonData file > itself contains the index. In current implementation, CarbonData supports 3 > types of indexing: > >>> > >>> 1. Multi-dimensional Key (B+ Tree index) > >>> The Data block are written in sequence to the disk and within each > data blocks each column block is written in sequence. Finally, the metadata > block for the file is written with information about byte positions of each > block in the file, Min-Max statistics index and the start and end MDK of > each data block. Since, the entire data in the file is in sorted order, the > start and end MDK of each data block can be used to construct a B+Tree and > the file can be logically represented as a B+Tree with the data blocks as > leaf nodes (on disk) and the remaining non-leaf nodes in memory. > >>> 2. Inverted index > >>> Inverted index is widely used in search engine. By using this index, > it helps processing/query engine to do filtering inside one HDFS block. > Furthermore, query
Re: [VOTE] Accept CarbonData into the Apache Incubator
Thx for the feedback... I change my vote to +1 (binding) > On May 27, 2016, at 1:46 AM, Jean-Baptiste Onofréwrote: > > Hi Jim, > > good point. Let me try to explain this "gap" regarding my discussion with the > team: > > 1. Some people have been involved mostly in architecture and design more > directly in code. That's why they are part of the initial committer list, > whereas they didn't really provide "visible" code on github. > > 2. Some people are no more involved in the project. That's why they don't > appear on the initial committer list. > > Regards > JB > > On 05/26/2016 05:45 PM, Jim Jagielski wrote: >> I am trying to align the list of initial committers with >> the list of current/active contributors, according to >> Github, and I am seeing people proposed who have not >> contributed anything and people NOT proposed who seem >> to be kinda active... >> >> Sooo. -0 >> >>> On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofré wrote: >>> >>> Hi all, >>> >>> following the discussion thread, I'm now calling a vote to accept >>> CarbonData into the Incubator. >>> >>> [ ] +1 Accept CarbonData into the Apache Incubator >>> [ ] +0 Abstain >>> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... >>> >>> This vote is open for 72 hours. >>> >>> The proposal follows, you can also access the wiki page: >>> https://wiki.apache.org/incubator/CarbonDataProposal >>> >>> Thanks ! >>> Regards >>> JB >>> >>> = Apache CarbonData = >>> >>> == Abstract == >>> >>> Apache CarbonData is a new Apache Hadoop native file format for faster >>> interactive >>> query using advanced columnar storage, index, compression and encoding >>> techniques >>> to improve computing efficiency, in turn it will help speedup queries an >>> order of >>> magnitude faster over PetaBytes of data. >>> >>> CarbonData github address: https://github.com/HuaweiBigData/carbondata >>> >>> == Background == >>> >>> Huawei is an ICT solution provider, we are committed to enhancing customer >>> experiences for telecom carriers, enterprises, and consumers on big data, >>> In order to satisfy the following customer requirements, we created a new >>> Hadoop native file format: >>> >>> * Support interactive OLAP-style query over big data in seconds. >>> * Support fast query on individual record which require touching all fields. >>> * Fast data loading speed and support incremental load in period of minutes. >>> * Support HDFS so that customer can leverage existing Hadoop cluster. >>> * Support time based data retention. >>> >>> Based on these requirements, we investigated existing file formats in the >>> Hadoop eco-system, but we could not find a suitable solution that >>> satisfying requirements all at the same time, so we start designing >>> CarbonData. >>> >>> == Rationale == >>> >>> CarbonData contains multiple modules, which are classified into two >>> categories: >>> >>> 1. CarbonData File Format: which contains core implementation for file >>> format such as columnar,index,dictionary,encoding+compression,API for >>> reading/writing etc. >>> 2. CarbonData integration with big data processing framework such as Apache >>> Spark, Apache Hive etc. Apache Beam is also planned to abstract the >>> execution runtime. >>> >>> === CarbonData File Format === >>> >>> CarbonData file format is a columnar store in HDFS, it has many features >>> that a modern columnar format has, such as splittable, compression schema >>> ,complex data type etc. And CarbonData has following unique features: >>> >>> Indexing >>> >>> In order to support fast interactive query, CarbonData leverage indexing >>> technology to reduce I/O scans. CarbonData files stores data along with >>> index, the index is not stored separately but the CarbonData file itself >>> contains the index. In current implementation, CarbonData supports 3 types >>> of indexing: >>> >>> 1. Multi-dimensional Key (B+ Tree index) >>> The Data block are written in sequence to the disk and within each data >>> blocks each column block is written in sequence. Finally, the metadata >>> block for the file is written with information about byte positions of each >>> block in the file, Min-Max statistics index and the start and end MDK of >>> each data block. Since, the entire data in the file is in sorted order, the >>> start and end MDK of each data block can be used to construct a B+Tree and >>> the file can be logically represented as a B+Tree with the data blocks as >>> leaf nodes (on disk) and the remaining non-leaf nodes in memory. >>> 2. Inverted index >>> Inverted index is widely used in search engine. By using this index, it >>> helps processing/query engine to do filtering inside one HDFS block. >>> Furthermore, query acceleration for count distinct like operation is made >>> possible when combining bitmap and inverted index in query time. >>> 3. MinMax index >>> For all columns,
Re: [VOTE] Accept CarbonData into the Apache Incubator
+1 (binding) On Wed, May 25, 2016 at 10:24 PM, Jean-Baptiste Onofréwrote: > Hi all, > > following the discussion thread, I'm now calling a vote to accept > CarbonData into the Incubator. > > [ ] +1 Accept CarbonData into the Apache Incubator > [ ] +0 Abstain > [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... > > This vote is open for 72 hours. > > The proposal follows, you can also access the wiki page: > https://wiki.apache.org/incubator/CarbonDataProposal > > Thanks ! > Regards > JB > > = Apache CarbonData = > > == Abstract == > > Apache CarbonData is a new Apache Hadoop native file format for faster > interactive > query using advanced columnar storage, index, compression and encoding > techniques > to improve computing efficiency, in turn it will help speedup queries an > order of > magnitude faster over PetaBytes of data. > > CarbonData github address: https://github.com/HuaweiBigData/carbondata > > == Background == > > Huawei is an ICT solution provider, we are committed to enhancing customer > experiences for telecom carriers, enterprises, and consumers on big data, > In order to satisfy the following customer requirements, we created a new > Hadoop native file format: > > * Support interactive OLAP-style query over big data in seconds. > * Support fast query on individual record which require touching all > fields. > * Fast data loading speed and support incremental load in period of > minutes. > * Support HDFS so that customer can leverage existing Hadoop cluster. > * Support time based data retention. > > Based on these requirements, we investigated existing file formats in the > Hadoop eco-system, but we could not find a suitable solution that > satisfying requirements all at the same time, so we start designing > CarbonData. > > == Rationale == > > CarbonData contains multiple modules, which are classified into two > categories: > > 1. CarbonData File Format: which contains core implementation for file > format such as columnar,index,dictionary,encoding+compression,API for > reading/writing etc. > 2. CarbonData integration with big data processing framework such as > Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the > execution runtime. > > === CarbonData File Format === > > CarbonData file format is a columnar store in HDFS, it has many features > that a modern columnar format has, such as splittable, compression schema > ,complex data type etc. And CarbonData has following unique features: > > Indexing > > In order to support fast interactive query, CarbonData leverage indexing > technology to reduce I/O scans. CarbonData files stores data along with > index, the index is not stored separately but the CarbonData file itself > contains the index. In current implementation, CarbonData supports 3 types > of indexing: > > 1. Multi-dimensional Key (B+ Tree index) > The Data block are written in sequence to the disk and within each data > blocks each column block is written in sequence. Finally, the metadata > block for the file is written with information about byte positions of each > block in the file, Min-Max statistics index and the start and end MDK of > each data block. Since, the entire data in the file is in sorted order, the > start and end MDK of each data block can be used to construct a B+Tree and > the file can be logically represented as a B+Tree with the data blocks as > leaf nodes (on disk) and the remaining non-leaf nodes in memory. > 2. Inverted index > Inverted index is widely used in search engine. By using this index, it > helps processing/query engine to do filtering inside one HDFS block. > Furthermore, query acceleration for count distinct like operation is made > possible when combining bitmap and inverted index in query time. > 3. MinMax index > For all columns, minmax index is created so that processing/query engine > can skip scan that is not required. > > Global Dictionary > > Besides I/O reduction, CarbonData accelerates computation by using global > dictionary, which enables processing/query engines to perform all > processing on encoded data without having to convert the data (Late > Materialization). We have observed dramatic performance improvement for > OLAP analytic scenario where table contains many columns in string data > type. The data is converted back to the user readable form just before > processing/query engine returning results to user. > > Column Group > > Sometimes users want to perform processing/query on multi-columns in one > table, for example, performing scan for individual record in > troubleshooting scenario. In this case, row format is more efficient than > columnar format since all columns will be touched by the workload. To > accelerate this, CarbonData supports storing a group of column in row > format, so data in column group is stored together and enable fast > retrieval. > > Optimized for multiple use cases > >
Re: [VOTE] Accept CarbonData into the Apache Incubator
+1 Thanks, Madhawa Madhawa On Fri, May 27, 2016 at 11:16 AM, Jean-Baptiste Onofréwrote: > Hi Jim, > > good point. Let me try to explain this "gap" regarding my discussion with > the team: > > 1. Some people have been involved mostly in architecture and design more > directly in code. That's why they are part of the initial committer list, > whereas they didn't really provide "visible" code on github. > > 2. Some people are no more involved in the project. That's why they don't > appear on the initial committer list. > > Regards > JB > > > On 05/26/2016 05:45 PM, Jim Jagielski wrote: > >> I am trying to align the list of initial committers with >> the list of current/active contributors, according to >> Github, and I am seeing people proposed who have not >> contributed anything and people NOT proposed who seem >> to be kinda active... >> >> Sooo. -0 >> >> On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofré >>> wrote: >>> >>> Hi all, >>> >>> following the discussion thread, I'm now calling a vote to accept >>> CarbonData into the Incubator. >>> >>> [ ] +1 Accept CarbonData into the Apache Incubator >>> [ ] +0 Abstain >>> [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... >>> >>> This vote is open for 72 hours. >>> >>> The proposal follows, you can also access the wiki page: >>> https://wiki.apache.org/incubator/CarbonDataProposal >>> >>> Thanks ! >>> Regards >>> JB >>> >>> = Apache CarbonData = >>> >>> == Abstract == >>> >>> Apache CarbonData is a new Apache Hadoop native file format for faster >>> interactive >>> query using advanced columnar storage, index, compression and encoding >>> techniques >>> to improve computing efficiency, in turn it will help speedup queries an >>> order of >>> magnitude faster over PetaBytes of data. >>> >>> CarbonData github address: https://github.com/HuaweiBigData/carbondata >>> >>> == Background == >>> >>> Huawei is an ICT solution provider, we are committed to enhancing >>> customer experiences for telecom carriers, enterprises, and consumers on >>> big data, In order to satisfy the following customer requirements, we >>> created a new Hadoop native file format: >>> >>> * Support interactive OLAP-style query over big data in seconds. >>> * Support fast query on individual record which require touching all >>> fields. >>> * Fast data loading speed and support incremental load in period of >>> minutes. >>> * Support HDFS so that customer can leverage existing Hadoop cluster. >>> * Support time based data retention. >>> >>> Based on these requirements, we investigated existing file formats in >>> the Hadoop eco-system, but we could not find a suitable solution that >>> satisfying requirements all at the same time, so we start designing >>> CarbonData. >>> >>> == Rationale == >>> >>> CarbonData contains multiple modules, which are classified into two >>> categories: >>> >>> 1. CarbonData File Format: which contains core implementation for file >>> format such as columnar,index,dictionary,encoding+compression,API for >>> reading/writing etc. >>> 2. CarbonData integration with big data processing framework such as >>> Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the >>> execution runtime. >>> >>> === CarbonData File Format === >>> >>> CarbonData file format is a columnar store in HDFS, it has many features >>> that a modern columnar format has, such as splittable, compression schema >>> ,complex data type etc. And CarbonData has following unique features: >>> >>> Indexing >>> >>> In order to support fast interactive query, CarbonData leverage indexing >>> technology to reduce I/O scans. CarbonData files stores data along with >>> index, the index is not stored separately but the CarbonData file itself >>> contains the index. In current implementation, CarbonData supports 3 types >>> of indexing: >>> >>> 1. Multi-dimensional Key (B+ Tree index) >>> The Data block are written in sequence to the disk and within each data >>> blocks each column block is written in sequence. Finally, the metadata >>> block for the file is written with information about byte positions of each >>> block in the file, Min-Max statistics index and the start and end MDK of >>> each data block. Since, the entire data in the file is in sorted order, the >>> start and end MDK of each data block can be used to construct a B+Tree and >>> the file can be logically represented as a B+Tree with the data blocks as >>> leaf nodes (on disk) and the remaining non-leaf nodes in memory. >>> 2. Inverted index >>> Inverted index is widely used in search engine. By using this index, it >>> helps processing/query engine to do filtering inside one HDFS block. >>> Furthermore, query acceleration for count distinct like operation is made >>> possible when combining bitmap and inverted index in query time. >>> 3. MinMax index >>> For all columns, minmax index is created so that processing/query engine >>> can skip scan
Re: [VOTE] Accept CarbonData into the Apache Incubator
Hi Jim, good point. Let me try to explain this "gap" regarding my discussion with the team: 1. Some people have been involved mostly in architecture and design more directly in code. That's why they are part of the initial committer list, whereas they didn't really provide "visible" code on github. 2. Some people are no more involved in the project. That's why they don't appear on the initial committer list. Regards JB On 05/26/2016 05:45 PM, Jim Jagielski wrote: I am trying to align the list of initial committers with the list of current/active contributors, according to Github, and I am seeing people proposed who have not contributed anything and people NOT proposed who seem to be kinda active... Sooo. -0 On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofréwrote: Hi all, following the discussion thread, I'm now calling a vote to accept CarbonData into the Incubator. [ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... This vote is open for 72 hours. The proposal follows, you can also access the wiki page: https://wiki.apache.org/incubator/CarbonDataProposal Thanks ! Regards JB = Apache CarbonData = == Abstract == Apache CarbonData is a new Apache Hadoop native file format for faster interactive query using advanced columnar storage, index, compression and encoding techniques to improve computing efficiency, in turn it will help speedup queries an order of magnitude faster over PetaBytes of data. CarbonData github address: https://github.com/HuaweiBigData/carbondata == Background == Huawei is an ICT solution provider, we are committed to enhancing customer experiences for telecom carriers, enterprises, and consumers on big data, In order to satisfy the following customer requirements, we created a new Hadoop native file format: * Support interactive OLAP-style query over big data in seconds. * Support fast query on individual record which require touching all fields. * Fast data loading speed and support incremental load in period of minutes. * Support HDFS so that customer can leverage existing Hadoop cluster. * Support time based data retention. Based on these requirements, we investigated existing file formats in the Hadoop eco-system, but we could not find a suitable solution that satisfying requirements all at the same time, so we start designing CarbonData. == Rationale == CarbonData contains multiple modules, which are classified into two categories: 1. CarbonData File Format: which contains core implementation for file format such as columnar,index,dictionary,encoding+compression,API for reading/writing etc. 2. CarbonData integration with big data processing framework such as Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution runtime. === CarbonData File Format === CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc. And CarbonData has following unique features: Indexing In order to support fast interactive query, CarbonData leverage indexing technology to reduce I/O scans. CarbonData files stores data along with index, the index is not stored separately but the CarbonData file itself contains the index. In current implementation, CarbonData supports 3 types of indexing: 1. Multi-dimensional Key (B+ Tree index) The Data block are written in sequence to the disk and within each data blocks each column block is written in sequence. Finally, the metadata block for the file is written with information about byte positions of each block in the file, Min-Max statistics index and the start and end MDK of each data block. Since, the entire data in the file is in sorted order, the start and end MDK of each data block can be used to construct a B+Tree and the file can be logically represented as a B+Tree with the data blocks as leaf nodes (on disk) and the remaining non-leaf nodes in memory. 2. Inverted index Inverted index is widely used in search engine. By using this index, it helps processing/query engine to do filtering inside one HDFS block. Furthermore, query acceleration for count distinct like operation is made possible when combining bitmap and inverted index in query time. 3. MinMax index For all columns, minmax index is created so that processing/query engine can skip scan that is not required. Global Dictionary Besides I/O reduction, CarbonData accelerates computation by using global dictionary, which enables processing/query engines to perform all processing on encoded data without having to convert the data (Late Materialization). We have observed dramatic performance improvement for OLAP analytic scenario where table contains many columns in string data type. The data is converted back to the user readable form just before processing/query
RE: [VOTE] Accept CarbonData into the Apache Incubator
+1 (non-binding) Regards, Kai -Original Message- From: Gangumalla, Uma [mailto:uma.ganguma...@intel.com] Sent: Friday, May 27, 2016 1:10 AM To: general@incubator.apache.org Subject: Re: [VOTE] Accept CarbonData into the Apache Incubator +1 (binding) Regards, Uma On 5/25/16, 1:24 PM, "Jean-Baptiste Onofré" <j...@nanthrax.net> wrote: >Hi all, > >following the discussion thread, I'm now calling a vote to accept >CarbonData into the Incubator. > >[ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [ ] >-1 Do not accept CarbonData into the Apache Incubator, because ... > >This vote is open for 72 hours. > >The proposal follows, you can also access the wiki page: >https://wiki.apache.org/incubator/CarbonDataProposal > >Thanks ! >Regards >JB > >= Apache CarbonData = > >== Abstract == > >Apache CarbonData is a new Apache Hadoop native file format for faster >interactive query using advanced columnar storage, index, compression >and encoding techniques to improve computing efficiency, in turn it >will help speedup queries an order of magnitude faster over PetaBytes >of data. > >CarbonData github address: https://github.com/HuaweiBigData/carbondata > >== Background == > >Huawei is an ICT solution provider, we are committed to enhancing >customer experiences for telecom carriers, enterprises, and consumers >on big data, In order to satisfy the following customer requirements, >we created a new Hadoop native file format: > > * Support interactive OLAP-style query over big data in seconds. > * Support fast query on individual record which require touching all >fields. > * Fast data loading speed and support incremental load in period of >minutes. > * Support HDFS so that customer can leverage existing Hadoop cluster. > * Support time based data retention. > >Based on these requirements, we investigated existing file formats in >the Hadoop eco-system, but we could not find a suitable solution that >satisfying requirements all at the same time, so we start designing >CarbonData. > >== Rationale == > >CarbonData contains multiple modules, which are classified into two >categories: > > 1. CarbonData File Format: which contains core implementation for >file format such as columnar,index,dictionary,encoding+compression,API >for reading/writing etc. > 2. CarbonData integration with big data processing framework such as >Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract >the execution runtime. > >=== CarbonData File Format === > >CarbonData file format is a columnar store in HDFS, it has many >features that a modern columnar format has, such as splittable, >compression schema ,complex data type etc. And CarbonData has following >unique >features: > > Indexing > >In order to support fast interactive query, CarbonData leverage >indexing technology to reduce I/O scans. CarbonData files stores data >along with index, the index is not stored separately but the CarbonData >file itself contains the index. In current implementation, CarbonData >supports 3 types of indexing: > >1. Multi-dimensional Key (B+ Tree index) > The Data block are written in sequence to the disk and within each >data blocks each column block is written in sequence. Finally, the >metadata block for the file is written with information about byte >positions of each block in the file, Min-Max statistics index and the >start and end MDK of each data block. Since, the entire data in the >file is in sorted order, the start and end MDK of each data block can >be used to construct a B+Tree and the file can be logically >represented as a >B+Tree with the data blocks as leaf nodes (on disk) and the remaining >non-leaf nodes in memory. >2. Inverted index > Inverted index is widely used in search engine. By using this index, >it helps processing/query engine to do filtering inside one HDFS block. >Furthermore, query acceleration for count distinct like operation is >made possible when combining bitmap and inverted index in query time. >3. MinMax index > For all columns, minmax index is created so that processing/query >engine can skip scan that is not required. > > Global Dictionary > >Besides I/O reduction, CarbonData accelerates computation by using >global dictionary, which enables processing/query engines to perform >all processing on encoded data without having to convert the data (Late >Materialization). We have observed dramatic performance improvement for >OLAP analytic scenario where table contains many columns in string data >type. The data is converted back to the user readable form just before >processin
Re: [VOTE] Accept CarbonData into the Apache Incubator
+1 (binding) Regards, Uma On 5/25/16, 1:24 PM, "Jean-Baptiste Onofré"wrote: >Hi all, > >following the discussion thread, I'm now calling a vote to accept >CarbonData into the Incubator. > >[ ] +1 Accept CarbonData into the Apache Incubator >[ ] +0 Abstain >[ ] -1 Do not accept CarbonData into the Apache Incubator, because ... > >This vote is open for 72 hours. > >The proposal follows, you can also access the wiki page: >https://wiki.apache.org/incubator/CarbonDataProposal > >Thanks ! >Regards >JB > >= Apache CarbonData = > >== Abstract == > >Apache CarbonData is a new Apache Hadoop native file format for faster >interactive >query using advanced columnar storage, index, compression and encoding >techniques >to improve computing efficiency, in turn it will help speedup queries an >order of >magnitude faster over PetaBytes of data. > >CarbonData github address: https://github.com/HuaweiBigData/carbondata > >== Background == > >Huawei is an ICT solution provider, we are committed to enhancing >customer experiences for telecom carriers, enterprises, and consumers on >big data, In order to satisfy the following customer requirements, we >created a new Hadoop native file format: > > * Support interactive OLAP-style query over big data in seconds. > * Support fast query on individual record which require touching all >fields. > * Fast data loading speed and support incremental load in period of >minutes. > * Support HDFS so that customer can leverage existing Hadoop cluster. > * Support time based data retention. > >Based on these requirements, we investigated existing file formats in >the Hadoop eco-system, but we could not find a suitable solution that >satisfying requirements all at the same time, so we start designing >CarbonData. > >== Rationale == > >CarbonData contains multiple modules, which are classified into two >categories: > > 1. CarbonData File Format: which contains core implementation for file >format such as columnar,index,dictionary,encoding+compression,API for >reading/writing etc. > 2. CarbonData integration with big data processing framework such as >Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract >the execution runtime. > >=== CarbonData File Format === > >CarbonData file format is a columnar store in HDFS, it has many features >that a modern columnar format has, such as splittable, compression >schema ,complex data type etc. And CarbonData has following unique >features: > > Indexing > >In order to support fast interactive query, CarbonData leverage indexing >technology to reduce I/O scans. CarbonData files stores data along with >index, the index is not stored separately but the CarbonData file itself >contains the index. In current implementation, CarbonData supports 3 >types of indexing: > >1. Multi-dimensional Key (B+ Tree index) > The Data block are written in sequence to the disk and within each >data blocks each column block is written in sequence. Finally, the >metadata block for the file is written with information about byte >positions of each block in the file, Min-Max statistics index and the >start and end MDK of each data block. Since, the entire data in the file >is in sorted order, the start and end MDK of each data block can be used >to construct a B+Tree and the file can be logically represented as a >B+Tree with the data blocks as leaf nodes (on disk) and the remaining >non-leaf nodes in memory. >2. Inverted index > Inverted index is widely used in search engine. By using this index, >it helps processing/query engine to do filtering inside one HDFS block. >Furthermore, query acceleration for count distinct like operation is >made possible when combining bitmap and inverted index in query time. >3. MinMax index > For all columns, minmax index is created so that processing/query >engine can skip scan that is not required. > > Global Dictionary > >Besides I/O reduction, CarbonData accelerates computation by using >global dictionary, which enables processing/query engines to perform all >processing on encoded data without having to convert the data (Late >Materialization). We have observed dramatic performance improvement for >OLAP analytic scenario where table contains many columns in string data >type. The data is converted back to the user readable form just before >processing/query engine returning results to user. > > Column Group > >Sometimes users want to perform processing/query on multi-columns in one >table, for example, performing scan for individual record in >troubleshooting scenario. In this case, row format is more efficient >than columnar format since all columns will be touched by the workload. >To accelerate this, CarbonData supports storing a group of column in row >format, so data in column group is stored together and enable fast >retrieval. > > Optimized for multiple use cases > >CarbonData indices and dictionary is highly configurable. To make >storage
Re: [VOTE] Accept CarbonData into the Apache Incubator
I am trying to align the list of initial committers with the list of current/active contributors, according to Github, and I am seeing people proposed who have not contributed anything and people NOT proposed who seem to be kinda active... Sooo. -0 > On May 25, 2016, at 4:24 PM, Jean-Baptiste Onofréwrote: > > Hi all, > > following the discussion thread, I'm now calling a vote to accept CarbonData > into the Incubator. > > [ ] +1 Accept CarbonData into the Apache Incubator > [ ] +0 Abstain > [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... > > This vote is open for 72 hours. > > The proposal follows, you can also access the wiki page: > https://wiki.apache.org/incubator/CarbonDataProposal > > Thanks ! > Regards > JB > > = Apache CarbonData = > > == Abstract == > > Apache CarbonData is a new Apache Hadoop native file format for faster > interactive > query using advanced columnar storage, index, compression and encoding > techniques > to improve computing efficiency, in turn it will help speedup queries an > order of > magnitude faster over PetaBytes of data. > > CarbonData github address: https://github.com/HuaweiBigData/carbondata > > == Background == > > Huawei is an ICT solution provider, we are committed to enhancing customer > experiences for telecom carriers, enterprises, and consumers on big data, In > order to satisfy the following customer requirements, we created a new Hadoop > native file format: > > * Support interactive OLAP-style query over big data in seconds. > * Support fast query on individual record which require touching all fields. > * Fast data loading speed and support incremental load in period of minutes. > * Support HDFS so that customer can leverage existing Hadoop cluster. > * Support time based data retention. > > Based on these requirements, we investigated existing file formats in the > Hadoop eco-system, but we could not find a suitable solution that satisfying > requirements all at the same time, so we start designing CarbonData. > > == Rationale == > > CarbonData contains multiple modules, which are classified into two > categories: > > 1. CarbonData File Format: which contains core implementation for file format > such as columnar,index,dictionary,encoding+compression,API for > reading/writing etc. > 2. CarbonData integration with big data processing framework such as Apache > Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution > runtime. > > === CarbonData File Format === > > CarbonData file format is a columnar store in HDFS, it has many features that > a modern columnar format has, such as splittable, compression schema ,complex > data type etc. And CarbonData has following unique features: > > Indexing > > In order to support fast interactive query, CarbonData leverage indexing > technology to reduce I/O scans. CarbonData files stores data along with > index, the index is not stored separately but the CarbonData file itself > contains the index. In current implementation, CarbonData supports 3 types of > indexing: > > 1. Multi-dimensional Key (B+ Tree index) > The Data block are written in sequence to the disk and within each data > blocks each column block is written in sequence. Finally, the metadata block > for the file is written with information about byte positions of each block > in the file, Min-Max statistics index and the start and end MDK of each data > block. Since, the entire data in the file is in sorted order, the start and > end MDK of each data block can be used to construct a B+Tree and the file can > be logically represented as a B+Tree with the data blocks as leaf nodes (on > disk) and the remaining non-leaf nodes in memory. > 2. Inverted index > Inverted index is widely used in search engine. By using this index, it helps > processing/query engine to do filtering inside one HDFS block. Furthermore, > query acceleration for count distinct like operation is made possible when > combining bitmap and inverted index in query time. > 3. MinMax index > For all columns, minmax index is created so that processing/query engine can > skip scan that is not required. > > Global Dictionary > > Besides I/O reduction, CarbonData accelerates computation by using global > dictionary, which enables processing/query engines to perform all processing > on encoded data without having to convert the data (Late Materialization). We > have observed dramatic performance improvement for OLAP analytic scenario > where table contains many columns in string data type. The data is converted > back to the user readable form just before processing/query engine returning > results to user. > > Column Group > > Sometimes users want to perform processing/query on multi-columns in one > table, for example, performing scan for individual record in troubleshooting > scenario. In this case, row format is more
Re: [VOTE] Accept CarbonData into the Apache Incubator
+1 -David (jonesde@a.o) > On 25 May 2016, at 13:24, Jean-Baptiste Onofréwrote: > > Hi all, > > following the discussion thread, I'm now calling a vote to accept CarbonData > into the Incubator. > > [ ] +1 Accept CarbonData into the Apache Incubator > [ ] +0 Abstain > [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... > > This vote is open for 72 hours. > > The proposal follows, you can also access the wiki page: > https://wiki.apache.org/incubator/CarbonDataProposal > > Thanks ! > Regards > JB > > = Apache CarbonData = > > == Abstract == > > Apache CarbonData is a new Apache Hadoop native file format for faster > interactive > query using advanced columnar storage, index, compression and encoding > techniques > to improve computing efficiency, in turn it will help speedup queries an > order of > magnitude faster over PetaBytes of data. > > CarbonData github address: https://github.com/HuaweiBigData/carbondata > > == Background == > > Huawei is an ICT solution provider, we are committed to enhancing customer > experiences for telecom carriers, enterprises, and consumers on big data, In > order to satisfy the following customer requirements, we created a new Hadoop > native file format: > > * Support interactive OLAP-style query over big data in seconds. > * Support fast query on individual record which require touching all fields. > * Fast data loading speed and support incremental load in period of minutes. > * Support HDFS so that customer can leverage existing Hadoop cluster. > * Support time based data retention. > > Based on these requirements, we investigated existing file formats in the > Hadoop eco-system, but we could not find a suitable solution that satisfying > requirements all at the same time, so we start designing CarbonData. > > == Rationale == > > CarbonData contains multiple modules, which are classified into two > categories: > > 1. CarbonData File Format: which contains core implementation for file format > such as columnar,index,dictionary,encoding+compression,API for > reading/writing etc. > 2. CarbonData integration with big data processing framework such as Apache > Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution > runtime. > > === CarbonData File Format === > > CarbonData file format is a columnar store in HDFS, it has many features that > a modern columnar format has, such as splittable, compression schema ,complex > data type etc. And CarbonData has following unique features: > > Indexing > > In order to support fast interactive query, CarbonData leverage indexing > technology to reduce I/O scans. CarbonData files stores data along with > index, the index is not stored separately but the CarbonData file itself > contains the index. In current implementation, CarbonData supports 3 types of > indexing: > > 1. Multi-dimensional Key (B+ Tree index) > The Data block are written in sequence to the disk and within each data > blocks each column block is written in sequence. Finally, the metadata block > for the file is written with information about byte positions of each block > in the file, Min-Max statistics index and the start and end MDK of each data > block. Since, the entire data in the file is in sorted order, the start and > end MDK of each data block can be used to construct a B+Tree and the file can > be logically represented as a B+Tree with the data blocks as leaf nodes (on > disk) and the remaining non-leaf nodes in memory. > 2. Inverted index > Inverted index is widely used in search engine. By using this index, it helps > processing/query engine to do filtering inside one HDFS block. Furthermore, > query acceleration for count distinct like operation is made possible when > combining bitmap and inverted index in query time. > 3. MinMax index > For all columns, minmax index is created so that processing/query engine can > skip scan that is not required. > > Global Dictionary > > Besides I/O reduction, CarbonData accelerates computation by using global > dictionary, which enables processing/query engines to perform all processing > on encoded data without having to convert the data (Late Materialization). We > have observed dramatic performance improvement for OLAP analytic scenario > where table contains many columns in string data type. The data is converted > back to the user readable form just before processing/query engine returning > results to user. > > Column Group > > Sometimes users want to perform processing/query on multi-columns in one > table, for example, performing scan for individual record in troubleshooting > scenario. In this case, row format is more efficient than columnar format > since all columns will be touched by the workload. To accelerate this, > CarbonData supports storing a group of column in row format, so data in > column group is stored together and enable fast
Re: [VOTE] Accept CarbonData into the Apache Incubator
+1 (binding) -Jake On Wed, May 25, 2016 at 4:24 PM, Jean-Baptiste Onofréwrote: > Hi all, > > following the discussion thread, I'm now calling a vote to accept > CarbonData into the Incubator. > > [ ] +1 Accept CarbonData into the Apache Incubator > [ ] +0 Abstain > [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... > > This vote is open for 72 hours. > > The proposal follows, you can also access the wiki page: > https://wiki.apache.org/incubator/CarbonDataProposal > > Thanks ! > Regards > JB > > = Apache CarbonData = > > == Abstract == > > Apache CarbonData is a new Apache Hadoop native file format for faster > interactive > query using advanced columnar storage, index, compression and encoding > techniques > to improve computing efficiency, in turn it will help speedup queries an > order of > magnitude faster over PetaBytes of data. > > CarbonData github address: https://github.com/HuaweiBigData/carbondata > > == Background == > > Huawei is an ICT solution provider, we are committed to enhancing customer > experiences for telecom carriers, enterprises, and consumers on big data, > In order to satisfy the following customer requirements, we created a new > Hadoop native file format: > > * Support interactive OLAP-style query over big data in seconds. > * Support fast query on individual record which require touching all > fields. > * Fast data loading speed and support incremental load in period of > minutes. > * Support HDFS so that customer can leverage existing Hadoop cluster. > * Support time based data retention. > > Based on these requirements, we investigated existing file formats in the > Hadoop eco-system, but we could not find a suitable solution that > satisfying requirements all at the same time, so we start designing > CarbonData. > > == Rationale == > > CarbonData contains multiple modules, which are classified into two > categories: > > 1. CarbonData File Format: which contains core implementation for file > format such as columnar,index,dictionary,encoding+compression,API for > reading/writing etc. > 2. CarbonData integration with big data processing framework such as > Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the > execution runtime. > > === CarbonData File Format === > > CarbonData file format is a columnar store in HDFS, it has many features > that a modern columnar format has, such as splittable, compression schema > ,complex data type etc. And CarbonData has following unique features: > > Indexing > > In order to support fast interactive query, CarbonData leverage indexing > technology to reduce I/O scans. CarbonData files stores data along with > index, the index is not stored separately but the CarbonData file itself > contains the index. In current implementation, CarbonData supports 3 types > of indexing: > > 1. Multi-dimensional Key (B+ Tree index) > The Data block are written in sequence to the disk and within each data > blocks each column block is written in sequence. Finally, the metadata > block for the file is written with information about byte positions of each > block in the file, Min-Max statistics index and the start and end MDK of > each data block. Since, the entire data in the file is in sorted order, the > start and end MDK of each data block can be used to construct a B+Tree and > the file can be logically represented as a B+Tree with the data blocks as > leaf nodes (on disk) and the remaining non-leaf nodes in memory. > 2. Inverted index > Inverted index is widely used in search engine. By using this index, it > helps processing/query engine to do filtering inside one HDFS block. > Furthermore, query acceleration for count distinct like operation is made > possible when combining bitmap and inverted index in query time. > 3. MinMax index > For all columns, minmax index is created so that processing/query engine > can skip scan that is not required. > > Global Dictionary > > Besides I/O reduction, CarbonData accelerates computation by using global > dictionary, which enables processing/query engines to perform all > processing on encoded data without having to convert the data (Late > Materialization). We have observed dramatic performance improvement for > OLAP analytic scenario where table contains many columns in string data > type. The data is converted back to the user readable form just before > processing/query engine returning results to user. > > Column Group > > Sometimes users want to perform processing/query on multi-columns in one > table, for example, performing scan for individual record in > troubleshooting scenario. In this case, row format is more efficient than > columnar format since all columns will be touched by the workload. To > accelerate this, CarbonData supports storing a group of column in row > format, so data in column group is stored together and enable fast > retrieval. > > Optimized for multiple use cases >
Re: [VOTE] Accept CarbonData into the Apache Incubator
+1 (binding) Best Regards! - Luke Han On Wed, May 25, 2016 at 9:44 PM, Wang, Gang1 <gang1.w...@intel.com> wrote: > +1 (no-binding) > > Best Regards > +Gary. > > -Original Message- > From: Cheng, Hao [mailto:hao.ch...@intel.com] > Sent: Wednesday, May 25, 2016 7:09 PM > To: general@incubator.apache.org > Subject: RE: [VOTE] Accept CarbonData into the Apache Incubator > > +1 > > -Original Message- > From: Jacques Nadeau [mailto:jacq...@apache.org] > Sent: Thursday, May 26, 2016 8:26 AM > To: general@incubator.apache.org > Subject: Re: [VOTE] Accept CarbonData into the Apache Incubator > > +1 (binding) > > On Wed, May 25, 2016 at 4:04 PM, John D. Ament <johndam...@apache.org> > wrote: > > > +1 > > > > On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré <j...@nanthrax.net> > > wrote: > > > > > Hi all, > > > > > > following the discussion thread, I'm now calling a vote to accept > > > CarbonData into the Incubator. > > > > > > [ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [ > > > ] -1 Do not accept CarbonData into the Apache Incubator, because ... > > > > > > This vote is open for 72 hours. > > > > > > The proposal follows, you can also access the wiki page: > > > https://wiki.apache.org/incubator/CarbonDataProposal > > > > > > Thanks ! > > > Regards > > > JB > > > > > > = Apache CarbonData = > > > > > > == Abstract == > > > > > > Apache CarbonData is a new Apache Hadoop native file format for > > > faster interactive query using advanced columnar storage, index, > > > compression and encoding techniques to improve computing efficiency, > > > in turn it will help speedup queries an order of magnitude faster > > > over PetaBytes of data. > > > > > > CarbonData github address: > > > https://github.com/HuaweiBigData/carbondata > > > > > > == Background == > > > > > > Huawei is an ICT solution provider, we are committed to enhancing > > > customer experiences for telecom carriers, enterprises, and > > > consumers on big data, In order to satisfy the following customer > > > requirements, we created a new Hadoop native file format: > > > > > > * Support interactive OLAP-style query over big data in seconds. > > > * Support fast query on individual record which require touching > > > all fields. > > > * Fast data loading speed and support incremental load in period > > > of minutes. > > > * Support HDFS so that customer can leverage existing Hadoop cluster. > > > * Support time based data retention. > > > > > > Based on these requirements, we investigated existing file formats > > > in the Hadoop eco-system, but we could not find a suitable solution > > > that satisfying requirements all at the same time, so we start > > > designing CarbonData. > > > > > > == Rationale == > > > > > > CarbonData contains multiple modules, which are classified into two > > > categories: > > > > > > 1. CarbonData File Format: which contains core implementation for > > > file format such as > > > columnar,index,dictionary,encoding+compression,API for reading/writing > etc. > > > 2. CarbonData integration with big data processing framework such > > > as Apache Spark, Apache Hive etc. Apache Beam is also planned to > > > abstract the execution runtime. > > > > > > === CarbonData File Format === > > > > > > CarbonData file format is a columnar store in HDFS, it has many > > > features that a modern columnar format has, such as splittable, > > > compression schema ,complex data type etc. And CarbonData has > > > following unique > > > features: > > > > > > Indexing > > > > > > In order to support fast interactive query, CarbonData leverage > > > indexing technology to reduce I/O scans. CarbonData files stores > > > data along with index, the index is not stored separately but the > > > CarbonData file itself contains the index. In current > > > implementation, CarbonData supports 3 types of indexing: > > > > > > 1. Multi-dimensional Key (B+ Tree index) > > > The Data block are written in sequence to the disk and within each > > > data blocks each column block is written in sequence. Finally, the > > > metadat
RE: [VOTE] Accept CarbonData into the Apache Incubator
+1 (no-binding) Best Regards +Gary. -Original Message- From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Wednesday, May 25, 2016 7:09 PM To: general@incubator.apache.org Subject: RE: [VOTE] Accept CarbonData into the Apache Incubator +1 -Original Message- From: Jacques Nadeau [mailto:jacq...@apache.org] Sent: Thursday, May 26, 2016 8:26 AM To: general@incubator.apache.org Subject: Re: [VOTE] Accept CarbonData into the Apache Incubator +1 (binding) On Wed, May 25, 2016 at 4:04 PM, John D. Ament <johndam...@apache.org> wrote: > +1 > > On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > > > Hi all, > > > > following the discussion thread, I'm now calling a vote to accept > > CarbonData into the Incubator. > > > > [ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [ > > ] -1 Do not accept CarbonData into the Apache Incubator, because ... > > > > This vote is open for 72 hours. > > > > The proposal follows, you can also access the wiki page: > > https://wiki.apache.org/incubator/CarbonDataProposal > > > > Thanks ! > > Regards > > JB > > > > = Apache CarbonData = > > > > == Abstract == > > > > Apache CarbonData is a new Apache Hadoop native file format for > > faster interactive query using advanced columnar storage, index, > > compression and encoding techniques to improve computing efficiency, > > in turn it will help speedup queries an order of magnitude faster > > over PetaBytes of data. > > > > CarbonData github address: > > https://github.com/HuaweiBigData/carbondata > > > > == Background == > > > > Huawei is an ICT solution provider, we are committed to enhancing > > customer experiences for telecom carriers, enterprises, and > > consumers on big data, In order to satisfy the following customer > > requirements, we created a new Hadoop native file format: > > > > * Support interactive OLAP-style query over big data in seconds. > > * Support fast query on individual record which require touching > > all fields. > > * Fast data loading speed and support incremental load in period > > of minutes. > > * Support HDFS so that customer can leverage existing Hadoop cluster. > > * Support time based data retention. > > > > Based on these requirements, we investigated existing file formats > > in the Hadoop eco-system, but we could not find a suitable solution > > that satisfying requirements all at the same time, so we start > > designing CarbonData. > > > > == Rationale == > > > > CarbonData contains multiple modules, which are classified into two > > categories: > > > > 1. CarbonData File Format: which contains core implementation for > > file format such as > > columnar,index,dictionary,encoding+compression,API for reading/writing etc. > > 2. CarbonData integration with big data processing framework such > > as Apache Spark, Apache Hive etc. Apache Beam is also planned to > > abstract the execution runtime. > > > > === CarbonData File Format === > > > > CarbonData file format is a columnar store in HDFS, it has many > > features that a modern columnar format has, such as splittable, > > compression schema ,complex data type etc. And CarbonData has > > following unique > > features: > > > > Indexing > > > > In order to support fast interactive query, CarbonData leverage > > indexing technology to reduce I/O scans. CarbonData files stores > > data along with index, the index is not stored separately but the > > CarbonData file itself contains the index. In current > > implementation, CarbonData supports 3 types of indexing: > > > > 1. Multi-dimensional Key (B+ Tree index) > > The Data block are written in sequence to the disk and within each > > data blocks each column block is written in sequence. Finally, the > > metadata block for the file is written with information about byte > > positions of each block in the file, Min-Max statistics index and > > the start and end MDK of each data block. Since, the entire data in > > the file is in sorted order, the start and end MDK of each data > > block can be used to construct a B+Tree and the file can be > > logically represented as a > > B+Tree with the data blocks as leaf nodes (on disk) and the > > B+remaining > > non-leaf nodes in memory. > > 2. Inverted index > > Inverted index is widely used in search engine. By using this > > index, it helps proces
RE: [VOTE] Accept CarbonData into the Apache Incubator
+1 -Original Message- From: Jacques Nadeau [mailto:jacq...@apache.org] Sent: Thursday, May 26, 2016 8:26 AM To: general@incubator.apache.org Subject: Re: [VOTE] Accept CarbonData into the Apache Incubator +1 (binding) On Wed, May 25, 2016 at 4:04 PM, John D. Ament <johndam...@apache.org> wrote: > +1 > > On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > > > Hi all, > > > > following the discussion thread, I'm now calling a vote to accept > > CarbonData into the Incubator. > > > > [ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [ > > ] -1 Do not accept CarbonData into the Apache Incubator, because ... > > > > This vote is open for 72 hours. > > > > The proposal follows, you can also access the wiki page: > > https://wiki.apache.org/incubator/CarbonDataProposal > > > > Thanks ! > > Regards > > JB > > > > = Apache CarbonData = > > > > == Abstract == > > > > Apache CarbonData is a new Apache Hadoop native file format for > > faster interactive query using advanced columnar storage, index, > > compression and encoding techniques to improve computing efficiency, > > in turn it will help speedup queries an order of magnitude faster > > over PetaBytes of data. > > > > CarbonData github address: > > https://github.com/HuaweiBigData/carbondata > > > > == Background == > > > > Huawei is an ICT solution provider, we are committed to enhancing > > customer experiences for telecom carriers, enterprises, and > > consumers on big data, In order to satisfy the following customer > > requirements, we created a new Hadoop native file format: > > > > * Support interactive OLAP-style query over big data in seconds. > > * Support fast query on individual record which require touching > > all fields. > > * Fast data loading speed and support incremental load in period > > of minutes. > > * Support HDFS so that customer can leverage existing Hadoop cluster. > > * Support time based data retention. > > > > Based on these requirements, we investigated existing file formats > > in the Hadoop eco-system, but we could not find a suitable solution > > that satisfying requirements all at the same time, so we start > > designing CarbonData. > > > > == Rationale == > > > > CarbonData contains multiple modules, which are classified into two > > categories: > > > > 1. CarbonData File Format: which contains core implementation for > > file format such as > > columnar,index,dictionary,encoding+compression,API for reading/writing etc. > > 2. CarbonData integration with big data processing framework such > > as Apache Spark, Apache Hive etc. Apache Beam is also planned to > > abstract the execution runtime. > > > > === CarbonData File Format === > > > > CarbonData file format is a columnar store in HDFS, it has many > > features that a modern columnar format has, such as splittable, > > compression schema ,complex data type etc. And CarbonData has > > following unique > > features: > > > > Indexing > > > > In order to support fast interactive query, CarbonData leverage > > indexing technology to reduce I/O scans. CarbonData files stores > > data along with index, the index is not stored separately but the > > CarbonData file itself contains the index. In current > > implementation, CarbonData supports 3 types of indexing: > > > > 1. Multi-dimensional Key (B+ Tree index) > > The Data block are written in sequence to the disk and within each > > data blocks each column block is written in sequence. Finally, the > > metadata block for the file is written with information about byte > > positions of each block in the file, Min-Max statistics index and > > the start and end MDK of each data block. Since, the entire data in > > the file is in sorted order, the start and end MDK of each data > > block can be used to construct a B+Tree and the file can be > > logically represented as a > > B+Tree with the data blocks as leaf nodes (on disk) and the > > B+remaining > > non-leaf nodes in memory. > > 2. Inverted index > > Inverted index is widely used in search engine. By using this > > index, it helps processing/query engine to do filtering inside one HDFS > > block. > > Furthermore, query acceleration for count distinct like operation is > > made possible when combining bitmap and inverted index in query time. > > 3. MinMax i
Re: [VOTE] Accept CarbonData into the Apache Incubator
+1 (binding) On Wed, May 25, 2016 at 4:04 PM, John D. Amentwrote: > +1 > > On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofré > wrote: > > > Hi all, > > > > following the discussion thread, I'm now calling a vote to accept > > CarbonData into the Incubator. > > > > [ ] +1 Accept CarbonData into the Apache Incubator > > [ ] +0 Abstain > > [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... > > > > This vote is open for 72 hours. > > > > The proposal follows, you can also access the wiki page: > > https://wiki.apache.org/incubator/CarbonDataProposal > > > > Thanks ! > > Regards > > JB > > > > = Apache CarbonData = > > > > == Abstract == > > > > Apache CarbonData is a new Apache Hadoop native file format for faster > > interactive > > query using advanced columnar storage, index, compression and encoding > > techniques > > to improve computing efficiency, in turn it will help speedup queries an > > order of > > magnitude faster over PetaBytes of data. > > > > CarbonData github address: https://github.com/HuaweiBigData/carbondata > > > > == Background == > > > > Huawei is an ICT solution provider, we are committed to enhancing > > customer experiences for telecom carriers, enterprises, and consumers on > > big data, In order to satisfy the following customer requirements, we > > created a new Hadoop native file format: > > > > * Support interactive OLAP-style query over big data in seconds. > > * Support fast query on individual record which require touching all > > fields. > > * Fast data loading speed and support incremental load in period of > > minutes. > > * Support HDFS so that customer can leverage existing Hadoop cluster. > > * Support time based data retention. > > > > Based on these requirements, we investigated existing file formats in > > the Hadoop eco-system, but we could not find a suitable solution that > > satisfying requirements all at the same time, so we start designing > > CarbonData. > > > > == Rationale == > > > > CarbonData contains multiple modules, which are classified into two > > categories: > > > > 1. CarbonData File Format: which contains core implementation for file > > format such as columnar,index,dictionary,encoding+compression,API for > > reading/writing etc. > > 2. CarbonData integration with big data processing framework such as > > Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract > > the execution runtime. > > > > === CarbonData File Format === > > > > CarbonData file format is a columnar store in HDFS, it has many features > > that a modern columnar format has, such as splittable, compression > > schema ,complex data type etc. And CarbonData has following unique > > features: > > > > Indexing > > > > In order to support fast interactive query, CarbonData leverage indexing > > technology to reduce I/O scans. CarbonData files stores data along with > > index, the index is not stored separately but the CarbonData file itself > > contains the index. In current implementation, CarbonData supports 3 > > types of indexing: > > > > 1. Multi-dimensional Key (B+ Tree index) > > The Data block are written in sequence to the disk and within each > > data blocks each column block is written in sequence. Finally, the > > metadata block for the file is written with information about byte > > positions of each block in the file, Min-Max statistics index and the > > start and end MDK of each data block. Since, the entire data in the file > > is in sorted order, the start and end MDK of each data block can be used > > to construct a B+Tree and the file can be logically represented as a > > B+Tree with the data blocks as leaf nodes (on disk) and the remaining > > non-leaf nodes in memory. > > 2. Inverted index > > Inverted index is widely used in search engine. By using this index, > > it helps processing/query engine to do filtering inside one HDFS block. > > Furthermore, query acceleration for count distinct like operation is > > made possible when combining bitmap and inverted index in query time. > > 3. MinMax index > > For all columns, minmax index is created so that processing/query > > engine can skip scan that is not required. > > > > Global Dictionary > > > > Besides I/O reduction, CarbonData accelerates computation by using > > global dictionary, which enables processing/query engines to perform all > > processing on encoded data without having to convert the data (Late > > Materialization). We have observed dramatic performance improvement for > > OLAP analytic scenario where table contains many columns in string data > > type. The data is converted back to the user readable form just before > > processing/query engine returning results to user. > > > > Column Group > > > > Sometimes users want to perform processing/query on multi-columns in one > > table, for example, performing scan for individual record in > > troubleshooting scenario.
Re: [VOTE] Accept CarbonData into the Apache Incubator
+1 On Wed, May 25, 2016 at 4:41 PM Jean-Baptiste Onofréwrote: > Hi all, > > following the discussion thread, I'm now calling a vote to accept > CarbonData into the Incubator. > > [ ] +1 Accept CarbonData into the Apache Incubator > [ ] +0 Abstain > [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... > > This vote is open for 72 hours. > > The proposal follows, you can also access the wiki page: > https://wiki.apache.org/incubator/CarbonDataProposal > > Thanks ! > Regards > JB > > = Apache CarbonData = > > == Abstract == > > Apache CarbonData is a new Apache Hadoop native file format for faster > interactive > query using advanced columnar storage, index, compression and encoding > techniques > to improve computing efficiency, in turn it will help speedup queries an > order of > magnitude faster over PetaBytes of data. > > CarbonData github address: https://github.com/HuaweiBigData/carbondata > > == Background == > > Huawei is an ICT solution provider, we are committed to enhancing > customer experiences for telecom carriers, enterprises, and consumers on > big data, In order to satisfy the following customer requirements, we > created a new Hadoop native file format: > > * Support interactive OLAP-style query over big data in seconds. > * Support fast query on individual record which require touching all > fields. > * Fast data loading speed and support incremental load in period of > minutes. > * Support HDFS so that customer can leverage existing Hadoop cluster. > * Support time based data retention. > > Based on these requirements, we investigated existing file formats in > the Hadoop eco-system, but we could not find a suitable solution that > satisfying requirements all at the same time, so we start designing > CarbonData. > > == Rationale == > > CarbonData contains multiple modules, which are classified into two > categories: > > 1. CarbonData File Format: which contains core implementation for file > format such as columnar,index,dictionary,encoding+compression,API for > reading/writing etc. > 2. CarbonData integration with big data processing framework such as > Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract > the execution runtime. > > === CarbonData File Format === > > CarbonData file format is a columnar store in HDFS, it has many features > that a modern columnar format has, such as splittable, compression > schema ,complex data type etc. And CarbonData has following unique > features: > > Indexing > > In order to support fast interactive query, CarbonData leverage indexing > technology to reduce I/O scans. CarbonData files stores data along with > index, the index is not stored separately but the CarbonData file itself > contains the index. In current implementation, CarbonData supports 3 > types of indexing: > > 1. Multi-dimensional Key (B+ Tree index) > The Data block are written in sequence to the disk and within each > data blocks each column block is written in sequence. Finally, the > metadata block for the file is written with information about byte > positions of each block in the file, Min-Max statistics index and the > start and end MDK of each data block. Since, the entire data in the file > is in sorted order, the start and end MDK of each data block can be used > to construct a B+Tree and the file can be logically represented as a > B+Tree with the data blocks as leaf nodes (on disk) and the remaining > non-leaf nodes in memory. > 2. Inverted index > Inverted index is widely used in search engine. By using this index, > it helps processing/query engine to do filtering inside one HDFS block. > Furthermore, query acceleration for count distinct like operation is > made possible when combining bitmap and inverted index in query time. > 3. MinMax index > For all columns, minmax index is created so that processing/query > engine can skip scan that is not required. > > Global Dictionary > > Besides I/O reduction, CarbonData accelerates computation by using > global dictionary, which enables processing/query engines to perform all > processing on encoded data without having to convert the data (Late > Materialization). We have observed dramatic performance improvement for > OLAP analytic scenario where table contains many columns in string data > type. The data is converted back to the user readable form just before > processing/query engine returning results to user. > > Column Group > > Sometimes users want to perform processing/query on multi-columns in one > table, for example, performing scan for individual record in > troubleshooting scenario. In this case, row format is more efficient > than columnar format since all columns will be touched by the workload. > To accelerate this, CarbonData supports storing a group of column in row > format, so data in column group is stored together and enable fast > retrieval. > > Optimized for multiple use cases > >
Re: [VOTE] Accept CarbonData into the Apache Incubator
+1 (binding) On Wednesday, May 25, 2016, Jean-Baptiste Onofréwrote: > Hi all, > > following the discussion thread, I'm now calling a vote to accept > CarbonData into the Incubator. > > [ ] +1 Accept CarbonData into the Apache Incubator > [ ] +0 Abstain > [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... > > This vote is open for 72 hours. > > The proposal follows, you can also access the wiki page: > https://wiki.apache.org/incubator/CarbonDataProposal > > Thanks ! > Regards > JB > > = Apache CarbonData = > > == Abstract == > > Apache CarbonData is a new Apache Hadoop native file format for faster > interactive > query using advanced columnar storage, index, compression and encoding > techniques > to improve computing efficiency, in turn it will help speedup queries an > order of > magnitude faster over PetaBytes of data. > > CarbonData github address: https://github.com/HuaweiBigData/carbondata > > == Background == > > Huawei is an ICT solution provider, we are committed to enhancing customer > experiences for telecom carriers, enterprises, and consumers on big data, > In order to satisfy the following customer requirements, we created a new > Hadoop native file format: > > * Support interactive OLAP-style query over big data in seconds. > * Support fast query on individual record which require touching all > fields. > * Fast data loading speed and support incremental load in period of > minutes. > * Support HDFS so that customer can leverage existing Hadoop cluster. > * Support time based data retention. > > Based on these requirements, we investigated existing file formats in the > Hadoop eco-system, but we could not find a suitable solution that > satisfying requirements all at the same time, so we start designing > CarbonData. > > == Rationale == > > CarbonData contains multiple modules, which are classified into two > categories: > > 1. CarbonData File Format: which contains core implementation for file > format such as columnar,index,dictionary,encoding+compression,API for > reading/writing etc. > 2. CarbonData integration with big data processing framework such as > Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the > execution runtime. > > === CarbonData File Format === > > CarbonData file format is a columnar store in HDFS, it has many features > that a modern columnar format has, such as splittable, compression schema > ,complex data type etc. And CarbonData has following unique features: > > Indexing > > In order to support fast interactive query, CarbonData leverage indexing > technology to reduce I/O scans. CarbonData files stores data along with > index, the index is not stored separately but the CarbonData file itself > contains the index. In current implementation, CarbonData supports 3 types > of indexing: > > 1. Multi-dimensional Key (B+ Tree index) > The Data block are written in sequence to the disk and within each data > blocks each column block is written in sequence. Finally, the metadata > block for the file is written with information about byte positions of each > block in the file, Min-Max statistics index and the start and end MDK of > each data block. Since, the entire data in the file is in sorted order, the > start and end MDK of each data block can be used to construct a B+Tree and > the file can be logically represented as a B+Tree with the data blocks as > leaf nodes (on disk) and the remaining non-leaf nodes in memory. > 2. Inverted index > Inverted index is widely used in search engine. By using this index, it > helps processing/query engine to do filtering inside one HDFS block. > Furthermore, query acceleration for count distinct like operation is made > possible when combining bitmap and inverted index in query time. > 3. MinMax index > For all columns, minmax index is created so that processing/query engine > can skip scan that is not required. > > Global Dictionary > > Besides I/O reduction, CarbonData accelerates computation by using global > dictionary, which enables processing/query engines to perform all > processing on encoded data without having to convert the data (Late > Materialization). We have observed dramatic performance improvement for > OLAP analytic scenario where table contains many columns in string data > type. The data is converted back to the user readable form just before > processing/query engine returning results to user. > > Column Group > > Sometimes users want to perform processing/query on multi-columns in one > table, for example, performing scan for individual record in > troubleshooting scenario. In this case, row format is more efficient than > columnar format since all columns will be touched by the workload. To > accelerate this, CarbonData supports storing a group of column in row > format, so data in column group is stored together and enable fast > retrieval. > > Optimized for multiple use cases > > CarbonData
Re: [VOTE] Accept CarbonData into the Apache Incubator
+1 Julian > On May 25, 2016, at 1:24 PM, Jean-Baptiste Onofréwrote: > > Hi all, > > following the discussion thread, I'm now calling a vote to accept CarbonData > into the Incubator. > > [ ] +1 Accept CarbonData into the Apache Incubator > [ ] +0 Abstain > [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... > > This vote is open for 72 hours. > > The proposal follows, you can also access the wiki page: > https://wiki.apache.org/incubator/CarbonDataProposal > > Thanks ! > Regards > JB > > = Apache CarbonData = > > == Abstract == > > Apache CarbonData is a new Apache Hadoop native file format for faster > interactive > query using advanced columnar storage, index, compression and encoding > techniques > to improve computing efficiency, in turn it will help speedup queries an > order of > magnitude faster over PetaBytes of data. > > CarbonData github address: https://github.com/HuaweiBigData/carbondata > > == Background == > > Huawei is an ICT solution provider, we are committed to enhancing customer > experiences for telecom carriers, enterprises, and consumers on big data, In > order to satisfy the following customer requirements, we created a new Hadoop > native file format: > > * Support interactive OLAP-style query over big data in seconds. > * Support fast query on individual record which require touching all fields. > * Fast data loading speed and support incremental load in period of minutes. > * Support HDFS so that customer can leverage existing Hadoop cluster. > * Support time based data retention. > > Based on these requirements, we investigated existing file formats in the > Hadoop eco-system, but we could not find a suitable solution that satisfying > requirements all at the same time, so we start designing CarbonData. > > == Rationale == > > CarbonData contains multiple modules, which are classified into two > categories: > > 1. CarbonData File Format: which contains core implementation for file format > such as columnar,index,dictionary,encoding+compression,API for > reading/writing etc. > 2. CarbonData integration with big data processing framework such as Apache > Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution > runtime. > > === CarbonData File Format === > > CarbonData file format is a columnar store in HDFS, it has many features that > a modern columnar format has, such as splittable, compression schema ,complex > data type etc. And CarbonData has following unique features: > > Indexing > > In order to support fast interactive query, CarbonData leverage indexing > technology to reduce I/O scans. CarbonData files stores data along with > index, the index is not stored separately but the CarbonData file itself > contains the index. In current implementation, CarbonData supports 3 types of > indexing: > > 1. Multi-dimensional Key (B+ Tree index) > The Data block are written in sequence to the disk and within each data > blocks each column block is written in sequence. Finally, the metadata block > for the file is written with information about byte positions of each block > in the file, Min-Max statistics index and the start and end MDK of each data > block. Since, the entire data in the file is in sorted order, the start and > end MDK of each data block can be used to construct a B+Tree and the file can > be logically represented as a B+Tree with the data blocks as leaf nodes (on > disk) and the remaining non-leaf nodes in memory. > 2. Inverted index > Inverted index is widely used in search engine. By using this index, it helps > processing/query engine to do filtering inside one HDFS block. Furthermore, > query acceleration for count distinct like operation is made possible when > combining bitmap and inverted index in query time. > 3. MinMax index > For all columns, minmax index is created so that processing/query engine can > skip scan that is not required. > > Global Dictionary > > Besides I/O reduction, CarbonData accelerates computation by using global > dictionary, which enables processing/query engines to perform all processing > on encoded data without having to convert the data (Late Materialization). We > have observed dramatic performance improvement for OLAP analytic scenario > where table contains many columns in string data type. The data is converted > back to the user readable form just before processing/query engine returning > results to user. > > Column Group > > Sometimes users want to perform processing/query on multi-columns in one > table, for example, performing scan for individual record in troubleshooting > scenario. In this case, row format is more efficient than columnar format > since all columns will be touched by the workload. To accelerate this, > CarbonData supports storing a group of column in row format, so data in > column group is stored together and enable fast retrieval. > >
[VOTE] Accept CarbonData into the Apache Incubator
Hi all, following the discussion thread, I'm now calling a vote to accept CarbonData into the Incubator. [ ] +1 Accept CarbonData into the Apache Incubator [ ] +0 Abstain [ ] -1 Do not accept CarbonData into the Apache Incubator, because ... This vote is open for 72 hours. The proposal follows, you can also access the wiki page: https://wiki.apache.org/incubator/CarbonDataProposal Thanks ! Regards JB = Apache CarbonData = == Abstract == Apache CarbonData is a new Apache Hadoop native file format for faster interactive query using advanced columnar storage, index, compression and encoding techniques to improve computing efficiency, in turn it will help speedup queries an order of magnitude faster over PetaBytes of data. CarbonData github address: https://github.com/HuaweiBigData/carbondata == Background == Huawei is an ICT solution provider, we are committed to enhancing customer experiences for telecom carriers, enterprises, and consumers on big data, In order to satisfy the following customer requirements, we created a new Hadoop native file format: * Support interactive OLAP-style query over big data in seconds. * Support fast query on individual record which require touching all fields. * Fast data loading speed and support incremental load in period of minutes. * Support HDFS so that customer can leverage existing Hadoop cluster. * Support time based data retention. Based on these requirements, we investigated existing file formats in the Hadoop eco-system, but we could not find a suitable solution that satisfying requirements all at the same time, so we start designing CarbonData. == Rationale == CarbonData contains multiple modules, which are classified into two categories: 1. CarbonData File Format: which contains core implementation for file format such as columnar,index,dictionary,encoding+compression,API for reading/writing etc. 2. CarbonData integration with big data processing framework such as Apache Spark, Apache Hive etc. Apache Beam is also planned to abstract the execution runtime. === CarbonData File Format === CarbonData file format is a columnar store in HDFS, it has many features that a modern columnar format has, such as splittable, compression schema ,complex data type etc. And CarbonData has following unique features: Indexing In order to support fast interactive query, CarbonData leverage indexing technology to reduce I/O scans. CarbonData files stores data along with index, the index is not stored separately but the CarbonData file itself contains the index. In current implementation, CarbonData supports 3 types of indexing: 1. Multi-dimensional Key (B+ Tree index) The Data block are written in sequence to the disk and within each data blocks each column block is written in sequence. Finally, the metadata block for the file is written with information about byte positions of each block in the file, Min-Max statistics index and the start and end MDK of each data block. Since, the entire data in the file is in sorted order, the start and end MDK of each data block can be used to construct a B+Tree and the file can be logically represented as a B+Tree with the data blocks as leaf nodes (on disk) and the remaining non-leaf nodes in memory. 2. Inverted index Inverted index is widely used in search engine. By using this index, it helps processing/query engine to do filtering inside one HDFS block. Furthermore, query acceleration for count distinct like operation is made possible when combining bitmap and inverted index in query time. 3. MinMax index For all columns, minmax index is created so that processing/query engine can skip scan that is not required. Global Dictionary Besides I/O reduction, CarbonData accelerates computation by using global dictionary, which enables processing/query engines to perform all processing on encoded data without having to convert the data (Late Materialization). We have observed dramatic performance improvement for OLAP analytic scenario where table contains many columns in string data type. The data is converted back to the user readable form just before processing/query engine returning results to user. Column Group Sometimes users want to perform processing/query on multi-columns in one table, for example, performing scan for individual record in troubleshooting scenario. In this case, row format is more efficient than columnar format since all columns will be touched by the workload. To accelerate this, CarbonData supports storing a group of column in row format, so data in column group is stored together and enable fast retrieval. Optimized for multiple use cases CarbonData indices and dictionary is highly configurable. To make storage optimized for different use cases, user can configure what to index, so user can decide and tune the format before loading data into CarbonData. For example || Use Case