Re:Re: Re: [DISCUSS] Columnar storage engine for Apache Kylin
Hi ShaoFeng, Very good questions, please see my comments start with [Gang]: 1) How to bridge the real-time cube with a cube built from Hive? You know, in Kylin the source type is marked at the table level, which means a table is either a Hive table, a JDBC table or a streaming table. To implement the lambda architecture, how to composite the batch cube with the real-time cube (with the same table)? This seems not mentioned in the design doc. [Gang] >> there is a sourceType field in TableDesc to indicate the source type, I just add new types for the table that has more than 1 source, for example: ID_KAFKA_HIVE=21, means the table source can be both Kafka and Hive. 2) How it be together with the as-is NRT (near real-time) solution introduced in v1.6? Many users are building cube directly from Kafka, though they are in the mini or micro batches. Can the new streaming solution work together with the NRT cube? E.g, if I don't need to do ETL in Hive, can I use the batch job to fetch data from Kafka, and use the streaming real-time receivers together? [Gang] >>The new streaming solution is totally new, it works separately with the current streaming solution, there is no conflict with the NRT solution, so they can run together in the same Kylin platform, but currently they cannot work together as you said. 3) Does the "Build engine" of the real-time solution follow the plug-in architecture, so that it can support non-HBase storage? As you know we're implementing the parquet storage. Can this solution support other storages without much rework? [Gang] >>Yes, the "Build engine" follows the plug-in architecture, so it is easy to support non-HBase storage. In eBay, we just use InMemCubing, so currently we only have InMemCubing algorithm, but I think it is easy to extend to support LayerCubing. At 2018-09-29 15:08:42, "ShaoFeng Shi" wrote: >Hi Gang, very good questions, that's why we need to raise such a discussion >publicly. Please check my comments below started with [shaofengshi]. Feel >free to comment. > >1. Is it possible to locate a cuboid quickly in a parquet file? How to save >cuboid metadata info in the parquet's FileMetaData, just in the metadata's >key/value pair? > >[shaofengshi]: There are a couple ways to achieve this. > >A simple way is, different cuboids can be organized into different files. >The cuboid ID can be used as a subfolder name (if the cuboid number is not >that big), or use it as the file prefix; this solution may cause many small >files when the cube is small. Another solution is to combine multiple >cuboids' data in one file (sharding). In this case, the cuboid ID can be a >column. > >2. I notice that there is schema field in parquet's FileMetaData, but in a >cube, different cuboids have different schemas, so we just save the basic >cuboid schema in the schema field? Will this cause storage waste? > >[shaofengshi]: To be simple, we can use the same schema for all cuboids. >There will be metadata overhead as some columns are empty, but it is minor >compared with the data size. If using different schemas, the code would be >a little complicated. This needs some verification I think, not determined >yet. > >3. Can parquet support extension to add index easily, like bitmap index or >B tree index for each column? > >[shaofengshi]: I think it is extensible. Parquet is adding the column page >index, which is a lightweight index. We can follow the way to implement >another type of index page, but that involves many changes, should be very >careful. Or, we can separately store the index in another fast storage. >Actually, I'm not sure whether the bitmap index is worth to build because >Kylin already builds many cuboids. For large cuboids, we can sort the data >by the high cardinality columns, in that case, the file/rowgroup/page level >min/max indices might be enough for filtering. For small cuboids (no high >cardinality columns), as the file is small, parquet's dictionary filtering >should be good enough. I believe the cube planner will play an important >role in finding out which cuboids are worth to calculate. This is a >difference Kylin with other engines: we can do more things in the build >phase. > >4. Do we need to build rpc server? if just use yarn to schedule spark tasks >to do query, start/stop jvm may take seconds, then most queries will be >slower than using HBase. Of course, it is more scalable, and some queries >maybe faster. >[shaofengshi]: We can have a long-running spark application acting as the >query engine, when Kylin receives the query, submit to spark without >starting overhead. Kyligence has this in their product so there is no risk >on this. > > >1. Use customized columnar format, it is more flexible, we can add Kylin >specific concepts in the storage, like a cuboid, etc. also it will be easy >to add a different type of index as we need. The disadvantage is needing >more effort to define the format and development(cannot leverage existing >lib to
Re: [DISCUSS] Columnar storage engine for Apache Kylin
You are welcome, ShaoFeng! Storage and query engine are inseparable and should design together for fully gaining each other's abilities. And I'm very excited about the new coming columnar storage and query engine! -- Regards! Aron Tao ShaoFeng Shi 于2018年10月26日周五 下午10:28写道: > Exactly; Thank you jiatao for the comments! > > JiaTao Tao 于2018年10月25日周四 下午6:12写道: > > > As far as I'm concerned, using Parquet as Kylin's storage format is > pretty > > appropriate. From the aspect of integrating Spark, Spark made a lot of > > optimizations for Parquet, e.g. We can enjoy Spark's vectorized reading > and > > lazy dict decoding, etc. > > > > > > And here are my thoughts about integrating Spark and our query engine. As > > Shaofeng mentioned, a cuboid is a Parquet file, and you can think of this > > as a small table and we can read this cuboid as a DataFrame directly, > which > > can be directly queried by Spark, a bit like this: > > > > > ss.read.parquet("path/to/CuboidFile").filter("xxx").agg("xxx").select("xxx"). > > (We need to implement some Kylin's advanced aggregations, as for some > > Kylin's basic aggregations like sum/min/max, we can use Spark's directly) > > > > > > > > *Compare to our old query engine, the advantages are as follows:* > > > > > > > > 1. It is distributed! Our old query engine will get all data into a query > > node and then calculate, it's a single point of failure and often leads > OOM > > when in a huge amount of data. > > > > > > > > 2. It is simple and easy to debug(every step is very clear and > > transparent), you can collect data after every single phase, > > e.g.(filter/aggregation/projection, etc.), so you can easily check out > > which operation/phase went wrong. Our old query engine uses Calcite for > > post-calculation, it's difficult when pinpointing problems, especially > when > > relating to code generation, and you cannot insert your own logic during > > computation. > > > > > > > > 3. We can fully enjoy all efforts that Spark made for optimizing > > performance, e.g. Catalyst/Tungsten, etc. > > > > > > > > 4. It is easy for unit tests, you can test every step separately, which > > could reduce the testing granularity of Kylin's query engine. > > > > > > > > 5. Thanks to Spark's DataSource API, we can change Parquet to other data > > formats easily. > > > > > > > > 6. A lot of upstream tools for Spark like many machine learning tools can > > directly be integrated with us. > > > > > > > > == > > > > > == > > > > Hi Kylin developers. > > > > > > > > HBase has been Kylin’s storage engine since the first day; Kylin on > > HBase > > > > has been verified as a success which can support low latency & high > > > > concurrency queries on a very large data scale. Thanks to HBase, most > > Kylin > > > > users can get on average less than 1-second query response. > > > > > > > > But we also see some limitations when putting Cubes into HBase; I > > shared > > > > some of them in the HBaseConf Asia 2018[1] this August. The typical > > > > limitations include: > > > > > > > >- Rowkey is the primary index, no secondary index so far; > > > > > > > > Filtering by row key’s prefix and suffix can get very different > > performance > > > > result. So the user needs to do a good design about the row key; > > otherwise, > > > > the query would be slow. This is difficult sometimes because the user > > might > > > > not predict the filtering patterns ahead of cube design. > > > > > > > >- HBase is a key-value instead of a columnar storage > > > > > > > > Kylin combines multiple measures (columns) into fewer column families > > for > > > > smaller data size (row key size is remarkable). This causes HBase > often > > > > needing to read more data than requested. > > > > > > > >- HBase couldn't run on YARN > > > > > > > > This makes the deployment and auto-scaling a little complicated, > > especially > > > > in the cloud. > > > > > > > > In one word, HBase is complicated to be Kylin’s storage. The > > maintenance, > > > > debugging is also hard for normal developers. Now we’re planning to > > seek a > > > > simple, light-weighted, read-only storage engine for Kylin. The new > > > > solution should have the following characteristics: > > > > > > > >- Columnar layout with compression for efficient I/O; > > > >- Index by each column for quick filtering and seeking; > > > >- MapReduce / Spark API for parallel processing; > > > >- HDFS compliant for scalability and availability; > > > >- Mature, stable and extensible; > > > > > > > > With the plugin architecture[2] introduced in Kylin 1.5, adding > > multiple > > > > storages to Kylin is possible. Some companies like Kyligence Inc and > > > > Meituan.com, have
Re: [DISCUSS] Columnar storage engine for Apache Kylin
Exactly; Thank you jiatao for the comments! JiaTao Tao 于2018年10月25日周四 下午6:12写道: > As far as I'm concerned, using Parquet as Kylin's storage format is pretty > appropriate. From the aspect of integrating Spark, Spark made a lot of > optimizations for Parquet, e.g. We can enjoy Spark's vectorized reading and > lazy dict decoding, etc. > > > And here are my thoughts about integrating Spark and our query engine. As > Shaofeng mentioned, a cuboid is a Parquet file, and you can think of this > as a small table and we can read this cuboid as a DataFrame directly, which > can be directly queried by Spark, a bit like this: > > ss.read.parquet("path/to/CuboidFile").filter("xxx").agg("xxx").select("xxx"). > (We need to implement some Kylin's advanced aggregations, as for some > Kylin's basic aggregations like sum/min/max, we can use Spark's directly) > > > > *Compare to our old query engine, the advantages are as follows:* > > > > 1. It is distributed! Our old query engine will get all data into a query > node and then calculate, it's a single point of failure and often leads OOM > when in a huge amount of data. > > > > 2. It is simple and easy to debug(every step is very clear and > transparent), you can collect data after every single phase, > e.g.(filter/aggregation/projection, etc.), so you can easily check out > which operation/phase went wrong. Our old query engine uses Calcite for > post-calculation, it's difficult when pinpointing problems, especially when > relating to code generation, and you cannot insert your own logic during > computation. > > > > 3. We can fully enjoy all efforts that Spark made for optimizing > performance, e.g. Catalyst/Tungsten, etc. > > > > 4. It is easy for unit tests, you can test every step separately, which > could reduce the testing granularity of Kylin's query engine. > > > > 5. Thanks to Spark's DataSource API, we can change Parquet to other data > formats easily. > > > > 6. A lot of upstream tools for Spark like many machine learning tools can > directly be integrated with us. > > > > == > > == > > Hi Kylin developers. > > > > HBase has been Kylin’s storage engine since the first day; Kylin on > HBase > > has been verified as a success which can support low latency & high > > concurrency queries on a very large data scale. Thanks to HBase, most > Kylin > > users can get on average less than 1-second query response. > > > > But we also see some limitations when putting Cubes into HBase; I > shared > > some of them in the HBaseConf Asia 2018[1] this August. The typical > > limitations include: > > > >- Rowkey is the primary index, no secondary index so far; > > > > Filtering by row key’s prefix and suffix can get very different > performance > > result. So the user needs to do a good design about the row key; > otherwise, > > the query would be slow. This is difficult sometimes because the user > might > > not predict the filtering patterns ahead of cube design. > > > >- HBase is a key-value instead of a columnar storage > > > > Kylin combines multiple measures (columns) into fewer column families > for > > smaller data size (row key size is remarkable). This causes HBase often > > needing to read more data than requested. > > > >- HBase couldn't run on YARN > > > > This makes the deployment and auto-scaling a little complicated, > especially > > in the cloud. > > > > In one word, HBase is complicated to be Kylin’s storage. The > maintenance, > > debugging is also hard for normal developers. Now we’re planning to > seek a > > simple, light-weighted, read-only storage engine for Kylin. The new > > solution should have the following characteristics: > > > >- Columnar layout with compression for efficient I/O; > >- Index by each column for quick filtering and seeking; > >- MapReduce / Spark API for parallel processing; > >- HDFS compliant for scalability and availability; > >- Mature, stable and extensible; > > > > With the plugin architecture[2] introduced in Kylin 1.5, adding > multiple > > storages to Kylin is possible. Some companies like Kyligence Inc and > > Meituan.com, have developed their customized storage engine for Kylin > in > > their product or platform. In their experience, columnar storage is a > good > > supplement for the HBase engine. Kaisen Kang from Meituan.com has > shared > > their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in > > Beijing. > > > > We plan to do a PoC with Apache Parquet + Apache Spark in the next > phase. > > Parquet is a standard columnar file format and has been widely > supported by > > many projects like Hive, Impala, Drill, etc. Parquet is adding the page > > level column index to support
Re: [DISCUSS] Columnar storage engine for Apache Kylin
As far as I'm concerned, using Parquet as Kylin's storage format is pretty appropriate. From the aspect of integrating Spark, Spark made a lot of optimizations for Parquet, e.g. We can enjoy Spark's vectorized reading and lazy dict decoding, etc. And here are my thoughts about integrating Spark and our query engine. As Shaofeng mentioned, a cuboid is a Parquet file, and you can think of this as a small table and we can read this cuboid as a DataFrame directly, which can be directly queried by Spark, a bit like this: ss.read.parquet("path/to/CuboidFile").filter("xxx").agg("xxx").select("xxx"). (We need to implement some Kylin's advanced aggregations, as for some Kylin's basic aggregations like sum/min/max, we can use Spark's directly) *Compare to our old query engine, the advantages are as follows:* 1. It is distributed! Our old query engine will get all data into a query node and then calculate, it's a single point of failure and often leads OOM when in a huge amount of data. 2. It is simple and easy to debug(every step is very clear and transparent), you can collect data after every single phase, e.g.(filter/aggregation/projection, etc.), so you can easily check out which operation/phase went wrong. Our old query engine uses Calcite for post-calculation, it's difficult when pinpointing problems, especially when relating to code generation, and you cannot insert your own logic during computation. 3. We can fully enjoy all efforts that Spark made for optimizing performance, e.g. Catalyst/Tungsten, etc. 4. It is easy for unit tests, you can test every step separately, which could reduce the testing granularity of Kylin's query engine. 5. Thanks to Spark's DataSource API, we can change Parquet to other data formats easily. 6. A lot of upstream tools for Spark like many machine learning tools can directly be integrated with us. == == Hi Kylin developers. HBase has been Kylin’s storage engine since the first day; Kylin on HBase has been verified as a success which can support low latency & high concurrency queries on a very large data scale. Thanks to HBase, most Kylin users can get on average less than 1-second query response. But we also see some limitations when putting Cubes into HBase; I shared some of them in the HBaseConf Asia 2018[1] this August. The typical limitations include: - Rowkey is the primary index, no secondary index so far; Filtering by row key’s prefix and suffix can get very different performance result. So the user needs to do a good design about the row key; otherwise, the query would be slow. This is difficult sometimes because the user might not predict the filtering patterns ahead of cube design. - HBase is a key-value instead of a columnar storage Kylin combines multiple measures (columns) into fewer column families for smaller data size (row key size is remarkable). This causes HBase often needing to read more data than requested. - HBase couldn't run on YARN This makes the deployment and auto-scaling a little complicated, especially in the cloud. In one word, HBase is complicated to be Kylin’s storage. The maintenance, debugging is also hard for normal developers. Now we’re planning to seek a simple, light-weighted, read-only storage engine for Kylin. The new solution should have the following characteristics: - Columnar layout with compression for efficient I/O; - Index by each column for quick filtering and seeking; - MapReduce / Spark API for parallel processing; - HDFS compliant for scalability and availability; - Mature, stable and extensible; With the plugin architecture[2] introduced in Kylin 1.5, adding multiple storages to Kylin is possible. Some companies like Kyligence Inc and Meituan.com, have developed their customized storage engine for Kylin in their product or platform. In their experience, columnar storage is a good supplement for the HBase engine. Kaisen Kang from Meituan.com has shared their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in Beijing. We plan to do a PoC with Apache Parquet + Apache Spark in the next phase. Parquet is a standard columnar file format and has been widely supported by many projects like Hive, Impala, Drill, etc. Parquet is adding the page level column index to support fine-grained filtering. Apache Spark can provide the parallel computing over Parquet and can be deployed on YARN/Mesos and Kubernetes. With this combination, the data persistence and computation are separated, which makes the scaling in/out much easier than before. Benefiting from Spark's flexibility, we can not only push down more
Re: [DISCUSS] Columnar storage engine for Apache Kylin
Hi guys, I uploaded the initial design document to JIRA, please feel free to comment: https://issues.apache.org/jira/browse/KYLIN-3621 ShaoFeng Shi 于2018年10月12日周五 上午9:44写道: > JIRA and sub-tasks are created for this. Welcome to comment there: > https://issues.apache.org/jira/browse/KYLIN-3621 > > ShaoFeng Shi 于2018年10月8日周一 下午2:45写道: > >> I agree; the new storage should be Hadoop/HDFS compliant, and also need >> be cloud storage (like S3, blob storage) friendly, as more and more users >> are running big data analytics in the cloud. >> >> Luke Han 于2018年10月7日周日 下午7:44写道: >> >>> It makes sense to bring a better storage option for Kylin. >>> >>> The option should be open and people could have different ways to create >>> an >>> adaptor for the underlying storage. >>> Considering huge adoptions of Kylin today are all run on Hadoop/HDFS, I >>> prefer for Parquet or ORC or other HDFS compatible option at this time. >>> It >>> will easy for people to upgrade to the next generation and keep >>> consistency. >>> >>> Looking forward to this feature to be rolled out soon. >>> >>> Thanks. >>> >>> >>> >>> Best Regards! >>> - >>> >>> Luke Han >>> >>> >>> On Wed, Oct 3, 2018 at 2:37 PM Li Yang wrote: >>> >>> > Love this discussion. Like to highlight 3 major roles HBase is playing >>> > currently, so we don't miss any of them when looking for a replacement. >>> > >>> > 1) Storage: A high speed big data storage >>> > 2) Cache: A distributed storage cache layer (was BlockCache) >>> > 3) MPP: A distributed computation framework (was Coprocessor) >>> > >>> > The "Storage" seems at the central of discussion. Be it Parquet, ORC, >>> or a >>> > new file format, to me the standard interface is most important. As >>> long as >>> > we have consensus on the access interface, like MapReduce / Spark >>> Dataset, >>> > then the rest of debate can be easily resolved by a fair benchmark. >>> Also it >>> > allows people with different preference to keep their own >>> implementation >>> > under the standard interface, and not impacting the rest of Kylin. >>> > >>> > The "Cache" and the "MPP" were more or less overlooked. I suggest we >>> pay >>> > more attentions to them. Apart from Spark and Alluxio, any other >>> > alternatives? Actually Druid is a well-rounded choice, as like HBase, >>> it >>> > covers all the 3 roles pretty well. >>> > >>> > In general, I prefer to choose from the state of the art instead of >>> > re-inventing. Indeed, Kylin is not a storage project. A new storage >>> format >>> > is not Kylin's mission. Any storage innovations we come across here >>> would >>> > be more beneficial if contribute to Parquet or ORC community. >>> > >>> > Regards >>> > Yang >>> > >>> > >>> > >>> > On Tue, Oct 2, 2018 at 11:20 AM ShaoFeng Shi >>> > wrote: >>> > >>> > > Hi Billy, >>> > > >>> > > Yes, the cloud storage should be considered. The traditional file >>> layouts >>> > > on HDFS may not work well on cloud storage. Kylin needs to allow >>> > extension >>> > > here. I will add this to the requirement. >>> > > >>> > > Billy Liu 于2018年9月29日周六 下午3:22写道: >>> > > >>> > > > Hi Shaofeng, >>> > > > >>> > > > I'd like to add one more character: cloud-native storage support. >>> > > > Quite a few users are using S3 on AWS, or Azure Data Lake Storage >>> on >>> > > > Azure. If new storage engine could be more cloud friendly, more >>> user >>> > > > could get benefits from it. >>> > > > >>> > > > With Warm regards >>> > > > >>> > > > Billy Liu >>> > > > ShaoFeng Shi 于2018年9月28日周五 下午2:15写道: >>> > > > > >>> > > > > Hi Kylin developers. >>> > > > > >>> > > > > HBase has been Kylin’s storage engine since the first day; Kylin >>> on >>> > > HBase >>> > > > > has been verified as a success which can support low latency & >>> high >>> > > > > concurrency queries on a very large data scale. Thanks to HBase, >>> most >>> > > > Kylin >>> > > > > users can get on average less than 1-second query response. >>> > > > > >>> > > > > But we also see some limitations when putting Cubes into HBase; I >>> > > shared >>> > > > > some of them in the HBaseConf Asia 2018[1] this August. The >>> typical >>> > > > > limitations include: >>> > > > > >>> > > > >- Rowkey is the primary index, no secondary index so far; >>> > > > > >>> > > > > Filtering by row key’s prefix and suffix can get very different >>> > > > performance >>> > > > > result. So the user needs to do a good design about the row key; >>> > > > otherwise, >>> > > > > the query would be slow. This is difficult sometimes because the >>> user >>> > > > might >>> > > > > not predict the filtering patterns ahead of cube design. >>> > > > > >>> > > > >- HBase is a key-value instead of a columnar storage >>> > > > > >>> > > > > Kylin combines multiple measures (columns) into fewer column >>> families >>> > > for >>> > > > > smaller data size (row key size is remarkable). This causes HBase >>> > often >>> > > > > needing to read more data than requested. >>> > > > > >>> > >
Re: [DISCUSS] Columnar storage engine for Apache Kylin
JIRA and sub-tasks are created for this. Welcome to comment there: https://issues.apache.org/jira/browse/KYLIN-3621 ShaoFeng Shi 于2018年10月8日周一 下午2:45写道: > I agree; the new storage should be Hadoop/HDFS compliant, and also need be > cloud storage (like S3, blob storage) friendly, as more and more users are > running big data analytics in the cloud. > > Luke Han 于2018年10月7日周日 下午7:44写道: > >> It makes sense to bring a better storage option for Kylin. >> >> The option should be open and people could have different ways to create >> an >> adaptor for the underlying storage. >> Considering huge adoptions of Kylin today are all run on Hadoop/HDFS, I >> prefer for Parquet or ORC or other HDFS compatible option at this time. It >> will easy for people to upgrade to the next generation and keep >> consistency. >> >> Looking forward to this feature to be rolled out soon. >> >> Thanks. >> >> >> >> Best Regards! >> - >> >> Luke Han >> >> >> On Wed, Oct 3, 2018 at 2:37 PM Li Yang wrote: >> >> > Love this discussion. Like to highlight 3 major roles HBase is playing >> > currently, so we don't miss any of them when looking for a replacement. >> > >> > 1) Storage: A high speed big data storage >> > 2) Cache: A distributed storage cache layer (was BlockCache) >> > 3) MPP: A distributed computation framework (was Coprocessor) >> > >> > The "Storage" seems at the central of discussion. Be it Parquet, ORC, >> or a >> > new file format, to me the standard interface is most important. As >> long as >> > we have consensus on the access interface, like MapReduce / Spark >> Dataset, >> > then the rest of debate can be easily resolved by a fair benchmark. >> Also it >> > allows people with different preference to keep their own implementation >> > under the standard interface, and not impacting the rest of Kylin. >> > >> > The "Cache" and the "MPP" were more or less overlooked. I suggest we pay >> > more attentions to them. Apart from Spark and Alluxio, any other >> > alternatives? Actually Druid is a well-rounded choice, as like HBase, it >> > covers all the 3 roles pretty well. >> > >> > In general, I prefer to choose from the state of the art instead of >> > re-inventing. Indeed, Kylin is not a storage project. A new storage >> format >> > is not Kylin's mission. Any storage innovations we come across here >> would >> > be more beneficial if contribute to Parquet or ORC community. >> > >> > Regards >> > Yang >> > >> > >> > >> > On Tue, Oct 2, 2018 at 11:20 AM ShaoFeng Shi >> > wrote: >> > >> > > Hi Billy, >> > > >> > > Yes, the cloud storage should be considered. The traditional file >> layouts >> > > on HDFS may not work well on cloud storage. Kylin needs to allow >> > extension >> > > here. I will add this to the requirement. >> > > >> > > Billy Liu 于2018年9月29日周六 下午3:22写道: >> > > >> > > > Hi Shaofeng, >> > > > >> > > > I'd like to add one more character: cloud-native storage support. >> > > > Quite a few users are using S3 on AWS, or Azure Data Lake Storage on >> > > > Azure. If new storage engine could be more cloud friendly, more user >> > > > could get benefits from it. >> > > > >> > > > With Warm regards >> > > > >> > > > Billy Liu >> > > > ShaoFeng Shi 于2018年9月28日周五 下午2:15写道: >> > > > > >> > > > > Hi Kylin developers. >> > > > > >> > > > > HBase has been Kylin’s storage engine since the first day; Kylin >> on >> > > HBase >> > > > > has been verified as a success which can support low latency & >> high >> > > > > concurrency queries on a very large data scale. Thanks to HBase, >> most >> > > > Kylin >> > > > > users can get on average less than 1-second query response. >> > > > > >> > > > > But we also see some limitations when putting Cubes into HBase; I >> > > shared >> > > > > some of them in the HBaseConf Asia 2018[1] this August. The >> typical >> > > > > limitations include: >> > > > > >> > > > >- Rowkey is the primary index, no secondary index so far; >> > > > > >> > > > > Filtering by row key’s prefix and suffix can get very different >> > > > performance >> > > > > result. So the user needs to do a good design about the row key; >> > > > otherwise, >> > > > > the query would be slow. This is difficult sometimes because the >> user >> > > > might >> > > > > not predict the filtering patterns ahead of cube design. >> > > > > >> > > > >- HBase is a key-value instead of a columnar storage >> > > > > >> > > > > Kylin combines multiple measures (columns) into fewer column >> families >> > > for >> > > > > smaller data size (row key size is remarkable). This causes HBase >> > often >> > > > > needing to read more data than requested. >> > > > > >> > > > >- HBase couldn't run on YARN >> > > > > >> > > > > This makes the deployment and auto-scaling a little complicated, >> > > > especially >> > > > > in the cloud. >> > > > > >> > > > > In one word, HBase is complicated to be Kylin’s storage. The >> > > maintenance, >> > > > > debugging is also hard for normal developers. Now we’re
Re: [DISCUSS] Columnar storage engine for Apache Kylin
I agree; the new storage should be Hadoop/HDFS compliant, and also need be cloud storage (like S3, blob storage) friendly, as more and more users are running big data analytics in the cloud. Luke Han 于2018年10月7日周日 下午7:44写道: > It makes sense to bring a better storage option for Kylin. > > The option should be open and people could have different ways to create an > adaptor for the underlying storage. > Considering huge adoptions of Kylin today are all run on Hadoop/HDFS, I > prefer for Parquet or ORC or other HDFS compatible option at this time. It > will easy for people to upgrade to the next generation and keep > consistency. > > Looking forward to this feature to be rolled out soon. > > Thanks. > > > > Best Regards! > - > > Luke Han > > > On Wed, Oct 3, 2018 at 2:37 PM Li Yang wrote: > > > Love this discussion. Like to highlight 3 major roles HBase is playing > > currently, so we don't miss any of them when looking for a replacement. > > > > 1) Storage: A high speed big data storage > > 2) Cache: A distributed storage cache layer (was BlockCache) > > 3) MPP: A distributed computation framework (was Coprocessor) > > > > The "Storage" seems at the central of discussion. Be it Parquet, ORC, or > a > > new file format, to me the standard interface is most important. As long > as > > we have consensus on the access interface, like MapReduce / Spark > Dataset, > > then the rest of debate can be easily resolved by a fair benchmark. Also > it > > allows people with different preference to keep their own implementation > > under the standard interface, and not impacting the rest of Kylin. > > > > The "Cache" and the "MPP" were more or less overlooked. I suggest we pay > > more attentions to them. Apart from Spark and Alluxio, any other > > alternatives? Actually Druid is a well-rounded choice, as like HBase, it > > covers all the 3 roles pretty well. > > > > In general, I prefer to choose from the state of the art instead of > > re-inventing. Indeed, Kylin is not a storage project. A new storage > format > > is not Kylin's mission. Any storage innovations we come across here would > > be more beneficial if contribute to Parquet or ORC community. > > > > Regards > > Yang > > > > > > > > On Tue, Oct 2, 2018 at 11:20 AM ShaoFeng Shi > > wrote: > > > > > Hi Billy, > > > > > > Yes, the cloud storage should be considered. The traditional file > layouts > > > on HDFS may not work well on cloud storage. Kylin needs to allow > > extension > > > here. I will add this to the requirement. > > > > > > Billy Liu 于2018年9月29日周六 下午3:22写道: > > > > > > > Hi Shaofeng, > > > > > > > > I'd like to add one more character: cloud-native storage support. > > > > Quite a few users are using S3 on AWS, or Azure Data Lake Storage on > > > > Azure. If new storage engine could be more cloud friendly, more user > > > > could get benefits from it. > > > > > > > > With Warm regards > > > > > > > > Billy Liu > > > > ShaoFeng Shi 于2018年9月28日周五 下午2:15写道: > > > > > > > > > > Hi Kylin developers. > > > > > > > > > > HBase has been Kylin’s storage engine since the first day; Kylin on > > > HBase > > > > > has been verified as a success which can support low latency & high > > > > > concurrency queries on a very large data scale. Thanks to HBase, > most > > > > Kylin > > > > > users can get on average less than 1-second query response. > > > > > > > > > > But we also see some limitations when putting Cubes into HBase; I > > > shared > > > > > some of them in the HBaseConf Asia 2018[1] this August. The typical > > > > > limitations include: > > > > > > > > > >- Rowkey is the primary index, no secondary index so far; > > > > > > > > > > Filtering by row key’s prefix and suffix can get very different > > > > performance > > > > > result. So the user needs to do a good design about the row key; > > > > otherwise, > > > > > the query would be slow. This is difficult sometimes because the > user > > > > might > > > > > not predict the filtering patterns ahead of cube design. > > > > > > > > > >- HBase is a key-value instead of a columnar storage > > > > > > > > > > Kylin combines multiple measures (columns) into fewer column > families > > > for > > > > > smaller data size (row key size is remarkable). This causes HBase > > often > > > > > needing to read more data than requested. > > > > > > > > > >- HBase couldn't run on YARN > > > > > > > > > > This makes the deployment and auto-scaling a little complicated, > > > > especially > > > > > in the cloud. > > > > > > > > > > In one word, HBase is complicated to be Kylin’s storage. The > > > maintenance, > > > > > debugging is also hard for normal developers. Now we’re planning to > > > seek > > > > a > > > > > simple, light-weighted, read-only storage engine for Kylin. The new > > > > > solution should have the following characteristics: > > > > > > > > > >- Columnar layout with compression for efficient I/O; > > > > >- Index by each column for quick filtering and
Re: [DISCUSS] Columnar storage engine for Apache Kylin
It makes sense to bring a better storage option for Kylin. The option should be open and people could have different ways to create an adaptor for the underlying storage. Considering huge adoptions of Kylin today are all run on Hadoop/HDFS, I prefer for Parquet or ORC or other HDFS compatible option at this time. It will easy for people to upgrade to the next generation and keep consistency. Looking forward to this feature to be rolled out soon. Thanks. Best Regards! - Luke Han On Wed, Oct 3, 2018 at 2:37 PM Li Yang wrote: > Love this discussion. Like to highlight 3 major roles HBase is playing > currently, so we don't miss any of them when looking for a replacement. > > 1) Storage: A high speed big data storage > 2) Cache: A distributed storage cache layer (was BlockCache) > 3) MPP: A distributed computation framework (was Coprocessor) > > The "Storage" seems at the central of discussion. Be it Parquet, ORC, or a > new file format, to me the standard interface is most important. As long as > we have consensus on the access interface, like MapReduce / Spark Dataset, > then the rest of debate can be easily resolved by a fair benchmark. Also it > allows people with different preference to keep their own implementation > under the standard interface, and not impacting the rest of Kylin. > > The "Cache" and the "MPP" were more or less overlooked. I suggest we pay > more attentions to them. Apart from Spark and Alluxio, any other > alternatives? Actually Druid is a well-rounded choice, as like HBase, it > covers all the 3 roles pretty well. > > In general, I prefer to choose from the state of the art instead of > re-inventing. Indeed, Kylin is not a storage project. A new storage format > is not Kylin's mission. Any storage innovations we come across here would > be more beneficial if contribute to Parquet or ORC community. > > Regards > Yang > > > > On Tue, Oct 2, 2018 at 11:20 AM ShaoFeng Shi > wrote: > > > Hi Billy, > > > > Yes, the cloud storage should be considered. The traditional file layouts > > on HDFS may not work well on cloud storage. Kylin needs to allow > extension > > here. I will add this to the requirement. > > > > Billy Liu 于2018年9月29日周六 下午3:22写道: > > > > > Hi Shaofeng, > > > > > > I'd like to add one more character: cloud-native storage support. > > > Quite a few users are using S3 on AWS, or Azure Data Lake Storage on > > > Azure. If new storage engine could be more cloud friendly, more user > > > could get benefits from it. > > > > > > With Warm regards > > > > > > Billy Liu > > > ShaoFeng Shi 于2018年9月28日周五 下午2:15写道: > > > > > > > > Hi Kylin developers. > > > > > > > > HBase has been Kylin’s storage engine since the first day; Kylin on > > HBase > > > > has been verified as a success which can support low latency & high > > > > concurrency queries on a very large data scale. Thanks to HBase, most > > > Kylin > > > > users can get on average less than 1-second query response. > > > > > > > > But we also see some limitations when putting Cubes into HBase; I > > shared > > > > some of them in the HBaseConf Asia 2018[1] this August. The typical > > > > limitations include: > > > > > > > >- Rowkey is the primary index, no secondary index so far; > > > > > > > > Filtering by row key’s prefix and suffix can get very different > > > performance > > > > result. So the user needs to do a good design about the row key; > > > otherwise, > > > > the query would be slow. This is difficult sometimes because the user > > > might > > > > not predict the filtering patterns ahead of cube design. > > > > > > > >- HBase is a key-value instead of a columnar storage > > > > > > > > Kylin combines multiple measures (columns) into fewer column families > > for > > > > smaller data size (row key size is remarkable). This causes HBase > often > > > > needing to read more data than requested. > > > > > > > >- HBase couldn't run on YARN > > > > > > > > This makes the deployment and auto-scaling a little complicated, > > > especially > > > > in the cloud. > > > > > > > > In one word, HBase is complicated to be Kylin’s storage. The > > maintenance, > > > > debugging is also hard for normal developers. Now we’re planning to > > seek > > > a > > > > simple, light-weighted, read-only storage engine for Kylin. The new > > > > solution should have the following characteristics: > > > > > > > >- Columnar layout with compression for efficient I/O; > > > >- Index by each column for quick filtering and seeking; > > > >- MapReduce / Spark API for parallel processing; > > > >- HDFS compliant for scalability and availability; > > > >- Mature, stable and extensible; > > > > > > > > With the plugin architecture[2] introduced in Kylin 1.5, adding > > multiple > > > > storages to Kylin is possible. Some companies like Kyligence Inc and > > > > Meituan.com, have developed their customized storage engine for Kylin > > in > > > > their product or platform. In their
Re: [DISCUSS] Columnar storage engine for Apache Kylin
Love this discussion. Like to highlight 3 major roles HBase is playing currently, so we don't miss any of them when looking for a replacement. 1) Storage: A high speed big data storage 2) Cache: A distributed storage cache layer (was BlockCache) 3) MPP: A distributed computation framework (was Coprocessor) The "Storage" seems at the central of discussion. Be it Parquet, ORC, or a new file format, to me the standard interface is most important. As long as we have consensus on the access interface, like MapReduce / Spark Dataset, then the rest of debate can be easily resolved by a fair benchmark. Also it allows people with different preference to keep their own implementation under the standard interface, and not impacting the rest of Kylin. The "Cache" and the "MPP" were more or less overlooked. I suggest we pay more attentions to them. Apart from Spark and Alluxio, any other alternatives? Actually Druid is a well-rounded choice, as like HBase, it covers all the 3 roles pretty well. In general, I prefer to choose from the state of the art instead of re-inventing. Indeed, Kylin is not a storage project. A new storage format is not Kylin's mission. Any storage innovations we come across here would be more beneficial if contribute to Parquet or ORC community. Regards Yang On Tue, Oct 2, 2018 at 11:20 AM ShaoFeng Shi wrote: > Hi Billy, > > Yes, the cloud storage should be considered. The traditional file layouts > on HDFS may not work well on cloud storage. Kylin needs to allow extension > here. I will add this to the requirement. > > Billy Liu 于2018年9月29日周六 下午3:22写道: > > > Hi Shaofeng, > > > > I'd like to add one more character: cloud-native storage support. > > Quite a few users are using S3 on AWS, or Azure Data Lake Storage on > > Azure. If new storage engine could be more cloud friendly, more user > > could get benefits from it. > > > > With Warm regards > > > > Billy Liu > > ShaoFeng Shi 于2018年9月28日周五 下午2:15写道: > > > > > > Hi Kylin developers. > > > > > > HBase has been Kylin’s storage engine since the first day; Kylin on > HBase > > > has been verified as a success which can support low latency & high > > > concurrency queries on a very large data scale. Thanks to HBase, most > > Kylin > > > users can get on average less than 1-second query response. > > > > > > But we also see some limitations when putting Cubes into HBase; I > shared > > > some of them in the HBaseConf Asia 2018[1] this August. The typical > > > limitations include: > > > > > >- Rowkey is the primary index, no secondary index so far; > > > > > > Filtering by row key’s prefix and suffix can get very different > > performance > > > result. So the user needs to do a good design about the row key; > > otherwise, > > > the query would be slow. This is difficult sometimes because the user > > might > > > not predict the filtering patterns ahead of cube design. > > > > > >- HBase is a key-value instead of a columnar storage > > > > > > Kylin combines multiple measures (columns) into fewer column families > for > > > smaller data size (row key size is remarkable). This causes HBase often > > > needing to read more data than requested. > > > > > >- HBase couldn't run on YARN > > > > > > This makes the deployment and auto-scaling a little complicated, > > especially > > > in the cloud. > > > > > > In one word, HBase is complicated to be Kylin’s storage. The > maintenance, > > > debugging is also hard for normal developers. Now we’re planning to > seek > > a > > > simple, light-weighted, read-only storage engine for Kylin. The new > > > solution should have the following characteristics: > > > > > >- Columnar layout with compression for efficient I/O; > > >- Index by each column for quick filtering and seeking; > > >- MapReduce / Spark API for parallel processing; > > >- HDFS compliant for scalability and availability; > > >- Mature, stable and extensible; > > > > > > With the plugin architecture[2] introduced in Kylin 1.5, adding > multiple > > > storages to Kylin is possible. Some companies like Kyligence Inc and > > > Meituan.com, have developed their customized storage engine for Kylin > in > > > their product or platform. In their experience, columnar storage is a > > good > > > supplement for the HBase engine. Kaisen Kang from Meituan.com has > shared > > > their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in > > > Beijing. > > > > > > We plan to do a PoC with Apache Parquet + Apache Spark in the next > phase. > > > Parquet is a standard columnar file format and has been widely > supported > > by > > > many projects like Hive, Impala, Drill, etc. Parquet is adding the page > > > level column index to support fine-grained filtering. Apache Spark can > > > provide the parallel computing over Parquet and can be deployed on > > > YARN/Mesos and Kubernetes. With this combination, the data persistence > > and > > > computation are separated, which makes the scaling in/out much
Re: [DISCUSS] Columnar storage engine for Apache Kylin
Hi Billy, Yes, the cloud storage should be considered. The traditional file layouts on HDFS may not work well on cloud storage. Kylin needs to allow extension here. I will add this to the requirement. Billy Liu 于2018年9月29日周六 下午3:22写道: > Hi Shaofeng, > > I'd like to add one more character: cloud-native storage support. > Quite a few users are using S3 on AWS, or Azure Data Lake Storage on > Azure. If new storage engine could be more cloud friendly, more user > could get benefits from it. > > With Warm regards > > Billy Liu > ShaoFeng Shi 于2018年9月28日周五 下午2:15写道: > > > > Hi Kylin developers. > > > > HBase has been Kylin’s storage engine since the first day; Kylin on HBase > > has been verified as a success which can support low latency & high > > concurrency queries on a very large data scale. Thanks to HBase, most > Kylin > > users can get on average less than 1-second query response. > > > > But we also see some limitations when putting Cubes into HBase; I shared > > some of them in the HBaseConf Asia 2018[1] this August. The typical > > limitations include: > > > >- Rowkey is the primary index, no secondary index so far; > > > > Filtering by row key’s prefix and suffix can get very different > performance > > result. So the user needs to do a good design about the row key; > otherwise, > > the query would be slow. This is difficult sometimes because the user > might > > not predict the filtering patterns ahead of cube design. > > > >- HBase is a key-value instead of a columnar storage > > > > Kylin combines multiple measures (columns) into fewer column families for > > smaller data size (row key size is remarkable). This causes HBase often > > needing to read more data than requested. > > > >- HBase couldn't run on YARN > > > > This makes the deployment and auto-scaling a little complicated, > especially > > in the cloud. > > > > In one word, HBase is complicated to be Kylin’s storage. The maintenance, > > debugging is also hard for normal developers. Now we’re planning to seek > a > > simple, light-weighted, read-only storage engine for Kylin. The new > > solution should have the following characteristics: > > > >- Columnar layout with compression for efficient I/O; > >- Index by each column for quick filtering and seeking; > >- MapReduce / Spark API for parallel processing; > >- HDFS compliant for scalability and availability; > >- Mature, stable and extensible; > > > > With the plugin architecture[2] introduced in Kylin 1.5, adding multiple > > storages to Kylin is possible. Some companies like Kyligence Inc and > > Meituan.com, have developed their customized storage engine for Kylin in > > their product or platform. In their experience, columnar storage is a > good > > supplement for the HBase engine. Kaisen Kang from Meituan.com has shared > > their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in > > Beijing. > > > > We plan to do a PoC with Apache Parquet + Apache Spark in the next phase. > > Parquet is a standard columnar file format and has been widely supported > by > > many projects like Hive, Impala, Drill, etc. Parquet is adding the page > > level column index to support fine-grained filtering. Apache Spark can > > provide the parallel computing over Parquet and can be deployed on > > YARN/Mesos and Kubernetes. With this combination, the data persistence > and > > computation are separated, which makes the scaling in/out much easier > than > > before. Benefiting from Spark's flexibility, we can not only push down > more > > computation from Kylin to the Hadoop cluster. Except for Parquet, Apache > > ORC is also a candidate. > > > > Now I raise this discussion to get your ideas about Kylin’s > next-generation > > storage engine. If you have good ideas or any related data, welcome > discuss in > > the community. > > > > Thank you! > > > > [1] Apache Kylin on HBase > > > https://www.slideshare.net/ShiShaoFeng1/apache-kylin-on-hbase-extreme-olap-engine-for-big-data > > [2] Apache Kylin Plugin Architecture > > https://kylin.apache.org/development/plugin_arch.html > > [3] 基于Druid的Kylin存储引擎实践 > https://blog.bcmeng.com/post/kylin-on-druid.html-- > > Best regards, > > > > Shaofeng Shi 史少锋 > -- Best regards, Shaofeng Shi 史少锋
Re: [DISCUSS] Columnar storage engine for Apache Kylin
Hi Shaofeng, I'd like to add one more character: cloud-native storage support. Quite a few users are using S3 on AWS, or Azure Data Lake Storage on Azure. If new storage engine could be more cloud friendly, more user could get benefits from it. With Warm regards Billy Liu ShaoFeng Shi 于2018年9月28日周五 下午2:15写道: > > Hi Kylin developers. > > HBase has been Kylin’s storage engine since the first day; Kylin on HBase > has been verified as a success which can support low latency & high > concurrency queries on a very large data scale. Thanks to HBase, most Kylin > users can get on average less than 1-second query response. > > But we also see some limitations when putting Cubes into HBase; I shared > some of them in the HBaseConf Asia 2018[1] this August. The typical > limitations include: > >- Rowkey is the primary index, no secondary index so far; > > Filtering by row key’s prefix and suffix can get very different performance > result. So the user needs to do a good design about the row key; otherwise, > the query would be slow. This is difficult sometimes because the user might > not predict the filtering patterns ahead of cube design. > >- HBase is a key-value instead of a columnar storage > > Kylin combines multiple measures (columns) into fewer column families for > smaller data size (row key size is remarkable). This causes HBase often > needing to read more data than requested. > >- HBase couldn't run on YARN > > This makes the deployment and auto-scaling a little complicated, especially > in the cloud. > > In one word, HBase is complicated to be Kylin’s storage. The maintenance, > debugging is also hard for normal developers. Now we’re planning to seek a > simple, light-weighted, read-only storage engine for Kylin. The new > solution should have the following characteristics: > >- Columnar layout with compression for efficient I/O; >- Index by each column for quick filtering and seeking; >- MapReduce / Spark API for parallel processing; >- HDFS compliant for scalability and availability; >- Mature, stable and extensible; > > With the plugin architecture[2] introduced in Kylin 1.5, adding multiple > storages to Kylin is possible. Some companies like Kyligence Inc and > Meituan.com, have developed their customized storage engine for Kylin in > their product or platform. In their experience, columnar storage is a good > supplement for the HBase engine. Kaisen Kang from Meituan.com has shared > their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in > Beijing. > > We plan to do a PoC with Apache Parquet + Apache Spark in the next phase. > Parquet is a standard columnar file format and has been widely supported by > many projects like Hive, Impala, Drill, etc. Parquet is adding the page > level column index to support fine-grained filtering. Apache Spark can > provide the parallel computing over Parquet and can be deployed on > YARN/Mesos and Kubernetes. With this combination, the data persistence and > computation are separated, which makes the scaling in/out much easier than > before. Benefiting from Spark's flexibility, we can not only push down more > computation from Kylin to the Hadoop cluster. Except for Parquet, Apache > ORC is also a candidate. > > Now I raise this discussion to get your ideas about Kylin’s next-generation > storage engine. If you have good ideas or any related data, welcome discuss in > the community. > > Thank you! > > [1] Apache Kylin on HBase > https://www.slideshare.net/ShiShaoFeng1/apache-kylin-on-hbase-extreme-olap-engine-for-big-data > [2] Apache Kylin Plugin Architecture > https://kylin.apache.org/development/plugin_arch.html > [3] 基于Druid的Kylin存储引擎实践 https://blog.bcmeng.com/post/kylin-on-druid.html-- > Best regards, > > Shaofeng Shi 史少锋
Re: [DISCUSS] Columnar storage engine for Apache Kylin
Hi Yanghong, Thanks for your question. I think it is not required that other engines know how to read Kylin's storage, but it is a nice to have if possible. We can extend the file format if Parquet or ORC couldn't match Kylin's requirement, but not necessary to re-invent a new format. Zhong, Yanghong 于2018年9月29日周六 上午10:59写道: > I have one question about the characteristics of Kylin columnar storage > files. That is whether it should be a standard or common one. Since the > data stored in the storage engine is Kylin specified, is it necessary for > other engines to know how to build data into and how to read data from the > storage engine? > > In my opinion, it's not necessary. And Kylin columnar storage files should > be Kylin specified. We can leverage the advantages of other columnar files, > like data skip indexes, bloom filters, dictionaries. Then create a new file > format with Kylin specified requirements, like cuboid info. > > -- > Best regards, > Yanghong Zhong > > > On 9/28/18, 2:15 PM, "ShaoFeng Shi" wrote: > > Hi Kylin developers. > > HBase has been Kylin’s storage engine since the first day; Kylin on > HBase > has been verified as a success which can support low latency & high > concurrency queries on a very large data scale. Thanks to HBase, most > Kylin > users can get on average less than 1-second query response. > > But we also see some limitations when putting Cubes into HBase; I > shared > some of them in the HBaseConf Asia 2018[1] this August. The typical > limitations include: > >- Rowkey is the primary index, no secondary index so far; > > Filtering by row key’s prefix and suffix can get very different > performance > result. So the user needs to do a good design about the row key; > otherwise, > the query would be slow. This is difficult sometimes because the user > might > not predict the filtering patterns ahead of cube design. > >- HBase is a key-value instead of a columnar storage > > Kylin combines multiple measures (columns) into fewer column families > for > smaller data size (row key size is remarkable). This causes HBase often > needing to read more data than requested. > >- HBase couldn't run on YARN > > This makes the deployment and auto-scaling a little complicated, > especially > in the cloud. > > In one word, HBase is complicated to be Kylin’s storage. The > maintenance, > debugging is also hard for normal developers. Now we’re planning to > seek a > simple, light-weighted, read-only storage engine for Kylin. The new > solution should have the following characteristics: > >- Columnar layout with compression for efficient I/O; >- Index by each column for quick filtering and seeking; >- MapReduce / Spark API for parallel processing; >- HDFS compliant for scalability and availability; >- Mature, stable and extensible; > > With the plugin architecture[2] introduced in Kylin 1.5, adding > multiple > storages to Kylin is possible. Some companies like Kyligence Inc and > Meituan.com, have developed their customized storage engine for Kylin > in > their product or platform. In their experience, columnar storage is a > good > supplement for the HBase engine. Kaisen Kang from Meituan.com has > shared > their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in > Beijing. > > We plan to do a PoC with Apache Parquet + Apache Spark in the next > phase. > Parquet is a standard columnar file format and has been widely > supported by > many projects like Hive, Impala, Drill, etc. Parquet is adding the page > level column index to support fine-grained filtering. Apache Spark can > provide the parallel computing over Parquet and can be deployed on > YARN/Mesos and Kubernetes. With this combination, the data persistence > and > computation are separated, which makes the scaling in/out much easier > than > before. Benefiting from Spark's flexibility, we can not only push down > more > computation from Kylin to the Hadoop cluster. Except for Parquet, > Apache > ORC is also a candidate. > > Now I raise this discussion to get your ideas about Kylin’s > next-generation > storage engine. If you have good ideas or any related data, welcome > discuss in > the community. > > Thank you! > > [1] Apache Kylin on HBase > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FShiShaoFeng1%2Fapache-kylin-on-hbase-extreme-olap-engine-for-big-datadata=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312sdata=TuIOe6FxdubqsoRVX8BQb%2FkvSFRrfI0ZvBRDB0euZWk%3Dreserved=0 > [2] Apache Kylin Plugin Architecture > >
Re: [DISCUSS] Columnar storage engine for Apache Kylin
I have one question about the characteristics of Kylin columnar storage files. That is whether it should be a standard or common one. Since the data stored in the storage engine is Kylin specified, is it necessary for other engines to know how to build data into and how to read data from the storage engine? In my opinion, it's not necessary. And Kylin columnar storage files should be Kylin specified. We can leverage the advantages of other columnar files, like data skip indexes, bloom filters, dictionaries. Then create a new file format with Kylin specified requirements, like cuboid info. -- Best regards, Yanghong Zhong On 9/28/18, 2:15 PM, "ShaoFeng Shi" wrote: Hi Kylin developers. HBase has been Kylin’s storage engine since the first day; Kylin on HBase has been verified as a success which can support low latency & high concurrency queries on a very large data scale. Thanks to HBase, most Kylin users can get on average less than 1-second query response. But we also see some limitations when putting Cubes into HBase; I shared some of them in the HBaseConf Asia 2018[1] this August. The typical limitations include: - Rowkey is the primary index, no secondary index so far; Filtering by row key’s prefix and suffix can get very different performance result. So the user needs to do a good design about the row key; otherwise, the query would be slow. This is difficult sometimes because the user might not predict the filtering patterns ahead of cube design. - HBase is a key-value instead of a columnar storage Kylin combines multiple measures (columns) into fewer column families for smaller data size (row key size is remarkable). This causes HBase often needing to read more data than requested. - HBase couldn't run on YARN This makes the deployment and auto-scaling a little complicated, especially in the cloud. In one word, HBase is complicated to be Kylin’s storage. The maintenance, debugging is also hard for normal developers. Now we’re planning to seek a simple, light-weighted, read-only storage engine for Kylin. The new solution should have the following characteristics: - Columnar layout with compression for efficient I/O; - Index by each column for quick filtering and seeking; - MapReduce / Spark API for parallel processing; - HDFS compliant for scalability and availability; - Mature, stable and extensible; With the plugin architecture[2] introduced in Kylin 1.5, adding multiple storages to Kylin is possible. Some companies like Kyligence Inc and Meituan.com, have developed their customized storage engine for Kylin in their product or platform. In their experience, columnar storage is a good supplement for the HBase engine. Kaisen Kang from Meituan.com has shared their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in Beijing. We plan to do a PoC with Apache Parquet + Apache Spark in the next phase. Parquet is a standard columnar file format and has been widely supported by many projects like Hive, Impala, Drill, etc. Parquet is adding the page level column index to support fine-grained filtering. Apache Spark can provide the parallel computing over Parquet and can be deployed on YARN/Mesos and Kubernetes. With this combination, the data persistence and computation are separated, which makes the scaling in/out much easier than before. Benefiting from Spark's flexibility, we can not only push down more computation from Kylin to the Hadoop cluster. Except for Parquet, Apache ORC is also a candidate. Now I raise this discussion to get your ideas about Kylin’s next-generation storage engine. If you have good ideas or any related data, welcome discuss in the community. Thank you! [1] Apache Kylin on HBase https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FShiShaoFeng1%2Fapache-kylin-on-hbase-extreme-olap-engine-for-big-datadata=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312sdata=TuIOe6FxdubqsoRVX8BQb%2FkvSFRrfI0ZvBRDB0euZWk%3Dreserved=0 [2] Apache Kylin Plugin Architecture https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkylin.apache.org%2Fdevelopment%2Fplugin_arch.htmldata=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312sdata=6WPLbX9Rat51rj3VCc1AuVDxTw5HO2ezPO0Cj8m231g%3Dreserved=0 [3] 基于Druid的Kylin存储引擎实践