Re:Re: Re: [DISCUSS] Columnar storage engine for Apache Kylin

2018-10-31 Thread Ma Gang
Hi ShaoFeng,
Very good questions, please see my comments start with [Gang]:
1) How to bridge the real-time cube with a cube built from Hive? You know,
in Kylin the source type is marked at the table level, which means a table
is either a Hive table, a JDBC table or a streaming table.  To implement
the lambda architecture, how to composite the batch cube with the real-time
cube (with the same table)? This seems not mentioned in the design doc.
[Gang] >> there is a sourceType field in TableDesc to indicate the source type, 
I just add new types for the table that has more than 1 source, for example: 
ID_KAFKA_HIVE=21, means the table source can be both Kafka and Hive.

2) How it be together with the as-is NRT (near real-time) solution
introduced in v1.6? Many users are building cube directly from Kafka,
though they are in the mini or micro batches. Can the new streaming
solution work together with the NRT cube? E.g, if I don't need to do ETL in
Hive, can I use the batch job to fetch data from Kafka, and use
the streaming real-time receivers together?
[Gang] >>The new streaming solution is totally new, it works separately with 
the current streaming solution, there is no conflict with the NRT solution, so 
they can run together in the same Kylin platform, but currently they cannot 
work together as you said.
3) Does the "Build engine" of the real-time solution follow the plug-in 
architecture, so that it can support non-HBase storage? As you know we're 
implementing the parquet storage. Can this solution support other storages 
without much rework?
[Gang] >>Yes, the "Build engine" follows the plug-in architecture, so it is 
easy to support non-HBase storage. In eBay, we just use InMemCubing, so 
currently we only have InMemCubing algorithm, but I think it is easy to extend 
to support LayerCubing.




At 2018-09-29 15:08:42, "ShaoFeng Shi"  wrote:
>Hi Gang, very good questions, that's why we need to raise such a discussion
>publicly. Please check my comments below started with [shaofengshi]. Feel
>free to comment.
>
>1. Is it possible to locate a cuboid quickly in a parquet file? How to save
>cuboid metadata info in the parquet's FileMetaData, just in the metadata's
>key/value pair?
>
>[shaofengshi]: There are a couple ways to achieve this.
>
>A simple way is, different cuboids can be organized into different files.
>The cuboid ID can be used as a subfolder name (if the cuboid number is not
>that big), or use it as the file prefix; this solution may cause many small
>files when the cube is small. Another solution is to combine multiple
>cuboids' data in one file (sharding). In this case, the cuboid ID can be a
>column.
>
>2. I notice that there is schema field in parquet's FileMetaData, but in a
>cube, different cuboids have different schemas, so we just save the basic
>cuboid schema in the schema field?  Will this cause storage waste?
>
>[shaofengshi]: To be simple, we can use the same schema for all cuboids.
>There will be metadata overhead as some columns are empty, but it is minor
>compared with the data size. If using different schemas, the code would be
>a little complicated. This needs some verification I think, not determined
>yet.
>
>3. Can parquet support extension to add index easily, like bitmap index or
>B tree index for each column?
>
>[shaofengshi]: I think it is extensible. Parquet is adding the column page
>index, which is a lightweight index. We can follow the way to implement
>another type of index page, but that involves many changes, should be very
>careful. Or, we can separately store the index in another fast storage.
>Actually, I'm not sure whether the bitmap index is worth to build because
>Kylin already builds many cuboids. For large cuboids, we can sort the data
>by the high cardinality columns, in that case, the file/rowgroup/page level
>min/max indices might be enough for filtering. For small cuboids (no high
>cardinality columns), as the file is small, parquet's dictionary filtering
>should be good enough.  I believe the cube planner will play an important
>role in finding out which cuboids are worth to calculate. This is a
>difference Kylin with other engines: we can do more things in the build
>phase.
>
>4. Do we need to build rpc server? if just use yarn to schedule spark tasks
>to do query, start/stop jvm may take seconds, then most queries will be
>slower than using HBase. Of course, it is more scalable, and some queries
>maybe faster.
>[shaofengshi]: We can have a long-running spark application acting as the
>query engine, when Kylin receives the query, submit to spark without
>starting overhead. Kyligence has this in their product so there is no risk
>on this.
>
>
>1. Use customized columnar format, it is more flexible, we can add Kylin
>specific concepts in the storage, like a cuboid, etc. also it will be easy
>to add a different type of index as we need. The disadvantage is needing
>more effort to define the format and development(cannot leverage existing
>lib to 

Re: [DISCUSS] Columnar storage engine for Apache Kylin

2018-10-26 Thread JiaTao Tao
You are welcome, ShaoFeng! Storage and query engine are inseparable and
should design together for fully gaining each other's abilities. And I'm
very excited about the new coming columnar storage and query engine!


-- 


Regards!

Aron Tao


ShaoFeng Shi  于2018年10月26日周五 下午10:28写道:

> Exactly; Thank you jiatao for the comments!
>
> JiaTao Tao  于2018年10月25日周四 下午6:12写道:
>
> > As far as I'm concerned, using Parquet as Kylin's storage format is
> pretty
> > appropriate. From the aspect of integrating Spark, Spark made a lot of
> > optimizations for Parquet, e.g. We can enjoy Spark's vectorized reading
> and
> > lazy dict decoding, etc.
> >
> >
> > And here are my thoughts about integrating Spark and our query engine. As
> > Shaofeng mentioned, a cuboid is a Parquet file, and you can think of this
> > as a small table and we can read this cuboid as a DataFrame directly,
> which
> > can be directly queried by Spark, a bit like this:
> >
> >
> ss.read.parquet("path/to/CuboidFile").filter("xxx").agg("xxx").select("xxx").
> > (We need to implement some Kylin's advanced aggregations, as for some
> > Kylin's basic aggregations like sum/min/max, we can use Spark's directly)
> >
> >
> >
> > *Compare to our old query engine, the advantages are as follows:*
> >
> >
> >
> > 1. It is distributed! Our old query engine will get all data into a query
> > node and then calculate, it's a single point of failure and often leads
> OOM
> > when in a huge amount of data.
> >
> >
> >
> > 2. It is simple and easy to debug(every step is very clear and
> > transparent), you can collect data after every single phase,
> > e.g.(filter/aggregation/projection, etc.), so you can easily check out
> > which operation/phase went wrong. Our old query engine uses Calcite for
> > post-calculation, it's difficult when pinpointing problems, especially
> when
> > relating to code generation, and you cannot insert your own logic during
> > computation.
> >
> >
> >
> > 3. We can fully enjoy all efforts that Spark made for optimizing
> > performance, e.g. Catalyst/Tungsten, etc.
> >
> >
> >
> > 4. It is easy for unit tests, you can test every step separately, which
> > could reduce the testing granularity of Kylin's query engine.
> >
> >
> >
> > 5. Thanks to Spark's DataSource API, we can change Parquet to other data
> > formats easily.
> >
> >
> >
> > 6. A lot of upstream tools for Spark like many machine learning tools can
> > directly be integrated with us.
> >
> >
> >
> > ==
> >
> >
> ==
> >
> >  Hi Kylin developers.
> >
> >
> >
> > HBase has been Kylin’s storage engine since the first day; Kylin on
> > HBase
> >
> > has been verified as a success which can support low latency & high
> >
> > concurrency queries on a very large data scale. Thanks to HBase, most
> > Kylin
> >
> > users can get on average less than 1-second query response.
> >
> >
> >
> > But we also see some limitations when putting Cubes into HBase; I
> > shared
> >
> > some of them in the HBaseConf Asia 2018[1] this August. The typical
> >
> > limitations include:
> >
> >
> >
> >- Rowkey is the primary index, no secondary index so far;
> >
> >
> >
> > Filtering by row key’s prefix and suffix can get very different
> > performance
> >
> > result. So the user needs to do a good design about the row key;
> > otherwise,
> >
> > the query would be slow. This is difficult sometimes because the user
> > might
> >
> > not predict the filtering patterns ahead of cube design.
> >
> >
> >
> >- HBase is a key-value instead of a columnar storage
> >
> >
> >
> > Kylin combines multiple measures (columns) into fewer column families
> > for
> >
> > smaller data size (row key size is remarkable). This causes HBase
> often
> >
> > needing to read more data than requested.
> >
> >
> >
> >- HBase couldn't run on YARN
> >
> >
> >
> > This makes the deployment and auto-scaling a little complicated,
> > especially
> >
> > in the cloud.
> >
> >
> >
> > In one word, HBase is complicated to be Kylin’s storage. The
> > maintenance,
> >
> > debugging is also hard for normal developers. Now we’re planning to
> > seek a
> >
> > simple, light-weighted, read-only storage engine for Kylin. The new
> >
> > solution should have the following characteristics:
> >
> >
> >
> >- Columnar layout with compression for efficient I/O;
> >
> >- Index by each column for quick filtering and seeking;
> >
> >- MapReduce / Spark API for parallel processing;
> >
> >- HDFS compliant for scalability and availability;
> >
> >- Mature, stable and extensible;
> >
> >
> >
> > With the plugin architecture[2] introduced in Kylin 1.5, adding
> > multiple
> >
> > storages to Kylin is possible. Some companies like Kyligence Inc and
> >
> > Meituan.com, have 

Re: [DISCUSS] Columnar storage engine for Apache Kylin

2018-10-26 Thread ShaoFeng Shi
Exactly; Thank you jiatao for the comments!

JiaTao Tao  于2018年10月25日周四 下午6:12写道:

> As far as I'm concerned, using Parquet as Kylin's storage format is pretty
> appropriate. From the aspect of integrating Spark, Spark made a lot of
> optimizations for Parquet, e.g. We can enjoy Spark's vectorized reading and
> lazy dict decoding, etc.
>
>
> And here are my thoughts about integrating Spark and our query engine. As
> Shaofeng mentioned, a cuboid is a Parquet file, and you can think of this
> as a small table and we can read this cuboid as a DataFrame directly, which
> can be directly queried by Spark, a bit like this:
>
> ss.read.parquet("path/to/CuboidFile").filter("xxx").agg("xxx").select("xxx").
> (We need to implement some Kylin's advanced aggregations, as for some
> Kylin's basic aggregations like sum/min/max, we can use Spark's directly)
>
>
>
> *Compare to our old query engine, the advantages are as follows:*
>
>
>
> 1. It is distributed! Our old query engine will get all data into a query
> node and then calculate, it's a single point of failure and often leads OOM
> when in a huge amount of data.
>
>
>
> 2. It is simple and easy to debug(every step is very clear and
> transparent), you can collect data after every single phase,
> e.g.(filter/aggregation/projection, etc.), so you can easily check out
> which operation/phase went wrong. Our old query engine uses Calcite for
> post-calculation, it's difficult when pinpointing problems, especially when
> relating to code generation, and you cannot insert your own logic during
> computation.
>
>
>
> 3. We can fully enjoy all efforts that Spark made for optimizing
> performance, e.g. Catalyst/Tungsten, etc.
>
>
>
> 4. It is easy for unit tests, you can test every step separately, which
> could reduce the testing granularity of Kylin's query engine.
>
>
>
> 5. Thanks to Spark's DataSource API, we can change Parquet to other data
> formats easily.
>
>
>
> 6. A lot of upstream tools for Spark like many machine learning tools can
> directly be integrated with us.
>
>
>
> ==
>
> ==
>
>  Hi Kylin developers.
>
>
>
> HBase has been Kylin’s storage engine since the first day; Kylin on
> HBase
>
> has been verified as a success which can support low latency & high
>
> concurrency queries on a very large data scale. Thanks to HBase, most
> Kylin
>
> users can get on average less than 1-second query response.
>
>
>
> But we also see some limitations when putting Cubes into HBase; I
> shared
>
> some of them in the HBaseConf Asia 2018[1] this August. The typical
>
> limitations include:
>
>
>
>- Rowkey is the primary index, no secondary index so far;
>
>
>
> Filtering by row key’s prefix and suffix can get very different
> performance
>
> result. So the user needs to do a good design about the row key;
> otherwise,
>
> the query would be slow. This is difficult sometimes because the user
> might
>
> not predict the filtering patterns ahead of cube design.
>
>
>
>- HBase is a key-value instead of a columnar storage
>
>
>
> Kylin combines multiple measures (columns) into fewer column families
> for
>
> smaller data size (row key size is remarkable). This causes HBase often
>
> needing to read more data than requested.
>
>
>
>- HBase couldn't run on YARN
>
>
>
> This makes the deployment and auto-scaling a little complicated,
> especially
>
> in the cloud.
>
>
>
> In one word, HBase is complicated to be Kylin’s storage. The
> maintenance,
>
> debugging is also hard for normal developers. Now we’re planning to
> seek a
>
> simple, light-weighted, read-only storage engine for Kylin. The new
>
> solution should have the following characteristics:
>
>
>
>- Columnar layout with compression for efficient I/O;
>
>- Index by each column for quick filtering and seeking;
>
>- MapReduce / Spark API for parallel processing;
>
>- HDFS compliant for scalability and availability;
>
>- Mature, stable and extensible;
>
>
>
> With the plugin architecture[2] introduced in Kylin 1.5, adding
> multiple
>
> storages to Kylin is possible. Some companies like Kyligence Inc and
>
> Meituan.com, have developed their customized storage engine for Kylin
> in
>
> their product or platform. In their experience, columnar storage is a
> good
>
> supplement for the HBase engine. Kaisen Kang from Meituan.com has
> shared
>
> their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in
>
> Beijing.
>
>
>
> We plan to do a PoC with Apache Parquet + Apache Spark in the next
> phase.
>
> Parquet is a standard columnar file format and has been widely
> supported by
>
> many projects like Hive, Impala, Drill, etc. Parquet is adding the page
>
> level column index to support 

Re: [DISCUSS] Columnar storage engine for Apache Kylin

2018-10-25 Thread JiaTao Tao
As far as I'm concerned, using Parquet as Kylin's storage format is pretty
appropriate. From the aspect of integrating Spark, Spark made a lot of
optimizations for Parquet, e.g. We can enjoy Spark's vectorized reading and
lazy dict decoding, etc.


And here are my thoughts about integrating Spark and our query engine. As
Shaofeng mentioned, a cuboid is a Parquet file, and you can think of this
as a small table and we can read this cuboid as a DataFrame directly, which
can be directly queried by Spark, a bit like this:
ss.read.parquet("path/to/CuboidFile").filter("xxx").agg("xxx").select("xxx").
(We need to implement some Kylin's advanced aggregations, as for some
Kylin's basic aggregations like sum/min/max, we can use Spark's directly)



*Compare to our old query engine, the advantages are as follows:*



1. It is distributed! Our old query engine will get all data into a query
node and then calculate, it's a single point of failure and often leads OOM
when in a huge amount of data.



2. It is simple and easy to debug(every step is very clear and
transparent), you can collect data after every single phase,
e.g.(filter/aggregation/projection, etc.), so you can easily check out
which operation/phase went wrong. Our old query engine uses Calcite for
post-calculation, it's difficult when pinpointing problems, especially when
relating to code generation, and you cannot insert your own logic during
computation.



3. We can fully enjoy all efforts that Spark made for optimizing
performance, e.g. Catalyst/Tungsten, etc.



4. It is easy for unit tests, you can test every step separately, which
could reduce the testing granularity of Kylin's query engine.



5. Thanks to Spark's DataSource API, we can change Parquet to other data
formats easily.



6. A lot of upstream tools for Spark like many machine learning tools can
directly be integrated with us.



==
==

 Hi Kylin developers.



HBase has been Kylin’s storage engine since the first day; Kylin on
HBase

has been verified as a success which can support low latency & high

concurrency queries on a very large data scale. Thanks to HBase, most
Kylin

users can get on average less than 1-second query response.



But we also see some limitations when putting Cubes into HBase; I shared

some of them in the HBaseConf Asia 2018[1] this August. The typical

limitations include:



   - Rowkey is the primary index, no secondary index so far;



Filtering by row key’s prefix and suffix can get very different
performance

result. So the user needs to do a good design about the row key;
otherwise,

the query would be slow. This is difficult sometimes because the user
might

not predict the filtering patterns ahead of cube design.



   - HBase is a key-value instead of a columnar storage



Kylin combines multiple measures (columns) into fewer column families
for

smaller data size (row key size is remarkable). This causes HBase often

needing to read more data than requested.



   - HBase couldn't run on YARN



This makes the deployment and auto-scaling a little complicated,
especially

in the cloud.



In one word, HBase is complicated to be Kylin’s storage. The
maintenance,

debugging is also hard for normal developers. Now we’re planning to
seek a

simple, light-weighted, read-only storage engine for Kylin. The new

solution should have the following characteristics:



   - Columnar layout with compression for efficient I/O;

   - Index by each column for quick filtering and seeking;

   - MapReduce / Spark API for parallel processing;

   - HDFS compliant for scalability and availability;

   - Mature, stable and extensible;



With the plugin architecture[2] introduced in Kylin 1.5, adding multiple

storages to Kylin is possible. Some companies like Kyligence Inc and

Meituan.com, have developed their customized storage engine for Kylin in

their product or platform. In their experience, columnar storage is a
good

supplement for the HBase engine. Kaisen Kang from Meituan.com has shared

their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in

Beijing.



We plan to do a PoC with Apache Parquet + Apache Spark in the next
phase.

Parquet is a standard columnar file format and has been widely
supported by

many projects like Hive, Impala, Drill, etc. Parquet is adding the page

level column index to support fine-grained filtering.  Apache Spark can

provide the parallel computing over Parquet and can be deployed on

YARN/Mesos and Kubernetes. With this combination, the data persistence
and

computation are separated, which makes the scaling in/out much easier
than

before. Benefiting from Spark's flexibility, we can not only push down
more


Re: [DISCUSS] Columnar storage engine for Apache Kylin

2018-10-16 Thread ShaoFeng Shi
Hi guys,

I uploaded the initial design document to JIRA, please feel free to comment:

https://issues.apache.org/jira/browse/KYLIN-3621


ShaoFeng Shi  于2018年10月12日周五 上午9:44写道:

> JIRA and sub-tasks are created for this. Welcome to comment there:
> https://issues.apache.org/jira/browse/KYLIN-3621
>
> ShaoFeng Shi  于2018年10月8日周一 下午2:45写道:
>
>> I agree; the new storage should be Hadoop/HDFS compliant, and also need
>> be cloud storage (like S3, blob storage) friendly, as more and more users
>> are running big data analytics in the cloud.
>>
>> Luke Han  于2018年10月7日周日 下午7:44写道:
>>
>>> It makes sense to bring a better storage option for Kylin.
>>>
>>> The option should be open and people could have different ways to create
>>> an
>>> adaptor for the underlying storage.
>>> Considering huge adoptions of Kylin today are all run on Hadoop/HDFS, I
>>> prefer for Parquet or ORC or other HDFS compatible option at this time.
>>> It
>>> will easy for people to upgrade to the next generation and keep
>>> consistency.
>>>
>>> Looking forward to this feature to be rolled out soon.
>>>
>>> Thanks.
>>>
>>>
>>>
>>> Best Regards!
>>> -
>>>
>>> Luke Han
>>>
>>>
>>> On Wed, Oct 3, 2018 at 2:37 PM Li Yang  wrote:
>>>
>>> > Love this discussion. Like to highlight 3 major roles HBase is playing
>>> > currently, so we don't miss any of them when looking for a replacement.
>>> >
>>> > 1) Storage: A high speed big data storage
>>> > 2) Cache: A distributed storage cache layer (was BlockCache)
>>> > 3) MPP: A distributed computation framework (was Coprocessor)
>>> >
>>> > The "Storage" seems at the central of discussion. Be it Parquet, ORC,
>>> or a
>>> > new file format, to me the standard interface is most important. As
>>> long as
>>> > we have consensus on the access interface, like MapReduce / Spark
>>> Dataset,
>>> > then the rest of debate can be easily resolved by a fair benchmark.
>>> Also it
>>> > allows people with different preference to keep their own
>>> implementation
>>> > under the standard interface, and not impacting the rest of Kylin.
>>> >
>>> > The "Cache" and the "MPP" were more or less overlooked. I suggest we
>>> pay
>>> > more attentions to them. Apart from Spark and Alluxio, any other
>>> > alternatives? Actually Druid is a well-rounded choice, as like HBase,
>>> it
>>> > covers all the 3 roles pretty well.
>>> >
>>> > In general, I prefer to choose from the state of the art instead of
>>> > re-inventing. Indeed, Kylin is not a storage project. A new storage
>>> format
>>> > is not Kylin's mission. Any storage innovations we come across here
>>> would
>>> > be more beneficial if contribute to Parquet or ORC community.
>>> >
>>> > Regards
>>> > Yang
>>> >
>>> >
>>> >
>>> > On Tue, Oct 2, 2018 at 11:20 AM ShaoFeng Shi 
>>> > wrote:
>>> >
>>> > > Hi Billy,
>>> > >
>>> > > Yes, the cloud storage should be considered. The traditional file
>>> layouts
>>> > > on HDFS may not work well on cloud storage. Kylin needs to allow
>>> > extension
>>> > > here. I will add this to the requirement.
>>> > >
>>> > > Billy Liu  于2018年9月29日周六 下午3:22写道:
>>> > >
>>> > > > Hi Shaofeng,
>>> > > >
>>> > > > I'd like to add one more character: cloud-native storage support.
>>> > > > Quite a few users are using S3 on AWS, or Azure Data Lake Storage
>>> on
>>> > > > Azure. If new storage engine could be more cloud friendly, more
>>> user
>>> > > > could get benefits from it.
>>> > > >
>>> > > > With Warm regards
>>> > > >
>>> > > > Billy Liu
>>> > > > ShaoFeng Shi  于2018年9月28日周五 下午2:15写道:
>>> > > > >
>>> > > > > Hi Kylin developers.
>>> > > > >
>>> > > > > HBase has been Kylin’s storage engine since the first day; Kylin
>>> on
>>> > > HBase
>>> > > > > has been verified as a success which can support low latency &
>>> high
>>> > > > > concurrency queries on a very large data scale. Thanks to HBase,
>>> most
>>> > > > Kylin
>>> > > > > users can get on average less than 1-second query response.
>>> > > > >
>>> > > > > But we also see some limitations when putting Cubes into HBase; I
>>> > > shared
>>> > > > > some of them in the HBaseConf Asia 2018[1] this August. The
>>> typical
>>> > > > > limitations include:
>>> > > > >
>>> > > > >- Rowkey is the primary index, no secondary index so far;
>>> > > > >
>>> > > > > Filtering by row key’s prefix and suffix can get very different
>>> > > > performance
>>> > > > > result. So the user needs to do a good design about the row key;
>>> > > > otherwise,
>>> > > > > the query would be slow. This is difficult sometimes because the
>>> user
>>> > > > might
>>> > > > > not predict the filtering patterns ahead of cube design.
>>> > > > >
>>> > > > >- HBase is a key-value instead of a columnar storage
>>> > > > >
>>> > > > > Kylin combines multiple measures (columns) into fewer column
>>> families
>>> > > for
>>> > > > > smaller data size (row key size is remarkable). This causes HBase
>>> > often
>>> > > > > needing to read more data than requested.
>>> > > > >
>>> > > 

Re: [DISCUSS] Columnar storage engine for Apache Kylin

2018-10-11 Thread ShaoFeng Shi
JIRA and sub-tasks are created for this. Welcome to comment there:
https://issues.apache.org/jira/browse/KYLIN-3621

ShaoFeng Shi  于2018年10月8日周一 下午2:45写道:

> I agree; the new storage should be Hadoop/HDFS compliant, and also need be
> cloud storage (like S3, blob storage) friendly, as more and more users are
> running big data analytics in the cloud.
>
> Luke Han  于2018年10月7日周日 下午7:44写道:
>
>> It makes sense to bring a better storage option for Kylin.
>>
>> The option should be open and people could have different ways to create
>> an
>> adaptor for the underlying storage.
>> Considering huge adoptions of Kylin today are all run on Hadoop/HDFS, I
>> prefer for Parquet or ORC or other HDFS compatible option at this time. It
>> will easy for people to upgrade to the next generation and keep
>> consistency.
>>
>> Looking forward to this feature to be rolled out soon.
>>
>> Thanks.
>>
>>
>>
>> Best Regards!
>> -
>>
>> Luke Han
>>
>>
>> On Wed, Oct 3, 2018 at 2:37 PM Li Yang  wrote:
>>
>> > Love this discussion. Like to highlight 3 major roles HBase is playing
>> > currently, so we don't miss any of them when looking for a replacement.
>> >
>> > 1) Storage: A high speed big data storage
>> > 2) Cache: A distributed storage cache layer (was BlockCache)
>> > 3) MPP: A distributed computation framework (was Coprocessor)
>> >
>> > The "Storage" seems at the central of discussion. Be it Parquet, ORC,
>> or a
>> > new file format, to me the standard interface is most important. As
>> long as
>> > we have consensus on the access interface, like MapReduce / Spark
>> Dataset,
>> > then the rest of debate can be easily resolved by a fair benchmark.
>> Also it
>> > allows people with different preference to keep their own implementation
>> > under the standard interface, and not impacting the rest of Kylin.
>> >
>> > The "Cache" and the "MPP" were more or less overlooked. I suggest we pay
>> > more attentions to them. Apart from Spark and Alluxio, any other
>> > alternatives? Actually Druid is a well-rounded choice, as like HBase, it
>> > covers all the 3 roles pretty well.
>> >
>> > In general, I prefer to choose from the state of the art instead of
>> > re-inventing. Indeed, Kylin is not a storage project. A new storage
>> format
>> > is not Kylin's mission. Any storage innovations we come across here
>> would
>> > be more beneficial if contribute to Parquet or ORC community.
>> >
>> > Regards
>> > Yang
>> >
>> >
>> >
>> > On Tue, Oct 2, 2018 at 11:20 AM ShaoFeng Shi 
>> > wrote:
>> >
>> > > Hi Billy,
>> > >
>> > > Yes, the cloud storage should be considered. The traditional file
>> layouts
>> > > on HDFS may not work well on cloud storage. Kylin needs to allow
>> > extension
>> > > here. I will add this to the requirement.
>> > >
>> > > Billy Liu  于2018年9月29日周六 下午3:22写道:
>> > >
>> > > > Hi Shaofeng,
>> > > >
>> > > > I'd like to add one more character: cloud-native storage support.
>> > > > Quite a few users are using S3 on AWS, or Azure Data Lake Storage on
>> > > > Azure. If new storage engine could be more cloud friendly, more user
>> > > > could get benefits from it.
>> > > >
>> > > > With Warm regards
>> > > >
>> > > > Billy Liu
>> > > > ShaoFeng Shi  于2018年9月28日周五 下午2:15写道:
>> > > > >
>> > > > > Hi Kylin developers.
>> > > > >
>> > > > > HBase has been Kylin’s storage engine since the first day; Kylin
>> on
>> > > HBase
>> > > > > has been verified as a success which can support low latency &
>> high
>> > > > > concurrency queries on a very large data scale. Thanks to HBase,
>> most
>> > > > Kylin
>> > > > > users can get on average less than 1-second query response.
>> > > > >
>> > > > > But we also see some limitations when putting Cubes into HBase; I
>> > > shared
>> > > > > some of them in the HBaseConf Asia 2018[1] this August. The
>> typical
>> > > > > limitations include:
>> > > > >
>> > > > >- Rowkey is the primary index, no secondary index so far;
>> > > > >
>> > > > > Filtering by row key’s prefix and suffix can get very different
>> > > > performance
>> > > > > result. So the user needs to do a good design about the row key;
>> > > > otherwise,
>> > > > > the query would be slow. This is difficult sometimes because the
>> user
>> > > > might
>> > > > > not predict the filtering patterns ahead of cube design.
>> > > > >
>> > > > >- HBase is a key-value instead of a columnar storage
>> > > > >
>> > > > > Kylin combines multiple measures (columns) into fewer column
>> families
>> > > for
>> > > > > smaller data size (row key size is remarkable). This causes HBase
>> > often
>> > > > > needing to read more data than requested.
>> > > > >
>> > > > >- HBase couldn't run on YARN
>> > > > >
>> > > > > This makes the deployment and auto-scaling a little complicated,
>> > > > especially
>> > > > > in the cloud.
>> > > > >
>> > > > > In one word, HBase is complicated to be Kylin’s storage. The
>> > > maintenance,
>> > > > > debugging is also hard for normal developers. Now we’re 

Re: [DISCUSS] Columnar storage engine for Apache Kylin

2018-10-08 Thread ShaoFeng Shi
I agree; the new storage should be Hadoop/HDFS compliant, and also need be
cloud storage (like S3, blob storage) friendly, as more and more users are
running big data analytics in the cloud.

Luke Han  于2018年10月7日周日 下午7:44写道:

> It makes sense to bring a better storage option for Kylin.
>
> The option should be open and people could have different ways to create an
> adaptor for the underlying storage.
> Considering huge adoptions of Kylin today are all run on Hadoop/HDFS, I
> prefer for Parquet or ORC or other HDFS compatible option at this time. It
> will easy for people to upgrade to the next generation and keep
> consistency.
>
> Looking forward to this feature to be rolled out soon.
>
> Thanks.
>
>
>
> Best Regards!
> -
>
> Luke Han
>
>
> On Wed, Oct 3, 2018 at 2:37 PM Li Yang  wrote:
>
> > Love this discussion. Like to highlight 3 major roles HBase is playing
> > currently, so we don't miss any of them when looking for a replacement.
> >
> > 1) Storage: A high speed big data storage
> > 2) Cache: A distributed storage cache layer (was BlockCache)
> > 3) MPP: A distributed computation framework (was Coprocessor)
> >
> > The "Storage" seems at the central of discussion. Be it Parquet, ORC, or
> a
> > new file format, to me the standard interface is most important. As long
> as
> > we have consensus on the access interface, like MapReduce / Spark
> Dataset,
> > then the rest of debate can be easily resolved by a fair benchmark. Also
> it
> > allows people with different preference to keep their own implementation
> > under the standard interface, and not impacting the rest of Kylin.
> >
> > The "Cache" and the "MPP" were more or less overlooked. I suggest we pay
> > more attentions to them. Apart from Spark and Alluxio, any other
> > alternatives? Actually Druid is a well-rounded choice, as like HBase, it
> > covers all the 3 roles pretty well.
> >
> > In general, I prefer to choose from the state of the art instead of
> > re-inventing. Indeed, Kylin is not a storage project. A new storage
> format
> > is not Kylin's mission. Any storage innovations we come across here would
> > be more beneficial if contribute to Parquet or ORC community.
> >
> > Regards
> > Yang
> >
> >
> >
> > On Tue, Oct 2, 2018 at 11:20 AM ShaoFeng Shi 
> > wrote:
> >
> > > Hi Billy,
> > >
> > > Yes, the cloud storage should be considered. The traditional file
> layouts
> > > on HDFS may not work well on cloud storage. Kylin needs to allow
> > extension
> > > here. I will add this to the requirement.
> > >
> > > Billy Liu  于2018年9月29日周六 下午3:22写道:
> > >
> > > > Hi Shaofeng,
> > > >
> > > > I'd like to add one more character: cloud-native storage support.
> > > > Quite a few users are using S3 on AWS, or Azure Data Lake Storage on
> > > > Azure. If new storage engine could be more cloud friendly, more user
> > > > could get benefits from it.
> > > >
> > > > With Warm regards
> > > >
> > > > Billy Liu
> > > > ShaoFeng Shi  于2018年9月28日周五 下午2:15写道:
> > > > >
> > > > > Hi Kylin developers.
> > > > >
> > > > > HBase has been Kylin’s storage engine since the first day; Kylin on
> > > HBase
> > > > > has been verified as a success which can support low latency & high
> > > > > concurrency queries on a very large data scale. Thanks to HBase,
> most
> > > > Kylin
> > > > > users can get on average less than 1-second query response.
> > > > >
> > > > > But we also see some limitations when putting Cubes into HBase; I
> > > shared
> > > > > some of them in the HBaseConf Asia 2018[1] this August. The typical
> > > > > limitations include:
> > > > >
> > > > >- Rowkey is the primary index, no secondary index so far;
> > > > >
> > > > > Filtering by row key’s prefix and suffix can get very different
> > > > performance
> > > > > result. So the user needs to do a good design about the row key;
> > > > otherwise,
> > > > > the query would be slow. This is difficult sometimes because the
> user
> > > > might
> > > > > not predict the filtering patterns ahead of cube design.
> > > > >
> > > > >- HBase is a key-value instead of a columnar storage
> > > > >
> > > > > Kylin combines multiple measures (columns) into fewer column
> families
> > > for
> > > > > smaller data size (row key size is remarkable). This causes HBase
> > often
> > > > > needing to read more data than requested.
> > > > >
> > > > >- HBase couldn't run on YARN
> > > > >
> > > > > This makes the deployment and auto-scaling a little complicated,
> > > > especially
> > > > > in the cloud.
> > > > >
> > > > > In one word, HBase is complicated to be Kylin’s storage. The
> > > maintenance,
> > > > > debugging is also hard for normal developers. Now we’re planning to
> > > seek
> > > > a
> > > > > simple, light-weighted, read-only storage engine for Kylin. The new
> > > > > solution should have the following characteristics:
> > > > >
> > > > >- Columnar layout with compression for efficient I/O;
> > > > >- Index by each column for quick filtering and 

Re: [DISCUSS] Columnar storage engine for Apache Kylin

2018-10-07 Thread Luke Han
It makes sense to bring a better storage option for Kylin.

The option should be open and people could have different ways to create an
adaptor for the underlying storage.
Considering huge adoptions of Kylin today are all run on Hadoop/HDFS, I
prefer for Parquet or ORC or other HDFS compatible option at this time. It
will easy for people to upgrade to the next generation and keep consistency.

Looking forward to this feature to be rolled out soon.

Thanks.



Best Regards!
-

Luke Han


On Wed, Oct 3, 2018 at 2:37 PM Li Yang  wrote:

> Love this discussion. Like to highlight 3 major roles HBase is playing
> currently, so we don't miss any of them when looking for a replacement.
>
> 1) Storage: A high speed big data storage
> 2) Cache: A distributed storage cache layer (was BlockCache)
> 3) MPP: A distributed computation framework (was Coprocessor)
>
> The "Storage" seems at the central of discussion. Be it Parquet, ORC, or a
> new file format, to me the standard interface is most important. As long as
> we have consensus on the access interface, like MapReduce / Spark Dataset,
> then the rest of debate can be easily resolved by a fair benchmark. Also it
> allows people with different preference to keep their own implementation
> under the standard interface, and not impacting the rest of Kylin.
>
> The "Cache" and the "MPP" were more or less overlooked. I suggest we pay
> more attentions to them. Apart from Spark and Alluxio, any other
> alternatives? Actually Druid is a well-rounded choice, as like HBase, it
> covers all the 3 roles pretty well.
>
> In general, I prefer to choose from the state of the art instead of
> re-inventing. Indeed, Kylin is not a storage project. A new storage format
> is not Kylin's mission. Any storage innovations we come across here would
> be more beneficial if contribute to Parquet or ORC community.
>
> Regards
> Yang
>
>
>
> On Tue, Oct 2, 2018 at 11:20 AM ShaoFeng Shi 
> wrote:
>
> > Hi Billy,
> >
> > Yes, the cloud storage should be considered. The traditional file layouts
> > on HDFS may not work well on cloud storage. Kylin needs to allow
> extension
> > here. I will add this to the requirement.
> >
> > Billy Liu  于2018年9月29日周六 下午3:22写道:
> >
> > > Hi Shaofeng,
> > >
> > > I'd like to add one more character: cloud-native storage support.
> > > Quite a few users are using S3 on AWS, or Azure Data Lake Storage on
> > > Azure. If new storage engine could be more cloud friendly, more user
> > > could get benefits from it.
> > >
> > > With Warm regards
> > >
> > > Billy Liu
> > > ShaoFeng Shi  于2018年9月28日周五 下午2:15写道:
> > > >
> > > > Hi Kylin developers.
> > > >
> > > > HBase has been Kylin’s storage engine since the first day; Kylin on
> > HBase
> > > > has been verified as a success which can support low latency & high
> > > > concurrency queries on a very large data scale. Thanks to HBase, most
> > > Kylin
> > > > users can get on average less than 1-second query response.
> > > >
> > > > But we also see some limitations when putting Cubes into HBase; I
> > shared
> > > > some of them in the HBaseConf Asia 2018[1] this August. The typical
> > > > limitations include:
> > > >
> > > >- Rowkey is the primary index, no secondary index so far;
> > > >
> > > > Filtering by row key’s prefix and suffix can get very different
> > > performance
> > > > result. So the user needs to do a good design about the row key;
> > > otherwise,
> > > > the query would be slow. This is difficult sometimes because the user
> > > might
> > > > not predict the filtering patterns ahead of cube design.
> > > >
> > > >- HBase is a key-value instead of a columnar storage
> > > >
> > > > Kylin combines multiple measures (columns) into fewer column families
> > for
> > > > smaller data size (row key size is remarkable). This causes HBase
> often
> > > > needing to read more data than requested.
> > > >
> > > >- HBase couldn't run on YARN
> > > >
> > > > This makes the deployment and auto-scaling a little complicated,
> > > especially
> > > > in the cloud.
> > > >
> > > > In one word, HBase is complicated to be Kylin’s storage. The
> > maintenance,
> > > > debugging is also hard for normal developers. Now we’re planning to
> > seek
> > > a
> > > > simple, light-weighted, read-only storage engine for Kylin. The new
> > > > solution should have the following characteristics:
> > > >
> > > >- Columnar layout with compression for efficient I/O;
> > > >- Index by each column for quick filtering and seeking;
> > > >- MapReduce / Spark API for parallel processing;
> > > >- HDFS compliant for scalability and availability;
> > > >- Mature, stable and extensible;
> > > >
> > > > With the plugin architecture[2] introduced in Kylin 1.5, adding
> > multiple
> > > > storages to Kylin is possible. Some companies like Kyligence Inc and
> > > > Meituan.com, have developed their customized storage engine for Kylin
> > in
> > > > their product or platform. In their 

Re: [DISCUSS] Columnar storage engine for Apache Kylin

2018-10-03 Thread Li Yang
Love this discussion. Like to highlight 3 major roles HBase is playing
currently, so we don't miss any of them when looking for a replacement.

1) Storage: A high speed big data storage
2) Cache: A distributed storage cache layer (was BlockCache)
3) MPP: A distributed computation framework (was Coprocessor)

The "Storage" seems at the central of discussion. Be it Parquet, ORC, or a
new file format, to me the standard interface is most important. As long as
we have consensus on the access interface, like MapReduce / Spark Dataset,
then the rest of debate can be easily resolved by a fair benchmark. Also it
allows people with different preference to keep their own implementation
under the standard interface, and not impacting the rest of Kylin.

The "Cache" and the "MPP" were more or less overlooked. I suggest we pay
more attentions to them. Apart from Spark and Alluxio, any other
alternatives? Actually Druid is a well-rounded choice, as like HBase, it
covers all the 3 roles pretty well.

In general, I prefer to choose from the state of the art instead of
re-inventing. Indeed, Kylin is not a storage project. A new storage format
is not Kylin's mission. Any storage innovations we come across here would
be more beneficial if contribute to Parquet or ORC community.

Regards
Yang



On Tue, Oct 2, 2018 at 11:20 AM ShaoFeng Shi  wrote:

> Hi Billy,
>
> Yes, the cloud storage should be considered. The traditional file layouts
> on HDFS may not work well on cloud storage. Kylin needs to allow extension
> here. I will add this to the requirement.
>
> Billy Liu  于2018年9月29日周六 下午3:22写道:
>
> > Hi Shaofeng,
> >
> > I'd like to add one more character: cloud-native storage support.
> > Quite a few users are using S3 on AWS, or Azure Data Lake Storage on
> > Azure. If new storage engine could be more cloud friendly, more user
> > could get benefits from it.
> >
> > With Warm regards
> >
> > Billy Liu
> > ShaoFeng Shi  于2018年9月28日周五 下午2:15写道:
> > >
> > > Hi Kylin developers.
> > >
> > > HBase has been Kylin’s storage engine since the first day; Kylin on
> HBase
> > > has been verified as a success which can support low latency & high
> > > concurrency queries on a very large data scale. Thanks to HBase, most
> > Kylin
> > > users can get on average less than 1-second query response.
> > >
> > > But we also see some limitations when putting Cubes into HBase; I
> shared
> > > some of them in the HBaseConf Asia 2018[1] this August. The typical
> > > limitations include:
> > >
> > >- Rowkey is the primary index, no secondary index so far;
> > >
> > > Filtering by row key’s prefix and suffix can get very different
> > performance
> > > result. So the user needs to do a good design about the row key;
> > otherwise,
> > > the query would be slow. This is difficult sometimes because the user
> > might
> > > not predict the filtering patterns ahead of cube design.
> > >
> > >- HBase is a key-value instead of a columnar storage
> > >
> > > Kylin combines multiple measures (columns) into fewer column families
> for
> > > smaller data size (row key size is remarkable). This causes HBase often
> > > needing to read more data than requested.
> > >
> > >- HBase couldn't run on YARN
> > >
> > > This makes the deployment and auto-scaling a little complicated,
> > especially
> > > in the cloud.
> > >
> > > In one word, HBase is complicated to be Kylin’s storage. The
> maintenance,
> > > debugging is also hard for normal developers. Now we’re planning to
> seek
> > a
> > > simple, light-weighted, read-only storage engine for Kylin. The new
> > > solution should have the following characteristics:
> > >
> > >- Columnar layout with compression for efficient I/O;
> > >- Index by each column for quick filtering and seeking;
> > >- MapReduce / Spark API for parallel processing;
> > >- HDFS compliant for scalability and availability;
> > >- Mature, stable and extensible;
> > >
> > > With the plugin architecture[2] introduced in Kylin 1.5, adding
> multiple
> > > storages to Kylin is possible. Some companies like Kyligence Inc and
> > > Meituan.com, have developed their customized storage engine for Kylin
> in
> > > their product or platform. In their experience, columnar storage is a
> > good
> > > supplement for the HBase engine. Kaisen Kang from Meituan.com has
> shared
> > > their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in
> > > Beijing.
> > >
> > > We plan to do a PoC with Apache Parquet + Apache Spark in the next
> phase.
> > > Parquet is a standard columnar file format and has been widely
> supported
> > by
> > > many projects like Hive, Impala, Drill, etc. Parquet is adding the page
> > > level column index to support fine-grained filtering.  Apache Spark can
> > > provide the parallel computing over Parquet and can be deployed on
> > > YARN/Mesos and Kubernetes. With this combination, the data persistence
> > and
> > > computation are separated, which makes the scaling in/out much 

Re: [DISCUSS] Columnar storage engine for Apache Kylin

2018-10-01 Thread ShaoFeng Shi
Hi Billy,

Yes, the cloud storage should be considered. The traditional file layouts
on HDFS may not work well on cloud storage. Kylin needs to allow extension
here. I will add this to the requirement.

Billy Liu  于2018年9月29日周六 下午3:22写道:

> Hi Shaofeng,
>
> I'd like to add one more character: cloud-native storage support.
> Quite a few users are using S3 on AWS, or Azure Data Lake Storage on
> Azure. If new storage engine could be more cloud friendly, more user
> could get benefits from it.
>
> With Warm regards
>
> Billy Liu
> ShaoFeng Shi  于2018年9月28日周五 下午2:15写道:
> >
> > Hi Kylin developers.
> >
> > HBase has been Kylin’s storage engine since the first day; Kylin on HBase
> > has been verified as a success which can support low latency & high
> > concurrency queries on a very large data scale. Thanks to HBase, most
> Kylin
> > users can get on average less than 1-second query response.
> >
> > But we also see some limitations when putting Cubes into HBase; I shared
> > some of them in the HBaseConf Asia 2018[1] this August. The typical
> > limitations include:
> >
> >- Rowkey is the primary index, no secondary index so far;
> >
> > Filtering by row key’s prefix and suffix can get very different
> performance
> > result. So the user needs to do a good design about the row key;
> otherwise,
> > the query would be slow. This is difficult sometimes because the user
> might
> > not predict the filtering patterns ahead of cube design.
> >
> >- HBase is a key-value instead of a columnar storage
> >
> > Kylin combines multiple measures (columns) into fewer column families for
> > smaller data size (row key size is remarkable). This causes HBase often
> > needing to read more data than requested.
> >
> >- HBase couldn't run on YARN
> >
> > This makes the deployment and auto-scaling a little complicated,
> especially
> > in the cloud.
> >
> > In one word, HBase is complicated to be Kylin’s storage. The maintenance,
> > debugging is also hard for normal developers. Now we’re planning to seek
> a
> > simple, light-weighted, read-only storage engine for Kylin. The new
> > solution should have the following characteristics:
> >
> >- Columnar layout with compression for efficient I/O;
> >- Index by each column for quick filtering and seeking;
> >- MapReduce / Spark API for parallel processing;
> >- HDFS compliant for scalability and availability;
> >- Mature, stable and extensible;
> >
> > With the plugin architecture[2] introduced in Kylin 1.5, adding multiple
> > storages to Kylin is possible. Some companies like Kyligence Inc and
> > Meituan.com, have developed their customized storage engine for Kylin in
> > their product or platform. In their experience, columnar storage is a
> good
> > supplement for the HBase engine. Kaisen Kang from Meituan.com has shared
> > their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in
> > Beijing.
> >
> > We plan to do a PoC with Apache Parquet + Apache Spark in the next phase.
> > Parquet is a standard columnar file format and has been widely supported
> by
> > many projects like Hive, Impala, Drill, etc. Parquet is adding the page
> > level column index to support fine-grained filtering.  Apache Spark can
> > provide the parallel computing over Parquet and can be deployed on
> > YARN/Mesos and Kubernetes. With this combination, the data persistence
> and
> > computation are separated, which makes the scaling in/out much easier
> than
> > before. Benefiting from Spark's flexibility, we can not only push down
> more
> > computation from Kylin to the Hadoop cluster. Except for Parquet, Apache
> > ORC is also a candidate.
> >
> > Now I raise this discussion to get your ideas about Kylin’s
> next-generation
> > storage engine. If you have good ideas or any related data, welcome
> discuss in
> > the community.
> >
> > Thank you!
> >
> > [1] Apache Kylin on HBase
> >
> https://www.slideshare.net/ShiShaoFeng1/apache-kylin-on-hbase-extreme-olap-engine-for-big-data
> > [2] Apache Kylin Plugin Architecture
> > https://kylin.apache.org/development/plugin_arch.html
> > [3] 基于Druid的Kylin存储引擎实践
> https://blog.bcmeng.com/post/kylin-on-druid.html--
> > Best regards,
> >
> > Shaofeng Shi 史少锋
>


-- 
Best regards,

Shaofeng Shi 史少锋


Re: [DISCUSS] Columnar storage engine for Apache Kylin

2018-09-29 Thread Billy Liu
Hi Shaofeng,

I'd like to add one more character: cloud-native storage support.
Quite a few users are using S3 on AWS, or Azure Data Lake Storage on
Azure. If new storage engine could be more cloud friendly, more user
could get benefits from it.

With Warm regards

Billy Liu
ShaoFeng Shi  于2018年9月28日周五 下午2:15写道:
>
> Hi Kylin developers.
>
> HBase has been Kylin’s storage engine since the first day; Kylin on HBase
> has been verified as a success which can support low latency & high
> concurrency queries on a very large data scale. Thanks to HBase, most Kylin
> users can get on average less than 1-second query response.
>
> But we also see some limitations when putting Cubes into HBase; I shared
> some of them in the HBaseConf Asia 2018[1] this August. The typical
> limitations include:
>
>- Rowkey is the primary index, no secondary index so far;
>
> Filtering by row key’s prefix and suffix can get very different performance
> result. So the user needs to do a good design about the row key; otherwise,
> the query would be slow. This is difficult sometimes because the user might
> not predict the filtering patterns ahead of cube design.
>
>- HBase is a key-value instead of a columnar storage
>
> Kylin combines multiple measures (columns) into fewer column families for
> smaller data size (row key size is remarkable). This causes HBase often
> needing to read more data than requested.
>
>- HBase couldn't run on YARN
>
> This makes the deployment and auto-scaling a little complicated, especially
> in the cloud.
>
> In one word, HBase is complicated to be Kylin’s storage. The maintenance,
> debugging is also hard for normal developers. Now we’re planning to seek a
> simple, light-weighted, read-only storage engine for Kylin. The new
> solution should have the following characteristics:
>
>- Columnar layout with compression for efficient I/O;
>- Index by each column for quick filtering and seeking;
>- MapReduce / Spark API for parallel processing;
>- HDFS compliant for scalability and availability;
>- Mature, stable and extensible;
>
> With the plugin architecture[2] introduced in Kylin 1.5, adding multiple
> storages to Kylin is possible. Some companies like Kyligence Inc and
> Meituan.com, have developed their customized storage engine for Kylin in
> their product or platform. In their experience, columnar storage is a good
> supplement for the HBase engine. Kaisen Kang from Meituan.com has shared
> their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in
> Beijing.
>
> We plan to do a PoC with Apache Parquet + Apache Spark in the next phase.
> Parquet is a standard columnar file format and has been widely supported by
> many projects like Hive, Impala, Drill, etc. Parquet is adding the page
> level column index to support fine-grained filtering.  Apache Spark can
> provide the parallel computing over Parquet and can be deployed on
> YARN/Mesos and Kubernetes. With this combination, the data persistence and
> computation are separated, which makes the scaling in/out much easier than
> before. Benefiting from Spark's flexibility, we can not only push down more
> computation from Kylin to the Hadoop cluster. Except for Parquet, Apache
> ORC is also a candidate.
>
> Now I raise this discussion to get your ideas about Kylin’s next-generation
> storage engine. If you have good ideas or any related data, welcome discuss in
> the community.
>
> Thank you!
>
> [1] Apache Kylin on HBase
> https://www.slideshare.net/ShiShaoFeng1/apache-kylin-on-hbase-extreme-olap-engine-for-big-data
> [2] Apache Kylin Plugin Architecture
> https://kylin.apache.org/development/plugin_arch.html
> [3] 基于Druid的Kylin存储引擎实践 https://blog.bcmeng.com/post/kylin-on-druid.html--
> Best regards,
>
> Shaofeng Shi 史少锋


Re: [DISCUSS] Columnar storage engine for Apache Kylin

2018-09-28 Thread ShaoFeng Shi
Hi Yanghong,

Thanks for your question. I think it is not required that other engines
know how to read Kylin's storage, but it is a nice to have if possible. We
can extend the file format if Parquet or ORC couldn't match Kylin's
requirement, but not necessary to re-invent a new format.

Zhong, Yanghong  于2018年9月29日周六 上午10:59写道:

> I have one question about the characteristics of Kylin columnar storage
> files. That is whether it should be a standard or common one. Since the
> data stored in the storage engine is Kylin specified, is it necessary for
> other engines to know how to build data into and how to read data from the
> storage engine?
>
> In my opinion, it's not necessary. And Kylin columnar storage files should
> be Kylin specified. We can leverage the advantages of other columnar files,
> like data skip indexes, bloom filters, dictionaries. Then create a new file
> format with Kylin specified requirements, like cuboid info.
>
> --
> Best regards,
> Yanghong Zhong
>
>
> On 9/28/18, 2:15 PM, "ShaoFeng Shi"  wrote:
>
> Hi Kylin developers.
>
> HBase has been Kylin’s storage engine since the first day; Kylin on
> HBase
> has been verified as a success which can support low latency & high
> concurrency queries on a very large data scale. Thanks to HBase, most
> Kylin
> users can get on average less than 1-second query response.
>
> But we also see some limitations when putting Cubes into HBase; I
> shared
> some of them in the HBaseConf Asia 2018[1] this August. The typical
> limitations include:
>
>- Rowkey is the primary index, no secondary index so far;
>
> Filtering by row key’s prefix and suffix can get very different
> performance
> result. So the user needs to do a good design about the row key;
> otherwise,
> the query would be slow. This is difficult sometimes because the user
> might
> not predict the filtering patterns ahead of cube design.
>
>- HBase is a key-value instead of a columnar storage
>
> Kylin combines multiple measures (columns) into fewer column families
> for
> smaller data size (row key size is remarkable). This causes HBase often
> needing to read more data than requested.
>
>- HBase couldn't run on YARN
>
> This makes the deployment and auto-scaling a little complicated,
> especially
> in the cloud.
>
> In one word, HBase is complicated to be Kylin’s storage. The
> maintenance,
> debugging is also hard for normal developers. Now we’re planning to
> seek a
> simple, light-weighted, read-only storage engine for Kylin. The new
> solution should have the following characteristics:
>
>- Columnar layout with compression for efficient I/O;
>- Index by each column for quick filtering and seeking;
>- MapReduce / Spark API for parallel processing;
>- HDFS compliant for scalability and availability;
>- Mature, stable and extensible;
>
> With the plugin architecture[2] introduced in Kylin 1.5, adding
> multiple
> storages to Kylin is possible. Some companies like Kyligence Inc and
> Meituan.com, have developed their customized storage engine for Kylin
> in
> their product or platform. In their experience, columnar storage is a
> good
> supplement for the HBase engine. Kaisen Kang from Meituan.com has
> shared
> their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in
> Beijing.
>
> We plan to do a PoC with Apache Parquet + Apache Spark in the next
> phase.
> Parquet is a standard columnar file format and has been widely
> supported by
> many projects like Hive, Impala, Drill, etc. Parquet is adding the page
> level column index to support fine-grained filtering.  Apache Spark can
> provide the parallel computing over Parquet and can be deployed on
> YARN/Mesos and Kubernetes. With this combination, the data persistence
> and
> computation are separated, which makes the scaling in/out much easier
> than
> before. Benefiting from Spark's flexibility, we can not only push down
> more
> computation from Kylin to the Hadoop cluster. Except for Parquet,
> Apache
> ORC is also a candidate.
>
> Now I raise this discussion to get your ideas about Kylin’s
> next-generation
> storage engine. If you have good ideas or any related data, welcome
> discuss in
> the community.
>
> Thank you!
>
> [1] Apache Kylin on HBase
>
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FShiShaoFeng1%2Fapache-kylin-on-hbase-extreme-olap-engine-for-big-datadata=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312sdata=TuIOe6FxdubqsoRVX8BQb%2FkvSFRrfI0ZvBRDB0euZWk%3Dreserved=0
> [2] Apache Kylin Plugin Architecture
>
> 

Re: [DISCUSS] Columnar storage engine for Apache Kylin

2018-09-28 Thread Zhong, Yanghong
I have one question about the characteristics of Kylin columnar storage files. 
That is whether it should be a standard or common one. Since the data stored in 
the storage engine is Kylin specified, is it necessary for other engines to 
know how to build data into and how to read data from the storage engine? 

In my opinion, it's not necessary. And Kylin columnar storage files should be 
Kylin specified. We can leverage the advantages of other columnar files, like 
data skip indexes, bloom filters, dictionaries. Then create a new file format 
with Kylin specified requirements, like cuboid info.

--
Best regards,
Yanghong Zhong


On 9/28/18, 2:15 PM, "ShaoFeng Shi"  wrote:

Hi Kylin developers.

HBase has been Kylin’s storage engine since the first day; Kylin on HBase
has been verified as a success which can support low latency & high
concurrency queries on a very large data scale. Thanks to HBase, most Kylin
users can get on average less than 1-second query response.

But we also see some limitations when putting Cubes into HBase; I shared
some of them in the HBaseConf Asia 2018[1] this August. The typical
limitations include:

   - Rowkey is the primary index, no secondary index so far;

Filtering by row key’s prefix and suffix can get very different performance
result. So the user needs to do a good design about the row key; otherwise,
the query would be slow. This is difficult sometimes because the user might
not predict the filtering patterns ahead of cube design.

   - HBase is a key-value instead of a columnar storage

Kylin combines multiple measures (columns) into fewer column families for
smaller data size (row key size is remarkable). This causes HBase often
needing to read more data than requested.

   - HBase couldn't run on YARN

This makes the deployment and auto-scaling a little complicated, especially
in the cloud.

In one word, HBase is complicated to be Kylin’s storage. The maintenance,
debugging is also hard for normal developers. Now we’re planning to seek a
simple, light-weighted, read-only storage engine for Kylin. The new
solution should have the following characteristics:

   - Columnar layout with compression for efficient I/O;
   - Index by each column for quick filtering and seeking;
   - MapReduce / Spark API for parallel processing;
   - HDFS compliant for scalability and availability;
   - Mature, stable and extensible;

With the plugin architecture[2] introduced in Kylin 1.5, adding multiple
storages to Kylin is possible. Some companies like Kyligence Inc and
Meituan.com, have developed their customized storage engine for Kylin in
their product or platform. In their experience, columnar storage is a good
supplement for the HBase engine. Kaisen Kang from Meituan.com has shared
their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in
Beijing.

We plan to do a PoC with Apache Parquet + Apache Spark in the next phase.
Parquet is a standard columnar file format and has been widely supported by
many projects like Hive, Impala, Drill, etc. Parquet is adding the page
level column index to support fine-grained filtering.  Apache Spark can
provide the parallel computing over Parquet and can be deployed on
YARN/Mesos and Kubernetes. With this combination, the data persistence and
computation are separated, which makes the scaling in/out much easier than
before. Benefiting from Spark's flexibility, we can not only push down more
computation from Kylin to the Hadoop cluster. Except for Parquet, Apache
ORC is also a candidate.

Now I raise this discussion to get your ideas about Kylin’s next-generation
storage engine. If you have good ideas or any related data, welcome discuss 
in
the community.

Thank you!

[1] Apache Kylin on HBase

https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FShiShaoFeng1%2Fapache-kylin-on-hbase-extreme-olap-engine-for-big-datadata=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312sdata=TuIOe6FxdubqsoRVX8BQb%2FkvSFRrfI0ZvBRDB0euZWk%3Dreserved=0
[2] Apache Kylin Plugin Architecture

https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkylin.apache.org%2Fdevelopment%2Fplugin_arch.htmldata=02%7C01%7Cyangzhong%40ebay.com%7C71e694ab5386420bb32908d62509c003%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636737121143223312sdata=6WPLbX9Rat51rj3VCc1AuVDxTw5HO2ezPO0Cj8m231g%3Dreserved=0
[3] 基于Druid的Kylin存储引擎实践