Awesome! Looking forward to the improvement. For dictionary, keep the 
dictionary in query engine, most time is not good since it brings lots of 
pressure to Kylin server, but sometimes it has benefit, for example, some 
segments can be pruned very early when filter value is not in the dictionary, 
and some queries can be answer directly using dictionary as described in: 
https://issues.apache.org/jira/browse/KYLIN-3490

At 2018-12-17 15:36:01, "ShaoFeng Shi" <[email protected]> wrote:

The dimension dictionary is a legacy design for HBase storage I think; because 
HBase has no data type, everything is a byte array, this makes Kylin has to 
encode STRING and other types with some encoding method like the dictionary. 


Now with the storage like Parquet, it would decide how to encode the data at 
the page or block level. Then we can drop the dictionary after the cube is 
built. This will release the memory pressure of Kylin query nodes and also 
benefit the UHC case.


Best regards,


Shaofeng Shi 史少锋
Apache Kylin PMC
Work email: [email protected]

Kyligence Inc: https://kyligence.io/


Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: [email protected]
Join Kylin dev mail group: [email protected]









Chao Long <[email protected]> 于2018年12月17日周一 下午1:23写道:

 In this PoC, we verified Kylin On Parquet is viable, but the query performance 
still have room to improve. We can improve it from the following aspects:


 1, Minimize result set serialization time
 Since Kylin need Object[] data to process, we convert Dataset to RDD, and then 
convert the "Row" type to Object[], so Spark need to serialize Object[] before 
return it to driver. Those time need to be avoided.


 2, Query without dictionary
 In this PoC, for less storage use, we keep dict encode value in Parquet file 
for dict-encode dimensions, so Kylin must load dictionary to convert dict value 
for query. If we keep original value for dict-encode dimension, dictionary is 
unnecessary. And we don't hava to worry about the storage use, because Parquet 
will encode it. We should remove dictionary from query.


 3, Remove query single-point issue
 In this PoC, we use Spark to read and process Cube data, which is distributed, 
but kylin alse need to process result data the Spark returned in single jvm. We 
can try to make it distributed too.


 4, Upgrade Parquet to 1.11 for page index
 In this PoC, Parquet don't have page index, we get a poor filter performance. 
We need to upgrade Parquet to version 1.11 which has page index to improve 
filter performance.


------------------
Best Regards,
Chao Long
 
------------------ 原始邮件 ------------------
发件人: "ShaoFeng Shi"<[email protected]>;
发送时间: 2018年12月14日(星期五) 下午4:39
收件人: "dev"<[email protected]>;"user"<[email protected]>;
主题: Evaluate Kylin on Parquet


Hello Kylin users,


The first version of Kylin on Parquet [1] feature has been staged in Kylin code 
repository for public review and evaluation. You can check out the 
"kylin-on-parquet" branch [2] to read the code, and also can make a binary 
build to run an example. When creating a cube, you can select "Parquet" as the 
storage in the "Advanced setting" page. Both MapReduce and Spark engines 
support this new storage. A tech blog is under drafting for the design and 
implementation.



Thanks so much to the engineers' hard work: Chao Long and Yichen Zhou!


This is not the final version; there is room to improve in many aspects, 
parquet, spark, and Kylin. It can be used for PoC at this moment. Your comments 
are welcomed. Let's improve it together.


[1] https://issues.apache.org/jira/browse/KYLIN-3621
[2] https://github.com/apache/kylin/tree/kylin-on-parquet


Best regards,


Shaofeng Shi 史少锋
Apache Kylin PMC
Work email: [email protected]

Kyligence Inc: https://kyligence.io/


Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: [email protected]
Join Kylin dev mail group: [email protected]




Reply via email to