Re: Re: Evaluate Kylin on Parquet

2019-01-01 Thread ShaoFeng Shi
a very weak filter capability for binary, and the
>> >> basic type can directly use spark's Vectorizedread to speed up data
>> >> reading
>> >> speed and calculation.
>> >>
>> >> 4. Use spark to match the parquet
>> >> The current spark has been adapted to the parquet. The sparked filter
>> of
>> >> the spark has been converted into a filter that can be used by the
>> >> parquet.
>> >> Here, you only need to upgrade the version of the parcel and modify it
>> to
>> >> provide the page index of the parquet.
>> >>
>> >> 5.index server
>> >> As described by JiaTao Tao, the index server is divided into file index
>> >> and
>> >> page index. The filtering of the dictionary is nothing but a file
>> index,
>> >> because we can insert an index server here.
>> >>
>> >> JiaTao Tao  于2018年12月19日周三 下午4:45写道:
>> >>
>> >> > Hi Gang
>> >> >
>> >> > In my opinion, segments/partition pruning is actually in the scope of
>> >> > "Index system", we can have an "Index system" in storage level
>> including
>> >> > File index(for segment/partition pruning), page index(for page
>> pruning)
>> >> > etc. We can put all these stuff in such a system and make the
>> >> separation of
>> >> > duties cleaner.
>> >> >
>> >> >
>> >> > Ma Gang  于2018年12月19日周三 上午6:31写道:
>> >> >
>> >> > > Awesome! Looking forward to the improvement. For dictionary, keep
>> the
>> >> > > dictionary in query engine, most time is not good since it brings
>> >> lots of
>> >> > > pressure to Kylin server, but sometimes it has benefit, for
>> example,
>> >> some
>> >> > > segments can be pruned very early when filter value is not in the
>> >> > > dictionary, and some queries can be answer directly using
>> dictionary
>> >> as
>> >> > > described in: https://issues.apache.org/jira/browse/KYLIN-3490
>> >> > >
>> >> > > At 2018-12-17 15:36:01, "ShaoFeng Shi" 
>> >> wrote:
>> >> > >
>> >> > > The dimension dictionary is a legacy design for HBase storage I
>> think;
>> >> > > because HBase has no data type, everything is a byte array, this
>> makes
>> >> > > Kylin has to encode STRING and other types with some encoding
>> method
>> >> like
>> >> > > the dictionary.
>> >> > >
>> >> > > Now with the storage like Parquet, it would decide how to encode
>> the
>> >> data
>> >> > > at the page or block level. Then we can drop the dictionary after
>> the
>> >> > cube
>> >> > > is built. This will release the memory pressure of Kylin query
>> nodes
>> >> and
>> >> > > also benefit the UHC case.
>> >> > >
>> >> > > Best regards,
>> >> > >
>> >> > > Shaofeng Shi 史少锋
>> >> > > Apache Kylin PMC
>> >> > > Work email: shaofeng@kyligence.io
>> >> > > Kyligence Inc: https://kyligence.io/
>> >> > >
>> >> > > Apache Kylin FAQ:
>> >> https://kylin.apache.org/docs/gettingstarted/faq.html
>> >> > > Join Kylin user mail group: user-subscr...@kylin.apache.org
>> >> > > Join Kylin dev mail group: dev-subscr...@kylin.apache.org
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > Chao Long  于2018年12月17日周一 下午1:23写道:
>> >> > >
>> >> > >>  In this PoC, we verified Kylin On Parquet is viable, but the
>> query
>> >> > >> performance still have room to improve. We can improve it from the
>> >> > >> following aspects:
>> >> > >>
>> >> > >>  1, Minimize result set serialization time
>> >> > >>  Since Kylin need Object[] data to process, we convert Dataset to
>> >> RDD,
>> >> > >> and then convert the "Row" type to Object[], so Spark need to
>> >> serialize
>> >> > >> Object[] before return it to driver. Those time need to be
>> avoided.
>> >> > >>
>> >> > >>  2, Query without dictionary
>> >> > >>  In this PoC, for less storage use, we keep dict encode value in
>> >> Parquet
>> >> > >> file for dict-encode dimensions, so Kylin must load dictionary to
>> >> > convert
>> >> > >> dict value for query. If we keep original value for dict-encode
>> >> > dimension,
>> >> > >> dictionary is unnecessary. And we don't hava to worry about the
>> >> storage
>> >> > >> use, because Parquet will encode it. We should remove dictionary
>> from
>> >> > query.
>> >> > >>
>> >> > >>  3, Remove query single-point issue
>> >> > >>  In this PoC, we use Spark to read and process Cube data, which is
>> >> > >> distributed, but kylin alse need to process result data the Spark
>> >> > returned
>> >> > >> in single jvm. We can try to make it distributed too.
>> >> > >>
>> >> > >>  4, Upgrade Parquet to 1.11 for page index
>> >> > >>  In this PoC, Parquet don't have page index, we get a poor filter
>> >> > >> performance. We need to upgrade Parquet to version 1.11 which has
>> >> page
>> >> > >> index to improve filter performance.
>> >> > >>
>> >> > >> --
>> >> > >> Best Regards,
>> >> > >> Chao Long
>> >> > >>
>> >> > >> -- 原始邮件 --
>> >> > >> *发件人:* "ShaoFeng Shi";
>> >> > >> *发送时间:* 2018年12月14日(星期五) 下午4:39
>> >> > >> *收件人:* "dev";"user";
>> >> > >> *主题:* Evaluate Kylin on Parquet
>> >> > >>
>> >> > >> Hello Kylin users,
>> >> > >>
>> >> > >> The first version of Kylin on Parquet [1] feature has been staged
>> in
>> >> > >> Kylin code repository for public review and evaluation. You can
>> check
>> >> > out
>> >> > >> the "kylin-on-parquet" branch [2] to read the code, and also can
>> >> make a
>> >> > >> binary build to run an example. When creating a cube, you can
>> select
>> >> > >> "Parquet" as the storage in the "Advanced setting" page. Both
>> >> MapReduce
>> >> > and
>> >> > >> Spark engines support this new storage. A tech blog is under
>> drafting
>> >> > for
>> >> > >> the design and implementation.
>> >> > >>
>> >> > >> Thanks so much to the engineers' hard work: Chao Long and Yichen
>> >> Zhou!
>> >> > >>
>> >> > >> This is not the final version; there is room to improve in many
>> >> aspects,
>> >> > >> parquet, spark, and Kylin. It can be used for PoC at this moment.
>> >> Your
>> >> > >> comments are welcomed. Let's improve it together.
>> >> > >>
>> >> > >> [1] https://issues.apache.org/jira/browse/KYLIN-3621
>> >> > >> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
>> >> > >>
>> >> > >> Best regards,
>> >> > >>
>> >> > >> Shaofeng Shi 史少锋
>> >> > >> Apache Kylin PMC
>> >> > >> Work email: shaofeng@kyligence.io
>> >> > >> Kyligence Inc: https://kyligence.io/
>> >> > >>
>> >> > >> Apache Kylin FAQ:
>> >> https://kylin.apache.org/docs/gettingstarted/faq.html
>> >> > >> Join Kylin user mail group: user-subscr...@kylin.apache.org
>> >> > >> Join Kylin dev mail group: dev-subscr...@kylin.apache.org
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >
>> >> > >
>> >> > >
>> >> >
>> >> >
>> >> > --
>> >> >
>> >> >
>> >> > Regards!
>> >> >
>> >> > Aron Tao
>> >> >
>> >>
>> >
>> >
>>
>


Re: Re: Evaluate Kylin on Parquet

2018-12-31 Thread Li Yang
 and make the
> >> separation of
> >> > duties cleaner.
> >> >
> >> >
> >> > Ma Gang  于2018年12月19日周三 上午6:31写道:
> >> >
> >> > > Awesome! Looking forward to the improvement. For dictionary, keep
> the
> >> > > dictionary in query engine, most time is not good since it brings
> >> lots of
> >> > > pressure to Kylin server, but sometimes it has benefit, for example,
> >> some
> >> > > segments can be pruned very early when filter value is not in the
> >> > > dictionary, and some queries can be answer directly using dictionary
> >> as
> >> > > described in: https://issues.apache.org/jira/browse/KYLIN-3490
> >> > >
> >> > > At 2018-12-17 15:36:01, "ShaoFeng Shi" 
> >> wrote:
> >> > >
> >> > > The dimension dictionary is a legacy design for HBase storage I
> think;
> >> > > because HBase has no data type, everything is a byte array, this
> makes
> >> > > Kylin has to encode STRING and other types with some encoding method
> >> like
> >> > > the dictionary.
> >> > >
> >> > > Now with the storage like Parquet, it would decide how to encode the
> >> data
> >> > > at the page or block level. Then we can drop the dictionary after
> the
> >> > cube
> >> > > is built. This will release the memory pressure of Kylin query nodes
> >> and
> >> > > also benefit the UHC case.
> >> > >
> >> > > Best regards,
> >> > >
> >> > > Shaofeng Shi 史少锋
> >> > > Apache Kylin PMC
> >> > > Work email: shaofeng@kyligence.io
> >> > > Kyligence Inc: https://kyligence.io/
> >> > >
> >> > > Apache Kylin FAQ:
> >> https://kylin.apache.org/docs/gettingstarted/faq.html
> >> > > Join Kylin user mail group: user-subscr...@kylin.apache.org
> >> > > Join Kylin dev mail group: dev-subscr...@kylin.apache.org
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > Chao Long  于2018年12月17日周一 下午1:23写道:
> >> > >
> >> > >>  In this PoC, we verified Kylin On Parquet is viable, but the query
> >> > >> performance still have room to improve. We can improve it from the
> >> > >> following aspects:
> >> > >>
> >> > >>  1, Minimize result set serialization time
> >> > >>  Since Kylin need Object[] data to process, we convert Dataset to
> >> RDD,
> >> > >> and then convert the "Row" type to Object[], so Spark need to
> >> serialize
> >> > >> Object[] before return it to driver. Those time need to be avoided.
> >> > >>
> >> > >>  2, Query without dictionary
> >> > >>  In this PoC, for less storage use, we keep dict encode value in
> >> Parquet
> >> > >> file for dict-encode dimensions, so Kylin must load dictionary to
> >> > convert
> >> > >> dict value for query. If we keep original value for dict-encode
> >> > dimension,
> >> > >> dictionary is unnecessary. And we don't hava to worry about the
> >> storage
> >> > >> use, because Parquet will encode it. We should remove dictionary
> from
> >> > query.
> >> > >>
> >> > >>  3, Remove query single-point issue
> >> > >>  In this PoC, we use Spark to read and process Cube data, which is
> >> > >> distributed, but kylin alse need to process result data the Spark
> >> > returned
> >> > >> in single jvm. We can try to make it distributed too.
> >> > >>
> >> > >>  4, Upgrade Parquet to 1.11 for page index
> >> > >>  In this PoC, Parquet don't have page index, we get a poor filter
> >> > >> performance. We need to upgrade Parquet to version 1.11 which has
> >> page
> >> > >> index to improve filter performance.
> >> > >>
> >> > >> --
> >> > >> Best Regards,
> >> > >> Chao Long
> >> > >>
> >> > >> -- 原始邮件 --
> >> > >> *发件人:* "ShaoFeng Shi";
> >> > >> *发送时间:* 2018年12月14日(星期五) 下午4:39
> >> > >> *收件人:* "dev";"user";
> >> > >> *主题:* Evaluate Kylin on Parquet
> >> > >>
> >> > >> Hello Kylin users,
> >> > >>
> >> > >> The first version of Kylin on Parquet [1] feature has been staged
> in
> >> > >> Kylin code repository for public review and evaluation. You can
> check
> >> > out
> >> > >> the "kylin-on-parquet" branch [2] to read the code, and also can
> >> make a
> >> > >> binary build to run an example. When creating a cube, you can
> select
> >> > >> "Parquet" as the storage in the "Advanced setting" page. Both
> >> MapReduce
> >> > and
> >> > >> Spark engines support this new storage. A tech blog is under
> drafting
> >> > for
> >> > >> the design and implementation.
> >> > >>
> >> > >> Thanks so much to the engineers' hard work: Chao Long and Yichen
> >> Zhou!
> >> > >>
> >> > >> This is not the final version; there is room to improve in many
> >> aspects,
> >> > >> parquet, spark, and Kylin. It can be used for PoC at this moment.
> >> Your
> >> > >> comments are welcomed. Let's improve it together.
> >> > >>
> >> > >> [1] https://issues.apache.org/jira/browse/KYLIN-3621
> >> > >> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
> >> > >>
> >> > >> Best regards,
> >> > >>
> >> > >> Shaofeng Shi 史少锋
> >> > >> Apache Kylin PMC
> >> > >> Work email: shaofeng@kyligence.io
> >> > >> Kyligence Inc: https://kyligence.io/
> >> > >>
> >> > >> Apache Kylin FAQ:
> >> https://kylin.apache.org/docs/gettingstarted/faq.html
> >> > >> Join Kylin user mail group: user-subscr...@kylin.apache.org
> >> > >> Join Kylin dev mail group: dev-subscr...@kylin.apache.org
> >> > >>
> >> > >>
> >> > >>
> >> > >
> >> > >
> >> > >
> >> >
> >> >
> >> > --
> >> >
> >> >
> >> > Regards!
> >> >
> >> > Aron Tao
> >> >
> >>
> >
> >
>


Re: Re: Evaluate Kylin on Parquet

2018-12-29 Thread ShaoFeng Shi
 cube
>> > > is built. This will release the memory pressure of Kylin query nodes
>> and
>> > > also benefit the UHC case.
>> > >
>> > > Best regards,
>> > >
>> > > Shaofeng Shi 史少锋
>> > > Apache Kylin PMC
>> > > Work email: shaofeng@kyligence.io
>> > > Kyligence Inc: https://kyligence.io/
>> > >
>> > > Apache Kylin FAQ:
>> https://kylin.apache.org/docs/gettingstarted/faq.html
>> > > Join Kylin user mail group: user-subscr...@kylin.apache.org
>> > > Join Kylin dev mail group: dev-subscr...@kylin.apache.org
>> > >
>> > >
>> > >
>> > >
>> > > Chao Long  于2018年12月17日周一 下午1:23写道:
>> > >
>> > >>  In this PoC, we verified Kylin On Parquet is viable, but the query
>> > >> performance still have room to improve. We can improve it from the
>> > >> following aspects:
>> > >>
>> > >>  1, Minimize result set serialization time
>> > >>  Since Kylin need Object[] data to process, we convert Dataset to
>> RDD,
>> > >> and then convert the "Row" type to Object[], so Spark need to
>> serialize
>> > >> Object[] before return it to driver. Those time need to be avoided.
>> > >>
>> > >>  2, Query without dictionary
>> > >>  In this PoC, for less storage use, we keep dict encode value in
>> Parquet
>> > >> file for dict-encode dimensions, so Kylin must load dictionary to
>> > convert
>> > >> dict value for query. If we keep original value for dict-encode
>> > dimension,
>> > >> dictionary is unnecessary. And we don't hava to worry about the
>> storage
>> > >> use, because Parquet will encode it. We should remove dictionary from
>> > query.
>> > >>
>> > >>  3, Remove query single-point issue
>> > >>  In this PoC, we use Spark to read and process Cube data, which is
>> > >> distributed, but kylin alse need to process result data the Spark
>> > returned
>> > >> in single jvm. We can try to make it distributed too.
>> > >>
>> > >>  4, Upgrade Parquet to 1.11 for page index
>> > >>  In this PoC, Parquet don't have page index, we get a poor filter
>> > >> performance. We need to upgrade Parquet to version 1.11 which has
>> page
>> > >> index to improve filter performance.
>> > >>
>> > >> --
>> > >> Best Regards,
>> > >> Chao Long
>> > >>
>> > >> -- 原始邮件 --
>> > >> *发件人:* "ShaoFeng Shi";
>> > >> *发送时间:* 2018年12月14日(星期五) 下午4:39
>> > >> *收件人:* "dev";"user";
>> > >> *主题:* Evaluate Kylin on Parquet
>> > >>
>> > >> Hello Kylin users,
>> > >>
>> > >> The first version of Kylin on Parquet [1] feature has been staged in
>> > >> Kylin code repository for public review and evaluation. You can check
>> > out
>> > >> the "kylin-on-parquet" branch [2] to read the code, and also can
>> make a
>> > >> binary build to run an example. When creating a cube, you can select
>> > >> "Parquet" as the storage in the "Advanced setting" page. Both
>> MapReduce
>> > and
>> > >> Spark engines support this new storage. A tech blog is under drafting
>> > for
>> > >> the design and implementation.
>> > >>
>> > >> Thanks so much to the engineers' hard work: Chao Long and Yichen
>> Zhou!
>> > >>
>> > >> This is not the final version; there is room to improve in many
>> aspects,
>> > >> parquet, spark, and Kylin. It can be used for PoC at this moment.
>> Your
>> > >> comments are welcomed. Let's improve it together.
>> > >>
>> > >> [1] https://issues.apache.org/jira/browse/KYLIN-3621
>> > >> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
>> > >>
>> > >> Best regards,
>> > >>
>> > >> Shaofeng Shi 史少锋
>> > >> Apache Kylin PMC
>> > >> Work email: shaofeng@kyligence.io
>> > >> Kyligence Inc: https://kyligence.io/
>> > >>
>> > >> Apache Kylin FAQ:
>> https://kylin.apache.org/docs/gettingstarted/faq.html
>> > >> Join Kylin user mail group: user-subscr...@kylin.apache.org
>> > >> Join Kylin dev mail group: dev-subscr...@kylin.apache.org
>> > >>
>> > >>
>> > >>
>> > >
>> > >
>> > >
>> >
>> >
>> > --
>> >
>> >
>> > Regards!
>> >
>> > Aron Tao
>> >
>>
>
>


Re: Re: Evaluate Kylin on Parquet

2018-12-19 Thread 许益铭
ngine, most time is not good since it brings lots of
> > pressure to Kylin server, but sometimes it has benefit, for example, some
> > segments can be pruned very early when filter value is not in the
> > dictionary, and some queries can be answer directly using dictionary as
> > described in: https://issues.apache.org/jira/browse/KYLIN-3490
> >
> > At 2018-12-17 15:36:01, "ShaoFeng Shi"  wrote:
> >
> > The dimension dictionary is a legacy design for HBase storage I think;
> > because HBase has no data type, everything is a byte array, this makes
> > Kylin has to encode STRING and other types with some encoding method like
> > the dictionary.
> >
> > Now with the storage like Parquet, it would decide how to encode the data
> > at the page or block level. Then we can drop the dictionary after the
> cube
> > is built. This will release the memory pressure of Kylin query nodes and
> > also benefit the UHC case.
> >
> > Best regards,
> >
> > Shaofeng Shi 史少锋
> > Apache Kylin PMC
> > Work email: shaofeng@kyligence.io
> > Kyligence Inc: https://kyligence.io/
> >
> > Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> > Join Kylin user mail group: user-subscr...@kylin.apache.org
> > Join Kylin dev mail group: dev-subscr...@kylin.apache.org
> >
> >
> >
> >
> > Chao Long  于2018年12月17日周一 下午1:23写道:
> >
> >>  In this PoC, we verified Kylin On Parquet is viable, but the query
> >> performance still have room to improve. We can improve it from the
> >> following aspects:
> >>
> >>  1, Minimize result set serialization time
> >>  Since Kylin need Object[] data to process, we convert Dataset to RDD,
> >> and then convert the "Row" type to Object[], so Spark need to serialize
> >> Object[] before return it to driver. Those time need to be avoided.
> >>
> >>  2, Query without dictionary
> >>  In this PoC, for less storage use, we keep dict encode value in Parquet
> >> file for dict-encode dimensions, so Kylin must load dictionary to
> convert
> >> dict value for query. If we keep original value for dict-encode
> dimension,
> >> dictionary is unnecessary. And we don't hava to worry about the storage
> >> use, because Parquet will encode it. We should remove dictionary from
> query.
> >>
> >>  3, Remove query single-point issue
> >>  In this PoC, we use Spark to read and process Cube data, which is
> >> distributed, but kylin alse need to process result data the Spark
> returned
> >> in single jvm. We can try to make it distributed too.
> >>
> >>  4, Upgrade Parquet to 1.11 for page index
> >>  In this PoC, Parquet don't have page index, we get a poor filter
> >> performance. We need to upgrade Parquet to version 1.11 which has page
> >> index to improve filter performance.
> >>
> >> --
> >> Best Regards,
> >> Chao Long
> >>
> >> -- 原始邮件 --
> >> *发件人:* "ShaoFeng Shi";
> >> *发送时间:* 2018年12月14日(星期五) 下午4:39
> >> *收件人:* "dev";"user";
> >> *主题:* Evaluate Kylin on Parquet
> >>
> >> Hello Kylin users,
> >>
> >> The first version of Kylin on Parquet [1] feature has been staged in
> >> Kylin code repository for public review and evaluation. You can check
> out
> >> the "kylin-on-parquet" branch [2] to read the code, and also can make a
> >> binary build to run an example. When creating a cube, you can select
> >> "Parquet" as the storage in the "Advanced setting" page. Both MapReduce
> and
> >> Spark engines support this new storage. A tech blog is under drafting
> for
> >> the design and implementation.
> >>
> >> Thanks so much to the engineers' hard work: Chao Long and Yichen Zhou!
> >>
> >> This is not the final version; there is room to improve in many aspects,
> >> parquet, spark, and Kylin. It can be used for PoC at this moment. Your
> >> comments are welcomed. Let's improve it together.
> >>
> >> [1] https://issues.apache.org/jira/browse/KYLIN-3621
> >> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
> >>
> >> Best regards,
> >>
> >> Shaofeng Shi 史少锋
> >> Apache Kylin PMC
> >> Work email: shaofeng@kyligence.io
> >> Kyligence Inc: https://kyligence.io/
> >>
> >> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> >> Join Kylin user mail group: user-subscr...@kylin.apache.org
> >> Join Kylin dev mail group: dev-subscr...@kylin.apache.org
> >>
> >>
> >>
> >
> >
> >
>
>
> --
>
>
> Regards!
>
> Aron Tao
>


Re: Re: Evaluate Kylin on Parquet

2018-12-19 Thread JiaTao Tao
Hi Gang

In my opinion, segments/partition pruning is actually in the scope of
"Index system", we can have an "Index system" in storage level including
File index(for segment/partition pruning), page index(for page pruning)
etc. We can put all these stuff in such a system and make the separation of
duties cleaner.


Ma Gang  于2018年12月19日周三 上午6:31写道:

> Awesome! Looking forward to the improvement. For dictionary, keep the
> dictionary in query engine, most time is not good since it brings lots of
> pressure to Kylin server, but sometimes it has benefit, for example, some
> segments can be pruned very early when filter value is not in the
> dictionary, and some queries can be answer directly using dictionary as
> described in: https://issues.apache.org/jira/browse/KYLIN-3490
>
> At 2018-12-17 15:36:01, "ShaoFeng Shi"  wrote:
>
> The dimension dictionary is a legacy design for HBase storage I think;
> because HBase has no data type, everything is a byte array, this makes
> Kylin has to encode STRING and other types with some encoding method like
> the dictionary.
>
> Now with the storage like Parquet, it would decide how to encode the data
> at the page or block level. Then we can drop the dictionary after the cube
> is built. This will release the memory pressure of Kylin query nodes and
> also benefit the UHC case.
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Work email: shaofeng@kyligence.io
> Kyligence Inc: https://kyligence.io/
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscr...@kylin.apache.org
> Join Kylin dev mail group: dev-subscr...@kylin.apache.org
>
>
>
>
> Chao Long  于2018年12月17日周一 下午1:23写道:
>
>>  In this PoC, we verified Kylin On Parquet is viable, but the query
>> performance still have room to improve. We can improve it from the
>> following aspects:
>>
>>  1, Minimize result set serialization time
>>  Since Kylin need Object[] data to process, we convert Dataset to RDD,
>> and then convert the "Row" type to Object[], so Spark need to serialize
>> Object[] before return it to driver. Those time need to be avoided.
>>
>>  2, Query without dictionary
>>  In this PoC, for less storage use, we keep dict encode value in Parquet
>> file for dict-encode dimensions, so Kylin must load dictionary to convert
>> dict value for query. If we keep original value for dict-encode dimension,
>> dictionary is unnecessary. And we don't hava to worry about the storage
>> use, because Parquet will encode it. We should remove dictionary from query.
>>
>>  3, Remove query single-point issue
>>  In this PoC, we use Spark to read and process Cube data, which is
>> distributed, but kylin alse need to process result data the Spark returned
>> in single jvm. We can try to make it distributed too.
>>
>>  4, Upgrade Parquet to 1.11 for page index
>>  In this PoC, Parquet don't have page index, we get a poor filter
>> performance. We need to upgrade Parquet to version 1.11 which has page
>> index to improve filter performance.
>>
>> --
>> Best Regards,
>> Chao Long
>>
>> -- 原始邮件 --
>> *发件人:* "ShaoFeng Shi";
>> *发送时间:* 2018年12月14日(星期五) 下午4:39
>> *收件人:* "dev";"user";
>> *主题:* Evaluate Kylin on Parquet
>>
>> Hello Kylin users,
>>
>> The first version of Kylin on Parquet [1] feature has been staged in
>> Kylin code repository for public review and evaluation. You can check out
>> the "kylin-on-parquet" branch [2] to read the code, and also can make a
>> binary build to run an example. When creating a cube, you can select
>> "Parquet" as the storage in the "Advanced setting" page. Both MapReduce and
>> Spark engines support this new storage. A tech blog is under drafting for
>> the design and implementation.
>>
>> Thanks so much to the engineers' hard work: Chao Long and Yichen Zhou!
>>
>> This is not the final version; there is room to improve in many aspects,
>> parquet, spark, and Kylin. It can be used for PoC at this moment. Your
>> comments are welcomed. Let's improve it together.
>>
>> [1] https://issues.apache.org/jira/browse/KYLIN-3621
>> [2] https://github.com/apache/kylin/tree/kylin-on-parquet
>>
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>> Apache Kylin PMC
>> Work email: shaofeng@kyligence.io
>> Kyligence Inc: https://kyligence.io/
>>
>> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
>> Join Kylin user mail group: user-subscr...@kylin.apache.org
>> Join Kylin dev mail group: dev-subscr...@kylin.apache.org
>>
>>
>>
>
>
>


-- 


Regards!

Aron Tao


??????Evaluate Kylin on Parquet

2018-12-16 Thread Chao Long
In this PoC, we verified Kylin On Parquet is viable, but the query performance 
still have room to improve. We can improve it from the following aspects:


 1, Minimize result set serialization time
 Since Kylin need Object[] data to process, we convert Dataset to RDD, and then 
convert the "Row" type to Object[], so Spark need to serialize Object[] before 
return it to driver. Those time need to be avoided.


 2, Query without dictionary
 In this PoC, for less storage use, we keep dict encode value in Parquet file 
for dict-encode dimensions, so Kylin must load dictionary to convert dict value 
for query. If we keep original value for dict-encode dimension, dictionary is 
unnecessary. And we don't hava to worry about the storage use, because Parquet 
will encode it. We should remove dictionary from query.


 3, Remove query single-point issue
 In this PoC, we use Spark to read and process Cube data, which is distributed, 
but kylin alse need to process result data the Spark returned in single jvm. We 
can try to make it distributed too.


 4, Upgrade Parquet to 1.11 for page index
 In this PoC, Parquet don't have page index, we get a poor filter performance. 
We need to upgrade Parquet to version 1.11 which has page index to improve 
filter performance.



--
Best Regards,
Chao Long


 
--  --
??: "ShaoFeng Shi";
: 2018??12??14??(??) 4:39
??: "dev";"user";

: Evaluate Kylin on Parquet



Hello Kylin users,

The first version of Kylin on Parquet [1] feature has been staged in Kylin
code repository for public review and evaluation. You can check out the
"kylin-on-parquet" branch [2] to read the code, and also can make a binary
build to run an example. When creating a cube, you can select "Parquet" as
the storage in the "Advanced setting" page. Both MapReduce and Spark
engines support this new storage. A tech blog is under drafting for the
design and implementation.

Thanks so much to the engineers' hard work: Chao Long and Yichen Zhou!

This is not the final version; there is room to improve in many aspects,
parquet, spark, and Kylin. It can be used for PoC at this moment. Your
comments are welcomed. Let's improve it together.

[1] https://issues.apache.org/jira/browse/KYLIN-3621
[2] https://github.com/apache/kylin/tree/kylin-on-parquet

Best regards,

Shaofeng Shi ??
Apache Kylin PMC
Work email: shaofeng@kyligence.io
Kyligence Inc: https://kyligence.io/

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscr...@kylin.apache.org
Join Kylin dev mail group: dev-subscr...@kylin.apache.org

Evaluate Kylin on Parquet

2018-12-14 Thread ShaoFeng Shi
Hello Kylin users,

The first version of Kylin on Parquet [1] feature has been staged in Kylin
code repository for public review and evaluation. You can check out the
"kylin-on-parquet" branch [2] to read the code, and also can make a binary
build to run an example. When creating a cube, you can select "Parquet" as
the storage in the "Advanced setting" page. Both MapReduce and Spark
engines support this new storage. A tech blog is under drafting for the
design and implementation.

Thanks so much to the engineers' hard work: Chao Long and Yichen Zhou!

This is not the final version; there is room to improve in many aspects,
parquet, spark, and Kylin. It can be used for PoC at this moment. Your
comments are welcomed. Let's improve it together.

[1] https://issues.apache.org/jira/browse/KYLIN-3621
[2] https://github.com/apache/kylin/tree/kylin-on-parquet

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Work email: shaofeng@kyligence.io
Kyligence Inc: https://kyligence.io/

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscr...@kylin.apache.org
Join Kylin dev mail group: dev-subscr...@kylin.apache.org