Re: Pinot/Kylin/Druid quick comparision

2023-12-04 Thread Xiaoxiang Yu
A JIRA ticket has been opened, waiting for INFRA :
https://issues.apache.org/jira/browse/INFRA-25238 .

With warm regard
Xiaoxiang Yu



On Tue, Dec 5, 2023 at 10:30 AM Nam Đỗ Duy  wrote:

> Thank you Xiaoxiang, please update me when you have changed your default
> branch. In case people are impressed by the numbers then I hope to turn
> this situation to reverse direction.
>
> On Tue, Dec 5, 2023 at 9:02 AM Xiaoxiang Yu  wrote:
>
>> The default branch is for 4.X which is a maintained branch, the active
>> branch is kylin5.
>> I will change the default branch to kylin5 later.
>>
>> 
>> With warm regard
>> Xiaoxiang Yu
>>
>>
>>
>> On Tue, Dec 5, 2023 at 9:12 AM Nam Đỗ Duy  wrote:
>>
>>> Hi Xiaoxiang, Sirs / Madams
>>>
>>> Can you see the atttached photo
>>>
>>> My boss asked that why druid commit code regularly but kylin had not
>>> been committed since July
>>>
>>>
>>> On Mon, 4 Dec 2023 at 15:33 Xiaoxiang Yu  wrote:
>>>
 I think so.

 Response time is not the only factor to make a decision. Kylin could be
 cheaper
 when the query pattern is suitable for the Kylin model, and Kylin can
 guarantee
 reasonable query latency. Clickhouse will be quicker in an ad hoc query
 scenario.

 By the way, Youzan and Kyligence combine them together to provide
 unified data analytics services for their customers.

 
 With warm regard
 Xiaoxiang Yu



 On Mon, Dec 4, 2023 at 4:01 PM Nam Đỗ Duy 
 wrote:

> Hi Xiaoxiang, thank you
>
> In case my client uses cloud computing service like gcp or aws, which
> will cost more: precalculation feature of kylin or clickhouse (incase
> of
> kylin, I have a thought that the query execution has been done once and
> stored in cube to be used many times so kylin uses less cloud
> computation,
> is that true)?
>
> On Mon, Dec 4, 2023 at 2:46 PM Xiaoxiang Yu  wrote:
>
> > Following text is part of an article(
> > https://zhuanlan.zhihu.com/p/343394287) .
> >
> >
> >
> ===
> >
> > Kylin is suitable for aggregation queries with fixed modes because
> of its
> > pre-calculated technology, for example, join, group by, and where
> condition
> > modes in SQL are relatively fixed, etc. The larger the data volume
> is, the
> > more obvious the advantages of using Kylin are; in particular, Kylin
> is
> > particularly advantageous in the scenarios of de-emphasis (count
> distinct),
> > Top N, and Percentile. In particular, Kylin's advantages in
> de-weighting
> > (count distinct), Top N, Percentile and other scenarios are
> especially
> > huge, and it is used in a large number of scenarios, such as
> Dashboard, all
> > kinds of reports, large-screen display, traffic statistics, and user
> > behavior analysis. Meituan, Aurora, Shell Housing, etc. use Kylin to
> build
> > their data service platforms, providing millions to tens of millions
> of
> > queries per day, and most of the queries can be completed within 2 -
> 3
> > seconds. There is no better alternative for such a high concurrency
> > scenario.
> >
> > ClickHouse, because of its MPP architecture, has high computing
> power and
> > is more suitable when the query request is more flexible, or when
> there is
> > a need for detailed queries with low concurrency. Scenarios include:
> very
> > many columns and where conditions are arbitrarily combined with the
> user
> > label filtering, not a large amount of concurrency of complex
> on-the-spot
> > query and so on. If the amount of data and access is large, you need
> to
> > deploy a distributed ClickHouse cluster, which is a higher challenge
> for
> > operation and maintenance.
> >
> > If some queries are very flexible but infrequent, it is more
> > resource-efficient to use now-computing. Since the number of queries
> is
> > small, even if each query consumes a lot of computational resources,
> it is
> > still cost-effective overall. If some queries have a fixed pattern
> and the
> > query volume is large, it is more suitable for Kylin, because the
> query
> > volume is large, and by using large computational resources to save
> the
> > results, the upfront computational cost can be amortized over each
> query,
> > so it is the most economical.
> >
> > --- Translated with DeepL.com (free version)
> >
> >
> > 
> > With warm regard
> > Xiaoxiang Yu
> >
> >
> >
> > On Mon, Dec 4, 2023 at 3:16 PM Nam Đỗ Duy 
> wrote:
> >
> >> Thank you Xiaoxiang for the near real time streaming feature. That's

Re: Pinot/Kylin/Druid quick comparision

2023-12-04 Thread Xiaoxiang Yu
I think so.

Response time is not the only factor to make a decision. Kylin could be
cheaper
when the query pattern is suitable for the Kylin model, and Kylin can
guarantee
reasonable query latency. Clickhouse will be quicker in an ad hoc query
scenario.

By the way, Youzan and Kyligence combine them together to provide
unified data analytics services for their customers.


With warm regard
Xiaoxiang Yu



On Mon, Dec 4, 2023 at 4:01 PM Nam Đỗ Duy  wrote:

> Hi Xiaoxiang, thank you
>
> In case my client uses cloud computing service like gcp or aws, which
> will cost more: precalculation feature of kylin or clickhouse (incase of
> kylin, I have a thought that the query execution has been done once and
> stored in cube to be used many times so kylin uses less cloud computation,
> is that true)?
>
> On Mon, Dec 4, 2023 at 2:46 PM Xiaoxiang Yu  wrote:
>
> > Following text is part of an article(
> > https://zhuanlan.zhihu.com/p/343394287) .
> >
> >
> >
> ===
> >
> > Kylin is suitable for aggregation queries with fixed modes because of its
> > pre-calculated technology, for example, join, group by, and where
> condition
> > modes in SQL are relatively fixed, etc. The larger the data volume is,
> the
> > more obvious the advantages of using Kylin are; in particular, Kylin is
> > particularly advantageous in the scenarios of de-emphasis (count
> distinct),
> > Top N, and Percentile. In particular, Kylin's advantages in de-weighting
> > (count distinct), Top N, Percentile and other scenarios are especially
> > huge, and it is used in a large number of scenarios, such as Dashboard,
> all
> > kinds of reports, large-screen display, traffic statistics, and user
> > behavior analysis. Meituan, Aurora, Shell Housing, etc. use Kylin to
> build
> > their data service platforms, providing millions to tens of millions of
> > queries per day, and most of the queries can be completed within 2 - 3
> > seconds. There is no better alternative for such a high concurrency
> > scenario.
> >
> > ClickHouse, because of its MPP architecture, has high computing power and
> > is more suitable when the query request is more flexible, or when there
> is
> > a need for detailed queries with low concurrency. Scenarios include: very
> > many columns and where conditions are arbitrarily combined with the user
> > label filtering, not a large amount of concurrency of complex on-the-spot
> > query and so on. If the amount of data and access is large, you need to
> > deploy a distributed ClickHouse cluster, which is a higher challenge for
> > operation and maintenance.
> >
> > If some queries are very flexible but infrequent, it is more
> > resource-efficient to use now-computing. Since the number of queries is
> > small, even if each query consumes a lot of computational resources, it
> is
> > still cost-effective overall. If some queries have a fixed pattern and
> the
> > query volume is large, it is more suitable for Kylin, because the query
> > volume is large, and by using large computational resources to save the
> > results, the upfront computational cost can be amortized over each query,
> > so it is the most economical.
> >
> > --- Translated with DeepL.com (free version)
> >
> >
> > 
> > With warm regard
> > Xiaoxiang Yu
> >
> >
> >
> > On Mon, Dec 4, 2023 at 3:16 PM Nam Đỗ Duy 
> wrote:
> >
> >> Thank you Xiaoxiang for the near real time streaming feature. That's
> >> great.
> >>
> >> This morning there has been a new challenge to my team: clickhouse
> offered
> >> us the speed of calculating 8 billion rows in millisecond which is
> faster
> >> than my demonstration (I used Kylin to do calculating 1 billion rows in
> >> 2.9
> >> seconds)
> >>
> >> Can you briefly suggest the advantages of kylin over clickhouse so that
> I
> >> can defend my demonstration.
> >>
> >> On Mon, Dec 4, 2023 at 1:55 PM Xiaoxiang Yu  wrote:
> >>
> >> > 1. "In this important scenario of realtime analytics, the reason here
> is
> >> > that
> >> > kylin has lag time due to model update of new segment build, is that
> >> > correct?"
> >> >
> >> > You are correct.
> >> >
> >> > 2. "If that is true, then can you suggest a work-around of combination
> >> of
> >> > ... "
> >> >
> >> > Kylin is planning to introduce NRT streaming(coding is completed but
> not
> >> > released),
> >> > which can make the time-lag to about 3 minutes(that is my estimation
> >> but I
> >> > am
> >> > quite certain about it).
> >> > NRT stands for 'near real-time', it will run a job and do micro-batch
> >> > aggregation and persistence periodically. The price is that you need
> to
> >> run
> >> > and monitor a long-running
> >> >  job. This feature is based on Spark Streaming, so you need knowledge
> of
> >> > it.
> >> >
> >> > I am curious about what is the maximum time-lag your customers
> >> > can tolerate?
> >> > Personally, I guess minute level time-lag is ok for 

Re: Pinot/Kylin/Druid quick comparision

2023-12-04 Thread Nam Đỗ Duy via user
Hi Xiaoxiang, thank you

In case my client uses cloud computing service like gcp or aws, which
will cost more: precalculation feature of kylin or clickhouse (incase of
kylin, I have a thought that the query execution has been done once and
stored in cube to be used many times so kylin uses less cloud computation,
is that true)?

On Mon, Dec 4, 2023 at 2:46 PM Xiaoxiang Yu  wrote:

> Following text is part of an article(
> https://zhuanlan.zhihu.com/p/343394287) .
>
>
> ===
>
> Kylin is suitable for aggregation queries with fixed modes because of its
> pre-calculated technology, for example, join, group by, and where condition
> modes in SQL are relatively fixed, etc. The larger the data volume is, the
> more obvious the advantages of using Kylin are; in particular, Kylin is
> particularly advantageous in the scenarios of de-emphasis (count distinct),
> Top N, and Percentile. In particular, Kylin's advantages in de-weighting
> (count distinct), Top N, Percentile and other scenarios are especially
> huge, and it is used in a large number of scenarios, such as Dashboard, all
> kinds of reports, large-screen display, traffic statistics, and user
> behavior analysis. Meituan, Aurora, Shell Housing, etc. use Kylin to build
> their data service platforms, providing millions to tens of millions of
> queries per day, and most of the queries can be completed within 2 - 3
> seconds. There is no better alternative for such a high concurrency
> scenario.
>
> ClickHouse, because of its MPP architecture, has high computing power and
> is more suitable when the query request is more flexible, or when there is
> a need for detailed queries with low concurrency. Scenarios include: very
> many columns and where conditions are arbitrarily combined with the user
> label filtering, not a large amount of concurrency of complex on-the-spot
> query and so on. If the amount of data and access is large, you need to
> deploy a distributed ClickHouse cluster, which is a higher challenge for
> operation and maintenance.
>
> If some queries are very flexible but infrequent, it is more
> resource-efficient to use now-computing. Since the number of queries is
> small, even if each query consumes a lot of computational resources, it is
> still cost-effective overall. If some queries have a fixed pattern and the
> query volume is large, it is more suitable for Kylin, because the query
> volume is large, and by using large computational resources to save the
> results, the upfront computational cost can be amortized over each query,
> so it is the most economical.
>
> --- Translated with DeepL.com (free version)
>
>
> 
> With warm regard
> Xiaoxiang Yu
>
>
>
> On Mon, Dec 4, 2023 at 3:16 PM Nam Đỗ Duy  wrote:
>
>> Thank you Xiaoxiang for the near real time streaming feature. That's
>> great.
>>
>> This morning there has been a new challenge to my team: clickhouse offered
>> us the speed of calculating 8 billion rows in millisecond which is faster
>> than my demonstration (I used Kylin to do calculating 1 billion rows in
>> 2.9
>> seconds)
>>
>> Can you briefly suggest the advantages of kylin over clickhouse so that I
>> can defend my demonstration.
>>
>> On Mon, Dec 4, 2023 at 1:55 PM Xiaoxiang Yu  wrote:
>>
>> > 1. "In this important scenario of realtime analytics, the reason here is
>> > that
>> > kylin has lag time due to model update of new segment build, is that
>> > correct?"
>> >
>> > You are correct.
>> >
>> > 2. "If that is true, then can you suggest a work-around of combination
>> of
>> > ... "
>> >
>> > Kylin is planning to introduce NRT streaming(coding is completed but not
>> > released),
>> > which can make the time-lag to about 3 minutes(that is my estimation
>> but I
>> > am
>> > quite certain about it).
>> > NRT stands for 'near real-time', it will run a job and do micro-batch
>> > aggregation and persistence periodically. The price is that you need to
>> run
>> > and monitor a long-running
>> >  job. This feature is based on Spark Streaming, so you need knowledge of
>> > it.
>> >
>> > I am curious about what is the maximum time-lag your customers
>> > can tolerate?
>> > Personally, I guess minute level time-lag is ok for most cases.
>> >
>> > 
>> > With warm regard
>> > Xiaoxiang Yu
>> >
>> >
>> >
>> > On Mon, Dec 4, 2023 at 12:28 PM Nam Đỗ Duy 
>> wrote:
>> >
>> > > Druid is better in
>> > > - Have a real-time datasource like Kafka etc.
>> > >
>> > > ==
>> > >
>> > > Hi Xiaoxiang, thank you for your response.
>> > >
>> > > In this important scenario of realtime alalytics, the reason here is
>> that
>> > > kylin has lag time due to model update of new segment build, is that
>> > > correct?
>> > >
>> > > If that is true, then can you suggest a work-around of combination of
>> :
>> > >
>> > > (time - lag kylin cube) + (realtime DB update) to provide
>> > > realtime capability ?