Re: Pinot/Kylin/Druid quick comparision

2023-12-05 Thread Nam Đỗ Duy via user
Thank you very much for your prompt response, I still have several
questions to seek for your help later.

Best regards and have a good day



On Wed, Dec 6, 2023 at 9:11 AM Xiaoxiang Yu  wrote:

> Done. Github branch changed to kylin5.
>
> 
> With warm regard
> Xiaoxiang Yu
>
>
>
> On Tue, Dec 5, 2023 at 11:13 AM Xiaoxiang Yu  wrote:
>
> > A JIRA ticket has been opened, waiting for INFRA :
> > https://issues.apache.org/jira/browse/INFRA-25238 .
> > 
> > With warm regard
> > Xiaoxiang Yu
> >
> >
> >
> > On Tue, Dec 5, 2023 at 10:30 AM Nam Đỗ Duy 
> wrote:
> >
> >> Thank you Xiaoxiang, please update me when you have changed your default
> >> branch. In case people are impressed by the numbers then I hope to turn
> >> this situation to reverse direction.
> >>
> >> On Tue, Dec 5, 2023 at 9:02 AM Xiaoxiang Yu  wrote:
> >>
> >>> The default branch is for 4.X which is a maintained branch, the active
> >>> branch is kylin5.
> >>> I will change the default branch to kylin5 later.
> >>>
> >>> 
> >>> With warm regard
> >>> Xiaoxiang Yu
> >>>
> >>>
> >>>
> >>> On Tue, Dec 5, 2023 at 9:12 AM Nam Đỗ Duy 
> >>> wrote:
> >>>
>  Hi Xiaoxiang, Sirs / Madams
> 
>  Can you see the atttached photo
> 
>  My boss asked that why druid commit code regularly but kylin had not
>  been committed since July
> 
> 
>  On Mon, 4 Dec 2023 at 15:33 Xiaoxiang Yu  wrote:
> 
> > I think so.
> >
> > Response time is not the only factor to make a decision. Kylin could
> > be cheaper
> > when the query pattern is suitable for the Kylin model, and Kylin can
> > guarantee
> > reasonable query latency. Clickhouse will be quicker in an ad hoc
> > query scenario.
> >
> > By the way, Youzan and Kyligence combine them together to provide
> > unified data analytics services for their customers.
> >
> > 
> > With warm regard
> > Xiaoxiang Yu
> >
> >
> >
> > On Mon, Dec 4, 2023 at 4:01 PM Nam Đỗ Duy 
> > wrote:
> >
> >> Hi Xiaoxiang, thank you
> >>
> >> In case my client uses cloud computing service like gcp or aws,
> which
> >> will cost more: precalculation feature of kylin or clickhouse
> (incase
> >> of
> >> kylin, I have a thought that the query execution has been done once
> >> and
> >> stored in cube to be used many times so kylin uses less cloud
> >> computation,
> >> is that true)?
> >>
> >> On Mon, Dec 4, 2023 at 2:46 PM Xiaoxiang Yu 
> wrote:
> >>
> >> > Following text is part of an article(
> >> > https://zhuanlan.zhihu.com/p/343394287) .
> >> >
> >> >
> >> >
> >>
> ===
> >> >
> >> > Kylin is suitable for aggregation queries with fixed modes because
> >> of its
> >> > pre-calculated technology, for example, join, group by, and where
> >> condition
> >> > modes in SQL are relatively fixed, etc. The larger the data volume
> >> is, the
> >> > more obvious the advantages of using Kylin are; in particular,
> >> Kylin is
> >> > particularly advantageous in the scenarios of de-emphasis (count
> >> distinct),
> >> > Top N, and Percentile. In particular, Kylin's advantages in
> >> de-weighting
> >> > (count distinct), Top N, Percentile and other scenarios are
> >> especially
> >> > huge, and it is used in a large number of scenarios, such as
> >> Dashboard, all
> >> > kinds of reports, large-screen display, traffic statistics, and
> user
> >> > behavior analysis. Meituan, Aurora, Shell Housing, etc. use Kylin
> >> to build
> >> > their data service platforms, providing millions to tens of
> >> millions of
> >> > queries per day, and most of the queries can be completed within 2
> >> - 3
> >> > seconds. There is no better alternative for such a high
> concurrency
> >> > scenario.
> >> >
> >> > ClickHouse, because of its MPP architecture, has high computing
> >> power and
> >> > is more suitable when the query request is more flexible, or when
> >> there is
> >> > a need for detailed queries with low concurrency. Scenarios
> >> include: very
> >> > many columns and where conditions are arbitrarily combined with
> the
> >> user
> >> > label filtering, not a large amount of concurrency of complex
> >> on-the-spot
> >> > query and so on. If the amount of data and access is large, you
> >> need to
> >> > deploy a distributed ClickHouse cluster, which is a higher
> >> challenge for
> >> > operation and maintenance.
> >> >
> >> > If some queries are very flexible but infrequent, it is more
> >> > resource-efficient to use now-computing. Since the number of
> >> queries is
> >> > small, even if each query 

Re: Pinot/Kylin/Druid quick comparision

2023-12-05 Thread Xiaoxiang Yu
Done. Github branch changed to kylin5.


With warm regard
Xiaoxiang Yu



On Tue, Dec 5, 2023 at 11:13 AM Xiaoxiang Yu  wrote:

> A JIRA ticket has been opened, waiting for INFRA :
> https://issues.apache.org/jira/browse/INFRA-25238 .
> 
> With warm regard
> Xiaoxiang Yu
>
>
>
> On Tue, Dec 5, 2023 at 10:30 AM Nam Đỗ Duy  wrote:
>
>> Thank you Xiaoxiang, please update me when you have changed your default
>> branch. In case people are impressed by the numbers then I hope to turn
>> this situation to reverse direction.
>>
>> On Tue, Dec 5, 2023 at 9:02 AM Xiaoxiang Yu  wrote:
>>
>>> The default branch is for 4.X which is a maintained branch, the active
>>> branch is kylin5.
>>> I will change the default branch to kylin5 later.
>>>
>>> 
>>> With warm regard
>>> Xiaoxiang Yu
>>>
>>>
>>>
>>> On Tue, Dec 5, 2023 at 9:12 AM Nam Đỗ Duy 
>>> wrote:
>>>
 Hi Xiaoxiang, Sirs / Madams

 Can you see the atttached photo

 My boss asked that why druid commit code regularly but kylin had not
 been committed since July


 On Mon, 4 Dec 2023 at 15:33 Xiaoxiang Yu  wrote:

> I think so.
>
> Response time is not the only factor to make a decision. Kylin could
> be cheaper
> when the query pattern is suitable for the Kylin model, and Kylin can
> guarantee
> reasonable query latency. Clickhouse will be quicker in an ad hoc
> query scenario.
>
> By the way, Youzan and Kyligence combine them together to provide
> unified data analytics services for their customers.
>
> 
> With warm regard
> Xiaoxiang Yu
>
>
>
> On Mon, Dec 4, 2023 at 4:01 PM Nam Đỗ Duy 
> wrote:
>
>> Hi Xiaoxiang, thank you
>>
>> In case my client uses cloud computing service like gcp or aws, which
>> will cost more: precalculation feature of kylin or clickhouse (incase
>> of
>> kylin, I have a thought that the query execution has been done once
>> and
>> stored in cube to be used many times so kylin uses less cloud
>> computation,
>> is that true)?
>>
>> On Mon, Dec 4, 2023 at 2:46 PM Xiaoxiang Yu  wrote:
>>
>> > Following text is part of an article(
>> > https://zhuanlan.zhihu.com/p/343394287) .
>> >
>> >
>> >
>> ===
>> >
>> > Kylin is suitable for aggregation queries with fixed modes because
>> of its
>> > pre-calculated technology, for example, join, group by, and where
>> condition
>> > modes in SQL are relatively fixed, etc. The larger the data volume
>> is, the
>> > more obvious the advantages of using Kylin are; in particular,
>> Kylin is
>> > particularly advantageous in the scenarios of de-emphasis (count
>> distinct),
>> > Top N, and Percentile. In particular, Kylin's advantages in
>> de-weighting
>> > (count distinct), Top N, Percentile and other scenarios are
>> especially
>> > huge, and it is used in a large number of scenarios, such as
>> Dashboard, all
>> > kinds of reports, large-screen display, traffic statistics, and user
>> > behavior analysis. Meituan, Aurora, Shell Housing, etc. use Kylin
>> to build
>> > their data service platforms, providing millions to tens of
>> millions of
>> > queries per day, and most of the queries can be completed within 2
>> - 3
>> > seconds. There is no better alternative for such a high concurrency
>> > scenario.
>> >
>> > ClickHouse, because of its MPP architecture, has high computing
>> power and
>> > is more suitable when the query request is more flexible, or when
>> there is
>> > a need for detailed queries with low concurrency. Scenarios
>> include: very
>> > many columns and where conditions are arbitrarily combined with the
>> user
>> > label filtering, not a large amount of concurrency of complex
>> on-the-spot
>> > query and so on. If the amount of data and access is large, you
>> need to
>> > deploy a distributed ClickHouse cluster, which is a higher
>> challenge for
>> > operation and maintenance.
>> >
>> > If some queries are very flexible but infrequent, it is more
>> > resource-efficient to use now-computing. Since the number of
>> queries is
>> > small, even if each query consumes a lot of computational
>> resources, it is
>> > still cost-effective overall. If some queries have a fixed pattern
>> and the
>> > query volume is large, it is more suitable for Kylin, because the
>> query
>> > volume is large, and by using large computational resources to save
>> the
>> > results, the upfront computational cost can be amortized over each
>> query,
>> > so it is the most economical.
>> >
>> > --- 

Re: kylin4.0.3构建数据时报错

2023-12-05 Thread lee
退订

> 2023年12月5日 19:33,李甜彪  写道:
> 
> 构建时报错,数据在hive中是没有问题的,空数据构建时可以成功,反思有可能是数据问题,自己手写几条数据,构建时又同样的错误,证明不是原来的数据的问题。
> 页面的看到的报错信息如下:
> java.io.IOException: OS command error exit with return code: 1, error 
> message: che.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
> ... 3 more
> 
> }
> RetryInfo{
>overrideConf : {},
>throwable : java.lang.RuntimeException: Error execute 
> org.apache.kylin.engine.spark.job.CubeBuildJob
> at 
> org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:96)
> at org.apache.spark.application.JobWorker$$anon$2.run(JobWorker.scala:55)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 74.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 74.0 (TID 186) (store2 executor 20): java.lang.NoClassDefFoundError: 
> Could not initialize class org.apache.hadoop.hive.conf.HiveConf$ConfVars
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:103)
> at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
> at 
> org.apache.spark.sql.hive.HadoopTableReader.$anonfun$makeRDDForTable$3(TableReader.scala:136)
> at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
> at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2303)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2252)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2251)
> at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2251)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1124)
> at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1124)
> at scala.Option.foreach(Option.scala:407)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1124)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2490)
> at 
> 

kylin4.0.3构建数据时报错

2023-12-05 Thread 李甜彪
构建时报错,数据在hive中是没有问题的,空数据构建时可以成功,反思有可能是数据问题,自己手写几条数据,构建时又同样的错误,证明不是原来的数据的问题。
页面的看到的报错信息如下:
java.io.IOException: OS command error exit with return code: 1, error message: 
che.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
... 3 more

}
RetryInfo{
   overrideConf : {},
   throwable : java.lang.RuntimeException: Error execute 
org.apache.kylin.engine.spark.job.CubeBuildJob
at 
org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:96)
at org.apache.spark.application.JobWorker$$anon$2.run(JobWorker.scala:55)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 74.0 failed 4 times, most recent failure: Lost task 0.3 in 
stage 74.0 (TID 186) (store2 executor 20): java.lang.NoClassDefFoundError: 
Could not initialize class org.apache.hadoop.hive.conf.HiveConf$ConfVars
at 
org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:103)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125)
at 
org.apache.spark.sql.hive.HadoopTableReader.$anonfun$makeRDDForTable$3(TableReader.scala:136)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2303)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2252)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2251)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2251)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1124)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1124)
at scala.Option.foreach(Option.scala:407)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1124)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2490)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2432)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2421)
at