Re: Pinot/Kylin/Druid quick comparision
Thank you very much for your prompt response, I still have several questions to seek for your help later. Best regards and have a good day On Wed, Dec 6, 2023 at 9:11 AM Xiaoxiang Yu wrote: > Done. Github branch changed to kylin5. > > > With warm regard > Xiaoxiang Yu > > > > On Tue, Dec 5, 2023 at 11:13 AM Xiaoxiang Yu wrote: > > > A JIRA ticket has been opened, waiting for INFRA : > > https://issues.apache.org/jira/browse/INFRA-25238 . > > > > With warm regard > > Xiaoxiang Yu > > > > > > > > On Tue, Dec 5, 2023 at 10:30 AM Nam Đỗ Duy > wrote: > > > >> Thank you Xiaoxiang, please update me when you have changed your default > >> branch. In case people are impressed by the numbers then I hope to turn > >> this situation to reverse direction. > >> > >> On Tue, Dec 5, 2023 at 9:02 AM Xiaoxiang Yu wrote: > >> > >>> The default branch is for 4.X which is a maintained branch, the active > >>> branch is kylin5. > >>> I will change the default branch to kylin5 later. > >>> > >>> > >>> With warm regard > >>> Xiaoxiang Yu > >>> > >>> > >>> > >>> On Tue, Dec 5, 2023 at 9:12 AM Nam Đỗ Duy > >>> wrote: > >>> > Hi Xiaoxiang, Sirs / Madams > > Can you see the atttached photo > > My boss asked that why druid commit code regularly but kylin had not > been committed since July > > > On Mon, 4 Dec 2023 at 15:33 Xiaoxiang Yu wrote: > > > I think so. > > > > Response time is not the only factor to make a decision. Kylin could > > be cheaper > > when the query pattern is suitable for the Kylin model, and Kylin can > > guarantee > > reasonable query latency. Clickhouse will be quicker in an ad hoc > > query scenario. > > > > By the way, Youzan and Kyligence combine them together to provide > > unified data analytics services for their customers. > > > > > > With warm regard > > Xiaoxiang Yu > > > > > > > > On Mon, Dec 4, 2023 at 4:01 PM Nam Đỗ Duy > > wrote: > > > >> Hi Xiaoxiang, thank you > >> > >> In case my client uses cloud computing service like gcp or aws, > which > >> will cost more: precalculation feature of kylin or clickhouse > (incase > >> of > >> kylin, I have a thought that the query execution has been done once > >> and > >> stored in cube to be used many times so kylin uses less cloud > >> computation, > >> is that true)? > >> > >> On Mon, Dec 4, 2023 at 2:46 PM Xiaoxiang Yu > wrote: > >> > >> > Following text is part of an article( > >> > https://zhuanlan.zhihu.com/p/343394287) . > >> > > >> > > >> > > >> > === > >> > > >> > Kylin is suitable for aggregation queries with fixed modes because > >> of its > >> > pre-calculated technology, for example, join, group by, and where > >> condition > >> > modes in SQL are relatively fixed, etc. The larger the data volume > >> is, the > >> > more obvious the advantages of using Kylin are; in particular, > >> Kylin is > >> > particularly advantageous in the scenarios of de-emphasis (count > >> distinct), > >> > Top N, and Percentile. In particular, Kylin's advantages in > >> de-weighting > >> > (count distinct), Top N, Percentile and other scenarios are > >> especially > >> > huge, and it is used in a large number of scenarios, such as > >> Dashboard, all > >> > kinds of reports, large-screen display, traffic statistics, and > user > >> > behavior analysis. Meituan, Aurora, Shell Housing, etc. use Kylin > >> to build > >> > their data service platforms, providing millions to tens of > >> millions of > >> > queries per day, and most of the queries can be completed within 2 > >> - 3 > >> > seconds. There is no better alternative for such a high > concurrency > >> > scenario. > >> > > >> > ClickHouse, because of its MPP architecture, has high computing > >> power and > >> > is more suitable when the query request is more flexible, or when > >> there is > >> > a need for detailed queries with low concurrency. Scenarios > >> include: very > >> > many columns and where conditions are arbitrarily combined with > the > >> user > >> > label filtering, not a large amount of concurrency of complex > >> on-the-spot > >> > query and so on. If the amount of data and access is large, you > >> need to > >> > deploy a distributed ClickHouse cluster, which is a higher > >> challenge for > >> > operation and maintenance. > >> > > >> > If some queries are very flexible but infrequent, it is more > >> > resource-efficient to use now-computing. Since the number of > >> queries is > >> > small, even if each query
Re: Pinot/Kylin/Druid quick comparision
Done. Github branch changed to kylin5. With warm regard Xiaoxiang Yu On Tue, Dec 5, 2023 at 11:13 AM Xiaoxiang Yu wrote: > A JIRA ticket has been opened, waiting for INFRA : > https://issues.apache.org/jira/browse/INFRA-25238 . > > With warm regard > Xiaoxiang Yu > > > > On Tue, Dec 5, 2023 at 10:30 AM Nam Đỗ Duy wrote: > >> Thank you Xiaoxiang, please update me when you have changed your default >> branch. In case people are impressed by the numbers then I hope to turn >> this situation to reverse direction. >> >> On Tue, Dec 5, 2023 at 9:02 AM Xiaoxiang Yu wrote: >> >>> The default branch is for 4.X which is a maintained branch, the active >>> branch is kylin5. >>> I will change the default branch to kylin5 later. >>> >>> >>> With warm regard >>> Xiaoxiang Yu >>> >>> >>> >>> On Tue, Dec 5, 2023 at 9:12 AM Nam Đỗ Duy >>> wrote: >>> Hi Xiaoxiang, Sirs / Madams Can you see the atttached photo My boss asked that why druid commit code regularly but kylin had not been committed since July On Mon, 4 Dec 2023 at 15:33 Xiaoxiang Yu wrote: > I think so. > > Response time is not the only factor to make a decision. Kylin could > be cheaper > when the query pattern is suitable for the Kylin model, and Kylin can > guarantee > reasonable query latency. Clickhouse will be quicker in an ad hoc > query scenario. > > By the way, Youzan and Kyligence combine them together to provide > unified data analytics services for their customers. > > > With warm regard > Xiaoxiang Yu > > > > On Mon, Dec 4, 2023 at 4:01 PM Nam Đỗ Duy > wrote: > >> Hi Xiaoxiang, thank you >> >> In case my client uses cloud computing service like gcp or aws, which >> will cost more: precalculation feature of kylin or clickhouse (incase >> of >> kylin, I have a thought that the query execution has been done once >> and >> stored in cube to be used many times so kylin uses less cloud >> computation, >> is that true)? >> >> On Mon, Dec 4, 2023 at 2:46 PM Xiaoxiang Yu wrote: >> >> > Following text is part of an article( >> > https://zhuanlan.zhihu.com/p/343394287) . >> > >> > >> > >> === >> > >> > Kylin is suitable for aggregation queries with fixed modes because >> of its >> > pre-calculated technology, for example, join, group by, and where >> condition >> > modes in SQL are relatively fixed, etc. The larger the data volume >> is, the >> > more obvious the advantages of using Kylin are; in particular, >> Kylin is >> > particularly advantageous in the scenarios of de-emphasis (count >> distinct), >> > Top N, and Percentile. In particular, Kylin's advantages in >> de-weighting >> > (count distinct), Top N, Percentile and other scenarios are >> especially >> > huge, and it is used in a large number of scenarios, such as >> Dashboard, all >> > kinds of reports, large-screen display, traffic statistics, and user >> > behavior analysis. Meituan, Aurora, Shell Housing, etc. use Kylin >> to build >> > their data service platforms, providing millions to tens of >> millions of >> > queries per day, and most of the queries can be completed within 2 >> - 3 >> > seconds. There is no better alternative for such a high concurrency >> > scenario. >> > >> > ClickHouse, because of its MPP architecture, has high computing >> power and >> > is more suitable when the query request is more flexible, or when >> there is >> > a need for detailed queries with low concurrency. Scenarios >> include: very >> > many columns and where conditions are arbitrarily combined with the >> user >> > label filtering, not a large amount of concurrency of complex >> on-the-spot >> > query and so on. If the amount of data and access is large, you >> need to >> > deploy a distributed ClickHouse cluster, which is a higher >> challenge for >> > operation and maintenance. >> > >> > If some queries are very flexible but infrequent, it is more >> > resource-efficient to use now-computing. Since the number of >> queries is >> > small, even if each query consumes a lot of computational >> resources, it is >> > still cost-effective overall. If some queries have a fixed pattern >> and the >> > query volume is large, it is more suitable for Kylin, because the >> query >> > volume is large, and by using large computational resources to save >> the >> > results, the upfront computational cost can be amortized over each >> query, >> > so it is the most economical. >> > >> > ---
Re: kylin4.0.3构建数据时报错
退订 > 2023年12月5日 19:33,李甜彪 写道: > > 构建时报错,数据在hive中是没有问题的,空数据构建时可以成功,反思有可能是数据问题,自己手写几条数据,构建时又同样的错误,证明不是原来的数据的问题。 > 页面的看到的报错信息如下: > java.io.IOException: OS command error exit with return code: 1, error > message: che.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) > ... 3 more > > } > RetryInfo{ >overrideConf : {}, >throwable : java.lang.RuntimeException: Error execute > org.apache.kylin.engine.spark.job.CubeBuildJob > at > org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:96) > at org.apache.spark.application.JobWorker$$anon$2.run(JobWorker.scala:55) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 74.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 74.0 (TID 186) (store2 executor 20): java.lang.NoClassDefFoundError: > Could not initialize class org.apache.hadoop.hive.conf.HiveConf$ConfVars > at > org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:103) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) > at > org.apache.spark.sql.hive.HadoopTableReader.$anonfun$makeRDDForTable$3(TableReader.scala:136) > at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863) > at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2303) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2252) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2251) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2251) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1124) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1124) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1124) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2490) > at >
kylin4.0.3构建数据时报错
构建时报错,数据在hive中是没有问题的,空数据构建时可以成功,反思有可能是数据问题,自己手写几条数据,构建时又同样的错误,证明不是原来的数据的问题。 页面的看到的报错信息如下: java.io.IOException: OS command error exit with return code: 1, error message: che.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) ... 3 more } RetryInfo{ overrideConf : {}, throwable : java.lang.RuntimeException: Error execute org.apache.kylin.engine.spark.job.CubeBuildJob at org.apache.kylin.engine.spark.application.SparkApplication.execute(SparkApplication.java:96) at org.apache.spark.application.JobWorker$$anon$2.run(JobWorker.scala:55) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 74.0 failed 4 times, most recent failure: Lost task 0.3 in stage 74.0 (TID 186) (store2 executor 20): java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.hive.conf.HiveConf$ConfVars at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:103) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.spark.sql.hive.HadoopTableReader.$anonfun$makeRDDForTable$3(TableReader.scala:136) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2303) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2252) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2251) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2251) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1124) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1124) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1124) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2490) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2432) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2421) at