Hi,

Recently I did some experiments using Hive, Spark, and Presto using TPC-DS 
benchmark
and I'd like to share the result with the community: 
http://www.slideshare.net/ssuser6bb12d/hive-presto-and-spark-on-tpcds-benchmark 
<http://www.slideshare.net/ssuser6bb12d/hive-presto-and-spark-on-tpcds-benchmark>
I entirely depend on the benchmark kit from Hortonwork: 
https://github.com/hortonworks/hive-testbench 
<https://github.com/hortonworks/hive-testbench>

Here I have a question about query 72.
Hive LLAP shows better performance than Presto and Spark for most queries, but 
it shows very poor performance on the execution of query 72.
While Presto also struggles with query 72, Spark finishes the execution of 
query 72 a lot faster than Hive (page 9 and 10).
I've observed a weird pattern in CPU utilization from Presto and Hive executing 
query 72 (page 11).
When I turned off Spark's WholeStageCodeGen, Spark also takes a very long time 
to finish the execution of query 72 (page 12).
Did I miss some feature of Hive to improve the performance of that kind of 
query?
I use the following setting for Hive experiments: 
https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/testbench.settings
 
<https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/testbench.settings>

Except query 72, Hive with LLAP shows very good performance for both small and 
large workload anyway.

- Dongwon Kim

Reply via email to