RE：RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

prosp4300 Fri, 11 Sep 2015 01:38:35 -0700

By the way turn off the code generation could be an option to try, sometime 
code generation could introduce slowness





在2015年09月11日 15:58，Cheng, Hao 写道:

Can you confirm if the query really run in the cluster mode? Not the local 
mode. Can you print the call stack of the executor when the query is running?

 

BTW: spark.shuffle.reduceLocality.enabled is the configuration of Spark, not 
Spark SQL.

 

From: Todd [mailto:bit1...@163.com]
Sent: Friday, September 11, 2015 3:39 PM
To: Todd
Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org
Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ 
compared with spark 1.4.1 SQL

 

I add the following two options:
spark.sql.planner.sortMergeJoin=false
spark.shuffle.reduceLocality.enabled=false

But it still performs the same as not setting them two.

One thing is that on the spark ui, when I click the SQL tab, it shows an empty 
page but the header title 'SQL',there is no table to show queries and execution 
plan information.







At 2015-09-11 14:39:06, "Todd" <bit1...@163.com> wrote:




Thanks Hao.
 Yes，it is still low as SMJ。Let me try the option your suggested,

 


At 2015-09-11 14:34:46, "Cheng, Hao" <hao.ch...@intel.com> wrote:



You mean the performance is still slow as the SMJ in Spark 1.5?

 

Can you set the spark.shuffle.reduceLocality.enabled=false when you start the 
spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true by 
default, but we found it probably causes the performance reduce dramatically.

 

 

From: Todd [mailto:bit1...@163.com]
Sent: Friday, September 11, 2015 2:17 PM
To: Cheng, Hao
Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.org
Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with 
spark 1.4.1 SQL

 

Thanks Hao for the reply.
I turn the merge sort join off, the physical plan is below, but the performance 
is roughly the same as it on...

== Physical Plan ==
TungstenProject 
[ss_quantity#10,ss_list_price#12,ss_coupon_amt#19,ss_cdemo_sk#4,ss_item_sk#2,ss_promo_sk#8,ss_sold_date_sk#0]
 ShuffledHashJoin [ss_item_sk#2], [ss_item_sk#25], BuildRight
  TungstenExchange hashpartitioning(ss_item_sk#2)
   ConvertToUnsafe
    Scan 
ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_promo_sk#8,ss_quantity#10,ss_cdemo_sk#4,ss_list_price#12,ss_coupon_amt#19,ss_item_sk#2,ss_sold_date_sk#0]
  TungstenExchange hashpartitioning(ss_item_sk#25)
   ConvertToUnsafe
    Scan 
ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_item_sk#25]

Code Generation: true






At 2015-09-11 13:48:23, "Cheng, Hao" <hao.ch...@intel.com> wrote:

This is not a big surprise the SMJ is slower than the HashJoin, as we do not 
fully utilize the sorting yet, more details can be found at 
https://issues.apache.org/jira/browse/SPARK-2926 .

 

Anyway, can you disable the sort merge join by 
“spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and run the query again? 
In our previous testing, it’s about 20% slower for sort merge join. I am not 
sure if there anything else slow down the performance.

 

Hao

 

 

From: Jesse F Chen [mailto:jfc...@us.ibm.com]
Sent: Friday, September 11, 2015 1:18 PM
To: Michael Armbrust
Cc: Todd; user@spark.apache.org
Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 
1.4.1 SQL

 

Could this be a build issue (i.e., sbt package)?

If I ran the same jar build for 1.4.1 in 1.5, I am seeing large regression too 
in queries (all other things identical)...

I am curious, to build 1.5 (when it isn't released yet), what do I need to do 
with the build.sbt file?

any special parameters i should be using to make sure I load the latest hive 
dependencies?

Michael Armbrust ---09/10/2015 11:07:28 AM---I've been running TPC-DS SF=1500 
daily on Spark 1.4.1 and Spark 1.5 on S3, so this is surprising.  I

From: Michael Armbrust <mich...@databricks.com>
To: Todd <bit1...@163.com>
Cc: "user@spark.apache.org" <user@spark.apache.org>
Date: 09/10/2015 11:07 AM
Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 
1.4.1 SQL




I've been running TPC-DS SF=1500 daily on Spark 1.4.1 and Spark 1.5 on S3, so 
this is surprising.  In my experiments Spark 1.5 is either the same or faster 
than 1.4 with only small exceptions.  A few thoughts,

 - 600 partitions is probably way too many for 6G of data.
 - Providing the output of explain for both runs would be helpful whenever 
reporting performance changes.

On Thu, Sep 10, 2015 at 1:24 AM, Todd <bit1...@163.com> wrote:

Hi,

I am using data generated with 
sparksqlperf(https://github.com/databricks/spark-sql-perf) to test the spark 
sql performance (spark on yarn, with 10 nodes) with the following code (The 
table store_sales is about 90 million records, 6G in size）
 
val outputDir="hdfs://tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales"
val name="store_sales"
    sqlContext.sql(
      s"""
          |CREATE TEMPORARY TABLE ${name}
          |USING org.apache.spark.sql.parquet
          |OPTIONS (
          |  path '${outputDir}'
          |)
        """.stripMargin)
        
val sql="""
         |select
         |  t1.ss_quantity,
         |  t1.ss_list_price,
         |  t1.ss_coupon_amt,
         |  t1.ss_cdemo_sk,
         |  t1.ss_item_sk,
         |  t1.ss_promo_sk,
         |  t1.ss_sold_date_sk
         |from store_sales t1 join store_sales t2 on t1.ss_item_sk = 
t2.ss_item_sk
         |where
         |  t1.ss_sold_date_sk between 2450815 and 2451179
       """.stripMargin
        
val df = sqlContext.sql(sql)
df.rdd.foreach(row=>Unit)

With 1.4.1, I can finish the query in 6 minutes,  but  I need 10+ minutes with 
1.5.

The configuration are basically the same, since I copy the configuration from 
1.4.1 to 1.5:

sparkVersion    1.4.1        1.5.0
scaleFactor    30        30
spark.sql.shuffle.partitions    600        600
spark.sql.sources.partitionDiscovery.enabled    true        true
spark.default.parallelism    200        200
spark.driver.memory    4G    4G        4G
spark.executor.memory    4G        4G
spark.executor.instances    10        10
spark.shuffle.consolidateFiles    true        true
spark.storage.memoryFraction    0.4        0.4
spark.executor.cores    3        3

I am not sure where is going wrong,any ideas?

RE：RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

Reply via email to