Re: RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

Davies Liu Fri, 11 Sep 2015 10:33:06 -0700

I had ran similar benchmark for 1.5, do self join on a fact table with
join key that had many duplicated rows (there are N rows for the same
join key), say N, after join, there will be N*N rows for each join
key. Generating the joined row is slower in 1.5 than 1.4 (it needs to
copy left and right row together, but not in 1.4). If the generated
row is accessed after join, there will be not much difference between
1.5 and 1.4, because accessing the joined row is slower in 1.4 than
1.5.


So, for this particular query, 1.5 is slower than 1.4, will be even
slower if you increase the N. But for real workload, it will not, 1.5
is usually faster than 1.4.

On Fri, Sep 11, 2015 at 1:31 AM, prosp4300 <prosp4...@163.com> wrote:
>
>
> By the way turn off the code generation could be an option to try, sometime 
> code generation could introduce slowness
>
>
> 在2015年09月11日 15:58，Cheng, Hao 写道:
>
> Can you confirm if the query really run in the cluster mode? Not the local 
> mode. Can you print the call stack of the executor when the query is running?
>
>
>
> BTW: spark.shuffle.reduceLocality.enabled is the configuration of Spark, not 
> Spark SQL.
>
>
>
> From: Todd [mailto:bit1...@163.com]
> Sent: Friday, September 11, 2015 3:39 PM
> To: Todd
> Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org
> Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ 
> compared with spark 1.4.1 SQL
>
>
>
> I add the following two options:
> spark.sql.planner.sortMergeJoin=false
> spark.shuffle.reduceLocality.enabled=false
>
> But it still performs the same as not setting them two.
>
> One thing is that on the spark ui, when I click the SQL tab, it shows an 
> empty page but the header title 'SQL',there is no table to show queries and 
> execution plan information.
>
>
>
>
>
> At 2015-09-11 14:39:06, "Todd" <bit1...@163.com> wrote:
>
>
> Thanks Hao.
>  Yes，it is still low as SMJ。Let me try the option your suggested,
>
>
>
>
> At 2015-09-11 14:34:46, "Cheng, Hao" <hao.ch...@intel.com> wrote:
>
> You mean the performance is still slow as the SMJ in Spark 1.5?
>
>
>
> Can you set the spark.shuffle.reduceLocality.enabled=false when you start the 
> spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true by 
> default, but we found it probably causes the performance reduce dramatically.
>
>
>
>
>
> From: Todd [mailto:bit1...@163.com]
> Sent: Friday, September 11, 2015 2:17 PM
> To: Cheng, Hao
> Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.org
> Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with 
> spark 1.4.1 SQL
>
>
>
> Thanks Hao for the reply.
> I turn the merge sort join off, the physical plan is below, but the 
> performance is roughly the same as it on...
>
> == Physical Plan ==
> TungstenProject 
> [ss_quantity#10,ss_list_price#12,ss_coupon_amt#19,ss_cdemo_sk#4,ss_item_sk#2,ss_promo_sk#8,ss_sold_date_sk#0]
>  ShuffledHashJoin [ss_item_sk#2], [ss_item_sk#25], BuildRight
>   TungstenExchange hashpartitioning(ss_item_sk#2)
>    ConvertToUnsafe
>     Scan 
> ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_promo_sk#8,ss_quantity#10,ss_cdemo_sk#4,ss_list_price#12,ss_coupon_amt#19,ss_item_sk#2,ss_sold_date_sk#0]
>   TungstenExchange hashpartitioning(ss_item_sk#25)
>    ConvertToUnsafe
>     Scan 
> ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_item_sk#25]
>
> Code Generation: true
>
>
>
>
> At 2015-09-11 13:48:23, "Cheng, Hao" <hao.ch...@intel.com> wrote:
>
> This is not a big surprise the SMJ is slower than the HashJoin, as we do not 
> fully utilize the sorting yet, more details can be found at 
> https://issues.apache.org/jira/browse/SPARK-2926 .
>
>
>
> Anyway, can you disable the sort merge join by 
> “spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and run the query 
> again? In our previous testing, it’s about 20% slower for sort merge join. I 
> am not sure if there anything else slow down the performance.
>
>
>
> Hao
>
>
>
>
>
> From: Jesse F Chen [mailto:jfc...@us.ibm.com]
> Sent: Friday, September 11, 2015 1:18 PM
> To: Michael Armbrust
> Cc: Todd; user@spark.apache.org
> Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with 
> spark 1.4.1 SQL
>
>
>
> Could this be a build issue (i.e., sbt package)?
>
> If I ran the same jar build for 1.4.1 in 1.5, I am seeing large regression 
> too in queries (all other things identical)...
>
> I am curious, to build 1.5 (when it isn't released yet), what do I need to do 
> with the build.sbt file?
>
> any special parameters i should be using to make sure I load the latest hive 
> dependencies?
>
> Michael Armbrust ---09/10/2015 11:07:28 AM---I've been running TPC-DS SF=1500 
> daily on Spark 1.4.1 and Spark 1.5 on S3, so this is surprising.  I
>
> From: Michael Armbrust <mich...@databricks.com>
> To: Todd <bit1...@163.com>
> Cc: "user@spark.apache.org" <user@spark.apache.org>
> Date: 09/10/2015 11:07 AM
> Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with 
> spark 1.4.1 SQL
>
> ________________________________
>
>
>
>
> I've been running TPC-DS SF=1500 daily on Spark 1.4.1 and Spark 1.5 on S3, so 
> this is surprising.  In my experiments Spark 1.5 is either the same or faster 
> than 1.4 with only small exceptions.  A few thoughts,
>
>  - 600 partitions is probably way too many for 6G of data.
>  - Providing the output of explain for both runs would be helpful whenever 
> reporting performance changes.
>
> On Thu, Sep 10, 2015 at 1:24 AM, Todd <bit1...@163.com> wrote:
>
> Hi,
>
> I am using data generated with 
> sparksqlperf(https://github.com/databricks/spark-sql-perf) to test the spark 
> sql performance (spark on yarn, with 10 nodes) with the following code (The 
> table store_sales is about 90 million records, 6G in size）
>
> val 
> outputDir="hdfs://tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales"
> val name="store_sales"
>     sqlContext.sql(
>       s"""
>           |CREATE TEMPORARY TABLE ${name}
>           |USING org.apache.spark.sql.parquet
>           |OPTIONS (
>           |  path '${outputDir}'
>           |)
>         """.stripMargin)
>
> val sql="""
>          |select
>          |  t1.ss_quantity,
>          |  t1.ss_list_price,
>          |  t1.ss_coupon_amt,
>          |  t1.ss_cdemo_sk,
>          |  t1.ss_item_sk,
>          |  t1.ss_promo_sk,
>          |  t1.ss_sold_date_sk
>          |from store_sales t1 join store_sales t2 on t1.ss_item_sk = 
> t2.ss_item_sk
>          |where
>          |  t1.ss_sold_date_sk between 2450815 and 2451179
>        """.stripMargin
>
> val df = sqlContext.sql(sql)
> df.rdd.foreach(row=>Unit)
>
> With 1.4.1, I can finish the query in 6 minutes,  but  I need 10+ minutes 
> with 1.5.
>
> The configuration are basically the same, since I copy the configuration from 
> 1.4.1 to 1.5:
>
> sparkVersion    1.4.1        1.5.0
> scaleFactor    30        30
> spark.sql.shuffle.partitions    600        600
> spark.sql.sources.partitionDiscovery.enabled    true        true
> spark.default.parallelism    200        200
> spark.driver.memory    4G    4G        4G
> spark.executor.memory    4G        4G
> spark.executor.instances    10        10
> spark.shuffle.consolidateFiles    true        true
> spark.storage.memoryFraction    0.4        0.4
> spark.executor.cores    3        3
>
> I am not sure where is going wrong,any ideas?
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

Reply via email to