回复：回复：RE: 回复：Re: sparksql running slow while joining_2_tables.

luohui20001 Tue, 05 May 2015 23:06:06 -0700

update status after i did some tests. I modified some other parameters, found 2 
parameters maybe relative.spark_worker_instance and spark.sql.shuffle.partitions
before Today I used default setting of spark_worker_instance and 
spark.sql.shuffle.partitions whose value is 1 and 200.At that time , my app 
stops running at 5/200tasks.
then I changed spark_worker_instance to 2, then my app process moved on to 
about 116/200 tasks.and then changed spark_worker_instance to 4, then I can get 
a further progress at 176/200.however when i changed to 8 or even more ,like 12 
works, it is still 176/200
Later new founds comes to me while I am trying with different 
spark.sql.shuffle.partitions. If I changed to 50,400,800 partitions, it stops 
at 26/50, 376/400,776/800 tasks. always leaving 24 tasks unable to finish.
Not sure why those happens.Hope this info could be helpful to solve it.


--------------------------------
 
Thanks&amp;Best regards!
罗辉 San.Luo

----- 原始邮件 -----
发件人：<luohui20...@sina.com>
收件人："Cheng, Hao" <hao.ch...@intel.com>, "Wang, Daoyuan" 
<daoyuan.w...@intel.com>, "Olivier Girardot" <ssab...@gmail.com>, "user" 
<user@spark.apache.org>,
主题：回复：RE: 回复：Re: sparksql running slow while joining_2_tables.
日期：2015年05月06日 09点51分

db has 1.7million records while sample has 0.6million. jvm settings i tried 
default settings and also tried to apply 4g by "export _java_opts 4g", app 
still stops running.
BTW, here are some details info about gc and jvm.
----- 原始邮件 -----
发件人："Cheng, Hao" <hao.ch...@intel.com>
收件人："luohui20...@sina.com" <luohui20...@sina.com>, "Wang, Daoyuan" 
<daoyuan.w...@intel.com>, Olivier Girardot <ssab...@gmail.com>, user 
<user@spark.apache.org>
主题：RE: 回复：Re: sparksql running slow while joining_2_tables.
日期：2015年05月05日 20点50分
56mb / 26mb is very small size, do you observe data skew? More precisely, many 
records with the same chrname / name?  And can you also double check the jvm 
settings
 for the executor process?
 
 
From: luohui20...@sina.com [mailto:luohui20...@sina.com]
Sent: Tuesday, May 5, 2015 7:50 PM
To: Cheng, Hao; Wang, Daoyuan; Olivier Girardot; user
Subject: 回复：Re: sparksql running slow while joining_2_tables.
 
Hi guys,
          attache the pic of physical plan and logs.Thanks.
--------------------------------
 
Thanks&amp;Best regards!
罗辉 San.Luo
 
----- 
原始邮件 -----
发件人："Cheng, Hao" <hao.ch...@intel.com>
收件人："Wang, Daoyuan" <daoyuan.w...@intel.com>, "luohui20...@sina.com" 
<luohui20...@sina.com>,
 Olivier Girardot <ssab...@gmail.com>, user <user@spark.apache.org>
主题：Re: sparksql running slow while joining_2_tables.
日期：2015年05月05日 13点18分
 
I assume you’re using the DataFrame API within your application.
 
sql(“SELECT…”).explain(true)
 
From: Wang, Daoyuan
Sent: Tuesday, May 5, 2015 10:16 AM
To: luohui20...@sina.com; Cheng, Hao; Olivier Girardot; user
Subject: RE: 回复：RE:
回复：Re: sparksql running slow while joining_2_tables.
 
You can use
Explain extended select ….
 
From:
luohui20...@sina.com [mailto:luohui20...@sina.com]
Sent: Tuesday, May 05, 2015 9:52 AM
To: Cheng, Hao; Olivier Girardot; user
Subject: 回复：RE:
回复：Re: sparksql running slow while joining_2_tables.
 
As I know broadcastjoin is automatically enabled by 
spark.sql.autoBroadcastJoinThreshold.
refer to 
http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options
 
and how to check my app's physical plan,and others things like optimized 
plan,executable plan.etc
 
thanks
 
--------------------------------
 
Thanks&amp;Best regards!
罗辉 San.Luo
 
----- 原始邮件 -----
发件人："Cheng, Hao" <hao.ch...@intel.com>
收件人："Cheng, Hao" <hao.ch...@intel.com>, "luohui20...@sina.com" 
<luohui20...@sina.com>,
 Olivier Girardot <ssab...@gmail.com>, user <user@spark.apache.org>
主题：RE: 
回复：Re: sparksql running slow while joining_2_tables.
日期：2015年05月05日 08点38分
 
Or, have you ever try broadcast join?
 
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Tuesday, May 5, 2015 8:33 AM
To: luohui20...@sina.com; Olivier Girardot; user
Subject: RE: 回复：Re: sparksql running slow while joining 2 tables.
 
Can you print out the physical plan?
 
EXPLAIN SELECT xxx…
 
From: luohui20...@sina.com [mailto:luohui20...@sina.com]
Sent: Monday, May 4, 2015 9:08 PM
To: Olivier Girardot; user
Subject: 回复：Re: sparksql running slow while joining 2 tables.
 
hi Olivier
spark1.3.1, with java1.8.0.45
and add 2 pics .
it seems like a GC issue. I also tried with different parameters like memory 
size of driver&executor, memory fraction, java opts...
but this issue still happens.
 
--------------------------------
 
Thanks&amp;Best regards!
罗辉 San.Luo
 
----- 原始邮件 -----
发件人：Olivier Girardot <ssab...@gmail.com>
收件人：luohui20...@sina.com, user <user@spark.apache.org>
主题：Re: sparksql running slow while joining 2 tables.
日期：2015年05月04日 20点46分
 
Hi, 
What is you Spark version ?
 
Regards, 
 
Olivier.
 
Le lun. 4 mai 2015 à 11:03, <luohui20...@sina.com> a
écrit :
hi guys
        when i am running a sql  like "select 
a.name,a.startpoint,a.endpoint, a.piece from db a join sample b on (a.name =
b.name) where (b.startpoint > a.startpoint &#43; 25);" I found sparksql running 
slow in minutes which may caused by very long GC and shuffle time.
 
       table db is created from a txt file size at 56mb while table sample 
sized at 26mb, both at small size.
       my spark cluster is a standalone  pseudo-distributed spark cluster with 
8g executor and 4g driver manager.
       any advises? thank you guys.
 
 
--------------------------------
 
Thanks&amp;Best regards!
罗辉 San.Luo
---------------------------------------------------------------------
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org

回复：回复：RE: 回复：Re: sparksql running slow while joining_2_tables.

Reply via email to