Hi Hao: I tried broadcastjoin with following steps, and found that my query is still running slow ,not very sure if I'm doing right with broadcastjoin:1.add "spark.sql.autoBroadcastJoinThreshold 104857600(100MB)" in conf/spark-default.conf. 100MB is larger than any of my 2 tables.2.start bin/spark-sql and confirm this setting worked both in environment page of my spark cluster web UI and sparksql console;3.run "ANALYZE TABLE db1 COMPUTE STATISTICS noscan" and "ANALYZE TABLE sample3 COMPUTE STATISTICS noscan" and cache both these tables; 4.use extend plan my query and confirmed broadcasthashjoin is used in the physical plan; 5.run my query "select a.chrname,a.startpoint,a.endpoint, a.piece from db1 a join sample3 b on (a.chrname = b.name) where (b.startpoint > a.startpoint + 25) and b.endpoint <= a.endpoint;" So, if there is mistakes in my operation pls point out.thanks.
-------------------------------- Thanks&Best regards! San.Luo ----- 原始邮件 ----- 发件人:"Cheng, Hao" <[email protected]> 收件人:"Cheng, Hao" <[email protected]>, "[email protected]" <[email protected]>, Olivier Girardot <[email protected]>, user <[email protected]> 主题:RE: 回复:Re: sparksql running slow while joining_2_tables. 日期:2015年05月05日 08点38分 Or, have you ever try broadcast join? From: Cheng, Hao [mailto:[email protected]] Sent: Tuesday, May 5, 2015 8:33 AM To: [email protected]; Olivier Girardot; user Subject: RE: 回复:Re: sparksql running slow while joining 2 tables. Can you print out the physical plan? EXPLAIN SELECT xxx… From: [email protected] [mailto:[email protected]] Sent: Monday, May 4, 2015 9:08 PM To: Olivier Girardot; user Subject: 回复:Re: sparksql running slow while joining 2 tables. hi Olivier spark1.3.1, with java1.8.0.45 and add 2 pics . it seems like a GC issue. I also tried with different parameters like memory size of driver&executor, memory fraction, java opts... but this issue still happens. -------------------------------- Thanks&Best regards! 罗辉 San.Luo ----- 原始邮件 ----- 发件人:Olivier Girardot <[email protected]> 收件人:[email protected], user <[email protected]> 主题:Re: sparksql running slow while joining 2 tables. 日期:2015年05月04日 20点46分 Hi, What is you Spark version ? Regards, Olivier. Le lun. 4 mai 2015 à 11:03, <[email protected]> a écrit : hi guys when i am running a sql like "select a.name,a.startpoint,a.endpoint, a.piece from db a join sample b on (a.name = b.name) where (b.startpoint > a.startpoint + 25);" I found sparksql running slow in minutes which may caused by very long GC and shuffle time. table db is created from a txt file size at 56mb while table sample sized at 26mb, both at small size. my spark cluster is a standalone pseudo-distributed spark cluster with 8g executor and 4g driver manager. any advises? thank you guys. -------------------------------- Thanks&Best regards! 罗辉 San.Luo --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
