Hi Gavin, for the first time someone is responding to this thread with a meaningful conversation - thanks for that.
Okay, I did not tweak the spark.sql.autoBroadcastJoinThreshold parameter and since the cached field was around 75 MB therefore I do not think that broadcast join was used. But I will surely be excited to see if I am going wrong here and post the results of sql.describe(). Thanks a ton once again. Hi Ted, Is there anyway you can throw some light on this before I post this in a blog? Regards, Gourav Sengupta On Fri, Jun 10, 2016 at 7:22 PM, Gavin Yue <yue.yuany...@gmail.com> wrote: > Yes. because in the second query, you did a (select PK from A) A . I > guess it could the the subquery makes the results much smaller and make > the broadcastJoin, so it is much faster. > > you could use sql.describe() to check the execution plan. > > > On Fri, Jun 10, 2016 at 1:41 AM, Gourav Sengupta < > gourav.sengu...@gmail.com> wrote: > >> Hi, >> >> I think if we try to see why is Query 2 faster than Query 1 then all the >> answers will be given without beating around the bush. That is the right >> way to find out what is happening and why. >> >> >> Regards, >> Gourav >> >> On Thu, Jun 9, 2016 at 11:19 PM, Gavin Yue <yue.yuany...@gmail.com> >> wrote: >> >>> Could you print out the sql execution plan? My guess is about broadcast >>> join. >>> >>> >>> >>> On Jun 9, 2016, at 07:14, Gourav Sengupta <gourav.sengu...@gmail.com> >>> wrote: >>> >>> Hi, >>> >>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening >>> here and is there a way we can optimize the queries in SPARK without the >>> obvious hack in Query2. >>> >>> >>> ----------------------- >>> ENVIRONMENT: >>> ----------------------- >>> >>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3 >>> million rows. Both the files are single gzipped csv file. >>> > Both table A and B are external tables in AWS S3 and created in HIVE >>> accessed through SPARK using HiveContext >>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using >>> allowMaximumResource allocation and node types are c3.4xlarge). >>> >>> -------------- >>> QUERY1: >>> -------------- >>> select A.PK, B.FK >>> from A >>> left outer join B on (A.PK = B.FK) >>> where B.FK is not null; >>> >>> >>> >>> This query takes 4 mins in HIVE and 1.1 hours in SPARK >>> >>> >>> -------------- >>> QUERY 2: >>> -------------- >>> >>> select A.PK, B.FK >>> from (select PK from A) A >>> left outer join B on (A.PK = B.FK) >>> where B.FK is not null; >>> >>> This query takes 4.5 mins in SPARK >>> >>> >>> >>> Regards, >>> Gourav Sengupta >>> >>> >>> >>> >> >