Hi Gavin,

for the first time someone is responding to this thread with a meaningful
conversation - thanks for that.

Okay, I did not tweak the spark.sql.autoBroadcastJoinThreshold parameter
and since the cached field was around 75 MB therefore I do not think that
broadcast join was used.

But I will surely be excited to see if I am going wrong here and post the
results of sql.describe(). Thanks a ton once again.


Hi Ted,

Is there anyway you can throw some light on this before I post this in a
blog?


Regards,
Gourav Sengupta


On Fri, Jun 10, 2016 at 7:22 PM, Gavin Yue <yue.yuany...@gmail.com> wrote:

> Yes.  because in the second query, you did a  (select PK from A) A .  I
>  guess it could the the subquery makes the results much smaller and make
> the broadcastJoin, so it is much faster.
>
> you could use sql.describe() to check the execution plan.
>
>
> On Fri, Jun 10, 2016 at 1:41 AM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> I think if we try to see why is Query 2 faster than Query 1 then all the
>> answers will be given without beating around the bush. That is the right
>> way to find out what is happening and why.
>>
>>
>> Regards,
>> Gourav
>>
>> On Thu, Jun 9, 2016 at 11:19 PM, Gavin Yue <yue.yuany...@gmail.com>
>> wrote:
>>
>>> Could you print out the sql execution plan? My guess is about broadcast
>>> join.
>>>
>>>
>>>
>>> On Jun 9, 2016, at 07:14, Gourav Sengupta <gourav.sengu...@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening
>>> here and is there a way we can optimize the queries in SPARK without the
>>> obvious hack in Query2.
>>>
>>>
>>> -----------------------
>>> ENVIRONMENT:
>>> -----------------------
>>>
>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>>> million rows. Both the files are single gzipped csv file.
>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>> accessed through SPARK using HiveContext
>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>>> allowMaximumResource allocation and node types are c3.4xlarge).
>>>
>>> --------------
>>> QUERY1:
>>> --------------
>>> select A.PK, B.FK
>>> from A
>>> left outer join B on (A.PK = B.FK)
>>> where B.FK is not null;
>>>
>>>
>>>
>>> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>>>
>>>
>>> --------------
>>> QUERY 2:
>>> --------------
>>>
>>> select A.PK, B.FK
>>> from (select PK from A) A
>>> left outer join B on (A.PK = B.FK)
>>> where B.FK is not null;
>>>
>>> This query takes 4.5 mins in SPARK
>>>
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>>
>>>
>>>
>>
>

Reply via email to