Hi Mich, does not Hive use map-reduce? I thought it to be so. And since I am running the queries in EMR 4.6 therefore HIVE is not using TEZ.
Regards, Gourav On Thu, Jun 9, 2016 at 3:25 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > are you using map-reduce with Hive? > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 9 June 2016 at 15:14, Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > >> Hi, >> >> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here >> and is there a way we can optimize the queries in SPARK without the obvious >> hack in Query2. >> >> >> ----------------------- >> ENVIRONMENT: >> ----------------------- >> >> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3 >> million rows. Both the files are single gzipped csv file. >> > Both table A and B are external tables in AWS S3 and created in HIVE >> accessed through SPARK using HiveContext >> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using >> allowMaximumResource allocation and node types are c3.4xlarge). >> >> -------------- >> QUERY1: >> -------------- >> select A.PK, B.FK >> from A >> left outer join B on (A.PK = B.FK) >> where B.FK is not null; >> >> >> >> This query takes 4 mins in HIVE and 1.1 hours in SPARK >> >> >> -------------- >> QUERY 2: >> -------------- >> >> select A.PK, B.FK >> from (select PK from A) A >> left outer join B on (A.PK = B.FK) >> where B.FK is not null; >> >> This query takes 4.5 mins in SPARK >> >> >> >> Regards, >> Gourav Sengupta >> >> >> >> >