sounds like this is a one off case. Do you have any other use case where you have Hive on MR outperforms Spark?
I did some tests on 1 billion row table getting the selectivity of a column using Hive on MR, Hive on Spark engine and Spark running on local mode (to keep it simple) Hive 2, Spark 1.6.1 Results: Hive with map-reduce --> 18 minutes Hive on Spark engine --> 6 minutes Spark --> 2 minutes HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 16 June 2016 at 08:43, Jörn Franke <jornfra...@gmail.com> wrote: > I agree here. > > However it depends always on your use case ! > > Best regards > > On 16 Jun 2016, at 04:58, Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > > Hi Mahender, > > please ensure that for dimension tables you are enabling the broadcast > method. You must be able to see surprising gains @12x. > > Overall I think that SPARK cannot figure out whether to scan all the > columns in a table or just the ones which are being used causing this > issue. > > When you start using HIVE with ORC and TEZ (*) you will see some amazing > results, and leaves SPARK way way behind. So pretty much you need to have > your data in memory for matching the performance claims of SPARK and the > advantage in that case you are getting is not because of SPARK algorithms > but just fast I/O from RAM. The advantage of SPARK is that it makes > accessible analytics, querying, and streaming frameworks together. > > > In case you are following the optimisations mentioned in the link you > hardly have any reasons for using SPARK SQL: > http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ . And > imagine being able to do all of that without having machines which requires > huge RAM, or in short you are achieving those performance gains using > commodity low cost systems around which HADOOP was designed. > > I think that Hortonworks is giving a stiff competition here :) > > Regards, > Gourav Sengupta > > On Wed, Jun 15, 2016 at 11:35 PM, Mahender Sarangam < > mahender.bigd...@outlook.com> wrote: > >> +1, >> >> Even see performance degradation while comparing SPark SQL with Hive. >> We have table of 260 columns. We have executed in hive and SPARK. In >> Hive, it is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4 >> mins of time. >> On 6/9/2016 3:19 PM, Gavin Yue wrote: >> >> Could you print out the sql execution plan? My guess is about broadcast >> join. >> >> >> >> On Jun 9, 2016, at 07:14, Gourav Sengupta < <gourav.sengu...@gmail.com> >> gourav.sengu...@gmail.com> wrote: >> >> Hi, >> >> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here >> and is there a way we can optimize the queries in SPARK without the obvious >> hack in Query2. >> >> >> ----------------------- >> ENVIRONMENT: >> ----------------------- >> >> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3 >> million rows. Both the files are single gzipped csv file. >> > Both table A and B are external tables in AWS S3 and created in HIVE >> accessed through SPARK using HiveContext >> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using >> allowMaximumResource allocation and node types are c3.4xlarge). >> >> -------------- >> QUERY1: >> -------------- >> select A.PK, B.FK >> from A >> left outer join B on (A.PK = B.FK) >> where B.FK is not null; >> >> >> >> This query takes 4 mins in HIVE and 1.1 hours in SPARK >> >> >> -------------- >> QUERY 2: >> -------------- >> >> select A.PK, B.FK >> from (select PK from A) A >> left outer join B on (A.PK = B.FK) >> where B.FK is not null; >> >> This query takes 4.5 mins in SPARK >> >> >> >> Regards, >> Gourav Sengupta >> >> >> >> >> >