I agree here. However it depends always on your use case !
Best regards > On 16 Jun 2016, at 04:58, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > > Hi Mahender, > > please ensure that for dimension tables you are enabling the broadcast > method. You must be able to see surprising gains @12x. > > Overall I think that SPARK cannot figure out whether to scan all the columns > in a table or just the ones which are being used causing this issue. > > When you start using HIVE with ORC and TEZ (*) you will see some amazing > results, and leaves SPARK way way behind. So pretty much you need to have > your data in memory for matching the performance claims of SPARK and the > advantage in that case you are getting is not because of SPARK algorithms but > just fast I/O from RAM. The advantage of SPARK is that it makes accessible > analytics, querying, and streaming frameworks together. > > > In case you are following the optimisations mentioned in the link you hardly > have any reasons for using SPARK SQL: > http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ . And > imagine being able to do all of that without having machines which requires > huge RAM, or in short you are achieving those performance gains using > commodity low cost systems around which HADOOP was designed. > > I think that Hortonworks is giving a stiff competition here :) > > Regards, > Gourav Sengupta > >> On Wed, Jun 15, 2016 at 11:35 PM, Mahender Sarangam >> <mahender.bigd...@outlook.com> wrote: >> +1, >> >> Even see performance degradation while comparing SPark SQL with Hive. >> We have table of 260 columns. We have executed in hive and SPARK. In Hive, >> it is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4 mins >> of time. >>> On 6/9/2016 3:19 PM, Gavin Yue wrote: >>> Could you print out the sql execution plan? My guess is about broadcast >>> join. >>> >>> >>> >>> On Jun 9, 2016, at 07:14, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here >>>> and is there a way we can optimize the queries in SPARK without the >>>> obvious hack in Query2. >>>> >>>> >>>> ----------------------- >>>> ENVIRONMENT: >>>> ----------------------- >>>> >>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3 >>>> > million rows. Both the files are single gzipped csv file. >>>> > Both table A and B are external tables in AWS S3 and created in HIVE >>>> > accessed through SPARK using HiveContext >>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using >>>> > allowMaximumResource allocation and node types are c3.4xlarge). >>>> >>>> -------------- >>>> QUERY1: >>>> -------------- >>>> select A.PK, B.FK >>>> from A >>>> left outer join B on (A.PK = B.FK) >>>> where B.FK is not null; >>>> >>>> >>>> >>>> This query takes 4 mins in HIVE and 1.1 hours in SPARK >>>> >>>> >>>> -------------- >>>> QUERY 2: >>>> -------------- >>>> >>>> select A.PK, B.FK >>>> from (select PK from A) A >>>> left outer join B on (A.PK = B.FK) >>>> where B.FK is not null; >>>> >>>> This query takes 4.5 mins in SPARK >>>> >>>> >>>> >>>> Regards, >>>> Gourav Sengupta >