I agree here.

However it depends always on your use case ! 

Best regards

> On 16 Jun 2016, at 04:58, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:
> 
> Hi Mahender, 
> 
> please ensure that for dimension tables you are enabling the broadcast 
> method. You must be able to see surprising gains @12x. 
> 
> Overall I think that SPARK cannot figure out whether to scan all the columns 
> in a table or just the ones which are being used causing this issue. 
> 
> When you start using HIVE with ORC and TEZ  (*) you will see some amazing 
> results, and leaves SPARK way way behind. So pretty much you need to have 
> your data in memory for matching the performance claims of SPARK and the 
> advantage in that case you are getting is not because of SPARK algorithms but 
> just fast I/O from RAM. The advantage of SPARK is that it makes accessible 
> analytics, querying, and streaming frameworks together.
> 
> 
> In case you are following the optimisations mentioned in the link you hardly 
> have any reasons for using SPARK SQL: 
> http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ . And 
> imagine being able to do all of that without having machines which requires 
> huge RAM, or in short you are achieving those performance gains using 
> commodity low cost systems around which HADOOP was designed. 
> 
> I think that Hortonworks is giving a stiff competition here :)
> 
> Regards,
> Gourav Sengupta
> 
>> On Wed, Jun 15, 2016 at 11:35 PM, Mahender Sarangam 
>> <mahender.bigd...@outlook.com> wrote:
>> +1,
>> 
>> Even see performance degradation while comparing SPark SQL with Hive. 
>> We have table of 260 columns. We have executed in hive and SPARK. In Hive, 
>> it is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4 mins 
>> of time. 
>>> On 6/9/2016 3:19 PM, Gavin Yue wrote:
>>> Could you print out the sql execution plan? My guess is about broadcast 
>>> join. 
>>> 
>>> 
>>> 
>>> On Jun 9, 2016, at 07:14, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here 
>>>> and is there a way we can optimize the queries in SPARK without the 
>>>> obvious hack in Query2.
>>>> 
>>>> 
>>>> -----------------------
>>>> ENVIRONMENT:
>>>> -----------------------
>>>> 
>>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3 
>>>> > million rows. Both the files are single gzipped csv file.
>>>> > Both table A and B are external tables in AWS S3 and created in HIVE 
>>>> > accessed through SPARK using HiveContext
>>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using 
>>>> > allowMaximumResource allocation and node types are c3.4xlarge).
>>>> 
>>>> --------------
>>>> QUERY1: 
>>>> --------------
>>>> select A.PK, B.FK
>>>> from A 
>>>> left outer join B on (A.PK = B.FK)
>>>> where B.FK is not null;
>>>> 
>>>> 
>>>> 
>>>> This query takes 4 mins in HIVE and 1.1 hours in SPARK 
>>>> 
>>>> 
>>>> --------------
>>>> QUERY 2:
>>>> --------------
>>>> 
>>>> select A.PK, B.FK
>>>> from (select PK from A) A 
>>>> left outer join B on (A.PK = B.FK)
>>>> where B.FK is not null;
>>>> 
>>>> This query takes 4.5 mins in SPARK 
>>>> 
>>>> 
>>>> 
>>>> Regards,
>>>> Gourav Sengupta
> 

Reply via email to