Re: Hive Query on Spark fails with OOM

Prabhu Joseph Mon, 14 Mar 2016 01:50:54 -0700

It is a Spark-SQL and the version used is Spark-1.2.1.

On Mon, Mar 14, 2016 at 2:16 PM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:


> I believe the OP is using Spark SQL and not Hive on Spark.
>
> Regards
> Sab
>
> On Mon, Mar 14, 2016 at 1:55 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> I think the only version of Spark that works OK with Hive (Hive on Spark
>> engine) is version 1.3.1. I also get OOM from time to time and have to
>> revert using MR
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 March 2016 at 08:06, Sabarish Sasidharan <
>> sabarish.sasidha...@manthan.com> wrote:
>>
>>> Which version of Spark are you using? The configuration varies by
>>> version.
>>>
>>> Regards
>>> Sab
>>>
>>> On Mon, Mar 14, 2016 at 10:53 AM, Prabhu Joseph <
>>> prabhujose.ga...@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> A Hive Join query which runs fine and faster in MapReduce takes lot of
>>>> time with Spark and finally fails with OOM.
>>>>
>>>> *Query:  hivejoin.py*
>>>>
>>>> from pyspark import SparkContext, SparkConf
>>>> from pyspark.sql import HiveContext
>>>> conf = SparkConf().setAppName("Hive_Join")
>>>> sc = SparkContext(conf=conf)
>>>> hiveCtx = HiveContext(sc)
>>>> hiveCtx.hql("INSERT OVERWRITE TABLE D select <80 columns> from A a
>>>> INNER JOIN B b ON a.item_id = b.item_id LEFT JOIN C c ON c.instance_id =
>>>> a.instance_id");
>>>> results = hiveCtx.hql("SELECT COUNT(1) FROM D").collect()
>>>> print results
>>>>
>>>>
>>>> *Data Study:*
>>>>
>>>> Number of Rows:
>>>>
>>>> A table has 1002093508
>>>> B table has    5371668
>>>> C table has          1000
>>>>
>>>> No Data Skewness:
>>>>
>>>> item_id in B is unique and A has multiple rows with same item_id, so
>>>> after first INNER_JOIN the result set is same 1002093508 rows
>>>>
>>>> instance_id in C is unique and A has multiple rows with same
>>>> instance_id (maximum count of number of rows with same instance_id is 250)
>>>>
>>>> Spark Job runs with 90 Executors each with 2cores and 6GB memory. YARN
>>>> has allotted all the requested resource immediately and no other job is
>>>> running on the
>>>> cluster.
>>>>
>>>> spark.storage.memoryFraction     0.6
>>>> spark.shuffle.memoryFraction     0.2
>>>>
>>>> Stage 2 - reads data from Hadoop, Tasks has NODE_LOCAL and shuffle
>>>> write 500GB of intermediate data
>>>>
>>>> Stage 3 - does shuffle read of 500GB data, tasks has PROCESS_LOCAL and
>>>> output of 400GB is shuffled
>>>>
>>>> Stage 4 - tasks fails with OOM on reading the shuffled output data when
>>>> it reached 40GB data itself
>>>>
>>>> First of all, what kind of Hive queries when run on Spark gets a better
>>>> performance than Mapreduce. And what are the hive queries that won't 
>>>> perform
>>>> well in Spark.
>>>>
>>>> How to calculate the optimal Heap for Executor Memory and the number of
>>>> executors for given input data size. We don't specify Spark Executors to
>>>> cache any data. But how come Stage 3 tasks says PROCESS_LOCAL. Why Stage 4
>>>> is failing immediately
>>>> when it has just read 40GB data, is it caching data in Memory.
>>>>
>>>> And in a Spark job, some stage will need lot of memory for shuffle and
>>>> some need lot of memory for cache. So, when a Spark Executor has lot of
>>>> memory available
>>>> for cache and does not use the cache but when there is a need to do lot
>>>> of shuffle, will executors only use the shuffle fraction which is set for
>>>> doing shuffle or will it use
>>>> the free memory available for cache as well.
>>>>
>>>>
>>>> Thanks,
>>>> Prabhu Joseph
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Architect - Big Data
>>> Ph: +91 99805 99458
>>>
>>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>>> Sullivan India ICT)*
>>> +++
>>>
>>
>>
>
>
> --
>
> Architect - Big Data
> Ph: +91 99805 99458
>
> Manthan Systems | *Company of the year - Analytics (2014 Frost and
> Sullivan India ICT)*
> +++
>

Re: Hive Query on Spark fails with OOM

Reply via email to