Which version of Spark are you using? The configuration varies by version.

Regards
Sab

On Mon, Mar 14, 2016 at 10:53 AM, Prabhu Joseph <prabhujose.ga...@gmail.com>
wrote:

> Hi All,
>
> A Hive Join query which runs fine and faster in MapReduce takes lot of
> time with Spark and finally fails with OOM.
>
> *Query:  hivejoin.py*
>
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> conf = SparkConf().setAppName("Hive_Join")
> sc = SparkContext(conf=conf)
> hiveCtx = HiveContext(sc)
> hiveCtx.hql("INSERT OVERWRITE TABLE D select <80 columns> from A a INNER
> JOIN B b ON a.item_id = b.item_id LEFT JOIN C c ON c.instance_id =
> a.instance_id");
> results = hiveCtx.hql("SELECT COUNT(1) FROM D").collect()
> print results
>
>
> *Data Study:*
>
> Number of Rows:
>
> A table has 1002093508
> B table has    5371668
> C table has          1000
>
> No Data Skewness:
>
> item_id in B is unique and A has multiple rows with same item_id, so after
> first INNER_JOIN the result set is same 1002093508 rows
>
> instance_id in C is unique and A has multiple rows with same instance_id
> (maximum count of number of rows with same instance_id is 250)
>
> Spark Job runs with 90 Executors each with 2cores and 6GB memory. YARN has
> allotted all the requested resource immediately and no other job is running
> on the
> cluster.
>
> spark.storage.memoryFraction     0.6
> spark.shuffle.memoryFraction     0.2
>
> Stage 2 - reads data from Hadoop, Tasks has NODE_LOCAL and shuffle write
> 500GB of intermediate data
>
> Stage 3 - does shuffle read of 500GB data, tasks has PROCESS_LOCAL and
> output of 400GB is shuffled
>
> Stage 4 - tasks fails with OOM on reading the shuffled output data when it
> reached 40GB data itself
>
> First of all, what kind of Hive queries when run on Spark gets a better
> performance than Mapreduce. And what are the hive queries that won't perform
> well in Spark.
>
> How to calculate the optimal Heap for Executor Memory and the number of
> executors for given input data size. We don't specify Spark Executors to
> cache any data. But how come Stage 3 tasks says PROCESS_LOCAL. Why Stage 4
> is failing immediately
> when it has just read 40GB data, is it caching data in Memory.
>
> And in a Spark job, some stage will need lot of memory for shuffle and
> some need lot of memory for cache. So, when a Spark Executor has lot of
> memory available
> for cache and does not use the cache but when there is a need to do lot of
> shuffle, will executors only use the shuffle fraction which is set for
> doing shuffle or will it use
> the free memory available for cache as well.
>
>
> Thanks,
> Prabhu Joseph
>
>
>
>
>
>
>
>
>
>
>


-- 

Architect - Big Data
Ph: +91 99805 99458

Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan
India ICT)*
+++

Reply via email to