Rick,

Thank you for the input. Now space issue is resolved. 
yarn.nodemanager.local.dirs and yarn.nodemanager.log.dirs was filling up.

For 5Gb of data why it should take 10 mins to load with 7-8 executors with 2 
cores and I also see all the executors memory is upto 7-20 GB
If 5 GB of data takes so much resources what will happen if I load 50 GB of data

I tried reducing the partitions to 64 but it takes more than 10 mins.

Is there any configuration which help me to improve loading process and consume 
less memory?

Regards,
~Sri

From: Rick Moritz [mailto:rah...@gmail.com]
Sent: Thursday, May 11, 2017 1:34 PM
To: Anantharaman, Srinatha (Contractor) <srinatha_ananthara...@comcast.com>; 
user <user@spark.apache.org>
Subject: Re: Spark consumes more memory

I would try to track down the "no space left on device" - find out where that 
originates from, since you should be able to allocate 10 executors with 4 cores 
and 15GB RAM each quite easily. In that case,you may want to increase overhead, 
so yarn doesn't kill your executors.
Check that no local drives are filling up with temporary data, by runnning a 
watch df on all nodes,
Also check that no quotas are being enforced, and that your log-partitions 
aren't flowing over.

Depending on your disk and network speed, as well as the time it takes yarn to 
allocate resources and spark to initialize the spark context, 10 minutes 
doesn't sound too bad. Also, I don't think 150 partitions are a helpful 
partition size, if you have 7G RAM per executor, and aren't doing any joining 
or other memory intensive calculation. Try again with 64 partitions, and see if 
the reduced overhead helps.
Also, track which action/task are running longer than expected in SparkUI. That 
sohuld help ID where your bottleneck is located.

On Thu, May 11, 2017 at 5:46 PM, Anantharaman, Srinatha (Contractor) 
<srinatha_ananthara...@comcast.com<mailto:srinatha_ananthara...@comcast.com>> 
wrote:
Hi,

I am reading a Hive Orc table into memory, StorageLevel is set to 
(StorageLevel.MEMORY_AND_DISK_SER)
Total size of the Hive table is 5GB
Started the spark-shell as below

spark-shell --master yarn --deploy-mode client --num-executors 8 
--driver-memory 5G --executor-memory 7G --executor-cores 2 --conf 
spark.yarn.executor.memoryOverhead=512
I have 10 node cluster each with 35 GB memory and 4 cores running on HDP 2.5
SPARK_LOCAL_DIRS location has enough space

My concern is below simple code to load data to memory takes approx. 10-12 mins.
If I change values for 
num-executors/driver-memory/executor-memory/executor-cores other than above 
mentioned I get “No space left on device” error
While running each nodes consumes varying size of memory from 7GB to 20 GB

import org.apache.spark.storage.StorageLevel


val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("SET hive.mapred.supports.subdirectories=true")
sqlContext.sql("SET mapreduce.input.fileinputformat.input.dir.recursive=true")
val tab1 =  sqlContext.sql("select * from 
xyz").repartition(150).persist(StorageLevel.MEMORY_AND_DISK_SER)
tab1.registerTempTable("AUDIT")
tab1.count()

kindly advice how to improve the performance of loading Hive table to Spark 
memory and avoid the space issue

Regards,
~Sri

Reply via email to