Rick, Thank you for the input. Now space issue is resolved. yarn.nodemanager.local.dirs and yarn.nodemanager.log.dirs was filling up.
For 5Gb of data why it should take 10 mins to load with 7-8 executors with 2 cores and I also see all the executors memory is upto 7-20 GB If 5 GB of data takes so much resources what will happen if I load 50 GB of data I tried reducing the partitions to 64 but it takes more than 10 mins. Is there any configuration which help me to improve loading process and consume less memory? Regards, ~Sri From: Rick Moritz [mailto:rah...@gmail.com] Sent: Thursday, May 11, 2017 1:34 PM To: Anantharaman, Srinatha (Contractor) <srinatha_ananthara...@comcast.com>; user <user@spark.apache.org> Subject: Re: Spark consumes more memory I would try to track down the "no space left on device" - find out where that originates from, since you should be able to allocate 10 executors with 4 cores and 15GB RAM each quite easily. In that case,you may want to increase overhead, so yarn doesn't kill your executors. Check that no local drives are filling up with temporary data, by runnning a watch df on all nodes, Also check that no quotas are being enforced, and that your log-partitions aren't flowing over. Depending on your disk and network speed, as well as the time it takes yarn to allocate resources and spark to initialize the spark context, 10 minutes doesn't sound too bad. Also, I don't think 150 partitions are a helpful partition size, if you have 7G RAM per executor, and aren't doing any joining or other memory intensive calculation. Try again with 64 partitions, and see if the reduced overhead helps. Also, track which action/task are running longer than expected in SparkUI. That sohuld help ID where your bottleneck is located. On Thu, May 11, 2017 at 5:46 PM, Anantharaman, Srinatha (Contractor) <srinatha_ananthara...@comcast.com<mailto:srinatha_ananthara...@comcast.com>> wrote: Hi, I am reading a Hive Orc table into memory, StorageLevel is set to (StorageLevel.MEMORY_AND_DISK_SER) Total size of the Hive table is 5GB Started the spark-shell as below spark-shell --master yarn --deploy-mode client --num-executors 8 --driver-memory 5G --executor-memory 7G --executor-cores 2 --conf spark.yarn.executor.memoryOverhead=512 I have 10 node cluster each with 35 GB memory and 4 cores running on HDP 2.5 SPARK_LOCAL_DIRS location has enough space My concern is below simple code to load data to memory takes approx. 10-12 mins. If I change values for num-executors/driver-memory/executor-memory/executor-cores other than above mentioned I get “No space left on device” error While running each nodes consumes varying size of memory from 7GB to 20 GB import org.apache.spark.storage.StorageLevel val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("SET hive.mapred.supports.subdirectories=true") sqlContext.sql("SET mapreduce.input.fileinputformat.input.dir.recursive=true") val tab1 = sqlContext.sql("select * from xyz").repartition(150).persist(StorageLevel.MEMORY_AND_DISK_SER) tab1.registerTempTable("AUDIT") tab1.count() kindly advice how to improve the performance of loading Hive table to Spark memory and avoid the space issue Regards, ~Sri