Hi thanks I usually get see the following errors in Spark logs and because of that I think executor gets lost all of the following happens because huge data shuffle and I cant avoid that dont know what to do please guide
15/08/16 12:26:46 WARN spark.HeartbeatReceiver: Removing executor 10 with no recent heartbeats: 1051638 ms exceeds timeout 1000000 ms Or org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384) at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380) at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) OR YARN kills container because of Container [pid=26783,containerID=container_1389136889967_0009_01_000002] is running beyond physical memory limits. Current usage: 30.2 GB of 30 GB physical memory used; Killing container. On Mon, Oct 5, 2015 at 8:00 AM, Alex Rovner <alex.rov...@magnetic.com> wrote: > Can you at least copy paste the error(s) you are seeing when the job > fails? Without the error message(s), it's hard to even suggest anything. > > *Alex Rovner* > *Director, Data Engineering * > *o:* 646.759.0052 > > * <http://www.magnetic.com/>* > > On Sat, Oct 3, 2015 at 9:50 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > >> Hi thanks I cant share yarn logs because of privacy in my company but I >> can tell you I have seen yarn logs there I have not found anything except >> YARN killing container because it is exceeds physical memory capacity. >> >> I am using the following command line script Above job launches around >> 1500 ExecutorService threads from a driver with a thread pool of 15 so at a >> time 15 jobs will be running as showing in UI. >> >> ./spark-submit --class com.xyz.abc.MySparkJob >> >> --conf "spark.executor.extraJavaOptions=-XX:MaxPermSize=512M" - >> >> -driver-java-options -XX:MaxPermSize=512m - >> >> -driver-memory 4g --master yarn-client >> >> --executor-memory 27G --executor-cores 2 >> >> --num-executors 40 >> >> --jars /path/to/others-jars >> >> /path/to/spark-job.jar >> >> >> On Sat, Oct 3, 2015 at 7:11 PM, Alex Rovner <alex.rov...@magnetic.com> >> wrote: >> >>> Can you send over your yarn logs along with the command you are using to >>> submit your job? >>> >>> *Alex Rovner* >>> *Director, Data Engineering * >>> *o:* 646.759.0052 >>> >>> * <http://www.magnetic.com/>* >>> >>> On Sat, Oct 3, 2015 at 9:07 AM, Umesh Kacha <umesh.ka...@gmail.com> >>> wrote: >>> >>>> Hi Alex thanks much for the reply. Please read the following for more >>>> details about my problem. >>>> >>>> >>>> http://stackoverflow.com/questions/32317285/spark-executor-oom-issue-on-yarn >>>> >>>> My each container has 8 core and 30 GB max memory. So I am using >>>> yarn-client mode using 40 executors with 27GB/2 cores. If I use more cores >>>> then my job start loosing more executors. I tried to set >>>> spark.yarn.executor.memoryOverhead around 2 GB even 8 GB but it does >>>> not help I loose executors no matter what. The reason is my jobs shuffles >>>> lots of data even 20 GB of data in every job in UI I have seen it. Shuffle >>>> happens because of group by and I cant avoid it in my case. >>>> >>>> >>>> >>>> On Sat, Oct 3, 2015 at 6:27 PM, Alex Rovner <alex.rov...@magnetic.com> >>>> wrote: >>>> >>>>> This sounds like you need to increase YARN overhead settings with the >>>>> "spark.yarn.executor.memoryOverhead" >>>>> parameter. See >>>>> http://spark.apache.org/docs/latest/running-on-yarn.html for more >>>>> information on the setting. >>>>> >>>>> If that does not work for you, please provide the error messages and >>>>> the command line you are using to submit your jobs for further >>>>> troubleshooting. >>>>> >>>>> >>>>> *Alex Rovner* >>>>> *Director, Data Engineering * >>>>> *o:* 646.759.0052 >>>>> >>>>> * <http://www.magnetic.com/>* >>>>> >>>>> On Sat, Oct 3, 2015 at 6:19 AM, unk1102 <umesh.ka...@gmail.com> wrote: >>>>> >>>>>> Hi I have couple of Spark jobs which uses group by query which is >>>>>> getting >>>>>> fired from hiveContext.sql() Now I know group by is evil but my use >>>>>> case I >>>>>> cant avoid group by I have around 7-8 fields on which I need to do >>>>>> group by. >>>>>> Also I am using df1.except(df2) which also seems heavy operation and >>>>>> does >>>>>> lots of shuffling please see my UI snap >>>>>> < >>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n24914/IMG_20151003_151830218.jpg >>>>>> > >>>>>> >>>>>> I have tried almost all optimisation including Spark 1.5 but nothing >>>>>> seems >>>>>> to be working and my job fails hangs because of executor will reach >>>>>> physical >>>>>> memory limit and YARN will kill it. I have around 1TB of data to >>>>>> process and >>>>>> it is skewed. Please guide. >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> View this message in context: >>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimize-group-by-query-fired-using-hiveContext-sql-tp24914.html >>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>> Nabble.com. >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>> >>>>>> >>>>> >>>> >>> >> >