yes I've had errors with too many open files before, but this doesn't seem to be the case here.
Hmm, you're right in that these errors are different from what I initially stated -- I think what I assumed was that the failure to write resulted in the worker to crash which in turn resulted in a failed fetch. I'll try to see if I can make sense of it from the logs. On Fri, May 22, 2015 at 9:29 PM, Imran Rashid <[email protected]> wrote: > hmm, sorry I think that disproves my theory. Nothing else is immediately > coming to mind. Its possible there is more info in the logs from the > driver, couldn't hurt to send those (though I don't have high hopes of > finding anything that way). Offchance this could be from too many open > files or something? Normally there is a different error msg, but I figure > its worth asking anyway. > > The error you reported here was slightly different from your original > post. This error is from writing the shuffle map output, while the > original error you reported was a fetch failed, which is from reading the > shuffle data on the "reduce" side in the next stage. Does the map stage > actually finish, even though the tasks are throwing these errors while > writing the map output? Or do you sometimes get failures on the shuffle > write side, and sometimes on the shuffle read side? (Not that I think you > are doing anything wrong, but it may help narrow down the root cause and > possibly file a bug.) > > thanks > > > On Fri, May 22, 2015 at 4:40 AM, Rok Roskar <[email protected]> wrote: > >> on the worker/container that fails, the "file not found" is the first >> error -- the output below is from the yarn log. There were some python >> worker crashes for another job/stage earlier (see the warning at 18:36) but >> I expect those to be unrelated to this file not found error. >> >> >> ================================================================================== >> LogType:stderr >> Log Upload Time:15-May-2015 18:50:05 >> LogLength:5706 >> Log Contents: >> SLF4J: Class path contains multiple SLF4J bindings. >> SLF4J: Found binding in >> [jar:file:/hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/filecache/89/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4 >> j/impl/StaticLoggerBinder.class] >> SLF4J: Found binding in >> [jar:file:/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an >> explanation. >> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] >> 15/05/15 18:33:09 WARN NativeCodeLoader: Unable to load native-hadoop >> library for your platform... using builtin-java classes where applicable >> 15/05/15 18:36:37 WARN PythonRDD: Incomplete task interrupted: Attempting >> to kill Python Worker >> 15/05/15 18:50:03 ERROR Executor: Exception in task 319.0 in stage 12.0 >> (TID 995) >> java.io.FileNotFoundException: >> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1047/blockmgr-3c9000cf-11f3 >> -44da-9410-99c872a89489/03/shuffle_4_319_0.data (No such file or >> directory) >> at java.io.FileOutputStream.open(Native Method) >> at java.io.FileOutputStream.<init>(FileOutputStream.java:212) >> at >> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:130) >> at >> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:201) >> at >> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5$$anonfun$apply$2.apply(ExternalSorter.scala:759) >> at >> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5$$anonfun$apply$2.apply(ExternalSorter.scala:758) >> at scala.collection.Iterator$class.foreach(Iterator.scala:727) >> at >> org.apache.spark.util.collection.ExternalSorter$IteratorForPartition.foreach(ExternalSorter.scala:823) >> at >> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5.apply(ExternalSorter.scala:758) >> at >> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5.apply(ExternalSorter.scala:754) >> at scala.collection.Iterator$class.foreach(Iterator.scala:727) >> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) >> at >> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:754) >> at >> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:71) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >> at org.apache.spark.scheduler.Task.run(Task.scala:64) >> at >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) >> at java.lang.Thread.run(Thread.java:722) >> 15/05/15 18:50:04 ERROR DiskBlockManager: Exception while deleting local >> spark dir: >> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1047/blockmgr-3c9000cf-11f3-44da-9410-99c872a89489 >> java.io.IOException: Failed to delete: >> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1047/blockmgr-3c9000cf-11f3-44da-9410-99c872a89489 >> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:933) >> at >> org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:165) >> at >> org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:162) >> at >> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) >> at >> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) >> at org.apache.spark.storage.DiskBlockManager.org >> $apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:162) >> at >> org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:156) >> at >> org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1208) >> at org.apache.spark.SparkEnv.stop(SparkEnv.scala:88) >> at org.apache.spark.executor.Executor.stop(Executor.scala:146) >> at >> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receiveWithLogging$1.applyOrElse(CoarseGrainedExecutorBackend.scala:105) >> at >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >> at >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >> at >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >> at >> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53) >> at >> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) >> at >> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >> at >> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) >> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >> at >> org.apache.spark.executor.CoarseGrainedExecutorBackend.aroundReceive(CoarseGrainedExecutorBackend.scala:38) >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >> >> On Tue, May 19, 2015 at 3:38 AM, Imran Rashid <[email protected]> >> wrote: >> >>> Hi, >>> >>> can you take a look at the logs and see what the first error you are >>> getting is? Its possible that the file doesn't exist when that error is >>> produced, but it shows up later -- I've seen similar things happen, but >>> only after there have already been some errors. But, if you see that in >>> the very first error, then I"m not sure what the cause is. Would be >>> helpful for you to send the logs. >>> >>> Imran >>> >>> On Fri, May 15, 2015 at 10:07 AM, rok <[email protected]> wrote: >>> >>>> I am trying to sort a collection of key,value pairs (between several >>>> hundred >>>> million to a few billion) and have recently been getting lots of >>>> "FetchFailedException" errors that seem to originate when one of the >>>> executors doesn't seem to find a temporary shuffle file on disk. E.g.: >>>> >>>> org.apache.spark.shuffle.FetchFailedException: >>>> >>>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index >>>> (No such file or directory) >>>> >>>> This file actually exists: >>>> >>>> > ls -l >>>> > >>>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index >>>> >>>> -rw-r--r-- 1 hadoop hadoop 11936 May 15 16:52 >>>> >>>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index >>>> >>>> This error repeats on several executors and is followed by a number of >>>> >>>> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output >>>> location for shuffle 0 >>>> >>>> This results on most tasks being lost and executors dying. >>>> >>>> There is plenty of space on all of the appropriate filesystems, so none >>>> of >>>> the executors are running out of disk space. Any idea what might be >>>> causing >>>> this? I am running this via YARN on approximately 100 nodes with 2 >>>> cores per >>>> node. Any thoughts on what might be causing these errors? Thanks! >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailedException-and-MetadataFetchFailedException-tp22901.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>>> >>> >> >
