Re: FetchFailedException and MetadataFetchFailedException

Rok Roskar Thu, 28 May 2015 02:20:08 -0700

yes I've had errors with too many open files before, but this doesn't seem
to be the case here.


Hmm, you're right in that these errors are different from what I initially
stated -- I think what I assumed was that the failure to write resulted in
the worker to crash which in turn resulted in a failed fetch. I'll try to
see if I can make sense of it from the logs.

On Fri, May 22, 2015 at 9:29 PM, Imran Rashid <[email protected]> wrote:

> hmm, sorry I think that disproves my theory.  Nothing else is immediately
> coming to mind.  Its possible there is more info in the logs from the
> driver, couldn't hurt to send those (though I don't have high hopes of
> finding anything that way).  Offchance this could be from too many open
> files or something?  Normally there is a different error msg, but I figure
> its worth asking anyway.
>
> The error you reported here was slightly different from your original
> post.  This error is from writing the shuffle map output, while the
> original error you reported was a fetch failed, which is from reading the
> shuffle data on the "reduce" side in the next stage.  Does the map stage
> actually finish, even though the tasks are throwing these errors while
> writing the map output?  Or do you sometimes get failures on the shuffle
> write side, and sometimes on the shuffle read side?  (Not that I think you
> are doing anything wrong, but it may help narrow down the root cause and
> possibly file a bug.)
>
> thanks
>
>
> On Fri, May 22, 2015 at 4:40 AM, Rok Roskar <[email protected]> wrote:
>
>> on the worker/container that fails, the "file not found" is the first
>> error -- the output below is from the yarn log. There were some python
>> worker crashes for another job/stage earlier (see the warning at 18:36) but
>> I expect those to be unrelated to this file not found error.
>>
>>
>> ==================================================================================
>> LogType:stderr
>> Log Upload Time:15-May-2015 18:50:05
>> LogLength:5706
>> Log Contents:
>> SLF4J: Class path contains multiple SLF4J bindings.
>> SLF4J: Found binding in
>> [jar:file:/hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/filecache/89/spark-assembly-1.3.1-hadoop2.6.0.jar!/org/slf4
>> j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>> explanation.
>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>> 15/05/15 18:33:09 WARN NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> 15/05/15 18:36:37 WARN PythonRDD: Incomplete task interrupted: Attempting
>> to kill Python Worker
>> 15/05/15 18:50:03 ERROR Executor: Exception in task 319.0 in stage 12.0
>> (TID 995)
>> java.io.FileNotFoundException:
>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1047/blockmgr-3c9000cf-11f3
>> -44da-9410-99c872a89489/03/shuffle_4_319_0.data (No such file or
>> directory)
>>         at java.io.FileOutputStream.open(Native Method)
>>         at java.io.FileOutputStream.<init>(FileOutputStream.java:212)
>>         at
>> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:130)
>>         at
>> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:201)
>>         at
>> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5$$anonfun$apply$2.apply(ExternalSorter.scala:759)
>>         at
>> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5$$anonfun$apply$2.apply(ExternalSorter.scala:758)
>>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>         at
>> org.apache.spark.util.collection.ExternalSorter$IteratorForPartition.foreach(ExternalSorter.scala:823)
>>         at
>> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5.apply(ExternalSorter.scala:758)
>>         at
>> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$5.apply(ExternalSorter.scala:754)
>>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>         at
>> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:754)
>>         at
>> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:71)
>>         at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>         at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>         at org.apache.spark.scheduler.Task.run(Task.scala:64)
>>         at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>         at java.lang.Thread.run(Thread.java:722)
>> 15/05/15 18:50:04 ERROR DiskBlockManager: Exception while deleting local
>> spark dir:
>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1047/blockmgr-3c9000cf-11f3-44da-9410-99c872a89489
>> java.io.IOException: Failed to delete:
>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1047/blockmgr-3c9000cf-11f3-44da-9410-99c872a89489
>>         at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:933)
>>         at
>> org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:165)
>>         at
>> org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:162)
>>         at
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>         at
>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>         at org.apache.spark.storage.DiskBlockManager.org
>> $apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:162)
>>         at
>> org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:156)
>>         at
>> org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1208)
>>         at org.apache.spark.SparkEnv.stop(SparkEnv.scala:88)
>>         at org.apache.spark.executor.Executor.stop(Executor.scala:146)
>>         at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receiveWithLogging$1.applyOrElse(CoarseGrainedExecutorBackend.scala:105)
>>         at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>         at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>         at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>         at
>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53)
>>         at
>> org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
>>         at
>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>         at
>> org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
>>         at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>         at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend.aroundReceive(CoarseGrainedExecutorBackend.scala:38)
>>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>
>> On Tue, May 19, 2015 at 3:38 AM, Imran Rashid <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> can you take a look at the logs and see what the first error you are
>>> getting is?  Its possible that the file doesn't exist when that error is
>>> produced, but it shows up later -- I've seen similar things happen, but
>>> only after there have already been some errors.  But, if you see that in
>>> the very first error, then I"m not sure what the cause is.  Would be
>>> helpful for you to send the logs.
>>>
>>> Imran
>>>
>>> On Fri, May 15, 2015 at 10:07 AM, rok <[email protected]> wrote:
>>>
>>>> I am trying to sort a collection of key,value pairs (between several
>>>> hundred
>>>> million to a few billion) and have recently been getting lots of
>>>> "FetchFailedException" errors that seem to originate when one of the
>>>> executors doesn't seem to find a temporary shuffle file on disk. E.g.:
>>>>
>>>> org.apache.spark.shuffle.FetchFailedException:
>>>>
>>>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index
>>>> (No such file or directory)
>>>>
>>>> This file actually exists:
>>>>
>>>> > ls -l
>>>> >
>>>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index
>>>>
>>>> -rw-r--r-- 1 hadoop hadoop 11936 May 15 16:52
>>>>
>>>> /hadoop/tmp/hadoop-hadoop/nm-local-dir/usercache/user/appcache/application_1426230650260_1044/blockmgr-453473e7-76c2-4a94-85d0-d0b75b515ad6/10/shuffle_0_264_0.index
>>>>
>>>> This error repeats on several executors and is followed by a number of
>>>>
>>>> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
>>>> location for shuffle 0
>>>>
>>>> This results on most tasks being lost and executors dying.
>>>>
>>>> There is plenty of space on all of the appropriate filesystems, so none
>>>> of
>>>> the executors are running out of disk space. Any idea what might be
>>>> causing
>>>> this? I am running this via YARN on approximately 100 nodes with 2
>>>> cores per
>>>> node. Any thoughts on what might be causing these errors? Thanks!
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/FetchFailedException-and-MetadataFetchFailedException-tp22901.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>
>>
>

Re: FetchFailedException and MetadataFetchFailedException

Reply via email to