Shivaram Venkataraman created SPARK-2723:
--------------------------------------------

             Summary: Block Manager should catch exceptions in putValues
                 Key: SPARK-2723
                 URL: https://issues.apache.org/jira/browse/SPARK-2723
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.0.0
            Reporter: Shivaram Venkataraman


The BlockManager should catch exceptions encountered while writing out files to 
disk. Right now these exceptions get counted as user-level task failures and 
the job is aborted after failing 4 times. We should either fail the executor or 
handle this better to prevent the job from dying.

I ran into an issue where one disk on a large EC2 cluster failed and this 
resulted in a long running job terminating. Longer term, we should also look at 
black-listing local directories when one of them become unusable ?

Exception pasted below:

14/07/29 00:55:39 WARN scheduler.TaskSetManager: Loss was due to 
java.io.FileNotFoundException
java.io.FileNotFoundException: 
/mnt2/spark/spark-local-20140728175256-e7cb/28/broadcast_264_piece20 
(Input/output error)
        at java.io.FileOutputStream.open(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:171)
        at org.apache.spark.storage.DiskStore.putValues(DiskStore.scala:79)
        at org.apache.spark.storage.DiskStore.putValues(DiskStore.scala:66)
        at 
org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:847)
        at 
org.apache.spark.storage.MemoryStore$$anonfun$ensureFreeSpace$4.apply(MemoryStore.scala:267)
        at 
org.apache.spark.storage.MemoryStore$$anonfun$ensureFreeSpace$4.apply(MemoryStore.scala:256)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at 
org.apache.spark.storage.MemoryStore.ensureFreeSpace(MemoryStore.scala:256)
        at org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:179)
        at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:76)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:663)
        at org.apache.spark.storage.BlockManager.put(BlockManager.scala:574)
        at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to