Hi team, I have observed the following problem. I have an application running in daemon mode. Within this application I use Spark in local mode, initializing SparkContext once per application start. But Spark jobs could be triggered at very different time - sometimes once per day, sometimes once per week. When there's a big gap in jobs run, the newly triggered job fails with following error:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.nio.file.NoSuchFileException: /tmp/blockmgr-d8d04f03-ccad-4cae-8db2-cff0caea3ea3/0e java.nio.file.NoSuchFileException: /tmp/blockmgr-d8d04f03-ccad-4cae-8db2-cff0caea3ea3/0e at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384) at java.nio.file.Files.createDirectory(Files.java:674) at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:108) at org.apache.spark.storage.DiskStore.remove(DiskStore.scala:131) at org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:2008) at 4ce2f7b9eb2d7c64e793fe7e55f682c41f824b97e$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1489) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1526) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:1381) at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1870) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:154) at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:99) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:38) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:78) at org.apache.spark.SparkContext.broadcastInternal(SparkContext.scala:1548) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1530) at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1535) at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1353) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1295) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2931) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) Why this happened: /tmp dir outdated folders (let's say 1 week old) are being cleaned up. And it seems like somewhere in SparkContext it stores blockId and tries to use it while I actually don't expect that new job run depends on previous one. The WA for this is to restart application but this is not suitable for us. After code analyzing found this in org.apache.spark.storage.BlockManager#removeBlockInternal // Removals are idempotent in disk store and memory store. At worst, we get a warning. val removedFromMemory = memoryStore.remove(blockId) val removedFromDisk = diskStore.remove(blockId) if (!removedFromMemory && !removedFromDisk) { logWarning(s"Block $blockId could not be removed as it was not found on disk or in memory") } In case of unsuccessful removal you expect only WARL log but actually the job fails. This happens because in org.apache.spark.storage.DiskBlockManager#getFile it's used Files.createDirectory(path) and this in its turn uses mkdir. And mkdir doesn't allow creating recursive directory (blockmgr-d8d04f03-ccad-4cae-8db2-cff0caea3ea3/0e in this case). So this operation is not idempotent! The only issue found in stackoverflow https://stackoverflow.com/questions/41238121/spark-java-ioexception-failed-to-create-local-dir-in-tmp-blockmgr and there's still no proper explanation and resolution. I'm using spark-core_2.12-3.4.1.jar Can you suggest anything for this issue? Could it be reported as a bug? Thanks Regards, -------------- Olga Averianova, Senior Software Engineer ________________________________ The information transmitted herein is intended only for the person or entity to which it is addressed and may contain confidential, proprietary and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.