Re: [Spark Core][BlockManager] Spark job fails if blockmgr dirs are cleaned up

Mich Talebzadeh Fri, 17 Jan 2025 07:52:46 -0800

Your spark running as daemons probably creates a race condition. If
multiple daemons are accessing and modifying the same data (say a shared
database, a file system, or memory), and they lack proper synchronization,
unexpected and unpredictable behavior can occur. if proper synchronisation
is not introduced.


As a suggestion, creating a lock file can be an effective way to prevent
multiple instances of your daemon application from running concurrently and
potentially causing race conditions or resource contention.The job that
created the lock file, can delete it after completion of the job.

HTH

Mich Talebzadeh,

Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
London <https://en.wikipedia.org/wiki/Imperial_College_London>
London, United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Fri, 17 Jan 2025 at 13:31, Olga Averianova
<olga.averian...@netcracker.com.invalid> wrote:

> Hi team,
>
>
>
> I have observed the following problem.
>
> I have an application running in daemon mode. Within this application I
> use Spark in local mode, initializing SparkContext once per application
> start. But Spark jobs could be triggered at very different time – sometimes
> once per day, sometimes once per week.
>
> When there’s a big gap in jobs run, the newly triggered job fails with
> following error:
>
>
>
> Caused by: org.apache.spark.SparkException: Job aborted due to stage
> failure: Task serialization failed: java.nio.file.NoSuchFileException:
> /tmp/blockmgr-d8d04f03-ccad-4cae-8db2-cff0caea3ea3/0e
>
> java.nio.file.NoSuchFileException:
> /tmp/blockmgr-d8d04f03-ccad-4cae-8db2-cff0caea3ea3/0e
>
>                at
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>
>                at
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>
>                at
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>
>                at
> sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
>
>                at java.nio.file.Files.createDirectory(Files.java:674)
>
>                at
> org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:108)
>
>                at
> org.apache.spark.storage.DiskStore.remove(DiskStore.scala:131)
>
>                at
> org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:2008)
>
>                at
> 4ce2f7b9eb2d7c64e793fe7e55f682c41f824b97e$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1489)
>
>                at
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1526)
>
>                at
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:1381)
>
>                at
> org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1870)
>
>                at
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:154)
>
>                at
> org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:99)
>
>                at
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:38)
>
>                at
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:78)
>
>                at
> org.apache.spark.SparkContext.broadcastInternal(SparkContext.scala:1548)
>
>                at
> org.apache.spark.SparkContext.broadcast(SparkContext.scala:1530)
>
>                at
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1535)
>
>                at
> org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1353)
>
>                at
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1295)
>
>                at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2931)
>
>                at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923)
>
>                at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912)
>
>                at
> org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>
>
>
> Why this happened: /tmp dir outdated folders (let’s say 1 week old) are
> being cleaned up. And it seems like somewhere in SparkContext it stores
> blockId and tries to use it while I actually don’t expect that new job run
> depends on previous one.
>
> The WA for this is to restart application but this is not suitable for us.
>
>
>
> After code analyzing found this in
> org.apache.spark.storage.BlockManager#removeBlockInternal
>
>
>
>
> *// Removals are idempotent in disk store and memory store. At worst, we
> get a warning. *val removedFromMemory = *memoryStore*.remove(blockId)
> val removedFromDisk = *diskStore*.remove(blockId)
> if (!removedFromMemory && !removedFromDisk) {
>   logWarning(s"Block *$*blockId could not be removed as it was not found
> on disk or in memory")
> }
>
>
>
> In case of unsuccessful removal you expect only WARL log but actually *the
> job fails*.
> This happens because in org.apache.spark.storage.DiskBlockManager#getFile
> it’s used
>
> Files.*createDirectory*(path)
>
>
> and this in its turn uses mkdir. And mkdir doesn’t allow creating
> recursive directory (*blockmgr-d8d04f03-ccad-4cae-8db2-cff0caea3ea3/0e*
> in this case).
>
>
>
> So this operation is not idempotent!
>
>
>
> The only issue found in stackoverflow
> https://stackoverflow.com/questions/41238121/spark-java-ioexception-failed-to-create-local-dir-in-tmp-blockmgr
> and there’s still no proper explanation and resolution.
>
>
>
> I’m using spark-core_2.12-3.4.1.jar
>
>
>
> Can you suggest anything for this issue? Could it be reported as a bug?
>
>
>
> Thanks
>
>
>
> Regards,
>
> --------------
> Olga Averianova, Senior Software Engineer
>
>
>
> ------------------------------
> The information transmitted herein is intended only for the person or
> entity to which it is addressed and may contain confidential, proprietary
> and/or privileged material. Any review, retransmission, dissemination or
> other use of, or taking of any action in reliance upon, this information by
> persons or entities other than the intended recipient is prohibited. If you
> received this in error, please contact the sender and delete the material
> from any computer.
>
>

Re: [Spark Core][BlockManager] Spark job fails if blockmgr dirs are cleaned up

Reply via email to