Re: Spark Job crash due to File Not found when shuffle intermittently
cool~ Thanks Kang! I will check and let you know. Sorry for delay as there is an urgent customer issue today. Best Martin 2017-07-24 22:15 GMT-07:00 周康: > * If the file exists but is a directory rather than a regular file, does > * not exist but cannot be created, or cannot be opened for any other > * reason then a FileNotFoundException is thrown. > > After searching into FileOutputStream i saw this annotation.So you can check > executor node first(may be no permission or no space,or no enough file > descriptor) > > > 2017-07-25 13:05 GMT+08:00 周康 : > >> You can also check whether space left in the executor node enough to >> store shuffle file or not. >> >> 2017-07-25 13:01 GMT+08:00 周康 : >> >>> First,spark will handle task fail so if job ended normally , this error >>> can be ignore. >>> Second, when using BypassMergeSortShuffleWriter, it will first write >>> data file then write an index file. >>> You can check "Failed to delete temporary index file at" or "fail to >>> rename file" in related executor node's log file. >>> >>> 2017-07-25 0:33 GMT+08:00 Martin Peng : >>> Is there anyone at share me some lights about this issue? Thanks Martin 2017-07-21 18:58 GMT-07:00 Martin Peng : > Hi, > > I have several Spark jobs including both batch job and Stream jobs to > process the system log and analyze them. We are using Kafka as the > pipeline > to connect each jobs. > > Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some > of the jobs(both batch or streaming) are thrown below exceptions > randomly(either after several hours run or just run in 20 mins). Can > anyone > give me some suggestions about how to figure out the real root cause? > (Looks like google result is not very useful...) > > Thanks, > Martin > > 00:30:04,510 WARN - 17/07/22 00:30:04 WARN TaskSetManager: Lost task > 60.0 in stage 1518490.0 (TID 338070, 10.133.96.21, executor 0): > java.io.FileNotFoundException: /mnt/mesos/work_dir/slaves/201 > 60924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5-e82a > -4df9-b034-8815a7a7564b-2543/executors/0/runs/fd15c15d-2511- > 4f37-a106-27431f583153/blockmgr-a0e0e673-f88b-4d12-a802-c356 > 43e6c6b2/33/shuffle_2090_60_0.index.b66235be-79be-4455-9759-1c7ba70f91f6 > (No such file or directory) > 00:30:04,510 WARN - at java.io.FileOutputStream.open0(Native > Method) > 00:30:04,510 WARN - at java.io.FileOutputStream.open( > FileOutputStream.java:270) > 00:30:04,510 WARN - at java.io.FileOutputStream. >(FileOutputStream.java:213) > 00:30:04,510 WARN - at java.io.FileOutputStream. >(FileOutputStream.java:162) > 00:30:04,510 WARN - at org.apache.spark.shuffle.Index > ShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlo > ckResolver.scala:144) > 00:30:04,510 WARN - at org.apache.spark.shuffle.sort. > BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWri > ter.java:128) > 00:30:04,510 WARN - at org.apache.spark.scheduler.Shu > ffleMapTask.runTask(ShuffleMapTask.scala:96) > 00:30:04,510 WARN - at org.apache.spark.scheduler.Shu > ffleMapTask.runTask(ShuffleMapTask.scala:53) > 00:30:04,510 WARN - at org.apache.spark.scheduler.Tas > k.run(Task.scala:99) > 00:30:04,510 WARN - at org.apache.spark.executor.Exec > utor$TaskRunner.run(Executor.scala:282) > 00:30:04,510 WARN - at java.util.concurrent.ThreadPoo > lExecutor.runWorker(ThreadPoolExecutor.java:1142) > 00:30:04,510 WARN - at java.util.concurrent.ThreadPoo > lExecutor$Worker.run(ThreadPoolExecutor.java:617) > 00:30:04,510 WARN - at java.lang.Thread.run(Thread.java:748) > > 00:30:04,580 INFO - Driver stacktrace: > 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler$DAGScheduler$$failJobAn > dIndependentStages(DAGScheduler.scala:1435) > 00:30:04,580 INFO - org.apache.spark.scheduler.DAG > Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) > 00:30:04,580 INFO - org.apache.spark.scheduler.DAG > Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) > 00:30:04,580 INFO - scala.collection.mutable.Resiz > ableArray$class.foreach(ResizableArray.scala:59) > 00:30:04,580 INFO - scala.collection.mutable.Array > Buffer.foreach(ArrayBuffer.scala:48) > 00:30:04,580 INFO - org.apache.spark.scheduler.DAG > Scheduler.abortStage(DAGScheduler.scala:1422) > 00:30:04,580 INFO - org.apache.spark.scheduler.DAG > Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >
Re: Spark Job crash due to File Not found when shuffle intermittently
* If the file exists but is a directory rather than a regular file, does * not exist but cannot be created, or cannot be opened for any other * reason then a FileNotFoundException is thrown. After searching into FileOutputStream i saw this annotation.So you can check executor node first(may be no permission or no space,or no enough file descriptor) 2017-07-25 13:05 GMT+08:00 周康: > You can also check whether space left in the executor node enough to store > shuffle file or not. > > 2017-07-25 13:01 GMT+08:00 周康 : > >> First,spark will handle task fail so if job ended normally , this error >> can be ignore. >> Second, when using BypassMergeSortShuffleWriter, it will first write data >> file then write an index file. >> You can check "Failed to delete temporary index file at" or "fail to >> rename file" in related executor node's log file. >> >> 2017-07-25 0:33 GMT+08:00 Martin Peng : >> >>> Is there anyone at share me some lights about this issue? >>> >>> Thanks >>> Martin >>> >>> 2017-07-21 18:58 GMT-07:00 Martin Peng : >>> Hi, I have several Spark jobs including both batch job and Stream jobs to process the system log and analyze them. We are using Kafka as the pipeline to connect each jobs. Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some of the jobs(both batch or streaming) are thrown below exceptions randomly(either after several hours run or just run in 20 mins). Can anyone give me some suggestions about how to figure out the real root cause? (Looks like google result is not very useful...) Thanks, Martin 00:30:04,510 WARN - 17/07/22 00:30:04 WARN TaskSetManager: Lost task 60.0 in stage 1518490.0 (TID 338070, 10.133.96.21, executor 0): java.io.FileNotFoundException: /mnt/mesos/work_dir/slaves/201 60924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5-e82a -4df9-b034-8815a7a7564b-2543/executors/0/runs/fd15c15d-2511- 4f37-a106-27431f583153/blockmgr-a0e0e673-f88b-4d12-a802- c35643e6c6b2/33/shuffle_2090_60_0.index.b66235be-79be-4455-9759-1c7ba70f91f6 (No such file or directory) 00:30:04,510 WARN - at java.io.FileOutputStream.open0(Native Method) 00:30:04,510 WARN - at java.io.FileOutputStream.open( FileOutputStream.java:270) 00:30:04,510 WARN - at java.io.FileOutputStream.>>> >(FileOutputStream.java:213) 00:30:04,510 WARN - at java.io.FileOutputStream.>>> >(FileOutputStream.java:162) 00:30:04,510 WARN - at org.apache.spark.shuffle.Index ShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlo ckResolver.scala:144) 00:30:04,510 WARN - at org.apache.spark.shuffle.sort. BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWri ter.java:128) 00:30:04,510 WARN - at org.apache.spark.scheduler.Shu ffleMapTask.runTask(ShuffleMapTask.scala:96) 00:30:04,510 WARN - at org.apache.spark.scheduler.Shu ffleMapTask.runTask(ShuffleMapTask.scala:53) 00:30:04,510 WARN - at org.apache.spark.scheduler.Tas k.run(Task.scala:99) 00:30:04,510 WARN - at org.apache.spark.executor.Exec utor$TaskRunner.run(Executor.scala:282) 00:30:04,510 WARN - at java.util.concurrent.ThreadPoo lExecutor.runWorker(ThreadPoolExecutor.java:1142) 00:30:04,510 WARN - at java.util.concurrent.ThreadPoo lExecutor$Worker.run(ThreadPoolExecutor.java:617) 00:30:04,510 WARN - at java.lang.Thread.run(Thread.java:748) 00:30:04,580 INFO - Driver stacktrace: 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAn dIndependentStages(DAGScheduler.scala:1435) 00:30:04,580 INFO - org.apache.spark.scheduler.DAG Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) 00:30:04,580 INFO - org.apache.spark.scheduler.DAG Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) 00:30:04,580 INFO - scala.collection.mutable.Resiz ableArray$class.foreach(ResizableArray.scala:59) 00:30:04,580 INFO - scala.collection.mutable.Array Buffer.foreach(ArrayBuffer.scala:48) 00:30:04,580 INFO - org.apache.spark.scheduler.DAG Scheduler.abortStage(DAGScheduler.scala:1422) 00:30:04,580 INFO - org.apache.spark.scheduler.DAG Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) 00:30:04,580 INFO - org.apache.spark.scheduler.DAG Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) 00:30:04,580 INFO - scala.Option.foreach(Option.scala:257) 00:30:04,580 INFO - org.apache.spark.scheduler.DAG Scheduler.handleTaskSetFailed(DAGScheduler.scala:802) 00:30:04,580 INFO - org.apache.spark.scheduler.DAG
Re: Spark Job crash due to File Not found when shuffle intermittently
You can also check whether space left in the executor node enough to store shuffle file or not. 2017-07-25 13:01 GMT+08:00 周康: > First,spark will handle task fail so if job ended normally , this error > can be ignore. > Second, when using BypassMergeSortShuffleWriter, it will first write data > file then write an index file. > You can check "Failed to delete temporary index file at" or "fail to > rename file" in related executor node's log file. > > 2017-07-25 0:33 GMT+08:00 Martin Peng : > >> Is there anyone at share me some lights about this issue? >> >> Thanks >> Martin >> >> 2017-07-21 18:58 GMT-07:00 Martin Peng : >> >>> Hi, >>> >>> I have several Spark jobs including both batch job and Stream jobs to >>> process the system log and analyze them. We are using Kafka as the pipeline >>> to connect each jobs. >>> >>> Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some of >>> the jobs(both batch or streaming) are thrown below exceptions >>> randomly(either after several hours run or just run in 20 mins). Can anyone >>> give me some suggestions about how to figure out the real root cause? >>> (Looks like google result is not very useful...) >>> >>> Thanks, >>> Martin >>> >>> 00:30:04,510 WARN - 17/07/22 00:30:04 WARN TaskSetManager: Lost task >>> 60.0 in stage 1518490.0 (TID 338070, 10.133.96.21, executor 0): >>> java.io.FileNotFoundException: /mnt/mesos/work_dir/slaves/201 >>> 60924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5-e82a >>> -4df9-b034-8815a7a7564b-2543/executors/0/runs/fd15c15d- >>> 2511-4f37-a106-27431f583153/blockmgr-a0e0e673-f88b-4d12- >>> a802-c35643e6c6b2/33/shuffle_2090_60_0.index.b66235be-79be-4455-9759-1c7ba70f91f6 >>> (No such file or directory) >>> 00:30:04,510 WARN - at java.io.FileOutputStream.open0(Native >>> Method) >>> 00:30:04,510 WARN - at java.io.FileOutputStream.open( >>> FileOutputStream.java:270) >>> 00:30:04,510 WARN - at java.io.FileOutputStream.>> >(FileOutputStream.java:213) >>> 00:30:04,510 WARN - at java.io.FileOutputStream.>> >(FileOutputStream.java:162) >>> 00:30:04,510 WARN - at org.apache.spark.shuffle.Index >>> ShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlo >>> ckResolver.scala:144) >>> 00:30:04,510 WARN - at org.apache.spark.shuffle.sort. >>> BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWri >>> ter.java:128) >>> 00:30:04,510 WARN - at org.apache.spark.scheduler.Shu >>> ffleMapTask.runTask(ShuffleMapTask.scala:96) >>> 00:30:04,510 WARN - at org.apache.spark.scheduler.Shu >>> ffleMapTask.runTask(ShuffleMapTask.scala:53) >>> 00:30:04,510 WARN - at org.apache.spark.scheduler.Tas >>> k.run(Task.scala:99) >>> 00:30:04,510 WARN - at org.apache.spark.executor.Exec >>> utor$TaskRunner.run(Executor.scala:282) >>> 00:30:04,510 WARN - at java.util.concurrent.ThreadPoo >>> lExecutor.runWorker(ThreadPoolExecutor.java:1142) >>> 00:30:04,510 WARN - at java.util.concurrent.ThreadPoo >>> lExecutor$Worker.run(ThreadPoolExecutor.java:617) >>> 00:30:04,510 WARN - at java.lang.Thread.run(Thread.java:748) >>> >>> 00:30:04,580 INFO - Driver stacktrace: >>> 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler.org >>> $apache$spark$scheduler$DAGScheduler$$failJobAn >>> dIndependentStages(DAGScheduler.scala:1435) >>> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) >>> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) >>> 00:30:04,580 INFO - scala.collection.mutable.Resiz >>> ableArray$class.foreach(ResizableArray.scala:59) >>> 00:30:04,580 INFO - scala.collection.mutable.Array >>> Buffer.foreach(ArrayBuffer.scala:48) >>> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >>> Scheduler.abortStage(DAGScheduler.scala:1422) >>> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) >>> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) >>> 00:30:04,580 INFO - scala.Option.foreach(Option.scala:257) >>> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >>> Scheduler.handleTaskSetFailed(DAGScheduler.scala:802) >>> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >>> SchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) >>> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) >>> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) >>> 00:30:04,580 INFO - org.apache.spark.util.EventLoo >>> p$$anon$1.run(EventLoop.scala:48) >>> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >>> Scheduler.runJob(DAGScheduler.scala:628) >>> 00:30:04,580 INFO - org.apache.spark.SparkContext. >>>
Re: Spark Job crash due to File Not found when shuffle intermittently
First,spark will handle task fail so if job ended normally , this error can be ignore. Second, when using BypassMergeSortShuffleWriter, it will first write data file then write an index file. You can check "Failed to delete temporary index file at" or "fail to rename file" in related executor node's log file. 2017-07-25 0:33 GMT+08:00 Martin Peng: > Is there anyone at share me some lights about this issue? > > Thanks > Martin > > 2017-07-21 18:58 GMT-07:00 Martin Peng : > >> Hi, >> >> I have several Spark jobs including both batch job and Stream jobs to >> process the system log and analyze them. We are using Kafka as the pipeline >> to connect each jobs. >> >> Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some of >> the jobs(both batch or streaming) are thrown below exceptions >> randomly(either after several hours run or just run in 20 mins). Can anyone >> give me some suggestions about how to figure out the real root cause? >> (Looks like google result is not very useful...) >> >> Thanks, >> Martin >> >> 00:30:04,510 WARN - 17/07/22 00:30:04 WARN TaskSetManager: Lost task >> 60.0 in stage 1518490.0 (TID 338070, 10.133.96.21, executor 0): >> java.io.FileNotFoundException: /mnt/mesos/work_dir/slaves/201 >> 60924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5- >> e82a-4df9-b034-8815a7a7564b-2543/executors/0/runs/ >> fd15c15d-2511-4f37-a106-27431f583153/blockmgr-a0e0e673-f88b- >> 4d12-a802-c35643e6c6b2/33/shuffle_2090_60_0.index. >> b66235be-79be-4455-9759-1c7ba70f91f6 (No such file or directory) >> 00:30:04,510 WARN - at java.io.FileOutputStream.open0(Native Method) >> 00:30:04,510 WARN - at java.io.FileOutputStream.open( >> FileOutputStream.java:270) >> 00:30:04,510 WARN - at java.io.FileOutputStream.> >(FileOutputStream.java:213) >> 00:30:04,510 WARN - at java.io.FileOutputStream.> >(FileOutputStream.java:162) >> 00:30:04,510 WARN - at org.apache.spark.shuffle.Index >> ShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlo >> ckResolver.scala:144) >> 00:30:04,510 WARN - at org.apache.spark.shuffle.sort. >> BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:128) >> 00:30:04,510 WARN - at org.apache.spark.scheduler.Shu >> ffleMapTask.runTask(ShuffleMapTask.scala:96) >> 00:30:04,510 WARN - at org.apache.spark.scheduler.Shu >> ffleMapTask.runTask(ShuffleMapTask.scala:53) >> 00:30:04,510 WARN - at org.apache.spark.scheduler.Tas >> k.run(Task.scala:99) >> 00:30:04,510 WARN - at org.apache.spark.executor.Exec >> utor$TaskRunner.run(Executor.scala:282) >> 00:30:04,510 WARN - at java.util.concurrent.ThreadPoo >> lExecutor.runWorker(ThreadPoolExecutor.java:1142) >> 00:30:04,510 WARN - at java.util.concurrent.ThreadPoo >> lExecutor$Worker.run(ThreadPoolExecutor.java:617) >> 00:30:04,510 WARN - at java.lang.Thread.run(Thread.java:748) >> >> 00:30:04,580 INFO - Driver stacktrace: >> 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler.org >> $apache$spark$scheduler$DAGScheduler$$failJobAn >> dIndependentStages(DAGScheduler.scala:1435) >> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) >> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) >> 00:30:04,580 INFO - scala.collection.mutable.Resiz >> ableArray$class.foreach(ResizableArray.scala:59) >> 00:30:04,580 INFO - scala.collection.mutable.Array >> Buffer.foreach(ArrayBuffer.scala:48) >> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >> Scheduler.abortStage(DAGScheduler.scala:1422) >> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) >> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) >> 00:30:04,580 INFO - scala.Option.foreach(Option.scala:257) >> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >> Scheduler.handleTaskSetFailed(DAGScheduler.scala:802) >> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >> SchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) >> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) >> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) >> 00:30:04,580 INFO - org.apache.spark.util.EventLoo >> p$$anon$1.run(EventLoop.scala:48) >> 00:30:04,580 INFO - org.apache.spark.scheduler.DAG >> Scheduler.runJob(DAGScheduler.scala:628) >> 00:30:04,580 INFO - org.apache.spark.SparkContext. >> runJob(SparkContext.scala:1918) >> 00:30:04,580 INFO - org.apache.spark.SparkContext. >> runJob(SparkContext.scala:1931) >> 00:30:04,580 INFO - org.apache.spark.SparkContext. >> runJob(SparkContext.scala:1944) >> 00:30:04,580 INFO - org.apache.spark.rdd.RDD$$anon >>
Re: Spark Job crash due to File Not found when shuffle intermittently
Is there anyone at share me some lights about this issue? Thanks Martin 2017-07-21 18:58 GMT-07:00 Martin Peng: > Hi, > > I have several Spark jobs including both batch job and Stream jobs to > process the system log and analyze them. We are using Kafka as the pipeline > to connect each jobs. > > Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some of > the jobs(both batch or streaming) are thrown below exceptions > randomly(either after several hours run or just run in 20 mins). Can anyone > give me some suggestions about how to figure out the real root cause? > (Looks like google result is not very useful...) > > Thanks, > Martin > > 00:30:04,510 WARN - 17/07/22 00:30:04 WARN TaskSetManager: Lost task 60.0 > in stage 1518490.0 (TID 338070, 10.133.96.21, executor 0): > java.io.FileNotFoundException: /mnt/mesos/work_dir/slaves/ > 20160924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5-e82a-4df9-b034- > 8815a7a7564b-2543/executors/0/runs/fd15c15d-2511-4f37-a106- > 27431f583153/blockmgr-a0e0e673-f88b-4d12-a802- > c35643e6c6b2/33/shuffle_2090_60_0.index.b66235be-79be-4455-9759-1c7ba70f91f6 > (No such file or directory) > 00:30:04,510 WARN - at java.io.FileOutputStream.open0(Native Method) > 00:30:04,510 WARN - at java.io.FileOutputStream.open( > FileOutputStream.java:270) > 00:30:04,510 WARN - at java.io.FileOutputStream.< > init>(FileOutputStream.java:213) > 00:30:04,510 WARN - at java.io.FileOutputStream.< > init>(FileOutputStream.java:162) > 00:30:04,510 WARN - at org.apache.spark.shuffle. > IndexShuffleBlockResolver.writeIndexFileAndCommit( > IndexShuffleBlockResolver.scala:144) > 00:30:04,510 WARN - at org.apache.spark.shuffle.sort. > BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:128) > 00:30:04,510 WARN - at org.apache.spark.scheduler. > ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > 00:30:04,510 WARN - at org.apache.spark.scheduler. > ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > 00:30:04,510 WARN - at org.apache.spark.scheduler. > Task.run(Task.scala:99) > 00:30:04,510 WARN - at org.apache.spark.executor. > Executor$TaskRunner.run(Executor.scala:282) > 00:30:04,510 WARN - at java.util.concurrent. > ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > 00:30:04,510 WARN - at java.util.concurrent. > ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > 00:30:04,510 WARN - at java.lang.Thread.run(Thread.java:748) > > 00:30:04,580 INFO - Driver stacktrace: > 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages( > DAGScheduler.scala:1435) > 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler$$anonfun$ > abortStage$1.apply(DAGScheduler.scala:1423) > 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler$$anonfun$ > abortStage$1.apply(DAGScheduler.scala:1422) > 00:30:04,580 INFO - scala.collection.mutable. > ResizableArray$class.foreach(ResizableArray.scala:59) > 00:30:04,580 INFO - scala.collection.mutable.ArrayBuffer.foreach( > ArrayBuffer.scala:48) > 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler.abortStage( > DAGScheduler.scala:1422) > 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler$$anonfun$ > handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler$$anonfun$ > handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > 00:30:04,580 INFO - scala.Option.foreach(Option.scala:257) > 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler. > handleTaskSetFailed(DAGScheduler.scala:802) > 00:30:04,580 INFO - org.apache.spark.scheduler. > DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) > 00:30:04,580 INFO - org.apache.spark.scheduler. > DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > 00:30:04,580 INFO - org.apache.spark.scheduler. > DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > 00:30:04,580 INFO - org.apache.spark.util.EventLoop$$anon$1.run( > EventLoop.scala:48) > 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler.runJob( > DAGScheduler.scala:628) > 00:30:04,580 INFO - org.apache.spark.SparkContext. > runJob(SparkContext.scala:1918) > 00:30:04,580 INFO - org.apache.spark.SparkContext. > runJob(SparkContext.scala:1931) > 00:30:04,580 INFO - org.apache.spark.SparkContext. > runJob(SparkContext.scala:1944) > 00:30:04,580 INFO - org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD. > scala:1353) > 00:30:04,580 INFO - org.apache.spark.rdd.RDDOperationScope$.withScope( > RDDOperationScope.scala:151) > 00:30:04,580 INFO - org.apache.spark.rdd.RDDOperationScope$.withScope( > RDDOperationScope.scala:112) > 00:30:04,580 INFO - org.apache.spark.rdd.RDD.withScope(RDD.scala:362) > 00:30:04,580 INFO - org.apache.spark.rdd.RDD.take(RDD.scala:1326) > 00:30:04,580 INFO - org.apache.spark.rdd.RDD$$ >
Spark Job crash due to File Not found when shuffle intermittently
Hi, I have several Spark jobs including both batch job and Stream jobs to process the system log and analyze them. We are using Kafka as the pipeline to connect each jobs. Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some of the jobs(both batch or streaming) are thrown below exceptions randomly(either after several hours run or just run in 20 mins). Can anyone give me some suggestions about how to figure out the real root cause? (Looks like google result is not very useful...) Thanks, Martin 00:30:04,510 WARN - 17/07/22 00:30:04 WARN TaskSetManager: Lost task 60.0 in stage 1518490.0 (TID 338070, 10.133.96.21, executor 0): java.io.FileNotFoundException: /mnt/mesos/work_dir/slaves/20160924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5-e82a-4df9-b034-8815a7a7564b-2543/executors/0/runs/fd15c15d-2511-4f37-a106-27431f583153/blockmgr-a0e0e673-f88b-4d12-a802-c35643e6c6b2/33/shuffle_2090_60_0.index.b66235be-79be-4455-9759-1c7ba70f91f6 (No such file or directory) 00:30:04,510 WARN - at java.io.FileOutputStream.open0(Native Method) 00:30:04,510 WARN - at java.io.FileOutputStream.open(FileOutputStream.java:270) 00:30:04,510 WARN - at java.io.FileOutputStream.(FileOutputStream.java:213) 00:30:04,510 WARN - at java.io.FileOutputStream.(FileOutputStream.java:162) 00:30:04,510 WARN - at org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:144) 00:30:04,510 WARN - at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:128) 00:30:04,510 WARN - at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) 00:30:04,510 WARN - at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 00:30:04,510 WARN - at org.apache.spark.scheduler.Task.run(Task.scala:99) 00:30:04,510 WARN - at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) 00:30:04,510 WARN - at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 00:30:04,510 WARN - at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 00:30:04,510 WARN - at java.lang.Thread.run(Thread.java:748) 00:30:04,580 INFO - Driver stacktrace: 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) 00:30:04,580 INFO - scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 00:30:04,580 INFO - scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) 00:30:04,580 INFO - scala.Option.foreach(Option.scala:257) 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) 00:30:04,580 INFO - org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) 00:30:04,580 INFO - org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) 00:30:04,580 INFO - org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) 00:30:04,580 INFO - org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 00:30:04,580 INFO - org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) 00:30:04,580 INFO - org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) 00:30:04,580 INFO - org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) 00:30:04,580 INFO - org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) 00:30:04,580 INFO - org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1353) 00:30:04,580 INFO - org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 00:30:04,580 INFO - org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 00:30:04,580 INFO - org.apache.spark.rdd.RDD.withScope(RDD.scala:362) 00:30:04,580 INFO - org.apache.spark.rdd.RDD.take(RDD.scala:1326) 00:30:04,580 INFO - org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1461) 00:30:04,580 INFO - org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1461) 00:30:04,580 INFO - org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1461) 00:30:04,580 INFO - org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 00:30:04,580 INFO -