Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-25 Thread Martin Peng
cool~ Thanks Kang! I will check and let you know.
Sorry for delay as there is an urgent customer issue today.

Best
Martin

2017-07-24 22:15 GMT-07:00 周康 :

> * If the file exists but is a directory rather than a regular file, does
> * not exist but cannot be created, or cannot be opened for any other
> * reason then a FileNotFoundException is thrown.
>
> After searching into FileOutputStream i saw this annotation.So you can check 
> executor node first(may be no permission or no space,or no enough file 
> descriptor)
>
>
> 2017-07-25 13:05 GMT+08:00 周康 :
>
>> You can also check whether space left in the executor node enough to
>> store shuffle file or not.
>>
>> 2017-07-25 13:01 GMT+08:00 周康 :
>>
>>> First,spark will handle task fail so if job ended normally , this error
>>> can be ignore.
>>> Second, when using BypassMergeSortShuffleWriter, it will first write
>>> data file then write an index file.
>>> You can check "Failed to delete temporary index file at" or "fail to
>>> rename file" in related executor node's log file.
>>>
>>> 2017-07-25 0:33 GMT+08:00 Martin Peng :
>>>
 Is there anyone at share me some lights about this issue?

 Thanks
 Martin

 2017-07-21 18:58 GMT-07:00 Martin Peng :

> Hi,
>
> I have several Spark jobs including both batch job and Stream jobs to
> process the system log and analyze them. We are using Kafka as the 
> pipeline
> to connect each jobs.
>
> Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some
> of the jobs(both batch or streaming) are thrown below exceptions
> randomly(either after several hours run or just run in 20 mins). Can 
> anyone
> give me some suggestions about how to figure out the real root cause?
> (Looks like google result is not very useful...)
>
> Thanks,
> Martin
>
> 00:30:04,510 WARN  - 17/07/22 00:30:04 WARN TaskSetManager: Lost task
> 60.0 in stage 1518490.0 (TID 338070, 10.133.96.21, executor 0):
> java.io.FileNotFoundException: /mnt/mesos/work_dir/slaves/201
> 60924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5-e82a
> -4df9-b034-8815a7a7564b-2543/executors/0/runs/fd15c15d-2511-
> 4f37-a106-27431f583153/blockmgr-a0e0e673-f88b-4d12-a802-c356
> 43e6c6b2/33/shuffle_2090_60_0.index.b66235be-79be-4455-9759-1c7ba70f91f6
> (No such file or directory)
> 00:30:04,510 WARN  - at java.io.FileOutputStream.open0(Native
> Method)
> 00:30:04,510 WARN  - at java.io.FileOutputStream.open(
> FileOutputStream.java:270)
> 00:30:04,510 WARN  - at java.io.FileOutputStream. >(FileOutputStream.java:213)
> 00:30:04,510 WARN  - at java.io.FileOutputStream. >(FileOutputStream.java:162)
> 00:30:04,510 WARN  - at org.apache.spark.shuffle.Index
> ShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlo
> ckResolver.scala:144)
> 00:30:04,510 WARN  - at org.apache.spark.shuffle.sort.
> BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWri
> ter.java:128)
> 00:30:04,510 WARN  - at org.apache.spark.scheduler.Shu
> ffleMapTask.runTask(ShuffleMapTask.scala:96)
> 00:30:04,510 WARN  - at org.apache.spark.scheduler.Shu
> ffleMapTask.runTask(ShuffleMapTask.scala:53)
> 00:30:04,510 WARN  - at org.apache.spark.scheduler.Tas
> k.run(Task.scala:99)
> 00:30:04,510 WARN  - at org.apache.spark.executor.Exec
> utor$TaskRunner.run(Executor.scala:282)
> 00:30:04,510 WARN  - at java.util.concurrent.ThreadPoo
> lExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 00:30:04,510 WARN  - at java.util.concurrent.ThreadPoo
> lExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 00:30:04,510 WARN  - at java.lang.Thread.run(Thread.java:748)
>
> 00:30:04,580 INFO  - Driver stacktrace:
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAn
> dIndependentStages(DAGScheduler.scala:1435)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> 00:30:04,580 INFO  - scala.collection.mutable.Resiz
> ableArray$class.foreach(ResizableArray.scala:59)
> 00:30:04,580 INFO  - scala.collection.mutable.Array
> Buffer.foreach(ArrayBuffer.scala:48)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
> Scheduler.abortStage(DAGScheduler.scala:1422)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
> 

Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-24 Thread 周康
* If the file exists but is a directory rather than a regular file, does
* not exist but cannot be created, or cannot be opened for any other
* reason then a FileNotFoundException is thrown.

After searching into FileOutputStream i saw this annotation.So you can
check executor node first(may be no permission or no space,or no
enough file descriptor)


2017-07-25 13:05 GMT+08:00 周康 :

> You can also check whether space left in the executor node enough to store
> shuffle file or not.
>
> 2017-07-25 13:01 GMT+08:00 周康 :
>
>> First,spark will handle task fail so if job ended normally , this error
>> can be ignore.
>> Second, when using BypassMergeSortShuffleWriter, it will first write data
>> file then write an index file.
>> You can check "Failed to delete temporary index file at" or "fail to
>> rename file" in related executor node's log file.
>>
>> 2017-07-25 0:33 GMT+08:00 Martin Peng :
>>
>>> Is there anyone at share me some lights about this issue?
>>>
>>> Thanks
>>> Martin
>>>
>>> 2017-07-21 18:58 GMT-07:00 Martin Peng :
>>>
 Hi,

 I have several Spark jobs including both batch job and Stream jobs to
 process the system log and analyze them. We are using Kafka as the pipeline
 to connect each jobs.

 Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some
 of the jobs(both batch or streaming) are thrown below exceptions
 randomly(either after several hours run or just run in 20 mins). Can anyone
 give me some suggestions about how to figure out the real root cause?
 (Looks like google result is not very useful...)

 Thanks,
 Martin

 00:30:04,510 WARN  - 17/07/22 00:30:04 WARN TaskSetManager: Lost task
 60.0 in stage 1518490.0 (TID 338070, 10.133.96.21, executor 0):
 java.io.FileNotFoundException: /mnt/mesos/work_dir/slaves/201
 60924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5-e82a
 -4df9-b034-8815a7a7564b-2543/executors/0/runs/fd15c15d-2511-
 4f37-a106-27431f583153/blockmgr-a0e0e673-f88b-4d12-a802-
 c35643e6c6b2/33/shuffle_2090_60_0.index.b66235be-79be-4455-9759-1c7ba70f91f6
 (No such file or directory)
 00:30:04,510 WARN  - at java.io.FileOutputStream.open0(Native
 Method)
 00:30:04,510 WARN  - at java.io.FileOutputStream.open(
 FileOutputStream.java:270)
 00:30:04,510 WARN  - at java.io.FileOutputStream.>>> >(FileOutputStream.java:213)
 00:30:04,510 WARN  - at java.io.FileOutputStream.>>> >(FileOutputStream.java:162)
 00:30:04,510 WARN  - at org.apache.spark.shuffle.Index
 ShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlo
 ckResolver.scala:144)
 00:30:04,510 WARN  - at org.apache.spark.shuffle.sort.
 BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWri
 ter.java:128)
 00:30:04,510 WARN  - at org.apache.spark.scheduler.Shu
 ffleMapTask.runTask(ShuffleMapTask.scala:96)
 00:30:04,510 WARN  - at org.apache.spark.scheduler.Shu
 ffleMapTask.runTask(ShuffleMapTask.scala:53)
 00:30:04,510 WARN  - at org.apache.spark.scheduler.Tas
 k.run(Task.scala:99)
 00:30:04,510 WARN  - at org.apache.spark.executor.Exec
 utor$TaskRunner.run(Executor.scala:282)
 00:30:04,510 WARN  - at java.util.concurrent.ThreadPoo
 lExecutor.runWorker(ThreadPoolExecutor.java:1142)
 00:30:04,510 WARN  - at java.util.concurrent.ThreadPoo
 lExecutor$Worker.run(ThreadPoolExecutor.java:617)
 00:30:04,510 WARN  - at java.lang.Thread.run(Thread.java:748)

 00:30:04,580 INFO  - Driver stacktrace:
 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAn
 dIndependentStages(DAGScheduler.scala:1435)
 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
 Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
 Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
 00:30:04,580 INFO  - scala.collection.mutable.Resiz
 ableArray$class.foreach(ResizableArray.scala:59)
 00:30:04,580 INFO  - scala.collection.mutable.Array
 Buffer.foreach(ArrayBuffer.scala:48)
 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
 Scheduler.abortStage(DAGScheduler.scala:1422)
 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
 Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
 Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
 00:30:04,580 INFO  - scala.Option.foreach(Option.scala:257)
 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
 Scheduler.handleTaskSetFailed(DAGScheduler.scala:802)
 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
 

Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-24 Thread 周康
You can also check whether space left in the executor node enough to store
shuffle file or not.

2017-07-25 13:01 GMT+08:00 周康 :

> First,spark will handle task fail so if job ended normally , this error
> can be ignore.
> Second, when using BypassMergeSortShuffleWriter, it will first write data
> file then write an index file.
> You can check "Failed to delete temporary index file at" or "fail to
> rename file" in related executor node's log file.
>
> 2017-07-25 0:33 GMT+08:00 Martin Peng :
>
>> Is there anyone at share me some lights about this issue?
>>
>> Thanks
>> Martin
>>
>> 2017-07-21 18:58 GMT-07:00 Martin Peng :
>>
>>> Hi,
>>>
>>> I have several Spark jobs including both batch job and Stream jobs to
>>> process the system log and analyze them. We are using Kafka as the pipeline
>>> to connect each jobs.
>>>
>>> Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some of
>>> the jobs(both batch or streaming) are thrown below exceptions
>>> randomly(either after several hours run or just run in 20 mins). Can anyone
>>> give me some suggestions about how to figure out the real root cause?
>>> (Looks like google result is not very useful...)
>>>
>>> Thanks,
>>> Martin
>>>
>>> 00:30:04,510 WARN  - 17/07/22 00:30:04 WARN TaskSetManager: Lost task
>>> 60.0 in stage 1518490.0 (TID 338070, 10.133.96.21, executor 0):
>>> java.io.FileNotFoundException: /mnt/mesos/work_dir/slaves/201
>>> 60924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5-e82a
>>> -4df9-b034-8815a7a7564b-2543/executors/0/runs/fd15c15d-
>>> 2511-4f37-a106-27431f583153/blockmgr-a0e0e673-f88b-4d12-
>>> a802-c35643e6c6b2/33/shuffle_2090_60_0.index.b66235be-79be-4455-9759-1c7ba70f91f6
>>> (No such file or directory)
>>> 00:30:04,510 WARN  - at java.io.FileOutputStream.open0(Native
>>> Method)
>>> 00:30:04,510 WARN  - at java.io.FileOutputStream.open(
>>> FileOutputStream.java:270)
>>> 00:30:04,510 WARN  - at java.io.FileOutputStream.>> >(FileOutputStream.java:213)
>>> 00:30:04,510 WARN  - at java.io.FileOutputStream.>> >(FileOutputStream.java:162)
>>> 00:30:04,510 WARN  - at org.apache.spark.shuffle.Index
>>> ShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlo
>>> ckResolver.scala:144)
>>> 00:30:04,510 WARN  - at org.apache.spark.shuffle.sort.
>>> BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWri
>>> ter.java:128)
>>> 00:30:04,510 WARN  - at org.apache.spark.scheduler.Shu
>>> ffleMapTask.runTask(ShuffleMapTask.scala:96)
>>> 00:30:04,510 WARN  - at org.apache.spark.scheduler.Shu
>>> ffleMapTask.runTask(ShuffleMapTask.scala:53)
>>> 00:30:04,510 WARN  - at org.apache.spark.scheduler.Tas
>>> k.run(Task.scala:99)
>>> 00:30:04,510 WARN  - at org.apache.spark.executor.Exec
>>> utor$TaskRunner.run(Executor.scala:282)
>>> 00:30:04,510 WARN  - at java.util.concurrent.ThreadPoo
>>> lExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> 00:30:04,510 WARN  - at java.util.concurrent.ThreadPoo
>>> lExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> 00:30:04,510 WARN  - at java.lang.Thread.run(Thread.java:748)
>>>
>>> 00:30:04,580 INFO  - Driver stacktrace:
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler.org
>>> $apache$spark$scheduler$DAGScheduler$$failJobAn
>>> dIndependentStages(DAGScheduler.scala:1435)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>>> 00:30:04,580 INFO  - scala.collection.mutable.Resiz
>>> ableArray$class.foreach(ResizableArray.scala:59)
>>> 00:30:04,580 INFO  - scala.collection.mutable.Array
>>> Buffer.foreach(ArrayBuffer.scala:48)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> Scheduler.abortStage(DAGScheduler.scala:1422)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>>> 00:30:04,580 INFO  - scala.Option.foreach(Option.scala:257)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> Scheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> SchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>>> 00:30:04,580 INFO  - org.apache.spark.util.EventLoo
>>> p$$anon$1.run(EventLoop.scala:48)
>>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>>> Scheduler.runJob(DAGScheduler.scala:628)
>>> 00:30:04,580 INFO  - org.apache.spark.SparkContext.
>>> 

Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-24 Thread 周康
First,spark will handle task fail so if job ended normally , this error can
be ignore.
Second, when using BypassMergeSortShuffleWriter, it will first write data
file then write an index file.
You can check "Failed to delete temporary index file at" or "fail to rename
file" in related executor node's log file.

2017-07-25 0:33 GMT+08:00 Martin Peng :

> Is there anyone at share me some lights about this issue?
>
> Thanks
> Martin
>
> 2017-07-21 18:58 GMT-07:00 Martin Peng :
>
>> Hi,
>>
>> I have several Spark jobs including both batch job and Stream jobs to
>> process the system log and analyze them. We are using Kafka as the pipeline
>> to connect each jobs.
>>
>> Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some of
>> the jobs(both batch or streaming) are thrown below exceptions
>> randomly(either after several hours run or just run in 20 mins). Can anyone
>> give me some suggestions about how to figure out the real root cause?
>> (Looks like google result is not very useful...)
>>
>> Thanks,
>> Martin
>>
>> 00:30:04,510 WARN  - 17/07/22 00:30:04 WARN TaskSetManager: Lost task
>> 60.0 in stage 1518490.0 (TID 338070, 10.133.96.21, executor 0):
>> java.io.FileNotFoundException: /mnt/mesos/work_dir/slaves/201
>> 60924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5-
>> e82a-4df9-b034-8815a7a7564b-2543/executors/0/runs/
>> fd15c15d-2511-4f37-a106-27431f583153/blockmgr-a0e0e673-f88b-
>> 4d12-a802-c35643e6c6b2/33/shuffle_2090_60_0.index.
>> b66235be-79be-4455-9759-1c7ba70f91f6 (No such file or directory)
>> 00:30:04,510 WARN  - at java.io.FileOutputStream.open0(Native Method)
>> 00:30:04,510 WARN  - at java.io.FileOutputStream.open(
>> FileOutputStream.java:270)
>> 00:30:04,510 WARN  - at java.io.FileOutputStream.> >(FileOutputStream.java:213)
>> 00:30:04,510 WARN  - at java.io.FileOutputStream.> >(FileOutputStream.java:162)
>> 00:30:04,510 WARN  - at org.apache.spark.shuffle.Index
>> ShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlo
>> ckResolver.scala:144)
>> 00:30:04,510 WARN  - at org.apache.spark.shuffle.sort.
>> BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:128)
>> 00:30:04,510 WARN  - at org.apache.spark.scheduler.Shu
>> ffleMapTask.runTask(ShuffleMapTask.scala:96)
>> 00:30:04,510 WARN  - at org.apache.spark.scheduler.Shu
>> ffleMapTask.runTask(ShuffleMapTask.scala:53)
>> 00:30:04,510 WARN  - at org.apache.spark.scheduler.Tas
>> k.run(Task.scala:99)
>> 00:30:04,510 WARN  - at org.apache.spark.executor.Exec
>> utor$TaskRunner.run(Executor.scala:282)
>> 00:30:04,510 WARN  - at java.util.concurrent.ThreadPoo
>> lExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> 00:30:04,510 WARN  - at java.util.concurrent.ThreadPoo
>> lExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> 00:30:04,510 WARN  - at java.lang.Thread.run(Thread.java:748)
>>
>> 00:30:04,580 INFO  - Driver stacktrace:
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAn
>> dIndependentStages(DAGScheduler.scala:1435)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>> 00:30:04,580 INFO  - scala.collection.mutable.Resiz
>> ableArray$class.foreach(ResizableArray.scala:59)
>> 00:30:04,580 INFO  - scala.collection.mutable.Array
>> Buffer.foreach(ArrayBuffer.scala:48)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> Scheduler.abortStage(DAGScheduler.scala:1422)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>> 00:30:04,580 INFO  - scala.Option.foreach(Option.scala:257)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> Scheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> SchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>> 00:30:04,580 INFO  - org.apache.spark.util.EventLoo
>> p$$anon$1.run(EventLoop.scala:48)
>> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAG
>> Scheduler.runJob(DAGScheduler.scala:628)
>> 00:30:04,580 INFO  - org.apache.spark.SparkContext.
>> runJob(SparkContext.scala:1918)
>> 00:30:04,580 INFO  - org.apache.spark.SparkContext.
>> runJob(SparkContext.scala:1931)
>> 00:30:04,580 INFO  - org.apache.spark.SparkContext.
>> runJob(SparkContext.scala:1944)
>> 00:30:04,580 INFO  - org.apache.spark.rdd.RDD$$anon
>> 

Re: Spark Job crash due to File Not found when shuffle intermittently

2017-07-24 Thread Martin Peng
Is there anyone at share me some lights about this issue?

Thanks
Martin

2017-07-21 18:58 GMT-07:00 Martin Peng :

> Hi,
>
> I have several Spark jobs including both batch job and Stream jobs to
> process the system log and analyze them. We are using Kafka as the pipeline
> to connect each jobs.
>
> Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some of
> the jobs(both batch or streaming) are thrown below exceptions
> randomly(either after several hours run or just run in 20 mins). Can anyone
> give me some suggestions about how to figure out the real root cause?
> (Looks like google result is not very useful...)
>
> Thanks,
> Martin
>
> 00:30:04,510 WARN  - 17/07/22 00:30:04 WARN TaskSetManager: Lost task 60.0
> in stage 1518490.0 (TID 338070, 10.133.96.21, executor 0):
> java.io.FileNotFoundException: /mnt/mesos/work_dir/slaves/
> 20160924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5-e82a-4df9-b034-
> 8815a7a7564b-2543/executors/0/runs/fd15c15d-2511-4f37-a106-
> 27431f583153/blockmgr-a0e0e673-f88b-4d12-a802-
> c35643e6c6b2/33/shuffle_2090_60_0.index.b66235be-79be-4455-9759-1c7ba70f91f6
> (No such file or directory)
> 00:30:04,510 WARN  - at java.io.FileOutputStream.open0(Native Method)
> 00:30:04,510 WARN  - at java.io.FileOutputStream.open(
> FileOutputStream.java:270)
> 00:30:04,510 WARN  - at java.io.FileOutputStream.<
> init>(FileOutputStream.java:213)
> 00:30:04,510 WARN  - at java.io.FileOutputStream.<
> init>(FileOutputStream.java:162)
> 00:30:04,510 WARN  - at org.apache.spark.shuffle.
> IndexShuffleBlockResolver.writeIndexFileAndCommit(
> IndexShuffleBlockResolver.scala:144)
> 00:30:04,510 WARN  - at org.apache.spark.shuffle.sort.
> BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:128)
> 00:30:04,510 WARN  - at org.apache.spark.scheduler.
> ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> 00:30:04,510 WARN  - at org.apache.spark.scheduler.
> ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> 00:30:04,510 WARN  - at org.apache.spark.scheduler.
> Task.run(Task.scala:99)
> 00:30:04,510 WARN  - at org.apache.spark.executor.
> Executor$TaskRunner.run(Executor.scala:282)
> 00:30:04,510 WARN  - at java.util.concurrent.
> ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 00:30:04,510 WARN  - at java.util.concurrent.
> ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 00:30:04,510 WARN  - at java.lang.Thread.run(Thread.java:748)
>
> 00:30:04,580 INFO  - Driver stacktrace:
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(
> DAGScheduler.scala:1435)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler$$anonfun$
> abortStage$1.apply(DAGScheduler.scala:1423)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler$$anonfun$
> abortStage$1.apply(DAGScheduler.scala:1422)
> 00:30:04,580 INFO  - scala.collection.mutable.
> ResizableArray$class.foreach(ResizableArray.scala:59)
> 00:30:04,580 INFO  - scala.collection.mutable.ArrayBuffer.foreach(
> ArrayBuffer.scala:48)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler.abortStage(
> DAGScheduler.scala:1422)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler$$anonfun$
> handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler$$anonfun$
> handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> 00:30:04,580 INFO  - scala.Option.foreach(Option.scala:257)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler.
> handleTaskSetFailed(DAGScheduler.scala:802)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.
> DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.
> DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.
> DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> 00:30:04,580 INFO  - org.apache.spark.util.EventLoop$$anon$1.run(
> EventLoop.scala:48)
> 00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler.runJob(
> DAGScheduler.scala:628)
> 00:30:04,580 INFO  - org.apache.spark.SparkContext.
> runJob(SparkContext.scala:1918)
> 00:30:04,580 INFO  - org.apache.spark.SparkContext.
> runJob(SparkContext.scala:1931)
> 00:30:04,580 INFO  - org.apache.spark.SparkContext.
> runJob(SparkContext.scala:1944)
> 00:30:04,580 INFO  - org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.
> scala:1353)
> 00:30:04,580 INFO  - org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
> 00:30:04,580 INFO  - org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:112)
> 00:30:04,580 INFO  - org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
> 00:30:04,580 INFO  - org.apache.spark.rdd.RDD.take(RDD.scala:1326)
> 00:30:04,580 INFO  - org.apache.spark.rdd.RDD$$
> 

Spark Job crash due to File Not found when shuffle intermittently

2017-07-21 Thread Martin Peng
Hi,

I have several Spark jobs including both batch job and Stream jobs to
process the system log and analyze them. We are using Kafka as the pipeline
to connect each jobs.

Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some of
the jobs(both batch or streaming) are thrown below exceptions
randomly(either after several hours run or just run in 20 mins). Can anyone
give me some suggestions about how to figure out the real root cause?
(Looks like google result is not very useful...)

Thanks,
Martin

00:30:04,510 WARN  - 17/07/22 00:30:04 WARN TaskSetManager: Lost task 60.0
in stage 1518490.0 (TID 338070, 10.133.96.21, executor 0):
java.io.FileNotFoundException:
/mnt/mesos/work_dir/slaves/20160924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5-e82a-4df9-b034-8815a7a7564b-2543/executors/0/runs/fd15c15d-2511-4f37-a106-27431f583153/blockmgr-a0e0e673-f88b-4d12-a802-c35643e6c6b2/33/shuffle_2090_60_0.index.b66235be-79be-4455-9759-1c7ba70f91f6
(No such file or directory)
00:30:04,510 WARN  - at java.io.FileOutputStream.open0(Native Method)
00:30:04,510 WARN  - at
java.io.FileOutputStream.open(FileOutputStream.java:270)
00:30:04,510 WARN  - at
java.io.FileOutputStream.(FileOutputStream.java:213)
00:30:04,510 WARN  - at
java.io.FileOutputStream.(FileOutputStream.java:162)
00:30:04,510 WARN  - at
org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:144)
00:30:04,510 WARN  - at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:128)
00:30:04,510 WARN  - at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
00:30:04,510 WARN  - at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
00:30:04,510 WARN  - at
org.apache.spark.scheduler.Task.run(Task.scala:99)
00:30:04,510 WARN  - at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
00:30:04,510 WARN  - at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
00:30:04,510 WARN  - at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
00:30:04,510 WARN  - at java.lang.Thread.run(Thread.java:748)

00:30:04,580 INFO  - Driver stacktrace:
00:30:04,580 INFO  - org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
00:30:04,580 INFO  -
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
00:30:04,580 INFO  -
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
00:30:04,580 INFO  -
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
00:30:04,580 INFO  -
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
00:30:04,580 INFO  -
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
00:30:04,580 INFO  -
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
00:30:04,580 INFO  -
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
00:30:04,580 INFO  - scala.Option.foreach(Option.scala:257)
00:30:04,580 INFO  -
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
00:30:04,580 INFO  -
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
00:30:04,580 INFO  -
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
00:30:04,580 INFO  -
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
00:30:04,580 INFO  -
org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
00:30:04,580 INFO  -
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
00:30:04,580 INFO  -
org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
00:30:04,580 INFO  -
org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
00:30:04,580 INFO  -
org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
00:30:04,580 INFO  -
org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1353)
00:30:04,580 INFO  -
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
00:30:04,580 INFO  -
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
00:30:04,580 INFO  - org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
00:30:04,580 INFO  - org.apache.spark.rdd.RDD.take(RDD.scala:1326)
00:30:04,580 INFO  -
org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1461)
00:30:04,580 INFO  -
org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1461)
00:30:04,580 INFO  -
org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1461)
00:30:04,580 INFO  -
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
00:30:04,580 INFO  -