Hi,

We are running a Spark Streaming application with Kafka Direct Stream with
Spark version 1.6.

It has run for few days without any error or failed tasks and then there
was an error creating a directory in one machine as follows:



Job aborted due to stage failure: Task 1 in stage 158757.0 failed 4 times,

most recent failure: Lost task 1.3 in stage 158757.0 (TID 4091253,
dc1hdpd5.gst.gov.in): java.io.IOException:

Failed to create local dir in
/hadoop-data-dir4/yarn/nm/usercache/hdfs/appcache/application_1500846219148_0003/blockmgr-9bba3cf8-9c4c-4461-a1d0-8e78d24f679f/23.



And as we checked, there was some hardware problem in that machine with
that directory. This exact same error seems to fail many jobs.

In streaming tab of Spark UI for failed jobs, it is showing status as
'succeeded' but the job is not complete. How should we interpret that? Is
this job completely failed?

In executor tab, I can see that there are many failed tasks for the
defective machine. Is there a way to recover those failed jobs(or offsets
for which it failed)? How should we handle this type of errors
programmatically or let spark handle it?

In the same tab, I can see that there are some 'active' tasks for the
non-defective machines which seems to increase along with the failed tasks
of the defective machine. But there are no active tasks running as per
streaming tab. And the number of active tasks stopped increasing after we
killed the executor process running in the defective machine. Why is this
happening?


Regards,

Venu.

Reply via email to