Re: Lost an executor error - Jobs fail

Praveen R Mon, 14 Apr 2014 07:40:28 -0700

Configuration comes from spark-ec2 setup script, sets spark.local.dir to
use /mnt/spark, /mnt2/spark.
Setup actually worked for quite sometime and then on one of the node there
were some disk errors as


mv: cannot remove
`/mnt2/spark/spark-local-20140409182103-c775/09/shuffle_1_248_0': Read-only
file system
mv: cannot remove
`/mnt2/spark/spark-local-20140409182103-c775/24/shuffle_1_260_0': Read-only
file system
mv: cannot remove
`/mnt2/spark/spark-local-20140409182103-c775/24/shuffle_2_658_0': Read-only
file system

I understand the issue is hardware level but thought it would be great if
spark could handle it and avoid cluster going down.


On Mon, Apr 14, 2014 at 7:58 PM, giive chen <thegi...@gmail.com> wrote:

> Hi Praveen
>
> What is your config about "* spark.local.dir" ? *
> Is all your worker has this dir and all worker has right permission on
> this dir?
>
> I think this is the reason of your error
>
> Wisely Chen
>
>
> On Mon, Apr 14, 2014 at 9:29 PM, Praveen R 
> <prav...@sigmoidanalytics.com>wrote:
>
>> Had below error while running shark queries on 30 node cluster and was
>> not able to start shark server or run any jobs.
>>
>> *14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4
>> (already removed): Failed to create local directory (bad spark.local.dir?)*
>> *Full log: *https://gist.github.com/praveenr019/10647049
>>
>> After spending quite some time, found it was due to disk read errors on
>> one node and had the cluster working after removing the node.
>>
>> Wanted to know if there is any configuration (like akkaTimeout) which can
>> handle this or does mesos help ?
>>
>> Shouldn't the worker be marked dead in such scenario, instead of making
>> the cluster non-usable so the debugging can be done at leisure.
>>
>> Thanks,
>> Praveen R
>>
>>
>>
>

Re: Lost an executor error - Jobs fail

Reply via email to