Had below error while running shark queries on 30 node cluster and was not
able to start shark server or run any jobs.

*14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4
(already removed): Failed to create local directory (bad spark.local.dir?)*
*Full log: *https://gist.github.com/praveenr019/10647049

After spending quite some time, found it was due to disk read errors on one
node and had the cluster working after removing the node.

Wanted to know if there is any configuration (like akkaTimeout) which can
handle this or does mesos help ?

Shouldn't the worker be marked dead in such scenario, instead of making the
cluster non-usable so the debugging can be done at leisure.

Thanks,
Praveen R

Reply via email to