Configuration comes from spark-ec2 setup script, sets spark.local.dir to use /mnt/spark, /mnt2/spark. Setup actually worked for quite sometime and then on one of the node there were some disk errors as
mv: cannot remove `/mnt2/spark/spark-local-20140409182103-c775/09/shuffle_1_248_0': Read-only file system mv: cannot remove `/mnt2/spark/spark-local-20140409182103-c775/24/shuffle_1_260_0': Read-only file system mv: cannot remove `/mnt2/spark/spark-local-20140409182103-c775/24/shuffle_2_658_0': Read-only file system I understand the issue is hardware level but thought it would be great if spark could handle it and avoid cluster going down. On Mon, Apr 14, 2014 at 7:58 PM, giive chen <thegi...@gmail.com> wrote: > Hi Praveen > > What is your config about "* spark.local.dir" ? * > Is all your worker has this dir and all worker has right permission on > this dir? > > I think this is the reason of your error > > Wisely Chen > > > On Mon, Apr 14, 2014 at 9:29 PM, Praveen R > <prav...@sigmoidanalytics.com>wrote: > >> Had below error while running shark queries on 30 node cluster and was >> not able to start shark server or run any jobs. >> >> *14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4 >> (already removed): Failed to create local directory (bad spark.local.dir?)* >> *Full log: *https://gist.github.com/praveenr019/10647049 >> >> After spending quite some time, found it was due to disk read errors on >> one node and had the cluster working after removing the node. >> >> Wanted to know if there is any configuration (like akkaTimeout) which can >> handle this or does mesos help ? >> >> Shouldn't the worker be marked dead in such scenario, instead of making >> the cluster non-usable so the debugging can be done at leisure. >> >> Thanks, >> Praveen R >> >> >> >