Unfortunately queries kept failing with SparkTask101 errors and had them working after removing the troublesome node.
FAILED: Execution Error, return code -101 from shark.execution.SparkTask I wish it would have been easy to re-produce it. I shall give a try to hard remove write permissions on one node to see if the same error happens. On Tue, Apr 15, 2014 at 9:17 AM, Aaron Davidson <ilike...@gmail.com> wrote: > Cool! It's pretty rare to actually get logs from a wild hardware failure. > The problem is as you said, that the executor keeps failing, but the worker > doesn't get the hint, so it keeps creating new, bad executors. > > However, this issue should not have caused your cluster to fail to start > up. In the linked logs, for instance, the shark shell started up just fine > (though the "shark>" was lost in some of the log messages). Queries should > have been able to execute just fine. Was this not the case? > > > On Mon, Apr 14, 2014 at 7:38 AM, Praveen R > <prav...@sigmoidanalytics.com>wrote: > >> Configuration comes from spark-ec2 setup script, sets spark.local.dir to >> use /mnt/spark, /mnt2/spark. >> Setup actually worked for quite sometime and then on one of the node >> there were some disk errors as >> >> mv: cannot remove >> `/mnt2/spark/spark-local-20140409182103-c775/09/shuffle_1_248_0': Read-only >> file system >> mv: cannot remove >> `/mnt2/spark/spark-local-20140409182103-c775/24/shuffle_1_260_0': Read-only >> file system >> mv: cannot remove >> `/mnt2/spark/spark-local-20140409182103-c775/24/shuffle_2_658_0': Read-only >> file system >> >> I understand the issue is hardware level but thought it would be great if >> spark could handle it and avoid cluster going down. >> >> >> On Mon, Apr 14, 2014 at 7:58 PM, giive chen <thegi...@gmail.com> wrote: >> >>> Hi Praveen >>> >>> What is your config about "* spark.local.dir" ? * >>> Is all your worker has this dir and all worker has right permission on >>> this dir? >>> >>> I think this is the reason of your error >>> >>> Wisely Chen >>> >>> >>> On Mon, Apr 14, 2014 at 9:29 PM, Praveen R <prav...@sigmoidanalytics.com >>> > wrote: >>> >>>> Had below error while running shark queries on 30 node cluster and was >>>> not able to start shark server or run any jobs. >>>> >>>> *14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor >>>> 4 (already removed): Failed to create local directory (bad >>>> spark.local.dir?)* >>>> *Full log: *https://gist.github.com/praveenr019/10647049 >>>> >>>> After spending quite some time, found it was due to disk read errors on >>>> one node and had the cluster working after removing the node. >>>> >>>> Wanted to know if there is any configuration (like akkaTimeout) which >>>> can handle this or does mesos help ? >>>> >>>> Shouldn't the worker be marked dead in such scenario, instead of making >>>> the cluster non-usable so the debugging can be done at leisure. >>>> >>>> Thanks, >>>> Praveen R >>>> >>>> >>>> >>> >> >