Re: Lost an executor error - Jobs fail

2014-04-14 Thread Aaron Davidson
Hmm, interesting. I created https://issues.apache.org/jira/browse/SPARK-1499to track the issue of Workers continuously spewing bad executors, but the real issue seems to be a combination of that and some other bug in Shark or Spark which fails to handle the situation properly. Please let us know i

Re: Lost an executor error - Jobs fail

2014-04-14 Thread Praveen R
Unfortunately queries kept failing with SparkTask101 errors and had them working after removing the troublesome node. FAILED: Execution Error, return code -101 from shark.execution.SparkTask I wish it would have been easy to re-produce it. I shall give a try to hard remove write permissions on on

Re: Lost an executor error - Jobs fail

2014-04-14 Thread Aaron Davidson
Cool! It's pretty rare to actually get logs from a wild hardware failure. The problem is as you said, that the executor keeps failing, but the worker doesn't get the hint, so it keeps creating new, bad executors. However, this issue should not have caused your cluster to fail to start up. In the l

Re: Lost an executor error - Jobs fail

2014-04-14 Thread Praveen R
Configuration comes from spark-ec2 setup script, sets spark.local.dir to use /mnt/spark, /mnt2/spark. Setup actually worked for quite sometime and then on one of the node there were some disk errors as mv: cannot remove `/mnt2/spark/spark-local-20140409182103-c775/09/shuffle_1_248_0': Read-only fi

Re: Lost an executor error - Jobs fail

2014-04-14 Thread giive chen
Hi Praveen What is your config about "* spark.local.dir" ? * Is all your worker has this dir and all worker has right permission on this dir? I think this is the reason of your error Wisely Chen On Mon, Apr 14, 2014 at 9:29 PM, Praveen R wrote: > Had below error while running shark queries on

Lost an executor error - Jobs fail

2014-04-14 Thread Praveen R
Had below error while running shark queries on 30 node cluster and was not able to start shark server or run any jobs. *14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4 (already removed): Failed to create local directory (bad spark.local.dir?)* *Full log: *https://gist.githu