Re: How to enable fault-tolerance?

Matei Zaharia Mon, 09 Jun 2014 12:09:24 -0700

If this is a useful feature for local mode, we should open a JIRA to document 
the setting or improve it (I’d prefer to add a spark.local.retries property 
instead of a special URL format). We initially disabled it for everything 
except unit tests because 90% of the time an exception in local mode means a 
problem in the application, and we’d rather let the user debug that right away 
rather than retrying the task several times and having them worry about why 
they get so many errors.


Matei

On Jun 9, 2014, at 11:28 AM, Peng Cheng <pc...@uowmail.edu.au> wrote:

> Oh, and to make things worse, they forgot '\*' in their regex.
> Am I the first to encounter this problem before?
> 
> On Mon 09 Jun 2014 02:24:43 PM EDT, Peng Cheng wrote:
>> Thanks a lot! That's very responsive, somebody definitely has
>> encountered the same problem before, and added two hidden modes in
>> masterURL:
>> 
>> (from SparkContext.scala: line1431)
>> 
>>   // Regular expression for local[N, maxRetries], used in tests with
>> failing tasks
>>   val LOCAL_N_FAILURES_REGEX = """local\[([0-9]+)\s*,\s*([0-9]+)\]""".r
>>   // Regular expression for simulating a Spark cluster of [N, cores,
>> memory] locally
>>   val LOCAL_CLUSTER_REGEX =
>> """local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*]""".r
>> 
>> Unfortunately they never got pushed into the documentation, and you
>> got config parameters scattered in two different places (masterURL and
>> $spark.task.maxFailures).
>> I'm thinking of adding a new config parameter
>> $spark.task.maxLocalFailures to override 1, how do you think?
>> 
>> Thanks again buddy.
>> 
>> Yours Peng
>> 
>> On Mon 09 Jun 2014 01:33:45 PM EDT, Aaron Davidson wrote:
>>> Looks like your problem is local mode:
>>> https://github.com/apache/spark/blob/640f9a0efefd42cff86aecd4878a3a57f5ae85fa/core/src/main/scala/org/apache/spark/SparkContext.scala#L1430
>>> 
>>> 
>>> For some reason, someone decided not to do retries when running in
>>> local mode. Not exactly sure why, feel free to submit a JIRA on this.
>>> 
>>> 
>>> On Mon, Jun 9, 2014 at 8:59 AM, Peng Cheng <pc...@uow.edu.au
>>> <mailto:pc...@uow.edu.au>> wrote:
>>> 
>>>    I speculate that Spark will only retry on exceptions that are
>>>    registered with
>>>    TaskSetScheduler, so a definitely-will-fail task will fail quickly
>>>    without
>>>    taking more resources. However I haven't found any documentation
>>>    or web page
>>>    on it
>>> 
>>> 
>>> 
>>>    --
>>>    View this message in context:
>>> 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-enable-fault-tolerance-tp7250p7255.html
>>> 
>>>    Sent from the Apache Spark User List mailing list archive at
>>>    Nabble.com.
>>> 
>>>

Re: How to enable fault-tolerance?

Reply via email to