Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-14 Thread Mridul Muralidharan
On Sat, Oct 14, 2023 at 3:49 AM Mridul Muralidharan wrote: > > A reducer oriented view of shuffle, especially without replication, could > indeed be susceptible to this issue you described (a single fetch failure > would require all mappers to need to be recomputed) - note, not necessarily > all

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-14 Thread Mridul Muralidharan
A reducer oriented view of shuffle, especially without replication, could indeed be susceptible to this issue you described (a single fetch failure would require all mappers to need to be recomputed) - note, not necessarily all reducers to be recomputed though. Note that I have not looked much

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-14 Thread Mridul Muralidharan
Hi, Spark will try to minimize the recomputation cost as much as possible. For example, if parent stage was DETERMINATE, it simply needs to recompute the missing (mapper) partitions (which resulted in fetch failure). Note, this by itself could require further recomputation in the DAG if the

Re: [PROPOSAL] Spark stage resubmission for shuffle fetch failure

2023-10-14 Thread Sungwoo Park
a) If one or more tasks for a stage (and so its shuffle id) is going to be recomputed, if it is an INDETERMINATE stage, all shuffle output will be discarded and it will be entirely recomputed (see here