Faiz Halde created SPARK-43407:
----------------------------------

             Summary: Can executors recover/reuse shuffle files upon failure?
                 Key: SPARK-43407
                 URL: https://issues.apache.org/jira/browse/SPARK-43407
             Project: Spark
          Issue Type: Question
          Components: Spark Core
    Affects Versions: 3.3.1
            Reporter: Faiz Halde


Hello,

We've been in touch with a few spark specialists who suggested us a potential 
solution to improve the reliability of our jobs that are shuffle heavy

Here is what our setup looks like
 * Spark version: 3.3.1
 * Java version: 1.8
 * We do not use external shuffle service
 * We use spot instances

We run spark jobs on clusters that use Amazon EBS volumes. The spark.local.dir 
is mounted on this EBS volume. One of the offerings from the service we use is 
EBS migration which basically means if a host is about to get evicted, a new 
host is created and the EBS volume is attached to it

When Spark assigns a new executor to the newly created instance, it basically 
can recover all the shuffle files that are already persisted in the migrated 
EBS volume

Is this how it works? Do executors recover / re-register the shuffle files that 
they found?

So far I have not come across any recovery mechanism. I can only see 
{noformat}
KubernetesLocalDiskShuffleDataIO{noformat}
 that has a pre-init step where it tries to register the available shuffle 
files to itself

A natural follow-up on this,

If what they claim is true, then ideally we should expect that when an executor 
is killed/OOM'd and a new executor is spawned on the same host, the new 
executor registers the shuffle files to itself. Is that so?

Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to