Thank you for detailed explanation.
Please on below:
 .... If one executor fails, it moves the processing over to other executor. 
However, if the data is lost, it re-executes the processing that generated the 
data, and might have to go back to the source.

Does this mean that only those tasks that the died executor was executing at 
the time need to be rerun to generate the processing stages. I read somewhere 
that RDD lineage keeps track of records of what needs to be re-executed.
best
    On Friday, 25 June 2021, 16:23:32 BST, Lalwani, Jayesh 
<jlalw...@amazon.com.invalid> wrote:  
 
 #yiv9786628402 #yiv9786628402 -- _filtered {} _filtered {} _filtered 
{}#yiv9786628402 #yiv9786628402 p.yiv9786628402MsoNormal, #yiv9786628402 
li.yiv9786628402MsoNormal, #yiv9786628402 div.yiv9786628402MsoNormal 
{margin:0in;font-size:11.0pt;font-family:sans-serif;}#yiv9786628402 
span.yiv9786628402EmailStyle20 
{font-family:sans-serif;color:windowtext;}#yiv9786628402 
.yiv9786628402MsoChpDefault {font-size:10.0pt;} _filtered {}#yiv9786628402 
div.yiv9786628402WordSection1 {}#yiv9786628402 _filtered {} _filtered 
{}#yiv9786628402 ol {margin-bottom:0in;}#yiv9786628402 ul 
{margin-bottom:0in;}#yiv9786628402 
Spark replicates the partitions among multiple nodes. If one executor fails, it 
moves the processing over to other executor. However, if the data is lost, it 
re-executes the processing that generated the data, and might have to go back 
to the source.
 
  
 
In case of failure, there will be delay in getting results. The amount of delay 
depends on how much reprocessing Spark needs to do.
 
  
 
When the driver executes an action, it submits a job to the Cluster Manager. 
The Cluster Manager starts submitting tasks to executors and monitoring them. 
In case, executors dies, the Cluster Manager does the work of reassigning the 
tasks. While all of this is going on, the driver is just sitting there waiting 
for the action to complete. SO, driver does nothing, really. The Cluster 
Manager is doing most of the work of managing the workload
 
  
 
Spark, by itself doesn’t add executors when executors fail. It just moves the 
tasks to other executors. If you are installing plain vanilla Spark on your own 
cluster, you need to figure out how to bring back executors. Most of the 
popular platforms built on top of Spark (Glue, EMR, Kubernetes) will replace 
failed nodes. You need to look into the capabilities of your chosen platform.
 
  
 
If the driver dies, the Spark job dies. There’s no recovering from that. The 
only way to recover is to run the job again. Batch jobs do not have 
benchmarking. So, they will need to reprocess everything from the beginning. 
You need to write your jobs to be idempotent; ie; rerunning them shouldn’t 
change the outcome. Streaming jobs have benchmarking, and they will start from 
the last microbatch. This means that they might have to repeat the last 
microbatch.
 
  
 
From: "ashok34...@yahoo.com.INVALID" <ashok34...@yahoo.com.INVALID>
Date: Friday, June 25, 2021 at 10:38 AM
To: "user@spark.apache.org" <user@spark.apache.org>

Subject: [EXTERNAL] Recovery when two spark nodes out of 6 fail
  
 
| 
CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.
  |


  
 
Greetings,
 
  
 
This is a scenario that we need to come up with a comprehensive answers to 
fulfil please.
 
  
 
If we have 6 spark VMs each running two executors via spark-submit.
 
  
    
   -  we have two VMs failures at H/W level, rack failure
   - we lose 4 executors of spark out of 12
   - Happening half way through the spark-submit job
   -   
 
So my humble questions are:
 
  
    
   - Will there be any data lost from the final result due to missing nodes?
   - How will RDD lineage will handle this?
   - Will there be any delay in getting the final result?
   - How the driver will handle these two nodes failure
   - Will there be additional executors added to the existing nodes or the 
existing executors will handle the job of 4 failing executors.
   - If running in client mode and the node holding the driver dies?
   - If running in cluster mode happens
 
  
 
Did search in Google no satisfactory answers gurus, hence turning to forum.
 
  
 
Best
 
  
 
A.K.
   

Reply via email to