Steve Loughran commented on YARN-3668:

I don't see this as being an issue. In slider we use AM restart with container 
preservation and rebuild state. We know this works as we test it with triggered 
AM Failures

# If the AM fails too often within a predefined window, YARN kills it. But 
that's a sign of the AM being unreliable. You shouldn't keep running it all the 
time, as it means your code is failing at a rate where the app is probably 
unusable. YARN halts and implicitly says "your code is unreliable" (though it 
could be you are asking for too little RAM and exceeding container limits)
# On NM failure with work-preservation, containers keep running. There's a fun 
condition there where if the NM doesn't come up, you now have containers that 
YARN believes are dead. Slider handles this when they report in to the AM: they 
are told they should shut down as they are no longer managed.
# when the AM restarts its JARs are re-downloaded from HDFS. If you update the 
JAR before the restart, the new version is picked up. This is how we actually 
implement zero-downtime upgrades of slider-managed clusters. 

In slider we also track windowed failure rates of deployed components (e.g 
HBase region servers) and nodes; SLIDER-856 tries to differentiate them, so we 
can distinguish "unreliable nodes" from "unreliable components". When a 
component fails too many times in the window, Slider just gives up and says 
"your app or its configuration is broken". This stops it trying to constantly 
restart a failing component and have a 6 GB error log within a few hours.

One thing we don't handle is what's covered in the title: what if YARN itself 

> Long run service shouldn't be killed even if Yarn crashed
> ---------------------------------------------------------
>                 Key: YARN-3668
>                 URL: https://issues.apache.org/jira/browse/YARN-3668
>             Project: Hadoop YARN
>          Issue Type: Wish
>            Reporter: sandflee
> For long running service, it shouldn't be killed even if all yarn component 
> crashed, with RM work preserving and NM restart, yarn could take over 
> applications again.

This message was sent by Atlassian JIRA

Reply via email to