Thanks Steve and Rohith, So yeah just realize it is start counter instead of restart =P
Try it with different AppMaster after restarting the RM and it is working. Seemed like problem with the other AppMaster. Will try to dig it why it went wrong. Thanks again for the insights and help! - Henry On Thu, Jun 4, 2015 at 2:12 AM, Steve Loughran <[email protected]> wrote: > >> On 3 Jun 2015, at 22:01, Henry Saputra <[email protected]> wrote: >> >> Hi All, >> >> I would like to know if "yarn.resourcemanager.am.max-attempts" config >> parameter will make the already running ApplicationMaster (AM) to have >> HA mode in YARN once it is already running? >> > > if you can reconfigure the RM and restart it, the value will be picked up by > the RM (rolling upgrades and an HA cluster lets you do that) > > for long-lived services, you should have the cluster set up with a window for > failures, so that sporadic, intermittent failures don't kill the app. > >> Meaning that if the running AM process dies (though permgen, OOM, or >> kill JVM with kill signal) then ResourceManager (RM) should be able to >> restart the number of times specified by >> "yarn.resourcemanager.am.max-attempts" config value ? > > yes, though its a "start counter", not a restart counter. That first run > counts as attempt #1 > >> >> I was trying it and it seems like the there was an attempt to restart >> the AppMaster but dies immediately. >> > > with a default cluster restart value of 2, two failures in a row is enough to > kill the app. > > In https://issues.apache.org/jira/browse/YARN-2392 I've a patch to give you > more details on count-exceeded values; global and app limits, plus window > details.
