[
https://issues.apache.org/jira/browse/YARN-10479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226303#comment-17226303
]
Jim Brennan commented on YARN-10479:
------------------------------------
I believe most of the YARN failures are unrelated to this change. They fail
for me with or without this change.
It looks to me like most of them were caused by [HADOOP-17306].
When I reverted [HADOOP-17306], most of these failures go away.
> RMProxy should retry on SocketTimeout Exceptions
> ------------------------------------------------
>
> Key: YARN-10479
> URL: https://issues.apache.org/jira/browse/YARN-10479
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: yarn
> Affects Versions: 2.10.1, 3.4.1
> Reporter: Jim Brennan
> Assignee: Jim Brennan
> Priority: Major
> Attachments: YARN-10479.001.patch, YARN-10479.002.patch,
> YARN-10479.003.patch
>
>
> During an incident involving a DNS outage, a large number of nodemanagers
> failed to come back into service because they hit a socket timeout when
> trying to re-register with the RM.
> SocketTimeoutException is not currently one of the exceptions that the
> RMProxy will retry. Based on this incident, it seems like it should be. We
> made this change internally about a year ago and it has been running in
> production since.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]