Re: stale_status_of_NM_from_standby_RM

Chris Nauroth Tue, 27 Dec 2022 11:44:07 -0800

Every NodeManager registers and heartbeats to the active ResourceManager
instance, which acts as the source of truth for cluster node status. If the
active ResourceManager terminates, then another becomes active, and every
NodeManager will start a new connection to register and heartbeat with that
new active ResourceManager.

As such, a standby ResourceManager cannot satisfy requests for node status
and instead will redirect to the current active:

curl -i '
http://cnauroth-ha-m-2:8088/ws/v1/cluster/nodes/cnauroth-ha-w-0.us-central1-c.c.hadoop-cloud-dev.google.com.internal:8026
'
HTTP/1.1 307 Temporary Redirect
Date: Tue, 27 Dec 2022 19:28:38 GMT
Cache-Control: no-cache
Expires: Tue, 27 Dec 2022 19:28:38 GMT
Date: Tue, 27 Dec 2022 19:28:38 GMT
Pragma: no-cache
Content-Type: text/plain;charset=utf-8
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Location:
http://cnauroth-ha-m-1.us-central1-c.c.hadoop-cloud-dev.google.com.internal.:8088/ws/v1/cluster/nodes/cnauroth-ha-w-0.us-central1-c.c.hadoop-cloud-dev.google.com.internal:8026
Content-Length: 136

If it looked like you were able to query a standby, then perhaps you were
using a browser or some other client that automatically follows redirects
(e.g. curl -L)?

The data really would have come from the active though, so you can trust
that it's not stale. The only thing you might have to consider is that
after a failover, it might take a while before every NodeManager registers
with the new ResourceManager.

Separately, if you're concerned about divergence of node include/exclude
files, you can configure them to be stored at a shared file system (e.g.
your preferred cloud object store) to be used by all ResourceManager
instances.

Chris Nauroth

On Sat, Dec 24, 2022 at 6:27 PM Dong Ye <yedong...@gmail.com> wrote:

> Hi, All:
>
>     I have some questions about the state of the node manager. If I use
> the rest API
>
>    - http://rm-http-address:port/ws/v1/cluster/nodes/{nodeid}
>
> to get node manager state from a standby RM,
> 1) is it possible that it could be stale?
> 2) If it is possible, how long will the node manager state be updated?
> 3) Is it possible that the NM state returned from standby RM be very
> different from that returned from the active RM? Say one is returning
> RUNNING while the other returns DECOMMISSIONED because the local
> exclude.xml is very different/diverges?
>
> Thanks.
> Have a good holiday.
>

Re: stale_status_of_NM_from_standby_RM

Reply via email to