Re: [ClusterLabs] Connections to attrd are fragile if it is restarted

Chris Lumens Wed, 07 May 2025 10:14:37 -0700

Hi, Thomas.

Thanks for looking into this.  We have a couple bugs in the RH bug
tracker about this, and while I don't see anything in the upstream
tracker, I wouldn't be surprised if there's an issue filed in there too.
So, it's great to have someone taking a look.


I am about to leave for a short vacation through Monday, so I do not
have the time right now to give your email the consideration it
deserves.  I will definitely read it more thoroughly when I get back.

I am working on a lot of this IPC code right now, in fact (see
https://github.com/ClusterLabs/pacemaker/pull/3877 for instance) so
hopefully that means we can get this problem figured out.

I'm not sure about the implications of adding retries to
pcmk__send_ipc_request()


On main, there's two things to consider for retrying in that function.
The most obvious is the while loop at the end for reading synchronous
replies.  We're already looping for EAGAIN, so looping for other error
codes is reasonable.

The biggest concern there is that for any callers looking for a sync
reply, looping will block them which can lead to daemons being
unresponsive (which I think would cause us to hit this same problem
again).

The other one is crm_ipc_send, which despite its name can receive and
then send and then receive again.  It luckily is set up to handle
timeouts, so we should be good to deal with additional error codes there
too.

Is the proper fix to add retries on ENOTCONN and ECONNREFUSED to every
call site of connect_and_send_attrd_request() in one of the pacemaker
daemons?


I would have to think about this more, but I think maybe instead of
adding retries to the callers, we should add retries to
connect_and_send_attrd_request itself.  That way if initiating the IPC
connection succeeds but sending the request does not, we're not spamming
the IPC server with new connections.

In general, we want to try our best to keep the number of IPC messages
to a minimum in order to improve responsiveness.

I am willing to work on a fix myself, but I'm wondering what it should
look like to get accepted. Patch that I have against 2.1.6 is attached.
Ideas for improvements in the fix for that version are also very
welcome.


Typically we do patch review on github, which would require you to have
an account there, fork the pacemaker repo, write a patch, and send a
pull request.  That's kind of a high bar especially for a small-ish
patch, so we can just do review here.

I'd be happy to look into all this more next week, but anyone else on
pacemaker that wants to take a look as well would be welcome to.

- Chris

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Connections to attrd are fragile if it is restarted

Reply via email to