Hi, Thomas. Thanks for looking into this. We have a couple bugs in the RH bug tracker about this, and while I don't see anything in the upstream tracker, I wouldn't be surprised if there's an issue filed in there too. So, it's great to have someone taking a look.
I am about to leave for a short vacation through Monday, so I do not have the time right now to give your email the consideration it deserves. I will definitely read it more thoroughly when I get back. I am working on a lot of this IPC code right now, in fact (see https://github.com/ClusterLabs/pacemaker/pull/3877 for instance) so hopefully that means we can get this problem figured out.
I'm not sure about the implications of adding retries to pcmk__send_ipc_request()
On main, there's two things to consider for retrying in that function. The most obvious is the while loop at the end for reading synchronous replies. We're already looping for EAGAIN, so looping for other error codes is reasonable. The biggest concern there is that for any callers looking for a sync reply, looping will block them which can lead to daemons being unresponsive (which I think would cause us to hit this same problem again). The other one is crm_ipc_send, which despite its name can receive and then send and then receive again. It luckily is set up to handle timeouts, so we should be good to deal with additional error codes there too.
Is the proper fix to add retries on ENOTCONN and ECONNREFUSED to every call site of connect_and_send_attrd_request() in one of the pacemaker daemons?
I would have to think about this more, but I think maybe instead of adding retries to the callers, we should add retries to connect_and_send_attrd_request itself. That way if initiating the IPC connection succeeds but sending the request does not, we're not spamming the IPC server with new connections. In general, we want to try our best to keep the number of IPC messages to a minimum in order to improve responsiveness.
I am willing to work on a fix myself, but I'm wondering what it should look like to get accepted. Patch that I have against 2.1.6 is attached. Ideas for improvements in the fix for that version are also very welcome.
Typically we do patch review on github, which would require you to have an account there, fork the pacemaker repo, write a patch, and send a pull request. That's kind of a high bar especially for a small-ish patch, so we can just do review here. I'd be happy to look into all this more next week, but anyone else on pacemaker that wants to take a look as well would be welcome to. - Chris _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/