Hi all, In the PostgreSQL Automatic Failover (PAF) project, one of most frequent negative feedback we got is how difficult it is to experience with it because of fencing occurring way too frequently. I am currently hunting this kind of useless fencing to make life easier.
It occurs to me, a frequent reason of fencing is because during the stop action, we check the status of the PostgreSQL instance using our monitor function before trying to stop the resource. If the function does not return OCF_NOT_RUNNING, OCF_SUCCESS or OCF_RUNNING_MASTER, we just raise an error, leading to a fencing. See: https://github.com/dalibo/PAF/blob/d50d0d783cfdf5566c3b7c8bd7ef70b11e4d1043/script/pgsqlms#L1291-L1301 I am considering adding a check to define if the instance is stopped even if the monitor action returns an error. The idea would be to parse **all** the local processes looking for at least one pair of "/proc/<PID>/{comm,cwd}" related to the PostgreSQL instance we want to stop. If none are found, we consider the instance is not running. Gracefully or not, we just know it is down and we can return OCF_SUCCESS. Just for completeness, the piece of code would be: my @pids; foreach my $f (glob "/proc/[0-9]*") { push @pids => basename($f) if -r $f and basename( readlink( "$f/exe" ) ) eq "postgres" and readlink( "$f/cwd" ) eq $pgdata; } I feels safe enough to me. The only risk I could think of is in a shared disk cluster with multiple nodes accessing the same data in RW (such setup can fail in so many ways :)). However, PAF is not supposed to work in such context, so I can live with this. Do you guys have some advices? Do you see some drawbacks? Hazards? Thanks in advance! -- Jehan-Guillaume de Rorthais Dalibo _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org