> On May 27, 2018, at 2:28 PM, Ken Gaillot <kgail...@redhat.com> wrote:
> 
> Pacemaker isn't fencing because the start failed, at least not
> directly:
> 
>> May 22 23:57:24 [2196] d-gp2-dbpg0-2    pengine:     info:
>> determine_op_status: Operation monitor found resource postgresql-10-
>> main:2 active on d-gp2-dbpg0-2
> 
>> May 22 23:57:24 [2196] d-gp2-dbpg0-2    pengine:   notice:
>> LogActions:  Demote  postgresql-10-main:1    (Master -> Slave d-gp2-
>> dbpg0-1)
>> May 22 23:57:24 [2196] d-gp2-dbpg0-2    pengine:   notice:
>> LogActions:  Recover postgresql-10-main:1    (Master d-gp2-dbpg0-1)
> 
> From the above, we can see that the initial probe after the node
> rejoined found that the resource was already running in master mode
> there (at least, that's what the agent thinks). So, the cluster wants
> to demote it, stop it, and start it again as a slave.

Well, it was running in master node prior to being power-cycled.  However my 
understanding was that PAF always tries to initially start PostgreSQL in 
standby mode.  There would be no reason for it to promote node 1 to master 
since node 2 has already taken over the master role, and there is no location 
constraint set that would cause it to try to move this role back to node 1 
after it rejoins the cluster.

Jehan-Guillaume wrote:  "on resource start, PAF will create the 
"PGDATA/recovery.conf" file based on your template anyway. No need to create it
yourself.".  The recovery.conf file being present upon PostgreSQL startup is 
what makes it start in standby mode.

Since no new log output is ever written to the PostgreSQL log file, it does not 
seem that it's ever actually doing anything to try to start the resource.  The 
recovery.conf doesn't get copied in, and no new data appears in the PostgreSQL 
log.  As far as I can tell, nothing ever happens on the rejoined node at all, 
before it gets fenced.

How can I tell what the resource agent is trying to do behind the scenes?  Is 
there a way that I can see what command(s) it is trying to run, so that I may 
try them manually?

> But the demote failed

I reckon that it probably couldn't demote what was never started.

> But the stop fails too

I guess that it can't stop what is already stopped?  Although, I'm surprised 
that it would error in this case, instead of just realizing that it was already 
stopped...

> 
>> May 22 23:57:24 [2196] d-gp2-dbpg0-2    pengine:  warning:
>> pe_fence_node:       Node d-gp2-dbpg0-1 will be fenced because of
>> resource failure(s)
> 
> which is why the cluster then wants to fence the node. (If a resource
> won't stop, the only way to recover it is to kill the entire node.)

But the resource is *never started*!?  There is never any postgres process 
running, and nothing appears in the PostgreSQL log file.  I'm really confused 
as to why pacemaker thinks it needs to fence something that's never running at 
all...  I guess what I need is to somehow figure out what the resource agent is 
doing that makes it think the resource is already active; is there a way to do 
this?

It would be really helpful, if somewhere within this verbose logging, were an 
indication of what commands were actually being run to monitor, start, stop, 
etc. as it seems like a black box.

I'm wondering if some stale PID file is getting left around after the hard 
reboot, and that is what the resource agent is checking instead of the actual 
running status, but I would hope that the resource agent would be smarter than 
that.

Thanks,
-- 
Casey
_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to