>>> Marek Grac <[email protected]> schrieb am 03.09.2016 um 14:41 in Nachricht <CA+40=jws_6hjglajcszaqa6o9rqh79oa9aq150z1+5kjst_...@mail.gmail.com>: > Hi, > > There are two problems mentioned in the email. > > 1) power-wait > > Power-wait is a quite advanced option and there are only few fence > devices/agent where it makes sense. And only because the HW/firmware on the > device is somewhat broken. Basically, when we execute power ON/OFF > operation, we wait for power-wait seconds before we send next command. I > don't remember any issue with APC and this kind of problems. > > > 2) the only theory I could come up with was that maybe the fencing > operation was considered complete too quickly? > > That is virtually not possible. Even when power ON/OFF is asynchronous, we > test status of device and fence agent wait until status of the plug/VM/... > matches what user wants.
I can imagine that a powerful power supply can deliver up to one second of power even if the mains is disconnected. If the cluster is very quick after fencing, it might be a problem. I'd suggest a 5 to 10 second delay between fencing action and cluster reaction. > > > m, > > > On Fri, Sep 2, 2016 at 3:14 PM, Dan Swartzendruber <[email protected]> > wrote: > >> >> So, I was testing my ZFS dual-head JBOD 2-node cluster. Manual failovers >> worked just fine. I then went to try an acid-test by logging in to node A >> and doing 'systemctl stop network'. Sure enough, pacemaker told the APC >> fencing agent to power-cycle node A. The ZFS pool moved to node B as >> expected. As soon as node A was back up, I migrated the pool/IP back to >> node A. I *thought* all was okay, until a bit later, I did 'zpool status', >> and saw checksum errors on both sides of several of the vdevs. After much >> digging and poking, the only theory I could come up with was that maybe the >> fencing operation was considered complete too quickly? I googled for >> examples using this, and the best tutorial I found showed using a >> power-wait=5, whereas the default seems to be power-wait=0? (this is >> CentOS 7, btw...) I changed it to use 5 instead of 0, and did a several >> fencing operations while a guest VM (vsphere via NFS) was writing to the >> pool. So far, no evidence of corruption. BTW, the way I was creating and >> managing the cluster was with the lcmc java gui. Possibly the power-wait >> default of 0 comes from there, I can't really tell. Any thoughts or ideas >> appreciated :) >> >> _______________________________________________ >> Users mailing list: [email protected] >> http://clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> _______________________________________________ Users mailing list: [email protected] http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
