On Mon, 2019-06-24 at 10:55 -0500, Bryan K. Walton wrote: > We have a two node cluster managing a two iSCSI targets on top of > DRBD. > Except for the fence agents, the resources are configured to reside > on > only one node at a time. These are CentOS 7 servers running Pacemaker > version 1.1.19-8.el7_6.4. DRBD is version: 9.0.16-1. > > We have two iSCSI targets that are served up by Pacemaker. I > configured > these servers so that these targets are 100% controlled by Pacemaker > -- > or at least that was my intention. > > My iSCSI targets part of a resource group, called ISCSIGroup. I have > a > colocation constraint that I believe restricts this group so that it > can > only run on the same node that hosts my DRBD resource. DRBD is > configured as primary/secondary. > > ----- > > ISCSIGroup: storageLVM ISCSICentipedeIP ISCSIMillipedeIP targetRHEVM > lunRHEVM targetVMStorage lunVMStorage > > Colocation Constraints: > ISCSIGroup with storageClusterClone (score:INFINITY) (with-rsc- > role:Master) > > ----- > > I was doing OS maintenance on the node that was NOT hosting the iSCSI > targets (storage2). I took the following steps to do the updates: > > I put the cluster into maintenance mode and did a yum update on > storage2. Then, I took the cluster out of maintenance mode (which was > successfully done.). > > Then, I put the cluster, on storage2, into standby mode and then > stopped > the cluster on storage2. I then rebooted storage2. After storage2 > came > back online, I started the cluster on storage2. All of these steps > were > successful. Then, as a final step, I instructed pacemaker to > unstandby > storage2. That is when things went horribly wrong. > > What happened is that storage1 (primary node) complained that the > monitor action for the iscsi targets failed on storage2. The error > is > suggesting that they were already running on storage2. > > Are my constraints, not sufficient? I would really appreciate some > input on what I've got wrong, or what is missing that would cause the > target resources to start on both nodes. > > Thanks, > Bryan > > Here are logs from storage1, which was the node hosting the resource: > > Jun 20 11:48:35 storage1 crmd[240695]: notice: Initiating monitor > operation targetRHEVM_monitor_0 on storage2 > Jun 20 11:48:35 storage1 crmd[240695]: notice: Initiating monitor > operation lunRHEVM_monitor_0 on storage2 > Jun 20 11:48:35 storage1 crmd[240695]: notice: Initiating monitor > operation targetVMStorage_monitor_0 on storage2 > Jun 20 11:48:35 storage1 crmd[240695]: notice: Initiating monitor > operation lunVMStorage_monitor_0 on storage2 > Jun 20 11:48:36 storage1 crmd[240695]: warning: Action 18 > (targetRHEVM_monitor_0) on storage2 failed (target: 7 vs. rc: 0): > Error > Jun 20 11:48:36 storage1 crmd[240695]: notice: Transition aborted by > operation targetRHEVM_monitor_0 'modify' on storage2: Event failed > Jun 20 11:48:36 storage1 crmd[240695]: warning: Action 20 > (targetVMStorage_monitor_0) on storage2 failed (target: 7 vs. rc: 0): > Error > Jun 20 11:48:36 storage1 crmd[240695]: notice: Transition 1 > (Complete=12, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-1054.bz2): Complete > Jun 20 11:48:36 storage1 pengine[240694]: error: Resource > targetRHEVM is active on 2 nodes (attempting recovery)
This means that pacemaker found the target active on storage2 *without* having scheduled it there -- so either something outside pacemaker is setting up the target, or (less likely) the agent is returning the wrong status. > Jun 20 11:48:36 storage1 pengine[240694]: notice: See > https://wiki.clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more > information > Jun 20 11:48:36 storage1 pengine[240694]: error: Resource > targetVMStorage is active on 2 nodes (attempting recovery) > Jun 20 11:48:36 storage1 pengine[240694]: notice: See > https://wiki.clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more > information > Jun 20 11:48:36 storage1 pengine[240694]: notice: * > Move targetRHEVM ( storage2 -> storage1 ) > Jun 20 11:48:36 storage1 pengine[240694]: notice: * > Restart lunRHEVM ( storage1 ) due to required > targetRHEVM start > Jun 20 11:48:36 storage1 pengine[240694]: notice: * > Move targetVMStorage ( storage2 -> storage1 ) > Jun 20 11:48:36 storage1 pengine[240694]: notice: * > Restart lunVMStorage ( storage1 ) due to > required targetVMStorage start > Jun 20 11:48:36 storage1 pengine[240694]: error: Calculated > transition 2 (with errors), saving inputs in > /var/lib/pacemaker/pengine/pe-error-7.bz2 > Jun 20 11:48:36 storage1 crmd[240695]: notice: Initiating stop > operation lunVMStorage_stop_0 locally on storage1 > Jun 20 11:48:36 storage1 iSCSILogicalUnit(lunVMStorage)[41725]: INFO: > Deleted LUN 1. > Jun 20 11:48:37 storage1 iSCSILogicalUnit(lunVMStorage)[41725]: INFO: > Deleted storage object lunVMStorage. > Jun 20 11:48:37 storage1 crmd[240695]: notice: Result of stop > operation for lunVMStorage on storage1: 0 (ok) > Jun 20 11:48:37 storage1 crmd[240695]: notice: Initiating stop > operation targetVMStorage_stop_0 locally on storage1 > Jun 20 11:48:37 storage1 crmd[240695]: notice: Initiating stop > operation targetVMStorage_stop_0 on storage2 > Jun 20 11:48:37 storage1 iSCSITarget(targetVMStorage)[41806]: INFO: > Deleted Target iqn.2019-03.com.leepfrog:storage.vmstorage. > Jun 20 11:48:37 storage1 crmd[240695]: notice: Result of stop > operation for targetVMStorage on storage1: 0 (ok) > Jun 20 11:48:37 storage1 crmd[240695]: notice: Initiating stop > operation lunRHEVM_stop_0 locally on storage1 > Jun 20 11:48:37 storage1 iSCSILogicalUnit(lunRHEVM)[41863]: INFO: > Deleted LUN 0. > Jun 20 11:48:38 storage1 iSCSILogicalUnit(lunRHEVM)[41863]: INFO: > Deleted storage object lunRHEVM. > Jun 20 11:48:38 storage1 crmd[240695]: notice: Result of stop > operation for lunRHEVM on storage1: 0 (ok) > Jun 20 11:48:38 storage1 crmd[240695]: notice: Initiating stop > operation targetRHEVM_stop_0 locally on storage1 > Jun 20 11:48:38 storage1 crmd[240695]: notice: Initiating stop > operation targetRHEVM_stop_0 on storage2 > Jun 20 11:48:38 storage1 iSCSITarget(targetRHEVM)[42032]: INFO: > Deleted Target iqn.2019-02.com.leepfrog:storage.rhevm. > Jun 20 11:48:38 storage1 crmd[240695]: notice: Result of stop > operation for targetRHEVM on storage1: 0 (ok) > Jun 20 11:48:38 storage1 crmd[240695]: notice: Initiating start > operation targetRHEVM_start_0 locally on storage1 > Jun 20 11:48:38 storage1 iSCSITarget(targetRHEVM)[42089]: INFO: > Parameter auto_add_default_portal is now 'false'. > Jun 20 11:48:39 storage1 iSCSITarget(targetRHEVM)[42089]: INFO: > Created target iqn.2019-02.com.leepfrog:storage.rhevm. Created TPG 1. > Jun 20 11:48:39 storage1 iSCSITarget(targetRHEVM)[42089]: INFO: > Created Node ACL for iqn.1994-05.com.redhat:3d066d1f423e > Jun 20 11:48:39 storage1 iSCSITarget(targetRHEVM)[42089]: INFO: > Created Node ACL for iqn.1994-05.com.redhat:84f0f7458c58 > Jun 20 11:48:39 storage1 crmd[240695]: notice: Result of start > operation for targetRHEVM on storage1: 0 (ok) > Jun 20 11:48:39 storage1 crmd[240695]: notice: Initiating monitor > operation targetRHEVM_monitor_10000 locally on storage1 > Jun 20 11:48:39 storage1 crmd[240695]: notice: Initiating start > operation lunRHEVM_start_0 locally on storage1 > Jun 20 11:48:39 storage1 iSCSILogicalUnit(lunRHEVM)[42221]: INFO: > Created block storage object lunRHEVM using /dev/storage/rhevm. > Jun 20 11:48:40 storage1 iSCSILogicalUnit(lunRHEVM)[42221]: INFO: > Created LUN 0. Created LUN 0->0 mapping in node ACL iqn.1994- > 05.com.redhat:84f0f7458c58 Created LUN 0->0 mapping in node ACL > iqn.1994-05.com.redhat:3d066d1f423e > Jun 20 11:48:40 storage1 iSCSILogicalUnit(lunRHEVM)[42221]: ERROR: No > such NetworkPortal in configfs: > /sys/kernel/config/target/iscsi/iqn.2019- > 02.com.leepfrog:storage.rhevm/tpgt_1/np/0.0.0.0:3260 > Jun 20 11:48:40 storage1 crmd[240695]: notice: Result of start > operation for lunRHEVM on storage1: 0 (ok) > Jun 20 11:48:40 storage1 crmd[240695]: notice: Initiating monitor > operation lunRHEVM_monitor_10000 locally on storage1 > Jun 20 11:48:40 storage1 crmd[240695]: notice: Initiating start > operation targetVMStorage_start_0 locally on storage1 > Jun 20 11:48:40 storage1 iSCSITarget(targetVMStorage)[42383]: INFO: > Parameter auto_add_default_portal is now 'false'. > Jun 20 11:48:41 storage1 iSCSITarget(targetVMStorage)[42383]: INFO: > Created target iqn.2019-03.com.leepfrog:storage.vmstorage. Created > TPG 1. > Jun 20 11:48:41 storage1 iSCSITarget(targetVMStorage)[42383]: INFO: > Created Node ACL for iqn.1994-05.com.redhat:3d066d1f423e > Jun 20 11:48:41 storage1 iSCSITarget(targetVMStorage)[42383]: INFO: > Created Node ACL for iqn.1994-05.com.redhat:84f0f7458c58 > Jun 20 11:48:41 storage1 crmd[240695]: notice: Result of start > operation for targetVMStorage on storage1: 0 (ok) > Jun 20 11:48:41 storage1 crmd[240695]: notice: Initiating monitor > operation targetVMStorage_monitor_10000 locally on storage1 > Jun 20 11:48:41 storage1 crmd[240695]: notice: Initiating start > operation lunVMStorage_start_0 locally on storage1 > Jun 20 11:48:41 storage1 iSCSILogicalUnit(lunVMStorage)[42555]: INFO: > Created block storage object lunVMStorage using > /dev/storage/vmstorage. > Jun 20 11:48:42 storage1 iSCSILogicalUnit(lunVMStorage)[42555]: INFO: > Created LUN 1. Created LUN 1->1 mapping in node ACL iqn.1994- > 05.com.redhat:84f0f7458c58 Created LUN 1->1 mapping in node ACL > iqn.1994-05.com.redhat:3d066d1f423e > Jun 20 11:48:42 storage1 iSCSILogicalUnit(lunVMStorage)[42555]: > ERROR: No such NetworkPortal in configfs: > /sys/kernel/config/target/iscsi/iqn.2019- > 03.com.leepfrog:storage.vmstorage/tpgt_1/np/0.0.0.0:3260 > Jun 20 11:48:42 storage1 lrmd[240692]: notice: > lunVMStorage_start_0:42555:stderr [ > /usr/lib/ocf/resource.d/heartbeat/iSCSILogicalUnit: line 410: > /sys/kernel/config/target/core/iblock_0/lunVMStorage/wwn/vpd_unit_ser > ial: No such file or directory ] > Jun 20 11:48:42 storage1 crmd[240695]: notice: Result of start > operation for lunVMStorage on storage1: 0 (ok) > Jun 20 11:48:42 storage1 crmd[240695]: notice: Initiating monitor > operation lunVMStorage_monitor_10000 locally on storage1 > Jun 20 11:48:42 storage1 crmd[240695]: notice: Transition 2 > (Complete=18, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-error-7.bz2): Complete > Jun 20 11:48:42 storage1 crmd[240695]: notice: State transition > S_TRANSITION_ENGINE -> S_IDLE > > Here are the logs from storage2, which was coming back online: > > Jun 20 11:48:36 storage2 crmd[22305]: notice: Result of probe > operation for ISCSIMillipedeIP on storage2: 7 (not running) > Jun 20 11:48:36 storage2 crmd[22305]: notice: Result of probe > operation for ISCSICentipedeIP on storage2: 7 (not running) > Jun 20 11:48:36 storage2 crmd[22305]: notice: Result of probe > operation for targetRHEVM on storage2: 0 (ok) > Jun 20 11:48:36 storage2 crmd[22305]: notice: Result of probe > operation for targetVMStorage on storage2: 0 (ok) > Jun 20 11:48:36 storage2 crmd[22305]: notice: Result of probe > operation for lunRHEVM on storage2: 7 (not running) > Jun 20 11:48:36 storage2 crmd[22305]: notice: Result of probe > operation for lunVMStorage on storage2: 7 (not running) > > > -- Ken Gaillot <[email protected]> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
