from:"renayama19661014"

[ClusterLabs] [Question:pacemaker_remote] About limitation of the placement of the resource to remote node.

2015-08-12 Thread renayama19661014

Hi All,

We confirmed movement of 
pacemaker_remote.(version:pacemaker-ad1f397a8228a63949f86c96597da5cecc3ed977)

It is the following cluster constitution.
 * sl7-01(KVM host)
 * snmp1(Guest on the sl7-01 host)
 * snmp2(Guest on the sl7-01 host)

We prepared for the next CLI file to confirm the resource placement to remote 
node.

--
property no-quorum-policy=ignore \
  stonith-enabled=false \
  startup-fencing=false

rsc_defaults resource-stickiness=INFINITY \
  migration-threshold=1

primitive remote-vm2 ocf:pacemaker:remote \
  params server=snmp1 \
  op monitor interval=3 timeout=15

primitive remote-vm3 ocf:pacemaker:remote \
  params server=snmp2 \
  op monitor interval=3 timeout=15

primitive dummy-remote-A Dummy \
  op start interval=0s timeout=60s \
  op monitor interval=30s timeout=60s \
  op stop interval=0s timeout=60s

primitive dummy-remote-B Dummy \
  op start interval=0s timeout=60s \
  op monitor interval=30s timeout=60s \
  op stop interval=0s timeout=60s

location loc1 dummy-remote-A \
  rule 200: #uname eq remote-vm3 \
  rule 100: #uname eq remote-vm2 \
  rule -inf: #uname eq sl7-01
location loc2 dummy-remote-B \
  rule 200: #uname eq remote-vm3 \
  rule 100: #uname eq remote-vm2 \
  rule -inf: #uname eq sl7-01
--

Case 1) The resource is placed as follows when I spend the CLI file which we 
prepared for.
 However, the placement of the dummy-remote resource does not meet a condition.
 dummy-remote-A starts in remote-vm2.

[root@sl7-01 ~]# crm_mon -1 -Af
Last updated: Thu Aug 13 08:49:09 2015          Last change: Thu Aug 13 
08:41:14 2015 by root via cibadmin on sl7-01
Stack: corosync
Current DC: sl7-01 (version 1.1.13-ad1f397) - partition WITHOUT quorum
3 nodes and 4 resources configured

Online: [ sl7-01 ]
RemoteOnline: [ remote-vm2 remote-vm3 ]

 dummy-remote-A (ocf::heartbeat:Dummy): Started remote-vm2
 dummy-remote-B (ocf::heartbeat:Dummy): Started remote-vm3
 remote-vm2     (ocf::pacemaker:remote):        Started sl7-01
 remote-vm3     (ocf::pacemaker:remote):        Started sl7-01

(snip)

Case 2) When we change CLI file of it and spend it, the resource is placed as 
follows.
 The resource is placed definitely.
 dummy-remote-A starts in remote-vm3.
 dummy-remote-B starts in remote-vm3.


(snip)
location loc1 dummy-remote-A \
  rule 200: #uname eq remote-vm3 \
  rule 100: #uname eq remote-vm2 \
  rule -inf: #uname ne remote-vm2 and #uname ne remote-vm3 \
  rule -inf: #uname eq sl7-01
location loc2 dummy-remote-B \
  rule 200: #uname eq remote-vm3 \
  rule 100: #uname eq remote-vm2 \
  rule -inf: #uname ne remote-vm2 and #uname ne remote-vm3 \
  rule -inf: #uname eq sl7-01
(snip)


[root@sl7-01 ~]# crm_mon -1 -Af
Last updated: Thu Aug 13 08:55:28 2015          Last change: Thu Aug 13 
08:55:22 2015 by root via cibadmin on sl7-01
Stack: corosync
Current DC: sl7-01 (version 1.1.13-ad1f397) - partition WITHOUT quorum
3 nodes and 4 resources configured

Online: [ sl7-01 ]
RemoteOnline: [ remote-vm2 remote-vm3 ]

 dummy-remote-A (ocf::heartbeat:Dummy): Started remote-vm3
 dummy-remote-B (ocf::heartbeat:Dummy): Started remote-vm3
 remote-vm2     (ocf::pacemaker:remote):        Started sl7-01
 remote-vm3     (ocf::pacemaker:remote):        Started sl7-01

(snip)

As for the placement of the resource being wrong with the first CLI file, the 
placement limitation of the remote node is like remote resource not being 
evaluated until it is done start.

The placement becomes right with the CLI file which I revised, but the 
description of this limitation is very troublesome when I compose a cluster of 
more nodes.

Does remote node not need processing delaying placement limitation until it is 
done start?

Is there a method to easily describe the limitation of the resource to remote 
node?

 * As one means, we know that the placement of the resource goes well by 
dividing the first CLI file into two.
   * After a cluster sent CLI which remote node starts, I send CLI where a 
cluster starts a resource.
 * However, we do not want to divide CLI file into two if possible.

Best Regards,
Hideo Yamauchi.


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [Problem] The SNMP trap which has been already started is transmitted.

2015-08-17 Thread renayama19661014

Hi Andrew,


   I used the built-in SNMP.

   I started as a daemon with -d option.
 
 Is it running on both nodes or just snmp1?


On both nodes.

[root@snmp1 ~]# ps -ef |grep crm_mon
root      4923     1  0 09:42 ?        00:00:00 crm_mon -d -S 192.168.40.2 -W 
-p /tmp/ClusterMon-upstart.pid
[root@snmp2 ~]# ps -ef |grep crm_mon
root      4860     1  0 09:42 ?        00:00:00 crm_mon -d -S 192.168.40.2 -W 
-p /tmp/ClusterMon-upstart.pid


 Because there is no logic in crm_mon that would have remapped the monitor 
 to 
 start, so my working theory is that its a duplicate of an old event.
 Can you tell which node the trap is being sent from?


The trap is transmitted by snmp1 node.

The trap is not sent from the snmp2 node that rebooted.


Aug 18 09:44:37 SNMP-MANAGER snmptrapd[1334]: 2015-08-18 09:44:37 snmp1 [UDP: 
[192.168.40.100]:59668-[192.168.40.2]]:#012DISMAN-EVENT-MIB::sysUpTimeInstance 
= Timeticks: (1439858677) 166 days, 15:36:26.77#011SNMPv2-MIB::snmpTrapOID.0 = 
OID: 
PACEMAKER-MIB::pacemakerNotification#011PACEMAKER-MIB::pacemakerNotificationResource
 = STRING: prmDummy#011PACEMAKER-MIB::pacemakerNotificationNode = STRING: 
snmp1#011PACEMAKER-MIB::pacemakerNotificationOperation = STRING: 
start#011PACEMAKER-MIB::pacemakerNotificationDescription = STRING: 
OK#011PACEMAKER-MIB::pacemakerNotificationReturnCode = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationTargetReturnCode = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationStatus = INTEGER: 0
Aug 18 09:44:37 SNMP-MANAGER snmptrapd[1334]: 2015-08-18 09:44:37 snmp1 [UDP: 
[192.168.40.100]:59668-[192.168.40.2]]:#012DISMAN-EVENT-MIB::sysUpTimeInstance 
= Timeticks: (1439858677) 166 days, 15:36:26.77#011SNMPv2-MIB::snmpTrapOID.0 = 
OID: 
PACEMAKER-MIB::pacemakerNotification#011PACEMAKER-MIB::pacemakerNotificationResource
 = STRING: prmDummy#011PACEMAKER-MIB::pacemakerNotificationNode = STRING: 
snmp1#011PACEMAKER-MIB::pacemakerNotificationOperation = STRING: 
monitor#011PACEMAKER-MIB::pacemakerNotificationDescription = STRING: 
OK#011PACEMAKER-MIB::pacemakerNotificationReturnCode = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationTargetReturnCode = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationStatus = INTEGER: 0


Best Regards,
Hideo Yamauchi.




- Original Message -
 From: renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp
 To: Cluster Labs - All topics related to open-source clustering welcomed 
 users@clusterlabs.org
 Cc: 
 Date: 2015/8/17, Mon 10:05
 Subject: Re: [ClusterLabs] [Problem] The SNMP trap which has been already 
 started is transmitted.
 
 Hi Andrew,
 
 Thank you for comments.
 
 
 I will confirm it tomorrow.
 I am a vacation today.
 
 Best Regards,
 Hideo Yamauchi.
 
 
 - Original Message -
  From: Andrew Beekhof and...@beekhof.net
  To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
 open-source clustering welcomed users@clusterlabs.org
  Cc: 
  Date: 2015/8/17, Mon 09:30
  Subject: Re: [ClusterLabs] [Problem] The SNMP trap which has been already 
 started is transmitted.
 
 
   On 4 Aug 2015, at 7:36 pm, renayama19661...@ybb.ne.jp wrote:
 
   Hi Andrew,
 
   Thank you for comments.
 
   However, a trap of crm_mon is sent to an SNMP manager.
    
   Are you using the built-in SNMP logic or using -E to give crm_mon 
 a 
  script which 
   is then producing the trap?
   (I’m trying to figure out who could be turning the monitor action 
 into 
  a start)
 
 
   I used the built-in SNMP.
   I started as a daemon with -d option.
 
  Is it running on both nodes or just snmp1?
  Because there is no logic in crm_mon that would have remapped the monitor 
 to 
  start, so my working theory is that its a duplicate of an old event.
  Can you tell which node the trap is being sent from?
 
 
 
   Best Regards,
   Hideo Yamauchi.
 
 
   - Original Message -
   From: Andrew Beekhof and...@beekhof.net
   To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related 
 to 
  open-source clustering welcomed users@clusterlabs.org
   Cc: 
   Date: 2015/8/4, Tue 14:15
   Subject: Re: [ClusterLabs] [Problem] The SNMP trap which has been 
  already started is transmitted.
 
 
   On 27 Jul 2015, at 4:18 pm, renayama19661...@ybb.ne.jp wrote:
 
   Hi All,
 
   The transmission of the SNMP trap of crm_mon seems to have a 
  problem.
   I identified a problem on latest Pacemaker and 
 Pacemaker1.1.13.
 
 
   Step 1) I constitute a cluster and send simple CLI file.
 
   [root@snmp1 ~]# crm_mon -1 
   Last updated: Mon Jul 27 14:40:37 2015          Last change: 
 Mon 
  Jul 27 
   14:40:29 2015 by root via cibadmin on snmp1
   Stack: corosync
   Current DC: snmp1 (version 1.1.13-3d781d3) - partition with 
 quorum
   2 nodes and 1 resource configured
 
   Online: [ snmp1 snmp2 ]
 
     prmDummy       (ocf::heartbeat:Dummy): Started snmp1
 
   Step 2) I stop a node of the standby once.
 
   [root@snmp2 ~]# stop pacemaker
   pacemaker stop/waiting
 
 
   Step 3) I start a node of the standby again.

Re: [ClusterLabs] [Question:pacemaker_remote] By the operation that remote node cannot carry out a cluster, the resource does not move. (STONITH is not carried out, too)

2015-08-17 Thread renayama19661014

Hi Andrew,


A correction seems to still have a problem.

It is awaiting demote, and the master-group resource cannot move.
[root@bl460g8n3 ~]# crm_mon -1 -Af
Last updated: Tue Aug 18 11:13:39 2015          Last change: Tue Aug 18 
11:11:01 2015 by root via crm_resource on bl460g8n4
Stack: corosync
Current DC: bl460g8n3 (version 1.1.13-7d0cac0) - partition with quorum
4 nodes and 10 resources configured

Online: [ bl460g8n3 bl460g8n4 ]
GuestOnline: [ pgsr02@bl460g8n4 ]

 prmDB2 (ocf::heartbeat:VirtualDomain): Started bl460g8n4
 Resource Group: grpStonith1
     prmStonith1-2      (stonith:external/ipmi):        Started bl460g8n4
 Resource Group: grpStonith2
     prmStonith2-2      (stonith:external/ipmi):        Started bl460g8n3
 Master/Slave Set: msPostgresql [pgsql]
     Masters: [ pgsr02 ]

Node Attributes:
* Node bl460g8n3:
* Node bl460g8n4:
* Node pgsr02@bl460g8n4:
    + master-pgsql                      : 10        

Migration Summary:
* Node bl460g8n3:
   pgsr01: migration-threshold=1 fail-count=1 last-failure='Tue Aug 18 11:12:03 
2015'
* Node bl460g8n4:
* Node pgsr02@bl460g8n4:

Failed Actions:
* pgsr01_monitor_3 on bl460g8n3 'unknown error' (1): call=2, status=Error, 
exitreason='none',
    last-rc-change='Tue Aug 18 11:12:03 2015', queued=0ms, exec=0ms

(snip)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Container prmDB1 and the 
resources within it have failed 1 times on bl460g8n3
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Forcing prmDB1 away from 
bl460g8n3 after 1 failures (max=1)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: pgsr01 has failed 1 times on 
bl460g8n3
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Forcing pgsr01 away from 
bl460g8n3 after 1 failures (max=1)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: prmDB1: Rolling back scores 
from pgsr01Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Resource prmDB1 
cannot run anywhere
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Resource pgsr01 cannot run 
anywhere
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: pgsql:0: Rolling back scores 
from vip-master
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Resource pgsql:0 cannot run 
anywhere
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Promoting pgsql:1 (Master 
pgsr02)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: msPostgresql: Promoted 1 
instances of a possible 1 to master
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action vip-master_stop_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info:  Start recurring monitor (10s) 
for vip-master on pgsr02
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action vip-rep_stop_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info:  Start recurring monitor (10s) 
for vip-rep on pgsr02
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info:  Start recurring monitor (9s) 
for pgsql:1 on pgsr02
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_demote_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: warning: Action pgsql:0_stop_0 on 
pgsr01 is unrunnable (offline)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info:  Start recurring monitor (9s) 
for pgsql:1 on pgsr02
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Impliying node pgsr01 is down 
when container prmDB1 is stopped ((nil))
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   prmDB1  (Stopped)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   prmDB2  (Started 
bl460g8n4)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   prmStonith1-2   
(Started bl460g8n4)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   prmStonith2-2   
(Started bl460g8n3)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: notice: Stop    vip-master    
(Started pgsr01 - blocked)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: notice: Stop    vip-rep       
(Started pgsr01 - blocked)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: notice: Demote  pgsql:0       (Master 
- Stopped pgsr01 - blocked)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   pgsql:1 (Master pgsr02)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   pgsr01  (Stopped)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: info: Leave   pgsr02  (Started 
bl460g8n4)
Aug 18 11:12:07 bl460g8n3 pengine[10325]: crit: Cannot shut down node 'pgsr01' 
because of pgsql:0: blocked failed
Aug

Re: [ClusterLabs] [Problem] The SNMP trap which has been already started is transmitted.

2015-08-16 Thread renayama19661014

Hi Andrew,

Thank you for comments.


I will confirm it tomorrow.
I am a vacation today.

Best Regards,
Hideo Yamauchi.


- Original Message -
 From: Andrew Beekhof and...@beekhof.net
 To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
 open-source clustering welcomed users@clusterlabs.org
 Cc: 
 Date: 2015/8/17, Mon 09:30
 Subject: Re: [ClusterLabs] [Problem] The SNMP trap which has been already 
 started is transmitted.
 
 
  On 4 Aug 2015, at 7:36 pm, renayama19661...@ybb.ne.jp wrote:
 
  Hi Andrew,
 
  Thank you for comments.
 
  However, a trap of crm_mon is sent to an SNMP manager.
   
  Are you using the built-in SNMP logic or using -E to give crm_mon a 
 script which 
  is then producing the trap?
  (I’m trying to figure out who could be turning the monitor action into 
 a start)
 
 
  I used the built-in SNMP.
  I started as a daemon with -d option.
 
 Is it running on both nodes or just snmp1?
 Because there is no logic in crm_mon that would have remapped the monitor to 
 start, so my working theory is that its a duplicate of an old event.
 Can you tell which node the trap is being sent from?
 
 
 
  Best Regards,
  Hideo Yamauchi.
 
 
  - Original Message -
  From: Andrew Beekhof and...@beekhof.net
  To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
 open-source clustering welcomed users@clusterlabs.org
  Cc: 
  Date: 2015/8/4, Tue 14:15
  Subject: Re: [ClusterLabs] [Problem] The SNMP trap which has been 
 already started is transmitted.
 
 
  On 27 Jul 2015, at 4:18 pm, renayama19661...@ybb.ne.jp wrote:
 
  Hi All,
 
  The transmission of the SNMP trap of crm_mon seems to have a 
 problem.
  I identified a problem on latest Pacemaker and Pacemaker1.1.13.
 
 
  Step 1) I constitute a cluster and send simple CLI file.
 
  [root@snmp1 ~]# crm_mon -1 
  Last updated: Mon Jul 27 14:40:37 2015          Last change: Mon 
 Jul 27 
  14:40:29 2015 by root via cibadmin on snmp1
  Stack: corosync
  Current DC: snmp1 (version 1.1.13-3d781d3) - partition with quorum
  2 nodes and 1 resource configured
 
  Online: [ snmp1 snmp2 ]
 
    prmDummy       (ocf::heartbeat:Dummy): Started snmp1
 
  Step 2) I stop a node of the standby once.
 
  [root@snmp2 ~]# stop pacemaker
  pacemaker stop/waiting
 
 
  Step 3) I start a node of the standby again.
  [root@snmp2 ~]# start pacemaker
  pacemaker start/running, process 2284
 
  Step 4) The indication of crm_mon does not change in particular.
  [root@snmp1 ~]# crm_mon -1
  Last updated: Mon Jul 27 14:45:12 2015          Last change: Mon 
 Jul 27 
  14:40:29 2015 by root via cibadmin on snmp1
  Stack: corosync
  Current DC: snmp1 (version 1.1.13-3d781d3) - partition with quorum
  2 nodes and 1 resource configured
 
  Online: [ snmp1 snmp2 ]
 
    prmDummy       (ocf::heartbeat:Dummy): Started snmp1
 
 
  In addition, as for the resource that started in snmp1 node, 
 nothing 
  changes.
 
  ---
  Jul 27 14:41:39 snmp1 crmd[29116]:   notice: State transition 
 S_IDLE - 
  S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
  origin=abort_transition_graph ]
  Jul 27 14:41:39 snmp1 cib[29111]:     info: Completed cib_modify 
 operation 
  for section status: OK (rc=0, origin=snmp1/attrd/11, version=0.4.20)
  Jul 27 14:41:39 snmp1 attrd[29114]:     info: Update 11 for 
 probe_complete: 
  OK (0)
  Jul 27 14:41:39 snmp1 attrd[29114]:     info: Update 11 for 
  probe_complete[snmp1]=true: OK (0)
  Jul 27 14:41:39 snmp1 attrd[29114]:     info: Update 11 for 
  probe_complete[snmp2]=true: OK (0)
  Jul 27 14:41:39 snmp1 cib[29202]:     info: Wrote version 0.4.0 of 
 the CIB 
  to disk (digest: a1f1920279fe0b1466a79cab09fa77d6)
  Jul 27 14:41:39 snmp1 pengine[29115]:   notice: On loss of CCM 
 Quorum: 
  Ignore
  Jul 27 14:41:39 snmp1 pengine[29115]:     info: Node snmp2 is 
 online
  Jul 27 14:41:39 snmp1 pengine[29115]:     info: Node snmp1 is 
 online
  Jul 27 14:41:39 snmp1 pengine[29115]:     info: 
  prmDummy#011(ocf::heartbeat:Dummy):#011Started snmp1
  Jul 27 14:41:39 snmp1 pengine[29115]:     info: Leave  
  prmDummy#011(Started snmp1)
  ---
 
  However, a trap of crm_mon is sent to an SNMP manager.
 
  Are you using the built-in SNMP logic or using -E to give crm_mon a 
 script which 
  is then producing the trap?
  (I’m trying to figure out who could be turning the monitor action into 
 a start)
 
  The resource does not reboot, but the SNMP trap which a resource 
 started is 
  sent.
 
  ---
  Jul 27 14:41:39 SNMP-MANAGER snmptrapd[4521]: 2015-07-27 14:41:39 
 snmp1 
  [UDP: 
 
 [192.168.40.100]:35265-[192.168.40.2]]:#012DISMAN-EVENT-MIB::sysUpTimeInstance
  
 
  = Timeticks: (1437975699) 166 days, 
 10:22:36.99#011SNMPv2-MIB::snmpTrapOID.0 = 
  OID: 
 
 PACEMAKER-MIB::pacemakerNotification#011PACEMAKER-MIB::pacemakerNotificationResource
  
 
  = STRING: 
 prmDummy#011PACEMAKER-MIB::pacemakerNotificationNode = 
  STRING: 
 snmp1#011PACEMAKER-MIB::pacemakerNotificationOperation = 
  STRING:

[ClusterLabs] [Question] About deletion of SysVStartPriority.

2015-08-06 Thread renayama19661014

Hi All,

We have a question for the next correction.
 * 
https://github.com/ClusterLabs/pacemaker/commit/a97c28d75347aa7be76092aa22459f0f56a220ff

We understand it that this obeyed a correction of systemd.


In Pacemaker1.1.13, SysVStartPriority=99 is set.
Pacemaker1.1.12 is set, too.


When we use Pacemaker1.1.12, does it have a problem to delete 
SysVStartPriority=99?
Or must we not delete it when use Pacemaker1.1.12 and Pacemaker1.1.13?
Or is it necessary to judge it by a version of systemd of the OS, and to set it?

-
[Unit]
Description=Pacemaker High Availability Cluster Manager

After=basic.target
After=network.target
After=corosync.service

Requires=basic.target
Requires=network.target
Requires=corosync.service

[Install]
WantedBy=multi-user.target

[Service]
Type=simple
KillMode=process
NotifyAccess=main
#SysVStartPriority=99
(snip)
-

Best Regards,
Hideo Yamauchi.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [Question] About deletion of SysVStartPriority.

2015-08-06 Thread renayama19661014

Hi Jan,

Thank you for comments.


 When we use Pacemaker1.1.12, does it have a problem to delete 
 SysVStartPriority=99?
 Or must we not delete it when use Pacemaker1.1.12 and Pacemaker1.1.13?
 Or is it necessary to judge it by a version of systemd of the OS, and to 
 set it?
 
 It was a leftover from times systemd took that value into account
 (definitely not the case with systemd-218+), and yes, systemd version
 is the only deciding factor whether it makes sense to state the
 parameter in the unit file.  I wouldn't expect this change will
 cause any issue regardless since the order at the startup/shutdown
 is pretty clear anyway (After/Requires).


We delete SysVStartPriority=99 in Pacemaker1.1.12 and move it.
If a problem occurs, I report it.

Many Thanks!
Hideo Yamauch.



- Original Message -
 From: Jan Pokorný jpoko...@redhat.com
 To: users@clusterlabs.org
 Cc: 
 Date: 2015/8/6, Thu 16:13
 Subject: Re: [ClusterLabs] [Question] About deletion of SysVStartPriority.
 
 On 06/08/15 15:42 +0900, renayama19661...@ybb.ne.jp wrote:
  We have a question for the next correction.
   * 
 https://github.com/ClusterLabs/pacemaker/commit/a97c28d75347aa7be76092aa22459f0f56a220ff
 
  We understand it that this obeyed a correction of systemd.
 
 
  In Pacemaker1.1.13, SysVStartPriority=99 is set.
  Pacemaker1.1.12 is set, too.
 
 
  When we use Pacemaker1.1.12, does it have a problem to delete 
 SysVStartPriority=99?
  Or must we not delete it when use Pacemaker1.1.12 and Pacemaker1.1.13?
  Or is it necessary to judge it by a version of systemd of the OS, and to 
 set it?
 
 It was a leftover from times systemd took that value into account
 (definitely not the case with systemd-218+), and yes, systemd version
 is the only deciding factor whether it makes sense to state the
 parameter in the unit file.  I wouldn't expect this change will
 cause any issue regardless since the order at the startup/shutdown
 is pretty clear anyway (After/Requires).
 
 So if you want, just go for it (and let us know in case of troubles).
 After all, unit files are there to be tweaked by keen sysadmins like
 you via overrides in /etc/systemd/user :-)
 
 -- 
 Jan (Poki)
 
 ___
 Users mailing list: Users@clusterlabs.org
 http://clusterlabs.org/mailman/listinfo/users
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] [Enhancement] When STONITH is not completed, a resource moves.

2015-10-28 Thread renayama19661014

Hi All,

The following problem produced us in Pacemaker1.1.12.
While STONITH was not completed, a resource moved it.

The next movement seemed to happen in a cluster.

Step1) Start a cluster.

Step2) Node 1 breaks down.

Step3) Node 1 is reconnected before practice is completed from node 2 STONITH.

Step4) Repeated between Step2 and Step3.

Step5) STONITH from node 2 is not completed, but a resource moves to node 2.



There was not resource information of node 1 when I saw pe file when a resource 
moved in node 2.
(snip)
  
    
      
        
          
          
          
          
          
        
      
    
    
      
        
          
          
          
        
      
      
        
(snip)

While STONITH is not completed, the information of the node of cib is deleted 
and seems to be caused by the fact that cib does not have the resource 
information of the node.

The cause of the problem was that the communication of the cluster became 
unstable.
However, an action of this cluster is a problem.

This problem is not taking place in Pacemaker1.1.13 for the moment.
However, I think that it is the same processing as far as I see a source code.

Does the deletion of the node information not have to perform it after all new 
node information gathered?

 * crmd/callback.c
(snip)
void
peer_update_callback(enum crm_status_type type, crm_node_t * node, const void 
*data)
{
(snip)
     if (down) {
            const char *task = crm_element_value(down->xml, XML_LRM_ATTR_TASK);

            if (alive && safe_str_eq(task, CRM_OP_FENCE)) {
                crm_info("Node return implies stonith of %s (action %d) 
completed", node->uname,
                         down->id);

                st_fail_count_reset(node->uname);

                erase_status_tag(node->uname, XML_CIB_TAG_LRM, cib_scope_local);
                erase_status_tag(node->uname, XML_TAG_TRANSIENT_NODEATTRS, 
cib_scope_local);
                /* down->confirmed = TRUE; Only stonith-ng returning should 
imply completion */
                down->sent_update = TRUE;       /* Prevent 
tengine_stonith_callback() from calling send_stonith_update() */

(snip)


 * There is the log, but cannot attach it because the information of the user 
is included.
 * Please contact me by an email if you need it.


These contents are registered with Bugzilla.
 * http://bugs.clusterlabs.org/show_bug.cgi?id=5254


Best Regards,
Hideo Yamauchi.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: Re: [Question] Question about mysql RA.

2015-11-12 Thread renayama19661014

Hi Ken,
Hi Ulrich,

Hi All,

I sent a patch.
 * https://github.com/ClusterLabs/resource-agents/pull/698

Please confirm it.

Best Regards,
Hideo Yamauchi.


- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Cc: 
> Date: 2015/11/5, Thu 19:36
> Subject: Re: [ClusterLabs] Antw: Re:  [Question] Question about mysql RA.
> 
> Hi Ken,
> Hi Ulrich,
> 
> Thank you for comment
> 
> The RA of mysql seemed to have a problem somehow or other from the beginning 
> as 
> far as I heard the opinion of Ken and Ulrich.
> 
> I wait for the opinion of other people a little more, and I make a patch.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> - Original Message -
>>  From: Ulrich Windl 
>>  To: users@clusterlabs.org; kgail...@redhat.com
>>  Cc: 
>>  Date: 2015/11/5, Thu 16:11
>>  Subject: [ClusterLabs] Antw: Re:  [Question] Question about mysql RA.
>> 
>   Ken Gaillot  schrieb am 04.11.2015 
> um 
>>  16:44 in Nachricht
>>  <563a27c2.5090...@redhat.com>:
>>>   On 11/04/2015 04:36 AM, renayama19661...@ybb.ne.jp wrote:
>>  [...]
       pid=`cat $OCF_RESKEY_pid 2> /dev/null `
       /bin/kill $pid > /dev/null
>>> 
>>>   I think before this line, the RA should do a "kill -0" to 
> check 
>>  whether
>>>   the PID is running, and return $OCF_SUCCESS if not. That way, we can
>>>   still return an error if the real kill fails.
>> 
>>  And remove the stale PID file if there is no such pid. For very busy 
> systems one 
>>  could use ps for that PID to see whether the PID belongs to the expected 
>>  process. There is a small chance that a PID exists, but does not belong to 
> the 
>>  expected process...
>> 
>>> 
       rc=$?
       if [ $rc != 0 ]; then
           ocf_exit_reason "MySQL couldn't be stopped"
           return $OCF_ERR_GENERIC
       fi
   (snip)
   -
 
   The mysql RA does such a code from old days.
    * http://hg.linux-ha.org/agents/file/67234f982ab7/heartbeat/mysql 
> 
 
   Does mysql RA know the reason becoming this made?
   Possibly is it a factor to be conscious of mysql cluster?
 
   I think about a patch of this movement of mysql RA.
   I want to know the detailed reason.
 
   Best Regards,
   Hideo Yamauchi.
>>> 
>>> 
>>>   ___
>>>   Users mailing list: Users@clusterlabs.org 
>>>   http://clusterlabs.org/mailman/listinfo/users 
>>> 
>>>   Project Home: http://www.clusterlabs.org 
>>>   Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>   Bugs: http://bugs.clusterlabs.org 
>> 
>> 
>> 
>> 
>> 
>>  ___
>>  Users mailing list: Users@clusterlabs.org
>>  http://clusterlabs.org/mailman/listinfo/users
>> 
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: Re: [Question] Question about mysql RA.

2015-11-16 Thread renayama19661014

Hi Dejan,


All right!

Thank you for merging a patch.



Many Thanks!
Hideo Yamauchi.


- Original Message -
> From: Dejan Muhamedagic 
> To: users@clusterlabs.org
> Cc: 
> Date: 2015/11/16, Mon 18:02
> Subject: Re: [ClusterLabs] Antw: Re:  [Question] Question about mysql RA.
> 
> Hi Hideo-san,
> 
> On Thu, Nov 12, 2015 at 06:15:29PM +0900, renayama19661...@ybb.ne.jp wrote:
>>  Hi Ken,
>>  Hi Ulrich,
>> 
>>  Hi All,
>> 
>>  I sent a patch.
>>   * https://github.com/ClusterLabs/resource-agents/pull/698
> 
> Your patch was merged. Many thanks.
> 
> Cheers,
> 
> Dejan
> 
>> 
>>  Please confirm it.
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
>> 
>> 
>>  - Original Message -
>>  > From: "renayama19661...@ybb.ne.jp" 
> 
>>  > To: Cluster Labs - All topics related to open-source clustering 
> welcomed 
>>  > Cc: 
>>  > Date: 2015/11/5, Thu 19:36
>>  > Subject: Re: [ClusterLabs] Antw: Re:  [Question] Question about mysql 
> RA.
>>  > 
>>  > Hi Ken,
>>  > Hi Ulrich,
>>  > 
>>  > Thank you for comment
>>  > 
>>  > The RA of mysql seemed to have a problem somehow or other from the 
> beginning as 
>>  > far as I heard the opinion of Ken and Ulrich.
>>  > 
>>  > I wait for the opinion of other people a little more, and I make a 
> patch.
>>  > 
>>  > Best Regards,
>>  > Hideo Yamauchi.
>>  > 
>>  > 
>>  > - Original Message -
>>  >>  From: Ulrich Windl 
>>  >>  To: users@clusterlabs.org; kgail...@redhat.com
>>  >>  Cc: 
>>  >>  Date: 2015/11/5, Thu 16:11
>>  >>  Subject: [ClusterLabs] Antw: Re:  [Question] Question about mysql 
> RA.
>>  >> 
>>  >   Ken Gaillot  schrieb am 
> 04.11.2015 
>>  > um 
>>  >>  16:44 in Nachricht
>>  >>  <563a27c2.5090...@redhat.com>:
>>  >>>   On 11/04/2015 04:36 AM, renayama19661...@ybb.ne.jp wrote:
>>  >>  [...]
>>         pid=`cat $OCF_RESKEY_pid 2> /dev/null `
>>         /bin/kill $pid > /dev/null
>>  >>> 
>>  >>>   I think before this line, the RA should do a "kill 
> -0" to 
>>  > check 
>>  >>  whether
>>  >>>   the PID is running, and return $OCF_SUCCESS if not. That 
> way, we can
>>  >>>   still return an error if the real kill fails.
>>  >> 
>>  >>  And remove the stale PID file if there is no such pid. For very 
> busy 
>>  > systems one 
>>  >>  could use ps for that PID to see whether the PID belongs to the 
> expected 
>>  >>  process. There is a small chance that a PID exists, but does not 
> belong to 
>>  > the 
>>  >>  expected process...
>>  >> 
>>  >>> 
>>         rc=$?
>>         if [ $rc != 0 ]; then
>>             ocf_exit_reason "MySQL couldn't be 
> stopped"
>>             return $OCF_ERR_GENERIC
>>         fi
>>     (snip)
>>     
> -
>>   
>>     The mysql RA does such a code from old days.
>>      * 
> http://hg.linux-ha.org/agents/file/67234f982ab7/heartbeat/mysql 
>>  > 
>>   
>>     Does mysql RA know the reason becoming this made?
>>     Possibly is it a factor to be conscious of mysql 
> cluster?
>>   
>>     I think about a patch of this movement of mysql RA.
>>     I want to know the detailed reason.
>>   
>>     Best Regards,
>>     Hideo Yamauchi.
>>  >>> 
>>  >>> 
>>  >>>   ___
>>  >>>   Users mailing list: Users@clusterlabs.org 
>>  >>>   http://clusterlabs.org/mailman/listinfo/users 
>>  >>> 
>>  >>>   Project Home: http://www.clusterlabs.org 
>>  >>>   Getting started: 
>>  > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>  >>>   Bugs: http://bugs.clusterlabs.org 
>>  >> 
>>  >> 
>>  >> 
>>  >> 
>>  >> 
>>  >>  ___
>>  >>  Users mailing list: Users@clusterlabs.org
>>  >>  http://clusterlabs.org/mailman/listinfo/users
>>  >> 
>>  >>  Project Home: http://www.clusterlabs.org
>>  >>  Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  >>  Bugs: http://bugs.clusterlabs.org
>>  >> 
>>  > 
>>  > ___
>>  > Users mailing list: Users@clusterlabs.org
>>  > http://clusterlabs.org/mailman/listinfo/users
>>  > 
>>  > Project Home: http://www.clusterlabs.org
>>  > Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  > Bugs: http://bugs.clusterlabs.org
>>  > 
>> 
>>  ___
>>  Users mailing list: Users@clusterlabs.org
>>  http://clusterlabs.org/mailman/listinfo/users
>> 
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started:

Re: [ClusterLabs] [Patch][glue][external/libvirt] Conversion to a lower case of hostlist.

2015-10-29 Thread renayama19661014

Hi Dejan,
Hi All,

How about the patch which I contributed by a former email?
I would like an opinion.

Best Regards,
Hideo Yamauchi.

- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Cc: 
> Date: 2015/10/14, Wed 09:38
> Subject: Re: [ClusterLabs] [Patch][glue][external/libvirt] Conversion to a 
> lower case of hostlist.
> 
> Hi Dejan,
> Hi All,
> 
> We reconsidered a patch.
> 
> 
> 
> In Pacemaker1.1, node names in STONITH are always small letters.
> When a user uses a capital letter in host name, STONITH of libvirt fails.
> 
> This patch lets STONITH by libvirt succeed in the next setting.
> 
>  * host name(upper case) and hostlist(upper case) and domain_id on 
> libvirt(uppper case)
>  * host name(upper case) and hostlist(lower case) and domain_id on 
> libvirt(lower 
> case)
>  * host name(lower case) and hostlist(upper case) and domain_id on 
> libvirt(uppper case)
>  * host name(lower case) and hostlist(lower case) and domain_id on 
> libvirt(lower 
> case)
> 
> 
> However, in the case of the next setting, STONITH of libvirt causes an error.
> In this case it is necessary for the user to make the size of the letter of 
> the 
> host name to manage of libvirt the same as host name to appoint in hostlist.
> 
>  * host name(upper case) and hostlist(lower case) and domain_id on 
> libvirt(uppper case)
>  * host name(upper case) and hostlist(uppper case) and domain_id on 
> libvirt(lower case)
>  * host name(lower case) and hostlist(lower case) and domain_id on 
> libvirt(uppper case)
>  * host name(lower case) and hostlist(uppper case) and domain_id on 
> libvirt(lower case)
> 
> 
> This patch is effective for letting STONITH by libvirt when host name was set 
> for a capital letter succeed.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> 
> 
> - Original Message -
>>  From: "renayama19661...@ybb.ne.jp" 
> 
>>  To: Cluster Labs - All topics related to open-source clustering welcomed 
> 
>>  Cc: 
>>  Date: 2015/9/15, Tue 03:28
>>  Subject: Re: [ClusterLabs] [Patch][glue][external/libvirt] Conversion to a 
> lower case of hostlist.
>> 
>>  Hi Dejan,
>> 
>>>   I suppose that you'll send another one? I can vaguelly recall
>>>   a problem with non-lower case node names, but not the specifics.
>>>   Is that supposed to be handled within a stonith agent?
>> 
>> 
>>  Yes.
>>  We make a different patch now.
>>  With the patch, I solve a node name of the non-small letter in the range of 
> 
>>  stonith agent.
>>  # But the patch cannot cover all all patterns.
>> 
>>  Please wait a little longer.
>>  I send a patch again.
>>  For a new patch, please tell me your opinion.
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
>> 
>> 
>> 
>>  - Original Message -
>>>   From: Dejan Muhamedagic 
>>>   To: ClusterLabs-ML 
>>>   Cc: 
>>>   Date: 2015/9/14, Mon 22:20
>>>   Subject: Re: [ClusterLabs] [Patch][glue][external/libvirt] Conversion 
> to a 
>>  lower case of hostlist.
>>> 
>>>   Hi Hideo-san,
>>> 
>>>   On Tue, Sep 08, 2015 at 05:28:05PM +0900, renayama19661...@ybb.ne.jp 
> wrote:
    Hi All,
 
    We intend to change some patches.
    We withdraw this patch.
>>> 
>>>   I suppose that you'll send another one? I can vaguelly recall
>>>   a problem with non-lower case node names, but not the specifics.
>>>   Is that supposed to be handled within a stonith agent?
>>> 
>>>   Cheers,
>>> 
>>>   Dejan
>>> 
    Best Regards,
    Hideo Yamauchi.
 
 
    - Original Message -
    > From: "renayama19661...@ybb.ne.jp" 
>>>   
    > To: ClusterLabs-ML 
    > Cc: 
    > Date: 2015/9/7, Mon 09:06
    > Subject: [ClusterLabs] [Patch][glue][external/libvirt] 
> Conversion 
>>  to a 
>>>   lower case of hostlist.
    > 
    > Hi All,
    > 
    > When a cluster carries out stonith, Pacemaker handles host 
> name 
>>  by a 
>>>   small 
    > letter.
    > When a user sets the host name of the OS and host name of 
>>  hostlist of 
    > external/libvrit in capital letters and waits, stonith is 
> not 
>>  carried 
>>>   out.
    > 
    > The external/libvrit to convert host name of hostlist, and 
> to 
>>  compare 
>>>   can assist 
    > a setting error of the user.
    > 
    > Best Regards,
    > Hideo Yamauchi.
    > 
    > ___
    > Users mailing list: Users@clusterlabs.org
    > http://clusterlabs.org/mailman/listinfo/users
    > 
    > Project Home: http://www.clusterlabs.org
    > Getting started: 
>>>   http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
    > Bugs: http://bugs.clusterlabs.org
    >

Re: [ClusterLabs] [Enhancement] When STONITH is not completed, a resource moves.

2015-10-29 Thread renayama19661014

Hi Ken,

Thank you for comments.

> The above is the reason for the behavior you're seeing.
> 
> A fenced node can come back up and rejoin the cluster before the fence
> command reports completion. When Pacemaker sees the rejoin, it assumes
> the fence command completed.
> 
> However in this case, the lost node rejoined on its own while fencing
> was still in progress, so that was an incorrect assumption.
> 
> A proper fix will take some investigation. As a workaround in the
> meantime, you could try increasing the corosync token timeout, so the
> node is not declared lost for brief outages.



We think so, too.
We understand that we can evade a problem by lengthening token of corosync.

If log when a problem happened is necessary for a survey by you, please contact 
me.


Many Thanks!
Hideo Yamauchi.


- Original Message -
> From: Ken Gaillot 
> To: users@clusterlabs.org
> Cc: 
> Date: 2015/10/29, Thu 23:09
> Subject: Re: [ClusterLabs] [Enhancement] When STONITH is not completed, a 
> resource moves.
> 
> On 10/28/2015 08:39 PM, renayama19661...@ybb.ne.jp wrote:
>>  Hi All,
>> 
>>  The following problem produced us in Pacemaker1.1.12.
>>  While STONITH was not completed, a resource moved it.
>> 
>>  The next movement seemed to happen in a cluster.
>> 
>>  Step1) Start a cluster.
>> 
>>  Step2) Node 1 breaks down.
>> 
>>  Step3) Node 1 is reconnected before practice is completed from node 2 
> STONITH.
>> 
>>  Step4) Repeated between Step2 and Step3.
>> 
>>  Step5) STONITH from node 2 is not completed, but a resource moves to node 
> 2.
>> 
>> 
>> 
>>  There was not resource information of node 1 when I saw pe file when a 
> resource moved in node 2.
>>  (snip)
>>    
>>       in_ccm="false" crmd="offline" 
> crm-debug-origin="do_state_transition" join="down" 
> expected="down">
>>        
>>          
>>             name="last-failure-prm_XXX1" value="1441957021"/>
>>             name="default_ping_set" value="300"/>
>>             name="last-failure-prm_XXX2" value="1441956891"/>
>>             name="shutdown" value="0"/>
>>             name="probe_complete" value="true"/>
>>          
>>        
>>      
>>       crmd="online" crm-debug-origin="do_state_transition" 
> uname="node2" join="member" expected="member">
>>        
>>          
>>             name="shutdown" value="0"/>
>>             name="probe_complete" value="true"/>
>>             name="default_ping_set" value="300"/>
>>          
>>        
>>        
>>          
>>  (snip)
>> 
>>  While STONITH is not completed, the information of the node of cib is 
> deleted and seems to be caused by the fact that cib does not have the 
> resource 
> information of the node.
>> 
>>  The cause of the problem was that the communication of the cluster became 
> unstable.
>>  However, an action of this cluster is a problem.
>> 
>>  This problem is not taking place in Pacemaker1.1.13 for the moment.
>>  However, I think that it is the same processing as far as I see a source 
> code.
>> 
>>  Does the deletion of the node information not have to perform it after all 
> new node information gathered?
>> 
>>   * crmd/callback.c
>>  (snip)
>>  void
>>  peer_update_callback(enum crm_status_type type, crm_node_t * node, const 
> void *data)
>>  {
>>  (snip)
>>       if (down) {
>>              const char *task = crm_element_value(down->xml, 
> XML_LRM_ATTR_TASK);
>> 
>>              if (alive && safe_str_eq(task, CRM_OP_FENCE)) {
>>                  crm_info("Node return implies stonith of %s (action 
> %d) completed", node->uname,
>>                           down->id);
> 
> The above is the reason for the behavior you're seeing.
> 
> A fenced node can come back up and rejoin the cluster before the fence
> command reports completion. When Pacemaker sees the rejoin, it assumes
> the fence command completed.
> 
> However in this case, the lost node rejoined on its own while fencing
> was still in progress, so that was an incorrect assumption.
> 
> A proper fix will take some investigation. As a workaround in the
> meantime, you could try increasing the corosync token timeout, so the
> node is not declared lost for brief outages.
> 
>>                  st_fail_count_reset(node->uname);
>> 
>>                  erase_status_tag(node->uname, XML_CIB_TAG_LRM, 
> cib_scope_local);
>>                  erase_status_tag(node->uname, 
> XML_TAG_TRANSIENT_NODEATTRS, cib_scope_local);
>>                  /* down->confirmed = TRUE; Only stonith-ng returning 
> should imply completion */
>>                  down->sent_update = TRUE;       /* Prevent 
> tengine_stonith_callback() from calling send_stonith_update() */
>> 
>>  (snip)
>> 
>> 
>>   * There is the log, but cannot attach it because the information of the 
> user is included.
>>   * Please contact me by an email if you need it.
>> 
>> 
>>  These contents are registered with Bugzilla.
>>   * http://bugs.clusterlabs.org/show_bug.cgi?id=5254
>> 
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
>>

Re: [ClusterLabs] Antw: Re: [Question] Question about mysql RA.

2015-11-05 Thread renayama19661014

Hi Ken,
Hi Ulrich,

Thank you for comment

The RA of mysql seemed to have a problem somehow or other from the beginning as 
far as I heard the opinion of Ken and Ulrich.

I wait for the opinion of other people a little more, and I make a patch.

Best Regards,
Hideo Yamauchi.


- Original Message -
> From: Ulrich Windl 
> To: users@clusterlabs.org; kgail...@redhat.com
> Cc: 
> Date: 2015/11/5, Thu 16:11
> Subject: [ClusterLabs] Antw: Re:  [Question] Question about mysql RA.
> 
  Ken Gaillot  schrieb am 04.11.2015 um 
> 16:44 in Nachricht
> <563a27c2.5090...@redhat.com>:
>>  On 11/04/2015 04:36 AM, renayama19661...@ybb.ne.jp wrote:
> [...]
>>>      pid=`cat $OCF_RESKEY_pid 2> /dev/null `
>>>      /bin/kill $pid > /dev/null
>> 
>>  I think before this line, the RA should do a "kill -0" to check 
> whether
>>  the PID is running, and return $OCF_SUCCESS if not. That way, we can
>>  still return an error if the real kill fails.
> 
> And remove the stale PID file if there is no such pid. For very busy systems 
> one 
> could use ps for that PID to see whether the PID belongs to the expected 
> process. There is a small chance that a PID exists, but does not belong to 
> the 
> expected process...
> 
>> 
>>>      rc=$?
>>>      if [ $rc != 0 ]; then
>>>          ocf_exit_reason "MySQL couldn't be stopped"
>>>          return $OCF_ERR_GENERIC
>>>      fi
>>>  (snip)
>>>  -
>>> 
>>>  The mysql RA does such a code from old days.
>>>   * http://hg.linux-ha.org/agents/file/67234f982ab7/heartbeat/mysql 
>>> 
>>>  Does mysql RA know the reason becoming this made?
>>>  Possibly is it a factor to be conscious of mysql cluster?
>>> 
>>>  I think about a patch of this movement of mysql RA.
>>>  I want to know the detailed reason.
>>> 
>>>  Best Regards,
>>>  Hideo Yamauchi.
>> 
>> 
>>  ___
>>  Users mailing list: Users@clusterlabs.org 
>>  http://clusterlabs.org/mailman/listinfo/users 
>> 
>>  Project Home: http://www.clusterlabs.org 
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>  Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [Patch][glue][external/libvirt] Conversion to a lower case of hostlist.

2015-10-30 Thread renayama19661014

Hi Dejan,

Thank you for a reply.

> It somehow slipped.

> 
> I suppose that you tested the patch well and nobody objected so
> far, so lets apply it.
> 
> Many thanks! And sorry about the delay.


I confirmed the merge of the patch.
 * http://hg.linux-ha.org/glue/rev/56f40ec5d37e

Many Thanks!
Hideo Yamauchi.



- Original Message -
> From: Dejan Muhamedagic 
> To: Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Cc: 
> Date: 2015/10/30, Fri 16:58
> Subject: Re: [ClusterLabs] [Patch][glue][external/libvirt] Conversion to a 
> lower case of hostlist.
> 
> Hi Hideo-san,
> 
> On Fri, Oct 30, 2015 at 11:41:26AM +0900, renayama19661...@ybb.ne.jp wrote:
>>  Hi Dejan,
>>  Hi All,
>> 
>>  How about the patch which I contributed by a former email?
>>  I would like an opinion.
> 
> It somehow slipped.
> 
> I suppose that you tested the patch well and nobody objected so
> far, so lets apply it.
> 
> Many thanks! And sorry about the delay.
> 
> Cheers,
> 
> Dejan
> 
> 
> 
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
>> 
>>  - Original Message -
>>  > From: "renayama19661...@ybb.ne.jp" 
> 
>>  > To: Cluster Labs - All topics related to open-source clustering 
> welcomed 
>>  > Cc: 
>>  > Date: 2015/10/14, Wed 09:38
>>  > Subject: Re: [ClusterLabs] [Patch][glue][external/libvirt] Conversion 
> to a lower case of hostlist.
>>  > 
>>  > Hi Dejan,
>>  > Hi All,
>>  > 
>>  > We reconsidered a patch.
>>  > 
>>  > 
>>  > 
>>  > In Pacemaker1.1, node names in STONITH are always small letters.
>>  > When a user uses a capital letter in host name, STONITH of libvirt 
> fails.
>>  > 
>>  > This patch lets STONITH by libvirt succeed in the next setting.
>>  > 
>>  >  * host name(upper case) and hostlist(upper case) and domain_id on 
>>  > libvirt(uppper case)
>>  >  * host name(upper case) and hostlist(lower case) and domain_id on 
> libvirt(lower 
>>  > case)
>>  >  * host name(lower case) and hostlist(upper case) and domain_id on 
>>  > libvirt(uppper case)
>>  >  * host name(lower case) and hostlist(lower case) and domain_id on 
> libvirt(lower 
>>  > case)
>>  > 
>>  > 
>>  > However, in the case of the next setting, STONITH of libvirt causes an 
> error.
>>  > In this case it is necessary for the user to make the size of the 
> letter of the 
>>  > host name to manage of libvirt the same as host name to appoint in 
> hostlist.
>>  > 
>>  >  * host name(upper case) and hostlist(lower case) and domain_id on 
>>  > libvirt(uppper case)
>>  >  * host name(upper case) and hostlist(uppper case) and domain_id on 
>>  > libvirt(lower case)
>>  >  * host name(lower case) and hostlist(lower case) and domain_id on 
>>  > libvirt(uppper case)
>>  >  * host name(lower case) and hostlist(uppper case) and domain_id on 
>>  > libvirt(lower case)
>>  > 
>>  > 
>>  > This patch is effective for letting STONITH by libvirt when host name 
> was set 
>>  > for a capital letter succeed.
>>  > 
>>  > Best Regards,
>>  > Hideo Yamauchi.
>>  > 
>>  > 
>>  > 
>>  > 
>>  > - Original Message -
>>  >>  From: "renayama19661...@ybb.ne.jp" 
>>  > 
>>  >>  To: Cluster Labs - All topics related to open-source clustering 
> welcomed 
>>  > 
>>  >>  Cc: 
>>  >>  Date: 2015/9/15, Tue 03:28
>>  >>  Subject: Re: [ClusterLabs] [Patch][glue][external/libvirt] 
> Conversion to a 
>>  > lower case of hostlist.
>>  >> 
>>  >>  Hi Dejan,
>>  >> 
>>  >>>   I suppose that you'll send another one? I can vaguelly 
> recall
>>  >>>   a problem with non-lower case node names, but not the 
> specifics.
>>  >>>   Is that supposed to be handled within a stonith agent?
>>  >> 
>>  >> 
>>  >>  Yes.
>>  >>  We make a different patch now.
>>  >>  With the patch, I solve a node name of the non-small letter in 
> the range of 
>>  > 
>>  >>  stonith agent.
>>  >>  # But the patch cannot cover all all patterns.
>>  >> 
>>  >>  Please wait a little longer.
>>  >>  I send a patch again.
>>  >>  For a new patch, please tell me your opinion.
>>  >> 
>>  >>  Best Regards,
>>  >>  Hideo Yamauchi.
>>  >> 
>>  >> 
>>  >> 
>>  >>  - Original Message -
>>  >>>   From: Dejan Muhamedagic 
>>  >>>   To: ClusterLabs-ML 
>>  >>>   Cc: 
>>  >>>   Date: 2015/9/14, Mon 22:20
>>  >>>   Subject: Re: [ClusterLabs] [Patch][glue][external/libvirt] 
> Conversion 
>>  > to a 
>>  >>  lower case of hostlist.
>>  >>> 
>>  >>>   Hi Hideo-san,
>>  >>> 
>>  >>>   On Tue, Sep 08, 2015 at 05:28:05PM +0900, 
> renayama19661...@ybb.ne.jp 
>>  > wrote:
>>      Hi All,
>>   
>>      We intend to change some patches.
>>      We withdraw this patch.
>>  >>> 
>>  >>>   I suppose that you'll send another one? I can vaguelly 
> recall
>>  >>>   a problem with non-lower case node names, but not the 
> specifics.
>>  >>>   Is that supposed to be handled within a

[ClusterLabs] [Question] Question about mysql RA.

2015-11-04 Thread renayama19661014

Hi All,

I contributed a patch several times about mysql, too.

I did not mind it very much before, but mysql RA makes next move.

Step1) Constitute a cluster using mysql in Pacemaker.
Step2) The mysql process kill by signal SIGKILL.
Step3) Stop Pacemaker before monitor error occurs and stop mysql

As a result, mysql RA causes trouble in stop.
By this trouble, Pacemaker does not stop until escalation.


The cause is processing unlike pgsql RA.
When a process of pid does not exist, in the case of a stop not to go by way of 
monitor trouble, mysql RA produces ERR_GENERIC.
When a process of pid does not exist, pgsql becomes the success of the stop.

-
* mysql
(snip)
mysql_monitor() {
    local rc
    local status_loglevel="err"

    # Set loglevel to info during probe
    if ocf_is_probe; then
        status_loglevel="info"
    fi
 
    mysql_common_status $status_loglevel

    rc=$?

    # TODO: check max connections error

    # If status returned an error, return that immediately
    if [ $rc -ne $OCF_SUCCESS ]; then
        return $rc
    fi
(snip)
mysql_stop() {
    if ocf_is_ms; then
        # clear preference for becoming master
        $CRM_MASTER -D

        # Remove VIP capability
        set_reader_attr 0
    fi

    mysql_common_stop
}
(snip)

* mysql-common.sh
(snip)
mysql_common_status() {
    local loglevel=$1
    local pid=$2
    if [ -z "$pid" ]; then
        if [ ! -e $OCF_RESKEY_pid ]; then
            ocf_log $loglevel "MySQL is not running"
            return $OCF_NOT_RUNNING;
        fi

        pid=`cat $OCF_RESKEY_pid`;
    fi
    if [ -d /proc -a -d /proc/1 ]; then
        [ "u$pid" != "u" -a -d /proc/$pid ]
    else
        kill -s 0 $pid >/dev/null 2>&1
    fi

    if [ $? -eq 0 ]; then
        return $OCF_SUCCESS;
    else
        ocf_log $loglevel "MySQL not running: removing old PID file"
        rm -f $OCF_RESKEY_pid
        return $OCF_NOT_RUNNING;
    fi
}
(snip)
mysql_common_stop()
{
    local pid
    local rc

    if [ ! -f $OCF_RESKEY_pid ]; then
        ocf_log info "MySQL is not running"
        return $OCF_SUCCESS
    fi

    pid=`cat $OCF_RESKEY_pid 2> /dev/null `
    /bin/kill $pid > /dev/null
    rc=$?
    if [ $rc != 0 ]; then
        ocf_exit_reason "MySQL couldn't be stopped"
        return $OCF_ERR_GENERIC
    fi
(snip)
-

The mysql RA does such a code from old days.
 * http://hg.linux-ha.org/agents/file/67234f982ab7/heartbeat/mysql

Does mysql RA know the reason becoming this made?
Possibly is it a factor to be conscious of mysql cluster?

I think about a patch of this movement of mysql RA.
I want to know the detailed reason.

Best Regards,
Hideo Yamauchi.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] [Patch][glue][external/libvirt] Conversion to a lower case of hostlist.

2015-09-06 Thread renayama19661014

Hi All,

When a cluster carries out stonith, Pacemaker handles host name by a small 
letter.
When a user sets the host name of the OS and host name of hostlist of 
external/libvrit in capital letters and waits, stonith is not carried out.

The external/libvrit to convert host name of hostlist, and to compare can 
assist a setting error of the user.

Best Regards,
Hideo Yamauchi.


libvirt.patch
Description: Binary data
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [Patch][glue][external/libvirt] Conversion to a lower case of hostlist.

2015-09-08 Thread renayama19661014

Hi All,

We intend to change some patches.
We withdraw this patch.

Best Regards,
Hideo Yamauchi.


- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: ClusterLabs-ML 
> Cc: 
> Date: 2015/9/7, Mon 09:06
> Subject: [ClusterLabs] [Patch][glue][external/libvirt] Conversion to a lower 
> case of hostlist.
> 
> Hi All,
> 
> When a cluster carries out stonith, Pacemaker handles host name by a small 
> letter.
> When a user sets the host name of the OS and host name of hostlist of 
> external/libvrit in capital letters and waits, stonith is not carried out.
> 
> The external/libvrit to convert host name of hostlist, and to compare can 
> assist 
> a setting error of the user.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [Pacemaker1.0.13] [hbagent] The hbagent does not stop.

2015-09-08 Thread renayama19661014

Hi Yan,

Thank you for comment.

> Sounds weird. I've never encountered the issue before. Actually I
> haven't run it with heartbeat for years ;-)  We'd probably have to find
> the pattern and produce it.



We still just began an investigation.

If there is the point that you think to be the cause of the problem, please 
tell me.

Best Reards,
Hideo Yamauchi.


- Original Message -
> From: "Gao,Yan" 
> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed 
> Cc: 
> Date: 2015/9/8, Tue 23:14
> Subject: Re: [ClusterLabs] [Pacemaker1.0.13] [hbagent] The hbagent does not 
> stop.
> 
> Hi Hideo,
> 
> On 09/08/2015 04:28 AM, renayama19661...@ybb.ne.jp wrote:
>>  Hi All,
>> 
>>  A problem produced us in Pacemaker1.0.13.
>> 
>>   * RHEL6.4(kernel-2.6.32-358.23.2.el6.x86_64)
>>    * SNMP：
>>     * net-snmp-libs-5.5-49.el6_5.1.x86_64
>>     * hp-snmp-agents-9.50-2564.40.rhel6.x86_64
>>     * net-snmp-utils-5.5-49.el6_5.1.x86_64
>>     * net-snmp-5.5-49.el6_5.1.x86_64
>>   * Pacemaker 1.0.13
>>   * pacemaker-mgmt-2.0.1
>> 
>>  We started hbagnet in respawn in this environment, but hbagent did not stop 
> when we stopped Heartbeat.
>>  SIGTERM seemed to be transmitted by Heartbeat even if we saw log, but there 
> was not the trace that hbagent received SIGTERM.
>> 
>>  We try the reproduction of the problem, but the problem never reappears for 
> the moment.
>> 
>>  We suppose that pacemaker-mgmt(hbagent) or snmp has a problem.
>> 
>>  Know similar problem?
>>  Know the cause of the problem?
> Sounds weird. I've never encountered the issue before. Actually I
> haven't run it with heartbeat for years ;-)  We'd probably have to find
> the pattern and produce it.
> 
> Regards,
>   Yan
> -- 
> Gao,Yan 
> Senior Software Engineer
> SUSE LINUX GmbH
> 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [Pacemaker1.0.13] [hbagent] The hbagent does not stop.

2015-09-17 Thread renayama19661014

Hi Yan,
Hi All,

The problem seems to be taking place somehow or other in the run_alarms inside 
carried out from hbagent.

I confirmed that hbagent received SIGTERM.

There seems to be the problem with connect() carried out from run_alarms.

We continue investigating it including a different specialized member.

Best Regars,
Hideo Yamauchi.



- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: "Gao,Yan" ; Cluster Labs - All topics related to 
> open-source clustering welcomed 
> Cc: 
> Date: 2015/9/9, Wed 05:19
> Subject: Re: [ClusterLabs] [Pacemaker1.0.13] [hbagent] The hbagent does not 
> stop.
> 
> Hi Yan,
> 
> Thank you for comment.
> 
>>  Sounds weird. I've never encountered the issue before. Actually I
>>  haven't run it with heartbeat for years ;-)  We'd probably have to 
> find
>>  the pattern and produce it.
> 
> 
> 
> We still just began an investigation.
> 
> If there is the point that you think to be the cause of the problem, please 
> tell 
> me.
> 
> Best Reards,
> Hideo Yamauchi.
> 
> 
> - Original Message -
>>  From: "Gao,Yan" 
>>  To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed 
>>  Cc: 
>>  Date: 2015/9/8, Tue 23:14
>>  Subject: Re: [ClusterLabs] [Pacemaker1.0.13] [hbagent] The hbagent does not 
> stop.
>> 
>>  Hi Hideo,
>> 
>>  On 09/08/2015 04:28 AM, renayama19661...@ybb.ne.jp wrote:
>>>   Hi All,
>>> 
>>>   A problem produced us in Pacemaker1.0.13.
>>> 
>>>    * RHEL6.4(kernel-2.6.32-358.23.2.el6.x86_64)
>>>     * SNMP：
>>>      * net-snmp-libs-5.5-49.el6_5.1.x86_64
>>>      * hp-snmp-agents-9.50-2564.40.rhel6.x86_64
>>>      * net-snmp-utils-5.5-49.el6_5.1.x86_64
>>>      * net-snmp-5.5-49.el6_5.1.x86_64
>>>    * Pacemaker 1.0.13
>>>    * pacemaker-mgmt-2.0.1
>>> 
>>>   We started hbagnet in respawn in this environment, but hbagent did not 
> stop 
>>  when we stopped Heartbeat.
>>>   SIGTERM seemed to be transmitted by Heartbeat even if we saw log, but 
> there 
>>  was not the trace that hbagent received SIGTERM.
>>> 
>>>   We try the reproduction of the problem, but the problem never 
> reappears for 
>>  the moment.
>>> 
>>>   We suppose that pacemaker-mgmt(hbagent) or snmp has a problem.
>>> 
>>>   Know similar problem?
>>>   Know the cause of the problem?
>>  Sounds weird. I've never encountered the issue before. Actually I
>>  haven't run it with heartbeat for years ;-)  We'd probably have to 
> find
>>  the pattern and produce it.
>> 
>>  Regards,
>>    Yan
>>  -- 
>>  Gao,Yan 
>>  Senior Software Engineer
>>  SUSE LINUX GmbH
>> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] [Problem] An SNMP trap is not transmitted.

2015-12-03 Thread renayama19661014

Hi All,

I tried a planned new SNMP function released in 
Pacemaker1.1.14.(pacemaker-87bc29e4b821fd2a98c978d5300e43eef41c2367)

However, in the next procedure, the SNMP trap is not transmitted.

Step 1) Start node A.
Step 2) Send CLI file.(The trap is transmitted definitely then.)

Dec  3 14:25:25 SNMP-MANAGER snmptrapd[2963]: 2015-12-03 14:25:25 rh72-01 [UDP: 
[192.168.40.20]:43092->[192.168.40.2]]:#012DISMAN-EVENT-MIB::sysUpTimeInstance 
= Timeticks: (167773) 0:27:57.73#011SNMPv2-MIB::snmpTrapOID.0 = OID: 
PACEMAKER-MIB::pacemakerNotificationTrap#011PACEMAKER-MIB::pacemakerNotificationTrap
 = STRING: "rh72-01"#011PACEMAKER-MIB::pacemakerNotificationResource = STRING: 
"prmDummy"#011PACEMAKER-MIB::pacemakerNotificationOperation = STRING: 
"start"#011PACEMAKER-MIB::pacemakerNotificationDescription = STRING: 
"ok"#011PACEMAKER-MIB::pacemakerNotificationStatus = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationReturnCode = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationTargetReturnCode = INTEGER: 0
Dec  3 14:25:25 SNMP-MANAGER snmptrapd[2963]: 2015-12-03 14:25:25 rh72-01 [UDP: 
[192.168.40.20]:53635->[192.168.40.2]]:#012DISMAN-EVENT-MIB::sysUpTimeInstance 
= Timeticks: (167773) 0:27:57.73#011SNMPv2-MIB::snmpTrapOID.0 = OID: 
PACEMAKER-MIB::pacemakerNotificationTrap#011PACEMAKER-MIB::pacemakerNotificationTrap
 = STRING: "rh72-01"#011PACEMAKER-MIB::pacemakerNotificationResource = STRING: 
"prmDummy"#011PACEMAKER-MIB::pacemakerNotificationOperation = STRING: 
"monitor"#011PACEMAKER-MIB::pacemakerNotificationDescription = STRING: 
"ok"#011PACEMAKER-MIB::pacemakerNotificationStatus = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationReturnCode = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationTargetReturnCode = INTEGER: 0
Dec  3 14:25:26 SNMP-MANAGER snmptrapd[2963]: 2015-12-03 14:25:26 rh72-01 [UDP: 
[192.168.40.20]:50867->[192.168.40.2]]:#012DISMAN-EVENT-MIB::sysUpTimeInstance 
= Timeticks: (167864) 0:27:58.64#011SNMPv2-MIB::snmpTrapOID.0 = OID: 
PACEMAKER-MIB::pacemakerNotificationTrap#011PACEMAKER-MIB::pacemakerNotificationTrap
 = STRING: "rh72-01"#011PACEMAKER-MIB::pacemakerNotificationResource = STRING: 
"prmStonith2-1"#011PACEMAKER-MIB::pacemakerNotificationOperation = STRING: 
"start"#011PACEMAKER-MIB::pacemakerNotificationDescription = STRING: 
"ok"#011PACEMAKER-MIB::pacemakerNotificationStatus = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationReturnCode = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationTargetReturnCode = INTEGER: 0

Step 3) Start node B.
* However, the trap of the start of prmStonith1-1 is not transmitted.

[root@rh72-01 ~]# crm_mon -1 -Af
Last updated: Thu Dec  3 14:25:46 2015  Last change: Thu Dec  3 
14:25:24 2015 by root via cibadmin on rh72-01
Stack: corosync
Current DC: rh72-01 (version 1.1.13-a7d6e6b) - partition with quorum
2 nodes and 3 resources configured

Online: [ rh72-01 rh72-02 ]

prmDummy   (ocf::pacemaker:Dummy): Started rh72-01
prmStonith1-1  (stonith:external/ssh): Started rh72-02
prmStonith2-1  (stonith:external/ssh): Started rh72-01



Step 4) Stop node B.
* However, the trap of the stop of prmStonith1-1 is not transmitted.

[root@rh72-01 ~]# crm_mon -1 -Af
Last updated: Thu Dec  3 14:28:24 2015  Last change: Thu Dec  3 
14:25:24 2015 by root via cibadmin on rh72-01
Stack: corosync
Current DC: rh72-01 (version 1.1.13-a7d6e6b) - partition with quorum
2 nodes and 3 resources configured

Node rh72-02: pending
Online: [ rh72-01 ]

prmDummy   (ocf::pacemaker:Dummy): Started rh72-01
prmStonith2-1  (stonith:external/ssh): Started rh72-01


The problem seems to depend on a practice judgment of NOTIFY not being carried 
out somehow or other after the node that participated later received CIB.
* After having received CIB, the node that participated does not carry out 
crmd_enable_notifications() processing.

I registered this problem with Bugzilla.
* http://bugs.clusterlabs.org/show_bug.cgi?id=5261

Best Regards,
Hideo Yamauchi.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [Problem] Start is carried out twice.

2016-06-06 Thread renayama19661014

Hi All, 

When a node joined while start of the resource takes time, start of the
resource is carried out twice. 

Step 1) Put sleep in start of the Dummy 
resource.(/usr/lib/ocf/resource.d/heartbeat/Dummy)
 (snip)
dummy_start() {
 sleep 60
 dummy_monitor
 if [ $? =  $OCF_SUCCESS ]; then
(snip)

Step 2) Start one node and send crm file. 

### Cluster Option ###
property no-quorum-policy="ignore" \ 
stonith-enabled="false" \ crmd-transition-delay="2s" 

### Resource Defaults ###
rsc_defaults resource-stickiness="INFINITY" \ 
migration-threshold="1" 

### Group Configuration ###
group grpDummy \
 prmDummy1 \
 prmDummy2 

### Primitive Configuration ###
primitive prmDummy1 ocf:heartbeat:Dummy \
 op start interval="0s" timeout="120s" on-fail="restart" \
 op monitor interval="10s" timeout="60s" on-fail="restart" \
 op stop interval="0s" timeout="60s" on-fail="block" 
primitive prmDummy2 ocf:heartbeat:Dummy \
 op start interval="0s" timeout="120s" on-fail="restart"\
 op monitor interval="10s" timeout="60s" on-fail="restart" \
 op stop interval="0s" timeout="60s" on-fail="block" 

### Resource Location ###
location rsc_location-grpDummy-1 grpDummy \
 rule 200: #uname eq vm1 \
 rule 100: #uname eq vm2

 
Step 3) When start of prmDummy1 was carried out, Start the second node. 
Start of prmDummy1 is carried out twice.

 [root@vm1 ~]# grep Initiating /var/log/ha-log
Jun  6 23:55:15 rh72-01 crmd[2921]:  notice: Initiating start operation 
prmDummy1_start_0 locally on vm1
Jun  6 23:56:17 rh72-01 crmd[2921]:  notice: Initiating start operation 
prmDummy1_start_0 locally on vm1 

While completion of start is unknown, it is not preferable for start to be 
carried out twice. 
When a node joined, it seems to be caused by the fact that information of 
thepractice of start which is not completed is deleted.


I registered these contents with Bugzilla.
  * http://bugs.clusterlabs.org/show_bug.cgi?id=5286

I attach the file which I collected in crm_report to Bugzilla.


Best Regards,
Hideo Yamauchi.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] [Problem] Start is carried out twice.

2016-06-06 Thread renayama19661014

Hi All, When a node joined while start of the resource takes time, start of the
resource is carried out twice. Step 1) Put sleep in start of the Dummy
resource.(/usr/lib/ocf/resource.d/heartbeat/Dummy) (snip)
dummy_start() { sleep 60 dummy_monitor if [ $? =  $OCF_SUCCESS ]; then
(snip) Step 2) Start one node and send crm file. 
### Cluster Option ###
property no-quorum-policy="ignore" \ stonith-enabled="false" \ 
crmd-transition-delay="2s" ### Resource Defaults ###
rsc_defaults resource-stickiness="INFINITY" \ migration-threshold="1" ### Group 
Configuration ###
group grpDummy \ prmDummy1 \ prmDummy2 ### Primitive Configuration ###
primitive prmDummy1 ocf:heartbeat:Dummy \ op start interval="0s" timeout="120s" 
on-fail="restart" \ op monitor interval="10s" timeout="60s" on-fail="restart" \ 
op stop interval="0s" timeout="60s" on-fail="block" primitive prmDummy2 
ocf:heartbeat:Dummy \ op start interval="0s" timeout="120s" on-fail="restart" \ 
op monitor interval="10s" timeout="60s" on-fail="restart" \ op stop 
interval="0s" timeout="60s" on-fail="block" ### Resource Location ###
location rsc_location-grpDummy-1 grpDummy \ rule 200: #uname eq vm1 \
rule 100: #uname eq vm2
 Step 3) When start of prmDummy1 was carried out, Start the 
second node. Start of prmDummy1 is carried out twice. [root@vm1 ~]# grep 
Initiating /var/log/ha-log
Jun  6 23:55:15 rh72-01 crmd[2921]:  notice: Initiating start operation
prmDummy1_start_0 locally on vm1
Jun  6 23:56:17 rh72-01 crmd[2921]:  notice: Initiating start operation
prmDummy1_start_0 locally on vm1 While completion of start is unknown, it is 
not preferable for start to be
carried out twice. When a node joined, it seems to be caused by the fact that 
information of the
practice of start which is not completed is deleted.

I registered these contents with Bugzilla.
* http://bugs.clusterlabs.org/show_bug.cgi?id=5286
I attach the file which I collected in crm_report to Bugzilla.

Best Regards,

Hideo Yamauchi.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [corosync][Problem] Very long "pause detect ... " was detected.

2016-06-13 Thread renayama19661014

Hi Honza,

Thank you for comment.


>>  Our user constituted a cluster in corosync and Pacemaker in the next 
> environment.
>>  The cluster constituted it among guests.
>> 
>>  * Host/Guest : RHEL6.6 - kernel : 2.6.32-504.el6.x86_64
>>  * libqb 0.17.1
>>  * corosync 2.3.4
>>  * Pacemaker 1.1.12
>> 
>>  The cluster worked well.
>>  When a user stopped an active guest, the next log was output in standby 
> guests repeatedly.
> 
> What exactly you mean by "active guest" and "standby 
> guests"?

The cluster is active / standby constitution.

As for the standby guest, a wait is in a state until a resource breaks down in 
active guests.


When a resource was replaced by standby, this problem seemed to occur.


> 
>> 
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5515870 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5515920 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5515971 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5516021 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5516071 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5516121 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5516171 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5516221 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5516271 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5516322 ms, flushing membership messages.
>>  May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5516372 ms, flushing membership messages.
>>  (snip)
>>  May xx xx:26:03 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5526172 ms, flushing membership messages.
>>  May xx xx:26:03 standby-guest corosync[6311]:  [MAIN  ] Totem is unable to 
> form a cluster because of an operating system or network fault. The most 
> common 
> cause of this message is that the local firewall is configured improperly.
>>  May xx xx:26:03 standby-guest corosync[6311]:  [TOTEM ] Process pause 
> detected for 5526222 ms, flushing membership messages.
>>  (snip)
>> 
> 
> This is weird. Not because of enormous pause length but because corosync 
> has a "scheduler pause" detector which warns before "Process 
> pause 
> detected ..." error is logged.

I thought so, too.
However, "scheduler pause" does not seem to be taking place.

> 
>>  As a result, the standby guest failed in the construction of the 
> independent cluster.
>> 
>>  It is recorded in log as if a timer stopped for 91 minutes.
>>  It is abnormal length for 91 minutes.
>> 
>>  Did you see a similar problem?
> 
> Never

Okay!


> 
>> 
>>  Possibly I think whether it is libqb or Kernel or some kind of problems.
> 
> What virtualization technology are you using? KVM?
> 
>>  * I suspect that the set of the timer failed in reset_pause_timeout().
> 
> You can try to put asserts into this function, but there is really not 
> too much reasons why it should fail (ether malloc returns NULL or some 
> nasty memory corruption).


I read a source code, too.
However, it is the street of your opinion.

I do not know whether a problem reappears, but I constitute it in RHEL6.6 and 
intend to take load this week.

If any you have noticed, please give me an email.

Best Regards,
Hideo Yamauchi.


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Coming in 1.1.15: Event-driven alerts

2016-04-28 Thread renayama19661014

Hi Klaus,

Because the script is performed the effectiveness of in async, I think that it 
is difficult to set "uptime" by the method of the sample.
After all we may request the transmission of the order.
#The patch before mine only controls a practice turn of the async and is not a 
thing giving load of crmd.

Japan begins a rest for one week from tomorrow.
I discuss after vacation with a member.

Best Regards,
Hideo Yamauchi.



- Original Message -
> From: Klaus Wenninger 
> To: users@clusterlabs.org
> Cc: 
> Date: 2016/4/28, Thu 03:14
> Subject: Re: [ClusterLabs] Coming in 1.1.15: Event-driven alerts
> 
> On 04/27/2016 04:19 PM, renayama19661...@ybb.ne.jp wrote:
>>  Hi All,
>> 
>>  We have a request for a new SNMP function.
>> 
>> 
>>  The order of traps is not right.
>> 
>>  The turn of the trap is not sometimes followed.
>>  This is because the handling of notice carries out "path" in 
> async.
>>  I think that it is necessary to wait for completion of the practice at 
> "path" unit of "alerts".
>>   
>>  The turn of the trap is different from the real stop order of the resource.
> Writing the alerts in a local list and having the alert-scripts called
> in a serialized manner
> would lead to the snmptrap-tool creating timestamps in the order of the
> occurrence 
> of the alerts.
> Having the snmp-manager order the traps by timestamp this would indeed
> lead to
> seeing them in the order they had occured.
> 
> But this approach has a number of drawbacks:
> 
> - it works just when the traps are coming from one node as there is no
> way to serialize
>   over nodes - at least none that would work under all circumstances we
> want alerts
>   to be delivered
> 
> - it distorts the timestamps created even more from the points in time
> when the
>   alert had been triggered - making the result in a multi-node-scenario
> even worse and
>   making it hard to correlate with other sources of information like
> logfiles
> 
> - if you imagine a scenario with multiple mechanisms of delivering an
> alert + multiple
>   recipients we couldn't use a single list but we would need something more
>   complicated to prevent unneeded delays, delays coming from one of the
> delivery
>   methods not working properly due to e.g. a recipient that is not
> reachable, ...
>   (all solvable of course but if it doesn't solve your problem in the
> first place why the effort)
> 
> The alternative approach taken doesn't create the timestamps in the
> scripts but
> provides timestamps to the scripts already.
> This way it doesn't matter if the execution of the script is delayed.
> 
> 
> A short example how this approach could be used with snmp-traps:
> 
> edit pcmk_snmp_helper.sh:
> 
> ...
> starttickfile="/var/run/starttick"
> 
> # hack to have a reference
> # can have it e.g. in an attribute to be visible throughout the cluster
> if [ ! -f ${starttickfile} ] ; then
>         echo ${CRM_alert_timestamp} > ${starttickfile}
> fi
> 
> starttick=`cat ${starttickfile}`
> ticks=`eval ${CRM_alert_timestamp} - ${starttick}`
> 
> if [[ ${CRM_alert_rc} != 0 && ${CRM_alert_task} == "monitor" 
> ]] || [[
> ${CRM_alert_task} != "monitor" ]] ; then
>     # This trap is compliant with PACEMAKER MIB
>     # 
> https://github.com/ClusterLabs/pacemaker/blob/master/extra/PCMK-MIB.txt
>     /usr/bin/snmptrap -v 2c -c public ${CRM_alert_recipient} ${ticks}
> PACEMAKER-MIB::pacemakerNotificationTrap \
>         PACEMAKER-MIB::pacemakerNotificationNode s "${CRM_alert_node}" 
> \
>         PACEMAKER-MIB::pacemakerNotificationResource s 
> "${CRM_alert_rsc}" \
>         PACEMAKER-MIB::pacemakerNotificationOperation s
> "${CRM_alert_task}" \
>         PACEMAKER-MIB::pacemakerNotificationDescription s
> "${CRM_alert_desc}" \
>         PACEMAKER-MIB::pacemakerNotificationStatus i 
> "${CRM_alert_status}" \
>         PACEMAKER-MIB::pacemakerNotificationReturnCode i ${CRM_alert_rc} \
>         PACEMAKER-MIB::pacemakerNotificationTargetReturnCode i
> ${CRM_alert_target_rc} && exit 0 || exit 1
> fi
> 
> exit 0
> ...
> 
> add a section to the cib:
> 
> cibadmin --create --xml-text '   id="snmp_traps" 
> path="/usr/share/pacemaker/tests/pcmk_snmp_helper.sh">
>   id="snmp_timestamp"
> name="tstamp_format" value="%s%02N"/> 
>   id="trap_destination" value="192.168.123.3"/> 
>  
> '
> 
> 
> This should solve the issue of correct order after being sorted by
> timestamps
> without having the ugly side-effects as described above.
> 
> I hope I understood your scenario correctly and this small example
> points out how I roughly would suggest to cope with the issue.
> 
> Regards,
> Klaus  
>> 
>>  
>>  [root@rh72-01 ~]# grep Operation  /var/log/ha-log | grep stop
>>  Apr 25 18:48:48 rh72-01 crmd[28897]:  notice: Operation prmDummy1_stop_0: 
> ok (node=rh72-01, call=33, rc=0, cib-update=56, confirmed=true)
>>  Apr 25 18:48:48 rh72-01 crmd[28897]:  notice: Operation prmDummy3_stop_0: 
> ok (node=rh72-01, call=37, rc=0, cib-update=57,

Re: [ClusterLabs] Coming in 1.1.15: Event-driven alerts

2016-05-10 Thread renayama19661014

Hi All,

After all our member needs the control of the turn of the transmission of the 
SNMP trap.

We make a patch of the control of the turn of the transmission and intend to 
send it.

Probably, with the patch, we add the "ordered" attribute that we sent by an 
email before.


Best Regards,
Hideo Yamauchi.


- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: "kwenn...@redhat.com" ; "users@clusterlabs.org" 
> ; Cluster Labs - All topics related to open-source 
> clustering welcomed 
> Cc: 
> Date: 2016/4/28, Thu 22:43
> Subject: Re: [ClusterLabs] Coming in 1.1.15: Event-driven alerts
> 
> Hi Klaus,
> 
> Because the script is performed the effectiveness of in async, I think that 
> it 
> is difficult to set "uptime" by the method of the sample.
> After all we may request the transmission of the order.
> #The patch before mine only controls a practice turn of the async and is not 
> a 
> thing giving load of crmd.
> 
> Japan begins a rest for one week from tomorrow.
> I discuss after vacation with a member.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> 
> - Original Message -
>>  From: Klaus Wenninger 
>>  To: users@clusterlabs.org
>>  Cc: 
>>  Date: 2016/4/28, Thu 03:14
>>  Subject: Re: [ClusterLabs] Coming in 1.1.15: Event-driven alerts
>> 
>>  On 04/27/2016 04:19 PM, renayama19661...@ybb.ne.jp wrote:
>>>   Hi All,
>>> 
>>>   We have a request for a new SNMP function.
>>> 
>>> 
>>>   The order of traps is not right.
>>> 
>>>   The turn of the trap is not sometimes followed.
>>>   This is because the handling of notice carries out "path" in 
> 
>>  async.
>>>   I think that it is necessary to wait for completion of the practice at 
> 
>>  "path" unit of "alerts".
>>>    
>>>   The turn of the trap is different from the real stop order of the 
> resource.
>>  Writing the alerts in a local list and having the alert-scripts called
>>  in a serialized manner
>>  would lead to the snmptrap-tool creating timestamps in the order of the
>>  occurrence 
>>  of the alerts.
>>  Having the snmp-manager order the traps by timestamp this would indeed
>>  lead to
>>  seeing them in the order they had occured.
>> 
>>  But this approach has a number of drawbacks:
>> 
>>  - it works just when the traps are coming from one node as there is no
>>  way to serialize
>>    over nodes - at least none that would work under all circumstances we
>>  want alerts
>>    to be delivered
>> 
>>  - it distorts the timestamps created even more from the points in time
>>  when the
>>    alert had been triggered - making the result in a multi-node-scenario
>>  even worse and
>>    making it hard to correlate with other sources of information like
>>  logfiles
>> 
>>  - if you imagine a scenario with multiple mechanisms of delivering an
>>  alert + multiple
>>    recipients we couldn't use a single list but we would need something 
> more
>>    complicated to prevent unneeded delays, delays coming from one of the
>>  delivery
>>    methods not working properly due to e.g. a recipient that is not
>>  reachable, ...
>>    (all solvable of course but if it doesn't solve your problem in the
>>  first place why the effort)
>> 
>>  The alternative approach taken doesn't create the timestamps in the
>>  scripts but
>>  provides timestamps to the scripts already.
>>  This way it doesn't matter if the execution of the script is delayed.
>> 
>> 
>>  A short example how this approach could be used with snmp-traps:
>> 
>>  edit pcmk_snmp_helper.sh:
>> 
>>  ...
>>  starttickfile="/var/run/starttick"
>> 
>>  # hack to have a reference
>>  # can have it e.g. in an attribute to be visible throughout the cluster
>>  if [ ! -f ${starttickfile} ] ; then
>>          echo ${CRM_alert_timestamp} > ${starttickfile}
>>  fi
>> 
>>  starttick=`cat ${starttickfile}`
>>  ticks=`eval ${CRM_alert_timestamp} - ${starttick}`
>> 
>>  if [[ ${CRM_alert_rc} != 0 && ${CRM_alert_task} == 
> "monitor" 
>>  ]] || [[
>>  ${CRM_alert_task} != "monitor" ]] ; then
>>      # This trap is compliant with PACEMAKER MIB
>>      # 
>>  https://github.com/ClusterLabs/pacemaker/blob/master/extra/PCMK-MIB.txt
>>      /usr/bin/snmptrap -v 2c -c public ${CRM_alert_recipient} ${ticks}
>>  PACEMAKER-MIB::pacemakerNotificationTrap \
>>          PACEMAKER-MIB::pacemakerNotificationNode s 
> "${CRM_alert_node}" 
>>  \
>>          PACEMAKER-MIB::pacemakerNotificationResource s 
>>  "${CRM_alert_rsc}" \
>>          PACEMAKER-MIB::pacemakerNotificationOperation s
>>  "${CRM_alert_task}" \
>>          PACEMAKER-MIB::pacemakerNotificationDescription s
>>  "${CRM_alert_desc}" \
>>          PACEMAKER-MIB::pacemakerNotificationStatus i 
>>  "${CRM_alert_status}" \
>>          PACEMAKER-MIB::pacemakerNotificationReturnCode i ${CRM_alert_rc} 
> \
>>          PACEMAKER-MIB::pacemakerNotificationTargetReturnCode i
>>  ${CRM_alert_target_rc} && exit 0 || exit 1
>>  fi

Re: [ClusterLabs] Coming in 1.1.15: Event-driven alerts

2016-05-13 Thread renayama19661014

i Klaus,

After all we want transmission order.
I think that I am going to use meta_attribute which you suggested and am enough.

(snip)







(snip)

I intend to write the correction that included a cue in this meta_attribute, 
what do you think?
I think that processing when queue was changed is troublesome, but think that I 
can write the patch somehow.

We intend to make the correction that pcmk_snmp_helper.shkeeps versatility.

We want to use this function in Pacemaker1.1.15.

Best Regards,
Hideo Yamauch.



- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: "kwenn...@redhat.com" ; "users@clusterlabs.org" 
> ; Cluster Labs - All topics related to open-source 
> clustering welcomed 
> Cc: 
> Date: 2016/5/12, Thu 06:28
> Subject: Re: [ClusterLabs] Coming in 1.1.15: Event-driven alerts
> 
> Hi Klaus,
> 
> Thank you for comment.
> 
> I confirm your comment.
> I think that I ask you a question again.
> 
> 
> Many thanks!
> Hideo Yamauchi.
> 
> 
> - Original Message -
>>  From: Klaus Wenninger 
>>  To: users@clusterlabs.org
>>  Cc: 
>>  Date: 2016/5/11, Wed 14:13
>>  Subject: Re: [ClusterLabs] Coming in 1.1.15: Event-driven alerts
>> 
>>  On 05/10/2016 11:19 PM, renayama19661...@ybb.ne.jp wrote:
>>>   Hi All,
>>> 
>>>   After all our member needs the control of the turn of the transmission 
> of 
>>  the SNMP trap.
>>> 
>>>   We make a patch of the control of the turn of the transmission and 
> intend 
>>  to send it.
>>> 
>>>   Probably, with the patch, we add the "ordered" attribute 
> that we 
>>  sent by an email before.
>>  Actually I still don't think that simple serialization of the calling 
> of
>>  the snmptrap-tool
>>  is a good solution to tackle the problem of loosing the order of traps
>>  arriving at
>>  some management station:
>> 
>>  - makes things worse in case of traps coming from multiple nodes
>>  - doesn't help when the order is lost on the network.
>> 
>>  Anyway I see 2 other scenarios where a certain degree of serialization 
> might
>>  be helpful:
>> 
>>  - alert agent-scripts that can't handle being called concurrently
>>  - performance issues that might arise on some systems that lack the
>>    performance-headroom needed and/or the agent-scripts in place
>>    require significant effort and/or there are a lot of resources/events
>>    that trigger a vast amount of alerts being handled in parallel
>> 
>>  So I could imagine the introduction of a meta-atribute that specifies a
>>  queue
>>  to be used for serialization.
>> 
>>  - 'none' is default and leads to the behavior we have at the 
> moment.
>>  - any other queue-name leads to the instantiation of an additional queue
>> 
>>  This approach should allow merely any kind of serialization you can think 
> of
>>  with as little impact as needed.
>>  e.g. if the agent doesn't cope with concurrent calls you use a queue 
> per
>>  agent leading to all recipients being handled in a serialized way (and of
>>  course the different alerts as well). And all the other agents are running
>>  in parallel.
>>  e.g. you can have a separate queue for a single recipient leading to
>>  the alerts being sent there being serialized.
>>  e.g. if the performance impact should be kept at a minimal level you
>>  would use a single queue for all agents and all recipients 
>> 
>>> 
>>> 
>>>   Best Regards,
>>>   Hideo Yamauchi.
>>> 
>>> 
>>>   - Original Message -
   From: "renayama19661...@ybb.ne.jp" 
>>  
   To: "kwenn...@redhat.com" ; 
>>  "users@clusterlabs.org" ; Cluster 
> Labs - 
>>  All topics related to open-source clustering welcomed 
>>  
   Cc: 
   Date: 2016/4/28, Thu 22:43
   Subject: Re: [ClusterLabs] Coming in 1.1.15: Event-driven alerts
 
   Hi Klaus,
 
   Because the script is performed the effectiveness of in async, I 
> think 
>>  that it 
   is difficult to set "uptime" by the method of the 
> sample.
   After all we may request the transmission of the order.
   #The patch before mine only controls a practice turn of the async 
> and 
>>  is not a 
   thing giving load of crmd.
 
   Japan begins a rest for one week from tomorrow.
   I discuss after vacation with a member.
 
   Best Regards,
   Hideo Yamauchi.
 
 
 
   - Original Message -
>    From: Klaus Wenninger 
>    To: users@clusterlabs.org
>    Cc: 
>    Date: 2016/4/28, Thu 03:14
>    Subject: Re: [ClusterLabs] Coming in 1.1.15: Event-driven 
> alerts
> 
>    On 04/27/2016 04:19 PM, renayama19661...@ybb.ne.jp wrote:
>>     Hi All,
>> 
>>     We have a request for a new SNMP function.
>> 
>> 
>>     The order of traps is not right.
>> 
>>     The turn

Re: [ClusterLabs] Coming in 1.1.15: Event-driven alerts

2016-05-13 Thread renayama19661014

Hi Klaus,

I do it by the weekend of the next week and write a patch using queue.


In addition, please tell me your opinion.

Many thanks!
Hideo Yamauchi.


- Original Message -
> From: Klaus Wenninger 
> To: "users@clusterlabs.org" 
> Cc: 
> Date: 2016/5/14, Sat 00:51
> Subject: Re: [ClusterLabs] Coming in 1.1.15: Event-driven alerts
> 
> On 05/13/2016 04:59 PM, renayama19661...@ybb.ne.jp wrote:
>>  i Klaus,
>> 
>>  After all we want transmission order.
>>  I think that I am going to use meta_attribute which you suggested and am 
> enough.
>> 
>>  (snip)
>>  >  path="/xxx//xxx.sh">
>>  
>>   value="alert1-queue" />
> Maybe we find a name that somehow implies what the purpose of the
> queue is. Something like "serialization-queue" coming to my mind
> although it is a little bit of an alliteration of course.
>>  
>>  
>>  >  path="/xxx//xxx.sh">
>> 
>>  (snip)
>> 
>>  I intend to write the correction that included a cue in this 
> meta_attribute, what do you think?
>>  I think that processing when queue was changed is troublesome, but think 
> that I can write the patch somehow.
> The current implementation for the alerts-feature doesn't directly use
> data from cib-diffs coming
> in, but just triggers a query for the whole section, purges all local
> data and feeds it in again from
> what the query returns.
> Thus I would just drain the queues - actively or just wait - before
> deleting them together with
> all the other local data. The alerts section is not expected to be
> altered frequently during operation.
>> 
>>  We intend to make the correction that pcmk_snmp_helper.shkeeps versatility.
> I tried to ask around a little bit to find out which tools were widely
> used to collect and process
> traps and how these are coping with traps arriving out of order because of
> whatever reason (local reordering done by the scheduler - as we have it
> here,
> different delays on the network from different trap-sources, loss of
> order on the
> network, ...).
> Unfortunately I didn't get real answers.
> But maybe somebody here on the list can give a hint on that topic.
> Anyway I found OID hrSystemDate (1.3.6.1.2.1.25.1.2) - part of MIB-2.
> We could at least add this one to the pacemaker-mib-OIDs in
> pcmk_snmp_helper.sh
> and feed it with a correct timestamp (created by crmd).
> 
>> 
>>  We want to use this function in Pacemaker1.1.15.
>> 
>>  Best Regards,
>>  Hideo Yamauch.
>> 
>> 
>> 
>>  - Original Message -
>>>  From: "renayama19661...@ybb.ne.jp" 
> 
>>>  To: "kwenn...@redhat.com" ; 
> "users@clusterlabs.org" ; Cluster Labs - 
> All topics related to open-source clustering welcomed 
> 
>>>  Cc: 
>>>  Date: 2016/5/12, Thu 06:28
>>>  Subject: Re: [ClusterLabs] Coming in 1.1.15: Event-driven alerts
>>> 
>>>  Hi Klaus,
>>> 
>>>  Thank you for comment.
>>> 
>>>  I confirm your comment.
>>>  I think that I ask you a question again.
>>> 
>>> 
>>>  Many thanks!
>>>  Hideo Yamauchi.
>>> 
>>> 
>>>  - Original Message -
   From: Klaus Wenninger 
   To: users@clusterlabs.org
   Cc: 
   Date: 2016/5/11, Wed 14:13
   Subject: Re: [ClusterLabs] Coming in 1.1.15: Event-driven alerts
 
   On 05/10/2016 11:19 PM, renayama19661...@ybb.ne.jp wrote:
>    Hi All,
> 
>    After all our member needs the control of the turn of the 
> transmission 
>>>  of 
   the SNMP trap.
>    We make a patch of the control of the turn of the 
> transmission and 
>>>  intend 
   to send it.
>    Probably, with the patch, we add the "ordered" 
> attribute 
>>>  that we 
   sent by an email before.
   Actually I still don't think that simple serialization of the 
> calling 
>>>  of
   the snmptrap-tool
   is a good solution to tackle the problem of loosing the order of 
> traps
   arriving at
   some management station:
 
   - makes things worse in case of traps coming from multiple nodes
   - doesn't help when the order is lost on the network.
 
   Anyway I see 2 other scenarios where a certain degree of 
> serialization 
>>>  might
   be helpful:
 
   - alert agent-scripts that can't handle being called 
> concurrently
   - performance issues that might arise on some systems that lack 
> the
     performance-headroom needed and/or the agent-scripts in place
     require significant effort and/or there are a lot of 
> resources/events
     that trigger a vast amount of alerts being handled in parallel
 
   So I could imagine the introduction of a meta-atribute that 
> specifies a
   queue
   to be used for serialization.
 
   - 'none' is default and leads to the behavior we have at 
> the 
>>>  moment.
   - any other queue-name leads to the instantiation of an additional 
> queue

Re: [ClusterLabs] Coming in 1.1.15: Event-driven alerts

2016-04-27 Thread renayama19661014

Hi All,

We have a request for a new SNMP function.


The order of traps is not right.

The turn of the trap is not sometimes followed.
This is because the handling of notice carries out "path" in async.
I think that it is necessary to wait for completion of the practice at "path" 
unit of "alerts".
 
The turn of the trap is different from the real stop order of the resource.


[root@rh72-01 ~]# grep Operation  /var/log/ha-log | grep stop
Apr 25 18:48:48 rh72-01 crmd[28897]:  notice: Operation prmDummy1_stop_0: ok 
(node=rh72-01, call=33, rc=0, cib-update=56, confirmed=true)
Apr 25 18:48:48 rh72-01 crmd[28897]:  notice: Operation prmDummy3_stop_0: ok 
(node=rh72-01, call=37, rc=0, cib-update=57, confirmed=true)
Apr 25 18:48:48 rh72-01 crmd[28897]:  notice: Operation prmDummy4_stop_0: ok 
(node=rh72-01, call=39, rc=0, cib-update=58, confirmed=true)
Apr 25 18:48:48 rh72-01 crmd[28897]:  notice: Operation prmDummy2_stop_0: ok 
(node=rh72-01, call=35, rc=0, cib-update=59, confirmed=true)
Apr 25 18:48:48 rh72-01 crmd[28897]:  notice: Operation prmDummy5_stop_0: ok 
(node=rh72-01, call=41, rc=0, cib-update=60, confirmed=true)

Apr 25 18:48:50 snmp-manager snmptrapd[6865]: 2016-04-25 18:48:50  
[UDP: 
[192.168.28.170]:40613->[192.168.28.189]:162]:#012DISMAN-EVENT-MIB::sysUpTimeInstance
 = Timeticks: (25512486) 2 days, 22:52:04.86#011SNMPv2-MIB::snmpTrapOID.0 = 
OID: 
PACEMAKER-MIB::pacemakerNotificationTrap#011PACEMAKER-MIB::pacemakerNotificationNode
 = STRING: "rh72-01"#011PACEMAKER-MIB::pacemakerNotificationResource = STRING: 
"prmDummy3"#011PACEMAKER-MIB::pacemakerNotificationOperation = STRING: 
"stop"#011PACEMAKER-MIB::pacemakerNotificationDescription = STRING: 
"ok"#011PACEMAKER-MIB::pacemakerNotificationStatus = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationReturnCode = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationTargetReturnCode = INTEGER: 0
Apr 25 18:48:50 snmp-manager snmptrapd[6865]: 2016-04-25 18:48:50  
[UDP: 
[192.168.28.170]:39581->[192.168.28.189]:162]:#012DISMAN-EVENT-MIB::sysUpTimeInstance
 = Timeticks: (25512489) 2 days, 22:52:04.89#011SNMPv2-MIB::snmpTrapOID.0 = 
OID: 
PACEMAKER-MIB::pacemakerNotificationTrap#011PACEMAKER-MIB::pacemakerNotificationNode
 = STRING: "rh72-01"#011PACEMAKER-MIB::pacemakerNotificationResource = STRING: 
"prmDummy4"#011PACEMAKER-MIB::pacemakerNotificationOperation = STRING: 
"stop"#011PACEMAKER-MIB::pacemakerNotificationDescription = STRING: 
"ok"#011PACEMAKER-MIB::pacemakerNotificationStatus = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationReturnCode = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationTargetReturnCode = INTEGER: 0
Apr 25 18:48:50 snmp-manager snmptrapd[6865]: 2016-04-25 18:48:50  
[UDP: 
[192.168.28.170]:37166->[192.168.28.189]:162]:#012DISMAN-EVENT-MIB::sysUpTimeInstance
 = Timeticks: (25512490) 2 days, 22:52:04.90#011SNMPv2-MIB::snmpTrapOID.0 = 
OID: 
PACEMAKER-MIB::pacemakerNotificationTrap#011PACEMAKER-MIB::pacemakerNotificationNode
 = STRING: "rh72-01"#011PACEMAKER-MIB::pacemakerNotificationResource = STRING: 
"prmDummy1"#011PACEMAKER-MIB::pacemakerNotificationOperation = STRING: 
"stop"#011PACEMAKER-MIB::pacemakerNotificationDescription = STRING: 
"ok"#011PACEMAKER-MIB::pacemakerNotificationStatus = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationReturnCode = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationTargetReturnCode = INTEGER: 0
Apr 25 18:48:50 snmp-manager snmptrapd[6865]: 2016-04-25 18:48:50  
[UDP: 
[192.168.28.170]:53502->[192.168.28.189]:162]:#012DISMAN-EVENT-MIB::sysUpTimeInstance
 = Timeticks: (25512494) 2 days, 22:52:04.94#011SNMPv2-MIB::snmpTrapOID.0 = 
OID: 
PACEMAKER-MIB::pacemakerNotificationTrap#011PACEMAKER-MIB::pacemakerNotificationNode
 = STRING: "rh72-01"#011PACEMAKER-MIB::pacemakerNotificationResource = STRING: 
"prmDummy2"#011PACEMAKER-MIB::pacemakerNotificationOperation = STRING: 
"stop"#011PACEMAKER-MIB::pacemakerNotificationDescription = STRING: 
"ok"#011PACEMAKER-MIB::pacemakerNotificationStatus = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationReturnCode = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationTargetReturnCode = INTEGER: 0
Apr 25 18:48:50 snmp-manager snmptrapd[6865]: 2016-04-25 18:48:50  
[UDP: 
[192.168.28.170]:45956->[192.168.28.189]:162]:#012DISMAN-EVENT-MIB::sysUpTimeInstance
 = Timeticks: (25512497) 2 days, 22:52:04.97#011SNMPv2-MIB::snmpTrapOID.0 = 
OID: 
PACEMAKER-MIB::pacemakerNotificationTrap#011PACEMAKER-MIB::pacemakerNotificationNode
 = STRING: "rh72-01"#011PACEMAKER-MIB::pacemakerNotificationResource = STRING: 
"prmDummy5"#011PACEMAKER-MIB::pacemakerNotificationOperation = STRING: 
"stop"#011PACEMAKER-MIB::pacemakerNotificationDescription = STRING: 
"ok"#011PACEMAKER-MIB::pacemakerNotificationStatus = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationReturnCode = INTEGER: 
0#011PACEMAKER-MIB::pacemakerNotificationTargetReturnCode = INTEGER: 0



I think that there is "timestamp" attribute for async by this change.

The order of traps may be important

[ClusterLabs] [Question] About a change of crm_failcount.

2017-02-02 Thread renayama19661014

Hi All,

By the next correction, the user was not able to set a value except zero in 
crm_failcount.

 - [Fix: tools: implement crm_failcount command-line options correctly]
   - 
https://github.com/ClusterLabs/pacemaker/commit/95db10602e8f646eefed335414e40a994498cafd#diff-6e58482648938fd488a920b9902daac4

However, pgsql RA sets INFINITY in a script.

```
(snip)
    CRM_FAILCOUNT="${HA_SBIN_DIR}/crm_failcount"
(snip)
    ocf_exit_reason "My data is newer than new master's one. New   master's 
location : $master_baseline"
    exec_with_retry 0 $CRM_FAILCOUNT -r $OCF_RESOURCE_INSTANCE -U $NODENAME -v 
INFINITY
    return $OCF_ERR_GENERIC
(snip)
```

There seems to be the influence only in pgsql somehow or other.

Can you revise it to set a value except zero in crm_failcount?
We make modifications to use crm_attribute in pgsql RA if we cannot revise it.

Best Regards,
Hideo Yamauchi.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] [Question] About log collection of crm_report.

2017-01-23 Thread renayama19661014

Hi All,

When I carry out Pacemaker1.1.15 and Pacemaker1.1.16 in RHEL7.3, log in 
conjunction with pacemaker is not collected in the file which I collected in 
sosreport.
 

This seems to be caused by the next correction and pacemaker.py script of 
RHEL7.3.

 - 
https://github.com/ClusterLabs/pacemaker/commit/1bcad6a1eced1a3b6c314b05ac1d353adda260f6
 - 
https://github.com/ClusterLabs/pacemaker/commit/582e886dd8475f701746999c0093cd9735aca1ed#diff-284d516fab648676f5d93bc5ce8b0fbf


---
(/usr/lib/python2.7/site-packages/sos/plugins/pacemaker.py)
(snip)
        if not self.get_option("crm_scrub"):
            crm_scrub = ""
            self._log_warn("scrubbing of crm passwords has been disabled:")
            self._log_warn("data collected by crm_report may contain"
                           " sensitive values.")
        self.add_cmd_output('crm_report --sos-mode %s -S -d '
                            ' --dest %s --from "%s"' %
                            (crm_scrub, crm_dest, crm_from),
                            chroot=self.tmp_in_sysroot())
(snip)
---


When a user carries out crm_report in sosreport, what is the reason that set 
search_logs to 0?

We think that the one where search_logs works with 1 in sosreport is right.


Best Regards,
Hideo Yamauchi.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [Question] About log collection of crm_report.

2017-01-25 Thread renayama19661014

Hi Ken,

Thank you for comment.

For example, our user does not use pacemaker.log and corosync.log.

Via a syslog, the user makes setting to output all log to /var/log/ha-log.

-
(/etc/corosycn/corosync.conf)
logging {
        syslog_facility: local1
        debug: off
}

(/etc/sysconfig/pacemaker)
PCMK_logfile=none
PCMK_logfacility=local1
PCMK_logpriority=info
PCMK_fail_fast=yes

(/etc/rsyslog.conf)
# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
*.info;mail.none;authpriv.none;cron.none;local1.none                
/var/log/messages
(snip)
# Save boot messages also to boot.log
local7.*                                                /var/log/boot.log
local1.info /var/log/ha-log
-

In present crm_report, in the case of the user who output log in a different 
file, the log is not collected in sosreport.

Is this not a problem?
Possibly is all /var/log going to collect it in future in sosreport?

Of course I know that "/var/log/ha-log" is collected definitely when I carry 
out crm_report alone.
I want to know why collection of log of this crm_report was stopped in 
sosreport.

For REDHAT, will it be to be enough for collection of sosreport contents?
If it is such a thing, we can understand.

- And I test crm_report at the present, but seem to have some problems.
- I intend to report the problem by Bugzilla again.

Best Regards,
Hideo Yamauchi.



- Original Message -
> From: Ken Gaillot 
> To: users@clusterlabs.org
> Cc: 
> Date: 2017/1/24, Tue 08:15
> Subject: Re: [ClusterLabs] [Question] About log collection of crm_report.
> 
> On 01/23/2017 04:17 PM, renayama19661...@ybb.ne.jp wrote:
>>  Hi All,
>> 
>>  When I carry out Pacemaker1.1.15 and Pacemaker1.1.16 in RHEL7.3, log in 
> conjunction with pacemaker is not collected in the file which I collected in 
> sosreport.
>>   
>> 
>>  This seems to be caused by the next correction and pacemaker.py script of 
> RHEL7.3.
>> 
>>   - 
> https://github.com/ClusterLabs/pacemaker/commit/1bcad6a1eced1a3b6c314b05ac1d353adda260f6
>>   - 
> https://github.com/ClusterLabs/pacemaker/commit/582e886dd8475f701746999c0093cd9735aca1ed#diff-284d516fab648676f5d93bc5ce8b0fbf
>> 
>> 
>>  ---
>>  (/usr/lib/python2.7/site-packages/sos/plugins/pacemaker.py)
>>  (snip)
>>          if not self.get_option("crm_scrub"):
>>              crm_scrub = ""
>>              self._log_warn("scrubbing of crm passwords has been 
> disabled:")
>>              self._log_warn("data collected by crm_report may 
> contain"
>>                             " sensitive values.")
>>          self.add_cmd_output('crm_report --sos-mode %s -S -d '
>>                              ' --dest %s --from "%s"' %
>>                              (crm_scrub, crm_dest, crm_from),
>>                              chroot=self.tmp_in_sysroot())
>>  (snip)
>>  ---
>> 
>> 
>>  When a user carries out crm_report in sosreport, what is the reason that 
> set search_logs to 0?
>> 
>>  We think that the one where search_logs works with 1 in sosreport is right.
>> 
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
> 
> Hi Hideo,
> 
> The --sos-mode option is intended for RHEL integration, so it is only
> guaranteed to work with the combination of pacemaker and sosreport
> packages delivered with a particular version of RHEL (and its derivatives).
> 
> That allows us to make assumptions about what sosreport features are
> available. It might be better to detect those features, but we haven't
> seen enough usage of sosreport + pacemaker outside RHEL to make that
> worth the effort.
> 
> In this case, the version of sosreport that will be in RHEL 7.4 will
> collect pacemaker.log and corosync.log on its own, so the crm_report in
> pacemaker 1.1.16 doesn't need to collect the logs itself.
> 
> It might work if you build the latest sosreport:
> https://github.com/sosreport/sos
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [Question] About log collection of crm_report.

2017-01-24 Thread renayama19661014

Hi Ken,

Thank you for comment.

For example, our user does not use pacemaker.log and corosync.log.

Via a syslog, the user makes setting to output all log to /var/log/ha-log.

-
(/etc/corosycn/corosync.conf)
logging {
        syslog_facility: local1
        debug: off
}

(/etc/sysconfig/pacemaker)
PCMK_logfile=none
PCMK_logfacility=local1
PCMK_logpriority=info
PCMK_fail_fast=yes

(/etc/rsyslog.conf)
# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
*.info;mail.none;authpriv.none;cron.none;local1.none                
/var/log/messages
(snip)
# Save boot messages also to boot.log
local7.*                                                /var/log/boot.log
local1.info /var/log/ha-log
-

In present crm_report, in the case of the user who output log in a different 
file, the log is not collected in sosreport.

Is this not a problem?
Possibly is all /var/log going to collect it in future in sosreport?

Of course I know that "/var/log/ha-log" is collected definitely when I carry 
out crm_report alone.
I want to know why collection of log of this crm_report was stopped in 
sosreport.

For REDHAT, will it be to be enough for collection of sosreport contents?
If it is such a thing, we can understand.

- And I test crm_report at the present, but seem to have some problems.
- I intend to report the problem by Bugzilla again.

Best Regards,
Hideo Yamauchi.




- Original Message -
> From: Ken Gaillot 
> To: users@clusterlabs.org
> Cc: 
> Date: 2017/1/24, Tue 08:15
> Subject: Re: [ClusterLabs] [Question] About log collection of crm_report.
> 
> On 01/23/2017 04:17 PM, renayama19661...@ybb.ne.jp wrote:
>>  Hi All,
>> 
>>  When I carry out Pacemaker1.1.15 and Pacemaker1.1.16 in RHEL7.3, log in 
> conjunction with pacemaker is not collected in the file which I collected in 
> sosreport.
>>   
>> 
>>  This seems to be caused by the next correction and pacemaker.py script of 
> RHEL7.3.
>> 
>>   - 
> https://github.com/ClusterLabs/pacemaker/commit/1bcad6a1eced1a3b6c314b05ac1d353adda260f6
>>   - 
> https://github.com/ClusterLabs/pacemaker/commit/582e886dd8475f701746999c0093cd9735aca1ed#diff-284d516fab648676f5d93bc5ce8b0fbf
>> 
>> 
>>  ---
>>  (/usr/lib/python2.7/site-packages/sos/plugins/pacemaker.py)
>>  (snip)
>>          if not self.get_option("crm_scrub"):
>>              crm_scrub = ""
>>              self._log_warn("scrubbing of crm passwords has been 
> disabled:")
>>              self._log_warn("data collected by crm_report may 
> contain"
>>                             " sensitive values.")
>>          self.add_cmd_output('crm_report --sos-mode %s -S -d '
>>                              ' --dest %s --from "%s"' %
>>                              (crm_scrub, crm_dest, crm_from),
>>                              chroot=self.tmp_in_sysroot())
>>  (snip)
>>  ---
>> 
>> 
>>  When a user carries out crm_report in sosreport, what is the reason that 
> set search_logs to 0?
>> 
>>  We think that the one where search_logs works with 1 in sosreport is right.
>> 
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
> 
> Hi Hideo,
> 
> The --sos-mode option is intended for RHEL integration, so it is only
> guaranteed to work with the combination of pacemaker and sosreport
> packages delivered with a particular version of RHEL (and its derivatives).
> 
> That allows us to make assumptions about what sosreport features are
> available. It might be better to detect those features, but we haven't
> seen enough usage of sosreport + pacemaker outside RHEL to make that
> worth the effort.
> 
> In this case, the version of sosreport that will be in RHEL 7.4 will
> collect pacemaker.log and corosync.log on its own, so the crm_report in
> pacemaker 1.1.16 doesn't need to collect the logs itself.
> 
> It might work if you build the latest sosreport:
> https://github.com/sosreport/sos
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] [Problem] Thel fail-over is completed without the stop of the resource being carried out.

2016-09-26 Thread renayama19661014

Hi All,

We discovered a problem in the cluster which Quorum control and STONITH did not 
have.

We can confirm the problem in the next procedure.

Step1) Constitute a cluster.

[root@rh72-01 ~]# crm configure load update trac3437.crm 

[root@rh72-01 ~]# crm_mon -1 -Af
Stack: corosync
Current DC: rh72-01 (version 1.1.15-e174ec8) - partition with quorum
Last updated: Mon Sep 26 13:00:22 2016  Last change: Mon Sep 26 
12:59:52 2016 by root via cibadmin on rh72-01

2 nodes and 1 resource configured

Online: [ rh72-01 rh72-02 ]

Resource Group: grpDummy
prmDummy   (ocf::pacemaker:Dummy): Started rh72-01

Node Attributes:
* Node rh72-01:
* Node rh72-02:

Migration Summary:
* Node rh72-01:
* Node rh72-02:


Step2) Edit Dummy resource to cause stop trouble.

(snip)
dummy_stop() {
return $OCF_ERR_GENERIC
dummy_monitorif [ $? -eq $OCF_SUCCESS ]; thenrm ${OCF_RESKEY_state} 
   fi
rm -f "${VERIFY_SERIALIZED_FILE}"
return $OCF_SUCCESS
}
(snip)

Step3) Stop Pacemaker of the node. Stop trouble happens.
[root@rh72-01 ~]# systemctl stop pacemaker

[root@rh72-01 ~]# crm_mon -1 -Af
Stack: corosync
Current DC: rh72-01 (version 1.1.15-e174ec8) - partition with quorum
Last updated: Mon Sep 26 13:01:33 2016  Last change: Mon Sep 26 
12:59:52 2016 by root via cibadmin on rh72-01

2 nodes and 1 resource configured

Online: [ rh72-01 rh72-02 ]

Resource Group: grpDummy
prmDummy   (ocf::pacemaker:Dummy): FAILED rh72-01 (blocked)

Node Attributes:
* Node rh72-01:
* Node rh72-02:

Migration Summary:
* Node rh72-01:
prmDummy: migration-threshold=1 fail-count=100 last-failure='Mon Sep 26 
13:01:18 2016'
* Node rh72-02:

Failed Actions:
* prmDummy_stop_0 on rh72-01 'unknown error' (1): call=8, status=complete, 
exitreason='none',
last-rc-change='Mon Sep 26 13:01:18 2016', queued=0ms, exec=33ms

Step4) Correct Dummy resource in the original.
(snip)
dummy_stop() {
dummy_monitor
if [ $? -eq $OCF_SUCCESS ]; then
rm ${OCF_RESKEY_state}
fi
rm -f "${VERIFY_SERIALIZED_FILE}"
return $OCF_SUCCESS
}
(snip)

Step5) Clean up does the trouble of the Dummy resource.

[root@rh72-01 ~]# crm_resource -C -r prmDummy -H rh72-01 -f
Cleaning up prmDummy on rh72-01, removing fail-count-prmDummy
Waiting for 1 replies from the CRMd. OK

Step6) Fail-over is completed. However, the stop of the Dummy resource is not 
carried out in rh72-01 node.

[root@rh72-02 ~]# crm_mon -1 -Af
Stack: corosync
Current DC: rh72-02 (version 1.1.15-e174ec8) - partition WITHOUT quorum
Last updated: Mon Sep 26 13:02:32 2016  Last change: Mon Sep 26 
13:02:20 2016 by hacluster via crmd on rh72-01

2 nodes and 1 resource configured

Online: [ rh72-02 ]
OFFLINE: [ rh72-01 ]

Resource Group: grpDummy
prmDummy   (ocf::pacemaker:Dummy): Started rh72-02

Node Attributes:
* Node rh72-02:

Migration Summary:
* Node rh72-02:

[root@rh72-01 ~]# ls -lt /var/run/Dummy-prmDummy.state 
-rw-r-. 1 root root 0  9月 26  2016 /var/run/Dummy-prmDummy.state
-
Sep 26 13:02:21 rh72-01 crmd[1584]: warning: Action 2 (prmDummy_monitor_0) on 
rh72-01 failed (target: 7 vs. rc: 0): Error
Sep 26 13:02:21 rh72-01 crmd[1584]: notice: Transition aborted by operation 
prmDummy_monitor_0 'create' on rh72-01: Event failed | 
magic=0:0;2:6:7:196faae4-4faf-42a5-9ffb-9dcf6272e3fb cib=0.6.2 
source=match_graph_event:310 complete=false
Sep 26 13:02:21 rh72-01 crmd[1584]: info: Action prmDummy_monitor_0 (2) 
confirmed on rh72-01 (rc=0)
Sep 26 13:02:21 rh72-01 crmd[1584]: info: Detected action (6.2) 
prmDummy_monitor_0.13=ok: failed
Sep 26 13:02:21 rh72-01 crmd[1584]: warning: Action 2 (prmDummy_monitor_0) on 
rh72-01 failed (target: 7 vs. rc: 0): Error
Sep 26 13:02:21 rh72-01 crmd[1584]: info: Transition aborted by operation 
prmDummy_monitor_0 'create' on rh72-01: Event failed | 
magic=0:0;2:6:7:196faae4-4faf-42a5-9ffb-9dcf6272e3fb cib=0.6.2 
source=match_graph_event:310 complete=false
Sep 26 13:02:21 rh72-01 crmd[1584]: info: Action prmDummy_monitor_0 (2) 
confirmed on rh72-01 (rc=0)
Sep 26 13:02:21 rh72-01 crmd[1584]: info: Detected action (6.2) 
prmDummy_monitor_0.13=ok: failed
Sep 26 13:02:21 rh72-01 crmd[1584]: notice: Transition 6 (Complete=3, 
Pending=0, Fired=0, Skipped=0, Incomplete=3, 
Source=/var/lib/pacemaker/pengine/pe-input-6.bz2): Complete
Sep 26 13:02:21 rh72-01 crmd[1584]: info: Input I_STOP received in state 
S_TRANSITION_ENGINE from notify_crmd
Sep 26 13:02:21 rh72-01 crmd[1584]: info: State transition S_TRANSITION_ENGINE 
-> S_STOPPING | input=I_STOP cause=C_FSA_INTERNAL origin=notify_crmd
Sep 26 13:02:21 rh72-01 crmd[1584]: info: DC role released
Sep 26 13:02:21 rh72-01 crmd[1584]: info: Connection to the Policy Engine 
released
Sep 26 13:02:21 rh72-01 cib[1579]: info: Forwarding cib_modify operation for 
section status to all (origin=local/crmd/56)
Sep 26 13:02:21 rh72-01 cib[1579]: info: Diff: --- 0.6.2 2
Sep 26 13:02:21 rh72-01 cib[1579]: info: Diff: +++ 0.6.3 (null)
Sep 26 13:02:21 rh72-01 cib[1579]: info: +  /cib:  @num_updates=3Sep 26 
13:02:21

[ClusterLabs] [Enhancement] Request to an SNMP trap function.

2016-10-27 Thread renayama19661014

Hi Ken,
Hi All,
About a future SNMP trap function, we request the following function. * SNMP 
trap function of the attribute change. This is a function to transmit an SNMP 
trap when a specific attribute changes.
It is useful to trap a change of the score of drbd and a change of the score of
the streaming of pgsql in SNMP.
I think that there is other utility value. This function thinks that the 
realization is possible by the method that crmd
watches the difference minute that has been spent from cib. But the current MIB 
file may not accord with this use.
I think that the examination of the change of the MIB file is necessary. We 
request examination and the discussion of this function. 
- I registered these contents with Bugzilla for a discussion.
(http://bugs.clusterlabs.org/show_bug.cgi?id=5303)

Best Regards,
Hideo Yamacuhi.___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-11-05 Thread renayama19661014

Hi Klaus,
Hi Jan,
Hi All,

About watchdog using WD service, there does not seem to be the opposite opinion.
I do work to make an official patch from next week.

Best Regards,
Hideo Yamauchi.


- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Cc: 
> Date: 2016/10/26, Wed 17:46
> Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is 
> frozen, cluster decisions are delayed infinitely
> 
> Hi Klaus,
> Hi Jan,
> Hi All,
> 
> Our member argued about watchdog using WD service.
> 
> 1) The WD service is not abolished.
> 2) In pacemaker_remote, it is available by starting corosync in localhost.
> 3) It is necessary for the scramble of watchdog to consider it.
> 4) Because I think about the case which does not use sbd, I do not think 
> about 
> adding an interface similar to corosync-API to sbd for the moment.
> 
> The user chooses a method using method and WD service using sbd and will use 
> it.
> It may cause confusion that there are two methods, but there is value for the 
> user who does not use sbd.
> 
> We want to include watchdog using WD service in Pacemaker.
> I intend to make an official patch.
> 
> What do you think?
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> 
> - Original Message -
>>  From: "renayama19661...@ybb.ne.jp" 
> 
>>  To: Cluster Labs - All topics related to open-source clustering welcomed 
> 
>>  Cc: 
>>  Date: 2016/10/20, Thu 19:08
>>  Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd 
> is frozen, cluster decisions are delayed infinitely
>> 
>>  Hi Klaus,
>>  Hi Jan,
>> 
>>  Thank you for comment.
>> 
>>  I wait for other comment a little more.
>>  We will argue about this matter next week.
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
>> 
>> 
>>  - Original Message -
>>>   From: Jan Friesse 
>>>   To: kwenn...@redhat.com; Cluster Labs - All topics related to 
> open-source 
>>  clustering welcomed 
>>>   Cc: 
>>>   Date: 2016/10/20, Thu 15:46
>>>   Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC 
> crmd 
>>  is frozen, cluster decisions are delayed infinitely
>>> 
 
    On 10/14/2016 11:21 AM, renayama19661...@ybb.ne.jp wrote:
>    Hi Klaus,
>    Hi All,
> 
>    I tried prototype of watchdog using WD service.
>      - 
>>> 
>> 
> https://github.com/HideoYamauchi/pacemaker/commit/3ee97b76e0212b1790226864dfcacd1a327dbcc9
> 
>    Please comment.
    Thank you Hideo for providing the prototype.
    Added the patch to my build and it seems to
    be working as expected.
 
    A few thoughts triggered by this approach:
 
    - we have to alert the corosync-people as in
       a chat with Jan Friesse he pointed me to the
       fact that for corosync 3.x the wd-service was
       planned to be removed
>>> 
>>>   Actually I didn't express myself correctly. What I wanted to say 
> was 
>>>   "I'm considering idea of removing it", simply because 
>>  it's 
>>>   disabled in 
>>>   downstream.
>>> 
>>>   BUT keep in mind that removing functionality = ask community to find 
> out 
>>>   if there is not somebody actively using it.
>>> 
>>>   And because there is active users and future use case, removing of wd 
> is 
>>>   not an option.
>>> 
>>> 
 
       especially delicate as the binding is very loose
       so that - as is - it builds against a corosync with
       disabled wd-service without any complaints...
 
    - as of now if you enable wd-service in the
       corosync-build it is on by default and would
       be hogging the watchdog presumably
       (there is obviously a pull request that makes
       it default to off)
 
    - with my thoughts about adding an API to
       sbd previously in the thread I was trying to
       target closer observation of pacemaker_remoted
       as well (remote-nodes don't have corosync
       running)
 
       I guess it would be possible to run corosync
       with a static config as single-node cluster
       bound to localhost for that purpose.
 
       I read the thread about corosync-remote and
       that happening might make the special-handling
       for pacemaker-remote obsolete anyway ...
 
    - to enable the approach to live alongside
       sbd it would be possible to make sbd use
       the corosync-API as well for watchdog purposes
       instead of opening the watchdog directly
 
       This shouldn't be a big deal for sbd used to
       observe a pacemaker-node as cluster-watcher
       (the part of sbd that sends cpg-pings to corosync)
       already builds against corosync.
       The blockdevice-part

Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-10-14 Thread renayama19661014

Hi Klaus,
Hi All,

I tried prototype of watchdog using WD service.
 - 
https://github.com/HideoYamauchi/pacemaker/commit/3ee97b76e0212b1790226864dfcacd1a327dbcc9

Please comment.


Best Regards,
Hideo Yamauchi.


- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: "users@clusterlabs.org" 
> Cc: 
> Date: 2016/10/11, Tue 17:58
> Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is 
> frozen, cluster decisions are delayed infinitely
> 
> Hi Klaus,
> 
> Thank you for comment.
> 
> I make the patch which is prototype using WD service.
> 
> Please wait a little.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> 
> 
> - Original Message -
>>  From: Klaus Wenninger 
>>  To: users@clusterlabs.org
>>  Cc: 
>>  Date: 2016/10/10, Mon 21:03
>>  Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd 
> is frozen, cluster decisions are delayed infinitely
>> 
>>  On 10/07/2016 11:10 PM, renayama19661...@ybb.ne.jp wrote:
>>>   Hi All,
>>> 
>>>   Our user may not necessarily use sdb.
>>> 
>>>   I confirmed that there was a method using WD service of corosync as 
> one 
>>  method not to use sdb.
>>>   Pacemaker watches the process of pacemaker by WD service using CMAP 
> and can 
>>  carry out watchdog.
>> 
>>  Have to have a look at that...
>>  But if we establish some in-between-layer in pacemaker we could have this
>>  as one of the possibilities besides e.g. sbd (with enhanced API), going for
>>  a watchdog-device directly, ...
>> 
>>> 
>>> 
>>>   We can set up a patch of pacemaker.
>> 
>>  Always helpful to discuss/clarify an idea once some code is available ...
>> 
>>>   Was the discussion of using WD service over so far?
>> 
>>  Not from my pov. Just a day off ;-)
>> 
>>> 
>>> 
>>>   Best Regard,
>>>   Hideo Yamauchi.
>>> 
>>> 
>>>   - Original Message -
   From: Klaus Wenninger 
   To: Ulrich Windl ; 
>>  users@clusterlabs.org
   Cc: 
   Date: 2016/10/7, Fri 17:47
   Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the 
> DC 
>>  crmd is frozen, cluster decisions are delayed infinitely
 
   On 10/07/2016 08:14 AM, Ulrich Windl wrote:
    Klaus Wenninger  
> schrieb am 
>> 
   06.10.2016 um 18:03 in
>    Nachricht 
> <3980cfdd-ebd9-1597-f6bd-a1ca808f7...@redhat.com>:
>>    On 10/05/2016 04:22 PM, renayama19661...@ybb.ne.jp wrote:
>>>    Hi All,
>>> 
>    If a user uses sbd, can the cluster evade a 
>>  problem of 
   SIGSTOP of crmd?
    
    As pointed out earlier, maybe crmd should feed a 
>>  watchdog. Then 
   stopping 
>>    crmd 
    will reboot the node (unless the watchdog fails).
>>>    Thank you for comment.
>>> 
>>>    We examine watchdog of crmd, too.
>>>    In addition, I comment after examination advanced.
>>    Was thinking of doing a small test implementation going
>>    a little in the direction Lars Ellenberg had been 
> pointing 
>>  out.
>> 
>>    a couple of thoughts I had so far:
>> 
>>    - add an API (via DBus or libqb - favoring libqb atm) to 
> sbd
>>      an application can use to create a watchdog within sbd
>    Why has it to be done within sbd?
   Not necessarily, could be spawned out as well into an own project 
> or
   something already existent could be taken.
   Remember to have added a dbus-interface to
   https://sourceforge.net/projects/watchdog/ for a project once.
   If you have a suggestion I'm open.
   Going off sbd would have the advantage of a smooth start:
 
   - cluster/pacemaker-watcher are there already and can
     be replaced/moved over time
   - the lifecycle of the daemon (when started/stopped) is
     already something that is in the code and in the people's 
> minds
 
>>    - parameters for the first are a name and a timeout
>> 
>>    - first use-case would be crmd observation
>> 
>>    - later on we could think of removing pacemaker 
> dependencies
>>      from sbd by moving the actual implementation of
>>      pacemaker-watcher and probably cluster-watcher as well
>>      into pacemaker - using the new API
>> 
>>    - this of course creates sbd dependency within pacemaker 
> so
>>      that it would make sense to offer a simpler and 
>>  self-contained
>>      implementation within pacemaker as an alternative
>    I think the watchdog interface is so simple that you 
> don't 
>>  need a relay 
   for it. The only limit I can imagine is the number of watchdogs 
>>  available of 
   some specific hardware.
   That is the point ;-)
>>      thus it would be favorable to have the dependency
>>      within a non-compulsory pacemaker-rpm so that
>>      we can offer an

Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-10-26 Thread renayama19661014

Hi Klaus,
Hi Jan,
Hi All,

Our member argued about watchdog using WD service.

1) The WD service is not abolished.
2) In pacemaker_remote, it is available by starting corosync in localhost.
3) It is necessary for the scramble of watchdog to consider it.
4) Because I think about the case which does not use sbd, I do not think about 
adding an interface similar to corosync-API to sbd for the moment.

The user chooses a method using method and WD service using sbd and will use it.
It may cause confusion that there are two methods, but there is value for the 
user who does not use sbd.

We want to include watchdog using WD service in Pacemaker.
I intend to make an official patch.

What do you think?

Best Regards,
Hideo Yamauchi.



- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Cc: 
> Date: 2016/10/20, Thu 19:08
> Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is 
> frozen, cluster decisions are delayed infinitely
> 
> Hi Klaus,
> Hi Jan,
> 
> Thank you for comment.
> 
> I wait for other comment a little more.
> We will argue about this matter next week.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> - Original Message -
>>  From: Jan Friesse 
>>  To: kwenn...@redhat.com; Cluster Labs - All topics related to open-source 
> clustering welcomed 
>>  Cc: 
>>  Date: 2016/10/20, Thu 15:46
>>  Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd 
> is frozen, cluster decisions are delayed infinitely
>> 
>>> 
>>>   On 10/14/2016 11:21 AM, renayama19661...@ybb.ne.jp wrote:
   Hi Klaus,
   Hi All,
 
   I tried prototype of watchdog using WD service.
     - 
>> 
> https://github.com/HideoYamauchi/pacemaker/commit/3ee97b76e0212b1790226864dfcacd1a327dbcc9
 
   Please comment.
>>>   Thank you Hideo for providing the prototype.
>>>   Added the patch to my build and it seems to
>>>   be working as expected.
>>> 
>>>   A few thoughts triggered by this approach:
>>> 
>>>   - we have to alert the corosync-people as in
>>>      a chat with Jan Friesse he pointed me to the
>>>      fact that for corosync 3.x the wd-service was
>>>      planned to be removed
>> 
>>  Actually I didn't express myself correctly. What I wanted to say was 
>>  "I'm considering idea of removing it", simply because 
> it's 
>>  disabled in 
>>  downstream.
>> 
>>  BUT keep in mind that removing functionality = ask community to find out 
>>  if there is not somebody actively using it.
>> 
>>  And because there is active users and future use case, removing of wd is 
>>  not an option.
>> 
>> 
>>> 
>>>      especially delicate as the binding is very loose
>>>      so that - as is - it builds against a corosync with
>>>      disabled wd-service without any complaints...
>>> 
>>>   - as of now if you enable wd-service in the
>>>      corosync-build it is on by default and would
>>>      be hogging the watchdog presumably
>>>      (there is obviously a pull request that makes
>>>      it default to off)
>>> 
>>>   - with my thoughts about adding an API to
>>>      sbd previously in the thread I was trying to
>>>      target closer observation of pacemaker_remoted
>>>      as well (remote-nodes don't have corosync
>>>      running)
>>> 
>>>      I guess it would be possible to run corosync
>>>      with a static config as single-node cluster
>>>      bound to localhost for that purpose.
>>> 
>>>      I read the thread about corosync-remote and
>>>      that happening might make the special-handling
>>>      for pacemaker-remote obsolete anyway ...
>>> 
>>>   - to enable the approach to live alongside
>>>      sbd it would be possible to make sbd use
>>>      the corosync-API as well for watchdog purposes
>>>      instead of opening the watchdog directly
>>> 
>>>      This shouldn't be a big deal for sbd used to
>>>      observe a pacemaker-node as cluster-watcher
>>>      (the part of sbd that sends cpg-pings to corosync)
>>>      already builds against corosync.
>>>      The blockdevice-part of sbd being basically
>>>      generic it might be an issue though.
>>> 
>>>   Regards,
>>>   Klaus
>>> 
 
 
   Best Regards,
   Hideo Yamauchi.
 
 
   - Original Message -
>   From: "renayama19661...@ybb.ne.jp" 
>>  
>   To: "users@clusterlabs.org" 
> 
>   Cc:
>   Date: 2016/10/11, Tue 17:58
>   Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When 
> the 
>>  DC crmd is frozen, cluster decisions are delayed infinitely
> 
>   Hi Klaus,
> 
>   Thank you for comment.
> 
>   I make the patch which is prototype using WD service.
> 
>   Please wait a little.
> 
>   Best Regards,
>   Hideo Yamauchi.
> 
> 
> 
> 
>   -

Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-10-11 Thread renayama19661014

Hi Klaus,

Thank you for comment.

I make the patch which is prototype using WD service.

Please wait a little.

Best Regards,
Hideo Yamauchi.




- Original Message -
> From: Klaus Wenninger 
> To: users@clusterlabs.org
> Cc: 
> Date: 2016/10/10, Mon 21:03
> Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC crmd is 
> frozen, cluster decisions are delayed infinitely
> 
> On 10/07/2016 11:10 PM, renayama19661...@ybb.ne.jp wrote:
>>  Hi All,
>> 
>>  Our user may not necessarily use sdb.
>> 
>>  I confirmed that there was a method using WD service of corosync as one 
> method not to use sdb.
>>  Pacemaker watches the process of pacemaker by WD service using CMAP and can 
> carry out watchdog.
> 
> Have to have a look at that...
> But if we establish some in-between-layer in pacemaker we could have this
> as one of the possibilities besides e.g. sbd (with enhanced API), going for
> a watchdog-device directly, ...
> 
>> 
>> 
>>  We can set up a patch of pacemaker.
> 
> Always helpful to discuss/clarify an idea once some code is available ...
> 
>>  Was the discussion of using WD service over so far?
> 
> Not from my pov. Just a day off ;-)
> 
>> 
>> 
>>  Best Regard,
>>  Hideo Yamauchi.
>> 
>> 
>>  - Original Message -
>>>  From: Klaus Wenninger 
>>>  To: Ulrich Windl ; 
> users@clusterlabs.org
>>>  Cc: 
>>>  Date: 2016/10/7, Fri 17:47
>>>  Subject: Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: Re: When the DC 
> crmd is frozen, cluster decisions are delayed infinitely
>>> 
>>>  On 10/07/2016 08:14 AM, Ulrich Windl wrote:
>>>   Klaus Wenninger  schrieb am 
> 
>>>  06.10.2016 um 18:03 in
   Nachricht <3980cfdd-ebd9-1597-f6bd-a1ca808f7...@redhat.com>:
>   On 10/05/2016 04:22 PM, renayama19661...@ybb.ne.jp wrote:
>>   Hi All,
>> 
   If a user uses sbd, can the cluster evade a 
> problem of 
>>>  SIGSTOP of crmd?
>>>   
>>>   As pointed out earlier, maybe crmd should feed a 
> watchdog. Then 
>>>  stopping 
>   crmd 
>>>   will reboot the node (unless the watchdog fails).
>>   Thank you for comment.
>> 
>>   We examine watchdog of crmd, too.
>>   In addition, I comment after examination advanced.
>   Was thinking of doing a small test implementation going
>   a little in the direction Lars Ellenberg had been pointing 
> out.
> 
>   a couple of thoughts I had so far:
> 
>   - add an API (via DBus or libqb - favoring libqb atm) to sbd
>     an application can use to create a watchdog within sbd
   Why has it to be done within sbd?
>>>  Not necessarily, could be spawned out as well into an own project or
>>>  something already existent could be taken.
>>>  Remember to have added a dbus-interface to
>>>  https://sourceforge.net/projects/watchdog/ for a project once.
>>>  If you have a suggestion I'm open.
>>>  Going off sbd would have the advantage of a smooth start:
>>> 
>>>  - cluster/pacemaker-watcher are there already and can
>>>    be replaced/moved over time
>>>  - the lifecycle of the daemon (when started/stopped) is
>>>    already something that is in the code and in the people's minds
>>> 
>   - parameters for the first are a name and a timeout
> 
>   - first use-case would be crmd observation
> 
>   - later on we could think of removing pacemaker dependencies
>     from sbd by moving the actual implementation of
>     pacemaker-watcher and probably cluster-watcher as well
>     into pacemaker - using the new API
> 
>   - this of course creates sbd dependency within pacemaker so
>     that it would make sense to offer a simpler and 
> self-contained
>     implementation within pacemaker as an alternative
   I think the watchdog interface is so simple that you don't 
> need a relay 
>>>  for it. The only limit I can imagine is the number of watchdogs 
> available of 
>>>  some specific hardware.
>>>  That is the point ;-)
>     thus it would be favorable to have the dependency
>     within a non-compulsory pacemaker-rpm so that
>     we can offer an alternative that doesn't use sbd
>     at maybe the cost of being less reliable or one
>     that owns a hardware-watchdog by itself for systems
>     where this is still unused.
> 
>     - e.g. via some kind of plugin (Andrew forgive me -
>                                                      no pils ;-) 
> )
>     - or via an additional daemon
> 
>   What did you have in mind?
>   Maybe it makes sense to synchronize...
> 
>   Regards,
>   Klaus
>   
>>   Best Regards,
>>   Hideo Yamauchi.
>> 
>> 
>> 
>>   - Original Message -
>>>   From: Ulrich Windl 
> 
>>>   To: users@clusterlabs.org; renayama19661...@ybb.ne.jp 
>>>   Cc: 
>>>   Date:

[ClusterLabs] [Problem] The crmd causes an error of xml.

2017-04-06 Thread renayama19661014

Hi All,

I confirmed a development edition of Pacemaker.
 - 
https://github.com/ClusterLabs/pacemaker/tree/71dbd128c7b0a923c472c8e564d33a0ba1816cb5


property no-quorum-policy="ignore" \
        stonith-enabled="true" \
        startup-fencing="false"

rsc_defaults resource-stickiness="INFINITY" \
        migration-threshold="INFINITY"

fencing_topology \
        rh73-01-snmp: prmStonith1-1 \
        rh73-02-snmp: prmStonith2-1

primitive prmDummy ocf:pacemaker:Dummy \
        op start interval="0s" timeout="60s" on-fail="restart" \
        op monitor interval="10s" timeout="60s" on-fail="restart" \
        op stop interval="0s" timeout="60s" on-fail="fence"

primitive prmStonith1-1 stonith:external/ssh \
        params \
        pcmk_reboot_retries="1" \
        pcmk_reboot_timeout="40s" \
        hostlist="rh73-01-snmp" \
        op start interval="0s" timeout="60s" on-fail="restart" \
        op stop interval="0s" timeout="60s" on-fail="ignore"

primitive prmStonith2-1 stonith:external/ssh \
        params \
        pcmk_reboot_retries="1" \
        pcmk_reboot_timeout="40s" \
        hostlist="rh73-02-snmp" \
        op start interval="0s" timeout="60s" on-fail="restart" \
        op stop interval="0s" timeout="60s" on-fail="ignore"

### Resource Location ###
location rsc_location-1 prmDummy \
        rule  300: #uname eq rh73-01-snmp \
        rule  200: #uname eq rh73-02-snmp



I pour the following brief crm files.
I produce the trouble of the resource in a cluster.
Then crmd causes an error.


(snip)
Apr  6 18:04:22 rh73-01-snmp pengine[5214]: warning: Calculated transition 4 
(with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-0.bz2
Apr  6 18:04:22 rh73-01-snmp crmd[5215]:   error: XML Error: Entity: line 1: 
parser error : Specification mandate value for attribute 
CRM_meta_fail_count_prmDummy
Apr  6 18:04:22 rh73-01-snmp crmd[5215]:   error: XML Error: rh73-01-snmp" 
on_node_uuid="3232238265">

pe-warn-0.bz2
Description: Binary data
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [Problem] The crmd causes an error of xml.

2017-04-07 Thread renayama19661014

Hi Ken,

Thank you for comment

Okay!
I wait for a correction.


Many thanks!
Hideo Yamauchi.


- Original Message -
> From: Ken Gaillot 
> To: users@clusterlabs.org
> Cc: 
> Date: 2017/4/8, Sat 05:04
> Subject: Re: [ClusterLabs] [Problem] The crmd causes an error of xml.
> 
> On 04/06/2017 08:49 AM, renayama19661...@ybb.ne.jp wrote:
>>  Hi All,
>> 
>>  I confirmed a development edition of Pacemaker.
>>   - 
> https://github.com/ClusterLabs/pacemaker/tree/71dbd128c7b0a923c472c8e564d33a0ba1816cb5
>> 
>>  
>>  property no-quorum-policy="ignore" \
>>          stonith-enabled="true" \
>>          startup-fencing="false"
>> 
>>  rsc_defaults resource-stickiness="INFINITY" \
>>          migration-threshold="INFINITY"
>> 
>>  fencing_topology \
>>          rh73-01-snmp: prmStonith1-1 \
>>          rh73-02-snmp: prmStonith2-1
>> 
>>  primitive prmDummy ocf:pacemaker:Dummy \
>>          op start interval="0s" timeout="60s" 
> on-fail="restart" \
>>          op monitor interval="10s" timeout="60s" 
> on-fail="restart" \
>>          op stop interval="0s" timeout="60s" 
> on-fail="fence"
>> 
>>  primitive prmStonith1-1 stonith:external/ssh \
>>          params \
>>          pcmk_reboot_retries="1" \
>>          pcmk_reboot_timeout="40s" \
>>          hostlist="rh73-01-snmp" \
>>          op start interval="0s" timeout="60s" 
> on-fail="restart" \
>>          op stop interval="0s" timeout="60s" 
> on-fail="ignore"
>> 
>>  primitive prmStonith2-1 stonith:external/ssh \
>>          params \
>>          pcmk_reboot_retries="1" \
>>          pcmk_reboot_timeout="40s" \
>>          hostlist="rh73-02-snmp" \
>>          op start interval="0s" timeout="60s" 
> on-fail="restart" \
>>          op stop interval="0s" timeout="60s" 
> on-fail="ignore"
>> 
>>  ### Resource Location ###
>>  location rsc_location-1 prmDummy \
>>          rule  300: #uname eq rh73-01-snmp \
>>          rule  200: #uname eq rh73-02-snmp
>> 
>>  
>> 
>>  I pour the following brief crm files.
>>  I produce the trouble of the resource in a cluster.
>>  Then crmd causes an error.
>> 
>>  
>>  (snip)
>>  Apr  6 18:04:22 rh73-01-snmp pengine[5214]: warning: Calculated transition 
> 4 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-0.bz2
>>  Apr  6 18:04:22 rh73-01-snmp crmd[5215]:   error: XML Error: Entity: line 
> 1: parser error : Specification mandate value for attribute 
> CRM_meta_fail_count_prmDummy
>>  Apr  6 18:04:22 rh73-01-snmp crmd[5215]:   error: XML Error: 
> rh73-01-snmp" on_node_uuid="3232238265"> CRM_meta_fail_count_prmDummy
>>  Apr  6 18:04:22 rh73-01-snmp crmd[5215]:   error: XML Error:                
>                                                                 ^
>>  Apr  6 18:04:22 rh73-01-snmp crmd[5215]:   error: XML Error: Entity: line 
> 1: parser error : attributes construct error
>>  Apr  6 18:04:22 rh73-01-snmp crmd[5215]:   error: XML Error: 
> rh73-01-snmp" on_node_uuid="3232238265"> CRM_meta_fail_count_prmDummy
>>  Apr  6 18:04:22 rh73-01-snmp crmd[5215]:   error: XML Error:                
>                                                                 ^
>>  Apr  6 18:04:22 rh73-01-snmp crmd[5215]:   error: XML Error: Entity: line 
> 1: parser error : Couldn't find end of Start Tag attributes line 1
>>  Apr  6 18:04:22 rh73-01-snmp crmd[5215]:   error: XML Error: 
> rh73-01-snmp" on_node_uuid="3232238265"> CRM_meta_fail_count_prmDummy
>>  Apr  6 18:04:22 rh73-01-snmp crmd[5215]:   error: XML Error:                
>                                                                 ^
>>  Apr  6 18:04:22 rh73-01-snmp crmd[5215]: warning: Parsing failed (domain=1, 
> level=3, code=73): Couldn't find end of Start Tag attributes line 1
>>  (snip)
>>  
>> 
>>  The XML that a new trouble count was related to somehow or other seems to 
> have a problem.
>> 
>>  I attach pe-warn-0.bz2.
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
> 
> Hi Hideo,
> 
> Thanks for the report!
> 
> This appears to be a PE bug when fencing is needed due to stop failure.
> It wasn't caught in regression testing because the PE will continue to
> use the old-style fail-count attribute if the DC does not support the
> new style, and existing tests obviously have older DCs. I definitely
> need to add some new tests.
> 
> I'm not sure why fail-count and last-failure are being added as
> meta-attributes in this case, or why incorrect XML syntax is being
> generated, but I'll investigate.
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started:

Re: [ClusterLabs] Antw: fence_vmware_soap: reads VM status but fails to reboot/on/off

2017-08-02 Thread renayama19661014

Hi Octavian,

Are you possibly using the free version of ESXi? 
On the free version of ESXi, the operation on or off fails.

The same phenomenon also occurs in connection with virsh.

 - https://communities.vmware.com/thread/542433

Best Regards,
Hideo Yamauchi.
- Original Message -
>From: Octavian Ciobanu 
>To: Cluster Labs - All topics related to open-source clustering welcomed 
> 
>Date: 2017/8/1, Tue 23:07
>Subject: Re: [ClusterLabs] Antw: fence_vmware_soap: reads VM status but fails 
>to reboot/on/off
> 
>
>Hey Marek,
>
>I've run the command with --action off and uploaded the file on one of our 
>servers : https://cloud.iwgate.com/index.php/s/1SpZlG8mBSR1dNE
>
>Interesting thing is that at the end of the file I found "Unable to 
>connect/login to fencing device" instead of "Failed: Timed out waiting to 
>power OFF"
>
>As information about my test rig:
> Host OS: VMware ESXi 6.5 Hypervisor
> Guest OS: Centos 7.3.1611 minimal with the latest updates
> Fence agents installed with yum : 
>    fence-agents-hpblade-4.0.11-47.el7_3.5.x86_64
>    fence-agents-rsa-4.0.11-47.el7_3.5.x86_64
>    fence-agents-ilo-moonshot-4.0.11-47.el7_3.5.x86_64
>    fence-agents-rhevm-4.0.11-47.el7_3.5.x86_64
>    fence-virt-0.3.2-5.el7.x86_64
>    fence-agents-mpath-4.0.11-47.el7_3.5.x86_64
>    fence-agents-ibmblade-4.0.11-47.el7_3.5.x86_64
>    fence-agents-ipdu-4.0.11-47.el7_3.5.x86_64
>    fence-agents-common-4.0.11-47.el7_3.5.x86_64
>    fence-agents-rsb-4.0.11-47.el7_3.5.x86_64
>    fence-agents-ilo-ssh-4.0.11-47.el7_3.5.x86_64
>    fence-agents-bladecenter-4.0.11-47.el7_3.5.x86_64
>    fence-agents-drac5-4.0.11-47.el7_3.5.x86_64
>    fence-agents-brocade-4.0.11-47.el7_3.5.x86_64
>    fence-agents-wti-4.0.11-47.el7_3.5.x86_64
>    fence-agents-compute-4.0.11-47.el7_3.5.x86_64
>    fence-agents-eps-4.0.11-47.el7_3.5.x86_64
>    fence-agents-cisco-ucs-4.0.11-47.el7_3.5.x86_64
>    fence-agents-intelmodular-4.0.11-47.el7_3.5.x86_64
>    fence-agents-eaton-snmp-4.0.11-47.el7_3.5.x86_64
>    fence-agents-cisco-mds-4.0.11-47.el7_3.5.x86_64
>    fence-agents-apc-snmp-4.0.11-47.el7_3.5.x86_64
>    fence-agents-ilo2-4.0.11-47.el7_3.5.x86_64
>    fence-agents-all-4.0.11-47.el7_3.5.x86_64
>    fence-agents-vmware-soap-4.0.11-47.el7_3.5.x86_64
>    fence-agents-ilo-mp-4.0.11-47.el7_3.5.x86_64
>    fence-agents-apc-4.0.11-47.el7_3.5.x86_64
>    fence-agents-emerson-4.0.11-47.el7_3.5.x86_64
>    fence-agents-ipmilan-4.0.11-47.el7_3.5.x86_64
>    fence-agents-ifmib-4.0.11-47.el7_3.5.x86_64
>    fence-agents-kdump-4.0.11-47.el7_3.5.x86_64
>    fence-agents-scsi-4.0.11-47.el7_3.5.x86_64
>
>Thank you
>
>
>
>On Tue, Aug 1, 2017 at 2:22 PM, Marek Grac  wrote:
>
>Hi,
>>
>>
>>> But when I call any of the power actions (on, off, reboot) I get "Failed:
 Timed out waiting to power OFF".

 I've tried with all the combinations of --power-timeout and --power-wait
 and same error without any change in the response time.

 Any ideas from where or how to fix this issue ?
>>>
>>
>>
>>No, you have used the right options and if they were high enough it should 
>>work. You can try to post verbose (anonymized) output and we can take a look 
>>at it more deeply. 
>>
>>>I suspect "power off" is actually a virtual press of the ACPI power button 
>>>(reboot likewise), so your VM tries to shut down cleanly. That could take 
>>>time, and it could hang (I guess). I don't use VMware, but maybe there's a 
>>>"reset" action that presses the virtual reset button of the virtual 
>>>hardware... ;-)
>>>
>>
>>
>>There should not be a fence agent that will do soft reboot. The 'reset' 
>>action does  power off/check status/power on so we are sure that machine was 
>>really down (of course unless --method cycle when 'reboot' button is used).
>>
>>m,
>>__ _
>>Users mailing list: Users@clusterlabs.org
>>http://lists.clusterlabs.org/ mailman/listinfo/users
>>
>>Project Home: http://www.clusterlabs.org
>>Getting started: http://www.clusterlabs.org/ doc/Cluster_from_Scratch.pdf
>>Bugs: http://bugs.clusterlabs.org
>>
>>
>
>___
>Users mailing list: Users@clusterlabs.org
>http://lists.clusterlabs.org/mailman/listinfo/users
>
>Project Home: http://www.clusterlabs.org
>Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>Bugs: http://bugs.clusterlabs.org
>
>
>

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] [Problem and Question] If there are too many resources, pacemaker-controld restarts when re-Probe is executed.

2018-05-17 Thread renayama19661014

Hi All, 


I have built the following environment.
 * RHEL7.3@KVM
 * libqb-1.0.2
 * corosync 2.4.4
 * pacemaker 2.0-rc4

Start up the cluster and pour crm files with 180 Dummy resources.
Node 3 will not start.

--
[root@rh73-01 ~]# crm_mon -1                                    
Stack: corosync
Current DC: rh73-01 (version 2.0.0-3aa2fced22) - partition with quorum
Last updated: Thu May 17 18:44:39 2018
Last change: Thu May 17 18:44:18 2018 by root via cibadmin on rh73-01
 2 nodes configured
180 resources configured
 Online: [ rh73-01 rh73-02 ]
 Active resources:
 Resource Group: grpJOS1
 prmDummy1  (ocf::pacemaker:Dummy): Started rh73-01
(snip)

 prmDummy140        (ocf::pacemaker:Dummy): Started rh73-01
(snip)
 prmDummy160        (ocf::pacemaker:Dummy): Started rh73-02

--

Execute crm_resource -R after 120 resources are started on the clustern.
--
[root@rh73-01 ~]# crm_resource -R      
Waiting for 1 replies from the controller. OK
--

I tried the following 3 patterns.

***

Pattern 1) When /etc/sysconfig/pacemaker is set as follows.
--@/etc/sysconfig/pacemaker
PCMK_logfacility=local1
PCMK_logpriority=info
--

After a while, the DC node crmd fails to recover and restarts the difference. 

[root@rh73-01 ~]# ps -ef |grep pace            
root      6751     1  0 18:43 ?        00:00:00 /usr/sbin/pacemakerd -f         
                                                                                
                                                 
haclust+  6752  6751  2 18:43 ?        00:00:16 
/usr/libexec/pacemaker/pacemaker-based                                          
                                                                                
root      6753  6751  0 18:43 ?        00:00:01 
/usr/libexec/pacemaker/pacemaker-fenced                                         
                                                                                
 
root      6754  6751  0 18:43 ?        00:00:02 
/usr/libexec/pacemaker/pacemaker-execd                                          
                                                                                
haclust+  6755  6751  0 18:43 ?        00:00:00 
/usr/libexec/pacemaker/pacemaker-attrd                                          
                                                                                
haclust+  6756  6751  0 18:43 ?        00:00:00 
/usr/libexec/pacemaker/pacemaker-schedulerd                                     
                                                                                
 
haclust+ 20478  6751  0 18:50 ?        00:00:00 
/usr/libexec/pacemaker/pacemaker-controld                                       
                                                                                
 
root     25552  1302  0 18:52 pts/0    00:00:00 grep --color=auto pace    



Pattern 2) In order to avoid problems, I made the following settings.
--@/etc/sysconfig/pacemaker
PCMK_logfacility=local1
PCMK_logpriority=info
PCMK_cib_timeout=120
PCMK_ipc_buffer=262144
--@crm file.
(snip)
property cib-bootstrap-options: \ cluster-ipc-limit=2000 \
(snip)
-- 

Just like pattern 1, after a while, DC node crmd fails to recover and restarts 
the difference. 

[root@rh73-01 ~]# ps -ef | grep pace
root      3840     1  0 18:57 ?        00:00:00 /usr/sbin/pacemakerd -f         
                                                                                
                                                 
haclust+  3841  3840  3 18:57 ?        00:00:16 
/usr/libexec/pacemaker/pacemaker-based                                          
                                                                                
root      3842  3840  0 18:57 ?        00:00:01 
/usr/libexec/pacemaker/pacemaker-fenced                                         
                                                                                
 
root      3843  3840  0 18:57 ?        00:00:01 
/usr/libexec/pacemaker/pacemaker-execd                                          
                                                                                
haclust+  3844  3840  0 18:57 ?        00:00:00 
/usr/libexec/pacemaker/pacemaker-attrd                                          
                                                                                
haclust+  3845  3840  0 18:57 ?        00:00:00 
/usr/libexec/pacemaker/pacemaker-schedulerd                                     
                                                                                
 
haclust+  6221  3840  0 19:00 ?        00:00:00 
/usr/libexec/pacemaker/pacemaker-controld                                       
                                                                                
 
root     17974  1302  0 19:05 pts/0    00:00:00 grep --color=auto pace          
      



Pattern 3) In order to avoid problems, I made the following

[ClusterLabs] [Question and Request] QUERY behavior of glue's plugin.

2018-09-01 Thread renayama19661014

Hi All,

The behavior of glue-based STONITH plug-in such as external / ipmi has been 
changed since PM 1.1.16.
Up to PM 1.1.15, "status" was executed in QUERY of STONITH.
For PM 1.1.16 and later, "list" is executed.

This is due to the following changes.
 - 
https://github.com/ClusterLabs/pacemaker/commit/3f2d1b1302adc40d9647e854187b7a85bd38f8fb

We want to use the same status behavior as PM 1.1.15.

I confirmed the source code, it looks like the following.

---
static const char *
target_list_type(stonith_device_t * dev)
{
    const char *check_type = NULL;

    check_type = g_hash_table_lookup(dev->params, STONITH_ATTR_HOSTCHECK);

    if (check_type == NULL) {

        if (g_hash_table_lookup(dev->params, STONITH_ATTR_HOSTLIST)) {
            check_type = "static-list";
        } else if (g_hash_table_lookup(dev->params, STONITH_ATTR_HOSTMAP)) {
            check_type = "static-list";
        } else if(is_set(dev->flags, st_device_supports_list)){
            check_type = "dynamic-list";
        } else if(is_set(dev->flags, st_device_supports_status)){
            check_type = "status";
        } else {
            check_type = "none";
        }
    }

    return check_type;
}
---

We have made the following settings in order to execute "status" even after PM 
1.1.16.
Is this setting correct?

(snip)
        params \
                pcmk_host_check="status" \
(snip)

Also, if this setting is correct, there is no document for "status" setting.

 - 
http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_differences_of_stonith_resources.html
 - 
http://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html/Pacemaker_Explained/_special_treatment_of_stonith_resources.html

Can you add a description such as "status" to the document?

Best Regards,
Hideo Yamauchi.

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] [Problem]The pengine core dumps when changing attributes of bundle.

2018-03-09 Thread renayama19661014

Hi All, I was checking the operation of Bundle with Pacemaker version 
2.0.0-9cd0f6cb86. When Bundle resource is configured in Pacemaker and attribute 
is changed, pengine core dumps. Step1) Start Pacemaker and pour in the 
settings. (The replicas and replicas-per-host are set to 1.) [root@rh74-test 
~]# cibadmin --modify --allow-create --scope resources -X '
   
 
 
  
' Step2) Bundle is configured. [root@rh74-test ~]# crm_mon -1 -Af
Stack: corosync
Current DC: rh74-test (version 2.0.0-9cd0f6cb86) - partition WITHOUT quorum
Last updated: Fri Mar  9 10:09:20 2018
Last change: Fri Mar  9 10:06:30 2018 by root via cibadmin on rh74-test 2 nodes 
configured
4 resources configured Online: [ rh74-test ]
GuestOnline: [ httpd-bundle-0@rh74-test ] Active resources: Docker container: 
httpd-bundle [pcmktest:http] httpd-bundle-0 (192.168.20.188)  
(ocf::heartbeat:apache):Started rh74-test Node Attributes:
* Node httpd-bundle-0@rh74-test:
* Node rh74-test: Migration Summary:
* Node rh74-test:
* Node httpd-bundle-0@rh74-test: Step3) Change attributes of bundle with 
cibadmin command. (The replicas and replicas-per-host change to 3.)
[root@rh74-test ~]# cibadmin --modify -X '' Step4) 
The pengine will core dump. (snip)
Mar  9 10:10:21 rh74-test pengine[17726]:  notice: On loss of quorum: Ignore
Mar  9 10:10:21 rh74-test pengine[17726]:info: Node rh74-test is online
Mar  9 10:10:21 rh74-test crmd[17727]:   error: Connection to pengine failed
Mar  9 10:10:21 rh74-test crmd[17727]:   error: Connection to 
pengine[0x55f2d068bfb0] closed (I/O condition=25)
Mar  9 10:10:21 rh74-test pacemakerd[17719]:   error: Managed process 17726 
(pengine) dumped core
Mar  9 10:10:21 rh74-test pacemakerd[17719]:   error: pengine[17726] terminated 
with signal 11 (core=1)
Mar  9 10:10:21 rh74-test pacemakerd[17719]:  notice: Respawning failed child 
process: pengine
Mar  9 10:10:21 rh74-test pacemakerd[17719]:info: Using uid=990 and 
group=984 for process pengine
Mar  9 10:10:21 rh74-test pacemakerd[17719]:info: Forked child 19275 for 
process pengine
(snip) This event reproduces 100 percent. Apparently the problem seems to be 
due to different handling of clone(httpd) resources in the Bundle resource. 

- I registered this content with the following Bugzilla.
(https://bugs.clusterlabs.org/show_bug.cgi?id=5337)
Best Regards
Hideo Yamauchi.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [Problem]The pengine core dumps when changing attributes of bundle.

2018-03-09 Thread renayama19661014

Hi All, 

[Sorry..There was a defect in line breaks. to send again.]

I was checking the operation of Bundle with Pacemaker version 2.0.0-9cd0f6cb86. 
When Bundle resource is configured in Pacemaker and attribute is changed, 
pengine core dumps. 

Step1) Start Pacemaker and pour in the settings. (The replicas and 
replicas-per-host are set to 1.) 

[root@rh74-test ~]# cibadmin --modify --allow-create --scope resources -X '
   
 
 
  
' 

Step2) Bundle is configured. 

[root@rh74-test ~]# crm_mon -1 -Af
Stack: corosync
Current DC: rh74-test (version 2.0.0-9cd0f6cb86) - partition WITHOUT quorum
Last updated: Fri Mar  9 10:09:20 2018
Last change: Fri Mar  9 10:06:30 2018 by root via cibadmin on rh74-test 2 nodes 
configured

4 resources configured Online: [ rh74-test ]
GuestOnline: [ httpd-bundle-0@rh74-test ] 

Active resources: 
Docker container: httpd-bundle [pcmktest:http] httpd-bundle-0 (192.168.20.188)  
    (ocf::heartbeat:apache):        

Started rh74-test Node Attributes:
* Node httpd-bundle-0@rh74-test:
* Node rh74-test: Migration Summary:
* Node rh74-test:
* Node httpd-bundle-0@rh74-test: 

Step3) Change attributes of bundle with cibadmin command. (The replicas and 
replicas-per-host change to 3.)


[root@rh74-test ~]# cibadmin --modify -X '' 

Step4) The pengine will core dump. (snip)
Mar  9 10:10:21 rh74-test pengine[17726]:  notice: On loss of quorum: Ignore
Mar  9 10:10:21 rh74-test pengine[17726]:    info: Node rh74-test is online
Mar  9 10:10:21 rh74-test crmd[17727]:  error: Connection to pengine failed
Mar  9 10:10:21 rh74-test crmd[17727]:  error: Connection to 
pengine[0x55f2d068bfb0] closed (I/O condition=25)
Mar  9 10:10:21 rh74-test pacemakerd[17719]:  error: Managed process 17726 
(pengine) dumped core
Mar  9 10:10:21 rh74-test pacemakerd[17719]:  error: pengine[17726] terminated 
with signal 11 (core=1)
Mar  9 10:10:21 rh74-test pacemakerd[17719]:  notice: Respawning failed child 
process: pengine
Mar  9 10:10:21 rh74-test pacemakerd[17719]:    info: Using uid=990 and 
group=984 for process pengine
Mar  9 10:10:21 rh74-test pacemakerd[17719]:    info: Forked child 19275 for 
process pengine
(snip) 

This event reproduces 100 percent. 

Apparently the problem seems to be due to different handling of clone(httpd) 
resources in the Bundle resource. 

- I registered this content with the following Bugzilla.
(https://bugs.clusterlabs.org/show_bug.cgi?id=5337)

Best Regards
Hideo Yamauchi.

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] [Problem] The crmd fails to connect with pengine.

2018-12-27 Thread renayama19661014

Hi All,

This problem occurred with our users.

The following problem occurred in a two-node cluster that does not set STONITH.

The problem seems to have occurred in the following procedure.

Step 1) Configure the cluster with 2 nodes. The DC node is the second node.
Step 2) Several resources are running on the first node.
Step 3) It stops almost at the same time in order of 2nd node and 1st node.
Step 4) After the second node stops, the first node tries to calculate the 
state transition for the resource stop.

However, crmd fails to connect with pengine and does not calculate state 
transitions.

-
Dec 27 08:36:00 rh74-01 crmd[12997]: warning: Setup of client connection 
failed, not adding channel to mainloop
-

As a result, Pacemaker will stop without stopping the resource.

The problem seems to have occurred in the following environment.

 - libqb 1.0
 - corosync 2.4.1
 - Pacemaker 1.1.15

I tried to reproduce this problem, but for now it can not be reproduced.

Do you know the cause of this problem?

Best Regards,
Hideo Yamacuhi.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [Problem] The crmd fails to connect with pengine.

2019-01-05 Thread renayama19661014

Hi Jan,
Hi Ken,

Thanks for your comment.

I am going to check a little more about the problem of libqb.


Many thanks,
Hideo Yamauchi.


- Original Message -
> From: Ken Gaillot 
> To: Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Cc: 
> Date: 2019/1/3, Thu 01:26
> Subject: Re: [ClusterLabs] [Problem] The crmd fails to connect with pengine.
> 
> On Wed, 2019-01-02 at 15:43 +0100, Jan Pokorný wrote:
>>  On 28/12/18 05:51 +0900, renayama19661...@ybb.ne.jp wrote:
>>  > This problem occurred with our users.
>>  > 
>>  > The following problem occurred in a two-node cluster that does not
>>  > set STONITH.
>>  > 
>>  > The problem seems to have occurred in the following procedure.
>>  > 
>>  > Step 1) Configure the cluster with 2 nodes. The DC node is the
>>  > second node.
>>  > Step 2) Several resources are running on the first node.
>>  > Step 3) It stops almost at the same time in order of 2nd node and
>>  > 1st node.
>> 
>>  Do I decipher the above correctly that the cluster is scheduled for
>>  shutdown (fully independently node by node or through a single
>>  trigger
>>  with a high level management tool?) and starts proceeding in serial
>>  manner, shutting 2nd node ~ original DC first?
>> 
>>  > Step 4) After the second node stops, the first node tries to
>>  >         calculate the state transition for the resource stop.
>>  > 
>>  > However, crmd fails to connect with pengine and does not calculate
>>  > state transitions.
>>  > 
>>  > -
>>  > Dec 27 08:36:00 rh74-01 crmd[12997]: warning: Setup of client
>>  > connection failed, not adding channel to mainloop
>>  > -
>> 
>>  Sadly, it looks like details of why this happened would only be
>>  retained when debugging/tracing verbosity of the log messages
>>  was enabled, which likely wasn't the case.
>> 
>>  Anyway, perhaps providing a wider context of the log messages
>>  from this first node might shed some light into this.
> 
> Agreed, that's probably the only hope.
> 
> This would have to be a low-level issue like an out-of-memory error, or
> something at the libqb level.
> 
>>  > As a result, Pacemaker will stop without stopping the resource.
>> 
>>  This might have serious consequences in some scenarios, perhaps
>>  unless some watchdog-based solution (SBD?) was used as a fencing
>>  of choice since it would not get defused just as the resource
>>  wasn't stopped, I think...
> 
> Yep, this is unavoidable in this situation. If the last node standing
> has an unrecoverable problem, there's no other node remaining to fence
> it and recover.
> 
>>  > The problem seems to have occurred in the following environment.
>>  > 
>>  >  - libqb 1.0
>>  >  - corosync 2.4.1
>>  >  - Pacemaker 1.1.15
>>  > 
>>  > I tried to reproduce this problem, but for now it can not be
>>  > reproduced.
>>  > 
>>  > Do you know the cause of this problem?
>> 
>>  No idea at this point.
> -- 
> Ken Gaillot 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] [Problem] If the token becomes unstable, the transient_attributes of all nodes disappear.

2018-12-18 Thread renayama19661014

Hi All,

In clusters that do not use STONITH, actions to erase attributes to each other 
occurred.
The problem occurs when the load on the CPU goes up and the token of corosync 
does not stabilize.

I confirmed that the problem will occur with a simple configuration.

Step1) Configure the cluster.

[root@rh74-01 ~]# crm_mon -1 -Af                                                
                                                                           
(snip)
Online: [ rh74-01 rh74-02 ]


Active resources:
 Clone Set: clnPing [prmPing]

     Started: [ rh74-01 rh74-02 ]
Node Attributes:

* Node rh74-01:
    + default_ping_set                  : 100       
* Node rh74-02:
    + default_ping_set                  : 100       
Migration Summary:

* Node rh74-01:
* Node rh74-02:


Step2) Node 2 puts a heavy load on the CPU, making token unstable.

[root@rh74-02 ~]# stress -c 2 --timeout 2s


Step3) Each node deletes the attribute of the other node, but when the cluster 
recovers in the middle, the transient_attributes of all nodes are deleted.


 ha-log.extract
(snip)
Dec 18 14:05:47 rh74-01 cib[21140]:    info: Completed cib_delete operation for 
section //node_state[@uname='rh74-01']/transient_attributes: OK (rc=0, 
origin=rh74-02/crmd/16, version=0.5.35)
(snip)
Dec 18 14:05:49 rh74-01 pengine[21144]:  notice: On loss of CCM Quorum: Ignore
Dec 18 14:05:49 rh74-01 pengine[21144]:    info: Node rh74-01 is online
Dec 18 14:05:49 rh74-01 pengine[21144]:    info: Node rh74-02 is online
Dec 18 14:05:49 rh74-01 pengine[21144]:    info: Node 1 is already processed
Dec 18 14:05:49 rh74-01 pengine[21144]:    info: Node 2 is already processed
Dec 18 14:05:49 rh74-01 pengine[21144]:    info: Node 1 is already processed
Dec 18 14:05:49 rh74-01 pengine[21144]:    info: Node 2 is already processed
Dec 18 14:05:49 rh74-01 pengine[21144]:    info:  Clone Set: clnPing [prmPing]
Dec 18 14:05:49 rh74-01 pengine[21144]:    info:      Started: [ rh74-01 
rh74-02 ]
Dec 18 14:05:49 rh74-01 pengine[21144]:    info: Leave   prmPing:0#011(Started 
rh74-01)
Dec 18 14:05:49 rh74-01 pengine[21144]:    info: Leave   prmPing:1#011(Started 
rh74-02)
Dec 18 14:05:49 rh74-01 pengine[21144]:  notice: Calculated transition 8, 
saving inputs in /var/lib/pacemaker/pengine/pe-input-1.bz2
Dec 18 14:05:49 rh74-01 cib[21387]: warning: Could not verify cluster 
configuration file /var/lib/pacemaker/cib/cib.xml: No such file or directory (2)
Dec 18 14:05:49 rh74-01 crmd[21145]:    info: State transition S_POLICY_ENGINE 
-> S_TRANSITION_ENGINE
Dec 18 14:05:49 rh74-01 crmd[21145]:    info: Processing graph 8 
(ref=pe_calc-dc-1545109549-63) derived from 
/var/lib/pacemaker/pengine/pe-input-1.bz2
Dec 18 14:05:49 rh74-01 crmd[21145]:  notice: Transition 8 (Complete=0, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-1.bz2): Complete
Dec 18 14:05:49 rh74-01 crmd[21145]:    info: Input I_TE_SUCCESS received in 
state S_TRANSITION_ENGINE from notify_crmd
Dec 18 14:05:49 rh74-01 crmd[21145]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE

 pe-input-1
(snip)
  
    
      
        
          
            
            
          
        
      
    
    
      
        
          
            
            
          
        
      
    
  
(snip)


With this simple configuration, no problem occurs, but in the case of a 
resource whose attribute is actually set as a constraint, at the time of 
pe-input-1, the resource is stopped.

In order to avoid the problem, it is necessary to examine the process of 
deleting the attribute of its own node from the correspondent node.

I confirmed this problem with Pacemaker 1.1.19.
The same problem has also been reported by users using Pacemaker 1.1.17.

 * Attach the crm_report file.
 * The attached log is acquired after applying high load.
 * https://bugs.clusterlabs.org/show_bug.cgi?id=5375

Best Regards,
Hideo Yamauchi.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] [Problem] Remote resource does not move when bundle resource moves.

2018-12-06 Thread renayama19661014

Hi All, We have confirmed a slightly strange configuration of the bundle.
There is only one bundle resource, and it has an association with a group 
resource.
The operation was confirmed in PM 1.1.19. Step1) Configure the cluster.

[root@cent7-host1 ~]# crm_mon -R
Defaulting to one-shot mode
You need to have curses available at compile time to enable console mode
Stack: corosync
Current DC: cent7-host2 (3232262829) (version 1.1.19-c3c624ea3d) - partition 
with quorum
Last updated: Thu Dec  6 13:20:21 2018
Last change: Thu Dec  6 13:20:05 2018 by root via cibadmin on cent7-host1 4 
nodes configured
10 resources configured Online: [ cent7-host1 (3232262828) cent7-host2 
(3232262829) ]
GuestOnline: [ httpd-bundle1-0@cent7-host1 httpd-bundle2-0@cent7-host2 ] Active 
resources: Resource Group: group1 dummy1 (ocf::pacemaker:Dummy): Started 
cent7-host1 Resource Group: group2 dummy2 (ocf::pacemaker:Dummy): Started 
cent7-host2 Docker container: httpd-bundle1 [pcmktest:http] 
httpd-bundle1-ip-192.168.20.188   (ocf::heartbeat:IPaddr2):   Started 
cent7-host1 httpd-bundle1-docker-0(ocf::heartbeat:docker):Started 
cent7-host1 httpd-bundle1-0   (ocf::pacemaker:remote):Started 
cent7-host1 httpd1(ocf::heartbeat:apache):Started httpd-bundle1-0 
Docker container: httpd-bundle2 [pcmktest:http] httpd-bundle2-ip-192.168.20.190 
  (ocf::heartbeat:IPaddr2):   Started cent7-host2 httpd-bundle2-docker-0
(ocf::heartbeat:docker):Started cent7-host2 httpd-bundle2-0   
(ocf::pacemaker:remote):Started cent7-host2 httpd2
(ocf::heartbeat:apache):Started httpd-bundle2-0
 Step2) Once we have cent7-host1 as standby, move the resource to 
cent7-host2. 
[root@cent7-host1 ~]# crm_standby -v on
[root@cent7-host1 ~]# crm_mon -R
Defaulting to one-shot mode
You need to have curses available at compile time to enable console mode
Stack: corosync
Current DC: cent7-host2 (3232262829) (version 1.1.19-c3c624ea3d) - partition 
with quorum
Last updated: Thu Dec  6 13:21:36 2018
Last change: Thu Dec  6 13:21:17 2018 by root via crm_attribute on cent7-host1 
4 nodes configured
10 resources configured Node cent7-host1 (3232262828): standby
Online: [ cent7-host2 (3232262829) ]
GuestOnline: [ httpd-bundle1-0@cent7-host2 httpd-bundle2-0@cent7-host2 ] Active 
resources: Resource Group: group1 dummy1 (ocf::pacemaker:Dummy): Started 
cent7-host2 Resource Group: group2 dummy2 (ocf::pacemaker:Dummy): Started 
cent7-host2 Docker container: httpd-bundle1 [pcmktest:http] 
httpd-bundle1-ip-192.168.20.188   (ocf::heartbeat:IPaddr2):   Started 
cent7-host2 httpd-bundle1-docker-0(ocf::heartbeat:docker):Started 
cent7-host2 httpd-bundle1-0   (ocf::pacemaker:remote):Started 
cent7-host2 httpd1(ocf::heartbeat:apache):Started httpd-bundle1-0 
Docker container: httpd-bundle2 [pcmktest:http] httpd-bundle2-ip-192.168.20.190 
  (ocf::heartbeat:IPaddr2):   Started cent7-host2 httpd-bundle2-docker-0
(ocf::heartbeat:docker):Started cent7-host2 httpd-bundle2-0   
(ocf::pacemaker:remote):Started cent7-host2 httpd2
(ocf::heartbeat:apache):Started httpd-bundle2-0
 Step3) Release standby of cent7-host1.

[root@cent7-host1 ~]# crm_standby -v off

   
[root@cent7-host1 ~]# crm_mon -R
Defaulting to one-shot mode
You need to have curses available at compile time to enable console mode
Stack: corosync
Current DC: cent7-host2 (3232262829) (version 1.1.19-c3c624ea3d) - partition 
with quorum
Last updated: Thu Dec  6 13:21:59 2018
Last change: Thu Dec  6 13:21:56 2018 by root via crm_attribute on cent7-host1 
4 nodes configured
10 resources configured Online: [ cent7-host1 (3232262828) cent7-host2 
(3232262829) ]
GuestOnline: [ httpd-bundle1-0@cent7-host2 httpd-bundle2-0@cent7-host2 ] Active 
resources: Resource Group: group1 dummy1 (ocf::pacemaker:Dummy): Started 
cent7-host2 Resource Group: group2 dummy2 (ocf::pacemaker:Dummy): Started 
cent7-host2 Docker container: httpd-bundle1 [pcmktest:http] 
httpd-bundle1-ip-192.168.20.188   (ocf::heartbeat:IPaddr2):   Started 
cent7-host2 httpd-bundle1-docker-0(ocf::heartbeat:docker):Started 
cent7-host2 httpd-bundle1-0   (ocf::pacemaker:remote):Started 
cent7-host2 httpd1(ocf::heartbeat:apache):Started httpd-bundle1-0 
Docker container: httpd-bundle2 [pcmktest:http] httpd-bundle2-ip-192.168.20.190 
  (ocf::heartbeat:IPaddr2):   Started cent7-host2 httpd-bundle2-docker-0
(ocf::heartbeat:docker):Started cent7-host2 httpd-bundle2-0   
(ocf::pacemaker:remote):Started cent7-host2 httpd2
(ocf::heartbeat:apache):Started httpd-bundle2-0
 Step4) Move the group 1 resource and also return the bundle resource

Re: [ClusterLabs] [Problem] Remote resource does not move when bundle resource moves.

2018-12-08 Thread renayama19661014

Hi All,

Sorry...

I made a mistake in line breaks.
to send again.

---

Hi All,


We have confirmed a slightly strange configuration of the bundle.
There is only one bundle resource, and it has an association with a group 
resource.
The operation was confirmed in PM 1.1.19.

Step1) Configure the cluster.

[root@cent7-host1 ~]# crm_mon -R
Defaulting to one-shot mode
You need to have curses available at compile time to enable console mode
Stack: corosync
Current DC: cent7-host2 (3232262829) (version 1.1.19-c3c624ea3d) - partition 
with quorum
Last updated: Thu Dec  6 13:20:21 2018
Last change: Thu Dec  6 13:20:05 2018 by root via cibadmin on cent7-host1

4 nodes configured
10 resources configured

Online: [ cent7-host1 (3232262828) cent7-host2 (3232262829) ]
GuestOnline: [ httpd-bundle1-0@cent7-host1 httpd-bundle2-0@cent7-host2 ]

Active resources:

Resource Group: group1
dummy1 (ocf::pacemaker:Dummy): Started cent7-host1
Resource Group: group2
dummy2 (ocf::pacemaker:Dummy): Started cent7-host2
Docker container: httpd-bundle1 [pcmktest:http]
httpd-bundle1-ip-192.168.20.188   (ocf::heartbeat:IPaddr2):   Started 
cent7-host1
httpd-bundle1-docker-0(ocf::heartbeat:docker):Started cent7-host1
httpd-bundle1-0   (ocf::pacemaker:remote):Started cent7-host1
httpd1(ocf::heartbeat:apache):Started httpd-bundle1-0
Docker container: httpd-bundle2 [pcmktest:http]
httpd-bundle2-ip-192.168.20.190   (ocf::heartbeat:IPaddr2):   Started 
cent7-host2
httpd-bundle2-docker-0(ocf::heartbeat:docker):Started cent7-host2
httpd-bundle2-0   (ocf::pacemaker:remote):Started cent7-host2
httpd2(ocf::heartbeat:apache):Started httpd-bundle2-0


Step2) Once we have cent7-host1 as standby, move the resource to cent7-host2.


[root@cent7-host1 ~]# crm_standby -v on
[root@cent7-host1 ~]# crm_mon -R
Defaulting to one-shot mode
You need to have curses available at compile time to enable console mode
Stack: corosync
Current DC: cent7-host2 (3232262829) (version 1.1.19-c3c624ea3d) - partition 
with quorum
Last updated: Thu Dec  6 13:21:36 2018
Last change: Thu Dec  6 13:21:17 2018 by root via crm_attribute on cent7-host1

4 nodes configured
10 resources configured

Node cent7-host1 (3232262828): standby
Online: [ cent7-host2 (3232262829) ]
GuestOnline: [ httpd-bundle1-0@cent7-host2 httpd-bundle2-0@cent7-host2 ]

Active resources:

Resource Group: group1
dummy1 (ocf::pacemaker:Dummy): Started cent7-host2
Resource Group: group2
dummy2 (ocf::pacemaker:Dummy): Started cent7-host2
Docker container: httpd-bundle1 [pcmktest:http]
httpd-bundle1-ip-192.168.20.188   (ocf::heartbeat:IPaddr2):   Started 
cent7-host2
httpd-bundle1-docker-0(ocf::heartbeat:docker):Started cent7-host2
httpd-bundle1-0   (ocf::pacemaker:remote):Started cent7-host2
httpd1(ocf::heartbeat:apache):Started httpd-bundle1-0
Docker container: httpd-bundle2 [pcmktest:http]
httpd-bundle2-ip-192.168.20.190   (ocf::heartbeat:IPaddr2):   Started 
cent7-host2
httpd-bundle2-docker-0(ocf::heartbeat:docker):Started cent7-host2
httpd-bundle2-0   (ocf::pacemaker:remote):Started cent7-host2
httpd2(ocf::heartbeat:apache):Started httpd-bundle2-0


Step3) Release standby of cent7-host1.

[root@cent7-host1 ~]# crm_standby -v off 
[root@cent7-host1 ~]# crm_mon -R
Defaulting to one-shot mode
You need to have curses available at compile time to enable console mode
Stack: corosync
Current DC: cent7-host2 (3232262829) (version 1.1.19-c3c624ea3d) - partition 
with quorum
Last updated: Thu Dec  6 13:21:59 2018
Last change: Thu Dec  6 13:21:56 2018 by root via crm_attribute on cent7-host1

4 nodes configured
10 resources configured

Online: [ cent7-host1 (3232262828) cent7-host2 (3232262829) ]
GuestOnline: [ httpd-bundle1-0@cent7-host2 httpd-bundle2-0@cent7-host2 ]

Active resources:

Resource Group: group1
dummy1 (ocf::pacemaker:Dummy): Started cent7-host2
Resource Group: group2
dummy2 (ocf::pacemaker:Dummy): Started cent7-host2
Docker container: httpd-bundle1 [pcmktest:http]
httpd-bundle1-ip-192.168.20.188   (ocf::heartbeat:IPaddr2):   Started 
cent7-host2
httpd-bundle1-docker-0(ocf::heartbeat:docker):Started cent7-host2
httpd-bundle1-0   (ocf::pacemaker:remote):Started cent7-host2
httpd1(ocf::heartbeat:apache):Started httpd-bundle1-0
Docker container: httpd-bundle2 [pcmktest:http]
httpd-bundle2-ip-192.168.20.190   (ocf::heartbeat:IPaddr2):   Started 
cent7-host2
httpd-bundle2-docker-0(ocf::heartbeat:docker):Started cent7-host2
httpd-bundle2-0   (ocf::pacemaker:remote):Started cent7-host2
httpd2(ocf::heartbeat:apache):Started httpd-bundle2-0


Step4) Move the group 1 resource and also return the bundle resource to 
cent7-host1.

[root@cent7-host1 ~]# crm_resource -M -r group1 -H cent7-host1 -f -Q
[root@cent7-host1 ~]#

Re: [ClusterLabs] Pending Fencing Actions shown in pcs status

2021-01-11 Thread renayama19661014

Hi Steffen,

I've been experimenting with it since last weekend, but I haven't been able to 
reproduce the same situation.
It seems that the cause is that the reproduction method cannot be limited.

Can I attach a problem log?

Best Regards,
Hideo Yamauchi.


- Original Message -
> From: Klaus Wenninger 
> To: Steffen Vinther Sørensen ; Cluster Labs - All topics 
> related to open-source clustering welcomed 
> Cc: 
> Date: 2021/1/7, Thu 21:42
> Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs status
> 
> On 1/7/21 1:13 PM, Steffen Vinther Sørensen wrote:
>>  Hi Klaus,
>> 
>>  Yes then the status does sync to the other nodes. Also it looks like
>>  there are some hostname resolving problems in play here, maybe causing
>>  problems,  here is my notes from restarting pacemaker etc.
> Don't think there are hostname resolving problems.
> The messages you are seeing, that look as if, are caused
> by using -EHOSTUNREACH as error-code to fail a pending
> fence action when a node that is just coming up sees
> a pending action that is claimed to be handled by himself.
> Back then I chose that error-code as there was none
> that really matched available right away and it was
> urgent for some reason so that introduction of something
> new was to risky at that state.
> Probably would make sense to introduce something that
> is more descriptive.
> Back then the issue was triggered by fenced crashing and
> being restarted - so not a node-restart but just fenced
> restarting.
> And it looks as if building the failed-message failed somehow.
> So that could be the reason why the pending action persists.
> Would be something else then what we solved with Bug 5401.
> But what triggers the logs below might as well just be a
> follow-up issue after the Bug 5401 thing.
> Will try to find time for a deeper look later today.
> 
> Klaus
>> 
>>  pcs cluster standby kvm03-node02.avigol-gcs.dk
>>  pcs cluster stop kvm03-node02.avigol-gcs.dk
>>  pcs status
>> 
>>  Pending Fencing Actions:
>>  * reboot of kvm03-node02.avigol-gcs.dk pending: client=crmd.37819,
>>  origin=kvm03-node03.avigol-gcs.dk
>> 
>>  # From logs on all 3 nodes:
>>  Jan 07 12:48:18 kvm03-node03 stonith-ng[37815]:  warning: received
>>  pending action we are supposed to be the owner but it's not in our
>>  records -> fail it
>>  Jan 07 12:48:18 kvm03-node03 stonith-ng[37815]:    error: Operation
>>  'reboot' targeting kvm03-node02.avigol-gcs.dk on  for
>>  crmd.37...@kvm03-node03.avigol-gcs.dk.56a3018c: No route to host
>>  Jan 07 12:48:18 kvm03-node03 stonith-ng[37815]:    error:
>>  stonith_construct_reply: Triggered assert at commands.c:2406 : request
>>  != NULL
>>  Jan 07 12:48:18 kvm03-node03 stonith-ng[37815]:  warning: Can't create
>>  a sane reply
>>  Jan 07 12:48:18 kvm03-node03 crmd[37819]:   notice: Peer
>>  kvm03-node02.avigol-gcs.dk was not terminated (reboot) by  on
>>  behalf of crmd.37819: No route to host
>> 
>>  pcs cluster start kvm03-node02.avigol-gcs.dk
>>  pcs status (now outputs the same on all 3 nodes)
>> 
>>  Failed Fencing Actions:
>>  * reboot of kvm03-node02.avigol-gcs.dk failed: delegate=,
>>  client=crmd.37819, origin=kvm03-node03.avigol-gcs.dk,
>>      last-failed='Thu Jan  7 12:48:18 2021'
>> 
>> 
>>  pcs cluster unstandby kvm03-node02.avigol-gcs.dk
>> 
>>  # Now libvirtd refuses to start
>> 
>>  Jan 07 12:51:44 kvm03-node02 dnsmasq[20884]: read /etc/hosts - 8 addresses
>>  Jan 07 12:51:44 kvm03-node02 dnsmasq[20884]: read
>>  /var/lib/libvirt/dnsmasq/default.addnhosts - 0 addresses
>>  Jan 07 12:51:44 kvm03-node02 dnsmasq-dhcp[20884]: read
>>  /var/lib/libvirt/dnsmasq/default.hostsfile
>>  Jan 07 12:51:44 kvm03-node02 libvirtd[24091]: 2021-01-07
>>  11:51:44.729+: 24160: info : libvirt version: 4.5.0, package:
>>  36.el7_9.3 (CentOS BuildSystem ,
>>  2020-11-16-16:25:20, x86-01.bsys.centos.org)
>>  Jan 07 12:51:44 kvm03-node02 libvirtd[24091]: 2021-01-07
>>  11:51:44.729+: 24160: info : hostname: kvm03-node02
>>  Jan 07 12:51:44 kvm03-node02 libvirtd[24091]: 2021-01-07
>>  11:51:44.729+: 24160: error : qemuMonitorOpenUnix:392 : failed to
>>  connect to monitor socket: Connection refused
>>  Jan 07 12:51:44 kvm03-node02 libvirtd[24091]: 2021-01-07
>>  11:51:44.729+: 24159: error : qemuMonitorOpenUnix:392 : failed to
>>  connect to monitor socket: Connection refused
>>  Jan 07 12:51:44 kvm03-node02 libvirtd[24091]: 2021-01-07
>>  11:51:44.730+: 24161: error : qemuMonitorOpenUnix:392 : failed to
>>  connect to monitor socket: Connection refused
>>  Jan 07 12:51:44 kvm03-node02 libvirtd[24091]: 2021-01-07
>>  11:51:44.730+: 24162: error : qemuMonitorOpenUnix:392 : failed to
>>  connect to monitor socket: Connection refused
>> 
>>  pcs status
>> 
>>  Failed Resource Actions:
>>  * libvirtd_start_0 on kvm03-node02.avigol-gcs.dk 'unknown error' 
> (1):
>>  call=142, status=complete, exitreason='',
>>      last-rc-change='Thu Jan  7 12:51:44 2021', queued=0ms, 
>

Re: [ClusterLabs] Pending Fencing Actions shown in pcs status

2021-01-06 Thread renayama19661014

Hi Steffen,
Hi Reid,

I also checked the Centos source rpm and it seems to include a fix for the 
problem.

As Steffen suggested, if you share your CIB settings, I might know something.

If this issue is the same as the fix, the display will only be displayed on the 
DC node and will not affect the operation.
The pending actions shown will remain for a long time, but will not have a 
negative impact on the cluster.

Best Regards,
Hideo Yamauchi.


- Original Message -
> From: Reid Wahl 
> To: Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Cc: 
> Date: 2021/1/7, Thu 15:58
> Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs status
> 
> It's supposedly fixed in that version.
>   - https://bugzilla.redhat.com/show_bug.cgi?id=1787749 
>   - https://access.redhat.com/solutions/4713471 
> 
> So you may be hitting a different issue (unless there's a bug in the
> pcmk 1.1 backport of the fix).
> 
> I may be a little bit out of my area of knowledge here, but can you
> share the CIBs from nodes 1 and 3? Maybe Hideo, Klaus, or Ken has some
> insight.
> 
> On Wed, Jan 6, 2021 at 10:53 PM Steffen Vinther Sørensen
>  wrote:
>> 
>>  Hi Hideo,
>> 
>>  If the fix is not going to make it into the CentOS7 pacemaker version,
>>  I guess the stable approach to take advantage of it is to build the
>>  cluster on another OS than CentOS7 ? A little late for that in this
>>  case though :)
>> 
>>  Regards
>>  Steffen
>> 
>> 
>> 
>> 
>>  On Thu, Jan 7, 2021 at 7:27 AM  wrote:
>>  >
>>  > Hi Steffen,
>>  >
>>  > The fix pointed out by Reid is affecting it.
>>  >
>>  > Since the fencing action requested by the DC node exists only in the 
> DC node, such an event occurs.
>>  > You will need to take advantage of the modified pacemaker to resolve 
> the issue.
>>  >
>>  > Best Regards,
>>  > Hideo Yamauchi.
>>  >
>>  >
>>  >
>>  > - Original Message -
>>  > > From: Reid Wahl 
>>  > > To: Cluster Labs - All topics related to open-source clustering 
> welcomed 
>>  > > Cc:
>>  > > Date: 2021/1/7, Thu 15:07
>>  > > Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs 
> status
>>  > >
>>  > > Hi, Steffen. Are your cluster nodes all running the same 
> Pacemaker
>>  > > versions? This looks like Bug 5401[1], which is fixed by upstream
>>  > > commit df71a07[2]. I'm a little bit confused about why it 
> only shows
>>  > > up on one out of three nodes though.
>>  > >
>>  > > [1] https://bugs.clusterlabs.org/show_bug.cgi?id=5401 
>>  > > [2] https://github.com/ClusterLabs/pacemaker/commit/df71a07 
>>  > >
>>  > > On Tue, Jan 5, 2021 at 8:31 AM Steffen Vinther Sørensen
>>  > >  wrote:
>>  > >>
>>  > >>  Hello
>>  > >>
>>  > >>  node 1 is showing this in 'pcs status'
>>  > >>
>>  > >>  Pending Fencing Actions:
>>  > >>  * reboot of kvm03-node02.avigol-gcs.dk pending: 
> client=crmd.37819,
>>  > >>  origin=kvm03-node03.avigol-gcs.dk
>>  > >>
>>  > >>  node 2 and node 3 outputs no such thing (node 3 is DC)
>>  > >>
>>  > >>  Google is not much help, how to investigate this further and 
> get rid
>>  > >>  of such terrifying status message ?
>>  > >>
>>  > >>  Regards
>>  > >>  Steffen
>>  > >>  ___
>>  > >>  Manage your subscription:
>>  > >>  https://lists.clusterlabs.org/mailman/listinfo/users 
>>  > >>
>>  > >>  ClusterLabs home: https://www.clusterlabs.org/ 
>>  > >>
>>  > >
>>  > >
>>  > > --
>>  > > Regards,
>>  > >
>>  > > Reid Wahl, RHCA
>>  > > Senior Software Maintenance Engineer, Red Hat
>>  > > CEE - Platform Support Delivery - ClusterHA
>>  > >
>>  > > ___
>>  > > Manage your subscription:
>>  > > https://lists.clusterlabs.org/mailman/listinfo/users 
>>  > >
>>  > > ClusterLabs home: https://www.clusterlabs.org/ 
>>  > >
>>  >
>>  > ___
>>  > Manage your subscription:
>>  > https://lists.clusterlabs.org/mailman/listinfo/users 
>>  >
>>  > ClusterLabs home: https://www.clusterlabs.org/ 
>>  ___
>>  Manage your subscription:
>>  https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>>  ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> 
> -- 
> Regards,
> 
> Reid Wahl, RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pending Fencing Actions shown in pcs status

2021-01-07 Thread renayama19661014

Hi Steffen,
Hi Reid,

The fencing history is kept inside stonith-ng and is not written to cib.
However, getting the entire cib and getting it sent will help you to reproduce 
the problem.

Best Regards,
Hideo Yamauchi.


- Original Message -
>From: Reid Wahl 
>To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
>open-source clustering welcomed  
>Date: 2021/1/7, Thu 17:39
>Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs status
> 
>
>Hi, Steffen. Those attachments don't contain the CIB. They contain the `pcs 
>config` output. You can get the cib with `pcs cluster cib > 
>$(hostname).cib.xml`.
>
>
>Granted, it's possible that this fence action information wouldn't be in the 
>CIB at all. It might be stored in fencer memory.
>
>
>On Thu, Jan 7, 2021 at 12:26 AM  wrote:
>
>Hi Steffen,
>>
>>> Here CIB settings attached (pcs config show) for all 3 of my nodes
>>> (all 3 seems 100% identical), node03 is the DC.
>>
>>
>>Thank you for the attachment.
>>
>>What is the scenario when this situation occurs?
>>In what steps did the problem appear when fencing was performed (or failed)?
>>
>>
>>Best Regards,
>>Hideo Yamauchi.
>>
>>
>>- Original Message -
>>> From: Steffen Vinther Sørensen 
>>> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
>>> open-source clustering welcomed 
>>> Cc: 
>>> Date: 2021/1/7, Thu 17:05
>>> Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs status
>>> 
>>> Hi Hideo,
>>> 
>>> Here CIB settings attached (pcs config show) for all 3 of my nodes
>>> (all 3 seems 100% identical), node03 is the DC.
>>> 
>>> Regards
>>> Steffen
>>> 
>>> On Thu, Jan 7, 2021 at 8:06 AM  wrote:
 
  Hi Steffen,
  Hi Reid,
 
  I also checked the Centos source rpm and it seems to include a fix for 
the 
>>> problem.
 
  As Steffen suggested, if you share your CIB settings, I might know 
>>> something.
 
  If this issue is the same as the fix, the display will only be displayed 
on 
>>> the DC node and will not affect the operation.
  The pending actions shown will remain for a long time, but will not have 
a 
>>> negative impact on the cluster.
 
  Best Regards,
  Hideo Yamauchi.
 
 
  - Original Message -
  > From: Reid Wahl 
  > To: Cluster Labs - All topics related to open-source clustering 
>>> welcomed 
  > Cc:
  > Date: 2021/1/7, Thu 15:58
  > Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs status
  >
  > It's supposedly fixed in that version.
  >   - https://bugzilla.redhat.com/show_bug.cgi?id=1787749
  >   - https://access.redhat.com/solutions/4713471
  >
  > So you may be hitting a different issue (unless there's a bug in 
>>> the
  > pcmk 1.1 backport of the fix).
  >
  > I may be a little bit out of my area of knowledge here, but can you
  > share the CIBs from nodes 1 and 3? Maybe Hideo, Klaus, or Ken has some
  > insight.
  >
  > On Wed, Jan 6, 2021 at 10:53 PM Steffen Vinther Sørensen
  >  wrote:
  >>
  >>  Hi Hideo,
  >>
  >>  If the fix is not going to make it into the CentOS7 pacemaker 
>>> version,
  >>  I guess the stable approach to take advantage of it is to build 
>>> the
  >>  cluster on another OS than CentOS7 ? A little late for that in 
>>> this
  >>  case though :)
  >>
  >>  Regards
  >>  Steffen
  >>
  >>
  >>
  >>
  >>  On Thu, Jan 7, 2021 at 7:27 AM  
>>> wrote:
  >>  >
  >>  > Hi Steffen,
  >>  >
  >>  > The fix pointed out by Reid is affecting it.
  >>  >
  >>  > Since the fencing action requested by the DC node exists 
>>> only in the
  > DC node, such an event occurs.
  >>  > You will need to take advantage of the modified pacemaker to 
>>> resolve
  > the issue.
  >>  >
  >>  > Best Regards,
  >>  > Hideo Yamauchi.
  >>  >
  >>  >
  >>  >
  >>  > - Original Message -
  >>  > > From: Reid Wahl 
  >>  > > To: Cluster Labs - All topics related to open-source 
>>> clustering
  > welcomed 
  >>  > > Cc:
  >>  > > Date: 2021/1/7, Thu 15:07
  >>  > > Subject: Re: [ClusterLabs] Pending Fencing Actions 
>>> shown in pcs
  > status
  >>  > >
  >>  > > Hi, Steffen. Are your cluster nodes all running the 
>>> same
  > Pacemaker
  >>  > > versions? This looks like Bug 5401[1], which is fixed 
>>> by upstream
  >>  > > commit df71a07[2]. I'm a little bit confused about 
>>> why it
  > only shows
  >>  > > up on one out of three nodes though.
  >>  > >
  >>  > > [1] https://bugs.clusterlabs.org/show_bug.cgi?id=5401
  >>  > > [2] 
>>> https://github.com/ClusterLabs/pacemaker/commit/df71a07
  >>  > >
  >>  > > On Tue, Jan 5, 2021 at 8:31 AM Steffen Vinther Sørensen
  >>  > >  wrote:
  >>  > >>
  >>  > >>  Hello
  >>  > >>

Re: [ClusterLabs] Antw: [EXT] Re: Pending Fencing Actions shown in pcs status

2021-01-07 Thread renayama19661014

Hi Ulrich,

> So you were asking for a specific section of the CIB like "cibadmin -Q -o
> status"?


No.
There is no need for a specific section of cib.


Best Regards,
Hideo Yamauchi.


- Original Message -
> From: Ulrich Windl 
> To: users@clusterlabs.org; renayama19661...@ybb.ne.jp
> Cc: 
> Date: 2021/1/7, Thu 17:57
> Subject: Antw: [EXT] Re: [ClusterLabs] Pending Fencing Actions shown in pcs 
> status
> 
   schrieb am 07.01.2021 um 09:51 
> in Nachricht
> <91782048.765666.1610009460932.javamail.ya...@mail.yahoo.co.jp>:
>>  Hi Steffen,
>>  Hi Reid,
>> 
>>  The fencing history is kept inside stonith-ng and is not written to cib.
> 
> So you were asking for a specific section of the CIB like "cibadmin -Q -o
> status"?
> 
>>  However, getting the entire cib and getting it sent will help you to 
>>  reproduce the problem.
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
>> 
>> 
>>  - Original Message -
>>> From: Reid Wahl 
>>> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
>>  open-source clustering welcomed  
>>> Date: 2021/1/7, Thu 17:39
>>> Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs status
>>> 
>>> 
>>> Hi, Steffen. Those attachments don't contain the CIB. They contain 
> the `pcs
> 
>>  config` output. You can get the cib with `pcs cluster cib > 
>>  $(hostname).cib.xml`.
>>> 
>>> 
>>> Granted, it's possible that this fence action information 
> wouldn't be in the
> 
>>  CIB at all. It might be stored in fencer memory.
>>> 
>>> 
>>> On Thu, Jan 7, 2021 at 12:26 AM  
> wrote:
>>> 
>>> Hi Steffen,
 
>  Here CIB settings attached (pcs config show) for all 3 of my 
> nodes
>  (all 3 seems 100% identical), node03 is the DC.
 
 
 Thank you for the attachment.
 
 What is the scenario when this situation occurs?
 In what steps did the problem appear when fencing was performed (or
> failed)?
 
 
 Best Regards,
 Hideo Yamauchi.
 
 
 - Original Message -
>  From: Steffen Vinther Sørensen 
>  To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics 
> related to 
>>  open-source clustering welcomed 
>  Cc: 
>  Date: 2021/1/7, Thu 17:05
>  Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs 
> status
> 
>  Hi Hideo,
> 
>  Here CIB settings attached (pcs config show) for all 3 of my 
> nodes
>  (all 3 seems 100% identical), node03 is the DC.
> 
>  Regards
>  Steffen
> 
>  On Thu, Jan 7, 2021 at 8:06 AM 
>  wrote:
>> 
>>   Hi Steffen,
>>   Hi Reid,
>> 
>>   I also checked the Centos source rpm and it seems to 
> include a fix for
> the 
>  problem.
>> 
>>   As Steffen suggested, if you share your CIB settings, I 
> might know 
>  something.
>> 
>>   If this issue is the same as the fix, the display will 
> only be
> displayed on 
>> 
>  the DC node and will not affect the operation.
>>   The pending actions shown will remain for a long time, but 
> will not
> have a 
>  negative impact on the cluster.
>> 
>>   Best Regards,
>>   Hideo Yamauchi.
>> 
>> 
>>   - Original Message -
>>   > From: Reid Wahl 
>>   > To: Cluster Labs - All topics related to open-source 
> clustering 
>  welcomed 
>>   > Cc:
>>   > Date: 2021/1/7, Thu 15:58
>>   > Subject: Re: [ClusterLabs] Pending Fencing Actions 
> shown in pcs
> status
>>   >
>>   > It's supposedly fixed in that version.
>>   >   - 
> https://bugzilla.redhat.com/show_bug.cgi?id=1787749 
>>   >   - https://access.redhat.com/solutions/4713471 
>>   >
>>   > So you may be hitting a different issue (unless 
> there's a bug in 
>  the
>>   > pcmk 1.1 backport of the fix).
>>   >
>>   > I may be a little bit out of my area of knowledge 
> here, but can you
>>   > share the CIBs from nodes 1 and 3? Maybe Hideo, 
> Klaus, or Ken has
> some
>>   > insight.
>>   >
>>   > On Wed, Jan 6, 2021 at 10:53 PM Steffen Vinther 
> Sørensen
>>   >  wrote:
>>   >>
>>   >>  Hi Hideo,
>>   >>
>>   >>  If the fix is not going to make it into the 
> CentOS7 pacemaker 
>  version,
>>   >>  I guess the stable approach to take advantage of 
> it is to build 
>  the
>>   >>  cluster on another OS than CentOS7 ? A little 
> late for that in 
>  this
>>   >>  case though :)
>>   >>
>>   >>  Regards
>>   >>  Steffen
>>   >>
>>   >>
>>   >>
>>   >>
>>   >>  On Thu, Jan 7, 2021 at 7:27 AM 
>  
>  wrote:
>>   >>  >
>>   >>  > Hi Steffen,
>>   >>  >
>>   >>  > The fix pointed out by Reid is affecting 
> it.
>>   >>  >
>>   >>  > Since the fencing action requested by the 
> DC node exists 
>  only in the
>>   > DC node, such an event occurs.
>>   >>  > You will need to take advantage of the 
> modified pacemaker to 
>  resolve

Re: [ClusterLabs] Pending Fencing Actions shown in pcs status

2021-01-07 Thread renayama19661014

Hi Steffen,

> Unfortunately not sure about the exact scenario. But I have been doing
> some recent experiments with node standby/unstandby stop/start. This
> to get procedures right for updating node rpms etc.
> 
> Later I noticed the uncomforting "pending fencing actions" status msg.

Okay!

Repeat the standby and unstandby steps in the same way to check.
We will start checking after tomorrow, so I think it will take some time until 
next week.


Many thanks,
Hideo Yamauchi.



- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: Reid Wahl ; Cluster Labs - All topics related to 
> open-source clustering welcomed 
> Cc: 
> Date: 2021/1/7, Thu 17:51
> Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs status
> 
> Hi Steffen,
> Hi Reid,
> 
> The fencing history is kept inside stonith-ng and is not written to cib.
> However, getting the entire cib and getting it sent will help you to 
> reproduce 
> the problem.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> - Original Message -
>> From: Reid Wahl 
>> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed  
>> Date: 2021/1/7, Thu 17:39
>> Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs status
>> 
>> 
>> Hi, Steffen. Those attachments don't contain the CIB. They contain the 
> `pcs config` output. You can get the cib with `pcs cluster cib > 
> $(hostname).cib.xml`.
>> 
>> 
>> Granted, it's possible that this fence action information wouldn't 
> be in the CIB at all. It might be stored in fencer memory.
>> 
>> 
>> On Thu, Jan 7, 2021 at 12:26 AM  wrote:
>> 
>> Hi Steffen,
>>> 
  Here CIB settings attached (pcs config show) for all 3 of my nodes
  (all 3 seems 100% identical), node03 is the DC.
>>> 
>>> 
>>> Thank you for the attachment.
>>> 
>>> What is the scenario when this situation occurs?
>>> In what steps did the problem appear when fencing was performed (or 
> failed)?
>>> 
>>> 
>>> Best Regards,
>>> Hideo Yamauchi.
>>> 
>>> 
>>> - Original Message -
  From: Steffen Vinther Sørensen 
  To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related 
> to open-source clustering welcomed 
  Cc: 
  Date: 2021/1/7, Thu 17:05
  Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs 
> status
 
  Hi Hideo,
 
  Here CIB settings attached (pcs config show) for all 3 of my nodes
  (all 3 seems 100% identical), node03 is the DC.
 
  Regards
  Steffen
 
  On Thu, Jan 7, 2021 at 8:06 AM  
> wrote:
> 
>   Hi Steffen,
>   Hi Reid,
> 
>   I also checked the Centos source rpm and it seems to include a 
> fix for the 
  problem.
> 
>   As Steffen suggested, if you share your CIB settings, I might 
> know 
  something.
> 
>   If this issue is the same as the fix, the display will only be 
> displayed on 
  the DC node and will not affect the operation.
>   The pending actions shown will remain for a long time, but 
> will not have a 
  negative impact on the cluster.
> 
>   Best Regards,
>   Hideo Yamauchi.
> 
> 
>   - Original Message -
>   > From: Reid Wahl 
>   > To: Cluster Labs - All topics related to open-source 
> clustering 
  welcomed 
>   > Cc:
>   > Date: 2021/1/7, Thu 15:58
>   > Subject: Re: [ClusterLabs] Pending Fencing Actions shown 
> in pcs status
>   >
>   > It's supposedly fixed in that version.
>   >   - https://bugzilla.redhat.com/show_bug.cgi?id=1787749 
>   >   - https://access.redhat.com/solutions/4713471 
>   >
>   > So you may be hitting a different issue (unless 
> there's a bug in 
  the
>   > pcmk 1.1 backport of the fix).
>   >
>   > I may be a little bit out of my area of knowledge here, 
> but can you
>   > share the CIBs from nodes 1 and 3? Maybe Hideo, Klaus, or 
> Ken has some
>   > insight.
>   >
>   > On Wed, Jan 6, 2021 at 10:53 PM Steffen Vinther Sørensen
>   >  wrote:
>   >>
>   >>  Hi Hideo,
>   >>
>   >>  If the fix is not going to make it into the CentOS7 
> pacemaker 
  version,
>   >>  I guess the stable approach to take advantage of it 
> is to build 
  the
>   >>  cluster on another OS than CentOS7 ? A little late 
> for that in 
  this
>   >>  case though :)
>   >>
>   >>  Regards
>   >>  Steffen
>   >>
>   >>
>   >>
>   >>
>   >>  On Thu, Jan 7, 2021 at 7:27 AM 
>  
  wrote:
>   >>  >
>   >>  > Hi Steffen,
>   >>  >
>   >>  > The fix pointed out by Reid is affecting it.
>   >>  >
>   >>  > Since the fencing action requested by the DC 
> node exists 
  only in the
>   > DC node, such an event occurs.
>   >>  > You will need to take advantage of the modified 
> pacemaker to 
  resolve
>   > the issue.
>   >>  >
>   >>  > Best Regards,
>

Re: [ClusterLabs] Pending Fencing Actions shown in pcs status

2021-01-07 Thread renayama19661014

Hi Steffen,

> Here CIB settings attached (pcs config show) for all 3 of my nodes
> (all 3 seems 100% identical), node03 is the DC.


Thank you for the attachment.

What is the scenario when this situation occurs?
In what steps did the problem appear when fencing was performed (or failed)?


Best Regards,
Hideo Yamauchi.


- Original Message -
> From: Steffen Vinther Sørensen 
> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed 
> Cc: 
> Date: 2021/1/7, Thu 17:05
> Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs status
> 
> Hi Hideo,
> 
> Here CIB settings attached (pcs config show) for all 3 of my nodes
> (all 3 seems 100% identical), node03 is the DC.
> 
> Regards
> Steffen
> 
> On Thu, Jan 7, 2021 at 8:06 AM  wrote:
>> 
>>  Hi Steffen,
>>  Hi Reid,
>> 
>>  I also checked the Centos source rpm and it seems to include a fix for the 
> problem.
>> 
>>  As Steffen suggested, if you share your CIB settings, I might know 
> something.
>> 
>>  If this issue is the same as the fix, the display will only be displayed on 
> the DC node and will not affect the operation.
>>  The pending actions shown will remain for a long time, but will not have a 
> negative impact on the cluster.
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
>> 
>> 
>>  - Original Message -
>>  > From: Reid Wahl 
>>  > To: Cluster Labs - All topics related to open-source clustering 
> welcomed 
>>  > Cc:
>>  > Date: 2021/1/7, Thu 15:58
>>  > Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs status
>>  >
>>  > It's supposedly fixed in that version.
>>  >   - https://bugzilla.redhat.com/show_bug.cgi?id=1787749 
>>  >   - https://access.redhat.com/solutions/4713471 
>>  >
>>  > So you may be hitting a different issue (unless there's a bug in 
> the
>>  > pcmk 1.1 backport of the fix).
>>  >
>>  > I may be a little bit out of my area of knowledge here, but can you
>>  > share the CIBs from nodes 1 and 3? Maybe Hideo, Klaus, or Ken has some
>>  > insight.
>>  >
>>  > On Wed, Jan 6, 2021 at 10:53 PM Steffen Vinther Sørensen
>>  >  wrote:
>>  >>
>>  >>  Hi Hideo,
>>  >>
>>  >>  If the fix is not going to make it into the CentOS7 pacemaker 
> version,
>>  >>  I guess the stable approach to take advantage of it is to build 
> the
>>  >>  cluster on another OS than CentOS7 ? A little late for that in 
> this
>>  >>  case though :)
>>  >>
>>  >>  Regards
>>  >>  Steffen
>>  >>
>>  >>
>>  >>
>>  >>
>>  >>  On Thu, Jan 7, 2021 at 7:27 AM  
> wrote:
>>  >>  >
>>  >>  > Hi Steffen,
>>  >>  >
>>  >>  > The fix pointed out by Reid is affecting it.
>>  >>  >
>>  >>  > Since the fencing action requested by the DC node exists 
> only in the
>>  > DC node, such an event occurs.
>>  >>  > You will need to take advantage of the modified pacemaker to 
> resolve
>>  > the issue.
>>  >>  >
>>  >>  > Best Regards,
>>  >>  > Hideo Yamauchi.
>>  >>  >
>>  >>  >
>>  >>  >
>>  >>  > - Original Message -
>>  >>  > > From: Reid Wahl 
>>  >>  > > To: Cluster Labs - All topics related to open-source 
> clustering
>>  > welcomed 
>>  >>  > > Cc:
>>  >>  > > Date: 2021/1/7, Thu 15:07
>>  >>  > > Subject: Re: [ClusterLabs] Pending Fencing Actions 
> shown in pcs
>>  > status
>>  >>  > >
>>  >>  > > Hi, Steffen. Are your cluster nodes all running the 
> same
>>  > Pacemaker
>>  >>  > > versions? This looks like Bug 5401[1], which is fixed 
> by upstream
>>  >>  > > commit df71a07[2]. I'm a little bit confused about 
> why it
>>  > only shows
>>  >>  > > up on one out of three nodes though.
>>  >>  > >
>>  >>  > > [1] https://bugs.clusterlabs.org/show_bug.cgi?id=5401 
>>  >>  > > [2] 
> https://github.com/ClusterLabs/pacemaker/commit/df71a07 
>>  >>  > >
>>  >>  > > On Tue, Jan 5, 2021 at 8:31 AM Steffen Vinther Sørensen
>>  >>  > >  wrote:
>>  >>  > >>
>>  >>  > >>  Hello
>>  >>  > >>
>>  >>  > >>  node 1 is showing this in 'pcs status'
>>  >>  > >>
>>  >>  > >>  Pending Fencing Actions:
>>  >>  > >>  * reboot of kvm03-node02.avigol-gcs.dk pending:
>>  > client=crmd.37819,
>>  >>  > >>  origin=kvm03-node03.avigol-gcs.dk
>>  >>  > >>
>>  >>  > >>  node 2 and node 3 outputs no such thing (node 3 is 
> DC)
>>  >>  > >>
>>  >>  > >>  Google is not much help, how to investigate this 
> further and
>>  > get rid
>>  >>  > >>  of such terrifying status message ?
>>  >>  > >>
>>  >>  > >>  Regards
>>  >>  > >>  Steffen
>>  >>  > >>  ___
>>  >>  > >>  Manage your subscription:
>>  >>  > >>  
> https://lists.clusterlabs.org/mailman/listinfo/users 
>>  >>  > >>
>>  >>  > >>  ClusterLabs home: https://www.clusterlabs.org/ 
>>  >>  > >>
>>  >>  > >
>>  >>  > >
>>  >>  > > --
>>  >>  > > Regards,
>>  >>  > >
>>  >>  > > Reid Wahl, RHCA
>>  >>  > > Senior Software Maintenance Engineer, Red Hat
>>  >>  > > CEE - Platform Support Delivery - ClusterHA
>>  >>  > >
>>  >>  > > ___
>>  >>  > > Manage your

Re: [ClusterLabs] Pending Fencing Actions shown in pcs status

2021-01-06 Thread renayama19661014

Hi Steffen,

The fix pointed out by Reid is affecting it.

Since the fencing action requested by the DC node exists only in the DC node, 
such an event occurs.
You will need to take advantage of the modified pacemaker to resolve the issue.

Best Regards,
Hideo Yamauchi.



- Original Message -
> From: Reid Wahl 
> To: Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Cc: 
> Date: 2021/1/7, Thu 15:07
> Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs status
> 
> Hi, Steffen. Are your cluster nodes all running the same Pacemaker
> versions? This looks like Bug 5401[1], which is fixed by upstream
> commit df71a07[2]. I'm a little bit confused about why it only shows
> up on one out of three nodes though.
> 
> [1] https://bugs.clusterlabs.org/show_bug.cgi?id=5401 
> [2] https://github.com/ClusterLabs/pacemaker/commit/df71a07 
> 
> On Tue, Jan 5, 2021 at 8:31 AM Steffen Vinther Sørensen
>  wrote:
>> 
>>  Hello
>> 
>>  node 1 is showing this in 'pcs status'
>> 
>>  Pending Fencing Actions:
>>  * reboot of kvm03-node02.avigol-gcs.dk pending: client=crmd.37819,
>>  origin=kvm03-node03.avigol-gcs.dk
>> 
>>  node 2 and node 3 outputs no such thing (node 3 is DC)
>> 
>>  Google is not much help, how to investigate this further and get rid
>>  of such terrifying status message ?
>> 
>>  Regards
>>  Steffen
>>  ___
>>  Manage your subscription:
>>  https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>>  ClusterLabs home: https://www.clusterlabs.org/ 
>> 
> 
> 
> -- 
> Regards,
> 
> Reid Wahl, RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Pending Fencing Actions shown in pcs status

2021-01-06 Thread renayama19661014

Hi Reid,
Hi Steffen,



> According to Steffen's description, the "pending" is displayed 
> only on
> node 1, while the DC is node 3. That's another thing that makes me
> wonder if this is a distinct issue.


The problem may not be the same.
I think it's a good idea to have bugzilla or ML provide a crm_report etc. to 
investigate the problem.


Best Regard,
Hideo Yamauchi.


- Original Message -
> From: Reid Wahl 
> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed 
> Cc: 
> Date: 2021/1/7, Thu 16:16
> Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs status
> 
> On Wed, Jan 6, 2021 at 11:07 PM  wrote:
>> 
>>  Hi Steffen,
>>  Hi Reid,
>> 
>>  I also checked the Centos source rpm and it seems to include a fix for the 
> problem.
>> 
>>  As Steffen suggested, if you share your CIB settings, I might know 
> something.
>> 
>>  If this issue is the same as the fix, the display will only be displayed on 
> the DC node and will not affect the operation.
> 
> According to Steffen's description, the "pending" is displayed 
> only on
> node 1, while the DC is node 3. That's another thing that makes me
> wonder if this is a distinct issue.
> 
>>  The pending actions shown will remain for a long time, but will not have a 
> negative impact on the cluster.
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
>> 
>> 
>>  - Original Message -
>>  > From: Reid Wahl 
>>  > To: Cluster Labs - All topics related to open-source clustering 
> welcomed 
>>  > Cc:
>>  > Date: 2021/1/7, Thu 15:58
>>  > Subject: Re: [ClusterLabs] Pending Fencing Actions shown in pcs status
>>  >
>>  > It's supposedly fixed in that version.
>>  >   - https://bugzilla.redhat.com/show_bug.cgi?id=1787749 
>>  >   - https://access.redhat.com/solutions/4713471 
>>  >
>>  > So you may be hitting a different issue (unless there's a bug in 
> the
>>  > pcmk 1.1 backport of the fix).
>>  >
>>  > I may be a little bit out of my area of knowledge here, but can you
>>  > share the CIBs from nodes 1 and 3? Maybe Hideo, Klaus, or Ken has some
>>  > insight.
>>  >
>>  > On Wed, Jan 6, 2021 at 10:53 PM Steffen Vinther Sørensen
>>  >  wrote:
>>  >>
>>  >>  Hi Hideo,
>>  >>
>>  >>  If the fix is not going to make it into the CentOS7 pacemaker 
> version,
>>  >>  I guess the stable approach to take advantage of it is to build 
> the
>>  >>  cluster on another OS than CentOS7 ? A little late for that in 
> this
>>  >>  case though :)
>>  >>
>>  >>  Regards
>>  >>  Steffen
>>  >>
>>  >>
>>  >>
>>  >>
>>  >>  On Thu, Jan 7, 2021 at 7:27 AM  
> wrote:
>>  >>  >
>>  >>  > Hi Steffen,
>>  >>  >
>>  >>  > The fix pointed out by Reid is affecting it.
>>  >>  >
>>  >>  > Since the fencing action requested by the DC node exists 
> only in the
>>  > DC node, such an event occurs.
>>  >>  > You will need to take advantage of the modified pacemaker to 
> resolve
>>  > the issue.
>>  >>  >
>>  >>  > Best Regards,
>>  >>  > Hideo Yamauchi.
>>  >>  >
>>  >>  >
>>  >>  >
>>  >>  > - Original Message -
>>  >>  > > From: Reid Wahl 
>>  >>  > > To: Cluster Labs - All topics related to open-source 
> clustering
>>  > welcomed 
>>  >>  > > Cc:
>>  >>  > > Date: 2021/1/7, Thu 15:07
>>  >>  > > Subject: Re: [ClusterLabs] Pending Fencing Actions 
> shown in pcs
>>  > status
>>  >>  > >
>>  >>  > > Hi, Steffen. Are your cluster nodes all running the 
> same
>>  > Pacemaker
>>  >>  > > versions? This looks like Bug 5401[1], which is fixed 
> by upstream
>>  >>  > > commit df71a07[2]. I'm a little bit confused about 
> why it
>>  > only shows
>>  >>  > > up on one out of three nodes though.
>>  >>  > >
>>  >>  > > [1] https://bugs.clusterlabs.org/show_bug.cgi?id=5401 
>>  >>  > > [2] 
> https://github.com/ClusterLabs/pacemaker/commit/df71a07 
>>  >>  > >
>>  >>  > > On Tue, Jan 5, 2021 at 8:31 AM Steffen Vinther Sørensen
>>  >>  > >  wrote:
>>  >>  > >>
>>  >>  > >>  Hello
>>  >>  > >>
>>  >>  > >>  node 1 is showing this in 'pcs status'
>>  >>  > >>
>>  >>  > >>  Pending Fencing Actions:
>>  >>  > >>  * reboot of kvm03-node02.avigol-gcs.dk pending:
>>  > client=crmd.37819,
>>  >>  > >>  origin=kvm03-node03.avigol-gcs.dk
>>  >>  > >>
>>  >>  > >>  node 2 and node 3 outputs no such thing (node 3 is 
> DC)
>>  >>  > >>
>>  >>  > >>  Google is not much help, how to investigate this 
> further and
>>  > get rid
>>  >>  > >>  of such terrifying status message ?
>>  >>  > >>
>>  >>  > >>  Regards
>>  >>  > >>  Steffen
>>  >>  > >>  ___
>>  >>  > >>  Manage your subscription:
>>  >>  > >>  
> https://lists.clusterlabs.org/mailman/listinfo/users 
>>  >>  > >>
>>  >>  > >>  ClusterLabs home: https://www.clusterlabs.org/ 
>>  >>  > >>
>>  >>  > >
>>  >>  > >
>>  >>  > > --
>>  >>  > > Regards,
>>  >>  > >
>>  >>  > > Reid Wahl, RHCA
>>  >>  > > Senior Software Maintenance Engineer, Red Hat
>>  >>  > > CEE - Platform Support Delivery - ClusterHA
>>  >>  > >
>>  >>  > >

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-23 Thread renayama19661014

Hi Ken,
Hi Klaus,

Thanks for your comment.

>We did not have time to get it into the RHEL 8.4 GA (general
>availability) release, which means for example it will not be in 8.4
>install images, but we did get a 0-day fix, which means that it will be
>available via "yum update" the same day that 8.4 is released.
>
>Thanks for testing the 8.4 build and finding the issue!


Okay!


Best Regards,
Hideo Yamauchi.




- Original Message -
>From: Ken Gaillot 
>To: renayama19661...@ybb.ne.jp 
>Cc: kwenning 
>Date: 2021/4/24, Sat 01:25
>Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
>fails.
> 
>Hi Hideo,
>
>A private reply to follow up:
>
>The fix will be in the 2.1.0 upstream release.
>
>We did not have time to get it into the RHEL 8.4 GA (general
>availability) release, which means for example it will not be in 8.4
>install images, but we did get a 0-day fix, which means that it will be
>available via "yum update" the same day that 8.4 is released.
>
>Thanks for testing the 8.4 build and finding the issue!
>
>On Thu, 2021-04-15 at 11:45 +0900, renayama19661...@ybb.ne.jp wrote:
>> Hi Klaus,
>> Hi Ken,
>> 
>> We have confirmed that the operation is improved by the test.
>> Thank you for your prompt response.
>> 
>> We look forward to including this fix in the release version of RHEL
>> 8.4.
>> 
>> Best Regards,
>> Hideo Yamauchi.
>> 
>> 
>> 
>> - Original Message -
>> > From: "renayama19661...@ybb.ne.jp" 
>> > To: "kwenn...@redhat.com" ; Cluster Labs - All
>> > topics related to open-source clustering welcomed <
>> > users@clusterlabs.org>; Cluster Labs - All topics related to open-
>> > source clustering welcomed 
>> > Cc: 
>> > Date: 2021/4/13, Tue 07:08
>> > Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource
>> > control fails.
>> > 
>> > Hi Klaus,
>> > Hi Ken,
>> > 
>> > >  I've opened https://github.com/ClusterLabs/pacemaker/pull/2342
>> > > with
>> > >  I guess the simplest possible solution to the immediate issue so
>> > >  that we can discuss it.
>> > 
>> > 
>> > Thank you for the fix.
>> > 
>> > 
>> > I have confirmed that the fixes have been merged.
>> > 
>> > I'll test this fix today just in case.
>> > 
>> > Many thanks,
>> > Hideo Yamauchi.
>> > 
>> > 
>> > - Original Message -
>> > >  From: Klaus Wenninger 
>> > >  To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics
>> > > related to 
>> > 
>> > open-source clustering welcomed 
>> > >  Cc: 
>> > >  Date: 2021/4/12, Mon 22:22
>> > >  Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql
>> > > resource control 
>> > 
>> > fails.
>> > > 
>> > >  On 4/9/21 5:13 PM, Klaus Wenninger wrote:
>> > > >   On 4/9/21 4:04 PM, Klaus Wenninger wrote:
>> > > > >   On 4/9/21 3:45 PM, Klaus Wenninger wrote:
>> > > > > >   On 4/9/21 3:36 PM, Klaus Wenninger wrote:
>> > > > > > >   On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:
>> > > > > > > >   Hi Klaus,
>> > > > > > > > 
>> > > > > > > >   Thanks for your comment.
>> > > > > > > > 
>> > > > > > > > >   Hmm ... is that with selinux enabled?
>> > > > > > > > >   Respectively do you see any related avc messages?
>> > > > > > > > 
>> > > > > > > >   Selinux is not enabled.
>> > > > > > > >   Isn't crm_mon caused by not returning a response 
>> > 
>> > when 
>> > >  pacemakerd 
>> > > > > > > >   prepares to stop?
>> > > > > > 
>> > > > > >   yep ... that doesn't look good.
>> > > > > >   While in pcmk_shutdown_worker ipc isn't handled.
>> > > > > 
>> > > > >   Stop ... that should actually work as pcmk_shutdown_worker
>> > > > >   should exit quite quickly and proceed after mainloop
>> > > > >   dispatching when called again.
>> > > > >   Don't see anything atm that might be blocking for longer
>> > > > > ...
>> > > > >   but let me dig into it further ...
>> > > > 
>> > > >   What happens is clear (thanks Ken for the hint ;-) ).
>> > > >   When pacemakerd is shutting down - already when it
>> > > >   shuts down the resources and not just when it starts to
>> > > >   reap the subdaemons - crm_mon reads that state and
>> > > >   doesn't try to connect to the cib anymore.
>> > > 
>> > >  I've opened https://github.com/ClusterLabs/pacemaker/pull/2342
>> > > with
>> > >  I guess the simplest possible solution to the immediate issue so
>> > >  that we can discuss it.
>> > > > > >   Question is why that didn't create issue earlier.
>> > > > > >   Probably I didn't test with resources that had crm_mon in
>> > > > > >   their stop/monitor-actions but sbd should have run into
>> > > > > >   issues.
>> > > > > > 
>> > > > > >   Klaus
>> > > > > > >   But when shutting down a node the resources should be
>> > > > > > >   shutdown before pacemakerd goes down.
>> > > > > > >   But let me have a look if it can happen that pacemakerd
>> > > > > > >   doesn't react to the ipc-pings before. That btw. might 
>> > 
>> > be
>> > > > > > >   lethal for sbd-scenarios (if the phase is too long and
>> > > > > > > it
>> > > > > > >   migh actually not be

[ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-08 Thread renayama19661014

Hi Ken,
Hi All,

In the pgsql resource, crm_mon is executed in the process of demote and stop, 
and the result is processed.

However, pacemaker included in RHEL8.4beta fails to execute this crm_mon.
 - The problem also occurs on github 
master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).

The problem can be easily reproduced in the following ways.

Step1. Modify to execute crm_mon in the stop process of the Dummy resource.


dummy_stop() {
    mon=$(crm_mon -1)
    ret=$?
    ocf_log info "### YAMAUCHI  crm_mon[${ret}] : ${mon}"
    dummy_monitor
    if [ $? =  $OCF_SUCCESS ]; then
        rm ${OCF_RESKEY_state}
    fi
    return $OCF_SUCCESS
}


Step2. Configure a cluster with two nodes.


[root@rh84-beta01 ~]# crm_mon -rfA1
Cluster Summary:
  * Stack: corosync
  * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - partition with 
quorum
  * Last updated: Thu Apr  8 18:00:52 2021
  * Last change:  Thu Apr  8 18:00:38 2021 by root via cibadmin on rh84-beta01
  * 2 nodes configured
  * 1 resource instance configured

Node List:
  * Online: [ rh84-beta01 rh84-beta02 ]

Full List of Resources:
  * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta01

Migration Summary:


Step3. Stop the node where the Dummy resource is running. The resource will 
fail over.

[root@rh84-beta02 ~]# crm_mon -rfA1
Cluster Summary:
  * Stack: corosync
  * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - partition with 
quorum
  * Last updated: Thu Apr  8 18:08:56 2021
  * Last change:  Thu Apr  8 18:05:08 2021 by root via cibadmin on rh84-beta01
  * 2 nodes configured
  * 1 resource instance configured

Node List:
  * Online: [ rh84-beta02 ]
  * OFFLINE: [ rh84-beta01 ]

Full List of Resources:
  * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta02


However, if you look at the log, you can see that the execution of crm_mon in 
the stop processing of the Dummy resource has failed.


Apr 08 18:05:17  Dummy(dummy-1)[2631]:    INFO: ### YAMAUCHI  crm_mon[102] 
: Pacemaker daemons shutting down ...
Apr 08 18:05:17 rh84-beta01 pacemaker-execd     [2219] (log_op_output)  notice: 
dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster is not available on 
this node ]


Similarly, pgsql also executes crm_mon with demote or stop, so control fails.

The problem seems to be related to the next fix.
 * Report pacemakerd in state waiting for sbd
  - https://github.com/ClusterLabs/pacemaker/pull/2278

The problem does not occur with the release version of Pacemaker 2.0.5 or the 
Pacemaker included with RHEL8.3.

This issue has a huge impact on the user.

Perhaps it also affects the control of other resources that utilize crm_mon.

Please improve the release version of RHEL8.4 so that it includes Pacemaker 
which does not cause this problem.
 * Distributions other than RHEL may also be affected in future releases.


This content is the same as the following Bugzilla.
 - https://bugs.clusterlabs.org/show_bug.cgi?id=5471


Best Regards,
Hideo Yamauchi.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-09 Thread renayama19661014

Hi Klaus,

Thanks for your comment.

> Hmm ... is that with selinux enabled?

> Respectively do you see any related avc messages?


Selinux is not enabled.
Isn't crm_mon caused by not returning a response when pacemakerd prepares to 
stop?

pgsql needs the result of crm_mon in demote processing and stop processing.
crm_mon should return a response even after pacemakerd goes into a stop 
operation.

Best Regards,
Hideo Yamauchi.


- Original Message -
> From: Klaus Wenninger 
> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed 
> Cc: 
> Date: 2021/4/9, Fri 21:12
> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
> 
> On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote:
>>  Hi Ken,
>>  Hi All,
>> 
>>  In the pgsql resource, crm_mon is executed in the process of demote and 
> stop, and the result is processed.
>> 
>>  However, pacemaker included in RHEL8.4beta fails to execute this crm_mon.
>>    - The problem also occurs on github 
> master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).
>> 
>>  The problem can be easily reproduced in the following ways.
>> 
>>  Step1. Modify to execute crm_mon in the stop process of the Dummy resource.
>>  
>> 
>>  dummy_stop() {
>>       mon=$(crm_mon -1)
>>       ret=$?
>>       ocf_log info "### YAMAUCHI  crm_mon[${ret}] : ${mon}"
>>       dummy_monitor
>>       if [ $? =  $OCF_SUCCESS ]; then
>>           rm ${OCF_RESKEY_state}
>>       fi
>>       return $OCF_SUCCESS
>>  }
>>  
>> 
>>  Step2. Configure a cluster with two nodes.
>>  
>> 
>>  [root@rh84-beta01 ~]# crm_mon -rfA1
>>  Cluster Summary:
>>     * Stack: corosync
>>     * Current DC: rh84-beta01 (version 2.0.5-8.el8-ba59be7122) - partition 
> with quorum
>>     * Last updated: Thu Apr  8 18:00:52 2021
>>     * Last change:  Thu Apr  8 18:00:38 2021 by root via cibadmin on 
> rh84-beta01
>>     * 2 nodes configured
>>     * 1 resource instance configured
>> 
>>  Node List:
>>     * Online: [ rh84-beta01 rh84-beta02 ]
>> 
>>  Full List of Resources:
>>     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta01
>> 
>>  Migration Summary:
>>  
>> 
>>  Step3. Stop the node where the Dummy resource is running. The resource will 
> fail over.
>>  
>>  [root@rh84-beta02 ~]# crm_mon -rfA1
>>  Cluster Summary:
>>     * Stack: corosync
>>     * Current DC: rh84-beta02 (version 2.0.5-8.el8-ba59be7122) - partition 
> with quorum
>>     * Last updated: Thu Apr  8 18:08:56 2021
>>     * Last change:  Thu Apr  8 18:05:08 2021 by root via cibadmin on 
> rh84-beta01
>>     * 2 nodes configured
>>     * 1 resource instance configured
>> 
>>  Node List:
>>     * Online: [ rh84-beta02 ]
>>     * OFFLINE: [ rh84-beta01 ]
>> 
>>  Full List of Resources:
>>     * dummy-1     (ocf::heartbeat:Dummy):  Started rh84-beta02
>>  
>> 
>>  However, if you look at the log, you can see that the execution of crm_mon 
> in the stop processing of the Dummy resource has failed.
>> 
>>  
>>  Apr 08 18:05:17  Dummy(dummy-1)[2631]:    INFO: ### YAMAUCHI  
> crm_mon[102] : Pacemaker daemons shutting down ...
>>  Apr 08 18:05:17 rh84-beta01 pacemaker-execd     [2219] (log_op_output)  
> notice: dummy-1_stop_0[2631] error output [ crm_mon: Error: cluster is not 
> available on this node ]
> Hmm ... is that with selinux enabled?
> Respectively do you see any related avc messages?
> 
> Klaus
>>  
>> 
>>  Similarly, pgsql also executes crm_mon with demote or stop, so control 
> fails.
>> 
>>  The problem seems to be related to the next fix.
>>    * Report pacemakerd in state waiting for sbd
>>     - https://github.com/ClusterLabs/pacemaker/pull/2278 
>> 
>>  The problem does not occur with the release version of Pacemaker 2.0.5 or 
> the Pacemaker included with RHEL8.3.
>> 
>>  This issue has a huge impact on the user.
>> 
>>  Perhaps it also affects the control of other resources that utilize 
> crm_mon.
>> 
>>  Please improve the release version of RHEL8.4 so that it includes Pacemaker 
> which does not cause this problem.
>>    * Distributions other than RHEL may also be affected in future releases.
>> 
>>  
>>  This content is the same as the following Bugzilla.
>>    - https://bugs.clusterlabs.org/show_bug.cgi?id=5471 
>>  
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
>> 
>>  ___
>>  Manage your subscription:
>>  https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>>  ClusterLabs home: https://www.clusterlabs.org/ 
> 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-14 Thread renayama19661014

Hi Klaus,
Hi Ken,

We have confirmed that the operation is improved by the test.
Thank you for your prompt response.

We look forward to including this fix in the release version of RHEL 8.4.

Best Regards,
Hideo Yamauchi.



- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: "kwenn...@redhat.com" ; Cluster Labs - All topics 
> related to open-source clustering welcomed ; Cluster 
> Labs - All topics related to open-source clustering welcomed 
> 
> Cc: 
> Date: 2021/4/13, Tue 07:08
> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
> 
> Hi Klaus,
> Hi Ken,
> 
>>  I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with
> 
>>  I guess the simplest possible solution to the immediate issue so
>>  that we can discuss it.
> 
> 
> Thank you for the fix.
> 
> 
> I have confirmed that the fixes have been merged.
> 
> I'll test this fix today just in case.
> 
> Many thanks,
> Hideo Yamauchi.
> 
> 
> - Original Message -
>>  From: Klaus Wenninger 
>>  To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed 
>>  Cc: 
>>  Date: 2021/4/12, Mon 22:22
>>  Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
>> 
>>  On 4/9/21 5:13 PM, Klaus Wenninger wrote:
>>>   On 4/9/21 4:04 PM, Klaus Wenninger wrote:
   On 4/9/21 3:45 PM, Klaus Wenninger wrote:
>   On 4/9/21 3:36 PM, Klaus Wenninger wrote:
>>   On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:
>>>   Hi Klaus,
>>> 
>>>   Thanks for your comment.
>>> 
   Hmm ... is that with selinux enabled?
   Respectively do you see any related avc messages?
>>> 
>>>   Selinux is not enabled.
>>>   Isn't crm_mon caused by not returning a response 
> when 
>>  pacemakerd 
>>>   prepares to stop?
>   yep ... that doesn't look good.
>   While in pcmk_shutdown_worker ipc isn't handled.
   Stop ... that should actually work as pcmk_shutdown_worker
   should exit quite quickly and proceed after mainloop
   dispatching when called again.
   Don't see anything atm that might be blocking for longer ...
   but let me dig into it further ...
>>>   What happens is clear (thanks Ken for the hint ;-) ).
>>>   When pacemakerd is shutting down - already when it
>>>   shuts down the resources and not just when it starts to
>>>   reap the subdaemons - crm_mon reads that state and
>>>   doesn't try to connect to the cib anymore.
>>  I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with
>>  I guess the simplest possible solution to the immediate issue so
>>  that we can discuss it.
>   Question is why that didn't create issue earlier.
>   Probably I didn't test with resources that had crm_mon in
>   their stop/monitor-actions but sbd should have run into
>   issues.
> 
>   Klaus
>>   But when shutting down a node the resources should be
>>   shutdown before pacemakerd goes down.
>>   But let me have a look if it can happen that pacemakerd
>>   doesn't react to the ipc-pings before. That btw. might 
> be
>>   lethal for sbd-scenarios (if the phase is too long and it
>>   migh actually not be defined).
>> 
>>   My idea with selinux would have been that it might block
>>   the ipc if crm_mon is issued by execd. But well forget
>>   about it as it is not enabled ;-)
>> 
>> 
>>   Klaus
>>> 
>>>   pgsql needs the result of crm_mon in demote processing 
> and 
>>  stop 
>>>   processing.
>>>   crm_mon should return a response even after pacemakerd 
> goes 
>>  into a 
>>>   stop operation.
>>> 
>>>   Best Regards,
>>>   Hideo Yamauchi.
>>> 
>>> 
>>>   - Original Message -
   From: Klaus Wenninger 
   To: renayama19661...@ybb.ne.jp; Cluster Labs - All 
> 
>>  topics related 
   to open-source clustering welcomed 
>>  
   Cc:
   Date: 2021/4/9, Fri 21:12
   Subject: Re: [ClusterLabs] [Problem] In 
> RHEL8.4beta, 
>>  pgsql 
   resource control fails.
 
   On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp 
> wrote:
>     Hi Ken,
>     Hi All,
> 
>     In the pgsql resource, crm_mon is executed 
> in the 
>>  process of 
>   demote and
   stop, and the result is processed.
>     However, pacemaker included in RHEL8.4beta 
> fails 
>>  to execute 
>   this crm_mon.
>       - The problem also occurs on github
   master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).
>     The problem can be easily reproduced in the 
>>  following ways.
> 
>     Step1. Modify to execute crm_mon in the stop 
> 
>>  process of the 
>   Dummy resource.
>     
> 
>     dummy_stop() {
>          mon=$(crm_mon -1)
>          ret=$?

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-15 Thread renayama19661014

Hi Klaus,
Hi Ken,

We have confirmed that the operation is improved by the test.
Thank you for your prompt response.

We look forward to including this fix in the release version of RHEL 8.4.

Best Regards,
Hideo Yamauchi.


- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: "kwenn...@redhat.com" ; Cluster Labs - All topics 
> related to open-source clustering welcomed ; Cluster 
> Labs - All topics related to open-source clustering welcomed 
> 
> Cc: 
> Date: 2021/4/13, Tue 07:08
> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
> 
> Hi Klaus,
> Hi Ken,
> 
>>  I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with
> 
>>  I guess the simplest possible solution to the immediate issue so
>>  that we can discuss it.
> 
> 
> Thank you for the fix.
> 
> 
> I have confirmed that the fixes have been merged.
> 
> I'll test this fix today just in case.
> 
> Many thanks,
> Hideo Yamauchi.
> 
> 
> - Original Message -
>>  From: Klaus Wenninger 
>>  To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed 
>>  Cc: 
>>  Date: 2021/4/12, Mon 22:22
>>  Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
>> 
>>  On 4/9/21 5:13 PM, Klaus Wenninger wrote:
>>>   On 4/9/21 4:04 PM, Klaus Wenninger wrote:
   On 4/9/21 3:45 PM, Klaus Wenninger wrote:
>   On 4/9/21 3:36 PM, Klaus Wenninger wrote:
>>   On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:
>>>   Hi Klaus,
>>> 
>>>   Thanks for your comment.
>>> 
   Hmm ... is that with selinux enabled?
   Respectively do you see any related avc messages?
>>> 
>>>   Selinux is not enabled.
>>>   Isn't crm_mon caused by not returning a response 
> when 
>>  pacemakerd 
>>>   prepares to stop?
>   yep ... that doesn't look good.
>   While in pcmk_shutdown_worker ipc isn't handled.
   Stop ... that should actually work as pcmk_shutdown_worker
   should exit quite quickly and proceed after mainloop
   dispatching when called again.
   Don't see anything atm that might be blocking for longer ...
   but let me dig into it further ...
>>>   What happens is clear (thanks Ken for the hint ;-) ).
>>>   When pacemakerd is shutting down - already when it
>>>   shuts down the resources and not just when it starts to
>>>   reap the subdaemons - crm_mon reads that state and
>>>   doesn't try to connect to the cib anymore.
>>  I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with
>>  I guess the simplest possible solution to the immediate issue so
>>  that we can discuss it.
>   Question is why that didn't create issue earlier.
>   Probably I didn't test with resources that had crm_mon in
>   their stop/monitor-actions but sbd should have run into
>   issues.
> 
>   Klaus
>>   But when shutting down a node the resources should be
>>   shutdown before pacemakerd goes down.
>>   But let me have a look if it can happen that pacemakerd
>>   doesn't react to the ipc-pings before. That btw. might 
> be
>>   lethal for sbd-scenarios (if the phase is too long and it
>>   migh actually not be defined).
>> 
>>   My idea with selinux would have been that it might block
>>   the ipc if crm_mon is issued by execd. But well forget
>>   about it as it is not enabled ;-)
>> 
>> 
>>   Klaus
>>> 
>>>   pgsql needs the result of crm_mon in demote processing 
> and 
>>  stop 
>>>   processing.
>>>   crm_mon should return a response even after pacemakerd 
> goes 
>>  into a 
>>>   stop operation.
>>> 
>>>   Best Regards,
>>>   Hideo Yamauchi.
>>> 
>>> 
>>>   - Original Message -
   From: Klaus Wenninger 
   To: renayama19661...@ybb.ne.jp; Cluster Labs - All 
> 
>>  topics related 
   to open-source clustering welcomed 
>>  
   Cc:
   Date: 2021/4/9, Fri 21:12
   Subject: Re: [ClusterLabs] [Problem] In 
> RHEL8.4beta, 
>>  pgsql 
   resource control fails.
 
   On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp 
> wrote:
>     Hi Ken,
>     Hi All,
> 
>     In the pgsql resource, crm_mon is executed 
> in the 
>>  process of 
>   demote and
   stop, and the result is processed.
>     However, pacemaker included in RHEL8.4beta 
> fails 
>>  to execute 
>   this crm_mon.
>       - The problem also occurs on github
   master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).
>     The problem can be easily reproduced in the 
>>  following ways.
> 
>     Step1. Modify to execute crm_mon in the stop 
> 
>>  process of the 
>   Dummy resource.
>     
> 
>     dummy_stop() {
>          mon=$(crm_mon -1)
>          ret=$?

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-15 Thread renayama19661014

Hi ALl,

Sorry...
Due to my operation mistake, the same email was sent multiple times.


Best Regards,
Hideo Yamauchi.


- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: Cluster Labs - All topics related to open-source clustering welcomed 
> ; Cluster Labs - All topics related to open-source 
> clustering welcomed 
> Cc: 
> Date: 2021/4/15, Thu 11:45
> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
> 
> Hi Klaus,
> Hi Ken,
> 
> We have confirmed that the operation is improved by the test.
> Thank you for your prompt response.
> 
> We look forward to including this fix in the release version of RHEL 8.4.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> 
> - Original Message -
>>  From: "renayama19661...@ybb.ne.jp" 
> 
>>  To: "kwenn...@redhat.com" ; Cluster 
> Labs - All topics related to open-source clustering welcomed 
> ; Cluster Labs - All topics related to open-source 
> clustering welcomed 
>>  Cc: 
>>  Date: 2021/4/13, Tue 07:08
>>  Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
>> 
>>  Hi Klaus,
>>  Hi Ken,
>> 
>>>   I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 
> with
>> 
>>>   I guess the simplest possible solution to the immediate issue so
>>>   that we can discuss it.
>> 
>> 
>>  Thank you for the fix.
>> 
>> 
>>  I have confirmed that the fixes have been merged.
>> 
>>  I'll test this fix today just in case.
>> 
>>  Many thanks,
>>  Hideo Yamauchi.
>> 
>> 
>>  - Original Message -
>>>   From: Klaus Wenninger 
>>>   To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
>>  open-source clustering welcomed 
>>>   Cc: 
>>>   Date: 2021/4/12, Mon 22:22
>>>   Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource 
> control 
>>  fails.
>>> 
>>>   On 4/9/21 5:13 PM, Klaus Wenninger wrote:
    On 4/9/21 4:04 PM, Klaus Wenninger wrote:
>    On 4/9/21 3:45 PM, Klaus Wenninger wrote:
>>    On 4/9/21 3:36 PM, Klaus Wenninger wrote:
>>>    On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:
    Hi Klaus,
 
    Thanks for your comment.
 
>    Hmm ... is that with selinux enabled?
>    Respectively do you see any related avc 
> messages?
 
    Selinux is not enabled.
    Isn't crm_mon caused by not returning a 
> response 
>>  when 
>>>   pacemakerd 
    prepares to stop?
>>    yep ... that doesn't look good.
>>    While in pcmk_shutdown_worker ipc isn't handled.
>    Stop ... that should actually work as pcmk_shutdown_worker
>    should exit quite quickly and proceed after mainloop
>    dispatching when called again.
>    Don't see anything atm that might be blocking for longer 
> ...
>    but let me dig into it further ...
    What happens is clear (thanks Ken for the hint ;-) ).
    When pacemakerd is shutting down - already when it
    shuts down the resources and not just when it starts to
    reap the subdaemons - crm_mon reads that state and
    doesn't try to connect to the cib anymore.
>>>   I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 
> with
>>>   I guess the simplest possible solution to the immediate issue so
>>>   that we can discuss it.
>>    Question is why that didn't create issue earlier.
>>    Probably I didn't test with resources that had 
> crm_mon in
>>    their stop/monitor-actions but sbd should have run into
>>    issues.
>> 
>>    Klaus
>>>    But when shutting down a node the resources should be
>>>    shutdown before pacemakerd goes down.
>>>    But let me have a look if it can happen that 
> pacemakerd
>>>    doesn't react to the ipc-pings before. That btw. 
> might 
>>  be
>>>    lethal for sbd-scenarios (if the phase is too long 
> and it
>>>    migh actually not be defined).
>>> 
>>>    My idea with selinux would have been that it might 
> block
>>>    the ipc if crm_mon is issued by execd. But well 
> forget
>>>    about it as it is not enabled ;-)
>>> 
>>> 
>>>    Klaus
 
    pgsql needs the result of crm_mon in demote 
> processing 
>>  and 
>>>   stop 
    processing.
    crm_mon should return a response even after 
> pacemakerd 
>>  goes 
>>>   into a 
    stop operation.
 
    Best Regards,
    Hideo Yamauchi.
 
 
    - Original Message -
>    From: Klaus Wenninger 
> 
>    To: renayama19661...@ybb.ne.jp; Cluster Labs 
> - All 
>> 
>>>   topics related 
>    to open-source clustering welcomed 
>>>   
>    Cc:
>    Date: 2021/4/9, Fri 21:12
>    Subject: Re: [ClusterLabs] [Problem] In 
>>  RHEL8.4beta, 
>>>   pgsql 
>    resource control fails.
> 
>    On 4/8/21 11:21 PM, 
> renayama19661...@ybb.ne.jp 
>>  wrote:

Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control fails.

2021-04-12 Thread renayama19661014

Hi Klaus,
Hi Ken,

> I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with

> I guess the simplest possible solution to the immediate issue so
> that we can discuss it.


Thank you for the fix.


I have confirmed that the fixes have been merged.

I'll test this fix today just in case.

Many thanks,
Hideo Yamauchi.


- Original Message -
> From: Klaus Wenninger 
> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed 
> Cc: 
> Date: 2021/4/12, Mon 22:22
> Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, pgsql resource control 
> fails.
> 
> On 4/9/21 5:13 PM, Klaus Wenninger wrote:
>>  On 4/9/21 4:04 PM, Klaus Wenninger wrote:
>>>  On 4/9/21 3:45 PM, Klaus Wenninger wrote:
  On 4/9/21 3:36 PM, Klaus Wenninger wrote:
>  On 4/9/21 2:37 PM, renayama19661...@ybb.ne.jp wrote:
>>  Hi Klaus,
>> 
>>  Thanks for your comment.
>> 
>>>  Hmm ... is that with selinux enabled?
>>>  Respectively do you see any related avc messages?
>> 
>>  Selinux is not enabled.
>>  Isn't crm_mon caused by not returning a response when 
> pacemakerd 
>>  prepares to stop?
  yep ... that doesn't look good.
  While in pcmk_shutdown_worker ipc isn't handled.
>>>  Stop ... that should actually work as pcmk_shutdown_worker
>>>  should exit quite quickly and proceed after mainloop
>>>  dispatching when called again.
>>>  Don't see anything atm that might be blocking for longer ...
>>>  but let me dig into it further ...
>>  What happens is clear (thanks Ken for the hint ;-) ).
>>  When pacemakerd is shutting down - already when it
>>  shuts down the resources and not just when it starts to
>>  reap the subdaemons - crm_mon reads that state and
>>  doesn't try to connect to the cib anymore.
> I've opened https://github.com/ClusterLabs/pacemaker/pull/2342 with
> I guess the simplest possible solution to the immediate issue so
> that we can discuss it.
  Question is why that didn't create issue earlier.
  Probably I didn't test with resources that had crm_mon in
  their stop/monitor-actions but sbd should have run into
  issues.
 
  Klaus
>  But when shutting down a node the resources should be
>  shutdown before pacemakerd goes down.
>  But let me have a look if it can happen that pacemakerd
>  doesn't react to the ipc-pings before. That btw. might be
>  lethal for sbd-scenarios (if the phase is too long and it
>  migh actually not be defined).
> 
>  My idea with selinux would have been that it might block
>  the ipc if crm_mon is issued by execd. But well forget
>  about it as it is not enabled ;-)
> 
> 
>  Klaus
>> 
>>  pgsql needs the result of crm_mon in demote processing and 
> stop 
>>  processing.
>>  crm_mon should return a response even after pacemakerd goes 
> into a 
>>  stop operation.
>> 
>>  Best Regards,
>>  Hideo Yamauchi.
>> 
>> 
>>  - Original Message -
>>>  From: Klaus Wenninger 
>>>  To: renayama19661...@ybb.ne.jp; Cluster Labs - All 
> topics related 
>>>  to open-source clustering welcomed 
> 
>>>  Cc:
>>>  Date: 2021/4/9, Fri 21:12
>>>  Subject: Re: [ClusterLabs] [Problem] In RHEL8.4beta, 
> pgsql 
>>>  resource control fails.
>>> 
>>>  On 4/8/21 11:21 PM, renayama19661...@ybb.ne.jp wrote:
    Hi Ken,
    Hi All,
 
    In the pgsql resource, crm_mon is executed in the 
> process of 
  demote and
>>>  stop, and the result is processed.
    However, pacemaker included in RHEL8.4beta fails 
> to execute 
  this crm_mon.
      - The problem also occurs on github
>>>  master(c40e18f085fad9ef1d9d79f671ed8a69eb3e753f).
    The problem can be easily reproduced in the 
> following ways.
 
    Step1. Modify to execute crm_mon in the stop 
> process of the 
  Dummy resource.
    
 
    dummy_stop() {
         mon=$(crm_mon -1)
         ret=$?
         ocf_log info "### YAMAUCHI  
> crm_mon[${ret}] : ${mon}"
         dummy_monitor
         if [ $? =  $OCF_SUCCESS ]; then
             rm ${OCF_RESKEY_state}
         fi
         return $OCF_SUCCESS
    }
    
 
    Step2. Configure a cluster with two nodes.
    
 
    [root@rh84-beta01 ~]# crm_mon -rfA1
    Cluster Summary:
       * Stack: corosync
       * Current DC: rh84-beta01 (version 
> 2.0.5-8.el8-ba59be7122) 
  - partition
>>>  with quorum
       * Last updated: Thu Apr  8 18:00:52 2021
       * Last change:  Thu Apr  8 18:00:38 2021 by 
> root via 
  cibadmin on
>>>  rh84-beta01
       * 2 nodes configured
       * 1 resource instance configured

[ClusterLabs] [Problem] In RHEL8.7beta, pgsql resource control fails.

2022-10-05 Thread renayama19661014

Hi Ken,
Hi All,

The problem that occurred in RHEL8.4 also occurs in pacemaker bundled in 
RHEL8.7beta.


(snip)
Oct 06 11:40:55  pgsql(pgsql)[17503]:WARNING: Retrying(remain 86115). 
"crm_mon -1 --output-as=xml" failed. rc=102. stdout="
  

  crm_mon: Error: cluster is not available on this node

  
".
Oct 06 11:40:56  pgsql(pgsql)[17503]:WARNING: Retrying(remain 86114). 
"crm_mon -1 --output-as=xml" failed. rc=102. stdout="
  
  crm_mon: Error: cluster is not available on this 
node

  
".
Oct 06 11:40:57  pgsql(pgsql)[17503]:WARNING: Retrying(remain 86113). 
"crm_mon -1 --output-as=xml" failed. rc=102. stdout="
  
  crm_mon: Error: cluster is not available on this 
node

  
".
(snip)



The previous fix is included in RHEL8.7beta's pacemaker, so it seems to be a 
different problem.

Due to this problem, stopping pgsql resources, etc. will fail.

I request your prompt response.

* This content is also registered in the following Bugzilla.
  - https://bugs.clusterlabs.org/show_bug.cgi?id=5501

Best Regards,
Hideo Yamauchi.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] [Problem] crm_attirbute fails to expand run options.

2023-03-06 Thread renayama19661014

Hi All,

The crm_attribute command expands the contents of options from the 
OCF_RESOURCE_INSTANCE environment variable if the p option is not specified.

However, if -INFINITY is specified as the value of the v option at this time, 
crm_attributes will incorrectly expand -INFINITY as an option and processing 
will fail.


[root@rh91-01dev tools]# crm_attribute -p pgsql  -v 100
[root@rh91-01dev tools]# crm_attribute -p pgsql  -v -INFINITY

[root@rh91-01dev tools]# OCF_RESOURCE_INSTANCE=pgsql crm_attribute -p  -v 100
[root@rh91-01dev tools]# OCF_RESOURCE_INSTANCE=pgsql crm_attribute -p  -v 
-INFINITY
crm_attribute: Could not map name=FINITY to a UUID



This problem occurs with the latest resource agent running on RHEL9.1, but also 
with the development version of pacemaker.


Due to this issue, some resource agents such as pgsql will fail to configure a 
cluster after version 4.12.

It's a very serious problem.

RAs such as pgsql should be run with the p option(OCF_RESOURCE_INSTANCE) 
provisionally when paired with a pacemaker version that does not resolve this 
issue.

(snip)
ocf_promotion_score() {
ocf_version_cmp "$OCF_RESKEY_crm_feature_set" "3.10.0"
res=$?
if [ $res -eq 2 ] || [ $res -eq 1 ] || ! have_binary "crm_master"; then
${HA_SBIN_DIR}/crm_attribute -p $OCF_RESOURCE_INSTANCE $@
else
${HA_SBIN_DIR}/crm_master -l reboot $@
fi
(snip)


This content has also been registered in the following Bugzilla:
https://bugs.clusterlabs.org/show_bug.cgi?id=5509

Best Regards,
Hideo Yamauchi.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

66 matches

Mail list logo