Re: [Pacemaker] Two node cluster and no hardware device for stonith.
emmanuel segura emi2fast@... writes: sorry, but i forgot to tell you, you need to know the fence_scsi doesn't reboot the evicted node, so you can combine fence_vmware with fence_scsi as the second option. for this, i'm trying to use a watchdog script https://access.redhat.com/solutions/65187 But when I start wachdog daemon, all node reboot. I continue testing... 2015-01-27 11:44 GMT+01:00 emmanuel segura emi2fast at gmail.com: In normal situation every node can in your file system, fence_scsi is used when your cluster is in split-braint, when your a node doesn't comunicate with the other node, i don't is good idea. So, i will see key registration only when nodes loose comunication? 2015-01-27 11:35 GMT+01:00 Andrea a.bacchi at codices.com: Andrea a.bacchi at ... writes: ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Two node cluster and no hardware device for stonith.
please show your configuration and your logs. 2015-01-27 14:24 GMT+01:00 Andrea a.bac...@codices.com: emmanuel segura emi2fast@... writes: if you are using cman+pacemaker you need to enabled the stonith and configuring that in you crm config 2015-01-27 14:05 GMT+01:00 Vinod Prabhu pvinod@gmail.com: is stonith enabled in crm conf? yes, stonith is enabled [ONE]pcs property Cluster Properties: cluster-infrastructure: cman dc-version: 1.1.11-97629de last-lrm-refresh: 1422285715 no-quorum-policy: ignore stonith-enabled: true If I disable it, stonith device don't start On Tue, Jan 27, 2015 at 6:21 PM, emmanuel segura emi2f...@gmail.com wrote:When a node is dead the registration key is removed. So I must see 2 key registered when I add fence_scsi device? But I don't see 2 key registered... Andrea ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- esta es mi vida e me la vivo hasta que dios quiera ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Segfault on monitor resource
maybe you can use sar for checking if your server was tight of resources? Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output: (Nginx-rsc:monitor:stderr) Killed /usr/lib/ocf/resource.d//heartbeat/nginx: 910: /usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork 2015-01-26 18:22 GMT+01:00 Oscar Salvador osalvador.vilard...@gmail.com: Oh, I forgot some important details: root# (S) crm status Last updated: Mon Jan 26 18:21:35 2015 Last change: Sun Jan 25 05:19:13 2015 via crm_resource on lb01 Stack: Heartbeat Current DC: lb01 (43b2c5a1-9552-4438-962b-6e98a2dd67c7) - partition with quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 2 Nodes configured, unknown expected votes 8 Resources configured. Online: [ lb01 lb02 ] IP-rsc_mysql (ocf::heartbeat:IPaddr2): Started lb02 IP-rsc_nginx (ocf::heartbeat:IPaddr2): Started lb02 IP-rsc_nginx6 (ocf::heartbeat:IPv6addr): Started lb02 IP-rsc_mysql6 (ocf::heartbeat:IPv6addr): Started lb02 IP-rsc_elasticsearch6 (ocf::heartbeat:IPv6addr): Started lb02 IP-rsc_elasticsearch (ocf::heartbeat:IPaddr2): Started lb02 Ldirector-rsc (ocf::heartbeat:ldirectord): Started lb02 Nginx-rsc (ocf::heartbeat:nginx): Started lb02 This is running on: Debian7.8 pacemaker 1.1.7-1 2015-01-26 18:20 GMT+01:00 Oscar Salvador osalvador.vilard...@gmail.com: Hi! I'm writing here because two days ago I experienced a strange problem in my Pacemaker Cluster. Everything was working fine, till suddenly a Segfault in Nginx monitor resource happened: Jan 25 03:55:24 lb02 crmd: [9975]: notice: run_graph: Transition 7551 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-90.bz2): Complete Jan 25 03:55:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Jan 25 04:00:08 lb02 cib: [9971]: info: cib_stats: Processed 1 operations (0.00us average, 0% utilization) in the last 10min Jan 25 04:10:24 lb02 crmd: [9975]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (90ms) Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Jan 25 04:10:24 lb02 crmd: [9975]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED Jan 25 04:10:24 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing failed op Ldirector-rsc_last_failure_0 on lb02: not running (7) Jan 25 04:10:24 lb02 pengine: [10028]: notice: common_apply_stickiness: Ldirector-rsc can fail 97 more times on lb02 before being forced off Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Jan 25 04:10:24 lb02 pengine: [10028]: notice: process_pe_message: Transition 7552: PEngine Input stored in: /var/lib/pengine/pe-input-90.bz2 Jan 25 04:10:24 lb02 crmd: [9975]: info: do_te_invoke: Processing graph 7552 (ref=pe_calc-dc-1422155424-7644) derived from /var/lib/pengine/pe-input-90.bz2 Jan 25 04:10:24 lb02 crmd: [9975]: notice: run_graph: Transition 7552 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-90.bz2): Complete Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output: (Nginx-rsc:monitor:stderr) Segmentation fault *** here it starts As you can see, the last line. And then: Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output: (Nginx-rsc:monitor:stderr) Killed /usr/lib/ocf/resource.d//heartbeat/nginx: 910: /usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork I guess here Nginx was killed. And then I have some others errors till Pacemaker decide to move the resources to the node: Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation Nginx-rsc_monitor_1 (call=52, rc=2, cib-update=7633, confirmed=false) invalid parameter Jan 25 04:10:30 lb02 crmd: [9975]: info: process_graph_event: Detected action Nginx-rsc_monitor_1 from a different transition: 5739 vs. 7552 Jan 25 04:10:30 lb02 crmd: [9975]: info: abort_transition_graph: process_graph_event:476 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=Nginx-rsc_last_failure_0, magic=0:2;4:5739:0:42d1ed53-9686-4174-84e7-d2c230ed8832, cib= 3.14.40) : Old event Jan 25 04:10:30 lb02 crmd: [9975]: WARN: update_failcount: Updating failcount for Nginx-rsc on lb02 after failed monitor: rc=2 (update=value++, time=1422155430) Jan 25 04:10:30 lb02 crmd: [9975]: notice: do_state_transition: State transition S_IDLE -
Re: [Pacemaker] Segfault on monitor resource
Hi, On Mon, Jan 26, 2015 at 06:20:35PM +0100, Oscar Salvador wrote: Hi! I'm writing here because two days ago I experienced a strange problem in my Pacemaker Cluster. Everything was working fine, till suddenly a Segfault in Nginx monitor resource happened: Jan 25 03:55:24 lb02 crmd: [9975]: notice: run_graph: Transition 7551 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-90.bz2): Complete Jan 25 03:55:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Jan 25 04:00:08 lb02 cib: [9971]: info: cib_stats: Processed 1 operations (0.00us average, 0% utilization) in the last 10min Jan 25 04:10:24 lb02 crmd: [9975]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (90ms) Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Jan 25 04:10:24 lb02 crmd: [9975]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED Jan 25 04:10:24 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing failed op Ldirector-rsc_last_failure_0 on lb02: not running (7) Jan 25 04:10:24 lb02 pengine: [10028]: notice: common_apply_stickiness: Ldirector-rsc can fail 97 more times on lb02 before being forced off Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Jan 25 04:10:24 lb02 pengine: [10028]: notice: process_pe_message: Transition 7552: PEngine Input stored in: /var/lib/pengine/pe-input-90.bz2 Jan 25 04:10:24 lb02 crmd: [9975]: info: do_te_invoke: Processing graph 7552 (ref=pe_calc-dc-1422155424-7644) derived from /var/lib/pengine/pe-input-90.bz2 Jan 25 04:10:24 lb02 crmd: [9975]: notice: run_graph: Transition 7552 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-90.bz2): Complete Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output: (Nginx-rsc:monitor:stderr) Segmentation fault *** here it starts What exactly did segfault? Do you have a core dump to examine? As you can see, the last line. And then: Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output: (Nginx-rsc:monitor:stderr) Killed /usr/lib/ocf/resource.d//heartbeat/nginx: 910: /usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork This could be related to the segfault, or due to other serious system error. I guess here Nginx was killed. And then I have some others errors till Pacemaker decide to move the resources to the node: Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation Nginx-rsc_monitor_1 (call=52, rc=2, cib-update=7633, confirmed=false) invalid parameter Jan 25 04:10:30 lb02 crmd: [9975]: info: process_graph_event: Detected action Nginx-rsc_monitor_1 from a different transition: 5739 vs. 7552 Jan 25 04:10:30 lb02 crmd: [9975]: info: abort_transition_graph: process_graph_event:476 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=Nginx-rsc_last_failure_0, magic=0:2;4:5739:0:42d1ed53-9686-4174-84e7-d2c230ed8832, cib= 3.14.40) : Old event Jan 25 04:10:30 lb02 crmd: [9975]: WARN: update_failcount: Updating failcount for Nginx-rsc on lb02 after failed monitor: rc=2 (update=value++, time=1422155430) Jan 25 04:10:30 lb02 crmd: [9975]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Jan 25 04:10:30 lb02 attrd: [9974]: info: log-rotate detected on logfile /var/log/ha-log Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-Nginx-rsc (1) Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid parameter' (rc=2) Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing failed op Nginx-rsc_last_failure_0 on lb02: invalid parameter (2) Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing failed op Ldirector-rsc_last_failure_0 on lb02: not running (7) Jan 25 04:10:30 lb02 pengine: [10028]: notice: common_apply_stickiness: Ldirector-rsc can fail 97 more times on lb02 before being forced off Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop IP-rsc_mysql (lb02) Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop IP-rsc_nginx (lb02) Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop IP-rsc_nginx6(lb02) Jan 25 04:10:30 lb02 pengine: [10028]: notice:
Re: [Pacemaker] pacemaker-remote not listening
- Original Message - Hi, my os is debian-wheezy i compiled and installed pacemaker-remote. Startup log: Jan 27 16:04:30 [2859] vm1 pacemaker_remoted: info: crm_log_init: Changed active directory to /var/lib/heartbeat/cores/root Jan 27 16:04:30 [2859] vm1 pacemaker_remoted: info: qb_ipcs_us_publish: server name: lrmd Jan 27 16:04:30 [2859] vm1 pacemaker_remoted: info: main: Starting My problem is, that pacemaker remote is not listening on port 3121 By default pacemaker_remote should listen on 3121. This is odd. One thing I can think of. Take a look at /etc/sysconfig/pacemaker on the node running pacemaker_remote. Make sure there isn't a custom port set using the PCMK_remote_port variable. http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Remote/index.html#_pacemaker_and_pacemaker_remote_options -- Vossel netstat -tulpen | grep 3121 netstat -alpen Proto RefCnt Flags Type State I-Node PID/Program name Path unix 2 [ ACC ] STREAM LISTENING 6635 2859/pacemaker_remo @lrmd unix 2 [ ] DGRAM 6634 2859/pacemaker_remo ... ... ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Segfault on monitor resource
Hi, I've checked the resource graphs I have, and the resources were fine, so I think it's not a problem due to a high use of memory or something like that. And unfortunately I don't have a core dump to analize(I'll enable it for a future case) so the only thing I have are the logs. For the line below, I though that was the process in charge to monitore nginx what was killed due to a segfault: RA output: (Nginx-rsc:monitor:stderr) Segmentation fault I've checked the Nginx logs, and there is nothing worth there, actually there is no activity, so I think it has to be something internal what caused the failure. I'll enable coredumps, it's the only thing I can do for now. Thank you very much Oscar 2015-01-27 10:39 GMT+01:00 Dejan Muhamedagic deja...@fastmail.fm: Hi, On Mon, Jan 26, 2015 at 06:20:35PM +0100, Oscar Salvador wrote: Hi! I'm writing here because two days ago I experienced a strange problem in my Pacemaker Cluster. Everything was working fine, till suddenly a Segfault in Nginx monitor resource happened: Jan 25 03:55:24 lb02 crmd: [9975]: notice: run_graph: Transition 7551 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-90.bz2): Complete Jan 25 03:55:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Jan 25 04:00:08 lb02 cib: [9971]: info: cib_stats: Processed 1 operations (0.00us average, 0% utilization) in the last 10min Jan 25 04:10:24 lb02 crmd: [9975]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (90ms) Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Jan 25 04:10:24 lb02 crmd: [9975]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED Jan 25 04:10:24 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing failed op Ldirector-rsc_last_failure_0 on lb02: not running (7) Jan 25 04:10:24 lb02 pengine: [10028]: notice: common_apply_stickiness: Ldirector-rsc can fail 97 more times on lb02 before being forced off Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Jan 25 04:10:24 lb02 pengine: [10028]: notice: process_pe_message: Transition 7552: PEngine Input stored in: /var/lib/pengine/pe-input-90.bz2 Jan 25 04:10:24 lb02 crmd: [9975]: info: do_te_invoke: Processing graph 7552 (ref=pe_calc-dc-1422155424-7644) derived from /var/lib/pengine/pe-input-90.bz2 Jan 25 04:10:24 lb02 crmd: [9975]: notice: run_graph: Transition 7552 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-90.bz2): Complete Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output: (Nginx-rsc:monitor:stderr) Segmentation fault *** here it starts What exactly did segfault? Do you have a core dump to examine? As you can see, the last line. And then: Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output: (Nginx-rsc:monitor:stderr) Killed /usr/lib/ocf/resource.d//heartbeat/nginx: 910: /usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork This could be related to the segfault, or due to other serious system error. I guess here Nginx was killed. And then I have some others errors till Pacemaker decide to move the resources to the node: Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation Nginx-rsc_monitor_1 (call=52, rc=2, cib-update=7633, confirmed=false) invalid parameter Jan 25 04:10:30 lb02 crmd: [9975]: info: process_graph_event: Detected action Nginx-rsc_monitor_1 from a different transition: 5739 vs. 7552 Jan 25 04:10:30 lb02 crmd: [9975]: info: abort_transition_graph: process_graph_event:476 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=Nginx-rsc_last_failure_0, magic=0:2;4:5739:0:42d1ed53-9686-4174-84e7-d2c230ed8832, cib= 3.14.40) : Old event Jan 25 04:10:30 lb02 crmd: [9975]: WARN: update_failcount: Updating failcount for Nginx-rsc on lb02 after failed monitor: rc=2 (update=value++, time=1422155430) Jan 25 04:10:30 lb02 crmd: [9975]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Jan 25 04:10:30 lb02 attrd: [9974]: info: log-rotate detected on logfile /var/log/ha-log Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-Nginx-rsc (1) Jan 25 04:10:30 lb02 pengine: [10028]: ERROR:
Re: [Pacemaker] Two node cluster and no hardware device for stonith.
Andrea a.bacchi@... writes: Michael Schwartzkopff ms at ... writes: Am Donnerstag, 22. Januar 2015, 10:03:38 schrieb E. Kuemmerle: On 21.01.2015 11:18 Digimer wrote: On 21/01/15 08:13 AM, Andrea wrote: Hi All, I have a question about stonith In my scenarion , I have to create 2 node cluster, but I don't Are you sure that you do not have fencing hardware? Perhaps you just did nit configure it? Please read the manual of you BIOS and check your system board if you have a IPMI interface. In my test, when I simulate network failure, split brain occurs, and when network come back, One node kill the other node -log on node 1: Jan 21 11:45:28 corosync [CMAN ] memb: Sending KILL to node 2 -log on node 2: Jan 21 11:45:28 corosync [CMAN ] memb: got KILL for node 2 That is how fencing works. Mit freundlichen Grüßen, Michael Schwartzkopff Hi All many thanks for your replies. I will update my scenario to ask about adding some devices for stonith - Option 1 I will ask for having 2 vmware virtual machine, so i can try fance_vmware -Option 2 In the project, maybe will need a shared storage. In this case, the shared storage will be a NAS that a can add to my nodes via iscsi. In this case I can try fence_scsi I will write here about news Many thanks to all for support Andrea some news - Option 2 In the customer environment I configured a iscsi target that our project will use as cluster filesystem [ONE]pvcreate /dev/sdb [ONE]vgcreate -Ay -cy cluster_vg /dev/sdb [ONE]lvcreate -L*G -n cluster_lv cluster_vg [ONE]mkfs.gfs2 -j2 -p lock_dlm -t ProjectHA:ArchiveFS /dev/cluster_vg/cluster_lv now I can add a Filesystem resource [ONE]pcs resource create clusterfs Filesystem device=/dev/cluster_vg/cluster_lv directory=/var/mountpoint fstype=gfs2 options=noatime op monitor interval=10s clone interleave=true and I can read and write from both node. Now I'd like to use this device with fence_scsi. It is ok? because I see in the man page this: The fence_scsi agent works by having each node in the cluster register a unique key with the SCSI devive(s). Once registered, a single node will become the reservation holder by creating a write exclu-sive, registrants only reservation on the device(s). The result is that only registered nodes may write to the device(s) It's no good for me, I need both node can write on the device. So, I need another device to use with fence_scsi? In this case I will try to create two partition, sdb1 and sdb2, on this device and use sdb1 as clusterfs and sdb2 for fencing. If i try to manually test this, I obtain before any operation [ONE]sg_persist -n --read-keys --device=/dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4 PR generation=0x27, 1 registered reservation key follows: 0x98343e580002734d Then, I try to set serverHA1 key [serverHA1]fence_scsi -d /dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4 -f /tmp/miolog.txt -n serverHA1 -o on But nothing has changed [ONE]sg_persist -n --read-keys --device=/dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4 PR generation=0x27, 1 registered reservation key follows: 0x98343e580002734d and in the log: gen 26 17:53:27 fence_scsi: [debug] main::do_register_ignore (node_key=4d5a0001, dev=/dev/sde) gen 26 17:53:27 fence_scsi: [debug] main::do_reset (dev=/dev/sde, status=6) gen 26 17:53:27 fence_scsi: [debug] main::do_register_ignore (err=0) The same when i try on serverHA2 It is normal? In any case, i try to create a stonith device [ONE]pcs stonith create iscsi-stonith-device fence_scsi pcmk_host_list=serverHA1 serverHA2 devices=/dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4 meta provides=unfencing and the cluster status is ok [ONE] pcs status Cluster name: MyCluHA Last updated: Tue Jan 27 11:21:48 2015 Last change: Tue Jan 27 10:46:57 2015 Stack: cman Current DC: serverHA1 - partition with quorum Version: 1.1.11-97629de 2 Nodes configured 5 Resources configured Online: [ serverHA1 serverHA2 ] Full list of resources: Clone Set: ping-clone [ping] Started: [ serverHA1 serverHA2 ] Clone Set: clusterfs-clone [clusterfs] Started: [ serverHA1 serverHA2 ] iscsi-stonith-device (stonith:fence_scsi): Started serverHA1 How I can try this from remote connection? Andrea ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Two node cluster and no hardware device for stonith.
sorry, but i forgot to tell you, you need to know the fence_scsi doesn't reboot the evicted node, so you can combine fence_vmware with fence_scsi as the second option. 2015-01-27 11:44 GMT+01:00 emmanuel segura emi2f...@gmail.com: In normal situation every node can in your file system, fence_scsi is used when your cluster is in split-braint, when your a node doesn't comunicate with the other node, i don't is good idea. 2015-01-27 11:35 GMT+01:00 Andrea a.bac...@codices.com: Andrea a.bacchi@... writes: Michael Schwartzkopff ms at ... writes: Am Donnerstag, 22. Januar 2015, 10:03:38 schrieb E. Kuemmerle: On 21.01.2015 11:18 Digimer wrote: On 21/01/15 08:13 AM, Andrea wrote: Hi All, I have a question about stonith In my scenarion , I have to create 2 node cluster, but I don't Are you sure that you do not have fencing hardware? Perhaps you just did nit configure it? Please read the manual of you BIOS and check your system board if you have a IPMI interface. In my test, when I simulate network failure, split brain occurs, and when network come back, One node kill the other node -log on node 1: Jan 21 11:45:28 corosync [CMAN ] memb: Sending KILL to node 2 -log on node 2: Jan 21 11:45:28 corosync [CMAN ] memb: got KILL for node 2 That is how fencing works. Mit freundlichen Grüßen, Michael Schwartzkopff Hi All many thanks for your replies. I will update my scenario to ask about adding some devices for stonith - Option 1 I will ask for having 2 vmware virtual machine, so i can try fance_vmware -Option 2 In the project, maybe will need a shared storage. In this case, the shared storage will be a NAS that a can add to my nodes via iscsi. In this case I can try fence_scsi I will write here about news Many thanks to all for support Andrea some news - Option 2 In the customer environment I configured a iscsi target that our project will use as cluster filesystem [ONE]pvcreate /dev/sdb [ONE]vgcreate -Ay -cy cluster_vg /dev/sdb [ONE]lvcreate -L*G -n cluster_lv cluster_vg [ONE]mkfs.gfs2 -j2 -p lock_dlm -t ProjectHA:ArchiveFS /dev/cluster_vg/cluster_lv now I can add a Filesystem resource [ONE]pcs resource create clusterfs Filesystem device=/dev/cluster_vg/cluster_lv directory=/var/mountpoint fstype=gfs2 options=noatime op monitor interval=10s clone interleave=true and I can read and write from both node. Now I'd like to use this device with fence_scsi. It is ok? because I see in the man page this: The fence_scsi agent works by having each node in the cluster register a unique key with the SCSI devive(s). Once registered, a single node will become the reservation holder by creating a write exclu-sive, registrants only reservation on the device(s). The result is that only registered nodes may write to the device(s) It's no good for me, I need both node can write on the device. So, I need another device to use with fence_scsi? In this case I will try to create two partition, sdb1 and sdb2, on this device and use sdb1 as clusterfs and sdb2 for fencing. If i try to manually test this, I obtain before any operation [ONE]sg_persist -n --read-keys --device=/dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4 PR generation=0x27, 1 registered reservation key follows: 0x98343e580002734d Then, I try to set serverHA1 key [serverHA1]fence_scsi -d /dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4 -f /tmp/miolog.txt -n serverHA1 -o on But nothing has changed [ONE]sg_persist -n --read-keys --device=/dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4 PR generation=0x27, 1 registered reservation key follows: 0x98343e580002734d and in the log: gen 26 17:53:27 fence_scsi: [debug] main::do_register_ignore (node_key=4d5a0001, dev=/dev/sde) gen 26 17:53:27 fence_scsi: [debug] main::do_reset (dev=/dev/sde, status=6) gen 26 17:53:27 fence_scsi: [debug] main::do_register_ignore (err=0) The same when i try on serverHA2 It is normal? In any case, i try to create a stonith device [ONE]pcs stonith create iscsi-stonith-device fence_scsi pcmk_host_list=serverHA1 serverHA2 devices=/dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4 meta provides=unfencing and the cluster status is ok [ONE] pcs status Cluster name: MyCluHA Last updated: Tue Jan 27 11:21:48 2015 Last change: Tue Jan 27 10:46:57 2015 Stack: cman Current DC: serverHA1 - partition with quorum Version: 1.1.11-97629de 2 Nodes configured 5 Resources configured Online: [ serverHA1 serverHA2 ] Full list of resources: Clone Set: ping-clone [ping] Started: [ serverHA1 serverHA2 ] Clone Set: clusterfs-clone [clusterfs] Started: [ serverHA1 serverHA2 ] iscsi-stonith-device (stonith:fence_scsi): Started serverHA1 How I can try this from remote connection? Andrea ___ Pacemaker
[Pacemaker] rrp_mode in corosync.conf
Hi all, I've been looking for a good answer to my question, but all information I found is ambiguous. I hope to get a good answer here =) The only description about active and passive modes I found is: Active: both rings will be active, in use Passive: only one of the 2 rings is in use, the second one will be use only if the first one fails There is no description of how it works and what the impact is? So, my general question is: How the rings are used in active and passive modes? Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Two node cluster and no hardware device for stonith.
In normal situation every node can in your file system, fence_scsi is used when your cluster is in split-braint, when your a node doesn't comunicate with the other node, i don't is good idea. 2015-01-27 11:35 GMT+01:00 Andrea a.bac...@codices.com: Andrea a.bacchi@... writes: Michael Schwartzkopff ms at ... writes: Am Donnerstag, 22. Januar 2015, 10:03:38 schrieb E. Kuemmerle: On 21.01.2015 11:18 Digimer wrote: On 21/01/15 08:13 AM, Andrea wrote: Hi All, I have a question about stonith In my scenarion , I have to create 2 node cluster, but I don't Are you sure that you do not have fencing hardware? Perhaps you just did nit configure it? Please read the manual of you BIOS and check your system board if you have a IPMI interface. In my test, when I simulate network failure, split brain occurs, and when network come back, One node kill the other node -log on node 1: Jan 21 11:45:28 corosync [CMAN ] memb: Sending KILL to node 2 -log on node 2: Jan 21 11:45:28 corosync [CMAN ] memb: got KILL for node 2 That is how fencing works. Mit freundlichen Grüßen, Michael Schwartzkopff Hi All many thanks for your replies. I will update my scenario to ask about adding some devices for stonith - Option 1 I will ask for having 2 vmware virtual machine, so i can try fance_vmware -Option 2 In the project, maybe will need a shared storage. In this case, the shared storage will be a NAS that a can add to my nodes via iscsi. In this case I can try fence_scsi I will write here about news Many thanks to all for support Andrea some news - Option 2 In the customer environment I configured a iscsi target that our project will use as cluster filesystem [ONE]pvcreate /dev/sdb [ONE]vgcreate -Ay -cy cluster_vg /dev/sdb [ONE]lvcreate -L*G -n cluster_lv cluster_vg [ONE]mkfs.gfs2 -j2 -p lock_dlm -t ProjectHA:ArchiveFS /dev/cluster_vg/cluster_lv now I can add a Filesystem resource [ONE]pcs resource create clusterfs Filesystem device=/dev/cluster_vg/cluster_lv directory=/var/mountpoint fstype=gfs2 options=noatime op monitor interval=10s clone interleave=true and I can read and write from both node. Now I'd like to use this device with fence_scsi. It is ok? because I see in the man page this: The fence_scsi agent works by having each node in the cluster register a unique key with the SCSI devive(s). Once registered, a single node will become the reservation holder by creating a write exclu-sive, registrants only reservation on the device(s). The result is that only registered nodes may write to the device(s) It's no good for me, I need both node can write on the device. So, I need another device to use with fence_scsi? In this case I will try to create two partition, sdb1 and sdb2, on this device and use sdb1 as clusterfs and sdb2 for fencing. If i try to manually test this, I obtain before any operation [ONE]sg_persist -n --read-keys --device=/dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4 PR generation=0x27, 1 registered reservation key follows: 0x98343e580002734d Then, I try to set serverHA1 key [serverHA1]fence_scsi -d /dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4 -f /tmp/miolog.txt -n serverHA1 -o on But nothing has changed [ONE]sg_persist -n --read-keys --device=/dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4 PR generation=0x27, 1 registered reservation key follows: 0x98343e580002734d and in the log: gen 26 17:53:27 fence_scsi: [debug] main::do_register_ignore (node_key=4d5a0001, dev=/dev/sde) gen 26 17:53:27 fence_scsi: [debug] main::do_reset (dev=/dev/sde, status=6) gen 26 17:53:27 fence_scsi: [debug] main::do_register_ignore (err=0) The same when i try on serverHA2 It is normal? In any case, i try to create a stonith device [ONE]pcs stonith create iscsi-stonith-device fence_scsi pcmk_host_list=serverHA1 serverHA2 devices=/dev/disk/by-id/scsi-36e843b608e55bb8d6d72d43bfdbc47d4 meta provides=unfencing and the cluster status is ok [ONE] pcs status Cluster name: MyCluHA Last updated: Tue Jan 27 11:21:48 2015 Last change: Tue Jan 27 10:46:57 2015 Stack: cman Current DC: serverHA1 - partition with quorum Version: 1.1.11-97629de 2 Nodes configured 5 Resources configured Online: [ serverHA1 serverHA2 ] Full list of resources: Clone Set: ping-clone [ping] Started: [ serverHA1 serverHA2 ] Clone Set: clusterfs-clone [clusterfs] Started: [ serverHA1 serverHA2 ] iscsi-stonith-device (stonith:fence_scsi): Started serverHA1 How I can try this from remote connection? Andrea ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
[Pacemaker] Pre/post-action notifications for master-slave and clone resources
Hi, Playing with two-week old git master on a two-node cluster I discovered that only limited set of notify operations is performed for clone and master-slave instances when all of them are being started/stopped. Clones (anonymous): * post-start * pre-stop M/S: * post-start * post-promote * pre-demote * pre-stop According to Pacemaker Explained there should be more: * pre-start * pre-promote * post-demote * post-stop Some notifications (pre-stop for my clone and pre-demote for ms) are repeated twice (due to transition aborts or fact that multiple instances are stopping/demoting?) but that has minor impact for me. I tested that by setting stop-all-resources property to 'true' and 'false'. On the other hand, if I put one node with running instances into standby and then into online states, I see all missing notifications. I that intended that actions above are not performed when all instances are handled simultaneously? One more question about 'post' notifications: Are they send to RA right after corresponding main action is finished or they wait in the transition queue? In other words, is it possible to get post-stop notification that the foreign instance is stopped during the time the stop action on the local instance is still running? Best, Vladislav ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Two node cluster and no hardware device for stonith.
emmanuel segura emi2fast@... writes: please show your configuration and your logs. 2015-01-27 14:24 GMT+01:00 Andrea a.bacchi@...: emmanuel segura emi2fast at ... writes: if you are using cman+pacemaker you need to enabled the stonith and configuring that in you crm config 2015-01-27 14:05 GMT+01:00 Vinod Prabhu pvinod.mit@...: is stonith enabled in crm conf? yes, stonith is enabled [ONE]pcs property Cluster Properties: cluster-infrastructure: cman dc-version: 1.1.11-97629de last-lrm-refresh: 1422285715 no-quorum-policy: ignore stonith-enabled: true If I disable it, stonith device don't start On Tue, Jan 27, 2015 at 6:21 PM, emmanuel segura emi2fast@... wrote:When a node is dead the registration key is removed. So I must see 2 key registered when I add fence_scsi device? But I don't see 2 key registered... Sorry, I used wrong device id. Now, with the correct device id, I see 2 key reserved [ONE] sg_persist -n --read-keys --device=/dev/disk/by-id/scsi-36e843b60f3d0cc6d1a11d4ff0da95cd8 PR generation=0x4, 2 registered reservation keys follow: 0x4d5a0001 0x4d5a0002 Tomorrow i will do some test for fencing... thanks Andrea ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Segfault on monitor resource
On Tue, Jan 27, 2015 at 03:18:13PM +0100, Oscar Salvador wrote: Hi, I've checked the resource graphs I have, and the resources were fine, so I think it's not a problem due to a high use of memory or something like that. And unfortunately I don't have a core dump to analize(I'll enable it for a future case) so the only thing I have are the logs. For the line below, I though that was the process in charge to monitore nginx what was killed due to a segfault: RA output: (Nginx-rsc:monitor:stderr) Segmentation fault This is just output captured during the execution of the RA monitor action. It could've been anything within the RA (which is just a shell script) to segfault. Thanks, Dejan I've checked the Nginx logs, and there is nothing worth there, actually there is no activity, so I think it has to be something internal what caused the failure. I'll enable coredumps, it's the only thing I can do for now. Thank you very much Oscar 2015-01-27 10:39 GMT+01:00 Dejan Muhamedagic deja...@fastmail.fm: Hi, On Mon, Jan 26, 2015 at 06:20:35PM +0100, Oscar Salvador wrote: Hi! I'm writing here because two days ago I experienced a strange problem in my Pacemaker Cluster. Everything was working fine, till suddenly a Segfault in Nginx monitor resource happened: Jan 25 03:55:24 lb02 crmd: [9975]: notice: run_graph: Transition 7551 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-90.bz2): Complete Jan 25 03:55:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Jan 25 04:00:08 lb02 cib: [9971]: info: cib_stats: Processed 1 operations (0.00us average, 0% utilization) in the last 10min Jan 25 04:10:24 lb02 crmd: [9975]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (90ms) Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Jan 25 04:10:24 lb02 crmd: [9975]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED Jan 25 04:10:24 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing failed op Ldirector-rsc_last_failure_0 on lb02: not running (7) Jan 25 04:10:24 lb02 pengine: [10028]: notice: common_apply_stickiness: Ldirector-rsc can fail 97 more times on lb02 before being forced off Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Jan 25 04:10:24 lb02 pengine: [10028]: notice: process_pe_message: Transition 7552: PEngine Input stored in: /var/lib/pengine/pe-input-90.bz2 Jan 25 04:10:24 lb02 crmd: [9975]: info: do_te_invoke: Processing graph 7552 (ref=pe_calc-dc-1422155424-7644) derived from /var/lib/pengine/pe-input-90.bz2 Jan 25 04:10:24 lb02 crmd: [9975]: notice: run_graph: Transition 7552 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-90.bz2): Complete Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output: (Nginx-rsc:monitor:stderr) Segmentation fault *** here it starts What exactly did segfault? Do you have a core dump to examine? As you can see, the last line. And then: Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output: (Nginx-rsc:monitor:stderr) Killed /usr/lib/ocf/resource.d//heartbeat/nginx: 910: /usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork This could be related to the segfault, or due to other serious system error. I guess here Nginx was killed. And then I have some others errors till Pacemaker decide to move the resources to the node: Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation Nginx-rsc_monitor_1 (call=52, rc=2, cib-update=7633, confirmed=false) invalid parameter Jan 25 04:10:30 lb02 crmd: [9975]: info: process_graph_event: Detected action Nginx-rsc_monitor_1 from a different transition: 5739 vs. 7552 Jan 25 04:10:30 lb02 crmd: [9975]: info: abort_transition_graph: process_graph_event:476 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=Nginx-rsc_last_failure_0, magic=0:2;4:5739:0:42d1ed53-9686-4174-84e7-d2c230ed8832, cib= 3.14.40) : Old event Jan 25 04:10:30 lb02 crmd: [9975]: WARN: update_failcount: Updating failcount for Nginx-rsc on lb02 after failed monitor: rc=2 (update=value++, time=1422155430) Jan 25 04:10:30 lb02 crmd: [9975]: notice: do_state_transition: State transition S_IDLE -
Re: [Pacemaker] new release date for resource-agents release 3.9.6
Hi Vladislav, On Mon, Jan 26, 2015 at 06:52:21PM +0300, Vladislav Bogdanov wrote: Hi Dejan, if it is not too late, would it be possible to add output of environment into resource trace file when tracing is enabled? Applied. Thanks, Dejan --- ocf-shellfuncs.orig 2015-01-26 15:50:34.435001364 + +++ ocf-shellfuncs 2015-01-26 15:49:19.707001542 + @@ -822,6 +822,7 @@ fi PS4='+ `date +%T`: ${FUNCNAME[0]:+${FUNCNAME[0]}:}${LINENO}: ' set -x + env=$( echo; printenv | sort ) } ocf_stop_trace() { set +x Best, Vladislav 23.01.2015 18:45, Dejan Muhamedagic wrote: Hello everybody, Someone warned us that three days is too short a period to test a release, so let's postpone the final release of resource-agents v3.9.6 to: Tuesday, Jan 27 Please do more testing in the meantime. The v3.9.6-rc1 packages are available for most popular platforms: http://download.opensuse.org/repositories/home:/dmuhamedagic:/branches:/network:/ha-clustering:/Stable RHEL-7 and Fedora 21 are unfortunately missing, due to some strange unresolvable dependencies issue. Debian/Ubuntu people can use alien. Many thanks! The resource-agents crowd ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] authentication in the cluster
Hi all, Here is a situation - there are two two-node clusters. They have totally identical configuration. Nodes in the clusters are connected directly, without any switches. Here is a part of corosync.comf file: totem { version: 2 cluster_name: mycluster transport: udpu crypto_hash: sha256 crypto_cipher: none rrp_mode: passive } nodelist { node { name: node-a nodeid: 1 ring0_addr: 169.254.0.2 ring1_addr: 169.254.1.2 } node { name: node-b nodeid: 2 ring0_addr: 169.254.0.3 ring1_addr: 169.254.1.3 } } The only difference between those two clusters is authentication key ( /etc/corosync/authkey ) - it is different for both clusters. QUESTION: -- What will be the behavior if the next mess in connection occurs: ring1_addr of node-a (cluster-A) is connected to ring1_addr of node-b (cluster-B) ring1_addr of node-a (cluster-B) is connected to ring1_addr of node-b (cluster-A) I attached a pic which shows the connections. My actual goal - do not let the clusters work in such case. To achieve it, I decided to use authentication key mechanism. But I don't know the result in the situation which I described ... . Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Two node cluster and no hardware device for stonith.
When a node is dead the registration key is removed. 2015-01-27 13:29 GMT+01:00 Andrea a.bac...@codices.com: emmanuel segura emi2fast@... writes: sorry, but i forgot to tell you, you need to know the fence_scsi doesn't reboot the evicted node, so you can combine fence_vmware with fence_scsi as the second option. for this, i'm trying to use a watchdog script https://access.redhat.com/solutions/65187 But when I start wachdog daemon, all node reboot. I continue testing... 2015-01-27 11:44 GMT+01:00 emmanuel segura emi2fast at gmail.com: In normal situation every node can in your file system, fence_scsi is used when your cluster is in split-braint, when your a node doesn't comunicate with the other node, i don't is good idea. So, i will see key registration only when nodes loose comunication? 2015-01-27 11:35 GMT+01:00 Andrea a.bacchi at codices.com: Andrea a.bacchi at ... writes: ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- esta es mi vida e me la vivo hasta que dios quiera ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Two node cluster and no hardware device for stonith.
is stonith enabled in crm conf? On Tue, Jan 27, 2015 at 6:21 PM, emmanuel segura emi2f...@gmail.com wrote: When a node is dead the registration key is removed. 2015-01-27 13:29 GMT+01:00 Andrea a.bac...@codices.com: emmanuel segura emi2fast@... writes: sorry, but i forgot to tell you, you need to know the fence_scsi doesn't reboot the evicted node, so you can combine fence_vmware with fence_scsi as the second option. for this, i'm trying to use a watchdog script https://access.redhat.com/solutions/65187 But when I start wachdog daemon, all node reboot. I continue testing... 2015-01-27 11:44 GMT+01:00 emmanuel segura emi2fast at gmail.com: In normal situation every node can in your file system, fence_scsi is used when your cluster is in split-braint, when your a node doesn't comunicate with the other node, i don't is good idea. So, i will see key registration only when nodes loose comunication? 2015-01-27 11:35 GMT+01:00 Andrea a.bacchi at codices.com: Andrea a.bacchi at ... writes: ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- esta es mi vida e me la vivo hasta que dios quiera ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- OSS BSS Developer Hand Phone: 9860788344 [image: Picture] ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Two node cluster and no hardware device for stonith.
if you are using cman+pacemaker you need to enabled the stonith and configuring that in you crm config 2015-01-27 14:05 GMT+01:00 Vinod Prabhu pvinod@gmail.com: is stonith enabled in crm conf? On Tue, Jan 27, 2015 at 6:21 PM, emmanuel segura emi2f...@gmail.com wrote: When a node is dead the registration key is removed. 2015-01-27 13:29 GMT+01:00 Andrea a.bac...@codices.com: emmanuel segura emi2fast@... writes: sorry, but i forgot to tell you, you need to know the fence_scsi doesn't reboot the evicted node, so you can combine fence_vmware with fence_scsi as the second option. for this, i'm trying to use a watchdog script https://access.redhat.com/solutions/65187 But when I start wachdog daemon, all node reboot. I continue testing... 2015-01-27 11:44 GMT+01:00 emmanuel segura emi2fast at gmail.com: In normal situation every node can in your file system, fence_scsi is used when your cluster is in split-braint, when your a node doesn't comunicate with the other node, i don't is good idea. So, i will see key registration only when nodes loose comunication? 2015-01-27 11:35 GMT+01:00 Andrea a.bacchi at codices.com: Andrea a.bacchi at ... writes: ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- esta es mi vida e me la vivo hasta que dios quiera ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- OSS BSS Developer Hand Phone: 9860788344 [image: Picture] ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org -- esta es mi vida e me la vivo hasta que dios quiera ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Two node cluster and no hardware device for stonith.
emmanuel segura emi2fast@... writes: if you are using cman+pacemaker you need to enabled the stonith and configuring that in you crm config 2015-01-27 14:05 GMT+01:00 Vinod Prabhu pvinod@gmail.com: is stonith enabled in crm conf? yes, stonith is enabled [ONE]pcs property Cluster Properties: cluster-infrastructure: cman dc-version: 1.1.11-97629de last-lrm-refresh: 1422285715 no-quorum-policy: ignore stonith-enabled: true If I disable it, stonith device don't start On Tue, Jan 27, 2015 at 6:21 PM, emmanuel segura emi2f...@gmail.com wrote:When a node is dead the registration key is removed. So I must see 2 key registered when I add fence_scsi device? But I don't see 2 key registered... Andrea ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] pacemaker-remote not listening
Hi, my os is debian-wheezy i compiled and installed pacemaker-remote. Startup log: Jan 27 16:04:30 [2859] vm1 pacemaker_remoted: info: crm_log_init: Changed active directory to /var/lib/heartbeat/cores/root Jan 27 16:04:30 [2859] vm1 pacemaker_remoted: info: qb_ipcs_us_publish: server name: lrmd Jan 27 16:04:30 [2859] vm1 pacemaker_remoted: info: main: Starting My problem is, that pacemaker remote is not listening on port 3121 netstat -tulpen grep 3121 netstat -alpen Proto RefCnt Flags Type State I-Node PID/Program name Path unix 2 [ ACC ] STREAM LISTENING 6635 2859/pacemaker_remo @lrmd unix 2 [ ] DGRAM 6634 2859/pacemaker_remo ... ... ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Segfault on monitor resource
2015-01-27 17:58 GMT+01:00 Dejan Muhamedagic deja...@fastmail.fm: On Tue, Jan 27, 2015 at 03:18:13PM +0100, Oscar Salvador wrote: Hi, I've checked the resource graphs I have, and the resources were fine, so I think it's not a problem due to a high use of memory or something like that. And unfortunately I don't have a core dump to analize(I'll enable it for a future case) so the only thing I have are the logs. For the line below, I though that was the process in charge to monitore nginx what was killed due to a segfault: RA output: (Nginx-rsc:monitor:stderr) Segmentation fault This is just output captured during the execution of the RA monitor action. It could've been anything within the RA (which is just a shell script) to segfault. Hi, Yes, I see. I've enabled core dumps on the system, so the next time I'll be able to check what is causing this. Thank you very much Oscar Salvador Thanks, Dejan I've checked the Nginx logs, and there is nothing worth there, actually there is no activity, so I think it has to be something internal what caused the failure. I'll enable coredumps, it's the only thing I can do for now. Thank you very much Oscar 2015-01-27 10:39 GMT+01:00 Dejan Muhamedagic deja...@fastmail.fm: Hi, On Mon, Jan 26, 2015 at 06:20:35PM +0100, Oscar Salvador wrote: Hi! I'm writing here because two days ago I experienced a strange problem in my Pacemaker Cluster. Everything was working fine, till suddenly a Segfault in Nginx monitor resource happened: Jan 25 03:55:24 lb02 crmd: [9975]: notice: run_graph: Transition 7551 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-90.bz2): Complete Jan 25 03:55:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Jan 25 04:00:08 lb02 cib: [9971]: info: cib_stats: Processed 1 operations (0.00us average, 0% utilization) in the last 10min Jan 25 04:10:24 lb02 crmd: [9975]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (90ms) Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Jan 25 04:10:24 lb02 crmd: [9975]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED Jan 25 04:10:24 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing failed op Ldirector-rsc_last_failure_0 on lb02: not running (7) Jan 25 04:10:24 lb02 pengine: [10028]: notice: common_apply_stickiness: Ldirector-rsc can fail 97 more times on lb02 before being forced off Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Jan 25 04:10:24 lb02 pengine: [10028]: notice: process_pe_message: Transition 7552: PEngine Input stored in: /var/lib/pengine/pe-input-90.bz2 Jan 25 04:10:24 lb02 crmd: [9975]: info: do_te_invoke: Processing graph 7552 (ref=pe_calc-dc-1422155424-7644) derived from /var/lib/pengine/pe-input-90.bz2 Jan 25 04:10:24 lb02 crmd: [9975]: notice: run_graph: Transition 7552 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-90.bz2): Complete Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State transition S_TRANSITION_ENGINE - S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output: (Nginx-rsc:monitor:stderr) Segmentation fault *** here it starts What exactly did segfault? Do you have a core dump to examine? As you can see, the last line. And then: Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output: (Nginx-rsc:monitor:stderr) Killed /usr/lib/ocf/resource.d//heartbeat/nginx: 910: /usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork This could be related to the segfault, or due to other serious system error. I guess here Nginx was killed. And then I have some others errors till Pacemaker decide to move the resources to the node: Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation Nginx-rsc_monitor_1 (call=52, rc=2, cib-update=7633, confirmed=false) invalid parameter Jan 25 04:10:30 lb02 crmd: [9975]: info: process_graph_event: Detected action Nginx-rsc_monitor_1 from a different transition: 5739 vs. 7552 Jan 25 04:10:30 lb02 crmd: [9975]: info: abort_transition_graph: process_graph_event:476 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=Nginx-rsc_last_failure_0,
Re: [Pacemaker] HA Summit Key-signing Party (was: Organizing HA Summit 2015)
What's needed? Once you have a key pair (and provided that you are using GnuPG), please run the following sequence: # figure out the key ID for the identity to be verified; # IDENTITY is either your associated email address/your name # if only single key ID matches, specific key otherwise # (you can use gpg -K to select a desired ID at the sec line) KEY=$(gpg --with-colons 'IDENTITY' | grep '^pub' | cut -d: -f5) Oops, sorry, somehow '-k' got lost above ^. Correct version: KEY=$(gpg -k --with-colons 'IDENTITY' | grep '^pub' | cut -d: -f5) # export the public key to a file that is suitable for exchange gpg --export -a -- $KEY $KEY # verify that you have an expected data to share gpg --with-fingerprint -- $KEY -- Jan pgpjL67h19sUK.pgp Description: PGP signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] authentication in the cluster
Hi Chrissie, I know that this setup it crazy thing =) First of all I needed to say - think about each two-node cluster as one box with two nodes. You can't connect clusters together like that. I know that. All nodes in the cluster have just 1 authkey file. That is true. But in this example there are two clusters, each of them have its own auth key. What you have there is not a ring, it's err, a linked-cross?! Yep, I showed the wrong way of connecting two clusters. Why do you need to connect the two clusters together - is it for failover? No, it is not. I really don't (and won't) connect them in that way. It wrong. But, in real life those two clusters will be standing (physically, in the same room, in the same rack) pretty close to each other. And my concern is - if someone do that connection by a mistake. What will be in that situation? What I would like to get in that situation, is something which prevent simultaneous work of two nodes in one cluster - because it will cause data corruption. The situation is pretty simple when there is only one ring_addr defined per node. In this case, when some one cross-linked two separate clusters, it will lead to 4 clusters each of which is missing one node - because two connected nodes has different auth keys, and that is why they will not see each other even when there is a connection. STONITH always works in the same cluster. So, STONITH will be rebooting the other one in the cluster. That will prevent simultaneous access to the data. I tried to do my best in describing the situation, the problem and the question. Looking forward to hear any suggestions =) Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] authentication in the cluster
On 27/01/15 15:56, Kostiantyn Ponomarenko wrote: Hi all, Here is a situation - there are two two-node clusters. They have totally identical configuration. Nodes in the clusters are connected directly, without any switches. You can't connect clusters together like that. All nodes in the cluster have just 1 authkey file. Also, corosync clusters are a ring, even if you have two nodes. What you have there is not a ring, it's err, a linked-cross?! Why do you need to connect the two clusters together - is it for failover? There must be a better way of achieving what you need, have a look for 'stretch clusters' (not my speciality TBH) if they are at separate sites. If you just want to run resources outside of the cluster then pacemaker_remote might be more useful. If it's just for isolation of resources then pacemaker can do that anyway so you don't need to partition the cluster like that. If you can explain just why you think you need this system we might be able to come up with something that will work :) Chrissie totem { version: 2 cluster_name: mycluster transport: udpu crypto_hash: sha256 crypto_cipher: none rrp_mode: passive } nodelist { node { name: node-a nodeid: 1 ring0_addr: 169.254.0.2 ring1_addr: 169.254.1.2 } node { name: node-b nodeid: 2 ring0_addr: 169.254.0.3 ring1_addr: 169.254.1.3 } } The only difference between those two clusters is authentication key ( /etc/corosync/authkey ) - it is different for both clusters. QUESTION: -- What will be the behavior if the next mess in connection occurs: ring1_addr of node-a (cluster-A) is connected to ring1_addr of node-b (cluster-B) ring1_addr of node-a (cluster-B) is connected to ring1_addr of node-b (cluster-A) I attached a pic which shows the connections. My actual goal - do not let the clusters work in such case. To achieve it, I decided to use authentication key mechanism. But I don't know the result in the situation which I described ... . Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org