Re: [ClusterLabs] [Patch][glue][external/libvirt] Conversion to a lower case of hostlist.
Hi All, We intend to change some patches. We withdraw this patch. Best Regards, Hideo Yamauchi. - Original Message - > From: "renayama19661...@ybb.ne.jp"> To: ClusterLabs-ML > Cc: > Date: 2015/9/7, Mon 09:06 > Subject: [ClusterLabs] [Patch][glue][external/libvirt] Conversion to a lower > case of hostlist. > > Hi All, > > When a cluster carries out stonith, Pacemaker handles host name by a small > letter. > When a user sets the host name of the OS and host name of hostlist of > external/libvrit in capital letters and waits, stonith is not carried out. > > The external/libvrit to convert host name of hostlist, and to compare can > assist > a setting error of the user. > > Best Regards, > Hideo Yamauchi. > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] crm_report consumes all available RAM
Hi, just discovered very interesting issue. If there is a system user with very big UID (8002 in my case), then crm_report (actually 'grep' it runs) consumes too much RAM. Relevant part of the process tree at that moment looks like (word-wrap off): USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND ... root 25526 0.0 0.0 106364 636 ?S12:37 0:00 \_ /bin/sh /usr/sbin/crm_report --dest=/var/log/crm_report -f -01-01 00:00:00 root 25585 0.0 0.0 106364 636 ?S12:37 0:00 \_ bash /var/log/crm_report/collector root 25613 0.0 0.0 106364 152 ?S12:37 0:00 \_ bash /var/log/crm_report/collector root 25614 0.0 0.0 106364 692 ?S12:37 0:00 \_ bash /var/log/crm_report/collector root 27965 4.9 0.0 100936 452 ?S12:38 0:01 | \_ cat /var/log/lastlog root 27966 23.0 82.9 3248996 1594688 ? D12:38 0:08 | \_ grep -l -e Starting Pacemaker root 25615 0.0 0.0 155432 600 ?S12:37 0:00 \_ sort -u ls -ls /var/log/lastlog shows: 40 -rw-r--r--. 1 root root 2336876 Sep 8 04:36 /var/log/lastlog That is sparse binary file, which consumes only 40k of disk space. At the same time its size is 23GB, and grep takes all the RAM trying to grep a string from a 23GB of mostly zeroes without new-lines. I believe this is worth fixing, Thank you, Vladislav ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: crm_report consumes all available RAM
08.09.2015 15:18, Ulrich Windl wrote: Vladislav Bogdanovschrieb am 08.09.2015 um 14:05 in Nachricht <55eecefb.8050...@hoster-ok.com>: Hi, just discovered very interesting issue. If there is a system user with very big UID (8002 in my case), then crm_report (actually 'grep' it runs) consumes too much RAM. Relevant part of the process tree at that moment looks like (word-wrap off): USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND ... root 25526 0.0 0.0 106364 636 ?S12:37 0:00 \_ /bin/sh /usr/sbin/crm_report --dest=/var/log/crm_report -f -01-01 00:00:00 root 25585 0.0 0.0 106364 636 ?S12:37 0:00 \_ bash /var/log/crm_report/collector root 25613 0.0 0.0 106364 152 ?S12:37 0:00 \_ bash /var/log/crm_report/collector root 25614 0.0 0.0 106364 692 ?S12:37 0:00 \_ bash /var/log/crm_report/collector root 27965 4.9 0.0 100936 452 ?S12:38 0:01 | \_ cat /var/log/lastlog root 27966 23.0 82.9 3248996 1594688 ? D12:38 0:08 | \_ grep -l -e Starting Pacemaker root 25615 0.0 0.0 155432 600 ?S12:37 0:00 \_ sort -u ls -ls /var/log/lastlog shows: 40 -rw-r--r--. 1 root root 2336876 Sep 8 04:36 /var/log/lastlog That is sparse binary file, which consumes only 40k of disk space. At the same time its size is 23GB, and grep takes all the RAM trying to grep a string from a 23GB of mostly zeroes without new-lines. I believe this is worth fixing, I guess the UID value is used as offset in the lastlog file (which exactly I just should add, that user should be logged-in at least once. isOK). When reading such a sparse file, the filesystem should simply deliver zero blocks to grep. As grep is designed to read from streads there is not much you can do against reading all these zeros, I guess. yep, I think that another indicator should be used Also an mmap based solution might exceed the virtual address space, especially for 32-bit systems. BTW: Did you try "last Pacemaker"? I could only test with "last reboot" here... That is post-1.1.13 Thanks, Vladislav ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [Pacemaker1.0.13] [hbagent] The hbagent does not stop.
Hi Yan, Thank you for comment. > Sounds weird. I've never encountered the issue before. Actually I > haven't run it with heartbeat for years ;-) We'd probably have to find > the pattern and produce it. We still just began an investigation. If there is the point that you think to be the cause of the problem, please tell me. Best Reards, Hideo Yamauchi. - Original Message - > From: "Gao,Yan"> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to > open-source clustering welcomed > Cc: > Date: 2015/9/8, Tue 23:14 > Subject: Re: [ClusterLabs] [Pacemaker1.0.13] [hbagent] The hbagent does not > stop. > > Hi Hideo, > > On 09/08/2015 04:28 AM, renayama19661...@ybb.ne.jp wrote: >> Hi All, >> >> A problem produced us in Pacemaker1.0.13. >> >> * RHEL6.4(kernel-2.6.32-358.23.2.el6.x86_64) >> * SNMP: >> * net-snmp-libs-5.5-49.el6_5.1.x86_64 >> * hp-snmp-agents-9.50-2564.40.rhel6.x86_64 >> * net-snmp-utils-5.5-49.el6_5.1.x86_64 >> * net-snmp-5.5-49.el6_5.1.x86_64 >> * Pacemaker 1.0.13 >> * pacemaker-mgmt-2.0.1 >> >> We started hbagnet in respawn in this environment, but hbagent did not stop > when we stopped Heartbeat. >> SIGTERM seemed to be transmitted by Heartbeat even if we saw log, but there > was not the trace that hbagent received SIGTERM. >> >> We try the reproduction of the problem, but the problem never reappears for > the moment. >> >> We suppose that pacemaker-mgmt(hbagent) or snmp has a problem. >> >> Know similar problem? >> Know the cause of the problem? > Sounds weird. I've never encountered the issue before. Actually I > haven't run it with heartbeat for years ;-) We'd probably have to find > the pattern and produce it. > > Regards, > Yan > -- > Gao,Yan > Senior Software Engineer > SUSE LINUX GmbH > ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] SBD & Failed Peer
On 09/07/2015 08:42 PM, Jorge Fábregas wrote: > If anyone from SUSE here could recreate it that would be great. Please open an SR - there are SUSE folks on this list but with an SR you get the right people working on the bug. greetings Kai Dupke Senior Product Manager Server Product Line -- Sell not virtue to purchase wealth, nor liberty to purchase power. Phone: +49-(0)5102-9310828 Mail: kdu...@suse.com Mobile: +49-(0)173-5876766 WWW: www.suse.com SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nuernberg (Germany) GF:Felix Imendörffer,Jane Smithard,Graham Norton,HRB 21284 (AG Nürnberg) ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] SBD & Failed Peer
> On 9 Sep 2015, at 12:13 am, Ken Gaillotwrote: > > On 09/07/2015 07:48 AM, Jorge Fábregas wrote: >> On 09/07/2015 03:27 AM, Digimer wrote: >>> And this is why I am nervous; It is always ideal to have a primary fence >>> method that has a method of confirming the 'off' state. IPMI fencing can >>> do this, as can hypervisor-based fence methods like fence_virsh and >>> fence_xvm. >> >> Hi Digimer, >> >> Yes, I thought that confirmation was kind of sacred but now I know it's >> not always possible. >> >>> I would use IPMI (iLO, DRAC, etc) as the primary fence method and >>> something else as a secondary, backup method. You can use SBD + watchdog >>> as the backup method, or as I do, a pair of switched PDUs (I find APC >>> brand to be very fast in fencing). >> >> This sounds great. Is there a way to specify a primary & secondary >> fencing device? I haven't seen a way to specify such hierarchy in >> pacemaker. > > Good news/bad news: > > Yes, pacemaker supports complex hierarchies of multiple fencing devices, > which it calls "fencing topology". There is a small example at > http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_advanced_stonith_configurations > > Unfortunately, sbd is not supported in fencing topologies. Another way to look at it, is that sbd is only supported in fencing topologies - just not explicit ones. Self-termination is always the least best option, so we’ll only use it if all other options (including topologies) are exhausted. But we’ll do so automatically. > Pacemaker > hooks into sbd via dedicated internal logic, not a conventional fence > agent, so it's treated differently. You might want to open an RFE bug > either upstream or with your OS vendor if you want to put it on the > radar, but sbd isn't entirely under Pacemaker's control, so I'm not sure > how feasible it would be. > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Watchdog & Reset
On 09/08/2015 09:29 AM, Jorge Fábregas wrote: > Who's feeding the watchdog timer? What or where's the watchdog timer > since there's none defined? Arrgh. It was kdump. By doing "chkconfig boot.kdump off" and restarting then I got the expected behavior (permanent freeze without rebooting). When using SBD for STONITH, does it create any issues (HA-wise) having kdump enabled and SBD feeding a hardware timer on the other hand? Regards, Jorge ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Problem with fence_virsh in RHEL 6 - selinux denial
Hi all, I've been using KVM-based VMs as a testbed for clusters for ages, always using fence_virsh. I noticed today though that fence_virsh is now being blocked by selinux (rhel 6.7, fully updated as of today): type=AVC msg=audit(1441752343.878:3269): avc: denied { execute } for pid=8848 comm="fence_virsh" name="ssh" dev=vda2 ino=2103935 scontext=unconfined_u:system_r:fenced_t:s0 tcontext=system_u:object_r:ssh_exec_t:s0 tclass=file type=SYSCALL msg=audit(1441752343.878:3269): arch=c03e syscall=21 success=no exit=-13 a0=1a363a0 a1=1 a2=7f02aa7f89e8 a3=7ffdff0dc7c0 items=0 ppid=7759 pid=8848 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=27 comm="fence_virsh" exe="/usr/bin/python" subj=unconfined_u:system_r:fenced_t:s0 key=(null) t [root@node1 ~]# rpm -q fence-agents cman corosync fence-agents-4.0.15-8.el6.x86_64 cman-3.0.12.1-73.el6.1.x86_64 corosync-1.4.7-2.el6.x86_64 [root@node1 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.7 (Santiago) I'll post a follow-up if I can sort out how to fix it. My selinux-fu is weak... -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] [ClusterLabs Developers] Problem with fence_virsh in RHEL 6 - selinux denial
On 08/09/15 09:46 PM, Justin Pryzby wrote: > In case it helps, I take that to mean: > > fence_virsh is a python program, which is attempting to run ssh, but failing. > > Can you check: > > which ssh # make sure it's not strange ssh in a /usr/local or such; > ls -Z `which fence_virsh` `which ssh` [root@node1 ~]# ls -Z `which fence_virsh` `which ssh` -rwxr-xr-x. root root system_u:object_r:ssh_exec_t:s0 /usr/bin/ssh -rwxr-xr-x. root root system_u:object_r:bin_t:s0 /usr/sbin/fence_virsh > sudo restorecon -v `which fence_virsh` `which ssh` # restore default selinux > contexts > ls -Z `which fence_virsh` `which ssh` # check again.. No change; [root@node1 ~]# restorecon -v `which fence_virsh` `which ssh` [root@node1 ~]# ls -Z `which fence_virsh` `which ssh` -rwxr-xr-x. root root system_u:object_r:ssh_exec_t:s0 /usr/bin/ssh -rwxr-xr-x. root root system_u:object_r:bin_t:s0 /usr/sbin/fence_virsh Not surprised, as this is a fresh install + OS update. I wiped audit.log, restarted auditd and then tried to fence manually. Here is what I saw: [root@node1 ~]# fence_node node2 fence node2 success In messages: Sep 9 02:53:30 node1 fence_node[23468]: fence node2 success A few moments later, you can see in messages that corosync noticed the loss of the node and tried to fence, but failed: Sep 9 02:53:38 node1 corosync[2792]: [TOTEM ] A processor failed, forming new configuration. Sep 9 02:53:40 node1 corosync[2792]: [QUORUM] Members[1]: 1 Sep 9 02:53:40 node1 corosync[2792]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 9 02:53:40 node1 corosync[2792]: [CPG ] chosen downlist: sender r(0) ip(10.20.10.1) r(1) ip(10.10.10.1) ; members(old:2 left:1) Sep 9 02:53:40 node1 corosync[2792]: [MAIN ] Completed service synchronization, ready to provide service. Sep 9 02:53:40 node1 kernel: dlm: closing connection to node 2 Sep 9 02:53:40 node1 fenced[2879]: node_history_fence_external no nodeid -1 Sep 9 02:53:40 node1 fenced[2879]: fencing node node2.ccrs.bcn Sep 9 02:53:40 node1 fenced[2879]: fence node2.ccrs.bcn dev 0.0 agent fence_virsh result: error from agent Sep 9 02:53:40 node1 fenced[2879]: fence node2.ccrs.bcn failed Sep 9 02:53:43 node1 fenced[2879]: fencing node node2.ccrs.bcn Sep 9 02:53:43 node1 fenced[2879]: fence node2.ccrs.bcn dev 0.0 agent fence_virsh result: error from agent Sep 9 02:53:43 node1 fenced[2879]: fence node2.ccrs.bcn failed Sep 9 02:53:46 node1 fenced[2879]: fencing node node2.ccrs.bcn Sep 9 02:53:46 node1 fenced[2879]: fence node2.ccrs.bcn dev 0.0 agent fence_virsh result: error from agent Sep 9 02:53:46 node1 fenced[2879]: fence node2.ccrs.bcn failed I set selinux to permissive: [root@node1 ~]# setenforce 1 And immediately the fence succeeded: Sep 9 02:53:46 node1 dbus: avc: received setenforce notice (enforcing=0) Sep 9 02:53:52 node1 fenced[2879]: fence node2.ccrs.bcn success Here is my cluster.conf, in case it matters: [root@node1 ~]# cat /etc/cluster/cluster.conf