date:20150908

Re: [ClusterLabs] [Patch][glue][external/libvirt] Conversion to a lower case of hostlist.

2015-09-08 Thread renayama19661014

Hi All,

We intend to change some patches.
We withdraw this patch.

Best Regards,
Hideo Yamauchi.


- Original Message -
> From: "renayama19661...@ybb.ne.jp" 
> To: ClusterLabs-ML 
> Cc: 
> Date: 2015/9/7, Mon 09:06
> Subject: [ClusterLabs] [Patch][glue][external/libvirt] Conversion to a lower 
> case of hostlist.
> 
> Hi All,
> 
> When a cluster carries out stonith, Pacemaker handles host name by a small 
> letter.
> When a user sets the host name of the OS and host name of hostlist of 
> external/libvrit in capital letters and waits, stonith is not carried out.
> 
> The external/libvrit to convert host name of hostlist, and to compare can 
> assist 
> a setting error of the user.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] crm_report consumes all available RAM

2015-09-08 Thread Vladislav Bogdanov

Hi,

just discovered very interesting issue.
If there is a system user with very big UID (8002 in my case),
then crm_report (actually 'grep' it runs) consumes too much RAM.

Relevant part of the process tree at that moment looks like (word-wrap off):
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
...
root 25526  0.0  0.0 106364   636 ?S12:37   0:00  \_ 
/bin/sh /usr/sbin/crm_report --dest=/var/log/crm_report -f -01-01 00:00:00
root 25585  0.0  0.0 106364   636 ?S12:37   0:00  
\_ bash /var/log/crm_report/collector
root 25613  0.0  0.0 106364   152 ?S12:37   0:00
  \_ bash /var/log/crm_report/collector
root 25614  0.0  0.0 106364   692 ?S12:37   0:00
  \_ bash /var/log/crm_report/collector
root 27965  4.9  0.0 100936   452 ?S12:38   0:01
  |   \_ cat /var/log/lastlog
root 27966 23.0 82.9 3248996 1594688 ? D12:38   0:08
  |   \_ grep -l -e Starting Pacemaker
root 25615  0.0  0.0 155432   600 ?S12:37   0:00
  \_ sort -u

ls -ls /var/log/lastlog shows:
40 -rw-r--r--. 1 root root 2336876 Sep  8 04:36 /var/log/lastlog

That is sparse binary file, which consumes only 40k of disk space.
At the same time its size is 23GB, and grep takes all the RAM trying to
grep a string from a 23GB of mostly zeroes without new-lines.

I believe this is worth fixing,

Thank you,
Vladislav

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Antw: crm_report consumes all available RAM

2015-09-08 Thread Vladislav Bogdanov


08.09.2015 15:18, Ulrich Windl wrote:

Vladislav Bogdanov  schrieb am 08.09.2015 um 14:05 in

Nachricht <55eecefb.8050...@hoster-ok.com>:

Hi,

just discovered very interesting issue.
If there is a system user with very big UID (8002 in my case),
then crm_report (actually 'grep' it runs) consumes too much RAM.

Relevant part of the process tree at that moment looks like (word-wrap off):
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
...
root 25526  0.0  0.0 106364   636 ?S12:37   0:00  \_
/bin/sh /usr/sbin/crm_report --dest=/var/log/crm_report -f -01-01 00:00:00
root 25585  0.0  0.0 106364   636 ?S12:37   0:00
  \_ bash /var/log/crm_report/collector
root 25613  0.0  0.0 106364   152 ?S12:37   0:00
  \_ bash /var/log/crm_report/collector
root 25614  0.0  0.0 106364   692 ?S12:37   0:00
  \_ bash /var/log/crm_report/collector
root 27965  4.9  0.0 100936   452 ?S12:38   0:01
  |   \_ cat /var/log/lastlog
root 27966 23.0 82.9 3248996 1594688 ? D12:38   0:08
  |   \_ grep -l -e Starting Pacemaker
root 25615  0.0  0.0 155432   600 ?S12:37   0:00
  \_ sort -u

ls -ls /var/log/lastlog shows:
40 -rw-r--r--. 1 root root 2336876 Sep  8 04:36 /var/log/lastlog

That is sparse binary file, which consumes only 40k of disk space.
At the same time its size is 23GB, and grep takes all the RAM trying to
grep a string from a 23GB of mostly zeroes without new-lines.

I believe this is worth fixing,


I guess the UID value is used as offset in the lastlog file (which


exactly
I just should add, that user should be logged-in at least once.


isOK). When reading such a sparse file, the filesystem should simply
deliver zero blocks to grep. As grep is designed to read from streads
there is not much you can do against reading all these zeros, I guess.


yep, I think that another indicator should be used


Also an mmap based solution might exceed the virtual address space,
especially for 32-bit systems.
BTW: Did you try "last Pacemaker"? I could only test with "last
reboot" here...


That is post-1.1.13

Thanks,
Vladislav


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [Pacemaker1.0.13] [hbagent] The hbagent does not stop.

2015-09-08 Thread renayama19661014

Hi Yan,

Thank you for comment.

> Sounds weird. I've never encountered the issue before. Actually I
> haven't run it with heartbeat for years ;-)  We'd probably have to find
> the pattern and produce it.



We still just began an investigation.

If there is the point that you think to be the cause of the problem, please 
tell me.

Best Reards,
Hideo Yamauchi.


- Original Message -
> From: "Gao,Yan" 
> To: renayama19661...@ybb.ne.jp; Cluster Labs - All topics related to 
> open-source clustering welcomed 
> Cc: 
> Date: 2015/9/8, Tue 23:14
> Subject: Re: [ClusterLabs] [Pacemaker1.0.13] [hbagent] The hbagent does not 
> stop.
> 
> Hi Hideo,
> 
> On 09/08/2015 04:28 AM, renayama19661...@ybb.ne.jp wrote:
>>  Hi All,
>> 
>>  A problem produced us in Pacemaker1.0.13.
>> 
>>   * RHEL6.4(kernel-2.6.32-358.23.2.el6.x86_64)
>>    * SNMP：
>>     * net-snmp-libs-5.5-49.el6_5.1.x86_64
>>     * hp-snmp-agents-9.50-2564.40.rhel6.x86_64
>>     * net-snmp-utils-5.5-49.el6_5.1.x86_64
>>     * net-snmp-5.5-49.el6_5.1.x86_64
>>   * Pacemaker 1.0.13
>>   * pacemaker-mgmt-2.0.1
>> 
>>  We started hbagnet in respawn in this environment, but hbagent did not stop 
> when we stopped Heartbeat.
>>  SIGTERM seemed to be transmitted by Heartbeat even if we saw log, but there 
> was not the trace that hbagent received SIGTERM.
>> 
>>  We try the reproduction of the problem, but the problem never reappears for 
> the moment.
>> 
>>  We suppose that pacemaker-mgmt(hbagent) or snmp has a problem.
>> 
>>  Know similar problem?
>>  Know the cause of the problem?
> Sounds weird. I've never encountered the issue before. Actually I
> haven't run it with heartbeat for years ;-)  We'd probably have to find
> the pattern and produce it.
> 
> Regards,
>   Yan
> -- 
> Gao,Yan 
> Senior Software Engineer
> SUSE LINUX GmbH
> 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] SBD & Failed Peer

2015-09-08 Thread Kai Dupke

On 09/07/2015 08:42 PM, Jorge Fábregas wrote:
> If anyone from SUSE here could recreate it that would be great.

Please open an SR - there are SUSE folks on this list but with an SR you
get the right people working on the bug.

greetings
Kai Dupke
Senior Product Manager
Server Product Line
-- 
Sell not virtue to purchase wealth, nor liberty to purchase power.
Phone:  +49-(0)5102-9310828 Mail: kdu...@suse.com
Mobile: +49-(0)173-5876766  WWW:  www.suse.com

SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nuernberg (Germany)
GF:Felix Imendörffer,Jane Smithard,Graham Norton,HRB 21284 (AG Nürnberg)

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] SBD & Failed Peer

2015-09-08 Thread Andrew Beekhof


> On 9 Sep 2015, at 12:13 am, Ken Gaillot  wrote:
> 
> On 09/07/2015 07:48 AM, Jorge Fábregas wrote:
>> On 09/07/2015 03:27 AM, Digimer wrote:
>>> And this is why I am nervous; It is always ideal to have a primary fence
>>> method that has a method of confirming the 'off' state. IPMI fencing can
>>> do this, as can hypervisor-based fence methods like fence_virsh and
>>> fence_xvm.
>> 
>> Hi Digimer,
>> 
>> Yes, I thought that confirmation was kind of sacred but now I know it's
>> not always possible.
>> 
>>> I would use IPMI (iLO, DRAC, etc) as the primary fence method and
>>> something else as a secondary, backup method. You can use SBD + watchdog
>>> as the backup method, or as I do, a pair of switched PDUs (I find APC
>>> brand to be very fast in fencing).
>> 
>> This sounds great.  Is there a way to specify a primary & secondary
>> fencing device?  I haven't seen a way to specify such hierarchy in
>> pacemaker.
> 
> Good news/bad news:
> 
> Yes, pacemaker supports complex hierarchies of multiple fencing devices,
> which it calls "fencing topology". There is a small example at
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_advanced_stonith_configurations
> 
> Unfortunately, sbd is not supported in fencing topologies.

Another way to look at it, is that sbd is only supported in fencing topologies 
- just not explicit ones.
Self-termination is always the least best option, so we’ll only use it if all 
other options (including topologies) are exhausted.
But we’ll do so automatically.

> Pacemaker
> hooks into sbd via dedicated internal logic, not a conventional fence
> agent, so it's treated differently. You might want to open an RFE bug
> either upstream or with your OS vendor if you want to put it on the
> radar, but sbd isn't entirely under Pacemaker's control, so I'm not sure
> how feasible it would be.
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Watchdog & Reset

2015-09-08 Thread Jorge Fábregas

On 09/08/2015 09:29 AM, Jorge Fábregas wrote:
>  Who's feeding the watchdog timer?  What or where's the watchdog timer 
> since there's none defined?  

Arrgh.  It was kdump.  By doing "chkconfig boot.kdump off" and
restarting then I got the expected behavior (permanent freeze without
rebooting).

When using SBD for STONITH, does it create any issues (HA-wise) having
kdump enabled and SBD feeding a hardware timer on the other hand?

Regards,
Jorge

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Problem with fence_virsh in RHEL 6 - selinux denial

2015-09-08 Thread Digimer

Hi all,

  I've been using KVM-based VMs as a testbed for clusters for ages,
always using fence_virsh.

  I noticed today though that fence_virsh is now being blocked by
selinux (rhel 6.7, fully updated as of today):

type=AVC msg=audit(1441752343.878:3269): avc:  denied  { execute } for
pid=8848 comm="fence_virsh" name="ssh" dev=vda2 ino=2103935
scontext=unconfined_u:system_r:fenced_t:s0
tcontext=system_u:object_r:ssh_exec_t:s0 tclass=file
type=SYSCALL msg=audit(1441752343.878:3269): arch=c03e syscall=21
success=no exit=-13 a0=1a363a0 a1=1 a2=7f02aa7f89e8 a3=7ffdff0dc7c0
items=0 ppid=7759 pid=8848 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0
egid=0 sgid=0 fsgid=0 tty=(none) ses=27 comm="fence_virsh"
exe="/usr/bin/python" subj=unconfined_u:system_r:fenced_t:s0 key=(null)
t

[root@node1 ~]# rpm -q fence-agents cman corosync
fence-agents-4.0.15-8.el6.x86_64
cman-3.0.12.1-73.el6.1.x86_64
corosync-1.4.7-2.el6.x86_64

[root@node1 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.7 (Santiago)

I'll post a follow-up if I can sort out how to fix it. My selinux-fu is
weak...

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [ClusterLabs Developers] Problem with fence_virsh in RHEL 6 - selinux denial

2015-09-08 Thread Digimer

On 08/09/15 09:46 PM, Justin Pryzby wrote:
> In case it helps, I take that to mean:
> 
> fence_virsh is a python program, which is attempting to run ssh, but failing.
> 
> Can you check:
> 
> which ssh # make sure it's not strange ssh in a /usr/local or such;
> ls -Z `which fence_virsh` `which ssh`


[root@node1 ~]# ls -Z `which fence_virsh` `which ssh`
-rwxr-xr-x. root root system_u:object_r:ssh_exec_t:s0  /usr/bin/ssh
-rwxr-xr-x. root root system_u:object_r:bin_t:s0   /usr/sbin/fence_virsh


> sudo restorecon -v `which fence_virsh` `which ssh` # restore default selinux 
> contexts
> ls -Z `which fence_virsh` `which ssh` # check again..

No change;


[root@node1 ~]# restorecon -v `which fence_virsh` `which ssh`
[root@node1 ~]# ls -Z `which fence_virsh` `which ssh`
-rwxr-xr-x. root root system_u:object_r:ssh_exec_t:s0  /usr/bin/ssh
-rwxr-xr-x. root root system_u:object_r:bin_t:s0   /usr/sbin/fence_virsh


Not surprised, as this is a fresh install + OS update.

I wiped audit.log, restarted auditd and then tried to fence manually.
Here is what I saw:


[root@node1 ~]# fence_node node2
fence node2 success


In messages:


Sep  9 02:53:30 node1 fence_node[23468]: fence node2 success


A few moments later, you can see in messages that corosync noticed the
loss of the node and tried to fence, but failed:


Sep  9 02:53:38 node1 corosync[2792]:   [TOTEM ] A processor failed,
forming new configuration.
Sep  9 02:53:40 node1 corosync[2792]:   [QUORUM] Members[1]: 1
Sep  9 02:53:40 node1 corosync[2792]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Sep  9 02:53:40 node1 corosync[2792]:   [CPG   ] chosen downlist: sender
r(0) ip(10.20.10.1) r(1) ip(10.10.10.1) ; members(old:2 left:1)
Sep  9 02:53:40 node1 corosync[2792]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Sep  9 02:53:40 node1 kernel: dlm: closing connection to node 2
Sep  9 02:53:40 node1 fenced[2879]: node_history_fence_external no nodeid -1
Sep  9 02:53:40 node1 fenced[2879]: fencing node node2.ccrs.bcn
Sep  9 02:53:40 node1 fenced[2879]: fence node2.ccrs.bcn dev 0.0 agent
fence_virsh result: error from agent
Sep  9 02:53:40 node1 fenced[2879]: fence node2.ccrs.bcn failed
Sep  9 02:53:43 node1 fenced[2879]: fencing node node2.ccrs.bcn
Sep  9 02:53:43 node1 fenced[2879]: fence node2.ccrs.bcn dev 0.0 agent
fence_virsh result: error from agent
Sep  9 02:53:43 node1 fenced[2879]: fence node2.ccrs.bcn failed
Sep  9 02:53:46 node1 fenced[2879]: fencing node node2.ccrs.bcn
Sep  9 02:53:46 node1 fenced[2879]: fence node2.ccrs.bcn dev 0.0 agent
fence_virsh result: error from agent
Sep  9 02:53:46 node1 fenced[2879]: fence node2.ccrs.bcn failed


I set selinux to permissive:


[root@node1 ~]# setenforce 1


And immediately the fence succeeded:


Sep  9 02:53:46 node1 dbus: avc:  received setenforce notice (enforcing=0)
Sep  9 02:53:52 node1 fenced[2879]: fence node2.ccrs.bcn success


Here is my cluster.conf, in case it matters:


[root@node1 ~]# cat /etc/cluster/cluster.conf