Re: [ClusterLabs] mdraid - pacemaker resource agent

2022-12-09 Thread Roger Zhou via Users


On 12/9/22 17:36, Jelen, Piotr wrote:

Hi Roger,

Thank you for your quick reply,
The mdraid resource agent works very well for us,
Can you please tell me if there is any resource agent or tool build in 
pacemaker  for sync configuration file(s) such as /etc/exports for nfs service 
between cluster nodes without mounting shares and creating symbolic link(s)  ?



I incline to ask you describe your use case a little bit more. Without knowing
it well, my view might be not fit you.

csync2 is often used by ClusterLabs community. But it is an individual tool,
and is not part of pacemaker. https://github.com/linbit/csync2

However, you refers to /etc/exports. I guess your use case is High Available
NFS? If so, you could leverage `man ocf_heartbeat_exportfs`. But this resource
agent doesn't use /etc/exports at all.


BR,
Roger





Thank you
Piotr Jelen
Senior Systems Platform Engineer

Mastercard
Mountain View, Central Park  | Leopard


-Original Message-
From: Roger Zhou 
Sent: Thursday 8 December 2022 05:56
To: Cluster Labs - All topics related to open-source clustering welcomed 
; Jelen, Piotr 
Cc: Nielsen, Laust 
Subject: {EXTERNAL} Re: [ClusterLabs] mdraid - pacemaker resource agent

CAUTION: The message originated from an EXTERNAL SOURCE. Please use caution 
when opening attachments, clicking links or responding to this email.



On 12/7/22 18:44, Jelen, Piotr wrote:

Hi ClusterLabs team ,

I would like to ask if this resource agent was tested and if it can be
use in production?
resource-agents/mdraid at main · ClusterLabs/resource-agents · GitHub
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Clust
erLabs_resource-2Dagents_blob_main_heartbeat_mdraid=DwIDaQ=uc5ZRXl
8dGLM1RMQwf7xTCjRqXF0jmCF6SP0bDlmMmY=UzPNWvRnAChMyFZDGSUcEqtZElcu_Xn
gP1GiT3HLUmI=OHY6Rfbo326vRgEACkHp-D4nbNFyihVuQak3w9YeTXJQ1SeBV2P-c7c
GumwQuS51=pOg6p7HopMV1E14QJmLReC7z_2_IBPhFWTWst5YO1FA= >



Yes. Why not ;). We don't see big missing piece, though might have some 
improvement for the certain scenario. Anyway, you could report the issue, or 
provide improvement code, if any.

Cheers,
Roger



Thank you

*Piotr Jelen*

Senior Systems Platform Engineer

Mastercard

Mountain View, Central Park  | Leopard

   [mastercard.com] <http://www.mastercard.com>


CONFIDENTIALITY NOTICE This e-mail message and any attachments are
only for the use of the intended recipient and may contain information
that is privileged, confidential or exempt from disclosure under
applicable law. If you are not the intended recipient, any disclosure,
distribution or other use of this e-mail message or attachments is
prohibited. If you have received this e-mail message in error, please delete 
and notify the sender immediately. Thank you.

___
Manage your subscription:
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.clusterlabs
.org_mailman_listinfo_users=DwIDaQ=uc5ZRXl8dGLM1RMQwf7xTCjRqXF0jmC
F6SP0bDlmMmY=UzPNWvRnAChMyFZDGSUcEqtZElcu_XngP1GiT3HLUmI=OHY6Rfbo3
26vRgEACkHp-D4nbNFyihVuQak3w9YeTXJQ1SeBV2P-c7cGumwQuS51=ffs3MlOD3Dhd
GPsQ5ikhdnYpI1fwmHosCj3lcGclCC8=

ClusterLabs home:
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.clusterlabs.o
rg_=DwIDaQ=uc5ZRXl8dGLM1RMQwf7xTCjRqXF0jmCF6SP0bDlmMmY=UzPNWvRnA
ChMyFZDGSUcEqtZElcu_XngP1GiT3HLUmI=OHY6Rfbo326vRgEACkHp-D4nbNFyihVuQ
ak3w9YeTXJQ1SeBV2P-c7cGumwQuS51=z_dYDvqUIF8kA-K2k0kgO_YirTwL_MFT9jED
iJ8rJrw=

CONFIDENTIALITY NOTICE This e-mail message and any attachments are only for the 
use of the intended recipient and may contain information that is privileged, 
confidential or exempt from disclosure under applicable law. If you are not the 
intended recipient, any disclosure, distribution or other use of this e-mail 
message or attachments is prohibited. If you have received this e-mail message 
in error, please delete and notify the sender immediately. Thank you.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] mdraid - pacemaker resource agent

2022-12-07 Thread Roger Zhou via Users

On 12/7/22 18:44, Jelen, Piotr wrote:

Hi ClusterLabs team ,

I would like to ask if this resource agent was tested and if it can be use in 
production?
resource-agents/mdraid at main · ClusterLabs/resource-agents · GitHub 





Yes. Why not ;). We don't see big missing piece, though might have some 
improvement for the certain scenario. Anyway, you could report the issue, or 
provide improvement code, if any.


Cheers,
Roger



Thank you

*Piotr Jelen*

Senior Systems Platform Engineer

Mastercard

Mountain View, Central Park  | Leopard

[mastercard.com] 


CONFIDENTIALITY NOTICE This e-mail message and any attachments are only for the 
use of the intended recipient and may contain information that is privileged, 
confidential or exempt from disclosure under applicable law. If you are not the 
intended recipient, any disclosure, distribution or other use of this e-mail 
message or attachments is prohibited. If you have received this e-mail message 
in error, please delete and notify the sender immediately. Thank you.


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Q: fence_kdump and fence_kdump_send

2022-02-25 Thread Roger Zhou via Users



On 2/24/22 20:21, Ulrich Windl wrote:

Hi!

After reading about fence_kdump and fence_kdump_send I wonder:
Does anybody use that in production?
Having the networking and bonding in initrd does not sound like a good idea to 
me.


I assume one of motivation for fence_kdump is to reduce the dependency on the 
shared disk which is the fundamental infrastructure for SBD.



Wouldn't it be easier to integrate that functionality into sbd?


sbd does support "crashdump". Though, you may want to have some further 
improvement.



I mean: Let sbd wait for a "kdump-ed" message that initrd could send when kdump 
is complete.
Basically that would be the same mechanism, but using storage instead of 
networking.

If I get it right, the original fence_kdump would also introduce an extra 
fencing delay, and I wonder what happens with a hardware watchdog while a kdump 
is in progress...

The background of all this is that our nodes kernel-panic, and support says the 
kdumps are all incomplete.
The events are most likely:
node1: panics (kdump)
other_node: seens node1 had failed and fences it (via sbd).

However sbd fencing wont work while kdump is executing (IMHO)



Setup both sbd + fence_kdump sounds not a good practice.
I understand the sbd watchdog is tricky in this combination.


So what happens most likely is that the watchdog terminates the kdump.
In that case all the mess with fence_kdump won't help, right?



With sbd crashdump functionality, it deals with the watchdog properly.

Here is a knowledge page as well
https://www.suse.com/support/kb/doc/?id=19873



Regards,
Ulrich




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] what is the "best" way to completely shutdown a two-node cluster ?

2022-02-10 Thread Roger Zhou via Users


On 2/9/22 17:46, Lentes, Bernd wrote:



- On Feb 7, 2022, at 4:13 PM, Jehan-Guillaume de Rorthais j...@dalibo.com 
wrote:


On Mon, 7 Feb 2022 14:24:44 +0100 (CET)
"Lentes, Bernd"  wrote:


Hi,

i'm currently changing a bit in my cluster because i realized that my
configuration for a power outtage didn't work as i expected. My idea is
currently:
- first stop about 20 VirtualDomains, which are my services. This will surely
takes some minutes. I'm thinking of stopping each with a time difference of
about 20 seconds for not getting to much IO load. and then ...


This part is tricky. At one hand, it is good thinking to throttle IO load.

On the other hand, as Jehan and Ulrich mentioned, `crm resource stop ` 
introduces "target‑role=Stopped" for each VirtualDomain, and have to do `crm 
resource start ` to changed it back to "target‑role=Started" to start them 
after the power outage.



- how to stop the other resources ?


I would set cluster option "stop-all-resources" so all remaining resources are
stopped gracefully by the cluster.

Then you can stop both nodes using eg. "crm cluster stop".


Here, for SLES12SP5, `crm cluster run "crm cluster stop"` could help a little.

From crmsh-4.4.0 onward, `crm cluster stop --all` is recommended to simplify 
the whole procedure to do the cluster wide shutdown.


BR,
Roger



On restart, after both nodes are up and joined to the cluster, you can set
"stop-all-resources=false", then start your VirtualDomains.


Aren't  the VirtualDomains already started by "stop-all-resources=false" ?

I wrote a script for the whole procedure which is triggered by the UPS.
As i am not a big schellscript-writer please have a look and tell me your 
opinion.
You find it here: 
https://nc-mcd.helmholtz-muenchen.de/nextcloud/s/rEA9bFxs5Ay6fYG
Thanks.

Bernd


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Possible timing bug in SLES15

2021-10-12 Thread Roger Zhou via Users



On 10/12/21 3:32 PM, Ulrich Windl wrote:

Hi!

I just examined the corosync.service unit in SLES15. It contains:
# /usr/lib/systemd/system/corosync.service
[Unit]
Description=Corosync Cluster Engine
Documentation=man:corosync man:corosync.conf man:corosync_overview
ConditionKernelCommandLine=!nocluster
Requires=network-online.target
After=network-online.target
...

However the documentation says corosync requires synchronized system clocks.
With this configuration corosync starts before the clocks are synchronized:


The point looks valid and make sense. Well, sounds like no(or very seldom) 
victim because of it in the real life.




Oct 05 14:57:47 h16 ntpd[6767]: ntpd 4.2.8p15@1.3728-o Tue Jun 15 12:00:00 UTC 
2021 (1): Starting
...
Oct 05 14:57:48 h16 systemd[1]: Starting Wait for ntpd to synchronize system 
clock...
...
Oct 05 14:57:48 h16 corosync[6793]:   [TOTEM ] Initializing transport (UDP/IP 
Unicast).
...
Oct 05 14:57:48 h16 systemd[1]: Started Corosync Cluster Engine.
...
Oct 05 14:58:10 h16 systemd[1]: Started Wait for ntpd to synchronize system 
clock.
Oct 05 14:58:10 h16 systemd[1]: Reached target System Time Synchronized.

Only pacemaker.service has:
# /usr/lib/systemd/system/pacemaker.service
[Unit]
Description=Pacemaker High Availability Cluster Manager
Documentation=man:pacemakerd
Documentation=https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html

# DefaultDependencies takes care of sysinit.target,
# basic.target, and shutdown.target

# We need networking to bind to a network address. It is recommended not to
# use Wants or Requires with network.target, and not to use
# network-online.target for server daemons.
After=network.target

# Time syncs can make the clock jump backward, which messes with logging
# and failure timestamps, so wait until it's done.
After=time-sync.target
...

Oct 05 14:58:10 h16 pacemakerd[6974]:  notice: Starting Pacemaker 
2.0.4+20200616.2deceaa3a-3.9.1
But still it does not "Require" time-sync.target...



Actually `After=` is more strict dependency than `Require=`.


Doesn't corosync need synchronized clocks?


Seems good to have, but low priority.

BR,
Roger





Regards,
Ulrich



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] (no subject)

2021-09-02 Thread Roger Zhou via Users


On 9/3/21 10:09 AM, ?? via Users wrote:

HELLO!
 ?0?2 ?0?2 I built a two node corosync + pacemaker cluster?? and the main end runs on 
node0. There are two network ports with the same network segment IP on node0. I 


This need attention a bit.
"Usually not a good idea to connect two interfaces using the same subnet" [1]

created a VIP resource on one of the network ports and specified the NIC 
attribute. When I down the network port where the VIP resource is located, 
theoretically, the cluster will automatically switch to another node, but the 
cluster did not successfully switch to another node. I don't know why, and 
there is no log information about cluster errors. Is this a bug or do I need to 
configure some additional parameters??




journalctl will tell you a lot of cluster messages and might tell you the 
reason behind.


[1] https://access.redhat.com/solutions/30564

Cheers,
Roger



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Antw: Hanging OCFS2 Filesystem any one else?

2021-07-12 Thread Roger Zhou



On 7/9/21 3:56 PM, Ulrich Windl wrote:

[...]


h19 kernel: Out of memory: Killed process 6838 (corosync) total-vm:261212kB, 
anon-rss:31444kB, file-rss:7700kB, shmem-rss:121872kB

I doubt that was the best possible choice ;-)

The dead corosync caused the DC (h18) to fence h19 (which was successful), but 
the DC was fenced while it tried to recover resources, so the complete cluster 
rebooted.



Hi Ulrich,

Any clue, why DC(h18) get fenced, "suicide"? Does h18 become inquorate without 
h19 and the by default `no-quorum-policy=stop` kicks in?

BTW, `no-quorum-policy=freeze` is the general suggestion for ocfs2 and gfs2.

BR,
Roger

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Correctly stop pacemaker on 2-node cluster with SBD and failed devices?

2021-06-16 Thread Roger Zhou


On 6/16/21 3:03 PM, Andrei Borzenkov wrote:





We thought that access to storage was restored, but one step was
missing so devices appeared empty.

At this point I tried to restart the pacemaker. But as soon as I
stopped pacemaker SBD rebooted nodes ‑ which is logical, as quorum was
now lost.

How to cleanly stop pacemaker in this case and keep nodes up?


Unconfigurte sbd devices I guess.



Do you have *practical* suggestions on how to do it online in a
running pacemaker cluster? Can you explain how it is going to help
given that lack of sbd device was not the problem in the first place?


I would translate this issue as "how to gracefully shutdown sbd to deregister 
sbd from pacemaker for the whole cluster". Seems no way to do that except 
`systemctl stop corosync`.


With that, to calm down sbd suicide, I'm thinking some tricky steps as below 
might help. Well, not sure it fits your situation as the whole.


crm cluster run "systemctl stop pacemaker"
crm cluster run "systemctl stop corosync"

BR,
Roger

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Q: VirtualDomain RA

2021-03-02 Thread Roger Zhou



On 3/1/21 7:17 PM, Ulrich Windl wrote:

Hi!

I have a question about the VirtualDomain RA (as in SLES15 SP2):
Why does the RA "undefine", then "create" a domain instead of just "start"ing a 
domain?
I mean: Assuming that an "installation" does "define" the domains, why bother with configuration 
files and "create" when a simple "start" would do also?



It makes sense to clean up the certain situation to avoid the "create" failure. 
Example:


1. given foobar.xml doesn't provide UUID
2. `virsh define foobar.xml`
3. `virsh create foobar.xml` <-- error: Failed to create domain from foobar.xml

Cheers,
Roger



Specifically this code:
verify_undefined() {
 local tmpf
 if virsh --connect=${OCF_RESKEY_hypervisor} list --all --name 2>/dev/null | grep 
-wqs "$DOMAIN_NAME"
 then
 tmpf=$(mktemp -t vmcfgsave.XX)
 if [ ! -r "$tmpf" ]; then
 ocf_log warn "unable to create temp file, disk full?"
 # we must undefine the domain
 virsh $VIRSH_OPTIONS undefine $DOMAIN_NAME > /dev/null 
2>&1
 else
 cp -p $OCF_RESKEY_config $tmpf
 virsh $VIRSH_OPTIONS undefine $DOMAIN_NAME > /dev/null 
2>&1
 [ -f $OCF_RESKEY_config ] || cp -f $tmpf 
$OCF_RESKEY_config
 rm -f $tmpf
 fi
 fi
}

Regards,
Ulrich


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Q: What is lvmlockd locking?

2021-01-22 Thread Roger Zhou



On 1/22/21 6:58 PM, Ulrich Windl wrote:

Roger Zhou  schrieb am 22.01.2021 um 11:26 in Nachricht

<8dcd53e2-b65b-aafe-ae29-7bdeea3b8...@suse.com>:


On 1/22/21 5:45 PM, Ulrich Windl wrote:

Roger Zhou  schrieb am 22.01.2021 um 10:18 in Nachricht

:


Could be the naming of lvmlockd and virtlockd mislead you, I guess.


I agree that there is one "virtlockd" name in the resources that refers to

lvmlockd. That is confusing, I agree.

But: Isn't virtlockd trying to lock the VM images used? Those are located on

a different OCFS2 filesystem here.

Right. virtlockd works together with libvirt for Virtual Machines locking.


And I thought virtlockd is using lvmlockd to lock those images. Maybe I'm

just confused.

Even after reading the manual page of virtlockd I could not find out how it

actually does perform locking.


lsof suggests it used files like this:


/var/lib/libvirt/lockd/files/f9d587c61002c7480f8b86116eb4f7dfa210e52af7e94476
2f58c2c2f89a6865

This file lock indicates the VM backing file is a qemu image. In case the VM

backing storage is SCSI or LVM, the directory structure will change

/var/lib/libvirt/lockd/scsi
/var/lib/libvirt/lockd/lvm

Some years ago, there was a draft patch set sent to libvirt community to add

the alternative to let virtlockd use the DLM lock, hence no need the
filesystem(nfs, ocfs2, or gfs2(?) ) for "/var/lib/libvirt/lockd". Well, the
libvirt community was less motivated to move it on.



That filesystem is OCFS:
h18:~ # df /var/lib/libvirt/lockd/files
Filesystem 1K-blocks  Used Available Use% Mounted on
/dev/md10 261120 99120162000  38% /var/lib/libvirt/lockd


Could part of the problem be that systemd controls virtlockd, but the

filesystem it needs is controlled by the cluster?


Do I have to mess with those systemd resources in the cluster?:
systemd:virtlockd   systemd:virtlockd-admin.socket

systemd:virtlockd.socket




It would be more complete and solid cluster configuration if doing so.
Though,
I think it could work to let libvirtd and virtlockd running out side of the
cluster stack as long as the whole system is not too complex to manage.
Anyway,
testing could tell.


Hi!

So basically I have one question: Does the virtlockd need a cluster-wide 
filesystem?
When ruinning on a single node (the usual case assumed in the docs) a local 
filesystem will do, but how would virtlockd prevent a VM using a shared 
filesystem or disk prevent a VM from starting on two different nodes?


The libvirt community guides users to use NFS in this case. We, the cluster 
community, could have fun with the cluster filesystem ;)


Cheers,
Roger



Unfortunately I had exactly that before deploying the virtlockd configuration, 
and the filesystem for the VM is damaged to a degree that made it unrecoverable.

Regards,
Ulrich



BR,
Roger




Anyway, two more tweaks needed in your CIB:

colocation col_vm__virtlockd inf: ( prm_xen_test-jeos1 prm_xen_test-jeos2
prm_xen_test-jeos3 prm_xen_test-jeos4 ) cln_lockspace_ocfs2

order ord_virtlockd__vm Mandatory: cln_lockspace_ocfs2 ( prm_xen_test-jeos1
prm_xen_test-jeos2 prm_xen_test-jeos3 prm_xen_test-jeos4 )


I'm still trying to understand all that. Thanks for helping so far.

Regards,
Ulrich




BR,
Roger












___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Re: Q: What is lvmlockd locking?

2021-01-22 Thread Roger Zhou



On 1/22/21 5:45 PM, Ulrich Windl wrote:

Roger Zhou  schrieb am 22.01.2021 um 10:18 in Nachricht

:


Could be the naming of lvmlockd and virtlockd mislead you, I guess.


I agree that there is one "virtlockd" name in the resources that refers to 
lvmlockd. That is confusing, I agree.
But: Isn't virtlockd trying to lock the VM images used? Those are located on a 
different OCFS2 filesystem here.


Right. virtlockd works together with libvirt for Virtual Machines locking.


And I thought virtlockd is using lvmlockd to lock those images. Maybe I'm just 
confused.
Even after reading the manual page of virtlockd I could not find out how it 
actually does perform locking.

lsof suggests it used files like this:
/var/lib/libvirt/lockd/files/f9d587c61002c7480f8b86116eb4f7dfa210e52af7e944762f58c2c2f89a6865


This file lock indicates the VM backing file is a qemu image. In case the VM 
backing storage is SCSI or LVM, the directory structure will change


/var/lib/libvirt/lockd/scsi
/var/lib/libvirt/lockd/lvm

Some years ago, there was a draft patch set sent to libvirt community to add 
the alternative to let virtlockd use the DLM lock, hence no need the 
filesystem(nfs, ocfs2, or gfs2(?) ) for "/var/lib/libvirt/lockd". Well, the 
libvirt community was less motivated to move it on.




That filesystem is OCFS:
h18:~ # df /var/lib/libvirt/lockd/files
Filesystem 1K-blocks  Used Available Use% Mounted on
/dev/md10 261120 99120162000  38% /var/lib/libvirt/lockd


Could part of the problem be that systemd controls virtlockd, but the 
filesystem it needs is controlled by the cluster?

Do I have to mess with those systemd resources in the cluster?:
systemd:virtlockd   systemd:virtlockd-admin.socket  
systemd:virtlockd.socket



It would be more complete and solid cluster configuration if doing so. Though, 
I think it could work to let libvirtd and virtlockd running out side of the 
cluster stack as long as the whole system is not too complex to manage. Anyway, 
testing could tell.


BR,
Roger




Anyway, two more tweaks needed in your CIB:

colocation col_vm__virtlockd inf: ( prm_xen_test-jeos1 prm_xen_test-jeos2
prm_xen_test-jeos3 prm_xen_test-jeos4 ) cln_lockspace_ocfs2

order ord_virtlockd__vm Mandatory: cln_lockspace_ocfs2 ( prm_xen_test-jeos1
prm_xen_test-jeos2 prm_xen_test-jeos3 prm_xen_test-jeos4 )


I'm still trying to understand all that. Thanks for helping so far.

Regards,
Ulrich




BR,
Roger







___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] Re: Q: What is lvmlockd locking?

2021-01-22 Thread Roger Zhou



On 1/22/21 4:17 PM, Ulrich Windl wrote:

Gang He  schrieb am 22.01.2021 um 09:13 in Nachricht

<1fd1c07d-d12c-fea9-4b17-90a977fe7...@suse.com>:

Hi Ulrich,

I reviewed the crm configuration file, there are some comments as below,
1) lvmlockd resource is used for shared VG, if you do not plan to add
any shared VG in your cluster, I suggest to drop this resource and clone.


Agree with Gang.

No need 'lvmlockd' in your configuration anymore. You could remove all 
"lvmlocked" related configuration.



2) second, lvmlockd service depends on DLM service, it will create
"lvm_xxx" related lock spaces when any shared VG is created/activated.
but some other resource also depends on DLM to create lock spaces for
avoiding race condition, e.g. clustered MD, ocfs2, etc. Then, the file
system resource should start later than lvm2(lvmlockd) related resources.
That means this order should be wrong.
order ord_lockspace_fs__lvmlockd Mandatory: cln_lockspace_ocfs2 cln_lvmlock

But cln_lockspace_ocfs2 provides the shared filesystem that lvmlockd uses. I
thought for locking in a cluster it needs a cluster-wide filesystem.



I understand your root motivation is to setup virtlockd on top of ocfs2.

There is no relation between ocfs2 and lvmlockd unless you setup ocfs2 on top 
of Cluster LVM(aka. shared VG) which is not your case.


Could be the naming of lvmlockd and virtlockd mislead you, I guess.

Anyway, two more tweaks needed in your CIB:

colocation col_vm__virtlockd inf: ( prm_xen_test-jeos1 prm_xen_test-jeos2 
prm_xen_test-jeos3 prm_xen_test-jeos4 ) cln_lockspace_ocfs2


order ord_virtlockd__vm Mandatory: cln_lockspace_ocfs2 ( prm_xen_test-jeos1 
prm_xen_test-jeos2 prm_xen_test-jeos3 prm_xen_test-jeos4 )



BR,
Roger

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Questions about the infamous TOTEM retransmit list

2021-01-13 Thread Roger Zhou



On 1/13/21 3:31 PM, Ulrich Windl wrote:

Roger Zhou  schrieb am 13.01.2021 um 05:32 in Nachricht

<97ac2305-85b4-cbb0-7133-ac1372143...@suse.com>:

On 1/12/21 4:23 PM, Ulrich Windl wrote:

Hi!

Before setting up our first pacemaker cluster we thought one low-speed

redundant network would be good in addition to the normal high-speed network.

However as is seems now (SLES15 SP2) there is NO reasonable RRP mode to

drive such a configuration with corosync.


Passive RRP mode with UDPU still sends each packet through both nets,


Indeed, packets are sent in the round-robin fashion.


being throttled by the slower network.
(Originally we were using multicast, but that was even worse)

Now I realized that even under modest load, I see messages about "retransmit

list", like this:

Jan 08 10:57:56 h16 corosync[3562]:   [TOTEM ] Retransmit List: 3e2
Jan 08 10:57:56 h16 corosync[3562]:   [TOTEM ] Retransmit List: 3e2 3e4
Jan 08 11:13:21 h16 corosync[3562]:   [TOTEM ] Retransmit List: 60e 610 612

614

Jan 08 11:13:21 h16 corosync[3562]:   [TOTEM ] Retransmit List: 610 614
Jan 08 11:13:21 h16 corosync[3562]:   [TOTEM ] Retransmit List: 614
Jan 08 11:13:41 h16 corosync[3562]:   [TOTEM ] Retransmit List: 6ed



What's the latency of this low speed link?


The normal net is fibre-based:
4 packets transmitted, 4 received, 0% packet loss, time 3058ms
rtt min/avg/max/mdev = 0.131/0.175/0.205/0.027 ms

The redundant net is copper-based:
5 packets transmitted, 5 received, 0% packet loss, time 4104ms
rtt min/avg/max/mdev = 0.293/0.304/0.325/0.019 ms



Aha, RTT < 1ms, the network is fast enough. It clear my doubt to guess the 
latency of the slow link might even in tens or even hundred ms level. Then, I 
might wonder if corosync packet get the bad luck and get delayed due to 
workload on one of the link.





Questions on that:
Will the situation be much better with knet?


knet provides "link_mode: passive" could fit your thought slightly which is
not
round-robin. But, it still doesn't fit your game well, since knet assumes
the
similar latency among links again. You may have to tune parameters for the
low
speed link and likely sacrifice the benefit from the fast link.


Well in the past when using HP Service Guard, everything was working quite 
differently:
There was a true heartbeat on each cluster net, determining ist "being alive", 
and when the cluster performed no action there was no traffic on the cluster links 
(except that heartbeat).
When the cluster actually had to talk, it was using the link that was flagged 
"alive" with a preference of primary first, then secondary when both were 
available.



"link_mode: passive" together with knet_link_priority would be useful. Also, 
use sctp in knet could be the alternative too.


Cheers,
Roger

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Questions about the infamous TOTEM retransmit list

2021-01-13 Thread Roger Zhou

On 1/12/21 4:23 PM, Ulrich Windl wrote:

Hi!

Before setting up our first pacemaker cluster we thought one low-speed 
redundant network would be good in addition to the normal high-speed network.
However as is seems now (SLES15 SP2) there is NO reasonable RRP mode to drive 
such a configuration with corosync.

Passive RRP mode with UDPU still sends each packet through both nets, 


Indeed, packets are sent in the round-robin fashion.


being throttled by the slower network.
(Originally we were using multicast, but that was even worse)

Now I realized that even under modest load, I see messages about "retransmit 
list", like this:
Jan 08 10:57:56 h16 corosync[3562]:   [TOTEM ] Retransmit List: 3e2
Jan 08 10:57:56 h16 corosync[3562]:   [TOTEM ] Retransmit List: 3e2 3e4
Jan 08 11:13:21 h16 corosync[3562]:   [TOTEM ] Retransmit List: 60e 610 612 614
Jan 08 11:13:21 h16 corosync[3562]:   [TOTEM ] Retransmit List: 610 614
Jan 08 11:13:21 h16 corosync[3562]:   [TOTEM ] Retransmit List: 614
Jan 08 11:13:41 h16 corosync[3562]:   [TOTEM ] Retransmit List: 6ed



What's the latency of this low speed link?

I guess it is rather large, and probably not qualified for the use unless 
modify the default corosync.conf carefully. Put it in another way around, 
corosync mostly works for the local network with the small latency by default. 
Also, it is not designed for links with large different latency.



Questions on that:
Will the situation be much better with knet?


knet provides "link_mode: passive" could fit your thought slightly which is not 
round-robin. But, it still doesn't fit your game well, since knet assumes the 
similar latency among links again. You may have to tune parameters for the low 
speed link and likely sacrifice the benefit from the fast link.



Is there a smooth migration path from UDPU to knet?


Out of my head, corosync3 need restart when switching from "transport: udpu" to 
 "transport: knet".


Cheers,
Roger

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Antw: [EXT] delaying start of a resource

2020-12-17 Thread Roger Zhou


Here is a tool intend to standardize the approach to simulate split-brain
https://software.opensuse.org/package/python3-cluster-preflight-check

After installation, simply run the comand:
`ha-cluster-preflight-check --split-brain-iptables`


Thanks,
Roger



On 12/17/20 4:14 PM, Gabriele Bulfon wrote:

I see, but then I have to issues:
1. it is a dual node server, the HA interface is internal, I have no way to 
unplug it, that's why I tried turning it down
2. even in case I could test it by unplugging it, there is still the 
possibility that someone turns the interface down, causing a bad situation for 
the zpool...so I would like to understand why xstha2 decided to turn on IP and 
zpool when stonish of xstha1 was not yet done...

*Sonicle S.r.l. *: http://www.sonicle.com 
*Music: *http://www.gabrielebulfon.com 
*eXoplanets : *https://gabrielebulfon.bandcamp.com/album/exoplanets 





--

Da: Ulrich Windl 
A: users@clusterlabs.org
Data: 17 dicembre 2020 7.48.46 CET
Oggetto: [ClusterLabs] Antw: Re: Antw: [EXT] delaying start of a resource

 >>> Gabriele Bulfon  schrieb am 16.12.2020 um 15:56 in
Nachricht <386755316.773.1608130588146@www>:
 > Thanks, here are the logs, there are infos about how it tried to start
 > resources on the nodes.
 > Keep in mind the node1 was already running the resources, and I 
simulated a
 > problem by turning down the ha interface.

Please note that "turning down" an interface is NOT a realistic test;
realistic would be to unplug the cable.

 >
 > Gabriele
 >
 >
 > Sonicle S.r.l. : http://www.sonicle.com
 > Music: http://www.gabrielebulfon.com
 > eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
 >
 >
 >
 >
 >
 > 

 > --
 >
 > Da: Ulrich Windl 
 > A: users@clusterlabs.org
 > Data: 16 dicembre 2020 15.45.36 CET
 > Oggetto: [ClusterLabs] Antw: [EXT] delaying start of a resource
 >
 >
  Gabriele Bulfon  schrieb am 16.12.2020 um 15:32 
in
 > Nachricht <1523391015.734.1608129155836@www>:
 >> Hi, I have now a two node cluster using stonith with different
 >> pcmk_delay_base, so that node 1 has priority to stonith node 2 in case 
of
 >> problems.
 >>
 >> Though, there is still one problem: once node 2 delays its stonith 
action
 >> for 10 seconds, and node 1 just 1, node 2 does not delay start of
resources,
 >
 >> so it happens that while it's not yet powered off by node 1 (and
waiting its
 >
 >> dalay to power off node 1) it actually starts resources, causing a
moment of
 >
 >> few seconds where both NFS IP and ZFS pool (!) is mounted by both!
 >
 > AFAIK pacemaker will not start resources on a node that is scheduled for
 > stonith. Even more: Pacemaker will tra to stop resources on a node
scheduled
 > for stonith to start them elsewhere.
 >
 >> How can I delay node 2 resource start until the delayed stonith action 
is
 >> done? Or how can I just delay the resource start so I can make it larger
 > than
 >> its pcmk_delay_base?
 >
 > We probably need to see logs and configs to understand.
 >
 >>
 >> Also, I was suggested to set "stonith-enabled=true", but I don't know
where
 >> to set this flag (cib-bootstrap-options is not happy with it...).
 >
 > I think it's on by default, so you must have set it to false.
 > In crm shell it is "configure# property stonith-enabled=...".
 >
 > Regards,
 > Ulrich
 >
 >
 > ___
 > Manage your subscription:
 > https://lists.clusterlabs.org/mailman/listinfo/users
 >
 > ClusterLabs home: https://www.clusterlabs.org/




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Another word of warning regarding VirtualDomain and Live Migration

2020-12-16 Thread Roger Zhou



On 12/16/20 5:06 PM, Ulrich Windl wrote:

Hi!

(I changed the subject of the thread)
VirtualDomain seems to be broken, as it does not handle a failed live-,igration 
correctly:

With my test-VM running on node h16, this happened when I tried to move it away 
(for testing):

Dec 16 09:28:46 h19 pacemaker-schedulerd[4427]:  notice:  * Migrate
prm_xen_test-jeos( h16 -> h19 )
Dec 16 09:28:46 h19 pacemaker-controld[4428]:  notice: Initiating migrate_to 
operation prm_xen_test-jeos_migrate_to_0 on h16
Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Transition 840 aborted 
by operation prm_xen_test-jeos_migrate_to_0 'modify' on h16: Event failed


RA migration_to failed quickly. Maybe the configuration is not perfect enough?

How about enable trace, and collect more RA logs to check what exactly virsh 
command used and check if it works manually


`crm resource trace prm_xen_test-jeos`



Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Transition 840 action 
115 (prm_xen_test-jeos_migrate_to_0 on h16): expected 'ok' but got 'error'
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
(error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to 
of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
(error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to 
of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020
### (note the message above is duplicate!)
Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  error: Resource 
prm_xen_test-jeos is active on 2 nodes (attempting recovery)
### This is nonsense after a failed live migration!


Indeed, sounds like an valid improvement for pacemaker-schedulerd? Or, 
articulate what to do with the migration_to fails. I couldn't find the 
definition from any doc yet.



Dec 16 09:28:47 h19 pacemaker-schedulerd[4427]:  notice:  * Recover
prm_xen_test-jeos( h19 )


So the cluster is exactly doing the wrong thing: The VM ist still active on h16, while a 
"recovery" on h19 will start it there! So _after_ the recovery the VM is 
duplicate.

Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Initiating stop 
operation prm_xen_test-jeos_stop_0 locally on h19
Dec 16 09:28:47 h19 VirtualDomain(prm_xen_test-jeos)[20656]: INFO: Domain 
test-jeos already stopped.
Dec 16 09:28:47 h19 pacemaker-execd[4425]:  notice: prm_xen_test-jeos stop 
(call 372, PID 20620) exited with status 0 (execution time 283ms, queue time 
0ms)
Dec 16 09:28:47 h19 pacemaker-controld[4428]:  notice: Result of stop operation 
for prm_xen_test-jeos on h19: ok
Dec 16 09:31:45 h19 pacemaker-controld[4428]:  notice: Initiating start 
operation prm_xen_test-jeos_start_0 locally on h19

Dec 16 09:31:47 h19 pacemaker-execd[4425]:  notice: prm_xen_test-jeos start 
(call 373, PID 21005) exited with status 0 (execution time 2715ms, queue time 
0ms)
Dec 16 09:31:47 h19 pacemaker-controld[4428]:  notice: Result of start 
operation for prm_xen_test-jeos on h19: ok
Dec 16 09:33:46 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result 
(error: test-jeos: live migration to h19 failed: 1) was recorded for migrate_to 
of prm_xen_test-jeos on h16 at Dec 16 09:28:46 2020



yeah, schedulerd is trying so hard to report the migration_to failure here!


Amazingly manual migration using virsh worked:
virsh migrate --live test-jeos xen+tls://h18...



What about s/h18/h19/?

Or, manually reproduce exactly as the RA code:

`virsh ${VIRSH_OPTIONS} migrate --live $migrate_opts $DOMAIN_NAME $remoteuri 
$migrateuri`



Good luck!
Roger



Regards,
Ulrich Windl



Ulrich Windl schrieb am 14.12.2020 um 15:21 in Nachricht <5FD774CF.8DE : 161 :

60728>:

Hi!

I think I found the problem why a VM ist started on two nodes:

Live-Migration had failed (e.g. away from h16), so the cluster uses stop and
start (stop on h16, start on h19 for example).
When rebooting h16, I see these messages (h19 is DC):

Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  warning: Unexpected result
(error: test-jeos: live migration to h16 failed: 1) was recorded for
migrate_to of prm_xen_test-jeos on h19 at Dec 14 11:54:08 2020
Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  error: Resource
prm_xen_test-jeos is active on 2 nodes (attempting recovery)

Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  notice:  * Restart
prm_xen_test-jeos( h16 )

THIS IS WRONG: h16 was booted, so no VM is running on h16 (unless there was
some autostart from libvirt. " virsh list --autostart" does not list any)

Dec 14 15:09:27 h16 VirtualDomain(prm_xen_test-jeos)[4850]: INFO: Domain
test-jeos already stopped.

Dec 14 15:09:27 h19 pacemaker-schedulerd[4427]:  error: Calculated
transition 669 (with errors), saving inputs in
/var/lib/pacemaker/pengine/pe-error-4.bz2

Whhat's going on here?

Regards,
Ulrich


Ulrich Windl schrieb am 14.12.2020 um 

Re: [ClusterLabs] crm enhancement proposal (configure grep): Opinions?

2020-12-16 Thread Roger Zhou

Hi Ulrich,

Sounds reasonable and handy! Can you create the github issue to track this?

Thanks,
Roger


On 11/30/20 8:47 PM, Ulrich Windl wrote:

Hi!

Would would users of crm shell think about this enhancement proposal:
crm configure grep 
That command would search the configuration for any occurrence of  and 
would list the names where it occurred.

That is, if pattern is testXYZ, then all resources either having testXYZ in their name or 
have the string testXYZ anywhere "inside" would be listed.

One could even construct more interesting commands like
"show [all] matching " or "edit [all] matching "

Regards,
Ulrich




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] resource management of standby node

2020-12-08 Thread Roger Zhou



On 12/1/20 4:03 PM, Ulrich Windl wrote:

Ken Gaillot  schrieb am 30.11.2020 um 19:52 in Nachricht

:

...


Though there's nothing wrong with putting all nodes in standby. Another
alternative would be to set the stop-all-resources cluster property.


Hi Ken,

thanks for the valuable feedback!

I was looking for that, but unfortunately crm shell cannot set that from the 
resource (or node) context; only from the configure context.
I don't know what a good syntax would be "resource stop all" / "resource start all" or 
"resource stop-all" / "resource unstop-all"
(the asymmetry is that after a "stop all" you cannot start a singly resource (I guess), 
but you'll have to use "start-all" (which, in turn, does not start resources that have a 
stopped role (I guess).

So maybe "resource set stop-all" / "resource unset stop-all" / "resource clear 
stop-all"



1.
Well, let `crm resource stop|start all` change the cluster property of 
`stop-all-resources` might contaminate the syntax at the resources level alone. 
To avoid that, the user interface need be more carefully to deliver the proper 
information at the first place about the internals at some degree to avoid the 
potential misunderstanding or questions.


2.
On the other hand, people might naturally read `crm resource stop all` as 
changing all resources `target-role=Stopped`. Well, technically this seems a 
bit awkward, but no obvious benefit comparing to stop-all-resources. And, 
pacemaker developers could comment more internals around this.


3.
`resource set|unset` add more commands under `resource` and will confuse some 
users and should be avoided in my view.


I feel more discussion is expected, though my gut feeling approach 1 is a 
better one.


Anyway, good topic indeed. Feedback from more users would be useful to shape 
the better UI/UX. I can imagine some people may have idea to suggest "--all" 
even, btw.


Thanks,
Roger

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Q: high-priority messages from DLM?

2020-12-08 Thread Roger Zhou



On 12/8/20 6:48 PM, Strahil Nikolov wrote:

Nope,

but if you don't use clustered FS, you could also use plain LVM + tags.
As far as I know you need dlm and clvmd for clustered FS.



FYI, clvmd is dropped since lvm2 v2_03, and is replaced by lvmlockd. BTW, 
lvmlockd (or its precedent clvmd) is optional here in theory, though 
practically useful.



On Fri, Dec 4, 2020 at 5:32 AM Ulrich Windl


Offtopic: Are you using DLM with OCFS2 ?


Hi!

I'm using OCFS2, but I tend to ask "Can I use OCFS2 _without_ DLM?". ;-)



As Strahil said, DLM is a must-have. Well, as a side note, a component named 
"o2cb" in the kernel space is an alternative to replace corosync/pacemaker, 
however.


BR,
Roger

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] sbd v1.4.2

2020-12-08 Thread Roger Zhou

Great news for the new version, first of all!

On 12/8/20 8:12 PM, Klaus Wenninger wrote:

On 12/8/20 11:51 AM, Klaus Wenninger wrote:

On 12/3/20 9:29 AM, Reid Wahl wrote:

On Thu, Dec 3, 2020 at 12:03 AM Ulrich Windl


[...]


‑ add robustness against misconfiguration / improve documentation

   * add environment section to man‑page previously just available in
 template‑config
   * inform the user to restart the sbd service after disk‑initialization

I thought with adding UUIDs sbd automatically detects a header change.

You're having a valid point here.
Actually a disk-init on an operational cluster should be
quite safe. (A very small race between header and slot
read does exist.)
Might make sense to think over taking the message back
or revising it.

Yan Gao just pointed me to the timeout configuration not being
updated if it changes in the header.
Guess until that is tackled one way or another the message
is a good idea.



Indeed, users may want to tune sbd in the runtime without restart the whole 
cluster stack, eg. watchdog timeout, msgwait timeout, etc. Currently, changing 
these timeouts will force users to recreate the sbd disk, that is a little 
strange user experience. And, restarting the whole cluster is a worse impression.


I can understand there are gaps currently, eg. to reinitialize watchdog driver 
timeout, and followed by a script to refresh pacemaker stonith-watchdog-timeout 
and stonith-timeout etc.


Furthermore, Can we change SBD_DEVICE in runtime even?

All in all, they add flexibilities to the management activities, and make the 
better user experience overall.


Thanks,
Roger

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Q: "crm node status" display

2020-12-08 Thread Roger Zhou

Can you create the Github Issues before we lose tracking? Thank you Ulrich!
https://github.com/ClusterLabs/crmsh/issues

BR,
Roger


On 11/20/20 2:50 PM, Ulrich Windl wrote:

Hi!

Setting up a new cluster with SLES15 SP2, I'm wondering: "crm node status" 
displays XML. Is that the way it should be?
h16:~ # crm node
crm(live/rksaph16)node# status

   
   
   


crmsh-4.2.0+git.1604052559.2a348644-5.26.1.noarch

Regards,
Ulrich


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] Re: Setting up HA cluster on Raspberry pi4 with ubuntu 20.04 aarch64 architecture

2020-06-16 Thread Roger Zhou



On 6/15/20 3:44 PM, Ulrich Windl wrote:

Strahil Nikolov  schrieb am 12.06.2020 um 14:00 in

Nachricht
<22726_1591963256_5EE36E78_22726_156_1_03FA2901-B9CC-4CE7-8952-283A864E1C72@yaho
.com>:

Out  of  curiosity , are you running it on sles/opensuse?

I think it is easier with  'crm cluster start'.


Indeed, it tries to provide consistent interface to the end user, even among 
releases. It does manage the complexity behind the scene.




Otherwise you can run 'journalctl -u pacemaker.service  -e'  to find what
dependency has failed.

Another  one is:

'systemctl list-dependencies pacemaker.service'


I wonder what "resource-agents-deps.target" (Description=resource-agents 
dependencies) is for (in SLES12 SP5).


It is used for drop-in dependencies in the run time. You will have more fun 
about the context from the output:


```
resource-agents.git> git grep systemd_drop_in
```

Cheers,
Roger






Best Regards,
Strahil  Nikolov



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Mirrored cLVM/Xen PVM Performance question for block device

2020-05-25 Thread Roger Zhou



On 5/20/20 2:50 PM, Ulrich Windl wrote:

Hi!

I have a performance question regarding delay for reading blocks in a PV Xen VM.
Forst a little background: Originally to monitor NFS outages, I developed a tool 
"iotwatch" (short: IOTW) that reads the first block of a block device or file 
(or anything you can open() and read() with Direct I/O). The tool samples the target at a 
rather high rate (like 5s), keeping statistics that are queried at a lower rate (like 5 
min).

A wrapper around the tool is used as monitoring plugin, and the outoput looks 
like this:
/dev/sys/var: alpha=0.01, count=75(120/120), last=0.0011, avg=0.00423/0.00264/0\
.00427, min=0.00052(0.00052/0.00084), max=0.02465(0.02465/0.02062), variance=0.\
5(0.3)|last=0.0011;;;0 exp_avg=0.00427;;;0 emin=0.00084;;;0 emax=0.0206\
2;;;0 davg=0.00264;;;0 dstd_dev=0.00617;;;0

A short explanation what these numbers mean:
"alpha" is the weight used for exponential averaging (e.g. for "exp_avg"). "count" is the number of samples since last read and the 
number of samples in the sampling queue (e.g. 120 valid samples ot of a maximum of 120). The values "avg" is average, "min" is the 
minimum", "max" is the maximum, "variance" is what it says, and "last" is the last sampling value.
In text output there are three numbers instead of just one, meaning (the 
indicated value, the average of the value within the sampling queue, and the 
exponentially averaged value). This is mostly for debugging. The performance 
data output has just one of those values, selectable via command-line option. 
Also the statistics can be (in this case they are) reset after it was read, so 
min and max will start anew...

OK, that was a rather long story before presenting the details:

A VM has its root disk on a mirrored LV (cLVM) presented as "phy:", and inside 
the VM the disk is partitioned like this:
Device Boot  Start  End  Sectors  Size Id Type
/dev/xvdb1 *  2048   411647   409600  200M 83 Linux
/dev/xvdb2  411648 83886079 83474432 39.8G  5 Extended
/dev/xvdb5  413696 83886079 83472384 39.8G 8e Linux LVM

xvdb5 is a PV for the sys VG, like this:
   opt  sys -wi-ao   4.00g
   root sys -wi-ao   8.00g
   srv  sys -wi-ao   4.00g
   swap sys -wi-ao   2.00g
   tmp  sys -wi-ao 512.00m
   var  sys -wi-ao   6.00g

LV var is mounted on /var as ext3 (acl,user_xattr). The timing threads runs 
with prio -80 (nice 0) at SCHED_RR, so I guess other processes won't disturb 
the measurements much. I see no other threads using a real-time scheduling 
policy in the VM; system tasks seem to run at prio 0 with some negative nice 
value instead...
(On the xen host corosync, DLM and OCFS2 runs with prio -2)

Now the story: The performance of the root disk inside the VM (IOTW-PV) has a 
typical read delay of less than 2ms with peaks below 40ms (A comparable local 
disk in bare metal) would have less than 0.2ms delay with peaks below 7ms).  
However when timing the var LV (IOTW_FS), the average is below 4ms with peaks 
up to 80ms.

The storage system behind is a FC-based 3PAR StorServ with all SSDs and the 
service time for reads is (according to the storage system'S own perfomance 
monitor (SSMC)) significantly below 0.25ms at the same time interval.

So I wonder: How can LVM in the VM add another 40ms peak to the base timing? 
The other thing that puzzles  me is this: While the timing for the root disk is 
basically good with very few peaks, the timing of the LV has mainly three 
levels: First, most common level is good performance. the next level is like 
20ms (more), and the third level are peaks of another 20 or 40 ms.

Is there any explanation for this? The VM is SLES12 SP5, while the Xen Host is 
still SLES11 SP4.

At the moment I'm thinking how to implement VM disks in a way that is efficient 
while supporting live migration of VMs.
In the past we were using filesystem images stored in OCFS2 which itself was 
put in a mirrored cLVM LV. Performance was rather poor, so I skipped the OCFS2 
layer and created a separate LV for each VM. Unfortunately mirroring all VM 
images to different storage systems is an absolute requirement.



Hi Ulrich,

In your use case, the (clustered) LVM2 mirroring layer is known to be a 
performance sensive concern. OCFS2 and SLES12 SP5 VM should not be a 
performance concern in your stack.


I do see an improvement to upgrade your host to SLES12 SP5 if possible. Hence 
you can evolve clustered LVM2 mirroring to clustered md raid1 which intends to 
resolve LVM2 mirroring performance concern. You can play with the following 
migration doc. It should apply to SLES12 SP5 too.


https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-clvm.html#sec-ha-clvm-migrate

Cheers,
Roger




I'd be glad to get some insights.

Regards,
Ulrich




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: 

Re: [ClusterLabs] SBD restarted the node while pacemaker in maintenance mode

2019-12-26 Thread Roger Zhou

On 12/24/19 11:48 AM, Jerry Kross wrote:
> Hi,
> The pacemaker cluster manages a 2 node database cluster configured to use 3 
> iscsi disk targets in its stonith configuration. The pacemaker cluster was 
> put 
> in maintenance mode but we see SBD writing to the system logs. And just after 
> these logs, the production node was restarted.
> Log:
> sbd[5955]:  warning: inquisitor_child: Latency: No liveness for 37 s exceeds 
> threshold of 36 s (healthy servants: 1)
> I see these messages logged and then the node was restarted. I suspect if it 
> was the softdog module that restarted the node but I don't see it in the 
> logs. 

sbd is too critical to share the io path with others.

Very likely, the workload is too heavy, the iscsi connections are broken and 
sbd looses the access to the disks, then sbd use sysrq 'b' to reboot the node 
brutally and immediately.

In regarding to watchdog-reboot, it kicks in when sbd is not able to tickle it 
in time, eg. sbd starves for cpu, or is crashed. It is crucial too, but not 
likely the case here.

Merry X'mas and Happy New Year!
Roger

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Dual Primary DRBD + OCFS2

2019-11-19 Thread Roger Zhou

On 11/19/19 4:51 PM, Илья Насонов wrote:
> Hello!
> 
> Configured a cluster (2-node DRBD+DLM+CFS2) and it works.
> 
> I heard the opinion that OCFS2 file system is better. Found an old 
> cluster setup 
> description:https://wiki.clusterlabs.org/wiki/Dual_Primary_DRBD_%2B_OCFS2
> 
> but as I understand it, o2cb Service is not supported Pacemaker on Debian.
> 
> Where can I get the latest information on setting up the OCFS2.

Probably you can refer to SUSE doc for OCFS2 with Pacemaker [1]. Should 
be not much different to adapt to Debian, I feel.

[1] 
https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-ocfs2.html

Cheers,
Roger


> 
> С уважением,
> Илья Насонов
> elias@po-mayak
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: Re: fencing on iscsi device not working

2019-11-06 Thread Roger Zhou


On 11/7/19 1:55 AM, Andrei Borzenkov wrote:
> 06.11.2019 18:55, Ken Gaillot пишет:
>> On Wed, 2019-11-06 at 08:04 +0100, Ulrich Windl wrote:
>> Ken Gaillot  schrieb am 05.11.2019 um
>> 16:05 in
>>>
>>> Nachricht
>>> :
 Coincidentally, the documentation for the pcmk_host_check default
 was
 recently updated for the upcoming 2.0.3 release. Once the release
 is
 out, the online documentation will be regenerated, but here is the
 text:

 Default
 ‑‑‑
 static‑list if either pcmk_host_list or pcmk_host_map is set,
 otherwise
 dynamic‑list if the fence device supports the list action,
 otherwise
 status if the fence device supports the status action, otherwise
 none
>>>
>>> I'd make that an itemized list with four items. I thinks it would be
>>> easer to
>>> understand.
>>
>> Good idea; I edited it so that the default and description are
>> combined:
>>
>> How to determine which machines are controlled by the device. Allowed
>> values:
>>
>> * +static-list:+ check the +pcmk_host_list+ or +pcmk_host_map+
>> attribute (this is the default if either one of those is set)
>>
>> * +dynamic-list:+ query the device via the "list" command (this is
>> otherwise the default if the fence device supports the list action)
>>
> 
> Oops, now it became even more ambiguous. What if both pcmk_host_list is
> set *and* device supports "list" (or "status") command? Previous variant
> at least was explicit about precedence.
> 
> "Otherwise" above is hard to attribute correctly. I really like previous
> version more.

+1

plus 2 cents:

I feel confused between Default and Assigned value if combine them in 
the description as above. I prefer to keep them separate.

I guest Ken might want to keep Pacemaker_Explained DOC more readable at 
the end of the day, ie. to avoid too many words in Default column [1]. 
For that, might be we can do differently, like the mockup [2].

[1] 
https://github.com/ClusterLabs/pacemaker/blob/d863971b7e0c56fbe6cc12815348e8e39b2e25c4/doc/Pacemaker_Explained/en-US/Ch-Fencing.txt#L182

[2]

|pcmk_host_check
|string
|+NOTE+
a|How to determine which machines are controlled by the device.

* +NOTE:+
  The default value is static-list if either +pcmk_host_list+ or 
+pcmk_host_map+ is set,
  otherwise dynamic-list if the fence device supports the list action,
  otherwise status if the fence device supports the status action,
  otherwise none.

  Allowed values:

* +dynamic-list:+ query the device via the "list" command
* +static-list:+ check the +pcmk_host_list+ or +pcmk_host_map+ attribute
* +status:+ query the device via the "status" command
* +none:+ assume every device can fence every machine


Cheers,
Roger

> 
>> * +status:+ query the device via the "status" command (this is
>> otherwise the default if the fence device supports the status action)
>>
>> * +none:+ assume every device can fence every machine (this is
>> otherwise the default)
>>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] fencing on iscsi device not working

2019-11-04 Thread Roger Zhou


On 11/3/19 12:56 AM, wf...@niif.hu wrote:
> Andrei Borzenkov  writes:
> 
>> According to documentation, pcmk_host_list is used only if
>> pcmk_host_check=static-list which is not default, by default pacemaker
>> queries agent for nodes it can fence and fence_scsi does not return
>> anything.
> 
> The documentation is somewhat vague here.  The note about pcmk_host_list
> says: "optional unless pcmk_host_check is static-list".  It does not
> state how pcmk_host_list is used if pcmk_host_check is the default
> dynamic-list, 

The confusion might be because of "the language barrier".

My interpretation is like this:

1. pcmk_host_list is used only if pcmk_host_check is static-list.

2. pcmk_host_check's default is dynamic-list.
That means, by default pcmk_host_list is not used at all.

Cheers,
Roger


> but I successfully use such setups with Pacemaker 1.1.16
> with fence_ipmilan.  Maybe the behavior is different in 2.0.1 (the
> version in Debian buster).  Ram, what happens if you set pcmk_host_check
> to static-list?
> 
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Stupid DRBD/LVM Global Filter Question

2019-10-30 Thread Roger Zhou


On 10/30/19 6:17 AM, Eric Robinson wrote:
> If I have an LV as a backing device for a DRBD disk, can someone explain 
> why I need an LVM filter? It seems to me that we would want the LV to be 
> always active under both the primary and secondary DRBD devices, and 
> there should be no need or desire to have the LV activated or 
> deactivated by Pacemaker. What am I missing?

Your understanding is correct. No need to use LVM resource agent from 
Pacemaker in your case.

--Roger

> 
> --Eric
> 
> Disclaimer : This email and any files transmitted with it are 
> confidential and intended solely for intended recipients. If you are not 
> the named addressee you should not disseminate, distribute, copy or 
> alter this email. Any views or opinions presented in this email are 
> solely those of the author and might not represent those of Physician 
> Select Management. Warning: Although Physician Select Management has 
> taken reasonable precautions to ensure no viruses are present in this 
> email, the company cannot accept responsibility for any loss or damage 
> arising from the use of this email or attachments.
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] volume group won't start in a nested DRBD setup

2019-10-29 Thread Roger Zhou


On 10/29/19 12:30 PM, Andrei Borzenkov wrote:
>> Oct 28 14:42:56 node2 LVM(p_lvm_vg0)[8775]: INFO: Activating volume group vg0
>> Oct 28 14:42:56 node2 LVM(p_lvm_vg0)[8775]: INFO:  Reading all physical 
>> volumes. This may take a while... Found volume group "vmspace" using 
>> metadata type lvm2 Found volume group "freespace" using metadata type
>>   lvm2 Found volume group "vg0" using metadata type lvm2
>> Oct 28 14:42:56 node2 LVM(p_lvm_vg0)[8775]: INFO:  0 logical volume(s) in 
>> volume group "vg0" now active
> Resource agent really does just "vgchange vg0". Does it work when you
> run it manually?
> 

Agree with Andrei.

> 
>> Oct 28 14:42:56 node2 LVM(p_lvm_vg0)[8775]: ERROR: LVM Volume vg0 is not 
>> available (stopped)
>> Oct 28 14:42:56 node2 LVM(p_lvm_vg0)[8775]: ERROR: LVM: vg0 did not activate 
>> correctly
>> Oct 28 14:42:56 node2 pacemaker-execd[27054]:  notice: 
>> p_lvm_vg0_start_0:8775:stderr [   Configuration node global/use_lvmetad not 
>> found ]

This error indicates the root cause is related to lvmetad. Please check 
lvmetad, eg.

systemctl status lvm2-lvmetad
grep use_lvmetad /etc/lvm/lvm.conf

Check your lvm2 version and google its workaround/fix accordingly.

Cheers,
Roger

>> Oct 28 14:42:56 node2 pacemaker-execd[27054]:  notice: 
>> p_lvm_vg0_start_0:8775:stderr [ ocf-exit-reason:LVM: vg0 did not activate 
>> correctly ]
>> Oct 28 14:42:56 node2 pacemaker-controld[27057]:  notice: Result of start 
>> operation for p_lvm_vg0 on node2: 7 (not running)
>> Oct 28 14:42:56 node2 pacemaker-controld[27057]:  notice: 
>> node2-p_lvm_vg0_start_0:77 [   Configuration node global/use_lvmetad not 
>> found\nocf-exit-reason:LVM: vg0 did not activate correctly\n ]
>> Oct 28 14:42:56 node2 pacemaker-controld[27057]:  warning: Action 42 
>> (p_lvm_vg0_start_0) on node2 failed (target: 0 vs. rc: 7): Error
>> Oct 28 14:42:56 node2 pacemaker-controld[27057]:  notice: Transition 602 
>> aborted by operation p_lvm_vg0_start_0 'modify' on node2: Event failed



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Safe way to stop pacemaker on both nodes of a two node cluster

2019-10-20 Thread Roger Zhou

On 10/21/19 12:28 AM, Valentin Vidić wrote:
> On Sun, Oct 20, 2019 at 09:24:31PM +0530, Dileep V Nair wrote:
>>  I am confused about the best way to stop pacemaker on both nodes of a
>> two node cluster. The options I know of are
>> 1. Put the cluster in Maintenance Mode, stop the applications manually and

To put the whole cluster in the maintenance node is a reliable approach 
to shutdown pacemaker gracefully and leave the applications running 
behind. It fits the use case per the title of this thread.

>> then stop pacemaker on both nodes. For this I need the application to  be
>> stopped manually
>> 2. Stop pacemaker on one node, wait for all resources to come up on second
>> node, then stop pacemaker on second node. This might cause a significant
>> delay because all resources has to come up on second node.
>>
>>  Is there any other way to stop pacemaker on both nodes gracefully ?
> 
> Maybe this pacemaker option can help?
> 
> stop-all-resources FALSE Should the cluster stop all resources?
> 

To shutdown the applications all together with pacemaker, this option is 
useful. But be caution about stop failure of the application migrate 
trigger stonith. "stonith-enabled=false" could remove such risk.

Cheers,
Roger

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: Re: DLM, cLVM, GFS2 and OCFS2 managed by systemd instead of crm ?

2019-10-16 Thread Roger Zhou

On 10/16/19 3:19 PM,  Ulrich Windl  wrote:
>>>> Roger Zhou  schrieb am 16.10.2019 um 08:54 in Nachricht
> :
>> Hi Bernd,
>>
>> Apart from Ken's insights.
>>
>> I try to put it simple between systemd vs. pacemaker:
>>
>> pacemaker does manage dependencies among nodes, well, systemd just not.
> 
> What I also wanted to say is (maybe the reason for Bernd's message) that many
> examples how to configure OCFS or cLVM are very bad regarding extensibility: 
> If
> you follow the instructions for OCFS2, and then you want to follow the
> instructions for cLVM (just one example), you get a conflict as DLM already is
> configured, and it's not very clear how to resolve dependencies correctly. If
> you do it cLVM first, then OCFS2, you have the same problem. Likewise for
> clustered RAID.

My understanding of your feedback, and probably the same from Bernd, 
roots back to "KISS". For that, I think agree.

Well, those projects(components) under ClusterLabs umbrella are the 
ingredients to cook the meal. But, it does not provide the meal 
directly, so to say.

One of the challenge here is to identify those solid solutions, and to 
figure out the mutual benefit among parties of this community. For those 
parties buy-in this, we can work together to add features to simplify 
configuration, deployment, and orchestration of the solutions, to make 
the KISS things, etc. If it happens, it is really cool! Well, sounds, my 
saying starts toward business oriented, and I hope you don't mind ;)

BR,
Roger


> 
> Regards,
> Ulrich
> 
>>
>> Cheers,
>> Roger
>>
>> On 10/16/19 5:16 AM, Ken Gaillot wrote:
>>> On Tue, 2019‑10‑15 at 21:35 +0200, Lentes, Bernd wrote:
>>>> Hi,
>>>>
>>>> i'm a big fan of simple solutions (KISS).
>>>> Currently i have DLM, cLVM, GFS2 and OCFS2 managed by pacemaker.
>>>> They all are fundamental prerequisites for my resources (Virtual
>>>> Domains).
>>>> To configure them i used clones and groups.
>>>> Why not having them managed by systemd to make the cluster setup more
>>>> overseeable ?
>>>>
>>>> Is there a strong reason that pacemaker cares about them ?
>>>>
>>>> Bernd
>>>
>>> Either approach is reasonable. The advantages of keeping them in
>>> pacemaker are:
>>>
>>> ‑ Service‑aware recurring monitor (if OCF)
>>>
>>> ‑ If one of those components fails, pacemaker will know to try to
>>> recover everything in the group from that point, and if necessary,
>>> fence the node and recover the virtual domain elsewhere (if they're in
>>> systemd, pacemaker will only know that the virtual domain has failed,
>>> and likely keep trying to restart it fruitlessly)
>>>
>>> ‑ Convenience of things like putting a node in standby mode, and
>>> checking resource status on all nodes with one command
>>>
>>> If you do move them to systemd, be sure to use the resource‑agents‑deps
>>> target to ensure they're started before pacemaker and stopped after
>>> pacemaker.
>>>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
> 
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] DLM, cLVM, GFS2 and OCFS2 managed by systemd instead of crm ?

2019-10-16 Thread Roger Zhou
Hi Bernd,

Apart from Ken's insights.

I try to put it simple between systemd vs. pacemaker:

pacemaker does manage dependencies among nodes, well, systemd just not.

Cheers,
Roger

On 10/16/19 5:16 AM, Ken Gaillot wrote:
> On Tue, 2019-10-15 at 21:35 +0200, Lentes, Bernd wrote:
>> Hi,
>>
>> i'm a big fan of simple solutions (KISS).
>> Currently i have DLM, cLVM, GFS2 and OCFS2 managed by pacemaker.
>> They all are fundamental prerequisites for my resources (Virtual
>> Domains).
>> To configure them i used clones and groups.
>> Why not having them managed by systemd to make the cluster setup more
>> overseeable ?
>>
>> Is there a strong reason that pacemaker cares about them ?
>>
>> Bernd
> 
> Either approach is reasonable. The advantages of keeping them in
> pacemaker are:
> 
> - Service-aware recurring monitor (if OCF)
> 
> - If one of those components fails, pacemaker will know to try to
> recover everything in the group from that point, and if necessary,
> fence the node and recover the virtual domain elsewhere (if they're in
> systemd, pacemaker will only know that the virtual domain has failed,
> and likely keep trying to restart it fruitlessly)
> 
> - Convenience of things like putting a node in standby mode, and
> checking resource status on all nodes with one command
> 
> If you do move them to systemd, be sure to use the resource-agents-deps
> target to ensure they're started before pacemaker and stopped after
> pacemaker.
> 
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] SBD with shared device - loss of both interconnect and shared device?

2019-10-10 Thread Roger Zhou




On 10/9/19 3:28 PM, Andrei Borzenkov wrote:
> What happens if both interconnect and shared device is lost by node? I
> assume node will reboot, correct?
> 

 From my understanding from Pacemaker integration feature in `man sbd`

Yes, sbd will do self-fence upon lose access to sbd disk when the node 
is not in quorate state.

> Now assuming (two node cluster) second node still can access shared
> device it will fence (via SBD) and continue takeover, right?

Yes, 2-node cluster is special. The node lose access to disk will 
self-fence even it is in "quorate" state.

> 
> If both nodes lost shared device, both nodes will reboot and if access
> to shared device is not restored, then cluster services will simply
> not come up on both nodes, so it means total outage. Correct?

Yes, without functioning SBD, the pacemaker won't start at the systemd 
level.

Cheers,
Roger


> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Where to find documentation for cluster MD?

2019-10-10 Thread Roger Zhou
In addition to the admin guide, there are some more advanced articles 
about the internals:

https://lwn.net/Articles/674085/
https://www.kernel.org/doc/Documentation/driver-api/md/md-cluster.rst

Cheers,
Roger


On 10/10/19 4:27 PM, Gang He wrote:
> Hello Ulrich
> 
> Cluster MD belongs to SLE HA extension product.
> The related doc link is here, e.g. 
> https://documentation.suse.com/sle-ha/15-SP1/single-html/SLE-HA-guide/#cha-ha-cluster-md
> 
> Thanks
> Gang
> 
>> -Original Message-
>> From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Ulrich
>> Windl
>> Sent: 2019年10月9日 15:13
>> To: users@clusterlabs.org
>> Subject: [ClusterLabs] Where to find documentation for cluster MD?
>>
>> Hi!
>>
>> In recent SLES there is "cluster MD", like in
>> cluster-md-kmp-default-4.12.14-197.18.1.x86_64
>> (/lib/modules/4.12.14-197.18-default/kernel/drivers/md/md-cluster.ko).
>> However I could not find any manual page for it.
>>
>> Where is the official documentation, meaning: Where is a description of the
>> feature supprted by SLES?
>>
>> Regards,
>> Ulrich
>>
>>
>>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Gracefully stop nodes one by one with disk-less sbd

2019-08-13 Thread Roger Zhou



On 8/12/19 9:24 PM, Klaus Wenninger wrote:

[...]

> If you shutdown solely pacemaker one-by-one on all nodes
> and these shutdowns are considered graceful then you are
> not gonna experience any reboots (e.g. 3 node cluster).

While revisit what you said, then run `systemctl stop pacemaker` one by one.

At this point corosync is still running on all nodes. ( NOTE: make sure 
no "StopWhenUnneded=yes" in corosync.service to get hijacked when 
stopping pacemaker. )


> Afterwards you can shutdown corosync one-by-one as well
> without experiencing reboots as without the cib-connection
> sbd isn't gonna check for quorum anymore (all resources
> down so no need to reboot in case of quorum-loss - extra
> care has to be taken care of with unmanaged resources but
> that isn't particular with sbd).
> 

Then, `systemctl stop corosync` one by one. Nice! disk-less sbd does 
play the trick as above. All nodes stay!

Thanks,
Roger

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: Re: Gracefully stop nodes one by one with disk-less sbd

2019-08-12 Thread Roger Zhou

On 8/12/19 2:48 PM,  Ulrich Windl  wrote:
 Andrei Borzenkov  schrieb am 09.08.2019 um 18:40 in
> Nachricht <217d10d8-022c-eaf6-28ae-a4f58b2f9...@gmail.com>:
>> 09.08.2019 16:34, Yan Gao пишет:

[...]

>>
>> Lack of cluster wide shutdown mode was mentioned more than once on this
>> list. I guess the only workaround is to use higher level tools which
>> basically simply try to stop cluster on all nodes at once. 

I try to think of ssh/pssh to the involved nodes and stop diskless SBD
daemons.  However, SBD is not able to be teared down on it own. It is
deeply tied up with pacemaker and corosync and has to be stop all
together. Or, to hack SBD dependency otherwise.

>> It is still
>> susceptible to race condition.
> 
> Are there any concrete plans to implement a clean solution?
> 

I can think of Yet Another Feature to disable diskless SBD on-purpose.
eg. to let SBD understands "stonith-enabled=false" at the cluster wide.


Cheers,
Roger
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] corosync.service (and sbd.service) are not stopper on pacemaker shutdown when corosync-qdevice is used

2019-08-09 Thread Roger Zhou


On 8/9/19 3:39 PM, Jan Friesse wrote:
> Roger Zhou napsal(a):
>>
>> On 8/9/19 2:27 PM, Roger Zhou wrote:
>>>
>>> On 7/29/19 12:24 AM, Andrei Borzenkov wrote:
>>>> corosync.service sets StopWhenUnneded=yes which normally stops it when
>>>> pacemaker is shut down.
>>
>> One more thought,
>>
>> Make sense to add "RefuseManualStop=true" to pacemaker.service?
>> The same for corosync-qdevice.service?
>>
>> And "RefuseManualStart=true" to corosync.service?
> 
> I would say short answer is no, but I would like to hear what is the 
> main idea for this proposal.

It's more about out of box user experience to guide the users of the 
most use cases in the field to manage the whole cluster stack in the 
appropriate steps, namely:

- To start stack: systemctl start pacemaker corosync-qdevice
- To stop stack: systemctl stop corosync.service

and less error prone assumptions:

With "RefuseManualStop=true" to pacemaker.service, sometimes(if not often),

- it prevents the wrong assumption/wish/impression to stop the
   whole cluster together with corosync

- it prevents users forget one more step to stop corosync indeed

- it prevents some ISV do create disruptive scripts only stop pacemaker 
and forget others.

- Being rejected at the first place, then naturally guide users to run 
`systemctl stop corosync.service`


And extends the same idea a little further to

- "RefuseManualStop=true" to corosync-qdevice.service
- and "RefuseManualStart=true" to corosync.service

Well, I do feel corosync* are less error prone as pacemaker in this regards.

Thanks,
Roger


> 
> Regards,
>    Honza
> 
>>
>> @Jan, @Ken
>>
>> What do you think?
>>
>> Cheers,
>> Roger
>>
>>
>>>
>>> `systemctl stop corosync.service` is the right command to stop those
>>> cluster stack.
>>>
>>> It stops pacemaker and corosync-qdevice first, and stop SBD too.
>>>
>>> pacemaker.service: After=corosync.service
>>> corosync-qdevice.service: After=corosync.service
>>> sbd.service: PartOf=corosync.service
>>>
>>> On the reverse side, to start the cluster stack, use
>>>
>>> systemctl start pacemaker.service corosync-qdevice
>>>
>>> It is slightly confusing from the impression. So, openSUSE uses the
>>> consistent commands as below:
>>>
>>> crm cluster start
>>> crm cluster stop
>>>
>>> Cheers,
>>> Roger
>>>
>>>> Unfortunately, corosync-qdevice.service declares
>>>> Requires=corosync.service and corosync-qdevice.service itself is *not*
>>>> stopped when pacemaker.service is stopped. Which means corosync.service
>>>> remains "needed" and is never stopped.
>>>>
>>>> Also sbd.service (which is PartOf=corosync.service) remains running 
>>>> as well.
>>>>
>>>> The latter is really bad, as it means sbd watchdog can kick in at any
>>>> time when user believes cluster stack is safely stopped. In particular
>>>> if qnetd is not accessible (think network reconfiguration).
>>>> ___
>>>> Manage your subscription:
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>>
>>> ___
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>>
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] corosync.service (and sbd.service) are not stopper on pacemaker shutdown when corosync-qdevice is used

2019-08-09 Thread Roger Zhou


On 8/9/19 2:27 PM, Roger Zhou wrote:
> 
> On 7/29/19 12:24 AM, Andrei Borzenkov wrote:
>> corosync.service sets StopWhenUnneded=yes which normally stops it when
>> pacemaker is shut down.

One more thought,

Make sense to add "RefuseManualStop=true" to pacemaker.service?
The same for corosync-qdevice.service?

And "RefuseManualStart=true" to corosync.service?

@Jan, @Ken

What do you think?

Cheers,
Roger


> 
> `systemctl stop corosync.service` is the right command to stop those
> cluster stack.
> 
> It stops pacemaker and corosync-qdevice first, and stop SBD too.
> 
> pacemaker.service: After=corosync.service
> corosync-qdevice.service: After=corosync.service
> sbd.service: PartOf=corosync.service
> 
> On the reverse side, to start the cluster stack, use
> 
> systemctl start pacemaker.service corosync-qdevice
> 
> It is slightly confusing from the impression. So, openSUSE uses the
> consistent commands as below:
> 
> crm cluster start
> crm cluster stop
> 
> Cheers,
> Roger
> 
>> Unfortunately, corosync-qdevice.service declares
>> Requires=corosync.service and corosync-qdevice.service itself is *not*
>> stopped when pacemaker.service is stopped. Which means corosync.service
>> remains "needed" and is never stopped.
>>
>> Also sbd.service (which is PartOf=corosync.service) remains running as well.
>>
>> The latter is really bad, as it means sbd watchdog can kick in at any
>> time when user believes cluster stack is safely stopped. In particular
>> if qnetd is not accessible (think network reconfiguration).
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] corosync.service (and sbd.service) are not stopper on pacemaker shutdown when corosync-qdevice is used

2019-08-09 Thread Roger Zhou


On 7/29/19 12:24 AM, Andrei Borzenkov wrote:
> corosync.service sets StopWhenUnneded=yes which normally stops it when
> pacemaker is shut down.

`systemctl stop corosync.service` is the right command to stop those 
cluster stack.

It stops pacemaker and corosync-qdevice first, and stop SBD too.

pacemaker.service: After=corosync.service
corosync-qdevice.service: After=corosync.service
sbd.service: PartOf=corosync.service

On the reverse side, to start the cluster stack, use

systemctl start pacemaker.service corosync-qdevice

It is slightly confusing from the impression. So, openSUSE uses the 
consistent commands as below:

crm cluster start
crm cluster stop

Cheers,
Roger

> Unfortunately, corosync-qdevice.service declares
> Requires=corosync.service and corosync-qdevice.service itself is *not*
> stopped when pacemaker.service is stopped. Which means corosync.service
> remains "needed" and is never stopped.
> 
> Also sbd.service (which is PartOf=corosync.service) remains running as well.
> 
> The latter is really bad, as it means sbd watchdog can kick in at any
> time when user believes cluster stack is safely stopped. In particular
> if qnetd is not accessible (think network reconfiguration).
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Feedback wanted: Node reaction to fabric fencing

2019-07-25 Thread Roger Zhou


On 7/25/19 1:33 AM, Ken Gaillot wrote:
> Hi all,
> 
> A recent bugfix (clbz#5386) brings up a question.
> 
> A node may receive notification of its own fencing when fencing is
> misconfigured (for example, an APC switch with the wrong plug number)
> or when fabric fencing is used that doesn't cut the cluster network
> (for example, fence_scsi).
> 
> Previously, the *intended* behavior was for the node to attempt to
> reboot itself in that situation, falling back to stopping pacemaker if
> that failed. However, due to the bug, the reboot always failed, so the
> behavior effectively was to stop pacemaker.
> 
> Now that the bug is fixed, the node will indeed reboot in that
> situation.
> 
> It occurred to me that some users configure fabric fencing specifically
> so that nodes aren't ever intentionally rebooted. Therefore, I intend
> to make this behavior configurable.
> 
> My question is, what do you think the default should be?
> 
> 1. Default to the correct behavior (reboot)
> 
> 2. Default to the current behavior (stop)
> 
> 3. Default to the current behavior for now, and change it to the
> correct behavior whenever pacemaker 2.1 is released (probably a few
> years from now)
> 

Sounds, 3) is the best choice.

Make it configurable, and keep the current behavior(stop) for backward 
compatibility for the current minor version, eg. next 2.0.z(3+).

Well, the correct behavior (reboot) as the default should be enforced. 
It should be the same crucial as stop failures of a resource. Make sense 
in the next minor version, say, 2.1.

Thanks,
Roger




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] What's the best practice to scale-out/increase the cluster size? (was: "node is unclean" leads to gratuitous reboot)

2019-07-11 Thread Roger Zhou

On 7/11/19 2:15 AM, Michael Powell wrote:
> Thanks to you and Andrei for your responses.  In our particular situation, we 
> want to be able to operate with either node in stand-alone mode, or with both 
> nodes protected by HA.  I did not mention this, but I am working on upgrading 
> our product from a version which used Pacemaker version 1.0.13 and Heartbeat 
> to run under CentOS 7.6 (later 8.0).  The older version did not exhibit this 
> behavior, hence my concern.
> 
> I do understand the "wait_for_all" option better, and now that I know why the 
> "gratuitous" reboot is happening, I'm more comfortable with that behavior.  I 
> think the biggest operational risk would occur following a power-up of the 
> chassis.  If one node were significantly delayed during bootup, e.g. because 
> of networking issues, the other node would issue the STONITH and reboot the 
> delayed node.  That would be an annoyance, but it would be relatively 
> infrequent.  Our customers almost always keep at least one node (and usually 
> both nodes) operational 24/7.
> 

2 cents,

I think your requirement is very clear. Well, I view this is a tricky 
design challenge. There are two different situations likely fool people:

a) the situation of being stand-alone (one node, really not a cluster)
b) the situation as the 2-node cluster but only one is up at the moment

Being not define the concepts clearly and not clarify their difference, 
people could mix them together and set wrong expectation but on a 
different concept, really.

In your case, the configuration is 2-node cluster. The log indicates the 
correct behavior for b), eg. those STONITH actions are by-design indeed. 
But, people set the wrong expectation to treat it as a).

With that, could be a cleaner design to let it be a stand-alone system 
first, then smoothly grow it to two nodes?

Furthermore, this trigger me to raise a question, mostly for corosync:

What's the best practice to scale-out/increase the cluster size?

I can think one of the approach is to modify corosync.conf and reload it 
in run-time. Well, it doesn't look like as smart as the reverse way, 
namely, allow_downscale/auto_tie_breaker/last_man_standing of the 
advanced corosync feature set, see `man votequorum`.


Cheers,
Roger




> Regards,
>Michael
> 
> -Original Message-
> From: Ken Gaillot 
> Sent: Tuesday, July 09, 2019 12:42 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Cc: Michael Powell ; Venkata Reddy 
> Chappavarapu 
> Subject: [EXTERNAL] Re: [ClusterLabs] "node is unclean" leads to gratuitous 
> reboot
> 
> On Tue, 2019-07-09 at 12:54 +, Michael Powell wrote:
>> I have a two-node cluster with a problem.  If I start
> 
> Not so much a problem as a configuration choice :)
> 
> There are trade-offs in any case.
> 
> - wait_for_all in corosync.conf: If set, this will make each starting node 
> wait until it sees the other before gaining quorum for the first time. The 
> downside is that both nodes must be up for the cluster to start; the upside 
> is a clean starting point and no fencing.
> 
> - startup-fencing in pacemaker properties: If disabled, either node can start 
> without fencing the other. This is unsafe; if the other node is actually 
> active and running resources, but unreachable from the newly up node, the 
> newly up node may start the same resources, causing split- brain. (Easier 
> than you might think: consider taking a node down for hardware maintenance, 
> bringing it back up without a network, then plugging it back into the network 
> -- by that point it may have brought up resources and starts causing havoc.)
> 
> - Start corosync on both nodes, then start pacemaker. This avoids start-up 
> fencing since when pacemaker starts on either node, it already sees the other 
> node present, even if that node's pacemaker isn't up yet.
> 
> Personally I'd go for wait_for_all in normal operation. You can always 
> disable it if there are special circumstances where a node is expected to be 
> out of the cluster for a long time.
> 
>> Corosync/Pacemaker on one node, and then delay startup on the 2nd node
>> (which is otherwise up and running), the 2nd node will be rebooted
>> very soon after STONITH is enabled on the first node.  This reboot
>> seems to be gratuitous and could under some circumstances be
>> problematic.  While, at present,  I “manually” start
>> Corosync/Pacemaker by invoking a script from an ssh session, in a
>> production environment, this script would be started by a systemd
>> service.  It’s not hard to imagine that if both nodes were started at
>> approximately the same time (each node runs on a separate motherboard
>> in the same chassis), this behavior could cause one of the nodes to be
>> rebooted while it’s in the process of booting up.
>>   
>> The two nodes’ host names are mgraid-16201289RN00023-0 and mgraid-
>> 16201289RN00023-1.  Both hosts are running, but Pacemaker has been
>> started on neither. 

Re: [ClusterLabs] Where do we download the source code of libdlm

2019-05-27 Thread Roger Zhou



David settled a new home for it more than two years ago
https://pagure.io/dlm

Cheers,
Roger


On 5/27/19 5:04 PM, Gang He wrote:

Hello Guys,

As the subject said, I want to download the source code of libdlm, to see its 
git log changes.
libdm is used to build dlm_controld, dlm_stonith, dlm_tool and etc.


Thanks
Gang


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Anyone have a document on how to configure VMWare fencing on Suse Linux

2018-12-16 Thread Roger Zhou



The following command will give you the detailed information:

crm ra info stonith:external/vcenter

Hope it is useful.

Cheers,
Roger


On 12/14/18 12:29 AM, Dileep V Nair wrote:

Hi,

I am using pacemaker for my clusters and shared sbd disk as the Stonith 
mechanism. Now I have an issue because I am using VMWare SRM for DR and 
that does not support shared disk. So I am thinking of configuring 
external/vcenter as the stonith mechanism. Is there any document which I 
can refer for configuring the same. Is there some specific settings / 
configurations to be done on the vcenter to do this. Any help is highly 
appreciated.


Thanks & Regards
*
Dileep Nair*


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: LVM resource and DAS - would two resources off one DAS...

2017-08-04 Thread roger zhou



On 07/27/2017 09:20 PM, Ulrich Windl wrote:

Hi!

I think it will work, because the cluster does not monitor the PVs or prtition 
or LUNs. It just checks whether you can activate the LVs (i.e.: the VG). That's 
what I know...

Regards,
Ulrich


lejeczek  schrieb am 27.07.2017 um 15:05 in Nachricht

<636398a2-e8ea-644b-046b-ff12358de...@yahoo.co.uk>:

hi fellas

I realise this might be quite specialized topic, as this
regards hardware DAS(sas2) and LVM and cluster itself but
I'm hoping with some luck an expert peeps over here and I'll
get some or all the answers then.

question:
Can cluster manage two(or more) LVM resources which would be
on/in same single DAS storage and have these resources(eg.
one LVM runs on 1&2 the other LVM runs on 3&4) run on
different nodes(which naturally all connect to that single DAS)?


Yes, it works in production environment for users.

While, it could depend on your detailed scenarios. You should further 
evaluation if you need protect lvm vg metadata, eg. resize your lv on 
multiple node simultaneously. If so, to involve clvm or the coming 
lvmlockd is necessary.


--Roger



Now, I guess this might be something many do already and
many will say: trivial. In which case a few firm "yes"
confirmations will mean - typical, just do it.
Or could it be something unusual and untested but
might/should work when done with care and special "preparation"?

I understand that lots depends on what/how harwdare+kernel
do things, but if possible(?) I'd leave it out for now and
ask only the cluster itself - do you do it?

many thanks.
L.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org






___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] clustered MD - beyond RAID1

2015-12-25 Thread roger zhou


On 12/22/2015 10:33 AM, Tejas Rao wrote:

On 12/21/2015 20:50, Aaron Knister wrote:


[...]


I'm curious now, Redhat doesn't support SW raid failover? I did some
googling and found this:

https://access.redhat.com/solutions/231643

While I can't read the solution I have to figure that they're now
supporting that. I might actually explore that for this project.

https://access.redhat.com/solutions/410203
This article states that md raid is not supported in RHEL6/7 under any 
circumstances, including active/passive modes.


OCFS2 or GFS2(same for GPFS, as the shared filesystem) over a shared 
storage is a typical Cluster configuration for Linux High Availability. 
Where, Clustered LVM (cLVM) is supported by both SUSE and Redhat to do 
mirroring to protect the data. However, the performance loss is very big 
and make people not so happy about this clustered mirror solution. This 
is where the motivation for clustered MD solution comes from.


With clustered md, this new solution could provide nearly the same 
performance as the native raid1. You may have interest to validate this 
from your lab with your configuration;)


Cheers,
Roger



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] (no subject) --> JUNK email

2015-10-09 Thread roger zhou

TaMen说我挑食 <974120...@qq.com>,

You'd better compose your email title with a word like, JUNK or TEST, to 
avoid misleading people here.


Digimer,

You are really nice! It is suspicious to me this user just to send a 
junk email to confirm the subscription not in digest format ;)


Regards,
Roger


On 10/09/2015 09:38 AM, Digimer wrote:

On 08/10/15 09:03 PM, TaMen说我挑食 wrote:

Corosync+Pacemaker error during failover

You need to ask a question if you want us to be able to help you.



--
regards,
Zhou, ZhiQiang (Roger) (office#:+86 10 65339283, cellphone# +86 13391978086)


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org