Re: [ClusterLabs] two node cluster: vm starting - shutting down 15min later - starting again 15min later ... and so on

2017-02-10 Thread Ken Gaillot
On 02/10/2017 06:49 AM, Lentes, Bernd wrote:
> 
> 
> - On Feb 10, 2017, at 1:10 AM, Ken Gaillot kgail...@redhat.com wrote:
> 
>> On 02/09/2017 10:48 AM, Lentes, Bernd wrote:
>>> Hi,
>>>
>>> i have a two node cluster with a vm as a resource. Currently i'm just 
>>> testing
>>> and playing. My vm boots and shuts down again in 15min gaps.
>>> Surely this is related to "PEngine Recheck Timer (I_PE_CALC) just popped
>>> (90ms)" found in the logs. I googled, and it is said that this
>>> is due to time-based rule
>>> (http://oss.clusterlabs.org/pipermail/pacemaker/2009-May/001647.html). OK.
>>> But i don't have any time-based rules.
>>> This is the config for my vm:
>>>
>>> primitive prim_vm_mausdb VirtualDomain \
>>> params config="/var/lib/libvirt/images/xml/mausdb_vm.xml" \
>>> params hypervisor="qemu:///system" \
>>> params migration_transport=ssh \
>>> op start interval=0 timeout=90 \
>>> op stop interval=0 timeout=95 \
>>> op monitor interval=30 timeout=30 \
>>> op migrate_from interval=0 timeout=100 \
>>> op migrate_to interval=0 timeout=120 \
>>> meta allow-migrate=true \
>>> meta target-role=Started \
>>> utilization cpu=2 hv_memory=4099
>>>
>>> The only constraint concerning the vm i had was a location (which i didn't
>>> create).
>>
>> What is the constraint? If its ID starts with "cli-", it was created by
>> a command-line tool (such as crm_resource, crm shell or pcs, generally
>> for a "move" or "ban" command).
>>
> I deleted the one i mentioned, but now i have two again. I didn't create them.
> Does the crm create constraints itself ?
> 
> location cli-ban-prim_vm_mausdb-on-ha-idg-2 prim_vm_mausdb role=Started -inf: 
> ha-idg-2
> location cli-prefer-prim_vm_mausdb prim_vm_mausdb role=Started inf: ha-idg-2

The command-line tool you use creates them.

If you're using crm_resource, they're created by crm_resource
--move/--ban. If you're using pcs, they're created by pcs resource
move/ban. Etc.

> One location constraint inf, one -inf for the same resource on the same node.
> Isn't that senseless ?

Yes, but that's what you told it to do :-)

The command-line tools move or ban resources by setting constraints to
achieve that effect. Those constraints are permanent until you remove them.

How to clear them again depends on which tool you use ... crm_resource
--clear, pcs resource clear, etc.

> 
> "crm resorce scores" show -inf for that resource on that node:
> native_color: prim_vm_mausdb allocation score on ha-idg-1: 100
> native_color: prim_vm_mausdb allocation score on ha-idg-2: -INFINITY
> 
> Is -inf stronger ?
> Is it true that only the values for "native_color" are notable ?
> 
> A principle question: When i have trouble to start/stop/migrate resources,
> is it senseful to do a "crm resource cleanup" before trying again ?
> (Beneath finding the reason for the trouble).

It's best to figure out what the problem is first, make sure that's
taken care of, then clean up. The cluster might or might not do anything
when you clean up, depending on what stickiness you have, your failure
handling settings, etc.

> Sorry for asking basic stuff. I read a lot before, but in practise it's total 
> different.
> Although i just have a vm as a resource, and i'm only testing, i'm sometimes 
> astonished about the 
> complexity of a simple two node cluster: scores, failcounts, constraints, 
> default values for a lot of variables ...
> you have to keep an eye on a lot of stuff.
> 
> Bernd
>  
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
> Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons 
> Enhsen
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] two node cluster: vm starting - shutting down 15min later - starting again 15min later ... and so on

2017-02-10 Thread Lentes, Bernd


- On Feb 10, 2017, at 1:10 AM, Ken Gaillot kgail...@redhat.com wrote:

> On 02/09/2017 10:48 AM, Lentes, Bernd wrote:
>> Hi,
>> 
>> i have a two node cluster with a vm as a resource. Currently i'm just testing
>> and playing. My vm boots and shuts down again in 15min gaps.
>> Surely this is related to "PEngine Recheck Timer (I_PE_CALC) just popped
>> (90ms)" found in the logs. I googled, and it is said that this
>> is due to time-based rule
>> (http://oss.clusterlabs.org/pipermail/pacemaker/2009-May/001647.html). OK.
>> But i don't have any time-based rules.
>> This is the config for my vm:
>> 
>> primitive prim_vm_mausdb VirtualDomain \
>> params config="/var/lib/libvirt/images/xml/mausdb_vm.xml" \
>> params hypervisor="qemu:///system" \
>> params migration_transport=ssh \
>> op start interval=0 timeout=90 \
>> op stop interval=0 timeout=95 \
>> op monitor interval=30 timeout=30 \
>> op migrate_from interval=0 timeout=100 \
>> op migrate_to interval=0 timeout=120 \
>> meta allow-migrate=true \
>> meta target-role=Started \
>> utilization cpu=2 hv_memory=4099
>> 
>> The only constraint concerning the vm i had was a location (which i didn't
>> create).
> 
> What is the constraint? If its ID starts with "cli-", it was created by
> a command-line tool (such as crm_resource, crm shell or pcs, generally
> for a "move" or "ban" command).
> 
I deleted the one i mentioned, but now i have two again. I didn't create them.
Does the crm create constraints itself ?

location cli-ban-prim_vm_mausdb-on-ha-idg-2 prim_vm_mausdb role=Started -inf: 
ha-idg-2
location cli-prefer-prim_vm_mausdb prim_vm_mausdb role=Started inf: ha-idg-2

One location constraint inf, one -inf for the same resource on the same node.
Isn't that senseless ?

"crm resorce scores" show -inf for that resource on that node:
native_color: prim_vm_mausdb allocation score on ha-idg-1: 100
native_color: prim_vm_mausdb allocation score on ha-idg-2: -INFINITY

Is -inf stronger ?
Is it true that only the values for "native_color" are notable ?

A principle question: When i have trouble to start/stop/migrate resources,
is it senseful to do a "crm resource cleanup" before trying again ?
(Beneath finding the reason for the trouble).

Sorry for asking basic stuff. I read a lot before, but in practise it's total 
different.
Although i just have a vm as a resource, and i'm only testing, i'm sometimes 
astonished about the 
complexity of a simple two node cluster: scores, failcounts, constraints, 
default values for a lot of variables ...
you have to keep an eye on a lot of stuff.

Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] two node cluster: vm starting - shutting down 15min later - starting again 15min later ... and so on

2017-02-09 Thread Ken Gaillot
On 02/09/2017 10:48 AM, Lentes, Bernd wrote:
> Hi,
> 
> i have a two node cluster with a vm as a resource. Currently i'm just testing 
> and playing. My vm boots and shuts down again in 15min gaps.
> Surely this is related to "PEngine Recheck Timer (I_PE_CALC) just popped 
> (90ms)" found in the logs. I googled, and it is said that this
> is due to time-based rule 
> (http://oss.clusterlabs.org/pipermail/pacemaker/2009-May/001647.html). OK.
> But i don't have any time-based rules.
> This is the config for my vm:
> 
> primitive prim_vm_mausdb VirtualDomain \
> params config="/var/lib/libvirt/images/xml/mausdb_vm.xml" \
> params hypervisor="qemu:///system" \
> params migration_transport=ssh \
> op start interval=0 timeout=90 \
> op stop interval=0 timeout=95 \
> op monitor interval=30 timeout=30 \
> op migrate_from interval=0 timeout=100 \
> op migrate_to interval=0 timeout=120 \
> meta allow-migrate=true \
> meta target-role=Started \
> utilization cpu=2 hv_memory=4099
> 
> The only constraint concerning the vm i had was a location (which i didn't 
> create).

What is the constraint? If its ID starts with "cli-", it was created by
a command-line tool (such as crm_resource, crm shell or pcs, generally
for a "move" or "ban" command).

> Ok, this timer is available, i can set it to zero to disable it.

The timer is used for multiple purposes; I wouldn't recommend disabling
it. Also, this doesn't fix the problem; the problem will still occur
whenever the cluster recalculates, just not on a regular time schedule.

> But why does it influence my vm in such a manner ?
> 
> Excerp from the log:
> 
> ...
> Feb  9 16:19:38 ha-idg-1 VirtualDomain(prim_vm_mausdb)[13148]: INFO: Domain 
> mausdb_vm already stopped.
> Feb  9 16:19:38 ha-idg-1 crmd[8407]:   notice: process_lrm_event: Operation 
> prim_vm_mausdb_stop_0: ok (node=ha-idg-1, call=401, rc=0, cib-update=340, 
> confirmed=true)
> Feb  9 16:19:38 ha-idg-1 kernel: [852506.947196] device vnet0 entered 
> promiscuous mode
> Feb  9 16:19:38 ha-idg-1 kernel: [852507.008770] br0: port 2(vnet0) entering 
> forwarding state
> Feb  9 16:19:38 ha-idg-1 kernel: [852507.008775] br0: port 2(vnet0) entering 
> forwarding state
> Feb  9 16:19:38 ha-idg-1 kernel: [852507.172120] qemu-kvm: sending ioctl 5326 
> to a partition!
> Feb  9 16:19:38 ha-idg-1 kernel: [852507.172133] qemu-kvm: sending ioctl 
> 80200204 to a partition!
> Feb  9 16:19:41 ha-idg-1 crmd[8407]:   notice: process_lrm_event: Operation 
> prim_vm_mausdb_start_0: ok (node=ha-idg-1, call=402, rc=0, cib-update=341, 
> confirmed=true)
> Feb  9 16:19:41 ha-idg-1 crmd[8407]:   notice: process_lrm_event: Operation 
> prim_vm_mausdb_monitor_3: ok (node=ha-idg-1, call=403, rc=0, 
> cib-update=342, confirmed=false)
> Feb  9 16:19:48 ha-idg-1 kernel: [852517.049015] vnet0: no IPv6 routers 
> present
> ...
> Feb  9 16:34:41 ha-idg-1 VirtualDomain(prim_vm_mausdb)[18272]: INFO: Issuing 
> graceful shutdown request for domain mausdb_vm.
> Feb  9 16:35:06 ha-idg-1 kernel: [853434.550089] br0: port 2(vnet0) entering 
> forwarding state
> Feb  9 16:35:06 ha-idg-1 kernel: [853434.550160] device vnet0 left 
> promiscuous mode
> Feb  9 16:35:06 ha-idg-1 kernel: [853434.550165] br0: port 2(vnet0) entering 
> disabled state
> Feb  9 16:35:06 ha-idg-1 ifdown: vnet0
> Feb  9 16:35:06 ha-idg-1 ifdown: Interface not available and no configuration 
> found.
> Feb  9 16:35:07 ha-idg-1 crmd[8407]:   notice: process_lrm_event: Operation 
> prim_vm_mausdb_stop_0: ok (node=ha-idg-1, call=405, rc=0, cib-update=343, 
> confirmed=true)
> ...
> 
> I deleted the location and until that vm is running fine for already 35min.

The logs don't go far back enough to have an idea why the VM was
stopped. Also, logs from the other node might be relevant, if it was the
DC (controller) at the time.

> System is SLES 11 SP4 64bit, vm is SLES 10 SP4 64bit.
> 
> Thanks.
> 
> Bernd

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org