On Fri, 2019-07-05 at 13:07 +0200, Lentes, Bernd wrote: > > ----- On Jul 4, 2019, at 1:25 AM, kgaillot kgail...@redhat.com wrote: > > > On Wed, 2019-06-19 at 18:46 +0200, Lentes, Bernd wrote: > > > ----- On Jun 15, 2019, at 4:30 PM, Bernd Lentes > > > bernd.len...@helmholtz-muenchen.de wrote: > > > > > > > ----- Am 14. Jun 2019 um 21:20 schrieb kgaillot > > > > kgail...@redhat.com > > > > : > > > > > > > > > On Fri, 2019-06-14 at 18:27 +0200, Lentes, Bernd wrote: > > > > > > Hi, > > > > > > > > > > > > i had that problem already once but still it's not clear > > > > > > for me > > > > > > what > > > > > > really happens. > > > > > > I had this problem some days ago: > > > > > > I have a 2-node cluster with several virtual domains as > > > > > > resources. I > > > > > > put one node (ha-idg-2) into standby, and two running > > > > > > virtual > > > > > > domains > > > > > > were migrated to the other node (ha-idg-1). The other > > > > > > virtual > > > > > > domains > > > > > > were already running on ha-idg-1. > > > > > > Since then the two virtual domains which migrated > > > > > > (vm_idcc_devel and > > > > > > vm_severin) start or stop every 15 minutes on ha-idg-1. > > > > > > ha-idg-2 resides in standby. > > > > > > I know that the 15 minutes interval is related to the > > > > > > "cluster- > > > > > > recheck-interval". > > > > > > But why are these two domains started and stopped ? > > > > > > I looked around much in the logs, checked the pe-input > > > > > > files, > > > > > > watched > > > > > > some graphs created by crm_simulate with dotty ... > > > > > > I always see that the domains are started and 15 minutes > > > > > > later > > > > > > stopped and 15 minutes later started ... > > > > > > but i don't see WHY. I would really like to know that. > > > > > > And why are the domains not started from the monitor > > > > > > resource > > > > > > operation ? It should recognize that the domain is stopped > > > > > > and > > > > > > starts > > > > > > it again. My monitor interval is 30 seconds. > > > > > > I had two errors pending concerning these domains, a failed > > > > > > migrate > > > > > > from ha-idg-1 to ha-idg-2, form some time before. > > > > > > Could that be the culprit ? > > > > It did indeed turn out to be. > > > > The resource history on ha-idg-1 shows the last failed action as a > > migrate_to from ha-idg-1 to ha-idg-2, and the last successful > > action as > > a migrate_from from ha-idg-2 to ha-idg-1. That confused pacemaker > > as to > > the current status of the migration. > > > > A full migration is migrate_to on the source node, migrate_from on > > the > > target node, and stop on the source node. When the resource history > > has > > a failed migrate_to on the source, and a stop but no migrate_from > > on > > the target, the migration is considered "dangling" and forces a > > stop of > > the resource on the source, because it's possible the migrate_from > > never got a chance to be scheduled. > > > > That is wrong in this situation. The resource is happily running on > > the > > node with the failed migrate_to because it was later moved back > > successfully, and the failed migrate_to is no longer relevant. > > > > My current plan for a fix is that if a node with a failed > > migrate_to > > has a successful migrate_from or start that's newer, and the target > > node of the failed migrate_to has a successful stop, then the > > migration > > should not be considered dangling. > > > > A couple of side notes on your configuration: > > > > Instead of putting action=off in fence device configurations, you > > should use pcmk_reboot_action=off. Pacemaker adds action when > > sending > > the fence command. > > I did that already. > > > When keeping a fence device off its target node, use a finite > > negative > > score rather than -INFINITY. This ensures the node can fence itself > > as > > a last resort. > > I will do that. > > Thanks for clarifying this, it happened very often. > I conclude that it's very important to cleanup a resource failure > quickly after finding the cause > and solving the problem, not having any pending errors.
This is the first bug I can recall that was triggered by an old failure, so I don't think it's important as a general policy outside of live migrations. I've got a fix I'll merge soon. > > Bernd > > > Helmholtz Zentrum Muenchen > Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) > Ingolstaedter Landstr. 1 > 85764 Neuherberg > www.helmholtz-muenchen.de > Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling > Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, > Heinrich Bassler, Kerstin Guenther > Registergericht: Amtsgericht Muenchen HRB 6466 > USt-IdNr: DE 129521671 -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/