----- On Aug 4, 2017, at 10:19 PM, Ken Gaillot [email protected] wrote:
> > Unfortunately no -- logging, and troubleshooting in general, is an area > we are continually striving to improve, but there are more to-do's than > time to do them. sad but comprehensible. Is it worth trying to understand the logs or should i keep an eye on hb-report or crm history ? I played a bit around with hb_report but it seems it just collects information already available and does not simplify the view on it. > The "ERROR" message is coming from the DRBD resource agent itself, not > pacemaker. Between that message and the two separate monitor operations, > it looks like the agent will only run as a master/slave clone. Yes. I see it in the RA. >> And why does it complain that stop is not configured ? > > A confusing error message. It's not complaining that the operations are > not configured, it's saying the operations failed because the resource > is not properly configured. What "properly configured" means is up to > the individual resource agent. Aah. And why does it not complain a "failed" start op ? Because i have "target-role=stopped" in rsc_defaults ? So it tries not to start but stop the resource initially ? >> The DC says: >> Aug 1 14:19:33 ha-idg-2 pengine[27043]: warning: unpack_rsc_op_failure: >> Processing failed op stop for prim_drbd_idcc_devel on ha-idg-1: not >> configured >> (6) >> Aug 1 14:19:33 ha-idg-2 pengine[27043]: error: unpack_rsc_op: Preventing >> prim_drbd_idcc_devel from re-starting anywhere: operation stop failed 'not >> configured' (6) >> >> Again complaining about a failed stop, saying it's not configured. Or does it >> complain that the fail of a stop op is not configured ? > > Again, it's confusing, but you have various logs of the same event > coming from three different places. > > First, DRBD logged that there is a "meta parameter misconfigured". It > then reported that error value back to the crmd cluster daemon that > called it, so the crmd logged the error as well, that the result of the > operation was "not configured". > > Then (above), when the policy engine reads the current status of the > cluster, it sees that there is a failed operation, so it decides what to > do about the failure. Ok. >> The doc says: >> "Some operations are generated by the cluster itself, for example, stopping >> and >> starting resources as needed." >> http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html >> . Is the doc wrong ? >> What happens when i DON'T configure start/stop operations ? Are they >> configured >> automatically ? >> I have several primitives without a configured start/stop operation, but >> never >> had any problems with them. > > Start and stop are indeed created by the cluster itself. If there are > start and stop operations configured in the cluster configuration, those > are used solely to get the meta-attributes such as timeout, to override > the defaults. Ok. >> failcount is direct INFINITY: >> Aug 1 14:19:33 ha-idg-1 attrd[4690]: notice: attrd_trigger_update: Sending >> flush op to all hosts for: fail-count-prim_drbd_idcc_devel (INFINITY) >> Aug 1 14:19:33 ha-idg-1 attrd[4690]: notice: attrd_perform_update: Sent >> update 8: fail-count-prim_drbd_idcc_devel=INFINITY > > Yes, a few result codes are considered "fatal", or automatically > INFINITY failures. The idea is that if the resource is misconfigured, > that's not going to change by simply re-running the agent. That makes sense. > >> After exact 9 minutes the complaints about the not configured stop operation >> stopped, the complaints about missing clone-max still appears, although both >> nodes are in standby > > I'm not sure why your nodes are in standby, but that should be unrelated > to all of this, unless perhaps you configured on-fail=standby. They are in standby because it put them manually into this state. > >> now fail-count is 1 million: >> Aug 1 14:28:33 ha-idg-1 attrd[4690]: notice: attrd_trigger_update: Sending >> flush op to all hosts for: fail-count-prim_drbd_idcc_devel (1000000) >> Aug 1 14:28:33 ha-idg-1 attrd[4690]: notice: attrd_perform_update: Sent >> update 7076: fail-count-prim_drbd_idcc_devel=1000000 > > Within Pacemaker, INFINITY = 1000000. I'm not sure why it's logged > differently here, but it's the same value. Ok. >> A big problem was that i have a ClusterMon resource running on each node. It >> triggered about 20000 snmp traps in 193 seconds to my management station, >> which >> triggered 20000 e-Mails ... >> From where comes this incredible amount of traps ? Nearly all traps said that >> stop is not configured for the drdb resource. Why complaining so often ? And >> why stopping after ~20.000 traps ? >> And complaining about not configured monitor operation just 8 times. > > I'm not really sure; I haven't used ClusterMon enough to say. If you > have Pacemaker 1.1.15 or later, the alerts feature is preferred to > ClusterMon. I have 1.12. Do you have experience with the snmp monitoring from sys4 https://github.com/sys4/pacemaker-snmp ? Bernd Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 _______________________________________________ Users mailing list: [email protected] http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
