Re: [ClusterLabs] interesting blog on Pacemaker-related outage

2017-12-07 Thread Andrei Borzenkov
07.12.2017 15:13, Adam Spiers пишет:
> https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/
> 
> 
> It's a great write-up, although a little frustrating that it is still
> not fully understood why a -inf colocation failed whereas a +inf
> succeeded.  (I actually have a vague memory of discovering something
> very similar a while back, but I can't find the details.)
> 

According to the only information we have (I can hardly call it
documentation) about how colocation constraints (are supposed to) work,
colocation on master/slave only affects order in which nodes are chosen
for promotion, not promotion decision itself. So yes, I'd love to see
relevant piece of pacemaker configuration.


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] interesting blog on Pacemaker-related outage

2017-12-07 Thread Ken Gaillot
On Thu, 2017-12-07 at 12:13 +, Adam Spiers wrote:
> https://gocardless.com/blog/incident-review-api-and-dashboard-outage-
> on-10th-october/
> 
> It's a great write-up, although a little frustrating that it is still
> not fully understood why a -inf colocation failed whereas a +inf
> succeeded.  (I actually have a vague memory of discovering something
> very similar a while back, but I can't find the details.)

That is an excellent post. I'll contact them directly to discuss it
further.

> IMHO this serves as a good example of the difficulty Pacemaker faces,
> and consequently as valuable feedback for how Pacemaker needs to
> improve: it's all too easy to do one tiny misconfiguration which can
> potentially bring the whole house of cards tumbling down, and it's
> often really hard to understand what went wrong.
> 
> So FWIW, my personal view is that more than anything else right now,
> Pacemaker needs to be made easier to understand.  I know this is a 

Agreed, but there are about a dozen things that are more important than
anything else right now ;)

Personally, my current focus is technical debt: stripping out all the
legacy features that were deprecated in 1.1.18, so we can release 2.0.0
with a smaller code base that is easier to maintain going forward. The
hope is that this pays off in greater time savings down the road, but
it sucks up a lot of time in the near term.

There are a large number of outstanding bug reports that bother me,
several of them quite serious, and I would like to spend more time on
those before new features, but ...

There is constant demand for new features from paying customers, and we
can't stay relevant without trying to keep up at least to an extent.
Several recent projects (bundles, alerts, versioned attributes) could
really benefit from some follow-up work, and more major projects are
right on the horizon (failure handling configuration overhaul, crm_mon
overhaul, containerization of pacemaker/corosync, corosync 3/knet
compatibility).

And of course usability is, indeed, an incredibly important area to be
addressed, spanning log messages, documentation, and tooling.

Which is to say, volunteers welcome :-)

> big
> ask since HA is unavoidably complex, but I'm sure there are
> actionable
> items which would serve as relatively manageable yet very worthwhile
> steps towards this goal.  I alluded to this during my presentation at
> the Clusterlabs Summit, e.g. see
> 
> https://aspiers.github.io/clusterlabs-summit-2017-openstack-ha/#/
> debugging
> 
> and the following slide.  And in fact I remember some really good
> discussions on this during the summit too, but I'm not sure if they
> led anywhere.
> 
> Hope this feedback is useful!
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] interesting blog on Pacemaker-related outage

2017-12-07 Thread Adam Spiers

https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/

It's a great write-up, although a little frustrating that it is still
not fully understood why a -inf colocation failed whereas a +inf
succeeded.  (I actually have a vague memory of discovering something
very similar a while back, but I can't find the details.)

IMHO this serves as a good example of the difficulty Pacemaker faces,
and consequently as valuable feedback for how Pacemaker needs to
improve: it's all too easy to do one tiny misconfiguration which can
potentially bring the whole house of cards tumbling down, and it's
often really hard to understand what went wrong.

So FWIW, my personal view is that more than anything else right now,
Pacemaker needs to be made easier to understand.  I know this is a big
ask since HA is unavoidably complex, but I'm sure there are actionable
items which would serve as relatively manageable yet very worthwhile
steps towards this goal.  I alluded to this during my presentation at
the Clusterlabs Summit, e.g. see

   https://aspiers.github.io/clusterlabs-summit-2017-openstack-ha/#/debugging

and the following slide.  And in fact I remember some really good
discussions on this during the summit too, but I'm not sure if they
led anywhere.

Hope this feedback is useful!

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org