Re: [openstack-dev] [Fuel] Waiting for Haproxy backends

2014-11-20 Thread Jay Pipes

Hi Sergey! Comments inline.

On 11/20/2014 05:25 AM, Sergey Vasilenko wrote:

Nor should it, IMO. Other than the Neutron dhcp-agent, all OpenStack
services that run on a "controller node" are completely stateless.
Therefore, I don't see any reason to use corosync/pacemaker for
management of these resources.

I see following reasons for managing neutron agents by Pacemaker:


Completely agree with you here for the Neutron agents, since they are 
not the same as the other OpenStack controller services. The Neutron 
agents keep state, and therefore are appropriate for management by 
Pacemaker, IMO.



  * *co-location* between resources. For example, L3 and DHCP agents
should be run only on nodes, that has properly work openvswitch (or
another L2) agent. Then L2-agent works wrong -- L3 and DHCP agents
should be stopped immediately. Because Neutron don't control this
situation and can allocate some resources (router or subnet) to an
agent.
  * extended *monitoring*.  Traditional OS init/upstart subsystem allow
only simple status checking (may be exclude systemd). Now we have
situation, for example with neutron agents, then some of
agent pretends to well-working. But in really, do nothing.
(unfortunately, openstack developed as is) Such agent should be
immediately restarted. Our Neutron team now works on internal
health-checking feature for agents, and I hope, that this will
implemented in 6.1. For example we can make simple checking (pid
found, process started) every 10sec, and more deep (RMQ connection,
internal health-checking) -- mo rare.
  * No different business-logic for different OS. We can use one OCF
script for Ubuntu, Centos, debian, etc...
  * handle cluster partitioning situation.

haproxy should just spread the HTTP request load evenly across all
API services and things should be fine, allowing haproxy's http
healthcheck monitoring to handle the simple service status checks.

Just HTTP checking not enough. In the future will be better make more
deep checking personal for each openstack service.


For endpoints that need more than an HTTP check, absolutely, you are 
correct. But, in real life, I haven't really seen much need for more 
than an HTTP check for the controller services *except for the Neutron 
agents*.


All the best,
-jay

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel] Waiting for Haproxy backends

2014-11-20 Thread Sergey Vasilenko
>
> Nor should it, IMO. Other than the Neutron dhcp-agent, all OpenStack
> services that run on a "controller node" are completely stateless.
> Therefore, I don't see any reason to use corosync/pacemaker for management
> of these resources.


I see following reasons for managing neutron agents by Pacemaker:

   - *co-location* between resources. For example, L3 and DHCP agents
   should be run only on nodes, that has properly work openvswitch (or another
   L2) agent. Then L2-agent works wrong -- L3 and DHCP agents should be
   stopped immediately. Because Neutron don't control this situation and can
   allocate some resources (router or subnet) to an agent.
   - extended *monitoring*.  Traditional OS init/upstart subsystem allow
   only simple status checking (may be exclude systemd). Now we have
   situation, for example with neutron agents, then some of agent pretends to
   well-working. But in really, do nothing. (unfortunately, openstack
   developed as is) Such agent should be immediately restarted. Our Neutron
   team now works on internal health-checking feature for agents, and I hope,
   that this will implemented in 6.1. For example we can make simple checking
   (pid found, process started) every 10sec, and more deep (RMQ connection,
   internal health-checking) -- mo rare.
   - No different business-logic for different OS. We can use one OCF
   script for Ubuntu, Centos, debian, etc...
   - handle cluster partitioning situation.



> haproxy should just spread the HTTP request load evenly across all API
> services and things should be fine, allowing haproxy's http healthcheck
> monitoring to handle the simple service status checks.
>

Just HTTP checking not enough. In the future will be better make more deep
checking personal for each openstack service.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel] Waiting for Haproxy backends

2014-11-19 Thread Andrew Beekhof

> On 20 Nov 2014, at 6:55 am, Sergii Golovatiuk  
> wrote:
> 
> Hi crew,
> 
> Please see my inline comments.
> 
> Hi Everyone,
> 
> I was reading the blueprints mentioned here and thought I'd take the 
> opportunity to introduce myself and ask a few questions.
> For those that don't recognise my name, Pacemaker is my baby - so I take a 
> keen interest helping people have a good experience with it :)
> 
> A couple of items stood out to me (apologies if I repeat anything that is 
> already well understood):
> 
> * Operations with CIB utilizes almost 100% of CPU on the Controller
> 
>  We introduced a new CIB algorithm in 1.1.12 which is O(2) faster/less 
> resource hungry than prior versions.
>  I would be interested to hear your experiences with it if you are able to 
> upgrade to that version.
>  
> Our team is aware of that. That's really nice improvement. Thank you very 
> much for that. We've prepared all packages, though we have feature freeze. 
> Pacemaker 1.1.12 will be added to next release.
>  
> * Corosync shutdown process takes a lot of time
> 
>  Corosync (and Pacemaker) can shut down incredibly quickly.
>  If corosync is taking a long time, it will be because it is waiting for 
> pacemaker, and pacemaker is almost always waiting for for one of the 
> clustered services to shut down.
> 
> As part of improvement we have idea to split signalling layer (corosync) and 
> resource management (pacemaker) layers by specifying
> service { 
>name: pacemaker
>ver:  1
> }
> 
> and create upstart script to set start ordering. That will allow us
> 
> 1. Create some notifications in puppet for pacemaker
> 2. Restart and manage corosync and pacemaker independently
> 3. Use respawn in upstart to restart corosync or pacemaker
> 
> 
> * Current Fuel Architecture is limited to Corosync 1.x and Pacemaker 1.x
> 
>  Corosync 2 is really the way to go.
>  Is there something in particular that is holding you back?
>  Also, out of interest, are you using cman or the pacemaker plugin?
> 
> We use almost standard corosync 1.x and pacemaker from CentOS 6.5

Please be aware that the plugin is not long for this world on CentOS.
It was already removed once (in 6.4 betas) and is not even slightly tested at 
RH and about the only ones using it upstream are SUSE.

http://blog.clusterlabs.org/blog/2013/pacemaker-on-rhel6-dot-4/ has some 
relevant details.
The short version is that I would really encourage a transition to CMAN (which 
is really just corosync 1.x plus a more mature and better tested plugin from 
the corosync people).
See http://clusterlabs.org/quickstart-redhat.html , its really quite painless.

> and Ubuntu 12.04. However, we've prepared corosync 2.x and pacemaker 1.1.12 
> packages. Also we have update puppet manifests on review. As was said above, 
> we can't just add at the end of development cycle.

Yep, makes sense.

>  
> 
> *  Diff operations against Corosync CIB require to save data to file rather
>   than keep all data in memory
> 
>  Can someone clarify this one for me?
>  
> That's our implementation for puppet. We can't just use shadow on distributed 
> environment, so we run 
> 
>  Also, I notice that the corosync init script has been modified to set/unset 
> maintenance-mode with cibadmin.
>  Any reason not to use crm_attribute instead?  You might find its a less 
> fragile solution than a hard-coded diff.
>  
> Can you give a particular line where you see that?  

I saw it in one of the bugs:
   https://bugs.launchpad.net/fuel/+bug/1340172

Maybe it is no longer accurate

> 
> * Debug process of OCF scripts is not unified requires a lot of actions from
>  Cloud Operator
> 
>  Two things to mention here... the first is crm_resource 
> --force-(start|stop|check) which queries the cluster for the resource's 
> definition but runs the command directly. 
>  Combined with -V, this means that you get to see everything the agent is 
> doing.
> 
> We write many own OCF scripts. We just need to see how OCF script behaves. 
> ocf_tester is not enough for our cases.

Agreed. ocf_tester is more for out-of-cluster regression testing, not really 
good for debugging a running cluster.

> I'll try if crm_resource -V --force-start is better.
>  
> 
>  Also, pacemaker now supports the ability for agents to emit specially 
> formatted error messages that are stored in the cib and can be shown back to 
> users.
>  This can make things much less painful for admins. Look for 
> PCMK_OCF_REASON_PREFIX in the upstream resource-agents project.
> 
> Thank you for tip. 
> 
> 
> * Openstack services are not managed by Pacemaker
> 
> The general idea to have all openstack services under pacemaker control 
> rather than having upstart and pacemaker. It will be very handy for operators 
> to see the status of all services from one console. Also it will give us 
> flexibility to have more complex service verification checks in monitor 
> function.
>  
> 
>  Oh?
> 
> * Compute nodes aren't in Pacemaker cluster, hen

Re: [openstack-dev] [Fuel] Waiting for Haproxy backends

2014-11-19 Thread Jay Pipes

On 11/18/2014 07:25 PM, Andrew Woodward wrote:

On Tue, Nov 18, 2014 at 3:18 PM, Andrew Beekhof  wrote:

* Openstack services are not managed by Pacemaker

  Oh?


fuel doesn't (currently) set up API services in pacemaker


Nor should it, IMO. Other than the Neutron dhcp-agent, all OpenStack 
services that run on a "controller node" are completely stateless. 
Therefore, I don't see any reason to use corosync/pacemaker for 
management of these resources. haproxy should just spread the HTTP 
request load evenly across all API services and things should be fine, 
allowing haproxy's http healthcheck monitoring to handle the simple 
service status checks.


Best,
-jay

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel] Waiting for Haproxy backends

2014-11-19 Thread Sergii Golovatiuk
Hi crew,

Please see my inline comments.

Hi Everyone,
>
> I was reading the blueprints mentioned here and thought I'd take the
> opportunity to introduce myself and ask a few questions.
> For those that don't recognise my name, Pacemaker is my baby - so I take a
> keen interest helping people have a good experience with it :)
>
> A couple of items stood out to me (apologies if I repeat anything that is
> already well understood):
>
> * Operations with CIB utilizes almost 100% of CPU on the Controller
>
>  We introduced a new CIB algorithm in 1.1.12 which is O(2) faster/less
> resource hungry than prior versions.
>  I would be interested to hear your experiences with it if you are able to
> upgrade to that version.
>

Our team is aware of that. That's really nice improvement. Thank you very
much for that. We've prepared all packages, though we have feature freeze.
Pacemaker 1.1.12 will be added to next release.


> * Corosync shutdown process takes a lot of time
>
>  Corosync (and Pacemaker) can shut down incredibly quickly.
>  If corosync is taking a long time, it will be because it is waiting for
> pacemaker, and pacemaker is almost always waiting for for one of the
> clustered services to shut down.
>

As part of improvement we have idea to split signalling layer (corosync)
and resource management (pacemaker) layers by specifying

service {
   name: pacemaker
   ver:  1
}

and create upstart script to set start ordering. That will allow us

1. Create some notifications in puppet for pacemaker
2. Restart and manage corosync and pacemaker independently
3. Use respawn in upstart to restart corosync or pacemaker


> * Current Fuel Architecture is limited to Corosync 1.x and Pacemaker 1.x
>
>  Corosync 2 is really the way to go.
>  Is there something in particular that is holding you back?
>  Also, out of interest, are you using cman or the pacemaker plugin?
>

We use almost standard corosync 1.x and pacemaker from CentOS 6.5 and
Ubuntu 12.04. However, we've prepared corosync 2.x and pacemaker 1.1.12
packages. Also we have update puppet manifests on review. As was said
above, we can't just add at the end of development cycle.


>
> *  Diff operations against Corosync CIB require to save data to file rather
>   than keep all data in memory
>
>  Can someone clarify this one for me?
>

That's our implementation for puppet. We can't just use shadow on
distributed environment, so we run

>
>  Also, I notice that the corosync init script has been modified to
> set/unset maintenance-mode with cibadmin.
>  Any reason not to use crm_attribute instead?  You might find its a less
> fragile solution than a hard-coded diff.
>

Can you give a particular line where you see that?

* Debug process of OCF scripts is not unified requires a lot of actions from
>  Cloud Operator
>
>  Two things to mention here... the first is crm_resource
> --force-(start|stop|check) which queries the cluster for the resource's
> definition but runs the command directly.

 Combined with -V, this means that you get to see everything the agent is
> doing.
>

We write many own OCF scripts. We just need to see how OCF script behaves.
ocf_tester is not enough for our cases. I'll try if crm_resource -V
--force-start is better.


>
>  Also, pacemaker now supports the ability for agents to emit specially
> formatted error messages that are stored in the cib and can be shown back
> to users.
>  This can make things much less painful for admins. Look for
> PCMK_OCF_REASON_PREFIX in the upstream resource-agents project.
>

Thank you for tip.

>
>
> * Openstack services are not managed by Pacemaker
>

The general idea to have all openstack services under pacemaker control
rather than having upstart and pacemaker. It will be very handy for
operators to see the status of all services from one console. Also it will
give us flexibility to have more complex service verification checks in
monitor function.


>
>  Oh?
>
> * Compute nodes aren't in Pacemaker cluster, hence, are lacking a viable
>  control plane for their's compute/nova services.
>
>  pacemaker-remoted might be of some interest here.
>
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Remote/index.html
>
>
> * Creating and committing shadows not only adds constant pain with
> dependencies and unneeded complexity but also rewrites cluster attributes
> and even other changes if you mess up with ordering and it’s really hard to
> debug it.
>
>  Is this still an issue?  I'm reasonably sure this is specific to the way
> crmsh uses shadows.
>  Using the native tools it should be possible to commit only the delta, so
> any other changes that occur while you're updating the shadow would not be
> an issue, and existing attributes wouldn't be rewritten.
>

We are on the way to replace pcs and crm with native tools in puppet
service provider.


>
> * Restarting resources by Puppet’s pacemaker service provider restarts
> them even if they are running on other nodes and it sometimes imp

Re: [openstack-dev] [Fuel] Waiting for Haproxy backends

2014-11-19 Thread Vladimir Kuklin
Hi everyone

Actually, we changed a lot in 5.1 HA and there are some changes in 6.0
also. Right now we are using assymetric cluster and use location
constraints to control resources. We started using xml diffs as the most
reliable and supported approach as it does not depend on pcs/crmsh
implementation. Regarding corosync 2.x we are looking forward to moving to
it but we did not fit our 6.0 release timeframe. We will surely move to
pacemaker plugin and corosync 2.x in 6.1 release as it should fix lots of
our problems.

On Wed, Nov 19, 2014 at 3:47 AM, Andrew Woodward  wrote:

> On Wed, Nov 12, 2014 at 4:10 AM, Aleksandr Didenko
>  wrote:
> > HI,
> >
> > in order to make sure some critical Haproxy backends are running (like
> mysql
> > or keystone) before proceeding with deployment, we use execs like [1] or
> > [2].
>
> We used to do the API waiting in the puppet resource providers
> consuming them [4] which tends to be very effective (unless it never
> comes up) as it doesn't care what is in-between the resource and the
> API it's attempting to use. This way works for everything except mysql
> because other services depend on it.
>
> >
> > We're currently working on a minor improvements of those execs, but
> there is
>
> really, we should not use these execs, they are bad and we need to be
> doing a proper response validation like in [4] instead of the just
> using the simple (and often wrong) haproxy health check
>
> > another approach - we can replace those execs with puppet resource
> providers
> > and move all the iterations/loops/timeouts logic there. Also we should
> fail
>
> yes, this will become the most reliable method. I'm partially still on
> the fence of which provider we are modifying. In the service provider,
> we could identify the check method (ie http 200 from a specific url)
> and the start check, and the provider would block until the check
> passes or timeout is reached. (I'm still on the fence of which
> provider to do this for haproxy, or each of the openstack API
> services. I'm leaning towards each API since this will allow the check
> to work regardless of haproxy, and should let it also work with
> refresh)
>
> > catalog compilation/run if those resource providers are not able to
> ensure
> > needed Haproxy backends are up and running. Because there is no point to
> > proceed with deployment if keystone is not running, for example.
> >
> > If no one objects, I can start implementing this for Fuel-6.1. We can
> > address it as a part of pacemaker improvements BP [3] or create a new BP.
>
> unless we are fixing the problem with pacemaker it should have its own
> spec, possibly w/o a blueprint
>
> >
> > [1]
> >
> https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/osnailyfacter/manifests/cluster_ha.pp#L551-L572
> > [2]
> >
> https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/openstack/manifests/ha/mysqld.pp#L28-L33
> > [3] https://blueprints.launchpad.net/fuel/+spec/pacemaker-improvements
> >
> > Regards,
> > Aleksandr Didenko
> >
> >
> > ___
> > OpenStack-dev mailing list
> > OpenStack-dev@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
> [4]
> https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/neutron/lib/puppet/provider/neutron.rb#L83-116
>
> --
> Andrew
> Mirantis
> Ceph community
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
Yours Faithfully,
Vladimir Kuklin,
Fuel Library Tech Lead,
Mirantis, Inc.
+7 (495) 640-49-04
+7 (926) 702-39-68
Skype kuklinvv
45bk3, Vorontsovskaya Str.
Moscow, Russia,
www.mirantis.com 
www.mirantis.ru
vkuk...@mirantis.com
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel] Waiting for Haproxy backends

2014-11-18 Thread Andrew Woodward
On Wed, Nov 12, 2014 at 4:10 AM, Aleksandr Didenko
 wrote:
> HI,
>
> in order to make sure some critical Haproxy backends are running (like mysql
> or keystone) before proceeding with deployment, we use execs like [1] or
> [2].

We used to do the API waiting in the puppet resource providers
consuming them [4] which tends to be very effective (unless it never
comes up) as it doesn't care what is in-between the resource and the
API it's attempting to use. This way works for everything except mysql
because other services depend on it.

>
> We're currently working on a minor improvements of those execs, but there is

really, we should not use these execs, they are bad and we need to be
doing a proper response validation like in [4] instead of the just
using the simple (and often wrong) haproxy health check

> another approach - we can replace those execs with puppet resource providers
> and move all the iterations/loops/timeouts logic there. Also we should fail

yes, this will become the most reliable method. I'm partially still on
the fence of which provider we are modifying. In the service provider,
we could identify the check method (ie http 200 from a specific url)
and the start check, and the provider would block until the check
passes or timeout is reached. (I'm still on the fence of which
provider to do this for haproxy, or each of the openstack API
services. I'm leaning towards each API since this will allow the check
to work regardless of haproxy, and should let it also work with
refresh)

> catalog compilation/run if those resource providers are not able to ensure
> needed Haproxy backends are up and running. Because there is no point to
> proceed with deployment if keystone is not running, for example.
>
> If no one objects, I can start implementing this for Fuel-6.1. We can
> address it as a part of pacemaker improvements BP [3] or create a new BP.

unless we are fixing the problem with pacemaker it should have its own
spec, possibly w/o a blueprint

>
> [1]
> https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/osnailyfacter/manifests/cluster_ha.pp#L551-L572
> [2]
> https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/openstack/manifests/ha/mysqld.pp#L28-L33
> [3] https://blueprints.launchpad.net/fuel/+spec/pacemaker-improvements
>
> Regards,
> Aleksandr Didenko
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

[4] 
https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/neutron/lib/puppet/provider/neutron.rb#L83-116

-- 
Andrew
Mirantis
Ceph community

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel] Waiting for Haproxy backends

2014-11-18 Thread Andrew Woodward
Some comments inline

On Tue, Nov 18, 2014 at 3:18 PM, Andrew Beekhof  wrote:
> Hi Everyone,
>
> I was reading the blueprints mentioned here and thought I'd take the 
> opportunity to introduce myself and ask a few questions.
> For those that don't recognise my name, Pacemaker is my baby - so I take a 
> keen interest helping people have a good experience with it :)
>
> A couple of items stood out to me (apologies if I repeat anything that is 
> already well understood):
>
> * Operations with CIB utilizes almost 100% of CPU on the Controller
>
>  We introduced a new CIB algorithm in 1.1.12 which is O(2) faster/less 
> resource hungry than prior versions.
>  I would be interested to hear your experiences with it if you are able to 
> upgrade to that version.

Pacemaker on CentOS 6.5 is 1.1.10 14.el6_5.3
https://review.fuel-infra.org/#/admin/projects/packages/centos6/pacemaker
Corosync on CentOS 6.5 is 1.4.6 26.2
https://review.fuel-infra.org/#/admin/projects/packages/centos6/corosync

>
> * Corosync shutdown process takes a lot of time
>
>  Corosync (and Pacemaker) can shut down incredibly quickly.
>  If corosync is taking a long time, it will be because it is waiting for 
> pacemaker, and pacemaker is almost always waiting for for one of the 
> clustered services to shut down.
>
> * Current Fuel Architecture is limited to Corosync 1.x and Pacemaker 1.x
>
>  Corosync 2 is really the way to go.
>  Is there something in particular that is holding you back?

We try to keep close to the distro version when possible / reasonable.

>  Also, out of interest, are you using cman or the pacemaker plugin?
>
> *  Diff operations against Corosync CIB require to save data to file rather
>   than keep all data in memory
>
>  Can someone clarify this one for me?
>
>  Also, I notice that the corosync init script has been modified to set/unset 
> maintenance-mode with cibadmin.
>  Any reason not to use crm_attribute instead?  You might find its a less 
> fragile solution than a hard-coded diff.
>
> * Debug process of OCF scripts is not unified requires a lot of actions from
>  Cloud Operator
>
>  Two things to mention here... the first is crm_resource 
> --force-(start|stop|check) which queries the cluster for the resource's 
> definition but runs the command directly.
>  Combined with -V, this means that you get to see everything the agent is 
> doing.
>
>  Also, pacemaker now supports the ability for agents to emit specially 
> formatted error messages that are stored in the cib and can be shown back to 
> users.
>  This can make things much less painful for admins. Look for 
> PCMK_OCF_REASON_PREFIX in the upstream resource-agents project.
>
>
> * Openstack services are not managed by Pacemaker
>
>  Oh?

fuel doesn't (currently) set up API services in pacemaker

>
> * Compute nodes aren't in Pacemaker cluster, hence, are lacking a viable
>  control plane for their's compute/nova services.
>
>  pacemaker-remoted might be of some interest here.
>  
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Remote/index.html
>
>
> * Creating and committing shadows not only adds constant pain with 
> dependencies and unneeded complexity but also rewrites cluster attributes and 
> even other changes if you mess up with ordering and it’s really hard to debug 
> it.
>
>  Is this still an issue?  I'm reasonably sure this is specific to the way 
> crmsh uses shadows.
>  Using the native tools it should be possible to commit only the delta, so 
> any other changes that occur while you're updating the shadow would not be an 
> issue, and existing attributes wouldn't be rewritten.
>
> * Restarting resources by Puppet’s pacemaker service provider restarts them 
> even if they are running on other nodes and it sometimes impacts the cluster.
>
>  Not available yet, but upstream there is now a smart --restart option for 
> crm_resource which can optionally take a --host parameter.
>  Sounds like it would be useful here.
>  
> http://blog.clusterlabs.org/blog/2014/feature-spotlight-smart-resource-restart-from-the-command-line/
>
> * An attempt to stop or restart corosync service brings down a lot of 
> resources and probably will fail and bring down the entire deployment.
>
>  That sounds deeply worrying.  Details?
>
> * Controllers other the the first download configured cib an immediate start 
> all cloned resources before they are configured so they have to be cleaned up 
> later.
>
>  By this you mean clones are being started on nodes which do not have the 
> software? Or before the ordering/colocation constraints have been configured?

this is a issue because we deploy one controller and
corosync/pacemaker is set up in one stage, and the software and
services are set up in a later stage. When we deploy the remaining
controllers, the cluster is joined in the same early stage, this
causes services to attempt to be started when there is no software yet
installed. This was worked around with some banning method so that
none of t

Re: [openstack-dev] [Fuel] Waiting for Haproxy backends

2014-11-18 Thread Andrew Beekhof
Hi Everyone,

I was reading the blueprints mentioned here and thought I'd take the 
opportunity to introduce myself and ask a few questions.
For those that don't recognise my name, Pacemaker is my baby - so I take a keen 
interest helping people have a good experience with it :)

A couple of items stood out to me (apologies if I repeat anything that is 
already well understood):

* Operations with CIB utilizes almost 100% of CPU on the Controller

 We introduced a new CIB algorithm in 1.1.12 which is O(2) faster/less resource 
hungry than prior versions.
 I would be interested to hear your experiences with it if you are able to 
upgrade to that version.

* Corosync shutdown process takes a lot of time

 Corosync (and Pacemaker) can shut down incredibly quickly. 
 If corosync is taking a long time, it will be because it is waiting for 
pacemaker, and pacemaker is almost always waiting for for one of the clustered 
services to shut down.

* Current Fuel Architecture is limited to Corosync 1.x and Pacemaker 1.x

 Corosync 2 is really the way to go.
 Is there something in particular that is holding you back?
 Also, out of interest, are you using cman or the pacemaker plugin?

*  Diff operations against Corosync CIB require to save data to file rather
  than keep all data in memory

 Can someone clarify this one for me?

 Also, I notice that the corosync init script has been modified to set/unset 
maintenance-mode with cibadmin.
 Any reason not to use crm_attribute instead?  You might find its a less 
fragile solution than a hard-coded diff.

* Debug process of OCF scripts is not unified requires a lot of actions from
 Cloud Operator

 Two things to mention here... the first is crm_resource 
--force-(start|stop|check) which queries the cluster for the resource's 
definition but runs the command directly.
 Combined with -V, this means that you get to see everything the agent is doing.

 Also, pacemaker now supports the ability for agents to emit specially 
formatted error messages that are stored in the cib and can be shown back to 
users.
 This can make things much less painful for admins. Look for 
PCMK_OCF_REASON_PREFIX in the upstream resource-agents project.


* Openstack services are not managed by Pacemaker

 Oh?

* Compute nodes aren't in Pacemaker cluster, hence, are lacking a viable
 control plane for their's compute/nova services.

 pacemaker-remoted might be of some interest here.  
 
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Remote/index.html


* Creating and committing shadows not only adds constant pain with dependencies 
and unneeded complexity but also rewrites cluster attributes and even other 
changes if you mess up with ordering and it’s really hard to debug it.

 Is this still an issue?  I'm reasonably sure this is specific to the way crmsh 
uses shadows.  
 Using the native tools it should be possible to commit only the delta, so any 
other changes that occur while you're updating the shadow would not be an 
issue, and existing attributes wouldn't be rewritten.

* Restarting resources by Puppet’s pacemaker service provider restarts them 
even if they are running on other nodes and it sometimes impacts the cluster.

 Not available yet, but upstream there is now a smart --restart option for 
crm_resource which can optionally take a --host parameter.
 Sounds like it would be useful here.  
 
http://blog.clusterlabs.org/blog/2014/feature-spotlight-smart-resource-restart-from-the-command-line/

* An attempt to stop or restart corosync service brings down a lot of resources 
and probably will fail and bring down the entire deployment.

 That sounds deeply worrying.  Details?

* Controllers other the the first download configured cib an immediate start 
all cloned resources before they are configured so they have to be cleaned up 
later.

 By this you mean clones are being started on nodes which do not have the 
software? Or before the ordering/colocation constraints have been configured?


> On 15 Nov 2014, at 10:31 am, Sergii Golovatiuk  
> wrote:
> 
> +1 for ha-pacemaker-improvements
> 
> --
> Best regards,
> Sergii Golovatiuk,
> Skype #golserge
> IRC #holser
> 
> On Fri, Nov 14, 2014 at 11:51 PM, Dmitry Borodaenko 
>  wrote:
> Good plan, but I really hate the name of this blueprint. I think we
> should stop lumping different unrelated HA improvements into a single
> blueprint with a generic name like that, especially when we already
> had a blueprint with essentially the same name
> (ha-pacemaker-improvements). There's nothing wrong with having 4
> trivial but specific blueprints instead of one catch-all.
> 
> On Wed, Nov 12, 2014 at 4:10 AM, Aleksandr Didenko
>  wrote:
> > HI,
> >
> > in order to make sure some critical Haproxy backends are running (like mysql
> > or keystone) before proceeding with deployment, we use execs like [1] or
> > [2].
> >
> > We're currently working on a minor improvements of those execs, but there is
> > another approach - we can replace tho

Re: [openstack-dev] [Fuel] Waiting for Haproxy backends

2014-11-14 Thread Sergii Golovatiuk
+1 for ha-pacemaker-improvements

--
Best regards,
Sergii Golovatiuk,
Skype #golserge
IRC #holser

On Fri, Nov 14, 2014 at 11:51 PM, Dmitry Borodaenko <
dborodae...@mirantis.com> wrote:

> Good plan, but I really hate the name of this blueprint. I think we
> should stop lumping different unrelated HA improvements into a single
> blueprint with a generic name like that, especially when we already
> had a blueprint with essentially the same name
> (ha-pacemaker-improvements). There's nothing wrong with having 4
> trivial but specific blueprints instead of one catch-all.
>
> On Wed, Nov 12, 2014 at 4:10 AM, Aleksandr Didenko
>  wrote:
> > HI,
> >
> > in order to make sure some critical Haproxy backends are running (like
> mysql
> > or keystone) before proceeding with deployment, we use execs like [1] or
> > [2].
> >
> > We're currently working on a minor improvements of those execs, but
> there is
> > another approach - we can replace those execs with puppet resource
> providers
> > and move all the iterations/loops/timeouts logic there. Also we should
> fail
> > catalog compilation/run if those resource providers are not able to
> ensure
> > needed Haproxy backends are up and running. Because there is no point to
> > proceed with deployment if keystone is not running, for example.
> >
> > If no one objects, I can start implementing this for Fuel-6.1. We can
> > address it as a part of pacemaker improvements BP [3] or create a new BP.
> >
> > [1]
> >
> https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/osnailyfacter/manifests/cluster_ha.pp#L551-L572
> > [2]
> >
> https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/openstack/manifests/ha/mysqld.pp#L28-L33
> > [3] https://blueprints.launchpad.net/fuel/+spec/pacemaker-improvements
> >
> > Regards,
> > Aleksandr Didenko
> >
> >
> > ___
> > OpenStack-dev mailing list
> > OpenStack-dev@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
>
>
> --
> Dmitry Borodaenko
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Fuel] Waiting for Haproxy backends

2014-11-14 Thread Dmitry Borodaenko
Good plan, but I really hate the name of this blueprint. I think we
should stop lumping different unrelated HA improvements into a single
blueprint with a generic name like that, especially when we already
had a blueprint with essentially the same name
(ha-pacemaker-improvements). There's nothing wrong with having 4
trivial but specific blueprints instead of one catch-all.

On Wed, Nov 12, 2014 at 4:10 AM, Aleksandr Didenko
 wrote:
> HI,
>
> in order to make sure some critical Haproxy backends are running (like mysql
> or keystone) before proceeding with deployment, we use execs like [1] or
> [2].
>
> We're currently working on a minor improvements of those execs, but there is
> another approach - we can replace those execs with puppet resource providers
> and move all the iterations/loops/timeouts logic there. Also we should fail
> catalog compilation/run if those resource providers are not able to ensure
> needed Haproxy backends are up and running. Because there is no point to
> proceed with deployment if keystone is not running, for example.
>
> If no one objects, I can start implementing this for Fuel-6.1. We can
> address it as a part of pacemaker improvements BP [3] or create a new BP.
>
> [1]
> https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/osnailyfacter/manifests/cluster_ha.pp#L551-L572
> [2]
> https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/openstack/manifests/ha/mysqld.pp#L28-L33
> [3] https://blueprints.launchpad.net/fuel/+spec/pacemaker-improvements
>
> Regards,
> Aleksandr Didenko
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
Dmitry Borodaenko

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev