Re: [openstack-dev] [fuel] What to do when a controller runs out of space

2015-10-06 Thread Vladimir Kuklin
Eugene

With all due respect to you and other OpenStack developers, I as a system
administrator do not believe when someone says that something is working
that way. Actually, what I would prefer to do is to stress-test these
services on their 'statelessness'. Currently we have l3-agent not so
stateless and lacking centralized synchronization in proper way which you
have no actually defied. So I agree - let's move this into different thread
and not hijack this one.

On Tue, Oct 6, 2015 at 5:11 PM, Eugene Nikanorov 
wrote:

>
> On Tue, Oct 6, 2015 at 4:22 PM, Vladimir Kuklin 
> wrote:
>
>> Eugene
>>
>> For example, each time that you need to have one instance (e.g. master
>> instance) of something non-stateless running in the cluster.
>>
>
> Right. This is theoretical. Practically, there are no such services among
> openstack.
>
> You are right that currently lots of things are fixed already - heat
>> engine is fine, for example. But I still see this issue with l3 agents and
>> I will not change my mind until we conduct complete scale and destructive
>> testing with new neutron code.
>>
>> Secondly, if we cannot reliably identify when to engage - then we need to
>> write the code that will tell us when to engage. If this code is already in
>> place and we can trigger a couple of commands to figure out Neutron agent
>> state, then we can add them to OCF script monitor and that is all. I agree
>> that we have some issues with our OCF scripts, for example some unoptimal
>> cleanup code that has issues with big scale, but I am almost sure we can
>> fix it.
>>
>> Finally, let me show an example of when you need a centralized cluster
>> manager to manage such situations - you have a temporary issue with
>> connectivity to neutron server over management network for some reason.
>> Your agents are not cleaned up and neutron server starts new l3 agent
>> instances on different node. In this case you will have IP duplication in
>> the network and will bring down the whole cluster as connectivity through
>> 'public' network will be working just fine. In case when we are using
>> Pacemaker - such node will be either fenced or will stop all the services
>> controlled by pacemaker as it is a part of non-quorate partition of the
>> cluster. When this happens, l3 agent OCF script will run its cleanup
>> section and purge all the stale IPs thus saving us from the trouble. I
>> obviously may be mistaking, so please correct me if this is not the case.
>>
> I think this deserves discussion in a separate thread, which I'll start
> soon.
> My initial point was (to state it clearly), that I will be -2 on any new
> additions of openstack services to pacemaker kingdom.
>
> Thanks,
> Eugene.
>
>>
>>
>> On Tue, Oct 6, 2015 at 3:46 PM, Eugene Nikanorov > > wrote:
>>
>>>
>>>
 2) I think you misunderstand what is the difference between
 upstart/systemd and Pacemaker in this case. There are many cases when you
 need to have syncrhonized view of the cluster. Otherwise you will hit
 split-brain situations and have your cluster misfunctioning. Until
 OpenStack provides us with such means there is no other way than using
 Pacemaker/Zookeper/etc.

>>>
>>> Could you please give some examples of those 'many cases' for openstack
>>> specifically?
>>> As for my 'misunderstanding' - openstack services only need to be always
>>> up, not more than that.
>>> Upstart does a perfect job there.
>>>
>>>
 3) Regarding Neutron agents - we discussed it many times - you need to
 be able to control and clean up stuff after some service crashed.
 Currently, Neutron does not provide reliable ways to do it. If your agent
 dies and does not clean up ip addresses from the network namespace you will
 get into the situation of ARP duplication which will be a kind of split
 brain described in item #2. I personally as a system architect and
 administrator do not believe for this to change in at least several years
 for OpenStack so we will be using Pacemaker for a very long period of time.

>>>
>>> This has been changed already, and a while ago.
>>> OCF infrastructure around neutron agents has never helped neutron in any
>>> meaningful way and is just an artifact from the dark past.
>>> The reasons are: pacemaker/ocf doesn't have enough intelligence to know
>>> when to engage, as a result, any cleanup could only be achieved through
>>> manual operations. I don't need to remind you how many bugs were in ocf
>>> scripts which brought whole clusters down after those manual operations.
>>> So it's just a way better to go with simple standard tools with
>>> fine-grain control.
>>> Same applies to any other openstack service (again, not rabbitmq/galera)
>>>
>>> > so we will be using Pacemaker for a very long period of time.
>>> Not for neutron, sorry. As soon as we finish the last bit of such
>>> cleanup, which is targeted for 8.0
>>>
>>> 

Re: [openstack-dev] [fuel] What to do when a controller runs out of space

2015-10-06 Thread Eugene Nikanorov
On Tue, Oct 6, 2015 at 4:22 PM, Vladimir Kuklin 
wrote:

> Eugene
>
> For example, each time that you need to have one instance (e.g. master
> instance) of something non-stateless running in the cluster.
>

Right. This is theoretical. Practically, there are no such services among
openstack.

You are right that currently lots of things are fixed already - heat engine
> is fine, for example. But I still see this issue with l3 agents and I will
> not change my mind until we conduct complete scale and destructive testing
> with new neutron code.
>
> Secondly, if we cannot reliably identify when to engage - then we need to
> write the code that will tell us when to engage. If this code is already in
> place and we can trigger a couple of commands to figure out Neutron agent
> state, then we can add them to OCF script monitor and that is all. I agree
> that we have some issues with our OCF scripts, for example some unoptimal
> cleanup code that has issues with big scale, but I am almost sure we can
> fix it.
>
> Finally, let me show an example of when you need a centralized cluster
> manager to manage such situations - you have a temporary issue with
> connectivity to neutron server over management network for some reason.
> Your agents are not cleaned up and neutron server starts new l3 agent
> instances on different node. In this case you will have IP duplication in
> the network and will bring down the whole cluster as connectivity through
> 'public' network will be working just fine. In case when we are using
> Pacemaker - such node will be either fenced or will stop all the services
> controlled by pacemaker as it is a part of non-quorate partition of the
> cluster. When this happens, l3 agent OCF script will run its cleanup
> section and purge all the stale IPs thus saving us from the trouble. I
> obviously may be mistaking, so please correct me if this is not the case.
>
I think this deserves discussion in a separate thread, which I'll start
soon.
My initial point was (to state it clearly), that I will be -2 on any new
additions of openstack services to pacemaker kingdom.

Thanks,
Eugene.

>
>
> On Tue, Oct 6, 2015 at 3:46 PM, Eugene Nikanorov 
> wrote:
>
>>
>>
>>> 2) I think you misunderstand what is the difference between
>>> upstart/systemd and Pacemaker in this case. There are many cases when you
>>> need to have syncrhonized view of the cluster. Otherwise you will hit
>>> split-brain situations and have your cluster misfunctioning. Until
>>> OpenStack provides us with such means there is no other way than using
>>> Pacemaker/Zookeper/etc.
>>>
>>
>> Could you please give some examples of those 'many cases' for openstack
>> specifically?
>> As for my 'misunderstanding' - openstack services only need to be always
>> up, not more than that.
>> Upstart does a perfect job there.
>>
>>
>>> 3) Regarding Neutron agents - we discussed it many times - you need to
>>> be able to control and clean up stuff after some service crashed.
>>> Currently, Neutron does not provide reliable ways to do it. If your agent
>>> dies and does not clean up ip addresses from the network namespace you will
>>> get into the situation of ARP duplication which will be a kind of split
>>> brain described in item #2. I personally as a system architect and
>>> administrator do not believe for this to change in at least several years
>>> for OpenStack so we will be using Pacemaker for a very long period of time.
>>>
>>
>> This has been changed already, and a while ago.
>> OCF infrastructure around neutron agents has never helped neutron in any
>> meaningful way and is just an artifact from the dark past.
>> The reasons are: pacemaker/ocf doesn't have enough intelligence to know
>> when to engage, as a result, any cleanup could only be achieved through
>> manual operations. I don't need to remind you how many bugs were in ocf
>> scripts which brought whole clusters down after those manual operations.
>> So it's just a way better to go with simple standard tools with
>> fine-grain control.
>> Same applies to any other openstack service (again, not rabbitmq/galera)
>>
>> > so we will be using Pacemaker for a very long period of time.
>> Not for neutron, sorry. As soon as we finish the last bit of such
>> cleanup, which is targeted for 8.0
>>
>> Now, back to the topic - we may decide to use some more sophisticated
>>> integral node health attribute which can be used with Pacemaker as well as
>>> to put node into some kind of maintenance mode. We can leverage User
>>> Maintenance Mode feature here or just simply stop particular services and
>>> disable particular haproxy backends.
>>>
>>
>> I think this kind of attribute, although being analyzed by pacemaker/ocf,
>> doesn't need any new OS service to be put under pacemaker control.
>>
>> Thanks,
>> Eugene.
>>
>>
>>>
>>> On Mon, Oct 5, 2015 at 11:57 PM, Eugene Nikanorov <
>>> enikano...@mirantis.com> wrote:
>>>

>>
> Mirantis does 

Re: [openstack-dev] [fuel] What to do when a controller runs out of space

2015-10-06 Thread Eugene Nikanorov
> 2) I think you misunderstand what is the difference between
> upstart/systemd and Pacemaker in this case. There are many cases when you
> need to have syncrhonized view of the cluster. Otherwise you will hit
> split-brain situations and have your cluster misfunctioning. Until
> OpenStack provides us with such means there is no other way than using
> Pacemaker/Zookeper/etc.
>

Could you please give some examples of those 'many cases' for openstack
specifically?
As for my 'misunderstanding' - openstack services only need to be always
up, not more than that.
Upstart does a perfect job there.


> 3) Regarding Neutron agents - we discussed it many times - you need to be
> able to control and clean up stuff after some service crashed. Currently,
> Neutron does not provide reliable ways to do it. If your agent dies and
> does not clean up ip addresses from the network namespace you will get into
> the situation of ARP duplication which will be a kind of split brain
> described in item #2. I personally as a system architect and administrator
> do not believe for this to change in at least several years for OpenStack
> so we will be using Pacemaker for a very long period of time.
>

This has been changed already, and a while ago.
OCF infrastructure around neutron agents has never helped neutron in any
meaningful way and is just an artifact from the dark past.
The reasons are: pacemaker/ocf doesn't have enough intelligence to know
when to engage, as a result, any cleanup could only be achieved through
manual operations. I don't need to remind you how many bugs were in ocf
scripts which brought whole clusters down after those manual operations.
So it's just a way better to go with simple standard tools with fine-grain
control.
Same applies to any other openstack service (again, not rabbitmq/galera)

> so we will be using Pacemaker for a very long period of time.
Not for neutron, sorry. As soon as we finish the last bit of such cleanup,
which is targeted for 8.0

Now, back to the topic - we may decide to use some more sophisticated
> integral node health attribute which can be used with Pacemaker as well as
> to put node into some kind of maintenance mode. We can leverage User
> Maintenance Mode feature here or just simply stop particular services and
> disable particular haproxy backends.
>

I think this kind of attribute, although being analyzed by pacemaker/ocf,
doesn't need any new OS service to be put under pacemaker control.

Thanks,
Eugene.


>
> On Mon, Oct 5, 2015 at 11:57 PM, Eugene Nikanorov  > wrote:
>
>>

>>> Mirantis does control neither Rabbitmq or Galera. Mirantis cannot assure
>>> their quality as well.
>>>
>>
>> Correct, and rabbitmq was always the pain in the back, preventing any *real
>> *enterprise usage of openstack where reliability does matter.
>>
>>
>>> > 2) it has terrible UX

>>>
>>> It looks like personal opinion. I'd like to see surveys or operators
>>> feedbacks. Also, this statement is not constructive as it doesn't have
>>> alternative solutions.
>>>
>>
>> The solution is to get rid of terrible UX wherever possible (i'm not
>> saying it is always possible, of course)
>> upstart is just so much better.
>> And yes, this is my personal opinion and is a summary of escalation
>> team's experience.
>>
>>
>>>
 > 3) it is not reliable

>>>
>>> I would say openstack services are not HA reliable. So OCF scripts are
>>> reaction of operators on these problems. Many of them have child-ish issues
>>> from release to release. Operators made OCF scripts to fix these problems.
>>> A lot of openstack are stateful, so they require some kind of stickiness or
>>> synchronization. Openstack services doesn't have simple health-check
>>> functionality so it's hard to say it's running well or not. Sighup is still
>>> a problem for many of openstack services. Etc/etc So, let's be constructive
>>> here.
>>>
>>
>> Well, I prefer to be responsible for what I know and maintain. Thus, I
>> state that neutron doesn't need to be managed by pacemaker, neither server,
>> nor all kinds of agents, and that's the path that neutron team will be
>> taking.
>>
>> Thanks,
>> Eugene.
>>
>>>
>>>
 >

 I disagree with #1 as I do not agree that should be a criteria for an
 open-source project.  Considering pacemaker is at the core of our
 controller setup, I would argue that if these are in fact true we need
 to be using something else.  I would agree that it is a terrible UX
 but all the clustering software I've used fall in this category.  I'd
 like more information on how it is not reliable. Do we have numbers to
 backup these claims?

 > (3) is not evaluation of the project itself, but just a logical
 consequence
 > of (1) and (2).
 > As a part of escalation team I can say that it has cost our team
 thousands
 > of man hours of head-scratching, staring at pacemaker logs which
 value are
 > usually slightly below 

Re: [openstack-dev] [fuel] What to do when a controller runs out of space

2015-10-06 Thread Vladimir Kuklin
Eugene

I would prefer to tackle your points in the following way:

1) Regarding rabbitmq - you and me both know that this a major flaw in how
OpenStack operates - it uses message broker in very unoptimal way sending
lots of unneeded data through when it actually may not send it. So far we
hardened our automated control of rabbitmq as much as possible and the only
issues we see are those when nodes are already under very stressful
conditions such as one of the OpenStack services consuming 95% of available
memory. I doubt that such a case should be handled by Pacemaker or any
other supervisor - they just will not help you. The proper thing that
should be done is fixing OpenStack itself to not overload messaging bus and
use built-in capabilities of RDBMS and other underlying components.

2) I think you misunderstand what is the difference between upstart/systemd
and Pacemaker in this case. There are many cases when you need to have
syncrhonized view of the cluster. Otherwise you will hit split-brain
situations and have your cluster misfunctioning. Until OpenStack provides
us with such means there is no other way than using Pacemaker/Zookeper/etc.

3) Regarding Neutron agents - we discussed it many times - you need to be
able to control and clean up stuff after some service crashed. Currently,
Neutron does not provide reliable ways to do it. If your agent dies and
does not clean up ip addresses from the network namespace you will get into
the situation of ARP duplication which will be a kind of split brain
described in item #2. I personally as a system architect and administrator
do not believe for this to change in at least several years for OpenStack
so we will be using Pacemaker for a very long period of time.

Now, back to the topic - we may decide to use some more sophisticated
integral node health attribute which can be used with Pacemaker as well as
to put node into some kind of maintenance mode. We can leverage User
Maintenance Mode feature here or just simply stop particular services and
disable particular haproxy backends.

On Mon, Oct 5, 2015 at 11:57 PM, Eugene Nikanorov 
wrote:

>
>>>
>> Mirantis does control neither Rabbitmq or Galera. Mirantis cannot assure
>> their quality as well.
>>
>
> Correct, and rabbitmq was always the pain in the back, preventing any *real
> *enterprise usage of openstack where reliability does matter.
>
>
>> > 2) it has terrible UX
>>>
>>
>> It looks like personal opinion. I'd like to see surveys or operators
>> feedbacks. Also, this statement is not constructive as it doesn't have
>> alternative solutions.
>>
>
> The solution is to get rid of terrible UX wherever possible (i'm not
> saying it is always possible, of course)
> upstart is just so much better.
> And yes, this is my personal opinion and is a summary of escalation team's
> experience.
>
>
>>
>>> > 3) it is not reliable
>>>
>>
>> I would say openstack services are not HA reliable. So OCF scripts are
>> reaction of operators on these problems. Many of them have child-ish issues
>> from release to release. Operators made OCF scripts to fix these problems.
>> A lot of openstack are stateful, so they require some kind of stickiness or
>> synchronization. Openstack services doesn't have simple health-check
>> functionality so it's hard to say it's running well or not. Sighup is still
>> a problem for many of openstack services. Etc/etc So, let's be constructive
>> here.
>>
>
> Well, I prefer to be responsible for what I know and maintain. Thus, I
> state that neutron doesn't need to be managed by pacemaker, neither server,
> nor all kinds of agents, and that's the path that neutron team will be
> taking.
>
> Thanks,
> Eugene.
>
>>
>>
>>> >
>>>
>>> I disagree with #1 as I do not agree that should be a criteria for an
>>> open-source project.  Considering pacemaker is at the core of our
>>> controller setup, I would argue that if these are in fact true we need
>>> to be using something else.  I would agree that it is a terrible UX
>>> but all the clustering software I've used fall in this category.  I'd
>>> like more information on how it is not reliable. Do we have numbers to
>>> backup these claims?
>>>
>>> > (3) is not evaluation of the project itself, but just a logical
>>> consequence
>>> > of (1) and (2).
>>> > As a part of escalation team I can say that it has cost our team
>>> thousands
>>> > of man hours of head-scratching, staring at pacemaker logs which value
>>> are
>>> > usually slightly below zero.
>>> >
>>> > Most of openstack services (in fact, ALL api servers) are stateless,
>>> they
>>> > don't require any cluster management (also, they don't need to be
>>> moved in
>>> > case of lack of space).
>>> > Statefull services like neutron agents have their states being a
>>> function of
>>> > db state and are able to syncronize it with the server without external
>>> > "help".
>>> >
>>>
>>> So it's not an issue with moving services so much as being able to
>>> stop the services 

Re: [openstack-dev] [fuel] What to do when a controller runs out of space

2015-10-06 Thread Vladimir Kuklin
Eugene

For example, each time that you need to have one instance (e.g. master
instance) of something non-stateless running in the cluster. You are right
that currently lots of things are fixed already - heat engine is fine, for
example. But I still see this issue with l3 agents and I will not change my
mind until we conduct complete scale and destructive testing with new
neutron code.

Secondly, if we cannot reliably identify when to engage - then we need to
write the code that will tell us when to engage. If this code is already in
place and we can trigger a couple of commands to figure out Neutron agent
state, then we can add them to OCF script monitor and that is all. I agree
that we have some issues with our OCF scripts, for example some unoptimal
cleanup code that has issues with big scale, but I am almost sure we can
fix it.

Finally, let me show an example of when you need a centralized cluster
manager to manage such situations - you have a temporary issue with
connectivity to neutron server over management network for some reason.
Your agents are not cleaned up and neutron server starts new l3 agent
instances on different node. In this case you will have IP duplication in
the network and will bring down the whole cluster as connectivity through
'public' network will be working just fine. In case when we are using
Pacemaker - such node will be either fenced or will stop all the services
controlled by pacemaker as it is a part of non-quorate partition of the
cluster. When this happens, l3 agent OCF script will run its cleanup
section and purge all the stale IPs thus saving us from the trouble. I
obviously may be mistaking, so please correct me if this is not the case.


On Tue, Oct 6, 2015 at 3:46 PM, Eugene Nikanorov 
wrote:

>
>
>> 2) I think you misunderstand what is the difference between
>> upstart/systemd and Pacemaker in this case. There are many cases when you
>> need to have syncrhonized view of the cluster. Otherwise you will hit
>> split-brain situations and have your cluster misfunctioning. Until
>> OpenStack provides us with such means there is no other way than using
>> Pacemaker/Zookeper/etc.
>>
>
> Could you please give some examples of those 'many cases' for openstack
> specifically?
> As for my 'misunderstanding' - openstack services only need to be always
> up, not more than that.
> Upstart does a perfect job there.
>
>
>> 3) Regarding Neutron agents - we discussed it many times - you need to be
>> able to control and clean up stuff after some service crashed. Currently,
>> Neutron does not provide reliable ways to do it. If your agent dies and
>> does not clean up ip addresses from the network namespace you will get into
>> the situation of ARP duplication which will be a kind of split brain
>> described in item #2. I personally as a system architect and administrator
>> do not believe for this to change in at least several years for OpenStack
>> so we will be using Pacemaker for a very long period of time.
>>
>
> This has been changed already, and a while ago.
> OCF infrastructure around neutron agents has never helped neutron in any
> meaningful way and is just an artifact from the dark past.
> The reasons are: pacemaker/ocf doesn't have enough intelligence to know
> when to engage, as a result, any cleanup could only be achieved through
> manual operations. I don't need to remind you how many bugs were in ocf
> scripts which brought whole clusters down after those manual operations.
> So it's just a way better to go with simple standard tools with fine-grain
> control.
> Same applies to any other openstack service (again, not rabbitmq/galera)
>
> > so we will be using Pacemaker for a very long period of time.
> Not for neutron, sorry. As soon as we finish the last bit of such cleanup,
> which is targeted for 8.0
>
> Now, back to the topic - we may decide to use some more sophisticated
>> integral node health attribute which can be used with Pacemaker as well as
>> to put node into some kind of maintenance mode. We can leverage User
>> Maintenance Mode feature here or just simply stop particular services and
>> disable particular haproxy backends.
>>
>
> I think this kind of attribute, although being analyzed by pacemaker/ocf,
> doesn't need any new OS service to be put under pacemaker control.
>
> Thanks,
> Eugene.
>
>
>>
>> On Mon, Oct 5, 2015 at 11:57 PM, Eugene Nikanorov <
>> enikano...@mirantis.com> wrote:
>>
>>>
>
 Mirantis does control neither Rabbitmq or Galera. Mirantis cannot
 assure their quality as well.

>>>
>>> Correct, and rabbitmq was always the pain in the back, preventing any *real
>>> *enterprise usage of openstack where reliability does matter.
>>>
>>>
 > 2) it has terrible UX
>

 It looks like personal opinion. I'd like to see surveys or operators
 feedbacks. Also, this statement is not constructive as it doesn't have
 alternative solutions.

>>>
>>> The solution is to get rid of terrible 

Re: [openstack-dev] [fuel] What to do when a controller runs out of space

2015-10-05 Thread Eugene Nikanorov
>
>
>>
> Mirantis does control neither Rabbitmq or Galera. Mirantis cannot assure
> their quality as well.
>

Correct, and rabbitmq was always the pain in the back, preventing any *real
*enterprise usage of openstack where reliability does matter.


> > 2) it has terrible UX
>>
>
> It looks like personal opinion. I'd like to see surveys or operators
> feedbacks. Also, this statement is not constructive as it doesn't have
> alternative solutions.
>

The solution is to get rid of terrible UX wherever possible (i'm not saying
it is always possible, of course)
upstart is just so much better.
And yes, this is my personal opinion and is a summary of escalation team's
experience.


>
>> > 3) it is not reliable
>>
>
> I would say openstack services are not HA reliable. So OCF scripts are
> reaction of operators on these problems. Many of them have child-ish issues
> from release to release. Operators made OCF scripts to fix these problems.
> A lot of openstack are stateful, so they require some kind of stickiness or
> synchronization. Openstack services doesn't have simple health-check
> functionality so it's hard to say it's running well or not. Sighup is still
> a problem for many of openstack services. Etc/etc So, let's be constructive
> here.
>

Well, I prefer to be responsible for what I know and maintain. Thus, I
state that neutron doesn't need to be managed by pacemaker, neither server,
nor all kinds of agents, and that's the path that neutron team will be
taking.

Thanks,
Eugene.

>
>
>> >
>>
>> I disagree with #1 as I do not agree that should be a criteria for an
>> open-source project.  Considering pacemaker is at the core of our
>> controller setup, I would argue that if these are in fact true we need
>> to be using something else.  I would agree that it is a terrible UX
>> but all the clustering software I've used fall in this category.  I'd
>> like more information on how it is not reliable. Do we have numbers to
>> backup these claims?
>>
>> > (3) is not evaluation of the project itself, but just a logical
>> consequence
>> > of (1) and (2).
>> > As a part of escalation team I can say that it has cost our team
>> thousands
>> > of man hours of head-scratching, staring at pacemaker logs which value
>> are
>> > usually slightly below zero.
>> >
>> > Most of openstack services (in fact, ALL api servers) are stateless,
>> they
>> > don't require any cluster management (also, they don't need to be moved
>> in
>> > case of lack of space).
>> > Statefull services like neutron agents have their states being a
>> function of
>> > db state and are able to syncronize it with the server without external
>> > "help".
>> >
>>
>> So it's not an issue with moving services so much as being able to
>> stop the services when a condition is met. Have we tested all OS
>> services to ensure they do function 100% when out of disk space?  I
>> would assume that glance might have issues with image uploads if there
>> is no space to handle a request.
>>
>> > So now usage of pacemaker can be only justified for cases where
>> service's
>> > clustering mechanism requires active monitoring (rabbitmq, galera)
>> > But even there, examples when we are better off without pacemaker are
>> all
>> > around.
>> >
>> > Thanks,
>> > Eugene.
>> >
>>
>> After I sent this email, I had further discussions around the issues
>> that I'm facing and it may not be completely related to disk space. I
>> think we might be relying on the expectation that the local rabbitmq
>> is always available but I need to look into that. Either way, I
>> believe we still should continue to discuss this issue as we are
>> managing services in multiple ways on a single host. Additionally I do
>> not believe that we really perform quality health checks on our
>> services.
>>
>> Thanks,
>> -Alex
>>
>>
>> >
>> > On Mon, Oct 5, 2015 at 1:34 PM, Sergey Vasilenko <
>> svasile...@mirantis.com>
>> > wrote:
>> >>
>> >>
>> >> On Mon, Oct 5, 2015 at 12:22 PM, Eugene Nikanorov
>> >>  wrote:
>> >>>
>> >>> No pacemaker for os services, please.
>> >>> We'll be moving out neutron agents from pacemaker control in 8.0,
>> other
>> >>> os services don't need it too.
>> >>
>> >>
>> >> could you please provide your arguments.
>> >>
>> >>
>> >> /sv
>> >>
>> >>
>> __
>> >> OpenStack Development Mailing List (not for usage questions)
>> >> Unsubscribe:
>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> >>
>> >
>> >
>> >
>> __
>> > OpenStack Development Mailing List (not for usage questions)
>> > Unsubscribe:
>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> >
>>
>> __
>> 

Re: [openstack-dev] [fuel] What to do when a controller runs out of space

2015-10-05 Thread Sergii Golovatiuk
Good morning gentlemen!

Alex raised very good question. Thank you very much! We have 3 init systems
right now. Some services use SystemV, some services use upstart, some
services are under pacemaker. Personally, I would like to have pacemaker as
pid 1 to replace init [1]. However, I would like to remove custom scripts
as much as possible to leave only upstart/systemd classes [2] only. That
move will give fantastic flexibility to operators to control their services.

Concerning Haproxy checker, I think it should be done in different way. If
pacemaker/corosyunc has an issue the node should be fenced.

Also, I would like to have pacemaker remote to control services on compute
nodes. It's very good replacement for monit.

[1] https://www.youtube.com/watch?v=yq5nYPKxBCo
[2]
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-supported.html



--
Best regards,
Sergii Golovatiuk,
Skype #golserge
IRC #holser
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [fuel] What to do when a controller runs out of space

2015-10-05 Thread Eugene Nikanorov
No pacemaker for os services, please.
We'll be moving out neutron agents from pacemaker control in 8.0, other os
services don't need it too.

E.
5 окт. 2015 г. 12:01 пользователь "Sergii Golovatiuk" <
sgolovat...@mirantis.com> написал:

> Good morning gentlemen!
>
> Alex raised very good question. Thank you very much! We have 3 init
> systems right now. Some services use SystemV, some services use upstart,
> some services are under pacemaker. Personally, I would like to have
> pacemaker as pid 1 to replace init [1]. However, I would like to remove
> custom scripts as much as possible to leave only upstart/systemd classes
> [2] only. That move will give fantastic flexibility to operators to control
> their services.
>
> Concerning Haproxy checker, I think it should be done in different way. If
> pacemaker/corosyunc has an issue the node should be fenced.
>
> Also, I would like to have pacemaker remote to control services on compute
> nodes. It's very good replacement for monit.
>
> [1] https://www.youtube.com/watch?v=yq5nYPKxBCo
> [2]
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-supported.html
>
>
>
> --
> Best regards,
> Sergii Golovatiuk,
> Skype #golserge
> IRC #holser
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [fuel] What to do when a controller runs out of space

2015-10-05 Thread Alex Schultz
On Mon, Oct 5, 2015 at 5:56 AM, Eugene Nikanorov
 wrote:
> Ok,
>
> Project-wise:
> 1) Pacemaker is not under our company's control, we can't assure its quality
> 2) it has terrible UX
> 3) it is not reliable
>

I disagree with #1 as I do not agree that should be a criteria for an
open-source project.  Considering pacemaker is at the core of our
controller setup, I would argue that if these are in fact true we need
to be using something else.  I would agree that it is a terrible UX
but all the clustering software I've used fall in this category.  I'd
like more information on how it is not reliable. Do we have numbers to
backup these claims?

> (3) is not evaluation of the project itself, but just a logical consequence
> of (1) and (2).
> As a part of escalation team I can say that it has cost our team thousands
> of man hours of head-scratching, staring at pacemaker logs which value are
> usually slightly below zero.
>
> Most of openstack services (in fact, ALL api servers) are stateless, they
> don't require any cluster management (also, they don't need to be moved in
> case of lack of space).
> Statefull services like neutron agents have their states being a function of
> db state and are able to syncronize it with the server without external
> "help".
>

So it's not an issue with moving services so much as being able to
stop the services when a condition is met. Have we tested all OS
services to ensure they do function 100% when out of disk space?  I
would assume that glance might have issues with image uploads if there
is no space to handle a request.

> So now usage of pacemaker can be only justified for cases where service's
> clustering mechanism requires active monitoring (rabbitmq, galera)
> But even there, examples when we are better off without pacemaker are all
> around.
>
> Thanks,
> Eugene.
>

After I sent this email, I had further discussions around the issues
that I'm facing and it may not be completely related to disk space. I
think we might be relying on the expectation that the local rabbitmq
is always available but I need to look into that. Either way, I
believe we still should continue to discuss this issue as we are
managing services in multiple ways on a single host. Additionally I do
not believe that we really perform quality health checks on our
services.

Thanks,
-Alex


>
> On Mon, Oct 5, 2015 at 1:34 PM, Sergey Vasilenko 
> wrote:
>>
>>
>> On Mon, Oct 5, 2015 at 12:22 PM, Eugene Nikanorov
>>  wrote:
>>>
>>> No pacemaker for os services, please.
>>> We'll be moving out neutron agents from pacemaker control in 8.0, other
>>> os services don't need it too.
>>
>>
>> could you please provide your arguments.
>>
>>
>> /sv
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [fuel] What to do when a controller runs out of space

2015-10-05 Thread Eugene Nikanorov
Ok,

Project-wise:
1) Pacemaker is not under our company's control, we can't assure its quality
2) it has terrible UX
3) it is not reliable

(3) is not evaluation of the project itself, but just a logical consequence
of (1) and (2).
As a part of escalation team I can say that it has cost our team thousands
of man hours of head-scratching, staring at pacemaker logs which value are
usually slightly below zero.

Most of openstack services (in fact, ALL api servers) are stateless, they
don't require any cluster management (also, they don't need to be moved in
case of lack of space).
Statefull services like neutron agents have their states being a function
of db state and are able to syncronize it with the server without external
"help".

So now usage of pacemaker can be only justified for cases where service's
clustering mechanism requires active monitoring (rabbitmq, galera)
But even there, examples when we are better off without pacemaker are all
around.

Thanks,
Eugene.


On Mon, Oct 5, 2015 at 1:34 PM, Sergey Vasilenko 
wrote:

>
> On Mon, Oct 5, 2015 at 12:22 PM, Eugene Nikanorov  > wrote:
>
>> No pacemaker for os services, please.
>> We'll be moving out neutron agents from pacemaker control in 8.0, other
>> os services don't need it too.
>>
>
> could you please provide your arguments.
>
>
> /sv
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [fuel] What to do when a controller runs out of space

2015-10-05 Thread Sergey Vasilenko
On Mon, Oct 5, 2015 at 12:22 PM, Eugene Nikanorov 
wrote:

> No pacemaker for os services, please.
> We'll be moving out neutron agents from pacemaker control in 8.0, other os
> services don't need it too.
>

could you please provide your arguments.


/sv
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [fuel] What to do when a controller runs out of space

2015-10-05 Thread Sergii Golovatiuk
Hi,


On Mon, Oct 5, 2015 at 3:03 PM, Alex Schultz  wrote:

> On Mon, Oct 5, 2015 at 5:56 AM, Eugene Nikanorov
>  wrote:
> > Ok,
> >
> > Project-wise:
> > 1) Pacemaker is not under our company's control, we can't assure its
> quality
>

Mirantis does control neither Rabbitmq or Galera. Mirantis cannot assure
their quality as well.

> 2) it has terrible UX
>

It looks like personal opinion. I'd like to see surveys or operators
feedbacks. Also, this statement is not constructive as it doesn't have
alternative solutions.


> > 3) it is not reliable
>

I would say openstack services are not HA reliable. So OCF scripts are
reaction of operators on these problems. Many of them have child-ish issues
from release to release. Operators made OCF scripts to fix these problems.
A lot of openstack are stateful, so they require some kind of stickiness or
synchronization. Openstack services doesn't have simple health-check
functionality so it's hard to say it's running well or not. Sighup is still
a problem for many of openstack services. Etc/etc So, let's be constructive
here.



> >
>
> I disagree with #1 as I do not agree that should be a criteria for an
> open-source project.  Considering pacemaker is at the core of our
> controller setup, I would argue that if these are in fact true we need
> to be using something else.  I would agree that it is a terrible UX
> but all the clustering software I've used fall in this category.  I'd
> like more information on how it is not reliable. Do we have numbers to
> backup these claims?
>
> > (3) is not evaluation of the project itself, but just a logical
> consequence
> > of (1) and (2).
> > As a part of escalation team I can say that it has cost our team
> thousands
> > of man hours of head-scratching, staring at pacemaker logs which value
> are
> > usually slightly below zero.
> >
> > Most of openstack services (in fact, ALL api servers) are stateless, they
> > don't require any cluster management (also, they don't need to be moved
> in
> > case of lack of space).
> > Statefull services like neutron agents have their states being a
> function of
> > db state and are able to syncronize it with the server without external
> > "help".
> >
>
> So it's not an issue with moving services so much as being able to
> stop the services when a condition is met. Have we tested all OS
> services to ensure they do function 100% when out of disk space?  I
> would assume that glance might have issues with image uploads if there
> is no space to handle a request.
>
> > So now usage of pacemaker can be only justified for cases where service's
> > clustering mechanism requires active monitoring (rabbitmq, galera)
> > But even there, examples when we are better off without pacemaker are all
> > around.
> >
> > Thanks,
> > Eugene.
> >
>
> After I sent this email, I had further discussions around the issues
> that I'm facing and it may not be completely related to disk space. I
> think we might be relying on the expectation that the local rabbitmq
> is always available but I need to look into that. Either way, I
> believe we still should continue to discuss this issue as we are
> managing services in multiple ways on a single host. Additionally I do
> not believe that we really perform quality health checks on our
> services.
>
> Thanks,
> -Alex
>
>
> >
> > On Mon, Oct 5, 2015 at 1:34 PM, Sergey Vasilenko <
> svasile...@mirantis.com>
> > wrote:
> >>
> >>
> >> On Mon, Oct 5, 2015 at 12:22 PM, Eugene Nikanorov
> >>  wrote:
> >>>
> >>> No pacemaker for os services, please.
> >>> We'll be moving out neutron agents from pacemaker control in 8.0, other
> >>> os services don't need it too.
> >>
> >>
> >> could you please provide your arguments.
> >>
> >>
> >> /sv
> >>
> >>
> __
> >> OpenStack Development Mailing List (not for usage questions)
> >> Unsubscribe:
> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>
> >
> >
> >
> __
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe:
> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [fuel] What to do when a controller runs out of space

2015-10-03 Thread Sergey Vasilenko
On Fri, Oct 2, 2015 at 10:50 PM, Alex Schultz  wrote:

> So I was working on Bug 1493520 which is about what happens when a
> controller runs out of space. For this I came up with a solution[1] to
> leverage pacemaker to migrate services away from the controller when
> it runs out of space.
>

Yesterday, while we discuss about another bug, Dmitry Iliyn propose
following changes to haproxy service checker:
Checker for each service, covered by haproxy, should check also pacemaker
status on the target node.
If pacemaker not in the operate state (maintenance mode, fencing processed,
something another) Haproxy should mark
this endpoint as bad and do not use this endpoint.

IMHO this approach may be use and here.
As soon out of space detected, pacemaker should be switched to non-operate
mode.

If all openstack services will be controlled by pacemaker -- they will be
going down automatically by pacemaker.
Advantage of moving all openstack services into pacemaker -- one control
center for each service for whole cluster.
Services, controlled by pacemaker, may be just a OS services (init,
upstart, or something another) or an containered.

/sv
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev