Re: [openstack-dev] Nova scheduler startup when database is not available

2015-12-28 Thread Jay Pipes

On 12/24/2015 02:30 PM, Clint Byrum wrote:

This is entirely philosophical, but we should think about when it is
appropriate to adopt which mode of operation.

There are basically two ways being discussed:

1) Fail fast.
2) Retry forever.

Fail fast pros- Immediate feedback for problems, no zombies to worry
about staying dormant and resurrecting because their configs accidentally
become right again. Much more determinism. Debugging is much simpler. To
summarize, it's up and working, or down and not.

Fail fast cons- Ripple effects. If you have a database or network blip
while services are starting, you must be aware of all of the downstream
dependencies and trigger them to start again, or have automation which
retries forever, giving up some of the benefits of fail-fast. Circular
dependencies require special workflow to unroll (Service1 aspect A relies
on aspect X of service2, service2 aspect X relies on aspect B of service1
which would start fine without service2).  To summarize: this moves the
retry-forever problem to orchestration, and complicates some corner cases.

Retry forever pros- Circular dependencies are cake. Blips auto-recover.
Bring-up orchestration is simpler (start everything, wait..). To
summarize: this makes orchestration simpler.

Retry forever cons- Non-determinism. It's impossible to just look at the
thing from outside and know if it is ready to do useful work. May
actually be hiding intermittent problems, requiring more logging and
indicators in general to allow analysis.

I honestly think any distributed system needs both.


So do I. I was proposing only that we deal with unrecoverable 
configuration errors on startup in a fail-fast way. I was not proposing 
that we remove the existing functionality that retries requests in the 
occasion where an already-up-and-running scheduler service experiences 
(typically transient) I/O disruptions to a dependent service like the DB 
or MQ.




That said, the scheduler is, IMO, an _extremely_ complex piece of
OpenStack, with up and down stream dependencies on several levels (which
is why redesigning it gets debated so often on openstack-dev).


It's actually not all that complex. Or at least, it doesn't need to be :)

Best,
-jay

> Making

it fail fast would complicate the process of bringing and keeping an
OpenStack cloud up. There are probably some benefits I haven't thought
of, but the main benefit you stated would be that one would know when
their configuration tooling was wrong and giving their scheduler the
wrong database information, which is not, IMO, a hard problem (one can
read the config file after all). But I'm sure we could think of more if
we tried hard.

I hope I'm not too vague here.. I *want* fail-fast on everything.
However, I also don't think it can just be a blanket policy without
requiring everybody to deploy complex orchestration on top.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Nova scheduler startup when database is not available

2015-12-28 Thread Fox, Kevin M
Another data point.. I've had to work around daemons failing fast as discussed 
below when working with docker-compose. It doesn't have nice dependency 
handling yet, and during the initial bootstrap of all the containers in a pod, 
some can fail due to not sticking around long enough for the things to init. 
Its kind of painful. Fail fast has some nice features, but retry forever is 
often very useful in the field.

Thanks,
Kevin

From: Jay Pipes [jaypi...@gmail.com]
Sent: Monday, December 28, 2015 9:45 AM
To: openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] Nova scheduler startup when database is not 
available

On 12/24/2015 02:30 PM, Clint Byrum wrote:
> This is entirely philosophical, but we should think about when it is
> appropriate to adopt which mode of operation.
>
> There are basically two ways being discussed:
>
> 1) Fail fast.
> 2) Retry forever.
>
> Fail fast pros- Immediate feedback for problems, no zombies to worry
> about staying dormant and resurrecting because their configs accidentally
> become right again. Much more determinism. Debugging is much simpler. To
> summarize, it's up and working, or down and not.
>
> Fail fast cons- Ripple effects. If you have a database or network blip
> while services are starting, you must be aware of all of the downstream
> dependencies and trigger them to start again, or have automation which
> retries forever, giving up some of the benefits of fail-fast. Circular
> dependencies require special workflow to unroll (Service1 aspect A relies
> on aspect X of service2, service2 aspect X relies on aspect B of service1
> which would start fine without service2).  To summarize: this moves the
> retry-forever problem to orchestration, and complicates some corner cases.
>
> Retry forever pros- Circular dependencies are cake. Blips auto-recover.
> Bring-up orchestration is simpler (start everything, wait..). To
> summarize: this makes orchestration simpler.
>
> Retry forever cons- Non-determinism. It's impossible to just look at the
> thing from outside and know if it is ready to do useful work. May
> actually be hiding intermittent problems, requiring more logging and
> indicators in general to allow analysis.
>
> I honestly think any distributed system needs both.

So do I. I was proposing only that we deal with unrecoverable
configuration errors on startup in a fail-fast way. I was not proposing
that we remove the existing functionality that retries requests in the
occasion where an already-up-and-running scheduler service experiences
(typically transient) I/O disruptions to a dependent service like the DB
or MQ.


> That said, the scheduler is, IMO, an _extremely_ complex piece of
> OpenStack, with up and down stream dependencies on several levels (which
> is why redesigning it gets debated so often on openstack-dev).

It's actually not all that complex. Or at least, it doesn't need to be :)

Best,
-jay

 > Making
> it fail fast would complicate the process of bringing and keeping an
> OpenStack cloud up. There are probably some benefits I haven't thought
> of, but the main benefit you stated would be that one would know when
> their configuration tooling was wrong and giving their scheduler the
> wrong database information, which is not, IMO, a hard problem (one can
> read the config file after all). But I'm sure we could think of more if
> we tried hard.
>
> I hope I'm not too vague here.. I *want* fail-fast on everything.
> However, I also don't think it can just be a blanket policy without
> requiring everybody to deploy complex orchestration on top.
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Nova scheduler startup when database is not available

2015-12-28 Thread Jay Pipes

On 12/23/2015 08:35 PM, Morgan Fainberg wrote:

On Wed, Dec 23, 2015 at 10:32 AM, Jay Pipes > wrote:

On 12/23/2015 12:27 PM, Lars Kellogg-Stedman wrote:

I've been looking into the startup constraints involved when
launching
Nova services with systemd using Type=notify (which causes
systemd to
wait for an explicit notification from the service before
considering
it to be "started".  Some services (e.g., nova-conductor) will
happily
"start" even if the backing database is currently unavailable (and
will enter a retry loop waiting for the database).

Other services -- specifically, nova-scheduler -- will block waiting
for the database *before* providing systemd with the necessary
notification.

nova-scheduler blocks because it wants to initialize a list of
available aggregates (in
scheduler.host_manager.HostManager.__init__),
which it gets by calling objects.AggregateList.get_all.

Does it make sense to block service startup at this stage?  The
database disappearing during runtime isn't a hard error -- we will
retry and reconnect when it comes back -- so should the same
situation
at startup be a hard error?  As an operator, I am more interested in
"did my configuration files parse correctly?" at startup, and would
generally prefer the service to start (and permit any dependent
services to start) even when the database isn't up (because that's
probably a situation of which I am already aware).


If your configuration file parsed correctly but has the wrong
database connection URI, what good is the service in an active
state? It won't be able to do anything at all.

This is why I think it's better to have hard checks like for
connections on startup and not have services active if they won't be
able to do anything useful.


Are you advocating that scheduler bails out and ceases to run or that it
doesn't mark itself as active? I am in favour of the second scenario but
not the first. There are cases where it would be nice to start the
scheduler and have it at least report "hey I can't contact the DB" but
not mark itself active, but continue to run and on  report/try
to reconnect.


I am in favor of the service not starting at all if the database cannot 
be connected to in a "test connection" scenario.



It isn't clear which level of "hard check" you're advocating in your
response and I want to clarify for the sake of conversation.


If the scheduler cannot contact the database, it cannot do anything 
useful at all. I don't see the point of having the service daemon "up" 
if it cannot do anything useful.


Most monitoring tooling (Nagios or nginx for simple load balancing) and 
distributed service management (Zookeeper) look at whether a service is 
responding on some port to determine if the service is up. If the 
service responds on said port, but cannot do anything useful, the 
information is less than useful...it's harmful, IMHO.


For errors that are recoverable, sure keep the service up and running 
and retry the condition that is recoverable. But in the case of bad 
configuration, it's not a recoverable error, and I don't think the 
service should be started at all.


Hope that clears things up.

Best,
-jay


It would be relatively easy to have the scheduler lazy-load the list
of aggregates on first references, rather than at __init__.


Sure, but if the root cause of the issue is a problem due to
misconfigured connection string, then that lazy-load will just bomb
out and the scheduler will be useless anyway. I'd rather have a
fail-early/fast occur here than a fail-late.

Best,
-jay

 > I'm not

familiar enough with the nova code to know if there would be any
undesirable implications of this behavior.  We're already punting
initializing the list of instances to an asynchronous task in
order to
avoid blocking service startup.

Does it make sense to permit nova-scheduler to complete service
startup in the absence of the database (and then retry the
connection
in the background)?




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

Re: [openstack-dev] Nova scheduler startup when database is not available

2015-12-28 Thread Clint Byrum
Excerpts from Jay Pipes's message of 2015-12-28 09:45:39 -0800:
> On 12/24/2015 02:30 PM, Clint Byrum wrote:
> > This is entirely philosophical, but we should think about when it is
> > appropriate to adopt which mode of operation.
> >
> > There are basically two ways being discussed:
> >
> > 1) Fail fast.
> > 2) Retry forever.
> >
> > Fail fast pros- Immediate feedback for problems, no zombies to worry
> > about staying dormant and resurrecting because their configs accidentally
> > become right again. Much more determinism. Debugging is much simpler. To
> > summarize, it's up and working, or down and not.
> >
> > Fail fast cons- Ripple effects. If you have a database or network blip
> > while services are starting, you must be aware of all of the downstream
> > dependencies and trigger them to start again, or have automation which
> > retries forever, giving up some of the benefits of fail-fast. Circular
> > dependencies require special workflow to unroll (Service1 aspect A relies
> > on aspect X of service2, service2 aspect X relies on aspect B of service1
> > which would start fine without service2).  To summarize: this moves the
> > retry-forever problem to orchestration, and complicates some corner cases.
> >
> > Retry forever pros- Circular dependencies are cake. Blips auto-recover.
> > Bring-up orchestration is simpler (start everything, wait..). To
> > summarize: this makes orchestration simpler.
> >
> > Retry forever cons- Non-determinism. It's impossible to just look at the
> > thing from outside and know if it is ready to do useful work. May
> > actually be hiding intermittent problems, requiring more logging and
> > indicators in general to allow analysis.
> >
> > I honestly think any distributed system needs both.
> 
> So do I. I was proposing only that we deal with unrecoverable 
> configuration errors on startup in a fail-fast way. I was not proposing 
> that we remove the existing functionality that retries requests in the 
> occasion where an already-up-and-running scheduler service experiences 
> (typically transient) I/O disruptions to a dependent service like the DB 
> or MQ.
> 

Even during startup, failing fast on remote dependencies complicates
things. There's no dependency resolver for the entire cloud, as Kevin
Fox suggested.

> 
> > That said, the scheduler is, IMO, an _extremely_ complex piece of
> > OpenStack, with up and down stream dependencies on several levels (which
> > is why redesigning it gets debated so often on openstack-dev).
> 
> It's actually not all that complex. Or at least, it doesn't need to be :)
> 

On this we definitely agree.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Nova scheduler startup when database is not available

2015-12-24 Thread Clint Byrum
Excerpts from Jay Pipes's message of 2015-12-23 10:32:27 -0800:
> On 12/23/2015 12:27 PM, Lars Kellogg-Stedman wrote:
> > I've been looking into the startup constraints involved when launching
> > Nova services with systemd using Type=notify (which causes systemd to
> > wait for an explicit notification from the service before considering
> > it to be "started".  Some services (e.g., nova-conductor) will happily
> > "start" even if the backing database is currently unavailable (and
> > will enter a retry loop waiting for the database).
> >
> > Other services -- specifically, nova-scheduler -- will block waiting
> > for the database *before* providing systemd with the necessary
> > notification.
> >
> > nova-scheduler blocks because it wants to initialize a list of
> > available aggregates (in scheduler.host_manager.HostManager.__init__),
> > which it gets by calling objects.AggregateList.get_all.
> >
> > Does it make sense to block service startup at this stage?  The
> > database disappearing during runtime isn't a hard error -- we will
> > retry and reconnect when it comes back -- so should the same situation
> > at startup be a hard error?  As an operator, I am more interested in
> > "did my configuration files parse correctly?" at startup, and would
> > generally prefer the service to start (and permit any dependent
> > services to start) even when the database isn't up (because that's
> > probably a situation of which I am already aware).
> 
> If your configuration file parsed correctly but has the wrong database 
> connection URI, what good is the service in an active state? It won't be 
> able to do anything at all.
> 
> This is why I think it's better to have hard checks like for connections 
> on startup and not have services active if they won't be able to do 
> anything useful.
> 
> > It would be relatively easy to have the scheduler lazy-load the list
> > of aggregates on first references, rather than at __init__.
> 
> Sure, but if the root cause of the issue is a problem due to 
> misconfigured connection string, then that lazy-load will just bomb out 
> and the scheduler will be useless anyway. I'd rather have a 
> fail-early/fast occur here than a fail-late.
> 

This is entirely philosophical, but we should think about when it is
appropriate to adopt which mode of operation.

There are basically two ways being discussed:

1) Fail fast.
2) Retry forever.

Fail fast pros- Immediate feedback for problems, no zombies to worry
about staying dormant and resurrecting because their configs accidentally
become right again. Much more determinism. Debugging is much simpler. To
summarize, it's up and working, or down and not.

Fail fast cons- Ripple effects. If you have a database or network blip
while services are starting, you must be aware of all of the downstream
dependencies and trigger them to start again, or have automation which
retries forever, giving up some of the benefits of fail-fast. Circular
dependencies require special workflow to unroll (Service1 aspect A relies
on aspect X of service2, service2 aspect X relies on aspect B of service1
which would start fine without service2).  To summarize: this moves the
retry-forever problem to orchestration, and complicates some corner cases.

Retry forever pros- Circular dependencies are cake. Blips auto-recover.
Bring-up orchestration is simpler (start everything, wait..). To
summarize: this makes orchestration simpler.

Retry forever cons- Non-determinism. It's impossible to just look at the
thing from outside and know if it is ready to do useful work. May
actually be hiding intermittent problems, requiring more logging and
indicators in general to allow analysis.

I honestly think any distributed system needs both. The more complex the
dependencies inside the system get, the more I think you have to deal
with the cons of retry-forever, even though this compounds the problem
of debugging that system. In designing systems, we should avoid
complex dependencies for this reason.

That said, the scheduler is, IMO, an _extremely_ complex piece of
OpenStack, with up and down stream dependencies on several levels (which
is why redesigning it gets debated so often on openstack-dev). Making
it fail fast would complicate the process of bringing and keeping an
OpenStack cloud up. There are probably some benefits I haven't thought
of, but the main benefit you stated would be that one would know when
their configuration tooling was wrong and giving their scheduler the
wrong database information, which is not, IMO, a hard problem (one can
read the config file after all). But I'm sure we could think of more if
we tried hard.

I hope I'm not too vague here.. I *want* fail-fast on everything.
However, I also don't think it can just be a blanket policy without
requiring everybody to deploy complex orchestration on top.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 

Re: [openstack-dev] Nova scheduler startup when database is not available

2015-12-24 Thread Sylvain Bauza



Le 24/12/2015 02:35, Morgan Fainberg a écrit :



On Wed, Dec 23, 2015 at 10:32 AM, Jay Pipes > wrote:


On 12/23/2015 12:27 PM, Lars Kellogg-Stedman wrote:

I've been looking into the startup constraints involved when
launching
Nova services with systemd using Type=notify (which causes
systemd to
wait for an explicit notification from the service before
considering
it to be "started".  Some services (e.g., nova-conductor) will
happily
"start" even if the backing database is currently unavailable (and
will enter a retry loop waiting for the database).

Other services -- specifically, nova-scheduler -- will block
waiting
for the database *before* providing systemd with the necessary
notification.

nova-scheduler blocks because it wants to initialize a list of
available aggregates (in
scheduler.host_manager.HostManager.__init__),
which it gets by calling objects.AggregateList.get_all.

Does it make sense to block service startup at this stage?  The
database disappearing during runtime isn't a hard error -- we will
retry and reconnect when it comes back -- so should the same
situation
at startup be a hard error?  As an operator, I am more
interested in
"did my configuration files parse correctly?" at startup, and
would
generally prefer the service to start (and permit any dependent
services to start) even when the database isn't up (because that's
probably a situation of which I am already aware).


If your configuration file parsed correctly but has the wrong
database connection URI, what good is the service in an active
state? It won't be able to do anything at all.

This is why I think it's better to have hard checks like for
connections on startup and not have services active if they won't
be able to do anything useful.


Are you advocating that scheduler bails out and ceases to run or that 
it doesn't mark itself as active? I am in favour of the second 
scenario but not the first. There are cases where it would be nice to 
start the scheduler and have it at least report "hey I can't contact 
the DB" but not mark itself active, but continue to run and on 
 report/try to reconnect.


It isn't clear which level of "hard check" you're advocating in your 
response and I want to clarify for the sake of conversation.


So, to be clear, the scheduler calls the DB to get the list of 
aggregates and instances for not calling the DB anytime a filter wants 
to check those, but rather look at in-memory.
While it means that it's only needed for the above filters, it still 
means that if the DB is ill, the scheduler wouldn't work - just because 
even if the service is running, any request call to the scheduler would 
return an exception.


So, what's better, you think ? Having a scheduler saying in an error log 
"heh cool, the DB is bad, but okay, you can call me" or rather "meh, you 
have a config issue, please review it" ?


to be honest, we can maybe have a better way to document why the 
scheduler is not starting when it's not possible to call the DB, but I'm 
not sure it's good to have a scheduler resilitient vs. the DB.


-Sylvain


It would be relatively easy to have the scheduler lazy-load
the list
of aggregates on first references, rather than at __init__.


Sure, but if the root cause of the issue is a problem due to
misconfigured connection string, then that lazy-load will just
bomb out and the scheduler will be useless anyway. I'd rather have
a fail-early/fast occur here than a fail-late.

Best,
-jay

> I'm not

familiar enough with the nova code to know if there would be any
undesirable implications of this behavior.  We're already punting
initializing the list of instances to an asynchronous task in
order to
avoid blocking service startup.

Does it make sense to permit nova-scheduler to complete service
startup in the absence of the database (and then retry the
connection
in the background)?




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

Re: [openstack-dev] Nova scheduler startup when database is not available

2015-12-23 Thread Jay Pipes

On 12/23/2015 12:27 PM, Lars Kellogg-Stedman wrote:

I've been looking into the startup constraints involved when launching
Nova services with systemd using Type=notify (which causes systemd to
wait for an explicit notification from the service before considering
it to be "started".  Some services (e.g., nova-conductor) will happily
"start" even if the backing database is currently unavailable (and
will enter a retry loop waiting for the database).

Other services -- specifically, nova-scheduler -- will block waiting
for the database *before* providing systemd with the necessary
notification.

nova-scheduler blocks because it wants to initialize a list of
available aggregates (in scheduler.host_manager.HostManager.__init__),
which it gets by calling objects.AggregateList.get_all.

Does it make sense to block service startup at this stage?  The
database disappearing during runtime isn't a hard error -- we will
retry and reconnect when it comes back -- so should the same situation
at startup be a hard error?  As an operator, I am more interested in
"did my configuration files parse correctly?" at startup, and would
generally prefer the service to start (and permit any dependent
services to start) even when the database isn't up (because that's
probably a situation of which I am already aware).


If your configuration file parsed correctly but has the wrong database 
connection URI, what good is the service in an active state? It won't be 
able to do anything at all.


This is why I think it's better to have hard checks like for connections 
on startup and not have services active if they won't be able to do 
anything useful.



It would be relatively easy to have the scheduler lazy-load the list
of aggregates on first references, rather than at __init__.


Sure, but if the root cause of the issue is a problem due to 
misconfigured connection string, then that lazy-load will just bomb out 
and the scheduler will be useless anyway. I'd rather have a 
fail-early/fast occur here than a fail-late.


Best,
-jay

> I'm not

familiar enough with the nova code to know if there would be any
undesirable implications of this behavior.  We're already punting
initializing the list of instances to an asynchronous task in order to
avoid blocking service startup.

Does it make sense to permit nova-scheduler to complete service
startup in the absence of the database (and then retry the connection
in the background)?



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] Nova scheduler startup when database is not available

2015-12-23 Thread Lars Kellogg-Stedman
I've been looking into the startup constraints involved when launching
Nova services with systemd using Type=notify (which causes systemd to
wait for an explicit notification from the service before considering
it to be "started".  Some services (e.g., nova-conductor) will happily
"start" even if the backing database is currently unavailable (and
will enter a retry loop waiting for the database).

Other services -- specifically, nova-scheduler -- will block waiting
for the database *before* providing systemd with the necessary
notification.

nova-scheduler blocks because it wants to initialize a list of
available aggregates (in scheduler.host_manager.HostManager.__init__),
which it gets by calling objects.AggregateList.get_all.

Does it make sense to block service startup at this stage?  The
database disappearing during runtime isn't a hard error -- we will
retry and reconnect when it comes back -- so should the same situation
at startup be a hard error?  As an operator, I am more interested in
"did my configuration files parse correctly?" at startup, and would
generally prefer the service to start (and permit any dependent
services to start) even when the database isn't up (because that's
probably a situation of which I am already aware).

It would be relatively easy to have the scheduler lazy-load the list
of aggregates on first references, rather than at __init__.  I'm not
familiar enough with the nova code to know if there would be any
undesirable implications of this behavior.  We're already punting
initializing the list of instances to an asynchronous task in order to
avoid blocking service startup.

Does it make sense to permit nova-scheduler to complete service
startup in the absence of the database (and then retry the connection
in the background)?

-- 
Lars Kellogg-Stedman  | larsks @ {freenode,twitter,github}
Cloud Engineering / OpenStack  | http://blog.oddbit.com/



signature.asc
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Nova scheduler startup when database is not available

2015-12-23 Thread Mike Bayer


On 12/23/2015 01:32 PM, Jay Pipes wrote:
> On 12/23/2015 12:27 PM, Lars Kellogg-Stedman wrote:
>> I've been looking into the startup constraints involved when launching
>> Nova services with systemd using Type=notify (which causes systemd to
>> wait for an explicit notification from the service before considering
>> it to be "started".  Some services (e.g., nova-conductor) will happily
>> "start" even if the backing database is currently unavailable (and
>> will enter a retry loop waiting for the database).
>>
>> Other services -- specifically, nova-scheduler -- will block waiting
>> for the database *before* providing systemd with the necessary
>> notification.
>>
>> nova-scheduler blocks because it wants to initialize a list of
>> available aggregates (in scheduler.host_manager.HostManager.__init__),
>> which it gets by calling objects.AggregateList.get_all.
>>
>> Does it make sense to block service startup at this stage?  The
>> database disappearing during runtime isn't a hard error -- we will
>> retry and reconnect when it comes back -- so should the same situation
>> at startup be a hard error?  As an operator, I am more interested in
>> "did my configuration files parse correctly?" at startup, and would
>> generally prefer the service to start (and permit any dependent
>> services to start) even when the database isn't up (because that's
>> probably a situation of which I am already aware).
> 
> If your configuration file parsed correctly but has the wrong database
> connection URI, what good is the service in an active state? It won't be
> able to do anything at all.

this is true, but to be fair, Nova doesn't work like this at all, at
least not in nova/db/sqlalchemy/api.py.  It is very intentionally
designed to *not* connect to the database until an API call is first
accessed, to the extent that it does an end-run around oslo.db's
create_engine() feature which itself does a "test" connection when it is
called (FTR, SQLAlchemy's create_engine() that is called by oslo.db is
in fact a lazy-initializing function).I find it quite awkward
overall that oslo.db reverses SQLAlchemy's "lazyness", but then nova and
others re-reverse *back* to "lazyness", but at the expense of allowing
oslo.db's create_engine() to receive its configuration up front.

In the reworked enginefacade API I went through a lot of effort to
replicate this behavior.   It would be nice if all Openstack apps could
just pick one paradigm and stick with it so that we can just make
oslo.db do *one* pattern and that's all (probably too late though).


> 
> This is why I think it's better to have hard checks like for connections
> on startup and not have services active if they won't be able to do
> anything useful.
> 
>> It would be relatively easy to have the scheduler lazy-load the list
>> of aggregates on first references, rather than at __init__.
> 
> Sure, but if the root cause of the issue is a problem due to
> misconfigured connection string, then that lazy-load will just bomb out
> and the scheduler will be useless anyway. I'd rather have a
> fail-early/fast occur here than a fail-late.
> 
> Best,
> -jay
> 
>> I'm not
>> familiar enough with the nova code to know if there would be any
>> undesirable implications of this behavior.  We're already punting
>> initializing the list of instances to an asynchronous task in order to
>> avoid blocking service startup.
>>
>> Does it make sense to permit nova-scheduler to complete service
>> startup in the absence of the database (and then retry the connection
>> in the background)?
>>
>>
>>
>> __
>>
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
> 
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Nova scheduler startup when database is not available

2015-12-23 Thread Morgan Fainberg
On Wed, Dec 23, 2015 at 10:32 AM, Jay Pipes  wrote:

> On 12/23/2015 12:27 PM, Lars Kellogg-Stedman wrote:
>
>> I've been looking into the startup constraints involved when launching
>> Nova services with systemd using Type=notify (which causes systemd to
>> wait for an explicit notification from the service before considering
>> it to be "started".  Some services (e.g., nova-conductor) will happily
>> "start" even if the backing database is currently unavailable (and
>> will enter a retry loop waiting for the database).
>>
>> Other services -- specifically, nova-scheduler -- will block waiting
>> for the database *before* providing systemd with the necessary
>> notification.
>>
>> nova-scheduler blocks because it wants to initialize a list of
>> available aggregates (in scheduler.host_manager.HostManager.__init__),
>> which it gets by calling objects.AggregateList.get_all.
>>
>> Does it make sense to block service startup at this stage?  The
>> database disappearing during runtime isn't a hard error -- we will
>> retry and reconnect when it comes back -- so should the same situation
>> at startup be a hard error?  As an operator, I am more interested in
>> "did my configuration files parse correctly?" at startup, and would
>> generally prefer the service to start (and permit any dependent
>> services to start) even when the database isn't up (because that's
>> probably a situation of which I am already aware).
>>
>
> If your configuration file parsed correctly but has the wrong database
> connection URI, what good is the service in an active state? It won't be
> able to do anything at all.
>
> This is why I think it's better to have hard checks like for connections
> on startup and not have services active if they won't be able to do
> anything useful.
>
>
Are you advocating that scheduler bails out and ceases to run or that it
doesn't mark itself as active? I am in favour of the second scenario but
not the first. There are cases where it would be nice to start the
scheduler and have it at least report "hey I can't contact the DB" but not
mark itself active, but continue to run and on  report/try to
reconnect.

It isn't clear which level of "hard check" you're advocating in your
response and I want to clarify for the sake of conversation.


> It would be relatively easy to have the scheduler lazy-load the list
>> of aggregates on first references, rather than at __init__.
>>
>
> Sure, but if the root cause of the issue is a problem due to misconfigured
> connection string, then that lazy-load will just bomb out and the scheduler
> will be useless anyway. I'd rather have a fail-early/fast occur here than a
> fail-late.
>
> Best,
> -jay
>
> > I'm not
>
>> familiar enough with the nova code to know if there would be any
>> undesirable implications of this behavior.  We're already punting
>> initializing the list of instances to an asynchronous task in order to
>> avoid blocking service startup.
>>
>> Does it make sense to permit nova-scheduler to complete service
>> startup in the absence of the database (and then retry the connection
>> in the background)?
>>
>>
>>
>> __
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev