Re: [openstack-dev] [nova][placement] Placement requests and caching in the resource tracker

2018-11-06 Thread Eric Fried
I do intend to respond to all the excellent discussion on this thread,
but right now I just want to offer an update on the code:

I've split the effort apart into multiple changes starting at [1]. A few
of these are ready for review.

One opinion was that a specless blueprint would be appropriate. If
there's consensus on this, I'll spin one up.

[1] https://review.openstack.org/#/c/615606/

On 11/5/18 03:16, Belmiro Moreira wrote:
> Thanks Eric for the patch.
> This will help keeping placement calls under control.
> 
> Belmiro
> 
> 
> On Sun, Nov 4, 2018 at 1:01 PM Jay Pipes  > wrote:
> 
> On 11/02/2018 03:22 PM, Eric Fried wrote:
> > All-
> >
> > Based on a (long) discussion yesterday [1] I have put up a patch [2]
> > whereby you can set [compute]resource_provider_association_refresh to
> > zero and the resource tracker will never* refresh the report client's
> > provider cache. Philosophically, we're removing the "healing"
> aspect of
> > the resource tracker's periodic and trusting that placement won't
> > diverge from whatever's in our cache. (If it does, it's because the op
> > hit the CLI, in which case they should SIGHUP - see below.)
> >
> > *except:
> > - When we initially create the compute node record and bootstrap its
> > resource provider.
> > - When the virt driver's update_provider_tree makes a change,
> > update_from_provider_tree reflects them in the cache as well as
> pushing
> > them back to placement.
> > - If update_from_provider_tree fails, the cache is cleared and gets
> > rebuilt on the next periodic.
> > - If you send SIGHUP to the compute process, the cache is cleared.
> >
> > This should dramatically reduce the number of calls to placement from
> > the compute service. Like, to nearly zero, unless something is
> actually
> > changing.
> >
> > Can I get some initial feedback as to whether this is worth
> polishing up
> > into something real? (It will probably need a bp/spec if so.)
> >
> > [1]
> >
> 
> http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-11-01.log.html#t2018-11-01T17:32:03
> > [2] https://review.openstack.org/#/c/614886/
> >
> > ==
> > Background
> > ==
> > In the Queens release, our friends at CERN noticed a serious spike in
> > the number of requests to placement from compute nodes, even in a
> > stable-state cloud. Given that we were in the process of adding a
> ton of
> > infrastructure to support sharing and nested providers, this was not
> > unexpected. Roughly, what was previously:
> >
> >   @periodic_task:
> >       GET /resource_providers/$compute_uuid
> >       GET /resource_providers/$compute_uuid/inventories
> >
> > became more like:
> >
> >   @periodic_task:
> >       # In Queens/Rocky, this would still just return the compute RP
> >       GET /resource_providers?in_tree=$compute_uuid
> >       # In Queens/Rocky, this would return nothing
> >       GET /resource_providers?member_of=...=MISC_SHARES...
> >       for each provider returned above:  # i.e. just one in Q/R
> >           GET /resource_providers/$compute_uuid/inventories
> >           GET /resource_providers/$compute_uuid/traits
> >           GET /resource_providers/$compute_uuid/aggregates
> >
> > In a cloud the size of CERN's, the load wasn't acceptable. But at the
> > time, CERN worked around the problem by disabling refreshing entirely.
> > (The fact that this seems to have worked for them is an
> encouraging sign
> > for the proposed code change.)
> >
> > We're not actually making use of most of that information, but it sets
> > the stage for things that we're working on in Stein and beyond, like
> > multiple VGPU types, bandwidth resource providers, accelerators, NUMA,
> > etc., so removing/reducing the amount of information we look at isn't
> > really an option strategically.
> 
> I support your idea of getting rid of the periodic refresh of the cache
> in the scheduler report client. Much of that was added in order to
> emulate the original way the resource tracker worked.
> 
> Most of the behaviour in the original resource tracker (and some of the
> code still in there for dealing with (surprise!) PCI passthrough
> devices
> and NUMA topology) was due to doing allocations on the compute node
> (the
> whole claims stuff). We needed to always be syncing the state of the
> compute_nodes and pci_devices table in the cell database with whatever
> usage information was being created/modified on the compute nodes [0].
> 
> All of the "healing" code that's in the resource tracker was basically
> to deal with "soft delete", migrations that didn't complete or work
> properly, 

Re: [openstack-dev] [nova][placement] Placement requests and caching in the resource tracker

2018-11-05 Thread Matt Riedemann

On 11/5/2018 1:17 PM, Matt Riedemann wrote:
I'm thinking of a case like, resize and instance but rather than 
confirm/revert it, the user deletes the instance. That would cleanup the 
allocations from the target node but potentially not from the source node.


Well this case is at least not an issue:

https://review.openstack.org/#/c/615644/

It took me a bit to sort out how that worked but it does and I've added 
a test to confirm it.


--

Thanks,

Matt

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][placement] Placement requests and caching in the resource tracker

2018-11-05 Thread Matt Riedemann

On 11/5/2018 12:28 PM, Mohammed Naser wrote:

Have you dug into any of the operations around these instances to
determine what might have gone wrong? For example, was a live migration
performed recently on these instances and if so, did it fail? How about
evacuations (rebuild from a down host).

To be honest, I have not, however, I suspect a lot of those happen from the
fact that it is possible that the service which makes the claim is not the
same one that deletes it

I'm not sure if this is something that's possible but say the compute2 makes
a claim for migrating to compute1 but something fails there, the revert happens
in compute1 but compute1 is already borked so it doesn't work

This isn't necessarily the exact case that's happening but it's a summary
of what I believe happens.



The computes don't create the resource allocations in placement though, 
the scheduler does, unless this deployment still has at least one 
compute that is 

The compute service should only be removing allocations for things like 
server delete, failed move operation (cleanup the allocations created by 
the scheduler), or a successful move operation (cleanup the allocations 
for the source node held by the migration record).


I wonder if you have migration records (from the cell DB migrations 
table) holding allocations in placement for some reason, even though the 
migration is complete. I know you have an audit script to look for 
allocations that are not held by instances, assuming those instances 
have been deleted and the allocations were leaked, but they could have 
also been held by the migration record and maybe leaked that way? 
Although if you delete the instance, the related migrations records are 
also removed (but maybe not their allocations?). I'm thinking of a case 
like, resize and instance but rather than confirm/revert it, the user 
deletes the instance. That would cleanup the allocations from the target 
node but potentially not from the source node.


--

Thanks,

Matt

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][placement] Placement requests and caching in the resource tracker

2018-11-05 Thread Mohammed Naser
On Mon, Nov 5, 2018 at 4:17 PM Matt Riedemann  wrote:
>
> On 11/4/2018 4:22 AM, Mohammed Naser wrote:
> > Just for information sake, a clean state cloud which had no reported issues
> > over maybe a period of 2-3 months already has 4 allocations which are
> > incorrect and 12 allocations pointing to the wrong resource provider, so I
> > think this comes does to committing to either "self-healing" to fix those
> > issues or not.
>
> Is this running Rocky or an older release?

In this case, this is inside a Queens cloud, I can run the same script
on a Rocky
cloud too.

> Have you dug into any of the operations around these instances to
> determine what might have gone wrong? For example, was a live migration
> performed recently on these instances and if so, did it fail? How about
> evacuations (rebuild from a down host).

To be honest, I have not, however, I suspect a lot of those happen from the
fact that it is possible that the service which makes the claim is not the
same one that deletes it

I'm not sure if this is something that's possible but say the compute2 makes
a claim for migrating to compute1 but something fails there, the revert happens
in compute1 but compute1 is already borked so it doesn't work

This isn't necessarily the exact case that's happening but it's a summary
of what I believe happens.

> By "4 allocations which are incorrect" I assume that means they are
> pointing at the correct compute node resource provider but the values
> for allocated VCPU, MEMORY_MB and DISK_GB is wrong? If so, how do the
> allocations align with old/new flavors used to resize the instance? Did
> the resize fail?

The allocated flavours usually are not wrong, they are simply associated
to the wrong resource provider (so it feels like failed migration or resize).

> Are there mixed compute versions at all, i.e. are you moving instances
> around during a rolling upgrade?

Nope

> --
>
> Thanks,
>
> Matt
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



--
Mohammed Naser — vexxhost
-
D. 514-316-8872
D. 800-910-1726 ext. 200
E. mna...@vexxhost.com
W. http://vexxhost.com

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][placement] Placement requests and caching in the resource tracker

2018-11-05 Thread Matt Riedemann

On 11/5/2018 5:52 AM, Chris Dent wrote:

* We need to have further discussion and investigation on
   allocations getting out of sync. Volunteers?


This is something I've already spent a lot of time on with the 
heal_allocations CLI, and have already started asking mnaser questions 
about this elsewhere in the thread.


--

Thanks,

Matt

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][placement] Placement requests and caching in the resource tracker

2018-11-05 Thread Matt Riedemann

On 11/4/2018 4:22 AM, Mohammed Naser wrote:

Just for information sake, a clean state cloud which had no reported issues
over maybe a period of 2-3 months already has 4 allocations which are
incorrect and 12 allocations pointing to the wrong resource provider, so I
think this comes does to committing to either "self-healing" to fix those
issues or not.


Is this running Rocky or an older release?

Have you dug into any of the operations around these instances to 
determine what might have gone wrong? For example, was a live migration 
performed recently on these instances and if so, did it fail? How about 
evacuations (rebuild from a down host).


By "4 allocations which are incorrect" I assume that means they are 
pointing at the correct compute node resource provider but the values 
for allocated VCPU, MEMORY_MB and DISK_GB is wrong? If so, how do the 
allocations align with old/new flavors used to resize the instance? Did 
the resize fail?


Are there mixed compute versions at all, i.e. are you moving instances 
around during a rolling upgrade?


--

Thanks,

Matt

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][placement] Placement requests and caching in the resource tracker

2018-11-05 Thread Tetsuro Nakamura
Thus we should only read from placement:
> * at compute node startup
> * when a write fails
> And we should only write to placement:
> * at compute node startup
> * when the virt driver tells us something has changed


I agree with this.

We could also prepare an interface for operators/other-projects to force
nova to pull fresh information from placement and put it into its cache in
order to avoid predictable conflicts.

Is that right? If it is not right, can we do that? If not, why not?


The same question from me.
Refreshing periodically strategy might be now an optional optimization for
smaller clouds?

2018年11月5日(月) 20:53 Chris Dent :

> On Sun, 4 Nov 2018, Jay Pipes wrote:
>
> > Now that we have generation markers protecting both providers and
> consumers,
> > we can rely on those generations to signal to the scheduler report
> client
> > that it needs to pull fresh information about a provider or consumer.
> So,
> > there's really no need to automatically and blindly refresh any more.
>
> I agree with this ^.
>
> I've been trying to tease out the issues in this thread and on the
> associated review [1] and I've decided that much of my confusion
> comes from the fact that we refer to a thing which is a "cache" in
> the resource tracker and either trusting it more or not having it at
> all, and I think that's misleading. To me a "cache" has multiple
> clients and there's some need for reconciliation and invalidation
> amongst them. The thing that's in the resource tracker is in one
> process, changes to it are synchronized; it's merely a data structure.
>
> Some words follow where I try to tease things out a bit more (mostly
> for my own sake, but if it helps other people, great). At the very
> end there's a bit of list of suggested todos for us to consider.
>
> What we have is a data structure which represents the resource
> tracker and virtdirver's current view on what providers and
> associates it is aware of. We maintain a boundary between the RT and
> the virtdriver that means there's "updating" going on that sometimes
> is a bit fussy to resolve (cf. recent adjustments to allocation
> ratio handling).
>
> In the old way, every now and again we get a bunch of info from
> placement to confirm that our view is right and try to reconcile
> things.
>
> What we're considering moving towards is only doing that "get a
> bunch of info from placement" when we fail to write to placement
> because of a generation conflict.
>
> Thus we should only read from placement:
>
> * at compute node startup
> * when a write fails
>
> And we should only write to placement:
>
> * at compute node startup
> * when the virt driver tells us something has changed
>
> Is that right? If it is not right, can we do that? If not, why not?
>
> Because generations change, often, they guard against us making
> changes in ignorance and allow us to write blindly and only GET when
> we fail. We've got this everywhere now, let's use it. So, for
> example, even if something else besides the compute is adding
> traits, it's cool. We'll fail when we (the compute) try to clobber.
>
> Elsewhere in the thread several other topics were raised. A lot of
> that boil back to "what are we actually trying to do in the
> periodics?". As is often the case (and appropriately so) what we're
> trying to do has evolved and accreted in an organic fashion and it
> is probably time for us to re-evaluate and make sure we're doing the
> right stuff. The first step is writing that down. That aspect has
> always been pretty obscure or tribal to me, I presume so for others.
> So doing a legit audit of that code and the goals is something we
> should do.
>
> Mohammed's comments about allocations getting out of sync are
> important. I agree with him that it would be excellent if we could
> go back to self-healing those, especially because of the "wait for
> the computes to automagically populate everything" part he mentions.
> However, that aspect, while related to this, is not quite the same
> thing. The management of allocations and the management of
> inventories (and "associates") is happening from different angles.
>
> And finally, even if we turn off these refreshes to lighten the
> load, placement still needs to be capable of dealing with frequent
> requests, so we have something to fix there. We need to do the
> analysis to find out where the cost is and implement some solutions.
> At the moment we don't know where it is. It could be:
>
> * In the database server
> * In the python code that marshals the data around those calls to
>the database
> * In the python code that handles the WSGI interactions
> * In the web server that is talking to the python code
>
> belmoreira's document [2] suggests some avenues of investigation
> (most CPU time is in user space and not waiting) but we'd need a bit
> more information to plan any concrete next steps:
>
> * what's the web server and which wsgi configuration?
> * where's the database, if it's different what's 

Re: [openstack-dev] [nova][placement] Placement requests and caching in the resource tracker

2018-11-05 Thread Chris Dent

On Sun, 4 Nov 2018, Jay Pipes wrote:

Now that we have generation markers protecting both providers and consumers, 
we can rely on those generations to signal to the scheduler report client 
that it needs to pull fresh information about a provider or consumer. So, 
there's really no need to automatically and blindly refresh any more.


I agree with this ^.

I've been trying to tease out the issues in this thread and on the
associated review [1] and I've decided that much of my confusion
comes from the fact that we refer to a thing which is a "cache" in
the resource tracker and either trusting it more or not having it at
all, and I think that's misleading. To me a "cache" has multiple
clients and there's some need for reconciliation and invalidation
amongst them. The thing that's in the resource tracker is in one
process, changes to it are synchronized; it's merely a data structure.

Some words follow where I try to tease things out a bit more (mostly
for my own sake, but if it helps other people, great). At the very
end there's a bit of list of suggested todos for us to consider.

What we have is a data structure which represents the resource
tracker and virtdirver's current view on what providers and
associates it is aware of. We maintain a boundary between the RT and
the virtdriver that means there's "updating" going on that sometimes
is a bit fussy to resolve (cf. recent adjustments to allocation
ratio handling).

In the old way, every now and again we get a bunch of info from
placement to confirm that our view is right and try to reconcile
things.

What we're considering moving towards is only doing that "get a
bunch of info from placement" when we fail to write to placement
because of a generation conflict.

Thus we should only read from placement:

* at compute node startup
* when a write fails

And we should only write to placement:

* at compute node startup
* when the virt driver tells us something has changed

Is that right? If it is not right, can we do that? If not, why not?

Because generations change, often, they guard against us making
changes in ignorance and allow us to write blindly and only GET when
we fail. We've got this everywhere now, let's use it. So, for
example, even if something else besides the compute is adding
traits, it's cool. We'll fail when we (the compute) try to clobber.

Elsewhere in the thread several other topics were raised. A lot of
that boil back to "what are we actually trying to do in the
periodics?". As is often the case (and appropriately so) what we're
trying to do has evolved and accreted in an organic fashion and it
is probably time for us to re-evaluate and make sure we're doing the
right stuff. The first step is writing that down. That aspect has
always been pretty obscure or tribal to me, I presume so for others.
So doing a legit audit of that code and the goals is something we
should do.

Mohammed's comments about allocations getting out of sync are
important. I agree with him that it would be excellent if we could
go back to self-healing those, especially because of the "wait for
the computes to automagically populate everything" part he mentions.
However, that aspect, while related to this, is not quite the same
thing. The management of allocations and the management of
inventories (and "associates") is happening from different angles.

And finally, even if we turn off these refreshes to lighten the
load, placement still needs to be capable of dealing with frequent
requests, so we have something to fix there. We need to do the
analysis to find out where the cost is and implement some solutions.
At the moment we don't know where it is. It could be:

* In the database server
* In the python code that marshals the data around those calls to
  the database
* In the python code that handles the WSGI interactions
* In the web server that is talking to the python code

belmoreira's document [2] suggests some avenues of investigation
(most CPU time is in user space and not waiting) but we'd need a bit
more information to plan any concrete next steps:

* what's the web server and which wsgi configuration?
* where's the database, if it's different what's the load there?

I suspect there's a lot we can do to make our code more correct and
efficient. And beyond that there is a great deal of standard run-of-
the mill server-side caching and etag handling that we could
implement if necessary. That is: treat placement like a web app that
needs to be optimized in the usual ways.

As Eric suggested at the start of the thread, this kind of
investigation is expected and normal. We've not done something
wrong. Make it, make it correct, make it fast is the process.
We're oscillating somewhere between 2 and 3.

So in terms of actions:

* I'm pretty well situated to do some deeper profiling and
  benchmarking of placement to find the elbows in that.

* It seems like Eric and Jay are probably best situated to define
  and refine what should really be going on with the 

Re: [openstack-dev] [nova][placement] Placement requests and caching in the resource tracker

2018-11-05 Thread Belmiro Moreira
Thanks Eric for the patch.
This will help keeping placement calls under control.

Belmiro


On Sun, Nov 4, 2018 at 1:01 PM Jay Pipes  wrote:

> On 11/02/2018 03:22 PM, Eric Fried wrote:
> > All-
> >
> > Based on a (long) discussion yesterday [1] I have put up a patch [2]
> > whereby you can set [compute]resource_provider_association_refresh to
> > zero and the resource tracker will never* refresh the report client's
> > provider cache. Philosophically, we're removing the "healing" aspect of
> > the resource tracker's periodic and trusting that placement won't
> > diverge from whatever's in our cache. (If it does, it's because the op
> > hit the CLI, in which case they should SIGHUP - see below.)
> >
> > *except:
> > - When we initially create the compute node record and bootstrap its
> > resource provider.
> > - When the virt driver's update_provider_tree makes a change,
> > update_from_provider_tree reflects them in the cache as well as pushing
> > them back to placement.
> > - If update_from_provider_tree fails, the cache is cleared and gets
> > rebuilt on the next periodic.
> > - If you send SIGHUP to the compute process, the cache is cleared.
> >
> > This should dramatically reduce the number of calls to placement from
> > the compute service. Like, to nearly zero, unless something is actually
> > changing.
> >
> > Can I get some initial feedback as to whether this is worth polishing up
> > into something real? (It will probably need a bp/spec if so.)
> >
> > [1]
> >
> http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-11-01.log.html#t2018-11-01T17:32:03
> > [2] https://review.openstack.org/#/c/614886/
> >
> > ==
> > Background
> > ==
> > In the Queens release, our friends at CERN noticed a serious spike in
> > the number of requests to placement from compute nodes, even in a
> > stable-state cloud. Given that we were in the process of adding a ton of
> > infrastructure to support sharing and nested providers, this was not
> > unexpected. Roughly, what was previously:
> >
> >   @periodic_task:
> >   GET /resource_providers/$compute_uuid
> >   GET /resource_providers/$compute_uuid/inventories
> >
> > became more like:
> >
> >   @periodic_task:
> >   # In Queens/Rocky, this would still just return the compute RP
> >   GET /resource_providers?in_tree=$compute_uuid
> >   # In Queens/Rocky, this would return nothing
> >   GET /resource_providers?member_of=...=MISC_SHARES...
> >   for each provider returned above:  # i.e. just one in Q/R
> >   GET /resource_providers/$compute_uuid/inventories
> >   GET /resource_providers/$compute_uuid/traits
> >   GET /resource_providers/$compute_uuid/aggregates
> >
> > In a cloud the size of CERN's, the load wasn't acceptable. But at the
> > time, CERN worked around the problem by disabling refreshing entirely.
> > (The fact that this seems to have worked for them is an encouraging sign
> > for the proposed code change.)
> >
> > We're not actually making use of most of that information, but it sets
> > the stage for things that we're working on in Stein and beyond, like
> > multiple VGPU types, bandwidth resource providers, accelerators, NUMA,
> > etc., so removing/reducing the amount of information we look at isn't
> > really an option strategically.
>
> I support your idea of getting rid of the periodic refresh of the cache
> in the scheduler report client. Much of that was added in order to
> emulate the original way the resource tracker worked.
>
> Most of the behaviour in the original resource tracker (and some of the
> code still in there for dealing with (surprise!) PCI passthrough devices
> and NUMA topology) was due to doing allocations on the compute node (the
> whole claims stuff). We needed to always be syncing the state of the
> compute_nodes and pci_devices table in the cell database with whatever
> usage information was being created/modified on the compute nodes [0].
>
> All of the "healing" code that's in the resource tracker was basically
> to deal with "soft delete", migrations that didn't complete or work
> properly, and, again, to handle allocations becoming out-of-sync because
> the compute nodes were responsible for allocating (as opposed to the
> current situation we have where the placement service -- via the
> scheduler's call to claim_resources() -- is responsible for allocating
> resources [1]).
>
> Now that we have generation markers protecting both providers and
> consumers, we can rely on those generations to signal to the scheduler
> report client that it needs to pull fresh information about a provider
> or consumer. So, there's really no need to automatically and blindly
> refresh any more.
>
> Best,
> -jay
>
> [0] We always need to be syncing those tables because those tables,
> unlike the placement database's data modeling, couple both inventory AND
> usage in the same table structure...
>
> [1] again, except for PCI devices and NUMA 

Re: [openstack-dev] [nova][placement] Placement requests and caching in the resource tracker

2018-11-04 Thread Jay Pipes

On 11/02/2018 03:22 PM, Eric Fried wrote:

All-

Based on a (long) discussion yesterday [1] I have put up a patch [2]
whereby you can set [compute]resource_provider_association_refresh to
zero and the resource tracker will never* refresh the report client's
provider cache. Philosophically, we're removing the "healing" aspect of
the resource tracker's periodic and trusting that placement won't
diverge from whatever's in our cache. (If it does, it's because the op
hit the CLI, in which case they should SIGHUP - see below.)

*except:
- When we initially create the compute node record and bootstrap its
resource provider.
- When the virt driver's update_provider_tree makes a change,
update_from_provider_tree reflects them in the cache as well as pushing
them back to placement.
- If update_from_provider_tree fails, the cache is cleared and gets
rebuilt on the next periodic.
- If you send SIGHUP to the compute process, the cache is cleared.

This should dramatically reduce the number of calls to placement from
the compute service. Like, to nearly zero, unless something is actually
changing.

Can I get some initial feedback as to whether this is worth polishing up
into something real? (It will probably need a bp/spec if so.)

[1]
http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-11-01.log.html#t2018-11-01T17:32:03
[2] https://review.openstack.org/#/c/614886/

==
Background
==
In the Queens release, our friends at CERN noticed a serious spike in
the number of requests to placement from compute nodes, even in a
stable-state cloud. Given that we were in the process of adding a ton of
infrastructure to support sharing and nested providers, this was not
unexpected. Roughly, what was previously:

  @periodic_task:
  GET /resource_providers/$compute_uuid
  GET /resource_providers/$compute_uuid/inventories

became more like:

  @periodic_task:
  # In Queens/Rocky, this would still just return the compute RP
  GET /resource_providers?in_tree=$compute_uuid
  # In Queens/Rocky, this would return nothing
  GET /resource_providers?member_of=...=MISC_SHARES...
  for each provider returned above:  # i.e. just one in Q/R
  GET /resource_providers/$compute_uuid/inventories
  GET /resource_providers/$compute_uuid/traits
  GET /resource_providers/$compute_uuid/aggregates

In a cloud the size of CERN's, the load wasn't acceptable. But at the
time, CERN worked around the problem by disabling refreshing entirely.
(The fact that this seems to have worked for them is an encouraging sign
for the proposed code change.)

We're not actually making use of most of that information, but it sets
the stage for things that we're working on in Stein and beyond, like
multiple VGPU types, bandwidth resource providers, accelerators, NUMA,
etc., so removing/reducing the amount of information we look at isn't
really an option strategically.


I support your idea of getting rid of the periodic refresh of the cache 
in the scheduler report client. Much of that was added in order to 
emulate the original way the resource tracker worked.


Most of the behaviour in the original resource tracker (and some of the 
code still in there for dealing with (surprise!) PCI passthrough devices 
and NUMA topology) was due to doing allocations on the compute node (the 
whole claims stuff). We needed to always be syncing the state of the 
compute_nodes and pci_devices table in the cell database with whatever 
usage information was being created/modified on the compute nodes [0].


All of the "healing" code that's in the resource tracker was basically 
to deal with "soft delete", migrations that didn't complete or work 
properly, and, again, to handle allocations becoming out-of-sync because 
the compute nodes were responsible for allocating (as opposed to the 
current situation we have where the placement service -- via the 
scheduler's call to claim_resources() -- is responsible for allocating 
resources [1]).


Now that we have generation markers protecting both providers and 
consumers, we can rely on those generations to signal to the scheduler 
report client that it needs to pull fresh information about a provider 
or consumer. So, there's really no need to automatically and blindly 
refresh any more.


Best,
-jay

[0] We always need to be syncing those tables because those tables, 
unlike the placement database's data modeling, couple both inventory AND 
usage in the same table structure...


[1] again, except for PCI devices and NUMA topology, because of the 
tight coupling in place with the different resource trackers those types 
of resources use...



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][placement] Placement requests and caching in the resource tracker

2018-11-04 Thread Mohammed Naser
Ugh, hit send accidentally.  Please take my comments lightly as I have not been
as involved with the developments but just chiming in as an operator with some
ideas.

On Fri, Nov 2, 2018 at 9:32 PM Matt Riedemann  wrote:
>
> On 11/2/2018 2:22 PM, Eric Fried wrote:
> > Based on a (long) discussion yesterday [1] I have put up a patch [2]
> > whereby you can set [compute]resource_provider_association_refresh to
> > zero and the resource tracker will never* refresh the report client's
> > provider cache. Philosophically, we're removing the "healing" aspect of
> > the resource tracker's periodic and trusting that placement won't
> > diverge from whatever's in our cache. (If it does, it's because the op
> > hit the CLI, in which case they should SIGHUP - see below.)
> >
> > *except:
> > - When we initially create the compute node record and bootstrap its
> > resource provider.
> > - When the virt driver's update_provider_tree makes a change,
> > update_from_provider_tree reflects them in the cache as well as pushing
> > them back to placement.
> > - If update_from_provider_tree fails, the cache is cleared and gets
> > rebuilt on the next periodic.
> > - If you send SIGHUP to the compute process, the cache is cleared.
> >
> > This should dramatically reduce the number of calls to placement from
> > the compute service. Like, to nearly zero, unless something is actually
> > changing.
> >
> > Can I get some initial feedback as to whether this is worth polishing up
> > into something real? (It will probably need a bp/spec if so.)
> >
> > [1]
> > http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-11-01.log.html#t2018-11-01T17:32:03
> > [2]https://review.openstack.org/#/c/614886/
> >
> > ==
> > Background
> > ==
> > In the Queens release, our friends at CERN noticed a serious spike in
> > the number of requests to placement from compute nodes, even in a
> > stable-state cloud. Given that we were in the process of adding a ton of
> > infrastructure to support sharing and nested providers, this was not
> > unexpected. Roughly, what was previously:
> >
> >   @periodic_task:
> >   GET/resource_providers/$compute_uuid
> >   GET/resource_providers/$compute_uuid/inventories
> >
> > became more like:
> >
> >   @periodic_task:
> >   # In Queens/Rocky, this would still just return the compute RP
> >   GET /resource_providers?in_tree=$compute_uuid
> >   # In Queens/Rocky, this would return nothing
> >   GET /resource_providers?member_of=...=MISC_SHARES...
> >   for each provider returned above:  # i.e. just one in Q/R
> >   GET/resource_providers/$compute_uuid/inventories
> >   GET/resource_providers/$compute_uuid/traits
> >   GET/resource_providers/$compute_uuid/aggregates
> >
> > In a cloud the size of CERN's, the load wasn't acceptable. But at the
> > time, CERN worked around the problem by disabling refreshing entirely.
> > (The fact that this seems to have worked for them is an encouraging sign
> > for the proposed code change.)
> >
> > We're not actually making use of most of that information, but it sets
> > the stage for things that we're working on in Stein and beyond, like
> > multiple VGPU types, bandwidth resource providers, accelerators, NUMA,
> > etc., so removing/reducing the amount of information we look at isn't
> > really an option strategically.
>
> A few random points from the long discussion that should probably
> re-posed here for wider thought:
>
> * There was probably a lot of discussion about why we needed to do this
> caching and stuff in the compute in the first place. What has changed
> that we no longer need to aggressively refresh the cache on every
> periodic? I thought initially it came up because people really wanted
> the compute to be fully self-healing to any external changes, including
> hot plugging resources like disk on the host to automatically reflect
> those changes in inventory. Similarly, external user/service
> interactions with the placement API which would then be automatically
> picked up by the next periodic run - is that no longer a desire, and/or
> how was the decision made previously that simply requiring a SIGHUP in
> that case wasn't sufficient/desirable.

I think that would be nice to have however at the current moment, based
from operators perspective, it looks like the placement service can really
get out of sync pretty easily.. so I think it'd be good to commit to either
really making it self-heal (delete stale allocations, create ones that should
be there) or remove all self-healing stuff

Also, for the self healing work, if we take that route and implement it fully,
it might make placement split much easier, because we just switch over
and wait for the computes to automagically populate everything, but that's
the type of operation that happens once in the lifetime of a cloud.

Just for information sake, a clean state cloud which had no reported issues
over maybe a period 

Re: [openstack-dev] [nova][placement] Placement requests and caching in the resource tracker

2018-11-04 Thread Mohammed Naser
On Fri, Nov 2, 2018 at 9:32 PM Matt Riedemann  wrote:
>
> On 11/2/2018 2:22 PM, Eric Fried wrote:
> > Based on a (long) discussion yesterday [1] I have put up a patch [2]
> > whereby you can set [compute]resource_provider_association_refresh to
> > zero and the resource tracker will never* refresh the report client's
> > provider cache. Philosophically, we're removing the "healing" aspect of
> > the resource tracker's periodic and trusting that placement won't
> > diverge from whatever's in our cache. (If it does, it's because the op
> > hit the CLI, in which case they should SIGHUP - see below.)
> >
> > *except:
> > - When we initially create the compute node record and bootstrap its
> > resource provider.
> > - When the virt driver's update_provider_tree makes a change,
> > update_from_provider_tree reflects them in the cache as well as pushing
> > them back to placement.
> > - If update_from_provider_tree fails, the cache is cleared and gets
> > rebuilt on the next periodic.
> > - If you send SIGHUP to the compute process, the cache is cleared.
> >
> > This should dramatically reduce the number of calls to placement from
> > the compute service. Like, to nearly zero, unless something is actually
> > changing.
> >
> > Can I get some initial feedback as to whether this is worth polishing up
> > into something real? (It will probably need a bp/spec if so.)
> >
> > [1]
> > http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-11-01.log.html#t2018-11-01T17:32:03
> > [2]https://review.openstack.org/#/c/614886/
> >
> > ==
> > Background
> > ==
> > In the Queens release, our friends at CERN noticed a serious spike in
> > the number of requests to placement from compute nodes, even in a
> > stable-state cloud. Given that we were in the process of adding a ton of
> > infrastructure to support sharing and nested providers, this was not
> > unexpected. Roughly, what was previously:
> >
> >   @periodic_task:
> >   GET/resource_providers/$compute_uuid
> >   GET/resource_providers/$compute_uuid/inventories
> >
> > became more like:
> >
> >   @periodic_task:
> >   # In Queens/Rocky, this would still just return the compute RP
> >   GET /resource_providers?in_tree=$compute_uuid
> >   # In Queens/Rocky, this would return nothing
> >   GET /resource_providers?member_of=...=MISC_SHARES...
> >   for each provider returned above:  # i.e. just one in Q/R
> >   GET/resource_providers/$compute_uuid/inventories
> >   GET/resource_providers/$compute_uuid/traits
> >   GET/resource_providers/$compute_uuid/aggregates
> >
> > In a cloud the size of CERN's, the load wasn't acceptable. But at the
> > time, CERN worked around the problem by disabling refreshing entirely.
> > (The fact that this seems to have worked for them is an encouraging sign
> > for the proposed code change.)
> >
> > We're not actually making use of most of that information, but it sets
> > the stage for things that we're working on in Stein and beyond, like
> > multiple VGPU types, bandwidth resource providers, accelerators, NUMA,
> > etc., so removing/reducing the amount of information we look at isn't
> > really an option strategically.
>
> A few random points from the long discussion that should probably
> re-posed here for wider thought:
>
> * There was probably a lot of discussion about why we needed to do this
> caching and stuff in the compute in the first place. What has changed
> that we no longer need to aggressively refresh the cache on every
> periodic? I thought initially it came up because people really wanted
> the compute to be fully self-healing to any external changes, including
> hot plugging resources like disk on the host to automatically reflect
> those changes in inventory. Similarly, external user/service
> interactions with the placement API which would then be automatically
> picked up by the next periodic run - is that no longer a desire, and/or
> how was the decision made previously that simply requiring a SIGHUP in
> that case wasn't sufficient/desirable.
>
> * I believe I made the point yesterday that we should probably not
> refresh by default, and let operators opt-in to that behavior if they
> really need it, i.e. they are frequently making changes to the
> environment, potentially by some external service (I could think of
> vCenter doing this to reflect changes from vCenter back into
> nova/placement), but I don't think that should be the assumed behavior
> by most and our defaults should reflect the "normal" use case.
>
> * I think I've noted a few times now that we don't actually use the
> provider aggregates information (yet) in the compute service. Nova host
> aggregate membership is mirror to placement since Rocky [1] but that
> happens in the API, not the the compute. The only thing I can think of
> that relied on resource provider aggregate information in the compute is
> the shared storage providers concept, but that's not 

Re: [openstack-dev] [nova][placement] Placement requests and caching in the resource tracker

2018-11-02 Thread Matt Riedemann

On 11/2/2018 2:22 PM, Eric Fried wrote:

Based on a (long) discussion yesterday [1] I have put up a patch [2]
whereby you can set [compute]resource_provider_association_refresh to
zero and the resource tracker will never* refresh the report client's
provider cache. Philosophically, we're removing the "healing" aspect of
the resource tracker's periodic and trusting that placement won't
diverge from whatever's in our cache. (If it does, it's because the op
hit the CLI, in which case they should SIGHUP - see below.)

*except:
- When we initially create the compute node record and bootstrap its
resource provider.
- When the virt driver's update_provider_tree makes a change,
update_from_provider_tree reflects them in the cache as well as pushing
them back to placement.
- If update_from_provider_tree fails, the cache is cleared and gets
rebuilt on the next periodic.
- If you send SIGHUP to the compute process, the cache is cleared.

This should dramatically reduce the number of calls to placement from
the compute service. Like, to nearly zero, unless something is actually
changing.

Can I get some initial feedback as to whether this is worth polishing up
into something real? (It will probably need a bp/spec if so.)

[1]
http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-11-01.log.html#t2018-11-01T17:32:03
[2]https://review.openstack.org/#/c/614886/

==
Background
==
In the Queens release, our friends at CERN noticed a serious spike in
the number of requests to placement from compute nodes, even in a
stable-state cloud. Given that we were in the process of adding a ton of
infrastructure to support sharing and nested providers, this was not
unexpected. Roughly, what was previously:

  @periodic_task:
  GET/resource_providers/$compute_uuid
  GET/resource_providers/$compute_uuid/inventories

became more like:

  @periodic_task:
  # In Queens/Rocky, this would still just return the compute RP
  GET /resource_providers?in_tree=$compute_uuid
  # In Queens/Rocky, this would return nothing
  GET /resource_providers?member_of=...=MISC_SHARES...
  for each provider returned above:  # i.e. just one in Q/R
  GET/resource_providers/$compute_uuid/inventories
  GET/resource_providers/$compute_uuid/traits
  GET/resource_providers/$compute_uuid/aggregates

In a cloud the size of CERN's, the load wasn't acceptable. But at the
time, CERN worked around the problem by disabling refreshing entirely.
(The fact that this seems to have worked for them is an encouraging sign
for the proposed code change.)

We're not actually making use of most of that information, but it sets
the stage for things that we're working on in Stein and beyond, like
multiple VGPU types, bandwidth resource providers, accelerators, NUMA,
etc., so removing/reducing the amount of information we look at isn't
really an option strategically.


A few random points from the long discussion that should probably 
re-posed here for wider thought:


* There was probably a lot of discussion about why we needed to do this 
caching and stuff in the compute in the first place. What has changed 
that we no longer need to aggressively refresh the cache on every 
periodic? I thought initially it came up because people really wanted 
the compute to be fully self-healing to any external changes, including 
hot plugging resources like disk on the host to automatically reflect 
those changes in inventory. Similarly, external user/service 
interactions with the placement API which would then be automatically 
picked up by the next periodic run - is that no longer a desire, and/or 
how was the decision made previously that simply requiring a SIGHUP in 
that case wasn't sufficient/desirable.


* I believe I made the point yesterday that we should probably not 
refresh by default, and let operators opt-in to that behavior if they 
really need it, i.e. they are frequently making changes to the 
environment, potentially by some external service (I could think of 
vCenter doing this to reflect changes from vCenter back into 
nova/placement), but I don't think that should be the assumed behavior 
by most and our defaults should reflect the "normal" use case.


* I think I've noted a few times now that we don't actually use the 
provider aggregates information (yet) in the compute service. Nova host 
aggregate membership is mirror to placement since Rocky [1] but that 
happens in the API, not the the compute. The only thing I can think of 
that relied on resource provider aggregate information in the compute is 
the shared storage providers concept, but that's not supported (yet) 
[2]. So do we need to keep retrieving aggregate information when nothing 
in compute uses it yet?


* Similarly, why do we need to get traits on each periodic? The only 
in-tree virt driver I'm aware of that *reports* traits is the libvirt 
driver for CPU features [3]. Otherwise I think the idea behind getting 
the