Re: [openstack-dev] [Openstack-operators] [nova] about resize the instance

2018-11-08 Thread Chris Friesen

On 11/8/2018 5:30 AM, Rambo wrote:


  When I resize the instance, the compute node report that 
"libvirtError: internal error: qemu unexpectedly closed the monitor: 
2018-11-08T09:42:04.695681Z qemu-kvm: cannot set up guest memory 
'pc.ram': Cannot allocate memory".Has anyone seen this situation?And 
the ram_allocation_ratio is set 3 in nova.conf.The total memory is 
125G.When I use the "nova hypervisor-show server" command to show the 
compute node's free_ram_mb is -45G.If it is the result of excessive use 
of memory?

Can you give me some suggestions about this?Thank you very much.


I suspect that you simply don't have any available memory on that system.

What is your kernel overcommit setting on the host?  If 
/proc/sys/vm/overcommit_memory is set to 2, then try either changing the 
overcommit ratio or setting it to 1 to see if that makes a difference.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][limits] Does ANYONE at all use the quota class functionality in Nova?

2018-10-25 Thread Chris Friesen

On 10/25/2018 12:00 PM, Jay Pipes wrote:

On 10/25/2018 01:38 PM, Chris Friesen wrote:

On 10/24/2018 9:10 AM, Jay Pipes wrote:
Nova's API has the ability to create "quota classes", which are 
basically limits for a set of resource types. There is something 
called the "default quota class" which corresponds to the limits in 
the CONF.quota section. Quota classes are basically templates of 
limits to be applied if the calling project doesn't have any stored 
project-specific limits.


Has anyone ever created a quota class that is different from "default"?


The Compute API specifically says:

"Only ‘default’ quota class is valid and used to set the default 
quotas, all other quota class would not be used anywhere."


What this API does provide is the ability to set new default quotas 
for *all* projects at once rather than individually specifying new 
defaults for each project.


It's a "defaults template", yes.


Chris, are you advocating for *keeping* the os-quota-classes API?


Nope.  I had two points:

1) It's kind of irrelevant whether anyone has created a quota class 
other than "default" because nova wouldn't use it anyways.


2) The main benefit (as I see it) of the quota class API is to allow 
dynamic adjustment of the default quotas without restarting services.


I totally agree that keystone limits should replace it.  I just didn't 
want the discussion to be focused on the non-default class portion 
because it doesn't matter.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][limits] Does ANYONE at all use the quota class functionality in Nova?

2018-10-25 Thread Chris Friesen

On 10/24/2018 9:10 AM, Jay Pipes wrote:
Nova's API has the ability to create "quota classes", which are 
basically limits for a set of resource types. There is something called 
the "default quota class" which corresponds to the limits in the 
CONF.quota section. Quota classes are basically templates of limits to 
be applied if the calling project doesn't have any stored 
project-specific limits.


Has anyone ever created a quota class that is different from "default"?


The Compute API specifically says:

"Only ‘default’ quota class is valid and used to set the default quotas, 
all other quota class would not be used anywhere."


What this API does provide is the ability to set new default quotas for 
*all* projects at once rather than individually specifying new defaults 
for each project.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Supporting force live-migrate and force evacuate with nested allocations

2018-10-09 Thread Chris Friesen

On 10/9/2018 1:20 PM, Jay Pipes wrote:

On 10/09/2018 11:04 AM, Balázs Gibizer wrote:

If you do the force flag removal in a nw microversion that also means
(at least to me) that you should not change the behavior of the force
flag in the old microversions.


Agreed.

Keep the old, buggy and unsafe behaviour for the old microversion and in 
a new microversion remove the --force flag entirely and always call GET 
/a_c, followed by a claim_resources() on the destination host.


Agreed.  Once you start looking at more complicated resource topologies, 
you pretty much need to handle allocations properly.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] agreement on how to specify options that impact scheduling and configuration

2018-10-04 Thread Chris Friesen
While discussing the "Add HPET timer support for x86 guests" 
blueprint[1] one of the items that came up was how to represent what are 
essentially flags that impact both scheduling and configuration.  Eric 
Fried posted a spec to start a discussion[2], and a number of nova 
developers met on a hangout to hash it out.  This is the result.


In this specific scenario the goal was to allow the user to specify that 
their image required a virtual HPET.  For efficient scheduling we wanted 
this to map to a placement trait, and the virt driver also needed to 
enable the feature when booting the instance.  (This can be generalized 
to other similar problems, including how to specify scheduling and 
configuration information for Ironic.)


We discussed two primary approaches:

The first approach was to specify an arbitrary "key=val" in flavor 
extra-specs or image properties, which nova would automatically 
translate into the appropriate placement trait before passing it to 
placement.  Once scheduled to a compute node, the virt driver would look 
for "key=val" in the flavor/image to determine how to proceed.


The second approach was to directly specify the placement trait in the 
flavor extra-specs or image properties.  Once scheduled to a compute 
node, the virt driver would look for the placement trait in the 
flavor/image to determine how to proceed.


Ultimately, the decision was made to go with the second approach.  The 
result is that it is officially acceptable for virt drivers to key off 
placement traits specified in the image/flavor in order to turn on/off 
configuration options for the instance.  If we do get down to the virt 
driver and the trait is set, and the driver for whatever reason 
determines it's not capable of flipping the switch, it should fail.


It should be noted that it only makes sense to use placement traits for 
things that affect scheduling.  If it doesn't affect scheduling, then it 
can be stored in the flavor extra-specs or image properties separate 
from the placement traits.  Also, this approach only makes sense for 
simple booleans.  Anything requiring more complex configuration will 
likely need additional extra-spec and/or config and/or unicorn dust.


Chris

[1] https://blueprints.launchpad.net/nova/+spec/support-hpet-on-guest
[2] 
https://review.openstack.org/#/c/607989/1/specs/stein/approved/support-hpet-on-guest.rst


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [helm] multiple nova compute nodes

2018-10-02 Thread Chris Friesen

On 10/2/2018 4:15 PM, Giridhar Jayavelu wrote:

Hi,
Currently, all nova components are packaged in same helm chart "nova". Are 
there any plans to separate nova-compute from rest of the services ?
What should be the approach for deploying multiple nova computes nodes using 
OpenStack helm charts?


The nova-compute pods are part of a daemonset which will automatically 
create a nova-compute pod on each node that has the 
"openstack-compute-node=enabled" label.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [storyboard] why use different "bug" tags per project?

2018-09-26 Thread Chris Friesen

Hi,

At the PTG, it was suggested that each project should tag their bugs 
with "-bug" to avoid tags being "leaked" across projects, or 
something like that.


Could someone elaborate on why this was recommended?  It seems to me 
that it'd be better for all projects to just use the "bug" tag for 
consistency.


If you want to get all bugs in a specific project it would be pretty 
easy to search for stories with a tag of "bug" and a project of "X".


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [goals][python3] mixed versions?

2018-09-12 Thread Chris Friesen

On 9/12/2018 12:04 PM, Doug Hellmann wrote:


This came up in a Vancouver summit session (the python3 one I think). General 
consensus there seemed to be that we should have grenade jobs that run python2 
on the old side and python3 on the new side and test the update from one to 
another through a release that way. Additionally there was thought that the 
nova partial job (and similar grenade jobs) could hold the non upgraded node on 
python2 and that would talk to a python3 control plane.

I haven't seen or heard of anyone working on this yet though.

Clark



IIRC, we also talked about not supporting multiple versions of
python on a given node, so all of the services on a node would need
to be upgraded together.


As I understand it, the various services talk to each other using 
over-the-wire protocols.  Assuming this is correct, why would we need to 
ensure they are using the same python version?


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] Bringing the community together (combine the lists!)

2018-08-30 Thread Chris Friesen

On 08/30/2018 11:03 AM, Jeremy Stanley wrote:


The proposal is simple: create a new openstack-discuss mailing list
to cover all the above sorts of discussion and stop using the other
four.


Do we want to merge usage and development onto one list?  That could be a busy 
list for someone who's just asking a simple usage question.


Alternately, if we are going to merge everything then why not just use the 
"openstack" mailing list since it already exists and there are references to it 
on the web.


(Or do you want to force people to move to something new to make them recognize 
that something has changed?)


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] [nova] [placement] placement below or beside compute after extraction?

2018-08-21 Thread Chris Friesen

On 08/21/2018 04:33 PM, melanie witt wrote:


If we separate into two different groups, all of the items I discussed in my
previous reply will become cross-project efforts. To me, this means that the
placement group will have their own priorities and goal setting process and if
their priorities and goals happen to align with ours on certain items, we can
agree to work on those in collaboration. But I won't make assumptions about how
much alignment we will have. The placement group, as a hypothetical example,
won't necessarily find helping us fix issues with compute functionality like
vGPUs as important as we do, if we need additional work in placement to support 
it.


I guess what I'm saying is that even with placement under nova governance, if 
the placement developers don't want to implement what the nova cores want them 
to implement there really isn't much that the nova cores can do to force them to 
implement it.


And if the placement developers/cores are on board with what nova wants, I don't 
see how it makes a difference if placement is a separate project, especially if 
all existing nova cores are also placement cores to start.


(Note that this is in the context of scratch-your-own-itch developers.  It would 
be very different if the PTL could order developers to work on something.)


Chris



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] [nova] [placement] placement below or beside compute after extraction?

2018-08-21 Thread Chris Friesen

On 08/21/2018 01:53 PM, melanie witt wrote:


Given all of that, I'm not seeing how *now* is a good time to separate the
placement project under separate governance with separate goals and priorities.
If operators need things for compute, that are well-known and that placement was
created to solve, how will placement have a shared interest in solving compute
problems, if it is not part of the compute project?


As someone who is not involved in the governance of nova, this seems like kind 
of an odd statement for an open-source project.


From the outside, it seems like there is a fairly small pool of active 
placement developers.  And either the placement developers are willing to 
implement the capabilities desired by compute or else they're not.  And if 
they're not, I don't see how being under compute governance would resolve that 
since the only official hard leverage the compute governance has is refusing to 
review/merge placement patches (which wouldn't really help implement compute's 
desires anyways).


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] [nova] [placement] placement below or beside compute after extraction?

2018-08-20 Thread Chris Friesen

On 08/20/2018 11:44 AM, Zane Bitter wrote:


If you want my personal opinion then I'm a big believer in incremental change.
So, despite recognising that it is born of long experience of which I have been
blissfully mostly unaware, I have to disagree with Chris's position that if
anybody lets you change something then you should try to change as much as
possible in case they don't let you try again. (In fact I'd go so far as to
suggest that those kinds of speculative changes are a contributing factor in
making people reluctant to allow anything to happen at all.) So I'd suggest
splitting the repo, trying things out for a while within Nova's governance, and
then re-evaluating. If there are that point specific problems that separate
governance would appear to address, then it's only a trivial governance patch
and a PTL election away. It should also be much easier to get consensus at that
point than it is at this distance where we're only speculating what things will
be like after the extraction.

I'd like to point out for the record that Mel already said this and said it
better and is AFAICT pretty much never wrong :)


In order to address the "velocity of change in placement" issues, how about 
making the main placement folks members of nova-core with the understanding that 
those powers would only be used in the new placement repo?


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] How to debug no valid host failures with placement

2018-08-15 Thread Chris Friesen

On 08/04/2018 05:18 PM, Matt Riedemann wrote:

On 8/3/2018 9:14 AM, Chris Friesen wrote:

I'm of two minds here.

On the one hand, you have the case where the end user has accidentally
requested some combination of things that isn't normally available, and they
need to be able to ask the provider what they did wrong.  I agree that this
case is not really an exception, those resources were never available in the
first place.

On the other hand, suppose the customer issues a valid request and it works,
and then issues the same request again and it fails, leading to a violation of
that customers SLA.  In this case I would suggest that it could be considered
an exception since the system is not delivering the service that it was
intended to deliver.


As I'm sure you're aware Chris, it looks like StarlingX has a kind of
post-mortem query utility to try and figure out where requested resources didn't
end up yielding a resource provider (for a compute node):

https://github.com/starlingx-staging/stx-nova/commit/71acfeae0d1c59fdc77704527d763bd85a276f9a#diff-94f87e728df6465becce5241f3da53c8R330


But as you noted way earlier in this thread, it might not be the actual reasons
at the time of the failure and in a busy cloud could quickly change.


Just noticed this email, sorry for the delay.

The bit you point out isn't a post-mortem query but rather a way of printing out 
the rejection reasons that were stored (via calls to filter_reject()) at the 
time the request was processed by each filter.


Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [puppet] migrating to storyboard

2018-08-15 Thread Chris Friesen

On 08/14/2018 10:33 AM, Tobias Urdin wrote:


My goal is that we will be able to swap to Storyboard during the Stein cycle but
considering that we have a low activity on
bugs my opinion is that we could do this swap very easily anything soon as long
as everybody is in favor of it.

Please let me know what you think about moving to Storyboard?


Not a puppet dev, but am currently using Storyboard.

One of the things we've run into is that there is no way to attach log files for 
bug reports to a story.  There's an open story on this[1] but it's not assigned 
to anyone.


Chris


[1] https://storyboard.openstack.org/#!/story/2003071

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Openstack-operators] [nova] StarlingX diff analysis

2018-08-13 Thread Chris Friesen

On 08/07/2018 07:29 AM, Matt Riedemann wrote:

On 8/7/2018 1:10 AM, Flint WALRUS wrote:

I didn’t had time to check StarlingX code quality, how did you feel it while
you were doing your analysis?


I didn't dig into the test diffs themselves, but it was my impression that from
what I was poking around in the local git repo, there were several changes which
didn't have any test coverage.


Full disclosure, I'm on the StarlingX team.

Certainly some changes didn't have unit/functional test coverage, generally due 
to the perceived cost of writing useful tests.  (And when you don't have a lot 
of experience writing tests this becomes a self-fulfilling prophecy.)  On the 
other hand, we had fairly robust periodic integration testing including 
multi-node testing with physical hardware.



For the really big full stack changes (L3 CAT, CPU scaling and shared/pinned
CPUs on same host), toward the end I just started glossing over a lot of that
because it's so much code in so many places, so I can't really speak very well
to how it was written or how well it is tested (maybe WindRiver had a more
robust CI system running integration tests, I don't know).


We didn't have a per-commit CI system, though that's starting to change.  We do 
have a QA team running regression and targeted tests.



There were also some things which would have been caught in code review
upstream. For example, they ignore the "force" parameter for live migration so
that live migration requests always go through the scheduler. However, the
"force" parameter is only on newer microversions. Before that, if you specified
a host at all it would bypass the scheduler, but the change didn't take that
into account, so they still have gaps in some of the things they were trying to
essentially disable in the API.


Agreed, that's not up to upstream quality.  In this case we made some 
simplifying assumptions because our customers were expected to use the matching 
modified clients and to use the "current" microversion rather than explicitly 
specifying older microversions.


Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Do we still want to lowercase metadata keys?

2018-08-13 Thread Chris Friesen

On 08/13/2018 08:26 AM, Jay Pipes wrote:

On 08/13/2018 10:10 AM, Matthew Booth wrote:



I suspect I've misunderstood, but I was arguing this is an anti-goal.
There's no reason to do this if the db is working correctly, and it
would violate the principal of least surprise in dbs with legacy
datasets (being all current dbs). These values have always been mixed
case, lets just leave them be and fix the db.


Do you want case-insensitive keys or do you not want case-insensitive keys?

It seems to me that people complain that MySQL is case-insensitive by default
but actually *like* the concept that a metadata key of "abc" should be "equal
to" a metadata key of "ABC".


How do we behave on PostgreSQL?  (I realize it's unsupported, but it still has 
users.)  It's case-sensitive by default, do we override that?


Personally, I've worked on case-sensitive systems long enough that I'd actually 
be surprised if "abc" matched "ABC". :)


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] about live-resize down the instance

2018-08-13 Thread Chris Friesen

On 08/13/2018 02:07 AM, Rambo wrote:

Hi,all

   I find it is important that live-resize the instance in production
environment,especially live downsize the disk.And we have talked it many
years.But I don't know why the bp[1] didn't approved.Can you tell me more about
this ?Thank you very much.

[1]https://review.openstack.org/#/c/141219/



It's been reviewed a number of times...I thought it was going to get approved 
for Rocky, but I think it didn't quite make it in...you'd have to ask the nova 
cores why not.


It should be noted though that the above live-resize spec explicitly did not 
cover resizing smaller, only larger.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] How to debug no valid host failures with placement

2018-08-03 Thread Chris Friesen

On 08/02/2018 06:27 PM, Jay Pipes wrote:

On 08/02/2018 06:18 PM, Michael Glasgow wrote:



More generally, any time a service fails to deliver a resource which it is
primarily designed to deliver, it seems to me at this stage that should
probably be taken a bit more seriously than just "check the log file, maybe
there's something in there?"  From the user's perspective, if nova fails to
produce an instance, or cinder fails to produce a volume, or neutron fails to
build a subnet, that's kind of a big deal, right?

In such cases, would it be possible to generate a detailed exception object
which contains all the necessary info to ascertain why that specific failure
occurred?


It's not an exception. It's normal course of events. NoValidHosts means there
were no compute nodes that met the requested resource amounts.


I'm of two minds here.

On the one hand, you have the case where the end user has accidentally requested 
some combination of things that isn't normally available, and they need to be 
able to ask the provider what they did wrong.  I agree that this case is not 
really an exception, those resources were never available in the first place.


On the other hand, suppose the customer issues a valid request and it works, and 
then issues the same request again and it fails, leading to a violation of that 
customers SLA.  In this case I would suggest that it could be considered an 
exception since the system is not delivering the service that it was intended to 
deliver.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] How to debug no valid host failures with placement

2018-08-02 Thread Chris Friesen

On 08/02/2018 01:04 PM, melanie witt wrote:


The problem is an infamous one, which is, your users are trying to boot
instances and they get "No Valid Host" and an instance in ERROR state. They
contact support, and now support is trying to determine why NoValidHost
happened. In the past, they would turn on DEBUG log level on the nova-scheduler,
try another request, and take a look at the scheduler logs.


At a previous Summit[1] there were some operators that said they just always ran 
nova-scheduler with debug logging enabled in order to deal with this issue, but 
that it was a pain to isolate the useful logs from the not-useful ones.


Chris


[1] in a discussion related to 
https://blueprints.launchpad.net/nova/+spec/improve-sched-logging


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] How to debug no valid host failures with placement

2018-08-02 Thread Chris Friesen

On 08/02/2018 04:10 AM, Chris Dent wrote:


When people ask for something like what Chris mentioned:

 hosts with enough CPU: 
 hosts that also have enough disk: 
 hosts that also have enough memory: 
 hosts that also meet extra spec host aggregate keys: 
 hosts that also meet image properties host aggregate keys: 
 hosts that also have requested PCI devices: 

What are the operational questions that people are trying to answer
with those results? Is the idea to be able to have some insight into
the resource usage and reporting on and from the various hosts and
discover that things are being used differently than thought? Is
placement a resource monitoring tool, or is it more simple and
focused than that? Or is it that we might have flavors or other
resource requesting constraints that have bad logic and we want to
see at what stage the failure is?  I don't know and I haven't really
seen it stated explicitly here, and knowing it would help.

Do people want info like this for requests as they happen, or to be
able to go back later and try the same request again with some flag
on that says: "diagnose what happened"?

Or to put it another way: Before we design something that provides
the information above, which is a solution to an undescribed
problem, can we describe the problem more completely first to make
sure that what solution we get is the right one. The thing above,
that set of information, is context free.


The reason my organization added additional failure-case logging to the 
pre-placement scheduler was that we were enabling complex features (cpu pinning, 
hugepages, PCI, SRIOV, CPU model requests, NUMA topology, etc.) and we were 
running into scheduling failures, and people were asking the question "why did 
this scheduler request fail to find a valid host?".


There are a few reasons we might want to ask this question.  Some of them 
include:

1) double-checking the scheduler is working properly when first using additional 
features

2) weeding out images/flavors with excessive or mutually-contradictory 
constraints
3) determining whether the cluster needs to be reconfigured to meet user 
requirements


I suspect that something like "do the same request again with a debug flag" 
would cover many scenarios.  I suspect its main weakness would be dealing with 
contention between short-lived entities.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] How to debug no valid host failures with placement

2018-08-02 Thread Chris Friesen

On 08/01/2018 11:34 PM, Joshua Harlow wrote:


And I would be able to say request the explanation for a given request id
(historical even) so that analysis could be done post-change and pre-change (say
I update the algorithm for selection) so that the effects of alternations to
said decisions could be determined.


This would require storing a snapshot of all resources prior to processing every 
request...seems like that could add overhead and increase storage consumption.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] How to debug no valid host failures with placement

2018-08-01 Thread Chris Friesen

On 08/01/2018 11:32 AM, melanie witt wrote:


I think it's definitely a significant issue that troubleshooting "No allocation
candidates returned" from placement is so difficult. However, it's not
straightforward to log detail in placement when the request for allocation
candidates is essentially "SELECT * FROM nodes WHERE cpu usage < needed and disk
usage < needed and memory usage < needed" and the result is returned from the 
API.


I think the only way to get useful info on a failure would be to break down the 
huge SQL statement into subclauses and store the results of the intermediate 
queries.  So then if it failed placement could log something like:


hosts with enough CPU: 
hosts that also have enough disk: 
hosts that also have enough memory: 
hosts that also meet extra spec host aggregate keys: 
hosts that also meet image properties host aggregate keys: 
hosts that also have requested PCI devices: 

And maybe we could optimize the above by only emitting logs where the list has a 
length less than X (to avoid flooding the logs with hostnames in large clusters).


This would let you zero in on the things that finally caused the list to be 
whittled down to nothing.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] How to debug no valid host failures with placement

2018-08-01 Thread Chris Friesen

On 08/01/2018 11:17 AM, Ben Nemec wrote:



On 08/01/2018 11:23 AM, Chris Friesen wrote:



The fact that there is no real way to get the equivalent of the old detailed
scheduler logs is a known shortcoming in placement, and will become more of a
problem if/when we move more complicated things like CPU pinning, hugepages,
and NUMA-awareness into placement.

The problem is that getting useful logs out of placement would require
significant development work.


Yeah, in my case I only had one compute node so it was obvious what the problem
was, but if I had a scheduling failure on a busy cloud with hundreds of nodes I
don't see how you would ever track it down.  Maybe we need to have a discussion
with operators about how often they do post-mortem debugging of this sort of 
thing?


For Wind River's Titanium Cloud it was enough of an issue that we customized the 
scheduler to emit detailed logs on scheduler failure.


We started upstreaming it[1] but the effort stalled out when the upstream folks 
requested major implementation changes.


Chris


[1] https://blueprints.launchpad.net/nova/+spec/improve-sched-logging

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] How to debug no valid host failures with placement

2018-08-01 Thread Chris Friesen

On 08/01/2018 09:58 AM, Andrey Volkov wrote:

Hi,

It seems you need first to check what placement knows about resources of your 
cloud.
This can be done either with REST API [1] or with osc-placement [2].
For osc-placement you could use:

pip install osc-placement
openstack allocation candidate list --resource DISK_GB=20 --resource
MEMORY_MB=2048 --resource VCPU=1 --os-placement-api-version 1.10

And you can explore placement state with other commands like openstack resource
provider list, resource provider inventory list, resource provider usage show.



Unfortunately this doesn't help figure out what the missing resources were *at 
the time of the failure*.


The fact that there is no real way to get the equivalent of the old detailed 
scheduler logs is a known shortcoming in placement, and will become more of a 
problem if/when we move more complicated things like CPU pinning, hugepages, and 
NUMA-awareness into placement.


The problem is that getting useful logs out of placement would require 
significant development work.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] keypair quota usage info for user

2018-07-26 Thread Chris Friesen

On 07/25/2018 06:22 PM, Alex Xu wrote:



2018-07-26 1:43 GMT+08:00 Chris Friesen mailto:chris.frie...@windriver.com>>:



Keypairs are weird in that they're owned by users, not projects.  This is
arguably wrong, since it can cause problems if a user boots an instance with
their keypair and then gets removed from a project.

Nova microversion 2.54 added support for modifying the keypair associated
with an instance when doing a rebuild.  Before that there was no clean way
to do it.


I don't understand this, we didn't count the keypair usage with the instance
together, we just count the keypair usage for specific user.



I was giving an example of why it's strange that keypairs are owned by users 
rather than projects.  (When instances are owned by projects, and keypairs are 
used to access instances.)


Chris



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] keypair quota usage info for user

2018-07-26 Thread Chris Friesen

On 07/25/2018 06:21 PM, Alex Xu wrote:



2018-07-26 0:29 GMT+08:00 William M Edmonds mailto:edmon...@us.ibm.com>>:


Ghanshyam Mann mailto:gm...@ghanshyammann.com>>
wrote on 07/25/2018 05:44:46 AM:
... snip ...
> 1. is it ok to show the keypair used info via API ? any original
> rational not to do so or it was just like that from starting.

keypairs aren't tied to a tenant/project, so how could nova track/report a
quota for them on a given tenant/project? Which is how the API is
constructed... note the "tenant_id" in GET /os-quota-sets/{tenant_id}/detail


Keypairs usage is only value for the API 'GET
/os-quota-sets/{tenant_id}/detail?user_id={user_id}'


The objection is that keypairs are tied to the user, not the tenant, so it 
doesn't make sense to specify a tenant_id in the above query.


And for Pike at least I think the above command does not actually show how many 
keypairs have been created by that user...it still shows zero.


Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] keypair quota usage info for user

2018-07-25 Thread Chris Friesen

On 07/25/2018 10:29 AM, William M Edmonds wrote:


Ghanshyam Mann  wrote on 07/25/2018 05:44:46 AM:
... snip ...
 > 1. is it ok to show the keypair used info via API ? any original
 > rational not to do so or it was just like that from starting.

keypairs aren't tied to a tenant/project, so how could nova track/report a quota
for them on a given tenant/project? Which is how the API is constructed... note
the "tenant_id" in GET /os-quota-sets/{tenant_id}/detail

 > 2. Because this change will show the keypair used quota information
 > in API's existing filed 'in_use', it is API behaviour change (not
 > interface signature change in backward incompatible way) which can
 > cause interop issue. Should we bump microversion for this change?

If we find a meaningful way to return in_use data for keypairs, then yes, I
would expect a microversion bump so that callers can distinguish between a)
talking to an older installation where in_use is always 0 vs. b) talking to a
newer installation where in_use is 0 because there are really none in use. Or if
we remove keypairs from the response, which at a glance seems to make more
sense, that should also have a microversion bump so that someone who expects the
old response format will still get it.


Keypairs are weird in that they're owned by users, not projects.  This is 
arguably wrong, since it can cause problems if a user boots an instance with 
their keypair and then gets removed from a project.


Nova microversion 2.54 added support for modifying the keypair associated with 
an instance when doing a rebuild.  Before that there was no clean way to do it.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [infra][nova] Running NFV tests in CI

2018-07-24 Thread Chris Friesen

On 07/24/2018 12:47 PM, Clark Boylan wrote:


Can you get by with qemu or is nested virt required?


Pretty sure that nested virt is needed in order to test CPU pinning.


As for hugepages, I've done a quick survey of cpuinfo across our clouds and all 
seem to have pse available but not all have pdpe1gb available. Are you using 
1GB hugepages?


If we want to test nova's handling of 1G hugepages then I think we'd need 
pdpe1gb.

Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [StoryBoard] issues found while using storyboard

2018-07-23 Thread Chris Friesen

Hi,

I'm on a team that is starting to use StoryBoard, and I just thought I'd raise 
some issues I've recently run into.  It may be that I'm making assumptions based 
on previous tools that I've used (Launchpad and Atlassian's Jira) and perhaps 
StoryBoard is intended to be used differently, so if that's the case please let 
me know.


1) There doesn't seem to be a formal way to search for newly-created stories 
that have not yet been triaged.


2) There doesn't seem to be a way to find stories/tasks using arbitrary boolean 
logic, for example something of the form "(A OR (B AND C)) AND NOT D". 
Automatic worklists will only let you do "(A AND B) OR (C AND D) OR (E AND F)" 
and story queries won't even let you do that.


3) I don't see a structured way to specify that a bug has been confirmed by 
someone other than the reporter, or how many people have been impacted by it.


4) I can't find a way to add attachments to a story.  (Like a big log file, or a 
proposed patch, or a screenshot.)


5) I don't see a way to search for stories that have not been assigned to 
someone.

6) This is more a convenience thing, but when looking at someone else's public 
automatic worklist, there's no way to see what the query terms were that 
generated the worklist.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Bug 1781710 killing the check queue

2018-07-18 Thread Chris Friesen

On 07/18/2018 03:43 PM, melanie witt wrote:

On Wed, 18 Jul 2018 15:14:55 -0500, Matt Riedemann wrote:

On 7/18/2018 1:13 PM, melanie witt wrote:

Can we get rid of multi-create?  It keeps causing complications, and
it already
has weird behaviour if you ask for min_count=X and max_count=Y and only X
instances can be scheduled.  (Currently it fails with NoValidHost, but
it should
arguably start up X instances.)

We've discussed that before but I think users do use it and appreciate
the ability to boot instances in batches (one request). The behavior you
describe could be changed with a microversion, though I'm not sure if
that would mean we have to preserve old behavior with the previous
microversion.

Correct, we can't just remove it since that's a backward incompatible
microversion change. Plus, NFV people*love*  it.


Sorry, I think I might have caused confusion with my question about a
microversion. I was saying that to change the min_count=X and max_count=Y
behavior of raising NoValidHost if X can be satisfied but Y can't, I thought we
could change that in a microversion. And I wasn't sure if that would also mean
we would have to keep the old behavior for previous microversions (and thus
maintain both behaviors).


I understood you. :)

For the case where we could satisfy min_count but not max_count I think we 
*would* need to keep the existing kill-them-all behaviour for existing 
microversions since that's definitely an end-user-visible behaviour.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Bug 1781710 killing the check queue

2018-07-18 Thread Chris Friesen

On 07/18/2018 10:14 AM, Matt Riedemann wrote:

As can be seen from logstash [1] this bug is hurting us pretty bad in the check
queue.

I thought I originally had this fixed with [2] but that turned out to only be
part of the issue.

I think I've identified the problem but I have failed to write a recreate
regression test [3] because (I think) it's due to random ordering of which
request spec we select to send to the scheduler during a multi-create request
(and I tried making that predictable by sorting the instances by uuid in both
conductor and the scheduler but that didn't make a difference in my test).


Can we get rid of multi-create?  It keeps causing complications, and it already 
has weird behaviour if you ask for min_count=X and max_count=Y and only X 
instances can be scheduled.  (Currently it fails with NoValidHost, but it should 
arguably start up X instances.)



After talking with Sean Mooney, we have another fix which is self-contained to
the scheduler [5] so we wouldn't need to make any changes to the RequestSpec
handling in conductor. It's admittedly a bit hairy, so I'm asking for some eyes
on it since either way we go, we should get going soon before we hit the FF and
RC1 rush which *always* kills the gate.


One of your options mentioned using RequestSpec.num_instances to decide if it's 
in a multi-create.  Is there any reason to persist RequestSpec.num_instances? 
It seems like it's only applicable to the initial request, since after that each 
instance is managed individually.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] creating instance

2018-07-10 Thread Chris Friesen

On 07/10/2018 03:04 AM, jayshankar nair wrote:

Hi,

I  am trying to create an instance of cirros os(Project/Compute/Instances). I am
getting the following error.

Error: Failed to perform requested operation on instance "cirros1", the instance
has an error status: Please try again later [Error: Build of instance
5de65e6d-fca6-4e78-a688-ead942e8ed2a aborted: The server has either erred or is
incapable of performing the requested operation. (HTTP 500) (Request-ID:
req-91535564-4caf-4975-8eff-7bca515d414e)].

How to debug the error.


You'll want to look at the logs for the individual service.  Since you were 
trying to create a server instance, you probably want to start with the logs for 
the "nova-api" service to see if there are any failure messages.  You can then 
check the logs for "nova-scheduler", "nova-conductor", and "nova-compute". 
There should be something useful in one of those.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [cinder] making volume available without stopping VM

2018-06-25 Thread Chris Friesen

On 06/23/2018 08:38 AM, Volodymyr Litovka wrote:

Dear friends,

I did some tests with making volume available without stopping VM. I'm using
CEPH and these steps produce the following results:

1) openstack volume set --state available [UUID]
- nothing changed inside both VM (volume is still connected) and CEPH
2) openstack volume set --size [new size] --state in-use [UUID]
- nothing changed inside VM (volume is still connected and has an old size)
- size of CEPH volume changed to the new value
3) during these operations I was copying a lot of data from external source and
all md5 sums are the same on both VM and source
4) changes on VM happens upon any kind of power-cycle (e.g. reboot (either soft
or hard): openstack server reboot [--hard] [VM uuid] )
- note: NOT after 'reboot' from inside VM

It seems, that all these manipilations with cinder just update internal
parameters of cinder/CEPH subsystems, without immediate effect for VMs. Is it
safe to use this mechanism in this particular environent (e.g. CEPH as backend)?


There are a different set of instructions[1] which imply that the change should 
be done via the hypervisor, and that the guest will then see the changes 
immediately.


Also, If you resize the backend in a way that bypasses nova, I think it will 
result in the placement data being wrong.  (At least temporarily.)


Chris


[1] 
https://wiki.skytech.dk/index.php/Ceph_-_howto,_rbd,_lvm,_cluster#Online_resizing_of_KVM_images_.28rbd.29



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] NUMA-aware live migration: easy but incomplete vs complete but hard

2018-06-21 Thread Chris Friesen

On 06/21/2018 07:04 AM, Artom Lifshitz wrote:

As I understand it, Artom is proposing to have a larger race window,
essentially
from when the scheduler selects a node until the resource audit runs on that
node.


Exactly. When writing the spec I thought we could just call the resource tracker
to claim the resources when the migration was done. However, when I started
looking at the code in reaction to Sahid's feedback, I noticed that there's no
way to do it without the MoveClaim context (right?)


In the previous patch, the MoveClaim is the thing that calculates the dest NUMA 
topology given the flavor/image, then calls hardware.numa_fit_instance_to_host() 
to figure out what specific host resources to consume.  That claim is then 
associated with the migration object and the instance.migration_context, and 
then we call _update_usage_from_migration() to actually consume the resources on 
the destination.  This all happens within check_can_live_migrate_destination().


As an improvement over what you've got, I think you could just kick off an early 
call of update_available_resource() once the migration is done.  It'd be 
potentially computationally expensive, but it'd reduce the race window.



Keep in mind, we're not making any race windows worse - I'm proposing keeping
the status quo and fixing it later with NUMA in placement (or the resource
tracker if we can swing it).


Well, right now live migration is totally broken so nobody's doing it.  You're 
going to make it kind of work but with racy resource tracking, which could lead 
to people doing it then getting in trouble.  We'll want to make sure there's a 
suitable release note for this.



The resource tracker stuff is just so... opaque. For instance, the original
patch [1] uses a mutated_migration_context around the pre_live_migration call to
the libvirt driver. Would I still need to do that? Why or why not?


The mutated context applies the "new" numa_topology and PCI stuff.

The reason for the mutated context for pre_live_migration() is so that the 
plug_vifs(instance) call will make use of the new macvtap device information. 
See Moshe's comment from Dec 8 2016 at https://review.openstack.org/#/c/244489/46.


I think the mutated context around the call to self.driver.live_migration() is 
so that the new XML represents the newly-claimed pinned CPUs on the destination.



At this point we need to commit to something and roll with it, so I'm sticking
to the "easy way". If it gets shut down in code review, at least we'll have
certainty on how to approach this next cycle.


Yep, I'm cool with incremental improvement.

Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] NUMA-aware live migration: easy but incomplete vs complete but hard

2018-06-21 Thread Chris Friesen

On 06/21/2018 07:50 AM, Mooney, Sean K wrote:

-Original Message-
From: Jay Pipes [mailto:jaypi...@gmail.com]



Side question... does either approach touch PCI device management
during live migration?

I ask because the only workloads I've ever seen that pin guest vCPU
threads to specific host processors -- or make use of huge pages
consumed from a specific host NUMA node -- have also made use of SR-IOV
and/or PCI passthrough. [1]

If workloads that use PCI passthrough or SR-IOV VFs cannot be live
migrated (due to existing complications in the lower-level virt layers)
I don't see much of a point spending lots of developer resources trying
to "fix" this situation when in the real world, only a mythical
workload that uses CPU pinning or huge pages but *doesn't* use PCI
passthrough or SR-IOV VFs would be helped by it.



[Mooney, Sean K]  I would generally agree but with the extention of include 
dpdk based vswitch like ovs-dpdk or vpp.
Cpu pinned or hugepage backed guests generally also have some kind of high 
performance networking solution or use a hardware
Acclaortor like a gpu to justify the performance assertion that pinning of 
cores or ram is required.
Dpdk networking stack would however not require the pci remaping to be 
addressed though I belive that is planned to be added in stine.


Jay, you make a good point but I'll second what Sean says...for the last few 
years my organization has been using a DPDK-accelerated vswitch which performs 
well enough for many high-performance purposes.


In the general case, I think live migration while using physical devices would 
require coordinating the migration with the guest software.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] NUMA-aware live migration: easy but incomplete vs complete but hard

2018-06-20 Thread Chris Friesen

On 06/20/2018 10:00 AM, Sylvain Bauza wrote:


When we reviewed the spec, we agreed as a community to say that we should still
get race conditions once the series is implemented, but at least it helps 
operators.
Quoting :
"It would also be possible for another instance to steal NUMA resources from a
live migrated instance before the latter’s destination compute host has a chance
to claim them. Until NUMA resource providers are implemented [3]
 and allow for an essentially atomic
schedule+claim operation, scheduling and claiming will keep being done at
different times on different nodes. Thus, the potential for races will continue
to exist."
https://specs.openstack.org/openstack/nova-specs/specs/rocky/approved/numa-aware-live-migration.html#proposed-change


My understanding of that quote was that we were acknowledging the fact that when 
using the ResourceTracker there was an unavoidable race window between the time 
when the scheduler selected a compute node and when the resources were claimed 
on that compute node in check_can_live_migrate_destination().  And in this model 
no resources are actually *used* until they are claimed.


As I understand it, Artom is proposing to have a larger race window, essentially 
from when the scheduler selects a node until the resource audit runs on that node.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] NUMA-aware live migration: easy but incomplete vs complete but hard

2018-06-19 Thread Chris Friesen

On 06/19/2018 01:59 PM, Artom Lifshitz wrote:

Adding
claims support later on wouldn't change any on-the-wire messaging, it would
just make things work more robustly.


I'm not even sure about that. Assuming [1] has at least the right
idea, it looks like it's an either-or kind of thing: either we use
resource tracker claims and get the new instance NUMA topology that
way, or do what was in the spec and have the dest send it to the
source.


One way or another you need to calculate the new topology in 
ComputeManager.check_can_live_migrate_destination() and communicate that 
information back to the source so that it can be used in 
ComputeManager._do_live_migration().  The previous patches communicated the new 
topoology as part of instance.



That being said, I still think I'm still in favor of choosing the
"easy" way out. For instance, [2] should fail because we can't access
the api db from the compute node.


I think you could use objects.ImageMeta.from_instance(instance) instead of 
request_spec.image.  The limits might be an issue.



So unless there's a simpler way,
using RT claims would involve changing the RPC to add parameters to
check_can_live_migration_destination, which, while not necessarily
bad, seems like useless complexity for a thing we know will get ripped
out.


I agree that it makes sense to get the "simple" option working first.  If we 
later choose to make it work "properly" I don't think it would require undoing 
too much.


Something to maybe factor in to what you're doing--I think there is currently a 
bug when migrating an instance with no numa_topology to a host with a different 
set of host CPUs usable for floating instances--I think it will assume it can 
still float over the same host CPUs as before.  Once we have the ability to 
re-write the instance XML prior to the live-migration it would be good to fix 
this.  I think this would require passing the set of available CPUs on the 
destination back to the host for use when recalculating the XML for the guest. 
(See the "if not guest_cpu_numa_config" case in 
LibvirtDriver._get_guest_numa_config() where "allowed_cpus" is specified, and 
LibvirtDriver._get_guest_config() where guest.cpuset is written.)


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] NUMA-aware live migration: easy but incomplete vs complete but hard

2018-06-18 Thread Chris Friesen

On 06/18/2018 08:16 AM, Artom Lifshitz wrote:

Hey all,

For Rocky I'm trying to get live migration to work properly for
instances that have a NUMA topology [1].

A question that came up on one of patches [2] is how to handle
resources claims on the destination, or indeed whether to handle that
at all.


I think getting the live migration to work at all is better than having it stay 
broken, so even without resource claiming on the destination it's an improvement 
over the status quo and I think it'd be a desirable change.


However, *not* doing resource claiming means that until the migration is 
complete and the regular resource audit runs on the destination (which could be 
a minute later by default) you could end up having other instances try to use 
the same resources, causing various operations to fail.  I think we'd want to 
have a very clear notice in the release notes about the limitations if we go 
this route.


I'm a little bit worried that waiting for support in placement will result in 
"fully-functional" live migration with dedicated resources being punted out 
indefinitely.  One of the reasons why the spec[1] called for using the existing 
resource tracker was that we don't expect placement to be functional for all 
NUMA-related stuff for a while yet.


For what it's worth, I think the previous patch languished for a number of 
reasons other than the complexity of the code...the original author left, the 
coding style was a bit odd, there was an attempt to make it work even if the 
source was an earlier version, etc.  I think a fresh implementation would be 
less complicated to review.


Given the above, my personal preference would be to merge it even without 
claims, but then try to get the claims support merged as well.  (Adding claims 
support later on wouldn't change any on-the-wire messaging, it would just make 
things work more robustly.)


Chris

[1] 
https://github.com/openstack/nova-specs/blob/master/specs/rocky/approved/numa-aware-live-migration.rst


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] increasing the number of allowed volumes attached per instance > 26

2018-06-07 Thread Chris Friesen

On 06/07/2018 12:07 PM, Matt Riedemann wrote:

On 6/7/2018 12:56 PM, melanie witt wrote:



C) Create a configurable API limit for maximum number of volumes to attach to
a single instance that is either a quota or similar to a quota. Pros: lets
operators opt-in to a maximum that works in their environment. Cons: it's yet
another quota?


This seems the most reasonable to me if we're going to do this, but I'm probably
in the minority. Yes more quota limits sucks, but it's (1) discoverable by API
users and therefore (2) interoperable.


Quota seems like kind of a blunt instrument, since it might not make sense for a 
little single-vCPU guest to get the same number of connections as a massive 
guest with many dedicated vCPUs.  (Since you can fit many more of the former on 
a given compute node.)


If what we care about is the number of connections per compute node it almost 
feels like a resource that should be tracked...but you wouldn't want to have one 
instance consume all of the connections on the node so you're back to needing a 
per-instance limit of some sort.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][glance] Deprecation of nova.image.download.modules extension point

2018-05-31 Thread Chris Friesen

On 05/31/2018 04:14 PM, Moore, Curt wrote:


The challenge is that transferring the Glance image transfer is _glacially slow_
when using the Glance HTTP API (~30 min for a 50GB Windows image (It’s Windows,
it’s huge with all of the necessary tools installed)).  If libvirt can instead
perform an RBD export on the image using the image download functionality, it is
able to download the same image in ~30 sec.


This seems oddly slow.  I just downloaded a 1.6 GB image from glance in slightly 
under 10 seconds.  That would map to about 5 minutes for a 50GB image.




We could look at attaching an additional ephemeral disk to the instance and have
cloudbase-init use it as the pagefile but it appears that if libvirt is using
rbd for its images_type, _all_ disks must then come from Ceph, there is no way
at present to allow the VM image to run from Ceph and have an ephemeral disk
mapped in from node-local storage.  Even still, this would have the effect of
"wasting" Ceph IOPS for the VM disk itself which could be better used for other
purposes.

Based on what I have explained about our use case, is there a better/different
way to accomplish the same goal without using the deprecated image download
functionality?  If not, can we work to "un-deprecate" the download extension
point? Should I work to get the code for this RBD download into the upstream
repository?


Have you considered using compute nodes configured for local storage but then 
use boot-from-volume with cinder and glance both using ceph?  I *think* there's 
an optimization there such that the volume creation is fast.


Assuming the volume creation is indeed fast, in this scenario you could then 
have a local ephemeral/swap disk for your pagefile.  You'd still have your VM 
root disks on ceph though.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [cyborg] [nova] Cyborg quotas

2018-05-21 Thread Chris Friesen

On 05/19/2018 05:58 PM, Blair Bethwaite wrote:

G'day Jay,

On 20 May 2018 at 08:37, Jay Pipes  wrote:

If it's not the VM or baremetal machine that is using the accelerator, what
is?


It will be a VM or BM, but I don't think accelerators should be tied
to the life of a single instance if that isn't technically necessary
(i.e., they are hot-pluggable devices). I can see plenty of scope for
use-cases where Cyborg is managing devices that are accessible to
compute infrastructure via network/fabric (e.g. rCUDA or dedicated
PCIe fabric). And even in the simple pci passthrough case (vfio or
mdev) it isn't hard to imagine use-cases for workloads that only need
an accelerator sometimes.


Currently nova only supports attach/detach of volumes and network interfaces. 
Is Cyborg looking to implement new Compute API operations to support hot 
attach/detach of various types of accelerators?


Chris



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] openstack-dev] [nova] Cannot live migrattion, because error:libvirtError: the CPU is incompatible with host CPU: Host CPU does not provide required features: cmt, mbm_total, mbm_lo

2018-05-14 Thread Chris Friesen

On 05/13/2018 09:23 PM, 何健乐 wrote:

Hi, all
When I did live-miration , I met the following error: |result 
||=||proxy_call(||self||._autowrap, f, ||*||args, ||*||*||kwargs)|


|May ||14| |10||:||33||:||11| |nova||-||compute[||981335||]: ||File| 
|"/usr/lib64/python2.7/site-packages/libvirt.py"||, line ||1939||, ||in| 
|migrateToURI3|
|May ||14| |10||:||33||:||11| |nova||-||compute[||981335||]: ||if| |ret 
||=||=| |-||1||: ||raise| |libvirtError (||'virDomainMigrateToURI3() 
failed'||, dom||=||self||)|
|May ||14| |10||:||33||:||11| |nova||-||compute[||981335||]: libvirtError: the 
CPU ||is| |incompatible with host CPU: Host CPU does ||not| |provide required 
features: cmt, mbm_total, mbm_local|



Is there any one that has solution for this problem.
Thanks



Can you run "virsh capabilities" and provide the "cpu" section for both the 
source and dest compute nodes?  Can you also provide the "cpu_mode", 
"cpu_model", and "cpu_model_extra_flags" options from the "libvirt" section of 
/etc/nova/nova.conf on both compute nodes?


Chris
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] reboot a rescued instance?

2018-05-04 Thread Chris Friesen

On 05/04/2018 07:50 AM, Matt Riedemann wrote:

For full details on this, see the IRC conversation [1].

tl;dr: the nova compute manager and xen virt driver assume that you can reboot a
rescued instance [2] but the API does not allow that [3] and as far as I can
tell, it never has.

I can only assume that Rackspace had an out of tree change to the API to allow
rebooting a rescued instance. I don't know why that wouldn't have been
upstreamed, but the upstream API doesn't allow it. I'm also not aware of
anything internal to nova that reboots an instance in a rescued state.

So the question now is, should we add rescue to the possible states to reboot an
instance in the API? Or just rollback this essentially dead code in the compute
manager and xen virt driver? I don't know if any other virt drivers will support
rebooting a rescued instance.


Not sure where the more recent equivalent is, but the mitaka user guide[1] has 
this:

"Pause, suspend, and stop operations are not allowed when an instance is running 
in rescue mode, as triggering these actions causes the loss of the original 
instance state, and makes it impossible to unrescue the instance."


Would the same logic apply to reboot since it's basically stop/start?

Chris



[1] https://docs.openstack.org/mitaka/user-guide/cli_reboot_an_instance.html

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Heads up for out-of-tree drivers: supports_recreate -> supports_evacuate

2018-04-19 Thread Chris Friesen

On 04/19/2018 08:33 AM, Jay Pipes wrote:

On 04/19/2018 09:15 AM, Matthew Booth wrote:

We've had inconsistent naming of recreate/evacuate in Nova for a long
time, and it will persist in a couple of places for a while more.
However, I've proposed the following to rename 'recreate' to
'evacuate' everywhere with no rpc/api impact here:

https://review.openstack.org/560900

One of the things which is renamed is the driver 'supports_recreate'
capability, which I've renamed to 'supports_evacuate'. The above
change updates this for in-tree drivers, but as noted in review this
would impact out-of-tree drivers. If this might affect you, please
follow the above in case it merges.


I have to admit, Matt, I'm a bit confused by this. I was under the impression
that we were trying to *remove* uses of the term "evacuate" as much as possible
because that term is not adequately descriptive of the operation and terms like
"recreate" were more descriptive?


This is a good point.

Personally I'd prefer to see it go the other way and convert everything to the 
"recreate" terminology, including the external API.


From the CLI perspective, it makes no sense that "nova evacuate" operates after 
a host is already down, but "nova evacuate-live" operates on a running host.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Concern about trusted certificates API change

2018-04-18 Thread Chris Friesen

On 04/18/2018 10:57 AM, Jay Pipes wrote:

On 04/18/2018 12:41 PM, Matt Riedemann wrote:

There is a compute REST API change proposed [1] which will allow users to pass
trusted certificate IDs to be used with validation of images when creating or
rebuilding a server. The trusted cert IDs are based on certificates stored in
some key manager, e.g. Barbican.

The full nova spec is here [2].

The main concern I have is that trusted certs will not be supported for
volume-backed instances, and some clouds only support volume-backed instances.


Yes. And some clouds only support VMWare vCenter virt driver. And some only
support Hyper-V. I don't believe we should delay adding good functionality to
(large percentage of) clouds because it doesn't yet work with one virt driver or
one piece of (badly-designed) functionality.

 > The way the patch is written is that if the user attempts to

boot from volume with trusted certs, it will fail.


And... I think that's perfectly fine.


If this happens, is it clear to the end-user that the reason the boot failed is 
that the cloud doesn't support trusted cert IDs for boot-from-vol?  If so, then 
I think that's totally fine.


If the error message is unclear, then maybe we should just improve it.

Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Default scheduler filters survey

2018-04-18 Thread Chris Friesen

On 04/18/2018 09:17 AM, Artom Lifshitz wrote:


To that end, we'd like to know what filters operators are enabling in
their deployment. If you can, please reply to this email with your
[filter_scheduler]/enabled_filters (or
[DEFAULT]/scheduler_default_filters if you're using an older version)
option from nova.conf. Any other comments are welcome as well :)


RetryFilter
ComputeFilter
AvailabilityZoneFilter
AggregateInstanceExtraSpecsFilter
ComputeCapabilitiesFilter
ImagePropertiesFilter
NUMATopologyFilter
ServerGroupAffinityFilter
ServerGroupAntiAffinityFilter
PciPassthroughFilter


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [placement][nova] Decision time on granular request groups for like resources

2018-04-18 Thread Chris Friesen

On 04/18/2018 09:58 AM, Matt Riedemann wrote:

On 4/18/2018 9:06 AM, Jay Pipes wrote:

"By default, should resources/traits submitted in different numbered request
groups be supplied by separate resource providers?"


Without knowing all of the hairy use cases, I'm trying to channel my inner
sdague and some of the similar types of discussions we've had to changes in the
compute API, and a lot of the time we've agreed that we shouldn't assume a
default in certain cases.

So for this case, if I'm requesting numbered request groups, why doesn't the API
just require that I pass a query parameter telling it how I'd like those
requests to be handled, either via affinity or anti-affinity.


The request might get unwieldy if we have to specify affinity/anti-affinity for 
each resource.  Maybe you could specify the default for the request and then 
optionally override it for each resource?


I'm not current on the placement implementation details, but would this level of 
flexibility cause complexity problems in the code?


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [placement][nova] Decision time on granular request groups for like resources

2018-04-18 Thread Chris Friesen

On 04/18/2018 08:06 AM, Jay Pipes wrote:

Stackers,

Eric Fried and I are currently at an impasse regarding a decision that will have
far-reaching (and end-user facing) impacts to the placement API and how nova
interacts with the placement service from the nova scheduler.

We need to make a decision regarding the following question:

"By default, should resources/traits submitted in different numbered request
groups be supplied by separate resource providers?"


I'm a bit conflicted.  On the one hand if we're talking about virtual resources 
like "vCPUs" then there's really no reason why they couldn't be sourced from the 
same resource provider.


On the other hand, once we're talking about *physical* resources it seems like 
it might be more common to want them to be coming from different resource 
providers.  We may want memory spread across multiple NUMA nodes for higher 
aggregate bandwidth, we may want VFs from separate PFs for high availability.


I'm half tempted to side with mriedem and say that there is no default and it 
must be explicit, but I'm concerned that this would make the requests a lot 
larger if you have to specify it for every resource.  (Will follow up in a reply 
to mriedem's post.)



Both proposals include ways to specify whether certain resources or whole
request groups can be forced to be sources from either a single provider or from
different providers.

In Viewpoint A, the proposal is to have a can_split=RESOURCE1,RESOURCE2 query
parameter that would indicate which resource classes in the unnumbered request
group that may be split across multiple providers (remember that viewpoint A
considers different request groups to explicitly mean different providers, so it
doesn't make sense to have a can_split query parameter for numbered request
groups).



In Viewpoint B, the proposal is to have a separate_providers=1,2 query parameter
that would indicate that the identified request groups should be sourced from
separate providers. Request groups that are not listed in the separate_providers
query parameter are not guaranteed to be sourced from different providers.


In either viewpoint, is there a way to represent "I want two resource groups, 
with resource X in each group coming from different resource providers 
(anti-affinity) and resource Y from the same resource provider (affinity)?


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [novaclient] invoking methods on the same client object in different theads caused malformed requests

2018-04-03 Thread Chris Friesen

On 04/03/2018 04:25 AM, Xiong, Huan wrote:

Hi,

I'm using a cloud benchmarking tool [1], which creates a *single* nova
client object in main thread and invoke methods on that object in different
worker threads. I find it generated malformed requests at random (my
system has python-novaclient 10.1.0 installed). The root cause was because
some methods in novaclient (e.g., those in images.py and networks.py)
changed client object's service_type. Since all threads shared a single
client object, the change caused other threads generated malformed requests
and hence the failure.

I wonder if this is a known issue for novaclient, or the above approach is
not supported?


In general, unless something says it is thread-safe you should assume it is not.

Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Hard fail if you try to rename an AZ with instances in it?

2018-03-27 Thread Chris Friesen

On 03/27/2018 10:42 AM, Matt Riedemann wrote:

On 3/27/2018 10:37 AM, Jay Pipes wrote:

If we want to actually fix the issue once and for all, we need to make
availability zones a real thing that has a permanent identifier (UUID) and
store that permanent identifier in the instance (not the instance metadata).


Aggregates have a UUID now, exposed in microversion 2.41 (you added it). Is that
what you mean by AZs having a UUID, since AZs are modeled as host aggregates?

One of the alternatives in the spec is not relying on name as a unique
identifier and just make sure everything is held together via the aggregate
UUID, which is now possible.


If we allow non-unique availability zone names, we'd need to display the 
availability zone UUID in horizon when selecting an availability zone.


I think it'd make sense to still require the availability zone names to be 
unique, but internally store the availability zone UUID in the instance instead 
of the name.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [keystone] batch processing with unified limits

2018-03-08 Thread Chris Friesen

On 03/07/2018 06:10 PM, Lance Bragstad wrote:

The keystone team is parsing the unified limits discussions from last
week. One of the things we went over as a group was the usability of the
current API [0].

Currently, the create and update APIs support batch processing. So
specifying a list of limits is valid for both. This was a part of the
original proposal as a way to make it easier for operators to set all
their registered limits with a single API call. The API also has unique
IDs for each limit reference. The consensus was that this felt a bit
weird with a resource that contains a unique set of attributes that can
make up a constraints (service, resource type, and optionally a region).
We're discussing ways to make this API more consistent with how the rest
of keystone works while maintaining usability for operators. Does anyone
see issues with supporting batch creation for limits and individual
updates? In other words, removing the ability to update a set of limits
in a single API call, but keeping the ability to create them in batches?


I suspect this would cover the typical usecases we have for standing up new 
clouds or a new service within a cloud.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Openstack-sigs] [keystone] [oslo] new unified limit library

2018-03-07 Thread Chris Friesen

On 03/07/2018 10:44 AM, Tim Bell wrote:

I think nested quotas would give the same thing, i.e. you have a parent project
for the group and child projects for the users. This would not need user/group
quotas but continue with the ‘project owns resources’ approach.


Agreed, I think that if we support nested quotas with a suitable depth of 
nesting it could be used to handle the existing nova user/project quotas.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [keystone] [oslo] new unified limit library

2018-03-07 Thread Chris Friesen

On 03/07/2018 09:49 AM, Lance Bragstad wrote:



On 03/07/2018 09:31 AM, Chris Friesen wrote:

On 03/07/2018 08:58 AM, Lance Bragstad wrote:

Hi all,

Per the identity-integration track at the PTG [0], I proposed a new oslo
library for services to use for hierarchical quota enforcement [1]. Let
me know if you have any questions or concerns about the library. If the
oslo team would like, I can add an agenda item for next weeks oslo
meeting to discuss.

Thanks,

Lance

[0] https://etherpad.openstack.org/p/unified-limits-rocky-ptg


Looks interesting.

Some complications related to quotas:

1) Nova currently supports quotas for a user/group tuple that can be
stricter than the overall quotas for that group.  As far as I know no
other project supports this.

By group, do you mean keystone group? Or are you talking about the quota
associated to a project?


Sorry, typo.  I meant  quotas for a user/project tuple, which can be stricter 
than the overall quotas for that project.



2) Nova and cinder also support the ability to set the "default" quota
class (which applies to any group that hasn't overridden their
quota).  Currently once it's set there is no way to revert back to the
original defaults.

This sounds like a registered limit [0], but again, I'm not exactly sure
what "group" means in this context. It sounds like group is supposed to
be a limit for a specific project?

[0]
https://docs.openstack.org/keystone/latest/admin/identity-unified-limits.html#registered-limits


Again, should be project instead of group.  And registered limits seem 
essentially analogous.




3) Neutron allows you to list quotas for projects with non-default
quota values.  This is useful, and I'd like to see it extended to
optionally just display the non-default quota values rather than all
quota values for that project.  If we were to support user/group
quotas this would be the only way to efficiently query which
user/group tuples have non-default quotas.

This might be something we can work into the keystone implementation
since it's still marked as experimental [1]. We have two APIs, one
returns the default limits, also known as a registered limit, for a
resource and one that returns the project-specific overrides. It sounds
like you're interested in the second one?

[1]
https://developer.openstack.org/api-ref/identity/v3/index.html#unified-limits


Again, should be user/project tuples.  Yes, in this case I'm talking about the 
project-specific ones.  (It's actually worse if you support user/project limits 
since with the current nova API you can potentially get combinatorial explosion 
if many users are part of many projects.)


I think it would be useful to be able to constrain this query to report limits 
for a specific project, (and a specific user if that will be supported.)


I also think it would be useful to be able to constrain it to report only the 
limits that have been explicitly set (rather than inheriting the default from 
the project or the registered limit).  Maybe it's already intended to work this 
way--if so that should be explicitly documented.



4) In nova, keypairs belong to the user rather than the project.
(This is a bit messed up, but is the current behaviour.)  The quota
for these should really be outside of any group, or else we should
modify nova to make them belong to the project.

I think the initial implementation of a unified limit pattern is
targeting limits and quotas for things associated to projects. In the
future, we can probably expand on the limit information in keystone to
include user-specific limits, which would be great if nova wants to move
away from handling that kind of stuff.


The quota handling for keypairs is a bit messed up in nova right now, but it's 
legacy behaviour at this point.  It'd be nice to be able to get it right if 
we're switching to new quota management mechanisms.


Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [keystone] [oslo] new unified limit library

2018-03-07 Thread Chris Friesen

On 03/07/2018 10:33 AM, Tim Bell wrote:

Sorry, I remember more detail now... it was using the 'owner' of the VM as part 
of the policy rather than quota.

Is there a per-user/per-group quota in Nova?


Nova supports setting quotas for individual users within a project (as long as 
they are smaller than the project quota for that resource).  I'm not sure how 
much it's actually used, or if they want to get rid of it.  (Maybe melwitt can 
chime in.)  But it's there now.


As you can see at 
"https://developer.openstack.org/api-ref/compute/#update-quotas;, there's an 
optional "user_id" field in the request.  Same thing for the "delete" and 
"detailed get" operations.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [keystone] [oslo] new unified limit library

2018-03-07 Thread Chris Friesen

On 03/07/2018 08:58 AM, Lance Bragstad wrote:

Hi all,

Per the identity-integration track at the PTG [0], I proposed a new oslo
library for services to use for hierarchical quota enforcement [1]. Let
me know if you have any questions or concerns about the library. If the
oslo team would like, I can add an agenda item for next weeks oslo
meeting to discuss.

Thanks,

Lance

[0] https://etherpad.openstack.org/p/unified-limits-rocky-ptg


Looks interesting.

Some complications related to quotas:

1) Nova currently supports quotas for a user/group tuple that can be stricter 
than the overall quotas for that group.  As far as I know no other project 
supports this.


2) Nova and cinder also support the ability to set the "default" quota class 
(which applies to any group that hasn't overridden their quota).  Currently once 
it's set there is no way to revert back to the original defaults.


3) Neutron allows you to list quotas for projects with non-default quota values. 
 This is useful, and I'd like to see it extended to optionally just display the 
non-default quota values rather than all quota values for that project.  If we 
were to support user/group quotas this would be the only way to efficiently 
query which user/group tuples have non-default quotas.


4) In nova, keypairs belong to the user rather than the project.  (This is a bit 
messed up, but is the current behaviour.)  The quota for these should really be 
outside of any group, or else we should modify nova to make them belong to the 
project.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [libvrit] Can QEMU or LIBVIRT know VM is powering-off

2018-02-21 Thread Chris Friesen

On 02/21/2018 03:19 PM, Kwan, Louie wrote:

When turning off a VM by doing nova stop,  the Status and Task State is there
for Nova. But can Libvirt / qemu programmatically figure out the ‘Task State’
that the VM is trying to powering-off ?.

For libvirt, it seems only know the “Power State”? Anyway to read the
“powering-off” info?


The fact that you have asked nova to power off the instance means nothing to 
libvirt/qemu.


In the "nova stop" case nova will do some housekeeping stuff, optionally tell 
libvirt to shutdown the domain cleanly, then tell libvirt to destroy the domain, 
then do more housekeeping stuff.


Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][Kingbird]Multi-Region Orchestrator

2018-02-07 Thread Chris Friesen

On 02/05/2018 06:33 PM, Jay Pipes wrote:


It does seem to me, however, that if the intention is *not* to get into the
multi-cloud orchestration game, that a simpler solution to this multi-region
OpenStack deployment use case would be to simply have a global Glance and
Keystone infrastructure that can seamlessly scale to multiple regions.

That way, there'd be no need for replicating anything.


One use-case I've seen for this sort of thing is someone that has multiple 
geographically-separate clouds, and maybe they want to run the same heat stack 
in all of them.


So they can use global glance/keystone, but they need to ensure that they have 
the right flavor(s) available in all the clouds.  This needs to be done by the 
admin user, so it can't be done as part of the normal user's heat stack.


Something like "create a keypair in each of the clouds with the same public key 
and same name" could be done by the end user with some coding, but it's 
convenient to have a tool to do it for you.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][osc] How to deal with add/remove fixed/floating CLIs after novaclient 10.0.0?

2018-01-30 Thread Chris Friesen

On 01/30/2018 09:15 AM, Matt Riedemann wrote:

The 10.0.0 release of python-novaclient dropped some deprecated CLIs and python
API bindings for the server actions to add/remove fixed and floating IPs:

https://docs.openstack.org/releasenotes/python-novaclient/queens.html#id2

python-openstackclient was using some of those python API bindings from
novaclient which now no longer work:

https://bugs.launchpad.net/python-openstackclient/+bug/1745795




Is there a plan going forward to ensure that python-novaclient and OSC are on 
the same page as far as deprecating CLIs and API bindings?


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] heads up to users of Aggregate[Core|Ram|Disk]Filter: behavior change in >= Ocata

2018-01-19 Thread Chris Friesen

On 01/18/2018 02:54 PM, Mathieu Gagné wrote:


We use this feature to segregate capacity/hosts based on CPU
allocation ratio using aggregates.
This is because we have different offers/flavors based on those
allocation ratios. This is part of our business model.
A flavor extra_specs is use to schedule instances on appropriate hosts
using AggregateInstanceExtraSpecsFilter.

Our setup has a configuration management system and we use aggregates
exclusively when it comes to allocation ratio.
We do not rely on cpu_allocation_ratio config in nova-scheduler or nova-compute.
One of the reasons is we do not wish to have to
update/package/redeploy our configuration management system just to
add one or multiple compute nodes to an aggregate/capacity pool.
This means anyone (likely an operator or other provisioning
technician) can perform this action without having to touch or even
know about our configuration management system.
We can also transfer capacity from one aggregate to another if there
is a need, again, using aggregate memberships. (we do "evacuate" the
node if there are instances on it)
Our capacity monitoring is based on aggregate memberships and this
offer an easy overview of the current capacity. Note that a host can
be in one and only one aggregate in our setup.


The existing mechanisms to control aggregate membership will still work, so the 
remaining issue is how to control the allocation ratios.


What about implementing a new HTTP API call (as a local private patch) to set 
the allocation ratios for a given host?  This would only be valid for your 
scenario where a given host is only present in a single aggregate, but it would 
allow your techs to modify the ratios.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ironic] Booting IPA from cinder: Was: Summary of ironic sessions from Sydney

2017-11-24 Thread Chris Friesen

On 11/24/2017 10:23 AM, Julia Kreger wrote:

Greetings Michael,

I believe It would need to involve multiple machines at the same time.

I guess there are two different approaches that I think _could_ be
taken to facilitate this:

1) Provide a facility to use a specific volume as the "golden volume"
to boot up for IPA, and then initiate copies of that volume. The
downside that I see is the act of either copying the volume, or
presenting a snapshot of it that will be deleted a little later. I
think that is really going to depend on the back-end, and if the
backend can handle it or not. :\


Don't most reasonable backends support copy-on-write for volumes?  If they do, 
then creating a mostly-read copy of the volume should be low-overhead.


Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Upstream LTS Releases

2017-11-14 Thread Chris Friesen

On 11/14/2017 02:10 PM, Doug Hellmann wrote:

Excerpts from Chris Friesen's message of 2017-11-14 14:01:58 -0600:

On 11/14/2017 01:28 PM, Dmitry Tantsur wrote:


The quality of backported fixes is expected to be a direct (and only?)
interest of those new teams of new cores, coming from users and operators and
vendors.


I'm not assuming bad intentions, not at all. But there is a lot of involved in a
decision whether to make a backport or not. Will these people be able to
evaluate a risk of each patch? Do they have enough context on how that release
was implemented and what can break? Do they understand why feature backports are
bad? Why they should not skip (supported) releases when backporting?

I know a lot of very reasonable people who do not understand the things above
really well.


I would hope that the core team for upstream LTS would be the (hopefully
experienced) people doing the downstream work that already happens within the
various distros.

Chris



Presumably those are the same people we've been trying to convince
to work on the existing stable branches for the last 5 years. What
makes these extended branches more appealing to those people than
the existing branches? Is it the reduced requirements on maintaining
test jobs? Or maybe some other policy change that could be applied
to the stable branches?



For what it's worth, we often lag more than 6 months behind master and so some 
of the things we backport wouldn't be allowed by the existing stable branch 
support phases.  (ie aren't "critical" or security patches.)


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Upstream LTS Releases

2017-11-14 Thread Chris Friesen

On 11/14/2017 01:28 PM, Dmitry Tantsur wrote:


The quality of backported fixes is expected to be a direct (and only?)
interest of those new teams of new cores, coming from users and operators and
vendors.


I'm not assuming bad intentions, not at all. But there is a lot of involved in a
decision whether to make a backport or not. Will these people be able to
evaluate a risk of each patch? Do they have enough context on how that release
was implemented and what can break? Do they understand why feature backports are
bad? Why they should not skip (supported) releases when backporting?

I know a lot of very reasonable people who do not understand the things above
really well.


I would hope that the core team for upstream LTS would be the (hopefully 
experienced) people doing the downstream work that already happens within the 
various distros.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Upstream LTS Releases

2017-11-14 Thread Chris Friesen

On 11/14/2017 10:25 AM, Doug Hellmann wrote:

Why
would we have third-party jobs on an old branch that we don't have on
master, for instance?


One possible reason is to test the stable version of OpenStack against a stable 
version of the underlying OS distro.   (Where that distro may not meet the 
package version requirements for running master.)


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Interesting bug when unshelving an instance in an AZ and the AZ is gone

2017-10-16 Thread Chris Friesen

On 10/16/2017 09:22 AM, Matt Riedemann wrote:


2. Should we null out the instance.availability_zone when it's shelved offloaded
like we do for the instance.host and instance.node attributes? Similarly, we
would not take into account the RequestSpec.availability_zone when scheduling
during unshelve. I tend to prefer this option because once you unshelve offload
an instance, it's no longer associated with a host and therefore no longer
associated with an AZ.


This statement isn't true in the case where the user specifically requested a 
non-default AZ at boot time.



However, is it reasonable to assume that the user doesn't
care that the instance, once unshelved, is no longer in the originally requested
AZ? Probably not a safe assumption.


If they didn't request a non-default AZ then I think we could remove it.


3. When a user unshelves, they can't propose a new AZ (and I don't think we want
to add that capability to the unshelve API). So if the original AZ is gone,
should we automatically remove the RequestSpec.availability_zone when
scheduling? I tend to not like this as it's very implicit and the user could see
the AZ on their instance change before and after unshelve and be confused.


I think allowing the user to specify an AZ on unshelve might be a reasonable 
option.  Or maybe just allow modifying the AZ of a shelved instance without 
unshelving it via a PUT on /servers/{server_id}.



4. We could simply do nothing about this specific bug and assert the behavior is
correct. The user requested an instance in a specific AZ, shelved that instance
and when they wanted to unshelve it, it's no longer available so it fails. The
user would have to delete the instance and create a new instance from the shelve
snapshot image in a new AZ.


I'm inclined to feel that this is operator error.  If they want to delete an AZ 
that has shelved instances then they should talk with their customers and update 
the stored AZ in the DB to a new "valid" one.  (Though currently this would 
require manual DB operations.)


If we implemented Sylvain's spec in #1 above, maybe

we don't have this problem going forward since you couldn't remove/delete an AZ
when there are even shelved offloaded instances still tied to it.


I kind of think it would be okay to disallow deleting AZs with shelved instances 
in them.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Should we make rebuild + new image on a volume-backed instance fail fast?

2017-10-06 Thread Chris Friesen

On 10/06/2017 11:32 AM, Mathieu Gagné wrote:

Why not supporting this use case?


I don't think anyone is suggesting we support do it, but nobody has stepped up 
to actually merge a change that implements it.


I think what Matt is suggesting is that we make it fail fast *now*, and if 
someone else implements it then they can remove the fast failure at the same time.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] What is the goal of AggregateImagePropertiesIsolation filter?

2017-10-05 Thread Chris Friesen

On 10/05/2017 03:47 AM, Kekane, Abhishek wrote:


So the question here is, what is the exact goal of
AggregateImagePropertiesIsolation' scheduler filter: - Is it one of the 
following:-

1. Matching all metadata of host aggregate with image properties.

2. Matching image properties with host aggregate metadata.

If we decide that actual goal of 'AggregateImagePropertiesIsolation' filter is
as  pointed in #1, then a small correction is required to return False if image
property is not present from the host aggregate metadata.


The name of the filter includes "Isolation", so I think the intent was to 
accomplish #1.  However, as you point out it only fails the filter if both the 
aggregate and the image have the same key but different values and so the 
isolation is imperfect.


At the same time we have the AggregateInstanceExtraSpecsFilter (which ensures 
that any keys in the flavor extra-specs must be present in the aggregate).


Since keys can be specified in either the flavor or the image, it could be 
confusing that the behaviour is different between these two filters.  At the 
same time we don't want to break existing users by modifying the behaviour of 
the existing filters.  Given this it might make sense to create a new filter 
which unifies the checks and behaves the same whether the key is specified in 
the image or the flavor, with some way to toggle whether we want strict 
isolation or not (so that we can ensure only "special" flavors/images are 
allowed to use aggregates with specific limited resources).


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Running large instances with CPU pinning and OOM

2017-09-28 Thread Chris Friesen

On 09/28/2017 05:29 AM, Sahid Orentino Ferdjaoui wrote:


Only the memory mapped for the guest is striclty allocated from the
NUMA node selected. The QEMU overhead should float on the host NUMA
nodes. So it seems that the "reserved_host_memory_mb" is enough.


What I see in the code/docs doesn't match that, but it's entirely possible I'm 
missing something.


nova uses LibvirtConfigGuestNUMATuneMemory with a mode of "strict" and a nodeset 
of "the host NUMA nodes used by a guest".


For a guest with single NUMA node, I think this would map to libvirt XML of 
something like


  

  

The docs at https://libvirt.org/formatdomain.html#elementsNUMATuning say, "The 
optional memory element specifies how to allocate memory for the domain process 
on a NUMA host."


That seems to me that the qemu overhead would be NUMA-affined, no?  (If you had 
a multi-NUMA-node guest, then the qemu overhead would float across all the NUMA 
nodes used by the guest.)


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Running large instances with CPU pinning and OOM

2017-09-27 Thread Chris Friesen

On 09/27/2017 04:55 PM, Blair Bethwaite wrote:

Hi Prema

On 28 September 2017 at 07:10, Premysl Kouril  wrote:

Hi, I work with Jakub (the op of this thread) and here is my two
cents: I think what is critical to realize is that KVM virtual
machines can have substantial memory overhead of up to 25% of memory,
allocated to KVM virtual machine itself. This overhead memory is not


I'm curious what sort of VM configuration causes such high overheads,
is this when using highly tuned virt devices with very large buffers?


For what it's worth we ran into issues a couple years back with I/O to 
RDB-backed disks in writethrough/writeback.  There was a bug that allowed a very 
large number of in-flight operations if the ceph server couldn't keep up with 
the aggregate load.  We hacked a local solution, I'm not sure if it's been dealt 
with upstream.


I think virtio networking has also caused issues, though not as bad.  (But 
noticeable when running close to the line.)


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Running large instances with CPU pinning and OOM

2017-09-27 Thread Chris Friesen

On 09/27/2017 03:10 PM, Premysl Kouril wrote:

Lastly, qemu has overhead that varies depending on what you're doing in the
guest.  In particular, there are various IO queues that can consume
significant amounts of memory.  The company that I work for put in a good
bit of effort engineering things so that they work more reliably, and part
of that was determining how much memory to reserve for the host.

Chris


Hi, I work with Jakub (the op of this thread) and here is my two
cents: I think what is critical to realize is that KVM virtual
machines can have substantial memory overhead of up to 25% of memory,
allocated to KVM virtual machine itself. This overhead memory is not
considered in nova code when calculating if the instance being
provisioned actually fits into host's available resources (only the
memory, configured in instance's flavor is considered). And this is
especially being a problem when CPU pinning is used as the memory
allocation is bounded by limits of specific NUMA node (due to the
strict memory allocation mode). This renders the global reservation
parameter reserved_host_memory_mb useless as it doesn't take NUMA into
account.

This KVM virtual machine overhead is what is causing the OOMs in our
infrastructure and that's what we need to fix.


Feel free to report a bug against nova...maybe reserved_host_memory_mb should be 
a list of per-numa-node values.


It's a bit of a hack, but if you use hugepages for all the guests you can 
control the amount of per-numa-node memory reserved for host overhead.


Since the kvm overhead memory is allocated from 4K pages (in my experience) you 
can just choose to leave some memory on each host NUMA node as 4K pages instead 
of allocating them as hugepages.


Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Running large instances with CPU pinning and OOM

2017-09-27 Thread Chris Friesen

On 09/27/2017 08:01 AM, Blair Bethwaite wrote:

On 27 September 2017 at 23:19, Jakub Jursa  wrote:

'hw:cpu_policy=dedicated' (while NOT setting 'hw:numa_nodes') results in
libvirt pinning CPU in 'strict' memory mode

(from libvirt xml for given instance)
...
   
 
 
   
...

So yeah, the instance is not able to allocate memory from another NUMA node.


I can't recall what the docs say on this but I wouldn't be surprised
if that was a bug. Though I do think most users would want CPU & NUMA
pinning together (you haven't shared your use case but perhaps you do
too?).


Not a bug.  Once you enable CPU pinning we assume you care about performance, 
and for max performance you need NUMA affinity as well.  (And hugepages are 
beneficial too.)



I'm not quite sure what do you mean by 'memory will be locked for the
guest'. Also, aren't huge pages enabled in kernel by default?


I think that suggestion was probably referring to static hugepages,
which can be reserved (per NUMA node) at boot and then (assuming your
host is configured correctly) QEMU will be able to back guest RAM with
them.


One nice thing about static hugepages is that you pre-allocate them at startup, 
so you can decide on a per-NUMA-node basis how much 4K memory you want to leave 
for incidental host stuff and qemu overhead.  This lets you specify different 
amounts of "host-reserved" memory on different NUMA nodes.


In order to use static hugepages for the guest you need to explicitly ask for a 
page size of 2MB.  (1GB is possible as well but in most cases doesn't buy you 
much compared to 2MB.)


Lastly, qemu has overhead that varies depending on what you're doing in the 
guest.  In particular, there are various IO queues that can consume significant 
amounts of memory.  The company that I work for put in a good bit of effort 
engineering things so that they work more reliably, and part of that was 
determining how much memory to reserve for the host.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Running large instances with CPU pinning and OOM

2017-09-27 Thread Chris Friesen

On 09/27/2017 03:12 AM, Jakub Jursa wrote:



On 27.09.2017 10:40, Blair Bethwaite wrote:

On 27 September 2017 at 18:14, Stephen Finucane  wrote:

What you're probably looking for is the 'reserved_host_memory_mb' option. This
defaults to 512 (at least in the latest master) so if you up this to 4192 or
similar you should resolve the issue.


I don't see how this would help given the problem description -
reserved_host_memory_mb would only help avoid causing OOM when
launching the last guest that would otherwise fit on a host based on
Nova's simplified notion of memory capacity. It sounds like both CPU
and NUMA pinning are in play here, otherwise the host would have no
problem allocating RAM on a different NUMA node and OOM would be
avoided.


I'm not quite sure if/how OpenStack handles NUMA pinning (why is VM
being killed by OOM rather than having memory allocated on different
NUMA node). Anyway, good point, thank you, I should have a look at exact
parameters passed to QEMU when using CPU pinning.


OpenStack uses strict memory pinning when using CPU pinning and/or memory 
hugepages, so all allocations are supposed to be local.  When it can't allocate 
locally, it triggers OOM.


Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Is there any reason to exclude originally failed build hosts during live migration?

2017-09-20 Thread Chris Friesen

On 09/20/2017 12:47 PM, Matt Riedemann wrote:


I wanted to bring it up here in case anyone had a good reason why we should not
continue to exclude originally failed hosts during live migration, even if the
admin is specifying one of those hosts for the live migration destination.

Presumably there was a good reason why the instance failed to build on a host
originally, but that could be for any number of reasons: resource claim failed
during a race, configuration issues, etc. Since we don't really know what
originally happened, it seems reasonable to not exclude originally attempted
build targets since the scheduler filters should still validate them during live
migration (this is all assuming you're not using the 'force' flag with live
migration - and if you are, all bets are off).


As you say, a failure on a host during the original instance creation (which 
could have been a long time ago) is not a reason to bypass that host during 
subsequent operations.


In other words, I think the list of hosts to ignore should be scoped to a single 
"operation" that requires scheduling (which would include any necessary 
rescheduling for that "operation").


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Should we add the 'force' option to the cold migrate API too?

2017-08-30 Thread Chris Friesen

On 08/30/2017 10:56 AM, Matt Riedemann wrote:

On 8/30/2017 11:35 AM, Chris Friesen wrote:

(We might even want to fail a live migration/evacuation with a forced
destination that could cause a conflict in these non-shareable resources, but
that'd be a behaviour change and therefore a new microversion.)


That's https://bugs.launchpad.net/nova/+bug/1427772 I believe, and I don't think
we should need a microversion to fix broken behavior in the backend. As noted,
even with a forced host live migration, we still do some things like the
ComputeFilter and RamFilter checks within conductor itself.


I think you're correct, and that bug seems like it would cover the desired 
behaviour.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Should we add the 'force' option to the cold migrate API too?

2017-08-30 Thread Chris Friesen

On 08/30/2017 09:09 AM, Matt Riedemann wrote:

Given the recent bugs [1][2] due to the force flag in the live migrate and
evacuate APIs related to Placement, and some other long standing bugs about
bypassing the scheduler [3], I don't think we should add the force option to the
cold migrate API, as (re-)proposed in Takashi's cold migrate spec here [4].

I'm fine with being able to specify a host during cold migrate/resize, but I
think the specified host should be validated by the scheduler (and placement) so
that the instance can actually move to that specified destination host.

Since we've built more logic into the scheduler in Pike for integration with
Placement, bypassing that gets us into maintenance issues with having to
duplicate code throughout conductor and just in general, seems like a bad idea
to force a host and bypass the scheduler and potentially break the instance. Not
to mention the complicated logic of passing the host through from the API to
conductor to the scheduler is it's own maintenance problem [5].


I completely agree with all of this.

Now that nova properly tracks non-shareable resources over cold migration 
(things like hugepages and PCI devices that cannot be shared) it really doesn't 
make sense to bypass the scheduler since it could end up seriously confusing the 
resource tracking mechanisms.


(We might even want to fail a live migration/evacuation with a forced 
destination that could cause a conflict in these non-shareable resources, but 
that'd be a behaviour change and therefore a new microversion.)


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Horizon][Nova] Editing flavor causing instance flavor display error

2017-08-03 Thread Chris Friesen

On 08/03/2017 04:21 AM, Sean Dague wrote:

On 08/03/2017 06:13 AM, Zhenyu Zheng wrote:

I was thinking, the current "edit" in Horizon is delete-and-create, and
it is there maybe just because
flavor has many fields, user may want to have a new flavor but just
modify one of the old flavor, so
they don't want to manually copy all other fields. And it is the
automatic delete action that causes
all the problem. Maybe horizon can provide a copy-and-modify action and
leave the deletion of
the flavor to the admin.


For what it is worth, it is already an admin level permission.

I do think that "Copy and Modify" is a better paradigm. Or "New Flavor
based on X" which will prefill based on an existing one.

The Delete flavor button should come with a giant warning of "This will
make a lot of information in your environment confusing, you should
never do this".


The same could also be said for flavor extra-specs (which can be modified 
in-place).  Once they're configured and the flavor is made public it would be 
best to leave them untouched otherwise you could end up confusing people if the 
behaviour of the "same" flavor changes due to extra-specs changes.


Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][docs] Concerns with docs migration

2017-08-02 Thread Chris Friesen

On 08/02/2017 09:22 AM, Stephen Finucane wrote:

On Wed, 2017-08-02 at 09:55 -0500, Matt Riedemann wrote:



3. The patch for the import of the admin guide [8] is missing some CLI
specific pages which are pretty useful given they aren't documented
anywhere else, like the forced_host part of the compute API [9].
Basically anything that's cli-nova-* in the admin guide should be in the
Nova docs. It's also missing the compute-flavors page [10] which is
pretty important for using OpenStack at all.


This is a tricky one. Based on previous discussions with dhellmann, the plan
seems to be to replace any references to 'nova xxx' or 'openstack xxx' commands
(i.e. commands using python-novaclient or python-openstackclient) in favour of
'curl'-based requests. The idea here is that the Python clients are not the
only clients available, and we shouldn't be "mandating" their use by
referencing them in the docs. I get this, though I don't fully agree with it
(who really uses curl?)


Are we going to document the python clients elsewhere then?  Personally I find 
it highly useful to have complete examples of how to do things with 
python-novaclient or python-openstackclient.


Given that any users of the raw HTTP API are likely going to be developers, 
while users of the CLI tools may not be, it seems more important to give good 
examples of using the CLI tools.  Any developer should be able to figure out the 
underlying HTTP (using the --debug option of the CLI tool if necessary).


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Help regarding VM Migration

2017-07-26 Thread Chris Friesen

On 07/25/2017 10:21 PM, Ziad Nayyer wrote:

Can anybody help me out regarding VM migration between two devstacks installed
on two different physical machines? Hot or cold?



Are you configured as per 
https://docs.openstack.org/devstack/latest/guides/multinode-lab.html ?


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-30 Thread Chris Friesen

On 06/30/2017 07:06 AM, sfinu...@redhat.com wrote:

On Thu, 2017-06-29 at 12:20 -0600, Chris Friesen wrote:

On 06/29/2017 10:59 AM, sfinu...@redhat.com wrote:



  From the above, there are 3-4 work items:

- Add a 'emulator_pin_set' or 'cpu_emulator_threads_mask' configuration
option

- If using a mask, rename 'vcpu_pin_set' to 'pin_set' (or, better,
  'usable_cpus')

- Add a 'emulator_overcommit_ratio', which will do for emulator threads
what
the other ratios do for vCPUs and memory


If we were going to support "emulator_overcommit_ratio", then we wouldn't
necessarily need an explicit mask/set as a config option. If someone wants
to run with 'hw:emulator_thread_policy=isolate' and we're below the
overcommit ratio then we run it, otherwise nova could try to allocate a new
pCPU to add to the emulator_pin_set internally tracked by nova.  This would
allow for the number of pCPUs in emulator_pin_set to vary depending on the
number of instances with 'hw:emulator_thread_policy=isolate'on the compute
node, which should allow for optimal packing.


So we'd now mark pCPUs not only as used, but also as used for a specific
purpose? That would probably be more flexible that using a static pool of CPUs,
particularly if instances are heterogeneous. I'd imagine it would, however, be
much tougher to do right. I need to think on this.


I think you could do it with a new "emulator_cpus" field in NUMACell, and a new 
"emulator_pcpu" field in InstanceNUMACell.



As an aside, what would we do about billing? Currently we include CPUs used for
emulator threads as overhead. Would this change?


We currently have local changes to allow instances with "shared" and "dedicated" 
CPUs to coexist on the same compute node.  For CPU usage, "dedicated" CPUs count 
as "1", and "shared" CPUs count as 1/cpu_overcommit_ratio.  That way the total 
CPU usage can never exceed the number of available CPUs.


You could follow this model and bill for an extra 1/emulator_overcommit_ratio 
worth of a CPU for instances with 'hw:emulator_thread_policy=isolate'.



- Deprecate 'hw:emulator_thread_policy'???


I'm not sure we need to deprecate it, it would instead signify whether the
emulator threads should be isolated from the vCPU threads.  If set to
"isolate" then they would run on the emulator_pin_set identified above
(potentially sharing them with emulator threads from other instances) rather
than each instance getting a whole pCPU for its emulator threads.


I'm confused, I thought we weren't going to need 'emulator_pin_set'?


I meant whatever field we use internally to track which pCPUs are currently 
being used to run emulator threads as opposed to vCPU threads.  (ie the 
"emulator_cpus" field in NUMACell suggested above.


> In any

case, it's probably less about deprecating the extra spec and instead changing
how things work under the hood. We'd actually still want something to signify
"I want my emulator overhead accounted for separately".


Agreed.

Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-29 Thread Chris Friesen

On 06/29/2017 10:59 AM, sfinu...@redhat.com wrote:


Thus far, we've no clear conclusions on directions to go, so I've took a stab
below. Henning, Sahid, Chris: does the above/below make sense, and is there
anything we need to further clarify?


The above is close enough. :)


# Problem 1

 From the above, there are 3-4 work items:

- Add a 'emulator_pin_set' or 'cpu_emulator_threads_mask' configuration option

   - If using a mask, rename 'vcpu_pin_set' to 'pin_set' (or, better,
 'usable_cpus')

- Add a 'emulator_overcommit_ratio', which will do for emulator threads what
   the other ratios do for vCPUs and memory


If we were going to support "emulator_overcommit_ratio", then we wouldn't 
necessarily need an explicit mask/set as a config option. If someone wants to 
run with 'hw:emulator_thread_policy=isolate' and we're below the overcommit 
ratio then we run it, otherwise nova could try to allocate a new pCPU to add to 
the emulator_pin_set internally tracked by nova.  This would allow for the 
number of pCPUs in emulator_pin_set to vary depending on the number of instances 
with 'hw:emulator_thread_policy=isolate'on the compute node, which should allow 
for optimal packing.



- Deprecate 'hw:emulator_thread_policy'???


I'm not sure we need to deprecate it, it would instead signify whether the 
emulator threads should be isolated from the vCPU threads.  If set to "isolate" 
then they would run on the emulator_pin_set identified above (potentially 
sharing them with emulator threads from other instances) rather than each 
instance getting a whole pCPU for its emulator threads.



# Problem 2

No clear conclusions yet?


I don't see any particular difficulty in supporting both RT and non-RT instances 
on the same host with one nova-compute process.  It might even be valid for a 
high-performance VM to make use of 'hw:emulator_thread_policy=isolate' without 
enabling RT.  (Which is why I've been careful not to imply RT in the description 
above.)


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][tc] How to deal with confusion around "hosted projects"

2017-06-29 Thread Chris Friesen

On 06/29/2017 09:23 AM, Monty Taylor wrote:


We are already WELL past where we can solve the problem you are describing.
Pandora's box has been opened - we have defined ourselves as an Open community.
Our only requirement to be official is that you behave as one of us. There is
nothing stopping those machine learning projects from becoming official. If they
did become official but were still bad software - what would we have solved?

We have a long-time official project that currently has staffing problems. If
someone Googles for OpenStack DBaaS and finds Trove and then looks to see that
the contribution rate has fallen off recently they could get the impression that
OpenStack is a bunch of dead crap.

Inclusion as an Official Project in OpenStack is not an indication that anyone
thinks the project is good quality. That's a decision we actively made. This is
the result.


I wonder if it would be useful to have a separate orthogonal status as to "level 
of stability/usefulness/maturity/quality" to help newcomers weed out projects 
that are under TC governance but are not ready for prime time.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [masakari][nova] Allow evacuation of instances in resized state

2017-06-28 Thread Chris Friesen

On 06/28/2017 05:50 AM, Kekane, Abhishek wrote:


In masakari, we are setting an instance to an error state if the vmstate is
resized before evacuating it to a new host.


Arguably the instance should be set to an error state as soon as you notice that 
the compute node is down.



Once an instance (which was in
resized state) is evacuated then it becomes active on the new host. The main
problem with this implementation from user’s point of view is the instance goes
into active state after evacuation, it should be in stopped state if the prior
action on the instance before resizing was stop. In masakari, It’s possible to
set the vm state to stopped state after evacuation but for a short period the
instance will go into the active state which is unacceptable.


That's a valid point, I think.


*Proposing changes to Nova:*

In the current nova code, if the instance is in stopped state before evacuation,
then it remains in the stopped state after evacuation is complete. On the
similar lines, we are proposing nova should allow instance to be evacuated in
resized state and after evacuation the instance should remain in stopped state
if the prior action on the instance is stopped before resizing.


The current nova code looks at the vm_state to decide whether or not it's 
allowable to evacuate, and while "stopped" is a valid state to evacuate from 
"resized" is not.  In your scenario it's both "stopped" *and* "resized" 
simultaneously, but there's no way to represent that in the vmstate so I think 
we'd have to check the power state, which would mean extending the 
check_instance_state() routine since it doesn't currently handle the power state.


The trickier question is how to handle the "resized" state...after evacuating an 
instance in the "resized" state should you be able to revert the resize?  If so, 
how would that work in the case where the instance was resized on the same host 
originally and that host is no longer available?  If not, then you'll end up 
with resources permanently reserved on the host the instance was on before the 
resize.  I suppose one option would be to auto-confirm the resize in the case of 
a resize-to-same-host, but that'll be tricky to process with the host not available.


Also, it should be noted that when rebuilding/evacuating a "stopped" instance 
the nova code just boots it up as normal and sets the vm_state to "active", then 
realizes that it's supposed to be stopped and sets the task_state to 
"powering_off" and goes down the normal path to stop the instance, eventually 
setting the vm_state to "stopped".  So you're still going to end up with the 
same state transitions as what you have now, though the timing will probably be 
a bit tighter.  If you really want a stopped instance to not actually start up 
on a rebuild/evacuate then that would be additional work.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-28 Thread Chris Friesen

On 06/28/2017 03:34 AM, Sahid Orentino Ferdjaoui wrote:

On Tue, Jun 27, 2017 at 04:00:35PM +0200, Henning Schild wrote:



As far as i remember it was not straight forward to get two novas onto
one host in the older release, i am not surprised that causing trouble
with the update to mitaka. If we agree on 2 novas and aggregates as the
recommended way we should make sure the 2 novas is a supported feature,
covered in test-cases and documented.
Dedicating a whole machine to either RT or nonRT would imho be no
viable option.


The realtime nodes should be isolated by aggregates as you seem to do.


Yes, with two novas on one machine. They share one libvirt using
different instrance-prefixes and have some other config options set, so
they do not collide on resources.


It's clearly not what I was suggesting, you should have 2 groups of
compute hosts. One aggregate with hosts for the non-RT VMs and an
other one for hosts with RT VMs.


Not all clouds are large enough to have an entire physical machine dedicated to 
RT VMs.  So Henning divided up the resources of the physical machine between two 
nova-compute instances and put them in separate aggregates.


It would be easier for operators if one single nova instance could manage both 
RT and non-RT instances on the same host (presumably running an RT OS).


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-27 Thread Chris Friesen

On 06/27/2017 09:36 AM, Henning Schild wrote:

Am Tue, 27 Jun 2017 09:28:34 -0600
schrieb Chris Friesen <chris.frie...@windriver.com>:



Once you use "isolcpus" on the host, the host scheduler won't "float"
threads between the CPUs based on load.  To get the float behaviour
you'd have to not isolate the pCPUs that will be used for emulator
threads, but then you run the risk of the host running other work on
those pCPUs (unless you use cpusets or something to isolate the host
work to a subset of non-isolcpus pCPUs).


With openstack you use libvirt and libvirt uses cgroups/cpusets to get
those threads onto these cores.


Right.  I misremembered.  We are currently using "isolcpus" on the compute node 
to isolate the pCPUs used for packet processing, but the pCPUs used for guests 
are not isolated.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-27 Thread Chris Friesen

On 06/27/2017 01:45 AM, Sahid Orentino Ferdjaoui wrote:

On Mon, Jun 26, 2017 at 12:12:49PM -0600, Chris Friesen wrote:

On 06/25/2017 02:09 AM, Sahid Orentino Ferdjaoui wrote:

On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:

On 06/23/2017 09:35 AM, Henning Schild wrote:

Am Fri, 23 Jun 2017 11:11:10 +0200
schrieb Sahid Orentino Ferdjaoui <sferd...@redhat.com>:



In Linux RT context, and as you mentioned, the non-RT vCPU can acquire
some guest kernel lock, then be pre-empted by emulator thread while
holding this lock. This situation blocks RT vCPUs from doing its
work. So that is why we have implemented [2]. For DPDK I don't think
we have such problems because it's running in userland.

So for DPDK context I think we could have a mask like we have for RT
and basically considering vCPU0 to handle best effort works (emulator
threads, SSH...). I think it's the current pattern used by DPDK users.


DPDK is just a library and one can imagine an application that has
cross-core communication/synchronisation needs where the emulator
slowing down vpu0 will also slow down vcpu1. You DPDK application would
have to know which of its cores did not get a full pcpu.

I am not sure what the DPDK-example is doing in this discussion, would
that not just be cpu_policy=dedicated? I guess normal behaviour of
dedicated is that emulators and io happily share pCPUs with vCPUs and
you are looking for a way to restrict emulators/io to a subset of pCPUs
because you can live with some of them beeing not 100%.


Yes.  A typical DPDK-using VM might look something like this:

vCPU0: non-realtime, housekeeping and I/O, handles all virtual interrupts
and "normal" linux stuff, emulator runs on same pCPU
vCPU1: realtime, runs in tight loop in userspace processing packets
vCPU2: realtime, runs in tight loop in userspace processing packets
vCPU3: realtime, runs in tight loop in userspace processing packets

In this context, vCPUs 1-3 don't really ever enter the kernel, and we've
offloaded as much kernel work as possible from them onto vCPU0.  This works
pretty well with the current system.


For RT we have to isolate the emulator threads to an additional pCPU
per guests or as your are suggesting to a set of pCPUs for all the
guests running.

I think we should introduce a new option:

 - hw:cpu_emulator_threads_mask=^1

If on 'nova.conf' - that mask will be applied to the set of all host
CPUs (vcpu_pin_set) to basically pack the emulator threads of all VMs
running here (useful for RT context).


That would allow modelling exactly what we need.
In nova.conf we are talking absolute known values, no need for a mask
and a set is much easier to read. Also using the same name does not
sound like a good idea.
And the name vcpu_pin_set clearly suggest what kind of load runs here,
if using a mask it should be called pin_set.


I agree with Henning.

In nova.conf we should just use a set, something like
"rt_emulator_vcpu_pin_set" which would be used for running the emulator/io
threads of *only* realtime instances.


I'm not agree with you, we have a set of pCPUs and we want to
substract some of them for the emulator threads. We need a mask. The
only set we need is to selection which pCPUs Nova can use
(vcpus_pin_set).


We may also want to have "rt_emulator_overcommit_ratio" to control how many
threads/instances we allow per pCPU.


Not really sure to have understand this point? If it is to indicate
that for a pCPU isolated we want X guest emulator threads, the same
behavior is achieved by the mask. A host for realtime is dedicated for
realtime, no overcommitment and the operators know the number of host
CPUs, they can easily deduct a ratio and so the corresponding mask.


Suppose I have a host with 64 CPUs.  I reserve three for host overhead and
networking, leaving 61 for instances.  If I have instances with one non-RT
vCPU and one RT vCPU then I can run 30 instances.  If instead my instances
have one non-RT and 5 RT vCPUs then I can run 12 instances.  If I put all of
my emulator threads on the same pCPU, it might make a difference whether I
put 30 sets of emulator threads or 12 sets.


Oh I understand your point now, but not sure that is going to make any
difference. I would say the load in the isolated cores is probably
going to be the same. Even that an overhead will be the number of
threads handled which will be slightly higher in your first scenario.


The proposed "rt_emulator_overcommit_ratio" would simply say "nova is
allowed to run X instances worth of emulator threads on each pCPU in
"rt_emulator_vcpu_pin_set".  If we've hit that threshold, then no more RT
instances are allowed to schedule on this compute node (but non-RT instances
would still be allowed).


Also I don't think we want to schedule where the emulator threads of
the guests should be pinned on the isolated cores. We will let them
float on the set of cores isolated. If there is a requiereme

Re: [openstack-dev] realtime kvm cpu affinities

2017-06-27 Thread Chris Friesen

On 06/27/2017 01:44 AM, Sahid Orentino Ferdjaoui wrote:

On Mon, Jun 26, 2017 at 10:19:12AM +0200, Henning Schild wrote:

Am Sun, 25 Jun 2017 10:09:10 +0200
schrieb Sahid Orentino Ferdjaoui <sferd...@redhat.com>:


On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:

On 06/23/2017 09:35 AM, Henning Schild wrote:

Am Fri, 23 Jun 2017 11:11:10 +0200
schrieb Sahid Orentino Ferdjaoui <sferd...@redhat.com>:



In Linux RT context, and as you mentioned, the non-RT vCPU can
acquire some guest kernel lock, then be pre-empted by emulator
thread while holding this lock. This situation blocks RT vCPUs
from doing its work. So that is why we have implemented [2].
For DPDK I don't think we have such problems because it's
running in userland.

So for DPDK context I think we could have a mask like we have
for RT and basically considering vCPU0 to handle best effort
works (emulator threads, SSH...). I think it's the current
pattern used by DPDK users.


DPDK is just a library and one can imagine an application that has
cross-core communication/synchronisation needs where the emulator
slowing down vpu0 will also slow down vcpu1. You DPDK application
would have to know which of its cores did not get a full pcpu.

I am not sure what the DPDK-example is doing in this discussion,
would that not just be cpu_policy=dedicated? I guess normal
behaviour of dedicated is that emulators and io happily share
pCPUs with vCPUs and you are looking for a way to restrict
emulators/io to a subset of pCPUs because you can live with some
of them beeing not 100%.


Yes.  A typical DPDK-using VM might look something like this:

vCPU0: non-realtime, housekeeping and I/O, handles all virtual
interrupts and "normal" linux stuff, emulator runs on same pCPU
vCPU1: realtime, runs in tight loop in userspace processing packets
vCPU2: realtime, runs in tight loop in userspace processing packets
vCPU3: realtime, runs in tight loop in userspace processing packets

In this context, vCPUs 1-3 don't really ever enter the kernel, and
we've offloaded as much kernel work as possible from them onto
vCPU0.  This works pretty well with the current system.


For RT we have to isolate the emulator threads to an additional
pCPU per guests or as your are suggesting to a set of pCPUs for
all the guests running.

I think we should introduce a new option:

- hw:cpu_emulator_threads_mask=^1

If on 'nova.conf' - that mask will be applied to the set of all
host CPUs (vcpu_pin_set) to basically pack the emulator threads
of all VMs running here (useful for RT context).


That would allow modelling exactly what we need.
In nova.conf we are talking absolute known values, no need for a
mask and a set is much easier to read. Also using the same name
does not sound like a good idea.
And the name vcpu_pin_set clearly suggest what kind of load runs
here, if using a mask it should be called pin_set.


I agree with Henning.

In nova.conf we should just use a set, something like
"rt_emulator_vcpu_pin_set" which would be used for running the
emulator/io threads of *only* realtime instances.


I'm not agree with you, we have a set of pCPUs and we want to
substract some of them for the emulator threads. We need a mask. The
only set we need is to selection which pCPUs Nova can use
(vcpus_pin_set).


At that point it does not really matter whether it is a set or a mask.
They can both express the same where a set is easier to read/configure.
With the same argument you could say that vcpu_pin_set should be a mask
over the hosts pcpus.

As i said before: vcpu_pin_set should be renamed because all sorts of
threads are put here (pcpu_pin_set?). But that would be a bigger change
and should be discussed as a seperate issue.

So far we talked about a compute-node for realtime only doing realtime.
In that case vcpu_pin_set + emulator_io_mask would work. If you want to
run regular VMs on the same host, you can run a second nova, like we do.

We could also use vcpu_pin_set + rt_vcpu_pin_set(/mask). I think that
would allow modelling all cases in just one nova. Having all in one
nova, you could potentially repurpose rt cpus to best-effort and back.
Some day in the future ...


That is not something we should allow or at least
advertise. compute-node can't run both RT and non-RT guests and that
because the nodes should have a kernel RT. We can't guarantee RT if
both are on same nodes.


A compute node with an RT OS could run RT and non-RT guests at the same time 
just fine.  In a small cloud (think hyperconverged with maybe two nodes total) 
it's not viable to dedicate an entire node to just RT loads.


I'd personally rather see nova able to handle a mix of RT and non-RT than need 
to run multiple nova instances on the same node and figure out an up-front split 
of resources between RT nova and non-RT nova.  Better to allow nova to 
dynamically allocate resources as needed.


Chris


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-26 Thread Chris Friesen

On 06/25/2017 02:09 AM, Sahid Orentino Ferdjaoui wrote:

On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:

On 06/23/2017 09:35 AM, Henning Schild wrote:

Am Fri, 23 Jun 2017 11:11:10 +0200
schrieb Sahid Orentino Ferdjaoui <sferd...@redhat.com>:



In Linux RT context, and as you mentioned, the non-RT vCPU can acquire
some guest kernel lock, then be pre-empted by emulator thread while
holding this lock. This situation blocks RT vCPUs from doing its
work. So that is why we have implemented [2]. For DPDK I don't think
we have such problems because it's running in userland.

So for DPDK context I think we could have a mask like we have for RT
and basically considering vCPU0 to handle best effort works (emulator
threads, SSH...). I think it's the current pattern used by DPDK users.


DPDK is just a library and one can imagine an application that has
cross-core communication/synchronisation needs where the emulator
slowing down vpu0 will also slow down vcpu1. You DPDK application would
have to know which of its cores did not get a full pcpu.

I am not sure what the DPDK-example is doing in this discussion, would
that not just be cpu_policy=dedicated? I guess normal behaviour of
dedicated is that emulators and io happily share pCPUs with vCPUs and
you are looking for a way to restrict emulators/io to a subset of pCPUs
because you can live with some of them beeing not 100%.


Yes.  A typical DPDK-using VM might look something like this:

vCPU0: non-realtime, housekeeping and I/O, handles all virtual interrupts
and "normal" linux stuff, emulator runs on same pCPU
vCPU1: realtime, runs in tight loop in userspace processing packets
vCPU2: realtime, runs in tight loop in userspace processing packets
vCPU3: realtime, runs in tight loop in userspace processing packets

In this context, vCPUs 1-3 don't really ever enter the kernel, and we've
offloaded as much kernel work as possible from them onto vCPU0.  This works
pretty well with the current system.


For RT we have to isolate the emulator threads to an additional pCPU
per guests or as your are suggesting to a set of pCPUs for all the
guests running.

I think we should introduce a new option:

- hw:cpu_emulator_threads_mask=^1

If on 'nova.conf' - that mask will be applied to the set of all host
CPUs (vcpu_pin_set) to basically pack the emulator threads of all VMs
running here (useful for RT context).


That would allow modelling exactly what we need.
In nova.conf we are talking absolute known values, no need for a mask
and a set is much easier to read. Also using the same name does not
sound like a good idea.
And the name vcpu_pin_set clearly suggest what kind of load runs here,
if using a mask it should be called pin_set.


I agree with Henning.

In nova.conf we should just use a set, something like
"rt_emulator_vcpu_pin_set" which would be used for running the emulator/io
threads of *only* realtime instances.


I'm not agree with you, we have a set of pCPUs and we want to
substract some of them for the emulator threads. We need a mask. The
only set we need is to selection which pCPUs Nova can use
(vcpus_pin_set).


We may also want to have "rt_emulator_overcommit_ratio" to control how many
threads/instances we allow per pCPU.


Not really sure to have understand this point? If it is to indicate
that for a pCPU isolated we want X guest emulator threads, the same
behavior is achieved by the mask. A host for realtime is dedicated for
realtime, no overcommitment and the operators know the number of host
CPUs, they can easily deduct a ratio and so the corresponding mask.


Suppose I have a host with 64 CPUs.  I reserve three for host overhead and 
networking, leaving 61 for instances.  If I have instances with one non-RT vCPU 
and one RT vCPU then I can run 30 instances.  If instead my instances have one 
non-RT and 5 RT vCPUs then I can run 12 instances.  If I put all of my emulator 
threads on the same pCPU, it might make a difference whether I put 30 sets of 
emulator threads or 12 sets.


The proposed "rt_emulator_overcommit_ratio" would simply say "nova is allowed to 
run X instances worth of emulator threads on each pCPU in 
"rt_emulator_vcpu_pin_set".  If we've hit that threshold, then no more RT 
instances are allowed to schedule on this compute node (but non-RT instances 
would still be allowed).


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-23 Thread Chris Friesen

On 06/23/2017 09:35 AM, Henning Schild wrote:

Am Fri, 23 Jun 2017 11:11:10 +0200
schrieb Sahid Orentino Ferdjaoui :



In Linux RT context, and as you mentioned, the non-RT vCPU can acquire
some guest kernel lock, then be pre-empted by emulator thread while
holding this lock. This situation blocks RT vCPUs from doing its
work. So that is why we have implemented [2]. For DPDK I don't think
we have such problems because it's running in userland.

So for DPDK context I think we could have a mask like we have for RT
and basically considering vCPU0 to handle best effort works (emulator
threads, SSH...). I think it's the current pattern used by DPDK users.


DPDK is just a library and one can imagine an application that has
cross-core communication/synchronisation needs where the emulator
slowing down vpu0 will also slow down vcpu1. You DPDK application would
have to know which of its cores did not get a full pcpu.

I am not sure what the DPDK-example is doing in this discussion, would
that not just be cpu_policy=dedicated? I guess normal behaviour of
dedicated is that emulators and io happily share pCPUs with vCPUs and
you are looking for a way to restrict emulators/io to a subset of pCPUs
because you can live with some of them beeing not 100%.


Yes.  A typical DPDK-using VM might look something like this:

vCPU0: non-realtime, housekeeping and I/O, handles all virtual interrupts and 
"normal" linux stuff, emulator runs on same pCPU

vCPU1: realtime, runs in tight loop in userspace processing packets
vCPU2: realtime, runs in tight loop in userspace processing packets
vCPU3: realtime, runs in tight loop in userspace processing packets

In this context, vCPUs 1-3 don't really ever enter the kernel, and we've 
offloaded as much kernel work as possible from them onto vCPU0.  This works 
pretty well with the current system.



For RT we have to isolate the emulator threads to an additional pCPU
per guests or as your are suggesting to a set of pCPUs for all the
guests running.

I think we should introduce a new option:

   - hw:cpu_emulator_threads_mask=^1

If on 'nova.conf' - that mask will be applied to the set of all host
CPUs (vcpu_pin_set) to basically pack the emulator threads of all VMs
running here (useful for RT context).


That would allow modelling exactly what we need.
In nova.conf we are talking absolute known values, no need for a mask
and a set is much easier to read. Also using the same name does not
sound like a good idea.
And the name vcpu_pin_set clearly suggest what kind of load runs here,
if using a mask it should be called pin_set.


I agree with Henning.

In nova.conf we should just use a set, something like "rt_emulator_vcpu_pin_set" 
which would be used for running the emulator/io threads of *only* realtime 
instances.


We may also want to have "rt_emulator_overcommit_ratio" to control how many 
threads/instances we allow per pCPU.



If on flavor extra-specs It will be applied to the vCPUs dedicated for
the guest (useful for DPDK context).


And if both are present the flavor wins and nova.conf is ignored?


In the flavor I'd like to see it be a full bitmask, not an exclusion mask with 
an implicit full set.  Thus the end-user could specify 
"hw:cpu_emulator_threads_mask=0" and get the emulator threads to run alongside 
vCPU0.


Henning, there is no conflict, the nova.conf setting and the flavor setting are 
used for two different things.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-22 Thread Chris Friesen

On 06/22/2017 01:47 AM, Henning Schild wrote:

Am Wed, 21 Jun 2017 11:40:14 -0600
schrieb Chris Friesen <chris.frie...@windriver.com>:


On 06/21/2017 10:46 AM, Henning Schild wrote:



As we know from our setup, and as Luiz confirmed - it is _not_
"critical to separate emulator threads for different KVM instances".
They have to be separated from the vcpu-cores but not from each
other. At least not on the "cpuset" basis, maybe "blkio" and
cgroups like that.


I'm reluctant to say conclusively that we don't need to separate
emulator threads since I don't think we've considered all the cases.
For example, what happens if one or more of the instances are being
live-migrated?  The migration thread for those instances will be very
busy scanning for dirty pages, which could delay the emulator threads
for other instances and also cause significant cross-NUMA traffic
unless we ensure at least one core per NUMA-node.


Realtime instances can not be live-migrated. We are talking about
threads that can not even be moved between two cores on one numa-node
without missing a deadline. But your point is good because it could
mean that such an emulator_set - if defined - should not be used for all
VMs.


I'd suggest that realtime instances cannot be live-migrated *while meeting 
realtime commitments*.  There may be reasons to live-migrate realtime instances 
that aren't currently providing service.



Also, I don't think we've determined how much CPU time is needed for
the emulator threads.  If we have ~60 CPUs available for instances
split across two NUMA nodes, can we safely run the emulator threads
of 30 instances all together on a single CPU?  If not, how much
"emulator overcommit" is allowable?


That depends on how much IO your VMs are issuing and can not be
answered in general. All VMs can cause high load with IO/emulation,
rt-VMs are probably less likely to do so.


I think the result of this is that in addition to "rt_emulator_pin_set" you'd 
probably want a config option for "rt_emulator_overcommit_ratio" or something 
similar.


Chris


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-21 Thread Chris Friesen

On 06/21/2017 10:46 AM, Henning Schild wrote:

Am Wed, 21 Jun 2017 10:04:52 -0600
schrieb Chris Friesen <chris.frie...@windriver.com>:



i guess you are talking about that section from [1]:


We could use a host level tunable to just reserve a set of host
pCPUs for running emulator threads globally, instead of trying to
account for it per instance. This would work in the simple case,
but when NUMA is used, it is highly desirable to have more fine
grained config to control emulator thread placement. When real-time
or dedicated CPUs are used, it will be critical to separate
emulator threads for different KVM instances.


Yes, that's the relevant section.


I know it has been considered, but i would like to bring the topic up
again. Because doing it that way allows for many more rt-VMs on a host
and i am not sure i fully understood why the idea was discarded in the
end.

I do not really see the influence of NUMA here. Say the
emulator_pin_set is used only for realtime VMs, we know that the
emulators and IOs can be "slow" so crossing numa-nodes should not be an
issue. Or you could say the set needs to contain at least one core per
numa-node and schedule emulators next to their vcpus.

As we know from our setup, and as Luiz confirmed - it is _not_ "critical
to separate emulator threads for different KVM instances".
They have to be separated from the vcpu-cores but not from each other.
At least not on the "cpuset" basis, maybe "blkio" and cgroups like that.


I'm reluctant to say conclusively that we don't need to separate emulator 
threads since I don't think we've considered all the cases.  For example, what 
happens if one or more of the instances are being live-migrated?  The migration 
thread for those instances will be very busy scanning for dirty pages, which 
could delay the emulator threads for other instances and also cause significant 
cross-NUMA traffic unless we ensure at least one core per NUMA-node.


Also, I don't think we've determined how much CPU time is needed for the 
emulator threads.  If we have ~60 CPUs available for instances split across two 
NUMA nodes, can we safely run the emulator threads of 30 instances all together 
on a single CPU?  If not, how much "emulator overcommit" is allowable?


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-21 Thread Chris Friesen

On 06/21/2017 02:42 AM, Henning Schild wrote:

Am Tue, 20 Jun 2017 10:41:44 -0600
schrieb Chris Friesen <chris.frie...@windriver.com>:



Our goal is to reach a high packing density of realtime VMs. Our
pragmatic first choice was to run all non-vcpu-threads on a shared
set of pcpus where we also run best-effort VMs and host load.
Now the OpenStack guys are not too happy with that because that is
load outside the assigned resources, which leads to quota and
accounting problems.


If you wanted to go this route, you could just edit the
"vcpu_pin_set" entry in nova.conf on the compute nodes so that nova
doesn't actually know about all of the host vCPUs.  Then you could
run host load and emulator threads on the pCPUs that nova doesn't
know about, and there will be no quota/accounting issues in nova.


Exactly that is the idea but OpenStack currently does not allow that.
No thread will ever end up on a core outside the vcpu_pin_set and
emulator/io-threads are controlled by OpenStack/libvirt.


Ah, right.  This will isolate the host load from the guest load, but it will 
leave the guest emulator work running on the same pCPUs as one or more vCPU threads.


Your emulator_pin_set idea is interesting...it might be worth proposing in nova.

Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

2017-06-20 Thread Chris Friesen

On 06/20/2017 09:51 AM, Eric Fried wrote:

Nice Stephen!

For those who aren't aware, the rendered version (pretty, so pretty) can
be accessed via the gate-nova-docs-ubuntu-xenial jenkins job:

http://docs-draft.openstack.org/10/475810/1/check/gate-nova-docs-ubuntu-xenial/25e5173//doc/build/html/scheduling.html?highlight=scheduling


Can we teach it to not put line breaks in the middle of words in the text boxes?

Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] How to handle nova show --minimal with embedded flavors

2017-06-20 Thread Chris Friesen

On 06/20/2017 07:59 AM, Matt Riedemann wrote:


Personally I think that if I specify --minimal I want minimal output, which
would just be the flavor's original name after the new microversion, which is
closer in behavior to how --minimal works today before the 2.47 microversion.


In the existing novaclient code for show/rebuild, the --minimal option just 
skips doing the lookups on the flavor/image as described in the help text.  It 
doesn't affect the other ~40 fields in the instance.  After the new microversion 
we already have the flavor details without doing the flavor lookup so I thought 
it made sense to display them.


I suppose an argument could be made that for consistency we should keep the 
output with --minimal similar to what it was before.  If we want to go that 
route I'm happy to do so.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [openstack-dev[[nova] Simple question about sorting CPU topologies

2017-06-20 Thread Chris Friesen

On 06/20/2017 06:29 AM, Jay Pipes wrote:

On 06/19/2017 10:45 PM, Zhenyu Zheng wrote:

Sorry, The mail sent accidentally by mis-typing ...

My question is, what is the benefit of the above preference?


Hi Kevin!

I believe the benefit is so that the compute node prefers CPU topologies that do
not have hardware threads over CPU topologies that do include hardware threads.

I'm not sure exactly of the reason for this preference, but perhaps it is due to
assumptions that on some hardware, threads will compete for the same cache
resources as other siblings on a core whereas cores may have their own caches
(again, on some specific hardware).


Isn't the definition of hardware threads basically the fact that the sibling 
threads share the resources of a single core?


Are there architectures that OpenStack runs on where hardware threads don't 
compete for cache/TLB/execution units?  (And if there are, then why are they 
called threads and not cores?)


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] realtime kvm cpu affinities

2017-06-20 Thread Chris Friesen

On 06/20/2017 01:48 AM, Henning Schild wrote:

Hi,

We are using OpenStack for managing realtime guests. We modified
it and contributed to discussions on how to model the realtime
feature. More recent versions of OpenStack have support for realtime,
and there are a few proposals on how to improve that further.

But there is still no full answer on how to distribute threads across
host-cores. The vcpus are easy but for the emulation and io-threads
there are multiple options. I would like to collect the constraints
from a qemu/kvm perspective first, and than possibly influence the
OpenStack development

I will put the summary/questions first, the text below provides more
context to where the questions come from.
- How do you distribute your threads when reaching the really low
   cyclictest results in the guests? In [3] Rik talked about problems
   like hold holder preemption, starvation etc. but not where/how to
   schedule emulators and io
- Is it ok to put a vcpu and emulator thread on the same core as long as
   the guest knows about it? Any funny behaving guest, not just Linux.
- Is it ok to make the emulators potentially slow by running them on
   busy best-effort cores, or will they quickly be on the critical path
   if you do more than just cyclictest? - our experience says we don't
   need them reactive even with rt-networking involved


Our goal is to reach a high packing density of realtime VMs. Our
pragmatic first choice was to run all non-vcpu-threads on a shared set
of pcpus where we also run best-effort VMs and host load.
Now the OpenStack guys are not too happy with that because that is load
outside the assigned resources, which leads to quota and accounting
problems.


If you wanted to go this route, you could just edit the "vcpu_pin_set" entry in 
nova.conf on the compute nodes so that nova doesn't actually know about all of 
the host vCPUs.  Then you could run host load and emulator threads on the pCPUs 
that nova doesn't know about, and there will be no quota/accounting issues in nova.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][tc] Moving away from "big tent" terminology

2017-06-19 Thread Chris Friesen

On 06/16/2017 02:57 AM, Julien Danjou wrote:

On Thu, Jun 15 2017, Doug Hellmann wrote:


One of the *most* common complaints the TC gets from outside the
contributor community is that people do not understand what projects
are part of OpenStack and what parts are not. We have a clear
definition of that in our minds (the projects that have said they
want to be part of OpenStack, and agreed to put themselves under
TC governance, with all of the policies that implies). That definition
is so trivial to say, that it seems like a tautology.  However,
looking in from the outside of the community, that definition isn't
helpful.


I still wonder why they care. Who care, really? Can we have some people
that care on this thread so they explain directly what we're trying to
solve here?

Everything is just a bunch of free software projects to me. The
governance made zero difference in my contributions or direction of the
projects I PTL'ed.


When I was first starting out, I didn't care at all about governance.  I wanted 
to know "What do the various components *do*, and which of them do I need to 
install to get a practical and useful OpenStack installation?".


A bit later on, I started thinking about "Which of these components are mature 
enough to be usable, and likely to be around for long enough to make it 
worthwhile to use them?"


A bit further down the road the issue became "I have this specific thing I want 
to accomplish, are there any projects out there that are working on it?"


I suspect I'm not the only one that went through this process, and I don't feel 
like there's a lot of information out there aimed at answering this sort of 
question without spending a lot of time digging into individual service 
documentation.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all][tc][glance] Glance needs help, it's getting critical

2017-06-12 Thread Chris Friesen

On 06/12/2017 01:50 PM, Flavio Percoco wrote:


Glance can be very exciting if one focuses on the interesting bits and it's an
*AWESOME* place where new comers can start contributing, new developers can
learn and practice, etc. That said, I believe that code doesn't have to be
challenging to be exciting. There's also excitment in the simple but interesting
things.


As an outsider, I found it harder to understand the glance code than the nova 
code...and that's saying something. :)


From the naive external viewpoint, it just doesn't seem like what glance is 
doing should be all that complicated, and yet somehow I found it to be so.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Is the pendulum swinging on PaaS layers?

2017-05-26 Thread Chris Friesen

On 05/19/2017 04:06 PM, Dean Troyer wrote:

On Fri, May 19, 2017 at 4:01 PM, Matt Riedemann  wrote:

I'm confused by this. Creating a server takes a volume ID if you're booting
from volume, and that's actually preferred (by nova devs) since then Nova
doesn't have to orchestrate the creation of the volume in the compute
service and then poll until it's available.

Same for ports - nova can create the port (default action) or get a port at
server creation time, which is required if you're doing trunk ports or
sr-iov / fancy pants ports.

Am I misunderstanding what you're saying is missing?


It turns out those are bad examples, they do accept IDs.


I was actually suggesting that maybe these commands in nova should *only* take 
IDs, and that nova itself should not set up either block storage or networking 
for you.


It seems non-intuitive to me that nova will do some basic stuff for you, but if 
you want something more complicated then you need to go do it a totally 
different way.


It seems to me that it'd be more logical if we always set up volumes/ports 
first, then passed the resulting UUIDs to nova.  This could maybe be hidden from 
the end-user by doing it in the client or some intermediate layer, but arguably 
nova proper shouldn't be in the proxying business.


Lastly, the existence of a partial proxy means that people ask for a more 
complete proxy--for example, specifying the vnic_type (for a port) or volume 
type (for a volume) when booting an instance.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Is the pendulum swinging on PaaS layers?

2017-05-25 Thread Chris Friesen

On 05/20/2017 10:36 AM, Monty Taylor wrote:

On 05/19/2017 03:13 PM, Monty Taylor wrote:

On 05/19/2017 01:53 PM, Sean Dague wrote:

On 05/19/2017 02:34 PM, Dean Troyer wrote:

On Fri, May 19, 2017 at 1:04 PM, Sean Dague  wrote:

These should be used as ways to experiment with the kinds of interfaces
we want cheaply, then take them back into services (which is a more
expensive process involving compatibility stories, deeper
documentation,
performance implications, and the like), not an end game on their own.


I totally agree here.  But I also see the rate of progress for many
and varied reasons, and want to make users lives easier now.

Have any of the lessons already learned from Shade or OSC made it into
services yet?  I think a few may have, "get me a network" being the
obvious one.  But that still took a lot of work (granted that one _is_
complicated).


Doing hard things is hard. I don't expect changing APIs to be easy at
this level of deployedness of OpenStack.


You can get the behavior. It also has other behaviors. I'm not sure any
user has actually argued for "please make me do more rest calls to
create a server".


Maybe not in those words, but "give me the tools to do what I need"
has been heard often.  Sometimes those tools are composable
primitives, sometimes they are helpful opinionated interfaces.  I've
already done the helpful opinionated stuff in OSC here (accept flavor
and image names when the non-unique names _do_ identify a single
result).  Having that control lets me give the user more options in
handling edge cases.


Sure, it does. The fact that it makes 3 API calls every time when doing
flavors by name (404 on the name, list all flavors, local search, get
the flavor by real id) on mostly read only data (without any caching) is
the kind of problem that rises from "just fix it in an upper layer". So
it does provide an experience at a cost.


We also searching of all resources by name-or-id in shade. But it's one
call - GET /images - and then we test to see if the given value matches
the name field or the id field. And there is caching, so the list call
is done once in the session.

The thing I'm the saddest about is the Nova flavor "extra_info" that one
needs to grab for backwards compat but almost never has anything useful
in it. This causes me to make a billion API calls for the initial flavor
list (which is then cached of course) It would be WAY nicer if there was
a GET /flavors/detail that would just get me the whole lot in one go, fwiw.


Quick follow up on this one.

It was "extra_specs" I was thinking about - not "extra_info"

It used to be in the flavor as part of an extension (with a longer name) - we
fetch them in shade for backwards compat with the past when they were just
there. However, I've also learned from a follow up in IRC that these aren't
really things that were intended for me.


For what it's worth, there are cases where extra_specs are important to normal 
users because they constrain what image properties you are allowed to set.


Things like cpu_policy, cpu_thread_policy, memory page size, number of NUMA 
nodes, etc. can all be set in both places, and they behave differently if there 
is a mismatch between the flavor extra_spec and the image property.


Because of this I think it makes sense for a normal person to be able to look at 
flavor extra_specs so that they can create an image with suitable properties to 
be able to boot up an instance with that image on that flavor.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova][neutron] massive overhead processing "network-changed" events during live migration

2017-05-19 Thread Chris Friesen
Recently we noticed failures in Newton when we attempted to live-migrate an 
instance with 16 vifs.  We tracked it down to an RPC timeout in nova which timed 
out waiting for the 'refresh_cache-%s' lock in get_instance_nw_info().  This led 
to a few other discoveries.


First, we have no fair locking in OpenStack.  The live migration code path was 
waiting for the lock, but the code processing the incoming "network-changed" 
events kept getting the lock instead even though they arrived while the live 
migration code was already blocked waiting for the lock.


Second, it turns out the cost of processing the "network-changed" events is 
astronomical.


1) In Newton nova commit 5de902a was merged to fix evacuate bugs, but it meant 
both source and dest compute nodes got the "network-changed" events.  This 
doubled the number of neutron API calls during a live migration.


2) A "network-changed" event is sent from neutron each time something changes. 
There are multiple of these events for each vif during a live-migration.  In the 
current upstream code the only information passed with the event is the instance 
id, so nova will loop over all the ports in the instance and build up all the 
information about subnets/floatingIP/fixedIP/etc. for that instance.  This 
results in O(N^2) neutron API calls where N is the number of vifs in the instance.


3) mriedem has proposed a patch series (https://review.openstack.org/#/c/465783 
and https://review.openstack.org/#/c/465787) that would change neutron to 
include the port ID, and allow nova to update just that port.  This reduces the 
cost to O(N), but it's still significant.


In a hardware lab with 4 compute nodes I created 4 boot-from-volume instances, 
each with 16 vifs.  I then live-migrated them all in parallel.  (The one on 
compute-0 was migrated to compute-1, the one on compute-1 was migrated to 
compute-2, etc.)  The aggregate CPU usage for a few critical components on the 
controller node is shown below.  Note in particular the CPU usage for 
neutron--it's using most of 10 CPUs for ~10 seconds, spiking to 13 CPUs.  This 
seems like an absurd amount of work to do just to update the cache in nova.



Labels:
  L0: neutron-server
  L1: nova-conductor
  L2: beam.smp
  L3: postgres
-  - -  L0  L1  L2  L3
date   time dt occ occ occ occ
/mm/dd hh:mm:ss.dec(s) (%) (%) (%) (%)
2017-05-19 17:51:38.710  2.173   19.751.282.851.96
2017-05-19 17:51:40.012  1.3021.021.753.805.07
2017-05-19 17:51:41.334  1.3222.342.665.251.76
2017-05-19 17:51:42.681  1.347   91.793.315.275.64
2017-05-19 17:51:44.035  1.354   40.787.273.487.34
2017-05-19 17:51:45.406  1.3717.12   21.358.66   19.58
2017-05-19 17:51:46.784  1.378   16.71  196.296.87   15.93
2017-05-19 17:51:48.133  1.349   18.51  362.468.57   25.70
2017-05-19 17:51:49.508  1.375  284.16  199.304.58   18.49
2017-05-19 17:51:50.919  1.411  512.88   17.617.47   42.88
2017-05-19 17:51:52.322  1.403  412.348.909.15   19.24
2017-05-19 17:51:53.734  1.411  320.245.20   10.599.08
2017-05-19 17:51:55.129  1.396  304.922.27   10.65   10.29
2017-05-19 17:51:56.551  1.422  556.09   14.56   10.74   18.85
2017-05-19 17:51:57.977  1.426  979.63   43.41   14.17   21.32
2017-05-19 17:51:59.382  1.405  902.56   48.31   13.69   18.59
2017-05-19 17:52:00.808  1.425 1140.99   74.28   15.12   17.18
2017-05-19 17:52:02.238  1.430 1013.91   69.77   16.46   21.19
2017-05-19 17:52:03.647  1.409  964.94  175.09   15.81   27.23
2017-05-19 17:52:05.077  1.430  838.15  109.13   15.70   34.12
2017-05-19 17:52:06.502  1.425  525.88   79.09   14.42   11.09
2017-05-19 17:52:07.954  1.452  614.58   38.38   12.20   17.89
2017-05-19 17:52:09.380  1.426  763.25   68.40   12.36   16.08
2017-05-19 17:52:10.825  1.445  901.57   73.59   15.90   41.12
2017-05-19 17:52:12.252  1.427  966.15   42.97   16.76   23.07
2017-05-19 17:52:13.702  1.450  902.40   70.98   19.66   17.50
2017-05-19 17:52:15.173  1.471 1023.33   59.71   19.78   18.91
2017-05-19 17:52:16.605  1.432 1127.04   64.19   16.41   26.80
2017-05-19 17:52:18.046  1.442 1300.56   68.22   16.29   24.39
2017-05-19 17:52:19.517  1.471 1055.60   71.74   14.39   17.09
2017-05-19 17:52:20.983  1.465  845.30   61.48   15.24   22.86
2017-05-19 17:52:22.447  1.464 1027.33   65.53   15.94   26.85
2017-05-19 17:52:23.919  1.472 1003.08   56.97   14.39   28.93
2017-05-19 17:52:25.367  1.448  702.50   45.42   11.78   20.53
2017-05-19 17:52:26.814  1.448  558.63   66.48   13.22   29.64
2017-05-19 17:52:28.276  1.462  620.34  206.63   14.58   17.17
2017-05-19 17:52:29.749  1.473  555.62  110.37   10.95   13.27
2017-05-19 17:52:31.228  1.479  436.66   33.659.00   21.55
2017-05-19 17:52:32.685  1.456  417.12   87.44   13.44   12.27
2017-05-19 17:52:34.128  1.443  368.31   87.08   11.95   14.70
2017-05-19 

Re: [openstack-dev] Is the pendulum swinging on PaaS layers?

2017-05-19 Thread Chris Friesen

On 05/19/2017 07:18 AM, Sean Dague wrote:


There was a conversation in the Cell v2 discussion around searchlight
that puts me more firmly in the anti enamel camp. Because of some
complexities around server list, Nova was planning on using Searchlight
to provide an efficient backend.

Q: Who in this room is running ELK already in their environment?
A: 100% of operators in room

Q: Who would be ok with standing up Searchlight for this?
A: 0% of operators in the room

We've now got an ecosystem that understands how to talk to our APIs
(yay! -
https://docs.google.com/presentation/d/1WAWHrVw8-u6XC7AG9ANdre8-Su0a3fdI-scjny3QOnk/pub?slide=id.g1d9d78a72b_0_0)
so saying "you need to also run this other service to *actually* do the
thing you want, and redo all your applications, and 3rd party SDKs" is
just weird.

And, yes, this is definitely a slider, and no I don't want Instance HA
in Nova. But we felt that "get-me-a-network" was important enough a user
experience to bake that in and stop poking users with sticks. And trying
hard to complete an expressed intent "POST /server" seems like it falls
on the line. Especially if the user received a conditional success (202).


A while back I suggested adding the vif-model as an attribute on the network 
during a nova boot request, and we were shot down because "that should be done 
in neutron".


I have some sympathy for this argument, but it seems to me that the logical 
extension of that is to expose simple orthogonal APIs where the nova boot 
request should only take neutron port ids and cinder volume ids.  The actual 
setup of those ports/volumes would be done by neutron and cinder.


It seems somewhat arbitrary to say "for historical reasons this subset of simple 
things can be done directly in a nova boot command, but for more complicated 
stuff you have to go use these other commands".  I think there's an argument to 
be made that it would be better to be consistent even for the simple things.


Chris

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


  1   2   3   4   5   6   >