Re: [Openstack-operators] Problems with AggregateMultiTenancyIsolation while migrating an instance

2018-05-30 Thread Matt Riedemann

On 5/30/2018 9:41 AM, Matt Riedemann wrote:
Thanks for your patience in debugging this Massimo! I'll get a bug 
reported and patch posted to fix it.


I'm tracking the problem with this bug:

https://bugs.launchpad.net/nova/+bug/1774205

I found that this has actually been fixed since Pike:

https://review.openstack.org/#/c/449640/

But I've got a patch up for another related issue, and a functional test 
to avoid regressions which I can also use when backporting the fix to 
stable/ocata.


--

Thanks,

Matt

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Problems with AggregateMultiTenancyIsolation while migrating an instance

2018-05-30 Thread Matt Riedemann

On 5/30/2018 5:21 AM, Massimo Sgaravatto wrote:

The problem is indeed with the tenant_id

When I create a VM, tenant_id is ee1865a76440481cbcff08544c7d580a 
(SgaraPrj1), as expected


But when, as admin, I run the "nova migrate" command to migrate the very 
same instance, the tenant_id is 56c3f5c047e74a78a71438c4412e6e13 (admin) !


OK that's good information.

Tracing the code for cold migrate in ocata, we get the request spec that 
was created when the instance was created here:


https://github.com/openstack/nova/blob/stable/ocata/nova/compute/api.py#L3339

As I mentioned earlier, if it was cold migrating an instance created 
before Newton and the online data migration wasn't run on it, we'd 
create a temporary request spec here:


https://github.com/openstack/nova/blob/stable/ocata/nova/conductor/manager.py#L263

But that shouldn't be the case in your scenario.

Right before we call the scheduler, for some reason, we completely 
ignore the request spec retrieved in the API, and re-create it from 
local scope variables in conductor:


https://github.com/openstack/nova/blob/stable/ocata/nova/conductor/tasks/migrate.py#L50

And *that* is precisely where this breaks down and takes the project_id 
from the current context (admin) rather than the instance:


https://github.com/openstack/nova/blob/stable/ocata/nova/objects/request_spec.py#L407

Thanks for your patience in debugging this Massimo! I'll get a bug 
reported and patch posted to fix it.


--

Thanks,

Matt

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-30 Thread Matt Riedemann

On 5/29/2018 8:23 PM, Chris Apsey wrote:
I want to echo the effectiveness of this change - we had vif failures 
when launching more than 50 or so cirros instances simultaneously, but 
moving to daemon mode made this issue disappear and we've tested 5x that 
amount.  This has been the single biggest scalability improvement to 
date.  This option should be the default in the official docs.


This is really good feedback. I'm not sure if there is any kind of 
centralized performance/scale-related documentation, does the LCOO team 
[1] have something that's current? There are also the performance docs 
[2] but that looks pretty stale.


We could add a note to the neutron rootwrap configuration option such 
that if you're running into timeout issues you could consider running 
that in daemon mode, but it's probably not very discoverable. In fact, I 
couldn't find anything about it in the neutron docs, I only found this 
[3] because I know it's defined in oslo.rootwrap (I don't expect 
everyone to know where this is defined).


I found root_helper_daemon in the neutron docs [4] but it doesn't 
mention anything about performance or related options, and it just makes 
it sound like it matters for xenserver, which I'd gloss over if I were 
using libvirt. The root_helper_daemon config option help in neutron 
should probably refer to the neutron-rootwrap-daemon which is in the 
setup.cfg [5].


For better discoverability of this, probably the best place to mention 
it is in the nova vif_plugging_timeout configuration option, since I 
expect that's the first place operators will be looking when they start 
hitting timeouts during vif plugging at scale.


I can start pushing some docs patches and report back here for review help.

[1] https://wiki.openstack.org/wiki/LCOO
[2] https://docs.openstack.org/developer/performance-docs/
[3] 
https://docs.openstack.org/oslo.rootwrap/latest/user/usage.html#daemon-mode
[4] 
https://docs.openstack.org/neutron/latest/configuration/neutron.html#agent.root_helper_daemon

[5] https://github.com/openstack/neutron/blob/f486f0/setup.cfg#L54

--

Thanks,

Matt

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] Problems with AggregateMultiTenancyIsolation while migrating an instance

2018-05-30 Thread Massimo Sgaravatto
The problem is indeed with the tenant_id

When I create a VM, tenant_id is ee1865a76440481cbcff08544c7d580a
(SgaraPrj1), as expected

But when, as admin, I run the "nova migrate" command to migrate the very
same instance, the tenant_id is 56c3f5c047e74a78a71438c4412e6e13 (admin) !

Cheers, Massimo

On Wed, May 30, 2018 at 1:01 AM, Matt Riedemann  wrote:

> On 5/29/2018 3:07 PM, Massimo Sgaravatto wrote:
>
>> The VM that I am trying to migrate was created when the Cloud was already
>> running Ocata
>>
>
> OK, I'd added the tenant_id variable in scope to the log message here:
>
> https://github.com/openstack/nova/blob/stable/ocata/nova/sch
> eduler/filters/aggregate_multitenancy_isolation.py#L50
>
> And make sure when it fails, it matches what you'd expect. If it's None or
> '' or something weird then we have a bug.
>
> --
>
> Thanks,
>
> Matt
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-30 Thread Radu Popescu | eMAG, Technology
Hi,

just to let you know. Problem is now gone. Instances boot up with working 
network interface.

Thanks a lot,
Radu

On Tue, 2018-05-29 at 21:23 -0400, Chris Apsey wrote:
I want to echo the effectiveness of this change - we had vif failures when 
launching more than 50 or so cirros instances simultaneously, but moving to 
daemon mode made this issue disappear and we've tested 5x that amount.  This 
has been the single biggest scalability improvement to date.  This option 
should be the default in the official docs.


On May 24, 2018 05:55:49 Saverio Proto  wrote:

Glad to hear it!
Always monitor rabbitmq queues to identify bottlenecks !! :)

Cheers

Saverio

Il gio 24 mag 2018, 11:07 Radu Popescu | eMAG, Technology 
mailto:radu.pope...@emag.ro>> ha scritto:
Hi,

did the change yesterday. Had no issue this morning with neutron not being able 
to move fast enough. Still, we had some storage issues, but that's another 
thing.
Anyway, I'll leave it like this for the next few days and report back in case I 
get the same slow neutron errors.

Thanks a lot!
Radu

On Wed, 2018-05-23 at 10:08 +, Radu Popescu | eMAG, Technology wrote:
Hi,

actually, I didn't know about that option. I'll enable it right now.
Testing is done every morning at about 4:00AM ..so I'll know tomorrow morning 
if it changed anything.

Thanks,
Radu

On Tue, 2018-05-22 at 15:30 +0200, Saverio Proto wrote:

Sorry email went out incomplete.

Read this:

https://cloudblog.switch.ch/2017/08/28/starting-1000-instances-on-switchengines/


make sure that Openstack rootwrap configured to work in daemon mode


Thank you


Saverio



2018-05-22 15:29 GMT+02:00 Saverio Proto 
mailto:ziopr...@gmail.com>>:

Hello Radu,


do you have the Openstack rootwrap configured to work in daemon mode ?


please read this article:


2018-05-18 10:21 GMT+02:00 Radu Popescu | eMAG, Technology

mailto:radu.pope...@emag.ro>>:

Hi,


so, nova says the VM is ACTIVE and actually boots with no network. We are

setting some metadata that we use later on and have cloud-init for different

tasks.

So, VM is up, OS is running, but network is working after a random amount of

time, that can get to around 45 minutes. Thing is, is not happening to all

VMs in that test (around 300), but it's happening to a fair amount - around

25%.


I can see the callback coming few seconds after neutron openvswitch agent

says it's completed the setup. My question is, why is it taking so long for

nova openvswitch agent to configure the port? I can see the port up in both

host OS and openvswitch. I would assume it's doing the whole namespace and

iptables setup. But still, 30 minutes? Seems a lot!


Thanks,

Radu


On Thu, 2018-05-17 at 11:50 -0400, George Mihaiescu wrote:


We have other scheduled tests that perform end-to-end (assign floating IP,

ssh, ping outside) and never had an issue.

I think we turned it off because the callback code was initially buggy and

nova would wait forever while things were in fact ok, but I'll  change

"vif_plugging_is_fatal = True" and "vif_plugging_timeout = 300" and run

another large test, just to confirm.


We usually run these large tests after a version upgrade to test the APIs

under load.




On Thu, May 17, 2018 at 11:42 AM, Matt Riedemann 
mailto:mriede...@gmail.com>>

wrote:


On 5/17/2018 9:46 AM, George Mihaiescu wrote:


and large rally tests of 500 instances complete with no issues.



Sure, except you can't ssh into the guests.


The whole reason the vif plugging is fatal and timeout and callback code was

because the upstream CI was unstable without it. The server would report as

ACTIVE but the ports weren't wired up so ssh would fail. Having an ACTIVE

guest that you can't actually do anything with is kind of pointless.


___


OpenStack-operators mailing list


OpenStack-operators@lists.openstack.org


http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators




___

OpenStack-operators mailing list

OpenStack-operators@lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___

OpenStack-operators mailing list

OpenStack-operators@lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators