Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-06-03 Thread Slawomir Kaplonski
Hi,

> Wiadomość napisana przez Matt Riedemann  w dniu 
> 03.06.2018, o godz. 16:54:
> 
> On 6/2/2018 1:37 AM, Chris Apsey wrote:
>> This is great.  I would even go so far as to say the install docs should be 
>> updated to capture this as the default; as far as I know there is no 
>> negative impact when running in daemon mode, even on very small deployments. 
>>  I would imagine that there are operators out there who have run into this 
>> issue but didn't know how to work through it - making stuff like this less 
>> painful is key to breaking the 'openstack is hard' stigma.
> 
> I think changing the default on the root_helper_daemon option is a good idea 
> if everyone is setting that anyway. There are some comments in the code next 
> to the option that make me wonder if there are edge cases where it might not 
> be a good idea, but I don't really know the details, someone from the neutron 
> team that knows more about it would have to speak up.
> 
> Also, I wonder if converting to privsep in the neutron agent would eliminate 
> the need for this option altogether and still gain the performance benefits.

Converting L2 agents to privsep is ongoing process but it’s very slow. There is 
switch of ip_lib to privsep in progress: 
https://bugs.launchpad.net/neutron/+bug/1492714
But to completely drop rootwrap there is also tc_lib to switch to privsep for 
QoS, iptables module for security groups and probably also some other modules. 
So I would not consider it as possibly done soon :)

> 
> -- 
> 
> Thanks,
> 
> Matt
> 
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

— 
Slawek Kaplonski
Senior software engineer
Red Hat


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-06-03 Thread Matt Riedemann

On 6/2/2018 1:37 AM, Chris Apsey wrote:
This is great.  I would even go so far as to say the install docs should 
be updated to capture this as the default; as far as I know there is no 
negative impact when running in daemon mode, even on very small 
deployments.  I would imagine that there are operators out there who 
have run into this issue but didn't know how to work through it - making 
stuff like this less painful is key to breaking the 'openstack is hard' 
stigma.


I think changing the default on the root_helper_daemon option is a good 
idea if everyone is setting that anyway. There are some comments in the 
code next to the option that make me wonder if there are edge cases 
where it might not be a good idea, but I don't really know the details, 
someone from the neutron team that knows more about it would have to 
speak up.


Also, I wonder if converting to privsep in the neutron agent would 
eliminate the need for this option altogether and still gain the 
performance benefits.


--

Thanks,

Matt

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-06-02 Thread Chris Apsey
This is great.  I would even go so far as to say the install docs should be 
updated to capture this as the default; as far as I know there is no 
negative impact when running in daemon mode, even on very small 
deployments.  I would imagine that there are operators out there who have 
run into this issue but didn't know how to work through it - making stuff 
like this less painful is key to breaking the 'openstack is hard' stigma.


On June 1, 2018 00:49:32 Matt Riedemann  wrote:


On 5/30/2018 9:30 AM, Matt Riedemann wrote:


I can start pushing some docs patches and report back here for review help.


Here are the docs patches in both nova and neutron:

https://review.openstack.org/#/q/topic:bug/1774217+(status:open+OR+status:merged)

--

Thanks,

Matt

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators





___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-31 Thread Matt Riedemann

On 5/30/2018 9:30 AM, Matt Riedemann wrote:


I can start pushing some docs patches and report back here for review help.


Here are the docs patches in both nova and neutron:

https://review.openstack.org/#/q/topic:bug/1774217+(status:open+OR+status:merged)

--

Thanks,

Matt

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-30 Thread Matt Riedemann

On 5/29/2018 8:23 PM, Chris Apsey wrote:
I want to echo the effectiveness of this change - we had vif failures 
when launching more than 50 or so cirros instances simultaneously, but 
moving to daemon mode made this issue disappear and we've tested 5x that 
amount.  This has been the single biggest scalability improvement to 
date.  This option should be the default in the official docs.


This is really good feedback. I'm not sure if there is any kind of 
centralized performance/scale-related documentation, does the LCOO team 
[1] have something that's current? There are also the performance docs 
[2] but that looks pretty stale.


We could add a note to the neutron rootwrap configuration option such 
that if you're running into timeout issues you could consider running 
that in daemon mode, but it's probably not very discoverable. In fact, I 
couldn't find anything about it in the neutron docs, I only found this 
[3] because I know it's defined in oslo.rootwrap (I don't expect 
everyone to know where this is defined).


I found root_helper_daemon in the neutron docs [4] but it doesn't 
mention anything about performance or related options, and it just makes 
it sound like it matters for xenserver, which I'd gloss over if I were 
using libvirt. The root_helper_daemon config option help in neutron 
should probably refer to the neutron-rootwrap-daemon which is in the 
setup.cfg [5].


For better discoverability of this, probably the best place to mention 
it is in the nova vif_plugging_timeout configuration option, since I 
expect that's the first place operators will be looking when they start 
hitting timeouts during vif plugging at scale.


I can start pushing some docs patches and report back here for review help.

[1] https://wiki.openstack.org/wiki/LCOO
[2] https://docs.openstack.org/developer/performance-docs/
[3] 
https://docs.openstack.org/oslo.rootwrap/latest/user/usage.html#daemon-mode
[4] 
https://docs.openstack.org/neutron/latest/configuration/neutron.html#agent.root_helper_daemon

[5] https://github.com/openstack/neutron/blob/f486f0/setup.cfg#L54

--

Thanks,

Matt

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-30 Thread Radu Popescu | eMAG, Technology
Hi,

just to let you know. Problem is now gone. Instances boot up with working 
network interface.

Thanks a lot,
Radu

On Tue, 2018-05-29 at 21:23 -0400, Chris Apsey wrote:
I want to echo the effectiveness of this change - we had vif failures when 
launching more than 50 or so cirros instances simultaneously, but moving to 
daemon mode made this issue disappear and we've tested 5x that amount.  This 
has been the single biggest scalability improvement to date.  This option 
should be the default in the official docs.


On May 24, 2018 05:55:49 Saverio Proto  wrote:

Glad to hear it!
Always monitor rabbitmq queues to identify bottlenecks !! :)

Cheers

Saverio

Il gio 24 mag 2018, 11:07 Radu Popescu | eMAG, Technology 
mailto:radu.pope...@emag.ro>> ha scritto:
Hi,

did the change yesterday. Had no issue this morning with neutron not being able 
to move fast enough. Still, we had some storage issues, but that's another 
thing.
Anyway, I'll leave it like this for the next few days and report back in case I 
get the same slow neutron errors.

Thanks a lot!
Radu

On Wed, 2018-05-23 at 10:08 +, Radu Popescu | eMAG, Technology wrote:
Hi,

actually, I didn't know about that option. I'll enable it right now.
Testing is done every morning at about 4:00AM ..so I'll know tomorrow morning 
if it changed anything.

Thanks,
Radu

On Tue, 2018-05-22 at 15:30 +0200, Saverio Proto wrote:

Sorry email went out incomplete.

Read this:

https://cloudblog.switch.ch/2017/08/28/starting-1000-instances-on-switchengines/


make sure that Openstack rootwrap configured to work in daemon mode


Thank you


Saverio



2018-05-22 15:29 GMT+02:00 Saverio Proto 
mailto:ziopr...@gmail.com>>:

Hello Radu,


do you have the Openstack rootwrap configured to work in daemon mode ?


please read this article:


2018-05-18 10:21 GMT+02:00 Radu Popescu | eMAG, Technology

mailto:radu.pope...@emag.ro>>:

Hi,


so, nova says the VM is ACTIVE and actually boots with no network. We are

setting some metadata that we use later on and have cloud-init for different

tasks.

So, VM is up, OS is running, but network is working after a random amount of

time, that can get to around 45 minutes. Thing is, is not happening to all

VMs in that test (around 300), but it's happening to a fair amount - around

25%.


I can see the callback coming few seconds after neutron openvswitch agent

says it's completed the setup. My question is, why is it taking so long for

nova openvswitch agent to configure the port? I can see the port up in both

host OS and openvswitch. I would assume it's doing the whole namespace and

iptables setup. But still, 30 minutes? Seems a lot!


Thanks,

Radu


On Thu, 2018-05-17 at 11:50 -0400, George Mihaiescu wrote:


We have other scheduled tests that perform end-to-end (assign floating IP,

ssh, ping outside) and never had an issue.

I think we turned it off because the callback code was initially buggy and

nova would wait forever while things were in fact ok, but I'll  change

"vif_plugging_is_fatal = True" and "vif_plugging_timeout = 300" and run

another large test, just to confirm.


We usually run these large tests after a version upgrade to test the APIs

under load.




On Thu, May 17, 2018 at 11:42 AM, Matt Riedemann 
mailto:mriede...@gmail.com>>

wrote:


On 5/17/2018 9:46 AM, George Mihaiescu wrote:


and large rally tests of 500 instances complete with no issues.



Sure, except you can't ssh into the guests.


The whole reason the vif plugging is fatal and timeout and callback code was

because the upstream CI was unstable without it. The server would report as

ACTIVE but the ports weren't wired up so ssh would fail. Having an ACTIVE

guest that you can't actually do anything with is kind of pointless.


___


OpenStack-operators mailing list


OpenStack-operators@lists.openstack.org


http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators




___

OpenStack-operators mailing list

OpenStack-operators@lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___

OpenStack-operators mailing list

OpenStack-operators@lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-29 Thread Chris Apsey
I want to echo the effectiveness of this change - we had vif failures when 
launching more than 50 or so cirros instances simultaneously, but moving to 
daemon mode made this issue disappear and we've tested 5x that amount.  
This has been the single biggest scalability improvement to date.  This 
option should be the default in the official docs.


On May 24, 2018 05:55:49 Saverio Proto  wrote:

Glad to hear it!
Always monitor rabbitmq queues to identify bottlenecks !! :)

Cheers

Saverio


Il gio 24 mag 2018, 11:07 Radu Popescu | eMAG, Technology 
 ha scritto:

Hi,

did the change yesterday. Had no issue this morning with neutron not being 
able to move fast enough. Still, we had some storage issues, but that's 
another thing.
Anyway, I'll leave it like this for the next few days and report back in 
case I get the same slow neutron errors.


Thanks a lot!
Radu

On Wed, 2018-05-23 at 10:08 +, Radu Popescu | eMAG, Technology wrote:

Hi,

actually, I didn't know about that option. I'll enable it right now.
Testing is done every morning at about 4:00AM ..so I'll know tomorrow 
morning if it changed anything.


Thanks,
Radu

On Tue, 2018-05-22 at 15:30 +0200, Saverio Proto wrote:

Sorry email went out incomplete.
Read this:
https://cloudblog.switch.ch/2017/08/28/starting-1000-instances-on-switchengines/

make sure that Openstack rootwrap configured to work in daemon mode

Thank you

Saverio


2018-05-22 15:29 GMT+02:00 Saverio Proto :




Hello Radu,

do you have the Openstack rootwrap configured to work in daemon mode ?

please read this article:

2018-05-18 10:21 GMT+02:00 Radu Popescu | eMAG, Technology
:




Hi,

so, nova says the VM is ACTIVE and actually boots with no network. We are
setting some metadata that we use later on and have cloud-init for different
tasks.
So, VM is up, OS is running, but network is working after a random amount of
time, that can get to around 45 minutes. Thing is, is not happening to all
VMs in that test (around 300), but it's happening to a fair amount - around
25%.

I can see the callback coming few seconds after neutron openvswitch agent
says it's completed the setup. My question is, why is it taking so long for
nova openvswitch agent to configure the port? I can see the port up in both
host OS and openvswitch. I would assume it's doing the whole namespace and
iptables setup. But still, 30 minutes? Seems a lot!

Thanks,
Radu

On Thu, 2018-05-17 at 11:50 -0400, George Mihaiescu wrote:

We have other scheduled tests that perform end-to-end (assign floating IP,
ssh, ping outside) and never had an issue.
I think we turned it off because the callback code was initially buggy and
nova would wait forever while things were in fact ok, but I'll change
"vif_plugging_is_fatal = True" and "vif_plugging_timeout = 300" and run
another large test, just to confirm.

We usually run these large tests after a version upgrade to test the APIs
under load.



On Thu, May 17, 2018 at 11:42 AM, Matt Riedemann 
wrote:

On 5/17/2018 9:46 AM, George Mihaiescu wrote:

and large rally tests of 500 instances complete with no issues.


Sure, except you can't ssh into the guests.

The whole reason the vif plugging is fatal and timeout and callback code was
because the upstream CI was unstable without it. The server would report as
ACTIVE but the ports weren't wired up so ssh would fail. Having an ACTIVE
guest that you can't actually do anything with is kind of pointless.

___

OpenStack-operators mailing list

OpenStack-operators@lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-24 Thread Saverio Proto
Glad to hear it!
Always monitor rabbitmq queues to identify bottlenecks !! :)

Cheers

Saverio

Il gio 24 mag 2018, 11:07 Radu Popescu | eMAG, Technology <
radu.pope...@emag.ro> ha scritto:

> Hi,
>
> did the change yesterday. Had no issue this morning with neutron not being
> able to move fast enough. Still, we had some storage issues, but that's
> another thing.
> Anyway, I'll leave it like this for the next few days and report back in
> case I get the same slow neutron errors.
>
> Thanks a lot!
> Radu
>
> On Wed, 2018-05-23 at 10:08 +, Radu Popescu | eMAG, Technology wrote:
>
> Hi,
>
> actually, I didn't know about that option. I'll enable it right now.
> Testing is done every morning at about 4:00AM ..so I'll know tomorrow
> morning if it changed anything.
>
> Thanks,
> Radu
>
> On Tue, 2018-05-22 at 15:30 +0200, Saverio Proto wrote:
>
> Sorry email went out incomplete.
>
> Read this:
>
> https://cloudblog.switch.ch/2017/08/28/starting-1000-instances-on-switchengines/
>
>
> make sure that Openstack rootwrap configured to work in daemon mode
>
>
> Thank you
>
>
> Saverio
>
>
>
> 2018-05-22 15:29 GMT+02:00 Saverio Proto :
>
> Hello Radu,
>
>
> do you have the Openstack rootwrap configured to work in daemon mode ?
>
>
> please read this article:
>
>
> 2018-05-18 10:21 GMT+02:00 Radu Popescu | eMAG, Technology
>
> :
>
> Hi,
>
>
> so, nova says the VM is ACTIVE and actually boots with no network. We are
>
> setting some metadata that we use later on and have cloud-init for different
>
> tasks.
>
> So, VM is up, OS is running, but network is working after a random amount of
>
> time, that can get to around 45 minutes. Thing is, is not happening to all
>
> VMs in that test (around 300), but it's happening to a fair amount - around
>
> 25%.
>
>
> I can see the callback coming few seconds after neutron openvswitch agent
>
> says it's completed the setup. My question is, why is it taking so long for
>
> nova openvswitch agent to configure the port? I can see the port up in both
>
> host OS and openvswitch. I would assume it's doing the whole namespace and
>
> iptables setup. But still, 30 minutes? Seems a lot!
>
>
> Thanks,
>
> Radu
>
>
> On Thu, 2018-05-17 at 11:50 -0400, George Mihaiescu wrote:
>
>
> We have other scheduled tests that perform end-to-end (assign floating IP,
>
> ssh, ping outside) and never had an issue.
>
> I think we turned it off because the callback code was initially buggy and
>
> nova would wait forever while things were in fact ok, but I'll  change
>
> "vif_plugging_is_fatal = True" and "vif_plugging_timeout = 300" and run
>
> another large test, just to confirm.
>
>
> We usually run these large tests after a version upgrade to test the APIs
>
> under load.
>
>
>
>
> On Thu, May 17, 2018 at 11:42 AM, Matt Riedemann 
>
> wrote:
>
>
> On 5/17/2018 9:46 AM, George Mihaiescu wrote:
>
>
> and large rally tests of 500 instances complete with no issues.
>
>
>
> Sure, except you can't ssh into the guests.
>
>
> The whole reason the vif plugging is fatal and timeout and callback code was
>
> because the upstream CI was unstable without it. The server would report as
>
> ACTIVE but the ports weren't wired up so ssh would fail. Having an ACTIVE
>
> guest that you can't actually do anything with is kind of pointless.
>
>
> ___
>
>
> OpenStack-operators mailing list
>
>
> OpenStack-operators@lists.openstack.org
>
>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
>
>
> ___
>
> OpenStack-operators mailing list
>
> OpenStack-operators@lists.openstack.org
>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
> ___
>
> OpenStack-operators mailing list
>
> OpenStack-operators@lists.openstack.org
>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-24 Thread Radu Popescu | eMAG, Technology
Hi,

did the change yesterday. Had no issue this morning with neutron not being able 
to move fast enough. Still, we had some storage issues, but that's another 
thing.
Anyway, I'll leave it like this for the next few days and report back in case I 
get the same slow neutron errors.

Thanks a lot!
Radu

On Wed, 2018-05-23 at 10:08 +, Radu Popescu | eMAG, Technology wrote:
Hi,

actually, I didn't know about that option. I'll enable it right now.
Testing is done every morning at about 4:00AM ..so I'll know tomorrow morning 
if it changed anything.

Thanks,
Radu

On Tue, 2018-05-22 at 15:30 +0200, Saverio Proto wrote:

Sorry email went out incomplete.

Read this:

https://cloudblog.switch.ch/2017/08/28/starting-1000-instances-on-switchengines/


make sure that Openstack rootwrap configured to work in daemon mode


Thank you


Saverio



2018-05-22 15:29 GMT+02:00 Saverio Proto 
>:

Hello Radu,


do you have the Openstack rootwrap configured to work in daemon mode ?


please read this article:


2018-05-18 10:21 GMT+02:00 Radu Popescu | eMAG, Technology

>:

Hi,


so, nova says the VM is ACTIVE and actually boots with no network. We are

setting some metadata that we use later on and have cloud-init for different

tasks.

So, VM is up, OS is running, but network is working after a random amount of

time, that can get to around 45 minutes. Thing is, is not happening to all

VMs in that test (around 300), but it's happening to a fair amount - around

25%.


I can see the callback coming few seconds after neutron openvswitch agent

says it's completed the setup. My question is, why is it taking so long for

nova openvswitch agent to configure the port? I can see the port up in both

host OS and openvswitch. I would assume it's doing the whole namespace and

iptables setup. But still, 30 minutes? Seems a lot!


Thanks,

Radu


On Thu, 2018-05-17 at 11:50 -0400, George Mihaiescu wrote:


We have other scheduled tests that perform end-to-end (assign floating IP,

ssh, ping outside) and never had an issue.

I think we turned it off because the callback code was initially buggy and

nova would wait forever while things were in fact ok, but I'll  change

"vif_plugging_is_fatal = True" and "vif_plugging_timeout = 300" and run

another large test, just to confirm.


We usually run these large tests after a version upgrade to test the APIs

under load.




On Thu, May 17, 2018 at 11:42 AM, Matt Riedemann 
>

wrote:


On 5/17/2018 9:46 AM, George Mihaiescu wrote:


and large rally tests of 500 instances complete with no issues.



Sure, except you can't ssh into the guests.


The whole reason the vif plugging is fatal and timeout and callback code was

because the upstream CI was unstable without it. The server would report as

ACTIVE but the ports weren't wired up so ssh would fail. Having an ACTIVE

guest that you can't actually do anything with is kind of pointless.


___


OpenStack-operators mailing list


OpenStack-operators@lists.openstack.org


http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators




___

OpenStack-operators mailing list

OpenStack-operators@lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___

OpenStack-operators mailing list

OpenStack-operators@lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-23 Thread Radu Popescu | eMAG, Technology
Hi,

actually, I didn't know about that option. I'll enable it right now.
Testing is done every morning at about 4:00AM ..so I'll know tomorrow morning 
if it changed anything.

Thanks,
Radu

On Tue, 2018-05-22 at 15:30 +0200, Saverio Proto wrote:

Sorry email went out incomplete.

Read this:

https://cloudblog.switch.ch/2017/08/28/starting-1000-instances-on-switchengines/


make sure that Openstack rootwrap configured to work in daemon mode


Thank you


Saverio



2018-05-22 15:29 GMT+02:00 Saverio Proto 
>:

Hello Radu,


do you have the Openstack rootwrap configured to work in daemon mode ?


please read this article:


2018-05-18 10:21 GMT+02:00 Radu Popescu | eMAG, Technology

>:

Hi,


so, nova says the VM is ACTIVE and actually boots with no network. We are

setting some metadata that we use later on and have cloud-init for different

tasks.

So, VM is up, OS is running, but network is working after a random amount of

time, that can get to around 45 minutes. Thing is, is not happening to all

VMs in that test (around 300), but it's happening to a fair amount - around

25%.


I can see the callback coming few seconds after neutron openvswitch agent

says it's completed the setup. My question is, why is it taking so long for

nova openvswitch agent to configure the port? I can see the port up in both

host OS and openvswitch. I would assume it's doing the whole namespace and

iptables setup. But still, 30 minutes? Seems a lot!


Thanks,

Radu


On Thu, 2018-05-17 at 11:50 -0400, George Mihaiescu wrote:


We have other scheduled tests that perform end-to-end (assign floating IP,

ssh, ping outside) and never had an issue.

I think we turned it off because the callback code was initially buggy and

nova would wait forever while things were in fact ok, but I'll  change

"vif_plugging_is_fatal = True" and "vif_plugging_timeout = 300" and run

another large test, just to confirm.


We usually run these large tests after a version upgrade to test the APIs

under load.




On Thu, May 17, 2018 at 11:42 AM, Matt Riedemann 
>

wrote:


On 5/17/2018 9:46 AM, George Mihaiescu wrote:


and large rally tests of 500 instances complete with no issues.



Sure, except you can't ssh into the guests.


The whole reason the vif plugging is fatal and timeout and callback code was

because the upstream CI was unstable without it. The server would report as

ACTIVE but the ports weren't wired up so ssh would fail. Having an ACTIVE

guest that you can't actually do anything with is kind of pointless.


___


OpenStack-operators mailing list


OpenStack-operators@lists.openstack.org


http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators




___

OpenStack-operators mailing list

OpenStack-operators@lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-22 Thread Saverio Proto
Sorry email went out incomplete.
Read this:
https://cloudblog.switch.ch/2017/08/28/starting-1000-instances-on-switchengines/

make sure that Openstack rootwrap configured to work in daemon mode

Thank you

Saverio


2018-05-22 15:29 GMT+02:00 Saverio Proto :
> Hello Radu,
>
> do you have the Openstack rootwrap configured to work in daemon mode ?
>
> please read this article:
>
> 2018-05-18 10:21 GMT+02:00 Radu Popescu | eMAG, Technology
> :
>> Hi,
>>
>> so, nova says the VM is ACTIVE and actually boots with no network. We are
>> setting some metadata that we use later on and have cloud-init for different
>> tasks.
>> So, VM is up, OS is running, but network is working after a random amount of
>> time, that can get to around 45 minutes. Thing is, is not happening to all
>> VMs in that test (around 300), but it's happening to a fair amount - around
>> 25%.
>>
>> I can see the callback coming few seconds after neutron openvswitch agent
>> says it's completed the setup. My question is, why is it taking so long for
>> nova openvswitch agent to configure the port? I can see the port up in both
>> host OS and openvswitch. I would assume it's doing the whole namespace and
>> iptables setup. But still, 30 minutes? Seems a lot!
>>
>> Thanks,
>> Radu
>>
>> On Thu, 2018-05-17 at 11:50 -0400, George Mihaiescu wrote:
>>
>> We have other scheduled tests that perform end-to-end (assign floating IP,
>> ssh, ping outside) and never had an issue.
>> I think we turned it off because the callback code was initially buggy and
>> nova would wait forever while things were in fact ok, but I'll  change
>> "vif_plugging_is_fatal = True" and "vif_plugging_timeout = 300" and run
>> another large test, just to confirm.
>>
>> We usually run these large tests after a version upgrade to test the APIs
>> under load.
>>
>>
>>
>> On Thu, May 17, 2018 at 11:42 AM, Matt Riedemann 
>> wrote:
>>
>> On 5/17/2018 9:46 AM, George Mihaiescu wrote:
>>
>> and large rally tests of 500 instances complete with no issues.
>>
>>
>> Sure, except you can't ssh into the guests.
>>
>> The whole reason the vif plugging is fatal and timeout and callback code was
>> because the upstream CI was unstable without it. The server would report as
>> ACTIVE but the ports weren't wired up so ssh would fail. Having an ACTIVE
>> guest that you can't actually do anything with is kind of pointless.
>>
>> ___
>>
>> OpenStack-operators mailing list
>>
>> OpenStack-operators@lists.openstack.org
>>
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>>
>>
>> ___
>> OpenStack-operators mailing list
>> OpenStack-operators@lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-22 Thread Saverio Proto
Hello Radu,

do you have the Openstack rootwrap configured to work in daemon mode ?

please read this article:

2018-05-18 10:21 GMT+02:00 Radu Popescu | eMAG, Technology
:
> Hi,
>
> so, nova says the VM is ACTIVE and actually boots with no network. We are
> setting some metadata that we use later on and have cloud-init for different
> tasks.
> So, VM is up, OS is running, but network is working after a random amount of
> time, that can get to around 45 minutes. Thing is, is not happening to all
> VMs in that test (around 300), but it's happening to a fair amount - around
> 25%.
>
> I can see the callback coming few seconds after neutron openvswitch agent
> says it's completed the setup. My question is, why is it taking so long for
> nova openvswitch agent to configure the port? I can see the port up in both
> host OS and openvswitch. I would assume it's doing the whole namespace and
> iptables setup. But still, 30 minutes? Seems a lot!
>
> Thanks,
> Radu
>
> On Thu, 2018-05-17 at 11:50 -0400, George Mihaiescu wrote:
>
> We have other scheduled tests that perform end-to-end (assign floating IP,
> ssh, ping outside) and never had an issue.
> I think we turned it off because the callback code was initially buggy and
> nova would wait forever while things were in fact ok, but I'll  change
> "vif_plugging_is_fatal = True" and "vif_plugging_timeout = 300" and run
> another large test, just to confirm.
>
> We usually run these large tests after a version upgrade to test the APIs
> under load.
>
>
>
> On Thu, May 17, 2018 at 11:42 AM, Matt Riedemann 
> wrote:
>
> On 5/17/2018 9:46 AM, George Mihaiescu wrote:
>
> and large rally tests of 500 instances complete with no issues.
>
>
> Sure, except you can't ssh into the guests.
>
> The whole reason the vif plugging is fatal and timeout and callback code was
> because the upstream CI was unstable without it. The server would report as
> ACTIVE but the ports weren't wired up so ssh would fail. Having an ACTIVE
> guest that you can't actually do anything with is kind of pointless.
>
> ___
>
> OpenStack-operators mailing list
>
> OpenStack-operators@lists.openstack.org
>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-18 Thread Radu Popescu | eMAG, Technology
Hi,

so, nova says the VM is ACTIVE and actually boots with no network. We are 
setting some metadata that we use later on and have cloud-init for different 
tasks.
So, VM is up, OS is running, but network is working after a random amount of 
time, that can get to around 45 minutes. Thing is, is not happening to all VMs 
in that test (around 300), but it's happening to a fair amount - around 25%.

I can see the callback coming few seconds after neutron openvswitch agent says 
it's completed the setup. My question is, why is it taking so long for nova 
openvswitch agent to configure the port? I can see the port up in both host OS 
and openvswitch. I would assume it's doing the whole namespace and iptables 
setup. But still, 30 minutes? Seems a lot!

Thanks,
Radu

On Thu, 2018-05-17 at 11:50 -0400, George Mihaiescu wrote:
We have other scheduled tests that perform end-to-end (assign floating IP, ssh, 
ping outside) and never had an issue.
I think we turned it off because the callback code was initially buggy and nova 
would wait forever while things were in fact ok, but I'll  change 
"vif_plugging_is_fatal = True" and "vif_plugging_timeout = 300" and run another 
large test, just to confirm.

We usually run these large tests after a version upgrade to test the APIs under 
load.



On Thu, May 17, 2018 at 11:42 AM, Matt Riedemann 
> wrote:
On 5/17/2018 9:46 AM, George Mihaiescu wrote:
and large rally tests of 500 instances complete with no issues.


Sure, except you can't ssh into the guests.

The whole reason the vif plugging is fatal and timeout and callback code was 
because the upstream CI was unstable without it. The server would report as 
ACTIVE but the ports weren't wired up so ssh would fail. Having an ACTIVE guest 
that you can't actually do anything with is kind of pointless.


___

OpenStack-operators mailing list

OpenStack-operators@lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-17 Thread George Mihaiescu
We have other scheduled tests that perform end-to-end (assign floating IP,
ssh, ping outside) and never had an issue.
I think we turned it off because the callback code was initially buggy and
nova would wait forever while things were in fact ok, but I'll  change
"vif_plugging_is_fatal = True" and "vif_plugging_timeout = 300" and run
another large test, just to confirm.

We usually run these large tests after a version upgrade to test the APIs
under load.



On Thu, May 17, 2018 at 11:42 AM, Matt Riedemann 
wrote:

> On 5/17/2018 9:46 AM, George Mihaiescu wrote:
>
>> and large rally tests of 500 instances complete with no issues.
>>
>
> Sure, except you can't ssh into the guests.
>
> The whole reason the vif plugging is fatal and timeout and callback code
> was because the upstream CI was unstable without it. The server would
> report as ACTIVE but the ports weren't wired up so ssh would fail. Having
> an ACTIVE guest that you can't actually do anything with is kind of
> pointless.
>
> --
>
> Thanks,
>
> Matt
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-17 Thread Matt Riedemann

On 5/17/2018 9:46 AM, George Mihaiescu wrote:

and large rally tests of 500 instances complete with no issues.


Sure, except you can't ssh into the guests.

The whole reason the vif plugging is fatal and timeout and callback code 
was because the upstream CI was unstable without it. The server would 
report as ACTIVE but the ports weren't wired up so ssh would fail. 
Having an ACTIVE guest that you can't actually do anything with is kind 
of pointless.


--

Thanks,

Matt

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-17 Thread George Mihaiescu
We use "vif_plugging_is_fatal = False" and "vif_plugging_timeout = 0" as
well as "no-ping" in the dnsmasq-neutron.conf, and large rally tests of 500
instances complete with no issues.

These are some good blogposts about Neutron performance:
https://www.mirantis.com/blog/openstack-neutron-performance-and-scalability-testing-summary/
https://www.mirantis.com/blog/improving-dhcp-performance-openstack/

I would run a large rally test like this one and see where time is spent
mostly:
{
"NovaServers.boot_and_delete_server": [
{
"args": {
"flavor": {
"name": "c2.small"
},
"image": {
"name": "^Ubuntu 16.04 - latest$"
},
"force_delete": false
},
"runner": {
"type": "constant",
"times": 500,
"concurrency": 100
}
}
]
}


Cheers,
George

On Thu, May 17, 2018 at 7:49 AM, Radu Popescu | eMAG, Technology <
radu.pope...@emag.ro> wrote:

> Hi,
>
> unfortunately, didn't get the reply in my inbox, so I'm answering from the
> link here:
> http://lists.openstack.org/pipermail/openstack-operators/
> 2018-May/015270.html
> (hopefully, my reply will go to the same thread)
>
> Anyway, I can see the neutron openvswitch agent logs processing the
> interface way after the VM is up (in this case, 30 minutes). And after the
> vif plugin timeout of 5 minutes (currently 10 minutes).
> After searching for logs, I came out with an example here: (replaced nova
> compute hostname with "nova.compute.hostname")
>
> http://paste.openstack.org/show/1VevKuimoBMs4G8X53Eu/
>
> As you can see, the request for the VM starts around 3:27AM. Ports get
> created, openvswitch has the command to do it, has DHCP, but apparently
> Neutron server sends the callback after Neutron Openvswitch agent finishes.
> Callback is at 2018-05-10 03:57:36.177 while Neutron Openvswitch agent says
> it completed the setup and configuration at 2018-05-10 03:57:35.247.
>
> So, my question is, why is Neutron Openvswitch agent processing the
> request 30 minutes after the VM is started? And where can I search for logs
> for whatever happens during those 30 minutes?
> And yes, we're using libvirt. At some point, we added some new nova
> compute nodes and the new ones came with v3.2.0 and was breaking migration
> between hosts. That's why we downgraded (and versionlocked) everything at
> v2.0.0.
>
> Thanks,
> Radu
>
> ___
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-17 Thread Radu Popescu | eMAG, Technology
Hi,

unfortunately, didn't get the reply in my inbox, so I'm answering from the link 
here:
http://lists.openstack.org/pipermail/openstack-operators/2018-May/015270.html
(hopefully, my reply will go to the same thread)

Anyway, I can see the neutron openvswitch agent logs processing the interface 
way after the VM is up (in this case, 30 minutes). And after the vif plugin 
timeout of 5 minutes (currently 10 minutes).
After searching for logs, I came out with an example here: (replaced nova 
compute hostname with "nova.compute.hostname")

http://paste.openstack.org/show/1VevKuimoBMs4G8X53Eu/

As you can see, the request for the VM starts around 3:27AM. Ports get created, 
openvswitch has the command to do it, has DHCP, but apparently Neutron server 
sends the callback after Neutron Openvswitch agent finishes. Callback is at 
2018-05-10 03:57:36.177 while Neutron Openvswitch agent says it completed the 
setup and configuration at 2018-05-10 03:57:35.247.

So, my question is, why is Neutron Openvswitch agent processing the request 30 
minutes after the VM is started? And where can I search for logs for whatever 
happens during those 30 minutes?
And yes, we're using libvirt. At some point, we added some new nova compute 
nodes and the new ones came with v3.2.0 and was breaking migration between 
hosts. That's why we downgraded (and versionlocked) everything at v2.0.0.

Thanks,
Radu
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-16 Thread Matt Riedemann

On 5/16/2018 10:30 AM, Radu Popescu | eMAG, Technology wrote:

but I can see nova attaching the interface after a huge amount of time.


What specifically are you looking for in the logs when you see this?

Are you passing pre-created ports to attach to nova or are you passing a 
network ID so nova will create the port for you during the attach call?


This is where the ComputeManager calls the driver to plug the vif on the 
host:


https://github.com/openstack/nova/blob/stable/ocata/nova/compute/manager.py#L5187

Assuming you're using the libvirt driver, the host vif plug happens here:

https://github.com/openstack/nova/blob/stable/ocata/nova/virt/libvirt/driver.py#L1463

And the guest is updated here:

https://github.com/openstack/nova/blob/stable/ocata/nova/virt/libvirt/driver.py#L1472

vif_plugging_is_fatal and vif_plugging_timeout don't come into play here 
because we're attaching an interface to an existing server - or are you 
talking about during the initial creation of the guest, i.e. this code 
in the driver?


https://github.com/openstack/nova/blob/stable/ocata/nova/virt/libvirt/driver.py#L5257

Are you seeing this in the logs for the given port?

https://github.com/openstack/nova/blob/stable/ocata/nova/compute/manager.py#L6875

If not, it could mean that neutron-server never send the event to nova, 
so nova-compute timed out waiting for the vif plug callback event to 
tell us that the port is ready and the server can be changed to ACTIVE 
status.


The neutron-server logs should log when external events are being sent 
to nova for the given port, you probably need to trace the requests and 
compare the nova-compute and neutron logs for a given server create request.


--

Thanks,

Matt

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-16 Thread Radu Popescu | eMAG, Technology
Hi all,

we have the following setup:
- Openstack Ocata deployed with Openstack Ansible (v15.1.7)
- 66 compute nodes, each having between 50 and 150 VMs, depending on their 
hardware configuration
- we don't use Ceilometer (so not adding extra load on RabbitMQ cluster)
- using Openvswitch HA with DVR
- all messaging are going through a 3 servers RabbitMQ cluster
- we now have 3 CCs hosting (initially had 2) hosting every other internal 
service

What happens is, when we create a large number of VMs (it's something we do on 
a daily basis, just to test different types of VMs and apps, around 300 VMs), 
there are some of them that don't get the network interface attached in a 
reasonable time.
After investigating, we can see that Neutron Openvswitch agent sees the port 
attached to the server, from an Openstack point of view, I can see the tap 
interface created in Openvswitch using both its logs and dmesg, but I can see 
nova attaching the interface after a huge amount of time. (I could see even 45 
minutes delay)

Since I can't see any reasonable errors I could take care of, my last chance is 
this mailing list.
Only thing I can think of, is that maybe libvirt is not able to attach the 
interface in a reasonable amount of time. But still, 45 minutes is way too much.

At the moment:
vif_plugging_is_fatal = True
vif_plugging_timeout = 600 (modified from default 300s)

That's because we needed VMs with networking. Otherwise, if either with error, 
either with no network, it's the same thing for us.

Thanks,

--

Radu Popescu >
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators