Re: [Openstack-operators] Neutron not adding iptables rules for metadata agent

2018-06-29 Thread Radu Popescu | eMAG, Technology
Well, right now, I’ve managed to manually add those rules.
For now, I will assume it was from the RabbitMQ upgrade process I’ve done few 
weeks ago. If the issue reappears, I’ll make sure I’ll add a bug report.

Thanks,
Radu

> On Jun 29, 2018, at 3:55 PM, Saverio Proto  wrote:
> 
> Hello,
> 
> I would suggest to open a bug on launchpad to track this issue.
> 
> thank you
> 
> Saverio
> 
> 2018-06-18 12:19 GMT+02:00 Radu Popescu | eMAG, Technology
> :
>> Hi,
>> 
>> We're using Openstack Ocata, deployed using Openstack Ansible v15.1.7.
>> Neutron server is v10.0.3.
>> I can see enable_isolated_metadata and enable_metadata_network only used for
>> isolated networks that don't have a router which is not our case.
>> Also, I checked all namespaces on all our novas and only affected 6 out of
>> 66 ..and only 1 namespace / nova. Seems like isolated case that doesn't
>> happen very often.
>> 
>> Can it be RabbitMQ? I'm not sure where to check.
>> 
>> Thanks,
>> Radu
>> 
>> On Fri, 2018-06-15 at 17:11 +0200, Saverio Proto wrote:
>> 
>> Hello Radu,
>> 
>> 
>> yours look more or less like a bug report. This you check existing
>> 
>> open bugs for neutron ? Also what version of openstack are you running
>> 
>> ?
>> 
>> 
>> how did you configure enable_isolated_metadata and
>> 
>> enable_metadata_network options ?
>> 
>> 
>> Saverio
>> 
>> 
>> 2018-06-13 12:45 GMT+02:00 Radu Popescu | eMAG, Technology
>> 
>> :
>> 
>> Hi all,
>> 
>> 
>> So, I'm having the following issue. I'm creating a VM with floating IP.
>> 
>> Everything is fine, namespace is there, postrouting and prerouting from the
>> 
>> internal IP to the floating IP are there. The only rules missing are the
>> 
>> rules to access metadata service:
>> 
>> 
>> -A neutron-l3-agent-PREROUTING -d 169.254.169.254/32 -i qr-+ -p tcp -m tcp
>> 
>> --dport 80 -j REDIRECT --to-ports 9697
>> 
>> -A neutron-l3-agent-PREROUTING -d 169.254.169.254/32 -i qr-+ -p tcp -m tcp
>> 
>> --dport 80 -j MARK --set-xmark 0x1/0x
>> 
>> 
>> (this is taken from another working namespace with iptables-save)
>> 
>> 
>> Forgot to mention, VM is booting ok, I have both the default route and the
>> 
>> one for the metadata service (cloud-init is running at boot time):
>> 
>> [   57.150766] cloud-init[892]: ci-info:
>> 
>> ++--+--+---+---+---+
>> 
>> [   57.150997] cloud-init[892]: ci-info: | Device |  Up  |   Address|
>> 
>> Mask | Scope | Hw-Address|
>> 
>> [   57.151219] cloud-init[892]: ci-info:
>> 
>> ++--+--+---+---+---+
>> 
>> [   57.151431] cloud-init[892]: ci-info: |  lo:   | True |  127.0.0.1   |
>> 
>> 255.0.0.0   |   .   | . |
>> 
>> [   57.151627] cloud-init[892]: ci-info: | eth0:  | True | 10.240.9.186 |
>> 
>> 255.255.252.0 |   .   | fa:16:3e:43:d1:c2 |
>> 
>> [   57.151815] cloud-init[892]: ci-info:
>> 
>> ++--+--+---+---+---+
>> 
>> [   57.152018] cloud-init[892]: ci-info:
>> 
>> +++Route IPv4
>> 
>> info
>> 
>> [   57.152225] cloud-init[892]: ci-info:
>> 
>> +---+-++-+---+---+
>> 
>> [   57.152426] cloud-init[892]: ci-info: | Route |   Destination   |
>> 
>> Gateway   | Genmask | Interface | Flags |
>> 
>> [   57.152621] cloud-init[892]: ci-info:
>> 
>> +---+-++-+---+---+
>> 
>> [   57.152813] cloud-init[892]: ci-info: |   0   | 0.0.0.0 |
>> 
>> 10.240.8.1 | 0.0.0.0 |eth0   |   UG  |
>> 
>> [   57.153013] cloud-init[892]: ci-info: |   1   |10.240.1.0   |
>> 
>> 0.0.0.0   |  255.255.255.0  |eth0   |   U   |
>> 
>> [   57.153202] cloud-init[892]: ci-info: |   2   |10.240.8.0   |
>> 
>> 0.0.0.0   |  255.255.252.0  |eth0   |   U   |
>> 
>> [   57.153397] cloud-init[892]: ci-info: |   3   | 169.254.169.254 |
>> 
>> 10.240.8.1 | 255.255.255.255 |eth0   |  UGH  |
>> 
>> [   57.153579] cloud-init[892]: ci-info:
>> 
>> +---+-++

Re: [Openstack-operators] Neutron not adding iptables rules for metadata agent

2018-06-18 Thread Radu Popescu | eMAG, Technology
Hi,

We're using Openstack Ocata, deployed using Openstack Ansible v15.1.7. Neutron 
server is v10.0.3.
I can see enable_isolated_metadata and enable_metadata_network only used for 
isolated networks that don't have a router which is not our case.
Also, I checked all namespaces on all our novas and only affected 6 out of 66 
..and only 1 namespace / nova. Seems like isolated case that doesn't happen 
very often.

Can it be RabbitMQ? I'm not sure where to check.

Thanks,
Radu

On Fri, 2018-06-15 at 17:11 +0200, Saverio Proto wrote:

Hello Radu,


yours look more or less like a bug report. This you check existing

open bugs for neutron ? Also what version of openstack are you running

?


how did you configure enable_isolated_metadata and

enable_metadata_network options ?


Saverio


2018-06-13 12:45 GMT+02:00 Radu Popescu | eMAG, Technology

mailto:radu.pope...@emag.ro>>:

Hi all,


So, I'm having the following issue. I'm creating a VM with floating IP.

Everything is fine, namespace is there, postrouting and prerouting from the

internal IP to the floating IP are there. The only rules missing are the

rules to access metadata service:


-A neutron-l3-agent-PREROUTING -d 169.254.169.254/32 -i qr-+ -p tcp -m tcp

--dport 80 -j REDIRECT --to-ports 9697

-A neutron-l3-agent-PREROUTING -d 169.254.169.254/32 -i qr-+ -p tcp -m tcp

--dport 80 -j MARK --set-xmark 0x1/0x


(this is taken from another working namespace with iptables-save)


Forgot to mention, VM is booting ok, I have both the default route and the

one for the metadata service (cloud-init is running at boot time):

[   57.150766] cloud-init[892]: ci-info:

++--+--+---+---+---+

[   57.150997] cloud-init[892]: ci-info: | Device |  Up  |   Address|

Mask | Scope | Hw-Address|

[   57.151219] cloud-init[892]: ci-info:

++--+--+---+---+---+

[   57.151431] cloud-init[892]: ci-info: |  lo:   | True |  127.0.0.1   |

255.0.0.0   |   .   | . |

[   57.151627] cloud-init[892]: ci-info: | eth0:  | True | 10.240.9.186 |

255.255.252.0 |   .   | fa:16:3e:43:d1:c2 |

[   57.151815] cloud-init[892]: ci-info:

++--+--+---+---+---+

[   57.152018] cloud-init[892]: ci-info:

+++Route IPv4

info

[   57.152225] cloud-init[892]: ci-info:

+---+-++-+---+---+

[   57.152426] cloud-init[892]: ci-info: | Route |   Destination   |

Gateway   | Genmask | Interface | Flags |

[   57.152621] cloud-init[892]: ci-info:

+---+-++-+---+---+

[   57.152813] cloud-init[892]: ci-info: |   0   | 0.0.0.0 |

10.240.8.1 | 0.0.0.0 |eth0   |   UG  |

[   57.153013] cloud-init[892]: ci-info: |   1   |10.240.1.0   |

0.0.0.0   |  255.255.255.0  |eth0   |   U   |

[   57.153202] cloud-init[892]: ci-info: |   2   |10.240.8.0   |

0.0.0.0   |  255.255.252.0  |eth0   |   U   |

[   57.153397] cloud-init[892]: ci-info: |   3   | 169.254.169.254 |

10.240.8.1 | 255.255.255.255 |eth0   |  UGH  |

[   57.153579] cloud-init[892]: ci-info:

+---+-++-+---+---+


The extra route is there because the tenant has 2 subnets.


Before adding those 2 rules manually, I had this coming from cloud-init:


[  192.451801] cloud-init[892]: 2018-06-13 12:29:26,179 -

url_helper.py[WARNING]: Calling

'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [0/120s]:

request error [('Connection aborted.', error(113, 'No route to host'))]

[  193.456805] cloud-init[892]: 2018-06-13 12:29:27,184 -

url_helper.py[WARNING]: Calling

'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [1/120s]:

request error [('Connection aborted.', error(113, 'No route to host'))]

[  194.461592] cloud-init[892]: 2018-06-13 12:29:28,189 -

url_helper.py[WARNING]: Calling

'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [2/120s]:

request error [('Connection aborted.', error(113, 'No route to host'))]

[  195.466441] cloud-init[892]: 2018-06-13 12:29:29,194 -

url_helper.py[WARNING]: Calling

'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [3/120s]:

request error [('Connection aborted.', error(113, 'No route to host'))]


I can see no errors in neither nova or neutron services.

In the mean time, I've searched all our nova servers for this kind of

behavior and we have 1 random namespace missing those rules on 6 of our 66

novas.


Any ideas would be greatly appreciated.


Thanks,

Radu


___

OpenStack-operators mailing list

OpenStack-operators@lists.openstack.org<mailto:OpenStack-operators@lists.openstack.org>

http://lists.openstack

[Openstack-operators] Neutron not adding iptables rules for metadata agent

2018-06-13 Thread Radu Popescu | eMAG, Technology
Hi all,

So, I'm having the following issue. I'm creating a VM with floating IP. 
Everything is fine, namespace is there, postrouting and prerouting from the 
internal IP to the floating IP are there. The only rules missing are the rules 
to access metadata service:

-A neutron-l3-agent-PREROUTING -d 169.254.169.254/32 -i qr-+ -p tcp -m tcp 
--dport 80 -j REDIRECT --to-ports 9697
-A neutron-l3-agent-PREROUTING -d 169.254.169.254/32 -i qr-+ -p tcp -m tcp 
--dport 80 -j MARK --set-xmark 0x1/0x

(this is taken from another working namespace with iptables-save)

Forgot to mention, VM is booting ok, I have both the default route and the one 
for the metadata service (cloud-init is running at boot time):
[   57.150766] cloud-init[892]: ci-info: 
++--+--+---+---+---+
[   57.150997] cloud-init[892]: ci-info: | Device |  Up  |   Address|  
Mask | Scope | Hw-Address|
[   57.151219] cloud-init[892]: ci-info: 
++--+--+---+---+---+
[   57.151431] cloud-init[892]: ci-info: |  lo:   | True |  127.0.0.1   |   
255.0.0.0   |   .   | . |
[   57.151627] cloud-init[892]: ci-info: | eth0:  | True | 10.240.9.186 | 
255.255.252.0 |   .   | fa:16:3e:43:d1:c2 |
[   57.151815] cloud-init[892]: ci-info: 
++--+--+---+---+---+
[   57.152018] cloud-init[892]: ci-info: +++Route 
IPv4 info
[   57.152225] cloud-init[892]: ci-info: 
+---+-++-+---+---+
[   57.152426] cloud-init[892]: ci-info: | Route |   Destination   |  Gateway   
| Genmask | Interface | Flags |
[   57.152621] cloud-init[892]: ci-info: 
+---+-++-+---+---+
[   57.152813] cloud-init[892]: ci-info: |   0   | 0.0.0.0 | 10.240.8.1 
| 0.0.0.0 |eth0   |   UG  |
[   57.153013] cloud-init[892]: ci-info: |   1   |10.240.1.0   |  0.0.0.0   
|  255.255.255.0  |eth0   |   U   |
[   57.153202] cloud-init[892]: ci-info: |   2   |10.240.8.0   |  0.0.0.0   
|  255.255.252.0  |eth0   |   U   |
[   57.153397] cloud-init[892]: ci-info: |   3   | 169.254.169.254 | 10.240.8.1 
| 255.255.255.255 |eth0   |  UGH  |
[   57.153579] cloud-init[892]: ci-info: 
+---+-++-+---+---+

The extra route is there because the tenant has 2 subnets.

Before adding those 2 rules manually, I had this coming from cloud-init:

[  192.451801] cloud-init[892]: 2018-06-13 12:29:26,179 - 
url_helper.py[WARNING]: Calling 
'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [0/120s]: 
request error [('Connection aborted.', error(113, 'No route to host'))]
[  193.456805] cloud-init[892]: 2018-06-13 12:29:27,184 - 
url_helper.py[WARNING]: Calling 
'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [1/120s]: 
request error [('Connection aborted.', error(113, 'No route to host'))]
[  194.461592] cloud-init[892]: 2018-06-13 12:29:28,189 - 
url_helper.py[WARNING]: Calling 
'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [2/120s]: 
request error [('Connection aborted.', error(113, 'No route to host'))]
[  195.466441] cloud-init[892]: 2018-06-13 12:29:29,194 - 
url_helper.py[WARNING]: Calling 
'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [3/120s]: 
request error [('Connection aborted.', error(113, 'No route to host'))]

I can see no errors in neither nova or neutron services.
In the mean time, I've searched all our nova servers for this kind of behavior 
and we have 1 random namespace missing those rules on 6 of our 66 novas.

Any ideas would be greatly appreciated.

Thanks,
Radu
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-30 Thread Radu Popescu | eMAG, Technology
Hi,

just to let you know. Problem is now gone. Instances boot up with working 
network interface.

Thanks a lot,
Radu

On Tue, 2018-05-29 at 21:23 -0400, Chris Apsey wrote:
I want to echo the effectiveness of this change - we had vif failures when 
launching more than 50 or so cirros instances simultaneously, but moving to 
daemon mode made this issue disappear and we've tested 5x that amount.  This 
has been the single biggest scalability improvement to date.  This option 
should be the default in the official docs.


On May 24, 2018 05:55:49 Saverio Proto  wrote:

Glad to hear it!
Always monitor rabbitmq queues to identify bottlenecks !! :)

Cheers

Saverio

Il gio 24 mag 2018, 11:07 Radu Popescu | eMAG, Technology 
mailto:radu.pope...@emag.ro>> ha scritto:
Hi,

did the change yesterday. Had no issue this morning with neutron not being able 
to move fast enough. Still, we had some storage issues, but that's another 
thing.
Anyway, I'll leave it like this for the next few days and report back in case I 
get the same slow neutron errors.

Thanks a lot!
Radu

On Wed, 2018-05-23 at 10:08 +, Radu Popescu | eMAG, Technology wrote:
Hi,

actually, I didn't know about that option. I'll enable it right now.
Testing is done every morning at about 4:00AM ..so I'll know tomorrow morning 
if it changed anything.

Thanks,
Radu

On Tue, 2018-05-22 at 15:30 +0200, Saverio Proto wrote:

Sorry email went out incomplete.

Read this:

https://cloudblog.switch.ch/2017/08/28/starting-1000-instances-on-switchengines/


make sure that Openstack rootwrap configured to work in daemon mode


Thank you


Saverio



2018-05-22 15:29 GMT+02:00 Saverio Proto 
mailto:ziopr...@gmail.com>>:

Hello Radu,


do you have the Openstack rootwrap configured to work in daemon mode ?


please read this article:


2018-05-18 10:21 GMT+02:00 Radu Popescu | eMAG, Technology

mailto:radu.pope...@emag.ro>>:

Hi,


so, nova says the VM is ACTIVE and actually boots with no network. We are

setting some metadata that we use later on and have cloud-init for different

tasks.

So, VM is up, OS is running, but network is working after a random amount of

time, that can get to around 45 minutes. Thing is, is not happening to all

VMs in that test (around 300), but it's happening to a fair amount - around

25%.


I can see the callback coming few seconds after neutron openvswitch agent

says it's completed the setup. My question is, why is it taking so long for

nova openvswitch agent to configure the port? I can see the port up in both

host OS and openvswitch. I would assume it's doing the whole namespace and

iptables setup. But still, 30 minutes? Seems a lot!


Thanks,

Radu


On Thu, 2018-05-17 at 11:50 -0400, George Mihaiescu wrote:


We have other scheduled tests that perform end-to-end (assign floating IP,

ssh, ping outside) and never had an issue.

I think we turned it off because the callback code was initially buggy and

nova would wait forever while things were in fact ok, but I'll  change

"vif_plugging_is_fatal = True" and "vif_plugging_timeout = 300" and run

another large test, just to confirm.


We usually run these large tests after a version upgrade to test the APIs

under load.




On Thu, May 17, 2018 at 11:42 AM, Matt Riedemann 
mailto:mriede...@gmail.com>>

wrote:


On 5/17/2018 9:46 AM, George Mihaiescu wrote:


and large rally tests of 500 instances complete with no issues.



Sure, except you can't ssh into the guests.


The whole reason the vif plugging is fatal and timeout and callback code was

because the upstream CI was unstable without it. The server would report as

ACTIVE but the ports weren't wired up so ssh would fail. Having an ACTIVE

guest that you can't actually do anything with is kind of pointless.


___


OpenStack-operators mailing list


OpenStack-operators@lists.openstack.org<mailto:OpenStack-operators@lists.openstack.org>


http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators




___

OpenStack-operators mailing list

OpenStack-operators@lists.openstack.org<mailto:OpenStack-operators@lists.openstack.org>

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___

OpenStack-operators mailing list

OpenStack-operators@lists.openstack.org<mailto:OpenStack-operators@lists.openstack.org>

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org<mailto:OpenStack-operators%40lists.openstack.org>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-24 Thread Radu Popescu | eMAG, Technology
Hi,

did the change yesterday. Had no issue this morning with neutron not being able 
to move fast enough. Still, we had some storage issues, but that's another 
thing.
Anyway, I'll leave it like this for the next few days and report back in case I 
get the same slow neutron errors.

Thanks a lot!
Radu

On Wed, 2018-05-23 at 10:08 +, Radu Popescu | eMAG, Technology wrote:
Hi,

actually, I didn't know about that option. I'll enable it right now.
Testing is done every morning at about 4:00AM ..so I'll know tomorrow morning 
if it changed anything.

Thanks,
Radu

On Tue, 2018-05-22 at 15:30 +0200, Saverio Proto wrote:

Sorry email went out incomplete.

Read this:

https://cloudblog.switch.ch/2017/08/28/starting-1000-instances-on-switchengines/


make sure that Openstack rootwrap configured to work in daemon mode


Thank you


Saverio



2018-05-22 15:29 GMT+02:00 Saverio Proto 
<ziopr...@gmail.com<mailto:ziopr...@gmail.com>>:

Hello Radu,


do you have the Openstack rootwrap configured to work in daemon mode ?


please read this article:


2018-05-18 10:21 GMT+02:00 Radu Popescu | eMAG, Technology

<radu.pope...@emag.ro<mailto:radu.pope...@emag.ro>>:

Hi,


so, nova says the VM is ACTIVE and actually boots with no network. We are

setting some metadata that we use later on and have cloud-init for different

tasks.

So, VM is up, OS is running, but network is working after a random amount of

time, that can get to around 45 minutes. Thing is, is not happening to all

VMs in that test (around 300), but it's happening to a fair amount - around

25%.


I can see the callback coming few seconds after neutron openvswitch agent

says it's completed the setup. My question is, why is it taking so long for

nova openvswitch agent to configure the port? I can see the port up in both

host OS and openvswitch. I would assume it's doing the whole namespace and

iptables setup. But still, 30 minutes? Seems a lot!


Thanks,

Radu


On Thu, 2018-05-17 at 11:50 -0400, George Mihaiescu wrote:


We have other scheduled tests that perform end-to-end (assign floating IP,

ssh, ping outside) and never had an issue.

I think we turned it off because the callback code was initially buggy and

nova would wait forever while things were in fact ok, but I'll  change

"vif_plugging_is_fatal = True" and "vif_plugging_timeout = 300" and run

another large test, just to confirm.


We usually run these large tests after a version upgrade to test the APIs

under load.




On Thu, May 17, 2018 at 11:42 AM, Matt Riedemann 
<mriede...@gmail.com<mailto:mriede...@gmail.com>>

wrote:


On 5/17/2018 9:46 AM, George Mihaiescu wrote:


and large rally tests of 500 instances complete with no issues.



Sure, except you can't ssh into the guests.


The whole reason the vif plugging is fatal and timeout and callback code was

because the upstream CI was unstable without it. The server would report as

ACTIVE but the ports weren't wired up so ssh would fail. Having an ACTIVE

guest that you can't actually do anything with is kind of pointless.


___


OpenStack-operators mailing list


OpenStack-operators@lists.openstack.org<mailto:OpenStack-operators@lists.openstack.org>


http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators




___

OpenStack-operators mailing list

OpenStack-operators@lists.openstack.org<mailto:OpenStack-operators@lists.openstack.org>

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


___

OpenStack-operators mailing list

OpenStack-operators@lists.openstack.org<mailto:OpenStack-operators@lists.openstack.org>

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-23 Thread Radu Popescu | eMAG, Technology
Hi,

actually, I didn't know about that option. I'll enable it right now.
Testing is done every morning at about 4:00AM ..so I'll know tomorrow morning 
if it changed anything.

Thanks,
Radu

On Tue, 2018-05-22 at 15:30 +0200, Saverio Proto wrote:

Sorry email went out incomplete.

Read this:

https://cloudblog.switch.ch/2017/08/28/starting-1000-instances-on-switchengines/


make sure that Openstack rootwrap configured to work in daemon mode


Thank you


Saverio



2018-05-22 15:29 GMT+02:00 Saverio Proto 
<ziopr...@gmail.com<mailto:ziopr...@gmail.com>>:

Hello Radu,


do you have the Openstack rootwrap configured to work in daemon mode ?


please read this article:


2018-05-18 10:21 GMT+02:00 Radu Popescu | eMAG, Technology

<radu.pope...@emag.ro<mailto:radu.pope...@emag.ro>>:

Hi,


so, nova says the VM is ACTIVE and actually boots with no network. We are

setting some metadata that we use later on and have cloud-init for different

tasks.

So, VM is up, OS is running, but network is working after a random amount of

time, that can get to around 45 minutes. Thing is, is not happening to all

VMs in that test (around 300), but it's happening to a fair amount - around

25%.


I can see the callback coming few seconds after neutron openvswitch agent

says it's completed the setup. My question is, why is it taking so long for

nova openvswitch agent to configure the port? I can see the port up in both

host OS and openvswitch. I would assume it's doing the whole namespace and

iptables setup. But still, 30 minutes? Seems a lot!


Thanks,

Radu


On Thu, 2018-05-17 at 11:50 -0400, George Mihaiescu wrote:


We have other scheduled tests that perform end-to-end (assign floating IP,

ssh, ping outside) and never had an issue.

I think we turned it off because the callback code was initially buggy and

nova would wait forever while things were in fact ok, but I'll  change

"vif_plugging_is_fatal = True" and "vif_plugging_timeout = 300" and run

another large test, just to confirm.


We usually run these large tests after a version upgrade to test the APIs

under load.




On Thu, May 17, 2018 at 11:42 AM, Matt Riedemann 
<mriede...@gmail.com<mailto:mriede...@gmail.com>>

wrote:


On 5/17/2018 9:46 AM, George Mihaiescu wrote:


and large rally tests of 500 instances complete with no issues.



Sure, except you can't ssh into the guests.


The whole reason the vif plugging is fatal and timeout and callback code was

because the upstream CI was unstable without it. The server would report as

ACTIVE but the ports weren't wired up so ssh would fail. Having an ACTIVE

guest that you can't actually do anything with is kind of pointless.


___


OpenStack-operators mailing list


OpenStack-operators@lists.openstack.org<mailto:OpenStack-operators@lists.openstack.org>


http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators




___

OpenStack-operators mailing list

OpenStack-operators@lists.openstack.org<mailto:OpenStack-operators@lists.openstack.org>

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-18 Thread Radu Popescu | eMAG, Technology
Hi,

so, nova says the VM is ACTIVE and actually boots with no network. We are 
setting some metadata that we use later on and have cloud-init for different 
tasks.
So, VM is up, OS is running, but network is working after a random amount of 
time, that can get to around 45 minutes. Thing is, is not happening to all VMs 
in that test (around 300), but it's happening to a fair amount - around 25%.

I can see the callback coming few seconds after neutron openvswitch agent says 
it's completed the setup. My question is, why is it taking so long for nova 
openvswitch agent to configure the port? I can see the port up in both host OS 
and openvswitch. I would assume it's doing the whole namespace and iptables 
setup. But still, 30 minutes? Seems a lot!

Thanks,
Radu

On Thu, 2018-05-17 at 11:50 -0400, George Mihaiescu wrote:
We have other scheduled tests that perform end-to-end (assign floating IP, ssh, 
ping outside) and never had an issue.
I think we turned it off because the callback code was initially buggy and nova 
would wait forever while things were in fact ok, but I'll  change 
"vif_plugging_is_fatal = True" and "vif_plugging_timeout = 300" and run another 
large test, just to confirm.

We usually run these large tests after a version upgrade to test the APIs under 
load.



On Thu, May 17, 2018 at 11:42 AM, Matt Riedemann 
> wrote:
On 5/17/2018 9:46 AM, George Mihaiescu wrote:
and large rally tests of 500 instances complete with no issues.


Sure, except you can't ssh into the guests.

The whole reason the vif plugging is fatal and timeout and callback code was 
because the upstream CI was unstable without it. The server would report as 
ACTIVE but the ports weren't wired up so ssh would fail. Having an ACTIVE guest 
that you can't actually do anything with is kind of pointless.


___

OpenStack-operators mailing list

OpenStack-operators@lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


Re: [Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-17 Thread Radu Popescu | eMAG, Technology
Hi,

unfortunately, didn't get the reply in my inbox, so I'm answering from the link 
here:
http://lists.openstack.org/pipermail/openstack-operators/2018-May/015270.html
(hopefully, my reply will go to the same thread)

Anyway, I can see the neutron openvswitch agent logs processing the interface 
way after the VM is up (in this case, 30 minutes). And after the vif plugin 
timeout of 5 minutes (currently 10 minutes).
After searching for logs, I came out with an example here: (replaced nova 
compute hostname with "nova.compute.hostname")

http://paste.openstack.org/show/1VevKuimoBMs4G8X53Eu/

As you can see, the request for the VM starts around 3:27AM. Ports get created, 
openvswitch has the command to do it, has DHCP, but apparently Neutron server 
sends the callback after Neutron Openvswitch agent finishes. Callback is at 
2018-05-10 03:57:36.177 while Neutron Openvswitch agent says it completed the 
setup and configuration at 2018-05-10 03:57:35.247.

So, my question is, why is Neutron Openvswitch agent processing the request 30 
minutes after the VM is started? And where can I search for logs for whatever 
happens during those 30 minutes?
And yes, we're using libvirt. At some point, we added some new nova compute 
nodes and the new ones came with v3.2.0 and was breaking migration between 
hosts. That's why we downgraded (and versionlocked) everything at v2.0.0.

Thanks,
Radu
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] attaching network cards to VMs taking a very long time

2018-05-16 Thread Radu Popescu | eMAG, Technology
Hi all,

we have the following setup:
- Openstack Ocata deployed with Openstack Ansible (v15.1.7)
- 66 compute nodes, each having between 50 and 150 VMs, depending on their 
hardware configuration
- we don't use Ceilometer (so not adding extra load on RabbitMQ cluster)
- using Openvswitch HA with DVR
- all messaging are going through a 3 servers RabbitMQ cluster
- we now have 3 CCs hosting (initially had 2) hosting every other internal 
service

What happens is, when we create a large number of VMs (it's something we do on 
a daily basis, just to test different types of VMs and apps, around 300 VMs), 
there are some of them that don't get the network interface attached in a 
reasonable time.
After investigating, we can see that Neutron Openvswitch agent sees the port 
attached to the server, from an Openstack point of view, I can see the tap 
interface created in Openvswitch using both its logs and dmesg, but I can see 
nova attaching the interface after a huge amount of time. (I could see even 45 
minutes delay)

Since I can't see any reasonable errors I could take care of, my last chance is 
this mailing list.
Only thing I can think of, is that maybe libvirt is not able to attach the 
interface in a reasonable amount of time. But still, 45 minutes is way too much.

At the moment:
vif_plugging_is_fatal = True
vif_plugging_timeout = 600 (modified from default 300s)

That's because we needed VMs with networking. Otherwise, if either with error, 
either with no network, it's the same thing for us.

Thanks,

--

Radu Popescu >
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators