Re: [openstack-dev] [Neutron] L3 agent rescheduling issue

2015-06-08 Thread Eugene Nikanorov
Yes, 50-100 networks received by DHCP agent on startup could cause 2nd state report to be sent seconds after it should be sent. In my tests, if I recall correctly, it was ~70 networks and delay between 1st and 2nd state report around 25 seconds (while 5 sec was configured) Eugene. On Sun, Jun 7,

Re: [openstack-dev] [Neutron] L3 agent rescheduling issue

2015-06-07 Thread Kevin Benton
Well a greenthread will only yield when it makes a blocking call like writing to a network socket, file, etc. So once the report_state greenthread starts executing, it won't yield until it makes a call like that. I looked through the report_state code for the DHCP agent and the only blocking call

Re: [openstack-dev] [Neutron] L3 agent rescheduling issue

2015-06-07 Thread Eugene Nikanorov
No, I think greenthread itself don't do anything special, it's just when there are too many threads, state_report thread can't get the control for too long, since there is no prioritization of greenthreads. Eugene. On Sun, Jun 7, 2015 at 8:24 PM, Kevin Benton wrote: > I understand now. So the i

Re: [openstack-dev] [Neutron] L3 agent rescheduling issue

2015-06-07 Thread Kevin Benton
I understand now. So the issue is that the report_state greenthread is just blocking and yielding whenever it tries to actually send a message? On Sun, Jun 7, 2015 at 8:10 PM, Eugene Nikanorov wrote: > Salvatore, > > By 'fairness' I meant chances for state report greenthread to get the > control

Re: [openstack-dev] [Neutron] L3 agent rescheduling issue

2015-06-07 Thread Eugene Nikanorov
Salvatore, By 'fairness' I meant chances for state report greenthread to get the control. In DHCP case, each network processed by a separate greenthread, so the more greenthreads agent has, the less chances that report state greenthread will be able to report in time. Thanks, Eugene. On Sun, Jun

Re: [openstack-dev] [Neutron] L3 agent rescheduling issue

2015-06-07 Thread Salvatore Orlando
On 5 June 2015 at 01:29, Itsuro ODA wrote: > Hi, > > > After trying to reproduce this, I'm suspecting that the issue is actually > > on the server side from failing to drain the agent report state queue in > > time. > > I have seen before. > I thought the senario at that time as follows. > * a lo

Re: [openstack-dev] [Neutron] L3 agent rescheduling issue

2015-06-04 Thread Eugene Nikanorov
I doubt it's a server side issue. Usually there are plenty of rpc workers to drain much higher amount of rpc messages going from agents. So the issue could be in 'fairness' on L3 agent side. But from my observations it was more an issue of DHCP agent than L3 agent due to difference in resource proc

Re: [openstack-dev] [Neutron] L3 agent rescheduling issue

2015-06-04 Thread Itsuro ODA
Hi, > After trying to reproduce this, I'm suspecting that the issue is actually > on the server side from failing to drain the agent report state queue in > time. I have seen before. I thought the senario at that time as follows. * a lot of create/update resource API issued * "rpc_conn_pool_size

Re: [openstack-dev] [Neutron] L3 agent rescheduling issue

2015-06-04 Thread Carl Baldwin
On Thu, Jun 4, 2015 at 3:20 PM, Kevin Benton wrote: > After trying to reproduce this, I'm suspecting that the issue is actually on > the server side from failing to drain the agent report state queue in time. > > I set the report_interval to 1 second on the agent and added a logging > statement an

Re: [openstack-dev] [Neutron] L3 agent rescheduling issue

2015-06-04 Thread Kevin Benton
After trying to reproduce this, I'm suspecting that the issue is actually on the server side from failing to drain the agent report state queue in time. I set the report_interval to 1 second on the agent and added a logging statement and I see a report every 1 second even when sync_routers is taki

Re: [openstack-dev] [Neutron] L3 agent rescheduling issue

2015-06-04 Thread Carl Baldwin
Ann, Thanks for bringing this up. It has been on the shelf for a while now. Carl On Thu, Jun 4, 2015 at 8:54 AM, Salvatore Orlando wrote: > One reason for not sending the heartbeat from a separate greenthread could > be that the agent is already doing it [1]. > The current proposed patch addre

Re: [openstack-dev] [Neutron] L3 agent rescheduling issue

2015-06-04 Thread Assaf Muller
- Original Message - > One reason for not sending the heartbeat from a separate greenthread could be > that the agent is already doing it [1]. > The current proposed patch addresses the issue blindly - that is to say > before declaring an agent dead let's wait for some more time because i

Re: [openstack-dev] [Neutron] L3 agent rescheduling issue

2015-06-04 Thread Kevin Benton
Is there a way to parallelize the period tasks? I wanted to go this route because I encountered cases where a bunch of routers would get scheduled to l3 agents and they would all hit the server nearly simultaneously with a sync routers task. This could result in thousands of routers and their floa

Re: [openstack-dev] [Neutron] L3 agent rescheduling issue

2015-06-04 Thread Salvatore Orlando
One reason for not sending the heartbeat from a separate greenthread could be that the agent is already doing it [1]. The current proposed patch addresses the issue blindly - that is to say before declaring an agent dead let's wait for some more time because it could be stuck doing stuff. In that c

Re: [openstack-dev] [Neutron] L3 agent rescheduling issue

2015-06-04 Thread Kevin Benton
Why don't we put the agent heartbeat into a separate greenthread on the agent so it continues to send updates even when it's busy processing changes? On Jun 4, 2015 2:56 AM, "Anna Kamyshnikova" wrote: > Hi, neutrons! > > Some time ago I discovered a bug for l3 agent rescheduling [1]. When there >

[openstack-dev] [Neutron] L3 agent rescheduling issue

2015-06-04 Thread Anna Kamyshnikova
Hi, neutrons! Some time ago I discovered a bug for l3 agent rescheduling [1]. When there are a lot of resources and agent_down_time is not big enough neutron-server starts marking l3 agents as dead. The same issue has been discovered and fixed for DHCP-agents. I proposed a change similar to those