Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-16 Thread Maru Newby
On Dec 13, 2013, at 8:06 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote: On Fri, Dec 06, 2013 at 04:30:17PM +0900, Maru Newby ma...@redhat.com wrote: On Dec 5, 2013, at 5:21 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote: On Wed, Dec 04, 2013 at 12:37:19PM +0900, Maru Newby

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-10 Thread Maru Newby
On Dec 10, 2013, at 4:47 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote: On Tue, Dec 10, 2013 at 07:28:10PM +1300, Robert Collins robe...@robertcollins.net wrote: On 10 December 2013 19:16, Isaku Yamahata isaku.yamah...@gmail.com wrote: Answering myself. If connection is closed, it

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-10 Thread Isaku Yamahata
On Mon, Dec 09, 2013 at 08:43:59AM +1300, Robert Collins robe...@robertcollins.net wrote: listening: when an agent connects after an outage, it first starts listening, then does a poll for updates it missed. Are you suggesting that processing of notifications and full state

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-10 Thread Maru Newby
On Dec 5, 2013, at 4:43 PM, Édouard Thuleau thul...@gmail.com wrote: There also another bug you can link/duplicate with #1192381 is https://bugs.launchpad.net/neutron/+bug/1185916. I proposed a fix but it's not the good way. I abandoned it. Édouard. Thank you for pointing this out! m.

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-10 Thread Maru Newby
On Dec 10, 2013, at 6:36 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote: On Mon, Dec 09, 2013 at 08:43:59AM +1300, Robert Collins robe...@robertcollins.net wrote: listening: when an agent connects after an outage, it first starts listening, then does a poll for updates it missed. Are

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-10 Thread Isaku Yamahata
On Wed, Dec 11, 2013 at 01:23:36AM +0900, Maru Newby ma...@redhat.com wrote: On Dec 10, 2013, at 6:36 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote: On Mon, Dec 09, 2013 at 08:43:59AM +1300, Robert Collins robe...@robertcollins.net wrote: listening: when an agent connects after

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-10 Thread Maru Newby
On Dec 11, 2013, at 8:39 AM, Isaku Yamahata isaku.yamah...@gmail.com wrote: On Wed, Dec 11, 2013 at 01:23:36AM +0900, Maru Newby ma...@redhat.com wrote: On Dec 10, 2013, at 6:36 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote: On Mon, Dec 09, 2013 at 08:43:59AM +1300, Robert Collins

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-09 Thread Isaku Yamahata
On Mon, Dec 09, 2013 at 08:07:12PM +0900, Isaku Yamahata isaku.yamah...@gmail.com wrote: On Mon, Dec 09, 2013 at 08:43:59AM +1300, Robert Collins robe...@robertcollins.net wrote: On 9 December 2013 01:43, Maru Newby ma...@redhat.com wrote: If AMQP service is set up not to lose

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-09 Thread Robert Collins
On 10 December 2013 19:16, Isaku Yamahata isaku.yamah...@gmail.com wrote: Answering myself. If connection is closed, it will reconnects automatically at rpc layer. See neutron.openstack.common.rpc.impl_{kombu, qpid}.py. So notifications during reconnects can be lost if AMQP service is set to

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-09 Thread Isaku Yamahata
On Tue, Dec 10, 2013 at 07:28:10PM +1300, Robert Collins robe...@robertcollins.net wrote: On 10 December 2013 19:16, Isaku Yamahata isaku.yamah...@gmail.com wrote: Answering myself. If connection is closed, it will reconnects automatically at rpc layer. See

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-09 Thread Isaku Yamahata
On Mon, Dec 09, 2013 at 08:43:59AM +1300, Robert Collins robe...@robertcollins.net wrote: On 9 December 2013 01:43, Maru Newby ma...@redhat.com wrote: If AMQP service is set up not to lose notification, notifications will be piled up and stress AMQP service. I would say single node

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-08 Thread Maru Newby
On Dec 7, 2013, at 6:21 PM, Robert Collins robe...@robertcollins.net wrote: On 7 December 2013 21:53, Isaku Yamahata isaku.yamah...@gmail.com wrote: Case 3: Hardware failure. So an agent on the node is gone. Another agent will run on another node. If AMQP service is set up not to

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-07 Thread Isaku Yamahata
On Fri, Dec 06, 2013 at 04:30:17PM +0900, Maru Newby ma...@redhat.com wrote: 2. A 'best-effort' refactor that maximizes the reliability of the DHCP agent. I'm hoping that coming up with a solution to #1 will allow us the breathing room to work on #2 in this cycle. Loss of

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-07 Thread Robert Collins
On 7 December 2013 21:53, Isaku Yamahata isaku.yamah...@gmail.com wrote: Case 3: Hardware failure. So an agent on the node is gone. Another agent will run on another node. If AMQP service is set up not to lose notification, notifications will be piled up and stress AMQP service. I

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-06 Thread Carl Baldwin
Pasting a few things from IRC here to fill out the context... marun carl_baldwin: but according to markmcclain and salv-orlando, it isn't possible to trivially use multiple workers for rpc because processing rpc requests out of sequence can be dangerous carl_baldwin marun: I think it is already

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-05 Thread Isaku Yamahata
On Wed, Dec 04, 2013 at 12:37:19PM +0900, Maru Newby ma...@redhat.com wrote: On Dec 4, 2013, at 11:57 AM, Clint Byrum cl...@fewbar.com wrote: Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800: I've been investigating a bug that is preventing VM's from receiving IP

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-05 Thread Maru Newby
On Dec 5, 2013, at 6:43 AM, Carl Baldwin c...@ecbaldwin.net wrote: I have offered up https://review.openstack.org/#/c/60082/ as a backport to Havana. Interest was expressed in the blueprint for doing this even before this thread. If there is consensus for this as the stop-gap then it is

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-05 Thread Carl Baldwin
Creating separate processes for API workers does allow a bit more room for RPC message processing in the main process. If this isn't enough and the main process is still bound on CPU and/or green thread/sqlalchemy blocking then creating separate worker processes for RPC processing may be the next

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-05 Thread Maru Newby
On Dec 5, 2013, at 5:21 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote: On Wed, Dec 04, 2013 at 12:37:19PM +0900, Maru Newby ma...@redhat.com wrote: In the current architecture, the Neutron service handles RPC and WSGI with a single process and is prone to being overloaded such that

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-04 Thread Joe Gordon
On Dec 4, 2013 5:41 AM, Maru Newby ma...@redhat.com wrote: On Dec 4, 2013, at 11:57 AM, Clint Byrum cl...@fewbar.com wrote: Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-04 Thread Maru Newby
On Dec 4, 2013, at 8:55 AM, Carl Baldwin c...@ecbaldwin.net wrote: Stephen, all, I agree that there may be some opportunity to split things out a bit. However, I'm not sure what the best way will be. I recall that Mark mentioned breaking out the processes that handle API requests and RPC

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-04 Thread Ashok Kumaran
On Wed, Dec 4, 2013 at 8:30 PM, Maru Newby ma...@redhat.com wrote: On Dec 4, 2013, at 8:55 AM, Carl Baldwin c...@ecbaldwin.net wrote: Stephen, all, I agree that there may be some opportunity to split things out a bit. However, I'm not sure what the best way will be. I recall that Mark

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-04 Thread Carl Baldwin
Sorry to have taken the discussion on a slight tangent. I meant only to offer the solution as a stop-gap. I agree that the fundamental problem should still be addressed. On Tue, Dec 3, 2013 at 8:01 PM, Maru Newby ma...@redhat.com wrote: On Dec 4, 2013, at 1:47 AM, Stephen Gran

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-04 Thread Carl Baldwin
I have offered up https://review.openstack.org/#/c/60082/ as a backport to Havana. Interest was expressed in the blueprint for doing this even before this thread. If there is consensus for this as the stop-gap then it is there for the merging. However, I do not want to discourage discussion of

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-04 Thread Édouard Thuleau
There also another bug you can link/duplicate with #1192381 is https://bugs.launchpad.net/neutron/+bug/1185916. I proposed a fix but it's not the good way. I abandoned it. Édouard. On Wed, Dec 4, 2013 at 10:43 PM, Carl Baldwin c...@ecbaldwin.net wrote: I have offered up

[openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-03 Thread Maru Newby
I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load: https://bugs.launchpad.net/neutron/+bug/1192381 High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-03 Thread Stephen Gran
On 03/12/13 16:08, Maru Newby wrote: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load: https://bugs.launchpad.net/neutron/+bug/1192381 High load causes the DHCP agent's status updates to be delayed, causing the

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-03 Thread Carl Baldwin
Stephen, all, I agree that there may be some opportunity to split things out a bit. However, I'm not sure what the best way will be. I recall that Mark mentioned breaking out the processes that handle API requests and RPC from each other at the summit. Anyway, it is something that has been

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-03 Thread Yongsheng Gong
another way is to have a large agent_down_time, by default it is 9 secs. On Wed, Dec 4, 2013 at 7:55 AM, Carl Baldwin c...@ecbaldwin.net wrote: Stephen, all, I agree that there may be some opportunity to split things out a bit. However, I'm not sure what the best way will be. I recall that

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-03 Thread Clint Byrum
Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load: https://bugs.launchpad.net/neutron/+bug/1192381 High load causes the DHCP agent's status updates to

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-03 Thread Maru Newby
On Dec 4, 2013, at 1:47 AM, Stephen Gran stephen.g...@theguardian.com wrote: On 03/12/13 16:08, Maru Newby wrote: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load: https://bugs.launchpad.net/neutron/+bug/1192381

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-03 Thread Maru Newby
On Dec 4, 2013, at 11:02 AM, Yongsheng Gong gong...@unitedstack.com wrote: another way is to have a large agent_down_time, by default it is 9 secs. I don't believe that increasing the timeout by itself is a good solution. Relying on the agent state to know whether to send a notification has

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-03 Thread Maru Newby
On Dec 4, 2013, at 11:57 AM, Clint Byrum cl...@fewbar.com wrote: Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load:

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-03 Thread Clint Byrum
Excerpts from Maru Newby's message of 2013-12-03 19:37:19 -0800: On Dec 4, 2013, at 11:57 AM, Clint Byrum cl...@fewbar.com wrote: Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800: I've been investigating a bug that is preventing VM's from receiving IP addresses when a