Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Dec 13, 2013, at 8:06 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote: On Fri, Dec 06, 2013 at 04:30:17PM +0900, Maru Newby ma...@redhat.com wrote: On Dec 5, 2013, at 5:21 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote: On Wed, Dec 04, 2013 at 12:37:19PM +0900, Maru Newby ma...@redhat.com wrote: In the current architecture, the Neutron service handles RPC and WSGI with a single process and is prone to being overloaded such that agent heartbeats can be delayed beyond the limit for the agent being declared 'down'. Even if we increased the agent timeout as Yongsheg suggests, there is no guarantee that we can accurately detect whether an agent is 'live' with the current architecture. Given that amqp can ensure eventual delivery - it is a queue - is sending a notification blind such a bad idea? In the best case the agent isn't really down and can process the notification. In the worst case, the agent really is down but will be brought up eventually by a deployment's monitoring solution and process the notification when it returns. What am I missing? Do you mean overload of neutron server? Not neutron agent. So event agent sends periodic 'live' report, the reports are piled up unprocessed by server. When server sends notification, it considers agent dead wrongly. Not because agent didn't send live reports due to overload of agent. Is this understanding correct? Your interpretation is likely correct. The demands on the service are going to be much higher by virtue of having to field RPC requests from all the agents to interact with the database on their behalf. Is this strongly indicating thread-starvation. i.e. too much unfair thread scheduling. Given that eventlet is cooperative threading, should sleep(0) to hogging thread? I'm afraid that's a question for a profiler: https://github.com/colinhowe/eventlet_profiler m. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Dec 10, 2013, at 4:47 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote: On Tue, Dec 10, 2013 at 07:28:10PM +1300, Robert Collins robe...@robertcollins.net wrote: On 10 December 2013 19:16, Isaku Yamahata isaku.yamah...@gmail.com wrote: Answering myself. If connection is closed, it will reconnects automatically at rpc layer. See neutron.openstack.common.rpc.impl_{kombu, qpid}.py. So notifications during reconnects can be lost if AMQP service is set to discard notifications during no subscriber. Which is fine: the agent repulls the full set it's running on that machine, and life goes on. On what event? Polling in agent seems effectively disabled by self.needs_resync with the current code. If the agent is not connected, it is either down (needs_resync will be set to True on launch) or experiencing a loss of connectivity to the amqp service (needs_resync will have been set to True on error). The loss of notifications is not a problem in either case. m. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Mon, Dec 09, 2013 at 08:43:59AM +1300, Robert Collins robe...@robertcollins.net wrote: listening: when an agent connects after an outage, it first starts listening, then does a poll for updates it missed. Are you suggesting that processing of notifications and full state synchronization are able to cooperate safely? Or hoping that it will be so in the future? I'm saying that you can avoid race conditions by a combination of 'subscribe to changes' + 'give me the full state'. Like this? https://review.openstack.org/#/c/61057/ This patch is just to confirm the idea. -- Isaku Yamahata isaku.yamah...@gmail.com ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Dec 5, 2013, at 4:43 PM, Édouard Thuleau thul...@gmail.com wrote: There also another bug you can link/duplicate with #1192381 is https://bugs.launchpad.net/neutron/+bug/1185916. I proposed a fix but it's not the good way. I abandoned it. Édouard. Thank you for pointing this out! m. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Dec 10, 2013, at 6:36 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote: On Mon, Dec 09, 2013 at 08:43:59AM +1300, Robert Collins robe...@robertcollins.net wrote: listening: when an agent connects after an outage, it first starts listening, then does a poll for updates it missed. Are you suggesting that processing of notifications and full state synchronization are able to cooperate safely? Or hoping that it will be so in the future? I'm saying that you can avoid race conditions by a combination of 'subscribe to changes' + 'give me the full state'. Like this? https://review.openstack.org/#/c/61057/ This patch is just to confirm the idea. I'm afraid the proposed patch is no more reliable than the current approach of using file-based locking. I am working on an alternate patch that puts the rpc event loop in the dhcp agent so that better coordination between full synchronization and notification handling is possible. This approach has already been taken with the L3 agent and work on the L2 agent is in process. m. -- Isaku Yamahata isaku.yamah...@gmail.com ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Wed, Dec 11, 2013 at 01:23:36AM +0900, Maru Newby ma...@redhat.com wrote: On Dec 10, 2013, at 6:36 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote: On Mon, Dec 09, 2013 at 08:43:59AM +1300, Robert Collins robe...@robertcollins.net wrote: listening: when an agent connects after an outage, it first starts listening, then does a poll for updates it missed. Are you suggesting that processing of notifications and full state synchronization are able to cooperate safely? Or hoping that it will be so in the future? I'm saying that you can avoid race conditions by a combination of 'subscribe to changes' + 'give me the full state'. Like this? https://review.openstack.org/#/c/61057/ This patch is just to confirm the idea. I'm afraid the proposed patch is no more reliable than the current approach of using file-based locking. I am working on an alternate patch that puts the rpc event loop in the dhcp agent so that better coordination between full synchronization and notification handling is possible. This approach has already been taken with the L3 agent and work on the L2 agent is in process. You objected against agent polling in the discussion. But you're now proposing polling now. Did you change your mind? -- Isaku Yamahata isaku.yamah...@gmail.com ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Dec 11, 2013, at 8:39 AM, Isaku Yamahata isaku.yamah...@gmail.com wrote: On Wed, Dec 11, 2013 at 01:23:36AM +0900, Maru Newby ma...@redhat.com wrote: On Dec 10, 2013, at 6:36 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote: On Mon, Dec 09, 2013 at 08:43:59AM +1300, Robert Collins robe...@robertcollins.net wrote: listening: when an agent connects after an outage, it first starts listening, then does a poll for updates it missed. Are you suggesting that processing of notifications and full state synchronization are able to cooperate safely? Or hoping that it will be so in the future? I'm saying that you can avoid race conditions by a combination of 'subscribe to changes' + 'give me the full state'. Like this? https://review.openstack.org/#/c/61057/ This patch is just to confirm the idea. I'm afraid the proposed patch is no more reliable than the current approach of using file-based locking. I am working on an alternate patch that puts the rpc event loop in the dhcp agent so that better coordination between full synchronization and notification handling is possible. This approach has already been taken with the L3 agent and work on the L2 agent is in process. You objected against agent polling in the discussion. But you're now proposing polling now. Did you change your mind? Uh, no. I'm proposing better coordination between notification processing and full state synchronization beyond simple exclusionary primitives (utils.synchronize etc). I apologize if my language was unclear. m. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Mon, Dec 09, 2013 at 08:07:12PM +0900, Isaku Yamahata isaku.yamah...@gmail.com wrote: On Mon, Dec 09, 2013 at 08:43:59AM +1300, Robert Collins robe...@robertcollins.net wrote: On 9 December 2013 01:43, Maru Newby ma...@redhat.com wrote: If AMQP service is set up not to lose notification, notifications will be piled up and stress AMQP service. I would say single node failure isn't catastrophic. So we should have AMQP set to discard notifications if there is noone What are the semantics of AMQP discarding notifications when a consumer is no longer present? Can this be relied upon to ensure that potentially stale notifications do not remain in the queue when an agent restarts? If the queue is set to autodelete, it will delete when the agent disconnects. There will be no queue until the agent reconnects. I don't know if we expose that functionality via oslo.messaging, but it's certainly something AMQP can do. What happens if intermittent network instability occur? When the connection between agent - AMQP is unintentionally closed, will agent die or reconnect to it? Answering myself. If connection is closed, it will reconnects automatically at rpc layer. See neutron.openstack.common.rpc.impl_{kombu, qpid}.py. So notifications during reconnects can be lost if AMQP service is set to discard notifications during no subscriber. -- Isaku Yamahata isaku.yamah...@gmail.com ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On 10 December 2013 19:16, Isaku Yamahata isaku.yamah...@gmail.com wrote: Answering myself. If connection is closed, it will reconnects automatically at rpc layer. See neutron.openstack.common.rpc.impl_{kombu, qpid}.py. So notifications during reconnects can be lost if AMQP service is set to discard notifications during no subscriber. Which is fine: the agent repulls the full set it's running on that machine, and life goes on. -Rob -- Robert Collins rbtcoll...@hp.com Distinguished Technologist HP Converged Cloud ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Tue, Dec 10, 2013 at 07:28:10PM +1300, Robert Collins robe...@robertcollins.net wrote: On 10 December 2013 19:16, Isaku Yamahata isaku.yamah...@gmail.com wrote: Answering myself. If connection is closed, it will reconnects automatically at rpc layer. See neutron.openstack.common.rpc.impl_{kombu, qpid}.py. So notifications during reconnects can be lost if AMQP service is set to discard notifications during no subscriber. Which is fine: the agent repulls the full set it's running on that machine, and life goes on. On what event? Polling in agent seems effectively disabled by self.needs_resync with the current code. -- Isaku Yamahata isaku.yamah...@gmail.com ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Mon, Dec 09, 2013 at 08:43:59AM +1300, Robert Collins robe...@robertcollins.net wrote: On 9 December 2013 01:43, Maru Newby ma...@redhat.com wrote: If AMQP service is set up not to lose notification, notifications will be piled up and stress AMQP service. I would say single node failure isn't catastrophic. So we should have AMQP set to discard notifications if there is noone What are the semantics of AMQP discarding notifications when a consumer is no longer present? Can this be relied upon to ensure that potentially stale notifications do not remain in the queue when an agent restarts? If the queue is set to autodelete, it will delete when the agent disconnects. There will be no queue until the agent reconnects. I don't know if we expose that functionality via oslo.messaging, but it's certainly something AMQP can do. What happens if intermittent network instability occur? When the connection between agent - AMQP is unintentionally closed, will agent die or reconnect to it? -- Isaku Yamahata isaku.yamah...@gmail.com ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Dec 7, 2013, at 6:21 PM, Robert Collins robe...@robertcollins.net wrote: On 7 December 2013 21:53, Isaku Yamahata isaku.yamah...@gmail.com wrote: Case 3: Hardware failure. So an agent on the node is gone. Another agent will run on another node. If AMQP service is set up not to lose notification, notifications will be piled up and stress AMQP service. I would say single node failure isn't catastrophic. So we should have AMQP set to discard notifications if there is noone What are the semantics of AMQP discarding notifications when a consumer is no longer present? Can this be relied upon to ensure that potentially stale notifications do not remain in the queue when an agent restarts? listening: when an agent connects after an outage, it first starts listening, then does a poll for updates it missed. Are you suggesting that processing of notifications and full state synchronization are able to cooperate safely? Or hoping that it will be so in the future? m. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Fri, Dec 06, 2013 at 04:30:17PM +0900, Maru Newby ma...@redhat.com wrote: 2. A 'best-effort' refactor that maximizes the reliability of the DHCP agent. I'm hoping that coming up with a solution to #1 will allow us the breathing room to work on #2 in this cycle. Loss of notifications is somewhat inevitable, I think. (Or logging tasks to stable storage shared between server and agent) And Unconditionally sending notifications would cause problem. Regarding sending notifications unconditionally, what specifically are you worried about? I can imagine 2 scenarios: Case 1: Send notification to an agent that is incorrectly reported as down. Result: Agent receives notification and acts on it. Case 2: Send notification to an agent that is actually down. Result: Agent comes up eventually (in a production environment this should be a given) and calls sync_state(). We definitely need to make sync_state more reliable, though (I discuss the specifics later in this message). Notifications could of course be dropped if AMQP queues are not persistent and are lost, but I don't think there needs to be a code-based remedy for this. An operator is likely to deploy the AMQP service in HA to prevent the queues from being lost, and know to restart everything in the event of catastrophic failure. Case 3: Hardware failure. So an agent on the node is gone. Another agent will run on another node. If AMQP service is set up not to lose notification, notifications will be piled up and stress AMQP service. I would say single node failure isn't catastrophic. That's not to say we don't have work to do, though. An agent is responsible for communicating resource state changes to the service, but the service neither detects nor reacts when the state of a resource is scheduled to change and fails to do so in a reasonable timeframe. Thus, as in the bug that prompted this discussion, it is up to the user to detect the failure (a VM without connectivity). Ideally, Neutron should be tracking resource state changes with sufficient detail and reviewing them periodically to allow timely failure detection and remediation. You are proposing polling by Neutron server. So polling somewhere (in server or agent or hybrid) is the way to go in long term. Do you agree? Details to discuss would be, how to do polling, how often(or adaptive) polling should be done, how the cost of polling can be mitigated by tricks... However, such a change is unlikely to be a candidate for backport so it will have to wait. Right, this isn't for backport. I'm talking about middle/long term direction. You mentioned agent crash. Server crash should also be taken care of for reliability. Admin also sometimes wants to restart neutron server/agents for some reasons. Agent can crash after receiving notifications before start processing actual tasks. Server can crash after commiting changes to DB before sending notifications. In such cases, notification will be lost. Polling to resync would be necessary somewhere. Agreed, we need to consider the cases of both agent and service failure. In the case of service failure, thanks to recently merged patches, the dhcp agent will at least force a resync in the event of an error in communicating with the server. However, there is no guarantee that the agent will communicate with the server during the downtime. While polling is one possible solution, might it be preferable for the service to simply notify the agents when it starts? The dhcp agent can already receive an agent_updated RPC message that triggers a resync. Agreed, notification on server startup is better. - notification loss isn't considered. self.resync is not always run. some optimization is possible, for example - detect loss by sequence number - polling can be postponed when notifications come without loss. Notification loss due to agent failure is already solved - sync_state() is called on startup. Notification loss due to server failure could be handled as described above. I think the larger problem is that calling sync_state() does not affect processing of notifications already in the queue, which could result in stale notifications being processed out-of-order, e.g. - service sends 'network down' notification - service goes down after committing 'network up' to db, but before sending notification - service comes back up - agent knows (somehow) to resync, setting the network 'up' - agent processes stale 'network down' notification Though tracking sequence numbers is one possible fix, what do you think of instead ignoring all notifications generated before a timestamp set at the beginning of sync_state()? I agree that improvement is necessary in the are and it is better for agent to ignore stale notification somehow. Regarding to out-of-order notification, making agent to be able to accept
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On 7 December 2013 21:53, Isaku Yamahata isaku.yamah...@gmail.com wrote: Case 3: Hardware failure. So an agent on the node is gone. Another agent will run on another node. If AMQP service is set up not to lose notification, notifications will be piled up and stress AMQP service. I would say single node failure isn't catastrophic. So we should have AMQP set to discard notifications if there is noone listening: when an agent connects after an outage, it first starts listening, then does a poll for updates it missed. -Rob -- Robert Collins rbtcoll...@hp.com Distinguished Technologist HP Converged Cloud ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
Pasting a few things from IRC here to fill out the context... marun carl_baldwin: but according to markmcclain and salv-orlando, it isn't possible to trivially use multiple workers for rpc because processing rpc requests out of sequence can be dangerous carl_baldwin marun: I think it is already possible to run more than one RPC message processor. If the neutron server process is run on multiple hosts in active/active I think you end up getting multiple independent RPC processing threads unless I'm missing something. marun carl_baldwin: is active/active an option? I checked one of my environments where there are two API servers running. It is clear from the logs that both servers are consuming and processing RPC messages independently. I have not identified any problems resulting from doing this yet. I've been running this way for months. There could be something lurking in there preparing to cause a problem. I'm suddenly keenly interested in understanding the problems with processing RPC messages out of order. I tried reading the IRC backlog for information about this but it was not clear to me. Mark or Salvatore, can you comment? Not only is RPC being handled by both physical servers in my environment but each of the API server worker processes is consuming and processing RPC messages independently. So, I am currently running a multi-process RPC scenario now. I did not intend for this to happen this way. My environment has something different than the current upstream. I confirmed that with current upstream code and the ML2 plugin only the parent process consumes RPC messages. It is probably because this environment is still using an older version of my multi-process API worker patch. Still looking in to it. Carl On Thu, Dec 5, 2013 at 7:32 AM, Carl Baldwin c...@ecbaldwin.net wrote: Creating separate processes for API workers does allow a bit more room for RPC message processing in the main process. If this isn't enough and the main process is still bound on CPU and/or green thread/sqlalchemy blocking then creating separate worker processes for RPC processing may be the next logical step to scale. I'll give it some thought today and possibly create a blueprint. Carl On Thu, Dec 5, 2013 at 7:13 AM, Maru Newby ma...@redhat.com wrote: On Dec 5, 2013, at 6:43 AM, Carl Baldwin c...@ecbaldwin.net wrote: I have offered up https://review.openstack.org/#/c/60082/ as a backport to Havana. Interest was expressed in the blueprint for doing this even before this thread. If there is consensus for this as the stop-gap then it is there for the merging. However, I do not want to discourage discussion of other stop-gap solutions like what Maru proposed in the original post. Carl Awesome. No worries, I'm still planning on submitting a patch to improve notification reliability. We seem to be cpu bound now in processing RPC messages. Do you think it would be reasonable to run multiple processes for RPC? m. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Wed, Dec 04, 2013 at 12:37:19PM +0900, Maru Newby ma...@redhat.com wrote: On Dec 4, 2013, at 11:57 AM, Clint Byrum cl...@fewbar.com wrote: Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load: https://bugs.launchpad.net/neutron/+bug/1192381 High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is down. This results in the Neutron service not sending notifications of port addition to the DHCP agent. At present, the notifications are simply dropped. A simple fix is to send notifications regardless of agent status. Does anybody have any objections to this stop-gap approach? I'm not clear on the implications of sending notifications to agents that are down, but I'm hoping for a simple fix that can be backported to both havana and grizzly (yes, this bug has been with us that long). Fixing this problem for real, though, will likely be more involved. The proposal to replace the current wsgi framework with Pecan may increase the Neutron service's scalability, but should we continue to use a 'fire and forget' approach to notification? Being able to track the success or failure of a given action outside of the logs would seem pretty important, and allow for more effective coordination with Nova than is currently possible. Dropping requests without triggering a user-visible error is a pretty serious problem. You didn't mention if you have filed a bug about that. If not, please do or let us know here so we can investigate and file a bug. There is a bug linked to in the original message that I am already working on. The fact that that bug title is 'dhcp agent doesn't configure ports' rather than 'dhcp notifications are silently dropped' is incidental. It seems to me that they should be put into a queue to be retried. Sending the notifications blindly is almost as bad as dropping them, as you have no idea if the agent is alive or not. This is more the kind of discussion I was looking for. In the current architecture, the Neutron service handles RPC and WSGI with a single process and is prone to being overloaded such that agent heartbeats can be delayed beyond the limit for the agent being declared 'down'. Even if we increased the agent timeout as Yongsheg suggests, there is no guarantee that we can accurately detect whether an agent is 'live' with the current architecture. Given that amqp can ensure eventual delivery - it is a queue - is sending a notification blind such a bad idea? In the best case the agent isn't really down and can process the notification. In the worst case, the agent really is down but will be brought up eventually by a deployment's monitoring solution and process the notification when it returns. What am I missing? Do you mean overload of neutron server? Not neutron agent. So event agent sends periodic 'live' report, the reports are piled up unprocessed by server. When server sends notification, it considers agent dead wrongly. Not because agent didn't send live reports due to overload of agent. Is this understanding correct? Please consider that while a good solution will track notification delivery and success, we may need 2 solutions: 1. A 'good-enough', minimally-invasive stop-gap that can be back-ported to grizzly and havana. How about twisting DhcpAgent._periodic_resync_helper? If no notification is received form server from last sleep, it calls self.sync_state() even if self.needs_resync = False. Thus the inconsistency between agent and server due to losing notification will be fixed. 2. A 'best-effort' refactor that maximizes the reliability of the DHCP agent. I'm hoping that coming up with a solution to #1 will allow us the breathing room to work on #2 in this cycle. Loss of notifications is somewhat inevitable, I think. (Or logging tasks to stable storage shared between server and agent) And Unconditionally sending notifications would cause problem. You mentioned agent crash. Server crash should also be taken care of for reliability. Admin also sometimes wants to restart neutron server/agents for some reasons. Agent can crash after receiving notifications before start processing actual tasks. Server can crash after commiting changes to DB before sending notifications. In such cases, notification will be lost. Polling to resync would be necessary somewhere. - notification loss isn't considered. self.resync is not always run. some optimization is possible, for example - detect loss by sequence number - polling can be postponed when notifications come without loss. - periodic resync spawns threads, but doesn't wait their completion. So if resync takes long time, next resync can start even while resync is going on.
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Dec 5, 2013, at 6:43 AM, Carl Baldwin c...@ecbaldwin.net wrote: I have offered up https://review.openstack.org/#/c/60082/ as a backport to Havana. Interest was expressed in the blueprint for doing this even before this thread. If there is consensus for this as the stop-gap then it is there for the merging. However, I do not want to discourage discussion of other stop-gap solutions like what Maru proposed in the original post. Carl Awesome. No worries, I'm still planning on submitting a patch to improve notification reliability. We seem to be cpu bound now in processing RPC messages. Do you think it would be reasonable to run multiple processes for RPC? m. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
Creating separate processes for API workers does allow a bit more room for RPC message processing in the main process. If this isn't enough and the main process is still bound on CPU and/or green thread/sqlalchemy blocking then creating separate worker processes for RPC processing may be the next logical step to scale. I'll give it some thought today and possibly create a blueprint. Carl On Thu, Dec 5, 2013 at 7:13 AM, Maru Newby ma...@redhat.com wrote: On Dec 5, 2013, at 6:43 AM, Carl Baldwin c...@ecbaldwin.net wrote: I have offered up https://review.openstack.org/#/c/60082/ as a backport to Havana. Interest was expressed in the blueprint for doing this even before this thread. If there is consensus for this as the stop-gap then it is there for the merging. However, I do not want to discourage discussion of other stop-gap solutions like what Maru proposed in the original post. Carl Awesome. No worries, I'm still planning on submitting a patch to improve notification reliability. We seem to be cpu bound now in processing RPC messages. Do you think it would be reasonable to run multiple processes for RPC? m. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Dec 5, 2013, at 5:21 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote: On Wed, Dec 04, 2013 at 12:37:19PM +0900, Maru Newby ma...@redhat.com wrote: In the current architecture, the Neutron service handles RPC and WSGI with a single process and is prone to being overloaded such that agent heartbeats can be delayed beyond the limit for the agent being declared 'down'. Even if we increased the agent timeout as Yongsheg suggests, there is no guarantee that we can accurately detect whether an agent is 'live' with the current architecture. Given that amqp can ensure eventual delivery - it is a queue - is sending a notification blind such a bad idea? In the best case the agent isn't really down and can process the notification. In the worst case, the agent really is down but will be brought up eventually by a deployment's monitoring solution and process the notification when it returns. What am I missing? Do you mean overload of neutron server? Not neutron agent. So event agent sends periodic 'live' report, the reports are piled up unprocessed by server. When server sends notification, it considers agent dead wrongly. Not because agent didn't send live reports due to overload of agent. Is this understanding correct? Your interpretation is likely correct. The demands on the service are going to be much higher by virtue of having to field RPC requests from all the agents to interact with the database on their behalf. Please consider that while a good solution will track notification delivery and success, we may need 2 solutions: 1. A 'good-enough', minimally-invasive stop-gap that can be back-ported to grizzly and havana. How about twisting DhcpAgent._periodic_resync_helper? If no notification is received form server from last sleep, it calls self.sync_state() even if self.needs_resync = False. Thus the inconsistency between agent and server due to losing notification will be fixed. Unless I'm missing something, wouldn't forcing more and potentially unnecessary resyncs increase the load on the Neutron service and negatively impact reliability? 2. A 'best-effort' refactor that maximizes the reliability of the DHCP agent. I'm hoping that coming up with a solution to #1 will allow us the breathing room to work on #2 in this cycle. Loss of notifications is somewhat inevitable, I think. (Or logging tasks to stable storage shared between server and agent) And Unconditionally sending notifications would cause problem. Regarding sending notifications unconditionally, what specifically are you worried about? I can imagine 2 scenarios: Case 1: Send notification to an agent that is incorrectly reported as down. Result: Agent receives notification and acts on it. Case 2: Send notification to an agent that is actually down. Result: Agent comes up eventually (in a production environment this should be a given) and calls sync_state(). We definitely need to make sync_state more reliable, though (I discuss the specifics later in this message). Notifications could of course be dropped if AMQP queues are not persistent and are lost, but I don't think there needs to be a code-based remedy for this. An operator is likely to deploy the AMQP service in HA to prevent the queues from being lost, and know to restart everything in the event of catastrophic failure. That's not to say we don't have work to do, though. An agent is responsible for communicating resource state changes to the service, but the service neither detects nor reacts when the state of a resource is scheduled to change and fails to do so in a reasonable timeframe. Thus, as in the bug that prompted this discussion, it is up to the user to detect the failure (a VM without connectivity). Ideally, Neutron should be tracking resource state changes with sufficient detail and reviewing them periodically to allow timely failure detection and remediation. However, such a change is unlikely to be a candidate for backport so it will have to wait. You mentioned agent crash. Server crash should also be taken care of for reliability. Admin also sometimes wants to restart neutron server/agents for some reasons. Agent can crash after receiving notifications before start processing actual tasks. Server can crash after commiting changes to DB before sending notifications. In such cases, notification will be lost. Polling to resync would be necessary somewhere. Agreed, we need to consider the cases of both agent and service failure. In the case of service failure, thanks to recently merged patches, the dhcp agent will at least force a resync in the event of an error in communicating with the server. However, there is no guarantee that the agent will communicate with the server during the downtime. While polling is one possible solution, might it be preferable for the service to simply notify the agents when it starts? The dhcp agent can already receive an
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Dec 4, 2013 5:41 AM, Maru Newby ma...@redhat.com wrote: On Dec 4, 2013, at 11:57 AM, Clint Byrum cl...@fewbar.com wrote: Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load: https://bugs.launchpad.net/neutron/+bug/1192381 High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is down. This results in the Neutron service not sending notifications of port addition to the DHCP agent. At present, the notifications are simply dropped. A simple fix is to send notifications regardless of agent status. Does anybody have any objections to this stop-gap approach? I'm not clear on the implications of sending notifications to agents that are down, but I'm hoping for a simple fix that can be backported to both havana and grizzly (yes, this bug has been with us that long). Fixing this problem for real, though, will likely be more involved. The proposal to replace the current wsgi framework with Pecan may increase the Neutron service's scalability, but should we continue to use a 'fire and forget' approach to notification? Being able to track the success or failure of a given action outside of the logs would seem pretty important, and allow for more effective coordination with Nova than is currently possible. Dropping requests without triggering a user-visible error is a pretty serious problem. You didn't mention if you have filed a bug about that. If not, please do or let us know here so we can investigate and file a bug. There is a bug linked to in the original message that I am already working on. The fact that that bug title is 'dhcp agent doesn't configure ports' rather than 'dhcp notifications are silently dropped' is incidental. It seems to me that they should be put into a queue to be retried. Sending the notifications blindly is almost as bad as dropping them, as you have no idea if the agent is alive or not. This is more the kind of discussion I was looking for. In the current architecture, the Neutron service handles RPC and WSGI with a single process and is prone to being overloaded such that agent heartbeats can be delayed beyond the limit for the agent being declared 'down'. Even if we increased the agent timeout as Yongsheg suggests, there is no guarantee that we can accurately detect whether an agent is 'live' with the current architecture. Given that amqp can ensure eventual delivery - it is a queue - is sending a notification blind such a bad idea? In the best case the agent isn't really down and can process the notification. In the worst case, the agent really is down but will be brought up eventually by a deployment's monitoring solution and process the notification when it returns. What am I missing? Please consider that while a good solution will track notification delivery and success, we may need 2 solutions: 1. A 'good-enough', minimally-invasive stop-gap that can be back-ported to grizzly and havana. 2. A 'best-effort' refactor that maximizes the reliability of the DHCP agent. I'm hoping that coming up with a solution to #1 will allow us the breathing room to work on #2 in this cycle. I like the two part approach but I would phrase it slightly differently. Short term solution to help neutron meet the deprecate nova-network goals by icshouse-2 and a long term more robust solution. m. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Dec 4, 2013, at 8:55 AM, Carl Baldwin c...@ecbaldwin.net wrote: Stephen, all, I agree that there may be some opportunity to split things out a bit. However, I'm not sure what the best way will be. I recall that Mark mentioned breaking out the processes that handle API requests and RPC from each other at the summit. Anyway, it is something that has been discussed. I actually wanted to point out that the neutron server now has the ability to run a configurable number of sub-processes to handle a heavier load. Introduced with this commit: https://review.openstack.org/#/c/37131/ Set api_workers to something 1 and restart the server. The server can also be run on more than one physical host in combination with multiple child processes. I completely misunderstood the import of the commit in question. Being able to run the wsgi server(s) out of process is a nice improvement, thank you for making it happen. Has there been any discussion around making the default for api_workers 0 (at least 1) to ensure that the default configuration separates wsgi and rpc load? This also seems like a great candidate for backporting to havana and maybe even grizzly, although api_workers should probably be defaulted to 0 in those cases. FYI, I re-ran the test that attempted to boot 75 micro VM's simultaneously with api_workers = 2, with mixed results. The increased wsgi throughput resulted in almost half of the boot requests failing with 500 errors due to QueuePool errors (https://bugs.launchpad.net/neutron/+bug/1160442) in Neutron. It also appears that maximizing the number of wsgi requests has the side-effect of increasing the RPC load on the main process, and this means that the problem of dhcp notifications being dropped is little improved. I intend to submit a fix that ensures that notifications are sent regardless of agent status, in any case. m. Carl On Tue, Dec 3, 2013 at 9:47 AM, Stephen Gran stephen.g...@theguardian.com wrote: On 03/12/13 16:08, Maru Newby wrote: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load: https://bugs.launchpad.net/neutron/+bug/1192381 High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is down. This results in the Neutron service not sending notifications of port addition to the DHCP agent. At present, the notifications are simply dropped. A simple fix is to send notifications regardless of agent status. Does anybody have any objections to this stop-gap approach? I'm not clear on the implications of sending notifications to agents that are down, but I'm hoping for a simple fix that can be backported to both havana and grizzly (yes, this bug has been with us that long). Fixing this problem for real, though, will likely be more involved. The proposal to replace the current wsgi framework with Pecan may increase the Neutron service's scalability, but should we continue to use a 'fire and forget' approach to notification? Being able to track the success or failure of a given action outside of the logs would seem pretty important, and allow for more effective coordination with Nova than is currently possible. It strikes me that we ask an awful lot of a single neutron-server instance - it has to take state updates from all the agents, it has to do scheduling, it has to respond to API requests, and it has to communicate about actual changes with the agents. Maybe breaking some of these out the way nova has a scheduler and a conductor and so on might be a good model (I know there are things people are unhappy about with nova-scheduler, but imagine how much worse it would be if it was built into the API). Doing all of those tasks, and doing it largely single threaded, is just asking for overload. Cheers, -- Stephen Gran Senior Systems Integrator - theguardian.com Please consider the environment before printing this email. -- Visit theguardian.com On your mobile, download the Guardian iPhone app theguardian.com/iphone and our iPad edition theguardian.com/iPad Save up to 33% by subscribing to the Guardian and Observer - choose the papers you want and get full digital access. Visit subscribe.theguardian.com This e-mail and all attachments are confidential and may also be privileged. If you are not the named recipient, please notify the sender and delete the e-mail and all attachments immediately. Do not disclose the contents to another person. You may not use the information for any purpose, or store, or copy, it in any way. Guardian News Media Limited is not liable for any computer viruses or other material transmitted with or as part of this e-mail. You should employ virus checking software. Guardian News Media Limited A member of Guardian Media Group plc
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Wed, Dec 4, 2013 at 8:30 PM, Maru Newby ma...@redhat.com wrote: On Dec 4, 2013, at 8:55 AM, Carl Baldwin c...@ecbaldwin.net wrote: Stephen, all, I agree that there may be some opportunity to split things out a bit. However, I'm not sure what the best way will be. I recall that Mark mentioned breaking out the processes that handle API requests and RPC from each other at the summit. Anyway, it is something that has been discussed. I actually wanted to point out that the neutron server now has the ability to run a configurable number of sub-processes to handle a heavier load. Introduced with this commit: https://review.openstack.org/#/c/37131/ Set api_workers to something 1 and restart the server. The server can also be run on more than one physical host in combination with multiple child processes. I completely misunderstood the import of the commit in question. Being able to run the wsgi server(s) out of process is a nice improvement, thank you for making it happen. Has there been any discussion around making the default for api_workers 0 (at least 1) to ensure that the default configuration separates wsgi and rpc load? This also seems like a great candidate for backporting to havana and maybe even grizzly, although api_workers should probably be defaulted to 0 in those cases. +1 for backporting the api_workers feature to havana as well as Grizzly :) FYI, I re-ran the test that attempted to boot 75 micro VM's simultaneously with api_workers = 2, with mixed results. The increased wsgi throughput resulted in almost half of the boot requests failing with 500 errors due to QueuePool errors (https://bugs.launchpad.net/neutron/+bug/1160442) in Neutron. It also appears that maximizing the number of wsgi requests has the side-effect of increasing the RPC load on the main process, and this means that the problem of dhcp notifications being dropped is little improved. I intend to submit a fix that ensures that notifications are sent regardless of agent status, in any case. m. Carl On Tue, Dec 3, 2013 at 9:47 AM, Stephen Gran stephen.g...@theguardian.com wrote: On 03/12/13 16:08, Maru Newby wrote: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load: https://bugs.launchpad.net/neutron/+bug/1192381 High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is down. This results in the Neutron service not sending notifications of port addition to the DHCP agent. At present, the notifications are simply dropped. A simple fix is to send notifications regardless of agent status. Does anybody have any objections to this stop-gap approach? I'm not clear on the implications of sending notifications to agents that are down, but I'm hoping for a simple fix that can be backported to both havana and grizzly (yes, this bug has been with us that long). Fixing this problem for real, though, will likely be more involved. The proposal to replace the current wsgi framework with Pecan may increase the Neutron service's scalability, but should we continue to use a 'fire and forget' approach to notification? Being able to track the success or failure of a given action outside of the logs would seem pretty important, and allow for more effective coordination with Nova than is currently possible. It strikes me that we ask an awful lot of a single neutron-server instance - it has to take state updates from all the agents, it has to do scheduling, it has to respond to API requests, and it has to communicate about actual changes with the agents. Maybe breaking some of these out the way nova has a scheduler and a conductor and so on might be a good model (I know there are things people are unhappy about with nova-scheduler, but imagine how much worse it would be if it was built into the API). Doing all of those tasks, and doing it largely single threaded, is just asking for overload. Cheers, -- Stephen Gran Senior Systems Integrator - theguardian.com Please consider the environment before printing this email. -- Visit theguardian.com On your mobile, download the Guardian iPhone app theguardian.com/iphoneand our iPad edition theguardian.com/iPad Save up to 33% by subscribing to the Guardian and Observer - choose the papers you want and get full digital access. Visit subscribe.theguardian.com This e-mail and all attachments are confidential and may also be privileged. If you are not the named recipient, please notify the sender and delete the e-mail and all attachments immediately. Do not disclose the contents to another person. You may not use the information for any purpose, or store, or copy, it in any way. Guardian News
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
Sorry to have taken the discussion on a slight tangent. I meant only to offer the solution as a stop-gap. I agree that the fundamental problem should still be addressed. On Tue, Dec 3, 2013 at 8:01 PM, Maru Newby ma...@redhat.com wrote: On Dec 4, 2013, at 1:47 AM, Stephen Gran stephen.g...@theguardian.com wrote: On 03/12/13 16:08, Maru Newby wrote: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load: https://bugs.launchpad.net/neutron/+bug/1192381 High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is down. This results in the Neutron service not sending notifications of port addition to the DHCP agent. At present, the notifications are simply dropped. A simple fix is to send notifications regardless of agent status. Does anybody have any objections to this stop-gap approach? I'm not clear on the implications of sending notifications to agents that are down, but I'm hoping for a simple fix that can be backported to both havana and grizzly (yes, this bug has been with us that long). Fixing this problem for real, though, will likely be more involved. The proposal to replace the current wsgi framework with Pecan may increase the Neutron service's scalability, but should we continue to use a 'fire and forget' approach to notification? Being able to track the success or failure of a given action outside of the logs would seem pretty important, and allow for more effective coordination with Nova than is currently possible. It strikes me that we ask an awful lot of a single neutron-server instance - it has to take state updates from all the agents, it has to do scheduling, it has to respond to API requests, and it has to communicate about actual changes with the agents. Maybe breaking some of these out the way nova has a scheduler and a conductor and so on might be a good model (I know there are things people are unhappy about with nova-scheduler, but imagine how much worse it would be if it was built into the API). Doing all of those tasks, and doing it largely single threaded, is just asking for overload. I'm sorry if it wasn't clear in my original message, but my primary concern lies with the reliability rather than the scalability of the Neutron service. Carl's addition of multiple workers is a good stop-gap to minimize the impact of blocking IO calls in the current architecture, and we already have consensus on the need to separate RPC and WSGI functions as part of the Pecan rewrite. I am worried, though, that we are not being sufficiently diligent in how we manage state transitions through notifications. Managing transitions and their associate error states is needlessly complicated by the current ad-hoc approach, and I'd appreciate input on the part of distributed systems experts as to how we could do better. m. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
I have offered up https://review.openstack.org/#/c/60082/ as a backport to Havana. Interest was expressed in the blueprint for doing this even before this thread. If there is consensus for this as the stop-gap then it is there for the merging. However, I do not want to discourage discussion of other stop-gap solutions like what Maru proposed in the original post. Carl On Wed, Dec 4, 2013 at 9:12 AM, Ashok Kumaran ashokkumara...@gmail.com wrote: On Wed, Dec 4, 2013 at 8:30 PM, Maru Newby ma...@redhat.com wrote: On Dec 4, 2013, at 8:55 AM, Carl Baldwin c...@ecbaldwin.net wrote: Stephen, all, I agree that there may be some opportunity to split things out a bit. However, I'm not sure what the best way will be. I recall that Mark mentioned breaking out the processes that handle API requests and RPC from each other at the summit. Anyway, it is something that has been discussed. I actually wanted to point out that the neutron server now has the ability to run a configurable number of sub-processes to handle a heavier load. Introduced with this commit: https://review.openstack.org/#/c/37131/ Set api_workers to something 1 and restart the server. The server can also be run on more than one physical host in combination with multiple child processes. I completely misunderstood the import of the commit in question. Being able to run the wsgi server(s) out of process is a nice improvement, thank you for making it happen. Has there been any discussion around making the default for api_workers 0 (at least 1) to ensure that the default configuration separates wsgi and rpc load? This also seems like a great candidate for backporting to havana and maybe even grizzly, although api_workers should probably be defaulted to 0 in those cases. +1 for backporting the api_workers feature to havana as well as Grizzly :) FYI, I re-ran the test that attempted to boot 75 micro VM's simultaneously with api_workers = 2, with mixed results. The increased wsgi throughput resulted in almost half of the boot requests failing with 500 errors due to QueuePool errors (https://bugs.launchpad.net/neutron/+bug/1160442) in Neutron. It also appears that maximizing the number of wsgi requests has the side-effect of increasing the RPC load on the main process, and this means that the problem of dhcp notifications being dropped is little improved. I intend to submit a fix that ensures that notifications are sent regardless of agent status, in any case. m. Carl On Tue, Dec 3, 2013 at 9:47 AM, Stephen Gran stephen.g...@theguardian.com wrote: On 03/12/13 16:08, Maru Newby wrote: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load: https://bugs.launchpad.net/neutron/+bug/1192381 High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is down. This results in the Neutron service not sending notifications of port addition to the DHCP agent. At present, the notifications are simply dropped. A simple fix is to send notifications regardless of agent status. Does anybody have any objections to this stop-gap approach? I'm not clear on the implications of sending notifications to agents that are down, but I'm hoping for a simple fix that can be backported to both havana and grizzly (yes, this bug has been with us that long). Fixing this problem for real, though, will likely be more involved. The proposal to replace the current wsgi framework with Pecan may increase the Neutron service's scalability, but should we continue to use a 'fire and forget' approach to notification? Being able to track the success or failure of a given action outside of the logs would seem pretty important, and allow for more effective coordination with Nova than is currently possible. It strikes me that we ask an awful lot of a single neutron-server instance - it has to take state updates from all the agents, it has to do scheduling, it has to respond to API requests, and it has to communicate about actual changes with the agents. Maybe breaking some of these out the way nova has a scheduler and a conductor and so on might be a good model (I know there are things people are unhappy about with nova-scheduler, but imagine how much worse it would be if it was built into the API). Doing all of those tasks, and doing it largely single threaded, is just asking for overload. Cheers, -- Stephen Gran Senior Systems Integrator - theguardian.com Please consider the environment before printing this email. -- Visit theguardian.com On your mobile, download the Guardian iPhone app theguardian.com/iphone and our iPad edition theguardian.com/iPad Save up to 33% by subscribing to
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
There also another bug you can link/duplicate with #1192381 is https://bugs.launchpad.net/neutron/+bug/1185916. I proposed a fix but it's not the good way. I abandoned it. Édouard. On Wed, Dec 4, 2013 at 10:43 PM, Carl Baldwin c...@ecbaldwin.net wrote: I have offered up https://review.openstack.org/#/c/60082/ as a backport to Havana. Interest was expressed in the blueprint for doing this even before this thread. If there is consensus for this as the stop-gap then it is there for the merging. However, I do not want to discourage discussion of other stop-gap solutions like what Maru proposed in the original post. Carl On Wed, Dec 4, 2013 at 9:12 AM, Ashok Kumaran ashokkumara...@gmail.com wrote: On Wed, Dec 4, 2013 at 8:30 PM, Maru Newby ma...@redhat.com wrote: On Dec 4, 2013, at 8:55 AM, Carl Baldwin c...@ecbaldwin.net wrote: Stephen, all, I agree that there may be some opportunity to split things out a bit. However, I'm not sure what the best way will be. I recall that Mark mentioned breaking out the processes that handle API requests and RPC from each other at the summit. Anyway, it is something that has been discussed. I actually wanted to point out that the neutron server now has the ability to run a configurable number of sub-processes to handle a heavier load. Introduced with this commit: https://review.openstack.org/#/c/37131/ Set api_workers to something 1 and restart the server. The server can also be run on more than one physical host in combination with multiple child processes. I completely misunderstood the import of the commit in question. Being able to run the wsgi server(s) out of process is a nice improvement, thank you for making it happen. Has there been any discussion around making the default for api_workers 0 (at least 1) to ensure that the default configuration separates wsgi and rpc load? This also seems like a great candidate for backporting to havana and maybe even grizzly, although api_workers should probably be defaulted to 0 in those cases. +1 for backporting the api_workers feature to havana as well as Grizzly :) FYI, I re-ran the test that attempted to boot 75 micro VM's simultaneously with api_workers = 2, with mixed results. The increased wsgi throughput resulted in almost half of the boot requests failing with 500 errors due to QueuePool errors (https://bugs.launchpad.net/neutron/+bug/1160442) in Neutron. It also appears that maximizing the number of wsgi requests has the side-effect of increasing the RPC load on the main process, and this means that the problem of dhcp notifications being dropped is little improved. I intend to submit a fix that ensures that notifications are sent regardless of agent status, in any case. m. Carl On Tue, Dec 3, 2013 at 9:47 AM, Stephen Gran stephen.g...@theguardian.com wrote: On 03/12/13 16:08, Maru Newby wrote: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load: https://bugs.launchpad.net/neutron/+bug/1192381 High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is down. This results in the Neutron service not sending notifications of port addition to the DHCP agent. At present, the notifications are simply dropped. A simple fix is to send notifications regardless of agent status. Does anybody have any objections to this stop-gap approach? I'm not clear on the implications of sending notifications to agents that are down, but I'm hoping for a simple fix that can be backported to both havana and grizzly (yes, this bug has been with us that long). Fixing this problem for real, though, will likely be more involved. The proposal to replace the current wsgi framework with Pecan may increase the Neutron service's scalability, but should we continue to use a 'fire and forget' approach to notification? Being able to track the success or failure of a given action outside of the logs would seem pretty important, and allow for more effective coordination with Nova than is currently possible. It strikes me that we ask an awful lot of a single neutron-server instance - it has to take state updates from all the agents, it has to do scheduling, it has to respond to API requests, and it has to communicate about actual changes with the agents. Maybe breaking some of these out the way nova has a scheduler and a conductor and so on might be a good model (I know there are things people are unhappy about with nova-scheduler, but imagine how much worse it would be if it was built into the API). Doing all of those tasks, and doing it largely single threaded, is just asking for overload. Cheers, -- Stephen Gran Senior Systems Integrator - theguardian.com Please consider the environment before
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On 03/12/13 16:08, Maru Newby wrote: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load: https://bugs.launchpad.net/neutron/+bug/1192381 High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is down. This results in the Neutron service not sending notifications of port addition to the DHCP agent. At present, the notifications are simply dropped. A simple fix is to send notifications regardless of agent status. Does anybody have any objections to this stop-gap approach? I'm not clear on the implications of sending notifications to agents that are down, but I'm hoping for a simple fix that can be backported to both havana and grizzly (yes, this bug has been with us that long). Fixing this problem for real, though, will likely be more involved. The proposal to replace the current wsgi framework with Pecan may increase the Neutron service's scalability, but should we continue to use a 'fire and forget' approach to notification? Being able to track the success or failure of a given action outside of the logs would seem pretty important, and allow for more effective coordination with Nova than is currently possible. It strikes me that we ask an awful lot of a single neutron-server instance - it has to take state updates from all the agents, it has to do scheduling, it has to respond to API requests, and it has to communicate about actual changes with the agents. Maybe breaking some of these out the way nova has a scheduler and a conductor and so on might be a good model (I know there are things people are unhappy about with nova-scheduler, but imagine how much worse it would be if it was built into the API). Doing all of those tasks, and doing it largely single threaded, is just asking for overload. Cheers, -- Stephen Gran Senior Systems Integrator - theguardian.com Please consider the environment before printing this email. -- Visit theguardian.com On your mobile, download the Guardian iPhone app theguardian.com/iphone and our iPad edition theguardian.com/iPad Save up to 33% by subscribing to the Guardian and Observer - choose the papers you want and get full digital access. Visit subscribe.theguardian.com This e-mail and all attachments are confidential and may also be privileged. If you are not the named recipient, please notify the sender and delete the e-mail and all attachments immediately. Do not disclose the contents to another person. You may not use the information for any purpose, or store, or copy, it in any way. Guardian News Media Limited is not liable for any computer viruses or other material transmitted with or as part of this e-mail. You should employ virus checking software. Guardian News Media Limited A member of Guardian Media Group plc Registered Office PO Box 68164 Kings Place 90 York Way London N1P 2AP Registered in England Number 908396 -- ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
Stephen, all, I agree that there may be some opportunity to split things out a bit. However, I'm not sure what the best way will be. I recall that Mark mentioned breaking out the processes that handle API requests and RPC from each other at the summit. Anyway, it is something that has been discussed. I actually wanted to point out that the neutron server now has the ability to run a configurable number of sub-processes to handle a heavier load. Introduced with this commit: https://review.openstack.org/#/c/37131/ Set api_workers to something 1 and restart the server. The server can also be run on more than one physical host in combination with multiple child processes. Carl On Tue, Dec 3, 2013 at 9:47 AM, Stephen Gran stephen.g...@theguardian.com wrote: On 03/12/13 16:08, Maru Newby wrote: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load: https://bugs.launchpad.net/neutron/+bug/1192381 High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is down. This results in the Neutron service not sending notifications of port addition to the DHCP agent. At present, the notifications are simply dropped. A simple fix is to send notifications regardless of agent status. Does anybody have any objections to this stop-gap approach? I'm not clear on the implications of sending notifications to agents that are down, but I'm hoping for a simple fix that can be backported to both havana and grizzly (yes, this bug has been with us that long). Fixing this problem for real, though, will likely be more involved. The proposal to replace the current wsgi framework with Pecan may increase the Neutron service's scalability, but should we continue to use a 'fire and forget' approach to notification? Being able to track the success or failure of a given action outside of the logs would seem pretty important, and allow for more effective coordination with Nova than is currently possible. It strikes me that we ask an awful lot of a single neutron-server instance - it has to take state updates from all the agents, it has to do scheduling, it has to respond to API requests, and it has to communicate about actual changes with the agents. Maybe breaking some of these out the way nova has a scheduler and a conductor and so on might be a good model (I know there are things people are unhappy about with nova-scheduler, but imagine how much worse it would be if it was built into the API). Doing all of those tasks, and doing it largely single threaded, is just asking for overload. Cheers, -- Stephen Gran Senior Systems Integrator - theguardian.com Please consider the environment before printing this email. -- Visit theguardian.com On your mobile, download the Guardian iPhone app theguardian.com/iphone and our iPad edition theguardian.com/iPad Save up to 33% by subscribing to the Guardian and Observer - choose the papers you want and get full digital access. Visit subscribe.theguardian.com This e-mail and all attachments are confidential and may also be privileged. If you are not the named recipient, please notify the sender and delete the e-mail and all attachments immediately. Do not disclose the contents to another person. You may not use the information for any purpose, or store, or copy, it in any way. Guardian News Media Limited is not liable for any computer viruses or other material transmitted with or as part of this e-mail. You should employ virus checking software. Guardian News Media Limited A member of Guardian Media Group plc Registered Office PO Box 68164 Kings Place 90 York Way London N1P 2AP Registered in England Number 908396 -- ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
another way is to have a large agent_down_time, by default it is 9 secs. On Wed, Dec 4, 2013 at 7:55 AM, Carl Baldwin c...@ecbaldwin.net wrote: Stephen, all, I agree that there may be some opportunity to split things out a bit. However, I'm not sure what the best way will be. I recall that Mark mentioned breaking out the processes that handle API requests and RPC from each other at the summit. Anyway, it is something that has been discussed. I actually wanted to point out that the neutron server now has the ability to run a configurable number of sub-processes to handle a heavier load. Introduced with this commit: https://review.openstack.org/#/c/37131/ Set api_workers to something 1 and restart the server. The server can also be run on more than one physical host in combination with multiple child processes. Carl On Tue, Dec 3, 2013 at 9:47 AM, Stephen Gran stephen.g...@theguardian.com wrote: On 03/12/13 16:08, Maru Newby wrote: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load: https://bugs.launchpad.net/neutron/+bug/1192381 High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is down. This results in the Neutron service not sending notifications of port addition to the DHCP agent. At present, the notifications are simply dropped. A simple fix is to send notifications regardless of agent status. Does anybody have any objections to this stop-gap approach? I'm not clear on the implications of sending notifications to agents that are down, but I'm hoping for a simple fix that can be backported to both havana and grizzly (yes, this bug has been with us that long). Fixing this problem for real, though, will likely be more involved. The proposal to replace the current wsgi framework with Pecan may increase the Neutron service's scalability, but should we continue to use a 'fire and forget' approach to notification? Being able to track the success or failure of a given action outside of the logs would seem pretty important, and allow for more effective coordination with Nova than is currently possible. It strikes me that we ask an awful lot of a single neutron-server instance - it has to take state updates from all the agents, it has to do scheduling, it has to respond to API requests, and it has to communicate about actual changes with the agents. Maybe breaking some of these out the way nova has a scheduler and a conductor and so on might be a good model (I know there are things people are unhappy about with nova-scheduler, but imagine how much worse it would be if it was built into the API). Doing all of those tasks, and doing it largely single threaded, is just asking for overload. Cheers, -- Stephen Gran Senior Systems Integrator - theguardian.com Please consider the environment before printing this email. -- Visit theguardian.com On your mobile, download the Guardian iPhone app theguardian.com/iphoneand our iPad edition theguardian.com/iPad Save up to 33% by subscribing to the Guardian and Observer - choose the papers you want and get full digital access. Visit subscribe.theguardian.com This e-mail and all attachments are confidential and may also be privileged. If you are not the named recipient, please notify the sender and delete the e-mail and all attachments immediately. Do not disclose the contents to another person. You may not use the information for any purpose, or store, or copy, it in any way. Guardian News Media Limited is not liable for any computer viruses or other material transmitted with or as part of this e-mail. You should employ virus checking software. Guardian News Media Limited A member of Guardian Media Group plc Registered Office PO Box 68164 Kings Place 90 York Way London N1P 2AP Registered in England Number 908396 -- ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load: https://bugs.launchpad.net/neutron/+bug/1192381 High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is down. This results in the Neutron service not sending notifications of port addition to the DHCP agent. At present, the notifications are simply dropped. A simple fix is to send notifications regardless of agent status. Does anybody have any objections to this stop-gap approach? I'm not clear on the implications of sending notifications to agents that are down, but I'm hoping for a simple fix that can be backported to both havana and grizzly (yes, this bug has been with us that long). Fixing this problem for real, though, will likely be more involved. The proposal to replace the current wsgi framework with Pecan may increase the Neutron service's scalability, but should we continue to use a 'fire and forget' approach to notification? Being able to track the success or failure of a given action outside of the logs would seem pretty important, and allow for more effective coordination with Nova than is currently possible. Dropping requests without triggering a user-visible error is a pretty serious problem. You didn't mention if you have filed a bug about that. If not, please do or let us know here so we can investigate and file a bug. It seems to me that they should be put into a queue to be retried. Sending the notifications blindly is almost as bad as dropping them, as you have no idea if the agent is alive or not. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Dec 4, 2013, at 1:47 AM, Stephen Gran stephen.g...@theguardian.com wrote: On 03/12/13 16:08, Maru Newby wrote: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load: https://bugs.launchpad.net/neutron/+bug/1192381 High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is down. This results in the Neutron service not sending notifications of port addition to the DHCP agent. At present, the notifications are simply dropped. A simple fix is to send notifications regardless of agent status. Does anybody have any objections to this stop-gap approach? I'm not clear on the implications of sending notifications to agents that are down, but I'm hoping for a simple fix that can be backported to both havana and grizzly (yes, this bug has been with us that long). Fixing this problem for real, though, will likely be more involved. The proposal to replace the current wsgi framework with Pecan may increase the Neutron service's scalability, but should we continue to use a 'fire and forget' approach to notification? Being able to track the success or failure of a given action outside of the logs would seem pretty important, and allow for more effective coordination with Nova than is currently possible. It strikes me that we ask an awful lot of a single neutron-server instance - it has to take state updates from all the agents, it has to do scheduling, it has to respond to API requests, and it has to communicate about actual changes with the agents. Maybe breaking some of these out the way nova has a scheduler and a conductor and so on might be a good model (I know there are things people are unhappy about with nova-scheduler, but imagine how much worse it would be if it was built into the API). Doing all of those tasks, and doing it largely single threaded, is just asking for overload. I'm sorry if it wasn't clear in my original message, but my primary concern lies with the reliability rather than the scalability of the Neutron service. Carl's addition of multiple workers is a good stop-gap to minimize the impact of blocking IO calls in the current architecture, and we already have consensus on the need to separate RPC and WSGI functions as part of the Pecan rewrite. I am worried, though, that we are not being sufficiently diligent in how we manage state transitions through notifications. Managing transitions and their associate error states is needlessly complicated by the current ad-hoc approach, and I'd appreciate input on the part of distributed systems experts as to how we could do better. m. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Dec 4, 2013, at 11:02 AM, Yongsheng Gong gong...@unitedstack.com wrote: another way is to have a large agent_down_time, by default it is 9 secs. I don't believe that increasing the timeout by itself is a good solution. Relying on the agent state to know whether to send a notification has simply proven unreliable with the current architecture of a poorly-performing single process server handling both RPC and WSGI. m. On Wed, Dec 4, 2013 at 7:55 AM, Carl Baldwin c...@ecbaldwin.net wrote: Stephen, all, I agree that there may be some opportunity to split things out a bit. However, I'm not sure what the best way will be. I recall that Mark mentioned breaking out the processes that handle API requests and RPC from each other at the summit. Anyway, it is something that has been discussed. I actually wanted to point out that the neutron server now has the ability to run a configurable number of sub-processes to handle a heavier load. Introduced with this commit: https://review.openstack.org/#/c/37131/ Set api_workers to something 1 and restart the server. The server can also be run on more than one physical host in combination with multiple child processes. Carl On Tue, Dec 3, 2013 at 9:47 AM, Stephen Gran stephen.g...@theguardian.com wrote: On 03/12/13 16:08, Maru Newby wrote: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load: https://bugs.launchpad.net/neutron/+bug/1192381 High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is down. This results in the Neutron service not sending notifications of port addition to the DHCP agent. At present, the notifications are simply dropped. A simple fix is to send notifications regardless of agent status. Does anybody have any objections to this stop-gap approach? I'm not clear on the implications of sending notifications to agents that are down, but I'm hoping for a simple fix that can be backported to both havana and grizzly (yes, this bug has been with us that long). Fixing this problem for real, though, will likely be more involved. The proposal to replace the current wsgi framework with Pecan may increase the Neutron service's scalability, but should we continue to use a 'fire and forget' approach to notification? Being able to track the success or failure of a given action outside of the logs would seem pretty important, and allow for more effective coordination with Nova than is currently possible. It strikes me that we ask an awful lot of a single neutron-server instance - it has to take state updates from all the agents, it has to do scheduling, it has to respond to API requests, and it has to communicate about actual changes with the agents. Maybe breaking some of these out the way nova has a scheduler and a conductor and so on might be a good model (I know there are things people are unhappy about with nova-scheduler, but imagine how much worse it would be if it was built into the API). Doing all of those tasks, and doing it largely single threaded, is just asking for overload. Cheers, -- Stephen Gran Senior Systems Integrator - theguardian.com Please consider the environment before printing this email. -- Visit theguardian.com On your mobile, download the Guardian iPhone app theguardian.com/iphone and our iPad edition theguardian.com/iPad Save up to 33% by subscribing to the Guardian and Observer - choose the papers you want and get full digital access. Visit subscribe.theguardian.com This e-mail and all attachments are confidential and may also be privileged. If you are not the named recipient, please notify the sender and delete the e-mail and all attachments immediately. Do not disclose the contents to another person. You may not use the information for any purpose, or store, or copy, it in any way. Guardian News Media Limited is not liable for any computer viruses or other material transmitted with or as part of this e-mail. You should employ virus checking software. Guardian News Media Limited A member of Guardian Media Group plc Registered Office PO Box 68164 Kings Place 90 York Way London N1P 2AP Registered in England Number 908396 -- ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
On Dec 4, 2013, at 11:57 AM, Clint Byrum cl...@fewbar.com wrote: Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load: https://bugs.launchpad.net/neutron/+bug/1192381 High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is down. This results in the Neutron service not sending notifications of port addition to the DHCP agent. At present, the notifications are simply dropped. A simple fix is to send notifications regardless of agent status. Does anybody have any objections to this stop-gap approach? I'm not clear on the implications of sending notifications to agents that are down, but I'm hoping for a simple fix that can be backported to both havana and grizzly (yes, this bug has been with us that long). Fixing this problem for real, though, will likely be more involved. The proposal to replace the current wsgi framework with Pecan may increase the Neutron service's scalability, but should we continue to use a 'fire and forget' approach to notification? Being able to track the success or failure of a given action outside of the logs would seem pretty important, and allow for more effective coordination with Nova than is currently possible. Dropping requests without triggering a user-visible error is a pretty serious problem. You didn't mention if you have filed a bug about that. If not, please do or let us know here so we can investigate and file a bug. There is a bug linked to in the original message that I am already working on. The fact that that bug title is 'dhcp agent doesn't configure ports' rather than 'dhcp notifications are silently dropped' is incidental. It seems to me that they should be put into a queue to be retried. Sending the notifications blindly is almost as bad as dropping them, as you have no idea if the agent is alive or not. This is more the kind of discussion I was looking for. In the current architecture, the Neutron service handles RPC and WSGI with a single process and is prone to being overloaded such that agent heartbeats can be delayed beyond the limit for the agent being declared 'down'. Even if we increased the agent timeout as Yongsheg suggests, there is no guarantee that we can accurately detect whether an agent is 'live' with the current architecture. Given that amqp can ensure eventual delivery - it is a queue - is sending a notification blind such a bad idea? In the best case the agent isn't really down and can process the notification. In the worst case, the agent really is down but will be brought up eventually by a deployment's monitoring solution and process the notification when it returns. What am I missing? Please consider that while a good solution will track notification delivery and success, we may need 2 solutions: 1. A 'good-enough', minimally-invasive stop-gap that can be back-ported to grizzly and havana. 2. A 'best-effort' refactor that maximizes the reliability of the DHCP agent. I'm hoping that coming up with a solution to #1 will allow us the breathing room to work on #2 in this cycle. m. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron] DHCP Agent Reliability
Excerpts from Maru Newby's message of 2013-12-03 19:37:19 -0800: On Dec 4, 2013, at 11:57 AM, Clint Byrum cl...@fewbar.com wrote: Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800: I've been investigating a bug that is preventing VM's from receiving IP addresses when a Neutron service is under high load: https://bugs.launchpad.net/neutron/+bug/1192381 High load causes the DHCP agent's status updates to be delayed, causing the Neutron service to assume that the agent is down. This results in the Neutron service not sending notifications of port addition to the DHCP agent. At present, the notifications are simply dropped. A simple fix is to send notifications regardless of agent status. Does anybody have any objections to this stop-gap approach? I'm not clear on the implications of sending notifications to agents that are down, but I'm hoping for a simple fix that can be backported to both havana and grizzly (yes, this bug has been with us that long). Fixing this problem for real, though, will likely be more involved. The proposal to replace the current wsgi framework with Pecan may increase the Neutron service's scalability, but should we continue to use a 'fire and forget' approach to notification? Being able to track the success or failure of a given action outside of the logs would seem pretty important, and allow for more effective coordination with Nova than is currently possible. Dropping requests without triggering a user-visible error is a pretty serious problem. You didn't mention if you have filed a bug about that. If not, please do or let us know here so we can investigate and file a bug. There is a bug linked to in the original message that I am already working on. The fact that that bug title is 'dhcp agent doesn't configure ports' rather than 'dhcp notifications are silently dropped' is incidental. Good point, I suppose that one bug is enough. It seems to me that they should be put into a queue to be retried. Sending the notifications blindly is almost as bad as dropping them, as you have no idea if the agent is alive or not. This is more the kind of discussion I was looking for. In the current architecture, the Neutron service handles RPC and WSGI with a single process and is prone to being overloaded such that agent heartbeats can be delayed beyond the limit for the agent being declared 'down'. Even if we increased the agent timeout as Yongsheg suggests, there is no guarantee that we can accurately detect whether an agent is 'live' with the current architecture. Given that amqp can ensure eventual delivery - it is a queue - is sending a notification blind such a bad idea? In the best case the agent isn't really down and can process the notification. In the worst case, the agent really is down but will be brought up eventually by a deployment's monitoring solution and process the notification when it returns. What am I missing? I have not looked closely into what expectations are built in to the notification system, so I may have been off base. My understanding was they were not necessarily guaranteed to be delivered, but if they are, then this is fine. Please consider that while a good solution will track notification delivery and success, we may need 2 solutions: 1. A 'good-enough', minimally-invasive stop-gap that can be back-ported to grizzly and havana. I don't know why we'd backport to grizzly. But yes, if we can get a notable jump in reliability with a clear patch, I'm all for it. 2. A 'best-effort' refactor that maximizes the reliability of the DHCP agent. I'm hoping that coming up with a solution to #1 will allow us the breathing room to work on #2 in this cycle. Understood, I like the short term plan and think long term having more CPU available to process more messages is a good thing, most likely in the form of more worker processes. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev