Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-16 Thread Maru Newby

On Dec 13, 2013, at 8:06 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote:

 On Fri, Dec 06, 2013 at 04:30:17PM +0900,
 Maru Newby ma...@redhat.com wrote:
 
 
 On Dec 5, 2013, at 5:21 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote:
 
 On Wed, Dec 04, 2013 at 12:37:19PM +0900,
 Maru Newby ma...@redhat.com wrote:
 
 In the current architecture, the Neutron service handles RPC and WSGI with 
 a single process and is prone to being overloaded such that agent 
 heartbeats can be delayed beyond the limit for the agent being declared 
 'down'.  Even if we increased the agent timeout as Yongsheg suggests, 
 there is no guarantee that we can accurately detect whether an agent is 
 'live' with the current architecture.  Given that amqp can ensure eventual 
 delivery - it is a queue - is sending a notification blind such a bad 
 idea?  In the best case the agent isn't really down and can process the 
 notification.  In the worst case, the agent really is down but will be 
 brought up eventually by a deployment's monitoring solution and process 
 the notification when it returns.  What am I missing? 
 
 
 Do you mean overload of neutron server? Not neutron agent.
 So event agent sends periodic 'live' report, the reports are piled up
 unprocessed by server.
 When server sends notification, it considers agent dead wrongly.
 Not because agent didn't send live reports due to overload of agent.
 Is this understanding correct?
 
 Your interpretation is likely correct.  The demands on the service are going 
 to be much higher by virtue of having to field RPC requests from all the 
 agents to interact with the database on their behalf.
 
 Is this strongly indicating thread-starvation. i.e. too much unfair
 thread scheduling.
 Given that eventlet is cooperative threading, should sleep(0) to 
 hogging thread?

I'm afraid that's a question for a profiler: 
https://github.com/colinhowe/eventlet_profiler


m.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-10 Thread Maru Newby

On Dec 10, 2013, at 4:47 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote:

 On Tue, Dec 10, 2013 at 07:28:10PM +1300,
 Robert Collins robe...@robertcollins.net wrote:
 
 On 10 December 2013 19:16, Isaku Yamahata isaku.yamah...@gmail.com wrote:
 
 Answering myself. If connection is closed, it will reconnects automatically
 at rpc layer. See neutron.openstack.common.rpc.impl_{kombu, qpid}.py.
 So notifications during reconnects can be lost if AMQP service is set
 to discard notifications during no subscriber.
 
 Which is fine: the agent repulls the full set it's running on that
 machine, and life goes on.
 
 On what event?
 Polling in agent seems effectively disabled by self.needs_resync with
 the current code.

If the agent is not connected, it is either down (needs_resync will be set to 
True on launch) or experiencing a loss of connectivity to the amqp service 
(needs_resync will have been set to True on error).  The loss of notifications 
is not a problem in either case.


m.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-10 Thread Isaku Yamahata
On Mon, Dec 09, 2013 at 08:43:59AM +1300,
Robert Collins robe...@robertcollins.net wrote:

  listening: when an agent connects after an outage, it first starts
  listening, then does a poll for updates it missed.
 
  Are you suggesting that processing of notifications and full state 
  synchronization are able to cooperate safely?  Or hoping that it will be so 
  in the future?
 
 I'm saying that you can avoid race conditions by a combination of
 'subscribe to changes' + 'give me the full state'.

Like this?
https://review.openstack.org/#/c/61057/
This patch is just to confirm the idea.
-- 
Isaku Yamahata isaku.yamah...@gmail.com

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-10 Thread Maru Newby

On Dec 5, 2013, at 4:43 PM, Édouard Thuleau thul...@gmail.com wrote:

 There also another bug you can link/duplicate with #1192381 is
 https://bugs.launchpad.net/neutron/+bug/1185916.
 I proposed a fix but it's not the good way. I abandoned it.
 
 Édouard.

Thank you for pointing this out!


m.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-10 Thread Maru Newby

On Dec 10, 2013, at 6:36 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote:

 On Mon, Dec 09, 2013 at 08:43:59AM +1300,
 Robert Collins robe...@robertcollins.net wrote:
 
 listening: when an agent connects after an outage, it first starts
 listening, then does a poll for updates it missed.
 
 Are you suggesting that processing of notifications and full state 
 synchronization are able to cooperate safely?  Or hoping that it will be so 
 in the future?
 
 I'm saying that you can avoid race conditions by a combination of
 'subscribe to changes' + 'give me the full state'.
 
 Like this?
 https://review.openstack.org/#/c/61057/
 This patch is just to confirm the idea.

I'm afraid the proposed patch is no more reliable than the current approach of 
using file-based locking.   I am working on an alternate patch that puts the 
rpc event loop in the dhcp agent so that better coordination between full 
synchronization and notification handling is possible.  This approach has 
already been taken with the L3 agent and work on the L2 agent is in process.  


m.

 -- 
 Isaku Yamahata isaku.yamah...@gmail.com
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-10 Thread Isaku Yamahata
On Wed, Dec 11, 2013 at 01:23:36AM +0900,
Maru Newby ma...@redhat.com wrote:

 
 On Dec 10, 2013, at 6:36 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote:
 
  On Mon, Dec 09, 2013 at 08:43:59AM +1300,
  Robert Collins robe...@robertcollins.net wrote:
  
  listening: when an agent connects after an outage, it first starts
  listening, then does a poll for updates it missed.
  
  Are you suggesting that processing of notifications and full state 
  synchronization are able to cooperate safely?  Or hoping that it will be 
  so in the future?
  
  I'm saying that you can avoid race conditions by a combination of
  'subscribe to changes' + 'give me the full state'.
  
  Like this?
  https://review.openstack.org/#/c/61057/
  This patch is just to confirm the idea.
 
 I'm afraid the proposed patch is no more reliable than the current approach 
 of using file-based locking.   I am working on an alternate patch that puts 
 the rpc event loop in the dhcp agent so that better coordination between full 
 synchronization and notification handling is possible.  This approach has 
 already been taken with the L3 agent and work on the L2 agent is in process.  

You objected against agent polling in the discussion.
But you're now proposing polling now. Did you change your mind?
-- 
Isaku Yamahata isaku.yamah...@gmail.com

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-10 Thread Maru Newby

On Dec 11, 2013, at 8:39 AM, Isaku Yamahata isaku.yamah...@gmail.com wrote:

 On Wed, Dec 11, 2013 at 01:23:36AM +0900,
 Maru Newby ma...@redhat.com wrote:
 
 
 On Dec 10, 2013, at 6:36 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote:
 
 On Mon, Dec 09, 2013 at 08:43:59AM +1300,
 Robert Collins robe...@robertcollins.net wrote:
 
 listening: when an agent connects after an outage, it first starts
 listening, then does a poll for updates it missed.
 
 Are you suggesting that processing of notifications and full state 
 synchronization are able to cooperate safely?  Or hoping that it will be 
 so in the future?
 
 I'm saying that you can avoid race conditions by a combination of
 'subscribe to changes' + 'give me the full state'.
 
 Like this?
 https://review.openstack.org/#/c/61057/
 This patch is just to confirm the idea.
 
 I'm afraid the proposed patch is no more reliable than the current approach 
 of using file-based locking.   I am working on an alternate patch that puts 
 the rpc event loop in the dhcp agent so that better coordination between 
 full synchronization and notification handling is possible.  This approach 
 has already been taken with the L3 agent and work on the L2 agent is in 
 process.  
 
 You objected against agent polling in the discussion.
 But you're now proposing polling now. Did you change your mind?

Uh, no.  I'm proposing better coordination between notification processing and 
full state synchronization beyond simple exclusionary primitives  
(utils.synchronize etc).  I apologize if my language was unclear.  


m.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-09 Thread Isaku Yamahata
On Mon, Dec 09, 2013 at 08:07:12PM +0900,
Isaku Yamahata isaku.yamah...@gmail.com wrote:

 On Mon, Dec 09, 2013 at 08:43:59AM +1300,
 Robert Collins robe...@robertcollins.net wrote:
 
  On 9 December 2013 01:43, Maru Newby ma...@redhat.com wrote:
  
  
   If AMQP service is set up not to lose notification, notifications will 
   be piled up
   and stress AMQP service. I would say single node failure isn't 
   catastrophic.
  
   So we should have AMQP set to discard notifications if there is noone
  
   What are the semantics of AMQP discarding notifications when a consumer 
   is no longer present?  Can this be relied upon to ensure that potentially 
   stale notifications do not remain in the queue when an agent restarts?
  
  If the queue is set to autodelete, it will delete when the agent
  disconnects. There will be no queue until the agent reconnects. I
  don't know if we expose that functionality via oslo.messaging, but
  it's certainly something AMQP can do.
 
 What happens if intermittent network instability occur?
 When the connection between agent - AMQP is unintentionally closed,
 will agent die or reconnect to it?

Answering myself. If connection is closed, it will reconnects automatically
at rpc layer. See neutron.openstack.common.rpc.impl_{kombu, qpid}.py.
So notifications during reconnects can be lost if AMQP service is set
to discard notifications during no subscriber.
-- 
Isaku Yamahata isaku.yamah...@gmail.com

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-09 Thread Robert Collins
On 10 December 2013 19:16, Isaku Yamahata isaku.yamah...@gmail.com wrote:

 Answering myself. If connection is closed, it will reconnects automatically
 at rpc layer. See neutron.openstack.common.rpc.impl_{kombu, qpid}.py.
 So notifications during reconnects can be lost if AMQP service is set
 to discard notifications during no subscriber.

Which is fine: the agent repulls the full set it's running on that
machine, and life goes on.

-Rob

-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-09 Thread Isaku Yamahata
On Tue, Dec 10, 2013 at 07:28:10PM +1300,
Robert Collins robe...@robertcollins.net wrote:

 On 10 December 2013 19:16, Isaku Yamahata isaku.yamah...@gmail.com wrote:
 
  Answering myself. If connection is closed, it will reconnects automatically
  at rpc layer. See neutron.openstack.common.rpc.impl_{kombu, qpid}.py.
  So notifications during reconnects can be lost if AMQP service is set
  to discard notifications during no subscriber.
 
 Which is fine: the agent repulls the full set it's running on that
 machine, and life goes on.

On what event?
Polling in agent seems effectively disabled by self.needs_resync with
the current code.
-- 
Isaku Yamahata isaku.yamah...@gmail.com

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-09 Thread Isaku Yamahata
On Mon, Dec 09, 2013 at 08:43:59AM +1300,
Robert Collins robe...@robertcollins.net wrote:

 On 9 December 2013 01:43, Maru Newby ma...@redhat.com wrote:
 
 
  If AMQP service is set up not to lose notification, notifications will be 
  piled up
  and stress AMQP service. I would say single node failure isn't 
  catastrophic.
 
  So we should have AMQP set to discard notifications if there is noone
 
  What are the semantics of AMQP discarding notifications when a consumer is 
  no longer present?  Can this be relied upon to ensure that potentially 
  stale notifications do not remain in the queue when an agent restarts?
 
 If the queue is set to autodelete, it will delete when the agent
 disconnects. There will be no queue until the agent reconnects. I
 don't know if we expose that functionality via oslo.messaging, but
 it's certainly something AMQP can do.

What happens if intermittent network instability occur?
When the connection between agent - AMQP is unintentionally closed,
will agent die or reconnect to it?
-- 
Isaku Yamahata isaku.yamah...@gmail.com

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-08 Thread Maru Newby

On Dec 7, 2013, at 6:21 PM, Robert Collins robe...@robertcollins.net wrote:

 On 7 December 2013 21:53, Isaku Yamahata isaku.yamah...@gmail.com wrote:
 
 Case 3: Hardware failure. So an agent on the node is gone.
Another agent will run on another node.
 
 If AMQP service is set up not to lose notification, notifications will be 
 piled up
 and stress AMQP service. I would say single node failure isn't catastrophic.
 
 So we should have AMQP set to discard notifications if there is noone

What are the semantics of AMQP discarding notifications when a consumer is no 
longer present?  Can this be relied upon to ensure that potentially stale 
notifications do not remain in the queue when an agent restarts?


 listening: when an agent connects after an outage, it first starts
 listening, then does a poll for updates it missed.

Are you suggesting that processing of notifications and full state 
synchronization are able to cooperate safely?  Or hoping that it will be so in 
the future?


m.


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-07 Thread Isaku Yamahata
On Fri, Dec 06, 2013 at 04:30:17PM +0900,
Maru Newby ma...@redhat.com wrote:

  2. A 'best-effort' refactor that maximizes the reliability of the DHCP 
  agent.
  
  I'm hoping that coming up with a solution to #1 will allow us the 
  breathing room to work on #2 in this cycle.
  
  Loss of notifications is somewhat inevitable, I think.
  (Or logging tasks to stable storage shared between server and agent)
  And Unconditionally sending notifications would cause problem.
 
 Regarding sending notifications unconditionally, what specifically are you 
 worried about?  I can imagine 2 scenarios:
 
 Case 1: Send notification to an agent that is incorrectly reported as down. 
 Result:  Agent receives notification and acts on it.
 
 Case 2: Send notification to an agent that is actually down.
 Result: Agent comes up eventually (in a production environment this should be 
 a given) and calls sync_state().  We definitely need to make sync_state more 
 reliable, though (I discuss the specifics later in this message).
 
 Notifications could of course be dropped if AMQP queues are not persistent 
 and are lost, but I don't think there needs to be a code-based remedy for 
 this.  An operator is likely to deploy the AMQP service in HA to prevent the 
 queues from being lost, and know to restart everything in the event of 
 catastrophic failure.

Case 3: Hardware failure. So an agent on the node is gone.
Another agent will run on another node.

If AMQP service is set up not to lose notification, notifications will be piled 
up
and stress AMQP service. I would say single node failure isn't catastrophic.


 That's not to say we don't have work to do, though.  An agent is responsible 
 for communicating resource state changes to the service, but the service 
 neither detects nor reacts when the state of a resource is scheduled to 
 change and fails to do so in a reasonable timeframe.  Thus, as in the bug 
 that prompted this discussion, it is up to the user to detect the failure (a 
 VM without connectivity).  Ideally, Neutron should be tracking resource state 
 changes with sufficient detail and reviewing them periodically to allow 
 timely failure detection and remediation.

You are proposing polling by Neutron server.
So polling somewhere (in server or agent or hybrid) is the way to go in long 
term.
Do you agree?
Details to discuss would be, how to do polling, how often(or adaptive) polling
should be done, how the cost of polling can be mitigated by tricks...


 However, such a change is unlikely to be a candidate for backport so it will 
 have to wait.

Right, this isn't for backport. I'm talking about middle/long term direction.


  You mentioned agent crash. Server crash should also be taken care of
  for reliability. Admin also sometimes wants to restart neutron
  server/agents for some reasons.
  Agent can crash after receiving notifications before start processing
  actual tasks. Server can crash after commiting changes to DB before sending
  notifications. In such cases, notification will be lost.
  Polling to resync would be necessary somewhere.
 
 Agreed, we need to consider the cases of both agent and service failure.  
 
 In the case of service failure, thanks to recently merged patches, the dhcp 
 agent will at least force a resync in the event of an error in communicating 
 with the server.  However, there is no guarantee that the agent will 
 communicate with the server during the downtime.  While polling is one 
 possible solution, might it be preferable for the service to simply notify 
 the agents when it starts?  The dhcp agent can already receive an 
 agent_updated RPC message that triggers a resync.  

Agreed, notification on server startup is better.


  - notification loss isn't considered.
   self.resync is not always run.
   some optimization is possible, for example
   - detect loss by sequence number
   - polling can be postponed when notifications come without loss.
 
 Notification loss due to agent failure is already solved - sync_state() is 
 called on startup.  Notification loss due to server failure could be handled 
 as described above.   I think the larger problem is that calling sync_state() 
 does not affect processing of notifications already in the queue, which could 
 result in stale notifications being processed out-of-order, e.g.
 
 - service sends 'network down' notification
 - service goes down after committing 'network up' to db, but before sending 
 notification
 - service comes back up
 - agent knows (somehow) to resync, setting the network 'up'
 - agent processes stale 'network down' notification
 
 Though tracking sequence numbers is one possible fix, what do you think of 
 instead ignoring all notifications generated before a timestamp set at the 
 beginning of sync_state()?  

I agree that improvement is necessary in the are and it is better for agent to
ignore stale notification somehow.

Regarding to out-of-order notification, making agent to be able to accept

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-07 Thread Robert Collins
On 7 December 2013 21:53, Isaku Yamahata isaku.yamah...@gmail.com wrote:

 Case 3: Hardware failure. So an agent on the node is gone.
 Another agent will run on another node.

 If AMQP service is set up not to lose notification, notifications will be 
 piled up
 and stress AMQP service. I would say single node failure isn't catastrophic.

So we should have AMQP set to discard notifications if there is noone
listening: when an agent connects after an outage, it first starts
listening, then does a poll for updates it missed.

-Rob

-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-06 Thread Carl Baldwin
Pasting a few things from IRC here to fill out the context...

marun carl_baldwin: but according to markmcclain and salv-orlando,
it isn't possible to trivially use multiple workers for rpc because
processing rpc requests out of sequence can be dangerous

carl_baldwin marun: I think it is already possible to run more than
one RPC message processor.  If the neutron server process is run on
multiple hosts in active/active I think you end up getting multiple
independent RPC processing threads unless I'm missing something.

marun carl_baldwin: is active/active an option?

I checked one of my environments where there are two API servers
running.  It is clear from the logs that both servers are consuming
and processing RPC messages independently.  I have not identified any
problems resulting from doing this yet.  I've been running this way
for months.  There could be something lurking in there preparing to
cause a problem.

I'm suddenly keenly interested in understanding the problems with
processing RPC messages out of order.  I tried reading the IRC backlog
for information about this but it was not clear to me.  Mark or
Salvatore, can you comment?

Not only is RPC being handled by both physical servers in my
environment but each of the API server worker processes is consuming
and processing RPC messages independently.  So, I am currently running
a multi-process RPC scenario now.

I did not intend for this to happen this way.  My environment has
something different than the current upstream.  I confirmed that with
current upstream code and the ML2 plugin only the parent process
consumes RPC messages.  It is probably because this environment is
still using an older version of my multi-process API worker patch.
Still looking in to it.

Carl

On Thu, Dec 5, 2013 at 7:32 AM, Carl Baldwin c...@ecbaldwin.net wrote:
 Creating separate processes for API workers does allow a bit more room
 for RPC message processing in the main process.  If this isn't enough
 and the main process is still bound on CPU and/or green
 thread/sqlalchemy blocking then creating separate worker processes for
 RPC processing may be the next logical step to scale.  I'll give it
 some thought today and possibly create a blueprint.

 Carl

 On Thu, Dec 5, 2013 at 7:13 AM, Maru Newby ma...@redhat.com wrote:

 On Dec 5, 2013, at 6:43 AM, Carl Baldwin c...@ecbaldwin.net wrote:

 I have offered up https://review.openstack.org/#/c/60082/ as a
 backport to Havana.  Interest was expressed in the blueprint for doing
 this even before this thread.  If there is consensus for this as the
 stop-gap then it is there for the merging.  However, I do not want to
 discourage discussion of other stop-gap solutions like what Maru
 proposed in the original post.

 Carl

 Awesome.  No worries, I'm still planning on submitting a patch to improve 
 notification reliability.

 We seem to be cpu bound now in processing RPC messages.  Do you think it 
 would be reasonable to run multiple processes for RPC?


 m.


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-05 Thread Isaku Yamahata
On Wed, Dec 04, 2013 at 12:37:19PM +0900,
Maru Newby ma...@redhat.com wrote:

 On Dec 4, 2013, at 11:57 AM, Clint Byrum cl...@fewbar.com wrote:
 
  Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800:
  I've been investigating a bug that is preventing VM's from receiving IP 
  addresses when a Neutron service is under high load:
  
  https://bugs.launchpad.net/neutron/+bug/1192381
  
  High load causes the DHCP agent's status updates to be delayed, causing 
  the Neutron service to assume that the agent is down.  This results in the 
  Neutron service not sending notifications of port addition to the DHCP 
  agent.  At present, the notifications are simply dropped.  A simple fix is 
  to send notifications regardless of agent status.  Does anybody have any 
  objections to this stop-gap approach?  I'm not clear on the implications 
  of sending notifications to agents that are down, but I'm hoping for a 
  simple fix that can be backported to both havana and grizzly (yes, this 
  bug has been with us that long).
  
  Fixing this problem for real, though, will likely be more involved.  The 
  proposal to replace the current wsgi framework with Pecan may increase the 
  Neutron service's scalability, but should we continue to use a 'fire and 
  forget' approach to notification?  Being able to track the success or 
  failure of a given action outside of the logs would seem pretty important, 
  and allow for more effective coordination with Nova than is currently 
  possible.
  
  
  Dropping requests without triggering a user-visible error is a pretty
  serious problem. You didn't mention if you have filed a bug about that.
  If not, please do or let us know here so we can investigate and file
  a bug.
 
 There is a bug linked to in the original message that I am already working 
 on.  The fact that that bug title is 'dhcp agent doesn't configure ports' 
 rather than 'dhcp notifications are silently dropped' is incidental.
 
  
  It seems to me that they should be put into a queue to be retried.
  Sending the notifications blindly is almost as bad as dropping them,
  as you have no idea if the agent is alive or not.
 
 This is more the kind of discussion I was looking for.  
 
 In the current architecture, the Neutron service handles RPC and WSGI with a 
 single process and is prone to being overloaded such that agent heartbeats 
 can be delayed beyond the limit for the agent being declared 'down'.  Even if 
 we increased the agent timeout as Yongsheg suggests, there is no guarantee 
 that we can accurately detect whether an agent is 'live' with the current 
 architecture.  Given that amqp can ensure eventual delivery - it is a queue - 
 is sending a notification blind such a bad idea?  In the best case the agent 
 isn't really down and can process the notification.  In the worst case, the 
 agent really is down but will be brought up eventually by a deployment's 
 monitoring solution and process the notification when it returns.  What am I 
 missing? 
 

Do you mean overload of neutron server? Not neutron agent.
So event agent sends periodic 'live' report, the reports are piled up
unprocessed by server.
When server sends notification, it considers agent dead wrongly.
Not because agent didn't send live reports due to overload of agent.
Is this understanding correct?


 Please consider that while a good solution will track notification delivery 
 and success, we may need 2 solutions:
 
 1. A 'good-enough', minimally-invasive stop-gap that can be back-ported to 
 grizzly and havana.

How about twisting DhcpAgent._periodic_resync_helper?
If no notification is received form server from last sleep,
it calls self.sync_state() even if self.needs_resync = False. Thus the
inconsistency between agent and server due to losing notification
will be fixed.


 2. A 'best-effort' refactor that maximizes the reliability of the DHCP agent.
 
 I'm hoping that coming up with a solution to #1 will allow us the breathing 
 room to work on #2 in this cycle.

Loss of notifications is somewhat inevitable, I think.
(Or logging tasks to stable storage shared between server and agent)
And Unconditionally sending notifications would cause problem.

You mentioned agent crash. Server crash should also be taken care of
for reliability. Admin also sometimes wants to restart neutron
server/agents for some reasons.
Agent can crash after receiving notifications before start processing
actual tasks. Server can crash after commiting changes to DB before sending
notifications. In such cases, notification will be lost.
Polling to resync would be necessary somewhere.

- notification loss isn't considered.
  self.resync is not always run.
  some optimization is possible, for example
  - detect loss by sequence number
  - polling can be postponed when notifications come without loss.

- periodic resync spawns threads, but doesn't wait their completion.
  So if resync takes long time, next resync can start even while
  resync is going on.


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-05 Thread Maru Newby

On Dec 5, 2013, at 6:43 AM, Carl Baldwin c...@ecbaldwin.net wrote:

 I have offered up https://review.openstack.org/#/c/60082/ as a
 backport to Havana.  Interest was expressed in the blueprint for doing
 this even before this thread.  If there is consensus for this as the
 stop-gap then it is there for the merging.  However, I do not want to
 discourage discussion of other stop-gap solutions like what Maru
 proposed in the original post.
 
 Carl

Awesome.  No worries, I'm still planning on submitting a patch to improve 
notification reliability.

We seem to be cpu bound now in processing RPC messages.  Do you think it would 
be reasonable to run multiple processes for RPC?


m.


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-05 Thread Carl Baldwin
Creating separate processes for API workers does allow a bit more room
for RPC message processing in the main process.  If this isn't enough
and the main process is still bound on CPU and/or green
thread/sqlalchemy blocking then creating separate worker processes for
RPC processing may be the next logical step to scale.  I'll give it
some thought today and possibly create a blueprint.

Carl

On Thu, Dec 5, 2013 at 7:13 AM, Maru Newby ma...@redhat.com wrote:

 On Dec 5, 2013, at 6:43 AM, Carl Baldwin c...@ecbaldwin.net wrote:

 I have offered up https://review.openstack.org/#/c/60082/ as a
 backport to Havana.  Interest was expressed in the blueprint for doing
 this even before this thread.  If there is consensus for this as the
 stop-gap then it is there for the merging.  However, I do not want to
 discourage discussion of other stop-gap solutions like what Maru
 proposed in the original post.

 Carl

 Awesome.  No worries, I'm still planning on submitting a patch to improve 
 notification reliability.

 We seem to be cpu bound now in processing RPC messages.  Do you think it 
 would be reasonable to run multiple processes for RPC?


 m.


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-05 Thread Maru Newby

On Dec 5, 2013, at 5:21 PM, Isaku Yamahata isaku.yamah...@gmail.com wrote:

 On Wed, Dec 04, 2013 at 12:37:19PM +0900,
 Maru Newby ma...@redhat.com wrote:
 
 In the current architecture, the Neutron service handles RPC and WSGI with a 
 single process and is prone to being overloaded such that agent heartbeats 
 can be delayed beyond the limit for the agent being declared 'down'.  Even 
 if we increased the agent timeout as Yongsheg suggests, there is no 
 guarantee that we can accurately detect whether an agent is 'live' with the 
 current architecture.  Given that amqp can ensure eventual delivery - it is 
 a queue - is sending a notification blind such a bad idea?  In the best case 
 the agent isn't really down and can process the notification.  In the worst 
 case, the agent really is down but will be brought up eventually by a 
 deployment's monitoring solution and process the notification when it 
 returns.  What am I missing? 
 
 
 Do you mean overload of neutron server? Not neutron agent.
 So event agent sends periodic 'live' report, the reports are piled up
 unprocessed by server.
 When server sends notification, it considers agent dead wrongly.
 Not because agent didn't send live reports due to overload of agent.
 Is this understanding correct?

Your interpretation is likely correct.  The demands on the service are going to 
be much higher by virtue of having to field RPC requests from all the agents to 
interact with the database on their behalf.


 Please consider that while a good solution will track notification delivery 
 and success, we may need 2 solutions:
 
 1. A 'good-enough', minimally-invasive stop-gap that can be back-ported to 
 grizzly and havana.
 
 How about twisting DhcpAgent._periodic_resync_helper?
 If no notification is received form server from last sleep,
 it calls self.sync_state() even if self.needs_resync = False. Thus the
 inconsistency between agent and server due to losing notification
 will be fixed.

Unless I'm missing something, wouldn't forcing more and potentially unnecessary 
resyncs increase the load on the Neutron service and negatively impact 
reliability?


 2. A 'best-effort' refactor that maximizes the reliability of the DHCP agent.
 
 I'm hoping that coming up with a solution to #1 will allow us the breathing 
 room to work on #2 in this cycle.
 
 Loss of notifications is somewhat inevitable, I think.
 (Or logging tasks to stable storage shared between server and agent)
 And Unconditionally sending notifications would cause problem.

Regarding sending notifications unconditionally, what specifically are you 
worried about?  I can imagine 2 scenarios:

Case 1: Send notification to an agent that is incorrectly reported as down. 
Result:  Agent receives notification and acts on it.

Case 2: Send notification to an agent that is actually down.
Result: Agent comes up eventually (in a production environment this should be a 
given) and calls sync_state().  We definitely need to make sync_state more 
reliable, though (I discuss the specifics later in this message).

Notifications could of course be dropped if AMQP queues are not persistent and 
are lost, but I don't think there needs to be a code-based remedy for this.  An 
operator is likely to deploy the AMQP service in HA to prevent the queues from 
being lost, and know to restart everything in the event of catastrophic failure.

That's not to say we don't have work to do, though.  An agent is responsible 
for communicating resource state changes to the service, but the service 
neither detects nor reacts when the state of a resource is scheduled to change 
and fails to do so in a reasonable timeframe.  Thus, as in the bug that 
prompted this discussion, it is up to the user to detect the failure (a VM 
without connectivity).  Ideally, Neutron should be tracking resource state 
changes with sufficient detail and reviewing them periodically to allow timely 
failure detection and remediation.  However, such a change is unlikely to be a 
candidate for backport so it will have to wait.


 
 You mentioned agent crash. Server crash should also be taken care of
 for reliability. Admin also sometimes wants to restart neutron
 server/agents for some reasons.
 Agent can crash after receiving notifications before start processing
 actual tasks. Server can crash after commiting changes to DB before sending
 notifications. In such cases, notification will be lost.
 Polling to resync would be necessary somewhere.

Agreed, we need to consider the cases of both agent and service failure.  

In the case of service failure, thanks to recently merged patches, the dhcp 
agent will at least force a resync in the event of an error in communicating 
with the server.  However, there is no guarantee that the agent will 
communicate with the server during the downtime.  While polling is one possible 
solution, might it be preferable for the service to simply notify the agents 
when it starts?  The dhcp agent can already receive an 

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-04 Thread Joe Gordon
On Dec 4, 2013 5:41 AM, Maru Newby ma...@redhat.com wrote:


 On Dec 4, 2013, at 11:57 AM, Clint Byrum cl...@fewbar.com wrote:

  Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800:
  I've been investigating a bug that is preventing VM's from receiving
IP addresses when a Neutron service is under high load:
 
  https://bugs.launchpad.net/neutron/+bug/1192381
 
  High load causes the DHCP agent's status updates to be delayed,
causing the Neutron service to assume that the agent is down.  This results
in the Neutron service not sending notifications of port addition to the
DHCP agent.  At present, the notifications are simply dropped.  A simple
fix is to send notifications regardless of agent status.  Does anybody have
any objections to this stop-gap approach?  I'm not clear on the
implications of sending notifications to agents that are down, but I'm
hoping for a simple fix that can be backported to both havana and grizzly
(yes, this bug has been with us that long).
 
  Fixing this problem for real, though, will likely be more involved.
 The proposal to replace the current wsgi framework with Pecan may increase
the Neutron service's scalability, but should we continue to use a 'fire
and forget' approach to notification?  Being able to track the success or
failure of a given action outside of the logs would seem pretty important,
and allow for more effective coordination with Nova than is currently
possible.
 
 
  Dropping requests without triggering a user-visible error is a pretty
  serious problem. You didn't mention if you have filed a bug about that.
  If not, please do or let us know here so we can investigate and file
  a bug.

 There is a bug linked to in the original message that I am already
working on.  The fact that that bug title is 'dhcp agent doesn't configure
ports' rather than 'dhcp notifications are silently dropped' is incidental.

 
  It seems to me that they should be put into a queue to be retried.
  Sending the notifications blindly is almost as bad as dropping them,
  as you have no idea if the agent is alive or not.

 This is more the kind of discussion I was looking for.

 In the current architecture, the Neutron service handles RPC and WSGI
with a single process and is prone to being overloaded such that agent
heartbeats can be delayed beyond the limit for the agent being declared
'down'.  Even if we increased the agent timeout as Yongsheg suggests, there
is no guarantee that we can accurately detect whether an agent is 'live'
with the current architecture.  Given that amqp can ensure eventual
delivery - it is a queue - is sending a notification blind such a bad idea?
 In the best case the agent isn't really down and can process the
notification.  In the worst case, the agent really is down but will be
brought up eventually by a deployment's monitoring solution and process the
notification when it returns.  What am I missing?

 Please consider that while a good solution will track notification
delivery and success, we may need 2 solutions:

 1. A 'good-enough', minimally-invasive stop-gap that can be back-ported
to grizzly and havana.

 2. A 'best-effort' refactor that maximizes the reliability of the DHCP
agent.

 I'm hoping that coming up with a solution to #1 will allow us the
breathing room to work on #2 in this cycle.

I like the two part approach but I would phrase it slightly differently.

Short term solution to help neutron meet the deprecate nova-network goals
by icshouse-2 and a long term more robust solution.



 m.



 
  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-04 Thread Maru Newby

On Dec 4, 2013, at 8:55 AM, Carl Baldwin c...@ecbaldwin.net wrote:

 Stephen, all,
 
 I agree that there may be some opportunity to split things out a bit.
 However, I'm not sure what the best way will be.  I recall that Mark
 mentioned breaking out the processes that handle API requests and RPC
 from each other at the summit.  Anyway, it is something that has been
 discussed.
 
 I actually wanted to point out that the neutron server now has the
 ability to run a configurable number of sub-processes to handle a
 heavier load.  Introduced with this commit:
 
 https://review.openstack.org/#/c/37131/
 
 Set api_workers to something  1 and restart the server.
 
 The server can also be run on more than one physical host in
 combination with multiple child processes.

I completely misunderstood the import of the commit in question.  Being able to 
run the wsgi server(s) out of process is a nice improvement, thank you for 
making it happen.  Has there been any discussion around making the default for 
api_workers  0 (at least 1) to ensure that the default configuration separates 
wsgi and rpc load?  This also seems like a great candidate for backporting to 
havana and maybe even grizzly, although api_workers should probably be 
defaulted to 0 in those cases.

FYI, I re-ran the test that attempted to boot 75 micro VM's simultaneously with 
api_workers = 2, with mixed results.  The increased wsgi throughput resulted in 
almost half of the boot requests failing with 500 errors due to QueuePool 
errors (https://bugs.launchpad.net/neutron/+bug/1160442) in Neutron.  It also 
appears that maximizing the number of wsgi requests has the side-effect of 
increasing the RPC load on the main process, and this means that the problem of 
dhcp notifications being dropped is little improved.  I intend to submit a fix 
that ensures that notifications are sent regardless of agent status, in any 
case.


m.

 
 Carl
 
 On Tue, Dec 3, 2013 at 9:47 AM, Stephen Gran
 stephen.g...@theguardian.com wrote:
 On 03/12/13 16:08, Maru Newby wrote:
 
 I've been investigating a bug that is preventing VM's from receiving IP
 addresses when a Neutron service is under high load:
 
 https://bugs.launchpad.net/neutron/+bug/1192381
 
 High load causes the DHCP agent's status updates to be delayed, causing
 the Neutron service to assume that the agent is down.  This results in the
 Neutron service not sending notifications of port addition to the DHCP
 agent.  At present, the notifications are simply dropped.  A simple fix is
 to send notifications regardless of agent status.  Does anybody have any
 objections to this stop-gap approach?  I'm not clear on the implications of
 sending notifications to agents that are down, but I'm hoping for a simple
 fix that can be backported to both havana and grizzly (yes, this bug has
 been with us that long).
 
 Fixing this problem for real, though, will likely be more involved.  The
 proposal to replace the current wsgi framework with Pecan may increase the
 Neutron service's scalability, but should we continue to use a 'fire and
 forget' approach to notification?  Being able to track the success or
 failure of a given action outside of the logs would seem pretty important,
 and allow for more effective coordination with Nova than is currently
 possible.
 
 
 It strikes me that we ask an awful lot of a single neutron-server instance -
 it has to take state updates from all the agents, it has to do scheduling,
 it has to respond to API requests, and it has to communicate about actual
 changes with the agents.
 
 Maybe breaking some of these out the way nova has a scheduler and a
 conductor and so on might be a good model (I know there are things people
 are unhappy about with nova-scheduler, but imagine how much worse it would
 be if it was built into the API).
 
 Doing all of those tasks, and doing it largely single threaded, is just
 asking for overload.
 
 Cheers,
 --
 Stephen Gran
 Senior Systems Integrator - theguardian.com
 Please consider the environment before printing this email.
 --
 Visit theguardian.com
 On your mobile, download the Guardian iPhone app theguardian.com/iphone and
 our iPad edition theguardian.com/iPad   Save up to 33% by subscribing to the
 Guardian and Observer - choose the papers you want and get full digital
 access.
 Visit subscribe.theguardian.com
 
 This e-mail and all attachments are confidential and may also
 be privileged. If you are not the named recipient, please notify
 the sender and delete the e-mail and all attachments immediately.
 Do not disclose the contents to another person. You may not use
 the information for any purpose, or store, or copy, it in any way.
 
 Guardian News  Media Limited is not liable for any computer
 viruses or other material transmitted with or as part of this
 e-mail. You should employ virus checking software.
 
 Guardian News  Media Limited
 
 A member of Guardian Media Group plc
 

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-04 Thread Ashok Kumaran
On Wed, Dec 4, 2013 at 8:30 PM, Maru Newby ma...@redhat.com wrote:


 On Dec 4, 2013, at 8:55 AM, Carl Baldwin c...@ecbaldwin.net wrote:

  Stephen, all,
 
  I agree that there may be some opportunity to split things out a bit.
  However, I'm not sure what the best way will be.  I recall that Mark
  mentioned breaking out the processes that handle API requests and RPC
  from each other at the summit.  Anyway, it is something that has been
  discussed.
 
  I actually wanted to point out that the neutron server now has the
  ability to run a configurable number of sub-processes to handle a
  heavier load.  Introduced with this commit:
 
  https://review.openstack.org/#/c/37131/
 
  Set api_workers to something  1 and restart the server.
 
  The server can also be run on more than one physical host in
  combination with multiple child processes.

 I completely misunderstood the import of the commit in question.  Being
 able to run the wsgi server(s) out of process is a nice improvement, thank
 you for making it happen.  Has there been any discussion around making the
 default for api_workers  0 (at least 1) to ensure that the default
 configuration separates wsgi and rpc load?  This also seems like a great
 candidate for backporting to havana and maybe even grizzly, although
 api_workers should probably be defaulted to 0 in those cases.


+1 for backporting the api_workers feature to havana as well as Grizzly :)


 FYI, I re-ran the test that attempted to boot 75 micro VM's simultaneously
 with api_workers = 2, with mixed results.  The increased wsgi throughput
 resulted in almost half of the boot requests failing with 500 errors due to
 QueuePool errors (https://bugs.launchpad.net/neutron/+bug/1160442) in
 Neutron.  It also appears that maximizing the number of wsgi requests has
 the side-effect of increasing the RPC load on the main process, and this
 means that the problem of dhcp notifications being dropped is little
 improved.  I intend to submit a fix that ensures that notifications are
 sent regardless of agent status, in any case.


 m.

 
  Carl
 
  On Tue, Dec 3, 2013 at 9:47 AM, Stephen Gran
  stephen.g...@theguardian.com wrote:
  On 03/12/13 16:08, Maru Newby wrote:
 
  I've been investigating a bug that is preventing VM's from receiving IP
  addresses when a Neutron service is under high load:
 
  https://bugs.launchpad.net/neutron/+bug/1192381
 
  High load causes the DHCP agent's status updates to be delayed, causing
  the Neutron service to assume that the agent is down.  This results in
 the
  Neutron service not sending notifications of port addition to the DHCP
  agent.  At present, the notifications are simply dropped.  A simple
 fix is
  to send notifications regardless of agent status.  Does anybody have
 any
  objections to this stop-gap approach?  I'm not clear on the
 implications of
  sending notifications to agents that are down, but I'm hoping for a
 simple
  fix that can be backported to both havana and grizzly (yes, this bug
 has
  been with us that long).
 
  Fixing this problem for real, though, will likely be more involved.
  The
  proposal to replace the current wsgi framework with Pecan may increase
 the
  Neutron service's scalability, but should we continue to use a 'fire
 and
  forget' approach to notification?  Being able to track the success or
  failure of a given action outside of the logs would seem pretty
 important,
  and allow for more effective coordination with Nova than is currently
  possible.
 
 
  It strikes me that we ask an awful lot of a single neutron-server
 instance -
  it has to take state updates from all the agents, it has to do
 scheduling,
  it has to respond to API requests, and it has to communicate about
 actual
  changes with the agents.
 
  Maybe breaking some of these out the way nova has a scheduler and a
  conductor and so on might be a good model (I know there are things
 people
  are unhappy about with nova-scheduler, but imagine how much worse it
 would
  be if it was built into the API).
 
  Doing all of those tasks, and doing it largely single threaded, is just
  asking for overload.
 
  Cheers,
  --
  Stephen Gran
  Senior Systems Integrator - theguardian.com
  Please consider the environment before printing this email.
  --
  Visit theguardian.com
  On your mobile, download the Guardian iPhone app theguardian.com/iphoneand
  our iPad edition theguardian.com/iPad   Save up to 33% by subscribing
 to the
  Guardian and Observer - choose the papers you want and get full digital
  access.
  Visit subscribe.theguardian.com
 
  This e-mail and all attachments are confidential and may also
  be privileged. If you are not the named recipient, please notify
  the sender and delete the e-mail and all attachments immediately.
  Do not disclose the contents to another person. You may not use
  the information for any purpose, or store, or copy, it in any way.
 
  Guardian News  

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-04 Thread Carl Baldwin
Sorry to have taken the discussion on a slight tangent.  I meant only
to offer the solution as a stop-gap.  I agree that the fundamental
problem should still be addressed.

On Tue, Dec 3, 2013 at 8:01 PM, Maru Newby ma...@redhat.com wrote:

 On Dec 4, 2013, at 1:47 AM, Stephen Gran stephen.g...@theguardian.com wrote:

 On 03/12/13 16:08, Maru Newby wrote:
 I've been investigating a bug that is preventing VM's from receiving IP 
 addresses when a Neutron service is under high load:

 https://bugs.launchpad.net/neutron/+bug/1192381

 High load causes the DHCP agent's status updates to be delayed, causing the 
 Neutron service to assume that the agent is down.  This results in the 
 Neutron service not sending notifications of port addition to the DHCP 
 agent.  At present, the notifications are simply dropped.  A simple fix is 
 to send notifications regardless of agent status.  Does anybody have any 
 objections to this stop-gap approach?  I'm not clear on the implications of 
 sending notifications to agents that are down, but I'm hoping for a simple 
 fix that can be backported to both havana and grizzly (yes, this bug has 
 been with us that long).

 Fixing this problem for real, though, will likely be more involved.  The 
 proposal to replace the current wsgi framework with Pecan may increase the 
 Neutron service's scalability, but should we continue to use a 'fire and 
 forget' approach to notification?  Being able to track the success or 
 failure of a given action outside of the logs would seem pretty important, 
 and allow for more effective coordination with Nova than is currently 
 possible.

 It strikes me that we ask an awful lot of a single neutron-server instance - 
 it has to take state updates from all the agents, it has to do scheduling, 
 it has to respond to API requests, and it has to communicate about actual 
 changes with the agents.

 Maybe breaking some of these out the way nova has a scheduler and a 
 conductor and so on might be a good model (I know there are things people 
 are unhappy about with nova-scheduler, but imagine how much worse it would 
 be if it was built into the API).

 Doing all of those tasks, and doing it largely single threaded, is just 
 asking for overload.

 I'm sorry if it wasn't clear in my original message, but my primary concern 
 lies with the reliability rather than the scalability of the Neutron service. 
  Carl's addition of multiple workers is a good stop-gap to minimize the 
 impact of blocking IO calls in the current architecture, and we already have 
 consensus on the need to separate RPC and WSGI functions as part of the Pecan 
 rewrite.  I am worried, though, that we are not being sufficiently diligent 
 in how we manage state transitions through notifications.  Managing 
 transitions and their associate error states is needlessly complicated by the 
 current ad-hoc approach, and I'd appreciate input on the part of distributed 
 systems experts as to how we could do better.


 m.


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-04 Thread Carl Baldwin
I have offered up https://review.openstack.org/#/c/60082/ as a
backport to Havana.  Interest was expressed in the blueprint for doing
this even before this thread.  If there is consensus for this as the
stop-gap then it is there for the merging.  However, I do not want to
discourage discussion of other stop-gap solutions like what Maru
proposed in the original post.

Carl

On Wed, Dec 4, 2013 at 9:12 AM, Ashok Kumaran ashokkumara...@gmail.com wrote:



 On Wed, Dec 4, 2013 at 8:30 PM, Maru Newby ma...@redhat.com wrote:


 On Dec 4, 2013, at 8:55 AM, Carl Baldwin c...@ecbaldwin.net wrote:

  Stephen, all,
 
  I agree that there may be some opportunity to split things out a bit.
  However, I'm not sure what the best way will be.  I recall that Mark
  mentioned breaking out the processes that handle API requests and RPC
  from each other at the summit.  Anyway, it is something that has been
  discussed.
 
  I actually wanted to point out that the neutron server now has the
  ability to run a configurable number of sub-processes to handle a
  heavier load.  Introduced with this commit:
 
  https://review.openstack.org/#/c/37131/
 
  Set api_workers to something  1 and restart the server.
 
  The server can also be run on more than one physical host in
  combination with multiple child processes.

 I completely misunderstood the import of the commit in question.  Being
 able to run the wsgi server(s) out of process is a nice improvement, thank
 you for making it happen.  Has there been any discussion around making the
 default for api_workers  0 (at least 1) to ensure that the default
 configuration separates wsgi and rpc load?  This also seems like a great
 candidate for backporting to havana and maybe even grizzly, although
 api_workers should probably be defaulted to 0 in those cases.


 +1 for backporting the api_workers feature to havana as well as Grizzly :)


 FYI, I re-ran the test that attempted to boot 75 micro VM's simultaneously
 with api_workers = 2, with mixed results.  The increased wsgi throughput
 resulted in almost half of the boot requests failing with 500 errors due to
 QueuePool errors (https://bugs.launchpad.net/neutron/+bug/1160442) in
 Neutron.  It also appears that maximizing the number of wsgi requests has
 the side-effect of increasing the RPC load on the main process, and this
 means that the problem of dhcp notifications being dropped is little
 improved.  I intend to submit a fix that ensures that notifications are sent
 regardless of agent status, in any case.


 m.

 
  Carl
 
  On Tue, Dec 3, 2013 at 9:47 AM, Stephen Gran
  stephen.g...@theguardian.com wrote:
  On 03/12/13 16:08, Maru Newby wrote:
 
  I've been investigating a bug that is preventing VM's from receiving
  IP
  addresses when a Neutron service is under high load:
 
  https://bugs.launchpad.net/neutron/+bug/1192381
 
  High load causes the DHCP agent's status updates to be delayed,
  causing
  the Neutron service to assume that the agent is down.  This results in
  the
  Neutron service not sending notifications of port addition to the DHCP
  agent.  At present, the notifications are simply dropped.  A simple
  fix is
  to send notifications regardless of agent status.  Does anybody have
  any
  objections to this stop-gap approach?  I'm not clear on the
  implications of
  sending notifications to agents that are down, but I'm hoping for a
  simple
  fix that can be backported to both havana and grizzly (yes, this bug
  has
  been with us that long).
 
  Fixing this problem for real, though, will likely be more involved.
  The
  proposal to replace the current wsgi framework with Pecan may increase
  the
  Neutron service's scalability, but should we continue to use a 'fire
  and
  forget' approach to notification?  Being able to track the success or
  failure of a given action outside of the logs would seem pretty
  important,
  and allow for more effective coordination with Nova than is currently
  possible.
 
 
  It strikes me that we ask an awful lot of a single neutron-server
  instance -
  it has to take state updates from all the agents, it has to do
  scheduling,
  it has to respond to API requests, and it has to communicate about
  actual
  changes with the agents.
 
  Maybe breaking some of these out the way nova has a scheduler and a
  conductor and so on might be a good model (I know there are things
  people
  are unhappy about with nova-scheduler, but imagine how much worse it
  would
  be if it was built into the API).
 
  Doing all of those tasks, and doing it largely single threaded, is just
  asking for overload.
 
  Cheers,
  --
  Stephen Gran
  Senior Systems Integrator - theguardian.com
  Please consider the environment before printing this email.
  --
  Visit theguardian.com
  On your mobile, download the Guardian iPhone app theguardian.com/iphone
  and
  our iPad edition theguardian.com/iPad   Save up to 33% by subscribing
  to 

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-04 Thread Édouard Thuleau
There also another bug you can link/duplicate with #1192381 is
https://bugs.launchpad.net/neutron/+bug/1185916.
I proposed a fix but it's not the good way. I abandoned it.

Édouard.

On Wed, Dec 4, 2013 at 10:43 PM, Carl Baldwin c...@ecbaldwin.net wrote:
 I have offered up https://review.openstack.org/#/c/60082/ as a
 backport to Havana.  Interest was expressed in the blueprint for doing
 this even before this thread.  If there is consensus for this as the
 stop-gap then it is there for the merging.  However, I do not want to
 discourage discussion of other stop-gap solutions like what Maru
 proposed in the original post.

 Carl

 On Wed, Dec 4, 2013 at 9:12 AM, Ashok Kumaran ashokkumara...@gmail.com 
 wrote:



 On Wed, Dec 4, 2013 at 8:30 PM, Maru Newby ma...@redhat.com wrote:


 On Dec 4, 2013, at 8:55 AM, Carl Baldwin c...@ecbaldwin.net wrote:

  Stephen, all,
 
  I agree that there may be some opportunity to split things out a bit.
  However, I'm not sure what the best way will be.  I recall that Mark
  mentioned breaking out the processes that handle API requests and RPC
  from each other at the summit.  Anyway, it is something that has been
  discussed.
 
  I actually wanted to point out that the neutron server now has the
  ability to run a configurable number of sub-processes to handle a
  heavier load.  Introduced with this commit:
 
  https://review.openstack.org/#/c/37131/
 
  Set api_workers to something  1 and restart the server.
 
  The server can also be run on more than one physical host in
  combination with multiple child processes.

 I completely misunderstood the import of the commit in question.  Being
 able to run the wsgi server(s) out of process is a nice improvement, thank
 you for making it happen.  Has there been any discussion around making the
 default for api_workers  0 (at least 1) to ensure that the default
 configuration separates wsgi and rpc load?  This also seems like a great
 candidate for backporting to havana and maybe even grizzly, although
 api_workers should probably be defaulted to 0 in those cases.


 +1 for backporting the api_workers feature to havana as well as Grizzly :)


 FYI, I re-ran the test that attempted to boot 75 micro VM's simultaneously
 with api_workers = 2, with mixed results.  The increased wsgi throughput
 resulted in almost half of the boot requests failing with 500 errors due to
 QueuePool errors (https://bugs.launchpad.net/neutron/+bug/1160442) in
 Neutron.  It also appears that maximizing the number of wsgi requests has
 the side-effect of increasing the RPC load on the main process, and this
 means that the problem of dhcp notifications being dropped is little
 improved.  I intend to submit a fix that ensures that notifications are sent
 regardless of agent status, in any case.


 m.

 
  Carl
 
  On Tue, Dec 3, 2013 at 9:47 AM, Stephen Gran
  stephen.g...@theguardian.com wrote:
  On 03/12/13 16:08, Maru Newby wrote:
 
  I've been investigating a bug that is preventing VM's from receiving
  IP
  addresses when a Neutron service is under high load:
 
  https://bugs.launchpad.net/neutron/+bug/1192381
 
  High load causes the DHCP agent's status updates to be delayed,
  causing
  the Neutron service to assume that the agent is down.  This results in
  the
  Neutron service not sending notifications of port addition to the DHCP
  agent.  At present, the notifications are simply dropped.  A simple
  fix is
  to send notifications regardless of agent status.  Does anybody have
  any
  objections to this stop-gap approach?  I'm not clear on the
  implications of
  sending notifications to agents that are down, but I'm hoping for a
  simple
  fix that can be backported to both havana and grizzly (yes, this bug
  has
  been with us that long).
 
  Fixing this problem for real, though, will likely be more involved.
  The
  proposal to replace the current wsgi framework with Pecan may increase
  the
  Neutron service's scalability, but should we continue to use a 'fire
  and
  forget' approach to notification?  Being able to track the success or
  failure of a given action outside of the logs would seem pretty
  important,
  and allow for more effective coordination with Nova than is currently
  possible.
 
 
  It strikes me that we ask an awful lot of a single neutron-server
  instance -
  it has to take state updates from all the agents, it has to do
  scheduling,
  it has to respond to API requests, and it has to communicate about
  actual
  changes with the agents.
 
  Maybe breaking some of these out the way nova has a scheduler and a
  conductor and so on might be a good model (I know there are things
  people
  are unhappy about with nova-scheduler, but imagine how much worse it
  would
  be if it was built into the API).
 
  Doing all of those tasks, and doing it largely single threaded, is just
  asking for overload.
 
  Cheers,
  --
  Stephen Gran
  Senior Systems Integrator - theguardian.com
  Please consider the environment before 

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-03 Thread Stephen Gran

On 03/12/13 16:08, Maru Newby wrote:

I've been investigating a bug that is preventing VM's from receiving IP 
addresses when a Neutron service is under high load:

https://bugs.launchpad.net/neutron/+bug/1192381

High load causes the DHCP agent's status updates to be delayed, causing the 
Neutron service to assume that the agent is down.  This results in the Neutron 
service not sending notifications of port addition to the DHCP agent.  At 
present, the notifications are simply dropped.  A simple fix is to send 
notifications regardless of agent status.  Does anybody have any objections to 
this stop-gap approach?  I'm not clear on the implications of sending 
notifications to agents that are down, but I'm hoping for a simple fix that can 
be backported to both havana and grizzly (yes, this bug has been with us that 
long).

Fixing this problem for real, though, will likely be more involved.  The 
proposal to replace the current wsgi framework with Pecan may increase the 
Neutron service's scalability, but should we continue to use a 'fire and 
forget' approach to notification?  Being able to track the success or failure 
of a given action outside of the logs would seem pretty important, and allow 
for more effective coordination with Nova than is currently possible.


It strikes me that we ask an awful lot of a single neutron-server 
instance - it has to take state updates from all the agents, it has to 
do scheduling, it has to respond to API requests, and it has to 
communicate about actual changes with the agents.


Maybe breaking some of these out the way nova has a scheduler and a 
conductor and so on might be a good model (I know there are things 
people are unhappy about with nova-scheduler, but imagine how much worse 
it would be if it was built into the API).


Doing all of those tasks, and doing it largely single threaded, is just 
asking for overload.


Cheers,
--
Stephen Gran
Senior Systems Integrator - theguardian.com
Please consider the environment before printing this email.
--
Visit theguardian.com   

On your mobile, download the Guardian iPhone app theguardian.com/iphone and our iPad edition theguardian.com/iPad   
Save up to 33% by subscribing to the Guardian and Observer - choose the papers you want and get full digital access.

Visit subscribe.theguardian.com

This e-mail and all attachments are confidential and may also
be privileged. If you are not the named recipient, please notify
the sender and delete the e-mail and all attachments immediately.
Do not disclose the contents to another person. You may not use
the information for any purpose, or store, or copy, it in any way.

Guardian News  Media Limited is not liable for any computer
viruses or other material transmitted with or as part of this
e-mail. You should employ virus checking software.

Guardian News  Media Limited

A member of Guardian Media Group plc
Registered Office
PO Box 68164
Kings Place
90 York Way
London
N1P 2AP

Registered in England Number 908396

--


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-03 Thread Carl Baldwin
Stephen, all,

I agree that there may be some opportunity to split things out a bit.
However, I'm not sure what the best way will be.  I recall that Mark
mentioned breaking out the processes that handle API requests and RPC
from each other at the summit.  Anyway, it is something that has been
discussed.

I actually wanted to point out that the neutron server now has the
ability to run a configurable number of sub-processes to handle a
heavier load.  Introduced with this commit:

https://review.openstack.org/#/c/37131/

Set api_workers to something  1 and restart the server.

The server can also be run on more than one physical host in
combination with multiple child processes.

Carl

On Tue, Dec 3, 2013 at 9:47 AM, Stephen Gran
stephen.g...@theguardian.com wrote:
 On 03/12/13 16:08, Maru Newby wrote:

 I've been investigating a bug that is preventing VM's from receiving IP
 addresses when a Neutron service is under high load:

 https://bugs.launchpad.net/neutron/+bug/1192381

 High load causes the DHCP agent's status updates to be delayed, causing
 the Neutron service to assume that the agent is down.  This results in the
 Neutron service not sending notifications of port addition to the DHCP
 agent.  At present, the notifications are simply dropped.  A simple fix is
 to send notifications regardless of agent status.  Does anybody have any
 objections to this stop-gap approach?  I'm not clear on the implications of
 sending notifications to agents that are down, but I'm hoping for a simple
 fix that can be backported to both havana and grizzly (yes, this bug has
 been with us that long).

 Fixing this problem for real, though, will likely be more involved.  The
 proposal to replace the current wsgi framework with Pecan may increase the
 Neutron service's scalability, but should we continue to use a 'fire and
 forget' approach to notification?  Being able to track the success or
 failure of a given action outside of the logs would seem pretty important,
 and allow for more effective coordination with Nova than is currently
 possible.


 It strikes me that we ask an awful lot of a single neutron-server instance -
 it has to take state updates from all the agents, it has to do scheduling,
 it has to respond to API requests, and it has to communicate about actual
 changes with the agents.

 Maybe breaking some of these out the way nova has a scheduler and a
 conductor and so on might be a good model (I know there are things people
 are unhappy about with nova-scheduler, but imagine how much worse it would
 be if it was built into the API).

 Doing all of those tasks, and doing it largely single threaded, is just
 asking for overload.

 Cheers,
 --
 Stephen Gran
 Senior Systems Integrator - theguardian.com
 Please consider the environment before printing this email.
 --
 Visit theguardian.com
 On your mobile, download the Guardian iPhone app theguardian.com/iphone and
 our iPad edition theguardian.com/iPad   Save up to 33% by subscribing to the
 Guardian and Observer - choose the papers you want and get full digital
 access.
 Visit subscribe.theguardian.com

 This e-mail and all attachments are confidential and may also
 be privileged. If you are not the named recipient, please notify
 the sender and delete the e-mail and all attachments immediately.
 Do not disclose the contents to another person. You may not use
 the information for any purpose, or store, or copy, it in any way.

 Guardian News  Media Limited is not liable for any computer
 viruses or other material transmitted with or as part of this
 e-mail. You should employ virus checking software.

 Guardian News  Media Limited

 A member of Guardian Media Group plc
 Registered Office
 PO Box 68164
 Kings Place
 90 York Way
 London
 N1P 2AP

 Registered in England Number 908396

 --



 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-03 Thread Yongsheng Gong
another way is to have a large agent_down_time, by default it is 9 secs.


On Wed, Dec 4, 2013 at 7:55 AM, Carl Baldwin c...@ecbaldwin.net wrote:

 Stephen, all,

 I agree that there may be some opportunity to split things out a bit.
 However, I'm not sure what the best way will be.  I recall that Mark
 mentioned breaking out the processes that handle API requests and RPC
 from each other at the summit.  Anyway, it is something that has been
 discussed.

 I actually wanted to point out that the neutron server now has the
 ability to run a configurable number of sub-processes to handle a
 heavier load.  Introduced with this commit:

 https://review.openstack.org/#/c/37131/

 Set api_workers to something  1 and restart the server.

 The server can also be run on more than one physical host in
 combination with multiple child processes.

 Carl

 On Tue, Dec 3, 2013 at 9:47 AM, Stephen Gran
 stephen.g...@theguardian.com wrote:
  On 03/12/13 16:08, Maru Newby wrote:
 
  I've been investigating a bug that is preventing VM's from receiving IP
  addresses when a Neutron service is under high load:
 
  https://bugs.launchpad.net/neutron/+bug/1192381
 
  High load causes the DHCP agent's status updates to be delayed, causing
  the Neutron service to assume that the agent is down.  This results in
 the
  Neutron service not sending notifications of port addition to the DHCP
  agent.  At present, the notifications are simply dropped.  A simple fix
 is
  to send notifications regardless of agent status.  Does anybody have any
  objections to this stop-gap approach?  I'm not clear on the
 implications of
  sending notifications to agents that are down, but I'm hoping for a
 simple
  fix that can be backported to both havana and grizzly (yes, this bug has
  been with us that long).
 
  Fixing this problem for real, though, will likely be more involved.  The
  proposal to replace the current wsgi framework with Pecan may increase
 the
  Neutron service's scalability, but should we continue to use a 'fire and
  forget' approach to notification?  Being able to track the success or
  failure of a given action outside of the logs would seem pretty
 important,
  and allow for more effective coordination with Nova than is currently
  possible.
 
 
  It strikes me that we ask an awful lot of a single neutron-server
 instance -
  it has to take state updates from all the agents, it has to do
 scheduling,
  it has to respond to API requests, and it has to communicate about actual
  changes with the agents.
 
  Maybe breaking some of these out the way nova has a scheduler and a
  conductor and so on might be a good model (I know there are things people
  are unhappy about with nova-scheduler, but imagine how much worse it
 would
  be if it was built into the API).
 
  Doing all of those tasks, and doing it largely single threaded, is just
  asking for overload.
 
  Cheers,
  --
  Stephen Gran
  Senior Systems Integrator - theguardian.com
  Please consider the environment before printing this email.
  --
  Visit theguardian.com
  On your mobile, download the Guardian iPhone app theguardian.com/iphoneand
  our iPad edition theguardian.com/iPad   Save up to 33% by subscribing
 to the
  Guardian and Observer - choose the papers you want and get full digital
  access.
  Visit subscribe.theguardian.com
 
  This e-mail and all attachments are confidential and may also
  be privileged. If you are not the named recipient, please notify
  the sender and delete the e-mail and all attachments immediately.
  Do not disclose the contents to another person. You may not use
  the information for any purpose, or store, or copy, it in any way.
 
  Guardian News  Media Limited is not liable for any computer
  viruses or other material transmitted with or as part of this
  e-mail. You should employ virus checking software.
 
  Guardian News  Media Limited
 
  A member of Guardian Media Group plc
  Registered Office
  PO Box 68164
  Kings Place
  90 York Way
  London
  N1P 2AP
 
  Registered in England Number 908396
 
 
 --
 
 
 
  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-03 Thread Clint Byrum
Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800:
 I've been investigating a bug that is preventing VM's from receiving IP 
 addresses when a Neutron service is under high load:
 
 https://bugs.launchpad.net/neutron/+bug/1192381
 
 High load causes the DHCP agent's status updates to be delayed, causing the 
 Neutron service to assume that the agent is down.  This results in the 
 Neutron service not sending notifications of port addition to the DHCP agent. 
  At present, the notifications are simply dropped.  A simple fix is to send 
 notifications regardless of agent status.  Does anybody have any objections 
 to this stop-gap approach?  I'm not clear on the implications of sending 
 notifications to agents that are down, but I'm hoping for a simple fix that 
 can be backported to both havana and grizzly (yes, this bug has been with us 
 that long).
 
 Fixing this problem for real, though, will likely be more involved.  The 
 proposal to replace the current wsgi framework with Pecan may increase the 
 Neutron service's scalability, but should we continue to use a 'fire and 
 forget' approach to notification?  Being able to track the success or failure 
 of a given action outside of the logs would seem pretty important, and allow 
 for more effective coordination with Nova than is currently possible.
 

Dropping requests without triggering a user-visible error is a pretty
serious problem. You didn't mention if you have filed a bug about that.
If not, please do or let us know here so we can investigate and file
a bug.

It seems to me that they should be put into a queue to be retried.
Sending the notifications blindly is almost as bad as dropping them,
as you have no idea if the agent is alive or not.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-03 Thread Maru Newby

On Dec 4, 2013, at 1:47 AM, Stephen Gran stephen.g...@theguardian.com wrote:

 On 03/12/13 16:08, Maru Newby wrote:
 I've been investigating a bug that is preventing VM's from receiving IP 
 addresses when a Neutron service is under high load:
 
 https://bugs.launchpad.net/neutron/+bug/1192381
 
 High load causes the DHCP agent's status updates to be delayed, causing the 
 Neutron service to assume that the agent is down.  This results in the 
 Neutron service not sending notifications of port addition to the DHCP 
 agent.  At present, the notifications are simply dropped.  A simple fix is 
 to send notifications regardless of agent status.  Does anybody have any 
 objections to this stop-gap approach?  I'm not clear on the implications of 
 sending notifications to agents that are down, but I'm hoping for a simple 
 fix that can be backported to both havana and grizzly (yes, this bug has 
 been with us that long).
 
 Fixing this problem for real, though, will likely be more involved.  The 
 proposal to replace the current wsgi framework with Pecan may increase the 
 Neutron service's scalability, but should we continue to use a 'fire and 
 forget' approach to notification?  Being able to track the success or 
 failure of a given action outside of the logs would seem pretty important, 
 and allow for more effective coordination with Nova than is currently 
 possible.
 
 It strikes me that we ask an awful lot of a single neutron-server instance - 
 it has to take state updates from all the agents, it has to do scheduling, it 
 has to respond to API requests, and it has to communicate about actual 
 changes with the agents.
 
 Maybe breaking some of these out the way nova has a scheduler and a conductor 
 and so on might be a good model (I know there are things people are unhappy 
 about with nova-scheduler, but imagine how much worse it would be if it was 
 built into the API).
 
 Doing all of those tasks, and doing it largely single threaded, is just 
 asking for overload.

I'm sorry if it wasn't clear in my original message, but my primary concern 
lies with the reliability rather than the scalability of the Neutron service.  
Carl's addition of multiple workers is a good stop-gap to minimize the impact 
of blocking IO calls in the current architecture, and we already have consensus 
on the need to separate RPC and WSGI functions as part of the Pecan rewrite.  I 
am worried, though, that we are not being sufficiently diligent in how we 
manage state transitions through notifications.  Managing transitions and their 
associate error states is needlessly complicated by the current ad-hoc 
approach, and I'd appreciate input on the part of distributed systems experts 
as to how we could do better.


m. 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-03 Thread Maru Newby

On Dec 4, 2013, at 11:02 AM, Yongsheng Gong gong...@unitedstack.com wrote:

 another way is to have a large agent_down_time, by default it is 9 secs.

I don't believe that increasing the timeout by itself is a good solution.  
Relying on the agent state to know whether to send a notification has simply 
proven unreliable with the current architecture of a poorly-performing single 
process server handling both RPC and WSGI.


m.

 
 
 On Wed, Dec 4, 2013 at 7:55 AM, Carl Baldwin c...@ecbaldwin.net wrote:
 Stephen, all,
 
 I agree that there may be some opportunity to split things out a bit.
 However, I'm not sure what the best way will be.  I recall that Mark
 mentioned breaking out the processes that handle API requests and RPC
 from each other at the summit.  Anyway, it is something that has been
 discussed.
 
 I actually wanted to point out that the neutron server now has the
 ability to run a configurable number of sub-processes to handle a
 heavier load.  Introduced with this commit:
 
 https://review.openstack.org/#/c/37131/
 
 Set api_workers to something  1 and restart the server.
 
 The server can also be run on more than one physical host in
 combination with multiple child processes.
 
 Carl
 
 On Tue, Dec 3, 2013 at 9:47 AM, Stephen Gran
 stephen.g...@theguardian.com wrote:
  On 03/12/13 16:08, Maru Newby wrote:
 
  I've been investigating a bug that is preventing VM's from receiving IP
  addresses when a Neutron service is under high load:
 
  https://bugs.launchpad.net/neutron/+bug/1192381
 
  High load causes the DHCP agent's status updates to be delayed, causing
  the Neutron service to assume that the agent is down.  This results in the
  Neutron service not sending notifications of port addition to the DHCP
  agent.  At present, the notifications are simply dropped.  A simple fix is
  to send notifications regardless of agent status.  Does anybody have any
  objections to this stop-gap approach?  I'm not clear on the implications of
  sending notifications to agents that are down, but I'm hoping for a simple
  fix that can be backported to both havana and grizzly (yes, this bug has
  been with us that long).
 
  Fixing this problem for real, though, will likely be more involved.  The
  proposal to replace the current wsgi framework with Pecan may increase the
  Neutron service's scalability, but should we continue to use a 'fire and
  forget' approach to notification?  Being able to track the success or
  failure of a given action outside of the logs would seem pretty important,
  and allow for more effective coordination with Nova than is currently
  possible.
 
 
  It strikes me that we ask an awful lot of a single neutron-server instance -
  it has to take state updates from all the agents, it has to do scheduling,
  it has to respond to API requests, and it has to communicate about actual
  changes with the agents.
 
  Maybe breaking some of these out the way nova has a scheduler and a
  conductor and so on might be a good model (I know there are things people
  are unhappy about with nova-scheduler, but imagine how much worse it would
  be if it was built into the API).
 
  Doing all of those tasks, and doing it largely single threaded, is just
  asking for overload.
 
  Cheers,
  --
  Stephen Gran
  Senior Systems Integrator - theguardian.com
  Please consider the environment before printing this email.
  --
  Visit theguardian.com
  On your mobile, download the Guardian iPhone app theguardian.com/iphone and
  our iPad edition theguardian.com/iPad   Save up to 33% by subscribing to the
  Guardian and Observer - choose the papers you want and get full digital
  access.
  Visit subscribe.theguardian.com
 
  This e-mail and all attachments are confidential and may also
  be privileged. If you are not the named recipient, please notify
  the sender and delete the e-mail and all attachments immediately.
  Do not disclose the contents to another person. You may not use
  the information for any purpose, or store, or copy, it in any way.
 
  Guardian News  Media Limited is not liable for any computer
  viruses or other material transmitted with or as part of this
  e-mail. You should employ virus checking software.
 
  Guardian News  Media Limited
 
  A member of Guardian Media Group plc
  Registered Office
  PO Box 68164
  Kings Place
  90 York Way
  London
  N1P 2AP
 
  Registered in England Number 908396
 
  --
 
 
 
  ___
  OpenStack-dev mailing list
  OpenStack-dev@lists.openstack.org
  http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 ___
 OpenStack-dev mailing list
 

Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-03 Thread Maru Newby

On Dec 4, 2013, at 11:57 AM, Clint Byrum cl...@fewbar.com wrote:

 Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800:
 I've been investigating a bug that is preventing VM's from receiving IP 
 addresses when a Neutron service is under high load:
 
 https://bugs.launchpad.net/neutron/+bug/1192381
 
 High load causes the DHCP agent's status updates to be delayed, causing the 
 Neutron service to assume that the agent is down.  This results in the 
 Neutron service not sending notifications of port addition to the DHCP 
 agent.  At present, the notifications are simply dropped.  A simple fix is 
 to send notifications regardless of agent status.  Does anybody have any 
 objections to this stop-gap approach?  I'm not clear on the implications of 
 sending notifications to agents that are down, but I'm hoping for a simple 
 fix that can be backported to both havana and grizzly (yes, this bug has 
 been with us that long).
 
 Fixing this problem for real, though, will likely be more involved.  The 
 proposal to replace the current wsgi framework with Pecan may increase the 
 Neutron service's scalability, but should we continue to use a 'fire and 
 forget' approach to notification?  Being able to track the success or 
 failure of a given action outside of the logs would seem pretty important, 
 and allow for more effective coordination with Nova than is currently 
 possible.
 
 
 Dropping requests without triggering a user-visible error is a pretty
 serious problem. You didn't mention if you have filed a bug about that.
 If not, please do or let us know here so we can investigate and file
 a bug.

There is a bug linked to in the original message that I am already working on.  
The fact that that bug title is 'dhcp agent doesn't configure ports' rather 
than 'dhcp notifications are silently dropped' is incidental.

 
 It seems to me that they should be put into a queue to be retried.
 Sending the notifications blindly is almost as bad as dropping them,
 as you have no idea if the agent is alive or not.

This is more the kind of discussion I was looking for.  

In the current architecture, the Neutron service handles RPC and WSGI with a 
single process and is prone to being overloaded such that agent heartbeats can 
be delayed beyond the limit for the agent being declared 'down'.  Even if we 
increased the agent timeout as Yongsheg suggests, there is no guarantee that we 
can accurately detect whether an agent is 'live' with the current architecture. 
 Given that amqp can ensure eventual delivery - it is a queue - is sending a 
notification blind such a bad idea?  In the best case the agent isn't really 
down and can process the notification.  In the worst case, the agent really is 
down but will be brought up eventually by a deployment's monitoring solution 
and process the notification when it returns.  What am I missing? 

Please consider that while a good solution will track notification delivery and 
success, we may need 2 solutions:

1. A 'good-enough', minimally-invasive stop-gap that can be back-ported to 
grizzly and havana.

2. A 'best-effort' refactor that maximizes the reliability of the DHCP agent.

I'm hoping that coming up with a solution to #1 will allow us the breathing 
room to work on #2 in this cycle.


m.



 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] DHCP Agent Reliability

2013-12-03 Thread Clint Byrum
Excerpts from Maru Newby's message of 2013-12-03 19:37:19 -0800:
 
 On Dec 4, 2013, at 11:57 AM, Clint Byrum cl...@fewbar.com wrote:
 
  Excerpts from Maru Newby's message of 2013-12-03 08:08:09 -0800:
  I've been investigating a bug that is preventing VM's from receiving IP 
  addresses when a Neutron service is under high load:
  
  https://bugs.launchpad.net/neutron/+bug/1192381
  
  High load causes the DHCP agent's status updates to be delayed, causing 
  the Neutron service to assume that the agent is down.  This results in the 
  Neutron service not sending notifications of port addition to the DHCP 
  agent.  At present, the notifications are simply dropped.  A simple fix is 
  to send notifications regardless of agent status.  Does anybody have any 
  objections to this stop-gap approach?  I'm not clear on the implications 
  of sending notifications to agents that are down, but I'm hoping for a 
  simple fix that can be backported to both havana and grizzly (yes, this 
  bug has been with us that long).
  
  Fixing this problem for real, though, will likely be more involved.  The 
  proposal to replace the current wsgi framework with Pecan may increase the 
  Neutron service's scalability, but should we continue to use a 'fire and 
  forget' approach to notification?  Being able to track the success or 
  failure of a given action outside of the logs would seem pretty important, 
  and allow for more effective coordination with Nova than is currently 
  possible.
  
  
  Dropping requests without triggering a user-visible error is a pretty
  serious problem. You didn't mention if you have filed a bug about that.
  If not, please do or let us know here so we can investigate and file
  a bug.
 
 There is a bug linked to in the original message that I am already working 
 on.  The fact that that bug title is 'dhcp agent doesn't configure ports' 
 rather than 'dhcp notifications are silently dropped' is incidental.
 

Good point, I suppose that one bug is enough.

  
  It seems to me that they should be put into a queue to be retried.
  Sending the notifications blindly is almost as bad as dropping them,
  as you have no idea if the agent is alive or not.
 
 This is more the kind of discussion I was looking for.  
 
 In the current architecture, the Neutron service handles RPC and WSGI with a 
 single process and is prone to being overloaded such that agent heartbeats 
 can be delayed beyond the limit for the agent being declared 'down'.  Even if 
 we increased the agent timeout as Yongsheg suggests, there is no guarantee 
 that we can accurately detect whether an agent is 'live' with the current 
 architecture.  Given that amqp can ensure eventual delivery - it is a queue - 
 is sending a notification blind such a bad idea?  In the best case the agent 
 isn't really down and can process the notification.  In the worst case, the 
 agent really is down but will be brought up eventually by a deployment's 
 monitoring solution and process the notification when it returns.  What am I 
 missing? 


I have not looked closely into what expectations are built in to the
notification system, so I may have been off base. My understanding was
they were not necessarily guaranteed to be delivered, but if they are,
then this is fine.

 Please consider that while a good solution will track notification delivery 
 and success, we may need 2 solutions:
 
 1. A 'good-enough', minimally-invasive stop-gap that can be back-ported to 
 grizzly and havana.


I don't know why we'd backport to grizzly. But yes, if we can get a
notable jump in reliability with a clear patch, I'm all for it.

 2. A 'best-effort' refactor that maximizes the reliability of the DHCP agent.
 
 I'm hoping that coming up with a solution to #1 will allow us the breathing 
 room to work on #2 in this cycle.


Understood, I like the short term plan and think long term having more
CPU available to process more messages is a good thing, most likely in
the form of more worker processes.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev