Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-03-13 Thread Ihar Hrachyshka
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

(Sorry for reviving an old thread.)

On 01/28/2015 02:55 PM, Ihar Hrachyshka wrote:
 On 01/28/2015 09:50 AM, Kevin Benton wrote:
 Hi,
 
 Approximately a year and a half ago, the default DHCP lease time
 in Neutron was increased from 120 seconds to 86400 seconds.[1]
 This was done with the goal of reducing DHCP traffic with very
 little discussion (based on what I can see in the review and bug
 report). While it it does indeed reduce DHCP traffic, I don't
 think any bug reports were filed showing that a 120 second lease
 time resulted in too much traffic or that a jump all of the way
 to 86400 seconds was required instead of a value in the same
 order of magnitude.
 
 I guess that would be a good case for FORCERENEW DHCP extension
 [1] though after digging thru dnsmasq code a bit, I doubt it
 supports the extension (though e.g. systemd dhcp client/server from
 networkd module do). Le sigh.
 
 [1]: https://tools.ietf.org/html/rfc3203
 

Note that DHCPv6 has Reconfigure message type exactly for the case of
pushing new configuration to clients that still possess valid IA_ID
configuration. It's defined in RFC3315, section 19 [1].

The only problem with the message type is that DHCP authentication is
mandatory for this type of messages, to avoid potential DoS attacks
(concern that is probably not relevant in our isolated setup).

I haven't had any experience with authN for DHCP before, but afaik it
does not involve any prior data injection into clients. Correct me if
I am wrong.

[1]: http://tools.ietf.org/html/rfc3315#section-19

/Ihar
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJVAsRvAAoJEC5aWaUY1u57WDMH/jMthBci6cB1FdLVv92zTXNQ
xl6iQziR8UAUmWrk90jdt9d9QsAJR9Z6zyPb3UuQTsw+NeCUEsTeDyqt6k4LR9nx
kn1a5pNJ+C3EMtNkDv2WP4kPFg/dTfp05dvrxaqJMpSZZAnpfD4v5uraqy5S3S39
uRZy166LeaJ2Nd1yfH9agQfJd347nTXKxpvwZxQPjbw3qOBfkN3W0UNlwYQWbIHr
6wpCVeB7wRsc5isQ2DneGkPERa3ooFMgjLqUMj7hxgvykVikJK1EVY2DxcFRoWPR
mimPhJ4kuCnpmPszJ4BCfTXYuTaggia1XrnDQSRfKlWhgRQPnuk+fxEZFlNAGTk=
=hFap
-END PGP SIGNATURE-

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-02-04 Thread Kevin Benton
I proposed an alternative to adjusting the lease time early on the in the
thread. By specifying the renewal time (DHCP option 58), we can have the
benefits of a long lease time (resiliency to long DHCP server outages)
while having a frequent renewal interval to check for IP changes. I favored
this approach because it only required a patch to dnsmasq to allow that
option to be set and patch to our agent to set that option, both of which
are pretty straight-forward.

- Just don't allow users to change their IPs without a reboot.

How can we do this? Call Nova from Neutron to force a reboot when the port
is updated?

- Bounce the link under the VM when the IP is changed, to force the guest
to re-request a DHCP lease immediately.

I had thought about this as well and it's the approach that I think would
be ideal, but the Nova VIF code would require changes to add support for
changing interface state. It's definition of plugging and unplugging is
actually creating and deleting the interfaces, which might not work so well
with running VMs. Then more changes would have to be done on the Nova side
to react to a port IP change notification from Neutron to trigger the
interface bounce. Finally, a small change would have to be made to Neutron
to send the IP change event to Nova.

The amount of changes it required from the Nova side deterred me from
pursuing it further.

- Remove the IP spoofing firewall feature

I think this makes sense as a tenant-configurable option for networks they
own, but I don't think we should throw it out. It makes for good protection
on networks facing Internet traffic that could have compromised hosts.
Along the same line, we make use of shared networks, which has other shady
tenants that might be dishonest when it comes to IP addresses.

- Make the IP spoofing firewall allow an overlap of both old and new
addresses until the DHCP lease time is up (or the instance reboots).  Adds
some additional async tasks, but this is clearly the required solution if
we want to keep all our existing features.

I didn't find a clean spot to put this. Spoofing rules are generated a long
ways away from the code that knows about IP updates. Maybe we could tack it
onto the response to the query from the agent for allowed address pairs.
Then we have to deal with persisting these temporary allowed addresses to
the DB (not a big deal, but still a schema change). Another issue here
would be if Neutron then allocated that address for another port while it
was still in use by the old node. We will probably have to block IPAM from
re-allocating that address for another port during this window as well.

However, this doesn't solve the general slowness of DHCP info propagation
for other updates (subnet gateway change, DNS nameserver change, etc), so I
would still like to go forward with the increased renewal interval. I will
also look into eliminating the downtime completely with your last
suggestion if it can be implemented without impacting too much stuff.

On Tue, Feb 3, 2015 at 11:01 PM, Angus Lees g...@inodes.org wrote:

 There's clearly not going to be any amount of time that satisfies both
 concerns here.

 Just to get some other options on the table, here's some things that would
 allow a non-zero dhcp lease timeout _and_ address Kevin's original bug
 report:

 - Just don't allow users to change their IPs without a reboot.

 - Bounce the link under the VM when the IP is changed, to force the guest
 to re-request a DHCP lease immediately.

 - Remove the IP spoofing firewall feature  (- my favourite, for what it's
 worth. I've never liked presenting a layer2 abstraction but then forcing
 specific layer3 addressing choices by default)

 - Make the IP spoofing firewall allow an overlap of both old and new
 addresses until the DHCP lease time is up (or the instance reboots).  Adds
 some additional async tasks, but this is clearly the required solution if
 we want to keep all our existing features.

 On Wed Feb 04 2015 at 4:28:11 PM Aaron Rosen aaronoro...@gmail.com
 wrote:

 I believe I was the one who changed the default value of this. When we
 upgraded our internal cloud ~6k networks back then from folsom to grizzly
 we didn't account that if the dhcp-agents went offline that instances would
 give up their lease and unconfigure themselves causing an outage. Setting a
 larger value for this helps to avoid this downtime (as Brian pointed out as
 well). Personally, I wouldn't really expect my instance to automatically
 change it's ip  - I think requiring the user to reboot the instance or use
 the console to correct the ip should be good enough. Especially since this
 will help buy you shorter down time if an agent fails for a little while
 which is probably more important than having the instance change it's ip.

 Aaron

 On Tue, Feb 3, 2015 at 5:25 PM, Kevin Benton blak...@gmail.com wrote:

 I definitely understand the use-case of having updatable stuff and I
 don't intend to support any proposals to strip away 

Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-02-04 Thread Cory Benfield
On Wed, Feb 04, 2015 at 08:59:54, Kevin Benton wrote:
 I proposed an alternative to adjusting the lease time early on the in
 the thread. By specifying the renewal time (DHCP option 58), we can
 have
 the benefits of a long lease time (resiliency to long DHCP server
 outages) while having a frequent renewal interval to check for IP
 changes. I favored this approach because it only required a patch to
 dnsmasq to allow that option to be set and patch to our agent to set
 that option, both of which are pretty straight-forward.

It's hard to see a downside to this proposal. Even if one of the other ideas 
goes forward as well, a short DHCP renewal interval feels like a very good idea 
to me.

Cory
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-02-04 Thread Angus Lees
On Wed Feb 04 2015 at 8:02:04 PM Kevin Benton blak...@gmail.com wrote:

 I proposed an alternative to adjusting the lease time early on the in the
 thread. By specifying the renewal time (DHCP option 58), we can have the
 benefits of a long lease time (resiliency to long DHCP server outages)
 while having a frequent renewal interval to check for IP changes. I favored
 this approach because it only required a patch to dnsmasq to allow that
 option to be set and patch to our agent to set that option, both of which
 are pretty straight-forward.


 Yep, I should have said +1 to this in my other post.  Simple coding change
that is strictly better than the current situation (other than a slight
increase in DHCP request traffic).

 - Gus
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-02-04 Thread Miguel Ángel Ajo


Miguel Ángel Ajo


On Wednesday, 4 de February de 2015 at 10:41, Cory Benfield wrote:

 On Wed, Feb 04, 2015 at 08:59:54, Kevin Benton wrote:
  I proposed an alternative to adjusting the lease time early on the in
  the thread. By specifying the renewal time (DHCP option 58), we can
  have
  the benefits of a long lease time (resiliency to long DHCP server
  outages) while having a frequent renewal interval to check for IP
  changes. I favored this approach because it only required a patch to
  dnsmasq to allow that option to be set and patch to our agent to set
  that option, both of which are pretty straight-forward.
   
  
  
 It's hard to see a downside to this proposal. Even if one of the other ideas 
 goes forward as well, a short DHCP renewal interval feels like a very good 
 idea to me.
  
+1

I understand some dhcp-clients could ignore option 58, but yet, I understand
they will obey the longer lease time, without affecting they behavior.

So only who really needs it would take care of using fully compliant clients…

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-02-03 Thread Aaron Rosen
I believe I was the one who changed the default value of this. When we
upgraded our internal cloud ~6k networks back then from folsom to grizzly
we didn't account that if the dhcp-agents went offline that instances would
give up their lease and unconfigure themselves causing an outage. Setting a
larger value for this helps to avoid this downtime (as Brian pointed out as
well). Personally, I wouldn't really expect my instance to automatically
change it's ip  - I think requiring the user to reboot the instance or use
the console to correct the ip should be good enough. Especially since this
will help buy you shorter down time if an agent fails for a little while
which is probably more important than having the instance change it's ip.

Aaron

On Tue, Feb 3, 2015 at 5:25 PM, Kevin Benton blak...@gmail.com wrote:

 I definitely understand the use-case of having updatable stuff and I don't
 intend to support any proposals to strip away that functionality. Brian was
 suggesting was to block port IP changes since it depended on DHCP to
 deliver that information to the hosts. I was just pointing out that we
 would need to block any API operations that resulted in different
 information being delivered via DHCP for that approach to make sense.

 On Tue, Feb 3, 2015 at 5:01 PM, Robert Collins robe...@robertcollins.net
 wrote:

 On 3 February 2015 at 00:48, Kevin Benton blak...@gmail.com wrote:
 The only thing this discussion has convinced me of is that allowing
 users
  to change the fixed IP address on a neutron port leads to a bad
  user-experience.
 ...

 Documenting a VM reboot is necessary, or even deprecating this (you
 won't
  like that) are sounding better to me by the minute.
 
  If this is an approach you really want to go with, then we should at
 least
  be consistent and deprecate the extra dhcp options extension (or at
 least
  the ability to update ports' dhcp options). Updating subnet attributes
 like
  gateway_ip, dns_nameserves, and host_routes should be thrown out as
 well.
  All of these things depend on the DHCP server to deliver updated
 information
  and are hindered by renewal times. Why discriminate against IP updates
 on a
  port? A failure to receive many of those other types of changes could
 result
  in just as severe of a connection disruption.

 So the reason we added the extra dhcp options extension was to support
 PXE booting physical machines for Nova baremetal, and then Ironic. It
 wasn't added for end users to use on the port, but as a generic way of
 supporting the specific PXE options needed - and that was done that
 way after discussing w/Neutron devs.

 We update ports for two reasons. Primarily, Ironic is HA and will move
 the TFTPd that boots are happening from if an Ironic node has failed.
 Secondly, because a non uncommon operation on physical machines is to
 replace broken NICs, and forcing a redeploy seemed unreasonable. The
 former case doesn't affect running nodes since its only consulted on
 reboot. The second case is by definition only possible when the NIC in
 question is offline (whether hotplug hardware or not).

 -Rob


 --
 Robert Collins rbtcoll...@hp.com
 Distinguished Technologist
 HP Converged Cloud

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:
 openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




 --
 Kevin Benton

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-02-03 Thread Angus Lees
There's clearly not going to be any amount of time that satisfies both
concerns here.

Just to get some other options on the table, here's some things that would
allow a non-zero dhcp lease timeout _and_ address Kevin's original bug
report:

- Just don't allow users to change their IPs without a reboot.

- Bounce the link under the VM when the IP is changed, to force the guest
to re-request a DHCP lease immediately.

- Remove the IP spoofing firewall feature  (- my favourite, for what it's
worth. I've never liked presenting a layer2 abstraction but then forcing
specific layer3 addressing choices by default)

- Make the IP spoofing firewall allow an overlap of both old and new
addresses until the DHCP lease time is up (or the instance reboots).  Adds
some additional async tasks, but this is clearly the required solution if
we want to keep all our existing features.

On Wed Feb 04 2015 at 4:28:11 PM Aaron Rosen aaronoro...@gmail.com wrote:

 I believe I was the one who changed the default value of this. When we
 upgraded our internal cloud ~6k networks back then from folsom to grizzly
 we didn't account that if the dhcp-agents went offline that instances would
 give up their lease and unconfigure themselves causing an outage. Setting a
 larger value for this helps to avoid this downtime (as Brian pointed out as
 well). Personally, I wouldn't really expect my instance to automatically
 change it's ip  - I think requiring the user to reboot the instance or use
 the console to correct the ip should be good enough. Especially since this
 will help buy you shorter down time if an agent fails for a little while
 which is probably more important than having the instance change it's ip.

 Aaron

 On Tue, Feb 3, 2015 at 5:25 PM, Kevin Benton blak...@gmail.com wrote:

 I definitely understand the use-case of having updatable stuff and I
 don't intend to support any proposals to strip away that functionality.
 Brian was suggesting was to block port IP changes since it depended on DHCP
 to deliver that information to the hosts. I was just pointing out that we
 would need to block any API operations that resulted in different
 information being delivered via DHCP for that approach to make sense.

 On Tue, Feb 3, 2015 at 5:01 PM, Robert Collins robe...@robertcollins.net
  wrote:

 On 3 February 2015 at 00:48, Kevin Benton blak...@gmail.com wrote:
 The only thing this discussion has convinced me of is that allowing
 users
  to change the fixed IP address on a neutron port leads to a bad
  user-experience.
 ...

 Documenting a VM reboot is necessary, or even deprecating this (you
 won't
  like that) are sounding better to me by the minute.
 
  If this is an approach you really want to go with, then we should at
 least
  be consistent and deprecate the extra dhcp options extension (or at
 least
  the ability to update ports' dhcp options). Updating subnet attributes
 like
  gateway_ip, dns_nameserves, and host_routes should be thrown out as
 well.
  All of these things depend on the DHCP server to deliver updated
 information
  and are hindered by renewal times. Why discriminate against IP updates
 on a
  port? A failure to receive many of those other types of changes could
 result
  in just as severe of a connection disruption.

 So the reason we added the extra dhcp options extension was to support
 PXE booting physical machines for Nova baremetal, and then Ironic. It
 wasn't added for end users to use on the port, but as a generic way of
 supporting the specific PXE options needed - and that was done that
 way after discussing w/Neutron devs.

 We update ports for two reasons. Primarily, Ironic is HA and will move
 the TFTPd that boots are happening from if an Ironic node has failed.
 Secondly, because a non uncommon operation on physical machines is to
 replace broken NICs, and forcing a redeploy seemed unreasonable. The
 former case doesn't affect running nodes since its only consulted on
 reboot. The second case is by definition only possible when the NIC in
 question is offline (whether hotplug hardware or not).

 -Rob


 --
 Robert Collins rbtcoll...@hp.com
 Distinguished Technologist
 HP Converged Cloud


 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:
 openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




 --
 Kevin Benton

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe:
 openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 

Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-02-03 Thread Brian Haley
On 02/03/2015 05:10 AM, Kevin Benton wrote:
The unicast DHCP will make it to the wire, but if you've renumbered the
 subnet either a) the DHCP server won't respond because it's IP has changed as
 well; or b) the DHCP server won't respond because there is no mapping for the 
 VM
 on it's old subnet.
 
 We aren't changing the DHCP server's IP here. The process that I saw was to 
 add
 a subnet and start moving VMs over. It's not 'b' either, because the server
 generates a DHCPNAK in response and which will immediately cause the client to
 release/renew. I have verified this behavior already and recorded a packet
 capture for you.[1] 
 
 In the capture, the renewal value is 4 seconds. I captured one renewal before
 the IP address change from 99.99.99.5 to 10.0.0.25 took place. You can see on
 the next renewal, the DHCP server immediately generates a NACK. The client 
 then
 releases its address, requests a new one, assigns it and ACKs within a couple 
 of
 seconds. 

Thanks for the trace.  So one thing I noticed is that this unicast DHCP only got
to the server since you created a second subnet on this network (dest MAC of
packet was that of same router interface).  If you had created a second network
and subnet this would have been dropped (different broadcast domain).  These
little differences are things users need to know because they lead to heads
banging on desks :(

This would happen if the AZ their VM was in went offline as well, at which
 point they would change their design to be more cloud-aware than it was.  
 Let's
 not heap all the blame on neutron - the user is tasked with vetting that
 their decisions meet the requirements they desire by thoroughly testing it.
 
 An availability zone going offline is not the same as an API operation that
 takes a day to apply. In an internal cloud, maintenance for AZs can be
 advertised and planned around by tenants running single-AZ services. Even if 
 you
 want to reference a public cloud, look how much of the Internet breaks when
 Amazon's us-east-1a or us-east-1d AZs have issues. Even though people are
 supposed to be bringing cattle to the cloud, a huge portion already have pets
 that they are attached to or that they can't convert into cattle. 

You completely missed the context of my reply Kevin - an AZ failure is not a
planned event.  You said people bring pets along, and rebooting them is painful.
 I said that's a bad design because other things can cause it to go offline, for
example:

1. Compute node failure
2. Network node failure
3. Router/switch failure
4. Internet failure
...
99. API call

All the user knows is they can't reach their VM - the cause doesn't matter when
they can't sell their widgets to customers because their site is down.  If it
takes 10 minutes for them to re-create their instance elsewhere that cannot be
blamed on neutron, even if it was our API call that caused it to go offline.

 If our floating IP 'associate' action took 12 hours to take effect on a 
 running
 instance, would telling users to reboot their instances to apply floating IPs
 faster be okay? I would certainly heap the blame on Neutron there.

The difference in a port IP change API call is that it requires action on the
VMs part that neutron can't trigger immediately.  It's still asynchronous like a
floating IP call, but the delay is typically going to be longer.  All we can say
is it will take from (0 - interval) seconds.  How is warning the user about
this a bad thing?

How about a big (*) next to all the things that could cause issues?  :)
 
 You want to put it next to all of the API calls to put the burden on the 
 users.
 I want to put it next to the DHCP renewal interval in the config files to put
 the burden on the operators. :)
 
 (*) Increasing this value will increase the delay between API calls and when
 they take effect on the data plane for any that depend on DHCP to relay the
 information. (e.g. port IP/subnet changes, port dhcp option changes, subnet
 gateways, subnet routes, subnet DNS servers, etc)

There is no delay in the API call here, the port was updated just as the user
requested.  Since they can't see into my config file (unless they look at their
lease info or run a tcpdump trace) they are essentially making a blind change
that immediately affects their instance.

And adding a DHCP option to tell them to renew more frequently doesn't fix the
problem, it only lessens it to ~(interval/2) - that might not be acceptable to
users and they need to know the danger.  This is the one point I've been trying
to get across in this whole discussion - these are advanced options that users
need to take caution with, neutron can only do so much.

-Brian


 1. http://paste.openstack.org/show/166048/
 
 
 On Mon, Feb 2, 2015 at 8:57 AM, Brian Haley brian.ha...@hp.com
 mailto:brian.ha...@hp.com wrote:
 
 Kevin,
 
 I think we are finally converging.  One of the points I've been trying to 
 make
 is 

Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-02-03 Thread Robert Collins
On 3 February 2015 at 00:48, Kevin Benton blak...@gmail.com wrote:
The only thing this discussion has convinced me of is that allowing users
 to change the fixed IP address on a neutron port leads to a bad
 user-experience.
...

Documenting a VM reboot is necessary, or even deprecating this (you won't
 like that) are sounding better to me by the minute.

 If this is an approach you really want to go with, then we should at least
 be consistent and deprecate the extra dhcp options extension (or at least
 the ability to update ports' dhcp options). Updating subnet attributes like
 gateway_ip, dns_nameserves, and host_routes should be thrown out as well.
 All of these things depend on the DHCP server to deliver updated information
 and are hindered by renewal times. Why discriminate against IP updates on a
 port? A failure to receive many of those other types of changes could result
 in just as severe of a connection disruption.

So the reason we added the extra dhcp options extension was to support
PXE booting physical machines for Nova baremetal, and then Ironic. It
wasn't added for end users to use on the port, but as a generic way of
supporting the specific PXE options needed - and that was done that
way after discussing w/Neutron devs.

We update ports for two reasons. Primarily, Ironic is HA and will move
the TFTPd that boots are happening from if an Ironic node has failed.
Secondly, because a non uncommon operation on physical machines is to
replace broken NICs, and forcing a redeploy seemed unreasonable. The
former case doesn't affect running nodes since its only consulted on
reboot. The second case is by definition only possible when the NIC in
question is offline (whether hotplug hardware or not).

-Rob


-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-02-03 Thread Kevin Benton
 If you had created a second network and subnet this would have been
dropped (different broadcast domain).

Well that update wouldn't have been allowed at the API. You can't use a
fixed IP from a subnet on a network that your port isn't attached to.
Changing a neutron port to a different network is not what we are talking
about here.

 I said that's a bad design because other things can cause it to go
offline, for example:

Yet people do it anyway, which is why I referenced the EC2 example. People
can deal with outages caused by unexpected failures. The outage we are
talking about is part of a normal API call and it doesn't make any sense to
the user.

 If it takes 10 minutes for them to re-create their instance elsewhere
that cannot be blamed on neutron, even if it was our API call that caused
it to go offline.

The outage can still be blamed on Neutron. What you are implying here is
that instead of improving the usability of Neutron, we just give up and
tell users that they should have known better. I don't like supporting a
project with that kind of approach to usability. It leads to unhappy users
and it reflects poorly on the quality of the project.

The difference in a port IP change API call is that it requires action on
the VMs part that neutron can't trigger immediately.

We know why these are different because we understand how Neutron works
internally, but there is no reason to think that a user would know why
these are different. From a user's perspective, one API call to change an
IP (floating IP) works as expected, the other has a huge variable delay
(port IP).

How is warning the user about this a bad thing?

We can and should make a note of this behavior, but it's not enough IMO.
Users don't read the documentation for these kind of things until they hit
an issue. We can update the Neutron server to return the DHCP interval to
the Neutron client and update the client to output these warnings, but it's
still a bit late at that point since we are telling the user, You just
broke your VM for 0-$(1/2 dhcp lease) hours. If you need it sooner,
hopefully you have console access or are fine with a forced restart.

There is no delay in the API call here, the port was updated just as the
user requested.

I never said there was a delay in the API call. I am talking about how long
it takes for that to take effect on the data plane. For it to take full
effect, the VMs need to get the information from the DHCP server. The long
default lease we have now means they won't get the information for hours on
average, which is the long delay I am referring to.


And adding a DHCP option to tell them to renew more frequently doesn't fix
the problem, it only lessens it to ~(interval/2) - that might not be
acceptable to users and they need to know the danger.

In the very first email in this thread, I pointed out that this is only
reducing the time. I don't think that was ever up for debate. The danger
exists already and warning them with whatever mechanism you had in mind
is orthogonal to my proposal to reduce the downtime.

This is the one point I've been trying to get across in this whole
discussion - these are advanced options that users need to take caution
with, neutron can only do so much.

Neutron is completely responsible for the management of the DHCP server in
this case. We have a lot of room for improvement here. I don't think we
should throw in the towel yet.

On Tue, Feb 3, 2015 at 8:53 AM, Brian Haley brian.ha...@hp.com wrote:

 On 02/03/2015 05:10 AM, Kevin Benton wrote:
 The unicast DHCP will make it to the wire, but if you've renumbered the
  subnet either a) the DHCP server won't respond because it's IP has
 changed as
  well; or b) the DHCP server won't respond because there is no mapping
 for the VM
  on it's old subnet.
 
  We aren't changing the DHCP server's IP here. The process that I saw was
 to add
  a subnet and start moving VMs over. It's not 'b' either, because the
 server
  generates a DHCPNAK in response and which will immediately cause the
 client to
  release/renew. I have verified this behavior already and recorded a
 packet
  capture for you.[1]
 
  In the capture, the renewal value is 4 seconds. I captured one renewal
 before
  the IP address change from 99.99.99.5 to 10.0.0.25 took place. You can
 see on
  the next renewal, the DHCP server immediately generates a NACK. The
 client then
  releases its address, requests a new one, assigns it and ACKs within a
 couple of
  seconds.

 Thanks for the trace.  So one thing I noticed is that this unicast DHCP
 only got
 to the server since you created a second subnet on this network (dest MAC
 of
 packet was that of same router interface).  If you had created a second
 network
 and subnet this would have been dropped (different broadcast domain).
 These
 little differences are things users need to know because they lead to heads
 banging on desks :(

 This would happen if the AZ their VM was in went offline as well, at
 which
  point 

Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-02-03 Thread Kevin Benton
The unicast DHCP will make it to the wire, but if you've renumbered the
subnet either a) the DHCP server won't respond because it's IP has changed
as well; or b) the DHCP server won't respond because there is no mapping
for the VM on it's old subnet.

We aren't changing the DHCP server's IP here. The process that I saw was to
add a subnet and start moving VMs over. It's not 'b' either, because the
server generates a DHCPNAK in response and which will immediately cause the
client to release/renew. I have verified this behavior already and recorded
a packet capture for you.[1]

In the capture, the renewal value is 4 seconds. I captured one renewal
before the IP address change from 99.99.99.5 to 10.0.0.25 took place. You
can see on the next renewal, the DHCP server immediately generates a NACK.
The client then releases its address, requests a new one, assigns it and
ACKs within a couple of seconds.

This would happen if the AZ their VM was in went offline as well, at which
point they would change their design to be more cloud-aware than it was.
Let's not heap all the blame on neutron - the user is tasked with vetting
that their decisions meet the requirements they desire by thoroughly
testing it.

An availability zone going offline is not the same as an API operation that
takes a day to apply. In an internal cloud, maintenance for AZs can be
advertised and planned around by tenants running single-AZ services. Even
if you want to reference a public cloud, look how much of the Internet
breaks when Amazon's us-east-1a or us-east-1d AZs have issues. Even though
people are supposed to be bringing cattle to the cloud, a huge portion
already have pets that they are attached to or that they can't convert into
cattle.

If our floating IP 'associate' action took 12 hours to take effect on a
running instance, would telling users to reboot their instances to apply
floating IPs faster be okay? I would certainly heap the blame on Neutron
there.


How about a big (*) next to all the things that could cause issues?  :)

You want to put it next to all of the API calls to put the burden on the
users. I want to put it next to the DHCP renewal interval in the config
files to put the burden on the operators. :)

(*) Increasing this value will increase the delay between API calls and
when they take effect on the data plane for any that depend on DHCP to
relay the information. (e.g. port IP/subnet changes, port dhcp option
changes, subnet gateways, subnet routes, subnet DNS servers, etc)

1. http://paste.openstack.org/show/166048/


On Mon, Feb 2, 2015 at 8:57 AM, Brian Haley brian.ha...@hp.com wrote:

 Kevin,

 I think we are finally converging.  One of the points I've been trying to
 make
 is that users are playing with fire when they start playing with some of
 these
 port attributes, and given the tool we have to work with (DHCP), the
 instantiation of these changes cannot be made seamlessly to a VM.  That's
 life
 in the cloud, and most of these things can (and should) be designed around.

 On 02/02/2015 06:48 AM, Kevin Benton wrote:
  The only thing this discussion has convinced me of is that allowing
 users
  to change the fixed IP address on a neutron port leads to a bad
  user-experience.
 
  Not as bad as having to delete a port and create another one on the same
  network just to change addresses though...
 
  Even with an 8-minute renew time you're talking up to a 7-minute
 blackout
  (87.5% of lease time before using broadcast).
 
  I suggested 240 seconds renewal time, which is up to 4 minutes of
  connectivity outage. This doesn't have anything to do with lease time and
  unicast DHCP will work because the spoof rules allow DHCP client traffic
  before restricting to specific IPs.

 The unicast DHCP will make it to the wire, but if you've renumbered the
 subnet
 either a) the DHCP server won't respond because it's IP has changed as
 well; or
 b) the DHCP server won't respond because there is no mapping for the VM on
 it's
 old subnet.

  Most would have rebooted long before then, true?  Cattle not pets,
 right?
 
  Only in an ideal world that I haven't encountered with customer
 deployments.
  Many enterprise deployments end up bringing pets along where reboots
 aren't
  always free. The time taken to relaunch programs and restore state can
 end
  up being 10 minutes+ if it's something like a VDI deployment or dev
  environment where someone spends a lot of time working on one VM.

 This would happen if the AZ their VM was in went offline as well, at which
 point
 they would change their design to be more cloud-aware than it was.  Let's
 not
 heap all the blame on neutron - the user is tasked with vetting that their
 decisions meet the requirements they desire by thoroughly testing it.

  Changing the lease time is just papering-over the real bug - neutron
  doesn't support seamless changes in IP addresses on ports, since it
 totally
  relies on the dhcp configuration settings a deployer has chosen.
 
  It doesn't 

Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-02-03 Thread Kevin Benton
I definitely understand the use-case of having updatable stuff and I don't
intend to support any proposals to strip away that functionality. Brian was
suggesting was to block port IP changes since it depended on DHCP to
deliver that information to the hosts. I was just pointing out that we
would need to block any API operations that resulted in different
information being delivered via DHCP for that approach to make sense.

On Tue, Feb 3, 2015 at 5:01 PM, Robert Collins robe...@robertcollins.net
wrote:

 On 3 February 2015 at 00:48, Kevin Benton blak...@gmail.com wrote:
 The only thing this discussion has convinced me of is that allowing users
  to change the fixed IP address on a neutron port leads to a bad
  user-experience.
 ...

 Documenting a VM reboot is necessary, or even deprecating this (you won't
  like that) are sounding better to me by the minute.
 
  If this is an approach you really want to go with, then we should at
 least
  be consistent and deprecate the extra dhcp options extension (or at least
  the ability to update ports' dhcp options). Updating subnet attributes
 like
  gateway_ip, dns_nameserves, and host_routes should be thrown out as well.
  All of these things depend on the DHCP server to deliver updated
 information
  and are hindered by renewal times. Why discriminate against IP updates
 on a
  port? A failure to receive many of those other types of changes could
 result
  in just as severe of a connection disruption.

 So the reason we added the extra dhcp options extension was to support
 PXE booting physical machines for Nova baremetal, and then Ironic. It
 wasn't added for end users to use on the port, but as a generic way of
 supporting the specific PXE options needed - and that was done that
 way after discussing w/Neutron devs.

 We update ports for two reasons. Primarily, Ironic is HA and will move
 the TFTPd that boots are happening from if an Ironic node has failed.
 Secondly, because a non uncommon operation on physical machines is to
 replace broken NICs, and forcing a redeploy seemed unreasonable. The
 former case doesn't affect running nodes since its only consulted on
 reboot. The second case is by definition only possible when the NIC in
 question is offline (whether hotplug hardware or not).

 -Rob


 --
 Robert Collins rbtcoll...@hp.com
 Distinguished Technologist
 HP Converged Cloud

 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




-- 
Kevin Benton
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-02-02 Thread Kevin Benton
The only thing this discussion has convinced me of is that allowing users
to change the fixed IP address on a neutron port leads to a bad
user-experience.

Not as bad as having to delete a port and create another one on the same
network just to change addresses though...

Even with an 8-minute renew time you're talking up to a 7-minute blackout
(87.5% of lease time before using broadcast).

I suggested 240 seconds renewal time, which is up to 4 minutes of
connectivity outage. This doesn't have anything to do with lease time and
unicast DHCP will work because the spoof rules allow DHCP client traffic
before restricting to specific IPs.

 Most would have rebooted long before then, true?  Cattle not pets, right?

Only in an ideal world that I haven't encountered with customer
deployments. Many enterprise deployments end up bringing pets along where
reboots aren't always free. The time taken to relaunch programs and restore
state can end up being 10 minutes+ if it's something like a VDI deployment
or dev environment where someone spends a lot of time working on one VM.

Changing the lease time is just papering-over the real bug - neutron
doesn't support seamless changes in IP addresses on ports, since it totally
relies on the dhcp configuration settings a deployer has chosen.

It doesn't need to be seamless, but it certainly shouldn't be useless.
Connectivity interruptions can be expected with IP changes (e.g. I've seen
changes in elastic IPs on EC2 can interrupt connectivity to an instance for
up to 2 minutes), but an entire day of downtime is awful.

One of the things I'm getting at is that a deployer shouldn't be choosing
such high lease times and we are encouraging it with a high default. You
are arguing for infrequent renewals to work around excessive logging, which
is just an implementation problem that should be addressed with a patch to
your logging collector (de-duplication) or to dnsmasq (don't log renewals).

Documenting a VM reboot is necessary, or even deprecating this (you won't
like that) are sounding better to me by the minute.

If this is an approach you really want to go with, then we should at least
be consistent and deprecate the extra dhcp options extension (or at least
the ability to update ports' dhcp options). Updating subnet attributes like
gateway_ip, dns_nameserves, and host_routes should be thrown out as well.
All of these things depend on the DHCP server to deliver updated
information and are hindered by renewal times. Why discriminate against IP
updates on a port? A failure to receive many of those other types of
changes could result in just as severe of a connection disruption.


In summary, the information the DHCP server gives to clients is not static.
Unless we eliminate updates to everything in the Neutron API that results
in different DHCP lease information, my suggestion is that we include a new
option for the renewal interval and have the default set 5 minutes. We can
leave the lease default to 1 day so the amount of time a DHCP server can be
offline without impacting running clients can stay the same.

On Fri, Jan 30, 2015 at 8:00 AM, Brian Haley brian.ha...@hp.com wrote:

 Kevin,

 The only thing this discussion has convinced me of is that allowing users
 to
 change the fixed IP address on a neutron port leads to a bad
 user-experience.
 Even with an 8-minute renew time you're talking up to a 7-minute blackout
 (87.5%
 of lease time before using broadcast).  This is time that customers are
 paying
 for.  Most would have rebooted long before then, true?  Cattle not pets,
 right?

 Changing the lease time is just papering-over the real bug - neutron
 doesn't
 support seamless changes in IP addresses on ports, since it totally relies
 on
 the dhcp configuration settings a deployer has chosen.  Bickering over the
 lease
 time doesn't fix that non-deterministic recovery for the VM.  Documenting
 a VM
 reboot is necessary, or even deprecating this (you won't like that) are
 sounding
 better to me by the minute.

 Is there anyone else that has used, or has customers using, this part of
 the
 neutron API?  Can they share their experiences?

 -Brian


 On 01/30/2015 07:26 AM, Kevin Benton wrote:
 But they will if we document it well, which is what Salvatore suggested.
 
  I don't think this is a good approach, and it's a big part of why I
 started this
  thread. Most of the deployers/operators I have worked with only read the
 bare
  minimum documentation to get a Neutron deployment working and they only
 adjust
  the settings necessary for basic functionality.
 
  We have an overwhelming amount of configuration options and adding a note
  specifying that a particular setting for DHCP leases has been optimized
 to
  reduce logging at the cost of long downtimes during port IP address
 updates is a
  waste of time and effort on our part.
 
 I think the current default value is also more indicative of something
  you'd find in your house, or at work - i.e. stable networks.
 
  

Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-02-02 Thread Brian Haley
Kevin,

I think we are finally converging.  One of the points I've been trying to make
is that users are playing with fire when they start playing with some of these
port attributes, and given the tool we have to work with (DHCP), the
instantiation of these changes cannot be made seamlessly to a VM.  That's life
in the cloud, and most of these things can (and should) be designed around.

On 02/02/2015 06:48 AM, Kevin Benton wrote:
 The only thing this discussion has convinced me of is that allowing users
 to change the fixed IP address on a neutron port leads to a bad
 user-experience.
 
 Not as bad as having to delete a port and create another one on the same
 network just to change addresses though...
 
 Even with an 8-minute renew time you're talking up to a 7-minute blackout
 (87.5% of lease time before using broadcast).
 
 I suggested 240 seconds renewal time, which is up to 4 minutes of
 connectivity outage. This doesn't have anything to do with lease time and
 unicast DHCP will work because the spoof rules allow DHCP client traffic
 before restricting to specific IPs.

The unicast DHCP will make it to the wire, but if you've renumbered the subnet
either a) the DHCP server won't respond because it's IP has changed as well; or
b) the DHCP server won't respond because there is no mapping for the VM on it's
old subnet.

 Most would have rebooted long before then, true?  Cattle not pets, right?
 
 Only in an ideal world that I haven't encountered with customer deployments. 
 Many enterprise deployments end up bringing pets along where reboots aren't 
 always free. The time taken to relaunch programs and restore state can end
 up being 10 minutes+ if it's something like a VDI deployment or dev
 environment where someone spends a lot of time working on one VM.

This would happen if the AZ their VM was in went offline as well, at which point
they would change their design to be more cloud-aware than it was.  Let's not
heap all the blame on neutron - the user is tasked with vetting that their
decisions meet the requirements they desire by thoroughly testing it.

 Changing the lease time is just papering-over the real bug - neutron
 doesn't support seamless changes in IP addresses on ports, since it totally 
 relies on the dhcp configuration settings a deployer has chosen.
 
 It doesn't need to be seamless, but it certainly shouldn't be useless. 
 Connectivity interruptions can be expected with IP changes (e.g. I've seen 
 changes in elastic IPs on EC2 can interrupt connectivity to an instance for
 up to 2 minutes), but an entire day of downtime is awful.

Yes, I agree, an entire day of downtime is bad.

 One of the things I'm getting at is that a deployer shouldn't be choosing
 such high lease times and we are encouraging it with a high default. You are
 arguing for infrequent renewals to work around excessive logging, which is
 just an implementation problem that should be addressed with a patch to your
 logging collector (de-duplication) or to dnsmasq (don't log renewals).

My #1 deployment problem was around control-plane upgrade, not logging:

During a control-plane upgrade or outage, having a short DHCP lease time will
take all your VMs offline.  The old value of 2 minutes is not a realistic value
for an upgrade, and I don't think 8 minutes is much better.  Yes, when DHCP is
down you can't boot a new VM, but as long as customers can get to their existing
VMs they're pretty happy and won't scream bloody murder.

 Documenting a VM reboot is necessary, or even deprecating this (you won't
 like
 that) are sounding better to me by the minute.
 
 If this is an approach you really want to go with, then we should at least
 be consistent and deprecate the extra dhcp options extension (or at least
 the ability to update ports' dhcp options). Updating subnet attributes like 
 gateway_ip, dns_nameserves, and host_routes should be thrown out as well. All
 of these things depend on the DHCP server to deliver updated information and
 are hindered by renewal times. Why discriminate against IP updates on a port?
 A failure to receive many of those other types of changes could result in
 just as severe of a connection disruption.

How about a big (*) next to all the things that could cause issues?  :)  We've
completely loaded the gun exposing all these attributes to the general user
when only the network-aware power-user should be playing with them.

(*) Changing these attributes could cause VMs to become unresponsive for a long
period of time depending on the deployment settings, and should be used with
caution.  Sometimes a VM reboot will be required to re-gain connectivity.

 In summary, the information the DHCP server gives to clients is not static. 
 Unless we eliminate updates to everything in the Neutron API that results in 
 different DHCP lease information, my suggestion is that we include a new
 option for the renewal interval and have the default set 5 minutes. We can
 leave the lease default to 1 day so the 

Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-01-30 Thread Brian Haley
Kevin,

The only thing this discussion has convinced me of is that allowing users to
change the fixed IP address on a neutron port leads to a bad user-experience.
Even with an 8-minute renew time you're talking up to a 7-minute blackout (87.5%
of lease time before using broadcast).  This is time that customers are paying
for.  Most would have rebooted long before then, true?  Cattle not pets, right?

Changing the lease time is just papering-over the real bug - neutron doesn't
support seamless changes in IP addresses on ports, since it totally relies on
the dhcp configuration settings a deployer has chosen.  Bickering over the lease
time doesn't fix that non-deterministic recovery for the VM.  Documenting a VM
reboot is necessary, or even deprecating this (you won't like that) are sounding
better to me by the minute.

Is there anyone else that has used, or has customers using, this part of the
neutron API?  Can they share their experiences?

-Brian


On 01/30/2015 07:26 AM, Kevin Benton wrote:
But they will if we document it well, which is what Salvatore suggested.
 
 I don't think this is a good approach, and it's a big part of why I started 
 this
 thread. Most of the deployers/operators I have worked with only read the bare
 minimum documentation to get a Neutron deployment working and they only adjust
 the settings necessary for basic functionality.
 
 We have an overwhelming amount of configuration options and adding a note
 specifying that a particular setting for DHCP leases has been optimized to
 reduce logging at the cost of long downtimes during port IP address updates 
 is a
 waste of time and effort on our part. 
 
I think the current default value is also more indicative of something
 you'd find in your house, or at work - i.e. stable networks.
 
 Tenants don't care what the DHCP lease time is or that it matches what they
 would see from a home router. They only care about connectivity. 
 
One solution is to disallow this operation.
 
 I want this feature to be useful in deployments by default, not strip it
 away. You can probably do this with /etc/neutron/policy.json without a code
 change if you wanted to block it in a deployment like yours where you have 
 such
 a high lease time.
 
Perhaps letting the user set it, but allow the admin to set the valid range
 for min/max?  And if they don't specify they get the default?
 
 Tenants wouldn't have any reason to adjust this default. They would be even 
 less
 likely than the operator to know about this weird relationship between a DHCP
 setting and the amount of time they lose connectivity after updating their
 ports' IPs.
 
It impacts anyone that hasn't changed from the default since July 2013 and 
later
 (Havana), since if they don't notice, they might get bitten by it.
 
 Keep in mind that what I am suggesting with the lease-renewal-time would be
 separate from the lease expiration time. The only difference that an operator
 would see on upgrade (if using the defaults) is increased DHCP traffic and 
 more
 logs to syslog from dnsmasq. The lease time would still be the same so the
 downtime windows for DHCP agents would be maintained. That is much less of an
 impact than many of the non-config changes we make between cycles.
 
 To clarify, even with an option for dhcp-renewal-time I am proposing, you are
 still opposed to setting it to anything low because of logging and the ~24 bps
 background DHCP traffic per VM?
 
 On Thu, Jan 29, 2015 at 7:11 PM, Brian Haley brian.ha...@hp.com
 mailto:brian.ha...@hp.com wrote:
 
 On 01/29/2015 05:28 PM, Kevin Benton wrote:
 How is Neutron breaking this?  If I move a port on my physical switch 
 to a
  different subnet, can you still communicate with the host sitting on it?
  Probably not since it has a view of the world (next-hop router) that no 
 longer
  exists, and the network won't route packets for it's old IP address to 
 the new
  location.  It has to wait for it's current DHCP lease to tick down to 
 the point
  where it will use broadcast to get a new one, after which point it will 
 work.
 
  That's not just moving to a different subnet. That's moving to a 
 different
  broadcast domain. Neutron supports multiple subnets per network 
 (broadcast
  domain). An address on either subnet will work. The router has two 
 interfaces
  into the network, one on each subnet.[2]
 
 
 Does it work on Windows VMs too?  People run those in clouds too.  The 
 point is
  that if we don't know if all the DHCP clients will support it then it's 
 a
  non-starter since there's no way to tell from the server side.
 
  It appears they do.[1] Even for clients that don't, the worst case 
 scenario is
  just that they are stuck where we are now.
 
 ... then the deployer can adjust the value upwards..., hmm, can they 
 adjust it
  downwards as well?  :)
 
  Yes, but most people doing initial openstack deployments don't and 

Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-01-30 Thread Kevin Benton
But they will if we document it well, which is what Salvatore suggested.

I don't think this is a good approach, and it's a big part of why I started
this thread. Most of the deployers/operators I have worked with only read
the bare minimum documentation to get a Neutron deployment working and they
only adjust the settings necessary for basic functionality.

We have an overwhelming amount of configuration options and adding a note
specifying that a particular setting for DHCP leases has been optimized to
reduce logging at the cost of long downtimes during port IP address updates
is a waste of time and effort on our part.

I think the current default value is also more indicative of something
you'd find in your house, or at work - i.e. stable networks.

Tenants don't care what the DHCP lease time is or that it matches what they
would see from a home router. They only care about connectivity.

One solution is to disallow this operation.

I want this feature to be useful in deployments by default, not strip it
away. You can probably do this with /etc/neutron/policy.json without a code
change if you wanted to block it in a deployment like yours where you have
such a high lease time.

Perhaps letting the user set it, but allow the admin to set the valid
range for min/max?  And if they don't specify they get the default?

Tenants wouldn't have any reason to adjust this default. They would be even
less likely than the operator to know about this weird relationship between
a DHCP setting and the amount of time they lose connectivity after updating
their ports' IPs.

It impacts anyone that hasn't changed from the default since July 2013 and
later
(Havana), since if they don't notice, they might get bitten by it.

Keep in mind that what I am suggesting with the lease-renewal-time would be
separate from the lease expiration time. The only difference that an
operator would see on upgrade (if using the defaults) is increased DHCP
traffic and more logs to syslog from dnsmasq. The lease time would still be
the same so the downtime windows for DHCP agents would be maintained. That
is much less of an impact than many of the non-config changes we make
between cycles.

To clarify, even with an option for dhcp-renewal-time I am proposing, you
are still opposed to setting it to anything low because of logging and the
~24 bps background DHCP traffic per VM?

On Thu, Jan 29, 2015 at 7:11 PM, Brian Haley brian.ha...@hp.com wrote:

 On 01/29/2015 05:28 PM, Kevin Benton wrote:
 How is Neutron breaking this?  If I move a port on my physical switch to
 a
  different subnet, can you still communicate with the host sitting on it?
  Probably not since it has a view of the world (next-hop router) that no
 longer
  exists, and the network won't route packets for it's old IP address to
 the new
  location.  It has to wait for it's current DHCP lease to tick down to
 the point
  where it will use broadcast to get a new one, after which point it will
 work.
 
  That's not just moving to a different subnet. That's moving to a
 different
  broadcast domain. Neutron supports multiple subnets per network
 (broadcast
  domain). An address on either subnet will work. The router has two
 interfaces
  into the network, one on each subnet.[2]
 
 
 Does it work on Windows VMs too?  People run those in clouds too.  The
 point is
  that if we don't know if all the DHCP clients will support it then it's a
  non-starter since there's no way to tell from the server side.
 
  It appears they do.[1] Even for clients that don't, the worst case
 scenario is
  just that they are stuck where we are now.
 
 ... then the deployer can adjust the value upwards..., hmm, can they
 adjust it
  downwards as well?  :)
 
  Yes, but most people doing initial openstack deployments don't and
 wouldn't
  think to without understanding the intricacies of the security groups
 filtering
  in Neutron.

 But they will if we document it well, which is what Salvatore suggested.

 I'm glad you're willing to boil the ocean to try and get the default
 changed,
  but is all this really worth it when all you have to do is edit the
 config file
  in your deployment?  That's why the value is there in the first place.
 
  The default value is basically incompatible with port IP changes. We
 shouldn't
  be shipping defaults that lead to half-broken functionality. What I'm
  understanding is that the current default value is to workaround
 shortcomings in
  dnsmasq. This is an example of implementation details leaking out and
 leading to
  bad UX.

 I think the current default value is also more indicative of something
 you'd
 find in your house, or at work - i.e. stable networks.

 I had another thought on this Kevin, hoping that we could come to some
 resolution, because sure, shipping broken functionality isn't great.  But
 here's
 the rub - how do we make a change in a fixed IP work in *all* deployments?
 Since the end-user can't set this value, they'll run into this problem in
 my
 

Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-01-29 Thread Kyle Mestery
On Thu, Jan 29, 2015 at 2:55 AM, Kevin Benton blak...@gmail.com wrote:

 Why would users want to change an active port's IP address anyway?

 Re-addressing. It's not common, but the entire reason I brought this up is
 because a user was moving an instance to another subnet on the same network
 and stranded one of their VMs.

  I worry about setting a default config value to handle a very unusual
 use case.

 Changing a static lease is something that works on normal networks so I
 don't think we should break it in Neutron without a really good reason.

 Right now, the big reason to keep a high lease time that I agree with is
 that it buys operators lots of dnsmasq downtime without affecting running
 clients. To get the best of both worlds we can set DHCP option 58 (a.k.a
 dhcp-renewal-time or T1) to 240 seconds. Then the lease time can be left to
 be something large like 10 days to allow for tons of DHCP server downtime
 without affecting running clients.

 There are two issues with this approach. First, some simple dhcp clients
 don't honor that dhcp option (e.g. the one with Cirros), but it works with
 dhclient so it should work on CentOS, Fedora, etc (I verified it works on
 Ubuntu). This isn't a big deal because the worst case is what we have
 already (half of the lease time). The second issue is that dnsmasq
 hardcodes that option, so a patch would be required to allow it to be
 specified in the options file. I am happy to submit the patch required
 there so that isn't a big deal either.

 I'll defer to distributions here, but they would have to consume this
patch and release it before it would become prevalent in distributions
deployed with Neutron. Just something to note here. That said, I think
submitting a patch to remove hard coding this is a good idea, and ideally
you would submit that patch quickly while we hash out the details here.



 If we implement that fix, the remaining issue is Brian's other comment
 about too much DHCP traffic. I've been doing some packet captures and the
 standard request/reply for a renewal is 2 unicast packets totaling about
 725 bytes. Assuming 10,000 VMs renewing every 240 seconds, there will be an
 average of 242 kbps background traffic across the entire network. Even at a
 density of 50 VMs, that's only 1.2 kbps per compute node. If that's still
 too much, then the deployer can adjust the value upwards, but that's hardly
 a reason to have a high default.

 That just leaves the logging problem. Since we require a change to dnsmasq
 anyway, perhaps we could also request an option to suppress logs from
 renewals? If that's not adequate, I think 2 log entries per vm every 240
 seconds is really only a concern for operators with large clouds and they
 should have the knowledge required to change a config file anyway. ;-)


 On Wed, Jan 28, 2015 at 3:59 PM, Chuck Carlino chuckjcarl...@gmail.com
 wrote:

  On 01/28/2015 12:51 PM, Kevin Benton wrote:

 If we are going to ignore the IP address changing use-case, can we just
 make the default infinity? Then nobody ever has to worry about control
 plane outages for existing client. 24 hours is way too long to be useful
 anyway.


 Why would users want to change an active port's IP address anyway?  I can
 see possible use in changing an inactive port's IP address, but that
 wouldn't cause the dhcp issues mentioned here.  I worry about setting a
 default config value to handle a very unusual use case.

 Chuck



  On Jan 28, 2015 12:44 PM, Salvatore Orlando sorla...@nicira.com
 wrote:



 On 28 January 2015 at 20:19, Brian Haley brian.ha...@hp.com wrote:

 Hi Kevin,

 On 01/28/2015 03:50 AM, Kevin Benton wrote:
  Hi,
 
  Approximately a year and a half ago, the default DHCP lease time in
 Neutron was
  increased from 120 seconds to 86400 seconds.[1] This was done with
 the goal of
  reducing DHCP traffic with very little discussion (based on what I
 can see in
  the review and bug report). While it it does indeed reduce DHCP
 traffic, I don't
  think any bug reports were filed showing that a 120 second lease time
 resulted
  in too much traffic or that a jump all of the way to 86400 seconds
 was required
  instead of a value in the same order of magnitude.
 
  Why does this matter?
 
  Neutron ports can be updated with a new IP address from the same
 subnet or
  another subnet on the same network. The port update will result in
 anti-spoofing
  iptables rule changes that immediately stop the old IP address from
 working on
  the host. This means the host is unreachable for 0-12 hours based on
 the current
  default lease time without manual intervention[2] (assuming
 half-lease length
  DHCP renewal attempts).

 So I'll first comment on the problem.  You're essentially pulling the
 rug out
 from under these VMs by changing their IP (and that of their router and
 DHCP/DNS
 server), but you expect they should fail quickly and come right back
 online.  In
 a non-Neutron environment wouldn't the IT person that did this need
 

Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-01-29 Thread Kevin Benton
Why would users want to change an active port's IP address anyway?

Re-addressing. It's not common, but the entire reason I brought this up is
because a user was moving an instance to another subnet on the same network
and stranded one of their VMs.

 I worry about setting a default config value to handle a very unusual use
case.

Changing a static lease is something that works on normal networks so I
don't think we should break it in Neutron without a really good reason.

Right now, the big reason to keep a high lease time that I agree with is
that it buys operators lots of dnsmasq downtime without affecting running
clients. To get the best of both worlds we can set DHCP option 58 (a.k.a
dhcp-renewal-time or T1) to 240 seconds. Then the lease time can be left to
be something large like 10 days to allow for tons of DHCP server downtime
without affecting running clients.

There are two issues with this approach. First, some simple dhcp clients
don't honor that dhcp option (e.g. the one with Cirros), but it works with
dhclient so it should work on CentOS, Fedora, etc (I verified it works on
Ubuntu). This isn't a big deal because the worst case is what we have
already (half of the lease time). The second issue is that dnsmasq
hardcodes that option, so a patch would be required to allow it to be
specified in the options file. I am happy to submit the patch required
there so that isn't a big deal either.


If we implement that fix, the remaining issue is Brian's other comment
about too much DHCP traffic. I've been doing some packet captures and the
standard request/reply for a renewal is 2 unicast packets totaling about
725 bytes. Assuming 10,000 VMs renewing every 240 seconds, there will be an
average of 242 kbps background traffic across the entire network. Even at a
density of 50 VMs, that's only 1.2 kbps per compute node. If that's still
too much, then the deployer can adjust the value upwards, but that's hardly
a reason to have a high default.

That just leaves the logging problem. Since we require a change to dnsmasq
anyway, perhaps we could also request an option to suppress logs from
renewals? If that's not adequate, I think 2 log entries per vm every 240
seconds is really only a concern for operators with large clouds and they
should have the knowledge required to change a config file anyway. ;-)


On Wed, Jan 28, 2015 at 3:59 PM, Chuck Carlino chuckjcarl...@gmail.com
wrote:

  On 01/28/2015 12:51 PM, Kevin Benton wrote:

 If we are going to ignore the IP address changing use-case, can we just
 make the default infinity? Then nobody ever has to worry about control
 plane outages for existing client. 24 hours is way too long to be useful
 anyway.


 Why would users want to change an active port's IP address anyway?  I can
 see possible use in changing an inactive port's IP address, but that
 wouldn't cause the dhcp issues mentioned here.  I worry about setting a
 default config value to handle a very unusual use case.

 Chuck



  On Jan 28, 2015 12:44 PM, Salvatore Orlando sorla...@nicira.com
 wrote:



 On 28 January 2015 at 20:19, Brian Haley brian.ha...@hp.com wrote:

 Hi Kevin,

 On 01/28/2015 03:50 AM, Kevin Benton wrote:
  Hi,
 
  Approximately a year and a half ago, the default DHCP lease time in
 Neutron was
  increased from 120 seconds to 86400 seconds.[1] This was done with the
 goal of
  reducing DHCP traffic with very little discussion (based on what I can
 see in
  the review and bug report). While it it does indeed reduce DHCP
 traffic, I don't
  think any bug reports were filed showing that a 120 second lease time
 resulted
  in too much traffic or that a jump all of the way to 86400 seconds was
 required
  instead of a value in the same order of magnitude.
 
  Why does this matter?
 
  Neutron ports can be updated with a new IP address from the same
 subnet or
  another subnet on the same network. The port update will result in
 anti-spoofing
  iptables rule changes that immediately stop the old IP address from
 working on
  the host. This means the host is unreachable for 0-12 hours based on
 the current
  default lease time without manual intervention[2] (assuming half-lease
 length
  DHCP renewal attempts).

 So I'll first comment on the problem.  You're essentially pulling the
 rug out
 from under these VMs by changing their IP (and that of their router and
 DHCP/DNS
 server), but you expect they should fail quickly and come right back
 online.  In
 a non-Neutron environment wouldn't the IT person that did this need some
 pretty
 good heat-resistant pants for all the flames from pissed-off users?
 Sure, the
 guy on his laptop will just bounce the connection, but servers (aka VMs)
 should
 stay pretty static.  VMs are servers (and cows according to some).


  I actually expect this kind operation to not be one Neutron users will
 do very often, mostly because regardless of whether you're in the cloud or
 not, you'd still need to wear those heat resistant pants.



 The correct 

Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-01-29 Thread Kevin Benton
How is Neutron breaking this?  If I move a port on my physical switch to a
different subnet, can you still communicate with the host sitting on it?
Probably not since it has a view of the world (next-hop router) that no
longer
exists, and the network won't route packets for it's old IP address to the
new
location.  It has to wait for it's current DHCP lease to tick down to the
point
where it will use broadcast to get a new one, after which point it will
work.

That's not just moving to a different subnet. That's moving to a different
broadcast domain. Neutron supports multiple subnets per network (broadcast
domain). An address on either subnet will work. The router has two
interfaces into the network, one on each subnet.[2]


Does it work on Windows VMs too?  People run those in clouds too.  The
point is
that if we don't know if all the DHCP clients will support it then it's a
non-starter since there's no way to tell from the server side.

It appears they do.[1] Even for clients that don't, the worst case scenario
is just that they are stuck where we are now.

... then the deployer can adjust the value upwards..., hmm, can they
adjust it
downwards as well?  :)

Yes, but most people doing initial openstack deployments don't and wouldn't
think to without understanding the intricacies of the security groups
filtering in Neutron.

I'm glad you're willing to boil the ocean to try and get the default
changed,
but is all this really worth it when all you have to do is edit the config
file
in your deployment?  That's why the value is there in the first place.

The default value is basically incompatible with port IP changes. We
shouldn't be shipping defaults that lead to half-broken functionality. What
I'm understanding is that the current default value is to workaround
shortcomings in dnsmasq. This is an example of implementation details
leaking out and leading to bad UX.

If we had an option to configure how often iptables rules were refreshed to
match their security group, there is no way we would have a default of 12
hours. This is essentially the same level of connectivity interruption, it
just happens to be a narrow use case so it hasn't been getting any
attention.

To flip your question around, why do you care if the default is lower? You
already adjust it beyond the 1 day default in your deployment, so how would
a different default impact you?


1. http://support.microsoft.com/kb/121005
2. Similar to using the secondary keyword on Cisco devices. Or just the
ip addr add command on linux.

On Thu, Jan 29, 2015 at 1:34 PM, Brian Haley brian.ha...@hp.com wrote:

 On 01/29/2015 03:55 AM, Kevin Benton wrote:
 Why would users want to change an active port's IP address anyway?
 
  Re-addressing. It's not common, but the entire reason I brought this up
 is
  because a user was moving an instance to another subnet on the same
 network and
  stranded one of their VMs.
 
  I worry about setting a default config value to handle a very unusual
 use case.
 
  Changing a static lease is something that works on normal networks so I
 don't
  think we should break it in Neutron without a really good reason.

 How is Neutron breaking this?  If I move a port on my physical switch to a
 different subnet, can you still communicate with the host sitting on it?
 Probably not since it has a view of the world (next-hop router) that no
 longer
 exists, and the network won't route packets for it's old IP address to the
 new
 location.  It has to wait for it's current DHCP lease to tick down to the
 point
 where it will use broadcast to get a new one, after which point it will
 work.

  Right now, the big reason to keep a high lease time that I agree with is
 that it
  buys operators lots of dnsmasq downtime without affecting running
 clients. To
  get the best of both worlds we can set DHCP option 58 (a.k.a
 dhcp-renewal-time
  or T1) to 240 seconds. Then the lease time can be left to be something
 large
  like 10 days to allow for tons of DHCP server downtime without affecting
 running
  clients.
 
  There are two issues with this approach. First, some simple dhcp clients
 don't
  honor that dhcp option (e.g. the one with Cirros), but it works with
 dhclient so
  it should work on CentOS, Fedora, etc (I verified it works on Ubuntu).
 This
  isn't a big deal because the worst case is what we have already (half of
 the
  lease time). The second issue is that dnsmasq hardcodes that option, so
 a patch
  would be required to allow it to be specified in the options file. I am
 happy to
  submit the patch required there so that isn't a big deal either.

 Does it work on Windows VMs too?  People run those in clouds too.  The
 point is
 that if we don't know if all the DHCP clients will support it then it's a
 non-starter since there's no way to tell from the server side.

  If we implement that fix, the remaining issue is Brian's other comment
 about too
  much DHCP traffic. I've been doing some packet captures and the standard
  

Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-01-29 Thread Brian Haley
On 01/29/2015 03:55 AM, Kevin Benton wrote:
Why would users want to change an active port's IP address anyway?
 
 Re-addressing. It's not common, but the entire reason I brought this up is
 because a user was moving an instance to another subnet on the same network 
 and
 stranded one of their VMs.
 
 I worry about setting a default config value to handle a very unusual use 
 case.
 
 Changing a static lease is something that works on normal networks so I don't
 think we should break it in Neutron without a really good reason.

How is Neutron breaking this?  If I move a port on my physical switch to a
different subnet, can you still communicate with the host sitting on it?
Probably not since it has a view of the world (next-hop router) that no longer
exists, and the network won't route packets for it's old IP address to the new
location.  It has to wait for it's current DHCP lease to tick down to the point
where it will use broadcast to get a new one, after which point it will work.

 Right now, the big reason to keep a high lease time that I agree with is that 
 it
 buys operators lots of dnsmasq downtime without affecting running clients. To
 get the best of both worlds we can set DHCP option 58 (a.k.a dhcp-renewal-time
 or T1) to 240 seconds. Then the lease time can be left to be something large
 like 10 days to allow for tons of DHCP server downtime without affecting 
 running
 clients.
 
 There are two issues with this approach. First, some simple dhcp clients don't
 honor that dhcp option (e.g. the one with Cirros), but it works with dhclient 
 so
 it should work on CentOS, Fedora, etc (I verified it works on Ubuntu). This
 isn't a big deal because the worst case is what we have already (half of the
 lease time). The second issue is that dnsmasq hardcodes that option, so a 
 patch
 would be required to allow it to be specified in the options file. I am happy 
 to
 submit the patch required there so that isn't a big deal either.

Does it work on Windows VMs too?  People run those in clouds too.  The point is
that if we don't know if all the DHCP clients will support it then it's a
non-starter since there's no way to tell from the server side.

 If we implement that fix, the remaining issue is Brian's other comment about 
 too
 much DHCP traffic. I've been doing some packet captures and the standard
 request/reply for a renewal is 2 unicast packets totaling about 725 bytes.
 Assuming 10,000 VMs renewing every 240 seconds, there will be an average of 
 242
 kbps background traffic across the entire network. Even at a density of 50 
 VMs,
 that's only 1.2 kbps per compute node. If that's still too much, then the
 deployer can adjust the value upwards, but that's hardly a reason to have a 
 high
 default.

... then the deployer can adjust the value upwards..., hmm, can they adjust it
downwards as well?  :)

 That just leaves the logging problem. Since we require a change to dnsmasq
 anyway, perhaps we could also request an option to suppress logs from 
 renewals?
 If that's not adequate, I think 2 log entries per vm every 240 seconds is 
 really
 only a concern for operators with large clouds and they should have the
 knowledge required to change a config file anyway. ;-)

I'm glad you're willing to boil the ocean to try and get the default changed,
but is all this really worth it when all you have to do is edit the config file
in your deployment?  That's why the value is there in the first place.

Sorry, I'm still unconvinced we need to do anything more than document this.

-Brian



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-01-29 Thread Brian Haley
On 01/29/2015 05:28 PM, Kevin Benton wrote:
How is Neutron breaking this?  If I move a port on my physical switch to a
 different subnet, can you still communicate with the host sitting on it?
 Probably not since it has a view of the world (next-hop router) that no longer
 exists, and the network won't route packets for it's old IP address to the new
 location.  It has to wait for it's current DHCP lease to tick down to the 
 point
 where it will use broadcast to get a new one, after which point it will work.
 
 That's not just moving to a different subnet. That's moving to a different
 broadcast domain. Neutron supports multiple subnets per network (broadcast
 domain). An address on either subnet will work. The router has two interfaces
 into the network, one on each subnet.[2]
 
 
Does it work on Windows VMs too?  People run those in clouds too.  The point 
is
 that if we don't know if all the DHCP clients will support it then it's a
 non-starter since there's no way to tell from the server side.
 
 It appears they do.[1] Even for clients that don't, the worst case scenario is
 just that they are stuck where we are now.
 
... then the deployer can adjust the value upwards..., hmm, can they adjust 
it
 downwards as well?  :)
 
 Yes, but most people doing initial openstack deployments don't and wouldn't
 think to without understanding the intricacies of the security groups 
 filtering
 in Neutron.

But they will if we document it well, which is what Salvatore suggested.

I'm glad you're willing to boil the ocean to try and get the default 
changed,
 but is all this really worth it when all you have to do is edit the config 
 file
 in your deployment?  That's why the value is there in the first place.
 
 The default value is basically incompatible with port IP changes. We shouldn't
 be shipping defaults that lead to half-broken functionality. What I'm
 understanding is that the current default value is to workaround shortcomings 
 in
 dnsmasq. This is an example of implementation details leaking out and leading 
 to
 bad UX. 

I think the current default value is also more indicative of something you'd
find in your house, or at work - i.e. stable networks.

I had another thought on this Kevin, hoping that we could come to some
resolution, because sure, shipping broken functionality isn't great.  But here's
the rub - how do we make a change in a fixed IP work in *all* deployments?
Since the end-user can't set this value, they'll run into this problem in my
deployment, or any other that has some not-very-short lease time.  One solution
is to disallow this operation.  The other is to fix neutron to make this work
better (I don't know what that involves, but there's bound to be a way).
Perhaps letting the user set it, but allow the admin to set the valid range for
min/max?  And if they don't specify they get the default?

 If we had an option to configure how often iptables rules were refreshed to
 match their security group, there is no way we would have a default of 12 
 hours.
 This is essentially the same level of connectivity interruption, it just 
 happens
 to be a narrow use case so it hasn't been getting any attention.
 
 To flip your question around, why do you care if the default is lower? You
 already adjust it beyond the 1 day default in your deployment, so how would a
 different default impact you?

It impacts anyone that hasn't changed from the default since July 2013 and later
(Havana), since if they don't notice, they might get bitten by it.

-Brian


 
 1. http://support.microsoft.com/kb/121005
 2. Similar to using the secondary keyword on Cisco devices. Or just the ip
 addr add command on linux.
 
 On Thu, Jan 29, 2015 at 1:34 PM, Brian Haley brian.ha...@hp.com
 mailto:brian.ha...@hp.com wrote:
 
 On 01/29/2015 03:55 AM, Kevin Benton wrote:
 Why would users want to change an active port's IP address anyway?
 
  Re-addressing. It's not common, but the entire reason I brought this up 
 is
  because a user was moving an instance to another subnet on the same 
 network and
  stranded one of their VMs.
 
  I worry about setting a default config value to handle a very unusual 
 use case.
 
  Changing a static lease is something that works on normal networks so I 
 don't
  think we should break it in Neutron without a really good reason.
 
 How is Neutron breaking this?  If I move a port on my physical switch to a
 different subnet, can you still communicate with the host sitting on it?
 Probably not since it has a view of the world (next-hop router) that no 
 longer
 exists, and the network won't route packets for it's old IP address to 
 the new
 location.  It has to wait for it's current DHCP lease to tick down to the 
 point
 where it will use broadcast to get a new one, after which point it will 
 work.
 
  Right now, the big reason to keep a high lease time that I agree with 
 is that it
  buys operators lots of 

[openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-01-28 Thread Kevin Benton
Hi,

Approximately a year and a half ago, the default DHCP lease time in Neutron
was increased from 120 seconds to 86400 seconds.[1] This was done with the
goal of reducing DHCP traffic with very little discussion (based on what I
can see in the review and bug report). While it it does indeed reduce DHCP
traffic, I don't think any bug reports were filed showing that a 120 second
lease time resulted in too much traffic or that a jump all of the way to
86400 seconds was required instead of a value in the same order of
magnitude.

Why does this matter?

Neutron ports can be updated with a new IP address from the same subnet or
another subnet on the same network. The port update will result in
anti-spoofing iptables rule changes that immediately stop the old IP
address from working on the host. This means the host is unreachable for
0-12 hours based on the current default lease time without manual
intervention[2] (assuming half-lease length DHCP renewal attempts).

Why is this on the mailing list?

In an attempt to make the VMs usable in a much shorter timeframe following
a Neutron port address change, I submitted a patch to reduce the default
DHCP lease time to 8 minutes.[3] However, this was upsetting to several
people,[4] so it was suggested I bring this discussion to the mailing list.
The following are the high-level concerns followed by my responses:

   - 8 minutes is arbitrary
  - Yes, but it's no more arbitrary than 1440 minutes. I picked it as
  an interval because it is still 4 times larger than the last short value,
  but it still allows VMs to regain connectivity in 5 minutes in the event
  their IP is changed. If someone has a good suggestion for
another interval
  based on known dnsmasq QPS limits or some other quantitative
reason, please
  chime in here.
   - other datacenters use long lease times
  - This is true, but it's not really a valid comparison. In most
  regular datacenters, updating a static DHCP lease has no effect
on the data
  plane so it doesn't matter that the client doesn't react for hours/days
  (even with DHCP snooping enabled). However, in Neutron's case,
the security
  groups are immediately updated so all traffic using the old address is
  blocked.
   - dhcp traffic is scary because it's broadcast
  - ARP traffic is also broadcast and many clients will expire entries
  every 5-10 minutes and re-ARP. L2population may be used to prevent ARP
  propagation, so the comparison between DHCP and ARP isn't always relevant
  here.


Please reply back with your opinions/anecdotes/data related to short DHCP
lease times.

Cheers

1.
https://github.com/openstack/neutron/commit/d9832282cf656b162c51afdefb830dacab72defe
2. Manual intervention could be an instance reboot, a dhcp client
invocation via the console, or a delayed invocation right before the
update. (all significantly more difficult to script than a simple update of
a port's IP via the API).
3. https://review.openstack.org/#/c/150595/
4. http://i.imgur.com/xtvatkP.jpg

-- 
Kevin Benton
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-01-28 Thread Ihar Hrachyshka

On 01/28/2015 09:50 AM, Kevin Benton wrote:

Hi,

Approximately a year and a half ago, the default DHCP lease time in 
Neutron was increased from 120 seconds to 86400 seconds.[1] This was 
done with the goal of reducing DHCP traffic with very little 
discussion (based on what I can see in the review and bug report). 
While it it does indeed reduce DHCP traffic, I don't think any bug 
reports were filed showing that a 120 second lease time resulted in 
too much traffic or that a jump all of the way to 86400 seconds was 
required instead of a value in the same order of magnitude.


I guess that would be a good case for FORCERENEW DHCP extension [1] 
though after digging thru dnsmasq code a bit, I doubt it supports the 
extension (though e.g. systemd dhcp client/server from networkd module 
do). Le sigh.


[1]: https://tools.ietf.org/html/rfc3203



Why does this matter?

Neutron ports can be updated with a new IP address from the same 
subnet or another subnet on the same network. The port update will 
result in anti-spoofing iptables rule changes that immediately stop 
the old IP address from working on the host. This means the host is 
unreachable for 0-12 hours based on the current default lease time 
without manual intervention[2] (assuming half-lease length DHCP 
renewal attempts).


Why is this on the mailing list?

In an attempt to make the VMs usable in a much shorter timeframe 
following a Neutron port address change, I submitted a patch to reduce 
the default DHCP lease time to 8 minutes.[3] However, this was 
upsetting to several people,[4] so it was suggested I bring this 
discussion to the mailing list. The following are the high-level 
concerns followed by my responses:


  * 8 minutes is arbitrary
  o Yes, but it's no more arbitrary than 1440 minutes. I picked it
as an interval because it is still 4 times larger than the
last short value, but it still allows VMs to regain
connectivity in 5 minutes in the event their IP is changed.
If someone has a good suggestion for another interval based on
known dnsmasq QPS limits or some other quantitative reason,
please chime in here.
  * other datacenters use long lease times
  o This is true, but it's not really a valid comparison. In most
regular datacenters, updating a static DHCP lease has no
effect on the data plane so it doesn't matter that the client
doesn't react for hours/days (even with DHCP snooping
enabled). However, in Neutron's case, the security groups are
immediately updated so all traffic using the old address is
blocked.
  * dhcp traffic is scary because it's broadcast
  o ARP traffic is also broadcast and many clients will expire
entries every 5-10 minutes and re-ARP. L2population may be
used to prevent ARP propagation, so the comparison between
DHCP and ARP isn't always relevant here.


Please reply back with your opinions/anecdotes/data related to short 
DHCP lease times.


Cheers

1. 
https://github.com/openstack/neutron/commit/d9832282cf656b162c51afdefb830dacab72defe
2. Manual intervention could be an instance reboot, a dhcp client 
invocation via the console, or a delayed invocation right before the 
update. (all significantly more difficult to script than a simple 
update of a port's IP via the API).

3. https://review.openstack.org/#/c/150595/
4. http://i.imgur.com/xtvatkP.jpg

--
Kevin Benton


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-01-28 Thread Miguel Ángel Ajo
Miguel Ángel Ajo


On Wednesday, 28 de January de 2015 at 09:50, Kevin Benton wrote:

 Hi,
  
 Approximately a year and a half ago, the default DHCP lease time in Neutron 
 was increased from 120 seconds to 86400 seconds.[1] This was done with the 
 goal of reducing DHCP traffic with very little discussion (based on what I 
 can see in the review and bug report). While it it does indeed reduce DHCP 
 traffic, I don't think any bug reports were filed showing that a 120 second 
 lease time resulted in too much traffic or that a jump all of the way to 
 86400 seconds was required instead of a value in the same order of magnitude.
  
 Why does this matter?  
  
 Neutron ports can be updated with a new IP address from the same subnet or 
 another subnet on the same network. The port update will result in 
 anti-spoofing iptables rule changes that immediately stop the old IP address 
 from working on the host. This means the host is unreachable for 0-12 hours 
 based on the current default lease time without manual intervention[2] 
 (assuming half-lease length DHCP renewal attempts).
  
 Why is this on the mailing list?
  
 In an attempt to make the VMs usable in a much shorter timeframe following a 
 Neutron port address change, I submitted a patch to reduce the default DHCP 
 lease time to 8 minutes.[3] However, this was upsetting to several people,[4] 
 so it was suggested I bring this discussion to the mailing list. The 
 following are the high-level concerns followed by my responses:
 8 minutes is arbitrary
 Yes, but it's no more arbitrary than 1440 minutes. I picked it as an interval 
 because it is still 4 times larger than the last short value, but it still 
 allows VMs to regain connectivity in 5 minutes in the event their IP is 
 changed. If someone has a good suggestion for another interval based on known 
 dnsmasq QPS limits or some other quantitative reason, please chime in here.
  
 other datacenters use long lease times
 This is true, but it's not really a valid comparison. In most regular 
 datacenters, updating a static DHCP lease has no effect on the data plane so 
 it doesn't matter that the client doesn't react for hours/days (even with 
 DHCP snooping enabled). However, in Neutron's case, the security groups are 
 immediately updated so all traffic using the old address is blocked.
  
 dhcp traffic is scary because it's broadcast
 ARP traffic is also broadcast and many clients will expire entries every 5-10 
 minutes and re-ARP. L2population may be used to prevent ARP propagation, so 
 the comparison between DHCP and ARP isn't always relevant here.
  
  
  
  
For what I’ve seen, at least for linux, the first DHCP request will be 
broadcast. Then all lease renewals are unicast, unless, the original
DHCP can’t be contacted, in which case, the dhcp client will turn back to 
broadcast trying to find out another server to renew his lease.

So, only initial boot of an instance should generate broadcast traffic.

Your proposal seems reasonable to me.

In this context, please see this ongoing work [5], specially comments here [6], 
where we’re discussing about optimization,  
due to theoretical 120 second limit for renews at scale, and we made some 
calculations of CPU usage for the current default, I  
will recalculate those for the new proposed default: 8 minutes.

TL; DR.  
That patch fixes an issue found when you restart dnsmasq, and old leases can’t 
be renewed, so we end up in a storm of requests,
for that we need to provide dnsmasq with a script for initialization of the 
leases table, initially such script was provided in python,
but that means that script is called for: init (once), lease (once per 
instance), and renew (every lease renew time * number of instances),
thus we should minimize the impact of such script as much as possible, or 
contribute dnsmasq to avoid such script being called
for lease renews under some flag.
  
  
 Please reply back with your opinions/anecdotes/data related to short DHCP 
 lease times.
  
 Cheers
  
 1. 
 https://github.com/openstack/neutron/commit/d9832282cf656b162c51afdefb830dacab72defe
 2. Manual intervention could be an instance reboot, a dhcp client invocation 
 via the console, or a delayed invocation right before the update. (all 
 significantly more difficult to script than a simple update of a port's IP 
 via the API).
 3. https://review.openstack.org/#/c/150595/
 4. http://i.imgur.com/xtvatkP.jpg
  
  
  
  

5. https://review.openstack.org/#/c/108272/ 
(https://review.openstack.org/#/c/108272/8/neutron/agent/linux/dhcp.py)
6. https://review.openstack.org/#/c/108272/8/neutron/agent/linux/dhcp.py  
  
 --  
 Kevin Benton  
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe 
 (mailto:openstack-dev-requ...@lists.openstack.org?subject:unsubscribe)
 

Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-01-28 Thread Carl Baldwin
On Wed, Jan 28, 2015 at 9:52 AM, Salvatore Orlando sorla...@nicira.com wrote:
 The patch Kevin points out increased the lease to 24 hours (which I agree is
 as arbitrary as 2 minutes, 8 minutes, or 1 century) because it introduced
 use of DHCPRELEASE message in the agent, which is supported by dnsmasq (to
 the best of my knowledge) and is functionally similar to FORCERENEW.

My understanding was that the dhcp release mechanism in dnsmasq does
not actually unicast a FORCERENEW message to the client.  Does it?  I
thought it just released dnsmasq's record of the lease.  If I'm right,
this is a huge difference.  It is a big pain knowing that there are
many clients out there who may not renew their leases to get updated
dhcp options for hours and hours.  I don't think there is a reliable
way for the server to force renew to the client, is there?  Do clients
support the FORCERENEW unicast message?

Carl

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-01-28 Thread Vishvananda Ishaya

On Jan 28, 2015, at 9:36 AM, Carl Baldwin c...@ecbaldwin.net wrote:

 On Wed, Jan 28, 2015 at 9:52 AM, Salvatore Orlando sorla...@nicira.com 
 wrote:
 The patch Kevin points out increased the lease to 24 hours (which I agree is
 as arbitrary as 2 minutes, 8 minutes, or 1 century) because it introduced
 use of DHCPRELEASE message in the agent, which is supported by dnsmasq (to
 the best of my knowledge) and is functionally similar to FORCERENEW.
 
 My understanding was that the dhcp release mechanism in dnsmasq does
 not actually unicast a FORCERENEW message to the client.  Does it?  I
 thought it just released dnsmasq's record of the lease.  If I'm right,
 this is a huge difference.  It is a big pain knowing that there are
 many clients out there who may not renew their leases to get updated
 dhcp options for hours and hours.  I don't think there is a reliable
 way for the server to force renew to the client, is there?  Do clients
 support the FORCERENEW unicast message?

If you are using the dhcp-release script (that we got included in ubuntu years
ago for nova-network), it sends a release packet on behalf of the client so
that dnsmasq can update its leases table, but it doesn’t send any message to
the client to tell it to update.

Vish

 
 Carl
 
 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-01-28 Thread Kevin Benton
If we are going to ignore the IP address changing use-case, can we just
make the default infinity? Then nobody ever has to worry about control
plane outages for existing client. 24 hours is way too long to be useful
anyway.
On Jan 28, 2015 12:44 PM, Salvatore Orlando sorla...@nicira.com wrote:



 On 28 January 2015 at 20:19, Brian Haley brian.ha...@hp.com wrote:

 Hi Kevin,

 On 01/28/2015 03:50 AM, Kevin Benton wrote:
  Hi,
 
  Approximately a year and a half ago, the default DHCP lease time in
 Neutron was
  increased from 120 seconds to 86400 seconds.[1] This was done with the
 goal of
  reducing DHCP traffic with very little discussion (based on what I can
 see in
  the review and bug report). While it it does indeed reduce DHCP
 traffic, I don't
  think any bug reports were filed showing that a 120 second lease time
 resulted
  in too much traffic or that a jump all of the way to 86400 seconds was
 required
  instead of a value in the same order of magnitude.
 
  Why does this matter?
 
  Neutron ports can be updated with a new IP address from the same subnet
 or
  another subnet on the same network. The port update will result in
 anti-spoofing
  iptables rule changes that immediately stop the old IP address from
 working on
  the host. This means the host is unreachable for 0-12 hours based on
 the current
  default lease time without manual intervention[2] (assuming half-lease
 length
  DHCP renewal attempts).

 So I'll first comment on the problem.  You're essentially pulling the
 rug out
 from under these VMs by changing their IP (and that of their router and
 DHCP/DNS
 server), but you expect they should fail quickly and come right back
 online.  In
 a non-Neutron environment wouldn't the IT person that did this need some
 pretty
 good heat-resistant pants for all the flames from pissed-off users?
 Sure, the
 guy on his laptop will just bounce the connection, but servers (aka VMs)
 should
 stay pretty static.  VMs are servers (and cows according to some).


 I actually expect this kind operation to not be one Neutron users will do
 very often, mostly because regardless of whether you're in the cloud or
 not, you'd still need to wear those heat resistant pants.



 The correct solution is to be able to renumber the network so there is no
 issue
 with the anti-spoofing rules dropping packets, or the VMs having an
 unreachable
 IP address, but that's a much bigger nut to crack.


 Indeed. In my opinion the update IP operation sets false expectations in
 users. I have considered disallowing PUT on fixed_ips in the past but that
 did not go ahead because there were users leveraging it.



  Why is this on the mailing list?
 
  In an attempt to make the VMs usable in a much shorter timeframe
 following a
  Neutron port address change, I submitted a patch to reduce the default
 DHCP
  lease time to 8 minutes.[3] However, this was upsetting to several
 people,[4] so
  it was suggested I bring this discussion to the mailing list. The
 following are
  the high-level concerns followed by my responses:
 
* 8 minutes is arbitrary
o Yes, but it's no more arbitrary than 1440 minutes. I picked it
 as an
  interval because it is still 4 times larger than the last short
 value,
  but it still allows VMs to regain connectivity in 5 minutes in
 the
  event their IP is changed. If someone has a good suggestion for
 another
  interval based on known dnsmasq QPS limits or some other
 quantitative
  reason, please chime in here.

 We run 48 hours as the default in our public cloud, and I did some
 digging to
 remind myself of the multiple reasons:

 1. Too much DHCP traffic.  Sure, only that initial request is broadcast,
 but
 dnsmasq is very verbose and loves writing to syslog for everything it
 does -
 less is more.  Do a scale test with 10K VMs and you'll quickly find out a
 large
 portion of traffic is DHCP RENEWs, and syslog is huge.


 This is correct, and something I overlooked in my previous post.
 Nevertheless I still think that it is really impossible to find an optimal
 default which is regarded as such by every user. The current default has
 been chosen mostly for the reason you explain below, and I don't see a
 strong reason for changing it.



 2. During a control-plane upgrade or outage, having a short DHCP lease
 time will
 take all your VMs offline.  The old value of 2 minutes is not a realistic
 value
 for an upgrade, and I don't think 8 minutes is much better.  Yes, when
 DHCP is
 down you can't boot a new VM, but as long as customers can get to their
 existing
 VMs they're pretty happy and won't scream bloody murder.


 In our cloud we were continuously hit bit this. We could not take our dhcp
 agents out, otherwise all VMs would lose their leases, unless the downtime
 of the agent was very brief.


 There's probably more, but those were the top two, with #2 being most
 important.


 Summarizing, I think that Kevin is exposing a real, albeit 

Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-01-28 Thread Salvatore Orlando
On 28 January 2015 at 20:19, Brian Haley brian.ha...@hp.com wrote:

 Hi Kevin,

 On 01/28/2015 03:50 AM, Kevin Benton wrote:
  Hi,
 
  Approximately a year and a half ago, the default DHCP lease time in
 Neutron was
  increased from 120 seconds to 86400 seconds.[1] This was done with the
 goal of
  reducing DHCP traffic with very little discussion (based on what I can
 see in
  the review and bug report). While it it does indeed reduce DHCP traffic,
 I don't
  think any bug reports were filed showing that a 120 second lease time
 resulted
  in too much traffic or that a jump all of the way to 86400 seconds was
 required
  instead of a value in the same order of magnitude.
 
  Why does this matter?
 
  Neutron ports can be updated with a new IP address from the same subnet
 or
  another subnet on the same network. The port update will result in
 anti-spoofing
  iptables rule changes that immediately stop the old IP address from
 working on
  the host. This means the host is unreachable for 0-12 hours based on the
 current
  default lease time without manual intervention[2] (assuming half-lease
 length
  DHCP renewal attempts).

 So I'll first comment on the problem.  You're essentially pulling the
 rug out
 from under these VMs by changing their IP (and that of their router and
 DHCP/DNS
 server), but you expect they should fail quickly and come right back
 online.  In
 a non-Neutron environment wouldn't the IT person that did this need some
 pretty
 good heat-resistant pants for all the flames from pissed-off users?  Sure,
 the
 guy on his laptop will just bounce the connection, but servers (aka VMs)
 should
 stay pretty static.  VMs are servers (and cows according to some).


I actually expect this kind operation to not be one Neutron users will do
very often, mostly because regardless of whether you're in the cloud or
not, you'd still need to wear those heat resistant pants.



 The correct solution is to be able to renumber the network so there is no
 issue
 with the anti-spoofing rules dropping packets, or the VMs having an
 unreachable
 IP address, but that's a much bigger nut to crack.


Indeed. In my opinion the update IP operation sets false expectations in
users. I have considered disallowing PUT on fixed_ips in the past but that
did not go ahead because there were users leveraging it.



  Why is this on the mailing list?
 
  In an attempt to make the VMs usable in a much shorter timeframe
 following a
  Neutron port address change, I submitted a patch to reduce the default
 DHCP
  lease time to 8 minutes.[3] However, this was upsetting to several
 people,[4] so
  it was suggested I bring this discussion to the mailing list. The
 following are
  the high-level concerns followed by my responses:
 
* 8 minutes is arbitrary
o Yes, but it's no more arbitrary than 1440 minutes. I picked it
 as an
  interval because it is still 4 times larger than the last short
 value,
  but it still allows VMs to regain connectivity in 5 minutes in
 the
  event their IP is changed. If someone has a good suggestion for
 another
  interval based on known dnsmasq QPS limits or some other
 quantitative
  reason, please chime in here.

 We run 48 hours as the default in our public cloud, and I did some digging
 to
 remind myself of the multiple reasons:

 1. Too much DHCP traffic.  Sure, only that initial request is broadcast,
 but
 dnsmasq is very verbose and loves writing to syslog for everything it does
 -
 less is more.  Do a scale test with 10K VMs and you'll quickly find out a
 large
 portion of traffic is DHCP RENEWs, and syslog is huge.


This is correct, and something I overlooked in my previous post.
Nevertheless I still think that it is really impossible to find an optimal
default which is regarded as such by every user. The current default has
been chosen mostly for the reason you explain below, and I don't see a
strong reason for changing it.



 2. During a control-plane upgrade or outage, having a short DHCP lease
 time will
 take all your VMs offline.  The old value of 2 minutes is not a realistic
 value
 for an upgrade, and I don't think 8 minutes is much better.  Yes, when
 DHCP is
 down you can't boot a new VM, but as long as customers can get to their
 existing
 VMs they're pretty happy and won't scream bloody murder.


In our cloud we were continuously hit bit this. We could not take our dhcp
agents out, otherwise all VMs would lose their leases, unless the downtime
of the agent was very brief.


 There's probably more, but those were the top two, with #2 being most
 important.


Summarizing, I think that Kevin is exposing a real, albeit well-know
problem (sorry about my dhcp release faux pas - I can use jet lag as a
justification!), and he's proposing a mitigation to it. On the other hand,
this mitigation, as Brian explains, is going to cause real operational
issues. Still, we're arguing on the a default value for a configuration
parameter. I 

Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-01-28 Thread Chuck Carlino

On 01/28/2015 12:51 PM, Kevin Benton wrote:


If we are going to ignore the IP address changing use-case, can we 
just make the default infinity? Then nobody ever has to worry about 
control plane outages for existing client. 24 hours is way too long to 
be useful anyway.




Why would users want to change an active port's IP address anyway? I can 
see possible use in changing an inactive port's IP address, but that 
wouldn't cause the dhcp issues mentioned here.  I worry about setting a 
default config value to handle a very unusual use case.


Chuck


On Jan 28, 2015 12:44 PM, Salvatore Orlando sorla...@nicira.com 
mailto:sorla...@nicira.com wrote:




On 28 January 2015 at 20:19, Brian Haley brian.ha...@hp.com
mailto:brian.ha...@hp.com wrote:

Hi Kevin,

On 01/28/2015 03:50 AM, Kevin Benton wrote:
 Hi,

 Approximately a year and a half ago, the default DHCP lease
time in Neutron was
 increased from 120 seconds to 86400 seconds.[1] This was
done with the goal of
 reducing DHCP traffic with very little discussion (based on
what I can see in
 the review and bug report). While it it does indeed reduce
DHCP traffic, I don't
 think any bug reports were filed showing that a 120 second
lease time resulted
 in too much traffic or that a jump all of the way to 86400
seconds was required
 instead of a value in the same order of magnitude.

 Why does this matter?

 Neutron ports can be updated with a new IP address from the
same subnet or
 another subnet on the same network. The port update will
result in anti-spoofing
 iptables rule changes that immediately stop the old IP
address from working on
 the host. This means the host is unreachable for 0-12 hours
based on the current
 default lease time without manual intervention[2] (assuming
half-lease length
 DHCP renewal attempts).

So I'll first comment on the problem.  You're essentially
pulling the rug out
from under these VMs by changing their IP (and that of their
router and DHCP/DNS
server), but you expect they should fail quickly and come
right back online.  In
a non-Neutron environment wouldn't the IT person that did this
need some pretty
good heat-resistant pants for all the flames from pissed-off
users?  Sure, the
guy on his laptop will just bounce the connection, but servers
(aka VMs) should
stay pretty static.  VMs are servers (and cows according to some).


I actually expect this kind operation to not be one Neutron users
will do very often, mostly because regardless of whether you're in
the cloud or not, you'd still need to wear those heat resistant pants.


The correct solution is to be able to renumber the network so
there is no issue
with the anti-spoofing rules dropping packets, or the VMs
having an unreachable
IP address, but that's a much bigger nut to crack.


Indeed. In my opinion the update IP operation sets false
expectations in users. I have considered disallowing PUT on
fixed_ips in the past but that did not go ahead because there were
users leveraging it.


 Why is this on the mailing list?

 In an attempt to make the VMs usable in a much shorter
timeframe following a
 Neutron port address change, I submitted a patch to reduce
the default DHCP
 lease time to 8 minutes.[3] However, this was upsetting to
several people,[4] so
 it was suggested I bring this discussion to the mailing
list. The following are
 the high-level concerns followed by my responses:

   * 8 minutes is arbitrary
   o Yes, but it's no more arbitrary than 1440 minutes. I
picked it as an
 interval because it is still 4 times larger than the last 
short value,
 but it still allows VMs to regain connectivity in 5
minutes in the
 event their IP is changed. If someone has a good
suggestion for another
 interval based on known dnsmasq QPS limits or some
other quantitative
 reason, please chime in here.

We run 48 hours as the default in our public cloud, and I did
some digging to
remind myself of the multiple reasons:

1. Too much DHCP traffic.  Sure, only that initial request is
broadcast, but
dnsmasq is very verbose and loves writing to syslog for
everything it does -
less is more.  Do a scale test with 10K VMs and you'll quickly
find out a large
portion of traffic is DHCP RENEWs, and syslog is huge.


This is correct, and something I overlooked in 

Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-01-28 Thread Salvatore Orlando
The patch Kevin points out increased the lease to 24 hours (which I agree
is as arbitrary as 2 minutes, 8 minutes, or 1 century) because it
introduced use of DHCPRELEASE message in the agent, which is supported by
dnsmasq (to the best of my knowledge) and is functionally similar to
FORCERENEW.

This should have provided resiliency against changes of IP address from the
Neutron API, as the agent would send a DHCPRELEASE message as the
notification was received. When we reviewed the patch we verified that a
number of client supported this message (to my shame I must admit I did not
consider windows clients however).

It seems like the problem perhaps is that DHCPRELEASE is actually not
working as expected, or not working all?

Salvatore

On 28 January 2015 at 14:55, Ihar Hrachyshka ihrac...@redhat.com wrote:

  On 01/28/2015 09:50 AM, Kevin Benton wrote:

 Hi,

  Approximately a year and a half ago, the default DHCP lease time in
 Neutron was increased from 120 seconds to 86400 seconds.[1] This was done
 with the goal of reducing DHCP traffic with very little discussion (based
 on what I can see in the review and bug report). While it it does indeed
 reduce DHCP traffic, I don't think any bug reports were filed showing that
 a 120 second lease time resulted in too much traffic or that a jump all of
 the way to 86400 seconds was required instead of a value in the same order
 of magnitude.


 I guess that would be a good case for FORCERENEW DHCP extension [1] though
 after digging thru dnsmasq code a bit, I doubt it supports the extension
 (though e.g. systemd dhcp client/server from networkd module do). Le sigh.

 [1]: https://tools.ietf.org/html/rfc3203


  Why does this matter?

  Neutron ports can be updated with a new IP address from the same subnet
 or another subnet on the same network. The port update will result in
 anti-spoofing iptables rule changes that immediately stop the old IP
 address from working on the host. This means the host is unreachable for
 0-12 hours based on the current default lease time without manual
 intervention[2] (assuming half-lease length DHCP renewal attempts).

  Why is this on the mailing list?

  In an attempt to make the VMs usable in a much shorter timeframe
 following a Neutron port address change, I submitted a patch to reduce the
 default DHCP lease time to 8 minutes.[3] However, this was upsetting to
 several people,[4] so it was suggested I bring this discussion to the
 mailing list. The following are the high-level concerns followed by my
 responses:

- 8 minutes is arbitrary
   - Yes, but it's no more arbitrary than 1440 minutes. I picked it as
   an interval because it is still 4 times larger than the last short 
 value,
   but it still allows VMs to regain connectivity in 5 minutes in the 
 event
   their IP is changed. If someone has a good suggestion for another 
 interval
   based on known dnsmasq QPS limits or some other quantitative reason, 
 please
   chime in here.


I think there little to no point in arguing about an optimal default lease
time. Simply because there isn't. If you want to move that to 8 minutes,
that's fine for me.


 - other datacenters use long lease times
   - This is true, but it's not really a valid comparison. In most
   regular datacenters, updating a static DHCP lease has no effect on the 
 data
   plane so it doesn't matter that the client doesn't react for hours/days
   (even with DHCP snooping enabled). However, in Neutron's case, the 
 security
   groups are immediately updated so all traffic using the old address is
   blocked.

 Kevin's comment here is totally reasonable, but implies that the devised
mechanisms based on DHCPRELEASE is not working!



 - dhcp traffic is scary because it's broadcast
   - ARP traffic is also broadcast and many clients will expire
   entries every 5-10 minutes and re-ARP. L2population may be used to 
 prevent
   ARP propagation, so the comparison between DHCP and ARP isn't always
   relevant here.


I think this is a bit of a moot point. What's the impact of DHCP traffic,
even the DHCPDISCOVER broadcast on the overall traffic on a network? It's
not like a DHCP packet is a train of several hundreds ethernet frames,
isn't it?





  Please reply back with your opinions/anecdotes/data related to short
 DHCP lease times.

  Cheers

  1.
 https://github.com/openstack/neutron/commit/d9832282cf656b162c51afdefb830dacab72defe
 2. Manual intervention could be an instance reboot, a dhcp client
 invocation via the console, or a delayed invocation right before the
 update. (all significantly more difficult to script than a simple update of
 a port's IP via the API).
 3. https://review.openstack.org/#/c/150595/
 4. http://i.imgur.com/xtvatkP.jpg

  --
  Kevin Benton


 __
 OpenStack Development Mailing List (not for usage questions)
 Unsubscribe: 
 

Re: [openstack-dev] [neutron] high dhcp lease times in neutron deployments considered harmful (or not???)

2015-01-28 Thread Brian Haley
Hi Kevin,

On 01/28/2015 03:50 AM, Kevin Benton wrote:
 Hi,
 
 Approximately a year and a half ago, the default DHCP lease time in Neutron 
 was
 increased from 120 seconds to 86400 seconds.[1] This was done with the goal of
 reducing DHCP traffic with very little discussion (based on what I can see in
 the review and bug report). While it it does indeed reduce DHCP traffic, I 
 don't
 think any bug reports were filed showing that a 120 second lease time resulted
 in too much traffic or that a jump all of the way to 86400 seconds was 
 required
 instead of a value in the same order of magnitude.
 
 Why does this matter? 
 
 Neutron ports can be updated with a new IP address from the same subnet or
 another subnet on the same network. The port update will result in 
 anti-spoofing
 iptables rule changes that immediately stop the old IP address from working on
 the host. This means the host is unreachable for 0-12 hours based on the 
 current
 default lease time without manual intervention[2] (assuming half-lease length
 DHCP renewal attempts).

So I'll first comment on the problem.  You're essentially pulling the rug out
from under these VMs by changing their IP (and that of their router and DHCP/DNS
server), but you expect they should fail quickly and come right back online.  In
a non-Neutron environment wouldn't the IT person that did this need some pretty
good heat-resistant pants for all the flames from pissed-off users?  Sure, the
guy on his laptop will just bounce the connection, but servers (aka VMs) should
stay pretty static.  VMs are servers (and cows according to some).

The correct solution is to be able to renumber the network so there is no issue
with the anti-spoofing rules dropping packets, or the VMs having an unreachable
IP address, but that's a much bigger nut to crack.

 Why is this on the mailing list?
 
 In an attempt to make the VMs usable in a much shorter timeframe following a
 Neutron port address change, I submitted a patch to reduce the default DHCP
 lease time to 8 minutes.[3] However, this was upsetting to several people,[4] 
 so
 it was suggested I bring this discussion to the mailing list. The following 
 are
 the high-level concerns followed by my responses:
 
   * 8 minutes is arbitrary
   o Yes, but it's no more arbitrary than 1440 minutes. I picked it as an
 interval because it is still 4 times larger than the last short value,
 but it still allows VMs to regain connectivity in 5 minutes in the
 event their IP is changed. If someone has a good suggestion for 
 another
 interval based on known dnsmasq QPS limits or some other quantitative
 reason, please chime in here.

We run 48 hours as the default in our public cloud, and I did some digging to
remind myself of the multiple reasons:

1. Too much DHCP traffic.  Sure, only that initial request is broadcast, but
dnsmasq is very verbose and loves writing to syslog for everything it does -
less is more.  Do a scale test with 10K VMs and you'll quickly find out a large
portion of traffic is DHCP RENEWs, and syslog is huge.

2. During a control-plane upgrade or outage, having a short DHCP lease time will
take all your VMs offline.  The old value of 2 minutes is not a realistic value
for an upgrade, and I don't think 8 minutes is much better.  Yes, when DHCP is
down you can't boot a new VM, but as long as customers can get to their existing
VMs they're pretty happy and won't scream bloody murder.

There's probably more, but those were the top two, with #2 being most important.

   * other datacenters use long lease times
   o This is true, but it's not really a valid comparison. In most regular
 datacenters, updating a static DHCP lease has no effect on the data
 plane so it doesn't matter that the client doesn't react for 
 hours/days
 (even with DHCP snooping enabled). However, in Neutron's case, the
 security groups are immediately updated so all traffic using the old
 address is blocked.

Yes, and choosing the lease time is a deployment decision that needs to take a
lot of things into account.  Like I said, we don't even use the default.  The
default should just be a good guess for a standard deployment, not a value that
caters towards the edge cases, especially when the value is tunable in 
neutron.conf.

   * dhcp traffic is scary because it's broadcast
   o ARP traffic is also broadcast and many clients will expire entries 
 every
 5-10 minutes and re-ARP. L2population may be used to prevent ARP
 propagation, so the comparison between DHCP and ARP isn't always
 relevant here.

I don't recall anyone being scared of broadcast, and can't find any comments
regarding it in https://review.openstack.org/#/c/150595/

 Please reply back with your opinions/anecdotes/data related to short DHCP 
 lease
 times.

I can only speculate on why 24 hours was chosen as the default back in 2013,
possibly because a lot of