[Yahoo-eng-team] [Bug 1806390] [NEW] [RFE] Distributed DHCP agent

Yang Youseok Mon, 03 Dec 2018 05:11:17 -0800

Public bug reported:

It was very old issue and ended with invalid feature though, I could not
find ideal solution so that I raise this issue again. I wonder how other
think of it.


It's heavily related to the old issue
(https://bugs.launchpad.net/neutron/+bug/1468236), and I reconstruct the
issue from my understanding.

Problems
- With giant shared provider network which has over than 10000 ports in a 
network.
- Several DHCP agents for the network. Even per hypervisor for Calico project.
- Scalability issue (DHCP lease file is not updated after the VM started) 
occurs.

Solutions from the reporter
1. Add distributed flag for the DHCP agent. And provision DHCP agent on every 
compute node.
2. Change DHCP agent notifier to specify DHCP agent per hosts
3. Do not spread DHCP flow outside of local hypervisor.

Conclusion
- Rejected because
- Solution step (2) add big complexity to agent notifier RPC.
- (3) is not a general solution.
- Even worse for migration. There were many side effects to we have to care 
about.
- There were building blocks that we can achieve the purpose. (It was mentioned 
on IRC, but I still does not understand what the building block that mentioned 
is.)

Our private cluster is very much like the Calico. We have an giant
provider network and make them routable with quagga and there were DHCP
agents per compute node. I believe that community has formed some
consensus that this kind of architecture is pretty good at handling
scale issues to see the approach like Routed network.

And to achieve the architecture with the lack of L2, modifying DHCP
agent could not be avoided since its default HA behavior make critical
DB performance issues.

But at the same time, I absolutely agreed with the comment which care
about the unnecessary complexity for distributed approach like DVR.

So What I suggest is
- Do not modify current DHCP agent behaviors like notifier side API. It does 
not harm migration logic.
- Do not change the DHCP HA concept and L2 agent at all.
- Just add a distributed flag for DHCP agent. And add host filtering logic the 
handler side RPC (get_active_network_info, get_network_info) only when the DHCP 
agent is distributed.
- Operators have little bit new concept of distributed DHCP which the agent is 
only for ports within a local hypervisor.

Then we can achieve from the change
- Reduce the performance overhead. I found the performance penalty is related 
to DB side (getting ports with get_active_info(), and complete provisioning 
step with dhcp_ready_on_ports(). RPC fanout is minor.
- Make new concept which means DHCP agent failure domain is splitted.

Any comments are appreciated.

** Affects: neutron
     Importance: Undecided
         Status: New


** Tags: rfe

** Tags added: rfe

** Description changed:

  It was very old issue and ended with invalid feature though, I could not
  find ideal solution so that I raise this issue again. I wonder how other
  think of it.
  
  It's heavily related to the old issue
  (https://bugs.launchpad.net/neutron/+bug/1468236), and I reconstruct the
  issue from my understanding.
  
  Problems
  - With giant shared provider network which has over than 10000 ports in a 
network.
  - Several DHCP agents for the network. Even per hypervisor for Calico project.
  - Scalability issue (DHCP lease file is not updated after the VM started) 
occurs.
  
  Solutions from the reporter
  1. Add distributed flag for the DHCP agent. And provision DHCP agent on every 
compute node.
- 2. Change DHCP agent notifier to specify DHCP agent per 
+ 2. Change DHCP agent notifier to specify DHCP agent per hosts
  3. Do not spread DHCP flow outside of local hypervisor.
  
  Conclusion
  - Rejected because
  - Solution step (2) add big complexity to agent notifier RPC.
  - (3) is not a general solution.
- - Even worse for migration. There were many side effects to we have to care 
about.    
+ - Even worse for migration. There were many side effects to we have to care 
about.
  - There were building blocks that we can achieve the purpose. (It was 
mentioned on IRC, but I still does not understand what the building block that 
mentioned is.)
  
- 
- Our private cluster is very much like the Calico. We have an giant provider 
network and make them routable with quagga and there were DHCP agents per 
compute node. I believe that community has formed some consensus that this kind 
of architecture is pretty good at handling scale issues to see the approach 
like Routed network.
+ Our private cluster is very much like the Calico. We have an giant
+ provider network and make them routable with quagga and there were DHCP
+ agents per compute node. I believe that community has formed some
+ consensus that this kind of architecture is pretty good at handling
+ scale issues to see the approach like Routed network.
  
  And to achieve the architecture with the lack of L2, modifying DHCP
  agent could not be avoided since its default HA behavior make critical
  DB performance issues.
  
  But at the same time, I absolutely agreed with the comment which care
  about the unnecessary complexity for distributed approach like DVR.
  
- 
  So What I suggest is
  - Do not modify current DHCP agent behaviors like notifier side API. It does 
not harm migration logic.
  - Do not change the DHCP HA concept and L2 agent at all.
  - Just add a distributed flag for DHCP agent. And add host filtering logic 
the handler side RPC (get_active_network_info, get_network_info) only when the 
DHCP agent is distributed.
  - Operators have little bit new concept of distributed DHCP which the agent 
is only for ports within a local hypervisor.
-  
+ 
  Then we can achieve from the change
  - Reduce the performance overhead. I found the performance penalty is related 
to DB side (getting ports with get_active_info(), and complete provisioning 
step with dhcp_ready_on_ports(). RPC fanout is minor.
  - Make new concept which means DHCP agent failure domain is splitted.
  
- 
  Any comments are appreciated.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1806390

Title:
  [RFE] Distributed DHCP agent

Status in neutron:
  New

Bug description:
  It was very old issue and ended with invalid feature though, I could
  not find ideal solution so that I raise this issue again. I wonder how
  other think of it.

  It's heavily related to the old issue
  (https://bugs.launchpad.net/neutron/+bug/1468236), and I reconstruct
  the issue from my understanding.

  Problems
  - With giant shared provider network which has over than 10000 ports in a 
network.
  - Several DHCP agents for the network. Even per hypervisor for Calico project.
  - Scalability issue (DHCP lease file is not updated after the VM started) 
occurs.

  Solutions from the reporter
  1. Add distributed flag for the DHCP agent. And provision DHCP agent on every 
compute node.
  2. Change DHCP agent notifier to specify DHCP agent per hosts
  3. Do not spread DHCP flow outside of local hypervisor.

  Conclusion
  - Rejected because
  - Solution step (2) add big complexity to agent notifier RPC.
  - (3) is not a general solution.
  - Even worse for migration. There were many side effects to we have to care 
about.
  - There were building blocks that we can achieve the purpose. (It was 
mentioned on IRC, but I still does not understand what the building block that 
mentioned is.)

  Our private cluster is very much like the Calico. We have an giant
  provider network and make them routable with quagga and there were
  DHCP agents per compute node. I believe that community has formed some
  consensus that this kind of architecture is pretty good at handling
  scale issues to see the approach like Routed network.

  And to achieve the architecture with the lack of L2, modifying DHCP
  agent could not be avoided since its default HA behavior make critical
  DB performance issues.

  But at the same time, I absolutely agreed with the comment which care
  about the unnecessary complexity for distributed approach like DVR.

  So What I suggest is
  - Do not modify current DHCP agent behaviors like notifier side API. It does 
not harm migration logic.
  - Do not change the DHCP HA concept and L2 agent at all.
  - Just add a distributed flag for DHCP agent. And add host filtering logic 
the handler side RPC (get_active_network_info, get_network_info) only when the 
DHCP agent is distributed.
  - Operators have little bit new concept of distributed DHCP which the agent 
is only for ports within a local hypervisor.

  Then we can achieve from the change
  - Reduce the performance overhead. I found the performance penalty is related 
to DB side (getting ports with get_active_info(), and complete provisioning 
step with dhcp_ready_on_ports(). RPC fanout is minor.
  - Make new concept which means DHCP agent failure domain is splitted.

  Any comments are appreciated.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1806390/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1806390] [NEW] [RFE] Distributed DHCP agent

Reply via email to