Re: [openstack-dev] [Neutron][L3] Orphaned process cleanup
Sean M. Collinswrote: Hi, I started poking a bit at https://bugs.launchpad.net/devstack/+bug/1535661 We have radvd processes that the l3 agent launches, and if the l3 agent is terminated these radvd processes continue to run. I think we should probably terminate them when the l3 agent is terminated, like if we are in DevStack and doing an unstack.sh[1]. There's a fix on the DevStack side but I'm waffling a bit on if it's the right thing to do or not[2]. The only concern I have is if there are situations where the l3 agent terminates, but we don't want data plane disruption. For example, if something goes wrong and the L3 agent dies, if the OS will be sending a SIGABRT (which my WIP patch doesn't catch[3] and radvd would continue to run) or if a SIGTERM is issued, or worse, an OOM event occurs (I think thats a SIGTERM too?) and you get an outage. [1]: https://github.com/openstack-dev/devstack/blob/master/lib/neutron-legacy#L767 [2]: https://review.openstack.org/269560 [3]: https://review.openstack.org/273228 -- Sean M. Collins __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev As Assaf pointed out, we don’t want to clean up processes on agent died. In RDO, we ship OCF resources to manage our services using pacemaker, and there, we trigger some scripts that cleanup on service fencing: https://github.com/openstack-packages/neutron/blob/rpm-master/neutron-netns-cleanup.init#L42 We kill radvd, netns-proxy, keepalived, and friends. I think that ideal solution here would be to have a separate executable similar to neutron-netns-cleanup and neutron-ovs-cleanup (neutron-l3-agent-cleanup?) that would be executed by external tools that want to clean up after an agent. Ihar __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron][L3] Orphaned process cleanup
On Wed, Jan 27, 2016 at 05:06:03PM EST, Assaf Muller wrote: > >> RDO systemd init script for the L3 agent will send a signal 15 when > >> 'systemctl restart neutron-l3-agent' is executed. I assume > >> Debian/Ubuntu do the same. It is imperative that agent restarts do not > >> cause data plane interruption. This has been the case for the L3 agent > > > > But wouldn't it really be wiser to use SIGHUP to communicate the intent > > to restart a process? > > Maybe. I just checked and on a Liberty based RDO installation, sending > SIGHUP to a L3 agent doesn't actually do anything. Specifically it > doesn't resync its routers (Which restarting it with signal 15 does). See, but there must be something that is starting the neutron l3 agent again, *after* sending it a SIGTERM (signal 15). Then the l3 agent does a full resync since it's started back up, based on some state accounting done in what appears to be the plugin. Nothing about signal 15 actually does any restarting. It just terminates the process. > 2016-01-27 20:45:35.075 14651 INFO neutron.agent.l3.agent [-] Agent has just > been revived. Doing a full sync. https://github.com/openstack/neutron/blob/ea8cafdfc0789bd01cf6b26adc6e5b7ee6b141d6/neutron/agent/l3/agent.py#L697 https://github.com/openstack/neutron/blob/ea8cafdfc0789bd01cf6b26adc6e5b7ee6b141d6/neutron/agent/l3/agent.py#L679 -- Sean M. Collins __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron][L3] Orphaned process cleanup
On Wed, Jan 27, 2016 at 4:52 PM, Sean M. Collinswrote: > On Wed, Jan 27, 2016 at 04:24:00PM EST, Assaf Muller wrote: >> On Wed, Jan 27, 2016 at 4:10 PM, Sean M. Collins wrote: >> > Hi, >> > >> > I started poking a bit at https://bugs.launchpad.net/devstack/+bug/1535661 >> > >> > We have radvd processes that the l3 agent launches, and if the l3 agent >> > is terminated these radvd processes continue to run. I think we should >> > probably terminate them when the l3 agent is terminated, like if we are >> > in DevStack and doing an unstack.sh[1]. There's a fix on the DevStack >> > side but I'm waffling a bit on if it's the right thing to do or not[2]. >> > >> > The only concern I have is if there are situations where the l3 agent >> > terminates, but we don't want data plane disruption. For example, if >> > something goes wrong and the L3 agent dies, if the OS will be sending a >> > SIGABRT (which my WIP patch doesn't catch[3] and radvd would continue to >> > run) or if a >> > SIGTERM is issued, or worse, an OOM event occurs (I think thats a >> > SIGTERM too?) and you get an outage. >> >> RDO systemd init script for the L3 agent will send a signal 15 when >> 'systemctl restart neutron-l3-agent' is executed. I assume >> Debian/Ubuntu do the same. It is imperative that agent restarts do not >> cause data plane interruption. This has been the case for the L3 agent > > But wouldn't it really be wiser to use SIGHUP to communicate the intent > to restart a process? Maybe. I just checked and on a Liberty based RDO installation, sending SIGHUP to a L3 agent doesn't actually do anything. Specifically it doesn't resync its routers (Which restarting it with signal 15 does). > > -- > Sean M. Collins > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron][L3] Orphaned process cleanup
On Wed, Jan 27, 2016 at 5:20 PM, Sean M. Collinswrote: > On Wed, Jan 27, 2016 at 05:06:03PM EST, Assaf Muller wrote: >> >> RDO systemd init script for the L3 agent will send a signal 15 when >> >> 'systemctl restart neutron-l3-agent' is executed. I assume >> >> Debian/Ubuntu do the same. It is imperative that agent restarts do not >> >> cause data plane interruption. This has been the case for the L3 agent >> > >> > But wouldn't it really be wiser to use SIGHUP to communicate the intent >> > to restart a process? >> >> Maybe. I just checked and on a Liberty based RDO installation, sending >> SIGHUP to a L3 agent doesn't actually do anything. Specifically it >> doesn't resync its routers (Which restarting it with signal 15 does). > > See, but there must be something that is starting the neutron l3 agent > again, *after* sending it a SIGTERM (signal 15). That's why I wrote 'restarting it with signal 15'. > Then the l3 agent does > a full resync since it's started back up, based on some state accounting > done in what appears to be the plugin. Nothing about signal 15 actually > does any restarting. It just terminates the process. Yup. The point stands, there's a difference between sig 15 then start, and a SIGHUP. Currently, Neutron agents don't resync after a SIGHUP (And I wouldn't expect them to. I'd just expect a SIGHUP to reload configuration). Restarting an agent shouldn't stop any agent spawned processes like radvd, keepalived, or perform any clean ups to its resources (Namespaces, etc), just like you wouldn't want the OVS agent to destroy bridges and ports, and you wouldn't want a restart to nova-compute to interfere with its qemu-kvm processes. > >> 2016-01-27 20:45:35.075 14651 INFO neutron.agent.l3.agent [-] Agent has just >> been revived. Doing a full sync. > > https://github.com/openstack/neutron/blob/ea8cafdfc0789bd01cf6b26adc6e5b7ee6b141d6/neutron/agent/l3/agent.py#L697 > > https://github.com/openstack/neutron/blob/ea8cafdfc0789bd01cf6b26adc6e5b7ee6b141d6/neutron/agent/l3/agent.py#L679 > > > -- > Sean M. Collins > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron][L3] Orphaned process cleanup
On Wed, Jan 27, 2016 at 4:10 PM, Sean M. Collinswrote: > Hi, > > I started poking a bit at https://bugs.launchpad.net/devstack/+bug/1535661 > > We have radvd processes that the l3 agent launches, and if the l3 agent > is terminated these radvd processes continue to run. I think we should > probably terminate them when the l3 agent is terminated, like if we are > in DevStack and doing an unstack.sh[1]. There's a fix on the DevStack > side but I'm waffling a bit on if it's the right thing to do or not[2]. > > The only concern I have is if there are situations where the l3 agent > terminates, but we don't want data plane disruption. For example, if > something goes wrong and the L3 agent dies, if the OS will be sending a > SIGABRT (which my WIP patch doesn't catch[3] and radvd would continue to run) > or if a > SIGTERM is issued, or worse, an OOM event occurs (I think thats a > SIGTERM too?) and you get an outage. RDO systemd init script for the L3 agent will send a signal 15 when 'systemctl restart neutron-l3-agent' is executed. I assume Debian/Ubuntu do the same. It is imperative that agent restarts do not cause data plane interruption. This has been the case for the L3 agent for a while, and recently for the OVS agent. There's a difference between an uninstallation (unstack.sh) and an agent restart/upgrade, let's keep it that way :) > > [1]: > https://github.com/openstack-dev/devstack/blob/master/lib/neutron-legacy#L767 > > [2]: https://review.openstack.org/269560 > > [3]: https://review.openstack.org/273228 > -- > Sean M. Collins > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Neutron][L3] Orphaned process cleanup
On Wed, Jan 27, 2016 at 04:24:00PM EST, Assaf Muller wrote: > On Wed, Jan 27, 2016 at 4:10 PM, Sean M. Collinswrote: > > Hi, > > > > I started poking a bit at https://bugs.launchpad.net/devstack/+bug/1535661 > > > > We have radvd processes that the l3 agent launches, and if the l3 agent > > is terminated these radvd processes continue to run. I think we should > > probably terminate them when the l3 agent is terminated, like if we are > > in DevStack and doing an unstack.sh[1]. There's a fix on the DevStack > > side but I'm waffling a bit on if it's the right thing to do or not[2]. > > > > The only concern I have is if there are situations where the l3 agent > > terminates, but we don't want data plane disruption. For example, if > > something goes wrong and the L3 agent dies, if the OS will be sending a > > SIGABRT (which my WIP patch doesn't catch[3] and radvd would continue to > > run) or if a > > SIGTERM is issued, or worse, an OOM event occurs (I think thats a > > SIGTERM too?) and you get an outage. > > RDO systemd init script for the L3 agent will send a signal 15 when > 'systemctl restart neutron-l3-agent' is executed. I assume > Debian/Ubuntu do the same. It is imperative that agent restarts do not > cause data plane interruption. This has been the case for the L3 agent But wouldn't it really be wiser to use SIGHUP to communicate the intent to restart a process? -- Sean M. Collins __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev