Re: [openstack-dev] [Neutron][L3] Orphaned process cleanup

2016-01-28 Thread Ihar Hrachyshka

Sean M. Collins  wrote:


Hi,

I started poking a bit at https://bugs.launchpad.net/devstack/+bug/1535661

We have radvd processes that the l3 agent launches, and if the l3 agent
is terminated these radvd processes continue to run. I think we should
probably terminate them when the l3 agent is terminated, like if we are
in DevStack and doing an unstack.sh[1]. There's a fix on the DevStack
side but I'm waffling a bit on if it's the right thing to do or not[2].

The only concern I have is if there are situations where the l3 agent
terminates, but we don't want data plane disruption. For example, if
something goes wrong and the L3 agent dies, if the OS will be sending a
SIGABRT (which my WIP patch doesn't catch[3] and radvd would continue to  
run) or if a

SIGTERM is issued, or worse, an OOM event occurs (I think thats a
SIGTERM too?) and you get an outage.

[1]:  
https://github.com/openstack-dev/devstack/blob/master/lib/neutron-legacy#L767


[2]: https://review.openstack.org/269560

[3]: https://review.openstack.org/273228
--
Sean M. Collins

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


As Assaf pointed out, we don’t want to clean up processes on agent died.

In RDO, we ship OCF resources to manage our services using pacemaker, and  
there, we trigger some scripts that cleanup on service fencing:


https://github.com/openstack-packages/neutron/blob/rpm-master/neutron-netns-cleanup.init#L42

We kill radvd, netns-proxy, keepalived, and friends.

I think that ideal solution here would be to have a separate executable  
similar to neutron-netns-cleanup and neutron-ovs-cleanup  
(neutron-l3-agent-cleanup?) that would be executed by external tools that  
want to clean up after an agent.


Ihar

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron][L3] Orphaned process cleanup

2016-01-27 Thread Sean M. Collins
On Wed, Jan 27, 2016 at 05:06:03PM EST, Assaf Muller wrote:
> >> RDO systemd init script for the L3 agent will send a signal 15 when
> >> 'systemctl restart neutron-l3-agent' is executed. I assume
> >> Debian/Ubuntu do the same. It is imperative that agent restarts do not
> >> cause data plane interruption. This has been the case for the L3 agent
> >
> > But wouldn't it really be wiser to use SIGHUP to communicate the intent
> > to restart a process?
> 
> Maybe. I just checked and on a Liberty based RDO installation, sending
> SIGHUP to a L3 agent doesn't actually do anything. Specifically it
> doesn't resync its routers (Which restarting it with signal 15 does).

See, but there must be something that is starting the neutron l3 agent
again, *after* sending it a SIGTERM (signal 15). Then the l3 agent does
a full resync since it's started back up, based on some state accounting
done in what appears to be the plugin. Nothing about signal 15 actually
does any restarting. It just terminates the process.

> 2016-01-27 20:45:35.075 14651 INFO neutron.agent.l3.agent [-] Agent has just 
> been revived. Doing a full sync.

https://github.com/openstack/neutron/blob/ea8cafdfc0789bd01cf6b26adc6e5b7ee6b141d6/neutron/agent/l3/agent.py#L697

https://github.com/openstack/neutron/blob/ea8cafdfc0789bd01cf6b26adc6e5b7ee6b141d6/neutron/agent/l3/agent.py#L679


-- 
Sean M. Collins

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron][L3] Orphaned process cleanup

2016-01-27 Thread Assaf Muller
On Wed, Jan 27, 2016 at 4:52 PM, Sean M. Collins  wrote:
> On Wed, Jan 27, 2016 at 04:24:00PM EST, Assaf Muller wrote:
>> On Wed, Jan 27, 2016 at 4:10 PM, Sean M. Collins  wrote:
>> > Hi,
>> >
>> > I started poking a bit at https://bugs.launchpad.net/devstack/+bug/1535661
>> >
>> > We have radvd processes that the l3 agent launches, and if the l3 agent
>> > is terminated these radvd processes continue to run. I think we should
>> > probably terminate them when the l3 agent is terminated, like if we are
>> > in DevStack and doing an unstack.sh[1]. There's a fix on the DevStack
>> > side but I'm waffling a bit on if it's the right thing to do or not[2].
>> >
>> > The only concern I have is if there are situations where the l3 agent
>> > terminates, but we don't want data plane disruption. For example, if
>> > something goes wrong and the L3 agent dies, if the OS will be sending a
>> > SIGABRT (which my WIP patch doesn't catch[3] and radvd would continue to 
>> > run) or if a
>> > SIGTERM is issued, or worse, an OOM event occurs (I think thats a
>> > SIGTERM too?) and you get an outage.
>>
>> RDO systemd init script for the L3 agent will send a signal 15 when
>> 'systemctl restart neutron-l3-agent' is executed. I assume
>> Debian/Ubuntu do the same. It is imperative that agent restarts do not
>> cause data plane interruption. This has been the case for the L3 agent
>
> But wouldn't it really be wiser to use SIGHUP to communicate the intent
> to restart a process?

Maybe. I just checked and on a Liberty based RDO installation, sending
SIGHUP to a L3 agent doesn't actually do anything. Specifically it
doesn't resync its routers (Which restarting it with signal 15 does).

>
> --
> Sean M. Collins
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron][L3] Orphaned process cleanup

2016-01-27 Thread Assaf Muller
On Wed, Jan 27, 2016 at 5:20 PM, Sean M. Collins  wrote:
> On Wed, Jan 27, 2016 at 05:06:03PM EST, Assaf Muller wrote:
>> >> RDO systemd init script for the L3 agent will send a signal 15 when
>> >> 'systemctl restart neutron-l3-agent' is executed. I assume
>> >> Debian/Ubuntu do the same. It is imperative that agent restarts do not
>> >> cause data plane interruption. This has been the case for the L3 agent
>> >
>> > But wouldn't it really be wiser to use SIGHUP to communicate the intent
>> > to restart a process?
>>
>> Maybe. I just checked and on a Liberty based RDO installation, sending
>> SIGHUP to a L3 agent doesn't actually do anything. Specifically it
>> doesn't resync its routers (Which restarting it with signal 15 does).
>
> See, but there must be something that is starting the neutron l3 agent
> again, *after* sending it a SIGTERM (signal 15).

That's why I wrote 'restarting it with signal 15'.

> Then the l3 agent does
> a full resync since it's started back up, based on some state accounting
> done in what appears to be the plugin. Nothing about signal 15 actually
> does any restarting. It just terminates the process.

Yup. The point stands, there's a difference between sig 15 then start,
and a SIGHUP. Currently, Neutron agents don't resync after a SIGHUP
(And I wouldn't expect them to. I'd just expect a SIGHUP to reload
configuration). Restarting an agent shouldn't stop any agent spawned
processes like radvd, keepalived, or perform any clean ups to its
resources (Namespaces, etc), just like you wouldn't want the OVS agent
to destroy bridges and ports, and you wouldn't want a restart to
nova-compute to interfere with its qemu-kvm processes.

>
>> 2016-01-27 20:45:35.075 14651 INFO neutron.agent.l3.agent [-] Agent has just 
>> been revived. Doing a full sync.
>
> https://github.com/openstack/neutron/blob/ea8cafdfc0789bd01cf6b26adc6e5b7ee6b141d6/neutron/agent/l3/agent.py#L697
>
> https://github.com/openstack/neutron/blob/ea8cafdfc0789bd01cf6b26adc6e5b7ee6b141d6/neutron/agent/l3/agent.py#L679
>
>
> --
> Sean M. Collins
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron][L3] Orphaned process cleanup

2016-01-27 Thread Assaf Muller
On Wed, Jan 27, 2016 at 4:10 PM, Sean M. Collins  wrote:
> Hi,
>
> I started poking a bit at https://bugs.launchpad.net/devstack/+bug/1535661
>
> We have radvd processes that the l3 agent launches, and if the l3 agent
> is terminated these radvd processes continue to run. I think we should
> probably terminate them when the l3 agent is terminated, like if we are
> in DevStack and doing an unstack.sh[1]. There's a fix on the DevStack
> side but I'm waffling a bit on if it's the right thing to do or not[2].
>
> The only concern I have is if there are situations where the l3 agent
> terminates, but we don't want data plane disruption. For example, if
> something goes wrong and the L3 agent dies, if the OS will be sending a
> SIGABRT (which my WIP patch doesn't catch[3] and radvd would continue to run) 
> or if a
> SIGTERM is issued, or worse, an OOM event occurs (I think thats a
> SIGTERM too?) and you get an outage.

RDO systemd init script for the L3 agent will send a signal 15 when
'systemctl restart neutron-l3-agent' is executed. I assume
Debian/Ubuntu do the same. It is imperative that agent restarts do not
cause data plane interruption. This has been the case for the L3 agent
for a while, and recently for the OVS agent. There's a difference
between an uninstallation (unstack.sh) and an agent restart/upgrade,
let's keep it that way :)

>
> [1]: 
> https://github.com/openstack-dev/devstack/blob/master/lib/neutron-legacy#L767
>
> [2]: https://review.openstack.org/269560
>
> [3]: https://review.openstack.org/273228
> --
> Sean M. Collins
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron][L3] Orphaned process cleanup

2016-01-27 Thread Sean M. Collins
On Wed, Jan 27, 2016 at 04:24:00PM EST, Assaf Muller wrote:
> On Wed, Jan 27, 2016 at 4:10 PM, Sean M. Collins  wrote:
> > Hi,
> >
> > I started poking a bit at https://bugs.launchpad.net/devstack/+bug/1535661
> >
> > We have radvd processes that the l3 agent launches, and if the l3 agent
> > is terminated these radvd processes continue to run. I think we should
> > probably terminate them when the l3 agent is terminated, like if we are
> > in DevStack and doing an unstack.sh[1]. There's a fix on the DevStack
> > side but I'm waffling a bit on if it's the right thing to do or not[2].
> >
> > The only concern I have is if there are situations where the l3 agent
> > terminates, but we don't want data plane disruption. For example, if
> > something goes wrong and the L3 agent dies, if the OS will be sending a
> > SIGABRT (which my WIP patch doesn't catch[3] and radvd would continue to 
> > run) or if a
> > SIGTERM is issued, or worse, an OOM event occurs (I think thats a
> > SIGTERM too?) and you get an outage.
> 
> RDO systemd init script for the L3 agent will send a signal 15 when
> 'systemctl restart neutron-l3-agent' is executed. I assume
> Debian/Ubuntu do the same. It is imperative that agent restarts do not
> cause data plane interruption. This has been the case for the L3 agent

But wouldn't it really be wiser to use SIGHUP to communicate the intent
to restart a process? 

-- 
Sean M. Collins

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev