Re: [ovs-discuss] [ovs-dev] Hypervisor down during upgrade OVS 2.10.x to 2.10.y

2019-09-06 Thread Han Zhou
Good finding. So it is a known kernel bug and it won't get fixed in that
kernel version. :(
>From OVS point of view, the problem is, if such bug exists in kernel, it
prevents ovs-vswitchd to be killed (even by SIGKILL) because the process is
in system call which got blocked because of the kernel bug. At that point,
there is no way to either proceed or rollback the OVS reload.
Does anyone have a good idea how to deal with such problem during OVS
reload?

Thanks,
Han

On Fri, Sep 6, 2019 at 4:17 PM aginwala  wrote:

> Hi:
>
> Adding correct ovs-discuss ML. I did get a chance to take a look on it a
> bit. I think this is the bug in 4.4.0-104-generic kernel on ubuntu 16.04 as
> its being discussed on ubuntu forum
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407  where it
> can be hit all of a sudden as per the kernel logs shared
> "unregister_netdevice: waiting for br0 to become free. Usage count = 1".
> Folks are proposing on the forum to upgrade to higher kernel to get rid of
> this issue. Upstream linux proposed relevant fixes @
> https://github.com/torvalds/linux/commit/ee60ad219f5c7c4fb2f047f88037770063ef785f
>  to
> address related issues. I guess kernel folks can comment on this more. Not
> sure if I missed anything else.
>
> May be we can we can do some improvements in force-reload-kmod where right
> now stop_forwarding causes stop ovs-vswitchd but system stalls because br0
> (eth0 is added to br0) is busy causing network connectivity loss. Host
> recovers only after host restart in current case. Not sure, if we need to
> handle this corner case in ovs?
>
> On Wed, Aug 28, 2019 at 2:21 PM Jin, Liang via dev <
> ovs-...@openvswitch.org> wrote:
>
>>
>> Hi,
>> We upgrade the OVS recently from one version 2.10 to another version
>> 2.10.  on some HV upgrade, the HV is down when running force reload kernel.
>> In the ovs-ctl log, kill ovs-vswitch is failed, but the script is still
>> going to reload the modules.
>> ```
>> ovsdb-server is running with pid 2431
>> ovs-vswitchd is running with pid 2507
>> Thu Aug 22 23:13:49 UTC 2019:stop
>> 2019-08-22T23:13:50Z|1|fatal_signal|WARN|terminating with signal 14
>> (Alarm clock)
>> Alarm clock
>> 2019-08-22T23:13:51Z|1|fatal_signal|WARN|terminating with signal 14
>> (Alarm clock)
>> Alarm clock
>> * Exiting ovs-vswitchd (2507)
>> * Killing ovs-vswitchd (2507)
>> * Killing ovs-vswitchd (2507) with SIGKILL
>> * Killing ovs-vswitchd (2507) failed
>> * Exiting ovsdb-server (2431)
>> Thu Aug 22 23:14:58 UTC 2019:load-kmod
>> Thu Aug 22 23:14:58 UTC 2019:start --system-id=random --no-full-hostname
>> /usr/share/openvswitch/scripts/ovs-ctl: unknown option
>> "--no-full-hostname" (use --help for help)
>> * Starting ovsdb-server
>> * Configuring Open vSwitch system IDs
>> * ovs-vswitchd is already running
>> * Enabling remote OVSDB managers
>> ovsdb-server is running with pid 3860447
>> ovs-vswitchd is running with pid 2507
>> ovsdb-server is running with pid 3860447
>> ovs-vswitchd is running with pid 2507
>> Thu Aug 22 23:15:09 UTC 2019:load-kmod
>> Thu Aug 22 23:15:09 UTC 2019:force-reload-kmod --system-id=random
>> --no-full-hostname
>> /usr/share/openvswitch/scripts/ovs-ctl: unknown option
>> "--no-full-hostname" (use --help for help)
>> * Detected internal interfaces: br-int
>> Thu Aug 22 23:37:08 UTC 2019:stop
>> 2019-08-22T23:37:09Z|1|fatal_signal|WARN|terminating with signal 14
>> (Alarm clock)
>> Alarm clock
>> 2019-08-22T23:37:10Z|1|fatal_signal|WARN|terminating with signal 14
>> (Alarm clock)
>> Alarm clock
>> * Exiting ovs-vswitchd (2507)
>> * Killing ovs-vswitchd (2507)
>> * Killing ovs-vswitchd (2507) with SIGKILL
>> * Killing ovs-vswitchd (2507) failed
>> * Exiting ovsdb-server (3860447)
>> Thu Aug 22 23:40:42 UTC 2019:load-kmod
>> * Inserting openvswitch module
>> Thu Aug 22 23:40:42 UTC 2019:start --system-id=random --no-full-hostname
>> /usr/share/openvswitch/scripts/ovs-ctl: unknown option
>> "--no-full-hostname" (use --help for help)
>> * Starting ovsdb-server
>> * Configuring Open vSwitch system IDs
>> * Starting ovs-vswitchd
>> * Enabling remote OVSDB managers
>> ovsdb-server is running with pid 2399
>> ovs-vswitchd is running with pid 2440
>> ovsdb-server is running with pid 2399
>> ovs-vswitchd is running with pid 2440
>> Thu Aug 22 23:46:18 UTC 2019:load-kmod
>> Thu Aug 22 23:46:18 UTC 2019:force-reload-kmod --system-id=random
>> --no-full-hostname
>> /usr/share/openvswitch/scripts/ovs-ctl: unknown option
>> "--no-full-hostname" (use --help for help)
>> * Detected internal interfaces: br-int br0
>> * Saving flows
>> * Exiting ovsdb-server (2399)
>> * Starting ovsdb-server
>> * Configuring Open vSwitch system IDs
>> * Flush old conntrack entries
>> * Exiting ovs-vswitchd (2440)
>> * Saving interface configuration
>> * Removing datapath: system@ovs-system
>> * Removing openvswitch module
>> rmmod: ERROR: Module vxlan is in use by: i40e
>> * Forcing removal of vxlan module
>> * Inserting openvswitch module
>> * Starting 

Re: [ovs-discuss] [ovs-dev] Hypervisor down during upgrade OVS 2.10.x to 2.10.y

2019-09-06 Thread aginwala
Hi:

Adding correct ovs-discuss ML. I did get a chance to take a look on it a
bit. I think this is the bug in 4.4.0-104-generic kernel on ubuntu 16.04 as
its being discussed on ubuntu forum
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407  where it can
be hit all of a sudden as per the kernel logs shared "unregister_netdevice:
waiting for br0 to become free. Usage count = 1".  Folks are proposing on
the forum to upgrade to higher kernel to get rid of this issue. Upstream
linux proposed relevant fixes @
https://github.com/torvalds/linux/commit/ee60ad219f5c7c4fb2f047f88037770063ef785f
to
address related issues. I guess kernel folks can comment on this more. Not
sure if I missed anything else.

May be we can we can do some improvements in force-reload-kmod where right
now stop_forwarding causes stop ovs-vswitchd but system stalls because br0
(eth0 is added to br0) is busy causing network connectivity loss. Host
recovers only after host restart in current case. Not sure, if we need to
handle this corner case in ovs?

On Wed, Aug 28, 2019 at 2:21 PM Jin, Liang via dev 
wrote:

>
> Hi,
> We upgrade the OVS recently from one version 2.10 to another version
> 2.10.  on some HV upgrade, the HV is down when running force reload kernel.
> In the ovs-ctl log, kill ovs-vswitch is failed, but the script is still
> going to reload the modules.
> ```
> ovsdb-server is running with pid 2431
> ovs-vswitchd is running with pid 2507
> Thu Aug 22 23:13:49 UTC 2019:stop
> 2019-08-22T23:13:50Z|1|fatal_signal|WARN|terminating with signal 14
> (Alarm clock)
> Alarm clock
> 2019-08-22T23:13:51Z|1|fatal_signal|WARN|terminating with signal 14
> (Alarm clock)
> Alarm clock
> * Exiting ovs-vswitchd (2507)
> * Killing ovs-vswitchd (2507)
> * Killing ovs-vswitchd (2507) with SIGKILL
> * Killing ovs-vswitchd (2507) failed
> * Exiting ovsdb-server (2431)
> Thu Aug 22 23:14:58 UTC 2019:load-kmod
> Thu Aug 22 23:14:58 UTC 2019:start --system-id=random --no-full-hostname
> /usr/share/openvswitch/scripts/ovs-ctl: unknown option
> "--no-full-hostname" (use --help for help)
> * Starting ovsdb-server
> * Configuring Open vSwitch system IDs
> * ovs-vswitchd is already running
> * Enabling remote OVSDB managers
> ovsdb-server is running with pid 3860447
> ovs-vswitchd is running with pid 2507
> ovsdb-server is running with pid 3860447
> ovs-vswitchd is running with pid 2507
> Thu Aug 22 23:15:09 UTC 2019:load-kmod
> Thu Aug 22 23:15:09 UTC 2019:force-reload-kmod --system-id=random
> --no-full-hostname
> /usr/share/openvswitch/scripts/ovs-ctl: unknown option
> "--no-full-hostname" (use --help for help)
> * Detected internal interfaces: br-int
> Thu Aug 22 23:37:08 UTC 2019:stop
> 2019-08-22T23:37:09Z|1|fatal_signal|WARN|terminating with signal 14
> (Alarm clock)
> Alarm clock
> 2019-08-22T23:37:10Z|1|fatal_signal|WARN|terminating with signal 14
> (Alarm clock)
> Alarm clock
> * Exiting ovs-vswitchd (2507)
> * Killing ovs-vswitchd (2507)
> * Killing ovs-vswitchd (2507) with SIGKILL
> * Killing ovs-vswitchd (2507) failed
> * Exiting ovsdb-server (3860447)
> Thu Aug 22 23:40:42 UTC 2019:load-kmod
> * Inserting openvswitch module
> Thu Aug 22 23:40:42 UTC 2019:start --system-id=random --no-full-hostname
> /usr/share/openvswitch/scripts/ovs-ctl: unknown option
> "--no-full-hostname" (use --help for help)
> * Starting ovsdb-server
> * Configuring Open vSwitch system IDs
> * Starting ovs-vswitchd
> * Enabling remote OVSDB managers
> ovsdb-server is running with pid 2399
> ovs-vswitchd is running with pid 2440
> ovsdb-server is running with pid 2399
> ovs-vswitchd is running with pid 2440
> Thu Aug 22 23:46:18 UTC 2019:load-kmod
> Thu Aug 22 23:46:18 UTC 2019:force-reload-kmod --system-id=random
> --no-full-hostname
> /usr/share/openvswitch/scripts/ovs-ctl: unknown option
> "--no-full-hostname" (use --help for help)
> * Detected internal interfaces: br-int br0
> * Saving flows
> * Exiting ovsdb-server (2399)
> * Starting ovsdb-server
> * Configuring Open vSwitch system IDs
> * Flush old conntrack entries
> * Exiting ovs-vswitchd (2440)
> * Saving interface configuration
> * Removing datapath: system@ovs-system
> * Removing openvswitch module
> rmmod: ERROR: Module vxlan is in use by: i40e
> * Forcing removal of vxlan module
> * Inserting openvswitch module
> * Starting ovs-vswitchd
> * Restoring saved flows
> * Enabling remote OVSDB managers
> * Restoring interface configuration
> ```
>
> But in kern.log, we see the log as below, the process could not exit
> because waiting br0 release,  and then, the ovs-ctl try to `kill term` and
> `kill -9` the process, it does not work, because kernel is in infinity
> loop.  Then, ovs-ctl try to save the flows, when save flow, core dump
> happened in kernel. Then HV is down until restart it.
> ```
> Aug 22 16:13:45 slx11c-9gjm kernel: [21177057.998961] device br0 left
> promiscuous mode
> Aug 22 16:13:55 slx11c-9gjm kernel: [21177068.044859]
> unregister_netdevice: waiting for br0 to become