Re: [ovs-discuss] OVS 3.2.0 crashing setting port QOS

2024-02-01 Thread Adrian Moreno via discuss



On 1/29/24 23:25, Ilya Maximets wrote:

On 1/29/24 20:13, Daryl Wang via discuss wrote:

After upgrading from Open vSwitch 3.1.2-4 to 3.2.0-2 we've been consistently
seeing new OVS crashes when setting QoS on ports. Both packages were taken
from the Debian distribution (https://packages.debian.org/source/sid/openvswitch
we're running on. From the core dump we're seeing the following backtrace:

# gdb --batch -ex bt /usr/sbin/ovs-vswitchd /core
[NewLWP 67669]
[NewLWP 67682]
[NewLWP 67681]
[NewLWP 67671]
[NewLWP 67679]
[NewLWP 67680]
[Threaddebugging usinglibthread_db enabled]
Usinghost libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Corewas generated by`ovs-vswitchd unix:/tmp/ovs/db.sock -vconsole:emer 
-vsyslog:err -vfile:info --ml'.
Program terminated with signal SIGABRT, Aborted.
#0  0x7fcacecbb0fc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7fcacf4cfa80 (LWP 67669))]
#0  0x7fcacecbb0fc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7fcacec6d472 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7fcacec574b2 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x560787952c7e in ovs_abort_valist (err_no=, 
format=, args=args@entry=0x7ffd14c3bce0) at ../lib/util.c:447
#4  0x560787952d14 in ovs_abort (err_no=, 
format=format@entry=0x5607879f3ec7 "%s: pthread_%s_%s failed") at ../lib/util.c:439
#5  0x56078791ee11 in ovs_mutex_lock_at (l_=l_@entry=0x56078934c6c8, 
where=where@entry=0x5607879f864b "../lib/netdev-linux.c:2575") at 
../lib/ovs-thread.c:76
#6  0x56078796d03d in netdev_linux_get_speed (netdev_=0x56078934c640, 
current=0x7ffd14c3be64, max=0x7ffd14c3be1c) at ../lib/netdev-linux.c:2575
#7  0x5607878c04f3 in netdev_get_speed (netdev=netdev@entry=0x56078934c640, 
current=current@entry=0x7ffd14c3be64, max=0x7ffd14c3be1c, max@entry=0x0) at 
../lib/netdev.c:1175
#8  0x560787968d67 in htb_parse_qdisc_details__ 
(netdev=netdev@entry=0x56078934c640, details=details@entry=0x56078934a880, 
hc=hc@entry=0x7ffd14c3beb0) at ../lib/netdev-linux.c:4804
#9  0x5607879755da in htb_tc_install (details=0x56078934a880, 
netdev=0x56078934c640) at ../lib/netdev-linux.c:4883
#10 htb_tc_install (netdev=0x56078934c640, details=0x56078934a880) at 
../lib/netdev-linux.c:4875
#11 0x560787974937 in netdev_linux_set_qos (netdev_=0x56078934c640, 
type=, details=0x56078934a880) at ../lib/netdev-linux.c:3054
#12 0x560787814ea5 in iface_configure_qos (qos=0x56078934a780, 
iface=0x560789349fd0) at ../vswitchd/bridge.c:4845
#13 bridge_reconfigure (ovs_cfg=ovs_cfg@entry=0x5607892dbf90) at 
../vswitchd/bridge.c:928
#14 0x560787817f7d in bridge_run () at ../vswitchd/bridge.c:3321
#15 0x56078780d205 in main (argc=, argv=) at 
../vswitchd/ovs-vswitchd.c:130

A shell script to reproduce the issue is:

#!/bin/sh

apt-get install openvswitch-{common,switch}{,-dbgsym}

# Don't need it running on the system
systemctl stop openvswitch-switch

set-e

cleanup(){
   ip link delveth0
   rm /tmp/ovs/conf.db
}

trap cleanup EXIT

# Setup our environment

ip link add veth0 type veth peer veth1

mkdir -p /tmp/ovs

exportOVS_RUNDIR=/tmp/ovs
exportOVS_LOGDIR=/tmp/ovs
exportOVS_DBDIR=/tmp/ovs

/usr/share/openvswitch/scripts/ovs-ctl start
ovs-vsctl add-br demo
ovs-vsctl add-port demo veth1

# Make it crash

ovs-vsctl setPortveth1 qos=@qos\
   -- --id=@qoscreate QoStype=linux-htb queues:1=@highqueues:2=@low\
   -- --id=@highcreate Queueother-config:priority=1other-config:min-rate=0.1\
   -- --id=@low  create Queueother-config:priority=6other-config:min-rate=0.05

We built the reproduction script based on speculation that
https://github.com/openvswitch/ovs/commit/6240c0b4c80ea3d8dd1bf77526b04b55742de2ce
is related to the crash. Notably we don't seem to run into the problem when we
pass in a specific maximum bandwidth instead of relying on the interface's 
maximum
bandwidth.


Hi, Daryl.  Thanks for the report!

Looking at the stack trace, the root cause seems to be the following commit:
   
https://github.com/openvswitch/ovs/commit/b8f8fad8643518551cf742056ae8728c936674c6

It introduced the netdev_get_speed() call in QoS functions.
These functions are running under netdev mutex and the netdev_get_speed()
function is trying to take the same mutex for a second time.  That fails
with 'deadlock avoided' or something like that.

 From the commit I linked it's clear why it's not happening if the max-rate
is specified.  The code just doesn't go that route.

To fix the issue, we need to have a lockless version of netdev_linux_get_speed()
and call it directly from the QoS functions without going through generic
netdev API.

Adrian, since it was your original patch, could you, please, take a look
at the issue?



Sure, I'll take a look at it.



Or Daryl, maybe you want to fix it yourself?
 > Best regards, Ilya Maximets.



--
Adrián Moreno

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvsw

Re: [ovs-discuss] OVS 3.2.0 crashing setting port QOS

2024-01-29 Thread Ilya Maximets via discuss
On 1/29/24 20:13, Daryl Wang via discuss wrote:
> After upgrading from Open vSwitch 3.1.2-4 to 3.2.0-2 we've been consistently
> seeing new OVS crashes when setting QoS on ports. Both packages were taken
> from the Debian distribution 
> (https://packages.debian.org/source/sid/openvswitch
> we're running on. From the core dump we're seeing the following backtrace:
> 
> # gdb --batch -ex bt /usr/sbin/ovs-vswitchd /core 
> [NewLWP 67669]
> [NewLWP 67682]
> [NewLWP 67681]
> [NewLWP 67671]
> [NewLWP 67679]
> [NewLWP 67680]
> [Threaddebugging usinglibthread_db enabled]
> Usinghost libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Corewas generated by`ovs-vswitchd unix:/tmp/ovs/db.sock -vconsole:emer 
> -vsyslog:err -vfile:info --ml'.
> Program terminated with signal SIGABRT, Aborted.
> #0  0x7fcacecbb0fc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> [Current thread is 1 (Thread 0x7fcacf4cfa80 (LWP 67669))]
> #0  0x7fcacecbb0fc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x7fcacec6d472 in raise () from /lib/x86_64-linux-gnu/libc.so.6
> #2  0x7fcacec574b2 in abort () from /lib/x86_64-linux-gnu/libc.so.6
> #3  0x560787952c7e in ovs_abort_valist (err_no=, 
> format=, args=args@entry=0x7ffd14c3bce0) at ../lib/util.c:447
> #4  0x560787952d14 in ovs_abort (err_no=, 
> format=format@entry=0x5607879f3ec7 "%s: pthread_%s_%s failed") at 
> ../lib/util.c:439
> #5  0x56078791ee11 in ovs_mutex_lock_at (l_=l_@entry=0x56078934c6c8, 
> where=where@entry=0x5607879f864b "../lib/netdev-linux.c:2575") at 
> ../lib/ovs-thread.c:76
> #6  0x56078796d03d in netdev_linux_get_speed (netdev_=0x56078934c640, 
> current=0x7ffd14c3be64, max=0x7ffd14c3be1c) at ../lib/netdev-linux.c:2575
> #7  0x5607878c04f3 in netdev_get_speed 
> (netdev=netdev@entry=0x56078934c640, current=current@entry=0x7ffd14c3be64, 
> max=0x7ffd14c3be1c, max@entry=0x0) at ../lib/netdev.c:1175
> #8  0x560787968d67 in htb_parse_qdisc_details__ 
> (netdev=netdev@entry=0x56078934c640, details=details@entry=0x56078934a880, 
> hc=hc@entry=0x7ffd14c3beb0) at ../lib/netdev-linux.c:4804
> #9  0x5607879755da in htb_tc_install (details=0x56078934a880, 
> netdev=0x56078934c640) at ../lib/netdev-linux.c:4883
> #10 htb_tc_install (netdev=0x56078934c640, details=0x56078934a880) at 
> ../lib/netdev-linux.c:4875
> #11 0x560787974937 in netdev_linux_set_qos (netdev_=0x56078934c640, 
> type=, details=0x56078934a880) at ../lib/netdev-linux.c:3054
> #12 0x560787814ea5 in iface_configure_qos (qos=0x56078934a780, 
> iface=0x560789349fd0) at ../vswitchd/bridge.c:4845
> #13 bridge_reconfigure (ovs_cfg=ovs_cfg@entry=0x5607892dbf90) at 
> ../vswitchd/bridge.c:928
> #14 0x560787817f7d in bridge_run () at ../vswitchd/bridge.c:3321
> #15 0x56078780d205 in main (argc=, argv=) 
> at ../vswitchd/ovs-vswitchd.c:130
> 
> A shell script to reproduce the issue is:
> 
> #!/bin/sh
> 
> apt-get install openvswitch-{common,switch}{,-dbgsym}
> 
> # Don't need it running on the system
> systemctl stop openvswitch-switch
> 
> set-e
> 
> cleanup(){
>   ip link delveth0
>   rm /tmp/ovs/conf.db
> }
> 
> trap cleanup EXIT
> 
> # Setup our environment
> 
> ip link add veth0 type veth peer veth1
> 
> mkdir -p /tmp/ovs
> 
> exportOVS_RUNDIR=/tmp/ovs 
> exportOVS_LOGDIR=/tmp/ovs
> exportOVS_DBDIR=/tmp/ovs
> 
> /usr/share/openvswitch/scripts/ovs-ctl start
> ovs-vsctl add-br demo
> ovs-vsctl add-port demo veth1
> 
> # Make it crash
> 
> ovs-vsctl setPortveth1 qos=@qos\
>   -- --id=@qoscreate QoStype=linux-htb queues:1=@highqueues:2=@low\
>   -- --id=@highcreate Queueother-config:priority=1other-config:min-rate=0.1\
>   -- --id=@low  create Queueother-config:priority=6other-config:min-rate=0.05
> 
> We built the reproduction script based on speculation that
> https://github.com/openvswitch/ovs/commit/6240c0b4c80ea3d8dd1bf77526b04b55742de2ce
> is related to the crash. Notably we don't seem to run into the problem when we
> pass in a specific maximum bandwidth instead of relying on the interface's 
> maximum
> bandwidth.

Hi, Daryl.  Thanks for the report!

Looking at the stack trace, the root cause seems to be the following commit:
  
https://github.com/openvswitch/ovs/commit/b8f8fad8643518551cf742056ae8728c936674c6

It introduced the netdev_get_speed() call in QoS functions.
These functions are running under netdev mutex and the netdev_get_speed()
function is trying to take the same mutex for a second time.  That fails
with 'deadlock avoided' or something like that.

From the commit I linked it's clear why it's not happening if the max-rate
is specified.  The code just doesn't go that route.

To fix the issue, we need to have a lockless version of netdev_linux_get_speed()
and call it directly from the QoS functions without going through generic
netdev API.

Adrian, since it was your original patch, could you, please, take a look
at the issue?

Or Daryl, maybe you want to fix it yourself?

Best regards, Ilya Maximets.
___

[ovs-discuss] OVS 3.2.0 crashing setting port QOS

2024-01-29 Thread Daryl Wang via discuss
After upgrading from Open vSwitch 3.1.2-4 to 3.2.0-2 we've been
consistently seeing new OVS crashes when setting QoS on ports. Both
packages were taken from the Debian distribution (
https://packages.debian.org/source/sid/openvswitch) we're running on. From
the core dump we're seeing the following backtrace:

# gdb --batch -ex bt /usr/sbin/ovs-vswitchd /core

[New LWP 67669]

[New LWP 67682]

[New LWP 67681]

[New LWP 67671]

[New LWP 67679]

[New LWP 67680]

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Core was generated by `ovs-vswitchd unix:/tmp/ovs/db.sock -vconsole:emer
-vsyslog:err -vfile:info --ml'.

Program terminated with signal SIGABRT, Aborted.

#0  0x7fcacecbb0fc in ?? () from /lib/x86_64-linux-gnu/libc.so.6

[Current thread is 1 (Thread 0x7fcacf4cfa80 (LWP 67669))]

#0  0x7fcacecbb0fc in ?? () from /lib/x86_64-linux-gnu/libc.so.6

#1  0x7fcacec6d472 in raise () from /lib/x86_64-linux-gnu/libc.so.6

#2  0x7fcacec574b2 in abort () from /lib/x86_64-linux-gnu/libc.so.6

#3  0x560787952c7e in ovs_abort_valist (err_no=,
format=, args=args@entry=0x7ffd14c3bce0) at ../lib/util.c:447

#4  0x560787952d14 in ovs_abort (err_no=,
format=format@entry=0x5607879f3ec7 "%s: pthread_%s_%s failed") at
../lib/util.c:439

#5  0x56078791ee11 in ovs_mutex_lock_at (l_=l_@entry=0x56078934c6c8,
where=where@entry=0x5607879f864b "../lib/netdev-linux.c:2575") at
../lib/ovs-thread.c:76

#6  0x56078796d03d in netdev_linux_get_speed (netdev_=0x56078934c640,
current=0x7ffd14c3be64, max=0x7ffd14c3be1c) at ../lib/netdev-linux.c:2575

#7  0x5607878c04f3 in netdev_get_speed (netdev=netdev@entry=0x56078934c640,
current=current@entry=0x7ffd14c3be64, max=0x7ffd14c3be1c, max@entry=0x0) at
../lib/netdev.c:1175

#8  0x560787968d67 in htb_parse_qdisc_details__
(netdev=netdev@entry=0x56078934c640,
details=details@entry=0x56078934a880, hc=hc@entry=0x7ffd14c3beb0) at
../lib/netdev-linux.c:4804

#9  0x5607879755da in htb_tc_install (details=0x56078934a880,
netdev=0x56078934c640) at ../lib/netdev-linux.c:4883

#10 htb_tc_install (netdev=0x56078934c640, details=0x56078934a880) at
../lib/netdev-linux.c:4875

#11 0x560787974937 in netdev_linux_set_qos (netdev_=0x56078934c640,
type=, details=0x56078934a880) at ../lib/netdev-linux.c:3054

#12 0x560787814ea5 in iface_configure_qos (qos=0x56078934a780,
iface=0x560789349fd0) at ../vswitchd/bridge.c:4845

#13 bridge_reconfigure (ovs_cfg=ovs_cfg@entry=0x5607892dbf90) at
../vswitchd/bridge.c:928

#14 0x560787817f7d in bridge_run () at ../vswitchd/bridge.c:3321

#15 0x56078780d205 in main (argc=, argv=)
at ../vswitchd/ovs-vswitchd.c:130



A shell script to reproduce the issue is:

#!/bin/sh

apt-get install openvswitch-{common,switch}{,-dbgsym}



# Don't need it running on the system

systemctl stop openvswitch-switch

set -e

cleanup() {

  ip link del veth0

  rm /tmp/ovs/conf.db

}

trap cleanup EXIT

# Setup our environment

ip link add veth0 type veth peer veth1

mkdir -p /tmp/ovs

export OVS_RUNDIR=/tmp/ovs

export OVS_LOGDIR=/tmp/ovs

export OVS_DBDIR=/tmp/ovs

/usr/share/openvswitch/scripts/ovs-ctl start

ovs-vsctl add-br demo

ovs-vsctl add-port demo veth1

# Make it crash

ovs-vsctl set Port veth1 qos=@qos \

  -- --id=@qos create QoS type=linux-htb queues:1=@high queues:2=@low \

  -- --id=@high create Queue other-config:priority=1 other-config:min-rate=
0.1 \

  -- --id=@low  create Queue other-config:priority=6 other-config:min-rate=
0.05

We built the reproduction script based on speculation that
https://github.com/openvswitch/ovs/commit/6240c0b4c80ea3d8dd1bf77526b04b55742de2ce
is related to the crash. Notably we don't seem to run into the problem when
we pass in a specific maximum bandwidth instead of relying on the
interface's maximum bandwidth.

Sincerely,

Daryl Wang
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss