Re: [ovs-discuss] OVS 3.2.0 crashing setting port QOS
On 1/29/24 23:25, Ilya Maximets wrote: On 1/29/24 20:13, Daryl Wang via discuss wrote: After upgrading from Open vSwitch 3.1.2-4 to 3.2.0-2 we've been consistently seeing new OVS crashes when setting QoS on ports. Both packages were taken from the Debian distribution (https://packages.debian.org/source/sid/openvswitch we're running on. From the core dump we're seeing the following backtrace: # gdb --batch -ex bt /usr/sbin/ovs-vswitchd /core [NewLWP 67669] [NewLWP 67682] [NewLWP 67681] [NewLWP 67671] [NewLWP 67679] [NewLWP 67680] [Threaddebugging usinglibthread_db enabled] Usinghost libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Corewas generated by`ovs-vswitchd unix:/tmp/ovs/db.sock -vconsole:emer -vsyslog:err -vfile:info --ml'. Program terminated with signal SIGABRT, Aborted. #0 0x7fcacecbb0fc in ?? () from /lib/x86_64-linux-gnu/libc.so.6 [Current thread is 1 (Thread 0x7fcacf4cfa80 (LWP 67669))] #0 0x7fcacecbb0fc in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7fcacec6d472 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x7fcacec574b2 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x560787952c7e in ovs_abort_valist (err_no=, format=, args=args@entry=0x7ffd14c3bce0) at ../lib/util.c:447 #4 0x560787952d14 in ovs_abort (err_no=, format=format@entry=0x5607879f3ec7 "%s: pthread_%s_%s failed") at ../lib/util.c:439 #5 0x56078791ee11 in ovs_mutex_lock_at (l_=l_@entry=0x56078934c6c8, where=where@entry=0x5607879f864b "../lib/netdev-linux.c:2575") at ../lib/ovs-thread.c:76 #6 0x56078796d03d in netdev_linux_get_speed (netdev_=0x56078934c640, current=0x7ffd14c3be64, max=0x7ffd14c3be1c) at ../lib/netdev-linux.c:2575 #7 0x5607878c04f3 in netdev_get_speed (netdev=netdev@entry=0x56078934c640, current=current@entry=0x7ffd14c3be64, max=0x7ffd14c3be1c, max@entry=0x0) at ../lib/netdev.c:1175 #8 0x560787968d67 in htb_parse_qdisc_details__ (netdev=netdev@entry=0x56078934c640, details=details@entry=0x56078934a880, hc=hc@entry=0x7ffd14c3beb0) at ../lib/netdev-linux.c:4804 #9 0x5607879755da in htb_tc_install (details=0x56078934a880, netdev=0x56078934c640) at ../lib/netdev-linux.c:4883 #10 htb_tc_install (netdev=0x56078934c640, details=0x56078934a880) at ../lib/netdev-linux.c:4875 #11 0x560787974937 in netdev_linux_set_qos (netdev_=0x56078934c640, type=, details=0x56078934a880) at ../lib/netdev-linux.c:3054 #12 0x560787814ea5 in iface_configure_qos (qos=0x56078934a780, iface=0x560789349fd0) at ../vswitchd/bridge.c:4845 #13 bridge_reconfigure (ovs_cfg=ovs_cfg@entry=0x5607892dbf90) at ../vswitchd/bridge.c:928 #14 0x560787817f7d in bridge_run () at ../vswitchd/bridge.c:3321 #15 0x56078780d205 in main (argc=, argv=) at ../vswitchd/ovs-vswitchd.c:130 A shell script to reproduce the issue is: #!/bin/sh apt-get install openvswitch-{common,switch}{,-dbgsym} # Don't need it running on the system systemctl stop openvswitch-switch set-e cleanup(){ ip link delveth0 rm /tmp/ovs/conf.db } trap cleanup EXIT # Setup our environment ip link add veth0 type veth peer veth1 mkdir -p /tmp/ovs exportOVS_RUNDIR=/tmp/ovs exportOVS_LOGDIR=/tmp/ovs exportOVS_DBDIR=/tmp/ovs /usr/share/openvswitch/scripts/ovs-ctl start ovs-vsctl add-br demo ovs-vsctl add-port demo veth1 # Make it crash ovs-vsctl setPortveth1 qos=@qos\ -- --id=@qoscreate QoStype=linux-htb queues:1=@highqueues:2=@low\ -- --id=@highcreate Queueother-config:priority=1other-config:min-rate=0.1\ -- --id=@low create Queueother-config:priority=6other-config:min-rate=0.05 We built the reproduction script based on speculation that https://github.com/openvswitch/ovs/commit/6240c0b4c80ea3d8dd1bf77526b04b55742de2ce is related to the crash. Notably we don't seem to run into the problem when we pass in a specific maximum bandwidth instead of relying on the interface's maximum bandwidth. Hi, Daryl. Thanks for the report! Looking at the stack trace, the root cause seems to be the following commit: https://github.com/openvswitch/ovs/commit/b8f8fad8643518551cf742056ae8728c936674c6 It introduced the netdev_get_speed() call in QoS functions. These functions are running under netdev mutex and the netdev_get_speed() function is trying to take the same mutex for a second time. That fails with 'deadlock avoided' or something like that. From the commit I linked it's clear why it's not happening if the max-rate is specified. The code just doesn't go that route. To fix the issue, we need to have a lockless version of netdev_linux_get_speed() and call it directly from the QoS functions without going through generic netdev API. Adrian, since it was your original patch, could you, please, take a look at the issue? Sure, I'll take a look at it. Or Daryl, maybe you want to fix it yourself? > Best regards, Ilya Maximets. -- Adrián Moreno ___ discuss mailing list disc...@openvswitch.org https://mail.openvsw
Re: [ovs-discuss] OVS 3.2.0 crashing setting port QOS
On 1/29/24 20:13, Daryl Wang via discuss wrote: > After upgrading from Open vSwitch 3.1.2-4 to 3.2.0-2 we've been consistently > seeing new OVS crashes when setting QoS on ports. Both packages were taken > from the Debian distribution > (https://packages.debian.org/source/sid/openvswitch > we're running on. From the core dump we're seeing the following backtrace: > > # gdb --batch -ex bt /usr/sbin/ovs-vswitchd /core > [NewLWP 67669] > [NewLWP 67682] > [NewLWP 67681] > [NewLWP 67671] > [NewLWP 67679] > [NewLWP 67680] > [Threaddebugging usinglibthread_db enabled] > Usinghost libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". > Corewas generated by`ovs-vswitchd unix:/tmp/ovs/db.sock -vconsole:emer > -vsyslog:err -vfile:info --ml'. > Program terminated with signal SIGABRT, Aborted. > #0 0x7fcacecbb0fc in ?? () from /lib/x86_64-linux-gnu/libc.so.6 > [Current thread is 1 (Thread 0x7fcacf4cfa80 (LWP 67669))] > #0 0x7fcacecbb0fc in ?? () from /lib/x86_64-linux-gnu/libc.so.6 > #1 0x7fcacec6d472 in raise () from /lib/x86_64-linux-gnu/libc.so.6 > #2 0x7fcacec574b2 in abort () from /lib/x86_64-linux-gnu/libc.so.6 > #3 0x560787952c7e in ovs_abort_valist (err_no=, > format=, args=args@entry=0x7ffd14c3bce0) at ../lib/util.c:447 > #4 0x560787952d14 in ovs_abort (err_no=, > format=format@entry=0x5607879f3ec7 "%s: pthread_%s_%s failed") at > ../lib/util.c:439 > #5 0x56078791ee11 in ovs_mutex_lock_at (l_=l_@entry=0x56078934c6c8, > where=where@entry=0x5607879f864b "../lib/netdev-linux.c:2575") at > ../lib/ovs-thread.c:76 > #6 0x56078796d03d in netdev_linux_get_speed (netdev_=0x56078934c640, > current=0x7ffd14c3be64, max=0x7ffd14c3be1c) at ../lib/netdev-linux.c:2575 > #7 0x5607878c04f3 in netdev_get_speed > (netdev=netdev@entry=0x56078934c640, current=current@entry=0x7ffd14c3be64, > max=0x7ffd14c3be1c, max@entry=0x0) at ../lib/netdev.c:1175 > #8 0x560787968d67 in htb_parse_qdisc_details__ > (netdev=netdev@entry=0x56078934c640, details=details@entry=0x56078934a880, > hc=hc@entry=0x7ffd14c3beb0) at ../lib/netdev-linux.c:4804 > #9 0x5607879755da in htb_tc_install (details=0x56078934a880, > netdev=0x56078934c640) at ../lib/netdev-linux.c:4883 > #10 htb_tc_install (netdev=0x56078934c640, details=0x56078934a880) at > ../lib/netdev-linux.c:4875 > #11 0x560787974937 in netdev_linux_set_qos (netdev_=0x56078934c640, > type=, details=0x56078934a880) at ../lib/netdev-linux.c:3054 > #12 0x560787814ea5 in iface_configure_qos (qos=0x56078934a780, > iface=0x560789349fd0) at ../vswitchd/bridge.c:4845 > #13 bridge_reconfigure (ovs_cfg=ovs_cfg@entry=0x5607892dbf90) at > ../vswitchd/bridge.c:928 > #14 0x560787817f7d in bridge_run () at ../vswitchd/bridge.c:3321 > #15 0x56078780d205 in main (argc=, argv=) > at ../vswitchd/ovs-vswitchd.c:130 > > A shell script to reproduce the issue is: > > #!/bin/sh > > apt-get install openvswitch-{common,switch}{,-dbgsym} > > # Don't need it running on the system > systemctl stop openvswitch-switch > > set-e > > cleanup(){ > ip link delveth0 > rm /tmp/ovs/conf.db > } > > trap cleanup EXIT > > # Setup our environment > > ip link add veth0 type veth peer veth1 > > mkdir -p /tmp/ovs > > exportOVS_RUNDIR=/tmp/ovs > exportOVS_LOGDIR=/tmp/ovs > exportOVS_DBDIR=/tmp/ovs > > /usr/share/openvswitch/scripts/ovs-ctl start > ovs-vsctl add-br demo > ovs-vsctl add-port demo veth1 > > # Make it crash > > ovs-vsctl setPortveth1 qos=@qos\ > -- --id=@qoscreate QoStype=linux-htb queues:1=@highqueues:2=@low\ > -- --id=@highcreate Queueother-config:priority=1other-config:min-rate=0.1\ > -- --id=@low create Queueother-config:priority=6other-config:min-rate=0.05 > > We built the reproduction script based on speculation that > https://github.com/openvswitch/ovs/commit/6240c0b4c80ea3d8dd1bf77526b04b55742de2ce > is related to the crash. Notably we don't seem to run into the problem when we > pass in a specific maximum bandwidth instead of relying on the interface's > maximum > bandwidth. Hi, Daryl. Thanks for the report! Looking at the stack trace, the root cause seems to be the following commit: https://github.com/openvswitch/ovs/commit/b8f8fad8643518551cf742056ae8728c936674c6 It introduced the netdev_get_speed() call in QoS functions. These functions are running under netdev mutex and the netdev_get_speed() function is trying to take the same mutex for a second time. That fails with 'deadlock avoided' or something like that. From the commit I linked it's clear why it's not happening if the max-rate is specified. The code just doesn't go that route. To fix the issue, we need to have a lockless version of netdev_linux_get_speed() and call it directly from the QoS functions without going through generic netdev API. Adrian, since it was your original patch, could you, please, take a look at the issue? Or Daryl, maybe you want to fix it yourself? Best regards, Ilya Maximets. ___
[ovs-discuss] OVS 3.2.0 crashing setting port QOS
After upgrading from Open vSwitch 3.1.2-4 to 3.2.0-2 we've been consistently seeing new OVS crashes when setting QoS on ports. Both packages were taken from the Debian distribution ( https://packages.debian.org/source/sid/openvswitch) we're running on. From the core dump we're seeing the following backtrace: # gdb --batch -ex bt /usr/sbin/ovs-vswitchd /core [New LWP 67669] [New LWP 67682] [New LWP 67681] [New LWP 67671] [New LWP 67679] [New LWP 67680] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `ovs-vswitchd unix:/tmp/ovs/db.sock -vconsole:emer -vsyslog:err -vfile:info --ml'. Program terminated with signal SIGABRT, Aborted. #0 0x7fcacecbb0fc in ?? () from /lib/x86_64-linux-gnu/libc.so.6 [Current thread is 1 (Thread 0x7fcacf4cfa80 (LWP 67669))] #0 0x7fcacecbb0fc in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x7fcacec6d472 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x7fcacec574b2 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x560787952c7e in ovs_abort_valist (err_no=, format=, args=args@entry=0x7ffd14c3bce0) at ../lib/util.c:447 #4 0x560787952d14 in ovs_abort (err_no=, format=format@entry=0x5607879f3ec7 "%s: pthread_%s_%s failed") at ../lib/util.c:439 #5 0x56078791ee11 in ovs_mutex_lock_at (l_=l_@entry=0x56078934c6c8, where=where@entry=0x5607879f864b "../lib/netdev-linux.c:2575") at ../lib/ovs-thread.c:76 #6 0x56078796d03d in netdev_linux_get_speed (netdev_=0x56078934c640, current=0x7ffd14c3be64, max=0x7ffd14c3be1c) at ../lib/netdev-linux.c:2575 #7 0x5607878c04f3 in netdev_get_speed (netdev=netdev@entry=0x56078934c640, current=current@entry=0x7ffd14c3be64, max=0x7ffd14c3be1c, max@entry=0x0) at ../lib/netdev.c:1175 #8 0x560787968d67 in htb_parse_qdisc_details__ (netdev=netdev@entry=0x56078934c640, details=details@entry=0x56078934a880, hc=hc@entry=0x7ffd14c3beb0) at ../lib/netdev-linux.c:4804 #9 0x5607879755da in htb_tc_install (details=0x56078934a880, netdev=0x56078934c640) at ../lib/netdev-linux.c:4883 #10 htb_tc_install (netdev=0x56078934c640, details=0x56078934a880) at ../lib/netdev-linux.c:4875 #11 0x560787974937 in netdev_linux_set_qos (netdev_=0x56078934c640, type=, details=0x56078934a880) at ../lib/netdev-linux.c:3054 #12 0x560787814ea5 in iface_configure_qos (qos=0x56078934a780, iface=0x560789349fd0) at ../vswitchd/bridge.c:4845 #13 bridge_reconfigure (ovs_cfg=ovs_cfg@entry=0x5607892dbf90) at ../vswitchd/bridge.c:928 #14 0x560787817f7d in bridge_run () at ../vswitchd/bridge.c:3321 #15 0x56078780d205 in main (argc=, argv=) at ../vswitchd/ovs-vswitchd.c:130 A shell script to reproduce the issue is: #!/bin/sh apt-get install openvswitch-{common,switch}{,-dbgsym} # Don't need it running on the system systemctl stop openvswitch-switch set -e cleanup() { ip link del veth0 rm /tmp/ovs/conf.db } trap cleanup EXIT # Setup our environment ip link add veth0 type veth peer veth1 mkdir -p /tmp/ovs export OVS_RUNDIR=/tmp/ovs export OVS_LOGDIR=/tmp/ovs export OVS_DBDIR=/tmp/ovs /usr/share/openvswitch/scripts/ovs-ctl start ovs-vsctl add-br demo ovs-vsctl add-port demo veth1 # Make it crash ovs-vsctl set Port veth1 qos=@qos \ -- --id=@qos create QoS type=linux-htb queues:1=@high queues:2=@low \ -- --id=@high create Queue other-config:priority=1 other-config:min-rate= 0.1 \ -- --id=@low create Queue other-config:priority=6 other-config:min-rate= 0.05 We built the reproduction script based on speculation that https://github.com/openvswitch/ovs/commit/6240c0b4c80ea3d8dd1bf77526b04b55742de2ce is related to the crash. Notably we don't seem to run into the problem when we pass in a specific maximum bandwidth instead of relying on the interface's maximum bandwidth. Sincerely, Daryl Wang ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss