Processed: Re: Bug#899044: Oops: 0000 [#1] SMP in skb_release_data, openvswitch related

2021-05-23 Thread Debian Bug Tracking System
Processing control commands:

> tags -1 + moreinfo
Bug #899044 [src:linux] Oops: 0000 [#1] SMP in skb_release_data, openvswitch 
related
Added tag(s) moreinfo.

-- 
899044: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=899044
Debian Bug Tracking System
Contact ow...@bugs.debian.org with problems



Bug#899044: Oops: 0000 [#1] SMP in skb_release_data, openvswitch related

2021-05-23 Thread Salvatore Bonaccorso
Control: tags -1 + moreinfo

On Fri, May 18, 2018 at 06:19:02PM +0200, Hans van Kranenburg wrote:
> Package: src:linux
> Version: 4.9.88-1
> 
> Hi,
> 
> I'm observing the attached errors on machines that are Xen dom0 and
> running the latest Debian Stretch 4.9 kernel as dom0 kernel. The errors
> have been happening a few times in the last few weeks. It started after
> upgrading them from Jessie and 3.16 kernel to Stretch with 4.9 kernel.
> 
> For networking between domUs and the outside world, we use openvswitch.
> 
> After such an error happens:
> * The amount of "flows" in the kernel quickly raises to the limit,
> 1, as seen in output of ovs-dpctl show.
> * Network traffic that should flow through the openvswitch bridge starts
> disappearing in a seemingly random way.
> * The memory usage of the userspace ovs-vswitchd starts growing quickly.
> * Many of the ovs commands, like to add or remove an interface or bridge
> hang.
> 
> After a restart of the openvswitch-switch service, and fixing up a bunch
> of configuration of connected interfaces, functionality is restored.
> 
> While most of the symptoms seem related to userspace openvswitch
> processes, the cause of it all seems to be in the kernel, while the
> userspace ovs-vswitchd process is receiving a network packet?
> 
> Sadly I do not know how to reproduce this, except for just waiting until
> it happens again.
> 
> Please advice what else I could use to help resolving this issue.

Is this still an issue? I notice that the upstream report did not got
a reply. If it still an issue, please try to prod upstream again. Is
it observable as well with recent kernels? If not I suggest to close
the issue.

Regards,
Salvatore



Bug#899044: Oops: 0000 [#1] SMP in skb_release_data, openvswitch related

2018-05-24 Thread Hans van Kranenburg
To: netdev, dev@openvswitch
Cc: Eric Dumazet (author of ff04a771ad), debian bug

Hi,

As follow-up to my bug report at Debian [0], I'm trying to do bug triage
and find out more. I'm not the expert here, but anything could help, and
it's an opportunity to learn things.

I'm observing the attached errors ('general protection fault:  [#1]
SMP' and 'BUG: unable to handle kernel paging request') on machines that
are Xen dom0 and running a 4.9.88 Debian Stretch kernel as dom0 kernel.
The errors have been happening a few times in the last few weeks. It
started after upgrading them from Jessie and 3.16 kernel to Stretch with
4.9 kernel.

The traces printed look very much alike every time.

If I look up the listed address, I get:

-$ addr2line -e /usr/lib/debug/boot/vmlinux-4.9.0-6-amd64 -i -a
814f5c7d
0x814f5c7d
./debian/build/build_amd64_none_amd64/./include/linux/compiler.h:243
(discriminator 3)
./debian/build/build_amd64_none_amd64/./include/linux/page-flags.h:143
(discriminator 3)
./debian/build/build_amd64_none_amd64/./include/linux/mm.h:779
(discriminator 3)
./debian/build/build_amd64_none_amd64/./include/linux/skbuff.h:2592
(discriminator 3)
./debian/build/build_amd64_none_amd64/./net/core/skbuff.c:594
(discriminator 3)

 583 static void skb_release_data(struct sk_buff *skb)
 584 {
 585 struct skb_shared_info *shinfo = skb_shinfo(skb);
 586 int i;
 587
 588 if (skb->cloned &&
 589 atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT)
+ 1 : 1,
 590   >dataref))
 591 return;
 592
 593 for (i = 0; i < shinfo->nr_frags; i++)
 594   ->__skb_frag_unref(>frags[i]);<--
 595
 596 /*
 597  * If skb buf is from userspace, we need to notify the caller
 598  * the lower device DMA has done;
 599  */
 600 if (shinfo->tx_flags & SKBTX_DEV_ZEROCOPY) {
 601 struct ubuf_info *uarg;
 602
 603 uarg = shinfo->destructor_arg;
 604 if (uarg->callback)
 605 uarg->callback(uarg, true);
 606 }
 607
 608 if (shinfo->frag_list)
 609 kfree_skb_list(shinfo->frag_list);
 610
 611 skb_free_head(skb);
 612 }

The most recent (well, from 2014) biggest change in this area is...

commit ff04a771ad25fc9ba91690e73465b4d34b6bf8b3
Author: Eric Dumazet 
Date:   Tue Sep 23 18:39:30 2014 -0700

net : optimize skb_release_data()

...which is not present in the 3.16.y kernel that Debian Jessie still
uses, and which does not hit this problem (however, also using older
openvswitch userspace components).

Other changes in this area mention zero copy IO, which sounds like
something openvswitch could be using.

-- background: openvswitch usage --

For networking between domUs and the outside world, we use openvswitch.

After such an error happens:
* The amount of "flows" in the kernel quickly raises to the limit,
1, as seen in output of ovs-dpctl show.
* Network traffic that should flow through the openvswitch bridge starts
disappearing in a seemingly random way (probably because it can't handle
new traffic flows).
* The memory usage of the userspace ovs-vswitchd starts growing quickly.
* Many of the ovs commands, like to add or remove an interface or bridge
hang.

After a restart of the openvswitch-switch service, and fixing up a bunch
of configuration of connected interfaces, functionality is restored.

While most of the symptoms seem related to userspace openvswitch
processes, the cause of it all seems to be in the kernel, while the
userspace ovs-vswitchd process is receiving a network packet?

-- reproducer --

I don't have a reliable reproducer yet, except for waiting days or weeks
until it randomly happens somewhere. There's no sign of unusual amounts
of traffic / load etc when it happens.

An idea I can come up with is builing a semi-random udp packet generator
to start stressing the code path from kernel to ovs-vswitchd.

If I succeed reproducing, I can start trying other kernels or changes.

Please advice what else I could do to help resolving this issue.

Thanks,
Regards,

Hans van Kranenburg

[0] https://bugs.debian.org/899044
May  4 08:23:03 altair kernel: [83978.662075] BUG: unable to handle kernel 
paging request at 0003001f
May  4 08:23:03 altair kernel: [83978.665887] IP: [] 
skb_release_data+0x8d/0x110
May  4 08:23:03 altair kernel: [83978.669837] PGD 0 
May  4 08:23:03 altair kernel: [83978.669882] 
May  4 08:23:03 altair kernel: [83978.673589] Oops:  [#1] SMP
May  4 08:23:03 altair kernel: [83978.677281] Modules linked in: cls_u32 
sch_ingress act_mirred sch_fq_codel ifb xt_mark sch_htb xt_physdev br_netfilter 
bridge stp llc xen_netback xen_blkback algif_skcipher af_alg dm_service_time 
binfmt_misc xen_gntdev xen_evtchn openvswitch nf_nat_ipv6 libcrc32c xenfs 
xen_privcmd ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 

Bug#899044: Oops: 0000 [#1] SMP in skb_release_data, openvswitch related

2018-05-18 Thread Hans van Kranenburg
Package: src:linux
Version: 4.9.88-1

Hi,

I'm observing the attached errors on machines that are Xen dom0 and
running the latest Debian Stretch 4.9 kernel as dom0 kernel. The errors
have been happening a few times in the last few weeks. It started after
upgrading them from Jessie and 3.16 kernel to Stretch with 4.9 kernel.

For networking between domUs and the outside world, we use openvswitch.

After such an error happens:
* The amount of "flows" in the kernel quickly raises to the limit,
1, as seen in output of ovs-dpctl show.
* Network traffic that should flow through the openvswitch bridge starts
disappearing in a seemingly random way.
* The memory usage of the userspace ovs-vswitchd starts growing quickly.
* Many of the ovs commands, like to add or remove an interface or bridge
hang.

After a restart of the openvswitch-switch service, and fixing up a bunch
of configuration of connected interfaces, functionality is restored.

While most of the symptoms seem related to userspace openvswitch
processes, the cause of it all seems to be in the kernel, while the
userspace ovs-vswitchd process is receiving a network packet?

Sadly I do not know how to reproduce this, except for just waiting until
it happens again.

Please advice what else I could use to help resolving this issue.

Thanks,
Regards,
-- 
Hans van Kranenburg
May  4 08:23:03 altair kernel: [83978.662075] BUG: unable to handle kernel 
paging request at 0003001f
May  4 08:23:03 altair kernel: [83978.665887] IP: [] 
skb_release_data+0x8d/0x110
May  4 08:23:03 altair kernel: [83978.669837] PGD 0 
May  4 08:23:03 altair kernel: [83978.669882] 
May  4 08:23:03 altair kernel: [83978.673589] Oops:  [#1] SMP
May  4 08:23:03 altair kernel: [83978.677281] Modules linked in: cls_u32 
sch_ingress act_mirred sch_fq_codel ifb xt_mark sch_htb xt_physdev br_netfilter 
bridge stp llc xen_netback xen_blkback algif_skcipher af_alg dm_service_time 
binfmt_misc xen_gntdev xen_evtchn openvswitch nf_nat_ipv6 libcrc32c xenfs 
xen_privcmd ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 
ip6table_filter ip6table_mangle ip6table_raw ip6_tables ipt_REJECT 
nf_reject_ipv4 xt_tcpudp xt_owner xt_multiport xt_conntrack iptable_filter 
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 
iptable_mangle iptable_raw dm_crypt intel_powerclamp crct10dif_pclmul 
crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel pcspkr serio_raw 
joydev evdev amdkfd radeon ttm drm_kms_helper drm i2c_algo_bit lpc_ich mfd_core 
i7core_edac hpilo
May  4 08:23:03 altair kernel: [83978.701936]  sg ipmi_si hpwdt edac_core 
ipmi_msghandler acpi_power_meter button shpchp dm_multipath dm_mod scsi_dh_rdac 
scsi_dh_emc scsi_dh_alua ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp 
libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 ext4 
crc16 jbd2 fscrypto ecb mbcache btrfs crc32c_generic xor raid6_pq mlx4_en ptp 
pps_core hid_generic usbhid hid sd_mod crc32c_intel aesni_intel aes_x86_64 
glue_helper lrw gf128mul ablk_helper cryptd psmouse ehci_pci uhci_hcd ehci_hcd 
usbcore usb_common hpsa scsi_transport_sas bnx2 mlx4_core devlink scsi_mod 
thermal
May  4 08:23:03 altair kernel: [83978.724406] CPU: 1 PID: 1486 Comm: 
revalidator7 Not tainted 4.9.0-6-amd64 #1 Debian 4.9.88-1
May  4 08:23:03 altair kernel: [83978.729139] Hardware name: HP ProLiant DL360 
G7, BIOS P68 08/16/2015
May  4 08:23:03 altair kernel: [83978.733958] task: 880119e1ee80 
task.stack: c90042764000
May  4 08:23:03 altair kernel: [83978.738724] RIP: e030:[]  
[] skb_release_data+0x8d/0x110
May  4 08:23:03 altair kernel: [83978.743560] RSP: e02b:c90042767c78  
EFLAGS: 00010206
May  4 08:23:03 altair kernel: [83978.748352] RAX: 0050 RBX: 
0002 RCX: 81ce0f40
May  4 08:23:03 altair kernel: [83978.753116] RDX:  RSI: 
8800cc998900 RDI: 8800cc998900
May  4 08:23:03 altair kernel: [83978.757867] RBP: 8800cc998900 R08: 
880123c0 R09: 88011f22
May  4 08:23:03 altair kernel: [83978.762598] R10: 8800cc998900 R11: 
880119e10280 R12: 0002
May  4 08:23:03 altair kernel: [83978.767321] R13: 88011f227ec0 R14: 
88011dea2800 R15: 
May  4 08:23:03 altair kernel: [83978.772000] FS:  7fc1656cc700() 
GS:88012824() knlGS:
May  4 08:23:03 altair kernel: [83978.776671] CS:  e033 DS:  ES:  CR0: 
80050033
May  4 08:23:03 altair kernel: [83978.781355] CR2: 0003001f CR3: 
0001212b1000 CR4: 2660
May  4 08:23:03 altair kernel: [83978.786135] Stack:
May  4 08:23:03 altair kernel: [83978.790841]  880120a28000 
8800cc998900 c90042767ec0 7ea4
May  4 08:23:03 altair kernel: [83978.795898]  814f6267 
880120a28000 8800cc998900 814fcc91
May  4 08:23:03 altair kernel: [83978.800806]  880120a28000 
8153f2df c900