Processed: Re: Bug#899044: Oops: 0000 [#1] SMP in skb_release_data, openvswitch related
Processing control commands: > tags -1 + moreinfo Bug #899044 [src:linux] Oops: 0000 [#1] SMP in skb_release_data, openvswitch related Added tag(s) moreinfo. -- 899044: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=899044 Debian Bug Tracking System Contact ow...@bugs.debian.org with problems
Bug#899044: Oops: 0000 [#1] SMP in skb_release_data, openvswitch related
Control: tags -1 + moreinfo On Fri, May 18, 2018 at 06:19:02PM +0200, Hans van Kranenburg wrote: > Package: src:linux > Version: 4.9.88-1 > > Hi, > > I'm observing the attached errors on machines that are Xen dom0 and > running the latest Debian Stretch 4.9 kernel as dom0 kernel. The errors > have been happening a few times in the last few weeks. It started after > upgrading them from Jessie and 3.16 kernel to Stretch with 4.9 kernel. > > For networking between domUs and the outside world, we use openvswitch. > > After such an error happens: > * The amount of "flows" in the kernel quickly raises to the limit, > 1, as seen in output of ovs-dpctl show. > * Network traffic that should flow through the openvswitch bridge starts > disappearing in a seemingly random way. > * The memory usage of the userspace ovs-vswitchd starts growing quickly. > * Many of the ovs commands, like to add or remove an interface or bridge > hang. > > After a restart of the openvswitch-switch service, and fixing up a bunch > of configuration of connected interfaces, functionality is restored. > > While most of the symptoms seem related to userspace openvswitch > processes, the cause of it all seems to be in the kernel, while the > userspace ovs-vswitchd process is receiving a network packet? > > Sadly I do not know how to reproduce this, except for just waiting until > it happens again. > > Please advice what else I could use to help resolving this issue. Is this still an issue? I notice that the upstream report did not got a reply. If it still an issue, please try to prod upstream again. Is it observable as well with recent kernels? If not I suggest to close the issue. Regards, Salvatore
Bug#899044: Oops: 0000 [#1] SMP in skb_release_data, openvswitch related
To: netdev, dev@openvswitch Cc: Eric Dumazet (author of ff04a771ad), debian bug Hi, As follow-up to my bug report at Debian [0], I'm trying to do bug triage and find out more. I'm not the expert here, but anything could help, and it's an opportunity to learn things. I'm observing the attached errors ('general protection fault: [#1] SMP' and 'BUG: unable to handle kernel paging request') on machines that are Xen dom0 and running a 4.9.88 Debian Stretch kernel as dom0 kernel. The errors have been happening a few times in the last few weeks. It started after upgrading them from Jessie and 3.16 kernel to Stretch with 4.9 kernel. The traces printed look very much alike every time. If I look up the listed address, I get: -$ addr2line -e /usr/lib/debug/boot/vmlinux-4.9.0-6-amd64 -i -a 814f5c7d 0x814f5c7d ./debian/build/build_amd64_none_amd64/./include/linux/compiler.h:243 (discriminator 3) ./debian/build/build_amd64_none_amd64/./include/linux/page-flags.h:143 (discriminator 3) ./debian/build/build_amd64_none_amd64/./include/linux/mm.h:779 (discriminator 3) ./debian/build/build_amd64_none_amd64/./include/linux/skbuff.h:2592 (discriminator 3) ./debian/build/build_amd64_none_amd64/./net/core/skbuff.c:594 (discriminator 3) 583 static void skb_release_data(struct sk_buff *skb) 584 { 585 struct skb_shared_info *shinfo = skb_shinfo(skb); 586 int i; 587 588 if (skb->cloned && 589 atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1, 590 >dataref)) 591 return; 592 593 for (i = 0; i < shinfo->nr_frags; i++) 594 ->__skb_frag_unref(>frags[i]);<-- 595 596 /* 597 * If skb buf is from userspace, we need to notify the caller 598 * the lower device DMA has done; 599 */ 600 if (shinfo->tx_flags & SKBTX_DEV_ZEROCOPY) { 601 struct ubuf_info *uarg; 602 603 uarg = shinfo->destructor_arg; 604 if (uarg->callback) 605 uarg->callback(uarg, true); 606 } 607 608 if (shinfo->frag_list) 609 kfree_skb_list(shinfo->frag_list); 610 611 skb_free_head(skb); 612 } The most recent (well, from 2014) biggest change in this area is... commit ff04a771ad25fc9ba91690e73465b4d34b6bf8b3 Author: Eric DumazetDate: Tue Sep 23 18:39:30 2014 -0700 net : optimize skb_release_data() ...which is not present in the 3.16.y kernel that Debian Jessie still uses, and which does not hit this problem (however, also using older openvswitch userspace components). Other changes in this area mention zero copy IO, which sounds like something openvswitch could be using. -- background: openvswitch usage -- For networking between domUs and the outside world, we use openvswitch. After such an error happens: * The amount of "flows" in the kernel quickly raises to the limit, 1, as seen in output of ovs-dpctl show. * Network traffic that should flow through the openvswitch bridge starts disappearing in a seemingly random way (probably because it can't handle new traffic flows). * The memory usage of the userspace ovs-vswitchd starts growing quickly. * Many of the ovs commands, like to add or remove an interface or bridge hang. After a restart of the openvswitch-switch service, and fixing up a bunch of configuration of connected interfaces, functionality is restored. While most of the symptoms seem related to userspace openvswitch processes, the cause of it all seems to be in the kernel, while the userspace ovs-vswitchd process is receiving a network packet? -- reproducer -- I don't have a reliable reproducer yet, except for waiting days or weeks until it randomly happens somewhere. There's no sign of unusual amounts of traffic / load etc when it happens. An idea I can come up with is builing a semi-random udp packet generator to start stressing the code path from kernel to ovs-vswitchd. If I succeed reproducing, I can start trying other kernels or changes. Please advice what else I could do to help resolving this issue. Thanks, Regards, Hans van Kranenburg [0] https://bugs.debian.org/899044 May 4 08:23:03 altair kernel: [83978.662075] BUG: unable to handle kernel paging request at 0003001f May 4 08:23:03 altair kernel: [83978.665887] IP: [] skb_release_data+0x8d/0x110 May 4 08:23:03 altair kernel: [83978.669837] PGD 0 May 4 08:23:03 altair kernel: [83978.669882] May 4 08:23:03 altair kernel: [83978.673589] Oops: [#1] SMP May 4 08:23:03 altair kernel: [83978.677281] Modules linked in: cls_u32 sch_ingress act_mirred sch_fq_codel ifb xt_mark sch_htb xt_physdev br_netfilter bridge stp llc xen_netback xen_blkback algif_skcipher af_alg dm_service_time binfmt_misc xen_gntdev xen_evtchn openvswitch nf_nat_ipv6 libcrc32c xenfs xen_privcmd ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6
Bug#899044: Oops: 0000 [#1] SMP in skb_release_data, openvswitch related
Package: src:linux Version: 4.9.88-1 Hi, I'm observing the attached errors on machines that are Xen dom0 and running the latest Debian Stretch 4.9 kernel as dom0 kernel. The errors have been happening a few times in the last few weeks. It started after upgrading them from Jessie and 3.16 kernel to Stretch with 4.9 kernel. For networking between domUs and the outside world, we use openvswitch. After such an error happens: * The amount of "flows" in the kernel quickly raises to the limit, 1, as seen in output of ovs-dpctl show. * Network traffic that should flow through the openvswitch bridge starts disappearing in a seemingly random way. * The memory usage of the userspace ovs-vswitchd starts growing quickly. * Many of the ovs commands, like to add or remove an interface or bridge hang. After a restart of the openvswitch-switch service, and fixing up a bunch of configuration of connected interfaces, functionality is restored. While most of the symptoms seem related to userspace openvswitch processes, the cause of it all seems to be in the kernel, while the userspace ovs-vswitchd process is receiving a network packet? Sadly I do not know how to reproduce this, except for just waiting until it happens again. Please advice what else I could use to help resolving this issue. Thanks, Regards, -- Hans van Kranenburg May 4 08:23:03 altair kernel: [83978.662075] BUG: unable to handle kernel paging request at 0003001f May 4 08:23:03 altair kernel: [83978.665887] IP: [] skb_release_data+0x8d/0x110 May 4 08:23:03 altair kernel: [83978.669837] PGD 0 May 4 08:23:03 altair kernel: [83978.669882] May 4 08:23:03 altair kernel: [83978.673589] Oops: [#1] SMP May 4 08:23:03 altair kernel: [83978.677281] Modules linked in: cls_u32 sch_ingress act_mirred sch_fq_codel ifb xt_mark sch_htb xt_physdev br_netfilter bridge stp llc xen_netback xen_blkback algif_skcipher af_alg dm_service_time binfmt_misc xen_gntdev xen_evtchn openvswitch nf_nat_ipv6 libcrc32c xenfs xen_privcmd ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6table_mangle ip6table_raw ip6_tables ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_owner xt_multiport xt_conntrack iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw dm_crypt intel_powerclamp crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel pcspkr serio_raw joydev evdev amdkfd radeon ttm drm_kms_helper drm i2c_algo_bit lpc_ich mfd_core i7core_edac hpilo May 4 08:23:03 altair kernel: [83978.701936] sg ipmi_si hpwdt edac_core ipmi_msghandler acpi_power_meter button shpchp dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 ext4 crc16 jbd2 fscrypto ecb mbcache btrfs crc32c_generic xor raid6_pq mlx4_en ptp pps_core hid_generic usbhid hid sd_mod crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd psmouse ehci_pci uhci_hcd ehci_hcd usbcore usb_common hpsa scsi_transport_sas bnx2 mlx4_core devlink scsi_mod thermal May 4 08:23:03 altair kernel: [83978.724406] CPU: 1 PID: 1486 Comm: revalidator7 Not tainted 4.9.0-6-amd64 #1 Debian 4.9.88-1 May 4 08:23:03 altair kernel: [83978.729139] Hardware name: HP ProLiant DL360 G7, BIOS P68 08/16/2015 May 4 08:23:03 altair kernel: [83978.733958] task: 880119e1ee80 task.stack: c90042764000 May 4 08:23:03 altair kernel: [83978.738724] RIP: e030:[] [] skb_release_data+0x8d/0x110 May 4 08:23:03 altair kernel: [83978.743560] RSP: e02b:c90042767c78 EFLAGS: 00010206 May 4 08:23:03 altair kernel: [83978.748352] RAX: 0050 RBX: 0002 RCX: 81ce0f40 May 4 08:23:03 altair kernel: [83978.753116] RDX: RSI: 8800cc998900 RDI: 8800cc998900 May 4 08:23:03 altair kernel: [83978.757867] RBP: 8800cc998900 R08: 880123c0 R09: 88011f22 May 4 08:23:03 altair kernel: [83978.762598] R10: 8800cc998900 R11: 880119e10280 R12: 0002 May 4 08:23:03 altair kernel: [83978.767321] R13: 88011f227ec0 R14: 88011dea2800 R15: May 4 08:23:03 altair kernel: [83978.772000] FS: 7fc1656cc700() GS:88012824() knlGS: May 4 08:23:03 altair kernel: [83978.776671] CS: e033 DS: ES: CR0: 80050033 May 4 08:23:03 altair kernel: [83978.781355] CR2: 0003001f CR3: 0001212b1000 CR4: 2660 May 4 08:23:03 altair kernel: [83978.786135] Stack: May 4 08:23:03 altair kernel: [83978.790841] 880120a28000 8800cc998900 c90042767ec0 7ea4 May 4 08:23:03 altair kernel: [83978.795898] 814f6267 880120a28000 8800cc998900 814fcc91 May 4 08:23:03 altair kernel: [83978.800806] 880120a28000 8153f2df c900