Re: Status of 8.99.12
On Mon, 12 Feb 2018, Thor Lancelot Simon wrote: On Mon, Feb 12, 2018 at 08:48:32AM +0800, Paul Goyette wrote: 1. Starting the gnucash program (from pkgsrc finance/gnucash) now takes about 3 times as long as before. Even after successfully loading the image (to get libraries etc into the file system cache) it take more than three full minutes for the program to initialize. It previously took 1 full minute? Yes, at least on the initial load (before caching everything). On subsequent loads it would take at least 20-30 seconds to start. How does it look without LOCKDEBUG? I would have to build a new kernel and check. Do you thing it is truly relevant? Would there be a lot of locking/contention occurring during the program start-up phase? FWIW, someone also suggested that the "3-seconds needed to unmount a nullfs" problem could also be affected by LOCKDEBUG. Interestingly, this problem does not seem to persist. Some time after the system is booted, and some time after all the nullfs mounts are completed, the unmount process completes as expected; all 20+ umount complete within total of 3-4 seconds, rather than 3-6 seconds each. I'll try to characterize the "some time after" details later, after resolving the bigger outstanding issues. +--+--++ | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com | | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org | +--+--++
Re: Status of 8.99.12
On Mon, Feb 12, 2018 at 08:48:32AM +0800, Paul Goyette wrote: > > 1. Starting the gnucash program (from pkgsrc finance/gnucash) now takes >about 3 times as long as before. Even after successfully loading >the image (to get libraries etc into the file system cache) it take >more than three full minutes for the program to initialize. It previously took 1 full minute? How does it look without LOCKDEBUG? Thor
Re: Status of 8.99.12
On Mon, 12 Feb 2018, Paul Goyette wrote: 2. Whenever I try to shutdown the system, I get a networking-related panic. The following is manually transcribed: trap type 4 code 0 rip 0x802d3f75 cs 0x8 rflags 0x10282 cr2 0x77e0e931c020 ilevel 0x4 rsp 0x80090a7e3c80 curlwp 0xe4afbb6e8700 pid 926.1 lowest kstack 0x80090a7e0c20 kernel: protection fault trap, code = 0 stopped in 926.1 (avahi-daemon) at ip_setmoptions+0x237: movq 360(%rax),%rdi traceback: ip_setmoptions + 0x237 ip_rtloutput + 0x218 udp_ctloutput + 0x82 udp_ctloutput_wrapper + 0x2c sosetopt + 0x67 sys_setsockopt + 0x91 syscall + 0x1ed (syscall #105) This appears to be fixed by a patch provided by ozakir@ 3. After getting the above, as soon as I type a single character as command input to ddb(4), I get a LOCKDEBUG panic. I didn't yet transcribe the 40+ lines of output, but the backtrace clearly includes a couple entries from the xhci (USB-3) driver. Here's the console output from the LOCKDEBUG panic - all transcribed by hand, but hopefully without too many typos! Mutex error: mutex_vector_enter,523: spin lock held lock address: 0xe410e9d1d9a0 type: spin initialized: 0x802bac06 shared holds: 0 exclusive: 1 shares wanted: 0 exclusive: 0 current CPU: 11 last held: 11 curlwp: 0xe41fc09ad2c0 last held: 0xe41fc09ad2c0 last locked*: 0x802b81de unlocked: 0x80291179 owner field: 0x00010600 wait/spin:0/1 panic: LOCKDEBUG: Mutex error: mutex_vector_enter,523: spin lock held And the backtrace is vpanic+0x140 snprintf lockdebug_more mutex_enter+0x69d xhci_device_intr_start+0x125 usbd_start_next+0x65 xhci_soft_intr+0x49b xhci_poll+0x37 ukbd_cngetc+0x19 cngetc+0x34 db_readline+0x65 db_read_line+0x15 db_command_loop+0x84 db_trap+0xe3 kbd_trap+0xe2 trap (number 4) followed by the original backtrace (see above, starting with ip_setmoptions). 4. While the system is running, I have noticed that un-mounting nullfs mounts is very slow. Using mksandbox (from pkgsrc), I create a sandbox with about 22 null mounts. Creating/mounting is no problem, and everything runs as expected. However, when unmounting these nullfs, each one takes between 3 and 6 wall-seconds, during which the umount process is running at 100% of one CPU. Additionally, some of these umounts seem to grab the CPU with interrupts disabled, resulting in total stall of the machine for the duration (and, in X, cursor movement stalls/gets "jerky"). All the unmounts eventually complete successflly. +--+--++ | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com | | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org | +--+--++ +--+--++ | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com | | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org | +--+--++
Re: Status of 8.99.12
On Mon, 12 Feb 2018, Ryota Ozaki wrote: 2. Whenever I try to shutdown the system, I get a networking-related panic. The following is manually transcribed: trap type 4 code 0 rip 0x802d3f75 cs 0x8 rflags 0x10282 cr2 0x77e0e931c020 ilevel 0x4 rsp 0x80090a7e3c80 curlwp 0xe4afbb6e8700 pid 926.1 lowest kstack 0x80090a7e0c20 kernel: protection fault trap, code = 0 stopped in 926.1 (avahi-daemon) at ip_setmoptions+0x237: movq 360(%rax),%rdi traceback: ip_setmoptions + 0x237 ip_rtloutput + 0x218 udp_ctloutput + 0x82 udp_ctloutput_wrapper + 0x2c sosetopt + 0x67 sys_setsockopt + 0x91 syscall + 0x1ed (syscall #105) Is the panic fixed by the following patch? diff --git a/sys/netinet/ip_output.c b/sys/netinet/ip_output.c index 44d8032f387..2e5e346af91 100644 --- a/sys/netinet/ip_output.c +++ b/sys/netinet/ip_output.c @@ -1927,9 +1927,13 @@ ip_drop_membership(struct ip_moptions *imo, const struct sockopt *sopt) * Give up the multicast address record to which the * membership points. */ - IFNET_LOCK(imo->imo_membership[i]->inm_ifp); +{ + struct ifnet *inm_ifp = imo->imo_membership[i]->inm_ifp; + IFNET_LOCK(inm_ifp); in_delmulti(imo->imo_membership[i]); - IFNET_UNLOCK(imo->imo_membership[i]->inm_ifp); + /* ifp should not leave thanks to solock */ + IFNET_UNLOCK(inm_ifp); +} /* * Remove the gap in the membership array. Yes it appears to address the problem. Without this patch, the above crash was 100% reproducible (5 out of 5). With this patch applied (and no other changes) I have had 3 consecutive reboots without and problem! Thanks for the quick turn-around. +--+--++ | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com | | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org | +--+--++
Re: Status of 8.99.12
On Mon, Feb 12, 2018 at 9:48 AM, Paul Goyettewrote: > After an extended period of build breaks, I finally got a new release built > from sources updated on 2018-02-10 at 04:02:43 UTC > > I'm seeing several problems with this release that were not seen with my > previous installation (from last November). > > 1. Starting the gnucash program (from pkgsrc finance/gnucash) now takes >about 3 times as long as before. Even after successfully loading >the image (to get libraries etc into the file system cache) it take >more than three full minutes for the program to initialize. > > 2. Whenever I try to shutdown the system, I get a networking-related >panic. The following is manually transcribed: > > trap type 4 code 0 rip 0x802d3f75 cs 0x8 rflags 0x10282 > cr2 0x77e0e931c020 ilevel 0x4 rsp 0x80090a7e3c80 > curlwp 0xe4afbb6e8700 pid 926.1 lowest kstack > 0x80090a7e0c20 > kernel: protection fault trap, code = 0 > stopped in 926.1 (avahi-daemon) at ip_setmoptions+0x237: movq > 360(%rax),%rdi > traceback: > ip_setmoptions + 0x237 > ip_rtloutput + 0x218 > udp_ctloutput + 0x82 > udp_ctloutput_wrapper + 0x2c > sosetopt + 0x67 > sys_setsockopt + 0x91 > syscall + 0x1ed (syscall #105) Is the panic fixed by the following patch? Thanks, ozaki-r diff --git a/sys/netinet/ip_output.c b/sys/netinet/ip_output.c index 44d8032f387..2e5e346af91 100644 --- a/sys/netinet/ip_output.c +++ b/sys/netinet/ip_output.c @@ -1927,9 +1927,13 @@ ip_drop_membership(struct ip_moptions *imo, const struct sockopt *sopt) * Give up the multicast address record to which the * membership points. */ - IFNET_LOCK(imo->imo_membership[i]->inm_ifp); +{ + struct ifnet *inm_ifp = imo->imo_membership[i]->inm_ifp; + IFNET_LOCK(inm_ifp); in_delmulti(imo->imo_membership[i]); - IFNET_UNLOCK(imo->imo_membership[i]->inm_ifp); + /* ifp should not leave thanks to solock */ + IFNET_UNLOCK(inm_ifp); +} /* * Remove the gap in the membership array.
Status of 8.99.12
After an extended period of build breaks, I finally got a new release built from sources updated on 2018-02-10 at 04:02:43 UTC I'm seeing several problems with this release that were not seen with my previous installation (from last November). 1. Starting the gnucash program (from pkgsrc finance/gnucash) now takes about 3 times as long as before. Even after successfully loading the image (to get libraries etc into the file system cache) it take more than three full minutes for the program to initialize. 2. Whenever I try to shutdown the system, I get a networking-related panic. The following is manually transcribed: trap type 4 code 0 rip 0x802d3f75 cs 0x8 rflags 0x10282 cr2 0x77e0e931c020 ilevel 0x4 rsp 0x80090a7e3c80 curlwp 0xe4afbb6e8700 pid 926.1 lowest kstack 0x80090a7e0c20 kernel: protection fault trap, code = 0 stopped in 926.1 (avahi-daemon) at ip_setmoptions+0x237: movq 360(%rax),%rdi traceback: ip_setmoptions + 0x237 ip_rtloutput + 0x218 udp_ctloutput + 0x82 udp_ctloutput_wrapper + 0x2c sosetopt + 0x67 sys_setsockopt + 0x91 syscall + 0x1ed (syscall #105) 3. After getting the above, as soon as I type a single character as command input to ddb(4), I get a LOCKDEBUG panic. I didn't yet transcribe the 40+ lines of output, but the backtrace clearly includes a couple entries from the xhci (USB-3) driver. 4. While the system is running, I have noticed that un-mounting nullfs mounts is very slow. Using mksandbox (from pkgsrc), I create a sandbox with about 22 null mounts. Creating/mounting is no problem, and everything runs as expected. However, when unmounting these nullfs, each one takes between 3 and 6 wall-seconds, during which the umount process is running at 100% of one CPU. Additionally, some of these umounts seem to grab the CPU with interrupts disabled, resulting in total stall of the machine for the duration (and, in X, cursor movement stalls/gets "jerky"). All the unmounts eventually complete successflly. +--+--++ | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com | | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org | +--+--++