Re: Status of 8.99.12

2018-02-12 Thread Paul Goyette

On Mon, 12 Feb 2018, Thor Lancelot Simon wrote:


On Mon, Feb 12, 2018 at 08:48:32AM +0800, Paul Goyette wrote:


1. Starting the gnucash program (from pkgsrc finance/gnucash) now takes
   about 3 times as long as before.  Even after successfully loading
   the image (to get libraries etc into the file system cache) it take
   more than three full minutes for the program to initialize.


It previously took 1 full minute?


Yes, at least on the initial load (before caching everything).  On 
subsequent loads it would take at least 20-30 seconds to start.




How does it look without LOCKDEBUG?


I would have to build a new kernel and check.  Do you thing it is truly 
relevant?  Would there be a lot of locking/contention occurring during 
the program start-up phase?


FWIW, someone also suggested that the "3-seconds needed to unmount a 
nullfs" problem could also be affected by LOCKDEBUG.  Interestingly, 
this problem does not seem to persist.  Some time after the system is 
booted, and some time after all the nullfs mounts are completed, the 
unmount process completes as expected;  all 20+ umount complete within 
total of 3-4 seconds, rather than 3-6 seconds each.  I'll try to 
characterize the "some time after" details later, after resolving the 
bigger outstanding issues.





+--+--++
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:  |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
+--+--++


Re: Status of 8.99.12

2018-02-12 Thread Thor Lancelot Simon
On Mon, Feb 12, 2018 at 08:48:32AM +0800, Paul Goyette wrote:
> 
> 1. Starting the gnucash program (from pkgsrc finance/gnucash) now takes
>about 3 times as long as before.  Even after successfully loading
>the image (to get libraries etc into the file system cache) it take
>more than three full minutes for the program to initialize.

It previously took 1 full minute?

How does it look without LOCKDEBUG?

Thor


Re: Status of 8.99.12

2018-02-11 Thread Paul Goyette

On Mon, 12 Feb 2018, Paul Goyette wrote:


2. Whenever I try to shutdown the system, I get a networking-related
  panic.  The following is manually transcribed:

trap type 4 code 0 rip 0x802d3f75 cs 0x8 rflags 0x10282
  cr2 0x77e0e931c020 ilevel 0x4 rsp 0x80090a7e3c80
curlwp 0xe4afbb6e8700 pid 926.1 lowest kstack
  0x80090a7e0c20
kernel: protection fault trap, code = 0
stopped in 926.1 (avahi-daemon) at ip_setmoptions+0x237: movq
  360(%rax),%rdi
traceback:
ip_setmoptions + 0x237
ip_rtloutput + 0x218
udp_ctloutput + 0x82
udp_ctloutput_wrapper + 0x2c
sosetopt + 0x67
sys_setsockopt + 0x91
syscall + 0x1ed (syscall #105)


This appears to be fixed by a patch provided by ozakir@


3. After getting the above, as soon as I type a single character as
  command input to ddb(4), I get a LOCKDEBUG panic.  I didn't yet
  transcribe the 40+ lines of output, but the backtrace clearly
  includes a couple entries from the xhci (USB-3) driver.


Here's the console output from the LOCKDEBUG panic - all transcribed by 
hand, but hopefully without too many typos!


Mutex error: mutex_vector_enter,523: spin lock held
lock address: 0xe410e9d1d9a0   type: spin
initialized:  0x802bac06
shared holds:  0   exclusive:  1
shares wanted: 0   exclusive:  0
current CPU:  11   last held: 11
curlwp:   0xe41fc09ad2c0   last held: 0xe41fc09ad2c0
last locked*: 0x802b81de   unlocked:  0x80291179
owner field:  0x00010600   wait/spin:0/1
panic: LOCKDEBUG: Mutex error: mutex_vector_enter,523: spin lock held

And the backtrace is

vpanic+0x140
snprintf
lockdebug_more
mutex_enter+0x69d
xhci_device_intr_start+0x125
usbd_start_next+0x65
xhci_soft_intr+0x49b
xhci_poll+0x37
ukbd_cngetc+0x19
cngetc+0x34
db_readline+0x65
db_read_line+0x15
db_command_loop+0x84
db_trap+0xe3
kbd_trap+0xe2
trap (number 4)

followed by the original backtrace (see above, starting with 
ip_setmoptions).







4. While the system is running, I have noticed that un-mounting nullfs
  mounts is very slow.  Using mksandbox (from pkgsrc), I create a
  sandbox with about 22 null mounts.  Creating/mounting is no problem,
  and everything runs as expected.  However, when unmounting these
  nullfs, each one takes between 3 and 6 wall-seconds, during which
  the umount process is running at 100% of one CPU.  Additionally,
  some of these umounts seem to grab the CPU with interrupts disabled,
  resulting in total stall of the machine for the duration (and, in X,
  cursor movement stalls/gets "jerky").  All the unmounts eventually
  complete successflly.




+--+--++
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:  |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
+--+--++



+--+--++
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:  |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
+--+--++


Re: Status of 8.99.12

2018-02-11 Thread Paul Goyette

On Mon, 12 Feb 2018, Ryota Ozaki wrote:


2. Whenever I try to shutdown the system, I get a networking-related
   panic.  The following is manually transcribed:

trap type 4 code 0 rip 0x802d3f75 cs 0x8 rflags 0x10282
  cr2 0x77e0e931c020 ilevel 0x4 rsp 0x80090a7e3c80
curlwp 0xe4afbb6e8700 pid 926.1 lowest kstack
  0x80090a7e0c20
kernel: protection fault trap, code = 0
stopped in 926.1 (avahi-daemon) at ip_setmoptions+0x237: movq
  360(%rax),%rdi
traceback:
ip_setmoptions + 0x237
ip_rtloutput + 0x218
udp_ctloutput + 0x82
udp_ctloutput_wrapper + 0x2c
sosetopt + 0x67
sys_setsockopt + 0x91
syscall + 0x1ed (syscall #105)


Is the panic fixed by the following patch?

diff --git a/sys/netinet/ip_output.c b/sys/netinet/ip_output.c
index 44d8032f387..2e5e346af91 100644
--- a/sys/netinet/ip_output.c
+++ b/sys/netinet/ip_output.c
@@ -1927,9 +1927,13 @@ ip_drop_membership(struct ip_moptions *imo,
const struct sockopt *sopt)
* Give up the multicast address record to which the
* membership points.
*/
-   IFNET_LOCK(imo->imo_membership[i]->inm_ifp);
+{
+   struct ifnet *inm_ifp = imo->imo_membership[i]->inm_ifp;
+   IFNET_LOCK(inm_ifp);
   in_delmulti(imo->imo_membership[i]);
-   IFNET_UNLOCK(imo->imo_membership[i]->inm_ifp);
+   /* ifp should not leave thanks to solock */
+   IFNET_UNLOCK(inm_ifp);
+}

   /*
* Remove the gap in the membership array.



Yes it appears to address the problem.  Without this patch, the above 
crash was 100% reproducible (5 out of 5).  With this patch applied (and 
no other changes) I have had 3 consecutive reboots without and problem!


Thanks for the quick turn-around.



+--+--++
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:  |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
+--+--++


Re: Status of 8.99.12

2018-02-11 Thread Ryota Ozaki
On Mon, Feb 12, 2018 at 9:48 AM, Paul Goyette  wrote:
> After an extended period of build breaks, I finally got a new release built
> from sources updated on 2018-02-10 at 04:02:43 UTC
>
> I'm seeing several problems with this release that were not seen with my
> previous installation (from last November).
>
> 1. Starting the gnucash program (from pkgsrc finance/gnucash) now takes
>about 3 times as long as before.  Even after successfully loading
>the image (to get libraries etc into the file system cache) it take
>more than three full minutes for the program to initialize.
>
> 2. Whenever I try to shutdown the system, I get a networking-related
>panic.  The following is manually transcribed:
>
> trap type 4 code 0 rip 0x802d3f75 cs 0x8 rflags 0x10282
>   cr2 0x77e0e931c020 ilevel 0x4 rsp 0x80090a7e3c80
> curlwp 0xe4afbb6e8700 pid 926.1 lowest kstack
>   0x80090a7e0c20
> kernel: protection fault trap, code = 0
> stopped in 926.1 (avahi-daemon) at ip_setmoptions+0x237: movq
>   360(%rax),%rdi
> traceback:
> ip_setmoptions + 0x237
> ip_rtloutput + 0x218
> udp_ctloutput + 0x82
> udp_ctloutput_wrapper + 0x2c
> sosetopt + 0x67
> sys_setsockopt + 0x91
> syscall + 0x1ed (syscall #105)

Is the panic fixed by the following patch?

Thanks,
  ozaki-r


diff --git a/sys/netinet/ip_output.c b/sys/netinet/ip_output.c
index 44d8032f387..2e5e346af91 100644
--- a/sys/netinet/ip_output.c
+++ b/sys/netinet/ip_output.c
@@ -1927,9 +1927,13 @@ ip_drop_membership(struct ip_moptions *imo,
const struct sockopt *sopt)
 * Give up the multicast address record to which the
 * membership points.
 */
-   IFNET_LOCK(imo->imo_membership[i]->inm_ifp);
+{
+   struct ifnet *inm_ifp = imo->imo_membership[i]->inm_ifp;
+   IFNET_LOCK(inm_ifp);
in_delmulti(imo->imo_membership[i]);
-   IFNET_UNLOCK(imo->imo_membership[i]->inm_ifp);
+   /* ifp should not leave thanks to solock */
+   IFNET_UNLOCK(inm_ifp);
+}

/*
 * Remove the gap in the membership array.


Status of 8.99.12

2018-02-11 Thread Paul Goyette
After an extended period of build breaks, I finally got a new release 
built from sources updated on 2018-02-10 at 04:02:43 UTC


I'm seeing several problems with this release that were not seen with my 
previous installation (from last November).


1. Starting the gnucash program (from pkgsrc finance/gnucash) now takes
   about 3 times as long as before.  Even after successfully loading
   the image (to get libraries etc into the file system cache) it take
   more than three full minutes for the program to initialize.

2. Whenever I try to shutdown the system, I get a networking-related
   panic.  The following is manually transcribed:

trap type 4 code 0 rip 0x802d3f75 cs 0x8 rflags 0x10282
  cr2 0x77e0e931c020 ilevel 0x4 rsp 0x80090a7e3c80
curlwp 0xe4afbb6e8700 pid 926.1 lowest kstack
  0x80090a7e0c20
kernel: protection fault trap, code = 0
stopped in 926.1 (avahi-daemon) at ip_setmoptions+0x237: movq
  360(%rax),%rdi
traceback:
ip_setmoptions + 0x237
ip_rtloutput + 0x218
udp_ctloutput + 0x82
udp_ctloutput_wrapper + 0x2c
sosetopt + 0x67
sys_setsockopt + 0x91
syscall + 0x1ed (syscall #105)

3. After getting the above, as soon as I type a single character as
   command input to ddb(4), I get a LOCKDEBUG panic.  I didn't yet
   transcribe the 40+ lines of output, but the backtrace clearly
   includes a couple entries from the xhci (USB-3) driver.

4. While the system is running, I have noticed that un-mounting nullfs
   mounts is very slow.  Using mksandbox (from pkgsrc), I create a
   sandbox with about 22 null mounts.  Creating/mounting is no problem,
   and everything runs as expected.  However, when unmounting these
   nullfs, each one takes between 3 and 6 wall-seconds, during which
   the umount process is running at 100% of one CPU.  Additionally,
   some of these umounts seem to grab the CPU with interrupts disabled,
   resulting in total stall of the machine for the duration (and, in X,
   cursor movement stalls/gets "jerky").  All the unmounts eventually
   complete successflly.




+--+--++
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:  |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
+--+--++