Re: i386 installer panics on boot

2020-11-27 Thread Chris Bennett
On Fri, Nov 27, 2020 at 10:01:37AM +, NaQiao wrote:
> Hello,
> 
> The i386 installer panics in early booting with error "panic: aml_die 
> aml_convert:2093" (see image attached)
> 
> The steps leading to this are:
> 
> - wget https://cdn.openbsd.org/pub/OpenBSD/6.8/i386/install68.img
> 
> - dd if=install68.img of=/dev/da0 bs=8M status=progress
> 
> - booting from the usb dongle with no options or key presses
> 
> - kernel panics while initializing CPUs (see attached image)
> 
> The hardware is an HP mini 4100 netbook with Intel Atom N2600, working fine 
> with FreeBSD.
> 
> Thanks
> 
> 
> 

Someone else will be able to answer this for sure, but most i386 work
better running amd64. All of my servers do.

Chris Bennett




Re: panic: ehci_alloc_std: curlen == 0 on 6.8-beta

2020-11-27 Thread Marcus Glocker
On Fri, Nov 27, 2020 at 12:57:02PM +, Mikolaj Kucharski wrote:

> I think something as simple as below would be okay. If requested I can
> put in DPRINTFN()s based on current printf()s, like I proposed in
> earlier diff in this thread. However more important part is, that I
> think DIAGNOSTIC ifdef should be removed as rest of the code, which
> relies on `if (curlen > len) curlen = len;` is not enclosed with
> `#ifdef DIAGNOSTIC`

Right.  That code should be outside of DIAGNOSTIC.  Though I would
leave the printf's in as DPRINTF's for the time being.  If you can
send such a diff I'm fine.
 
> Index: dev/usb/ehci.c
> ===
> RCS file: /cvs/src/sys/dev/usb/ehci.c,v
> retrieving revision 1.212
> diff -u -p -u -r1.212 ehci.c
> --- dev/usb/ehci.c23 Oct 2020 20:25:35 -  1.212
> +++ dev/usb/ehci.c27 Nov 2020 10:16:23 -
> @@ -2393,16 +2406,10 @@ ehci_alloc_sqtd_chain(struct ehci_softc 
>   /* must use multiple TDs, fill as much as possible. */
>   curlen = EHCI_QTD_NBUFFERS * EHCI_PAGE_SIZE -
>EHCI_PAGE_OFFSET(dataphys);
> -#ifdef DIAGNOSTIC
> - if (curlen > len) {
> - printf("ehci_alloc_sqtd_chain: curlen=%u "
> - "len=%u offs=0x%x\n", curlen, len,
> - EHCI_PAGE_OFFSET(dataphys));
> - printf("lastpage=0x%x page=0x%x phys=0x%x\n",
> - dataphyslastpage, dataphyspage, dataphys);
> +
> + if (curlen > len)
>   curlen = len;
> - }
> -#endif
> +
>   /* the length must be a multiple of the max size */
>   curlen -= curlen % mps;
>   DPRINTFN(1,("ehci_alloc_sqtd_chain: multiple QTDs, "
> 
> 
> On Sun, Nov 22, 2020 at 01:36:10AM +, Mikolaj Kucharski wrote:
> > Hi,
> > 
> > Whould below diff be okay, or just simple:
> > 
> > if (curlen > len)
> > curlen = len;
> > 
> > be more appropriate here?
> > 
> > On Wed, Nov 11, 2020 at 09:02:49AM +, Mikolaj Kucharski wrote:
> > > On Sat, Oct 24, 2020 at 09:08:45AM +0200, Marcus Glocker wrote:
> > > > Now you have on M less in your tree checkout :-)
> > > > Thanks for tracking this down.
> > > 
> > > There is one more change, which I would consider. It was visible after I
> > > switched back to official snapshot kernel. Now that kernel is not
> > > panicing, when the specific code path from this email thread is executed
> > > it prints:
> > > 
> > > ehci_alloc_sqtd_chain: curlen=20480 len=0 offs=0x0
> > > lastpage=0xcfe66000 page=0xcfe67000 phys=0xcfe67000
> > > 
> > > and I think this is not needed by default any more, so I have this diff:
> > > 
> > > Index: dev/usb/ehci.c
> > > ===
> > > RCS file: /cvs/src/sys/dev/usb/ehci.c,v
> > > retrieving revision 1.212
> > > diff -u -p -u -r1.212 ehci.c
> > > --- dev/usb/ehci.c23 Oct 2020 20:25:35 -  1.212
> > > +++ dev/usb/ehci.c11 Nov 2020 08:55:01 -
> > > @@ -2395,11 +2408,11 @@ ehci_alloc_sqtd_chain(struct ehci_softc 
> > >EHCI_PAGE_OFFSET(dataphys);
> > >  #ifdef DIAGNOSTIC
> > >   if (curlen > len) {
> > > - printf("ehci_alloc_sqtd_chain: curlen=%u "
> > > + DPRINTFN(1,("ehci_alloc_sqtd_chain: curlen=%u "
> > >   "len=%u offs=0x%x\n", curlen, len,
> > > - EHCI_PAGE_OFFSET(dataphys));
> > > - printf("lastpage=0x%x page=0x%x phys=0x%x\n",
> > > - dataphyslastpage, dataphyspage, dataphys);
> > > + EHCI_PAGE_OFFSET(dataphys)));
> > > + DPRINTFN(1,("lastpage=0x%x page=0x%x 
> > > phys=0x%x\n",
> > > + dataphyslastpage, dataphyspage, dataphys));
> > >   curlen = len;
> > >   }
> > >  #endif
> > > 
> > > to mute those messages. I'm also wondering could above be just as simple
> > > as:
> > > 
> > >   if (curlen > len) {
> > >   curlen = len;
> > > 
> > > and to drop completly above printf()s / DPRINTFN()s as for me they
> > > didn't bring a lot of troubleshooting value. Dunno. Anyway one way or
> > > another muting those I think would be good.
> > > 
> > > 
> > > > On Fri, Oct 23, 2020 at 06:50:53PM +0200, Marcus Glocker wrote:
> > > > 
> > > > > Honestly, I haven't spent much time to investigate how the curlen = 0 
> > > > > is
> > > > > getting generated exactly, because for me it will be very difficult to
> > > > > understand that without the hardware on my side re-producing the same.
> > > > > 
> > > > > But I had look when the code was introduced to 

Re: apu4 fatal protection fault in supervisor mode [Was: apu4 kernel panic]

2020-11-27 Thread Stuart Henderson
On 2020/11/27 18:50, Mark Kettenis wrote:
> > Date: Fri, 27 Nov 2020 18:43:47 +0100
> > From: Marcus MERIGHI 
> > 
> > s...@spacehopper.org (Stuart Henderson), 2020.11.27 (Fri) 17:54 (CET):
> > > On 2020/11/27 16:21, Marcus MERIGHI wrote:
> > > > It happened again; anything I should do when "syncing disks..." is done?
> > 
> > This time around it doesn't seem to finish "syncing disks..." and drop
> > into ddb>. So it can't be rebooted via "boot reboot". Is there a way to
> > reboot via the serial console? Sending a BREAK (~#) doesn't seem to do
> > anything...
> > 
> > > Can you try dowgrading the bios to 4.11.0.4?
> > > https://pcengines.github.io/#mr-33
> > 
> > Will do, as soon as the machine is rebooted. Thanks for the pointer!
> > (You mention 4.11.0.4, but your link goes to 4.11.0.5?)
> 
> Frankly I think this issue is a kernel bug, where somehow the sysctl
> code that reports on open files is racing against code that closes
> those files or otherwise messes with the associated data structures.
> I bet that if you stop the process that is doing those sysctl calls,
> things will run stable again.

fstat was running on Marcus' machine.

> Given what you wrote about the configuration of the machine I'd say
> this is related to sockets and missing locking in/against the network
> stack.  Unfortunately the traces you showed so far don't really give
> me any clues.
> 



Re: apu4 fatal protection fault in supervisor mode [Was: apu4 kernel panic]

2020-11-27 Thread Stuart Henderson
On 2020/11/27 18:43, Marcus MERIGHI wrote:
> s...@spacehopper.org (Stuart Henderson), 2020.11.27 (Fri) 17:54 (CET):
> > On 2020/11/27 16:21, Marcus MERIGHI wrote:
> > > It happened again; anything I should do when "syncing disks..." is done?
> 
> This time around it doesn't seem to finish "syncing disks..." and drop
> into ddb>. So it can't be rebooted via "boot reboot". Is there a way to
> reboot via the serial console? Sending a BREAK (~#) doesn't seem to do
> anything...
> 
> > Can you try dowgrading the bios to 4.11.0.4?
> > https://pcengines.github.io/#mr-33
> 
> Will do, as soon as the machine is rebooted. Thanks for the pointer!
> (You mention 4.11.0.4, but your link goes to 4.11.0.5?)
> 
> Marcus
> 

Scratch that - mine lasted longer after going back to that one but it
crashed with that too now.



Re: apu4 fatal protection fault in supervisor mode [Was: apu4 kernel panic]

2020-11-27 Thread Mark Kettenis
> Date: Fri, 27 Nov 2020 18:43:47 +0100
> From: Marcus MERIGHI 
> 
> s...@spacehopper.org (Stuart Henderson), 2020.11.27 (Fri) 17:54 (CET):
> > On 2020/11/27 16:21, Marcus MERIGHI wrote:
> > > It happened again; anything I should do when "syncing disks..." is done?
> 
> This time around it doesn't seem to finish "syncing disks..." and drop
> into ddb>. So it can't be rebooted via "boot reboot". Is there a way to
> reboot via the serial console? Sending a BREAK (~#) doesn't seem to do
> anything...
> 
> > Can you try dowgrading the bios to 4.11.0.4?
> > https://pcengines.github.io/#mr-33
> 
> Will do, as soon as the machine is rebooted. Thanks for the pointer!
> (You mention 4.11.0.4, but your link goes to 4.11.0.5?)

Frankly I think this issue is a kernel bug, where somehow the sysctl
code that reports on open files is racing against code that closes
those files or otherwise messes with the associated data structures.
I bet that if you stop the process that is doing those sysctl calls,
things will run stable again.

Given what you wrote about the configuration of the machine I'd say
this is related to sockets and missing locking in/against the network
stack.  Unfortunately the traces you showed so far don't really give
me any clues.



Re: apu4 fatal protection fault in supervisor mode [Was: apu4 kernel panic]

2020-11-27 Thread Marcus MERIGHI
s...@spacehopper.org (Stuart Henderson), 2020.11.27 (Fri) 17:54 (CET):
> On 2020/11/27 16:21, Marcus MERIGHI wrote:
> > It happened again; anything I should do when "syncing disks..." is done?

This time around it doesn't seem to finish "syncing disks..." and drop
into ddb>. So it can't be rebooted via "boot reboot". Is there a way to
reboot via the serial console? Sending a BREAK (~#) doesn't seem to do
anything...

> Can you try dowgrading the bios to 4.11.0.4?
> https://pcengines.github.io/#mr-33

Will do, as soon as the machine is rebooted. Thanks for the pointer!
(You mention 4.11.0.4, but your link goes to 4.11.0.5?)

Marcus



Re: apu4 fatal protection fault in supervisor mode [Was: apu4 kernel panic]

2020-11-27 Thread Stuart Henderson
On 2020/11/27 16:21, Marcus MERIGHI wrote:
> It happened again; anything I should do when "syncing disks..." is done?

Can you try dowgrading the bios to 4.11.0.4?

https://pcengines.github.io/#mr-33



[PATCH] rdist cmdspecial handling

2020-11-27 Thread Aaron Poffenberger
According to rdist(1), the special and cmdspecial commands accept an
optional list of files. When the  is omitted, the command
should execute for all files, otherwise it should be executed only
when one of the listed files is affected.

The special command works as expected, but cmdspecial runs for all
files regardless of .

The diff fixes cmdspecial by adding "sc_updfilelist" to the subcmd
structure to hold a list of affected files found in the 
(sc_args). For cmdspecial commands without a , the module
level variable "updfilelist" is used as before.

The diff also addresses a couple of related issues, which aren't
technically necessary for the fix, but which helped with debugging and
testing, and make rdist user messages clearer.  I can break them out
as separate patches if necessary.

Changes:
 - Fix  handling in cmdspecial commands
 - Modify user message to indicate whether special or cmdspecial
   commands apply to "any" file or a "list" of files
 - Show cmdspecial commands that would be executed when the "verify"
   option is set (rdist shows all other actions, including special
   commands, when "verify" is enabled)


Test Script
---
I've included a shell script that demonstrates the 
problem. It's attached as a diff for ease of use. I'm not suggesting
it be committed.

The script creates a Distfile with special and cmdspecial commands
with both a  and no  specified to highlight
the difference in handling by rdist.

The script runs the base-installed rdist twice, the first time with
two changed files, the second with just one. In both cases it executes
a simple sh script to log the env variable executed with the command
(special or cmdspecial).

If the script finds an executable rdist in /usr/src/usr.bin/rdist/, it
then runs that version with the same two test cases.

The script then displays a diff of the two logs showing that the
patched rdist executes according to the documentation, otherwise it
shows the output of the first execution to show the defective handling
of cmdspecial.

--Aaron

cvs diff: Diffing .
Index: client.c
===
RCS file: /cvs/src/usr.bin/rdist/client.c,v
retrieving revision 1.37
diff -u -p -r1.37 client.c
--- client.c28 Jun 2019 13:35:03 -  1.37
+++ client.c27 Nov 2020 14:45:16 -
@@ -66,7 +66,7 @@ struct namelist   *updfilelist = NULL; /* 
 
 static void runspecial(char *, opt_t, char *, int);
 static void addcmdspecialfile(char *, char *, int);
-static void freecmdspecialfiles(void);
+static void freecmdspecialfiles(struct namelist **);
 static struct linkbuf *linkinfo(struct stat *);
 static int sendhardlink(opt_t, struct linkbuf *, char *, int);
 static int sendfile(char *, opt_t, struct stat *, char *, char *, int);
@@ -172,7 +172,8 @@ runspecial(char *starget, opt_t opts, ch
continue;
if (sc->sc_args != NULL && !inlist(sc->sc_args, starget))
continue;
-   message(MT_CHANGE, "special \"%s\"", sc->sc_name);
+   message(MT_CHANGE, "special <%s> \"%s\"",
+   sc->sc_args == NULL ? "any" : "list", sc->sc_name);
if (IS_ON(opts, DO_VERIFY))
continue;
(void) sendcmd(C_SPECIAL,
@@ -201,11 +202,19 @@ addcmdspecialfile(char *starget, char *r
 
rfile = remfilename(source, Tdest, target, rname, destdir);
 
-   for (sc = subcmds; sc != NULL && !isokay; sc = sc->sc_next) {
+   for (sc = subcmds; sc != NULL; sc = sc->sc_next) {
if (sc->sc_type != CMDSPECIAL)
continue;
-   if (sc->sc_args != NULL && !inlist(sc->sc_args, starget))
+   if (sc->sc_args == NULL) {
+   isokay = TRUE;
continue;
+   } else if (!inlist(sc->sc_args, starget))
+   continue;
+   new = xmalloc(sizeof *new);
+   new->n_name = xstrdup(rfile);
+   new->n_regex = NULL;
+   new->n_next = sc->sc_updfilelist;
+   sc->sc_updfilelist = new;
isokay = TRUE;
}
 
@@ -222,11 +231,11 @@ addcmdspecialfile(char *starget, char *r
  * Free the file list
  */
 static void
-freecmdspecialfiles(void)
+freecmdspecialfiles(struct namelist **list)
 {
struct namelist *ptr, *save;
 
-   for (ptr = updfilelist; ptr; ) {
+   for (ptr = *list; ptr; ) {
if (ptr->n_name) (void) free(ptr->n_name);
save = ptr->n_next;
(void) free(ptr);
@@ -235,7 +244,7 @@ freecmdspecialfiles(void)
else
ptr = NULL;
}
-   updfilelist = NULL;
+   *list = NULL;
 }
 
 /*
@@ -251,11 +260,15 @@ runcmdspecial(struct cmd *cmd, opt_t opt
for (sc = cmd->c_cmds; sc != NULL; sc = sc->sc_next) {
if (sc->sc_type != CMDSPECIAL)

Re: apu4 fatal protection fault in supervisor mode [Was: apu4 kernel panic]

2020-11-27 Thread Marcus MERIGHI
It happened again; anything I should do when "syncing disks..." is done?

fatal protection fault in supervisor mode
trap type 4 code 0 rip 8198bf66 cs 8 rflags 10246 cr2 f2556d1000
cpl 0 0
gsbase 0x80002241aff0  kgsbase 0x0
panic: trap type 4, code=0, pc=8198bf66
Starting stack trace...
panic(81de8b6b) at panic+0x11d
kerntrap(800022969060) at kerntrap+0x114
alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b
fill_file(80f25c00,fd8100fbfb58,fd81033e36e8,3,0,80002270806
sysctl_file(800022969688,4,f564eca9000,8000229696b8,80002265f188) a2
kern_sysctl(800022969684,5,f564eca9000,8000229696b8,0,0) at kern_sysctl1
sys_sysctl(80002265f188,800022969720,800022969780) at
sys_sysctl+0x4
syscall(8000229697f0) at syscall+0x389
Xsyscall() at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7c6e30, count: 248
End of stack trace.
syncing disks...

Marcus

mcmer-open...@tor.at (Marcus MERIGHI), 2020.11.26 (Thu) 16:51 (CET):
> >Synopsis:kernel panic on apu4
> >Category:kernel amd64
> >Environment:
>   System  : OpenBSD 6.8
>   Details : OpenBSD 6.8 (GENERIC.MP) #1: Tue Nov  3 09:06:04 MST 2020
>
> r...@syspatch-68-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> 
>   Architecture: OpenBSD.amd64
>   Machine : amd64
> >Description:
>   kernel panic on apu4
> >How-To-Repeat:
>   It happened for the first time with this hardware.
> Put some load (~3000 Interrupts) on apu4 with 20 VLANs and CARPs and
>   a 715 rules pf.conf and ipsec/npppd VPN.
> >Fix:
>   None known.
> 
> what i gathered from the ddb> prompt:
> 
> fatal protection fault in supervisor mode
> trap type 4 code 0 rip 81347f76 cs 8 rflags 10246 cr2 a0b46b96000 cpl 
> 0 rsp 80
> gsbase 0x800022411ff0  kgsbase 0x0
> panic: trap type 4, code=0, pc=81347f76
> Starting stack trace...
> panic(81de557b) at panic+0x11d
> kerntrap(8000229c1630) at kerntrap+0x114
> alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b
> fill_file(80ca8000,fd811c6a23d0,fd81012516f8,3,0,800022735658)
>  at fil6
> sysctl_file(8000229c1c58,4,b4769963000,8000229c1c88,8000227d8c98) 
> at sysctl_f2
> kern_sysctl(8000229c1c54,5,b4769963000,8000229c1c88,0,0) at 
> kern_sysctl+0x1d1
> sys_sysctl(8000227d8c98,8000229c1cf0,8000229c1d50) at 
> sys_sysctl+0x184
> syscall(8000229c1dc0) at syscall+0x389
> Xsyscall() at Xsyscall+0x128
> end of kernel
> end trace frame: 0x7f7e8170, count: 248
> End of stack trace.
> syncing disks...WARNING: SPL NOT LOWERED ON SYSCALL 3 3 EXIT 0 9
> Stopped at  savectx+0xb1:   movl$0,%gs:0x530
> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
>  447997  99550   1000 0x2  02  fstat
>  156309  93905  00x12  03K sh
> *495661   2360 620x100010  01  spamlogd
>  492394  78811  0 0x14000 0x42000  softclock
> savectx() at savectx+0xb1
> end of kernel
> end trace frame: 0x7f7caf00, count: 14
> ddb{1}> trace
> savectx() at savectx+0xb1
> end of kernel
> end trace frame: 0x7f7caf00, count: -1
> ddb{1}> mach ddbcpu 0
> Stopped at  x86_ipi_db+0x12:leave
> x86_ipi_db(820e2ff0) at x86_ipi_db+0x12
> x86_ipi_handler() at x86_ipi_handler+0x80
> Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23
> __mp_lock(8218c620) at __mp_lock+0x72
> intr_handler(8000225c3d00,8008ee80) at intr_handler+0x44
> Xintr_ioapic_level0_untramp() at Xintr_ioapic_level0_untramp+0x1a3
> in_cksum(fd80cd5bdb00,24) at in_cksum+0x44
> carp_send_ad(80be7c00) at carp_send_ad+0x27c
> carp_timer_ad(80be7c00) at carp_timer_ad+0x20
> softclock_thread(8000f3c0) at softclock_thread+0x16b
> end trace frame: 0x0, count: 5
> ddb{0}> trace
> x86_ipi_db(820e2ff0) at x86_ipi_db+0x12
> x86_ipi_handler() at x86_ipi_handler+0x80
> Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23
> __mp_lock(8218c620) at __mp_lock+0x72
> intr_handler(8000225c3d00,8008ee80) at intr_handler+0x44
> Xintr_ioapic_level0_untramp() at Xintr_ioapic_level0_untramp+0x1a3
> in_cksum(fd80cd5bdb00,24) at in_cksum+0x44
> carp_send_ad(80be7c00) at carp_send_ad+0x27c
> carp_timer_ad(80be7c00) at carp_timer_ad+0x20
> softclock_thread(8000f3c0) at softclock_thread+0x16b
> end trace frame: 0x0, count: -10
> ddb{0}> mach ddbcpu 2
> Stopped at  x86_ipi_db+0x12:leave
> x86_ipi_db(800022411ff0) at x86_ipi_db+0x12
> x86_ipi_handler() at x86_ipi_handler+0x80
> Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23
> __mp_acquire_count(8218c620,1) at __mp_acquire_count+0x92
> tsleep(fd812e111468,11,81e94678,0) at tsleep+0x10e
> getblk(fd812e880680,12fce0,4000,0,) at getblk+0xe5
> bread(fd812e880680,12fce0,4000,8000229c11e0) 

Re: kernel panic when removing interface

2020-11-27 Thread Martin Pieuchot
On 27/11/20(Fri) 15:47, Denis Fondras wrote:
> > It is, I guess a fix should go in net/rtsock.c to prevent adding "-link"
> > entry on routing table different from ifp->if_rdomain.
> > 
> 
> I came up with this, which is more radical.

Which is not exactly what we want.  This will prevent adding any route
on a routing table different from rdomain.

What needs to be enforced is the check from a request coming from
userland trying to insert a "-link" route.  Such check should have the
benefit of documenting that L2 entries should be only inserted in the
rdomain table of an interface.

> Index: route.c
> ===
> RCS file: /cvs/src/sys/net/route.c,v
> retrieving revision 1.397
> diff -u -p -r1.397 route.c
> --- route.c   29 Oct 2020 21:15:27 -  1.397
> +++ route.c   27 Nov 2020 09:39:53 -
> @@ -865,6 +865,8 @@ rtrequest(int req, struct rt_addrinfo *i
>   return (EINVAL);
>   ifa = info->rti_ifa;
>   ifp = ifa->ifa_ifp;
> + if (tableid != ifp->if_rdomain)
> + return (EINVAL);
>   if (prio == 0)
>   prio = ifp->if_priority + RTP_STATIC;
>  
> 



Re: kernel panic when removing interface

2020-11-27 Thread Denis Fondras
> It is, I guess a fix should go in net/rtsock.c to prevent adding "-link"
> entry on routing table different from ifp->if_rdomain.
> 

I came up with this, which is more radical.

Index: route.c
===
RCS file: /cvs/src/sys/net/route.c,v
retrieving revision 1.397
diff -u -p -r1.397 route.c
--- route.c 29 Oct 2020 21:15:27 -  1.397
+++ route.c 27 Nov 2020 09:39:53 -
@@ -865,6 +865,8 @@ rtrequest(int req, struct rt_addrinfo *i
return (EINVAL);
ifa = info->rti_ifa;
ifp = ifa->ifa_ifp;
+   if (tableid != ifp->if_rdomain)
+   return (EINVAL);
if (prio == 0)
prio = ifp->if_priority + RTP_STATIC;
 



Re: double fault trap, 6.8, kbind->...->pmap_tlb_shootpage

2020-11-27 Thread Stuart Henderson
On 2020/11/27 13:19, Stuart Henderson wrote:
> On 2020/11/26 20:17, Stuart Henderson wrote:
> > I setup a console server today - after leaving it for a few hours I came
> > back to a double fault trap. 6.8+syspatches, amd64, APU2. Simple PF
> > config, em(4), wg(4). Running ssh/sshd/conserver/lldpd plus default base
> > daemons.
> 
> Traces from another crash. I had another one in db_read_bytes as well
> but forgot to trace.

Notes on the conserver config: it has some ipmi sol consoles (UDP),
some regular network consoles on TCP ports, and some network consoles
via ssh client - no serial consoles on the machine itself.

The hardware has previously been used in a different role (ipsec
concentrator) with no problems so doesn't seem likely to be a hw issue.

I don't have a specific trigger but it doesn't seem to stay up for
more than an hour or so.



Re: kernel panic when removing interface

2020-11-27 Thread Martin Pieuchot
On 26/11/20(Thu) 20:38, Pierre Emeriaud wrote:
> Hello Martin
> 
> Le jeu. 26 nov. 2020 à 14:27, Martin Pieuchot  a écrit :
> >
> > >
> > > $ doas route -T1 add 192.0.2.2/32 -link -iface vlan12
> >
> > I wonder if the problem isn't in the validation of these parameters.
> >
> > Should we accept a L2 (-link) entry on a routing table which isn't the
> > routing domain?  If so why does the entry persist in the ARP cache?
> 
> Which arp entry are you referring to? The one from the route I added?

Yes.  In the kernel ARP entries are represented as route entries.  So
when you add a "-link" route it is an ARP entry.

> > Can you reproduce the problem if you don't specify T1?
> 
> No. The routes are correctly removed when the interface is destroyed.
> It only crashes when the routes are added to another (non-empty if
> that matters) rdomain, but again, this was a silly mistake on my side.

Still, silly mistakes should be prevented and not crash the kernel ;)

> I reported it as it might be of interest to fix this for the sake of
> it, but it causes almost no harm.

It is, I guess a fix should go in net/rtsock.c to prevent adding "-link"
entry on routing table different from ifp->if_rdomain.

> PS: I've managed to crash my first router just by waiting a few
> seconds - no need to remove the route - same thing as the second
> router:
> ddb> show panic
> kernel diagnostic assertion "ifp != NULL" failed: file 
> "/usr/src/sys/netinet/if
> _ether.c", line 718
> 
> ddb> trace
> db_enter() at db_enter+0x10
> panic(81dc761f) at panic+0x12a
> __assert(81e321c2,81db9f2b,2ce,81d9e429) at 
> __assert+0x
> 2b
> arp_rtrequest(fd800baa10a8,fd800baa10a8,fd801aa63dc0) at 
> arp_rtrequ
> est
> arptimer(8216a090) at arptimer+0x67
> softclock_thread(8000ea40) at softclock_thread+0x13f
> end trace frame: 0x0, count: -6



Re: double fault trap, 6.8, kbind->...->pmap_tlb_shootpage

2020-11-27 Thread Stuart Henderson
On 2020/11/26 20:17, Stuart Henderson wrote:
> I setup a console server today - after leaving it for a few hours I came
> back to a double fault trap. 6.8+syspatches, amd64, APU2. Simple PF
> config, em(4), wg(4). Running ssh/sshd/conserver/lldpd plus default base
> daemons.

Traces from another crash. I had another one in db_read_bytes as well
but forgot to trace.


login: uvm_fault(0x821214b8, 0x0008a240, 0, 4) -> e
kernel: page fault trap, code=0
Stopped at  0x0008a240:uvm_fault(0x821214b8, 
0x0008a240, 0, 1) -> e
 kernel: page fault trap, code=0
Stopped at  db_read_bytes+0x70: movzbl  0(%rdi,%rcx,1),%eax
ddb{0}> tr
db_read_bytes(0008a240,1,80001fe21338) at db_read_bytes+0x70
db_get_value(0008a240,1,0) at db_get_value+0x3f
db_disasm(0008a240,0) at db_disasm+0x85
db_trap(6,0) at db_trap+0xa5
db_ktrap(6,0,80001fe21590) at db_ktrap+0x112
kerntrap(80001fe21590) at kerntrap+0xa4
alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b
0008a240(a,a,91a8cae800152f3a,0,10,80001fe21670) at 0x0008a
240
x86_fast_ipi(80001fa78ff0,f1) at x86_fast_ipi+0x42
pmap_tlb_shootpage(821b4c08,80001fe54000,1) at pmap_tlb_shootpage+0
x136
pmap_do_remove(821b4c08,80001fe54000,80001fe55000,0) at pmap_do
_remove+0x524
uvm_unmap_remove(821214b8,80001fe54000,80001fe55000,80001fe
218e0,0,1) at uvm_unmap_remove+0x22b
sys_kbind(80001fe5b8f0,80001fe21960,80001fe219c0) at sys_kbind+0x38
2
syscall(80001fe21a30) at syscall+0x389
Xsyscall() at Xsyscall+0x128
end of kernel
end trace frame: 0x7f7ebf38, count: -15
ddb{0}> sh reg
rdi   0x0008a240
rsi  0x1
rbp   0x80001fe21320
rbx   0x0008a240
rdx   0x80001fe21338
rcx0
rax  0x2
r8 0
r9   0x1
r10   0x240c40d54ae302a4
r11   0xf0a9831425dab75e
r12  0x1
r13  0x2
r14  0x1
r150
rip   0x812f17a0db_read_bytes+0x70
cs   0x8
rflags   0x10246__ALIGN_SIZE+0xf246
rsp   0x80001fe21300
ss  0x10
db_read_bytes+0x70: movzbl  0(%rdi,%rcx,1),%eax
ddb{0}> ps /o
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
* 45975  16635736   0  00K conserver
   5773   2362736   0  02  conserver
ddb{0}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
*16635   45975   2362736  7   0conserver
  8429  426980   2362736  30x82  selectssh
 39053  373370  71219736  30x82  netio ssh
 19513   64298  98727736  30x82  selectssh
 47789  322641  71219736  30x82  netio ssh
 21271  468310  71219736  30x82  netio ssh
 30971  306907   2362736  30x82  netio ssh
 79098  114844   2362736  30x82  selectssh
 20267  292736   2362736  30x82  netio ssh
 35029  283216   2362736  30x82  netio ssh
 85342  247671   2362736  30x82  netio ssh
  5026   19178   2362736  30x82  selectssh
 77040  304406   2362736  30x82  selectssh
 76252  106736  71219736  30x82  netcon2   ssh
 36012  282631  71219736  30x82  netcon2   ssh
 33864  439485  71219736  30x82  netcon2   ssh
 59242  499297  71219736  30x82  netcon2   ssh
 98607  317809  71219736  30x82  netcon2   ssh
 34332  113957  71219736  30x82  netcon2   ssh
 73243   60205  71219736  30x82  netcon2   ssh
 55945  371324  38307   1000  30x100083  kqreadtail
 98727  337142  43620736  30x80  selectconserver
 32756   59128   2362736  30x100082  selectssh
 85586  501496   2362736  30x100082  selectssh
  23625773  43620736  7   0conserver
  2362  513578  43620736  3   0x480  selectconserver
  2362  396399  43620736  3   0x480  poll  conserver
 38307  239821  83880   1000  30x10008b  pause ksh
 43768  500734  71219736  30x100082  selectssh
 34176  206371  71219736  30x100082  selectssh
 46020  243860  71219736  30x100082  selectssh
 18338  340252  71219736  30x100082  selectssh
 14861  256347  71219736  30x100082  selectssh
 18404  496474  71219736  30x100082  selectssh
 71219  262961  

Re: panic: ehci_alloc_std: curlen == 0 on 6.8-beta

2020-11-27 Thread Mikolaj Kucharski
I think something as simple as below would be okay. If requested I can
put in DPRINTFN()s based on current printf()s, like I proposed in
earlier diff in this thread. However more important part is, that I
think DIAGNOSTIC ifdef should be removed as rest of the code, which
relies on `if (curlen > len) curlen = len;` is not enclosed with
`#ifdef DIAGNOSTIC`


Index: dev/usb/ehci.c
===
RCS file: /cvs/src/sys/dev/usb/ehci.c,v
retrieving revision 1.212
diff -u -p -u -r1.212 ehci.c
--- dev/usb/ehci.c  23 Oct 2020 20:25:35 -  1.212
+++ dev/usb/ehci.c  27 Nov 2020 10:16:23 -
@@ -2393,16 +2406,10 @@ ehci_alloc_sqtd_chain(struct ehci_softc 
/* must use multiple TDs, fill as much as possible. */
curlen = EHCI_QTD_NBUFFERS * EHCI_PAGE_SIZE -
 EHCI_PAGE_OFFSET(dataphys);
-#ifdef DIAGNOSTIC
-   if (curlen > len) {
-   printf("ehci_alloc_sqtd_chain: curlen=%u "
-   "len=%u offs=0x%x\n", curlen, len,
-   EHCI_PAGE_OFFSET(dataphys));
-   printf("lastpage=0x%x page=0x%x phys=0x%x\n",
-   dataphyslastpage, dataphyspage, dataphys);
+
+   if (curlen > len)
curlen = len;
-   }
-#endif
+
/* the length must be a multiple of the max size */
curlen -= curlen % mps;
DPRINTFN(1,("ehci_alloc_sqtd_chain: multiple QTDs, "


On Sun, Nov 22, 2020 at 01:36:10AM +, Mikolaj Kucharski wrote:
> Hi,
> 
> Whould below diff be okay, or just simple:
> 
>   if (curlen > len)
>   curlen = len;
> 
> be more appropriate here?
> 
> On Wed, Nov 11, 2020 at 09:02:49AM +, Mikolaj Kucharski wrote:
> > On Sat, Oct 24, 2020 at 09:08:45AM +0200, Marcus Glocker wrote:
> > > Now you have on M less in your tree checkout :-)
> > > Thanks for tracking this down.
> > 
> > There is one more change, which I would consider. It was visible after I
> > switched back to official snapshot kernel. Now that kernel is not
> > panicing, when the specific code path from this email thread is executed
> > it prints:
> > 
> > ehci_alloc_sqtd_chain: curlen=20480 len=0 offs=0x0
> > lastpage=0xcfe66000 page=0xcfe67000 phys=0xcfe67000
> > 
> > and I think this is not needed by default any more, so I have this diff:
> > 
> > Index: dev/usb/ehci.c
> > ===
> > RCS file: /cvs/src/sys/dev/usb/ehci.c,v
> > retrieving revision 1.212
> > diff -u -p -u -r1.212 ehci.c
> > --- dev/usb/ehci.c  23 Oct 2020 20:25:35 -  1.212
> > +++ dev/usb/ehci.c  11 Nov 2020 08:55:01 -
> > @@ -2395,11 +2408,11 @@ ehci_alloc_sqtd_chain(struct ehci_softc 
> >  EHCI_PAGE_OFFSET(dataphys);
> >  #ifdef DIAGNOSTIC
> > if (curlen > len) {
> > -   printf("ehci_alloc_sqtd_chain: curlen=%u "
> > +   DPRINTFN(1,("ehci_alloc_sqtd_chain: curlen=%u "
> > "len=%u offs=0x%x\n", curlen, len,
> > -   EHCI_PAGE_OFFSET(dataphys));
> > -   printf("lastpage=0x%x page=0x%x phys=0x%x\n",
> > -   dataphyslastpage, dataphyspage, dataphys);
> > +   EHCI_PAGE_OFFSET(dataphys)));
> > +   DPRINTFN(1,("lastpage=0x%x page=0x%x 
> > phys=0x%x\n",
> > +   dataphyslastpage, dataphyspage, dataphys));
> > curlen = len;
> > }
> >  #endif
> > 
> > to mute those messages. I'm also wondering could above be just as simple
> > as:
> > 
> > if (curlen > len) {
> > curlen = len;
> > 
> > and to drop completly above printf()s / DPRINTFN()s as for me they
> > didn't bring a lot of troubleshooting value. Dunno. Anyway one way or
> > another muting those I think would be good.
> > 
> > 
> > > On Fri, Oct 23, 2020 at 06:50:53PM +0200, Marcus Glocker wrote:
> > > 
> > > > Honestly, I haven't spent much time to investigate how the curlen = 0 is
> > > > getting generated exactly, because for me it will be very difficult to
> > > > understand that without the hardware on my side re-producing the same.
> > > > 
> > > > But I had look when the code was introduced to handle curlen == 0 later
> > > > in the function:
> > > > 
> > > > if (iscontrol) {
> > > > /*
> > > >  * adjust the toggle based on the number of packets
> > > >  * in this qtd
> > > >  */
> > > > if curlen + mps - 1) / mps) & 1) || curlen == 0)
> > > > qtdstatus ^= EHCI_QTD_TOGGLE_MASK;
> > > > }
> > > > 
> > > > This was