Re: i386 installer panics on boot
On Fri, Nov 27, 2020 at 10:01:37AM +, NaQiao wrote: > Hello, > > The i386 installer panics in early booting with error "panic: aml_die > aml_convert:2093" (see image attached) > > The steps leading to this are: > > - wget https://cdn.openbsd.org/pub/OpenBSD/6.8/i386/install68.img > > - dd if=install68.img of=/dev/da0 bs=8M status=progress > > - booting from the usb dongle with no options or key presses > > - kernel panics while initializing CPUs (see attached image) > > The hardware is an HP mini 4100 netbook with Intel Atom N2600, working fine > with FreeBSD. > > Thanks > > > Someone else will be able to answer this for sure, but most i386 work better running amd64. All of my servers do. Chris Bennett
Re: panic: ehci_alloc_std: curlen == 0 on 6.8-beta
On Fri, Nov 27, 2020 at 12:57:02PM +, Mikolaj Kucharski wrote: > I think something as simple as below would be okay. If requested I can > put in DPRINTFN()s based on current printf()s, like I proposed in > earlier diff in this thread. However more important part is, that I > think DIAGNOSTIC ifdef should be removed as rest of the code, which > relies on `if (curlen > len) curlen = len;` is not enclosed with > `#ifdef DIAGNOSTIC` Right. That code should be outside of DIAGNOSTIC. Though I would leave the printf's in as DPRINTF's for the time being. If you can send such a diff I'm fine. > Index: dev/usb/ehci.c > === > RCS file: /cvs/src/sys/dev/usb/ehci.c,v > retrieving revision 1.212 > diff -u -p -u -r1.212 ehci.c > --- dev/usb/ehci.c23 Oct 2020 20:25:35 - 1.212 > +++ dev/usb/ehci.c27 Nov 2020 10:16:23 - > @@ -2393,16 +2406,10 @@ ehci_alloc_sqtd_chain(struct ehci_softc > /* must use multiple TDs, fill as much as possible. */ > curlen = EHCI_QTD_NBUFFERS * EHCI_PAGE_SIZE - >EHCI_PAGE_OFFSET(dataphys); > -#ifdef DIAGNOSTIC > - if (curlen > len) { > - printf("ehci_alloc_sqtd_chain: curlen=%u " > - "len=%u offs=0x%x\n", curlen, len, > - EHCI_PAGE_OFFSET(dataphys)); > - printf("lastpage=0x%x page=0x%x phys=0x%x\n", > - dataphyslastpage, dataphyspage, dataphys); > + > + if (curlen > len) > curlen = len; > - } > -#endif > + > /* the length must be a multiple of the max size */ > curlen -= curlen % mps; > DPRINTFN(1,("ehci_alloc_sqtd_chain: multiple QTDs, " > > > On Sun, Nov 22, 2020 at 01:36:10AM +, Mikolaj Kucharski wrote: > > Hi, > > > > Whould below diff be okay, or just simple: > > > > if (curlen > len) > > curlen = len; > > > > be more appropriate here? > > > > On Wed, Nov 11, 2020 at 09:02:49AM +, Mikolaj Kucharski wrote: > > > On Sat, Oct 24, 2020 at 09:08:45AM +0200, Marcus Glocker wrote: > > > > Now you have on M less in your tree checkout :-) > > > > Thanks for tracking this down. > > > > > > There is one more change, which I would consider. It was visible after I > > > switched back to official snapshot kernel. Now that kernel is not > > > panicing, when the specific code path from this email thread is executed > > > it prints: > > > > > > ehci_alloc_sqtd_chain: curlen=20480 len=0 offs=0x0 > > > lastpage=0xcfe66000 page=0xcfe67000 phys=0xcfe67000 > > > > > > and I think this is not needed by default any more, so I have this diff: > > > > > > Index: dev/usb/ehci.c > > > === > > > RCS file: /cvs/src/sys/dev/usb/ehci.c,v > > > retrieving revision 1.212 > > > diff -u -p -u -r1.212 ehci.c > > > --- dev/usb/ehci.c23 Oct 2020 20:25:35 - 1.212 > > > +++ dev/usb/ehci.c11 Nov 2020 08:55:01 - > > > @@ -2395,11 +2408,11 @@ ehci_alloc_sqtd_chain(struct ehci_softc > > >EHCI_PAGE_OFFSET(dataphys); > > > #ifdef DIAGNOSTIC > > > if (curlen > len) { > > > - printf("ehci_alloc_sqtd_chain: curlen=%u " > > > + DPRINTFN(1,("ehci_alloc_sqtd_chain: curlen=%u " > > > "len=%u offs=0x%x\n", curlen, len, > > > - EHCI_PAGE_OFFSET(dataphys)); > > > - printf("lastpage=0x%x page=0x%x phys=0x%x\n", > > > - dataphyslastpage, dataphyspage, dataphys); > > > + EHCI_PAGE_OFFSET(dataphys))); > > > + DPRINTFN(1,("lastpage=0x%x page=0x%x > > > phys=0x%x\n", > > > + dataphyslastpage, dataphyspage, dataphys)); > > > curlen = len; > > > } > > > #endif > > > > > > to mute those messages. I'm also wondering could above be just as simple > > > as: > > > > > > if (curlen > len) { > > > curlen = len; > > > > > > and to drop completly above printf()s / DPRINTFN()s as for me they > > > didn't bring a lot of troubleshooting value. Dunno. Anyway one way or > > > another muting those I think would be good. > > > > > > > > > > On Fri, Oct 23, 2020 at 06:50:53PM +0200, Marcus Glocker wrote: > > > > > > > > > Honestly, I haven't spent much time to investigate how the curlen = 0 > > > > > is > > > > > getting generated exactly, because for me it will be very difficult to > > > > > understand that without the hardware on my side re-producing the same. > > > > > > > > > > But I had look when the code was introduced to
Re: apu4 fatal protection fault in supervisor mode [Was: apu4 kernel panic]
On 2020/11/27 18:50, Mark Kettenis wrote: > > Date: Fri, 27 Nov 2020 18:43:47 +0100 > > From: Marcus MERIGHI > > > > s...@spacehopper.org (Stuart Henderson), 2020.11.27 (Fri) 17:54 (CET): > > > On 2020/11/27 16:21, Marcus MERIGHI wrote: > > > > It happened again; anything I should do when "syncing disks..." is done? > > > > This time around it doesn't seem to finish "syncing disks..." and drop > > into ddb>. So it can't be rebooted via "boot reboot". Is there a way to > > reboot via the serial console? Sending a BREAK (~#) doesn't seem to do > > anything... > > > > > Can you try dowgrading the bios to 4.11.0.4? > > > https://pcengines.github.io/#mr-33 > > > > Will do, as soon as the machine is rebooted. Thanks for the pointer! > > (You mention 4.11.0.4, but your link goes to 4.11.0.5?) > > Frankly I think this issue is a kernel bug, where somehow the sysctl > code that reports on open files is racing against code that closes > those files or otherwise messes with the associated data structures. > I bet that if you stop the process that is doing those sysctl calls, > things will run stable again. fstat was running on Marcus' machine. > Given what you wrote about the configuration of the machine I'd say > this is related to sockets and missing locking in/against the network > stack. Unfortunately the traces you showed so far don't really give > me any clues. >
Re: apu4 fatal protection fault in supervisor mode [Was: apu4 kernel panic]
On 2020/11/27 18:43, Marcus MERIGHI wrote: > s...@spacehopper.org (Stuart Henderson), 2020.11.27 (Fri) 17:54 (CET): > > On 2020/11/27 16:21, Marcus MERIGHI wrote: > > > It happened again; anything I should do when "syncing disks..." is done? > > This time around it doesn't seem to finish "syncing disks..." and drop > into ddb>. So it can't be rebooted via "boot reboot". Is there a way to > reboot via the serial console? Sending a BREAK (~#) doesn't seem to do > anything... > > > Can you try dowgrading the bios to 4.11.0.4? > > https://pcengines.github.io/#mr-33 > > Will do, as soon as the machine is rebooted. Thanks for the pointer! > (You mention 4.11.0.4, but your link goes to 4.11.0.5?) > > Marcus > Scratch that - mine lasted longer after going back to that one but it crashed with that too now.
Re: apu4 fatal protection fault in supervisor mode [Was: apu4 kernel panic]
> Date: Fri, 27 Nov 2020 18:43:47 +0100 > From: Marcus MERIGHI > > s...@spacehopper.org (Stuart Henderson), 2020.11.27 (Fri) 17:54 (CET): > > On 2020/11/27 16:21, Marcus MERIGHI wrote: > > > It happened again; anything I should do when "syncing disks..." is done? > > This time around it doesn't seem to finish "syncing disks..." and drop > into ddb>. So it can't be rebooted via "boot reboot". Is there a way to > reboot via the serial console? Sending a BREAK (~#) doesn't seem to do > anything... > > > Can you try dowgrading the bios to 4.11.0.4? > > https://pcengines.github.io/#mr-33 > > Will do, as soon as the machine is rebooted. Thanks for the pointer! > (You mention 4.11.0.4, but your link goes to 4.11.0.5?) Frankly I think this issue is a kernel bug, where somehow the sysctl code that reports on open files is racing against code that closes those files or otherwise messes with the associated data structures. I bet that if you stop the process that is doing those sysctl calls, things will run stable again. Given what you wrote about the configuration of the machine I'd say this is related to sockets and missing locking in/against the network stack. Unfortunately the traces you showed so far don't really give me any clues.
Re: apu4 fatal protection fault in supervisor mode [Was: apu4 kernel panic]
s...@spacehopper.org (Stuart Henderson), 2020.11.27 (Fri) 17:54 (CET): > On 2020/11/27 16:21, Marcus MERIGHI wrote: > > It happened again; anything I should do when "syncing disks..." is done? This time around it doesn't seem to finish "syncing disks..." and drop into ddb>. So it can't be rebooted via "boot reboot". Is there a way to reboot via the serial console? Sending a BREAK (~#) doesn't seem to do anything... > Can you try dowgrading the bios to 4.11.0.4? > https://pcengines.github.io/#mr-33 Will do, as soon as the machine is rebooted. Thanks for the pointer! (You mention 4.11.0.4, but your link goes to 4.11.0.5?) Marcus
Re: apu4 fatal protection fault in supervisor mode [Was: apu4 kernel panic]
On 2020/11/27 16:21, Marcus MERIGHI wrote: > It happened again; anything I should do when "syncing disks..." is done? Can you try dowgrading the bios to 4.11.0.4? https://pcengines.github.io/#mr-33
[PATCH] rdist cmdspecial handling
According to rdist(1), the special and cmdspecial commands accept an optional list of files. When the is omitted, the command should execute for all files, otherwise it should be executed only when one of the listed files is affected. The special command works as expected, but cmdspecial runs for all files regardless of . The diff fixes cmdspecial by adding "sc_updfilelist" to the subcmd structure to hold a list of affected files found in the (sc_args). For cmdspecial commands without a , the module level variable "updfilelist" is used as before. The diff also addresses a couple of related issues, which aren't technically necessary for the fix, but which helped with debugging and testing, and make rdist user messages clearer. I can break them out as separate patches if necessary. Changes: - Fix handling in cmdspecial commands - Modify user message to indicate whether special or cmdspecial commands apply to "any" file or a "list" of files - Show cmdspecial commands that would be executed when the "verify" option is set (rdist shows all other actions, including special commands, when "verify" is enabled) Test Script --- I've included a shell script that demonstrates the problem. It's attached as a diff for ease of use. I'm not suggesting it be committed. The script creates a Distfile with special and cmdspecial commands with both a and no specified to highlight the difference in handling by rdist. The script runs the base-installed rdist twice, the first time with two changed files, the second with just one. In both cases it executes a simple sh script to log the env variable executed with the command (special or cmdspecial). If the script finds an executable rdist in /usr/src/usr.bin/rdist/, it then runs that version with the same two test cases. The script then displays a diff of the two logs showing that the patched rdist executes according to the documentation, otherwise it shows the output of the first execution to show the defective handling of cmdspecial. --Aaron cvs diff: Diffing . Index: client.c === RCS file: /cvs/src/usr.bin/rdist/client.c,v retrieving revision 1.37 diff -u -p -r1.37 client.c --- client.c28 Jun 2019 13:35:03 - 1.37 +++ client.c27 Nov 2020 14:45:16 - @@ -66,7 +66,7 @@ struct namelist *updfilelist = NULL; /* static void runspecial(char *, opt_t, char *, int); static void addcmdspecialfile(char *, char *, int); -static void freecmdspecialfiles(void); +static void freecmdspecialfiles(struct namelist **); static struct linkbuf *linkinfo(struct stat *); static int sendhardlink(opt_t, struct linkbuf *, char *, int); static int sendfile(char *, opt_t, struct stat *, char *, char *, int); @@ -172,7 +172,8 @@ runspecial(char *starget, opt_t opts, ch continue; if (sc->sc_args != NULL && !inlist(sc->sc_args, starget)) continue; - message(MT_CHANGE, "special \"%s\"", sc->sc_name); + message(MT_CHANGE, "special <%s> \"%s\"", + sc->sc_args == NULL ? "any" : "list", sc->sc_name); if (IS_ON(opts, DO_VERIFY)) continue; (void) sendcmd(C_SPECIAL, @@ -201,11 +202,19 @@ addcmdspecialfile(char *starget, char *r rfile = remfilename(source, Tdest, target, rname, destdir); - for (sc = subcmds; sc != NULL && !isokay; sc = sc->sc_next) { + for (sc = subcmds; sc != NULL; sc = sc->sc_next) { if (sc->sc_type != CMDSPECIAL) continue; - if (sc->sc_args != NULL && !inlist(sc->sc_args, starget)) + if (sc->sc_args == NULL) { + isokay = TRUE; continue; + } else if (!inlist(sc->sc_args, starget)) + continue; + new = xmalloc(sizeof *new); + new->n_name = xstrdup(rfile); + new->n_regex = NULL; + new->n_next = sc->sc_updfilelist; + sc->sc_updfilelist = new; isokay = TRUE; } @@ -222,11 +231,11 @@ addcmdspecialfile(char *starget, char *r * Free the file list */ static void -freecmdspecialfiles(void) +freecmdspecialfiles(struct namelist **list) { struct namelist *ptr, *save; - for (ptr = updfilelist; ptr; ) { + for (ptr = *list; ptr; ) { if (ptr->n_name) (void) free(ptr->n_name); save = ptr->n_next; (void) free(ptr); @@ -235,7 +244,7 @@ freecmdspecialfiles(void) else ptr = NULL; } - updfilelist = NULL; + *list = NULL; } /* @@ -251,11 +260,15 @@ runcmdspecial(struct cmd *cmd, opt_t opt for (sc = cmd->c_cmds; sc != NULL; sc = sc->sc_next) { if (sc->sc_type != CMDSPECIAL)
Re: apu4 fatal protection fault in supervisor mode [Was: apu4 kernel panic]
It happened again; anything I should do when "syncing disks..." is done? fatal protection fault in supervisor mode trap type 4 code 0 rip 8198bf66 cs 8 rflags 10246 cr2 f2556d1000 cpl 0 0 gsbase 0x80002241aff0 kgsbase 0x0 panic: trap type 4, code=0, pc=8198bf66 Starting stack trace... panic(81de8b6b) at panic+0x11d kerntrap(800022969060) at kerntrap+0x114 alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b fill_file(80f25c00,fd8100fbfb58,fd81033e36e8,3,0,80002270806 sysctl_file(800022969688,4,f564eca9000,8000229696b8,80002265f188) a2 kern_sysctl(800022969684,5,f564eca9000,8000229696b8,0,0) at kern_sysctl1 sys_sysctl(80002265f188,800022969720,800022969780) at sys_sysctl+0x4 syscall(8000229697f0) at syscall+0x389 Xsyscall() at Xsyscall+0x128 end of kernel end trace frame: 0x7f7c6e30, count: 248 End of stack trace. syncing disks... Marcus mcmer-open...@tor.at (Marcus MERIGHI), 2020.11.26 (Thu) 16:51 (CET): > >Synopsis:kernel panic on apu4 > >Category:kernel amd64 > >Environment: > System : OpenBSD 6.8 > Details : OpenBSD 6.8 (GENERIC.MP) #1: Tue Nov 3 09:06:04 MST 2020 > > r...@syspatch-68-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP > > Architecture: OpenBSD.amd64 > Machine : amd64 > >Description: > kernel panic on apu4 > >How-To-Repeat: > It happened for the first time with this hardware. > Put some load (~3000 Interrupts) on apu4 with 20 VLANs and CARPs and > a 715 rules pf.conf and ipsec/npppd VPN. > >Fix: > None known. > > what i gathered from the ddb> prompt: > > fatal protection fault in supervisor mode > trap type 4 code 0 rip 81347f76 cs 8 rflags 10246 cr2 a0b46b96000 cpl > 0 rsp 80 > gsbase 0x800022411ff0 kgsbase 0x0 > panic: trap type 4, code=0, pc=81347f76 > Starting stack trace... > panic(81de557b) at panic+0x11d > kerntrap(8000229c1630) at kerntrap+0x114 > alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b > fill_file(80ca8000,fd811c6a23d0,fd81012516f8,3,0,800022735658) > at fil6 > sysctl_file(8000229c1c58,4,b4769963000,8000229c1c88,8000227d8c98) > at sysctl_f2 > kern_sysctl(8000229c1c54,5,b4769963000,8000229c1c88,0,0) at > kern_sysctl+0x1d1 > sys_sysctl(8000227d8c98,8000229c1cf0,8000229c1d50) at > sys_sysctl+0x184 > syscall(8000229c1dc0) at syscall+0x389 > Xsyscall() at Xsyscall+0x128 > end of kernel > end trace frame: 0x7f7e8170, count: 248 > End of stack trace. > syncing disks...WARNING: SPL NOT LOWERED ON SYSCALL 3 3 EXIT 0 9 > Stopped at savectx+0xb1: movl$0,%gs:0x530 > TIDPIDUID PRFLAGS PFLAGS CPU COMMAND > 447997 99550 1000 0x2 02 fstat > 156309 93905 00x12 03K sh > *495661 2360 620x100010 01 spamlogd > 492394 78811 0 0x14000 0x42000 softclock > savectx() at savectx+0xb1 > end of kernel > end trace frame: 0x7f7caf00, count: 14 > ddb{1}> trace > savectx() at savectx+0xb1 > end of kernel > end trace frame: 0x7f7caf00, count: -1 > ddb{1}> mach ddbcpu 0 > Stopped at x86_ipi_db+0x12:leave > x86_ipi_db(820e2ff0) at x86_ipi_db+0x12 > x86_ipi_handler() at x86_ipi_handler+0x80 > Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23 > __mp_lock(8218c620) at __mp_lock+0x72 > intr_handler(8000225c3d00,8008ee80) at intr_handler+0x44 > Xintr_ioapic_level0_untramp() at Xintr_ioapic_level0_untramp+0x1a3 > in_cksum(fd80cd5bdb00,24) at in_cksum+0x44 > carp_send_ad(80be7c00) at carp_send_ad+0x27c > carp_timer_ad(80be7c00) at carp_timer_ad+0x20 > softclock_thread(8000f3c0) at softclock_thread+0x16b > end trace frame: 0x0, count: 5 > ddb{0}> trace > x86_ipi_db(820e2ff0) at x86_ipi_db+0x12 > x86_ipi_handler() at x86_ipi_handler+0x80 > Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23 > __mp_lock(8218c620) at __mp_lock+0x72 > intr_handler(8000225c3d00,8008ee80) at intr_handler+0x44 > Xintr_ioapic_level0_untramp() at Xintr_ioapic_level0_untramp+0x1a3 > in_cksum(fd80cd5bdb00,24) at in_cksum+0x44 > carp_send_ad(80be7c00) at carp_send_ad+0x27c > carp_timer_ad(80be7c00) at carp_timer_ad+0x20 > softclock_thread(8000f3c0) at softclock_thread+0x16b > end trace frame: 0x0, count: -10 > ddb{0}> mach ddbcpu 2 > Stopped at x86_ipi_db+0x12:leave > x86_ipi_db(800022411ff0) at x86_ipi_db+0x12 > x86_ipi_handler() at x86_ipi_handler+0x80 > Xresume_lapic_ipi() at Xresume_lapic_ipi+0x23 > __mp_acquire_count(8218c620,1) at __mp_acquire_count+0x92 > tsleep(fd812e111468,11,81e94678,0) at tsleep+0x10e > getblk(fd812e880680,12fce0,4000,0,) at getblk+0xe5 > bread(fd812e880680,12fce0,4000,8000229c11e0)
Re: kernel panic when removing interface
On 27/11/20(Fri) 15:47, Denis Fondras wrote: > > It is, I guess a fix should go in net/rtsock.c to prevent adding "-link" > > entry on routing table different from ifp->if_rdomain. > > > > I came up with this, which is more radical. Which is not exactly what we want. This will prevent adding any route on a routing table different from rdomain. What needs to be enforced is the check from a request coming from userland trying to insert a "-link" route. Such check should have the benefit of documenting that L2 entries should be only inserted in the rdomain table of an interface. > Index: route.c > === > RCS file: /cvs/src/sys/net/route.c,v > retrieving revision 1.397 > diff -u -p -r1.397 route.c > --- route.c 29 Oct 2020 21:15:27 - 1.397 > +++ route.c 27 Nov 2020 09:39:53 - > @@ -865,6 +865,8 @@ rtrequest(int req, struct rt_addrinfo *i > return (EINVAL); > ifa = info->rti_ifa; > ifp = ifa->ifa_ifp; > + if (tableid != ifp->if_rdomain) > + return (EINVAL); > if (prio == 0) > prio = ifp->if_priority + RTP_STATIC; > >
Re: kernel panic when removing interface
> It is, I guess a fix should go in net/rtsock.c to prevent adding "-link" > entry on routing table different from ifp->if_rdomain. > I came up with this, which is more radical. Index: route.c === RCS file: /cvs/src/sys/net/route.c,v retrieving revision 1.397 diff -u -p -r1.397 route.c --- route.c 29 Oct 2020 21:15:27 - 1.397 +++ route.c 27 Nov 2020 09:39:53 - @@ -865,6 +865,8 @@ rtrequest(int req, struct rt_addrinfo *i return (EINVAL); ifa = info->rti_ifa; ifp = ifa->ifa_ifp; + if (tableid != ifp->if_rdomain) + return (EINVAL); if (prio == 0) prio = ifp->if_priority + RTP_STATIC;
Re: double fault trap, 6.8, kbind->...->pmap_tlb_shootpage
On 2020/11/27 13:19, Stuart Henderson wrote: > On 2020/11/26 20:17, Stuart Henderson wrote: > > I setup a console server today - after leaving it for a few hours I came > > back to a double fault trap. 6.8+syspatches, amd64, APU2. Simple PF > > config, em(4), wg(4). Running ssh/sshd/conserver/lldpd plus default base > > daemons. > > Traces from another crash. I had another one in db_read_bytes as well > but forgot to trace. Notes on the conserver config: it has some ipmi sol consoles (UDP), some regular network consoles on TCP ports, and some network consoles via ssh client - no serial consoles on the machine itself. The hardware has previously been used in a different role (ipsec concentrator) with no problems so doesn't seem likely to be a hw issue. I don't have a specific trigger but it doesn't seem to stay up for more than an hour or so.
Re: kernel panic when removing interface
On 26/11/20(Thu) 20:38, Pierre Emeriaud wrote: > Hello Martin > > Le jeu. 26 nov. 2020 à 14:27, Martin Pieuchot a écrit : > > > > > > > > $ doas route -T1 add 192.0.2.2/32 -link -iface vlan12 > > > > I wonder if the problem isn't in the validation of these parameters. > > > > Should we accept a L2 (-link) entry on a routing table which isn't the > > routing domain? If so why does the entry persist in the ARP cache? > > Which arp entry are you referring to? The one from the route I added? Yes. In the kernel ARP entries are represented as route entries. So when you add a "-link" route it is an ARP entry. > > Can you reproduce the problem if you don't specify T1? > > No. The routes are correctly removed when the interface is destroyed. > It only crashes when the routes are added to another (non-empty if > that matters) rdomain, but again, this was a silly mistake on my side. Still, silly mistakes should be prevented and not crash the kernel ;) > I reported it as it might be of interest to fix this for the sake of > it, but it causes almost no harm. It is, I guess a fix should go in net/rtsock.c to prevent adding "-link" entry on routing table different from ifp->if_rdomain. > PS: I've managed to crash my first router just by waiting a few > seconds - no need to remove the route - same thing as the second > router: > ddb> show panic > kernel diagnostic assertion "ifp != NULL" failed: file > "/usr/src/sys/netinet/if > _ether.c", line 718 > > ddb> trace > db_enter() at db_enter+0x10 > panic(81dc761f) at panic+0x12a > __assert(81e321c2,81db9f2b,2ce,81d9e429) at > __assert+0x > 2b > arp_rtrequest(fd800baa10a8,fd800baa10a8,fd801aa63dc0) at > arp_rtrequ > est > arptimer(8216a090) at arptimer+0x67 > softclock_thread(8000ea40) at softclock_thread+0x13f > end trace frame: 0x0, count: -6
Re: double fault trap, 6.8, kbind->...->pmap_tlb_shootpage
On 2020/11/26 20:17, Stuart Henderson wrote: > I setup a console server today - after leaving it for a few hours I came > back to a double fault trap. 6.8+syspatches, amd64, APU2. Simple PF > config, em(4), wg(4). Running ssh/sshd/conserver/lldpd plus default base > daemons. Traces from another crash. I had another one in db_read_bytes as well but forgot to trace. login: uvm_fault(0x821214b8, 0x0008a240, 0, 4) -> e kernel: page fault trap, code=0 Stopped at 0x0008a240:uvm_fault(0x821214b8, 0x0008a240, 0, 1) -> e kernel: page fault trap, code=0 Stopped at db_read_bytes+0x70: movzbl 0(%rdi,%rcx,1),%eax ddb{0}> tr db_read_bytes(0008a240,1,80001fe21338) at db_read_bytes+0x70 db_get_value(0008a240,1,0) at db_get_value+0x3f db_disasm(0008a240,0) at db_disasm+0x85 db_trap(6,0) at db_trap+0xa5 db_ktrap(6,0,80001fe21590) at db_ktrap+0x112 kerntrap(80001fe21590) at kerntrap+0xa4 alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b 0008a240(a,a,91a8cae800152f3a,0,10,80001fe21670) at 0x0008a 240 x86_fast_ipi(80001fa78ff0,f1) at x86_fast_ipi+0x42 pmap_tlb_shootpage(821b4c08,80001fe54000,1) at pmap_tlb_shootpage+0 x136 pmap_do_remove(821b4c08,80001fe54000,80001fe55000,0) at pmap_do _remove+0x524 uvm_unmap_remove(821214b8,80001fe54000,80001fe55000,80001fe 218e0,0,1) at uvm_unmap_remove+0x22b sys_kbind(80001fe5b8f0,80001fe21960,80001fe219c0) at sys_kbind+0x38 2 syscall(80001fe21a30) at syscall+0x389 Xsyscall() at Xsyscall+0x128 end of kernel end trace frame: 0x7f7ebf38, count: -15 ddb{0}> sh reg rdi 0x0008a240 rsi 0x1 rbp 0x80001fe21320 rbx 0x0008a240 rdx 0x80001fe21338 rcx0 rax 0x2 r8 0 r9 0x1 r10 0x240c40d54ae302a4 r11 0xf0a9831425dab75e r12 0x1 r13 0x2 r14 0x1 r150 rip 0x812f17a0db_read_bytes+0x70 cs 0x8 rflags 0x10246__ALIGN_SIZE+0xf246 rsp 0x80001fe21300 ss 0x10 db_read_bytes+0x70: movzbl 0(%rdi,%rcx,1),%eax ddb{0}> ps /o TIDPIDUID PRFLAGS PFLAGS CPU COMMAND * 45975 16635736 0 00K conserver 5773 2362736 0 02 conserver ddb{0}> ps PID TID PPIDUID S FLAGS WAIT COMMAND *16635 45975 2362736 7 0conserver 8429 426980 2362736 30x82 selectssh 39053 373370 71219736 30x82 netio ssh 19513 64298 98727736 30x82 selectssh 47789 322641 71219736 30x82 netio ssh 21271 468310 71219736 30x82 netio ssh 30971 306907 2362736 30x82 netio ssh 79098 114844 2362736 30x82 selectssh 20267 292736 2362736 30x82 netio ssh 35029 283216 2362736 30x82 netio ssh 85342 247671 2362736 30x82 netio ssh 5026 19178 2362736 30x82 selectssh 77040 304406 2362736 30x82 selectssh 76252 106736 71219736 30x82 netcon2 ssh 36012 282631 71219736 30x82 netcon2 ssh 33864 439485 71219736 30x82 netcon2 ssh 59242 499297 71219736 30x82 netcon2 ssh 98607 317809 71219736 30x82 netcon2 ssh 34332 113957 71219736 30x82 netcon2 ssh 73243 60205 71219736 30x82 netcon2 ssh 55945 371324 38307 1000 30x100083 kqreadtail 98727 337142 43620736 30x80 selectconserver 32756 59128 2362736 30x100082 selectssh 85586 501496 2362736 30x100082 selectssh 23625773 43620736 7 0conserver 2362 513578 43620736 3 0x480 selectconserver 2362 396399 43620736 3 0x480 poll conserver 38307 239821 83880 1000 30x10008b pause ksh 43768 500734 71219736 30x100082 selectssh 34176 206371 71219736 30x100082 selectssh 46020 243860 71219736 30x100082 selectssh 18338 340252 71219736 30x100082 selectssh 14861 256347 71219736 30x100082 selectssh 18404 496474 71219736 30x100082 selectssh 71219 262961
Re: panic: ehci_alloc_std: curlen == 0 on 6.8-beta
I think something as simple as below would be okay. If requested I can put in DPRINTFN()s based on current printf()s, like I proposed in earlier diff in this thread. However more important part is, that I think DIAGNOSTIC ifdef should be removed as rest of the code, which relies on `if (curlen > len) curlen = len;` is not enclosed with `#ifdef DIAGNOSTIC` Index: dev/usb/ehci.c === RCS file: /cvs/src/sys/dev/usb/ehci.c,v retrieving revision 1.212 diff -u -p -u -r1.212 ehci.c --- dev/usb/ehci.c 23 Oct 2020 20:25:35 - 1.212 +++ dev/usb/ehci.c 27 Nov 2020 10:16:23 - @@ -2393,16 +2406,10 @@ ehci_alloc_sqtd_chain(struct ehci_softc /* must use multiple TDs, fill as much as possible. */ curlen = EHCI_QTD_NBUFFERS * EHCI_PAGE_SIZE - EHCI_PAGE_OFFSET(dataphys); -#ifdef DIAGNOSTIC - if (curlen > len) { - printf("ehci_alloc_sqtd_chain: curlen=%u " - "len=%u offs=0x%x\n", curlen, len, - EHCI_PAGE_OFFSET(dataphys)); - printf("lastpage=0x%x page=0x%x phys=0x%x\n", - dataphyslastpage, dataphyspage, dataphys); + + if (curlen > len) curlen = len; - } -#endif + /* the length must be a multiple of the max size */ curlen -= curlen % mps; DPRINTFN(1,("ehci_alloc_sqtd_chain: multiple QTDs, " On Sun, Nov 22, 2020 at 01:36:10AM +, Mikolaj Kucharski wrote: > Hi, > > Whould below diff be okay, or just simple: > > if (curlen > len) > curlen = len; > > be more appropriate here? > > On Wed, Nov 11, 2020 at 09:02:49AM +, Mikolaj Kucharski wrote: > > On Sat, Oct 24, 2020 at 09:08:45AM +0200, Marcus Glocker wrote: > > > Now you have on M less in your tree checkout :-) > > > Thanks for tracking this down. > > > > There is one more change, which I would consider. It was visible after I > > switched back to official snapshot kernel. Now that kernel is not > > panicing, when the specific code path from this email thread is executed > > it prints: > > > > ehci_alloc_sqtd_chain: curlen=20480 len=0 offs=0x0 > > lastpage=0xcfe66000 page=0xcfe67000 phys=0xcfe67000 > > > > and I think this is not needed by default any more, so I have this diff: > > > > Index: dev/usb/ehci.c > > === > > RCS file: /cvs/src/sys/dev/usb/ehci.c,v > > retrieving revision 1.212 > > diff -u -p -u -r1.212 ehci.c > > --- dev/usb/ehci.c 23 Oct 2020 20:25:35 - 1.212 > > +++ dev/usb/ehci.c 11 Nov 2020 08:55:01 - > > @@ -2395,11 +2408,11 @@ ehci_alloc_sqtd_chain(struct ehci_softc > > EHCI_PAGE_OFFSET(dataphys); > > #ifdef DIAGNOSTIC > > if (curlen > len) { > > - printf("ehci_alloc_sqtd_chain: curlen=%u " > > + DPRINTFN(1,("ehci_alloc_sqtd_chain: curlen=%u " > > "len=%u offs=0x%x\n", curlen, len, > > - EHCI_PAGE_OFFSET(dataphys)); > > - printf("lastpage=0x%x page=0x%x phys=0x%x\n", > > - dataphyslastpage, dataphyspage, dataphys); > > + EHCI_PAGE_OFFSET(dataphys))); > > + DPRINTFN(1,("lastpage=0x%x page=0x%x > > phys=0x%x\n", > > + dataphyslastpage, dataphyspage, dataphys)); > > curlen = len; > > } > > #endif > > > > to mute those messages. I'm also wondering could above be just as simple > > as: > > > > if (curlen > len) { > > curlen = len; > > > > and to drop completly above printf()s / DPRINTFN()s as for me they > > didn't bring a lot of troubleshooting value. Dunno. Anyway one way or > > another muting those I think would be good. > > > > > > > On Fri, Oct 23, 2020 at 06:50:53PM +0200, Marcus Glocker wrote: > > > > > > > Honestly, I haven't spent much time to investigate how the curlen = 0 is > > > > getting generated exactly, because for me it will be very difficult to > > > > understand that without the hardware on my side re-producing the same. > > > > > > > > But I had look when the code was introduced to handle curlen == 0 later > > > > in the function: > > > > > > > > if (iscontrol) { > > > > /* > > > > * adjust the toggle based on the number of packets > > > > * in this qtd > > > > */ > > > > if curlen + mps - 1) / mps) & 1) || curlen == 0) > > > > qtdstatus ^= EHCI_QTD_TOGGLE_MASK; > > > > } > > > > > > > > This was