Re: OpenBSD amd64 6.9 repeatable kernel panic starting X

2021-09-15 Thread M Smith



On 16/09/21 2:29 am, Martin Pieuchot wrote:

On 13/09/21(Mon) 08:25, M Smith wrote:

On 8/09/21 3:37 am, Martin Pieuchot wrote:

Hello,

Thanks for your bug report.

On 07/09/21(Tue) 15:18, M Smith wrote:

Synopsis:   OpenBSD amd64 6.9 repeatable kernel panic starting X
Category:   kernel
Environment:


System  : OpenBSD 6.9
Details : OpenBSD 6.9 (GENERIC.MP) #4: Tue Aug 10 08:12:23 MDT 2021

r...@syspatch-69-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP

Architecture: OpenBSD.amd64
Machine : amd64


Description:


I have been investigating a largely repeatable OpenBSD 6.9 
amd64 panic.  Essentially the OS drops into the kernel debugger about 90% of 
the time when starting X on specific hardware, and is doing so with what seems 
like a memory related issue - possibly errant modification by concurrent 
threads.


Indeed.  You're certainly hitting a VM/pmap bug.


The event is reproducible across two independent machines (both new).  
Each machine has identical underlying hardware.  A memory checker run overnight 
on one machine did not identify any underlying memory issues.


That points to something in your setup which exposes the bug.


The hardware: Avalue EMS-TGL-S85-A1-1R, CPU an 11th Gen Intel(R) 
Core(TM) i7-1185G7E @ 2.80GHz with 2x 16GB memory boards (32GB in total).

The mentioned possible errant memory modification, the assertion underlying this 
panic (https://www.sirranet.co.nz/openbsd_542456/69_panic.html) suggests that kernel 
execution has failed to obtain a necessary exclusivity lock.  Various other panics differ 
in that many feature assertions based on "pool_do_get ... offset ???" with the 
offset identifying the trigger condition, hinting at a memory inconsistency.

Testing on 7.0-current (https://www.sirranet.co.nz/openbsd_542456/70_panic.html) 
sometimes results in a panic on boot before invoking startX, other times the boot fails 
to complete cleanly at the kernel linking step with the error "reodering libraries 
ld in calloc(): chunk infor corrupted" and simular errors.  Whether these two events 
are related to the 6.9 panic is anything but conclusive.

I see others have posted what looks like the same issue.  I have posted 
the above detail however as the assert identifying the lack of kernel lock 
looks as though it may be of some value.
https://marc.info/?t=16176931482=1=2
https://marc.info/?t=16239060261=1=2


All those report have in common a 1th Gen Intel CPU.


Any ideas would be greatly appreciated.


You could start by booting bsd.sp to rule out any HW problem.


Sorry for the delay in replying.

Both 6.9 and 7.0 crash when booting bsd.sp
https://www.sirranet.co.nz/openbsd_542456/69_reply.html
https://www.sirranet.co.nz/openbsd_542456/70_reply.html


That rules out any concurrency issue.


Does the corruption happen with a vanilla install or does running
particular program makes it easier to happen?


These are both basic installs. After a fresh install I have run fw_update,
and on the 6.9 machine syspatch was run. Other than that we have enabled
xenodm. No other software or packages are installed or running. The machines
don't always crash on first boot, but after a handful of reboot they do.


I can easily test/re-test on both 6.9 and 7.0-current).


Does it also happen if you disable drm at boot?



On both 6.9 and 7.0  if I disable drm the machine panics on reboot. (Images
in the links above.)


Please make sure you also disable inteldrm(4).  That's why you're
getting a panic on 6.9.  This is to see if the issue is related to
the graphic driver.




With drm and inteldrm installed 6.9 still crashes often. I booted a 10 
times. A few times it booted, the rest of the time it crashed. (Again 
this is with a basic install with xenodm enabled.)

https://www.sirranet.co.nz/openbsd_542456/69_drm_inteldrm.html

Thanks
Megan



Re: OpenBSD amd64 6.9 repeatable kernel panic starting X

2021-09-15 Thread Martin Pieuchot
On 13/09/21(Mon) 08:25, M Smith wrote:
> On 8/09/21 3:37 am, Martin Pieuchot wrote:
> > Hello,
> > 
> > Thanks for your bug report.
> > 
> > On 07/09/21(Tue) 15:18, M Smith wrote:
> > > > Synopsis:   OpenBSD amd64 6.9 repeatable kernel panic starting X
> > > > Category:   kernel
> > > > Environment:
> > > 
> > >   System  : OpenBSD 6.9
> > >   Details : OpenBSD 6.9 (GENERIC.MP) #4: Tue Aug 10 08:12:23 MDT 2021
> > >   
> > > r...@syspatch-69-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> > > 
> > >   Architecture: OpenBSD.amd64
> > >   Machine : amd64
> > > 
> > > > Description:
> > > 
> > >   I have been investigating a largely repeatable OpenBSD 6.9 
> > > amd64 panic.  Essentially the OS drops into the kernel debugger about 90% 
> > > of the time when starting X on specific hardware, and is doing so with 
> > > what seems like a memory related issue - possibly errant modification by 
> > > concurrent threads.
> > 
> > Indeed.  You're certainly hitting a VM/pmap bug.
> > 
> > >   The event is reproducible across two independent machines (both new).  
> > > Each machine has identical underlying hardware.  A memory checker run 
> > > overnight on one machine did not identify any underlying memory issues.
> > 
> > That points to something in your setup which exposes the bug.
> > 
> > >   The hardware: Avalue EMS-TGL-S85-A1-1R, CPU an 11th Gen Intel(R) 
> > > Core(TM) i7-1185G7E @ 2.80GHz with 2x 16GB memory boards (32GB in total).
> > > 
> > >   The mentioned possible errant memory modification, the assertion 
> > > underlying this panic 
> > > (https://www.sirranet.co.nz/openbsd_542456/69_panic.html) suggests that 
> > > kernel execution has failed to obtain a necessary exclusivity lock.  
> > > Various other panics differ in that many feature assertions based on 
> > > "pool_do_get ... offset ???" with the offset identifying the trigger 
> > > condition, hinting at a memory inconsistency.
> > > 
> > >   Testing on 7.0-current 
> > > (https://www.sirranet.co.nz/openbsd_542456/70_panic.html) sometimes 
> > > results in a panic on boot before invoking startX, other times the boot 
> > > fails to complete cleanly at the kernel linking step with the error 
> > > "reodering libraries ld in calloc(): chunk infor corrupted" and simular 
> > > errors.  Whether these two events are related to the 6.9 panic is 
> > > anything but conclusive.
> > > 
> > >   I see others have posted what looks like the same issue.  I have posted 
> > > the above detail however as the assert identifying the lack of kernel 
> > > lock looks as though it may be of some value.
> > >   https://marc.info/?t=16176931482=1=2
> > >   https://marc.info/?t=16239060261=1=2
> > 
> > All those report have in common a 1th Gen Intel CPU.
> > 
> > >   Any ideas would be greatly appreciated.
> > 
> > You could start by booting bsd.sp to rule out any HW problem.
> 
> Sorry for the delay in replying.
> 
> Both 6.9 and 7.0 crash when booting bsd.sp
> https://www.sirranet.co.nz/openbsd_542456/69_reply.html
> https://www.sirranet.co.nz/openbsd_542456/70_reply.html

That rules out any concurrency issue.

> > Does the corruption happen with a vanilla install or does running
> > particular program makes it easier to happen?
> 
> These are both basic installs. After a fresh install I have run fw_update,
> and on the 6.9 machine syspatch was run. Other than that we have enabled
> xenodm. No other software or packages are installed or running. The machines
> don't always crash on first boot, but after a handful of reboot they do.
> 
> > >   I can easily test/re-test on both 6.9 and 7.0-current).
> > 
> > Does it also happen if you disable drm at boot?
> > 
> 
> On both 6.9 and 7.0  if I disable drm the machine panics on reboot. (Images
> in the links above.)

Please make sure you also disable inteldrm(4).  That's why you're
getting a panic on 6.9.  This is to see if the issue is related to
the graphic driver.



Re: __mp_lock_spin: 0xffffffff822d1120 lock spun out

2021-09-15 Thread Martin Pieuchot
On 15/09/21(Wed) 12:06, Paul de Weerd wrote:
> Hi all,
> 
> After some off-list advice from Patrick to enable MP_LOCKDEBUG in
> order to debug the hangs I reported [1], I did exactly that and was
> running a self-built kernel for some time.  This morning, I wanted to
> upgrade to the latest snapshot so I also cvs up'd and rebuilt my
> kernel with MP_LOCKDEBUG.  However, now I get __mp_lock_spin during
> boot:
> 
> root on sd2a (a0b80508b6693ba1.a) swap on sd2b dump on sd2b
> inteldrm0: 1920x1080, 32bpp
> wsdisplay0 at inteldrm0 mux 1
> __mp_lock_spin: 0x822d1120 lock spun out
> Stopped at  db_enter+0x10:  popq%rbp
> ddb{1}> trace
> db_enter() at db_enter+0x10
> __mp_lock(822d1120) at __mp_lock+0xa2
> __mp_acquire_count(822d1120,1) at __mp_acquire_count+0x38
> mi_switch() at mi_switch+0x299
> sleep_finish(8000226d4f80,1) at sleep_finish+0x11c
> msleep(8011d980,8011d998,20,81e828e3,0) at msleep+0xcc
> taskq_next_work(8011d980,8000226d5040) at taskq_next_work+0x61
> taskq_thread(8011d980) at taskq_thread+0x6c
> end trace frame: 0x0, count: -8

That means another CPU is holding the KERNEL_LOCK() for too long.  When
this happens it is more important to look at what other CPUs are doing
because one of them is holding the KERNEL_LOCK().  If you can reproduce
this, please include the output of "ps /o" and the trace from all the
CPUs.

Note that the default value of MP_LOCKDEBUG might be too sensitive for
some workloads, using WITNESS might not spot the same issue, but does
not present false positive.

Thanks,
Martin



Re: bus error on octeon

2021-09-15 Thread Visa Hankala
On Tue, Sep 14, 2021 at 07:31:33PM +0200, Peter J. Philipp wrote:
> On Tue, Sep 14, 2021 at 10:48:44AM -0600, Theo de Raadt wrote:
> > Mark Kettenis  wrote:
> > 
> > > To be honest, I do think that adding __packed is a reasonable way to
> > > handle protocol structs like this where performance doesn't really
> > > matter.  This translates into __attribute__((packed)) and both GCC and
> > > LLVM started treating that in a way to signal that the data might not
> > > be properly aligned and use byte access on architectures that need
> > > strict alignment.  This is still not explicitly documented but I don't
> > > think the compiler writers can backtrack on that at this point.
> > 
> > __packed does not mean "use bytes".  It means ensure there are no
> > padding bytes, and then on strict alignment systems when something is
> > misaligned, the compiler can see it must use smaller loads.
> > 
> > Unfortunately, tcpdump also parses encapsulated protocols, and some of
> > the outer layers are limited to short-alignment.  So if an inner layer
> > has a 32-bit value inside the __packed array after packing, the compiler
> > will believe it is still on a 32-bit boundary, and not use char or short
> > accesses.  This will fault, because tcpdump is passing a pointe to an
> > object with tighter alignment.
> > 
> > __packed is not saying "each object must be loaded as bytes", and I
> > really am susprised you claim that is what happens today.  That is a
> > crazy expensive choice on some architectures.
> > 
> > So I think __packed is not enough for what tcpdump is doing.
> > 
> > Perhaps some of the compilers are being more cynical, or their costing
> > of loads leads them to do smaller-storage operations... but I have a
> > recollection that at least gcc on the alpha gets confused and does the
> > wrong thing.
> 
> To appease the situation, I have 1. taken what Visa said about the nonce,
> 2. taken my old patch and changed the memcpy's to EXTRACT_64BITS() which
> I had to homeroll off EXTRACT_32BITS(), and made a patch and tested it.
> 
> No bus errors again, though I don't know if it's the right approach.  The
> nonces in the tcpdump were sequential counting up from 1 as my wireguard 
> hardware was rebooting.  I think I control-c'ed by 218 nonce or so.

EXTRACT_64BITS() followed by letoh64() is not correct because the former
converts byte order from big endian to host. It is better to add
EXTRACT_LE_64BITS().

The patch assumes that the data already are 4-byte aligned (bpf(4) and
libpcap ensure proper alignment for the network header; also, print-ip.c
and print-ip6.c do last-resort fixing when needed).

Index: extract.h
===
RCS file: src/usr.sbin/tcpdump/extract.h,v
retrieving revision 1.9
diff -u -p -r1.9 extract.h
--- extract.h   7 Oct 2007 16:41:05 -   1.9
+++ extract.h   15 Sep 2021 12:40:56 -
@@ -51,3 +51,12 @@
(u_int32_t)*((const u_int8_t *)(p) + 2) << 16 | \
(u_int32_t)*((const u_int8_t *)(p) + 1) << 8 | \
(u_int32_t)*((const u_int8_t *)(p) + 0))
+#define EXTRACT_LE_64BITS(p) \
+   ((u_int64_t)*((const u_int8_t *)(p) + 7) << 56 | \
+   (u_int64_t)*((const u_int8_t *)(p) + 6) << 48 | \
+   (u_int64_t)*((const u_int8_t *)(p) + 5) << 40 | \
+   (u_int64_t)*((const u_int8_t *)(p) + 4) << 32 | \
+   (u_int64_t)*((const u_int8_t *)(p) + 3) << 24 | \
+   (u_int64_t)*((const u_int8_t *)(p) + 2) << 16 | \
+   (u_int64_t)*((const u_int8_t *)(p) + 1) << 8 | \
+   (u_int64_t)*((const u_int8_t *)(p) + 0))
Index: print-wg.c
===
RCS file: src/usr.sbin/tcpdump/print-wg.c,v
retrieving revision 1.6
diff -u -p -r1.6 print-wg.c
--- print-wg.c  14 Apr 2021 19:34:56 -  1.6
+++ print-wg.c  15 Sep 2021 12:40:56 -
@@ -142,8 +142,9 @@ wg_print(const u_char *bp, u_int length)
printf("[wg] keepalive ");
if (caplen < offsetof(struct wg_data, mac))
goto trunc;
+   /* data->nonce may be unaligned. */
printf("to 0x%08x nonce %llu",
-   letoh32(data->receiver), letoh64(data->nonce));
+   letoh32(data->receiver), EXTRACT_LE_64BITS(>nonce));
break;
}
return;



__mp_lock_spin: 0xffffffff822d1120 lock spun out

2021-09-15 Thread Paul de Weerd
Hi all,

After some off-list advice from Patrick to enable MP_LOCKDEBUG in
order to debug the hangs I reported [1], I did exactly that and was
running a self-built kernel for some time.  This morning, I wanted to
upgrade to the latest snapshot so I also cvs up'd and rebuilt my
kernel with MP_LOCKDEBUG.  However, now I get __mp_lock_spin during
boot:

root on sd2a (a0b80508b6693ba1.a) swap on sd2b dump on sd2b
inteldrm0: 1920x1080, 32bpp
wsdisplay0 at inteldrm0 mux 1
__mp_lock_spin: 0x822d1120 lock spun out
Stopped at  db_enter+0x10:  popq%rbp
ddb{1}> trace
db_enter() at db_enter+0x10
__mp_lock(822d1120) at __mp_lock+0xa2
__mp_acquire_count(822d1120,1) at __mp_acquire_count+0x38
mi_switch() at mi_switch+0x299
sleep_finish(8000226d4f80,1) at sleep_finish+0x11c
msleep(8011d980,8011d998,20,81e828e3,0) at msleep+0xcc
taskq_next_work(8011d980,8000226d5040) at taskq_next_work+0x61
taskq_thread(8011d980) at taskq_thread+0x6c
end trace frame: 0x0, count: -8

Everything was fine with (had MP_LOCKDEBUG enabled):

OpenBSD 7.0 (GENERIC.MP) #17: Tue Sep 14 20:07:14 CEST 2021
we...@pom.alm.weirdnet.nl:/usr/src/sys/arch/amd64/compile/GENERIC.MP

And a kernel that's mere hours newer crashes:

OpenBSD 7.0 (GENERIC.MP) #18: Wed Sep 15 09:26:20 CEST 2021
we...@pom.alm.weirdnet.nl:/usr/src/sys/arch/amd64/compile/GENERIC.MP

I enabled verbose booting to see if anything else was happening, but
no:

inteldrm0: 1920x1080, 32bpp
>>> probing for drm*
>>> drm probe returned 0
>>> probing for intagp*
>>> intagp probe returned 0
>>> probing for wsdisplay*
>>> wsdisplay probe returned 1
>>> probing for wsdisplay0
>>> wsdisplay probe returned 10
>>> wsdisplay probe won
wsdisplay0 at inteldrm0 mux 1
__mp_lock_spin: 0x822d1120 lock spun out
Stopped at  db_enter+0x10:  popq%rbp

Looking at the recent commits to CVS I see no obvious reason why this
was crashing.  Any suggestions for a commit to revert to see if it is
the cause?

Cheers,

Paul

[1]: https://marc.info/?l=openbsd-bugs=163155522416560=2

-- 
>[<++>-]<+++.>+++[<-->-]<.>+++[<+
+++>-]<.>++[<>-]<+.--.[-]
 http://www.weirdnet.nl/