Reproducible crash on repeatedly running OpenGL-accelerated games

2022-04-30 Thread Thomas Frohwein


>Synopsis:  Reproducible crash on repeatedly running mono+SDL2 games
>Category:  kernel amd64
>Environment:
System  : OpenBSD 7.1
Details : OpenBSD 7.1-current (GENERIC.MP) #487: Sat Apr 30 
09:14:44 MDT 2022
 
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP

Architecture: OpenBSD.amd64
Machine : amd64
>Description:
I can predictably cause a system freeze when repeatedly running
any of the fnaify/FNA games or Godot. It doesn't happen on the
first run but usually running the same or other games repeatedly
shortly thereafter triggers the computer to freeze while
launching the application. This has _not_ happened on prolonged
runtime of a GL-accelerated application.

Most of time, the system just becomes unresponsive, either with
the X11 window frozen, or a black screen. A few times, I ended
up in the console windows suddenly, seeing the following:

panic: b_to_q: tty has no clist

Sadly, this is the only output that I have been able to grab
when this happens. I have checked /var/log/messages and
/var/log/Xorg.0.log{,.old} without anything showing up there. I
am unable to ssh into the computer (no response to ssh).

Of note, the freeze happens independently if I call the
application from an xterm or x11/kitty window terminal emulator.

>How-To-Repeat:
Launch pretty much any fnaify and godot games repeatedly, not
necessarily the same application each time.

Of note, this does not happen when I repeatedly launch any of
xonotic, megaglest, lugaru, 0ad, or openra from ports.

It also doesn't happen on the Thinkpad X395 with AMD Ryzen CPU
/GPU.
>Fix:
Unknown.

dmesg:
OpenBSD 7.1-current (GENERIC.MP) #487: Sat Apr 30 09:14:44 MDT 2022
dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
real mem = 33493360640 (31941MB)
avail mem = 32460935168 (30957MB)
random: good seed from bootblocks
mpath0 at root
scsibus0 at mpath0: 256 targets
mainbus0 at root
bios0 at mainbus0: SMBIOS rev. 3.2 @ 0x3e39c000 (77 entries)
bios0: vendor Dell Inc. version "1.7.0" date 12/10/2021
bios0: Dell Inc. Precision 7560
acpi0 at bios0: ACPI 6.1
acpi0: sleep states S0 S4 S5
acpi0: tables DSDT FACP SSDT SSDT SSDT HPET APIC MCFG SSDT NHLT SSDT SSDT SSDT 
SSDT LPIT SSDT SSDT DBGP DBG2 BOOT SSDT TPM2 DMAR SSDT SSDT SSDT PTDT BGRT FPDT
acpi0: wakeup devices GLAN(S4) XHCI(S0) XDCI(S4) HDAS(S4) RP01(S4) PXSX(S4) 
RP02(S4) PXSX(S4) RP03(S4) PXSX(S4) RP04(S4) PXSX(S4) RP05(S4) PXSX(S4) 
RP06(S4) PXSX(S4) [...]
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpihpet0 at acpi0: 1920 Hz
acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: Intel(R) Xeon(R) W-11955M CPU @ 2.60GHz, 2594.02 MHz, 06-8d-01
cpu0: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,AVX512IFMA,CLFLUSHOPT,CLWB,PT,AVX512CD,SHA,AVX512BW,AVX512VL,AVX512VBMI,UMIP,PKU,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
cpu0: 256KB 64b/line disabled L2 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
cpu0: apic clock running at 38MHz
cpu0: mwait min=64, max=64, C-substates=0.2.0.1.2.1.1.1, IBE
cpu1 at mainbus0: apid 2 (application processor)
cpu1: Intel(R) Xeon(R) W-11955M CPU @ 2.60GHz, 2594.04 MHz, 06-8d-01
cpu1: 
FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,3DNOWP,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,AVX512F,AVX512DQ,RDSEED,ADX,SMAP,AVX512IFMA,CLFLUSHOPT,CLWB,PT,AVX512CD,SHA,AVX512BW,AVX512VL,AVX512VBMI,UMIP,PKU,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,XSAVEC,XGETBV1,XSAVES
cpu1: 256KB 64b/line disabled L2 cache
cpu1: smt 0, core 1, package 0
cpu2 at mainbus0: apid 4 (application processor)
cpu2: Intel(R) Xeon(R) W-11955M CPU @ 2.60GHz, 2594.02 MHz, 06-8d-01
cpu2: 

Re: bse: null dereference in genet_rxintr()

2022-04-30 Thread Mark Kettenis
> Date: Tue, 19 Apr 2022 07:32:36 +0200
> From: Anton Lindqvist 
> 
> On Thu, Mar 24, 2022 at 07:41:44AM +0100, Anton Lindqvist wrote:
> > >Synopsis:  bse: null dereference in genet_rxintr()
> > >Category:  arm64
> > >Environment:
> > System  : OpenBSD 7.1
> > Details : OpenBSD 7.1-beta (GENERIC.MP) #1594: Mon Mar 21 06:55:12 
> > MDT 2022
> > 
> > dera...@arm64.openbsd.org:/usr/src/sys/arch/arm64/compile/GENERIC.MP
> > 
> > Architecture: OpenBSD.arm64
> > Machine : arm64
> > >Description:
> > 
> > Booting my rpi4 often but not always causes a panic while rc(8) tries to 
> > start
> > the bse network interface:
> > 
> > panic: attempt to access user address 0x38 from EL1
> > Stopped at  panic+0x160:cmp w21, #0x0
> > TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> > * 0  0  0 0x1  0x2000K swapper
> > db_enter() at panic+0x15c
> > panic() at do_el1h_sync+0x1f8
> > do_el1h_sync() at handle_el1h_sync+0x6c
> > handle_el1h_sync() at genet_rxintr+0x120
> > genet_rxintr() at genet_intr+0x74
> > genet_intr() at ampintc_irq_handler+0x14c
> > ampintc_irq_handler() at arm_cpu_irq+0x30
> > arm_cpu_irq() at handle_el1h_irq+0x6c
> > handle_el1h_irq() at ampintc_splx+0x80
> > ampintc_splx() at genet_ioctl+0x158
> > genet_ioctl() at ifioctl+0x308
> > ifioctl() at nfs_boot_init+0xc0
> > nfs_boot_init() at nfs_mountroot+0x3c
> > nfs_mountroot() at main+0x464
> > main() at virtdone+0x70
> > 
> > >Fix:
> > 
> > The mbuf associated with the current index is NULL. I noticed that the 
> > NetBSD
> > driver allocates mbufs for each ring entry in genet_setup_dma(). But even 
> > with
> > that in place the same panic still occurs. Enabling GENET_DEBUG shows that 
> > the
> > total is quite high:
> > 
> > RX pidx=ca07 total=51463
> >
> > 
> > Since it's greater than GENET_DMA_DESC_COUNT (=256) the null dereference 
> > will
> > still happen after doing more than 256 iterations in genet_rxintr() since we
> > will start accessing mbufs cleared by the previous iteration.
> > 
> > Here's a diff with what I've tried so far. The KASSERT() is just capturing 
> > the
> > problem at an earlier stage. Any pointers would be much appreciated.
> 
> Further digging reveals that writes to GENET_RX_DMA_PROD_INDEX are
> ignored by the hardware. That's why I ended up with a large amount of
> mbufs available in genet_rxintr() since the software and hardware state
> was out of sync. Honoring any existing value makes the problem go away
> and matches what u-boot[1] does as well.

Writing to GENET_RX_DMA_PROD_INDEX works for me.  The U-Boot code says
that writing 0 doesn't work.  But even that works for me.  So I'm
puzzled.

> The current RX cidx/pidx defaults in genet_fill_rx_ring() where probably
> carefully selected as they ensure that the rx ring is filled with at
> least the configured low watermark number of mbufs. However, instead of
> being forced to ensure a pidx - cidx delta above 0 on the first
> invocations of genet_fill_rx_ring(), RX_DESC_COUNT could simply be
> passed as the max argument to if_rxr_get() which will clamp the value
> anyway.

Well, what the code does is setting the "prod" index ahead of the
"cons" index to simulate a full ring.  And then when we (partially)
fill the ring we increase "cons" to make descriptors available to the
hardware.  This seems to work on my hardware and I've never seen the
crash you're seeing.



Re: ipsp_ids_gc panic after 7.1 upgrade

2022-04-30 Thread Vitaliy Makkoveev
Committed, thanks!

> On 30 Apr 2022, at 03:26, Alexander Bluhm  wrote:
> 
> On Thu, Apr 28, 2022 at 12:52:41AM +0300, Vitaliy Makkoveev wrote:
>> On Thu, Apr 28, 2022 at 12:15:25AM +0300, Vitaliy Makkoveev wrote:
 On 27 Apr 2022, at 23:24, Kasak  wrote:
>> [ skip ]
 I???m afraid your patch did not help, it crashed again after three hours 
>>> 
>>> Did it panic within ipsp_ids_gc() again?
>>> 
>> 
>> I missed, ipsp_ids_lookup() bumps `id_refcount' on dead `ids'. I fixed
>> my previous diff.
> 
> OK bluhm@
> 
>> Index: sys/netinet/ip_ipsp.c
>> ===
>> RCS file: /cvs/src/sys/netinet/ip_ipsp.c,v
>> retrieving revision 1.269
>> diff -u -p -r1.269 ip_ipsp.c
>> --- sys/netinet/ip_ipsp.c10 Mar 2022 15:21:08 -  1.269
>> +++ sys/netinet/ip_ipsp.c27 Apr 2022 21:40:58 -
>> @@ -1205,7 +1205,7 @@ ipsp_ids_insert(struct ipsec_ids *ids)
>>  found = RBT_INSERT(ipsec_ids_tree, _ids_tree, ids);
>>  if (found) {
>>  /* if refcount was zero, then timeout is running */
>> -if (atomic_inc_int_nv(>id_refcount) == 1) {
>> +if ((++found->id_refcount) == 1) {
>>  LIST_REMOVE(found, id_gc_list);
>> 
>>  if (LIST_EMPTY(_ids_gc_list))
>> @@ -1248,7 +1248,12 @@ ipsp_ids_lookup(u_int32_t ipsecflowinfo)
>> 
>>  mtx_enter(_flows_mtx);
>>  ids = RBT_FIND(ipsec_ids_flows, _ids_flows, );
>> -atomic_inc_int(>id_refcount);
>> +if (ids != NULL) {
>> +if (ids->id_refcount != 0)
>> +ids->id_refcount++;
>> +else
>> +ids = NULL;
>> +}
>>  mtx_leave(_flows_mtx);
>> 
>>  return ids;
>> @@ -1290,6 +1295,8 @@ ipsp_ids_free(struct ipsec_ids *ids)
>>  if (ids == NULL)
>>  return;
>> 
>> +mtx_enter(_flows_mtx);
>> +
>>  /*
>>   * If the refcount becomes zero, then a timeout is started. This
>>   * timeout must be cancelled if refcount is increased from zero.
>> @@ -1297,10 +1304,10 @@ ipsp_ids_free(struct ipsec_ids *ids)
>>  DPRINTF("ids %p count %d", ids, ids->id_refcount);
>>  KASSERT(ids->id_refcount > 0);
>> 
>> -if (atomic_dec_int_nv(>id_refcount) > 0)
>> +if ((--ids->id_refcount) > 0) {
>> +mtx_leave(_flows_mtx);
>>  return;
>> -
>> -mtx_enter(_flows_mtx);
>> +}
>> 
>>  /*
>>   * Add second for the case ipsp_ids_gc() is already running and
>> Index: sys/netinet/ip_ipsp.h
>> ===
>> RCS file: /cvs/src/sys/netinet/ip_ipsp.h,v
>> retrieving revision 1.238
>> diff -u -p -r1.238 ip_ipsp.h
>> --- sys/netinet/ip_ipsp.h21 Apr 2022 15:22:50 -  1.238
>> +++ sys/netinet/ip_ipsp.h27 Apr 2022 21:40:59 -
>> @@ -241,7 +241,7 @@ struct ipsec_ids {
>>  struct ipsec_id *id_local;  /* [I] */
>>  struct ipsec_id *id_remote; /* [I] */
>>  u_int32_t   id_flow;/* [I] */
>> -u_int   id_refcount;/* [a] */
>> +u_int   id_refcount;/* [F] */
>>  u_int   id_gc_ttl;  /* [F] */
>> };
>> RBT_HEAD(ipsec_ids_flows, ipsec_ids);



Re: kernel panic on openbsd 7.1

2022-04-30 Thread Vitaliy Makkoveev
This diff was committed to -current.

> On 30 Apr 2022, at 15:18, Jihyun Yu  wrote:
> 
> Thanks! I’ll try the patch :)
> 
> 
>> On Apr 30, 2022, at 8:34 PM, Stuart Henderson  wrote:
>> 
>> On 2022/04/30 20:01, Jihyun Yu wrote:
 Synopsis: kernel panic, without user activities
 Category: kernel panic
 Environment:
>>> System  : OpenBSD 7.1
>>> Details : OpenBSD 7.1 (GENERIC.MP) #465: Mon Apr 11 18:03:57 MDT 2022
>>> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
>>> 
>>> Architecture: OpenBSD.amd64
>>> Machine : amd64
 Description:
>>> 
>>> kernel panics with no apparent user activities - for example running a
>>> command or interacting with shells, ...
>>> Here's info from ddb
>>> 
>>> ```
>>> fatal protection fault in supervisor mode
>>> 
>>> trap type 4 code 0 rip 817879a4 cs 8 rflags 10282 cr2
>>> 8000226fdb28 cpl 9 rsp 800022633b60
>>> gsbase 0x8227cff0  kgsbase 0x0
>>> 
>>> panic: trap type 4, code=0, pc=817879a4
>>> 
>>> Starting stack trace...
>>> panic(81f16ea6) at panic+0x12c
>>> kerntrap(800022633ab0) at kerntrap+0x114
>>> alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b
>>> ipsp_ids_gc(0) at ipsp_ids_gc+0xb4
>> 
>> This was reported here: https://marc.info/?t=16507217981=1=2
>> 
>> There's a kernel patch in 
>> https://marc.info/?l=openbsd-bugs=165109635421930=2
>> which should fix it, hopefully it will make it into -current and then maybe
>> syspatches later
>> 
> 



Re: kernel panic on openbsd 7.1

2022-04-30 Thread Jihyun Yu
Thanks! I’ll try the patch :)


> On Apr 30, 2022, at 8:34 PM, Stuart Henderson  wrote:
> 
> On 2022/04/30 20:01, Jihyun Yu wrote:
>>> Synopsis: kernel panic, without user activities
>>> Category: kernel panic
>>> Environment:
>> System  : OpenBSD 7.1
>> Details : OpenBSD 7.1 (GENERIC.MP) #465: Mon Apr 11 18:03:57 MDT 2022
>> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
>> 
>> Architecture: OpenBSD.amd64
>> Machine : amd64
>>> Description:
>> 
>> kernel panics with no apparent user activities - for example running a
>> command or interacting with shells, ...
>> Here's info from ddb
>> 
>> ```
>> fatal protection fault in supervisor mode
>> 
>> trap type 4 code 0 rip 817879a4 cs 8 rflags 10282 cr2
>> 8000226fdb28 cpl 9 rsp 800022633b60
>> gsbase 0x8227cff0  kgsbase 0x0
>> 
>> panic: trap type 4, code=0, pc=817879a4
>> 
>> Starting stack trace...
>> panic(81f16ea6) at panic+0x12c
>> kerntrap(800022633ab0) at kerntrap+0x114
>> alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b
>> ipsp_ids_gc(0) at ipsp_ids_gc+0xb4
> 
> This was reported here: https://marc.info/?t=16507217981=1=2
> 
> There's a kernel patch in 
> https://marc.info/?l=openbsd-bugs=165109635421930=2
> which should fix it, hopefully it will make it into -current and then maybe
> syspatches later
> 



Re: kernel panic on openbsd 7.1

2022-04-30 Thread Stuart Henderson
On 2022/04/30 20:01, Jihyun Yu wrote:
> >Synopsis: kernel panic, without user activities
> >Category: kernel panic
> >Environment:
> System  : OpenBSD 7.1
> Details : OpenBSD 7.1 (GENERIC.MP) #465: Mon Apr 11 18:03:57 MDT 2022
> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> 
> Architecture: OpenBSD.amd64
> Machine : amd64
> >Description:
> 
> kernel panics with no apparent user activities - for example running a
> command or interacting with shells, ...
> Here's info from ddb
> 
> ```
> fatal protection fault in supervisor mode
> 
> trap type 4 code 0 rip 817879a4 cs 8 rflags 10282 cr2
> 8000226fdb28 cpl 9 rsp 800022633b60
> gsbase 0x8227cff0  kgsbase 0x0
> 
> panic: trap type 4, code=0, pc=817879a4
> 
> Starting stack trace...
> panic(81f16ea6) at panic+0x12c
> kerntrap(800022633ab0) at kerntrap+0x114
> alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b
> ipsp_ids_gc(0) at ipsp_ids_gc+0xb4

This was reported here: https://marc.info/?t=16507217981=1=2

There's a kernel patch in 
https://marc.info/?l=openbsd-bugs=165109635421930=2
which should fix it, hopefully it will make it into -current and then maybe
syspatches later



Re: Odd IPv6 ND behaviour after upgrading to OpenBSD 7.1

2022-04-30 Thread Otto Moerbeek
On Fri, Apr 29, 2022 at 04:42:25PM +0100, Ian Chilton wrote:

> Hi,
> 
> Not sure what the etiquette for this list is, so apologies if this is not 
> appropriate as it's not a confirmed bug...
> 
> I have a whole bunch of subnets which are static routed to a HSRP address, 
> provided by a pair of Cisco routers, on a linknet VLAN. Actually, there is 
> two VLANs, vlan209 and vlan409. In the case of v6, the HSRP IP is fe80::1, so 
> I have routes to fe80::1%vlan209 and fe80::1%vlan409.
> 
> This has worked fine for many weeks. On Wednesday evening I upgraded to 7.1.
> 
> On Friday morning, I woke up to nearly 2,000 alerts, because some v6 had 
> started flapping during the night.
> 
> It turns out that fe80::1%vlan409 had randomly become unreachable.
> 
> Every few minutes, it would become reachable again for 8 echo replies, then 
> goes unreachable again.
> 
> This is strange, because we use this same HSRP config / fe80::1 addresses for 
> all of our VLANs and have done for years, without issue.
> 
> Throughout this, the other OpenBSD host (still on 7.0), can access that 
> address with no problem.
> 
> Oddly, this host can still access fe80::1%vlan209 no problem.
> 
> What seems to happen is, a stale ND entry appears and 8 pings succeed...
> the-gw1# ndp -a |grep vlan409 | grep fe80
> fe80::1%vlan409  00:05:73:a0:00:01 vlan409 23h57m56s S R
> ..
> 
> Then this happens:
> the-gw1# ndp -a |grep vlan409 | grep fe80
> ndp: ioctl(SIOCGNBRINFO_IN6): Invalid argument
> ndp: failed to get neighbor information
> ndp: ioctl(SIOCGNBRINFO_IN6): Invalid argument
> ndp: failed to get neighbor information
> ndp: ioctl(SIOCGNBRINFO_IN6): Invalid argument
> ndp: failed to get neighbor information
> ndp: ioctl(SIOCGNBRINFO_IN6): Invalid argument
> ndp: failed to get neighbor information
> fe80::1%vlan409  (incomplete)  vlan409 1sI  2
> Check again, and the entry has disappeared.
> 
> A few mins later, the process repeats - 8 pings suddenly succeed and it 
> disappears again.
> 
> As I say though, fe80::1%vlan209 continues to work fine, as does 
> fe80::1%vlan409 from the other host.
> 
> fe80::1%vlan209  00:05:73:a0:00:01 vlan209 10s   R R
> 
> Interestingly, I did see a neighbour entry for fe80::1 on vlan409 on the 
> Cisco which is the HSRP master which had a MAC address of the-gw1, which 
> implied that the-gw1 is some how responding to ND requests for that IP 
> but I am not able to find those replies in a tcpdump.
> 
> As a workaround, i've added another HSRP address, fe80::2 on the Ciscos and 
> changed the static routes on this box to use that. After a few hours, that's 
> still reachable ok.
> 
> It might be total coincidence that this is after a 7.0 -> 7.1 upgrade, but 
> thought i'd report it and see if anyone else is seeing any similar issues.
> 
> Thanks,
> 
> Ian

I had some issues with neighbour discover lately, which started to
appear when I installed a new CPE.

The issue was that the kernel generated outgoing icmp6 messages with a
hop limit, which then got dropped by pf before even reaching the lan.

The workaround was to do

pass proto icmp6 allow-opts

In the meantime, bluhm@ has been working on a proper solution. See
https://marc.info/?l=openbsd-tech=165056094900572

-Otto