Re: bnxt panic - HWRM_RING_ALLOC command returned RESOURCE_ALLOC_ERROR error.

2024-01-08 Thread Jonathan Matthew
On Wed, Jan 03, 2024 at 10:14:12AM +0100, Hrvoje Popovski wrote:
> On 3.1.2024. 7:51, Jonathan Matthew wrote:
> > On Wed, Jan 03, 2024 at 01:50:06AM +0100, Alexander Bluhm wrote:
> >> On Wed, Jan 03, 2024 at 12:26:26AM +0100, Hrvoje Popovski wrote:
> >>> While testing kettenis@ ipl diff from tech@ and doing iperf3 to bnxt
> >>> interface and ifconfig bnxt0 down/up at the same time I can trigger
> >>> panic. Panic can be triggered without kettenis@ diff...
> >> It is easy to reproduce.  ifconfig bnxt1 down/up a few times while
> >> receiving TCP traffic with iperf3.  Machine still has kettenis@ diff.
> >> My panic looks different.
> > It looks like I wasn't trying very hard when I wrote bnxt_down().
> > I think there's also a problem with bnxt_up() unwinding after failure
> > in various places, but that's a different issue.
> > 
> > This makes it a more resilient for me, though it still logs
> > 'bnxt0: unexpected completion type 3' a lot if I take the interface
> > down while it's in use.  I'll look at that separately.
> 
> Hi,
> 
> with this diff I can still panic box with ifconfig up/down but not as
> fast as without it

Right, this is the other problem where bnxt_up() wasn't cleaning up properly
after failing part way through.  This diff should fix that, but I don't think
it will fix the 'HWRM_RING_ALLOC command returned RESOURCE_ALLOC_ERROR error'
problem, so the interface will still stop working at that point.


Index: if_bnxt.c
===
RCS file: /cvs/src/sys/dev/pci/if_bnxt.c,v
retrieving revision 1.39
diff -u -p -r1.39 if_bnxt.c
--- if_bnxt.c   10 Nov 2023 15:51:20 -  1.39
+++ if_bnxt.c   9 Jan 2024 01:59:38 -
@@ -1073,7 +1081,7 @@ bnxt_up(struct bnxt_softc *sc)
if (bnxt_hwrm_vnic_ctx_alloc(sc, >sc_vnic.rss_id) != 0) {
printf("%s: failed to allocate vnic rss context\n",
DEVNAME(sc));
-   goto down_queues;
+   goto down_all_queues;
}
 
sc->sc_vnic.id = (uint16_t)HWRM_NA_SIGNATURE;
@@ -1139,8 +1147,11 @@ dealloc_vnic:
bnxt_hwrm_vnic_free(sc, >sc_vnic);
 dealloc_vnic_ctx:
bnxt_hwrm_vnic_ctx_free(sc, >sc_vnic.rss_id);
+
+down_all_queues:
+   i = sc->sc_nqueues;
 down_queues:
-   for (i = 0; i < sc->sc_nqueues; i++)
+   while (i-- > 0)
bnxt_queue_down(sc, >sc_queues[i]);
 
bnxt_dmamem_free(sc, sc->sc_rx_cfg);



Re: bnxt panic - HWRM_RING_ALLOC command returned RESOURCE_ALLOC_ERROR error.

2024-01-03 Thread Jonathan Matthew
On Wed, Jan 03, 2024 at 01:04:05PM +0100, Alexander Bluhm wrote:
> On Wed, Jan 03, 2024 at 04:51:39PM +1000, Jonathan Matthew wrote:
> > On Wed, Jan 03, 2024 at 01:50:06AM +0100, Alexander Bluhm wrote:
> > > On Wed, Jan 03, 2024 at 12:26:26AM +0100, Hrvoje Popovski wrote:
> > > > While testing kettenis@ ipl diff from tech@ and doing iperf3 to bnxt
> > > > interface and ifconfig bnxt0 down/up at the same time I can trigger
> > > > panic. Panic can be triggered without kettenis@ diff...
> > > 
> > > It is easy to reproduce.  ifconfig bnxt1 down/up a few times while
> > > receiving TCP traffic with iperf3.  Machine still has kettenis@ diff.
> > > My panic looks different.
> > 
> > It looks like I wasn't trying very hard when I wrote bnxt_down().
> > I think there's also a problem with bnxt_up() unwinding after failure
> > in various places, but that's a different issue.
> > 
> > This makes it a more resilient for me, though it still logs
> > 'bnxt0: unexpected completion type 3' a lot if I take the interface
> > down while it's in use.  I'll look at that separately.
> 
> Should we intr_barrier(sc->sc_queues[0].q_ihc) if sc->sc_intrmap == NULL ?

In that case, we only have one interrupt vector and sc->sc_ih is its
cookie.

> 
> All these barriers make sense to me.  OK bluhm@

Thanks.

> 
> > Index: if_bnxt.c
> > ===
> > RCS file: /cvs/src/sys/dev/pci/if_bnxt.c,v
> > retrieving revision 1.39
> > diff -u -p -r1.39 if_bnxt.c
> > --- if_bnxt.c   10 Nov 2023 15:51:20 -  1.39
> > +++ if_bnxt.c   3 Jan 2024 06:36:02 -
> > @@ -1158,12 +1159,16 @@ bnxt_down(struct bnxt_softc *sc)
> >  
> > CLR(ifp->if_flags, IFF_RUNNING);
> >  
> > +   intr_barrier(sc->sc_ih);
> > +
> > for (i = 0; i < sc->sc_nqueues; i++) {
> > ifq_clr_oactive(ifp->if_ifqs[i]);
> > ifq_barrier(ifp->if_ifqs[i]);
> > -   /* intr barrier? */
> >  
> > -   timeout_del(>sc_queues[i].q_rx.rx_refill);
> > +   timeout_del_barrier(>sc_queues[i].q_rx.rx_refill);
> > +
> > +   if (sc->sc_intrmap != NULL)
> > +   intr_barrier(sc->sc_queues[i].q_ihc);
> > }
> >  
> > bnxt_hwrm_free_filter(sc, >sc_vnic);
> > 
> 



Re: bnxt panic - HWRM_RING_ALLOC command returned RESOURCE_ALLOC_ERROR error.

2024-01-02 Thread Jonathan Matthew
On Wed, Jan 03, 2024 at 01:50:06AM +0100, Alexander Bluhm wrote:
> On Wed, Jan 03, 2024 at 12:26:26AM +0100, Hrvoje Popovski wrote:
> > While testing kettenis@ ipl diff from tech@ and doing iperf3 to bnxt
> > interface and ifconfig bnxt0 down/up at the same time I can trigger
> > panic. Panic can be triggered without kettenis@ diff...
> 
> It is easy to reproduce.  ifconfig bnxt1 down/up a few times while
> receiving TCP traffic with iperf3.  Machine still has kettenis@ diff.
> My panic looks different.

It looks like I wasn't trying very hard when I wrote bnxt_down().
I think there's also a problem with bnxt_up() unwinding after failure
in various places, but that's a different issue.

This makes it a more resilient for me, though it still logs
'bnxt0: unexpected completion type 3' a lot if I take the interface
down while it's in use.  I'll look at that separately.


Index: if_bnxt.c
===
RCS file: /cvs/src/sys/dev/pci/if_bnxt.c,v
retrieving revision 1.39
diff -u -p -r1.39 if_bnxt.c
--- if_bnxt.c   10 Nov 2023 15:51:20 -  1.39
+++ if_bnxt.c   3 Jan 2024 06:36:02 -
@@ -1158,12 +1159,16 @@ bnxt_down(struct bnxt_softc *sc)
 
CLR(ifp->if_flags, IFF_RUNNING);
 
+   intr_barrier(sc->sc_ih);
+
for (i = 0; i < sc->sc_nqueues; i++) {
ifq_clr_oactive(ifp->if_ifqs[i]);
ifq_barrier(ifp->if_ifqs[i]);
-   /* intr barrier? */
 
-   timeout_del(>sc_queues[i].q_rx.rx_refill);
+   timeout_del_barrier(>sc_queues[i].q_rx.rx_refill);
+
+   if (sc->sc_intrmap != NULL)
+   intr_barrier(sc->sc_queues[i].q_ihc);
}
 
bnxt_hwrm_free_filter(sc, >sc_vnic);




Re: 7.3 regression: high network latency every 12 seconds on all interfaces

2023-05-01 Thread Jonathan Matthew
On Sat, Apr 29, 2023 at 07:32:27AM +0200, Harald Dunkel wrote:
> >Synopsis:7.3 regression: high network latency every 12 seconds on all 
> >interfaces
> >Category:network
> >Environment:
>   System  : OpenBSD 7.3
>   Details : OpenBSD 7.3 (GENERIC.MP) #1125: Sat Mar 25 10:36:29 MDT 
> 2023
>
> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> 
>   Architecture: OpenBSD.amd64
>   Machine : amd64
> >Description:
>   Since the upgrade to 7.3 of a HA gateway ("redgatea" and "redgateb", one
> external network, 2 internal networks, carp on all interfaces) I see a
> high network latency for incoming network traffic every 12 seconds.
> dmesg:
 ...
> inteldrm0 at pci0 dev 2 function 0 "Intel HD Graphics" rev 0x35
> drm0 at inteldrm0
> inteldrm0: msi, CHERRYVIEW, gen 8

This generation often has problems with hdmi detection polling causing
latency spikes for network traffic and everything else.
Can you plug in a monitor (or a headless hdmi plug), or disable inteldrm?



Re: stuck after attaching scsibus at softraid0

2023-03-18 Thread Jonathan Matthew
On Fri, Mar 17, 2023 at 10:26:55PM +0100, Paul de Weerd wrote:
> I've gotten a lot further now, and can report things are not
> completely not working .. they're just .. s  l  o  w.
> 
> I built a kernel with AHCI_DEBUG and some printf's in init_main.c 
> and a rather stupid diff for amd64's dkcsum.c (I tried #define DEBUG,
> but that didn't work for some reason, so I went with s/ifdef/ifndef/ -
> see below for both diffs).  After building the kernel and rebooting, I
> had to go AFK for a bit, so I left the system for a few hours.  When I
> got back - a login prompt!
> 
> Here's what I saw:
> 
> [weerd@pom] $ time ktrace disklabel sd1 > /dev/null
> 3m00.18s real 0m00.00s user 0m00.20s system
> 
> So disk access is SUPER SLOW for some reason.  But it does work - the
> disklabel it showed for another run (where I didn't redirect to
> /dev/null) was the correct label:

It sounds like interrupts from ahci1 aren't working.  When a command
timeout fires, the driver checks to see if the command has actually
finished, and processes it normally if it has, which is where this
message comes from:

 Mar 17 20:41:47 ahci1.1: final poll of port completed command in slot 10

What does vmstat -zi show?



Re: firefox vs jitsi: stack exhaustion?

2021-04-08 Thread Jonathan Matthew
On Thu, Apr 08, 2021 at 10:24:06AM +0200, Martin Pieuchot wrote:
> firefox often crash when somebody else connects to the jitsi I'm in.
> The trace looks like a stack exhaustion, see below. 
> 
> Does this ring a bell?
> 
> #530 
> #531 pthread_setschedparam (thread=0x0, policy=1, param=0x2ade5ca1df0)
> at /usr/src/lib/librthread/rthread_sched.c:56
> #532 0x02ae65f9f016 in 
> rtc::PlatformThread::SetPriority(rtc::ThreadPriority) () from 
> /usr/local/lib/firefox/libxul.so.101.0
> #533 0x02ae65f9ed75 in rtc::PlatformThread::Run() ()
>from /usr/local/lib/firefox/libxul.so.101.0
> #534 0x02ae65f9ead9 in rtc::PlatformThread::StartThread(void*) ()
>from /usr/local/lib/firefox/libxul.so.101.0
> #535 0x02ade9d4df51 in _rthread_start (v=)
> at /usr/src/lib/librthread/rthread.c:96
> #536 0x02ad9c2ec3da in __tfork_thread ()
> at /usr/src/lib/libc/arch/amd64/sys/tfork_thread.S:84

This looks like, at least with our pthreads implementation, libwebrtc has a
race between the newly created thread and the thread creating it.

The creating thread stores the thread handle in thread_ here:
https://github.com/mozilla/gecko-dev/blob/master/third_party/libwebrtc/webrtc/rtc_base/platform_thread.cc#L186

and the new thread uses it here:
https://github.com/mozilla/gecko-dev/blob/master/third_party/libwebrtc/webrtc/rtc_base/platform_thread.cc#L363
which is called as almost the first thing the new thread does.

Our pthread_create() only stores the pthread handle into the supplied
address after __tfork_thread() returns, by which time the new thread
could already be running.



Re: panic: uao_fin_swhash_elt: can't allocate entry

2021-02-22 Thread Jonathan Matthew
On Mon, Feb 22, 2021 at 01:48:01PM +, Stuart Henderson wrote:
> Not much information on this but it's an unusual one so I thought I'd
> post in case it's of interest to anyone. (Re-typed from a screen photo,
> it's remote and used by non-technical people, this is all I have).
> 
> panic: uao_fin_swhash_elt: can't allocate entry

uao_find_swhash_elt():

/* allocate a new entry for the bucket and init/insert it in */
elt = pool_get(_swhash_elt_pool, PR_NOWAIT | PR_ZERO);
/*
 * XXX We cannot sleep here as the hash table might disappear
 * from under our feet.  And we run the risk of deadlocking
 * the pagedeamon.  In fact this code will only be called by
 * the pagedaemon and allocation will only fail if we
 * exhausted the pagedeamon reserve.  In that case we're
 * doomed anyway, so panic.
 */
if (elt == NULL)
panic("%s: can't allocate entry", __func__);

so it sounds like the machine was so out of memory it couldn't swap.


> Stopped at db_enter+0x10: popq %rbp
> TID   PID UID PRFLAGS PFLAGS  CPU COMMAND
> 38724523522   10010x100   0   sh
> *428940   98261   0   0x14000 0x200   1K  pagedaemon
> db_enter+0x10
> panic+0x12a
> uao_set_swslot(fd80c1ecc980,150,1f4d1) at uao_set_swslot+0x1a1
> uvmpd_scan_inactive(82188790) at uvmpd_scan_inactive+0x537
> uvmpd_scan+0x9f
> uvm_pageout(800053d0) at uvm_pageout+0x375
> end trace frame 0x0, count: 9
> 
> Happened after about 6 days uptime, running GNOME and chromium.
> Nothing in syslog anywhere near the crash. Machine is an haswell nuc
> D34010WYK (looks like a system builder has had their paws on the bios ID
> strings).
> 
> OpenBSD 6.8 (GENERIC.MP) #4: Mon Jan 11 10:35:56 MST 2021
> 
> r...@syspatch-68-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> real mem = 4169539584 (3976MB)
> avail mem = 4028141568 (3841MB)
> random: good seed from bootblocks
> mpath0 at root
> scsibus0 at mpath0: 256 targets
> mainbus0 at root
> bios0 at mainbus0: SMBIOS rev. 2.8 @ 0xec240 (83 entries)
> bios0: vendor Intel Corp. version "WYLPT10H.86A.0054.2019.0902.1752" date 
> 09/02/2019
> bios0: NOVATECH LTD PC-BX12966
> acpi0 at bios0: ACPI 5.0
> acpi0: sleep states S0 S3 S4 S5
> acpi0: tables DSDT FACP APIC FPDT FIDT SSDT SSDT MCFG HPET SSDT SSDT DMAR 
> CSRT MSDM
> acpi0: wakeup devices PXSX(S4) PXSX(S4) PXSX(S4) PXSX(S4) PXSX(S4) PXSX(S4) 
> PXSX(S4) PXSX(S4) GLAN(S4) EHC1(S4) EHC2(S4) XHC_(S4) HDEF(S4) PEG0(S4) 
> PEGP(S4) PWRB(S3)
> acpitimer0 at acpi0: 3579545 Hz, 24 bits
> acpimadt0 at acpi0 addr 0xfee0: PC-AT compat
> cpu0 at mainbus0: apid 0 (boot processor)
> cpu0: Intel(R) Core(TM) i3-4010U CPU @ 1.70GHz, 1696.39 MHz, 06-45-01
> cpu0: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,SRBDS_CTRL,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
> cpu0: 256KB 64b/line 8-way L2 cache
> cpu0: smt 0, core 0, package 0
> mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
> cpu0: apic clock running at 99MHz
> cpu0: mwait min=64, max=64, C-substates=0.2.1.2.4.1.1.1, IBE
> cpu1 at mainbus0: apid 2 (application processor)
> cpu1: Intel(R) Core(TM) i3-4010U CPU @ 1.70GHz, 1696.08 MHz, 06-45-01
> cpu1: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,SRBDS_CTRL,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
> cpu1: 256KB 64b/line 8-way L2 cache
> cpu1: smt 0, core 1, package 0
> cpu2 at mainbus0: apid 1 (application processor)
> cpu2: Intel(R) Core(TM) i3-4010U CPU @ 1.70GHz, 1696.08 MHz, 06-45-01
> cpu2: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,EST,TM2,SSSE3,SDBG,FMA3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,PAGE1GB,RDTSCP,LONG,LAHF,ABM,PERF,ITSC,FSGSBASE,TSC_ADJUST,BMI1,AVX2,SMEP,BMI2,ERMS,INVPCID,SRBDS_CTRL,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
> cpu2: 256KB 64b/line 8-way L2 cache
> cpu2: smt 1, core 0, package 0
> cpu3 at mainbus0: apid 3 (application processor)
> cpu3: Intel(R) Core(TM) i3-4010U CPU @ 1.70GHz, 1696.08 MHz, 06-45-01
> cpu3: 
> 

Re: Fwd: Re: Protectli FW1 with Intel 82583V - Interfaces errors and latency spike issue

2021-01-07 Thread Jonathan Matthew
On Wed, Jan 06, 2021 at 12:53:45PM +0100, Mark Kettenis wrote:
> > Date: Wed, 6 Jan 2021 21:29:52 +1000
> > From: Jonathan Matthew 
> > 
> > On Wed, Jan 06, 2021 at 10:52:48AM +0100, Mark Kettenis wrote:
> > > > Date: Wed, 6 Jan 2021 20:29:09 +1100
> > > > From: Jonathan Gray 
> > > > 
> > > > On Tue, Jan 05, 2021 at 10:28:20PM -1000, st...@wdwd.me wrote:
> > > > > I tested with a Protectli FW1 router (dmesg below) forwarding packets
> > > > > between two test machines. The latency spikes occur when running 
> > > > > headless
> > > > > beginning with this commit:
> > > > 
> > > > As the interrupt is handled via msi it wouldn't be a shared interrupt
> > > > related problem.
> > > > 
> > > > Perhaps some drm kernel thread, but I can't think of anything that would
> > > > be doing work with no display connected.
> > > 
> > > Could be the kernel periodically polling whether a monitor is
> > > attached.  Some generations of the Intel graphics hardware have broken
> > > hardware hotplug detection.  And some rely on polling i2c code to
> > > detect a VGA monitor.
> > > 
> > > Don't know this hardware.  If it has a VGA port that's left
> > > unconnected it might help to actually connect it.  Maybe one of those
> > > dongles that fake a VGA monitor would do.
> > > 
> > > Disabling inteldrm(4) would also help.
> > 
> > On my home router, which is a similar kind of machine, various drm work 
> > queue
> > threads use a fair bit of cpu time.  I normally have inteldrm disabled just
> > for that - I hadn't noticed it causing latency problems, it just seemed 
> > wrong
> > that my router spent more cpu time on drm stuff than on forwarding packets.
> > 
> > inteldrm0 at pci0 dev 2 function 0 "Intel HD Graphics" rev 0x35
> > drm0 at inteldrm0
> > inteldrm0: msi, CHERRYVIEW, gen 8
> > inteldrm0: 1024x768, 32bpp
> > 
> > It has one displayport and two hdmi, no vga, so hopefully analog hotplug
> > isn't involved.
> 
> HDMI may be in the same boat.  And I think there are cases where the
> VBIOS still advertises an (unconnected) VGA port even if there is no
> physical output.
> 
> > $ vmstat -zi 
> > interrupt   total rate
> > irq0/clock 241657  396
> > irq0/ipi17830   29
> > irq96/acpi0 00
> > irq144/inteldrm0  4690
> > irq97/ahci0 50442   82
> > irq98/xhci0250
> > irq176/azalia0  10
> > irq99/ppb0  00
> > irq114/re0  12914   21
> > irq100/ppb1 00
> > irq115/re1  13261   21
> > irq101/ppb2 00
> > irq116/athn000
> > irq102/ichiic0  00
> > irq145/com0   1180
> > irq146/pckbc0   00
> > irq147/pckbc0   00
> > Total  336717  551
> > 
> > This is what the drm workqueue threads have done in 15 minutes uptime:
> > 
> > root 85080  0.0  0.0 0 0 ??  DK  9:06PM0:00.00 (drmlwq)
> > root 65272  0.0  0.0 0 0 ??  DK  9:06PM0:00.00 (drmtskl)
> > root 39235  0.0  0.0 0 0 ??  DK  9:06PM0:00.00 (drmlwq)
> > root 65215  0.0  0.0 0 0 ??  DK  9:06PM0:01.00 (drmlwq)
> > root 62266  0.0  0.0 0 0 ??  DK  9:06PM0:00.00 (drmlwq)
> > root 20339  0.0  0.0 0 0 ??  DK  9:06PM0:00.00 (drmubwq)
> > root 62920  0.0  0.0 0 0 ??  DK  9:06PM0:00.00 (drmubwq)
> > root 58454  0.0  0.0 0 0 ??  DK  9:06PM0:01.00 (drmubwq)
> > root 75983  0.0  0.0 0 0 ??  DK  9:06PM0:00.00 (drmubwq)
> > root 31352  0.0  0.0 0 0 ??  DK  9:06PM0:00.01 (drmhpwq)
> > root 23634  0.0  0.0 0 0 ??  DK  9:06PM0:00.00 (drmhpwq)
> > root 95926  0.0  0.0 0 0 ??  DK  9:06PM0:00.00 (drmhpwq)
> > root 38038  0.0  0.0 0 0 ??  DK  9:06PM0:07.10 (drmwq)
> > root 10622  0.0  0.0 0 0 ??  DK  9:06PM0:06.50 (drmwq)
> > root 68591  0.0  0.0 0 0 ??  DK  9:

Re: Edimax EW-7811Un V2 not detected by urtwn

2020-11-14 Thread Jonathan Matthew
On Fri, Nov 13, 2020 at 03:44:27PM -0500, Morgan Aldridge wrote:
> On Thu, Nov 12, 2020 at 11:01 PM Jonathan Matthew  wrote:
> >
> > On Thu, Nov 12, 2020 at 03:31:09PM -0500, Morgan Aldridge wrote:
> > > >Synopsis:  Edimax EW-7811Un V2 not detected by urtwn
> > > >Category:  kernel amd64
> > > >Environment:
> > > System  : OpenBSD 6.8
> > > Details : OpenBSD 6.8 (GENERIC.MP) #0: Wed Oct 28 10:06:34 
> > > MDT 2020
> > >
> > > r...@syspatch-67-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> > >
> > > Architecture: OpenBSD.amd64
> > > Machine : amd64
> > > >Description:
> > > I purchased an Edimax EW-7811Un USB WiFi adapter after a number of
> > > suggestions that it is commonly supported by urtwn(4) on OpenBSD, 
> > > but
> > > they seem to be shipping new V2 model which is still identified as
> > > Realtek, but has a new device ID of 0xb811 (instead of 0x7811).
> > >
> > > dmesg shows:
> > >
> > > ugen2 at uhub1 port 2 "Realtek Edimax N150 Adapter" rev 2.00/0.00 
> > > addr 2
> > >
> > > usbdevs shows:
> > >
> > > addr 02: 7392:b811 Realtek, Edimax N150 Adapter
> > >  high speed, power 500 mA, config 1, rev 0.00, iSerial
> > > 08BEAC0EEAA1
> > >  driver: ugen2
> > >
> > > fw_update does not fetch the urtwn firmware.
> > >
> > > I'm not 100% sure that this is still using the same chipset, but 
> > > it
> > > seems like it's at least still using a Realtek chipset, and am
> > > happy to provide further details or even send the device to a
> > > developer. I'm tempted to just add the new device ID  to usbdevs 
> > > and
> > > if_urtwn.c and see if it works, but I don't know what the risks 
> > > are.
> > > >How-To-Repeat:
> > > Plug Edimax EW-7811Un V2 to USB port, run fw_update, dmesg, and 
> > > usbdevs.
> > > >Fix:
> > > Maybe we can just add the new device ID to usbdevs & if_urtwn.c, 
> > > but
> > > I'm not sure.
> >
> > It looks like this device is RTL8188EU based (not RTL8188CU like the 
> > 7811Un), since
> > the linux driver they offer for download is called rtl8188EUS, so you'd add
> >
> > URTWN_DEV_8188EU(EDIMAX, EW7811UNV2)
> >
> > to the urtwn device list.
> 
> Thanks for the tip! I have confirmed that from the Linux drivers and
> the following diff works for me on GENERIC.MP amd64:

Great, I've committed that now.
Just a note, we keep the devices in usbdevs sorted by device id, so
I moved the new entry down two lines.



Re: Not correctly supported on OpenBSD 6.7: HPE 10/25Gb 2p 640FLR-SFP28 network adapter on HPE DL380 Gen10 servers

2020-06-11 Thread Jonathan Matthew
On Fri, Jun 12, 2020 at 12:13:42AM +0200, Mark Schneider wrote:
> Hello
> 
> 
> Even the 640FLR-SFP28 network adapter is listed in the "pcidump -v" output
> on OpenBSD 6.7 there are no entries for it's interfaces in the output of
> "ifconfig -a"
> 
> # -
> obsd67a1# grep "^ [0-9]" OBSD67-pcidump-v.txt | grep -v Intel
>  1:0:0: Hewlett-Packard iLO3 Slave
>  1:0:1: Matrox unknown
>  1:0:2: Hewlett-Packard iLO3 Management
>  1:0:4: Hewlett-Packard unknown
>  2:0:0: Broadcom BCM5719
>  2:0:1: Broadcom BCM5719
>  2:0:2: Broadcom BCM5719
>  2:0:3: Broadcom BCM5719
>  18:0:0: Adaptec unknown
>  177:0:0: Adaptec unknown
>  178:0:0: Mellanox ConnectX-4 Lx
>  178:0:1: Mellanox ConnectX-4 Lx
> # -

Can you try -current please?  This system will likely work better with
support for acpi pci host bridges, which was not enabled in 6.7.



Re: aarch64: ahci0: log page read failed, slot 31 was still active

2019-10-10 Thread Jonathan Matthew
On Mon, Oct 07, 2019 at 01:30:52PM -0400, Kurt Miller wrote:
> > I hit the issue again using the latest snapshot which
> > includes the work-around.
> > 
> > ahci0: log page read failed, slot 31 was still active.
> > ahci0: stopping the port, softreset slot 31 was still active.
> > ahci0: failed to reset port during timeout handling, disabling it
> > 
> > No panic this time.
> > 
> > The work-around helped with stability in the RAMDISK env where
> > it was very unstable. Perhaps it is still a good idea.
> > 
> 
> This time I got a panic so perhaps there's something helpful
> in the info below. I had 2x rm -rf on some larger directories
> going at the time of this panic:

I've been running a similar load on a desktop pc (amd64), copying and deleting
ports trees, 4 in parallel, ~4k iops on average, on a SATA SSD through an
ASM1061 card (looks exactly the same as the one in the pine64 store) for
around a day with no problems at all.  Is it possible this is actually a
power problem?  How are you powering the board and the SSD?



Re: aarch64: ahci0: log page read failed, slot 31 was still active

2019-10-05 Thread Jonathan Matthew
On Fri, Oct 04, 2019 at 06:24:16PM -0400, k...@intricatesoftware.com wrote:
> >Synopsis:panic after ahci0: log page read failed, slot 31 was still 
> >active
> >Category:kernel
> >Environment:
>   System  : OpenBSD 6.6
>   Details : OpenBSD 6.6-beta (GENERIC.MP) #245: Sat Sep 28 20:43:51 
> MDT 2019
>
> dera...@arm64.openbsd.org:/usr/src/sys/arch/arm64/compile/GENERIC.MP
> 
>   Architecture: OpenBSD.arm64
>   Machine : arm64
> >Description:
>   While building lang/rust received some ahci0 messages then a panic.
>   System is a RockPro64 with 4G memory. filesystem root is on uSD, the
>   rest of the paritions are on SSD (including swap).
> >How-To-Repeat:
>   Not sure if reproducable yet. I was building lang/rust with
>   ulimit -Sd 4194304.
> >Fix:
>   Unknown
> 
> 
> ahci0: log page read failed, slot 31 was still active.
> ahci0: device didn't come ready after reset, TFD: 0x84c1
> panic: uvm_fault failed: ff80003475b8

So, the 'log page read' happens when there are multiple commands active and the
port reports an error.  The log page read is supposed to tell us which command
failed.  If that fails, we fail all active commands, and if the device won't
reset after that, we shut the device off and all further io will fail.  If
that's your swap device, and the system is swapping, you're kind of screwed.

Perhaps disabling command queueing (which means there can only be one command
in flight, so no need to read the log page on errors) might help?  The diff
below should do that.  Checking the SSD out with smartctl is probably also a
good idea at this point.


diff --git sys/dev/pci/ahci_pci.c sys/dev/pci/ahci_pci.c
index 79044b52dd5..f61ae96b0cf 100644
--- sys/dev/pci/ahci_pci.c
+++ sys/dev/pci/ahci_pci.c
@@ -108,7 +108,8 @@ static const struct ahci_device ahci_devices[] = {
{ PCI_VENDOR_ATI,   PCI_PRODUCT_ATI_SBX00_SATA_6,
NULL,   ahci_ati_sb700_attach },
 
-   { PCI_VENDOR_ASMEDIA,   PCI_PRODUCT_ASMEDIA_ASM1061_SATA },
+   { PCI_VENDOR_ASMEDIA,   PCI_PRODUCT_ASMEDIA_ASM1061_SATA,
+   NULL,   ahci_vt8251_attach },
 
{ PCI_VENDOR_INTEL, PCI_PRODUCT_INTEL_6SERIES_AHCI_1,
NULL,   ahci_intel_attach },



Re: Supermicro 2029TP-HC0R: xhci0 msiuvm_fault

2018-11-15 Thread Jonathan Matthew
On Mon, Oct 08, 2018 at 11:45:53AM +0200, Sebastian Benoit wrote:
> 
> Supermicro SYS-2029TP-HC0R
> https://www.supermicro.nl/products/system/2U/2029/SYS-2029TP-HC0R.cfm
> 
> similar to my mail to bugs@
> Subject: Super X11SPi-TF board, xhci0 at ... "Intel C620 xHCI" ... 
> msiuvm_fault
> but different system.

If you still have access to either of these, can you try this diff?
It should stop it crashing (it does on a dell r6415 here), but the usb
controller will most likely fail to initialise.


Index: xhci_pci.c
===
RCS file: /cvs/src/sys/dev/pci/xhci_pci.c,v
retrieving revision 1.9
diff -u -p -r1.9 xhci_pci.c
--- xhci_pci.c  8 May 2018 13:41:52 -   1.9
+++ xhci_pci.c  16 Nov 2018 01:26:05 -
@@ -255,6 +255,9 @@ xhci_pci_takecontroller(struct xhci_pci_
int i;
 
cparams = XREAD4(>sc, XHCI_HCCPARAMS);
+   if (cparams == 0x)
+   return;
+
eec = -1;
 
/* Synchronise with the BIOS if it owns the controller. */
@@ -262,6 +265,8 @@ xhci_pci_takecontroller(struct xhci_pci_
xecp != 0 && XHCI_XECP_NEXT(eec);
xecp += XHCI_XECP_NEXT(eec) << 2) {
eec = XREAD4(>sc, xecp);
+   if (eec == 0x)
+   return;
if (XHCI_XECP_ID(eec) != XHCI_ID_USB_LEGACY)
continue;
bios_sem = XREAD1(>sc, xecp + XHCI_XECP_BIOS_SEM);



Re: strtod() can change "nan" to -nan

2018-05-27 Thread Jonathan Matthew

On 28/04/18 12:57, George Koehler wrote:

On Mon, 16 Apr 2018 18:06:32 -0400
George Koehler  wrote:


In some platforms (like amd64, but not macppc), strtod(3)
flips the sign of not-a-number (nan).  It changes "nan" into
-nan and "-nan" into nan.


I had sent a patch for strtod() on amd64, but I forgot to fix
strtof() and strtold().  I now send a patch for all 3 functions on
amd64.  This patch also tries to fix alpha, arm, and i386, but I only
tested amd64.  I know that powerpc doesn't have the bug, and I guess
that other arches might not have the bug.


I ran into this bug in the test suite for glib, so I tested the fix on 
the other arches, verified that the rest (except sh and m88k) are 
unaffected, and committed it.  Thanks for doing the tricky part for me.




Re: crash when unplugging urtwn usb wifi adapter

2018-04-18 Thread Jonathan Matthew
On Sat, Apr 14, 2018 at 06:54:35AM +0200, p...@ex.com.pl wrote:
> >Synopsis:page fault trap when removing urtwn Wifi adapter from the port
> >Category:kernel
> >Environment:
>   System  : OpenBSD 6.3
>   Details : OpenBSD 6.3 (GENERIC.MP) #107: Sat Mar 24 14:21:59 MDT 
> 2018
>
> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> 
>   Architecture: OpenBSD.amd64
>   Machine : amd64
> >Description:
> I'm observing system crash if I remove the the TP-Link TL-WN725N
> WiFi adapter from the port. The system reports kernel panic:
> 
> kernel: page fault trap, code=0
> Stopped at softclock+0x16b: movq %rax,0(%rdx)

Does this fix it?

Index: ieee80211.c
===
RCS file: /cvs/src/sys/net80211/ieee80211.c,v
retrieving revision 1.65
diff -u -p -u -p -r1.65 ieee80211.c
--- ieee80211.c 12 Dec 2017 15:52:49 -  1.65
+++ ieee80211.c 18 Apr 2018 12:25:34 -
@@ -193,6 +193,7 @@ ieee80211_ifdetach(struct ifnet *ifp)
 {
struct ieee80211com *ic = (void *)ifp;
 
+   timeout_del(>ic_bgscan_timeout);
ieee80211_proto_detach(ifp);
ieee80211_crypto_detach(ifp);
ieee80211_node_detach(ifp);




Re: mp deadlock on 6.2 running on kvm

2017-12-15 Thread Jonathan Matthew
On Tue, Dec 12, 2017 at 09:11:40PM +1000, Jonathan Matthew wrote:
> On Mon, Dec 11, 2017 at 09:34:00AM +0100, Landry Breuil wrote:
> > On Mon, Dec 11, 2017 at 06:21:01PM +1000, Jonathan Matthew wrote:
> > > On 10/12/17 03:26, Landry Breuil wrote:
> > > > On Sat, Dec 09, 2017 at 04:33:28PM +0100, Juan Francisco Cantero 
> > > > Hurtado wrote:
> > > > > On Thu, Dec 07, 2017 at 02:27:29PM +0100, Landry Breuil wrote:
> > > > > > On Thu, Dec 07, 2017 at 11:52:46AM +0100, Martin Pieuchot wrote:
> > > > > > > On 07/12/17(Thu) 08:34, Landry Breuil wrote:
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > i've been having kvm VMs running 6.2 hardlocking/deadlocking 
> > > > > > > > since a
> > > > > > > > while, all those running on proxmox 5.1 using linux 4.13.8 & 
> > > > > > > > qemu-kvm
> > > > > > > > 2.9.1. There were hardlocks upon reboot which were 'solved' by 
> > > > > > > > disabling
> > > > > > > > x2apic emulation in kvm (args: -cpu=kvm64,-x2apic) or giving 
> > > > > > > > the host
> > > > > > > > cpu flags to the vm (args: -cpu host) but there still remains 
> > > > > > > > deadlocks
> > > > > > > > during normal operation.
> > > > > > > > 
> > > > > > > > I'm now running a kernel with MP_LOCKDEBUG, so i'm collecting 
> > > > > > > > traces in
> > > > > > > > the vain hope that it might help someone interested in locking 
> > > > > > > > issues.
> > > > > > > > Here's the latest one:
> > > > > > > 
> > > > > > > Let me add that when you had x2apic enabled the kernel 'froze' 
> > > > > > > inside
> > > > > > > x2apic_readreg, trace below:
> > > > > > > 
> > > > > > >ddb{0}> tr
> > > > > > >x2apic_readreg(10) at x2apic_readreg+0xf
> > > > > > >lapic_delay(800022136900) at lapic_delay+0x5c
> > > > > > >rtcput(800022136960) at rtcput+0x65
> > > > > > >resettodr() at resettodr+0x1d6
> > > > > > >perform_resettodr(81769b29) at perform_resettodr+0x9
> > > > > > >taskq_thread(0) at taskq_thread+0x67
> > > > > > >end trace frame: 0x0, count: -6
> > > > > > > 
> > > > > > > What you're seeing with a MP_LOCKDEBUG kernel is just a symptom.  
> > > > > > > A CPU
> > > > > > > enters DDB because another one is 'frozen' while holding the
> > > > > > > KERNEL_LOCK().  What's interesting is that in both case the 
> > > > > > > frozen CPU
> > > > > > > is trying to execute apic related code:
> > > > > > >- x2apic_readreg
> > > > > > >- lapic_delay
> > > > > > > 
> > > > > > > I believe this issue should be reported to KVM developers as well.
> > > > > > 
> > > > > > *very* interestingly, i had a new lock, running bsd.sp.. So i think 
> > > > > > that
> > > > > > rules out openbsd mp.
> > > > > > 
> > > > > > ddb> tr
> > > > > > i82489_readreg(0) at i82489_readreg+0xd
> > > > > > lapic_delay(81a84090) at lapic_delay+0x5c
> > > > > > rtcget(81a84090) at rtcget+0x1a
> > > > > > resettodr() at resettodr+0x3a
> > > > > > perform_resettodr(81659e99) at perform_resettodr+0x9
> > > > > > taskq_thread(0) at taskq_thread+0x57
> > > > > > end trace frame: 0x0, count: -6
> > > > > 
> > > > > Try running with "-machine q35". It changes the emulated machine to
> > > > > a modern platform.
> > > > 
> > > > Right, i suppose that matches https://wiki.qemu.org/Features/Q35,
> > > > interesting. Will definitely try, mailed the kvm mailing list but got no
> > > > feedback so far.
> > > 
> > > I've been seeing this for a while too, on VMs that are already run with
> > > -machine q35 and -cpu host.  I was blaming my (still unfinished) pvclock
> > &

Re: mp deadlock on 6.2 running on kvm

2017-12-12 Thread Jonathan Matthew
On Mon, Dec 11, 2017 at 09:34:00AM +0100, Landry Breuil wrote:
> On Mon, Dec 11, 2017 at 06:21:01PM +1000, Jonathan Matthew wrote:
> > On 10/12/17 03:26, Landry Breuil wrote:
> > > On Sat, Dec 09, 2017 at 04:33:28PM +0100, Juan Francisco Cantero Hurtado 
> > > wrote:
> > > > On Thu, Dec 07, 2017 at 02:27:29PM +0100, Landry Breuil wrote:
> > > > > On Thu, Dec 07, 2017 at 11:52:46AM +0100, Martin Pieuchot wrote:
> > > > > > On 07/12/17(Thu) 08:34, Landry Breuil wrote:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > i've been having kvm VMs running 6.2 hardlocking/deadlocking 
> > > > > > > since a
> > > > > > > while, all those running on proxmox 5.1 using linux 4.13.8 & 
> > > > > > > qemu-kvm
> > > > > > > 2.9.1. There were hardlocks upon reboot which were 'solved' by 
> > > > > > > disabling
> > > > > > > x2apic emulation in kvm (args: -cpu=kvm64,-x2apic) or giving the 
> > > > > > > host
> > > > > > > cpu flags to the vm (args: -cpu host) but there still remains 
> > > > > > > deadlocks
> > > > > > > during normal operation.
> > > > > > > 
> > > > > > > I'm now running a kernel with MP_LOCKDEBUG, so i'm collecting 
> > > > > > > traces in
> > > > > > > the vain hope that it might help someone interested in locking 
> > > > > > > issues.
> > > > > > > Here's the latest one:
> > > > > > 
> > > > > > Let me add that when you had x2apic enabled the kernel 'froze' 
> > > > > > inside
> > > > > > x2apic_readreg, trace below:
> > > > > > 
> > > > > >ddb{0}> tr
> > > > > >x2apic_readreg(10) at x2apic_readreg+0xf
> > > > > >lapic_delay(800022136900) at lapic_delay+0x5c
> > > > > >rtcput(800022136960) at rtcput+0x65
> > > > > >resettodr() at resettodr+0x1d6
> > > > > >perform_resettodr(81769b29) at perform_resettodr+0x9
> > > > > >taskq_thread(0) at taskq_thread+0x67
> > > > > >end trace frame: 0x0, count: -6
> > > > > > 
> > > > > > What you're seeing with a MP_LOCKDEBUG kernel is just a symptom.  A 
> > > > > > CPU
> > > > > > enters DDB because another one is 'frozen' while holding the
> > > > > > KERNEL_LOCK().  What's interesting is that in both case the frozen 
> > > > > > CPU
> > > > > > is trying to execute apic related code:
> > > > > >- x2apic_readreg
> > > > > >- lapic_delay
> > > > > > 
> > > > > > I believe this issue should be reported to KVM developers as well.
> > > > > 
> > > > > *very* interestingly, i had a new lock, running bsd.sp.. So i think 
> > > > > that
> > > > > rules out openbsd mp.
> > > > > 
> > > > > ddb> tr
> > > > > i82489_readreg(0) at i82489_readreg+0xd
> > > > > lapic_delay(81a84090) at lapic_delay+0x5c
> > > > > rtcget(81a84090) at rtcget+0x1a
> > > > > resettodr() at resettodr+0x3a
> > > > > perform_resettodr(81659e99) at perform_resettodr+0x9
> > > > > taskq_thread(0) at taskq_thread+0x57
> > > > > end trace frame: 0x0, count: -6
> > > > 
> > > > Try running with "-machine q35". It changes the emulated machine to
> > > > a modern platform.
> > > 
> > > Right, i suppose that matches https://wiki.qemu.org/Features/Q35,
> > > interesting. Will definitely try, mailed the kvm mailing list but got no
> > > feedback so far.
> > 
> > I've been seeing this for a while too, on VMs that are already run with
> > -machine q35 and -cpu host.  I was blaming my (still unfinished) pvclock
> > code, but now I can fairly easily trigger it on single cpu VMs without that,
> > mostly by running kernel compiles in a loop in a couple of different guests.
> > I'm using a Fedora 25 (4.10.15-200.fc25.x86_64) kernel.
> > 
> > Adding some debug output to lapic_delay, it appears the KVM virtualized
> > lapic counter hits zero and doesn't reset, so the lapic_delay loop in the
> > guest never term

Re: mp deadlock on 6.2 running on kvm

2017-12-11 Thread Jonathan Matthew

On 10/12/17 03:26, Landry Breuil wrote:

On Sat, Dec 09, 2017 at 04:33:28PM +0100, Juan Francisco Cantero Hurtado wrote:

On Thu, Dec 07, 2017 at 02:27:29PM +0100, Landry Breuil wrote:

On Thu, Dec 07, 2017 at 11:52:46AM +0100, Martin Pieuchot wrote:

On 07/12/17(Thu) 08:34, Landry Breuil wrote:

Hi,

i've been having kvm VMs running 6.2 hardlocking/deadlocking since a
while, all those running on proxmox 5.1 using linux 4.13.8 & qemu-kvm
2.9.1. There were hardlocks upon reboot which were 'solved' by disabling
x2apic emulation in kvm (args: -cpu=kvm64,-x2apic) or giving the host
cpu flags to the vm (args: -cpu host) but there still remains deadlocks
during normal operation.

I'm now running a kernel with MP_LOCKDEBUG, so i'm collecting traces in
the vain hope that it might help someone interested in locking issues.
Here's the latest one:


Let me add that when you had x2apic enabled the kernel 'froze' inside
x2apic_readreg, trace below:

   ddb{0}> tr
   x2apic_readreg(10) at x2apic_readreg+0xf
   lapic_delay(800022136900) at lapic_delay+0x5c
   rtcput(800022136960) at rtcput+0x65
   resettodr() at resettodr+0x1d6
   perform_resettodr(81769b29) at perform_resettodr+0x9
   taskq_thread(0) at taskq_thread+0x67
   end trace frame: 0x0, count: -6

What you're seeing with a MP_LOCKDEBUG kernel is just a symptom.  A CPU
enters DDB because another one is 'frozen' while holding the
KERNEL_LOCK().  What's interesting is that in both case the frozen CPU
is trying to execute apic related code:
   - x2apic_readreg
   - lapic_delay

I believe this issue should be reported to KVM developers as well.


*very* interestingly, i had a new lock, running bsd.sp.. So i think that
rules out openbsd mp.

ddb> tr
i82489_readreg(0) at i82489_readreg+0xd
lapic_delay(81a84090) at lapic_delay+0x5c
rtcget(81a84090) at rtcget+0x1a
resettodr() at resettodr+0x3a
perform_resettodr(81659e99) at perform_resettodr+0x9
taskq_thread(0) at taskq_thread+0x57
end trace frame: 0x0, count: -6


Try running with "-machine q35". It changes the emulated machine to
a modern platform.


Right, i suppose that matches https://wiki.qemu.org/Features/Q35,
interesting. Will definitely try, mailed the kvm mailing list but got no
feedback so far.


I've been seeing this for a while too, on VMs that are already run with 
-machine q35 and -cpu host.  I was blaming my (still unfinished) pvclock 
code, but now I can fairly easily trigger it on single cpu VMs without 
that, mostly by running kernel compiles in a loop in a couple of 
different guests.  I'm using a Fedora 25 (4.10.15-200.fc25.x86_64) kernel.


Adding some debug output to lapic_delay, it appears the KVM virtualized 
lapic counter hits zero and doesn't reset, so the lapic_delay loop in 
the guest never terminates.  KVM has several different ways it can 
provide the lapic counter and I'm not sure which one I'm using yet.


I just tried making lapic_delay give up after a million zero reads, and 
it seems to recover after a minute or so.  I'll leave it running to see 
if it happens again.




Re: Invalid syntax error from syntax_is_time() on ldapd(8) when adding entries

2017-05-28 Thread Jonathan Matthew
On Fri, May 05, 2017 at 10:05:19AM +, Seiya Kawashima wrote:
> >Synopsis:Invalid syntax error from syntax_is_time() on ldapd(8) when 
> >adding entries
> >Category:system
> >Environment:
>   System  : OpenBSD 6.1
>   Details : OpenBSD 6.1-current (GENERIC.MP) #50: Thu May  4 11:52:48 
> MDT 2017
>
> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> 
>   Architecture: OpenBSD.amd64
>   Machine : amd64
> >Description:
>   Thank you for the great work on ldapd(8).
> 
>   ldapd(8) had been working great until I moved to OpenBSD 6.1-current 
> (GENERIC.MP) #50.
>   The entire dmesg is attached at the end of this report. If this report 
> is not relevant,
>   please discard it. The relevant parts are syntax_is_gentime(), 
> syntax_is_utctime and
>   syntax_is_time() on syntax.c. I still wonder why it worked without the 
> fix shown below and
>   now it doesn't work without any modification.

Thanks for reporting and investigating this.  I've commited a fix, also fixing 
the
check for timezones on generalized times.

What happened here was that in r1.4 of syntax.c, we corrected the CHECK_RANGE 
macro,
which previously wouldn't work properly following an 'if' statement with no 
braces.
The function had accidentally been written to rely on the effects of doing 
this, and
unfortunately we didn't notice that fixing the macro caused the function to 
misbehave.



Re: kernel panic ATA_S_ONCHIP

2017-02-21 Thread Jonathan Matthew

On 02/22/2017 12:24 PM, Sanka Coffie wrote:


Then I did a clean reboot and ran fsck manually which produced all of
the output below. The rest of this e-mail is just output from fsck.

# fsck
** /dev/sd0a (402af328685601ff.a) (NO WRITE)
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
1747 files, 22931 used, 493332 free (84 frags, 61656 blocks, 0.0%
fragmentation)
** /dev/sd1a (6ad7e2b138d393b2.a) (NO WRITE)
** File system is clean; not checking
** /dev/sd1e (6ad7e2b138d393b2.e) (NO WRITE)
** File system is clean; not checking
** /dev/sd1d (6ad7e2b138d393b2.d) (NO WRITE)
** Last Mounted on /tmp
** Phase 1 - Check Blocks and Sizes

CANNOT READ: BLK 2016
CONTINUE? [Fyn?] y

ahci0: ncq error: 6 0 41 84
ahci0: NCQ errored slot 6 is idle (0004 active)


So, no matter what's happening, it always says command slot 6 was the 
one that failed.  If you run fsck again, does it fail at the same block 
numbers, or are they more or less random?


I wonder if the ssd is misreporting its queue depth, so we shove too 
many commands at it and it doesn't know how to report that properly.

What output do you get with this diff:
https://mild.embarrassm.net/~jonathan/t/atascsi-qdepth.diff
(apply in src/sys/dev/ata)?



Re: kernel panic ATA_S_ONCHIP

2017-02-20 Thread Jonathan Matthew
On Wed, Feb 01, 2017 at 06:51:00PM -0500, Sanka Coffie wrote:
> On Wed, Feb 1, 2017 at 6:28 AM, Jonathan Matthew <jonat...@d14n.org> wrote:
> 
> > On Wed, Feb 01, 2017 at 04:16:34AM -0500, Sanka Coffie wrote:
> > > Here is the output of dmesg with debugging enabled on ahci, running with
> > > -current that I pulled down a few hours ago. The line matches up with the
> > > driver which is an improvement.
> >
> > OK, that looks like the mechanism for determining which NCQ command failed
> > is failing, so let's see what happens if we turn NCQ off.  Does this diff
> > change anything?
> >
> >
> > Index: ahci_pci.c
> > ===
> > RCS file: /cvs/src/sys/dev/pci/ahci_pci.c,v
> > retrieving revision 1.12
> > diff -u -p -u -p -r1.12 ahci_pci.c
> > --- ahci_pci.c  14 Jan 2016 04:06:53 -  1.12
> > +++ ahci_pci.c  1 Feb 2017 11:26:01 -
> > @@ -281,7 +281,7 @@ ahci_amd_hudson2_attach(struct ahci_soft
> >  {
> > ahci_ati_sb_idetoahci(sc, pa);
> >
> > -   sc->sc_flags |= AHCI_F_IPMS_PROBE;
> > +   sc->sc_flags |= AHCI_F_IPMS_PROBE | AHCI_F_NO_NCQ;
> >
> > return (0);
> >  }
> >
> >
> Yep, just installed with that change and no more kernel panic. I was also
> able to reboot, as well as run df and ls on the 2nd drive without issue.

Every other ahci driver I've looked at doesn't consider this a fatal error,
so maybe we shouldn't either.  The diff below adds a bit of dmesg spam during
NCQ error recovery, and also attempts to handle the condition we're seeing
here by failing all outstanding commands rather than panicking.  I'd be really
interested to see what happens if you remove the AHCI_F_NO_NCQ change and try
this instead.


Index: ahci.c
===
RCS file: /cvs/src/sys/dev/ic/ahci.c,v
retrieving revision 1.28
diff -u -p -u -p -r1.28 ahci.c
--- ahci.c  2 Oct 2016 18:56:05 -   1.28
+++ ahci.c  21 Feb 2017 03:51:09 -
@@ -2158,6 +2158,12 @@ ahci_port_intr(struct ahci_port *ap, u_i
PORTNAME(ap), err_slot);
 
ccb = >ap_ccbs[err_slot];
+   if (ccb->ccb_xa.state != ATA_S_ONCHIP) {
+   printf("%s: NCQ errored slot %d is idle"
+   " (%08x active)\n", PORTNAME(ap), err_slot,
+   ci_saved);
+   goto failall;
+   }
} else {
/* Didn't reset, could gather extended info from log. */
}
@@ -2572,9 +2578,21 @@ err:
/* Extract failed register set and tags from the scratch space. */
if (rc == 0) {
struct ata_log_page_10h *log;
-   int err_slot;
+   int err_slot, i;
+   uint8_t sum;
 
log = (struct ata_log_page_10h *)ap->ap_err_scratch;
+   sum = 0;
+   for (i = 0; i < sizeof(*log); i++)
+   sum += ap->ap_err_scratch[i];
+   if (sum != 0)
+   printf("%s: NCQ error log checksum mismatch\n",
+   PORTNAME(ap));
+
+   printf("%s: ncq error: %x %x %x %x\n", PORTNAME(ap),
+   log->err_regs.type, log->err_regs.flags,
+   log->err_regs.status, log->err_regs.error);
+
if (ISSET(log->err_regs.type, ATA_LOG_10H_TYPE_NOTQUEUED)) {
/* Not queued bit was set - wasn't an NCQ error? */
printf("%s: read NCQ error page, but not an NCQ "



Re: kernel panic ATA_S_ONCHIP

2017-01-29 Thread Jonathan Matthew
On Wed, Jan 18, 2017 at 03:04:21AM +, Sanka Coffie wrote:
> Just installed -current and am hitting a kernel panic on boot:
> 
> panic: kernel diagnostic assertion "ccb->ccb_xa.state =3D=3D ATA_S_ONCHIP"
> failed: file "../../../../dev/ic/ahci.c", line 2174

This indicates the SATA controller reported an error for a command that had
already completed, or for a command slot that wasn't in use at all.
Can you narrow it down to one of the SSDs in that system?  Do those SSDs
work properly in other machines?

Are you able to build a custom kernel and boot it on the APU?  If so, the first
thing to do would be to enable ahci debug output by changing the
#define NO_AHCI_DEBUG line at the top of src/sys/dev/ic/ahcivar.h - this won't
fix the panic but it'll tell us where the error information is coming from.
Depending on what output you get from that, there are a few more things to try.