Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching

2017-07-24 Thread Mark Martinec

2017-07-24 18:25, Ken Merry wrote:

It is possible that the change I MFCed today (r321207 in head, r321415
in stable/11) is related, but Mark will have to boot his machine with
the fix to see if it makes any difference.

What happened in my case on one particular machine (not on most
machines in our lab running the same code) was that mps_wait_command()
/ mpr_wait_command() would not wait the full 60 seconds for a write to
the DPM table (Driver Persistent Mapping) table in the controller.
So, it reported that there was a timeout.
[...]
Eliminating bogus timeouts will eliminate most all of the sources of
those panics anyway.


Took r321415 from stable/11 and applied it to 11.1-RC3 - and it makes
no difference to booting: still hangs attempting to attach da0,
with a spinning CPU (according to fan speed).
Booting in safe mode, or with EARLY_AP_STARTUP disabled avoids the 
problem.



There is a secondary bug that is still in the mps(4) / mpr(4) drivers
when a timeout does happen — the error recovery code in the
wait_command() routine reinitializes the controller, which clears out
all the commands.  When the wait_command() routine returns, the
command passed in has been freed, but the caller doesn’t know that.
So the caller (it happens in a number of places) dereferences a
pointer to freed memory and the kernel panics.

I’m planning to fix that bug, too, if slm@ doesn’t get to it first,
I’ve just had other bugs to fix first.


No panics in my case, just hangs.

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: stable/11: Kernel page fault with the following non-sleepable locks held: CAM device lock

2017-07-24 Thread Eugene Grosbein
On 23.07.2017 20:02, Eugene Grosbein wrote:
> Hi!
> 
> Long story short: stable/11 r321371 started to panic at the moment of smartd 
> invocation
> after my SSD died.
> 
> I have Intel motherboard with graid-supported pseudo-raid.
> I use it in RAID1 mode with one HDD and one SSD.
> 
> Yesterday the SSD has died: it is not detected by BIOS nor FreeBSD kernel 
> (timeouts).
> This went unnoticed by me as graid just disconnected it on-the-fly:
> 
> kernel: ahcich5: Timeout on slot 24 port 0
> kernel: ahcich5: is  cs  ss 0100 rs 0100 tfd 40 serr 
>  cmd d817
> kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 02 ad 12 9e 40 3b 
> 00 00 00 00 00
> kernel: (ada1:ahcich5:0:0:0): CAM status: Command timeout
> kernel: (ada1:ahcich5:0:0:0): Retrying command
> kernel: ahcich5: AHCI reset: device not ready after 31000ms (tfd = 0080)
> [skip]
> kernel: ada1 at ahcich5 bus 0 scbus2 target 0 lun 0
> kernel: ada1:  s/n JYKJ550855860139 detached
> [skip]
> kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 02 ad 12 9e 40 3b 
> 00 00 00 00 00
> kernel: (ada1:ahcich5:0:0:0): CAM status: Command timeout
> kernel: (ada1:ahcich5:0:0:0): Error 5, Periph was invalidated
> kernel: GEOM_RAID: Write failed: failing subdisk. 
> ada1[WRITE(offset=269389066240, length=32768)]
> kernel: GEOM_RAID: Intel-c291fe96: Disk ada1 state changed from ACTIVE to 
> FAILED.
> kernel: GEOM_RAID: Intel-c291fe96: Subdisk r0:1-ada1 state changed from 
> ACTIVE to FAILED.
> kernel: GEOM_RAID: Intel-c291fe96: Volume r0 state changed from OPTIMAL to 
> DEGRADED.
> kernel: GEOM_RAID: Intel-c291fe96: Disk ada1 state changed from FAILED to 
> OFFLINE.
> kernel: GEOM_RAID: Intel-c291fe96: Subdisk r0:1-[unknown] state changed from 
> FAILED to NONE.
> kernel: GEOM_RAID: Write failed: failing subdisk. 
> ada1[WRITE(offset=270699851776, length=32768)]
> kernel: GEOM_RAID: Intel-c291fe96: Warning! Fail request to a disk in a wrong 
> state (OFFLINE)!
> 
> Unaware of that, I've performed standard source upgrade from 11.1-PRERELEASE 
> r318692
> to stable/11 r321371 that went smooth. After reboot, BIOS was unable to 
> detect SSD,
> reported degraded state of the mirror and booted FreeBSD using second mirror 
> component (HDD).
> 
> After long timeout, the kernel could not detect dead SSD too and continued to 
> run with degraded mirror
> just fine: the system went multiuser mode and almost finished loading when 
> rcNG started smartd.
> The kernel panices that moment. This is repeatable: I can cold-boot to single 
> user mode, start smartd
> and get same panic. This is debugging kernel and I managed to obtain 
> crashdump.
> 
> kgdb session follows:
> 
> <118>Starting smartd.
> Kernel page fault with the following non-sleepable locks held:
> exclusive sleep mutex CAM device lock (CAM device lock) r = 0 
> (0xf8000cf71c60) locked @ /home/src/sys/cam/scsi/scsi_pass.c:1766
> stack backtrace:
> #0 0x80a12620 at witness_debugger+0x70
> #1 0x80a13a4e at witness_warn+0x45e
> #2 0x80e4b693 at trap_pfault+0x53
> #3 0x80e4ae3e at trap+0x29e
> #4 0x80e2ed91 at calltrap+0x8
> #5 0x8033873a at passsendccb+0x6a
> #6 0x80337836 at passdoioctl+0x3c6
> #7 0x80337052 at passioctl+0x22
> #8 0x80878c78 at devfs_ioctl_f+0x138
> #9 0x80a18184 at kern_ioctl+0x2c4
> #10 0x80a17e4f at sys_ioctl+0x16f
> #11 0x80e4c05a at amd64_syscall+0x53a
> #12 0x80e2f07b at Xfast_syscall+0xfb
> 
> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; apic id = 00
> fault virtual address   = 0xa
> fault code  = supervisor read data, page not present
> instruction pointer = 0x20:0x80e494e1
> stack pointer   = 0x28:0xfe04675ff670
> frame pointer   = 0x28:0xfe04675ff670
> code segment= base 0x0, limit 0xf, type 0x1b
> = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags= interrupt enabled, resume, IOPL = 0
> current process = 1387 (smartd)
> trap number = 12
> panic: page fault
> cpuid = 0
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe04675ff250
> vpanic() at vpanic+0x186/frame 0xfe04675ff2d0
> panic() at panic+0x43/frame 0xfe04675ff330
> trap_fatal() at trap_fatal+0x322/frame 0xfe04675ff380
> trap_pfault() at trap_pfault+0x62/frame 0xfe04675ff3e0
> trap() at trap+0x29e/frame 0xfe04675ff5a0
> calltrap() at calltrap+0x8/frame 0xfe04675ff5a0
> --- trap 0xc, rip = 0x80e494e1, rsp = 0xfe04675ff670, rbp = 
> 0xfe04675ff670 ---
> copyin() at copyin+0x41/frame 0xfe04675ff670
> passsendccb() at passsendccb+0x6a/frame 0xfe04675ff6f0
> passdoioctl() at passdoioctl+0x3c6/frame 0xfe04675ff7a0
> passioctl() at passioctl+0x22/frame 0xfe04675ff7e0
> devfs_ioctl_f() at devfs_ioctl_f+0x138/frame 0xfe04675ff840
> kern_ioctl() at 

Re: stable/11 debugging kernel unable to produce crashdump again

2017-07-24 Thread Eugene Grosbein
On 25.07.2017 00:22, Mark Johnston wrote:
> On Tue, Jul 25, 2017 at 12:03:05AM +0700, Eugene Grosbein wrote:
>> Thanks, this helped:
>>
>> $ addr2line -f -e kernel.debug 0x80919c00
>> g_raid_shutdown_post_sync
>> /home/src/sys/geom/raid/g_raid.c:2458
>>
>> That is GEOM_RAID's g_raid_shutdown_post_sync() that hangs if called just 
>> before
>> crashdump generation but works just fine during normal system shutdown.
> 
> I think graid probably needs a treatment similar to r301173/r316032.
> g_raid_shutdown_post_sync() appears to be quite similar to the
> corresponding gmirror handler. In particular, it just attempts to mark
> the individual components as clean and destroy the GEOM, which is not
> really safe after a panic.
> 
> diff --git a/sys/geom/raid/g_raid.c b/sys/geom/raid/g_raid.c
> index 7a1fd8c5ce2e..aa2529d5466a 100644
> --- a/sys/geom/raid/g_raid.c
> +++ b/sys/geom/raid/g_raid.c
> @@ -2461,6 +2461,9 @@ g_raid_shutdown_post_sync(void *arg, int howto)
>   struct g_raid_softc *sc;
>   struct g_raid_volume *vol;
>  
> + if (panicstr != NULL)
> + return;
> +
>   mp = arg;
>   g_topology_lock();
>   g_raid_shutdown = 1;
> 

I'r rather leave this to Alexander.

Funny thing is that it's not 100% hangs if I add some debugging printfs:
more printfs added, more probability that it does not hang and proceeds
to successfull crashdump generation. I use old "sc" console (not vt), if that 
matters.


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: stable/11 debugging kernel unable to produce crashdump again

2017-07-24 Thread Warner Losh
I've often wondered why, for CAM at least, we don't automatically fall back
to the dump way when scheduling is stopped rather than have two different
interfaces and special knowledge of this in a lot of places...

Warner

On Mon, Jul 24, 2017 at 11:25 AM, Alexander Motin  wrote:

> I guess that problem of g_raid_shutdown_post_sync in case of panic can
> be explained by the fact it tries to write clean metadata in regular
> (not dumping) way while system is already in panic mode and there is no
> proper scheduling.  May be it could be just bypassed in case of dumping
> (should be trivial and probably OK), or use g_raid_subdisk_kerneldump()
> in that case instead of normal GEOM I/O.
>
> On 24.07.2017 20:03, Eugene Grosbein wrote:
> > CCing mav@ as graid expert.
> >
> > On 24.07.2017 08:44, Mark Johnston wrote:
> >
> >>> Sadly, this time 11.1-STABLE r321371 SMP hangs instead of doing
> crashdump:
> >>>
> >>> - "call doadump" from DDB prompt works just fine;
> >>> - "shutdown -r now" reboots the system without problems;
> >>> - "sysctl debug.kdb.panic=1" triggers a panic just fine but system
> hangs just afer showing uptime
> >>> instead of continuing with crashdump generation; same if "real" panic
> occurs.
> >>>
> >>> Same for debug.minidump set to 1 or 0. How do I debug this?
> >>
> >> I'm not able to reproduce the problem in bhyve using r321401. Looking
> >> at the code, the culprits might be cngrab(), or one of the
> >> shutdown_post_sync eventhandlers. Since you're apparently able to see
> >> the console output at the time of the panic, I guess it's probably the
> >> latter. Could you try your test with the patch below applied? It'll
> >> print a bunch of "entering post_sync"/"leaving post_sync" messages with
> >> addresses that can be resolved using kgdb. That'll help determine where
> >> we're getting stuck.
> >>
> >> Index: sys/sys/eventhandler.h
> >> ===
> >> --- sys/sys/eventhandler.h   (revision 321401)
> >> +++ sys/sys/eventhandler.h   (working copy)
> >> @@ -85,7 +85,11 @@
> >>  _t = (struct eventhandler_entry_ ## name *)_ep; \
> >>  CTR1(KTR_EVH, "eventhandler_invoke: executing %p",
> \
> >>  (void *)_t->eh_func);   \
> >> +if (strcmp(__STRING(name), "shutdown_post_sync")
> == 0) \
> >> +printf("entering post_sync %p\n", (void
> *)_t->eh_func); \
> >>  _t->eh_func(_ep->ee_arg , ## __VA_ARGS__);  \
> >> +if (strcmp(__STRING(name), "shutdown_post_sync")
> == 0) \
> >> +printf("leaving post_sync %p\n", (void
> *)_t->eh_func); \
> >>  EHL_LOCK((list));   \
> >>  }   \
> >>  }   \
> >>
> >
> > Thanks, this helped:
> >
> > $ addr2line -f -e kernel.debug 0x80919c00
> > g_raid_shutdown_post_sync
> > /home/src/sys/geom/raid/g_raid.c:2458
> >
> > That is GEOM_RAID's g_raid_shutdown_post_sync() that hangs if called
> just before
> > crashdump generation but works just fine during normal system shutdown.
> >
> > I should note my graid's RAID1 is running in degraded state currently
> > due to dead SSD module that does not respond. Here is part of boot log:
> >
> > ahcich5: AHCI reset: device not ready after 31000ms (tfd = 0080)
> > ahcich5: Poll timeout on slot 2 port 0
> > ahcich5: is  cs 0004 ss  rs 0004 tfd 80 serr
>  cmd c217
> > (aprobe2:ahcich5:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00
> 00 00 00
> > (aprobe2:ahcich5:0:0:0): CAM status: Command timeout
> > (aprobe2:ahcich5:0:0:0): Error 5, Retries exhausted
> > run_interrupt_driven_hooks: still waiting after 60 seconds for xpt_config
> > ahcich5: Poll timeout on slot 3 port 0
> > ahcich5: is  cs 0008 ss  rs 0008 tfd 80 serr
>  cmd c317
> > (aprobe2:ahcich5:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00
> 00 00
> > (aprobe2:ahcich5:0:0:0): CAM status: Command timeout
> > (aprobe2:ahcich5:0:0:0): Error 5, Retries exhausted
> > [skip]
> > Trying to mount root from ufs:/dev/raid/r0s4a [rw,noatime]...
> > Root mount waiting for: GRAID-Intel
> > Root mount waiting for: GRAID-Intel
> > Root mount waiting for: GRAID-Intel
> > Root mount waiting for: GRAID-Intel
> > Root mount waiting for: GRAID-Intel
> > GEOM_RAID: Intel-c291fe96: Force array start due to timeout.
> > GEOM_RAID: Intel-c291fe96: Disk ada0 state changed from NONE to ACTIVE.
> > GEOM_RAID: Intel-c291fe96: Subdisk r0:0-ada0 state changed from NONE to
> STALE.
> > GEOM_RAID: Intel-c291fe96: Array started.
> > GEOM_RAID: Intel-c291fe96: Subdisk r0:0-ada0 state changed from STALE to
> ACTIVE.
> > GEOM_RAID: Intel-c291fe96: Volume r0 state 

Re: stable/11 debugging kernel unable to produce crashdump again

2017-07-24 Thread Alexander Motin
I guess that problem of g_raid_shutdown_post_sync in case of panic can
be explained by the fact it tries to write clean metadata in regular
(not dumping) way while system is already in panic mode and there is no
proper scheduling.  May be it could be just bypassed in case of dumping
(should be trivial and probably OK), or use g_raid_subdisk_kerneldump()
in that case instead of normal GEOM I/O.

On 24.07.2017 20:03, Eugene Grosbein wrote:
> CCing mav@ as graid expert.
> 
> On 24.07.2017 08:44, Mark Johnston wrote:
> 
>>> Sadly, this time 11.1-STABLE r321371 SMP hangs instead of doing crashdump:
>>>
>>> - "call doadump" from DDB prompt works just fine;
>>> - "shutdown -r now" reboots the system without problems;
>>> - "sysctl debug.kdb.panic=1" triggers a panic just fine but system hangs 
>>> just afer showing uptime
>>> instead of continuing with crashdump generation; same if "real" panic 
>>> occurs.
>>>
>>> Same for debug.minidump set to 1 or 0. How do I debug this?
>>
>> I'm not able to reproduce the problem in bhyve using r321401. Looking
>> at the code, the culprits might be cngrab(), or one of the
>> shutdown_post_sync eventhandlers. Since you're apparently able to see
>> the console output at the time of the panic, I guess it's probably the
>> latter. Could you try your test with the patch below applied? It'll
>> print a bunch of "entering post_sync"/"leaving post_sync" messages with
>> addresses that can be resolved using kgdb. That'll help determine where
>> we're getting stuck.
>>
>> Index: sys/sys/eventhandler.h
>> ===
>> --- sys/sys/eventhandler.h   (revision 321401)
>> +++ sys/sys/eventhandler.h   (working copy)
>> @@ -85,7 +85,11 @@
>>  _t = (struct eventhandler_entry_ ## name *)_ep; \
>>  CTR1(KTR_EVH, "eventhandler_invoke: executing %p", \
>>  (void *)_t->eh_func);   \
>> +if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \
>> +printf("entering post_sync %p\n", (void 
>> *)_t->eh_func); \
>>  _t->eh_func(_ep->ee_arg , ## __VA_ARGS__);  \
>> +if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \
>> +printf("leaving post_sync %p\n", (void 
>> *)_t->eh_func); \
>>  EHL_LOCK((list));   \
>>  }   \
>>  }   \
>>
> 
> Thanks, this helped:
> 
> $ addr2line -f -e kernel.debug 0x80919c00
> g_raid_shutdown_post_sync
> /home/src/sys/geom/raid/g_raid.c:2458
> 
> That is GEOM_RAID's g_raid_shutdown_post_sync() that hangs if called just 
> before
> crashdump generation but works just fine during normal system shutdown.
> 
> I should note my graid's RAID1 is running in degraded state currently
> due to dead SSD module that does not respond. Here is part of boot log:
> 
> ahcich5: AHCI reset: device not ready after 31000ms (tfd = 0080)
> ahcich5: Poll timeout on slot 2 port 0
> ahcich5: is  cs 0004 ss  rs 0004 tfd 80 serr  
> cmd c217
> (aprobe2:ahcich5:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 
> 00 00
> (aprobe2:ahcich5:0:0:0): CAM status: Command timeout
> (aprobe2:ahcich5:0:0:0): Error 5, Retries exhausted
> run_interrupt_driven_hooks: still waiting after 60 seconds for xpt_config
> ahcich5: Poll timeout on slot 3 port 0
> ahcich5: is  cs 0008 ss  rs 0008 tfd 80 serr  
> cmd c317
> (aprobe2:ahcich5:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
> (aprobe2:ahcich5:0:0:0): CAM status: Command timeout
> (aprobe2:ahcich5:0:0:0): Error 5, Retries exhausted
> [skip]
> Trying to mount root from ufs:/dev/raid/r0s4a [rw,noatime]...
> Root mount waiting for: GRAID-Intel
> Root mount waiting for: GRAID-Intel
> Root mount waiting for: GRAID-Intel
> Root mount waiting for: GRAID-Intel
> Root mount waiting for: GRAID-Intel
> GEOM_RAID: Intel-c291fe96: Force array start due to timeout.
> GEOM_RAID: Intel-c291fe96: Disk ada0 state changed from NONE to ACTIVE.
> GEOM_RAID: Intel-c291fe96: Subdisk r0:0-ada0 state changed from NONE to STALE.
> GEOM_RAID: Intel-c291fe96: Array started.
> GEOM_RAID: Intel-c291fe96: Subdisk r0:0-ada0 state changed from STALE to 
> ACTIVE.
> GEOM_RAID: Intel-c291fe96: Volume r0 state changed from STARTING to DEGRADED.
> GEOM_RAID: Intel-c291fe96: Provider raid/r0 for volume r0 created.
> 
>  
> 

-- 
Alexander Motin
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: stable/11 debugging kernel unable to produce crashdump again

2017-07-24 Thread Mark Johnston
On Tue, Jul 25, 2017 at 12:03:05AM +0700, Eugene Grosbein wrote:
> Thanks, this helped:
> 
> $ addr2line -f -e kernel.debug 0x80919c00
> g_raid_shutdown_post_sync
> /home/src/sys/geom/raid/g_raid.c:2458
> 
> That is GEOM_RAID's g_raid_shutdown_post_sync() that hangs if called just 
> before
> crashdump generation but works just fine during normal system shutdown.

I think graid probably needs a treatment similar to r301173/r316032.
g_raid_shutdown_post_sync() appears to be quite similar to the
corresponding gmirror handler. In particular, it just attempts to mark
the individual components as clean and destroy the GEOM, which is not
really safe after a panic.

diff --git a/sys/geom/raid/g_raid.c b/sys/geom/raid/g_raid.c
index 7a1fd8c5ce2e..aa2529d5466a 100644
--- a/sys/geom/raid/g_raid.c
+++ b/sys/geom/raid/g_raid.c
@@ -2461,6 +2461,9 @@ g_raid_shutdown_post_sync(void *arg, int howto)
struct g_raid_softc *sc;
struct g_raid_volume *vol;
 
+   if (panicstr != NULL)
+   return;
+
mp = arg;
g_topology_lock();
g_raid_shutdown = 1;
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: stable/11 debugging kernel unable to produce crashdump again

2017-07-24 Thread Eugene Grosbein
CCing mav@ as graid expert.

On 24.07.2017 08:44, Mark Johnston wrote:

>> Sadly, this time 11.1-STABLE r321371 SMP hangs instead of doing crashdump:
>>
>> - "call doadump" from DDB prompt works just fine;
>> - "shutdown -r now" reboots the system without problems;
>> - "sysctl debug.kdb.panic=1" triggers a panic just fine but system hangs 
>> just afer showing uptime
>> instead of continuing with crashdump generation; same if "real" panic occurs.
>>
>> Same for debug.minidump set to 1 or 0. How do I debug this?
> 
> I'm not able to reproduce the problem in bhyve using r321401. Looking
> at the code, the culprits might be cngrab(), or one of the
> shutdown_post_sync eventhandlers. Since you're apparently able to see
> the console output at the time of the panic, I guess it's probably the
> latter. Could you try your test with the patch below applied? It'll
> print a bunch of "entering post_sync"/"leaving post_sync" messages with
> addresses that can be resolved using kgdb. That'll help determine where
> we're getting stuck.
> 
> Index: sys/sys/eventhandler.h
> ===
> --- sys/sys/eventhandler.h(revision 321401)
> +++ sys/sys/eventhandler.h(working copy)
> @@ -85,7 +85,11 @@
>   _t = (struct eventhandler_entry_ ## name *)_ep; \
>   CTR1(KTR_EVH, "eventhandler_invoke: executing %p", \
>   (void *)_t->eh_func);   \
> + if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \
> + printf("entering post_sync %p\n", (void 
> *)_t->eh_func); \
>   _t->eh_func(_ep->ee_arg , ## __VA_ARGS__);  \
> + if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \
> + printf("leaving post_sync %p\n", (void 
> *)_t->eh_func); \
>   EHL_LOCK((list));   \
>   }   \
>   }   \
> 

Thanks, this helped:

$ addr2line -f -e kernel.debug 0x80919c00
g_raid_shutdown_post_sync
/home/src/sys/geom/raid/g_raid.c:2458

That is GEOM_RAID's g_raid_shutdown_post_sync() that hangs if called just before
crashdump generation but works just fine during normal system shutdown.

I should note my graid's RAID1 is running in degraded state currently
due to dead SSD module that does not respond. Here is part of boot log:

ahcich5: AHCI reset: device not ready after 31000ms (tfd = 0080)
ahcich5: Poll timeout on slot 2 port 0
ahcich5: is  cs 0004 ss  rs 0004 tfd 80 serr  
cmd c217
(aprobe2:ahcich5:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00 00 
00
(aprobe2:ahcich5:0:0:0): CAM status: Command timeout
(aprobe2:ahcich5:0:0:0): Error 5, Retries exhausted
run_interrupt_driven_hooks: still waiting after 60 seconds for xpt_config
ahcich5: Poll timeout on slot 3 port 0
ahcich5: is  cs 0008 ss  rs 0008 tfd 80 serr  
cmd c317
(aprobe2:ahcich5:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
(aprobe2:ahcich5:0:0:0): CAM status: Command timeout
(aprobe2:ahcich5:0:0:0): Error 5, Retries exhausted
[skip]
Trying to mount root from ufs:/dev/raid/r0s4a [rw,noatime]...
Root mount waiting for: GRAID-Intel
Root mount waiting for: GRAID-Intel
Root mount waiting for: GRAID-Intel
Root mount waiting for: GRAID-Intel
Root mount waiting for: GRAID-Intel
GEOM_RAID: Intel-c291fe96: Force array start due to timeout.
GEOM_RAID: Intel-c291fe96: Disk ada0 state changed from NONE to ACTIVE.
GEOM_RAID: Intel-c291fe96: Subdisk r0:0-ada0 state changed from NONE to STALE.
GEOM_RAID: Intel-c291fe96: Array started.
GEOM_RAID: Intel-c291fe96: Subdisk r0:0-ada0 state changed from STALE to ACTIVE.
GEOM_RAID: Intel-c291fe96: Volume r0 state changed from STARTING to DEGRADED.
GEOM_RAID: Intel-c291fe96: Provider raid/r0 for volume r0 created.

 

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching

2017-07-24 Thread Ken Merry
It is possible that the change I MFCed today (r321207 in head, r321415 in 
stable/11) is related, but Mark will have to boot his machine with the fix to 
see if it makes any difference.

What happened in my case on one particular machine (not on most machines in our 
lab running the same code) was that mps_wait_command() / mpr_wait_command() 
would not wait the full 60 seconds for a write to the DPM table (Driver 
Persistent Mapping) table in the controller.  So, it reported that there was a 
timeout.

There is a secondary bug that is still in the mps(4) / mpr(4) drivers when a 
timeout does happen — the error recovery code in the wait_command() routine 
reinitializes the controller, which clears out all the commands.  When the 
wait_command() routine returns, the command passed in has been freed, but the 
caller doesn’t know that.  So the caller (it happens in a number of places) 
dereferences a pointer to freed memory and the kernel panics.

I’m planning to fix that bug, too, if slm@ doesn’t get to it first, I’ve just 
had other bugs to fix first.

Eliminating bogus timeouts will eliminate most all of the sources of those 
panics anyway.

Ken
— 
Ken Merry
k...@freebsd.org



> On Jul 24, 2017, at 12:10 PM, Steven Hartland  wrote:
> 
> Based on your boot info you're using mps, so this could be related to mps fix 
> committed to stable/11 today by ken@
> https://svnweb.freebsd.org/changeset/base/321415 
> 
> 
> re@ cc'ed as this could cause hangs for others too on 11.1-RELEASE if this is 
> the case.
> 
> Regards
> Steve
> 
> On 24/07/2017 15:55, Mark Martinec wrote:
>>> Thanks! Tried it, and the message (or a backtrace) does not show 
>>> during a boot of a generic (patched) kernel, at least not in 
>>> the last 40-lines screen before the hang occurs. 
>>> (It also does not show during a "Safe mode" successful boot.) 
>> 
>> Btw (may or may not be relevant): after the above experiment 
>> I have rebooted the machine in "Safe mode" (generic kernel, 
>> EARLY_AP_STARTUP enabled by default) - and spent some time 
>> doing non-intensive interactive work on this host (web browsing, 
>> editor, shell, all under KDE) - and after about an hour the 
>> machine froze: clock display not updating, keyboard unresponsive, 
>> console virtual terminals inaccessible) - so had to reboot. 
>> According to fans speed the machine was idle. 
>> The /var/log/messages does not show anything of interest 
>> before the freeze. All disks are under ZFS. 
>> 
>> Can EARLY_AP_STARTUP have an effect also _after_ booting? 
>> This host never hung during normal work when EARLY_AP_STARTUP 
>> was disabled (or with 11.0 and earlier). 
>> 
>>   Mark 
>> ___ 
>> freebsd-stable@freebsd.org  mailing list 
>> https://lists.freebsd.org/mailman/listinfo/freebsd-stable 
>>  
>> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" 
>>  
> 

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching

2017-07-24 Thread Steven Hartland
Based on your boot info you're using mps, so this could be related to 
mps fix committed to stable/11 today by ken@

https://svnweb.freebsd.org/changeset/base/321415

re@ cc'ed as this could cause hangs for others too on 11.1-RELEASE if 
this is the case.


Regards
Steve

On 24/07/2017 15:55, Mark Martinec wrote:

Thanks! Tried it, and the message (or a backtrace) does not show
during a boot of a generic (patched) kernel, at least not in
the last 40-lines screen before the hang occurs.
(It also does not show during a "Safe mode" successful boot.)


Btw (may or may not be relevant): after the above experiment
I have rebooted the machine in "Safe mode" (generic kernel,
EARLY_AP_STARTUP enabled by default) - and spent some time
doing non-intensive interactive work on this host (web browsing,
editor, shell, all under KDE) - and after about an hour the
machine froze: clock display not updating, keyboard unresponsive,
console virtual terminals inaccessible) - so had to reboot.
According to fans speed the machine was idle.
The /var/log/messages does not show anything of interest
before the freeze. All disks are under ZFS.

Can EARLY_AP_STARTUP have an effect also _after_ booting?
This host never hung during normal work when EARLY_AP_STARTUP
was disabled (or with 11.0 and earlier).

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching

2017-07-24 Thread Mark Martinec

Thanks! Tried it, and the message (or a backtrace) does not show
during a boot of a generic (patched) kernel, at least not in
the last 40-lines screen before the hang occurs.
(It also does not show during a "Safe mode" successful boot.)


Btw (may or may not be relevant): after the above experiment
I have rebooted the machine in "Safe mode" (generic kernel,
EARLY_AP_STARTUP enabled by default) - and spent some time
doing non-intensive interactive work on this host (web browsing,
editor, shell, all under KDE) - and after about an hour the
machine froze: clock display not updating, keyboard unresponsive,
console virtual terminals inaccessible) - so had to reboot.
According to fans speed the machine was idle.
The /var/log/messages does not show anything of interest
before the freeze. All disks are under ZFS.

Can EARLY_AP_STARTUP have an effect also _after_ booting?
This host never hung during normal work when EARLY_AP_STARTUP
was disabled (or with 11.0 and earlier).

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: The 11.1-RC3 can only boot and attach disks in "Safe mode", otherwise gets stuck attaching

2017-07-24 Thread Mark Martinec

2017-07-24 04:15, Mark Johnston wrote:

Could you try re-enabling EARLY_AP_STARTUP, applying the patch at the
end of this email, and see if the message "sleeping before eventtimer
init" appears in the boot output? If it does, it'll be followed by a
backtrace that might be useful for tracking down the hang. It might
produce false positives, but we'll see.


Thanks! Tried it, and the message (or a backtrace) does not show
during a boot of a generic (patched) kernel, at least not in
the last 40-lines screen before the hang occurs.
(It also does not show during a "Safe mode" successful boot.)

  Mark
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: stable/11 debugging kernel unable to produce crashdump again

2017-07-24 Thread Eugene Grosbein
On 24.07.2017 08:44, Mark Johnston wrote:

>> Sadly, this time 11.1-STABLE r321371 SMP hangs instead of doing crashdump:
> 
> Is this amd64 GENERIC, or something else?

Custom kernel, amd64.

> 
>>
>> - "call doadump" from DDB prompt works just fine;
>> - "shutdown -r now" reboots the system without problems;
>> - "sysctl debug.kdb.panic=1" triggers a panic just fine but system hangs 
>> just afer showing uptime
>> instead of continuing with crashdump generation; same if "real" panic occurs.
>>
>> Same for debug.minidump set to 1 or 0. How do I debug this?
> 
> I'm not able to reproduce the problem in bhyve using r321401. Looking
> at the code, the culprits might be cngrab(), or one of the
> shutdown_post_sync eventhandlers. Since you're apparently able to see
> the console output at the time of the panic, I guess it's probably the
> latter. Could you try your test with the patch below applied? It'll
> print a bunch of "entering post_sync"/"leaving post_sync" messages with
> addresses that can be resolved using kgdb. That'll help determine where
> we're getting stuck.
> 
> Index: sys/sys/eventhandler.h
> ===
> --- sys/sys/eventhandler.h(revision 321401)
> +++ sys/sys/eventhandler.h(working copy)
> @@ -85,7 +85,11 @@
>   _t = (struct eventhandler_entry_ ## name *)_ep; \
>   CTR1(KTR_EVH, "eventhandler_invoke: executing %p", \
>   (void *)_t->eh_func);   \
> + if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \
> + printf("entering post_sync %p\n", (void 
> *)_t->eh_func); \
>   _t->eh_func(_ep->ee_arg , ## __VA_ARGS__);  \
> + if (strcmp(__STRING(name), "shutdown_post_sync") == 0) \
> + printf("leaving post_sync %p\n", (void 
> *)_t->eh_func); \
>   EHL_LOCK((list));   \
>   }   \
>   }   \
> 

I'l try this evening.


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"