----- Ursprüngliche Mail -----
> Von: "Stephan Budach" <stephan.bud...@jvm.de>
> An: "Discussion list for OpenIndiana" <openindiana-discuss@openindiana.org>
> Gesendet: Dienstag, 16. Januar 2018 14:15:37
> Betreff: [OpenIndiana-discuss] ZFS hangs - causes host to panic
> 
> 
> Hi,
> 
> 
> I am currently putting my new NVME servers through their paces and I
> already experinced two panics on one of those hosts.
> After taking "forever" writing the crash dump, I found this in the
> syslog after reboot:
> 
> 
> 
> Jan 16 13:25:29 nfsvmpool09 savecore: [ID 570001 auth.error] reboot
> after panic: I/O to pool 'nvmeTank02' appears to be hung.
> Jan 16 13:25:29 nfsvmpool09 savecore: [ID 771660 auth.error] Panic
> crashdump pending on dump device but dumpadm -n in effect; run
> savecore(1M) manually to extract. Image UUID
> 995846d5-8c94-4f68-bada-e05ae5e4cb25(fault-management initiated).
> 
> 
> I ran mdb against the crash dump, but I am still a dummy, reading
> those information:
> 
> 
> 
> root@nfsvmpool09:/var/crash/nfsvmpool09# mdb unix.0 vmcore.0
> Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc
> apix scsi_vhci zfs sata sd ip hook neti sockfs arp usba fctl stmf
> stmf_sbd mm lofs i40e idm cpc crypto fcip fcp random ufs logindmux
> nsmb ptm smbsrv nfs sppp ipc ]
> > $C
> ffffd000f5dd79d0 vpanic()
> ffffd000f5dd7a20 vdev_deadman+0x10b(ffffd0320fb69980)
> ffffd000f5dd7a70 vdev_deadman+0x4a(ffffd0333b018940)
> ffffd000f5dd7ac0 vdev_deadman+0x4a(ffffd03228f796c0)
> ffffd000f5dd7af0 spa_deadman+0xad(ffffd03229543000)
> ffffd000f5dd7b90 cyclic_softint+0xfd(ffffd031eac4db00, 0)
> ffffd000f5dd7ba0 cbe_low_level+0x14()
> ffffd000f5dd7bf0 av_dispatch_softvect+0x78(2)
> ffffd000f5dd7c20 apix_dispatch_softint+0x35(0, 0)
> ffffd000f5da1990 switch_sp_and_call+0x13()
> ffffd000f5da19e0 apix_do_softint+0x6c(ffffd000f5da1a50)
> ffffd000f5da1a40 apix_do_interrupt+0x362(ffffd000f5da1a50, 2)
> ffffd000f5da1a50 _interrupt+0xba()
> ffffd000f5da1bc0 acpi_cpu_cstate+0x11b(ffffd031e98a43e0)
> ffffd000f5da1bf0 cpu_acpi_idle+0x8d()
> ffffd000f5da1c00 cpu_idle_adaptive+0x13()
> ffffd000f5da1c20 idle+0xa7()
> ffffd000f5da1c30 thread_start+8()
> > 
> 
> 
> Can anybody make something useful of that?
> 
> 
> Thanks,
> Stephan


I have been trying to hunt that down further, as it only seems to affect some 
NVMe SSDs and consequently the error moves along with where I am putting thise 
NVMe SSDs in. What seems to happen is, that at some random point, writes to the 
NVMe SSDs are not coming back and finally the ZFS deadman timer kicks in, 
panicing the host.

What I was able to gather is that at that point the SSD becomes 100% busy with 
no actual transfer between the device and the host. iostst -xenM will show 
something like this:

                            extended device statistics       ---- errors ---
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot 
device
    0,0    0,0    0,0    0,0  0,0  1,0    0,0    0,0   0 100   0  27   0  27 
c21t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   3   0   3 
c14t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   2   0   2 
c29t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   3   0   3 
c6t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   3   0   3 
c15t1d0
    0,0    0,0    0,0    0,0  0,0  1,0    0,0    0,0   0 100   0   2   0   2 
c13t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0  28   0  28 
c23t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   2   0   2 
c16t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   2   0   2 
c24t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0  27   0  27 
c19t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0  27   0  27 
c22t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   3   0   3 
c12t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   2   0   2 
c17t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   3   0   3 
c7t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0  27   0  27 
c20t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   3   0   3 
c10t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   3   0   3 
c26t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   2   0   2 
c8t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   3   0   3 
c25t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   2   0   2 
c27t1d0
 1844,2    0,0   14,4    0,0  0,0  0,4    0,0    0,2   0  39   0  27   0  27 
c18t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   3   0   3 
c11t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   3   0   3 
c9t1d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   2   0   2 
c28t1d0
    0,0    0,0    0,0    0,0 98,0  1,0    0,0    0,0 100 100   0   0   0   0 
nvmeTank01
    0,0    0,0    0,0    0,0 104,0  1,0    0,0    0,0 100 100   0   0   0   0 
nvmeTank02
 1844,2    0,0   14,4    0,0  0,0  0,5    0,0    0,3   0  50   0   0   0   0 
poolc18d1t0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   0   0   0 
rpool
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   0   0   0 
c4t0d0
    0,0    0,0    0,0    0,0  0,0  0,0    0,0    0,0   0   0   0   0   0   0 
c4t1d0

c21t1d0 and c13t1d0 are blocking their respective zpools, but I had also some 
other SSD behave in this way, so I am wondering how likely it is, that I got a 
really bad batch of NVMes, since atm, I'd suggest, that at least 3 devices 
exhibit this odd behaviour.




_______________________________________________
openindiana-discuss mailing list
openindiana-discuss@openindiana.org
https://openindiana.org/mailman/listinfo/openindiana-discuss

Reply via email to