Re: [smartos-discuss] Pool scrub causes panic via spa_deadman and vdev_deadman

Brian Bennett Sat, 10 Sep 2016 21:30:00 -0700

This looks substantially similar to OS-2415 
(https://smartos.org/bugview/OS-2415, https://www.illumos.org/issues/4013). 
That issue has been closed over 3 years, but the fix was just reverting a 
commit from 3 years before that. I'm not sure it was ever followed up on to 
re-correct the original bug (it would surely have been given a new ID, which 
makes it hard for me to track). One of our kernel developers may remember this 
and know more. If this is indeed the same issue, you've hit an issue that's at 
least six years old.


The description of OS-2415 seems to indicate faulty hardware, though it's an OS 
bug that there was a panic instead of simply faulting the device. Since you 
know you have a flaky drive that could be what's going on here. In any event 
it's probably best to replace that drive as soon as possible.

I'd like to get a copy of the crash dump from you, if possible (I can give you 
a signed Manta URL to upload to if that helps). That will help confirm whether 
or not it's actually related to OS-2415, and hopefully get the underlying issue 
corrected.

-- 
Brian Bennett
Systems Engineer, Cloud Operations
Joyent, Inc. | www.joyent.com <http://www.joyent.com/>
> On Sep 10, 2016, at 2:50 PM, Daniel Carosone <[email protected]> 
> wrote:
> 
> 
> Hi all, 
> 
> I get the following panic when scrubbing a pool.  After the panic, it will 
> continue scrubbing and possibly panic again, several times. 
> However, if left to run, the pool will finish scrubbing and is reported clean.
> 
> The pool is an 8-way raidz2, across several AHCI controllers (Intel and 
> Marvell) on one of those AsRock Avoton C2750D4I Mini-ITX boards that were all 
> the rage a couple of years ago for home servers. I'm only just getting around 
> to putting it into full service, but it's been running fine (with fewer 
> drives) until now.  It's running the current PI.
> 
> I'm assuming there's some load-related locking issue, and hoping it's a 
> solvable software issue rather than bad hardware. I haven't yet really looked 
> at the BIOS options to see if there are some interrupt-mapping options that 
> might move the issue around somehow. I can try that, but I'd rather a 
> deliberate set of tests rather than random shuffling.
> 
> There's also a flaky ssd in the zones pool (on sata0/0 c3t0d0).  It works 
> fine, most of the time, but occasionally goes offline until I power off, 
> fiddle and re-try it.  I suspect a cable problem, and will be pulling it out 
> to test separately.  It doesn't share a controller with the pool, and has 
> been offline through a full panic cycle, so I'm hoping is unrelated.  At 
> least, a failed drive shouldn't be able to cause this, so I'm posting before 
> trying that anyway.
> 
> So, some crash and config details below.  Suggestions and requests for 
> further info welcome (I presume info on driver and interrupt status would be 
> useful, but I don't have the mdb incantations..)
> 
> [root@d0-50-99-46-c2-00 /var/crash/volatile]# mdb -e '::status;$C' vmcore.5 
> debugging crash dump vmcore.5 (64-bit) from d0-50-99-46-c2-00
> operating system: 5.11 joyent_20160906T181054Z (i86pc)
> image uuid: (not set)
> panic message: I/O to pool 'titan' appears to be hung.
> dump content: kernel pages only
> ffffff003d54f9d0 vpanic()
> ffffff003d54fa20 vdev_deadman+0x10b(ffffff0d08289380)
> ffffff003d54fa70 vdev_deadman+0x4a(ffffff0d10a28640)
> ffffff003d54fac0 vdev_deadman+0x4a(ffffff0d10bbd040)
> ffffff003d54faf0 spa_deadman+0xad(ffffff0d11210000)
> ffffff003d54fb90 cyclic_softint+0xfd(ffffff0d07de9a80, 0)
> ffffff003d54fba0 cbe_low_level+0x14()
> ffffff003d54fbf0 av_dispatch_softvect+0x78(2)
> ffffff003d54fc20 dispatch_softint+0x39(0, 0)
> ffffff003d4e8a20 switch_sp_and_call+0x13()
> ffffff003d4e8a60 dosoftint+0x44(ffffff003d4e8ad0)
> ffffff003d4e8ac0 do_interrupt+0xba(ffffff003d4e8ad0, 0)
> ffffff003d4e8ad0 _interrupt+0xba()
> ffffff003d4e8bc0 i86_mwait+0xd()
> ffffff003d4e8c00 cpu_idle_mwait+0x109()
> ffffff003d4e8c20 idle+0xa7()
> ffffff003d4e8c30 thread_start+8()
> 
> [root@d0-50-99-46-c2-00 /var/crash/volatile]# zpool status -v titan
>   pool: titan
>  state: ONLINE
>   scan: scrub repaired 0 in 3h53m with 0 errors on Sat Sep 10 13:59:42 2016
> config:
> 
>     NAME        STATE     READ WRITE CKSUM
>     titan       ONLINE       0     0     0
>       raidz2-0  ONLINE       0     0     0
>         c4t0d0  ONLINE       0     0     0
>         c4t1d0  ONLINE       0     0     0
>         c0t0d0  ONLINE       0     0     0
>         c0t1d0  ONLINE       0     0     0
>         c1t0d0  ONLINE       0     0     0
>         c1t1d0  ONLINE       0     0     0
>         c1t2d0  ONLINE       0     0     0
>         c1t3d0  ONLINE       0     0     0
> 
> [root@d0-50-99-46-c2-00 /var/crash/volatile]# cfgadm -lv 
> Ap_Id                          Receptacle   Occupant     Condition  
> Information
> When         Type         Busy     Phys_Id
> sata0/0                        connected    unconfigured unknown    Mod:  
> FRev:  SN:
> unavailable  unknown      n        /devices/pci@0,0/pci1849,1f22@17:0
> sata0/1::dsk/c3t1d0            connected    configured   ok         Mod: 
> INTEL SSDSC2BW240A4 FRev: DC32 SN: CVDA44520AGG2403GN
> unavailable  disk         n        /devices/pci@0,0/pci1849,1f22@17:1
> sata0/2::dsk/c3t2d0            connected    configured   ok         Mod: 
> KINGSTON SVP200S37A240G FRev: 502ABBF0 SN: 50026B722C0629B9
> unavailable  disk         n        /devices/pci@0,0/pci1849,1f22@17:2
> sata0/3                        empty        unconfigured ok
> unavailable  sata-port    n        /devices/pci@0,0/pci1849,1f22@17:3
> sata1/0::dsk/c4t0d0            connected    configured   ok         Mod: WDC 
> WD80EFZX-68UW8N0 FRev: 83.H0A83 SN: VKJA71SX
> unavailable  disk         n        /devices/pci@0,0/pci1849,1f32@18:0
> sata1/1::dsk/c4t1d0            connected    configured   ok         Mod: WDC 
> WD80EFZX-68UW8N0 FRev: 83.H0A83 SN: VLGUM69Z
> unavailable  disk         n        /devices/pci@0,0/pci1849,1f32@18:1
> sata2/0::dsk/c0t0d0            connected    configured   ok         Mod: WDC 
> WD80EFZX-68UW8N0 FRev: 83.H0A83 SN: VKJ9UZ6X
> unavailable  disk         n        
> /devices/pci@0,0/pci8086,1f12@3/pci10b5,8608@0/pci10b5,8608@1/pci1849,9172@0:0
> sata2/1::dsk/c0t1d0            connected    configured   ok         Mod: WDC 
> WD80EFZX-68UW8N0 FRev: 83.H0A83 SN: VKJAVGNX
> unavailable  disk         n        
> /devices/pci@0,0/pci8086,1f12@3/pci10b5,8608@0/pci10b5,8608@1/pci1849,9172@0:1
> sata3/0::dsk/c1t0d0            connected    configured   ok         Mod: WDC 
> WD80EFZX-68UW8N0 FRev: 83.H0A83 SN: VKJYRS2Y
> unavailable  disk         n        
> /devices/pci@0,0/pci8086,1f13@4/pci1849,9230@0:0
> sata3/1::dsk/c1t1d0            connected    configured   ok         Mod: WDC 
> WD80EFZX-68UW8N0 FRev: 83.H0A83 SN: VLGUJSDZ
> unavailable  disk         n        
> /devices/pci@0,0/pci8086,1f13@4/pci1849,9230@0:1
> sata3/2::dsk/c1t2d0            connected    configured   ok         Mod: WDC 
> WD80EFZX-68UW8N0 FRev: 83.H0A83 SN: VLGUNDZZ
> unavailable  disk         n        
> /devices/pci@0,0/pci8086,1f13@4/pci1849,9230@0:2
> sata3/3::dsk/c1t3d0            connected    configured   ok         Mod: WDC 
> WD80EFZX-68UW8N0 FRev: 83.H0A83 SN: VLGUJNXZ
> unavailable  disk         n        
> /devices/pci@0,0/pci8086,1f13@4/pci1849,9230@0:3
> sata3/4                        empty        unconfigured ok
> unavailable  sata-port    n        
> /devices/pci@0,0/pci8086,1f13@4/pci1849,9230@0:4
> sata3/5                        empty        unconfigured ok
> unavailable  sata-port    n        
> /devices/pci@0,0/pci8086,1f13@4/pci1849,9230@0:5
> sata3/6                        empty        unconfigured ok
> unavailable  sata-port    n        
> /devices/pci@0,0/pci8086,1f13@4/pci1849,9230@0:6
> sata3/7                        connected    unconfigured ok         Mod: 
> MARVELL VIRTUALL FRev: 1.09 SN:
> unavailable  processor    n        
> /devices/pci@0,0/pci8086,1f13@4/pci1849,9230@0:7
> usb0/1                         connected    configured   ok         Mfg: 
> <undef>  Product: <undef>  NConfigs: 1  Config: 0  <no cfg str descr>
> # usb stuff below here trimmed
> 
> -- 
> Dan.
> smartos-discuss | Archives 
> <https://www.listbox.com/member/archive/184463/=now>  
> <https://www.listbox.com/member/archive/rss/184463/26986985-d0246faa> | 
> Modify <https://www.listbox.com/member/?&;> Your Subscription   
> <http://www.listbox.com/>

smime.p7s
Description: S/MIME cryptographic signature




-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

Re: [smartos-discuss] Pool scrub causes panic via spa_deadman and vdev_deadman

Reply via email to