On May 12, 2012, at 4:52 AM, Jim Klimov wrote:

> 2012-05-11 14:22, Jim Klimov wrote:
>> What conditions can cause the reset of the resilvering
>> process? My lost-and-found disk can't get back into the
>> pool because of resilvers restarting...
> 
> FOLLOW-UP AND NEW QUESTIONS
> 
> Here is a new piece of evidence - I've finally got something
> out of fmdump - series of several (5) retries ending with a
> fail, dated 75 seconds before resilvers restart (more below).
> Not a squeak in zpool status nor dmesg nor /dev/console.
> 
> Guess I must assume that the disk is dying indeed, losing
> connection or something like that after a random time (my
> resilvers restart after 15min-5hrs), and at least a run of
> SMART long diags is in order, while the pool would try to
> rebuild onto another disk (the hotspare) instead of trying
> to update this one which was in the pool.

Please share if SMART offers anything useful.

> Anyhow, information on the ex-pool disk is likely unused
> anyway - from iostat I see that it is only written to,
> with few to zero reads for minutes - so I'd lose nothing
> by replacing it with a blank drive (that's strange though)...
> 
> I also guess that the disk gets found after something like
> an unlogged bus reset or whatever, and this event causes
> the resilvering to restart from scratch.

This makes sense. 

> Q: Would this be the same in OI_151a, or would it continue
> resilvering from where it left off? I think I had the pool
> exported once during a resilver, and it restarted from the
> same percentage counter, so it is possible ;)
> 
> Best course of action would be to get those people to fully
> replace the untrustworthy disk... Or at least pull and push
> it a bit - maybe it's contacts just got plain dirty/oxidized
> and the disk should be re-seated in the enclosure...

Not likely to help. SATA?
 -- richard

> 
> I'd like someone to please confirm or deny my hypotheses
> and guesses :)
> 
> DETAILS
> 
> According to format, the disk in tailed fmdump reports below
> is indeed the one I'm trying to resilver into:
> 
> # format | gegrep -B1 '/pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0'
>      10. c1t2d0 <ATA-SEAGATE ST32500N-3AZQ-232.88GB>
>          /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0
> 
> From the pool history we see the resilvering restart
> timestamps... they closely match the retry-fail cycles,
> following them by some 75 seconds:
> 
> # zpool history -il pond | tail; date
> ....
> 2012-05-12.10:43:35 [internal pool scrub done txg:91072311] complete=0 [user 
> root on thumper]
> 2012-05-12.10:43:36 [internal pool scrub txg:91072311] func=1 mintxg=41 
> maxtxg=91051854 [user root on thumper]
> 2012-05-12.14:12:44 [internal pool scrub done txg:91072723] complete=0 [user 
> root on thumper]
> 2012-05-12.14:12:45 [internal pool scrub txg:91072723] func=1 mintxg=41 
> maxtxg=91051854 [user root on thumper]
> Sat May 12 15:45:50 MSK 2012
> 
> And last but not least - the FMDUMP messages...
> 
> # fmdump -eV | tail -150
> 
> ...
> May 12 2012 10:42:19.559305872 ereport.io.scsi.cmd.disk.tran
> nvlist version: 0
>        class = ereport.io.scsi.cmd.disk.tran
>        ena = 0x928e32c9d1700401
>        detector = (embedded nvlist)
>        nvlist version: 0
>                version = 0x0
>                scheme = dev
>                device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0
>        (end detector)
> 
>        driver-assessment = fail
>        op-code = 0x28
>        cdb = 0x28 0x0 0x14 0xd5 0x53 0x0 0x0 0x0 0x80 0x0
>        pkt-reason = 0x1
>        pkt-state = 0x37
>        pkt-stats = 0x0
>        __ttl = 0x1
>        __tod = 0x4fae064b 0x21565490
> 
> May 12 2012 14:11:27.754940954 ereport.io.scsi.cmd.disk.tran
> nvlist version: 0
>        class = ereport.io.scsi.cmd.disk.tran
>        ena = 0x492896f3a8500401
>        detector = (embedded nvlist)
>        nvlist version: 0
>                version = 0x0
>                scheme = dev
>                device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0
>        (end detector)
> 
>        driver-assessment = retry
>        op-code = 0x28
>        cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
>        pkt-reason = 0x1
>        pkt-state = 0x37
>        pkt-stats = 0x0
>        __ttl = 0x1
>        __tod = 0x4fae374f 0x2cff7c1a
> 
> 
> May 12 2012 14:11:27.754905021 ereport.io.scsi.cmd.disk.tran
> nvlist version: 0
>        class = ereport.io.scsi.cmd.disk.tran
>        ena = 0x492896f3a8500401
>        detector = (embedded nvlist)
>        nvlist version: 0
>                version = 0x0
>                scheme = dev
>                device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0
>        (end detector)
> 
>        driver-assessment = retry
>        op-code = 0x28
>        cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
>        pkt-reason = 0x1
>        pkt-state = 0x37
>        pkt-stats = 0x0
>        __ttl = 0x1
>        __tod = 0x4fae374f 0x2cfeefbd
> 
> May 12 2012 14:11:27.754866050 ereport.io.scsi.cmd.disk.tran
> nvlist version: 0
>        class = ereport.io.scsi.cmd.disk.tran
>        ena = 0x492896f3a8500401
>        detector = (embedded nvlist)
>        nvlist version: 0
>                version = 0x0
>                scheme = dev
>                device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0
>        (end detector)
> 
>        driver-assessment = retry
>        op-code = 0x28
>        cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
>        pkt-reason = 0x1
>        pkt-state = 0x37
>        pkt-stats = 0x0
>        __ttl = 0x1
>        __tod = 0x4fae374f 0x2cfe5782
> 
> May 12 2012 14:11:27.754793613 ereport.io.scsi.cmd.disk.tran
> nvlist version: 0
>        class = ereport.io.scsi.cmd.disk.tran
>        ena = 0x492896f3a8500401
>        detector = (embedded nvlist)
>        nvlist version: 0
>                version = 0x0
>                scheme = dev
>                device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0
>        (end detector)
> 
>        driver-assessment = retry
>        op-code = 0x28
>        cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
>        pkt-reason = 0x1
>        pkt-state = 0x37
>        pkt-stats = 0x0
>        __ttl = 0x1
>        __tod = 0x4fae374f 0x2cfd3c8d
> 
> 
> 
> May 12 2012 14:11:27.754757103 ereport.io.scsi.cmd.disk.tran
> nvlist version: 0
>        class = ereport.io.scsi.cmd.disk.tran
>        ena = 0x492896f3a8500401
>        detector = (embedded nvlist)
>        nvlist version: 0
>                version = 0x0
>                scheme = dev
>                device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0
>        (end detector)
> 
>        driver-assessment = retry
>        op-code = 0x28
>        cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
>        pkt-reason = 0x1
>        pkt-state = 0x37
>        pkt-stats = 0x0
>        __ttl = 0x1
>        __tod = 0x4fae374f 0x2cfcadef
> 
> 
> May 12 2012 14:11:27.754721778 ereport.io.scsi.cmd.disk.tran
> nvlist version: 0
>        class = ereport.io.scsi.cmd.disk.tran
>        ena = 0x492896f3a8500401
>        detector = (embedded nvlist)
>        nvlist version: 0
>                version = 0x0
>                scheme = dev
>                device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0
>        (end detector)
> 
>        driver-assessment = fail
>        op-code = 0x28
>        cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
>        pkt-reason = 0x1
>        pkt-state = 0x37
>        pkt-stats = 0x0
>        __ttl = 0x1
>        __tod = 0x4fae374f 0x2cfc23f2
> 
> 
> //Jim
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
ZFS Performance and Training
richard.ell...@richardelling.com
+1-760-896-4422







_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to