On May 12, 2012, at 4:52 AM, Jim Klimov wrote: > 2012-05-11 14:22, Jim Klimov wrote: >> What conditions can cause the reset of the resilvering >> process? My lost-and-found disk can't get back into the >> pool because of resilvers restarting... > > FOLLOW-UP AND NEW QUESTIONS > > Here is a new piece of evidence - I've finally got something > out of fmdump - series of several (5) retries ending with a > fail, dated 75 seconds before resilvers restart (more below). > Not a squeak in zpool status nor dmesg nor /dev/console. > > Guess I must assume that the disk is dying indeed, losing > connection or something like that after a random time (my > resilvers restart after 15min-5hrs), and at least a run of > SMART long diags is in order, while the pool would try to > rebuild onto another disk (the hotspare) instead of trying > to update this one which was in the pool.
Please share if SMART offers anything useful. > Anyhow, information on the ex-pool disk is likely unused > anyway - from iostat I see that it is only written to, > with few to zero reads for minutes - so I'd lose nothing > by replacing it with a blank drive (that's strange though)... > > I also guess that the disk gets found after something like > an unlogged bus reset or whatever, and this event causes > the resilvering to restart from scratch. This makes sense. > Q: Would this be the same in OI_151a, or would it continue > resilvering from where it left off? I think I had the pool > exported once during a resilver, and it restarted from the > same percentage counter, so it is possible ;) > > Best course of action would be to get those people to fully > replace the untrustworthy disk... Or at least pull and push > it a bit - maybe it's contacts just got plain dirty/oxidized > and the disk should be re-seated in the enclosure... Not likely to help. SATA? -- richard > > I'd like someone to please confirm or deny my hypotheses > and guesses :) > > DETAILS > > According to format, the disk in tailed fmdump reports below > is indeed the one I'm trying to resilver into: > > # format | gegrep -B1 '/pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0' > 10. c1t2d0 <ATA-SEAGATE ST32500N-3AZQ-232.88GB> > /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0 > > From the pool history we see the resilvering restart > timestamps... they closely match the retry-fail cycles, > following them by some 75 seconds: > > # zpool history -il pond | tail; date > .... > 2012-05-12.10:43:35 [internal pool scrub done txg:91072311] complete=0 [user > root on thumper] > 2012-05-12.10:43:36 [internal pool scrub txg:91072311] func=1 mintxg=41 > maxtxg=91051854 [user root on thumper] > 2012-05-12.14:12:44 [internal pool scrub done txg:91072723] complete=0 [user > root on thumper] > 2012-05-12.14:12:45 [internal pool scrub txg:91072723] func=1 mintxg=41 > maxtxg=91051854 [user root on thumper] > Sat May 12 15:45:50 MSK 2012 > > And last but not least - the FMDUMP messages... > > # fmdump -eV | tail -150 > > ... > May 12 2012 10:42:19.559305872 ereport.io.scsi.cmd.disk.tran > nvlist version: 0 > class = ereport.io.scsi.cmd.disk.tran > ena = 0x928e32c9d1700401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = dev > device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0 > (end detector) > > driver-assessment = fail > op-code = 0x28 > cdb = 0x28 0x0 0x14 0xd5 0x53 0x0 0x0 0x0 0x80 0x0 > pkt-reason = 0x1 > pkt-state = 0x37 > pkt-stats = 0x0 > __ttl = 0x1 > __tod = 0x4fae064b 0x21565490 > > May 12 2012 14:11:27.754940954 ereport.io.scsi.cmd.disk.tran > nvlist version: 0 > class = ereport.io.scsi.cmd.disk.tran > ena = 0x492896f3a8500401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = dev > device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0 > (end detector) > > driver-assessment = retry > op-code = 0x28 > cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0 > pkt-reason = 0x1 > pkt-state = 0x37 > pkt-stats = 0x0 > __ttl = 0x1 > __tod = 0x4fae374f 0x2cff7c1a > > > May 12 2012 14:11:27.754905021 ereport.io.scsi.cmd.disk.tran > nvlist version: 0 > class = ereport.io.scsi.cmd.disk.tran > ena = 0x492896f3a8500401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = dev > device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0 > (end detector) > > driver-assessment = retry > op-code = 0x28 > cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0 > pkt-reason = 0x1 > pkt-state = 0x37 > pkt-stats = 0x0 > __ttl = 0x1 > __tod = 0x4fae374f 0x2cfeefbd > > May 12 2012 14:11:27.754866050 ereport.io.scsi.cmd.disk.tran > nvlist version: 0 > class = ereport.io.scsi.cmd.disk.tran > ena = 0x492896f3a8500401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = dev > device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0 > (end detector) > > driver-assessment = retry > op-code = 0x28 > cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0 > pkt-reason = 0x1 > pkt-state = 0x37 > pkt-stats = 0x0 > __ttl = 0x1 > __tod = 0x4fae374f 0x2cfe5782 > > May 12 2012 14:11:27.754793613 ereport.io.scsi.cmd.disk.tran > nvlist version: 0 > class = ereport.io.scsi.cmd.disk.tran > ena = 0x492896f3a8500401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = dev > device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0 > (end detector) > > driver-assessment = retry > op-code = 0x28 > cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0 > pkt-reason = 0x1 > pkt-state = 0x37 > pkt-stats = 0x0 > __ttl = 0x1 > __tod = 0x4fae374f 0x2cfd3c8d > > > > May 12 2012 14:11:27.754757103 ereport.io.scsi.cmd.disk.tran > nvlist version: 0 > class = ereport.io.scsi.cmd.disk.tran > ena = 0x492896f3a8500401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = dev > device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0 > (end detector) > > driver-assessment = retry > op-code = 0x28 > cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0 > pkt-reason = 0x1 > pkt-state = 0x37 > pkt-stats = 0x0 > __ttl = 0x1 > __tod = 0x4fae374f 0x2cfcadef > > > May 12 2012 14:11:27.754721778 ereport.io.scsi.cmd.disk.tran > nvlist version: 0 > class = ereport.io.scsi.cmd.disk.tran > ena = 0x492896f3a8500401 > detector = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = dev > device-path = /pci@0,0/pci1022,7458@2/pci11ab,11ab@1/disk@2,0 > (end detector) > > driver-assessment = fail > op-code = 0x28 > cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0 > pkt-reason = 0x1 > pkt-state = 0x37 > pkt-stats = 0x0 > __ttl = 0x1 > __tod = 0x4fae374f 0x2cfc23f2 > > > //Jim > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- ZFS Performance and Training richard.ell...@richardelling.com +1-760-896-4422
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss