Re: [zfs-discuss] ZFS Recovery: What do I try next?
On Thu, Dec 22, 2011 at 11:25 AM, Tim Cook wrote:
> On Thu, Dec 22, 2011 at 10:00 AM, Myers Carpenter wrote:
>> So the lesson here: Don't be a dumbass like me. Setup up nagios or some
>> other system to alert you when a pool has become degraded. ZFS works very
>> well with one drive out of the array, you aren't probably going to notice
>> problems unless you are proactively looking for them.
> Or, if you aren't scrubbing on a regular basis, just change your zpool
> failmode property. Had you set it to wait or panic, it would've been very
> clear, very quickly that something was wrong.
> http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/
I'm not sure this will help, as a single failed drive in a raidz1
or 2 in a raidz2 will make the zpool DEGRADED and not FAULTED. I
believe this parameter governs behavior for a FAULTED zpool.
We have a very simple shell script that runs hourly and does a
`zpool status -x` and generates an email to the admins if any pool is
in any state other than ONLINE. As soon as a zpool goes DEGRADED we
get notified and can initiate the correct response (open a case with
Oracle to replace the failed drive is the usual one). Here is the
snippet from the script of the actual health check (not my code, I
would have done it differently, but this works) ...
not_ok=`${zfs_path}/zpool status -x | egrep -v "all pools are
healthy|no pools available"`
if [ "X${not_ok}" != "X" ]
then
fault_details="There is at least one zpool error."
let fault_count=fault_count+1
new_faults[${fault_count}]=${fault_details}
fi
--
{1-2-3-4-5-6-7-}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
___
zfs-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Recovery: What do I try next?
On Thu, Dec 22, 2011 at 10:00 AM, Myers Carpenter wrote: > > > On Sat, Nov 5, 2011 at 2:35 PM, Myers Carpenter wrote: > >> I would like to pick the brains of the ZFS experts on this list: What >> would you do next to try and recover this zfs pool? >> > > I hate running across threads that ask a question and the person that > asked them never comes back to say what they eventually did, so... > > To summarize: In late October I had two drives fail in a raidz1 pool. I > was able to recover all the data from one drive, but the other could not be > seen by the controller. Trying to zpool import was not working. I had 3 > of the 4 drives, why couldn't I mount this. > > I read about every option in zdb and tried ones that might tell me > something more about what was on this recovered drive. I eventually hit on > > zdb -p devs -e -lu /bank4/hd/devs/loop0 > > where /bank4/hd/devs/loop0 was a symlink back to /dev/loop0 where I had > setup the disk image of the recovered drive. > > This showed the uberblocks which looked like this: > > Uberblock[1] > magic = 00bab10c > version = 26 > txg = 23128193 > guid_sum = 13396147021153418877 > timestamp = 1316987376 UTC = Sun Sep 25 17:49:36 2011 > rootbp = DVA[0]=<0:2981f336c00:400> DVA[1]=<0:1e8dcc01400:400> > DVA[2]=<0:3b16a3dd400:400> [L0 DMU objset] fletcher4 lzjb LE contiguous > unique triple size=800L/200P birth=23128193L/23128193P fill=255 > cksum=136175e0a4:79b27ae49c7:1857d594ca833:34ec76b965ae40 > > Then it all came clear: This drive had encountered errors one month before > the other drive had failed and zfs had stopped writing to it. > > So the lesson here: Don't be a dumbass like me. Setup up nagios or some > other system to alert you when a pool has become degraded. ZFS works very > well with one drive out of the array, you aren't probably going to notice > problems unless you are proactively looking for them. > > myers > > > > Or, if you aren't scrubbing on a regular basis, just change your zpool failmode property. Had you set it to wait or panic, it would've been very clear, very quickly that something was wrong. http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/ --Tim ___ zfs-discuss mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Recovery: What do I try next?
On Sat, Nov 5, 2011 at 2:35 PM, Myers Carpenter wrote: > I would like to pick the brains of the ZFS experts on this list: What > would you do next to try and recover this zfs pool? > I hate running across threads that ask a question and the person that asked them never comes back to say what they eventually did, so... To summarize: In late October I had two drives fail in a raidz1 pool. I was able to recover all the data from one drive, but the other could not be seen by the controller. Trying to zpool import was not working. I had 3 of the 4 drives, why couldn't I mount this. I read about every option in zdb and tried ones that might tell me something more about what was on this recovered drive. I eventually hit on zdb -p devs -e -lu /bank4/hd/devs/loop0 where /bank4/hd/devs/loop0 was a symlink back to /dev/loop0 where I had setup the disk image of the recovered drive. This showed the uberblocks which looked like this: Uberblock[1] magic = 00bab10c version = 26 txg = 23128193 guid_sum = 13396147021153418877 timestamp = 1316987376 UTC = Sun Sep 25 17:49:36 2011 rootbp = DVA[0]=<0:2981f336c00:400> DVA[1]=<0:1e8dcc01400:400> DVA[2]=<0:3b16a3dd400:400> [L0 DMU objset] fletcher4 lzjb LE contiguous unique triple size=800L/200P birth=23128193L/23128193P fill=255 cksum=136175e0a4:79b27ae49c7:1857d594ca833:34ec76b965ae40 Then it all came clear: This drive had encountered errors one month before the other drive had failed and zfs had stopped writing to it. So the lesson here: Don't be a dumbass like me. Setup up nagios or some other system to alert you when a pool has become degraded. ZFS works very well with one drive out of the array, you aren't probably going to notice problems unless you are proactively looking for them. myers ___ zfs-discuss mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Recovery: What do I try next?
Dir you try Zpool clear -F bank0 with tbe latest solaris express? Sent from my iPad On Nov 5, 2011, at 2:35 PM, Myers Carpenter wrote: > I would like to pick the brains of the ZFS experts on this list: What > would you do next to try and recover this zfs pool? > > I have a ZFS RAIDZ1 pool named bank0 that I cannot import. It was > composed of 4 1.5 TiB disks. One disk is totally dead. Another had > SMART errors, but using GNU ddrescue I was able to copy all the data > off successfully. > > I have copied all 3 remaining disks as images using dd on to another > another filesystem. Using the loopback filesystem I can treat these > images as if they were real disks. I've made a snapshot of the > filesystem the disk images are on so that I can try things and > rollback the changes if needed. > > "gir" is the computer these disks are hosted on. It used to be a > Nexenta server, but is now Ubuntu 11.10 with the zfs on linux modules. > > I have tried booting up Solaris Express 11 Live CD and doing "zpool > import -fFX bank0" which ran for ~6 hours and put out: "one or more > devices is currently unavailable" > > I have tried "zpool import -fFX bank0" on linux with the same results. > > I have tried moving the drives back into the controller config they > where before, and booted my old Nexenta root disk where the > /etc/zfs/zpool.cache still had an entry for bank0. I was not able to > get the filesystems mounts. I can't remember what errors I got. I can > do it again if the errors might be useful. > > Here is the output of the different utils: > > root@gir:/bank3/hd# zpool import -d devs > pool: bank0 >id: 3936305481264476979 > state: FAULTED > status: The pool was last accessed by another system. > action: The pool cannot be imported due to damaged devices or data. >The pool may be active on another system, but can be imported using >the '-f' flag. > see: http://www.sun.com/msg/ZFS-8000-EY > config: > >bank0 FAULTED corrupted data > raidz1-0 DEGRADED >loop0 ONLINE >loop1 ONLINE >loop2 ONLINE >c10t2d0p0 UNAVAIL > > > root@gir:/bank3/hd# zpool import -d devs bank0 > cannot import 'bank0': pool may be in use from other system, it was > last accessed by gir (hostid: 0xa1767) on Mon Oct 24 15:50:23 2011 > use '-f' to import anyway > > > root@gir:/bank3/hd# zpool import -f -d devs bank0 > cannot import 'bank0': I/O error >Destroy and re-create the pool from >a backup source. > > root@gir:/bank3/hd# zdb -e -p devs bank0 > Configuration for import: >vdev_children: 1 >version: 26 >pool_guid: 3936305481264476979 >name: 'bank0' >state: 0 >hostid: 661351 >hostname: 'gir' >vdev_tree: >type: 'root' >id: 0 >guid: 3936305481264476979 >children[0]: >type: 'raidz' >id: 0 >guid: 10967243523656644777 >nparity: 1 >metaslab_array: 23 >metaslab_shift: 35 >ashift: 9 >asize: 6001161928704 >is_log: 0 >create_txg: 4 >children[0]: >type: 'disk' >id: 0 >guid: 13554115250875315903 >phys_path: '/pci@0,0/pci1002,4391@11/disk@3,0:q' >whole_disk: 0 >DTL: 57 >create_txg: 4 >path: '/bank3/hd/devs/loop0' >children[1]: >type: 'disk' >id: 1 >guid: 17894226827518944093 >phys_path: '/pci@0,0/pci1002,4391@11/disk@0,0:q' >whole_disk: 0 >DTL: 62 >create_txg: 4 >path: '/bank3/hd/devs/loop1' >children[2]: >type: 'disk' >id: 2 >guid: 9087312107742869669 >phys_path: '/pci@0,0/pci1002,4391@11/disk@1,0:q' >whole_disk: 0 >DTL: 61 >create_txg: 4 >faulted: 1 >aux_state: 'err_exceeded' >path: '/bank3/hd/devs/loop2' >children[3]: >type: 'disk' >id: 3 >guid: 13297176051223822304 >path: '/dev/dsk/c10t2d0p0' >devid: > 'id1,sd@SATA_ST31500341AS9VS32K25/q' >phys_path: '/pci@0,0/pci1002,4391@11/disk@2,0:q' >whole_disk: 0 >DTL: 60 >create_txg: 4 > > zdb: can't open 'bank0': No such file or directory > ___ > zfs-discuss mailing list > zf
