Re: [zfs-discuss] ZFS Recovery: What do I try next?
On Thu, Dec 22, 2011 at 11:25 AM, Tim Cook wrote: > On Thu, Dec 22, 2011 at 10:00 AM, Myers Carpenter wrote: >> So the lesson here: Don't be a dumbass like me. Setup up nagios or some >> other system to alert you when a pool has become degraded. ZFS works very >> well with one drive out of the array, you aren't probably going to notice >> problems unless you are proactively looking for them. > Or, if you aren't scrubbing on a regular basis, just change your zpool > failmode property. Had you set it to wait or panic, it would've been very > clear, very quickly that something was wrong. > http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/ I'm not sure this will help, as a single failed drive in a raidz1 or 2 in a raidz2 will make the zpool DEGRADED and not FAULTED. I believe this parameter governs behavior for a FAULTED zpool. We have a very simple shell script that runs hourly and does a `zpool status -x` and generates an email to the admins if any pool is in any state other than ONLINE. As soon as a zpool goes DEGRADED we get notified and can initiate the correct response (open a case with Oracle to replace the failed drive is the usual one). Here is the snippet from the script of the actual health check (not my code, I would have done it differently, but this works) ... not_ok=`${zfs_path}/zpool status -x | egrep -v "all pools are healthy|no pools available"` if [ "X${not_ok}" != "X" ] then fault_details="There is at least one zpool error." let fault_count=fault_count+1 new_faults[${fault_count}]=${fault_details} fi -- {1-2-3-4-5-6-7-} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Recovery: What do I try next?
On Thu, Dec 22, 2011 at 10:00 AM, Myers Carpenter wrote: > > > On Sat, Nov 5, 2011 at 2:35 PM, Myers Carpenter wrote: > >> I would like to pick the brains of the ZFS experts on this list: What >> would you do next to try and recover this zfs pool? >> > > I hate running across threads that ask a question and the person that > asked them never comes back to say what they eventually did, so... > > To summarize: In late October I had two drives fail in a raidz1 pool. I > was able to recover all the data from one drive, but the other could not be > seen by the controller. Trying to zpool import was not working. I had 3 > of the 4 drives, why couldn't I mount this. > > I read about every option in zdb and tried ones that might tell me > something more about what was on this recovered drive. I eventually hit on > > zdb -p devs -e -lu /bank4/hd/devs/loop0 > > where /bank4/hd/devs/loop0 was a symlink back to /dev/loop0 where I had > setup the disk image of the recovered drive. > > This showed the uberblocks which looked like this: > > Uberblock[1] > magic = 00bab10c > version = 26 > txg = 23128193 > guid_sum = 13396147021153418877 > timestamp = 1316987376 UTC = Sun Sep 25 17:49:36 2011 > rootbp = DVA[0]=<0:2981f336c00:400> DVA[1]=<0:1e8dcc01400:400> > DVA[2]=<0:3b16a3dd400:400> [L0 DMU objset] fletcher4 lzjb LE contiguous > unique triple size=800L/200P birth=23128193L/23128193P fill=255 > cksum=136175e0a4:79b27ae49c7:1857d594ca833:34ec76b965ae40 > > Then it all came clear: This drive had encountered errors one month before > the other drive had failed and zfs had stopped writing to it. > > So the lesson here: Don't be a dumbass like me. Setup up nagios or some > other system to alert you when a pool has become degraded. ZFS works very > well with one drive out of the array, you aren't probably going to notice > problems unless you are proactively looking for them. > > myers > > > > Or, if you aren't scrubbing on a regular basis, just change your zpool failmode property. Had you set it to wait or panic, it would've been very clear, very quickly that something was wrong. http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/ --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Recovery: What do I try next?
On Sat, Nov 5, 2011 at 2:35 PM, Myers Carpenter wrote: > I would like to pick the brains of the ZFS experts on this list: What > would you do next to try and recover this zfs pool? > I hate running across threads that ask a question and the person that asked them never comes back to say what they eventually did, so... To summarize: In late October I had two drives fail in a raidz1 pool. I was able to recover all the data from one drive, but the other could not be seen by the controller. Trying to zpool import was not working. I had 3 of the 4 drives, why couldn't I mount this. I read about every option in zdb and tried ones that might tell me something more about what was on this recovered drive. I eventually hit on zdb -p devs -e -lu /bank4/hd/devs/loop0 where /bank4/hd/devs/loop0 was a symlink back to /dev/loop0 where I had setup the disk image of the recovered drive. This showed the uberblocks which looked like this: Uberblock[1] magic = 00bab10c version = 26 txg = 23128193 guid_sum = 13396147021153418877 timestamp = 1316987376 UTC = Sun Sep 25 17:49:36 2011 rootbp = DVA[0]=<0:2981f336c00:400> DVA[1]=<0:1e8dcc01400:400> DVA[2]=<0:3b16a3dd400:400> [L0 DMU objset] fletcher4 lzjb LE contiguous unique triple size=800L/200P birth=23128193L/23128193P fill=255 cksum=136175e0a4:79b27ae49c7:1857d594ca833:34ec76b965ae40 Then it all came clear: This drive had encountered errors one month before the other drive had failed and zfs had stopped writing to it. So the lesson here: Don't be a dumbass like me. Setup up nagios or some other system to alert you when a pool has become degraded. ZFS works very well with one drive out of the array, you aren't probably going to notice problems unless you are proactively looking for them. myers ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Recovery: What do I try next?
Dir you try Zpool clear -F bank0 with tbe latest solaris express? Sent from my iPad On Nov 5, 2011, at 2:35 PM, Myers Carpenter wrote: > I would like to pick the brains of the ZFS experts on this list: What > would you do next to try and recover this zfs pool? > > I have a ZFS RAIDZ1 pool named bank0 that I cannot import. It was > composed of 4 1.5 TiB disks. One disk is totally dead. Another had > SMART errors, but using GNU ddrescue I was able to copy all the data > off successfully. > > I have copied all 3 remaining disks as images using dd on to another > another filesystem. Using the loopback filesystem I can treat these > images as if they were real disks. I've made a snapshot of the > filesystem the disk images are on so that I can try things and > rollback the changes if needed. > > "gir" is the computer these disks are hosted on. It used to be a > Nexenta server, but is now Ubuntu 11.10 with the zfs on linux modules. > > I have tried booting up Solaris Express 11 Live CD and doing "zpool > import -fFX bank0" which ran for ~6 hours and put out: "one or more > devices is currently unavailable" > > I have tried "zpool import -fFX bank0" on linux with the same results. > > I have tried moving the drives back into the controller config they > where before, and booted my old Nexenta root disk where the > /etc/zfs/zpool.cache still had an entry for bank0. I was not able to > get the filesystems mounts. I can't remember what errors I got. I can > do it again if the errors might be useful. > > Here is the output of the different utils: > > root@gir:/bank3/hd# zpool import -d devs > pool: bank0 >id: 3936305481264476979 > state: FAULTED > status: The pool was last accessed by another system. > action: The pool cannot be imported due to damaged devices or data. >The pool may be active on another system, but can be imported using >the '-f' flag. > see: http://www.sun.com/msg/ZFS-8000-EY > config: > >bank0 FAULTED corrupted data > raidz1-0 DEGRADED >loop0 ONLINE >loop1 ONLINE >loop2 ONLINE >c10t2d0p0 UNAVAIL > > > root@gir:/bank3/hd# zpool import -d devs bank0 > cannot import 'bank0': pool may be in use from other system, it was > last accessed by gir (hostid: 0xa1767) on Mon Oct 24 15:50:23 2011 > use '-f' to import anyway > > > root@gir:/bank3/hd# zpool import -f -d devs bank0 > cannot import 'bank0': I/O error >Destroy and re-create the pool from >a backup source. > > root@gir:/bank3/hd# zdb -e -p devs bank0 > Configuration for import: >vdev_children: 1 >version: 26 >pool_guid: 3936305481264476979 >name: 'bank0' >state: 0 >hostid: 661351 >hostname: 'gir' >vdev_tree: >type: 'root' >id: 0 >guid: 3936305481264476979 >children[0]: >type: 'raidz' >id: 0 >guid: 10967243523656644777 >nparity: 1 >metaslab_array: 23 >metaslab_shift: 35 >ashift: 9 >asize: 6001161928704 >is_log: 0 >create_txg: 4 >children[0]: >type: 'disk' >id: 0 >guid: 13554115250875315903 >phys_path: '/pci@0,0/pci1002,4391@11/disk@3,0:q' >whole_disk: 0 >DTL: 57 >create_txg: 4 >path: '/bank3/hd/devs/loop0' >children[1]: >type: 'disk' >id: 1 >guid: 17894226827518944093 >phys_path: '/pci@0,0/pci1002,4391@11/disk@0,0:q' >whole_disk: 0 >DTL: 62 >create_txg: 4 >path: '/bank3/hd/devs/loop1' >children[2]: >type: 'disk' >id: 2 >guid: 9087312107742869669 >phys_path: '/pci@0,0/pci1002,4391@11/disk@1,0:q' >whole_disk: 0 >DTL: 61 >create_txg: 4 >faulted: 1 >aux_state: 'err_exceeded' >path: '/bank3/hd/devs/loop2' >children[3]: >type: 'disk' >id: 3 >guid: 13297176051223822304 >path: '/dev/dsk/c10t2d0p0' >devid: > 'id1,sd@SATA_ST31500341AS9VS32K25/q' >phys_path: '/pci@0,0/pci1002,4391@11/disk@2,0:q' >whole_disk: 0 >DTL: 60 >create_txg: 4 > > zdb: can't open 'bank0': No such file or directory > ___ > zfs-discuss mailing list > zf
[zfs-discuss] ZFS Recovery: What do I try next?
I would like to pick the brains of the ZFS experts on this list: What would you do next to try and recover this zfs pool? I have a ZFS RAIDZ1 pool named bank0 that I cannot import. It was composed of 4 1.5 TiB disks. One disk is totally dead. Another had SMART errors, but using GNU ddrescue I was able to copy all the data off successfully. I have copied all 3 remaining disks as images using dd on to another another filesystem. Using the loopback filesystem I can treat these images as if they were real disks. I've made a snapshot of the filesystem the disk images are on so that I can try things and rollback the changes if needed. "gir" is the computer these disks are hosted on. It used to be a Nexenta server, but is now Ubuntu 11.10 with the zfs on linux modules. I have tried booting up Solaris Express 11 Live CD and doing "zpool import -fFX bank0" which ran for ~6 hours and put out: "one or more devices is currently unavailable" I have tried "zpool import -fFX bank0" on linux with the same results. I have tried moving the drives back into the controller config they where before, and booted my old Nexenta root disk where the /etc/zfs/zpool.cache still had an entry for bank0. I was not able to get the filesystems mounts. I can't remember what errors I got. I can do it again if the errors might be useful. Here is the output of the different utils: root@gir:/bank3/hd# zpool import -d devs pool: bank0 id: 3936305481264476979 state: FAULTED status: The pool was last accessed by another system. action: The pool cannot be imported due to damaged devices or data. The pool may be active on another system, but can be imported using the '-f' flag. see: http://www.sun.com/msg/ZFS-8000-EY config: bank0 FAULTED corrupted data raidz1-0 DEGRADED loop0 ONLINE loop1 ONLINE loop2 ONLINE c10t2d0p0 UNAVAIL root@gir:/bank3/hd# zpool import -d devs bank0 cannot import 'bank0': pool may be in use from other system, it was last accessed by gir (hostid: 0xa1767) on Mon Oct 24 15:50:23 2011 use '-f' to import anyway root@gir:/bank3/hd# zpool import -f -d devs bank0 cannot import 'bank0': I/O error Destroy and re-create the pool from a backup source. root@gir:/bank3/hd# zdb -e -p devs bank0 Configuration for import: vdev_children: 1 version: 26 pool_guid: 3936305481264476979 name: 'bank0' state: 0 hostid: 661351 hostname: 'gir' vdev_tree: type: 'root' id: 0 guid: 3936305481264476979 children[0]: type: 'raidz' id: 0 guid: 10967243523656644777 nparity: 1 metaslab_array: 23 metaslab_shift: 35 ashift: 9 asize: 6001161928704 is_log: 0 create_txg: 4 children[0]: type: 'disk' id: 0 guid: 13554115250875315903 phys_path: '/pci@0,0/pci1002,4391@11/disk@3,0:q' whole_disk: 0 DTL: 57 create_txg: 4 path: '/bank3/hd/devs/loop0' children[1]: type: 'disk' id: 1 guid: 17894226827518944093 phys_path: '/pci@0,0/pci1002,4391@11/disk@0,0:q' whole_disk: 0 DTL: 62 create_txg: 4 path: '/bank3/hd/devs/loop1' children[2]: type: 'disk' id: 2 guid: 9087312107742869669 phys_path: '/pci@0,0/pci1002,4391@11/disk@1,0:q' whole_disk: 0 DTL: 61 create_txg: 4 faulted: 1 aux_state: 'err_exceeded' path: '/bank3/hd/devs/loop2' children[3]: type: 'disk' id: 3 guid: 13297176051223822304 path: '/dev/dsk/c10t2d0p0' devid: 'id1,sd@SATA_ST31500341AS9VS32K25/q' phys_path: '/pci@0,0/pci1002,4391@11/disk@2,0:q' whole_disk: 0 DTL: 60 create_txg: 4 zdb: can't open 'bank0': No such file or directory ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss