Re: [zfs-discuss] ZFS Recovery: What do I try next?

2011-12-22 Thread Paul Kraus
On Thu, Dec 22, 2011 at 11:25 AM, Tim Cook  wrote:
> On Thu, Dec 22, 2011 at 10:00 AM, Myers Carpenter  wrote:

>> So the lesson here: Don't be a dumbass like me.  Setup up nagios or some
>> other system to alert you when a pool has become degraded.  ZFS works very
>> well with one drive out of the array, you aren't probably going to notice
>> problems unless you are proactively looking for them.

> Or, if you aren't scrubbing on a regular basis, just change your zpool
> failmode property.  Had you set it to wait or panic, it would've been very
> clear, very quickly that something was wrong.
> http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/

I'm not sure this will help, as a single failed drive in a raidz1
or 2 in a raidz2 will make the zpool DEGRADED and not FAULTED. I
believe this parameter governs behavior for a FAULTED zpool.

We have a very simple shell script that runs hourly and does a
`zpool status -x` and generates an email to the admins if any pool is
in any state other than ONLINE. As soon as a zpool goes DEGRADED we
get notified and can initiate the correct response (open a case with
Oracle to replace the failed drive is the usual one). Here is the
snippet from the script of the actual health check (not my code, I
would have done it differently, but this works) ...

not_ok=`${zfs_path}/zpool status -x | egrep -v "all pools are
healthy|no pools available"`

if [ "X${not_ok}" != "X" ]
  then
fault_details="There is at least one zpool error."
let fault_count=fault_count+1
new_faults[${fault_count}]=${fault_details}
fi

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Recovery: What do I try next?

2011-12-22 Thread Tim Cook
On Thu, Dec 22, 2011 at 10:00 AM, Myers Carpenter  wrote:

>
>
> On Sat, Nov 5, 2011 at 2:35 PM, Myers Carpenter  wrote:
>
>> I would like to pick the brains of the ZFS experts on this list: What
>> would you do next to try and recover this zfs pool?
>>
>
> I hate running across threads that ask a question and the person that
> asked them never comes back to say what they eventually did, so...
>
> To summarize: In late October I had two drives fail in a raidz1 pool.  I
> was able to recover all the data from one drive, but the other could not be
> seen by the controller.  Trying to zpool import was not working.   I had 3
> of the 4 drives, why couldn't I mount this.
>
> I read about every option in zdb and tried ones that might tell me
> something more about what was on this recovered drive.  I eventually hit on
>
> zdb -p devs -e -lu /bank4/hd/devs/loop0
>
> where /bank4/hd/devs/loop0 was a symlink back to /dev/loop0 where I had
> setup the disk image of the recovered drive.
>
> This showed the uberblocks which looked like this:
>
> Uberblock[1]
> magic = 00bab10c
> version = 26
> txg = 23128193
> guid_sum = 13396147021153418877
> timestamp = 1316987376 UTC = Sun Sep 25 17:49:36 2011
> rootbp = DVA[0]=<0:2981f336c00:400> DVA[1]=<0:1e8dcc01400:400>
> DVA[2]=<0:3b16a3dd400:400> [L0 DMU objset] fletcher4 lzjb LE contiguous
> unique triple size=800L/200P birth=23128193L/23128193P fill=255
> cksum=136175e0a4:79b27ae49c7:1857d594ca833:34ec76b965ae40
>
> Then it all came clear: This drive had encountered errors one month before
> the other drive had failed and zfs had stopped writing to it.
>
> So the lesson here: Don't be a dumbass like me.  Setup up nagios or some
> other system to alert you when a pool has become degraded.  ZFS works very
> well with one drive out of the array, you aren't probably going to notice
> problems unless you are proactively looking for them.
>
> myers
>
>
>
>


Or, if you aren't scrubbing on a regular basis, just change your zpool
failmode property.  Had you set it to wait or panic, it would've been very
clear, very quickly that something was wrong.
http://prefetch.net/blog/index.php/2008/03/01/configuring-zfs-to-gracefully-deal-with-failures/

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Recovery: What do I try next?

2011-12-22 Thread Myers Carpenter
On Sat, Nov 5, 2011 at 2:35 PM, Myers Carpenter  wrote:

> I would like to pick the brains of the ZFS experts on this list: What
> would you do next to try and recover this zfs pool?
>

I hate running across threads that ask a question and the person that asked
them never comes back to say what they eventually did, so...

To summarize: In late October I had two drives fail in a raidz1 pool.  I
was able to recover all the data from one drive, but the other could not be
seen by the controller.  Trying to zpool import was not working.   I had 3
of the 4 drives, why couldn't I mount this.

I read about every option in zdb and tried ones that might tell me
something more about what was on this recovered drive.  I eventually hit on

zdb -p devs -e -lu /bank4/hd/devs/loop0

where /bank4/hd/devs/loop0 was a symlink back to /dev/loop0 where I had
setup the disk image of the recovered drive.

This showed the uberblocks which looked like this:

Uberblock[1]
magic = 00bab10c
version = 26
txg = 23128193
guid_sum = 13396147021153418877
timestamp = 1316987376 UTC = Sun Sep 25 17:49:36 2011
rootbp = DVA[0]=<0:2981f336c00:400> DVA[1]=<0:1e8dcc01400:400>
DVA[2]=<0:3b16a3dd400:400> [L0 DMU objset] fletcher4 lzjb LE contiguous
unique triple size=800L/200P birth=23128193L/23128193P fill=255
cksum=136175e0a4:79b27ae49c7:1857d594ca833:34ec76b965ae40

Then it all came clear: This drive had encountered errors one month before
the other drive had failed and zfs had stopped writing to it.

So the lesson here: Don't be a dumbass like me.  Setup up nagios or some
other system to alert you when a pool has become degraded.  ZFS works very
well with one drive out of the array, you aren't probably going to notice
problems unless you are proactively looking for them.

myers
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Recovery: What do I try next?

2011-11-05 Thread LaoTsao
Dir you try
Zpool clear -F  bank0 with tbe latest solaris express?

Sent from my iPad

On Nov 5, 2011, at 2:35 PM, Myers Carpenter  wrote:

> I would like to pick the brains of the ZFS experts on this list: What
> would you do next to try and recover this zfs pool?
> 
> I have a ZFS RAIDZ1 pool named bank0 that I cannot import.  It was
> composed of 4 1.5 TiB disks.  One disk is totally dead.  Another had
> SMART errors, but using GNU ddrescue I was able to copy all the data
> off successfully.
> 
> I have copied all 3 remaining disks as images using dd on to another
> another filesystem.  Using the loopback filesystem I can treat these
> images as if they were real disks.  I've made a snapshot of the
> filesystem the disk images are on so that I can try things and
> rollback the changes if needed.
> 
> "gir" is the computer these disks are hosted on.  It used to be a
> Nexenta server, but is now Ubuntu 11.10 with the zfs on linux modules.
> 
> I have tried booting up Solaris Express 11 Live CD and doing "zpool
> import -fFX bank0" which ran for ~6 hours and put out: "one or more
> devices is currently unavailable"
> 
> I have tried "zpool import -fFX bank0" on linux with the same results.
> 
> I have tried moving the drives back into the controller config they
> where before, and booted my old Nexenta root disk where the
> /etc/zfs/zpool.cache still had an entry for bank0.  I was not able to
> get the filesystems mounts. I can't remember what errors I got.  I can
> do it again if the errors might be useful.
> 
> Here is the output of the different utils:
> 
> root@gir:/bank3/hd# zpool import -d devs
>  pool: bank0
>id: 3936305481264476979
> state: FAULTED
> status: The pool was last accessed by another system.
> action: The pool cannot be imported due to damaged devices or data.
>The pool may be active on another system, but can be imported using
>the '-f' flag.
>   see: http://www.sun.com/msg/ZFS-8000-EY
> config:
> 
>bank0  FAULTED  corrupted data
>  raidz1-0 DEGRADED
>loop0  ONLINE
>loop1  ONLINE
>loop2  ONLINE
>c10t2d0p0  UNAVAIL
> 
> 
> root@gir:/bank3/hd# zpool import -d devs bank0
> cannot import 'bank0': pool may be in use from other system, it was
> last accessed by gir (hostid: 0xa1767) on Mon Oct 24 15:50:23 2011
> use '-f' to import anyway
> 
> 
> root@gir:/bank3/hd# zpool import -f -d devs bank0
> cannot import 'bank0': I/O error
>Destroy and re-create the pool from
>a backup source.
> 
> root@gir:/bank3/hd# zdb -e -p devs bank0
> Configuration for import:
>vdev_children: 1
>version: 26
>pool_guid: 3936305481264476979
>name: 'bank0'
>state: 0
>hostid: 661351
>hostname: 'gir'
>vdev_tree:
>type: 'root'
>id: 0
>guid: 3936305481264476979
>children[0]:
>type: 'raidz'
>id: 0
>guid: 10967243523656644777
>nparity: 1
>metaslab_array: 23
>metaslab_shift: 35
>ashift: 9
>asize: 6001161928704
>is_log: 0
>create_txg: 4
>children[0]:
>type: 'disk'
>id: 0
>guid: 13554115250875315903
>phys_path: '/pci@0,0/pci1002,4391@11/disk@3,0:q'
>whole_disk: 0
>DTL: 57
>create_txg: 4
>path: '/bank3/hd/devs/loop0'
>children[1]:
>type: 'disk'
>id: 1
>guid: 17894226827518944093
>phys_path: '/pci@0,0/pci1002,4391@11/disk@0,0:q'
>whole_disk: 0
>DTL: 62
>create_txg: 4
>path: '/bank3/hd/devs/loop1'
>children[2]:
>type: 'disk'
>id: 2
>guid: 9087312107742869669
>phys_path: '/pci@0,0/pci1002,4391@11/disk@1,0:q'
>whole_disk: 0
>DTL: 61
>create_txg: 4
>faulted: 1
>aux_state: 'err_exceeded'
>path: '/bank3/hd/devs/loop2'
>children[3]:
>type: 'disk'
>id: 3
>guid: 13297176051223822304
>path: '/dev/dsk/c10t2d0p0'
>devid:
> 'id1,sd@SATA_ST31500341AS9VS32K25/q'
>phys_path: '/pci@0,0/pci1002,4391@11/disk@2,0:q'
>whole_disk: 0
>DTL: 60
>create_txg: 4
> 
> zdb: can't open 'bank0': No such file or directory
> ___
> zfs-discuss mailing list
> zf

[zfs-discuss] ZFS Recovery: What do I try next?

2011-11-05 Thread Myers Carpenter
I would like to pick the brains of the ZFS experts on this list: What
would you do next to try and recover this zfs pool?

I have a ZFS RAIDZ1 pool named bank0 that I cannot import.  It was
composed of 4 1.5 TiB disks.  One disk is totally dead.  Another had
SMART errors, but using GNU ddrescue I was able to copy all the data
off successfully.

I have copied all 3 remaining disks as images using dd on to another
another filesystem.  Using the loopback filesystem I can treat these
images as if they were real disks.  I've made a snapshot of the
filesystem the disk images are on so that I can try things and
rollback the changes if needed.

"gir" is the computer these disks are hosted on.  It used to be a
Nexenta server, but is now Ubuntu 11.10 with the zfs on linux modules.

I have tried booting up Solaris Express 11 Live CD and doing "zpool
import -fFX bank0" which ran for ~6 hours and put out: "one or more
devices is currently unavailable"

I have tried "zpool import -fFX bank0" on linux with the same results.

I have tried moving the drives back into the controller config they
where before, and booted my old Nexenta root disk where the
/etc/zfs/zpool.cache still had an entry for bank0.  I was not able to
get the filesystems mounts. I can't remember what errors I got.  I can
do it again if the errors might be useful.

Here is the output of the different utils:

root@gir:/bank3/hd# zpool import -d devs
  pool: bank0
id: 3936305481264476979
 state: FAULTED
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-EY
config:

bank0  FAULTED  corrupted data
  raidz1-0 DEGRADED
loop0  ONLINE
loop1  ONLINE
loop2  ONLINE
c10t2d0p0  UNAVAIL


root@gir:/bank3/hd# zpool import -d devs bank0
cannot import 'bank0': pool may be in use from other system, it was
last accessed by gir (hostid: 0xa1767) on Mon Oct 24 15:50:23 2011
use '-f' to import anyway


root@gir:/bank3/hd# zpool import -f -d devs bank0
cannot import 'bank0': I/O error
Destroy and re-create the pool from
a backup source.

root@gir:/bank3/hd# zdb -e -p devs bank0
Configuration for import:
vdev_children: 1
version: 26
pool_guid: 3936305481264476979
name: 'bank0'
state: 0
hostid: 661351
hostname: 'gir'
vdev_tree:
type: 'root'
id: 0
guid: 3936305481264476979
children[0]:
type: 'raidz'
id: 0
guid: 10967243523656644777
nparity: 1
metaslab_array: 23
metaslab_shift: 35
ashift: 9
asize: 6001161928704
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 13554115250875315903
phys_path: '/pci@0,0/pci1002,4391@11/disk@3,0:q'
whole_disk: 0
DTL: 57
create_txg: 4
path: '/bank3/hd/devs/loop0'
children[1]:
type: 'disk'
id: 1
guid: 17894226827518944093
phys_path: '/pci@0,0/pci1002,4391@11/disk@0,0:q'
whole_disk: 0
DTL: 62
create_txg: 4
path: '/bank3/hd/devs/loop1'
children[2]:
type: 'disk'
id: 2
guid: 9087312107742869669
phys_path: '/pci@0,0/pci1002,4391@11/disk@1,0:q'
whole_disk: 0
DTL: 61
create_txg: 4
faulted: 1
aux_state: 'err_exceeded'
path: '/bank3/hd/devs/loop2'
children[3]:
type: 'disk'
id: 3
guid: 13297176051223822304
path: '/dev/dsk/c10t2d0p0'
devid:
'id1,sd@SATA_ST31500341AS9VS32K25/q'
phys_path: '/pci@0,0/pci1002,4391@11/disk@2,0:q'
whole_disk: 0
DTL: 60
create_txg: 4

zdb: can't open 'bank0': No such file or directory
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss