Re: [zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

2010-03-09 Thread Victor Latushkin

Christian Hessmann wrote:

Victor,


Btw, they affect some files referenced by snapshots as
'zpool status -v' suggests:

  tank/DVD:0x9cd tank/d...@2010025100:/Memento.m4v
  tank/d...@2010025100:/Payback.m4v
  tank/d...@2010025100:/TheManWhoWasntThere.m4v

In case of OpenSolaris it is not that difficult to work around this bug
without getting rid of files (snapshots referencing them) with errors,
but in I'm not sure how to do the same on FreeBSD.
But you always have option of destroying snapshot indicated above (and may
be more).


I'm still reluctant to reboot the machine, so what I did now was as you
suggested destroy these snapshots (after deleting the files from the
current filesystem, of course).
I'm not so sure the result is good, though:

===
[r...@camelot /tank/DVD]# zpool status -v tank
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed after 10h42m with 136 errors on Tue Mar  2
07:55:05 2010
config:

NAME   STATE READ WRITE CKSUM
tank   DEGRADED   137 0 0
  raidz1   ONLINE   0 0 0
ad17p2 ONLINE   0 0 0
ad18p2 ONLINE   0 0 0
ad20p2 ONLINE   0 0 0
  raidz1   DEGRADED   326 0 0
replacing  DEGRADED 0 0 0
  ad16p2   OFFLINE  2  241K 6
  ad4p2ONLINE   0 0 0  839G resilvered
ad14p2 ONLINE   0 0 0  5.33G resilvered
ad15p2 ONLINE 418 0 0  5.33G resilvered

errors: Permanent errors have been detected in the following files:

tank/DVD:0x9cd
0x2064:0x25a4
0x20ae:0x503
0x20ae:0x9cd
===

Any further information available on this hex messages?


This tells that ZFS can no longer map object numbers from errlog into meaningful 
 names, and this is expected, as you have destroyed them.


Now you need to rerun a scrub.

regards,
victor

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

2010-03-05 Thread Victor Latushkin

Mark J Musante wrote:

It looks like you're running into a DTL issue.  ZFS believes that ad16p2 has
some data on it that hasn't been copied off yet, and it's not considering the
fact that it's part of a raidz group and ad4p2.

There is a CR on this,
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6909724 but what's
viewable in the bug database is pretty minimal.

If you haven't made a backup yet (or at least done a complete snapshot and
generated a send stream from it), my advice would be to do that now.  Then
reboot and see if that clears the DTL enough to let you do the detach.


Actually besides the bug mentioned above, resilvering will not clear DTLs upon 
completion due to


6887372 DTLs not cleared after resilver if permanent errors present

as there are permanent errors present. Btw, they affect some files referenced by 
snapshots as 'zpool status -v' suggests:


 tank/DVD:0x9cd tank/d...@2010025100:/Memento.m4v
 tank/d...@2010025100:/Payback.m4v
 tank/d...@2010025100:/TheManWhoWasntThere.m4v

In case of OpenSolaris it is not that difficult to work around this bug without 
getting rid of files (snapshots referencing them) with errors, but in I'm not 
sure how to do the same on FreeBSD.


But you always have option of destroying snapshot indicated above (and may be 
more).

regards,
victor





On 3 Mar, 2010, at 18.46, Christian Heßmann wrote:


Hello guys,


I've already written this on the FreeBSD forums, but so far, the feedback
is not so great - seems FreeBSD guys aren't that keen on ZFS. I have some
hopes you'll be more experienced on these kind of errors:

I have a ZFS pool comprised of two 3-disk RAIDs which I've recently moved
from OS X to FreeBSD (8 stable).

One harddisk failed last weekend with lots of shouting, SMART messages and
even a kernel panic. I attached a new disk and started the replacement. 
Unfortunately, about 20% into the replacement, a second disk in the same

RAID showed signs of misbehaviour by giving me read errors. The resilvering
did finish, though, and it left me with only three broken files according
to zpool status:

[r...@camelot /]# zpool status -v tank pool: tank state: DEGRADED status:
One or more devices has experienced an error resulting in data corruption.
Applications may be affected. action: Restore the file in question if
possible.  Otherwise restore the entire pool from backup. see:
http://www.sun.com/msg/ZFS-8000-8A scrub: resilver completed after 10h42m
with 136 errors on Tue Mar  2 07:55:05 2010 config:

NAME   STATE READ WRITE CKSUM tank   DEGRADED   137
0 0 raidz1   ONLINE   0 0 0 ad17p2 ONLINE   0
0 0 ad18p2 ONLINE   0 0 0 ad20p2 ONLINE   0
0 0 raidz1   DEGRADED   326 0 0 replacing  DEGRADED 0
0 0 ad16p2   OFFLINE  2  169K 6 ad4p2ONLINE   0 0
0  839G resilvered ad14p2 ONLINE   0 0 0  5.33G resilvered 
ad15p2 ONLINE 418 0 0  5.33G resilvered


errors: Permanent errors have been detected in the following files:

tank/DVD:0x9cd tank/d...@2010025100:/Memento.m4v 
tank/d...@2010025100:/Payback.m4v 
tank/d...@2010025100:/TheManWhoWasntThere.m4v


I have the feeling the problems on ad15p2 are related to a cable issue,
since it doesn't have any SMART errors, is quite a new drive (3 months old)
and was IMHO sufficiently burned in by repeatedly filling it to the brim
and checking the contents (via ZFS). So I'd like to switch off the server,
replace the cable and do a scrub afterwards to make sure it doesn't produce
additional errors.

Unfortunately, although it says the resilvering completed, I can't detach
ad16p2 (the first faulted disk) from the system:

[r...@camelot /]# zpool detach tank ad16p2 cannot detach ad16p2: no valid
replicas

To be honest, I don't know how to proceed now. It feels like my system is
in a very unstable state right now, with a replacement not yet finished and
errors on two drives in one RAID.Z1.

I deleted the files affected, but have about 20 snapshots of this
filesystem and think these files are in most of them since they're quite
old.

So, what should I do now? Delete all snapshots? Move all other files from
this filesystem to a new filesystem and destroy the old filesystem? Try to
export and import the pool? Is it even safe to reboot the machine right
now?

I got one response in the FreeBSD Forum telling me I should reboot the
machine and do a scrub afterwards, it should then detect that it doesn't
need the old disk anymore - I am a bit reluctant doing that, to be
honest...

Any help would be appreciated.

Thank you.

Christian ___ zfs-discuss
mailing list zfs-discuss@opensolaris.org 
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___ zfs-discuss mailing list 
zfs-discuss@opensolaris.org 
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

2010-03-05 Thread Christian Hessmann
Victor,

 Btw, they affect some files referenced by snapshots as
 'zpool status -v' suggests:

   tank/DVD:0x9cd tank/d...@2010025100:/Memento.m4v
   tank/d...@2010025100:/Payback.m4v
   tank/d...@2010025100:/TheManWhoWasntThere.m4v

 In case of OpenSolaris it is not that difficult to work around this bug
 without getting rid of files (snapshots referencing them) with errors,
 but in I'm not sure how to do the same on FreeBSD.
 But you always have option of destroying snapshot indicated above (and may
 be more).

I'm still reluctant to reboot the machine, so what I did now was as you
suggested destroy these snapshots (after deleting the files from the
current filesystem, of course).
I'm not so sure the result is good, though:

===
[r...@camelot /tank/DVD]# zpool status -v tank
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed after 10h42m with 136 errors on Tue Mar  2
07:55:05 2010
config:

NAME   STATE READ WRITE CKSUM
tank   DEGRADED   137 0 0
  raidz1   ONLINE   0 0 0
ad17p2 ONLINE   0 0 0
ad18p2 ONLINE   0 0 0
ad20p2 ONLINE   0 0 0
  raidz1   DEGRADED   326 0 0
replacing  DEGRADED 0 0 0
  ad16p2   OFFLINE  2  241K 6
  ad4p2ONLINE   0 0 0  839G resilvered
ad14p2 ONLINE   0 0 0  5.33G resilvered
ad15p2 ONLINE 418 0 0  5.33G resilvered

errors: Permanent errors have been detected in the following files:

tank/DVD:0x9cd
0x2064:0x25a4
0x20ae:0x503
0x20ae:0x9cd
===

Any further information available on this hex messages?


Regards
Christian

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

2010-03-04 Thread Mark J Musante
It looks like you're running into a DTL issue.  ZFS believes that ad16p2 has 
some data on it that hasn't been copied off yet, and it's not considering the 
fact that it's part of a raidz group and ad4p2.

There is a CR on this, 
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6909724 but what's 
viewable in the bug database is pretty minimal.

If you haven't made a backup yet (or at least done a complete snapshot and 
generated a send stream from it), my advice would be to do that now.  Then 
reboot and see if that clears the DTL enough to let you do the detach.


On 3 Mar, 2010, at 18.46, Christian Heßmann wrote:

 Hello guys,
 
 
 I've already written this on the FreeBSD forums, but so far, the feedback is 
 not so great - seems FreeBSD guys aren't that keen on ZFS. I have some hopes 
 you'll be more experienced on these kind of errors:
 
 I have a ZFS pool comprised of two 3-disk RAIDs which I've recently moved 
 from OS X to FreeBSD (8 stable).
 
 One harddisk failed last weekend with lots of shouting, SMART messages and 
 even a kernel panic.
 I attached a new disk and started the replacement.
 Unfortunately, about 20% into the replacement, a second disk in the same RAID 
 showed signs of misbehaviour by giving me read errors. The resilvering did 
 finish, though, and it left me with only three broken files according to 
 zpool status:
 
 [r...@camelot /]# zpool status -v tank
  pool: tank
 state: DEGRADED
 status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
 action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed after 10h42m with 136 errors on Tue Mar  2 07:55:05 
 2010
 config:
 
NAME   STATE READ WRITE CKSUM
tank   DEGRADED   137 0 0
  raidz1   ONLINE   0 0 0
ad17p2 ONLINE   0 0 0
ad18p2 ONLINE   0 0 0
ad20p2 ONLINE   0 0 0
  raidz1   DEGRADED   326 0 0
replacing  DEGRADED 0 0 0
  ad16p2   OFFLINE  2  169K 6
  ad4p2ONLINE   0 0 0  839G resilvered
ad14p2 ONLINE   0 0 0  5.33G resilvered
ad15p2 ONLINE 418 0 0  5.33G resilvered
 
 errors: Permanent errors have been detected in the following files:
 
tank/DVD:0x9cd
tank/d...@2010025100:/Memento.m4v
tank/d...@2010025100:/Payback.m4v
tank/d...@2010025100:/TheManWhoWasntThere.m4v
 
 I have the feeling the problems on ad15p2 are related to a cable issue, since 
 it doesn't have any SMART errors, is quite a new drive (3 months old) and was 
 IMHO sufficiently burned in by repeatedly filling it to the brim and 
 checking the contents (via ZFS). So I'd like to switch off the server, 
 replace the cable and do a scrub afterwards to make sure it doesn't produce 
 additional errors.
 
 Unfortunately, although it says the resilvering completed, I can't detach 
 ad16p2 (the first faulted disk) from the system:
 
 [r...@camelot /]# zpool detach tank ad16p2
 cannot detach ad16p2: no valid replicas
 
 To be honest, I don't know how to proceed now. It feels like my system is in 
 a very unstable state right now, with a replacement not yet finished and 
 errors on two drives in one RAID.Z1.
 
 I deleted the files affected, but have about 20 snapshots of this filesystem 
 and think these files are in most of them since they're quite old.
 
 So, what should I do now? Delete all snapshots? Move all other files from 
 this filesystem to a new filesystem and destroy the old filesystem? Try to 
 export and import the pool? Is it even safe to reboot the machine right now?
 
 I got one response in the FreeBSD Forum telling me I should reboot the 
 machine and do a scrub afterwards, it should then detect that it doesn't need 
 the old disk anymore - I am a bit reluctant doing that, to be honest...
 
 Any help would be appreciated.
 
 Thank you.
 
 Christian
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

2010-03-03 Thread Christian Heßmann

Hello guys,


I've already written this on the FreeBSD forums, but so far, the  
feedback is not so great - seems FreeBSD guys aren't that keen on ZFS.  
I have some hopes you'll be more experienced on these kind of errors:


I have a ZFS pool comprised of two 3-disk RAIDs which I've recently  
moved from OS X to FreeBSD (8 stable).


One harddisk failed last weekend with lots of shouting, SMART messages  
and even a kernel panic.

I attached a new disk and started the replacement.
Unfortunately, about 20% into the replacement, a second disk in the  
same RAID showed signs of misbehaviour by giving me read errors. The  
resilvering did finish, though, and it left me with only three broken  
files according to zpool status:


[r...@camelot /]# zpool status -v tank
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver completed after 10h42m with 136 errors on Tue Mar  2  
07:55:05 2010

config:

NAME   STATE READ WRITE CKSUM
tank   DEGRADED   137 0 0
  raidz1   ONLINE   0 0 0
ad17p2 ONLINE   0 0 0
ad18p2 ONLINE   0 0 0
ad20p2 ONLINE   0 0 0
  raidz1   DEGRADED   326 0 0
replacing  DEGRADED 0 0 0
  ad16p2   OFFLINE  2  169K 6
  ad4p2ONLINE   0 0 0  839G resilvered
ad14p2 ONLINE   0 0 0  5.33G resilvered
ad15p2 ONLINE 418 0 0  5.33G resilvered

errors: Permanent errors have been detected in the following files:

tank/DVD:0x9cd
tank/d...@2010025100:/Memento.m4v
tank/d...@2010025100:/Payback.m4v
tank/d...@2010025100:/TheManWhoWasntThere.m4v

I have the feeling the problems on ad15p2 are related to a cable  
issue, since it doesn't have any SMART errors, is quite a new drive (3  
months old) and was IMHO sufficiently burned in by repeatedly  
filling it to the brim and checking the contents (via ZFS). So I'd  
like to switch off the server, replace the cable and do a scrub  
afterwards to make sure it doesn't produce additional errors.


Unfortunately, although it says the resilvering completed, I can't  
detach ad16p2 (the first faulted disk) from the system:


[r...@camelot /]# zpool detach tank ad16p2
cannot detach ad16p2: no valid replicas

To be honest, I don't know how to proceed now. It feels like my system  
is in a very unstable state right now, with a replacement not yet  
finished and errors on two drives in one RAID.Z1.


I deleted the files affected, but have about 20 snapshots of this  
filesystem and think these files are in most of them since they're  
quite old.


So, what should I do now? Delete all snapshots? Move all other files  
from this filesystem to a new filesystem and destroy the old  
filesystem? Try to export and import the pool? Is it even safe to  
reboot the machine right now?


I got one response in the FreeBSD Forum telling me I should reboot the  
machine and do a scrub afterwards, it should then detect that it  
doesn't need the old disk anymore - I am a bit reluctant doing that,  
to be honest...


Any help would be appreciated.

Thank you.

Christian
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

2010-03-03 Thread Bob Friesenhahn

On Thu, 4 Mar 2010, Christian Heßmann wrote:


I've already written this on the FreeBSD forums, but so far, the feedback is 
not so great - seems FreeBSD guys aren't that keen on ZFS. I have some hopes


I see lots and lots of zfs traffic on the discussion list 
freebsd...@freebsd.org.  This is where the FreeBSD filesystem 
developers hang out.



raidz1   DEGRADED   326 0 0
  replacing  DEGRADED 0 0 0
ad16p2   OFFLINE  2  169K 6
ad4p2ONLINE   0 0 0  839G resilvered
  ad14p2 ONLINE   0 0 0  5.33G resilvered
  ad15p2 ONLINE 418 0 0  5.33G resilvered

Unfortunately, although it says the resilvering completed, I can't detach 
ad16p2 (the first faulted disk) from the system:


The zpool status you posted shows that ad16p2 is still in 'replacing' 
mode.  If this is still the case, then it could be a reason that the 
original disk can't yet be removed.


To be honest, I don't know how to proceed now. It feels like my system is in 
a very unstable state right now, with a replacement not yet finished and 
errors on two drives in one RAID.Z1.


If it is still in 'replacing' mode then it seems that the best policy 
is to just wait.  If there is no drive activity on ad4p2 then there 
may be something more wrong.


Cold booting a system can be one of the scariest things to do so it 
should be a means of last resort.  Maybe the system would not come 
back.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

2010-03-03 Thread Freddie Cash
On Wed, Mar 3, 2010 at 5:57 PM, Bob Friesenhahn 
bfrie...@simple.dallas.tx.us wrote:

 On Thu, 4 Mar 2010, Christian Heßmann wrote:

 To be honest, I don't know how to proceed now. It feels like my system is
 in a very unstable state right now, with a replacement not yet finished and
 errors on two drives in one RAID.Z1.


 If it is still in 'replacing' mode then it seems that the best policy is to
 just wait.  If there is no drive activity on ad4p2 then there may be
 something more wrong.

 Cold booting a system can be one of the scariest things to do so it should
 be a means of last resort.  Maybe the system would not come back.


We've had this happen a couple of times on our FreeBSD-based storage
servers.  Rebooting and manually running a scrub has fixed the issue each
time.

24x 500 GB SATA drives in 3x raidz2 vdev of 8 drives each

-- 
Freddie Cash
fjwc...@gmail.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

2010-03-03 Thread Christian Heßmann

On 04.03.2010, at 02:57, Bob Friesenhahn wrote:

I see lots and lots of zfs traffic on the discussion list freebsd...@freebsd.org 
.  This is where the FreeBSD filesystem developers hang out.


Thanks - I'll have a look there. As usual, the cool kids are in  
mailing lists... ;-)



The zpool status you posted shows that ad16p2 is still in  
'replacing' mode.  If this is still the case, then it could be a  
reason that the original disk can't yet be removed.

[...]
If it is still in 'replacing' mode then it seems that the best  
policy is to just wait.  If there is no drive activity on ad4p2 then  
there may be something more wrong.


It bothers me as well that it says replacing instead of replaced or  
whatever else it should say. Since the resilvering completed I don't  
have any activity on the drives anymore, so I presume it somehow  
thinks it's done.



Cold booting a system can be one of the scariest things to do so it  
should be a means of last resort.  Maybe the system would not come  
back.


That's my fear. Although from what I can gather from the feedback so  
far the FreeBSD users seem somewhat familiar with an error like that  
and recommend rebooting. I might take the majority advice, make a  
backup of the important parts of the pool and just go for a reboot.


Might go for another repost into the freebsd-fs list before, though,  
so please bear with me that you have to read this again...


Thanks.

Christian
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss