Re: [zfs-discuss] zfs corruptions in pool

2010-06-08 Thread Toby Thain


On 6-Jun-10, at 7:11 AM, Thomas Maier-Komor wrote:


On 06.06.2010 08:06, devsk wrote:
I had an unclean shutdown because of a hang and suddenly my pool is  
degraded (I realized something is wrong when python dumped core a  
couple of times).


This is before I ran scrub:

 pool: mypool
state: DEGRADED
status: One or more devices has experienced an error resulting in  
data

   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise  
restore the

   entire pool from backup.
  see: http://www.sun.com/msg/ZFS-8000-8A
scan: scrub repaired 0 in 0h7m with 0 errors on Mon May 31 09:00:27  
2010

config:

   NAMESTATE READ WRITE CKSUM
   mypool  DEGRADED 0 0 0
 c6t0d0s0  DEGRADED 0 0 0  too many errors

errors: Permanent errors have been detected in the following files:

   mypool/ROOT/May25-2010-Image-Update:0x3041e
   mypool/ROOT/May25-2010-Image-Update:0x31524
   mypool/ROOT/May25-2010-Image-Update:0x26d24
   mypool/ROOT/May25-2010-Image-Update:0x37234
   //var/pkg/download/d6/d6be0ef348e3c81f18eca38085721f6d6503af7a
   mypool/ROOT/May25-2010-Image-Update:0x25db3
   //var/pkg/download/cb/cbb0ff02bcdc6649da3763900363de7cff78ec72
   mypool/ROOT/May25-2010-Image-Update:0x26cf6


I ran scrub and this is what it has to say afterwards.

 pool: mypool
state: DEGRADED
status: One or more devices has experienced an unrecoverable  
error.  An
   attempt was made to correct the error.  Applications are  
unaffected.
action: Determine if the device needs to be replaced, and clear the  
errors
   using 'zpool clear' or replace the device with 'zpool  
replace'.

  see: http://www.sun.com/msg/ZFS-8000-9P
scan: scrub repaired 0 in 0h11m with 0 errors on Sat Jun  5  
22:43:54 2010

config:

   NAMESTATE READ WRITE CKSUM
   mypool  DEGRADED 0 0 0
 c6t0d0s0  DEGRADED 0 0 0  too many errors

errors: No known data errors

Few of questions:

1. Have the errors really gone away? Can I just clear and be  
content that errors are really gone?


2. Why did the errors occur anyway if ZFS guarantees on-disk  
consistency? I wasn't writing anything. Those files were definitely  
not being touched when the hang and unclean shutdown happened.


I mean I don't mind if I create or modify a file and it doesn't  
land on disk because on unclean shutdown happened but a bunch of  
unrelated files getting corrupted, is sort of painful to digest.


3. The action says Determine if the device needs to be replaced.  
How the heck do I do that?



Is it possible that this system runs on a virtual box? At least I've
seen such a thing happen on a Virtual Box but never on a real machine.


As I postulated in the relevant forum thread there:
http://forums.virtualbox.org/viewtopic.php?t=13661
(can't check URL, the site seems down for me atm)



The reason why the error have gone away might be that meta data has
three copies IIRC. So if your disk only had corruptions in the meta  
data

area these errors can be repaired by scrubbing the pool.

The smartmontools might help you figuring out if the disk is broken.  
But
if you only had an unexpected shutdown and now everything is clean  
after

a scrub, I wouldn't expect the disk to be broken. You can get the
smartmontools from opencsw.org.

If your system is really running on a Virtual Box I'd recommend that  
you

turn of disk write caching of Virtual Box.


Specifically, stop it from ignoring cache flush. Caching is irrelevant  
if flushes are being correctly handled.


ZFS isn't the only software system that will suffer inconsistencies/ 
corruption in the guest if flushes are ignored, of course.


--Toby



Search the OpenSolaris forum
of Virtual Box. There is an article somewhere how to do this. IIRC the
subject is somethink like 'zfs pool curruption'. But it is also
somewhere in the docs.

HTH,
Thomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs corruptions in pool

2010-06-06 Thread Thomas Maier-Komor
On 06.06.2010 08:06, devsk wrote:
 I had an unclean shutdown because of a hang and suddenly my pool is degraded 
 (I realized something is wrong when python dumped core a couple of times).
 
 This is before I ran scrub:
 
   pool: mypool
  state: DEGRADED
 status: One or more devices has experienced an error resulting in data
 corruption.  Applications may be affected.
 action: Restore the file in question if possible.  Otherwise restore the
 entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0h7m with 0 errors on Mon May 31 09:00:27 2010
 config:
 
 NAMESTATE READ WRITE CKSUM
 mypool  DEGRADED 0 0 0
   c6t0d0s0  DEGRADED 0 0 0  too many errors
 
 errors: Permanent errors have been detected in the following files:
 
 mypool/ROOT/May25-2010-Image-Update:0x3041e
 mypool/ROOT/May25-2010-Image-Update:0x31524
 mypool/ROOT/May25-2010-Image-Update:0x26d24
 mypool/ROOT/May25-2010-Image-Update:0x37234
 //var/pkg/download/d6/d6be0ef348e3c81f18eca38085721f6d6503af7a
 mypool/ROOT/May25-2010-Image-Update:0x25db3
 //var/pkg/download/cb/cbb0ff02bcdc6649da3763900363de7cff78ec72
 mypool/ROOT/May25-2010-Image-Update:0x26cf6
 
 
 I ran scrub and this is what it has to say afterwards.
 
   pool: mypool
  state: DEGRADED
 status: One or more devices has experienced an unrecoverable error.  An
 attempt was made to correct the error.  Applications are unaffected.
 action: Determine if the device needs to be replaced, and clear the errors
 using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 0h11m with 0 errors on Sat Jun  5 22:43:54 2010
 config:
 
 NAMESTATE READ WRITE CKSUM
 mypool  DEGRADED 0 0 0
   c6t0d0s0  DEGRADED 0 0 0  too many errors
 
 errors: No known data errors
 
 Few of questions:
 
 1. Have the errors really gone away? Can I just clear and be content that 
 errors are really gone?
 
 2. Why did the errors occur anyway if ZFS guarantees on-disk consistency? I 
 wasn't writing anything. Those files were definitely not being touched when 
 the hang and unclean shutdown happened.
 
 I mean I don't mind if I create or modify a file and it doesn't land on disk 
 because on unclean shutdown happened but a bunch of unrelated files getting 
 corrupted, is sort of painful to digest.
 
 3. The action says Determine if the device needs to be replaced. How the 
 heck do I do that?


Is it possible that this system runs on a virtual box? At least I've
seen such a thing happen on a Virtual Box but never on a real machine.

The reason why the error have gone away might be that meta data has
three copies IIRC. So if your disk only had corruptions in the meta data
area these errors can be repaired by scrubbing the pool.

The smartmontools might help you figuring out if the disk is broken. But
if you only had an unexpected shutdown and now everything is clean after
a scrub, I wouldn't expect the disk to be broken. You can get the
smartmontools from opencsw.org.

If your system is really running on a Virtual Box I'd recommend that you
turn of disk write caching of Virtual Box. Search the OpenSolaris forum
of Virtual Box. There is an article somewhere how to do this. IIRC the
subject is somethink like 'zfs pool curruption'. But it is also
somewhere in the docs.

HTH,
Thomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs corruptions in pool

2010-06-06 Thread Bob Friesenhahn

On Sun, 6 Jun 2010, Roy Sigurd Karlsbakk wrote:


I mean I don't mind if I create or modify a file and it doesn't land
on disk because on unclean shutdown happened but a bunch of unrelated
files getting corrupted, is sort of painful to digest.


ZFS guarantees consistency in a redundant setup, but it looks like 
your pool only consists of one drive, meaning zero redundancy


This is not a true statement.  Redundancy is not required for 
consistency.  Consistency is assured by zfs writing transaction groups 
in order and commiting the data to disk prior to transitioning to the 
next transaction group.  If the disk fails to sync its cache and 
writes data out of order (data from multiple transaction groups), then 
zfs loses consistency.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs corruptions in pool

2010-06-06 Thread devsk
I think both Bob and Thomas have it right. I am using VIrtualbox and just 
checked, the host IO is cached on the SATA controller, although I thought I had 
it enabled (this is VB-3.2.0).

Let me run this mode for a while and see of this happens again.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss