Re: [zfs-discuss] Does zpool clear delete corrupted files

2009-06-02 Thread Paul Choi
Hm. That's odd. zpool clear should've cleared the list of errors. 
Unless you were accessing files at the same time, so there were more 
checksum errors being reported upon reads.
As for zpool scrub, there's no benefit in your case. Since you are 
reading from the zpool, and there's checksums being done as you read - 
and I assume you're going to read every single file there is. zpool 
scrub is good when you want to ensure that checksum is good for the 
whole zpool, including files you haven't read recently.


Well, good luck with your recovery efforts.

-Paul


Jonathan Loran wrote:


Well, I tried to clear the errors, but zpool clear didn't clear them. 
 I think the errors are in the metadata in such a way that they can't 
be cleared.  I'm actually a bit scared to scrub it before I grab a 
backup, so I'm going to do that first.  After the backup, I need to 
break the mirror to pull the x4540 out, and I just hope that can 
succeed.  If not, we'll be loosing some data between the time the 
backup is taken and I roll out the new storage.  

Let this be a double warning to all you zfs-ers out there:  Make sure 
you have redundancy at the zfs layer, and also do backups. 
 Unfortunately for me, penny pinching has precluded both for us until 
now.


Jon

On Jun 1, 2009, at 4:19 PM, A Darren Dunham wrote:


On Mon, Jun 01, 2009 at 03:19:59PM -0700, Jonathan Loran wrote:


Kinda scary then.  Better make sure we delete all the bad files before  
I back it up.


That shouldn't be necessary.  Clearing the error count doesn't disable
checksums.  Every read is going to verify checksums on the file data
blocks.  If it can't find at least one copy with a valid checksum,
you should just get an I/O error trying to read the file, not invalid
data.

What's odd is we've checked a few hundred files, and most of them  
don't seem to have any corruption.  I'm thinking what's wrong is the  
metadata for these files is corrupted somehow, yet we can read them  
just fine.


Are you still getting errors?

--
Darren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org mailto:zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 
jlo...@ssl.berkeley.edu mailto:jlo...@ssl.berkeley.edu

- __/__/__/   AST:7731^29u18e3
 






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Does zpool clear delete corrupted files

2009-06-01 Thread Jonathan Loran


Hi list,

First off:

# cat /etc/release
   Solaris 10 6/06 s10x_u2wos_09a X86
  Copyright 2006 Sun Microsystems, Inc.  All Rights Reserved.
   Use is subject to license terms.
Assembled 09 June 2006

Here's an (almost) disaster scenario that came to life over the past  
week.  We have a very large zpool containing over 30TB, composed  
(foolishly) of three concatenated iSCSI SAN devices.  There's no  
redundancy in this pool at the zfs level.  We are actually in the  
process of migrating this to a x4540 + j4500 setup, but since the  
x4540 is part of the existing pool, we need to mirror it, then detach  
it so we can build out the replacement storage.


What happened was some time after I had attached the mirror to the  
x4540, the scsi_vhci/network connection went south, and the server  
panicked.  Since this system has been up, over the past 2.5 years,  
this has never happened before.  When we got the thing glued back  
together, it immediately started resilvering from the beginning, and  
reported about 1.9 million data errors.  The list from zpool status -v  
gave over 883k bad files.  This is a small percentage of the total  
number of files in this volume: over 80 million (1%).


My question is this:  When we clear the pool with zpool clear, what  
happens to all of the bad files?  Are they deleted from the pool, or  
do the error counters just get reset, leaving the bad files in tact?   
I'm going to perform a full backup of this guy (not so easy on my  
budget), and I would rather only get the good files.


Thanks,

Jon


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 jlo...@ssl.berkeley.edu
- __/__/__/   AST:7731^29u18e3




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does zpool clear delete corrupted files

2009-06-01 Thread Paul Choi
zpool clear just clears the list of errors (and # of checksum errors) 
from its stats. It does not modify the filesystem in any manner. You run 
zpool clear to make the zpool forget that it ever had any issues.


-Paul

Jonathan Loran wrote:


Hi list,

First off:  


# cat /etc/release
   Solaris 10 6/06 s10x_u2wos_09a X86
  Copyright 2006 Sun Microsystems, Inc.  All Rights Reserved.
   Use is subject to license terms.
Assembled 09 June 2006

Here's an (almost) disaster scenario that came to life over the past 
week.  We have a very large zpool containing over 30TB, composed 
(foolishly) of three concatenated iSCSI SAN devices.  There's no 
redundancy in this pool at the zfs level.  We are actually in the 
process of migrating this to a x4540 + j4500 setup, but since the 
x4540 is part of the existing pool, we need to mirror it, 
then detach it so we can build out the replacement storage.  

What happened was some time after I had attached the mirror to the 
x4540, the scsi_vhci/network connection went south, and the 
server panicked.  Since this system has been up, over the past 2.5 
years, this has never happened before.  When we got the thing glued 
back together, it immediately started resilvering from the beginning, 
and reported about 1.9 million data errors.  The list from zpool 
status -v gave over 883k bad files.  This is a small percentage of the 
total number of files in this volume: over 80 million (1%).  

My question is this:  When we clear the pool with zpool clear, what 
happens to all of the bad files?  Are they deleted from the pool, or 
do the error counters just get reset, leaving the bad files in tact? 
 I'm going to perform a full backup of this guy (not so easy on my 
budget), and I would rather only get the good files.


Thanks,

Jon


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 
jlo...@ssl.berkeley.edu mailto:jlo...@ssl.berkeley.edu

- __/__/__/   AST:7731^29u18e3
 






___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does zpool clear delete corrupted files

2009-06-01 Thread Jonathan Loran


Kinda scary then.  Better make sure we delete all the bad files before  
I back it up.


What's odd is we've checked a few hundred files, and most of them  
don't seem to have any corruption.  I'm thinking what's wrong is the  
metadata for these files is corrupted somehow, yet we can read them  
just fine.  I wish I could tell which ones are really bad, so we  
wouldn't have to recreate them unnecessarily.  They are mirrored in  
various places, or can be recreated via reprocessing, but recreating/ 
restoring that many files is no easy task.


Thanks,

Jon

On Jun 1, 2009, at 2:41 PM, Paul Choi wrote:

zpool clear just clears the list of errors (and # of checksum  
errors) from its stats. It does not modify the filesystem in any  
manner. You run zpool clear to make the zpool forget that it ever  
had any issues.


-Paul

Jonathan Loran wrote:


Hi list,

First off:
# cat /etc/release
  Solaris 10 6/06 s10x_u2wos_09a X86
 Copyright 2006 Sun Microsystems, Inc.  All Rights Reserved.
  Use is subject to license terms.
   Assembled 09 June 2006

Here's an (almost) disaster scenario that came to life over the  
past week.  We have a very large zpool containing over 30TB,  
composed (foolishly) of three concatenated iSCSI SAN devices.   
There's no redundancy in this pool at the zfs level.  We are  
actually in the process of migrating this to a x4540 + j4500 setup,  
but since the x4540 is part of the existing pool, we need to mirror  
it, then detach it so we can build out the replacement storage.
What happened was some time after I had attached the mirror to the  
x4540, the scsi_vhci/network connection went south, and the server  
panicked.  Since this system has been up, over the past 2.5 years,  
this has never happened before.  When we got the thing glued back  
together, it immediately started resilvering from the beginning,  
and reported about 1.9 million data errors.  The list from zpool  
status -v gave over 883k bad files.  This is a small percentage of  
the total number of files in this volume: over 80 million (1%).
My question is this:  When we clear the pool with zpool clear, what  
happens to all of the bad files?  Are they deleted from the pool,  
or do the error counters just get reset, leaving the bad files in  
tact?  I'm going to perform a full backup of this guy (not so easy  
on my budget), and I would rather only get the good files.


Thanks,

Jon


- _/ _/  /   - Jonathan Loran  
-   -
-/  /   /IT  
Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC  
Berkeley
-/  / /  (510) 643-5146 jlo...@ssl.berkeley.edu 
 mailto:jlo...@ssl.berkeley.edu

- __/__/__/   AST:7731^29u18e3





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss







- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 jlo...@ssl.berkeley.edu
- __/__/__/   AST:7731^29u18e3




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does zpool clear delete corrupted files

2009-06-01 Thread Paul Choi
If you run zpool scrub on the zpool, it'll do its best to identify the 
file(s) or filesystems/snapshots that have issues. Since you're on a 
single zpool, it won't self-heal any checksum errors... It'll take a 
long time, though, to scrub 30TB...


-Paul

Jonathan Loran wrote:


Kinda scary then.  Better make sure we delete all the bad files before 
I back it up.


What's odd is we've checked a few hundred files, and most of them 
don't seem to have any corruption.  I'm thinking what's wrong is the 
metadata for these files is corrupted somehow, yet we can read them 
just fine.  I wish I could tell which ones are really bad, so we 
wouldn't have to recreate them unnecessarily.  They are mirrored in 
various places, or can be recreated via reprocessing, but 
recreating/restoring that many files is no easy task.


Thanks,

Jon

On Jun 1, 2009, at 2:41 PM, Paul Choi wrote:

zpool clear just clears the list of errors (and # of checksum 
errors) from its stats. It does not modify the filesystem in any 
manner. You run zpool clear to make the zpool forget that it ever 
had any issues.


-Paul

Jonathan Loran wrote:


Hi list,

First off:
# cat /etc/release
  Solaris 10 6/06 s10x_u2wos_09a X86
 Copyright 2006 Sun Microsystems, Inc.  All Rights Reserved.
  Use is subject to license terms.
   Assembled 09 June 2006

Here's an (almost) disaster scenario that came to life over the past 
week.  We have a very large zpool containing over 30TB, composed 
(foolishly) of three concatenated iSCSI SAN devices.  There's no 
redundancy in this pool at the zfs level.  We are actually in the 
process of migrating this to a x4540 + j4500 setup, but since the 
x4540 is part of the existing pool, we need to mirror it, then 
detach it so we can build out the replacement storage.
What happened was some time after I had attached the mirror to the 
x4540, the scsi_vhci/network connection went south, and the server 
panicked.  Since this system has been up, over the past 2.5 years, 
this has never happened before.  When we got the thing glued back 
together, it immediately started resilvering from the beginning, and 
reported about 1.9 million data errors.  The list from zpool status 
-v gave over 883k bad files.  This is a small percentage of the 
total number of files in this volume: over 80 million (1%).
My question is this:  When we clear the pool with zpool clear, what 
happens to all of the bad files?  Are they deleted from the pool, or 
do the error counters just get reset, leaving the bad files in 
tact?  I'm going to perform a full backup of this guy (not so easy 
on my budget), and I would rather only get the good files.


Thanks,

Jon


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 
jlo...@ssl.berkeley.edu mailto:jlo...@ssl.berkeley.edu

- __/__/__/   AST:7731^29u18e3



 



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss







- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 jlo...@ssl.berkeley.edu
- __/__/__/   AST:7731^29u18e3




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does zpool clear delete corrupted files

2009-06-01 Thread Marion Hakanson
jlo...@ssl.berkeley.edu said:
 What's odd is we've checked a few hundred files, and most of them   don't
 seem to have any corruption.  I'm thinking what's wrong is the   metadata for
 these files is corrupted somehow, yet we can read them   just fine.  I wish I
 could tell which ones are really bad, so we   wouldn't have to recreate them
 unnecessarily.  They are mirrored in   various places, or can be recreated
 via reprocessing, but recreating/  restoring that many files is no easy task.

You know, this sounds similar to what happened to me once when I did a
zpool offline to half of a mirror, changed a lot of stuff in the pool
(like adding 20GB of data to an 80GB pool), then zpool online, thinking
ZFS might be smart enough to sync up the changes that had happened
since detaching.

Instead, a bunch of bad files were reported.  Since I knew nothing was
wrong with the half of the mirror that had never been offlined, I just
did a zpool detach of the formerly offlined drive, zpool clear to
clear the error counts, zpool scrub to check for integrity, then
zpool attach to cause resilver to start from scratch.

If this describes your situation, I guess the tricky part for you is to
now decide which half of your mirror is the good half.

There's always rsync -n -v -a -c ... to compare copies of files
that happen to reside elsewhere.  Slow but safe.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Does zpool clear delete corrupted files

2009-06-01 Thread A Darren Dunham
On Mon, Jun 01, 2009 at 03:19:59PM -0700, Jonathan Loran wrote:
 
 Kinda scary then.  Better make sure we delete all the bad files before  
 I back it up.

That shouldn't be necessary.  Clearing the error count doesn't disable
checksums.  Every read is going to verify checksums on the file data
blocks.  If it can't find at least one copy with a valid checksum,
you should just get an I/O error trying to read the file, not invalid
data. 

 What's odd is we've checked a few hundred files, and most of them  
 don't seem to have any corruption.  I'm thinking what's wrong is the  
 metadata for these files is corrupted somehow, yet we can read them  
 just fine.

Are you still getting errors?

-- 
Darren
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss