Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2009-02-07 Thread Gino
 FYI, I'm working on a workaround for broken devices.
  As you note,
 ome disks flat-out lie: you issue the
 synchronize-cache command,
 they say got it, boss, yet the data is still not on
 stable storage.
 Why do they do this?  Because it performs better.
  Well, duh --
 ou can make stuff *really* fast if it doesn't have to
 be correct.
 

 The uberblock ring buffer in ZFS gives us a way to
 cope with this,
 as long as we don't reuse freed blocks for a few
 transaction groups.
 The basic idea: if we can't read the pool startign
 from the most
 recent uberblock, then we should be able to use the
 one before it,
 or the one before that, etc, as long as we haven't
 yet reused any
 blocks that were freed in those earlier txgs.  This
 allows us to
 use the normal load on the pool, plus the passage of
 time, as a
 displacement flush for disk caches that ignore the
 sync command.
 
 If we go back far enough in (txg) time, we will
 eventually find an
 uberblock all of whose dependent data blocks have
 make it to disk.
 I'll run tests with known-broken disks to determine
 how far back we
 need to go in practice -- I'll bet one txg is almost
 always enough.
 
 Jeff

Hi Jeff,
we just losed 2 pools on snv91.
Any news about your workaround to recover pools discarding last txg?

thanks
gino
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-11-30 Thread Ray Clark
It would be extremely helpful to know what brands/models of disks lie and which 
don't.  This information could be provided diplomatically simply as threads 
documenting problems you are working on, stating the facts.  Use of a specific 
string of words would make searching for it easy.  There should be no 
liability, since you are simply documenting compatibility with zfs.  

Or perhaps if the lawyers let you, you could simply publish a 
compatibility/incompatibility list.  These ARE facts. 

If there is a way to make a detection tool, that would be very useful too, 
although after the purchase is made, it could be hard to send it back.  However 
that info could be fed into the database as that drive/model being incompatible 
with zfs.

As Solaris / zfs gains ground, this could become a strong driver in the 
industry.

Re: I'll run tests with known-broken disks to determine how far back we
need to go in practice -- I'll bet one txg is almost always enough.

So go back three - we are using zfs because we want absolute reliability (or at 
least as close as we can get).

--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-13 Thread Wade . Stuart


[EMAIL PROTECTED] wrote on 10/11/2008 09:36:02 PM:


 On Oct 10, 2008, at 7:55 PM   10/10/, David Magda wrote:

 
  If someone finds themselves in this position, what advice can be
  followed to minimize risks?

 Can you ask for two LUNs on different physical SAN devices and have
 an expectation of getting it?

Better yet also ask for multiple paths over different SAN infrastructure to
each.  Then again, I would hope you don't need to ask your SAN folks for
that?

-Wade

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-13 Thread Mike Gerdts
On Thu, Oct 9, 2008 at 10:33 PM, Mike Gerdts [EMAIL PROTECTED] wrote:
 On Thu, Oct 9, 2008 at 10:18 AM, Mike Gerdts [EMAIL PROTECTED] wrote:
 On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw [EMAIL PROTECTED] wrote:
 Nevada isn't production code.  For real ZFS testing, you must use a
 production release, currently Solaris 10 (update 5, soon to be update 6).

 I misstated before in my LDoms case.  The corrupted pool was on
 Solaris 10, with LDoms 1.0.  The control domain was SX*E, but the
 zpool there showed no problems.  I got into a panic loop with dangling
 dbufs.  My understanding is that this was caused by a bug in the LDoms
 manager 1.0 code that has been fixed in a later release.  It was a
 supported configuration, I pushed for and got a fix.  However, that
 pool was still lost.

 Or maybe it wasn't fixed yet.  I see that this was committed just today.

 6684721 file backed virtual i/o should be synchronous

 http://hg.genunix.org/onnv-gate.hg/rev/eb40ff0c92ec

The related information from the LDoms Manager 1.1 Early Access
release notes (820-4914-10):

Data Might Not Be Written Immediately to the Virtual Disk Backend If
Virtual I/O Is Backed by a File or Volume

Bug ID 6684721: When a file or volume is exported as a virtual disk,
then the service domain exporting that file or volume is acting as a
storage cache for the virtual disk. In that case, data written to the
virtual disk might get cached into the service domain memory instead
of being immediately written to the virtual disk backend. Data are not
cached if the virtual disk backend is a physical disk or slice, or if
it is a volume device exported as a single-slice disk.

Workaround: If the virtual disk backend is a file or a volume device
exported as a full disk, then you can prevent data from being cached
into the service domain memory and have data written immediately to
the virtual disk backend by adding the following line to the
/etc/system file on the service domain.

set vds:vd_file_write_flags = 0

Note – Setting this tunable flag does have an impact on performance
when writing to a virtual disk, but it does ensure that data are
written immediately to the virtual disk backend.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-11 Thread Keith Bierman

On Oct 10, 2008, at 7:55 PM   10/10/, David Magda wrote:


 If someone finds themselves in this position, what advice can be
 followed to minimize risks?

Can you ask for two LUNs on different physical SAN devices and have  
an expectation of getting it?



-- 
Keith H. Bierman   [EMAIL PROTECTED]  | AIM kbiermank
5430 Nassau Circle East  |
Cherry Hills Village, CO 80113   | 303-997-2749
speaking for myself* Copyright 2008




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-10 Thread Timh Bergström
2008/10/9 Bob Friesenhahn [EMAIL PROTECTED]:
 On Thu, 9 Oct 2008, Miles Nordin wrote:

 catastrophically.  If this is really the situation, then ZFS needs to
 give the sysadmin a way to isolate and fix the problems
 deterministically before filling the pool with data, not just blame
 the sysadmin based on nebulous speculatory hindsight gremlins.

 And if it's NOT the case, the ZFS problems need to be acknowledged and
 fixed.

 Can you provide any supportive evidence that ZFS is as fragile as you
 describe?

The hundreds of sysadmins seeing their pools go byebye after normal
operations in a production environment is evidence enough. And the
number of times people like Victor have saved our asses.


 From recent opinions expressed here, properly-designed ZFS pools must
 be inexplicably permanently cratering each and every day.

 Bob
 ==
 Bob Friesenhahn
 [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
 GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
Timh Bergström
System Administrator
Diino AB - www.diino.com
:wq
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-10 Thread Marcelo Leal
Hello all,
 I think the problem here is the ZFS´ capacity for recovery from a failure.  
Forgive me, but thinking about creating a code without failures, maybe the 
hackers did forget that other people can make mistakes (if they can´t). 
 - ZFS does not need fsck.
 Ok, that´s a great statement, but i think ZFS needs one. Really does. And in 
my opinion a enhanced zdb would be the solution. Flexibility. Options.
 - I have 90% of something i think is your filesystem, do you want it?
 I think a software is  as good as it can recovery from failures. And i don´t 
want to know who failed, i´m not going to send anyone to jail, i´m not a 
lawyer. I agree with Jeff, really do, but that is another problem...
 The solution Jeff is working one, i think is really great, since it does NOT 
be the all or nothing again... I don´t know about you, but A LOT of times i 
was saved by the Lost and Found directory! All the beauty of a UNIX system is 
rm /etc/passwd after have edited it, and get the whole file doing a cat 
/dev/mem. ;-)
 I think there are a lot of parts in ZFS design that remembers me when you see 
something left on the floor at home, so you ask for your son why he did not get 
it, and he says it was not me.
 peace.

 Leal.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-10 Thread Jeff Bonwick
 The circumstances where I have lost data have been when ZFS has not
 handled a layer of redundancy.  However, I am not terribly optimistic
 of the prospects of ZFS on any device that hasn't committed writes
 that ZFS thinks are committed.

FYI, I'm working on a workaround for broken devices.  As you note,
some disks flat-out lie: you issue the synchronize-cache command,
they say got it, boss, yet the data is still not on stable storage.
Why do they do this?  Because it performs better.  Well, duh --
you can make stuff *really* fast if it doesn't have to be correct.

Before I explain how ZFS can fix this, I need to get something off my
chest: people who knowingly make such disks should be in federal prison.
It is *fraud* to win benchmarks this way.  Doing so causes real harm
to real people.  Same goes for NFS implementations that ignore sync.
We have specifications for a reason.  People assume that you honor them,
and build higher-level systems on top of them.  Change the mass of
the proton by a few percent, and the stars explode.  It is impossible
to build a functioning civil society in a culture that tolerates lies.
We need a little more Code of Hammurabi in the storage industry.

Now:

The uberblock ring buffer in ZFS gives us a way to cope with this,
as long as we don't reuse freed blocks for a few transaction groups.
The basic idea: if we can't read the pool startign from the most
recent uberblock, then we should be able to use the one before it,
or the one before that, etc, as long as we haven't yet reused any
blocks that were freed in those earlier txgs.  This allows us to
use the normal load on the pool, plus the passage of time, as a
displacement flush for disk caches that ignore the sync command.

If we go back far enough in (txg) time, we will eventually find an
uberblock all of whose dependent data blocks have make it to disk.
I'll run tests with known-broken disks to determine how far back we
need to go in practice -- I'll bet one txg is almost always enough.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-10 Thread Ricardo M. Correia
Hi Jeff,

On Sex, 2008-10-10 at 01:26 -0700, Jeff Bonwick wrote:
  The circumstances where I have lost data have been when ZFS has not
  handled a layer of redundancy.  However, I am not terribly optimistic
  of the prospects of ZFS on any device that hasn't committed writes
  that ZFS thinks are committed.
 
 FYI, I'm working on a workaround for broken devices.  As you note,
 some disks flat-out lie: you issue the synchronize-cache command,
 they say got it, boss, yet the data is still not on stable storage.

It's not just about ignoring the synchronize-cache command, there's also
another weak spot.

ZFS is quite resilient against so-called phantom writes, provided that
they occur sporadically - let's say, if the disk decides to _randomly_
ignore writes 10% of the time, ZFS could probably survive that pretty
well even on single-vdev pools, due to ditto blocks.

However, it is not so resilient when the storage system suffers hiccups
which cause phantom writes to occur continuously, even if for a small
period of time (say less than 10 seconds), and then return to normal.
This could happen for several reasons, including network problems, bugs
in software or even firmware, etc.

I think in this case, going back to a previous uberblock could also be
enough to recover from such a scenario most of the times, unless perhaps
the error occurred too long ago, and the unwritten metadata got flushed
out of the ARC and didn't have a chance to get rewritten.

In any case, a more generic solution to repair all kinds of metadata
corruption, such as (e.g.) space map corruption, would be very
desirable, as I think everyone can agree.

Best regards,
Ricardo


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-10 Thread Ross
That sounds like a great idea for a tool Jeff.  Would it be possible to build 
that in as a zpool recover command?

Being able to run a tool like that and see just how bad the corruption is, but 
know it's possible to recover an older version would be great.  Is there any 
chance of outputting details so the sysadmin can know roughly how much was 
lost?  

My thoughts are going to be very rough (I don't know much about zfs internals), 
but I'm wondering if something like this would work, where all bad blocks are 
reported, along with the latest 3 good ones:

*8
# zpool recover pool
. pool details ...

Finding and testing uberblocks...
1.  block a  date/time:  x/
 CORRUPTED
2.  block b  date/time:  y/
 CORRUPTED
3.  block c  date/time:  z/
 Appears OK
4.  block d  date/time:  z/
 Appears OK
5.  block e  date/time:  z/
 Appears OK

 
*8

Victor was talking in another thread about using zdb to check the pool before 
doing an import of a damaged pool.  Might it be possible for the next stage of 
the recovery process to give the user an option of testing or importing the 
pool for any particular uberblock?

It does sound like testing can take a long time, so this would need to be 
something that can be cancelled, and you would also need a way to mark 
uberblocks as bad should problems be found with either the test or import.

This would be a great addition to ZFS though, and would hopefully save Victor a 
bit of time ;-)

Ross
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-10 Thread Miles Nordin
 jb == Jeff Bonwick [EMAIL PROTECTED] writes:
 rmc == Ricardo M Correia [EMAIL PROTECTED] writes:

jb We need a little more Code of Hammurabi in the storage
jb industry.

It seems like most of the work people have to do now is cleaning up
after the sloppyness of others.  At least it takes the longest.

You could always mention which disks you found ignoring the
command---wouldn't that help the overall problem?  I understand
there's a pervasive ``i don' wan' any trouble, mistah'' attitude, but
I don't understand where it comes from.

 http://www.ferris.edu/news/jimcrow/tom/

jb displacement flush for disk caches that ignore the sync
jb command.

Sounds like a good idea but:

 (1) won't this break the NFS guarantees you were just saying should
 never be broken?

 I get it, someone else is breaking a standard so how can ZFS be
 expected to yadda yadda yadad.  But I fear it will just push
 ``blame the sysadmin'' one step further out.  ex., Q. ``with ZFS
 all my NFS clients become unstable after the server reboots,'' or
 ``I'm getting silent corruption with NFS''.  A.  ``your drives
 might have gremlins in them, no way to know,'' and ``well what do
 you expect without a single integrity domain and TCP's weak
 checksums.  / no i'm using a crossover cable, and FCS is not
 weak. / ZFS managing a layer of redundancy it is probably your
 RAM or corruption on the uh, between the Ethernet MAC chip and
 the PCI slot''

 (1a) I'm concerned about how it'll be reported when it happens.

  (a) if it's not reported at all, then ZFS is hiding the fact
  that fsync() is not working.  Also, other journaling
  filesystems sometimes report when they find
  ``unexpected'' corruption, which is useful for finding
  both hardware and software problems.

  I'm already concerned ZFS is not reporting enough, like
  when it says a vdev component is ONLINE, but 'zpool
  offline pool component' says 'no valid replicas', then
  after a scrub there is no change to zpool status, but
  zpool offline works again.  

  ZFS should not ``simplify'' the user interface to the
  point that it's hiding problems with itself and its
  environment to the ends of avoiding discussion.

  (b) if it is reported, then whenever the reporter-blob
  raises its hand it will have the effect of exonerating
  ZFS in most people's minds, like the stupid CKSUM column
  does right now.  ``ZFS-FEED-B33F error?  oh yeah that's
  the new ueberblock search code.  that means your disks
  are ignoring the SYNCHRONIZE CACHE command.  thank GOD
  you have ZFS with ANY OTHER FILESYSTEM all bets would be
  totally off.  lucky you.  / I have tried ten different
  models from all four brands.  / yeah sucks don't it?
  flagrant violation of the standard, industry wide.  / my
  linux testing tool says they're obeying the command fine
  / linux is crap / i added a patch to solaris to block
  the SYNC CACHE command and the disks got faster so I
  think it's not being ignored / well the stack is
  complicated and flushing happens at many levels, like
  think about controller performance, and that's
  completely unsupported you are doing something REALLY
  UNSAFE there you should NOT DO THAT it is STUPID'' and
  so on, stalling the actual fix literally for years.

  The right way to exonerate ZFS is to make a diagnosis
  tool for the disks which proves they're broken, and then
  don't buy those disks.  not to make a new class of ZFS
  fault report that could potentially capture all kinds of
  problems, then hazily assign blame to an untestable
  quantity.

 (2) disks are probably not the only thing dropping the write
 barriers.  So far, we're also suspecting (unproven!) iSCSI
 targets/initiators, particularly around a TCP reconnection event
 or target reboot.  and VM stacks, both VirtualBox and the HVM in
 UltraSPARC T1.  probably other stuff.  

 I'm concerned that assumptions you'll find safe to make about
 disks after you get started, like nothing is more than 1s stale,
 or send a CDB to size the on-disk cache and imagine it's a FIFO
 and it'll be no worse than that, or ``you can get an fsync by
 pausing reads for 500ms'' or whatever, will add robustness for
 current and future broken disks but won't apply to other types of
 broken storage layer.

   rmc However, it is not so resilient when the storage system
   rmc suffers hiccups which cause phantom writes to occur
   rmc continuously, even if for a small period of time (say less
 

Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-10 Thread Eric Schrock
On Fri, Oct 10, 2008 at 06:15:16AM -0700, Marcelo Leal wrote:
  - ZFS does not need fsck.
  Ok, that?s a great statement, but i think ZFS needs one. Really does.
  And in my opinion a enhanced zdb would be the solution. Flexibility.
  Options.

About 99% of the problems reported as I need ZFS fsck can be summed up
by two ZFS bugs:

1. If a toplevel vdev fails to open, we should be able to pull
   information from necessary ditto blocks to open the pool and make
   what progress we can.  Right now, the root vdev code assumes can't
   open = faulted pool, which results in failure scenarios that are
   perfectly recoverable most of the time.  This needs to be fixed
   so that pool failure is only determined by the ability to read
   critical metadata (such as the root of the DSL).

2. If an uberblock ends up with an inconsistent view of the world (due
   to failure of DKIOCFLUSHWRITECACHE, for example), we should be able
   to go back to previous uberblocks to find a good view of our pool.
   This is the failure mode described by Jeff.

These are both bugs in ZFS and will be fixed.  The other 1% of the
complaints are usually of the form I created my pool on top of my old
one or I imported a LUN on two different systems at the same time.
It's unclear what a 'fsck' tool could do in this scenario, if anything.
Due to a variety of reasons (hierarchical nature of ZFS, variable block
sizes, RAIDZ-Z, compression, etc), it's difficult to even *identify* a
ZFS block, let alone determine its validity and associate it in some
larger construct.

There are some interesting possibilities for limited forensic tools - in
particular, I like the idea of a mdb backend for reading and writing ZFS
pools[1].  But I haven't actually heard a reasonable proposal for what a
fsck-like tool (i.e. one that could repair things automatically) would
actually *do*, let alone how it would work in the variety of situations
it needs to (compressed RAID-Z?) where the standard ZFS infrastructure
fails.

- Eric

[1] 
http://mbruning.blogspot.com/2008/08/recovering-removed-file-on-zfs-disk.html

--
Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-10 Thread Victor Latushkin
Eric Schrock wrote:
 On Fri, Oct 10, 2008 at 06:15:16AM -0700, Marcelo Leal wrote:
  - ZFS does not need fsck.
  Ok, that?s a great statement, but i think ZFS needs one. Really does.
  And in my opinion a enhanced zdb would be the solution. Flexibility.
  Options.
 
 About 99% of the problems reported as I need ZFS fsck can be summed up
 by two ZFS bugs:
 
 1. If a toplevel vdev fails to open, we should be able to pull
information from necessary ditto blocks to open the pool and make
what progress we can.  Right now, the root vdev code assumes can't
open = faulted pool, which results in failure scenarios that are
perfectly recoverable most of the time.  This needs to be fixed
so that pool failure is only determined by the ability to read
critical metadata (such as the root of the DSL).
 
 2. If an uberblock ends up with an inconsistent view of the world (due
to failure of DKIOCFLUSHWRITECACHE, for example), we should be able
to go back to previous uberblocks to find a good view of our pool.
This is the failure mode described by Jeff.

I've mostly seen (2), because despite all the best practices out there, 
single vdev pools are quite common. In all such cases that I had my 
hands on it was possible to recover pool by going back by one or two txgs.

 These are both bugs in ZFS and will be fixed.  The other 1% of the
 complaints are usually of the form I created my pool on top of my old
 one or I imported a LUN on two different systems at the same time.

Of these two former is not easy because it requires searching through 
the entire disk space for root block candidates and trying each of them.
Latter one is not catastrophic in case there were little to no activity 
from one system. In this case one of the first things to suffer is pool 
config object, and corruption of it prevents pool open.

Fortunately enough, after putback of

6733970 assertion failure in dbuf_dirty() via spa_sync_nvlist()

in build 99 corrupted pool config object is written in such a way during 
open that prevents reading in old corrupted copy, and in most cases this 
allows to import pool and save most of the data. zdb is useful to 
understand how much is corrupted and how much is recovered. If nothing 
else is corrupted, then pool may be available for further use without 
recreation. Again, in every case I had my hands on it was possible to 
either recover pool completely or at least save most of the data.

 It's unclear what a 'fsck' tool could do in this scenario, if anything.
 Due to a variety of reasons (hierarchical nature of ZFS, variable block
 sizes, RAIDZ-Z, compression, etc), it's difficult to even *identify* a
 ZFS block, let alone determine its validity and associate it in some
 larger construct.

Indeed. In more ZFS recovery case involving 42TB pool with about 8TB 
used, zdb -bv alone took several hours to walk the block tree and verify 
consistency of block pointers, and zdb -bcv took couple of days to 
verify all user data blocks as well. And different checksums and gang 
blocks in addition to all other dynamic features mentioned complicate 
the task of identifying ZFS blocks and linking those blocks into tree 
and make it really time (and space) consuming.

 There are some interesting possibilities for limited forensic tools - in
 particular, I like the idea of a mdb backend for reading and writing ZFS
 pools[1].  But I haven't actually heard a reasonable proposal for what a
 fsck-like tool (i.e. one that could repair things automatically) would
 actually *do*, let alone how it would work in the variety of situations
 it needs to (compressed RAID-Z?) where the standard ZFS infrastructure
 fails.

There are a number of bugs and rfes to improve usefulness of zdb for 
field use, e.g.

6720637 want zdb -l option to dump uberblock arrays as well
6709782 issues running zdb with -p and -e options
6736356 zdb -R needs to work with exported pools
6720907 zdb should handle errors while dumping datasets and objects
6746101 zdb command to search for ZFS labels in a device
6757444 want zdb -R to supoprt decompression, checksumming and raid-z
6757430 want an option for zdb to disable space map loading and leak 
tracking

Hth,
Victor

 - Eric
 
 [1] 
 http://mbruning.blogspot.com/2008/08/recovering-removed-file-on-zfs-disk.html
 
 --
 Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-10 Thread Timh Bergström
2008/10/10 Richard Elling [EMAIL PROTECTED]:
 Timh Bergström wrote:

 2008/10/9 Bob Friesenhahn [EMAIL PROTECTED]:


 On Thu, 9 Oct 2008, Miles Nordin wrote:


 catastrophically.  If this is really the situation, then ZFS needs to
 give the sysadmin a way to isolate and fix the problems
 deterministically before filling the pool with data, not just blame
 the sysadmin based on nebulous speculatory hindsight gremlins.

 And if it's NOT the case, the ZFS problems need to be acknowledged and
 fixed.


 Can you provide any supportive evidence that ZFS is as fragile as you
 describe?


 The hundreds of sysadmins seeing their pools go byebye after normal
 operations in a production environment is evidence enough. And the
 number of times people like Victor have saved our asses.


 Hundreds?  Do you have evidence of this?

One is one to many, I dont need evidence of hundreds - that is
hopefully an exaggeration.

//T

 -- richard


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-10 Thread Marcelo Leal
 On Fri, Oct 10, 2008 at 06:15:16AM -0700, Marcelo
 Leal wrote:
   - ZFS does not need fsck.
   Ok, that?s a great statement, but i think ZFS
 needs one. Really does.
   And in my opinion a enhanced zdb would be the
 solution. Flexibility.
   Options.
 
 About 99% of the problems reported as I need ZFS
 fsck can be summed up
 by two ZFS bugs:
 
 1. If a toplevel vdev fails to open, we should be
 able to pull
 information from necessary ditto blocks to open
  the pool and make
 what progress we can.  Right now, the root vdev
  code assumes can't
 open = faulted pool, which results in failure
  scenarios that are
 perfectly recoverable most of the time.  This needs
  to be fixed
 so that pool failure is only determined by the
  ability to read
   critical metadata (such as the root of the DSL).
 . If an uberblock ends up with an inconsistent view
 of the world (due
 to failure of DKIOCFLUSHWRITECACHE, for example),
  we should be able
 to go back to previous uberblocks to find a good
  view of our pool.
   This is the failure mode described by Jeff.
 hese are both bugs in ZFS and will be fixed.  

 That´s it! It´s 100% for me! ;-) 
 One is the all-or-nothing problem, and the other is about guilty... ;-))

 
 There are some interesting possibilities for limited
 forensic tools - in
 particular, I like the idea of a mdb backend for
 reading and writing ZFS
 pools[1]. 
 In my opinion would be great the whole functionality in zdb. it´s simple, and 
the concepts are clear on the tool. mdb is a debugger, needs concepts that i 
think is different in a tool for read/fix filesystems. Just an opinion... What 
does not mean we can not have both. Like i said, flexibility, options... ;-)


 But I haven't actually heard a reasonable
 proposal for what a
 fsck-like tool 

 I think we must NOT stuck in the word fsck, i have used it just as an 
example (Lost and Found). And i think other users used just as an example too. 
The important is the two points you have described very *well*.

(i.e. one that could repair things
 automatically) would
 actually *do*, let alone how it would work in the
 variety of situations
 it needs to (compressed RAID-Z?) where the standard
 ZFS infrastructure
 fails.
 
 - Eric
 
 [1]
 http://mbruning.blogspot.com/2008/08/recovering-remove
 d-file-on-zfs-disk.html
 
 --
 Eric Schrock, Fishworks
http://blogs.sun.com/eschrock
 
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discu
 ss

 Many thanks for your answer!
 Leal.
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-10 Thread Ricardo M. Correia
On Sex, 2008-10-10 at 11:23 -0700, Eric Schrock wrote:
 But I haven't actually heard a reasonable proposal for what a
 fsck-like tool (i.e. one that could repair things automatically) would
 actually *do*, let alone how it would work in the variety of situations
 it needs to (compressed RAID-Z?) where the standard ZFS infrastructure
 fails.

I'd say an fsck-like tool for ZFS should not worry much compression,
checksums, RAID-Z and whatnot. In essence, it would try to do what an
fsck tool does for a typical filesystem, and so would be mostly
oblivious to the layout or encoding of the blocks, perhaps treating
blocks with failed checksums as blocks full of zeros.

Here's how it could work (of course, this is all easier said than done):

1) Open all the devices specified by the user. Optionally, take just a
pool name/guid and scan for the right devices in /dev/[r]dsk.

2) Verify if the pool configuration read from the devices is sane -- if
not, try to generate a consistent configuration. Some elements of the
pool configuration, such as the correct pool version, could be checked
in later steps, depending on features that were found.

3) Starting from the last uberblock, fully traverse a few levels down
the tree. If less than 100% of the blocks could be read without errors,
do the same for previous uberblocks and offer the user the choice to
which uberblock to use, or if running non-interactively, choose the one
with the best success rate.

4) Traverse the list/tree of filesystems, snapshots and clones. Make
sure that they are well-connected. For each filesystem, try to replay
the ZILs, clean them out.

5) Now fully traverse the pool. Compute the space maps and FS space
usage on-the-go, as blocks are read.

6) For each metadata block read, check whether the fields are sane, fix
them/zero them out if they're not. Basically we're assuming here that we
may have corrupted metadata with correct checksums.

If some metadata block can not be read due to a failed checksum, assume
the block is full of zeros, and fix it.

By the way, this includes every field of every kind of metadata block,
including ZAPs, ACLs, FID maps, znode fields, everything.

For fields that reference other objects, make sure that the object they
reference is of the correct type and that the object itself is correct.

For objects that are missing, create empty ones if necessary.

7) Check that every object is referenced somewhere and link unreferenced
objects to /lost+found/object-type/, or similar.

8) Probably do other things that I'm forgetting.

9) In the end, check if the space maps are consistent with the ones
computed, write correct ones if not. Check that space
usage/reservations/quotas are correct.

Essentially, the goal is that at the end of this process, the pool
should contain consistent information, should have as much data as could
be recovered and should never cause any further errors in ZFS due to
invalid metadata/fields; either when importing it, reading from it or
writing/modifying it (except that it would still return EIO errors when
trying to read corrupted file data blocks, of course).

Now, a problem with fsck-like tools, and perhaps especially with ZFS, is
that some of these steps may either require lots of memory or multiple
filesystem/pool traversals.

I'd say having such a tool, even if it required additional temporary
storage for operation (hopefully not a very large fraction of the pool
size), would be *very* useful and would clear up any worries that people
currently have.

Kind regards,
Ricardo

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-10 Thread Richard Elling
Timh Bergström wrote:
 2008/10/10 Richard Elling [EMAIL PROTECTED]:
   
 Timh Bergström wrote:
 
 2008/10/9 Bob Friesenhahn [EMAIL PROTECTED]:

   
 On Thu, 9 Oct 2008, Miles Nordin wrote:

 
 catastrophically.  If this is really the situation, then ZFS needs to
 give the sysadmin a way to isolate and fix the problems
 deterministically before filling the pool with data, not just blame
 the sysadmin based on nebulous speculatory hindsight gremlins.

 And if it's NOT the case, the ZFS problems need to be acknowledged and
 fixed.

   
 Can you provide any supportive evidence that ZFS is as fragile as you
 describe?

 
 The hundreds of sysadmins seeing their pools go byebye after normal
 operations in a production environment is evidence enough. And the
 number of times people like Victor have saved our asses.

   
 Hundreds?  Do you have evidence of this?
 

 One is one to many, I dont need evidence of hundreds - that is
 hopefully an exaggeration.

   

Don't show up to a data fight without data :-/
Yes, we do track this information and guys like me analyze it.
The ratio of installed base to problem reports for ZFS is quite high.
When we see a trend, we adjust priorities to address it.  This is just
part of our overall quality program.

Which brings me to the required mantra, if you don't file a bug or
make a service call, the problem doesn't get tracked.  Please make
the effort so that we can prioritize the use of our limited resources.
Posting a fine whine on this (or any) forum is not guaranteed to result
in an entry in our problem tracking system -- someone has to put in
the extra effort, or it will fall into the silent complainant category.
Please help us to improve the quality of our systems, thanks.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-10 Thread David Magda
On Oct 10, 2008, at 15:48, Victor Latushkin wrote:

 I've mostly seen (2), because despite all the best practices out  
 there,
 single vdev pools are quite common. In all such cases that I had my
 hands on it was possible to recover pool by going back by one or two  
 txgs.

For better or worse this is the case where I work.

Most of our storage is on SANs (EMC and NetApp), and so if we need  
more space we ask for it and we get a giant LUN given to us (usually  
multi-pathed). We also have a lot of Veritas VxVM and VxFS for Oracle,  
and so even if we're running Solaris 10, we're not using ZFS in that  
case.

SAN space is also allocated to Windows and VMware ESX machines as  
well, so it's not like we can ask for the disks in the SAN to be  
exported raw, as that would mess up managing of things with the other  
OSes. (We have a very small global storage / back up team, and I  
really don't want to add more to their workload.)

If someone finds themselves in this position, what advice can be  
followed to minimize risks?

For example, is having checksums enabled a good idea? If you have no  
redundancy and an error occurs, the system will panic by default  
(configurable in newer builds of OpenSolaris, but not in Solaris  
'proper' yet). But if the system is ignoring checksums, you're no  
worse off than most other file systems (but still get all the other  
features of ZFS).

Or is there a way to mitigate a checksum error on non-redundant zpool?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-10 Thread Jeff Bonwick
 Or is there a way to mitigate a checksum error on non-redundant zpool?

It's just like the difference between non-parity, parity, and ECC memory.
Most filesystems don't have checksums (non-parity), so they don't even
know when they're returning corrupt data.  ZFS without any replication
can detect errors, but can't fix them (like parity memory).  ZFS with
mirroring or RAID-Z can both detect and correct (like ECC memory).

Note: even in a single-device pool, ZFS metadata is replicated via
ditto blocks at two or three different places on the device, so that
a localized media failure can be both detected and corrected.
If you have two or more devices, even without any mirroring
or RAID-Z, ZFS metadata is mirrored (again via ditto blocks)
across those devices.

Jeff
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-10 Thread Mike Gerdts
On Fri, Oct 10, 2008 at 9:14 PM, Jeff Bonwick [EMAIL PROTECTED] wrote:
 Note: even in a single-device pool, ZFS metadata is replicated via
 ditto blocks at two or three different places on the device, so that
 a localized media failure can be both detected and corrected.
 If you have two or more devices, even without any mirroring
 or RAID-Z, ZFS metadata is mirrored (again via ditto blocks)
 across those devices.

And in the event that you have a pool that is mostly not very
important but some of it is important, you can have data mirrored on a
per dataset level via copies=n.

If we can avoid losing an entire pool by rolling back a txg or two,
the biggest source of data loss and frustration is taken care of.
Ditto blocks for metadata should take care of most other cases that
would result in wide spread loss.  Normal bit rot that causes you to
lose blocks here and there are somewhat likely to take out a small
minority of files and spit warnings along the way.  If there are some
files that are more important to you than others (e.g. losing files in
rpool/home may have more impact than than rpool/ROOT) copies=2 can
help there.

And for those places where losing a txg or two is a mortal sin, don't
use flaky hardware and allow zfs to handle a layer of redundancy.

This gets me thinking that it may be worthwhile to have a small (100
MB x 2) rescue boot environment with copies=2 (as well as rpool/boot/)
so that pkg repair could be used to deal with cases that prevent
your normal (4 GB) boot environment from booting.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-09 Thread .
 His explanation: he invalidated the incorrect
 uberblocks and forced zfs to revert to an earlier
 state that was consistent.

Would someone be willing to document the steps required in order to do this 
please?

I have a disk in a similar state:

# zpool import
  pool: tank
id: 13234439337856002730
 state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-72
config:

tankFAULTED  corrupted data
  c7d0  ONLINE

This happened after I foolishly began trusting zfs-fuse with some large but 
relatively unimportant data on a big, empty single disk zpool in my home 
machine and then suffered a power cut before I got around to backing it up.

OpenSolaris can't import the pool either, so the drive is sat on a shelf 
waiting till a method for fixing it is published.

While it's clearly my own fault for taking the risks I did, it's still pretty 
frustrating knowing that all my data is likely still intact and nicely 
checksummed on the disk but that none of it is accessible due to some tiny 
filesystem inconsistency.  With pretty much any other FS I think I could get 
most of it back.

Clearly such a small number of occurrences in what were admittedly precarious 
configurations aren't going to be particularly convincing motivators to provide 
a general solution, but I'd feel a whole lot better about using ZFS if I knew 
that there were some documented steps or a tool (zfsck? ;) that could help to 
recover from this kind of metadata corruption in the unlikely event of it 
happening.

cheers,

Rob
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-09 Thread Mike Gerdts
On Thu, Oct 9, 2008 at 4:53 AM, . [EMAIL PROTECTED] wrote:
 While it's clearly my own fault for taking the risks I did, it's
 still pretty frustrating knowing that all my data is likely still
 intact and nicely checksummed on the disk but that none of it is
 accessible due to some tiny filesystem inconsistency. ?With pretty
 much any other FS I think I could get most of it back.

 Clearly such a small number of occurrences in what were admittedly
 precarious configurations aren't going to be particularly convincing
 motivators to provide a general solution, but I'd feel a whole lot
 better about using ZFS if I knew that there were some documented
 steps or a tool (zfsck? ;) that could help to recover from this kind
 of metadata corruption in the unlikely event of it happening.

Well said.  You have hit on my #1 concern with deploying ZFS.

FWIW, I belive that I have hit the same type of bug as the OP in the
following combinations:

- T2000, LDoms 1.0, various builds of Nevada in control and guest
  domains.
- Laptop, VirtualBox 1.6.2, Windows XP SP2 host, OpenSolaris 2008.05 @
  build 97 guest

In the past year I've lost more ZFS file systems than I have any other
type of file system in the past 5 years.  With other file systems I
can almost always get some data back.  With ZFS I can't get any back.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-09 Thread Wilkinson, Alex

0n Thu, Oct 09, 2008 at 06:37:23AM -0500, Mike Gerdts wrote: 

FWIW, I belive that I have hit the same type of bug as the OP in the
following combinations:

- T2000, LDoms 1.0, various builds of Nevada in control and guest
  domains.
- Laptop, VirtualBox 1.6.2, Windows XP SP2 host, OpenSolaris 2008.05 @
  build 97 guest

In the past year I've lost more ZFS file systems than I have any other
type of file system in the past 5 years.  With other file systems I
can almost always get some data back.  With ZFS I can't get any back.

Thats scary to hear!

 -aW

IMPORTANT: This email remains the property of the Australian Defence 
Organisation and is subject to the jurisdiction of section 70 of the CRIMES ACT 
1914.  If you have received this email in error, you are requested to contact 
the sender and delete the email.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-09 Thread Mike Gerdts
On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal
[EMAIL PROTECTED] wrote:


In the past year I've lost more ZFS file systems than I have any other
type of file system in the past 5 years.  With other file systems I
can almost always get some data back.  With ZFS I can't get any back.

 Thats scary to hear!


 I am really scared now! I was the one trying to quantify ZFS reliability,
 and that is surely bad to hear!

The circumstances where I have lost data have been when ZFS has not
handled a layer of redundancy.  However, I am not terribly optimistic
of the prospects of ZFS on any device that hasn't committed writes
that ZFS thinks are committed.  Mirrors and raidz would also be
vulnerable to such failures.

I also have run into other failures that have gone unanswered on the
lists.  It makes me wary about using zfs without a support contract
that allows me to escalate to engineering.  Patching only support
won't help.

http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html
   Hang only after I mirrored the zpool, no response on the list

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html
   I think this is fixed around snv_98, but the zfs-discuss list was
   surprisingly silent on acknowledging it as a problem - I had no
   idea that it was being worked until I saw the commit.  The panic
   seemed to be caused by dtrace - core developers of dtrace
   were quite interested in the kernel crash dump.

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html
   Panic during ON build.  Pool was lost, no response from list.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-09 Thread Timh Bergström
Unfortunely I can only agree to the doubts about running ZFS in
production environments, i've lost ditto-blocks, i''ve gotten
corrupted pools and a bunch of other failures even in
mirror/raidz/raidz2 setups with or without hardware mirrors/raid5/6.
Plus the insecurity of a sudden crash/reboot will corrupt or even
destroy the pools with restore from backup as the only advice. I've
been lucky so far about getting my pools back thanks to people like
Victor.

What would be needed is a proper fsck for ZFS which can resolv minor
data corruptions, tools for rebuilding, resizing and moving the data
about on pools is also needed, even recover of data from faulted
pools, like there is for ext2/3/ufs/ntfs.

All in all, great FS but not production ready until the tools are in
place or it gets really really resillient to minor failures and/or
crashes in both software and hardware. For now i'll stick to XFS/UFS
and sw/hw-raid and live with the restrictions of such fs.

//T

2008/10/9 Mike Gerdts [EMAIL PROTECTED]:
 On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal
 [EMAIL PROTECTED] wrote:


In the past year I've lost more ZFS file systems than I have any other
type of file system in the past 5 years.  With other file systems I
can almost always get some data back.  With ZFS I can't get any back.

 Thats scary to hear!


 I am really scared now! I was the one trying to quantify ZFS reliability,
 and that is surely bad to hear!

 The circumstances where I have lost data have been when ZFS has not
 handled a layer of redundancy.  However, I am not terribly optimistic
 of the prospects of ZFS on any device that hasn't committed writes
 that ZFS thinks are committed.  Mirrors and raidz would also be
 vulnerable to such failures.

 I also have run into other failures that have gone unanswered on the
 lists.  It makes me wary about using zfs without a support contract
 that allows me to escalate to engineering.  Patching only support
 won't help.

 http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html
   Hang only after I mirrored the zpool, no response on the list

 http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html
   I think this is fixed around snv_98, but the zfs-discuss list was
   surprisingly silent on acknowledging it as a problem - I had no
   idea that it was being worked until I saw the commit.  The panic
   seemed to be caused by dtrace - core developers of dtrace
   were quite interested in the kernel crash dump.

 http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html
   Panic during ON build.  Pool was lost, no response from list.

 --
 Mike Gerdts
 http://mgerdts.blogspot.com/
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss




-- 
Timh Bergström
System Administrator
Diino AB - www.diino.com
:wq
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-09 Thread Greg Shaw
Perhaps I mis-understand, but the below issues are all based on Nevada, 
not Solaris 10.  

Nevada isn't production code.  For real ZFS testing, you must use a 
production release, currently Solaris 10 (update 5, soon to be update 6).

In the last 2 years, I've stored everything in my environment (home 
directory, builds, etc.) on ZFS on multiple types of storage subsystems 
without issues.  All of this has been on Solaris 10, however.

Btw, I completely agree on the panic issue.If I have a large DB 
server with many pools, and one inconsequential pool fails, I lose the 
entire DB server.   I'd really like to see an option at the zpool level 
directing what to do in a panic for a particular pool.Perhaps this 
is in the latest bits; if so, sorry, I'm running old stuff.  :-)

I also run ZFS on my mac.  While not production quality, some of the 
panic errors dealing with external (firewire, usb, esata) are very 
irritating.   A hiccup due to a jostled cable, and the entire box 
panics.   That's frustrating.

Timh Bergström wrote:
 Unfortunely I can only agree to the doubts about running ZFS in
 production environments, i've lost ditto-blocks, i''ve gotten
 corrupted pools and a bunch of other failures even in
 mirror/raidz/raidz2 setups with or without hardware mirrors/raid5/6.
 Plus the insecurity of a sudden crash/reboot will corrupt or even
 destroy the pools with restore from backup as the only advice. I've
 been lucky so far about getting my pools back thanks to people like
 Victor.

 What would be needed is a proper fsck for ZFS which can resolv minor
 data corruptions, tools for rebuilding, resizing and moving the data
 about on pools is also needed, even recover of data from faulted
 pools, like there is for ext2/3/ufs/ntfs.

 All in all, great FS but not production ready until the tools are in
 place or it gets really really resillient to minor failures and/or
 crashes in both software and hardware. For now i'll stick to XFS/UFS
 and sw/hw-raid and live with the restrictions of such fs.

 //T

 2008/10/9 Mike Gerdts [EMAIL PROTECTED]:
   
 On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal
 [EMAIL PROTECTED] wrote:
 

In the past year I've lost more ZFS file systems than I have any other
type of file system in the past 5 years.  With other file systems I
can almost always get some data back.  With ZFS I can't get any back.

   
 Thats scary to hear!

 
 I am really scared now! I was the one trying to quantify ZFS reliability,
 and that is surely bad to hear!
   
 The circumstances where I have lost data have been when ZFS has not
 handled a layer of redundancy.  However, I am not terribly optimistic
 of the prospects of ZFS on any device that hasn't committed writes
 that ZFS thinks are committed.  Mirrors and raidz would also be
 vulnerable to such failures.

 I also have run into other failures that have gone unanswered on the
 lists.  It makes me wary about using zfs without a support contract
 that allows me to escalate to engineering.  Patching only support
 won't help.

 http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html
   Hang only after I mirrored the zpool, no response on the list

 http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html
   I think this is fixed around snv_98, but the zfs-discuss list was
   surprisingly silent on acknowledging it as a problem - I had no
   idea that it was being worked until I saw the commit.  The panic
   seemed to be caused by dtrace - core developers of dtrace
   were quite interested in the kernel crash dump.

 http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html
   Panic during ON build.  Pool was lost, no response from list.

 --
 Mike Gerdts
 http://mgerdts.blogspot.com/
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

 



   
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-09 Thread Mike Gerdts
On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw [EMAIL PROTECTED] wrote:
 Nevada isn't production code.  For real ZFS testing, you must use a
 production release, currently Solaris 10 (update 5, soon to be update 6).

I misstated before in my LDoms case.  The corrupted pool was on
Solaris 10, with LDoms 1.0.  The control domain was SX*E, but the
zpool there showed no problems.  I got into a panic loop with dangling
dbufs.  My understanding is that this was caused by a bug in the LDoms
manager 1.0 code that has been fixed in a later release.  It was a
supported configuration, I pushed for and got a fix.  However, that
pool was still lost.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-09 Thread Miles Nordin
 gs == Greg Shaw [EMAIL PROTECTED] writes:

gs Nevada isn't production code.  For real ZFS testing, you must
gs use a production release, currently Solaris 10 (update 5, soon
gs to be update 6).

based on list feedback, my impression is that the results of a
``test'' confined to s10, particularly s10u4 (the latest available
during most of Mike's experience), would be worse than Nevada
experience over the same period.  but I doubt either matches UFS+SVM
or ext3+LVM2.  The on-disk format with ``ditto blocks'' and ``always
consistent'' may be fantastic, but the code for reading it is not.

Maybe the code is stellar, and the problem really is underlying
storage stacks that fail to respect write barriers.  If so, ZFS needs
to include a storage stack qualification tool.  For me it doesn't
strain credibility to believe these problems might be rampant in VM
stacks and SAN's, nor do I find it unacceptable if ZFS is vastly more
sensitive to them than any other filesystem.  If this speculation
turns out to really be the case, I imagine the two going together: the
problems are rampant because they don't bother other filesystems too
catastrophically.  If this is really the situation, then ZFS needs to
give the sysadmin a way to isolate and fix the problems
deterministically before filling the pool with data, not just blame
the sysadmin based on nebulous speculatory hindsight gremlins.

And if it's NOT the case, the ZFS problems need to be acknowledged and
fixed.

To my view, the above is *IN ADDITION* to developing a
recovery/forensic/``fsck'' tool, not either/or.  The pools should not
be getting corrupt in the first place, and pulling the cord should not
mean you have to settle for best-effort.  None of the modern
filesystems demand an fsck after unclean shutdown.

The current procedure for qualifying a platform seems to be: (1)
subject it to heavy write activity, (2) pull the cord, (3) repeat.
Ahmed, maybe you should use that test to ``quantify'' filesystem
reliability.  You can try it with ZFS, then reinstall the machine with
CentOS and try the same test with ext3+LVM2 or xfs+areca.  The numbers
you get are how many times can you pull the cord before you lose
something, and how much do you lose.  Here's a really old test of that
sort comparing Linux filesystems which is something like what I have
in mind:

 https://www.redhat.com/archives/fedora-list/2004-July/msg00418.html

so you see he got two sets of numbers---number of reboots and amount
of corruption.  For reiserfs and JFS he lost their equivalent of ``the
whole pool'', and for ext3 and XFS he got corruption but never lost
the pool.  It's not clear to me the filesystems ever claimed to
prevent corruption in his test scenario (was he calling fsync() after
each log write?  syslog does that sometimes, and if so, they do claim
it, but if he's just writing with some silly script they don't), but
definitely they do all claim you won't lose the whole pool in a power
outage, and only two out of four delivered on that.  I base my choice
of Linux filesystem on this test, and wish I'd done such a test before
converting things to ZFS.


pgpi0TlEstn85.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-09 Thread Bob Friesenhahn
On Thu, 9 Oct 2008, Miles Nordin wrote:

 catastrophically.  If this is really the situation, then ZFS needs to
 give the sysadmin a way to isolate and fix the problems
 deterministically before filling the pool with data, not just blame
 the sysadmin based on nebulous speculatory hindsight gremlins.

 And if it's NOT the case, the ZFS problems need to be acknowledged and
 fixed.

Can you provide any supportive evidence that ZFS is as fragile as you 
describe?

From recent opinions expressed here, properly-designed ZFS pools must 
be inexplicably permanently cratering each and every day.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-09 Thread Mike Gerdts
On Thu, Oct 9, 2008 at 10:18 AM, Mike Gerdts [EMAIL PROTECTED] wrote:
 On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw [EMAIL PROTECTED] wrote:
 Nevada isn't production code.  For real ZFS testing, you must use a
 production release, currently Solaris 10 (update 5, soon to be update 6).

 I misstated before in my LDoms case.  The corrupted pool was on
 Solaris 10, with LDoms 1.0.  The control domain was SX*E, but the
 zpool there showed no problems.  I got into a panic loop with dangling
 dbufs.  My understanding is that this was caused by a bug in the LDoms
 manager 1.0 code that has been fixed in a later release.  It was a
 supported configuration, I pushed for and got a fix.  However, that
 pool was still lost.

Or maybe it wasn't fixed yet.  I see that this was committed just today.

6684721 file backed virtual i/o should be synchronous

http://hg.genunix.org/onnv-gate.hg/rev/eb40ff0c92ec

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-06 Thread Darren J Moffat
Fajar A. Nugraha wrote:
 On Fri, Oct 3, 2008 at 10:37 PM, Vasile Dumitrescu
 [EMAIL PROTECTED] wrote:
 
 VMWare 6.0.4 running on Debian unstable,
 Linux bigsrv 2.6.26-1-amd64 #1 SMP Wed Sep 24 13:59:41 UTC 2008 x86_64 
 GNU/Linux

 Solaris is vanilla snv_90 installed with no GUI.
 
 
 in summary: physical disks, assigned 100% to the VM
 
 That's weird. I thought one of the point of using physical disks
 instead of files was to avoid problems caused by caching on host/dom0?

The data still flows through the host/dom0 device drivers and is thus at 
the mercy of the commands they issue to the physical devices.

-- 
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-04 Thread Fajar A. Nugraha
On Fri, Oct 3, 2008 at 10:37 PM, Vasile Dumitrescu
[EMAIL PROTECTED] wrote:

 VMWare 6.0.4 running on Debian unstable,
 Linux bigsrv 2.6.26-1-amd64 #1 SMP Wed Sep 24 13:59:41 UTC 2008 x86_64 
 GNU/Linux

 Solaris is vanilla snv_90 installed with no GUI.



 in summary: physical disks, assigned 100% to the VM

That's weird. I thought one of the point of using physical disks
instead of files was to avoid problems caused by caching on host/dom0?
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-03 Thread Vasile Dumitrescu
Hi folks,

I just wanted to share the end of my adventure here and especially take the 
time to thank Victor for helping me out of this mess.

I will let him explain the technical details (I am out of my depth here) but 
bottom line he spent a couple of hours with me on the machine and sorted me 
out. His explanation: he invalidated the incorrect uberblocks and forced zfs to 
revert to an earlier state that was consistent.

The machine is now in the process of doing a full scrub and the first order of 
business tomorrow will be to do a full backup :-)

According to his explanation, the reason for the troubles I had was that 
Solaris was running in a VM on my Debian server and it was not shut down 
properly when the Debian server did a controlled shutdown following a UPS event.

The Solaris machine was abruptly shut down but because it was not in control of 
the entire chain till bare hardware, it appears that some writes were in fact 
still with Debian when Solaris thought them safely executed.

This left the zpool in question in a state that even raidz1 did not help with.

Anyway, again, lots and lots of thanks to Victor!!!

kind regards
Vasile
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-03 Thread Darren J Moffat
Vasile Dumitrescu wrote:
 Hi folks,
 
 I just wanted to share the end of my adventure here and especially take the 
 time to thank Victor for helping me out of this mess.
 
 I will let him explain the technical details (I am out of my depth here) but 
 bottom line he spent a couple of hours with me on the machine and sorted me 
 out. His explanation: he invalidated the incorrect uberblocks and forced zfs 
 to revert to an earlier state that was consistent.
 
 The machine is now in the process of doing a full scrub and the first order 
 of business tomorrow will be to do a full backup :-)
 
 According to his explanation, the reason for the troubles I had was that 
 Solaris was running in a VM on my Debian server and it was not shut down 
 properly when the Debian server did a controlled shutdown following a UPS 
 event.

Which VM solution was this ? VMware, VirtualBox, Xen, other ?  How were 
the disks presented to the guest ?  What are the disks in the host, 
real disks, files, something else ?


-- 
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008-10-03 Thread Vasile Dumitrescu
 
 Which VM solution was this ? VMware, VirtualBox, Xen,
 other ?  How were 
 the disks presented to the guest ?  What are the
 disks in the host, 
 real disks, files, something else ?
 
 
 -- 
 Darren J Moffat
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discu
 ss

VMWare 6.0.4 running on Debian unstable, 
Linux bigsrv 2.6.26-1-amd64 #1 SMP Wed Sep 24 13:59:41 UTC 2008 x86_64 GNU/Linux

Solaris is vanilla snv_90 installed with no GUI.

Here is the content of the .vmx file in question:

#!/usr/bin/vmware
config.version = 8
virtualHW.version = 6
scsi0.present = TRUE
scsi0.virtualDev = lsilogic

memsize = 4096
MemAllowAutoScaleDown = FALSE
MemTrimRate = 0
sched.mem.pshare.enable = FALSE
sched.mem.minsize = 3062
sched.mem.max = 7000
sched.mem.maxmemctl = 0
sched.mem.shares = 10

scsi0:0.present = TRUE
scsi0:0.fileName = /home/vasile/vmware/solsrv/OpenSolaris64.vmdk
ide1:0.present = TRUE
ide1:0.autodetect = TRUE
ide1:0.deviceType = cdrom-image
floppy0.startConnected = FALSE
floppy0.autodetect = TRUE
ethernet0.present = TRUE
ethernet0.virtualDev = e1000
ethernet0.wakeOnPcktRcv = TRUE
sound.present = FALSE
sound.fileName = -1
sound.autodetect = TRUE
svga.autodetect = FALSE
pciBridge0.present = TRUE
displayName = zfssrv
guestOS = solaris10-64
nvram = Solaris 10 64-bit.nvram
deploymentPlatform = windows
virtualHW.productCompatibility = hosted
RemoteDisplay.vnc.port = 0
tools.upgrade.policy = useGlobal

floppy0.fileName = /dev/fd0
extendedConfigFile = Solaris 10 64-bit.vmxf

ide1:0.fileName = 
floppy0.present = FALSE
gui.powerOnAtStartup = TRUE

ide1:0.startConnected = TRUE
ethernet0.addressType = generated
uuid.location = 56 4d da 02 a4 a0 78 74-2e 09 90 62 45 bb c4 94
uuid.bios = 56 4d da 02 a4 a0 78 74-2e 09 90 62 45 bb c4 94
scsi0:0.redo = 
pciBridge0.pciSlotNumber = 17
scsi0.pciSlotNumber = 16
ethernet0.pciSlotNumber = 32
sound.pciSlotNumber = -1
ethernet0.generatedAddress = 00:0c:29:bb:c4:94
ethernet0.generatedAddressOffset = 0
tools.syncTime = FALSE

svga.maxWidth = 1024
svga.maxHeight = 768
svga.vramSize = 3145728

scsi0:1.present = TRUE
scsi0:1.fileName = ztank-sda.vmdk
scsi0:1.mode = independent-persistent
scsi0:1.deviceType = rawDisk
scsi0:2.present = TRUE
scsi0:2.fileName = ztank-sdb.vmdk
scsi0:2.mode = independent-persistent
scsi0:2.deviceType = rawDisk
scsi0:3.present = TRUE
scsi0:3.fileName = ztank-sdc.vmdk
scsi0:3.mode = independent-persistent
scsi0:3.deviceType = rawDisk
scsi0:4.present = TRUE
scsi0:4.fileName = ztank-sdd.vmdk
scsi0:4.mode = independent-persistent
scsi0:4.deviceType = rawDisk
scsi0:5.present = TRUE
scsi0:5.fileName = ztank-sde.vmdk
scsi0:5.mode = independent-persistent
scsi0:5.deviceType = rawDisk
scsi0:6.present = TRUE
scsi0:6.fileName = ztank-sdf.vmdk
scsi0:6.mode = independent-persistent
scsi0:6.deviceType = rawDisk

scsi0:1.redo = 
scsi0:2.redo = 
scsi0:3.redo = 
scsi0:4.redo = 
scsi0:5.redo = 
scsi0:6.redo = 

isolation.tools.dnd.disable = TRUE
snapshot.disabled = TRUE

scsi0:0.mode = independent-persistent

isolation.tools.copy.disable = FALSE
isolation.tools.paste.disable = FALSE

tools.remindInstall = TRUE


in summary: physical disks, assigned 100% to the VM

HTH

kind regards
Vasile
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss