Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
FYI, I'm working on a workaround for broken devices. As you note, ome disks flat-out lie: you issue the synchronize-cache command, they say got it, boss, yet the data is still not on stable storage. Why do they do this? Because it performs better. Well, duh -- ou can make stuff *really* fast if it doesn't have to be correct. The uberblock ring buffer in ZFS gives us a way to cope with this, as long as we don't reuse freed blocks for a few transaction groups. The basic idea: if we can't read the pool startign from the most recent uberblock, then we should be able to use the one before it, or the one before that, etc, as long as we haven't yet reused any blocks that were freed in those earlier txgs. This allows us to use the normal load on the pool, plus the passage of time, as a displacement flush for disk caches that ignore the sync command. If we go back far enough in (txg) time, we will eventually find an uberblock all of whose dependent data blocks have make it to disk. I'll run tests with known-broken disks to determine how far back we need to go in practice -- I'll bet one txg is almost always enough. Jeff Hi Jeff, we just losed 2 pools on snv91. Any news about your workaround to recover pools discarding last txg? thanks gino -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
It would be extremely helpful to know what brands/models of disks lie and which don't. This information could be provided diplomatically simply as threads documenting problems you are working on, stating the facts. Use of a specific string of words would make searching for it easy. There should be no liability, since you are simply documenting compatibility with zfs. Or perhaps if the lawyers let you, you could simply publish a compatibility/incompatibility list. These ARE facts. If there is a way to make a detection tool, that would be very useful too, although after the purchase is made, it could be hard to send it back. However that info could be fed into the database as that drive/model being incompatible with zfs. As Solaris / zfs gains ground, this could become a strong driver in the industry. Re: I'll run tests with known-broken disks to determine how far back we need to go in practice -- I'll bet one txg is almost always enough. So go back three - we are using zfs because we want absolute reliability (or at least as close as we can get). --Ray -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
[EMAIL PROTECTED] wrote on 10/11/2008 09:36:02 PM: On Oct 10, 2008, at 7:55 PM 10/10/, David Magda wrote: If someone finds themselves in this position, what advice can be followed to minimize risks? Can you ask for two LUNs on different physical SAN devices and have an expectation of getting it? Better yet also ask for multiple paths over different SAN infrastructure to each. Then again, I would hope you don't need to ask your SAN folks for that? -Wade ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Thu, Oct 9, 2008 at 10:33 PM, Mike Gerdts [EMAIL PROTECTED] wrote: On Thu, Oct 9, 2008 at 10:18 AM, Mike Gerdts [EMAIL PROTECTED] wrote: On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw [EMAIL PROTECTED] wrote: Nevada isn't production code. For real ZFS testing, you must use a production release, currently Solaris 10 (update 5, soon to be update 6). I misstated before in my LDoms case. The corrupted pool was on Solaris 10, with LDoms 1.0. The control domain was SX*E, but the zpool there showed no problems. I got into a panic loop with dangling dbufs. My understanding is that this was caused by a bug in the LDoms manager 1.0 code that has been fixed in a later release. It was a supported configuration, I pushed for and got a fix. However, that pool was still lost. Or maybe it wasn't fixed yet. I see that this was committed just today. 6684721 file backed virtual i/o should be synchronous http://hg.genunix.org/onnv-gate.hg/rev/eb40ff0c92ec The related information from the LDoms Manager 1.1 Early Access release notes (820-4914-10): Data Might Not Be Written Immediately to the Virtual Disk Backend If Virtual I/O Is Backed by a File or Volume Bug ID 6684721: When a file or volume is exported as a virtual disk, then the service domain exporting that file or volume is acting as a storage cache for the virtual disk. In that case, data written to the virtual disk might get cached into the service domain memory instead of being immediately written to the virtual disk backend. Data are not cached if the virtual disk backend is a physical disk or slice, or if it is a volume device exported as a single-slice disk. Workaround: If the virtual disk backend is a file or a volume device exported as a full disk, then you can prevent data from being cached into the service domain memory and have data written immediately to the virtual disk backend by adding the following line to the /etc/system file on the service domain. set vds:vd_file_write_flags = 0 Note – Setting this tunable flag does have an impact on performance when writing to a virtual disk, but it does ensure that data are written immediately to the virtual disk backend. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Oct 10, 2008, at 7:55 PM 10/10/, David Magda wrote: If someone finds themselves in this position, what advice can be followed to minimize risks? Can you ask for two LUNs on different physical SAN devices and have an expectation of getting it? -- Keith H. Bierman [EMAIL PROTECTED] | AIM kbiermank 5430 Nassau Circle East | Cherry Hills Village, CO 80113 | 303-997-2749 speaking for myself* Copyright 2008 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
2008/10/9 Bob Friesenhahn [EMAIL PROTECTED]: On Thu, 9 Oct 2008, Miles Nordin wrote: catastrophically. If this is really the situation, then ZFS needs to give the sysadmin a way to isolate and fix the problems deterministically before filling the pool with data, not just blame the sysadmin based on nebulous speculatory hindsight gremlins. And if it's NOT the case, the ZFS problems need to be acknowledged and fixed. Can you provide any supportive evidence that ZFS is as fragile as you describe? The hundreds of sysadmins seeing their pools go byebye after normal operations in a production environment is evidence enough. And the number of times people like Victor have saved our asses. From recent opinions expressed here, properly-designed ZFS pools must be inexplicably permanently cratering each and every day. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Timh Bergström System Administrator Diino AB - www.diino.com :wq ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Hello all, I think the problem here is the ZFS´ capacity for recovery from a failure. Forgive me, but thinking about creating a code without failures, maybe the hackers did forget that other people can make mistakes (if they can´t). - ZFS does not need fsck. Ok, that´s a great statement, but i think ZFS needs one. Really does. And in my opinion a enhanced zdb would be the solution. Flexibility. Options. - I have 90% of something i think is your filesystem, do you want it? I think a software is as good as it can recovery from failures. And i don´t want to know who failed, i´m not going to send anyone to jail, i´m not a lawyer. I agree with Jeff, really do, but that is another problem... The solution Jeff is working one, i think is really great, since it does NOT be the all or nothing again... I don´t know about you, but A LOT of times i was saved by the Lost and Found directory! All the beauty of a UNIX system is rm /etc/passwd after have edited it, and get the whole file doing a cat /dev/mem. ;-) I think there are a lot of parts in ZFS design that remembers me when you see something left on the floor at home, so you ask for your son why he did not get it, and he says it was not me. peace. Leal. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
The circumstances where I have lost data have been when ZFS has not handled a layer of redundancy. However, I am not terribly optimistic of the prospects of ZFS on any device that hasn't committed writes that ZFS thinks are committed. FYI, I'm working on a workaround for broken devices. As you note, some disks flat-out lie: you issue the synchronize-cache command, they say got it, boss, yet the data is still not on stable storage. Why do they do this? Because it performs better. Well, duh -- you can make stuff *really* fast if it doesn't have to be correct. Before I explain how ZFS can fix this, I need to get something off my chest: people who knowingly make such disks should be in federal prison. It is *fraud* to win benchmarks this way. Doing so causes real harm to real people. Same goes for NFS implementations that ignore sync. We have specifications for a reason. People assume that you honor them, and build higher-level systems on top of them. Change the mass of the proton by a few percent, and the stars explode. It is impossible to build a functioning civil society in a culture that tolerates lies. We need a little more Code of Hammurabi in the storage industry. Now: The uberblock ring buffer in ZFS gives us a way to cope with this, as long as we don't reuse freed blocks for a few transaction groups. The basic idea: if we can't read the pool startign from the most recent uberblock, then we should be able to use the one before it, or the one before that, etc, as long as we haven't yet reused any blocks that were freed in those earlier txgs. This allows us to use the normal load on the pool, plus the passage of time, as a displacement flush for disk caches that ignore the sync command. If we go back far enough in (txg) time, we will eventually find an uberblock all of whose dependent data blocks have make it to disk. I'll run tests with known-broken disks to determine how far back we need to go in practice -- I'll bet one txg is almost always enough. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Hi Jeff, On Sex, 2008-10-10 at 01:26 -0700, Jeff Bonwick wrote: The circumstances where I have lost data have been when ZFS has not handled a layer of redundancy. However, I am not terribly optimistic of the prospects of ZFS on any device that hasn't committed writes that ZFS thinks are committed. FYI, I'm working on a workaround for broken devices. As you note, some disks flat-out lie: you issue the synchronize-cache command, they say got it, boss, yet the data is still not on stable storage. It's not just about ignoring the synchronize-cache command, there's also another weak spot. ZFS is quite resilient against so-called phantom writes, provided that they occur sporadically - let's say, if the disk decides to _randomly_ ignore writes 10% of the time, ZFS could probably survive that pretty well even on single-vdev pools, due to ditto blocks. However, it is not so resilient when the storage system suffers hiccups which cause phantom writes to occur continuously, even if for a small period of time (say less than 10 seconds), and then return to normal. This could happen for several reasons, including network problems, bugs in software or even firmware, etc. I think in this case, going back to a previous uberblock could also be enough to recover from such a scenario most of the times, unless perhaps the error occurred too long ago, and the unwritten metadata got flushed out of the ARC and didn't have a chance to get rewritten. In any case, a more generic solution to repair all kinds of metadata corruption, such as (e.g.) space map corruption, would be very desirable, as I think everyone can agree. Best regards, Ricardo ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
That sounds like a great idea for a tool Jeff. Would it be possible to build that in as a zpool recover command? Being able to run a tool like that and see just how bad the corruption is, but know it's possible to recover an older version would be great. Is there any chance of outputting details so the sysadmin can know roughly how much was lost? My thoughts are going to be very rough (I don't know much about zfs internals), but I'm wondering if something like this would work, where all bad blocks are reported, along with the latest 3 good ones: *8 # zpool recover pool . pool details ... Finding and testing uberblocks... 1. block a date/time: x/ CORRUPTED 2. block b date/time: y/ CORRUPTED 3. block c date/time: z/ Appears OK 4. block d date/time: z/ Appears OK 5. block e date/time: z/ Appears OK *8 Victor was talking in another thread about using zdb to check the pool before doing an import of a damaged pool. Might it be possible for the next stage of the recovery process to give the user an option of testing or importing the pool for any particular uberblock? It does sound like testing can take a long time, so this would need to be something that can be cancelled, and you would also need a way to mark uberblocks as bad should problems be found with either the test or import. This would be a great addition to ZFS though, and would hopefully save Victor a bit of time ;-) Ross -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
jb == Jeff Bonwick [EMAIL PROTECTED] writes: rmc == Ricardo M Correia [EMAIL PROTECTED] writes: jb We need a little more Code of Hammurabi in the storage jb industry. It seems like most of the work people have to do now is cleaning up after the sloppyness of others. At least it takes the longest. You could always mention which disks you found ignoring the command---wouldn't that help the overall problem? I understand there's a pervasive ``i don' wan' any trouble, mistah'' attitude, but I don't understand where it comes from. http://www.ferris.edu/news/jimcrow/tom/ jb displacement flush for disk caches that ignore the sync jb command. Sounds like a good idea but: (1) won't this break the NFS guarantees you were just saying should never be broken? I get it, someone else is breaking a standard so how can ZFS be expected to yadda yadda yadad. But I fear it will just push ``blame the sysadmin'' one step further out. ex., Q. ``with ZFS all my NFS clients become unstable after the server reboots,'' or ``I'm getting silent corruption with NFS''. A. ``your drives might have gremlins in them, no way to know,'' and ``well what do you expect without a single integrity domain and TCP's weak checksums. / no i'm using a crossover cable, and FCS is not weak. / ZFS managing a layer of redundancy it is probably your RAM or corruption on the uh, between the Ethernet MAC chip and the PCI slot'' (1a) I'm concerned about how it'll be reported when it happens. (a) if it's not reported at all, then ZFS is hiding the fact that fsync() is not working. Also, other journaling filesystems sometimes report when they find ``unexpected'' corruption, which is useful for finding both hardware and software problems. I'm already concerned ZFS is not reporting enough, like when it says a vdev component is ONLINE, but 'zpool offline pool component' says 'no valid replicas', then after a scrub there is no change to zpool status, but zpool offline works again. ZFS should not ``simplify'' the user interface to the point that it's hiding problems with itself and its environment to the ends of avoiding discussion. (b) if it is reported, then whenever the reporter-blob raises its hand it will have the effect of exonerating ZFS in most people's minds, like the stupid CKSUM column does right now. ``ZFS-FEED-B33F error? oh yeah that's the new ueberblock search code. that means your disks are ignoring the SYNCHRONIZE CACHE command. thank GOD you have ZFS with ANY OTHER FILESYSTEM all bets would be totally off. lucky you. / I have tried ten different models from all four brands. / yeah sucks don't it? flagrant violation of the standard, industry wide. / my linux testing tool says they're obeying the command fine / linux is crap / i added a patch to solaris to block the SYNC CACHE command and the disks got faster so I think it's not being ignored / well the stack is complicated and flushing happens at many levels, like think about controller performance, and that's completely unsupported you are doing something REALLY UNSAFE there you should NOT DO THAT it is STUPID'' and so on, stalling the actual fix literally for years. The right way to exonerate ZFS is to make a diagnosis tool for the disks which proves they're broken, and then don't buy those disks. not to make a new class of ZFS fault report that could potentially capture all kinds of problems, then hazily assign blame to an untestable quantity. (2) disks are probably not the only thing dropping the write barriers. So far, we're also suspecting (unproven!) iSCSI targets/initiators, particularly around a TCP reconnection event or target reboot. and VM stacks, both VirtualBox and the HVM in UltraSPARC T1. probably other stuff. I'm concerned that assumptions you'll find safe to make about disks after you get started, like nothing is more than 1s stale, or send a CDB to size the on-disk cache and imagine it's a FIFO and it'll be no worse than that, or ``you can get an fsync by pausing reads for 500ms'' or whatever, will add robustness for current and future broken disks but won't apply to other types of broken storage layer. rmc However, it is not so resilient when the storage system rmc suffers hiccups which cause phantom writes to occur rmc continuously, even if for a small period of time (say less
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Fri, Oct 10, 2008 at 06:15:16AM -0700, Marcelo Leal wrote: - ZFS does not need fsck. Ok, that?s a great statement, but i think ZFS needs one. Really does. And in my opinion a enhanced zdb would be the solution. Flexibility. Options. About 99% of the problems reported as I need ZFS fsck can be summed up by two ZFS bugs: 1. If a toplevel vdev fails to open, we should be able to pull information from necessary ditto blocks to open the pool and make what progress we can. Right now, the root vdev code assumes can't open = faulted pool, which results in failure scenarios that are perfectly recoverable most of the time. This needs to be fixed so that pool failure is only determined by the ability to read critical metadata (such as the root of the DSL). 2. If an uberblock ends up with an inconsistent view of the world (due to failure of DKIOCFLUSHWRITECACHE, for example), we should be able to go back to previous uberblocks to find a good view of our pool. This is the failure mode described by Jeff. These are both bugs in ZFS and will be fixed. The other 1% of the complaints are usually of the form I created my pool on top of my old one or I imported a LUN on two different systems at the same time. It's unclear what a 'fsck' tool could do in this scenario, if anything. Due to a variety of reasons (hierarchical nature of ZFS, variable block sizes, RAIDZ-Z, compression, etc), it's difficult to even *identify* a ZFS block, let alone determine its validity and associate it in some larger construct. There are some interesting possibilities for limited forensic tools - in particular, I like the idea of a mdb backend for reading and writing ZFS pools[1]. But I haven't actually heard a reasonable proposal for what a fsck-like tool (i.e. one that could repair things automatically) would actually *do*, let alone how it would work in the variety of situations it needs to (compressed RAID-Z?) where the standard ZFS infrastructure fails. - Eric [1] http://mbruning.blogspot.com/2008/08/recovering-removed-file-on-zfs-disk.html -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Eric Schrock wrote: On Fri, Oct 10, 2008 at 06:15:16AM -0700, Marcelo Leal wrote: - ZFS does not need fsck. Ok, that?s a great statement, but i think ZFS needs one. Really does. And in my opinion a enhanced zdb would be the solution. Flexibility. Options. About 99% of the problems reported as I need ZFS fsck can be summed up by two ZFS bugs: 1. If a toplevel vdev fails to open, we should be able to pull information from necessary ditto blocks to open the pool and make what progress we can. Right now, the root vdev code assumes can't open = faulted pool, which results in failure scenarios that are perfectly recoverable most of the time. This needs to be fixed so that pool failure is only determined by the ability to read critical metadata (such as the root of the DSL). 2. If an uberblock ends up with an inconsistent view of the world (due to failure of DKIOCFLUSHWRITECACHE, for example), we should be able to go back to previous uberblocks to find a good view of our pool. This is the failure mode described by Jeff. I've mostly seen (2), because despite all the best practices out there, single vdev pools are quite common. In all such cases that I had my hands on it was possible to recover pool by going back by one or two txgs. These are both bugs in ZFS and will be fixed. The other 1% of the complaints are usually of the form I created my pool on top of my old one or I imported a LUN on two different systems at the same time. Of these two former is not easy because it requires searching through the entire disk space for root block candidates and trying each of them. Latter one is not catastrophic in case there were little to no activity from one system. In this case one of the first things to suffer is pool config object, and corruption of it prevents pool open. Fortunately enough, after putback of 6733970 assertion failure in dbuf_dirty() via spa_sync_nvlist() in build 99 corrupted pool config object is written in such a way during open that prevents reading in old corrupted copy, and in most cases this allows to import pool and save most of the data. zdb is useful to understand how much is corrupted and how much is recovered. If nothing else is corrupted, then pool may be available for further use without recreation. Again, in every case I had my hands on it was possible to either recover pool completely or at least save most of the data. It's unclear what a 'fsck' tool could do in this scenario, if anything. Due to a variety of reasons (hierarchical nature of ZFS, variable block sizes, RAIDZ-Z, compression, etc), it's difficult to even *identify* a ZFS block, let alone determine its validity and associate it in some larger construct. Indeed. In more ZFS recovery case involving 42TB pool with about 8TB used, zdb -bv alone took several hours to walk the block tree and verify consistency of block pointers, and zdb -bcv took couple of days to verify all user data blocks as well. And different checksums and gang blocks in addition to all other dynamic features mentioned complicate the task of identifying ZFS blocks and linking those blocks into tree and make it really time (and space) consuming. There are some interesting possibilities for limited forensic tools - in particular, I like the idea of a mdb backend for reading and writing ZFS pools[1]. But I haven't actually heard a reasonable proposal for what a fsck-like tool (i.e. one that could repair things automatically) would actually *do*, let alone how it would work in the variety of situations it needs to (compressed RAID-Z?) where the standard ZFS infrastructure fails. There are a number of bugs and rfes to improve usefulness of zdb for field use, e.g. 6720637 want zdb -l option to dump uberblock arrays as well 6709782 issues running zdb with -p and -e options 6736356 zdb -R needs to work with exported pools 6720907 zdb should handle errors while dumping datasets and objects 6746101 zdb command to search for ZFS labels in a device 6757444 want zdb -R to supoprt decompression, checksumming and raid-z 6757430 want an option for zdb to disable space map loading and leak tracking Hth, Victor - Eric [1] http://mbruning.blogspot.com/2008/08/recovering-removed-file-on-zfs-disk.html -- Eric Schrock, Fishworkshttp://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
2008/10/10 Richard Elling [EMAIL PROTECTED]: Timh Bergström wrote: 2008/10/9 Bob Friesenhahn [EMAIL PROTECTED]: On Thu, 9 Oct 2008, Miles Nordin wrote: catastrophically. If this is really the situation, then ZFS needs to give the sysadmin a way to isolate and fix the problems deterministically before filling the pool with data, not just blame the sysadmin based on nebulous speculatory hindsight gremlins. And if it's NOT the case, the ZFS problems need to be acknowledged and fixed. Can you provide any supportive evidence that ZFS is as fragile as you describe? The hundreds of sysadmins seeing their pools go byebye after normal operations in a production environment is evidence enough. And the number of times people like Victor have saved our asses. Hundreds? Do you have evidence of this? One is one to many, I dont need evidence of hundreds - that is hopefully an exaggeration. //T -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Fri, Oct 10, 2008 at 06:15:16AM -0700, Marcelo Leal wrote: - ZFS does not need fsck. Ok, that?s a great statement, but i think ZFS needs one. Really does. And in my opinion a enhanced zdb would be the solution. Flexibility. Options. About 99% of the problems reported as I need ZFS fsck can be summed up by two ZFS bugs: 1. If a toplevel vdev fails to open, we should be able to pull information from necessary ditto blocks to open the pool and make what progress we can. Right now, the root vdev code assumes can't open = faulted pool, which results in failure scenarios that are perfectly recoverable most of the time. This needs to be fixed so that pool failure is only determined by the ability to read critical metadata (such as the root of the DSL). . If an uberblock ends up with an inconsistent view of the world (due to failure of DKIOCFLUSHWRITECACHE, for example), we should be able to go back to previous uberblocks to find a good view of our pool. This is the failure mode described by Jeff. hese are both bugs in ZFS and will be fixed. That´s it! It´s 100% for me! ;-) One is the all-or-nothing problem, and the other is about guilty... ;-)) There are some interesting possibilities for limited forensic tools - in particular, I like the idea of a mdb backend for reading and writing ZFS pools[1]. In my opinion would be great the whole functionality in zdb. it´s simple, and the concepts are clear on the tool. mdb is a debugger, needs concepts that i think is different in a tool for read/fix filesystems. Just an opinion... What does not mean we can not have both. Like i said, flexibility, options... ;-) But I haven't actually heard a reasonable proposal for what a fsck-like tool I think we must NOT stuck in the word fsck, i have used it just as an example (Lost and Found). And i think other users used just as an example too. The important is the two points you have described very *well*. (i.e. one that could repair things automatically) would actually *do*, let alone how it would work in the variety of situations it needs to (compressed RAID-Z?) where the standard ZFS infrastructure fails. - Eric [1] http://mbruning.blogspot.com/2008/08/recovering-remove d-file-on-zfs-disk.html -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discu ss Many thanks for your answer! Leal. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Sex, 2008-10-10 at 11:23 -0700, Eric Schrock wrote: But I haven't actually heard a reasonable proposal for what a fsck-like tool (i.e. one that could repair things automatically) would actually *do*, let alone how it would work in the variety of situations it needs to (compressed RAID-Z?) where the standard ZFS infrastructure fails. I'd say an fsck-like tool for ZFS should not worry much compression, checksums, RAID-Z and whatnot. In essence, it would try to do what an fsck tool does for a typical filesystem, and so would be mostly oblivious to the layout or encoding of the blocks, perhaps treating blocks with failed checksums as blocks full of zeros. Here's how it could work (of course, this is all easier said than done): 1) Open all the devices specified by the user. Optionally, take just a pool name/guid and scan for the right devices in /dev/[r]dsk. 2) Verify if the pool configuration read from the devices is sane -- if not, try to generate a consistent configuration. Some elements of the pool configuration, such as the correct pool version, could be checked in later steps, depending on features that were found. 3) Starting from the last uberblock, fully traverse a few levels down the tree. If less than 100% of the blocks could be read without errors, do the same for previous uberblocks and offer the user the choice to which uberblock to use, or if running non-interactively, choose the one with the best success rate. 4) Traverse the list/tree of filesystems, snapshots and clones. Make sure that they are well-connected. For each filesystem, try to replay the ZILs, clean them out. 5) Now fully traverse the pool. Compute the space maps and FS space usage on-the-go, as blocks are read. 6) For each metadata block read, check whether the fields are sane, fix them/zero them out if they're not. Basically we're assuming here that we may have corrupted metadata with correct checksums. If some metadata block can not be read due to a failed checksum, assume the block is full of zeros, and fix it. By the way, this includes every field of every kind of metadata block, including ZAPs, ACLs, FID maps, znode fields, everything. For fields that reference other objects, make sure that the object they reference is of the correct type and that the object itself is correct. For objects that are missing, create empty ones if necessary. 7) Check that every object is referenced somewhere and link unreferenced objects to /lost+found/object-type/, or similar. 8) Probably do other things that I'm forgetting. 9) In the end, check if the space maps are consistent with the ones computed, write correct ones if not. Check that space usage/reservations/quotas are correct. Essentially, the goal is that at the end of this process, the pool should contain consistent information, should have as much data as could be recovered and should never cause any further errors in ZFS due to invalid metadata/fields; either when importing it, reading from it or writing/modifying it (except that it would still return EIO errors when trying to read corrupted file data blocks, of course). Now, a problem with fsck-like tools, and perhaps especially with ZFS, is that some of these steps may either require lots of memory or multiple filesystem/pool traversals. I'd say having such a tool, even if it required additional temporary storage for operation (hopefully not a very large fraction of the pool size), would be *very* useful and would clear up any worries that people currently have. Kind regards, Ricardo ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Timh Bergström wrote: 2008/10/10 Richard Elling [EMAIL PROTECTED]: Timh Bergström wrote: 2008/10/9 Bob Friesenhahn [EMAIL PROTECTED]: On Thu, 9 Oct 2008, Miles Nordin wrote: catastrophically. If this is really the situation, then ZFS needs to give the sysadmin a way to isolate and fix the problems deterministically before filling the pool with data, not just blame the sysadmin based on nebulous speculatory hindsight gremlins. And if it's NOT the case, the ZFS problems need to be acknowledged and fixed. Can you provide any supportive evidence that ZFS is as fragile as you describe? The hundreds of sysadmins seeing their pools go byebye after normal operations in a production environment is evidence enough. And the number of times people like Victor have saved our asses. Hundreds? Do you have evidence of this? One is one to many, I dont need evidence of hundreds - that is hopefully an exaggeration. Don't show up to a data fight without data :-/ Yes, we do track this information and guys like me analyze it. The ratio of installed base to problem reports for ZFS is quite high. When we see a trend, we adjust priorities to address it. This is just part of our overall quality program. Which brings me to the required mantra, if you don't file a bug or make a service call, the problem doesn't get tracked. Please make the effort so that we can prioritize the use of our limited resources. Posting a fine whine on this (or any) forum is not guaranteed to result in an entry in our problem tracking system -- someone has to put in the extra effort, or it will fall into the silent complainant category. Please help us to improve the quality of our systems, thanks. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Oct 10, 2008, at 15:48, Victor Latushkin wrote: I've mostly seen (2), because despite all the best practices out there, single vdev pools are quite common. In all such cases that I had my hands on it was possible to recover pool by going back by one or two txgs. For better or worse this is the case where I work. Most of our storage is on SANs (EMC and NetApp), and so if we need more space we ask for it and we get a giant LUN given to us (usually multi-pathed). We also have a lot of Veritas VxVM and VxFS for Oracle, and so even if we're running Solaris 10, we're not using ZFS in that case. SAN space is also allocated to Windows and VMware ESX machines as well, so it's not like we can ask for the disks in the SAN to be exported raw, as that would mess up managing of things with the other OSes. (We have a very small global storage / back up team, and I really don't want to add more to their workload.) If someone finds themselves in this position, what advice can be followed to minimize risks? For example, is having checksums enabled a good idea? If you have no redundancy and an error occurs, the system will panic by default (configurable in newer builds of OpenSolaris, but not in Solaris 'proper' yet). But if the system is ignoring checksums, you're no worse off than most other file systems (but still get all the other features of ZFS). Or is there a way to mitigate a checksum error on non-redundant zpool? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Or is there a way to mitigate a checksum error on non-redundant zpool? It's just like the difference between non-parity, parity, and ECC memory. Most filesystems don't have checksums (non-parity), so they don't even know when they're returning corrupt data. ZFS without any replication can detect errors, but can't fix them (like parity memory). ZFS with mirroring or RAID-Z can both detect and correct (like ECC memory). Note: even in a single-device pool, ZFS metadata is replicated via ditto blocks at two or three different places on the device, so that a localized media failure can be both detected and corrected. If you have two or more devices, even without any mirroring or RAID-Z, ZFS metadata is mirrored (again via ditto blocks) across those devices. Jeff ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Fri, Oct 10, 2008 at 9:14 PM, Jeff Bonwick [EMAIL PROTECTED] wrote: Note: even in a single-device pool, ZFS metadata is replicated via ditto blocks at two or three different places on the device, so that a localized media failure can be both detected and corrected. If you have two or more devices, even without any mirroring or RAID-Z, ZFS metadata is mirrored (again via ditto blocks) across those devices. And in the event that you have a pool that is mostly not very important but some of it is important, you can have data mirrored on a per dataset level via copies=n. If we can avoid losing an entire pool by rolling back a txg or two, the biggest source of data loss and frustration is taken care of. Ditto blocks for metadata should take care of most other cases that would result in wide spread loss. Normal bit rot that causes you to lose blocks here and there are somewhat likely to take out a small minority of files and spit warnings along the way. If there are some files that are more important to you than others (e.g. losing files in rpool/home may have more impact than than rpool/ROOT) copies=2 can help there. And for those places where losing a txg or two is a mortal sin, don't use flaky hardware and allow zfs to handle a layer of redundancy. This gets me thinking that it may be worthwhile to have a small (100 MB x 2) rescue boot environment with copies=2 (as well as rpool/boot/) so that pkg repair could be used to deal with cases that prevent your normal (4 GB) boot environment from booting. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
His explanation: he invalidated the incorrect uberblocks and forced zfs to revert to an earlier state that was consistent. Would someone be willing to document the steps required in order to do this please? I have a disk in a similar state: # zpool import pool: tank id: 13234439337856002730 state: FAULTED status: The pool metadata is corrupted. action: The pool cannot be imported due to damaged devices or data. The pool may be active on another system, but can be imported using the '-f' flag. see: http://www.sun.com/msg/ZFS-8000-72 config: tankFAULTED corrupted data c7d0 ONLINE This happened after I foolishly began trusting zfs-fuse with some large but relatively unimportant data on a big, empty single disk zpool in my home machine and then suffered a power cut before I got around to backing it up. OpenSolaris can't import the pool either, so the drive is sat on a shelf waiting till a method for fixing it is published. While it's clearly my own fault for taking the risks I did, it's still pretty frustrating knowing that all my data is likely still intact and nicely checksummed on the disk but that none of it is accessible due to some tiny filesystem inconsistency. With pretty much any other FS I think I could get most of it back. Clearly such a small number of occurrences in what were admittedly precarious configurations aren't going to be particularly convincing motivators to provide a general solution, but I'd feel a whole lot better about using ZFS if I knew that there were some documented steps or a tool (zfsck? ;) that could help to recover from this kind of metadata corruption in the unlikely event of it happening. cheers, Rob -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Thu, Oct 9, 2008 at 4:53 AM, . [EMAIL PROTECTED] wrote: While it's clearly my own fault for taking the risks I did, it's still pretty frustrating knowing that all my data is likely still intact and nicely checksummed on the disk but that none of it is accessible due to some tiny filesystem inconsistency. ?With pretty much any other FS I think I could get most of it back. Clearly such a small number of occurrences in what were admittedly precarious configurations aren't going to be particularly convincing motivators to provide a general solution, but I'd feel a whole lot better about using ZFS if I knew that there were some documented steps or a tool (zfsck? ;) that could help to recover from this kind of metadata corruption in the unlikely event of it happening. Well said. You have hit on my #1 concern with deploying ZFS. FWIW, I belive that I have hit the same type of bug as the OP in the following combinations: - T2000, LDoms 1.0, various builds of Nevada in control and guest domains. - Laptop, VirtualBox 1.6.2, Windows XP SP2 host, OpenSolaris 2008.05 @ build 97 guest In the past year I've lost more ZFS file systems than I have any other type of file system in the past 5 years. With other file systems I can almost always get some data back. With ZFS I can't get any back. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
0n Thu, Oct 09, 2008 at 06:37:23AM -0500, Mike Gerdts wrote: FWIW, I belive that I have hit the same type of bug as the OP in the following combinations: - T2000, LDoms 1.0, various builds of Nevada in control and guest domains. - Laptop, VirtualBox 1.6.2, Windows XP SP2 host, OpenSolaris 2008.05 @ build 97 guest In the past year I've lost more ZFS file systems than I have any other type of file system in the past 5 years. With other file systems I can almost always get some data back. With ZFS I can't get any back. Thats scary to hear! -aW IMPORTANT: This email remains the property of the Australian Defence Organisation and is subject to the jurisdiction of section 70 of the CRIMES ACT 1914. If you have received this email in error, you are requested to contact the sender and delete the email. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal [EMAIL PROTECTED] wrote: In the past year I've lost more ZFS file systems than I have any other type of file system in the past 5 years. With other file systems I can almost always get some data back. With ZFS I can't get any back. Thats scary to hear! I am really scared now! I was the one trying to quantify ZFS reliability, and that is surely bad to hear! The circumstances where I have lost data have been when ZFS has not handled a layer of redundancy. However, I am not terribly optimistic of the prospects of ZFS on any device that hasn't committed writes that ZFS thinks are committed. Mirrors and raidz would also be vulnerable to such failures. I also have run into other failures that have gone unanswered on the lists. It makes me wary about using zfs without a support contract that allows me to escalate to engineering. Patching only support won't help. http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html Hang only after I mirrored the zpool, no response on the list http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html I think this is fixed around snv_98, but the zfs-discuss list was surprisingly silent on acknowledging it as a problem - I had no idea that it was being worked until I saw the commit. The panic seemed to be caused by dtrace - core developers of dtrace were quite interested in the kernel crash dump. http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html Panic during ON build. Pool was lost, no response from list. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Unfortunely I can only agree to the doubts about running ZFS in production environments, i've lost ditto-blocks, i''ve gotten corrupted pools and a bunch of other failures even in mirror/raidz/raidz2 setups with or without hardware mirrors/raid5/6. Plus the insecurity of a sudden crash/reboot will corrupt or even destroy the pools with restore from backup as the only advice. I've been lucky so far about getting my pools back thanks to people like Victor. What would be needed is a proper fsck for ZFS which can resolv minor data corruptions, tools for rebuilding, resizing and moving the data about on pools is also needed, even recover of data from faulted pools, like there is for ext2/3/ufs/ntfs. All in all, great FS but not production ready until the tools are in place or it gets really really resillient to minor failures and/or crashes in both software and hardware. For now i'll stick to XFS/UFS and sw/hw-raid and live with the restrictions of such fs. //T 2008/10/9 Mike Gerdts [EMAIL PROTECTED]: On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal [EMAIL PROTECTED] wrote: In the past year I've lost more ZFS file systems than I have any other type of file system in the past 5 years. With other file systems I can almost always get some data back. With ZFS I can't get any back. Thats scary to hear! I am really scared now! I was the one trying to quantify ZFS reliability, and that is surely bad to hear! The circumstances where I have lost data have been when ZFS has not handled a layer of redundancy. However, I am not terribly optimistic of the prospects of ZFS on any device that hasn't committed writes that ZFS thinks are committed. Mirrors and raidz would also be vulnerable to such failures. I also have run into other failures that have gone unanswered on the lists. It makes me wary about using zfs without a support contract that allows me to escalate to engineering. Patching only support won't help. http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html Hang only after I mirrored the zpool, no response on the list http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html I think this is fixed around snv_98, but the zfs-discuss list was surprisingly silent on acknowledging it as a problem - I had no idea that it was being worked until I saw the commit. The panic seemed to be caused by dtrace - core developers of dtrace were quite interested in the kernel crash dump. http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html Panic during ON build. Pool was lost, no response from list. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Timh Bergström System Administrator Diino AB - www.diino.com :wq ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Perhaps I mis-understand, but the below issues are all based on Nevada, not Solaris 10. Nevada isn't production code. For real ZFS testing, you must use a production release, currently Solaris 10 (update 5, soon to be update 6). In the last 2 years, I've stored everything in my environment (home directory, builds, etc.) on ZFS on multiple types of storage subsystems without issues. All of this has been on Solaris 10, however. Btw, I completely agree on the panic issue.If I have a large DB server with many pools, and one inconsequential pool fails, I lose the entire DB server. I'd really like to see an option at the zpool level directing what to do in a panic for a particular pool.Perhaps this is in the latest bits; if so, sorry, I'm running old stuff. :-) I also run ZFS on my mac. While not production quality, some of the panic errors dealing with external (firewire, usb, esata) are very irritating. A hiccup due to a jostled cable, and the entire box panics. That's frustrating. Timh Bergström wrote: Unfortunely I can only agree to the doubts about running ZFS in production environments, i've lost ditto-blocks, i''ve gotten corrupted pools and a bunch of other failures even in mirror/raidz/raidz2 setups with or without hardware mirrors/raid5/6. Plus the insecurity of a sudden crash/reboot will corrupt or even destroy the pools with restore from backup as the only advice. I've been lucky so far about getting my pools back thanks to people like Victor. What would be needed is a proper fsck for ZFS which can resolv minor data corruptions, tools for rebuilding, resizing and moving the data about on pools is also needed, even recover of data from faulted pools, like there is for ext2/3/ufs/ntfs. All in all, great FS but not production ready until the tools are in place or it gets really really resillient to minor failures and/or crashes in both software and hardware. For now i'll stick to XFS/UFS and sw/hw-raid and live with the restrictions of such fs. //T 2008/10/9 Mike Gerdts [EMAIL PROTECTED]: On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal [EMAIL PROTECTED] wrote: In the past year I've lost more ZFS file systems than I have any other type of file system in the past 5 years. With other file systems I can almost always get some data back. With ZFS I can't get any back. Thats scary to hear! I am really scared now! I was the one trying to quantify ZFS reliability, and that is surely bad to hear! The circumstances where I have lost data have been when ZFS has not handled a layer of redundancy. However, I am not terribly optimistic of the prospects of ZFS on any device that hasn't committed writes that ZFS thinks are committed. Mirrors and raidz would also be vulnerable to such failures. I also have run into other failures that have gone unanswered on the lists. It makes me wary about using zfs without a support contract that allows me to escalate to engineering. Patching only support won't help. http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html Hang only after I mirrored the zpool, no response on the list http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html I think this is fixed around snv_98, but the zfs-discuss list was surprisingly silent on acknowledging it as a problem - I had no idea that it was being worked until I saw the commit. The panic seemed to be caused by dtrace - core developers of dtrace were quite interested in the kernel crash dump. http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html Panic during ON build. Pool was lost, no response from list. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw [EMAIL PROTECTED] wrote: Nevada isn't production code. For real ZFS testing, you must use a production release, currently Solaris 10 (update 5, soon to be update 6). I misstated before in my LDoms case. The corrupted pool was on Solaris 10, with LDoms 1.0. The control domain was SX*E, but the zpool there showed no problems. I got into a panic loop with dangling dbufs. My understanding is that this was caused by a bug in the LDoms manager 1.0 code that has been fixed in a later release. It was a supported configuration, I pushed for and got a fix. However, that pool was still lost. -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
gs == Greg Shaw [EMAIL PROTECTED] writes: gs Nevada isn't production code. For real ZFS testing, you must gs use a production release, currently Solaris 10 (update 5, soon gs to be update 6). based on list feedback, my impression is that the results of a ``test'' confined to s10, particularly s10u4 (the latest available during most of Mike's experience), would be worse than Nevada experience over the same period. but I doubt either matches UFS+SVM or ext3+LVM2. The on-disk format with ``ditto blocks'' and ``always consistent'' may be fantastic, but the code for reading it is not. Maybe the code is stellar, and the problem really is underlying storage stacks that fail to respect write barriers. If so, ZFS needs to include a storage stack qualification tool. For me it doesn't strain credibility to believe these problems might be rampant in VM stacks and SAN's, nor do I find it unacceptable if ZFS is vastly more sensitive to them than any other filesystem. If this speculation turns out to really be the case, I imagine the two going together: the problems are rampant because they don't bother other filesystems too catastrophically. If this is really the situation, then ZFS needs to give the sysadmin a way to isolate and fix the problems deterministically before filling the pool with data, not just blame the sysadmin based on nebulous speculatory hindsight gremlins. And if it's NOT the case, the ZFS problems need to be acknowledged and fixed. To my view, the above is *IN ADDITION* to developing a recovery/forensic/``fsck'' tool, not either/or. The pools should not be getting corrupt in the first place, and pulling the cord should not mean you have to settle for best-effort. None of the modern filesystems demand an fsck after unclean shutdown. The current procedure for qualifying a platform seems to be: (1) subject it to heavy write activity, (2) pull the cord, (3) repeat. Ahmed, maybe you should use that test to ``quantify'' filesystem reliability. You can try it with ZFS, then reinstall the machine with CentOS and try the same test with ext3+LVM2 or xfs+areca. The numbers you get are how many times can you pull the cord before you lose something, and how much do you lose. Here's a really old test of that sort comparing Linux filesystems which is something like what I have in mind: https://www.redhat.com/archives/fedora-list/2004-July/msg00418.html so you see he got two sets of numbers---number of reboots and amount of corruption. For reiserfs and JFS he lost their equivalent of ``the whole pool'', and for ext3 and XFS he got corruption but never lost the pool. It's not clear to me the filesystems ever claimed to prevent corruption in his test scenario (was he calling fsync() after each log write? syslog does that sometimes, and if so, they do claim it, but if he's just writing with some silly script they don't), but definitely they do all claim you won't lose the whole pool in a power outage, and only two out of four delivered on that. I base my choice of Linux filesystem on this test, and wish I'd done such a test before converting things to ZFS. pgpi0TlEstn85.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Thu, 9 Oct 2008, Miles Nordin wrote: catastrophically. If this is really the situation, then ZFS needs to give the sysadmin a way to isolate and fix the problems deterministically before filling the pool with data, not just blame the sysadmin based on nebulous speculatory hindsight gremlins. And if it's NOT the case, the ZFS problems need to be acknowledged and fixed. Can you provide any supportive evidence that ZFS is as fragile as you describe? From recent opinions expressed here, properly-designed ZFS pools must be inexplicably permanently cratering each and every day. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Thu, Oct 9, 2008 at 10:18 AM, Mike Gerdts [EMAIL PROTECTED] wrote: On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw [EMAIL PROTECTED] wrote: Nevada isn't production code. For real ZFS testing, you must use a production release, currently Solaris 10 (update 5, soon to be update 6). I misstated before in my LDoms case. The corrupted pool was on Solaris 10, with LDoms 1.0. The control domain was SX*E, but the zpool there showed no problems. I got into a panic loop with dangling dbufs. My understanding is that this was caused by a bug in the LDoms manager 1.0 code that has been fixed in a later release. It was a supported configuration, I pushed for and got a fix. However, that pool was still lost. Or maybe it wasn't fixed yet. I see that this was committed just today. 6684721 file backed virtual i/o should be synchronous http://hg.genunix.org/onnv-gate.hg/rev/eb40ff0c92ec -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Fajar A. Nugraha wrote: On Fri, Oct 3, 2008 at 10:37 PM, Vasile Dumitrescu [EMAIL PROTECTED] wrote: VMWare 6.0.4 running on Debian unstable, Linux bigsrv 2.6.26-1-amd64 #1 SMP Wed Sep 24 13:59:41 UTC 2008 x86_64 GNU/Linux Solaris is vanilla snv_90 installed with no GUI. in summary: physical disks, assigned 100% to the VM That's weird. I thought one of the point of using physical disks instead of files was to avoid problems caused by caching on host/dom0? The data still flows through the host/dom0 device drivers and is thus at the mercy of the commands they issue to the physical devices. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Fri, Oct 3, 2008 at 10:37 PM, Vasile Dumitrescu [EMAIL PROTECTED] wrote: VMWare 6.0.4 running on Debian unstable, Linux bigsrv 2.6.26-1-amd64 #1 SMP Wed Sep 24 13:59:41 UTC 2008 x86_64 GNU/Linux Solaris is vanilla snv_90 installed with no GUI. in summary: physical disks, assigned 100% to the VM That's weird. I thought one of the point of using physical disks instead of files was to avoid problems caused by caching on host/dom0? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Hi folks, I just wanted to share the end of my adventure here and especially take the time to thank Victor for helping me out of this mess. I will let him explain the technical details (I am out of my depth here) but bottom line he spent a couple of hours with me on the machine and sorted me out. His explanation: he invalidated the incorrect uberblocks and forced zfs to revert to an earlier state that was consistent. The machine is now in the process of doing a full scrub and the first order of business tomorrow will be to do a full backup :-) According to his explanation, the reason for the troubles I had was that Solaris was running in a VM on my Debian server and it was not shut down properly when the Debian server did a controlled shutdown following a UPS event. The Solaris machine was abruptly shut down but because it was not in control of the entire chain till bare hardware, it appears that some writes were in fact still with Debian when Solaris thought them safely executed. This left the zpool in question in a state that even raidz1 did not help with. Anyway, again, lots and lots of thanks to Victor!!! kind regards Vasile -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Vasile Dumitrescu wrote: Hi folks, I just wanted to share the end of my adventure here and especially take the time to thank Victor for helping me out of this mess. I will let him explain the technical details (I am out of my depth here) but bottom line he spent a couple of hours with me on the machine and sorted me out. His explanation: he invalidated the incorrect uberblocks and forced zfs to revert to an earlier state that was consistent. The machine is now in the process of doing a full scrub and the first order of business tomorrow will be to do a full backup :-) According to his explanation, the reason for the troubles I had was that Solaris was running in a VM on my Debian server and it was not shut down properly when the Debian server did a controlled shutdown following a UPS event. Which VM solution was this ? VMware, VirtualBox, Xen, other ? How were the disks presented to the guest ? What are the disks in the host, real disks, files, something else ? -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Which VM solution was this ? VMware, VirtualBox, Xen, other ? How were the disks presented to the guest ? What are the disks in the host, real disks, files, something else ? -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discu ss VMWare 6.0.4 running on Debian unstable, Linux bigsrv 2.6.26-1-amd64 #1 SMP Wed Sep 24 13:59:41 UTC 2008 x86_64 GNU/Linux Solaris is vanilla snv_90 installed with no GUI. Here is the content of the .vmx file in question: #!/usr/bin/vmware config.version = 8 virtualHW.version = 6 scsi0.present = TRUE scsi0.virtualDev = lsilogic memsize = 4096 MemAllowAutoScaleDown = FALSE MemTrimRate = 0 sched.mem.pshare.enable = FALSE sched.mem.minsize = 3062 sched.mem.max = 7000 sched.mem.maxmemctl = 0 sched.mem.shares = 10 scsi0:0.present = TRUE scsi0:0.fileName = /home/vasile/vmware/solsrv/OpenSolaris64.vmdk ide1:0.present = TRUE ide1:0.autodetect = TRUE ide1:0.deviceType = cdrom-image floppy0.startConnected = FALSE floppy0.autodetect = TRUE ethernet0.present = TRUE ethernet0.virtualDev = e1000 ethernet0.wakeOnPcktRcv = TRUE sound.present = FALSE sound.fileName = -1 sound.autodetect = TRUE svga.autodetect = FALSE pciBridge0.present = TRUE displayName = zfssrv guestOS = solaris10-64 nvram = Solaris 10 64-bit.nvram deploymentPlatform = windows virtualHW.productCompatibility = hosted RemoteDisplay.vnc.port = 0 tools.upgrade.policy = useGlobal floppy0.fileName = /dev/fd0 extendedConfigFile = Solaris 10 64-bit.vmxf ide1:0.fileName = floppy0.present = FALSE gui.powerOnAtStartup = TRUE ide1:0.startConnected = TRUE ethernet0.addressType = generated uuid.location = 56 4d da 02 a4 a0 78 74-2e 09 90 62 45 bb c4 94 uuid.bios = 56 4d da 02 a4 a0 78 74-2e 09 90 62 45 bb c4 94 scsi0:0.redo = pciBridge0.pciSlotNumber = 17 scsi0.pciSlotNumber = 16 ethernet0.pciSlotNumber = 32 sound.pciSlotNumber = -1 ethernet0.generatedAddress = 00:0c:29:bb:c4:94 ethernet0.generatedAddressOffset = 0 tools.syncTime = FALSE svga.maxWidth = 1024 svga.maxHeight = 768 svga.vramSize = 3145728 scsi0:1.present = TRUE scsi0:1.fileName = ztank-sda.vmdk scsi0:1.mode = independent-persistent scsi0:1.deviceType = rawDisk scsi0:2.present = TRUE scsi0:2.fileName = ztank-sdb.vmdk scsi0:2.mode = independent-persistent scsi0:2.deviceType = rawDisk scsi0:3.present = TRUE scsi0:3.fileName = ztank-sdc.vmdk scsi0:3.mode = independent-persistent scsi0:3.deviceType = rawDisk scsi0:4.present = TRUE scsi0:4.fileName = ztank-sdd.vmdk scsi0:4.mode = independent-persistent scsi0:4.deviceType = rawDisk scsi0:5.present = TRUE scsi0:5.fileName = ztank-sde.vmdk scsi0:5.mode = independent-persistent scsi0:5.deviceType = rawDisk scsi0:6.present = TRUE scsi0:6.fileName = ztank-sdf.vmdk scsi0:6.mode = independent-persistent scsi0:6.deviceType = rawDisk scsi0:1.redo = scsi0:2.redo = scsi0:3.redo = scsi0:4.redo = scsi0:5.redo = scsi0:6.redo = isolation.tools.dnd.disable = TRUE snapshot.disabled = TRUE scsi0:0.mode = independent-persistent isolation.tools.copy.disable = FALSE isolation.tools.paste.disable = FALSE tools.remindInstall = TRUE in summary: physical disks, assigned 100% to the VM HTH kind regards Vasile -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss