Re: [zfs-discuss] ZFS, power failures, and UPSes (and ZFS recovery guide links)
Ian Collins wrote: David Magda wrote: On Jun 30, 2009, at 14:08, Bob Friesenhahn wrote: I have seen UPSs help quite a lot for short glitches lasting seconds, or a minute. Otherwise the outage is usually longer than the UPSs can stay up since the problem required human attention. A standby generator is needed for any long outages. Can't remember where I read the claim, but supposedly if power isn't restored within about ten minutes, then it will probably be out for a few hours. If this 'statistic' is true, it would mean that your UPS should last (say) fifteen minutes, and after that you really need a generator. Or run your systems of DC and get as much backup as you have room (and budget!) for batteries. I once visited a central exchange with 48 hours of battery capacity... The way Google handles UPSes is to have a small 12v battery integrated with each PC power supply. When the machine is on, the battery has its charged maintained. Not unlike a laptop in that it has a built in battery backup, but using an inexpensive sealed lead acid battery instead of lithium ion. Here is info along with photos of the Google server internals: http://news.cnet.com/8301-1001_3-10209580-92.html http://willysr.blogspot.com/2009/04/googles-server-design.html (IIRC there have been power supply UPSes since at least the late 1980s which had an internal battery. Either that or they were UPSes that fit inside the standard PC (AT) compatible desktop case, making the power protection system entirely internal to the computer. I think I saw these models one time while browsing late 1980s or early 1990s issues of PC Magazine that reviewed UPSes. They still exist...one company selling them is http://www.globtek.com/html/ups.html . A Google search for 'power supply built in UPS' would likely find more.) I also did additional searches in the zfs-discuss archives and found a thread from mid-February, which lead me to other threads. It looks like there are still scattered instances where ZFS has not recovered gracefully from power failures or other failures, where it became necessary to perform a manual transaction group (txg) rollback. Here is a consolidated list of links related to manual uberblock transaction group (txg) rollback and similar ZFS data recovery guides, including undeleting: Section 1: Nathan Hand's guide and related thread Nathan Hand's guide to invalidating uberblocks (Dec 2008 thread) http://www.opensolaris.org/jive/thread.jspa?threadID=85794 or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg22153.html Section 2. Victor Latushkin's guide and related threads Thread: zpool unimportable (corrupt zpool metadata??) but no zdb -l device problems (Oct 2008 to Feb 2009 thread) http://www.opensolaris.org/jive/thread.jspa?threadID=76960 or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg19839.html Repair report: Re: Solved - a big THANKS to Victor Latushkin @ Sun / Moscow http://www.opensolaris.org/jive/message.jspa?messageID=289537#289537 Some recovery discussion by Victor: zdb -bv alone took several hours to walk the block tree http://www.opensolaris.org/jive/message.jspa?messageID=292991#292991 or http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/022365.html or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg20095.html Victor Latushkin's guide: Thanks to COW nature of ZFS it was possible to successfully recover pool state which was only 5 seconds older than last unopenable one. http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/022331.html or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg20061.html Section 3: reliability debates, recovery tool planning, uberblock info Thread: Availability: ZFS needs to handle disk removal / driver failure better (August 2008 thread) http://www.opensolaris.org/jive/thread.jspa?threadID=70811 or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg19057.html Thread: ZFS: unreliable for professional usage? (Feb 2009 thread) http://www.opensolaris.org/jive/thread.jspa?threadID=91426 or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg23833.html Richard Elling's post that uberblocks are kept in an 128-entry circular queue which is 4x redundant with 2 copies each at the beginning and end of the vdev. Other metadata, by default, is 2x redundant and spatially diverse. http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg24145.html Jeff Bonwick's post about Bug ID 6667683 http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg23961.html Bug ID 6667683: need a way to rollback to an uberblock from a previous txg Description: If we are unable to open the pool based on the most recent uberblock then it might be useful to try an older txg uberblock as it might provide a better view of the world. Having a utility to reset the uberblock to a previous txg might provide a nice recovery mechanism.
Re: [zfs-discuss] Q: zfs log device
On Tue, 30 Jun 2009, John Hoogerdijk wrote: i've setup a RAIDZ2 pool with 5 SATA drives and added a 32GB SSD log device. to see how well it works, i ran bonnie++, but never saw any io's on the log device (using iostat -nxce) . pool status is good - no issues or errors. any ideas? Try using direct i/o (the -D flag) in bonnie++. You'll need at least version 1.03e. Regards, markm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS tale of woe and fail
On 19.01.09 12:09, Tom Bird wrote: Toby Thain wrote: On 18-Jan-09, at 6:12 PM, Nathan Kroenert wrote: Hey, Tom - Correct me if I'm wrong here, but it seems you are not allowing ZFS any sort of redundancy to manage. Every other file system out there runs fine on a single LUN, when things go wrong you have a fsck utility that patches it up and the world keeps on turning. I can't find anywhere that will sell me a 48 drive SATA JBOD with all the drives presented on a single SAS channel, so running on a single giant LUN is a real world scenario that ZFS should be able to cope with, as this is how the hardware I am stuck with is arranged. Which is particularly catastrophic when one's 'content' is organized as a monolithic file, as it is here - unless, of course, you have some way of scavenging that file based on internal structure. No, it's not a monolithic file, the point I was making there is that no files are showing up. r...@cs4:~# find /content /content r...@cs4:~# (yes that really is it) This issue (and previous one reported by Tom) has got some publicity recently - see here http://www.uknof.org.uk/uknof13/Bird-Redux.pdf So i feel like i need to provide a little bit more information about the outcome (sorry that it is delayed and not as full as previous one). First, it looked like this: r...@cs4:~# zpool list NAME SIZE USED AVAILCAP HEALTH ALTROOT content 62.5T 59.9T 2.63T95% ONLINE - r...@cs4:~# zpool status -v pool: content state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAMESTATE READ WRITE CKSUM content ONLINE 0 032 c2t8d0ONLINE 0 032 errors: Permanent errors have been detected in the following files: content:0x0 content:0x2c898 First permanent error means that root block of the filesystem named 'content' was corrupted (all copies), so it was not possible to open it and access any content of that filesystem. Fortunately enough, there were not too much activity on the pool, so we decided to try previous states of the pool. I do not remember exact txg number we tried, but it was something like hundred txg back or so. We checked it with zdb and discovered that that state was more or less good - at least filesystem content was openable and it was possible to access its content, so we decided to reactivate that previous state. Pool imported fine and contents of 'content' was there. Subsequent scrub did find some errors but I do not remember exactly how much. Tom may have exact number. Victor ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Q: zfs log device
On Wed, 1 Jul 2009, Mark J Musante wrote: On Tue, 30 Jun 2009, John Hoogerdijk wrote: i've setup a RAIDZ2 pool with 5 SATA drives and added a 32GB SSD log device. to see how well it works, i ran bonnie++, but never saw any io's on the log device (using iostat -nxce) . pool status is good - no issues or errors. any ideas? Try using direct i/o (the -D flag) in bonnie++. You'll need at least version 1.03e. If this -D flag uses the Solaris directio() function, then it will do nothing for ZFS. It only works for UFS and NFS. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Q: zfs log device
Mark J Musante wrote: On Tue, 30 Jun 2009, John Hoogerdijk wrote: i've setup a RAIDZ2 pool with 5 SATA drives and added a 32GB SSD log device. to see how well it works, i ran bonnie++, but never saw any io's on the log device (using iostat -nxce) . pool status is good - no issues or errors. any ideas? Try using direct i/o (the -D flag) in bonnie++. You'll need at least version 1.03e. Or you could export the filesystem via NFS and run any file creation/write workload on an NFS client; that should generate a large amount of log activity thanks to the synchronous writes that the NFS server must issue to honour its obligations to the NFS client. -- jason.ozol...@anu.edu.au ANU Supercomputer Facility Leonard Huxley Bldg 56, Mills Road Ph: +61 2 6125 5449 Australian National University Fax: +61 2 6125 8199 Canberra, ACT, 0200, Australia ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS tale of woe and fail
On Jul 1, 2009, at 12:37, Victor Latushkin wrote: This issue (and previous one reported by Tom) has got some publicity recently - see here http://www.uknof.org.uk/uknof13/Bird-Redux.pdf Joyent also had issues a while back as well: http://tinyurl.com/ytyzs6 http://www.joyeur.com/2008/01/22/bingodisk-and-strongspace-what-happened A lot of people billed it as a ZFS issue, but it should be noted that because of all the checksuming going on, when you get back data you can be fairly sure that it hasn't been corrupted. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Q: zfs log device
John Hoogerdijk wrote: so i guess there is some porting to do - no O_DIRECT in solaris... anyone have bonnie++ 1.03e ported already? For your purposes, couldn't you replace O_DIRECT with O_SYNC as a hack? If you're trying to benchmark the log device, the important thing is to generate synchronous writes, and the zero-copy aspect of O_DIRECT is less important, no? -- jason.ozol...@anu.edu.au ANU Supercomputer Facility Leonard Huxley Bldg 56, Mills Road Ph: +61 2 6125 5449 Australian National University Fax: +61 2 6125 8199 Canberra, ACT, 0200, Australia ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write I/O stalls
Note that this issue does not apply at all to NFS service, database service, or any other usage which does synchronous writes. Bob Hello Bob, There is impact for all workloads. The fact that the write is sync or not, is just a question to write on slog (SSD) or not. But the txg interval and sync time is the same. Actually the zil code is just to preserve that exact same thing for synchronous writes. Leal [ http://www.eall.com.br/blog ] -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Backing up OS drive?
Hi cindys, That recovery procedure seems overly complex. I've instead purchased a disk to mirror my root pool onto. Unfortunately, it seems that the disk is slightly smaller than my current rpool. However I would be happy to have a mirror the same size as the smaller disk. Is there a way to mirror onto a smaller disk, or alternately to send the root pool to the smaller disk, boot from it, then mirror to the larger one? Thanks, Lydgate -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] rpool mirror on USB sticks
Did you ever figure this out? I'm trying to do the same thing and also getting new device must be a single disk. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS, power failures, and UPSes
On Thu, 2 Jul 2009, Ian Collins wrote: 5+ is typical for telco use. Aah, but we start getting into rooms full of giant 2V wet lead acid cells and giant busbars the size of railway tracks. -- Andre van Eyssen. mail: an...@purplecow.org jabber: an...@interact.purplecow.org purplecow.org: UNIX for the masses http://www2.purplecow.org purplecow.org: PCOWpix http://pix.purplecow.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS write I/O stalls
Actually it seems to be 3/4: dsl_pool.c 391 zfs_write_limit_max = ptob(physmem) zfs_write_limit_shift; 392 zfs_write_limit_inflated = MAX(zfs_write_limit_min, 393 spa_get_asize(dp-dp_spa, zfs_write_limit_max)); While spa_get_asize is: spa_misc.c 1249 uint64_t 1250 spa_get_asize(spa_t *spa, uint64_t lsize) 1251 { 1252 /* 1253 * For now, the worst case is 512-byte RAID-Z blocks, in which 1254 * case the space requirement is exactly 2x; so just assume that. 1255 * Add to this the fact that we can have up to 3 DVAs per bp, and 1256 * we have to multiply by a total of 6x. 1257 */ 1258 return (lsize * 6); 1259 } Which will result in: zfs_write_limit_inflated = MAX((32 20), (ptob(physmem) 3) * 6); Bob Friesenhahn wrote: Even if I set zfs_write_limit_override to 8053063680 I am unable to achieve the massive writes that Solaris 10 (141415-03) sends to my drive array by default. When I read the blog entry at http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle, I see this statement: The new code keeps track of the amount of data accepted in a TXG and the time it takes to sync. It dynamically adjusts that amount so that each TXG sync takes about 5 seconds (txg_time variable). It also clamps the limit to no more than 1/8th of physical memory. On my system I see that the about 5 seconds rule is being followed, but see no sign of clamping the limit to no more than 1/8th of physical memory. There is no sign of clamping at all. The writen data is captured and does take about 5 seconds to write (good estimate). On my system with 20GB of RAM, and ARC memory limit set to 10GB (zfs:zfs_arc_max = 0x28000), the maximum zfs_write_limit_override value I can set is on the order of 8053063680, yet this results in a much smaller amount of data being written per write cycle than the Solaris 10 default operation. The default operation is 24 seconds of no write activity followed by 5 seconds of write. On my system, 1/8 of memory would be 2.5GB. If I set the zfs_write_limit_override value to 2684354560 then it seems that about 1.2 seconds of data is captured for write. In this case I see 5 seconds of no write followed by maybe a second of write. This causes me to believe that the algorithm is not implemented as described in Solaris 10. Bob ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] rpool mirror on USB sticks
Tertius Lydgate wrote: Did you ever figure this out? Figure what out? -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss