Re: assemble vs create an array.......
On Thu, Dec 06, 2007 at 07:39:28PM +0300, Michael Tokarev wrote: What to do is to give repairfs a try for each permutation, but again without letting it to actually fix anything. Just run it in read-only mode and see which combination of drives gives less errors, or no fatal errors (there may be several similar combinations, with the same order of drives but with different drive missing). Ugggh. It's sad that xfs refuses mount when structure needs cleaning - the best way here is to actually mount it and see how it looks like, instead of trying repair tools. It self protection - if you try to write to a corrupted filesystem, you'll only make the corruption worse. Mounting involves log recovery, which writes to the filesystem Is there some option to force-mount it still (in readonly mode, knowing it may OOPs kernel etc)? Sure you can: mount -o ro,norecovery dev mtpt But it you hit corruption it will still shut down on you. If the machine oopses then that is a bug. thread prompted me to think. If I can't force-mount it (or browse it using other ways) as I can almost always do with (somewhat?) broken ext[23] just to examine things, maybe I'm trying it before it's mature enough? ;) Hehe ;) For maximum uber-XFS-guru points, learn to browse your filesystem with xfs_db. :P Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3ware 9650 tips
On Mon, Jul 16, 2007 at 12:41:15PM +1000, David Chinner wrote: On Fri, Jul 13, 2007 at 03:36:46PM -0400, Justin Piszcz wrote: On Fri, 13 Jul 2007, Jon Collette wrote: Wouldn't Raid 6 be slower than Raid 5 because of the extra fault tolerance? http://www.enterprisenetworksandservers.com/monthly/art.php?1754 - 20% drop according to this article His 500GB WD drives are 7200RPM compared to the Raptors 10K. So his numbers will be slower. Justin what file system do you have running on the Raptors? I think thats an interesting point made by Joshua. I use XFS: When it comes to bandwidth, there is good reason for that. Trying to stick with a supported config as much as possible, I need to run ext3. As per usual, though, initial ext3 numbers are less than impressive. Using bonnie++ to get a baseline, I get (after doing 'blockdev --setra 65536' on the device): Write: 136MB/s Read: 384MB/s Proving it's not the hardware, with XFS the numbers look like: Write: 333MB/s Read: 465MB/s Those are pretty typical numbers. In my experience, ext3 is limited to about 250MB/s buffered write speed. It's not disk limited, it's design limited. e.g. on a disk subsystem where XFS was getting 4-5GB/s buffered write, ext3 was doing 250MB/s. http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf If you've got any sort of serious disk array, ext3 is not the filesystem to use To show what the difference is, I used blktrace and Chris Mason's seekwatcher script on a simple, single threaded dd command on a 12 disk dm RAID0 stripe: # dd if=/dev/zero of=/mnt/scratch/fred bs=1024k count=10k; sync http://oss.sgi.com/~dgc/writes/ext3_write.png http://oss.sgi.com/~dgc/writes/xfs_write.png You can see from the ext3 graph that it comes to a screeching halt every 5s (probably when pdflush runs) and at all other times the seek rate is 10,000 seeks/s. That's pretty bad for a brand new, empty filesystem and the only way it is sustained is the fact that the disks have their write caches turned on. ext4 will probably show better results, but I haven't got any of the tools installed to be able to test it The XFS pattern shows consistently an order of magnitude less seeks and consistent throughput above 600MB/s. To put the number of seeks in context, XFS is doing 512k I/Os at about 1200-1300 per second. The number of seeks? A bit above 10^3 per second or roughly 1 seek per I/O which is pretty much optimal. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3ware 9650 tips
On Mon, Jul 16, 2007 at 10:50:34AM -0500, Eric Sandeen wrote: David Chinner wrote: On Mon, Jul 16, 2007 at 12:41:15PM +1000, David Chinner wrote: On Fri, Jul 13, 2007 at 03:36:46PM -0400, Justin Piszcz wrote: ... If you've got any sort of serious disk array, ext3 is not the filesystem to use To show what the difference is, I used blktrace and Chris Mason's seekwatcher script on a simple, single threaded dd command on a 12 disk dm RAID0 stripe: # dd if=/dev/zero of=/mnt/scratch/fred bs=1024k count=10k; sync http://oss.sgi.com/~dgc/writes/ext3_write.png http://oss.sgi.com/~dgc/writes/xfs_write.png Were those all with default mkfs mount options? ext3 in writeback mode might be an interesting comparison too. Defaults. i.e. # mkfs.ext3 /dev/mapper/dm0 # mkfs.xfs /dev/mapper/dm0 The mkfs.xfs picked up sunit/swidth correctly from the dm volume. Last time I checked, writeback made little difference to ext3 throughput; maybe 5-10% at most. I'll run it again later today... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3ware 9650 tips
On Fri, Jul 13, 2007 at 03:36:46PM -0400, Justin Piszcz wrote: On Fri, 13 Jul 2007, Jon Collette wrote: Wouldn't Raid 6 be slower than Raid 5 because of the extra fault tolerance? http://www.enterprisenetworksandservers.com/monthly/art.php?1754 - 20% drop according to this article His 500GB WD drives are 7200RPM compared to the Raptors 10K. So his numbers will be slower. Justin what file system do you have running on the Raptors? I think thats an interesting point made by Joshua. I use XFS: When it comes to bandwidth, there is good reason for that. Trying to stick with a supported config as much as possible, I need to run ext3. As per usual, though, initial ext3 numbers are less than impressive. Using bonnie++ to get a baseline, I get (after doing 'blockdev --setra 65536' on the device): Write: 136MB/s Read: 384MB/s Proving it's not the hardware, with XFS the numbers look like: Write: 333MB/s Read: 465MB/s Those are pretty typical numbers. In my experience, ext3 is limited to about 250MB/s buffered write speed. It's not disk limited, it's design limited. e.g. on a disk subsystem where XFS was getting 4-5GB/s buffered write, ext3 was doing 250MB/s. http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf If you've got any sort of serious disk array, ext3 is not the filesystem to use How many folks are using these? Any tuning tips? Make sure you tell XFS the correct sunit/swidth. For hardware raid5/6, sunit = per-disk chunksize, swidth = number of *data* disks in array. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fastest Chunk Size w/XFS For MD Software RAID = 1024k
On Thu, Jun 28, 2007 at 04:27:15AM -0400, Justin Piszcz wrote: On Thu, 28 Jun 2007, Peter Rabbitson wrote: Justin Piszcz wrote: mdadm --create \ --verbose /dev/md3 \ --level=5 \ --raid-devices=10 \ --chunk=1024 \ --force \ --run /dev/sd[cdefghijkl]1 Justin. Interesting, I came up with the same results (1M chunk being superior) with a completely different raid set with XFS on top: mdadm--create \ --level=10 \ --chunk=1024 \ --raid-devices=4 \ --layout=f3 \ ... Could it be attributed to XFS itself? More likely it's related to the I/O size being sent to the disks. The larger the chunk size, the larger the I/o hitting each disk. I think the maximum I/O size is 512k ATM on x86(_64), so a chunk of 1MB will guarantee that there are maximally sized I/Os being sent to the disk Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc4 XFS fails after hibernate/resume
On Wed, Jun 27, 2007 at 08:49:24PM +, Pavel Machek wrote: Hi! FWIW, I'm on record stating that sync is not sufficient to quiesce an XFS filesystem for a suspend/resume to work safely and have argued that the only Hmm, so XFS writes to disk even when its threads are frozen? They issue async I/O before they sleep and expects processing to be done on I/O completion via workqueues. safe thing to do is freeze the filesystem before suspend and thaw it after resume. This is why I originally asked you to test that with the other problem Could you add that to the XFS threads if it is really required? They do know that they are being frozen for suspend. We don't suspend the threads on a filesystem freeze - they continue run. A filesystem freeze guarantees the filesystem clean and that the in memory state matches what is on disk. It is not possible for the filesytem to issue I/O or have outstanding I/O when it is in the frozen state, so the state of the threads and/or workqueues does not matter because they will be idle. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-pm] Re: [linux-lvm] 2.6.22-rc4 XFS fails after hibernate/resume
On Fri, Jun 29, 2007 at 12:16:44AM +0200, Rafael J. Wysocki wrote: There are two solutions possible, IMO. One would be to make these workqueues freezable, which is possible, but hacky and Oleg didn't like that very much. The second would be to freeze XFS from within the hibernation code path, using freeze_bdev(). The second is much more likely to work reliably. If freezing the filesystem leaves something in an inconsistent state, then it's something I can reproduce and debug without needing to suspend/resume. FWIW, don't forget you need to thaw the filesystem on resume. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fastest Chunk Size w/XFS For MD Software RAID = 1024k
On Wed, Jun 27, 2007 at 07:20:42PM -0400, Justin Piszcz wrote: For drives with 16MB of cache (in this case, raptors). That's four (4) drives, right? If so, how do you get a block read rate of 578MB/s from 4 drives? That's 145MB/s per drive Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [linux-lvm] 2.6.22-rc4 XFS fails after hibernate/resume
On Mon, Jun 18, 2007 at 08:49:34AM +0100, David Greaves wrote: David Greaves wrote: OK, that gave me an idea. Freeze the filesystem md5sum the lvm hibernate resume md5sum the lvm snip So the lvm and below looks OK... I'll see how it behaves now the filesystem has been frozen/thawed over the hibernate... And it appears to behave well. (A few hours compile/clean cycling kernel builds on that filesystem were OK). Historically I've done: sync echo platform /sys/power/disk echo disk /sys/power/state # resume and had filesystem corruption (only on this machine, my other hibernating xfs machines don't have this problem) So doing: xfs_freeze -f /scratch sync echo platform /sys/power/disk echo disk /sys/power/state # resume xfs_freeze -u /scratch Works (for now - more usage testing tonight) Verrry interesting. What you were seeing was an XFS shutdown occurring because the free space btree was corrupted. IOWs, the process of suspend/resume has resulted in either bad data being written to disk, the correct data not being written to disk or the cached block being corrupted in memory. If you run xfs_check on the filesystem after it has shut down after a resume, can you tell us if it reports on-disk corruption? Note: do not run xfs_repair to check this - it does not check the free space btrees; instead it simply rebuilds them from scratch. If xfs_check reports an error, then run xfs_repair to fix it up. FWIW, I'm on record stating that sync is not sufficient to quiesce an XFS filesystem for a suspend/resume to work safely and have argued that the only safe thing to do is freeze the filesystem before suspend and thaw it after resume. This is why I originally asked you to test that with the other problem that you reported. Up until this point in time, there's been no evidence to prove either side of the argument.. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limits on raid
On Sat, Jun 16, 2007 at 07:59:29AM +1000, Neil Brown wrote: Combining these thoughts, it would make a lot of sense for the filesystem to be able to say to the block device That blocks looks wrong - can you find me another copy to try?. That is an example of the sort of closer integration between filesystem and RAID that would make sense. I think that this would only be useful on devices that store discrete copies of the blocks on different devices i.e. mirrors. If it's an XOR based RAID, you don't have another copy you can retreive Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote: On Thu, May 31 2007, David Chinner wrote: IOWs, there are two parts to the problem: 1 - guaranteeing I/O ordering 2 - guaranteeing blocks are on persistent storage. Right now, a single barrier I/O is used to provide both of these guarantees. In most cases, all we really need to provide is 1); the need for 2) is a much rarer condition but still needs to be provided. if I am understanding it correctly, the big win for barriers is that you do NOT have to stop and wait until the data is on persistant media before you can continue. Yes, if we define a barrier to only guarantee 1), then yes this would be a big win (esp. for XFS). But that requires all filesystems to handle sync writes differently, and sync_blockdev() needs to call blkdev_issue_flush() as well So, what do we do here? Do we define a barrier I/O to only provide ordering, or do we define it to also provide persistent storage writeback? Whatever we decide, it needs to be documented The block layer already has a notion of the two types of barriers, with a very small amount of tweaking we could expose that. There's absolutely zero reason we can't easily support both types of barriers. That sounds like a good idea - we can leave the existing WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED behaviour that only guarantees ordering. The filesystem can then choose which to use where appropriate Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote: David Chinner wrote: That sounds like a good idea - we can leave the existing WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED behaviour that only guarantees ordering. The filesystem can then choose which to use where appropriate So what if you want a synchronous write, but DON'T care about the order? submit_bio(WRITE_SYNC, bio); Already there, already used by XFS, JFS and direct I/O. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Tue, May 29, 2007 at 05:01:24PM -0700, [EMAIL PROTECTED] wrote: On Wed, 30 May 2007, David Chinner wrote: On Tue, May 29, 2007 at 04:03:43PM -0400, Phillip Susi wrote: David Chinner wrote: The use of barriers in XFS assumes the commit write to be on stable storage before it returns. One of the ordering guarantees that we need is that the transaction (commit write) is on disk before the metadata block containing the change in the transaction is written to disk and the current barrier behaviour gives us that. Barrier != synchronous write, Of course. FYI, XFS only issues barriers on *async* writes. But barrier semantics - as far as they've been described by everyone but you indicate that the barrier write is guaranteed to be on stable storage when it returns. this doesn't match what I have seen wtih barriers it's perfectly legal to have the following sequence of events 1. app writes block 10 to OS 2. app writes block 4 to OS 3. app writes barrier to OS 4. app writes block 5 to OS 5. app writes block 20 to OS hm - applications can't issue barriers to the filesystem. However, if you consider the barrier to be an fsync() for example, then it's still the filesystem that is issuing the barrier and there's a block that needs to be written that is associated with that barrier (either an inode or a transaction commit) that needs to be on stable storage before the filesystem returns to userspace. 6. OS writes block 4 to disk drive 7. OS writes block 10 to disk drive 8. OS writes barrier to disk drive 9. OS writes block 5 to disk drive 10. OS writes block 20 to disk drive Replace OS with filesystem, and combine 7+8 together - we don't have zero-length barriers and hence they are *always* associated with a write to a certain block on disk. i.e.: 1. FS writes block 4 to disk drive 2. FS writes block 10 to disk drive 3. FS writes *barrier* block X to disk drive 4. FS writes block 5 to disk drive 5. FS writes block 20 to disk drive The order that these are expected by the filesystem to hit stable storage are: 1. block 4 and 10 on stable storage in any order 2. barrier block X on stable storage 3. block 5 and 20 on stable storage in any order The point I'm trying to make is that in XFS, block 5 and 20 cannot be allowed to hit the disk before the barrier block because they have strict order dependency on block X being stable before them, just like block X has strict order dependency that block 4 and 10 must be stable before we start the barrier block write. 11. disk drive writes block 10 to platter 12. disk drive writes block 4 to platter 13. disk drive writes block 20 to platter 14. disk drive writes block 5 to platter if the disk drive doesn't support barriers then step #8 becomes 'issue flush' and steps 11 and 12 take place before step #9, 13, 14 No, you need a flush on either side of the block X write to maintain the same semantics as barrier writes currently have. We have filesystems that require barriers to prevent reordering of writes in both directions and to ensure that the block associated with the barrier is on stable storage when I/o completion is signalled. The existing barrier implementation (where it works) provide these requirements. We need barriers to retain these semantics, otherwise we'll still have to do special stuff in the filesystems to get the semantics that we need. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Wed, May 30, 2007 at 09:52:49AM -0700, [EMAIL PROTECTED] wrote: On Wed, 30 May 2007, David Chinner wrote: with the barrier is on stable storage when I/o completion is signalled. The existing barrier implementation (where it works) provide these requirements. We need barriers to retain these semantics, otherwise we'll still have to do special stuff in the filesystems to get the semantics that we need. one of us is misunderstanding barriers here. No, I thinkwe are both on the same level here - it's what barriers are used for that is not clear understood, I think. you are understanding barriers to be the same as syncronous writes. (and therefor the data is on persistant media before the call returns) No, I'm describing the high level behaviour that is expected by a filesystem. The reasons for this are below I am understanding barriers to only indicate ordering requirements. things before the barrier can be reordered freely, things after the barrier can be reordered freely, but things cannot be reordered across the barrier. Ok, that's my understanding of how *device based barriers* can work, but there's more to it than that. As far as the filesystem is concerned the barrier write needs to *behave* exactly like a sync write because of the guarantees the filesystem has to provide userspace. Specifically - sync, sync writes and fsync. This is the big problem, right? If we use barriers for commit writes, the filesystem can return to userspace after a sync write or fsync() and an *ordered barrier device implementation* may not have written the blocks to persistent media. If we then pull the plug on the box, we've just lost data that sync or fsync said was successfully on disk. That's BAD. Right now a barrier write on the last block of the fsync/sync write is sufficient to prevent that because of the FUA on the barrier block write. A purely ordered barrier implementation does not provide this guarantee. This is the crux of my argument - from a filesystem perspective, there is a *major* difference between a barrier implemented to just guaranteeing ordering and a barrier implemented with a flush+FUA or flush+write+flush. IOWs, there are two parts to the problem: 1 - guaranteeing I/O ordering 2 - guaranteeing blocks are on persistent storage. Right now, a single barrier I/O is used to provide both of these guarantees. In most cases, all we really need to provide is 1); the need for 2) is a much rarer condition but still needs to be provided. if I am understanding it correctly, the big win for barriers is that you do NOT have to stop and wait until the data is on persistant media before you can continue. Yes, if we define a barrier to only guarantee 1), then yes this would be a big win (esp. for XFS). But that requires all filesystems to handle sync writes differently, and sync_blockdev() needs to call blkdev_issue_flush() as well So, what do we do here? Do we define a barrier I/O to only provide ordering, or do we define it to also provide persistent storage writeback? Whatever we decide, it needs to be documented Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, May 31, 2007 at 02:07:39AM +0100, Alasdair G Kergon wrote: On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote: If a filesystem cares, it could 'ask' as suggested above. What would be a good interface for asking? XFS already tests: bd_disk-queue-ordered == QUEUE_ORDERED_NONE The side effects of removing that check is what started this whole discussion. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Mon, May 28, 2007 at 01:17:31PM +0200, Pallai Roland wrote: On Monday 28 May 2007 04:17:18 David Chinner wrote: H. A quick look at the linux code makes me thikn that background writeback on linux has never been able to cause a shutdown in this case. However, the same error on Irix will definitely cause a shutdown, though I hope Linux will follow Irix, that's a consistent standpoint. I raised a bug for this yesterday when writing that reply. It won't get forgotten now David, have you a plan to implement your reporting raid5 block layer idea? No one else has caring about this silent data loss on temporary (cable, power) failed raid5 arrays as I see, I really hope you do at least! Yeah, I'd love to get something like this happening, but given it's about half way down my list of stuff to do when I have some spare time I'd say it will be about 2015 before I get to it. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Mon, May 28, 2007 at 05:30:52PM +0200, Pallai Roland wrote: On Monday 28 May 2007 14:53:55 Pallai Roland wrote: On Friday 25 May 2007 02:05:47 David Chinner wrote: -o ro,norecovery will allow you to mount the filesystem and get any uncorrupted data off it. You still may get shutdowns if you trip across corrupted metadata in the filesystem, though. This filesystem is completely dead. [...] I tried to make a md patch to stop writes if a raid5 array got 2+ failed drives, but I found it's already done, oops. :) handle_stripe5() ignores writes in this case quietly, I tried and works. Hmmm - it clears the uptodate bit on the bio, which is supposed to make the bio return EIO. That looks to be doing the right thing... There's an another layer I used on this box between md and xfs: loop-aes. I Oh, that's a kind of important thing to forget to mention used it since years and rock stable, but now it's my first suspect, cause I found a bug in it today: I assembled my array from n-1 disks, and I failed a second disk for a test and I found /dev/loop1 still provides *random* data where /dev/md1 serves nothing, it's definitely a loop-aes bug: . It's not an explanation to my screwed up file system, but for me it's enough to drop loop-aes. Eh. If you can get random data back instead of an error from the block device, then I'm not surprised your filesystem is toast. If it's one sector in a larger block that is corrupted, then the only thing that will protect you from this sort of corruption causing problems is metadata checksums (yet another thin on my list of stuff to do). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Mon, May 28, 2007 at 05:45:27PM -0500, Alberto Alonso wrote: On Fri, 2007-05-25 at 18:36 +1000, David Chinner wrote: On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote: I think his point was that going into a read only mode causes a less catastrophic situation (ie. a web server can still serve pages). Sure - but once you've detected one corruption or had metadata I/O errors, can you trust the rest of the filesystem? I think that is a valid point, rather than shutting down the file system completely, an automatic switch to where the least disruption of service can occur is always desired. I consider the possibility of serving out bad data (i.e after a remount to readonly) to be the worst possible disruption of service that can happen ;) I guess it does depend on the nature of the failure. A write failure on block 2000 does not imply corruption of the other 2TB of data. The rest might not be corrupted, but if block 2000 is a index of some sort (i.e. metadata), you could reference any of that 2TB incorrectly and get the wrong data, write to the wrong spot on disk, etc. I personally have found the XFS file system to be great for my needs (except issues with NFS interaction, where the bug report never got answered), but that doesn't mean it can not be improved. Got a pointer? I can't seem to find it. I'm pretty sure I used bugzilla to report it. I did find the kernel dump file though, so here it is: Oct 3 15:34:07 localhost kernel: xfs_iget_core: ambiguous vns: vp/0xd1e69c80, invp/0xc989e380 Oh, I haven't seen any of those problems for quite some time. = /proc/kmsg started. Oct 3 15:51:23 localhost kernel: Inspecting /boot/System.map-2.6.8-2-686-smp Oh, well, yes, kernels that old did have that problem. It got fixed some time around 2.6.12 or 2.6.13 IIRC Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote: On Friday 25 May 2007 06:55:00 David Chinner wrote: Oh, did you look at your logs and find that XFS had spammed them about writes that were failing? The first message after the incident: May 24 01:53:50 hq kernel: Filesystem loop1: XFS internal error xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c. Caller 0xf8ac14f8 May 24 01:53:50 hq kernel: f8adae69 xfs_btree_check_sblock+0x4f/0xc2 [xfs] f8ac14f8 xfs_alloc_lookup+0x34e/0x47b [xfs] May 24 01:53:50 HF kernel: f8ac14f8 xfs_alloc_lookup+0x34e/0x47b [xfs] f8b1a9c7 kmem_zone_zalloc+0x1b/0x43 [xfs] May 24 01:53:50 hq kernel: f8abe645 xfs_alloc_ag_vextent+0x24d/0x1110 [xfs] f8ac0647 xfs_alloc_vextent+0x3bd/0x53b [xfs] May 24 01:53:50 hq kernel: f8ad2f7e xfs_bmapi+0x1ac4/0x23cd [xfs] f8acab97 xfs_bmap_search_multi_extents+0x8e/0xd8 [xfs] May 24 01:53:50 hq kernel: f8b1 xlog_dealloc_log+0x49/0xea [xfs] f8afdaee xfs_iomap_write_allocate+0x2d9/0x58b [xfs] May 24 01:53:50 hq kernel: f8afc3ae xfs_iomap+0x60e/0x82d [xfs] c0113bc8 __wake_up_common+0x39/0x59 May 24 01:53:50 hq kernel: f8b1ae11 xfs_map_blocks+0x39/0x6c [xfs] f8b1bd7b xfs_page_state_convert+0x644/0xf9c [xfs] May 24 01:53:50 hq kernel: c036f384 schedule+0x5d1/0xf4d f8b1c780 xfs_vm_writepage+0x0/0xe0 [xfs] May 24 01:53:50 hq kernel: f8b1c7d7 xfs_vm_writepage+0x57/0xe0 [xfs] c01830e8 mpage_writepages+0x1fb/0x3bb May 24 01:53:50 hq kernel: c0183020 mpage_writepages+0x133/0x3bb f8b1c780 xfs_vm_writepage+0x0/0xe0 [xfs] May 24 01:53:50 hq kernel: c0147bb3 do_writepages+0x35/0x3b c018135c __writeback_single_inode+0x88/0x387 May 24 01:53:50 hq kernel: c01819b7 sync_sb_inodes+0x1b4/0x2a8 c0181c63 writeback_inodes+0x63/0xdc May 24 01:53:50 hq kernel: c0147943 background_writeout+0x66/0x9f c01482b3 pdflush+0x0/0x1ad May 24 01:53:50 hq kernel: c01483a2 pdflush+0xef/0x1ad c01478dd background_writeout+0x0/0x9f May 24 01:53:50 hq kernel: c012d10b kthread+0xc2/0xc6 c012d049 kthread+0x0/0xc6 May 24 01:53:50 hq kernel: c0100dd5 kernel_thread_helper+0x5/0xb .and I've spammed such messages. This internal error isn't a good reason to shut down the file system? Actaully, that error does shut the filesystem down in most cases. When you see that output, the function is returning -EFSCORRUPTED. You've got a corrupted freespace btree. The reason why you get spammed is that this is happening during background writeback, and there is no one to return the -EFSCORRUPTED error to. The background writeback path doesn't specifically detect shut down filesystems or trigger shutdowns on errors because that happens in different layers so you just end up with failed data writes. These errors will occur on the next foreground data or metadata allocation and that will shut the filesystem down at that point. I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in this case we should be shutting down the filesystem. That would certainly cut down on the spamming and would not appear to change anything other behaviour I think if there's a sign of corrupted file system, the first thing we should do is to stop writes (or the entire FS) and let the admin to examine the situation. Yes, that's *exactly* what a shutdown does. In this case, your writes are being stopped - hence the error messages - but the filesystem has not yet been shutdown. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Mon, May 28, 2007 at 03:50:17AM +0200, Pallai Roland wrote: On Monday 28 May 2007 02:30:11 David Chinner wrote: On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote: .and I've spammed such messages. This internal error isn't a good reason to shut down the file system? Actaully, that error does shut the filesystem down in most cases. When you see that output, the function is returning -EFSCORRUPTED. You've got a corrupted freespace btree. The reason why you get spammed is that this is happening during background writeback, and there is no one to return the -EFSCORRUPTED error to. The background writeback path doesn't specifically detect shut down filesystems or trigger shutdowns on errors because that happens in different layers so you just end up with failed data writes. These errors will occur on the next foreground data or metadata allocation and that will shut the filesystem down at that point. I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in this case we should be shutting down the filesystem. That would certainly cut down on the spamming and would not appear to change anything other behaviour If I remember correctly, my file system wasn't shutted down at all, it was writeable for whole night, the yafc slowly written files to it. Maybe all write operations had failed, but yafc doesn't warn. So you never created new files or directories, unlinked files or directories, did synchronous writes, etc? Just had slowly growing files? Spamming is just annoying when we need to find out what went wrong (My kernel.log is 300Mb), but for data security it's important to react to EFSCORRUPTED error in any case, I think so. Please consider this. The filesystem has responded correctly to the corruption in terms of data security (i.e. failed the data write and warned noisily about it), but it probably hasn't done everything it should H. A quick look at the linux code makes me thikn that background writeback on linux has never been able to cause a shutdown in this case. However, the same error on Irix will definitely cause a shutdown, though Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Mon, May 28, 2007 at 11:30:32AM +1000, Neil Brown wrote: Thanks everyone for your input. There was some very valuable observations in the various emails. I will try to pull most of it together and bring out what seem to be the important points. 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP. Sounds good to me, but how do we test to see if the underlying device supports barriers? Do we just assume that they do and only change behaviour if -o nobarrier is specified in the mount options? 2/ Maybe barriers provide stronger semantics than are required. All write requests are synchronised around a barrier write. This is often more than is required and apparently can cause a measurable slowdown. Also the FUA for the actual commit write might not be needed. It is important for consistency that the preceding writes are in safe storage before the commit write, but it is not so important that the commit write is immediately safe on storage. That isn't needed until a 'sync' or 'fsync' or similar. The use of barriers in XFS assumes the commit write to be on stable storage before it returns. One of the ordering guarantees that we need is that the transaction (commit write) is on disk before the metadata block containing the change in the transaction is written to disk and the current barrier behaviour gives us that. One possible alternative is: - writes can overtake barriers, but barrier cannot overtake writes. No, that breaks the above usage of a barrier - flush before the barrier, not after. This is considerably weaker, and hence cheaper. But I think it is enough for all filesystems (providing it is still an option to call blkdev_issue_flush on 'fsync'). No, not enough for XFS. Another alternative would be to tag each bio was being in a particular barrier-group. Then bio's in different groups could overtake each other in either direction, but a BARRIER request must be totally ordered w.r.t. other requests in the barrier group. This would require an extra bio field, and would give the filesystem more appearance of control. I'm not yet sure how much it would really help... And that assumes the filesystem is tracking exact dependencies between I/Os. Such a mechanism would probably require filesystems to be redesigned to use this, but I can see how it would be useful for doing things like ensuring ordering between just an inode and it's data writes. What would the overhead of having to support several hundred thousand different barrier groups be (i.e. one per dirty inode in a system)? I think the implementation priorities here are: Depending on the answer to my first question: 0/ implement a specific test for filesystems to run at mount time to determine if barriers are supported or not. 1/ implement a zero-length BIO_RW_BARRIER option. 2/ Use it (or otherwise) to make all dm and md modules handle barriers (and loop?). 3/ Devise and implement appropriate fall-backs with-in the block layer so that -EOPNOTSUP is never returned. 4/ Remove unneeded cruft from filesystems (and elsewhere). Sounds like a good start. ;) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote: The difference between ext3 and XFS is that ext3 will remount to read-only on the first write error but the XFS won't, XFS only fails only the current operation, IMHO. The method of ext3 isn't perfect, but in practice, it's working well. XFS will shutdown the filesystem if metadata corruption will occur due to a failed write. We don't immediately fail the filesystem on data write errors because on large systems you can get *transient* I/O errors (e.g. FC path failover) and so retrying failed data writes is useful for preventing unnecessary shutdowns of the filesystem. Different design criteria, different solutions... I think his point was that going into a read only mode causes a less catastrophic situation (ie. a web server can still serve pages). Sure - but once you've detected one corruption or had metadata I/O errors, can you trust the rest of the filesystem? I think that is a valid point, rather than shutting down the file system completely, an automatic switch to where the least disruption of service can occur is always desired. I consider the possibility of serving out bad data (i.e after a remount to readonly) to be the worst possible disruption of service that can happen ;) Maybe the automatic failure mode could be something that is configurable via the mount options. If only it were that simple. Have you looked to see how many hooks there are in XFS to shutdown without causing further damage? % grep FORCED_SHUTDOWN fs/xfs/*.[ch] fs/xfs/*/*.[ch] | wc -l 116 Changing the way we handle shutdowns would take a lot of time, effort and testing. When can I expect a patch? ;) I personally have found the XFS file system to be great for my needs (except issues with NFS interaction, where the bug report never got answered), but that doesn't mean it can not be improved. Got a pointer? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Fri, May 25, 2007 at 05:58:25PM +1000, Neil Brown wrote: We can think of there being three types of devices: 1/ SAFE. With a SAFE device, there is no write-behind cache, or if there is it is non-volatile. Once a write completes it is completely safe. Such a device does not require barriers or -issue_flush_fn, and can respond to them either by a no-op or with -EOPNOTSUPP (the former is preferred). 2/ FLUSHABLE. A FLUSHABLE device may have a volatile write-behind cache. This cache can be flushed with a call to blkdev_issue_flush. It may not support barrier requests. So returns -EOPNOTSUPP to any barrier request? 3/ BARRIER. A BARRIER device supports both blkdev_issue_flush and BIO_RW_BARRIER. Either may be used to synchronise any write-behind cache to non-volatile storage (media). Handling of SAFE and FLUSHABLE devices is essentially the same and can work on a BARRIER device. The BARRIER device has the option of more efficient handling. How does a filesystem use this? === The filesystem will want to ensure that all preceding writes are safe before writing the barrier block. There are two ways to achieve this. Three, actually. 1/ Issue all 'preceding writes', wait for them to complete (bi_endio called), call blkdev_issue_flush, issue the commit write, wait for it to complete, call blkdev_issue_flush a second time. (This is needed for FLUSHABLE) *nod* 2/ Set the BIO_RW_BARRIER bit in the write request for the commit block. (This is more efficient on BARRIER). *nod* 3/ Use a SAFE device. The second, while much easier, can fail. So we do a test I/O to see if the device supports them before enabling that mode. But, as we've recently discovered, this is not sufficient to detect *correctly functioning* barrier support. So a filesystem should be prepared to deal with that failure by falling back to the first option. I don't buy that argument. Thus the general sequence might be: a/ issue all preceding writes. b/ issue the commit write with BIO_RW_BARRIER At this point, the filesystem has done everything it needs to ensure that the block layer has been informed of the I/O ordering requirements. Why should the filesystem now have to detect block layer breakage, and then use a different block layer API to issue the same I/O under the same constraints? c/ wait for the commit to complete. If it was successful - done. If it failed other than with EOPNOTSUPP, abort else continue d/ wait for all 'preceding writes' to complete e/ call blkdev_issue_flush f/ issue commit write without BIO_RW_BARRIER g/ wait for commit write to complete if it failed, abort h/ call blkdev_issue _flush? DONE steps b and c can be left out if it is known that the device does not support barriers. The only way to discover this to try and see if it fails. That's a very linear, single-threaded way of looking at it... ;) I don't think any filesystem follows all these steps. ext3 has the right structure, but it doesn't include steps e and h. reiserfs is similar. It does have a call to blkdev_issue_flush, but that is only on the fsync path, so it isn't really protecting general journal commits. XFS - I'm less sure. I think it does 'a' then 'd', then 'b' or 'f' depending on a whether it thinks the device handles barriers, and finally 'g'. That's right, except for the g (or c) bit - commit writes are async and nothing waits for them - the io completion wakes anything waiting on it's completion (yes, all XFS barrier I/Os are issued async which is why having to handle an -EOPNOTSUPP error is a real pain. The fix I currently have is to reissue the I/O from the completion handler with is ugly, ugly, ugly.) So for devices that support BIO_RW_BARRIER, and for devices that don't need any flush, they work OK, but for device that need flushing, but don't support BIO_RW_BARRIER, none of them work. This should be easy to fix. Right - XFS as it stands was designed to work on SAFE devices, and we've modified it to work on BARRIER devices. We don't support FLUSHABLE devices at all. But if the filesystem supports BARRIER devices, I don't see any reason why a filesystem needs to be modified to support FLUSHABLE devices - the key point being that by the time the filesystem has issued the commit write it has already waited for all it's dependent I/O, and so all the block device needs to do is issue flushes either side of the commit write HOW DO MD or DM USE THIS 1/ striping devices. This includes md/raid0 md/linear dm-linear dm-stripe and probably others. These devices can easily support blkdev_issue_flush by simply calling blkdev_issue_flush on all component
Re: raid5: I lost a XFS file system due to a minor IDE cable problem
On Thu, May 24, 2007 at 07:20:35AM -0400, Justin Piszcz wrote: Including XFS mailing list on this one. Thanks Justin. On Thu, 24 May 2007, Pallai Roland wrote: Hi, I wondering why the md raid5 does accept writes after 2 disks failed. I've an array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable failed (my friend kicked it off from the box on the floor:) and 2 disks have been kicked but my download (yafc) not stopped, it tried and could write the file system for whole night! Now I changed the cable, tried to reassembly the array (mdadm -f --run), event counter increased from 4908158 up to 4929612 on the failed disks, but I cannot mount the file system and the 'xfs_repair -n' shows lot of errors there. This is expainable by the partially successed writes. Ext3 and JFS has error= mount option to switch filesystem read-only on any error, but XFS hasn't: why? -o ro,norecovery will allow you to mount the filesystem and get any uncorrupted data off it. You still may get shutdowns if you trip across corrupted metadata in the filesystem, though. It's a good question too, but I think the md layer could save dumb filesystems like XFS if denies writes after 2 disks are failed, and I cannot see a good reason why it's not behave this way. How is *any* filesystem supposed to know that the underlying block device has gone bad if it is not returning errors? I did mention this exact scenario in the filesystems workshop back in february - we'd *really* like to know if a RAID block device has gone into degraded mode (i.e. lost a disk) so we can throttle new writes until the rebuil dhas been completed. Stopping writes completely on a fatal error (like 2 lost disks in RAID5, and 3 lost disks in RAID6) would also be possible if only we could get the information out of the block layer. Do you have better idea how can I avoid such filesystem corruptions in the future? No, I don't want to use ext3 on this box. :) Well, the problem is a bug in MD - it should have detected drives going away and stopped access to the device until it was repaired. You would have had the same problem with ext3, or JFS, or reiser or any other filesystem, too. my mount error: XFS: Log inconsistent (didn't find previous header) XFS: failed to find log head XFS: log mount/recovery failed: error 5 XFS: log mount failed You MD device is still hosed - error 5 = EIO; the md device is reporting errors back the filesystem now. You need to fix that before trying to recover any data... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: XFS and write barrier
On Tue, Jul 18, 2006 at 06:58:56PM +1000, Neil Brown wrote: On Tuesday July 18, [EMAIL PROTECTED] wrote: On Mon, Jul 17, 2006 at 01:32:38AM +0800, Federico Sevilla III wrote: On Sat, Jul 15, 2006 at 12:48:56PM +0200, Martin Steigerwald wrote: I am currently gathering information to write an article about journal filesystems with emphasis on write barrier functionality, how it works, why journalling filesystems need write barrier and the current implementation of write barrier support for different filesystems. Journalling filesystems need write barrier isn't really accurate. They can make good use of write barrier if it is supported, and where it isn't supported, they should use blkdev_issue_flush in combination with regular submit/wait. blkdev_issue_flush() causes a write cache flush - just like a barrier typically causes a write cache flush up to the I/O with the barrier in it. Both of these mechanisms provide the same thing - an I/O barrier that enforces ordering of I/Os to disk. Given that filesystems already indicate to the block layer when they want a barrier, wouldn't it be better to get the block layer to issue this cache flush if the underlying device doesn't support barriers and it receives a barrier request? FWIW, Only XFS and Reiser3 use this function, and only then when issuing a fsync when barriers are disabled to make sure a common test (fsync then power cycle) doesn't result in data loss... Noone here seems to know, maybe Neil | the other folks on linux-raid can help us out with details on status of MD and write barriers? In 2.6.17, md/raid1 will detect if the underlying devices support barriers and if they all do, it will accept barrier requests from the filesystem and pass those requests down to all devices. Other raid levels will reject all barrier requests. Any particular reason for not supporting barriers on the other types of RAID? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html