Re: [PATCH 00/37] Permit filesystem local caching
On Thursday 21 February 2008, David Howells wrote: David Howells [EMAIL PROTECTED] wrote: Have you got before/after benchmark results? See attached. Attached here are results using BTRFS (patched so that it'll work at all) rather than Ext3 on the client on the partition backing the cache. Thanks for trying this, of course I'll ask you to try again with the latest v0.13 code, it has a number of optimizations especially for CPU usage. Note that I didn't bother redoing the tests that didn't involve a cache as the choice of filesystem backing the cache should have no bearing on the result. Generally, completely cold caches shouldn't show much variation as all the writing can be done completely asynchronously, provided the client doesn't fill its RAM. The interesting case is where the disk cache is warm, but the pagecache is cold (ie: just after a reboot after filling the caches). Here, for the two big files case, BTRFS appears quite a bit better than Ext3, showing a 21% reduction in time for the smaller case and a 13% reduction for the larger case. I'm afraid I don't have a good handle on the filesystem operations that result from this workload. Are we reading from the FS to fill the NFS page cache? For the many small/medium files case, BTRFS performed significantly better (15% reduction in time) in the case where the caches were completely cold. I'm not sure why, though - perhaps because it doesn't execute a write_begin() stage during the write_one_page() call and thus doesn't go allocating disk blocks to back the data, but instead allocates them later. If your write_one_page call does parts of btrfs_file_write, you'll get delayed allocation for anything bigger than 8k by default. = 8k will get packed into the btree leaves. More surprising is that BTRFS performed significantly worse (15% increase in time) in the case where the cache on disk was fully populated and then the machine had been rebooted to clear the pagecaches. Which FS operations are included here? Finding all the files or just an unmount? Btrfs defrags metadata in the background, and unmount has to wait for that defrag to finish. Thanks again, Chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[ANNOUNCE] Btrfs v0.13
Hello everyone, Btrfs v0.13 is now available for download from: http://oss.oracle.com/projects/btrfs/ We took another short break from the multi-device code to make the minor mods required to compile on 2.6.25, fix some problematic bugs and do more tuning. The most important fix is for file data checksumming errors. These might show up on .o files from compiles or other files where seeky writes were done internally to fill it up. The end result was a bunch of zeros in the file where people expected their data to be. Thanks to Yan Zheng for tracking it down. GregKH provided most of the 2.6.25 port with some sysfs updates. Since the sysfs files are not used much and Greg has offered additional cleanups, I've disabled the btrfs sysfs interface on kernels older than 2.6.25. This way he won't have to back port any of his changes. Optimizations and other fixes: * File data checksumming done in larger chunks, resulting in fewer btree searches and fewer kmap calls. * CPU Optimizations for back reference removal * CPU Optimizations for block allocation, and much more efficient searching through the free space cache. * Allocation optimizations, the free space clustering code was not properly allocating from a cluster once it found it. For normal mounts the fix improves metadata writeback, for mount -o ssd it improves everything. * Unaligned access fixes from Dave Miller * Btree reads are done in larger bios when possible * i_block accounting is fixed -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] Btrfs v0.13
On Thursday 21 February 2008, Chris Mason wrote: Hello everyone, Btrfs v0.13 is now available for download from: http://oss.oracle.com/projects/btrfs/ We took another short break from the multi-device code to make the minor mods required to compile on 2.6.25, fix some problematic bugs and do more tuning. Sorry, I should have added: v0.13 has no disk format changes since v0.12. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very poor ext3 write performance on big filesystems?
On Tuesday 19 February 2008, Tomasz Chmielewski wrote: Theodore Tso schrieb: (...) The following ld_preload can help in some cases. Mutt has this hack encoded in for maildir directories, which helps. It doesn't work very reliable for me. For some reason, it hangs for me sometimes (doesn't remove any files, rm -rf just stalls), or segfaults. You can go the low-tech route (assuming your file names don't have spaces in them) find . -printf %i %p\n | sort -n | awk '{print $2}' | xargs rm As most of the ideas here in this thread assume (re)creating a new filesystem from scratch - would perhaps playing with /proc/sys/vm/dirty_ratio and /proc/sys/vm/dirty_background_ratio help a bit? Probably not. You're seeking between all the inodes on the box, and probably not bound by the memory used. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: very poor ext3 write performance on big filesystems?
On Tuesday 19 February 2008, Tomasz Chmielewski wrote: Chris Mason schrieb: On Tuesday 19 February 2008, Tomasz Chmielewski wrote: Theodore Tso schrieb: (...) The following ld_preload can help in some cases. Mutt has this hack encoded in for maildir directories, which helps. It doesn't work very reliable for me. For some reason, it hangs for me sometimes (doesn't remove any files, rm -rf just stalls), or segfaults. You can go the low-tech route (assuming your file names don't have spaces in them) find . -printf %i %p\n | sort -n | awk '{print $2}' | xargs rm Why should it make a difference? It does something similar to Ted's ld preload, sorting the results from readdir by inode number before using them. You will still seek quite a lot between the directory entries, but operations on the files themselves will go in a much more optimal order. It might help. Does find find filenames/paths faster than rm -r? Or is find once/remove once faster than find files/rm files/find files/rm files/..., which I suppose rm -r does? rm -r does removes things in the order that readdir returns. In your hard linked tree (on almost any FS), this will be very random. The sorting is probably the best you can do from userland to optimize the ordering. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS only works with PAGE_SIZE = 4K
On Tuesday 12 February 2008, David Miller wrote: From: Chris Mason [EMAIL PROTECTED] Date: Wed, 6 Feb 2008 12:00:13 -0500 So, here's v0.12. Any page size larger than 4K will not work with btrfs. All of the extent stuff assumes that PAGE_SIZE = sectorsize. Yeah, there is definitely clean up to do in that area. I confirmed this by forcing mkfs.btrfs to use an 8K sectorsize on sparc64 and I was finally able to successfully mount a partition. Nice With 4K there are zero's in the root tree node header, because it's extent's location on disk is at a sub-PAGE_SIZE multiple and the extent code doesn't handle that. You really need to start validating this stuff on other platforms. Something that isn't little endian and something that doesn't use 4K pages. I'm sure you have some powerpc parts around somewhere. :) Grin, I think around v0.4 I grabbed a ppc box for a day and got things working. There has been some churn since then... My first prio is the newest set of disk format changes, and then I'll sit down and work on stability on a bunch of arches. Anyways, here is a patch for the kernel bits which fixes most of the unaligned accesses on sparc64. Many thanks, I'll try these out here and push them into the tree. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS partition usage...
On Tuesday 12 February 2008, Jan Engelhardt wrote: On Feb 12 2008 09:08, Chris Mason wrote: So, if Btrfs starts zeroing at 1k, will that be acceptable for you? Something looks wrong here. Why would btrfs need to zero at all? Superblock at 0, and done. Just like xfs. (Yes, I had xfs on sparc before, so it's not like you NEED the whitespace at the start of a partition.) I've had requests to move the super down to 64k to make room for bootloaders, which may not matter for sparc, but I don't really plan on different locations for different arches. In x86, there is even more space for a bootloader (some 28k or so) even if your partition table is as closely packed as possible, from 0 to 7e00 IIRC. For sparc you could have something like startlbaendlba type sda10 2 1 Boot sda22 58 3 Whole disk sda358 9 83 Linux and slap the bootloader into MBR, just like on x86. Or I am missing something.. It was a request from hpa, and he clearly had something in mind. He kindly offered to review the disk format for bootloaders and other lower level issues but I asked him to wait until I firm it up a bit. From my point of view, 0 is a bad idea because it is very likely to conflict with other things. There are lots of things in the FS that need deep thought,and the perfect system to fully use the first 64k of a 1TB filesystem isn't quite at the top of my list right now ;) Regardless of offset, it is a good idea to mop up previous filesystems where possible, and a very good idea to align things on some sector boundary. Even going 1MB in wouldn't be a horrible idea to align with erasure blocks on SSD. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS partition usage...
On Tuesday 12 February 2008, Jan Engelhardt wrote: On Feb 12 2008 08:49, Chris Mason wrote: This is a real issue on sparc where the default sun disk labels created use an initial partition where block zero aliases the disk label. It took me a few iterations before I figured out why every btrfs make would zero out my disk label :-/ Actually it seems this is only a problem with mkfs.btrfs, it clears out the first 64 4K chunks of the disk for whatever reason. It is a good idea to remove supers from other filesystems. I also need to add zeroing at the end of the device as well. Looks like I misread the e2fs zeroing code. It zeros the whole external log device, and I assumed it also zero'd out the start of the main FS. So, if Btrfs starts zeroing at 1k, will that be acceptable for you? Something looks wrong here. Why would btrfs need to zero at all? Superblock at 0, and done. Just like xfs. (Yes, I had xfs on sparc before, so it's not like you NEED the whitespace at the start of a partition.) I've had requests to move the super down to 64k to make room for bootloaders, which may not matter for sparc, but I don't really plan on different locations for different arches. 4k aligned is important given that sector sizes are growing. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS partition usage...
On Tuesday 12 February 2008, David Miller wrote: From: David Miller [EMAIL PROTECTED] Date: Mon, 11 Feb 2008 23:21:39 -0800 (PST) Filesystems like ext2 put their superblock 1 block into the partition in order to avoid overwriting disk labels and other uglies. UFS does this too, as do several others. One of the few exceptions I've been able to find is XFS. This is a real issue on sparc where the default sun disk labels created use an initial partition where block zero aliases the disk label. It took me a few iterations before I figured out why every btrfs make would zero out my disk label :-/ Actually it seems this is only a problem with mkfs.btrfs, it clears out the first 64 4K chunks of the disk for whatever reason. It is a good idea to remove supers from other filesystems. I also need to add zeroing at the end of the device as well. Looks like I misread the e2fs zeroing code. It zeros the whole external log device, and I assumed it also zero'd out the start of the main FS. So, if Btrfs starts zeroing at 1k, will that be acceptable for you? -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[ANNOUNCE] Btrfs v0.12 released
Hello everyone, I wasn't planning on releasing v0.12 yet, and it was supposed to have some initial support for multiple devices. But, I have made a number of performance fixes and small bug fixes, and I wanted to get them out there before the (destabilizing) work on multiple-devices took over. So, here's v0.12. It comes with a shiny new disk format (sorry), but the gain is dramatically better random writes to existing files. In testing here, the random write phase of tiobench went from 1MB/s to 30MB/s. The fix was to change the way back references for file extents were hashed. Other changes: Insert and delete multiple items at once in the btree where possible. Back references added more tree balances, and it showed up in a few benchmarks. With v0.12, backrefs have no real impact on performance. Optimize bio end_io routines. Btrfs was spending way too much CPU time in the bio end_io routines, leading to lock contention and other problems. Optimize read ahead during transaction commit. The old code was trying to read far too much at once, which made the end_io problems really stand out. mount -o ssd option, which clusters file data writes together regardless of the directory the files belong to. There are a number of other performance tweaks for SSD, aimed at clustering metadata and data writes to better take advantage of the hardware. mount -o max_inline=size option, to override the default max inline file data size (default is 8k). Any value up to the leaf size is allowed (default 16k). Simple -ENOSPC handling. Emphasis on simple, but it prevents accidentally filling the disk most of the time. With enough threads/procs banging on things, you can still easily crash the box. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] ext3: per-process soft-syncing data=ordered mode
On Thursday 31 January 2008, Jan Kara wrote: On Thu 31-01-08 11:56:01, Chris Mason wrote: On Thursday 31 January 2008, Al Boldi wrote: Andreas Dilger wrote: On Wednesday 30 January 2008, Al Boldi wrote: And, a quick test of successive 1sec delayed syncs shows no hangs until about 1 minute (~180mb) of db-writeout activity, when the sync abruptly hangs for minutes on end, and io-wait shows almost 100%. How large is the journal in this filesystem? You can check via debugfs -R 'stat 8' /dev/XXX. 32mb. Is this affected by increasing the journal size? You can set the journal size via mke2fs -J size=400 at format time, or on an unmounted filesystem by running tune2fs -O ^has_journal /dev/XXX then tune2fs -J size=400 /dev/XXX. Setting size=400 doesn't help, nor does size=4. I suspect that the stall is caused by the journal filling up, and then waiting while the entire journal is checkpointed back to the filesystem before the next transaction can start. It is possible to improve this behaviour in JBD by reducing the amount of space that is cleared if the journal becomes full, and also doing journal checkpointing before it becomes full. While that may reduce performance a small amount, it would help avoid such huge latency problems. I believe we have such a patch in one of the Lustre branches already, and while I'm not sure what kernel it is for the JBD code rarely changes much The big difference between ordered and writeback is that once the slowdown starts, ordered goes into ~100% iowait, whereas writeback continues 100% user. Does data=ordered write buffers in the order they were dirtied? This might explain the extreme problems in transactional workloads. Well, it does but we submit them to block layer all at once so elevator should sort the requests for us... nr_requests is fairly small, so a long stream of random requests should still end up being random IO. Al, could you please compare the write throughput from vmstat for the data=ordered vs data=writeback runs? I would guess the data=ordered one has a lower overall write throughput. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] ext3: per-process soft-syncing data=ordered mode
On Thursday 31 January 2008, Al Boldi wrote: Andreas Dilger wrote: On Wednesday 30 January 2008, Al Boldi wrote: And, a quick test of successive 1sec delayed syncs shows no hangs until about 1 minute (~180mb) of db-writeout activity, when the sync abruptly hangs for minutes on end, and io-wait shows almost 100%. How large is the journal in this filesystem? You can check via debugfs -R 'stat 8' /dev/XXX. 32mb. Is this affected by increasing the journal size? You can set the journal size via mke2fs -J size=400 at format time, or on an unmounted filesystem by running tune2fs -O ^has_journal /dev/XXX then tune2fs -J size=400 /dev/XXX. Setting size=400 doesn't help, nor does size=4. I suspect that the stall is caused by the journal filling up, and then waiting while the entire journal is checkpointed back to the filesystem before the next transaction can start. It is possible to improve this behaviour in JBD by reducing the amount of space that is cleared if the journal becomes full, and also doing journal checkpointing before it becomes full. While that may reduce performance a small amount, it would help avoid such huge latency problems. I believe we have such a patch in one of the Lustre branches already, and while I'm not sure what kernel it is for the JBD code rarely changes much The big difference between ordered and writeback is that once the slowdown starts, ordered goes into ~100% iowait, whereas writeback continues 100% user. Does data=ordered write buffers in the order they were dirtied? This might explain the extreme problems in transactional workloads. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: lockdep warning with LTP dio test (v2.6.24-rc6-125-g5356f66)
On Friday 25 January 2008, Jan Kara wrote: If ext3's DIO code only touches transactions in get_block, then it can violate data=ordered rules. Basically the transaction that allocates the blocks might commit before the DIO code gets around to writing them. A crash in the wrong place will expose stale data on disk. Hmm, I've looked at it and I don't think so - look at the rationale in the patch below... That patch should fix the lock-inversion problem (at least I see no lockdep warnings on my test machine). Ah ok, when I was looking at this I was allowing holes to get filled without falling back to buffered. But, with the orphan inode entry protecting things I see how you're safe with this patch. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: konqueror deadlocks on 2.6.22
On Tuesday 22 January 2008, Al Boldi wrote: Ingo Molnar wrote: * Oliver Pinter (Pintér Olivér) [EMAIL PROTECTED] wrote: and then please update to CFS-v24.1 http://people.redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.6.22.15-v24. 1 .patch Yes with CFSv20.4, as in the log. It also hangs on 2.6.23.13 my feeling is that this is some sort of timing dependent race in konqueror/kde/qt that is exposed when a different scheduler is put in. If it disappears with CFS-v24.1 it is probably just because the timings will change again. Would be nice to debug this on the konqueror side and analyze why it fails and how. You can probably tune the timings by enabling SCHED_DEBUG and tweaking /proc/sys/kernel/*sched* values - in particular sched_latency and the granularity settings. Setting wakeup granularity to 0 might be one of the things that could make a difference. Thanks Ingo, but Mike suggested that data=writeback may make a difference, which it does indeed. So the bug seems to be related to data=ordered, although I haven't gotten any feedback from the ext3 gurus yet. Seems rather critical though, as data=writeback is a dangerous mode to run. Running fsync in data=ordered means that all of the dirty blocks on the FS will get written before fsync returns. Your original stack trace shows everyone either performing writeback for a log commit or waiting for the log commit to return. They key task in your trace is kjournald, stuck in get_request_wait. It could be a block layer bug, not giving him requests quickly enough, or it could be the scheduler not giving him back the cpu fast enough. At any rate, that's where to concentrate the debugging. You should be able to simulate this by running a few instances of the below loop and looking for stalls: while(true) ; do time dd if=/dev/zero of=foo bs=50M count=4 oflags=sync done - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: konqueror deadlocks on 2.6.22
On Tuesday 22 January 2008, Al Boldi wrote: Chris Mason wrote: Running fsync in data=ordered means that all of the dirty blocks on the FS will get written before fsync returns. Hm, that's strange, I expected this kind of behaviour from data=journal. data=writeback should return immediatly, which seems it does, but data=ordered should only wait for metadata flush, it shouldn't wait for filedata flush. Are you sure it waits for both? I over simplified. data=ordered means that all data blocks are written before the metadata that references them commits. So, if you add 1GB to a fileA in a transaction and then run fsync(fileB) in the same transaction, the 1GB from fileA is sent to disk (and waited on) before the fsync on fileB returns. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Btrfs-devel] [ANNOUNCE] Btrfs v0.10 (online growing/shrinking, ext3 conversion, and more)
On Thursday 17 January 2008, Christian Hesse wrote: On Thursday 17 January 2008, Chris mason wrote: So, I've put v0.11 out there. Ok, back to the suspend problem I mentioned: [ oopsen ] I get this after a suspend/resume cycle with mounted btrfs. Looks like metadata corruption. How are you suspending? -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Btrfs-devel] [ANNOUNCE] Btrfs v0.10 (online growing/shrinking, ext3 conversion, and more)
On Tuesday 15 January 2008, Chris Mason wrote: Hello everyone, Btrfs v0.10 is now available for download from: http://oss.oracle.com/projects/btrfs/ Well, it turns out this release had a few small problems: * data=ordered deadlock on older kernels (including 2.6.23) * Compile problems when ACLs were not enabled in the kernel So, I've put v0.11 out there. It fixes those two problems and will also compile on older (2.6.18) enterprise kernels. v0.11 does not have any disk format changes. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Btrfs-devel] [ANNOUNCE] Btrfs v0.10 (online growing/shrinking, ext3 conversion, and more)
On Thursday 17 January 2008, Daniel Phillips wrote: On Jan 17, 2008 1:25 PM, Chris mason [EMAIL PROTECTED] wrote: So, I've put v0.11 out there. It fixes those two problems and will also compile on older (2.6.18) enterprise kernels. v0.11 does not have any disk format changes. Hi Chris, First, massive congratulations for bringing this to fruition in such a short time. Now back to the regular carping: why even support older kernels? The general answer is the backports are small and easy. I don't test them heavily, and I don't go out of my way to make things work. But, they do make it easier for people to try out, and to figure how to use all these new features to solve problems. Small changes that enable more testers are always welcome. In general, the core parts of the kernel that btrfs uses haven't had many interface changes since 2.6.18, so this isn't a huge deal. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[ANNOUNCE] Btrfs v0.10 (online growing/shrinking, ext3 conversion, and more)
Hello everyone, Btrfs v0.10 is now available for download from: http://oss.oracle.com/projects/btrfs/ Btrfs is still in an early alpha state, and the disk format is not finalized. v0.10 introduces a new disk format, and is not compatible with v0.9. The core of this release is explicit back references for all metadata blocks, data extents, and directory items. These are a crucial building block for future features such as online fsck and migration between devices. The back references are verified during deletes, and the extent back references are checked by the existing offline fsck tool. For all of the details of how the back references are maintained, please see the design document: http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-design.html Other new features (described in detail below): * Online resizing (including shrinking) * In place conversion from Ext3 to Btrfs * data=ordered support * Mount options to disable data COW and checksumming * Barrier support for sata and IDE drives [ Resizing ] In order to demonstrate and test the back references, I've added an online resizer, which can both grow and shrink the filesystem: mount -t btrfs /dev/xxx /mnt # add 2GB to the FS btrfsctl -r +2g /mnt # shrink the FS by 4GB btrfsctl -r -4g /mnt # Explicitly set the FS size btrfsctl -r 20g /mnt # Use 'max' to grow the FS to the limit of the device btrfsctl -r max /mnt [ Conversion from Ext3 ] This is an offline, in place, conversion program written by Yan Zheng. It has been through basic testing, but should not be trusted with critical data. To build the conversion program, run 'make convert' in the btrfs-progs tree. It depends on libe2fs and acl development libraries. The conversion program uses the copy on write nature of Btrfs to preserve the original Ext3 FS, sharing the data blocks between Btrfs and Ext3 metadata. Btrfs metadata is created inside the free space of the Ext3 filesystem, and it is possible to either make the conversion permanent (reclaiming the space used by Ext3) or roll back the conversion to the original Ext3 filesystem. More details and example usage of the conversion program can be found here: http://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-converter.html Thanks to Yan Zheng for all of his work on the converter. [ New mount options ] mount -o nodatacsum disables checksumming on data extents mount -o nodatacow disables copy on write of data extents, unless a given extent is referenced by more than one snapshot. This is targeted at database workloads, where copy on write is not optimal for performance. The explicit back references allow the nodatacow code to make sure copy on write is done when multiple snapshots reference the same file, maintaining snapshot consistency. mount -o alloc_start=num forces allocation hints to start at least num bytes into the disk. This was introduced to test the resizer. Example usage: mount -o alloc_start=16g /dev/ /mnt (do something to the FS) btrfsctl -r 12g /mnt The btrfsctl command will resize the FS down to 12GB in size. Because the FS was mounted with -o alloc_start=16g, any allocations done after mounting will need to be relocated by the resizer. It is safe to specify a number past the end of the FS, if the alloc_start is too large, it is ignored. mount -o nobarrier disables cache flushes during commit. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
On Tue, 15 Jan 2008 20:24:27 -0500 Daniel Phillips [EMAIL PROTECTED] wrote: On Jan 15, 2008 7:15 PM, Alan Cox [EMAIL PROTECTED] wrote: Writeback cache on disk in iteself is not bad, it only gets bad if the disk is not engineered to save all its dirty cache on power loss, using the disk motor as a generator or alternatively a small battery. It would be awfully nice to know which brands fail here, if any, because writeback cache is a big performance booster. AFAIK no drive saves the cache. The worst case cache flush for drives is several seconds with no retries and a couple of minutes if something really bad happens. This is why the kernel has some knowledge of barriers and uses them to issue flushes when needed. Indeed, you are right, which is supported by actual measurements: http://sr5tech.com/write_back_cache_experiments.htm Sorry for implying that anybody has engineered a drive that can do such a nice thing with writeback cache. The disk motor as a generator tale may not be purely folklore. When an IDE drive is not in writeback mode, something special needs to done to ensure the last write to media is not a scribble. A small UPS can make writeback mode actually reliable, provided the system is smart enough to take the drives out of writeback mode when the line power is off. We've had mount -o barrier=1 for ext3 for a while now, it makes writeback caching safe. XFS has this on by default, as does reiserfs. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: lockdep warning with LTP dio test (v2.6.24-rc6-125-g5356f66)
On Mon, 14 Jan 2008 18:06:09 +0100 Jan Kara [EMAIL PROTECTED] wrote: On Wed 02-01-08 12:42:19, Zach Brown wrote: Erez Zadok wrote: Setting: ltp-full-20071031, dio01 test on ext3 with Linus's latest tree. Kernel w/ SMP, preemption, and lockdep configured. This is a real lock ordering problem. Thanks for reporting it. The updating of atime inside sys_mmap() orders the mmap_sem in the vfs outside of the journal handle in ext3's inode dirtying: [ lock inversion traces ] Two fixes come to mind: 1) use something like Peter's -mmap_prepare() to update atime before acquiring the mmap_sem. ( http://lkml.org/lkml/2007/11/11/97 ). I don't know if this would leave more paths which do a journal_start() while holding the mmap_sem. 2) rework ext3's dio to only hold the jbd handle in ext3_get_block(). Chris has a patch for this kicking around somewhere but I'm told it has problems exposing old blocks in ordered data mode. Does anyone have preferences? I could go either way. I certainly don't like the idea of journal handles being held across the entirety of fs/direct-io.c. It's yet another case of O_DIRECT differing wildly from the buffered path :(. I've looked more into it and I think that 2) is the only way to go since transaction start ranks below page lock (standard buffered write path) and page lock ranks below mmap_sem. So we have at least one more dependency mmap_sem must go before transaction start... Just to clarify a little bit: If ext3's DIO code only touches transactions in get_block, then it can violate data=ordered rules. Basically the transaction that allocates the blocks might commit before the DIO code gets around to writing them. A crash in the wrong place will expose stale data on disk. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] fast file mapping for loop
On Fri, 11 Jan 2008 10:01:18 +1100 Neil Brown [EMAIL PROTECTED] wrote: On Thursday January 10, [EMAIL PROTECTED] wrote: On Thu, Jan 10 2008, Chris Mason wrote: On Thu, 10 Jan 2008 09:31:31 +0100 Jens Axboe [EMAIL PROTECTED] wrote: On Wed, Jan 09 2008, Alasdair G Kergon wrote: Here's the latest version of dm-loop, for comparison. To try it out, ln -s dmsetup dmlosetup and supply similar basic parameters to losetup. (using dmsetup version 1.02.11 or higher) Why oh why does dm always insist to reinvent everything? That's bad enough in itself, but on top of that most of the extra stuff ends up being essentially unmaintained. I don't quite get how the dm version is reinventing things. They use Things like raid, now file mapping functionality. I'm sure there are more examples, it's how dm was always developed probably originating back to when they developed mostly out of tree. And I think it's a bad idea, we don't want duplicate functionality. If something is wrong with loop, fix it, don't write dm-loop. I'm with Jens here. We currently have two interfaces that interesting block devices can be written for: 'dm' and 'block'. We really should aim to have just one. I would call it 'block' and move anything really useful from dm into block. As far as I can tell, the important things that 'dm' has that 'block' doesn't have are: - a standard ioctl interface for assembling and creating interesting devices. For 'block', everybody just rolls there own. e.g. md, loop, and nbd all use totally different approaches for setup and tear down etc. - suspend/reconfigure/resume. This is something that I would really like to see in 'block'. If I had a filesystem mounted on /dev/sda1 and I wanted to make it a raid1, it would be cool if I could suspend /dev/sda1 build a raid1 from sda1 and something else plug tha raid1 in as 'sda1'. resume sda1 - Integrated 'linear' mapping. This is the bit of 'dm' that I think of as yucky. If I read the code correctly, every dm device is a linear array of a bunch of targets. Each target can be a stripe-set(raid0) or a multipath or a raid1 or a plain block device or whatever. Having 'linear' at a different level to everything else seems a bit ugly, but it isn't really a big deal. DM is also a framework where you can introduce completely new types of block devices without having to go through the associated pain of finding major numbers. In terms of developing new things with greater flexibility, I think it is easier. I would really like to see every 'dm' target being just a regular 'block' device. Then a 'linear' block device could be used to assemble dm targets into a dm device. Or the targets could be used directly if the 'linear' function wasn't needed. Each target/device could respond to both dm ioctls and 'adhoc' ioctls. That is a bit ugly, but backwards compatibility always is, but it isn't a big cost. I think the way forward here is to put the important suspend/reconfig/resume functionality into the block layer, then work on making code work with multiple ioctl interfaces. I *don't* think the way forward is to duplicate current block devices as dm targets. This is duplication of effort (which I admit isn't always a bad thing) and a maintenance headache (which is). raid in dm aside (that's an entirely different debate ;), loop is a pile of things which dm can nicely layer out into pieces (dm-crypt vs loopback crypt). Also, dm doesn't have to jump through hoops to get a variable number of minors. Yes, the loop side was recently improved for # of minors, and it does have enough in there for userland to do variable number of minors, but this is one specific case where dm is just easier. At any rate, I'm all for ideas that make dm less of the evil stepchild of the block layer ;) I'm not saying everything should be dm, but I did want to point out that dm-loop isn't entirely silly. I have a version of Jens' patch in testing here that makes a new API with the FS for mapping extents and hope to post it later today. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] fast file mapping for loop
On Thu, 10 Jan 2008 09:31:31 +0100 Jens Axboe [EMAIL PROTECTED] wrote: On Wed, Jan 09 2008, Alasdair G Kergon wrote: Here's the latest version of dm-loop, for comparison. To try it out, ln -s dmsetup dmlosetup and supply similar basic parameters to losetup. (using dmsetup version 1.02.11 or higher) Why oh why does dm always insist to reinvent everything? That's bad enough in itself, but on top of that most of the extra stuff ends up being essentially unmaintained. I don't quite get how the dm version is reinventing things. They use the dmsetup command that they use for everything else and provide a small and fairly clean module for bio specific loop instead of piling it onto loop.c Their code doesn't have the fancy hole handling that yours does, but neither did yours 4 days ago ;) If we instead improve loop, everyone wins. Sorry to sound a bit harsh, but sometimes it doesn't hurt to think a bit outside your own sandbox. It is a natural fit in either place, as both loop and dm have a good infrastructure for it. I'm not picky about where it ends up, but dm wouldn't be a bad place. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] fast file mapping for loop
On Thu, 10 Jan 2008 08:54:59 + Christoph Hellwig [EMAIL PROTECTED] wrote: On Thu, Jan 10, 2008 at 09:44:57AM +0100, Jens Axboe wrote: IMHO this shouldn't be done in the loop driver anyway. Filesystems have their own effricient extent lookup trees (well, at least xfs and btrfs do), and we should leverage that instead of reinventing it. Completely agree, it's just needed right now for this solution since all we have is a crappy bmap() interface to get at those mappings. So let's fix the interface instead of piling crap ontop of it. As I said I think Peter has something to start with so let's beat on it until we have something suitable. If we aren't done by end of Feb I'm happy to host a hackfest to get it sorted around the fs/storage summit.. Ok, I've been meaning to break my extent_map code up, and this is a very good reason. I'll work up a sample today based on Jens' code. The basic goals: * Loop (swap) calls into the FS for each mapping. Any caching happens on the FS side. * The FS returns an extent, filling any holes Swap would need to use an extra call early on for preallocation. Step two is having a call back into the FS allow the FS to delay the bios until commit completion so that COW and delalloc blocks can be fully on disk when the bios are reported as done. Jens, can you add some way to queue the bio completions up? -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] fast file mapping for loop
On Thu, 10 Jan 2008 14:03:24 +0100 Jens Axboe [EMAIL PROTECTED] wrote: On Thu, Jan 10 2008, Chris Mason wrote: On Thu, 10 Jan 2008 08:54:59 + Christoph Hellwig [EMAIL PROTECTED] wrote: On Thu, Jan 10, 2008 at 09:44:57AM +0100, Jens Axboe wrote: IMHO this shouldn't be done in the loop driver anyway. Filesystems have their own effricient extent lookup trees (well, at least xfs and btrfs do), and we should leverage that instead of reinventing it. Completely agree, it's just needed right now for this solution since all we have is a crappy bmap() interface to get at those mappings. So let's fix the interface instead of piling crap ontop of it. As I said I think Peter has something to start with so let's beat on it until we have something suitable. If we aren't done by end of Feb I'm happy to host a hackfest to get it sorted around the fs/storage summit.. Ok, I've been meaning to break my extent_map code up, and this is a very good reason. I'll work up a sample today based on Jens' code. Great! Grin, we'll see how the sample looks. The basic goals: * Loop (swap) calls into the FS for each mapping. Any caching happens on the FS side. * The FS returns an extent, filling any holes We don't want to fill holes for a read, but I guess that's a given? Right. Swap would need to use an extra call early on for preallocation. Step two is having a call back into the FS allow the FS to delay the bios until commit completion so that COW and delalloc blocks can be fully on disk when the bios are reported as done. Jens, can you add some way to queue the bio completions up? Sure, a function to save a completed bio and a function to execute completions on those already stored? Sounds right, I'm mostly looking for a way to aggregate a few writes to make the commits a little larger. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] fast file mapping for loop
On Wed, 9 Jan 2008 10:43:21 +0100 Jens Axboe [EMAIL PROTECTED] wrote: On Wed, Jan 09 2008, Christoph Hellwig wrote: On Wed, Jan 09, 2008 at 09:52:32AM +0100, Jens Axboe wrote: - The file block mappings must not change while loop is using the file. This means that we have to ensure exclusive access to the file and this is the bit that is currently missing in the implementation. It would be nice if we could just do this via open(), ideas welcome... And the way this is done is simply broken. It means you have to get rid of things like delayed or unwritten hands beforehand, it'll be a complete pain for COW or non-block backed filesystems. COW is not that hard to handle, you just need to be notified of moving blocks. If you view the patch as just a tighter integration between loop and fs, I don't think it's necessarily that broken. Filling holes (delayed allocation) and COW are definitely a problem. But at least for the loop use case, most non-cow filesystems will want to preallocate the space for loop file and be done with it. Sparse loop definitely has uses, but generally those users are willing to pay a little performance. Jens' patch falls back to buffered writes for the hole case and pretends cow doesn't exist. It's a good starting point that I hope to extend with something like the extent_map apis. I did consider these cases, and it can be done with the existing approach. The right way to do this is to allow direct I/O from kernel sources where the filesystem is in-charge of submitting the actual I/O after the pages are handed to it. I think Peter Zijlstra has been looking into something like that for swap over nfs. That does sound like a nice approach, but a lot more work. It'll behave differently too, the advantage of what I proposed is that it behaves like a real device. The problem with O_DIRECT (or even O_SYNC) loop is that every write into loop becomes synchronous, and it really changes the performance of things like filemap_fdatawrite. If we just hand ownership of the file over to loop entirely and prevent other openers (perhaps even forcing backups through the loop device), we get fewer corner cases and much better performance. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[ANNOUNCE] Btrfs v0.9
Hello everyone, I've just tagged and released Btrfs v0.9. Special thanks to Yan Zheng and Josef Bacik for their work. This release includes a number of disk format changes from v0.8 and also a small change from recent btrfs-unstable HG trees. So, if you have existing Btrfs filesystems, you will need to backup, reformat and restore to try out v0.9. You can find download links and other details here: http://oss.oracle.com/projects/btrfs/ Since v0.8: * Support for btree blocks larger than the page size. mkfs.btrfs defaults to 8k blocks, but -l and -n can be used to set the block size for leaves and nodes. Powers of 2 are required, example: mkfs.btrfs -l 32768 -n 32768 /dev/ * Support for inline (packed into the btree) file data larger than the page size. Any file smaller than a btree block will probably be backed into the btree. * Xattr support (no ACLs yet) from Josef Bacik. This works for generic user xattrs and was tested with beagle among other things. * Stripe size parameter to mkfs.btrfs (-s size_in_bytes). Extents will be aligned to the stripe size for performance. * Many performance and stability fixes, especially on 32 bit x86 machines. Unfixed: ENOSPC handling. Things are much more predicable now, and Btrfs will work up until the disk is very close to full. Concurrency: Everything is still protected by a single mutex, which is held during IO. Multi-threaded benchmarks will not perform well. Database performance: Still very slow in database workloads. You can get an idea of where Btrfs is headed from the TODO list: http://oss.oracle.com/projects/btrfs/dist/documentation/todo.html -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Reminder: Last day for submissions to the Storage and Filesystem Workshop.
Hello everyone, The deadline for position statements to the Linux Storage and Filesystem Workshop is here. Submitting a position statement is an easy way for you to tell the organizers that you would like to attend, and which topics you are most interesting in. You can find all the details about the workshop here: http://www.usenix.org/events/lsf08/ The Linux Storage and Filesystem Workshop is a small, tightly focused, by-invitation workshop. It is intended to bring together developers and researchers interested in implementing improvements in the Linux filesystem and storage subsystems that can find their way into the mainline kernel and into Linux distributions in the 1–2 year timeframe. The workshop will be two days and will be separated into storage and filesystem tracks, with some combined plenary sessions. The workshop will be held Feb 25 and 26 in San Jose. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Reminder: Linux Storage and Filesystem Workshop
Hello everyone, The deadline for position statements to the Linux Storage and Filesystem Workshop is quickly approaching. The position statements are an easy way for you to tell the organizers that you would like to attend, and which topics you are most interesting in. You can find all the details about the workshop here: http://www.usenix.org/events/lsf08/ The Linux Storage and Filesystem Workshop is a small, tightly focused, by-invitation workshop. It is intended to bring together developers and researchers interested in implementing improvements in the Linux filesystem and storage subsystems that can find their way into the mainline kernel and into Linux distributions in the 1–2 year timeframe. The workshop will be two days and will be separated into storage and filesystem tracks, with some combined plenary sessions. The workshop will be held Feb 25 and 26 in San Jose. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migratepage failures on reiserfs
On Mon, 5 Nov 2007 10:23:35 + [EMAIL PROTECTED] (Mel Gorman) wrote: On (01/11/07 10:10), Badari Pulavarty didst pronounce: Hmpf, my first reply had a paragraph about the block device inode pages, I noticed the phrase file data pages and deleted it ;) But, for the metadata buffers there's not much we can do. They are included in a bunch of different lists and the patch would be non-trivial. Unfortunately, these buffer pages are spread all around making those sections of memory non-removable. Of course, one can use ZONE_MOVABLE to make sure to guarantee the remove. But I am hoping we could easily group all these allocations and minimize spreading them around. Mel ? The grow_dev_page() pages should be reclaimable even though migration is not supported for those pages? They were marked movable as it was useful for lumpy reclaim taking back pages for hugepage allocations and the like. Would it make sense for memory unremove to attempt migration first and reclaim second? In this case, reiserfs has the page pinned while it is doing journal magic. Not sure if ext3 has the same issues. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
2008 Linux Storage and Filesystem Workshop
Hello everyone, The position statement submission system for the 2008 storage and filesystem workshop is now online. This is how you let us know you're interested in attending and what topics are most important for discussion. For all the details, please see: http://www.usenix.org/events/lsf08/ -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migratepage failures on reiserfs
On Thu, 01 Nov 2007 08:38:57 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: On Wed, 2007-10-31 at 13:40 -0400, Chris Mason wrote: On Wed, 31 Oct 2007 08:14:21 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: I tried data=writeback mode and it didn't help :( Ouch, so much for the easy way out. unable to release the page 262070 bh c000211b9408 flags 110029 count 1 private 0 unable to release the page 262098 bh c00020ec9198 flags 110029 count 1 private 0 memory offlining 3f000 to 4 failed The only other special thing reiserfs does with the page cache is file tails. I don't suppose all of these pages are index zero in files smaller than 4k? Ah !! I am so blind :( I have been suspecting reiserfs all along, since its executing fallback_migrate_page(). Actually, these buffer heads are backing blockdev. I guess these are metadata buffers :( I am not sure we can do much with these.. Hmpf, my first reply had a paragraph about the block device inode pages, I noticed the phrase file data pages and deleted it ;) But, for the metadata buffers there's not much we can do. They are included in a bunch of different lists and the patch would be non-trivial. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migratepage failures on reiserfs
On Wed, 31 Oct 2007 08:14:21 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: I tried data=writeback mode and it didn't help :( Ouch, so much for the easy way out. unable to release the page 262070 bh c000211b9408 flags 110029 count 1 private 0 unable to release the page 262098 bh c00020ec9198 flags 110029 count 1 private 0 memory offlining 3f000 to 4 failed The only other special thing reiserfs does with the page cache is file tails. I don't suppose all of these pages are index zero in files smaller than 4k? -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migratepage failures on reiserfs
On Tue, 30 Oct 2007 10:27:04 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: Hi, While testing hotplug memory remove, I ran into this issue. Given a range of pages hotplug memory remove tries to migrate those pages. migrate_pages() keeps failing to migrate pages containing pagecache pages for reiserfs files. I noticed that reiserfs doesn't have -migratepage() ops. So, fallback_migrate_page() code tries to do try_to_release_page(). try_to_release_page() fails to drop_buffers() since b_count == 1. Here is what my debug shows: migrate pages failed pfn 258111/flags 3f801 bh cb53f6e0 flags 110029 count 1 Any one know why the b_count == 1 and not getting dropped to zero ? If these are file data pages, the count is probably elevated as part of the data=ordered tracking. You can verify this via b_private, or just mount data=writeback to double check. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: migratepage failures on reiserfs
On Tue, 30 Oct 2007 13:54:05 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: On Tue, 2007-10-30 at 13:54 -0400, Chris Mason wrote: On Tue, 30 Oct 2007 10:27:04 -0800 Badari Pulavarty [EMAIL PROTECTED] wrote: Hi, While testing hotplug memory remove, I ran into this issue. Given a range of pages hotplug memory remove tries to migrate those pages. migrate_pages() keeps failing to migrate pages containing pagecache pages for reiserfs files. I noticed that reiserfs doesn't have -migratepage() ops. So, fallback_migrate_page() code tries to do try_to_release_page(). try_to_release_page() fails to drop_buffers() since b_count == 1. Here is what my debug shows: migrate pages failed pfn 258111/flags 3f801 bh cb53f6e0 flags 110029 count 1 Any one know why the b_count == 1 and not getting dropped to zero ? If these are file data pages, the count is probably elevated as part of the data=ordered tracking. You can verify this via b_private, or just mount data=writeback to double check. Chris, That was my first assumption. But after looking at reiserfs_releasepage (), realized that it would do reiserfs_free_jh() and clears the b_private. I couldn't easily find out who has the ref. against this bh. bh cbdaaf00 flags 110029 count 1 private 0 If I'm reading this correctly the buffer is BH_Lock | BH_Req, perhaps it is currently under IO? The page isn't locked, but data=ordered does IO directly on the buffer heads, without taking the page lock. The easy way to narrow our search is to try without data=ordered, it is certainly complicating things. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 4/6][RFC] Attempt to plug race with truncate
On Fri, 26 Oct 2007 16:37:36 -0700 Mike Waychison [EMAIL PROTECTED] wrote: Attempt to deal with races with truncate paths. I'm not really sure on the locking here, but these seem to be taken by the truncate path. BKL is left as some filesystem may(?) still require it. Signed-off-by: Mike Waychison [EMAIL PROTECTED] fs/ioctl.c |8 1 file changed, 8 insertions(+) Index: linux-2.6.23/fs/ioctl.c === --- linux-2.6.23.orig/fs/ioctl.c 2007-10-26 15:27:29.0 -0700 +++ linux-2.6.23/fs/ioctl.c 2007-10-26 16:16:28.0 -0700 @@ -43,13 +43,21 @@ static long do_ioctl(struct file *filp, static int do_fibmap(struct address_space *mapping, sector_t block, sector_t *phys_block) { + struct inode *inode = mapping-host; + if (!capable(CAP_SYS_RAWIO)) return -EPERM; if (!mapping-a_ops-bmap) return -EINVAL; lock_kernel(); + /* Avoid races with truncate */ + mutex_lock(inode-i_mutex); + /* FIXME: Do we really need i_alloc_sem? */ + down_read(inode-i_alloc_sem); i_alloc_sem will avoid races with filesystems filling holes inside writepage (where i_mutex isn't held). I'd expect everyone to currently give some consistent result (either the old value or the new but not garbage), but I wouldn't expect taking the semaphore to hurt anything. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
On Sat, 27 Oct 2007 18:57:06 +0100 Anton Altaparmakov [EMAIL PROTECTED] wrote: Hi, -bmap is ugly and horrible! If you have to do this at the very least please cause -bmap64 to be able to return error values in case the file system failed to get the information or indeed such information does not exist as is the case for compressed and encrypted files for example and also for small files that are inside the on-disk inode (NTFS resident files and reiserfs packed tails are examples of this). And another of my pet peeves with -bmap is that it uses 0 to mean sparse which causes a conflict on NTFS at least as block zero is part of the $Boot system file so it is a real, valid block... NTFS uses -1 to denote sparse blocks internally. Reiserfs and Btrfs also use 0 to mean packed. It would be nice if there was a way to indicate your-data-is-here-but-isn't-alone. But that's more of a feature for the FIEMAP stuff. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
On Mon, 29 Oct 2007 12:18:22 -0700 Mike Waychison [EMAIL PROTECTED] wrote: Zach Brown wrote: And another of my pet peeves with -bmap is that it uses 0 to mean sparse which causes a conflict on NTFS at least as block zero is part of the $Boot system file so it is a real, valid block... NTFS uses -1 to denote sparse blocks internally. Reiserfs and Btrfs also use 0 to mean packed. It would be nice if there was a way to indicate your-data-is-here-but-isn't-alone. But that's more of a feature for the FIEMAP stuff. And maybe we can step back and see what the callers of FIBMAP are doing with the results they're getting. One use is to discover the order in which to read file data that will result in efficient IO. If we had an interface specifically for this use case then perhaps a sparse block would be better reported as the position of the inode relative to other data blocks. Maybe the inode block number in ext* land. Can you clarify what you mean above with an example? I don't really follow. This is a larger topic of helping userland optimize access to groups of files. For example, during a readdir if we knew the next step was to delete all the files found, we could do one top of readahead (or even ordering the returned values). If we knew the next step would be to read all the files found, a different type of readahead would be useful. But, we shouldn't inflict all of this on fibmap/fiemapwe'll get lost trying to make the one true interface for all operations. For grouping operations on files, I think a read_tree syscall with hints for what userland will do (read, stat, delete, list filenames), and a better cookie than readdir should do it. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[CFP] 2008 Linux Storage and Filesystem Workshop
Hello everyone, We are organizing another filesystem and storage workshop in San Jose next Feb 25 and 26. You can find some great writeups of last year's conference on LWN: http://lwn.net/Articles/226351/ This year we're trying to concentrate on more problem solving sessions, short term projects and joint sessions. You can find all the details on the conference webpages: http://www.usenix.org/events/lsf08/ Soon there will be a link for submitting your position statement, which is basically a note to the organizers that you are interested in attending and which topics you think should be covered. We're also looking for people to lead the discussion around the major topics, so please let us know if you're interested in that. The discussion leaders will have input into the people that get invited and the format of the discussion. Please let me know if there are any questions about the workshop. Thanks, Chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] reiserfs: don't drop PG_dirty when releasing sub-page-sized dirty file
On Tue, 23 Oct 2007 19:56:20 +0800 Fengguang Wu [EMAIL PROTECTED] wrote: On Tue, Oct 23, 2007 at 12:07:07PM +0200, Peter Zijlstra wrote: [ adding reiserfs devs to the CC ] Thank you. This fix is kind of crude - even when it fixed Maxim's problem, and survived my stress testing of a lot of patching and kernel compiling. I'd be glad to see better solutions. This should be safe, reiserfs has the buffer heads themselves clean and the page should get cleaned eventually. The cancel_dirty_page call was just an optimization to be VM friendly. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: More Large blocksize benchmarks
On Tue, 2007-10-16 at 12:36 +1000, David Chinner wrote: On Mon, Oct 15, 2007 at 08:22:31PM -0400, Chris Mason wrote: Hello everyone, I'm stealing the cc list and reviving and old thread because I've finally got some numbers to go along with the Btrfs variable blocksize feature. The basic idea is to create a read/write interface to map a range of bytes on the address space, and use it in Btrfs for all metadata operations (file operations have always been extent based). So, instead of casting buffer_head-b_data to some structure, I read and write at offsets in a struct extent_buffer. The extent buffer is very small and backed by an address space, and I get large block sizes the same way file_write gets to write to 16k at a time, by finding the appropriate page in the addess space. This is an over simplification since I try to cache these mapping decisions to avoid using too much CPU, but hopefully you get the idea. The advantage to this approach is the changes are all inside Btrfs. No extra kernel patches were required. Dave reported that XFS saw much higher write throughput with large blocksizes, but so far I'm seeing the most benefits during reads. Apples to oranges, Chris ;) Grin, if the two were the same, there'd be no reason to write a new one. I didn't expect faster writes on btrfs, at least not for workloads that did not require reads. The basic idea is to show there are a variety of ways the larger blocks can improve (and hurt) performance. Also, vmap isn't the only implementation path. Its true the Btrfs changes for this were huge, but a big chunk of the changes were for different leaf/node blocksizes, something that may never get used in practice. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
More Large blocksize benchmarks
Hello everyone, I'm stealing the cc list and reviving and old thread because I've finally got some numbers to go along with the Btrfs variable blocksize feature. The basic idea is to create a read/write interface to map a range of bytes on the address space, and use it in Btrfs for all metadata operations (file operations have always been extent based). So, instead of casting buffer_head-b_data to some structure, I read and write at offsets in a struct extent_buffer. The extent buffer is very small and backed by an address space, and I get large block sizes the same way file_write gets to write to 16k at a time, by finding the appropriate page in the addess space. This is an over simplification since I try to cache these mapping decisions to avoid using too much CPU, but hopefully you get the idea. The advantage to this approach is the changes are all inside Btrfs. No extra kernel patches were required. Dave reported that XFS saw much higher write throughput with large blocksizes, but so far I'm seeing the most benefits during reads. The next step is a bunch more benchmarks. I've done the first round and posted it here: http://oss.oracle.com/~mason/blocksizes/ The Btrfs code makes it relatively easy to experiment, and so this may be a good step toward figuring out if some automagic solution is worth it in general. I can even use different sizes for nodes and leaves, although I haven't done much testing at all there yet. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Correct behavior on O_DIRECT sparse file writes
Hello everyone, The test below creates a sparse file and then fills a hole with O_DIRECT. As far as I can tell from reading generic_osync_inode, the filesystem metadata is only forced to disk if i_size changes during the file write. I've tested ext3, xfs and reiserfs and they all skip the commit when filling holes. I would argue that filling holes via O_DIRECT is supposed to commit the metadata required to find those file blocks later. At least on ext3, O_SYNC does force a commit on fill holes (haven't tested others). So, is the current behavior a bug or a feature? dd if=/dev/zero of=foo bs=1M seek=1 count=1 oflag=direct hexdump foo | head -n 2 000 62b1 ea2d 73e8 c64f f5ef 1af5 dd09 8ccd 010 75ec 9581 e0ea ae9b e28f b76d a700 4d5b dd if=/dev/urandom of=foo bs=4k count=1 conv=notrunc oflag=direct reboot -nf (after reboot) hexdump foo 000 * 020 -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] writeback time order/delay fixes take 3
On Wed, 29 Aug 2007 00:55:30 +1000 David Chinner [EMAIL PROTECTED] wrote: On Fri, Aug 24, 2007 at 09:55:04PM +0800, Fengguang Wu wrote: On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote: On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote: On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote: Notes: (1) I'm not sure inode number is correlated to disk location in filesystems other than ext2/3/4. Or parent dir? The correspond to the exact location on disk on XFS. But, XFS has it's own inode clustering (see xfs_iflush) and it can't be moved up into the generic layers because of locking and integration into the transaction subsystem. (2) It duplicates some function of elevators. Why is it necessary? The elevators have no clue as to how the filesystem might treat adjacent inodes. In XFS, inode clustering is a fundamental feature of the inode reading and writing and that is something no elevator can hope to acheive Thank you. That explains the linear write curve(perfect!) in Chris' graph. I wonder if XFS can benefit any more from the general writeback clustering. How large would be a typical XFS cluster? Depends on inode size. typically they are 8k in size, so anything from 4-32 inodes. The inode writeback clustering is pretty tightly integrated into the transaction subsystem and has some intricate locking, so it's not likely to be easy (or perhaps even possible) to make it more generic. When I talked to hch about this, he said the order file data pages got written in XFS was still dictated by the order the higher layers sent things down. Shouldn't the clustering still help to have delalloc done in inode order instead of in whatever random order pdflush sends things down now? -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] writeback time order/delay fixes take 3
On Wed, 29 Aug 2007 02:33:08 +1000 David Chinner [EMAIL PROTECTED] wrote: On Tue, Aug 28, 2007 at 11:08:20AM -0400, Chris Mason wrote: I wonder if XFS can benefit any more from the general writeback clustering. How large would be a typical XFS cluster? Depends on inode size. typically they are 8k in size, so anything from 4-32 inodes. The inode writeback clustering is pretty tightly integrated into the transaction subsystem and has some intricate locking, so it's not likely to be easy (or perhaps even possible) to make it more generic. When I talked to hch about this, he said the order file data pages got written in XFS was still dictated by the order the higher layers sent things down. Sure, that's file data. I was talking about the inode writeback, not the data writeback. I think we're trying to gain different things from inode based clustering...I'm not worried that the inode be next to the data. I'm going under the assumption that most of the time, the FS will try to allocate inodes in groups in a directory, and so most of the time the data blocks for inode N will be close to inode N+1. So what I'm really trying for here is data block clustering when writing multiple inodes at once. This matters most when files are relatively small and written in groups, which is a common workload. It may make the most sense to change the patch to supply some key for the data block clustering instead of the inode number, but its an easy first pass. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] writeback time order/delay fixes take 3
On Thu, 23 Aug 2007 12:47:23 +1000 David Chinner [EMAIL PROTECTED] wrote: On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote: I think we should assume a full scan of s_dirty is impossible in the presence of concurrent writers. We want to be able to pick a start time (right now) and find all the inodes older than that start time. New things will come in while we're scanning. But perhaps that's what you're saying... At any rate, we've got two types of lists now. One keeps track of age and the other two keep track of what is currently being written. I would try two things: 1) s_dirty stays a list for FIFO. s_io becomes a radix tree that indexes by inode number (or some arbitrary field the FS can set in the inode). Radix tree tags are used to indicate which things in s_io are already in progress or are pending (hand waving because I'm not sure exactly). inodes are pulled off s_dirty and the corresponding slot in s_io is tagged to indicate IO has started. Any nearby inodes in s_io are also sent down. the problem with this approach is that it only looks at inode locality. Data locality is ignored completely here and the data for all the inodes that are close together could be splattered all over the drive. In that case, clustering by inode location is exactly the wrong thing to do. Usually it won't be less wrong than clustering by time. For example, XFs changes allocation strategy at 1TB for 32bit inode filesystems which makes the data get placed way away from the inodes. i.e. inodes in AGs below 1TB, all data in AGs 1TB. clustering by inode number for data writeback is mostly useless in the 1TB case. I agree we'll want a way to let the FS provide the clustering key. But for the first cut on the patch, I would suggest keeping it simple. The inode32 for 1Tb and inode64 allocators both try to keep data close to the inode (i.e. in the same AG) so clustering by inode number might work better here. Also, it might be worthwhile allowing the filesystem to supply a hint or mask for closeness for inode clustering. This would help the gernic code only try to cluster inode writes to inodes that fall into the same cluster as the first inode Yes, also a good idea after things are working. Notes: (1) I'm not sure inode number is correlated to disk location in filesystems other than ext2/3/4. Or parent dir? In general, it is a better assumption than sorting by time. It may make sense to one day let the FS provide a clustering hint (corresponding to the first block in the file?), but for starters it makes sense to just go with the inode number. Perhaps multiple hints are needed - one for data locality and one for inode cluster locality. So, my feature creep idea would have been more data clustering. I'm mainly trying to solve this graph: http://oss.oracle.com/~mason/compilebench/makej/compare-create-dirs-0.png Where background writing of the block device inode is making ext3 do seeky writes while directory trees. My simple idea was to kick off a 'I've just written block X' call back to the FS, where it may decide to send down dirty chunks of the block device inode that also happen to be dirty. But, maintaining the kupdate max dirty time and congestion limits in the face of all this clustering gets tricky. So, I wasn't going to suggest it until the basic machinery was working. Fengguang, this isn't a small project ;) But, lots of people will be interested in the results. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[ANNOUNCE] seekwatcher v0.3 IO graphing an animation
Hello everyone, I've tossed out seekwatcher v0.3. The major changes are using rolling averages to smooth out the seek and throughput graphs, and it can generate mpgs of the IO done by a given trace. Here's a sample of the smoother graphs (creating 20 kernel trees): http://oss.oracle.com/~mason/seekwatcher/ext3_vs_btrfs_vs_xfs.png There are details and sample movies of the kernel tree run at: http://oss.oracle.com/~mason/seekwatcher -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] extent mapped page cache
On Thu, 26 Jul 2007 04:36:39 +0200 Nick Piggin [EMAIL PROTECTED] wrote: [ are state trees a good idea? ] One thing it gains us is finding the start of the cluster. Even if called by kswapd, the state tree allows writepage to find the start of the cluster and send down a big bio (provided I implement trylock to avoid various deadlocks). That's very true, we could potentially also do that with the block extent tree that I want to try with fsblock. If fsblock records and extent of 200MB, and writepage is called on a page in the middle of the extent, how do you walk the radix backwards to find the first dirty up to date page in the range? I'm looking at cleaning up some of these aops APIs so hopefully most of the deadlock problems go away. Should be useful to both our efforts. Will post patches hopefully when I get time to finish the draft this weekend. Great O_DIRECT becomes a special case of readpages and writepagesthe memory used for IO just comes from userland instead of the page cache. Could be, although you'll probably also need to teach the mm about the state tree and/or still manipulate the pagecache tree to prevent concurrency? Well, it isn't coded yet, but I should be able to do it from the FS specific ops. Probably, if you invalidate all the pagecache in the range beforehand you should be able to do it (and I guess you want to do the invalidate anyway). Although, below deadlock issues might still bite somehwere... Well, O_DIRECT is french for deadlocks. But I shouldn't have to worry so much about evicting the pages themselves since I can tag the range. But isn't the main aim of O_DIRECT to do as little locking and synchronisation with the pagecache as possible? I thought this is why your race fixing patches got put on the back burner (although they did look fairly nice from a correctness POV). I put the placeholder patches on hold because handling a corner case where userland did O_DIRECT from a mmap'd region of the same file (Linus pointed it out to me). Basically my patches had to work in 64k chunks to avoid a deadlock in get_user_pages. With the state tree, I can allow the page to be faulted in but still properly deal with it. Oh right, I didn't think of that one. Would you still have similar issues with the external state tree? I mean, the filesystem doesn't really know why the fault is taken. O_DIRECT read from a file into mmapped memory of the same block in the file is almost hopeless I think. Racing is fine as long as we don't deadlock or expose garbage from disk. The ability to put in additional tracking info like the process that first dirtied a range is also significant. So, I think it is worth trying. Definitely, and I'm glad you are. You haven't converted me yet, but I look forward to finding the best ideas from our two approaches when the patches are further along (ext2 port of fsblock coming along, so we'll be able to have races soon :P). I'm sure we can find some river in Cambridge, winner gets to throw Axboe in. Very noble of you to donate your colleage to such a worthy cause. Jens is always interested in helping solve such debates. It's a fantastic service he provides to the community. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] extent mapped page cache
On Wed, 25 Jul 2007 04:32:17 +0200 Nick Piggin [EMAIL PROTECTED] wrote: On Tue, Jul 24, 2007 at 07:25:09PM -0400, Chris Mason wrote: On Tue, 24 Jul 2007 23:25:43 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote: The tree is a critical part of the patch, but it is also the easiest to rip out and replace. Basically the code stores a range by inserting an object at an index corresponding to the end of the range. Then it does searches by looking forward from the start of the range. More or less any tree that can search and return the first key = than the requested key will work. So, I'd be happy to rip out the tree and replace with something else. Going completely lockless will be tricky, its something that will deep thought once the rest of the interface is sane. Just having the other tree and managing it is what makes me a little less positive of this approach, especially using it to store pagecache state when we already have the pagecache tree. Having another tree to store block state I think is a good idea as I said in the fsblock thread with Dave, but I haven't clicked as to why it is a big advantage to use it to manage pagecache state. (and I can see some possible disadvantages in locking and tree manipulation overhead). Yes, there are definitely costs with the state tree, it will take some careful benchmarking to convince me it is a feasible solution. But, storing all the state in the pages themselves is impossible unless the block size equals the page size. So, we end up with something like fsblock/buffer heads or the state tree. One advantage to the state tree is that it separates the state from the memory being described, allowing a simple kmap style interface that covers subpages, highmem and superpages. It also more naturally matches the way we want to do IO, making for easy clustering. O_DIRECT becomes a special case of readpages and writepagesthe memory used for IO just comes from userland instead of the page cache. The ability to put in additional tracking info like the process that first dirtied a range is also significant. So, I think it is worth trying. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] extent mapped page cache
On Thu, 26 Jul 2007 03:37:28 +0200 Nick Piggin [EMAIL PROTECTED] wrote: One advantage to the state tree is that it separates the state from the memory being described, allowing a simple kmap style interface that covers subpages, highmem and superpages. I suppose so, although we should have added those interfaces long ago ;) The variants in fsblock are pretty good, and you could always do an arbitrary extent (rather than block) based API using the pagecache tree if it would be helpful. Yes, you could use fsblock for the state bits and make a separate API to map the actual pages. It also more naturally matches the way we want to do IO, making for easy clustering. Well the pagecache tree is used to reasonable effect for that now. OK the code isn't beautiful ;). Granted, this might be an area where the seperate state tree ends up being better. We'll see. One thing it gains us is finding the start of the cluster. Even if called by kswapd, the state tree allows writepage to find the start of the cluster and send down a big bio (provided I implement trylock to avoid various deadlocks). O_DIRECT becomes a special case of readpages and writepagesthe memory used for IO just comes from userland instead of the page cache. Could be, although you'll probably also need to teach the mm about the state tree and/or still manipulate the pagecache tree to prevent concurrency? Well, it isn't coded yet, but I should be able to do it from the FS specific ops. But isn't the main aim of O_DIRECT to do as little locking and synchronisation with the pagecache as possible? I thought this is why your race fixing patches got put on the back burner (although they did look fairly nice from a correctness POV). I put the placeholder patches on hold because handling a corner case where userland did O_DIRECT from a mmap'd region of the same file (Linus pointed it out to me). Basically my patches had to work in 64k chunks to avoid a deadlock in get_user_pages. With the state tree, I can allow the page to be faulted in but still properly deal with it. Well I'm kind of handwaving when it comes to O_DIRECT ;) It does look like this might be another advantage of the state tree (although you aren't allowed to slow down buffered IO to achieve the locking ;)). ;) The O_DIRECT benefit is a fringe thing. I've long wanted to help clean up that code, but the real point of the patch is to make general usage faster and less complex. If I can't get there, the O_DIRECT stuff doesn't matter. The ability to put in additional tracking info like the process that first dirtied a range is also significant. So, I think it is worth trying. Definitely, and I'm glad you are. You haven't converted me yet, but I look forward to finding the best ideas from our two approaches when the patches are further along (ext2 port of fsblock coming along, so we'll be able to have races soon :P). I'm sure we can find some river in Cambridge, winner gets to throw Axboe in. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC] extent mapped page cache
On Tue, 10 Jul 2007 17:03:26 -0400 Chris Mason [EMAIL PROTECTED] wrote: This patch aims to demonstrate one way to replace buffer heads with a few extent trees. Buffer heads provide a few different features: 1) Mapping of logical file offset to blocks on disk 2) Recording state (dirty, locked etc) 3) Providing a mechanism to access sub-page sized blocks. This patch covers #1 and #2, I'll start on #3 a little later next week. Well, almost. I decided to try out an rbtree instead of the radix, which turned out to be much faster. Even though individual operations are slower, the rbtree was able to do many fewer ops to accomplish the same thing, especially for merging extents together. It also uses much less ram. This code still has lots of room for optimization, but it comes in at around 2-5% more cpu time for ext2 streaming reads and writes. I haven't done readpages or writepages yet, so this is more or less a worst case setup. I'm comparing against ext2 with readpages and writepages disabled. The new code has the added benefit of passing fsx-linux, and not triggering MCE's on my poor little test box. The basic idea is to store state in byte ranges in an rbtree, and to mirror that state down into individual pages. This allows us to store arbitrary state outside of the page struct, so we could include the pid of the process that dirtied a page range for cfq purposes. The example readpage and writepage code is probably the easiest way to understand the basic API. A separate rbtree stores a mapping of byte offset in the file to byte offset on disk. This allows the filesystem to fill in mapping information in bulk, and reduces the number of metadata lookups required to do common operations. Because the state and mapping information are separate from the page, pages can come and go and their corresponding metadata can still be cached (the current code drops mappings as the last page corresponding to that mapping disappears). Two patches follow, the core extent_map implementation and a sample user (ext2). This is pretty basic, implementing prepare/commit_write, read/writepage and a few other funcs to exercise the new code. Longer term, it should fit in with Nick's other extent work instead of prepare/commit_write. My patch sets page-private to 1, really for no good reason. It is just a debugging aid I was using to make sure the page took the right path down the line. If this catches on, we might set it to a magic value so you can if (ExtentPage(page)) or just leave it as null. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC] extent mapped page cache main code
Core Extentmap implementation diff -r 126111346f94 -r 53cabea328f7 fs/Makefile --- a/fs/Makefile Mon Jul 09 10:53:57 2007 -0400 +++ b/fs/Makefile Tue Jul 24 15:40:27 2007 -0400 @@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table. attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \ seq_file.o xattr.o libfs.o fs-writeback.o \ pnode.o drop_caches.o splice.o sync.o utimes.o \ - stack.o + stack.o extent_map.o ifeq ($(CONFIG_BLOCK),y) obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o diff -r 126111346f94 -r 53cabea328f7 fs/extent_map.c --- /dev/null Thu Jan 01 00:00:00 1970 + +++ b/fs/extent_map.c Tue Jul 24 15:40:27 2007 -0400 @@ -0,0 +1,1591 @@ +#include linux/bitops.h +#include linux/slab.h +#include linux/bio.h +#include linux/mm.h +#include linux/gfp.h +#include linux/pagemap.h +#include linux/page-flags.h +#include linux/module.h +#include linux/spinlock.h +#include linux/blkdev.h +#include linux/extent_map.h + +static struct kmem_cache *extent_map_cache; +static struct kmem_cache *extent_state_cache; + +struct tree_entry { + u64 start; + u64 end; + int in_tree; + struct rb_node rb_node; +}; + + +/* bits for the extent state */ +#define EXTENT_DIRTY 1 +#define EXTENT_WRITEBACK (1 1) +#define EXTENT_UPTODATE (1 2) +#define EXTENT_LOCKED (1 3) +#define EXTENT_NEW (1 4) + +#define EXTENT_IOBITS (EXTENT_LOCKED | EXTENT_WRITEBACK) + +void __init extent_map_init(void) +{ + extent_map_cache = kmem_cache_create(extent_map, + sizeof(struct extent_map), 0, + SLAB_RECLAIM_ACCOUNT | + SLAB_DESTROY_BY_RCU, + NULL, NULL); + extent_state_cache = kmem_cache_create(extent_state, + sizeof(struct extent_state), 0, + SLAB_RECLAIM_ACCOUNT | + SLAB_DESTROY_BY_RCU, + NULL, NULL); +} + +void extent_map_tree_init(struct extent_map_tree *tree, + struct address_space *mapping, gfp_t mask) +{ + tree-map.rb_node = NULL; + tree-state.rb_node = NULL; + rwlock_init(tree-lock); + tree-mapping = mapping; +} +EXPORT_SYMBOL(extent_map_tree_init); + +struct extent_map *alloc_extent_map(gfp_t mask) +{ + struct extent_map *em; + em = kmem_cache_alloc(extent_map_cache, mask); + if (!em || IS_ERR(em)) + return em; + em-in_tree = 0; + atomic_set(em-refs, 1); + return em; +} +EXPORT_SYMBOL(alloc_extent_map); + +void free_extent_map(struct extent_map *em) +{ + if (atomic_dec_and_test(em-refs)) { + WARN_ON(em-in_tree); + kmem_cache_free(extent_map_cache, em); + } +} +EXPORT_SYMBOL(free_extent_map); + +struct extent_state *alloc_extent_state(gfp_t mask) +{ + struct extent_state *state; + state = kmem_cache_alloc(extent_state_cache, mask); + if (!state || IS_ERR(state)) + return state; + state-state = 0; + state-in_tree = 0; + atomic_set(state-refs, 1); + init_waitqueue_head(state-wq); + return state; +} +EXPORT_SYMBOL(alloc_extent_state); + +void free_extent_state(struct extent_state *state) +{ + if (atomic_dec_and_test(state-refs)) { + WARN_ON(state-in_tree); + kmem_cache_free(extent_state_cache, state); + } +} +EXPORT_SYMBOL(free_extent_state); + +static struct rb_node *tree_insert(struct rb_root *root, u64 offset, + struct rb_node *node) +{ + struct rb_node ** p = root-rb_node; + struct rb_node * parent = NULL; + struct tree_entry *entry; + + while(*p) { + parent = *p; + entry = rb_entry(parent, struct tree_entry, rb_node); + + if (offset entry-end) + p = (*p)-rb_left; + else if (offset entry-end) + p = (*p)-rb_right; + else + return parent; + } + + entry = rb_entry(node, struct tree_entry, rb_node); + entry-in_tree = 1; + rb_link_node(node, parent, p); + rb_insert_color(node, root); + return NULL; +} + +static struct rb_node *__tree_search(struct rb_root *root, u64 offset, + struct rb_node **prev_ret) +{ + struct rb_node * n = root-rb_node; + struct rb_node *prev = NULL; + struct tree_entry *entry; + struct tree_entry *prev_entry = NULL; + + while(n) { + entry = rb_entry(n, struct tree_entry, rb_node); + prev = n; + prev_entry = entry; + + if (offset
[PATCH RFC] ext2 extentmap support
mount -o extentmap to use the new stuff diff -r 126111346f94 -r 53cabea328f7 fs/ext2/ext2.h --- a/fs/ext2/ext2.hMon Jul 09 10:53:57 2007 -0400 +++ b/fs/ext2/ext2.hTue Jul 24 15:40:27 2007 -0400 @@ -1,5 +1,6 @@ #include linux/fs.h #include linux/ext2_fs.h +#include linux/extent_map.h /* * ext2 mount options @@ -65,6 +66,7 @@ struct ext2_inode_info { struct posix_acl*i_default_acl; #endif rwlock_t i_meta_lock; + struct extent_map_tree extent_tree; struct inodevfs_inode; }; @@ -167,6 +169,7 @@ extern const struct address_space_operat extern const struct address_space_operations ext2_aops; extern const struct address_space_operations ext2_aops_xip; extern const struct address_space_operations ext2_nobh_aops; +extern const struct address_space_operations ext2_extent_map_aops; /* namei.c */ extern const struct inode_operations ext2_dir_inode_operations; diff -r 126111346f94 -r 53cabea328f7 fs/ext2/inode.c --- a/fs/ext2/inode.c Mon Jul 09 10:53:57 2007 -0400 +++ b/fs/ext2/inode.c Tue Jul 24 15:40:27 2007 -0400 @@ -625,6 +625,84 @@ changed: goto reread; } +/* + * simple get_extent implementation using get_block. This assumes + * the get_block function can return something larger than a single block, + * but the ext2 implementation doesn't do so. Just change b_size to + * something larger if get_block can return larger extents. + */ +struct extent_map *ext2_get_extent(struct inode *inode, struct page *page, + size_t page_offset, u64 start, u64 end, + int create) +{ + struct buffer_head bh; + sector_t iblock; + struct extent_map *em = NULL; + struct extent_map_tree *extent_tree = EXT2_I(inode)-extent_tree; + int ret = 0; + u64 max_end = (u64)-1; + u64 found_len; + u64 bh_start; + u64 bh_end; + + bh.b_size = inode-i_sb-s_blocksize; + bh.b_state = 0; +again: + em = lookup_extent_mapping(extent_tree, start, end); + if (em) { + return em; + } + + iblock = start inode-i_blkbits; + if (!buffer_mapped(bh)) { + ret = ext2_get_block(inode, iblock, bh, create); + if (ret) + goto out; + } + + found_len = min((u64)(bh.b_size), max_end - start); + if (!em) + em = alloc_extent_map(GFP_NOFS); + + bh_start = start; + bh_end = start + found_len - 1; + em-start = start; + em-end = bh_end; + em-bdev = inode-i_sb-s_bdev; + + if (!buffer_mapped(bh)) { + em-block_start = 0; + em-block_end = 0; + } else { + em-block_start = bh.b_blocknr inode-i_blkbits; + em-block_end = em-block_start + found_len - 1; + } + ret = add_extent_mapping(extent_tree, em); + if (ret == -EEXIST) { + free_extent_map(em); + em = NULL; + max_end = end; + goto again; + } +out: + if (ret) { + if (em) + free_extent_map(em); + return ERR_PTR(ret); + } else if (em buffer_new(bh)) { + set_extent_new(extent_tree, bh_start, bh_end, GFP_NOFS); + } + return em; +} + +static int ext2_extent_map_writepage(struct page *page, +struct writeback_control *wbc) +{ + struct extent_map_tree *tree; + tree = EXT2_I(page-mapping-host)-extent_tree; + return extent_write_full_page(tree, page, ext2_get_extent, wbc); +} + static int ext2_writepage(struct page *page, struct writeback_control *wbc) { return block_write_full_page(page, ext2_get_block, wbc); @@ -633,6 +711,42 @@ static int ext2_readpage(struct file *fi static int ext2_readpage(struct file *file, struct page *page) { return mpage_readpage(page, ext2_get_block); +} + +static int ext2_extent_map_readpage(struct file *file, struct page *page) +{ + struct extent_map_tree *tree; + tree = EXT2_I(page-mapping-host)-extent_tree; + return extent_read_full_page(tree, page, ext2_get_extent); +} + +static int ext2_extent_map_releasepage(struct page *page, + gfp_t unused_gfp_flags) +{ + struct extent_map_tree *tree; + int ret; + + if (page-private != 1) + return try_to_free_buffers(page); + tree = EXT2_I(page-mapping-host)-extent_tree; + ret = try_release_extent_mapping(tree, page); + if (ret == 1) { + ClearPagePrivate(page); + set_page_private(page, 0); + page_cache_release(page); + } + return ret; +} + + +static void ext2_extent_map_invalidatepage(struct page *page, + unsigned long offset) +{ + struct extent_map_tree *tree; + + tree =
Re: [PATCH RFC] extent mapped page cache
On Tue, 24 Jul 2007 23:25:43 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote: On Tue, 2007-07-24 at 16:13 -0400, Trond Myklebust wrote: On Tue, 2007-07-24 at 16:00 -0400, Chris Mason wrote: On Tue, 10 Jul 2007 17:03:26 -0400 Chris Mason [EMAIL PROTECTED] wrote: This patch aims to demonstrate one way to replace buffer heads with a few extent trees. Buffer heads provide a few different features: 1) Mapping of logical file offset to blocks on disk 2) Recording state (dirty, locked etc) 3) Providing a mechanism to access sub-page sized blocks. This patch covers #1 and #2, I'll start on #3 a little later next week. Well, almost. I decided to try out an rbtree instead of the radix, which turned out to be much faster. Even though individual operations are slower, the rbtree was able to do many fewer ops to accomplish the same thing, especially for merging extents together. It also uses much less ram. The problem with an rbtree is that you can't use it together with RCU to do lockless lookups. You can probably modify it to allocate nodes dynamically (like the radix tree does) and thus make it RCU-compatible, but then you risk losing the two main benefits that you list above. The tree is a critical part of the patch, but it is also the easiest to rip out and replace. Basically the code stores a range by inserting an object at an index corresponding to the end of the range. Then it does searches by looking forward from the start of the range. More or less any tree that can search and return the first key = than the requested key will work. So, I'd be happy to rip out the tree and replace with something else. Going completely lockless will be tricky, its something that will deep thought once the rest of the interface is sane. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[ANNOUNCE] seekwatcher IO graphing v0.2
Hello everyone, Since doing the initial Btrfs benchmarks, I've made my blktrace graphing utility a little more generic and tossed it out on oss.oracle.com. This new version can easily graph two different runs, and has a few other tweaks that make the graphs look nicer. Docs, examples and other details are at: http://oss.oracle.com/~mason/seekwatcher -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] extent mapped page cache
On Thu, 12 Jul 2007 00:00:28 -0700 Daniel Phillips [EMAIL PROTECTED] wrote: On Tuesday 10 July 2007 14:03, Chris Mason wrote: This patch aims to demonstrate one way to replace buffer heads with a few extent trees... Hi Chris, Quite terse commentary on algorithms and data structures, but I suppose that is not a problem because Jon has a whole week to reverse engineer it for us. What did you have in mind for subpages? This partially depends on input here. The goal is to have one interface that works for subpages, highmem and superpages, and for the FS maintainers to not care if the mappings come magically from clameter's work or vmap or whatever. Given the whole extent based theme, I plan on something like this: struct extent_ptr { char *ptr; some way to indicate size and type of map struct page pages[]; }; struct extent_ptr *alloc_extent_ptr(struct extent_map_tree *tree, u64 start, u64 end); void free_extent_ptr(struct extent_map_tree *tree, struct extent_ptr *ptr); And then some calls along the lines of kmap/kunmap that gives you a pointer you can use for accessing the ram. read/write calls would also be fine by me, but harder to convert filesystems to use. The struct extent_ptr would increase the ref count on the pages, but the pages would have no back pointers to it. All dirty/locked/writeback state would go in the extent state tree and would not be stored in the struct extent_ptr. The idea is to make a simple mapping entity, and not complicate it by storing FS specific state in there. It could be variably sized to hold an array of pages, and allocated via kmap. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC] extent mapped page cache
This patch aims to demonstrate one way to replace buffer heads with a few extent trees. Buffer heads provide a few different features: 1) Mapping of logical file offset to blocks on disk 2) Recording state (dirty, locked etc) 3) Providing a mechanism to access sub-page sized blocks. This patch covers #1 and #2, I'll start on #3 a little later next week. The file offset to disk block mapping is done in one radix tree, and the state is done in a second radix tree. Extent ranges are stored in the radix trees by inserting into the slot corresponding to the end of the range, and always using gang lookups for searching. The basic implementation mirrors the page and buffer bits already used, but allows state bits to be set on regions smaller or larger than a single page. Eventually I would like to use this mechanism to replace my DIO locking/placeholder patch. Ext2 is changed to use the extent mapping code when mounted with -o extentmap. DIO is not supported and readpages/writepages are not yet implemented, but this should be enough to get the basic idea across. Testing has been very very light, I'm mostly sending this out for comments and to continue the discussion started by Nick's patch set. diff -r 126111346f94 fs/Makefile --- a/fs/Makefile Mon Jul 09 10:53:57 2007 -0400 +++ b/fs/Makefile Tue Jul 10 16:49:26 2007 -0400 @@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table. attr.o bad_inode.o file.o filesystems.o namespace.o aio.o \ seq_file.o xattr.o libfs.o fs-writeback.o \ pnode.o drop_caches.o splice.o sync.o utimes.o \ - stack.o + stack.o extent_map.o ifeq ($(CONFIG_BLOCK),y) obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o diff -r 126111346f94 fs/ext2/ext2.h --- a/fs/ext2/ext2.hMon Jul 09 10:53:57 2007 -0400 +++ b/fs/ext2/ext2.hTue Jul 10 16:49:26 2007 -0400 @@ -1,5 +1,6 @@ #include linux/fs.h #include linux/ext2_fs.h +#include linux/extent_map.h /* * ext2 mount options @@ -65,6 +66,7 @@ struct ext2_inode_info { struct posix_acl*i_default_acl; #endif rwlock_t i_meta_lock; + struct extent_map_tree extent_tree; struct inodevfs_inode; }; @@ -167,6 +169,7 @@ extern const struct address_space_operat extern const struct address_space_operations ext2_aops; extern const struct address_space_operations ext2_aops_xip; extern const struct address_space_operations ext2_nobh_aops; +extern const struct address_space_operations ext2_extent_map_aops; /* namei.c */ extern const struct inode_operations ext2_dir_inode_operations; diff -r 126111346f94 fs/ext2/inode.c --- a/fs/ext2/inode.c Mon Jul 09 10:53:57 2007 -0400 +++ b/fs/ext2/inode.c Tue Jul 10 16:49:26 2007 -0400 @@ -625,6 +625,78 @@ changed: goto reread; } +/* + * simple get_extent implementation using get_block. This assumes + * the get_block function can return something larger than a single block, + * but the ext2 implementation doesn't do so. Just change b_size to + * something larger if get_block can return larger extents. + */ +struct extent_map *ext2_get_extent(struct inode *inode, struct page *page, + size_t page_offset, u64 start, u64 end, + int create) +{ + struct buffer_head bh; + sector_t iblock; + struct extent_map *em = NULL; + struct extent_map_tree *extent_tree = EXT2_I(inode)-extent_tree; + int ret = 0; + u64 max_end = (u64)-1; + u64 found_len; + + bh.b_size = inode-i_sb-s_blocksize; + bh.b_state = 0; +again: + em = lookup_extent_mapping(extent_tree, start, end); + if (em) + return em; + + iblock = start inode-i_blkbits; + if (!buffer_mapped(bh)) { + ret = ext2_get_block(inode, iblock, bh, create); + if (ret) + goto out; + } + + found_len = min((u64)(bh.b_size), max_end - start); + if (!em) + em = alloc_extent_map(GFP_NOFS); + + em-start = start; + em-end = start + found_len - 1; + em-bdev = inode-i_sb-s_bdev; + + if (!buffer_mapped(bh)) { + em-block_start = 0; + em-block_end = 0; + } else { + em-block_start = bh.b_blocknr inode-i_blkbits; + em-block_end = em-block_start + found_len - 1; + } + + ret = add_extent_mapping(extent_tree, em); + if (ret == -EEXIST) { + max_end = end; + goto again; + } +out: + if (ret) { + if (em) + free_extent_map(em); + return ERR_PTR(ret); + } else if (em buffer_new(bh)) { + set_extent_new(extent_tree, start, end, GFP_NOFS); + } + return em; +} + +static int ext2_extent_map_writepage(struct page *page, +struct
Re: vm/fs meetup details
On Fri, 6 Jul 2007 23:42:01 +1000 David Chinner [EMAIL PROTECTED] wrote: On Fri, Jul 06, 2007 at 12:26:23PM +0200, Jörn Engel wrote: On Fri, 6 July 2007 20:01:10 +1000, David Chinner wrote: On Fri, Jul 06, 2007 at 04:26:51AM +0200, Nick Piggin wrote: But, surprisingly enough, the above work is relevent to this forum because of two things: - we've had to move to direct I/O and user space caching to work around deficiencies in kernel block device caching under memory pressure - we've exploited techniques that XFS supports but the VM does not. i.e. priority tagging of cached metadata so that less important metadata is tossed first (e.g. toss tree leaves before nodes and nodes before roots) when under memory pressure. And the latter is exactly what logfs needs as well. You certainly have me interested. I believe it applies to btrfs and any other cow-fs as well. The point is that higher levels get dirtied by writing lower layers. So perfect behaviour for sync is to write leaves first, then nodes, then the root. Any other order will either cause sync not to sync or cause unnecessary writes and cost performance. Hmmm - I guess you could use it for writeback ordering. I hadn't really thought about that. Doesn't seem a particularly efficient way of doing it, though. Why not just use multiple address spaces for this? i.e. one per level and flush in ascending order. At least in the case of btrfs, the perfect order for sync is disk order ;) COW happens when blocks are changed for the first time in a transaction, not when they are written out to disk. If logfs is writing things out some form of tree order, you're going to have to group disk allocations such that tree order reflects disk order somehow. But, the part where we toss leaves first is definitely useful. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Versioning file system
On Thu, 5 Jul 2007 09:57:40 -0400 John Stoffel [EMAIL PROTECTED] wrote: Erik == Erik Mouw [EMAIL PROTECTED] writes: Erik (sorry for the late reply, just got back from holiday) Erik On Mon, Jun 18, 2007 at 01:29:56PM -0400, Theodore Tso wrote: As I mentioned in my Linux.conf.au presentation a year and a half ago, the main use of Streams in Windows to date has been for system crackers to hide trojan horse code and rootkits so that system administrators couldn't find them. :-) Erik The only valid use of Streams in Windows I've seen was a virus Erik checker that stored a hash of the file in a separate Erik stream. Checking a file was a matter of rehashing it and Erik comparing against the hash stored in the special hash data Erik stream for that particular file. So what was stopping a virus from infecting a file, re-computing the hash and pushing the new hash into the stream? You need to keep the computed hashes on Read-Only media for true security, once you let the system change them, then you're toast I'm not a huge fan of streams, but I'm pretty sure there are various encryption tools that let us verify and validate the source of data. It's entirely possible the virus checker wasn't doing it right, but storing verification info in an EA or stream isn't entirely invalid. You still need an external key that you do trust of course. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how do versioning filesystems take snapshot of opened files?
On Tue, 3 Jul 2007 01:28:57 -0400 Xin Zhao [EMAIL PROTECTED] wrote: Hi, If a file is already opened when snapshot command is issued, the file itself could be in an inconsistent state already. Before the file is closed, maybe part of the file contains old data, the rest contains new data. How does a versioning filesystem guarantee that the file snapshot is in a consistent state in this case? I googled it but didn't find any answer. Can someone explain it a little bit? It's the same answer as in most filesystem related questions...it depends ;) Consistent state means many different things. It may mean that the metadata accurately reflects the space on disk allocated to the file and that all data for the file is properly on disk (ie from an fsync). But, even this is less than useful because very few files on the filesystem stand alone. Applications spread their state across a number of files and so consistent means something different to every application. Getting a snapshot that is useful with respect to application data requires help from the application. The app needs to be shutdown or paused prior to the snapshot and then started up again after the snapshot is taken. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how do versioning filesystems take snapshot of opened files?
On Tue, 3 Jul 2007 12:31:49 -0400 Xin Zhao [EMAIL PROTECTED] wrote: That's a good point! But this sounds hopeless to take a real consistent snapshot from app perspective unless you shutdown the computer. Right? Many different applications support some form of pausing in order to facilitate live backups. You just have to keep it all in mind when designing the total backup solution. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how do versioning filesystems take snapshot of opened files?
On Tue, 3 Jul 2007 13:15:06 -0400 Xin Zhao [EMAIL PROTECTED] wrote: OK. From discussion above, can we reach a conclusion: from the application perspective, it is very hard, if not impossible, to take a transactional consistent snapshot without the help from applications? You definitely need help from the applications. They define what a transaction is. Chris, you mentioned that Many different applications support some form of pausing in order to facilitate live backups. Can you provide some examples? I mean popular apps. Oracle, db2, mysql, ldap, postgres, sleepycat databases...just search for online backup and most programs that involve something transactional have a way to do it. Finally, if we back up a little bit, say, we don't care the transaction level consistency ( a transaction that open/close many times), but we want a open/close consistency in snapshots. That is, a file in a snapshot must be in a single version, but it can be in a middle state of a transaction. Can we do that? Pausing apps itself does not solve this problem, because a file could be already opened and in the middle of write. As I mentioned earlier, some systems can backup old data every time new data is written, but I suspect that this will impact the system performance quite a bit. Any idea about that? This depends on the transaction engine in your filesystem. None of the existing linux filesystems have a way to start a transaction when the file opens and finish it when the file closes, or a way to roll back individual operations that have happened inside a given transaction. It certainly could be done, but it would also introduce a great deal of complexity to the FS. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Thu, Jun 28, 2007 at 04:44:43AM +0200, Nick Piggin wrote: On Thu, Jun 28, 2007 at 08:35:48AM +1000, David Chinner wrote: On Wed, Jun 27, 2007 at 07:50:56AM -0400, Chris Mason wrote: Lets look at a typical example of how IO actually gets done today, starting with sys_write(): sys_write(file, buffer, 1MB) for each page: prepare_write() allocate contiguous chunks of disk attach buffers copy_from_user() commit_write() dirty buffers pdflush: writepages() find pages with contiguous chunks of disk build and submit large bios So, we replace prepare_write and commit_write with an extent based api, but we keep the dirty each buffer part. writepages has to turn that back into extents (bio sized), and the result is completely full of dark dark corner cases. That's true but I don't think an extent data structure means we can become too far divorced from the pagecache or the native block size -- what will end up happening is that often we'll need stuff to map between all those as well, even if it is only at IO-time. I think the fundamental difference is that fsblock still does: mapping_info = page-something, where something is attached on a per page basis. What we really want is mapping_info = lookup_mapping(page), where that function goes and finds something stored on a per extent basis, with extra bits for tracking dirty and locked state. Ideally, in at least some of the cases the dirty and locked state could be at an extent granularity (streaming IO) instead of the block granularity (random IO). In my little brain, even block based filesystems should be able to take advantage of this...but such things are always easier to believe in before the coding starts. But the point is taken, and I do believe that at least for APIs, extent based seems like the best way to go. And that should allow fsblock to be replaced or augmented in future without _too_ much pain. Yup - I've been on the painful end of those dark corner cases several times in the last few months. It's also worth pointing out that mpage_readpages() already works on an extent basis - it overloads bufferheads to provide a map_bh that can point to a range of blocks in the same state. The code then iterates the map_bh range a page at a time building bios (i.e. not even using buffer heads) from that map.. One issue I have with the current nobh and mpage stuff is that it requires multiple calls into get_block (first to prepare write, then to writepage), it doesn't allow filesystems to attach resources required for writeout at prepare_write time, and it doesn't play nicely with buffers in general. (not to mention that nobh error handling is buggy). I haven't done any mpage-like code for fsblocks yet, but I think they wouldn't be too much trouble, and wouldn't have any of the above problems... Could be, but the fundamental issue of sometimes pages have mappings attached and sometimes they don't is still there. The window is smaller, but non-zero. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote: On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote: On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote: On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote: [ ... fsblocks vs extent range mapping ] iomaps can double as range locks simply because iomaps are expressions of ranges within the file. Seeing as you can only access a given range exclusively to modify it, inserting an empty mapping into the tree as a range lock gives an effective method of allowing safe parallel reads, writes and allocation into the file. The fsblocks and the vm page cache interface cannot be used to facilitate this because a radix tree is the wrong type of tree to store this information in. A sparse, range based tree (e.g. btree) is the right way to do this and it matches very well with a range based API. I'm really not against the extent based page cache idea, but I kind of assumed it would be too big a change for this kind of generic setup. At any rate, if we'd like to do it, it may be best to ditch the idea of attach mapping information to a page, and switch to lookup mapping information and range locking for a page. Well the get_block equivalent API is extent based one now, and I'll look at what is required in making map_fsblock a more generic call that could be used for an extent-based scheme. An extent based thing IMO really isn't appropriate as the main generic layer here though. If it is really useful and popular, then it could be turned into generic code and sit along side fsblock or underneath fsblock... Lets look at a typical example of how IO actually gets done today, starting with sys_write(): sys_write(file, buffer, 1MB) for each page: prepare_write() allocate contiguous chunks of disk attach buffers copy_from_user() commit_write() dirty buffers pdflush: writepages() find pages with contiguous chunks of disk build and submit large bios So, we replace prepare_write and commit_write with an extent based api, but we keep the dirty each buffer part. writepages has to turn that back into extents (bio sized), and the result is completely full of dark dark corner cases. I do think fsblocks is a nice cleanup on its own, but Dave has a good point that it makes sense to look for ways generalize things even more. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/3] add the fsblock layer
On Tue, Jun 26, 2007 at 01:07:43PM +1000, Nick Piggin wrote: Neil Brown wrote: On Tuesday June 26, [EMAIL PROTECTED] wrote: Chris Mason wrote: The block device pagecache isn't special, and certainly isn't that much code. I would suggest keeping it buffer head specific and making a second variant that does only fsblocks. This is mostly to keep the semantics of PagePrivate sane, lets not fuzz the line. That would require a new inode and address_space for the fsblock type blockdev pagecache, wouldn't it? I just can't think of a better non-intrusive way of allowing a buffer_head filesystem and an fsblock filesystem to live on the same blkdev together. I don't think they would ever try to. Both filesystems would bd_claim the blkdev, and only one would win. Hmm OK, I might have confused myself thinking about partitions... The issue is more of a filesystem sharing a blockdev with the block-special device (i.e. open(/dev/sda1), read) isn't it? If a filesystem wants to attach information to the blockdev pagecache that is different to what blockdev want to attach, then I think Yes - a new inode and address space is what it needs to create. Then you get into consistency issues between the metadata and direct blockdevice access. Do we care about those? Yeah that issue is definitely a real one. The problem is not just consistency, but how do the block device aops even know that the PG_private page they have has buffer heads or fsblocks, so it is an oopsable condition rather than just a plain consistency issue (consistency is already not guaranteed). Since we're testing new code, I would just leave the blkdev address space alone. If a filesystem wants to use fsblocks, they allocate a new inode during mount, stuff it into their private super block (or in the generic super), and use that for everything. Basically ignoring the block device address space completely. It means there will be some inconsistency between what you get when reading the block device file and the filesystem metadata, but we've got that already (ext2 dir in page cache). -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote: On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote: [ ... fsblocks vs extent range mapping ] iomaps can double as range locks simply because iomaps are expressions of ranges within the file. Seeing as you can only access a given range exclusively to modify it, inserting an empty mapping into the tree as a range lock gives an effective method of allowing safe parallel reads, writes and allocation into the file. The fsblocks and the vm page cache interface cannot be used to facilitate this because a radix tree is the wrong type of tree to store this information in. A sparse, range based tree (e.g. btree) is the right way to do this and it matches very well with a range based API. I'm really not against the extent based page cache idea, but I kind of assumed it would be too big a change for this kind of generic setup. At any rate, if we'd like to do it, it may be best to ditch the idea of attach mapping information to a page, and switch to lookup mapping information and range locking for a page. A btree could be used to hold the range mapping and locking, but it could just as easily be a radix tree where you do a gang lookup for the end of the range (the same way my placeholder patch did). It'll still find intersecting range locks but is much faster for random insertion/deletion than the btrees. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vm/fs meetup in september?
On Tue, Jun 26, 2007 at 12:35:09PM +1000, Nick Piggin wrote: Christoph Hellwig wrote: On Sun, Jun 24, 2007 at 06:23:45AM +0200, Nick Piggin wrote: I'd just like to take the chance also to ask about a VM/FS meetup some time around kernel summit (maybe take a big of time during UKUUG or so). I won't be around until a day or two before KS, so I'd prefer to have it after KS if possible. I'd like to see you there, so I hope we can find a date that most people are happy with. I'll try to start working that out after we have a rough idea of who's interested. I'm game, but won't be staying past the end of KS (I'll arrive Sept 2nd or so though). Given debates so far, it probably makes sense to talk about things at KS too. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Mon, Jun 25, 2007 at 04:58:48PM +1000, Nick Piggin wrote: Using buffer heads instead allows the FS to send file data down inside the transaction code, without taking the page lock. So, locking wrt data=ordered is definitely going to be tricky. The best long term option may be making the locking order transaction - page lock, and change writepage to punt to some other queue when it needs to start a transaction. Yeah, that's what I would like, and I think it would come naturally if we move away from these pass down a single, locked page APIs in the VM, and let the filesystem do the locking and potentially batching of larger ranges. Definitely. write_begin/write_end is a step in that direction (and it helps OCFS and GFS quite a bit). I think there is also not much reason for writepage sites to require the page to lock the page and clear the dirty bit themselves (which has seems ugly to me). If we keep the page mapping information with the page all the time (ie writepage doesn't have to call get_block ever), it may be possible to avoid sending down a locked page. But, I don't know the delayed allocation internals well enough to say for sure if that is true. Either way, writepage is the easiest of the bunch because it can be deferred. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/3] add the fsblock layer
On Mon, Jun 25, 2007 at 05:41:58PM +1000, Nick Piggin wrote: Neil Brown wrote: On Sunday June 24, [EMAIL PROTECTED] wrote: +#define PG_blocks 20 /* Page has block mappings */ + I've only had a very quick look, but this line looks *very* wrong. You should be using PG_private. There should never be any confusion about whether -private has buffers or blocks attached as the only routines that ever look in -private are address_space operations (or should be. I think 'NULL' is sometimes special cased, as in try_to_release_page. It would be good to do some preliminary work and tidy all that up). There is a lot of confusion, actually :) But as you see in the patch, I added a couple more aops APIs, and am working toward decoupling it as much as possible. It's pretty close after the fsblock patch... however: Why do you think you need PG_blocks? Block device pagecache (buffer cache) has to be able to accept attachment of either buffers or blocks for filesystem metadata, and call into either buffer.c or fsblock.c based on that. If the page flag is really important, we can do some awful hack like assuming the first long of the private data is flags, and those flags will tell us whether the structure is a buffer_head or fsblock ;) But for now it is just easier to use a page flag. The block device pagecache isn't special, and certainly isn't that much code. I would suggest keeping it buffer head specific and making a second variant that does only fsblocks. This is mostly to keep the semantics of PagePrivate sane, lets not fuzz the line. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/3] add the fsblock layer
On Sun, Jun 24, 2007 at 03:46:13AM +0200, Nick Piggin wrote: Rewrite the buffer layer. Overall, I like the basic concepts, but it is hard to track the locking rules. Could you please write them up? I like the way you split out the assoc_buffers from the main fsblock code, but the list setup is still something of a wart. It also provides poor ordering of blocks for writeback. I think it makes sense to replace the assoc_buffers list head with a radix tree sorted by block number. mark_buffer_dirty_inode would up the reference count and put it into the radix, the various flushing routines would walk the radix etc. If you wanted to be able to drop the reference count once the block was written you could have a back pointer to the appropriate inode. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation, pathname matching
On Thu, Jun 21, 2007 at 09:06:40PM -0400, James Morris wrote: On Thu, 21 Jun 2007, Chris Mason wrote: The incomplete mediation flows from the design, since the pathname-based mediation doesn't generalize to cover all objects unlike label- or attribute-based mediation. And the use the natural abstraction for each object type approach likewise doesn't yield any general model or anything that you can analyze systematically for data flow. This feels quite a lot like a repeat of the discussion at the kernel summit. There are valid uses for path based security, and if they don't fit your needs, please don't use them. But, path based semantics alone are not a valid reason to shut out AA. The validity or otherwise of pathname access control is not being discussed here. The point is that the pathname model does not generalize, and that AppArmor's inability to provide adequate coverage of the system is a design issue arising from this. I'm sorry, but I don't see where in the paragraphs above you aren't making a general argument against the pathname model. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation, pathname matching
On Fri, Jun 22, 2007 at 10:23:03AM -0400, James Morris wrote: On Fri, 22 Jun 2007, Chris Mason wrote: But, this is a completely different discussion than if AA is solving problems in the wild for its intended audience, or if the code is somehow flawed and breaking other parts of the kernel. Is its intended audience aware of its limitiations? Lars has just acknowledged that it does not implement mandatory access control, for one. Until people understand these issues, they certainly need to be addressed in the context of upstream merge. It is definitely useful to clearly understand the intended AA use cases during the merge. We've been over the AA is different discussion in threads about a billion times, and at the last kernel summit. I don't believe that people at the summit were adequately informed on the issue, and from several accounts I've heard, Stephen Smalley was effectively cut off before he could even get to his second slide. I'm sure people there will have a different versions of events. The one part that was discussed was if pathname based security was useful, and a number of the people in the room (outside of novell) said it was. Now, it could be that nobody wanted to argue anymore, since most opinions had come out on one list or another by then. But as someone who doesn't use either SElinux or AA, I really hope we can get past the part of the debate where: while(1) AA) we think we're making users happy with pathname security SELINUX) pathname security sucks So, yes Greg got it started and Lars is a well known trouble maker, and I completely understand if you want to say no thank you to an selinux based AA ;) The models are different and it shouldn't be a requirement that they try to use the same underlying mechanisms. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 39/45] AppArmor: Profile loading and manipulation, pathname matching
On Thu, Jun 21, 2007 at 04:59:54PM -0400, Stephen Smalley wrote: On Thu, 2007-06-21 at 21:54 +0200, Lars Marowsky-Bree wrote: On 2007-06-21T15:42:28, James Morris [EMAIL PROTECTED] wrote: A veto is not a technical argument. All technical arguments (except for path name is ugly, yuk yuk!) have been addressed, have they not? AppArmor doesn't actually provide confinement, because it only operates on filesystem objects. What you define in AppArmor policy does _not_ reflect the actual confinement properties of the policy. Applications can simply use other mechanisms to access objects, and the policy is effectively meaningless. Only if they have access to another process which provides them with that data. Or can access the data under a different path to which their profile does give them access, whether in its final destination or in some temporary file processed along the way. And now, yes, I know AA doesn't mediate IPC or networking (yet), but that's a missing feature, not broken by design. The incomplete mediation flows from the design, since the pathname-based mediation doesn't generalize to cover all objects unlike label- or attribute-based mediation. And the use the natural abstraction for each object type approach likewise doesn't yield any general model or anything that you can analyze systematically for data flow. This feels quite a lot like a repeat of the discussion at the kernel summit. There are valid uses for path based security, and if they don't fit your needs, please don't use them. But, path based semantics alone are not a valid reason to shut out AA. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
On Tue, Jun 19, 2007 at 10:11:13AM +0100, Pádraig Brady wrote: Vladislav Bolkhovitin wrote: I would also suggest one more feature: support for block level de-duplication. I mean: 1. Ability for Btrfs to have blocks in several files to point to the same block on disk 2. Support for new syscall or IOCTL to de-duplicate as a single transaction two or more blocks on disk, i.e. link them to one of them and free others 3. De-de-duplicate blocks on disk, i.e. copy them on write I suppose that de-duplication itself would be done by some user space process that would scan files, determine blocks with the same data and then de-duplicate them by using syscall or IOCTL (2). That would be very usable feature, which in most cases would allow to shrink occupied disk space on 50-90%. Have you references for this number? In my experience one gets a lot of benefit from the much simpler process of de-duplication of files. Yes, I would expect simple hard links to be a better solution for this, but the feature request is not that out of line. I actually had plans on implementing auto duplicate block reuse earlier in btrfs. Snapshots already share duplicate blocks between files, and so all of the reference counting needed to implement this already exists. Snapshots are writable, and data mods are copy on write, and in general things work. But, to help fsck, the extent allocation tree has a back pointer to the inode that owns an extent. If you're doing snapshots, all of the owners of the extent have the same inode number. If you're sharing duplicate blocks, the owners can have any inode number, and fsck becomes much more complex. In general, when I have to decide between fsck and a feature, I'm going to pick fsck. The features are much more fun, but fsck is one of the main motivations for doing this work. Thanks for the input, Chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
On Sat, Jun 16, 2007 at 11:31:47AM +0200, Florian D. wrote: Chris Mason wrote: Strange, these numbers are not quite what I was expecting ;) Could you please post your fio job files? Also, how much ram does the machine have? Only writing doesn't seem like enough to fill the ram. -chris Sure: [global] directory=/mnt/temp/default filename=testfile size=300m randrepeat=1 overwrite=1 end_fsync=1 [ very bad results on btrfs with these parameters ] Ok, the numbers make more sense now. Basically what is happening is that during the random IO phase, fio is hitting every single block in the file. Btrfs will allocate new blocks in a sequential fashion, but the fsync does writeback in page order. So, the fsync sees completely random block ordering, and then we see it again on the reads. In ext3 even though the writes are random, the fsync uses the original (sequential) ordering of the blocks, and everything works nicely. The fix is either delayed allocation or defrag-on-writeback. Another option (which I'll have to do for O_SYNC performance) is to leave space in the blocks allocated to the file for COWs (basically strides of allocated blocks). I'll do the defrag-on-writeback right after enospc. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Versioning file system
On Mon, Jun 18, 2007 at 03:45:24AM -0600, Andreas Dilger wrote: Too bad everyone is spending time on 10 similar-but-slightly-different filesystems. This will likely end up with a bunch of filesystems that implement some easy subset of features, but will not get polished for users or have a full set of features implemented (e.g. ACL, quota, fsck, etc). While I don't think there is a single answer to every question, it does seem that the number of filesystem projects has climbed lately. Maybe there should be a BOF at OLS to merge these filesystem projects (btrfs, chunkfs, tilefs, logfs, etc) into a single project with multiple people working on getting it solid, scalable (parallel readers/writers on lots of CPUs), robust (checksums, failure localization), recoverable, etc. I thought Val's FS summits were designed to get developers to collaborate, but it seems everyone has gone back to their corners to work on their own filesystem? Unfortunately, I can't do OLS this year, but anyone who wants to talk on these things can drop me a line and we can setup phone calls or whatever for planning. Adding polish to any FS is not a one man show, and so I know I'll need to get more people on board to really finish btrfs off. One of my long term goals for btrfs is to figure out the features and layout people are most interested in for filesystems that don't have to be ext* backwards compatible. I've got a pretty good start, but I'm sure parts of it will change if I can get a big enough developer base. Working on getting hooks into DM/MD so that the filesystem and RAID layers can move beyond ignorance is bliss when talking to each other would be great. Not rebuilding empty parts of the fs, limit parity resync to parts of the fs that were in the previous transaction, use fs-supplied checksums to verify on-disk data is correct, use RAID geometry when doing allocations, etc. Definitely. There's a lot of work in the DM integration bits that are not FS specific. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Updated Btrfs project site online
Hello everyone, I've moved the Btrfs pages here: http://oss.oracle.com/projects/btrfs Which gives us a bugzilla, mailing lists, and a somewhat more orderly file download area. There are links to my HG trees for sources as well. The oss project area automagically creates a few different mailing lists. For now, [EMAIL PROTECTED] and [EMAIL PROTECTED] will be used. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Updated Btrfs project site online -git repo?
On Mon, Jun 18, 2007 at 09:53:39PM +0200, Maria Domenica Bertolucci wrote: Would it be possible to have a git repo as well so as to keep in sync with all git kernel projects? It also helps standardize things. Sorry, the repos will stay Mercurial based for now. These are small repos and not attached to the main kernel sources, it will be easy to download (I promise). -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
On Fri, Jun 15, 2007 at 09:08:38PM +0200, Florian D. wrote: Chris Mason wrote: is it possible to test it on top of LVM2 on RAID at this stage? Yes, I haven't done much multi-spindle testing yet, so I'm definitely interested in these numbers. -chris I did not get very far: # insmod btrfs.ko # mkfs.btrfs /dev/brain_volume_group/btrfstest on close 0 blocks are allocated fs created on /dev/brain_volume_group/btrfstest blocksize 4096 blocks 4980736 (/dev/brain_volume_group/btrfstest is a 20GB logical volume on top of RAID6) # mount /dev/brain_volume_group/btrfstest /mnt/temp/ (this gives these kernel-msgs: [ 385.980358] btrfs: dm-6 checksum verify failed on 4 [ 385.980462] btrfs: dm-6 checksum verify failed on 12 [ 385.980559] btrfs: dm-6 checksum verify failed on 11 These are normal on the first mount, the mkfs doesn't set the csums on the blocks it creates (will fix ;) ) # touch /mnt/temp/default/testfile.txt [ 445.445638] btrfs: dm-6 checksum verify failed on 10 # umount /mnt/temp/ [ 457.980372] [ cut here ] [ 457.980377] kernel BUG at fs/buffer.c:2644! Whoops. Please try this: diff -r 38b36731 disk-io.c --- a/disk-io.c Fri Jun 15 13:50:20 2007 -0400 +++ b/disk-io.c Fri Jun 15 15:12:26 2007 -0400 @@ -541,6 +541,7 @@ int write_ctree_super(struct btrfs_trans else ret = submit_bh(WRITE, bh); if (ret == -EOPNOTSUPP) { + lock_buffer(bh); set_buffer_uptodate(bh); root-fs_info-do_barriers = 0; ret = submit_bh(WRITE, bh); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
On Fri, Jun 15, 2007 at 10:46:04PM +0200, Florian D. wrote: Chris Mason wrote: # umount /mnt/temp/ [ 457.980372] [ cut here ] [ 457.980377] kernel BUG at fs/buffer.c:2644! Whoops. Please try this: [ bad patch ] sorry, with the patch applied: [ 147.475077] BUG: at /home/florian/system/btrfs_test/btrfs-0.2/disk-io.c:534 Well, apparently I get get the silly stuff wrong an infinite number of times. Sorry, lets try again: diff -r 38b36731 disk-io.c --- a/disk-io.c Fri Jun 15 13:50:20 2007 -0400 +++ b/disk-io.c Fri Jun 15 16:52:38 2007 -0400 @@ -541,6 +541,8 @@ int write_ctree_super(struct btrfs_trans else ret = submit_bh(WRITE, bh); if (ret == -EOPNOTSUPP) { + get_bh(bh); + lock_buffer(bh); set_buffer_uptodate(bh); root-fs_info-do_barriers = 0; ret = submit_bh(WRITE, bh); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
On Sat, Jun 16, 2007 at 12:03:06AM +0200, Florian D. wrote: Chris Mason wrote: Well, apparently I get get the silly stuff wrong an infinite number of times. Sorry, lets try again: diff -r 38b36731 disk-io.c --- a/disk-io.c Fri Jun 15 13:50:20 2007 -0400 +++ b/disk-io.c Fri Jun 15 16:52:38 2007 -0400 @@ -541,6 +541,8 @@ int write_ctree_super(struct btrfs_trans else ret = submit_bh(WRITE, bh); if (ret == -EOPNOTSUPP) { + get_bh(bh); + lock_buffer(bh); set_buffer_uptodate(bh); root-fs_info-do_barriers = 0; ret = submit_bh(WRITE, bh); ha! it is working now. some numbers from here(with the fio-tool): Great, I'll have a v0.3 out on Monday with that fix rolled in. 1. sequential read 2. random writes 3. sequential read again filesize:300MB, bs:4K btrfs reiserfs ext3 usr% sys% bw sec.usr% sys% bw sec.usr% sys% bw sec. 1 551 68.3 4.6 117 67.4 4.6 524 68.0 4.6 2 010.7 431 221 29.8 10.5318 29.0 10.8 3 012.3 133 119 70.5 4.4 524 68.6 4.5 bw: MB/sec. ext3: -o data=writeback,barrier=1 20GB LVM2 partition on a RAID6 (4 SATA-disks) Strange, these numbers are not quite what I was expecting ;) Could you please post your fio job files? Also, how much ram does the machine have? Only writing doesn't seem like enough to fill the ram. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
On Thu, Jun 14, 2007 at 08:29:10PM +0200, Florian D. wrote: Chris Mason wrote: The basic list of features looks like this: [amazing stuff snipped] The current status is a very early alpha state, and the kernel code weighs in at a sparsely commented 10,547 lines. I'm releasing now in hopes of finding people interested in testing, benchmarking, documenting, and contributing to the code. ok, what kind of benchmarks would help you most? bonnie? compilebench? sth. other? Thanks! Lets start with a list of the things I know will go badly: O_SYNC (not implemented) O_DIRECT (not implemented) aio (not implemented) multi-threaded (brain dead tree locking) things that fill the drive (will oops) mmap() writes (not supported, mmap reads are ok) Also, overlapping writes are not that well supported. For example, tar by default will write in 10k chunks, and btrfs_file_write currently cows on every single write. So, if your tar file has a bunch of 16k files, it'll go much faster if you tell tar to use 16k (or 8k) buffers. In general, I was hoping for a generic delayed allocation facility to magically appear in the kernel, and so I haven't spent a lot of time tuning btrfs_file_write for this yet. Any other workload is fair game, and I'm especially interested in seeing how badly the COW hurts. For example, on a big file, I'd like to see how much slower big sequential reads are after small random writes (fio is good for this). Or, writing to every file on the FS in random order and then seeing how much slower we are at reading. Benchmarks that stress the directory structure are interesting too, huge numbers of files + directories etc. Ric Wheeler's fs_mark has a lot of options and output. But, that's just my list, you can pick anything that you find interesting ;) Please try btrfsck after the run to see how well it keeps up. If you use blktrace to generate io traces, graphs can be generated: http://oss.oracle.com/~mason/seekwatcher/ Not that well documented, but drop me a line if you need help running it. btt is a good alternative to the graphs too, and easier to run. is it possible to test it on top of LVM2 on RAID at this stage? Yes, I haven't done much multi-spindle testing yet, so I'm definitely interested in these numbers. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
On Wed, Jun 13, 2007 at 04:08:30AM +0100, Christoph Hellwig wrote: On Tue, Jun 12, 2007 at 04:14:39PM -0400, Chris Mason wrote: Aside from folding snapshot history into the origin's namespace... It could be possible to have a mount.btrfs that allows subvolumes and/or snapshot volumes to be mounted as unique roots? I'd imagine a bind mount _could_ provide this too? Anyway, I'm just interested in understanding the vision for managing the potentially complex nature of a Btrfs namespace. One option is to put the real btrfs root into some directory in (/sys/fs/btrfs/$device?) and then use tools in userland to mount -o bind outside of that. I wanted to wait to get fancy until I had a better idea of how people would use the feature. We already support mounting into subdirectories of a filesystem for nfs connection sharing. The patch below makes use of this to allow mounting any subdirectory of a btrfs filesystem by specifying it in the form of /dev/somedevice:directory and when no subdirectory is specified uses 'default'. Neat, thanks Christoph, this will be much nicer longer term. I'll integrate it after I finish off -enospc. To make this more useful btrfs directories should grow some way to be marked head of a subvolume, They are already different in the btree, but maybe I'm not 100% sure what you mean by marked as the head of a subvolume? and we'd need a more useful way to actually create subvolumes and snapshots without fugly ioctls. One way I can think of that doesn't involve an ioctl is to have a special subdir at the root of the subvolume: cd /mnt/default/.snaps mkdir new_snapshot rmdir old_snapshot cd /mnt mkdir new_subvol rmdir old_subvol -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
On Tue, Jun 12, 2007 at 11:46:20PM -0400, John Stoffel wrote: Chris == Chris Mason [EMAIL PROTECTED] writes: Chris After the last FS summit, I started working on a new filesystem Chris that maintains checksums of all file data and metadata. Many Chris thanks to Zach Brown for his ideas, and to Dave Chinner for his Chris help on benchmarking analysis. Chris The basic list of features looks like this: Chris* Extent based file storage (2^64 max file size) Chris* Space efficient packing of small files Chris* Space efficient indexed directories Chris* Dynamic inode allocation Chris* Writable snapshots Chris* Subvolumes (separate internal filesystem roots) Chris- Object level mirroring and striping Chris* Checksums on data and metadata (multiple algorithms available) Chris- Strong integration with device mapper for multiple device support Chris- Online filesystem check Chris* Very fast offline filesystem check Chris- Efficient incremental backup and FS mirroring So, can you resize a filesystem both bigger and smaller? Or is that implicit in the Object level mirroring and striping? Growing the FS is just either extending or adding a new extent tree. Shrinking is more complex. The extent trees do have back pointers to the objectids that own the extent, but snapshotting makes that a little non-deterministic. The good news is there are no fixed locations for any of the metadata. So it is at least possible to shrink and pop out arbitrary chunks. As a user of Netapps, having quotas (if only for reporting purposes) and some way to migrate non-used files to slower/cheaper storage would be great. So far, I'm not planning quotas beyond the subvolume level. Ie. being able to setup two pools, one being RAID6, the other being RAID1, where all currently accessed files are in the RAID1 setup, but if un-used get migrated to the RAID6 area. HSM in general is definitely interesting. I'm afraid it is a long ways off, but it could be integrated into the scrubber that wanders the trees in the background. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
On Wed, Jun 13, 2007 at 01:45:28AM -0400, Albert Cahalan wrote: Neat! It's great to see somebody else waking up to the idea that storage media is NOT to be trusted. Judging by the design paper, it looks like your structs have some alignment problems. Actual defs are all packed, but I may still shuffle around the structs to optimize alignment. The keys are fixed, although I may make the u32 in the middle smaller. The usual wishlist: * inode-to-pathnames mapping This one I'll code, it will help with inode link count verification. I want to be able to detect at run time that an inode with a link count of zero is still actually in a directory. So there will be back pointers from the inode to the directory. Also, the incremental backup code will be able to walk the btree to find inodes that have changed, and the backpointers will help make a list of file names that need to be rsync'd or whatever. * a subvolume that is a single file (disk image, database, etc.) subvolumes can be made that have a single file in them, but they have to be directories right now. Doing otherwise would complicate mounts and other management tools (inside the btree, it doesn't really matter). * directory indexes to better support Wine and Samba * secure delete via destruction of per-file or per-block random crypto keys I'd rather keep secure delete as a userland problem (or a layered FS problem). When you take backups and other copies of the file into account, it's a bigger problem than btrfs wants to tackle right now. * fast (seekless) access to normal-sized SE Linux data acls and xattrs will adjacent to the inode in the tree. Most of the time it'll be seekless. * atomic creation of copy-on-write directory trees Do you mean something more fine grained than the current snapshotting system? * immutable bits like UFS has I'll do the ext2 chattr calls. * hole punch ability Hole punching isn't harder or easier in btrfs than most other filesystems that support holes. It's largely a VM issue. * insert/delete ability (add/remove a chunk in the middle of a file) The disk format makes this O(extent records past the chunk). It's possible to code but it would not be optimized. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
On Wed, Jun 13, 2007 at 10:00:56AM -0400, John Stoffel wrote: Chris == Chris Mason [EMAIL PROTECTED] writes: As a user of Netapps, having quotas (if only for reporting purposes) and some way to migrate non-used files to slower/cheaper storage would be great. Chris So far, I'm not planning quotas beyond the subvolume level. So let me get this straight. Are you saying that quotas would only be on the volume level, and for the initial level of sub-volumes below that level? Or would *all* sub-volumes have quota support? And does that include snapshots as well? On disk, snapshots and subvolumes are identical...the only difference is their starting state (sorry, it's confusing, and it doesn't help that I interchange the terms when describing features). Every subvolume will have a quota on the number of blocks it can consume. I haven't yet decided on the best way to account for blocks that are actually shared between snapshots, but it'll be in there somehow. So if you wanted to make a snapshot readonly, you just set the quota to 1 block. But, I'm not planning on adding a way to say user X in subvolume Y has quota Z. I'll just be: this subvolume can't get bigger than a given size. (at least for version 1.0). -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
On Wed, Jun 13, 2007 at 12:12:23PM -0400, John Stoffel wrote: Chris == Chris Mason [EMAIL PROTECTED] writes: [ nod ] Also, I think you're wrong here when you state that making a snapshot (sub-volume?) RO just requires you to set the quota to 1 block. What is to stop me from writing 1 block to a random file that already exists? It's copy on write, so changing one block means allocating a new one and putting the new contents there. The old blocks don't become available for reuse until the transaction commits. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
On Wed, Jun 13, 2007 at 12:14:40PM -0400, Albert Cahalan wrote: On 6/13/07, Chris Mason [EMAIL PROTECTED] wrote: On Wed, Jun 13, 2007 at 01:45:28AM -0400, Albert Cahalan wrote: The usual wishlist: * inode-to-pathnames mapping This one I'll code, it will help with inode link count verification. I want to be able to detect at run time that an inode with a link count of zero is still actually in a directory. So there will be back pointers from the inode to the directory. Great, but fsck improvement wasn't on my mind. This is a desirable feature for the NFS server, and for regular users. Think about a backup program trying to maintain hard links. Sure, it'll be there either way ;) Also, the incremental backup code will be able to walk the btree to find inodes that have changed, and the backpointers will help make a list of file names that need to be rsync'd or whatever. * a subvolume that is a single file (disk image, database, etc.) subvolumes can be made that have a single file in them, but they have to be directories right now. Doing otherwise would complicate mounts and other management tools (inside the btree, it doesn't really matter). Bummer. As I understand it, ZFS provides this. :-) Grin, when the pain of typing cd subvol is btrfs' biggest worry, I'll be doing very well. * directory indexes to better support Wine and Samba * secure delete via destruction of per-file or per-block random crypto keys I'd rather keep secure delete as a userland problem (or a layered FS problem). When you take backups and other copies of the file into account, it's a bigger problem than btrfs wants to tackle right now. It can't be a userland problem if you allow disk blocks to move. Volume resizing, logging/journalling, etc. -- they combine to make the userland solution essentially impossible. (one could wipe the whole partition, or maybe fill ALL space on the volume) Right about here is where I would insert a long story about ecryptfs, or encryption solutions that happen all in userland. At any rate, it is outside the scope of v1.0, even though I definitely agree it is an important problem for some people. * atomic creation of copy-on-write directory trees Do you mean something more fine grained than the current snapshotting system? I believe so. Example: I have a linux-2.6 directory. It's not a mount point or anything special like that. I want to copy it to a new directory called wip, without actually copying all the blocks. To all the normal POSIX API stuff, this copy should look like the result of cp -a, not hard links. This would be a snapshot, which has to be done on a subvolume right now. It is not as nice as being able to pick a random directory, but I've only been able to get this far by limiting the feature scope significantly. What I did do was make subvolumes very cheap...just make a bunch of them. Keep in mind that if you implement a cow directory tree without a snapshot, and you don't want to duplicate any blocks in the cow, you're going to have fun with inode numbers. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS
On Tue, Jun 12, 2007 at 03:53:03PM -0400, Mike Snitzer wrote: On 6/12/07, Chris Mason [EMAIL PROTECTED] wrote: Hello everyone, After the last FS summit, I started working on a new filesystem that maintains checksums of all file data and metadata. Many thanks to Zach Brown for his ideas, and to Dave Chinner for his help on benchmarking analysis. Chris, Given the substantial work that you've already put into btrfs and the direction you're Todo list details; it feels as though Btrfs will quickly provide the features that only Sun's ZFS provides. Looking at your Btrfs benchmark and design pages it is clear that you're motivation is a filesystem that addresses modern concerns (performance that doesn't degrade over time, online fsck, fast offline fsck, data/metadata checksums, unlimited snapshots, efficient remote mirroring, etc). There is still much Todo but you've made very impressive progress for the first announcement! I have some management oriented questions/comments. 1) Regarding the direction of Btrfs as it relates to integration with DM. The allocation policies, the ease of configuring DM-based striping/mirroring, management of large pools of storage all seems to indicate that Btrfs will manage the physical spindles internally. This is very ZFS-ish (ZFS pools) so I'd like to understand where you see Btrfs going in this area. There's quite a lot of hand waving in that section. What I'd like to do is work closely with the LVM/DM/MD maintainers and come up with something that leverages what linux already does. I don't want to rewrite LVM into the FS, but I do want to make better use of info about the underlying storage. Your initial benchmarks were all done ontop of a single disk with an LVM stack yet your roadmap/todo and design speaks to a tighter integration of the volume management features. So long term is traditional LVM/MD functionality to be pulled directly into Btrfs? 2) The Btrfs notion of subvolumes and snapshots is very elegant and provides for a fluid management of the filesystem system data. It feels as though each subvolume/snapshot is just folded into the parent Btrfs volumes' namespace. Was there any particular reason you elected to do this? I can see that it lends itself to allowing snapshots of snapshots. If you could elaborate I'd appreciate it. Yes, I wanted snapshots to be writable and resnapshottable. It also lowers the complexity to keep each snapshot as a subvolume/tree. subvolumes are only slightly more expensive than a directory. So, even though a subvolume is a large grained unit for a snapshot, you can get around this by just making more subvolumes. In practice subvolumes and/or snapshots appear to be implicitly mounted upon creation (refcount of parent is incremented). Is this correct? For snapshots, this runs counter to mapping the snapshots' data into the namespace of the origin Btrfs (e.g. with a .snapshot dir, but this is only useful for read-only snaps). Having snapshot namespaces in terms of monolithic subvolumes puts a less intuitive face on N Btrfs snapshots. The history of a given file/dir feels to be lost with this model. That's somewhat true, the disk format does have enough information to show you that history, but cleanly expressing it to the user is a daunting task. Aside from folding snapshot history into the origin's namespace... It could be possible to have a mount.btrfs that allows subvolumes and/or snapshot volumes to be mounted as unique roots? I'd imagine a bind mount _could_ provide this too? Anyway, I'm just interested in understanding the vision for managing the potentially complex nature of a Btrfs namespace. One option is to put the real btrfs root into some directory in (/sys/fs/btrfs/$device?) and then use tools in userland to mount -o bind outside of that. I wanted to wait to get fancy until I had a better idea of how people would use the feature. Thanks for doing all this work; I think the Linux community got a much needed shot in the arm with this Btrfs announcement. Thanks for the comments. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
On Wed, May 16, 2007 at 08:09:19PM +0800, David Woodhouse wrote: On Wed, 2007-05-16 at 11:19 +0100, David Howells wrote: The start and end points passed to block_prepare_write() delimit the region of the page that is going to be modified. This means that prepare_write() doesn't need to fill it in if the page is not up to date. Really? Is it _really_ going to be modified? Even if the pointer userspace gave to write() is bogus, and is going to fault half-way through the copy_from_user()? This is why there are so many variations on copy_from_user that zero on faults. One way or another, the prepare_write/commit_write pair are responsible for filling it in. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1 of 2] block_page_mkwrite() Implementation V2
On Wed, May 16, 2007 at 11:04:11PM +1000, Nick Piggin wrote: Chris Mason wrote: On Wed, May 16, 2007 at 08:09:19PM +0800, David Woodhouse wrote: On Wed, 2007-05-16 at 11:19 +0100, David Howells wrote: The start and end points passed to block_prepare_write() delimit the region of the page that is going to be modified. This means that prepare_write() doesn't need to fill it in if the page is not up to date. Really? Is it _really_ going to be modified? Even if the pointer userspace gave to write() is bogus, and is going to fault half-way through the copy_from_user()? This is why there are so many variations on copy_from_user that zero on faults. One way or another, the prepare_write/commit_write pair are responsible for filling it in. I'll add to David's question about David's comment on David's patch, yes it will be modified but in that case it would be zero-filled as Chris says. However I believe this is incorrect behaviour. It is possible to easily fix that so it would only happen via a tiny race window (where the source memory gets unmapped at just the right time) however nobody seemed to interested (just by checking the return value of fault_in_pages_readable). The buffered write patches I'm working on fix that (among other things) of course. But they do away with prepare_write and introduce new aops, and they indeed must not expect the full range to have been written to. I was also wrong to say prepare_write and commit_write are responsible, they work together with their callers to make the right things happen. Oh well, so much for trying to give a short answer for a chunk of code full of corner cases ;) -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4 of 8] Add flags to control direct IO helpers
On Thu, Feb 08, 2007 at 09:33:05AM +0530, Suparna Bhattacharya wrote: On Wed, Feb 07, 2007 at 01:05:44PM -0500, Chris Mason wrote: On Wed, Feb 07, 2007 at 10:38:45PM +0530, Suparna Bhattacharya wrote: + * The flags parameter is a bitmask of: + * + * DIO_PLACEHOLDERS (use placeholder pages for locking) + * DIO_CREATE (pass create=1 to get_block for filling holes or extending) A little more explanation about why these options are needed, and examples of when one would specify each of these options would be good. I'll extend the comments in the patch, but for discussion here: DIO_PLACEHOLDERS: placeholders are inserted into the page cache to synchronize the DIO with buffered writes. From a locking point of view, this is similar to inserting and locking pages in the address space corresponding to the DIO. placeholders guard against concurrent allocations and truncates during the DIO. You don't need placeholders if truncates and allocations are are impossible (for example, on a block device). Likewise placeholders may not be needed if the underlying filesystem already takes care of locking to synchronizes DIO vs buffered. True, although I don't think any FS covers 100% of the cases right now. DIO_CREATE: placeholders make it possible for filesystems to safely fill holes and extend the file via get_block during the DIO. If DIO_CREATE is turned on, get_block will be called with create=1, allowing the FS to allocate blocks during the DIO. When would one NOT specify DIO_CREATE, and what are the implications ? The purpose of having an option of NOT allowing the FS to allocate blocks during DIO is one is not very intuitive from the standpoint of the caller. (the block device case could be an example, but then create=1 could not do any harm or add extra overhead, so why bother ?) DIO has fallen back to buffered IO for so long that I wanted filesystems to explicitly choose the create=1 for now. A good example is my patch for ext3, where the ext3 get_block routine needed to be changed to start a transaction instead of finding the current trans in current-journal_info. The reiserfs DIO get_block needed to be told not to expect i_mutex to be held, etc etc. Is there still a valid case where we fallback to buffered IO to fill holes - to me that seems to be the only situation where create=0 must be enforced. Right, when create=0 we fall back, otherwise we don't. DIO_DROP_I_MUTEX: If the write is inside of i_size, i_mutex is dropped during the DIO and taken again before returning. Again an example of when one would not specify this (block device and XFS ?) would be useful. If the FS can't fill a hole or extend the file without i_mutex, or if the caller has already dropped I_MUTEX themselves. I think this is only XFS right now, the long term goal is to make placeholders fast enough for XFS to use. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1 of 2] Implement generic block_page_mkwrite() functionality
On Thu, Feb 08, 2007 at 09:50:13AM +1100, David Chinner wrote: You don't need to lock out all truncation, but you do need to lock out truncation of the page in question. Instead of your i_size checks, check page-mapping isn't NULL after the lock_page? Yes, that can be done, but we still need to know if part of the page is beyond EOF for when we call block_commit_write() and mark buffers dirty. Hence we need to check the inode size. I guess if we block the truncate with the page lock, then the inode size is not going to change until we unlock the page. If the inode size has already been changed but the page not yet removed from the mapping we'll be beyond EOF. So it seems to me that we can get away with not using the i_mutex in the generic code here. vmtruncate changes the inode size before waiting on any pages. So, i_size could change any time during page_mkwrite. Since the patch does: if (((page-index + 1) PAGE_CACHE_SHIFT) i_size_read(inode)) end = i_size_read(inode) ~PAGE_CACHE_MASK; else end = PAGE_CACHE_SIZE; It would be a good idea to read i_size once and put it in a local var instead. The FS truncate op should be locking the last page in the file to make sure it is properly zero filled. The worst case should be that we zero too many bytes in page_mkwrite (expanding truncate past our current i_size), but at least it won't expose stale data. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7 of 8] Adapt XFS to the new blockdev_direct_IO calls
XFS is changed to use blockdev_direct_IO flags instead of DIO_OWN_LOCKING. Signed-off-by: Chris Mason [EMAIL PROTECTED] diff -r 1ab8a2112a7d -r f53fd3802dc9 fs/xfs/linux-2.6/xfs_aops.c --- a/fs/xfs/linux-2.6/xfs_aops.c Tue Feb 06 20:02:56 2007 -0500 +++ b/fs/xfs/linux-2.6/xfs_aops.c Tue Feb 06 20:02:56 2007 -0500 @@ -1392,19 +1392,16 @@ xfs_vm_direct_IO( iocb-private = xfs_alloc_ioend(inode, IOMAP_UNWRITTEN); - if (rw == WRITE) { - ret = blockdev_direct_IO_own_locking(rw, iocb, inode, - iomap.iomap_target-bt_bdev, - iov, offset, nr_segs, - xfs_get_blocks_direct, - xfs_end_io_direct); - } else { - ret = blockdev_direct_IO_no_locking(rw, iocb, inode, - iomap.iomap_target-bt_bdev, - iov, offset, nr_segs, - xfs_get_blocks_direct, - xfs_end_io_direct); - } + /* +* ask DIO not to do any special locking for us, and to always +* pass create=1 to get_block on writes +*/ + ret = blockdev_direct_IO_flags(rw, iocb, inode, + iomap.iomap_target-bt_bdev, + iov, offset, nr_segs, + xfs_get_blocks_direct, + xfs_end_io_direct, + DIO_CREATE); if (unlikely(ret != -EIOCBQUEUED iocb-private)) xfs_destroy_ioend(iocb-private); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8 of 8] Avoid too many boundary buffers in DIO
Dave Chinner found a 10% performance regression with ext3 when using DIO to fill holes instead of buffered IO. On large IOs, the ext3 get_block routine will send more than a page worth of blocks back to DIO via a single buffer_head with a large b_size value. The DIO code iterates through this massive block and tests for a boundary buffer over and over again. For every block size unit spanned by the big map_bh, the boundary bit is tested and a bio may be forced down to the block layer. There are two potential fixes, one is to ignore the boundary bit on large regions returned by the FS. DIO can't tell which part of the big region was a boundary, and so it may not be a good idea to trust the hint. This patch just clears the boundary bit after using it once. It is 10% faster for a streaming DIO write w/blocksize of 512k on my sata drive. Signed-off-by: Chris Mason [EMAIL PROTECTED] diff -r f53fd3802dc9 -r d068ea378c04 fs/direct-io.c --- a/fs/direct-io.cTue Feb 06 20:02:56 2007 -0500 +++ b/fs/direct-io.cTue Feb 06 20:02:56 2007 -0500 @@ -625,7 +625,6 @@ static int dio_new_bio(struct dio *dio, nr_pages = min(dio-pages_in_io, bio_get_nr_vecs(dio-map_bh.b_bdev)); BUG_ON(nr_pages = 0); ret = dio_bio_alloc(dio, dio-map_bh.b_bdev, sector, nr_pages); - dio-boundary = 0; out: return ret; } @@ -679,12 +678,6 @@ static int dio_send_cur_page(struct dio */ if (dio-final_block_in_bio != dio-cur_page_block) dio_bio_submit(dio); - /* -* Submit now if the underlying fs is about to perform a -* metadata read -*/ - if (dio-boundary) - dio_bio_submit(dio); } if (dio-bio == NULL) { @@ -701,6 +694,12 @@ static int dio_send_cur_page(struct dio BUG_ON(ret != 0); } } + /* +* Submit now if the underlying fs is about to perform a +* metadata read +*/ + if (dio-boundary) + dio_bio_submit(dio); out: return ret; } @@ -727,6 +726,10 @@ submit_page_section(struct dio *dio, str unsigned offset, unsigned len, sector_t blocknr) { int ret = 0; + int boundary = dio-boundary; + + /* don't let dio_send_cur_page do the boundary too soon */ + dio-boundary = 0; if (dio-rw WRITE) { /* @@ -743,17 +746,7 @@ submit_page_section(struct dio *dio, str (dio-cur_page_block + (dio-cur_page_len dio-blkbits) == blocknr)) { dio-cur_page_len += len; - - /* -* If dio-boundary then we want to schedule the IO now to -* avoid metadata seeks. -*/ - if (dio-boundary) { - ret = dio_send_cur_page(dio); - page_cache_release(dio-cur_page); - dio-cur_page = NULL; - } - goto out; + goto out_send; } /* @@ -772,6 +765,18 @@ submit_page_section(struct dio *dio, str dio-cur_page_offset = offset; dio-cur_page_len = len; dio-cur_page_block = blocknr; + +out_send: + /* +* If dio-boundary then we want to schedule the IO now to +* avoid metadata seeks. +*/ + if (boundary) { + dio-boundary = 1; + ret = dio_send_cur_page(dio); + page_cache_release(dio-cur_page); + dio-cur_page = NULL; + } out: return ret; } @@ -977,7 +982,16 @@ do_holes: this_chunk_bytes = this_chunk_blocks blkbits; BUG_ON(this_chunk_bytes == 0); - dio-boundary = buffer_boundary(map_bh); + /* +* get_block may return more than one page worth +* of blocks. Make sure only the last io we +* send down for this region is a boundary +*/ + if (dio-blocks_available == this_chunk_blocks) + dio-boundary = buffer_boundary(map_bh); + else + dio-boundary = 0; + ret = submit_page_section(dio, page, offset_in_page, this_chunk_bytes, dio-next_block_for_io); if (ret) { - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5 of 8] Make ext3 safe for the new DIO locking rules
This creates a version of ext3_get_block that starts and ends a transaction. By starting and ending the transaction inside get_block, this is able to avoid lock inversion problems when the DIO code tries to take page locks inside blockdev_direct_IO. (transaction locks must always happen after page locks). Signed-off-by: Chris Mason [EMAIL PROTECTED] diff -r 04dd7ddd593e -r 42596f5254ca fs/ext3/inode.c --- a/fs/ext3/inode.c Tue Feb 06 20:02:56 2007 -0500 +++ b/fs/ext3/inode.c Tue Feb 06 20:02:56 2007 -0500 @@ -1673,6 +1673,30 @@ static int ext3_releasepage(struct page return journal_try_to_free_buffers(journal, page, wait); } +static int ext3_get_block_direct_IO(struct inode *inode, sector_t iblock, + struct buffer_head *bh_result, int create) +{ + int ret = 0; + handle_t *handle = ext3_journal_start(inode, DIO_CREDITS); + if (IS_ERR(handle)) { + ret = PTR_ERR(handle); + goto out; + } + ret = ext3_get_block(inode, iblock, bh_result, create); + /* +* Reacquire the handle: ext3_get_block() can restart the transaction +*/ + handle = journal_current_handle(); + if (handle) { + int err; + err = ext3_journal_stop(handle); + if (!ret) + ret = err; + } +out: + return ret; +} + /* * If the O_DIRECT write will extend the file then add this inode to the * orphan list. So recovery will truncate it back to the original size @@ -1693,39 +1717,58 @@ static ssize_t ext3_direct_IO(int rw, st int orphan = 0; size_t count = iov_length(iov, nr_segs); - if (rw == WRITE) { - loff_t final_size = offset + count; - + if (rw == WRITE (offset + count inode-i_size)) { handle = ext3_journal_start(inode, DIO_CREDITS); if (IS_ERR(handle)) { ret = PTR_ERR(handle); goto out; } - if (final_size inode-i_size) { - ret = ext3_orphan_add(handle, inode); - if (ret) - goto out_stop; - orphan = 1; - ei-i_disksize = inode-i_size; - } - } - + ret = ext3_orphan_add(handle, inode); + if (ret) { + ext3_journal_stop(handle); + goto out; + } + ei-i_disksize = inode-i_size; + ret = ext3_journal_stop(handle); + if (ret) { + /* something has gone horribly wrong, cleanup +* the orphan list in ram +*/ + if (inode-i_nlink) + ext3_orphan_del(NULL, inode); + goto out; + } + orphan = 1; + } + + /* +* the placeholder page code may take a page lock, so we have +* to stop any running transactions before calling +* blockdev_direct_IO. Use ext3_get_block_direct_IO to start +* and stop a transaction on each get_block call. +*/ ret = blockdev_direct_IO(rw, iocb, inode, inode-i_sb-s_bdev, iov, offset, nr_segs, -ext3_get_block, NULL); +ext3_get_block_direct_IO, NULL); /* * Reacquire the handle: ext3_get_block() can restart the transaction */ handle = journal_current_handle(); -out_stop: - if (handle) { + if (orphan) { int err; - - if (orphan inode-i_nlink) + handle = ext3_journal_start(inode, DIO_CREDITS); + if (IS_ERR(handle)) { + ret = PTR_ERR(handle); + if (inode-i_nlink) + ext3_orphan_del(NULL, inode); + goto out; + } + + if (inode-i_nlink) ext3_orphan_del(handle, inode); - if (orphan ret 0) { + if (ret 0) { loff_t end = offset + ret; if (end inode-i_size) { ei-i_disksize = end; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1 of 8] Introduce a place holder page for the pagecache
mm/filemap.c is changed to wait on these before adding a page into the page cache, and truncates are changed to wait for all of the place holder pages to disappear. Place holder pages can only be examined with the mapping lock held. They cannot be locked, and cannot have references increased or decreased on them. Placeholders can span a range bigger than one page. The placeholder is inserted into the radix slot for the end of the range, and the flags field in the page struct is used to record the start of the range. A bit is added for the radix root (PAGECACHE_TAG_EXTENTS), and when mm/filemap.c finds that bit set, searches for an index in the pagecache look forward to find any placeholders that index may intersect. Signed-off-by: Chris Mason [EMAIL PROTECTED] diff -r fc2d683623bb -r 7819e6e3f674 drivers/mtd/devices/block2mtd.c --- a/drivers/mtd/devices/block2mtd.c Sun Feb 04 10:44:54 2007 -0800 +++ b/drivers/mtd/devices/block2mtd.c Tue Feb 06 19:45:28 2007 -0500 @@ -66,7 +66,7 @@ static void cache_readahead(struct addre INFO(Overrun end of disk in cache readahead\n); break; } - page = radix_tree_lookup(mapping-page_tree, pagei); + page = radix_tree_lookup_extent(mapping-page_tree, pagei); if (page (!i)) break; if (page) diff -r fc2d683623bb -r 7819e6e3f674 include/linux/fs.h --- a/include/linux/fs.hSun Feb 04 10:44:54 2007 -0800 +++ b/include/linux/fs.hTue Feb 06 19:45:28 2007 -0500 @@ -490,6 +490,11 @@ struct block_device { */ #define PAGECACHE_TAG_DIRTY0 #define PAGECACHE_TAG_WRITEBACK1 + +/* + * This tag is only valid on the root of the radix tree + */ +#define PAGE_CACHE_TAG_EXTENTS 2 int mapping_tagged(struct address_space *mapping, int tag); diff -r fc2d683623bb -r 7819e6e3f674 include/linux/page-flags.h --- a/include/linux/page-flags.hSun Feb 04 10:44:54 2007 -0800 +++ b/include/linux/page-flags.hTue Feb 06 19:45:28 2007 -0500 @@ -263,4 +263,6 @@ static inline void set_page_writeback(st test_set_page_writeback(page); } +void set_page_placeholder(struct page *page, pgoff_t start, pgoff_t end); + #endif /* PAGE_FLAGS_H */ diff -r fc2d683623bb -r 7819e6e3f674 include/linux/pagemap.h --- a/include/linux/pagemap.h Sun Feb 04 10:44:54 2007 -0800 +++ b/include/linux/pagemap.h Tue Feb 06 19:45:28 2007 -0500 @@ -76,6 +76,9 @@ extern struct page * find_get_page(struc unsigned long index); extern struct page * find_lock_page(struct address_space *mapping, unsigned long index); +int find_or_insert_placeholders(struct address_space *mapping, + unsigned long start, unsigned long end, + gfp_t gfp_mask, int wait); extern __deprecated_for_modules struct page * find_trylock_page( struct address_space *mapping, unsigned long index); extern struct page * find_or_create_page(struct address_space *mapping, @@ -86,6 +89,12 @@ unsigned find_get_pages_contig(struct ad unsigned int nr_pages, struct page **pages); unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, int tag, unsigned int nr_pages, struct page **pages); +void remove_placeholder_pages(struct address_space *mapping, + unsigned long offset, unsigned long end); +void wake_up_placeholder_page(struct page *page); +void wait_on_placeholder_pages_range(struct address_space *mapping, pgoff_t start, + pgoff_t end); + /* * Returns locked page at given index in given cache, creating it if needed. @@ -116,6 +125,8 @@ int add_to_page_cache_lru(struct page *p unsigned long index, gfp_t gfp_mask); extern void remove_from_page_cache(struct page *page); extern void __remove_from_page_cache(struct page *page); +struct page *radix_tree_lookup_extent(struct radix_tree_root *root, +unsigned long index); /* * Return byte-offset into filesystem object for page. diff -r fc2d683623bb -r 7819e6e3f674 include/linux/radix-tree.h --- a/include/linux/radix-tree.hSun Feb 04 10:44:54 2007 -0800 +++ b/include/linux/radix-tree.hTue Feb 06 19:45:28 2007 -0500 @@ -53,6 +53,7 @@ static inline int radix_tree_is_direct_p /*** radix-tree API starts here ***/ #define RADIX_TREE_MAX_TAGS 2 +#define RADIX_TREE_MAX_ROOT_TAGS 3 /* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */ struct radix_tree_root { @@ -168,6 +169,7 @@ radix_tree_gang_lookup_tag(struct radix_ unsigned long first_index, unsigned int max_items, unsigned int tag); int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag); +void radix_tree_root_tag_set
[PATCH 6 of 8] Make reiserfs safe for new DIO locking rules
reiserfs is changed to use a version of reiserfs_get_block that is safe for filling holes without i_mutex held. Signed-off-by: Chris Mason [EMAIL PROTECTED] diff -r 42596f5254ca -r 1ab8a2112a7d fs/reiserfs/inode.c --- a/fs/reiserfs/inode.c Tue Feb 06 20:02:56 2007 -0500 +++ b/fs/reiserfs/inode.c Tue Feb 06 20:02:56 2007 -0500 @@ -469,7 +469,8 @@ static int reiserfs_get_blocks_direct_io bh_result-b_size = (1 inode-i_blkbits); ret = reiserfs_get_block(inode, iblock, bh_result, -create | GET_BLOCK_NO_DANGLE); +create | GET_BLOCK_NO_DANGLE | +GET_BLOCK_NO_IMUX); if (ret) goto out; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4 of 8] Add flags to control direct IO helpers
This creates a number of flags so that filesystems can control blockdev_direct_IO. It is based on code from Russell Cettelan. The new flags are: DIO_CREATE -- always pass create=1 to get_block on writes. This allows DIO to fill holes in the file. DIO_PLACEHOLDERS -- use placeholder pages to provide locking against buffered io and truncates. DIO_DROP_I_MUTEX -- drop i_mutex before starting the mapping, io submission, or io waiting. The mutex is still dropped for AIO as well. Some API changes are made so that filesystems can have more control over the DIO features. __blockdev_direct_IO is more or less renamed to blockdev_direct_IO_flags. All waiting and invalidating of page cache data is pushed down into blockdev_direct_IO_flags (and removed from mm/filemap.c) direct_io_worker is exported into the wild. Filesystems that want to be special can pull out the bits of blockdev_direct_IO_flags they care about and then call direct_io_worker directly. Signed-off-by: Chris Mason [EMAIL PROTECTED] diff -r 1a7105ab9c19 -r 04dd7ddd593e fs/direct-io.c --- a/fs/direct-io.cTue Feb 06 20:02:55 2007 -0500 +++ b/fs/direct-io.cTue Feb 06 20:02:56 2007 -0500 @@ -1,4 +1,3 @@ - GFP_KERNEL, 1); /* * fs/direct-io.c * @@ -55,13 +54,6 @@ * * If blkfactor is zero then the user's request was aligned to the filesystem's * blocksize. - * - * lock_type is DIO_LOCKING for regular files on direct-IO-naive filesystems. - * This determines whether we need to do the fancy locking which prevents - * direct-IO from being able to read uninitialised disk blocks. If its zero - * (blockdev) this locking is not done, and if it is DIO_OWN_LOCKING i_mutex is - * not held for the entire direct write (taken briefly, initially, during a - * direct read though, but its never held for the duration of a direct-IO). */ struct dio { @@ -70,8 +62,7 @@ struct dio { struct inode *inode; int rw; loff_t i_size; /* i_size when submitted */ - int lock_type; /* doesn't change */ - int reacquire_i_mutex; /* should we get i_mutex when done? */ + unsigned flags; /* locking and get_block flags */ unsigned blkbits; /* doesn't change */ unsigned blkfactor; /* When we're using an alignment which is finer than the filesystem's soft @@ -211,7 +202,7 @@ out: static void dio_unlock_page_range(struct dio *dio) { - if (dio-lock_type != DIO_NO_LOCKING) { + if (dio-flags DIO_PLACEHOLDERS) { remove_placeholder_pages(dio-inode-i_mapping, dio-fspages_start_off, dio-fspages_end_off); @@ -226,7 +217,7 @@ static int dio_lock_page_range(struct di unsigned long max_size; int ret = 0; - if (dio-lock_type == DIO_NO_LOCKING) + if (!(dio-flags DIO_PLACEHOLDERS)) return 0; while (index = dio-fspages_end_off) { @@ -310,9 +301,6 @@ static int dio_complete(struct dio *dio, dio-map_bh.b_private); dio_unlock_page_range(dio); - if (dio-reacquire_i_mutex) - mutex_lock(dio-inode-i_mutex); - if (ret == 0) ret = dio-page_errors; if (ret == 0) @@ -597,8 +585,9 @@ static int get_more_blocks(struct dio *d map_bh-b_state = 0; map_bh-b_size = fs_count dio-inode-i_blkbits; - create = dio-rw WRITE; - if (dio-lock_type == DIO_NO_LOCKING) + if (dio-flags DIO_CREATE) + create = dio-rw WRITE; + else create = 0; index = fs_startblk (PAGE_CACHE_SHIFT - dio-inode-i_blkbits); @@ -1014,19 +1003,41 @@ out: return ret; } -static ssize_t -direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, - const struct iovec *iov, loff_t offset, unsigned long nr_segs, +/* + * This does all the real work of the direct io. Most filesystems want to + * call blockdev_direct_IO_flags instead, but if you have exotic locking + * routines you can call this directly. + * + * The flags parameter is a bitmask of: + * + * DIO_PLACEHOLDERS (use placeholder pages for locking) + * DIO_CREATE (pass create=1 to get_block for filling holes or extending) + * DIO_DROP_I_MUTEX (drop inode-i_mutex during writes) + */ +ssize_t +direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, + const struct iovec *iov, loff_t offset, unsigned long nr_segs, unsigned blkbits, get_block_t get_block, dio_iodone_t end_io, - struct dio *dio) -{ - unsigned long user_addr; + int is_async, unsigned dioflags
Re: [PATCH 4 of 8] Add flags to control direct IO helpers
On Wed, Feb 07, 2007 at 10:38:45PM +0530, Suparna Bhattacharya wrote: + * The flags parameter is a bitmask of: + * + * DIO_PLACEHOLDERS (use placeholder pages for locking) + * DIO_CREATE (pass create=1 to get_block for filling holes or extending) A little more explanation about why these options are needed, and examples of when one would specify each of these options would be good. I'll extend the comments in the patch, but for discussion here: DIO_PLACEHOLDERS: placeholders are inserted into the page cache to synchronize the DIO with buffered writes. From a locking point of view, this is similar to inserting and locking pages in the address space corresponding to the DIO. placeholders guard against concurrent allocations and truncates during the DIO. You don't need placeholders if truncates and allocations are are impossible (for example, on a block device). DIO_CREATE: placeholders make it possible for filesystems to safely fill holes and extend the file via get_block during the DIO. If DIO_CREATE is turned on, get_block will be called with create=1, allowing the FS to allocate blocks during the DIO. DIO_DROP_I_MUTEX: If the write is inside of i_size, i_mutex is dropped during the DIO and taken again before returning. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html