Re: Proposal for proper durable fsync() and fdatasync()
On Tuesday 26 February 2008 18:59, Jamie Lokier wrote: Andrew Morton wrote: On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier [EMAIL PROTECTED] wrote: (It would be nicer if sync_file_range() took a vector of ranges for better elevator scheduling, but let's ignore that :-) Two passes: Pass 1: shove each of the segments into the queue with SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE Pass 2: wait for them all to complete and return accumulated result with SYNC_FILE_RANGE_WAIT_AFTER Thanks. Seems ok, though being able to cork the I/O until the last one would be a bonus (like TCP_MORE... SYNC_FILE_RANGE_MORE?) I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE. Is there a reason why you have it there? The man page isn't very enlightening. Yeah, sync_file_range has slightly unusual semantics and introduce the new concept, writeout, to userspace (does writeout include in drive cache? the kernel doesn't think so, but the only way to make sync_file_range safe is if you do consider it writeout). If it makes it any easier to understand, we can add in SYNC_FILE_ASYNC, SYNC_FILE_SYNC parts that just deal with safe/unsafe and sync/async semantics that is part of the normal POSIX api. Anyway, the idea of making fsync/fdatasync etc. safe by default is a good idea IMO, and is a bad bug that we don't do that :( - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [0/18] Implement some low hanging BKL removal fruit in fs/*
On Sunday 27 January 2008 13:17, Andi Kleen wrote: [Andrew: I believe this is -mm material for .25] - Convert some more file systems (generally those who don't use the BKL for anything except mount) to use unlocked_bkl. - Implement BKL less fasync (see patch for the rationale) This is currently a separate entry point, but since the number of fasync users in the tree is relatively small I hope the older entry point can be removed then in the not too far future [help from other people converting more fasync users to unlocked_fasync would be appreciated] - Implement BKL less remote_llseek - While I was at it I also added a few missing compat ioctl handlers - Fix a few comments This fixes a lot of relatively trivial BKL users in fs/*. The main remaining non legacy offenders are now locks.c, nfs/nfsd and reiserfs. I believe BKL removal for all of those is being worked on by other people. Also a lot of legacy file systems still use it, but converting those does not seem to be very pressing. BTW. here is a patch I did a while back for minix. I know it isn't a big deal, but the work is done so I guess I should send it along. The minix filesystem uses bkl to protect access to metadata. Switch to a per-superblock mutex. Signed-off-by: Nick Piggin [EMAIL PROTECTED] Index: linux-2.6/fs/minix/bitmap.c === --- linux-2.6.orig/fs/minix/bitmap.c +++ linux-2.6/fs/minix/bitmap.c @@ -69,11 +69,11 @@ void minix_free_block(struct inode *inod return; } bh = sbi-s_zmap[zone]; - lock_kernel(); + mutex_lock(sbi-s_mutex); if (!minix_test_and_clear_bit(bit, bh-b_data)) printk(minix_free_block (%s:%lu): bit already cleared\n, sb-s_id, block); - unlock_kernel(); + mutex_unlock(sbi-s_mutex); mark_buffer_dirty(bh); return; } @@ -88,18 +88,18 @@ int minix_new_block(struct inode * inode struct buffer_head *bh = sbi-s_zmap[i]; int j; - lock_kernel(); + mutex_lock(sbi-s_mutex); j = minix_find_first_zero_bit(bh-b_data, bits_per_zone); if (j bits_per_zone) { minix_set_bit(j, bh-b_data); - unlock_kernel(); + mutex_unlock(sbi-s_mutex); mark_buffer_dirty(bh); j += i * bits_per_zone + sbi-s_firstdatazone-1; if (j sbi-s_firstdatazone || j = sbi-s_nzones) break; return j; } - unlock_kernel(); + mutex_unlock(sbi-s_mutex); } return 0; } @@ -211,10 +211,10 @@ void minix_free_inode(struct inode * ino minix_clear_inode(inode); /* clear on-disk copy */ bh = sbi-s_imap[ino]; - lock_kernel(); + mutex_lock(sbi-s_mutex); if (!minix_test_and_clear_bit(bit, bh-b_data)) printk(minix_free_inode: bit %lu already cleared\n, bit); - unlock_kernel(); + mutex_unlock(sbi-s_mutex); mark_buffer_dirty(bh); out: clear_inode(inode); /* clear in-memory copy */ @@ -237,7 +237,7 @@ struct inode * minix_new_inode(const str j = bits_per_zone; bh = NULL; *error = -ENOSPC; - lock_kernel(); + mutex_lock(sbi-s_mutex); for (i = 0; i sbi-s_imap_blocks; i++) { bh = sbi-s_imap[i]; j = minix_find_first_zero_bit(bh-b_data, bits_per_zone); @@ -245,17 +245,17 @@ struct inode * minix_new_inode(const str break; } if (!bh || j = bits_per_zone) { - unlock_kernel(); + mutex_unlock(sbi-s_mutex); iput(inode); return NULL; } if (minix_test_and_set_bit(j, bh-b_data)) { /* shouldn't happen */ - unlock_kernel(); + mutex_unlock(sbi-s_mutex); printk(minix_new_inode: bit already set\n); iput(inode); return NULL; } - unlock_kernel(); + mutex_unlock(sbi-s_mutex); mark_buffer_dirty(bh); j += i * bits_per_zone; if (!j || j sbi-s_ninodes) { Index: linux-2.6/fs/minix/dir.c === --- linux-2.6.orig/fs/minix/dir.c +++ linux-2.6/fs/minix/dir.c @@ -102,7 +102,7 @@ static int minix_readdir(struct file * f char *name; __u32 inumber; - lock_kernel(); + mutex_lock(sbi-s_mutex); pos = (pos + chunk_size-1) ~(chunk_size-1); if (pos = inode-i_size) @@ -146,7 +146,7 @@ static int minix_readdir(struct file * f done: filp-f_pos = (n PAGE_CACHE_SHIFT) | offset; - unlock_kernel(); + mutex_unlock(sbi-s_mutex); return 0; } Index: linux-2.6/fs/minix/inode.c === --- linux-2.6.orig/fs/minix/inode.c +++ linux-2.6/fs/minix/inode.c @@ -174,6 +174,7 @@ static int minix_fill_super(struct super sbi-s_firstdatazone = ms-s_firstdatazone; sbi-s_log_zone_size = ms-s_log_zone_size; sbi-s_max_size = ms-s_max_size; + mutex_init(sbi-s_mutex); s-s_magic = ms-s_magic; if (s-s_magic == MINIX_SUPER_MAGIC) { sbi-s_version = MINIX_V1; Index: linux-2.6/fs/minix/minix.h === --- linux-2.6.orig/fs/minix/minix.h +++ linux-2.6/fs/minix/minix.h @@ -1,6 +1,7 @@ #include linux/fs.h #include linux/pagemap.h #include linux/minix_fs.h +#include linux/mutex.h /* * change the define below to 0
Re: [PATCH][RFC] fast file mapping for loop
On Wednesday 09 January 2008 19:52, Jens Axboe wrote: So how does it work? Instead of punting IO to a thread and passing it through the page cache, we instead attempt to send the IO directly to the filesystem block that it maps to. You told Christoph that just using direct-IO from kernel still doesn't give you the required behaviour... What about queueing the IO directly *and* using direct-IO? I guess it still has to go through the underlying filesystem, but that's probably a good thing. loop maintains a prio tree of known extents in the file (populated lazily on demand, as needed). Just a quick question (I haven't looked closely at the code): how come you are using a prio tree for extents? I don't think they could be overlapping? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 36/42] VFS: export drop_pagecache_sb
On Friday 14 December 2007 02:24, Erez Zadok wrote: In message [EMAIL PROTECTED], Nick Piggin writes: On Monday 10 December 2007 13:42, Erez Zadok wrote: Needed to maintain cache coherency after branch management. Hmm, I'd much prefer to be able to sleep in invalidate_mapping_pages before this function gets exported. As it is, it can cause massive latencies on preemption and the inode_lock so it is pretty much debug-only IMO. I'd rather it didn't escape into the wild as is. Either that or rework your cache coherency somehow. Nick, thanks for the advice. We use a generation number after each successful branch configuration command, so that -d_revalidate later on can discover that change, and rebuild the union of objects. At -remount time, I figured it'd be nice to encourage that revalidation to happen sooner, by invalidating as many upper pages as possible, thus causing -d_revalidate/-readpage to take place sooner. So we used to call drop_pagecache_sb from our remount code: it was the only caller of drop_pagecache_sb. It wasn't too much of an latency issue to call drop_pagecache_sb there: the VFS remount code path is already pretty slow (dropping temporarily to readonly mode, and dropping other caches), and remount isn't an operation used often, so a little bit more latency would probably not have been noticed by users. Well a large, infrequent spike is the most damaging to latency sensitive users. And anyway, I guess the infrequency of remount means it doesn't have to be really efficient with invalidating pagecache either. Nevertheless, it was not strictly necessary to call drop_pagecache_sb in unionfs_remount, because the objects in question will have gotten revalidated sooner or later anyway; the call to drop_pagecache_sb was just an optimization (one which I wasn't 100% sure about anyway, as per my long XXX comment above that call in unionfs_remount). So I agree with you: if this symbol can be abused by modules and cause problems, then exporting it to modules is too risky. I've reworked my code to avoid calling drop_pagecache_sb and I'll [sic] drop that patch. Thanks, I'd be much happier with that. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] mm: fix XIP file writes
On Mon, Dec 10, 2007 at 03:38:20PM +0100, Christian Borntraeger wrote: Hi Nick, Here we go. See, brd already found a bug ;) Can you apply the ext2 XIP patch too? And I'll resend the brd XIP patch. [...] Writing to XIP files at a non-page-aligned offset results in data corruption because the writes were always sent to the start of the page. [...] @@ -314,7 +314,7 @@ __xip_file_write(struct file *filp, cons fault_in_pages_readable(buf, bytes); kaddr = kmap_atomic(page, KM_USER0); copied = bytes - - __copy_from_user_inatomic_nocache(kaddr, buf, bytes); + __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes); kunmap_atomic(kaddr, KM_USER0); flush_dcache_page(page); I asked myself why this problem never happened before. So I asked our testers to reproduce this problem on 2.6.23 and service levels. As the testcase did not trigger, I looked into the 2.6.23 code. This problem was introduced by commit 4a9e5ef1f4f15205e477817a5cefc34bd3f65f55 (mm: write iovec cleanup from Nick Piggin) during 2.6.24-rc: snip--- - copied = filemap_copy_from_user(page, offset, buf, bytes); [...] + copied = bytes - + __copy_from_user_inatomic_nocache(kaddr, buf, bytes); --- So yes, its good to have xip on brd. It even tests your changes ;-) Good news is, that we dont need anything for stable. Heh ;) that explains a lot. Thanks! - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 36/42] VFS: export drop_pagecache_sb
On Monday 10 December 2007 13:42, Erez Zadok wrote: Needed to maintain cache coherency after branch management. Hmm, I'd much prefer to be able to sleep in invalidate_mapping_pages before this function gets exported. As it is, it can cause massive latencies on preemption and the inode_lock so it is pretty much debug-only IMO. I'd rather it didn't escape into the wild as is. Either that or rework your cache coherency somehow. Signed-off-by: Erez Zadok [EMAIL PROTECTED] --- fs/drop_caches.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/fs/drop_caches.c b/fs/drop_caches.c index 59375ef..90410ac 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -3,6 +3,7 @@ */ #include linux/kernel.h +#include linux/module.h #include linux/mm.h #include linux/fs.h #include linux/writeback.h @@ -12,7 +13,7 @@ /* A global variable is a bit ugly, but it keeps the code simple */ int sysctl_drop_caches; -static void drop_pagecache_sb(struct super_block *sb) +void drop_pagecache_sb(struct super_block *sb) { struct inode *inode; @@ -24,6 +25,7 @@ static void drop_pagecache_sb(struct super_block *sb) } spin_unlock(inode_lock); } +EXPORT_SYMBOL(drop_pagecache_sb); void drop_pagecache(void) { - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] ext2: xip check fix
On Thu, Dec 06, 2007 at 09:43:27AM +0100, Carsten Otte wrote: Nick Piggin wrote: Xip does only work, if both do match PAGE_SIZE because it does'nt support multiple calls to direct_access in the get_xip_page address space operation. Thus we check both here, actually this was changed from how it looks after your patch as a bugfix where our tester tried a 4k filesystem on a 2k blockdev. Did I miss something? However, the bdev block size may be changed with sb_set_blocksize. It doesn't actually have to match the hardware sector size -- if this does matter for XIP, then I think you need some other check here. Hhh. For a bdev with PAGE_SIZE hardsect size, there is no other valid value then PAGE_SIZE that one could set it to. Or can it indeed be changed to a value greater then PAGE_SIZE or smaller then hardsect size? It can't be made smaller (or larger, in current kernels). But you already get all that checking done for you -- both by checking that the filesystem blocksize == PAGE_SIZE, and by the error checking in sb_set_blocksize. After my patch, we can do XIP in a hardsect size PAGE_SIZE block device -- this seems to be a fine thing to do at least for the ramdisk code. Would this situation be problematic for existing drivers, and if so, in what way? Thanks, Nick - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] ext2: xip check fix
On Thu, Dec 06, 2007 at 10:59:02AM +0100, Carsten Otte wrote: Nick Piggin wrote: After my patch, we can do XIP in a hardsect size PAGE_SIZE block device -- this seems to be a fine thing to do at least for the ramdisk code. Would this situation be problematic for existing drivers, and if so, in what way? I have done some archeology, and our ancient CVS logs show this check was introduced in early 2005 into our 2.6.x. codebase. However, it existed way before, and was copied from our prehistorical ext2 split named xip2 back in the old days of 2.4.x where we did not really have a block device behind because that one was scamped into the file system in a very queer way. OK, thanks for taking a look at that. It will be helpful for testing XIP with my new ramdisk driver (did you see the patch?). After all, I don't see any risk in removing the check. The only driver we have that does direct_access is drivers/s390/block/dcssblk.c, and that one only supports block_size == PAGE_SIZE. I think the patch should go into mainline. Actually another one's recently sprung up too (arch/powerpc/sysdev/axonram.c) but it looks like that one should be fine as it looks to be all simply memory mapped. Acked-by: Carsten Otte [EMAIL PROTECTED] Thanks! Andrew, would you queue this up for 2.6.25 please? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] ext2: xip check fix
On Thu, Dec 06, 2007 at 10:17:39PM -0600, Rob Landley wrote: On Thursday 06 December 2007 21:22:25 Jared Hulbert wrote: I have'nt looked at it yet. I do appreciate it, I think it might broaden the user-base of this feature which is up to now s390 only due to the fact that the flash memory extensions have not been implemented (yet?). And it enables testing xip on other platforms. The patch is on my must-read list. query: which feature is currently s390 only? (Execute In Place?) I think so. The filemap_xip.c functionality doesn't work for Flash memory yet. Flash memory doesn't have struct pages to back it up with which this stuff depends on. Um, trying to clarify: S390. Also known as zSeries, big iron machine, uses its own weird processor design rather than x86, x86-64, arm, or mips processors. How does struct page enter into this? What I want to know is, are you saying execute in place doesn't work on things like arm and mips? (I so, I was unaware of this. I heard about somebody getting it to work on a Nintendo DS: http://forums.maxconsole.net/showthread.php?t=18668 ) It's just that the only device driver with XIP support for the moment is s390 (and an obscure powerpc one). Now with my ramdisk patch, anybody can use it. That's just using regular RAM, but there is nothing preventing XIP backed by more exotic storage, provided the CPU can natively address and execute from it. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] rewrite rd
On Tue, Dec 04, 2007 at 01:55:17AM -0600, Rob Landley wrote: On Monday 03 December 2007 22:26:28 Nick Piggin wrote: There is one slight downside -- direct block device access and filesystem metadata access goes through an extra copy and gets stored in RAM twice. However, this downside is only slight, because the real buffercache of the device is now reclaimable (because we're not playing crazy games with it), so under memory intensive situations, footprint should effectively be the same -- maybe even a slight advantage to the new driver because it can also reclaim buffer heads. For the embedded world, initramfs has pretty much rendered initrd obsolete, and that was the biggest user of the ramdisk code I know of. Beyond that, loopback mounts give you more flexible transient block devices than ramdisks do. (In fact, ramdisks are such an amazing pain to use/size/free that if I really needed something like that I'd just make a loopback mount in a ramfs instance.) They are, although we could easily hook up a couple more ioctls for them now (if anybody is interested). The rd driver can potentially be a _lot_ more scalable than the loop driver. It's completely lockless in the fastpath and doesn't even dirty any cachelines except for the actual destination memory. OK, this is only really important for testing purposes (eg. testing scalability of a filesystem etc.) But it is one reason why I (personally) want brd... Embedded users who still want a block interface for memory are generally trying to use a cramfs or squashfs image out of ROM or flash, although there are flash-specific filesystems for this and I dunno if they're actually mounting /dev/mem at an offset or something (md? losetup -o? Beats me, I haven't tried that myself yet...) OK, it would be interesting to hear from anyone using rd for things like cramfs. I don't think we can get rid of the code entirely, but it sounds like the few downsides of the new code won't be big problems. Thanks for the feedback. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] rewrite rd
On Tue, Dec 04, 2007 at 10:54:51AM +0100, Christian Borntraeger wrote: Am Dienstag, 4. Dezember 2007 schrieb Nick Piggin: [...] There is one slight downside -- direct block device access and filesystem metadata access goes through an extra copy and gets stored in RAM twice. However, this downside is only slight, because the real buffercache of the device is now reclaimable (because we're not playing crazy games with it), so under memory intensive situations, footprint should effectively be the same -- maybe even a slight advantage to the new driver because it can also reclaim buffer heads. This is just an idea, I dont know if it is worth the trouble, but have you though about implementing direct_access for brd? That would allow execute-in-place (xip) on brd eliminating the extra copy. Actually that's a pretty good idea. It would allow xip to be tested without special hardware as well... I'll see what the patch looks like. Thanks - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch] rd: support XIP
On Tue, Dec 04, 2007 at 11:10:09AM +0100, Nick Piggin wrote: This is just an idea, I dont know if it is worth the trouble, but have you though about implementing direct_access for brd? That would allow execute-in-place (xip) on brd eliminating the extra copy. Actually that's a pretty good idea. It would allow xip to be tested without special hardware as well... I'll see what the patch looks like. Thanks This seems to work, although I needed another patch for ext2 too. Never run XIP before, but I'm able to execute files out of the -oxip mount, so I'm guessing it's working ;) --- Support direct_access XIP method with brd. Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- Index: linux-2.6/drivers/block/Kconfig === --- linux-2.6.orig/drivers/block/Kconfig +++ linux-2.6/drivers/block/Kconfig @@ -346,6 +346,16 @@ config BLK_DEV_RAM_SIZE The default value is 4096 kilobytes. Only change this if you know what you are doing. +config BLK_DEV_XIP + bool Support XIP filesystems on RAM block device + depends on BLK_DEV_RAM + default n + help + Support XIP filesystems (such as ext2 with XIP support on) on + top of block ram device. This will slightly enlarge the kernel, and + will prevent RAM block device backing store memory from being + allocated from highmem (only a problem for highmem systems). + config CDROM_PKTCDVD tristate Packet writing on CD/DVD media depends on !UML Index: linux-2.6/drivers/block/brd.c === --- linux-2.6.orig/drivers/block/brd.c +++ linux-2.6/drivers/block/brd.c @@ -89,6 +89,7 @@ static struct page *brd_insert_page(stru { pgoff_t idx; struct page *page; + gfp_t gfp_flags; page = brd_lookup_page(brd, sector); if (page) @@ -97,7 +98,16 @@ static struct page *brd_insert_page(stru /* * Must use NOIO because we don't want to recurse back into the * block or filesystem layers from page reclaim. +* +* Cannot support XIP and highmem, because our -direct_access +* routine for XIP must return memory that is always addressable. +* If XIP was reworked to use pfns and kmap throughout, this +* restriction might be able to be lifted. */ + gfp_flags = GFP_NOIO | __GFP_ZERO; +#ifndef CONFIG_BLK_DEV_XIP + gfp_flags |= __GFP_HIGHMEM; +#endif page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO); if (!page) return NULL; @@ -307,6 +317,28 @@ out: return 0; } +#ifdef CONFIG_BLK_DEV_XIP +static int brd_direct_access (struct block_device *bdev, sector_t sector, + unsigned long *data) +{ + struct brd_device *brd = bdev-bd_disk-private_data; + struct page *page; + + if (!brd) + return -ENODEV; + if (sector (PAGE_SECTORS-1)) + return -EINVAL; + if (sector + PAGE_SECTORS get_capacity(bdev-bd_disk)) + return -ERANGE; + page = brd_insert_page(brd, sector); + if (!page) + return -ENOMEM; + *data = (unsigned long)page_address(page); + + return 0; +} +#endif + static int brd_ioctl(struct inode *inode, struct file *file, unsigned int cmd, unsigned long arg) { @@ -342,8 +374,11 @@ static int brd_ioctl(struct inode *inode } static struct block_device_operations brd_fops = { - .owner =THIS_MODULE, - .ioctl =brd_ioctl, + .owner =THIS_MODULE, + .ioctl =brd_ioctl, +#ifdef CONFIG_BLK_DEV_XIP + .direct_access =brd_direct_access, +#endif }; /* - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch] ext2: xip check fix
Am I missing something here? I wonder how s390 works without this change? -- ext2 should not worry about checking sb-s_blocksize for XIP before the sb's blocksize actually gets set. Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- Index: linux-2.6/fs/ext2/super.c === --- linux-2.6.orig/fs/ext2/super.c +++ linux-2.6/fs/ext2/super.c @@ -844,8 +844,7 @@ static int ext2_fill_super(struct super_ blocksize = BLOCK_SIZE le32_to_cpu(sbi-s_es-s_log_block_size); - if ((ext2_use_xip(sb)) ((blocksize != PAGE_SIZE) || - (sb-s_blocksize != blocksize))) { + if (ext2_use_xip(sb) blocksize != PAGE_SIZE) { if (!silent) printk(XIP: Unsupported blocksize\n); goto failed_mount; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] rd: support XIP
On Tue, Dec 04, 2007 at 03:26:20AM -0800, Andrew Morton wrote: On Tue, 4 Dec 2007 12:21:00 +0100 Nick Piggin [EMAIL PROTECTED] wrote: +* +* Cannot support XIP and highmem, because our -direct_access +* routine for XIP must return memory that is always addressable. +* If XIP was reworked to use pfns and kmap throughout, this +* restriction might be able to be lifted. */ + gfp_flags = GFP_NOIO | __GFP_ZERO; +#ifndef CONFIG_BLK_DEV_XIP + gfp_flags |= __GFP_HIGHMEM; +#endif A dubious tradeoff? On big highmem machines certainly. It may be somewhat useful on small memory systems... but having the config option there is nice for a VM developer without an s390 easily available ;) But don't apply these XIP patches yet -- after a bit more testing I'm seeing some data corruption, so I'll have to work out what's going wrong with that first. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch] mm: fix XIP file writes
On Tue, Dec 04, 2007 at 12:35:49PM +0100, Nick Piggin wrote: On Tue, Dec 04, 2007 at 03:26:20AM -0800, Andrew Morton wrote: On Tue, 4 Dec 2007 12:21:00 +0100 Nick Piggin [EMAIL PROTECTED] wrote: + * + * Cannot support XIP and highmem, because our -direct_access + * routine for XIP must return memory that is always addressable. + * If XIP was reworked to use pfns and kmap throughout, this + * restriction might be able to be lifted. */ + gfp_flags = GFP_NOIO | __GFP_ZERO; +#ifndef CONFIG_BLK_DEV_XIP + gfp_flags |= __GFP_HIGHMEM; +#endif A dubious tradeoff? On big highmem machines certainly. It may be somewhat useful on small memory systems... but having the config option there is nice for a VM developer without an s390 easily available ;) But don't apply these XIP patches yet -- after a bit more testing I'm seeing some data corruption, so I'll have to work out what's going wrong with that first. Here we go. See, brd already found a bug ;) Can you apply the ext2 XIP patch too? And I'll resend the brd XIP patch. --- Writing to XIP files at a non-page-aligned offset results in data corruption because the writes were always sent to the start of the page. Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- Index: linux-2.6/mm/filemap_xip.c === --- linux-2.6.orig/mm/filemap_xip.c +++ linux-2.6/mm/filemap_xip.c @@ -314,7 +314,7 @@ __xip_file_write(struct file *filp, cons fault_in_pages_readable(buf, bytes); kaddr = kmap_atomic(page, KM_USER0); copied = bytes - - __copy_from_user_inatomic_nocache(kaddr, buf, bytes); + __copy_from_user_inatomic_nocache(kaddr + offset, buf, bytes); kunmap_atomic(kaddr, KM_USER0); flush_dcache_page(page); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch] rd: support XIP (updated)
On Tue, Dec 04, 2007 at 12:06:23PM +, Duane Griffin wrote: On 04/12/2007, Nick Piggin [EMAIL PROTECTED] wrote: + gfp_flags = GFP_NOIO | __GFP_ZERO; +#ifndef CONFIG_BLK_DEV_XIP + gfp_flags |= __GFP_HIGHMEM; +#endif page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO); I think that should be alloc_page(gfp_flags), no? Yes. Here is a resend. Andrew, please apply this one (has passed some testing with ext2 XIP). -- Support direct_access XIP method with brd. Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- Index: linux-2.6/drivers/block/Kconfig === --- linux-2.6.orig/drivers/block/Kconfig +++ linux-2.6/drivers/block/Kconfig @@ -346,6 +346,16 @@ config BLK_DEV_RAM_SIZE The default value is 4096 kilobytes. Only change this if you know what you are doing. +config BLK_DEV_XIP + bool Support XIP filesystems on RAM block device + depends on BLK_DEV_RAM + default n + help + Support XIP filesystems (such as ext2 with XIP support on) on + top of block ram device. This will slightly enlarge the kernel, and + will prevent RAM block device backing store memory from being + allocated from highmem (only a problem for highmem systems). + config CDROM_PKTCDVD tristate Packet writing on CD/DVD media depends on !UML Index: linux-2.6/drivers/block/brd.c === --- linux-2.6.orig/drivers/block/brd.c +++ linux-2.6/drivers/block/brd.c @@ -89,6 +89,7 @@ static struct page *brd_insert_page(stru { pgoff_t idx; struct page *page; + gfp_t gfp_flags; page = brd_lookup_page(brd, sector); if (page) @@ -97,8 +98,17 @@ static struct page *brd_insert_page(stru /* * Must use NOIO because we don't want to recurse back into the * block or filesystem layers from page reclaim. +* +* Cannot support XIP and highmem, because our -direct_access +* routine for XIP must return memory that is always addressable. +* If XIP was reworked to use pfns and kmap throughout, this +* restriction might be able to be lifted. */ - page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO); + gfp_flags = GFP_NOIO | __GFP_ZERO; +#ifndef CONFIG_BLK_DEV_XIP + gfp_flags |= __GFP_HIGHMEM; +#endif + page = alloc_page(gfp_flags); if (!page) return NULL; @@ -307,6 +317,28 @@ out: return 0; } +#ifdef CONFIG_BLK_DEV_XIP +static int brd_direct_access (struct block_device *bdev, sector_t sector, + unsigned long *data) +{ + struct brd_device *brd = bdev-bd_disk-private_data; + struct page *page; + + if (!brd) + return -ENODEV; + if (sector (PAGE_SECTORS-1)) + return -EINVAL; + if (sector + PAGE_SECTORS get_capacity(bdev-bd_disk)) + return -ERANGE; + page = brd_insert_page(brd, sector); + if (!page) + return -ENOMEM; + *data = (unsigned long)page_address(page); + + return 0; +} +#endif + static int brd_ioctl(struct inode *inode, struct file *file, unsigned int cmd, unsigned long arg) { @@ -342,8 +374,11 @@ static int brd_ioctl(struct inode *inode } static struct block_device_operations brd_fops = { - .owner =THIS_MODULE, - .ioctl =brd_ioctl, + .owner =THIS_MODULE, + .ioctl =brd_ioctl, +#ifdef CONFIG_BLK_DEV_XIP + .direct_access =brd_direct_access, +#endif }; /* - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch] rewrite rd
Hi, This is my proposal for a (hopefully) backwards compatible rd driver. The motivation for me is not pressing, except that I have this code sitting here that is either going to rot or get merged. I'm happy to make myself maintainer of this code, but if any real block device driver writer would like to, that would be fine by me ;) Comments? -- This is a rewrite of the ramdisk block device driver. The old one is really difficult because it effectively implements a block device which serves data out of its own buffer cache. It relies on the dirty bit being set, to pin its backing store in cache, however there are non trivial paths which can clear the dirty bit (eg. try_to_free_buffers()), which had recently lead to data corruption. And in general it is completely wrong for a block device driver to do this. The new one is more like a regular block device driver. It has no idea about vm/vfs stuff. It's backing store is similar to the buffer cache (a simple radix-tree of pages), but it doesn't know anything about page cache (the pages in the radix tree are not pagecache pages). There is one slight downside -- direct block device access and filesystem metadata access goes through an extra copy and gets stored in RAM twice. However, this downside is only slight, because the real buffercache of the device is now reclaimable (because we're not playing crazy games with it), so under memory intensive situations, footprint should effectively be the same -- maybe even a slight advantage to the new driver because it can also reclaim buffer heads. The fact that it now goes through all the regular vm/fs paths makes it much more useful for testing, too. textdata bss dec hex filename 2837 849 3844070 fe6 drivers/block/rd.o 3528 371 123911 f47 drivers/block/brd.o Text is larger, but data and bss are smaller, making total size smaller. A few other nice things about it: - Similar structure and layout to the new loop device handlinag. - Dynamic ramdisk creation. - Runtime flexible buffer head size (because it is no longer part of the ramdisk code). - Boot / load time flexible ramdisk size, which could easily be extended to a per-ramdisk runtime changeable size (eg. with an ioctl). Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- MAINTAINERS|5 drivers/block/Kconfig | 12 - drivers/block/Makefile |2 drivers/block/brd.c| 500 + drivers/block/rd.c | 536 - fs/buffer.c|1 6 files changed, 508 insertions(+), 548 deletions(-) Index: linux-2.6/drivers/block/Kconfig === --- linux-2.6.orig/drivers/block/Kconfig +++ linux-2.6/drivers/block/Kconfig @@ -311,7 +311,7 @@ config BLK_DEV_UB If unsure, say N. config BLK_DEV_RAM - tristate RAM disk support + tristate RAM block device support ---help--- Saying Y here will allow you to use a portion of your RAM memory as a block device, so that you can make file systems on it, read and @@ -346,16 +346,6 @@ config BLK_DEV_RAM_SIZE The default value is 4096 kilobytes. Only change this if you know what you are doing. -config BLK_DEV_RAM_BLOCKSIZE - int Default RAM disk block size (bytes) - depends on BLK_DEV_RAM - default 1024 - help - The default value is 1024 bytes. PAGE_SIZE is a much more - efficient choice however. The default is kept to ensure initrd - setups function - apparently needed by the rd_load_image routine - that supposes the filesystem in the image uses a 1024 blocksize. - config CDROM_PKTCDVD tristate Packet writing on CD/DVD media depends on !UML Index: linux-2.6/drivers/block/Makefile === --- linux-2.6.orig/drivers/block/Makefile +++ linux-2.6/drivers/block/Makefile @@ -11,7 +11,7 @@ obj-$(CONFIG_AMIGA_FLOPPY)+= amiflop.o obj-$(CONFIG_PS3_DISK) += ps3disk.o obj-$(CONFIG_ATARI_FLOPPY) += ataflop.o obj-$(CONFIG_AMIGA_Z2RAM) += z2ram.o -obj-$(CONFIG_BLK_DEV_RAM) += rd.o +obj-$(CONFIG_BLK_DEV_RAM) += brd.o obj-$(CONFIG_BLK_DEV_LOOP) += loop.o obj-$(CONFIG_BLK_DEV_PS2) += ps2esdi.o obj-$(CONFIG_BLK_DEV_XD) += xd.o Index: linux-2.6/drivers/block/brd.c === --- /dev/null +++ linux-2.6/drivers/block/brd.c @@ -0,0 +1,536 @@ +/* + * Ram backed block device driver. + * + * Copyright (C) 2007 Nick Piggin + * Copyright (C) 2007 Novell Inc. + * + * Parts derived from drivers/block/rd.c, and drivers/block/loop.c, copyright + * of their respective owners. + */ + +#include linux/init.h +#include linux/module.h +#include linux/moduleparam.h +#include linux/major.h +#include linux/blkdev.h
Re: [patch] rewrite rd
On Mon, Dec 03, 2007 at 10:29:03PM -0800, Andrew Morton wrote: On Tue, 4 Dec 2007 05:26:28 +0100 Nick Piggin [EMAIL PROTECTED] wrote: There is one slight downside -- direct block device access and filesystem metadata access goes through an extra copy and gets stored in RAM twice. However, this downside is only slight, because the real buffercache of the device is now reclaimable (because we're not playing crazy games with it), so under memory intensive situations, footprint should effectively be the same -- maybe even a slight advantage to the new driver because it can also reclaim buffer heads. what about mmap(/dev/ram0)? Same thing I guess -- it will use more memory in the common case, but should actually be able to reclaim slightly more memory under pressure, so the tiny systems guys shouldn't have too much trouble. And we're avoiding the whole class of aliasing problems where mmap on the old rd driver means that you're creating new mappings to your backing store pages. Text is larger, but data and bss are smaller, making total size smaller. A few other nice things about it: - Similar structure and layout to the new loop device handlinag. - Dynamic ramdisk creation. - Runtime flexible buffer head size (because it is no longer part of the ramdisk code). - Boot / load time flexible ramdisk size, which could easily be extended to a per-ramdisk runtime changeable size (eg. with an ioctl). This ramdisk driver can use highmem whereas the existing one can't (yes?). That's a pretty major difference. Plus look at the revoltingness in rd.c's mapping_set_gfp_mask()s. Ah yep, there is the highmem advantage too. +#define SECTOR_SHIFT 9 That's our third definition of SECTOR_SHIFT. Or 7th, depend on how you count ;) I always thought redefining it is a prerequsite to getting anything merged into the block layer, so I'm too scared to put it in include/linux/blkdev.h ;) +/* + * Look up and return a brd's page for a given sector. + */ +static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector) +{ + unsigned long idx; Could use pgoff_t here if you think that's clearer. I guess it is. Although on one hand the radix-tree uses unsigned long, but on the other hand, page-index is pgoff. +{ + unsigned long idx; + struct page *page; + + page = brd_lookup_page(brd, sector); + if (page) + return page; + + page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO); Why GFP_NOIO? I guess it has similar issues to rd.c -- we can't risk recursing into the block layer here. However unlike rd.c, we _could_ easily add some mode or ioctl to allocate the backing store upfront, with full reclaim flags... Have you thought about __GFP_MOVABLE treatment here? Seems pretty desirable as this sucker can be large. AFAIK that only applies to pagecache but I haven't been paying much attention to that stuff lately. It wouldn't be hard to move these pages around, if we had a hook from the vm for it. +static void brd_free_pages(struct brd_device *brd) +{ + unsigned long pos = 0; + struct page *pages[FREE_BATCH]; + int nr_pages; + + do { + int i; + + nr_pages = radix_tree_gang_lookup(brd-brd_pages, + (void **)pages, pos, FREE_BATCH); + + for (i = 0; i nr_pages; i++) { + void *ret; + + BUG_ON(pages[i]-index pos); + pos = pages[i]-index; + ret = radix_tree_delete(brd-brd_pages, pos); + BUG_ON(!ret || ret != pages[i]); + __free_page(pages[i]); + } + + pos++; + + } while (nr_pages == FREE_BATCH); +} I have vague memories that radix_tree_gang_lookup()'s naive implementation may return fewer items than you asked for even when there are more items remaining - when it hits certain boundaries. Good memory, but it's the low level leaf traversal that bales out at boundaries. The higher level code then retries, so we should be OK here. +/* + * copy_to_brd_setup must be called before copy_to_brd. It may sleep. + */ +static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n) +{ + unsigned int offset = (sector (PAGE_SECTORS-1)) SECTOR_SHIFT; + size_t copy; + + copy = min((unsigned long)n, PAGE_SIZE - offset); use min_t. Or, better, sort out the types. I guess the API is using size_t, so that would be the more approprite type to convert to. And I'll use min_t too. +static void copy_to_brd(struct brd_device *brd, const void *src, + sector_t sector, size_t n) +{ + struct page *page; + void *dst; + unsigned int offset = (sector (PAGE_SECTORS-1)) SECTOR_SHIFT; + size_t copy; + + copy = min((unsigned long)n, PAGE_SIZE - offset); Ditto. + page
Re: [patch] rewrite rd
On Tue, Dec 04, 2007 at 08:01:31AM +0100, Nick Piggin wrote: Thanks for the review, I'll post an incremental patch in a sec. Index: linux-2.6/drivers/block/brd.c === --- linux-2.6.orig/drivers/block/brd.c +++ linux-2.6/drivers/block/brd.c @@ -56,7 +56,7 @@ struct brd_device { */ static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector) { - unsigned long idx; + pgoff_t idx; struct page *page; /* @@ -87,13 +87,17 @@ static struct page *brd_lookup_page(stru */ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector) { - unsigned long idx; + pgoff_t idx; struct page *page; page = brd_lookup_page(brd, sector); if (page) return page; + /* +* Must use NOIO because we don't want to recurse back into the +* block or filesystem layers from page reclaim. +*/ page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO); if (!page) return NULL; @@ -148,6 +152,11 @@ static void brd_free_pages(struct brd_de pos++; + /* +* This assumes radix_tree_gang_lookup always returns as +* many pages as possible. If the radix-tree code changes, +* so will this have to. +*/ } while (nr_pages == FREE_BATCH); } @@ -159,7 +168,7 @@ static int copy_to_brd_setup(struct brd_ unsigned int offset = (sector (PAGE_SECTORS-1)) SECTOR_SHIFT; size_t copy; - copy = min((unsigned long)n, PAGE_SIZE - offset); + copy = min_t(size_t, n, PAGE_SIZE - offset); if (!brd_insert_page(brd, sector)) return -ENOMEM; if (copy n) { @@ -181,7 +190,7 @@ static void copy_to_brd(struct brd_devic unsigned int offset = (sector (PAGE_SECTORS-1)) SECTOR_SHIFT; size_t copy; - copy = min((unsigned long)n, PAGE_SIZE - offset); + copy = min_t(size_t, n, PAGE_SIZE - offset); page = brd_lookup_page(brd, sector); BUG_ON(!page); @@ -213,7 +222,7 @@ static void copy_from_brd(void *dst, str unsigned int offset = (sector (PAGE_SECTORS-1)) SECTOR_SHIFT; size_t copy; - copy = min((unsigned long)n, PAGE_SIZE - offset); + copy = min_t(size_t, n, PAGE_SIZE - offset); page = brd_lookup_page(brd, sector); if (page) { src = kmap_atomic(page, KM_USER1); @@ -318,6 +327,9 @@ static int brd_ioctl(struct inode *inode /* * Invalidate the cache first, so it isn't written * back to the device. +* +* Another thread might instantiate more buffercache here, +* but there is not much we can do to close that race. */ invalidate_bh_lrus(); truncate_inode_pages(bdev-bd_inode-i_mapping, 0); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[rfc][patch 2/2] inotify: remove debug code
The inotify debugging code is supposed to verify that the DCACHE_INOTIFY_PARENT_WATCHED scalability optimisation does not result in notifications getting lost nor extra needless locking generated. Unfortunately there are also some races in the debugging code. And it isn't very good at finding problems anyway. So remove it for now. Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- Index: linux-2.6/fs/dcache.c === --- linux-2.6.orig/fs/dcache.c +++ linux-2.6/fs/dcache.c @@ -1408,9 +1408,6 @@ void d_delete(struct dentry * dentry) if (atomic_read(dentry-d_count) == 1) { dentry_iput(dentry); fsnotify_nameremove(dentry, isdir); - - /* remove this and other inotify debug checks after 2.6.18 */ - dentry-d_flags = ~DCACHE_INOTIFY_PARENT_WATCHED; return; } Index: linux-2.6/fs/inotify.c === --- linux-2.6.orig/fs/inotify.c +++ linux-2.6/fs/inotify.c @@ -168,20 +168,14 @@ static void set_dentry_child_flags(struc struct dentry *child; list_for_each_entry(child, alias-d_subdirs, d_u.d_child) { - if (!child-d_inode) { - WARN_ON(child-d_flags DCACHE_INOTIFY_PARENT_WATCHED); + if (!child-d_inode) continue; - } + spin_lock(child-d_lock); - if (watched) { - WARN_ON(child-d_flags - DCACHE_INOTIFY_PARENT_WATCHED); + if (watched) child-d_flags |= DCACHE_INOTIFY_PARENT_WATCHED; - } else { - WARN_ON(!(child-d_flags - DCACHE_INOTIFY_PARENT_WATCHED)); - child-d_flags=~DCACHE_INOTIFY_PARENT_WATCHED; - } + else + child-d_flags =~DCACHE_INOTIFY_PARENT_WATCHED; spin_unlock(child-d_lock); } } @@ -253,7 +247,6 @@ void inotify_d_instantiate(struct dentry if (!inode) return; - WARN_ON(entry-d_flags DCACHE_INOTIFY_PARENT_WATCHED); spin_lock(entry-d_lock); parent = entry-d_parent; if (parent-d_inode inotify_inode_watched(parent-d_inode)) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[rfc][patch 1/2] inotify: fix race
Hmm, thinking I'd better just take a hammer to this inotify problem despite the fact I cannot confirm the root cause of the warning messages. Possibilities include the race being fixed here, and races in the debug code itself. -- There is a race between setting an inode's children's parent watched flag when placing the first watch on a parent, and instantiating new children of that parent: a child could miss having its flags set by set_dentry_child_flags, but then inotify_d_instantiate might still see !inotify_inode_watched. The solution is to set_dentry_child_flags after adding the watch. Locking is taken care of, because both set_dentry_child_flags and inotify_d_instantiate hold dcache_lock and child-d_locks. Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- Index: linux-2.6/fs/inotify.c === --- linux-2.6.orig/fs/inotify.c +++ linux-2.6/fs/inotify.c @@ -627,6 +627,7 @@ s32 inotify_add_watch(struct inotify_han struct inode *inode, u32 mask) { int ret = 0; + int newly_watched; /* don't allow invalid bits: we don't want flags set */ mask = IN_ALL_EVENTS | IN_ONESHOT; @@ -653,12 +654,18 @@ s32 inotify_add_watch(struct inotify_han */ watch-inode = igrab(inode); - if (!inotify_inode_watched(inode)) - set_dentry_child_flags(inode, 1); - /* Add the watch to the handle's and the inode's list */ + newly_watched = !inotify_inode_watched(inode); list_add(watch-h_list, ih-watches); list_add(watch-i_list, inode-inotify_watches); + /* +* Set child flags _after_ adding the watch, so there is no race +* windows where newly instantiated children could miss their parent's +* watched flag. +*/ + if (newly_watched) + set_dentry_child_flags(inode, 1); + out: mutex_unlock(ih-mutex); mutex_unlock(inode-inotify_mutex); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Should PAGE_CACHE_SIZE be discarded?
On Thu, Nov 15, 2007 at 02:46:46PM +, David Howells wrote: Benny Halevy [EMAIL PROTECTED] wrote: I think that what Nick was trying to say is that PAGE_CACHE_SIZE should always be used properly as the size of the memory struct Page covers (while PAGE_SIZE is the hardware page size and the constraint is that PAGE_CACHE_SIZE == (PAGE_SIZE k) for some k = 0). If everybody does that then None of the filesystems should really care at all. That said, it doesn't seem like the current usage in fs/ and drivers/ is consistent with this convention. Indeed. One thing you have to consider is kmap(). I would expect it to present an area of PAGE_SIZE for access. However, if the filesystem gets an area of PAGE_CACHE_SIZE to fill, then I would have to do multiple kmap() calls in the process of filling that 'pagecache page' in AFS. Furthermore, if a page struct covers a PAGE_CACHE_SIZE chunk of memory, then I suspect the page allocator is also wrong, as it I believe it deals with PAGE_SIZE chunks of memory, assuming a struct page for each. That's because you're thinking about all these difficulties outside the filesystem. Really, pagecache is in PAGE_CACHE_SIZE. Nobody said pagecache PAGE_SIZE will work by changing a single define, but filesystems are about the least problematic here. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [rfc][patch 3/5] afs: new aops
On Thu, Nov 15, 2007 at 12:15:41PM +, David Howells wrote: Nick Piggin [EMAIL PROTECTED] wrote: So you're saying a struct page controls an area of PAGE_CACHE_SIZE, not an area of PAGE_SIZE? No, a pagecache page is PAGE_CACHE_SIZE. That doesn't answer my question. I didn't ask about 'pagecache pages' per se. Are you saying then that a page struct always represents an area of PAGE_SIZE to, say, the page allocator and PAGE_CACHE_SIZE to a filesystem's address operations? Yes. How about I state it this way: Please define what the coverage of a (non-compound) struct page is, and how this relates to PAGE_SIZE and PAGE_CACHE_SIZE. If it's well-defined then this cannot be hard, right? No it's easy. It's PAGE_SIZE (which also happens to be PAGE_CACHE_SIZE). An implementation that would (not trivially, but without changing the basic concepts) allow a larger pagecache size is with compound pages. And actually hugetlbfs already does this. And not all struct pages control the same amount of data anyway, with compound pages. Compound pages are irrelevant to my question. A compound page is actually a regulated by a series of page structs, each of which represents a 'page' of real memory. And it can also represent a PAGE_CACHE_SIZE page of pagecache. Do you say, then, that all, say, readpage() and readpages() methods must handle a compound page if that is given to them? I'm not talking about a specific implementation that allows PAGE_CACHE_SIZE PAGE_SIZE. So no, I don't say anything about that. I say that pagecache pages are PAGE_CACHE_SIZE! Yes it is easy now because that is the same as PAGE_SIZE. Yes it will be harder if you wanted to change that. What you would not have to change is the assumption that pagecache pages are in PAGE_SIZE units. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [rfc][patch 3/5] afs: new aops
On Wed, Nov 14, 2007 at 12:18:43PM +, David Howells wrote: Nick Piggin [EMAIL PROTECTED] wrote: The problem is that the code called assumes that the struct page * argument points to a single page, not an array of pages as would presumably be the case if PAGE_CACHE_SIZE PAGE_SIZE. Incorrect. Christoph's patch for example does this by using compound pages. Now I personally don't like the patch or see the point in PAGE_CACHE_SIZE / PAGE_SIZE distinction, but I'm just telling you what the convention is. There is no point you arguing against it, that's simply how it is. No! You are wrong. I wrote the AFS code. I *know* it can only deal with No I'm talking about core code. In core code, the PAGE_CACHE_SIZE is for page cache struct pages. Single struct pages (not page arrays). Take a look at generic mapping read or something. There is nothing to deal with page arrays there either, but that's simply the convention. So: you may not change the assertion unless you also fix the lower functions. I won't change the assertion, because you haven't been following said convention, so just changing it in one place is stupider than not changing it at all, but not for the reason cited. The convention is not precisely clear. Just grep for PAGE_CACHE_SIZE in Documentation/. It's only mentioned twice, and in neither case does it give any information about what PAGE_CACHE_SIZE is, what it represents, or where it applies. Therefore it's an ill-defined concept. If you look in Documentation/filesystems/vfs.txt, you'll see that it almost always talks about 'pages'. It only mentions 'pagecache pages' once - in the description of write_begin(), but it's not clear whether that means anything. Documentation is the opposite of convention ;) Look in mm/. However, I've now noted that I need to fix my code, so just keep the assertion for now and I'll fix my code to handle multipage blocks. I'm not saying you need to do that. Leave it at PAGE_SIZE, really it doesn't matter that much at present. This has just blown out of proportion. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Should PAGE_CACHE_SIZE be discarded?
On Wed, Nov 14, 2007 at 01:56:53PM +, David Howells wrote: Are we ever going to have PAGE_CACHE_SIZE != PAGE_SIZE? If not, why not discard PAGE_CACHE_SIZE as it's then redundant. Christoph Lameter has patches exactly to make PAGE_CACHE_SIZE larger than PAGE_SIZE, and they seem to work without much effort. I happen to hate the patches ;) but that doesn't change the fact that PAGE_CACHE_SIZE is relatively useful and it is not at all an ill-defined concept. PAGE_CACHE_SIZE is also an ill-defined concept. Just grep in Documentation/ and it only comes up in a couple of places, neither particularly informative. If there's a possibility that it will be used, then someone who knows how it's supposed to work needs to sit down and document what it is, what it represents, where it applies, how it interacts with PG_compound and how the page flags distribute over a page cache slot. No, this was an *example*. It has nothing to do with PG_compound upsream. That was just an example. Basically, anything that goes in the page cache is in units of PAGE_CACHE_SIZE, and nothing else. For filesystems it should be pretty easy... One further thing to consider: actually making PAGE_CACHE_SIZE PAGE_SIZE work will be an interesting problem, as I'm certain most filesystems will break horribly without a lot of work (ramfs might be only exception). I think most filesystems actually don't have much problem with it. mm code has bigger problems, eg. due to ptes - pagecache no longer being equal size but why would filesystems care? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [rfc][patch 3/5] afs: new aops
On Wed, Nov 14, 2007 at 03:57:46PM +, David Howells wrote: Nick Piggin [EMAIL PROTECTED] wrote: In core code, the PAGE_CACHE_SIZE is for page cache struct pages. Single struct pages (not page arrays). Take a look at generic mapping read or something. So you're saying a struct page controls an area of PAGE_CACHE_SIZE, not an area of PAGE_SIZE? No, a pagecache page is PAGE_CACHE_SIZE. And not all struct pages control the same amount of data anyway, with compound pages. Is a struct page then a purely pagecache concept? Documentation is the opposite of convention ;) If it's not Documented, then it's irrelevant. But you can't just decide yourself that it is irrelevant because you don't grep hard enough ;) include/linux/mm.h: * A page may belong to an inode's memory mapping. In this case, page-mapping * is the pointer to the inode, and page-index is the file offset of the page, * in units of PAGE_CACHE_SIZE. include/linux/mm_types.h unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE units, *not* PAGE_CACHE_SIZE */ Looks like documentation to me. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Should PAGE_CACHE_SIZE be discarded?
On Wed, Nov 14, 2007 at 03:59:39PM +, David Howells wrote: Nick Piggin [EMAIL PROTECTED] wrote: Christoph Lameter has patches exactly to make PAGE_CACHE_SIZE larger than PAGE_SIZE, and they seem to work without much effort. I happen to hate the patches ;) but that doesn't change the fact that PAGE_CACHE_SIZE is relatively useful and it is not at all an ill-defined concept. Where, please? mm kernels? Floating around. I'm not saying it will even get upstream. It's just an example. Basically, anything that goes in the page cache is in units of PAGE_CACHE_SIZE, and nothing else. For filesystems it should be pretty easy... That depends on what the coverage of struct page is. I don't actually know whether this is PAGE_SIZE or PAGE_CACHE_SIZE; I assumed it to be the former, but from what you've said, I'm not actually sure. It can be pretty well any power of 2 from PAGE_SIZE upwards, with compound pages. None of the filesystems should really care at all. It's not even a new concept, hugetlbfs uses HPAGE_SIZE... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] nfs: use -mmap_prepare() to avoid an AB-BA deadlock
On Wed, Nov 14, 2007 at 09:01:39PM +0100, Peter Zijlstra wrote: Normal locking order is: i_mutex mmap_sem However NFS's -mmap hook, which is called under mmap_sem, can take i_mutex. Avoid this potential deadlock by doing the work that requires i_mutex from the new -mmap_prepare(). [ Is this sufficient, or does it introduce a race? ] Seems like an OK patchset in my opinion. I don't know much about NFS unfortunately, but I wonder what prevents the condition fixed by nfs_revalidate_mapping from happening again while the mmap is active...? Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/nfs/file.c | 25 + 1 file changed, 17 insertions(+), 8 deletions(-) Index: linux-2.6/fs/nfs/file.c === --- linux-2.6.orig/fs/nfs/file.c +++ linux-2.6/fs/nfs/file.c @@ -41,6 +41,9 @@ static int nfs_file_open(struct inode *, struct file *); static int nfs_file_release(struct inode *, struct file *); static loff_t nfs_file_llseek(struct file *file, loff_t offset, int origin); +static int +nfs_file_mmap_prepare(struct file * file, unsigned long len, + unsigned long prot, unsigned long flags, unsigned long pgoff); static int nfs_file_mmap(struct file *, struct vm_area_struct *); static ssize_t nfs_file_splice_read(struct file *filp, loff_t *ppos, struct pipe_inode_info *pipe, @@ -64,6 +67,7 @@ const struct file_operations nfs_file_op .write = do_sync_write, .aio_read = nfs_file_read, .aio_write = nfs_file_write, + .mmap_prepare = nfs_file_mmap_prepare, .mmap = nfs_file_mmap, .open = nfs_file_open, .flush = nfs_file_flush, @@ -270,7 +274,8 @@ nfs_file_splice_read(struct file *filp, } static int -nfs_file_mmap(struct file * file, struct vm_area_struct * vma) +nfs_file_mmap_prepare(struct file * file, unsigned long len, + unsigned long prot, unsigned long flags, unsigned long pgoff) { struct dentry *dentry = file-f_path.dentry; struct inode *inode = dentry-d_inode; @@ -279,13 +284,17 @@ nfs_file_mmap(struct file * file, struct dfprintk(VFS, nfs: mmap(%s/%s)\n, dentry-d_parent-d_name.name, dentry-d_name.name); - status = nfs_revalidate_mapping(inode, file-f_mapping); - if (!status) { - vma-vm_ops = nfs_file_vm_ops; - vma-vm_flags |= VM_CAN_NONLINEAR; - file_accessed(file); - } - return status; + return nfs_revalidate_mapping(inode, file-f_mapping); +} + +static int +nfs_file_mmap(struct file *file, struct vm_area_struct *vma) +{ + vma-vm_ops = nfs_file_vm_ops; + vma-vm_flags |= VM_CAN_NONLINEAR; + file_accessed(file); + + return 0; } /* -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] nfs: use -mmap_prepare() to avoid an AB-BA deadlock
On Wed, Nov 14, 2007 at 05:18:50PM -0500, Trond Myklebust wrote: On Wed, 2007-11-14 at 22:50 +0100, Peter Zijlstra wrote: Right, but I guess what Nick asked is, if pages could be stale to start with, how is that avoided in the future. The way I understand it, this re-validate is just a best effort at getting a coherent image. The normal convention for NFS is to use a close-to-open cache consistency model. In that model, applications must agree never to open the file for reading or writing if an application on a different NFS client already holds it open for writing. However there is no standard locking model for _enforcing_ such an agreement, so some setups do violate it. One obvious model that we try to support is that where the applications are using POSIX locking in order to ensure exclusive access to the data when requires. Another model is to rely rather on synchronous writes and heavy attribute revalidation to detect when a competing application has written to the file (the 'noac' mount option). While such a model is obviously deficient in that it can never guarantee cache coherency, we do attempt to ensure that it works on a per-operation basis (IOW: we check cache coherency before each call to read(), to mmap(), etc) since it is by far the easiest model to apply if you have applications that cannot be rewritten and that satisfy the requirement that they rarely conflict. mmap()s can be different from read in that the syscall may have little relation to when the data gets used. But I guess it's still a best effort thing. Fair enough. Thanks, Nick - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [rfc][patch 3/5] afs: new aops
On Tue, Nov 13, 2007 at 10:56:25AM +, David Howells wrote: Nick Piggin [EMAIL PROTECTED] wrote: It takes a pagecache page, yes. If you follow convention, you use PAGE_CACHE_SIZE for that guy. You don't have to allow PAGE_CACHE_SIZE != PAGE_SIZE, and if all the rest of your code is in units of PAGE_SIZE, then obviously my changing of just the one unit is even more confusing than the current arrangement ;) The problem is that the code called assumes that the struct page * argument points to a single page, not an array of pages as would presumably be the case if PAGE_CACHE_SIZE PAGE_SIZE. Incorrect. Christoph's patch for example does this by using compound pages. Now I personally don't like the patch or see the point in PAGE_CACHE_SIZE / PAGE_SIZE distinction, but I'm just telling you what the convention is. There is no point you arguing against it, that's simply how it is. If I should allow for an array of pages then the lower functions (specifically afs_deliver_fs_fetch_data()) need to change, and until that time occurs, the assertion *must* remain as it is now. It defends the lower functions against being asked to do something they weren't designed to do. So: you may not change the assertion unless you also fix the lower functions. I won't change the assertion, because you haven't been following said convention, so just changing it in one place is stupider than not changing it at all, but not for the reason cited. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [rfc][patch 3/5] afs: new aops
On Mon, Nov 12, 2007 at 03:29:14PM +, David Howells wrote: Nick Piggin [EMAIL PROTECTED] wrote: - ASSERTCMP(start + len, =, PAGE_SIZE); + ASSERTCMP(len, =, PAGE_CACHE_SIZE); Do you guarantee this will work if PAGE_CACHE_SIZE != PAGE_SIZE? If not, you can't make this particular change. PAGE_CACHE_SIZE should be used to address the pagecache. Do we ever intend to have PAGE_CACHE_SIZE != PAGE_SIZE? If not, then surely the former is redundant and should scrapped to avoid confusion? Maybe. + i_size = i_size_read(vnode-vfs_inode); + if (pos + len i_size) + eof = i_size; + else + eof = PAGE_CACHE_SIZE; + + ret = afs_vnode_fetch_data(vnode, key, 0, eof, page); That can't be right, surely. Either 'eof' is the size of the file or it's the length of the data to be read. It can't be both. The first case needs eof masking off. Also, 'eof' isn't a good choice of name. 'len' would be better were it not already taken:-/ Yeah, just missed the mask. I notice you removed the stuff that clears holes in the page to be written. Is this is now done by the caller? It is supposed to bring the page uptodate first. So, no need to clear AFAIKS? I notice also that you use afs_fill_page() in place of afs_prepare_page() to prepare a page. You can't do this if the region to be filled currently lies beyond the server's idea of EOF. I'll try and get a look at fixing this patch tomorrow. No rush, it won't get into 2.6.24 obviously. But that would be nice, thanks. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [rfc][patch 3/5] afs: new aops
On Tue, Nov 13, 2007 at 12:30:05AM +, David Howells wrote: Nick Piggin [EMAIL PROTECTED] wrote: PAGE_CACHE_SIZE should be used to address the pagecache. Perhaps, but the function being called from there takes pages not page cache slots. If I have to allow for PAGE_CACHE_SIZE PAGE_SIZE then I need to modify my code, if not then the assertion needs to remain what it is. It takes a pagecache page, yes. If you follow convention, you use PAGE_CACHE_SIZE for that guy. You don't have to allow PAGE_CACHE_SIZE != PAGE_SIZE, and if all the rest of your code is in units of PAGE_SIZE, then obviously my changing of just the one unit is even more confusing than the current arrangement ;) I notice you removed the stuff that clears holes in the page to be written. Is this is now done by the caller? It is supposed to bring the page uptodate first. So, no need to clear AFAIKS? Hmmm... I suppose. However, it is wasteful in the common case as it is then bringing the page up to date by filling/clearing the whole of it and not just the bits that are not going to be written. Yes, that's where you come in. You are free (and encouraged) to optimise this. Let's see, for a network filesystem this is what you could do: - if the page is not uptodate at write_begin time, do not bring it uptodate (at least, not the region that is going to be written to) - if the page is not uptodate at write_end time, but the copy was fully completed, just mark it uptodate (provided you brought the regions outside the copy uptodate). - if the page is not uptodate and you have a short copy, simply do not mark the page uptodate or dirty, and return 0 from write_end, indicating that you have committed 0 bytes. The generic code should DTRT. Or you could: Pass back a temporary (not pagecache) page in *pagep, and copy that yourself into the _real_ pagecache page at write_end time, so you know exactly how big the copy will be (this is basically what the 2copy method does now... it is probably not as good as the first method I described, but for a high latency filesystem it may be preferable to always bringing the page uptodate). Or: keep track of sub-page dirty / uptodate status eg. with a light weight buffer_head like structure, so you can actually have partially dirty pages that are not completely uptodate. Or... if you can think of another way... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[rfc][patches] remove -prepare_write
Hi, These are a set of patches to convert the last few filesystems to use the new deadlock-free write aops, and remove the core code to handle the legacy write path. I don't really have setups to sufficiently test these filesystems. So I would really appreciate if filesystem maintainers can pick these patches up, bear with my bugs, and send them upstream when they're ready. The benefit to you is that you get to use the fast and well tested code paths! Actually, it is interesting: several of the conversions I've done (including these) take a relatively naive aproach of simply prefilling the whole page if it isn't uptodate. It might be the case that some filesystems actually prefer to do something similar to the legacy double-copy path which they're being converted away from! (then again, it would be probably even more ideal to have simple sub-page state tracking structures). There is still quite a lot of work left to be done. Several filesystems still use prepare_write() helpers, and when they're fixed up, all the old helpers themselves have to be removed. But this step is probably most important to getting rid of complex code. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[rfc][patch 1/5] ecryptfs new aops
Convert ecryptfs to new aops. Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- Index: linux-2.6/fs/ecryptfs/mmap.c === --- linux-2.6.orig/fs/ecryptfs/mmap.c +++ linux-2.6/fs/ecryptfs/mmap.c @@ -263,31 +263,38 @@ out: return 0; } -static int ecryptfs_prepare_write(struct file *file, struct page *page, - unsigned from, unsigned to) +static int ecryptfs_write_begin(struct file *file, + struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { + pgoff_t index = pos PAGE_CACHE_SHIFT; + struct page *page; int rc = 0; - if (from == 0 to == PAGE_CACHE_SIZE) + page = __grab_cache_page(mapping, index); + if (!page) + return -ENOMEM; + + if (flags AOP_FLAG_UNINTERRUPTIBLE len == PAGE_CACHE_SIZE) goto out; /* If we are writing a full page, it will be up to date. */ if (!PageUptodate(page)) { - rc = ecryptfs_read_lower_page_segment(page, page-index, 0, + rc = ecryptfs_read_lower_page_segment(page, index, 0, PAGE_CACHE_SIZE, - page-mapping-host); + mapping-host); if (rc) { printk(KERN_ERR %s: Error attemping to read lower page segment; rc = [%d]\n, __FUNCTION__, rc); - ClearPageUptodate(page); goto out; } else SetPageUptodate(page); } - if (page-index != 0) { + if (index != 0) { loff_t end_of_prev_pg_pos = - (((loff_t)page-index PAGE_CACHE_SHIFT) - 1); + (((loff_t)index PAGE_CACHE_SHIFT) - 1); - if (end_of_prev_pg_pos i_size_read(page-mapping-host)) { + if (end_of_prev_pg_pos i_size_read(mapping-host)) { rc = ecryptfs_truncate(file-f_path.dentry, end_of_prev_pg_pos); if (rc) { @@ -297,7 +304,7 @@ static int ecryptfs_prepare_write(struct goto out; } } - if (end_of_prev_pg_pos + 1 i_size_read(page-mapping-host)) + if (end_of_prev_pg_pos + 1 i_size_read(mapping-host)) zero_user_page(page, 0, PAGE_CACHE_SIZE, KM_USER0); } out: @@ -391,21 +398,25 @@ int ecryptfs_write_inode_size_to_metadat } /** - * ecryptfs_commit_write + * ecryptfs_write_end * @file: The eCryptfs file object + * @mapping: The eCryptfs object + * @pos, len: Ignored (we rotate the page IV on each write) * @page: The eCryptfs page - * @from: Ignored (we rotate the page IV on each write) - * @to: Ignored * * This is where we encrypt the data and pass the encrypted data to * the lower filesystem. In OpenPGP-compatible mode, we operate on * entire underlying packets. */ -static int ecryptfs_commit_write(struct file *file, struct page *page, -unsigned from, unsigned to) -{ - loff_t pos; - struct inode *ecryptfs_inode = page-mapping-host; +static int ecryptfs_write_end(struct file *file, + struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) +{ + pgoff_t index = pos PAGE_CACHE_SHIFT; + unsigned from = pos (PAGE_CACHE_SIZE - 1); + unsigned to = from + copied; + struct inode *ecryptfs_inode = mapping-host; struct ecryptfs_crypt_stat *crypt_stat = ecryptfs_inode_to_private(file-f_path.dentry-d_inode)-crypt_stat; int rc; @@ -417,25 +428,22 @@ static int ecryptfs_commit_write(struct } else ecryptfs_printk(KERN_DEBUG, Not a new file\n); ecryptfs_printk(KERN_DEBUG, Calling fill_zeros_to_end_of_page - (page w/ index = [0x%.16x], to = [%d])\n, page-index, - to); + (page w/ index = [0x%.16x], to = [%d])\n, index, to); /* Fills in zeros if 'to' goes beyond inode size */ rc = fill_zeros_to_end_of_page(page, to); if (rc) { ecryptfs_printk(KERN_WARNING, Error attempting to fill - zeros in page with index = [0x%.16x]\n, - page-index); + zeros in page with index = [0x%.16x]\n, index); goto out; } rc = ecryptfs_encrypt_page(page); if (rc
[rfc][patch 2/5] cifs: new aops
Convert cifs to new aops. This is another relatively naive conversion. Always do the read upfront when the page is not uptodate (unless we're in the writethrough path). Fix an uninitialized data exposure where SetPageUptodate was called before the page was uptodate. SetPageUptodate and switch to writeback mode in the case that the full page was dirtied. Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- Index: linux-2.6/fs/cifs/file.c === --- linux-2.6.orig/fs/cifs/file.c +++ linux-2.6/fs/cifs/file.c @@ -103,7 +103,7 @@ static inline int cifs_open_inode_helper /* want handles we can use to read with first in the list so we do not have to walk the - list to search for one in prepare_write */ + list to search for one in write_begin */ if ((file-f_flags O_ACCMODE) == O_WRONLY) { list_add_tail(pCifsFile-flist, pCifsInode-openFileList); @@ -1411,49 +1411,53 @@ static int cifs_writepage(struct page *p return rc; } -static int cifs_commit_write(struct file *file, struct page *page, - unsigned offset, unsigned to) +static int cifs_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { - int xid; - int rc = 0; - struct inode *inode = page-mapping-host; - loff_t position = ((loff_t)page-index PAGE_CACHE_SHIFT) + to; - char *page_data; + int rc; + struct inode *inode = mapping-host; + loff_t position; + + cFYI(1, (commit write for page %p from position %lld with %d bytes, +page, pos, copied)); + + if (!PageUptodate(page) copied == PAGE_CACHE_SIZE) + SetPageUptodate(page); - xid = GetXid(); - cFYI(1, (commit write for page %p up to position %lld for %d, -page, position, to)); - spin_lock(inode-i_lock); - if (position inode-i_size) { - i_size_write(inode, position); - } - spin_unlock(inode-i_lock); if (!PageUptodate(page)) { - position = ((loff_t)page-index PAGE_CACHE_SHIFT) + offset; - /* can not rely on (or let) writepage write this data */ - if (to offset) { - cFYI(1, (Illegal offsets, can not copy from %d to %d, - offset, to)); - FreeXid(xid); - return rc; - } + char *page_data; + unsigned offset = pos (PAGE_CACHE_SIZE - 1); + int xid; + + xid = GetXid(); /* this is probably better than directly calling partialpage_write since in this function the file handle is known which we might as well leverage */ /* BB check if anything else missing out of ppw such as updating last write time */ page_data = kmap(page); - rc = cifs_write(file, page_data + offset, to-offset, - position); - if (rc 0) - rc = 0; - /* else if (rc 0) should we set writebehind rc? */ + rc = cifs_write(file, page_data + offset, copied, position); + /* if (rc 0) should we set writebehind rc? */ kunmap(page); + + FreeXid(xid); } else { + rc = copied; set_page_dirty(page); } - FreeXid(xid); + if (rc 0) { + position = pos + rc; + spin_lock(inode-i_lock); + if (position inode-i_size) + i_size_write(inode, position); + spin_unlock(inode-i_lock); + } + + unlock_page(page); + page_cache_release(page); + return rc; } @@ -1999,49 +2003,45 @@ int is_size_safe_to_change(struct cifsIn return 1; } -static int cifs_prepare_write(struct file *file, struct page *page, - unsigned from, unsigned to) +static int cifs_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata) { - int rc = 0; - loff_t i_size; - loff_t offset; + pgoff_t index = pos PAGE_CACHE_SHIFT; + loff_t offset = pos (PAGE_CACHE_SIZE - 1); + struct page *page; + + cFYI(1, (write begin from %lld len %d, (long long)pos, len)); + + page = __grab_cache_page(mapping, index); + if (!page) + return -ENOMEM; - cFYI(1, (prepare write for page %p from %d to %d, page, from, to)); if (PageUptodate(page)) return 0; /* If we are writing a full page
[rfc][patch 3/5] afs: new aops
Convert afs to new aops. Cannot assume writes will fully complete, so this conversion goes the easy way and always brings the page uptodate before the write. Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- Index: linux-2.6/fs/afs/file.c === --- linux-2.6.orig/fs/afs/file.c +++ linux-2.6/fs/afs/file.c @@ -50,8 +50,8 @@ const struct address_space_operations af .launder_page = afs_launder_page, .releasepage= afs_releasepage, .invalidatepage = afs_invalidatepage, - .prepare_write = afs_prepare_write, - .commit_write = afs_commit_write, + .write_begin= afs_write_begin, + .write_end = afs_write_end, .writepage = afs_writepage, .writepages = afs_writepages, }; Index: linux-2.6/fs/afs/internal.h === --- linux-2.6.orig/fs/afs/internal.h +++ linux-2.6/fs/afs/internal.h @@ -731,8 +731,12 @@ extern int afs_volume_release_fileserver */ extern int afs_set_page_dirty(struct page *); extern void afs_put_writeback(struct afs_writeback *); -extern int afs_prepare_write(struct file *, struct page *, unsigned, unsigned); -extern int afs_commit_write(struct file *, struct page *, unsigned, unsigned); +extern int afs_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata); +extern int afs_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata); extern int afs_writepage(struct page *, struct writeback_control *); extern int afs_writepages(struct address_space *, struct writeback_control *); extern int afs_write_inode(struct inode *, int); Index: linux-2.6/fs/afs/write.c === --- linux-2.6.orig/fs/afs/write.c +++ linux-2.6/fs/afs/write.c @@ -84,15 +84,23 @@ void afs_put_writeback(struct afs_writeb * partly or wholly fill a page that's under preparation for writing */ static int afs_fill_page(struct afs_vnode *vnode, struct key *key, -unsigned start, unsigned len, struct page *page) + loff_t pos, unsigned len, struct page *page) { + loff_t i_size; + unsigned eof; int ret; - _enter(,,%u,%u, start, len); + _enter(,,%llu,%u, (unsigned long long)pos, len); - ASSERTCMP(start + len, =, PAGE_SIZE); + ASSERTCMP(len, =, PAGE_CACHE_SIZE); - ret = afs_vnode_fetch_data(vnode, key, start, len, page); + i_size = i_size_read(vnode-vfs_inode); + if (pos + len i_size) + eof = i_size; + else + eof = PAGE_CACHE_SIZE; + + ret = afs_vnode_fetch_data(vnode, key, 0, eof, page); if (ret 0) { if (ret == -ENOENT) { _debug(got NOENT from server @@ -107,109 +115,55 @@ static int afs_fill_page(struct afs_vnod } /* - * prepare a page for being written to - */ -static int afs_prepare_page(struct afs_vnode *vnode, struct page *page, - struct key *key, unsigned offset, unsigned to) -{ - unsigned eof, tail, start, stop, len; - loff_t i_size, pos; - void *p; - int ret; - - _enter(); - - if (offset == 0 to == PAGE_SIZE) - return 0; - - p = kmap_atomic(page, KM_USER0); - - i_size = i_size_read(vnode-vfs_inode); - pos = (loff_t) page-index PAGE_SHIFT; - if (pos = i_size) { - /* partial write, page beyond EOF */ - _debug(beyond); - if (offset 0) - memset(p, 0, offset); - if (to PAGE_SIZE) - memset(p + to, 0, PAGE_SIZE - to); - kunmap_atomic(p, KM_USER0); - return 0; - } - - if (i_size - pos = PAGE_SIZE) { - /* partial write, page entirely before EOF */ - _debug(before); - tail = eof = PAGE_SIZE; - } else { - /* partial write, page overlaps EOF */ - eof = i_size - pos; - _debug(overlap %u, eof); - tail = max(eof, to); - if (tail PAGE_SIZE) - memset(p + tail, 0, PAGE_SIZE - tail); - if (offset eof) - memset(p + eof, 0, PAGE_SIZE - eof); - } - - kunmap_atomic(p, KM_USER0); - - ret = 0; - if (offset 0 || eof to) { - /* need to fill one or two bits that aren't going to be written -* (cover both fillers in one read if there are two) */ - start = (offset 0) ? 0 : to; - stop = (eof to) ? eof : offset; - len
[rfc][patch 4/5] rd: rewrite rd
This is a rewrite of the ramdisk block device driver. The old one is really difficult because it effectively implements a block device which serves data out of its own buffer cache. It relies on the dirty bit being set, to pin its backing store in cache, however there are non trivial paths which can clear the dirty bit (eg. try_to_free_buffers()), which had recently lead to data corruption. And in general it is completely wrong for a block device driver to do this. The new one is more like a regular block device driver. It has no idea about vm/vfs stuff. It's backing store is similar to the buffer cache (a simple radix-tree of pages), but it doesn't know anything about page cache (the pages in the radix tree are not pagecache pages). There is one slight downside -- direct block device access and filesystem metadata access goes through an extra copy and gets stored in RAM twice. However, this downside is only slight, because the real buffercache of the device is now reclaimable (because we're not playing crazy games with it), so under memory intensive situations, footprint should effectively be the same -- maybe even a slight advantage to the new driver because it can also reclaim buffer heads. The fact that it now goes through all the regular vm/fs paths makes it much more useful for testing, too. Signed-off-by: Nick Piggin [EMAIL PROTECTED] --- Index: linux-2.6/drivers/block/Kconfig === --- linux-2.6.orig/drivers/block/Kconfig +++ linux-2.6/drivers/block/Kconfig @@ -311,7 +311,7 @@ config BLK_DEV_UB If unsure, say N. config BLK_DEV_RAM - tristate RAM disk support + tristate RAM block device support ---help--- Saying Y here will allow you to use a portion of your RAM memory as a block device, so that you can make file systems on it, read and @@ -346,16 +346,6 @@ config BLK_DEV_RAM_SIZE The default value is 4096 kilobytes. Only change this if you know what you are doing. -config BLK_DEV_RAM_BLOCKSIZE - int Default RAM disk block size (bytes) - depends on BLK_DEV_RAM - default 1024 - help - The default value is 1024 bytes. PAGE_SIZE is a much more - efficient choice however. The default is kept to ensure initrd - setups function - apparently needed by the rd_load_image routine - that supposes the filesystem in the image uses a 1024 blocksize. - config CDROM_PKTCDVD tristate Packet writing on CD/DVD media depends on !UML Index: linux-2.6/drivers/block/Makefile === --- linux-2.6.orig/drivers/block/Makefile +++ linux-2.6/drivers/block/Makefile @@ -11,7 +11,7 @@ obj-$(CONFIG_AMIGA_FLOPPY)+= amiflop.o obj-$(CONFIG_PS3_DISK) += ps3disk.o obj-$(CONFIG_ATARI_FLOPPY) += ataflop.o obj-$(CONFIG_AMIGA_Z2RAM) += z2ram.o -obj-$(CONFIG_BLK_DEV_RAM) += rd.o +obj-$(CONFIG_BLK_DEV_RD) += brd.o obj-$(CONFIG_BLK_DEV_LOOP) += loop.o obj-$(CONFIG_BLK_DEV_PS2) += ps2esdi.o obj-$(CONFIG_BLK_DEV_XD) += xd.o Index: linux-2.6/drivers/block/brd.c === --- /dev/null +++ linux-2.6/drivers/block/brd.c @@ -0,0 +1,477 @@ +/* + * Ram backed block device driver. + * + * Copyright (C) 2007 Nick Piggin + * Copyright (C) 2007 Novell Inc. + * + * Parts derived from drivers/block/rd.c, and drivers/block/loop.c, copyright + * of their respective owners. + */ + +#include linux/init.h +#include linux/module.h +#include linux/moduleparam.h +#include linux/major.h +#include linux/blkdev.h +#include linux/bio.h +#include linux/highmem.h +#include linux/gfp.h +#include linux/radix-tree.h +#include linux/buffer_head.h /* invalidate_bh_lrus() */ + +#include asm/uaccess.h + +#define PAGE_SECTORS (1 (PAGE_SHIFT - 9)) + +struct brd_device { + int brd_number; + int brd_refcnt; + loff_t brd_offset; + loff_t brd_sizelimit; + unsignedbrd_blocksize; + + struct request_queue*brd_queue; + struct gendisk *brd_disk; + + spinlock_t brd_lock; + struct radix_tree_root brd_pages; + + struct list_headbrd_list; +}; + +static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector) +{ + unsigned long idx; + struct page *page; + + rcu_read_lock(); + idx = sector (PAGE_SHIFT - 9); + page = radix_tree_lookup(brd-brd_pages, idx); + rcu_read_unlock(); + + BUG_ON(page page-index != idx); + + return page; +} + +static struct page *brd_insert_page(struct brd_device *brd, sector_t sector) +{ + unsigned long idx; + struct page *page; + + page = brd_lookup_page(brd, sector); + if (page) + return page; + + page = alloc_page(GFP_NOIO
[rfc][patch 5/5] remove prepare_write
Index: linux-2.6/drivers/block/loop.c === --- linux-2.6.orig/drivers/block/loop.c +++ linux-2.6/drivers/block/loop.c @@ -40,8 +40,7 @@ * Heinz Mauelshagen [EMAIL PROTECTED], Feb 2002 * * Support for falling back on the write file operation when the address space - * operations prepare_write and/or commit_write are not available on the - * backing filesystem. + * operations write_begin is not available on the backing filesystem. * Anton Altaparmakov, 16 Feb 2005 * * Still To Fix: @@ -761,7 +760,7 @@ static int loop_set_fd(struct loop_devic */ if (!file-f_op-splice_read) goto out_putf; - if (aops-prepare_write || aops-write_begin) + if (aops-write_begin) lo_flags |= LO_FLAGS_USE_AOPS; if (!(lo_flags LO_FLAGS_USE_AOPS) !file-f_op-write) lo_flags |= LO_FLAGS_READ_ONLY; Index: linux-2.6/fs/fat/inode.c === --- linux-2.6.orig/fs/fat/inode.c +++ linux-2.6/fs/fat/inode.c @@ -175,7 +175,7 @@ static ssize_t fat_direct_IO(int rw, str if (rw == WRITE) { /* -* FIXME: blockdev_direct_IO() doesn't use -prepare_write(), +* FIXME: blockdev_direct_IO() doesn't use -write_begin(), * so we need to update the -mmu_private to block boundary. * * But we must fill the remaining area or hole by nul for Index: linux-2.6/fs/ocfs2/file.c === --- linux-2.6.orig/fs/ocfs2/file.c +++ linux-2.6/fs/ocfs2/file.c @@ -790,8 +790,7 @@ leave: /* Some parts of this taken from generic_cont_expand, which turned out * to be too fragile to do exactly what we need without us having to - * worry about recursive locking in -prepare_write() and - * -commit_write(). */ + * worry about recursive locking in -write_begin() and -write_end(). */ static int ocfs2_write_zero_page(struct inode *inode, u64 size) { Index: linux-2.6/fs/splice.c === --- linux-2.6.orig/fs/splice.c +++ linux-2.6/fs/splice.c @@ -734,8 +734,8 @@ ssize_t splice_from_pipe(struct pipe_ino }; /* -* The actor worker might be calling -prepare_write and -* -commit_write. Most of the time, these expect i_mutex to +* The actor worker might be calling -write_begin and +* -write_end. Most of the time, these expect i_mutex to * be held. Since this may result in an ABBA deadlock with * pipe-inode, we have to order lock acquiry here. */ Index: linux-2.6/include/linux/fs.h === --- linux-2.6.orig/include/linux/fs.h +++ linux-2.6/include/linux/fs.h @@ -453,13 +453,6 @@ struct address_space_operations { int (*readpages)(struct file *filp, struct address_space *mapping, struct list_head *pages, unsigned nr_pages); - /* -* ext3 requires that a successful prepare_write() call be followed -* by a commit_write() call - they must be balanced -*/ - int (*prepare_write)(struct file *, struct page *, unsigned, unsigned); - int (*commit_write)(struct file *, struct page *, unsigned, unsigned); - int (*write_begin)(struct file *, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata); Index: linux-2.6/mm/filemap.c === --- linux-2.6.orig/mm/filemap.c +++ linux-2.6/mm/filemap.c @@ -1863,48 +1863,8 @@ int pagecache_write_begin(struct file *f { const struct address_space_operations *aops = mapping-a_ops; - if (aops-write_begin) { - return aops-write_begin(file, mapping, pos, len, flags, + return aops-write_begin(file, mapping, pos, len, flags, pagep, fsdata); - } else { - int ret; - pgoff_t index = pos PAGE_CACHE_SHIFT; - unsigned offset = pos (PAGE_CACHE_SIZE - 1); - struct inode *inode = mapping-host; - struct page *page; -again: - page = __grab_cache_page(mapping, index); - *pagep = page; - if (!page) - return -ENOMEM; - - if (flags AOP_FLAG_UNINTERRUPTIBLE !PageUptodate(page)) { - /* -* There is no way to resolve a short write situation -* for a !Uptodate page (except by double copying in -* the caller done by
Re: [patch] fs: restore nobh
On Thu, Oct 25, 2007 at 09:07:36PM +0200, Jan Kara wrote: Hi, This is overdue, sorry. Got a little complicated, and I've been away from my filesystem test setup so I didn't want ot send it (lucky, coz I found a bug after more substantial testing). Anyway, RFC? Hmm, maybe one comment/question: Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c 2007-10-08 14:09:35.0 +1000 +++ linux-2.6/fs/buffer.c 2007-10-08 16:45:28.0 +1000 ... -/* - * Make sure any changes to nobh_commit_write() are reflected in - * nobh_truncate_page(), since it doesn't call commit_write(). - */ -int nobh_commit_write(struct file *file, struct page *page, - unsigned from, unsigned to) +int nobh_write_end(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata) { struct inode *inode = page-mapping-host; - loff_t pos = ((loff_t)page-index PAGE_CACHE_SHIFT) + to; + struct buffer_head *head = NULL; + struct buffer_head *bh; - if (page_has_buffers(page)) - return generic_commit_write(file, page, from, to); + if (!PageMappedToDisk(page)) { + if (unlikely(copied len) !page_has_buffers(page)) + attach_nobh_buffers(page, head); + if (page_has_buffers(page)) + return generic_write_end(file, mapping, pos, len, copied, page, fsdata); + } So we can have a page with attached buffers but we decide to not call generic_write_end()... SetPageUptodate(page); set_page_dirty(page); And we just set the page dirty but we leave buffers clean. So cannot subsequent try_to_free_buffers() come, mark page as clean (as buffers are clean) and discard the data? Hmm, we might just be OK here because set_page_dirty should dirty the buffers. However, I think you have spotted a mistake here and it would be better if the generic_write_end test was outside the PageMapedToDisk check. Thanks! - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/31] Remove iget() and read_inode() [try #4]
On Friday 12 October 2007 19:07, David Howells wrote: Hi Linus, Here's a set of patches that remove all calls to iget() and all read_inode() functions. They should be removed for two reasons: firstly they don't lend themselves to good error handling, and secondly their presence is a temptation for code outside a filesystem to call iget() to access inodes within that filesystem. Is there any particular reason not to push these through -mm? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch] fs: restore nobh
Hi, This is overdue, sorry. Got a little complicated, and I've been away from my filesystem test setup so I didn't want ot send it (lucky, coz I found a bug after more substantial testing). Anyway, RFC? --- Implement nobh in new aops. This is a bit tricky. FWIW, nobh_truncate is now implemented in a way that does not create blocks in sparse regions, which is a silly thing for it to have been doing (isn't it?) ext2 survives fsx and fsstress. jfs is converted as well... ext3 should be easy to do (but not done yet). Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c 2007-10-08 14:09:35.0 +1000 +++ linux-2.6/fs/buffer.c 2007-10-08 16:45:28.0 +1000 @@ -2378,7 +2378,7 @@ } /* - * nobh_prepare_write()'s prereads are special: the buffer_heads are freed + * nobh_write_begin()'s prereads are special: the buffer_heads are freed * immediately, while under the page lock. So it needs a special end_io * handler which does not touch the bh after unlocking it. */ @@ -2388,16 +2388,45 @@ } /* + * Attach the singly-linked list of buffers created by nobh_write_begin, to + * the page (converting it to circular linked list and taking care of page + * dirty races). + */ +static void attach_nobh_buffers(struct page *page, struct buffer_head *head) +{ + struct buffer_head *bh; + + BUG_ON(!PageLocked(page)); + + spin_lock(page-mapping-private_lock); + bh = head; + do { + if (PageDirty(page)) + set_buffer_dirty(bh); + if (!bh-b_this_page) + bh-b_this_page = head; + bh = bh-b_this_page; + } while (bh != head); + attach_page_buffers(page, head); + spin_unlock(page-mapping-private_lock); +} + +/* * On entry, the page is fully not uptodate. * On exit the page is fully uptodate in the areas outside (from,to) */ -int nobh_prepare_write(struct page *page, unsigned from, unsigned to, +int nobh_write_begin(struct file *file, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata, get_block_t *get_block) { - struct inode *inode = page-mapping-host; + struct inode *inode = mapping-host; const unsigned blkbits = inode-i_blkbits; const unsigned blocksize = 1 blkbits; struct buffer_head *head, *bh; + struct page *page; + pgoff_t index; + unsigned from, to; unsigned block_in_page; unsigned block_start, block_end; sector_t block_in_file; @@ -2406,8 +2435,22 @@ int ret = 0; int is_mapped_to_disk = 1; - if (page_has_buffers(page)) - return block_prepare_write(page, from, to, get_block); + index = pos PAGE_CACHE_SHIFT; + from = pos (PAGE_CACHE_SIZE - 1); + to = from + len; + + page = __grab_cache_page(mapping, index); + if (!page) + return -ENOMEM; + *pagep = page; + *fsdata = NULL; + + if (page_has_buffers(page)) { + unlock_page(page); + page_cache_release(page); + *pagep = NULL; + return block_write_begin(file, mapping, pos, len, flags, pagep, fsdata, get_block); + } if (PageMappedToDisk(page)) return 0; @@ -2422,8 +2465,10 @@ * than the circular one we're used to. */ head = alloc_page_buffers(page, blocksize, 0); - if (!head) - return -ENOMEM; + if (!head) { + ret = -ENOMEM; + goto out_release; + } block_in_file = (sector_t)page-index (PAGE_CACHE_SHIFT - blkbits); @@ -2492,15 +2537,12 @@ if (is_mapped_to_disk) SetPageMappedToDisk(page); - do { - bh = head; - head = head-b_this_page; - free_buffer_head(bh); - } while (head); + *fsdata = head; /* to be released by nobh_write_end */ return 0; failed: + BUG_ON(!ret); /* * Error recovery is a bit difficult. We need to zero out blocks that * were newly allocated, and dirty them to ensure they get written out. @@ -2508,64 +2550,56 @@ * the handling of potential IO errors during writeout would be hard * (could try doing synchronous writeout, but what if that fails too?) */ - spin_lock(page-mapping-private_lock); - bh = head; - block_start = 0; - do { - if (PageUptodate(page)) - set_buffer_uptodate(bh); - if (PageDirty(page)) - set_buffer_dirty(bh); + attach_nobh_buffers(page, head); + page_zero_new_buffers(page, from, to); - block_end = block_start+blocksize; - if (block_end
Re: [PATCH]fix VM_CAN_NONLINEAR check in sys_remap_file_pages
On Monday 08 October 2007 23:37, Hugh Dickins wrote: On Mon, 8 Oct 2007, Yan Zheng wrote: The test for VM_CAN_NONLINEAR always fails Good catch indeed. Though I was puzzled how we do nonlinear at all, until I realized it's The test for not VM_CAN_NONLINEAR always fails. It's not as serious as it appears, since code further down has been added more recently to simulate nonlinear on non-RAM-backed filesystems, instead of going the real nonlinear way; so most filesystems are now not required to do what VM_CAN_NONLINEAR was put in to ensure they could do. Well, I think all filesystems can do VM_CAN_NONLINEAR anyway. Device drivers and weird things tend to have trouble... I'm confused as to where that leaves us: is this actually a fix that needs to go into 2.6.23? or will it suddenly disable a system call which has been silently working fine on various filesystems which did not add VM_CAN_NONLINEAR? could we just rip out VM_CAN_NONLINEAR? We probably should keep VM_CAN_NONLINEAR for the moment, I think. But now that we have the fallback path, we _could_ use that instead of failing. I doubt anybody will be using nonlinear mappings on anything but regular files for the time being, but as a trivial fix, I think this probably should go into 2.6.23. Thanks for spotting this problem Acked-by: Nick Piggin [EMAIL PROTECTED] I hope Nick or Miklos is clearer on what the risks are. (Apologies for all the nots and nons here, I'm embarrassed after just criticizing Ingo's SCHED_NO_NO_OMIT_FRAME_POINTER!) Hugh Signed-off-by: Yan Zheng[EMAIL PROTECTED] diff -ur linux-2.6.23-rc9/mm/fremap.c linux/mm/fremap.c --- linux-2.6.23-rc9/mm/fremap.c2007-10-07 15:03:33.0 +0800 +++ linux/mm/fremap.c 2007-10-08 19:33:44.0 +0800 @@ -160,7 +160,7 @@ if (vma-vm_private_data !(vma-vm_flags VM_NONLINEAR)) goto out; - if (!vma-vm_flags VM_CAN_NONLINEAR) + if (!(vma-vm_flags VM_CAN_NONLINEAR)) goto out; if (end = start || start vma-vm_start || end vma-vm_end) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH]fix VM_CAN_NONLINEAR check in sys_remap_file_pages
On Tuesday 09 October 2007 03:04, Andrew Morton wrote: On Mon, 8 Oct 2007 19:45:08 +0800 Yan Zheng [EMAIL PROTECTED] wrote: Hi all The test for VM_CAN_NONLINEAR always fails Signed-off-by: Yan Zheng[EMAIL PROTECTED] diff -ur linux-2.6.23-rc9/mm/fremap.c linux/mm/fremap.c --- linux-2.6.23-rc9/mm/fremap.c2007-10-07 15:03:33.0 +0800 +++ linux/mm/fremap.c 2007-10-08 19:33:44.0 +0800 @@ -160,7 +160,7 @@ if (vma-vm_private_data !(vma-vm_flags VM_NONLINEAR)) goto out; - if (!vma-vm_flags VM_CAN_NONLINEAR) + if (!(vma-vm_flags VM_CAN_NONLINEAR)) goto out; if (end = start || start vma-vm_start || end vma-vm_end) Lovely. From this we can deduce that nobody has run remap_file_pages() since 2.6.23-rc1 and that nobody (including the developer who made that change) ran it while that change was in -mm. But you'd be wrong. remap_file_pages was tested both with my own tester and Ingo's test program. vm_flags != 0, !vm_flags = 0, 0 x = 0, so the test always falls through. Of course, what I _should_ have done is also test a driver which does not have VM_CAN_NONLINEAR... but even I wouldn't rewrite half the nonlinear mapping code without once testing it ;) FWIW, Oracle (maybe the sole real user of this) has been testing it, which I'm very happy about (rather than testing after 2.6.23 is released). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH]fix VM_CAN_NONLINEAR check in sys_remap_file_pages
On Tuesday 09 October 2007 03:51, Andrew Morton wrote: On Mon, 8 Oct 2007 10:28:43 -0700 I'll now add remap_file_pages soon. Maybe those other 2 tests aren't strong enough (?). Or maybe they don't return a non-0 exit status even when they fail... (I'll check.) Perhaps Yan Zheng can tell us what test was used to demonstrate this? Was probably found by review. Otherwise, you could probably reproduce it by mmaping, say, drm device node, running remap_file_pages() on it to create a nonlinear mapping, and then finding that you get the wrong data. I'm surprise that LTP doesn't have any remap_file_pages() tests. quick grep didn't find any for me. Me either. There are a few lying around the place which could be integrated. It would be good if LTP were to have some remap_file_pages() tests (please). As we see here, it is something which we can easily break, and leave broken for some time. Here is Ingo's old test, since cleaned up and fixed a bit by me I'm sure he would distribute it GPL, but I've cc'ed him because I didn't find an explicit statement about that. /* * Copyright (C) Ingo Molnar, 2002 */ #define _GNU_SOURCE #include stdio.h #include unistd.h #include sys/mman.h #include sys/stat.h #include sys/types.h #include fcntl.h #include errno.h #include stdlib.h #include sys/times.h #include sys/wait.h #include sys/ioctl.h #include sys/syscall.h #include linux/unistd.h #define PAGE_SIZE 4096 #define PAGE_WORDS (PAGE_SIZE/sizeof(int)) #define CACHE_PAGES 1024 #define CACHE_SIZE (CACHE_PAGES*PAGE_SIZE) #define WINDOW_PAGES 16 #define WINDOW_SIZE (WINDOW_PAGES*PAGE_SIZE) #define WINDOW_START 0x4800 static char cache_contents [CACHE_SIZE]; static void test_nonlinear(int fd) { char *data = NULL; int i, j, repeat = 2; for (i = 0; i CACHE_PAGES; i++) { int *page = (int *) (cache_contents + i*PAGE_SIZE); for (j = 0; j PAGE_WORDS; j++) page[j] = i; } if (write(fd, cache_contents, CACHE_SIZE) != CACHE_SIZE) perror(write), exit(1); data = mmap((void *)WINDOW_START, WINDOW_SIZE, PROT_READ|PROT_WRITE, MAP_FIXED | MAP_SHARED , fd, 0); if (data == MAP_FAILED) perror(mmap), exit(1); again: for (i = 0; i WINDOW_PAGES; i += 2) { char *page = data + i*PAGE_SIZE; if (remap_file_pages(page, PAGE_SIZE * 2, 0, (WINDOW_PAGES-i-2), 0) == -1) perror(remap_file_pages), exit(1); } for (i = 0; i WINDOW_PAGES; i++) { /* * Double-check the correctness of the mapping: */ if (i 1) { if (data[i*PAGE_SIZE] != WINDOW_PAGES-i) { printf(hm, mapped incorrect data!\n); exit(1); } } else { if (data[i*PAGE_SIZE] != WINDOW_PAGES-i-2) { printf(hm, mapped incorrect data!\n); exit(1); } } } if (--repeat) goto again; } int main(int argc, char **argv) { int fd; fd = open(/dev/shm/cache, O_RDWR|O_CREAT|O_TRUNC,S_IRWXU); if (fd 0) perror(open), exit(1); test_nonlinear(fd); if (close(fd) == -1) perror(close), exit(1); printf(nonlinear shm file OK\n); fd = open(/tmp/cache, O_RDWR|O_CREAT|O_TRUNC,S_IRWXU); if (fd 0) perror(open), exit(1); test_nonlinear(fd); if (close(fd) == -1) perror(close), exit(1); printf(nonlinear /tmp/ file OK\n); exit(0); }
Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
On Tuesday 02 October 2007 07:01, Christoph Lameter wrote: On Sat, 29 Sep 2007, Peter Zijlstra wrote: On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote: Really? That means we can no longer even allocate stacks for forking. I think I'm running with 4k stacks... 4k stacks will never fly on an SGI x86_64 NUMA configuration given the additional data that may be kept on the stack. We are currently considering to go from 8k to 16k (or even 32k) to make things work. So having the ability to put the stacks in vmalloc space may be something to look at. i386 and x86-64 already used 8K stacks for years and they have never really been much problem before. They only started failing when contiguous memory is getting used up by other things, _even with_ those anti-frag patches in there. Bottom line is that you do not use higher order allocations when you do not need them. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
On Sunday 30 September 2007 05:20, Andrew Morton wrote: On Sat, 29 Sep 2007 06:19:33 +1000 Nick Piggin [EMAIL PROTECTED] wrote: On Saturday 29 September 2007 19:27, Andrew Morton wrote: On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote: oom-killings, or page allocation failures? The latter, one hopes. Linux version 2.6.23-rc4-mm1-dirty ([EMAIL PROTECTED]) (gcc version 4.1.2 (Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007 ... mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0 Call Trace: 611b3878: [6002dd28] printk_ratelimit+0x15/0x17 611b3888: [60052ed4] out_of_memory+0x80/0x100 611b38c8: [60054b0c] __alloc_pages+0x1ed/0x280 611b3948: [6006c608] allocate_slab+0x5b/0xb0 611b3968: [6006c705] new_slab+0x7e/0x183 611b39a8: [6006cbae] __slab_alloc+0xc9/0x14b 611b39b0: [6011f89f] radix_tree_preload+0x70/0xbf 611b39b8: [600980f2] do_mpage_readpage+0x3b3/0x472 611b39e0: [6011f89f] radix_tree_preload+0x70/0xbf 611b39f8: [6006cc81] kmem_cache_alloc+0x51/0x98 611b3a38: [6011f89f] radix_tree_preload+0x70/0xbf 611b3a58: [6004f8e2] add_to_page_cache+0x22/0xf7 611b3a98: [6004f9c6] add_to_page_cache_lru+0xf/0x24 611b3ab8: [6009821e] mpage_readpages+0x6d/0x109 611b3ac0: [600d59f0] ext3_get_block+0x0/0xf2 611b3b08: [6005483d] get_page_from_freelist+0x8d/0xc1 611b3b88: [600d6937] ext3_readpages+0x18/0x1a 611b3b98: [60056f00] read_pages+0x37/0x9b 611b3bd8: [60057064] __do_page_cache_readahead+0x100/0x157 611b3c48: [60057196] do_page_cache_readahead+0x52/0x5f 611b3c78: [60050ab4] filemap_fault+0x145/0x278 611b3ca8: [60022b61] run_syscall_stub+0xd1/0xdd 611b3ce8: [6005eae3] __do_fault+0x7e/0x3ca 611b3d68: [6005ee60] do_linear_fault+0x31/0x33 611b3d88: [6005f149] handle_mm_fault+0x14e/0x246 611b3da8: [60120a7b] __up_read+0x73/0x7b 611b3de8: [60013177] handle_page_fault+0x11f/0x23b 611b3e48: [60013419] segv+0xac/0x297 611b3f28: [60013367] segv_handler+0x68/0x6e 611b3f48: [600232ad] get_skas_faultinfo+0x9c/0xa1 611b3f68: [60023853] userspace+0x13a/0x19d 611b3fc8: [60010d58] fork_handler+0x86/0x8d OK, that's different. Someone broke the vm - order-2 GFP_KERNEL allocations aren't supposed to fail. I'm suspecting that did_some_progress thing. The allocation didn't fail -- it invoked the OOM killer because the kernel ran out of unfragmented memory. We can't run out of unfragmented memory for an order-2 GFP_KERNEL allocation in this workload. We go and synchronously free stuff up to make it work. How did this get broken? Either no more order-2 pages could be freed, or the ones that were being freed were being used by something else (eg. other order-2 slab allocations). Probably because higher order allocations are the new vogue in -mm at the moment ;) That's a different bug. bug 1: We shouldn't be doing higher-order allocations in slub because of the considerable damage this does to atomic allocations. bug 2: order-2 GFP_KERNEL allocations shouldn't fail like this. I think one causes 2 as well -- it isn't just considerable damage to atomic allocations but to GFP_KERNEL allocations too. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
On Monday 01 October 2007 06:12, Andrew Morton wrote: On Sun, 30 Sep 2007 05:09:28 +1000 Nick Piggin [EMAIL PROTECTED] wrote: On Sunday 30 September 2007 05:20, Andrew Morton wrote: We can't run out of unfragmented memory for an order-2 GFP_KERNEL allocation in this workload. We go and synchronously free stuff up to make it work. How did this get broken? Either no more order-2 pages could be freed, or the ones that were being freed were being used by something else (eg. other order-2 slab allocations). No. The current design of reclaim (for better or for worse) is that for order 0,1,2 and 3 allocations we just keep on trying until it works. That got broken and I think it got broken at a design level when that did_some_progress logic went in. Perhaps something else we did later worsened things. It will keep trying until it works. It won't have stopped trying (unless I'm very mistaken?), it's just oom killing things merrily along the way. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
On Saturday 29 September 2007 19:27, Andrew Morton wrote: On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote: oom-killings, or page allocation failures? The latter, one hopes. Linux version 2.6.23-rc4-mm1-dirty ([EMAIL PROTECTED]) (gcc version 4.1.2 (Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007 ... mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0 Call Trace: 611b3878: [6002dd28] printk_ratelimit+0x15/0x17 611b3888: [60052ed4] out_of_memory+0x80/0x100 611b38c8: [60054b0c] __alloc_pages+0x1ed/0x280 611b3948: [6006c608] allocate_slab+0x5b/0xb0 611b3968: [6006c705] new_slab+0x7e/0x183 611b39a8: [6006cbae] __slab_alloc+0xc9/0x14b 611b39b0: [6011f89f] radix_tree_preload+0x70/0xbf 611b39b8: [600980f2] do_mpage_readpage+0x3b3/0x472 611b39e0: [6011f89f] radix_tree_preload+0x70/0xbf 611b39f8: [6006cc81] kmem_cache_alloc+0x51/0x98 611b3a38: [6011f89f] radix_tree_preload+0x70/0xbf 611b3a58: [6004f8e2] add_to_page_cache+0x22/0xf7 611b3a98: [6004f9c6] add_to_page_cache_lru+0xf/0x24 611b3ab8: [6009821e] mpage_readpages+0x6d/0x109 611b3ac0: [600d59f0] ext3_get_block+0x0/0xf2 611b3b08: [6005483d] get_page_from_freelist+0x8d/0xc1 611b3b88: [600d6937] ext3_readpages+0x18/0x1a 611b3b98: [60056f00] read_pages+0x37/0x9b 611b3bd8: [60057064] __do_page_cache_readahead+0x100/0x157 611b3c48: [60057196] do_page_cache_readahead+0x52/0x5f 611b3c78: [60050ab4] filemap_fault+0x145/0x278 611b3ca8: [60022b61] run_syscall_stub+0xd1/0xdd 611b3ce8: [6005eae3] __do_fault+0x7e/0x3ca 611b3d68: [6005ee60] do_linear_fault+0x31/0x33 611b3d88: [6005f149] handle_mm_fault+0x14e/0x246 611b3da8: [60120a7b] __up_read+0x73/0x7b 611b3de8: [60013177] handle_page_fault+0x11f/0x23b 611b3e48: [60013419] segv+0xac/0x297 611b3f28: [60013367] segv_handler+0x68/0x6e 611b3f48: [600232ad] get_skas_faultinfo+0x9c/0xa1 611b3f68: [60023853] userspace+0x13a/0x19d 611b3fc8: [60010d58] fork_handler+0x86/0x8d OK, that's different. Someone broke the vm - order-2 GFP_KERNEL allocations aren't supposed to fail. I'm suspecting that did_some_progress thing. The allocation didn't fail -- it invoked the OOM killer because the kernel ran out of unfragmented memory. Probably because higher order allocations are the new vogue in -mm at the moment ;) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
On Saturday 29 September 2007 04:41, Christoph Lameter wrote: On Fri, 28 Sep 2007, Peter Zijlstra wrote: memory got massively fragemented, as anti-frag gets easily defeated. setting min_free_kbytes to 12M does seem to solve it - it forces 2 max order blocks to stay available, so we don't mix types. however 12M on 128M is rather a lot. Yes, strict ordering would be much better. On NUMA it may be possible to completely forbid merging. We can fall back to other nodes if necessary. 12M is not much on a NUMA system. But this shows that (unsurprisingly) we may have issues on systems with a small amounts of memory and we may not want to use higher orders on such systems. The case you got may be good to use as a testcase for the virtual fallback. H... Maybe it is possible to allocate the stack as a virtual compound page. Got some script/code to produce that problem? Yeah, you could do that, but we generally don't have big problems allocating stacks in mainline, because we have very few users of higher order pages, the few that are there don't seem to be a problem. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
On Wednesday 19 September 2007 13:36, Christoph Lameter wrote: SLAB_VFALLBACK can be specified for selected slab caches. If fallback is available then the conservative settings for higher order allocations are overridden. We then request an order that can accomodate at mininum 100 objects. The size of an individual slab allocation is allowed to reach up to 256k (order 6 on i386, order 4 on IA64). How come SLUB wants such a big amount of objects? I thought the unqueued nature of it made it better than slab because it minimised the amount of cache hot memory lying around in slabs... vmalloc is incredibly slow and unscalable at the moment. I'm still working on making it more scalable and faster -- hopefully to a point where it would actually be usable for this... but you still get moved off large TLBs, and also have to inevitably do tlb flushing. Or do you have SLUB at a point where performance is comparable to SLAB, and this is just a possible idea for more performance? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Thursday 20 September 2007 11:38, David Chinner wrote: On Wed, Sep 19, 2007 at 04:04:30PM +0200, Andrea Arcangeli wrote: Plus of course you don't like fsblock because it requires work to adapt a fs to it, I can't argue about that. No, I don't like fsblock because it is inherently a struture per filesystem block construct, just like buggerheads. You still need to allocate millions of them when you have millions dirty pages around. Rather than type it all out again, read the fsblocks thread from here: I don't think there is anything inherently wrong with a structure per filesystem block construct. In the places where you want to have a handle on that information. In the data path of a lot of filesystems, it's not really useful, no question. But the block / buffer head concept is there and is useful for many things obviously. fsblock, I believe, improves on it; maybe even to the point where there won't be too much reason for many filesystems to convert their data paths to something different (eg. nobh mode, or an extent block mapping). http://marc.info/?l=linux-fsdevelm=118284983925719w=2 FWIW, with Chris mason's extent-based block mapping (which btrfs is using and Christoph Hellwig is porting XFS over to) we completely remove buggerheads from XFS and so fsblock would be a pretty major step backwards for us if Chris's work goes into mainline. If you don't need to manipulate or manage the pagecache on a per-block basis anyway, then you shouldn't need fsblock (or anything else particularly special) to do higher order block sizes. If you do sometimes need to, then fsblock *may* be a way you can remove vmap code from your filesystem and share it with generic code... But, I'm not going to argue endlessly for one solution or another; I'm happy to see different solutions being chased, so may the best VM win ;) Agreed :) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK
On Saturday 29 September 2007 03:33, Christoph Lameter wrote: On Fri, 28 Sep 2007, Nick Piggin wrote: On Wednesday 19 September 2007 13:36, Christoph Lameter wrote: SLAB_VFALLBACK can be specified for selected slab caches. If fallback is available then the conservative settings for higher order allocations are overridden. We then request an order that can accomodate at mininum 100 objects. The size of an individual slab allocation is allowed to reach up to 256k (order 6 on i386, order 4 on IA64). How come SLUB wants such a big amount of objects? I thought the unqueued nature of it made it better than slab because it minimised the amount of cache hot memory lying around in slabs... The more objects in a page the more the fast path runs. The more the fast path runs the lower the cache footprint and the faster the overall allocations etc. SLAB can be configured for large queues holdings lots of objects. SLUB can only reach the same through large pages because it does not have queues. One could add the ability to manage pools of cpu slabs but that would be adding yet another layer to compensate for the problem of the small pages. Reliable large page allocations means that we can get rid of these layers and the many workarounds that we have in place right now. That doesn't sound very nice because you don't actually want to use up higher order allocations if you can avoid it, and you definitely don't want to be increasing your slab page size unit if you can help it, because it compounds the problem of slab fragmentation. The unqueued nature of SLUB reduces memory requirements and in general the more efficient code paths of SLUB offset the advantage that SLAB can reach by being able to put more objects onto its queues. introduces complexity and cache line use through the need to manage those queues. I thought it was slower. Have you fixed the performance regression? (OK, I read further down that you are still working on it but not confirmed yet...) vmalloc is incredibly slow and unscalable at the moment. I'm still working on making it more scalable and faster -- hopefully to a point where it would actually be usable for this... but you still get moved off large TLBs, and also have to inevitably do tlb flushing. Again I have not seen any fallbacks to vmalloc in my testing. What we are doing here is mainly to address your theoretical cases that we so far have never seen to be a problem and increase the reliability of allocations of page orders larger than 3 to a usable level. So far I have so far not dared to enable orders larger than 3 by default. Basically, all that shows is that your testing isn't very thorough. 128MB is an order of magnitude *more* memory than some users have. They probably wouldn't be happy with a regression in slab allocator performance either. Or do you have SLUB at a point where performance is comparable to SLAB, and this is just a possible idea for more performance? AFAICT SLUBs performance is superior to SLAB in most cases and it was like that from the beginning. I am still concerned about several corner cases though (I think most of them are going to be addressed by the per cpu patches in mm). Having a comparable or larger amount of per cpu objects as SLAB is something that also could address some of these concerns and could increase performance much further. OK, so long as it isn't going to depend on using higher order pages, that's fine. (if they help even further as an optional thing, that's fine too. You can turn them on your huge systems and not even bother about adding this vmap fallback -- you won't have me to nag you about these purely theoretical issues). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [13/17] Virtual compound page freeing in interrupt context
On Wednesday 19 September 2007 13:36, Christoph Lameter wrote: If we are in an interrupt context then simply defer the free via a workqueue. In an interrupt context it is not possible to use vmalloc_addr() to determine the vmalloc address. So add a variant that does that too. Removing a virtual mappping *must* be done with interrupts enabled since tlb_xx functions are called that rely on interrupts for processor to processor communications. These things will clash drastically with my lazy TLB flushing and scalability work. Actually the lazy TLB flushing will provide a natural way to defer unmapping at interrupt time, and the scalability work should make it easier to vmap from interrupt context too, if you really need that. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] --- mm/page_alloc.c | 23 ++- 1 file changed, 22 insertions(+), 1 deletion(-) Index: linux-2.6/mm/page_alloc.c === --- linux-2.6.orig/mm/page_alloc.c2007-09-18 20:10:55.0 -0700 +++ linux-2.6/mm/page_alloc.c 2007-09-18 20:11:40.0 -0700 @@ -1297,7 +1297,12 @@ abort: return NULL; } -static void vcompound_free(void *addr) +/* + * Virtual Compound freeing functions. This is complicated by the vmalloc + * layer not being able to free virtual allocations when interrupts are + * disabled. So we defer the frees via a workqueue if necessary. + */ +static void __vcompound_free(void *addr) { struct page **pages = vunmap(addr); int i; @@ -1320,6 +1325,22 @@ static void vcompound_free(void *addr) kfree(pages); } +static void vcompound_free_work(struct work_struct *w) +{ + __vcompound_free((void *)w); +} + +static void vcompound_free(void *addr) +{ + if (in_interrupt()) { + struct work_struct *w = addr; + + INIT_WORK(w, vcompound_free_work); + schedule_work(w); + } else + __vcompound_free(addr); +} + /* * This is the 'heart' of the zoned buddy allocator. */ - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Wednesday 19 September 2007 04:30, Linus Torvalds wrote: On Tue, 18 Sep 2007, Nick Piggin wrote: ROFL! Yeah of course, how could I have forgotten about our trusty OOM killer as the solution to the fragmentation problem? It would only have been funnier if you had said to reboot every so often when memory gets fragmented :) Can we please stop this *idiotic* thread. Nick, you and some others seem to be arguing based on a totally flawed base, namely: - we can guarantee anything at all in the VM - we even care about the 16kB blocksize - second-class citizenry is bad The fact is, *none* of those things are true. The VM doesn't guarantee anything, and is already very much about statistics in many places. You seem to be arguing as if Christoph was introducing something new and unacceptable, when it's largely just more of the same. I will stop this idiotic thread. However, at the VM and/or vm/fs things we had, I was happy enough for this thing of Christoph's to get merged. Actually I even didn't care if it had mmap support, so long as it solved their problem. But a solution to the general problem of VM and IO scalability, it is not. IMO. And the fact is, nobody but SGI customers would ever want the 16kB blocksize. IOW - NONE OF THIS MATTERS! Maybe. Maybe not. Can you guys stop this inane thread already, or at least take it private between you guys, instead of forcing everybody else to listen in on your flamefest. Will do. Sorry. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Tuesday 18 September 2007 08:00, Christoph Lameter wrote: On Sun, 16 Sep 2007, Nick Piggin wrote: I don't know how it would prevent fragmentation from building up anyway. It's commonly the case that potentially unmovable objects are allowed to fill up all of ram (dentries, inodes, etc). Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from ZONE_MOVABLE and thus the memory that can be allocated for them is limited. Why would ZONE_MOVABLE require that movable objects should be moved out of the way for unmovable ones? It never _has_ any unmovable objects in it. Quite obviously we were not talking about reserve zones. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Tuesday 18 September 2007 08:21, Christoph Lameter wrote: On Sun, 16 Sep 2007, Nick Piggin wrote: So if you argue that vmap is a downside, then please tell me how you consider the -ENOMEM of your approach to be better? That is again pretty undifferentiated. Are we talking about low page In general. There is no -ENOMEM approach. Lower order page allocation ( PAGE_ALLOC_COSTLY_ORDER) will reclaim and in the worst case the OOM killer will be activated. ROFL! Yeah of course, how could I have forgotten about our trusty OOM killer as the solution to the fragmentation problem? It would only have been funnier if you had said to reboot every so often when memory gets fragmented :) That is the nature of the failures that we saw early in the year when this was first merged into mm. With the ZONE_MOVABLE you can remove the unmovable objects into a defined pool then higher order success rates become reasonable. OK, if you rely on reserve pools, then it is not 1st class support and hence it is a non-solution to VM and IO scalability problems. ZONE_MOVABLE creates two memory pools in a machine. One of them is for movable and one for unmovable. That is in 2.6.23. So 2.6.23 has no first call support for order 0 pages? What? If, by special software layer, you mean the vmap/vunmap support in fsblock, let's see... that's probably all of a hundred or two lines. Contrast that with anti-fragmentation, lumpy reclaim, higher order pagecache and its new special mmap layer... Hmm, seems like a no brainer to me. You really still want to persue the extra layer argument as a point against fsblock here? Yes sure. You code could not live without these approaches. Without the Actually: your code is the one that relies on higher order allocations. Now you're trying to turn that into an argument against fsblock? fsblock also needs contiguous pages in order to have a beneficial effect that we seem to be looking for. Keyword: relies. antifragmentation measures your fsblock code would not be very successful in getting the larger contiguous segments you need to improve performance. Complely wrong. *I* don't need to do any of that to improve performance. Actually the VM is well tuned for order-0 pages, and so seeing as I have sane hardware, 4K pagecache works beautifully for me. Sure the system works fine as is. Not sure why we would need fsblock then. Large block filesystem. (There is no new mmap layer, the higher order pagecache is simply the old API with set_blocksize expanded). Yes you add another layer in the userspace mapping code to handle higher order pagecache. That would imply a new API or something? I do not see it. I was not implying a new API. Why: It is the same approach that you use. Again, rubbish. Ok the logical conclusion from the above is that you think your approach is rubbish The logical conclusion is that _they are not the same approach_! Is there some way you could cool down a bit? I'm not upset, but what you were saying was rubbish, plain and simple. The amount of times we've gone in circles, I most likely have already explained this, serveral times, in a more polite manner. And I know you're more than capable to understand at least the concept behind fsblock, even without time to work through the exact details. What are you expecting me to say, after all this back and forth, when you come up with things like [fsblock] is not a generic change but special to the block layer, and then claim that fsblock is the same as allocating virtual compound pages with vmalloc as a fallback for higher order allocs. What I will say is that fsblock has still a relatively longer way to go, so maybe that's your reason for not looking at it. And yes, when fsblock is in a better state to actually perform useful comparisons with it, will be a much better time to have these debates. But in that case, just say so :) then I can go away and do more constructive work on it instead of filling people's inboxes. I believe the fsblock approach is the best one, but it's not without problems and complexities, so I'm quite ready for it to be proven incorrect, not performant, or otherwise rejected. I'm going on holiday for 2 weeks. I'll try to stay away from email, and particularly this thread. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Tuesday 18 September 2007 08:05, Christoph Lameter wrote: On Sun, 16 Sep 2007, Nick Piggin wrote: fsblock doesn't need any of those hacks, of course. Nor does mine for the low orders that we are considering. For order MAX_ORDER this is unavoidable since the page allocator cannot manage such large pages. It can be used for lower order if there are issues (that I have not seen yet). Or we can just avoid all doubt (and doesn't have arbitrary limitations according to what you think might be reasonable or how well the system actually behaves). We can avoid all doubt in this patchset as well by adding support for fallback to a vmalloced compound page. How would you do a vmapped fallback in your patchset? How would you keep track of pages 2..N if they don't exist in the radix tree? What if they don't even exist in the kernel's linear mapping? It seems you would also require more special casing in the fault path and special casing in the block layer to do this. It's not a trivial problem you can just brush away by handwaving. Let's see... you could add another field in struct page to store the vmap virtual address, and set a new flag to indicate indicate that constituent page N can be found via vmalloc_to_page(page-vaddr + N*PAGE_SIZE). Then add more special casing to the block layer and fault path etc. to handle these new non-contiguous compound pages. I guess you must have thought about it much harder than the 2 minutes I just did then, so you must have a much nicer solution... But even so, you're still trying very hard to avoid touching the filesystems or buffer layer while advocating instead to squeeze the complexity out into the vm and block layer. I don't agree that is the right thing to do. Sure it is _easier_, because we know the VM. I don't argue that fsblock large block support is trivial. But you are first asserting that it is too complicated and then trying to address one of the issues it solves by introducing complexity elsewhere. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.22.6: kernel BUG at fs/locks.c:171
On Saturday 15 September 2007 20:22, Soeren Sonnenburg wrote: On Sat, 2007-09-15 at 09:47 +, Soeren Sonnenburg wrote: Memtest did not find anything after 16 passes so I finally stopped it applied your patch and used CONFIG_DEBUG_SLAB=y CONFIG_DEBUG_SLAB_LEAK=y and booted into the new kernel. A few hours later the machine hung (due to nmi watchdog rebooted), so I restarted and disabled the watchdog and while compiling a kernel with a ``more minimal'' config I got this (not sure whether this is related/the cause .../ note that I don't use a swapfile/partition). I would need more guidance on what to try now... Thanks! Soeren swap_dup: Bad swap file entry 28c8af9d Hmm, this is another telltale symptom of either bad hardware or a memory scribbling bug. VM: killing process cc1 Eeek! page_mapcount(page) went negative! (-1) page pfn = 36233 page-flags = 4834 page-count = 2 page-mapping = c1cfed14 vma-vm_ops = run_init_process+0x3feff000/0x14 And these are probably related (it's just gone off and started performing VM operations on the wrong page...). Had you been using the dvb card since rebooting when you saw these messages come up? What happens if you remove the card from the system? [ cut here ] kernel BUG at mm/rmap.c:628! invalid opcode: [#1] Modules linked in: ipt_iprange ipt_REDIRECT capi kernelcapi capifs ipt_REJECT xt_tcpudp xt_state xt_limit ipt_LOG ipt_MASQUERADE iptable_mangle iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables x_tables b44 ohci1394 ieee1394 nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack lcd tda827x saa7134_dvb dvb_pll video_buf_dvb tda1004x tuner ves1820 usb_storage usblp budget_ci budget_core saa7134 compat_ioctl32 dvb_ttpci dvb_core saa7146_vv video_buf saa7146 ttpci_eeprom ir_kbd_i2c videodev v4l2_common v4l1_compat ir_common via_agp agpgart CPU:0 EIP:0060:[c0144487]Not tainted VLI EFLAGS: 00010246 (2.6.22.6 #2) EIP is at page_remove_rmap+0xd4/0x101 eax: ebx: c16c4660 ecx: edx: esi: d4570b30 edi: d6560a78 ebp: b740 esp: d6265eac ds: 007b es: 007b fs: gs: ss: 0068 Process cc1 (pid: 26095, ti=d6264000 task=d67af5b0 task.ti=d6264000) Stack: c0422e26 c1cfed14 c16c4660 b729e000 c013f5b8 36233cce d4570b30 d6265f20 0001 f4ffcb70 f483a3b8 c04f44b8 f4ffcb70 00303ff4 b7c18000 d6265f20 f4a8c510 f483a3b8 0009 Call Trace: [c013f5b8] unmap_vmas+0x23f/0x404 [c0141c09] exit_mmap+0x5f/0xc9 [c011923a] mmput+0x1b/0x5e [c011cf97] do_exit+0x1a0/0x606 [c01135f8] do_page_fault+0x49c/0x518 [c011e340] __do_softirq+0x35/0x75 [c011315c] do_page_fault+0x0/0x518 [c039aada] error_code+0x6a/0x70 === Code: c0 74 0d 8b 50 08 b8 56 2e 42 c0 e8 ac f4 fe ff 8b 46 48 85 c0 74 14 8b 40 10 85 c0 74 0d 8b 50 2c b8 75 2e 42 c0 e8 91 f4 fe ff 0f 0b eb fe 8b 53 10 8b 03 83 e2 01 c1 e8 1e f7 da 83 c2 04 69 EIP: [c0144487] page_remove_rmap+0xd4/0x101 SS:ESP 0068:d6265eac Fixing recursive fault but reboot is needed! Hmmhh, so now I rebooted and again tried to $ make the new kernel which again triggered this(?) BUG: Any ideas? Soeren. Eeek! page_mapcount(page) went negative! (-1) page pfn = 18722 page-flags = 4000 page-count = 1 page-mapping = vma-vm_ops = run_init_process+0x3feff000/0x14 [ cut here ] kernel BUG at mm/rmap.c:628! invalid opcode: [#1] Modules linked in: ipt_iprange ipt_REDIRECT capi kernelcapi capifs ipt_REJECT xt_tcpudp xt_state xt_limit ipt_LOG ipt_MASQUERADE iptable_mangle iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables x_t CPU:0 EIP:0060:[c0144487]Not tainted VLI EFLAGS: 00010246 (2.6.22.6 #2) EIP is at page_remove_rmap+0xd4/0x101 eax: ebx: c130e440 ecx: edx: esi: f438b510 edi: f3328ac8 ebp: c130e440 esp: f28d5eec ds: 007b es: 007b fs: gs: 0033 ss: 0068 Process cc1 (pid: 17957, ti=f28d4000 task=f60bb0d0 task.ti=f28d4000) Stack: c0422e26 f3328ac8 0002 c013f185 b76b2000 f438b510 f43013b8 c1a7c640 1879 b76b2000 f3328ac8 f438b510 c014021d f3328ac8 f4360b74 f43013f8 1879 00100073 b76b2000 f43013b8 f4360b74 0100 f28d5f90 Call Trace: [c013f185] do_wp_page+0x28a/0x35c [c014021d] __handle_mm_fault+0x626/0x6a4 [c0113368] do_page_fault+0x20c/0x518 [c011315c] do_page_fault+0x0/0x518 [c039aada] error_code+0x6a/0x70 === Code: c0 74 0d 8b 50 08 b8 56 2e 42 c0 e8 ac f4 fe ff 8b 46 48 85 c0 74 14 8b 40 10 85 c0 74 0d 8b 50 2c b8 75 2e 42 c0 e8 91 f4 fe ff 0f 0b eb fe 8b 53 10 8b 03 83 e2 01 c1 e8 1e f7 da 83 c2 04 69 EIP: [c0144487] page_remove_rmap+0xd4/0x101 SS:ESP 0068:f28d5eec Eeek! page_mapcount(page) went negative! (-2) page pfn = 18722 page-flags = 4004
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Saturday 15 September 2007 03:52, Christoph Lameter wrote: On Fri, 14 Sep 2007, Nick Piggin wrote: [*] ok, this isn't quite true because if you can actually put a hard limit on unmovable allocations then anti-frag will fundamentally help -- get back to me on that when you get patches to move most of the obvious ones. We have this hard limit using ZONE_MOVABLE in 2.6.23. So we're back to 2nd class support. 2nd class support for me means a feature that is not enabled by default but that can be enabled in order to increase performance. The 2nd class support is there because we are not yet sure about the maturity of the memory allocation methods. I'd rather an approach that does not require all these hacks. Reserve pools as handled (by the not yet available) large page pool patches (which again has altogether another purpose) are not a limit. The reserve pools are used to provide a mininum of higher order pages that is not broken down in order to insure that a mininum number of the desired order of pages is even available in your worst case scenario. Mainly I think that is needed during the period when memory defragmentation is still under development. fsblock doesn't need any of those hacks, of course. Nor does mine for the low orders that we are considering. For order MAX_ORDER this is unavoidable since the page allocator cannot manage such large pages. It can be used for lower order if there are issues (that I have not seen yet). Or we can just avoid all doubt (and doesn't have arbitrary limitations according to what you think might be reasonable or how well the system actually behaves). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Saturday 15 September 2007 04:08, Christoph Lameter wrote: On Fri, 14 Sep 2007, Nick Piggin wrote: However fsblock can do everything that higher order pagecache can do in terms of avoiding vmap and giving contiguous memory to block devices by opportunistically allocating higher orders of pages, and falling back to vmap if they cannot be satisfied. fsblock is restricted to the page cache and cannot be used in other contexts where subsystems can benefit from larger linear memory. Unless you believe higher order pagecache is not restricted to the pagecache, can we please just stick on topic of fsblock vs higher order pagecache? So if you argue that vmap is a downside, then please tell me how you consider the -ENOMEM of your approach to be better? That is again pretty undifferentiated. Are we talking about low page In general. orders? There we will reclaim the all of reclaimable memory before getting an -ENOMEM. Given the quantities of pages on todays machine--a 1 G machine has 256 milllion 4k pages--and the unmovable ratios we see today it would require a very strange setup to get an allocation failure while still be able to allocate order 0 pages. So much handwaving. 1TB machines without very strange setups (where very strange is something arbitrarily defined by you) I guess make up 0.001% of Linux installations. With the ZONE_MOVABLE you can remove the unmovable objects into a defined pool then higher order success rates become reasonable. OK, if you rely on reserve pools, then it is not 1st class support and hence it is a non-solution to VM and IO scalability problems. If, by special software layer, you mean the vmap/vunmap support in fsblock, let's see... that's probably all of a hundred or two lines. Contrast that with anti-fragmentation, lumpy reclaim, higher order pagecache and its new special mmap layer... Hmm, seems like a no brainer to me. You really still want to persue the extra layer argument as a point against fsblock here? Yes sure. You code could not live without these approaches. Without the Actually: your code is the one that relies on higher order allocations. Now you're trying to turn that into an argument against fsblock? antifragmentation measures your fsblock code would not be very successful in getting the larger contiguous segments you need to improve performance. Complely wrong. *I* don't need to do any of that to improve performance. Actually the VM is well tuned for order-0 pages, and so seeing as I have sane hardware, 4K pagecache works beautifully for me. My point was this: fsblock does not preclude your using such measures to fix the performance of your hardware that's broken with 4K pages. And it would allow higher order allocations to fail gracefully. (There is no new mmap layer, the higher order pagecache is simply the old API with set_blocksize expanded). Yes you add another layer in the userspace mapping code to handle higher order pagecache. Of course I wouldn't state that. On the contrary, I categorically state that I have already solved it :) Well then I guess that you have not read the requirements... I'm not talking about solving your problem of poor hardware. I'm talking about supporting higher order block sizes in the kernel. Because it has already been rejected in another form and adds more You have rejected it. But they are bogus reasons, as I showed above. Thats not me. I am working on this because many of the filesystem people have repeatedly asked me to do this. I am no expert on filesystems. Yes it is you. You brought up reasons in this thread and I said why I thought they were bogus. If you think you can now forget about my shooting them down by saying you aren't an expert in filesystems, then you shouldn't have brought them up in the first place. Either stand by your arguments or don't. You also describe some other real (if lesser) issues like number of page structs to manage in the pagecache. But this is hardly enough to reject my patch now... for every downside you can point out in my approach, I can point out one in yours. - fsblock doesn't require changes to virtual memory layer Therefore it is not a generic change but special to the block layer. So other subsystems still have to deal with the single page issues on their own. Rubbish. fsblock doesn't touch a single line in the block layer. Maybe we coud get to something like a hybrid that avoids some of these issues? Add support so something like a virtual compound page can be handled transparently in the filesystem layer with special casing if such a beast reaches the block layer? That's conceptually much worse, IMO. Why: It is the same approach that you use. Again, rubbish. If it is barely ever used and satisfies your concern then I am fine with it. Right below this line is where I told you I am _not_ fine with it. And practically worse as well: vmap space
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Monday 17 September 2007 04:13, Mel Gorman wrote: On (15/09/07 14:14), Goswin von Brederlow didst pronounce: I keep coming back to the fact that movable objects should be moved out of the way for unmovable ones. Anything else just allows fragmentation to build up. This is easily achieved, just really really expensive because of the amount of copying that would have to take place. It would also compel that min_free_kbytes be at least one free PAGEBLOCK_NR_PAGES and likely MIGRATE_TYPES * PAGEBLOCK_NR_PAGES to reduce excessive copying. That is a lot of free memory to keep around which is why fragmentation avoidance doesn't do it. I don't know how it would prevent fragmentation from building up anyway. It's commonly the case that potentially unmovable objects are allowed to fill up all of ram (dentries, inodes, etc). And of course, if you craft your exploit nicely with help from higher ordered unmovable memory (eg. mm structs or unix sockets), then you don't even need to fill all memory with unmovables before you can have them take over all groups. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Monday 17 September 2007 14:07, David Chinner wrote: On Fri, Sep 14, 2007 at 06:48:55AM +1000, Nick Piggin wrote: OK, the vunmap batching code wipes your TLB flushing and IPIs off the table. Diffstat below, but the TLB portions are here (besides that _everything_ is probably lower due to less TLB misses caused by the TLB flushing): -170 -99.4% sn2_send_IPI -343 -100.0% sn_send_IPI_phys -17911 -99.9% smp_call_function Total performance went up by 30% on a 64-way system (248 seconds to 172 seconds to run parallel finds over different huge directories). Good start, Nick ;) I didn't have the chance to test against a 16K directory block size to find the optimal performance, but it is something I will do (I'm sure it will be still a _lot_ faster than 172 seconds :)). 23012 54790.5% _read_lock 9427 329.0% __get_vm_area_node 5792 0.0% __find_vm_area 1590 53000.0% __vunmap _read_lock? I though vmap() and vunmap() only took the vmlist_lock in write mode. Yeah, it is a slight change... the lazy vunmap only has to take it for read. In practice, I'm not sure that it helps a great deal because everything else still takes the lock for write. But that explains why it's popped up in the profile. Next I have some patches to scale the vmap locks and data structures better, but they're not quite ready yet. This looks like it should result in a further speedup of several times when combined with the TLB flushing reductions here... Sounds promising - when you have patches that work, send them my way and I'll run some tests on them. Still away from home (for the next 2 weeks), so I'll be going a bit slow :P I'm thinking about various scalable locking schemes and I'll definitely ping you when I've made a bit of progress. Thanks, Nick - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Thursday 13 September 2007 09:06, Christoph Lameter wrote: On Wed, 12 Sep 2007, Nick Piggin wrote: So lumpy reclaim does not change my formula nor significantly help against a fragmentation attack. AFAIKS. Lumpy reclaim improves the situation significantly because the overwhelming majority of allocation during the lifetime of a systems are movable and thus it is able to opportunistically restore the availability of higher order pages by reclaiming neighboring pages. I'm talking about non movable allocations. [*] ok, this isn't quite true because if you can actually put a hard limit on unmovable allocations then anti-frag will fundamentally help -- get back to me on that when you get patches to move most of the obvious ones. We have this hard limit using ZONE_MOVABLE in 2.6.23. So we're back to 2nd class support. Sure, and I pointed out the theoretical figure for 64K pages as well. Is that figure not problematic to you? Where do you draw the limit for what is acceptable? Why? What happens with tiny memory machines where a reserve or even the anti-frag patches may not be acceptable and/or work very well? When do you require reserve pools? Why are reserve pools acceptable for first-class support of filesystems when it has been very loudly been made a known policy decision by Linus in the past (and for some valid reasons) that we should not put limits on the sizes of caches in the kernel. 64K pages may problematic because it is above the PAGE_ORDER_COSTLY in 2.6.23. 32K is currently much safer because lumpy reclaim can restore these and does so on my systems. I expect the situation for 64K pages to improve when more of Mel's patches go in. We have long term experience with 32k sized allocation through Andrew's tree. Reserve pools as handled (by the not yet available) large page pool patches (which again has altogether another purpose) are not a limit. The reserve pools are used to provide a mininum of higher order pages that is not broken down in order to insure that a mininum number of the desired order of pages is even available in your worst case scenario. Mainly I think that is needed during the period when memory defragmentation is still under development. fsblock doesn't need any of those hacks, of course. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Thursday 13 September 2007 12:01, Nick Piggin wrote: On Thursday 13 September 2007 23:03, David Chinner wrote: Then just do operations on directories with lots of files in them (tens of thousands). Every directory operation will require at least one vmap in this situation - e.g. a traversal will result in lots and lots of blocks being read that will require vmap() for every directory block read from disk and an unmap almost immediately afterwards when the reference is dropped Ah, wow, thanks: I can reproduce it. OK, the vunmap batching code wipes your TLB flushing and IPIs off the table. Diffstat below, but the TLB portions are here (besides that _everything_ is probably lower due to less TLB misses caused by the TLB flushing): -170 -99.4% sn2_send_IPI -343 -100.0% sn_send_IPI_phys -17911 -99.9% smp_call_function Total performance went up by 30% on a 64-way system (248 seconds to 172 seconds to run parallel finds over different huge directories). 23012 54790.5% _read_lock 9427 329.0% __get_vm_area_node 5792 0.0% __find_vm_area 1590 53000.0% __vunmap 10726.0% _spin_lock 74 119.4% _xfs_buf_find 58 0.0% __unmap_kernel_range 5336.6% kmem_zone_alloc -129 -100.0% pio_phys_write_mmr -144 -100.0% unmap_kernel_range -170 -99.4% sn2_send_IPI -233 -59.1% kfree -266 -100.0% find_next_bit -343 -100.0% sn_send_IPI_phys -564 -19.9% xfs_iget_core -1946 -100.0% remove_vm_area -17911 -99.9% smp_call_function -62726-7.2% _write_lock -438360 -64.2% default_idle -482631 -30.4% total Next I have some patches to scale the vmap locks and data structures better, but they're not quite ready yet. This looks like it should result in a further speedup of several times when combined with the TLB flushing reductions here... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Thursday 13 September 2007 09:17, Christoph Lameter wrote: On Wed, 12 Sep 2007, Nick Piggin wrote: I will still argue that my approach is the better technical solution for large block support than yours, I don't think we made progress on that. And I'm quite sure we agreed at the VM summit not to rely on your patches for VM or IO scalability. The approach has already been tried (see the XFS layer) and found lacking. It is lacking because our vmap algorithms are simplistic to the point of being utterly inadequate for the new requirements. There has not been any fundamental problem revealed (like the fragmentation approach has). However fsblock can do everything that higher order pagecache can do in terms of avoiding vmap and giving contiguous memory to block devices by opportunistically allocating higher orders of pages, and falling back to vmap if they cannot be satisfied. So if you argue that vmap is a downside, then please tell me how you consider the -ENOMEM of your approach to be better? Having a fake linear block through vmalloc means that a special software layer must be introduced and we may face special casing in the block / fs layer to check if we have one of these strange vmalloc blocks. I guess you're a little confused. There is nothing fake about the linear address. Filesystems of course need changes (the block layer needs none -- why would it?). But changing APIs to accommodate a better solution is what Linux is about. If, by special software layer, you mean the vmap/vunmap support in fsblock, let's see... that's probably all of a hundred or two lines. Contrast that with anti-fragmentation, lumpy reclaim, higher order pagecache and its new special mmap layer... Hmm, seems like a no brainer to me. You really still want to persue the extra layer argument as a point against fsblock here? But you just showed in two emails that you don't understand what the problem is. To reiterate: lumpy reclaim does *not* invalidate my formulae; and antifrag does *not* isolate the issue. I do understand what the problem is. I just do not get what your problem with this is and why you have this drive to demand perfection. We are Oh. I don't think I could explain that if you still don't understand by now. But that's not the main issue: all that I ask is you consider fsblock on technical grounds. working a variety of approaches on the (potential) issue but you categorically state that it cannot be solved. Of course I wouldn't state that. On the contrary, I categorically state that I have already solved it :) But what do you say about viable alternatives that do not have to worry about these unlikely scenarios, full stop? So, why should we not use fs block for higher order page support? Because it has already been rejected in another form and adds more You have rejected it. But they are bogus reasons, as I showed above. You also describe some other real (if lesser) issues like number of page structs to manage in the pagecache. But this is hardly enough to reject my patch now... for every downside you can point out in my approach, I can point out one in yours. - fsblock doesn't require changes to virtual memory layer - fsblock can retain cache of just 4K in a 4K block size file How about those? I know very well how Linus feels about both of them. Maybe we coud get to something like a hybrid that avoids some of these issues? Add support so something like a virtual compound page can be handled transparently in the filesystem layer with special casing if such a beast reaches the block layer? That's conceptually much worse, IMO. And practically worse as well: vmap space is limited on 32-bit; fsblock approach can avoid vmap completely in many cases; for two reasons. The fsblock data accessor APIs aren't _that_ bad changes. They change zero conceptually in the filesystem, are arguably cleaner, and can be essentially nooped if we wanted to stay with a b_data type approach (but they give you that flexibility to replace it with any implementation). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Thursday 13 September 2007 11:49, David Chinner wrote: On Wed, Sep 12, 2007 at 01:27:33AM +1000, Nick Piggin wrote: I just gave 4 things which combined might easily reduce xfs vmap overhead by several orders of magnitude, all without changing much code at all. Patches would be greatly appreciately. You obviously understand this vm code much better than I do, so if it's easy to fix by adding some generic vmap cache thingy, please do. Well, it may not be easy to _fix_, but it's easy to try a few improvements ;) How do I make an image and run a workload that will coerce XFS into doing a significant number of vmaps? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.22.6: kernel BUG at fs/locks.c:171
On Thursday 13 September 2007 19:20, Soeren Sonnenburg wrote: Dear all, I've just seen this in dmesg on a AMD K7 / kernel 2.6.22.6 machine (config attached). Any ideas / which further information needed ? Thanks for the report. Is it reproduceable? It seems like the locks_free_lock call that's oopsing is coming from __posix_lock_file. The actual function looks fine, but the lock being freed could have been corrupted if there was slab corruption, or a hardware corruption. You could: try running memtest86+ overnight. And try the following patch and turn on slab debugging then try to reproduce the problem. Soeren [ cut here ] kernel BUG at fs/locks.c:171! invalid opcode: [#1] Modules linked in: ipt_iprange ipt_REDIRECT capi kernelcapi capifs ipt_REJECT xt_tcpudp xt_state xt_limit ipt_LOG ipt_MASQUERADE iptable_mangle iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables x_tables b44 ohci1394 ieee1394 nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack lcd tda827x saa7134_dvb dvb_pll video_buf_dvb tuner tda1004x ves1820 usb_storage usblp saa7134 compat_ioctl32 budget_ci budget_core dvb_ttpci dvb_core saa7146_vv video_buf saa7146 ttpci_eeprom via_agp ir_kbd_i2c videodev v4l2_common v4l1_compat ir_common agpgart CPU:0 EIP:0060:[c0158f59]Not tainted VLI EFLAGS: 00010206 (2.6.22.6 #1) EIP is at locks_free_lock+0xb/0x3b eax: e1d07f9c ebx: e1d07f80 ecx: f5f5e2f0 edx: esi: edi: ebp: esp: da3d7f04 ds: 007b es: 007b fs: gs: 0033 ss: 0068 Process mrtg-load (pid: 19688, ti=da3d6000 task=f5e3a030 task.ti=da3d6000) Stack: c015972b 0002 c04889c8 c012b920 f5f5e290 c048541c f0ed3ca0 01485414 e1d07f80 f0f39f58 44ef35f1 f62fc2ac f5f5e290 d23106c0 c015a891 0007 0004 Call Trace: [c015972b] __posix_lock_file+0x44e/0x47f [c012b920] getnstimeofday+0x2b/0xaf [c015a891] fcntl_setlk+0xff/0x1f6 [c011d836] do_setitimer+0xfa/0x226 [c0156b87] sys_fcntl64+0x74/0x85 [c0103ade] syscall_call+0x7/0xb === Code: 74 1b 8b 15 30 93 48 c0 8d 43 04 89 53 04 89 42 04 a3 30 93 48 c0 c7 40 04 30 93 48 c0 5b 5e c3 53 89 c3 8d 40 1c 39 43 1c 74 04 0f 0b eb fe 8d 43 0c 39 43 0c 74 04 0f 0b eb fe 8d 43 04 39 43 EIP: [c0158f59] locks_free_lock+0xb/0x3b SS:ESP 0068:da3d7f04 BUG: unable to handle kernel paging request at virtual address 9ee420b0 printing eip: c014ab7d *pde = Oops: 0002 [#2] Modules linked in: ipt_iprange ipt_REDIRECT capi kernelcapi capifs ipt_REJECT xt_tcpudp xt_state xt_limit ipt_LOG ipt_MASQUERADE iptable_mangle iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables x_tables b44 ohci1394 ieee1394 nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack lcd tda827x saa7134_dvb dvb_pll video_buf_dvb tuner tda1004x ves1820 usb_storage usblp saa7134 compat_ioctl32 budget_ci budget_core dvb_ttpci dvb_core saa7146_vv video_buf saa7146 ttpci_eeprom via_agp ir_kbd_i2c videodev v4l2_common v4l1_compat ir_common agpgart CPU:0 EIP:0060:[c014ab7d]Not tainted VLI EFLAGS: 00010082 (2.6.22.6 #1) EIP is at free_block+0x61/0xfb eax: a75b2c19 ebx: c1cf6c10 ecx: e1d070c4 edx: 9ee420ac esi: e1d07000 edi: dfde6960 ebp: dfde7620 esp: dfd87f44 ds: 007b es: 007b fs: gs: ss: 0068 Process events/0 (pid: 4, ti=dfd86000 task=dfdc4a50 task.ti=dfd86000) Stack: 0012 0018 c1cf6c10 c1cf6c10 0018 c1cf6c00 dfde7620 c014ac86 dfde6960 dfde7620 c0521d20 c014b869 dfde69e0 c0521d20 c014b827 c0125955 dfdc4b5c 8f0c99c0 Call Trace: [c014ac86] drain_array+0x6f/0x89 [c014b869] cache_reap+0x42/0xde [c014b827] cache_reap+0x0/0xde [c0125955] run_workqueue+0x6b/0xdf [c0125ec7] worker_thread+0x0/0xbd [c0125f79] worker_thread+0xb2/0xbd [c0128221] autoremove_wake_function+0x0/0x35 [c01280cc] kthread+0x36/0x5a [c0128096] kthread+0x0/0x5a [c0104607] kernel_thread_helper+0x7/0x10 === Code: 8b 02 25 00 40 02 00 3d 00 40 02 00 75 03 8b 52 0c 8b 02 84 c0 78 04 0f 0b eb fe 8b 72 1c 8b 54 24 28 8b 46 04 8b 7c 95 4c 8b 16 89 42 04 89 10 2b 4e 0c c7 06 00 01 10 00 c7 46 04 00 02 20 00 EIP: [c014ab7d] free_block+0x61/0xfb SS:ESP 0068:dfd87f44 [ cut here ] kernel BUG at fs/locks.c:171! invalid opcode: [#3] Modules linked in: ipt_iprange ipt_REDIRECT capi kernelcapi capifs ipt_REJECT xt_tcpudp xt_state xt_limit ipt_LOG ipt_MASQUERADE iptable_mangle iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables x_tables b44 ohci1394 ieee1394 nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack lcd tda827x saa7134_dvb dvb_pll video_buf_dvb tuner tda1004x ves1820 usb_storage usblp saa7134 compat_ioctl32 budget_ci budget_core dvb_ttpci dvb_core saa7146_vv video_buf saa7146 ttpci_eeprom via_agp ir_kbd_i2c videodev v4l2_common v4l1_compat ir_common agpgart CPU:
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Thursday 13 September 2007 23:03, David Chinner wrote: On Thu, Sep 13, 2007 at 03:23:21AM +1000, Nick Piggin wrote: Well, it may not be easy to _fix_, but it's easy to try a few improvements ;) How do I make an image and run a workload that will coerce XFS into doing a significant number of vmaps? # mkfs.xfs -n size=16384 dev to create a filesystem with a 16k directory block size on a 4k page machine. Then just do operations on directories with lots of files in them (tens of thousands). Every directory operation will require at least one vmap in this situation - e.g. a traversal will result in lots and lots of blocks being read that will require vmap() for every directory block read from disk and an unmap almost immediately afterwards when the reference is dropped Ah, wow, thanks: I can reproduce it. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Wednesday 12 September 2007 11:49, David Chinner wrote: On Tue, Sep 11, 2007 at 04:00:17PM +1000, Nick Piggin wrote: OTOH, I'm not sure how much buy-in there was from the filesystems guys. Particularly Christoph H and XFS (which is strange because they already do vmapping in places). I think they use vmapping because they have to, not because they want to. They might be a lot happier with fsblock if it used contiguous pages for large blocks whenever possible - I don't know for sure. The metadata accessors they might be unhappy with because it's inconvenient but as Christoph Hellwig pointed out at VM/FS, the filesystems who really care will convert. Sure, they would rather not to. But there are also a lot of ways you can improve vmap more than what XFS does (or probably what darwin does) (more persistence for cached objects, and batched invalidates for example). XFS already has persistence across the object life time (which can be many tens of seconds for a frequently used buffer) But you don't do a very good job. When you go above 64 cached mappings, you purge _all_ of them. fsblock's vmap cache can have a much higher number (if you want), and purging can just unmap a batch which is decided by a simple LRU (thus important metadata gets saved). and it also does batched unmapping of objects as well. It also could do a lot better at unmapping. Currently you're just calling vunmap a lot of times in sequence. That still requires global IPIs and TLB flushing every time. This simple patch should easily be able to reduce that number by 2 or 3 orders of magnitude (maybe more on big systems). http://www.mail-archive.com/[EMAIL PROTECTED]/msg03956.html vmap area locking and data structures could also be made a lot better quite easily, I suspect. There are also a lot of trivial things you can do to make a lot of those accesses not require vmaps (and less trivial things, but even such things as binary searches over multiple pages should be quite possible with a bit of logic). Yes, we already do the many of these things (via xfs_buf_offset()), but that is not good enough for something like a memcpy that spans multiple pages in a large block (think btree block compaction, splits and recombines). fsblock_memcpy(fsblock *src, int soff, fsblock *dst, int doff, int size); ? IOWs, we already play these vmap harm-minimisation games in the places where we can, but still the overhead is high and something we'd prefer to be able to avoid. I don't think you've looked nearly far enough with all this low hanging fruit. I just gave 4 things which combined might easily reduce xfs vmap overhead by several orders of magnitude, all without changing much code at all. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Wednesday 12 September 2007 11:49, David Chinner wrote: On Tue, Sep 11, 2007 at 04:00:17PM +1000, Nick Piggin wrote: OTOH, I'm not sure how much buy-in there was from the filesystems guys. Particularly Christoph H and XFS (which is strange because they already do vmapping in places). I think they use vmapping because they have to, not because they want to. They might be a lot happier with fsblock if it used contiguous pages for large blocks whenever possible - I don't know for sure. The metadata accessors they might be unhappy with because it's inconvenient but as Christoph Hellwig pointed out at VM/FS, the filesystems who really care will convert. Sure, they would rather not to. But there are also a lot of ways you can improve vmap more than what XFS does (or probably what darwin does) (more persistence for cached objects, and batched invalidates for example). XFS already has persistence across the object life time (which can be many tens of seconds for a frequently used buffer) But you don't do a very good job. When you go above 64 vmaps cached, you purge _all_ of them. fsblock's vmap cache can have a much higher number (if you want), and purging will only unmap a smaller batch, decided by a simple LRU. and it also does batched unmapping of objects as well. It also could do a lot better at unmapping. Currently you're just calling vunmap a lot of times in sequence. That still requires global IPIs and TLB flushing every time. This simple patch should easily be able to reduce that number by 2 or 3 orders of magnitude on 64-bit systems. Maybe more if you increase the batch size. http://www.mail-archive.com/[EMAIL PROTECTED]/msg03956.html vmap area manipulation scalability and search complexity could also be improved quite easily, I suspect. There are also a lot of trivial things you can do to make a lot of those accesses not require vmaps (and less trivial things, but even such things as binary searches over multiple pages should be quite possible with a bit of logic). Yes, we already do the many of these things (via xfs_buf_offset()), but that is not good enough for something like a memcpy that spans multiple pages in a large block (think btree block compaction, splits and recombines). fsblock_memcpy(fsblock *src, int soff, fsblock *dst, int doff, int size); ? IOWs, we already play these vmap harm-minimisation games in the places where we can, but still the overhead is high and something we'd prefer to be able to avoid. I don't think you've looked very far with all this low hanging fruit. The several ways I suggested combined might easily reduce xfs vmap overhead by several orders of magnitude, all without changing much code at all. Can you provide a formula to reproduce these workloads where vmap overhead in XFS is a problem? (huge IO capacity need not be an issue, because I could try reproducing it on a 64-way on ramdisks for example). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Wednesday 12 September 2007 10:00, Christoph Lameter wrote: On Tue, 11 Sep 2007, Nick Piggin wrote: Yes. I think we differ on our interpretations of okay. In my interpretation, it is not OK to use this patch as a way to solve VM or FS or IO scalability issues, especially not while the alternative approaches that do _not_ have these problems have not been adequately compared or argued against. We never talked about not being able to solve scalability issues with this patchset. The alternate approaches were discussed at the VM MiniSummit and at the VM/FS meeting. You organized the VM/FS summit. I know you were there and were arguing for your approach. That was not sufficient? I will still argue that my approach is the better technical solution for large block support than yours, I don't think we made progress on that. And I'm quite sure we agreed at the VM summit not to rely on your patches for VM or IO scalability. So even after all this time you do not understand what the fundamental problem is with anti-frag and yet you are happy to waste both our time in endless flamewars telling me how wrong I am about it. We obviously have discussed this before and the only point of asking this question by you seems to be to have me repeat the whole line argument again? But you just showed in two emails that you don't understand what the problem is. To reiterate: lumpy reclaim does *not* invalidate my formulae; and antifrag does *not* isolate the issue. Forgive me if I'm starting to be rude, Christoph. This is really irritating. Sorry but I have had too much exposure to philosophy. Talk about absolutes like guarantees (that do not exist even for order 0 allocs) and unlikely memory fragmentation scenarios to show that something does not work seems to be getting into some metaphysical realm where there is no data anymore to draw any firm conclusions. Some problems can not be solved easily or at all within reasonable constraints. I have no problems with that. And it is a valid stance to take on the fragmentation issue, and the hash collision problem, etc. But what do you say about viable alternatives that do not have to worry about these unlikely scenarios, full stop? So, why should we not use fsblock for higher order page support? Last times this came up, you dismissed the fsblock approach because it adds another layer of complexity (even though it is simpler than the buffer layer it replaces); and also had some other strange objection like you thought it provides higher order pages or something. And as far as the problems I bring up with your approach, you say they shouldn't be likely, don't really matter, can be fixed as we go along, or want me to show they are a problem in a real world situation!! :P Software reliability is inherent probabilistic otherwise we would not have things like CRC sums and SHA1 algorithms. Its just a matter of reducing the failure rate sufficiently. The failure rate for lower order allocations (0-3) seems to have been significantly reduced in 2.6.23 through lumpy reclaim. If antifrag measures are not successful (likely for 2M allocs) then other methods (like the large page pools that you skipped when reading my post) will need to be used. I didn't skip that. We have large page pools today. How does that give first class of support to those allocations if you have to have memory reserves? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Tuesday 11 September 2007 16:03, Christoph Lameter wrote: 5. VM scalability Large block sizes mean less state keeping for the information being transferred. For a 1TB file one needs to handle 256 million page structs in the VM if one uses 4k page size. A 64k page size reduces that amount to 16 million. If the limitation in existing filesystems are removed then even higher reductions become possible. For very large files like that a page size of 2 MB may be beneficial which will reduce the number of page struct to handle to 512k. The variable nature of the block size means that the size can be tuned at file system creation time for the anticipated needs on a volume. There is a limitation in the VM. Fragmentation. You keep saying this is a solved issue and just assuming you'll be able to fix any cases that come up as they happen. I still don't get the feeling you realise that there is a fundamental fragmentation issue that is unsolvable with Mel's approach. The idea that there even _is_ a bug to fail when higher order pages cannot be allocated was also brushed aside by some people at the vm/fs summit. I don't know if those people had gone through the math about this, but it goes somewhat like this: if you use a 64K page size, you can run out of memory with 93% of your pages free. If you use a 2MB page size, you can fail with 99.8% of your pages still free. That's 64GB of memory used on a 32TB Altix. If you don't consider that is a problem because you don't care about theoretical issues or nobody has reported it from running -mm kernels, then I simply can't argue against that on a technical basis. But I'm totally against introducing known big fundamental problems to the VM at this stage of the kernel. God knows how long it takes to ever fix them in future after they have become pervasive throughout the kernel. IMO the only thing that higher order pagecache is good for is a quick hack for filesystems to support larger block sizes. And after seeing it is fairly ugly to support mmap, I'm not even really happy for it to do that. If VM scalability is a problem, then it needs to be addressed in other areas anyway for order-0 pages, and if contiguous pages helps IO scalability or crappy hardware, then there is nothing stopping us from *attempting* to get contiguous memory in the current scheme. Basically, if you're placing your hopes for VM and IO scalability on this, then I think that's a totally broken thing to do and will end up making the kernel worse in the years to come (except maybe on some poor configurations of bad hardware). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Tuesday 11 September 2007 22:12, Jörn Engel wrote: On Tue, 11 September 2007 04:52:19 +1000, Nick Piggin wrote: On Tuesday 11 September 2007 16:03, Christoph Lameter wrote: 5. VM scalability Large block sizes mean less state keeping for the information being transferred. For a 1TB file one needs to handle 256 million page structs in the VM if one uses 4k page size. A 64k page size reduces that amount to 16 million. If the limitation in existing filesystems are removed then even higher reductions become possible. For very large files like that a page size of 2 MB may be beneficial which will reduce the number of page struct to handle to 512k. The variable nature of the block size means that the size can be tuned at file system creation time for the anticipated needs on a volume. The idea that there even _is_ a bug to fail when higher order pages cannot be allocated was also brushed aside by some people at the vm/fs summit. I don't know if those people had gone through the math about this, but it goes somewhat like this: if you use a 64K page size, you can run out of memory with 93% of your pages free. If you use a 2MB page size, you can fail with 99.8% of your pages still free. That's 64GB of memory used on a 32TB Altix. While I agree with your concern, those numbers are quite silly. The They are the theoretical worst case. Obviously with a non trivially sized system and non-DoS workload, they will not be reached. chances of 99.8% of pages being free and the remaining 0.2% being perfectly spread across all 2MB large_pages are lower than those of SHA1 creating a collision. I don't see anyone abandoning git or rsync, so your extreme example clearly is the wrong one. Again, I agree with your concern, even though your example makes it look silly. It is not simply a question of once-off chance for an all-at-once layout to fail in this way. Fragmentation slowly builds over time, and especially if you do actually use higher-order pages for a significant number of things (unlike we do today), then the problem will become worse. If you have any part of your workload that is affected by fragmentation, then it will cause unfragmented regions to eventually be used for fragmentation inducing allocations (by definition -- if it did not, eg. then there would be no fragmentation problem and no need for Mel's patches). I don't know what happens as time tends towards infinity, but I don't think it will be good. At millions of allocations per second, how long does it take to produce an unacceptable number of free pages before the ENOMEM condition? Furthermore, what *is* an unacceptable number? I don't know. I am not trying to push this feature in, so the burden is not mine to make sure it is OK. Yes, we already have some of these problems today. Introducing more and worse problems and justifying them because of existing ones is much more silly than my quoting of the numbers. IMO. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Wednesday 12 September 2007 04:31, Mel Gorman wrote: On Tue, 2007-09-11 at 18:47 +0200, Andrea Arcangeli wrote: Hi Mel, Hi, On Tue, Sep 11, 2007 at 04:36:07PM +0100, Mel Gorman wrote: that increasing the pagesize like what Andrea suggested would lead to internal fragmentation problems. Regrettably we didn't discuss Andrea's The config_page_shift guarantees the kernel stacks or whatever not defragmentable allocation other allocation goes into the same 64k not defragmentable page. Not like with SGI design that a 8k kernel stack could be allocated in the first 64k page, and then another 8k stack could be allocated in the next 64k page, effectively pinning all 64k pages until Nick worst case scenario triggers. In practice, it's pretty difficult to trigger. Buddy allocators always try and use the smallest possible sized buddy to split. Once a 64K is split for a 4K or 8K allocation, the remainder of that block will be used for other 4K, 8K, 16K, 32K allocations. The situation where multiple 64K blocks gets split does not occur. Now, the worst case scenario for your patch is that a hostile process allocates large amount of memory and mlocks() one 4K page per 64K chunk (this is unlikely in practice I know). The end result is you have many 64KB regions that are now unusable because 4K is pinned in each of them. Your approach is not immune from problems either. To me, only Nicks approach is bullet-proof in the long run. One important thing I think in Andrea's case, the memory will be accounted for (eg. we can limit mlock, or work within various memory accounting things). With fragmentation, I suspect it will be much more difficult to do this. It would be another layer of heuristics that will also inevitably go wrong at times if you try to limit how much fragmentation a process can do. Quite likely it is hard to make something even work reasonably well in most cases. We can still try to save some memory by defragging the slab a bit, but it's by far *not* required with config_page_shift. No defrag at all is required infact. You will need to take some sort of defragmentation to deal with internal fragmentation. It's a very similar problem to blasting away at slab pages and still not being able to free them because objects are in use. Replace slab with large page and object with 4k page and the issues are similar. Well yes and slab has issues today too with internal fragmentation, targetted reclaim and some (small) higher order allocations too today. But at least with config_page_shift, you don't introduce _new_ sources of problems (eg. coming from pagecache or other allocs). Sure, there are some other things -- like pagecache can actually use up more memory instead -- but there are a number of other positives that Andrea's has as well. It is using order-0 pages, which are first class throughout the VM; they have per-cpu queues, and do not require any special reclaim code. They also *actually do* reduce the page management overhead in the general case, unlike higher order pcache. So combined with the accounting issues, I think it is unfair to say that Andrea's is just moving the fragmentation to internal. It has a number of upsides. I have no idea how it will actually behave and perform, mind you ;) Plus there's a cost in defragging and freeing cache... the more you need defrag, the slower the kernel will be. approach in depth. Well it wasn't my fault if we didn't discuss it in depth though. If it's my fault, sorry about that. It wasn't my intention. I think it did get brushed aside a little quickly too (not blaming anyone). Maybe because Linus was hostile. But *if* the idea is that page management overhead has or will become a problem that needs fixing, then neither higher order pagecache, nor (obviously) fsblock, fixes this properly. Andrea's most definitely has the potential to. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Wednesday 12 September 2007 04:25, Maxim Levitsky wrote: Hi, I think that fundamental problem is no fragmentation/large pages/... The problem is the VM itself. The vm doesn't use virtual memory, thats all, that the problem. Although this will be probably linux 3.0, I think that the right way to solve all those problems is to make all kernel memory vmalloced (except few areas like kernel .text) It will suddenly remove the buddy allocator, it will remove need for highmem, it will allow to allocate any amount of memory (for example 4k stacks will be obsolete) It will even allow kernel memory to be swapped to disk. This is the solution, but it is very very hard. I'm not sure that it is too hard. OK it is far from trivial... This is not a new idea though, it has been floated around for a long time (since before Linux I'm sure, although have no references). There are lots of reasons why such an approach has fundamental performance problems too, however. Your kernel can't use huge tlbs for a lot of memory, you can't find the physical address of a page without walking page tables, defragmenting still has a significant cost in terms of moving pages and flushing TLBs etc. So the train of thought up to now has been that a virtually mapped kernel would be the problem with the VM itself ;) We're actually at a point now where higher order allocations are pretty rare and not a big problem (except with very special cases like hugepages and memory hotplug which can mostly get away with compromises, so we don't want to turn over the kernel just for these). So in my opinion, any increase of the dependence on higher order allocations is simply a bad move until a killer use-case can be found. They move us further away from good behaviour on our assumed ideal of an identity mapped kernel. (I don't actually dislike the idea of virtually mapped kernel. Maybe hardware trends will favour that model and there are some potential simple instructions a CPU can implement to help with some of the performance hits. I'm sure it will be looked at again for Linux one day) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Wednesday 12 September 2007 06:01, Christoph Lameter wrote: On Tue, 11 Sep 2007, Nick Piggin wrote: There is a limitation in the VM. Fragmentation. You keep saying this is a solved issue and just assuming you'll be able to fix any cases that come up as they happen. I still don't get the feeling you realise that there is a fundamental fragmentation issue that is unsolvable with Mel's approach. Well my problem first of all is that you did not read the full message. It discusses that later and provides page pools to address the issue. Secondly you keep FUDding people with lots of theoretical concerns assuming Mel's approaches must fail. If there is an issue (I guess there must be right?) then please give us a concrete case of a failure that we can work against. On the other hand, you ignore the potential failure cases, and ignore the alternatives that do not have such cases. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Wednesday 12 September 2007 06:11, Christoph Lameter wrote: On Tue, 11 Sep 2007, Nick Piggin wrote: It would be interesting to craft an attack. If you knew roughly the layout and size of your dentry slab for example... maybe you could stat a whole lot of files, then open one and keep it open (maybe post the fd to a unix socket or something crazy!) when you think you have filled up a couple of MB worth of them. Repeat the process until your movable zone is gone. Or do the same things with pagetables, or task structs, or radix tree nodes, etc.. these are the kinds of things I worry about (as well as just the gradual natural degredation). I guess you would have to run that without my targeted slab reclaim patchset? Otherwise the slab that are in the way could be reclaimed and you could not produce your test case. I didn't realise you had patches to move pinned dentries, radix tree nodes, task structs, page tables etc. Did I miss them in your last patchset? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Wednesday 12 September 2007 06:01, Christoph Lameter wrote: On Tue, 11 Sep 2007, Nick Piggin wrote: There is a limitation in the VM. Fragmentation. You keep saying this is a solved issue and just assuming you'll be able to fix any cases that come up as they happen. I still don't get the feeling you realise that there is a fundamental fragmentation issue that is unsolvable with Mel's approach. Well my problem first of all is that you did not read the full message. It discusses that later and provides page pools to address the issue. Secondly you keep FUDding people with lots of theoretical concerns assuming Mel's approaches must fail. If there is an issue (I guess there must be right?) then please give us a concrete case of a failure that we can work against. And BTW, before you accuse me of FUD, I'm actually talking about the fragmentation issues on which Mel I think mostly agrees with me at this point. Also have you really a rational reason why we should just up and accept all these big changes happening just because that, while there are lots of theoretical issues, the person pointing them out to you hasn't happened to give you a concrete failure case. Oh, and the actual performance benefit is actually not really even quantified yet, crappy hardware not withstanding, and neither has a proper evaluation of the alternatives. So... would you drive over a bridge if the engineer had this mindset? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Wednesday 12 September 2007 06:53, Mel Gorman wrote: On (11/09/07 11:44), Nick Piggin didst pronounce: However, this discussion belongs more with the non-existant-remove-slab patch. Based on what we've seen since the summits, we need a thorough analysis with benchmarks before making a final decision (kernbench, ebizzy, tbench (netpipe if someone has the time/resources), hackbench and maybe sysbench as well as something the filesystem people recommend to get good coverage of the subsystems). True. Aside, it seems I might have been mistaken in saying Christoph is proposing to use higher order allocations to fix the SLUB regression. Anyway, I agree let's not get sidetracked about this here. I'd rather not get side-tracked here. I regret you feel stream-rolled but I think grouping pages by mobility is the right thing to do for better usage of the TLB by the kernel and for improving hugepage support in userspace minimally. We never really did see eye-to-eye but this way, if I'm wrong you get to chuck eggs down the line. No it's a fair point, and even the hugepage allocations alone are a fair point. From the discussions I think it seems like quite probably the right thing to do pragmatically, which is what Linux is about and I hope will result in a better kernel in the end. So I don't have complaints except from little ivory tower ;) Sure. And some people run workloads where fragmentation is likely never going to be a problem, they are shipping this poorly configured hardware now or soon, so they don't have too much interest in doing it right at this point, rather than doing it *now*. OK, that's a valid reason which is why I don't use the argument that we should do it correctly or never at all. So are we saying the right thing to do is go with fs-block from day 1 once we get it to optimistically use high-order pages? I think your concern might be that if this goes in then it'll be harder to justify fsblock in the future because it'll be solving a theoritical problem that takes months to trigger if at all. i.e. The filesystem people will push because apparently large block support as it is solves world peace. Is that accurate? Heh. It's hard to say. I think fsblock could take a while to implement, regardless of high order pages or not. I actually would like to be able to pass down a mandate to say higher order pagecache will never get merged, simply so that these talented people would work on fsblock ;) But that's not my place to say, and I'm actually not arguing that high order pagecache does not have uses (especially as a practical, shorter-term solution which is unintrusive to filesystems). So no, I don't think I'm really going against the basics of what we agreed in Cambridge. But it sounds like it's still being billed as first-order support right off the bat here. OTOH, I'm not sure how much buy-in there was from the filesystems guys. Particularly Christoph H and XFS (which is strange because they already do vmapping in places). I think they use vmapping because they have to, not because they want to. They might be a lot happier with fsblock if it used contiguous pages for large blocks whenever possible - I don't know for sure. The metadata accessors they might be unhappy with because it's inconvenient but as Christoph Hellwig pointed out at VM/FS, the filesystems who really care will convert. Sure, they would rather not to. But there are also a lot of ways you can improve vmap more than what XFS does (or probably what darwin does) (more persistence for cached objects, and batched invalidates for example). There are also a lot of trivial things you can do to make a lot of those accesses not require vmaps (and less trivial things, but even such things as binary searches over multiple pages should be quite possible with a bit of logic). It would be interesting to craft an attack. If you knew roughly the layout and size of your dentry slab for example... maybe you could stat a whole lot of files, then open one and keep it open (maybe post the fd to a unix socket or something crazy!) when you think you have filled up a couple of MB worth of them. I might regret saying this, but it would be easier to craft an attack using pagetable pages. It's woefully difficult to do but it's probably doable. I say pagetables because while slub targetted reclaim is on the cards and memory compaction exists for page cache pages, pagetables are currently pinned with no prototype patch existing to deal with them. But even so, you can just hold an open fd in order to pin the dentry you want. My attack would go like this: get the page size and allocation group size for the machine, then get the number of dentries required to fill a slab. Then read in that many dentries and pin one of them. Repeat the process. Even if there is other activity on the system, it seems possible that such a thing will cause some headaches after not too long a time. Some
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Wednesday 12 September 2007 07:41, Christoph Lameter wrote: On Tue, 11 Sep 2007, Nick Piggin wrote: I think I would have as good a shot as any to write a fragmentation exploit, yes. I think I've given you enough info to do the same, so I'd like to hear a reason why it is not a problem. No you have not explained why the theoretical issues continue to exist given even just considering Lumpy Reclaim in .23 nor what effect the antifrag patchset would have. So how does lumpy reclaim, your slab patches, or anti-frag have much effect on the worst case situation? Or help much against a targetted fragmentation attack? And you have used a 2M pagesize which is irrelevant to this patchset that deals with blocksizes up to 64k. In my experience the use of blocksize PAGE_COSTLY_ORDER (32k) is reasonably safe. I used EXACTLY the page sizes that you brought up in your patch description (ie. 64K and 2MB). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [00/41] Large Blocksize Support V7 (adds memmap support)
On Wednesday 12 September 2007 07:48, Christoph Lameter wrote: On Tue, 11 Sep 2007, Nick Piggin wrote: But that's not my place to say, and I'm actually not arguing that high order pagecache does not have uses (especially as a practical, shorter-term solution which is unintrusive to filesystems). So no, I don't think I'm really going against the basics of what we agreed in Cambridge. But it sounds like it's still being billed as first-order support right off the bat here. Well its seems that we have different interpretations of what was agreed on. My understanding was that the large blocksize patchset was okay provided that I supply an acceptable mmap implementation and put a warning in. Yes. I think we differ on our interpretations of okay. In my interpretation, it is not OK to use this patch as a way to solve VM or FS or IO scalability issues, especially not while the alternative approaches that do _not_ have these problems have not been adequately compared or argued against. But even so, you can just hold an open fd in order to pin the dentry you want. My attack would go like this: get the page size and allocation group size for the machine, then get the number of dentries required to fill a slab. Then read in that many dentries and pin one of them. Repeat the process. Even if there is other activity on the system, it seems possible that such a thing will cause some headaches after not too long a time. Some sources of pinned memory are going to be better than others for this of course, so yeah maybe pagetables will be a bit easier (I don't know). Well even without slab targeted reclaim: Mel's antifrag will sort the dentries into separate blocks of memory and so isolate the issue. So even after all this time you do not understand what the fundamental problem is with anti-frag and yet you are happy to waste both our time in endless flamewars telling me how wrong I am about it. Forgive me if I'm starting to be rude, Christoph. This is really irritating. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [07/36] Use page_cache_xxx in mm/filemap_xip.c
Christoph Hellwig wrote: On Tue, Aug 28, 2007 at 09:49:38PM +0200, J??rn Engel wrote: On Tue, 28 August 2007 12:05:58 -0700, [EMAIL PROTECTED] wrote: - index = *ppos PAGE_CACHE_SHIFT; - offset = *ppos ~PAGE_CACHE_MASK; + index = page_cache_index(mapping, *ppos); + offset = page_cache_offset(mapping, *ppos); Part of me feels inclined to marge this patch now because it makes the code more readable, even if page_cache_index() is implemented as #define page_cache_index(mapping, pos) ((pos) PAGE_CACHE_SHIFT) I know there is little use in yet another global search'n'replace wankfest and Andrew might wash my mouth just for mentioning it. Still, hard to dislike this part of your patch. Yes, I I suggested that before. Andrew seems to somehow hate this patchset, but even if we don;'t get it in the lowercase macros are much much better then the current PAGE_CACHE_* confusion. I don't mind the change either. The open coded macros are very recognisable, but it isn't hard to have a typo and get one slightly wrong. If it goes upstream now it wouldn't have the mapping argument though, would it? Or the need to replace PAGE_CACHE_SIZE I guess. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[rfc] block-based page writeout
Hi, I've always liked the idea of being able to do writeout directly based on block number, rather than the valiant but doomed-to-be-suboptimal heuristics that our current dirty writeout system does, as it is running above the pagecache and doesn't know about poor file layouts, or interaction between multiple files or file and metadata. My idea for block based writeout is to have dirty pagecache get inserted into a data structure based on its (block device, block number), and sorted in order of block number for a given block device. You then have a kernel thread per block device, bdflush, which does the background writeout in order. There are minimal heuristics involved here, best is probably just submit blocks before a particular age in a one-way elevator order (which my patch doesn't quite get right). But it would be easy to do things like detect idle block devices to kick off writeout sooner, or write out _all_ dirty blocks in ascending order (which my patch does -- this trades off timeliness for throughput). I haven't worked out what to do about delayed allocation filesystems, exactly. Perhaps blocks without mappings yet could be given some default heuristic by the filesystem for placement in the dirty data structure, or else some kind of scheme to drop into a scan-by-logical-offset mode to do the allocations contiguously before resuming block based writeout. I haven't thought that far ahead ;) Simple patch here to implement this against fsblock. Sorry for an experimental patch on top of an experimental patch... I seem to need some features in fsblock but not buffer.c to make this work nicely (eg. tighter coupling of dirty state between block and page). Patch is pretty experimental, and has a known use-after-free bug on unmount (I'm not going to get much more time for hacking between now and vm/fs meetup, and I'd like to be able to discuss this if people think it is interesting, so I had to rush). Mainly attached as proof of concept, which I don't expect anybody is actually going to use. But comments on the concept would be welcome. I wonder why there hasn't been much noise about doing something like this -- are there some known/obvious problems? The data structure I'm using is a per-block_device rbtree for simplicity now. I perhaps forsee scalability issues with locking, so it may be more appropriate to go with a radix-tree or b-tree or something with a more scalable write-side. I don't think this should be a showstopper, though. Thanks, Nick --- Index: linux-2.6/fs/fsblock.c === --- linux-2.6.orig/fs/fsblock.c +++ linux-2.6/fs/fsblock.c @@ -21,6 +21,9 @@ #include linux/gfp.h #include linux/cache.h #include linux/profile.h +#include linux/rbtree.h +#include linux/kthread.h +#include linux/delay.h //#include linux/buffer_head.h /* too much crap in me */ extern int try_to_free_buffers(struct page *); @@ -45,6 +48,15 @@ void __init fsblock_init(void) SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD, NULL); } +static void clear_block_dirty(struct fsblock *block); + +static int test_and_set_block_dirty(struct fsblock *block); + +static void set_block_dirty(struct fsblock *block) +{ + test_and_set_block_dirty(block); +} + static void init_block(struct page *page, struct fsblock *block, unsigned int bits) { block-flags = 0; @@ -472,14 +484,14 @@ int fsblock_set_page_dirty(struct page * for_each_block(block, b) { FSB_BUG_ON(!test_bit(BL_uptodate, b-flags)); if (!test_bit(BL_dirty, b-flags)) { - set_bit(BL_dirty, b-flags); + set_block_dirty(b); ret = 1; } } } else { FSB_BUG_ON(!test_bit(BL_uptodate, block-flags)); if (!test_bit(BL_dirty, block-flags)) { - set_bit(BL_dirty, block-flags); + set_block_dirty(block); ret = 1; } } @@ -1329,6 +1341,37 @@ static int write_block(struct fsblock *b return submit_block(block, WRITE); } +static int fsblock_test_clear_dirty(struct fsblock *block) +{ + int ret = 0; + struct page *page = block-page; + + if (test_bit(BL_dirty, block-flags)) { + FSB_BUG_ON(!test_bit(BL_uptodate, block-flags)); + clear_block_dirty(block); + ret = 1; + + if (fsblock_subpage(block)) { + struct fsblock *b; + for_each_block(page_blocks(page), b) { + if (test_bit(BL_dirty, b-flags)) + goto page_dirty; + } + } + if (!fsblock_superpage(block)) { + ret = clear_page_dirty_for_io(page); +
[patch] 2.6.23-rc3: fsblock
Hi, I'm still plugging away at fsblock slowly. Haven't really got around to to finishing up any big new features, but there has been a lot of bug fixing and little API changes since last release. I still think fsblock has merit, and even if a more extent-based approach ends up working better for most things. I think a block based one will still have its place (either along-side or underneath), and I think fsblock is just much better than buffer_heads. fsblock proper and minix port patches are on kernel.org, because they're pretty big and not well split-up to be posted here. http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/fsblock/2.6.23-rc3/ Some notes: mbforget is restored (though very poor implementation for now), and filesystems do not have their underlying metadata unmapped or synched by default for new blocks. They could do this in their get_block function if they want, or otherwise ensure that metadata blocks are not just abandoned with possible dirty data in-flight. Basically my thinking is that it is quite an extra expense that should be optional. Made use of lockless radix-tree. Pages in superpage-blocks are tracked with the pagecache radix-tree, and a fair amount of lookups can be involved. This makes those operations faster and more scalable. I've got rid of RCU completely from the fsblock data structures. fsblock's getblk equivalent must now tolerate blocking (if this turns out to be too limiting, I might be convinced to add back RCU or some kind of spin locking). However I want to get rid of RCU freeing, because a normal use-case (the nobh mode) is expected to have a lot of fsblock structures being allocated and freed as IO is happening, so this needs to be really efficient and RCU can degrade CPU cache properties in these situations. Added a real vmap cache. It is still pretty primitive, but it gets like 99% hit rate in my basic tests, so I didn't bother with anything more fancy yet. Changed block memory mapping API to only support metadata blocks at this stage. Just makes a few implementation things easier at this point. Added the /proc/sys/vm/fsblock_no_cache tunable, which disables caching of fsblock structures, and actively frees them after IOs etc have finished. TODO - introduce, and make use of, range based aops APIs and remove all the silly locking code for multi-page blocks (as well as do nice multi-block bios). In progress. - write a extent-based block mapping module that filesystems could optionally use to efficiently cache block mappings. In progress. - port some real filesystems. ext2 is in progress, and I like the idea of brtfs, it seems like it would be a lot more manageable than ext3 for me. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch][rfc] fs: fix nobh error handling
On Wed, Aug 08, 2007 at 07:39:42AM -0700, Mingming Cao wrote: On Wed, 2007-08-08 at 08:07 -0500, Dave Kleikamp wrote: For jfs's sake, I don't really care if it ever uses nobh again. I originally started using it because I figured the movement was away from buffer heads and jfs seemed just as happy with the nobh functions (after a few bugs were flushed out). I don't think jfs really benefitted though. That said, I don't really know who cares about the nobh option in ext3. Actually IBM/LTC use the nobh option in ext3 on our internal kernel development server, to control the consumption of large amount of low memory space. Fair enough... but you mean to say that page reclaim does not control the consumption of low memory well enough? Ie. what is the exact benefit of using nobh? (I'm not trying to argue to remove it... yet... just interested) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch][rfc] fs: fix nobh error handling
Hi, So I seem to have hit some resistance to the idea of removing nobh mode with the new aops patches. I don't need to reiterate my ideas for disliking nobh mode (if the buffer layer is crap, it should be improved rather than circumvented -- see fsblock). But at any rate, this view does not seem to be unanimous, and it is completely sensible to be adverse to removing things that exist and somewhat work today. So I've been mulling around the best way to have nobh and new aoips coexist nicely. I think I've got a reasonable idea now... the main problem with nobh mode is that it blissfully throws away most of the buffer state; when an error does happen, it's up shit creek. This becomes more of a problem with error recovery after a short write allowed by the new aop API: the short write is actually going to happen relatively often as part of normal operation, rather than just when the disk subsystem fails. Still, the fix for existing error handling will look basically the same as the fix for short write error handling. So I propose this patch against head in order not to confuse the issues (and to fix the existing bugs in release kernels). If this gets accepted, then I can subsequently look at doing _additional_ patches for the new aops series in -mm. Note: I actually did observe some of these problems by using IO error injection. I wouldn't say the patch is completely stress tested yet, but it did survive the same tests without any file corruption. Nick --- nobh mode error handling is not just pretty slack, it's wrong. One cannot zero out the whole page to ensure new blocks are zeroed, because it just brings the whole page uptodate with zeroes even if that may not be the correct uptodate data. Also, other parts of the page may already contain dirty data which would get lost by zeroing it out. Thirdly, the writeback of zeroes to the new blocks will also erase existing blocks. All these conditions are pagecache and/or filesystem corruption. The problem comes about because we didn't keep track of which buffers actually are new or old. However it is not enough just to keep only this state, because at the point we start dirtying parts of the page (new blocks, with zeroes), the handling of IO errors becomes impossible without buffers because the page may only be partially uptodate, in which case the page flags allone cannot capture the state of the parts of the page. So allocate all buffers for the page upfront, but leave them unattached so that they don't pick up any other references and can be freed when we're done. If the error path is hit, then zero the new buffers as the regular buffer path does, then attach the buffers to the page so that it can actually be written out correctly and be subject to the normal IO error handling paths. As an upshot, we save 1K of kernel stack on ia64 or powerpc 64K page systems. Signed-off-by: Nick Piggin [EMAIL PROTECTED] -- Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -2272,51 +2272,53 @@ int nobh_prepare_write(struct page *page struct inode *inode = page-mapping-host; const unsigned blkbits = inode-i_blkbits; const unsigned blocksize = 1 blkbits; - struct buffer_head map_bh; - struct buffer_head *read_bh[MAX_BUF_PER_PAGE]; + struct buffer_head *head, *bh; unsigned block_in_page; - unsigned block_start; + unsigned block_start, block_end; sector_t block_in_file; char *kaddr; int nr_reads = 0; - int i; int ret = 0; int is_mapped_to_disk = 1; + if (PagePrivate(page)) + return block_prepare_write(page, from, to, get_block); + if (PageMappedToDisk(page)) return 0; + head = alloc_page_buffers(page, blocksize, 1); + block_in_file = (sector_t)page-index (PAGE_CACHE_SHIFT - blkbits); - map_bh.b_page = page; /* * We loop across all blocks in the page, whether or not they are * part of the affected region. This is so we can discover if the * page is fully mapped-to-disk. */ - for (block_start = 0, block_in_page = 0; + for (block_start = 0, block_in_page = 0, bh = head; block_start PAGE_CACHE_SIZE; - block_in_page++, block_start += blocksize) { - unsigned block_end = block_start + blocksize; + block_in_page++, block_start += blocksize, bh = bh-b_this_page) { int create; - map_bh.b_state = 0; + block_end = block_start + blocksize; + bh-b_state = 0; create = 1; if (block_start = to) create = 0; - map_bh.b_size = blocksize; ret = get_block(inode, block_in_file + block_in_page, - map_bh
Re: [PATCH RFC] extent mapped page cache
On Thu, Jul 26, 2007 at 09:05:15AM -0400, Chris Mason wrote: On Thu, 26 Jul 2007 04:36:39 +0200 Nick Piggin [EMAIL PROTECTED] wrote: [ are state trees a good idea? ] One thing it gains us is finding the start of the cluster. Even if called by kswapd, the state tree allows writepage to find the start of the cluster and send down a big bio (provided I implement trylock to avoid various deadlocks). That's very true, we could potentially also do that with the block extent tree that I want to try with fsblock. If fsblock records and extent of 200MB, and writepage is called on a page in the middle of the extent, how do you walk the radix backwards to find the first dirty up to date page in the range? I mean if we also have a block extent tree between fsblock and the filesystem's get_block (which I think could be a good idea). So you would use that tree to find the block extent that you're in, then use the pagecache tree dirty tag lookup from the start of that block extent (or you could add a backward tag lookup if you just wanted to gather a small range around the given offset). I think (haven't got any code for this yet, mind you). I put the placeholder patches on hold because handling a corner case where userland did O_DIRECT from a mmap'd region of the same file (Linus pointed it out to me). Basically my patches had to work in 64k chunks to avoid a deadlock in get_user_pages. With the state tree, I can allow the page to be faulted in but still properly deal with it. Oh right, I didn't think of that one. Would you still have similar issues with the external state tree? I mean, the filesystem doesn't really know why the fault is taken. O_DIRECT read from a file into mmapped memory of the same block in the file is almost hopeless I think. Racing is fine as long as we don't deadlock or expose garbage from disk. Hmm, OK you're probably right. I'll have to see how it looks. I'm sure we can find some river in Cambridge, winner gets to throw Axboe in. Very noble of you to donate your colleage to such a worthy cause. Jens is always interested in helping solve such debates. It's a fantastic service he provides to the community. ;) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] extent mapped page cache
On Wed, Jul 25, 2007 at 10:10:07PM -0400, Chris Mason wrote: On Thu, 26 Jul 2007 03:37:28 +0200 Nick Piggin [EMAIL PROTECTED] wrote: One advantage to the state tree is that it separates the state from the memory being described, allowing a simple kmap style interface that covers subpages, highmem and superpages. I suppose so, although we should have added those interfaces long ago ;) The variants in fsblock are pretty good, and you could always do an arbitrary extent (rather than block) based API using the pagecache tree if it would be helpful. Yes, you could use fsblock for the state bits and make a separate API to map the actual pages. It also more naturally matches the way we want to do IO, making for easy clustering. Well the pagecache tree is used to reasonable effect for that now. OK the code isn't beautiful ;). Granted, this might be an area where the seperate state tree ends up being better. We'll see. One thing it gains us is finding the start of the cluster. Even if called by kswapd, the state tree allows writepage to find the start of the cluster and send down a big bio (provided I implement trylock to avoid various deadlocks). That's very true, we could potentially also do that with the block extent tree that I want to try with fsblock. I'm looking at cleaning up some of these aops APIs so hopefully most of the deadlock problems go away. Should be useful to both our efforts. Will post patches hopefully when I get time to finish the draft this weekend. O_DIRECT becomes a special case of readpages and writepagesthe memory used for IO just comes from userland instead of the page cache. Could be, although you'll probably also need to teach the mm about the state tree and/or still manipulate the pagecache tree to prevent concurrency? Well, it isn't coded yet, but I should be able to do it from the FS specific ops. Probably, if you invalidate all the pagecache in the range beforehand you should be able to do it (and I guess you want to do the invalidate anyway). Although, below deadlock issues might still bite somehwere... But isn't the main aim of O_DIRECT to do as little locking and synchronisation with the pagecache as possible? I thought this is why your race fixing patches got put on the back burner (although they did look fairly nice from a correctness POV). I put the placeholder patches on hold because handling a corner case where userland did O_DIRECT from a mmap'd region of the same file (Linus pointed it out to me). Basically my patches had to work in 64k chunks to avoid a deadlock in get_user_pages. With the state tree, I can allow the page to be faulted in but still properly deal with it. Oh right, I didn't think of that one. Would you still have similar issues with the external state tree? I mean, the filesystem doesn't really know why the fault is taken. O_DIRECT read from a file into mmapped memory of the same block in the file is almost hopeless I think. Well I'm kind of handwaving when it comes to O_DIRECT ;) It does look like this might be another advantage of the state tree (although you aren't allowed to slow down buffered IO to achieve the locking ;)). ;) The O_DIRECT benefit is a fringe thing. I've long wanted to help clean up that code, but the real point of the patch is to make general usage faster and less complex. If I can't get there, the O_DIRECT stuff doesn't matter. Sure, although unifying code is always a plus so I like that you've got that in mind. The ability to put in additional tracking info like the process that first dirtied a range is also significant. So, I think it is worth trying. Definitely, and I'm glad you are. You haven't converted me yet, but I look forward to finding the best ideas from our two approaches when the patches are further along (ext2 port of fsblock coming along, so we'll be able to have races soon :P). I'm sure we can find some river in Cambridge, winner gets to throw Axboe in. Very noble of you to donate your colleage to such a worthy cause. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] extent mapped page cache
On Wed, Jul 25, 2007 at 08:18:53AM -0400, Chris Mason wrote: On Wed, 25 Jul 2007 04:32:17 +0200 Nick Piggin [EMAIL PROTECTED] wrote: Having another tree to store block state I think is a good idea as I said in the fsblock thread with Dave, but I haven't clicked as to why it is a big advantage to use it to manage pagecache state. (and I can see some possible disadvantages in locking and tree manipulation overhead). Yes, there are definitely costs with the state tree, it will take some careful benchmarking to convince me it is a feasible solution. But, storing all the state in the pages themselves is impossible unless the block size equals the page size. So, we end up with something like fsblock/buffer heads or the state tree. Yep, we have to have something. One advantage to the state tree is that it separates the state from the memory being described, allowing a simple kmap style interface that covers subpages, highmem and superpages. I suppose so, although we should have added those interfaces long ago ;) The variants in fsblock are pretty good, and you could always do an arbitrary extent (rather than block) based API using the pagecache tree if it would be helpful. It also more naturally matches the way we want to do IO, making for easy clustering. Well the pagecache tree is used to reasonable effect for that now. OK the code isn't beautiful ;). Granted, this might be an area where the seperate state tree ends up being better. We'll see. O_DIRECT becomes a special case of readpages and writepagesthe memory used for IO just comes from userland instead of the page cache. Could be, although you'll probably also need to teach the mm about the state tree and/or still manipulate the pagecache tree to prevent concurrency? But isn't the main aim of O_DIRECT to do as little locking and synchronisation with the pagecache as possible? I thought this is why your race fixing patches got put on the back burner (although they did look fairly nice from a correctness POV). Well I'm kind of handwaving when it comes to O_DIRECT ;) It does look like this might be another advantage of the state tree (although you aren't allowed to slow down buffered IO to achieve the locking ;)). The ability to put in additional tracking info like the process that first dirtied a range is also significant. So, I think it is worth trying. Definitely, and I'm glad you are. You haven't converted me yet, but I look forward to finding the best ideas from our two approaches when the patches are further along (ext2 port of fsblock coming along, so we'll be able to have races soon :P). Thanks, Nick - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/4] ext2 convert to new aops fix
These are some write_begin aops conversion fixes for the current -mm tree. I didn't quite get it building, but did compile test at least. Patches all address the same silly bug in my directory pagecache conversions. --- ext2 directory code has a conversion overflow. Spotted by Hugh Dickins. Probably would hurt anyone in practice unless they are using 4GB directories, but must fix. Signed-off-by: Nick Piggin [EMAIL PROTECTED] Index: linux-2.6/fs/ext2/dir.c === --- linux-2.6.orig/fs/ext2/dir.c +++ linux-2.6/fs/ext2/dir.c @@ -424,7 +424,7 @@ ino_t ext2_inode_by_name(struct inode * void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de, struct page *page, struct inode *inode) { - loff_t pos = (page-index PAGE_CACHE_SHIFT) + + loff_t pos = page_offset(page) + (char *) de - (char *) page_address(page); unsigned len = le16_to_cpu(de-rec_len); int err; @@ -511,7 +511,7 @@ int ext2_add_link (struct dentry *dentry return -EINVAL; got_it: - pos = (page-index PAGE_CACHE_SHIFT) + + pos = page_offset(page) + (char*)de - (char*)page_address(page); err = __ext2_write_begin(NULL, page-mapping, pos, rec_len, 0, page, NULL); @@ -569,7 +569,7 @@ int ext2_delete_entry (struct ext2_dir_e } if (pde) from = (char*)pde - (char*)page_address(page); - pos = (page-index PAGE_CACHE_SHIFT) + from; + pos = page_offset(page) + from; lock_page(page); err = __ext2_write_begin(NULL, page-mapping, pos, to - from, 0, page, NULL); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 2/4] minix convert to new aops fix
Signed-off-by: Nick Piggin [EMAIL PROTECTED] Index: linux-2.6/fs/minix/dir.c === --- linux-2.6.orig/fs/minix/dir.c +++ linux-2.6/fs/minix/dir.c @@ -311,7 +311,7 @@ int minix_delete_entry(struct minix_dir_ struct address_space *mapping = page-mapping; struct inode *inode = (struct inode*)mapping-host; char *kaddr = page_address(page); - loff_t pos = (page-index PAGE_CACHE_SHIFT) + (char*)de - kaddr; + loff_t pos = page_offset(page) + (char*)de - kaddr; unsigned len = minix_sb(inode-i_sb)-s_dirsize; int err; @@ -435,7 +435,7 @@ void minix_set_link(struct minix_dir_ent struct address_space *mapping = page-mapping; struct inode *dir = mapping-host; struct minix_sb_info *sbi = minix_sb(dir-i_sb); - loff_t pos = (page-index PAGE_CACHE_SHIFT) + + loff_t pos = page_offset(page) + (char *)de-(char*)page_address(page); int err; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 3/4] sysv convert to new aops fix
Signed-off-by: Nick Piggin [EMAIL PROTECTED] Index: linux-2.6/fs/sysv/dir.c === --- linux-2.6.orig/fs/sysv/dir.c +++ linux-2.6/fs/sysv/dir.c @@ -219,7 +219,7 @@ int sysv_add_link(struct dentry *dentry, return -EINVAL; got_it: - pos = (page-index PAGE_CACHE_SHIFT) + + pos = page_offset(page) + (char*)de - (char*)page_address(page); lock_page(page); err = __sysv_write_begin(NULL, page-mapping, pos, SYSV_DIRSIZE, @@ -246,7 +246,7 @@ int sysv_delete_entry(struct sysv_dir_en struct address_space *mapping = page-mapping; struct inode *inode = (struct inode*)mapping-host; char *kaddr = (char*)page_address(page); - loff_t pos = (page-index PAGE_CACHE_SHIFT) + (char *)de - kaddr; + loff_t pos = page_offset(page) + (char *)de - kaddr; int err; lock_page(page); @@ -347,7 +347,7 @@ void sysv_set_link(struct sysv_dir_entry { struct address_space *mapping = page-mapping; struct inode *dir = mapping-host; - loff_t pos = (page-index PAGE_CACHE_SHIFT) + + loff_t pos = page_offset(page) + (char *)de-(char*)page_address(page); int err; - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 4/4] ufs convert to new aops fix
Signed-off-by: Nick Piggin [EMAIL PROTECTED] Index: linux-2.6/fs/ufs/dir.c === --- linux-2.6.orig/fs/ufs/dir.c +++ linux-2.6/fs/ufs/dir.c @@ -89,7 +89,7 @@ ino_t ufs_inode_by_name(struct inode *di void ufs_set_link(struct inode *dir, struct ufs_dir_entry *de, struct page *page, struct inode *inode) { - loff_t pos = (page-index PAGE_CACHE_SHIFT) + + loff_t pos = page_offset(page) + (char *) de - (char *) page_address(page); unsigned len = fs16_to_cpu(dir-i_sb, de-d_reclen); int err; @@ -379,7 +379,7 @@ int ufs_add_link(struct dentry *dentry, return -EINVAL; got_it: - pos = (page-index PAGE_CACHE_SHIFT) + + pos = page_offset(page) + (char*)de - (char*)page_address(page); err = __ufs_write_begin(NULL, page-mapping, pos, rec_len, AOP_FLAG_UNINTERRUPTIBLE, page, NULL); @@ -547,7 +547,7 @@ int ufs_delete_entry(struct inode *inode if (pde) from = (char*)pde - (char*)page_address(page); - pos = (page-index PAGE_CACHE_SHIFT) + from; + pos = page_offset(page) + from; lock_page(page); err = __ufs_write_begin(NULL, mapping, pos, to - from, AOP_FLAG_UNINTERRUPTIBLE, page, NULL); - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] extent mapped page cache
On Tue, Jul 24, 2007 at 07:25:09PM -0400, Chris Mason wrote: On Tue, 24 Jul 2007 23:25:43 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote: The tree is a critical part of the patch, but it is also the easiest to rip out and replace. Basically the code stores a range by inserting an object at an index corresponding to the end of the range. Then it does searches by looking forward from the start of the range. More or less any tree that can search and return the first key = than the requested key will work. So, I'd be happy to rip out the tree and replace with something else. Going completely lockless will be tricky, its something that will deep thought once the rest of the interface is sane. Just having the other tree and managing it is what makes me a little less positive of this approach, especially using it to store pagecache state when we already have the pagecache tree. Having another tree to store block state I think is a good idea as I said in the fsblock thread with Dave, but I haven't clicked as to why it is a big advantage to use it to manage pagecache state. (and I can see some possible disadvantages in locking and tree manipulation overhead). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: block_page_mkwrite? (Re: fault vs invalidate race (Re: -mm merge plans for 2.6.23))
David Chinner wrote: On Thu, Jul 12, 2007 at 10:54:57AM +1000, Nick Piggin wrote: Andrew Morton wrote: The fault-vs-invalidate race fix. I have belatedly learned that these need more work, so their state is uncertain. The more work may turn out being too much for you (although it is nothing exactly tricky that would introduce subtle bugs, it is a fair amont of churn). OK, so does that mean we can finally get the block_page_mkwrite patches merged? i.e.: http://marc.info/?l=linux-kernelm=117426058311032w=2 http://marc.info/?l=linux-kernelm=11742607036w=2 I've got up-to-date versions of them ready to go and they've been consistently tested thanks to the XFSQA test I wrote for the bug that it fixes. I've been holding them out-of-tree for months now because -fault was supposed to supercede this interface. Yeah, as I've said, don't hold them back because of me. They are relatively simple enough that I don't see why they couldn't be merged in this window. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] fsblock
On Mon, Jul 09, 2007 at 10:14:06AM -0700, Christoph Lameter wrote: On Sun, 24 Jun 2007, Nick Piggin wrote: Firstly, what is the buffer layer? The buffer layer isn't really a buffer layer as in the buffer cache of unix: the block device cache is unified with the pagecache (in terms of the pagecache, a blkdev file is just like any other, but with a 1:1 mapping between offset and block). I thought that the buffer layer is essentially a method to index to sub section of a page? It converts pagecache addresses to block addresses I guess. The current implementation cannot handle blocks larger than pages, but not because use of larger pages for pagecache wsa anticipated (likely because it is more work, and the APIs aren't really set up for it). Why rewrite the buffer layer? Lots of people have had a desire to completely rip out the buffer layer, but we can't do that[*] because it does actually serve a useful purpose. Why the bad rap? Because the code is old and crufty, and buffer_head is an awful name. It must be among the oldest code in the core fs/vm, and the main reason is because of the inertia of so many and such complex filesystems. Hmmm I did not notice that yet but then I have not done much work there. Notice what? - data structure size. struct fsblock is 20 bytes on 32-bit, and 40 on 64-bit (could easily be 32 if we can have int bitops). Compare this to around 50 and 100ish for struct buffer_head. With a 4K page and 1K blocks, IO requires 10% RAM overhead in buffer heads alone. With fsblocks you're down to around 3%. I thought we were going to simply use the page struct instead of having buffer heads? Would that not reduce the overhead to zero? What do you mean by that? As I said, you couldn't use just the page struct for anything except page sized blocks, and even then it would require more fields or at least more flags in the page struct. nobh mode actually tries to do something similar, however it requires multiple calls into the filesystem to first allocate the block, and then find its sector. It is also buggy and can't handle errors properly (although I'm trying to fix that). - A real nobh mode. nobh was created I think mainly to avoid problems with buffer_head memory consumption, especially on lowmem machines. It is basically a hack (sorry), which requires special code in filesystems, and duplication of quite a bit of tricky buffer layer code (and bugs). It also doesn't work so well for buffers with non-trivial private data (like most journalling ones). fsblock implements this with basically a few lines of code, and it shold work in situations like ext3. Hmmm That means simply page struct are not working... I don't understand you. jbd needs to attach private data to each bh, and that can stay around for longer than the life of the page in the pagecache. - Large block support. I can mount and run an 8K block size minix3 fs on my 4K page system and it didn't require anything special in the fs. We can go up to about 32MB blocks now, and gigabyte+ blocks would only require one more bit in the fsblock flags. fsblock_superpage blocks are PAGE_CACHE_SIZE, midpage ==, and subpage . Core pagecache code is pretty creaky with respect to this. I think it is mostly race free, but it requires stupid unlocking and relocking hacks because the vm usually passes single locked pages to the fs layers, and we need to lock all pages of a block in offset ascending order. This could be avoided by doing locking on only the first page of a block for locking in the fsblock layer, but that's a bit scary too. Probably better would be to move towards offset,length rather than page based fs APIs where everything can be batched up nicely and this sort of non-trivial locking can be more optimal. Large blocks also have a performance black spot where an 8K sized and aligned write(2) would require an RMW in the filesystem. Again because of the page based nature of the fs API, and this too would be fixed if the APIs were better. The simple solution would be to use a compound page and make the head page represent the status of all the pages in the vm. Logic for that is already in place. I do not consider that a solution because I explicitly want to allow order-0 pages here. I know about your higher order pagecache, the anti-frag and defrag work, I know about compound pages. I'm not just ignoring them because of NIH or something silly. Anyway, I have thought about just using the first page in the block for the locking, and that might be a reasonable optimisation. However for now I'm keeping it simple. Large block memory access via filesystem uses vmap, but it will go back to kmap if the access doesn't cross a page. Filesystems really should do this because vmap is slow as anything. I've implemented a vmap cache which
Re: [RFC] fsblock
On Mon, Jul 09, 2007 at 05:59:47PM -0700, Christoph Lameter wrote: On Tue, 10 Jul 2007, Nick Piggin wrote: Hmmm I did not notice that yet but then I have not done much work there. Notice what? The bad code for the buffer heads. Oh. Well my first mail in this thrad listed some of the problems with them. - A real nobh mode. nobh was created I think mainly to avoid problems with buffer_head memory consumption, especially on lowmem machines. It is basically a hack (sorry), which requires special code in filesystems, and duplication of quite a bit of tricky buffer layer code (and bugs). It also doesn't work so well for buffers with non-trivial private data (like most journalling ones). fsblock implements this with basically a few lines of code, and it shold work in situations like ext3. Hmmm That means simply page struct are not working... I don't understand you. jbd needs to attach private data to each bh, and that can stay around for longer than the life of the page in the pagecache. Right. So just using page struct alone wont work for the filesystems. There are no changes to the filesystem API for large pages (although I am adding a couple of helpers to do page based bitmap ops). And I don't want to rely on contiguous memory. Why do you think handling of large pages (presumably you mean larger than page sized blocks) is strange? We already have a way to handle large pages: Compound pages. Yes but I don't want to use large pages and I am not going to use them (at least, they won't be mandatory). Conglomerating the constituent pages via the pagecache radix-tree seems logical to me. Meaning overhead to handle each page still exists? This scheme cannot handle large contiguous blocks as a single entity? Of course some things have to be done per-page if the pages are not contiguous. I actually haven't seen that to be a problem or have much reason to think it will suddenly become a problem (although I do like Andrea's config page sizes approach for really big systems that cannot change their HW page size). - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vm/fs meetup details
On Thu, Jul 05, 2007 at 01:54:06PM -0400, Rik van Riel wrote: Nick Piggin wrote: Hi, The vm/fs meetup will be held September 4th from 10am till 4pm (with the option of going longer), at the University of Cambridge. I am interested. A few potential topics: OK, I'll put you on the list. Hope to see you there. - improving laptop_mode in the VFS VM to further increase battery life in laptops - repair driven design, we know what it is (Val told us), but how does it apply to the things we are currently working on? should we do more of it? Thanks for the suggestions. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vm/fs meetup details
On Thu, Jul 05, 2007 at 05:40:57PM -0400, Rik van Riel wrote: David Chinner wrote: On Thu, Jul 05, 2007 at 01:40:08PM -0700, Zach Brown wrote: - repair driven design, we know what it is (Val told us), but how does it apply to the things we are currently working on? should we do more of it? I'm sure Chris and I could talk about the design elements in btrfs that should aid repair if folks are interested in hearing about them. We'd keep the hand-waving to a minimum :). And I'm sure I could provide a counterpoint by talking about the techniques we've used improving XFS repair speed and scalability without needing to change any on disk formats Sounds like that could be an interesting discussion. Especially when trying to answer questions like: At what filesystem size will the mitigating fixes no longer be enough? and When will people start using filesystems THAT big? :) Keep in mind that the way to get the most out of this meeting is for the fs people to have topics of the form we'd really like to do X, can we get some help from the VM? Or vice versa from vm people. That said, we can talk about whatever interests the group on the day. And that could definitely include issues common to different filesystems. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
vm/fs meetup details
Hi, The vm/fs meetup will be held September 4th from 10am till 4pm (with the option of going longer), at the University of Cambridge. Anton Altaparmakov has arranged a conference room for us with whiteboard and projector, so many thanks to him. I will send out the location and plans for meeting/getting there after we work out the best strategy for that. At the moment we have 15 people interested so far. We can have a few more people, so if you aren't cc'ed and would like to come along please let me know. We do have limited space, so I'm sorry in advance if anybody misses out. I'll post out a running list of suggested topics later, but they're really just a rough guideline. It will be a round-table kind of thing and long monologue talks won't be appropriate, however some slides or whiteboarding to interactively introduce and discuss your idea would be OK. I think we want to avoid assigning slots for specific people/topics. Feel free to propose anything, if it only gets a small amount of interest then at least you'll know who to discuss it with later :) Thanks, Nick - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html