from:"Nick Piggin"

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Nick Piggin

On Tuesday 26 February 2008 18:59, Jamie Lokier wrote:
 Andrew Morton wrote:
  On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier [EMAIL PROTECTED] 
wrote:
   (It would be nicer if sync_file_range()
   took a vector of ranges for better elevator scheduling, but let's
   ignore that :-)
 
  Two passes:
 
  Pass 1: shove each of the segments into the queue with
  SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE
 
  Pass 2: wait for them all to complete and return accumulated result
  with SYNC_FILE_RANGE_WAIT_AFTER

 Thanks.

 Seems ok, though being able to cork the I/O until the last one would
 be a bonus (like TCP_MORE...  SYNC_FILE_RANGE_MORE?)

 I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE.  Is there a
 reason why you have it there?  The man page isn't very enlightening.


Yeah, sync_file_range has slightly unusual semantics and introduce
the new concept, writeout, to userspace (does writeout include
in drive cache? the kernel doesn't think so, but the only way to
make sync_file_range safe is if you do consider it writeout).

If it makes it any easier to understand, we can add in
SYNC_FILE_ASYNC, SYNC_FILE_SYNC parts that just deal with
safe/unsafe and sync/async semantics that is part of the normal
POSIX api.

Anyway, the idea of making fsync/fdatasync etc. safe by default is
a good idea IMO, and is a bad bug that we don't do that :(

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [0/18] Implement some low hanging BKL removal fruit in fs/*

2008-01-27 Thread Nick Piggin

On Sunday 27 January 2008 13:17, Andi Kleen wrote:
 [Andrew: I believe this is -mm material for .25]

 - Convert some more file systems (generally those who don't use the BKL
 for anything except mount) to use unlocked_bkl.
 - Implement BKL less fasync (see patch for the rationale)
 This is currently a separate entry point, but since the number of fasync
 users in the tree is relatively small I hope the older entry point can
 be removed then in the not too far future
 [help from other people converting more fasync users to unlocked_fasync
 would be appreciated]
 - Implement BKL less remote_llseek
 - While I was at it I also added a few missing compat ioctl handlers
 - Fix a few comments

 This fixes a lot of relatively trivial BKL users in fs/*. The main
 remaining non legacy offenders are now locks.c, nfs/nfsd and reiserfs.
 I believe BKL removal for all of those is being worked on by other people.
 Also a lot of legacy file systems still use it, but converting those
 does not seem to be very pressing.

BTW. here is a patch I did a while back for minix. I know it isn't
a big deal, but the work is done so I guess I should send it along.
The minix filesystem uses bkl to protect access to metadata. Switch
to a per-superblock mutex.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]

Index: linux-2.6/fs/minix/bitmap.c
===
--- linux-2.6.orig/fs/minix/bitmap.c
+++ linux-2.6/fs/minix/bitmap.c
@@ -69,11 +69,11 @@ void minix_free_block(struct inode *inod
 		return;
 	}
 	bh = sbi-s_zmap[zone];
-	lock_kernel();
+	mutex_lock(sbi-s_mutex);
 	if (!minix_test_and_clear_bit(bit, bh-b_data))
 		printk(minix_free_block (%s:%lu): bit already cleared\n,
 		   sb-s_id, block);
-	unlock_kernel();
+	mutex_unlock(sbi-s_mutex);
 	mark_buffer_dirty(bh);
 	return;
 }
@@ -88,18 +88,18 @@ int minix_new_block(struct inode * inode
 		struct buffer_head *bh = sbi-s_zmap[i];
 		int j;
 
-		lock_kernel();
+		mutex_lock(sbi-s_mutex);
 		j = minix_find_first_zero_bit(bh-b_data, bits_per_zone);
 		if (j  bits_per_zone) {
 			minix_set_bit(j, bh-b_data);
-			unlock_kernel();
+			mutex_unlock(sbi-s_mutex);
 			mark_buffer_dirty(bh);
 			j += i * bits_per_zone + sbi-s_firstdatazone-1;
 			if (j  sbi-s_firstdatazone || j = sbi-s_nzones)
 break;
 			return j;
 		}
-		unlock_kernel();
+		mutex_unlock(sbi-s_mutex);
 	}
 	return 0;
 }
@@ -211,10 +211,10 @@ void minix_free_inode(struct inode * ino
 	minix_clear_inode(inode);	/* clear on-disk copy */
 
 	bh = sbi-s_imap[ino];
-	lock_kernel();
+	mutex_lock(sbi-s_mutex);
 	if (!minix_test_and_clear_bit(bit, bh-b_data))
 		printk(minix_free_inode: bit %lu already cleared\n, bit);
-	unlock_kernel();
+	mutex_unlock(sbi-s_mutex);
 	mark_buffer_dirty(bh);
  out:
 	clear_inode(inode);		/* clear in-memory copy */
@@ -237,7 +237,7 @@ struct inode * minix_new_inode(const str
 	j = bits_per_zone;
 	bh = NULL;
 	*error = -ENOSPC;
-	lock_kernel();
+	mutex_lock(sbi-s_mutex);
 	for (i = 0; i  sbi-s_imap_blocks; i++) {
 		bh = sbi-s_imap[i];
 		j = minix_find_first_zero_bit(bh-b_data, bits_per_zone);
@@ -245,17 +245,17 @@ struct inode * minix_new_inode(const str
 			break;
 	}
 	if (!bh || j = bits_per_zone) {
-		unlock_kernel();
+		mutex_unlock(sbi-s_mutex);
 		iput(inode);
 		return NULL;
 	}
 	if (minix_test_and_set_bit(j, bh-b_data)) {	/* shouldn't happen */
-		unlock_kernel();
+		mutex_unlock(sbi-s_mutex);
 		printk(minix_new_inode: bit already set\n);
 		iput(inode);
 		return NULL;
 	}
-	unlock_kernel();
+	mutex_unlock(sbi-s_mutex);
 	mark_buffer_dirty(bh);
 	j += i * bits_per_zone;
 	if (!j || j  sbi-s_ninodes) {
Index: linux-2.6/fs/minix/dir.c
===
--- linux-2.6.orig/fs/minix/dir.c
+++ linux-2.6/fs/minix/dir.c
@@ -102,7 +102,7 @@ static int minix_readdir(struct file * f
 	char *name;
 	__u32 inumber;
 
-	lock_kernel();
+	mutex_lock(sbi-s_mutex);
 
 	pos = (pos + chunk_size-1)  ~(chunk_size-1);
 	if (pos = inode-i_size)
@@ -146,7 +146,7 @@ static int minix_readdir(struct file * f
 
 done:
 	filp-f_pos = (n  PAGE_CACHE_SHIFT) | offset;
-	unlock_kernel();
+	mutex_unlock(sbi-s_mutex);
 	return 0;
 }
 
Index: linux-2.6/fs/minix/inode.c
===
--- linux-2.6.orig/fs/minix/inode.c
+++ linux-2.6/fs/minix/inode.c
@@ -174,6 +174,7 @@ static int minix_fill_super(struct super
 	sbi-s_firstdatazone = ms-s_firstdatazone;
 	sbi-s_log_zone_size = ms-s_log_zone_size;
 	sbi-s_max_size = ms-s_max_size;
+	mutex_init(sbi-s_mutex);
 	s-s_magic = ms-s_magic;
 	if (s-s_magic == MINIX_SUPER_MAGIC) {
 		sbi-s_version = MINIX_V1;
Index: linux-2.6/fs/minix/minix.h
===
--- linux-2.6.orig/fs/minix/minix.h
+++ linux-2.6/fs/minix/minix.h
@@ -1,6 +1,7 @@
 #include linux/fs.h
 #include linux/pagemap.h
 #include linux/minix_fs.h
+#include linux/mutex.h
 
 /*
  * change the define below to 0

Re: [PATCH][RFC] fast file mapping for loop

2008-01-09 Thread Nick Piggin

On Wednesday 09 January 2008 19:52, Jens Axboe wrote:

 So how does it work? Instead of punting IO to a thread and passing it
 through the page cache, we instead attempt to send the IO directly to the
 filesystem block that it maps to.

You told Christoph that just using direct-IO from kernel still doesn't
give you the required behaviour... What about queueing the IO directly
*and* using direct-IO? I guess it still has to go through the underlying
filesystem, but that's probably a good thing.

 loop maintains a prio tree of known 
 extents in the file (populated lazily on demand, as needed).

Just a quick question (I haven't looked closely at the code): how come
you are using a prio tree for extents? I don't think they could be
overlapping?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 36/42] VFS: export drop_pagecache_sb

2007-12-13 Thread Nick Piggin

On Friday 14 December 2007 02:24, Erez Zadok wrote:
 In message [EMAIL PROTECTED], Nick Piggin writes:
  On Monday 10 December 2007 13:42, Erez Zadok wrote:
   Needed to maintain cache coherency after branch management.
 
  Hmm, I'd much prefer to be able to sleep in invalidate_mapping_pages
  before this function gets exported.
 
  As it is, it can cause massive latencies on preemption and the inode_lock
  so it is pretty much debug-only IMO. I'd rather it didn't escape into the
  wild as is.
 
  Either that or rework your cache coherency somehow.

 Nick, thanks for the advice.

 We use a generation number after each successful branch configuration
 command, so that -d_revalidate later on can discover that change, and
 rebuild the union of objects.  At -remount time, I figured it'd be nice to
 encourage that revalidation to happen sooner, by invalidating as many
 upper pages as possible, thus causing -d_revalidate/-readpage to take
 place sooner.  So we used to call drop_pagecache_sb from our remount code:
 it was the only caller of drop_pagecache_sb.  It wasn't too much of an
 latency issue to call drop_pagecache_sb there: the VFS remount code path is
 already pretty slow (dropping temporarily to readonly mode, and dropping
 other caches), and remount isn't an operation used often, so a little bit
 more latency would probably not have been noticed by users.

Well a large, infrequent spike is the most damaging to latency sensitive
users. And anyway, I guess the infrequency of remount means it doesn't
have to be really efficient with invalidating pagecache either.


 Nevertheless, it was not strictly necessary to call drop_pagecache_sb in
 unionfs_remount, because the objects in question will have gotten
 revalidated sooner or later anyway; the call to drop_pagecache_sb was just
 an optimization (one which I wasn't 100% sure about anyway, as per my long
 XXX comment above that call in unionfs_remount).

 So I agree with you: if this symbol can be abused by modules and cause
 problems, then exporting it to modules is too risky.  I've reworked my code
 to avoid calling drop_pagecache_sb and I'll [sic] drop that patch.

Thanks, I'd be much happier with that.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] mm: fix XIP file writes

2007-12-11 Thread Nick Piggin

On Mon, Dec 10, 2007 at 03:38:20PM +0100, Christian Borntraeger wrote:
 Hi Nick,
 
  Here we go. See, brd already found a bug ;)
  Can you apply the ext2 XIP patch too? And I'll resend the brd XIP patch.
 [...]
  Writing to XIP files at a non-page-aligned offset results in data corruption
  because the writes were always sent to the start of the page.
 [...]
  @@ -314,7 +314,7 @@ __xip_file_write(struct file *filp, cons
  fault_in_pages_readable(buf, bytes);
  kaddr = kmap_atomic(page, KM_USER0);
  copied = bytes -
  -   __copy_from_user_inatomic_nocache(kaddr, buf, bytes);
  +   __copy_from_user_inatomic_nocache(kaddr + offset, buf, 
  bytes);
  kunmap_atomic(kaddr, KM_USER0);
  flush_dcache_page(page);
 
 I asked myself why this problem never happened before. So I asked our testers
 to reproduce this problem on 2.6.23 and service levels. As the testcase did
 not trigger, I looked into the 2.6.23 code. This problem was introduced by
 commit 4a9e5ef1f4f15205e477817a5cefc34bd3f65f55 (mm: write iovec cleanup from
 Nick Piggin) during 2.6.24-rc:
 snip---
 - copied = filemap_copy_from_user(page, offset, buf, bytes);
 [...]
 + copied = bytes -
 + __copy_from_user_inatomic_nocache(kaddr, buf, bytes);
 ---
 
 So yes, its good to have xip on brd. It even tests your changes ;-)
 Good news is, that we dont need anything for stable.

Heh ;) that explains a lot. Thanks!
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 36/42] VFS: export drop_pagecache_sb

2007-12-11 Thread Nick Piggin

On Monday 10 December 2007 13:42, Erez Zadok wrote:
 Needed to maintain cache coherency after branch management.


Hmm, I'd much prefer to be able to sleep in invalidate_mapping_pages
before this function gets exported.

As it is, it can cause massive latencies on preemption and the inode_lock
so it is pretty much debug-only IMO. I'd rather it didn't escape into the
wild as is.

Either that or rework your cache coherency somehow.


 Signed-off-by: Erez Zadok [EMAIL PROTECTED]
 ---
  fs/drop_caches.c |4 +++-
  1 files changed, 3 insertions(+), 1 deletions(-)

 diff --git a/fs/drop_caches.c b/fs/drop_caches.c
 index 59375ef..90410ac 100644
 --- a/fs/drop_caches.c
 +++ b/fs/drop_caches.c
 @@ -3,6 +3,7 @@
   */

  #include linux/kernel.h
 +#include linux/module.h
  #include linux/mm.h
  #include linux/fs.h
  #include linux/writeback.h
 @@ -12,7 +13,7 @@
  /* A global variable is a bit ugly, but it keeps the code simple */
  int sysctl_drop_caches;

 -static void drop_pagecache_sb(struct super_block *sb)
 +void drop_pagecache_sb(struct super_block *sb)
  {
   struct inode *inode;

 @@ -24,6 +25,7 @@ static void drop_pagecache_sb(struct super_block *sb)
   }
   spin_unlock(inode_lock);
  }
 +EXPORT_SYMBOL(drop_pagecache_sb);

  void drop_pagecache(void)
  {
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] ext2: xip check fix

2007-12-06 Thread Nick Piggin

On Thu, Dec 06, 2007 at 09:43:27AM +0100, Carsten Otte wrote:
 Nick Piggin wrote:
 Xip does only work, if both do match PAGE_SIZE because it 
 does'nt support multiple calls to direct_access in the get_xip_page 
 address space operation. Thus we check both here, actually this was 
 changed from how it looks after your patch as a bugfix where our 
 tester tried a 4k filesystem on a 2k blockdev.
 Did I miss something?
 
 However, the bdev block size may be changed with sb_set_blocksize. It
 doesn't actually have to match the hardware sector size -- if this
 does matter for XIP, then I think you need some other check here.
 Hhh. For a bdev with PAGE_SIZE hardsect size, there is no other 
 valid value then PAGE_SIZE that one could set it to. Or can it indeed 
 be changed to a value greater then PAGE_SIZE or smaller then hardsect 
 size?

It can't be made smaller (or larger, in current kernels). But you
already get all that checking done for you -- both by checking that
the filesystem blocksize == PAGE_SIZE, and by the error checking in
sb_set_blocksize.

After my patch, we can do XIP in a hardsect size  PAGE_SIZE block
device -- this seems to be a fine thing to do at least for the
ramdisk code. Would this situation be problematic for existing drivers,
and if so, in what way?

Thanks,
Nick

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] ext2: xip check fix

2007-12-06 Thread Nick Piggin

On Thu, Dec 06, 2007 at 10:59:02AM +0100, Carsten Otte wrote:
 Nick Piggin wrote:
 After my patch, we can do XIP in a hardsect size  PAGE_SIZE block
 device -- this seems to be a fine thing to do at least for the
 ramdisk code. Would this situation be problematic for existing drivers,
 and if so, in what way?
 I have done some archeology, and our ancient CVS logs show this check 
 was introduced in early 2005 into our 2.6.x. codebase. However, it 
 existed way before, and was copied from our prehistorical ext2 split 
 named xip2 back in the old days of 2.4.x where we did not really have 
 a block device behind because that one was scamped into the file 
 system in a very queer way.

OK, thanks for taking a look at that. It will be helpful for testing
XIP with my new ramdisk driver (did you see the patch?).


 After all, I don't see any risk in removing the check. The only driver 
 we have that does direct_access is drivers/s390/block/dcssblk.c, and 
 that one only supports block_size == PAGE_SIZE. I think the patch 
 should go into mainline.

Actually another one's recently sprung up too (arch/powerpc/sysdev/axonram.c)
but it looks like that one should be fine as it looks to be all simply memory
mapped.

 
 Acked-by: Carsten Otte [EMAIL PROTECTED]

Thanks! Andrew, would you queue this up for 2.6.25 please?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] ext2: xip check fix

2007-12-06 Thread Nick Piggin

On Thu, Dec 06, 2007 at 10:17:39PM -0600, Rob Landley wrote:
 On Thursday 06 December 2007 21:22:25 Jared Hulbert wrote:
I have'nt looked at it yet. I do appreciate it, I think it might
broaden the user-base of this feature which is up to now s390 only due
to the fact that the flash memory extensions have not been implemented
(yet?). And it enables testing xip on other platforms. The patch is on
my must-read list.
  
   query: which feature is currently s390 only?  (Execute In Place?)
 
  I think so.  The filemap_xip.c functionality doesn't work for Flash
  memory yet.  Flash memory doesn't have struct pages to back it up with
  which this stuff depends on.
 
 Um, trying to clarify:  S390.  Also known as zSeries, big iron machine, uses 
 its own weird processor design rather than x86, x86-64, arm, or mips 
 processors.
 
 How does struct page enter into this?
 
 What I want to know is, are you saying execute in place doesn't work on 
 things 
 like arm and mips?  (I so, I was unaware of this.  I heard about somebody 
 getting it to work on a Nintendo DS: 
 http://forums.maxconsole.net/showthread.php?t=18668 )

It's just that the only device driver with XIP support for the moment is
s390 (and an obscure powerpc one).

Now with my ramdisk patch, anybody can use it. That's just using regular
RAM, but there is nothing preventing XIP backed by more exotic storage,
provided the CPU can natively address and execute from it.
 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] rewrite rd

2007-12-04 Thread Nick Piggin

On Tue, Dec 04, 2007 at 01:55:17AM -0600, Rob Landley wrote:
 On Monday 03 December 2007 22:26:28 Nick Piggin wrote:
  There is one slight downside -- direct block device access and filesystem
  metadata access goes through an extra copy and gets stored in RAM twice.
  However, this downside is only slight, because the real buffercache of the
  device is now reclaimable (because we're not playing crazy games with it),
  so under memory intensive situations, footprint should effectively be the
  same -- maybe even a slight advantage to the new driver because it can also
  reclaim buffer heads.
 
 For the embedded world, initramfs has pretty much rendered initrd obsolete, 
 and that was the biggest user of the ramdisk code I know of.  Beyond that, 
 loopback mounts give you more flexible transient block devices than ramdisks 
 do.  (In fact, ramdisks are such an amazing pain to use/size/free that if I 
 really needed something like that I'd just make a loopback mount in a ramfs 
 instance.)

They are, although we could easily hook up a couple more ioctls for them
now (if anybody is interested).

The rd driver can potentially be a _lot_ more scalable than the loop driver.
It's completely lockless in the fastpath and doesn't even dirty any
cachelines except for the actual destination memory. OK, this is only
really important for testing purposes (eg. testing scalability of a
filesystem etc.) But it is one reason why I (personally) want brd...
 

 Embedded users who still want a block interface for memory are generally 
 trying to use a cramfs or squashfs image out of ROM or flash, although there 
 are flash-specific filesystems for this and I dunno if they're actually 
 mounting /dev/mem at an offset or something (md?  losetup -o?  Beats me, I 
 haven't tried that myself yet...)

OK, it would be interesting to hear from anyone using rd for things like
cramfs. I don't think we can get rid of the code entirely, but it sounds
like the few downsides of the new code won't be big problems.

Thanks for the feedback.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] rewrite rd

2007-12-04 Thread Nick Piggin

On Tue, Dec 04, 2007 at 10:54:51AM +0100, Christian Borntraeger wrote:
 Am Dienstag, 4. Dezember 2007 schrieb Nick Piggin:
 [...]
  There is one slight downside -- direct block device access and filesystem
  metadata access goes through an extra copy and gets stored in RAM twice.
  However, this downside is only slight, because the real buffercache of the
  device is now reclaimable (because we're not playing crazy games with it),
  so under memory intensive situations, footprint should effectively be the
  same -- maybe even a slight advantage to the new driver because it can also
  reclaim buffer heads.
 
 This is just an idea, I dont know if it is worth the trouble, but have you
 though about implementing direct_access for brd? That would allow 
 execute-in-place (xip) on brd eliminating the extra copy.

Actually that's a pretty good idea. It would allow xip to be tested
without special hardware as well...

I'll see what the patch looks like. Thanks

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch] rd: support XIP

2007-12-04 Thread Nick Piggin

On Tue, Dec 04, 2007 at 11:10:09AM +0100, Nick Piggin wrote:
  
  This is just an idea, I dont know if it is worth the trouble, but have you
  though about implementing direct_access for brd? That would allow 
  execute-in-place (xip) on brd eliminating the extra copy.
 
 Actually that's a pretty good idea. It would allow xip to be tested
 without special hardware as well...
 
 I'll see what the patch looks like. Thanks

This seems to work, although I needed another patch for ext2 too.
Never run XIP before, but I'm able to execute files out of the -oxip
mount, so I'm guessing it's working ;)

---

Support direct_access XIP method with brd.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
---
Index: linux-2.6/drivers/block/Kconfig
===
--- linux-2.6.orig/drivers/block/Kconfig
+++ linux-2.6/drivers/block/Kconfig
@@ -346,6 +346,16 @@ config BLK_DEV_RAM_SIZE
  The default value is 4096 kilobytes. Only change this if you know
  what you are doing.
 
+config BLK_DEV_XIP
+   bool Support XIP filesystems on RAM block device
+   depends on BLK_DEV_RAM
+   default n
+   help
+ Support XIP filesystems (such as ext2 with XIP support on) on
+ top of block ram device. This will slightly enlarge the kernel, and
+ will prevent RAM block device backing store memory from being
+ allocated from highmem (only a problem for highmem systems).
+
 config CDROM_PKTCDVD
tristate Packet writing on CD/DVD media
depends on !UML
Index: linux-2.6/drivers/block/brd.c
===
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -89,6 +89,7 @@ static struct page *brd_insert_page(stru
 {
pgoff_t idx;
struct page *page;
+   gfp_t gfp_flags;
 
page = brd_lookup_page(brd, sector);
if (page)
@@ -97,7 +98,16 @@ static struct page *brd_insert_page(stru
/*
 * Must use NOIO because we don't want to recurse back into the
 * block or filesystem layers from page reclaim.
+*
+* Cannot support XIP and highmem, because our -direct_access
+* routine for XIP must return memory that is always addressable.
+* If XIP was reworked to use pfns and kmap throughout, this
+* restriction might be able to be lifted.
 */
+   gfp_flags = GFP_NOIO | __GFP_ZERO;
+#ifndef CONFIG_BLK_DEV_XIP
+   gfp_flags |= __GFP_HIGHMEM;
+#endif
page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);
if (!page)
return NULL;
@@ -307,6 +317,28 @@ out:
return 0;
 }
 
+#ifdef CONFIG_BLK_DEV_XIP
+static int brd_direct_access (struct block_device *bdev, sector_t sector,
+   unsigned long *data)
+{
+   struct brd_device *brd = bdev-bd_disk-private_data;
+   struct page *page;
+
+   if (!brd)
+   return -ENODEV;
+   if (sector  (PAGE_SECTORS-1))
+   return -EINVAL;
+   if (sector + PAGE_SECTORS  get_capacity(bdev-bd_disk))
+   return -ERANGE;
+   page = brd_insert_page(brd, sector);
+   if (!page)
+   return -ENOMEM;
+   *data = (unsigned long)page_address(page);
+
+   return 0;
+}
+#endif
+
 static int brd_ioctl(struct inode *inode, struct file *file,
unsigned int cmd, unsigned long arg)
 {
@@ -342,8 +374,11 @@ static int brd_ioctl(struct inode *inode
 }
 
 static struct block_device_operations brd_fops = {
-   .owner =THIS_MODULE,
-   .ioctl =brd_ioctl,
+   .owner =THIS_MODULE,
+   .ioctl =brd_ioctl,
+#ifdef CONFIG_BLK_DEV_XIP
+   .direct_access =brd_direct_access,
+#endif
 };
 
 /*
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch] ext2: xip check fix

2007-12-04 Thread Nick Piggin

Am I missing something here? I wonder how s390 works without this change?

--
ext2 should not worry about checking sb-s_blocksize for XIP before the
sb's blocksize actually gets set.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
---
Index: linux-2.6/fs/ext2/super.c
===
--- linux-2.6.orig/fs/ext2/super.c
+++ linux-2.6/fs/ext2/super.c
@@ -844,8 +844,7 @@ static int ext2_fill_super(struct super_
 
blocksize = BLOCK_SIZE  le32_to_cpu(sbi-s_es-s_log_block_size);
 
-   if ((ext2_use_xip(sb))  ((blocksize != PAGE_SIZE) ||
- (sb-s_blocksize != blocksize))) {
+   if (ext2_use_xip(sb)  blocksize != PAGE_SIZE) {
if (!silent)
printk(XIP: Unsupported blocksize\n);
goto failed_mount;
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] rd: support XIP

2007-12-04 Thread Nick Piggin

On Tue, Dec 04, 2007 at 03:26:20AM -0800, Andrew Morton wrote:
 On Tue, 4 Dec 2007 12:21:00 +0100 Nick Piggin [EMAIL PROTECTED] wrote:
 
  +*
  +* Cannot support XIP and highmem, because our -direct_access
  +* routine for XIP must return memory that is always addressable.
  +* If XIP was reworked to use pfns and kmap throughout, this
  +* restriction might be able to be lifted.
   */
  +   gfp_flags = GFP_NOIO | __GFP_ZERO;
  +#ifndef CONFIG_BLK_DEV_XIP
  +   gfp_flags |= __GFP_HIGHMEM;
  +#endif
 
 A dubious tradeoff?

On big highmem machines certainly. It may be somewhat useful on small
memory systems... but having the config option there is nice for a VM
developer without an s390 easily available ;)

But don't apply these XIP patches yet -- after a bit more testing I'm
seeing some data corruption, so I'll have to work out what's going
wrong with that first.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch] mm: fix XIP file writes

2007-12-04 Thread Nick Piggin

On Tue, Dec 04, 2007 at 12:35:49PM +0100, Nick Piggin wrote:
 On Tue, Dec 04, 2007 at 03:26:20AM -0800, Andrew Morton wrote:
  On Tue, 4 Dec 2007 12:21:00 +0100 Nick Piggin [EMAIL PROTECTED] wrote:
  
   +  *
   +  * Cannot support XIP and highmem, because our -direct_access
   +  * routine for XIP must return memory that is always addressable.
   +  * If XIP was reworked to use pfns and kmap throughout, this
   +  * restriction might be able to be lifted.
  */
   + gfp_flags = GFP_NOIO | __GFP_ZERO;
   +#ifndef CONFIG_BLK_DEV_XIP
   + gfp_flags |= __GFP_HIGHMEM;
   +#endif
  
  A dubious tradeoff?
 
 On big highmem machines certainly. It may be somewhat useful on small
 memory systems... but having the config option there is nice for a VM
 developer without an s390 easily available ;)
 
 But don't apply these XIP patches yet -- after a bit more testing I'm
 seeing some data corruption, so I'll have to work out what's going
 wrong with that first.

Here we go. See, brd already found a bug ;)

Can you apply the ext2 XIP patch too? And I'll resend the brd XIP patch.

---

Writing to XIP files at a non-page-aligned offset results in data corruption
because the writes were always sent to the start of the page.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
---
Index: linux-2.6/mm/filemap_xip.c
===
--- linux-2.6.orig/mm/filemap_xip.c
+++ linux-2.6/mm/filemap_xip.c
@@ -314,7 +314,7 @@ __xip_file_write(struct file *filp, cons
fault_in_pages_readable(buf, bytes);
kaddr = kmap_atomic(page, KM_USER0);
copied = bytes -
-   __copy_from_user_inatomic_nocache(kaddr, buf, bytes);
+   __copy_from_user_inatomic_nocache(kaddr + offset, buf, 
bytes);
kunmap_atomic(kaddr, KM_USER0);
flush_dcache_page(page);
 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch] rd: support XIP (updated)

2007-12-04 Thread Nick Piggin

On Tue, Dec 04, 2007 at 12:06:23PM +, Duane Griffin wrote:
 On 04/12/2007, Nick Piggin [EMAIL PROTECTED] wrote:
  +   gfp_flags = GFP_NOIO | __GFP_ZERO;
  +#ifndef CONFIG_BLK_DEV_XIP
  +   gfp_flags |= __GFP_HIGHMEM;
  +#endif
  page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);
 
 I think that should be alloc_page(gfp_flags), no?

Yes. Here is a resend. Andrew, please apply this one (has passed some
testing with ext2 XIP).

--

Support direct_access XIP method with brd.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
---
Index: linux-2.6/drivers/block/Kconfig
===
--- linux-2.6.orig/drivers/block/Kconfig
+++ linux-2.6/drivers/block/Kconfig
@@ -346,6 +346,16 @@ config BLK_DEV_RAM_SIZE
  The default value is 4096 kilobytes. Only change this if you know
  what you are doing.
 
+config BLK_DEV_XIP
+   bool Support XIP filesystems on RAM block device
+   depends on BLK_DEV_RAM
+   default n
+   help
+ Support XIP filesystems (such as ext2 with XIP support on) on
+ top of block ram device. This will slightly enlarge the kernel, and
+ will prevent RAM block device backing store memory from being
+ allocated from highmem (only a problem for highmem systems).
+
 config CDROM_PKTCDVD
tristate Packet writing on CD/DVD media
depends on !UML
Index: linux-2.6/drivers/block/brd.c
===
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -89,6 +89,7 @@ static struct page *brd_insert_page(stru
 {
pgoff_t idx;
struct page *page;
+   gfp_t gfp_flags;
 
page = brd_lookup_page(brd, sector);
if (page)
@@ -97,8 +98,17 @@ static struct page *brd_insert_page(stru
/*
 * Must use NOIO because we don't want to recurse back into the
 * block or filesystem layers from page reclaim.
+*
+* Cannot support XIP and highmem, because our -direct_access
+* routine for XIP must return memory that is always addressable.
+* If XIP was reworked to use pfns and kmap throughout, this
+* restriction might be able to be lifted.
 */
-   page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);
+   gfp_flags = GFP_NOIO | __GFP_ZERO;
+#ifndef CONFIG_BLK_DEV_XIP
+   gfp_flags |= __GFP_HIGHMEM;
+#endif
+   page = alloc_page(gfp_flags);
if (!page)
return NULL;
 
@@ -307,6 +317,28 @@ out:
return 0;
 }
 
+#ifdef CONFIG_BLK_DEV_XIP
+static int brd_direct_access (struct block_device *bdev, sector_t sector,
+   unsigned long *data)
+{
+   struct brd_device *brd = bdev-bd_disk-private_data;
+   struct page *page;
+
+   if (!brd)
+   return -ENODEV;
+   if (sector  (PAGE_SECTORS-1))
+   return -EINVAL;
+   if (sector + PAGE_SECTORS  get_capacity(bdev-bd_disk))
+   return -ERANGE;
+   page = brd_insert_page(brd, sector);
+   if (!page)
+   return -ENOMEM;
+   *data = (unsigned long)page_address(page);
+
+   return 0;
+}
+#endif
+
 static int brd_ioctl(struct inode *inode, struct file *file,
unsigned int cmd, unsigned long arg)
 {
@@ -342,8 +374,11 @@ static int brd_ioctl(struct inode *inode
 }
 
 static struct block_device_operations brd_fops = {
-   .owner =THIS_MODULE,
-   .ioctl =brd_ioctl,
+   .owner =THIS_MODULE,
+   .ioctl =brd_ioctl,
+#ifdef CONFIG_BLK_DEV_XIP
+   .direct_access =brd_direct_access,
+#endif
 };
 
 /*
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch] rewrite rd

2007-12-03 Thread Nick Piggin

Hi,

This is my proposal for a (hopefully) backwards compatible rd driver.
The motivation for me is not pressing, except that I have this code
sitting here that is either going to rot or get merged. I'm happy to
make myself maintainer of this code, but if any real block device
driver writer would like to, that would be fine by me ;)

Comments?

--

This is a rewrite of the ramdisk block device driver.

The old one is really difficult because it effectively implements a block
device which serves data out of its own buffer cache. It relies on the dirty
bit being set, to pin its backing store in cache, however there are non
trivial paths which can clear the dirty bit (eg. try_to_free_buffers()),
which had recently lead to data corruption. And in general it is completely
wrong for a block device driver to do this.

The new one is more like a regular block device driver. It has no idea
about vm/vfs stuff. It's backing store is similar to the buffer cache
(a simple radix-tree of pages), but it doesn't know anything about page
cache (the pages in the radix tree are not pagecache pages).

There is one slight downside -- direct block device access and filesystem
metadata access goes through an extra copy and gets stored in RAM twice.
However, this downside is only slight, because the real buffercache of the
device is now reclaimable (because we're not playing crazy games with it),
so under memory intensive situations, footprint should effectively be the
same -- maybe even a slight advantage to the new driver because it can also
reclaim buffer heads.

The fact that it now goes through all the regular vm/fs paths makes it
much more useful for testing, too.

   textdata bss dec hex filename
   2837 849 3844070 fe6 drivers/block/rd.o
   3528 371  123911 f47 drivers/block/brd.o

Text is larger, but data and bss are smaller, making total size smaller.

A few other nice things about it:
- Similar structure and layout to the new loop device handlinag.
- Dynamic ramdisk creation.
- Runtime flexible buffer head size (because it is no longer part of the
  ramdisk code).
- Boot / load time flexible ramdisk size, which could easily be extended
  to a per-ramdisk runtime changeable size (eg. with an ioctl).

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
---
 MAINTAINERS|5 
 drivers/block/Kconfig  |   12 -
 drivers/block/Makefile |2 
 drivers/block/brd.c|  500 +
 drivers/block/rd.c |  536 -
 fs/buffer.c|1 
 6 files changed, 508 insertions(+), 548 deletions(-)

Index: linux-2.6/drivers/block/Kconfig
===
--- linux-2.6.orig/drivers/block/Kconfig
+++ linux-2.6/drivers/block/Kconfig
@@ -311,7 +311,7 @@ config BLK_DEV_UB
  If unsure, say N.
 
 config BLK_DEV_RAM
-   tristate RAM disk support
+   tristate RAM block device support
---help---
  Saying Y here will allow you to use a portion of your RAM memory as
  a block device, so that you can make file systems on it, read and
@@ -346,16 +346,6 @@ config BLK_DEV_RAM_SIZE
  The default value is 4096 kilobytes. Only change this if you know
  what you are doing.
 
-config BLK_DEV_RAM_BLOCKSIZE
-   int Default RAM disk block size (bytes)
-   depends on BLK_DEV_RAM
-   default 1024
-   help
- The default value is 1024 bytes.  PAGE_SIZE is a much more
- efficient choice however.  The default is kept to ensure initrd
- setups function - apparently needed by the rd_load_image routine
- that supposes the filesystem in the image uses a 1024 blocksize.
-
 config CDROM_PKTCDVD
tristate Packet writing on CD/DVD media
depends on !UML
Index: linux-2.6/drivers/block/Makefile
===
--- linux-2.6.orig/drivers/block/Makefile
+++ linux-2.6/drivers/block/Makefile
@@ -11,7 +11,7 @@ obj-$(CONFIG_AMIGA_FLOPPY)+= amiflop.o
 obj-$(CONFIG_PS3_DISK) += ps3disk.o
 obj-$(CONFIG_ATARI_FLOPPY) += ataflop.o
 obj-$(CONFIG_AMIGA_Z2RAM)  += z2ram.o
-obj-$(CONFIG_BLK_DEV_RAM)  += rd.o
+obj-$(CONFIG_BLK_DEV_RAM)  += brd.o
 obj-$(CONFIG_BLK_DEV_LOOP) += loop.o
 obj-$(CONFIG_BLK_DEV_PS2)  += ps2esdi.o
 obj-$(CONFIG_BLK_DEV_XD)   += xd.o
Index: linux-2.6/drivers/block/brd.c
===
--- /dev/null
+++ linux-2.6/drivers/block/brd.c
@@ -0,0 +1,536 @@
+/*
+ * Ram backed block device driver.
+ *
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ *
+ * Parts derived from drivers/block/rd.c, and drivers/block/loop.c, copyright
+ * of their respective owners.
+ */
+
+#include linux/init.h
+#include linux/module.h
+#include linux/moduleparam.h
+#include linux/major.h
+#include linux/blkdev.h

Re: [patch] rewrite rd

2007-12-03 Thread Nick Piggin

On Mon, Dec 03, 2007 at 10:29:03PM -0800, Andrew Morton wrote:
 On Tue, 4 Dec 2007 05:26:28 +0100 Nick Piggin [EMAIL PROTECTED] wrote:
  
  There is one slight downside -- direct block device access and filesystem
  metadata access goes through an extra copy and gets stored in RAM twice.
  However, this downside is only slight, because the real buffercache of the
  device is now reclaimable (because we're not playing crazy games with it),
  so under memory intensive situations, footprint should effectively be the
  same -- maybe even a slight advantage to the new driver because it can also
  reclaim buffer heads.
 
 what about mmap(/dev/ram0)?

Same thing I guess -- it will use more memory in the common case, but should
actually be able to reclaim slightly more memory under pressure, so the
tiny systems guys shouldn't have too much trouble.

And we're avoiding the whole class of aliasing problems where mmap on
the old rd driver means that you're creating new mappings to your backing
store pages.

 
  Text is larger, but data and bss are smaller, making total size smaller.
  
  A few other nice things about it:
  - Similar structure and layout to the new loop device handlinag.
  - Dynamic ramdisk creation.
  - Runtime flexible buffer head size (because it is no longer part of the
ramdisk code).
  - Boot / load time flexible ramdisk size, which could easily be extended
to a per-ramdisk runtime changeable size (eg. with an ioctl).
 
 This ramdisk driver can use highmem whereas the existing one can't (yes?). 
 That's a pretty major difference.  Plus look at the revoltingness in rd.c's
 mapping_set_gfp_mask()s.

Ah yep, there is the highmem advantage too.


  +#define SECTOR_SHIFT   9
 
 That's our third definition of SECTOR_SHIFT.

Or 7th, depend on how you count ;) I always thought redefining it is a
prerequsite to getting anything merged into the block layer, so I'm too
scared to put it in include/linux/blkdev.h ;)


  +/*
  + * Look up and return a brd's page for a given sector.
  + */
  +static struct page *brd_lookup_page(struct brd_device *brd, sector_t 
  sector)
  +{
  +   unsigned long idx;
 
 Could use pgoff_t here if you think that's clearer.

I guess it is. Although on one hand the radix-tree uses unsigned long, but
on the other hand, page-index is pgoff.

  +{
  +   unsigned long idx;
  +   struct page *page;
  +
  +   page = brd_lookup_page(brd, sector);
  +   if (page)
  +   return page;
  +
  +   page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);
 
 Why GFP_NOIO?

I guess it has similar issues to rd.c -- we can't risk recursing
into the block layer here. However unlike rd.c, we _could_ easily
add some mode or ioctl to allocate the backing store upfront, with
full reclaim flags...

 
 Have you thought about __GFP_MOVABLE treatment here?  Seems pretty
 desirable as this sucker can be large.

AFAIK that only applies to pagecache but I haven't been paying much
attention to that stuff lately. It wouldn't be hard to move these
pages around, if we had a hook from the vm for it.
 

  +static void brd_free_pages(struct brd_device *brd)
  +{
  +   unsigned long pos = 0;
  +   struct page *pages[FREE_BATCH];
  +   int nr_pages;
  +
  +   do {
  +   int i;
  +
  +   nr_pages = radix_tree_gang_lookup(brd-brd_pages,
  +   (void **)pages, pos, FREE_BATCH);
  +
  +   for (i = 0; i  nr_pages; i++) {
  +   void *ret;
  +
  +   BUG_ON(pages[i]-index  pos);
  +   pos = pages[i]-index;
  +   ret = radix_tree_delete(brd-brd_pages, pos);
  +   BUG_ON(!ret || ret != pages[i]);
  +   __free_page(pages[i]);
  +   }
  +
  +   pos++;
  +
  +   } while (nr_pages == FREE_BATCH);
  +}
 
 I have vague memories that radix_tree_gang_lookup()'s naive
 implementation may return fewer items than you asked for even when there
 are more items remaining - when it hits certain boundaries.

Good memory, but it's the low level leaf traversal that bales out at
boundaries. The higher level code then retries, so we should be OK
here.

  +/*
  + * copy_to_brd_setup must be called before copy_to_brd. It may sleep.
  + */
  +static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, 
  size_t n)
  +{
  +   unsigned int offset = (sector  (PAGE_SECTORS-1))  SECTOR_SHIFT;
  +   size_t copy;
  +
  +   copy = min((unsigned long)n, PAGE_SIZE - offset);
 
 use min_t.  Or, better, sort out the types.

I guess the API is using size_t, so that would be the more approprite type
to convert to. And I'll use min_t too.


  +static void copy_to_brd(struct brd_device *brd, const void *src,
  +   sector_t sector, size_t n)
  +{
  +   struct page *page;
  +   void *dst;
  +   unsigned int offset = (sector  (PAGE_SECTORS-1))  SECTOR_SHIFT;
  +   size_t copy;
  +
  +   copy = min((unsigned long)n, PAGE_SIZE - offset);
 
 Ditto.
 
  +   page

Re: [patch] rewrite rd

2007-12-03 Thread Nick Piggin

On Tue, Dec 04, 2007 at 08:01:31AM +0100, Nick Piggin wrote:
 Thanks for the review, I'll post an incremental patch in a sec.
 

Index: linux-2.6/drivers/block/brd.c
===
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -56,7 +56,7 @@ struct brd_device {
  */
 static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
 {
-   unsigned long idx;
+   pgoff_t idx;
struct page *page;
 
/*
@@ -87,13 +87,17 @@ static struct page *brd_lookup_page(stru
  */
 static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
 {
-   unsigned long idx;
+   pgoff_t idx;
struct page *page;
 
page = brd_lookup_page(brd, sector);
if (page)
return page;
 
+   /*
+* Must use NOIO because we don't want to recurse back into the
+* block or filesystem layers from page reclaim.
+*/
page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);
if (!page)
return NULL;
@@ -148,6 +152,11 @@ static void brd_free_pages(struct brd_de
 
pos++;
 
+   /*
+* This assumes radix_tree_gang_lookup always returns as
+* many pages as possible. If the radix-tree code changes,
+* so will this have to.
+*/
} while (nr_pages == FREE_BATCH);
 }
 
@@ -159,7 +168,7 @@ static int copy_to_brd_setup(struct brd_
unsigned int offset = (sector  (PAGE_SECTORS-1))  SECTOR_SHIFT;
size_t copy;
 
-   copy = min((unsigned long)n, PAGE_SIZE - offset);
+   copy = min_t(size_t, n, PAGE_SIZE - offset);
if (!brd_insert_page(brd, sector))
return -ENOMEM;
if (copy  n) {
@@ -181,7 +190,7 @@ static void copy_to_brd(struct brd_devic
unsigned int offset = (sector  (PAGE_SECTORS-1))  SECTOR_SHIFT;
size_t copy;
 
-   copy = min((unsigned long)n, PAGE_SIZE - offset);
+   copy = min_t(size_t, n, PAGE_SIZE - offset);
page = brd_lookup_page(brd, sector);
BUG_ON(!page);
 
@@ -213,7 +222,7 @@ static void copy_from_brd(void *dst, str
unsigned int offset = (sector  (PAGE_SECTORS-1))  SECTOR_SHIFT;
size_t copy;
 
-   copy = min((unsigned long)n, PAGE_SIZE - offset);
+   copy = min_t(size_t, n, PAGE_SIZE - offset);
page = brd_lookup_page(brd, sector);
if (page) {
src = kmap_atomic(page, KM_USER1);
@@ -318,6 +327,9 @@ static int brd_ioctl(struct inode *inode
/*
 * Invalidate the cache first, so it isn't written
 * back to the device.
+*
+* Another thread might instantiate more buffercache here,
+* but there is not much we can do to close that race.
 */
invalidate_bh_lrus();
truncate_inode_pages(bdev-bd_inode-i_mapping, 0);
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[rfc][patch 2/2] inotify: remove debug code

2007-12-02 Thread Nick Piggin


The inotify debugging code is supposed to verify that the
DCACHE_INOTIFY_PARENT_WATCHED scalability optimisation does not result in
notifications getting lost nor extra needless locking generated.

Unfortunately there are also some races in the debugging code. And it isn't
very good at finding problems anyway. So remove it for now.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
---
Index: linux-2.6/fs/dcache.c
===
--- linux-2.6.orig/fs/dcache.c
+++ linux-2.6/fs/dcache.c
@@ -1408,9 +1408,6 @@ void d_delete(struct dentry * dentry)
if (atomic_read(dentry-d_count) == 1) {
dentry_iput(dentry);
fsnotify_nameremove(dentry, isdir);
-
-   /* remove this and other inotify debug checks after 2.6.18 */
-   dentry-d_flags = ~DCACHE_INOTIFY_PARENT_WATCHED;
return;
}
 
Index: linux-2.6/fs/inotify.c
===
--- linux-2.6.orig/fs/inotify.c
+++ linux-2.6/fs/inotify.c
@@ -168,20 +168,14 @@ static void set_dentry_child_flags(struc
struct dentry *child;
 
list_for_each_entry(child, alias-d_subdirs, d_u.d_child) {
-   if (!child-d_inode) {
-   WARN_ON(child-d_flags  
DCACHE_INOTIFY_PARENT_WATCHED);
+   if (!child-d_inode)
continue;
-   }
+
spin_lock(child-d_lock);
-   if (watched) {
-   WARN_ON(child-d_flags 
-   DCACHE_INOTIFY_PARENT_WATCHED);
+   if (watched)
child-d_flags |= DCACHE_INOTIFY_PARENT_WATCHED;
-   } else {
-   WARN_ON(!(child-d_flags 
-   DCACHE_INOTIFY_PARENT_WATCHED));
-   child-d_flags=~DCACHE_INOTIFY_PARENT_WATCHED;
-   }
+   else
+   child-d_flags =~DCACHE_INOTIFY_PARENT_WATCHED;
spin_unlock(child-d_lock);
}
}
@@ -253,7 +247,6 @@ void inotify_d_instantiate(struct dentry
if (!inode)
return;
 
-   WARN_ON(entry-d_flags  DCACHE_INOTIFY_PARENT_WATCHED);
spin_lock(entry-d_lock);
parent = entry-d_parent;
if (parent-d_inode  inotify_inode_watched(parent-d_inode))
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[rfc][patch 1/2] inotify: fix race

2007-12-02 Thread Nick Piggin

Hmm, thinking I'd better just take a hammer to this inotify problem
despite the fact I cannot confirm the root cause of the warning messages.
Possibilities include the race being fixed here, and races in the
debug code itself.

--

There is a race between setting an inode's children's parent watched flag
when placing the first watch on a parent, and instantiating new children of
that parent: a child could miss having its flags set by set_dentry_child_flags,
but then inotify_d_instantiate might still see !inotify_inode_watched.

The solution is to set_dentry_child_flags after adding the watch. Locking is
taken care of, because both set_dentry_child_flags and inotify_d_instantiate
hold dcache_lock and child-d_locks.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
---
Index: linux-2.6/fs/inotify.c
===
--- linux-2.6.orig/fs/inotify.c
+++ linux-2.6/fs/inotify.c
@@ -627,6 +627,7 @@ s32 inotify_add_watch(struct inotify_han
  struct inode *inode, u32 mask)
 {
int ret = 0;
+   int newly_watched;
 
/* don't allow invalid bits: we don't want flags set */
mask = IN_ALL_EVENTS | IN_ONESHOT;
@@ -653,12 +654,18 @@ s32 inotify_add_watch(struct inotify_han
 */
watch-inode = igrab(inode);
 
-   if (!inotify_inode_watched(inode))
-   set_dentry_child_flags(inode, 1);
-
/* Add the watch to the handle's and the inode's list */
+   newly_watched = !inotify_inode_watched(inode);
list_add(watch-h_list, ih-watches);
list_add(watch-i_list, inode-inotify_watches);
+   /*
+* Set child flags _after_ adding the watch, so there is no race
+* windows where newly instantiated children could miss their parent's
+* watched flag.
+*/
+   if (newly_watched)
+   set_dentry_child_flags(inode, 1);
+
 out:
mutex_unlock(ih-mutex);
mutex_unlock(inode-inotify_mutex);
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Should PAGE_CACHE_SIZE be discarded?

2007-11-15 Thread Nick Piggin

On Thu, Nov 15, 2007 at 02:46:46PM +, David Howells wrote:
 Benny Halevy [EMAIL PROTECTED] wrote:
 
  I think that what Nick was trying to say is that PAGE_CACHE_SIZE should
  always be used properly as the size of the memory struct Page covers (while
  PAGE_SIZE is the hardware page size and the constraint is that
  PAGE_CACHE_SIZE == (PAGE_SIZE  k) for some k = 0).  If everybody does
  that then None of the filesystems should really care at all. That said, it
  doesn't seem like the current usage in fs/ and drivers/ is consistent with
  this convention.
 
 Indeed.  One thing you have to consider is kmap().  I would expect it to
 present an area of PAGE_SIZE for access.  However, if the filesystem gets an
 area of PAGE_CACHE_SIZE to fill, then I would have to do multiple kmap() calls
 in the process of filling that 'pagecache page' in AFS.
 
 Furthermore, if a page struct covers a PAGE_CACHE_SIZE chunk of memory, then I
 suspect the page allocator is also wrong, as it I believe it deals with
 PAGE_SIZE chunks of memory, assuming a struct page for each.

That's because you're thinking about all these difficulties outside the
filesystem. Really, pagecache is in PAGE_CACHE_SIZE. Nobody said pagecache
 PAGE_SIZE will work by changing a single define, but filesystems are
about the least problematic here.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [rfc][patch 3/5] afs: new aops

2007-11-15 Thread Nick Piggin

On Thu, Nov 15, 2007 at 12:15:41PM +, David Howells wrote:
 Nick Piggin [EMAIL PROTECTED] wrote:
 
   So you're saying a struct page controls an area of PAGE_CACHE_SIZE, not an
   area of PAGE_SIZE?
  
  No, a pagecache page is PAGE_CACHE_SIZE.
 
 That doesn't answer my question.  I didn't ask about 'pagecache pages' per se.
 
 Are you saying then that a page struct always represents an area of PAGE_SIZE
 to, say, the page allocator and PAGE_CACHE_SIZE to a filesystem's address
 operations?

Yes.

 
 How about I state it this way: Please define what the coverage of a
 (non-compound) struct page is, and how this relates to PAGE_SIZE and
 PAGE_CACHE_SIZE.  If it's well-defined then this cannot be hard, right?
 
No it's easy. It's PAGE_SIZE (which also happens to be PAGE_CACHE_SIZE).
An implementation that would (not trivially, but without changing the
basic concepts) allow a larger pagecache size is with compound pages. And
actually hugetlbfs already does this.


  And not all struct pages control the same amount of data anyway, with
  compound pages.
 
 Compound pages are irrelevant to my question.  A compound page is actually a
 regulated by a series of page structs, each of which represents a 'page' of
 real memory.

And it can also represent a PAGE_CACHE_SIZE page of pagecache.
 

 Do you say, then, that all, say, readpage() and readpages() methods must
 handle a compound page if that is given to them?

I'm not talking about a specific implementation that allows
PAGE_CACHE_SIZE  PAGE_SIZE. So no, I don't say anything about that. I
say that pagecache pages are PAGE_CACHE_SIZE! Yes it is easy now because
that is the same as PAGE_SIZE. Yes it will be harder if you wanted to
change that. What you would not have to change is the assumption that
pagecache pages are in PAGE_SIZE units.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [rfc][patch 3/5] afs: new aops

2007-11-14 Thread Nick Piggin

On Wed, Nov 14, 2007 at 12:18:43PM +, David Howells wrote:
 Nick Piggin [EMAIL PROTECTED] wrote:
 
   The problem is that the code called assumes that the struct page *
   argument points to a single page, not an array of pages as would
   presumably be the case if PAGE_CACHE_SIZE  PAGE_SIZE.
  
  Incorrect. Christoph's patch for example does this by using compound pages.
  Now I personally don't like the patch or see the point in PAGE_CACHE_SIZE /
  PAGE_SIZE distinction, but I'm just telling you what the convention is. 
  There
  is no point you arguing against it, that's simply how it is.
 
 No!  You are wrong.  I wrote the AFS code.  I *know* it can only deal with

No I'm talking about core code. In core code, the PAGE_CACHE_SIZE is
for page cache struct pages. Single struct pages (not page arrays).
Take a look at generic mapping read or something.

There is nothing to deal with page arrays there either, but that's simply
the convention.


   So: you may not change the assertion unless you also fix the lower
   functions.
  
  I won't change the assertion, because you haven't been following said
  convention, so just changing it in one place is stupider than not changing
  it at all, but not for the reason cited.
 
 The convention is not precisely clear.  Just grep for PAGE_CACHE_SIZE in
 Documentation/.  It's only mentioned twice, and in neither case does it give
 any information about what PAGE_CACHE_SIZE is, what it represents, or where it
 applies.  Therefore it's an ill-defined concept.
 
 If you look in Documentation/filesystems/vfs.txt, you'll see that it almost
 always talks about 'pages'.  It only mentions 'pagecache pages' once - in the
 description of write_begin(), but it's not clear whether that means anything.

Documentation is the opposite of convention ;) Look in mm/.

 
 However, I've now noted that I need to fix my code, so just keep the assertion
 for now and I'll fix my code to handle multipage blocks.

I'm not saying you need to do that. Leave it at PAGE_SIZE, really it
doesn't matter that much at present. This has just blown out of proportion.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Should PAGE_CACHE_SIZE be discarded?

2007-11-14 Thread Nick Piggin

On Wed, Nov 14, 2007 at 01:56:53PM +, David Howells wrote:
 
 Are we ever going to have PAGE_CACHE_SIZE != PAGE_SIZE?  If not, why not
 discard PAGE_CACHE_SIZE as it's then redundant.
 

Christoph Lameter has patches exactly to make PAGE_CACHE_SIZE larger than
PAGE_SIZE, and they seem to work without much effort. I happen to hate the
patches ;) but that doesn't change the fact that PAGE_CACHE_SIZE is
relatively useful and it is not at all an ill-defined concept.


 PAGE_CACHE_SIZE is also an ill-defined concept.  Just grep in Documentation/
 and it only comes up in a couple of places, neither particularly informative.
 
 If there's a possibility that it will be used, then someone who knows how it's
 supposed to work needs to sit down and document what it is, what it 
 represents,
 where it applies, how it interacts with PG_compound and how the page flags
 distribute over a page cache slot.

No, this was an *example*. It has nothing to do with PG_compound upsream.
That was just an example. Basically, anything that goes in the page cache
is in units of PAGE_CACHE_SIZE, and nothing else. For filesystems it
should be pretty easy...

 
 One further thing to consider: actually making PAGE_CACHE_SIZE  PAGE_SIZE
 work will be an interesting problem, as I'm certain most filesystems will
 break horribly without a lot of work (ramfs might be only exception).

I think most filesystems actually don't have much problem with it. mm
code has bigger problems, eg. due to ptes - pagecache no longer being
equal size but why would filesystems care?

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [rfc][patch 3/5] afs: new aops

2007-11-14 Thread Nick Piggin

On Wed, Nov 14, 2007 at 03:57:46PM +, David Howells wrote:
 Nick Piggin [EMAIL PROTECTED] wrote:
 
  In core code, the PAGE_CACHE_SIZE is for page cache struct pages. Single
  struct pages (not page arrays).  Take a look at generic mapping read or
  something.
 
 So you're saying a struct page controls an area of PAGE_CACHE_SIZE, not an
 area of PAGE_SIZE?

No, a pagecache page is PAGE_CACHE_SIZE. And not all struct pages control
the same amount of data anyway, with compound pages.


 Is a struct page then a purely pagecache concept?
 
  Documentation is the opposite of convention ;)
 
 If it's not Documented, then it's irrelevant.

But you can't just decide yourself that it is irrelevant because you don't
grep hard enough ;)

include/linux/mm.h:
 * A page may belong to an inode's memory mapping. In this case, page-mapping
 * is the pointer to the inode, and page-index is the file offset of the page,
 * in units of PAGE_CACHE_SIZE.

include/linux/mm_types.h
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
   units, *not* PAGE_CACHE_SIZE */

Looks like documentation to me.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Should PAGE_CACHE_SIZE be discarded?

2007-11-14 Thread Nick Piggin

On Wed, Nov 14, 2007 at 03:59:39PM +, David Howells wrote:
 Nick Piggin [EMAIL PROTECTED] wrote:
 
  Christoph Lameter has patches exactly to make PAGE_CACHE_SIZE larger than
  PAGE_SIZE, and they seem to work without much effort. I happen to hate the
  patches ;) but that doesn't change the fact that PAGE_CACHE_SIZE is
  relatively useful and it is not at all an ill-defined concept.
 
 Where, please? mm kernels?

Floating around. I'm not saying it will even get upstream. It's just
an example.

 
  Basically, anything that goes in the page cache is in units of
  PAGE_CACHE_SIZE, and nothing else. For filesystems it should be pretty
  easy...
 
 That depends on what the coverage of struct page is.  I don't actually know
 whether this is PAGE_SIZE or PAGE_CACHE_SIZE; I assumed it to be the former,
 but from what you've said, I'm not actually sure.

It can be pretty well any power of 2 from PAGE_SIZE upwards, with
compound pages. None of the filesystems should really care at all.
It's not even a new concept, hugetlbfs uses HPAGE_SIZE...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] nfs: use -mmap_prepare() to avoid an AB-BA deadlock

2007-11-14 Thread Nick Piggin

On Wed, Nov 14, 2007 at 09:01:39PM +0100, Peter Zijlstra wrote:
 Normal locking order is:
 
   i_mutex
 mmap_sem
 
 However NFS's -mmap hook, which is called under mmap_sem, can take i_mutex.
 Avoid this potential deadlock by doing the work that requires i_mutex from
 the new -mmap_prepare().
 
 [ Is this sufficient, or does it introduce a race? ]

Seems like an OK patchset in my opinion. I don't know much about NFS
unfortunately, but I wonder what prevents the condition fixed by
nfs_revalidate_mapping from happening again while the mmap is active...?

 
 Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
 ---
  fs/nfs/file.c |   25 +
  1 file changed, 17 insertions(+), 8 deletions(-)
 
 Index: linux-2.6/fs/nfs/file.c
 ===
 --- linux-2.6.orig/fs/nfs/file.c
 +++ linux-2.6/fs/nfs/file.c
 @@ -41,6 +41,9 @@
  static int nfs_file_open(struct inode *, struct file *);
  static int nfs_file_release(struct inode *, struct file *);
  static loff_t nfs_file_llseek(struct file *file, loff_t offset, int origin);
 +static int
 +nfs_file_mmap_prepare(struct file * file, unsigned long len,
 + unsigned long prot, unsigned long flags, unsigned long pgoff);
  static int  nfs_file_mmap(struct file *, struct vm_area_struct *);
  static ssize_t nfs_file_splice_read(struct file *filp, loff_t *ppos,
   struct pipe_inode_info *pipe,
 @@ -64,6 +67,7 @@ const struct file_operations nfs_file_op
   .write  = do_sync_write,
   .aio_read   = nfs_file_read,
   .aio_write  = nfs_file_write,
 + .mmap_prepare   = nfs_file_mmap_prepare,
   .mmap   = nfs_file_mmap,
   .open   = nfs_file_open,
   .flush  = nfs_file_flush,
 @@ -270,7 +274,8 @@ nfs_file_splice_read(struct file *filp, 
  }
  
  static int
 -nfs_file_mmap(struct file * file, struct vm_area_struct * vma)
 +nfs_file_mmap_prepare(struct file * file, unsigned long len,
 + unsigned long prot, unsigned long flags, unsigned long pgoff)
  {
   struct dentry *dentry = file-f_path.dentry;
   struct inode *inode = dentry-d_inode;
 @@ -279,13 +284,17 @@ nfs_file_mmap(struct file * file, struct
   dfprintk(VFS, nfs: mmap(%s/%s)\n,
   dentry-d_parent-d_name.name, dentry-d_name.name);
  
 - status = nfs_revalidate_mapping(inode, file-f_mapping);
 - if (!status) {
 - vma-vm_ops = nfs_file_vm_ops;
 - vma-vm_flags |= VM_CAN_NONLINEAR;
 - file_accessed(file);
 - }
 - return status;
 + return nfs_revalidate_mapping(inode, file-f_mapping);
 +}
 +
 +static int
 +nfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 +{
 + vma-vm_ops = nfs_file_vm_ops;
 + vma-vm_flags |= VM_CAN_NONLINEAR;
 + file_accessed(file);
 +
 + return 0;
  }
  
  /*
 
 --
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] nfs: use -mmap_prepare() to avoid an AB-BA deadlock

2007-11-14 Thread Nick Piggin

On Wed, Nov 14, 2007 at 05:18:50PM -0500, Trond Myklebust wrote:
 
 On Wed, 2007-11-14 at 22:50 +0100, Peter Zijlstra wrote:
  Right, but I guess what Nick asked is, if pages could be stale to start
  with, how is that avoided in the future.
  
  The way I understand it, this re-validate is just a best effort at
  getting a coherent image.
 
 The normal convention for NFS is to use a close-to-open cache
 consistency model. In that model, applications must agree never to open
 the file for reading or writing if an application on a different NFS
 client already holds it open for writing.
 
 However there is no standard locking model for _enforcing_ such an
 agreement, so some setups do violate it. One obvious model that we try
 to support is that where the applications are using POSIX locking in
 order to ensure exclusive access to the data when requires.
 
 Another model is to rely rather on synchronous writes and heavy
 attribute revalidation to detect when a competing application has
 written to the file (the 'noac' mount option). While such a model is
 obviously deficient in that it can never guarantee cache coherency, we
 do attempt to ensure that it works on a per-operation basis (IOW: we
 check cache coherency before each call to read(), to mmap(), etc) since
 it is by far the easiest model to apply if you have applications that
 cannot be rewritten and that satisfy the requirement that they rarely
 conflict.

mmap()s can be different from read in that the syscall may have little
relation to when the data gets used. But I guess it's still a best
effort thing. Fair enough.

Thanks,
Nick
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [rfc][patch 3/5] afs: new aops

2007-11-13 Thread Nick Piggin

On Tue, Nov 13, 2007 at 10:56:25AM +, David Howells wrote:
 Nick Piggin [EMAIL PROTECTED] wrote:
 
  It takes a pagecache page, yes. If you follow convention, you use
  PAGE_CACHE_SIZE for that guy. You don't have to allow PAGE_CACHE_SIZE !=
  PAGE_SIZE, and if all the rest of your code is in units of PAGE_SIZE, then
  obviously my changing of just the one unit is even more confusing than
  the current arrangement ;)
 
 The problem is that the code called assumes that the struct page * argument
 points to a single page, not an array of pages as would presumably be the case
 if PAGE_CACHE_SIZE  PAGE_SIZE.

Incorrect. Christoph's patch for example does this by using compound pages.
Now I personally don't like the patch or see the point in PAGE_CACHE_SIZE /
PAGE_SIZE distinction, but I'm just telling you what the convention is. There
is no point you arguing against it, that's simply how it is.


 If I should allow for an array of pages then
 the lower functions (specifically afs_deliver_fs_fetch_data()) need to change,
 and until that time occurs, the assertion *must* remain as it is now.  It
 defends the lower functions against being asked to do something they weren't
 designed to do.
 
 So: you may not change the assertion unless you also fix the lower functions.

I won't change the assertion, because you haven't been following said
convention, so just changing it in one place is stupider than not changing
it at all, but not for the reason cited.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [rfc][patch 3/5] afs: new aops

2007-11-12 Thread Nick Piggin

On Mon, Nov 12, 2007 at 03:29:14PM +, David Howells wrote:
 Nick Piggin [EMAIL PROTECTED] wrote:
 
  -   ASSERTCMP(start + len, =, PAGE_SIZE);
  +   ASSERTCMP(len, =, PAGE_CACHE_SIZE);
 
 Do you guarantee this will work if PAGE_CACHE_SIZE != PAGE_SIZE?  If not, you
 can't make this particular change.

PAGE_CACHE_SIZE should be used to address the pagecache.

 
 Do we ever intend to have PAGE_CACHE_SIZE != PAGE_SIZE?  If not, then surely
 the former is redundant and should scrapped to avoid confusion?

Maybe.


  +   i_size = i_size_read(vnode-vfs_inode);
  +   if (pos + len  i_size)
  +   eof = i_size;
  +   else
  +   eof = PAGE_CACHE_SIZE;
  +
  +   ret = afs_vnode_fetch_data(vnode, key, 0, eof, page);
 
 That can't be right, surely.  Either 'eof' is the size of the file or it's the
 length of the data to be read.  It can't be both.  The first case needs eof
 masking off.  Also, 'eof' isn't a good choice of name.  'len' would be better
 were it not already taken:-/

Yeah, just missed the mask.

 
 I notice you removed the stuff that clears holes in the page to be written.  
 Is
 this is now done by the caller?

It is supposed to bring the page uptodate first. So, no need to clear
AFAIKS?

 I notice also that you use afs_fill_page() in place of afs_prepare_page() to
 prepare a page.  You can't do this if the region to be filled currently lies
 beyond the server's idea of EOF.
 
 I'll try and get a look at fixing this patch tomorrow.

No rush, it won't get into 2.6.24 obviously. But that would be nice, thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [rfc][patch 3/5] afs: new aops

2007-11-12 Thread Nick Piggin

On Tue, Nov 13, 2007 at 12:30:05AM +, David Howells wrote:
 Nick Piggin [EMAIL PROTECTED] wrote:
 
  PAGE_CACHE_SIZE should be used to address the pagecache.
 
 Perhaps, but the function being called from there takes pages not page cache
 slots.  If I have to allow for PAGE_CACHE_SIZE  PAGE_SIZE then I need to
 modify my code, if not then the assertion needs to remain what it is.

It takes a pagecache page, yes. If you follow convention, you use
PAGE_CACHE_SIZE for that guy. You don't have to allow PAGE_CACHE_SIZE !=
PAGE_SIZE, and if all the rest of your code is in units of PAGE_SIZE, then
obviously my changing of just the one unit is even more confusing than
the current arrangement ;)


   I notice you removed the stuff that clears holes in the page to be
   written.  Is this is now done by the caller?
  
  It is supposed to bring the page uptodate first. So, no need to clear
  AFAIKS?
 
 Hmmm...  I suppose.  However, it is wasteful in the common case as it is then
 bringing the page up to date by filling/clearing the whole of it and not just
 the bits that are not going to be written.

Yes, that's where you come in. You are free (and encouraged) to optimise
this.

Let's see, for a network filesystem this is what you could do:
- if the page is not uptodate at write_begin time, do not bring it uptodate
  (at least, not the region that is going to be written to)
- if the page is not uptodate at write_end time, but the copy was fully
  completed, just mark it uptodate (provided you brought the regions outside
  the copy uptodate).
- if the page is not uptodate and you have a short copy, simply do not
  mark the page uptodate or dirty, and return 0 from write_end, indicating
  that you have committed 0 bytes. The generic code should DTRT.


Or you could:
Pass back a temporary (not pagecache) page in *pagep, and copy that yourself
into the _real_ pagecache page at write_end time, so you know exactly how
big the copy will be (this is basically what the 2copy method does now...
it is probably not as good as the first method I described, but for a
high latency filesystem it may be preferable to always bringing the page
uptodate).

Or: keep track of sub-page dirty / uptodate status eg. with a light weight
buffer_head like structure, so you can actually have partially dirty pages
that are not completely uptodate.

Or... if you can think of another way...

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[rfc][patches] remove -prepare_write

2007-11-11 Thread Nick Piggin

Hi,

These are a set of patches to convert the last few filesystems to use
the new deadlock-free write aops, and remove the core code to handle the
legacy write path.

I don't really have setups to sufficiently test these filesystems. So I
would really appreciate if filesystem maintainers can pick these patches
up, bear with my bugs, and send them upstream when they're ready.

The benefit to you is that you get to use the fast and well tested code 
paths! Actually, it is interesting: several of the conversions I've done
(including these) take a relatively naive aproach of simply prefilling
the whole page if it isn't uptodate. It might be the case that some
filesystems actually prefer to do something similar to the legacy
double-copy path which they're being converted away from! (then again,
it would be probably even more ideal to have simple sub-page state
tracking structures).

There is still quite a lot of work left to be done. Several filesystems
still use prepare_write() helpers, and when they're fixed up, all the
old helpers themselves have to be removed. But this step is probably
most important to getting rid of complex code.



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[rfc][patch 1/5] ecryptfs new aops

2007-11-11 Thread Nick Piggin


Convert ecryptfs to new aops.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
---
Index: linux-2.6/fs/ecryptfs/mmap.c
===
--- linux-2.6.orig/fs/ecryptfs/mmap.c
+++ linux-2.6/fs/ecryptfs/mmap.c
@@ -263,31 +263,38 @@ out:
return 0;
 }
 
-static int ecryptfs_prepare_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
+static int ecryptfs_write_begin(struct file *file,
+   struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+   struct page *page;
int rc = 0;
 
-   if (from == 0  to == PAGE_CACHE_SIZE)
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+
+   if (flags  AOP_FLAG_UNINTERRUPTIBLE  len == PAGE_CACHE_SIZE)
goto out;   /* If we are writing a full page, it will be
   up to date. */
if (!PageUptodate(page)) {
-   rc = ecryptfs_read_lower_page_segment(page, page-index, 0,
+   rc = ecryptfs_read_lower_page_segment(page, index, 0,
  PAGE_CACHE_SIZE,
- page-mapping-host);
+ mapping-host);
if (rc) {
printk(KERN_ERR %s: Error attemping to read lower 
   page segment; rc = [%d]\n, __FUNCTION__, rc);
-   ClearPageUptodate(page);
goto out;
} else
SetPageUptodate(page);
}
-   if (page-index != 0) {
+   if (index != 0) {
loff_t end_of_prev_pg_pos =
-   (((loff_t)page-index  PAGE_CACHE_SHIFT) - 1);
+   (((loff_t)index  PAGE_CACHE_SHIFT) - 1);
 
-   if (end_of_prev_pg_pos  i_size_read(page-mapping-host)) {
+   if (end_of_prev_pg_pos  i_size_read(mapping-host)) {
rc = ecryptfs_truncate(file-f_path.dentry,
   end_of_prev_pg_pos);
if (rc) {
@@ -297,7 +304,7 @@ static int ecryptfs_prepare_write(struct
goto out;
}
}
-   if (end_of_prev_pg_pos + 1  i_size_read(page-mapping-host))
+   if (end_of_prev_pg_pos + 1  i_size_read(mapping-host))
zero_user_page(page, 0, PAGE_CACHE_SIZE, KM_USER0);
}
 out:
@@ -391,21 +398,25 @@ int ecryptfs_write_inode_size_to_metadat
 }
 
 /**
- * ecryptfs_commit_write
+ * ecryptfs_write_end
  * @file: The eCryptfs file object
+ * @mapping: The eCryptfs object
+ * @pos, len: Ignored (we rotate the page IV on each write)
  * @page: The eCryptfs page
- * @from: Ignored (we rotate the page IV on each write)
- * @to: Ignored
  *
  * This is where we encrypt the data and pass the encrypted data to
  * the lower filesystem.  In OpenPGP-compatible mode, we operate on
  * entire underlying packets.
  */
-static int ecryptfs_commit_write(struct file *file, struct page *page,
-unsigned from, unsigned to)
-{
-   loff_t pos;
-   struct inode *ecryptfs_inode = page-mapping-host;
+static int ecryptfs_write_end(struct file *file,
+   struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
+{
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+   unsigned from = pos  (PAGE_CACHE_SIZE - 1);
+   unsigned to = from + copied;
+   struct inode *ecryptfs_inode = mapping-host;
struct ecryptfs_crypt_stat *crypt_stat =

ecryptfs_inode_to_private(file-f_path.dentry-d_inode)-crypt_stat;
int rc;
@@ -417,25 +428,22 @@ static int ecryptfs_commit_write(struct 
} else
ecryptfs_printk(KERN_DEBUG, Not a new file\n);
ecryptfs_printk(KERN_DEBUG, Calling fill_zeros_to_end_of_page
-   (page w/ index = [0x%.16x], to = [%d])\n, page-index,
-   to);
+   (page w/ index = [0x%.16x], to = [%d])\n, index, to);
/* Fills in zeros if 'to' goes beyond inode size */
rc = fill_zeros_to_end_of_page(page, to);
if (rc) {
ecryptfs_printk(KERN_WARNING, Error attempting to fill 
-   zeros in page with index = [0x%.16x]\n,
-   page-index);
+   zeros in page with index = [0x%.16x]\n, index);
goto out;
}
rc = ecryptfs_encrypt_page(page);
if (rc

[rfc][patch 2/5] cifs: new aops

2007-11-11 Thread Nick Piggin


Convert cifs to new aops.

This is another relatively naive conversion. Always do the read upfront when
the page is not uptodate (unless we're in the writethrough path).

Fix an uninitialized data exposure where SetPageUptodate was called before
the page was uptodate.

SetPageUptodate and switch to writeback mode in the case that the full page
was dirtied.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
---
Index: linux-2.6/fs/cifs/file.c
===
--- linux-2.6.orig/fs/cifs/file.c
+++ linux-2.6/fs/cifs/file.c
@@ -103,7 +103,7 @@ static inline int cifs_open_inode_helper
 
/* want handles we can use to read with first
   in the list so we do not have to walk the
-  list to search for one in prepare_write */
+  list to search for one in write_begin */
if ((file-f_flags  O_ACCMODE) == O_WRONLY) {
list_add_tail(pCifsFile-flist,
  pCifsInode-openFileList);
@@ -1411,49 +1411,53 @@ static int cifs_writepage(struct page *p
return rc;
 }
 
-static int cifs_commit_write(struct file *file, struct page *page,
-   unsigned offset, unsigned to)
+static int cifs_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
-   int xid;
-   int rc = 0;
-   struct inode *inode = page-mapping-host;
-   loff_t position = ((loff_t)page-index  PAGE_CACHE_SHIFT) + to;
-   char *page_data;
+   int rc;
+   struct inode *inode = mapping-host;
+   loff_t position;
+
+   cFYI(1, (commit write for page %p from position %lld with %d bytes,
+page, pos, copied));
+
+   if (!PageUptodate(page)  copied == PAGE_CACHE_SIZE)
+   SetPageUptodate(page);
 
-   xid = GetXid();
-   cFYI(1, (commit write for page %p up to position %lld for %d,
-page, position, to));
-   spin_lock(inode-i_lock);
-   if (position  inode-i_size) {
-   i_size_write(inode, position);
-   }
-   spin_unlock(inode-i_lock);
if (!PageUptodate(page)) {
-   position =  ((loff_t)page-index  PAGE_CACHE_SHIFT) + offset;
-   /* can not rely on (or let) writepage write this data */
-   if (to  offset) {
-   cFYI(1, (Illegal offsets, can not copy from %d to %d,
-   offset, to));
-   FreeXid(xid);
-   return rc;
-   }
+   char *page_data;
+   unsigned offset = pos  (PAGE_CACHE_SIZE - 1);
+   int xid;
+
+   xid = GetXid();
/* this is probably better than directly calling
   partialpage_write since in this function the file handle is
   known which we might as well leverage */
/* BB check if anything else missing out of ppw
   such as updating last write time */
page_data = kmap(page);
-   rc = cifs_write(file, page_data + offset, to-offset,
-   position);
-   if (rc  0)
-   rc = 0;
-   /* else if (rc  0) should we set writebehind rc? */
+   rc = cifs_write(file, page_data + offset, copied, position);
+   /* if (rc  0) should we set writebehind rc? */
kunmap(page);
+
+   FreeXid(xid);
} else {
+   rc = copied;
set_page_dirty(page);
}
 
-   FreeXid(xid);
+   if (rc  0) {
+   position = pos + rc;
+   spin_lock(inode-i_lock);
+   if (position  inode-i_size)
+   i_size_write(inode, position);
+   spin_unlock(inode-i_lock);
+   }
+
+   unlock_page(page);
+   page_cache_release(page);
+
return rc;
 }
 
@@ -1999,49 +2003,45 @@ int is_size_safe_to_change(struct cifsIn
return 1;
 }
 
-static int cifs_prepare_write(struct file *file, struct page *page,
-   unsigned from, unsigned to)
+static int cifs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   int rc = 0;
-   loff_t i_size;
-   loff_t offset;
+   pgoff_t index = pos  PAGE_CACHE_SHIFT;
+   loff_t offset = pos  (PAGE_CACHE_SIZE - 1);
+   struct page *page;
+
+   cFYI(1, (write begin from %lld len %d, (long long)pos, len));
+
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
 
-   cFYI(1, (prepare write for page %p from %d to %d, page, from, to));
if (PageUptodate(page))
return 0;
 
/* If we are writing a full page

[rfc][patch 3/5] afs: new aops

2007-11-11 Thread Nick Piggin


Convert afs to new aops.

Cannot assume writes will fully complete, so this conversion goes the easy
way and always brings the page uptodate before the write.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
---
Index: linux-2.6/fs/afs/file.c
===
--- linux-2.6.orig/fs/afs/file.c
+++ linux-2.6/fs/afs/file.c
@@ -50,8 +50,8 @@ const struct address_space_operations af
.launder_page   = afs_launder_page,
.releasepage= afs_releasepage,
.invalidatepage = afs_invalidatepage,
-   .prepare_write  = afs_prepare_write,
-   .commit_write   = afs_commit_write,
+   .write_begin= afs_write_begin,
+   .write_end  = afs_write_end,
.writepage  = afs_writepage,
.writepages = afs_writepages,
 };
Index: linux-2.6/fs/afs/internal.h
===
--- linux-2.6.orig/fs/afs/internal.h
+++ linux-2.6/fs/afs/internal.h
@@ -731,8 +731,12 @@ extern int afs_volume_release_fileserver
  */
 extern int afs_set_page_dirty(struct page *);
 extern void afs_put_writeback(struct afs_writeback *);
-extern int afs_prepare_write(struct file *, struct page *, unsigned, unsigned);
-extern int afs_commit_write(struct file *, struct page *, unsigned, unsigned);
+extern int afs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata);
+extern int afs_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata);
 extern int afs_writepage(struct page *, struct writeback_control *);
 extern int afs_writepages(struct address_space *, struct writeback_control *);
 extern int afs_write_inode(struct inode *, int);
Index: linux-2.6/fs/afs/write.c
===
--- linux-2.6.orig/fs/afs/write.c
+++ linux-2.6/fs/afs/write.c
@@ -84,15 +84,23 @@ void afs_put_writeback(struct afs_writeb
  * partly or wholly fill a page that's under preparation for writing
  */
 static int afs_fill_page(struct afs_vnode *vnode, struct key *key,
-unsigned start, unsigned len, struct page *page)
+   loff_t pos, unsigned len, struct page *page)
 {
+   loff_t i_size;
+   unsigned eof;
int ret;
 
-   _enter(,,%u,%u, start, len);
+   _enter(,,%llu,%u, (unsigned long long)pos, len);
 
-   ASSERTCMP(start + len, =, PAGE_SIZE);
+   ASSERTCMP(len, =, PAGE_CACHE_SIZE);
 
-   ret = afs_vnode_fetch_data(vnode, key, start, len, page);
+   i_size = i_size_read(vnode-vfs_inode);
+   if (pos + len  i_size)
+   eof = i_size;
+   else
+   eof = PAGE_CACHE_SIZE;
+
+   ret = afs_vnode_fetch_data(vnode, key, 0, eof, page);
if (ret  0) {
if (ret == -ENOENT) {
_debug(got NOENT from server
@@ -107,109 +115,55 @@ static int afs_fill_page(struct afs_vnod
 }
 
 /*
- * prepare a page for being written to
- */
-static int afs_prepare_page(struct afs_vnode *vnode, struct page *page,
-   struct key *key, unsigned offset, unsigned to)
-{
-   unsigned eof, tail, start, stop, len;
-   loff_t i_size, pos;
-   void *p;
-   int ret;
-
-   _enter();
-
-   if (offset == 0  to == PAGE_SIZE)
-   return 0;
-
-   p = kmap_atomic(page, KM_USER0);
-
-   i_size = i_size_read(vnode-vfs_inode);
-   pos = (loff_t) page-index  PAGE_SHIFT;
-   if (pos = i_size) {
-   /* partial write, page beyond EOF */
-   _debug(beyond);
-   if (offset  0)
-   memset(p, 0, offset);
-   if (to  PAGE_SIZE)
-   memset(p + to, 0, PAGE_SIZE - to);
-   kunmap_atomic(p, KM_USER0);
-   return 0;
-   }
-
-   if (i_size - pos = PAGE_SIZE) {
-   /* partial write, page entirely before EOF */
-   _debug(before);
-   tail = eof = PAGE_SIZE;
-   } else {
-   /* partial write, page overlaps EOF */
-   eof = i_size - pos;
-   _debug(overlap %u, eof);
-   tail = max(eof, to);
-   if (tail  PAGE_SIZE)
-   memset(p + tail, 0, PAGE_SIZE - tail);
-   if (offset  eof)
-   memset(p + eof, 0, PAGE_SIZE - eof);
-   }
-
-   kunmap_atomic(p, KM_USER0);
-
-   ret = 0;
-   if (offset  0 || eof  to) {
-   /* need to fill one or two bits that aren't going to be written
-* (cover both fillers in one read if there are two) */
-   start = (offset  0) ? 0 : to;
-   stop = (eof  to) ? eof : offset;
-   len

[rfc][patch 4/5] rd: rewrite rd

2007-11-11 Thread Nick Piggin


This is a rewrite of the ramdisk block device driver.

The old one is really difficult because it effectively implements a block
device which serves data out of its own buffer cache. It relies on the dirty
bit being set, to pin its backing store in cache, however there are non
trivial paths which can clear the dirty bit (eg. try_to_free_buffers()),
which had recently lead to data corruption. And in general it is completely
wrong for a block device driver to do this.

The new one is more like a regular block device driver. It has no idea
about vm/vfs stuff. It's backing store is similar to the buffer cache
(a simple radix-tree of pages), but it doesn't know anything about page
cache (the pages in the radix tree are not pagecache pages).

There is one slight downside -- direct block device access and filesystem
metadata access goes through an extra copy and gets stored in RAM twice.
However, this downside is only slight, because the real buffercache of the
device is now reclaimable (because we're not playing crazy games with it),
so under memory intensive situations, footprint should effectively be the
same -- maybe even a slight advantage to the new driver because it can also
reclaim buffer heads.

The fact that it now goes through all the regular vm/fs paths makes it
much more useful for testing, too.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]
---
Index: linux-2.6/drivers/block/Kconfig
===
--- linux-2.6.orig/drivers/block/Kconfig
+++ linux-2.6/drivers/block/Kconfig
@@ -311,7 +311,7 @@ config BLK_DEV_UB
  If unsure, say N.
 
 config BLK_DEV_RAM
-   tristate RAM disk support
+   tristate RAM block device support
---help---
  Saying Y here will allow you to use a portion of your RAM memory as
  a block device, so that you can make file systems on it, read and
@@ -346,16 +346,6 @@ config BLK_DEV_RAM_SIZE
  The default value is 4096 kilobytes. Only change this if you know
  what you are doing.
 
-config BLK_DEV_RAM_BLOCKSIZE
-   int Default RAM disk block size (bytes)
-   depends on BLK_DEV_RAM
-   default 1024
-   help
- The default value is 1024 bytes.  PAGE_SIZE is a much more
- efficient choice however.  The default is kept to ensure initrd
- setups function - apparently needed by the rd_load_image routine
- that supposes the filesystem in the image uses a 1024 blocksize.
-
 config CDROM_PKTCDVD
tristate Packet writing on CD/DVD media
depends on !UML
Index: linux-2.6/drivers/block/Makefile
===
--- linux-2.6.orig/drivers/block/Makefile
+++ linux-2.6/drivers/block/Makefile
@@ -11,7 +11,7 @@ obj-$(CONFIG_AMIGA_FLOPPY)+= amiflop.o
 obj-$(CONFIG_PS3_DISK) += ps3disk.o
 obj-$(CONFIG_ATARI_FLOPPY) += ataflop.o
 obj-$(CONFIG_AMIGA_Z2RAM)  += z2ram.o
-obj-$(CONFIG_BLK_DEV_RAM)  += rd.o
+obj-$(CONFIG_BLK_DEV_RD)   += brd.o
 obj-$(CONFIG_BLK_DEV_LOOP) += loop.o
 obj-$(CONFIG_BLK_DEV_PS2)  += ps2esdi.o
 obj-$(CONFIG_BLK_DEV_XD)   += xd.o
Index: linux-2.6/drivers/block/brd.c
===
--- /dev/null
+++ linux-2.6/drivers/block/brd.c
@@ -0,0 +1,477 @@
+/*
+ * Ram backed block device driver.
+ *
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ *
+ * Parts derived from drivers/block/rd.c, and drivers/block/loop.c, copyright
+ * of their respective owners.
+ */
+
+#include linux/init.h
+#include linux/module.h
+#include linux/moduleparam.h
+#include linux/major.h
+#include linux/blkdev.h
+#include linux/bio.h
+#include linux/highmem.h
+#include linux/gfp.h
+#include linux/radix-tree.h
+#include linux/buffer_head.h /* invalidate_bh_lrus() */
+
+#include asm/uaccess.h
+
+#define PAGE_SECTORS (1  (PAGE_SHIFT - 9))
+
+struct brd_device {
+   int brd_number;
+   int brd_refcnt;
+   loff_t  brd_offset;
+   loff_t  brd_sizelimit;
+   unsignedbrd_blocksize;
+
+   struct request_queue*brd_queue;
+   struct gendisk  *brd_disk;
+
+   spinlock_t  brd_lock;
+   struct radix_tree_root  brd_pages;
+
+   struct list_headbrd_list;
+};
+
+static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
+{
+   unsigned long idx;
+   struct page *page;
+
+   rcu_read_lock();
+   idx = sector  (PAGE_SHIFT - 9);
+   page = radix_tree_lookup(brd-brd_pages, idx);
+   rcu_read_unlock();
+
+   BUG_ON(page  page-index != idx);
+
+   return page;
+}
+
+static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
+{
+   unsigned long idx;
+   struct page *page;
+
+   page = brd_lookup_page(brd, sector);
+   if (page)
+   return page;
+
+   page = alloc_page(GFP_NOIO

[rfc][patch 5/5] remove prepare_write

2007-11-11 Thread Nick Piggin


Index: linux-2.6/drivers/block/loop.c
===
--- linux-2.6.orig/drivers/block/loop.c
+++ linux-2.6/drivers/block/loop.c
@@ -40,8 +40,7 @@
  * Heinz Mauelshagen [EMAIL PROTECTED], Feb 2002
  *
  * Support for falling back on the write file operation when the address space
- * operations prepare_write and/or commit_write are not available on the
- * backing filesystem.
+ * operations write_begin is not available on the backing filesystem.
  * Anton Altaparmakov, 16 Feb 2005
  *
  * Still To Fix:
@@ -761,7 +760,7 @@ static int loop_set_fd(struct loop_devic
 */
if (!file-f_op-splice_read)
goto out_putf;
-   if (aops-prepare_write || aops-write_begin)
+   if (aops-write_begin)
lo_flags |= LO_FLAGS_USE_AOPS;
if (!(lo_flags  LO_FLAGS_USE_AOPS)  !file-f_op-write)
lo_flags |= LO_FLAGS_READ_ONLY;
Index: linux-2.6/fs/fat/inode.c
===
--- linux-2.6.orig/fs/fat/inode.c
+++ linux-2.6/fs/fat/inode.c
@@ -175,7 +175,7 @@ static ssize_t fat_direct_IO(int rw, str
 
if (rw == WRITE) {
/*
-* FIXME: blockdev_direct_IO() doesn't use -prepare_write(),
+* FIXME: blockdev_direct_IO() doesn't use -write_begin(),
 * so we need to update the -mmu_private to block boundary.
 *
 * But we must fill the remaining area or hole by nul for
Index: linux-2.6/fs/ocfs2/file.c
===
--- linux-2.6.orig/fs/ocfs2/file.c
+++ linux-2.6/fs/ocfs2/file.c
@@ -790,8 +790,7 @@ leave:
 
 /* Some parts of this taken from generic_cont_expand, which turned out
  * to be too fragile to do exactly what we need without us having to
- * worry about recursive locking in -prepare_write() and
- * -commit_write(). */
+ * worry about recursive locking in -write_begin() and -write_end(). */
 static int ocfs2_write_zero_page(struct inode *inode,
 u64 size)
 {
Index: linux-2.6/fs/splice.c
===
--- linux-2.6.orig/fs/splice.c
+++ linux-2.6/fs/splice.c
@@ -734,8 +734,8 @@ ssize_t splice_from_pipe(struct pipe_ino
};
 
/*
-* The actor worker might be calling -prepare_write and
-* -commit_write. Most of the time, these expect i_mutex to
+* The actor worker might be calling -write_begin and
+* -write_end. Most of the time, these expect i_mutex to
 * be held. Since this may result in an ABBA deadlock with
 * pipe-inode, we have to order lock acquiry here.
 */
Index: linux-2.6/include/linux/fs.h
===
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -453,13 +453,6 @@ struct address_space_operations {
int (*readpages)(struct file *filp, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages);
 
-   /*
-* ext3 requires that a successful prepare_write() call be followed
-* by a commit_write() call - they must be balanced
-*/
-   int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
-   int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
-
int (*write_begin)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata);
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1863,48 +1863,8 @@ int pagecache_write_begin(struct file *f
 {
const struct address_space_operations *aops = mapping-a_ops;
 
-   if (aops-write_begin) {
-   return aops-write_begin(file, mapping, pos, len, flags,
+   return aops-write_begin(file, mapping, pos, len, flags,
pagep, fsdata);
-   } else {
-   int ret;
-   pgoff_t index = pos  PAGE_CACHE_SHIFT;
-   unsigned offset = pos  (PAGE_CACHE_SIZE - 1);
-   struct inode *inode = mapping-host;
-   struct page *page;
-again:
-   page = __grab_cache_page(mapping, index);
-   *pagep = page;
-   if (!page)
-   return -ENOMEM;
-
-   if (flags  AOP_FLAG_UNINTERRUPTIBLE  !PageUptodate(page)) {
-   /*
-* There is no way to resolve a short write situation
-* for a !Uptodate page (except by double copying in
-* the caller done by

Re: [patch] fs: restore nobh

2007-10-25 Thread Nick Piggin

On Thu, Oct 25, 2007 at 09:07:36PM +0200, Jan Kara wrote:
   Hi,
 
  This is overdue, sorry. Got a little complicated, and I've been away from
  my filesystem test setup so I didn't want ot send it (lucky, coz I found
  a bug after more substantial testing).
  
  Anyway, RFC?
   Hmm, maybe one comment/question:
 
  Index: linux-2.6/fs/buffer.c
  ===
  --- linux-2.6.orig/fs/buffer.c  2007-10-08 14:09:35.0 +1000
  +++ linux-2.6/fs/buffer.c   2007-10-08 16:45:28.0 +1000
 ...
 
  -/*
  - * Make sure any changes to nobh_commit_write() are reflected in
  - * nobh_truncate_page(), since it doesn't call commit_write().
  - */
  -int nobh_commit_write(struct file *file, struct page *page,
  -   unsigned from, unsigned to)
  +int nobh_write_end(struct file *file, struct address_space *mapping,
  +   loff_t pos, unsigned len, unsigned copied,
  +   struct page *page, void *fsdata)
   {
  struct inode *inode = page-mapping-host;
  -   loff_t pos = ((loff_t)page-index  PAGE_CACHE_SHIFT) + to;
  +   struct buffer_head *head = NULL;
  +   struct buffer_head *bh;
   
  -   if (page_has_buffers(page))
  -   return generic_commit_write(file, page, from, to);
  +   if (!PageMappedToDisk(page)) {
  +   if (unlikely(copied  len)  !page_has_buffers(page))
  +   attach_nobh_buffers(page, head);
  +   if (page_has_buffers(page))
  +   return generic_write_end(file, mapping, pos, len, 
  copied, page, fsdata);
  +   }
   So we can have a page with attached buffers but we decide to not call
 generic_write_end()...
 
  SetPageUptodate(page);
  set_page_dirty(page);
   And we just set the page dirty but we leave buffers clean. So cannot
 subsequent try_to_free_buffers() come, mark page as clean (as buffers
 are clean) and discard the data?

Hmm, we might just be OK here because set_page_dirty should dirty the
buffers. However, I think you have spotted a mistake here and it would be
better if the generic_write_end test was outside the PageMapedToDisk
check.

Thanks!

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/31] Remove iget() and read_inode() [try #4]

2007-10-12 Thread Nick Piggin

On Friday 12 October 2007 19:07, David Howells wrote:
 Hi Linus,

 Here's a set of patches that remove all calls to iget() and all
 read_inode() functions.  They should be removed for two reasons: firstly
 they don't lend themselves to good error handling, and secondly their
 presence is a temptation for code outside a filesystem to call iget() to
 access inodes within that filesystem.

Is there any particular reason not to push these through -mm?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch] fs: restore nobh

2007-10-08 Thread Nick Piggin

Hi,

This is overdue, sorry. Got a little complicated, and I've been away from
my filesystem test setup so I didn't want ot send it (lucky, coz I found
a bug after more substantial testing).

Anyway, RFC?

---
Implement nobh in new aops. This is a bit tricky.
FWIW, nobh_truncate is now implemented in a way that does not
create blocks in sparse regions, which is a silly thing for it
to have been doing (isn't it?)

ext2 survives fsx and fsstress. jfs is converted as well... ext3
should be easy to do (but not done yet).

Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c  2007-10-08 14:09:35.0 +1000
+++ linux-2.6/fs/buffer.c   2007-10-08 16:45:28.0 +1000
@@ -2378,7 +2378,7 @@
 }
 
 /*
- * nobh_prepare_write()'s prereads are special: the buffer_heads are freed
+ * nobh_write_begin()'s prereads are special: the buffer_heads are freed
  * immediately, while under the page lock.  So it needs a special end_io
  * handler which does not touch the bh after unlocking it.
  */
@@ -2388,16 +2388,45 @@
 }
 
 /*
+ * Attach the singly-linked list of buffers created by nobh_write_begin, to
+ * the page (converting it to circular linked list and taking care of page
+ * dirty races).
+ */
+static void attach_nobh_buffers(struct page *page, struct buffer_head *head)
+{
+   struct buffer_head *bh;
+
+   BUG_ON(!PageLocked(page));
+
+   spin_lock(page-mapping-private_lock);
+   bh = head;
+   do {
+   if (PageDirty(page))
+   set_buffer_dirty(bh);
+   if (!bh-b_this_page)
+   bh-b_this_page = head;
+   bh = bh-b_this_page;
+   } while (bh != head);
+   attach_page_buffers(page, head);
+   spin_unlock(page-mapping-private_lock);
+}
+
+/*
  * On entry, the page is fully not uptodate.
  * On exit the page is fully uptodate in the areas outside (from,to)
  */
-int nobh_prepare_write(struct page *page, unsigned from, unsigned to,
+int nobh_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata,
get_block_t *get_block)
 {
-   struct inode *inode = page-mapping-host;
+   struct inode *inode = mapping-host;
const unsigned blkbits = inode-i_blkbits;
const unsigned blocksize = 1  blkbits;
struct buffer_head *head, *bh;
+   struct page *page;
+   pgoff_t index;
+   unsigned from, to;
unsigned block_in_page;
unsigned block_start, block_end;
sector_t block_in_file;
@@ -2406,8 +2435,22 @@
int ret = 0;
int is_mapped_to_disk = 1;
 
-   if (page_has_buffers(page))
-   return block_prepare_write(page, from, to, get_block);
+   index = pos  PAGE_CACHE_SHIFT;
+   from = pos  (PAGE_CACHE_SIZE - 1);
+   to = from + len;
+
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+   *pagep = page;
+   *fsdata = NULL;
+
+   if (page_has_buffers(page)) {
+   unlock_page(page);
+   page_cache_release(page);
+   *pagep = NULL;
+   return block_write_begin(file, mapping, pos, len, flags, pagep, 
fsdata, get_block);
+   }
 
if (PageMappedToDisk(page))
return 0;
@@ -2422,8 +2465,10 @@
 * than the circular one we're used to.
 */
head = alloc_page_buffers(page, blocksize, 0);
-   if (!head)
-   return -ENOMEM;
+   if (!head) {
+   ret = -ENOMEM;
+   goto out_release;
+   }
 
block_in_file = (sector_t)page-index  (PAGE_CACHE_SHIFT - blkbits);
 
@@ -2492,15 +2537,12 @@
if (is_mapped_to_disk)
SetPageMappedToDisk(page);
 
-   do {
-   bh = head;
-   head = head-b_this_page;
-   free_buffer_head(bh);
-   } while (head);
+   *fsdata = head; /* to be released by nobh_write_end */
 
return 0;
 
 failed:
+   BUG_ON(!ret);
/*
 * Error recovery is a bit difficult. We need to zero out blocks that
 * were newly allocated, and dirty them to ensure they get written out.
@@ -2508,64 +2550,56 @@
 * the handling of potential IO errors during writeout would be hard
 * (could try doing synchronous writeout, but what if that fails too?)
 */
-   spin_lock(page-mapping-private_lock);
-   bh = head;
-   block_start = 0;
-   do {
-   if (PageUptodate(page))
-   set_buffer_uptodate(bh);
-   if (PageDirty(page))
-   set_buffer_dirty(bh);
+   attach_nobh_buffers(page, head);
+   page_zero_new_buffers(page, from, to);
 
-   block_end = block_start+blocksize;
-   if (block_end

Re: [PATCH]fix VM_CAN_NONLINEAR check in sys_remap_file_pages

2007-10-08 Thread Nick Piggin

On Monday 08 October 2007 23:37, Hugh Dickins wrote:
 On Mon, 8 Oct 2007, Yan Zheng wrote:
  The test for VM_CAN_NONLINEAR always fails

 Good catch indeed.  Though I was puzzled how we do nonlinear at all,
 until I realized it's The test for not VM_CAN_NONLINEAR always fails.

 It's not as serious as it appears, since code further down has been
 added more recently to simulate nonlinear on non-RAM-backed filesystems,
 instead of going the real nonlinear way; so most filesystems are now not
 required to do what VM_CAN_NONLINEAR was put in to ensure they could do.

Well, I think all filesystems can do VM_CAN_NONLINEAR anyway. Device
drivers and weird things tend to have trouble...


 I'm confused as to where that leaves us: is this actually a fix that
 needs to go into 2.6.23?  or will it suddenly disable a system call
 which has been silently working fine on various filesystems which did
 not add VM_CAN_NONLINEAR?  could we just rip out VM_CAN_NONLINEAR?

We probably should keep VM_CAN_NONLINEAR for the moment, I think.
But now that we have the fallback path, we _could_ use that instead of
failing. I doubt anybody will be using nonlinear mappings on anything but
regular files for the time being, but as a trivial fix, I think this probably
should go into 2.6.23.

Thanks for spotting this problem
Acked-by: Nick Piggin [EMAIL PROTECTED]

 I hope Nick or Miklos is clearer on what the risks are.

 (Apologies for all the nots and nons here, I'm embarrassed
 after just criticizing Ingo's SCHED_NO_NO_OMIT_FRAME_POINTER!)

 Hugh

  Signed-off-by: Yan Zheng[EMAIL PROTECTED]
  
  diff -ur linux-2.6.23-rc9/mm/fremap.c linux/mm/fremap.c
  --- linux-2.6.23-rc9/mm/fremap.c2007-10-07 15:03:33.0 +0800
  +++ linux/mm/fremap.c   2007-10-08 19:33:44.0 +0800
  @@ -160,7 +160,7 @@
  if (vma-vm_private_data  !(vma-vm_flags  VM_NONLINEAR))
  goto out;
 
  -   if (!vma-vm_flags  VM_CAN_NONLINEAR)
  +   if (!(vma-vm_flags  VM_CAN_NONLINEAR))
  goto out;
 
  if (end = start || start  vma-vm_start || end  vma-vm_end)
  -
  To unsubscribe from this list: send the line unsubscribe linux-kernel
  in the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH]fix VM_CAN_NONLINEAR check in sys_remap_file_pages

2007-10-08 Thread Nick Piggin

On Tuesday 09 October 2007 03:04, Andrew Morton wrote:
 On Mon, 8 Oct 2007 19:45:08 +0800 Yan Zheng [EMAIL PROTECTED] wrote:
  Hi all
 
  The test for VM_CAN_NONLINEAR always fails
 
  Signed-off-by: Yan Zheng[EMAIL PROTECTED]
  
  diff -ur linux-2.6.23-rc9/mm/fremap.c linux/mm/fremap.c
  --- linux-2.6.23-rc9/mm/fremap.c2007-10-07 15:03:33.0 +0800
  +++ linux/mm/fremap.c   2007-10-08 19:33:44.0 +0800
  @@ -160,7 +160,7 @@
  if (vma-vm_private_data  !(vma-vm_flags  VM_NONLINEAR))
  goto out;
 
  -   if (!vma-vm_flags  VM_CAN_NONLINEAR)
  +   if (!(vma-vm_flags  VM_CAN_NONLINEAR))
  goto out;
 
  if (end = start || start  vma-vm_start || end  vma-vm_end)

 Lovely.  From this we can deduce that nobody has run remap_file_pages()
 since 2.6.23-rc1 and that nobody (including the developer who made that
 change) ran it while that change was in -mm.

But you'd be wrong. remap_file_pages was tested both with my own tester
and Ingo's test program.

vm_flags != 0, !vm_flags = 0, 0  x = 0, so the test always falls
through. Of course, what I _should_ have done is also test a driver which
does not have VM_CAN_NONLINEAR... but even I wouldn't rewrite half
the nonlinear mapping code without once testing it ;)

FWIW, Oracle (maybe the sole real user of this) has been testing it, which
I'm very happy about (rather than testing after 2.6.23 is released).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH]fix VM_CAN_NONLINEAR check in sys_remap_file_pages

2007-10-08 Thread Nick Piggin

On Tuesday 09 October 2007 03:51, Andrew Morton wrote:
 On Mon, 8 Oct 2007 10:28:43 -0700

  I'll now add remap_file_pages soon.
  Maybe those other 2 tests aren't strong enough (?).
  Or maybe they don't return a non-0 exit status even when they fail...
  (I'll check.)

 Perhaps Yan Zheng can tell us what test was used to demonstrate this?

Was probably found by review. Otherwise, you could probably reproduce
it by mmaping, say, drm device node, running remap_file_pages() on it
to create a nonlinear mapping, and then finding that you get the wrong
data.

   I'm surprise that LTP doesn't have any remap_file_pages() tests.
 
  quick grep didn't find any for me.

 Me either.  There are a few lying around the place which could be
 integrated.

 It would be good if LTP were to have some remap_file_pages() tests
 (please).  As we see here, it is something which we can easily break, and
 leave broken for some time.

Here is Ingo's old test, since cleaned up and fixed a bit by me
I'm sure he would distribute it GPL, but I've cc'ed him because I didn't
find an explicit statement about that.

/*
 * Copyright (C) Ingo Molnar, 2002
 */
#define _GNU_SOURCE
#include stdio.h
#include unistd.h
#include sys/mman.h
#include sys/stat.h
#include sys/types.h
#include fcntl.h
#include errno.h
#include stdlib.h
#include sys/times.h
#include sys/wait.h
#include sys/ioctl.h
#include sys/syscall.h
#include linux/unistd.h

#define PAGE_SIZE 4096
#define PAGE_WORDS (PAGE_SIZE/sizeof(int))

#define CACHE_PAGES 1024
#define CACHE_SIZE (CACHE_PAGES*PAGE_SIZE)

#define WINDOW_PAGES 16
#define WINDOW_SIZE (WINDOW_PAGES*PAGE_SIZE)

#define WINDOW_START 0x4800

static char cache_contents [CACHE_SIZE];

static void test_nonlinear(int fd)
{
	char *data = NULL;
	int i, j, repeat = 2;

	for (i = 0; i  CACHE_PAGES; i++) {
		int *page = (int *) (cache_contents + i*PAGE_SIZE);

		for (j = 0; j  PAGE_WORDS; j++)
			page[j] = i;
	}

	if (write(fd, cache_contents, CACHE_SIZE) != CACHE_SIZE)
		perror(write), exit(1);

	data = mmap((void *)WINDOW_START,
			WINDOW_SIZE,
			PROT_READ|PROT_WRITE, 
			MAP_FIXED | MAP_SHARED 
			, fd, 0);

	if (data == MAP_FAILED)
		perror(mmap), exit(1);

again:
	for (i = 0; i  WINDOW_PAGES; i += 2) {
		char *page = data + i*PAGE_SIZE;

		if (remap_file_pages(page, PAGE_SIZE * 2, 0,
(WINDOW_PAGES-i-2), 0) == -1)
			perror(remap_file_pages), exit(1);
	}

	for (i = 0; i  WINDOW_PAGES; i++) {
		/*
		 * Double-check the correctness of the mapping:
		 */
		if (i  1) {
			if (data[i*PAGE_SIZE] != WINDOW_PAGES-i) {
printf(hm, mapped incorrect data!\n);
exit(1);
			}
		} else {
			if (data[i*PAGE_SIZE] != WINDOW_PAGES-i-2) {
printf(hm, mapped incorrect data!\n);
exit(1);
			}
		}
	}

	if (--repeat)
		goto again;
}

int main(int argc, char **argv)
{
	int fd;

	fd = open(/dev/shm/cache, O_RDWR|O_CREAT|O_TRUNC,S_IRWXU);
	if (fd  0)
		perror(open), exit(1);
	test_nonlinear(fd);
	if (close(fd) == -1)
		perror(close), exit(1);
	printf(nonlinear shm file OK\n);

	fd = open(/tmp/cache, O_RDWR|O_CREAT|O_TRUNC,S_IRWXU);
	if (fd  0)
		perror(open), exit(1);
	test_nonlinear(fd);
	if (close(fd) == -1)
		perror(close), exit(1);
	printf(nonlinear /tmp/ file OK\n);

	exit(0);
}

Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-10-02 Thread Nick Piggin

On Tuesday 02 October 2007 07:01, Christoph Lameter wrote:
 On Sat, 29 Sep 2007, Peter Zijlstra wrote:
  On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
   Really? That means we can no longer even allocate stacks for forking.
 
  I think I'm running with 4k stacks...

 4k stacks will never fly on an SGI x86_64 NUMA configuration given the
 additional data that may be kept on the stack. We are currently
 considering to go from 8k to 16k (or even 32k) to make things work. So
 having the ability to put the stacks in vmalloc space may be something to
 look at.

i386 and x86-64 already used 8K stacks for years and they have never
really been much problem before.

They only started failing when contiguous memory is getting used up
by other things, _even with_ those anti-frag patches in there.

Bottom line is that you do not use higher order allocations when you do
not need them.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-30 Thread Nick Piggin

On Sunday 30 September 2007 05:20, Andrew Morton wrote:
 On Sat, 29 Sep 2007 06:19:33 +1000 Nick Piggin [EMAIL PROTECTED] 
wrote:
  On Saturday 29 September 2007 19:27, Andrew Morton wrote:
   On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra
   [EMAIL PROTECTED]
 
  wrote:
 oom-killings, or page allocation failures?  The latter, one hopes.
   
Linux version 2.6.23-rc4-mm1-dirty ([EMAIL PROTECTED]) (gcc version 
4.1.2
(Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007
   
...
   
   
mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
Call Trace:
611b3878:  [6002dd28] printk_ratelimit+0x15/0x17
611b3888:  [60052ed4] out_of_memory+0x80/0x100
611b38c8:  [60054b0c] __alloc_pages+0x1ed/0x280
611b3948:  [6006c608] allocate_slab+0x5b/0xb0
611b3968:  [6006c705] new_slab+0x7e/0x183
611b39a8:  [6006cbae] __slab_alloc+0xc9/0x14b
611b39b0:  [6011f89f] radix_tree_preload+0x70/0xbf
611b39b8:  [600980f2] do_mpage_readpage+0x3b3/0x472
611b39e0:  [6011f89f] radix_tree_preload+0x70/0xbf
611b39f8:  [6006cc81] kmem_cache_alloc+0x51/0x98
611b3a38:  [6011f89f] radix_tree_preload+0x70/0xbf
611b3a58:  [6004f8e2] add_to_page_cache+0x22/0xf7
611b3a98:  [6004f9c6] add_to_page_cache_lru+0xf/0x24
611b3ab8:  [6009821e] mpage_readpages+0x6d/0x109
611b3ac0:  [600d59f0] ext3_get_block+0x0/0xf2
611b3b08:  [6005483d] get_page_from_freelist+0x8d/0xc1
611b3b88:  [600d6937] ext3_readpages+0x18/0x1a
611b3b98:  [60056f00] read_pages+0x37/0x9b
611b3bd8:  [60057064] __do_page_cache_readahead+0x100/0x157
611b3c48:  [60057196] do_page_cache_readahead+0x52/0x5f
611b3c78:  [60050ab4] filemap_fault+0x145/0x278
611b3ca8:  [60022b61] run_syscall_stub+0xd1/0xdd
611b3ce8:  [6005eae3] __do_fault+0x7e/0x3ca
611b3d68:  [6005ee60] do_linear_fault+0x31/0x33
611b3d88:  [6005f149] handle_mm_fault+0x14e/0x246
611b3da8:  [60120a7b] __up_read+0x73/0x7b
611b3de8:  [60013177] handle_page_fault+0x11f/0x23b
611b3e48:  [60013419] segv+0xac/0x297
611b3f28:  [60013367] segv_handler+0x68/0x6e
611b3f48:  [600232ad] get_skas_faultinfo+0x9c/0xa1
611b3f68:  [60023853] userspace+0x13a/0x19d
611b3fc8:  [60010d58] fork_handler+0x86/0x8d
  
   OK, that's different.  Someone broke the vm - order-2 GFP_KERNEL
   allocations aren't supposed to fail.
  
   I'm suspecting that did_some_progress thing.
 
  The allocation didn't fail -- it invoked the OOM killer because the
  kernel ran out of unfragmented memory.

 We can't run out of unfragmented memory for an order-2 GFP_KERNEL
 allocation in this workload.  We go and synchronously free stuff up to make
 it work.

 How did this get broken?

Either no more order-2 pages could be freed, or the ones that were being
freed were being used by something else (eg. other order-2 slab allocations).


  Probably because higher order
  allocations are the new vogue in -mm at the moment ;)

 That's a different bug.

 bug 1: We shouldn't be doing higher-order allocations in slub because of
 the considerable damage this does to atomic allocations.

 bug 2: order-2 GFP_KERNEL allocations shouldn't fail like this.

I think one causes 2 as well -- it isn't just considerable damage to atomic
allocations but to GFP_KERNEL allocations too.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-30 Thread Nick Piggin

On Monday 01 October 2007 06:12, Andrew Morton wrote:
 On Sun, 30 Sep 2007 05:09:28 +1000 Nick Piggin [EMAIL PROTECTED] 
wrote:
  On Sunday 30 September 2007 05:20, Andrew Morton wrote:

   We can't run out of unfragmented memory for an order-2 GFP_KERNEL
   allocation in this workload.  We go and synchronously free stuff up to
   make it work.
  
   How did this get broken?
 
  Either no more order-2 pages could be freed, or the ones that were being
  freed were being used by something else (eg. other order-2 slab
  allocations).

 No.  The current design of reclaim (for better or for worse) is that for
 order 0,1,2 and 3 allocations we just keep on trying until it works.  That
 got broken and I think it got broken at a design level when that
 did_some_progress logic went in.  Perhaps something else we did later
 worsened things.

It will keep trying until it works. It won't have stopped trying (unless
I'm very mistaken?), it's just oom killing things merrily along the way.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Nick Piggin

On Saturday 29 September 2007 19:27, Andrew Morton wrote:
 On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra [EMAIL PROTECTED] 
wrote:
   oom-killings, or page allocation failures?  The latter, one hopes.
 
  Linux version 2.6.23-rc4-mm1-dirty ([EMAIL PROTECTED]) (gcc version 4.1.2 
  (Ubuntu
  4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007
 
  ...
 
 
  mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
  Call Trace:
  611b3878:  [6002dd28] printk_ratelimit+0x15/0x17
  611b3888:  [60052ed4] out_of_memory+0x80/0x100
  611b38c8:  [60054b0c] __alloc_pages+0x1ed/0x280
  611b3948:  [6006c608] allocate_slab+0x5b/0xb0
  611b3968:  [6006c705] new_slab+0x7e/0x183
  611b39a8:  [6006cbae] __slab_alloc+0xc9/0x14b
  611b39b0:  [6011f89f] radix_tree_preload+0x70/0xbf
  611b39b8:  [600980f2] do_mpage_readpage+0x3b3/0x472
  611b39e0:  [6011f89f] radix_tree_preload+0x70/0xbf
  611b39f8:  [6006cc81] kmem_cache_alloc+0x51/0x98
  611b3a38:  [6011f89f] radix_tree_preload+0x70/0xbf
  611b3a58:  [6004f8e2] add_to_page_cache+0x22/0xf7
  611b3a98:  [6004f9c6] add_to_page_cache_lru+0xf/0x24
  611b3ab8:  [6009821e] mpage_readpages+0x6d/0x109
  611b3ac0:  [600d59f0] ext3_get_block+0x0/0xf2
  611b3b08:  [6005483d] get_page_from_freelist+0x8d/0xc1
  611b3b88:  [600d6937] ext3_readpages+0x18/0x1a
  611b3b98:  [60056f00] read_pages+0x37/0x9b
  611b3bd8:  [60057064] __do_page_cache_readahead+0x100/0x157
  611b3c48:  [60057196] do_page_cache_readahead+0x52/0x5f
  611b3c78:  [60050ab4] filemap_fault+0x145/0x278
  611b3ca8:  [60022b61] run_syscall_stub+0xd1/0xdd
  611b3ce8:  [6005eae3] __do_fault+0x7e/0x3ca
  611b3d68:  [6005ee60] do_linear_fault+0x31/0x33
  611b3d88:  [6005f149] handle_mm_fault+0x14e/0x246
  611b3da8:  [60120a7b] __up_read+0x73/0x7b
  611b3de8:  [60013177] handle_page_fault+0x11f/0x23b
  611b3e48:  [60013419] segv+0xac/0x297
  611b3f28:  [60013367] segv_handler+0x68/0x6e
  611b3f48:  [600232ad] get_skas_faultinfo+0x9c/0xa1
  611b3f68:  [60023853] userspace+0x13a/0x19d
  611b3fc8:  [60010d58] fork_handler+0x86/0x8d

 OK, that's different.  Someone broke the vm - order-2 GFP_KERNEL
 allocations aren't supposed to fail.

 I'm suspecting that did_some_progress thing.

The allocation didn't fail -- it invoked the OOM killer because the kernel
ran out of unfragmented memory. Probably because higher order
allocations are the new vogue in -mm at the moment ;)
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Nick Piggin

On Saturday 29 September 2007 04:41, Christoph Lameter wrote:
 On Fri, 28 Sep 2007, Peter Zijlstra wrote:
  memory got massively fragemented, as anti-frag gets easily defeated.
  setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
  order blocks to stay available, so we don't mix types. however 12M on
  128M is rather a lot.

 Yes, strict ordering would be much better. On NUMA it may be possible to
 completely forbid merging. We can fall back to other nodes if necessary.
 12M is not much on a NUMA system.

 But this shows that (unsurprisingly) we may have issues on systems with a
 small amounts of memory and we may not want to use higher orders on such
 systems.

 The case you got may be good to use as a testcase for the virtual
 fallback. H... Maybe it is possible to allocate the stack as a virtual
 compound page. Got some script/code to produce that problem?

Yeah, you could do that, but we generally don't have big problems allocating
stacks in mainline, because we have very few users of higher order pages,
the few that are there don't seem to be a problem.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-28 Thread Nick Piggin

On Wednesday 19 September 2007 13:36, Christoph Lameter wrote:
 SLAB_VFALLBACK can be specified for selected slab caches. If fallback is
 available then the conservative settings for higher order allocations are
 overridden. We then request an order that can accomodate at mininum
 100 objects. The size of an individual slab allocation is allowed to reach
 up to 256k (order 6 on i386, order 4 on IA64).

How come SLUB wants such a big amount of objects? I thought the
unqueued nature of it made it better than slab because it minimised
the amount of cache hot memory lying around in slabs...

vmalloc is incredibly slow and unscalable at the moment. I'm still working
on making it more scalable and faster -- hopefully to a point where it would
actually be usable for this... but you still get moved off large TLBs, and
also have to inevitably do tlb flushing.

Or do you have SLUB at a point where performance is comparable to SLAB,
and this is just a possible idea for more performance?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-28 Thread Nick Piggin

On Thursday 20 September 2007 11:38, David Chinner wrote:
 On Wed, Sep 19, 2007 at 04:04:30PM +0200, Andrea Arcangeli wrote:

  Plus of course you don't like fsblock because it requires work to
  adapt a fs to it, I can't argue about that.

 No, I don't like fsblock because it is inherently a struture
 per filesystem block construct, just like buggerheads. You
 still need to allocate millions of them when you have millions
 dirty pages around. Rather than type it all out again, read
 the fsblocks thread from here:

I don't think there is anything inherently wrong with a structure
per filesystem block construct. In the places where you want to
have a handle on that information.

In the data path of a lot of filesystems, it's not really useful, no
question.

But the block / buffer head concept is there and is useful for many
things obviously. fsblock, I believe, improves on it; maybe even to
the point where there won't be too much reason for many
filesystems to convert their data paths to something different (eg.
nobh mode, or an extent block mapping).


 http://marc.info/?l=linux-fsdevelm=118284983925719w=2

 FWIW, with Chris mason's extent-based block mapping (which btrfs
 is using and Christoph Hellwig is porting XFS over to) we completely
 remove buggerheads from XFS and so fsblock would be a pretty
 major step backwards for us if Chris's work goes into mainline.

If you don't need to manipulate or manage the pagecache on a
per-block basis anyway, then you shouldn't need fsblock (or anything
else particularly special) to do higher order block sizes.

If you do sometimes need to, then fsblock *may* be a way you can
remove vmap code from your filesystem and share it with generic
code...


 But, I'm not going to argue endlessly for one solution or another;
 I'm happy to see different solutions being chased, so may the
 best VM win ;)

Agreed :)

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-28 Thread Nick Piggin

On Saturday 29 September 2007 03:33, Christoph Lameter wrote:
 On Fri, 28 Sep 2007, Nick Piggin wrote:
  On Wednesday 19 September 2007 13:36, Christoph Lameter wrote:
   SLAB_VFALLBACK can be specified for selected slab caches. If fallback
   is available then the conservative settings for higher order
   allocations are overridden. We then request an order that can
   accomodate at mininum 100 objects. The size of an individual slab
   allocation is allowed to reach up to 256k (order 6 on i386, order 4 on
   IA64).
 
  How come SLUB wants such a big amount of objects? I thought the
  unqueued nature of it made it better than slab because it minimised
  the amount of cache hot memory lying around in slabs...

 The more objects in a page the more the fast path runs. The more the fast
 path runs the lower the cache footprint and the faster the overall
 allocations etc.

 SLAB can be configured for large queues holdings lots of objects.
 SLUB can only reach the same through large pages because it does not
 have queues. One could add the ability to manage pools of cpu slabs but
 that would be adding yet another layer to compensate for the problem of
 the small pages. Reliable large page allocations means that we can get rid
 of these layers and the many workarounds that we have in place right now.

That doesn't sound very nice because you don't actually want to use up
higher order allocations if you can avoid it, and you definitely don't want
to be increasing your slab page size unit if you can help it, because it
compounds the problem of slab fragmentation.


 The unqueued nature of SLUB reduces memory requirements and in general the
 more efficient code paths of SLUB offset the advantage that SLAB can reach
 by being able to put more objects onto its queues.
 introduces complexity and cache line use through the need to manage those
 queues.

I thought it was slower. Have you fixed the performance regression?
(OK, I read further down that you are still working on it but not confirmed
yet...)


  vmalloc is incredibly slow and unscalable at the moment. I'm still
  working on making it more scalable and faster -- hopefully to a point
  where it would actually be usable for this... but you still get moved off
  large TLBs, and also have to inevitably do tlb flushing.

 Again I have not seen any fallbacks to vmalloc in my testing. What we are
 doing here is mainly to address your theoretical cases that we so far have
 never seen to be a problem and increase the reliability of allocations of
 page orders larger than 3 to a usable level. So far I have so far not
 dared to enable orders larger than 3 by default.

Basically, all that shows is that your testing isn't very thorough. 128MB
is an order of magnitude *more* memory than some users have. They
probably wouldn't be happy with a regression in slab allocator performance
either.


  Or do you have SLUB at a point where performance is comparable to SLAB,
  and this is just a possible idea for more performance?

 AFAICT SLUBs performance is superior to SLAB in most cases and it was like
 that from the beginning. I am still concerned about several corner cases
 though (I think most of them are going to be addressed by the per cpu
 patches in mm). Having a comparable or larger amount of per cpu objects as
 SLAB is something that also could address some of these concerns and could
 increase performance much further.

OK, so long as it isn't going to depend on using higher order pages, that's
fine. (if they help even further as an optional thing, that's fine too. You
can turn them on your huge systems and not even bother about adding
this vmap fallback -- you won't have me to nag you about these
purely theoretical issues).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [13/17] Virtual compound page freeing in interrupt context

2007-09-20 Thread Nick Piggin

On Wednesday 19 September 2007 13:36, Christoph Lameter wrote:
 If we are in an interrupt context then simply defer the free via a
 workqueue.

 In an interrupt context it is not possible to use vmalloc_addr() to
 determine the vmalloc address. So add a variant that does that too.

 Removing a virtual mappping *must* be done with interrupts enabled
 since tlb_xx functions are called that rely on interrupts for
 processor to processor communications.

These things will clash drastically with my lazy TLB flushing and scalability
work. Actually the lazy TLB flushing will provide a natural way to defer
unmapping at interrupt time, and the scalability work should make it
easier to vmap from interrupt context too, if you really need that.


 Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

 ---
  mm/page_alloc.c |   23 ++-
  1 file changed, 22 insertions(+), 1 deletion(-)

 Index: linux-2.6/mm/page_alloc.c
 ===
 --- linux-2.6.orig/mm/page_alloc.c2007-09-18 20:10:55.0 -0700
 +++ linux-2.6/mm/page_alloc.c 2007-09-18 20:11:40.0 -0700
 @@ -1297,7 +1297,12 @@ abort:
   return NULL;
  }

 -static void vcompound_free(void *addr)
 +/*
 + * Virtual Compound freeing functions. This is complicated by the vmalloc
 + * layer not being able to free virtual allocations when interrupts are
 + * disabled. So we defer the frees via a workqueue if necessary.
 + */
 +static void __vcompound_free(void *addr)
  {
   struct page **pages = vunmap(addr);
   int i;
 @@ -1320,6 +1325,22 @@ static void vcompound_free(void *addr)
   kfree(pages);
  }

 +static void vcompound_free_work(struct work_struct *w)
 +{
 + __vcompound_free((void *)w);
 +}
 +
 +static void vcompound_free(void *addr)
 +{
 + if (in_interrupt()) {
 + struct work_struct *w = addr;
 +
 + INIT_WORK(w, vcompound_free_work);
 + schedule_work(w);
 + } else
 + __vcompound_free(addr);
 +}
 +
  /*
   * This is the 'heart' of the zoned buddy allocator.
   */
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-19 Thread Nick Piggin

On Wednesday 19 September 2007 04:30, Linus Torvalds wrote:
 On Tue, 18 Sep 2007, Nick Piggin wrote:
  ROFL! Yeah of course, how could I have forgotten about our trusty OOM
  killer as the solution to the fragmentation problem? It would only have
  been funnier if you had said to reboot every so often when memory gets
  fragmented :)

 Can we please stop this *idiotic* thread.

 Nick, you and some others seem to be arguing based on a totally flawed
 base, namely:
  - we can guarantee anything at all in the VM
  - we even care about the 16kB blocksize
  - second-class citizenry is bad

 The fact is, *none* of those things are true. The VM doesn't guarantee
 anything, and is already very much about statistics in many places. You
 seem to be arguing as if Christoph was introducing something new and
 unacceptable, when it's largely just more of the same.

I will stop this idiotic thread.

However, at the VM and/or vm/fs things we had, I was happy enough
for this thing of Christoph's to get merged. Actually I even didn't care
if it had mmap support, so long as it solved their problem.

But a solution to the general problem of VM and IO scalability, it is not.
IMO.


 And the fact is, nobody but SGI customers would ever want the 16kB
 blocksize. IOW - NONE OF THIS MATTERS!

Maybe. Maybe not.


 Can you guys stop this inane thread already, or at least take it private
 between you guys, instead of forcing everybody else to listen in on your
 flamefest.

Will do. Sorry.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Nick Piggin

On Tuesday 18 September 2007 08:00, Christoph Lameter wrote:
 On Sun, 16 Sep 2007, Nick Piggin wrote:
  I don't know how it would prevent fragmentation from building up
  anyway. It's commonly the case that potentially unmovable objects
  are allowed to fill up all of ram (dentries, inodes, etc).

 Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from
 ZONE_MOVABLE and thus the memory that can be allocated for them is
 limited.

Why would ZONE_MOVABLE require that movable objects should be moved
out of the way for unmovable ones? It never _has_ any unmovable objects in
it. Quite obviously we were not talking about reserve zones.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Nick Piggin

On Tuesday 18 September 2007 08:21, Christoph Lameter wrote:
 On Sun, 16 Sep 2007, Nick Piggin wrote:
So if you argue that vmap is a downside, then please tell me how you
consider the -ENOMEM of your approach to be better?
  
   That is again pretty undifferentiated. Are we talking about low page
 
  In general.

 There is no -ENOMEM approach. Lower order page allocation (
 PAGE_ALLOC_COSTLY_ORDER) will reclaim and in the worst case the OOM killer
 will be activated.

ROFL! Yeah of course, how could I have forgotten about our trusty OOM killer
as the solution to the fragmentation problem? It would only have been funnier
if you had said to reboot every so often when memory gets fragmented :)


 That is the nature of the failures that we saw early in 
 the year when this was first merged into mm.

   With the ZONE_MOVABLE you can remove the unmovable objects into a
   defined pool then higher order success rates become reasonable.
 
  OK, if you rely on reserve pools, then it is not 1st class support and
  hence it is a non-solution to VM and IO scalability problems.

 ZONE_MOVABLE creates two memory pools in a machine. One of them is for
 movable and one for unmovable. That is in 2.6.23. So 2.6.23 has no first
 call support for order 0 pages?

What?


If, by special software layer, you mean the vmap/vunmap support in
fsblock, let's see... that's probably all of a hundred or two lines.
Contrast that with anti-fragmentation, lumpy reclaim, higher order
pagecache and its new special mmap layer... Hmm, seems like a no
brainer to me. You really still want to persue the extra layer
argument as a point against fsblock here?
  
   Yes sure. You code could not live without these approaches. Without the
 
  Actually: your code is the one that relies on higher order allocations.
  Now you're trying to turn that into an argument against fsblock?

 fsblock also needs contiguous pages in order to have a beneficial
 effect that we seem to be looking for.

Keyword: relies.


   antifragmentation measures your fsblock code would not be very
   successful in getting the larger contiguous segments you need to
   improve performance.
 
  Complely wrong. *I* don't need to do any of that to improve performance.
  Actually the VM is well tuned for order-0 pages, and so seeing as I have
  sane hardware, 4K pagecache works beautifully for me.

 Sure the system works fine as is. Not sure why we would need fsblock then.

Large block filesystem.


   (There is no new mmap layer, the higher order pagecache is simply the
   old API with set_blocksize expanded).
 
  Yes you add another layer in the userspace mapping code to handle higher
  order pagecache.

 That would imply a new API or something? I do not see it.

I was not implying a new API.


   Why: It is the same approach that you use.
 
  Again, rubbish.

 Ok the logical conclusion from the above is that you think your approach
 is rubbish 

The logical conclusion is that _they are not the same approach_!


 Is there some way you could cool down a bit? 

I'm not upset, but what you were saying was rubbish, plain and simple. The
amount of times we've gone in circles, I most likely have already explained
this, serveral times, in a more polite manner.

And I know you're more than capable to understand at least the concept
behind fsblock, even without time to work through the exact details. What
are you expecting me to say, after all this back and forth, when you come
up with things like [fsblock] is not a generic change but special to the
block layer, and then claim that fsblock is the same as allocating virtual
compound pages with vmalloc as a fallback for higher order allocs.

What I will say is that fsblock has still a relatively longer way to go, so
maybe that's your reason for not looking at it. And yes, when fsblock is
in a better state to actually perform useful comparisons with it, will be a
much better time to have these debates. But in that case, just say so :)
then I can go away and do more constructive work on it instead of filling
people's inboxes.

I believe the fsblock approach is the best one, but it's not without problems
and complexities, so I'm quite ready for it to be proven incorrect, not
performant, or otherwise rejected.

I'm going on holiday for 2 weeks. I'll try to stay away from email, and
particularly this thread.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-18 Thread Nick Piggin

On Tuesday 18 September 2007 08:05, Christoph Lameter wrote:
 On Sun, 16 Sep 2007, Nick Piggin wrote:
fsblock doesn't need any of those hacks, of course.
  
   Nor does mine for the low orders that we are considering. For order 
   MAX_ORDER this is unavoidable since the page allocator cannot manage
   such large pages. It can be used for lower order if there are issues
   (that I have not seen yet).
 
  Or we can just avoid all doubt (and doesn't have arbitrary limitations
  according to what you think might be reasonable or how well the
  system actually behaves).

 We can avoid all doubt in this patchset as well by adding support for
 fallback to a vmalloced compound page.

How would you do a vmapped fallback in your patchset? How would
you keep track of pages 2..N if they don't exist in the radix tree?
What if they don't even exist in the kernel's linear mapping? It seems
you would also require more special casing in the fault path and special
casing in the block layer to do this.

It's not a trivial problem you can just brush away by handwaving. Let's
see... you could add another field in struct page to store the vmap
virtual address, and set a new flag to indicate indicate that constituent
page N can be found via vmalloc_to_page(page-vaddr + N*PAGE_SIZE).
Then add more special casing to the block layer and fault path etc. to handle
these new non-contiguous compound pages. I guess you must have thought
about it much harder than the 2 minutes I just did then, so you must have a
much nicer solution...

But even so, you're still trying very hard to avoid touching the filesystems
or buffer layer while advocating instead to squeeze the complexity out into
the vm and block layer. I don't agree that is the right thing to do. Sure it
is _easier_, because we know the VM.

I don't argue that fsblock large block support is trivial. But you are first
asserting that it is too complicated and then trying to address one of the
issues it solves by introducing complexity elsewhere.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.22.6: kernel BUG at fs/locks.c:171

2007-09-17 Thread Nick Piggin

On Saturday 15 September 2007 20:22, Soeren Sonnenburg wrote:
 On Sat, 2007-09-15 at 09:47 +, Soeren Sonnenburg wrote:

  Memtest did not find anything after 16 passes so I finally stopped it
  applied your patch and used
 
  CONFIG_DEBUG_SLAB=y
  CONFIG_DEBUG_SLAB_LEAK=y
 
  and booted into the new kernel.
 
  A few hours later the machine hung (due to nmi watchdog rebooted), so I
  restarted and disabled the watchdog and while compiling a kernel with a
  ``more minimal'' config I got this (not sure whether this is related/the
  cause .../ note that I don't use a swapfile/partition).
 
  I would need more guidance on what to try now...
 
  Thanks!
  Soeren
 
  swap_dup: Bad swap file entry 28c8af9d

Hmm, this is another telltale symptom of either bad hardware
or a memory scribbling bug.


  VM: killing process cc1
  Eeek! page_mapcount(page) went negative! (-1)
page pfn = 36233
page-flags = 4834
page-count = 2
page-mapping = c1cfed14
vma-vm_ops = run_init_process+0x3feff000/0x14

And these are probably related (it's just gone off and started
performing VM operations on the wrong page...).

Had you been using the dvb card since rebooting when you saw
these messages come up? What happens if you remove the card
from the system?


  [ cut here ]
  kernel BUG at mm/rmap.c:628!
  invalid opcode:  [#1]
  Modules linked in: ipt_iprange ipt_REDIRECT capi kernelcapi capifs
  ipt_REJECT xt_tcpudp xt_state xt_limit ipt_LOG ipt_MASQUERADE
  iptable_mangle iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables
  x_tables b44 ohci1394 ieee1394 nf_nat_ftp nf_nat nf_conntrack_ftp
  nf_conntrack lcd tda827x saa7134_dvb dvb_pll video_buf_dvb tda1004x tuner
  ves1820 usb_storage usblp budget_ci budget_core saa7134 compat_ioctl32
  dvb_ttpci dvb_core saa7146_vv video_buf saa7146 ttpci_eeprom ir_kbd_i2c
  videodev v4l2_common v4l1_compat ir_common via_agp agpgart CPU:0
  EIP:0060:[c0144487]Not tainted VLI
  EFLAGS: 00010246   (2.6.22.6 #2)
  EIP is at page_remove_rmap+0xd4/0x101
  eax:    ebx: c16c4660   ecx:    edx: 
  esi: d4570b30   edi: d6560a78   ebp: b740   esp: d6265eac
  ds: 007b   es: 007b   fs:   gs:   ss: 0068
  Process cc1 (pid: 26095, ti=d6264000 task=d67af5b0 task.ti=d6264000)
  Stack: c0422e26 c1cfed14 c16c4660 b729e000 c013f5b8 36233cce 
  d4570b30 d6265f20  0001 f4ffcb70 f483a3b8 c04f44b8 
   f4ffcb70 00303ff4 b7c18000  d6265f20 f4a8c510 f483a3b8
  0009 Call Trace:
   [c013f5b8] unmap_vmas+0x23f/0x404
   [c0141c09] exit_mmap+0x5f/0xc9
   [c011923a] mmput+0x1b/0x5e
   [c011cf97] do_exit+0x1a0/0x606
   [c01135f8] do_page_fault+0x49c/0x518
   [c011e340] __do_softirq+0x35/0x75
   [c011315c] do_page_fault+0x0/0x518
   [c039aada] error_code+0x6a/0x70
   ===
  Code: c0 74 0d 8b 50 08 b8 56 2e 42 c0 e8 ac f4 fe ff 8b 46 48 85 c0 74
  14 8b 40 10 85 c0 74 0d 8b 50 2c b8 75 2e 42 c0 e8 91 f4 fe ff 0f 0b eb
  fe 8b 53 10 8b 03 83 e2 01 c1 e8 1e f7 da 83 c2 04 69 EIP: [c0144487]
  page_remove_rmap+0xd4/0x101 SS:ESP 0068:d6265eac Fixing recursive fault
  but reboot is needed!

 Hmmhh, so now I rebooted and again tried to

 $ make

 the new kernel which again triggered this(?) BUG:

 Any ideas?
 Soeren.

 Eeek! page_mapcount(page) went negative! (-1)
   page pfn = 18722
   page-flags = 4000
   page-count = 1
   page-mapping = 
   vma-vm_ops = run_init_process+0x3feff000/0x14
 [ cut here ]
 kernel BUG at mm/rmap.c:628!
 invalid opcode:  [#1]
 Modules linked in: ipt_iprange ipt_REDIRECT capi kernelcapi capifs
 ipt_REJECT xt_tcpudp xt_state xt_limit ipt_LOG ipt_MASQUERADE
 iptable_mangle iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables x_t
 CPU:0
 EIP:0060:[c0144487]Not tainted VLI
 EFLAGS: 00010246   (2.6.22.6 #2)
 EIP is at page_remove_rmap+0xd4/0x101
 eax:    ebx: c130e440   ecx:    edx: 
 esi: f438b510   edi: f3328ac8   ebp: c130e440   esp: f28d5eec
 ds: 007b   es: 007b   fs:   gs: 0033  ss: 0068
 Process cc1 (pid: 17957, ti=f28d4000 task=f60bb0d0 task.ti=f28d4000)
 Stack: c0422e26  f3328ac8 0002 c013f185 b76b2000 f438b510
 f43013b8 c1a7c640 1879 b76b2000 f3328ac8 f438b510 c014021d f3328ac8
 f4360b74 f43013f8 1879 00100073 b76b2000 f43013b8 f4360b74 0100
 f28d5f90 Call Trace:
  [c013f185] do_wp_page+0x28a/0x35c
  [c014021d] __handle_mm_fault+0x626/0x6a4
  [c0113368] do_page_fault+0x20c/0x518
  [c011315c] do_page_fault+0x0/0x518
  [c039aada] error_code+0x6a/0x70
  ===
 Code: c0 74 0d 8b 50 08 b8 56 2e 42 c0 e8 ac f4 fe ff 8b 46 48 85 c0 74 14
 8b 40 10 85 c0 74 0d 8b 50 2c b8 75 2e 42 c0 e8 91 f4 fe ff 0f 0b eb fe
 8b 53 10 8b 03 83 e2 01 c1 e8 1e f7 da 83 c2 04 69 EIP: [c0144487]
 page_remove_rmap+0xd4/0x101 SS:ESP 0068:f28d5eec Eeek! page_mapcount(page)
 went negative! (-2)
   page pfn = 18722
   page-flags = 4004

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Nick Piggin

On Saturday 15 September 2007 03:52, Christoph Lameter wrote:
 On Fri, 14 Sep 2007, Nick Piggin wrote:
[*] ok, this isn't quite true because if you can actually put a hard
limit on unmovable allocations then anti-frag will fundamentally help
-- get back to me on that when you get patches to move most of the
obvious ones.
  
   We have this hard limit using ZONE_MOVABLE in 2.6.23.
 
  So we're back to 2nd class support.

 2nd class support for me means a feature that is not enabled by default
 but that can be enabled in order to increase performance. The 2nd class
 support is there because we are not yet sure about the maturity of the
 memory allocation methods.

I'd rather an approach that does not require all these hacks.


   Reserve pools as handled (by the not yet available) large page pool
   patches (which again has altogether another purpose) are not a limit.
   The reserve pools are used to provide a mininum of higher order pages
   that is not broken down in order to insure that a mininum number of the
   desired order of pages is even available in your worst case scenario.
   Mainly I think that is needed during the period when memory
   defragmentation is still under development.
 
  fsblock doesn't need any of those hacks, of course.

 Nor does mine for the low orders that we are considering. For order 
 MAX_ORDER this is unavoidable since the page allocator cannot manage such
 large pages. It can be used for lower order if there are issues (that I
 have not seen yet).

Or we can just avoid all doubt (and doesn't have arbitrary limitations
according to what you think might be reasonable or how well the
system actually behaves).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Nick Piggin

On Saturday 15 September 2007 04:08, Christoph Lameter wrote:
 On Fri, 14 Sep 2007, Nick Piggin wrote:
  However fsblock can do everything that higher order pagecache can
  do in terms of avoiding vmap and giving contiguous memory to block
  devices by opportunistically allocating higher orders of pages, and
  falling back to vmap if they cannot be satisfied.

 fsblock is restricted to the page cache and cannot be used in other
 contexts where subsystems can benefit from larger linear memory.

Unless you believe higher order pagecache is not restricted to the
pagecache, can we please just stick on topic of fsblock vs higher
order pagecache?


  So if you argue that vmap is a downside, then please tell me how you
  consider the -ENOMEM of your approach to be better?

 That is again pretty undifferentiated. Are we talking about low page

In general.


 orders? There we will reclaim the all of reclaimable memory before getting
 an -ENOMEM. Given the quantities of pages on todays machine--a 1 G machine
 has 256 milllion 4k pages--and the unmovable ratios we see today it
 would require a very strange setup to get an allocation failure while
 still be able to allocate order 0 pages.

So much handwaving. 1TB machines without very strange setups
(where very strange is something arbitrarily defined by you) I guess make
up 0.001% of Linux installations.


 With the ZONE_MOVABLE you can remove the unmovable objects into a defined
 pool then higher order success rates become reasonable.

OK, if you rely on reserve pools, then it is not 1st class support and hence
it is a non-solution to VM and IO scalability problems.


  If, by special software layer, you mean the vmap/vunmap support in
  fsblock, let's see... that's probably all of a hundred or two lines.
  Contrast that with anti-fragmentation, lumpy reclaim, higher order
  pagecache and its new special mmap layer... Hmm, seems like a no
  brainer to me. You really still want to persue the extra layer
  argument as a point against fsblock here?

 Yes sure. You code could not live without these approaches. Without the

Actually: your code is the one that relies on higher order allocations. Now
you're trying to turn that into an argument against fsblock?


 antifragmentation measures your fsblock code would not be very successful
 in getting the larger contiguous segments you need to improve performance.

Complely wrong. *I* don't need to do any of that to improve performance.
Actually the VM is well tuned for order-0 pages, and so seeing as I have
sane hardware, 4K pagecache works beautifully for me.

My point was this: fsblock does not preclude your using such measures to
fix the performance of your hardware that's broken with 4K pages. And it
would allow higher order allocations to fail gracefully.


 (There is no new mmap layer, the higher order pagecache is simply the old
 API with set_blocksize expanded).

Yes you add another layer in the userspace mapping code to handle higher
order pagecache.


  Of course I wouldn't state that. On the contrary, I categorically state
  that I have already solved it :)

 Well then I guess that you have not read the requirements...

I'm not talking about solving your problem of poor hardware. I'm talking
about supporting higher order block sizes in the kernel.


   Because it has already been rejected in another form and adds more
 
  You have rejected it. But they are bogus reasons, as I showed above.

 Thats not me. I am working on this because many of the filesystem people
 have repeatedly asked me to do this. I am no expert on filesystems.

Yes it is you. You brought up reasons in this thread and I said why I thought
they were bogus. If you think you can now forget about my shooting them
down by saying you aren't an expert in filesystems, then you shouldn't have
brought them up in the first place. Either stand by your arguments or don't.


  You also describe some other real (if lesser) issues like number of page
  structs to manage in the pagecache. But this is hardly enough to reject
  my patch now... for every downside you can point out in my approach, I
  can point out one in yours.
 
  - fsblock doesn't require changes to virtual memory layer

 Therefore it is not a generic change but special to the block layer. So
 other subsystems still have to deal with the single page issues on
 their own.

Rubbish. fsblock doesn't touch a single line in the block layer.


   Maybe we coud get to something like a hybrid that avoids some of these
   issues? Add support so something like a virtual compound page can be
   handled transparently in the filesystem layer with special casing if
   such a beast reaches the block layer?
 
  That's conceptually much worse, IMO.

 Why: It is the same approach that you use.

Again, rubbish.

 If it is barely ever used and 
 satisfies your concern then I am fine with it.

Right below this line is where I told you I am _not_ fine with it.


  And practically worse as well: vmap space

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Nick Piggin

On Monday 17 September 2007 04:13, Mel Gorman wrote:
 On (15/09/07 14:14), Goswin von Brederlow didst pronounce:

  I keep coming back to the fact that movable objects should be moved
  out of the way for unmovable ones. Anything else just allows
  fragmentation to build up.

 This is easily achieved, just really really expensive because of the
 amount of copying that would have to take place. It would also compel
 that min_free_kbytes be at least one free PAGEBLOCK_NR_PAGES and likely
 MIGRATE_TYPES * PAGEBLOCK_NR_PAGES to reduce excessive copying. That is
 a lot of free memory to keep around which is why fragmentation avoidance
 doesn't do it.

I don't know how it would prevent fragmentation from building up
anyway. It's commonly the case that potentially unmovable objects
are allowed to fill up all of ram (dentries, inodes, etc).

And of course,  if you craft your exploit nicely with help from higher
ordered unmovable memory (eg. mm structs or unix sockets), then
you don't even need to fill all memory with unmovables before you
can have them take over all groups.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-17 Thread Nick Piggin

On Monday 17 September 2007 14:07, David Chinner wrote:
 On Fri, Sep 14, 2007 at 06:48:55AM +1000, Nick Piggin wrote:

  OK, the vunmap batching code wipes your TLB flushing and IPIs off
  the table. Diffstat below, but the TLB portions are here (besides that
  _everything_ is probably lower due to less TLB misses caused by the
  TLB flushing):
 
-170   -99.4% sn2_send_IPI
-343  -100.0% sn_send_IPI_phys
  -17911   -99.9% smp_call_function
 
  Total performance went up by 30% on a 64-way system (248 seconds to
  172 seconds to run parallel finds over different huge directories).

 Good start, Nick ;)

I didn't have the chance to test against a 16K directory block size to find
the optimal performance, but it is something I will do (I'm sure it will be
still a _lot_ faster than 172 seconds :)).


   23012  54790.5% _read_lock
9427   329.0% __get_vm_area_node
5792 0.0% __find_vm_area
1590  53000.0% __vunmap

 

 _read_lock? I though vmap() and vunmap() only took the vmlist_lock in
 write mode.

Yeah, it is a slight change... the lazy vunmap only has to take it for read.
In practice, I'm not sure that it helps a great deal because everything else
still takes the lock for write. But that explains why it's popped up in the
profile.


  Next I have some patches to scale the vmap locks and data
  structures better, but they're not quite ready yet. This looks like it
  should result in a further speedup of several times when combined
  with the TLB flushing reductions here...

 Sounds promising - when you have patches that work, send them my
 way and I'll run some tests on them.

Still away from home (for the next 2 weeks), so I'll be going a bit slow :P
I'm thinking about various scalable locking schemes and I'll definitely
ping you when I've made a bit of progress.

Thanks,
Nick
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Nick Piggin

On Thursday 13 September 2007 09:06, Christoph Lameter wrote:
 On Wed, 12 Sep 2007, Nick Piggin wrote:

  So lumpy reclaim does not change my formula nor significantly help
  against a fragmentation attack. AFAIKS.

 Lumpy reclaim improves the situation significantly because the
 overwhelming majority of allocation during the lifetime of a systems are
 movable and thus it is able to opportunistically restore the availability
 of higher order pages by reclaiming neighboring pages.

I'm talking about non movable allocations.


  [*] ok, this isn't quite true because if you can actually put a hard
  limit on unmovable allocations then anti-frag will fundamentally help --
  get back to me on that when you get patches to move most of the obvious 
  ones.

 We have this hard limit using ZONE_MOVABLE in 2.6.23.

So we're back to 2nd class support.


  Sure, and I pointed out the theoretical figure for 64K pages as well. Is
  that figure not problematic to you? Where do you draw the limit for what
  is acceptable? Why? What happens with tiny memory machines where a
  reserve or even the anti-frag patches may not be acceptable and/or work
  very well? When do you require reserve pools? Why are reserve pools
  acceptable for first-class support of filesystems when it has been very
  loudly been made a known policy decision by Linus in the past (and for
  some valid reasons) that we should not put limits on the sizes of caches
  in the kernel.

 64K pages may problematic because it is above the PAGE_ORDER_COSTLY in
 2.6.23. 32K is currently much safer because lumpy reclaim can restore
 these and does so on my systems. I expect the situation for 64K pages to
 improve when more of Mel's patches go in. We have long term experience
 with 32k sized allocation through Andrew's tree.

 Reserve pools as handled (by the not yet available) large page pool
 patches (which again has altogether another purpose) are not a limit. The
 reserve pools are used to provide a mininum of higher order pages that is
 not broken down in order to insure that a mininum number of the desired
 order of pages is even available in your worst case scenario. Mainly I
 think that is needed during the period when memory defragmentation is
 still under development.

fsblock doesn't need any of those hacks, of course.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Nick Piggin

On Thursday 13 September 2007 12:01, Nick Piggin wrote:
 On Thursday 13 September 2007 23:03, David Chinner wrote:
  Then just do operations on directories with lots of files in them
  (tens of thousands). Every directory operation will require at
  least one vmap in this situation - e.g. a traversal will result in
  lots and lots of blocks being read that will require vmap() for every
  directory block read from disk and an unmap almost immediately
  afterwards when the reference is dropped

 Ah, wow, thanks: I can reproduce it.

OK, the vunmap batching code wipes your TLB flushing and IPIs off
the table. Diffstat below, but the TLB portions are here (besides that
_everything_ is probably lower due to less TLB misses caused by the
TLB flushing):

  -170   -99.4% sn2_send_IPI
  -343  -100.0% sn_send_IPI_phys
-17911   -99.9% smp_call_function


Total performance went up by 30% on a 64-way system (248 seconds to
172 seconds to run parallel finds over different huge directories).

 23012  54790.5% _read_lock
  9427   329.0% __get_vm_area_node
  5792 0.0% __find_vm_area
  1590  53000.0% __vunmap
   10726.0% _spin_lock
74   119.4% _xfs_buf_find
58 0.0% __unmap_kernel_range
5336.6% kmem_zone_alloc
  -129  -100.0% pio_phys_write_mmr
  -144  -100.0% unmap_kernel_range
  -170   -99.4% sn2_send_IPI
  -233   -59.1% kfree
  -266  -100.0% find_next_bit
  -343  -100.0% sn_send_IPI_phys
  -564   -19.9% xfs_iget_core
 -1946  -100.0% remove_vm_area
-17911   -99.9% smp_call_function
-62726-7.2% _write_lock
   -438360   -64.2% default_idle
   -482631   -30.4% total

Next I have some patches to scale the vmap locks and data
structures better, but they're not quite ready yet. This looks like it
should result in a further speedup of several times when combined
with the TLB flushing reductions here...
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-14 Thread Nick Piggin

On Thursday 13 September 2007 09:17, Christoph Lameter wrote:
 On Wed, 12 Sep 2007, Nick Piggin wrote:
  I will still argue that my approach is the better technical solution for
  large block support than yours, I don't think we made progress on that.
  And I'm quite sure we agreed at the VM summit not to rely on your patches
  for VM or IO scalability.

 The approach has already been tried (see the XFS layer) and found lacking.

It is lacking because our vmap algorithms are simplistic to the point
of being utterly inadequate for the new requirements. There has not
been any fundamental problem revealed (like the fragmentation 
approach has).

However fsblock can do everything that higher order pagecache can
do in terms of avoiding vmap and giving contiguous memory to block
devices by opportunistically allocating higher orders of pages, and falling
back to vmap if they cannot be satisfied.

So if you argue that vmap is a downside, then please tell me how you
consider the -ENOMEM of your approach to be better?


 Having a fake linear block through vmalloc means that a special software
 layer must be introduced and we may face special casing in the block / fs
 layer to check if we have one of these strange vmalloc blocks.

I guess you're a little confused. There is nothing fake about the linear
address. Filesystems of course need changes (the block layer needs
none -- why would it?). But changing APIs to accommodate a better
solution is what Linux is about.

If, by special software layer, you mean the vmap/vunmap support in
fsblock, let's see... that's probably all of a hundred or two lines.
Contrast that with anti-fragmentation, lumpy reclaim, higher order
pagecache and its new special mmap layer... Hmm, seems like a no
brainer to me. You really still want to persue the extra layer
argument as a point against fsblock here?


  But you just showed in two emails that you don't understand what the
  problem is. To reiterate: lumpy reclaim does *not* invalidate my
  formulae; and antifrag does *not* isolate the issue.

 I do understand what the problem is. I just do not get what your problem
 with this is and why you have this drive to demand perfection. We are

Oh. I don't think I could explain that if you still don't understand by now.
But that's not the main issue: all that I ask is you consider fsblock on
technical grounds. 


 working a variety of approaches on the (potential) issue but you
 categorically state that it cannot be solved.

Of course I wouldn't state that. On the contrary, I categorically state that
I have already solved it :)


  But what do you say about viable alternatives that do not have to
  worry about these unlikely scenarios, full stop? So, why should we
  not use fs block for higher order page support?

 Because it has already been rejected in another form and adds more

You have rejected it. But they are bogus reasons, as I showed above.

You also describe some other real (if lesser) issues like number of page
structs to manage in the pagecache. But this is hardly enough to reject
my patch now... for every downside you can point out in my approach, I
can point out one in yours.

- fsblock doesn't require changes to virtual memory layer
- fsblock can retain cache of just 4K in a  4K block size file

How about those? I know very well how Linus feels about both of them.


 Maybe we coud get to something like a hybrid that avoids some of these
 issues? Add support so something like a virtual compound page can be
 handled transparently in the filesystem layer with special casing if
 such a beast reaches the block layer?

That's conceptually much worse, IMO.

And practically worse as well: vmap space is limited on 32-bit; fsblock
approach can avoid vmap completely in many cases; for two reasons.

The fsblock data accessor APIs aren't _that_ bad changes. They change
zero conceptually in the filesystem, are arguably cleaner, and can be
essentially nooped if we wanted to stay with a b_data type approach
(but they give you that flexibility to replace it with any implementation).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-13 Thread Nick Piggin

On Thursday 13 September 2007 11:49, David Chinner wrote:
 On Wed, Sep 12, 2007 at 01:27:33AM +1000, Nick Piggin wrote:

  I just gave 4 things which combined might easily reduce xfs vmap overhead
  by several orders of magnitude, all without changing much code at all.

 Patches would be greatly appreciately. You obviously understand this
 vm code much better than I do, so if it's easy to fix by adding some
 generic vmap cache thingy, please do.

Well, it may not be easy to _fix_, but it's easy to try a few improvements ;)

How do I make an image and run a workload that will coerce XFS into
doing a significant number of vmaps?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.22.6: kernel BUG at fs/locks.c:171

2007-09-13 Thread Nick Piggin

On Thursday 13 September 2007 19:20, Soeren Sonnenburg wrote:
 Dear all,

 I've just seen this in dmesg on a AMD K7 / kernel 2.6.22.6 machine
 (config attached).

 Any ideas / which further information needed ?

Thanks for the report. Is it reproduceable? It seems like the
locks_free_lock call that's oopsing is coming from __posix_lock_file.
The actual function looks fine, but the lock being freed could have
been corrupted if there was slab corruption, or a hardware corruption.

You could: try running memtest86+ overnight. And try the following
patch and turn on slab debugging then try to reproduce the problem.



 Soeren

 [ cut here ]
 kernel BUG at fs/locks.c:171!
 invalid opcode:  [#1]
 Modules linked in: ipt_iprange ipt_REDIRECT capi kernelcapi capifs
 ipt_REJECT xt_tcpudp xt_state xt_limit ipt_LOG ipt_MASQUERADE
 iptable_mangle iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables
 x_tables b44 ohci1394 ieee1394 nf_nat_ftp nf_nat nf_conntrack_ftp
 nf_conntrack lcd tda827x saa7134_dvb dvb_pll video_buf_dvb tuner tda1004x
 ves1820 usb_storage usblp saa7134 compat_ioctl32 budget_ci budget_core
 dvb_ttpci dvb_core saa7146_vv video_buf saa7146 ttpci_eeprom via_agp
 ir_kbd_i2c videodev v4l2_common v4l1_compat ir_common agpgart CPU:0
 EIP:0060:[c0158f59]Not tainted VLI
 EFLAGS: 00010206   (2.6.22.6 #1)
 EIP is at locks_free_lock+0xb/0x3b
 eax: e1d07f9c   ebx: e1d07f80   ecx: f5f5e2f0   edx: 
 esi:    edi:    ebp:    esp: da3d7f04
 ds: 007b   es: 007b   fs:   gs: 0033  ss: 0068
 Process mrtg-load (pid: 19688, ti=da3d6000 task=f5e3a030 task.ti=da3d6000)
 Stack:  c015972b 0002 c04889c8 c012b920 f5f5e290 c048541c
 f0ed3ca0 01485414  e1d07f80  f0f39f58 44ef35f1 f62fc2ac
   f5f5e290  d23106c0 c015a891  0007
 0004 Call Trace:
  [c015972b] __posix_lock_file+0x44e/0x47f
  [c012b920] getnstimeofday+0x2b/0xaf
  [c015a891] fcntl_setlk+0xff/0x1f6
  [c011d836] do_setitimer+0xfa/0x226
  [c0156b87] sys_fcntl64+0x74/0x85
  [c0103ade] syscall_call+0x7/0xb
  ===
 Code: 74 1b 8b 15 30 93 48 c0 8d 43 04 89 53 04 89 42 04 a3 30 93 48 c0 c7
 40 04 30 93 48 c0 5b 5e c3 53 89 c3 8d 40 1c 39 43 1c 74 04 0f 0b eb fe
 8d 43 0c 39 43 0c 74 04 0f 0b eb fe 8d 43 04 39 43 EIP: [c0158f59]
 locks_free_lock+0xb/0x3b SS:ESP 0068:da3d7f04
 BUG: unable to handle kernel paging request at virtual address 9ee420b0
  printing eip:
 c014ab7d
 *pde = 
 Oops: 0002 [#2]
 Modules linked in: ipt_iprange ipt_REDIRECT capi kernelcapi capifs
 ipt_REJECT xt_tcpudp xt_state xt_limit ipt_LOG ipt_MASQUERADE
 iptable_mangle iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables
 x_tables b44 ohci1394 ieee1394 nf_nat_ftp nf_nat nf_conntrack_ftp
 nf_conntrack lcd tda827x saa7134_dvb dvb_pll video_buf_dvb tuner tda1004x
 ves1820 usb_storage usblp saa7134 compat_ioctl32 budget_ci budget_core
 dvb_ttpci dvb_core saa7146_vv video_buf saa7146 ttpci_eeprom via_agp
 ir_kbd_i2c videodev v4l2_common v4l1_compat ir_common agpgart CPU:0
 EIP:0060:[c014ab7d]Not tainted VLI
 EFLAGS: 00010082   (2.6.22.6 #1)
 EIP is at free_block+0x61/0xfb
 eax: a75b2c19   ebx: c1cf6c10   ecx: e1d070c4   edx: 9ee420ac
 esi: e1d07000   edi: dfde6960   ebp: dfde7620   esp: dfd87f44
 ds: 007b   es: 007b   fs:   gs:   ss: 0068
 Process events/0 (pid: 4, ti=dfd86000 task=dfdc4a50 task.ti=dfd86000)
 Stack: 0012  0018  c1cf6c10 c1cf6c10 0018
 c1cf6c00 dfde7620 c014ac86  dfde6960 dfde7620 c0521d20 
 c014b869   dfde69e0 c0521d20 c014b827 c0125955 dfdc4b5c
 8f0c99c0 Call Trace:
  [c014ac86] drain_array+0x6f/0x89
  [c014b869] cache_reap+0x42/0xde
  [c014b827] cache_reap+0x0/0xde
  [c0125955] run_workqueue+0x6b/0xdf
  [c0125ec7] worker_thread+0x0/0xbd
  [c0125f79] worker_thread+0xb2/0xbd
  [c0128221] autoremove_wake_function+0x0/0x35
  [c01280cc] kthread+0x36/0x5a
  [c0128096] kthread+0x0/0x5a
  [c0104607] kernel_thread_helper+0x7/0x10
  ===
 Code: 8b 02 25 00 40 02 00 3d 00 40 02 00 75 03 8b 52 0c 8b 02 84 c0 78 04
 0f 0b eb fe 8b 72 1c 8b 54 24 28 8b 46 04 8b 7c 95 4c 8b 16 89 42 04 89
 10 2b 4e 0c c7 06 00 01 10 00 c7 46 04 00 02 20 00 EIP: [c014ab7d]
 free_block+0x61/0xfb SS:ESP 0068:dfd87f44
 [ cut here ]
 kernel BUG at fs/locks.c:171!
 invalid opcode:  [#3]
 Modules linked in: ipt_iprange ipt_REDIRECT capi kernelcapi capifs
 ipt_REJECT xt_tcpudp xt_state xt_limit ipt_LOG ipt_MASQUERADE
 iptable_mangle iptable_nat nf_conntrack_ipv4 iptable_filter ip_tables
 x_tables b44 ohci1394 ieee1394 nf_nat_ftp nf_nat nf_conntrack_ftp
 nf_conntrack lcd tda827x saa7134_dvb dvb_pll video_buf_dvb tuner tda1004x
 ves1820 usb_storage usblp saa7134 compat_ioctl32 budget_ci budget_core
 dvb_ttpci dvb_core saa7146_vv video_buf saa7146 ttpci_eeprom via_agp
 ir_kbd_i2c videodev v4l2_common v4l1_compat ir_common agpgart CPU:

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-13 Thread Nick Piggin

On Thursday 13 September 2007 23:03, David Chinner wrote:
 On Thu, Sep 13, 2007 at 03:23:21AM +1000, Nick Piggin wrote:

  Well, it may not be easy to _fix_, but it's easy to try a few
  improvements ;)
 
  How do I make an image and run a workload that will coerce XFS into
  doing a significant number of vmaps?

 # mkfs.xfs -n size=16384 dev

 to create a filesystem with a 16k directory block size on a 4k page
 machine.

 Then just do operations on directories with lots of files in them
 (tens of thousands). Every directory operation will require at
 least one vmap in this situation - e.g. a traversal will result in
 lots and lots of blocks being read that will require vmap() for every
 directory block read from disk and an unmap almost immediately
 afterwards when the reference is dropped

Ah, wow, thanks: I can reproduce it.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-12 Thread Nick Piggin

On Wednesday 12 September 2007 11:49, David Chinner wrote:
 On Tue, Sep 11, 2007 at 04:00:17PM +1000, Nick Piggin wrote:
OTOH, I'm not sure how much buy-in there was from the filesystems
guys. Particularly Christoph H and XFS (which is strange because they
already do vmapping in places).
  
   I think they use vmapping because they have to, not because they want
   to. They might be a lot happier with fsblock if it used contiguous
   pages for large blocks whenever possible - I don't know for sure. The
   metadata accessors they might be unhappy with because it's inconvenient
   but as Christoph Hellwig pointed out at VM/FS, the filesystems who
   really care will convert.
 
  Sure, they would rather not to. But there are also a lot of ways you can
  improve vmap more than what XFS does (or probably what darwin does)
  (more persistence for cached objects, and batched invalidates for
  example).

 XFS already has persistence across the object life time (which can be many
 tens of seconds for a frequently used buffer)

But you don't do a very good job. When you go above 64 cached mappings,
you purge _all_ of them. fsblock's vmap cache can have a much higher number
(if you want), and purging can just unmap a batch which is decided by a simple
LRU (thus important metadata gets saved).


 and it also does batched 
 unmapping of objects as well.

It also could do a lot better at unmapping. Currently you're just calling
vunmap a lot of times in sequence. That still requires global IPIs and TLB
flushing every time.

This simple patch should easily be able to reduce that number by 2 or 3
orders of magnitude (maybe more on big systems).
http://www.mail-archive.com/[EMAIL PROTECTED]/msg03956.html

vmap area locking and data structures could also be made a lot better
quite easily, I suspect.


  There are also a lot of trivial things you can do to make a lot of those
  accesses not require vmaps (and less trivial things, but even such things
  as binary searches over multiple pages should be quite possible with a
  bit of logic).

 Yes, we already do the many of these things (via xfs_buf_offset()), but
 that is not good enough for something like a memcpy that spans multiple
 pages in a large block (think btree block compaction, splits and
 recombines).

fsblock_memcpy(fsblock *src, int soff, fsblock *dst, int doff, int size); ?


 IOWs, we already play these vmap harm-minimisation games in the places
 where we can, but still the overhead is high and something we'd prefer
 to be able to avoid.

I don't think you've looked nearly far enough with all this low hanging
fruit.

I just gave 4 things which combined might easily reduce xfs vmap overhead
by several orders of magnitude, all without changing much code at all.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-12 Thread Nick Piggin

On Wednesday 12 September 2007 11:49, David Chinner wrote:
 On Tue, Sep 11, 2007 at 04:00:17PM +1000, Nick Piggin wrote:
OTOH, I'm not sure how much buy-in there was from the filesystems
guys. Particularly Christoph H and XFS (which is strange because they
already do vmapping in places).
  
   I think they use vmapping because they have to, not because they want
   to. They might be a lot happier with fsblock if it used contiguous
   pages for large blocks whenever possible - I don't know for sure. The
   metadata accessors they might be unhappy with because it's inconvenient
   but as Christoph Hellwig pointed out at VM/FS, the filesystems who
   really care will convert.
 
  Sure, they would rather not to. But there are also a lot of ways you can
  improve vmap more than what XFS does (or probably what darwin does)
  (more persistence for cached objects, and batched invalidates for
  example).

 XFS already has persistence across the object life time (which can be many
 tens of seconds for a frequently used buffer)

But you don't do a very good job. When you go above 64 vmaps cached, you
purge _all_ of them. fsblock's vmap cache can have a much higher number
(if you want), and purging will only unmap a smaller batch, decided by a
simple LRU.


 and it also does batched 
 unmapping of objects as well.

It also could do a lot better at unmapping. Currently you're just calling
vunmap a lot of times in sequence. That still requires global IPIs and TLB
flushing every time.

This simple patch should easily be able to reduce that number by 2 or 3
orders of magnitude on 64-bit systems. Maybe more if you increase the
batch size.
http://www.mail-archive.com/[EMAIL PROTECTED]/msg03956.html

vmap area manipulation scalability and search complexity could also be
improved quite easily, I suspect.


  There are also a lot of trivial things you can do to make a lot of those
  accesses not require vmaps (and less trivial things, but even such things
  as binary searches over multiple pages should be quite possible with a
  bit of logic).

 Yes, we already do the many of these things (via xfs_buf_offset()), but
 that is not good enough for something like a memcpy that spans multiple
 pages in a large block (think btree block compaction, splits and
 recombines).

fsblock_memcpy(fsblock *src, int soff, fsblock *dst, int doff, int size); ?


 IOWs, we already play these vmap harm-minimisation games in the places
 where we can, but still the overhead is high and something we'd prefer
 to be able to avoid.

I don't think you've looked very far with all this low hanging fruit.

The several ways I suggested combined might easily reduce xfs vmap
overhead by several orders of magnitude, all without changing much
code at all.

Can you provide a formula to reproduce these workloads where vmap
overhead in XFS is a problem? (huge IO capacity need not be an issue,
because I could try reproducing it on a 64-way on ramdisks for example).

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-12 Thread Nick Piggin

On Wednesday 12 September 2007 10:00, Christoph Lameter wrote:
 On Tue, 11 Sep 2007, Nick Piggin wrote:
  Yes. I think we differ on our interpretations of okay. In my
  interpretation, it is not OK to use this patch as a way to solve VM or FS
  or IO scalability issues, especially not while the alternative approaches
  that do _not_ have these problems have not been adequately compared or
  argued against.

 We never talked about not being able to solve scalability issues with this
 patchset. The alternate approaches were discussed at the VM MiniSummit and
 at the VM/FS meeting. You organized the VM/FS summit. I know you were
 there and were arguing for your approach. That was not sufficient?

I will still argue that my approach is the better technical solution for large
block support than yours, I don't think we made progress on that. And I'm
quite sure we agreed at the VM summit not to rely on your patches for
VM or IO scalability.


  So even after all this time you do not understand what the fundamental
  problem is with anti-frag and yet you are happy to waste both our time
  in endless flamewars telling me how wrong I am about it.

 We obviously have discussed this before and the only point of asking this
 question by you seems to be to have me repeat the whole line argument
 again?

But you just showed in two emails that you don't understand what the
problem is. To reiterate: lumpy reclaim does *not* invalidate my formulae;
and antifrag does *not* isolate the issue.


  Forgive me if I'm starting to be rude, Christoph. This is really
  irritating.

 Sorry but I have had too much exposure to philosophy. Talk about absolutes
 like guarantees (that do not exist even for order 0 allocs) and unlikely
 memory fragmentation scenarios to show that something does not work seems
 to be getting into some metaphysical realm where there is no data anymore
 to draw any firm conclusions.

Some problems can not be solved easily or at all within reasonable
constraints. I have no problems with that. And it is a valid stance to
take on the fragmentation issue, and the hash collision problem, etc.

But what do you say about viable alternatives that do not have to
worry about these unlikely scenarios, full stop? So, why should we
not use fsblock for higher order page support?

Last times this came up, you dismissed the fsblock approach because it
adds another layer of complexity (even though it is simpler than the
buffer layer it replaces); and also had some other strange objection like
you thought it provides higher order pages or something.

And as far as the problems I bring up with your approach, you say they
shouldn't be likely, don't really matter, can be fixed as we go along, or
want me to show they are a problem in a real world situation!! :P


 Software reliability is inherent probabilistic otherwise we would not have
 things like CRC sums and SHA1 algorithms. Its just a matter of reducing
 the failure rate sufficiently. The failure rate for lower order
 allocations (0-3) seems to have been significantly reduced in 2.6.23
 through lumpy reclaim.

 If antifrag measures are not successful (likely for 2M allocs) then other
 methods (like the large page pools that you skipped when reading my post)
 will need to be used.

I didn't skip that. We have large page pools today. How does that give
first class of support to those allocations if you have to have memory
reserves?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Nick Piggin

On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:

 5. VM scalability
Large block sizes mean less state keeping for the information being
transferred. For a 1TB file one needs to handle 256 million page
structs in the VM if one uses 4k page size. A 64k page size reduces
that amount to 16 million. If the limitation in existing filesystems
are removed then even higher reductions become possible. For very
large files like that a page size of 2 MB may be beneficial which
will reduce the number of page struct to handle to 512k. The variable
nature of the block size means that the size can be tuned at file
system creation time for the anticipated needs on a volume.

There is a limitation in the VM. Fragmentation. You keep saying this
is a solved issue and just assuming you'll be able to fix any cases
that come up as they happen.

I still don't get the feeling you realise that there is a fundamental
fragmentation issue that is unsolvable with Mel's approach.

The idea that there even _is_ a bug to fail when higher order pages
cannot be allocated was also brushed aside by some people at the
vm/fs summit. I don't know if those people had gone through the
math about this, but it goes somewhat like this: if you use a 64K
page size, you can run out of memory with 93% of your pages free.
If you use a 2MB page size, you can fail with 99.8% of your pages
still free. That's 64GB of memory used on a 32TB Altix.

If you don't consider that is a problem because you don't care about
theoretical issues or nobody has reported it from running -mm
kernels, then I simply can't argue against that on a technical basis.
But I'm totally against introducing known big fundamental problems to
the VM at this stage of the kernel. God knows how long it takes to ever
fix them in future after they have become pervasive throughout the 
kernel.

IMO the only thing that higher order pagecache is good for is a quick
hack for filesystems to support larger block sizes. And after seeing it
is fairly ugly to support mmap, I'm not even really happy for it to do
that.

If VM scalability is a problem, then it needs to be addressed in other
areas anyway for order-0 pages, and if contiguous pages helps IO
scalability or crappy hardware, then there is nothing stopping us from
*attempting* to get contiguous memory in the current scheme.

Basically, if you're placing your hopes for VM and IO scalability on this,
then I think that's a totally broken thing to do and will end up making
the kernel worse in the years to come (except maybe on some poor
configurations of bad hardware).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Nick Piggin

On Tuesday 11 September 2007 22:12, Jörn Engel wrote:
 On Tue, 11 September 2007 04:52:19 +1000, Nick Piggin wrote:
  On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:
   5. VM scalability
  Large block sizes mean less state keeping for the information being
  transferred. For a 1TB file one needs to handle 256 million page
  structs in the VM if one uses 4k page size. A 64k page size reduces
  that amount to 16 million. If the limitation in existing filesystems
  are removed then even higher reductions become possible. For very
  large files like that a page size of 2 MB may be beneficial which
  will reduce the number of page struct to handle to 512k. The
   variable nature of the block size means that the size can be tuned at
   file system creation time for the anticipated needs on a volume.
 
  The idea that there even _is_ a bug to fail when higher order pages
  cannot be allocated was also brushed aside by some people at the
  vm/fs summit. I don't know if those people had gone through the
  math about this, but it goes somewhat like this: if you use a 64K
  page size, you can run out of memory with 93% of your pages free.
  If you use a 2MB page size, you can fail with 99.8% of your pages
  still free. That's 64GB of memory used on a 32TB Altix.

 While I agree with your concern, those numbers are quite silly.  The

They are the theoretical worst case. Obviously with a non trivially
sized system and non-DoS workload, they will not be reached.


 chances of 99.8% of pages being free and the remaining 0.2% being
 perfectly spread across all 2MB large_pages are lower than those of SHA1
 creating a collision.  I don't see anyone abandoning git or rsync, so
 your extreme example clearly is the wrong one.

 Again, I agree with your concern, even though your example makes it look
 silly.

It is not simply a question of once-off chance for an all-at-once layout
to fail in this way. Fragmentation slowly builds over time, and especially
if you do actually use higher-order pages for a significant number of
things (unlike we do today), then the problem will become worse. If you
have any part of your workload that is affected by fragmentation, then
it will cause unfragmented regions to eventually be used for fragmentation
inducing allocations (by definition -- if it did not, eg. then there would be
no fragmentation problem and no need for Mel's patches).

I don't know what happens as time tends towards infinity, but I don't think
it will be good.

At millions of allocations per second, how long does it take to produce
an unacceptable number of free pages before the ENOMEM condition?
Furthermore, what *is* an unacceptable number? I don't know. I am not
trying to push this feature in, so the burden is not mine to make sure it
is OK.

Yes, we already have some of these problems today. Introducing more
and worse problems and justifying them because of existing ones is much
more silly than my quoting of the numbers. IMO.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Nick Piggin

On Wednesday 12 September 2007 04:31, Mel Gorman wrote:
 On Tue, 2007-09-11 at 18:47 +0200, Andrea Arcangeli wrote:
  Hi Mel,

 Hi,

  On Tue, Sep 11, 2007 at 04:36:07PM +0100, Mel Gorman wrote:
   that increasing the pagesize like what Andrea suggested would lead to
   internal fragmentation problems. Regrettably we didn't discuss Andrea's
 
  The config_page_shift guarantees the kernel stacks or whatever not
  defragmentable allocation other allocation goes into the same 64k not
  defragmentable page. Not like with SGI design that a 8k kernel stack
  could be allocated in the first 64k page, and then another 8k stack
  could be allocated in the next 64k page, effectively pinning all 64k
  pages until Nick worst case scenario triggers.

 In practice, it's pretty difficult to trigger. Buddy allocators always
 try and use the smallest possible sized buddy to split. Once a 64K is
 split for a 4K or 8K allocation, the remainder of that block will be
 used for other 4K, 8K, 16K, 32K allocations. The situation where
 multiple 64K blocks gets split does not occur.

 Now, the worst case scenario for your patch is that a hostile process
 allocates large amount of memory and mlocks() one 4K page per 64K chunk
 (this is unlikely in practice I know). The end result is you have many
 64KB regions that are now unusable because 4K is pinned in each of them.
 Your approach is not immune from problems either. To me, only Nicks
 approach is bullet-proof in the long run.

One important thing I think in Andrea's case, the memory will be accounted
for (eg. we can limit mlock, or work within various memory accounting things).

With fragmentation, I suspect it will be much more difficult to do this. It
would be another layer of heuristics that will also inevitably go wrong
at times if you try to limit how much fragmentation a process can do.
Quite likely it is hard to make something even work reasonably well in
most cases.


   We can still try to save some memory by
  defragging the slab a bit, but it's by far *not* required with
  config_page_shift. No defrag at all is required infact.

 You will need to take some sort of defragmentation to deal with internal
 fragmentation. It's a very similar problem to blasting away at slab
 pages and still not being able to free them because objects are in use.
 Replace slab with large page and object with 4k page and the
 issues are similar.

Well yes and slab has issues today too with internal fragmentation,
targetted reclaim and some (small) higher order allocations too today.
But at least with config_page_shift, you don't introduce _new_ sources
of problems (eg. coming from pagecache or other allocs).

Sure, there are some other things -- like pagecache can actually use
up more memory instead -- but there are a number of other positives
that Andrea's has as well. It is using order-0 pages, which are first class
throughout the VM; they have per-cpu queues, and do not require any
special reclaim code. They also *actually do* reduce the page
management overhead in the general case, unlike higher order pcache.

So combined with the accounting issues, I think it is unfair to say that
Andrea's is just moving the fragmentation to internal. It has a number
of upsides. I have no idea how it will actually behave and perform, mind
you ;)


  Plus there's a cost in defragging and freeing cache... the more you
  need defrag, the slower the kernel will be.
 
   approach in depth.
 
  Well it wasn't my fault if we didn't discuss it in depth though.

 If it's my fault, sorry about that. It wasn't my intention.

I think it did get brushed aside a little quickly too (not blaming anyone).
Maybe because Linus was hostile. But *if* the idea is that page
management overhead has or will become a problem that needs fixing,
then neither higher order pagecache, nor (obviously) fsblock, fixes this
properly. Andrea's most definitely has the potential to.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Nick Piggin

On Wednesday 12 September 2007 04:25, Maxim Levitsky wrote:

 Hi,

 I think that fundamental problem is no fragmentation/large pages/...

 The problem is the VM itself.
 The vm doesn't use virtual memory, thats all, that the problem.
 Although this will be probably linux 3.0, I think that the right way to
 solve all those problems is to make all kernel memory vmalloced (except few
 areas like kernel .text)

 It will suddenly remove the buddy allocator, it will remove need for
 highmem, it will allow to allocate any amount of memory (for example 4k
 stacks will be obsolete)
 It will even allow kernel memory to be swapped to disk.

 This is the solution, but it is very very hard.

I'm not sure that it is too hard. OK it is far from trivial...

This is not a new idea though, it has been floated around for a long
time (since before Linux I'm sure, although have no references).

There are lots of reasons why such an approach has fundamental
performance problems too, however. Your kernel can't use huge tlbs
for a lot of memory, you can't find the physical address of a page
without walking page tables, defragmenting still has a significant
cost in terms of moving pages and flushing TLBs etc.

So the train of thought up to now has been that a virtually mapped
kernel would be the problem with the VM itself ;)

We're actually at a point now where higher order allocations are
pretty rare and not a big problem (except with very special cases
like hugepages and memory hotplug which can mostly get away
with compromises, so we don't want to turn over the kernel just
for these).

So in my opinion, any increase of the dependence on higher order
allocations is simply a bad move until a killer use-case can be found.
They move us further away from good behaviour on our assumed
ideal of an identity mapped kernel.

(I don't actually dislike the idea of virtually mapped kernel. Maybe
hardware trends will favour that model and there are some potential
simple instructions a CPU can implement to help with some of the
performance hits. I'm sure it will be looked at again for Linux one day)
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Nick Piggin

On Wednesday 12 September 2007 06:01, Christoph Lameter wrote:
 On Tue, 11 Sep 2007, Nick Piggin wrote:
  There is a limitation in the VM. Fragmentation. You keep saying this
  is a solved issue and just assuming you'll be able to fix any cases
  that come up as they happen.
 
  I still don't get the feeling you realise that there is a fundamental
  fragmentation issue that is unsolvable with Mel's approach.

 Well my problem first of all is that you did not read the full message. It
 discusses that later and provides page pools to address the issue.

 Secondly you keep FUDding people with lots of theoretical concerns
 assuming Mel's approaches must fail. If there is an issue (I guess there
 must be right?) then please give us a concrete case of a failure that we
 can work against.

On the other hand, you ignore the potential failure cases, and ignore
the alternatives that do not have such cases.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Nick Piggin

On Wednesday 12 September 2007 06:11, Christoph Lameter wrote:
 On Tue, 11 Sep 2007, Nick Piggin wrote:
  It would be interesting to craft an attack. If you knew roughly the
  layout and size of your dentry slab for example... maybe you could stat a
  whole lot of files, then open one and keep it open (maybe post the fd to
  a unix socket or something crazy!) when you think you have filled up a
  couple of MB worth of them. Repeat the process until your movable zone is
  gone. Or do the same things with pagetables, or task structs, or radix
  tree nodes, etc.. these are the kinds of things I worry about (as well as
  just the gradual natural degredation).

 I guess you would have to run that without my targeted slab reclaim
 patchset? Otherwise the slab that are in the way could be reclaimed and
 you could not produce your test case.

I didn't realise you had patches to move pinned dentries, radix tree nodes,
task structs, page tables etc. Did I miss them in your last patchset?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Nick Piggin

On Wednesday 12 September 2007 06:01, Christoph Lameter wrote:
 On Tue, 11 Sep 2007, Nick Piggin wrote:
  There is a limitation in the VM. Fragmentation. You keep saying this
  is a solved issue and just assuming you'll be able to fix any cases
  that come up as they happen.
 
  I still don't get the feeling you realise that there is a fundamental
  fragmentation issue that is unsolvable with Mel's approach.

 Well my problem first of all is that you did not read the full message. It
 discusses that later and provides page pools to address the issue.

 Secondly you keep FUDding people with lots of theoretical concerns
 assuming Mel's approaches must fail. If there is an issue (I guess there
 must be right?) then please give us a concrete case of a failure that we
 can work against.

And BTW, before you accuse me of FUD, I'm actually talking about the
fragmentation issues on which Mel I think mostly agrees with me at this
point.

Also have you really a rational reason why we should just up and accept
all these big changes happening just because that, while there are lots
of theoretical issues, the person pointing them out to you hasn't happened
to give you a concrete failure case. Oh, and the actual performance
benefit is actually not really even quantified yet, crappy hardware not
withstanding, and neither has a proper evaluation of the alternatives.

So... would you drive over a bridge if the engineer had this mindset?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Nick Piggin

On Wednesday 12 September 2007 06:53, Mel Gorman wrote:
 On (11/09/07 11:44), Nick Piggin didst pronounce:

 However, this discussion belongs more with the non-existant-remove-slab
 patch. Based on what we've seen since the summits, we need a thorough
 analysis with benchmarks before making a final decision (kernbench, ebizzy,
 tbench (netpipe if someone has the time/resources), hackbench and maybe
 sysbench as well as something the filesystem people recommend to get good
 coverage of the subsystems).

True. Aside, it seems I might have been mistaken in saying Christoph
is proposing to use higher order allocations to fix the SLUB regression.
Anyway, I agree let's not get sidetracked about this here.


 I'd rather not get side-tracked here. I regret you feel stream-rolled but I
 think grouping pages by mobility is the right thing to do for better usage
 of the TLB by the kernel and for improving hugepage support in userspace
 minimally. We never really did see eye-to-eye but this way, if I'm wrong
 you get to chuck eggs down the line.

No it's a fair point, and even the hugepage allocations alone are a fair
point. From the discussions I think it seems like quite probably the right
thing to do pragmatically, which is what Linux is about and I hope will
result in a better kernel in the end. So I don't have complaints except
from little ivory tower ;)


  Sure. And some people run workloads where fragmentation is likely never
  going to be a problem, they are shipping this poorly configured hardware
  now or soon, so they don't have too much interest in doing it right at
  this point, rather than doing it *now*. OK, that's a valid reason which
  is why I don't use the argument that we should do it correctly or never
  at all.

 So are we saying the right thing to do is go with fs-block from day 1 once
 we get it to optimistically use high-order pages? I think your concern
 might be that if this goes in then it'll be harder to justify fsblock in
 the future because it'll be solving a theoritical problem that takes months
 to trigger if at all. i.e. The filesystem people will push because
 apparently large block support as it is solves world peace. Is that
 accurate?

Heh. It's hard to say. I think fsblock could take a while to implement,
regardless of high order pages or not. I actually would like to be able
to pass down a mandate to say higher order pagecache will never
get merged, simply so that these talented people would work on
fsblock ;)

But that's not my place to say, and I'm actually not arguing that high
order pagecache does not have uses (especially as a practical,
shorter-term solution which is unintrusive to filesystems).

So no, I don't think I'm really going against the basics of what we agreed
in Cambridge. But it sounds like it's still being billed as first-order
support right off the bat here.


  OTOH, I'm not sure how much buy-in there was from the filesystems guys.
  Particularly Christoph H and XFS (which is strange because they already
  do vmapping in places).

 I think they use vmapping because they have to, not because they want
 to. They might be a lot happier with fsblock if it used contiguous pages
 for large blocks whenever possible - I don't know for sure. The metadata
 accessors they might be unhappy with because it's inconvenient but as
 Christoph Hellwig pointed out at VM/FS, the filesystems who really care
 will convert.

Sure, they would rather not to. But there are also a lot of ways you can
improve vmap more than what XFS does (or probably what darwin does)
(more persistence for cached objects, and batched invalidates for example).
There are also a lot of trivial things you can do to make a lot of those
accesses not require vmaps (and less trivial things, but even such things
as binary searches over multiple pages should be quite possible with a bit
of logic).


  It would be interesting to craft an attack. If you knew roughly the
  layout and size of your dentry slab for example... maybe you could stat a
  whole lot of files, then open one and keep it open (maybe post the fd to
  a unix socket or something crazy!) when you think you have filled up a
  couple of MB worth of them.

 I might regret saying this, but it would be easier to craft an attack
 using pagetable pages. It's woefully difficult to do but it's probably
 doable. I say pagetables because while slub targetted reclaim is on the
 cards and memory compaction exists for page cache pages, pagetables are
 currently pinned with no prototype patch existing to deal with them.

But even so, you can just hold an open fd in order to pin the dentry you
want. My attack would go like this: get the page size and allocation group
size for the machine, then get the number of dentries required to fill a
slab. Then read in that many dentries and pin one of them. Repeat the
process. Even if there is other activity on the system, it seems possible
that such a thing will cause some headaches after not too long a time.
Some

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Nick Piggin

On Wednesday 12 September 2007 07:41, Christoph Lameter wrote:
 On Tue, 11 Sep 2007, Nick Piggin wrote:
  I think I would have as good a shot as any to write a fragmentation
  exploit, yes. I think I've given you enough info to do the same, so I'd
  like to hear a reason why it is not a problem.

 No you have not explained why the theoretical issues continue to exist
 given even just considering Lumpy Reclaim in .23 nor what effect the
 antifrag patchset would have.

So how does lumpy reclaim, your slab patches, or anti-frag have
much effect on the worst case situation? Or help much against a
targetted fragmentation attack?


 And you have used a 2M pagesize which is 
 irrelevant to this patchset that deals with blocksizes up to 64k. In my
 experience the use of blocksize  PAGE_COSTLY_ORDER (32k) is reasonably
 safe.

I used EXACTLY the page sizes that you brought up in your patch
description (ie. 64K and 2MB).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Nick Piggin

On Wednesday 12 September 2007 07:48, Christoph Lameter wrote:
 On Tue, 11 Sep 2007, Nick Piggin wrote:
  But that's not my place to say, and I'm actually not arguing that high
  order pagecache does not have uses (especially as a practical,
  shorter-term solution which is unintrusive to filesystems).
 
  So no, I don't think I'm really going against the basics of what we
  agreed in Cambridge. But it sounds like it's still being billed as
  first-order support right off the bat here.

 Well its seems that we have different interpretations of what was agreed
 on. My understanding was that the large blocksize patchset was okay
 provided that I supply an acceptable mmap implementation and put a
 warning in.

Yes. I think we differ on our interpretations of okay. In my interpretation,
it is not OK to use this patch as a way to solve VM or FS or IO scalability
issues, especially not while the alternative approaches that do _not_ have
these problems have not been adequately compared or argued against.


  But even so, you can just hold an open fd in order to pin the dentry you
  want. My attack would go like this: get the page size and allocation
  group size for the machine, then get the number of dentries required to
  fill a slab. Then read in that many dentries and pin one of them. Repeat
  the process. Even if there is other activity on the system, it seems
  possible that such a thing will cause some headaches after not too long a
  time. Some sources of pinned memory are going to be better than others
  for this of course, so yeah maybe pagetables will be a bit easier (I
  don't know).

 Well even without slab targeted reclaim: Mel's antifrag will sort the
 dentries into separate blocks of memory and so isolate the issue.

So even after all this time you do not understand what the fundamental
problem is with anti-frag and yet you are happy to waste both our time
in endless flamewars telling me how wrong I am about it.

Forgive me if I'm starting to be rude, Christoph. This is really irritating.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [07/36] Use page_cache_xxx in mm/filemap_xip.c

2007-08-28 Thread Nick Piggin


Christoph Hellwig wrote:

On Tue, Aug 28, 2007 at 09:49:38PM +0200, J??rn Engel wrote:


On Tue, 28 August 2007 12:05:58 -0700, [EMAIL PROTECTED] wrote:



-   index = *ppos  PAGE_CACHE_SHIFT;
-   offset = *ppos  ~PAGE_CACHE_MASK;
+   index = page_cache_index(mapping, *ppos);
+   offset = page_cache_offset(mapping, *ppos);


Part of me feels inclined to marge this patch now because it makes the
code more readable, even if page_cache_index() is implemented as
#define page_cache_index(mapping, pos) ((pos)  PAGE_CACHE_SHIFT)

I know there is little use in yet another global search'n'replace
wankfest and Andrew might wash my mouth just for mentioning it.  Still,
hard to dislike this part of your patch.



Yes, I I suggested that before.  Andrew seems to somehow hate this
patchset, but even if we don;'t get it in the lowercase macros are much
much better then the current PAGE_CACHE_* confusion.


I don't mind the change either. The open coded macros are very
recognisable, but it isn't hard to have a typo and get one
slightly wrong.

If it goes upstream now it wouldn't have the mapping argument
though, would it? Or the need to replace PAGE_CACHE_SIZE I guess.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[rfc] block-based page writeout

2007-08-27 Thread Nick Piggin

Hi,

I've always liked the idea of being able to do writeout directly based on
block number, rather than the valiant but doomed-to-be-suboptimal heuristics
that our current dirty writeout system does, as it is running above the
pagecache and doesn't know about poor file layouts, or interaction between
multiple files or file and metadata.

My idea for block based writeout is to have dirty pagecache get inserted
into a data structure based on its (block device, block number), and sorted
in order of block number for a given block device.

You then have a kernel thread per block device, bdflush, which does the
background writeout in order. There are minimal heuristics involved here,
best is probably just submit blocks before a particular age in a one-way
elevator order (which my patch doesn't quite get right). But it would be
easy to do things like detect idle block devices to kick off writeout
sooner, or write out _all_ dirty blocks in ascending order (which my patch
does -- this trades off timeliness for throughput).

I haven't worked out what to do about delayed allocation filesystems,
exactly. Perhaps blocks without mappings yet could be given some default
heuristic by the filesystem for placement in the dirty data structure,
or else some kind of scheme to drop into a scan-by-logical-offset mode
to do the allocations contiguously before resuming block based writeout.
I haven't thought that far ahead ;)

Simple patch here to implement this against fsblock. Sorry for an
experimental patch on top of an experimental patch... I seem to need some
features in fsblock but not buffer.c to make this work nicely (eg. tighter
coupling of dirty state between block and page).

Patch is pretty experimental, and has a known use-after-free bug on
unmount (I'm not going to get much more time for hacking between now and
vm/fs meetup, and I'd like to be able to discuss this if people think it
is interesting, so I had to rush). Mainly attached as proof of concept,
which I don't expect anybody is actually going to use. But comments on the
concept would be welcome. I wonder why there hasn't been much noise about
doing something like this -- are there some known/obvious problems?

The data structure I'm using is a per-block_device rbtree for simplicity
now. I perhaps forsee scalability issues with locking, so it may be more
appropriate to go with a radix-tree or b-tree or something with a more
scalable write-side. I don't think this should be a showstopper, though.

Thanks,
Nick

---
Index: linux-2.6/fs/fsblock.c
===
--- linux-2.6.orig/fs/fsblock.c
+++ linux-2.6/fs/fsblock.c
@@ -21,6 +21,9 @@
 #include linux/gfp.h
 #include linux/cache.h
 #include linux/profile.h
+#include linux/rbtree.h
+#include linux/kthread.h
+#include linux/delay.h
 //#include linux/buffer_head.h /* too much crap in me */
 extern int try_to_free_buffers(struct page *);
 
@@ -45,6 +48,15 @@ void __init fsblock_init(void)
SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD, NULL);
 }
 
+static void clear_block_dirty(struct fsblock *block);
+
+static int test_and_set_block_dirty(struct fsblock *block);
+
+static void set_block_dirty(struct fsblock *block)
+{
+   test_and_set_block_dirty(block);
+}
+
 static void init_block(struct page *page, struct fsblock *block, unsigned int 
bits)
 {
block-flags = 0;
@@ -472,14 +484,14 @@ int fsblock_set_page_dirty(struct page *
for_each_block(block, b) {
FSB_BUG_ON(!test_bit(BL_uptodate, b-flags));
if (!test_bit(BL_dirty, b-flags)) {
-   set_bit(BL_dirty, b-flags);
+   set_block_dirty(b);
ret = 1;
}
}
} else {
FSB_BUG_ON(!test_bit(BL_uptodate, block-flags));
if (!test_bit(BL_dirty, block-flags)) {
-   set_bit(BL_dirty, block-flags);
+   set_block_dirty(block);
ret = 1;
}
}
@@ -1329,6 +1341,37 @@ static int write_block(struct fsblock *b
return submit_block(block, WRITE);
 }
 
+static int fsblock_test_clear_dirty(struct fsblock *block)
+{
+   int ret = 0;
+   struct page *page = block-page;
+
+   if (test_bit(BL_dirty, block-flags)) {
+   FSB_BUG_ON(!test_bit(BL_uptodate, block-flags));
+   clear_block_dirty(block);
+   ret = 1;
+
+   if (fsblock_subpage(block)) {
+   struct fsblock *b;
+   for_each_block(page_blocks(page), b) {
+   if (test_bit(BL_dirty, b-flags))
+   goto page_dirty;
+   }
+   }
+   if (!fsblock_superpage(block)) {
+   ret = clear_page_dirty_for_io(page);
+

[patch] 2.6.23-rc3: fsblock

2007-08-24 Thread Nick Piggin

Hi,

I'm still plugging away at fsblock slowly. Haven't really got around to
to finishing up any big new features, but there has been a lot of bug fixing
and little API changes since last release.

I still think fsblock has merit, and even if a more extent-based approach
ends up working better for most things. I think a block based one will
still have its place (either along-side or underneath), and I think fsblock
is just much better than buffer_heads.

fsblock proper and minix port patches are on kernel.org, because they're
pretty big and not well split-up to be posted here.

http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/fsblock/2.6.23-rc3/

Some notes:

mbforget is restored (though very poor implementation for now), and filesystems
do not have their underlying metadata unmapped or synched by default for new
blocks. They could do this in their get_block function if they want, or
otherwise ensure that metadata blocks are not just abandoned with possible
dirty data in-flight. Basically my thinking is that it is quite an extra
expense that should be optional.

Made use of lockless radix-tree. Pages in superpage-blocks are tracked with
the pagecache radix-tree, and a fair amount of lookups can be involved. This
makes those operations faster and more scalable.

I've got rid of RCU completely from the fsblock data structures. fsblock's
getblk equivalent must now tolerate blocking (if this turns out to be too
limiting, I might be convinced to add back RCU or some kind of spin locking).
However I want to get rid of RCU freeing, because a normal use-case (the
nobh mode) is expected to have a lot of fsblock structures being allocated
and freed as IO is happening, so this needs to be really efficient and RCU
can degrade CPU cache properties in these situations.

Added a real vmap cache. It is still pretty primitive, but it gets like
99% hit rate in my basic tests, so I didn't bother with anything more
fancy yet.

Changed block memory mapping API to only support metadata blocks at this
stage. Just makes a few implementation things easier at this point.

Added the /proc/sys/vm/fsblock_no_cache tunable, which disables caching of
fsblock structures, and actively frees them after IOs etc have finished.

TODO
- introduce, and make use of, range based aops APIs and remove all the
  silly locking code for multi-page blocks (as well as do nice multi-block
  bios). In progress.

- write a extent-based block mapping module that filesystems could optionally
  use to efficiently cache block mappings. In progress.

- port some real filesystems. ext2 is in progress, and I like the idea of
  brtfs, it seems like it would be a lot more manageable than ext3 for me.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch][rfc] fs: fix nobh error handling

2007-08-08 Thread Nick Piggin

On Wed, Aug 08, 2007 at 07:39:42AM -0700, Mingming Cao wrote:
 On Wed, 2007-08-08 at 08:07 -0500, Dave Kleikamp wrote:
  
  For jfs's sake, I don't really care if it ever uses nobh again.  I
  originally started using it because I figured the movement was away from
  buffer heads and jfs seemed just as happy with the nobh functions (after
  a few bugs were flushed out).  I don't think jfs really benefitted
  though.
  
  That said, I don't really know who cares about the nobh option in ext3.
  
 
 Actually IBM/LTC use the nobh option in ext3 on our internal kernel
 development server, to control the consumption of large amount of low
 memory space.

Fair enough...  but you mean to say that page reclaim does not control
the consumption of low memory well enough? Ie. what is the exact benefit
of using nobh?

(I'm not trying to argue to remove it... yet... just interested)

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch][rfc] fs: fix nobh error handling

2007-08-06 Thread Nick Piggin

Hi,

So I seem to have hit some resistance to the idea of removing nobh mode
with the new aops patches. I don't need to reiterate my ideas for disliking
nobh mode (if the buffer layer is crap, it should be improved rather than
circumvented -- see fsblock). But at any rate, this view does not seem to
be unanimous, and it is completely sensible to be adverse to removing things
that exist and somewhat work today.

So I've been mulling around the best way to have nobh and new aoips coexist
nicely. I think I've got a reasonable idea now... the main problem with
nobh mode is that it blissfully throws away most of the buffer state; when
an error does happen, it's up shit creek. This becomes more of a problem
with error recovery after a short write allowed by the new aop API: the
short write is actually going to happen relatively often as part of normal
operation, rather than just when the disk subsystem fails.

Still, the fix for existing error handling will look basically the same as
the fix for short write error handling. So I propose this patch against
head in order not to confuse the issues (and to fix the existing bugs in
release kernels). If this gets accepted, then I can subsequently look at
doing _additional_ patches for the new aops series in -mm.

Note: I actually did observe some of these problems by using IO error
injection. I wouldn't say the patch is completely stress tested yet, but
it did survive the same tests without any file corruption.

Nick

---

nobh mode error handling is not just pretty slack, it's wrong.

One cannot zero out the whole page to ensure new blocks are zeroed,
because it just brings the whole page uptodate with zeroes even if
that may not be the correct uptodate data. Also, other parts of the
page may already contain dirty data which would get lost by zeroing
it out. Thirdly, the writeback of zeroes to the new blocks will also
erase existing blocks. All these conditions are pagecache and/or
filesystem corruption.

The problem comes about because we didn't keep track of which buffers
actually are new or old. However it is not enough just to keep only
this state, because at the point we start dirtying parts of the page
(new blocks, with zeroes), the handling of IO errors becomes impossible
without buffers because the page may only be partially uptodate, in
which case the page flags allone cannot capture the state of the parts
of the page.

So allocate all buffers for the page upfront, but leave them unattached
so that they don't pick up any other references and can be freed when
we're done. If the error path is hit, then zero the new buffers as the
regular buffer path does, then attach the buffers to the page so that
it can actually be written out correctly and be subject to the normal
IO error handling paths.

As an upshot, we save 1K of kernel stack on ia64 or powerpc 64K page
systems.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]

--

Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2272,51 +2272,53 @@ int nobh_prepare_write(struct page *page
struct inode *inode = page-mapping-host;
const unsigned blkbits = inode-i_blkbits;
const unsigned blocksize = 1  blkbits;
-   struct buffer_head map_bh;
-   struct buffer_head *read_bh[MAX_BUF_PER_PAGE];
+   struct buffer_head *head, *bh;
unsigned block_in_page;
-   unsigned block_start;
+   unsigned block_start, block_end;
sector_t block_in_file;
char *kaddr;
int nr_reads = 0;
-   int i;
int ret = 0;
int is_mapped_to_disk = 1;
 
+   if (PagePrivate(page))
+   return block_prepare_write(page, from, to, get_block);
+
if (PageMappedToDisk(page))
return 0;
 
+   head = alloc_page_buffers(page, blocksize, 1);
+
block_in_file = (sector_t)page-index  (PAGE_CACHE_SHIFT - blkbits);
-   map_bh.b_page = page;
 
/*
 * We loop across all blocks in the page, whether or not they are
 * part of the affected region.  This is so we can discover if the
 * page is fully mapped-to-disk.
 */
-   for (block_start = 0, block_in_page = 0;
+   for (block_start = 0, block_in_page = 0, bh = head;
  block_start  PAGE_CACHE_SIZE;
- block_in_page++, block_start += blocksize) {
-   unsigned block_end = block_start + blocksize;
+ block_in_page++, block_start += blocksize, bh = 
bh-b_this_page) {
int create;
 
-   map_bh.b_state = 0;
+   block_end = block_start + blocksize;
+   bh-b_state = 0;
create = 1;
if (block_start = to)
create = 0;
-   map_bh.b_size = blocksize;
ret = get_block(inode, block_in_file + block_in_page,
-   map_bh

Re: [PATCH RFC] extent mapped page cache

2007-07-26 Thread Nick Piggin

On Thu, Jul 26, 2007 at 09:05:15AM -0400, Chris Mason wrote:
 On Thu, 26 Jul 2007 04:36:39 +0200
 Nick Piggin [EMAIL PROTECTED] wrote:
 
 [ are state trees a good idea? ]
 
   One thing it gains us is finding the start of the cluster.  Even if
   called by kswapd, the state tree allows writepage to find the start
   of the cluster and send down a big bio (provided I implement
   trylock to avoid various deadlocks).
  
  That's very true, we could potentially also do that with the block
  extent tree that I want to try with fsblock.
 
 If fsblock records and extent of 200MB, and writepage is called on a
 page in the middle of the extent, how do you walk the radix backwards
 to find the first dirty  up to date page in the range?

I mean if we also have a block extent tree between fsblock and the
filesystem's get_block (which I think could be a good idea).

So you would use that tree to find the block extent that you're in,
then use the pagecache tree dirty tag lookup from the start of that
block extent (or you could add a backward tag lookup if you just wanted
to gather a small range around the given offset). I think (haven't got
any code for this yet, mind you).


   I put the placeholder patches on hold because handling a corner case
   where userland did O_DIRECT from a mmap'd region of the same file
   (Linus pointed it out to me).  Basically my patches had to work in
   64k chunks to avoid a deadlock in get_user_pages.  With the state
   tree, I can allow the page to be faulted in but still properly deal
   with it.
  
  Oh right, I didn't think of that one. Would you still have similar
  issues with the external state tree? I mean, the filesystem doesn't
  really know why the fault is taken. O_DIRECT read from a file into
  mmapped memory of the same block in the file is almost hopeless I
  think.
 
 Racing is fine as long as we don't deadlock or expose garbage from disk.

Hmm, OK you're probably right. I'll have to see how it looks.


   I'm sure we can find some river in Cambridge, winner gets to throw
   Axboe in.
  
  Very noble of you to donate your colleage to such a worthy cause.
 
 Jens is always interested in helping solve such debates.  It's a
 fantastic service he provides to the community.

;)
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC] extent mapped page cache

2007-07-25 Thread Nick Piggin

On Wed, Jul 25, 2007 at 10:10:07PM -0400, Chris Mason wrote:
 On Thu, 26 Jul 2007 03:37:28 +0200
 Nick Piggin [EMAIL PROTECTED] wrote:
 
   
   One advantage to the state tree is that it separates the state from
   the memory being described, allowing a simple kmap style interface
   that covers subpages, highmem and superpages.
  
  I suppose so, although we should have added those interfaces long
  ago ;) The variants in fsblock are pretty good, and you could always
  do an arbitrary extent (rather than block) based API using the
  pagecache tree if it would be helpful.
 
 Yes, you could use fsblock for the state bits and make a separate API
 to map the actual pages.
 
   
  
   It also more naturally matches the way we want to do IO, making for
   easy clustering.
  
  Well the pagecache tree is used to reasonable effect for that now.
  OK the code isn't beautiful ;). Granted, this might be an area where
  the seperate state tree ends up being better. We'll see.
  
 
 One thing it gains us is finding the start of the cluster.  Even if
 called by kswapd, the state tree allows writepage to find the start of
 the cluster and send down a big bio (provided I implement trylock to
 avoid various deadlocks).

That's very true, we could potentially also do that with the block extent
tree that I want to try with fsblock.

I'm looking at cleaning up some of these aops APIs so hopefully most of
the deadlock problems go away. Should be useful to both our efforts. Will
post patches hopefully when I get time to finish the draft this weekend.


   O_DIRECT becomes a special case of readpages and writepagesthe
   memory used for IO just comes from userland instead of the page
   cache.
  
  Could be, although you'll probably also need to teach the mm about
  the state tree and/or still manipulate the pagecache tree to prevent
  concurrency?
 
 Well, it isn't coded yet, but I should be able to do it from the FS
 specific ops.

Probably, if you invalidate all the pagecache in the range beforehand
you should be able to do it (and I guess you want to do the invalidate
anyway). Although, below deadlock issues might still bite somehwere...


  But isn't the main aim of O_DIRECT to do as little locking and
  synchronisation with the pagecache as possible? I thought this is
  why your race fixing patches got put on the back burner (although
  they did look fairly nice from a correctness POV).
 
 I put the placeholder patches on hold because handling a corner case
 where userland did O_DIRECT from a mmap'd region of the same file (Linus
 pointed it out to me).  Basically my patches had to work in 64k chunks
 to avoid a deadlock in get_user_pages.  With the state tree, I can
 allow the page to be faulted in but still properly deal with it.

Oh right, I didn't think of that one. Would you still have similar
issues with the external state tree? I mean, the filesystem doesn't
really know why the fault is taken. O_DIRECT read from a file into
mmapped memory of the same block in the file is almost hopeless I
think.


  Well I'm kind of handwaving when it comes to O_DIRECT ;) It does look
  like this might be another advantage of the state tree (although you
  aren't allowed to slow down buffered IO to achieve the locking ;)).
 
 ;) The O_DIRECT benefit is a fringe thing.  I've long wanted to help
 clean up that code, but the real point of the patch is to make general
 usage faster and less complex.  If I can't get there, the O_DIRECT
 stuff doesn't matter.

Sure, although unifying code is always a plus so I like that you've
got that in mind.


   The ability to put in additional tracking info like the process that
   first dirtied a range is also significant.  So, I think it is worth
   trying.
  
  Definitely, and I'm glad you are. You haven't converted me yet, but
  I look forward to finding the best ideas from our two approaches when
  the patches are further along (ext2 port of fsblock coming along, so
  we'll be able to have races soon :P).
 
 I'm sure we can find some river in Cambridge, winner gets to throw
 Axboe in.

Very noble of you to donate your colleage to such a worthy cause.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC] extent mapped page cache

2007-07-25 Thread Nick Piggin

On Wed, Jul 25, 2007 at 08:18:53AM -0400, Chris Mason wrote:
 On Wed, 25 Jul 2007 04:32:17 +0200
 Nick Piggin [EMAIL PROTECTED] wrote:
 
  Having another tree to store block state I think is a good idea as I
  said in the fsblock thread with Dave, but I haven't clicked as to why
  it is a big advantage to use it to manage pagecache state. (and I can
  see some possible disadvantages in locking and tree manipulation
  overhead).
 
 Yes, there are definitely costs with the state tree, it will take some
 careful benchmarking to convince me it is a feasible solution. But,
 storing all the state in the pages themselves is impossible unless the
 block size equals the page size. So, we end up with something like
 fsblock/buffer heads or the state tree.

Yep, we have to have something.

 
 One advantage to the state tree is that it separates the state from
 the memory being described, allowing a simple kmap style interface
 that covers subpages, highmem and superpages.

I suppose so, although we should have added those interfaces long
ago ;) The variants in fsblock are pretty good, and you could always
do an arbitrary extent (rather than block) based API using the
pagecache tree if it would be helpful.
 

 It also more naturally matches the way we want to do IO, making for
 easy clustering.

Well the pagecache tree is used to reasonable effect for that now.
OK the code isn't beautiful ;). Granted, this might be an area where
the seperate state tree ends up being better. We'll see.

 
 O_DIRECT becomes a special case of readpages and writepagesthe
 memory used for IO just comes from userland instead of the page cache.

Could be, although you'll probably also need to teach the mm about
the state tree and/or still manipulate the pagecache tree to prevent
concurrency?

But isn't the main aim of O_DIRECT to do as little locking and
synchronisation with the pagecache as possible? I thought this is
why your race fixing patches got put on the back burner (although
they did look fairly nice from a correctness POV).

Well I'm kind of handwaving when it comes to O_DIRECT ;) It does look
like this might be another advantage of the state tree (although you
aren't allowed to slow down buffered IO to achieve the locking ;)).

 
 The ability to put in additional tracking info like the process that
 first dirtied a range is also significant.  So, I think it is worth
 trying.

Definitely, and I'm glad you are. You haven't converted me yet, but
I look forward to finding the best ideas from our two approaches when
the patches are further along (ext2 port of fsblock coming along, so
we'll be able to have races soon :P).

Thanks,
Nick
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 1/4] ext2 convert to new aops fix

2007-07-24 Thread Nick Piggin

These are some write_begin aops conversion fixes for the current -mm
tree. I didn't quite get it building, but did compile test at least.

Patches all address the same silly bug in my directory pagecache
conversions.

---
ext2 directory code has a conversion overflow. Spotted by Hugh Dickins.
Probably would hurt anyone in practice unless they are using 4GB directories,
but must fix.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]

Index: linux-2.6/fs/ext2/dir.c
===
--- linux-2.6.orig/fs/ext2/dir.c
+++ linux-2.6/fs/ext2/dir.c
@@ -424,7 +424,7 @@ ino_t ext2_inode_by_name(struct inode * 
 void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
struct page *page, struct inode *inode)
 {
-   loff_t pos = (page-index  PAGE_CACHE_SHIFT) +
+   loff_t pos = page_offset(page) +
(char *) de - (char *) page_address(page);
unsigned len = le16_to_cpu(de-rec_len);
int err;
@@ -511,7 +511,7 @@ int ext2_add_link (struct dentry *dentry
return -EINVAL;
 
 got_it:
-   pos = (page-index  PAGE_CACHE_SHIFT) +
+   pos = page_offset(page) +
(char*)de - (char*)page_address(page);
err = __ext2_write_begin(NULL, page-mapping, pos, rec_len, 0,
page, NULL);
@@ -569,7 +569,7 @@ int ext2_delete_entry (struct ext2_dir_e
}
if (pde)
from = (char*)pde - (char*)page_address(page);
-   pos = (page-index  PAGE_CACHE_SHIFT) + from;
+   pos = page_offset(page) + from;
lock_page(page);
err = __ext2_write_begin(NULL, page-mapping, pos, to - from, 0,
page, NULL);
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 2/4] minix convert to new aops fix

2007-07-24 Thread Nick Piggin


Signed-off-by: Nick Piggin [EMAIL PROTECTED]

Index: linux-2.6/fs/minix/dir.c
===
--- linux-2.6.orig/fs/minix/dir.c
+++ linux-2.6/fs/minix/dir.c
@@ -311,7 +311,7 @@ int minix_delete_entry(struct minix_dir_
struct address_space *mapping = page-mapping;
struct inode *inode = (struct inode*)mapping-host;
char *kaddr = page_address(page);
-   loff_t pos = (page-index  PAGE_CACHE_SHIFT) + (char*)de - kaddr;
+   loff_t pos = page_offset(page) + (char*)de - kaddr;
unsigned len = minix_sb(inode-i_sb)-s_dirsize;
int err;
 
@@ -435,7 +435,7 @@ void minix_set_link(struct minix_dir_ent
struct address_space *mapping = page-mapping;
struct inode *dir = mapping-host;
struct minix_sb_info *sbi = minix_sb(dir-i_sb);
-   loff_t pos = (page-index  PAGE_CACHE_SHIFT) +
+   loff_t pos = page_offset(page) +
(char *)de-(char*)page_address(page);
int err;
 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 3/4] sysv convert to new aops fix

2007-07-24 Thread Nick Piggin


Signed-off-by: Nick Piggin [EMAIL PROTECTED]

Index: linux-2.6/fs/sysv/dir.c
===
--- linux-2.6.orig/fs/sysv/dir.c
+++ linux-2.6/fs/sysv/dir.c
@@ -219,7 +219,7 @@ int sysv_add_link(struct dentry *dentry,
return -EINVAL;
 
 got_it:
-   pos = (page-index  PAGE_CACHE_SHIFT) +
+   pos = page_offset(page) +
(char*)de - (char*)page_address(page);
lock_page(page);
err = __sysv_write_begin(NULL, page-mapping, pos, SYSV_DIRSIZE,
@@ -246,7 +246,7 @@ int sysv_delete_entry(struct sysv_dir_en
struct address_space *mapping = page-mapping;
struct inode *inode = (struct inode*)mapping-host;
char *kaddr = (char*)page_address(page);
-   loff_t pos = (page-index  PAGE_CACHE_SHIFT) + (char *)de - kaddr;
+   loff_t pos = page_offset(page) + (char *)de - kaddr;
int err;
 
lock_page(page);
@@ -347,7 +347,7 @@ void sysv_set_link(struct sysv_dir_entry
 {
struct address_space *mapping = page-mapping;
struct inode *dir = mapping-host;
-   loff_t pos = (page-index  PAGE_CACHE_SHIFT) +
+   loff_t pos = page_offset(page) +
(char *)de-(char*)page_address(page);
int err;
 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 4/4] ufs convert to new aops fix

2007-07-24 Thread Nick Piggin


Signed-off-by: Nick Piggin [EMAIL PROTECTED]

Index: linux-2.6/fs/ufs/dir.c
===
--- linux-2.6.orig/fs/ufs/dir.c
+++ linux-2.6/fs/ufs/dir.c
@@ -89,7 +89,7 @@ ino_t ufs_inode_by_name(struct inode *di
 void ufs_set_link(struct inode *dir, struct ufs_dir_entry *de,
  struct page *page, struct inode *inode)
 {
-   loff_t pos = (page-index  PAGE_CACHE_SHIFT) +
+   loff_t pos = page_offset(page) +
(char *) de - (char *) page_address(page);
unsigned len = fs16_to_cpu(dir-i_sb, de-d_reclen);
int err;
@@ -379,7 +379,7 @@ int ufs_add_link(struct dentry *dentry, 
return -EINVAL;
 
 got_it:
-   pos = (page-index  PAGE_CACHE_SHIFT) +
+   pos = page_offset(page) +
(char*)de - (char*)page_address(page);
err = __ufs_write_begin(NULL, page-mapping, pos, rec_len,
AOP_FLAG_UNINTERRUPTIBLE, page, NULL);
@@ -547,7 +547,7 @@ int ufs_delete_entry(struct inode *inode
if (pde)
from = (char*)pde - (char*)page_address(page);
 
-   pos = (page-index  PAGE_CACHE_SHIFT) + from;
+   pos = page_offset(page) + from;
lock_page(page);
err = __ufs_write_begin(NULL, mapping, pos, to - from,
AOP_FLAG_UNINTERRUPTIBLE, page, NULL);
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC] extent mapped page cache

2007-07-24 Thread Nick Piggin

On Tue, Jul 24, 2007 at 07:25:09PM -0400, Chris Mason wrote:
 On Tue, 24 Jul 2007 23:25:43 +0200
 Peter Zijlstra [EMAIL PROTECTED] wrote:
 
 The tree is a critical part of the patch, but it is also the easiest to
 rip out and replace.  Basically the code stores a range by inserting
 an object at an index corresponding to the end of the range.
 
 Then it does searches by looking forward from the start of the range.
 More or less any tree that can search and return the first key =
 than the requested key will work.
 
 So, I'd be happy to rip out the tree and replace with something else.
 Going completely lockless will be tricky, its something that will deep
 thought once the rest of the interface is sane.

Just having the other tree and managing it is what makes me a little
less positive of this approach, especially using it to store pagecache
state when we already have the pagecache tree.

Having another tree to store block state I think is a good idea as I
said in the fsblock thread with Dave, but I haven't clicked as to why
it is a big advantage to use it to manage pagecache state. (and I can
see some possible disadvantages in locking and tree manipulation overhead).

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: block_page_mkwrite? (Re: fault vs invalidate race (Re: -mm merge plans for 2.6.23))

2007-07-11 Thread Nick Piggin


David Chinner wrote:

On Thu, Jul 12, 2007 at 10:54:57AM +1000, Nick Piggin wrote:


Andrew Morton wrote:

The fault-vs-invalidate race fix.  I have belatedly learned that these 
need

more work, so their state is uncertain.


The more work may turn out being too much for you (although it is nothing
exactly tricky that would introduce subtle bugs, it is a fair amont of 
churn).



OK, so does that mean we can finally get the block_page_mkwrite
patches merged?

i.e.:

http://marc.info/?l=linux-kernelm=117426058311032w=2
http://marc.info/?l=linux-kernelm=11742607036w=2

I've got up-to-date versions of them ready to go and they've been
consistently tested thanks to the XFSQA test I wrote for the bug
that it fixes. I've been holding them out-of-tree for months now
because -fault was supposed to supercede this interface.


Yeah, as I've said, don't hold them back because of me. They are
relatively simple enough that I don't see why they couldn't be
merged in this window.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] fsblock

2007-07-09 Thread Nick Piggin

On Mon, Jul 09, 2007 at 10:14:06AM -0700, Christoph Lameter wrote:
 On Sun, 24 Jun 2007, Nick Piggin wrote:
 
  Firstly, what is the buffer layer?  The buffer layer isn't really a
  buffer layer as in the buffer cache of unix: the block device cache
  is unified with the pagecache (in terms of the pagecache, a blkdev
  file is just like any other, but with a 1:1 mapping between offset
  and block).
 
 I thought that the buffer layer is essentially a method to index to sub 
 section of a page?

It converts pagecache addresses to block addresses I guess. The
current implementation cannot handle blocks larger than pages,
but not because use of larger pages for pagecache wsa anticipated
(likely because it is more work, and the APIs aren't really set up
for it).


  Why rewrite the buffer layer?  Lots of people have had a desire to
  completely rip out the buffer layer, but we can't do that[*] because
  it does actually serve a useful purpose. Why the bad rap? Because
  the code is old and crufty, and buffer_head is an awful name. It must 
  be among the oldest code in the core fs/vm, and the main reason is
  because of the inertia of so many and such complex filesystems.
 
 Hmmm I did not notice that yet but then I have not done much work 
 there.

Notice what?


  - data structure size. struct fsblock is 20 bytes on 32-bit, and 40 on
64-bit (could easily be 32 if we can have int bitops). Compare this
to around 50 and 100ish for struct buffer_head. With a 4K page and 1K
blocks, IO requires 10% RAM overhead in buffer heads alone. With
fsblocks you're down to around 3%.
 
 I thought we were going to simply use the page struct instead of having
 buffer heads? Would that not reduce the overhead to zero?

What do you mean by that? As I said, you couldn't use just the page
struct for anything except page sized blocks, and even then it would
require more fields or at least more flags in the page struct.

nobh mode actually tries to do something similar, however it requires
multiple calls into the filesystem to first allocate the block, and
then find its sector. It is also buggy and can't handle errors properly
(although I'm trying to fix that).


  - A real nobh mode. nobh was created I think mainly to avoid problems
with buffer_head memory consumption, especially on lowmem machines. It
is basically a hack (sorry), which requires special code in filesystems,
and duplication of quite a bit of tricky buffer layer code (and bugs).
It also doesn't work so well for buffers with non-trivial private data
(like most journalling ones). fsblock implements this with basically a
few lines of code, and it shold work in situations like ext3.
 
 Hmmm That means simply page struct are not working...

I don't understand you. jbd needs to attach private data to each bh, and
that can stay around for longer than the life of the page in the pagecache.


  - Large block support. I can mount and run an 8K block size minix3 fs on
my 4K page system and it didn't require anything special in the fs. We
can go up to about 32MB blocks now, and gigabyte+ blocks would only
require  one more bit in the fsblock flags. fsblock_superpage blocks
are  PAGE_CACHE_SIZE, midpage ==, and subpage .
  
Core pagecache code is pretty creaky with respect to this. I think it is
mostly race free, but it requires stupid unlocking and relocking hacks
because the vm usually passes single locked pages to the fs layers, and we
need to lock all pages of a block in offset ascending order. This could be
avoided by doing locking on only the first page of a block for locking in
the fsblock layer, but that's a bit scary too. Probably better would be to
move towards offset,length rather than page based fs APIs where everything
can be batched up nicely and this sort of non-trivial locking can be more
optimal.
  
Large blocks also have a performance black spot where an 8K sized and
aligned write(2) would require an RMW in the filesystem. Again because of
the page based nature of the fs API, and this too would be fixed if
the APIs were better.
 
 The simple solution would be to use a compound page and make the head page
 represent the status of all the pages in the vm. Logic for that is already 
 in place.

I do not consider that a solution because I explicitly want to allow
order-0 pages here. I know about your higher order pagecache, the anti-frag
and defrag work, I know about compound pages.  I'm not just ignoring them
because of NIH or something silly.

Anyway, I have thought about just using the first page in the block for
the locking, and that might be a reasonable optimisation. However for
now I'm keeping it simple.


Large block memory access via filesystem uses vmap, but it will go back
to kmap if the access doesn't cross a page. Filesystems really should do
this because vmap is slow as anything. I've implemented a vmap cache
which

Re: [RFC] fsblock

2007-07-09 Thread Nick Piggin

On Mon, Jul 09, 2007 at 05:59:47PM -0700, Christoph Lameter wrote:
 On Tue, 10 Jul 2007, Nick Piggin wrote:
 
   Hmmm I did not notice that yet but then I have not done much work 
   there.
  
  Notice what?
 
 The bad code for the buffer heads.

Oh. Well my first mail in this thrad listed some of the problems
with them.


- A real nobh mode. nobh was created I think mainly to avoid problems
  with buffer_head memory consumption, especially on lowmem machines. It
  is basically a hack (sorry), which requires special code in 
filesystems,
  and duplication of quite a bit of tricky buffer layer code (and bugs).
  It also doesn't work so well for buffers with non-trivial private data
  (like most journalling ones). fsblock implements this with basically a
  few lines of code, and it shold work in situations like ext3.
   
   Hmmm That means simply page struct are not working...
  
  I don't understand you. jbd needs to attach private data to each bh, and
  that can stay around for longer than the life of the page in the pagecache.
 
 Right. So just using page struct alone wont work for the filesystems.
 
  There are no changes to the filesystem API for large pages (although I
  am adding a couple of helpers to do page based bitmap ops). And I don't
  want to rely on contiguous memory. Why do you think handling of large
  pages (presumably you mean larger than page sized blocks) is strange?
 
 We already have a way to handle large pages: Compound pages.

Yes but I don't want to use large pages and I am not going to use
them (at least, they won't be mandatory).

 
  Conglomerating the constituent pages via the pagecache radix-tree seems
  logical to me.
 
 Meaning overhead to handle each page still exists? This scheme cannot 
 handle large contiguous blocks as a single entity?

Of course some things have to be done per-page if the pages are not
contiguous. I actually haven't seen that to be a problem or have much
reason to think it will suddenly become a problem (although I do like
Andrea's config page sizes approach for really big systems that cannot
change their HW page size).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vm/fs meetup details

2007-07-05 Thread Nick Piggin

On Thu, Jul 05, 2007 at 01:54:06PM -0400, Rik van Riel wrote:
 Nick Piggin wrote:
 Hi,
 
 The vm/fs meetup will be held September 4th from 10am till 4pm (with the
 option of going longer), at the University of Cambridge. 
 
 I am interested.  A few potential topics:

OK, I'll put you on the list. Hope to see you there.

 
 - improving laptop_mode in the VFS  VM to further increase
   battery life in laptops
 
 - repair driven design, we know what it is (Val told us), but
   how does it apply to the things we are currently working on?
   should we do more of it?

Thanks for the suggestions.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vm/fs meetup details

2007-07-05 Thread Nick Piggin

On Thu, Jul 05, 2007 at 05:40:57PM -0400, Rik van Riel wrote:
 David Chinner wrote:
 On Thu, Jul 05, 2007 at 01:40:08PM -0700, Zach Brown wrote:
 - repair driven design, we know what it is (Val told us), but
  how does it apply to the things we are currently working on?
  should we do more of it?
 I'm sure Chris and I could talk about the design elements in btrfs  
 that should aid repair if folks are interested in hearing about  
 them.  We'd keep the hand-waving to a minimum :).
 
 And I'm sure I could provide a counterpoint by talking about
 the techniques we've used improving XFS repair speed and
 scalability without needing to change any on disk formats
 
 Sounds like that could be an interesting discussion.
 
 Especially when trying to answer questions like:
 
 At what filesystem size will the mitigating fixes no
  longer be enough?
 
 and
 
 When will people start using filesystems THAT big?  :)

Keep in mind that the way to get the most out of this meeting
is for the fs people to have topics of the form we'd really
like to do X, can we get some help from the VM? Or vice versa
from vm people.

That said, we can talk about whatever interests the group on
the day. And that could definitely include issues common to
different filesystems.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

vm/fs meetup details

2007-07-04 Thread Nick Piggin

Hi,

The vm/fs meetup will be held September 4th from 10am till 4pm (with the
option of going longer), at the University of Cambridge. 

Anton Altaparmakov has arranged a conference room for us with whiteboard
and projector, so many thanks to him. I will send out the location and
plans for meeting/getting there after we work out the best strategy for
that.

At the moment we have 15 people interested so far. We can have a few
more people, so if you aren't cc'ed and would like to come along please
let me know. We do have limited space, so I'm sorry in advance if anybody
misses out.

I'll post out a running list of suggested topics later, but they're
really just a rough guideline. It will be a round-table kind of thing
and long monologue talks won't be appropriate, however some slides or
whiteboarding to interactively introduce and discuss your idea would
be OK.

I think we want to avoid assigning slots for specific people/topics.
Feel free to propose anything, if it only gets a small amount of
interest then at least you'll know who to discuss it with later :)

Thanks,
Nick
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 >

1 - 100 of 326 matches

Mail list logo