[rfc][patch 5/5] remove prepare_write

2007-11-11 Thread Nick Piggin

Index: linux-2.6/drivers/block/loop.c
===
--- linux-2.6.orig/drivers/block/loop.c
+++ linux-2.6/drivers/block/loop.c
@@ -40,8 +40,7 @@
  * Heinz Mauelshagen <[EMAIL PROTECTED]>, Feb 2002
  *
  * Support for falling back on the write file operation when the address space
- * operations prepare_write and/or commit_write are not available on the
- * backing filesystem.
+ * operations write_begin is not available on the backing filesystem.
  * Anton Altaparmakov, 16 Feb 2005
  *
  * Still To Fix:
@@ -761,7 +760,7 @@ static int loop_set_fd(struct loop_devic
 */
if (!file->f_op->splice_read)
goto out_putf;
-   if (aops->prepare_write || aops->write_begin)
+   if (aops->write_begin)
lo_flags |= LO_FLAGS_USE_AOPS;
if (!(lo_flags & LO_FLAGS_USE_AOPS) && !file->f_op->write)
lo_flags |= LO_FLAGS_READ_ONLY;
Index: linux-2.6/fs/fat/inode.c
===
--- linux-2.6.orig/fs/fat/inode.c
+++ linux-2.6/fs/fat/inode.c
@@ -175,7 +175,7 @@ static ssize_t fat_direct_IO(int rw, str
 
if (rw == WRITE) {
/*
-* FIXME: blockdev_direct_IO() doesn't use ->prepare_write(),
+* FIXME: blockdev_direct_IO() doesn't use ->write_begin(),
 * so we need to update the ->mmu_private to block boundary.
 *
 * But we must fill the remaining area or hole by nul for
Index: linux-2.6/fs/ocfs2/file.c
===
--- linux-2.6.orig/fs/ocfs2/file.c
+++ linux-2.6/fs/ocfs2/file.c
@@ -790,8 +790,7 @@ leave:
 
 /* Some parts of this taken from generic_cont_expand, which turned out
  * to be too fragile to do exactly what we need without us having to
- * worry about recursive locking in ->prepare_write() and
- * ->commit_write(). */
+ * worry about recursive locking in ->write_begin() and ->write_end(). */
 static int ocfs2_write_zero_page(struct inode *inode,
 u64 size)
 {
Index: linux-2.6/fs/splice.c
===
--- linux-2.6.orig/fs/splice.c
+++ linux-2.6/fs/splice.c
@@ -734,8 +734,8 @@ ssize_t splice_from_pipe(struct pipe_ino
};
 
/*
-* The actor worker might be calling ->prepare_write and
-* ->commit_write. Most of the time, these expect i_mutex to
+* The actor worker might be calling ->write_begin and
+* ->write_end. Most of the time, these expect i_mutex to
 * be held. Since this may result in an ABBA deadlock with
 * pipe->inode, we have to order lock acquiry here.
 */
Index: linux-2.6/include/linux/fs.h
===
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -453,13 +453,6 @@ struct address_space_operations {
int (*readpages)(struct file *filp, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages);
 
-   /*
-* ext3 requires that a successful prepare_write() call be followed
-* by a commit_write() call - they must be balanced
-*/
-   int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
-   int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
-
int (*write_begin)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata);
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1863,48 +1863,8 @@ int pagecache_write_begin(struct file *f
 {
const struct address_space_operations *aops = mapping->a_ops;
 
-   if (aops->write_begin) {
-   return aops->write_begin(file, mapping, pos, len, flags,
+   return aops->write_begin(file, mapping, pos, len, flags,
pagep, fsdata);
-   } else {
-   int ret;
-   pgoff_t index = pos >> PAGE_CACHE_SHIFT;
-   unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
-   struct inode *inode = mapping->host;
-   struct page *page;
-again:
-   page = __grab_cache_page(mapping, index);
-   *pagep = page;
-   if (!page)
-   return -ENOMEM;
-
-   if (flags & AOP_FLAG_UNINTERRUPTIBLE && !PageUptodate(page)) {
-   /*
-* There is no way to resolve a short write situation
-* for a !Uptodate page (except by double copying in
-

[rfc][patch 4/5] rd: rewrite rd

2007-11-11 Thread Nick Piggin

This is a rewrite of the ramdisk block device driver.

The old one is really difficult because it effectively implements a block
device which serves data out of its own buffer cache. It relies on the dirty
bit being set, to pin its backing store in cache, however there are non
trivial paths which can clear the dirty bit (eg. try_to_free_buffers()),
which had recently lead to data corruption. And in general it is completely
wrong for a block device driver to do this.

The new one is more like a regular block device driver. It has no idea
about vm/vfs stuff. It's backing store is similar to the buffer cache
(a simple radix-tree of pages), but it doesn't know anything about page
cache (the pages in the radix tree are not pagecache pages).

There is one slight downside -- direct block device access and filesystem
metadata access goes through an extra copy and gets stored in RAM twice.
However, this downside is only slight, because the real buffercache of the
device is now reclaimable (because we're not playing crazy games with it),
so under memory intensive situations, footprint should effectively be the
same -- maybe even a slight advantage to the new driver because it can also
reclaim buffer heads.

The fact that it now goes through all the regular vm/fs paths makes it
much more useful for testing, too.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
---
Index: linux-2.6/drivers/block/Kconfig
===
--- linux-2.6.orig/drivers/block/Kconfig
+++ linux-2.6/drivers/block/Kconfig
@@ -311,7 +311,7 @@ config BLK_DEV_UB
  If unsure, say N.
 
 config BLK_DEV_RAM
-   tristate "RAM disk support"
+   tristate "RAM block device support"
---help---
  Saying Y here will allow you to use a portion of your RAM memory as
  a block device, so that you can make file systems on it, read and
@@ -346,16 +346,6 @@ config BLK_DEV_RAM_SIZE
  The default value is 4096 kilobytes. Only change this if you know
  what you are doing.
 
-config BLK_DEV_RAM_BLOCKSIZE
-   int "Default RAM disk block size (bytes)"
-   depends on BLK_DEV_RAM
-   default "1024"
-   help
- The default value is 1024 bytes.  PAGE_SIZE is a much more
- efficient choice however.  The default is kept to ensure initrd
- setups function - apparently needed by the rd_load_image routine
- that supposes the filesystem in the image uses a 1024 blocksize.
-
 config CDROM_PKTCDVD
tristate "Packet writing on CD/DVD media"
depends on !UML
Index: linux-2.6/drivers/block/Makefile
===
--- linux-2.6.orig/drivers/block/Makefile
+++ linux-2.6/drivers/block/Makefile
@@ -11,7 +11,7 @@ obj-$(CONFIG_AMIGA_FLOPPY)+= amiflop.o
 obj-$(CONFIG_PS3_DISK) += ps3disk.o
 obj-$(CONFIG_ATARI_FLOPPY) += ataflop.o
 obj-$(CONFIG_AMIGA_Z2RAM)  += z2ram.o
-obj-$(CONFIG_BLK_DEV_RAM)  += rd.o
+obj-$(CONFIG_BLK_DEV_RD)   += brd.o
 obj-$(CONFIG_BLK_DEV_LOOP) += loop.o
 obj-$(CONFIG_BLK_DEV_PS2)  += ps2esdi.o
 obj-$(CONFIG_BLK_DEV_XD)   += xd.o
Index: linux-2.6/drivers/block/brd.c
===
--- /dev/null
+++ linux-2.6/drivers/block/brd.c
@@ -0,0 +1,477 @@
+/*
+ * Ram backed block device driver.
+ *
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ *
+ * Parts derived from drivers/block/rd.c, and drivers/block/loop.c, copyright
+ * of their respective owners.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include  /* invalidate_bh_lrus() */
+
+#include 
+
+#define PAGE_SECTORS (1 << (PAGE_SHIFT - 9))
+
+struct brd_device {
+   int brd_number;
+   int brd_refcnt;
+   loff_t  brd_offset;
+   loff_t  brd_sizelimit;
+   unsignedbrd_blocksize;
+
+   struct request_queue*brd_queue;
+   struct gendisk  *brd_disk;
+
+   spinlock_t  brd_lock;
+   struct radix_tree_root  brd_pages;
+
+   struct list_headbrd_list;
+};
+
+static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
+{
+   unsigned long idx;
+   struct page *page;
+
+   rcu_read_lock();
+   idx = sector >> (PAGE_SHIFT - 9);
+   page = radix_tree_lookup(&brd->brd_pages, idx);
+   rcu_read_unlock();
+
+   BUG_ON(page && page->index != idx);
+
+   return page;
+}
+
+static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
+{
+   unsigned long idx;
+   struct page *page;
+
+   page = brd_lookup_page(brd, sector);
+   if (page)
+   return page;
+
+   page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);
+   if (!page)
+   return NULL;
+
+   if (radix_tree_preload(GFP_NOIO)) {
+   __free_pag

[rfc][patch 3/5] afs: new aops

2007-11-11 Thread Nick Piggin

Convert afs to new aops.

Cannot assume writes will fully complete, so this conversion goes the easy
way and always brings the page uptodate before the write.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
---
Index: linux-2.6/fs/afs/file.c
===
--- linux-2.6.orig/fs/afs/file.c
+++ linux-2.6/fs/afs/file.c
@@ -50,8 +50,8 @@ const struct address_space_operations af
.launder_page   = afs_launder_page,
.releasepage= afs_releasepage,
.invalidatepage = afs_invalidatepage,
-   .prepare_write  = afs_prepare_write,
-   .commit_write   = afs_commit_write,
+   .write_begin= afs_write_begin,
+   .write_end  = afs_write_end,
.writepage  = afs_writepage,
.writepages = afs_writepages,
 };
Index: linux-2.6/fs/afs/internal.h
===
--- linux-2.6.orig/fs/afs/internal.h
+++ linux-2.6/fs/afs/internal.h
@@ -731,8 +731,12 @@ extern int afs_volume_release_fileserver
  */
 extern int afs_set_page_dirty(struct page *);
 extern void afs_put_writeback(struct afs_writeback *);
-extern int afs_prepare_write(struct file *, struct page *, unsigned, unsigned);
-extern int afs_commit_write(struct file *, struct page *, unsigned, unsigned);
+extern int afs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata);
+extern int afs_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata);
 extern int afs_writepage(struct page *, struct writeback_control *);
 extern int afs_writepages(struct address_space *, struct writeback_control *);
 extern int afs_write_inode(struct inode *, int);
Index: linux-2.6/fs/afs/write.c
===
--- linux-2.6.orig/fs/afs/write.c
+++ linux-2.6/fs/afs/write.c
@@ -84,15 +84,23 @@ void afs_put_writeback(struct afs_writeb
  * partly or wholly fill a page that's under preparation for writing
  */
 static int afs_fill_page(struct afs_vnode *vnode, struct key *key,
-unsigned start, unsigned len, struct page *page)
+   loff_t pos, unsigned len, struct page *page)
 {
+   loff_t i_size;
+   unsigned eof;
int ret;
 
-   _enter(",,%u,%u", start, len);
+   _enter(",,%llu,%u", (unsigned long long)pos, len);
 
-   ASSERTCMP(start + len, <=, PAGE_SIZE);
+   ASSERTCMP(len, <=, PAGE_CACHE_SIZE);
 
-   ret = afs_vnode_fetch_data(vnode, key, start, len, page);
+   i_size = i_size_read(&vnode->vfs_inode);
+   if (pos + len > i_size)
+   eof = i_size;
+   else
+   eof = PAGE_CACHE_SIZE;
+
+   ret = afs_vnode_fetch_data(vnode, key, 0, eof, page);
if (ret < 0) {
if (ret == -ENOENT) {
_debug("got NOENT from server"
@@ -107,109 +115,55 @@ static int afs_fill_page(struct afs_vnod
 }
 
 /*
- * prepare a page for being written to
- */
-static int afs_prepare_page(struct afs_vnode *vnode, struct page *page,
-   struct key *key, unsigned offset, unsigned to)
-{
-   unsigned eof, tail, start, stop, len;
-   loff_t i_size, pos;
-   void *p;
-   int ret;
-
-   _enter("");
-
-   if (offset == 0 && to == PAGE_SIZE)
-   return 0;
-
-   p = kmap_atomic(page, KM_USER0);
-
-   i_size = i_size_read(&vnode->vfs_inode);
-   pos = (loff_t) page->index << PAGE_SHIFT;
-   if (pos >= i_size) {
-   /* partial write, page beyond EOF */
-   _debug("beyond");
-   if (offset > 0)
-   memset(p, 0, offset);
-   if (to < PAGE_SIZE)
-   memset(p + to, 0, PAGE_SIZE - to);
-   kunmap_atomic(p, KM_USER0);
-   return 0;
-   }
-
-   if (i_size - pos >= PAGE_SIZE) {
-   /* partial write, page entirely before EOF */
-   _debug("before");
-   tail = eof = PAGE_SIZE;
-   } else {
-   /* partial write, page overlaps EOF */
-   eof = i_size - pos;
-   _debug("overlap %u", eof);
-   tail = max(eof, to);
-   if (tail < PAGE_SIZE)
-   memset(p + tail, 0, PAGE_SIZE - tail);
-   if (offset > eof)
-   memset(p + eof, 0, PAGE_SIZE - eof);
-   }
-
-   kunmap_atomic(p, KM_USER0);
-
-   ret = 0;
-   if (offset > 0 || eof > to) {
-   /* need to fill one or two bits that aren't going to be written
-* (cover both fillers in one read if there are two) */
-   start = (offset > 0) ? 0 : to;
-   stop = (eof > to) ?

[rfc][patch 2/5] cifs: new aops

2007-11-11 Thread Nick Piggin

Convert cifs to new aops.

This is another relatively naive conversion. Always do the read upfront when
the page is not uptodate (unless we're in the writethrough path).

Fix an uninitialized data exposure where SetPageUptodate was called before
the page was uptodate.

SetPageUptodate and switch to writeback mode in the case that the full page
was dirtied.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
---
Index: linux-2.6/fs/cifs/file.c
===
--- linux-2.6.orig/fs/cifs/file.c
+++ linux-2.6/fs/cifs/file.c
@@ -103,7 +103,7 @@ static inline int cifs_open_inode_helper
 
/* want handles we can use to read with first
   in the list so we do not have to walk the
-  list to search for one in prepare_write */
+  list to search for one in write_begin */
if ((file->f_flags & O_ACCMODE) == O_WRONLY) {
list_add_tail(&pCifsFile->flist,
  &pCifsInode->openFileList);
@@ -1411,49 +1411,53 @@ static int cifs_writepage(struct page *p
return rc;
 }
 
-static int cifs_commit_write(struct file *file, struct page *page,
-   unsigned offset, unsigned to)
+static int cifs_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
-   int xid;
-   int rc = 0;
-   struct inode *inode = page->mapping->host;
-   loff_t position = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
-   char *page_data;
+   int rc;
+   struct inode *inode = mapping->host;
+   loff_t position;
+
+   cFYI(1, ("commit write for page %p from position %lld with %d bytes",
+page, pos, copied));
+
+   if (!PageUptodate(page) && copied == PAGE_CACHE_SIZE)
+   SetPageUptodate(page);
 
-   xid = GetXid();
-   cFYI(1, ("commit write for page %p up to position %lld for %d",
-page, position, to));
-   spin_lock(&inode->i_lock);
-   if (position > inode->i_size) {
-   i_size_write(inode, position);
-   }
-   spin_unlock(&inode->i_lock);
if (!PageUptodate(page)) {
-   position =  ((loff_t)page->index << PAGE_CACHE_SHIFT) + offset;
-   /* can not rely on (or let) writepage write this data */
-   if (to < offset) {
-   cFYI(1, ("Illegal offsets, can not copy from %d to %d",
-   offset, to));
-   FreeXid(xid);
-   return rc;
-   }
+   char *page_data;
+   unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
+   int xid;
+
+   xid = GetXid();
/* this is probably better than directly calling
   partialpage_write since in this function the file handle is
   known which we might as well leverage */
/* BB check if anything else missing out of ppw
   such as updating last write time */
page_data = kmap(page);
-   rc = cifs_write(file, page_data + offset, to-offset,
-   &position);
-   if (rc > 0)
-   rc = 0;
-   /* else if (rc < 0) should we set writebehind rc? */
+   rc = cifs_write(file, page_data + offset, copied, &position);
+   /* if (rc < 0) should we set writebehind rc? */
kunmap(page);
+
+   FreeXid(xid);
} else {
+   rc = copied;
set_page_dirty(page);
}
 
-   FreeXid(xid);
+   if (rc > 0) {
+   position = pos + rc;
+   spin_lock(&inode->i_lock);
+   if (position > inode->i_size)
+   i_size_write(inode, position);
+   spin_unlock(&inode->i_lock);
+   }
+
+   unlock_page(page);
+   page_cache_release(page);
+
return rc;
 }
 
@@ -1999,49 +2003,45 @@ int is_size_safe_to_change(struct cifsIn
return 1;
 }
 
-static int cifs_prepare_write(struct file *file, struct page *page,
-   unsigned from, unsigned to)
+static int cifs_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
-   int rc = 0;
-   loff_t i_size;
-   loff_t offset;
+   pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+   loff_t offset = pos & (PAGE_CACHE_SIZE - 1);
+   struct page *page;
+
+   cFYI(1, ("write begin from %lld len %d", (long long)pos, len));
+
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
 
-   cFYI(1, ("prepare write for page %p from %d to %d", page, from, to));
if (PageUptodate(page))
return 0;

[rfc][patch 1/5] ecryptfs new aops

2007-11-11 Thread Nick Piggin

Convert ecryptfs to new aops.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
---
Index: linux-2.6/fs/ecryptfs/mmap.c
===
--- linux-2.6.orig/fs/ecryptfs/mmap.c
+++ linux-2.6/fs/ecryptfs/mmap.c
@@ -263,31 +263,38 @@ out:
return 0;
 }
 
-static int ecryptfs_prepare_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
+static int ecryptfs_write_begin(struct file *file,
+   struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned flags,
+   struct page **pagep, void **fsdata)
 {
+   pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+   struct page *page;
int rc = 0;
 
-   if (from == 0 && to == PAGE_CACHE_SIZE)
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+
+   if (flags & AOP_FLAG_UNINTERRUPTIBLE && len == PAGE_CACHE_SIZE)
goto out;   /* If we are writing a full page, it will be
   up to date. */
if (!PageUptodate(page)) {
-   rc = ecryptfs_read_lower_page_segment(page, page->index, 0,
+   rc = ecryptfs_read_lower_page_segment(page, index, 0,
  PAGE_CACHE_SIZE,
- page->mapping->host);
+ mapping->host);
if (rc) {
printk(KERN_ERR "%s: Error attemping to read lower "
   "page segment; rc = [%d]\n", __FUNCTION__, rc);
-   ClearPageUptodate(page);
goto out;
} else
SetPageUptodate(page);
}
-   if (page->index != 0) {
+   if (index != 0) {
loff_t end_of_prev_pg_pos =
-   (((loff_t)page->index << PAGE_CACHE_SHIFT) - 1);
+   (((loff_t)index << PAGE_CACHE_SHIFT) - 1);
 
-   if (end_of_prev_pg_pos > i_size_read(page->mapping->host)) {
+   if (end_of_prev_pg_pos > i_size_read(mapping->host)) {
rc = ecryptfs_truncate(file->f_path.dentry,
   end_of_prev_pg_pos);
if (rc) {
@@ -297,7 +304,7 @@ static int ecryptfs_prepare_write(struct
goto out;
}
}
-   if (end_of_prev_pg_pos + 1 > i_size_read(page->mapping->host))
+   if (end_of_prev_pg_pos + 1 > i_size_read(mapping->host))
zero_user_page(page, 0, PAGE_CACHE_SIZE, KM_USER0);
}
 out:
@@ -391,21 +398,25 @@ int ecryptfs_write_inode_size_to_metadat
 }
 
 /**
- * ecryptfs_commit_write
+ * ecryptfs_write_end
  * @file: The eCryptfs file object
+ * @mapping: The eCryptfs object
+ * @pos, len: Ignored (we rotate the page IV on each write)
  * @page: The eCryptfs page
- * @from: Ignored (we rotate the page IV on each write)
- * @to: Ignored
  *
  * This is where we encrypt the data and pass the encrypted data to
  * the lower filesystem.  In OpenPGP-compatible mode, we operate on
  * entire underlying packets.
  */
-static int ecryptfs_commit_write(struct file *file, struct page *page,
-unsigned from, unsigned to)
-{
-   loff_t pos;
-   struct inode *ecryptfs_inode = page->mapping->host;
+static int ecryptfs_write_end(struct file *file,
+   struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
+{
+   pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+   unsigned from = pos & (PAGE_CACHE_SIZE - 1);
+   unsigned to = from + copied;
+   struct inode *ecryptfs_inode = mapping->host;
struct ecryptfs_crypt_stat *crypt_stat =

&ecryptfs_inode_to_private(file->f_path.dentry->d_inode)->crypt_stat;
int rc;
@@ -417,25 +428,22 @@ static int ecryptfs_commit_write(struct 
} else
ecryptfs_printk(KERN_DEBUG, "Not a new file\n");
ecryptfs_printk(KERN_DEBUG, "Calling fill_zeros_to_end_of_page"
-   "(page w/ index = [0x%.16x], to = [%d])\n", page->index,
-   to);
+   "(page w/ index = [0x%.16x], to = [%d])\n", index, to);
/* Fills in zeros if 'to' goes beyond inode size */
rc = fill_zeros_to_end_of_page(page, to);
if (rc) {
ecryptfs_printk(KERN_WARNING, "Error attempting to fill "
-   "zeros in page with index = [0x%.16x]\n",
-   page->index);
+   "zeros in page with index = [0x%.16x]\n", index);
goto out;
}
rc =

[rfc][patches] remove ->prepare_write

2007-11-11 Thread Nick Piggin
Hi,

These are a set of patches to convert the last few filesystems to use
the new deadlock-free write aops, and remove the core code to handle the
legacy write path.

I don't really have setups to sufficiently test these filesystems. So I
would really appreciate if filesystem maintainers can pick these patches
up, bear with my bugs, and send them upstream when they're ready.

The benefit to you is that you get to use the fast and well tested code 
paths! Actually, it is interesting: several of the conversions I've done
(including these) take a relatively naive aproach of simply prefilling
the whole page if it isn't uptodate. It might be the case that some
filesystems actually prefer to do something similar to the legacy
double-copy path which they're being converted away from! (then again,
it would be probably even more ideal to have simple sub-page state
tracking structures).

There is still quite a lot of work left to be done. Several filesystems
still use prepare_write() helpers, and when they're fixed up, all the
old helpers themselves have to be removed. But this step is probably
most important to getting rid of complex code.



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-11 Thread Hugh Dickins
On Fri, 9 Nov 2007, Erez Zadok wrote:
> In message <[EMAIL PROTECTED]>, Hugh Dickins writes:
> > 
> > One, I think you would be safer to do a set_page_dirty(lower_page)
> > before your clear_page_dirty_for_io(lower_page).  I know that sounds
> > silly, but see Linus' "Yes, Virginia" comment in clear_page_dirty_for_io:
> > there's a lot of subtlety hereabouts, and I think you'd be mimicing the
> > usual path closer if you set_page_dirty first - there's nothing else
> > doing it on that lower_page, is there?  I'm not certain that you need
> > to, but I think you'd do well to look into it and make up your own mind.
> 
> Hugh, my code looks like:
> 
>   if (wbc->for_reclaim) {
>   set_page_dirty(lower_page);
>   unlock_page(lower_page);
>   goto out_release;
>   }
>   BUG_ON(!lower_mapping->a_ops->writepage);
>   clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
>   err = lower_mapping->a_ops->writepage(lower_page, wbc);
> 
> Do you mean I should set_page_dirty(lower_page) unconditionally before
> clear_page_dirty_for_io?  (I already do that in the 'if' statement above it.)

Yes.  Whether you're wrong not to be doing that already, I've not checked;
but I think doing so will make unionfs safer against any future changes
in the relationship between set_page_dirty and clear_page_dirty_for_io.

For example, look at clear_page_dirty_for_io: it's decrementing some
statistics which __set_page_dirty_nobuffers increments.  Does use of
unionfs (over some filesystems) make those numbers wrap increasingly
negative?  Does adding this set_page_dirty(lower_page) correct that?
I suspect so, but may be wrong.

> > Two, I'm unsure of the way you're clearing or setting PageUptodate on
> > the upper page there.  The rules for PageUptodate are fairly obvious
> > when reading, but when a write fails, it's not so obvious.  Again, I'm
> > not saying what you've got is wrong (it may be unavoidable, to keep
> > synch between lower and upper), but it deserves a second thought.
> 
> I looked at all mainline filesystems's ->writepage to see what, if any, they
> do with their page's uptodate flag.  Most f/s don't touch the flag one way
> or another.

I'm not going to try and guess what assorted filesystems are up to,
and not all of them will be bugfree.  The crucial point of PageUptodate
is that we insert a filesystem page into the page cache before it's had
any data read in: it needs to be !PageUptodate until the data is there,
and then marked PageUptodate to say the data is good and others can
start using it.  See mm/filemap.c.

> And finally, unionfs clears the uptodate flag on error from the lower
> ->writepage, and otherwise sets the flag on success from the lower
> ->writepage.  My gut feeling is that unionfs shouldn't change the page
> uptodate flag at all: if the VFS passes unionfs_writepage a page which isn't
> uptodate, then the VFS has a serious problem b/c it'd be asking a f/s to
> write out a page which isn't up-to-date, right?  Otherwise, whether
> unionfs_writepage manages to write the lower page or not, why should that
> invalidate the state of the unionfs page itself?  Come to think of it, I
> think clearing pageuptodate on error from ->writepage(lower_page) may be
> bad.  Imagine if after such a failed unionfs_writepage, I get a
> unionfs_readpage: that ->readpage will get data from the lower f/s page and
> copy it *over* the unionfs page, even if the upper page's data was more
> recent prior to the failed call to unionfs_writepage.  IOW, we could be
> reverting a user-visible mmap'ed page to a previous on-disk version.  What
> do you think: could this happen?  Anyway, I'll run some exhaustive testing
> next and see what happens if I don't set/clear the uptodate flag in
> unionfs_writepage.

That was my point, and I don't really have more to add.  It's unusual
to do anything with PageUptodate when writing.  By clearing it when the
lower level has an error, you're throwing away the changes already made
at the upper level.  You might have some good reason for that, but it's
worth questioning.  If you don't know why you're Set/Clear'ing it there,
better to just take that out.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: cramfs in big endian

2007-11-11 Thread Andi Drebes
> What about simply deep-sixing cramfs and replacing it with squashfs or 
> something else?
I think this is the long term solution. Cramfs isn't a very beautiful 
filesystem. It's a good candidate for removal. However, there are still some 
distributions that use cramfs for initrds. So removing it immediately isn't a 
good idea.

Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html