Re: [patch 2/5] fs: introduce new aops and infrastructure

2007-03-14 Thread Joel Becker
On Thu, Mar 15, 2007 at 05:36:42AM +0100, Nick Piggin wrote:
> On Wed, Mar 14, 2007 at 09:13:29PM -0700, Mark Fasheh wrote:
> > Are we going to get rid of the file and intr arguments btw? I'm not sure
> > intr is useful, and mapping is probably enough to get whatever we inside
> > ->write_begin / ->write_end.
> 
> Yeah, I was going to, but I had this version ready to go so decided
> to leave them in at the last minute. We can definitely take them out
> if people agree.

You're really going to need the file argument around.  Some
folks care about file->private_data, etc.  A good example is
nfs_updatepage() from nfs_commit_write().  There's a context on the
filp.  Mapping can get back to the inode via ->host, but not to the
struct file.

Joel

-- 

Life's Little Instruction Book #157 

"Take time to smell the roses."

Joel Becker
Principal Software Developer
Oracle
E-mail: [EMAIL PROTECTED]
Phone: (650) 506-8127
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/5] fs: introduce new aops and infrastructure

2007-03-14 Thread Mark Fasheh
On Thu, Mar 15, 2007 at 05:36:42AM +0100, Nick Piggin wrote:
> > Are we going to get rid of the file and intr arguments btw? I'm not sure
> > intr is useful, and mapping is probably enough to get whatever we inside
> > ->write_begin / ->write_end.
> 
> Yeah, I was going to, but I had this version ready to go so decided
> to leave them in at the last minute. We can definitely take them out
> if people agree.
> 
> However a side note about intr -- I wonder if it might be wise to
> include a flags argument, in case we might want to add something like
> that later? (definitely if we do keep intr, then it should be done as
> a flag rather than its own int).

I don't see a problem with having a flags argument. It could give us some
flexibility in the future which would otherwise require a much bigger
update. If we found out that we needed intr, it could just be a flag.


> > One interesting side effect is that we no longer pass AOP_TRUNCATE_PAGE up a
> > level. This gives callers less to deal with. And it means that ocfs2 doesn't
> > have to use the ocfs2_*_lock_with_page() cluster lock variants in
> > ocfs2_block_write_begin() because it can order cluster locks outside of the
> > page lock there.
> 
> OK that's very cool. I was hoping that would be the case. If GFS2 can
> avoid that too, then we might be able to get rid of AOP_TRUNCATE_PAGE
> handling from the legacy prepare/commit_write paths, which will make
> them simpler.

Yeah - so long as we're not taking a page fault between write_begin /
write_end, there's no reason for the cluster locks to be taken and dropped
within the individual callbacks, which means we can just take them in
write_begin (where page lock ordering is possible) and hold them until
write_end is called.


> OK, well I'll add this to my queue for now, and post the full patchset
> after incorporating feedback I've had so far, and doing more testing,
> so people can actually apply them and boot kernels.

Great, thanks - it just occured to me that I should be holding the clusters
locks across the entire copy (as I point out above), so I'll have a slightly
updated version of this patch for you soon :)
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/5] fs: introduce new aops and infrastructure

2007-03-14 Thread Nick Piggin
On Wed, Mar 14, 2007 at 09:13:29PM -0700, Mark Fasheh wrote:
> Hi Nick,
> 
> On Wed, Mar 14, 2007 at 02:38:22PM +0100, Nick Piggin wrote:
> > Introduce write_begin, write_end, and perform_write aops.
> > 
> > These are intended to replace prepare_write and commit_write with more
> > flexible alternatives that are also able to avoid the buffered write
> > deadlock problems efficiently (which prepare_write is unable to do).
> 
> > Index: linux-2.6/include/linux/fs.h
> > ===
> > --- linux-2.6.orig/include/linux/fs.h
> > +++ linux-2.6/include/linux/fs.h
> > @@ -449,6 +449,17 @@ struct address_space_operations {
> >  */
> > int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
> > int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
> > +
> > +   int (*write_begin)(struct file *, struct address_space *mapping,
> > +   loff_t pos, unsigned len, int intr,
> > +   struct page **pagep, void **fsdata);
> > +   int (*write_end)(struct file *, struct address_space *mapping,
> > +   loff_t pos, unsigned len, unsigned copied,
> > +   struct page *page, void *fsdata);
> 
> Are we going to get rid of the file and intr arguments btw? I'm not sure
> intr is useful, and mapping is probably enough to get whatever we inside
> ->write_begin / ->write_end.

Yeah, I was going to, but I had this version ready to go so decided
to leave them in at the last minute. We can definitely take them out
if people agree.

However a side note about intr -- I wonder if it might be wise to
include a flags argument, in case we might want to add something like
that later? (definitely if we do keep intr, then it should be done as
a flag rather than its own int).


> Also, I noticed that you didn't export block_write_begin(),
> simple_write_begin(), block_write_end() and simple_write_end() - I think we
> want those for client modules.

Yep, simple oversight on my part.


> Attached is a quick patch to hook up the existing ocfs2 write code. This has
> been compile tested only for now - one of my test machines isn't
> cooperating, so a runtime test will have to wait until tommorrow.
> 
> One interesting side effect is that we no longer pass AOP_TRUNCATE_PAGE up a
> level. This gives callers less to deal with. And it means that ocfs2 doesn't
> have to use the ocfs2_*_lock_with_page() cluster lock variants in
> ocfs2_block_write_begin() because it can order cluster locks outside of the
> page lock there.

OK that's very cool. I was hoping that would be the case. If GFS2 can
avoid that too, then we might be able to get rid of AOP_TRUNCATE_PAGE
handling from the legacy prepare/commit_write paths, which will make
them simpler.

> My ocfs2 write rework will be a more serious user of these stuff, including
> the fsdata backpointer. That code will also eliminate the entire
> ocfs2_*_lock_with_page() cluster locking workarounds for write (they'll have
> to remain for ->readpage()). I'm beginning work on cleaning those ocfs2
> patches up and getting them plugged into this stuff. I hope to post them in
> the next day or two.

OK, well I'll add this to my queue for now, and post the full patchset
after incorporating feedback I've had so far, and doing more testing,
so people can actually apply them and boot kernels.

Thanks,
Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/5] fs: introduce new aops and infrastructure

2007-03-14 Thread Mark Fasheh
Hi Nick,

On Wed, Mar 14, 2007 at 02:38:22PM +0100, Nick Piggin wrote:
> Introduce write_begin, write_end, and perform_write aops.
> 
> These are intended to replace prepare_write and commit_write with more
> flexible alternatives that are also able to avoid the buffered write
> deadlock problems efficiently (which prepare_write is unable to do).

> Index: linux-2.6/include/linux/fs.h
> ===
> --- linux-2.6.orig/include/linux/fs.h
> +++ linux-2.6/include/linux/fs.h
> @@ -449,6 +449,17 @@ struct address_space_operations {
>*/
>   int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
>   int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
> +
> + int (*write_begin)(struct file *, struct address_space *mapping,
> + loff_t pos, unsigned len, int intr,
> + struct page **pagep, void **fsdata);
> + int (*write_end)(struct file *, struct address_space *mapping,
> + loff_t pos, unsigned len, unsigned copied,
> + struct page *page, void *fsdata);

Are we going to get rid of the file and intr arguments btw? I'm not sure
intr is useful, and mapping is probably enough to get whatever we inside
->write_begin / ->write_end.

Also, I noticed that you didn't export block_write_begin(),
simple_write_begin(), block_write_end() and simple_write_end() - I think we
want those for client modules.


Attached is a quick patch to hook up the existing ocfs2 write code. This has
been compile tested only for now - one of my test machines isn't
cooperating, so a runtime test will have to wait until tommorrow.

One interesting side effect is that we no longer pass AOP_TRUNCATE_PAGE up a
level. This gives callers less to deal with. And it means that ocfs2 doesn't
have to use the ocfs2_*_lock_with_page() cluster lock variants in
ocfs2_block_write_begin() because it can order cluster locks outside of the
page lock there.

My ocfs2 write rework will be a more serious user of these stuff, including
the fsdata backpointer. That code will also eliminate the entire
ocfs2_*_lock_with_page() cluster locking workarounds for write (they'll have
to remain for ->readpage()). I'm beginning work on cleaning those ocfs2
patches up and getting them plugged into this stuff. I hope to post them in
the next day or two.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[EMAIL PROTECTED]

ocfs2: Convert to new aops

Turn ocfs2_prepare_write() and ocfs2_commit_write() into ocfs2_write_begin()
and ocfs2_write_end().

Signed-off-by: Mark Fasheh <[EMAIL PROTECTED]>

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 93628b0..e7bcbbd 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -293,29 +293,30 @@ int ocfs2_prepare_write_nolock(struct in
 }
 
 /*
- * ocfs2_prepare_write() can be an outer-most ocfs2 call when it is called
- * from loopback.  It must be able to perform its own locking around
- * ocfs2_get_block().
+ * ocfs2_write_begin() can be an outer-most ocfs2 call when it is
+ * called from elsewhere in the kernel. It must be able to perform its
+ * own locking around ocfs2_get_block().
  */
-static int ocfs2_prepare_write(struct file *file, struct page *page,
-  unsigned from, unsigned to)
+static int ocfs2_write_begin(struct file *file, struct address_space *mapping,
+loff_t pos, unsigned len, int intr,
+struct page **pagep, void **fsdata)
 {
-   struct inode *inode = page->mapping->host;
+   struct inode *inode = mapping->host;
int ret;
 
-   mlog_entry("(0x%p, 0x%p, %u, %u)\n", file, page, from, to);
-
-   ret = ocfs2_meta_lock_with_page(inode, NULL, 0, page);
+   ret = ocfs2_meta_lock(inode, NULL, 0);
if (ret != 0) {
mlog_errno(ret);
goto out;
}
 
-   ret = ocfs2_prepare_write_nolock(inode, page, from, to);
+   down_read(&OCFS2_I(inode)->ip_alloc_sem);
+   ret = block_write_begin(file, mapping, pos, len, intr, pagep, fsdata,
+   ocfs2_get_block);
+   up_read(&OCFS2_I(inode)->ip_alloc_sem);
 
ocfs2_meta_unlock(inode, 0);
 out:
-   mlog_exit(ret);
return ret;
 }
 
@@ -388,16 +389,21 @@ out:
return handle;
 }
 
-static int ocfs2_commit_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
+static int ocfs2_write_end(struct file *file, struct address_space *mapping,
+  loff_t pos, unsigned len, unsigned copied,
+  struct page *page, void *fsdata)
 {
int ret;
+   unsigned from, to;
struct buffer_head *di_bh = NULL;
struct inode *inode = page->mapping->host;
handle_t *handle = NULL;
struct ocfs2_dinode *di;
 
-   mlog_entry("(0x%p, 0x%p,

Re: [patch 2/5] fs: introduce new aops and infrastructure

2007-03-14 Thread Nick Piggin
On Wed, Mar 14, 2007 at 10:46:25PM +0100, Mariusz Kozlowski wrote:
> Hello, 
> 
>   I guess no need to define 'ret' twice here.

[...]

Hi Mariusz,

Thanks, I'll clean that up.

Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/5] fs: introduce new aops and infrastructure

2007-03-14 Thread Nick Piggin
On Thu, Mar 15, 2007 at 12:28:04AM +0300, Dmitriy Monakhov wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> 
> > +
> > +int pagecache_write_end(struct file *file, struct address_space *mapping,
> > +   loff_t pos, unsigned len, unsigned copied,
> > +   struct page *page, void *fsdata)
> > +{
> > +   const struct address_space_operations *aops = mapping->a_ops;
> > +   int ret;
> > +
> > +   if (aops->write_begin)
> > +   ret = aops->write_end(file, mapping, pos, len, copied, page, 
> > fsdata);
> > +   else {
> > +   int ret;
> > +   unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
> > +   struct inode *inode = mapping->host;
> > +
> > +   flush_dcache_page(page);
> > +   ret = aops->commit_write(file, page, offset, offset+len);
> > +   if (ret < 0) {
> > +   unlock_page(page);
> > +   page_cache_release(page);
> > +   if (pos + len > inode->i_size)
> > +   vmtruncate(inode, inode->i_size);
> > +   } else
> > +   ret = copied;
> What about AOP_TRUNCATED_PAGE?  Off corse we can't just "goto retry" here :) ,
> but we may return it to caller and let's caller handle it.  

Yeah AOP_TRUNCATED_PAGE... I'm _hoping_ that OCFS2 and GFS2 will be able
to avoid that using write_begin/write_end, so the caller will not have to
know anything about it.

I don't know that commit_write can even return AOP_TRUNCATED_PAGE... we
should have gathered all our locks in prepare_write.


> > +   }
> > +
> > +   return copied;
> if ->commit_write return non negative value we return with sill locked page  
> look above at [1] 
> may be it will be unlocked by caller? I guess no it was just forgoten.

Yeah, thanks. I think I converted all my filesystems to use write_begin /
write_end, so I probably didn't test this path :P. I do plan to go through
and try to individually test error cases and stress test it over the next
couple of days.


> > +void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
> > +{
> > +   unsigned int block_start, block_end;
> > +   struct buffer_head *head, *bh;
> > +
> > +   BUG_ON(!PageLocked(page));
> > +   if (!page_has_buffers(page))
> > +   return;__block_prepare_write 
> > +
> > +   bh = head = page_buffers(page);
> > block_start = 0;
> > do {
> > -   block_end = block_start+blocksize;
> > -   if (block_end <= from)
> > -   goto next_bh;
> > -   if (block_start >= to)
> > -   break;
> > +   block_end = block_start + bh->b_size;
> > +
> > if (buffer_new(bh)) {
> > -   void *kaddr;
> > +   if (block_end > from && block_start < to) {
> > +   if (!PageUptodate(page)) {
> > +   unsigned start, end;
> > +   void *kaddr;
> > +
> > +   start = max(from, block_start);
> > +   end = min(to, block_end);
> > +
> > +   kaddr = kmap_atomic(page, KM_USER0);
> > +   memset(kaddr+start, 0, block_end-end);
> <<< At least this result in information leak in case of (stat == from)
> just imagine fs with blocksize == 1k conains file with i_size == 4096 and 
> fist two blocks not mapped (hole), now invoke write op from 1023 to 2048.
> For example we succeed in allocating first block, but faile while allocating 
> second
> , then we call page_zero_new_buffers(...from == 1023, to == 2048)
>   and then zerro only one last byte for first block, and set is uptodate
> After this we just do read( from == 0, to == 1023) and steal old block 
> content.

When we first invoke the write op, it should see were doing a partial
write into the first buffer and bring it uptodate first. I don't see the
problem, but again I do need to go through and exercise various cases
like this.


> > @@ -222,67 +221,47 @@ static int do_lo_send_aops(struct loop_d
> > len = bvec->bv_len;
> > while (len > 0) {
> > sector_t IV;
> > -   unsigned size;
> > +   unsigned size, copied;
> > int transfer_result;
> > +   struct page *page;
> > +   void *fsdata;
> >  
> > IV = ((sector_t)index << (PAGE_CACHE_SHIFT - 9))+(offset >> 9);
> > size = PAGE_CACHE_SIZE - offset;
> > if (size > len)
> > size = len;
> > -   page = grab_cache_page(mapping, index);
> > -   if (unlikely(!page))
> > +
> > +   ret = pagecache_write_begin(file, mapping, pos, size, 1,
> > +   &page, &fsdata);
> > +   if (ret)
> > goto fail;
> > -   ret = aops->prepare_write(file, page, offset,
> > -  

Re: Move across mount,sb

2007-03-14 Thread Al Viro
On Thu, Mar 15, 2007 at 02:20:25AM +0100, Jan Engelhardt wrote:
> Why is EXDEV returned when _vfs mountpoints_ are crossed? Should not it 
> be more like the following?
> 
>   error = -EXDEV;
>   if (oldnd.mnt->mnt_sb != newnd.mnt->mnt_sb)
>   goto exit2;
> 

No.  This is absolutely deliberate - mountpoint creates a boundary and
such boundaries are very useful for restricting modifications of filesystem.
IOW, it's not a bug and it applies to other operations as well (link(), for
example).
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Move across mount,sb

2007-03-14 Thread Jan Engelhardt
Hello,


touch /tmp/foo;
mount /tmp /mnt --bind;
strace -e rename mv /tmp/foo /mnt/bar

Ideally, I would expect, that since /tmp and /mnt are the same 
filesystem, that the move operation would complete without _copying_ the 
file. But strace returns

rename("/tmp/foo", "/mnt/bar") = -1 EXDEV (Invalid cross-device link)

The piece of code identified is fs/namei.c:2578:do_rename()

error = -EXDEV;
if (oldnd.mnt != newnd.mnt)
goto exit2;

Why is EXDEV returned when _vfs mountpoints_ are crossed? Should not it 
be more like the following?

error = -EXDEV;
if (oldnd.mnt->mnt_sb != newnd.mnt->mnt_sb)
goto exit2;




Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/8] per backing_dev dirty and writeback page accounting

2007-03-14 Thread Miklos Szeredi
> Only if the queue depth is not bound. Queue depths are bound and so
> the distance we can go over the threshold is limited.  This is the
> fundamental principle on which the throttling is based.
> 
> Hence, if the queue is not full, then we will have either written
> dirty pages to it (i.e wbc->nr_write != write_chunk so we will throttle
> or continue normally if write_chunk was written) or we have no more
> dirty pages left.
> 
> Having no dirty pages left on the bdi and it not being congested
> means we effectively have a clean, idle bdi. We should not be trying
> to throttle writeback here - we can't do anything to improve the
> situation by continuing to try to do writeback on this bdi, so we
> may as well give up and let the writer continue. Once we have dirty
> pages on the bdi, we'll get throttled appropriately.

OK, you convinced me.

How about this patch?  I introduced a new wbc counter, that sums the
number of dirty pages encountered, including ones already under
writeback.

Dave, big thanks for your insights.

Miklos

Index: linux/include/linux/writeback.h
===
--- linux.orig/include/linux/writeback.h2007-03-14 22:43:42.0 
+0100
+++ linux/include/linux/writeback.h 2007-03-14 22:58:56.0 +0100
@@ -44,6 +44,7 @@ struct writeback_control {
long nr_to_write;   /* Write this many pages, and decrement
   this for each page written */
long pages_skipped; /* Pages which were not written */
+   long nr_dirty;  /* Number of dirty pages encountered */
 
/*
 * For a_ops->writepages(): is start or end are non-zero then this is
Index: linux/mm/page-writeback.c
===
--- linux.orig/mm/page-writeback.c  2007-03-14 22:41:01.0 +0100
+++ linux/mm/page-writeback.c   2007-03-14 23:00:20.0 +0100
@@ -220,6 +220,17 @@ static void balance_dirty_pages(struct a
pages_written += write_chunk - wbc.nr_to_write;
if (pages_written >= write_chunk)
break;  /* We've done our duty */
+
+   /*
+* If just a few dirty pages were encountered, and
+* the queue is not congested, then allow this dirty
+* producer to continue.  This resolves the deadlock
+* that happens when one filesystem writes back data
+* through another.  It should also help when a slow
+* device is completely blocking other writes.
+*/
+   if (wbc.nr_dirty < 8 && !bdi_write_congested(bdi))
+   break;
}
congestion_wait(WRITE, HZ/10);
}
@@ -612,6 +623,7 @@ retry:
  min(end - index, 
(pgoff_t)PAGEVEC_SIZE-1) + 1))) {
unsigned i;
 
+   wbc->nr_dirty += nr_pages;
scanned = 1;
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/5] fs: introduce new aops and infrastructure

2007-03-14 Thread Dmitriy Monakhov
Nick Piggin <[EMAIL PROTECTED]> writes:

> Introduce write_begin, write_end, and perform_write aops.
>
> These are intended to replace prepare_write and commit_write with more
> flexible alternatives that are also able to avoid the buffered write
> deadlock problems efficiently (which prepare_write is unable to do).
>
>  Documentation/filesystems/Locking |9 +
>  Documentation/filesystems/vfs.txt |   38 ++
>  drivers/block/loop.c  |   69 ---
>  fs/buffer.c   |  144 
>  fs/libfs.c|   40 ++
>  fs/namei.c|   47 ++-
>  fs/splice.c   |  106 +
>  include/linux/buffer_head.h   |7 +
>  include/linux/fs.h|   29 
>  include/linux/pagemap.h   |2 
>  mm/filemap.c  |  228 
> ++
>  11 files changed, 494 insertions(+), 225 deletions(-)
>
> Index: linux-2.6/include/linux/fs.h
> ===
> --- linux-2.6.orig/include/linux/fs.h
> +++ linux-2.6/include/linux/fs.h
> @@ -449,6 +449,17 @@ struct address_space_operations {
>*/
>   int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
>   int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
> +
> + int (*write_begin)(struct file *, struct address_space *mapping,
> + loff_t pos, unsigned len, int intr,
> + struct page **pagep, void **fsdata);
> + int (*write_end)(struct file *, struct address_space *mapping,
> + loff_t pos, unsigned len, unsigned copied,
> + struct page *page, void *fsdata);
> +
> + int (*perform_write)(struct file *, struct address_space *,
> + struct iov_iter *, loff_t);
> +
>   /* Unfortunately this kludge is needed for FIBMAP. Don't use it */
>   sector_t (*bmap)(struct address_space *, sector_t);
>   void (*invalidatepage) (struct page *, unsigned long);
> @@ -463,6 +474,18 @@ struct address_space_operations {
>   int (*launder_page) (struct page *);
>  };
>  
> +/*
> + * pagecache_write_begin/pagecache_write_end must be used by general code
> + * to write into the pagecache.
> + */
> +int pagecache_write_begin(struct file *, struct address_space *mapping,
> + loff_t pos, unsigned len, int intr,
> + struct page **pagep, void **fsdata);
> +
> +int pagecache_write_end(struct file *, struct address_space *mapping,
> + loff_t pos, unsigned len, unsigned copied,
> + struct page *page, void *fsdata);
> +
>  struct backing_dev_info;
>  struct address_space {
>   struct inode*host;  /* owner: inode, block_device */
> @@ -1902,6 +1925,12 @@ extern int simple_prepare_write(struct f
>   unsigned offset, unsigned to);
>  extern int simple_commit_write(struct file *file, struct page *page,
>   unsigned offset, unsigned to);
> +extern int simple_write_begin(struct file *file, struct address_space 
> *mapping,
> + loff_t pos, unsigned len, int intr,
> + struct page **pagep, void **fsdata);
> +extern int simple_write_end(struct file *file, struct address_space *mapping,
> + loff_t pos, unsigned len, unsigned copied,
> + struct page *page, void *fsdata);
>  
>  extern struct dentry *simple_lookup(struct inode *, struct dentry *, struct 
> nameidata *);
>  extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t 
> *);
> Index: linux-2.6/mm/filemap.c
> ===
> --- linux-2.6.orig/mm/filemap.c
> +++ linux-2.6/mm/filemap.c
> @@ -2049,6 +2049,88 @@ inline int generic_write_checks(struct f
>  }
>  EXPORT_SYMBOL(generic_write_checks);
>  
> +int pagecache_write_begin(struct file *file, struct address_space *mapping,
> + loff_t pos, unsigned len, int intr,
> + struct page **pagep, void **fsdata)
> +{
> + const struct address_space_operations *aops = mapping->a_ops;
> +
> + if (aops->write_begin)
> + return aops->write_begin(file, mapping, pos, len, intr, pagep, 
> fsdata);
> + else {
> + int ret;
> + pgoff_t index = pos >> PAGE_CACHE_SHIFT;
> + unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
> + struct inode *inode = mapping->host;
> + struct page *page;
> +again:
> + page = __grab_cache_page(mapping, index);
> + *pagep = page;
> + if (!page)
> + return -ENOMEM;
> +
> + if (intr && !PageUptodate(page)) {
> +  

Re: [patch 2/3] fs: introduce perform_write aop

2007-03-14 Thread Christoph Hellwig
On Wed, Mar 14, 2007 at 02:30:24PM +0100, Nick Piggin wrote:
> So I've tried a different approach - the 2-op API rather than an actor.
> 
> perform_write stays around as a higher performance API, but it isn't
> required if the filesystem implements the 2-op API. I've called them
> write_begin/write_end for now.
> 
> There are a few upshots to doing this rather than the actor approach.
> First of all, this is what callers expect, they want to write into the
> page directly rather than making an actor.
> 
> More importantly, it allows us to implement generic block versions of
> the API which is much more reusable than block_perform_write (which was
> basically useless for anything more than ext2).

Generally thiis look pretty cool.  But even if we go with perform_write
as aop for now (which I think is a bad idea aswell, but moving it out
would better be done after all filesystems are converted) these should
just stay callbacks passed to generic_perform_write.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/5] fs: add an iovec iterator

2007-03-14 Thread Nick Piggin
On Wed, Mar 14, 2007 at 02:38:12PM +0100, Nick Piggin wrote:
> Add an iterator data structure to operate over an iovec. Add usercopy
> operators needed by generic_file_buffered_write, and convert that function
> over.

Just a note to anyone looking at these -- they don't apply to any tree,
and I'm posting them at this stage mainly to get feedback on the a_ops
APIs. Comments from anyone welcome.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 4/5] ext2: convert to new aops

2007-03-14 Thread Nick Piggin
Implement new aops for ext2.

 fs/ext2/inode.c |   12 
 1 file changed, 12 insertions(+)

Index: linux-2.6/fs/ext2/inode.c
===
--- linux-2.6.orig/fs/ext2/inode.c
+++ linux-2.6/fs/ext2/inode.c
@@ -643,6 +643,16 @@ ext2_readpages(struct file *file, struct
 }
 
 static int
+ext2_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, int intr,
+   struct page **pagep, void **fsdata)
+{
+   *pagep = NULL;
+   return block_write_begin(file, mapping, pos, len, intr, pagep, fsdata,
+   ext2_get_block);
+}
+
+static int
 ext2_prepare_write(struct file *file, struct page *page,
unsigned from, unsigned to)
 {
@@ -689,6 +699,8 @@ const struct address_space_operations ex
.readpages  = ext2_readpages,
.writepage  = ext2_writepage,
.sync_page  = block_sync_page,
+   .write_begin= ext2_write_begin,
+   .write_end  = block_write_end,
.prepare_write  = ext2_prepare_write,
.commit_write   = generic_commit_write,
.bmap   = ext2_bmap,
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 5/5] ext3: convert to new aops

2007-03-14 Thread Nick Piggin
Implement new aops for ext3. Probably has some bugs in interaction with
journalling, and corner cases aren't tested/thought out fully, but it
boots and runs. I don't see a fundamental reason why it can't work...

 fs/ext3/inode.c |  137 +++-
 1 file changed, 88 insertions(+), 49 deletions(-)

Index: linux-2.6/fs/ext3/inode.c
===
--- linux-2.6.orig/fs/ext3/inode.c
+++ linux-2.6/fs/ext3/inode.c
@@ -1155,7 +1155,7 @@ static int do_journal_get_write_access(h
  * This content is expected to be set to zeroes by block_prepare_write().
  * 2006/10/14  SAW
  */
-static int ext3_prepare_failure(struct file *file, struct page *page,
+static int ext3_write_failure(struct file *file, struct page *page,
unsigned from, unsigned to)
 {
struct address_space *mapping;
@@ -1208,29 +1208,40 @@ skip:
return mapping->a_ops->commit_write(file, page, from, block_start);
 }
 
-static int ext3_prepare_write(struct file *file, struct page *page,
- unsigned from, unsigned to)
+static int ext3_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, int intr,
+   struct page **pagep, void **fsdata)
 {
-   struct inode *inode = page->mapping->host;
-   int ret, ret2;
+   struct inode *inode = mapping->host;
int needed_blocks = ext3_writepage_trans_blocks(inode);
+   int ret, ret2;
handle_t *handle;
int retries = 0;
+   struct page *page;
+   pgoff_t index;
+   unsigned start, end;
+
+   index = pos >> PAGE_CACHE_SHIFT;
+   start = pos * (PAGE_CACHE_SIZE - 1);
+   end = start + len;
+
+   page = __grab_cache_page(mapping, index);
+   if (!page)
+   return -ENOMEM;
+   *pagep = page;
 
 retry:
handle = ext3_journal_start(inode, needed_blocks);
if (IS_ERR(handle))
return PTR_ERR(handle);
-   if (test_opt(inode->i_sb, NOBH) && ext3_should_writeback_data(inode))
-   ret = nobh_prepare_write(page, from, to, ext3_get_block);
-   else
-   ret = block_prepare_write(page, from, to, ext3_get_block);
+   ret = block_write_begin(file, mapping, pos, len, intr, pagep, fsdata,
+   ext3_get_block);
if (ret)
goto failure;
 
if (ext3_should_journal_data(inode)) {
ret = walk_page_buffers(handle, page_buffers(page),
-   from, to, NULL, do_journal_get_write_access);
+   start, end, NULL, do_journal_get_write_access);
if (ret)
/* fatal error, just put the handle and return */
journal_stop(handle);
@@ -1238,7 +1249,7 @@ retry:
return ret;
 
 failure:
-   ret2 = ext3_prepare_failure(file, page, from, to);
+   ret2 = ext3_write_failure(file, page, start, end);
if (ret2 < 0)
return ret2;
if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
@@ -1247,17 +1258,18 @@ failure:
return ret;
 }
 
+
 int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
 {
int err = journal_dirty_data(handle, bh);
if (err)
ext3_journal_abort_handle(__FUNCTION__, __FUNCTION__,
-   bh, handle,err);
+   bh, handle, err);
return err;
 }
 
-/* For commit_write() in data=journal mode */
-static int commit_write_fn(handle_t *handle, struct buffer_head *bh)
+/* For write_end() in data=journal mode */
+static int write_end_fn(handle_t *handle, struct buffer_head *bh)
 {
if (!buffer_mapped(bh) || buffer_freed(bh))
return 0;
@@ -1272,78 +1284,103 @@ static int commit_write_fn(handle_t *han
  * ext3 never places buffers on inode->i_mapping->private_list.  metadata
  * buffers are managed internally.
  */
-static int ext3_ordered_commit_write(struct file *file, struct page *page,
-unsigned from, unsigned to)
+static int ext3_ordered_write_end(struct file *file,
+   struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata)
 {
handle_t *handle = ext3_journal_current_handle();
-   struct inode *inode = page->mapping->host;
+   struct inode *inode = file->f_mapping->host;
+   unsigned from, to;
int ret = 0, ret2;
 
+   from = pos & (PAGE_CACHE_SIZE - 1);
+   to = from + len;
+
ret = walk_page_buffers(handle, page_buffers(page),
from, to, NULL, ext3_journal_dirty_data);
 
if (ret == 0) {
  

[patch 2/5] fs: introduce new aops and infrastructure

2007-03-14 Thread Nick Piggin
Introduce write_begin, write_end, and perform_write aops.

These are intended to replace prepare_write and commit_write with more
flexible alternatives that are also able to avoid the buffered write
deadlock problems efficiently (which prepare_write is unable to do).

 Documentation/filesystems/Locking |9 +
 Documentation/filesystems/vfs.txt |   38 ++
 drivers/block/loop.c  |   69 ---
 fs/buffer.c   |  144 
 fs/libfs.c|   40 ++
 fs/namei.c|   47 ++-
 fs/splice.c   |  106 +
 include/linux/buffer_head.h   |7 +
 include/linux/fs.h|   29 
 include/linux/pagemap.h   |2 
 mm/filemap.c  |  228 ++
 11 files changed, 494 insertions(+), 225 deletions(-)

Index: linux-2.6/include/linux/fs.h
===
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -449,6 +449,17 @@ struct address_space_operations {
 */
int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
+
+   int (*write_begin)(struct file *, struct address_space *mapping,
+   loff_t pos, unsigned len, int intr,
+   struct page **pagep, void **fsdata);
+   int (*write_end)(struct file *, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata);
+
+   int (*perform_write)(struct file *, struct address_space *,
+   struct iov_iter *, loff_t);
+
/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidatepage) (struct page *, unsigned long);
@@ -463,6 +474,18 @@ struct address_space_operations {
int (*launder_page) (struct page *);
 };
 
+/*
+ * pagecache_write_begin/pagecache_write_end must be used by general code
+ * to write into the pagecache.
+ */
+int pagecache_write_begin(struct file *, struct address_space *mapping,
+   loff_t pos, unsigned len, int intr,
+   struct page **pagep, void **fsdata);
+
+int pagecache_write_end(struct file *, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata);
+
 struct backing_dev_info;
 struct address_space {
struct inode*host;  /* owner: inode, block_device */
@@ -1902,6 +1925,12 @@ extern int simple_prepare_write(struct f
unsigned offset, unsigned to);
 extern int simple_commit_write(struct file *file, struct page *page,
unsigned offset, unsigned to);
+extern int simple_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, int intr,
+   struct page **pagep, void **fsdata);
+extern int simple_write_end(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, unsigned copied,
+   struct page *page, void *fsdata);
 
 extern struct dentry *simple_lookup(struct inode *, struct dentry *, struct 
nameidata *);
 extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t 
*);
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2049,6 +2049,88 @@ inline int generic_write_checks(struct f
 }
 EXPORT_SYMBOL(generic_write_checks);
 
+int pagecache_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, int intr,
+   struct page **pagep, void **fsdata)
+{
+   const struct address_space_operations *aops = mapping->a_ops;
+
+   if (aops->write_begin)
+   return aops->write_begin(file, mapping, pos, len, intr, pagep, 
fsdata);
+   else {
+   int ret;
+   pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+   unsigned offset = pos & (PAGE_CACHE_SIZE - 1);
+   struct inode *inode = mapping->host;
+   struct page *page;
+again:
+   page = __grab_cache_page(mapping, index);
+   *pagep = page;
+   if (!page)
+   return -ENOMEM;
+
+   if (intr && !PageUptodate(page)) {
+   /*
+* There is no way to resolve a short write situation
+* for a !Uptodate page (except by double copying 

[patch 3/5] fs: convert some simple filesystems

2007-03-14 Thread Nick Piggin
Implement new aops for some of the simpler filesystems.

 fs/configfs/inode.c |4 ++--
 fs/sysfs/inode.c|4 ++--
 mm/shmem.c  |   16 ++--
 3 files changed, 14 insertions(+), 10 deletions(-)

Index: linux-2.6/mm/shmem.c
===
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -1419,10 +1419,14 @@ static const struct inode_operations shm
  * lets a tmpfs file be used read-write below the loop driver.
  */
 static int
-shmem_prepare_write(struct file *file, struct page *page, unsigned offset, 
unsigned to)
-{
-   struct inode *inode = page->mapping->host;
-   return shmem_getpage(inode, page->index, &page, SGP_WRITE, NULL);
+shmem_write_begin(struct file *file, struct address_space *mapping,
+   loff_t pos, unsigned len, int intr,
+   struct page **pagep, void **fsdata)
+{
+   struct inode *inode = mapping->host;
+   pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+   *pagep = NULL;
+   return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
 }
 
 static ssize_t
@@ -2319,8 +2323,8 @@ static const struct address_space_operat
.writepage  = shmem_writepage,
.set_page_dirty = __set_page_dirty_no_writeback,
 #ifdef CONFIG_TMPFS
-   .prepare_write  = shmem_prepare_write,
-   .commit_write   = simple_commit_write,
+   .write_begin= shmem_write_begin,
+   .write_end  = simple_write_end,
 #endif
.migratepage= migrate_page,
 };
Index: linux-2.6/fs/configfs/inode.c
===
--- linux-2.6.orig/fs/configfs/inode.c
+++ linux-2.6/fs/configfs/inode.c
@@ -40,8 +40,8 @@ extern struct super_block * configfs_sb;
 
 static const struct address_space_operations configfs_aops = {
.readpage   = simple_readpage,
-   .prepare_write  = simple_prepare_write,
-   .commit_write   = simple_commit_write
+   .write_begin= simple_write_begin,
+   .write_end  = simple_write_end,
 };
 
 static struct backing_dev_info configfs_backing_dev_info = {
Index: linux-2.6/fs/sysfs/inode.c
===
--- linux-2.6.orig/fs/sysfs/inode.c
+++ linux-2.6/fs/sysfs/inode.c
@@ -20,8 +20,8 @@ extern struct super_block * sysfs_sb;
 
 static const struct address_space_operations sysfs_aops = {
.readpage   = simple_readpage,
-   .prepare_write  = simple_prepare_write,
-   .commit_write   = simple_commit_write
+   .write_begin= simple_write_begin,
+   .write_end  = simple_write_end,
 };
 
 static struct backing_dev_info sysfs_backing_dev_info = {
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/5] fs: add an iovec iterator

2007-03-14 Thread Nick Piggin
Add an iterator data structure to operate over an iovec. Add usercopy
operators needed by generic_file_buffered_write, and convert that function
over.

 include/linux/fs.h |   33 
 mm/filemap.c   |  144 +++--
 mm/filemap.h   |  103 -
 3 files changed, 150 insertions(+), 130 deletions(-)

Index: linux-2.6/include/linux/fs.h
===
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -396,6 +396,39 @@ struct page;
 struct address_space;
 struct writeback_control;
 
+struct iov_iter {
+   const struct iovec *iov;
+   unsigned long nr_segs;
+   size_t iov_offset;
+   size_t count;
+};
+
+size_t iov_iter_copy_from_user_atomic(struct page *page,
+   struct iov_iter *i, unsigned long offset, size_t bytes);
+size_t iov_iter_copy_from_user(struct page *page,
+   struct iov_iter *i, unsigned long offset, size_t bytes);
+void iov_iter_advance(struct iov_iter *i, size_t bytes);
+int iov_iter_fault_in_readable(struct iov_iter *i);
+size_t iov_iter_single_seg_count(struct iov_iter *i);
+
+static inline void iov_iter_init(struct iov_iter *i,
+   const struct iovec *iov, unsigned long nr_segs,
+   size_t count, size_t written)
+{
+   i->iov = iov;
+   i->nr_segs = nr_segs;
+   i->iov_offset = 0;
+   i->count = count + written;
+
+   iov_iter_advance(i, written);
+}
+
+static inline size_t iov_iter_count(struct iov_iter *i)
+{
+   return i->count;
+}
+
+
 struct address_space_operations {
int (*writepage)(struct page *page, struct writeback_control *wbc);
int (*readpage)(struct file *, struct page *);
Index: linux-2.6/mm/filemap.c
===
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -30,7 +30,7 @@
 #include 
 #include 
 #include 
-#include "filemap.h"
+#include  /* for BUG_ON(!in_atomic()) only */
 #include "internal.h"
 
 /*
@@ -1839,8 +1839,7 @@ int remove_suid(struct dentry *dentry)
 }
 EXPORT_SYMBOL(remove_suid);
 
-size_t
-__filemap_copy_from_user_iovec_inatomic(char *vaddr,
+static size_t __iovec_copy_from_user_inatomic(char *vaddr,
const struct iovec *iov, size_t base, size_t bytes)
 {
size_t copied = 0, left = 0;
@@ -1863,6 +1862,110 @@ __filemap_copy_from_user_iovec_inatomic(
 }
 
 /*
+ * Copy as much as we can into the page and return the number of bytes which
+ * were sucessfully copied.  If a fault is encountered then return the number 
of
+ * bytes which were copied.
+ */
+size_t iov_iter_copy_from_user_atomic(struct page *page,
+   struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+   char *kaddr;
+   size_t copied;
+
+   BUG_ON(!in_atomic());
+   kaddr = kmap_atomic(page, KM_USER0);
+   if (likely(i->nr_segs == 1)) {
+   int left;
+   char __user *buf = i->iov->iov_base + i->iov_offset;
+   left = __copy_from_user_inatomic_nocache(kaddr + offset,
+   buf, bytes);
+   copied = bytes - left;
+   } else {
+   copied = __iovec_copy_from_user_inatomic(kaddr + offset,
+   i->iov, i->iov_offset, bytes);
+   }
+   kunmap_atomic(kaddr, KM_USER0);
+
+   return copied;
+}
+
+/*
+ * This has the same sideeffects and return value as
+ * iov_iter_copy_from_user_atomic().
+ * The difference is that it attempts to resolve faults.
+ * Page must not be locked.
+ */
+size_t iov_iter_copy_from_user(struct page *page,
+   struct iov_iter *i, unsigned long offset, size_t bytes)
+{
+   char *kaddr;
+   size_t copied;
+
+   kaddr = kmap(page);
+   if (likely(i->nr_segs == 1)) {
+   int left;
+   char __user *buf = i->iov->iov_base + i->iov_offset;
+   left = __copy_from_user_nocache(kaddr + offset, buf, bytes);
+   copied = bytes - left;
+   } else {
+   copied = __iovec_copy_from_user_inatomic(kaddr + offset,
+   i->iov, i->iov_offset, bytes);
+   }
+   kunmap(page);
+   return copied;
+}
+
+static void __iov_iter_advance_iov(struct iov_iter *i, size_t bytes)
+{
+   if (likely(i->nr_segs == 1)) {
+   i->iov_offset += bytes;
+   } else {
+   const struct iovec *iov = i->iov;
+   size_t base = i->iov_offset;
+
+   while (bytes) {
+   int copy = min(bytes, iov->iov_len - base);
+
+   bytes -= copy;
+   base += copy;
+   if (iov->iov_len == base) {
+   iov++;
+   base = 0;
+   

Re: [patch 2/3] fs: introduce perform_write aop

2007-03-14 Thread Nick Piggin
On Sat, Mar 10, 2007 at 09:25:41AM +, Christoph Hellwig wrote:
> On Fri, Mar 09, 2007 at 03:33:01PM -0800, Mark Fasheh wrote:
> > ->kernel_write() as opposed to genericizing ->perform_write() would be fine
> > with me. Just so long as we get rid of ->prepare_write and ->commit_write in
> > that other kernel code doesn't call them directly. That interface just
> > doesn't work for Ocfs2.
> 
> It doesn't work for any filesystem that needs slightly fancy locking.
> That and the reason that's an interface that doesn't fit into our
> layering is why I want to get rid of it.  Note that fops->kernel_write
> might in fact use ->perform_write with an actor as Nick suggested.
> I'm not quite sure how it'll look like - I'd rather take care of the
> buffered write path first and then handle this issue once the first
> changes have stabilized.

So I've tried a different approach - the 2-op API rather than an actor.

perform_write stays around as a higher performance API, but it isn't
required if the filesystem implements the 2-op API. I've called them
write_begin/write_end for now.

There are a few upshots to doing this rather than the actor approach.
First of all, this is what callers expect, they want to write into the
page directly rather than making an actor.

More importantly, it allows us to implement generic block versions of
the API which is much more reusable than block_perform_write (which was
basically useless for anything more than ext2).

The API calls for the filesystem to find and lock the page itself, and
pass that up to the caller, as well as handle short-writes properly, so
we can solve this deadlock properly.

The nice thing about this is that write_begin is basically a single
page case of the first half (before the iovec copy) of perform_write,
and write_end is the single page case of the second half (after the
copy). So any filesystem that implements perform_write should be able
to reuse write_begin/write_end components.

Anyway, I'm attaching the top patches of my stack (underneath are the
initial patches to solve prepare_write deadlock -- I'll repost the
complete set once I get some more feedback). Sorry it is a bit under
commented and not stress tested. However it does boot and run with
ext2/3/shmem and other simplefses.

Mark has been providing some helpful advice about the new 2-op interface,
but it is still pretty much up in the air.

Also note that a _lot_ of crud is there to support prepare_write (eg.
pagecache_write_begin/pagecache_write_end basically become single liners
once we get rid of the old API).

Anyway, I think I should throw this out for comments before investing too
much more time, in case everyone hates it ;)

Anyway, comments much appreciated...

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 2.6.21-rc3] [smbfs] "double free" memory corruption in smbfs

2007-03-14 Thread Vasily Averin
smbfs allocates rq_trans2buffer to handle server's multi transaction2 response
messages. As struct smb_request may be reused, rq_trans2buffer is freed before
each new request. However if last servers's response is not multi but single
trans2 message then new rq_trans2buffer is not allocated but last smb_rput still
tries to free it again.
To prevent this issue rq_trans2buffer pointer should be set to NULL after kfree.

Signed-off-by:  Vasily Averin <[EMAIL PROTECTED]>

--- 2.6.21-rc3/fs/smbfs/request.c   2007-03-13 14:22:53.0 +0300
+++ 2.6.21-rc3/fs/smbfs/request.c   2007-03-14 11:44:18.0 +0300
@@ -181,6 +181,7 @@ static int smb_setup_request(struct smb_
req->rq_errno = 0;
req->rq_fragment = 0;
kfree(req->rq_trans2buffer);
+   req->rq_trans2buffer = NULL;

return 0;
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html