Re: [PATCH] incorrect error handling inside generic_file_direct_write

2007-01-02 Thread 'Christoph Hellwig'
On Fri, Dec 15, 2006 at 10:53:18AM -0800, Chen, Kenneth W wrote:
> Christoph Hellwig wrote on Friday, December 15, 2006 2:44 AM
> > So we're doing the sync_page_range once in __generic_file_aio_write
> > with i_mutex held.
> > 
> > 
> > >   mutex_lock(>i_mutex);
> > > - ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
> > > - >ki_pos);
> > > + ret = __generic_file_aio_write(iocb, iov, nr_segs, pos);
> > >   mutex_unlock(>i_mutex);
> > >  
> > >   if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> > 
> > And then another time after it's unlocked, this seems wrong.
> 
> 
> I didn't invent that mess though.
> 
> I should've ask the question first: in 2.6.20-rc1, generic_file_aio_write
> will call sync_page_range twice, once from __generic_file_aio_write_nolock
> and once within the function itself.  Is it redundant?  Can we delete the
> one in the top level function?  Like the following?

Really?  I'm looking at -rc3 now as -rc1 is rather old and it's definitly
not the case there.  I also can't remember ever doing this - when I
started the generic read/write path untangling I had exactly the same
situation that's now in -rc3:

  - generic_file_aio_write_nolock calls sync_page_range_nolock
  - generic_file_aio_write calls sync_page_range
  - __generic_file_aio_write_nolock doesn't call any sync_page_range variant

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] incorrect error handling inside generic_file_direct_write

2007-01-02 Thread 'Christoph Hellwig'
On Fri, Dec 15, 2006 at 10:53:18AM -0800, Chen, Kenneth W wrote:
 Christoph Hellwig wrote on Friday, December 15, 2006 2:44 AM
  So we're doing the sync_page_range once in __generic_file_aio_write
  with i_mutex held.
  
  
 mutex_lock(inode-i_mutex);
   - ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
   - iocb-ki_pos);
   + ret = __generic_file_aio_write(iocb, iov, nr_segs, pos);
 mutex_unlock(inode-i_mutex);

 if (ret  0  ((file-f_flags  O_SYNC) || IS_SYNC(inode))) {
  
  And then another time after it's unlocked, this seems wrong.
 
 
 I didn't invent that mess though.
 
 I should've ask the question first: in 2.6.20-rc1, generic_file_aio_write
 will call sync_page_range twice, once from __generic_file_aio_write_nolock
 and once within the function itself.  Is it redundant?  Can we delete the
 one in the top level function?  Like the following?

Really?  I'm looking at -rc3 now as -rc1 is rather old and it's definitly
not the case there.  I also can't remember ever doing this - when I
started the generic read/write path untangling I had exactly the same
situation that's now in -rc3:

  - generic_file_aio_write_nolock calls sync_page_range_nolock
  - generic_file_aio_write calls sync_page_range
  - __generic_file_aio_write_nolock doesn't call any sync_page_range variant

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-15 Thread Chen, Kenneth W
Christoph Hellwig wrote on Friday, December 15, 2006 2:44 AM
> So we're doing the sync_page_range once in __generic_file_aio_write
> with i_mutex held.
> 
> 
> > mutex_lock(>i_mutex);
> > -   ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
> > -   >ki_pos);
> > +   ret = __generic_file_aio_write(iocb, iov, nr_segs, pos);
> > mutex_unlock(>i_mutex);
> >  
> > if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> 
> And then another time after it's unlocked, this seems wrong.


I didn't invent that mess though.

I should've ask the question first: in 2.6.20-rc1, generic_file_aio_write
will call sync_page_range twice, once from __generic_file_aio_write_nolock
and once within the function itself.  Is it redundant?  Can we delete the
one in the top level function?  Like the following?


--- ./mm/filemap.c.orig 2006-12-15 09:02:58.0 -0800
+++ ./mm/filemap.c  2006-12-15 09:03:19.0 -0800
@@ -2370,14 +2370,6 @@ ssize_t generic_file_aio_write(struct ki
ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
>ki_pos);
mutex_unlock(>i_mutex);
-
-   if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-   ssize_t err;
-
-   err = sync_page_range(inode, mapping, pos, ret);
-   if (err < 0)
-   ret = err;
-   }
return ret;
 }
 EXPORT_SYMBOL(generic_file_aio_write);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-15 Thread 'Christoph Hellwig'
> +ssize_t
> +__generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
> + unsigned long nr_segs, loff_t pos)

I'd still call this generic_file_aio_write_nolock.

> + loff_t  *ppos = >ki_pos;

I'd rather use iocb->ki_pos directly in the few places ppos is referenced
currently.

>   if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> - ssize_t err;
> -
>   err = sync_page_range_nolock(inode, mapping, pos, ret);
>   if (err < 0)
>   ret = err;
>   }

So we're doing the sync_page_range once in __generic_file_aio_write
with i_mutex held.


>   mutex_lock(>i_mutex);
> - ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
> - >ki_pos);
> + ret = __generic_file_aio_write(iocb, iov, nr_segs, pos);
>   mutex_unlock(>i_mutex);
>  
>   if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {

And then another time after it's unlocked, this seems wrong.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-15 Thread 'Christoph Hellwig'
 +ssize_t
 +__generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 + unsigned long nr_segs, loff_t pos)

I'd still call this generic_file_aio_write_nolock.

 + loff_t  *ppos = iocb-ki_pos;

I'd rather use iocb-ki_pos directly in the few places ppos is referenced
currently.

   if (ret  0  ((file-f_flags  O_SYNC) || IS_SYNC(inode))) {
 - ssize_t err;
 -
   err = sync_page_range_nolock(inode, mapping, pos, ret);
   if (err  0)
   ret = err;
   }

So we're doing the sync_page_range once in __generic_file_aio_write
with i_mutex held.


   mutex_lock(inode-i_mutex);
 - ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
 - iocb-ki_pos);
 + ret = __generic_file_aio_write(iocb, iov, nr_segs, pos);
   mutex_unlock(inode-i_mutex);
  
   if (ret  0  ((file-f_flags  O_SYNC) || IS_SYNC(inode))) {

And then another time after it's unlocked, this seems wrong.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-15 Thread Chen, Kenneth W
Christoph Hellwig wrote on Friday, December 15, 2006 2:44 AM
 So we're doing the sync_page_range once in __generic_file_aio_write
 with i_mutex held.
 
 
  mutex_lock(inode-i_mutex);
  -   ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
  -   iocb-ki_pos);
  +   ret = __generic_file_aio_write(iocb, iov, nr_segs, pos);
  mutex_unlock(inode-i_mutex);
   
  if (ret  0  ((file-f_flags  O_SYNC) || IS_SYNC(inode))) {
 
 And then another time after it's unlocked, this seems wrong.


I didn't invent that mess though.

I should've ask the question first: in 2.6.20-rc1, generic_file_aio_write
will call sync_page_range twice, once from __generic_file_aio_write_nolock
and once within the function itself.  Is it redundant?  Can we delete the
one in the top level function?  Like the following?


--- ./mm/filemap.c.orig 2006-12-15 09:02:58.0 -0800
+++ ./mm/filemap.c  2006-12-15 09:03:19.0 -0800
@@ -2370,14 +2370,6 @@ ssize_t generic_file_aio_write(struct ki
ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
iocb-ki_pos);
mutex_unlock(inode-i_mutex);
-
-   if (ret  0  ((file-f_flags  O_SYNC) || IS_SYNC(inode))) {
-   ssize_t err;
-
-   err = sync_page_range(inode, mapping, pos, ret);
-   if (err  0)
-   ret = err;
-   }
return ret;
 }
 EXPORT_SYMBOL(generic_file_aio_write);

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-12 Thread Chen, Kenneth W
Andrew Morton wrote on Tuesday, December 12, 2006 2:40 AM
> On Tue, 12 Dec 2006 16:18:32 +0300
> Dmitriy Monakhov <[EMAIL PROTECTED]> wrote:
> 
> > >> but according to filemaps locking rules: mm/filemap.c:77
> > >>  ..
> > >>  *  ->i_mutex(generic_file_buffered_write)
> > >>  *->mmap_sem (fault_in_pages_readable->do_page_fault)
> > >>  ..
> > >> I'm confused a litle bit, where is the truth? 
> > >
> > > xfs_write() calls generic_file_direct_write() without taking i_mutex for
> > > O_DIRECT writes.
> > Yes, but my quastion is about __generic_file_aio_write_nolock().
> > As i understand _nolock sufix means that i_mutex was already locked 
> > by caller, am i right ?
> 
> Nope.  It just means that __generic_file_aio_write_nolock() doesn't take
> the lock.  We don't assume or require that the caller took it.  For example
> the raw driver calls generic_file_aio_write_nolock() without taking
> i_mutex.  Raw isn't relevant to the problem (although ocfs2 might be).  But
> we cannot assume that all callers have taken i_mutex, I think.


I think we should also clean up generic_file_aio_write_nolock. This was
brought up a couple of weeks ago and I gave up too early.  Here is my
second attempt.

How about the following patch, I think we can kill generic_file_aio_write_nolock
and merge both *file_aio_write_nolock into one function, then

generic_file_aio_write
ocfs2_file_aio_write
blk_dev->aio_write

all points to a non-lock version of __generic_file_aio_write().  First
two already hold i_mutex, while the block device's aio_write method
doesn't require i_mutex to be held.


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>

diff -Nurp linux-2.6.19/drivers/char/raw.c linux-2.6.19.ken/drivers/char/raw.c
--- linux-2.6.19/drivers/char/raw.c 2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/drivers/char/raw.c 2006-12-12 16:41:39.0 -0800
@@ -242,7 +242,7 @@ static const struct file_operations raw_
.read   =   do_sync_read,
.aio_read = generic_file_aio_read,
.write  =   do_sync_write,
-   .aio_write =generic_file_aio_write_nolock,
+   .aio_write =__generic_file_aio_write,
.open   =   raw_open,
.release=   raw_release,
.ioctl  =   raw_ioctl,
diff -Nurp linux-2.6.19/fs/block_dev.c linux-2.6.19.ken/fs/block_dev.c
--- linux-2.6.19/fs/block_dev.c 2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/fs/block_dev.c 2006-12-12 16:47:58.0 -0800
@@ -1198,7 +1198,7 @@ const struct file_operations def_blk_fop
.read   = do_sync_read,
.write  = do_sync_write,
.aio_read   = generic_file_aio_read,
-   .aio_write  = generic_file_aio_write_nolock,
+   .aio_write  = __generic_file_aio_write,
.mmap   = generic_file_mmap,
.fsync  = block_fsync,
.unlocked_ioctl = block_ioctl,
diff -Nurp linux-2.6.19/fs/ocfs2/file.c linux-2.6.19.ken/fs/ocfs2/file.c
--- linux-2.6.19/fs/ocfs2/file.c2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/fs/ocfs2/file.c2006-12-12 16:42:09.0 -0800
@@ -1107,7 +1107,7 @@ static ssize_t ocfs2_file_aio_write(stru
/* communicate with ocfs2_dio_end_io */
ocfs2_iocb_set_rw_locked(iocb);
 
-   ret = generic_file_aio_write_nolock(iocb, iov, nr_segs, iocb->ki_pos);
+   ret = __generic_file_aio_write(iocb, iov, nr_segs, iocb->ki_pos);
 
/* buffered aio wouldn't have proper lock coverage today */
BUG_ON(ret == -EIOCBQUEUED && !(filp->f_flags & O_DIRECT));
diff -Nurp linux-2.6.19/include/linux/fs.h linux-2.6.19.ken/include/linux/fs.h
--- linux-2.6.19/include/linux/fs.h 2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/include/linux/fs.h 2006-12-12 16:41:58.0 -0800
@@ -1742,7 +1742,7 @@ extern int file_send_actor(read_descript
 int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int 
isblk);
 extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, 
unsigned long, loff_t);
 extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, 
unsigned long, loff_t);
-extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct 
iovec *,
+extern ssize_t __generic_file_aio_write(struct kiocb *, const struct iovec *,
unsigned long, loff_t);
 extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *,
unsigned long *, loff_t, loff_t *, size_t, size_t);
diff -Nurp linux-2.6.19/mm/filemap.c linux-2.6.19.ken/mm/filemap.c
--- linux-2.6.19/mm/filemap.c   2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/mm/filemap.c   2006-12-12 16:47:58.0 -0800
@@ -2219,9 +2219,9 @@ zero_length_segment:
 }
 EXPORT_SYMBOL(generic_file_buffered_write);
 
-static ssize_t
-__generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov,
-   unsigned long nr_segs, loff_t 

Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-12 Thread Dmitriy Monakhov
Andrew Morton <[EMAIL PROTECTED]> writes:

> On Tue, 12 Dec 2006 16:18:32 +0300
> Dmitriy Monakhov <[EMAIL PROTECTED]> wrote:
>
>> >> but according to filemaps locking rules: mm/filemap.c:77
>> >>  ..
>> >>  *  ->i_mutex (generic_file_buffered_write)
>> >>  *->mmap_sem  (fault_in_pages_readable->do_page_fault)
>> >>  ..
>> >> I'm confused a litle bit, where is the truth? 
>> >
>> > xfs_write() calls generic_file_direct_write() without taking i_mutex for
>> > O_DIRECT writes.
>> Yes, but my quastion is about __generic_file_aio_write_nolock().
>> As i understand _nolock sufix means that i_mutex was already locked 
>> by caller, am i right ?
>
> Nope.  It just means that __generic_file_aio_write_nolock() doesn't take
> the lock.  We don't assume or require that the caller took it.  For example
> the raw driver calls generic_file_aio_write_nolock() without taking
> i_mutex.  Raw isn't relevant to the problem (although ocfs2 might be).  But
> we cannot assume that all callers have taken i_mutex, I think.
>
> I guess we can make that a rule (document it, add
> BUG_ON(!mutex_is_locked(..)) if it isn't a blockdev) if needs be.  After
> really checking that this matches reality for all callers.
I've checked generic_file_aio_write_nolock() callers for non blockdev.
Only ocfs2 call it explicitly, and do it under i_mutex.
So we need to do: 
1) Change wrong comments.
2) Add BUG_ON(!mutex_is_locked(..)) for non blkdev.
3) Invoke vmtruncate only for non blkdev.

Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b84dc8..540ef5e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2046,8 +2046,8 @@ generic_file_direct_write(struct kiocb *
/*
 * Sync the fs metadata but not the minor inode changes and
 * of course not the data as we did direct DMA for the IO.
-* i_mutex is held, which protects generic_osync_inode() from
-* livelocking.
+* i_mutex may not being held, if so some specific locking
+* ordering must protect generic_osync_inode() from livelocking.
 */
if (written >= 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
@@ -2282,6 +2282,17 @@ __generic_file_aio_write_nolock(struct k
 
written = generic_file_direct_write(iocb, iov, _segs, pos,
ppos, count, ocount);
+   /*
+* If host is not S_ISBLK generic_file_direct_write() may 
+* have instantiated a few blocks outside i_size  files
+* Trim these off again.
+*/
+   if (unlikely(written < 0) && !S_ISBLK(inode->i_mode)) {
+   loff_t isize = i_size_read(inode);
+   if (pos + count > isize)
+   vmtruncate(inode, isize);
+   }
+
if (written < 0 || written == count)
goto out;
/*
@@ -2344,6 +2355,13 @@ ssize_t generic_file_aio_write_nolock(st
ssize_t ret;
 
BUG_ON(iocb->ki_pos != pos);
+   /*
+*  generic_file_buffered_write() may be called inside 
+*  __generic_file_aio_write_nolock() even in case of
+*  O_DIRECT for non S_ISBLK files. So i_mutex must be held.
+*/
+   if (!S_ISBLK(inode->i_mode))
+   BUG_ON(!mutex_is_locked(>i_mutex));
 
ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
>ki_pos);
@@ -2386,8 +2404,8 @@ ssize_t generic_file_aio_write(struct ki
 EXPORT_SYMBOL(generic_file_aio_write);
 
 /*
- * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if 
something
- * went wrong during pagecache shootdown.
+ * May be called without i_mutex for writes to S_ISREG files.
+ * Returns -EIO if something went wrong during pagecache shootdown.
  */
 static ssize_t
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,


Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-12 Thread Andrew Morton
On Tue, 12 Dec 2006 16:18:32 +0300
Dmitriy Monakhov <[EMAIL PROTECTED]> wrote:

> >> but according to filemaps locking rules: mm/filemap.c:77
> >>  ..
> >>  *  ->i_mutex  (generic_file_buffered_write)
> >>  *->mmap_sem   (fault_in_pages_readable->do_page_fault)
> >>  ..
> >> I'm confused a litle bit, where is the truth? 
> >
> > xfs_write() calls generic_file_direct_write() without taking i_mutex for
> > O_DIRECT writes.
> Yes, but my quastion is about __generic_file_aio_write_nolock().
> As i understand _nolock sufix means that i_mutex was already locked 
> by caller, am i right ?

Nope.  It just means that __generic_file_aio_write_nolock() doesn't take
the lock.  We don't assume or require that the caller took it.  For example
the raw driver calls generic_file_aio_write_nolock() without taking
i_mutex.  Raw isn't relevant to the problem (although ocfs2 might be).  But
we cannot assume that all callers have taken i_mutex, I think.

I guess we can make that a rule (document it, add
BUG_ON(!mutex_is_locked(..)) if it isn't a blockdev) if needs be.  After
really checking that this matches reality for all callers.

It's important, too - if we have an unprotected i_size_write() then the
seqlock can get out of sync due to a race and then i_size_read() locks up
the kernel.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-12 Thread Dmitriy Monakhov
Andrew Morton <[EMAIL PROTECTED]> writes:

> On Tue, 12 Dec 2006 15:20:52 +0300
> Dmitriy Monakhov <[EMAIL PROTECTED]> wrote:
>
>> > XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
>> > And vmtruncate() expects i_mutex to be held.
>> >
>> > I guess a suitable solution would be to push this problem back up to the
>> > callers: let them decide whether to run vmtruncate() and if so, to ensure
>> > that i_mutex is held.
>> >
>> > The existence of generic_file_aio_write_nolock() makes that rather messy
>> > though.
>> This means we may call generic_file_aio_write_nolock() without i_mutex, 
>> right?
>> but call trace is :
>>   generic_file_aio_write_nolock() 
>>   ->generic_file_buffered_write() /* i_mutex not held here */ 
>> but according to filemaps locking rules: mm/filemap.c:77
>>  ..
>>  *  ->i_mutex(generic_file_buffered_write)
>>  *->mmap_sem (fault_in_pages_readable->do_page_fault)
>>  ..
>> I'm confused a litle bit, where is the truth? 
>
> xfs_write() calls generic_file_direct_write() without taking i_mutex for
> O_DIRECT writes.
Yes, but my quastion is about __generic_file_aio_write_nolock().
As i understand _nolock sufix means that i_mutex was already locked 
by caller, am i right ?
If yes, than __generic_file_aio_write_nolock() is beter place for vmtrancate() 
acclivity after generic_file_direct_write() has fail.
Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
---
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b84dc8..723d2ca 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2282,6 +2282,15 @@ __generic_file_aio_write_nolock(struct k
 
written = generic_file_direct_write(iocb, iov, _segs, pos,
ppos, count, ocount);
+   if (written < 0) {
+   loff_t isize = i_size_read(inode);
+   /*
+* generic_file_direct_write() may have instantiated 
+* a few blocks outside i_size. Trim these off again.
+*/
+   if (pos + count > isize)
+   vmtruncate(inode, isize);
+   }
if (written < 0 || written == count)
goto out;
/*


Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-12 Thread Andrew Morton
On Tue, 12 Dec 2006 15:20:52 +0300
Dmitriy Monakhov <[EMAIL PROTECTED]> wrote:

> > XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
> > And vmtruncate() expects i_mutex to be held.
> >
> > I guess a suitable solution would be to push this problem back up to the
> > callers: let them decide whether to run vmtruncate() and if so, to ensure
> > that i_mutex is held.
> >
> > The existence of generic_file_aio_write_nolock() makes that rather messy
> > though.
> This means we may call generic_file_aio_write_nolock() without i_mutex, right?
> but call trace is :
>   generic_file_aio_write_nolock() 
>   ->generic_file_buffered_write() /* i_mutex not held here */ 
> but according to filemaps locking rules: mm/filemap.c:77
>  ..
>  *  ->i_mutex (generic_file_buffered_write)
>  *->mmap_sem  (fault_in_pages_readable->do_page_fault)
>  ..
> I'm confused a litle bit, where is the truth? 

xfs_write() calls generic_file_direct_write() without taking i_mutex for
O_DIRECT writes.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-12 Thread Dmitriy Monakhov
Andrew Morton <[EMAIL PROTECTED]> writes:

> On Mon, 11 Dec 2006 16:34:27 +0300
> Dmitriy Monakhov <[EMAIL PROTECTED]> wrote:
>
>> OpenVZ team has discovered error inside generic_file_direct_write()
>> If generic_file_direct_IO() has fail (ENOSPC condition) it may have 
>> instantiated
>> a few blocks outside i_size. And fsck will complain about wrong i_size
>> (ext2, ext3 and reiserfs interpret i_size and biggest block difference as 
>> error),
>> after fsck will fix error i_size will be increased to the biggest block,
>> but this blocks contain gurbage from previous write attempt, this is not 
>> information leak, but its silence file data corruption. 
>> We need truncate any block beyond i_size after write have failed , do in 
>> simular
>> generic_file_buffered_write() error path.
>> 
>> Exampe:
>> open("mnt2/FILE3", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
>> write(3, "aa"..., 4096) = -1 ENOSPC (No space left on device)
>> 
>> stat mnt2/FILE3
>> File: `mnt2/FILE3'
>> Size: 0   Blocks: 4  IO Block: 4096   regular empty file
>> >>^^ file size is less than biggest block idx
>> Device: 700h/1792d  Inode: 14  Links: 1
>> Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
>> 
>> fsck.ext2 -f -n  mnt1/fs_img
>> Pass 1: Checking inodes, blocks, and sizes
>> Inode 14, i_size is 0, should be 2048.  Fix? no
>> 
>> Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
>> --
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 7b84dc8..bf7cf6c 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2041,6 +2041,14 @@ generic_file_direct_write(struct kiocb *
>>  mark_inode_dirty(inode);
>>  }
>>  *ppos = end;
>> +} else if (written < 0) {
>> +loff_t isize = i_size_read(inode);
>> +/*
>> + * generic_file_direct_IO() may have instantiated a few blocks
>> + * outside i_size.  Trim these off again.
>> + */
>> +if (pos + count > isize)
>> +vmtruncate(inode, isize);
>>  }
>>  
>
> XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
> And vmtruncate() expects i_mutex to be held.
>
> I guess a suitable solution would be to push this problem back up to the
> callers: let them decide whether to run vmtruncate() and if so, to ensure
> that i_mutex is held.
>
> The existence of generic_file_aio_write_nolock() makes that rather messy
> though.
This means we may call generic_file_aio_write_nolock() without i_mutex, right?
but call trace is :
  generic_file_aio_write_nolock() 
  ->generic_file_buffered_write() /* i_mutex not held here */ 
but according to filemaps locking rules: mm/filemap.c:77
 ..
 *  ->i_mutex   (generic_file_buffered_write)
 *->mmap_sem(fault_in_pages_readable->do_page_fault)
 ..
I'm confused a litle bit, where is the truth? 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-12 Thread Dmitriy Monakhov
Andrew Morton [EMAIL PROTECTED] writes:

 On Mon, 11 Dec 2006 16:34:27 +0300
 Dmitriy Monakhov [EMAIL PROTECTED] wrote:

 OpenVZ team has discovered error inside generic_file_direct_write()
 If generic_file_direct_IO() has fail (ENOSPC condition) it may have 
 instantiated
 a few blocks outside i_size. And fsck will complain about wrong i_size
 (ext2, ext3 and reiserfs interpret i_size and biggest block difference as 
 error),
 after fsck will fix error i_size will be increased to the biggest block,
 but this blocks contain gurbage from previous write attempt, this is not 
 information leak, but its silence file data corruption. 
 We need truncate any block beyond i_size after write have failed , do in 
 simular
 generic_file_buffered_write() error path.
 
 Exampe:
 open(mnt2/FILE3, O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
 write(3, aa..., 4096) = -1 ENOSPC (No space left on device)
 
 stat mnt2/FILE3
 File: `mnt2/FILE3'
 Size: 0   Blocks: 4  IO Block: 4096   regular empty file
 ^^ file size is less than biggest block idx
 Device: 700h/1792d  Inode: 14  Links: 1
 Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
 
 fsck.ext2 -f -n  mnt1/fs_img
 Pass 1: Checking inodes, blocks, and sizes
 Inode 14, i_size is 0, should be 2048.  Fix? no
 
 Signed-off-by: Dmitriy Monakhov [EMAIL PROTECTED]
 --

 diff --git a/mm/filemap.c b/mm/filemap.c
 index 7b84dc8..bf7cf6c 100644
 --- a/mm/filemap.c
 +++ b/mm/filemap.c
 @@ -2041,6 +2041,14 @@ generic_file_direct_write(struct kiocb *
  mark_inode_dirty(inode);
  }
  *ppos = end;
 +} else if (written  0) {
 +loff_t isize = i_size_read(inode);
 +/*
 + * generic_file_direct_IO() may have instantiated a few blocks
 + * outside i_size.  Trim these off again.
 + */
 +if (pos + count  isize)
 +vmtruncate(inode, isize);
  }
  

 XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
 And vmtruncate() expects i_mutex to be held.

 I guess a suitable solution would be to push this problem back up to the
 callers: let them decide whether to run vmtruncate() and if so, to ensure
 that i_mutex is held.

 The existence of generic_file_aio_write_nolock() makes that rather messy
 though.
This means we may call generic_file_aio_write_nolock() without i_mutex, right?
but call trace is :
  generic_file_aio_write_nolock() 
  -generic_file_buffered_write() /* i_mutex not held here */ 
but according to filemaps locking rules: mm/filemap.c:77
 ..
 *  -i_mutex   (generic_file_buffered_write)
 *-mmap_sem(fault_in_pages_readable-do_page_fault)
 ..
I'm confused a litle bit, where is the truth? 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-12 Thread Andrew Morton
On Tue, 12 Dec 2006 15:20:52 +0300
Dmitriy Monakhov [EMAIL PROTECTED] wrote:

  XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
  And vmtruncate() expects i_mutex to be held.
 
  I guess a suitable solution would be to push this problem back up to the
  callers: let them decide whether to run vmtruncate() and if so, to ensure
  that i_mutex is held.
 
  The existence of generic_file_aio_write_nolock() makes that rather messy
  though.
 This means we may call generic_file_aio_write_nolock() without i_mutex, right?
 but call trace is :
   generic_file_aio_write_nolock() 
   -generic_file_buffered_write() /* i_mutex not held here */ 
 but according to filemaps locking rules: mm/filemap.c:77
  ..
  *  -i_mutex (generic_file_buffered_write)
  *-mmap_sem  (fault_in_pages_readable-do_page_fault)
  ..
 I'm confused a litle bit, where is the truth? 

xfs_write() calls generic_file_direct_write() without taking i_mutex for
O_DIRECT writes.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-12 Thread Dmitriy Monakhov
Andrew Morton [EMAIL PROTECTED] writes:

 On Tue, 12 Dec 2006 15:20:52 +0300
 Dmitriy Monakhov [EMAIL PROTECTED] wrote:

  XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
  And vmtruncate() expects i_mutex to be held.
 
  I guess a suitable solution would be to push this problem back up to the
  callers: let them decide whether to run vmtruncate() and if so, to ensure
  that i_mutex is held.
 
  The existence of generic_file_aio_write_nolock() makes that rather messy
  though.
 This means we may call generic_file_aio_write_nolock() without i_mutex, 
 right?
 but call trace is :
   generic_file_aio_write_nolock() 
   -generic_file_buffered_write() /* i_mutex not held here */ 
 but according to filemaps locking rules: mm/filemap.c:77
  ..
  *  -i_mutex(generic_file_buffered_write)
  *-mmap_sem (fault_in_pages_readable-do_page_fault)
  ..
 I'm confused a litle bit, where is the truth? 

 xfs_write() calls generic_file_direct_write() without taking i_mutex for
 O_DIRECT writes.
Yes, but my quastion is about __generic_file_aio_write_nolock().
As i understand _nolock sufix means that i_mutex was already locked 
by caller, am i right ?
If yes, than __generic_file_aio_write_nolock() is beter place for vmtrancate() 
acclivity after generic_file_direct_write() has fail.
Signed-off-by: Dmitriy Monakhov [EMAIL PROTECTED]
---
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b84dc8..723d2ca 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2282,6 +2282,15 @@ __generic_file_aio_write_nolock(struct k
 
written = generic_file_direct_write(iocb, iov, nr_segs, pos,
ppos, count, ocount);
+   if (written  0) {
+   loff_t isize = i_size_read(inode);
+   /*
+* generic_file_direct_write() may have instantiated 
+* a few blocks outside i_size. Trim these off again.
+*/
+   if (pos + count  isize)
+   vmtruncate(inode, isize);
+   }
if (written  0 || written == count)
goto out;
/*


Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-12 Thread Andrew Morton
On Tue, 12 Dec 2006 16:18:32 +0300
Dmitriy Monakhov [EMAIL PROTECTED] wrote:

  but according to filemaps locking rules: mm/filemap.c:77
   ..
   *  -i_mutex  (generic_file_buffered_write)
   *-mmap_sem   (fault_in_pages_readable-do_page_fault)
   ..
  I'm confused a litle bit, where is the truth? 
 
  xfs_write() calls generic_file_direct_write() without taking i_mutex for
  O_DIRECT writes.
 Yes, but my quastion is about __generic_file_aio_write_nolock().
 As i understand _nolock sufix means that i_mutex was already locked 
 by caller, am i right ?

Nope.  It just means that __generic_file_aio_write_nolock() doesn't take
the lock.  We don't assume or require that the caller took it.  For example
the raw driver calls generic_file_aio_write_nolock() without taking
i_mutex.  Raw isn't relevant to the problem (although ocfs2 might be).  But
we cannot assume that all callers have taken i_mutex, I think.

I guess we can make that a rule (document it, add
BUG_ON(!mutex_is_locked(..)) if it isn't a blockdev) if needs be.  After
really checking that this matches reality for all callers.

It's important, too - if we have an unprotected i_size_write() then the
seqlock can get out of sync due to a race and then i_size_read() locks up
the kernel.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-12 Thread Dmitriy Monakhov
Andrew Morton [EMAIL PROTECTED] writes:

 On Tue, 12 Dec 2006 16:18:32 +0300
 Dmitriy Monakhov [EMAIL PROTECTED] wrote:

  but according to filemaps locking rules: mm/filemap.c:77
   ..
   *  -i_mutex (generic_file_buffered_write)
   *-mmap_sem  (fault_in_pages_readable-do_page_fault)
   ..
  I'm confused a litle bit, where is the truth? 
 
  xfs_write() calls generic_file_direct_write() without taking i_mutex for
  O_DIRECT writes.
 Yes, but my quastion is about __generic_file_aio_write_nolock().
 As i understand _nolock sufix means that i_mutex was already locked 
 by caller, am i right ?

 Nope.  It just means that __generic_file_aio_write_nolock() doesn't take
 the lock.  We don't assume or require that the caller took it.  For example
 the raw driver calls generic_file_aio_write_nolock() without taking
 i_mutex.  Raw isn't relevant to the problem (although ocfs2 might be).  But
 we cannot assume that all callers have taken i_mutex, I think.

 I guess we can make that a rule (document it, add
 BUG_ON(!mutex_is_locked(..)) if it isn't a blockdev) if needs be.  After
 really checking that this matches reality for all callers.
I've checked generic_file_aio_write_nolock() callers for non blockdev.
Only ocfs2 call it explicitly, and do it under i_mutex.
So we need to do: 
1) Change wrong comments.
2) Add BUG_ON(!mutex_is_locked(..)) for non blkdev.
3) Invoke vmtruncate only for non blkdev.

Signed-off-by: Dmitriy Monakhov [EMAIL PROTECTED]
---
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b84dc8..540ef5e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2046,8 +2046,8 @@ generic_file_direct_write(struct kiocb *
/*
 * Sync the fs metadata but not the minor inode changes and
 * of course not the data as we did direct DMA for the IO.
-* i_mutex is held, which protects generic_osync_inode() from
-* livelocking.
+* i_mutex may not being held, if so some specific locking
+* ordering must protect generic_osync_inode() from livelocking.
 */
if (written = 0  ((file-f_flags  O_SYNC) || IS_SYNC(inode))) {
int err = generic_osync_inode(inode, mapping, OSYNC_METADATA);
@@ -2282,6 +2282,17 @@ __generic_file_aio_write_nolock(struct k
 
written = generic_file_direct_write(iocb, iov, nr_segs, pos,
ppos, count, ocount);
+   /*
+* If host is not S_ISBLK generic_file_direct_write() may 
+* have instantiated a few blocks outside i_size  files
+* Trim these off again.
+*/
+   if (unlikely(written  0)  !S_ISBLK(inode-i_mode)) {
+   loff_t isize = i_size_read(inode);
+   if (pos + count  isize)
+   vmtruncate(inode, isize);
+   }
+
if (written  0 || written == count)
goto out;
/*
@@ -2344,6 +2355,13 @@ ssize_t generic_file_aio_write_nolock(st
ssize_t ret;
 
BUG_ON(iocb-ki_pos != pos);
+   /*
+*  generic_file_buffered_write() may be called inside 
+*  __generic_file_aio_write_nolock() even in case of
+*  O_DIRECT for non S_ISBLK files. So i_mutex must be held.
+*/
+   if (!S_ISBLK(inode-i_mode))
+   BUG_ON(!mutex_is_locked(inode-i_mutex));
 
ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
iocb-ki_pos);
@@ -2386,8 +2404,8 @@ ssize_t generic_file_aio_write(struct ki
 EXPORT_SYMBOL(generic_file_aio_write);
 
 /*
- * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if 
something
- * went wrong during pagecache shootdown.
+ * May be called without i_mutex for writes to S_ISREG files.
+ * Returns -EIO if something went wrong during pagecache shootdown.
  */
 static ssize_t
 generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,


RE: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-12 Thread Chen, Kenneth W
Andrew Morton wrote on Tuesday, December 12, 2006 2:40 AM
 On Tue, 12 Dec 2006 16:18:32 +0300
 Dmitriy Monakhov [EMAIL PROTECTED] wrote:
 
   but according to filemaps locking rules: mm/filemap.c:77
..
*  -i_mutex(generic_file_buffered_write)
*-mmap_sem (fault_in_pages_readable-do_page_fault)
..
   I'm confused a litle bit, where is the truth? 
  
   xfs_write() calls generic_file_direct_write() without taking i_mutex for
   O_DIRECT writes.
  Yes, but my quastion is about __generic_file_aio_write_nolock().
  As i understand _nolock sufix means that i_mutex was already locked 
  by caller, am i right ?
 
 Nope.  It just means that __generic_file_aio_write_nolock() doesn't take
 the lock.  We don't assume or require that the caller took it.  For example
 the raw driver calls generic_file_aio_write_nolock() without taking
 i_mutex.  Raw isn't relevant to the problem (although ocfs2 might be).  But
 we cannot assume that all callers have taken i_mutex, I think.


I think we should also clean up generic_file_aio_write_nolock. This was
brought up a couple of weeks ago and I gave up too early.  Here is my
second attempt.

How about the following patch, I think we can kill generic_file_aio_write_nolock
and merge both *file_aio_write_nolock into one function, then

generic_file_aio_write
ocfs2_file_aio_write
blk_dev-aio_write

all points to a non-lock version of __generic_file_aio_write().  First
two already hold i_mutex, while the block device's aio_write method
doesn't require i_mutex to be held.


Signed-off-by: Ken Chen [EMAIL PROTECTED]

diff -Nurp linux-2.6.19/drivers/char/raw.c linux-2.6.19.ken/drivers/char/raw.c
--- linux-2.6.19/drivers/char/raw.c 2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/drivers/char/raw.c 2006-12-12 16:41:39.0 -0800
@@ -242,7 +242,7 @@ static const struct file_operations raw_
.read   =   do_sync_read,
.aio_read = generic_file_aio_read,
.write  =   do_sync_write,
-   .aio_write =generic_file_aio_write_nolock,
+   .aio_write =__generic_file_aio_write,
.open   =   raw_open,
.release=   raw_release,
.ioctl  =   raw_ioctl,
diff -Nurp linux-2.6.19/fs/block_dev.c linux-2.6.19.ken/fs/block_dev.c
--- linux-2.6.19/fs/block_dev.c 2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/fs/block_dev.c 2006-12-12 16:47:58.0 -0800
@@ -1198,7 +1198,7 @@ const struct file_operations def_blk_fop
.read   = do_sync_read,
.write  = do_sync_write,
.aio_read   = generic_file_aio_read,
-   .aio_write  = generic_file_aio_write_nolock,
+   .aio_write  = __generic_file_aio_write,
.mmap   = generic_file_mmap,
.fsync  = block_fsync,
.unlocked_ioctl = block_ioctl,
diff -Nurp linux-2.6.19/fs/ocfs2/file.c linux-2.6.19.ken/fs/ocfs2/file.c
--- linux-2.6.19/fs/ocfs2/file.c2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/fs/ocfs2/file.c2006-12-12 16:42:09.0 -0800
@@ -1107,7 +1107,7 @@ static ssize_t ocfs2_file_aio_write(stru
/* communicate with ocfs2_dio_end_io */
ocfs2_iocb_set_rw_locked(iocb);
 
-   ret = generic_file_aio_write_nolock(iocb, iov, nr_segs, iocb-ki_pos);
+   ret = __generic_file_aio_write(iocb, iov, nr_segs, iocb-ki_pos);
 
/* buffered aio wouldn't have proper lock coverage today */
BUG_ON(ret == -EIOCBQUEUED  !(filp-f_flags  O_DIRECT));
diff -Nurp linux-2.6.19/include/linux/fs.h linux-2.6.19.ken/include/linux/fs.h
--- linux-2.6.19/include/linux/fs.h 2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/include/linux/fs.h 2006-12-12 16:41:58.0 -0800
@@ -1742,7 +1742,7 @@ extern int file_send_actor(read_descript
 int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int 
isblk);
 extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, 
unsigned long, loff_t);
 extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, 
unsigned long, loff_t);
-extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct 
iovec *,
+extern ssize_t __generic_file_aio_write(struct kiocb *, const struct iovec *,
unsigned long, loff_t);
 extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *,
unsigned long *, loff_t, loff_t *, size_t, size_t);
diff -Nurp linux-2.6.19/mm/filemap.c linux-2.6.19.ken/mm/filemap.c
--- linux-2.6.19/mm/filemap.c   2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/mm/filemap.c   2006-12-12 16:47:58.0 -0800
@@ -2219,9 +2219,9 @@ zero_length_segment:
 }
 EXPORT_SYMBOL(generic_file_buffered_write);
 
-static ssize_t
-__generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov,
-   unsigned long nr_segs, loff_t *ppos)
+ssize_t
+__generic_file_aio_write(struct kiocb *iocb, 

Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-11 Thread Andrew Morton
On Tue, 12 Dec 2006 12:22:14 +0300
Dmitriy Monakhov <[EMAIL PROTECTED]> wrote:

> >> @@ -2041,6 +2041,14 @@ generic_file_direct_write(struct kiocb *
> >>mark_inode_dirty(inode);
> >>}
> >>*ppos = end;
> >> +  } else if (written < 0) {
> >> +  loff_t isize = i_size_read(inode);
> >> +  /*
> >> +   * generic_file_direct_IO() may have instantiated a few blocks
> >> +   * outside i_size.  Trim these off again.
> >> +   */
> >> +  if (pos + count > isize)
> >> +  vmtruncate(inode, isize);
> >>}
> >>  
> >
> > XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
> How could it be ?
> 
> from mm/filemap.c:2046 generic_file_direct_write() comment right after 
> place where i want to add vmtruncate()
> /*
>* Sync the fs metadata but not the minor inode changes and
>* of course not the data as we did direct DMA for the IO.
>* i_mutex is held, which protects generic_osync_inode() from
>* livelocking.
>*/
> 
> > And vmtruncate() expects i_mutex to be held.
> generic_file_direct_IO must called under i_mutex too
> from mm/filemap.c:2388
>   /*
>* Called under i_mutex for writes to S_ISREG files.   Returns -EIO if 
> something
>* went wrong during pagecache shootdown.
>*/
>   static ssize_t
>   generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,

yup, the comments are wrong.

> This means XFS generic_file_direct_write() call generic_file_direct_IO() 
> without
> i_mutex held too?

Think so.  XFS uses blockdev_direct_IO_own_locking().  We'd need to check
with the XFS guys regarding its precise operation and what needs to be done
here.

> >
> > I guess a suitable solution would be to push this problem back up to the
> > callers: let them decide whether to run vmtruncate() and if so, to ensure
> > that i_mutex is held.
> >
> > The existence of generic_file_aio_write_nolock() makes that rather messy
> > though.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-11 Thread Dmitriy Monakhov
Andrew Morton <[EMAIL PROTECTED]> writes:

> On Mon, 11 Dec 2006 16:34:27 +0300
> Dmitriy Monakhov <[EMAIL PROTECTED]> wrote:
>
>> OpenVZ team has discovered error inside generic_file_direct_write()
>> If generic_file_direct_IO() has fail (ENOSPC condition) it may have 
>> instantiated
>> a few blocks outside i_size. And fsck will complain about wrong i_size
>> (ext2, ext3 and reiserfs interpret i_size and biggest block difference as 
>> error),
>> after fsck will fix error i_size will be increased to the biggest block,
>> but this blocks contain gurbage from previous write attempt, this is not 
>> information leak, but its silence file data corruption. 
>> We need truncate any block beyond i_size after write have failed , do in 
>> simular
>> generic_file_buffered_write() error path.
>> 
>> Exampe:
>> open("mnt2/FILE3", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
>> write(3, "aa"..., 4096) = -1 ENOSPC (No space left on device)
>> 
>> stat mnt2/FILE3
>> File: `mnt2/FILE3'
>> Size: 0   Blocks: 4  IO Block: 4096   regular empty file
>> >>^^ file size is less than biggest block idx
>> Device: 700h/1792d  Inode: 14  Links: 1
>> Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
>> 
>> fsck.ext2 -f -n  mnt1/fs_img
>> Pass 1: Checking inodes, blocks, and sizes
>> Inode 14, i_size is 0, should be 2048.  Fix? no
>> 
>> Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
>> --
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 7b84dc8..bf7cf6c 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2041,6 +2041,14 @@ generic_file_direct_write(struct kiocb *
>>  mark_inode_dirty(inode);
>>  }
>>  *ppos = end;
>> +} else if (written < 0) {
>> +loff_t isize = i_size_read(inode);
>> +/*
>> + * generic_file_direct_IO() may have instantiated a few blocks
>> + * outside i_size.  Trim these off again.
>> + */
>> +if (pos + count > isize)
>> +vmtruncate(inode, isize);
>>  }
>>  
>
> XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
How could it be ?

from mm/filemap.c:2046 generic_file_direct_write() comment right after 
place where i want to add vmtruncate()
/*
 * Sync the fs metadata but not the minor inode changes and
 * of course not the data as we did direct DMA for the IO.
 * i_mutex is held, which protects generic_osync_inode() from
 * livelocking.
 */

> And vmtruncate() expects i_mutex to be held.
generic_file_direct_IO must called under i_mutex too
from mm/filemap.c:2388
  /*
   * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if 
something
   * went wrong during pagecache shootdown.
   */
  static ssize_t
  generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,

This means XFS generic_file_direct_write() call generic_file_direct_IO() without
i_mutex held too?
>
> I guess a suitable solution would be to push this problem back up to the
> callers: let them decide whether to run vmtruncate() and if so, to ensure
> that i_mutex is held.
>
> The existence of generic_file_aio_write_nolock() makes that rather messy
> though.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-11 Thread Andrew Morton
On Mon, 11 Dec 2006 16:34:27 +0300
Dmitriy Monakhov <[EMAIL PROTECTED]> wrote:

> OpenVZ team has discovered error inside generic_file_direct_write()
> If generic_file_direct_IO() has fail (ENOSPC condition) it may have 
> instantiated
> a few blocks outside i_size. And fsck will complain about wrong i_size
> (ext2, ext3 and reiserfs interpret i_size and biggest block difference as 
> error),
> after fsck will fix error i_size will be increased to the biggest block,
> but this blocks contain gurbage from previous write attempt, this is not 
> information leak, but its silence file data corruption. 
> We need truncate any block beyond i_size after write have failed , do in 
> simular
> generic_file_buffered_write() error path.
> 
> Exampe:
> open("mnt2/FILE3", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
> write(3, "aa"..., 4096) = -1 ENOSPC (No space left on device)
> 
> stat mnt2/FILE3
> File: `mnt2/FILE3'
> Size: 0   Blocks: 4  IO Block: 4096   regular empty file
> >>^^ file size is less than biggest block idx
> Device: 700h/1792d  Inode: 14  Links: 1
> Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
> 
> fsck.ext2 -f -n  mnt1/fs_img
> Pass 1: Checking inodes, blocks, and sizes
> Inode 14, i_size is 0, should be 2048.  Fix? no
> 
> Signed-off-by: Dmitriy Monakhov <[EMAIL PROTECTED]>
> --
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7b84dc8..bf7cf6c 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2041,6 +2041,14 @@ generic_file_direct_write(struct kiocb *
>   mark_inode_dirty(inode);
>   }
>   *ppos = end;
> + } else if (written < 0) {
> + loff_t isize = i_size_read(inode);
> + /*
> +  * generic_file_direct_IO() may have instantiated a few blocks
> +  * outside i_size.  Trim these off again.
> +  */
> + if (pos + count > isize)
> + vmtruncate(inode, isize);
>   }
>  

XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
And vmtruncate() expects i_mutex to be held.

I guess a suitable solution would be to push this problem back up to the
callers: let them decide whether to run vmtruncate() and if so, to ensure
that i_mutex is held.

The existence of generic_file_aio_write_nolock() makes that rather messy
though.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-11 Thread Andrew Morton
On Mon, 11 Dec 2006 16:34:27 +0300
Dmitriy Monakhov [EMAIL PROTECTED] wrote:

 OpenVZ team has discovered error inside generic_file_direct_write()
 If generic_file_direct_IO() has fail (ENOSPC condition) it may have 
 instantiated
 a few blocks outside i_size. And fsck will complain about wrong i_size
 (ext2, ext3 and reiserfs interpret i_size and biggest block difference as 
 error),
 after fsck will fix error i_size will be increased to the biggest block,
 but this blocks contain gurbage from previous write attempt, this is not 
 information leak, but its silence file data corruption. 
 We need truncate any block beyond i_size after write have failed , do in 
 simular
 generic_file_buffered_write() error path.
 
 Exampe:
 open(mnt2/FILE3, O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
 write(3, aa..., 4096) = -1 ENOSPC (No space left on device)
 
 stat mnt2/FILE3
 File: `mnt2/FILE3'
 Size: 0   Blocks: 4  IO Block: 4096   regular empty file
 ^^ file size is less than biggest block idx
 Device: 700h/1792d  Inode: 14  Links: 1
 Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
 
 fsck.ext2 -f -n  mnt1/fs_img
 Pass 1: Checking inodes, blocks, and sizes
 Inode 14, i_size is 0, should be 2048.  Fix? no
 
 Signed-off-by: Dmitriy Monakhov [EMAIL PROTECTED]
 --

 diff --git a/mm/filemap.c b/mm/filemap.c
 index 7b84dc8..bf7cf6c 100644
 --- a/mm/filemap.c
 +++ b/mm/filemap.c
 @@ -2041,6 +2041,14 @@ generic_file_direct_write(struct kiocb *
   mark_inode_dirty(inode);
   }
   *ppos = end;
 + } else if (written  0) {
 + loff_t isize = i_size_read(inode);
 + /*
 +  * generic_file_direct_IO() may have instantiated a few blocks
 +  * outside i_size.  Trim these off again.
 +  */
 + if (pos + count  isize)
 + vmtruncate(inode, isize);
   }
  

XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
And vmtruncate() expects i_mutex to be held.

I guess a suitable solution would be to push this problem back up to the
callers: let them decide whether to run vmtruncate() and if so, to ensure
that i_mutex is held.

The existence of generic_file_aio_write_nolock() makes that rather messy
though.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-11 Thread Dmitriy Monakhov
Andrew Morton [EMAIL PROTECTED] writes:

 On Mon, 11 Dec 2006 16:34:27 +0300
 Dmitriy Monakhov [EMAIL PROTECTED] wrote:

 OpenVZ team has discovered error inside generic_file_direct_write()
 If generic_file_direct_IO() has fail (ENOSPC condition) it may have 
 instantiated
 a few blocks outside i_size. And fsck will complain about wrong i_size
 (ext2, ext3 and reiserfs interpret i_size and biggest block difference as 
 error),
 after fsck will fix error i_size will be increased to the biggest block,
 but this blocks contain gurbage from previous write attempt, this is not 
 information leak, but its silence file data corruption. 
 We need truncate any block beyond i_size after write have failed , do in 
 simular
 generic_file_buffered_write() error path.
 
 Exampe:
 open(mnt2/FILE3, O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3
 write(3, aa..., 4096) = -1 ENOSPC (No space left on device)
 
 stat mnt2/FILE3
 File: `mnt2/FILE3'
 Size: 0   Blocks: 4  IO Block: 4096   regular empty file
 ^^ file size is less than biggest block idx
 Device: 700h/1792d  Inode: 14  Links: 1
 Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid: (0/root)
 
 fsck.ext2 -f -n  mnt1/fs_img
 Pass 1: Checking inodes, blocks, and sizes
 Inode 14, i_size is 0, should be 2048.  Fix? no
 
 Signed-off-by: Dmitriy Monakhov [EMAIL PROTECTED]
 --

 diff --git a/mm/filemap.c b/mm/filemap.c
 index 7b84dc8..bf7cf6c 100644
 --- a/mm/filemap.c
 +++ b/mm/filemap.c
 @@ -2041,6 +2041,14 @@ generic_file_direct_write(struct kiocb *
  mark_inode_dirty(inode);
  }
  *ppos = end;
 +} else if (written  0) {
 +loff_t isize = i_size_read(inode);
 +/*
 + * generic_file_direct_IO() may have instantiated a few blocks
 + * outside i_size.  Trim these off again.
 + */
 +if (pos + count  isize)
 +vmtruncate(inode, isize);
  }
  

 XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
How could it be ?

from mm/filemap.c:2046 generic_file_direct_write() comment right after 
place where i want to add vmtruncate()
/*
 * Sync the fs metadata but not the minor inode changes and
 * of course not the data as we did direct DMA for the IO.
 * i_mutex is held, which protects generic_osync_inode() from
 * livelocking.
 */

 And vmtruncate() expects i_mutex to be held.
generic_file_direct_IO must called under i_mutex too
from mm/filemap.c:2388
  /*
   * Called under i_mutex for writes to S_ISREG files.   Returns -EIO if 
something
   * went wrong during pagecache shootdown.
   */
  static ssize_t
  generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,

This means XFS generic_file_direct_write() call generic_file_direct_IO() without
i_mutex held too?

 I guess a suitable solution would be to push this problem back up to the
 callers: let them decide whether to run vmtruncate() and if so, to ensure
 that i_mutex is held.

 The existence of generic_file_aio_write_nolock() makes that rather messy
 though.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-11 Thread Andrew Morton
On Tue, 12 Dec 2006 12:22:14 +0300
Dmitriy Monakhov [EMAIL PROTECTED] wrote:

  @@ -2041,6 +2041,14 @@ generic_file_direct_write(struct kiocb *
 mark_inode_dirty(inode);
 }
 *ppos = end;
  +  } else if (written  0) {
  +  loff_t isize = i_size_read(inode);
  +  /*
  +   * generic_file_direct_IO() may have instantiated a few blocks
  +   * outside i_size.  Trim these off again.
  +   */
  +  if (pos + count  isize)
  +  vmtruncate(inode, isize);
 }
   
 
  XFS (at least) can call generic_file_direct_write() with i_mutex not held. 
 How could it be ?
 
 from mm/filemap.c:2046 generic_file_direct_write() comment right after 
 place where i want to add vmtruncate()
 /*
* Sync the fs metadata but not the minor inode changes and
* of course not the data as we did direct DMA for the IO.
* i_mutex is held, which protects generic_osync_inode() from
* livelocking.
*/
 
  And vmtruncate() expects i_mutex to be held.
 generic_file_direct_IO must called under i_mutex too
 from mm/filemap.c:2388
   /*
* Called under i_mutex for writes to S_ISREG files.   Returns -EIO if 
 something
* went wrong during pagecache shootdown.
*/
   static ssize_t
   generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,

yup, the comments are wrong.

 This means XFS generic_file_direct_write() call generic_file_direct_IO() 
 without
 i_mutex held too?

Think so.  XFS uses blockdev_direct_IO_own_locking().  We'd need to check
with the XFS guys regarding its precise operation and what needs to be done
here.

 
  I guess a suitable solution would be to push this problem back up to the
  callers: let them decide whether to run vmtruncate() and if so, to ensure
  that i_mutex is held.
 
  The existence of generic_file_aio_write_nolock() makes that rather messy
  though.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/