subject:"\[Cluster\-devel\] \[PATCH v1 0\/4\] fs\: fix race between llseek SEEK

Re: [Cluster-devel] [PATCH v1 0/4] fs: fix race between llseek SEEK_END and write

2018-11-25 Thread Eiichi Tsukata

2018年11月22日(木) 16:06 Al Viro :
>
> Can you show me where does POSIX/SuS/whatever it's called these days promise
> that kind of atomicity?

No. I couldn't found it.
That's why I previously posted RFC Patch:
https://marc.info/?t=15423727791&r=1&w=2
I wasn't sure this is a bug in the kernel or not.

> that kind of atomicity?  TBH, I would be very surprised if any Unix promised
> to have file size updated no more than once per write(2).  Any userland code
> that relies on such property is, as minimum, non-portable and I would argue
> that it is outright broken.

Thanks. Now It's clear. It is not a bug in the kernel, but in
userspace if `tail` assumes such
atomicity.

> Note, BTW, that your example depends upon rather non-obvious details of echo
> implementation, starting with the bufferization logics used by particular
> shell.  And AFAICS ash(1) ends up with possibility of more than one write(2)

I've never imagined such a difference in echo implementation, thanks.

Re: [Cluster-devel] [PATCH v1 0/4] fs: fix race between llseek SEEK_END and write

2018-11-21 Thread Al Viro

On Thu, Nov 22, 2018 at 02:40:50PM +0900, Eiichi Tsukata wrote:
> 2018年11月21日(水) 13:54 Al Viro :
> >
> > On Wed, Nov 21, 2018 at 11:43:56AM +0900, Eiichi Tsukata wrote:
> > > Some file systems (including ext4, xfs, ramfs ...) have the following
> > > problem as I've described in the commit message of the 1/4 patch.
> > >
> > >   The commit ef3d0fd27e90 ("vfs: do (nearly) lockless 
> > > generic_file_llseek")
> > >   removed almost all locks in llseek() including SEEK_END. It based on the
> > >   idea that write() updates size atomically. But in fact, write() can be
> > >   divided into two or more parts in generic_perform_write() when pos
> > >   straddles over the PAGE_SIZE, which results in updating size multiple
> > >   times in one write(). It means that llseek() can see the size being
> > >   updated during write().
> >
> > And?  Who has ever promised anything that insane?  write(2) can take an 
> > arbitrary
> > amount of time; another process doing lseek() on independently opened 
> > descriptor
> > is *not* going to wait for that (e.g. page-in of the buffer being written, 
> > which
> > just happens to be mmapped from a file on NFS over RFC1149 link).
> 
> Thanks.
> 
> The lock I added in NFS was nothing but slow down lseek() because a file size 
> is
> updated atomically. Even `spin_lock(&inode->i_lock)` is unnecessary.
> 
> I'll fix the commit message which only refers to specific local file
> systems that use
> generic_perform_write() and remove unnecessary locks in some
> distributed file systems
> (e.g. nfs, cifs, or more) by replacing generic_file_llseek() with
> generic_file_llseek_unlocked()
> so that `tail` don't have to wait for avian carriers.

fd = open("/mnt/slw/foo", O_RDONLY);
p = mmap(NULL, 8192, PROT_READ, MAP_SHARED, fd, 0);
local_fd = open("/tmp/foo", O_RDWR);
write(local_fd, p, 8192);

and there you go - extremely slow write(2).  To file on a local filesystem.

Can you show me where does POSIX/SuS/whatever it's called these days promise
that kind of atomicity?  TBH, I would be very surprised if any Unix promised
to have file size updated no more than once per write(2).  Any userland code
that relies on such property is, as minimum, non-portable and I would argue
that it is outright broken.

Note, BTW, that your example depends upon rather non-obvious details of echo
implementation, starting with the bufferization logics used by particular
shell.  And AFAICS ash(1) ends up with possibility of more than one write(2)
anyway - get SIGSTOP delivered, followed by SIGCONT and you've got just that.
Transparently for echo(1).  I'm fairly sure that the same holds for anything
that isn't outright broken - write(2) *IS* interruptible (must be, for obvious
reasons) and that pair of signals will lead to short write if it comes at the
right time.  The only thing userland can do (and does) is to issue further
write(2) on anything that hadn't been written the last time around.

Re: [Cluster-devel] [PATCH v1 0/4] fs: fix race between llseek SEEK_END and write

2018-11-21 Thread Eiichi Tsukata

2018年11月21日(水) 13:54 Al Viro :
>
> On Wed, Nov 21, 2018 at 11:43:56AM +0900, Eiichi Tsukata wrote:
> > Some file systems (including ext4, xfs, ramfs ...) have the following
> > problem as I've described in the commit message of the 1/4 patch.
> >
> >   The commit ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek")
> >   removed almost all locks in llseek() including SEEK_END. It based on the
> >   idea that write() updates size atomically. But in fact, write() can be
> >   divided into two or more parts in generic_perform_write() when pos
> >   straddles over the PAGE_SIZE, which results in updating size multiple
> >   times in one write(). It means that llseek() can see the size being
> >   updated during write().
>
> And?  Who has ever promised anything that insane?  write(2) can take an 
> arbitrary
> amount of time; another process doing lseek() on independently opened 
> descriptor
> is *not* going to wait for that (e.g. page-in of the buffer being written, 
> which
> just happens to be mmapped from a file on NFS over RFC1149 link).

Thanks.

The lock I added in NFS was nothing but slow down lseek() because a file size is
updated atomically. Even `spin_lock(&inode->i_lock)` is unnecessary.

I'll fix the commit message which only refers to specific local file
systems that use
generic_perform_write() and remove unnecessary locks in some
distributed file systems
(e.g. nfs, cifs, or more) by replacing generic_file_llseek() with
generic_file_llseek_unlocked()
so that `tail` don't have to wait for avian carriers.

[Cluster-devel] [PATCH v1 0/4] fs: fix race between llseek SEEK_END and write

2018-11-21 Thread Eiichi Tsukata

Some file systems (including ext4, xfs, ramfs ...) have the following
problem as I've described in the commit message of the 1/4 patch.

  The commit ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek")
  removed almost all locks in llseek() including SEEK_END. It based on the
  idea that write() updates size atomically. But in fact, write() can be
  divided into two or more parts in generic_perform_write() when pos
  straddles over the PAGE_SIZE, which results in updating size multiple
  times in one write(). It means that llseek() can see the size being
  updated during write().

  This race changes behavior of some applications. 'tail' is one of those
  applications. It reads range [pos, pos_end] where pos_end is obtained
  via llseek() SEEK_END. Sometimes, a read line could be broken.

  reproducer:

$ while true; do echo 123456 >> out; done
$ while true; do tail out | grep -v 123456 ; done

  example output(take 30 secs):

12345
1
1234
1
12
1234

Note: Some file systems which indivisually implements llseek() and hold
  inode mutex lock in it are not affected. ex:) btrfs, ocfs2

Patch 1: re-implements inode lock for SEEK_END in generic_file_llseek()
Patch 2 to 4: implement inode lock for SEEK_END in each file systems

Eiichi Tsukata (4):
  vfs: fix race between llseek SEEK_END and write
  ext4: fix race between llseek SEEK_END and write
  f2fs: fix race between llseek SEEK_END and write
  overlayfs: fix race between llseek SEEK_END and write

 fs/btrfs/file.c |  2 +-
 fs/ext4/file.c  | 10 ++
 fs/f2fs/file.c  |  6 +-
 fs/fuse/file.c  |  5 +++--
 fs/gfs2/file.c  |  3 ++-
 fs/overlayfs/file.c | 23 ---
 fs/read_write.c | 37 ++---
 include/linux/fs.h  |  2 ++
 8 files changed, 73 insertions(+), 15 deletions(-)

-- 
2.19.1

Re: [Cluster-devel] [PATCH v1 0/4] fs: fix race between llseek SEEK_END and write

2018-11-20 Thread Al Viro

On Wed, Nov 21, 2018 at 11:43:56AM +0900, Eiichi Tsukata wrote:
> Some file systems (including ext4, xfs, ramfs ...) have the following
> problem as I've described in the commit message of the 1/4 patch.
> 
>   The commit ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek")
>   removed almost all locks in llseek() including SEEK_END. It based on the
>   idea that write() updates size atomically. But in fact, write() can be
>   divided into two or more parts in generic_perform_write() when pos
>   straddles over the PAGE_SIZE, which results in updating size multiple
>   times in one write(). It means that llseek() can see the size being
>   updated during write().

And?  Who has ever promised anything that insane?  write(2) can take an 
arbitrary
amount of time; another process doing lseek() on independently opened descriptor
is *not* going to wait for that (e.g. page-in of the buffer being written, which
just happens to be mmapped from a file on NFS over RFC1149 link).

Re: [Cluster-devel] [PATCH v1 0/4] fs: fix race between llseek SEEK_END and write

Re: [Cluster-devel] [PATCH v1 0/4] fs: fix race between llseek SEEK_END and write

Re: [Cluster-devel] [PATCH v1 0/4] fs: fix race between llseek SEEK_END and write

[Cluster-devel] [PATCH v1 0/4] fs: fix race between llseek SEEK_END and write

Re: [Cluster-devel] [PATCH v1 0/4] fs: fix race between llseek SEEK_END and write

5 matches

Site Navigation

Mail list logo

Footer information