Re: Advice sought on how to lock multiple pages in ->prepare_write and ->writepage

2005-01-28 Thread Anton Altaparmakov
Hi Nathan,

On Fri, 28 Jan 2005, Nathan Scott wrote:
> On Thu, Jan 27, 2005 at 04:58:22PM -0800, Andrew Morton wrote:
> > Anton Altaparmakov <[EMAIL PROTECTED]> wrote:
> > >
> > > What would you propose can I do to perform the required zeroing in a
> > > deadlock safe manner whilst also ensuring that it cannot happen that a
> > > concurrent ->readpage() will cause me to miss a page and thus end up
> > > with non-initialized/random data on disk for that page?
> > 
> > The only thing I can think of is to lock all the pages.  There's no other
> > place in the kernel above you which locks multiple pagecache pages, but we
> > can certainly adopt the convention that multiple-page-locking must proceed
> > in ascending file offset order.
> > 
> > ...
> > 
> > Not very pretty though.
> 
> Just putting up my hand to say "yeah, us too" - we could also make
> use of that functionality, so we can grok existing XFS filesystems
> that have blocksizes larger than the page size.
> 
> Were you looking at using an i_blkbits value larger than pageshift,
> Anton?  There's many places where generic code does 1 << i_blkbits
> that'll need careful auditing, I think.

No, definitely not.  The inode and superblock block size are always set to 
512 bytes for NTFS (in present code).  The NTFS driver does the 
translation from "page->index + offset" to "byte offset" to "logical block 
+ offset in the block" everywhere.  This allows both reading and writing 
to occur in 512 byte granularity (the smallest logical block size allowed 
on NTFS).  So if the user is reading only 1 byte, I only need to read in 
that page and if the user is writing only 1 byte into a 
non-sparse/preallocated file, I can only write out 512 bytes 
(prepare/commit write) or a page (writepage).

But the real reason for why I have to stick with 512 bytes is that NTFS 
stores all its metadata (including the on-disk inodes, the directory 
indices, the boot record, the volume label and flags, etc) inside regular 
files.  And the contents of the metadata are organised like flat database 
files containing fixed length records - but depending on the metadata the 
size is different - (which in turn contain variable length records) and 
each fixed length record is protected at a 512 byte granularity with 
fixups which ensure that partial writes due to power failure or disk 
failure are detected.  These fixups need to be applied every time metadata 
records are written.  And when the fixups are applied, the metadata 
appears corrupt.  So we do: apply fixups, write, take away fixups.  This 
means that we have to lock each metadata record for exclusive access 
while this is happening otherwise someone could look at the record and see 
it in a corrupt state...  And since two adjacent records are not usually 
related to each other at all we cannot just lock multiple records at once 
(given the page is already locked) otherwise we would deadlock immediately 
and hence we cannot write out more than one record at once.  Which in turn 
means we need to maintain a 512 byte granularity at the write level.  
Anyway, this is slightly off topic.  I hope the above makes some sense...

> A lock-in-page-index-order approach seems the simplest way to prevent 
> page deadlocks as Andrew suggested, and always starting to lock pages at 
> the lowest block- aligned file offset (rather than taking a page lock, 
> dropping it, locking earlier pages, reaquiring the later one, etc) - 
> that can't really be done inside the filesystems though.

Indeed.  But I don't think we would want to write out whole logical blocks 
at once unless it was actuall all dirty data.  Why write out loads when 
you can get away with as little as 512 bytes or PAGE_CACHE_SIZE in 
majority of cases?  In NTFS the logical block size can go into the mega 
bytes (in theory) and if we were forced to do such large writes every time 
performance would degrade really badly for most usage scenarios.

> So it's probably something that should be handled in generic page
> cache code such that the locking is done in common places for all
> filesystems using large i_blkbits values, and such that locking is
> done before the filesystem-specific read/writepage(s) routines are
> called, so that they don't have to be changed to do page locking
> any differently.

Yes, this would have advantages.  In fact this may be the point where we 
would actually make PAGE_CACHE_SIZE dynamic (but presumably only in 
multiples of PAGE_SIZE) but that is a rather big project in its own right.

For NTFS, I would prefer to not use a large i_blkbits as explained above.

However, I do not see why the VFS cannot provide one or more 
library/helper functions to do "lock multiple pages with start_index, 
end_index, already_locked_page, writeback_control/NULL, etc".  NTFS would 
then just call this from prepare_write and writepage/writepages.  And the 
compression engine (once we implement write to compressed files) may well 
need to use this, too.  And if XFS 

Re: Advice sought on how to lock multiple pages in ->prepare_write and ->writepage

2005-01-28 Thread Anton Altaparmakov
Hi Andrew,

Thanks a lot for your help!  Some comments below...

On Thu, 27 Jan 2005, Andrew Morton wrote:
> Anton Altaparmakov <[EMAIL PROTECTED]> wrote:
> >
> > What would you propose can I do to perform the required zeroing in a
> > deadlock safe manner whilst also ensuring that it cannot happen that a
> > concurrent ->readpage() will cause me to miss a page and thus end up
> > with non-initialized/random data on disk for that page?
> 
> The only thing I can think of is to lock all the pages.

Yes, that is what I was thinking, too.

> There's no other place in the kernel above you which locks multiple 
> pagecache pages, but we can certainly adopt the convention that 
> multiple-page-locking must proceed in ascending file offset order.

That sounds sensible.

> Which means that you'll need to drop and reacquire the page lock in
> ->prepare_write and in ->writepage, which could get unpleasant.

Yes, this is the bit I was worried about...

> For ->prepare_write it should be OK: the caller has a ref on the inode and
> you can check ->mapping after locking the page to see if a truncate flew
> past (OK, you have i_sem, but writepage will need to do this check).

Ok.  One down.

> For writepage() or writepages() with for_reclaim=0 you're OK: the caller
> has a ref on the inode and has taken sb->s_umount, so you can safely drop
> and retake the page lock.

Ok.

> For ->writepage with for_reclaim=1 the problem is that the inode can
> disappear on you altogether: you have no inode ref and if you let go of
> that page lock, truncate+reclaim or truncate+umount can zap the inode.
> 
> So hrm.  I guess your ->writepage(for_reclaim=1) could do a trylock on
> s_umount and fail the writepage if that didn't work out.

Ok, this should cause no problems.  I have already other places inside 
writepage (for metadata writes) which do trylock on an ntfs internal inode 
lock (it was necessary due to lock reversal: pagelock is held already in 
writepage but usually the ntfs lock is taken first and then pages are 
locked - now we just redirty the page in writepage if the trylock fails).

> That leaves the problem of preventing truncate+reclaim from zapping the
> inode when you've dropped the page lock.  I don't think you'll want to take
> a ref on the inode because the subsequent iput() can cause a storm of
> activity and I have vague memories that iput() inside
> ->writepage(for_reclaim=1) is a bad deal.  Maybe do a trylock on i_sem and
> fail the writepage if that doesn't work out?

The trylock is once again no problem.  But holding i_sem without a 
reference to the inode seems dodgy.  Will it really be sufficient?  I 
mean, yes it will exclude truncate but couldn't a reclaim kick in after 
the truncate has released i_sem and we have taken it?  Or am I missing 
something here?

In any case, once I relock the page I will need to check that it 
is still inside i_size I assume and also that it is in fact still dirty.

> That way, once you have i_sem and s_umount you can unlock the target page
> then populate+lock all the pages in the 64k segment.
> 
> Not very pretty though.

No, not pretty at all.  Especially as I will have to drop all sorts of 
NTFS locks (as my code is at present) which will cause me to have to redo 
a lot of work after reaquiring those locks.  I guess I will have to put in 
a "if (cluster_size > PAGE_CACHE_SIZE) detect hole" logic really early on 
and do the page cache magic at this stage and only then start taking NTFS 
locks and doing the work.

But it doesn't matter that it is not pretty.  It is not the usual scenario 
after all.  Normally a file gets created/opened and writing begins so the 
problem does not really occur as we would always be starting with the 
first page in a cluster and hence we would not need to drop its lock at 
all.  Also, the problem only occurs if cluster size > PAGE_CACHE_SIZE 
which will be false for the majority of cases.  (In NTFS using cluster 
size > 4kiB causes all sorts of negative side effects such as compression 
and possibly encryption being disabled for example and Windows will in 
fact never use cluster size > 4kiB by default but some people override the 
default due to special usage scenarios or really large volumes: with 4kiB 
clusters and the current artificial limit of 2^32-1 clusters imposed by 
Windows the maximum volume size is 16TiB.)

Best regards,

Anton
-- 
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Advice sought on how to lock multiple pages in -prepare_write and -writepage

2005-01-28 Thread Anton Altaparmakov
Hi Andrew,

Thanks a lot for your help!  Some comments below...

On Thu, 27 Jan 2005, Andrew Morton wrote:
 Anton Altaparmakov [EMAIL PROTECTED] wrote:
 
  What would you propose can I do to perform the required zeroing in a
  deadlock safe manner whilst also ensuring that it cannot happen that a
  concurrent -readpage() will cause me to miss a page and thus end up
  with non-initialized/random data on disk for that page?
 
 The only thing I can think of is to lock all the pages.

Yes, that is what I was thinking, too.

 There's no other place in the kernel above you which locks multiple 
 pagecache pages, but we can certainly adopt the convention that 
 multiple-page-locking must proceed in ascending file offset order.

That sounds sensible.

 Which means that you'll need to drop and reacquire the page lock in
 -prepare_write and in -writepage, which could get unpleasant.

Yes, this is the bit I was worried about...

 For -prepare_write it should be OK: the caller has a ref on the inode and
 you can check -mapping after locking the page to see if a truncate flew
 past (OK, you have i_sem, but writepage will need to do this check).

Ok.  One down.

 For writepage() or writepages() with for_reclaim=0 you're OK: the caller
 has a ref on the inode and has taken sb-s_umount, so you can safely drop
 and retake the page lock.

Ok.

 For -writepage with for_reclaim=1 the problem is that the inode can
 disappear on you altogether: you have no inode ref and if you let go of
 that page lock, truncate+reclaim or truncate+umount can zap the inode.
 
 So hrm.  I guess your -writepage(for_reclaim=1) could do a trylock on
 s_umount and fail the writepage if that didn't work out.

Ok, this should cause no problems.  I have already other places inside 
writepage (for metadata writes) which do trylock on an ntfs internal inode 
lock (it was necessary due to lock reversal: pagelock is held already in 
writepage but usually the ntfs lock is taken first and then pages are 
locked - now we just redirty the page in writepage if the trylock fails).

 That leaves the problem of preventing truncate+reclaim from zapping the
 inode when you've dropped the page lock.  I don't think you'll want to take
 a ref on the inode because the subsequent iput() can cause a storm of
 activity and I have vague memories that iput() inside
 -writepage(for_reclaim=1) is a bad deal.  Maybe do a trylock on i_sem and
 fail the writepage if that doesn't work out?

The trylock is once again no problem.  But holding i_sem without a 
reference to the inode seems dodgy.  Will it really be sufficient?  I 
mean, yes it will exclude truncate but couldn't a reclaim kick in after 
the truncate has released i_sem and we have taken it?  Or am I missing 
something here?

In any case, once I relock the page I will need to check that it 
is still inside i_size I assume and also that it is in fact still dirty.

 That way, once you have i_sem and s_umount you can unlock the target page
 then populate+lock all the pages in the 64k segment.
 
 Not very pretty though.

No, not pretty at all.  Especially as I will have to drop all sorts of 
NTFS locks (as my code is at present) which will cause me to have to redo 
a lot of work after reaquiring those locks.  I guess I will have to put in 
a if (cluster_size  PAGE_CACHE_SIZE) detect hole logic really early on 
and do the page cache magic at this stage and only then start taking NTFS 
locks and doing the work.

But it doesn't matter that it is not pretty.  It is not the usual scenario 
after all.  Normally a file gets created/opened and writing begins so the 
problem does not really occur as we would always be starting with the 
first page in a cluster and hence we would not need to drop its lock at 
all.  Also, the problem only occurs if cluster size  PAGE_CACHE_SIZE 
which will be false for the majority of cases.  (In NTFS using cluster 
size  4kiB causes all sorts of negative side effects such as compression 
and possibly encryption being disabled for example and Windows will in 
fact never use cluster size  4kiB by default but some people override the 
default due to special usage scenarios or really large volumes: with 4kiB 
clusters and the current artificial limit of 2^32-1 clusters imposed by 
Windows the maximum volume size is 16TiB.)

Best regards,

Anton
-- 
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/  http://www-stu.christs.cam.ac.uk/~aia21/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Advice sought on how to lock multiple pages in -prepare_write and -writepage

2005-01-28 Thread Anton Altaparmakov
Hi Nathan,

On Fri, 28 Jan 2005, Nathan Scott wrote:
 On Thu, Jan 27, 2005 at 04:58:22PM -0800, Andrew Morton wrote:
  Anton Altaparmakov [EMAIL PROTECTED] wrote:
  
   What would you propose can I do to perform the required zeroing in a
   deadlock safe manner whilst also ensuring that it cannot happen that a
   concurrent -readpage() will cause me to miss a page and thus end up
   with non-initialized/random data on disk for that page?
  
  The only thing I can think of is to lock all the pages.  There's no other
  place in the kernel above you which locks multiple pagecache pages, but we
  can certainly adopt the convention that multiple-page-locking must proceed
  in ascending file offset order.
  
  ...
  
  Not very pretty though.
 
 Just putting up my hand to say yeah, us too - we could also make
 use of that functionality, so we can grok existing XFS filesystems
 that have blocksizes larger than the page size.
 
 Were you looking at using an i_blkbits value larger than pageshift,
 Anton?  There's many places where generic code does 1  i_blkbits
 that'll need careful auditing, I think.

No, definitely not.  The inode and superblock block size are always set to 
512 bytes for NTFS (in present code).  The NTFS driver does the 
translation from page-index + offset to byte offset to logical block 
+ offset in the block everywhere.  This allows both reading and writing 
to occur in 512 byte granularity (the smallest logical block size allowed 
on NTFS).  So if the user is reading only 1 byte, I only need to read in 
that page and if the user is writing only 1 byte into a 
non-sparse/preallocated file, I can only write out 512 bytes 
(prepare/commit write) or a page (writepage).

But the real reason for why I have to stick with 512 bytes is that NTFS 
stores all its metadata (including the on-disk inodes, the directory 
indices, the boot record, the volume label and flags, etc) inside regular 
files.  And the contents of the metadata are organised like flat database 
files containing fixed length records - but depending on the metadata the 
size is different - (which in turn contain variable length records) and 
each fixed length record is protected at a 512 byte granularity with 
fixups which ensure that partial writes due to power failure or disk 
failure are detected.  These fixups need to be applied every time metadata 
records are written.  And when the fixups are applied, the metadata 
appears corrupt.  So we do: apply fixups, write, take away fixups.  This 
means that we have to lock each metadata record for exclusive access 
while this is happening otherwise someone could look at the record and see 
it in a corrupt state...  And since two adjacent records are not usually 
related to each other at all we cannot just lock multiple records at once 
(given the page is already locked) otherwise we would deadlock immediately 
and hence we cannot write out more than one record at once.  Which in turn 
means we need to maintain a 512 byte granularity at the write level.  
Anyway, this is slightly off topic.  I hope the above makes some sense...

 A lock-in-page-index-order approach seems the simplest way to prevent 
 page deadlocks as Andrew suggested, and always starting to lock pages at 
 the lowest block- aligned file offset (rather than taking a page lock, 
 dropping it, locking earlier pages, reaquiring the later one, etc) - 
 that can't really be done inside the filesystems though.

Indeed.  But I don't think we would want to write out whole logical blocks 
at once unless it was actuall all dirty data.  Why write out loads when 
you can get away with as little as 512 bytes or PAGE_CACHE_SIZE in 
majority of cases?  In NTFS the logical block size can go into the mega 
bytes (in theory) and if we were forced to do such large writes every time 
performance would degrade really badly for most usage scenarios.

 So it's probably something that should be handled in generic page
 cache code such that the locking is done in common places for all
 filesystems using large i_blkbits values, and such that locking is
 done before the filesystem-specific read/writepage(s) routines are
 called, so that they don't have to be changed to do page locking
 any differently.

Yes, this would have advantages.  In fact this may be the point where we 
would actually make PAGE_CACHE_SIZE dynamic (but presumably only in 
multiples of PAGE_SIZE) but that is a rather big project in its own right.

For NTFS, I would prefer to not use a large i_blkbits as explained above.

However, I do not see why the VFS cannot provide one or more 
library/helper functions to do lock multiple pages with start_index, 
end_index, already_locked_page, writeback_control/NULL, etc.  NTFS would 
then just call this from prepare_write and writepage/writepages.  And the 
compression engine (once we implement write to compressed files) may well 
need to use this, too.  And if XFS would find it useful then having a VFS 
based helper function would be 

Re: Advice sought on how to lock multiple pages in ->prepare_write and ->writepage

2005-01-27 Thread Nathan Scott
Hi Anton,

On Thu, Jan 27, 2005 at 04:58:22PM -0800, Andrew Morton wrote:
> Anton Altaparmakov <[EMAIL PROTECTED]> wrote:
> >
> > What would you propose can I do to perform the required zeroing in a
> > deadlock safe manner whilst also ensuring that it cannot happen that a
> > concurrent ->readpage() will cause me to miss a page and thus end up
> > with non-initialized/random data on disk for that page?
> 
> The only thing I can think of is to lock all the pages.  There's no other
> place in the kernel above you which locks multiple pagecache pages, but we
> can certainly adopt the convention that multiple-page-locking must proceed
> in ascending file offset order.
> 
> ...
> 
> Not very pretty though.
> 

Just putting up my hand to say "yeah, us too" - we could also make
use of that functionality, so we can grok existing XFS filesystems
that have blocksizes larger than the page size.

Were you looking at using an i_blkbits value larger than pageshift,
Anton?  There's many places where generic code does 1 << i_blkbits
that'll need careful auditing, I think.  A lock-in-page-index-order
approach seems the simplest way to prevent page deadlocks as Andrew
suggested, and always starting to lock pages at the lowest block-
aligned file offset (rather than taking a page lock, dropping it,
locking earlier pages, reaquiring the later one, etc) - that can't
really be done inside the filesystems though..

So it's probably something that should be handled in generic page
cache code such that the locking is done in common places for all
filesystems using large i_blkbits values, and such that locking is
done before the filesystem-specific read/writepage(s) routines are
called, so that they don't have to be changed to do page locking
any differently.

cheers.

-- 
Nathan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Advice sought on how to lock multiple pages in ->prepare_write and ->writepage

2005-01-27 Thread Andrew Morton
Anton Altaparmakov <[EMAIL PROTECTED]> wrote:
>
> What would you propose can I do to perform the required zeroing in a
> deadlock safe manner whilst also ensuring that it cannot happen that a
> concurrent ->readpage() will cause me to miss a page and thus end up
> with non-initialized/random data on disk for that page?

The only thing I can think of is to lock all the pages.  There's no other
place in the kernel above you which locks multiple pagecache pages, but we
can certainly adopt the convention that multiple-page-locking must proceed
in ascending file offset order.

Which means that you'll need to drop and reacquire the page lock in
->prepare_write and in ->writepage, which could get unpleasant.

For ->prepare_write it should be OK: the caller has a ref on the inode and
you can check ->mapping after locking the page to see if a truncate flew
past (OK, you have i_sem, but writepage will need to do this check).

For writepage() or writepages() with for_reclaim=0 you're OK: the caller
has a ref on the inode and has taken sb->s_umount, so you can safely drop
and retake the page lock.

For ->writepage with for_reclaim=1 the problem is that the inode can
disappear on you altogether: you have no inode ref and if you let go of
that page lock, truncate+reclaim or truncate+umount can zap the inode.

So hrm.  I guess your ->writepage(for_reclaim=1) could do a trylock on
s_umount and fail the writepage if that didn't work out.

That leaves the problem of preventing truncate+reclaim from zapping the
inode when you've dropped the page lock.  I don't think you'll want to take
a ref on the inode because the subsequent iput() can cause a storm of
activity and I have vague memories that iput() inside
->writepage(for_reclaim=1) is a bad deal.  Maybe do a trylock on i_sem and
fail the writepage if that doesn't work out?

That way, once you have i_sem and s_umount you can unlock the target page
then populate+lock all the pages in the 64k segment.

Not very pretty though.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Advice sought on how to lock multiple pages in ->prepare_write and ->writepage

2005-01-27 Thread Anton Altaparmakov
Hi,

I am looking for advice on how to lock multiple pages in ->prepare_write
and ->writepage.  Here is an example scenario where I need to do this:

We have a mounted NTFS volume with a cluster, i.e. logical block, size
of 64kiB on a system with a PAGE_CACHE_SIZE of 4kiB.  This means we can
allocate space in a file in multiples of 64kiB (aligned to 64kiB
boundaries).  Now take a sparse file and the user is attempting to do a
write into a hole in this file and lets say the write is in the middle
of a sparse cluster (logical block).

This means that the NTFS driver will receive a write request either via
->prepare_write or ->writepage for a page which lies in the middle of
the cluster (logical block).

NTFS driver now allocates a cluster on disk to fill the hole.

Now the driver needs to write zeroes between the start of the newly
allocated cluster and the beginning of the write request as well as
between the end of the write request and the end of the cluster.

In ->prepare_write we are holding i_sem on the file's inode as well as
the page lock on the page containing the write request.

In ->writepage we are only holding the page lock on the page containing
the write request.

We also need to keep in mind that there may be other already dirty pages
n the affected region that have not hit ->writepage yet or in the most
complicated case that are hitting ->writepage simultaneously on a
different cpu or due to preempt.  Such pages that are already uptodate
would need to not be zeroed.

A further complication is that any of those pages might be currently
under a ->readpage() and hence locked but they would never be marked
dirty for writeback unless we do it now.

I also need to ensure that any pages I zero also make it to disk
(presumably simply marking these pages dirty would be the Right Way(TM)
to do this).

What would you propose can I do to perform the required zeroing in a
deadlock safe manner whilst also ensuring that it cannot happen that a
concurrent ->readpage() will cause me to miss a page and thus end up
with non-initialized/random data on disk for that page?

Thanks a lot in advance for any advice/suggestions you can give me...

Best regards,

Anton
-- 
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Advice sought on how to lock multiple pages in -prepare_write and -writepage

2005-01-27 Thread Anton Altaparmakov
Hi,

I am looking for advice on how to lock multiple pages in -prepare_write
and -writepage.  Here is an example scenario where I need to do this:

We have a mounted NTFS volume with a cluster, i.e. logical block, size
of 64kiB on a system with a PAGE_CACHE_SIZE of 4kiB.  This means we can
allocate space in a file in multiples of 64kiB (aligned to 64kiB
boundaries).  Now take a sparse file and the user is attempting to do a
write into a hole in this file and lets say the write is in the middle
of a sparse cluster (logical block).

This means that the NTFS driver will receive a write request either via
-prepare_write or -writepage for a page which lies in the middle of
the cluster (logical block).

NTFS driver now allocates a cluster on disk to fill the hole.

Now the driver needs to write zeroes between the start of the newly
allocated cluster and the beginning of the write request as well as
between the end of the write request and the end of the cluster.

In -prepare_write we are holding i_sem on the file's inode as well as
the page lock on the page containing the write request.

In -writepage we are only holding the page lock on the page containing
the write request.

We also need to keep in mind that there may be other already dirty pages
n the affected region that have not hit -writepage yet or in the most
complicated case that are hitting -writepage simultaneously on a
different cpu or due to preempt.  Such pages that are already uptodate
would need to not be zeroed.

A further complication is that any of those pages might be currently
under a -readpage() and hence locked but they would never be marked
dirty for writeback unless we do it now.

I also need to ensure that any pages I zero also make it to disk
(presumably simply marking these pages dirty would be the Right Way(TM)
to do this).

What would you propose can I do to perform the required zeroing in a
deadlock safe manner whilst also ensuring that it cannot happen that a
concurrent -readpage() will cause me to miss a page and thus end up
with non-initialized/random data on disk for that page?

Thanks a lot in advance for any advice/suggestions you can give me...

Best regards,

Anton
-- 
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/  http://www-stu.christs.cam.ac.uk/~aia21/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Advice sought on how to lock multiple pages in -prepare_write and -writepage

2005-01-27 Thread Andrew Morton
Anton Altaparmakov [EMAIL PROTECTED] wrote:

 What would you propose can I do to perform the required zeroing in a
 deadlock safe manner whilst also ensuring that it cannot happen that a
 concurrent -readpage() will cause me to miss a page and thus end up
 with non-initialized/random data on disk for that page?

The only thing I can think of is to lock all the pages.  There's no other
place in the kernel above you which locks multiple pagecache pages, but we
can certainly adopt the convention that multiple-page-locking must proceed
in ascending file offset order.

Which means that you'll need to drop and reacquire the page lock in
-prepare_write and in -writepage, which could get unpleasant.

For -prepare_write it should be OK: the caller has a ref on the inode and
you can check -mapping after locking the page to see if a truncate flew
past (OK, you have i_sem, but writepage will need to do this check).

For writepage() or writepages() with for_reclaim=0 you're OK: the caller
has a ref on the inode and has taken sb-s_umount, so you can safely drop
and retake the page lock.

For -writepage with for_reclaim=1 the problem is that the inode can
disappear on you altogether: you have no inode ref and if you let go of
that page lock, truncate+reclaim or truncate+umount can zap the inode.

So hrm.  I guess your -writepage(for_reclaim=1) could do a trylock on
s_umount and fail the writepage if that didn't work out.

That leaves the problem of preventing truncate+reclaim from zapping the
inode when you've dropped the page lock.  I don't think you'll want to take
a ref on the inode because the subsequent iput() can cause a storm of
activity and I have vague memories that iput() inside
-writepage(for_reclaim=1) is a bad deal.  Maybe do a trylock on i_sem and
fail the writepage if that doesn't work out?

That way, once you have i_sem and s_umount you can unlock the target page
then populate+lock all the pages in the 64k segment.

Not very pretty though.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Advice sought on how to lock multiple pages in -prepare_write and -writepage

2005-01-27 Thread Nathan Scott
Hi Anton,

On Thu, Jan 27, 2005 at 04:58:22PM -0800, Andrew Morton wrote:
 Anton Altaparmakov [EMAIL PROTECTED] wrote:
 
  What would you propose can I do to perform the required zeroing in a
  deadlock safe manner whilst also ensuring that it cannot happen that a
  concurrent -readpage() will cause me to miss a page and thus end up
  with non-initialized/random data on disk for that page?
 
 The only thing I can think of is to lock all the pages.  There's no other
 place in the kernel above you which locks multiple pagecache pages, but we
 can certainly adopt the convention that multiple-page-locking must proceed
 in ascending file offset order.
 
 ...
 
 Not very pretty though.
 

Just putting up my hand to say yeah, us too - we could also make
use of that functionality, so we can grok existing XFS filesystems
that have blocksizes larger than the page size.

Were you looking at using an i_blkbits value larger than pageshift,
Anton?  There's many places where generic code does 1  i_blkbits
that'll need careful auditing, I think.  A lock-in-page-index-order
approach seems the simplest way to prevent page deadlocks as Andrew
suggested, and always starting to lock pages at the lowest block-
aligned file offset (rather than taking a page lock, dropping it,
locking earlier pages, reaquiring the later one, etc) - that can't
really be done inside the filesystems though..

So it's probably something that should be handled in generic page
cache code such that the locking is done in common places for all
filesystems using large i_blkbits values, and such that locking is
done before the filesystem-specific read/writepage(s) routines are
called, so that they don't have to be changed to do page locking
any differently.

cheers.

-- 
Nathan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/