Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-16 Thread Anton Altaparmakov
Hi Andrew,

On Sun, 6 Feb 2005, Andrew Morton wrote:
> Anton Altaparmakov <[EMAIL PROTECTED]> wrote:
> > On Thu, 3 Feb 2005, Andrew Morton wrote:
> > > I did a patch which switched loop to use the file_operations.read/write
> > > about a year ago.  Forget what happened to it.  It always seemed the right
> > > thing to do..
> > 
> > How did you implement the write?
> 
> It was Urban, actually.  Memory fails me.
> 
> http://marc.theaimsgroup.com/?l=linux-fsdevel=102133217310200=2
> 
> It won't work properly for crypto transformations - iirc we decided it
> would need to perform a copy.

That is the conclusion I came to as well.  I whipped up a patch today that 
implements fallback to file_operations->write in the case that 
aops->{prepare,commit}_write are not present on the backing filesystem.

The fallback happens in two different ways:

- For normal loop devices, i.e. ones which do not do transformation on the 
data but simply pass it along, we simply call fops->write.  This should be 
pretty much just as fast as using aops->{prepare,commit}_write directly.

- For all other loop devices (e.g. xor and cryptoloop), i.e. all the ones 
which may be doing transformations on the data, we allocate and map a page 
(once for each bio), then for each bio vec we copy the bio vec page data 
to our mapped page, apply the loop transformation, and use fops->write to 
write out the transformed data from our page.  Once all bio vecs from the 
bio are done, we unmap and free the page.

This approach is the absolute minimum of overhead I could come up with and 
for performance hungry people, as you can see I left the address space 
operations method in place for filesystems which implement 
aops->{prepare,commit}_write.

I have tested this patch with normal loop devices using 
aops->{prepare,commit}_write on the backing filesystem, with normal loop 
devices using the fops->write code path and with cryptoloop devices using 
the deouble buffering + fops->write code path.

It all seemed to work fine for me.

Andrew, please consider this patch for your -mm tree and please apply it 
to mainline when/as you see fit.  Thanks!

Best regards,

Anton
-- 
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

--- loop-write.diff ---

loop: Implement fallback to file operations->write in the loop driver for 
  backing filesystems which do not implement address space 
  operations->{prepare,commit}_write.

Signed-off-by: Anton Altaparmakov <[EMAIL PROTECTED]>

drivers/block/loop.c |  159 ++-
include/linux/loop.h |5 +
2 files changed, 136 insertions(+), 28 deletions(-)

--- linus-2.6/include/linux/loop.h  2005-02-16 11:04:41.439418700 +
+++ linux-2.6/include/linux/loop.h  2005-02-16 11:08:07.535781915 +
@@ -71,7 +71,10 @@ struct loop_device {
 /*
  * Loop flags
  */
-#define LO_FLAGS_READ_ONLY 1
+enum {
+   LO_FLAGS_READ_ONLY  = 1,
+   LO_FLAGS_USE_AOPS   = 2,
+};
 
 #include/* for __kernel_old_dev_t */
 #include  /* for __u64 */
--- linus-2.6/drivers/block/loop.c  2005-02-16 10:44:02.994427343 +
+++ linux-2.6/drivers/block/loop.c  2005-02-16 21:08:25.235894884 +
@@ -39,6 +39,11 @@
  * Support up to 256 loop devices
  * Heinz Mauelshagen <[EMAIL PROTECTED]>, Feb 2002
  *
+ * Support for falling back on the write file operation when the address space
+ * operations prepare_write and/or commit_write are not available on the
+ * backing filesystem.
+ * Anton Altaparmakov, 16 Feb 2005
+ *
  * Still To Fix:
  * - Advisory locking is ignored here.
  * - Should use an own CAP_* category instead of CAP_SYS_ADMIN
@@ -67,6 +72,8 @@
 #include 
 #include  /* for invalidate_bdev() */
 #include 
+#include 
+#include 
 
 #include 
 
@@ -127,7 +134,7 @@ static int transfer_xor(struct loop_devi
 
 static int xor_init(struct loop_device *lo, const struct loop_info64 *info)
 {
-   if (info->lo_encrypt_key_size <= 0)
+   if (unlikely(info->lo_encrypt_key_size <= 0))
return -EINVAL;
return 0;
 }
@@ -173,7 +180,7 @@ figure_loop_size(struct loop_device *lo)
loff_t size = get_loop_size(lo, lo->lo_backing_file);
sector_t x = (sector_t)size;
 
-   if ((loff_t)x != size)
+   if (unlikely((loff_t)x != size))
return -EFBIG;
 
set_capacity(disks[lo->lo_number], x);
@@ -186,23 +193,27 @@ lo_do_transfer(struct loop_device *lo, i
   struct page *lpage, unsigned loffs,
   int size, sector_t rblock)
 {
-   if (!lo->transfer)
+   if (unlikely(!lo->transfer))
return 0;
 
return lo->transfer(lo, cmd, rpage, roffs, lpage, loffs, size, rblock);
 }
 
-static int
-do_lo_send(struct loop_device *lo, struct bio_vec *bvec, int 

Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-16 Thread Anton Altaparmakov
Hi Andrew,

On Sun, 6 Feb 2005, Andrew Morton wrote:
 Anton Altaparmakov [EMAIL PROTECTED] wrote:
  On Thu, 3 Feb 2005, Andrew Morton wrote:
   I did a patch which switched loop to use the file_operations.read/write
   about a year ago.  Forget what happened to it.  It always seemed the right
   thing to do..
  
  How did you implement the write?
 
 It was Urban, actually.  Memory fails me.
 
 http://marc.theaimsgroup.com/?l=linux-fsdevelm=102133217310200w=2
 
 It won't work properly for crypto transformations - iirc we decided it
 would need to perform a copy.

That is the conclusion I came to as well.  I whipped up a patch today that 
implements fallback to file_operations-write in the case that 
aops-{prepare,commit}_write are not present on the backing filesystem.

The fallback happens in two different ways:

- For normal loop devices, i.e. ones which do not do transformation on the 
data but simply pass it along, we simply call fops-write.  This should be 
pretty much just as fast as using aops-{prepare,commit}_write directly.

- For all other loop devices (e.g. xor and cryptoloop), i.e. all the ones 
which may be doing transformations on the data, we allocate and map a page 
(once for each bio), then for each bio vec we copy the bio vec page data 
to our mapped page, apply the loop transformation, and use fops-write to 
write out the transformed data from our page.  Once all bio vecs from the 
bio are done, we unmap and free the page.

This approach is the absolute minimum of overhead I could come up with and 
for performance hungry people, as you can see I left the address space 
operations method in place for filesystems which implement 
aops-{prepare,commit}_write.

I have tested this patch with normal loop devices using 
aops-{prepare,commit}_write on the backing filesystem, with normal loop 
devices using the fops-write code path and with cryptoloop devices using 
the deouble buffering + fops-write code path.

It all seemed to work fine for me.

Andrew, please consider this patch for your -mm tree and please apply it 
to mainline when/as you see fit.  Thanks!

Best regards,

Anton
-- 
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/  http://www-stu.christs.cam.ac.uk/~aia21/

--- loop-write.diff ---

loop: Implement fallback to file operations-write in the loop driver for 
  backing filesystems which do not implement address space 
  operations-{prepare,commit}_write.

Signed-off-by: Anton Altaparmakov [EMAIL PROTECTED]

drivers/block/loop.c |  159 ++-
include/linux/loop.h |5 +
2 files changed, 136 insertions(+), 28 deletions(-)

--- linus-2.6/include/linux/loop.h  2005-02-16 11:04:41.439418700 +
+++ linux-2.6/include/linux/loop.h  2005-02-16 11:08:07.535781915 +
@@ -71,7 +71,10 @@ struct loop_device {
 /*
  * Loop flags
  */
-#define LO_FLAGS_READ_ONLY 1
+enum {
+   LO_FLAGS_READ_ONLY  = 1,
+   LO_FLAGS_USE_AOPS   = 2,
+};
 
 #include asm/posix_types.h   /* for __kernel_old_dev_t */
 #include asm/types.h /* for __u64 */
--- linus-2.6/drivers/block/loop.c  2005-02-16 10:44:02.994427343 +
+++ linux-2.6/drivers/block/loop.c  2005-02-16 21:08:25.235894884 +
@@ -39,6 +39,11 @@
  * Support up to 256 loop devices
  * Heinz Mauelshagen [EMAIL PROTECTED], Feb 2002
  *
+ * Support for falling back on the write file operation when the address space
+ * operations prepare_write and/or commit_write are not available on the
+ * backing filesystem.
+ * Anton Altaparmakov, 16 Feb 2005
+ *
  * Still To Fix:
  * - Advisory locking is ignored here.
  * - Should use an own CAP_* category instead of CAP_SYS_ADMIN
@@ -67,6 +72,8 @@
 #include linux/writeback.h
 #include linux/buffer_head.h /* for invalidate_bdev() */
 #include linux/completion.h
+#include linux/highmem.h
+#include linux/gfp.h
 
 #include asm/uaccess.h
 
@@ -127,7 +134,7 @@ static int transfer_xor(struct loop_devi
 
 static int xor_init(struct loop_device *lo, const struct loop_info64 *info)
 {
-   if (info-lo_encrypt_key_size = 0)
+   if (unlikely(info-lo_encrypt_key_size = 0))
return -EINVAL;
return 0;
 }
@@ -173,7 +180,7 @@ figure_loop_size(struct loop_device *lo)
loff_t size = get_loop_size(lo, lo-lo_backing_file);
sector_t x = (sector_t)size;
 
-   if ((loff_t)x != size)
+   if (unlikely((loff_t)x != size))
return -EFBIG;
 
set_capacity(disks[lo-lo_number], x);
@@ -186,23 +193,27 @@ lo_do_transfer(struct loop_device *lo, i
   struct page *lpage, unsigned loffs,
   int size, sector_t rblock)
 {
-   if (!lo-transfer)
+   if (unlikely(!lo-transfer))
return 0;
 
return lo-transfer(lo, cmd, rpage, roffs, lpage, loffs, 

Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-06 Thread Andrew Morton
Anton Altaparmakov <[EMAIL PROTECTED]> wrote:
>
> On Thu, 3 Feb 2005, Andrew Morton wrote:
> > I did a patch which switched loop to use the file_operations.read/write
> > about a year ago.  Forget what happened to it.  It always seemed the right
> > thing to do..
> 
> How did you implement the write?

It was Urban, actually.  Memory fails me.

http://marc.theaimsgroup.com/?l=linux-fsdevel=102133217310200=2

It won't work properly for crypto transformations - iirc we decided it
would need to perform a copy.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-06 Thread Anton Altaparmakov
On Thu, 3 Feb 2005, Andrew Morton wrote:
> I did a patch which switched loop to use the file_operations.read/write
> about a year ago.  Forget what happened to it.  It always seemed the right
> thing to do..

How did you implement the write?  At the moment the loop driver gets hold 
of both source and destination pages (the latter via grab_cache_page() and 
aops->prepare_write()) and copies/transforms directly from the source to 
the destination page (and then calls commit_write() on the destination 
page).  Did you allocate a buffer for each request, copy/transform to the 
buffer and then submit the buffer via file_operations->write?  That would 
clearly be not very efficient but given fops->write() is not atomic I 
don't see how that could be optimised further...

Perhaps the loop driver should work as is when 
aops->{prepare,commit}_write() are not NULL and should fall back to 
a buffered fops->write() otherwise?

Or have I missed some way in which the fops->write() case can be 
optimized?

Best regards,

Anton
-- 
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-06 Thread Anton Altaparmakov
On Thu, 3 Feb 2005, Andrew Morton wrote:
 I did a patch which switched loop to use the file_operations.read/write
 about a year ago.  Forget what happened to it.  It always seemed the right
 thing to do..

How did you implement the write?  At the moment the loop driver gets hold 
of both source and destination pages (the latter via grab_cache_page() and 
aops-prepare_write()) and copies/transforms directly from the source to 
the destination page (and then calls commit_write() on the destination 
page).  Did you allocate a buffer for each request, copy/transform to the 
buffer and then submit the buffer via file_operations-write?  That would 
clearly be not very efficient but given fops-write() is not atomic I 
don't see how that could be optimised further...

Perhaps the loop driver should work as is when 
aops-{prepare,commit}_write() are not NULL and should fall back to 
a buffered fops-write() otherwise?

Or have I missed some way in which the fops-write() case can be 
optimized?

Best regards,

Anton
-- 
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/  http://www-stu.christs.cam.ac.uk/~aia21/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-06 Thread Andrew Morton
Anton Altaparmakov [EMAIL PROTECTED] wrote:

 On Thu, 3 Feb 2005, Andrew Morton wrote:
  I did a patch which switched loop to use the file_operations.read/write
  about a year ago.  Forget what happened to it.  It always seemed the right
  thing to do..
 
 How did you implement the write?

It was Urban, actually.  Memory fails me.

http://marc.theaimsgroup.com/?l=linux-fsdevelm=102133217310200w=2

It won't work properly for crypto transformations - iirc we decided it
would need to perform a copy.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - nopage alternative

2005-02-04 Thread Hugh Dickins
On Fri, 4 Feb 2005, Anton Altaparmakov wrote:
> On Thu, 2005-02-03 at 11:23 -0800, Bryan Henderson wrote:
> > 
> > I think the point is that we can't have a "handler for writes," because 
> > the writes are being done by simple CPU Store instructions in a user 
> > program.  The handler we're talking about is just for page faults.  Other 
> 
> That was my understanding.
> 
> > operating systems approach this by actually _having_ a handler for a CPU 
> > store instruction, in the form of a page protection fault handler -- the 
> > nopage routine adds the page to the user's address space, but write 
> > protects it.  The first time the user tries to store into it, the 
> > filesystem driver gets a chance to do what's necessary to support a dirty 
> > cache page -- allocate a block, add additional dirty pages to the cache, 
> > etc.  It would be wonderful to have that in Linux.
> 
> It would.  This would certainly solve this problem.

Isn't this exactly what David Howells' page_mkwrite stuff in -mm's
add-page-becoming-writable-notification.patch is designed for?

Though it looks a little broken to me as it stands (beyond the two
fixup patches already there).  I've not found time to double-check
or test, apologies in advance if I'm libelling, but...

(a) I thought the prot bits do_nopage gives a pte in a shared writable
mapping include write permission, even when it's a read fault:
that can't be allowed if there's a page_mkwrite.

(b) I don't understand how do_wp_page's "reuse" logic for whether it
can just go ahead and use the existing anonymous page, would have
any relevance to calling page_mkwrite on a shared writable page,
which must be used and not COWed however many references there are.

Didn't look further, think you should take a look at page_mkwrite,
but don't expect the implementation to be correct yet.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - nopage alternative

2005-02-04 Thread Anton Altaparmakov
On Thu, 2005-02-03 at 11:23 -0800, Bryan Henderson wrote:
> >> > > And for the vmscan->writepage() side of things I wonder if it would 
> be
> >> > > possible to overload the mapping's ->nopage handler.  If the target 
> page
> >> > > lies in a hole, go off and allocate all the necessary pagecache 
> pages, zero
> >> > > them, mark them dirty?
> >> > 
> >> > I guess it would be possible but ->nopage is used for the read case 
> and
> >> > why would we want to then cause writes/allocations?
> >> 
> >> yup, we'd need to create a new handler for writes, or pass 
> `write_access'
> >> into ->nopage.  I think others (dwdm2?) have seen a need for that.
> >
> >That would work as long as all writable mappings are actually written to
> >everywhere.  Otherwise you still get that reading the whole mmap()ped
> >are but writing a small part of it would still instantiate all of it on
> >disk.  As far as I understand this there is no way to hook into the mmap
> >system such that we have a hook whenever a mmap()ped page gets written
> >to for the first time.  (I may well be wrong on that one so please
> >correct me if that is the case.)
> 
> I think the point is that we can't have a "handler for writes," because 
> the writes are being done by simple CPU Store instructions in a user 
> program.  The handler we're talking about is just for page faults.  Other 

That was my understanding.

> operating systems approach this by actually _having_ a handler for a CPU 
> store instruction, in the form of a page protection fault handler -- the 
> nopage routine adds the page to the user's address space, but write 
> protects it.  The first time the user tries to store into it, the 
> filesystem driver gets a chance to do what's necessary to support a dirty 
> cache page -- allocate a block, add additional dirty pages to the cache, 
> etc.  It would be wonderful to have that in Linux.

It would.  This would certainly solve this problem.

> Short of that, I don't see any way to avoid sometimes filling in holes due 
> to reads.  It's not a huge problem, though -- it requires someone to do a 
> shared writable mmap and then read lots of holes and not write to them, 
> which is a pretty rare situation for a normal file.
> 
> I didn't follow how the helper function solves this problem.  If it's 
> something involving adding the required extra pages to the cache at 
> pageout time, then that's not going to work -- you can't make adding pages 
> to the cache a prerequisite for cleaning a page -- that would be Deadlock 
> City.

Ouch.  Yes, it would rather, wouldn't it.  How very annoying of it.  I
guess the ->nomap way is the only sane way to deal with this then.  I
suppose it is also possible to do it via writepage by scanning for and
locking pages if present in writepage and if not present go direct to
disk (using bio or bh) whilst holding exclusive lock so no new pages can
be added to the page cache with stale data.  Or actually we wouldn't
even care if stale pages are added as they would still be cleared in
readpage().  And pages found and uptodate and locked simply need to be
marked dirty and released again and if not uptodate they need to be
cleared first.  This might actually turn out the easiest for ntfs at
least and it should avoid the deadlocks that would be caused by trying
to add new pages to the page cache.  Its maybe not as clean as a write
on a read-only page causing a fault handler to be triggered but it
should work I think.  Comments?

> My large-block filesystem driver does the nopage thing, and does in fact 
> fill in files unnecessarily in this scenario.  :-(  The driver for the 
> same filesystems on AIX does not, though.  It has the write protection 
> thing.

Is your driver's source available to look at?

Best regards,

Anton
-- 
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - nopage alternative

2005-02-04 Thread Hugh Dickins
On Fri, 4 Feb 2005, Anton Altaparmakov wrote:
 On Thu, 2005-02-03 at 11:23 -0800, Bryan Henderson wrote:
  
  I think the point is that we can't have a handler for writes, because 
  the writes are being done by simple CPU Store instructions in a user 
  program.  The handler we're talking about is just for page faults.  Other 
 
 That was my understanding.
 
  operating systems approach this by actually _having_ a handler for a CPU 
  store instruction, in the form of a page protection fault handler -- the 
  nopage routine adds the page to the user's address space, but write 
  protects it.  The first time the user tries to store into it, the 
  filesystem driver gets a chance to do what's necessary to support a dirty 
  cache page -- allocate a block, add additional dirty pages to the cache, 
  etc.  It would be wonderful to have that in Linux.
 
 It would.  This would certainly solve this problem.

Isn't this exactly what David Howells' page_mkwrite stuff in -mm's
add-page-becoming-writable-notification.patch is designed for?

Though it looks a little broken to me as it stands (beyond the two
fixup patches already there).  I've not found time to double-check
or test, apologies in advance if I'm libelling, but...

(a) I thought the prot bits do_nopage gives a pte in a shared writable
mapping include write permission, even when it's a read fault:
that can't be allowed if there's a page_mkwrite.

(b) I don't understand how do_wp_page's reuse logic for whether it
can just go ahead and use the existing anonymous page, would have
any relevance to calling page_mkwrite on a shared writable page,
which must be used and not COWed however many references there are.

Didn't look further, think you should take a look at page_mkwrite,
but don't expect the implementation to be correct yet.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - nopage alternative

2005-02-04 Thread Anton Altaparmakov
On Thu, 2005-02-03 at 11:23 -0800, Bryan Henderson wrote:
And for the vmscan-writepage() side of things I wonder if it would 
 be
possible to overload the mapping's -nopage handler.  If the target 
 page
lies in a hole, go off and allocate all the necessary pagecache 
 pages, zero
them, mark them dirty?
   
   I guess it would be possible but -nopage is used for the read case 
 and
   why would we want to then cause writes/allocations?
  
  yup, we'd need to create a new handler for writes, or pass 
 `write_access'
  into -nopage.  I think others (dwdm2?) have seen a need for that.
 
 That would work as long as all writable mappings are actually written to
 everywhere.  Otherwise you still get that reading the whole mmap()ped
 are but writing a small part of it would still instantiate all of it on
 disk.  As far as I understand this there is no way to hook into the mmap
 system such that we have a hook whenever a mmap()ped page gets written
 to for the first time.  (I may well be wrong on that one so please
 correct me if that is the case.)
 
 I think the point is that we can't have a handler for writes, because 
 the writes are being done by simple CPU Store instructions in a user 
 program.  The handler we're talking about is just for page faults.  Other 

That was my understanding.

 operating systems approach this by actually _having_ a handler for a CPU 
 store instruction, in the form of a page protection fault handler -- the 
 nopage routine adds the page to the user's address space, but write 
 protects it.  The first time the user tries to store into it, the 
 filesystem driver gets a chance to do what's necessary to support a dirty 
 cache page -- allocate a block, add additional dirty pages to the cache, 
 etc.  It would be wonderful to have that in Linux.

It would.  This would certainly solve this problem.

 Short of that, I don't see any way to avoid sometimes filling in holes due 
 to reads.  It's not a huge problem, though -- it requires someone to do a 
 shared writable mmap and then read lots of holes and not write to them, 
 which is a pretty rare situation for a normal file.
 
 I didn't follow how the helper function solves this problem.  If it's 
 something involving adding the required extra pages to the cache at 
 pageout time, then that's not going to work -- you can't make adding pages 
 to the cache a prerequisite for cleaning a page -- that would be Deadlock 
 City.

Ouch.  Yes, it would rather, wouldn't it.  How very annoying of it.  I
guess the -nomap way is the only sane way to deal with this then.  I
suppose it is also possible to do it via writepage by scanning for and
locking pages if present in writepage and if not present go direct to
disk (using bio or bh) whilst holding exclusive lock so no new pages can
be added to the page cache with stale data.  Or actually we wouldn't
even care if stale pages are added as they would still be cleared in
readpage().  And pages found and uptodate and locked simply need to be
marked dirty and released again and if not uptodate they need to be
cleared first.  This might actually turn out the easiest for ntfs at
least and it should avoid the deadlocks that would be caused by trying
to add new pages to the page cache.  Its maybe not as clean as a write
on a read-only page causing a fault handler to be triggered but it
should work I think.  Comments?

 My large-block filesystem driver does the nopage thing, and does in fact 
 fill in files unnecessarily in this scenario.  :-(  The driver for the 
 same filesystems on AIX does not, though.  It has the write protection 
 thing.

Is your driver's source available to look at?

Best regards,

Anton
-- 
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/  http://www-stu.christs.cam.ac.uk/~aia21/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - loop device

2005-02-03 Thread Bryan Henderson
>I did a patch which switched loop to use the file_operations.read/write
>about a year ago.  Forget what happened to it.  It always seemed the 
right
>thing to do..

This is unquestionably the right thing to do (at least compared to what we 
have now).  The loop device driver has no business assuming that the 
underlying filesystem uses the generic routines.  I always assumed it was 
a simple design error that it did.  (Such errors are easy to make because 
prepare_write and commit_write are declared as address space operations, 
when they're really private to the buffer cache and generic writer).

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - nopage alternative

2005-02-03 Thread Bryan Henderson
>> > > And for the vmscan->writepage() side of things I wonder if it would 
be
>> > > possible to overload the mapping's ->nopage handler.  If the target 
page
>> > > lies in a hole, go off and allocate all the necessary pagecache 
pages, zero
>> > > them, mark them dirty?
>> > 
>> > I guess it would be possible but ->nopage is used for the read case 
and
>> > why would we want to then cause writes/allocations?
>> 
>> yup, we'd need to create a new handler for writes, or pass 
`write_access'
>> into ->nopage.  I think others (dwdm2?) have seen a need for that.
>
>That would work as long as all writable mappings are actually written to
>everywhere.  Otherwise you still get that reading the whole mmap()ped
>are but writing a small part of it would still instantiate all of it on
>disk.  As far as I understand this there is no way to hook into the mmap
>system such that we have a hook whenever a mmap()ped page gets written
>to for the first time.  (I may well be wrong on that one so please
>correct me if that is the case.)

I think the point is that we can't have a "handler for writes," because 
the writes are being done by simple CPU Store instructions in a user 
program.  The handler we're talking about is just for page faults.  Other 
operating systems approach this by actually _having_ a handler for a CPU 
store instruction, in the form of a page protection fault handler -- the 
nopage routine adds the page to the user's address space, but write 
protects it.  The first time the user tries to store into it, the 
filesystem driver gets a chance to do what's necessary to support a dirty 
cache page -- allocate a block, add additional dirty pages to the cache, 
etc.  It would be wonderful to have that in Linux.  I saw hints of such 
code in a Linux kernel once (a "write_protect" address space operation or 
something like that); I don't know what happened to it.

Short of that, I don't see any way to avoid sometimes filling in holes due 
to reads.  It's not a huge problem, though -- it requires someone to do a 
shared writable mmap and then read lots of holes and not write to them, 
which is a pretty rare situation for a normal file.

I didn't follow how the helper function solves this problem.  If it's 
something involving adding the required extra pages to the cache at 
pageout time, then that's not going to work -- you can't make adding pages 
to the cache a prerequisite for cleaning a page -- that would be Deadlock 
City.

My large-block filesystem driver does the nopage thing, and does in fact 
fill in files unnecessarily in this scenario.  :-(  The driver for the 
same filesystems on AIX does not, though.  It has the write protection 
thing.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-03 Thread Anton Altaparmakov
On Thu, 2005-02-03 at 02:47 -0800, Andrew Morton wrote:
> Anton Altaparmakov <[EMAIL PROTECTED]> wrote:
> > On Wed, 2005-02-02 at 14:34 -0800, Andrew Morton wrote:
> > > Anton Altaparmakov <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Below is a patch which adds a function 
> > > >  mm/filemap.c::find_or_create_pages(), locks a range of pages.  Please 
> > > > see 
> > > >  the function description in the patch for details.
> > > 
> > > This isn't very nice, is it, really?  Kind of a square peg in a round 
> > > hole.
> > 
> > Only followed your advice.  (-;  But yes, it is not very nice at all.
> > 
> > > If you took the approach of defining a custom file_operations.write() then
> > > I'd imagine that the write() side of things would fall out fairly neatly:
> > > no need for s_umount and i_sem needs to be taken anyway.  No trylocking.
> > 
> > But the write() side of things don't need s_umount or trylocking with
> > the proposed find_or_create_pages(), either...
> 
> i_sem nests outside lock_page, normally.  I guess that can be avoided though.

I meant that the write() side of things (i.e. ->{prepare,commit}_write)
already has i_sem held on entry.

> > Unfortunately it is not possible to do this since removing
> > ->{prepare,commit}_write() from NTFS would mean that we cannot use loop
> > devices on NTFS any more and this is a really important feature for
> > several Linux distributions (e.g. TopologiLinux) which install Linux on
> > a loopback mounted NTFS file which they then use to place an ext3 (or
> > whatever) fs on and use that as the root fs...
> > 
> > So we definitely need full blown prepare/commit write.  (Unless we
> > modify the loop device driver not to use ->{prepare,commit}_write
> > first.)
> > 
> > Any ideas how to solve that one?
> 
> I did a patch which switched loop to use the file_operations.read/write
> about a year ago.  Forget what happened to it.  It always seemed the right
> thing to do..

Yes, I remember seeing something like that on LKML.  I guess it would
enable readahead on the loop devices.  Whether this is a good or bad
thing is I guess entirely dependent on the usage scenario.

> > > And for the vmscan->writepage() side of things I wonder if it would be
> > > possible to overload the mapping's ->nopage handler.  If the target page
> > > lies in a hole, go off and allocate all the necessary pagecache pages, 
> > > zero
> > > them, mark them dirty?
> > 
> > I guess it would be possible but ->nopage is used for the read case and
> > why would we want to then cause writes/allocations?
> 
> yup, we'd need to create a new handler for writes, or pass `write_access'
> into ->nopage.  I think others (dwdm2?) have seen a need for that.

That would work as long as all writable mappings are actually written to
everywhere.  Otherwise you still get that reading the whole mmap()ped
are but writing a small part of it would still instantiate all of it on
disk.  As far as I understand this there is no way to hook into the mmap
system such that we have a hook whenever a mmap()ped page gets written
to for the first time.  (I may well be wrong on that one so please
correct me if that is the case.)

> > At the moment I cannot see a way to solve my problem without the
> > proposed find_or_create_pages().  )-:
> 
> Unpleasant, isn't it.
> 
> I guess the path of least resistance is to do it within ntfs for now.

Ok, I will do that.  ntfs_find_or_create_pages() will be hitting
fs/ntfs/aops.c soon...

As always, thanks a lot for your help!

Best regards,

Anton
-- 
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-03 Thread Andrew Morton
Anton Altaparmakov <[EMAIL PROTECTED]> wrote:
>
> On Wed, 2005-02-02 at 14:34 -0800, Andrew Morton wrote:
> > Anton Altaparmakov <[EMAIL PROTECTED]> wrote:
> > >
> > > Below is a patch which adds a function 
> > >  mm/filemap.c::find_or_create_pages(), locks a range of pages.  Please 
> > > see 
> > >  the function description in the patch for details.
> > 
> > This isn't very nice, is it, really?  Kind of a square peg in a round hole.
> 
> Only followed your advice.  (-;  But yes, it is not very nice at all.
> 
> > If you took the approach of defining a custom file_operations.write() then
> > I'd imagine that the write() side of things would fall out fairly neatly:
> > no need for s_umount and i_sem needs to be taken anyway.  No trylocking.
> 
> But the write() side of things don't need s_umount or trylocking with
> the proposed find_or_create_pages(), either...

i_sem nests outside lock_page, normally.  I guess that can be avoided though.

> Unfortunately it is not possible to do this since removing
> ->{prepare,commit}_write() from NTFS would mean that we cannot use loop
> devices on NTFS any more and this is a really important feature for
> several Linux distributions (e.g. TopologiLinux) which install Linux on
> a loopback mounted NTFS file which they then use to place an ext3 (or
> whatever) fs on and use that as the root fs...
> 
> So we definitely need full blown prepare/commit write.  (Unless we
> modify the loop device driver not to use ->{prepare,commit}_write
> first.)
> 
> Any ideas how to solve that one?

I did a patch which switched loop to use the file_operations.read/write
about a year ago.  Forget what happened to it.  It always seemed the right
thing to do..

> > And for the vmscan->writepage() side of things I wonder if it would be
> > possible to overload the mapping's ->nopage handler.  If the target page
> > lies in a hole, go off and allocate all the necessary pagecache pages, zero
> > them, mark them dirty?
> 
> I guess it would be possible but ->nopage is used for the read case and
> why would we want to then cause writes/allocations?

yup, we'd need to create a new handler for writes, or pass `write_access'
into ->nopage.  I think others (dwdm2?) have seen a need for that.

> At the moment I cannot see a way to solve my problem without the
> proposed find_or_create_pages().  )-:

Unpleasant, isn't it.

I guess the path of least resistance is to do it within ntfs for now.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-03 Thread Anton Altaparmakov
On Wed, 2005-02-02 at 14:34 -0800, Andrew Morton wrote:
> Anton Altaparmakov <[EMAIL PROTECTED]> wrote:
> >
> > Below is a patch which adds a function 
> >  mm/filemap.c::find_or_create_pages(), locks a range of pages.  Please see 
> >  the function description in the patch for details.
> 
> This isn't very nice, is it, really?  Kind of a square peg in a round hole.

Only followed your advice.  (-;  But yes, it is not very nice at all.

> If you took the approach of defining a custom file_operations.write() then
> I'd imagine that the write() side of things would fall out fairly neatly:
> no need for s_umount and i_sem needs to be taken anyway.  No trylocking.

But the write() side of things don't need s_umount or trylocking with
the proposed find_or_create_pages(), either...

Unfortunately it is not possible to do this since removing
->{prepare,commit}_write() from NTFS would mean that we cannot use loop
devices on NTFS any more and this is a really important feature for
several Linux distributions (e.g. TopologiLinux) which install Linux on
a loopback mounted NTFS file which they then use to place an ext3 (or
whatever) fs on and use that as the root fs...

So we definitely need full blown prepare/commit write.  (Unless we
modify the loop device driver not to use ->{prepare,commit}_write
first.)

Any ideas how to solve that one?

> And for the vmscan->writepage() side of things I wonder if it would be
> possible to overload the mapping's ->nopage handler.  If the target page
> lies in a hole, go off and allocate all the necessary pagecache pages, zero
> them, mark them dirty?

I guess it would be possible but ->nopage is used for the read case and
why would we want to then cause writes/allocations?  Example: I create a
sparse file of 2TiB size and put some data in relevant places.  Then an
applications mmap()s it and does loads of reads on the mmap()ped file
and perhaps a write here or there.  Do we really want that to start
allocating and filling in all read holes?  That seems worse than having
a square peg for a round hole that is hidden away in a single function.

There is nothing in the proposed find_or_create_pages() that means it
needs to go into mm/filemap.c.  It could easily be a private function in
fs/ntfs/aops.c.  I just thought that other fs who want to support
writing to large block sizes might find it useful and having a shared
copy in mm/filemap.c would be better in that case.  But if it is too
ugly to go in mm/filemap.c then that is fine, too.

At the moment I cannot see a way to solve my problem without the
proposed find_or_create_pages().  )-:

Best regards,

Anton
-- 
Anton Altaparmakov  (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-03 Thread Anton Altaparmakov
On Wed, 2005-02-02 at 14:34 -0800, Andrew Morton wrote:
 Anton Altaparmakov [EMAIL PROTECTED] wrote:
 
  Below is a patch which adds a function 
   mm/filemap.c::find_or_create_pages(), locks a range of pages.  Please see 
   the function description in the patch for details.
 
 This isn't very nice, is it, really?  Kind of a square peg in a round hole.

Only followed your advice.  (-;  But yes, it is not very nice at all.

 If you took the approach of defining a custom file_operations.write() then
 I'd imagine that the write() side of things would fall out fairly neatly:
 no need for s_umount and i_sem needs to be taken anyway.  No trylocking.

But the write() side of things don't need s_umount or trylocking with
the proposed find_or_create_pages(), either...

Unfortunately it is not possible to do this since removing
-{prepare,commit}_write() from NTFS would mean that we cannot use loop
devices on NTFS any more and this is a really important feature for
several Linux distributions (e.g. TopologiLinux) which install Linux on
a loopback mounted NTFS file which they then use to place an ext3 (or
whatever) fs on and use that as the root fs...

So we definitely need full blown prepare/commit write.  (Unless we
modify the loop device driver not to use -{prepare,commit}_write
first.)

Any ideas how to solve that one?

 And for the vmscan-writepage() side of things I wonder if it would be
 possible to overload the mapping's -nopage handler.  If the target page
 lies in a hole, go off and allocate all the necessary pagecache pages, zero
 them, mark them dirty?

I guess it would be possible but -nopage is used for the read case and
why would we want to then cause writes/allocations?  Example: I create a
sparse file of 2TiB size and put some data in relevant places.  Then an
applications mmap()s it and does loads of reads on the mmap()ped file
and perhaps a write here or there.  Do we really want that to start
allocating and filling in all read holes?  That seems worse than having
a square peg for a round hole that is hidden away in a single function.

There is nothing in the proposed find_or_create_pages() that means it
needs to go into mm/filemap.c.  It could easily be a private function in
fs/ntfs/aops.c.  I just thought that other fs who want to support
writing to large block sizes might find it useful and having a shared
copy in mm/filemap.c would be better in that case.  But if it is too
ugly to go in mm/filemap.c then that is fine, too.

At the moment I cannot see a way to solve my problem without the
proposed find_or_create_pages().  )-:

Best regards,

Anton
-- 
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/  http://www-stu.christs.cam.ac.uk/~aia21/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-03 Thread Andrew Morton
Anton Altaparmakov [EMAIL PROTECTED] wrote:

 On Wed, 2005-02-02 at 14:34 -0800, Andrew Morton wrote:
  Anton Altaparmakov [EMAIL PROTECTED] wrote:
  
   Below is a patch which adds a function 
mm/filemap.c::find_or_create_pages(), locks a range of pages.  Please 
   see 
the function description in the patch for details.
  
  This isn't very nice, is it, really?  Kind of a square peg in a round hole.
 
 Only followed your advice.  (-;  But yes, it is not very nice at all.
 
  If you took the approach of defining a custom file_operations.write() then
  I'd imagine that the write() side of things would fall out fairly neatly:
  no need for s_umount and i_sem needs to be taken anyway.  No trylocking.
 
 But the write() side of things don't need s_umount or trylocking with
 the proposed find_or_create_pages(), either...

i_sem nests outside lock_page, normally.  I guess that can be avoided though.

 Unfortunately it is not possible to do this since removing
 -{prepare,commit}_write() from NTFS would mean that we cannot use loop
 devices on NTFS any more and this is a really important feature for
 several Linux distributions (e.g. TopologiLinux) which install Linux on
 a loopback mounted NTFS file which they then use to place an ext3 (or
 whatever) fs on and use that as the root fs...
 
 So we definitely need full blown prepare/commit write.  (Unless we
 modify the loop device driver not to use -{prepare,commit}_write
 first.)
 
 Any ideas how to solve that one?

I did a patch which switched loop to use the file_operations.read/write
about a year ago.  Forget what happened to it.  It always seemed the right
thing to do..

  And for the vmscan-writepage() side of things I wonder if it would be
  possible to overload the mapping's -nopage handler.  If the target page
  lies in a hole, go off and allocate all the necessary pagecache pages, zero
  them, mark them dirty?
 
 I guess it would be possible but -nopage is used for the read case and
 why would we want to then cause writes/allocations?

yup, we'd need to create a new handler for writes, or pass `write_access'
into -nopage.  I think others (dwdm2?) have seen a need for that.

 At the moment I cannot see a way to solve my problem without the
 proposed find_or_create_pages().  )-:

Unpleasant, isn't it.

I guess the path of least resistance is to do it within ntfs for now.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-03 Thread Anton Altaparmakov
On Thu, 2005-02-03 at 02:47 -0800, Andrew Morton wrote:
 Anton Altaparmakov [EMAIL PROTECTED] wrote:
  On Wed, 2005-02-02 at 14:34 -0800, Andrew Morton wrote:
   Anton Altaparmakov [EMAIL PROTECTED] wrote:
   
Below is a patch which adds a function 
 mm/filemap.c::find_or_create_pages(), locks a range of pages.  Please 
see 
 the function description in the patch for details.
   
   This isn't very nice, is it, really?  Kind of a square peg in a round 
   hole.
  
  Only followed your advice.  (-;  But yes, it is not very nice at all.
  
   If you took the approach of defining a custom file_operations.write() then
   I'd imagine that the write() side of things would fall out fairly neatly:
   no need for s_umount and i_sem needs to be taken anyway.  No trylocking.
  
  But the write() side of things don't need s_umount or trylocking with
  the proposed find_or_create_pages(), either...
 
 i_sem nests outside lock_page, normally.  I guess that can be avoided though.

I meant that the write() side of things (i.e. -{prepare,commit}_write)
already has i_sem held on entry.

  Unfortunately it is not possible to do this since removing
  -{prepare,commit}_write() from NTFS would mean that we cannot use loop
  devices on NTFS any more and this is a really important feature for
  several Linux distributions (e.g. TopologiLinux) which install Linux on
  a loopback mounted NTFS file which they then use to place an ext3 (or
  whatever) fs on and use that as the root fs...
  
  So we definitely need full blown prepare/commit write.  (Unless we
  modify the loop device driver not to use -{prepare,commit}_write
  first.)
  
  Any ideas how to solve that one?
 
 I did a patch which switched loop to use the file_operations.read/write
 about a year ago.  Forget what happened to it.  It always seemed the right
 thing to do..

Yes, I remember seeing something like that on LKML.  I guess it would
enable readahead on the loop devices.  Whether this is a good or bad
thing is I guess entirely dependent on the usage scenario.

   And for the vmscan-writepage() side of things I wonder if it would be
   possible to overload the mapping's -nopage handler.  If the target page
   lies in a hole, go off and allocate all the necessary pagecache pages, 
   zero
   them, mark them dirty?
  
  I guess it would be possible but -nopage is used for the read case and
  why would we want to then cause writes/allocations?
 
 yup, we'd need to create a new handler for writes, or pass `write_access'
 into -nopage.  I think others (dwdm2?) have seen a need for that.

That would work as long as all writable mappings are actually written to
everywhere.  Otherwise you still get that reading the whole mmap()ped
are but writing a small part of it would still instantiate all of it on
disk.  As far as I understand this there is no way to hook into the mmap
system such that we have a hook whenever a mmap()ped page gets written
to for the first time.  (I may well be wrong on that one so please
correct me if that is the case.)

  At the moment I cannot see a way to solve my problem without the
  proposed find_or_create_pages().  )-:
 
 Unpleasant, isn't it.
 
 I guess the path of least resistance is to do it within ntfs for now.

Ok, I will do that.  ntfs_find_or_create_pages() will be hitting
fs/ntfs/aops.c soon...

As always, thanks a lot for your help!

Best regards,

Anton
-- 
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/  http://www-stu.christs.cam.ac.uk/~aia21/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - nopage alternative

2005-02-03 Thread Bryan Henderson
   And for the vmscan-writepage() side of things I wonder if it would 
be
   possible to overload the mapping's -nopage handler.  If the target 
page
   lies in a hole, go off and allocate all the necessary pagecache 
pages, zero
   them, mark them dirty?
  
  I guess it would be possible but -nopage is used for the read case 
and
  why would we want to then cause writes/allocations?
 
 yup, we'd need to create a new handler for writes, or pass 
`write_access'
 into -nopage.  I think others (dwdm2?) have seen a need for that.

That would work as long as all writable mappings are actually written to
everywhere.  Otherwise you still get that reading the whole mmap()ped
are but writing a small part of it would still instantiate all of it on
disk.  As far as I understand this there is no way to hook into the mmap
system such that we have a hook whenever a mmap()ped page gets written
to for the first time.  (I may well be wrong on that one so please
correct me if that is the case.)

I think the point is that we can't have a handler for writes, because 
the writes are being done by simple CPU Store instructions in a user 
program.  The handler we're talking about is just for page faults.  Other 
operating systems approach this by actually _having_ a handler for a CPU 
store instruction, in the form of a page protection fault handler -- the 
nopage routine adds the page to the user's address space, but write 
protects it.  The first time the user tries to store into it, the 
filesystem driver gets a chance to do what's necessary to support a dirty 
cache page -- allocate a block, add additional dirty pages to the cache, 
etc.  It would be wonderful to have that in Linux.  I saw hints of such 
code in a Linux kernel once (a write_protect address space operation or 
something like that); I don't know what happened to it.

Short of that, I don't see any way to avoid sometimes filling in holes due 
to reads.  It's not a huge problem, though -- it requires someone to do a 
shared writable mmap and then read lots of holes and not write to them, 
which is a pretty rare situation for a normal file.

I didn't follow how the helper function solves this problem.  If it's 
something involving adding the required extra pages to the cache at 
pageout time, then that's not going to work -- you can't make adding pages 
to the cache a prerequisite for cleaning a page -- that would be Deadlock 
City.

My large-block filesystem driver does the nopage thing, and does in fact 
fill in files unnecessarily in this scenario.  :-(  The driver for the 
same filesystems on AIX does not, though.  It has the write protection 
thing.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - loop device

2005-02-03 Thread Bryan Henderson
I did a patch which switched loop to use the file_operations.read/write
about a year ago.  Forget what happened to it.  It always seemed the 
right
thing to do..

This is unquestionably the right thing to do (at least compared to what we 
have now).  The loop device driver has no business assuming that the 
underlying filesystem uses the generic routines.  I always assumed it was 
a simple design error that it did.  (Such errors are easy to make because 
prepare_write and commit_write are declared as address space operations, 
when they're really private to the buffer cache and generic writer).

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-02 Thread Andrew Morton
Anton Altaparmakov <[EMAIL PROTECTED]> wrote:
>
> Below is a patch which adds a function 
>  mm/filemap.c::find_or_create_pages(), locks a range of pages.  Please see 
>  the function description in the patch for details.

This isn't very nice, is it, really?  Kind of a square peg in a round hole.

If you took the approach of defining a custom file_operations.write() then
I'd imagine that the write() side of things would fall out fairly neatly:
no need for s_umount and i_sem needs to be taken anyway.  No trylocking.

And for the vmscan->writepage() side of things I wonder if it would be
possible to overload the mapping's ->nopage handler.  If the target page
lies in a hole, go off and allocate all the necessary pagecache pages, zero
them, mark them dirty?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-02 Thread Anton Altaparmakov
Hi Matthew,

On Wed, 2005-02-02 at 15:43 +, Matthew Wilcox wrote:
> On Wed, Feb 02, 2005 at 03:12:50PM +, Anton Altaparmakov wrote:
> 
> I think the below loop would be clearer as a for loop ...
> 
>   err = 0;
>   for (nr = 0; nr < nr_pages; nr++, start++) {
>   if (start == lp_idx) {
>   pages[nr] = locked_page;
>   if (!nr)
>   continue;
>   lock_page(locked_page);
>   if (!wbc)
>   continue;
>   if (wbc->for_reclaim) {
>   up(>i_sem);
>   up_read(>i_sb->s_umount);
>   }
>   /* Was the page truncated under us? */
>   if (page_mapping(locked_page) != mapping) {
>   err = -ESTALE;
>   goto err_out_locked;
>   }
>   } else {
>   pages[nr] = find_lock_page(mapping, start);
>   if (pages[nr])
>   continue;
>   if (!cached_page) {
>   cached_page = alloc_page(gfp_mask);
>   if (unlikely(!cached_page))
>   goto err_out;
>   }
>   err = add_to_page_cache_lru(cached_page,
>   mapping, start, gfp_mask);
>   if (unlikely(err)) {
>   if (err == -EEXIST)
>   continue;
>   goto err_out;
>   }
>   pages[nr] = cached_page;
>   cached_page = NULL;
>   }
>   }
> 
> The above fixes two bugs in the below:
>  - if (!unlikely(cached_page)) should be if (unlikely(!cached_page))

Ah, oops.  Thanks!  Well spotted!  I did say it was only compile
tested...  (-;

>  - The -EEXIST case after add_to_page_cache_lru() would result in
>an infinite loop in the original as nr wasn't being incremented.

That was exactly what was meant to happen.  It is not a bug.  It is a
feature.  This is why it is a while loop instead of a for loop.  I need
to have @nr and @start incremented only if the code reaches the end of
the loop.

The -EEXIST case needs to repeat for the same @nr and @start.  It
basically means that someone else allocated the page with index @start
and added it to the page cache in between us running find_lock_page()
and add_to_page_cache_lru().  So what we want to do is to run
find_lock_page() again which should then find and lock the page that the
other process created.

Of course what could happen is that between us getting the -EEXIST and
us repeating the find_lock_page() the page is freed again so the
find_lock_page() fails again.  Perhaps this time we will succeed with
add_to_page_cache_lru() and if not we repeat again.  Eventually either
find_lock_page() or add_to_page_cache_lru() will succeed so in practise
it will never be an endless loop.

If the while loop is changed to a for loop, the "continue;" on -EEXIST
would need to be changed to "goto repeat;" and a label "repeat:" would
need to be placed at the beginning of the loop.  I considered this but
decided the while loop looks nicer.  (-:

Thanks for the review!

> > +   err = nr = 0;
> > +   while (nr < nr_pages) {
> > +   if (start == lp_idx) {
> > +   pages[nr] = locked_page;
> > +   if (nr) {
> > +   lock_page(locked_page);
> > +   if (wbc) {
> > +   if (wbc->for_reclaim) {
> > +   up(>i_sem);
> > +   up_read(>i_sb->s_umount);
> > +   }
> > +   /* Was the page truncated under us? */
> > +   if (page_mapping(locked_page) !=
> > +   mapping) {
> > +   err = -ESTALE;
> > +   goto err_out_locked;
> > +   }
> > +   }
> > +   }
> > +   } else {
> > +   pages[nr] = find_lock_page(mapping, start);
> > +   if (!pages[nr]) {
> > +   if (!cached_page) {
> > +   cached_page = alloc_page(gfp_mask);
> > +   if (!unlikely(cached_page))
> > +   goto err_out;
> > +   }
> > +   err = add_to_page_cache_lru(cached_page,
> > +   

Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-02 Thread Matthew Wilcox
On Wed, Feb 02, 2005 at 03:12:50PM +, Anton Altaparmakov wrote:

I think the below loop would be clearer as a for loop ...

err = 0;
for (nr = 0; nr < nr_pages; nr++, start++) {
if (start == lp_idx) {
pages[nr] = locked_page;
if (!nr)
continue;
lock_page(locked_page);
if (!wbc)
continue;
if (wbc->for_reclaim) {
up(>i_sem);
up_read(>i_sb->s_umount);
}
/* Was the page truncated under us? */
if (page_mapping(locked_page) != mapping) {
err = -ESTALE;
goto err_out_locked;
}
} else {
pages[nr] = find_lock_page(mapping, start);
if (pages[nr])
continue;
if (!cached_page) {
cached_page = alloc_page(gfp_mask);
if (unlikely(!cached_page))
goto err_out;
}
err = add_to_page_cache_lru(cached_page,
mapping, start, gfp_mask);
if (unlikely(err)) {
if (err == -EEXIST)
continue;
goto err_out;
}
pages[nr] = cached_page;
cached_page = NULL;
}
}

The above fixes two bugs in the below:
 - if (!unlikely(cached_page)) should be if (unlikely(!cached_page))
 - The -EEXIST case after add_to_page_cache_lru() would result in
   an infinite loop in the original as nr wasn't being incremented.

> + err = nr = 0;
> + while (nr < nr_pages) {
> + if (start == lp_idx) {
> + pages[nr] = locked_page;
> + if (nr) {
> + lock_page(locked_page);
> + if (wbc) {
> + if (wbc->for_reclaim) {
> + up(>i_sem);
> + up_read(>i_sb->s_umount);
> + }
> + /* Was the page truncated under us? */
> + if (page_mapping(locked_page) !=
> + mapping) {
> + err = -ESTALE;
> + goto err_out_locked;
> + }
> + }
> + }
> + } else {
> + pages[nr] = find_lock_page(mapping, start);
> + if (!pages[nr]) {
> + if (!cached_page) {
> + cached_page = alloc_page(gfp_mask);
> + if (!unlikely(cached_page))
> + goto err_out;
> + }
> + err = add_to_page_cache_lru(cached_page,
> + mapping, start, gfp_mask);
> + if (unlikely(err)) {
> + if (err == -EEXIST)
> + continue;
> + goto err_out;
> + }
> + pages[nr] = cached_page;
> + cached_page = NULL;
> + }
> + }
> + nr++;
> + start++;
> + }

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-02 Thread Matthew Wilcox
On Wed, Feb 02, 2005 at 03:12:50PM +, Anton Altaparmakov wrote:

I think the below loop would be clearer as a for loop ...

err = 0;
for (nr = 0; nr  nr_pages; nr++, start++) {
if (start == lp_idx) {
pages[nr] = locked_page;
if (!nr)
continue;
lock_page(locked_page);
if (!wbc)
continue;
if (wbc-for_reclaim) {
up(inode-i_sem);
up_read(inode-i_sb-s_umount);
}
/* Was the page truncated under us? */
if (page_mapping(locked_page) != mapping) {
err = -ESTALE;
goto err_out_locked;
}
} else {
pages[nr] = find_lock_page(mapping, start);
if (pages[nr])
continue;
if (!cached_page) {
cached_page = alloc_page(gfp_mask);
if (unlikely(!cached_page))
goto err_out;
}
err = add_to_page_cache_lru(cached_page,
mapping, start, gfp_mask);
if (unlikely(err)) {
if (err == -EEXIST)
continue;
goto err_out;
}
pages[nr] = cached_page;
cached_page = NULL;
}
}

The above fixes two bugs in the below:
 - if (!unlikely(cached_page)) should be if (unlikely(!cached_page))
 - The -EEXIST case after add_to_page_cache_lru() would result in
   an infinite loop in the original as nr wasn't being incremented.

 + err = nr = 0;
 + while (nr  nr_pages) {
 + if (start == lp_idx) {
 + pages[nr] = locked_page;
 + if (nr) {
 + lock_page(locked_page);
 + if (wbc) {
 + if (wbc-for_reclaim) {
 + up(inode-i_sem);
 + up_read(inode-i_sb-s_umount);
 + }
 + /* Was the page truncated under us? */
 + if (page_mapping(locked_page) !=
 + mapping) {
 + err = -ESTALE;
 + goto err_out_locked;
 + }
 + }
 + }
 + } else {
 + pages[nr] = find_lock_page(mapping, start);
 + if (!pages[nr]) {
 + if (!cached_page) {
 + cached_page = alloc_page(gfp_mask);
 + if (!unlikely(cached_page))
 + goto err_out;
 + }
 + err = add_to_page_cache_lru(cached_page,
 + mapping, start, gfp_mask);
 + if (unlikely(err)) {
 + if (err == -EEXIST)
 + continue;
 + goto err_out;
 + }
 + pages[nr] = cached_page;
 + cached_page = NULL;
 + }
 + }
 + nr++;
 + start++;
 + }

-- 
Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception. -- Mark Twain
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-02 Thread Anton Altaparmakov
Hi Matthew,

On Wed, 2005-02-02 at 15:43 +, Matthew Wilcox wrote:
 On Wed, Feb 02, 2005 at 03:12:50PM +, Anton Altaparmakov wrote:
 
 I think the below loop would be clearer as a for loop ...
 
   err = 0;
   for (nr = 0; nr  nr_pages; nr++, start++) {
   if (start == lp_idx) {
   pages[nr] = locked_page;
   if (!nr)
   continue;
   lock_page(locked_page);
   if (!wbc)
   continue;
   if (wbc-for_reclaim) {
   up(inode-i_sem);
   up_read(inode-i_sb-s_umount);
   }
   /* Was the page truncated under us? */
   if (page_mapping(locked_page) != mapping) {
   err = -ESTALE;
   goto err_out_locked;
   }
   } else {
   pages[nr] = find_lock_page(mapping, start);
   if (pages[nr])
   continue;
   if (!cached_page) {
   cached_page = alloc_page(gfp_mask);
   if (unlikely(!cached_page))
   goto err_out;
   }
   err = add_to_page_cache_lru(cached_page,
   mapping, start, gfp_mask);
   if (unlikely(err)) {
   if (err == -EEXIST)
   continue;
   goto err_out;
   }
   pages[nr] = cached_page;
   cached_page = NULL;
   }
   }
 
 The above fixes two bugs in the below:
  - if (!unlikely(cached_page)) should be if (unlikely(!cached_page))

Ah, oops.  Thanks!  Well spotted!  I did say it was only compile
tested...  (-;

  - The -EEXIST case after add_to_page_cache_lru() would result in
an infinite loop in the original as nr wasn't being incremented.

That was exactly what was meant to happen.  It is not a bug.  It is a
feature.  This is why it is a while loop instead of a for loop.  I need
to have @nr and @start incremented only if the code reaches the end of
the loop.

The -EEXIST case needs to repeat for the same @nr and @start.  It
basically means that someone else allocated the page with index @start
and added it to the page cache in between us running find_lock_page()
and add_to_page_cache_lru().  So what we want to do is to run
find_lock_page() again which should then find and lock the page that the
other process created.

Of course what could happen is that between us getting the -EEXIST and
us repeating the find_lock_page() the page is freed again so the
find_lock_page() fails again.  Perhaps this time we will succeed with
add_to_page_cache_lru() and if not we repeat again.  Eventually either
find_lock_page() or add_to_page_cache_lru() will succeed so in practise
it will never be an endless loop.

If the while loop is changed to a for loop, the continue; on -EEXIST
would need to be changed to goto repeat; and a label repeat: would
need to be placed at the beginning of the loop.  I considered this but
decided the while loop looks nicer.  (-:

Thanks for the review!

  +   err = nr = 0;
  +   while (nr  nr_pages) {
  +   if (start == lp_idx) {
  +   pages[nr] = locked_page;
  +   if (nr) {
  +   lock_page(locked_page);
  +   if (wbc) {
  +   if (wbc-for_reclaim) {
  +   up(inode-i_sem);
  +   up_read(inode-i_sb-s_umount);
  +   }
  +   /* Was the page truncated under us? */
  +   if (page_mapping(locked_page) !=
  +   mapping) {
  +   err = -ESTALE;
  +   goto err_out_locked;
  +   }
  +   }
  +   }
  +   } else {
  +   pages[nr] = find_lock_page(mapping, start);
  +   if (!pages[nr]) {
  +   if (!cached_page) {
  +   cached_page = alloc_page(gfp_mask);
  +   if (!unlikely(cached_page))
  +   goto err_out;
  +   }
  +   err = add_to_page_cache_lru(cached_page,
  +   mapping, start, gfp_mask);
  +   if (unlikely(err)) {
  +

Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages.

2005-02-02 Thread Andrew Morton
Anton Altaparmakov [EMAIL PROTECTED] wrote:

 Below is a patch which adds a function 
  mm/filemap.c::find_or_create_pages(), locks a range of pages.  Please see 
  the function description in the patch for details.

This isn't very nice, is it, really?  Kind of a square peg in a round hole.

If you took the approach of defining a custom file_operations.write() then
I'd imagine that the write() side of things would fall out fairly neatly:
no need for s_umount and i_sem needs to be taken anyway.  No trylocking.

And for the vmscan-writepage() side of things I wonder if it would be
possible to overload the mapping's -nopage handler.  If the target page
lies in a hole, go off and allocate all the necessary pagecache pages, zero
them, mark them dirty?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/