Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
2008/1/21, Peter Staubach <[EMAIL PROTECTED]>:
> Linus Torvalds wrote:
> > On Fri, 18 Jan 2008, Ingo Oeser wrote:
> >
> >> Can we get "if the write to the page hits the disk, the mtime has hit the 
> >> disk
> >> already no less than SOME_GRANULARITY before"?
> >>
> >> That is very important for computer forensics. Esp. in saving your ass!
> >>
> >> Ok, now back again to making that fast :-)
> >>
> >
> > I certainly don't mind it if we have some tighter guarantees, but what I'd
> > want is:
> >
> >  - keep it simple. Let's face it, Linux has never ever given those
> >guarantees before, and it's not is if anybody has really cared. Even
> >now, the issue seems to be more about paper standards conformance than
> >anything else.
> >
> >
>
> I have been working on getting something supported here for
> because I have some very large Wall Street customers who do
> care about getting the mtime updated because their backups
> are getting corrupted.  They are incomplete because although
> their applications update files, they don't get backed up
> because the mtime never changes.
>
> >  - I get worried about people playing around with the dirty bit in
> >particular. We have had some really rather nasty bugs here. Most of
> >which are totally impossible to trigger under normal loads (for
> >example the old random-access utorrent writable mmap issue from about
> >a year ago).
> >
> > So these two issues - the big red danger signs flashing in my brain,
> > coupled with the fact that no application has apparently ever really
> > noticed in the last 15 years - just makes it a case where I'd like each
> > step of the way to be obvious and simple and no larger than really
> > absolutely necessary.
>
> Simple is good.  However, too simple is not good.  I would suggest
> that we implement file time updates which make sense and if they
> happen to follow POSIX, then nifty, otherwise, oh well.

Thank you very much for your support, Peter!

I'm going to submit the design document, the next version of the patch
series, and the performance tests results soon.

>
> Thanx...
>
>ps
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-21 Thread Peter Staubach

Linus Torvalds wrote:

On Fri, 18 Jan 2008, Ingo Oeser wrote:
  

Can we get "if the write to the page hits the disk, the mtime has hit the disk
already no less than SOME_GRANULARITY before"? 


That is very important for computer forensics. Esp. in saving your ass!

Ok, now back again to making that fast :-)



I certainly don't mind it if we have some tighter guarantees, but what I'd 
want is:


 - keep it simple. Let's face it, Linux has never ever given those 
   guarantees before, and it's not is if anybody has really cared. Even 
   now, the issue seems to be more about paper standards conformance than 
   anything else.


  


I have been working on getting something supported here for
because I have some very large Wall Street customers who do
care about getting the mtime updated because their backups
are getting corrupted.  They are incomplete because although
their applications update files, they don't get backed up
because the mtime never changes.

 - I get worried about people playing around with the dirty bit in 
   particular. We have had some really rather nasty bugs here. Most of 
   which are totally impossible to trigger under normal loads (for 
   example the old random-access utorrent writable mmap issue from about 
   a year ago).


So these two issues - the big red danger signs flashing in my brain, 
coupled with the fact that no application has apparently ever really 
noticed in the last 15 years - just makes it a case where I'd like each 
step of the way to be obvious and simple and no larger than really 
absolutely necessary.


Simple is good.  However, too simple is not good.  I would suggest
that we implement file time updates which make sense and if they
happen to follow POSIX, then nifty, otherwise, oh well.

   Thanx...

  ps
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-21 Thread Peter Staubach

Linus Torvalds wrote:

On Fri, 18 Jan 2008, Ingo Oeser wrote:
  

Can we get if the write to the page hits the disk, the mtime has hit the disk
already no less than SOME_GRANULARITY before? 


That is very important for computer forensics. Esp. in saving your ass!

Ok, now back again to making that fast :-)



I certainly don't mind it if we have some tighter guarantees, but what I'd 
want is:


 - keep it simple. Let's face it, Linux has never ever given those 
   guarantees before, and it's not is if anybody has really cared. Even 
   now, the issue seems to be more about paper standards conformance than 
   anything else.


  


I have been working on getting something supported here for
because I have some very large Wall Street customers who do
care about getting the mtime updated because their backups
are getting corrupted.  They are incomplete because although
their applications update files, they don't get backed up
because the mtime never changes.

 - I get worried about people playing around with the dirty bit in 
   particular. We have had some really rather nasty bugs here. Most of 
   which are totally impossible to trigger under normal loads (for 
   example the old random-access utorrent writable mmap issue from about 
   a year ago).


So these two issues - the big red danger signs flashing in my brain, 
coupled with the fact that no application has apparently ever really 
noticed in the last 15 years - just makes it a case where I'd like each 
step of the way to be obvious and simple and no larger than really 
absolutely necessary.


Simple is good.  However, too simple is not good.  I would suggest
that we implement file time updates which make sense and if they
happen to follow POSIX, then nifty, otherwise, oh well.

   Thanx...

  ps
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
2008/1/21, Peter Staubach [EMAIL PROTECTED]:
 Linus Torvalds wrote:
  On Fri, 18 Jan 2008, Ingo Oeser wrote:
 
  Can we get if the write to the page hits the disk, the mtime has hit the 
  disk
  already no less than SOME_GRANULARITY before?
 
  That is very important for computer forensics. Esp. in saving your ass!
 
  Ok, now back again to making that fast :-)
 
 
  I certainly don't mind it if we have some tighter guarantees, but what I'd
  want is:
 
   - keep it simple. Let's face it, Linux has never ever given those
 guarantees before, and it's not is if anybody has really cared. Even
 now, the issue seems to be more about paper standards conformance than
 anything else.
 
 

 I have been working on getting something supported here for
 because I have some very large Wall Street customers who do
 care about getting the mtime updated because their backups
 are getting corrupted.  They are incomplete because although
 their applications update files, they don't get backed up
 because the mtime never changes.

   - I get worried about people playing around with the dirty bit in
 particular. We have had some really rather nasty bugs here. Most of
 which are totally impossible to trigger under normal loads (for
 example the old random-access utorrent writable mmap issue from about
 a year ago).
 
  So these two issues - the big red danger signs flashing in my brain,
  coupled with the fact that no application has apparently ever really
  noticed in the last 15 years - just makes it a case where I'd like each
  step of the way to be obvious and simple and no larger than really
  absolutely necessary.

 Simple is good.  However, too simple is not good.  I would suggest
 that we implement file time updates which make sense and if they
 happen to follow POSIX, then nifty, otherwise, oh well.

Thank you very much for your support, Peter!

I'm going to submit the design document, the next version of the patch
series, and the performance tests results soon.


 Thanx...

ps

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-19 Thread Matt Mackall

On Sat, 2008-01-19 at 11:22 +0100, Miklos Szeredi wrote:
> > Reminds me, I've got a patch here for addressing that problem with loop 
> > mounts:
> > 
> > Writes to loop should update the mtime of the underlying file.
> > 
> > Signed-off-by: Matt Mackall <[EMAIL PROTECTED]>
> > 
> > Index: l/drivers/block/loop.c
> > ===
> > --- l.orig/drivers/block/loop.c 2007-11-05 17:50:07.0 -0600
> > +++ l/drivers/block/loop.c  2007-11-05 19:03:51.0 -0600
> > @@ -221,6 +221,7 @@ static int do_lo_send_aops(struct loop_d
> > offset = pos & ((pgoff_t)PAGE_CACHE_SIZE - 1);
> > bv_offs = bvec->bv_offset;
> > len = bvec->bv_len;
> > +   file_update_time(file);
> > while (len > 0) {
> > sector_t IV;
> > unsigned size;
> > @@ -299,6 +300,7 @@ static int __do_lo_send_write(struct fil
> >  
> > set_fs(get_ds());
> > bw = file->f_op->write(file, buf, len, );
> > +   file_update_time(file);
> 
> ->write should have already updated the times, no?

Yes, this second case is redundant. Still needed in the first case.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-19 Thread Miklos Szeredi
> Reminds me, I've got a patch here for addressing that problem with loop 
> mounts:
> 
> Writes to loop should update the mtime of the underlying file.
> 
> Signed-off-by: Matt Mackall <[EMAIL PROTECTED]>
> 
> Index: l/drivers/block/loop.c
> ===
> --- l.orig/drivers/block/loop.c   2007-11-05 17:50:07.0 -0600
> +++ l/drivers/block/loop.c2007-11-05 19:03:51.0 -0600
> @@ -221,6 +221,7 @@ static int do_lo_send_aops(struct loop_d
>   offset = pos & ((pgoff_t)PAGE_CACHE_SIZE - 1);
>   bv_offs = bvec->bv_offset;
>   len = bvec->bv_len;
> + file_update_time(file);
>   while (len > 0) {
>   sector_t IV;
>   unsigned size;
> @@ -299,6 +300,7 @@ static int __do_lo_send_write(struct fil
>  
>   set_fs(get_ds());
>   bw = file->f_op->write(file, buf, len, );
> + file_update_time(file);

->write should have already updated the times, no?

>   set_fs(old_fs);
>   if (likely(bw == len))
>   return 0;
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-19 Thread Miklos Szeredi
 Reminds me, I've got a patch here for addressing that problem with loop 
 mounts:
 
 Writes to loop should update the mtime of the underlying file.
 
 Signed-off-by: Matt Mackall [EMAIL PROTECTED]
 
 Index: l/drivers/block/loop.c
 ===
 --- l.orig/drivers/block/loop.c   2007-11-05 17:50:07.0 -0600
 +++ l/drivers/block/loop.c2007-11-05 19:03:51.0 -0600
 @@ -221,6 +221,7 @@ static int do_lo_send_aops(struct loop_d
   offset = pos  ((pgoff_t)PAGE_CACHE_SIZE - 1);
   bv_offs = bvec-bv_offset;
   len = bvec-bv_len;
 + file_update_time(file);
   while (len  0) {
   sector_t IV;
   unsigned size;
 @@ -299,6 +300,7 @@ static int __do_lo_send_write(struct fil
  
   set_fs(get_ds());
   bw = file-f_op-write(file, buf, len, pos);
 + file_update_time(file);

-write should have already updated the times, no?

   set_fs(old_fs);
   if (likely(bw == len))
   return 0;
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-19 Thread Matt Mackall

On Sat, 2008-01-19 at 11:22 +0100, Miklos Szeredi wrote:
  Reminds me, I've got a patch here for addressing that problem with loop 
  mounts:
  
  Writes to loop should update the mtime of the underlying file.
  
  Signed-off-by: Matt Mackall [EMAIL PROTECTED]
  
  Index: l/drivers/block/loop.c
  ===
  --- l.orig/drivers/block/loop.c 2007-11-05 17:50:07.0 -0600
  +++ l/drivers/block/loop.c  2007-11-05 19:03:51.0 -0600
  @@ -221,6 +221,7 @@ static int do_lo_send_aops(struct loop_d
  offset = pos  ((pgoff_t)PAGE_CACHE_SIZE - 1);
  bv_offs = bvec-bv_offset;
  len = bvec-bv_len;
  +   file_update_time(file);
  while (len  0) {
  sector_t IV;
  unsigned size;
  @@ -299,6 +300,7 @@ static int __do_lo_send_write(struct fil
   
  set_fs(get_ds());
  bw = file-f_op-write(file, buf, len, pos);
  +   file_update_time(file);
 
 -write should have already updated the times, no?

Yes, this second case is redundant. Still needed in the first case.

-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Rik van Riel
On Fri, 18 Jan 2008 18:50:03 -0600
Matt Mackall <[EMAIL PROTECTED]> wrote:
> On Fri, 2008-01-18 at 17:54 -0500, Rik van Riel wrote:

> > Backup programs not seeing an updated mtime is a really big deal.
> 
> And that's fixed with the 4-line approach.
> 
> Reminds me, I've got a patch here for addressing that problem with loop 
> mounts:
> 
> Writes to loop should update the mtime of the underlying file.
> 
> Signed-off-by: Matt Mackall <[EMAIL PROTECTED]>

Acked-by: Rik van Riel <[EMAIL PROTECTED]>

-- 
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Matt Mackall

On Fri, 2008-01-18 at 17:54 -0500, Rik van Riel wrote:
> On Fri, 18 Jan 2008 14:47:33 -0800 (PST)
> Linus Torvalds <[EMAIL PROTECTED]> wrote:
> 
> >  - keep it simple. Let's face it, Linux has never ever given those 
> >guarantees before, and it's not is if anybody has really cared. Even 
> >now, the issue seems to be more about paper standards conformance than 
> >anything else.
> 
> There is one issue which is way more than just standards conformance.
> 
> When a program changes file data through mmap(), at some point the
> mtime needs to be update so that backup programs know to back up the
> new version of the file.
> 
> Backup programs not seeing an updated mtime is a really big deal.

And that's fixed with the 4-line approach.

Reminds me, I've got a patch here for addressing that problem with loop mounts:

Writes to loop should update the mtime of the underlying file.

Signed-off-by: Matt Mackall <[EMAIL PROTECTED]>

Index: l/drivers/block/loop.c
===
--- l.orig/drivers/block/loop.c 2007-11-05 17:50:07.0 -0600
+++ l/drivers/block/loop.c  2007-11-05 19:03:51.0 -0600
@@ -221,6 +221,7 @@ static int do_lo_send_aops(struct loop_d
offset = pos & ((pgoff_t)PAGE_CACHE_SIZE - 1);
bv_offs = bvec->bv_offset;
len = bvec->bv_len;
+   file_update_time(file);
while (len > 0) {
sector_t IV;
unsigned size;
@@ -299,6 +300,7 @@ static int __do_lo_send_write(struct fil
 
set_fs(get_ds());
bw = file->f_op->write(file, buf, len, );
+   file_update_time(file);
set_fs(old_fs);
if (likely(bw == len))
return 0;


-- 
Mathematics is the supreme nostalgia of our time.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Rik van Riel
On Fri, 18 Jan 2008 14:47:33 -0800 (PST)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

>  - keep it simple. Let's face it, Linux has never ever given those 
>guarantees before, and it's not is if anybody has really cared. Even 
>now, the issue seems to be more about paper standards conformance than 
>anything else.

There is one issue which is way more than just standards conformance.

When a program changes file data through mmap(), at some point the
mtime needs to be update so that backup programs know to back up the
new version of the file.

Backup programs not seeing an updated mtime is a really big deal.

-- 
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Linus Torvalds


On Fri, 18 Jan 2008, Ingo Oeser wrote:
> 
> Can we get "if the write to the page hits the disk, the mtime has hit the disk
> already no less than SOME_GRANULARITY before"? 
> 
> That is very important for computer forensics. Esp. in saving your ass!
> 
> Ok, now back again to making that fast :-)

I certainly don't mind it if we have some tighter guarantees, but what I'd 
want is:

 - keep it simple. Let's face it, Linux has never ever given those 
   guarantees before, and it's not is if anybody has really cared. Even 
   now, the issue seems to be more about paper standards conformance than 
   anything else.

 - I get worried about people playing around with the dirty bit in 
   particular. We have had some really rather nasty bugs here. Most of 
   which are totally impossible to trigger under normal loads (for 
   example the old random-access utorrent writable mmap issue from about 
   a year ago).

So these two issues - the big red danger signs flashing in my brain, 
coupled with the fact that no application has apparently ever really 
noticed in the last 15 years - just makes it a case where I'd like each 
step of the way to be obvious and simple and no larger than really 
absolutely necessary.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/19, Linus Torvalds <[EMAIL PROTECTED]>:
>
>
> On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
> >
> > The page_check_address() function is called from the
> > page_mkclean_one() routine as follows:
>
> .. and the page_mkclean_one() function is totally different.
>
> Lookie here, this is the correct and complex sequence:
>
> > entry = ptep_clear_flush(vma, address, pte);
> > entry = pte_wrprotect(entry);
> > entry = pte_mkclean(entry);
> > set_pte_at(mm, address, pte, entry);
>
> That's a rather expensive sequence, but it's done exactly because it has
> to be done that way. What it does is to
>
>  - *atomically* load the pte entry _and_ clear the old one in memory.
>
>That's the
>
> entry = ptep_clear_flush(vma, address, pte);
>
>thing, and it basically means that it's doing some
>architecture-specific magic to make sure that another CPU that accesses
>the PTE at the same time will never actually modify the pte (because
>it's clear and not valid)
>
>  - it then - while the page table is actually clear and invalid - takes
>the old value and turns it into the new one:
>
> entry = pte_wrprotect(entry);
> entry = pte_mkclean(entry);
>
>  - and finally, it replaces the entry with the new one:
>
> set_pte_at(mm, address, pte, entry);
>
>which takes care to write the new entry in some specific way that is
>atomic wrt other CPU's (ie on 32-bit x86 with a 64-bit page table
>entry it writes the high word first, see the write barriers in
>"native_set_pte()" in include/asm-x86/pgtable-3level.h
>
> Now, compare that subtle and correct thing with what is *not* correct:
>
> if (pte_dirty(*pte) && pte_write(*pte))
> *pte = pte_wrprotect(*pte);
>
> which makes no effort at all to make sure that it's safe in case another
> CPU updates the accessed bit.
>
> Now, arguably it's unlikely to cause horrible problems at least on x86,
> because:
>
>  - we only do this if the pte is already marked dirty, so while we can
>lose the accessed bit, we can *not* lose the dirty bit. And the
>accessed bit isn't such a big deal.
>
>  - it's not doing any of the "be careful about" ordering things, but since
>the really important bits aren't changing, ordering probably won't
>practically matter.
>
> But the problem is that we have something like 24 different architectures,
> it's hard to make sure that none of them have issues.
>
> In other words: it may well work in practice. But when these things go
> subtly wrong, they are *really* nasty to find, and the unsafe sequence is
> really not how it's supposed to be done. For example, you don't even flush
> the TLB, so even if there are no cross-CPU issues, there's probably going
> to be writable entries in the TLB that now don't match the page tables.
>
> Will it matter? Again, probably impossible to see in practice. But ...

Linus, I am very grateful to you for your extremely clear explanation
of the issue I have overlooked!

Back to the msync() issue, I'm going to come back with a new design
for the bug fix.

Thank you once again.

Anton

>
> Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Ingo Oeser
Hi Linus,

On Friday 18 January 2008, Linus Torvalds wrote:
> On Fri, 18 Jan 2008, Miklos Szeredi wrote:
> > 
> > What I'm saying is that the times could be left un-updated for a long
> > time if program doesn't do munmap() or msync(MS_SYNC) for a long time.
> 
> Sure.
> 
> But in those circumstances, the programmer cannot depend on the mtime 
> *anyway* (because there is no synchronization), so what's the downside?

Can we get "if the write to the page hits the disk, the mtime has hit the disk
already no less than SOME_GRANULARITY before"? 

That is very important for computer forensics. Esp. in saving your ass!

Ok, now back again to making that fast :-)


Best Regards

Ingo Oeser
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Linus Torvalds


On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
> 
> The page_check_address() function is called from the
> page_mkclean_one() routine as follows:

.. and the page_mkclean_one() function is totally different.

Lookie here, this is the correct and complex sequence:

> entry = ptep_clear_flush(vma, address, pte);
> entry = pte_wrprotect(entry);
> entry = pte_mkclean(entry);
> set_pte_at(mm, address, pte, entry);

That's a rather expensive sequence, but it's done exactly because it has 
to be done that way. What it does is to

 - *atomically* load the pte entry _and_ clear the old one in memory.

   That's the

entry = ptep_clear_flush(vma, address, pte);

   thing, and it basically means that it's doing some 
   architecture-specific magic to make sure that another CPU that accesses 
   the PTE at the same time will never actually modify the pte (because 
   it's clear and not valid)

 - it then - while the page table is actually clear and invalid - takes 
   the old value and turns it into the new one:

entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);

 - and finally, it replaces the entry with the new one:

set_pte_at(mm, address, pte, entry);

   which takes care to write the new entry in some specific way that is 
   atomic wrt other CPU's (ie on 32-bit x86 with a 64-bit page table 
   entry it writes the high word first, see the write barriers in 
   "native_set_pte()" in include/asm-x86/pgtable-3level.h

Now, compare that subtle and correct thing with what is *not* correct:

if (pte_dirty(*pte) && pte_write(*pte))
*pte = pte_wrprotect(*pte);

which makes no effort at all to make sure that it's safe in case another 
CPU updates the accessed bit.

Now, arguably it's unlikely to cause horrible problems at least on x86, 
because:

 - we only do this if the pte is already marked dirty, so while we can 
   lose the accessed bit, we can *not* lose the dirty bit. And the 
   accessed bit isn't such a big deal.

 - it's not doing any of the "be careful about" ordering things, but since 
   the really important bits aren't changing, ordering probably won't 
   practically matter.

But the problem is that we have something like 24 different architectures, 
it's hard to make sure that none of them have issues. 

In other words: it may well work in practice. But when these things go 
subtly wrong, they are *really* nasty to find, and the unsafe sequence is 
really not how it's supposed to be done. For example, you don't even flush 
the TLB, so even if there are no cross-CPU issues, there's probably going 
to be writable entries in the TLB that now don't match the page tables.

Will it matter? Again, probably impossible to see in practice. But ...

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/19, Linus Torvalds <[EMAIL PROTECTED]>:
>
>
> On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
> >
> > Before using pte_wrprotect() the vma_wrprotect() routine uses the
> > pte_offset_map_lock() macro to get the PTE and to acquire the ptl
> > spinlock. Why did you say that this code was not SMP-safe? It should
> > be atomic, I think.
>
> It's atomic WITH RESPECT TO OTHER PEOPLE WHO GET THE LOCK.
>
> Guess how much another x86 CPU cares when it sets the accessed bit in
> hardware?

Thank you very much for taking part in this discussion. Personally,
it's very important to me.  But I'm not sure that I understand which
bit can be lost.

Please let me explain.

The logic for my vma_wrprotect() routine was taken from the
page_check_address() function in mm/rmap.c. Here is a code snippet of
the latter function:

pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
return NULL;

pud = pud_offset(pgd, address);
if (!pud_present(*pud))
return NULL;

pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
return NULL;

pte = pte_offset_map(pmd, address);
/* Make a quick check before getting the lock */
if (!pte_present(*pte)) {
pte_unmap(pte);
return NULL;
}

ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
*ptlp = ptl;
return pte;
}
pte_unmap_unlock(pte, ptl);

The page_check_address() function is called from the
page_mkclean_one() routine as follows:

pte = page_check_address(page, mm, address, );
if (!pte)
goto out;

if (pte_dirty(*pte) || pte_write(*pte)) {
pte_t entry;

flush_cache_page(vma, address, pte_pfn(*pte));
entry = ptep_clear_flush(vma, address, pte);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(mm, address, pte, entry);
ret = 1;
}

pte_unmap_unlock(pte, ptl);

The write-protection of the PTE is done using the pte_wrprotect()
entity. I intended to do the same during msync() with MS_ASYNC. I
understand that I'm taking a risk of looking a complete idiot now,
however I don't see any difference between the two situations.

I presumed that the code in mm/rmap.c was absolutely correct, that's
why I basically reused the design.

>
> > The POSIX standard requires the ctime and mtime stamps to be updated
> > not later than at the second call to msync() with the MS_ASYNC flag.
>
> .. and that is no excuse for bad code.
>
> Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Linus Torvalds


On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
> 
> Before using pte_wrprotect() the vma_wrprotect() routine uses the
> pte_offset_map_lock() macro to get the PTE and to acquire the ptl
> spinlock. Why did you say that this code was not SMP-safe? It should
> be atomic, I think.

It's atomic WITH RESPECT TO OTHER PEOPLE WHO GET THE LOCK.

Guess how much another x86 CPU cares when it sets the accessed bit in 
hardware?

> The POSIX standard requires the ctime and mtime stamps to be updated
> not later than at the second call to msync() with the MS_ASYNC flag.

.. and that is no excuse for bad code.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Linus Torvalds <[EMAIL PROTECTED]>:
>
>
> On Fri, 18 Jan 2008, Anton Salikhmetov wrote:
> >
> > The current solution doesn't hit the performance at all when compared to
> > the competitor POSIX-compliant systems. It is faster and does even more
> > than the POSIX standard requires.
>
> Your current patches have two problems:
>  - they are simply unnecessarily invasive for a relatively simple issue
>  - all versions I've looked at closer are buggy too
>
> Example:
>
> +   if (pte_dirty(*pte) && pte_write(*pte))
> +   *pte = pte_wrprotect(*pte);
>
> Uhhuh. Looks simple enough. Except it does a non-atomic pte access while
> other CPU's may be accessing it and updating it from their hw page table
> walkers. What will happen? Who knows? I can see lost access bits at a
> minimum.
>
> IOW, this isn't simple code. It's code that it is simple to screw up. In
> this case, you really need to use ptep_set_wrprotect(), for example.

Before using pte_wrprotect() the vma_wrprotect() routine uses the
pte_offset_map_lock() macro to get the PTE and to acquire the ptl
spinlock. Why did you say that this code was not SMP-safe? It should
be atomic, I think.


>
> So why not do it in many fewer lines with that simpler vma->dirty flag?

Neither the dirty flag you suggest, nor the AS_MCTIME flag I've
introduced in my previous solutions solve the following problem:

- mmap()
- a write reference
- msync() with MS_ASYNC
- a write reference
- msync() with MS_ASYNC

The POSIX standard requires the ctime and mtime stamps to be updated
not later than at the second call to msync() with the MS_ASYNC flag.

Some other POSIX-compliant operating system such as HP-UX and FreeBSD
satisfy this POSIX requirement. Linux does not.

>
> Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Linus Torvalds


On Fri, 18 Jan 2008, Anton Salikhmetov wrote:
> 
> The current solution doesn't hit the performance at all when compared to
> the competitor POSIX-compliant systems. It is faster and does even more
> than the POSIX standard requires.

Your current patches have two problems:
 - they are simply unnecessarily invasive for a relatively simple issue
 - all versions I've looked at closer are buggy too

Example:

+   if (pte_dirty(*pte) && pte_write(*pte))
+   *pte = pte_wrprotect(*pte);

Uhhuh. Looks simple enough. Except it does a non-atomic pte access while 
other CPU's may be accessing it and updating it from their hw page table 
walkers. What will happen? Who knows? I can see lost access bits at a 
minimum.

IOW, this isn't simple code. It's code that it is simple to screw up. In 
this case, you really need to use ptep_set_wrprotect(), for example.

So why not do it in many fewer lines with that simpler vma->dirty flag?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Linus Torvalds <[EMAIL PROTECTED]>:
>
>
> On Fri, 18 Jan 2008, Miklos Szeredi wrote:
> >
> > What I'm saying is that the times could be left un-updated for a long
> > time if program doesn't do munmap() or msync(MS_SYNC) for a long time.
>
> Sure.
>
> But in those circumstances, the programmer cannot depend on the mtime
> *anyway* (because there is no synchronization), so what's the downside?
>
> Let's face it, there's exactly three possible solutions:
>
>  - the insane one: trap EVERY SINGLE instruction that does a write to the
>page, and update mtime each and every time.
>
>This one is so obviously STUPID that it's not even worth discussing
>further, except to say that "yes, there is an 'exact' algorithm, but
>no, we are never EVER going to use it".
>
>  - the non-exact solutions that don't give you mtime updates every time
>a write to the page happens, but give *some* guarantees for things that
>will update it.
>
>This is the one I think we can do, and the only things a programmer can
>impact using it is "msync()" and "munmap()", since no other operations
>really have any thing to do with it in a programmer-visible way (ie a
>normal "sync" operation may happen in the background and has no
>progam-relevant timing information)
>
>Other things *may* or may not update mtime (some filesystems - take
>most networked one as an example - will *always* update mtime on the
>server on writeback, so we cannot ever guarantee that nothing but
>msync/munmap does so), but at least we'll have a minimum set of things
>that people can depend on.
>
>  - the "we don't care at all solutions".
>
>mmap(MAP_WRITE) doesn't really update times reliably after the write
>has happened (but might do it *before* - maybe the mmap() itself does).
>
> Those are the three choices, I think. We currently approximate #3. We
> *can* do #2 (and there are various flavors of it). And even *aiming* for
> #1 is totally insane and stupid.

The current solution doesn't hit the performance at all when compared to
the competitor POSIX-compliant systems. It is faster and does even more
than the POSIX standard requires.

Please see the test results I've sent into the thread "-v6 0/2":

http://lkml.org/lkml/2008/1/18/447

I guess, the current solution is ready to use.

>
> Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Linus Torvalds


On Fri, 18 Jan 2008, Miklos Szeredi wrote:
> 
> What I'm saying is that the times could be left un-updated for a long
> time if program doesn't do munmap() or msync(MS_SYNC) for a long time.

Sure.

But in those circumstances, the programmer cannot depend on the mtime 
*anyway* (because there is no synchronization), so what's the downside?

Let's face it, there's exactly three possible solutions:

 - the insane one: trap EVERY SINGLE instruction that does a write to the 
   page, and update mtime each and every time.

   This one is so obviously STUPID that it's not even worth discussing 
   further, except to say that "yes, there is an 'exact' algorithm, but 
   no, we are never EVER going to use it".

 - the non-exact solutions that don't give you mtime updates every time 
   a write to the page happens, but give *some* guarantees for things that 
   will update it.

   This is the one I think we can do, and the only things a programmer can 
   impact using it is "msync()" and "munmap()", since no other operations 
   really have any thing to do with it in a programmer-visible way (ie a 
   normal "sync" operation may happen in the background and has no 
   progam-relevant timing information)

   Other things *may* or may not update mtime (some filesystems - take
   most networked one as an example - will *always* update mtime on the 
   server on writeback, so we cannot ever guarantee that nothing but 
   msync/munmap does so), but at least we'll have a minimum set of things 
   that people can depend on.

 - the "we don't care at all solutions".

   mmap(MAP_WRITE) doesn't really update times reliably after the write 
   has happened (but might do it *before* - maybe the mmap() itself does).

Those are the three choices, I think. We currently approximate #3. We 
*can* do #2 (and there are various flavors of it). And even *aiming* for 
#1 is totally insane and stupid.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Miklos Szeredi
> > 
> > But then background writeout, sync(2), etc, wouldn't update the times.
> 
> Sure it would, but only when doing the final unmap.
> 
> Did you miss the "on unmap and msync" part?

No :)

What I'm saying is that the times could be left un-updated for a long
time if program doesn't do munmap() or msync(MS_SYNC) for a long time.

If program has this pattern:

mmap()
write to map
msync(MS_ASYNC)
sleep(long)
write to map
msync(MS_ASYNC)
sleep(long)
...

Then we'd never see time updates (until the program exits, but that
could be years).

Maybe this doesn't matter, I'm just saying this is a disadvantage
compared to the "update on first dirtying" approach, which would
ensure, that times are updated at least once per 30s.

Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Linus Torvalds


On Fri, 18 Jan 2008, Miklos Szeredi wrote:
> 
> But then background writeout, sync(2), etc, wouldn't update the times.

Sure it would, but only when doing the final unmap.

Did you miss the "on unmap and msync" part?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Miklos Szeredi
> > That would need a new page flag (PG_mmap_dirty?).  Do we have one
> > available?
> 
> Yeah, that would be bad. We probably have flags free, but those page flags 
> are always a pain. Scratch that.
> 
> How about just setting a per-vma dirty flag, and then instead of updating 
> the mtime when taking the dirty-page fault, we just set that flag?
> 
> Then, on unmap and msync, we just do
> 
>   if (vma->dirty-flag) {
>   vma->dirty_flag = 0;
>   update_file_times(vma->vm_file);
>   }
> 
> and be done with it? 

But then background writeout, sync(2), etc, wouldn't update the times.
Dunno.  I don't think actual _physical_ writeout matters much, so it's
not worse to be 30s early with the timestamp, than to be 30s or more
late.

Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Miklos Szeredi
> > > And even in that four-liner, I suspect that the *last* two lines are 
> > > actually incorrect: there's no point in updating the file time when the 
> > > page *becomes* dirty,
> > 
> > Actually all four lines do that.  The first two for a write access on
> > a present, read-only pte, the other two for a write on a non-present
> > pte.
> > 
> > > we should update the file time when it is marked 
> > > clean, and "msync(MS_SYNC)" should update it as part of *that*.
> > 
> > That would need a new page flag (PG_mmap_dirty?).  Do we have one
> > available?
> 
> I thought the page writing stuff looked at (and cleared) the pte
> dirty bit, too?

Yeah, it does.  Hmm...

What happens on munmap?  The times _could_ get updated from there as
well, but it's getting complicated.

Miklos

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Linus Torvalds


On Fri, 18 Jan 2008, Miklos Szeredi wrote:
> 
> That would need a new page flag (PG_mmap_dirty?).  Do we have one
> available?

Yeah, that would be bad. We probably have flags free, but those page flags 
are always a pain. Scratch that.

How about just setting a per-vma dirty flag, and then instead of updating 
the mtime when taking the dirty-page fault, we just set that flag?

Then, on unmap and msync, we just do

if (vma->dirty-flag) {
vma->dirty_flag = 0;
update_file_times(vma->vm_file);
}

and be done with it? 


Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Rik van Riel
On Fri, 18 Jan 2008 19:11:47 +0100
Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> > And even in that four-liner, I suspect that the *last* two lines are 
> > actually incorrect: there's no point in updating the file time when the 
> > page *becomes* dirty,
> 
> Actually all four lines do that.  The first two for a write access on
> a present, read-only pte, the other two for a write on a non-present
> pte.
> 
> > we should update the file time when it is marked 
> > clean, and "msync(MS_SYNC)" should update it as part of *that*.
> 
> That would need a new page flag (PG_mmap_dirty?).  Do we have one
> available?

I thought the page writing stuff looked at (and cleared) the pte
dirty bit, too?

-- 
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Miklos Szeredi
> And even in that four-liner, I suspect that the *last* two lines are 
> actually incorrect: there's no point in updating the file time when the 
> page *becomes* dirty,

Actually all four lines do that.  The first two for a write access on
a present, read-only pte, the other two for a write on a non-present
pte.

> we should update the file time when it is marked 
> clean, and "msync(MS_SYNC)" should update it as part of *that*.

That would need a new page flag (PG_mmap_dirty?).  Do we have one
available?

Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Linus Torvalds


On Fri, 18 Jan 2008, Peter Zijlstra wrote:
> 
> Bah, and will break on s390... so we'd need a page_mkclean() variant
> that doesn't actually clear dirty.

No, we simply want to not play all these very expensive games with dirty 
in the first place.

Guys, mmap access times aren't important enough for this. It's not 
specified closely enough, and people don't care enough.

Of the patches around so far, the best one by far seems to be the simple 
four-liner from Miklos.

And even in that four-liner, I suspect that the *last* two lines are 
actually incorrect: there's no point in updating the file time when the 
page *becomes* dirty, we should update the file time when it is marked 
clean, and "msync(MS_SYNC)" should update it as part of *that*.

So I think the file time update should be part of just the page writeout 
logic, not by msync() or page faulting itself or anything like that.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Miklos Szeredi
> Possibly, I didn't see a quick way to break that iteration.
> >From a quick glance at prio_tree.c the iterator isn't valid anymore
> after releasing i_mmap_lock. Fixing that would be,.. 'fun'.

Maybe i_mmap_lock isn't needed at all, since msync holds mmap_sem,
which protects the prio tree as well, no?

> I also realized I forgot to copy/paste the prio_tree_iter declaration
> and ought to make all these functions static.
> 
> But for a quick draft it conveys the idea pretty well, I guess :-)

Yes :)

There could also be nasty performance corner cases, like having a huge
file mapped thousands of times, and having only a couple of pages
dirtied between MS_ASYNC invocations.  Then most of that page table
walking would be just unnecessary overhead.

There's something to be said for walking only the dirty pages, and
doing page_mkclean on them, even if in some cases that would be
slower.

But I have a strong feeling of deja vu, and last time it ended with
Andrew not liking the whole thing...

Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Peter Zijlstra

On Fri, 2008-01-18 at 12:17 +0100, Miklos Szeredi wrote:
> > diff --git a/mm/msync.c b/mm/msync.c
> > index 144a757..a1b3fc6 100644
> > --- a/mm/msync.c
> > +++ b/mm/msync.c
> > @@ -14,6 +14,122 @@
> >  #include 
> >  #include 
> >  
> > +unsigned long masync_pte_range(struct vm_area_struct *vma, pmd_t *pdm,
> > +   unsigned long addr, unsigned long end)
> > +{
> > +   pte_t *pte;
> > +   spinlock_t *ptl;
> > +
> > +   pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
> > +   arch_enter_lazy_mmu_mode();
> > +   do {
> > +   pte_t ptent = *pte;
> > +
> > +   if (pte_none(ptent))
> > +   continue;
> > +
> > +   if (!pte_present(ptent))
> > +   continue;
> > +
> > +   if (pte_dirty(ptent) && pte_write(ptent)) {
> > +   flush_cache_page(vma, addr, pte_pfn(ptent));
> 
> Hmm, I'm not sure flush_cache_page() is needed.  Or does does dirty
> data in the cache somehow interfere with the page protection?

No, just being paranoid..

> > +   ptent = ptep_clear_flush(vma, addr, pte);
> > +   ptent = pte_wrprotect(ptent);
> > +   set_pte_at(vma->vm_mnm, addr, pte, ptent);
> > +   }
> > +   } while (pte++, addr += PAGE_SIZE, addr != end);
> > +   arch_leave_lazy_mmu_mode();
> > +   pte_unmap_unlock(pte - 1, ptl);
> > +
> > +   return addr;
> > +}
> > +
> > +unsigned long masync_pmd_range(struct vm_area_struct *vma, pud_t *pud,
> > +   unsigned long addr, unsigned long end)
> > +{
> > +   pmd_t *pmd;
> > +   unsigned long next;
> > +
> > +   pmd = pmd_offset(pud, addr);
> > +   do {
> > +   next = pmd_addr_end(addr, end);
> > +   if (pmd_none_or_clear_bad(pmd))
> > +   continue;
> > +   next = masync_pte_range(vma, pmd, addr, next);
> > +   } while (pmd++, addr = next, addr != end);
> > +
> > +   return addr;
> > +}
> > +
> > +unsigned long masync_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
> > +   unsigned long addr, unsigned long end)
> > +{
> > +   pud_t *pud;
> > +   unsigned long next;
> > +
> > +   pud = pud_offset(pgd, addr);
> > +   do {
> > +   next = pud_addr_end(addr, end);
> > +   if (pud_none_or_clear_bad(pud))
> > +   continue;
> > +   next = masync_pmd_range(vma, pud, addr, next);
> > +   } while (pud++, addr = next, addr != end);
> > +
> > +   return addr;
> > +}
> > +
> > +unsigned long masync_pgd_range()
> > +{
> > +   pgd_t *pgd;
> > +   unsigned long next;
> > +
> > +   pgd = pgd_offset(vma->vm_mm, addr);
> > +   do {
> > +   next = pgd_addr_end(addr, end);
> > +   if (pgd_none_of_clear_bad(pgd))
> > +   continue;
> > +   next = masync_pud_range(vma, pgd, addr, next);
> > +   } while (pgd++, addr = next, addr != end);
> > +
> > +   return addr;
> > +}
> > +
> > +int masync_vma_one(struct vm_area_struct *vma,
> > +   unsigned long start, unsigned long end)
> > +{
> > +   if (start < vma->vm_start)
> > +   start = vma->vm_start;
> > +
> > +   if (end > vma->vm_end)
> > +   end = vma->vm_end;
> > +
> > +   masync_pgd_range(vma, start, end);
> > +
> > +   return 0;
> > +}
> > +
> > +int masync_vma(struct vm_area_struct *vma, 
> > +   unsigned long start, unsigned long end)
> > +{
> > +   struct address_space *mapping;
> > +   struct vm_area_struct *vma_iter;
> > +
> > +   if (!(vma->vm_flags & VM_SHARED))
> > +   return 0;
> > +
> > +   mapping = vma->vm_file->f_mapping;
> > +
> > +   if (!mapping_cap_account_dirty(mapping))
> > +   return 0;
> > +
> > +   spin_lock(>i_mmap_lock);
> > +   vma_prio_tree_foreach(vma_iter, , >i_mmap, start, end)
> > +   masync_vma_one(vma_iter, start, end);
> > +   spin_unlock(>i_mmap_lock);
> 
> This is hoding i_mmap_lock for possibly quite long.  Isn't that going
> to cause problems?

Possibly, I didn't see a quick way to break that iteration.
>From a quick glance at prio_tree.c the iterator isn't valid anymore
after releasing i_mmap_lock. Fixing that would be,.. 'fun'.

I also realized I forgot to copy/paste the prio_tree_iter declaration
and ought to make all these functions static.

But for a quick draft it conveys the idea pretty well, I guess :-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Miklos Szeredi
> diff --git a/mm/msync.c b/mm/msync.c
> index 144a757..a1b3fc6 100644
> --- a/mm/msync.c
> +++ b/mm/msync.c
> @@ -14,6 +14,122 @@
>  #include 
>  #include 
>  
> +unsigned long masync_pte_range(struct vm_area_struct *vma, pmd_t *pdm,
> + unsigned long addr, unsigned long end)
> +{
> + pte_t *pte;
> + spinlock_t *ptl;
> +
> + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
> + arch_enter_lazy_mmu_mode();
> + do {
> + pte_t ptent = *pte;
> +
> + if (pte_none(ptent))
> + continue;
> +
> + if (!pte_present(ptent))
> + continue;
> +
> + if (pte_dirty(ptent) && pte_write(ptent)) {
> + flush_cache_page(vma, addr, pte_pfn(ptent));

Hmm, I'm not sure flush_cache_page() is needed.  Or does does dirty
data in the cache somehow interfere with the page protection?

> + ptent = ptep_clear_flush(vma, addr, pte);
> + ptent = pte_wrprotect(ptent);
> + set_pte_at(vma->vm_mnm, addr, pte, ptent);
> + }
> + } while (pte++, addr += PAGE_SIZE, addr != end);
> + arch_leave_lazy_mmu_mode();
> + pte_unmap_unlock(pte - 1, ptl);
> +
> + return addr;
> +}
> +
> +unsigned long masync_pmd_range(struct vm_area_struct *vma, pud_t *pud,
> + unsigned long addr, unsigned long end)
> +{
> + pmd_t *pmd;
> + unsigned long next;
> +
> + pmd = pmd_offset(pud, addr);
> + do {
> + next = pmd_addr_end(addr, end);
> + if (pmd_none_or_clear_bad(pmd))
> + continue;
> + next = masync_pte_range(vma, pmd, addr, next);
> + } while (pmd++, addr = next, addr != end);
> +
> + return addr;
> +}
> +
> +unsigned long masync_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
> + unsigned long addr, unsigned long end)
> +{
> + pud_t *pud;
> + unsigned long next;
> +
> + pud = pud_offset(pgd, addr);
> + do {
> + next = pud_addr_end(addr, end);
> + if (pud_none_or_clear_bad(pud))
> + continue;
> + next = masync_pmd_range(vma, pud, addr, next);
> + } while (pud++, addr = next, addr != end);
> +
> + return addr;
> +}
> +
> +unsigned long masync_pgd_range()
> +{
> + pgd_t *pgd;
> + unsigned long next;
> +
> + pgd = pgd_offset(vma->vm_mm, addr);
> + do {
> + next = pgd_addr_end(addr, end);
> + if (pgd_none_of_clear_bad(pgd))
> + continue;
> + next = masync_pud_range(vma, pgd, addr, next);
> + } while (pgd++, addr = next, addr != end);
> +
> + return addr;
> +}
> +
> +int masync_vma_one(struct vm_area_struct *vma,
> + unsigned long start, unsigned long end)
> +{
> + if (start < vma->vm_start)
> + start = vma->vm_start;
> +
> + if (end > vma->vm_end)
> + end = vma->vm_end;
> +
> + masync_pgd_range(vma, start, end);
> +
> + return 0;
> +}
> +
> +int masync_vma(struct vm_area_struct *vma, 
> + unsigned long start, unsigned long end)
> +{
> + struct address_space *mapping;
> + struct vm_area_struct *vma_iter;
> +
> + if (!(vma->vm_flags & VM_SHARED))
> + return 0;
> +
> + mapping = vma->vm_file->f_mapping;
> +
> + if (!mapping_cap_account_dirty(mapping))
> + return 0;
> +
> + spin_lock(>i_mmap_lock);
> + vma_prio_tree_foreach(vma_iter, , >i_mmap, start, end)
> + masync_vma_one(vma_iter, start, end);
> + spin_unlock(>i_mmap_lock);

This is hoding i_mmap_lock for possibly quite long.  Isn't that going
to cause problems?

Miklos

> +
> + return 0;
> +}
> +
>  /*
>   * MS_SYNC syncs the entire file - including mappings.
>   *
> 
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Peter Zijlstra

On Fri, 2008-01-18 at 11:38 +0100, Miklos Szeredi wrote:
> > On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:
> > 
> > > > diff --git a/mm/msync.c b/mm/msync.c
> > > > index a4de868..a49af28 100644
> > > > --- a/mm/msync.c
> > > > +++ b/mm/msync.c
> > > > @@ -13,11 +13,33 @@
> > > >  #include 
> > > >  
> > > >  /*
> > > > + * Scan the PTEs for pages belonging to the VMA and mark them 
> > > > read-only.
> > > > + * It will force a pagefault on the next write access.
> > > > + */
> > > > +static void vma_wrprotect(struct vm_area_struct *vma)
> > > > +{
> > > > +   unsigned long addr;
> > > > +
> > > > +   for (addr = vma->vm_start; addr < vma->vm_end; addr += 
> > > > PAGE_SIZE) {
> > > > +   spinlock_t *ptl;
> > > > +   pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > > +   pud_t *pud = pud_offset(pgd, addr);
> > > > +   pmd_t *pmd = pmd_offset(pud, addr);
> > > > +   pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, 
> > > > );
> > > > +
> > > > +   if (pte_dirty(*pte) && pte_write(*pte))
> > > > +   *pte = pte_wrprotect(*pte);
> > > > +   pte_unmap_unlock(pte, ptl);
> > > > +   }
> > > > +}
> > > 
> > > What about ram based filesystems?  They don't start out with read-only
> > > pte's, so I think they don't want them read-protected now either.
> > > Unless this is essential for correct mtime/ctime accounting on these
> > > filesystems (I don't think it really is).  But then the mapping should
> > > start out read-only as well, otherwise the time update will only work
> > > after an msync(MS_ASYNC).
> > 
> > page_mkclean() has all the needed logic for this, it also walks the rmap
> > and cleans out all other users, which I think is needed too for
> > consistencies sake:
> > 
> > Process A   Process B
> > 
> > mmap(foo.txt)   mmap(foo.txt)
> > 
> > dirty page
> > dirty page
> > 
> > msync(MS_ASYNC)
> > 
> > dirty page
> > 
> > msync(MS_ASYNC) <--- now what?!

how about:

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..a1b3fc6 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -14,6 +14,122 @@
 #include 
 #include 
 
+unsigned long masync_pte_range(struct vm_area_struct *vma, pmd_t *pdm,
+   unsigned long addr, unsigned long end)
+{
+   pte_t *pte;
+   spinlock_t *ptl;
+
+   pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
+   arch_enter_lazy_mmu_mode();
+   do {
+   pte_t ptent = *pte;
+
+   if (pte_none(ptent))
+   continue;
+
+   if (!pte_present(ptent))
+   continue;
+
+   if (pte_dirty(ptent) && pte_write(ptent)) {
+   flush_cache_page(vma, addr, pte_pfn(ptent));
+   ptent = ptep_clear_flush(vma, addr, pte);
+   ptent = pte_wrprotect(ptent);
+   set_pte_at(vma->vm_mnm, addr, pte, ptent);
+   }
+   } while (pte++, addr += PAGE_SIZE, addr != end);
+   arch_leave_lazy_mmu_mode();
+   pte_unmap_unlock(pte - 1, ptl);
+
+   return addr;
+}
+
+unsigned long masync_pmd_range(struct vm_area_struct *vma, pud_t *pud,
+   unsigned long addr, unsigned long end)
+{
+   pmd_t *pmd;
+   unsigned long next;
+
+   pmd = pmd_offset(pud, addr);
+   do {
+   next = pmd_addr_end(addr, end);
+   if (pmd_none_or_clear_bad(pmd))
+   continue;
+   next = masync_pte_range(vma, pmd, addr, next);
+   } while (pmd++, addr = next, addr != end);
+
+   return addr;
+}
+
+unsigned long masync_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
+   unsigned long addr, unsigned long end)
+{
+   pud_t *pud;
+   unsigned long next;
+
+   pud = pud_offset(pgd, addr);
+   do {
+   next = pud_addr_end(addr, end);
+   if (pud_none_or_clear_bad(pud))
+   continue;
+   next = masync_pmd_range(vma, pud, addr, next);
+   } while (pud++, addr = next, addr != end);
+
+   return addr;
+}
+
+unsigned long masync_pgd_range()
+{
+   pgd_t *pgd;
+   unsigned long next;
+
+   pgd = pgd_offset(vma->vm_mm, addr);
+   do {
+   next = pgd_addr_end(addr, end);
+   if (pgd_none_of_clear_bad(pgd))
+   continue;
+   next = masync_pud_range(vma, pgd, addr, next);
+   } while (pgd++, addr = next, addr != end);
+
+   return addr;
+}
+
+int masync_vma_one(struct vm_area_struct *vma,
+   unsigned long start, unsigned long end)
+{
+   if (start < vma->vm_start)
+   start = vma->vm_start;
+
+   if (end > vma->vm_end)
+   end = vma->vm_end;
+
+   masync_pgd_range(vma, start, end);
+
+   return 0;
+}
+
+int 

Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Peter Zijlstra <[EMAIL PROTECTED]>:
>
> On Fri, 2008-01-18 at 11:15 +0100, Peter Zijlstra wrote:
> > On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:
> >
> > > > diff --git a/mm/msync.c b/mm/msync.c
> > > > index a4de868..a49af28 100644
> > > > --- a/mm/msync.c
> > > > +++ b/mm/msync.c
> > > > @@ -13,11 +13,33 @@
> > > >  #include 
> > > >
> > > >  /*
> > > > + * Scan the PTEs for pages belonging to the VMA and mark them 
> > > > read-only.
> > > > + * It will force a pagefault on the next write access.
> > > > + */
> > > > +static void vma_wrprotect(struct vm_area_struct *vma)
> > > > +{
> > > > + unsigned long addr;
> > > > +
> > > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > > > + spinlock_t *ptl;
> > > > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > > + pud_t *pud = pud_offset(pgd, addr);
> > > > + pmd_t *pmd = pmd_offset(pud, addr);
> > > > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
> > > > +
> > > > + if (pte_dirty(*pte) && pte_write(*pte))
> > > > + *pte = pte_wrprotect(*pte);
> > > > + pte_unmap_unlock(pte, ptl);
> > > > + }
> > > > +}
> > >
> > > What about ram based filesystems?  They don't start out with read-only
> > > pte's, so I think they don't want them read-protected now either.
> > > Unless this is essential for correct mtime/ctime accounting on these
> > > filesystems (I don't think it really is).  But then the mapping should
> > > start out read-only as well, otherwise the time update will only work
> > > after an msync(MS_ASYNC).
> >
> > page_mkclean() has all the needed logic for this, it also walks the rmap
> > and cleans out all other users, which I think is needed too for
> > consistencies sake:
> >
> > Process A Process B
> >
> > mmap(foo.txt) mmap(foo.txt)
> >
> > dirty page
> >   dirty page
> >
> > msync(MS_ASYNC)
> >
> >   dirty page
> >
> > msync(MS_ASYNC) <--- now what?!
> >
> >
> > So what I would suggest is using the page table walkers from mm, and
> > walks the page range, obtain the page using vm_normal_page() and call
> > page_mkclean(). (Oh, and ensure you don't nest the pte lock :-)
> >
> > All in all, that sounds rather expensive..
>
> Bah, and will break on s390... so we'd need a page_mkclean() variant
> that doesn't actually clear dirty.

So the current version of the functional changes patch takes this into account.

>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Miklos Szeredi
> On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:
> 
> > > diff --git a/mm/msync.c b/mm/msync.c
> > > index a4de868..a49af28 100644
> > > --- a/mm/msync.c
> > > +++ b/mm/msync.c
> > > @@ -13,11 +13,33 @@
> > >  #include 
> > >  
> > >  /*
> > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > > + * It will force a pagefault on the next write access.
> > > + */
> > > +static void vma_wrprotect(struct vm_area_struct *vma)
> > > +{
> > > + unsigned long addr;
> > > +
> > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > > + spinlock_t *ptl;
> > > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > + pud_t *pud = pud_offset(pgd, addr);
> > > + pmd_t *pmd = pmd_offset(pud, addr);
> > > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
> > > +
> > > + if (pte_dirty(*pte) && pte_write(*pte))
> > > + *pte = pte_wrprotect(*pte);
> > > + pte_unmap_unlock(pte, ptl);
> > > + }
> > > +}
> > 
> > What about ram based filesystems?  They don't start out with read-only
> > pte's, so I think they don't want them read-protected now either.
> > Unless this is essential for correct mtime/ctime accounting on these
> > filesystems (I don't think it really is).  But then the mapping should
> > start out read-only as well, otherwise the time update will only work
> > after an msync(MS_ASYNC).
> 
> page_mkclean() has all the needed logic for this, it also walks the rmap
> and cleans out all other users, which I think is needed too for
> consistencies sake:
> 
> Process A Process B
> 
> mmap(foo.txt) mmap(foo.txt)
> 
> dirty page
>   dirty page
> 
> msync(MS_ASYNC)
> 
>   dirty page
> 
> msync(MS_ASYNC) <--- now what?!

Nothing.  I think it's perfectly acceptable behavior, if msync in
process A doesn't care about any dirtying in process B.

> All in all, that sounds rather expensive..

Right.  The advantage of Anton's current approach, is that it's at
least simple, and possibly not so expensive, while providing same
quite sane semantics for MS_ASYNC vs. mtime updates.

Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Peter Zijlstra

On Fri, 2008-01-18 at 11:15 +0100, Peter Zijlstra wrote:
> On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:
> 
> > > diff --git a/mm/msync.c b/mm/msync.c
> > > index a4de868..a49af28 100644
> > > --- a/mm/msync.c
> > > +++ b/mm/msync.c
> > > @@ -13,11 +13,33 @@
> > >  #include 
> > >  
> > >  /*
> > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > > + * It will force a pagefault on the next write access.
> > > + */
> > > +static void vma_wrprotect(struct vm_area_struct *vma)
> > > +{
> > > + unsigned long addr;
> > > +
> > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > > + spinlock_t *ptl;
> > > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > + pud_t *pud = pud_offset(pgd, addr);
> > > + pmd_t *pmd = pmd_offset(pud, addr);
> > > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
> > > +
> > > + if (pte_dirty(*pte) && pte_write(*pte))
> > > + *pte = pte_wrprotect(*pte);
> > > + pte_unmap_unlock(pte, ptl);
> > > + }
> > > +}
> > 
> > What about ram based filesystems?  They don't start out with read-only
> > pte's, so I think they don't want them read-protected now either.
> > Unless this is essential for correct mtime/ctime accounting on these
> > filesystems (I don't think it really is).  But then the mapping should
> > start out read-only as well, otherwise the time update will only work
> > after an msync(MS_ASYNC).
> 
> page_mkclean() has all the needed logic for this, it also walks the rmap
> and cleans out all other users, which I think is needed too for
> consistencies sake:
> 
> Process A Process B
> 
> mmap(foo.txt) mmap(foo.txt)
> 
> dirty page
>   dirty page
> 
> msync(MS_ASYNC)
> 
>   dirty page
> 
> msync(MS_ASYNC) <--- now what?!
> 
> 
> So what I would suggest is using the page table walkers from mm, and
> walks the page range, obtain the page using vm_normal_page() and call
> page_mkclean(). (Oh, and ensure you don't nest the pte lock :-)
> 
> All in all, that sounds rather expensive..

Bah, and will break on s390... so we'd need a page_mkclean() variant
that doesn't actually clear dirty.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Peter Zijlstra

On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:

> > diff --git a/mm/msync.c b/mm/msync.c
> > index a4de868..a49af28 100644
> > --- a/mm/msync.c
> > +++ b/mm/msync.c
> > @@ -13,11 +13,33 @@
> >  #include 
> >  
> >  /*
> > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > + * It will force a pagefault on the next write access.
> > + */
> > +static void vma_wrprotect(struct vm_area_struct *vma)
> > +{
> > +   unsigned long addr;
> > +
> > +   for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > +   spinlock_t *ptl;
> > +   pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > +   pud_t *pud = pud_offset(pgd, addr);
> > +   pmd_t *pmd = pmd_offset(pud, addr);
> > +   pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
> > +
> > +   if (pte_dirty(*pte) && pte_write(*pte))
> > +   *pte = pte_wrprotect(*pte);
> > +   pte_unmap_unlock(pte, ptl);
> > +   }
> > +}
> 
> What about ram based filesystems?  They don't start out with read-only
> pte's, so I think they don't want them read-protected now either.
> Unless this is essential for correct mtime/ctime accounting on these
> filesystems (I don't think it really is).  But then the mapping should
> start out read-only as well, otherwise the time update will only work
> after an msync(MS_ASYNC).

page_mkclean() has all the needed logic for this, it also walks the rmap
and cleans out all other users, which I think is needed too for
consistencies sake:

Process A   Process B

mmap(foo.txt)   mmap(foo.txt)

dirty page
dirty page

msync(MS_ASYNC)

dirty page

msync(MS_ASYNC) <--- now what?!


So what I would suggest is using the page table walkers from mm, and
walks the page range, obtain the page using vm_normal_page() and call
page_mkclean(). (Oh, and ensure you don't nest the pte lock :-)

All in all, that sounds rather expensive..



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Miklos Szeredi
> Updating file times at write references to memory-mapped files and
> forcing file times update at the next write reference after
> calling the msync() system call with the MS_ASYNC flag.
> 
> Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
> ---
>  mm/memory.c |6 ++
>  mm/msync.c  |   52 +++-
>  2 files changed, 45 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 4bf0b6d..13d5bbf 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1668,6 +1668,9 @@ gotten:
>  unlock:
>   pte_unmap_unlock(page_table, ptl);
>   if (dirty_page) {
> + if (vma->vm_file)
> + file_update_time(vma->vm_file);
> +
>   /*
>* Yes, Virginia, this is actually required to prevent a race
>* with clear_page_dirty_for_io() from clearing the page dirty
> @@ -2341,6 +2344,9 @@ out_unlocked:
>   if (anon)
>   page_cache_release(vmf.page);
>   else if (dirty_page) {
> + if (vma->vm_file)
> + file_update_time(vma->vm_file);
> +
>   set_page_dirty_balance(dirty_page, page_mkwrite);
>   put_page(dirty_page);
>   }
> diff --git a/mm/msync.c b/mm/msync.c
> index a4de868..a49af28 100644
> --- a/mm/msync.c
> +++ b/mm/msync.c
> @@ -13,11 +13,33 @@
>  #include 
>  
>  /*
> + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> + * It will force a pagefault on the next write access.
> + */
> +static void vma_wrprotect(struct vm_area_struct *vma)
> +{
> + unsigned long addr;
> +
> + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> + spinlock_t *ptl;
> + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> + pud_t *pud = pud_offset(pgd, addr);
> + pmd_t *pmd = pmd_offset(pud, addr);
> + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
> +
> + if (pte_dirty(*pte) && pte_write(*pte))
> + *pte = pte_wrprotect(*pte);
> + pte_unmap_unlock(pte, ptl);
> + }
> +}

What about ram based filesystems?  They don't start out with read-only
pte's, so I think they don't want them read-protected now either.
Unless this is essential for correct mtime/ctime accounting on these
filesystems (I don't think it really is).  But then the mapping should
start out read-only as well, otherwise the time update will only work
after an msync(MS_ASYNC).

> +
> +/*
>   * MS_SYNC syncs the entire file - including mappings.
>   *
> - * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> - * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
> - * Now it doesn't do anything, since dirty pages are properly tracked.
> + * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
> + * read-only by calling vma_wrprotect(). This is needed to catch the next
> + * write reference to the mapped region and update the file times
> + * accordingly.
>   *
>   * The application may now run fsync() to write out the dirty pages and
>   * wait on the writeout and check the result. Or the application may run
> @@ -77,16 +99,20 @@ asmlinkage long sys_msync(unsigned long start, size_t 
> len, int flags)
>   error = 0;
>   start = vma->vm_end;
>   file = vma->vm_file;
> - if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
> - get_file(file);
> - up_read(>mmap_sem);
> - error = do_fsync(file, 0);
> - fput(file);
> - if (error || start >= end)
> - goto out;
> - down_read(>mmap_sem);
> - vma = find_vma(mm, start);
> - continue;
> + if (file && (vma->vm_flags & VM_SHARED)) {
> + if (flags & MS_ASYNC)
> + vma_wrprotect(vma);
> + if (flags & MS_SYNC) {
> + get_file(file);
> + up_read(>mmap_sem);
> + error = do_fsync(file, 0);
> + fput(file);
> + if (error || start >= end)
> + goto out;
> + down_read(>mmap_sem);
> + vma = find_vma(mm, start);
> + continue;
> + }
>   }
>  
>   vma = vma->vm_next;
> -- 
> 1.4.4.4
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Peter Zijlstra

On Fri, 2008-01-18 at 11:15 +0100, Peter Zijlstra wrote:
 On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:
 
   diff --git a/mm/msync.c b/mm/msync.c
   index a4de868..a49af28 100644
   --- a/mm/msync.c
   +++ b/mm/msync.c
   @@ -13,11 +13,33 @@
#include linux/syscalls.h

/*
   + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
   + * It will force a pagefault on the next write access.
   + */
   +static void vma_wrprotect(struct vm_area_struct *vma)
   +{
   + unsigned long addr;
   +
   + for (addr = vma-vm_start; addr  vma-vm_end; addr += PAGE_SIZE) {
   + spinlock_t *ptl;
   + pgd_t *pgd = pgd_offset(vma-vm_mm, addr);
   + pud_t *pud = pud_offset(pgd, addr);
   + pmd_t *pmd = pmd_offset(pud, addr);
   + pte_t *pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
   +
   + if (pte_dirty(*pte)  pte_write(*pte))
   + *pte = pte_wrprotect(*pte);
   + pte_unmap_unlock(pte, ptl);
   + }
   +}
  
  What about ram based filesystems?  They don't start out with read-only
  pte's, so I think they don't want them read-protected now either.
  Unless this is essential for correct mtime/ctime accounting on these
  filesystems (I don't think it really is).  But then the mapping should
  start out read-only as well, otherwise the time update will only work
  after an msync(MS_ASYNC).
 
 page_mkclean() has all the needed logic for this, it also walks the rmap
 and cleans out all other users, which I think is needed too for
 consistencies sake:
 
 Process A Process B
 
 mmap(foo.txt) mmap(foo.txt)
 
 dirty page
   dirty page
 
 msync(MS_ASYNC)
 
   dirty page
 
 msync(MS_ASYNC) --- now what?!
 
 
 So what I would suggest is using the page table walkers from mm, and
 walks the page range, obtain the page using vm_normal_page() and call
 page_mkclean(). (Oh, and ensure you don't nest the pte lock :-)
 
 All in all, that sounds rather expensive..

Bah, and will break on s390... so we'd need a page_mkclean() variant
that doesn't actually clear dirty.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Peter Zijlstra

On Fri, 2008-01-18 at 12:17 +0100, Miklos Szeredi wrote:
  diff --git a/mm/msync.c b/mm/msync.c
  index 144a757..a1b3fc6 100644
  --- a/mm/msync.c
  +++ b/mm/msync.c
  @@ -14,6 +14,122 @@
   #include linux/syscalls.h
   #include linux/sched.h
   
  +unsigned long masync_pte_range(struct vm_area_struct *vma, pmd_t *pdm,
  +   unsigned long addr, unsigned long end)
  +{
  +   pte_t *pte;
  +   spinlock_t *ptl;
  +
  +   pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
  +   arch_enter_lazy_mmu_mode();
  +   do {
  +   pte_t ptent = *pte;
  +
  +   if (pte_none(ptent))
  +   continue;
  +
  +   if (!pte_present(ptent))
  +   continue;
  +
  +   if (pte_dirty(ptent)  pte_write(ptent)) {
  +   flush_cache_page(vma, addr, pte_pfn(ptent));
 
 Hmm, I'm not sure flush_cache_page() is needed.  Or does does dirty
 data in the cache somehow interfere with the page protection?

No, just being paranoid..

  +   ptent = ptep_clear_flush(vma, addr, pte);
  +   ptent = pte_wrprotect(ptent);
  +   set_pte_at(vma-vm_mnm, addr, pte, ptent);
  +   }
  +   } while (pte++, addr += PAGE_SIZE, addr != end);
  +   arch_leave_lazy_mmu_mode();
  +   pte_unmap_unlock(pte - 1, ptl);
  +
  +   return addr;
  +}
  +
  +unsigned long masync_pmd_range(struct vm_area_struct *vma, pud_t *pud,
  +   unsigned long addr, unsigned long end)
  +{
  +   pmd_t *pmd;
  +   unsigned long next;
  +
  +   pmd = pmd_offset(pud, addr);
  +   do {
  +   next = pmd_addr_end(addr, end);
  +   if (pmd_none_or_clear_bad(pmd))
  +   continue;
  +   next = masync_pte_range(vma, pmd, addr, next);
  +   } while (pmd++, addr = next, addr != end);
  +
  +   return addr;
  +}
  +
  +unsigned long masync_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
  +   unsigned long addr, unsigned long end)
  +{
  +   pud_t *pud;
  +   unsigned long next;
  +
  +   pud = pud_offset(pgd, addr);
  +   do {
  +   next = pud_addr_end(addr, end);
  +   if (pud_none_or_clear_bad(pud))
  +   continue;
  +   next = masync_pmd_range(vma, pud, addr, next);
  +   } while (pud++, addr = next, addr != end);
  +
  +   return addr;
  +}
  +
  +unsigned long masync_pgd_range()
  +{
  +   pgd_t *pgd;
  +   unsigned long next;
  +
  +   pgd = pgd_offset(vma-vm_mm, addr);
  +   do {
  +   next = pgd_addr_end(addr, end);
  +   if (pgd_none_of_clear_bad(pgd))
  +   continue;
  +   next = masync_pud_range(vma, pgd, addr, next);
  +   } while (pgd++, addr = next, addr != end);
  +
  +   return addr;
  +}
  +
  +int masync_vma_one(struct vm_area_struct *vma,
  +   unsigned long start, unsigned long end)
  +{
  +   if (start  vma-vm_start)
  +   start = vma-vm_start;
  +
  +   if (end  vma-vm_end)
  +   end = vma-vm_end;
  +
  +   masync_pgd_range(vma, start, end);
  +
  +   return 0;
  +}
  +
  +int masync_vma(struct vm_area_struct *vma, 
  +   unsigned long start, unsigned long end)
  +{
  +   struct address_space *mapping;
  +   struct vm_area_struct *vma_iter;
  +
  +   if (!(vma-vm_flags  VM_SHARED))
  +   return 0;
  +
  +   mapping = vma-vm_file-f_mapping;
  +
  +   if (!mapping_cap_account_dirty(mapping))
  +   return 0;
  +
  +   spin_lock(mapping-i_mmap_lock);
  +   vma_prio_tree_foreach(vma_iter, iter, mapping-i_mmap, start, end)
  +   masync_vma_one(vma_iter, start, end);
  +   spin_unlock(mapping-i_mmap_lock);
 
 This is hoding i_mmap_lock for possibly quite long.  Isn't that going
 to cause problems?

Possibly, I didn't see a quick way to break that iteration.
From a quick glance at prio_tree.c the iterator isn't valid anymore
after releasing i_mmap_lock. Fixing that would be,.. 'fun'.

I also realized I forgot to copy/paste the prio_tree_iter declaration
and ought to make all these functions static.

But for a quick draft it conveys the idea pretty well, I guess :-)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Miklos Szeredi
 diff --git a/mm/msync.c b/mm/msync.c
 index 144a757..a1b3fc6 100644
 --- a/mm/msync.c
 +++ b/mm/msync.c
 @@ -14,6 +14,122 @@
  #include linux/syscalls.h
  #include linux/sched.h
  
 +unsigned long masync_pte_range(struct vm_area_struct *vma, pmd_t *pdm,
 + unsigned long addr, unsigned long end)
 +{
 + pte_t *pte;
 + spinlock_t *ptl;
 +
 + pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
 + arch_enter_lazy_mmu_mode();
 + do {
 + pte_t ptent = *pte;
 +
 + if (pte_none(ptent))
 + continue;
 +
 + if (!pte_present(ptent))
 + continue;
 +
 + if (pte_dirty(ptent)  pte_write(ptent)) {
 + flush_cache_page(vma, addr, pte_pfn(ptent));

Hmm, I'm not sure flush_cache_page() is needed.  Or does does dirty
data in the cache somehow interfere with the page protection?

 + ptent = ptep_clear_flush(vma, addr, pte);
 + ptent = pte_wrprotect(ptent);
 + set_pte_at(vma-vm_mnm, addr, pte, ptent);
 + }
 + } while (pte++, addr += PAGE_SIZE, addr != end);
 + arch_leave_lazy_mmu_mode();
 + pte_unmap_unlock(pte - 1, ptl);
 +
 + return addr;
 +}
 +
 +unsigned long masync_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 + unsigned long addr, unsigned long end)
 +{
 + pmd_t *pmd;
 + unsigned long next;
 +
 + pmd = pmd_offset(pud, addr);
 + do {
 + next = pmd_addr_end(addr, end);
 + if (pmd_none_or_clear_bad(pmd))
 + continue;
 + next = masync_pte_range(vma, pmd, addr, next);
 + } while (pmd++, addr = next, addr != end);
 +
 + return addr;
 +}
 +
 +unsigned long masync_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
 + unsigned long addr, unsigned long end)
 +{
 + pud_t *pud;
 + unsigned long next;
 +
 + pud = pud_offset(pgd, addr);
 + do {
 + next = pud_addr_end(addr, end);
 + if (pud_none_or_clear_bad(pud))
 + continue;
 + next = masync_pmd_range(vma, pud, addr, next);
 + } while (pud++, addr = next, addr != end);
 +
 + return addr;
 +}
 +
 +unsigned long masync_pgd_range()
 +{
 + pgd_t *pgd;
 + unsigned long next;
 +
 + pgd = pgd_offset(vma-vm_mm, addr);
 + do {
 + next = pgd_addr_end(addr, end);
 + if (pgd_none_of_clear_bad(pgd))
 + continue;
 + next = masync_pud_range(vma, pgd, addr, next);
 + } while (pgd++, addr = next, addr != end);
 +
 + return addr;
 +}
 +
 +int masync_vma_one(struct vm_area_struct *vma,
 + unsigned long start, unsigned long end)
 +{
 + if (start  vma-vm_start)
 + start = vma-vm_start;
 +
 + if (end  vma-vm_end)
 + end = vma-vm_end;
 +
 + masync_pgd_range(vma, start, end);
 +
 + return 0;
 +}
 +
 +int masync_vma(struct vm_area_struct *vma, 
 + unsigned long start, unsigned long end)
 +{
 + struct address_space *mapping;
 + struct vm_area_struct *vma_iter;
 +
 + if (!(vma-vm_flags  VM_SHARED))
 + return 0;
 +
 + mapping = vma-vm_file-f_mapping;
 +
 + if (!mapping_cap_account_dirty(mapping))
 + return 0;
 +
 + spin_lock(mapping-i_mmap_lock);
 + vma_prio_tree_foreach(vma_iter, iter, mapping-i_mmap, start, end)
 + masync_vma_one(vma_iter, start, end);
 + spin_unlock(mapping-i_mmap_lock);

This is hoding i_mmap_lock for possibly quite long.  Isn't that going
to cause problems?

Miklos

 +
 + return 0;
 +}
 +
  /*
   * MS_SYNC syncs the entire file - including mappings.
   *
 
 
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Peter Zijlstra

On Fri, 2008-01-18 at 11:38 +0100, Miklos Szeredi wrote:
  On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:
  
diff --git a/mm/msync.c b/mm/msync.c
index a4de868..a49af28 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -13,11 +13,33 @@
 #include linux/syscalls.h
 
 /*
+ * Scan the PTEs for pages belonging to the VMA and mark them 
read-only.
+ * It will force a pagefault on the next write access.
+ */
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+   unsigned long addr;
+
+   for (addr = vma-vm_start; addr  vma-vm_end; addr += 
PAGE_SIZE) {
+   spinlock_t *ptl;
+   pgd_t *pgd = pgd_offset(vma-vm_mm, addr);
+   pud_t *pud = pud_offset(pgd, addr);
+   pmd_t *pmd = pmd_offset(pud, addr);
+   pte_t *pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, 
ptl);
+
+   if (pte_dirty(*pte)  pte_write(*pte))
+   *pte = pte_wrprotect(*pte);
+   pte_unmap_unlock(pte, ptl);
+   }
+}
   
   What about ram based filesystems?  They don't start out with read-only
   pte's, so I think they don't want them read-protected now either.
   Unless this is essential for correct mtime/ctime accounting on these
   filesystems (I don't think it really is).  But then the mapping should
   start out read-only as well, otherwise the time update will only work
   after an msync(MS_ASYNC).
  
  page_mkclean() has all the needed logic for this, it also walks the rmap
  and cleans out all other users, which I think is needed too for
  consistencies sake:
  
  Process A   Process B
  
  mmap(foo.txt)   mmap(foo.txt)
  
  dirty page
  dirty page
  
  msync(MS_ASYNC)
  
  dirty page
  
  msync(MS_ASYNC) --- now what?!

how about:

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..a1b3fc6 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -14,6 +14,122 @@
 #include linux/syscalls.h
 #include linux/sched.h
 
+unsigned long masync_pte_range(struct vm_area_struct *vma, pmd_t *pdm,
+   unsigned long addr, unsigned long end)
+{
+   pte_t *pte;
+   spinlock_t *ptl;
+
+   pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
+   arch_enter_lazy_mmu_mode();
+   do {
+   pte_t ptent = *pte;
+
+   if (pte_none(ptent))
+   continue;
+
+   if (!pte_present(ptent))
+   continue;
+
+   if (pte_dirty(ptent)  pte_write(ptent)) {
+   flush_cache_page(vma, addr, pte_pfn(ptent));
+   ptent = ptep_clear_flush(vma, addr, pte);
+   ptent = pte_wrprotect(ptent);
+   set_pte_at(vma-vm_mnm, addr, pte, ptent);
+   }
+   } while (pte++, addr += PAGE_SIZE, addr != end);
+   arch_leave_lazy_mmu_mode();
+   pte_unmap_unlock(pte - 1, ptl);
+
+   return addr;
+}
+
+unsigned long masync_pmd_range(struct vm_area_struct *vma, pud_t *pud,
+   unsigned long addr, unsigned long end)
+{
+   pmd_t *pmd;
+   unsigned long next;
+
+   pmd = pmd_offset(pud, addr);
+   do {
+   next = pmd_addr_end(addr, end);
+   if (pmd_none_or_clear_bad(pmd))
+   continue;
+   next = masync_pte_range(vma, pmd, addr, next);
+   } while (pmd++, addr = next, addr != end);
+
+   return addr;
+}
+
+unsigned long masync_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
+   unsigned long addr, unsigned long end)
+{
+   pud_t *pud;
+   unsigned long next;
+
+   pud = pud_offset(pgd, addr);
+   do {
+   next = pud_addr_end(addr, end);
+   if (pud_none_or_clear_bad(pud))
+   continue;
+   next = masync_pmd_range(vma, pud, addr, next);
+   } while (pud++, addr = next, addr != end);
+
+   return addr;
+}
+
+unsigned long masync_pgd_range()
+{
+   pgd_t *pgd;
+   unsigned long next;
+
+   pgd = pgd_offset(vma-vm_mm, addr);
+   do {
+   next = pgd_addr_end(addr, end);
+   if (pgd_none_of_clear_bad(pgd))
+   continue;
+   next = masync_pud_range(vma, pgd, addr, next);
+   } while (pgd++, addr = next, addr != end);
+
+   return addr;
+}
+
+int masync_vma_one(struct vm_area_struct *vma,
+   unsigned long start, unsigned long end)
+{
+   if (start  vma-vm_start)
+   start = vma-vm_start;
+
+   if (end  vma-vm_end)
+   end = vma-vm_end;
+
+   masync_pgd_range(vma, start, end);
+
+   return 0;
+}
+
+int masync_vma(struct vm_area_struct *vma, 
+   unsigned long start, unsigned long end)
+{
+   struct address_space *mapping;
+   struct 

Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Peter Zijlstra

On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:

  diff --git a/mm/msync.c b/mm/msync.c
  index a4de868..a49af28 100644
  --- a/mm/msync.c
  +++ b/mm/msync.c
  @@ -13,11 +13,33 @@
   #include linux/syscalls.h
   
   /*
  + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
  + * It will force a pagefault on the next write access.
  + */
  +static void vma_wrprotect(struct vm_area_struct *vma)
  +{
  +   unsigned long addr;
  +
  +   for (addr = vma-vm_start; addr  vma-vm_end; addr += PAGE_SIZE) {
  +   spinlock_t *ptl;
  +   pgd_t *pgd = pgd_offset(vma-vm_mm, addr);
  +   pud_t *pud = pud_offset(pgd, addr);
  +   pmd_t *pmd = pmd_offset(pud, addr);
  +   pte_t *pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
  +
  +   if (pte_dirty(*pte)  pte_write(*pte))
  +   *pte = pte_wrprotect(*pte);
  +   pte_unmap_unlock(pte, ptl);
  +   }
  +}
 
 What about ram based filesystems?  They don't start out with read-only
 pte's, so I think they don't want them read-protected now either.
 Unless this is essential for correct mtime/ctime accounting on these
 filesystems (I don't think it really is).  But then the mapping should
 start out read-only as well, otherwise the time update will only work
 after an msync(MS_ASYNC).

page_mkclean() has all the needed logic for this, it also walks the rmap
and cleans out all other users, which I think is needed too for
consistencies sake:

Process A   Process B

mmap(foo.txt)   mmap(foo.txt)

dirty page
dirty page

msync(MS_ASYNC)

dirty page

msync(MS_ASYNC) --- now what?!


So what I would suggest is using the page table walkers from mm, and
walks the page range, obtain the page using vm_normal_page() and call
page_mkclean(). (Oh, and ensure you don't nest the pte lock :-)

All in all, that sounds rather expensive..



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Miklos Szeredi
 Updating file times at write references to memory-mapped files and
 forcing file times update at the next write reference after
 calling the msync() system call with the MS_ASYNC flag.
 
 Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]
 ---
  mm/memory.c |6 ++
  mm/msync.c  |   52 +++-
  2 files changed, 45 insertions(+), 13 deletions(-)
 
 diff --git a/mm/memory.c b/mm/memory.c
 index 4bf0b6d..13d5bbf 100644
 --- a/mm/memory.c
 +++ b/mm/memory.c
 @@ -1668,6 +1668,9 @@ gotten:
  unlock:
   pte_unmap_unlock(page_table, ptl);
   if (dirty_page) {
 + if (vma-vm_file)
 + file_update_time(vma-vm_file);
 +
   /*
* Yes, Virginia, this is actually required to prevent a race
* with clear_page_dirty_for_io() from clearing the page dirty
 @@ -2341,6 +2344,9 @@ out_unlocked:
   if (anon)
   page_cache_release(vmf.page);
   else if (dirty_page) {
 + if (vma-vm_file)
 + file_update_time(vma-vm_file);
 +
   set_page_dirty_balance(dirty_page, page_mkwrite);
   put_page(dirty_page);
   }
 diff --git a/mm/msync.c b/mm/msync.c
 index a4de868..a49af28 100644
 --- a/mm/msync.c
 +++ b/mm/msync.c
 @@ -13,11 +13,33 @@
  #include linux/syscalls.h
  
  /*
 + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
 + * It will force a pagefault on the next write access.
 + */
 +static void vma_wrprotect(struct vm_area_struct *vma)
 +{
 + unsigned long addr;
 +
 + for (addr = vma-vm_start; addr  vma-vm_end; addr += PAGE_SIZE) {
 + spinlock_t *ptl;
 + pgd_t *pgd = pgd_offset(vma-vm_mm, addr);
 + pud_t *pud = pud_offset(pgd, addr);
 + pmd_t *pmd = pmd_offset(pud, addr);
 + pte_t *pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
 +
 + if (pte_dirty(*pte)  pte_write(*pte))
 + *pte = pte_wrprotect(*pte);
 + pte_unmap_unlock(pte, ptl);
 + }
 +}

What about ram based filesystems?  They don't start out with read-only
pte's, so I think they don't want them read-protected now either.
Unless this is essential for correct mtime/ctime accounting on these
filesystems (I don't think it really is).  But then the mapping should
start out read-only as well, otherwise the time update will only work
after an msync(MS_ASYNC).

 +
 +/*
   * MS_SYNC syncs the entire file - including mappings.
   *
 - * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
 - * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
 - * Now it doesn't do anything, since dirty pages are properly tracked.
 + * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
 + * read-only by calling vma_wrprotect(). This is needed to catch the next
 + * write reference to the mapped region and update the file times
 + * accordingly.
   *
   * The application may now run fsync() to write out the dirty pages and
   * wait on the writeout and check the result. Or the application may run
 @@ -77,16 +99,20 @@ asmlinkage long sys_msync(unsigned long start, size_t 
 len, int flags)
   error = 0;
   start = vma-vm_end;
   file = vma-vm_file;
 - if (file  (vma-vm_flags  VM_SHARED)  (flags  MS_SYNC)) {
 - get_file(file);
 - up_read(mm-mmap_sem);
 - error = do_fsync(file, 0);
 - fput(file);
 - if (error || start = end)
 - goto out;
 - down_read(mm-mmap_sem);
 - vma = find_vma(mm, start);
 - continue;
 + if (file  (vma-vm_flags  VM_SHARED)) {
 + if (flags  MS_ASYNC)
 + vma_wrprotect(vma);
 + if (flags  MS_SYNC) {
 + get_file(file);
 + up_read(mm-mmap_sem);
 + error = do_fsync(file, 0);
 + fput(file);
 + if (error || start = end)
 + goto out;
 + down_read(mm-mmap_sem);
 + vma = find_vma(mm, start);
 + continue;
 + }
   }
  
   vma = vma-vm_next;
 -- 
 1.4.4.4
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Miklos Szeredi
 On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:
 
   diff --git a/mm/msync.c b/mm/msync.c
   index a4de868..a49af28 100644
   --- a/mm/msync.c
   +++ b/mm/msync.c
   @@ -13,11 +13,33 @@
#include linux/syscalls.h

/*
   + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
   + * It will force a pagefault on the next write access.
   + */
   +static void vma_wrprotect(struct vm_area_struct *vma)
   +{
   + unsigned long addr;
   +
   + for (addr = vma-vm_start; addr  vma-vm_end; addr += PAGE_SIZE) {
   + spinlock_t *ptl;
   + pgd_t *pgd = pgd_offset(vma-vm_mm, addr);
   + pud_t *pud = pud_offset(pgd, addr);
   + pmd_t *pmd = pmd_offset(pud, addr);
   + pte_t *pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
   +
   + if (pte_dirty(*pte)  pte_write(*pte))
   + *pte = pte_wrprotect(*pte);
   + pte_unmap_unlock(pte, ptl);
   + }
   +}
  
  What about ram based filesystems?  They don't start out with read-only
  pte's, so I think they don't want them read-protected now either.
  Unless this is essential for correct mtime/ctime accounting on these
  filesystems (I don't think it really is).  But then the mapping should
  start out read-only as well, otherwise the time update will only work
  after an msync(MS_ASYNC).
 
 page_mkclean() has all the needed logic for this, it also walks the rmap
 and cleans out all other users, which I think is needed too for
 consistencies sake:
 
 Process A Process B
 
 mmap(foo.txt) mmap(foo.txt)
 
 dirty page
   dirty page
 
 msync(MS_ASYNC)
 
   dirty page
 
 msync(MS_ASYNC) --- now what?!

Nothing.  I think it's perfectly acceptable behavior, if msync in
process A doesn't care about any dirtying in process B.

 All in all, that sounds rather expensive..

Right.  The advantage of Anton's current approach, is that it's at
least simple, and possibly not so expensive, while providing same
quite sane semantics for MS_ASYNC vs. mtime updates.

Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Peter Zijlstra [EMAIL PROTECTED]:

 On Fri, 2008-01-18 at 11:15 +0100, Peter Zijlstra wrote:
  On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:
 
diff --git a/mm/msync.c b/mm/msync.c
index a4de868..a49af28 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -13,11 +13,33 @@
 #include linux/syscalls.h
   
 /*
+ * Scan the PTEs for pages belonging to the VMA and mark them 
read-only.
+ * It will force a pagefault on the next write access.
+ */
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+ unsigned long addr;
+
+ for (addr = vma-vm_start; addr  vma-vm_end; addr += PAGE_SIZE) {
+ spinlock_t *ptl;
+ pgd_t *pgd = pgd_offset(vma-vm_mm, addr);
+ pud_t *pud = pud_offset(pgd, addr);
+ pmd_t *pmd = pmd_offset(pud, addr);
+ pte_t *pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
+
+ if (pte_dirty(*pte)  pte_write(*pte))
+ *pte = pte_wrprotect(*pte);
+ pte_unmap_unlock(pte, ptl);
+ }
+}
  
   What about ram based filesystems?  They don't start out with read-only
   pte's, so I think they don't want them read-protected now either.
   Unless this is essential for correct mtime/ctime accounting on these
   filesystems (I don't think it really is).  But then the mapping should
   start out read-only as well, otherwise the time update will only work
   after an msync(MS_ASYNC).
 
  page_mkclean() has all the needed logic for this, it also walks the rmap
  and cleans out all other users, which I think is needed too for
  consistencies sake:
 
  Process A Process B
 
  mmap(foo.txt) mmap(foo.txt)
 
  dirty page
dirty page
 
  msync(MS_ASYNC)
 
dirty page
 
  msync(MS_ASYNC) --- now what?!
 
 
  So what I would suggest is using the page table walkers from mm, and
  walks the page range, obtain the page using vm_normal_page() and call
  page_mkclean(). (Oh, and ensure you don't nest the pte lock :-)
 
  All in all, that sounds rather expensive..

 Bah, and will break on s390... so we'd need a page_mkclean() variant
 that doesn't actually clear dirty.

So the current version of the functional changes patch takes this into account.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Miklos Szeredi
 Possibly, I didn't see a quick way to break that iteration.
 From a quick glance at prio_tree.c the iterator isn't valid anymore
 after releasing i_mmap_lock. Fixing that would be,.. 'fun'.

Maybe i_mmap_lock isn't needed at all, since msync holds mmap_sem,
which protects the prio tree as well, no?

 I also realized I forgot to copy/paste the prio_tree_iter declaration
 and ought to make all these functions static.
 
 But for a quick draft it conveys the idea pretty well, I guess :-)

Yes :)

There could also be nasty performance corner cases, like having a huge
file mapped thousands of times, and having only a couple of pages
dirtied between MS_ASYNC invocations.  Then most of that page table
walking would be just unnecessary overhead.

There's something to be said for walking only the dirty pages, and
doing page_mkclean on them, even if in some cases that would be
slower.

But I have a strong feeling of deja vu, and last time it ended with
Andrew not liking the whole thing...

Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Linus Torvalds


On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
 
 Before using pte_wrprotect() the vma_wrprotect() routine uses the
 pte_offset_map_lock() macro to get the PTE and to acquire the ptl
 spinlock. Why did you say that this code was not SMP-safe? It should
 be atomic, I think.

It's atomic WITH RESPECT TO OTHER PEOPLE WHO GET THE LOCK.

Guess how much another x86 CPU cares when it sets the accessed bit in 
hardware?

 The POSIX standard requires the ctime and mtime stamps to be updated
 not later than at the second call to msync() with the MS_ASYNC flag.

.. and that is no excuse for bad code.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/19, Linus Torvalds [EMAIL PROTECTED]:


 On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
 
  The page_check_address() function is called from the
  page_mkclean_one() routine as follows:

 .. and the page_mkclean_one() function is totally different.

 Lookie here, this is the correct and complex sequence:

  entry = ptep_clear_flush(vma, address, pte);
  entry = pte_wrprotect(entry);
  entry = pte_mkclean(entry);
  set_pte_at(mm, address, pte, entry);

 That's a rather expensive sequence, but it's done exactly because it has
 to be done that way. What it does is to

  - *atomically* load the pte entry _and_ clear the old one in memory.

That's the

 entry = ptep_clear_flush(vma, address, pte);

thing, and it basically means that it's doing some
architecture-specific magic to make sure that another CPU that accesses
the PTE at the same time will never actually modify the pte (because
it's clear and not valid)

  - it then - while the page table is actually clear and invalid - takes
the old value and turns it into the new one:

 entry = pte_wrprotect(entry);
 entry = pte_mkclean(entry);

  - and finally, it replaces the entry with the new one:

 set_pte_at(mm, address, pte, entry);

which takes care to write the new entry in some specific way that is
atomic wrt other CPU's (ie on 32-bit x86 with a 64-bit page table
entry it writes the high word first, see the write barriers in
native_set_pte() in include/asm-x86/pgtable-3level.h

 Now, compare that subtle and correct thing with what is *not* correct:

 if (pte_dirty(*pte)  pte_write(*pte))
 *pte = pte_wrprotect(*pte);

 which makes no effort at all to make sure that it's safe in case another
 CPU updates the accessed bit.

 Now, arguably it's unlikely to cause horrible problems at least on x86,
 because:

  - we only do this if the pte is already marked dirty, so while we can
lose the accessed bit, we can *not* lose the dirty bit. And the
accessed bit isn't such a big deal.

  - it's not doing any of the be careful about ordering things, but since
the really important bits aren't changing, ordering probably won't
practically matter.

 But the problem is that we have something like 24 different architectures,
 it's hard to make sure that none of them have issues.

 In other words: it may well work in practice. But when these things go
 subtly wrong, they are *really* nasty to find, and the unsafe sequence is
 really not how it's supposed to be done. For example, you don't even flush
 the TLB, so even if there are no cross-CPU issues, there's probably going
 to be writable entries in the TLB that now don't match the page tables.

 Will it matter? Again, probably impossible to see in practice. But ...

Linus, I am very grateful to you for your extremely clear explanation
of the issue I have overlooked!

Back to the msync() issue, I'm going to come back with a new design
for the bug fix.

Thank you once again.

Anton


 Linus

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Linus Torvalds


On Fri, 18 Jan 2008, Ingo Oeser wrote:
 
 Can we get if the write to the page hits the disk, the mtime has hit the disk
 already no less than SOME_GRANULARITY before? 
 
 That is very important for computer forensics. Esp. in saving your ass!
 
 Ok, now back again to making that fast :-)

I certainly don't mind it if we have some tighter guarantees, but what I'd 
want is:

 - keep it simple. Let's face it, Linux has never ever given those 
   guarantees before, and it's not is if anybody has really cared. Even 
   now, the issue seems to be more about paper standards conformance than 
   anything else.

 - I get worried about people playing around with the dirty bit in 
   particular. We have had some really rather nasty bugs here. Most of 
   which are totally impossible to trigger under normal loads (for 
   example the old random-access utorrent writable mmap issue from about 
   a year ago).

So these two issues - the big red danger signs flashing in my brain, 
coupled with the fact that no application has apparently ever really 
noticed in the last 15 years - just makes it a case where I'd like each 
step of the way to be obvious and simple and no larger than really 
absolutely necessary.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/19, Linus Torvalds [EMAIL PROTECTED]:


 On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
 
  Before using pte_wrprotect() the vma_wrprotect() routine uses the
  pte_offset_map_lock() macro to get the PTE and to acquire the ptl
  spinlock. Why did you say that this code was not SMP-safe? It should
  be atomic, I think.

 It's atomic WITH RESPECT TO OTHER PEOPLE WHO GET THE LOCK.

 Guess how much another x86 CPU cares when it sets the accessed bit in
 hardware?

Thank you very much for taking part in this discussion. Personally,
it's very important to me.  But I'm not sure that I understand which
bit can be lost.

Please let me explain.

The logic for my vma_wrprotect() routine was taken from the
page_check_address() function in mm/rmap.c. Here is a code snippet of
the latter function:

pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
return NULL;

pud = pud_offset(pgd, address);
if (!pud_present(*pud))
return NULL;

pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
return NULL;

pte = pte_offset_map(pmd, address);
/* Make a quick check before getting the lock */
if (!pte_present(*pte)) {
pte_unmap(pte);
return NULL;
}

ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (pte_present(*pte)  page_to_pfn(page) == pte_pfn(*pte)) {
*ptlp = ptl;
return pte;
}
pte_unmap_unlock(pte, ptl);

The page_check_address() function is called from the
page_mkclean_one() routine as follows:

pte = page_check_address(page, mm, address, ptl);
if (!pte)
goto out;

if (pte_dirty(*pte) || pte_write(*pte)) {
pte_t entry;

flush_cache_page(vma, address, pte_pfn(*pte));
entry = ptep_clear_flush(vma, address, pte);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(mm, address, pte, entry);
ret = 1;
}

pte_unmap_unlock(pte, ptl);

The write-protection of the PTE is done using the pte_wrprotect()
entity. I intended to do the same during msync() with MS_ASYNC. I
understand that I'm taking a risk of looking a complete idiot now,
however I don't see any difference between the two situations.

I presumed that the code in mm/rmap.c was absolutely correct, that's
why I basically reused the design.


  The POSIX standard requires the ctime and mtime stamps to be updated
  not later than at the second call to msync() with the MS_ASYNC flag.

 .. and that is no excuse for bad code.

 Linus

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Linus Torvalds [EMAIL PROTECTED]:


 On Fri, 18 Jan 2008, Anton Salikhmetov wrote:
 
  The current solution doesn't hit the performance at all when compared to
  the competitor POSIX-compliant systems. It is faster and does even more
  than the POSIX standard requires.

 Your current patches have two problems:
  - they are simply unnecessarily invasive for a relatively simple issue
  - all versions I've looked at closer are buggy too

 Example:

 +   if (pte_dirty(*pte)  pte_write(*pte))
 +   *pte = pte_wrprotect(*pte);

 Uhhuh. Looks simple enough. Except it does a non-atomic pte access while
 other CPU's may be accessing it and updating it from their hw page table
 walkers. What will happen? Who knows? I can see lost access bits at a
 minimum.

 IOW, this isn't simple code. It's code that it is simple to screw up. In
 this case, you really need to use ptep_set_wrprotect(), for example.

Before using pte_wrprotect() the vma_wrprotect() routine uses the
pte_offset_map_lock() macro to get the PTE and to acquire the ptl
spinlock. Why did you say that this code was not SMP-safe? It should
be atomic, I think.



 So why not do it in many fewer lines with that simpler vma-dirty flag?

Neither the dirty flag you suggest, nor the AS_MCTIME flag I've
introduced in my previous solutions solve the following problem:

- mmap()
- a write reference
- msync() with MS_ASYNC
- a write reference
- msync() with MS_ASYNC

The POSIX standard requires the ctime and mtime stamps to be updated
not later than at the second call to msync() with the MS_ASYNC flag.

Some other POSIX-compliant operating system such as HP-UX and FreeBSD
satisfy this POSIX requirement. Linux does not.


 Linus

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Miklos Szeredi
 And even in that four-liner, I suspect that the *last* two lines are 
 actually incorrect: there's no point in updating the file time when the 
 page *becomes* dirty,

Actually all four lines do that.  The first two for a write access on
a present, read-only pte, the other two for a write on a non-present
pte.

 we should update the file time when it is marked 
 clean, and msync(MS_SYNC) should update it as part of *that*.

That would need a new page flag (PG_mmap_dirty?).  Do we have one
available?

Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Linus Torvalds


On Fri, 18 Jan 2008, Peter Zijlstra wrote:
 
 Bah, and will break on s390... so we'd need a page_mkclean() variant
 that doesn't actually clear dirty.

No, we simply want to not play all these very expensive games with dirty 
in the first place.

Guys, mmap access times aren't important enough for this. It's not 
specified closely enough, and people don't care enough.

Of the patches around so far, the best one by far seems to be the simple 
four-liner from Miklos.

And even in that four-liner, I suspect that the *last* two lines are 
actually incorrect: there's no point in updating the file time when the 
page *becomes* dirty, we should update the file time when it is marked 
clean, and msync(MS_SYNC) should update it as part of *that*.

So I think the file time update should be part of just the page writeout 
logic, not by msync() or page faulting itself or anything like that.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Miklos Szeredi
   And even in that four-liner, I suspect that the *last* two lines are 
   actually incorrect: there's no point in updating the file time when the 
   page *becomes* dirty,
  
  Actually all four lines do that.  The first two for a write access on
  a present, read-only pte, the other two for a write on a non-present
  pte.
  
   we should update the file time when it is marked 
   clean, and msync(MS_SYNC) should update it as part of *that*.
  
  That would need a new page flag (PG_mmap_dirty?).  Do we have one
  available?
 
 I thought the page writing stuff looked at (and cleared) the pte
 dirty bit, too?

Yeah, it does.  Hmm...

What happens on munmap?  The times _could_ get updated from there as
well, but it's getting complicated.

Miklos

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Ingo Oeser
Hi Linus,

On Friday 18 January 2008, Linus Torvalds wrote:
 On Fri, 18 Jan 2008, Miklos Szeredi wrote:
  
  What I'm saying is that the times could be left un-updated for a long
  time if program doesn't do munmap() or msync(MS_SYNC) for a long time.
 
 Sure.
 
 But in those circumstances, the programmer cannot depend on the mtime 
 *anyway* (because there is no synchronization), so what's the downside?

Can we get if the write to the page hits the disk, the mtime has hit the disk
already no less than SOME_GRANULARITY before? 

That is very important for computer forensics. Esp. in saving your ass!

Ok, now back again to making that fast :-)


Best Regards

Ingo Oeser
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Linus Torvalds


On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
 
 The page_check_address() function is called from the
 page_mkclean_one() routine as follows:

.. and the page_mkclean_one() function is totally different.

Lookie here, this is the correct and complex sequence:

 entry = ptep_clear_flush(vma, address, pte);
 entry = pte_wrprotect(entry);
 entry = pte_mkclean(entry);
 set_pte_at(mm, address, pte, entry);

That's a rather expensive sequence, but it's done exactly because it has 
to be done that way. What it does is to

 - *atomically* load the pte entry _and_ clear the old one in memory.

   That's the

entry = ptep_clear_flush(vma, address, pte);

   thing, and it basically means that it's doing some 
   architecture-specific magic to make sure that another CPU that accesses 
   the PTE at the same time will never actually modify the pte (because 
   it's clear and not valid)

 - it then - while the page table is actually clear and invalid - takes 
   the old value and turns it into the new one:

entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);

 - and finally, it replaces the entry with the new one:

set_pte_at(mm, address, pte, entry);

   which takes care to write the new entry in some specific way that is 
   atomic wrt other CPU's (ie on 32-bit x86 with a 64-bit page table 
   entry it writes the high word first, see the write barriers in 
   native_set_pte() in include/asm-x86/pgtable-3level.h

Now, compare that subtle and correct thing with what is *not* correct:

if (pte_dirty(*pte)  pte_write(*pte))
*pte = pte_wrprotect(*pte);

which makes no effort at all to make sure that it's safe in case another 
CPU updates the accessed bit.

Now, arguably it's unlikely to cause horrible problems at least on x86, 
because:

 - we only do this if the pte is already marked dirty, so while we can 
   lose the accessed bit, we can *not* lose the dirty bit. And the 
   accessed bit isn't such a big deal.

 - it's not doing any of the be careful about ordering things, but since 
   the really important bits aren't changing, ordering probably won't 
   practically matter.

But the problem is that we have something like 24 different architectures, 
it's hard to make sure that none of them have issues. 

In other words: it may well work in practice. But when these things go 
subtly wrong, they are *really* nasty to find, and the unsafe sequence is 
really not how it's supposed to be done. For example, you don't even flush 
the TLB, so even if there are no cross-CPU issues, there's probably going 
to be writable entries in the TLB that now don't match the page tables.

Will it matter? Again, probably impossible to see in practice. But ...

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Miklos Szeredi
  That would need a new page flag (PG_mmap_dirty?).  Do we have one
  available?
 
 Yeah, that would be bad. We probably have flags free, but those page flags 
 are always a pain. Scratch that.
 
 How about just setting a per-vma dirty flag, and then instead of updating 
 the mtime when taking the dirty-page fault, we just set that flag?
 
 Then, on unmap and msync, we just do
 
   if (vma-dirty-flag) {
   vma-dirty_flag = 0;
   update_file_times(vma-vm_file);
   }
 
 and be done with it? 

But then background writeout, sync(2), etc, wouldn't update the times.
Dunno.  I don't think actual _physical_ writeout matters much, so it's
not worse to be 30s early with the timestamp, than to be 30s or more
late.

Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Linus Torvalds


On Fri, 18 Jan 2008, Miklos Szeredi wrote:
 
 That would need a new page flag (PG_mmap_dirty?).  Do we have one
 available?

Yeah, that would be bad. We probably have flags free, but those page flags 
are always a pain. Scratch that.

How about just setting a per-vma dirty flag, and then instead of updating 
the mtime when taking the dirty-page fault, we just set that flag?

Then, on unmap and msync, we just do

if (vma-dirty-flag) {
vma-dirty_flag = 0;
update_file_times(vma-vm_file);
}

and be done with it? 


Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Linus Torvalds


On Fri, 18 Jan 2008, Miklos Szeredi wrote:
 
 But then background writeout, sync(2), etc, wouldn't update the times.

Sure it would, but only when doing the final unmap.

Did you miss the on unmap and msync part?

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Rik van Riel
On Fri, 18 Jan 2008 19:11:47 +0100
Miklos Szeredi [EMAIL PROTECTED] wrote:

  And even in that four-liner, I suspect that the *last* two lines are 
  actually incorrect: there's no point in updating the file time when the 
  page *becomes* dirty,
 
 Actually all four lines do that.  The first two for a write access on
 a present, read-only pte, the other two for a write on a non-present
 pte.
 
  we should update the file time when it is marked 
  clean, and msync(MS_SYNC) should update it as part of *that*.
 
 That would need a new page flag (PG_mmap_dirty?).  Do we have one
 available?

I thought the page writing stuff looked at (and cleared) the pte
dirty bit, too?

-- 
All rights reversed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Miklos Szeredi
  
  But then background writeout, sync(2), etc, wouldn't update the times.
 
 Sure it would, but only when doing the final unmap.
 
 Did you miss the on unmap and msync part?

No :)

What I'm saying is that the times could be left un-updated for a long
time if program doesn't do munmap() or msync(MS_SYNC) for a long time.

If program has this pattern:

mmap()
write to map
msync(MS_ASYNC)
sleep(long)
write to map
msync(MS_ASYNC)
sleep(long)
...

Then we'd never see time updates (until the program exits, but that
could be years).

Maybe this doesn't matter, I'm just saying this is a disadvantage
compared to the update on first dirtying approach, which would
ensure, that times are updated at least once per 30s.

Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Linus Torvalds


On Fri, 18 Jan 2008, Miklos Szeredi wrote:
 
 What I'm saying is that the times could be left un-updated for a long
 time if program doesn't do munmap() or msync(MS_SYNC) for a long time.

Sure.

But in those circumstances, the programmer cannot depend on the mtime 
*anyway* (because there is no synchronization), so what's the downside?

Let's face it, there's exactly three possible solutions:

 - the insane one: trap EVERY SINGLE instruction that does a write to the 
   page, and update mtime each and every time.

   This one is so obviously STUPID that it's not even worth discussing 
   further, except to say that yes, there is an 'exact' algorithm, but 
   no, we are never EVER going to use it.

 - the non-exact solutions that don't give you mtime updates every time 
   a write to the page happens, but give *some* guarantees for things that 
   will update it.

   This is the one I think we can do, and the only things a programmer can 
   impact using it is msync() and munmap(), since no other operations 
   really have any thing to do with it in a programmer-visible way (ie a 
   normal sync operation may happen in the background and has no 
   progam-relevant timing information)

   Other things *may* or may not update mtime (some filesystems - take
   most networked one as an example - will *always* update mtime on the 
   server on writeback, so we cannot ever guarantee that nothing but 
   msync/munmap does so), but at least we'll have a minimum set of things 
   that people can depend on.

 - the we don't care at all solutions.

   mmap(MAP_WRITE) doesn't really update times reliably after the write 
   has happened (but might do it *before* - maybe the mmap() itself does).

Those are the three choices, I think. We currently approximate #3. We 
*can* do #2 (and there are various flavors of it). And even *aiming* for 
#1 is totally insane and stupid.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Linus Torvalds [EMAIL PROTECTED]:


 On Fri, 18 Jan 2008, Miklos Szeredi wrote:
 
  What I'm saying is that the times could be left un-updated for a long
  time if program doesn't do munmap() or msync(MS_SYNC) for a long time.

 Sure.

 But in those circumstances, the programmer cannot depend on the mtime
 *anyway* (because there is no synchronization), so what's the downside?

 Let's face it, there's exactly three possible solutions:

  - the insane one: trap EVERY SINGLE instruction that does a write to the
page, and update mtime each and every time.

This one is so obviously STUPID that it's not even worth discussing
further, except to say that yes, there is an 'exact' algorithm, but
no, we are never EVER going to use it.

  - the non-exact solutions that don't give you mtime updates every time
a write to the page happens, but give *some* guarantees for things that
will update it.

This is the one I think we can do, and the only things a programmer can
impact using it is msync() and munmap(), since no other operations
really have any thing to do with it in a programmer-visible way (ie a
normal sync operation may happen in the background and has no
progam-relevant timing information)

Other things *may* or may not update mtime (some filesystems - take
most networked one as an example - will *always* update mtime on the
server on writeback, so we cannot ever guarantee that nothing but
msync/munmap does so), but at least we'll have a minimum set of things
that people can depend on.

  - the we don't care at all solutions.

mmap(MAP_WRITE) doesn't really update times reliably after the write
has happened (but might do it *before* - maybe the mmap() itself does).

 Those are the three choices, I think. We currently approximate #3. We
 *can* do #2 (and there are various flavors of it). And even *aiming* for
 #1 is totally insane and stupid.

The current solution doesn't hit the performance at all when compared to
the competitor POSIX-compliant systems. It is faster and does even more
than the POSIX standard requires.

Please see the test results I've sent into the thread -v6 0/2:

http://lkml.org/lkml/2008/1/18/447

I guess, the current solution is ready to use.


 Linus

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Linus Torvalds


On Fri, 18 Jan 2008, Anton Salikhmetov wrote:
 
 The current solution doesn't hit the performance at all when compared to
 the competitor POSIX-compliant systems. It is faster and does even more
 than the POSIX standard requires.

Your current patches have two problems:
 - they are simply unnecessarily invasive for a relatively simple issue
 - all versions I've looked at closer are buggy too

Example:

+   if (pte_dirty(*pte)  pte_write(*pte))
+   *pte = pte_wrprotect(*pte);

Uhhuh. Looks simple enough. Except it does a non-atomic pte access while 
other CPU's may be accessing it and updating it from their hw page table 
walkers. What will happen? Who knows? I can see lost access bits at a 
minimum.

IOW, this isn't simple code. It's code that it is simple to screw up. In 
this case, you really need to use ptep_set_wrprotect(), for example.

So why not do it in many fewer lines with that simpler vma-dirty flag?

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-17 Thread Anton Salikhmetov
Updating file times at write references to memory-mapped files and
forcing file times update at the next write reference after
calling the msync() system call with the MS_ASYNC flag.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/memory.c |6 ++
 mm/msync.c  |   52 +++-
 2 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4bf0b6d..13d5bbf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1668,6 +1668,9 @@ gotten:
 unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+   if (vma->vm_file)
+   file_update_time(vma->vm_file);
+
/*
 * Yes, Virginia, this is actually required to prevent a race
 * with clear_page_dirty_for_io() from clearing the page dirty
@@ -2341,6 +2344,9 @@ out_unlocked:
if (anon)
page_cache_release(vmf.page);
else if (dirty_page) {
+   if (vma->vm_file)
+   file_update_time(vma->vm_file);
+
set_page_dirty_balance(dirty_page, page_mkwrite);
put_page(dirty_page);
}
diff --git a/mm/msync.c b/mm/msync.c
index a4de868..a49af28 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -13,11 +13,33 @@
 #include 
 
 /*
+ * Scan the PTEs for pages belonging to the VMA and mark them read-only.
+ * It will force a pagefault on the next write access.
+ */
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+   unsigned long addr;
+
+   for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+   spinlock_t *ptl;
+   pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
+   pud_t *pud = pud_offset(pgd, addr);
+   pmd_t *pmd = pmd_offset(pud, addr);
+   pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
+
+   if (pte_dirty(*pte) && pte_write(*pte))
+   *pte = pte_wrprotect(*pte);
+   pte_unmap_unlock(pte, ptl);
+   }
+}
+
+/*
  * MS_SYNC syncs the entire file - including mappings.
  *
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
- * Now it doesn't do anything, since dirty pages are properly tracked.
+ * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
+ * read-only by calling vma_wrprotect(). This is needed to catch the next
+ * write reference to the mapped region and update the file times
+ * accordingly.
  *
  * The application may now run fsync() to write out the dirty pages and
  * wait on the writeout and check the result. Or the application may run
@@ -77,16 +99,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
error = 0;
start = vma->vm_end;
file = vma->vm_file;
-   if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
-   get_file(file);
-   up_read(>mmap_sem);
-   error = do_fsync(file, 0);
-   fput(file);
-   if (error || start >= end)
-   goto out;
-   down_read(>mmap_sem);
-   vma = find_vma(mm, start);
-   continue;
+   if (file && (vma->vm_flags & VM_SHARED)) {
+   if (flags & MS_ASYNC)
+   vma_wrprotect(vma);
+   if (flags & MS_SYNC) {
+   get_file(file);
+   up_read(>mmap_sem);
+   error = do_fsync(file, 0);
+   fput(file);
+   if (error || start >= end)
+   goto out;
+   down_read(>mmap_sem);
+   vma = find_vma(mm, start);
+   continue;
+   }
}
 
vma = vma->vm_next;
-- 
1.4.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-17 Thread Anton Salikhmetov
Updating file times at write references to memory-mapped files and
forcing file times update at the next write reference after
calling the msync() system call with the MS_ASYNC flag.

Signed-off-by: Anton Salikhmetov [EMAIL PROTECTED]
---
 mm/memory.c |6 ++
 mm/msync.c  |   52 +++-
 2 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4bf0b6d..13d5bbf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1668,6 +1668,9 @@ gotten:
 unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+   if (vma-vm_file)
+   file_update_time(vma-vm_file);
+
/*
 * Yes, Virginia, this is actually required to prevent a race
 * with clear_page_dirty_for_io() from clearing the page dirty
@@ -2341,6 +2344,9 @@ out_unlocked:
if (anon)
page_cache_release(vmf.page);
else if (dirty_page) {
+   if (vma-vm_file)
+   file_update_time(vma-vm_file);
+
set_page_dirty_balance(dirty_page, page_mkwrite);
put_page(dirty_page);
}
diff --git a/mm/msync.c b/mm/msync.c
index a4de868..a49af28 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -13,11 +13,33 @@
 #include linux/syscalls.h
 
 /*
+ * Scan the PTEs for pages belonging to the VMA and mark them read-only.
+ * It will force a pagefault on the next write access.
+ */
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+   unsigned long addr;
+
+   for (addr = vma-vm_start; addr  vma-vm_end; addr += PAGE_SIZE) {
+   spinlock_t *ptl;
+   pgd_t *pgd = pgd_offset(vma-vm_mm, addr);
+   pud_t *pud = pud_offset(pgd, addr);
+   pmd_t *pmd = pmd_offset(pud, addr);
+   pte_t *pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl);
+
+   if (pte_dirty(*pte)  pte_write(*pte))
+   *pte = pte_wrprotect(*pte);
+   pte_unmap_unlock(pte, ptl);
+   }
+}
+
+/*
  * MS_SYNC syncs the entire file - including mappings.
  *
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
- * Now it doesn't do anything, since dirty pages are properly tracked.
+ * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
+ * read-only by calling vma_wrprotect(). This is needed to catch the next
+ * write reference to the mapped region and update the file times
+ * accordingly.
  *
  * The application may now run fsync() to write out the dirty pages and
  * wait on the writeout and check the result. Or the application may run
@@ -77,16 +99,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
error = 0;
start = vma-vm_end;
file = vma-vm_file;
-   if (file  (vma-vm_flags  VM_SHARED)  (flags  MS_SYNC)) {
-   get_file(file);
-   up_read(mm-mmap_sem);
-   error = do_fsync(file, 0);
-   fput(file);
-   if (error || start = end)
-   goto out;
-   down_read(mm-mmap_sem);
-   vma = find_vma(mm, start);
-   continue;
+   if (file  (vma-vm_flags  VM_SHARED)) {
+   if (flags  MS_ASYNC)
+   vma_wrprotect(vma);
+   if (flags  MS_SYNC) {
+   get_file(file);
+   up_read(mm-mmap_sem);
+   error = do_fsync(file, 0);
+   fput(file);
+   if (error || start = end)
+   goto out;
+   down_read(mm-mmap_sem);
+   vma = find_vma(mm, start);
+   continue;
+   }
}
 
vma = vma-vm_next;
-- 
1.4.4.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/