Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-28 Thread Nick Piggin

[EMAIL PROTECTED] wrote:

But if you didn't notice until now, then the current implementation
must be pretty reasonable for you use as well.



Oh, I definitely noticed.  As soon as I tried to port my application
to 2.6, it broke - as evidenced by my complaints last year.  The
current solution is simple - since it's running on dedicated boxes,
leave them on 2.4.


Well I didn't know that was a change in behaviour vs 2.4 (or maybe I
did and forgot). That was probably a bit silly, unless there was a
good reason for it.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-28 Thread linux
> But if you didn't notice until now, then the current implementation
> must be pretty reasonable for you use as well.

Oh, I definitely noticed.  As soon as I tried to port my application
to 2.6, it broke - as evidenced by my complaints last year.  The
current solution is simple - since it's running on dedicated boxes,
leave them on 2.4.

I've now got the hint on how to make it work on 2.6 (sync_file_range()),
so I can try again.  But the pressure to upgrade is not strong, so it
might be a while.

You may recall, this subthread started when I responding to "the
only reason to use msync(MS_ASYNC) is to update timestamps" with a
counterexample.  I still think the purpose of the call is a hint to the
kernel that writing to the specified page(s) is complete and now would be
a good time to clean them.  Which has very little to do with timestamps.

Now, my application, which leaves less than a second between the MS_ASYNC
and a subsequent MS_SYNC to check whether it's done, broke, but I can
imagine similar cases where MS_ASYNC would remain a useful hint to reduce
the sort of memory hogging generally associated with "dd if=/dev/zero"
type operations.

Reading between the lines of the standard, that seems (to me, at least)
to obviously be the intended purpose of msync(MS_ASYNC).  I wonder if
there's any historical documentation describing the original intent
behind creating the call.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-28 Thread Nick Piggin

[EMAIL PROTECTED] wrote:

Linux _will_ write all modified data to permanent storage locations.
Since 2.6.17 it will do this regardless of msync().  Before 2.6.17 you
do need msync() to enable data to be written back.

But it will not start I/O immediately, which is not a requirement in
the standard, or at least it's pretty vague about that.



As I've said before, I disagree, but I'm not going to start a major
flame war about it.

The most relevant paragraph is:

# When MS_ASYNC is specified, msync() returns immediately once all the
# write operations are initiated or queued for servicing; when MS_SYNC is
# specified, msync() will not return until all write operations are
# completed as defined for synchronised I/O data integrity completion.
# Either MS_ASYNC or MS_SYNC is specified, but not both.

Note two things:
1) In the paragraphs before, what msync does is defined independently
   of the MS_* flags.  Only the time of the return to user space varies.
   Thus, whatever the delay between calling msync() and the data being
   written, it should be the same whether MS_SYNC or MS_ASYNC is used.

   The implementation intended is:
   - Start all I/O
   - If MS_SYNC, wait for I/O to complete
   - Return to user space


set_page_dirty queues pages for IO, and the writeout daemon will service
that queue when it sees fit. IMO it is sufficient that you cannot say the
implementation does not meet the standard.



2) "all the write operations are initiated or queued for servicing".
   It is a common convention in English (and most languages, I expect)
   that in the "or" is a preference for the first alternative.  The second
   is a permitted alternative if the first is not possible.

   And "queued for servicing", especially "initiated or queued for
   servicing", to me imples queuing waiting for some resource.  To have
   the resource being waited for be a timer expiry seems like rather a
   cheat to me.  It's perhaps doesn't break the letter of the standard,
   but definitely bends it.  It feels like a fiddle.

Still, the basic hint function of msync(MS_ASYNC) *is* being accomplished:
"I don't expect to write this page any more, so now would be a good time
to clean it."
It would just make my life easier if the kernel procrastinated less.


But if you didn't notice until now, then the current implementation
must be pretty reasonable for you use as well.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-28 Thread Nick Piggin

[EMAIL PROTECTED] wrote:

Linux _will_ write all modified data to permanent storage locations.
Since 2.6.17 it will do this regardless of msync().  Before 2.6.17 you
do need msync() to enable data to be written back.

But it will not start I/O immediately, which is not a requirement in
the standard, or at least it's pretty vague about that.



As I've said before, I disagree, but I'm not going to start a major
flame war about it.

The most relevant paragraph is:

# When MS_ASYNC is specified, msync() returns immediately once all the
# write operations are initiated or queued for servicing; when MS_SYNC is
# specified, msync() will not return until all write operations are
# completed as defined for synchronised I/O data integrity completion.
# Either MS_ASYNC or MS_SYNC is specified, but not both.

Note two things:
1) In the paragraphs before, what msync does is defined independently
   of the MS_* flags.  Only the time of the return to user space varies.
   Thus, whatever the delay between calling msync() and the data being
   written, it should be the same whether MS_SYNC or MS_ASYNC is used.

   The implementation intended is:
   - Start all I/O
   - If MS_SYNC, wait for I/O to complete
   - Return to user space


set_page_dirty queues pages for IO, and the writeout daemon will service
that queue when it sees fit. IMO it is sufficient that you cannot say the
implementation does not meet the standard.



2) all the write operations are initiated or queued for servicing.
   It is a common convention in English (and most languages, I expect)
   that in the or is a preference for the first alternative.  The second
   is a permitted alternative if the first is not possible.

   And queued for servicing, especially initiated or queued for
   servicing, to me imples queuing waiting for some resource.  To have
   the resource being waited for be a timer expiry seems like rather a
   cheat to me.  It's perhaps doesn't break the letter of the standard,
   but definitely bends it.  It feels like a fiddle.

Still, the basic hint function of msync(MS_ASYNC) *is* being accomplished:
I don't expect to write this page any more, so now would be a good time
to clean it.
It would just make my life easier if the kernel procrastinated less.


But if you didn't notice until now, then the current implementation
must be pretty reasonable for you use as well.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-28 Thread linux
 But if you didn't notice until now, then the current implementation
 must be pretty reasonable for you use as well.

Oh, I definitely noticed.  As soon as I tried to port my application
to 2.6, it broke - as evidenced by my complaints last year.  The
current solution is simple - since it's running on dedicated boxes,
leave them on 2.4.

I've now got the hint on how to make it work on 2.6 (sync_file_range()),
so I can try again.  But the pressure to upgrade is not strong, so it
might be a while.

You may recall, this subthread started when I responding to the
only reason to use msync(MS_ASYNC) is to update timestamps with a
counterexample.  I still think the purpose of the call is a hint to the
kernel that writing to the specified page(s) is complete and now would be
a good time to clean them.  Which has very little to do with timestamps.

Now, my application, which leaves less than a second between the MS_ASYNC
and a subsequent MS_SYNC to check whether it's done, broke, but I can
imagine similar cases where MS_ASYNC would remain a useful hint to reduce
the sort of memory hogging generally associated with dd if=/dev/zero
type operations.

Reading between the lines of the standard, that seems (to me, at least)
to obviously be the intended purpose of msync(MS_ASYNC).  I wonder if
there's any historical documentation describing the original intent
behind creating the call.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-28 Thread Nick Piggin

[EMAIL PROTECTED] wrote:

But if you didn't notice until now, then the current implementation
must be pretty reasonable for you use as well.



Oh, I definitely noticed.  As soon as I tried to port my application
to 2.6, it broke - as evidenced by my complaints last year.  The
current solution is simple - since it's running on dedicated boxes,
leave them on 2.4.


Well I didn't know that was a change in behaviour vs 2.4 (or maybe I
did and forgot). That was probably a bit silly, unless there was a
good reason for it.

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread linux
> Linux _will_ write all modified data to permanent storage locations.
> Since 2.6.17 it will do this regardless of msync().  Before 2.6.17 you
> do need msync() to enable data to be written back.
> 
> But it will not start I/O immediately, which is not a requirement in
> the standard, or at least it's pretty vague about that.

As I've said before, I disagree, but I'm not going to start a major
flame war about it.

The most relevant paragraph is:

# When MS_ASYNC is specified, msync() returns immediately once all the
# write operations are initiated or queued for servicing; when MS_SYNC is
# specified, msync() will not return until all write operations are
# completed as defined for synchronised I/O data integrity completion.
# Either MS_ASYNC or MS_SYNC is specified, but not both.

Note two things:
1) In the paragraphs before, what msync does is defined independently
   of the MS_* flags.  Only the time of the return to user space varies.
   Thus, whatever the delay between calling msync() and the data being
   written, it should be the same whether MS_SYNC or MS_ASYNC is used.

   The implementation intended is:
   - Start all I/O
   - If MS_SYNC, wait for I/O to complete
   - Return to user space

2) "all the write operations are initiated or queued for servicing".
   It is a common convention in English (and most languages, I expect)
   that in the "or" is a preference for the first alternative.  The second
   is a permitted alternative if the first is not possible.

   And "queued for servicing", especially "initiated or queued for
   servicing", to me imples queuing waiting for some resource.  To have
   the resource being waited for be a timer expiry seems like rather a
   cheat to me.  It's perhaps doesn't break the letter of the standard,
   but definitely bends it.  It feels like a fiddle.

Still, the basic hint function of msync(MS_ASYNC) *is* being accomplished:
"I don't expect to write this page any more, so now would be a good time
to clean it."
It would just make my life easier if the kernel procrastinated less.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Andrew Morton
On 27 Mar 2007 16:09:33 -0400
[EMAIL PROTECTED] wrote:

> What part of "The msync() function writes all modified data to
> permanent storage locations [...] For mappings to files, the msync()
> function ensures that all write operations are completed as defined
> for synchronised I/O data integrity completion." suggests that it's not
> supposed to do disk I/O?  How is that uselessly vague?
> 

Because for MS_ASYNC, "msync() shall return immediately once all the write
operations are initiated or queued for servicing".

ie: the writes can complete one millisecond or one week later.  We chose 30
seconds.

And this is not completely fatuous - before 2.6.17, MAP_SHARED pages could
float about in memory in a dirty state for arbitrarily long periods -
potentially for the entire application lifetime.  It was quite reasonable
for our MS_ASYNC implementation to do what it did: tell the VM about the
dirtiness of these pages so they get written back soon.

Post-2.6.17 we preserved that behaviour.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Miklos Szeredi
> > Suggest you use msync(MS_ASYNC), then
> > sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE).
> 
> Thank you; I didn't know about that.  And I can handle -ENOSYS by falling
> back to the old behavior.
> 
> > We can fix your application, and we'll break someone else's.
> 
> If you can point to an application that it'll break, I'd be a lot more
> understanding.  Nobody did, last year.
> 
> > I don't think it's solveable, really - the range of applications is so
> > broad, and the "standard" is so vague as to be useless.
> 
> I agree that standards are sometimes vague, but that one seemed about
> as clear as it's possible to be without imposing unreasonably on
> the file system and device driver layers.
> 
> What part of "The msync() function writes all modified data to
> permanent storage locations [...] For mappings to files, the msync()
> function ensures that all write operations are completed as defined
> for synchronised I/O data integrity completion." suggests that it's not
> supposed to do disk I/O?  How is that uselessly vague?

Linux _will_ write all modified data to permanent storage locations.
Since 2.6.17 it will do this regardless of msync().  Before 2.6.17 you
do need msync() to enable data to be written back.

But it will not start I/O immediately, which is not a requirement in
the standard, or at least it's pretty vague about that.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread linux
> Suggest you use msync(MS_ASYNC), then
> sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE).

Thank you; I didn't know about that.  And I can handle -ENOSYS by falling
back to the old behavior.

> We can fix your application, and we'll break someone else's.

If you can point to an application that it'll break, I'd be a lot more
understanding.  Nobody did, last year.

> I don't think it's solveable, really - the range of applications is so
> broad, and the "standard" is so vague as to be useless.

I agree that standards are sometimes vague, but that one seemed about
as clear as it's possible to be without imposing unreasonably on
the file system and device driver layers.

What part of "The msync() function writes all modified data to
permanent storage locations [...] For mappings to files, the msync()
function ensures that all write operations are completed as defined
for synchronised I/O data integrity completion." suggests that it's not
supposed to do disk I/O?  How is that uselessly vague?

It says to me that msync's raison d'ĂȘtre is to write data from RAM to
stable storage.  If an application calls it too often, that's
the application's fault just as if it called sync(2) too often.

> This is why we've
> been extending these things with linux-specific goodies which permit
> applications to actually tell the kernel what they want to be done in a
> more finely-grained fashion.

Well, I still think the current Linux behavior is a bug, but there's a
usable (and run-time compatible) workaround that doesn't unreasonably
complicate the code, and that's good enough.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Andrew Morton
On 27 Mar 2007 15:24:52 -0400
[EMAIL PROTECTED] wrote:

> > * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> 
> Yes, I noticed.  See
> http://www.ussg.iu.edu/hypermail/linux/kernel/0602.1/0450.html
> for a bug report on the subject February 2006.

Suggest you use msync(MS_ASYNC), then
sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE).

The new (post 2.6.17) MAP_SHARED dirty-page semantics mean that the msync()
isn't actually needed.

> That's why this application is still running on 2.4.
> 
> As I mentioned at the time, the SUS says:
> (http://opengroup.org/onlinepubs/007908799/xsh/msync.html)
> "When MS_ASYNC is specified, msync() returns immediately once all the
> write operations are initiated or queued for servicing."
> 
> You can argue that putting it on the dirty list constitutes "queued for
> servicing", but the intent seems pretty clear to me: MS_ASYNC is supposed
> to start the I/O.  Although strict standards-ese parsing says that
> either branch of an or is acceptable, it is a common English language
> convention that the first alternative is preferred and the second
> is a fallback.
> 
> It makes sense in this case: start the write or, if that's not possible
> (the disk is already busy), queue it for service as soon as the disk
> is available.
> 
> They perhaps didn't mandate it this strictly, but that's clearly the
> intent.

We can fix your application, and we'll break someone else's.

I don't think it's solveable, really - the range of applications is so
broad, and the "standard" is so vague as to be useless.  This is why we've
been extending these things with linux-specific goodies which permit
applications to actually tell the kernel what they want to be done in a
more finely-grained fashion.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread linux
> * MS_ASYNC does not start I/O (it used to, up to 2.5.67).

Yes, I noticed.  See
http://www.ussg.iu.edu/hypermail/linux/kernel/0602.1/0450.html
for a bug report on the subject February 2006.

That's why this application is still running on 2.4.

As I mentioned at the time, the SUS says:
(http://opengroup.org/onlinepubs/007908799/xsh/msync.html)
"When MS_ASYNC is specified, msync() returns immediately once all the
write operations are initiated or queued for servicing."

You can argue that putting it on the dirty list constitutes "queued for
servicing", but the intent seems pretty clear to me: MS_ASYNC is supposed
to start the I/O.  Although strict standards-ese parsing says that
either branch of an or is acceptable, it is a common English language
convention that the first alternative is preferred and the second
is a fallback.

It makes sense in this case: start the write or, if that's not possible
(the disk is already busy), queue it for service as soon as the disk
is available.

They perhaps didn't mandate it this strictly, but that's clearly the
intent.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Andrew Morton
On Tue, 27 Mar 2007 20:55:51 +0200
Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> It's actually wrong about FADV_DONTNEED, which I think doesn't start
> writeout either.

case POSIX_FADV_DONTNEED:
if (!bdi_write_congested(mapping->backing_dev_info))
filemap_flush(mapping);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Miklos Szeredi
> > I agree that msync(MS_ASYNC) has no semantics if time is ignored.
> > But it's a useful way to tell the OS that the page is not going
> > to be dirtied again.
> 
> Just to clarify, here's the header comment for sys_msync():
> 
> /*
>  * MS_SYNC syncs the entire file - including mappings.
>  *
>  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
>  * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
>  * Now it doesn't do anything, since dirty pages are properly tracked.
>  *
>  * The application may now run fsync() to
>  * write out the dirty pages and wait on the writeout and check the result.
>  * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
>  * async writeout immediately.
>  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
>  * applications.
>  */
> 
> It's actually wrong about FADV_DONTNEED, which I think doesn't start
> writeout either.  So there you have it ;)

Sorry, FADV_DONTNEED _does_ seem to start writeout, but it does it
indiscriminately, not just on the specified range.  That should be
easy to fix though.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Miklos Szeredi
> > Yes, this will make msync(MS_ASYNC) more heavyweight again.  But if an
> > application doesn't want to update the timestamps, it should just omit
> > this call, since it does nothing else.
> 
> Er... FWIW, I have an application that makes heavy use of msync(MS_ASYNC)
> and doesn't care about timestamps.  (In fact, sometimes it's configured
> to write to a raw device and there are no timestamps.)
> 
> It's used as a poor man's portable async I/O.  The application logs
> data to disk, and sometimes needs to sync it to disk to ensure it has
> all been written.
> 
> To reduce long pauses when doing msync(MS_SYNC), it does msync(MS_ASYNC)
> as soon as a page is filled up to prompt asynchronous writeback.
> "I'm done writing this page and don't intend to write it again.
> Please start committing it to stable storage, but don't block me."
> 
> Then, occasionally, there's an msync(MS_SYNC) call to be sure the data
> is synced to disk.  This caused annoying hiccups before the MS_ASYNC
> calls were added.
> 
> 
> I agree that msync(MS_ASYNC) has no semantics if time is ignored.
> But it's a useful way to tell the OS that the page is not going
> to be dirtied again.

Just to clarify, here's the header comment for sys_msync():

/*
 * MS_SYNC syncs the entire file - including mappings.
 *
 * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
 * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
 * Now it doesn't do anything, since dirty pages are properly tracked.
 *
 * The application may now run fsync() to
 * write out the dirty pages and wait on the writeout and check the result.
 * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
 * async writeout immediately.
 * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
 * applications.
 */

It's actually wrong about FADV_DONTNEED, which I think doesn't start
writeout either.  So there you have it ;)

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread linux
> Yes, this will make msync(MS_ASYNC) more heavyweight again.  But if an
> application doesn't want to update the timestamps, it should just omit
> this call, since it does nothing else.

Er... FWIW, I have an application that makes heavy use of msync(MS_ASYNC)
and doesn't care about timestamps.  (In fact, sometimes it's configured
to write to a raw device and there are no timestamps.)

It's used as a poor man's portable async I/O.  The application logs
data to disk, and sometimes needs to sync it to disk to ensure it has
all been written.

To reduce long pauses when doing msync(MS_SYNC), it does msync(MS_ASYNC)
as soon as a page is filled up to prompt asynchronous writeback.
"I'm done writing this page and don't intend to write it again.
Please start committing it to stable storage, but don't block me."

Then, occasionally, there's an msync(MS_SYNC) call to be sure the data
is synced to disk.  This caused annoying hiccups before the MS_ASYNC
calls were added.


I agree that msync(MS_ASYNC) has no semantics if time is ignored.
But it's a useful way to tell the OS that the page is not going
to be dirtied again.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Miklos Szeredi
> that's simpler ;) Is it correct enough though?  The place where it will
> become inaccurate is for repeated modification via an established map.  ie:
> 
>   p = mmap(..., MAP_SHARED);
>   for ( ; ; )
>   *p = 1;
> 
> in which case I think the timestamp will only get updated once per
> writeback interval (ie: 30 seconds).

Which is perfectly OK, we really can't do any better in any sane way.

My concern is only about MS_ASYNC, it would be really nice to know
what other OSs do.  I've checked on a Solaris 5.7 (not exactly a
modern OS) and that is as good as current Linux.

If someone has access to others pls. find my test program at the end.

The logical way to handle msync is IMO:

   write to memory + msync(MS_ASYNC) == write()
   write to memory + msync(MS_SYNC) == write() + fdatasync()

Yes, it would add some overhead for MS_ASYNC, but that's what the user
wants isn't it?  If the user doesn't want correctly updated
timestamps, it should not call msync().

Miklos

=== msync_time.c ==
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static const char *filename;

static void print_times(const char *msg)
{
struct stat stbuf;
stat(filename, );
printf("%s\t%li\t%li\t%li\n", msg, stbuf.st_ctime, stbuf.st_mtime,
   stbuf.st_atime);
}

static void usage(const char *progname)
{
fprintf(stderr, "usage: %s filename [-s]\n", progname);
exit(1);
}

int main(int argc, char *argv[])
{
int res;
char *addr;
int msync_flag = MS_ASYNC;
int fd;

if (argc < 2)
usage(argv[0]);

filename = argv[1];
if (argc > 2) {
if (argc > 3)
usage(argv[0]);
if (strcmp(argv[2], "-s") == 0)
msync_flag = MS_SYNC;
else
usage(argv[0]);
}

fd = open(filename, O_RDWR | O_TRUNC | O_CREAT, 0666);
if (fd == -1) {
perror(filename);
return 1;
}
print_times("begin");
sleep(1);
write(fd, "\n", 4);
print_times("write");
sleep(1);
addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (addr == (char *) -1) {
perror("mmap");
return 1;
}
print_times("mmap");
sleep(1);

addr[1] = 'b';
print_times("b");
sleep(1);
res = msync(addr, 4, msync_flag);
if (res == -1) {
perror("msync");
return 1;
}
print_times("msync b");
sleep(1);

addr[2] = 'c';
print_times("c");
sleep(1);
res = msync(addr, 4, msync_flag);
if (res == -1) {
perror("msync");
return 1;
}
print_times("msync c");
sleep(1);

addr[3] = 'd';
print_times("d");
sleep(1);
res = munmap(addr, 4);
if (res == -1) {
perror("munmap");
return 1;
}
print_times("munmap");
sleep(1);

res = close(fd);
if (res == -1) {
perror("close");
return 1;
}
print_times("end");
return 0;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Andrew Morton
On Tue, 27 Mar 2007 11:23:06 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> > > > > But Peter Staubach says a RH custumer has files written thorugh mmap,
> > > > > which are not being backed up.
> > > > 
> > > > Yes, I expect the backup problem is the major real-world hurt arising 
> > > > from
> > > > this bug.
> > > > 
> > > > But I expect we could adequately plug that problem at munmap()-time.  
> > > > Or,
> > > > better, do_wp_page().  As I said - half-assed.
> > > > 
> > > > It's a question if whether the backup problem is the only thing which 
> > > > is hurting
> > > > in the real-world, or if people have other problems.
> > > > 
> > > > (In fact, what's wrong with doing it in do_wp_page()?
> > > 
> > > It's rather more expensive, than just toggling a bit.
> > 
> > It shouldn't be, especially for filesystems which have one-second timestamp
> > granularity.
> > 
> > Filesystems which have s_time_gran=1 might hurt a bit, but no more than
> > they will with write().
> > 
> > Actually, no - we'd only update the mctime once per page per writeback
> > period (30 seconds by default) so the load will be small.
> 
> Why?  For each faulted page the times will be updated, no?

Yes, but only at pagefault-time.  And

- the faults are already "slow": we need to pull the page contents in
  from disk, or memset or cow the page

- we need to take the trap

compared to which, the cost of the timestamp update will (we hope) be
relatively low.

> Maybe it's acceptable, I don't really know the cost of
> file_update_time().
> 
> Tried this patch, and it seems to work.  It will even randomly update
> the time for tmpfs files (on initial fault, and on swapins).
> 
> Miklos
> 
> Index: linux/mm/memory.c
> ===
> --- linux.orig/mm/memory.c2007-03-27 11:04:40.0 +0200
> +++ linux/mm/memory.c 2007-03-27 11:08:19.0 +0200
> @@ -1664,6 +1664,8 @@ gotten:
>  unlock:
>   pte_unmap_unlock(page_table, ptl);
>   if (dirty_page) {
> + if (vma->vm_file)
> + file_update_time(vma->vm_file);
>   set_page_dirty_balance(dirty_page);
>   put_page(dirty_page);
>   }
> @@ -2316,6 +2318,8 @@ retry:
>  unlock:
>   pte_unmap_unlock(page_table, ptl);
>   if (dirty_page) {
> + if (vma->vm_file)
> + file_update_time(vma->vm_file);
>   set_page_dirty_balance(dirty_page);
>   put_page(dirty_page);
>   }

that's simpler ;) Is it correct enough though?  The place where it will
become inaccurate is for repeated modification via an established map.  ie:

p = mmap(..., MAP_SHARED);
for ( ; ; )
*p = 1;

in which case I think the timestamp will only get updated once per
writeback interval (ie: 30 seconds).


tmpfs files have an s_time_gran of 1, so benchmarking some workload on
tmpfs with this patch will tell us the worst-case overhead of the change. 

I guess we should arrange for multiple CPUs to perform write faults against
multiple pages of the same file in parallel.  Of course, that'd be a pretty
darn short benchmark because it'll run out of RAM.  Which reveals why we
probably won't have a performance problem in there.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Miklos Szeredi
> > > > But Peter Staubach says a RH custumer has files written thorugh mmap,
> > > > which are not being backed up.
> > > 
> > > Yes, I expect the backup problem is the major real-world hurt arising from
> > > this bug.
> > > 
> > > But I expect we could adequately plug that problem at munmap()-time.  Or,
> > > better, do_wp_page().  As I said - half-assed.
> > > 
> > > It's a question if whether the backup problem is the only thing which is 
> > > hurting
> > > in the real-world, or if people have other problems.
> > > 
> > > (In fact, what's wrong with doing it in do_wp_page()?
> > 
> > It's rather more expensive, than just toggling a bit.
> 
> It shouldn't be, especially for filesystems which have one-second timestamp
> granularity.
> 
> Filesystems which have s_time_gran=1 might hurt a bit, but no more than
> they will with write().
> 
> Actually, no - we'd only update the mctime once per page per writeback
> period (30 seconds by default) so the load will be small.

Why?  For each faulted page the times will be updated, no?

Maybe it's acceptable, I don't really know the cost of
file_update_time().

Tried this patch, and it seems to work.  It will even randomly update
the time for tmpfs files (on initial fault, and on swapins).

Miklos

Index: linux/mm/memory.c
===
--- linux.orig/mm/memory.c  2007-03-27 11:04:40.0 +0200
+++ linux/mm/memory.c   2007-03-27 11:08:19.0 +0200
@@ -1664,6 +1664,8 @@ gotten:
 unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+   if (vma->vm_file)
+   file_update_time(vma->vm_file);
set_page_dirty_balance(dirty_page);
put_page(dirty_page);
}
@@ -2316,6 +2318,8 @@ retry:
 unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+   if (vma->vm_file)
+   file_update_time(vma->vm_file);
set_page_dirty_balance(dirty_page);
put_page(dirty_page);
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Andrew Morton
On Tue, 27 Mar 2007 10:28:16 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> > > But Peter Staubach says a RH custumer has files written thorugh mmap,
> > > which are not being backed up.
> > 
> > Yes, I expect the backup problem is the major real-world hurt arising from
> > this bug.
> > 
> > But I expect we could adequately plug that problem at munmap()-time.  Or,
> > better, do_wp_page().  As I said - half-assed.
> > 
> > It's a question if whether the backup problem is the only thing which is 
> > hurting
> > in the real-world, or if people have other problems.
> > 
> > (In fact, what's wrong with doing it in do_wp_page()?
> 
> It's rather more expensive, than just toggling a bit.

It shouldn't be, especially for filesystems which have one-second timestamp
granularity.

Filesystems which have s_time_gran=1 might hurt a bit, but no more than
they will with write().

Actually, no - we'd only update the mctime once per page per writeback
period (30 seconds by default) so the load will be small.  It'll be more
often if the user is doing a lot of pte-cleaning via msync() or fsync(),
but then the m/ctime writes will be the least of their problems. 

I'd have thought there were more substantial problems with something that
crude?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Miklos Szeredi
> > But Peter Staubach says a RH custumer has files written thorugh mmap,
> > which are not being backed up.
> 
> Yes, I expect the backup problem is the major real-world hurt arising from
> this bug.
> 
> But I expect we could adequately plug that problem at munmap()-time.  Or,
> better, do_wp_page().  As I said - half-assed.
> 
> It's a question if whether the backup problem is the only thing which is 
> hurting
> in the real-world, or if people have other problems.
> 
> (In fact, what's wrong with doing it in do_wp_page()?

It's rather more expensive, than just toggling a bit.

Let me work on it a bit more.  I think I can make the current patch
more palatable.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Andrew Morton
On Tue, 27 Mar 2007 10:03:41 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> But Peter Staubach says a RH custumer has files written thorugh mmap,
> which are not being backed up.

Yes, I expect the backup problem is the major real-world hurt arising from
this bug.

But I expect we could adequately plug that problem at munmap()-time.  Or,
better, do_wp_page().  As I said - half-assed.

It's a question if whether the backup problem is the only thing which is hurting
in the real-world, or if people have other problems.

(In fact, what's wrong with doing it in do_wp_page()?  The timestamp could
be up to 30 seconds too early, but that's heaps better than what we have
now..)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Miklos Szeredi
> > > There is surely no need to duplicate all that.
> > 
> > Yeah, we could teach generic_writepages() to conditionally not submit
> > for io just test/clear pte dirtyness.
> > 
> > Maybe that would be somewhat cleaner, dunno.
> > 
> > Then there are the ram backed filesystems, which don't have dirty
> > accounting and radix trees, and for which this pte walking is still
> > needed to provide semantics consistent with normal filesystems.
> 
> hm.
> 
> I don't know how important all this is, really - we've had this bug for
> ever and presumably we've already trained everyone to work around it.
> 
> What usage scenarios are people actually hurting from?  Is there anything
> interesting in the mysterious Novell Bugzilla #206431?

That's just a failing LTP testcase, not quite real life ;)

But Peter Staubach says a RH custumer has files written thorugh mmap,
which are not being backed up.

> Perhaps we can get away with doing something half-assed which covers most
> requirements...

OK.  At least I can split the patch into two half asses.

The big question is tmpfs and friends.  Those won't get any timestamp
update without additional page table walking.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Andrew Morton
On Tue, 27 Mar 2007 09:36:50 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> > There is surely no need to duplicate all that.
> 
> Yeah, we could teach generic_writepages() to conditionally not submit
> for io just test/clear pte dirtyness.
> 
> Maybe that would be somewhat cleaner, dunno.
> 
> Then there are the ram backed filesystems, which don't have dirty
> accounting and radix trees, and for which this pte walking is still
> needed to provide semantics consistent with normal filesystems.

hm.

I don't know how important all this is, really - we've had this bug for
ever and presumably we've already trained everyone to work around it.

What usage scenarios are people actually hurting from?  Is there anything
interesting in the mysterious Novell Bugzilla #206431?

Perhaps we can get away with doing something half-assed which covers most
requirements...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Miklos Szeredi
> > > clear_page_dirty_for_io() already does that.
> > > 
> > > So we should be able to test PageDirtiedByWrite() after running
> > > clear_page_dirty_for_io() to discover whether this page was dirtied via
> > > MAP_SHARED, and then update the inode times if so.
> > 
> > What do you need the page flag for?
> 
> To work out whether the page was dirtied via write (in which case the
> timestamps were updated) or via mmap (in which case they were not).
> 
> >  The "modified through mmap" info
> > is there in the ptes.
> 
> It might not be there any more - the ptes could have got taken down by, for
> example, reclaim.

Yes, but then the "modified through mmap" is transferred to the
per-address_space flag.  All this is already done by this patch.

> I dunno - I'm not trying very hard.  I'm trying to encourage you to come up
> with something less costly and less complex than this thing, but you appear
> to be resisting.

No, I'm just arguing that your suggestion is actually a complication,
not a simplification ;)

> >  And from the ptes it can be transfered to a
> > per-address_space flag.  Nobody is interested through which page was
> > the file modified.
> > 
> > Anyway, that's just MS_SYNC.  MS_ASYNC doesn't walk the pages, yet it
> > should update the timestamp.  That's the difficult one.
> > 
> 
> We can treat MS_ASYNC as we treat MS_SYNC.  Then, MS_ASYNC *does* walk the
> pages.  Is does it in generic_writepages().  It also even walks the ptes
> for you, in clear_page_dirty_for_io().

Yes.  But that's not very useful semantic for MS_ASYNC vs. file time
update.

It would basically say:

  "if you cann MS_ASYNC, and the file was modified then sometime in
  the future you will get an updated c/mtime".

But this is not what POSIX says, and it's not what applications want.

For example "make" would want to know if a file was modified or not,
and with your suggestion only msync(MS_SYNC) would reliably provide
that info.  But msync(MS_SYNC) is unnecessary in many cases.

> There is surely no need to duplicate all that.

Yeah, we could teach generic_writepages() to conditionally not submit
for io just test/clear pte dirtyness.

Maybe that would be somewhat cleaner, dunno.

Then there are the ram backed filesystems, which don't have dirty
accounting and radix trees, and for which this pte walking is still
needed to provide semantics consistent with normal filesystems.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Andrew Morton
On Tue, 27 Mar 2007 08:55:40 +0200 Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> > > > > This patch makes writing to shared memory mappings update st_ctime and
> > > > > st_mtime as defined by SUSv3:
> > > > 
> > > > Boy this is complicated.
> > > 
> > > You tell me?
> > > 
> > > > Is there a simpler way of doing all this?  Say, we define a new page 
> > > > flag
> > > > PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> > > > ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> > > > page-dirtiness.  Then, when performing writeback we look to see if any 
> > > > of
> > > > the dirty pages are !PageDirtiedByWrite() and, if so, we update 
> > > > [mc]time to
> > > > current-time.
> > > 
> > > I don't think a page flag gains anything over the address_space flag
> > > that this patch already has.
> > > 
> > > The complexity is not about keeping track of the "data modified
> > > through mmap" state, but about msync() guarantees, that POSIX wants.
> > > 
> > > And these requirements do in fact make some sense: msync() basically
> > > means:
> > > 
> > >   "I want the data written through mmaps to be visible to the world"
> > > 
> > > And that obviously includes updating the timestamps.
> > > 
> > > So how do we know if the data was modified between two msync()
> > > invocations?  The only sane way I can think of is to walk the page
> > > tables in msync() and test/clear the pte dirty bit.
> > 
> > clear_page_dirty_for_io() already does that.
> > 
> > So we should be able to test PageDirtiedByWrite() after running
> > clear_page_dirty_for_io() to discover whether this page was dirtied via
> > MAP_SHARED, and then update the inode times if so.
> 
> What do you need the page flag for?

To work out whether the page was dirtied via write (in which case the
timestamps were updated) or via mmap (in which case they were not).

>  The "modified through mmap" info
> is there in the ptes.

It might not be there any more - the ptes could have got taken down by, for
example, reclaim.

I dunno - I'm not trying very hard.  I'm trying to encourage you to come up
with something less costly and less complex than this thing, but you appear
to be resisting.

>  And from the ptes it can be transfered to a
> per-address_space flag.  Nobody is interested through which page was
> the file modified.
> 
> Anyway, that's just MS_SYNC.  MS_ASYNC doesn't walk the pages, yet it
> should update the timestamp.  That's the difficult one.
> 

We can treat MS_ASYNC as we treat MS_SYNC.  Then, MS_ASYNC *does* walk the
pages.  Is does it in generic_writepages().  It also even walks the ptes
for you, in clear_page_dirty_for_io().

There is surely no need to duplicate all that.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Andrew Morton
On Tue, 27 Mar 2007 08:55:40 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote:

 This patch makes writing to shared memory mappings update st_ctime and
 st_mtime as defined by SUSv3:

Boy this is complicated.
   
   You tell me?
   
Is there a simpler way of doing all this?  Say, we define a new page 
flag
PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
page-dirtiness.  Then, when performing writeback we look to see if any 
of
the dirty pages are !PageDirtiedByWrite() and, if so, we update 
[mc]time to
current-time.
   
   I don't think a page flag gains anything over the address_space flag
   that this patch already has.
   
   The complexity is not about keeping track of the data modified
   through mmap state, but about msync() guarantees, that POSIX wants.
   
   And these requirements do in fact make some sense: msync() basically
   means:
   
 I want the data written through mmaps to be visible to the world
   
   And that obviously includes updating the timestamps.
   
   So how do we know if the data was modified between two msync()
   invocations?  The only sane way I can think of is to walk the page
   tables in msync() and test/clear the pte dirty bit.
  
  clear_page_dirty_for_io() already does that.
  
  So we should be able to test PageDirtiedByWrite() after running
  clear_page_dirty_for_io() to discover whether this page was dirtied via
  MAP_SHARED, and then update the inode times if so.
 
 What do you need the page flag for?

To work out whether the page was dirtied via write (in which case the
timestamps were updated) or via mmap (in which case they were not).

  The modified through mmap info
 is there in the ptes.

It might not be there any more - the ptes could have got taken down by, for
example, reclaim.

I dunno - I'm not trying very hard.  I'm trying to encourage you to come up
with something less costly and less complex than this thing, but you appear
to be resisting.

  And from the ptes it can be transfered to a
 per-address_space flag.  Nobody is interested through which page was
 the file modified.
 
 Anyway, that's just MS_SYNC.  MS_ASYNC doesn't walk the pages, yet it
 should update the timestamp.  That's the difficult one.
 

We can treat MS_ASYNC as we treat MS_SYNC.  Then, MS_ASYNC *does* walk the
pages.  Is does it in generic_writepages().  It also even walks the ptes
for you, in clear_page_dirty_for_io().

There is surely no need to duplicate all that.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Miklos Szeredi
   clear_page_dirty_for_io() already does that.
   
   So we should be able to test PageDirtiedByWrite() after running
   clear_page_dirty_for_io() to discover whether this page was dirtied via
   MAP_SHARED, and then update the inode times if so.
  
  What do you need the page flag for?
 
 To work out whether the page was dirtied via write (in which case the
 timestamps were updated) or via mmap (in which case they were not).
 
   The modified through mmap info
  is there in the ptes.
 
 It might not be there any more - the ptes could have got taken down by, for
 example, reclaim.

Yes, but then the modified through mmap is transferred to the
per-address_space flag.  All this is already done by this patch.

 I dunno - I'm not trying very hard.  I'm trying to encourage you to come up
 with something less costly and less complex than this thing, but you appear
 to be resisting.

No, I'm just arguing that your suggestion is actually a complication,
not a simplification ;)

   And from the ptes it can be transfered to a
  per-address_space flag.  Nobody is interested through which page was
  the file modified.
  
  Anyway, that's just MS_SYNC.  MS_ASYNC doesn't walk the pages, yet it
  should update the timestamp.  That's the difficult one.
  
 
 We can treat MS_ASYNC as we treat MS_SYNC.  Then, MS_ASYNC *does* walk the
 pages.  Is does it in generic_writepages().  It also even walks the ptes
 for you, in clear_page_dirty_for_io().

Yes.  But that's not very useful semantic for MS_ASYNC vs. file time
update.

It would basically say:

  if you cann MS_ASYNC, and the file was modified then sometime in
  the future you will get an updated c/mtime.

But this is not what POSIX says, and it's not what applications want.

For example make would want to know if a file was modified or not,
and with your suggestion only msync(MS_SYNC) would reliably provide
that info.  But msync(MS_SYNC) is unnecessary in many cases.

 There is surely no need to duplicate all that.

Yeah, we could teach generic_writepages() to conditionally not submit
for io just test/clear pte dirtyness.

Maybe that would be somewhat cleaner, dunno.

Then there are the ram backed filesystems, which don't have dirty
accounting and radix trees, and for which this pte walking is still
needed to provide semantics consistent with normal filesystems.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Andrew Morton
On Tue, 27 Mar 2007 09:36:50 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote:

  There is surely no need to duplicate all that.
 
 Yeah, we could teach generic_writepages() to conditionally not submit
 for io just test/clear pte dirtyness.
 
 Maybe that would be somewhat cleaner, dunno.
 
 Then there are the ram backed filesystems, which don't have dirty
 accounting and radix trees, and for which this pte walking is still
 needed to provide semantics consistent with normal filesystems.

hm.

I don't know how important all this is, really - we've had this bug for
ever and presumably we've already trained everyone to work around it.

What usage scenarios are people actually hurting from?  Is there anything
interesting in the mysterious Novell Bugzilla #206431?

Perhaps we can get away with doing something half-assed which covers most
requirements...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Miklos Szeredi
   There is surely no need to duplicate all that.
  
  Yeah, we could teach generic_writepages() to conditionally not submit
  for io just test/clear pte dirtyness.
  
  Maybe that would be somewhat cleaner, dunno.
  
  Then there are the ram backed filesystems, which don't have dirty
  accounting and radix trees, and for which this pte walking is still
  needed to provide semantics consistent with normal filesystems.
 
 hm.
 
 I don't know how important all this is, really - we've had this bug for
 ever and presumably we've already trained everyone to work around it.
 
 What usage scenarios are people actually hurting from?  Is there anything
 interesting in the mysterious Novell Bugzilla #206431?

That's just a failing LTP testcase, not quite real life ;)

But Peter Staubach says a RH custumer has files written thorugh mmap,
which are not being backed up.

 Perhaps we can get away with doing something half-assed which covers most
 requirements...

OK.  At least I can split the patch into two half asses.

The big question is tmpfs and friends.  Those won't get any timestamp
update without additional page table walking.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Andrew Morton
On Tue, 27 Mar 2007 10:03:41 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote:

 But Peter Staubach says a RH custumer has files written thorugh mmap,
 which are not being backed up.

Yes, I expect the backup problem is the major real-world hurt arising from
this bug.

But I expect we could adequately plug that problem at munmap()-time.  Or,
better, do_wp_page().  As I said - half-assed.

It's a question if whether the backup problem is the only thing which is hurting
in the real-world, or if people have other problems.

(In fact, what's wrong with doing it in do_wp_page()?  The timestamp could
be up to 30 seconds too early, but that's heaps better than what we have
now..)


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Miklos Szeredi
  But Peter Staubach says a RH custumer has files written thorugh mmap,
  which are not being backed up.
 
 Yes, I expect the backup problem is the major real-world hurt arising from
 this bug.
 
 But I expect we could adequately plug that problem at munmap()-time.  Or,
 better, do_wp_page().  As I said - half-assed.
 
 It's a question if whether the backup problem is the only thing which is 
 hurting
 in the real-world, or if people have other problems.
 
 (In fact, what's wrong with doing it in do_wp_page()?

It's rather more expensive, than just toggling a bit.

Let me work on it a bit more.  I think I can make the current patch
more palatable.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Andrew Morton
On Tue, 27 Mar 2007 10:28:16 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote:

   But Peter Staubach says a RH custumer has files written thorugh mmap,
   which are not being backed up.
  
  Yes, I expect the backup problem is the major real-world hurt arising from
  this bug.
  
  But I expect we could adequately plug that problem at munmap()-time.  Or,
  better, do_wp_page().  As I said - half-assed.
  
  It's a question if whether the backup problem is the only thing which is 
  hurting
  in the real-world, or if people have other problems.
  
  (In fact, what's wrong with doing it in do_wp_page()?
 
 It's rather more expensive, than just toggling a bit.

It shouldn't be, especially for filesystems which have one-second timestamp
granularity.

Filesystems which have s_time_gran=1 might hurt a bit, but no more than
they will with write().

Actually, no - we'd only update the mctime once per page per writeback
period (30 seconds by default) so the load will be small.  It'll be more
often if the user is doing a lot of pte-cleaning via msync() or fsync(),
but then the m/ctime writes will be the least of their problems. 

I'd have thought there were more substantial problems with something that
crude?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Miklos Szeredi
But Peter Staubach says a RH custumer has files written thorugh mmap,
which are not being backed up.
   
   Yes, I expect the backup problem is the major real-world hurt arising from
   this bug.
   
   But I expect we could adequately plug that problem at munmap()-time.  Or,
   better, do_wp_page().  As I said - half-assed.
   
   It's a question if whether the backup problem is the only thing which is 
   hurting
   in the real-world, or if people have other problems.
   
   (In fact, what's wrong with doing it in do_wp_page()?
  
  It's rather more expensive, than just toggling a bit.
 
 It shouldn't be, especially for filesystems which have one-second timestamp
 granularity.
 
 Filesystems which have s_time_gran=1 might hurt a bit, but no more than
 they will with write().
 
 Actually, no - we'd only update the mctime once per page per writeback
 period (30 seconds by default) so the load will be small.

Why?  For each faulted page the times will be updated, no?

Maybe it's acceptable, I don't really know the cost of
file_update_time().

Tried this patch, and it seems to work.  It will even randomly update
the time for tmpfs files (on initial fault, and on swapins).

Miklos

Index: linux/mm/memory.c
===
--- linux.orig/mm/memory.c  2007-03-27 11:04:40.0 +0200
+++ linux/mm/memory.c   2007-03-27 11:08:19.0 +0200
@@ -1664,6 +1664,8 @@ gotten:
 unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+   if (vma-vm_file)
+   file_update_time(vma-vm_file);
set_page_dirty_balance(dirty_page);
put_page(dirty_page);
}
@@ -2316,6 +2318,8 @@ retry:
 unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+   if (vma-vm_file)
+   file_update_time(vma-vm_file);
set_page_dirty_balance(dirty_page);
put_page(dirty_page);
}
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Andrew Morton
On Tue, 27 Mar 2007 11:23:06 +0200 Miklos Szeredi [EMAIL PROTECTED] wrote:

 But Peter Staubach says a RH custumer has files written thorugh mmap,
 which are not being backed up.

Yes, I expect the backup problem is the major real-world hurt arising 
from
this bug.

But I expect we could adequately plug that problem at munmap()-time.  
Or,
better, do_wp_page().  As I said - half-assed.

It's a question if whether the backup problem is the only thing which 
is hurting
in the real-world, or if people have other problems.

(In fact, what's wrong with doing it in do_wp_page()?
   
   It's rather more expensive, than just toggling a bit.
  
  It shouldn't be, especially for filesystems which have one-second timestamp
  granularity.
  
  Filesystems which have s_time_gran=1 might hurt a bit, but no more than
  they will with write().
  
  Actually, no - we'd only update the mctime once per page per writeback
  period (30 seconds by default) so the load will be small.
 
 Why?  For each faulted page the times will be updated, no?

Yes, but only at pagefault-time.  And

- the faults are already slow: we need to pull the page contents in
  from disk, or memset or cow the page

- we need to take the trap

compared to which, the cost of the timestamp update will (we hope) be
relatively low.

 Maybe it's acceptable, I don't really know the cost of
 file_update_time().
 
 Tried this patch, and it seems to work.  It will even randomly update
 the time for tmpfs files (on initial fault, and on swapins).
 
 Miklos
 
 Index: linux/mm/memory.c
 ===
 --- linux.orig/mm/memory.c2007-03-27 11:04:40.0 +0200
 +++ linux/mm/memory.c 2007-03-27 11:08:19.0 +0200
 @@ -1664,6 +1664,8 @@ gotten:
  unlock:
   pte_unmap_unlock(page_table, ptl);
   if (dirty_page) {
 + if (vma-vm_file)
 + file_update_time(vma-vm_file);
   set_page_dirty_balance(dirty_page);
   put_page(dirty_page);
   }
 @@ -2316,6 +2318,8 @@ retry:
  unlock:
   pte_unmap_unlock(page_table, ptl);
   if (dirty_page) {
 + if (vma-vm_file)
 + file_update_time(vma-vm_file);
   set_page_dirty_balance(dirty_page);
   put_page(dirty_page);
   }

that's simpler ;) Is it correct enough though?  The place where it will
become inaccurate is for repeated modification via an established map.  ie:

p = mmap(..., MAP_SHARED);
for ( ; ; )
*p = 1;

in which case I think the timestamp will only get updated once per
writeback interval (ie: 30 seconds).


tmpfs files have an s_time_gran of 1, so benchmarking some workload on
tmpfs with this patch will tell us the worst-case overhead of the change. 

I guess we should arrange for multiple CPUs to perform write faults against
multiple pages of the same file in parallel.  Of course, that'd be a pretty
darn short benchmark because it'll run out of RAM.  Which reveals why we
probably won't have a performance problem in there.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Miklos Szeredi
 that's simpler ;) Is it correct enough though?  The place where it will
 become inaccurate is for repeated modification via an established map.  ie:
 
   p = mmap(..., MAP_SHARED);
   for ( ; ; )
   *p = 1;
 
 in which case I think the timestamp will only get updated once per
 writeback interval (ie: 30 seconds).

Which is perfectly OK, we really can't do any better in any sane way.

My concern is only about MS_ASYNC, it would be really nice to know
what other OSs do.  I've checked on a Solaris 5.7 (not exactly a
modern OS) and that is as good as current Linux.

If someone has access to others pls. find my test program at the end.

The logical way to handle msync is IMO:

   write to memory + msync(MS_ASYNC) == write()
   write to memory + msync(MS_SYNC) == write() + fdatasync()

Yes, it would add some overhead for MS_ASYNC, but that's what the user
wants isn't it?  If the user doesn't want correctly updated
timestamps, it should not call msync().

Miklos

=== msync_time.c ==
#include stdio.h
#include stdlib.h
#include unistd.h
#include string.h
#include fcntl.h
#include sys/mman.h
#include sys/stat.h

static const char *filename;

static void print_times(const char *msg)
{
struct stat stbuf;
stat(filename, stbuf);
printf(%s\t%li\t%li\t%li\n, msg, stbuf.st_ctime, stbuf.st_mtime,
   stbuf.st_atime);
}

static void usage(const char *progname)
{
fprintf(stderr, usage: %s filename [-s]\n, progname);
exit(1);
}

int main(int argc, char *argv[])
{
int res;
char *addr;
int msync_flag = MS_ASYNC;
int fd;

if (argc  2)
usage(argv[0]);

filename = argv[1];
if (argc  2) {
if (argc  3)
usage(argv[0]);
if (strcmp(argv[2], -s) == 0)
msync_flag = MS_SYNC;
else
usage(argv[0]);
}

fd = open(filename, O_RDWR | O_TRUNC | O_CREAT, 0666);
if (fd == -1) {
perror(filename);
return 1;
}
print_times(begin);
sleep(1);
write(fd, \n, 4);
print_times(write);
sleep(1);
addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (addr == (char *) -1) {
perror(mmap);
return 1;
}
print_times(mmap);
sleep(1);

addr[1] = 'b';
print_times(b);
sleep(1);
res = msync(addr, 4, msync_flag);
if (res == -1) {
perror(msync);
return 1;
}
print_times(msync b);
sleep(1);

addr[2] = 'c';
print_times(c);
sleep(1);
res = msync(addr, 4, msync_flag);
if (res == -1) {
perror(msync);
return 1;
}
print_times(msync c);
sleep(1);

addr[3] = 'd';
print_times(d);
sleep(1);
res = munmap(addr, 4);
if (res == -1) {
perror(munmap);
return 1;
}
print_times(munmap);
sleep(1);

res = close(fd);
if (res == -1) {
perror(close);
return 1;
}
print_times(end);
return 0;
}
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread linux
 Yes, this will make msync(MS_ASYNC) more heavyweight again.  But if an
 application doesn't want to update the timestamps, it should just omit
 this call, since it does nothing else.

Er... FWIW, I have an application that makes heavy use of msync(MS_ASYNC)
and doesn't care about timestamps.  (In fact, sometimes it's configured
to write to a raw device and there are no timestamps.)

It's used as a poor man's portable async I/O.  The application logs
data to disk, and sometimes needs to sync it to disk to ensure it has
all been written.

To reduce long pauses when doing msync(MS_SYNC), it does msync(MS_ASYNC)
as soon as a page is filled up to prompt asynchronous writeback.
I'm done writing this page and don't intend to write it again.
Please start committing it to stable storage, but don't block me.

Then, occasionally, there's an msync(MS_SYNC) call to be sure the data
is synced to disk.  This caused annoying hiccups before the MS_ASYNC
calls were added.


I agree that msync(MS_ASYNC) has no semantics if time is ignored.
But it's a useful way to tell the OS that the page is not going
to be dirtied again.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Miklos Szeredi
  Yes, this will make msync(MS_ASYNC) more heavyweight again.  But if an
  application doesn't want to update the timestamps, it should just omit
  this call, since it does nothing else.
 
 Er... FWIW, I have an application that makes heavy use of msync(MS_ASYNC)
 and doesn't care about timestamps.  (In fact, sometimes it's configured
 to write to a raw device and there are no timestamps.)
 
 It's used as a poor man's portable async I/O.  The application logs
 data to disk, and sometimes needs to sync it to disk to ensure it has
 all been written.
 
 To reduce long pauses when doing msync(MS_SYNC), it does msync(MS_ASYNC)
 as soon as a page is filled up to prompt asynchronous writeback.
 I'm done writing this page and don't intend to write it again.
 Please start committing it to stable storage, but don't block me.
 
 Then, occasionally, there's an msync(MS_SYNC) call to be sure the data
 is synced to disk.  This caused annoying hiccups before the MS_ASYNC
 calls were added.
 
 
 I agree that msync(MS_ASYNC) has no semantics if time is ignored.
 But it's a useful way to tell the OS that the page is not going
 to be dirtied again.

Just to clarify, here's the header comment for sys_msync():

/*
 * MS_SYNC syncs the entire file - including mappings.
 *
 * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
 * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
 * Now it doesn't do anything, since dirty pages are properly tracked.
 *
 * The application may now run fsync() to
 * write out the dirty pages and wait on the writeout and check the result.
 * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
 * async writeout immediately.
 * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
 * applications.
 */

It's actually wrong about FADV_DONTNEED, which I think doesn't start
writeout either.  So there you have it ;)

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Miklos Szeredi
  I agree that msync(MS_ASYNC) has no semantics if time is ignored.
  But it's a useful way to tell the OS that the page is not going
  to be dirtied again.
 
 Just to clarify, here's the header comment for sys_msync():
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
  * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
  * The application may now run fsync() to
  * write out the dirty pages and wait on the writeout and check the result.
  * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
  * async writeout immediately.
  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
  * applications.
  */
 
 It's actually wrong about FADV_DONTNEED, which I think doesn't start
 writeout either.  So there you have it ;)

Sorry, FADV_DONTNEED _does_ seem to start writeout, but it does it
indiscriminately, not just on the specified range.  That should be
easy to fix though.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Andrew Morton
On Tue, 27 Mar 2007 20:55:51 +0200
Miklos Szeredi [EMAIL PROTECTED] wrote:

 It's actually wrong about FADV_DONTNEED, which I think doesn't start
 writeout either.

case POSIX_FADV_DONTNEED:
if (!bdi_write_congested(mapping-backing_dev_info))
filemap_flush(mapping);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread linux
 * MS_ASYNC does not start I/O (it used to, up to 2.5.67).

Yes, I noticed.  See
http://www.ussg.iu.edu/hypermail/linux/kernel/0602.1/0450.html
for a bug report on the subject February 2006.

That's why this application is still running on 2.4.

As I mentioned at the time, the SUS says:
(http://opengroup.org/onlinepubs/007908799/xsh/msync.html)
When MS_ASYNC is specified, msync() returns immediately once all the
write operations are initiated or queued for servicing.

You can argue that putting it on the dirty list constitutes queued for
servicing, but the intent seems pretty clear to me: MS_ASYNC is supposed
to start the I/O.  Although strict standards-ese parsing says that
either branch of an or is acceptable, it is a common English language
convention that the first alternative is preferred and the second
is a fallback.

It makes sense in this case: start the write or, if that's not possible
(the disk is already busy), queue it for service as soon as the disk
is available.

They perhaps didn't mandate it this strictly, but that's clearly the
intent.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Andrew Morton
On 27 Mar 2007 15:24:52 -0400
[EMAIL PROTECTED] wrote:

  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
 
 Yes, I noticed.  See
 http://www.ussg.iu.edu/hypermail/linux/kernel/0602.1/0450.html
 for a bug report on the subject February 2006.

Suggest you use msync(MS_ASYNC), then
sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE).

The new (post 2.6.17) MAP_SHARED dirty-page semantics mean that the msync()
isn't actually needed.

 That's why this application is still running on 2.4.
 
 As I mentioned at the time, the SUS says:
 (http://opengroup.org/onlinepubs/007908799/xsh/msync.html)
 When MS_ASYNC is specified, msync() returns immediately once all the
 write operations are initiated or queued for servicing.
 
 You can argue that putting it on the dirty list constitutes queued for
 servicing, but the intent seems pretty clear to me: MS_ASYNC is supposed
 to start the I/O.  Although strict standards-ese parsing says that
 either branch of an or is acceptable, it is a common English language
 convention that the first alternative is preferred and the second
 is a fallback.
 
 It makes sense in this case: start the write or, if that's not possible
 (the disk is already busy), queue it for service as soon as the disk
 is available.
 
 They perhaps didn't mandate it this strictly, but that's clearly the
 intent.

We can fix your application, and we'll break someone else's.

I don't think it's solveable, really - the range of applications is so
broad, and the standard is so vague as to be useless.  This is why we've
been extending these things with linux-specific goodies which permit
applications to actually tell the kernel what they want to be done in a
more finely-grained fashion.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread linux
 Suggest you use msync(MS_ASYNC), then
 sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE).

Thank you; I didn't know about that.  And I can handle -ENOSYS by falling
back to the old behavior.

 We can fix your application, and we'll break someone else's.

If you can point to an application that it'll break, I'd be a lot more
understanding.  Nobody did, last year.

 I don't think it's solveable, really - the range of applications is so
 broad, and the standard is so vague as to be useless.

I agree that standards are sometimes vague, but that one seemed about
as clear as it's possible to be without imposing unreasonably on
the file system and device driver layers.

What part of The msync() function writes all modified data to
permanent storage locations [...] For mappings to files, the msync()
function ensures that all write operations are completed as defined
for synchronised I/O data integrity completion. suggests that it's not
supposed to do disk I/O?  How is that uselessly vague?

It says to me that msync's raison d'ĂȘtre is to write data from RAM to
stable storage.  If an application calls it too often, that's
the application's fault just as if it called sync(2) too often.

 This is why we've
 been extending these things with linux-specific goodies which permit
 applications to actually tell the kernel what they want to be done in a
 more finely-grained fashion.

Well, I still think the current Linux behavior is a bug, but there's a
usable (and run-time compatible) workaround that doesn't unreasonably
complicate the code, and that's good enough.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Miklos Szeredi
  Suggest you use msync(MS_ASYNC), then
  sync_file_range(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE).
 
 Thank you; I didn't know about that.  And I can handle -ENOSYS by falling
 back to the old behavior.
 
  We can fix your application, and we'll break someone else's.
 
 If you can point to an application that it'll break, I'd be a lot more
 understanding.  Nobody did, last year.
 
  I don't think it's solveable, really - the range of applications is so
  broad, and the standard is so vague as to be useless.
 
 I agree that standards are sometimes vague, but that one seemed about
 as clear as it's possible to be without imposing unreasonably on
 the file system and device driver layers.
 
 What part of The msync() function writes all modified data to
 permanent storage locations [...] For mappings to files, the msync()
 function ensures that all write operations are completed as defined
 for synchronised I/O data integrity completion. suggests that it's not
 supposed to do disk I/O?  How is that uselessly vague?

Linux _will_ write all modified data to permanent storage locations.
Since 2.6.17 it will do this regardless of msync().  Before 2.6.17 you
do need msync() to enable data to be written back.

But it will not start I/O immediately, which is not a requirement in
the standard, or at least it's pretty vague about that.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread Andrew Morton
On 27 Mar 2007 16:09:33 -0400
[EMAIL PROTECTED] wrote:

 What part of The msync() function writes all modified data to
 permanent storage locations [...] For mappings to files, the msync()
 function ensures that all write operations are completed as defined
 for synchronised I/O data integrity completion. suggests that it's not
 supposed to do disk I/O?  How is that uselessly vague?
 

Because for MS_ASYNC, msync() shall return immediately once all the write
operations are initiated or queued for servicing.

ie: the writes can complete one millisecond or one week later.  We chose 30
seconds.

And this is not completely fatuous - before 2.6.17, MAP_SHARED pages could
float about in memory in a dirty state for arbitrarily long periods -
potentially for the entire application lifetime.  It was quite reasonable
for our MS_ASYNC implementation to do what it did: tell the VM about the
dirtiness of these pages so they get written back soon.

Post-2.6.17 we preserved that behaviour.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-27 Thread linux
 Linux _will_ write all modified data to permanent storage locations.
 Since 2.6.17 it will do this regardless of msync().  Before 2.6.17 you
 do need msync() to enable data to be written back.
 
 But it will not start I/O immediately, which is not a requirement in
 the standard, or at least it's pretty vague about that.

As I've said before, I disagree, but I'm not going to start a major
flame war about it.

The most relevant paragraph is:

# When MS_ASYNC is specified, msync() returns immediately once all the
# write operations are initiated or queued for servicing; when MS_SYNC is
# specified, msync() will not return until all write operations are
# completed as defined for synchronised I/O data integrity completion.
# Either MS_ASYNC or MS_SYNC is specified, but not both.

Note two things:
1) In the paragraphs before, what msync does is defined independently
   of the MS_* flags.  Only the time of the return to user space varies.
   Thus, whatever the delay between calling msync() and the data being
   written, it should be the same whether MS_SYNC or MS_ASYNC is used.

   The implementation intended is:
   - Start all I/O
   - If MS_SYNC, wait for I/O to complete
   - Return to user space

2) all the write operations are initiated or queued for servicing.
   It is a common convention in English (and most languages, I expect)
   that in the or is a preference for the first alternative.  The second
   is a permitted alternative if the first is not possible.

   And queued for servicing, especially initiated or queued for
   servicing, to me imples queuing waiting for some resource.  To have
   the resource being waited for be a timer expiry seems like rather a
   cheat to me.  It's perhaps doesn't break the letter of the standard,
   but definitely bends it.  It feels like a fiddle.

Still, the basic hint function of msync(MS_ASYNC) *is* being accomplished:
I don't expect to write this page any more, so now would be a good time
to clean it.
It would just make my life easier if the kernel procrastinated less.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-26 Thread Miklos Szeredi
> > > > This patch makes writing to shared memory mappings update st_ctime and
> > > > st_mtime as defined by SUSv3:
> > > 
> > > Boy this is complicated.
> > 
> > You tell me?
> > 
> > > Is there a simpler way of doing all this?  Say, we define a new page flag
> > > PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> > > ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> > > page-dirtiness.  Then, when performing writeback we look to see if any of
> > > the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time 
> > > to
> > > current-time.
> > 
> > I don't think a page flag gains anything over the address_space flag
> > that this patch already has.
> > 
> > The complexity is not about keeping track of the "data modified
> > through mmap" state, but about msync() guarantees, that POSIX wants.
> > 
> > And these requirements do in fact make some sense: msync() basically
> > means:
> > 
> >   "I want the data written through mmaps to be visible to the world"
> > 
> > And that obviously includes updating the timestamps.
> > 
> > So how do we know if the data was modified between two msync()
> > invocations?  The only sane way I can think of is to walk the page
> > tables in msync() and test/clear the pte dirty bit.
> 
> clear_page_dirty_for_io() already does that.
> 
> So we should be able to test PageDirtiedByWrite() after running
> clear_page_dirty_for_io() to discover whether this page was dirtied via
> MAP_SHARED, and then update the inode times if so.

What do you need the page flag for?  The "modified through mmap" info
is there in the ptes.  And from the ptes it can be transfered to a
per-address_space flag.  Nobody is interested through which page was
the file modified.

Anyway, that's just MS_SYNC.  MS_ASYNC doesn't walk the pages, yet it
should update the timestamp.  That's the difficult one.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-26 Thread Andrew Morton
On Mon, 26 Mar 2007 23:43:08 +0200
Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> > > This patch makes writing to shared memory mappings update st_ctime and
> > > st_mtime as defined by SUSv3:
> > 
> > Boy this is complicated.
> 
> You tell me?
> 
> > Is there a simpler way of doing all this?  Say, we define a new page flag
> > PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> > ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> > page-dirtiness.  Then, when performing writeback we look to see if any of
> > the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> > current-time.
> 
> I don't think a page flag gains anything over the address_space flag
> that this patch already has.
> 
> The complexity is not about keeping track of the "data modified
> through mmap" state, but about msync() guarantees, that POSIX wants.
> 
> And these requirements do in fact make some sense: msync() basically
> means:
> 
>   "I want the data written through mmaps to be visible to the world"
> 
> And that obviously includes updating the timestamps.
> 
> So how do we know if the data was modified between two msync()
> invocations?  The only sane way I can think of is to walk the page
> tables in msync() and test/clear the pte dirty bit.

clear_page_dirty_for_io() already does that.

So we should be able to test PageDirtiedByWrite() after running
clear_page_dirty_for_io() to discover whether this page was dirtied via
MAP_SHARED, and then update the inode times if so.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-26 Thread Andrew Morton
On Mon, 26 Mar 2007 16:10:09 -0500
Matt Mackall <[EMAIL PROTECTED]> wrote:

> On Mon, Mar 26, 2007 at 02:00:36PM -0700, Andrew Morton wrote:
> > On Sun, 25 Mar 2007 23:10:21 +0200
> > Miklos Szeredi <[EMAIL PROTECTED]> wrote:
> > 
> > > This patch makes writing to shared memory mappings update st_ctime and
> > > st_mtime as defined by SUSv3:
> > 
> > Boy this is complicated.
> > 
> > Is there a simpler way of doing all this?  Say, we define a new page flag
> > PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> > ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> > page-dirtiness.  Then, when performing writeback we look to see if any of
> > the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> > current-time.
> > 
> > Or something like that - I'm just thinking out loud and picking holes in
> > the above doesn't shut me up ;) We're adding complexity and some overhead
> > and we're losing our recent msync() simplifications and this all hurts.  Is
> > there some other way?  I think burning a page flag to avoid this additional
> > complexity would be worthwhile.  
> 
> Aren't we basically out of those?

Rafael has liberated a couple of the flags which swsusp was using.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-26 Thread Miklos Szeredi
> > This patch makes writing to shared memory mappings update st_ctime and
> > st_mtime as defined by SUSv3:
> 
> Boy this is complicated.

You tell me?

> Is there a simpler way of doing all this?  Say, we define a new page flag
> PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> page-dirtiness.  Then, when performing writeback we look to see if any of
> the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> current-time.

I don't think a page flag gains anything over the address_space flag
that this patch already has.

The complexity is not about keeping track of the "data modified
through mmap" state, but about msync() guarantees, that POSIX wants.

And these requirements do in fact make some sense: msync() basically
means:

  "I want the data written through mmaps to be visible to the world"

And that obviously includes updating the timestamps.

So how do we know if the data was modified between two msync()
invocations?  The only sane way I can think of is to walk the page
tables in msync() and test/clear the pte dirty bit.

Yes, this will make msync(MS_ASYNC) more heavyweight again.  But if an
application doesn't want to update the timestamps, it should just omit
this call, since it does nothing else.

There shouldn't be any other side effect.

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-26 Thread Matt Mackall
On Mon, Mar 26, 2007 at 02:00:36PM -0700, Andrew Morton wrote:
> On Sun, 25 Mar 2007 23:10:21 +0200
> Miklos Szeredi <[EMAIL PROTECTED]> wrote:
> 
> > This patch makes writing to shared memory mappings update st_ctime and
> > st_mtime as defined by SUSv3:
> 
> Boy this is complicated.
> 
> Is there a simpler way of doing all this?  Say, we define a new page flag
> PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
> ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
> page-dirtiness.  Then, when performing writeback we look to see if any of
> the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
> current-time.
> 
> Or something like that - I'm just thinking out loud and picking holes in
> the above doesn't shut me up ;) We're adding complexity and some overhead
> and we're losing our recent msync() simplifications and this all hurts.  Is
> there some other way?  I think burning a page flag to avoid this additional
> complexity would be worthwhile.  

Aren't we basically out of those?

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-26 Thread Andrew Morton
On Sun, 25 Mar 2007 23:10:21 +0200
Miklos Szeredi <[EMAIL PROTECTED]> wrote:

> This patch makes writing to shared memory mappings update st_ctime and
> st_mtime as defined by SUSv3:

Boy this is complicated.

Is there a simpler way of doing all this?  Say, we define a new page flag
PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
page-dirtiness.  Then, when performing writeback we look to see if any of
the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
current-time.

Or something like that - I'm just thinking out loud and picking holes in
the above doesn't shut me up ;) We're adding complexity and some overhead
and we're losing our recent msync() simplifications and this all hurts.  Is
there some other way?  I think burning a page flag to avoid this additional
complexity would be worthwhile.  
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-26 Thread Andrew Morton
On Sun, 25 Mar 2007 23:10:21 +0200
Miklos Szeredi [EMAIL PROTECTED] wrote:

 This patch makes writing to shared memory mappings update st_ctime and
 st_mtime as defined by SUSv3:

Boy this is complicated.

Is there a simpler way of doing all this?  Say, we define a new page flag
PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
page-dirtiness.  Then, when performing writeback we look to see if any of
the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
current-time.

Or something like that - I'm just thinking out loud and picking holes in
the above doesn't shut me up ;) We're adding complexity and some overhead
and we're losing our recent msync() simplifications and this all hurts.  Is
there some other way?  I think burning a page flag to avoid this additional
complexity would be worthwhile.  
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-26 Thread Matt Mackall
On Mon, Mar 26, 2007 at 02:00:36PM -0700, Andrew Morton wrote:
 On Sun, 25 Mar 2007 23:10:21 +0200
 Miklos Szeredi [EMAIL PROTECTED] wrote:
 
  This patch makes writing to shared memory mappings update st_ctime and
  st_mtime as defined by SUSv3:
 
 Boy this is complicated.
 
 Is there a simpler way of doing all this?  Say, we define a new page flag
 PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
 ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
 page-dirtiness.  Then, when performing writeback we look to see if any of
 the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
 current-time.
 
 Or something like that - I'm just thinking out loud and picking holes in
 the above doesn't shut me up ;) We're adding complexity and some overhead
 and we're losing our recent msync() simplifications and this all hurts.  Is
 there some other way?  I think burning a page flag to avoid this additional
 complexity would be worthwhile.  

Aren't we basically out of those?

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-26 Thread Miklos Szeredi
  This patch makes writing to shared memory mappings update st_ctime and
  st_mtime as defined by SUSv3:
 
 Boy this is complicated.

You tell me?

 Is there a simpler way of doing all this?  Say, we define a new page flag
 PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
 ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
 page-dirtiness.  Then, when performing writeback we look to see if any of
 the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
 current-time.

I don't think a page flag gains anything over the address_space flag
that this patch already has.

The complexity is not about keeping track of the data modified
through mmap state, but about msync() guarantees, that POSIX wants.

And these requirements do in fact make some sense: msync() basically
means:

  I want the data written through mmaps to be visible to the world

And that obviously includes updating the timestamps.

So how do we know if the data was modified between two msync()
invocations?  The only sane way I can think of is to walk the page
tables in msync() and test/clear the pte dirty bit.

Yes, this will make msync(MS_ASYNC) more heavyweight again.  But if an
application doesn't want to update the timestamps, it should just omit
this call, since it does nothing else.

There shouldn't be any other side effect.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-26 Thread Andrew Morton
On Mon, 26 Mar 2007 16:10:09 -0500
Matt Mackall [EMAIL PROTECTED] wrote:

 On Mon, Mar 26, 2007 at 02:00:36PM -0700, Andrew Morton wrote:
  On Sun, 25 Mar 2007 23:10:21 +0200
  Miklos Szeredi [EMAIL PROTECTED] wrote:
  
   This patch makes writing to shared memory mappings update st_ctime and
   st_mtime as defined by SUSv3:
  
  Boy this is complicated.
  
  Is there a simpler way of doing all this?  Say, we define a new page flag
  PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
  ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
  page-dirtiness.  Then, when performing writeback we look to see if any of
  the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
  current-time.
  
  Or something like that - I'm just thinking out loud and picking holes in
  the above doesn't shut me up ;) We're adding complexity and some overhead
  and we're losing our recent msync() simplifications and this all hurts.  Is
  there some other way?  I think burning a page flag to avoid this additional
  complexity would be worthwhile.  
 
 Aren't we basically out of those?

Rafael has liberated a couple of the flags which swsusp was using.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-26 Thread Andrew Morton
On Mon, 26 Mar 2007 23:43:08 +0200
Miklos Szeredi [EMAIL PROTECTED] wrote:

   This patch makes writing to shared memory mappings update st_ctime and
   st_mtime as defined by SUSv3:
  
  Boy this is complicated.
 
 You tell me?
 
  Is there a simpler way of doing all this?  Say, we define a new page flag
  PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
  ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
  page-dirtiness.  Then, when performing writeback we look to see if any of
  the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time to
  current-time.
 
 I don't think a page flag gains anything over the address_space flag
 that this patch already has.
 
 The complexity is not about keeping track of the data modified
 through mmap state, but about msync() guarantees, that POSIX wants.
 
 And these requirements do in fact make some sense: msync() basically
 means:
 
   I want the data written through mmaps to be visible to the world
 
 And that obviously includes updating the timestamps.
 
 So how do we know if the data was modified between two msync()
 invocations?  The only sane way I can think of is to walk the page
 tables in msync() and test/clear the pte dirty bit.

clear_page_dirty_for_io() already does that.

So we should be able to test PageDirtiedByWrite() after running
clear_page_dirty_for_io() to discover whether this page was dirtied via
MAP_SHARED, and then update the inode times if so.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch resend v4] update ctime and mtime for mmaped write

2007-03-26 Thread Miklos Szeredi
This patch makes writing to shared memory mappings update st_ctime and
st_mtime as defined by SUSv3:
   
   Boy this is complicated.
  
  You tell me?
  
   Is there a simpler way of doing all this?  Say, we define a new page flag
   PG_dirtiedbywrite and we do SetPageDirtiedByWrite() inside write() and
   ClearPageDirtiedByWrite() whenever we propagate pte-dirtiness into
   page-dirtiness.  Then, when performing writeback we look to see if any of
   the dirty pages are !PageDirtiedByWrite() and, if so, we update [mc]time 
   to
   current-time.
  
  I don't think a page flag gains anything over the address_space flag
  that this patch already has.
  
  The complexity is not about keeping track of the data modified
  through mmap state, but about msync() guarantees, that POSIX wants.
  
  And these requirements do in fact make some sense: msync() basically
  means:
  
I want the data written through mmaps to be visible to the world
  
  And that obviously includes updating the timestamps.
  
  So how do we know if the data was modified between two msync()
  invocations?  The only sane way I can think of is to walk the page
  tables in msync() and test/clear the pte dirty bit.
 
 clear_page_dirty_for_io() already does that.
 
 So we should be able to test PageDirtiedByWrite() after running
 clear_page_dirty_for_io() to discover whether this page was dirtied via
 MAP_SHARED, and then update the inode times if so.

What do you need the page flag for?  The modified through mmap info
is there in the ptes.  And from the ptes it can be transfered to a
per-address_space flag.  Nobody is interested through which page was
the file modified.

Anyway, that's just MS_SYNC.  MS_ASYNC doesn't walk the pages, yet it
should update the timestamp.  That's the difficult one.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch resend v4] update ctime and mtime for mmaped write

2007-03-25 Thread Miklos Szeredi
From: Miklos Szeredi <[EMAIL PROTECTED]>

Changes:
v4:
 o rename test_clear_page_modified to page_mkclean_noprot
 o clean up page_mkclean_noprot
 o don't set AS_CMTIME from fault handler, since that also sets the PTE dirty
 o only update c/mtime in munmap, if file is not mapped any more
 o cleanups
v3:
 o rename is_page_modified to test_clear_page_modified
v2:
 o set AS_CMTIME flag in clear_page_dirty_for_io() too
 o don't clear AS_CMTIME in file_update_time()
 o check the dirty bit in the page tables
v1:
 o moved check from __fput() to remove_vma(), which is more logical
 o changed set_page_dirty() to set_page_dirty_mapping in hugetlb.c
 o cleaned up #ifdef CONFIG_BLOCK mess

This patch makes writing to shared memory mappings update st_ctime and
st_mtime as defined by SUSv3:

   The st_ctime and st_mtime fields of a file that is mapped with
   MAP_SHARED and PROT_WRITE shall be marked for update at some point
   in the interval between a write reference to the mapped region and
   the next call to msync() with MS_ASYNC or MS_SYNC for that portion
   of the file by any process. If there is no such call and if the
   underlying file is modified as a result of a write reference, then
   these fields shall be marked for update at some time after the
   write reference.

A new address_space flag is introduced: AS_CMTIME.  This is set each
time dirtyness from the page table is transferred to the page or if
the page is dirtied without the page table being set dirty.

Note, the flag is set unconditionally, even if the page is already
dirty.  This is important, because the page might have been dirtied
earlier by a non-mmap write.

Msync checks this flag and also dirty bit in the page tables, because
the data might change again after an msync(MS_ASYNC), while the page
is already dirty and read-write.  This also makes the time updating
work for memory backed filesystems such as tmpfs.

This implementation walks the pages in the synced range, and uses rmap
to find all the ptes for each page.  Non-linear vmas are ignored,
since the ptes can only be found by scanning the whole vma, which is
very inefficient.

As an optimization, if dirty pages are accounted, then only walk the
dirty pages, since the clean pages necessarily have clean ptes.  This
doesn't work for memory backed filesystems, where no dirty accounting
is done.

An alternative implementation could check for all intersecting vmas in
the mapping and walk the page tables for each.  This would probably be
more efficient for memory backed filesystems and if the number of
dirty pages is near the total number of pages in the range.

In munmap if there are no more memory mappings of this file, and the
AS_CMTIME flag has been set, the file times are updated.

Fixes Novell Bugzilla #206431.

Inspired by Peter Staubach's patch and the resulting comments.

Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]>
---

Index: linux-2.6.21-rc4-mm1/include/linux/pagemap.h
===
--- linux-2.6.21-rc4-mm1.orig/include/linux/pagemap.h   2007-03-25 
19:00:06.0 +0200
+++ linux-2.6.21-rc4-mm1/include/linux/pagemap.h2007-03-25 
19:00:35.0 +0200
@@ -19,6 +19,7 @@
  */
 #defineAS_EIO  (__GFP_BITS_SHIFT + 0)  /* IO error on async 
write */
 #define AS_ENOSPC  (__GFP_BITS_SHIFT + 1)  /* ENOSPC on async write */
+#define AS_CMTIME  (__GFP_BITS_SHIFT + 2)  /* ctime/mtime update needed */
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
Index: linux-2.6.21-rc4-mm1/include/linux/mm.h
===
--- linux-2.6.21-rc4-mm1.orig/include/linux/mm.h2007-03-25 
19:00:06.0 +0200
+++ linux-2.6.21-rc4-mm1/include/linux/mm.h 2007-03-25 19:00:36.0 
+0200
@@ -808,6 +808,7 @@ int redirty_page_for_writepage(struct wr
struct page *page);
 int FASTCALL(set_page_dirty(struct page *page));
 int set_page_dirty_lock(struct page *page);
+int set_page_dirty_mapping(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
 
 extern unsigned long do_mremap(unsigned long addr,
Index: linux-2.6.21-rc4-mm1/mm/memory.c
===
--- linux-2.6.21-rc4-mm1.orig/mm/memory.c   2007-03-25 19:00:06.0 
+0200
+++ linux-2.6.21-rc4-mm1/mm/memory.c2007-03-25 19:00:36.0 +0200
@@ -676,7 +676,7 @@ static unsigned long zap_pte_range(struc
anon_rss--;
else {
if (pte_dirty(ptent))
-   set_page_dirty(page);
+   set_page_dirty_mapping(page);
if (pte_young(ptent))
SetPageReferenced(page);
file_rss--;
@@ -954,7 +954,7 @@ struct page 

[patch resend v4] update ctime and mtime for mmaped write

2007-03-25 Thread Miklos Szeredi
From: Miklos Szeredi [EMAIL PROTECTED]

Changes:
v4:
 o rename test_clear_page_modified to page_mkclean_noprot
 o clean up page_mkclean_noprot
 o don't set AS_CMTIME from fault handler, since that also sets the PTE dirty
 o only update c/mtime in munmap, if file is not mapped any more
 o cleanups
v3:
 o rename is_page_modified to test_clear_page_modified
v2:
 o set AS_CMTIME flag in clear_page_dirty_for_io() too
 o don't clear AS_CMTIME in file_update_time()
 o check the dirty bit in the page tables
v1:
 o moved check from __fput() to remove_vma(), which is more logical
 o changed set_page_dirty() to set_page_dirty_mapping in hugetlb.c
 o cleaned up #ifdef CONFIG_BLOCK mess

This patch makes writing to shared memory mappings update st_ctime and
st_mtime as defined by SUSv3:

   The st_ctime and st_mtime fields of a file that is mapped with
   MAP_SHARED and PROT_WRITE shall be marked for update at some point
   in the interval between a write reference to the mapped region and
   the next call to msync() with MS_ASYNC or MS_SYNC for that portion
   of the file by any process. If there is no such call and if the
   underlying file is modified as a result of a write reference, then
   these fields shall be marked for update at some time after the
   write reference.

A new address_space flag is introduced: AS_CMTIME.  This is set each
time dirtyness from the page table is transferred to the page or if
the page is dirtied without the page table being set dirty.

Note, the flag is set unconditionally, even if the page is already
dirty.  This is important, because the page might have been dirtied
earlier by a non-mmap write.

Msync checks this flag and also dirty bit in the page tables, because
the data might change again after an msync(MS_ASYNC), while the page
is already dirty and read-write.  This also makes the time updating
work for memory backed filesystems such as tmpfs.

This implementation walks the pages in the synced range, and uses rmap
to find all the ptes for each page.  Non-linear vmas are ignored,
since the ptes can only be found by scanning the whole vma, which is
very inefficient.

As an optimization, if dirty pages are accounted, then only walk the
dirty pages, since the clean pages necessarily have clean ptes.  This
doesn't work for memory backed filesystems, where no dirty accounting
is done.

An alternative implementation could check for all intersecting vmas in
the mapping and walk the page tables for each.  This would probably be
more efficient for memory backed filesystems and if the number of
dirty pages is near the total number of pages in the range.

In munmap if there are no more memory mappings of this file, and the
AS_CMTIME flag has been set, the file times are updated.

Fixes Novell Bugzilla #206431.

Inspired by Peter Staubach's patch and the resulting comments.

Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
---

Index: linux-2.6.21-rc4-mm1/include/linux/pagemap.h
===
--- linux-2.6.21-rc4-mm1.orig/include/linux/pagemap.h   2007-03-25 
19:00:06.0 +0200
+++ linux-2.6.21-rc4-mm1/include/linux/pagemap.h2007-03-25 
19:00:35.0 +0200
@@ -19,6 +19,7 @@
  */
 #defineAS_EIO  (__GFP_BITS_SHIFT + 0)  /* IO error on async 
write */
 #define AS_ENOSPC  (__GFP_BITS_SHIFT + 1)  /* ENOSPC on async write */
+#define AS_CMTIME  (__GFP_BITS_SHIFT + 2)  /* ctime/mtime update needed */
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
Index: linux-2.6.21-rc4-mm1/include/linux/mm.h
===
--- linux-2.6.21-rc4-mm1.orig/include/linux/mm.h2007-03-25 
19:00:06.0 +0200
+++ linux-2.6.21-rc4-mm1/include/linux/mm.h 2007-03-25 19:00:36.0 
+0200
@@ -808,6 +808,7 @@ int redirty_page_for_writepage(struct wr
struct page *page);
 int FASTCALL(set_page_dirty(struct page *page));
 int set_page_dirty_lock(struct page *page);
+int set_page_dirty_mapping(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
 
 extern unsigned long do_mremap(unsigned long addr,
Index: linux-2.6.21-rc4-mm1/mm/memory.c
===
--- linux-2.6.21-rc4-mm1.orig/mm/memory.c   2007-03-25 19:00:06.0 
+0200
+++ linux-2.6.21-rc4-mm1/mm/memory.c2007-03-25 19:00:36.0 +0200
@@ -676,7 +676,7 @@ static unsigned long zap_pte_range(struc
anon_rss--;
else {
if (pte_dirty(ptent))
-   set_page_dirty(page);
+   set_page_dirty_mapping(page);
if (pte_young(ptent))
SetPageReferenced(page);
file_rss--;
@@ -954,7 +954,7 @@ struct page