Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-20 Thread Dave Chinner
On Tue, Aug 20, 2013 at 05:47:10PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 20, 2013 at 3:43 PM, Dave Chinner  wrote:
> > On Tue, Aug 20, 2013 at 02:54:01PM -0700, Andy Lutomirski wrote:
> >> On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner  wrote:
> >> > On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
> >> >> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara  wrote:
> >> >> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
> >> >> >> >> I could require ->writepages *and* ->flush_cmtime to handle the 
> >> >> >> >> time
> >> >> >> >> update, but that would complicate non-transactional filesystems.
> >> >> >> >> Those filesystems should just flush cmtime at the end of 
> >> >> >> >> writepages.
> >> >> >> >
> >> >> >> > do_writepages() is the wrong place to do such updates - we can get
> >> >> >> > writeback directly through .writepage, so the time updates need to
> >> >> >> > be in .writepage. That first .writepage call will clear the bit on
> >> >> >> > the mapping, so it's only done on the first call to .writepage on
> >> >> >> > the given mapping.
> >> >> >>
> >> >> >> Last time I checked, all the paths that actually needed the timestamp
> >> >> >> update went through .writepages.  I'll double-check.
> >> >> >   kswapd can call just .writepage to do the writeout so timestamp 
> >> >> > update
> >> >> > should be handled there as well. Otherwise all pages in a mapping can 
> >> >> > be
> >> >> > cleaned without timestamp being updated.
> >> >>
> >> >> OK, I'll fix that.
> >> >>
> >> >> >
> >> >> > Which btw made me realize that even your scheme doesn't completely 
> >> >> > make
> >> >> > sure timestamp is updated after mmap write - if you have pages 0 and 
> >> >> > 1, you
> >> >> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 
> >> >> > 4096)
> >> >> > is called. We write the page 0, writeprotect it, update timestamps. 
> >> >> > But
> >> >> > page 1 is still writeable so writes to it won't set CMTIME flag, 
> >> >> > neither
> >> >> > update the timestamp... Not that I think this can be reasonably 
> >> >> > solved but
> >> >> > it is a food for thought.
> >> >>
> >> >> This should already work.  AS_CMTIME is set when the pte goes from
> >> >> dirty to clean, not when the pte goes from wp to writable.  So
> >> >> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
> >> >> be set and a subsequent writepages call will update the timestamp.
> >> >
> >> > Oh, I missed that - I thought you were setting AS_CMTIME during
> >> > .page_mkwrite.
> >> >
> >> > Setting it in clear_page_dirty_for_io() is too late for filesystems
> >> > to include it in their existing transactions during .writepage, (at
> >> > least for XFs and ext4) because they do their delayed allocation
> >> > transactions before changing page state
> >>
> >> Couldn't it go between mpage_map_and_submit_extent and
> >> ext4_journal_stop in ext4_writepages?
> >
> > Maybe - I'm not an ext4 expert - but even if you can make it work
> > for ext4 in some way, that doesn't mean it is possible for any other
> > filesystem to use the same method. You're adding code to generic,
> > non-filesystem specific code paths and so the solutions need to be
> > generic rather not tied to how a specific filesystem is implemented.
> >
> 
> I don't see the problem for xfs or btrfs either.
> 
> xfs uses generic_writepages, which already does the right thing.  (xfs
> with my updated patches passes my tests.)  xfs_vm_writepage calls
> xfs_start_page_writeback(..., 1, ...), so clear_page_dirty_for_io is
> called.  At that point (I presume), it would still be possible to add
> metadata to a transaction (assuming there's a transaction open -- I
> don't have a clue here). 

That's my point - there isn't a transaction in XFS at this point,
and so if we gather that flag from clear_page_dirty_for_io() we'd
need a second transaction and therefore the optimisation you want
filesystems to use to mitigate the additional overhead can't be done
for all commonly used filesystems.

> Even if this is too late, xfs_vm_writepage
> could call page_mkwrite to for AS_CMTIME to be set if needed.
> page_mkwrite will be fast if the page isn't mmapped.  What am I
> missing?

That it leads to different behaviour for different filesystems.

i.e. page_mkwrite on page A sets the flag. writeback on a range that
doesn't include page A occurs, sees the flag, clears it after
updating the timestamp. Some time later writeback on page A occurs,
no timestamp update occurs.

The behaviour needs to be consistent across filesystems.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-20 Thread Andy Lutomirski
On Tue, Aug 20, 2013 at 3:43 PM, Dave Chinner  wrote:
> On Tue, Aug 20, 2013 at 02:54:01PM -0700, Andy Lutomirski wrote:
>> On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner  wrote:
>> > On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
>> >> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara  wrote:
>> >> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
>> >> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time
>> >> >> >> update, but that would complicate non-transactional filesystems.
>> >> >> >> Those filesystems should just flush cmtime at the end of writepages.
>> >> >> >
>> >> >> > do_writepages() is the wrong place to do such updates - we can get
>> >> >> > writeback directly through .writepage, so the time updates need to
>> >> >> > be in .writepage. That first .writepage call will clear the bit on
>> >> >> > the mapping, so it's only done on the first call to .writepage on
>> >> >> > the given mapping.
>> >> >>
>> >> >> Last time I checked, all the paths that actually needed the timestamp
>> >> >> update went through .writepages.  I'll double-check.
>> >> >   kswapd can call just .writepage to do the writeout so timestamp update
>> >> > should be handled there as well. Otherwise all pages in a mapping can be
>> >> > cleaned without timestamp being updated.
>> >>
>> >> OK, I'll fix that.
>> >>
>> >> >
>> >> > Which btw made me realize that even your scheme doesn't completely make
>> >> > sure timestamp is updated after mmap write - if you have pages 0 and 1, 
>> >> > you
>> >> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 
>> >> > 4096)
>> >> > is called. We write the page 0, writeprotect it, update timestamps. But
>> >> > page 1 is still writeable so writes to it won't set CMTIME flag, neither
>> >> > update the timestamp... Not that I think this can be reasonably solved 
>> >> > but
>> >> > it is a food for thought.
>> >>
>> >> This should already work.  AS_CMTIME is set when the pte goes from
>> >> dirty to clean, not when the pte goes from wp to writable.  So
>> >> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
>> >> be set and a subsequent writepages call will update the timestamp.
>> >
>> > Oh, I missed that - I thought you were setting AS_CMTIME during
>> > .page_mkwrite.
>> >
>> > Setting it in clear_page_dirty_for_io() is too late for filesystems
>> > to include it in their existing transactions during .writepage, (at
>> > least for XFs and ext4) because they do their delayed allocation
>> > transactions before changing page state
>>
>> Couldn't it go between mpage_map_and_submit_extent and
>> ext4_journal_stop in ext4_writepages?
>
> Maybe - I'm not an ext4 expert - but even if you can make it work
> for ext4 in some way, that doesn't mean it is possible for any other
> filesystem to use the same method. You're adding code to generic,
> non-filesystem specific code paths and so the solutions need to be
> generic rather not tied to how a specific filesystem is implemented.
>

I don't see the problem for xfs or btrfs either.

xfs uses generic_writepages, which already does the right thing.  (xfs
with my updated patches passes my tests.)  xfs_vm_writepage calls
xfs_start_page_writeback(..., 1, ...), so clear_page_dirty_for_io is
called.  At that point (I presume), it would still be possible to add
metadata to a transaction (assuming there's a transaction open -- I
don't have a clue here).  Even if this is too late, xfs_vm_writepage
could call page_mkwrite to for AS_CMTIME to be set if needed.
page_mkwrite will be fast if the page isn't mmapped.  What am I
missing?

btrfs seems to do much the same thing.

--Andy


> Cheers,
>
> Dave.
> --
> Dave Chinner
> da...@fromorbit.com



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-20 Thread Dave Chinner
On Tue, Aug 20, 2013 at 02:54:01PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner  wrote:
> > On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
> >> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara  wrote:
> >> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
> >> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time
> >> >> >> update, but that would complicate non-transactional filesystems.
> >> >> >> Those filesystems should just flush cmtime at the end of writepages.
> >> >> >
> >> >> > do_writepages() is the wrong place to do such updates - we can get
> >> >> > writeback directly through .writepage, so the time updates need to
> >> >> > be in .writepage. That first .writepage call will clear the bit on
> >> >> > the mapping, so it's only done on the first call to .writepage on
> >> >> > the given mapping.
> >> >>
> >> >> Last time I checked, all the paths that actually needed the timestamp
> >> >> update went through .writepages.  I'll double-check.
> >> >   kswapd can call just .writepage to do the writeout so timestamp update
> >> > should be handled there as well. Otherwise all pages in a mapping can be
> >> > cleaned without timestamp being updated.
> >>
> >> OK, I'll fix that.
> >>
> >> >
> >> > Which btw made me realize that even your scheme doesn't completely make
> >> > sure timestamp is updated after mmap write - if you have pages 0 and 1, 
> >> > you
> >> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 
> >> > 4096)
> >> > is called. We write the page 0, writeprotect it, update timestamps. But
> >> > page 1 is still writeable so writes to it won't set CMTIME flag, neither
> >> > update the timestamp... Not that I think this can be reasonably solved 
> >> > but
> >> > it is a food for thought.
> >>
> >> This should already work.  AS_CMTIME is set when the pte goes from
> >> dirty to clean, not when the pte goes from wp to writable.  So
> >> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
> >> be set and a subsequent writepages call will update the timestamp.
> >
> > Oh, I missed that - I thought you were setting AS_CMTIME during
> > .page_mkwrite.
> >
> > Setting it in clear_page_dirty_for_io() is too late for filesystems
> > to include it in their existing transactions during .writepage, (at
> > least for XFs and ext4) because they do their delayed allocation
> > transactions before changing page state
> 
> Couldn't it go between mpage_map_and_submit_extent and
> ext4_journal_stop in ext4_writepages?

Maybe - I'm not an ext4 expert - but even if you can make it work
for ext4 in some way, that doesn't mean it is possible for any other
filesystem to use the same method. You're adding code to generic,
non-filesystem specific code paths and so the solutions need to be
generic rather not tied to how a specific filesystem is implemented.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-20 Thread Andy Lutomirski
On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner  wrote:
> On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
>> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara  wrote:
>> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
>> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time
>> >> >> update, but that would complicate non-transactional filesystems.
>> >> >> Those filesystems should just flush cmtime at the end of writepages.
>> >> >
>> >> > do_writepages() is the wrong place to do such updates - we can get
>> >> > writeback directly through .writepage, so the time updates need to
>> >> > be in .writepage. That first .writepage call will clear the bit on
>> >> > the mapping, so it's only done on the first call to .writepage on
>> >> > the given mapping.
>> >>
>> >> Last time I checked, all the paths that actually needed the timestamp
>> >> update went through .writepages.  I'll double-check.
>> >   kswapd can call just .writepage to do the writeout so timestamp update
>> > should be handled there as well. Otherwise all pages in a mapping can be
>> > cleaned without timestamp being updated.
>>
>> OK, I'll fix that.
>>
>> >
>> > Which btw made me realize that even your scheme doesn't completely make
>> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you
>> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
>> > is called. We write the page 0, writeprotect it, update timestamps. But
>> > page 1 is still writeable so writes to it won't set CMTIME flag, neither
>> > update the timestamp... Not that I think this can be reasonably solved but
>> > it is a food for thought.
>>
>> This should already work.  AS_CMTIME is set when the pte goes from
>> dirty to clean, not when the pte goes from wp to writable.  So
>> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
>> be set and a subsequent writepages call will update the timestamp.
>
> Oh, I missed that - I thought you were setting AS_CMTIME during
> .page_mkwrite.
>
> Setting it in clear_page_dirty_for_io() is too late for filesystems
> to include it in their existing transactions during .writepage, (at
> least for XFs and ext4) because they do their delayed allocation
> transactions before changing page state

Couldn't it go between mpage_map_and_submit_extent and
ext4_journal_stop in ext4_writepages?

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> da...@fromorbit.com



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-20 Thread Dave Chinner
On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara  wrote:
> > On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
> >> >> I could require ->writepages *and* ->flush_cmtime to handle the time
> >> >> update, but that would complicate non-transactional filesystems.
> >> >> Those filesystems should just flush cmtime at the end of writepages.
> >> >
> >> > do_writepages() is the wrong place to do such updates - we can get
> >> > writeback directly through .writepage, so the time updates need to
> >> > be in .writepage. That first .writepage call will clear the bit on
> >> > the mapping, so it's only done on the first call to .writepage on
> >> > the given mapping.
> >>
> >> Last time I checked, all the paths that actually needed the timestamp
> >> update went through .writepages.  I'll double-check.
> >   kswapd can call just .writepage to do the writeout so timestamp update
> > should be handled there as well. Otherwise all pages in a mapping can be
> > cleaned without timestamp being updated.
> 
> OK, I'll fix that.
> 
> >
> > Which btw made me realize that even your scheme doesn't completely make
> > sure timestamp is updated after mmap write - if you have pages 0 and 1, you
> > write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
> > is called. We write the page 0, writeprotect it, update timestamps. But
> > page 1 is still writeable so writes to it won't set CMTIME flag, neither
> > update the timestamp... Not that I think this can be reasonably solved but
> > it is a food for thought.
> 
> This should already work.  AS_CMTIME is set when the pte goes from
> dirty to clean, not when the pte goes from wp to writable.  So
> whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
> be set and a subsequent writepages call will update the timestamp.

Oh, I missed that - I thought you were setting AS_CMTIME during
.page_mkwrite.

Setting it in clear_page_dirty_for_io() is too late for filesystems
to include it in their existing transactions during .writepage, (at
least for XFs and ext4) because they do their delayed allocation
transactions before changing page state

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-20 Thread Andy Lutomirski
On Tue, Aug 20, 2013 at 9:42 AM, Andy Lutomirski  wrote:
> On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara  wrote:
>> On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
>>> >> I could require ->writepages *and* ->flush_cmtime to handle the time
>>> >> update, but that would complicate non-transactional filesystems.
>>> >> Those filesystems should just flush cmtime at the end of writepages.
>>> >
>>> > do_writepages() is the wrong place to do such updates - we can get
>>> > writeback directly through .writepage, so the time updates need to
>>> > be in .writepage. That first .writepage call will clear the bit on
>>> > the mapping, so it's only done on the first call to .writepage on
>>> > the given mapping.
>>>
>>> Last time I checked, all the paths that actually needed the timestamp
>>> update went through .writepages.  I'll double-check.
>>   kswapd can call just .writepage to do the writeout so timestamp update
>> should be handled there as well. Otherwise all pages in a mapping can be
>> cleaned without timestamp being updated.
>
> OK, I'll fix that.

This is a bit ugly.  mpage_writepages and generic_writepages both call
->writepage.  If writepage starts checking cmtime, then writepages
will do multiple timestamp updates on filesystems that use this code.

I can modify vmscan and migrate to flush cmtime -- they seem to be the
only callers of ->writepage that aren't themselves called from
->writepages.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-20 Thread Andy Lutomirski
On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara  wrote:
> On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
>> >> I could require ->writepages *and* ->flush_cmtime to handle the time
>> >> update, but that would complicate non-transactional filesystems.
>> >> Those filesystems should just flush cmtime at the end of writepages.
>> >
>> > do_writepages() is the wrong place to do such updates - we can get
>> > writeback directly through .writepage, so the time updates need to
>> > be in .writepage. That first .writepage call will clear the bit on
>> > the mapping, so it's only done on the first call to .writepage on
>> > the given mapping.
>>
>> Last time I checked, all the paths that actually needed the timestamp
>> update went through .writepages.  I'll double-check.
>   kswapd can call just .writepage to do the writeout so timestamp update
> should be handled there as well. Otherwise all pages in a mapping can be
> cleaned without timestamp being updated.

OK, I'll fix that.

>
> Which btw made me realize that even your scheme doesn't completely make
> sure timestamp is updated after mmap write - if you have pages 0 and 1, you
> write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
> is called. We write the page 0, writeprotect it, update timestamps. But
> page 1 is still writeable so writes to it won't set CMTIME flag, neither
> update the timestamp... Not that I think this can be reasonably solved but
> it is a food for thought.

This should already work.  AS_CMTIME is set when the pte goes from
dirty to clean, not when the pte goes from wp to writable.  So
whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
be set and a subsequent writepages call will update the timestamp.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-20 Thread Jan Kara
On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
> >> I could require ->writepages *and* ->flush_cmtime to handle the time
> >> update, but that would complicate non-transactional filesystems.
> >> Those filesystems should just flush cmtime at the end of writepages.
> >
> > do_writepages() is the wrong place to do such updates - we can get
> > writeback directly through .writepage, so the time updates need to
> > be in .writepage. That first .writepage call will clear the bit on
> > the mapping, so it's only done on the first call to .writepage on
> > the given mapping.
> 
> Last time I checked, all the paths that actually needed the timestamp
> update went through .writepages.  I'll double-check.
  kswapd can call just .writepage to do the writeout so timestamp update
should be handled there as well. Otherwise all pages in a mapping can be
cleaned without timestamp being updated.

Which btw made me realize that even your scheme doesn't completely make
sure timestamp is updated after mmap write - if you have pages 0 and 1, you
write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
is called. We write the page 0, writeprotect it, update timestamps. But
page 1 is still writeable so writes to it won't set CMTIME flag, neither
update the timestamp... Not that I think this can be reasonably solved but
it is a food for thought.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-20 Thread Jan Kara
On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
  I could require -writepages *and* -flush_cmtime to handle the time
  update, but that would complicate non-transactional filesystems.
  Those filesystems should just flush cmtime at the end of writepages.
 
  do_writepages() is the wrong place to do such updates - we can get
  writeback directly through .writepage, so the time updates need to
  be in .writepage. That first .writepage call will clear the bit on
  the mapping, so it's only done on the first call to .writepage on
  the given mapping.
 
 Last time I checked, all the paths that actually needed the timestamp
 update went through .writepages.  I'll double-check.
  kswapd can call just .writepage to do the writeout so timestamp update
should be handled there as well. Otherwise all pages in a mapping can be
cleaned without timestamp being updated.

Which btw made me realize that even your scheme doesn't completely make
sure timestamp is updated after mmap write - if you have pages 0 and 1, you
write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
is called. We write the page 0, writeprotect it, update timestamps. But
page 1 is still writeable so writes to it won't set CMTIME flag, neither
update the timestamp... Not that I think this can be reasonably solved but
it is a food for thought.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-20 Thread Andy Lutomirski
On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara j...@suse.cz wrote:
 On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
  I could require -writepages *and* -flush_cmtime to handle the time
  update, but that would complicate non-transactional filesystems.
  Those filesystems should just flush cmtime at the end of writepages.
 
  do_writepages() is the wrong place to do such updates - we can get
  writeback directly through .writepage, so the time updates need to
  be in .writepage. That first .writepage call will clear the bit on
  the mapping, so it's only done on the first call to .writepage on
  the given mapping.

 Last time I checked, all the paths that actually needed the timestamp
 update went through .writepages.  I'll double-check.
   kswapd can call just .writepage to do the writeout so timestamp update
 should be handled there as well. Otherwise all pages in a mapping can be
 cleaned without timestamp being updated.

OK, I'll fix that.


 Which btw made me realize that even your scheme doesn't completely make
 sure timestamp is updated after mmap write - if you have pages 0 and 1, you
 write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
 is called. We write the page 0, writeprotect it, update timestamps. But
 page 1 is still writeable so writes to it won't set CMTIME flag, neither
 update the timestamp... Not that I think this can be reasonably solved but
 it is a food for thought.

This should already work.  AS_CMTIME is set when the pte goes from
dirty to clean, not when the pte goes from wp to writable.  So
whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
be set and a subsequent writepages call will update the timestamp.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-20 Thread Andy Lutomirski
On Tue, Aug 20, 2013 at 9:42 AM, Andy Lutomirski l...@amacapital.net wrote:
 On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara j...@suse.cz wrote:
 On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
  I could require -writepages *and* -flush_cmtime to handle the time
  update, but that would complicate non-transactional filesystems.
  Those filesystems should just flush cmtime at the end of writepages.
 
  do_writepages() is the wrong place to do such updates - we can get
  writeback directly through .writepage, so the time updates need to
  be in .writepage. That first .writepage call will clear the bit on
  the mapping, so it's only done on the first call to .writepage on
  the given mapping.

 Last time I checked, all the paths that actually needed the timestamp
 update went through .writepages.  I'll double-check.
   kswapd can call just .writepage to do the writeout so timestamp update
 should be handled there as well. Otherwise all pages in a mapping can be
 cleaned without timestamp being updated.

 OK, I'll fix that.

This is a bit ugly.  mpage_writepages and generic_writepages both call
-writepage.  If writepage starts checking cmtime, then writepages
will do multiple timestamp updates on filesystems that use this code.

I can modify vmscan and migrate to flush cmtime -- they seem to be the
only callers of -writepage that aren't themselves called from
-writepages.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-20 Thread Dave Chinner
On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
 On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara j...@suse.cz wrote:
  On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
   I could require -writepages *and* -flush_cmtime to handle the time
   update, but that would complicate non-transactional filesystems.
   Those filesystems should just flush cmtime at the end of writepages.
  
   do_writepages() is the wrong place to do such updates - we can get
   writeback directly through .writepage, so the time updates need to
   be in .writepage. That first .writepage call will clear the bit on
   the mapping, so it's only done on the first call to .writepage on
   the given mapping.
 
  Last time I checked, all the paths that actually needed the timestamp
  update went through .writepages.  I'll double-check.
kswapd can call just .writepage to do the writeout so timestamp update
  should be handled there as well. Otherwise all pages in a mapping can be
  cleaned without timestamp being updated.
 
 OK, I'll fix that.
 
 
  Which btw made me realize that even your scheme doesn't completely make
  sure timestamp is updated after mmap write - if you have pages 0 and 1, you
  write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
  is called. We write the page 0, writeprotect it, update timestamps. But
  page 1 is still writeable so writes to it won't set CMTIME flag, neither
  update the timestamp... Not that I think this can be reasonably solved but
  it is a food for thought.
 
 This should already work.  AS_CMTIME is set when the pte goes from
 dirty to clean, not when the pte goes from wp to writable.  So
 whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
 be set and a subsequent writepages call will update the timestamp.

Oh, I missed that - I thought you were setting AS_CMTIME during
.page_mkwrite.

Setting it in clear_page_dirty_for_io() is too late for filesystems
to include it in their existing transactions during .writepage, (at
least for XFs and ext4) because they do their delayed allocation
transactions before changing page state

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-20 Thread Andy Lutomirski
On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner da...@fromorbit.com wrote:
 On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
 On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara j...@suse.cz wrote:
  On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
   I could require -writepages *and* -flush_cmtime to handle the time
   update, but that would complicate non-transactional filesystems.
   Those filesystems should just flush cmtime at the end of writepages.
  
   do_writepages() is the wrong place to do such updates - we can get
   writeback directly through .writepage, so the time updates need to
   be in .writepage. That first .writepage call will clear the bit on
   the mapping, so it's only done on the first call to .writepage on
   the given mapping.
 
  Last time I checked, all the paths that actually needed the timestamp
  update went through .writepages.  I'll double-check.
kswapd can call just .writepage to do the writeout so timestamp update
  should be handled there as well. Otherwise all pages in a mapping can be
  cleaned without timestamp being updated.

 OK, I'll fix that.

 
  Which btw made me realize that even your scheme doesn't completely make
  sure timestamp is updated after mmap write - if you have pages 0 and 1, you
  write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 4096)
  is called. We write the page 0, writeprotect it, update timestamps. But
  page 1 is still writeable so writes to it won't set CMTIME flag, neither
  update the timestamp... Not that I think this can be reasonably solved but
  it is a food for thought.

 This should already work.  AS_CMTIME is set when the pte goes from
 dirty to clean, not when the pte goes from wp to writable.  So
 whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
 be set and a subsequent writepages call will update the timestamp.

 Oh, I missed that - I thought you were setting AS_CMTIME during
 .page_mkwrite.

 Setting it in clear_page_dirty_for_io() is too late for filesystems
 to include it in their existing transactions during .writepage, (at
 least for XFs and ext4) because they do their delayed allocation
 transactions before changing page state

Couldn't it go between mpage_map_and_submit_extent and
ext4_journal_stop in ext4_writepages?


 Cheers,

 Dave.
 --
 Dave Chinner
 da...@fromorbit.com



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-20 Thread Dave Chinner
On Tue, Aug 20, 2013 at 02:54:01PM -0700, Andy Lutomirski wrote:
 On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner da...@fromorbit.com wrote:
  On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
  On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara j...@suse.cz wrote:
   On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
I could require -writepages *and* -flush_cmtime to handle the time
update, but that would complicate non-transactional filesystems.
Those filesystems should just flush cmtime at the end of writepages.
   
do_writepages() is the wrong place to do such updates - we can get
writeback directly through .writepage, so the time updates need to
be in .writepage. That first .writepage call will clear the bit on
the mapping, so it's only done on the first call to .writepage on
the given mapping.
  
   Last time I checked, all the paths that actually needed the timestamp
   update went through .writepages.  I'll double-check.
 kswapd can call just .writepage to do the writeout so timestamp update
   should be handled there as well. Otherwise all pages in a mapping can be
   cleaned without timestamp being updated.
 
  OK, I'll fix that.
 
  
   Which btw made me realize that even your scheme doesn't completely make
   sure timestamp is updated after mmap write - if you have pages 0 and 1, 
   you
   write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 
   4096)
   is called. We write the page 0, writeprotect it, update timestamps. But
   page 1 is still writeable so writes to it won't set CMTIME flag, neither
   update the timestamp... Not that I think this can be reasonably solved 
   but
   it is a food for thought.
 
  This should already work.  AS_CMTIME is set when the pte goes from
  dirty to clean, not when the pte goes from wp to writable.  So
  whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
  be set and a subsequent writepages call will update the timestamp.
 
  Oh, I missed that - I thought you were setting AS_CMTIME during
  .page_mkwrite.
 
  Setting it in clear_page_dirty_for_io() is too late for filesystems
  to include it in their existing transactions during .writepage, (at
  least for XFs and ext4) because they do their delayed allocation
  transactions before changing page state
 
 Couldn't it go between mpage_map_and_submit_extent and
 ext4_journal_stop in ext4_writepages?

Maybe - I'm not an ext4 expert - but even if you can make it work
for ext4 in some way, that doesn't mean it is possible for any other
filesystem to use the same method. You're adding code to generic,
non-filesystem specific code paths and so the solutions need to be
generic rather not tied to how a specific filesystem is implemented.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-20 Thread Andy Lutomirski
On Tue, Aug 20, 2013 at 3:43 PM, Dave Chinner da...@fromorbit.com wrote:
 On Tue, Aug 20, 2013 at 02:54:01PM -0700, Andy Lutomirski wrote:
 On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner da...@fromorbit.com wrote:
  On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
  On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara j...@suse.cz wrote:
   On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
I could require -writepages *and* -flush_cmtime to handle the time
update, but that would complicate non-transactional filesystems.
Those filesystems should just flush cmtime at the end of writepages.
   
do_writepages() is the wrong place to do such updates - we can get
writeback directly through .writepage, so the time updates need to
be in .writepage. That first .writepage call will clear the bit on
the mapping, so it's only done on the first call to .writepage on
the given mapping.
  
   Last time I checked, all the paths that actually needed the timestamp
   update went through .writepages.  I'll double-check.
 kswapd can call just .writepage to do the writeout so timestamp update
   should be handled there as well. Otherwise all pages in a mapping can be
   cleaned without timestamp being updated.
 
  OK, I'll fix that.
 
  
   Which btw made me realize that even your scheme doesn't completely make
   sure timestamp is updated after mmap write - if you have pages 0 and 1, 
   you
   write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 
   4096)
   is called. We write the page 0, writeprotect it, update timestamps. But
   page 1 is still writeable so writes to it won't set CMTIME flag, neither
   update the timestamp... Not that I think this can be reasonably solved 
   but
   it is a food for thought.
 
  This should already work.  AS_CMTIME is set when the pte goes from
  dirty to clean, not when the pte goes from wp to writable.  So
  whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
  be set and a subsequent writepages call will update the timestamp.
 
  Oh, I missed that - I thought you were setting AS_CMTIME during
  .page_mkwrite.
 
  Setting it in clear_page_dirty_for_io() is too late for filesystems
  to include it in their existing transactions during .writepage, (at
  least for XFs and ext4) because they do their delayed allocation
  transactions before changing page state

 Couldn't it go between mpage_map_and_submit_extent and
 ext4_journal_stop in ext4_writepages?

 Maybe - I'm not an ext4 expert - but even if you can make it work
 for ext4 in some way, that doesn't mean it is possible for any other
 filesystem to use the same method. You're adding code to generic,
 non-filesystem specific code paths and so the solutions need to be
 generic rather not tied to how a specific filesystem is implemented.


I don't see the problem for xfs or btrfs either.

xfs uses generic_writepages, which already does the right thing.  (xfs
with my updated patches passes my tests.)  xfs_vm_writepage calls
xfs_start_page_writeback(..., 1, ...), so clear_page_dirty_for_io is
called.  At that point (I presume), it would still be possible to add
metadata to a transaction (assuming there's a transaction open -- I
don't have a clue here).  Even if this is too late, xfs_vm_writepage
could call page_mkwrite to for AS_CMTIME to be set if needed.
page_mkwrite will be fast if the page isn't mmapped.  What am I
missing?

btrfs seems to do much the same thing.

--Andy


 Cheers,

 Dave.
 --
 Dave Chinner
 da...@fromorbit.com



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-20 Thread Dave Chinner
On Tue, Aug 20, 2013 at 05:47:10PM -0700, Andy Lutomirski wrote:
 On Tue, Aug 20, 2013 at 3:43 PM, Dave Chinner da...@fromorbit.com wrote:
  On Tue, Aug 20, 2013 at 02:54:01PM -0700, Andy Lutomirski wrote:
  On Tue, Aug 20, 2013 at 2:48 PM, Dave Chinner da...@fromorbit.com wrote:
   On Tue, Aug 20, 2013 at 09:42:34AM -0700, Andy Lutomirski wrote:
   On Tue, Aug 20, 2013 at 9:00 AM, Jan Kara j...@suse.cz wrote:
On Mon 19-08-13 21:14:44, Andy Lutomirski wrote:
 I could require -writepages *and* -flush_cmtime to handle the 
 time
 update, but that would complicate non-transactional filesystems.
 Those filesystems should just flush cmtime at the end of 
 writepages.

 do_writepages() is the wrong place to do such updates - we can get
 writeback directly through .writepage, so the time updates need to
 be in .writepage. That first .writepage call will clear the bit on
 the mapping, so it's only done on the first call to .writepage on
 the given mapping.
   
Last time I checked, all the paths that actually needed the timestamp
update went through .writepages.  I'll double-check.
  kswapd can call just .writepage to do the writeout so timestamp 
update
should be handled there as well. Otherwise all pages in a mapping can 
be
cleaned without timestamp being updated.
  
   OK, I'll fix that.
  
   
Which btw made me realize that even your scheme doesn't completely 
make
sure timestamp is updated after mmap write - if you have pages 0 and 
1, you
write to both of them - CMTIME flag gets set. Then fsync_range(fd, 0, 
4096)
is called. We write the page 0, writeprotect it, update timestamps. 
But
page 1 is still writeable so writes to it won't set CMTIME flag, 
neither
update the timestamp... Not that I think this can be reasonably 
solved but
it is a food for thought.
  
   This should already work.  AS_CMTIME is set when the pte goes from
   dirty to clean, not when the pte goes from wp to writable.  So
   whenever clear_page_dirty_for_io is called on page 1, AS_CMTIME will
   be set and a subsequent writepages call will update the timestamp.
  
   Oh, I missed that - I thought you were setting AS_CMTIME during
   .page_mkwrite.
  
   Setting it in clear_page_dirty_for_io() is too late for filesystems
   to include it in their existing transactions during .writepage, (at
   least for XFs and ext4) because they do their delayed allocation
   transactions before changing page state
 
  Couldn't it go between mpage_map_and_submit_extent and
  ext4_journal_stop in ext4_writepages?
 
  Maybe - I'm not an ext4 expert - but even if you can make it work
  for ext4 in some way, that doesn't mean it is possible for any other
  filesystem to use the same method. You're adding code to generic,
  non-filesystem specific code paths and so the solutions need to be
  generic rather not tied to how a specific filesystem is implemented.
 
 
 I don't see the problem for xfs or btrfs either.
 
 xfs uses generic_writepages, which already does the right thing.  (xfs
 with my updated patches passes my tests.)  xfs_vm_writepage calls
 xfs_start_page_writeback(..., 1, ...), so clear_page_dirty_for_io is
 called.  At that point (I presume), it would still be possible to add
 metadata to a transaction (assuming there's a transaction open -- I
 don't have a clue here). 

That's my point - there isn't a transaction in XFS at this point,
and so if we gather that flag from clear_page_dirty_for_io() we'd
need a second transaction and therefore the optimisation you want
filesystems to use to mitigate the additional overhead can't be done
for all commonly used filesystems.

 Even if this is too late, xfs_vm_writepage
 could call page_mkwrite to for AS_CMTIME to be set if needed.
 page_mkwrite will be fast if the page isn't mmapped.  What am I
 missing?

That it leads to different behaviour for different filesystems.

i.e. page_mkwrite on page A sets the flag. writeback on a range that
doesn't include page A occurs, sees the flag, clears it after
updating the timestamp. Some time later writeback on page A occurs,
no timestamp update occurs.

The behaviour needs to be consistent across filesystems.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-19 Thread Andy Lutomirski
On Mon, Aug 19, 2013 at 9:08 PM, Dave Chinner  wrote:
> On Mon, Aug 19, 2013 at 08:28:20PM -0700, Andy Lutomirski wrote:
>> On Mon, Aug 19, 2013 at 7:36 PM, Dave Chinner  wrote:
>> > On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
>> >> Filesystems that defer cmtime updates should update cmtime when any
>> >> of these events happen after a write via a mapping:
>> >>
>> >>  - The mapping is written back to disk.  This happens from all kinds
>> >>of places, all of which eventually call ->writepages.
>> >>
>> >>  - munmap is called or the mapping is removed when the process exits
>> >>
>> >>  - msync(MS_ASYNC) is called.  Linux currently does nothing for
>> >>msync(MS_ASYNC), but POSIX says that cmtime should be updated some
>> >>time between an mmaped write and the subsequent msync call.
>> >>MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.
>> >>
>> >> Filesystmes that defer cmtime updates should flush them on munmap or
>> >> exit.  Finding out that this happened through vm_ops is messy, so
>> >> add a new address space op for this.
>> >>
>> >> It's not strictly necessary to call ->flush_cmtime after ->writepages,
>> >> but it simplifies the fs code.  As an optional optimization,
>> >> filesystems can call mapping_test_clear_cmtime themselves in
>> >> ->writepages (as long as they're careful to scan all the pages first
>> >> -- the cmtime bit may not be set when ->writepages is entered).
>> >
>> > .flush_cmtime is effectively a duplicate method.  We already have
>> > .update_time to notify filesystems that they need to update the
>> > timestamp in the inode transactionally.
>>
>> .update_time is used for the atime update as well, and it relies on
>> the core code to update the in-memory timestamp first.
>
> No, it doesn't. The caller of .update_time provides the timestamp to
> that gets written into the relevant fields of the inode.
>
> i.e. this code:
>
> now = current_fs_time(inode->i_sb);
> if (inode->i_op->update_time)
> return inode->i_op->update_time(inode, , S_CTIME|S_MTIME);
>
> will update *only* the ctime and mtime of the inode to have a value
> of @now. That's precisely what you want to do, yes?
>
> Indeed, your .flush_cmtime function effectively does:
>
> flags = prepare_cmtime(inode, )
> { *now = current_fs_time()
>   flags = S_CTIME|S_MTIME;
> }
> update_time(inode, now, flags);
> {
> inode->i_op->update_time(inode, now, flags)
> or
> mark_inode_dirty_sync()
> }
>
> IOWs, you're adding a filesystem specific method to update a
> timestamp that wraps around a generic method for updating a
> timestamp. i.e.  .flush_cmtime is not necessary - you could just
> call inode_update_time_writable() from generic_writepages() and it
> would simply just work for all filesystems

OK, I'll try that out.

>
>> I used that
>> approach in v2, but it was (correctly, I think) pointed out that this
>> was a layering violation and that core code shouldn't be mucking with
>> the timestamps directly during writeback.
>
> If calling a generic method to update the timestamp is a layering
> violation, then why is replacing that with a filesystem specific
> method of updating a timestamp not a layering violation?
>
>> > Indeed:
>> >
>> >> + /*
>> >> +  * Userspace expects certain system calls to update cmtime if
>> >> +  * a file has been recently written using a shared vma.  In
>> >> +  * cases where cmtime may need to be updated but writepages is
>> >> +  * not called, this is called instead.  (Implementations
>> >> +  * should call mapping_test_clear_cmtime.)
>> >> +  */
>> >> + void (*flush_cmtime)(struct address_space *);
>> >
>> > You say it can be implemented in the ->writepage(s) method, and all
>> > filesystems provide ->writepage(s) in some form. Therefore I would
>> > have thought it be best to simply require filesystems to check that
>> > mapping flag during those methods and update the inode directly when
>> > that is set?
>>
>> The problem with only doing it in ->writepages is that calling
>> writepages from munmap and exit would probably hurt performance for no
>> particular gain.  So I need some kind of callback to say "update the
>> time, but don't write data."  The AS_CMTIME bit will still be set when
>> the ptes are removed.
>
> What's the point of updating the timestamp at unmap when it's going
> to be changed again when the writeback occurs?

To avoid breaking make and similar tools.  Suppose that an output file
already exists but is stale.  Some program gets called to update it,
and it opens it for write, mmaps it, updates it, calls munmap, and
exits.  Its parent will expect the timestamp to have been updated, but
writeback may not have happened.

>
>> I could require ->writepages *and* ->flush_cmtime to handle the time

Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-19 Thread Dave Chinner
On Mon, Aug 19, 2013 at 08:28:20PM -0700, Andy Lutomirski wrote:
> On Mon, Aug 19, 2013 at 7:36 PM, Dave Chinner  wrote:
> > On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
> >> Filesystems that defer cmtime updates should update cmtime when any
> >> of these events happen after a write via a mapping:
> >>
> >>  - The mapping is written back to disk.  This happens from all kinds
> >>of places, all of which eventually call ->writepages.
> >>
> >>  - munmap is called or the mapping is removed when the process exits
> >>
> >>  - msync(MS_ASYNC) is called.  Linux currently does nothing for
> >>msync(MS_ASYNC), but POSIX says that cmtime should be updated some
> >>time between an mmaped write and the subsequent msync call.
> >>MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.
> >>
> >> Filesystmes that defer cmtime updates should flush them on munmap or
> >> exit.  Finding out that this happened through vm_ops is messy, so
> >> add a new address space op for this.
> >>
> >> It's not strictly necessary to call ->flush_cmtime after ->writepages,
> >> but it simplifies the fs code.  As an optional optimization,
> >> filesystems can call mapping_test_clear_cmtime themselves in
> >> ->writepages (as long as they're careful to scan all the pages first
> >> -- the cmtime bit may not be set when ->writepages is entered).
> >
> > .flush_cmtime is effectively a duplicate method.  We already have
> > .update_time to notify filesystems that they need to update the
> > timestamp in the inode transactionally.
> 
> .update_time is used for the atime update as well, and it relies on
> the core code to update the in-memory timestamp first.

No, it doesn't. The caller of .update_time provides the timestamp to
that gets written into the relevant fields of the inode.

i.e. this code:

now = current_fs_time(inode->i_sb);
if (inode->i_op->update_time)
return inode->i_op->update_time(inode, , S_CTIME|S_MTIME);

will update *only* the ctime and mtime of the inode to have a value
of @now. That's precisely what you want to do, yes?

Indeed, your .flush_cmtime function effectively does:

flags = prepare_cmtime(inode, )
{ *now = current_fs_time()
  flags = S_CTIME|S_MTIME;
}
update_time(inode, now, flags);
{
inode->i_op->update_time(inode, now, flags)
or
mark_inode_dirty_sync()
}

IOWs, you're adding a filesystem specific method to update a
timestamp that wraps around a generic method for updating a
timestamp. i.e.  .flush_cmtime is not necessary - you could just
call inode_update_time_writable() from generic_writepages() and it
would simply just work for all filesystems

> I used that
> approach in v2, but it was (correctly, I think) pointed out that this
> was a layering violation and that core code shouldn't be mucking with
> the timestamps directly during writeback.

If calling a generic method to update the timestamp is a layering
violation, then why is replacing that with a filesystem specific
method of updating a timestamp not a layering violation?

> > Indeed:
> >
> >> + /*
> >> +  * Userspace expects certain system calls to update cmtime if
> >> +  * a file has been recently written using a shared vma.  In
> >> +  * cases where cmtime may need to be updated but writepages is
> >> +  * not called, this is called instead.  (Implementations
> >> +  * should call mapping_test_clear_cmtime.)
> >> +  */
> >> + void (*flush_cmtime)(struct address_space *);
> >
> > You say it can be implemented in the ->writepage(s) method, and all
> > filesystems provide ->writepage(s) in some form. Therefore I would
> > have thought it be best to simply require filesystems to check that
> > mapping flag during those methods and update the inode directly when
> > that is set?
> 
> The problem with only doing it in ->writepages is that calling
> writepages from munmap and exit would probably hurt performance for no
> particular gain.  So I need some kind of callback to say "update the
> time, but don't write data."  The AS_CMTIME bit will still be set when
> the ptes are removed.

What's the point of updating the timestamp at unmap when it's going
to be changed again when the writeback occurs?

> I could require ->writepages *and* ->flush_cmtime to handle the time
> update, but that would complicate non-transactional filesystems.
> Those filesystems should just flush cmtime at the end of writepages.

do_writepages() is the wrong place to do such updates - we can get
writeback directly through .writepage, so the time updates need to
be in .writepage. That first .writepage call will clear the bit on
the mapping, so it's only done on the first call to .writepage on
the given mapping.

And some filesystems provide a .writepages method that doesn't call
.writepage, so those 

Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-19 Thread Andy Lutomirski
On Mon, Aug 19, 2013 at 7:36 PM, Dave Chinner  wrote:
> On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
>> Filesystems that defer cmtime updates should update cmtime when any
>> of these events happen after a write via a mapping:
>>
>>  - The mapping is written back to disk.  This happens from all kinds
>>of places, all of which eventually call ->writepages.
>>
>>  - munmap is called or the mapping is removed when the process exits
>>
>>  - msync(MS_ASYNC) is called.  Linux currently does nothing for
>>msync(MS_ASYNC), but POSIX says that cmtime should be updated some
>>time between an mmaped write and the subsequent msync call.
>>MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.
>>
>> Filesystmes that defer cmtime updates should flush them on munmap or
>> exit.  Finding out that this happened through vm_ops is messy, so
>> add a new address space op for this.
>>
>> It's not strictly necessary to call ->flush_cmtime after ->writepages,
>> but it simplifies the fs code.  As an optional optimization,
>> filesystems can call mapping_test_clear_cmtime themselves in
>> ->writepages (as long as they're careful to scan all the pages first
>> -- the cmtime bit may not be set when ->writepages is entered).
>
> .flush_cmtime is effectively a duplicate method.  We already have
> .update_time to notify filesystems that they need to update the
> timestamp in the inode transactionally.

.update_time is used for the atime update as well, and it relies on
the core code to update the in-memory timestamp first.  I used that
approach in v2, but it was (correctly, I think) pointed out that this
was a layering violation and that core code shouldn't be mucking with
the timestamps directly during writeback.

There was a recent effort to move most of the file_update_calls from
the core into .page_mkwrite, and I don't think anyone wants to undo
that.

>
> Indeed:
>
>> + /*
>> +  * Userspace expects certain system calls to update cmtime if
>> +  * a file has been recently written using a shared vma.  In
>> +  * cases where cmtime may need to be updated but writepages is
>> +  * not called, this is called instead.  (Implementations
>> +  * should call mapping_test_clear_cmtime.)
>> +  */
>> + void (*flush_cmtime)(struct address_space *);
>
> You say it can be implemented in the ->writepage(s) method, and all
> filesystems provide ->writepage(s) in some form. Therefore I would
> have thought it be best to simply require filesystems to check that
> mapping flag during those methods and update the inode directly when
> that is set?

The problem with only doing it in ->writepages is that calling
writepages from munmap and exit would probably hurt performance for no
particular gain.  So I need some kind of callback to say "update the
time, but don't write data."  The AS_CMTIME bit will still be set when
the ptes are removed.

I could require ->writepages *and* ->flush_cmtime to handle the time
update, but that would complicate non-transactional filesystems.
Those filesystems should just flush cmtime at the end of writepages.

>
> Indeed, the way you've set up the infrastructure, we'll have to
> rewrite the cmtime update code to enable writepages to update this
> within some other transaction. Perhaps you should just implement it
> that way first?

This is already possible although not IMO necessary for correctness.
All that ext4 would need to do is to add something like:

if (mapping_test_clear_cmtime(mapping)) {
  update times within current transaction
}

somewhere inside the transaction in writepages.  There would probably
be room for some kind of generic helper to do everything in
inode_update_time_writable except for the actual mark_inode_dirty
part, but this still seems nasty from a locking perspective, and I'd
rather leave that optimization to an ext4 developer who wants to do
it.

I could simplify this a bit by moving the mapping_test_clear_cmtime
part from .flush_cmtime to its callers.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-19 Thread Dave Chinner
On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
> Filesystems that defer cmtime updates should update cmtime when any
> of these events happen after a write via a mapping:
> 
>  - The mapping is written back to disk.  This happens from all kinds
>of places, all of which eventually call ->writepages.
> 
>  - munmap is called or the mapping is removed when the process exits
> 
>  - msync(MS_ASYNC) is called.  Linux currently does nothing for
>msync(MS_ASYNC), but POSIX says that cmtime should be updated some
>time between an mmaped write and the subsequent msync call.
>MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.
> 
> Filesystmes that defer cmtime updates should flush them on munmap or
> exit.  Finding out that this happened through vm_ops is messy, so
> add a new address space op for this.
> 
> It's not strictly necessary to call ->flush_cmtime after ->writepages,
> but it simplifies the fs code.  As an optional optimization,
> filesystems can call mapping_test_clear_cmtime themselves in
> ->writepages (as long as they're careful to scan all the pages first
> -- the cmtime bit may not be set when ->writepages is entered).

.flush_cmtime is effectively a duplicate method.  We already have
.update_time to notify filesystems that they need to update the
timestamp in the inode transactionally.

Indeed:

> + /*
> +  * Userspace expects certain system calls to update cmtime if
> +  * a file has been recently written using a shared vma.  In
> +  * cases where cmtime may need to be updated but writepages is
> +  * not called, this is called instead.  (Implementations
> +  * should call mapping_test_clear_cmtime.)
> +  */
> + void (*flush_cmtime)(struct address_space *);

You say it can be implemented in the ->writepage(s) method, and all
filesystems provide ->writepage(s) in some form. Therefore I would
have thought it be best to simply require filesystems to check that
mapping flag during those methods and update the inode directly when
that is set?

Indeed, the way you've set up the infrastructure, we'll have to
rewrite the cmtime update code to enable writepages to update this
within some other transaction. Perhaps you should just implement it
that way first?

> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1928,6 +1928,18 @@ int do_writepages(struct address_space *mapping, 
> struct writeback_control *wbc)
>   ret = mapping->a_ops->writepages(mapping, wbc);
>   else
>   ret = generic_writepages(mapping, wbc);
> +
> + /*
> +  * ->writepages will call clear_page_dirty_for_io, which may, in turn,
> +  * mark the mapping for deferred cmtime update.  As an optimization,
> +  * a filesystem can flush the update at the end of ->writepages
> +  * (possibly avoiding a journal transaction, for example), but,
> +  * for simplicity, let filesystems skip that part and just implement
> +  * ->flush_cmtime.
> +  */
> + if (mapping->a_ops->flush_cmtime)
> + mapping->a_ops->flush_cmtime(mapping);

And that's where you cannot call sb_pagefault_start/end

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-19 Thread Dave Chinner
On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
 Filesystems that defer cmtime updates should update cmtime when any
 of these events happen after a write via a mapping:
 
  - The mapping is written back to disk.  This happens from all kinds
of places, all of which eventually call -writepages.
 
  - munmap is called or the mapping is removed when the process exits
 
  - msync(MS_ASYNC) is called.  Linux currently does nothing for
msync(MS_ASYNC), but POSIX says that cmtime should be updated some
time between an mmaped write and the subsequent msync call.
MS_SYNC calls -writepages, but MS_ASYNC needs special handling.
 
 Filesystmes that defer cmtime updates should flush them on munmap or
 exit.  Finding out that this happened through vm_ops is messy, so
 add a new address space op for this.
 
 It's not strictly necessary to call -flush_cmtime after -writepages,
 but it simplifies the fs code.  As an optional optimization,
 filesystems can call mapping_test_clear_cmtime themselves in
 -writepages (as long as they're careful to scan all the pages first
 -- the cmtime bit may not be set when -writepages is entered).

.flush_cmtime is effectively a duplicate method.  We already have
.update_time to notify filesystems that they need to update the
timestamp in the inode transactionally.

Indeed:

 + /*
 +  * Userspace expects certain system calls to update cmtime if
 +  * a file has been recently written using a shared vma.  In
 +  * cases where cmtime may need to be updated but writepages is
 +  * not called, this is called instead.  (Implementations
 +  * should call mapping_test_clear_cmtime.)
 +  */
 + void (*flush_cmtime)(struct address_space *);

You say it can be implemented in the -writepage(s) method, and all
filesystems provide -writepage(s) in some form. Therefore I would
have thought it be best to simply require filesystems to check that
mapping flag during those methods and update the inode directly when
that is set?

Indeed, the way you've set up the infrastructure, we'll have to
rewrite the cmtime update code to enable writepages to update this
within some other transaction. Perhaps you should just implement it
that way first?

 --- a/mm/page-writeback.c
 +++ b/mm/page-writeback.c
 @@ -1928,6 +1928,18 @@ int do_writepages(struct address_space *mapping, 
 struct writeback_control *wbc)
   ret = mapping-a_ops-writepages(mapping, wbc);
   else
   ret = generic_writepages(mapping, wbc);
 +
 + /*
 +  * -writepages will call clear_page_dirty_for_io, which may, in turn,
 +  * mark the mapping for deferred cmtime update.  As an optimization,
 +  * a filesystem can flush the update at the end of -writepages
 +  * (possibly avoiding a journal transaction, for example), but,
 +  * for simplicity, let filesystems skip that part and just implement
 +  * -flush_cmtime.
 +  */
 + if (mapping-a_ops-flush_cmtime)
 + mapping-a_ops-flush_cmtime(mapping);

And that's where you cannot call sb_pagefault_start/end

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-19 Thread Andy Lutomirski
On Mon, Aug 19, 2013 at 7:36 PM, Dave Chinner da...@fromorbit.com wrote:
 On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
 Filesystems that defer cmtime updates should update cmtime when any
 of these events happen after a write via a mapping:

  - The mapping is written back to disk.  This happens from all kinds
of places, all of which eventually call -writepages.

  - munmap is called or the mapping is removed when the process exits

  - msync(MS_ASYNC) is called.  Linux currently does nothing for
msync(MS_ASYNC), but POSIX says that cmtime should be updated some
time between an mmaped write and the subsequent msync call.
MS_SYNC calls -writepages, but MS_ASYNC needs special handling.

 Filesystmes that defer cmtime updates should flush them on munmap or
 exit.  Finding out that this happened through vm_ops is messy, so
 add a new address space op for this.

 It's not strictly necessary to call -flush_cmtime after -writepages,
 but it simplifies the fs code.  As an optional optimization,
 filesystems can call mapping_test_clear_cmtime themselves in
 -writepages (as long as they're careful to scan all the pages first
 -- the cmtime bit may not be set when -writepages is entered).

 .flush_cmtime is effectively a duplicate method.  We already have
 .update_time to notify filesystems that they need to update the
 timestamp in the inode transactionally.

.update_time is used for the atime update as well, and it relies on
the core code to update the in-memory timestamp first.  I used that
approach in v2, but it was (correctly, I think) pointed out that this
was a layering violation and that core code shouldn't be mucking with
the timestamps directly during writeback.

There was a recent effort to move most of the file_update_calls from
the core into .page_mkwrite, and I don't think anyone wants to undo
that.


 Indeed:

 + /*
 +  * Userspace expects certain system calls to update cmtime if
 +  * a file has been recently written using a shared vma.  In
 +  * cases where cmtime may need to be updated but writepages is
 +  * not called, this is called instead.  (Implementations
 +  * should call mapping_test_clear_cmtime.)
 +  */
 + void (*flush_cmtime)(struct address_space *);

 You say it can be implemented in the -writepage(s) method, and all
 filesystems provide -writepage(s) in some form. Therefore I would
 have thought it be best to simply require filesystems to check that
 mapping flag during those methods and update the inode directly when
 that is set?

The problem with only doing it in -writepages is that calling
writepages from munmap and exit would probably hurt performance for no
particular gain.  So I need some kind of callback to say update the
time, but don't write data.  The AS_CMTIME bit will still be set when
the ptes are removed.

I could require -writepages *and* -flush_cmtime to handle the time
update, but that would complicate non-transactional filesystems.
Those filesystems should just flush cmtime at the end of writepages.


 Indeed, the way you've set up the infrastructure, we'll have to
 rewrite the cmtime update code to enable writepages to update this
 within some other transaction. Perhaps you should just implement it
 that way first?

This is already possible although not IMO necessary for correctness.
All that ext4 would need to do is to add something like:

if (mapping_test_clear_cmtime(mapping)) {
  update times within current transaction
}

somewhere inside the transaction in writepages.  There would probably
be room for some kind of generic helper to do everything in
inode_update_time_writable except for the actual mark_inode_dirty
part, but this still seems nasty from a locking perspective, and I'd
rather leave that optimization to an ext4 developer who wants to do
it.

I could simplify this a bit by moving the mapping_test_clear_cmtime
part from .flush_cmtime to its callers.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-19 Thread Dave Chinner
On Mon, Aug 19, 2013 at 08:28:20PM -0700, Andy Lutomirski wrote:
 On Mon, Aug 19, 2013 at 7:36 PM, Dave Chinner da...@fromorbit.com wrote:
  On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
  Filesystems that defer cmtime updates should update cmtime when any
  of these events happen after a write via a mapping:
 
   - The mapping is written back to disk.  This happens from all kinds
 of places, all of which eventually call -writepages.
 
   - munmap is called or the mapping is removed when the process exits
 
   - msync(MS_ASYNC) is called.  Linux currently does nothing for
 msync(MS_ASYNC), but POSIX says that cmtime should be updated some
 time between an mmaped write and the subsequent msync call.
 MS_SYNC calls -writepages, but MS_ASYNC needs special handling.
 
  Filesystmes that defer cmtime updates should flush them on munmap or
  exit.  Finding out that this happened through vm_ops is messy, so
  add a new address space op for this.
 
  It's not strictly necessary to call -flush_cmtime after -writepages,
  but it simplifies the fs code.  As an optional optimization,
  filesystems can call mapping_test_clear_cmtime themselves in
  -writepages (as long as they're careful to scan all the pages first
  -- the cmtime bit may not be set when -writepages is entered).
 
  .flush_cmtime is effectively a duplicate method.  We already have
  .update_time to notify filesystems that they need to update the
  timestamp in the inode transactionally.
 
 .update_time is used for the atime update as well, and it relies on
 the core code to update the in-memory timestamp first.

No, it doesn't. The caller of .update_time provides the timestamp to
that gets written into the relevant fields of the inode.

i.e. this code:

now = current_fs_time(inode-i_sb);
if (inode-i_op-update_time)
return inode-i_op-update_time(inode, now, S_CTIME|S_MTIME);

will update *only* the ctime and mtime of the inode to have a value
of @now. That's precisely what you want to do, yes?

Indeed, your .flush_cmtime function effectively does:

flags = prepare_cmtime(inode, now)
{ *now = current_fs_time()
  flags = S_CTIME|S_MTIME;
}
update_time(inode, now, flags);
{
inode-i_op-update_time(inode, now, flags)
or
mark_inode_dirty_sync()
}

IOWs, you're adding a filesystem specific method to update a
timestamp that wraps around a generic method for updating a
timestamp. i.e.  .flush_cmtime is not necessary - you could just
call inode_update_time_writable() from generic_writepages() and it
would simply just work for all filesystems

 I used that
 approach in v2, but it was (correctly, I think) pointed out that this
 was a layering violation and that core code shouldn't be mucking with
 the timestamps directly during writeback.

If calling a generic method to update the timestamp is a layering
violation, then why is replacing that with a filesystem specific
method of updating a timestamp not a layering violation?

  Indeed:
 
  + /*
  +  * Userspace expects certain system calls to update cmtime if
  +  * a file has been recently written using a shared vma.  In
  +  * cases where cmtime may need to be updated but writepages is
  +  * not called, this is called instead.  (Implementations
  +  * should call mapping_test_clear_cmtime.)
  +  */
  + void (*flush_cmtime)(struct address_space *);
 
  You say it can be implemented in the -writepage(s) method, and all
  filesystems provide -writepage(s) in some form. Therefore I would
  have thought it be best to simply require filesystems to check that
  mapping flag during those methods and update the inode directly when
  that is set?
 
 The problem with only doing it in -writepages is that calling
 writepages from munmap and exit would probably hurt performance for no
 particular gain.  So I need some kind of callback to say update the
 time, but don't write data.  The AS_CMTIME bit will still be set when
 the ptes are removed.

What's the point of updating the timestamp at unmap when it's going
to be changed again when the writeback occurs?

 I could require -writepages *and* -flush_cmtime to handle the time
 update, but that would complicate non-transactional filesystems.
 Those filesystems should just flush cmtime at the end of writepages.

do_writepages() is the wrong place to do such updates - we can get
writeback directly through .writepage, so the time updates need to
be in .writepage. That first .writepage call will clear the bit on
the mapping, so it's only done on the first call to .writepage on
the given mapping.

And some filesystems provide a .writepages method that doesn't call
.writepage, so those filesystems will need to do the timestamp
update in those methods.

  Indeed, the way you've set up the infrastructure, we'll have to

Re: [PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-19 Thread Andy Lutomirski
On Mon, Aug 19, 2013 at 9:08 PM, Dave Chinner da...@fromorbit.com wrote:
 On Mon, Aug 19, 2013 at 08:28:20PM -0700, Andy Lutomirski wrote:
 On Mon, Aug 19, 2013 at 7:36 PM, Dave Chinner da...@fromorbit.com wrote:
  On Fri, Aug 16, 2013 at 04:22:10PM -0700, Andy Lutomirski wrote:
  Filesystems that defer cmtime updates should update cmtime when any
  of these events happen after a write via a mapping:
 
   - The mapping is written back to disk.  This happens from all kinds
 of places, all of which eventually call -writepages.
 
   - munmap is called or the mapping is removed when the process exits
 
   - msync(MS_ASYNC) is called.  Linux currently does nothing for
 msync(MS_ASYNC), but POSIX says that cmtime should be updated some
 time between an mmaped write and the subsequent msync call.
 MS_SYNC calls -writepages, but MS_ASYNC needs special handling.
 
  Filesystmes that defer cmtime updates should flush them on munmap or
  exit.  Finding out that this happened through vm_ops is messy, so
  add a new address space op for this.
 
  It's not strictly necessary to call -flush_cmtime after -writepages,
  but it simplifies the fs code.  As an optional optimization,
  filesystems can call mapping_test_clear_cmtime themselves in
  -writepages (as long as they're careful to scan all the pages first
  -- the cmtime bit may not be set when -writepages is entered).
 
  .flush_cmtime is effectively a duplicate method.  We already have
  .update_time to notify filesystems that they need to update the
  timestamp in the inode transactionally.

 .update_time is used for the atime update as well, and it relies on
 the core code to update the in-memory timestamp first.

 No, it doesn't. The caller of .update_time provides the timestamp to
 that gets written into the relevant fields of the inode.

 i.e. this code:

 now = current_fs_time(inode-i_sb);
 if (inode-i_op-update_time)
 return inode-i_op-update_time(inode, now, S_CTIME|S_MTIME);

 will update *only* the ctime and mtime of the inode to have a value
 of @now. That's precisely what you want to do, yes?

 Indeed, your .flush_cmtime function effectively does:

 flags = prepare_cmtime(inode, now)
 { *now = current_fs_time()
   flags = S_CTIME|S_MTIME;
 }
 update_time(inode, now, flags);
 {
 inode-i_op-update_time(inode, now, flags)
 or
 mark_inode_dirty_sync()
 }

 IOWs, you're adding a filesystem specific method to update a
 timestamp that wraps around a generic method for updating a
 timestamp. i.e.  .flush_cmtime is not necessary - you could just
 call inode_update_time_writable() from generic_writepages() and it
 would simply just work for all filesystems

OK, I'll try that out.


 I used that
 approach in v2, but it was (correctly, I think) pointed out that this
 was a layering violation and that core code shouldn't be mucking with
 the timestamps directly during writeback.

 If calling a generic method to update the timestamp is a layering
 violation, then why is replacing that with a filesystem specific
 method of updating a timestamp not a layering violation?

  Indeed:
 
  + /*
  +  * Userspace expects certain system calls to update cmtime if
  +  * a file has been recently written using a shared vma.  In
  +  * cases where cmtime may need to be updated but writepages is
  +  * not called, this is called instead.  (Implementations
  +  * should call mapping_test_clear_cmtime.)
  +  */
  + void (*flush_cmtime)(struct address_space *);
 
  You say it can be implemented in the -writepage(s) method, and all
  filesystems provide -writepage(s) in some form. Therefore I would
  have thought it be best to simply require filesystems to check that
  mapping flag during those methods and update the inode directly when
  that is set?

 The problem with only doing it in -writepages is that calling
 writepages from munmap and exit would probably hurt performance for no
 particular gain.  So I need some kind of callback to say update the
 time, but don't write data.  The AS_CMTIME bit will still be set when
 the ptes are removed.

 What's the point of updating the timestamp at unmap when it's going
 to be changed again when the writeback occurs?

To avoid breaking make and similar tools.  Suppose that an output file
already exists but is stale.  Some program gets called to update it,
and it opens it for write, mmaps it, updates it, calls munmap, and
exits.  Its parent will expect the timestamp to have been updated, but
writeback may not have happened.


 I could require -writepages *and* -flush_cmtime to handle the time
 update, but that would complicate non-transactional filesystems.
 Those filesystems should just flush cmtime at the end of writepages.

 do_writepages() is the wrong place to do such updates - we can get
 

[PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-16 Thread Andy Lutomirski
Filesystems that defer cmtime updates should update cmtime when any
of these events happen after a write via a mapping:

 - The mapping is written back to disk.  This happens from all kinds
   of places, all of which eventually call ->writepages.

 - munmap is called or the mapping is removed when the process exits

 - msync(MS_ASYNC) is called.  Linux currently does nothing for
   msync(MS_ASYNC), but POSIX says that cmtime should be updated some
   time between an mmaped write and the subsequent msync call.
   MS_SYNC calls ->writepages, but MS_ASYNC needs special handling.

Filesystmes that defer cmtime updates should flush them on munmap or
exit.  Finding out that this happened through vm_ops is messy, so
add a new address space op for this.

It's not strictly necessary to call ->flush_cmtime after ->writepages,
but it simplifies the fs code.  As an optional optimization,
filesystems can call mapping_test_clear_cmtime themselves in
->writepages (as long as they're careful to scan all the pages first
-- the cmtime bit may not be set when ->writepages is entered).

This patch does not implement the MS_ASYNC case; that's in the next
patch.

Signed-off-by: Andy Lutomirski 
---
 include/linux/fs.h|  9 +
 include/linux/writeback.h |  1 +
 mm/mmap.c |  9 -
 mm/page-writeback.c   | 26 ++
 4 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 86cf0a4..f224155 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -350,6 +350,15 @@ struct address_space_operations {
/* Write back some dirty pages from this mapping. */
int (*writepages)(struct address_space *, struct writeback_control *);
 
+   /*
+* Userspace expects certain system calls to update cmtime if
+* a file has been recently written using a shared vma.  In
+* cases where cmtime may need to be updated but writepages is
+* not called, this is called instead.  (Implementations
+* should call mapping_test_clear_cmtime.)
+*/
+   void (*flush_cmtime)(struct address_space *);
+
/* Set a page dirty.  Return true if this dirtied it */
int (*set_page_dirty)(struct page *page);
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 4e198ca..f6e8261 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -174,6 +174,7 @@ typedef int (*writepage_t)(struct page *page, struct 
writeback_control *wbc,
 
 int generic_writepages(struct address_space *mapping,
   struct writeback_control *wbc);
+void generic_flush_cmtime(struct address_space *mapping);
 void tag_pages_for_writeback(struct address_space *mapping,
 pgoff_t start, pgoff_t end);
 int write_cache_pages(struct address_space *mapping,
diff --git a/mm/mmap.c b/mm/mmap.c
index 1edbaa3..7ed7700 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1,3 +1,4 @@
+
 /*
  * mm/mmap.c
  *
@@ -249,8 +250,14 @@ static struct vm_area_struct *remove_vma(struct 
vm_area_struct *vma)
might_sleep();
if (vma->vm_ops && vma->vm_ops->close)
vma->vm_ops->close(vma);
-   if (vma->vm_file)
+   if (vma->vm_file) {
+   if ((vma->vm_flags & VM_SHARED) && vma->vm_file->f_mapping) {
+   struct address_space *mapping = vma->vm_file->f_mapping;
+   if (mapping->a_ops && mapping->a_ops->flush_cmtime)
+   mapping->a_ops->flush_cmtime(mapping);
+   }
fput(vma->vm_file);
+   }
mpol_put(vma_policy(vma));
kmem_cache_free(vm_area_cachep, vma);
return next;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3f0c895..9ab8c9e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1928,6 +1928,18 @@ int do_writepages(struct address_space *mapping, struct 
writeback_control *wbc)
ret = mapping->a_ops->writepages(mapping, wbc);
else
ret = generic_writepages(mapping, wbc);
+
+   /*
+* ->writepages will call clear_page_dirty_for_io, which may, in turn,
+* mark the mapping for deferred cmtime update.  As an optimization,
+* a filesystem can flush the update at the end of ->writepages
+* (possibly avoiding a journal transaction, for example), but,
+* for simplicity, let filesystems skip that part and just implement
+* ->flush_cmtime.
+*/
+   if (mapping->a_ops->flush_cmtime)
+   mapping->a_ops->flush_cmtime(mapping);
+
return ret;
 }
 
@@ -1970,6 +1982,20 @@ int write_one_page(struct page *page, int wait)
 }
 EXPORT_SYMBOL(write_one_page);
 
+/**
+ * generic_flush_cmtime - perform a deferred cmtime update if needed
+ * @mapping: address space structure
+ *
+ * This is a library function, which implements the flush_cmtime()
+ * address_space_operation.
+ */

[PATCH v3 3/5] mm: Notify filesystems when it's time to apply a deferred cmtime update

2013-08-16 Thread Andy Lutomirski
Filesystems that defer cmtime updates should update cmtime when any
of these events happen after a write via a mapping:

 - The mapping is written back to disk.  This happens from all kinds
   of places, all of which eventually call -writepages.

 - munmap is called or the mapping is removed when the process exits

 - msync(MS_ASYNC) is called.  Linux currently does nothing for
   msync(MS_ASYNC), but POSIX says that cmtime should be updated some
   time between an mmaped write and the subsequent msync call.
   MS_SYNC calls -writepages, but MS_ASYNC needs special handling.

Filesystmes that defer cmtime updates should flush them on munmap or
exit.  Finding out that this happened through vm_ops is messy, so
add a new address space op for this.

It's not strictly necessary to call -flush_cmtime after -writepages,
but it simplifies the fs code.  As an optional optimization,
filesystems can call mapping_test_clear_cmtime themselves in
-writepages (as long as they're careful to scan all the pages first
-- the cmtime bit may not be set when -writepages is entered).

This patch does not implement the MS_ASYNC case; that's in the next
patch.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 include/linux/fs.h|  9 +
 include/linux/writeback.h |  1 +
 mm/mmap.c |  9 -
 mm/page-writeback.c   | 26 ++
 4 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 86cf0a4..f224155 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -350,6 +350,15 @@ struct address_space_operations {
/* Write back some dirty pages from this mapping. */
int (*writepages)(struct address_space *, struct writeback_control *);
 
+   /*
+* Userspace expects certain system calls to update cmtime if
+* a file has been recently written using a shared vma.  In
+* cases where cmtime may need to be updated but writepages is
+* not called, this is called instead.  (Implementations
+* should call mapping_test_clear_cmtime.)
+*/
+   void (*flush_cmtime)(struct address_space *);
+
/* Set a page dirty.  Return true if this dirtied it */
int (*set_page_dirty)(struct page *page);
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 4e198ca..f6e8261 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -174,6 +174,7 @@ typedef int (*writepage_t)(struct page *page, struct 
writeback_control *wbc,
 
 int generic_writepages(struct address_space *mapping,
   struct writeback_control *wbc);
+void generic_flush_cmtime(struct address_space *mapping);
 void tag_pages_for_writeback(struct address_space *mapping,
 pgoff_t start, pgoff_t end);
 int write_cache_pages(struct address_space *mapping,
diff --git a/mm/mmap.c b/mm/mmap.c
index 1edbaa3..7ed7700 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1,3 +1,4 @@
+
 /*
  * mm/mmap.c
  *
@@ -249,8 +250,14 @@ static struct vm_area_struct *remove_vma(struct 
vm_area_struct *vma)
might_sleep();
if (vma-vm_ops  vma-vm_ops-close)
vma-vm_ops-close(vma);
-   if (vma-vm_file)
+   if (vma-vm_file) {
+   if ((vma-vm_flags  VM_SHARED)  vma-vm_file-f_mapping) {
+   struct address_space *mapping = vma-vm_file-f_mapping;
+   if (mapping-a_ops  mapping-a_ops-flush_cmtime)
+   mapping-a_ops-flush_cmtime(mapping);
+   }
fput(vma-vm_file);
+   }
mpol_put(vma_policy(vma));
kmem_cache_free(vm_area_cachep, vma);
return next;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3f0c895..9ab8c9e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1928,6 +1928,18 @@ int do_writepages(struct address_space *mapping, struct 
writeback_control *wbc)
ret = mapping-a_ops-writepages(mapping, wbc);
else
ret = generic_writepages(mapping, wbc);
+
+   /*
+* -writepages will call clear_page_dirty_for_io, which may, in turn,
+* mark the mapping for deferred cmtime update.  As an optimization,
+* a filesystem can flush the update at the end of -writepages
+* (possibly avoiding a journal transaction, for example), but,
+* for simplicity, let filesystems skip that part and just implement
+* -flush_cmtime.
+*/
+   if (mapping-a_ops-flush_cmtime)
+   mapping-a_ops-flush_cmtime(mapping);
+
return ret;
 }
 
@@ -1970,6 +1982,20 @@ int write_one_page(struct page *page, int wait)
 }
 EXPORT_SYMBOL(write_one_page);
 
+/**
+ * generic_flush_cmtime - perform a deferred cmtime update if needed
+ * @mapping: address space structure
+ *
+ * This is a library function, which implements the flush_cmtime()
+ * address_space_operation.
+ */
+void