Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-04 Thread Jan Kara
On Tue 03-06-14 15:37:39, Daniel Phillips wrote:
> On Tuesday, June 3, 2014 8:21:55 AM PDT, Jan Kara wrote:
> >On Tue 03-06-14 07:14:44, Christoph Hellwig wrote:
> >>On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote:
> >And I agree we went for per-bdi
> >flushing to avoid two threads congesting a single device leading to
> >suboptimal IO patterns during background writeback.
> 
> A proposal is on the table to implement s_ops->writeback() as a per-sb
> operation in such a way that nothing changes in the current per-inode path.
> Good or bad approach?
  Having s_ops->writeback() is fine. But I just hate how you hook in that
callback into the writeback code (and Dave has expressed similar concerns).
The way you hook it up, filesystem still has to have one inode in the dirty
list so that __writeback_inodes_wb() can find that inode, get superblock
from it and ask for writeback to be done via callback. That's really ugly
and no-go for me.

> >So currently I'm convinced we want to go for per-sb dirty tracking. That
> >also makes some speedups in that code noticeably simpler. I'm not
> convinced
> >about the per-sb flushing thread - if we don't regress the multiple sb on
> >bdi case when we just let the threads from different superblocks contend
> >for IO, then that would be a natural thing to do. But once we have to
> >introduce some synchronization between threads to avoid regressions, I
> >think it might be easier to just stay with per-bdi thread which switches
> >between superblocks.
> 
> Could you elaborate on the means of switching between superblocks?  Do
> you mean a new fs-writeback path just for data=journal class filesystems,
> or are you suggesting changing the way all filesystems are driven?
  So I suggest changing the way all filesystems are driven to a per-sb one.
That makes sense for other reasons as well and allows clean incorporation
of your writeback callback (either a fs will use generic callback which
would do what generic writeback code does now, only with per-sb dirty lists
instead of current per-bdi ones, or the fs can do its own stuff). I can
write something (the switch isn't really that complex) but it will need
quite some testing to verify we don't regress somewhere...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-04 Thread Jan Kara
On Tue 03-06-14 15:37:39, Daniel Phillips wrote:
 On Tuesday, June 3, 2014 8:21:55 AM PDT, Jan Kara wrote:
 On Tue 03-06-14 07:14:44, Christoph Hellwig wrote:
 On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote:
 And I agree we went for per-bdi
 flushing to avoid two threads congesting a single device leading to
 suboptimal IO patterns during background writeback.
 
 A proposal is on the table to implement s_ops-writeback() as a per-sb
 operation in such a way that nothing changes in the current per-inode path.
 Good or bad approach?
  Having s_ops-writeback() is fine. But I just hate how you hook in that
callback into the writeback code (and Dave has expressed similar concerns).
The way you hook it up, filesystem still has to have one inode in the dirty
list so that __writeback_inodes_wb() can find that inode, get superblock
from it and ask for writeback to be done via callback. That's really ugly
and no-go for me.

 So currently I'm convinced we want to go for per-sb dirty tracking. That
 also makes some speedups in that code noticeably simpler. I'm not
 convinced
 about the per-sb flushing thread - if we don't regress the multiple sb on
 bdi case when we just let the threads from different superblocks contend
 for IO, then that would be a natural thing to do. But once we have to
 introduce some synchronization between threads to avoid regressions, I
 think it might be easier to just stay with per-bdi thread which switches
 between superblocks.
 
 Could you elaborate on the means of switching between superblocks?  Do
 you mean a new fs-writeback path just for data=journal class filesystems,
 or are you suggesting changing the way all filesystems are driven?
  So I suggest changing the way all filesystems are driven to a per-sb one.
That makes sense for other reasons as well and allows clean incorporation
of your writeback callback (either a fs will use generic callback which
would do what generic writeback code does now, only with per-sb dirty lists
instead of current per-bdi ones, or the fs can do its own stuff). I can
write something (the switch isn't really that complex) but it will need
quite some testing to verify we don't regress somewhere...

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Daniel Phillips

On Tuesday, June 3, 2014 8:21:55 AM PDT, Jan Kara wrote:

On Tue 03-06-14 07:14:44, Christoph Hellwig wrote:

On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote:

 ...
  So I agree per-bdi / per-sb matters only in simple setups but machines
with single rotating disk with several partitions and without LVM aren't
that rare AFAICT from my experience.


Retribution is sure to be swift, terrible and eternal for anyone who dares 
to

break those.


And I agree we went for per-bdi
flushing to avoid two threads congesting a single device leading to
suboptimal IO patterns during background writeback.


A proposal is on the table to implement s_ops->writeback() as a per-sb
operation in such a way that nothing changes in the current per-inode path.
Good or bad approach?


So currently I'm convinced we want to go for per-sb dirty tracking. That
also makes some speedups in that code noticeably simpler. I'm not 

convinced

about the per-sb flushing thread - if we don't regress the multiple sb on
bdi case when we just let the threads from different superblocks contend
for IO, then that would be a natural thing to do. But once we have to
introduce some synchronization between threads to avoid regressions, I
think it might be easier to just stay with per-bdi thread which switches
between superblocks.


Could you elaborate on the means of switching between superblocks? Do you 
mean

a new fs-writeback path just for data=journal class filesystems, or are you
suggesting changing the way all filesystems are driven?

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Christian Stroetmann

On the 3rd of June 2014 16:57, Theodore Ts'o wrote:

On Tue, Jun 03, 2014 at 07:30:32AM +0200, Christian Stroetmann wrote:

In general, I do not believe that the complexity problems of soft updates,
atomic writes, and related techniques can be solved by hand/manually. So my
suggestion is to automatically handle the complexity problem of e.g.
dependancies in a way that is comparable to a(n on-the-fly) file-system
compiler so to say that works on a very large dependancy graph (having
several billions of graph vertices actually). And at this point an
abstraction like it is given with Featherstitch helps to feed and control
this special FS compiler.

Well, if you want to try to implement something like this, go for it!


I am already active since some weeks.


I'd be very curious to see how well (a) how much CPU overhead it takes
to crunch on a dependency graph with billions of vertices, and (b) how
easily can it be to express these dependencies and maintainable such a
dependency language would be.  Sounds like a great research topic, and


To a) A run is expected to take some few hours on a single computing node.

Also, such a graph processing must not be done all the time, but only if 
a new application demands a specific handling of the data in respect to 
e.g. one of the ACID criterias. That is the reason why I put 
"on-the-fly" in paranthesis.


To b) I hoped that file system developers could make some suggestions or 
point to some no-gos.
I am also thinking about Petri-Nets in this relation, though it is just 
an idea.


I would also like to mention that it could be used in conjunction with 
Non-Volatile Memory (NVM) as well.



I'll note the Call For Papers for FAST 2015 is out, and if you can
solve these problems, it would make a great FAST 2015 submission:

https://www.usenix.org/conference/fast15/call-for-papers


Are you serious or have I missed the 1st of April once again?
Actually, I could only write a general overview about the approach 
comparable to a white paper, but nothing more.



Cheers,

- Ted



Best regards
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Jan Kara
On Tue 03-06-14 07:14:44, Christoph Hellwig wrote:
> On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote:
> > So we currently flush inodes in first dirtied first written back order when
> > superblock is not specified in writeback work. That completely ignores the
> > fact to which superblock inode belongs but I don't see per-sb fairness to
> > actually make any sense when
> > 1) flushing old data (to keep promise set in dirty_expire_centisecs)
> > 2) flushing data to reduce number of dirty pages
> > And these are really the only two cases where we don't do per-sb flushing.
> > 
> > Now when filesystems want to do something more clever (and I can see
> > reasons for that e.g. when journalling metadata, even more so when
> > journalling data) I agree we need to somehow implement the above two types
> > of writeback using per-sb flushing. Type 1) is actually pretty easy - just
> > tell each sb to writeback dirty data upto time T. Type 2) is more difficult
> > because that is more openended task - it seems similar to what shrinkers do
> > but that would require us to track per sb amount of dirty pages / inodes
> > and I'm not sure we want to add even more page counting statistics...
> > Especially since often bdi == fs. Thoughts?
> 
> Honestly I think doing per-bdi writeback has been a major mistake.  As
> you said it only even matters when we have filesystems on multiple
> partitions on a single device, and even then only in a simple setup,
> as soon as we use LVM or btrfs this sort of sharing stops to happen
> anyway.  I don't even see much of a benefit except that we prevent
> two flushing daemons to congest a single device for that special case
> of multiple filesystems on partitions of the same device, and that could
> be solved in other ways.
  So I agree per-bdi / per-sb matters only in simple setups but machines
with single rotating disk with several partitions and without LVM aren't
that rare AFAICT from my experience. And I agree we went for per-bdi
flushing to avoid two threads congesting a single device leading to
suboptimal IO patterns during background writeback.

So currently I'm convinced we want to go for per-sb dirty tracking. That
also makes some speedups in that code noticeably simpler. I'm not convinced
about the per-sb flushing thread - if we don't regress the multiple sb on
bdi case when we just let the threads from different superblocks contend
for IO, then that would be a natural thing to do. But once we have to
introduce some synchronization between threads to avoid regressions, I
think it might be easier to just stay with per-bdi thread which switches
between superblocks.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Theodore Ts'o
On Tue, Jun 03, 2014 at 07:30:32AM +0200, Christian Stroetmann wrote:
> In general, I do not believe that the complexity problems of soft updates,
> atomic writes, and related techniques can be solved by hand/manually. So my
> suggestion is to automatically handle the complexity problem of e.g.
> dependancies in a way that is comparable to a(n on-the-fly) file-system
> compiler so to say that works on a very large dependancy graph (having
> several billions of graph vertices actually). And at this point an
> abstraction like it is given with Featherstitch helps to feed and control
> this special FS compiler.

Well, if you want to try to implement something like this, go for it!

I'd be very curious to see how well (a) how much CPU overhead it takes
to crunch on a dependency graph with billions of vertices, and (b) how
easily can it be to express these dependencies and maintainable such a
dependency language would be.  Sounds like a great research topic, and
I'll note the Call For Papers for FAST 2015 is out, and if you can
solve these problems, it would make a great FAST 2015 submission:

https://www.usenix.org/conference/fast15/call-for-papers

Cheers,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Theodore Ts'o
On Tue, Jun 03, 2014 at 07:14:44AM -0700, Christoph Hellwig wrote:
> Honestly I think doing per-bdi writeback has been a major mistake.  As
> you said it only even matters when we have filesystems on multiple
> partitions on a single device, and even then only in a simple setup,
> as soon as we use LVM or btrfs this sort of sharing stops to happen
> anyway.  I don't even see much of a benefit except that we prevent
> two flushing daemons to congest a single device for that special case
> of multiple filesystems on partitions of the same device, and that could
> be solved in other ways.

To be fair, back when per-bdi writeback was introduced, having
multiple partitions on a single disk was far more common, and the use
of LVM was much less common.  These days, many more systems using one
big root filesystem, and or using flash where the parallel writes can
actually be a good thing (since there isn't a single disk head which
has to seek all over the HDD), the case for keeping per-bdi writeback
is much weaker, if not non-existent.

Cheers,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Christoph Hellwig
On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote:
> So we currently flush inodes in first dirtied first written back order when
> superblock is not specified in writeback work. That completely ignores the
> fact to which superblock inode belongs but I don't see per-sb fairness to
> actually make any sense when
> 1) flushing old data (to keep promise set in dirty_expire_centisecs)
> 2) flushing data to reduce number of dirty pages
> And these are really the only two cases where we don't do per-sb flushing.
> 
> Now when filesystems want to do something more clever (and I can see
> reasons for that e.g. when journalling metadata, even more so when
> journalling data) I agree we need to somehow implement the above two types
> of writeback using per-sb flushing. Type 1) is actually pretty easy - just
> tell each sb to writeback dirty data upto time T. Type 2) is more difficult
> because that is more openended task - it seems similar to what shrinkers do
> but that would require us to track per sb amount of dirty pages / inodes
> and I'm not sure we want to add even more page counting statistics...
> Especially since often bdi == fs. Thoughts?

Honestly I think doing per-bdi writeback has been a major mistake.  As
you said it only even matters when we have filesystems on multiple
partitions on a single device, and even then only in a simple setup,
as soon as we use LVM or btrfs this sort of sharing stops to happen
anyway.  I don't even see much of a benefit except that we prevent
two flushing daemons to congest a single device for that special case
of multiple filesystems on partitions of the same device, and that could
be solved in other ways.

The major benefit of the per-bdi writeback was that for the usual case
of one filesystem per device we get exactly one flusher thread per
filesystems intead of multiple competing ones, but per-sb writeback
would solve that just as fine.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Jan Kara
On Tue 03-06-14 17:52:09, Dave Chinner wrote:
> On Tue, Jun 03, 2014 at 12:01:11AM -0700, Daniel Phillips wrote:
> > > However, we already avoid the VFS writeback lists for certain
> > > filesystems for pure metadata. e.g. XFS does not use the VFS dirty
> > > inode lists for inode metadata changes.  They get tracked internally
> > > by the transaction subsystem which does it's own writeback according
> > > to the requirements of journal space availability.
> > >
> > > This is done simply by not calling mark_inode_dirty() on any
> > > metadata only change. If we want to do the same for data, then we'd
> > > simply not call mark_inode_dirty() in the data IO path. That
> > > requires a custom ->set_page_dirty method to be provided by the
> > > filesystem that didn't call
> > >
> > >   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > >
> > > and instead did it's own thing.
> > >
> > > So the per-superblock dirty tracking is something we can do right
> > > now, and some filesystems do it for metadata. The missing piece for
> > > data is writeback infrastructure capable of deferring to superblocks
> > > for writeback rather than BDIs
> > 
> > We agree that fs-writeback inode lists are broken for anything
> > more sophisticated than Ext2.
> 
> No, I did not say that.
> 
> I said XFS does something different for metadata changes because it
> has different flushing constraints and requirements than the generic
> code provides. That does not make the generic code broken.
> 
> > An advantage of the patch under
> > consideration is that it still lets fs-writeback mostly work the
> > way it has worked for the last few years, except for not allowing it
> > to pick specific inodes and data pages for writeout. As far as I
> > can see, it still balances writeout between different filesystems
> > on the same block device pretty well.
> 
> Not really. If there are 3 superblocks on a BDI, and the dirty inode
> list iterates between 2 of them with lots of dirty inodes, it can
> starve writeback from the third until one of it's dirty inodes pops
> to the head of the b_io list. So it's inherently unfair from that
> perspective.
> 
> Changing the high level flushing to be per-superblock rather than
> per-BDI would enable us to control that behaviour and be much fairer
> to all the superblocks on a given BDI. That said, I don't really
> care that much about this case...
So we currently flush inodes in first dirtied first written back order when
superblock is not specified in writeback work. That completely ignores the
fact to which superblock inode belongs but I don't see per-sb fairness to
actually make any sense when
1) flushing old data (to keep promise set in dirty_expire_centisecs)
2) flushing data to reduce number of dirty pages
And these are really the only two cases where we don't do per-sb flushing.

Now when filesystems want to do something more clever (and I can see
reasons for that e.g. when journalling metadata, even more so when
journalling data) I agree we need to somehow implement the above two types
of writeback using per-sb flushing. Type 1) is actually pretty easy - just
tell each sb to writeback dirty data upto time T. Type 2) is more difficult
because that is more openended task - it seems similar to what shrinkers do
but that would require us to track per sb amount of dirty pages / inodes
and I'm not sure we want to add even more page counting statistics...
Especially since often bdi == fs. Thoughts?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread OGAWA Hirofumi
Dave Chinner  writes:

>> It doesn't move inode to end of the dirty if wb.b_dirty is empty
>> (I.e. it can move from wb.b_io to wb.b_dirty too).
>
> Um, really?  What code are you reading? From 3.15-rc8:
>
> static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
> {
> assert_spin_locked(>list_lock);
> if (!list_empty(>b_dirty)) {
> struct inode *tail;
>
> tail = wb_inode(wb->b_dirty.next);
> if (time_before(inode->dirtied_when, tail->dirtied_when))
> inode->dirtied_when = jiffies;
> }
> list_move(>i_wb_list, >b_dirty);
> }
>
> The list_move() is not conditional at all and so the inode is
> *always* moved to the tail of wb->b_dirty

Oh, you are right.

>> It has difference.
>> 
>> Say, tail->dirtied_when == 1, inode->dirtied_when == 2, and now == 30
>> (tail->dirtied_when is expired at 31 with default config). In this case,
>> redirty_tail() doesn't update ->dirtied_when.
>
> OK, that's a different issue, and is actually handled by
> requeue_inode(), which is called to put inodes back on the correct
> dirty list when IO completes. I think that if you are going to use
> the wb dirty inode lists, you should probably use the existing
> functions to manage the inode lists appropriately rather than
> creating your own writeback list lifecycle.

See __mark_inode_dirty() what does. We can consolidate that with
__mark_inode_dirty() if you want.

In our case, dirty while inode has I_DIRTY needs to handle new dirty
sometime, because of data=journal behavior.

> If tux3 wants it's own dirty inode list lifecycles, then that's
> where avoiding the wb lists completely is an appropriate design
> point. I don't want to hack little bits and pieces all over the
> writeback code to support what tux3 is doing right now if it's going
> to do something else real soon. When tux3 moves to use it's own
> internal lists these new funcitons just have to be removed again, so
> let's skip the hack step we are at right now and go straight for
> supporting the "don't need the fs-writeback lists" infrstructure.

Is it agreed with someone? There was bdflush, pdflush, and now bdi
flush. Why changed to bdi flush? This respects the intent of change
"pdflush to bdi flush" more or less.

Thanks.
-- 
OGAWA Hirofumi 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Dave Chinner
On Tue, Jun 03, 2014 at 04:47:52PM +0900, OGAWA Hirofumi wrote:
> Daniel Phillips  writes:
> 
> > Hi Dave,
> > On 06/02/2014 08:33 PM, Dave Chinner wrote:
> >> On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote:
> >>>
> >>> Redirty_tail nearly works, but "if (!list_empty(>b_dirty))" is
> >>> not correct because the inode needs to end up on the dirty list
> >>> whether it was already there or not.
> >> redirty_tail() always moves the inode to the end of the dirty
> >> list.
> 
> It doesn't move inode to end of the dirty if wb.b_dirty is empty
> (I.e. it can move from wb.b_io to wb.b_dirty too).

Um, really?  What code are you reading? From 3.15-rc8:

static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
{
assert_spin_locked(>list_lock);
if (!list_empty(>b_dirty)) {
struct inode *tail;

tail = wb_inode(wb->b_dirty.next);
if (time_before(inode->dirtied_when, tail->dirtied_when))
inode->dirtied_when = jiffies;
}
list_move(>i_wb_list, >b_dirty);
}

The list_move() is not conditional at all and so the inode is
*always* moved to the tail of wb->b_dirty

> It has difference.
> 
> Say, tail->dirtied_when == 1, inode->dirtied_when == 2, and now == 30
> (tail->dirtied_when is expired at 31 with default config). In this case,
> redirty_tail() doesn't update ->dirtied_when.

OK, that's a different issue, and is actually handled by
requeue_inode(), which is called to put inodes back on the correct
dirty list when IO completes. I think that if you are going to use
the wb dirty inode lists, you should probably use the existing
functions to manage the inode lists appropriately rather than
creating your own writeback list lifecycle.

If tux3 wants it's own dirty inode list lifecycles, then that's
where avoiding the wb lists completely is an appropriate design
point. I don't want to hack little bits and pieces all over the
writeback code to support what tux3 is doing right now if it's going
to do something else real soon. When tux3 moves to use it's own
internal lists these new funcitons just have to be removed again, so
let's skip the hack step we are at right now and go straight for
supporting the "don't need the fs-writeback lists" infrstructure.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Dave Chinner
On Tue, Jun 03, 2014 at 12:01:11AM -0700, Daniel Phillips wrote:
> Hi Dave,
> On 06/02/2014 08:33 PM, Dave Chinner wrote:
> > On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote:
> >>> H - this is using the wb dirty lists and locks, but you
> >>> don't pass the wb structure to the writeback callout? That seem
> >>> wrong to me - why would you bother manipulating these lists if
> >>> you aren't using them to track dirty inodes in the first place?
> >> From Tux3's point of view, the wb lists just allow fs-writeback to
> >> determine when to call ->writeback(). We agree that inode lists
> >> are a suboptimal mechanism, but that is how fs-writeback currently
> >> thinks. It would be better if our inodes never go onto wb lists in
> >> the first place, provided that fs-writeback can still drive
> >> writeback appropriately.
> > It can't, and definitely not with the callout you added.
> 
> Obviously, fs-writeback does not get to choose inodes or specific pages
> with our proposal (which is the whole point) but it still decides when
> to call Tux3 vs some other filesystem on the same device, and is still
> able to indicate how much data it thinks should be written out. Even
> though we ignore the latter suggestion and always flush everything, we
> appreciate the timing of those callbacks very much, because they give
> us exactly the pressure sensitive batching behavior we want.

Which, again, points out that you want per-superblock flushing, not
per-bdi flushing which is what the current writeback code does.

> > However, we already avoid the VFS writeback lists for certain
> > filesystems for pure metadata. e.g. XFS does not use the VFS dirty
> > inode lists for inode metadata changes.  They get tracked internally
> > by the transaction subsystem which does it's own writeback according
> > to the requirements of journal space availability.
> >
> > This is done simply by not calling mark_inode_dirty() on any
> > metadata only change. If we want to do the same for data, then we'd
> > simply not call mark_inode_dirty() in the data IO path. That
> > requires a custom ->set_page_dirty method to be provided by the
> > filesystem that didn't call
> >
> > __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> >
> > and instead did it's own thing.
> >
> > So the per-superblock dirty tracking is something we can do right
> > now, and some filesystems do it for metadata. The missing piece for
> > data is writeback infrastructure capable of deferring to superblocks
> > for writeback rather than BDIs
> 
> We agree that fs-writeback inode lists are broken for anything
> more sophisticated than Ext2.

No, I did not say that.

I said XFS does something different for metadata changes because it
has different flushing constraints and requirements than the generic
code provides. That does not make the generic code broken.

> An advantage of the patch under
> consideration is that it still lets fs-writeback mostly work the
> way it has worked for the last few years, except for not allowing it
> to pick specific inodes and data pages for writeout. As far as I
> can see, it still balances writeout between different filesystems
> on the same block device pretty well.

Not really. If there are 3 superblocks on a BDI, and the dirty inode
list iterates between 2 of them with lots of dirty inodes, it can
starve writeback from the third until one of it's dirty inodes pops
to the head of the b_io list. So it's inherently unfair from that
perspective.

Changing the high level flushing to be per-superblock rather than
per-BDI would enable us to control that behaviour and be much fairer
to all the superblocks on a given BDI. That said, I don't really
care that much about this case...

> >> Perhaps fs-writeback should have an option to work without inode
> >> lists at all, and just maintain writeback scheduling statistics in
> >> the superblock or similar. That would be a big change, more on the
> >> experimental side. We would be happy to take it on after merge,
> >> hopefully for the benefit of other filesystems besides Tux3.
> > Well, I don't see it that way. If you have a requirement to be able
> > to track dirty inodes internally, then lets move to that implement
> > that infrastructure rather than hacking around with callouts that
> > only have a limited shelf-life.
> 
> What you call shelf-life, I call iteration. Small changes are beautiful.
> Apologies for the rhetoric, content is below.

Ok, let me use stronger language: implementing something at the
wrong layer only to have to redo it at the correct layer a short
while later is wasted effort. Do it right the first time.

I do beleive I've explained this to you at least once before, so
lets not have to go over this again.

> > XFS already has everything it needs internally to track dirty
> > inodes. In fact, we used to do data writeback from within XFS and we
> > slowly removed it as the generic writeback code was improved made
> > the XFS code redundant.
> >
> > 

Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread OGAWA Hirofumi
Daniel Phillips  writes:

> Hi Dave,
> On 06/02/2014 08:33 PM, Dave Chinner wrote:
>> On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote:
>>>
>>> Redirty_tail nearly works, but "if (!list_empty(>b_dirty))" is
>>> not correct because the inode needs to end up on the dirty list
>>> whether it was already there or not.
>> redirty_tail() always moves the inode to the end of the dirty
>> list.

It doesn't move inode to end of the dirty if wb.b_dirty is empty
(I.e. it can move from wb.b_io to wb.b_dirty too).

BTW, this is called for like mark inode dirty process, not always
writeback path.

>>> This requirement is analogous to __mark_inode_dirty() and must
>>> tolerate similar races. At the microoptimization level, calling
>>> redirty_tail from inode_writeback_touch would be less efficient
>>> and more bulky. Another small issue is, redirty_tail does not
>>> always update the timestamp, which could trigger some bogus
>>> writeback events.
>> redirty_tail does not update the timestamp when it doesn't need to
>> change. If it needs to be changed because the current value would
>> violate the time ordering requirements of the list, it rewrites it.
>>
>> So there is essentially no functional difference between the new
>> function and redirty_tail
>
> Hirofumi, would you care to comment?

It has difference.

Say, tail->dirtied_when == 1, inode->dirtied_when == 2, and now == 30
(tail->dirtied_when is expired at 31 with default config). In this case,
redirty_tail() doesn't update ->dirtied_when.

And if inode->dirtied_when is not updated to 30, expire time has
difference. I.e. 32 vs 60.

Thanks.
-- 
OGAWA Hirofumi 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Daniel Phillips
Hi Dave,

Here is a non-incremental patch. This implements your suggestion
from yesterday, except that the wb list lock is dropped before
calling ->writeback().

Regards,

Daniel

>From d030d328757b160b39b252e82811a94843513cfc Mon Sep 17 00:00:00 2001
From: Daniel Phillips 
Date: Tue, 3 Jun 2014 00:19:11 -0700
Subject: [PATCH] Add a super operation for writeback

Add a "writeback" super operation to be called in the
form:

   progress = s_op->writeback(sb, wb, work, wbc);

Where sb is (struct super_block *), wb is (struct
bdi_writeback *), work is (struct wb_writeback_work *),
and wbc is (struct writeback_control *).

The filesystem is expected to flush some inodes to disk
and return progress of at least 1, or if no inodes are
flushed, return progress of zero. The filesystem should
try to flush at least the number of pages specified in
work->nr_pages, or if that is not possible, return
approximately the number of pages that were not flushed
in work->nr_pages.

Within the ->writeback callback, the filesystem should
call inode_writeback_done(inode) for each inode flushed
and therefore set clean) or inode_writeback_touch(inode)
for any inode that will be retained dirty in cache.

Signed-off-by: Daniel Phillips 
---
 fs/fs-writeback.c | 107 +-
 include/linux/fs.h|   9 +++-
 include/linux/writeback.h |  19 
 3 files changed, 95 insertions(+), 40 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index be568b7..98810bd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -34,25 +34,6 @@
  */
 #define MIN_WRITEBACK_PAGES(4096UL >> (PAGE_CACHE_SHIFT - 10))
 
-/*
- * Passed into wb_writeback(), essentially a subset of writeback_control
- */
-struct wb_writeback_work {
-long nr_pages;
-struct super_block *sb;
-unsigned long *older_than_this;
-enum writeback_sync_modes sync_mode;
-unsigned int tagged_writepages:1;
-unsigned int for_kupdate:1;
-unsigned int range_cyclic:1;
-unsigned int for_background:1;
-unsigned int for_sync:1;/* sync(2) WB_SYNC_ALL writeback */
-enum wb_reason reason;/* why was writeback initiated? */
-
-struct list_head list;/* pending work list */
-struct completion *done;/* set if the caller waits */
-};
-
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -192,6 +173,35 @@ void inode_wb_list_del(struct inode *inode)
 }
 
 /*
+ * Remove inode from writeback list if clean.
+ */
+void inode_writeback_done(struct inode *inode)
+{
+struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+spin_lock(>wb.list_lock);
+spin_lock(>i_lock);
+if (!(inode->i_state & I_DIRTY))
+list_del_init(>i_wb_list);
+spin_unlock(>i_lock);
+spin_unlock(>wb.list_lock);
+}
+EXPORT_SYMBOL_GPL(inode_writeback_done);
+
+/*
+ * Add inode to writeback dirty list with current time.
+ */
+void inode_writeback_touch(struct inode *inode)
+{
+struct backing_dev_info *bdi = inode->i_sb->s_bdi;
+spin_lock(>wb.list_lock);
+inode->dirtied_when = jiffies;
+list_move(>i_wb_list, >wb.b_dirty);
+spin_unlock(>wb.list_lock);
+}
+EXPORT_SYMBOL_GPL(inode_writeback_touch);
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -593,20 +603,12 @@ static long writeback_chunk_size(struct backing_dev_info 
*bdi,
  *
  * Return the number of pages and/or inodes written.
  */
-static long writeback_sb_inodes(struct super_block *sb,
-struct bdi_writeback *wb,
-struct wb_writeback_work *work)
+long generic_writeback_sb_inodes(
+struct super_block *sb,
+struct bdi_writeback *wb,
+struct wb_writeback_work *work,
+struct writeback_control *wbc)
 {
-struct writeback_control wbc = {
-.sync_mode= work->sync_mode,
-.tagged_writepages= work->tagged_writepages,
-.for_kupdate= work->for_kupdate,
-.for_background= work->for_background,
-.for_sync= work->for_sync,
-.range_cyclic= work->range_cyclic,
-.range_start= 0,
-.range_end= LLONG_MAX,
-};
 unsigned long start_time = jiffies;
 long write_chunk;
 long wrote = 0;  /* count both pages and inodes */
@@ -644,7 +646,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 redirty_tail(inode, wb);
 continue;
 }
-if ((inode->i_state & I_SYNC) && wbc.sync_mode != WB_SYNC_ALL) {
+if ((inode->i_state & I_SYNC) && wbc->sync_mode != WB_SYNC_ALL) {
 /*
  * If this inode is locked for writeback and we are not
  * doing writeback-for-data-integrity, move it to
@@ -677,22 +679,22 @@ static long writeback_sb_inodes(struct super_block *sb,
 

Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Daniel Phillips
Hi Dave,
On 06/02/2014 08:33 PM, Dave Chinner wrote:
> On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote:
>>
>> Redirty_tail nearly works, but "if (!list_empty(>b_dirty))" is
>> not correct because the inode needs to end up on the dirty list
>> whether it was already there or not.
> redirty_tail() always moves the inode to the end of the dirty
> list.
>
>> This requirement is analogous to __mark_inode_dirty() and must
>> tolerate similar races. At the microoptimization level, calling
>> redirty_tail from inode_writeback_touch would be less efficient
>> and more bulky. Another small issue is, redirty_tail does not
>> always update the timestamp, which could trigger some bogus
>> writeback events.
> redirty_tail does not update the timestamp when it doesn't need to
> change. If it needs to be changed because the current value would
> violate the time ordering requirements of the list, it rewrites it.
>
> So there is essentially no functional difference between the new
> function and redirty_tail

Hirofumi, would you care to comment?

>
>>> H - this is using the wb dirty lists and locks, but you
>>> don't pass the wb structure to the writeback callout? That seem
>>> wrong to me - why would you bother manipulating these lists if
>>> you aren't using them to track dirty inodes in the first place?
>> From Tux3's point of view, the wb lists just allow fs-writeback to
>> determine when to call ->writeback(). We agree that inode lists
>> are a suboptimal mechanism, but that is how fs-writeback currently
>> thinks. It would be better if our inodes never go onto wb lists in
>> the first place, provided that fs-writeback can still drive
>> writeback appropriately.
> It can't, and definitely not with the callout you added.

Obviously, fs-writeback does not get to choose inodes or specific pages
with our proposal (which is the whole point) but it still decides when
to call Tux3 vs some other filesystem on the same device, and is still
able to indicate how much data it thinks should be written out. Even
though we ignore the latter suggestion and always flush everything, we
appreciate the timing of those callbacks very much, because they give
us exactly the pressure sensitive batching behavior we want.

> However, we already avoid the VFS writeback lists for certain
> filesystems for pure metadata. e.g. XFS does not use the VFS dirty
> inode lists for inode metadata changes.  They get tracked internally
> by the transaction subsystem which does it's own writeback according
> to the requirements of journal space availability.
>
> This is done simply by not calling mark_inode_dirty() on any
> metadata only change. If we want to do the same for data, then we'd
> simply not call mark_inode_dirty() in the data IO path. That
> requires a custom ->set_page_dirty method to be provided by the
> filesystem that didn't call
>
>   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>
> and instead did it's own thing.
>
> So the per-superblock dirty tracking is something we can do right
> now, and some filesystems do it for metadata. The missing piece for
> data is writeback infrastructure capable of deferring to superblocks
> for writeback rather than BDIs

We agree that fs-writeback inode lists are broken for anything
more sophisticated than Ext2. An advantage of the patch under
consideration is that it still lets fs-writeback mostly work the
way it has worked for the last few years, except for not allowing it
to pick specific inodes and data pages for writeout. As far as I
can see, it still balances writeout between different filesystems
on the same block device pretty well.

>> Perhaps fs-writeback should have an option to work without inode
>> lists at all, and just maintain writeback scheduling statistics in
>> the superblock or similar. That would be a big change, more on the
>> experimental side. We would be happy to take it on after merge,
>> hopefully for the benefit of other filesystems besides Tux3.
> Well, I don't see it that way. If you have a requirement to be able
> to track dirty inodes internally, then lets move to that implement
> that infrastructure rather than hacking around with callouts that
> only have a limited shelf-life.

What you call shelf-life, I call iteration. Small changes are beautiful.
Apologies for the rhetoric, content is below.

> XFS already has everything it needs internally to track dirty
> inodes. In fact, we used to do data writeback from within XFS and we
> slowly removed it as the generic writeback code was improved made
> the XFS code redundant.
>
> That said, parallelising writeback so we can support hundreds of
> GB/s of delayed allocation based writeback is something we
> eventually need to do, and that pretty much means we need to bring
> dirty data inode tracking back into XFS.
>
> So what we really need is writeback infrastructure that is:
>
>   a) independent of the dirty inode lists
>   b) implements background, periodic and immediate writeback

Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Daniel Phillips
Hi Dave,
On 06/02/2014 08:33 PM, Dave Chinner wrote:
 On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote:

 Redirty_tail nearly works, but if (!list_empty(wb-b_dirty)) is
 not correct because the inode needs to end up on the dirty list
 whether it was already there or not.
 redirty_tail() always moves the inode to the end of the dirty
 list.

 This requirement is analogous to __mark_inode_dirty() and must
 tolerate similar races. At the microoptimization level, calling
 redirty_tail from inode_writeback_touch would be less efficient
 and more bulky. Another small issue is, redirty_tail does not
 always update the timestamp, which could trigger some bogus
 writeback events.
 redirty_tail does not update the timestamp when it doesn't need to
 change. If it needs to be changed because the current value would
 violate the time ordering requirements of the list, it rewrites it.

 So there is essentially no functional difference between the new
 function and redirty_tail

Hirofumi, would you care to comment?


 H - this is using the wb dirty lists and locks, but you
 don't pass the wb structure to the writeback callout? That seem
 wrong to me - why would you bother manipulating these lists if
 you aren't using them to track dirty inodes in the first place?
 From Tux3's point of view, the wb lists just allow fs-writeback to
 determine when to call -writeback(). We agree that inode lists
 are a suboptimal mechanism, but that is how fs-writeback currently
 thinks. It would be better if our inodes never go onto wb lists in
 the first place, provided that fs-writeback can still drive
 writeback appropriately.
 It can't, and definitely not with the callout you added.

Obviously, fs-writeback does not get to choose inodes or specific pages
with our proposal (which is the whole point) but it still decides when
to call Tux3 vs some other filesystem on the same device, and is still
able to indicate how much data it thinks should be written out. Even
though we ignore the latter suggestion and always flush everything, we
appreciate the timing of those callbacks very much, because they give
us exactly the pressure sensitive batching behavior we want.

 However, we already avoid the VFS writeback lists for certain
 filesystems for pure metadata. e.g. XFS does not use the VFS dirty
 inode lists for inode metadata changes.  They get tracked internally
 by the transaction subsystem which does it's own writeback according
 to the requirements of journal space availability.

 This is done simply by not calling mark_inode_dirty() on any
 metadata only change. If we want to do the same for data, then we'd
 simply not call mark_inode_dirty() in the data IO path. That
 requires a custom -set_page_dirty method to be provided by the
 filesystem that didn't call

   __mark_inode_dirty(mapping-host, I_DIRTY_PAGES);

 and instead did it's own thing.

 So the per-superblock dirty tracking is something we can do right
 now, and some filesystems do it for metadata. The missing piece for
 data is writeback infrastructure capable of deferring to superblocks
 for writeback rather than BDIs

We agree that fs-writeback inode lists are broken for anything
more sophisticated than Ext2. An advantage of the patch under
consideration is that it still lets fs-writeback mostly work the
way it has worked for the last few years, except for not allowing it
to pick specific inodes and data pages for writeout. As far as I
can see, it still balances writeout between different filesystems
on the same block device pretty well.

 Perhaps fs-writeback should have an option to work without inode
 lists at all, and just maintain writeback scheduling statistics in
 the superblock or similar. That would be a big change, more on the
 experimental side. We would be happy to take it on after merge,
 hopefully for the benefit of other filesystems besides Tux3.
 Well, I don't see it that way. If you have a requirement to be able
 to track dirty inodes internally, then lets move to that implement
 that infrastructure rather than hacking around with callouts that
 only have a limited shelf-life.

What you call shelf-life, I call iteration. Small changes are beautiful.
Apologies for the rhetoric, content is below.

 XFS already has everything it needs internally to track dirty
 inodes. In fact, we used to do data writeback from within XFS and we
 slowly removed it as the generic writeback code was improved made
 the XFS code redundant.

 That said, parallelising writeback so we can support hundreds of
 GB/s of delayed allocation based writeback is something we
 eventually need to do, and that pretty much means we need to bring
 dirty data inode tracking back into XFS.

 So what we really need is writeback infrastructure that is:

   a) independent of the dirty inode lists
   b) implements background, periodic and immediate writeback
   c) can be driven by BDI or superblock

 IOWs, the abstraction we need for this writeback callout is 

Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Daniel Phillips
Hi Dave,

Here is a non-incremental patch. This implements your suggestion
from yesterday, except that the wb list lock is dropped before
calling -writeback().

Regards,

Daniel

From d030d328757b160b39b252e82811a94843513cfc Mon Sep 17 00:00:00 2001
From: Daniel Phillips dan...@tux3.org
Date: Tue, 3 Jun 2014 00:19:11 -0700
Subject: [PATCH] Add a super operation for writeback

Add a writeback super operation to be called in the
form:

   progress = s_op-writeback(sb, wb, work, wbc);

Where sb is (struct super_block *), wb is (struct
bdi_writeback *), work is (struct wb_writeback_work *),
and wbc is (struct writeback_control *).

The filesystem is expected to flush some inodes to disk
and return progress of at least 1, or if no inodes are
flushed, return progress of zero. The filesystem should
try to flush at least the number of pages specified in
work-nr_pages, or if that is not possible, return
approximately the number of pages that were not flushed
in work-nr_pages.

Within the -writeback callback, the filesystem should
call inode_writeback_done(inode) for each inode flushed
and therefore set clean) or inode_writeback_touch(inode)
for any inode that will be retained dirty in cache.

Signed-off-by: Daniel Phillips dan...@tux3.org
---
 fs/fs-writeback.c | 107 +-
 include/linux/fs.h|   9 +++-
 include/linux/writeback.h |  19 
 3 files changed, 95 insertions(+), 40 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index be568b7..98810bd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -34,25 +34,6 @@
  */
 #define MIN_WRITEBACK_PAGES(4096UL  (PAGE_CACHE_SHIFT - 10))
 
-/*
- * Passed into wb_writeback(), essentially a subset of writeback_control
- */
-struct wb_writeback_work {
-long nr_pages;
-struct super_block *sb;
-unsigned long *older_than_this;
-enum writeback_sync_modes sync_mode;
-unsigned int tagged_writepages:1;
-unsigned int for_kupdate:1;
-unsigned int range_cyclic:1;
-unsigned int for_background:1;
-unsigned int for_sync:1;/* sync(2) WB_SYNC_ALL writeback */
-enum wb_reason reason;/* why was writeback initiated? */
-
-struct list_head list;/* pending work list */
-struct completion *done;/* set if the caller waits */
-};
-
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -192,6 +173,35 @@ void inode_wb_list_del(struct inode *inode)
 }
 
 /*
+ * Remove inode from writeback list if clean.
+ */
+void inode_writeback_done(struct inode *inode)
+{
+struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+spin_lock(bdi-wb.list_lock);
+spin_lock(inode-i_lock);
+if (!(inode-i_state  I_DIRTY))
+list_del_init(inode-i_wb_list);
+spin_unlock(inode-i_lock);
+spin_unlock(bdi-wb.list_lock);
+}
+EXPORT_SYMBOL_GPL(inode_writeback_done);
+
+/*
+ * Add inode to writeback dirty list with current time.
+ */
+void inode_writeback_touch(struct inode *inode)
+{
+struct backing_dev_info *bdi = inode-i_sb-s_bdi;
+spin_lock(bdi-wb.list_lock);
+inode-dirtied_when = jiffies;
+list_move(inode-i_wb_list, bdi-wb.b_dirty);
+spin_unlock(bdi-wb.list_lock);
+}
+EXPORT_SYMBOL_GPL(inode_writeback_touch);
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -593,20 +603,12 @@ static long writeback_chunk_size(struct backing_dev_info 
*bdi,
  *
  * Return the number of pages and/or inodes written.
  */
-static long writeback_sb_inodes(struct super_block *sb,
-struct bdi_writeback *wb,
-struct wb_writeback_work *work)
+long generic_writeback_sb_inodes(
+struct super_block *sb,
+struct bdi_writeback *wb,
+struct wb_writeback_work *work,
+struct writeback_control *wbc)
 {
-struct writeback_control wbc = {
-.sync_mode= work-sync_mode,
-.tagged_writepages= work-tagged_writepages,
-.for_kupdate= work-for_kupdate,
-.for_background= work-for_background,
-.for_sync= work-for_sync,
-.range_cyclic= work-range_cyclic,
-.range_start= 0,
-.range_end= LLONG_MAX,
-};
 unsigned long start_time = jiffies;
 long write_chunk;
 long wrote = 0;  /* count both pages and inodes */
@@ -644,7 +646,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 redirty_tail(inode, wb);
 continue;
 }
-if ((inode-i_state  I_SYNC)  wbc.sync_mode != WB_SYNC_ALL) {
+if ((inode-i_state  I_SYNC)  wbc-sync_mode != WB_SYNC_ALL) {
 /*
  * If this inode is locked for writeback and we are not
  * doing writeback-for-data-integrity, move it to
@@ -677,22 +679,22 @@ static long writeback_sb_inodes(struct 

Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread OGAWA Hirofumi
Daniel Phillips dan...@phunq.net writes:

 Hi Dave,
 On 06/02/2014 08:33 PM, Dave Chinner wrote:
 On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote:

 Redirty_tail nearly works, but if (!list_empty(wb-b_dirty)) is
 not correct because the inode needs to end up on the dirty list
 whether it was already there or not.
 redirty_tail() always moves the inode to the end of the dirty
 list.

It doesn't move inode to end of the dirty if wb.b_dirty is empty
(I.e. it can move from wb.b_io to wb.b_dirty too).

BTW, this is called for like mark inode dirty process, not always
writeback path.

 This requirement is analogous to __mark_inode_dirty() and must
 tolerate similar races. At the microoptimization level, calling
 redirty_tail from inode_writeback_touch would be less efficient
 and more bulky. Another small issue is, redirty_tail does not
 always update the timestamp, which could trigger some bogus
 writeback events.
 redirty_tail does not update the timestamp when it doesn't need to
 change. If it needs to be changed because the current value would
 violate the time ordering requirements of the list, it rewrites it.

 So there is essentially no functional difference between the new
 function and redirty_tail

 Hirofumi, would you care to comment?

It has difference.

Say, tail-dirtied_when == 1, inode-dirtied_when == 2, and now == 30
(tail-dirtied_when is expired at 31 with default config). In this case,
redirty_tail() doesn't update -dirtied_when.

And if inode-dirtied_when is not updated to 30, expire time has
difference. I.e. 32 vs 60.

Thanks.
-- 
OGAWA Hirofumi hirof...@mail.parknet.co.jp
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Dave Chinner
On Tue, Jun 03, 2014 at 12:01:11AM -0700, Daniel Phillips wrote:
 Hi Dave,
 On 06/02/2014 08:33 PM, Dave Chinner wrote:
  On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote:
  H - this is using the wb dirty lists and locks, but you
  don't pass the wb structure to the writeback callout? That seem
  wrong to me - why would you bother manipulating these lists if
  you aren't using them to track dirty inodes in the first place?
  From Tux3's point of view, the wb lists just allow fs-writeback to
  determine when to call -writeback(). We agree that inode lists
  are a suboptimal mechanism, but that is how fs-writeback currently
  thinks. It would be better if our inodes never go onto wb lists in
  the first place, provided that fs-writeback can still drive
  writeback appropriately.
  It can't, and definitely not with the callout you added.
 
 Obviously, fs-writeback does not get to choose inodes or specific pages
 with our proposal (which is the whole point) but it still decides when
 to call Tux3 vs some other filesystem on the same device, and is still
 able to indicate how much data it thinks should be written out. Even
 though we ignore the latter suggestion and always flush everything, we
 appreciate the timing of those callbacks very much, because they give
 us exactly the pressure sensitive batching behavior we want.

Which, again, points out that you want per-superblock flushing, not
per-bdi flushing which is what the current writeback code does.

  However, we already avoid the VFS writeback lists for certain
  filesystems for pure metadata. e.g. XFS does not use the VFS dirty
  inode lists for inode metadata changes.  They get tracked internally
  by the transaction subsystem which does it's own writeback according
  to the requirements of journal space availability.
 
  This is done simply by not calling mark_inode_dirty() on any
  metadata only change. If we want to do the same for data, then we'd
  simply not call mark_inode_dirty() in the data IO path. That
  requires a custom -set_page_dirty method to be provided by the
  filesystem that didn't call
 
  __mark_inode_dirty(mapping-host, I_DIRTY_PAGES);
 
  and instead did it's own thing.
 
  So the per-superblock dirty tracking is something we can do right
  now, and some filesystems do it for metadata. The missing piece for
  data is writeback infrastructure capable of deferring to superblocks
  for writeback rather than BDIs
 
 We agree that fs-writeback inode lists are broken for anything
 more sophisticated than Ext2.

No, I did not say that.

I said XFS does something different for metadata changes because it
has different flushing constraints and requirements than the generic
code provides. That does not make the generic code broken.

 An advantage of the patch under
 consideration is that it still lets fs-writeback mostly work the
 way it has worked for the last few years, except for not allowing it
 to pick specific inodes and data pages for writeout. As far as I
 can see, it still balances writeout between different filesystems
 on the same block device pretty well.

Not really. If there are 3 superblocks on a BDI, and the dirty inode
list iterates between 2 of them with lots of dirty inodes, it can
starve writeback from the third until one of it's dirty inodes pops
to the head of the b_io list. So it's inherently unfair from that
perspective.

Changing the high level flushing to be per-superblock rather than
per-BDI would enable us to control that behaviour and be much fairer
to all the superblocks on a given BDI. That said, I don't really
care that much about this case...

  Perhaps fs-writeback should have an option to work without inode
  lists at all, and just maintain writeback scheduling statistics in
  the superblock or similar. That would be a big change, more on the
  experimental side. We would be happy to take it on after merge,
  hopefully for the benefit of other filesystems besides Tux3.
  Well, I don't see it that way. If you have a requirement to be able
  to track dirty inodes internally, then lets move to that implement
  that infrastructure rather than hacking around with callouts that
  only have a limited shelf-life.
 
 What you call shelf-life, I call iteration. Small changes are beautiful.
 Apologies for the rhetoric, content is below.

Ok, let me use stronger language: implementing something at the
wrong layer only to have to redo it at the correct layer a short
while later is wasted effort. Do it right the first time.

I do beleive I've explained this to you at least once before, so
lets not have to go over this again.

  XFS already has everything it needs internally to track dirty
  inodes. In fact, we used to do data writeback from within XFS and we
  slowly removed it as the generic writeback code was improved made
  the XFS code redundant.
 
  That said, parallelising writeback so we can support hundreds of
  GB/s of delayed allocation based writeback is something we
  eventually 

Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Dave Chinner
On Tue, Jun 03, 2014 at 04:47:52PM +0900, OGAWA Hirofumi wrote:
 Daniel Phillips dan...@phunq.net writes:
 
  Hi Dave,
  On 06/02/2014 08:33 PM, Dave Chinner wrote:
  On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote:
 
  Redirty_tail nearly works, but if (!list_empty(wb-b_dirty)) is
  not correct because the inode needs to end up on the dirty list
  whether it was already there or not.
  redirty_tail() always moves the inode to the end of the dirty
  list.
 
 It doesn't move inode to end of the dirty if wb.b_dirty is empty
 (I.e. it can move from wb.b_io to wb.b_dirty too).

Um, really?  What code are you reading? From 3.15-rc8:

static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
{
assert_spin_locked(wb-list_lock);
if (!list_empty(wb-b_dirty)) {
struct inode *tail;

tail = wb_inode(wb-b_dirty.next);
if (time_before(inode-dirtied_when, tail-dirtied_when))
inode-dirtied_when = jiffies;
}
list_move(inode-i_wb_list, wb-b_dirty);
}

The list_move() is not conditional at all and so the inode is
*always* moved to the tail of wb-b_dirty

 It has difference.
 
 Say, tail-dirtied_when == 1, inode-dirtied_when == 2, and now == 30
 (tail-dirtied_when is expired at 31 with default config). In this case,
 redirty_tail() doesn't update -dirtied_when.

OK, that's a different issue, and is actually handled by
requeue_inode(), which is called to put inodes back on the correct
dirty list when IO completes. I think that if you are going to use
the wb dirty inode lists, you should probably use the existing
functions to manage the inode lists appropriately rather than
creating your own writeback list lifecycle.

If tux3 wants it's own dirty inode list lifecycles, then that's
where avoiding the wb lists completely is an appropriate design
point. I don't want to hack little bits and pieces all over the
writeback code to support what tux3 is doing right now if it's going
to do something else real soon. When tux3 moves to use it's own
internal lists these new funcitons just have to be removed again, so
let's skip the hack step we are at right now and go straight for
supporting the don't need the fs-writeback lists infrstructure.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread OGAWA Hirofumi
Dave Chinner da...@fromorbit.com writes:

 It doesn't move inode to end of the dirty if wb.b_dirty is empty
 (I.e. it can move from wb.b_io to wb.b_dirty too).

 Um, really?  What code are you reading? From 3.15-rc8:

 static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
 {
 assert_spin_locked(wb-list_lock);
 if (!list_empty(wb-b_dirty)) {
 struct inode *tail;

 tail = wb_inode(wb-b_dirty.next);
 if (time_before(inode-dirtied_when, tail-dirtied_when))
 inode-dirtied_when = jiffies;
 }
 list_move(inode-i_wb_list, wb-b_dirty);
 }

 The list_move() is not conditional at all and so the inode is
 *always* moved to the tail of wb-b_dirty

Oh, you are right.

 It has difference.
 
 Say, tail-dirtied_when == 1, inode-dirtied_when == 2, and now == 30
 (tail-dirtied_when is expired at 31 with default config). In this case,
 redirty_tail() doesn't update -dirtied_when.

 OK, that's a different issue, and is actually handled by
 requeue_inode(), which is called to put inodes back on the correct
 dirty list when IO completes. I think that if you are going to use
 the wb dirty inode lists, you should probably use the existing
 functions to manage the inode lists appropriately rather than
 creating your own writeback list lifecycle.

See __mark_inode_dirty() what does. We can consolidate that with
__mark_inode_dirty() if you want.

In our case, dirty while inode has I_DIRTY needs to handle new dirty
sometime, because of data=journal behavior.

 If tux3 wants it's own dirty inode list lifecycles, then that's
 where avoiding the wb lists completely is an appropriate design
 point. I don't want to hack little bits and pieces all over the
 writeback code to support what tux3 is doing right now if it's going
 to do something else real soon. When tux3 moves to use it's own
 internal lists these new funcitons just have to be removed again, so
 let's skip the hack step we are at right now and go straight for
 supporting the don't need the fs-writeback lists infrstructure.

Is it agreed with someone? There was bdflush, pdflush, and now bdi
flush. Why changed to bdi flush? This respects the intent of change
pdflush to bdi flush more or less.

Thanks.
-- 
OGAWA Hirofumi hirof...@mail.parknet.co.jp
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Jan Kara
On Tue 03-06-14 17:52:09, Dave Chinner wrote:
 On Tue, Jun 03, 2014 at 12:01:11AM -0700, Daniel Phillips wrote:
   However, we already avoid the VFS writeback lists for certain
   filesystems for pure metadata. e.g. XFS does not use the VFS dirty
   inode lists for inode metadata changes.  They get tracked internally
   by the transaction subsystem which does it's own writeback according
   to the requirements of journal space availability.
  
   This is done simply by not calling mark_inode_dirty() on any
   metadata only change. If we want to do the same for data, then we'd
   simply not call mark_inode_dirty() in the data IO path. That
   requires a custom -set_page_dirty method to be provided by the
   filesystem that didn't call
  
 __mark_inode_dirty(mapping-host, I_DIRTY_PAGES);
  
   and instead did it's own thing.
  
   So the per-superblock dirty tracking is something we can do right
   now, and some filesystems do it for metadata. The missing piece for
   data is writeback infrastructure capable of deferring to superblocks
   for writeback rather than BDIs
  
  We agree that fs-writeback inode lists are broken for anything
  more sophisticated than Ext2.
 
 No, I did not say that.
 
 I said XFS does something different for metadata changes because it
 has different flushing constraints and requirements than the generic
 code provides. That does not make the generic code broken.
 
  An advantage of the patch under
  consideration is that it still lets fs-writeback mostly work the
  way it has worked for the last few years, except for not allowing it
  to pick specific inodes and data pages for writeout. As far as I
  can see, it still balances writeout between different filesystems
  on the same block device pretty well.
 
 Not really. If there are 3 superblocks on a BDI, and the dirty inode
 list iterates between 2 of them with lots of dirty inodes, it can
 starve writeback from the third until one of it's dirty inodes pops
 to the head of the b_io list. So it's inherently unfair from that
 perspective.
 
 Changing the high level flushing to be per-superblock rather than
 per-BDI would enable us to control that behaviour and be much fairer
 to all the superblocks on a given BDI. That said, I don't really
 care that much about this case...
So we currently flush inodes in first dirtied first written back order when
superblock is not specified in writeback work. That completely ignores the
fact to which superblock inode belongs but I don't see per-sb fairness to
actually make any sense when
1) flushing old data (to keep promise set in dirty_expire_centisecs)
2) flushing data to reduce number of dirty pages
And these are really the only two cases where we don't do per-sb flushing.

Now when filesystems want to do something more clever (and I can see
reasons for that e.g. when journalling metadata, even more so when
journalling data) I agree we need to somehow implement the above two types
of writeback using per-sb flushing. Type 1) is actually pretty easy - just
tell each sb to writeback dirty data upto time T. Type 2) is more difficult
because that is more openended task - it seems similar to what shrinkers do
but that would require us to track per sb amount of dirty pages / inodes
and I'm not sure we want to add even more page counting statistics...
Especially since often bdi == fs. Thoughts?

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Christoph Hellwig
On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote:
 So we currently flush inodes in first dirtied first written back order when
 superblock is not specified in writeback work. That completely ignores the
 fact to which superblock inode belongs but I don't see per-sb fairness to
 actually make any sense when
 1) flushing old data (to keep promise set in dirty_expire_centisecs)
 2) flushing data to reduce number of dirty pages
 And these are really the only two cases where we don't do per-sb flushing.
 
 Now when filesystems want to do something more clever (and I can see
 reasons for that e.g. when journalling metadata, even more so when
 journalling data) I agree we need to somehow implement the above two types
 of writeback using per-sb flushing. Type 1) is actually pretty easy - just
 tell each sb to writeback dirty data upto time T. Type 2) is more difficult
 because that is more openended task - it seems similar to what shrinkers do
 but that would require us to track per sb amount of dirty pages / inodes
 and I'm not sure we want to add even more page counting statistics...
 Especially since often bdi == fs. Thoughts?

Honestly I think doing per-bdi writeback has been a major mistake.  As
you said it only even matters when we have filesystems on multiple
partitions on a single device, and even then only in a simple setup,
as soon as we use LVM or btrfs this sort of sharing stops to happen
anyway.  I don't even see much of a benefit except that we prevent
two flushing daemons to congest a single device for that special case
of multiple filesystems on partitions of the same device, and that could
be solved in other ways.

The major benefit of the per-bdi writeback was that for the usual case
of one filesystem per device we get exactly one flusher thread per
filesystems intead of multiple competing ones, but per-sb writeback
would solve that just as fine.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Theodore Ts'o
On Tue, Jun 03, 2014 at 07:14:44AM -0700, Christoph Hellwig wrote:
 Honestly I think doing per-bdi writeback has been a major mistake.  As
 you said it only even matters when we have filesystems on multiple
 partitions on a single device, and even then only in a simple setup,
 as soon as we use LVM or btrfs this sort of sharing stops to happen
 anyway.  I don't even see much of a benefit except that we prevent
 two flushing daemons to congest a single device for that special case
 of multiple filesystems on partitions of the same device, and that could
 be solved in other ways.

To be fair, back when per-bdi writeback was introduced, having
multiple partitions on a single disk was far more common, and the use
of LVM was much less common.  These days, many more systems using one
big root filesystem, and or using flash where the parallel writes can
actually be a good thing (since there isn't a single disk head which
has to seek all over the HDD), the case for keeping per-bdi writeback
is much weaker, if not non-existent.

Cheers,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Theodore Ts'o
On Tue, Jun 03, 2014 at 07:30:32AM +0200, Christian Stroetmann wrote:
 In general, I do not believe that the complexity problems of soft updates,
 atomic writes, and related techniques can be solved by hand/manually. So my
 suggestion is to automatically handle the complexity problem of e.g.
 dependancies in a way that is comparable to a(n on-the-fly) file-system
 compiler so to say that works on a very large dependancy graph (having
 several billions of graph vertices actually). And at this point an
 abstraction like it is given with Featherstitch helps to feed and control
 this special FS compiler.

Well, if you want to try to implement something like this, go for it!

I'd be very curious to see how well (a) how much CPU overhead it takes
to crunch on a dependency graph with billions of vertices, and (b) how
easily can it be to express these dependencies and maintainable such a
dependency language would be.  Sounds like a great research topic, and
I'll note the Call For Papers for FAST 2015 is out, and if you can
solve these problems, it would make a great FAST 2015 submission:

https://www.usenix.org/conference/fast15/call-for-papers

Cheers,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Jan Kara
On Tue 03-06-14 07:14:44, Christoph Hellwig wrote:
 On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote:
  So we currently flush inodes in first dirtied first written back order when
  superblock is not specified in writeback work. That completely ignores the
  fact to which superblock inode belongs but I don't see per-sb fairness to
  actually make any sense when
  1) flushing old data (to keep promise set in dirty_expire_centisecs)
  2) flushing data to reduce number of dirty pages
  And these are really the only two cases where we don't do per-sb flushing.
  
  Now when filesystems want to do something more clever (and I can see
  reasons for that e.g. when journalling metadata, even more so when
  journalling data) I agree we need to somehow implement the above two types
  of writeback using per-sb flushing. Type 1) is actually pretty easy - just
  tell each sb to writeback dirty data upto time T. Type 2) is more difficult
  because that is more openended task - it seems similar to what shrinkers do
  but that would require us to track per sb amount of dirty pages / inodes
  and I'm not sure we want to add even more page counting statistics...
  Especially since often bdi == fs. Thoughts?
 
 Honestly I think doing per-bdi writeback has been a major mistake.  As
 you said it only even matters when we have filesystems on multiple
 partitions on a single device, and even then only in a simple setup,
 as soon as we use LVM or btrfs this sort of sharing stops to happen
 anyway.  I don't even see much of a benefit except that we prevent
 two flushing daemons to congest a single device for that special case
 of multiple filesystems on partitions of the same device, and that could
 be solved in other ways.
  So I agree per-bdi / per-sb matters only in simple setups but machines
with single rotating disk with several partitions and without LVM aren't
that rare AFAICT from my experience. And I agree we went for per-bdi
flushing to avoid two threads congesting a single device leading to
suboptimal IO patterns during background writeback.

So currently I'm convinced we want to go for per-sb dirty tracking. That
also makes some speedups in that code noticeably simpler. I'm not convinced
about the per-sb flushing thread - if we don't regress the multiple sb on
bdi case when we just let the threads from different superblocks contend
for IO, then that would be a natural thing to do. But once we have to
introduce some synchronization between threads to avoid regressions, I
think it might be easier to just stay with per-bdi thread which switches
between superblocks.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Christian Stroetmann

On the 3rd of June 2014 16:57, Theodore Ts'o wrote:

On Tue, Jun 03, 2014 at 07:30:32AM +0200, Christian Stroetmann wrote:

In general, I do not believe that the complexity problems of soft updates,
atomic writes, and related techniques can be solved by hand/manually. So my
suggestion is to automatically handle the complexity problem of e.g.
dependancies in a way that is comparable to a(n on-the-fly) file-system
compiler so to say that works on a very large dependancy graph (having
several billions of graph vertices actually). And at this point an
abstraction like it is given with Featherstitch helps to feed and control
this special FS compiler.

Well, if you want to try to implement something like this, go for it!


I am already active since some weeks.


I'd be very curious to see how well (a) how much CPU overhead it takes
to crunch on a dependency graph with billions of vertices, and (b) how
easily can it be to express these dependencies and maintainable such a
dependency language would be.  Sounds like a great research topic, and


To a) A run is expected to take some few hours on a single computing node.

Also, such a graph processing must not be done all the time, but only if 
a new application demands a specific handling of the data in respect to 
e.g. one of the ACID criterias. That is the reason why I put 
on-the-fly in paranthesis.


To b) I hoped that file system developers could make some suggestions or 
point to some no-gos.
I am also thinking about Petri-Nets in this relation, though it is just 
an idea.


I would also like to mention that it could be used in conjunction with 
Non-Volatile Memory (NVM) as well.



I'll note the Call For Papers for FAST 2015 is out, and if you can
solve these problems, it would make a great FAST 2015 submission:

https://www.usenix.org/conference/fast15/call-for-papers


Are you serious or have I missed the 1st of April once again?
Actually, I could only write a general overview about the approach 
comparable to a white paper, but nothing more.



Cheers,

- Ted



Best regards
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-03 Thread Daniel Phillips

On Tuesday, June 3, 2014 8:21:55 AM PDT, Jan Kara wrote:

On Tue 03-06-14 07:14:44, Christoph Hellwig wrote:

On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote:

 ...
  So I agree per-bdi / per-sb matters only in simple setups but machines
with single rotating disk with several partitions and without LVM aren't
that rare AFAICT from my experience.


Retribution is sure to be swift, terrible and eternal for anyone who dares 
to

break those.


And I agree we went for per-bdi
flushing to avoid two threads congesting a single device leading to
suboptimal IO patterns during background writeback.


A proposal is on the table to implement s_ops-writeback() as a per-sb
operation in such a way that nothing changes in the current per-inode path.
Good or bad approach?


So currently I'm convinced we want to go for per-sb dirty tracking. That
also makes some speedups in that code noticeably simpler. I'm not 

convinced

about the per-sb flushing thread - if we don't regress the multiple sb on
bdi case when we just let the threads from different superblocks contend
for IO, then that would be a natural thing to do. But once we have to
introduce some synchronization between threads to avoid regressions, I
think it might be easier to just stay with per-bdi thread which switches
between superblocks.


Could you elaborate on the means of switching between superblocks? Do you 
mean

a new fs-writeback path just for data=journal class filesystems, or are you
suggesting changing the way all filesystems are driven?

Regards,

Daniel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-02 Thread Christian Stroetmann

On the 3rd of June 2014 05:39, Dave Chinner wrote:

On Mon, Jun 02, 2014 at 10:30:07AM +0200, Christian Stroetmann wrote:

When I followed the advice of Dave Chinner:
"We're not going to merge that page forking stuff (like you were
told at LSF 2013 more than a year ago:
http://lwn.net/Articles/548091/) without rigorous design review and
a demonstration of the solutions to all the hard corner cases it
has"
given in his e-mail related with the presentation of the latest
version of the Tux3 file system (see [1]) and read the linked
article, I found in the second comments:
"Parts of this almost sound like it either a.) overlaps with or b.)
would benefit greatly from something similar to Featherstitch
[[2]]."

Could it be that we have with Featherstitch a general solution
already that is said to be even "file system agnostic"?
Honestly, I thought that something like this would make its way into
the Linux code base.

Here's what I said about the last proposal (a few months ago) for
integrating featherstitch into the kernel:

http://www.spinics.net/lists/linux-fsdevel/msg72799.html

It's not a viable solution.

Cheers,

Dave.


How annoying, I did not remember your e-mail of the referenced thread 
"[Lsf-pc] [LSF/MM TOPIC] atomic block device" despite I saved it on 
local disk. Thanks a lot for the reminder.


I also directly saw the problem with the research prototype 
Featherstitch, specifically the point "All the filesystem modules it has 
are built into the featherstitch kernel module, and called through a VFS 
shim layer". But it is just a prototype and its concept of abstraction 
has not to be copied 1:1 into the Linux code base.


In general, I do not believe that the complexity problems of soft 
updates, atomic writes, and related techniques can be solved by 
hand/manually. So my suggestion is to automatically handle the 
complexity problem of e.g. dependancies in a way that is comparable to 
a(n on-the-fly) file-system compiler so to say that works on a very 
large dependancy graph (having several billions of graph vertices 
actually). And at this point an abstraction like it is given with 
Featherstitch helps to feed and control this special FS compiler.


Actually, I have to follow the discussion further on the one hand and go 
deeper into the highly complex problem space on the other hand.




With all the best
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-02 Thread Dave Chinner
On Mon, Jun 02, 2014 at 10:30:07AM +0200, Christian Stroetmann wrote:
> When I followed the advice of Dave Chinner:
> "We're not going to merge that page forking stuff (like you were
> told at LSF 2013 more than a year ago:
> http://lwn.net/Articles/548091/) without rigorous design review and
> a demonstration of the solutions to all the hard corner cases it
> has"
> given in his e-mail related with the presentation of the latest
> version of the Tux3 file system (see [1]) and read the linked
> article, I found in the second comments:
> "Parts of this almost sound like it either a.) overlaps with or b.)
> would benefit greatly from something similar to Featherstitch
> [[2]]."
> 
> Could it be that we have with Featherstitch a general solution
> already that is said to be even "file system agnostic"?
> Honestly, I thought that something like this would make its way into
> the Linux code base.

Here's what I said about the last proposal (a few months ago) for
integrating featherstitch into the kernel:

http://www.spinics.net/lists/linux-fsdevel/msg72799.html

It's not a viable solution.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-02 Thread Dave Chinner

[ please line wrap at something sane like 68 columns ]

On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote:
> On 06/01/2014 08:15 PM, Dave Chinner wrote:
> > On Sun, Jun 01, 2014 at 02:41:02PM -0700, I wrote:
> >> +
> >> +/*
> >> + * Add inode to writeback dirty list with current time.
> >> + */
> >> +void inode_writeback_touch(struct inode *inode)
> >> +{
> >> +  struct backing_dev_info *bdi = inode->i_sb->s_bdi;
> >> +  spin_lock(>wb.list_lock);
> >> +  inode->dirtied_when = jiffies;
> >> +  list_move(>i_wb_list, >wb.b_dirty);
> >> +  spin_unlock(>wb.list_lock);
> >> +}
> >> +EXPORT_SYMBOL_GPL(inode_writeback_touch);
> > You should be able to use redirty_tail() for this
> 
> Redirty_tail nearly works, but "if (!list_empty(>b_dirty))" is
> not correct because the inode needs to end up on the dirty list
> whether it was already there or not.

redirty_tail() always moves the inode to the end of the dirty
list.

> This requirement is analogous to __mark_inode_dirty() and must
> tolerate similar races. At the microoptimization level, calling
> redirty_tail from inode_writeback_touch would be less efficient
> and more bulky. Another small issue is, redirty_tail does not
> always update the timestamp, which could trigger some bogus
> writeback events.

redirty_tail does not update the timestamp when it doesn't need to
change. If it needs to be changed because the current value would
violate the time ordering requirements of the list, it rewrites it.

So there is essentially no functional difference between the new
function and redirty_tail

> > H - this is using the wb dirty lists and locks, but you
> > don't pass the wb structure to the writeback callout? That seem
> > wrong to me - why would you bother manipulating these lists if
> > you aren't using them to track dirty inodes in the first place?
> 
> From Tux3's point of view, the wb lists just allow fs-writeback to
> determine when to call ->writeback(). We agree that inode lists
> are a suboptimal mechanism, but that is how fs-writeback currently
> thinks. It would be better if our inodes never go onto wb lists in
> the first place, provided that fs-writeback can still drive
> writeback appropriately.

It can't, and definitely not with the callout you added.

However, we already avoid the VFS writeback lists for certain
filesystems for pure metadata. e.g. XFS does not use the VFS dirty
inode lists for inode metadata changes.  They get tracked internally
by the transaction subsystem which does it's own writeback according
to the requirements of journal space availability.

This is done simply by not calling mark_inode_dirty() on any
metadata only change. If we want to do the same for data, then we'd
simply not call mark_inode_dirty() in the data IO path. That
requires a custom ->set_page_dirty method to be provided by the
filesystem that didn't call

__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

and instead did it's own thing.

So the per-superblock dirty tracking is something we can do right
now, and some filesystems do it for metadata. The missing piece for
data is writeback infrastructure capable of deferring to superblocks
for writeback rather than BDIs

> Perhaps fs-writeback should have an option to work without inode
> lists at all, and just maintain writeback scheduling statistics in
> the superblock or similar. That would be a big change, more on the
> experimental side. We would be happy to take it on after merge,
> hopefully for the benefit of other filesystems besides Tux3.

Well, I don't see it that way. If you have a requirement to be able
to track dirty inodes internally, then lets move to that implement
that infrastructure rather than hacking around with callouts that
only have a limited shelf-life.

> What we pass to ->writeback() is just a matter of taste at the
> moment because we currently ignore everything except
> >nr_pages. Anything sane is fine. Note that bdi_writeback is
> already available from bdi->wb, the "default writeback info",
> whatever that means. A quick tour of existing usage suggests that
> you can reliably obtain the wb that way, but a complete audit
> would be needed.
> 
> Most probably, this API will evolve as new users arrive, and also
> as our Tux3 usage becomes more sophisticated. For now, Tux3 just
> flushes everything without asking questions. Exactly how that
> might change in the future is hard to predict. You are in a better
> position to know what XFS would require in order to use this
> interface.

XFS already has everything it needs internally to track dirty
inodes. In fact, we used to do data writeback from within XFS and we
slowly removed it as the generic writeback code was improved made
the XFS code redundant.

That said, parallelising writeback so we can support hundreds of
GB/s of delayed allocation based writeback is something we
eventually need to do, and that pretty much means we need to bring
dirty data inode tracking back into XFS.

So what we really 

Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-02 Thread Daniel Phillips

On 06/01/2014 08:15 PM, Dave Chinner wrote:
> On Sun, Jun 01, 2014 at 02:41:02PM -0700, I wrote:
>> +
>> +/*
>> + * Add inode to writeback dirty list with current time.
>> + */
>> +void inode_writeback_touch(struct inode *inode)
>> +{
>> +struct backing_dev_info *bdi = inode->i_sb->s_bdi;
>> +spin_lock(>wb.list_lock);
>> +inode->dirtied_when = jiffies;
>> +list_move(>i_wb_list, >wb.b_dirty);
>> +spin_unlock(>wb.list_lock);
>> +}
>> +EXPORT_SYMBOL_GPL(inode_writeback_touch);
> You should be able to use redirty_tail() for this

Redirty_tail nearly works, but "if (!list_empty(>b_dirty))" is not correct 
because the inode
needs to end up on the dirty list whether it was already there or not. This 
requirement is analogous
to __mark_inode_dirty() and must tolerate similar races. At the 
microoptimization level, calling
redirty_tail from inode_writeback_touch would be less efficient and more bulky. 
Another small issue
is, redirty_tail does not always update the timestamp, which could trigger some 
bogus writeback events.

> H - this is using the wb dirty lists and locks, but you
> don't pass the wb structure to the writeback callout? That seem
> wrong to me - why would you bother manipulating these lists if you
> aren't using them to track dirty inodes in the first place?

>From Tux3's point of view, the wb lists just allow fs-writeback to determine 
>when to call
->writeback(). We agree that inode lists are a suboptimal mechanism, but that 
is how fs-writeback
currently thinks. It would be better if our inodes never go onto wb lists in 
the first place,
provided that fs-writeback can still drive writeback appropriately.

Perhaps fs-writeback should have an option to work without inode lists at all, 
and just maintain
writeback scheduling statistics in the superblock or similar. That would be a 
big change, more on
the experimental side. We would be happy to take it on after merge, hopefully 
for the benefit of
other filesystems besides Tux3.

What we pass to ->writeback() is just a matter of taste at the moment because 
we currently ignore
everything except >nr_pages. Anything sane is fine. Note that 
bdi_writeback is already
available from bdi->wb, the "default writeback info", whatever that means. A 
quick tour of existing
usage suggests that you can reliably obtain the wb that way, but a complete 
audit would be needed.

Most probably, this API will evolve as new users arrive, and also as our Tux3 
usage becomes more
sophisticated. For now, Tux3 just flushes everything without asking questions. 
Exactly how that
might change in the future is hard to predict. You are in a better position to 
know what XFS would
require in order to use this interface.

> The first thing that __writeback_sb_inodes() does is create a struct
> writeback_control from the wb_writeback_work. That should be done
> here and passed to __writeback_sb_inodes(), which should be renamed
> "generic_writeback_sb_inodes()".  Also, all the fields in the wbc
> need to be initialised correctly (i.e including range_start/end).

Good point. I will try that generic_writeback_sb_inode concept for the next 
iteration, which will
need a day or two including regression testing.

> Further, a writeback implementation will need to use the generic bdi
> list and lock structures and so we need to pass the bdi_writeback.
> Similarly, if we are going to pass nr_pages, we might as well pass
> the entire work structure. 
>
> Finally, I don't like the way the wb->list_lock is treated
> differently by this code. I suspect that we need to rationalise the
> layering of the wb->list_lock as it is currently not clear what it
> protects and what (nested) layers of the writeback code actually
> require it.

One design goal of this proposed writeback interface is to hide the wb list 
lock entirely from the
filesystem so core writeback can evolve more easily. This is not cast in stone, 
but it smells like
decent factoring. We can save some spinlocks by violating that factoring (as 
Hirofumi's original
hack did) at the cost of holding a potentially busy wb lock longer, which does 
not seem like a good
trade.

I agree that writeback list locking is murky, and fs-writeback is murky in 
general. We would like
Tux3 to be part of the solution, not the problem.

> What I'd like to see is this work:
>
> struct super_ops ... = {
> 
>   .writeback = generic_writeback_sb_inodes,
> 
>
> And that means writeback_sb_inodes() would become:
>
> static long writeback_sb_inodes(struct super_block *sb,
>   struct bdi_writeback *wb,
>   struct wb_writeback_work *work)
> {
>   struct writeback_control wbc = {
>   .sync_mode  = work->sync_mode,
>   .tagged_writepages  = work->tagged_writepages,
>   .for_kupdate= work->for_kupdate,
>   .for_background = work->for_background,
>   .for_sync  

Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-02 Thread Christian Stroetmann

On the 1st of June 2014 23:41, Daniel Phillips wrote:

Hi,

This is the first of four core changes we would like for Tux3. We
start with a hard one and suggest a simple solution.

The first patch in this series adds a new super operation to write
back multiple inodes in a single call. The second patch applies to
our linux-tux3 repository at git.kernel.org to demonstrate how this
interface is used, and removes about 450 lines of workaround code.

Traditionally, core kernel tells each mounted filesystems which
dirty pages of which inodes should be flushed to disk, but
unfortunately, is blissfully ignorant of filesystem-specific
ordering constraints. This scheme was really invented for Ext2 and
has not evolved much recently. Tux3, with its strong ordering and
optimized delta commit, cannot tolerate random flushing and
therefore takes responsibility for flush ordering itself. On the
other hand, Tux3 has no good way to know when is the right time to
flush, but core is good at that. This proposed API harmonizes those
two competencies so that Tux3 and core each take care of what they
are good at, and not more.

The API extension consists of a new writeback hook and two helpers
to be uses within the hook. The hook sits about halfway down the
fs-writeback stack, just after core has determined that some dirty
inodes should be flushed to disk and just before it starts thinking
about which inodes those should be. At that point, core calls Tux3
instead of continuing on down the usual do_writepages path. Tux3
responds by staging a new delta commit, using the new helpers to
tell core which inodes were flushed versus being left dirty in
cache. This is pretty much the same behavior as the traditional
writeout path, but less convoluted, probably more efficient, and
certainly easier to analyze.

The new writeback hook looks like:

 progress = sb->s_op->writeback(sb,_control,_pages);

This should be self-explanatory: nr_pages and progress have the
semantics of existing usage in fs-writeback; writeback_control is
ignored by Tux3, but that is only because Tux3 always flushes
everything and does not require hints for now. We can safely assume
that or equivalent is wanted here. An obvious wart is the
overlap between "progress" and "nr_pages", but fs-writeback thinks
that way, so it would not make much sense to improve one without
improving the other.

Tux3 necessarily keeps its own dirty inode list, which is an area
of overlap with fs-writeback. In a perfect world, there would be
just one dirty inode list per superblock, on which both fs-writeback
and Tux3 would operate. That would be a deeper core change than
seems appropriate right now.

Potential races are a big issue with this API, which is no surprise.
The fs-writeback scheme requires keeping several kinds of object in
sync: tux3 dirty inode lists, fs-writeback dirty inode lists and
inode dirty state. The new helpers inode_writeback_done(inode) and
inode_writeback_touch(inode) take care of that while hiding
internal details of the fs-writeback implementation.

Tux3 calls inode_writeback_done when it has flushed an inode and
marked it clean, or calls inode_writeback_touch if it intends to
retain a dirty inode in cache. These have simple implementations.
The former just removes a clean inode from any fs-writeback list.
The latter updates the inode's dirty timestamp so that fs-writeback
does not keep trying flush it. Both these things could be done more
efficiently by re-engineering fs-writeback, but we prefer to work
with the existing scheme for now.

Hirofumi's masterful hack nicely avoided racy removal of inodes from
the writeback list by taking an internal fs-writeback lock inside
filesystem code. The new helper requires dropping i_lock inside
filesystem code and retaking it in the helper, so inode redirty can
race with writeback list removal. This does not seem to present a
problem because filesystem code is able to enforce strict
alternation of cleaning and calling the helper. As an offsetting
advantage, writeback lock contention is reduced.

Compared to Hirofumi's hack, the cost of this interface is one
additional spinlock per inode_writeback_done,  which is
insignificant compared to the convoluted code path that is avoided.
Regards,

Daniel

When I followed the advice of Dave Chinner:
"We're not going to merge that page forking stuff (like you were told at 
LSF 2013 more than a year ago: http://lwn.net/Articles/548091/) without 
rigorous design review and a demonstration of the solutions to all the 
hard corner cases it has"
given in his e-mail related with the presentation of the latest version 
of the Tux3 file system (see [1]) and read the linked article, I found 
in the second comments:
"Parts of this almost sound like it either a.) overlaps with or b.) 
would benefit greatly from something similar to Featherstitch [[2]]."


Could it be that we have with Featherstitch a general solution already 
that is said to be even "file system agnostic"?
Honestly, I 

Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-02 Thread Christian Stroetmann

On the 1st of June 2014 23:41, Daniel Phillips wrote:

Hi,

This is the first of four core changes we would like for Tux3. We
start with a hard one and suggest a simple solution.

The first patch in this series adds a new super operation to write
back multiple inodes in a single call. The second patch applies to
our linux-tux3 repository at git.kernel.org to demonstrate how this
interface is used, and removes about 450 lines of workaround code.

Traditionally, core kernel tells each mounted filesystems which
dirty pages of which inodes should be flushed to disk, but
unfortunately, is blissfully ignorant of filesystem-specific
ordering constraints. This scheme was really invented for Ext2 and
has not evolved much recently. Tux3, with its strong ordering and
optimized delta commit, cannot tolerate random flushing and
therefore takes responsibility for flush ordering itself. On the
other hand, Tux3 has no good way to know when is the right time to
flush, but core is good at that. This proposed API harmonizes those
two competencies so that Tux3 and core each take care of what they
are good at, and not more.

The API extension consists of a new writeback hook and two helpers
to be uses within the hook. The hook sits about halfway down the
fs-writeback stack, just after core has determined that some dirty
inodes should be flushed to disk and just before it starts thinking
about which inodes those should be. At that point, core calls Tux3
instead of continuing on down the usual do_writepages path. Tux3
responds by staging a new delta commit, using the new helpers to
tell core which inodes were flushed versus being left dirty in
cache. This is pretty much the same behavior as the traditional
writeout path, but less convoluted, probably more efficient, and
certainly easier to analyze.

The new writeback hook looks like:

 progress = sb-s_op-writeback(sb,writeback_control,nr_pages);

This should be self-explanatory: nr_pages and progress have the
semantics of existing usage in fs-writeback; writeback_control is
ignored by Tux3, but that is only because Tux3 always flushes
everything and does not require hints for now. We can safely assume
thatwbc or equivalent is wanted here. An obvious wart is the
overlap between progress and nr_pages, but fs-writeback thinks
that way, so it would not make much sense to improve one without
improving the other.

Tux3 necessarily keeps its own dirty inode list, which is an area
of overlap with fs-writeback. In a perfect world, there would be
just one dirty inode list per superblock, on which both fs-writeback
and Tux3 would operate. That would be a deeper core change than
seems appropriate right now.

Potential races are a big issue with this API, which is no surprise.
The fs-writeback scheme requires keeping several kinds of object in
sync: tux3 dirty inode lists, fs-writeback dirty inode lists and
inode dirty state. The new helpers inode_writeback_done(inode) and
inode_writeback_touch(inode) take care of that while hiding
internal details of the fs-writeback implementation.

Tux3 calls inode_writeback_done when it has flushed an inode and
marked it clean, or calls inode_writeback_touch if it intends to
retain a dirty inode in cache. These have simple implementations.
The former just removes a clean inode from any fs-writeback list.
The latter updates the inode's dirty timestamp so that fs-writeback
does not keep trying flush it. Both these things could be done more
efficiently by re-engineering fs-writeback, but we prefer to work
with the existing scheme for now.

Hirofumi's masterful hack nicely avoided racy removal of inodes from
the writeback list by taking an internal fs-writeback lock inside
filesystem code. The new helper requires dropping i_lock inside
filesystem code and retaking it in the helper, so inode redirty can
race with writeback list removal. This does not seem to present a
problem because filesystem code is able to enforce strict
alternation of cleaning and calling the helper. As an offsetting
advantage, writeback lock contention is reduced.

Compared to Hirofumi's hack, the cost of this interface is one
additional spinlock per inode_writeback_done,  which is
insignificant compared to the convoluted code path that is avoided.
Regards,

Daniel

When I followed the advice of Dave Chinner:
We're not going to merge that page forking stuff (like you were told at 
LSF 2013 more than a year ago: http://lwn.net/Articles/548091/) without 
rigorous design review and a demonstration of the solutions to all the 
hard corner cases it has
given in his e-mail related with the presentation of the latest version 
of the Tux3 file system (see [1]) and read the linked article, I found 
in the second comments:
Parts of this almost sound like it either a.) overlaps with or b.) 
would benefit greatly from something similar to Featherstitch [[2]].


Could it be that we have with Featherstitch a general solution already 
that is said to be even file system agnostic?
Honestly, I 

Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-02 Thread Daniel Phillips

On 06/01/2014 08:15 PM, Dave Chinner wrote:
 On Sun, Jun 01, 2014 at 02:41:02PM -0700, I wrote:
 +
 +/*
 + * Add inode to writeback dirty list with current time.
 + */
 +void inode_writeback_touch(struct inode *inode)
 +{
 +struct backing_dev_info *bdi = inode-i_sb-s_bdi;
 +spin_lock(bdi-wb.list_lock);
 +inode-dirtied_when = jiffies;
 +list_move(inode-i_wb_list, bdi-wb.b_dirty);
 +spin_unlock(bdi-wb.list_lock);
 +}
 +EXPORT_SYMBOL_GPL(inode_writeback_touch);
 You should be able to use redirty_tail() for this

Redirty_tail nearly works, but if (!list_empty(wb-b_dirty)) is not correct 
because the inode
needs to end up on the dirty list whether it was already there or not. This 
requirement is analogous
to __mark_inode_dirty() and must tolerate similar races. At the 
microoptimization level, calling
redirty_tail from inode_writeback_touch would be less efficient and more bulky. 
Another small issue
is, redirty_tail does not always update the timestamp, which could trigger some 
bogus writeback events.

 H - this is using the wb dirty lists and locks, but you
 don't pass the wb structure to the writeback callout? That seem
 wrong to me - why would you bother manipulating these lists if you
 aren't using them to track dirty inodes in the first place?

From Tux3's point of view, the wb lists just allow fs-writeback to determine 
when to call
-writeback(). We agree that inode lists are a suboptimal mechanism, but that 
is how fs-writeback
currently thinks. It would be better if our inodes never go onto wb lists in 
the first place,
provided that fs-writeback can still drive writeback appropriately.

Perhaps fs-writeback should have an option to work without inode lists at all, 
and just maintain
writeback scheduling statistics in the superblock or similar. That would be a 
big change, more on
the experimental side. We would be happy to take it on after merge, hopefully 
for the benefit of
other filesystems besides Tux3.

What we pass to -writeback() is just a matter of taste at the moment because 
we currently ignore
everything except work-nr_pages. Anything sane is fine. Note that 
bdi_writeback is already
available from bdi-wb, the default writeback info, whatever that means. A 
quick tour of existing
usage suggests that you can reliably obtain the wb that way, but a complete 
audit would be needed.

Most probably, this API will evolve as new users arrive, and also as our Tux3 
usage becomes more
sophisticated. For now, Tux3 just flushes everything without asking questions. 
Exactly how that
might change in the future is hard to predict. You are in a better position to 
know what XFS would
require in order to use this interface.

 The first thing that __writeback_sb_inodes() does is create a struct
 writeback_control from the wb_writeback_work. That should be done
 here and passed to __writeback_sb_inodes(), which should be renamed
 generic_writeback_sb_inodes().  Also, all the fields in the wbc
 need to be initialised correctly (i.e including range_start/end).

Good point. I will try that generic_writeback_sb_inode concept for the next 
iteration, which will
need a day or two including regression testing.

 Further, a writeback implementation will need to use the generic bdi
 list and lock structures and so we need to pass the bdi_writeback.
 Similarly, if we are going to pass nr_pages, we might as well pass
 the entire work structure. 

 Finally, I don't like the way the wb-list_lock is treated
 differently by this code. I suspect that we need to rationalise the
 layering of the wb-list_lock as it is currently not clear what it
 protects and what (nested) layers of the writeback code actually
 require it.

One design goal of this proposed writeback interface is to hide the wb list 
lock entirely from the
filesystem so core writeback can evolve more easily. This is not cast in stone, 
but it smells like
decent factoring. We can save some spinlocks by violating that factoring (as 
Hirofumi's original
hack did) at the cost of holding a potentially busy wb lock longer, which does 
not seem like a good
trade.

I agree that writeback list locking is murky, and fs-writeback is murky in 
general. We would like
Tux3 to be part of the solution, not the problem.

 What I'd like to see is this work:

 struct super_ops ... = {
 
   .writeback = generic_writeback_sb_inodes,
 

 And that means writeback_sb_inodes() would become:

 static long writeback_sb_inodes(struct super_block *sb,
   struct bdi_writeback *wb,
   struct wb_writeback_work *work)
 {
   struct writeback_control wbc = {
   .sync_mode  = work-sync_mode,
   .tagged_writepages  = work-tagged_writepages,
   .for_kupdate= work-for_kupdate,
   .for_background = work-for_background,
   .for_sync   = work-for_sync,
   .range_cyclic   = 

Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-02 Thread Dave Chinner

[ please line wrap at something sane like 68 columns ]

On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote:
 On 06/01/2014 08:15 PM, Dave Chinner wrote:
  On Sun, Jun 01, 2014 at 02:41:02PM -0700, I wrote:
  +
  +/*
  + * Add inode to writeback dirty list with current time.
  + */
  +void inode_writeback_touch(struct inode *inode)
  +{
  +  struct backing_dev_info *bdi = inode-i_sb-s_bdi;
  +  spin_lock(bdi-wb.list_lock);
  +  inode-dirtied_when = jiffies;
  +  list_move(inode-i_wb_list, bdi-wb.b_dirty);
  +  spin_unlock(bdi-wb.list_lock);
  +}
  +EXPORT_SYMBOL_GPL(inode_writeback_touch);
  You should be able to use redirty_tail() for this
 
 Redirty_tail nearly works, but if (!list_empty(wb-b_dirty)) is
 not correct because the inode needs to end up on the dirty list
 whether it was already there or not.

redirty_tail() always moves the inode to the end of the dirty
list.

 This requirement is analogous to __mark_inode_dirty() and must
 tolerate similar races. At the microoptimization level, calling
 redirty_tail from inode_writeback_touch would be less efficient
 and more bulky. Another small issue is, redirty_tail does not
 always update the timestamp, which could trigger some bogus
 writeback events.

redirty_tail does not update the timestamp when it doesn't need to
change. If it needs to be changed because the current value would
violate the time ordering requirements of the list, it rewrites it.

So there is essentially no functional difference between the new
function and redirty_tail

  H - this is using the wb dirty lists and locks, but you
  don't pass the wb structure to the writeback callout? That seem
  wrong to me - why would you bother manipulating these lists if
  you aren't using them to track dirty inodes in the first place?
 
 From Tux3's point of view, the wb lists just allow fs-writeback to
 determine when to call -writeback(). We agree that inode lists
 are a suboptimal mechanism, but that is how fs-writeback currently
 thinks. It would be better if our inodes never go onto wb lists in
 the first place, provided that fs-writeback can still drive
 writeback appropriately.

It can't, and definitely not with the callout you added.

However, we already avoid the VFS writeback lists for certain
filesystems for pure metadata. e.g. XFS does not use the VFS dirty
inode lists for inode metadata changes.  They get tracked internally
by the transaction subsystem which does it's own writeback according
to the requirements of journal space availability.

This is done simply by not calling mark_inode_dirty() on any
metadata only change. If we want to do the same for data, then we'd
simply not call mark_inode_dirty() in the data IO path. That
requires a custom -set_page_dirty method to be provided by the
filesystem that didn't call

__mark_inode_dirty(mapping-host, I_DIRTY_PAGES);

and instead did it's own thing.

So the per-superblock dirty tracking is something we can do right
now, and some filesystems do it for metadata. The missing piece for
data is writeback infrastructure capable of deferring to superblocks
for writeback rather than BDIs

 Perhaps fs-writeback should have an option to work without inode
 lists at all, and just maintain writeback scheduling statistics in
 the superblock or similar. That would be a big change, more on the
 experimental side. We would be happy to take it on after merge,
 hopefully for the benefit of other filesystems besides Tux3.

Well, I don't see it that way. If you have a requirement to be able
to track dirty inodes internally, then lets move to that implement
that infrastructure rather than hacking around with callouts that
only have a limited shelf-life.

 What we pass to -writeback() is just a matter of taste at the
 moment because we currently ignore everything except
 work-nr_pages. Anything sane is fine. Note that bdi_writeback is
 already available from bdi-wb, the default writeback info,
 whatever that means. A quick tour of existing usage suggests that
 you can reliably obtain the wb that way, but a complete audit
 would be needed.
 
 Most probably, this API will evolve as new users arrive, and also
 as our Tux3 usage becomes more sophisticated. For now, Tux3 just
 flushes everything without asking questions. Exactly how that
 might change in the future is hard to predict. You are in a better
 position to know what XFS would require in order to use this
 interface.

XFS already has everything it needs internally to track dirty
inodes. In fact, we used to do data writeback from within XFS and we
slowly removed it as the generic writeback code was improved made
the XFS code redundant.

That said, parallelising writeback so we can support hundreds of
GB/s of delayed allocation based writeback is something we
eventually need to do, and that pretty much means we need to bring
dirty data inode tracking back into XFS.

So what we really need is writeback infrastructure that is:

a) independent of the dirty 

Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-02 Thread Dave Chinner
On Mon, Jun 02, 2014 at 10:30:07AM +0200, Christian Stroetmann wrote:
 When I followed the advice of Dave Chinner:
 We're not going to merge that page forking stuff (like you were
 told at LSF 2013 more than a year ago:
 http://lwn.net/Articles/548091/) without rigorous design review and
 a demonstration of the solutions to all the hard corner cases it
 has
 given in his e-mail related with the presentation of the latest
 version of the Tux3 file system (see [1]) and read the linked
 article, I found in the second comments:
 Parts of this almost sound like it either a.) overlaps with or b.)
 would benefit greatly from something similar to Featherstitch
 [[2]].
 
 Could it be that we have with Featherstitch a general solution
 already that is said to be even file system agnostic?
 Honestly, I thought that something like this would make its way into
 the Linux code base.

Here's what I said about the last proposal (a few months ago) for
integrating featherstitch into the kernel:

http://www.spinics.net/lists/linux-fsdevel/msg72799.html

It's not a viable solution.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-02 Thread Christian Stroetmann

On the 3rd of June 2014 05:39, Dave Chinner wrote:

On Mon, Jun 02, 2014 at 10:30:07AM +0200, Christian Stroetmann wrote:

When I followed the advice of Dave Chinner:
We're not going to merge that page forking stuff (like you were
told at LSF 2013 more than a year ago:
http://lwn.net/Articles/548091/) without rigorous design review and
a demonstration of the solutions to all the hard corner cases it
has
given in his e-mail related with the presentation of the latest
version of the Tux3 file system (see [1]) and read the linked
article, I found in the second comments:
Parts of this almost sound like it either a.) overlaps with or b.)
would benefit greatly from something similar to Featherstitch
[[2]].

Could it be that we have with Featherstitch a general solution
already that is said to be even file system agnostic?
Honestly, I thought that something like this would make its way into
the Linux code base.

Here's what I said about the last proposal (a few months ago) for
integrating featherstitch into the kernel:

http://www.spinics.net/lists/linux-fsdevel/msg72799.html

It's not a viable solution.

Cheers,

Dave.


How annoying, I did not remember your e-mail of the referenced thread 
[Lsf-pc] [LSF/MM TOPIC] atomic block device despite I saved it on 
local disk. Thanks a lot for the reminder.


I also directly saw the problem with the research prototype 
Featherstitch, specifically the point All the filesystem modules it has 
are built into the featherstitch kernel module, and called through a VFS 
shim layer. But it is just a prototype and its concept of abstraction 
has not to be copied 1:1 into the Linux code base.


In general, I do not believe that the complexity problems of soft 
updates, atomic writes, and related techniques can be solved by 
hand/manually. So my suggestion is to automatically handle the 
complexity problem of e.g. dependancies in a way that is comparable to 
a(n on-the-fly) file-system compiler so to say that works on a very 
large dependancy graph (having several billions of graph vertices 
actually). And at this point an abstraction like it is given with 
Featherstitch helps to feed and control this special FS compiler.


Actually, I have to follow the discussion further on the one hand and go 
deeper into the highly complex problem space on the other hand.




With all the best
Christian
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-01 Thread Dave Chinner
On Sun, Jun 01, 2014 at 02:41:02PM -0700, Daniel Phillips wrote:
> ---
> From: Daniel Phillips 
> Subject: [PATCH] Add a super operation for writeback
> 
> Add a "writeback" super operation to be called in the
> form:
> 
> progress = sb->s_op->writeback(sb, , );
> 
> The filesystem is expected to flush some inodes to disk
> and return progress of at least 1, or if no inodes are
> flushed, return progress of zero. The filesystem should
> try to flush at least the number of pages specified in
> *pages, or if that is not possible, return approximately
> the number of pages not flushed into *pages.
> 
> Within the ->writeback callback, the filesystem should
> call inode_writeback_done(inode) for each inode flushed
> (and therefore set clean) or inode_writeback_touch(inode)
> for any inode that will be retained dirty in cache.
> 
> Signed-off-by: Daniel Phillips  
> Signed-off-by: OGAWA Hirofumi 
> ---
> 
>  fs/fs-writeback.c  |   59 
> +---
>  include/linux/fs.h |4 +++
>  2 files changed, 60 insertions(+), 3 deletions(-)
> 
> diff -puN fs/fs-writeback.c~core-writeback fs/fs-writeback.c
> --- linux-tux3/fs/fs-writeback.c~core-writeback   2014-05-31 
> 06:43:19.416031712 +0900
> +++ linux-tux3-hirofumi/fs/fs-writeback.c 2014-05-31 06:44:46.087904373 
> +0900
> @@ -192,6 +192,35 @@ void inode_wb_list_del(struct inode *ino
>  }
> 
>  /*
> + * Remove inode from writeback list if clean.
> + */
> +void inode_writeback_done(struct inode *inode)
> +{
> + struct backing_dev_info *bdi = inode_to_bdi(inode);
> +
> + spin_lock(>wb.list_lock);
> + spin_lock(>i_lock);
> + if (!(inode->i_state & I_DIRTY))
> + list_del_init(>i_wb_list);
> + spin_unlock(>i_lock);
> + spin_unlock(>wb.list_lock);
> +}
> +EXPORT_SYMBOL_GPL(inode_writeback_done);
> +
> +/*
> + * Add inode to writeback dirty list with current time.
> + */
> +void inode_writeback_touch(struct inode *inode)
> +{
> + struct backing_dev_info *bdi = inode->i_sb->s_bdi;
> + spin_lock(>wb.list_lock);
> + inode->dirtied_when = jiffies;
> + list_move(>i_wb_list, >wb.b_dirty);
> + spin_unlock(>wb.list_lock);
> +}
> +EXPORT_SYMBOL_GPL(inode_writeback_touch);

You should be able to use redirty_tail() for this

H - this is using the wb dirty lists and locks, but you
don't pass the wb structure to the writeback callout? That seem
wrong to me - why would you bother manipulating these lists if you
aren't using them to track dirty inodes in the first place?

> +
> +/*
>   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
>   * furthest end of its superblock's dirty-inode list.
>   *
> @@ -593,9 +622,9 @@ static long writeback_chunk_size(struct
>   *
>   * Return the number of pages and/or inodes written.
>   */
> -static long writeback_sb_inodes(struct super_block *sb,
> - struct bdi_writeback *wb,
> - struct wb_writeback_work *work)
> +static long __writeback_sb_inodes(struct super_block *sb,
> +   struct bdi_writeback *wb,
> +   struct wb_writeback_work *work)
>  {
>   struct writeback_control wbc = {
>   .sync_mode  = work->sync_mode,
> @@ -710,6 +739,30 @@ static long writeback_sb_inodes(struct s
>   return wrote;
>  }
> 
> +static long writeback_sb_inodes(struct super_block *sb,
> + struct bdi_writeback *wb,
> + struct wb_writeback_work *work)
> +{
> + if (sb->s_op->writeback) {
> + struct writeback_control wbc = {
> + .sync_mode  = work->sync_mode,
> + .tagged_writepages  = work->tagged_writepages,
> + .for_kupdate= work->for_kupdate,
> + .for_background = work->for_background,
> + .for_sync   = work->for_sync,
> + .range_cyclic   = work->range_cyclic,
> + };
> + long ret;
> +
> + spin_unlock(>list_lock);
> + ret = sb->s_op->writeback(sb, , >nr_pages);
> + spin_lock(>list_lock);
> + return ret;
> + }
> +
> + return __writeback_sb_inodes(sb, wb, work);
> +}

The first thing that __writeback_sb_inodes() does is create a struct
writeback_control from the wb_writeback_work. That should be done
here and passed to __writeback_sb_inodes(), which should be renamed
"generic_writeback_sb_inodes()".  Also, all the fields in the wbc
need to be initialised correctly (i.e including range_start/end).

Further, a writeback implementation will need to use the generic bdi
list and lock structures and so we need to pass the bdi_writeback.
Similarly, if we are going to pass nr_pages, we might as well pass
the entire work structure. 

Finally, I don't like the way the 

[RFC][PATCH 1/2] Add a super operation for writeback

2014-06-01 Thread Daniel Phillips
Hi,

This is the first of four core changes we would like for Tux3. We
start with a hard one and suggest a simple solution.

The first patch in this series adds a new super operation to write
back multiple inodes in a single call. The second patch applies to
our linux-tux3 repository at git.kernel.org to demonstrate how this
interface is used, and removes about 450 lines of workaround code.

Traditionally, core kernel tells each mounted filesystems which
dirty pages of which inodes should be flushed to disk, but
unfortunately, is blissfully ignorant of filesystem-specific
ordering constraints. This scheme was really invented for Ext2 and
has not evolved much recently. Tux3, with its strong ordering and
optimized delta commit, cannot tolerate random flushing and
therefore takes responsibility for flush ordering itself. On the
other hand, Tux3 has no good way to know when is the right time to
flush, but core is good at that. This proposed API harmonizes those
two competencies so that Tux3 and core each take care of what they
are good at, and not more.

The API extension consists of a new writeback hook and two helpers
to be uses within the hook. The hook sits about halfway down the
fs-writeback stack, just after core has determined that some dirty
inodes should be flushed to disk and just before it starts thinking
about which inodes those should be. At that point, core calls Tux3
instead of continuing on down the usual do_writepages path. Tux3
responds by staging a new delta commit, using the new helpers to
tell core which inodes were flushed versus being left dirty in
cache. This is pretty much the same behavior as the traditional
writeout path, but less convoluted, probably more efficient, and
certainly easier to analyze.

The new writeback hook looks like:

progress = sb->s_op->writeback(sb, _control, _pages);

This should be self-explanatory: nr_pages and progress have the
semantics of existing usage in fs-writeback; writeback_control is
ignored by Tux3, but that is only because Tux3 always flushes
everything and does not require hints for now. We can safely assume
that  or equivalent is wanted here. An obvious wart is the
overlap between "progress" and "nr_pages", but fs-writeback thinks
that way, so it would not make much sense to improve one without
improving the other.

Tux3 necessarily keeps its own dirty inode list, which is an area
of overlap with fs-writeback. In a perfect world, there would be
just one dirty inode list per superblock, on which both fs-writeback
and Tux3 would operate. That would be a deeper core change than
seems appropriate right now.

Potential races are a big issue with this API, which is no surprise.
The fs-writeback scheme requires keeping several kinds of object in
sync: tux3 dirty inode lists, fs-writeback dirty inode lists and
inode dirty state. The new helpers inode_writeback_done(inode) and
inode_writeback_touch(inode) take care of that while hiding
internal details of the fs-writeback implementation.

Tux3 calls inode_writeback_done when it has flushed an inode and
marked it clean, or calls inode_writeback_touch if it intends to
retain a dirty inode in cache. These have simple implementations.
The former just removes a clean inode from any fs-writeback list.
The latter updates the inode's dirty timestamp so that fs-writeback
does not keep trying flush it. Both these things could be done more
efficiently by re-engineering fs-writeback, but we prefer to work
with the existing scheme for now.

Hirofumi's masterful hack nicely avoided racy removal of inodes from
the writeback list by taking an internal fs-writeback lock inside
filesystem code. The new helper requires dropping i_lock inside
filesystem code and retaking it in the helper, so inode redirty can
race with writeback list removal. This does not seem to present a
problem because filesystem code is able to enforce strict
alternation of cleaning and calling the helper. As an offsetting
advantage, writeback lock contention is reduced.

Compared to Hirofumi's hack, the cost of this interface is one
additional spinlock per inode_writeback_done,  which is
insignificant compared to the convoluted code path that is avoided.
Regards,

Daniel

---
From: Daniel Phillips 
Subject: [PATCH] Add a super operation for writeback

Add a "writeback" super operation to be called in the
form:

progress = sb->s_op->writeback(sb, , );

The filesystem is expected to flush some inodes to disk
and return progress of at least 1, or if no inodes are
flushed, return progress of zero. The filesystem should
try to flush at least the number of pages specified in
*pages, or if that is not possible, return approximately
the number of pages not flushed into *pages.

Within the ->writeback callback, the filesystem should
call inode_writeback_done(inode) for each inode flushed
(and therefore set clean) or inode_writeback_touch(inode)
for any inode that will be retained dirty in cache.

Signed-off-by: Daniel Phillips  
Signed-off-by: 

[RFC][PATCH 1/2] Add a super operation for writeback

2014-06-01 Thread Daniel Phillips
Hi,

This is the first of four core changes we would like for Tux3. We
start with a hard one and suggest a simple solution.

The first patch in this series adds a new super operation to write
back multiple inodes in a single call. The second patch applies to
our linux-tux3 repository at git.kernel.org to demonstrate how this
interface is used, and removes about 450 lines of workaround code.

Traditionally, core kernel tells each mounted filesystems which
dirty pages of which inodes should be flushed to disk, but
unfortunately, is blissfully ignorant of filesystem-specific
ordering constraints. This scheme was really invented for Ext2 and
has not evolved much recently. Tux3, with its strong ordering and
optimized delta commit, cannot tolerate random flushing and
therefore takes responsibility for flush ordering itself. On the
other hand, Tux3 has no good way to know when is the right time to
flush, but core is good at that. This proposed API harmonizes those
two competencies so that Tux3 and core each take care of what they
are good at, and not more.

The API extension consists of a new writeback hook and two helpers
to be uses within the hook. The hook sits about halfway down the
fs-writeback stack, just after core has determined that some dirty
inodes should be flushed to disk and just before it starts thinking
about which inodes those should be. At that point, core calls Tux3
instead of continuing on down the usual do_writepages path. Tux3
responds by staging a new delta commit, using the new helpers to
tell core which inodes were flushed versus being left dirty in
cache. This is pretty much the same behavior as the traditional
writeout path, but less convoluted, probably more efficient, and
certainly easier to analyze.

The new writeback hook looks like:

progress = sb-s_op-writeback(sb, writeback_control, nr_pages);

This should be self-explanatory: nr_pages and progress have the
semantics of existing usage in fs-writeback; writeback_control is
ignored by Tux3, but that is only because Tux3 always flushes
everything and does not require hints for now. We can safely assume
that wbc or equivalent is wanted here. An obvious wart is the
overlap between progress and nr_pages, but fs-writeback thinks
that way, so it would not make much sense to improve one without
improving the other.

Tux3 necessarily keeps its own dirty inode list, which is an area
of overlap with fs-writeback. In a perfect world, there would be
just one dirty inode list per superblock, on which both fs-writeback
and Tux3 would operate. That would be a deeper core change than
seems appropriate right now.

Potential races are a big issue with this API, which is no surprise.
The fs-writeback scheme requires keeping several kinds of object in
sync: tux3 dirty inode lists, fs-writeback dirty inode lists and
inode dirty state. The new helpers inode_writeback_done(inode) and
inode_writeback_touch(inode) take care of that while hiding
internal details of the fs-writeback implementation.

Tux3 calls inode_writeback_done when it has flushed an inode and
marked it clean, or calls inode_writeback_touch if it intends to
retain a dirty inode in cache. These have simple implementations.
The former just removes a clean inode from any fs-writeback list.
The latter updates the inode's dirty timestamp so that fs-writeback
does not keep trying flush it. Both these things could be done more
efficiently by re-engineering fs-writeback, but we prefer to work
with the existing scheme for now.

Hirofumi's masterful hack nicely avoided racy removal of inodes from
the writeback list by taking an internal fs-writeback lock inside
filesystem code. The new helper requires dropping i_lock inside
filesystem code and retaking it in the helper, so inode redirty can
race with writeback list removal. This does not seem to present a
problem because filesystem code is able to enforce strict
alternation of cleaning and calling the helper. As an offsetting
advantage, writeback lock contention is reduced.

Compared to Hirofumi's hack, the cost of this interface is one
additional spinlock per inode_writeback_done,  which is
insignificant compared to the convoluted code path that is avoided.
Regards,

Daniel

---
From: Daniel Phillips dan...@tux3.org
Subject: [PATCH] Add a super operation for writeback

Add a writeback super operation to be called in the
form:

progress = sb-s_op-writeback(sb, wbc, pages);

The filesystem is expected to flush some inodes to disk
and return progress of at least 1, or if no inodes are
flushed, return progress of zero. The filesystem should
try to flush at least the number of pages specified in
*pages, or if that is not possible, return approximately
the number of pages not flushed into *pages.

Within the -writeback callback, the filesystem should
call inode_writeback_done(inode) for each inode flushed
(and therefore set clean) or inode_writeback_touch(inode)
for any inode that will be retained dirty in cache.

Signed-off-by: Daniel 

Re: [RFC][PATCH 1/2] Add a super operation for writeback

2014-06-01 Thread Dave Chinner
On Sun, Jun 01, 2014 at 02:41:02PM -0700, Daniel Phillips wrote:
 ---
 From: Daniel Phillips dan...@tux3.org
 Subject: [PATCH] Add a super operation for writeback
 
 Add a writeback super operation to be called in the
 form:
 
 progress = sb-s_op-writeback(sb, wbc, pages);
 
 The filesystem is expected to flush some inodes to disk
 and return progress of at least 1, or if no inodes are
 flushed, return progress of zero. The filesystem should
 try to flush at least the number of pages specified in
 *pages, or if that is not possible, return approximately
 the number of pages not flushed into *pages.
 
 Within the -writeback callback, the filesystem should
 call inode_writeback_done(inode) for each inode flushed
 (and therefore set clean) or inode_writeback_touch(inode)
 for any inode that will be retained dirty in cache.
 
 Signed-off-by: Daniel Phillips  dan...@tux3.org
 Signed-off-by: OGAWA Hirofumi hirof...@mail.parknet.co.jp
 ---
 
  fs/fs-writeback.c  |   59 
 +---
  include/linux/fs.h |4 +++
  2 files changed, 60 insertions(+), 3 deletions(-)
 
 diff -puN fs/fs-writeback.c~core-writeback fs/fs-writeback.c
 --- linux-tux3/fs/fs-writeback.c~core-writeback   2014-05-31 
 06:43:19.416031712 +0900
 +++ linux-tux3-hirofumi/fs/fs-writeback.c 2014-05-31 06:44:46.087904373 
 +0900
 @@ -192,6 +192,35 @@ void inode_wb_list_del(struct inode *ino
  }
 
  /*
 + * Remove inode from writeback list if clean.
 + */
 +void inode_writeback_done(struct inode *inode)
 +{
 + struct backing_dev_info *bdi = inode_to_bdi(inode);
 +
 + spin_lock(bdi-wb.list_lock);
 + spin_lock(inode-i_lock);
 + if (!(inode-i_state  I_DIRTY))
 + list_del_init(inode-i_wb_list);
 + spin_unlock(inode-i_lock);
 + spin_unlock(bdi-wb.list_lock);
 +}
 +EXPORT_SYMBOL_GPL(inode_writeback_done);
 +
 +/*
 + * Add inode to writeback dirty list with current time.
 + */
 +void inode_writeback_touch(struct inode *inode)
 +{
 + struct backing_dev_info *bdi = inode-i_sb-s_bdi;
 + spin_lock(bdi-wb.list_lock);
 + inode-dirtied_when = jiffies;
 + list_move(inode-i_wb_list, bdi-wb.b_dirty);
 + spin_unlock(bdi-wb.list_lock);
 +}
 +EXPORT_SYMBOL_GPL(inode_writeback_touch);

You should be able to use redirty_tail() for this

H - this is using the wb dirty lists and locks, but you
don't pass the wb structure to the writeback callout? That seem
wrong to me - why would you bother manipulating these lists if you
aren't using them to track dirty inodes in the first place?

 +
 +/*
   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
   * furthest end of its superblock's dirty-inode list.
   *
 @@ -593,9 +622,9 @@ static long writeback_chunk_size(struct
   *
   * Return the number of pages and/or inodes written.
   */
 -static long writeback_sb_inodes(struct super_block *sb,
 - struct bdi_writeback *wb,
 - struct wb_writeback_work *work)
 +static long __writeback_sb_inodes(struct super_block *sb,
 +   struct bdi_writeback *wb,
 +   struct wb_writeback_work *work)
  {
   struct writeback_control wbc = {
   .sync_mode  = work-sync_mode,
 @@ -710,6 +739,30 @@ static long writeback_sb_inodes(struct s
   return wrote;
  }
 
 +static long writeback_sb_inodes(struct super_block *sb,
 + struct bdi_writeback *wb,
 + struct wb_writeback_work *work)
 +{
 + if (sb-s_op-writeback) {
 + struct writeback_control wbc = {
 + .sync_mode  = work-sync_mode,
 + .tagged_writepages  = work-tagged_writepages,
 + .for_kupdate= work-for_kupdate,
 + .for_background = work-for_background,
 + .for_sync   = work-for_sync,
 + .range_cyclic   = work-range_cyclic,
 + };
 + long ret;
 +
 + spin_unlock(wb-list_lock);
 + ret = sb-s_op-writeback(sb, wbc, work-nr_pages);
 + spin_lock(wb-list_lock);
 + return ret;
 + }
 +
 + return __writeback_sb_inodes(sb, wb, work);
 +}

The first thing that __writeback_sb_inodes() does is create a struct
writeback_control from the wb_writeback_work. That should be done
here and passed to __writeback_sb_inodes(), which should be renamed
generic_writeback_sb_inodes().  Also, all the fields in the wbc
need to be initialised correctly (i.e including range_start/end).

Further, a writeback implementation will need to use the generic bdi
list and lock structures and so we need to pass the bdi_writeback.
Similarly, if we are going to pass nr_pages, we might as well pass
the entire work structure. 

Finally, I don't like the way the wb-list_lock is treated