Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Tue 03-06-14 15:37:39, Daniel Phillips wrote: > On Tuesday, June 3, 2014 8:21:55 AM PDT, Jan Kara wrote: > >On Tue 03-06-14 07:14:44, Christoph Hellwig wrote: > >>On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote: > >And I agree we went for per-bdi > >flushing to avoid two threads congesting a single device leading to > >suboptimal IO patterns during background writeback. > > A proposal is on the table to implement s_ops->writeback() as a per-sb > operation in such a way that nothing changes in the current per-inode path. > Good or bad approach? Having s_ops->writeback() is fine. But I just hate how you hook in that callback into the writeback code (and Dave has expressed similar concerns). The way you hook it up, filesystem still has to have one inode in the dirty list so that __writeback_inodes_wb() can find that inode, get superblock from it and ask for writeback to be done via callback. That's really ugly and no-go for me. > >So currently I'm convinced we want to go for per-sb dirty tracking. That > >also makes some speedups in that code noticeably simpler. I'm not > convinced > >about the per-sb flushing thread - if we don't regress the multiple sb on > >bdi case when we just let the threads from different superblocks contend > >for IO, then that would be a natural thing to do. But once we have to > >introduce some synchronization between threads to avoid regressions, I > >think it might be easier to just stay with per-bdi thread which switches > >between superblocks. > > Could you elaborate on the means of switching between superblocks? Do > you mean a new fs-writeback path just for data=journal class filesystems, > or are you suggesting changing the way all filesystems are driven? So I suggest changing the way all filesystems are driven to a per-sb one. That makes sense for other reasons as well and allows clean incorporation of your writeback callback (either a fs will use generic callback which would do what generic writeback code does now, only with per-sb dirty lists instead of current per-bdi ones, or the fs can do its own stuff). I can write something (the switch isn't really that complex) but it will need quite some testing to verify we don't regress somewhere... Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Tue 03-06-14 15:37:39, Daniel Phillips wrote: On Tuesday, June 3, 2014 8:21:55 AM PDT, Jan Kara wrote: On Tue 03-06-14 07:14:44, Christoph Hellwig wrote: On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote: And I agree we went for per-bdi flushing to avoid two threads congesting a single device leading to suboptimal IO patterns during background writeback. A proposal is on the table to implement s_ops-writeback() as a per-sb operation in such a way that nothing changes in the current per-inode path. Good or bad approach? Having s_ops-writeback() is fine. But I just hate how you hook in that callback into the writeback code (and Dave has expressed similar concerns). The way you hook it up, filesystem still has to have one inode in the dirty list so that __writeback_inodes_wb() can find that inode, get superblock from it and ask for writeback to be done via callback. That's really ugly and no-go for me. So currently I'm convinced we want to go for per-sb dirty tracking. That also makes some speedups in that code noticeably simpler. I'm not convinced about the per-sb flushing thread - if we don't regress the multiple sb on bdi case when we just let the threads from different superblocks contend for IO, then that would be a natural thing to do. But once we have to introduce some synchronization between threads to avoid regressions, I think it might be easier to just stay with per-bdi thread which switches between superblocks. Could you elaborate on the means of switching between superblocks? Do you mean a new fs-writeback path just for data=journal class filesystems, or are you suggesting changing the way all filesystems are driven? So I suggest changing the way all filesystems are driven to a per-sb one. That makes sense for other reasons as well and allows clean incorporation of your writeback callback (either a fs will use generic callback which would do what generic writeback code does now, only with per-sb dirty lists instead of current per-bdi ones, or the fs can do its own stuff). I can write something (the switch isn't really that complex) but it will need quite some testing to verify we don't regress somewhere... Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Tuesday, June 3, 2014 8:21:55 AM PDT, Jan Kara wrote: On Tue 03-06-14 07:14:44, Christoph Hellwig wrote: On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote: ... So I agree per-bdi / per-sb matters only in simple setups but machines with single rotating disk with several partitions and without LVM aren't that rare AFAICT from my experience. Retribution is sure to be swift, terrible and eternal for anyone who dares to break those. And I agree we went for per-bdi flushing to avoid two threads congesting a single device leading to suboptimal IO patterns during background writeback. A proposal is on the table to implement s_ops->writeback() as a per-sb operation in such a way that nothing changes in the current per-inode path. Good or bad approach? So currently I'm convinced we want to go for per-sb dirty tracking. That also makes some speedups in that code noticeably simpler. I'm not convinced about the per-sb flushing thread - if we don't regress the multiple sb on bdi case when we just let the threads from different superblocks contend for IO, then that would be a natural thing to do. But once we have to introduce some synchronization between threads to avoid regressions, I think it might be easier to just stay with per-bdi thread which switches between superblocks. Could you elaborate on the means of switching between superblocks? Do you mean a new fs-writeback path just for data=journal class filesystems, or are you suggesting changing the way all filesystems are driven? Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On the 3rd of June 2014 16:57, Theodore Ts'o wrote: On Tue, Jun 03, 2014 at 07:30:32AM +0200, Christian Stroetmann wrote: In general, I do not believe that the complexity problems of soft updates, atomic writes, and related techniques can be solved by hand/manually. So my suggestion is to automatically handle the complexity problem of e.g. dependancies in a way that is comparable to a(n on-the-fly) file-system compiler so to say that works on a very large dependancy graph (having several billions of graph vertices actually). And at this point an abstraction like it is given with Featherstitch helps to feed and control this special FS compiler. Well, if you want to try to implement something like this, go for it! I am already active since some weeks. I'd be very curious to see how well (a) how much CPU overhead it takes to crunch on a dependency graph with billions of vertices, and (b) how easily can it be to express these dependencies and maintainable such a dependency language would be. Sounds like a great research topic, and To a) A run is expected to take some few hours on a single computing node. Also, such a graph processing must not be done all the time, but only if a new application demands a specific handling of the data in respect to e.g. one of the ACID criterias. That is the reason why I put "on-the-fly" in paranthesis. To b) I hoped that file system developers could make some suggestions or point to some no-gos. I am also thinking about Petri-Nets in this relation, though it is just an idea. I would also like to mention that it could be used in conjunction with Non-Volatile Memory (NVM) as well. I'll note the Call For Papers for FAST 2015 is out, and if you can solve these problems, it would make a great FAST 2015 submission: https://www.usenix.org/conference/fast15/call-for-papers Are you serious or have I missed the 1st of April once again? Actually, I could only write a general overview about the approach comparable to a white paper, but nothing more. Cheers, - Ted Best regards Christian -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Tue 03-06-14 07:14:44, Christoph Hellwig wrote: > On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote: > > So we currently flush inodes in first dirtied first written back order when > > superblock is not specified in writeback work. That completely ignores the > > fact to which superblock inode belongs but I don't see per-sb fairness to > > actually make any sense when > > 1) flushing old data (to keep promise set in dirty_expire_centisecs) > > 2) flushing data to reduce number of dirty pages > > And these are really the only two cases where we don't do per-sb flushing. > > > > Now when filesystems want to do something more clever (and I can see > > reasons for that e.g. when journalling metadata, even more so when > > journalling data) I agree we need to somehow implement the above two types > > of writeback using per-sb flushing. Type 1) is actually pretty easy - just > > tell each sb to writeback dirty data upto time T. Type 2) is more difficult > > because that is more openended task - it seems similar to what shrinkers do > > but that would require us to track per sb amount of dirty pages / inodes > > and I'm not sure we want to add even more page counting statistics... > > Especially since often bdi == fs. Thoughts? > > Honestly I think doing per-bdi writeback has been a major mistake. As > you said it only even matters when we have filesystems on multiple > partitions on a single device, and even then only in a simple setup, > as soon as we use LVM or btrfs this sort of sharing stops to happen > anyway. I don't even see much of a benefit except that we prevent > two flushing daemons to congest a single device for that special case > of multiple filesystems on partitions of the same device, and that could > be solved in other ways. So I agree per-bdi / per-sb matters only in simple setups but machines with single rotating disk with several partitions and without LVM aren't that rare AFAICT from my experience. And I agree we went for per-bdi flushing to avoid two threads congesting a single device leading to suboptimal IO patterns during background writeback. So currently I'm convinced we want to go for per-sb dirty tracking. That also makes some speedups in that code noticeably simpler. I'm not convinced about the per-sb flushing thread - if we don't regress the multiple sb on bdi case when we just let the threads from different superblocks contend for IO, then that would be a natural thing to do. But once we have to introduce some synchronization between threads to avoid regressions, I think it might be easier to just stay with per-bdi thread which switches between superblocks. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Tue, Jun 03, 2014 at 07:30:32AM +0200, Christian Stroetmann wrote: > In general, I do not believe that the complexity problems of soft updates, > atomic writes, and related techniques can be solved by hand/manually. So my > suggestion is to automatically handle the complexity problem of e.g. > dependancies in a way that is comparable to a(n on-the-fly) file-system > compiler so to say that works on a very large dependancy graph (having > several billions of graph vertices actually). And at this point an > abstraction like it is given with Featherstitch helps to feed and control > this special FS compiler. Well, if you want to try to implement something like this, go for it! I'd be very curious to see how well (a) how much CPU overhead it takes to crunch on a dependency graph with billions of vertices, and (b) how easily can it be to express these dependencies and maintainable such a dependency language would be. Sounds like a great research topic, and I'll note the Call For Papers for FAST 2015 is out, and if you can solve these problems, it would make a great FAST 2015 submission: https://www.usenix.org/conference/fast15/call-for-papers Cheers, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Tue, Jun 03, 2014 at 07:14:44AM -0700, Christoph Hellwig wrote: > Honestly I think doing per-bdi writeback has been a major mistake. As > you said it only even matters when we have filesystems on multiple > partitions on a single device, and even then only in a simple setup, > as soon as we use LVM or btrfs this sort of sharing stops to happen > anyway. I don't even see much of a benefit except that we prevent > two flushing daemons to congest a single device for that special case > of multiple filesystems on partitions of the same device, and that could > be solved in other ways. To be fair, back when per-bdi writeback was introduced, having multiple partitions on a single disk was far more common, and the use of LVM was much less common. These days, many more systems using one big root filesystem, and or using flash where the parallel writes can actually be a good thing (since there isn't a single disk head which has to seek all over the HDD), the case for keeping per-bdi writeback is much weaker, if not non-existent. Cheers, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote: > So we currently flush inodes in first dirtied first written back order when > superblock is not specified in writeback work. That completely ignores the > fact to which superblock inode belongs but I don't see per-sb fairness to > actually make any sense when > 1) flushing old data (to keep promise set in dirty_expire_centisecs) > 2) flushing data to reduce number of dirty pages > And these are really the only two cases where we don't do per-sb flushing. > > Now when filesystems want to do something more clever (and I can see > reasons for that e.g. when journalling metadata, even more so when > journalling data) I agree we need to somehow implement the above two types > of writeback using per-sb flushing. Type 1) is actually pretty easy - just > tell each sb to writeback dirty data upto time T. Type 2) is more difficult > because that is more openended task - it seems similar to what shrinkers do > but that would require us to track per sb amount of dirty pages / inodes > and I'm not sure we want to add even more page counting statistics... > Especially since often bdi == fs. Thoughts? Honestly I think doing per-bdi writeback has been a major mistake. As you said it only even matters when we have filesystems on multiple partitions on a single device, and even then only in a simple setup, as soon as we use LVM or btrfs this sort of sharing stops to happen anyway. I don't even see much of a benefit except that we prevent two flushing daemons to congest a single device for that special case of multiple filesystems on partitions of the same device, and that could be solved in other ways. The major benefit of the per-bdi writeback was that for the usual case of one filesystem per device we get exactly one flusher thread per filesystems intead of multiple competing ones, but per-sb writeback would solve that just as fine. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Tue 03-06-14 17:52:09, Dave Chinner wrote: > On Tue, Jun 03, 2014 at 12:01:11AM -0700, Daniel Phillips wrote: > > > However, we already avoid the VFS writeback lists for certain > > > filesystems for pure metadata. e.g. XFS does not use the VFS dirty > > > inode lists for inode metadata changes. They get tracked internally > > > by the transaction subsystem which does it's own writeback according > > > to the requirements of journal space availability. > > > > > > This is done simply by not calling mark_inode_dirty() on any > > > metadata only change. If we want to do the same for data, then we'd > > > simply not call mark_inode_dirty() in the data IO path. That > > > requires a custom ->set_page_dirty method to be provided by the > > > filesystem that didn't call > > > > > > __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > > > > > > and instead did it's own thing. > > > > > > So the per-superblock dirty tracking is something we can do right > > > now, and some filesystems do it for metadata. The missing piece for > > > data is writeback infrastructure capable of deferring to superblocks > > > for writeback rather than BDIs > > > > We agree that fs-writeback inode lists are broken for anything > > more sophisticated than Ext2. > > No, I did not say that. > > I said XFS does something different for metadata changes because it > has different flushing constraints and requirements than the generic > code provides. That does not make the generic code broken. > > > An advantage of the patch under > > consideration is that it still lets fs-writeback mostly work the > > way it has worked for the last few years, except for not allowing it > > to pick specific inodes and data pages for writeout. As far as I > > can see, it still balances writeout between different filesystems > > on the same block device pretty well. > > Not really. If there are 3 superblocks on a BDI, and the dirty inode > list iterates between 2 of them with lots of dirty inodes, it can > starve writeback from the third until one of it's dirty inodes pops > to the head of the b_io list. So it's inherently unfair from that > perspective. > > Changing the high level flushing to be per-superblock rather than > per-BDI would enable us to control that behaviour and be much fairer > to all the superblocks on a given BDI. That said, I don't really > care that much about this case... So we currently flush inodes in first dirtied first written back order when superblock is not specified in writeback work. That completely ignores the fact to which superblock inode belongs but I don't see per-sb fairness to actually make any sense when 1) flushing old data (to keep promise set in dirty_expire_centisecs) 2) flushing data to reduce number of dirty pages And these are really the only two cases where we don't do per-sb flushing. Now when filesystems want to do something more clever (and I can see reasons for that e.g. when journalling metadata, even more so when journalling data) I agree we need to somehow implement the above two types of writeback using per-sb flushing. Type 1) is actually pretty easy - just tell each sb to writeback dirty data upto time T. Type 2) is more difficult because that is more openended task - it seems similar to what shrinkers do but that would require us to track per sb amount of dirty pages / inodes and I'm not sure we want to add even more page counting statistics... Especially since often bdi == fs. Thoughts? Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
Dave Chinner writes: >> It doesn't move inode to end of the dirty if wb.b_dirty is empty >> (I.e. it can move from wb.b_io to wb.b_dirty too). > > Um, really? What code are you reading? From 3.15-rc8: > > static void redirty_tail(struct inode *inode, struct bdi_writeback *wb) > { > assert_spin_locked(>list_lock); > if (!list_empty(>b_dirty)) { > struct inode *tail; > > tail = wb_inode(wb->b_dirty.next); > if (time_before(inode->dirtied_when, tail->dirtied_when)) > inode->dirtied_when = jiffies; > } > list_move(>i_wb_list, >b_dirty); > } > > The list_move() is not conditional at all and so the inode is > *always* moved to the tail of wb->b_dirty Oh, you are right. >> It has difference. >> >> Say, tail->dirtied_when == 1, inode->dirtied_when == 2, and now == 30 >> (tail->dirtied_when is expired at 31 with default config). In this case, >> redirty_tail() doesn't update ->dirtied_when. > > OK, that's a different issue, and is actually handled by > requeue_inode(), which is called to put inodes back on the correct > dirty list when IO completes. I think that if you are going to use > the wb dirty inode lists, you should probably use the existing > functions to manage the inode lists appropriately rather than > creating your own writeback list lifecycle. See __mark_inode_dirty() what does. We can consolidate that with __mark_inode_dirty() if you want. In our case, dirty while inode has I_DIRTY needs to handle new dirty sometime, because of data=journal behavior. > If tux3 wants it's own dirty inode list lifecycles, then that's > where avoiding the wb lists completely is an appropriate design > point. I don't want to hack little bits and pieces all over the > writeback code to support what tux3 is doing right now if it's going > to do something else real soon. When tux3 moves to use it's own > internal lists these new funcitons just have to be removed again, so > let's skip the hack step we are at right now and go straight for > supporting the "don't need the fs-writeback lists" infrstructure. Is it agreed with someone? There was bdflush, pdflush, and now bdi flush. Why changed to bdi flush? This respects the intent of change "pdflush to bdi flush" more or less. Thanks. -- OGAWA Hirofumi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Tue, Jun 03, 2014 at 04:47:52PM +0900, OGAWA Hirofumi wrote: > Daniel Phillips writes: > > > Hi Dave, > > On 06/02/2014 08:33 PM, Dave Chinner wrote: > >> On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote: > >>> > >>> Redirty_tail nearly works, but "if (!list_empty(>b_dirty))" is > >>> not correct because the inode needs to end up on the dirty list > >>> whether it was already there or not. > >> redirty_tail() always moves the inode to the end of the dirty > >> list. > > It doesn't move inode to end of the dirty if wb.b_dirty is empty > (I.e. it can move from wb.b_io to wb.b_dirty too). Um, really? What code are you reading? From 3.15-rc8: static void redirty_tail(struct inode *inode, struct bdi_writeback *wb) { assert_spin_locked(>list_lock); if (!list_empty(>b_dirty)) { struct inode *tail; tail = wb_inode(wb->b_dirty.next); if (time_before(inode->dirtied_when, tail->dirtied_when)) inode->dirtied_when = jiffies; } list_move(>i_wb_list, >b_dirty); } The list_move() is not conditional at all and so the inode is *always* moved to the tail of wb->b_dirty > It has difference. > > Say, tail->dirtied_when == 1, inode->dirtied_when == 2, and now == 30 > (tail->dirtied_when is expired at 31 with default config). In this case, > redirty_tail() doesn't update ->dirtied_when. OK, that's a different issue, and is actually handled by requeue_inode(), which is called to put inodes back on the correct dirty list when IO completes. I think that if you are going to use the wb dirty inode lists, you should probably use the existing functions to manage the inode lists appropriately rather than creating your own writeback list lifecycle. If tux3 wants it's own dirty inode list lifecycles, then that's where avoiding the wb lists completely is an appropriate design point. I don't want to hack little bits and pieces all over the writeback code to support what tux3 is doing right now if it's going to do something else real soon. When tux3 moves to use it's own internal lists these new funcitons just have to be removed again, so let's skip the hack step we are at right now and go straight for supporting the "don't need the fs-writeback lists" infrstructure. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Tue, Jun 03, 2014 at 12:01:11AM -0700, Daniel Phillips wrote: > Hi Dave, > On 06/02/2014 08:33 PM, Dave Chinner wrote: > > On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote: > >>> H - this is using the wb dirty lists and locks, but you > >>> don't pass the wb structure to the writeback callout? That seem > >>> wrong to me - why would you bother manipulating these lists if > >>> you aren't using them to track dirty inodes in the first place? > >> From Tux3's point of view, the wb lists just allow fs-writeback to > >> determine when to call ->writeback(). We agree that inode lists > >> are a suboptimal mechanism, but that is how fs-writeback currently > >> thinks. It would be better if our inodes never go onto wb lists in > >> the first place, provided that fs-writeback can still drive > >> writeback appropriately. > > It can't, and definitely not with the callout you added. > > Obviously, fs-writeback does not get to choose inodes or specific pages > with our proposal (which is the whole point) but it still decides when > to call Tux3 vs some other filesystem on the same device, and is still > able to indicate how much data it thinks should be written out. Even > though we ignore the latter suggestion and always flush everything, we > appreciate the timing of those callbacks very much, because they give > us exactly the pressure sensitive batching behavior we want. Which, again, points out that you want per-superblock flushing, not per-bdi flushing which is what the current writeback code does. > > However, we already avoid the VFS writeback lists for certain > > filesystems for pure metadata. e.g. XFS does not use the VFS dirty > > inode lists for inode metadata changes. They get tracked internally > > by the transaction subsystem which does it's own writeback according > > to the requirements of journal space availability. > > > > This is done simply by not calling mark_inode_dirty() on any > > metadata only change. If we want to do the same for data, then we'd > > simply not call mark_inode_dirty() in the data IO path. That > > requires a custom ->set_page_dirty method to be provided by the > > filesystem that didn't call > > > > __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > > > > and instead did it's own thing. > > > > So the per-superblock dirty tracking is something we can do right > > now, and some filesystems do it for metadata. The missing piece for > > data is writeback infrastructure capable of deferring to superblocks > > for writeback rather than BDIs > > We agree that fs-writeback inode lists are broken for anything > more sophisticated than Ext2. No, I did not say that. I said XFS does something different for metadata changes because it has different flushing constraints and requirements than the generic code provides. That does not make the generic code broken. > An advantage of the patch under > consideration is that it still lets fs-writeback mostly work the > way it has worked for the last few years, except for not allowing it > to pick specific inodes and data pages for writeout. As far as I > can see, it still balances writeout between different filesystems > on the same block device pretty well. Not really. If there are 3 superblocks on a BDI, and the dirty inode list iterates between 2 of them with lots of dirty inodes, it can starve writeback from the third until one of it's dirty inodes pops to the head of the b_io list. So it's inherently unfair from that perspective. Changing the high level flushing to be per-superblock rather than per-BDI would enable us to control that behaviour and be much fairer to all the superblocks on a given BDI. That said, I don't really care that much about this case... > >> Perhaps fs-writeback should have an option to work without inode > >> lists at all, and just maintain writeback scheduling statistics in > >> the superblock or similar. That would be a big change, more on the > >> experimental side. We would be happy to take it on after merge, > >> hopefully for the benefit of other filesystems besides Tux3. > > Well, I don't see it that way. If you have a requirement to be able > > to track dirty inodes internally, then lets move to that implement > > that infrastructure rather than hacking around with callouts that > > only have a limited shelf-life. > > What you call shelf-life, I call iteration. Small changes are beautiful. > Apologies for the rhetoric, content is below. Ok, let me use stronger language: implementing something at the wrong layer only to have to redo it at the correct layer a short while later is wasted effort. Do it right the first time. I do beleive I've explained this to you at least once before, so lets not have to go over this again. > > XFS already has everything it needs internally to track dirty > > inodes. In fact, we used to do data writeback from within XFS and we > > slowly removed it as the generic writeback code was improved made > > the XFS code redundant. > > > >
Re: [RFC][PATCH 1/2] Add a super operation for writeback
Daniel Phillips writes: > Hi Dave, > On 06/02/2014 08:33 PM, Dave Chinner wrote: >> On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote: >>> >>> Redirty_tail nearly works, but "if (!list_empty(>b_dirty))" is >>> not correct because the inode needs to end up on the dirty list >>> whether it was already there or not. >> redirty_tail() always moves the inode to the end of the dirty >> list. It doesn't move inode to end of the dirty if wb.b_dirty is empty (I.e. it can move from wb.b_io to wb.b_dirty too). BTW, this is called for like mark inode dirty process, not always writeback path. >>> This requirement is analogous to __mark_inode_dirty() and must >>> tolerate similar races. At the microoptimization level, calling >>> redirty_tail from inode_writeback_touch would be less efficient >>> and more bulky. Another small issue is, redirty_tail does not >>> always update the timestamp, which could trigger some bogus >>> writeback events. >> redirty_tail does not update the timestamp when it doesn't need to >> change. If it needs to be changed because the current value would >> violate the time ordering requirements of the list, it rewrites it. >> >> So there is essentially no functional difference between the new >> function and redirty_tail > > Hirofumi, would you care to comment? It has difference. Say, tail->dirtied_when == 1, inode->dirtied_when == 2, and now == 30 (tail->dirtied_when is expired at 31 with default config). In this case, redirty_tail() doesn't update ->dirtied_when. And if inode->dirtied_when is not updated to 30, expire time has difference. I.e. 32 vs 60. Thanks. -- OGAWA Hirofumi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
Hi Dave, Here is a non-incremental patch. This implements your suggestion from yesterday, except that the wb list lock is dropped before calling ->writeback(). Regards, Daniel >From d030d328757b160b39b252e82811a94843513cfc Mon Sep 17 00:00:00 2001 From: Daniel Phillips Date: Tue, 3 Jun 2014 00:19:11 -0700 Subject: [PATCH] Add a super operation for writeback Add a "writeback" super operation to be called in the form: progress = s_op->writeback(sb, wb, work, wbc); Where sb is (struct super_block *), wb is (struct bdi_writeback *), work is (struct wb_writeback_work *), and wbc is (struct writeback_control *). The filesystem is expected to flush some inodes to disk and return progress of at least 1, or if no inodes are flushed, return progress of zero. The filesystem should try to flush at least the number of pages specified in work->nr_pages, or if that is not possible, return approximately the number of pages that were not flushed in work->nr_pages. Within the ->writeback callback, the filesystem should call inode_writeback_done(inode) for each inode flushed and therefore set clean) or inode_writeback_touch(inode) for any inode that will be retained dirty in cache. Signed-off-by: Daniel Phillips --- fs/fs-writeback.c | 107 +- include/linux/fs.h| 9 +++- include/linux/writeback.h | 19 3 files changed, 95 insertions(+), 40 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index be568b7..98810bd 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -34,25 +34,6 @@ */ #define MIN_WRITEBACK_PAGES(4096UL >> (PAGE_CACHE_SHIFT - 10)) -/* - * Passed into wb_writeback(), essentially a subset of writeback_control - */ -struct wb_writeback_work { -long nr_pages; -struct super_block *sb; -unsigned long *older_than_this; -enum writeback_sync_modes sync_mode; -unsigned int tagged_writepages:1; -unsigned int for_kupdate:1; -unsigned int range_cyclic:1; -unsigned int for_background:1; -unsigned int for_sync:1;/* sync(2) WB_SYNC_ALL writeback */ -enum wb_reason reason;/* why was writeback initiated? */ - -struct list_head list;/* pending work list */ -struct completion *done;/* set if the caller waits */ -}; - /** * writeback_in_progress - determine whether there is writeback in progress * @bdi: the device's backing_dev_info structure. @@ -192,6 +173,35 @@ void inode_wb_list_del(struct inode *inode) } /* + * Remove inode from writeback list if clean. + */ +void inode_writeback_done(struct inode *inode) +{ +struct backing_dev_info *bdi = inode_to_bdi(inode); + +spin_lock(>wb.list_lock); +spin_lock(>i_lock); +if (!(inode->i_state & I_DIRTY)) +list_del_init(>i_wb_list); +spin_unlock(>i_lock); +spin_unlock(>wb.list_lock); +} +EXPORT_SYMBOL_GPL(inode_writeback_done); + +/* + * Add inode to writeback dirty list with current time. + */ +void inode_writeback_touch(struct inode *inode) +{ +struct backing_dev_info *bdi = inode->i_sb->s_bdi; +spin_lock(>wb.list_lock); +inode->dirtied_when = jiffies; +list_move(>i_wb_list, >wb.b_dirty); +spin_unlock(>wb.list_lock); +} +EXPORT_SYMBOL_GPL(inode_writeback_touch); + +/* * Redirty an inode: set its when-it-was dirtied timestamp and move it to the * furthest end of its superblock's dirty-inode list. * @@ -593,20 +603,12 @@ static long writeback_chunk_size(struct backing_dev_info *bdi, * * Return the number of pages and/or inodes written. */ -static long writeback_sb_inodes(struct super_block *sb, -struct bdi_writeback *wb, -struct wb_writeback_work *work) +long generic_writeback_sb_inodes( +struct super_block *sb, +struct bdi_writeback *wb, +struct wb_writeback_work *work, +struct writeback_control *wbc) { -struct writeback_control wbc = { -.sync_mode= work->sync_mode, -.tagged_writepages= work->tagged_writepages, -.for_kupdate= work->for_kupdate, -.for_background= work->for_background, -.for_sync= work->for_sync, -.range_cyclic= work->range_cyclic, -.range_start= 0, -.range_end= LLONG_MAX, -}; unsigned long start_time = jiffies; long write_chunk; long wrote = 0; /* count both pages and inodes */ @@ -644,7 +646,7 @@ static long writeback_sb_inodes(struct super_block *sb, redirty_tail(inode, wb); continue; } -if ((inode->i_state & I_SYNC) && wbc.sync_mode != WB_SYNC_ALL) { +if ((inode->i_state & I_SYNC) && wbc->sync_mode != WB_SYNC_ALL) { /* * If this inode is locked for writeback and we are not * doing writeback-for-data-integrity, move it to @@ -677,22 +679,22 @@ static long writeback_sb_inodes(struct super_block *sb,
Re: [RFC][PATCH 1/2] Add a super operation for writeback
Hi Dave, On 06/02/2014 08:33 PM, Dave Chinner wrote: > On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote: >> >> Redirty_tail nearly works, but "if (!list_empty(>b_dirty))" is >> not correct because the inode needs to end up on the dirty list >> whether it was already there or not. > redirty_tail() always moves the inode to the end of the dirty > list. > >> This requirement is analogous to __mark_inode_dirty() and must >> tolerate similar races. At the microoptimization level, calling >> redirty_tail from inode_writeback_touch would be less efficient >> and more bulky. Another small issue is, redirty_tail does not >> always update the timestamp, which could trigger some bogus >> writeback events. > redirty_tail does not update the timestamp when it doesn't need to > change. If it needs to be changed because the current value would > violate the time ordering requirements of the list, it rewrites it. > > So there is essentially no functional difference between the new > function and redirty_tail Hirofumi, would you care to comment? > >>> H - this is using the wb dirty lists and locks, but you >>> don't pass the wb structure to the writeback callout? That seem >>> wrong to me - why would you bother manipulating these lists if >>> you aren't using them to track dirty inodes in the first place? >> From Tux3's point of view, the wb lists just allow fs-writeback to >> determine when to call ->writeback(). We agree that inode lists >> are a suboptimal mechanism, but that is how fs-writeback currently >> thinks. It would be better if our inodes never go onto wb lists in >> the first place, provided that fs-writeback can still drive >> writeback appropriately. > It can't, and definitely not with the callout you added. Obviously, fs-writeback does not get to choose inodes or specific pages with our proposal (which is the whole point) but it still decides when to call Tux3 vs some other filesystem on the same device, and is still able to indicate how much data it thinks should be written out. Even though we ignore the latter suggestion and always flush everything, we appreciate the timing of those callbacks very much, because they give us exactly the pressure sensitive batching behavior we want. > However, we already avoid the VFS writeback lists for certain > filesystems for pure metadata. e.g. XFS does not use the VFS dirty > inode lists for inode metadata changes. They get tracked internally > by the transaction subsystem which does it's own writeback according > to the requirements of journal space availability. > > This is done simply by not calling mark_inode_dirty() on any > metadata only change. If we want to do the same for data, then we'd > simply not call mark_inode_dirty() in the data IO path. That > requires a custom ->set_page_dirty method to be provided by the > filesystem that didn't call > > __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > > and instead did it's own thing. > > So the per-superblock dirty tracking is something we can do right > now, and some filesystems do it for metadata. The missing piece for > data is writeback infrastructure capable of deferring to superblocks > for writeback rather than BDIs We agree that fs-writeback inode lists are broken for anything more sophisticated than Ext2. An advantage of the patch under consideration is that it still lets fs-writeback mostly work the way it has worked for the last few years, except for not allowing it to pick specific inodes and data pages for writeout. As far as I can see, it still balances writeout between different filesystems on the same block device pretty well. >> Perhaps fs-writeback should have an option to work without inode >> lists at all, and just maintain writeback scheduling statistics in >> the superblock or similar. That would be a big change, more on the >> experimental side. We would be happy to take it on after merge, >> hopefully for the benefit of other filesystems besides Tux3. > Well, I don't see it that way. If you have a requirement to be able > to track dirty inodes internally, then lets move to that implement > that infrastructure rather than hacking around with callouts that > only have a limited shelf-life. What you call shelf-life, I call iteration. Small changes are beautiful. Apologies for the rhetoric, content is below. > XFS already has everything it needs internally to track dirty > inodes. In fact, we used to do data writeback from within XFS and we > slowly removed it as the generic writeback code was improved made > the XFS code redundant. > > That said, parallelising writeback so we can support hundreds of > GB/s of delayed allocation based writeback is something we > eventually need to do, and that pretty much means we need to bring > dirty data inode tracking back into XFS. > > So what we really need is writeback infrastructure that is: > > a) independent of the dirty inode lists > b) implements background, periodic and immediate writeback
Re: [RFC][PATCH 1/2] Add a super operation for writeback
Hi Dave, On 06/02/2014 08:33 PM, Dave Chinner wrote: On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote: Redirty_tail nearly works, but if (!list_empty(wb-b_dirty)) is not correct because the inode needs to end up on the dirty list whether it was already there or not. redirty_tail() always moves the inode to the end of the dirty list. This requirement is analogous to __mark_inode_dirty() and must tolerate similar races. At the microoptimization level, calling redirty_tail from inode_writeback_touch would be less efficient and more bulky. Another small issue is, redirty_tail does not always update the timestamp, which could trigger some bogus writeback events. redirty_tail does not update the timestamp when it doesn't need to change. If it needs to be changed because the current value would violate the time ordering requirements of the list, it rewrites it. So there is essentially no functional difference between the new function and redirty_tail Hirofumi, would you care to comment? H - this is using the wb dirty lists and locks, but you don't pass the wb structure to the writeback callout? That seem wrong to me - why would you bother manipulating these lists if you aren't using them to track dirty inodes in the first place? From Tux3's point of view, the wb lists just allow fs-writeback to determine when to call -writeback(). We agree that inode lists are a suboptimal mechanism, but that is how fs-writeback currently thinks. It would be better if our inodes never go onto wb lists in the first place, provided that fs-writeback can still drive writeback appropriately. It can't, and definitely not with the callout you added. Obviously, fs-writeback does not get to choose inodes or specific pages with our proposal (which is the whole point) but it still decides when to call Tux3 vs some other filesystem on the same device, and is still able to indicate how much data it thinks should be written out. Even though we ignore the latter suggestion and always flush everything, we appreciate the timing of those callbacks very much, because they give us exactly the pressure sensitive batching behavior we want. However, we already avoid the VFS writeback lists for certain filesystems for pure metadata. e.g. XFS does not use the VFS dirty inode lists for inode metadata changes. They get tracked internally by the transaction subsystem which does it's own writeback according to the requirements of journal space availability. This is done simply by not calling mark_inode_dirty() on any metadata only change. If we want to do the same for data, then we'd simply not call mark_inode_dirty() in the data IO path. That requires a custom -set_page_dirty method to be provided by the filesystem that didn't call __mark_inode_dirty(mapping-host, I_DIRTY_PAGES); and instead did it's own thing. So the per-superblock dirty tracking is something we can do right now, and some filesystems do it for metadata. The missing piece for data is writeback infrastructure capable of deferring to superblocks for writeback rather than BDIs We agree that fs-writeback inode lists are broken for anything more sophisticated than Ext2. An advantage of the patch under consideration is that it still lets fs-writeback mostly work the way it has worked for the last few years, except for not allowing it to pick specific inodes and data pages for writeout. As far as I can see, it still balances writeout between different filesystems on the same block device pretty well. Perhaps fs-writeback should have an option to work without inode lists at all, and just maintain writeback scheduling statistics in the superblock or similar. That would be a big change, more on the experimental side. We would be happy to take it on after merge, hopefully for the benefit of other filesystems besides Tux3. Well, I don't see it that way. If you have a requirement to be able to track dirty inodes internally, then lets move to that implement that infrastructure rather than hacking around with callouts that only have a limited shelf-life. What you call shelf-life, I call iteration. Small changes are beautiful. Apologies for the rhetoric, content is below. XFS already has everything it needs internally to track dirty inodes. In fact, we used to do data writeback from within XFS and we slowly removed it as the generic writeback code was improved made the XFS code redundant. That said, parallelising writeback so we can support hundreds of GB/s of delayed allocation based writeback is something we eventually need to do, and that pretty much means we need to bring dirty data inode tracking back into XFS. So what we really need is writeback infrastructure that is: a) independent of the dirty inode lists b) implements background, periodic and immediate writeback c) can be driven by BDI or superblock IOWs, the abstraction we need for this writeback callout is
Re: [RFC][PATCH 1/2] Add a super operation for writeback
Hi Dave, Here is a non-incremental patch. This implements your suggestion from yesterday, except that the wb list lock is dropped before calling -writeback(). Regards, Daniel From d030d328757b160b39b252e82811a94843513cfc Mon Sep 17 00:00:00 2001 From: Daniel Phillips dan...@tux3.org Date: Tue, 3 Jun 2014 00:19:11 -0700 Subject: [PATCH] Add a super operation for writeback Add a writeback super operation to be called in the form: progress = s_op-writeback(sb, wb, work, wbc); Where sb is (struct super_block *), wb is (struct bdi_writeback *), work is (struct wb_writeback_work *), and wbc is (struct writeback_control *). The filesystem is expected to flush some inodes to disk and return progress of at least 1, or if no inodes are flushed, return progress of zero. The filesystem should try to flush at least the number of pages specified in work-nr_pages, or if that is not possible, return approximately the number of pages that were not flushed in work-nr_pages. Within the -writeback callback, the filesystem should call inode_writeback_done(inode) for each inode flushed and therefore set clean) or inode_writeback_touch(inode) for any inode that will be retained dirty in cache. Signed-off-by: Daniel Phillips dan...@tux3.org --- fs/fs-writeback.c | 107 +- include/linux/fs.h| 9 +++- include/linux/writeback.h | 19 3 files changed, 95 insertions(+), 40 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index be568b7..98810bd 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -34,25 +34,6 @@ */ #define MIN_WRITEBACK_PAGES(4096UL (PAGE_CACHE_SHIFT - 10)) -/* - * Passed into wb_writeback(), essentially a subset of writeback_control - */ -struct wb_writeback_work { -long nr_pages; -struct super_block *sb; -unsigned long *older_than_this; -enum writeback_sync_modes sync_mode; -unsigned int tagged_writepages:1; -unsigned int for_kupdate:1; -unsigned int range_cyclic:1; -unsigned int for_background:1; -unsigned int for_sync:1;/* sync(2) WB_SYNC_ALL writeback */ -enum wb_reason reason;/* why was writeback initiated? */ - -struct list_head list;/* pending work list */ -struct completion *done;/* set if the caller waits */ -}; - /** * writeback_in_progress - determine whether there is writeback in progress * @bdi: the device's backing_dev_info structure. @@ -192,6 +173,35 @@ void inode_wb_list_del(struct inode *inode) } /* + * Remove inode from writeback list if clean. + */ +void inode_writeback_done(struct inode *inode) +{ +struct backing_dev_info *bdi = inode_to_bdi(inode); + +spin_lock(bdi-wb.list_lock); +spin_lock(inode-i_lock); +if (!(inode-i_state I_DIRTY)) +list_del_init(inode-i_wb_list); +spin_unlock(inode-i_lock); +spin_unlock(bdi-wb.list_lock); +} +EXPORT_SYMBOL_GPL(inode_writeback_done); + +/* + * Add inode to writeback dirty list with current time. + */ +void inode_writeback_touch(struct inode *inode) +{ +struct backing_dev_info *bdi = inode-i_sb-s_bdi; +spin_lock(bdi-wb.list_lock); +inode-dirtied_when = jiffies; +list_move(inode-i_wb_list, bdi-wb.b_dirty); +spin_unlock(bdi-wb.list_lock); +} +EXPORT_SYMBOL_GPL(inode_writeback_touch); + +/* * Redirty an inode: set its when-it-was dirtied timestamp and move it to the * furthest end of its superblock's dirty-inode list. * @@ -593,20 +603,12 @@ static long writeback_chunk_size(struct backing_dev_info *bdi, * * Return the number of pages and/or inodes written. */ -static long writeback_sb_inodes(struct super_block *sb, -struct bdi_writeback *wb, -struct wb_writeback_work *work) +long generic_writeback_sb_inodes( +struct super_block *sb, +struct bdi_writeback *wb, +struct wb_writeback_work *work, +struct writeback_control *wbc) { -struct writeback_control wbc = { -.sync_mode= work-sync_mode, -.tagged_writepages= work-tagged_writepages, -.for_kupdate= work-for_kupdate, -.for_background= work-for_background, -.for_sync= work-for_sync, -.range_cyclic= work-range_cyclic, -.range_start= 0, -.range_end= LLONG_MAX, -}; unsigned long start_time = jiffies; long write_chunk; long wrote = 0; /* count both pages and inodes */ @@ -644,7 +646,7 @@ static long writeback_sb_inodes(struct super_block *sb, redirty_tail(inode, wb); continue; } -if ((inode-i_state I_SYNC) wbc.sync_mode != WB_SYNC_ALL) { +if ((inode-i_state I_SYNC) wbc-sync_mode != WB_SYNC_ALL) { /* * If this inode is locked for writeback and we are not * doing writeback-for-data-integrity, move it to @@ -677,22 +679,22 @@ static long writeback_sb_inodes(struct
Re: [RFC][PATCH 1/2] Add a super operation for writeback
Daniel Phillips dan...@phunq.net writes: Hi Dave, On 06/02/2014 08:33 PM, Dave Chinner wrote: On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote: Redirty_tail nearly works, but if (!list_empty(wb-b_dirty)) is not correct because the inode needs to end up on the dirty list whether it was already there or not. redirty_tail() always moves the inode to the end of the dirty list. It doesn't move inode to end of the dirty if wb.b_dirty is empty (I.e. it can move from wb.b_io to wb.b_dirty too). BTW, this is called for like mark inode dirty process, not always writeback path. This requirement is analogous to __mark_inode_dirty() and must tolerate similar races. At the microoptimization level, calling redirty_tail from inode_writeback_touch would be less efficient and more bulky. Another small issue is, redirty_tail does not always update the timestamp, which could trigger some bogus writeback events. redirty_tail does not update the timestamp when it doesn't need to change. If it needs to be changed because the current value would violate the time ordering requirements of the list, it rewrites it. So there is essentially no functional difference between the new function and redirty_tail Hirofumi, would you care to comment? It has difference. Say, tail-dirtied_when == 1, inode-dirtied_when == 2, and now == 30 (tail-dirtied_when is expired at 31 with default config). In this case, redirty_tail() doesn't update -dirtied_when. And if inode-dirtied_when is not updated to 30, expire time has difference. I.e. 32 vs 60. Thanks. -- OGAWA Hirofumi hirof...@mail.parknet.co.jp -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Tue, Jun 03, 2014 at 12:01:11AM -0700, Daniel Phillips wrote: Hi Dave, On 06/02/2014 08:33 PM, Dave Chinner wrote: On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote: H - this is using the wb dirty lists and locks, but you don't pass the wb structure to the writeback callout? That seem wrong to me - why would you bother manipulating these lists if you aren't using them to track dirty inodes in the first place? From Tux3's point of view, the wb lists just allow fs-writeback to determine when to call -writeback(). We agree that inode lists are a suboptimal mechanism, but that is how fs-writeback currently thinks. It would be better if our inodes never go onto wb lists in the first place, provided that fs-writeback can still drive writeback appropriately. It can't, and definitely not with the callout you added. Obviously, fs-writeback does not get to choose inodes or specific pages with our proposal (which is the whole point) but it still decides when to call Tux3 vs some other filesystem on the same device, and is still able to indicate how much data it thinks should be written out. Even though we ignore the latter suggestion and always flush everything, we appreciate the timing of those callbacks very much, because they give us exactly the pressure sensitive batching behavior we want. Which, again, points out that you want per-superblock flushing, not per-bdi flushing which is what the current writeback code does. However, we already avoid the VFS writeback lists for certain filesystems for pure metadata. e.g. XFS does not use the VFS dirty inode lists for inode metadata changes. They get tracked internally by the transaction subsystem which does it's own writeback according to the requirements of journal space availability. This is done simply by not calling mark_inode_dirty() on any metadata only change. If we want to do the same for data, then we'd simply not call mark_inode_dirty() in the data IO path. That requires a custom -set_page_dirty method to be provided by the filesystem that didn't call __mark_inode_dirty(mapping-host, I_DIRTY_PAGES); and instead did it's own thing. So the per-superblock dirty tracking is something we can do right now, and some filesystems do it for metadata. The missing piece for data is writeback infrastructure capable of deferring to superblocks for writeback rather than BDIs We agree that fs-writeback inode lists are broken for anything more sophisticated than Ext2. No, I did not say that. I said XFS does something different for metadata changes because it has different flushing constraints and requirements than the generic code provides. That does not make the generic code broken. An advantage of the patch under consideration is that it still lets fs-writeback mostly work the way it has worked for the last few years, except for not allowing it to pick specific inodes and data pages for writeout. As far as I can see, it still balances writeout between different filesystems on the same block device pretty well. Not really. If there are 3 superblocks on a BDI, and the dirty inode list iterates between 2 of them with lots of dirty inodes, it can starve writeback from the third until one of it's dirty inodes pops to the head of the b_io list. So it's inherently unfair from that perspective. Changing the high level flushing to be per-superblock rather than per-BDI would enable us to control that behaviour and be much fairer to all the superblocks on a given BDI. That said, I don't really care that much about this case... Perhaps fs-writeback should have an option to work without inode lists at all, and just maintain writeback scheduling statistics in the superblock or similar. That would be a big change, more on the experimental side. We would be happy to take it on after merge, hopefully for the benefit of other filesystems besides Tux3. Well, I don't see it that way. If you have a requirement to be able to track dirty inodes internally, then lets move to that implement that infrastructure rather than hacking around with callouts that only have a limited shelf-life. What you call shelf-life, I call iteration. Small changes are beautiful. Apologies for the rhetoric, content is below. Ok, let me use stronger language: implementing something at the wrong layer only to have to redo it at the correct layer a short while later is wasted effort. Do it right the first time. I do beleive I've explained this to you at least once before, so lets not have to go over this again. XFS already has everything it needs internally to track dirty inodes. In fact, we used to do data writeback from within XFS and we slowly removed it as the generic writeback code was improved made the XFS code redundant. That said, parallelising writeback so we can support hundreds of GB/s of delayed allocation based writeback is something we eventually
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Tue, Jun 03, 2014 at 04:47:52PM +0900, OGAWA Hirofumi wrote: Daniel Phillips dan...@phunq.net writes: Hi Dave, On 06/02/2014 08:33 PM, Dave Chinner wrote: On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote: Redirty_tail nearly works, but if (!list_empty(wb-b_dirty)) is not correct because the inode needs to end up on the dirty list whether it was already there or not. redirty_tail() always moves the inode to the end of the dirty list. It doesn't move inode to end of the dirty if wb.b_dirty is empty (I.e. it can move from wb.b_io to wb.b_dirty too). Um, really? What code are you reading? From 3.15-rc8: static void redirty_tail(struct inode *inode, struct bdi_writeback *wb) { assert_spin_locked(wb-list_lock); if (!list_empty(wb-b_dirty)) { struct inode *tail; tail = wb_inode(wb-b_dirty.next); if (time_before(inode-dirtied_when, tail-dirtied_when)) inode-dirtied_when = jiffies; } list_move(inode-i_wb_list, wb-b_dirty); } The list_move() is not conditional at all and so the inode is *always* moved to the tail of wb-b_dirty It has difference. Say, tail-dirtied_when == 1, inode-dirtied_when == 2, and now == 30 (tail-dirtied_when is expired at 31 with default config). In this case, redirty_tail() doesn't update -dirtied_when. OK, that's a different issue, and is actually handled by requeue_inode(), which is called to put inodes back on the correct dirty list when IO completes. I think that if you are going to use the wb dirty inode lists, you should probably use the existing functions to manage the inode lists appropriately rather than creating your own writeback list lifecycle. If tux3 wants it's own dirty inode list lifecycles, then that's where avoiding the wb lists completely is an appropriate design point. I don't want to hack little bits and pieces all over the writeback code to support what tux3 is doing right now if it's going to do something else real soon. When tux3 moves to use it's own internal lists these new funcitons just have to be removed again, so let's skip the hack step we are at right now and go straight for supporting the don't need the fs-writeback lists infrstructure. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
Dave Chinner da...@fromorbit.com writes: It doesn't move inode to end of the dirty if wb.b_dirty is empty (I.e. it can move from wb.b_io to wb.b_dirty too). Um, really? What code are you reading? From 3.15-rc8: static void redirty_tail(struct inode *inode, struct bdi_writeback *wb) { assert_spin_locked(wb-list_lock); if (!list_empty(wb-b_dirty)) { struct inode *tail; tail = wb_inode(wb-b_dirty.next); if (time_before(inode-dirtied_when, tail-dirtied_when)) inode-dirtied_when = jiffies; } list_move(inode-i_wb_list, wb-b_dirty); } The list_move() is not conditional at all and so the inode is *always* moved to the tail of wb-b_dirty Oh, you are right. It has difference. Say, tail-dirtied_when == 1, inode-dirtied_when == 2, and now == 30 (tail-dirtied_when is expired at 31 with default config). In this case, redirty_tail() doesn't update -dirtied_when. OK, that's a different issue, and is actually handled by requeue_inode(), which is called to put inodes back on the correct dirty list when IO completes. I think that if you are going to use the wb dirty inode lists, you should probably use the existing functions to manage the inode lists appropriately rather than creating your own writeback list lifecycle. See __mark_inode_dirty() what does. We can consolidate that with __mark_inode_dirty() if you want. In our case, dirty while inode has I_DIRTY needs to handle new dirty sometime, because of data=journal behavior. If tux3 wants it's own dirty inode list lifecycles, then that's where avoiding the wb lists completely is an appropriate design point. I don't want to hack little bits and pieces all over the writeback code to support what tux3 is doing right now if it's going to do something else real soon. When tux3 moves to use it's own internal lists these new funcitons just have to be removed again, so let's skip the hack step we are at right now and go straight for supporting the don't need the fs-writeback lists infrstructure. Is it agreed with someone? There was bdflush, pdflush, and now bdi flush. Why changed to bdi flush? This respects the intent of change pdflush to bdi flush more or less. Thanks. -- OGAWA Hirofumi hirof...@mail.parknet.co.jp -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Tue 03-06-14 17:52:09, Dave Chinner wrote: On Tue, Jun 03, 2014 at 12:01:11AM -0700, Daniel Phillips wrote: However, we already avoid the VFS writeback lists for certain filesystems for pure metadata. e.g. XFS does not use the VFS dirty inode lists for inode metadata changes. They get tracked internally by the transaction subsystem which does it's own writeback according to the requirements of journal space availability. This is done simply by not calling mark_inode_dirty() on any metadata only change. If we want to do the same for data, then we'd simply not call mark_inode_dirty() in the data IO path. That requires a custom -set_page_dirty method to be provided by the filesystem that didn't call __mark_inode_dirty(mapping-host, I_DIRTY_PAGES); and instead did it's own thing. So the per-superblock dirty tracking is something we can do right now, and some filesystems do it for metadata. The missing piece for data is writeback infrastructure capable of deferring to superblocks for writeback rather than BDIs We agree that fs-writeback inode lists are broken for anything more sophisticated than Ext2. No, I did not say that. I said XFS does something different for metadata changes because it has different flushing constraints and requirements than the generic code provides. That does not make the generic code broken. An advantage of the patch under consideration is that it still lets fs-writeback mostly work the way it has worked for the last few years, except for not allowing it to pick specific inodes and data pages for writeout. As far as I can see, it still balances writeout between different filesystems on the same block device pretty well. Not really. If there are 3 superblocks on a BDI, and the dirty inode list iterates between 2 of them with lots of dirty inodes, it can starve writeback from the third until one of it's dirty inodes pops to the head of the b_io list. So it's inherently unfair from that perspective. Changing the high level flushing to be per-superblock rather than per-BDI would enable us to control that behaviour and be much fairer to all the superblocks on a given BDI. That said, I don't really care that much about this case... So we currently flush inodes in first dirtied first written back order when superblock is not specified in writeback work. That completely ignores the fact to which superblock inode belongs but I don't see per-sb fairness to actually make any sense when 1) flushing old data (to keep promise set in dirty_expire_centisecs) 2) flushing data to reduce number of dirty pages And these are really the only two cases where we don't do per-sb flushing. Now when filesystems want to do something more clever (and I can see reasons for that e.g. when journalling metadata, even more so when journalling data) I agree we need to somehow implement the above two types of writeback using per-sb flushing. Type 1) is actually pretty easy - just tell each sb to writeback dirty data upto time T. Type 2) is more difficult because that is more openended task - it seems similar to what shrinkers do but that would require us to track per sb amount of dirty pages / inodes and I'm not sure we want to add even more page counting statistics... Especially since often bdi == fs. Thoughts? Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote: So we currently flush inodes in first dirtied first written back order when superblock is not specified in writeback work. That completely ignores the fact to which superblock inode belongs but I don't see per-sb fairness to actually make any sense when 1) flushing old data (to keep promise set in dirty_expire_centisecs) 2) flushing data to reduce number of dirty pages And these are really the only two cases where we don't do per-sb flushing. Now when filesystems want to do something more clever (and I can see reasons for that e.g. when journalling metadata, even more so when journalling data) I agree we need to somehow implement the above two types of writeback using per-sb flushing. Type 1) is actually pretty easy - just tell each sb to writeback dirty data upto time T. Type 2) is more difficult because that is more openended task - it seems similar to what shrinkers do but that would require us to track per sb amount of dirty pages / inodes and I'm not sure we want to add even more page counting statistics... Especially since often bdi == fs. Thoughts? Honestly I think doing per-bdi writeback has been a major mistake. As you said it only even matters when we have filesystems on multiple partitions on a single device, and even then only in a simple setup, as soon as we use LVM or btrfs this sort of sharing stops to happen anyway. I don't even see much of a benefit except that we prevent two flushing daemons to congest a single device for that special case of multiple filesystems on partitions of the same device, and that could be solved in other ways. The major benefit of the per-bdi writeback was that for the usual case of one filesystem per device we get exactly one flusher thread per filesystems intead of multiple competing ones, but per-sb writeback would solve that just as fine. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Tue, Jun 03, 2014 at 07:14:44AM -0700, Christoph Hellwig wrote: Honestly I think doing per-bdi writeback has been a major mistake. As you said it only even matters when we have filesystems on multiple partitions on a single device, and even then only in a simple setup, as soon as we use LVM or btrfs this sort of sharing stops to happen anyway. I don't even see much of a benefit except that we prevent two flushing daemons to congest a single device for that special case of multiple filesystems on partitions of the same device, and that could be solved in other ways. To be fair, back when per-bdi writeback was introduced, having multiple partitions on a single disk was far more common, and the use of LVM was much less common. These days, many more systems using one big root filesystem, and or using flash where the parallel writes can actually be a good thing (since there isn't a single disk head which has to seek all over the HDD), the case for keeping per-bdi writeback is much weaker, if not non-existent. Cheers, - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Tue, Jun 03, 2014 at 07:30:32AM +0200, Christian Stroetmann wrote: In general, I do not believe that the complexity problems of soft updates, atomic writes, and related techniques can be solved by hand/manually. So my suggestion is to automatically handle the complexity problem of e.g. dependancies in a way that is comparable to a(n on-the-fly) file-system compiler so to say that works on a very large dependancy graph (having several billions of graph vertices actually). And at this point an abstraction like it is given with Featherstitch helps to feed and control this special FS compiler. Well, if you want to try to implement something like this, go for it! I'd be very curious to see how well (a) how much CPU overhead it takes to crunch on a dependency graph with billions of vertices, and (b) how easily can it be to express these dependencies and maintainable such a dependency language would be. Sounds like a great research topic, and I'll note the Call For Papers for FAST 2015 is out, and if you can solve these problems, it would make a great FAST 2015 submission: https://www.usenix.org/conference/fast15/call-for-papers Cheers, - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Tue 03-06-14 07:14:44, Christoph Hellwig wrote: On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote: So we currently flush inodes in first dirtied first written back order when superblock is not specified in writeback work. That completely ignores the fact to which superblock inode belongs but I don't see per-sb fairness to actually make any sense when 1) flushing old data (to keep promise set in dirty_expire_centisecs) 2) flushing data to reduce number of dirty pages And these are really the only two cases where we don't do per-sb flushing. Now when filesystems want to do something more clever (and I can see reasons for that e.g. when journalling metadata, even more so when journalling data) I agree we need to somehow implement the above two types of writeback using per-sb flushing. Type 1) is actually pretty easy - just tell each sb to writeback dirty data upto time T. Type 2) is more difficult because that is more openended task - it seems similar to what shrinkers do but that would require us to track per sb amount of dirty pages / inodes and I'm not sure we want to add even more page counting statistics... Especially since often bdi == fs. Thoughts? Honestly I think doing per-bdi writeback has been a major mistake. As you said it only even matters when we have filesystems on multiple partitions on a single device, and even then only in a simple setup, as soon as we use LVM or btrfs this sort of sharing stops to happen anyway. I don't even see much of a benefit except that we prevent two flushing daemons to congest a single device for that special case of multiple filesystems on partitions of the same device, and that could be solved in other ways. So I agree per-bdi / per-sb matters only in simple setups but machines with single rotating disk with several partitions and without LVM aren't that rare AFAICT from my experience. And I agree we went for per-bdi flushing to avoid two threads congesting a single device leading to suboptimal IO patterns during background writeback. So currently I'm convinced we want to go for per-sb dirty tracking. That also makes some speedups in that code noticeably simpler. I'm not convinced about the per-sb flushing thread - if we don't regress the multiple sb on bdi case when we just let the threads from different superblocks contend for IO, then that would be a natural thing to do. But once we have to introduce some synchronization between threads to avoid regressions, I think it might be easier to just stay with per-bdi thread which switches between superblocks. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On the 3rd of June 2014 16:57, Theodore Ts'o wrote: On Tue, Jun 03, 2014 at 07:30:32AM +0200, Christian Stroetmann wrote: In general, I do not believe that the complexity problems of soft updates, atomic writes, and related techniques can be solved by hand/manually. So my suggestion is to automatically handle the complexity problem of e.g. dependancies in a way that is comparable to a(n on-the-fly) file-system compiler so to say that works on a very large dependancy graph (having several billions of graph vertices actually). And at this point an abstraction like it is given with Featherstitch helps to feed and control this special FS compiler. Well, if you want to try to implement something like this, go for it! I am already active since some weeks. I'd be very curious to see how well (a) how much CPU overhead it takes to crunch on a dependency graph with billions of vertices, and (b) how easily can it be to express these dependencies and maintainable such a dependency language would be. Sounds like a great research topic, and To a) A run is expected to take some few hours on a single computing node. Also, such a graph processing must not be done all the time, but only if a new application demands a specific handling of the data in respect to e.g. one of the ACID criterias. That is the reason why I put on-the-fly in paranthesis. To b) I hoped that file system developers could make some suggestions or point to some no-gos. I am also thinking about Petri-Nets in this relation, though it is just an idea. I would also like to mention that it could be used in conjunction with Non-Volatile Memory (NVM) as well. I'll note the Call For Papers for FAST 2015 is out, and if you can solve these problems, it would make a great FAST 2015 submission: https://www.usenix.org/conference/fast15/call-for-papers Are you serious or have I missed the 1st of April once again? Actually, I could only write a general overview about the approach comparable to a white paper, but nothing more. Cheers, - Ted Best regards Christian -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Tuesday, June 3, 2014 8:21:55 AM PDT, Jan Kara wrote: On Tue 03-06-14 07:14:44, Christoph Hellwig wrote: On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote: ... So I agree per-bdi / per-sb matters only in simple setups but machines with single rotating disk with several partitions and without LVM aren't that rare AFAICT from my experience. Retribution is sure to be swift, terrible and eternal for anyone who dares to break those. And I agree we went for per-bdi flushing to avoid two threads congesting a single device leading to suboptimal IO patterns during background writeback. A proposal is on the table to implement s_ops-writeback() as a per-sb operation in such a way that nothing changes in the current per-inode path. Good or bad approach? So currently I'm convinced we want to go for per-sb dirty tracking. That also makes some speedups in that code noticeably simpler. I'm not convinced about the per-sb flushing thread - if we don't regress the multiple sb on bdi case when we just let the threads from different superblocks contend for IO, then that would be a natural thing to do. But once we have to introduce some synchronization between threads to avoid regressions, I think it might be easier to just stay with per-bdi thread which switches between superblocks. Could you elaborate on the means of switching between superblocks? Do you mean a new fs-writeback path just for data=journal class filesystems, or are you suggesting changing the way all filesystems are driven? Regards, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On the 3rd of June 2014 05:39, Dave Chinner wrote: On Mon, Jun 02, 2014 at 10:30:07AM +0200, Christian Stroetmann wrote: When I followed the advice of Dave Chinner: "We're not going to merge that page forking stuff (like you were told at LSF 2013 more than a year ago: http://lwn.net/Articles/548091/) without rigorous design review and a demonstration of the solutions to all the hard corner cases it has" given in his e-mail related with the presentation of the latest version of the Tux3 file system (see [1]) and read the linked article, I found in the second comments: "Parts of this almost sound like it either a.) overlaps with or b.) would benefit greatly from something similar to Featherstitch [[2]]." Could it be that we have with Featherstitch a general solution already that is said to be even "file system agnostic"? Honestly, I thought that something like this would make its way into the Linux code base. Here's what I said about the last proposal (a few months ago) for integrating featherstitch into the kernel: http://www.spinics.net/lists/linux-fsdevel/msg72799.html It's not a viable solution. Cheers, Dave. How annoying, I did not remember your e-mail of the referenced thread "[Lsf-pc] [LSF/MM TOPIC] atomic block device" despite I saved it on local disk. Thanks a lot for the reminder. I also directly saw the problem with the research prototype Featherstitch, specifically the point "All the filesystem modules it has are built into the featherstitch kernel module, and called through a VFS shim layer". But it is just a prototype and its concept of abstraction has not to be copied 1:1 into the Linux code base. In general, I do not believe that the complexity problems of soft updates, atomic writes, and related techniques can be solved by hand/manually. So my suggestion is to automatically handle the complexity problem of e.g. dependancies in a way that is comparable to a(n on-the-fly) file-system compiler so to say that works on a very large dependancy graph (having several billions of graph vertices actually). And at this point an abstraction like it is given with Featherstitch helps to feed and control this special FS compiler. Actually, I have to follow the discussion further on the one hand and go deeper into the highly complex problem space on the other hand. With all the best Christian -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Mon, Jun 02, 2014 at 10:30:07AM +0200, Christian Stroetmann wrote: > When I followed the advice of Dave Chinner: > "We're not going to merge that page forking stuff (like you were > told at LSF 2013 more than a year ago: > http://lwn.net/Articles/548091/) without rigorous design review and > a demonstration of the solutions to all the hard corner cases it > has" > given in his e-mail related with the presentation of the latest > version of the Tux3 file system (see [1]) and read the linked > article, I found in the second comments: > "Parts of this almost sound like it either a.) overlaps with or b.) > would benefit greatly from something similar to Featherstitch > [[2]]." > > Could it be that we have with Featherstitch a general solution > already that is said to be even "file system agnostic"? > Honestly, I thought that something like this would make its way into > the Linux code base. Here's what I said about the last proposal (a few months ago) for integrating featherstitch into the kernel: http://www.spinics.net/lists/linux-fsdevel/msg72799.html It's not a viable solution. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
[ please line wrap at something sane like 68 columns ] On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote: > On 06/01/2014 08:15 PM, Dave Chinner wrote: > > On Sun, Jun 01, 2014 at 02:41:02PM -0700, I wrote: > >> + > >> +/* > >> + * Add inode to writeback dirty list with current time. > >> + */ > >> +void inode_writeback_touch(struct inode *inode) > >> +{ > >> + struct backing_dev_info *bdi = inode->i_sb->s_bdi; > >> + spin_lock(>wb.list_lock); > >> + inode->dirtied_when = jiffies; > >> + list_move(>i_wb_list, >wb.b_dirty); > >> + spin_unlock(>wb.list_lock); > >> +} > >> +EXPORT_SYMBOL_GPL(inode_writeback_touch); > > You should be able to use redirty_tail() for this > > Redirty_tail nearly works, but "if (!list_empty(>b_dirty))" is > not correct because the inode needs to end up on the dirty list > whether it was already there or not. redirty_tail() always moves the inode to the end of the dirty list. > This requirement is analogous to __mark_inode_dirty() and must > tolerate similar races. At the microoptimization level, calling > redirty_tail from inode_writeback_touch would be less efficient > and more bulky. Another small issue is, redirty_tail does not > always update the timestamp, which could trigger some bogus > writeback events. redirty_tail does not update the timestamp when it doesn't need to change. If it needs to be changed because the current value would violate the time ordering requirements of the list, it rewrites it. So there is essentially no functional difference between the new function and redirty_tail > > H - this is using the wb dirty lists and locks, but you > > don't pass the wb structure to the writeback callout? That seem > > wrong to me - why would you bother manipulating these lists if > > you aren't using them to track dirty inodes in the first place? > > From Tux3's point of view, the wb lists just allow fs-writeback to > determine when to call ->writeback(). We agree that inode lists > are a suboptimal mechanism, but that is how fs-writeback currently > thinks. It would be better if our inodes never go onto wb lists in > the first place, provided that fs-writeback can still drive > writeback appropriately. It can't, and definitely not with the callout you added. However, we already avoid the VFS writeback lists for certain filesystems for pure metadata. e.g. XFS does not use the VFS dirty inode lists for inode metadata changes. They get tracked internally by the transaction subsystem which does it's own writeback according to the requirements of journal space availability. This is done simply by not calling mark_inode_dirty() on any metadata only change. If we want to do the same for data, then we'd simply not call mark_inode_dirty() in the data IO path. That requires a custom ->set_page_dirty method to be provided by the filesystem that didn't call __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); and instead did it's own thing. So the per-superblock dirty tracking is something we can do right now, and some filesystems do it for metadata. The missing piece for data is writeback infrastructure capable of deferring to superblocks for writeback rather than BDIs > Perhaps fs-writeback should have an option to work without inode > lists at all, and just maintain writeback scheduling statistics in > the superblock or similar. That would be a big change, more on the > experimental side. We would be happy to take it on after merge, > hopefully for the benefit of other filesystems besides Tux3. Well, I don't see it that way. If you have a requirement to be able to track dirty inodes internally, then lets move to that implement that infrastructure rather than hacking around with callouts that only have a limited shelf-life. > What we pass to ->writeback() is just a matter of taste at the > moment because we currently ignore everything except > >nr_pages. Anything sane is fine. Note that bdi_writeback is > already available from bdi->wb, the "default writeback info", > whatever that means. A quick tour of existing usage suggests that > you can reliably obtain the wb that way, but a complete audit > would be needed. > > Most probably, this API will evolve as new users arrive, and also > as our Tux3 usage becomes more sophisticated. For now, Tux3 just > flushes everything without asking questions. Exactly how that > might change in the future is hard to predict. You are in a better > position to know what XFS would require in order to use this > interface. XFS already has everything it needs internally to track dirty inodes. In fact, we used to do data writeback from within XFS and we slowly removed it as the generic writeback code was improved made the XFS code redundant. That said, parallelising writeback so we can support hundreds of GB/s of delayed allocation based writeback is something we eventually need to do, and that pretty much means we need to bring dirty data inode tracking back into XFS. So what we really
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On 06/01/2014 08:15 PM, Dave Chinner wrote: > On Sun, Jun 01, 2014 at 02:41:02PM -0700, I wrote: >> + >> +/* >> + * Add inode to writeback dirty list with current time. >> + */ >> +void inode_writeback_touch(struct inode *inode) >> +{ >> +struct backing_dev_info *bdi = inode->i_sb->s_bdi; >> +spin_lock(>wb.list_lock); >> +inode->dirtied_when = jiffies; >> +list_move(>i_wb_list, >wb.b_dirty); >> +spin_unlock(>wb.list_lock); >> +} >> +EXPORT_SYMBOL_GPL(inode_writeback_touch); > You should be able to use redirty_tail() for this Redirty_tail nearly works, but "if (!list_empty(>b_dirty))" is not correct because the inode needs to end up on the dirty list whether it was already there or not. This requirement is analogous to __mark_inode_dirty() and must tolerate similar races. At the microoptimization level, calling redirty_tail from inode_writeback_touch would be less efficient and more bulky. Another small issue is, redirty_tail does not always update the timestamp, which could trigger some bogus writeback events. > H - this is using the wb dirty lists and locks, but you > don't pass the wb structure to the writeback callout? That seem > wrong to me - why would you bother manipulating these lists if you > aren't using them to track dirty inodes in the first place? >From Tux3's point of view, the wb lists just allow fs-writeback to determine >when to call ->writeback(). We agree that inode lists are a suboptimal mechanism, but that is how fs-writeback currently thinks. It would be better if our inodes never go onto wb lists in the first place, provided that fs-writeback can still drive writeback appropriately. Perhaps fs-writeback should have an option to work without inode lists at all, and just maintain writeback scheduling statistics in the superblock or similar. That would be a big change, more on the experimental side. We would be happy to take it on after merge, hopefully for the benefit of other filesystems besides Tux3. What we pass to ->writeback() is just a matter of taste at the moment because we currently ignore everything except >nr_pages. Anything sane is fine. Note that bdi_writeback is already available from bdi->wb, the "default writeback info", whatever that means. A quick tour of existing usage suggests that you can reliably obtain the wb that way, but a complete audit would be needed. Most probably, this API will evolve as new users arrive, and also as our Tux3 usage becomes more sophisticated. For now, Tux3 just flushes everything without asking questions. Exactly how that might change in the future is hard to predict. You are in a better position to know what XFS would require in order to use this interface. > The first thing that __writeback_sb_inodes() does is create a struct > writeback_control from the wb_writeback_work. That should be done > here and passed to __writeback_sb_inodes(), which should be renamed > "generic_writeback_sb_inodes()". Also, all the fields in the wbc > need to be initialised correctly (i.e including range_start/end). Good point. I will try that generic_writeback_sb_inode concept for the next iteration, which will need a day or two including regression testing. > Further, a writeback implementation will need to use the generic bdi > list and lock structures and so we need to pass the bdi_writeback. > Similarly, if we are going to pass nr_pages, we might as well pass > the entire work structure. > > Finally, I don't like the way the wb->list_lock is treated > differently by this code. I suspect that we need to rationalise the > layering of the wb->list_lock as it is currently not clear what it > protects and what (nested) layers of the writeback code actually > require it. One design goal of this proposed writeback interface is to hide the wb list lock entirely from the filesystem so core writeback can evolve more easily. This is not cast in stone, but it smells like decent factoring. We can save some spinlocks by violating that factoring (as Hirofumi's original hack did) at the cost of holding a potentially busy wb lock longer, which does not seem like a good trade. I agree that writeback list locking is murky, and fs-writeback is murky in general. We would like Tux3 to be part of the solution, not the problem. > What I'd like to see is this work: > > struct super_ops ... = { > > .writeback = generic_writeback_sb_inodes, > > > And that means writeback_sb_inodes() would become: > > static long writeback_sb_inodes(struct super_block *sb, > struct bdi_writeback *wb, > struct wb_writeback_work *work) > { > struct writeback_control wbc = { > .sync_mode = work->sync_mode, > .tagged_writepages = work->tagged_writepages, > .for_kupdate= work->for_kupdate, > .for_background = work->for_background, > .for_sync
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On the 1st of June 2014 23:41, Daniel Phillips wrote: Hi, This is the first of four core changes we would like for Tux3. We start with a hard one and suggest a simple solution. The first patch in this series adds a new super operation to write back multiple inodes in a single call. The second patch applies to our linux-tux3 repository at git.kernel.org to demonstrate how this interface is used, and removes about 450 lines of workaround code. Traditionally, core kernel tells each mounted filesystems which dirty pages of which inodes should be flushed to disk, but unfortunately, is blissfully ignorant of filesystem-specific ordering constraints. This scheme was really invented for Ext2 and has not evolved much recently. Tux3, with its strong ordering and optimized delta commit, cannot tolerate random flushing and therefore takes responsibility for flush ordering itself. On the other hand, Tux3 has no good way to know when is the right time to flush, but core is good at that. This proposed API harmonizes those two competencies so that Tux3 and core each take care of what they are good at, and not more. The API extension consists of a new writeback hook and two helpers to be uses within the hook. The hook sits about halfway down the fs-writeback stack, just after core has determined that some dirty inodes should be flushed to disk and just before it starts thinking about which inodes those should be. At that point, core calls Tux3 instead of continuing on down the usual do_writepages path. Tux3 responds by staging a new delta commit, using the new helpers to tell core which inodes were flushed versus being left dirty in cache. This is pretty much the same behavior as the traditional writeout path, but less convoluted, probably more efficient, and certainly easier to analyze. The new writeback hook looks like: progress = sb->s_op->writeback(sb,_control,_pages); This should be self-explanatory: nr_pages and progress have the semantics of existing usage in fs-writeback; writeback_control is ignored by Tux3, but that is only because Tux3 always flushes everything and does not require hints for now. We can safely assume that or equivalent is wanted here. An obvious wart is the overlap between "progress" and "nr_pages", but fs-writeback thinks that way, so it would not make much sense to improve one without improving the other. Tux3 necessarily keeps its own dirty inode list, which is an area of overlap with fs-writeback. In a perfect world, there would be just one dirty inode list per superblock, on which both fs-writeback and Tux3 would operate. That would be a deeper core change than seems appropriate right now. Potential races are a big issue with this API, which is no surprise. The fs-writeback scheme requires keeping several kinds of object in sync: tux3 dirty inode lists, fs-writeback dirty inode lists and inode dirty state. The new helpers inode_writeback_done(inode) and inode_writeback_touch(inode) take care of that while hiding internal details of the fs-writeback implementation. Tux3 calls inode_writeback_done when it has flushed an inode and marked it clean, or calls inode_writeback_touch if it intends to retain a dirty inode in cache. These have simple implementations. The former just removes a clean inode from any fs-writeback list. The latter updates the inode's dirty timestamp so that fs-writeback does not keep trying flush it. Both these things could be done more efficiently by re-engineering fs-writeback, but we prefer to work with the existing scheme for now. Hirofumi's masterful hack nicely avoided racy removal of inodes from the writeback list by taking an internal fs-writeback lock inside filesystem code. The new helper requires dropping i_lock inside filesystem code and retaking it in the helper, so inode redirty can race with writeback list removal. This does not seem to present a problem because filesystem code is able to enforce strict alternation of cleaning and calling the helper. As an offsetting advantage, writeback lock contention is reduced. Compared to Hirofumi's hack, the cost of this interface is one additional spinlock per inode_writeback_done, which is insignificant compared to the convoluted code path that is avoided. Regards, Daniel When I followed the advice of Dave Chinner: "We're not going to merge that page forking stuff (like you were told at LSF 2013 more than a year ago: http://lwn.net/Articles/548091/) without rigorous design review and a demonstration of the solutions to all the hard corner cases it has" given in his e-mail related with the presentation of the latest version of the Tux3 file system (see [1]) and read the linked article, I found in the second comments: "Parts of this almost sound like it either a.) overlaps with or b.) would benefit greatly from something similar to Featherstitch [[2]]." Could it be that we have with Featherstitch a general solution already that is said to be even "file system agnostic"? Honestly, I
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On the 1st of June 2014 23:41, Daniel Phillips wrote: Hi, This is the first of four core changes we would like for Tux3. We start with a hard one and suggest a simple solution. The first patch in this series adds a new super operation to write back multiple inodes in a single call. The second patch applies to our linux-tux3 repository at git.kernel.org to demonstrate how this interface is used, and removes about 450 lines of workaround code. Traditionally, core kernel tells each mounted filesystems which dirty pages of which inodes should be flushed to disk, but unfortunately, is blissfully ignorant of filesystem-specific ordering constraints. This scheme was really invented for Ext2 and has not evolved much recently. Tux3, with its strong ordering and optimized delta commit, cannot tolerate random flushing and therefore takes responsibility for flush ordering itself. On the other hand, Tux3 has no good way to know when is the right time to flush, but core is good at that. This proposed API harmonizes those two competencies so that Tux3 and core each take care of what they are good at, and not more. The API extension consists of a new writeback hook and two helpers to be uses within the hook. The hook sits about halfway down the fs-writeback stack, just after core has determined that some dirty inodes should be flushed to disk and just before it starts thinking about which inodes those should be. At that point, core calls Tux3 instead of continuing on down the usual do_writepages path. Tux3 responds by staging a new delta commit, using the new helpers to tell core which inodes were flushed versus being left dirty in cache. This is pretty much the same behavior as the traditional writeout path, but less convoluted, probably more efficient, and certainly easier to analyze. The new writeback hook looks like: progress = sb-s_op-writeback(sb,writeback_control,nr_pages); This should be self-explanatory: nr_pages and progress have the semantics of existing usage in fs-writeback; writeback_control is ignored by Tux3, but that is only because Tux3 always flushes everything and does not require hints for now. We can safely assume thatwbc or equivalent is wanted here. An obvious wart is the overlap between progress and nr_pages, but fs-writeback thinks that way, so it would not make much sense to improve one without improving the other. Tux3 necessarily keeps its own dirty inode list, which is an area of overlap with fs-writeback. In a perfect world, there would be just one dirty inode list per superblock, on which both fs-writeback and Tux3 would operate. That would be a deeper core change than seems appropriate right now. Potential races are a big issue with this API, which is no surprise. The fs-writeback scheme requires keeping several kinds of object in sync: tux3 dirty inode lists, fs-writeback dirty inode lists and inode dirty state. The new helpers inode_writeback_done(inode) and inode_writeback_touch(inode) take care of that while hiding internal details of the fs-writeback implementation. Tux3 calls inode_writeback_done when it has flushed an inode and marked it clean, or calls inode_writeback_touch if it intends to retain a dirty inode in cache. These have simple implementations. The former just removes a clean inode from any fs-writeback list. The latter updates the inode's dirty timestamp so that fs-writeback does not keep trying flush it. Both these things could be done more efficiently by re-engineering fs-writeback, but we prefer to work with the existing scheme for now. Hirofumi's masterful hack nicely avoided racy removal of inodes from the writeback list by taking an internal fs-writeback lock inside filesystem code. The new helper requires dropping i_lock inside filesystem code and retaking it in the helper, so inode redirty can race with writeback list removal. This does not seem to present a problem because filesystem code is able to enforce strict alternation of cleaning and calling the helper. As an offsetting advantage, writeback lock contention is reduced. Compared to Hirofumi's hack, the cost of this interface is one additional spinlock per inode_writeback_done, which is insignificant compared to the convoluted code path that is avoided. Regards, Daniel When I followed the advice of Dave Chinner: We're not going to merge that page forking stuff (like you were told at LSF 2013 more than a year ago: http://lwn.net/Articles/548091/) without rigorous design review and a demonstration of the solutions to all the hard corner cases it has given in his e-mail related with the presentation of the latest version of the Tux3 file system (see [1]) and read the linked article, I found in the second comments: Parts of this almost sound like it either a.) overlaps with or b.) would benefit greatly from something similar to Featherstitch [[2]]. Could it be that we have with Featherstitch a general solution already that is said to be even file system agnostic? Honestly, I
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On 06/01/2014 08:15 PM, Dave Chinner wrote: On Sun, Jun 01, 2014 at 02:41:02PM -0700, I wrote: + +/* + * Add inode to writeback dirty list with current time. + */ +void inode_writeback_touch(struct inode *inode) +{ +struct backing_dev_info *bdi = inode-i_sb-s_bdi; +spin_lock(bdi-wb.list_lock); +inode-dirtied_when = jiffies; +list_move(inode-i_wb_list, bdi-wb.b_dirty); +spin_unlock(bdi-wb.list_lock); +} +EXPORT_SYMBOL_GPL(inode_writeback_touch); You should be able to use redirty_tail() for this Redirty_tail nearly works, but if (!list_empty(wb-b_dirty)) is not correct because the inode needs to end up on the dirty list whether it was already there or not. This requirement is analogous to __mark_inode_dirty() and must tolerate similar races. At the microoptimization level, calling redirty_tail from inode_writeback_touch would be less efficient and more bulky. Another small issue is, redirty_tail does not always update the timestamp, which could trigger some bogus writeback events. H - this is using the wb dirty lists and locks, but you don't pass the wb structure to the writeback callout? That seem wrong to me - why would you bother manipulating these lists if you aren't using them to track dirty inodes in the first place? From Tux3's point of view, the wb lists just allow fs-writeback to determine when to call -writeback(). We agree that inode lists are a suboptimal mechanism, but that is how fs-writeback currently thinks. It would be better if our inodes never go onto wb lists in the first place, provided that fs-writeback can still drive writeback appropriately. Perhaps fs-writeback should have an option to work without inode lists at all, and just maintain writeback scheduling statistics in the superblock or similar. That would be a big change, more on the experimental side. We would be happy to take it on after merge, hopefully for the benefit of other filesystems besides Tux3. What we pass to -writeback() is just a matter of taste at the moment because we currently ignore everything except work-nr_pages. Anything sane is fine. Note that bdi_writeback is already available from bdi-wb, the default writeback info, whatever that means. A quick tour of existing usage suggests that you can reliably obtain the wb that way, but a complete audit would be needed. Most probably, this API will evolve as new users arrive, and also as our Tux3 usage becomes more sophisticated. For now, Tux3 just flushes everything without asking questions. Exactly how that might change in the future is hard to predict. You are in a better position to know what XFS would require in order to use this interface. The first thing that __writeback_sb_inodes() does is create a struct writeback_control from the wb_writeback_work. That should be done here and passed to __writeback_sb_inodes(), which should be renamed generic_writeback_sb_inodes(). Also, all the fields in the wbc need to be initialised correctly (i.e including range_start/end). Good point. I will try that generic_writeback_sb_inode concept for the next iteration, which will need a day or two including regression testing. Further, a writeback implementation will need to use the generic bdi list and lock structures and so we need to pass the bdi_writeback. Similarly, if we are going to pass nr_pages, we might as well pass the entire work structure. Finally, I don't like the way the wb-list_lock is treated differently by this code. I suspect that we need to rationalise the layering of the wb-list_lock as it is currently not clear what it protects and what (nested) layers of the writeback code actually require it. One design goal of this proposed writeback interface is to hide the wb list lock entirely from the filesystem so core writeback can evolve more easily. This is not cast in stone, but it smells like decent factoring. We can save some spinlocks by violating that factoring (as Hirofumi's original hack did) at the cost of holding a potentially busy wb lock longer, which does not seem like a good trade. I agree that writeback list locking is murky, and fs-writeback is murky in general. We would like Tux3 to be part of the solution, not the problem. What I'd like to see is this work: struct super_ops ... = { .writeback = generic_writeback_sb_inodes, And that means writeback_sb_inodes() would become: static long writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb, struct wb_writeback_work *work) { struct writeback_control wbc = { .sync_mode = work-sync_mode, .tagged_writepages = work-tagged_writepages, .for_kupdate= work-for_kupdate, .for_background = work-for_background, .for_sync = work-for_sync, .range_cyclic =
Re: [RFC][PATCH 1/2] Add a super operation for writeback
[ please line wrap at something sane like 68 columns ] On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote: On 06/01/2014 08:15 PM, Dave Chinner wrote: On Sun, Jun 01, 2014 at 02:41:02PM -0700, I wrote: + +/* + * Add inode to writeback dirty list with current time. + */ +void inode_writeback_touch(struct inode *inode) +{ + struct backing_dev_info *bdi = inode-i_sb-s_bdi; + spin_lock(bdi-wb.list_lock); + inode-dirtied_when = jiffies; + list_move(inode-i_wb_list, bdi-wb.b_dirty); + spin_unlock(bdi-wb.list_lock); +} +EXPORT_SYMBOL_GPL(inode_writeback_touch); You should be able to use redirty_tail() for this Redirty_tail nearly works, but if (!list_empty(wb-b_dirty)) is not correct because the inode needs to end up on the dirty list whether it was already there or not. redirty_tail() always moves the inode to the end of the dirty list. This requirement is analogous to __mark_inode_dirty() and must tolerate similar races. At the microoptimization level, calling redirty_tail from inode_writeback_touch would be less efficient and more bulky. Another small issue is, redirty_tail does not always update the timestamp, which could trigger some bogus writeback events. redirty_tail does not update the timestamp when it doesn't need to change. If it needs to be changed because the current value would violate the time ordering requirements of the list, it rewrites it. So there is essentially no functional difference between the new function and redirty_tail H - this is using the wb dirty lists and locks, but you don't pass the wb structure to the writeback callout? That seem wrong to me - why would you bother manipulating these lists if you aren't using them to track dirty inodes in the first place? From Tux3's point of view, the wb lists just allow fs-writeback to determine when to call -writeback(). We agree that inode lists are a suboptimal mechanism, but that is how fs-writeback currently thinks. It would be better if our inodes never go onto wb lists in the first place, provided that fs-writeback can still drive writeback appropriately. It can't, and definitely not with the callout you added. However, we already avoid the VFS writeback lists for certain filesystems for pure metadata. e.g. XFS does not use the VFS dirty inode lists for inode metadata changes. They get tracked internally by the transaction subsystem which does it's own writeback according to the requirements of journal space availability. This is done simply by not calling mark_inode_dirty() on any metadata only change. If we want to do the same for data, then we'd simply not call mark_inode_dirty() in the data IO path. That requires a custom -set_page_dirty method to be provided by the filesystem that didn't call __mark_inode_dirty(mapping-host, I_DIRTY_PAGES); and instead did it's own thing. So the per-superblock dirty tracking is something we can do right now, and some filesystems do it for metadata. The missing piece for data is writeback infrastructure capable of deferring to superblocks for writeback rather than BDIs Perhaps fs-writeback should have an option to work without inode lists at all, and just maintain writeback scheduling statistics in the superblock or similar. That would be a big change, more on the experimental side. We would be happy to take it on after merge, hopefully for the benefit of other filesystems besides Tux3. Well, I don't see it that way. If you have a requirement to be able to track dirty inodes internally, then lets move to that implement that infrastructure rather than hacking around with callouts that only have a limited shelf-life. What we pass to -writeback() is just a matter of taste at the moment because we currently ignore everything except work-nr_pages. Anything sane is fine. Note that bdi_writeback is already available from bdi-wb, the default writeback info, whatever that means. A quick tour of existing usage suggests that you can reliably obtain the wb that way, but a complete audit would be needed. Most probably, this API will evolve as new users arrive, and also as our Tux3 usage becomes more sophisticated. For now, Tux3 just flushes everything without asking questions. Exactly how that might change in the future is hard to predict. You are in a better position to know what XFS would require in order to use this interface. XFS already has everything it needs internally to track dirty inodes. In fact, we used to do data writeback from within XFS and we slowly removed it as the generic writeback code was improved made the XFS code redundant. That said, parallelising writeback so we can support hundreds of GB/s of delayed allocation based writeback is something we eventually need to do, and that pretty much means we need to bring dirty data inode tracking back into XFS. So what we really need is writeback infrastructure that is: a) independent of the dirty
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Mon, Jun 02, 2014 at 10:30:07AM +0200, Christian Stroetmann wrote: When I followed the advice of Dave Chinner: We're not going to merge that page forking stuff (like you were told at LSF 2013 more than a year ago: http://lwn.net/Articles/548091/) without rigorous design review and a demonstration of the solutions to all the hard corner cases it has given in his e-mail related with the presentation of the latest version of the Tux3 file system (see [1]) and read the linked article, I found in the second comments: Parts of this almost sound like it either a.) overlaps with or b.) would benefit greatly from something similar to Featherstitch [[2]]. Could it be that we have with Featherstitch a general solution already that is said to be even file system agnostic? Honestly, I thought that something like this would make its way into the Linux code base. Here's what I said about the last proposal (a few months ago) for integrating featherstitch into the kernel: http://www.spinics.net/lists/linux-fsdevel/msg72799.html It's not a viable solution. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On the 3rd of June 2014 05:39, Dave Chinner wrote: On Mon, Jun 02, 2014 at 10:30:07AM +0200, Christian Stroetmann wrote: When I followed the advice of Dave Chinner: We're not going to merge that page forking stuff (like you were told at LSF 2013 more than a year ago: http://lwn.net/Articles/548091/) without rigorous design review and a demonstration of the solutions to all the hard corner cases it has given in his e-mail related with the presentation of the latest version of the Tux3 file system (see [1]) and read the linked article, I found in the second comments: Parts of this almost sound like it either a.) overlaps with or b.) would benefit greatly from something similar to Featherstitch [[2]]. Could it be that we have with Featherstitch a general solution already that is said to be even file system agnostic? Honestly, I thought that something like this would make its way into the Linux code base. Here's what I said about the last proposal (a few months ago) for integrating featherstitch into the kernel: http://www.spinics.net/lists/linux-fsdevel/msg72799.html It's not a viable solution. Cheers, Dave. How annoying, I did not remember your e-mail of the referenced thread [Lsf-pc] [LSF/MM TOPIC] atomic block device despite I saved it on local disk. Thanks a lot for the reminder. I also directly saw the problem with the research prototype Featherstitch, specifically the point All the filesystem modules it has are built into the featherstitch kernel module, and called through a VFS shim layer. But it is just a prototype and its concept of abstraction has not to be copied 1:1 into the Linux code base. In general, I do not believe that the complexity problems of soft updates, atomic writes, and related techniques can be solved by hand/manually. So my suggestion is to automatically handle the complexity problem of e.g. dependancies in a way that is comparable to a(n on-the-fly) file-system compiler so to say that works on a very large dependancy graph (having several billions of graph vertices actually). And at this point an abstraction like it is given with Featherstitch helps to feed and control this special FS compiler. Actually, I have to follow the discussion further on the one hand and go deeper into the highly complex problem space on the other hand. With all the best Christian -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Sun, Jun 01, 2014 at 02:41:02PM -0700, Daniel Phillips wrote: > --- > From: Daniel Phillips > Subject: [PATCH] Add a super operation for writeback > > Add a "writeback" super operation to be called in the > form: > > progress = sb->s_op->writeback(sb, , ); > > The filesystem is expected to flush some inodes to disk > and return progress of at least 1, or if no inodes are > flushed, return progress of zero. The filesystem should > try to flush at least the number of pages specified in > *pages, or if that is not possible, return approximately > the number of pages not flushed into *pages. > > Within the ->writeback callback, the filesystem should > call inode_writeback_done(inode) for each inode flushed > (and therefore set clean) or inode_writeback_touch(inode) > for any inode that will be retained dirty in cache. > > Signed-off-by: Daniel Phillips > Signed-off-by: OGAWA Hirofumi > --- > > fs/fs-writeback.c | 59 > +--- > include/linux/fs.h |4 +++ > 2 files changed, 60 insertions(+), 3 deletions(-) > > diff -puN fs/fs-writeback.c~core-writeback fs/fs-writeback.c > --- linux-tux3/fs/fs-writeback.c~core-writeback 2014-05-31 > 06:43:19.416031712 +0900 > +++ linux-tux3-hirofumi/fs/fs-writeback.c 2014-05-31 06:44:46.087904373 > +0900 > @@ -192,6 +192,35 @@ void inode_wb_list_del(struct inode *ino > } > > /* > + * Remove inode from writeback list if clean. > + */ > +void inode_writeback_done(struct inode *inode) > +{ > + struct backing_dev_info *bdi = inode_to_bdi(inode); > + > + spin_lock(>wb.list_lock); > + spin_lock(>i_lock); > + if (!(inode->i_state & I_DIRTY)) > + list_del_init(>i_wb_list); > + spin_unlock(>i_lock); > + spin_unlock(>wb.list_lock); > +} > +EXPORT_SYMBOL_GPL(inode_writeback_done); > + > +/* > + * Add inode to writeback dirty list with current time. > + */ > +void inode_writeback_touch(struct inode *inode) > +{ > + struct backing_dev_info *bdi = inode->i_sb->s_bdi; > + spin_lock(>wb.list_lock); > + inode->dirtied_when = jiffies; > + list_move(>i_wb_list, >wb.b_dirty); > + spin_unlock(>wb.list_lock); > +} > +EXPORT_SYMBOL_GPL(inode_writeback_touch); You should be able to use redirty_tail() for this H - this is using the wb dirty lists and locks, but you don't pass the wb structure to the writeback callout? That seem wrong to me - why would you bother manipulating these lists if you aren't using them to track dirty inodes in the first place? > + > +/* > * Redirty an inode: set its when-it-was dirtied timestamp and move it to the > * furthest end of its superblock's dirty-inode list. > * > @@ -593,9 +622,9 @@ static long writeback_chunk_size(struct > * > * Return the number of pages and/or inodes written. > */ > -static long writeback_sb_inodes(struct super_block *sb, > - struct bdi_writeback *wb, > - struct wb_writeback_work *work) > +static long __writeback_sb_inodes(struct super_block *sb, > + struct bdi_writeback *wb, > + struct wb_writeback_work *work) > { > struct writeback_control wbc = { > .sync_mode = work->sync_mode, > @@ -710,6 +739,30 @@ static long writeback_sb_inodes(struct s > return wrote; > } > > +static long writeback_sb_inodes(struct super_block *sb, > + struct bdi_writeback *wb, > + struct wb_writeback_work *work) > +{ > + if (sb->s_op->writeback) { > + struct writeback_control wbc = { > + .sync_mode = work->sync_mode, > + .tagged_writepages = work->tagged_writepages, > + .for_kupdate= work->for_kupdate, > + .for_background = work->for_background, > + .for_sync = work->for_sync, > + .range_cyclic = work->range_cyclic, > + }; > + long ret; > + > + spin_unlock(>list_lock); > + ret = sb->s_op->writeback(sb, , >nr_pages); > + spin_lock(>list_lock); > + return ret; > + } > + > + return __writeback_sb_inodes(sb, wb, work); > +} The first thing that __writeback_sb_inodes() does is create a struct writeback_control from the wb_writeback_work. That should be done here and passed to __writeback_sb_inodes(), which should be renamed "generic_writeback_sb_inodes()". Also, all the fields in the wbc need to be initialised correctly (i.e including range_start/end). Further, a writeback implementation will need to use the generic bdi list and lock structures and so we need to pass the bdi_writeback. Similarly, if we are going to pass nr_pages, we might as well pass the entire work structure. Finally, I don't like the way the
[RFC][PATCH 1/2] Add a super operation for writeback
Hi, This is the first of four core changes we would like for Tux3. We start with a hard one and suggest a simple solution. The first patch in this series adds a new super operation to write back multiple inodes in a single call. The second patch applies to our linux-tux3 repository at git.kernel.org to demonstrate how this interface is used, and removes about 450 lines of workaround code. Traditionally, core kernel tells each mounted filesystems which dirty pages of which inodes should be flushed to disk, but unfortunately, is blissfully ignorant of filesystem-specific ordering constraints. This scheme was really invented for Ext2 and has not evolved much recently. Tux3, with its strong ordering and optimized delta commit, cannot tolerate random flushing and therefore takes responsibility for flush ordering itself. On the other hand, Tux3 has no good way to know when is the right time to flush, but core is good at that. This proposed API harmonizes those two competencies so that Tux3 and core each take care of what they are good at, and not more. The API extension consists of a new writeback hook and two helpers to be uses within the hook. The hook sits about halfway down the fs-writeback stack, just after core has determined that some dirty inodes should be flushed to disk and just before it starts thinking about which inodes those should be. At that point, core calls Tux3 instead of continuing on down the usual do_writepages path. Tux3 responds by staging a new delta commit, using the new helpers to tell core which inodes were flushed versus being left dirty in cache. This is pretty much the same behavior as the traditional writeout path, but less convoluted, probably more efficient, and certainly easier to analyze. The new writeback hook looks like: progress = sb->s_op->writeback(sb, _control, _pages); This should be self-explanatory: nr_pages and progress have the semantics of existing usage in fs-writeback; writeback_control is ignored by Tux3, but that is only because Tux3 always flushes everything and does not require hints for now. We can safely assume that or equivalent is wanted here. An obvious wart is the overlap between "progress" and "nr_pages", but fs-writeback thinks that way, so it would not make much sense to improve one without improving the other. Tux3 necessarily keeps its own dirty inode list, which is an area of overlap with fs-writeback. In a perfect world, there would be just one dirty inode list per superblock, on which both fs-writeback and Tux3 would operate. That would be a deeper core change than seems appropriate right now. Potential races are a big issue with this API, which is no surprise. The fs-writeback scheme requires keeping several kinds of object in sync: tux3 dirty inode lists, fs-writeback dirty inode lists and inode dirty state. The new helpers inode_writeback_done(inode) and inode_writeback_touch(inode) take care of that while hiding internal details of the fs-writeback implementation. Tux3 calls inode_writeback_done when it has flushed an inode and marked it clean, or calls inode_writeback_touch if it intends to retain a dirty inode in cache. These have simple implementations. The former just removes a clean inode from any fs-writeback list. The latter updates the inode's dirty timestamp so that fs-writeback does not keep trying flush it. Both these things could be done more efficiently by re-engineering fs-writeback, but we prefer to work with the existing scheme for now. Hirofumi's masterful hack nicely avoided racy removal of inodes from the writeback list by taking an internal fs-writeback lock inside filesystem code. The new helper requires dropping i_lock inside filesystem code and retaking it in the helper, so inode redirty can race with writeback list removal. This does not seem to present a problem because filesystem code is able to enforce strict alternation of cleaning and calling the helper. As an offsetting advantage, writeback lock contention is reduced. Compared to Hirofumi's hack, the cost of this interface is one additional spinlock per inode_writeback_done, which is insignificant compared to the convoluted code path that is avoided. Regards, Daniel --- From: Daniel Phillips Subject: [PATCH] Add a super operation for writeback Add a "writeback" super operation to be called in the form: progress = sb->s_op->writeback(sb, , ); The filesystem is expected to flush some inodes to disk and return progress of at least 1, or if no inodes are flushed, return progress of zero. The filesystem should try to flush at least the number of pages specified in *pages, or if that is not possible, return approximately the number of pages not flushed into *pages. Within the ->writeback callback, the filesystem should call inode_writeback_done(inode) for each inode flushed (and therefore set clean) or inode_writeback_touch(inode) for any inode that will be retained dirty in cache. Signed-off-by: Daniel Phillips Signed-off-by:
[RFC][PATCH 1/2] Add a super operation for writeback
Hi, This is the first of four core changes we would like for Tux3. We start with a hard one and suggest a simple solution. The first patch in this series adds a new super operation to write back multiple inodes in a single call. The second patch applies to our linux-tux3 repository at git.kernel.org to demonstrate how this interface is used, and removes about 450 lines of workaround code. Traditionally, core kernel tells each mounted filesystems which dirty pages of which inodes should be flushed to disk, but unfortunately, is blissfully ignorant of filesystem-specific ordering constraints. This scheme was really invented for Ext2 and has not evolved much recently. Tux3, with its strong ordering and optimized delta commit, cannot tolerate random flushing and therefore takes responsibility for flush ordering itself. On the other hand, Tux3 has no good way to know when is the right time to flush, but core is good at that. This proposed API harmonizes those two competencies so that Tux3 and core each take care of what they are good at, and not more. The API extension consists of a new writeback hook and two helpers to be uses within the hook. The hook sits about halfway down the fs-writeback stack, just after core has determined that some dirty inodes should be flushed to disk and just before it starts thinking about which inodes those should be. At that point, core calls Tux3 instead of continuing on down the usual do_writepages path. Tux3 responds by staging a new delta commit, using the new helpers to tell core which inodes were flushed versus being left dirty in cache. This is pretty much the same behavior as the traditional writeout path, but less convoluted, probably more efficient, and certainly easier to analyze. The new writeback hook looks like: progress = sb-s_op-writeback(sb, writeback_control, nr_pages); This should be self-explanatory: nr_pages and progress have the semantics of existing usage in fs-writeback; writeback_control is ignored by Tux3, but that is only because Tux3 always flushes everything and does not require hints for now. We can safely assume that wbc or equivalent is wanted here. An obvious wart is the overlap between progress and nr_pages, but fs-writeback thinks that way, so it would not make much sense to improve one without improving the other. Tux3 necessarily keeps its own dirty inode list, which is an area of overlap with fs-writeback. In a perfect world, there would be just one dirty inode list per superblock, on which both fs-writeback and Tux3 would operate. That would be a deeper core change than seems appropriate right now. Potential races are a big issue with this API, which is no surprise. The fs-writeback scheme requires keeping several kinds of object in sync: tux3 dirty inode lists, fs-writeback dirty inode lists and inode dirty state. The new helpers inode_writeback_done(inode) and inode_writeback_touch(inode) take care of that while hiding internal details of the fs-writeback implementation. Tux3 calls inode_writeback_done when it has flushed an inode and marked it clean, or calls inode_writeback_touch if it intends to retain a dirty inode in cache. These have simple implementations. The former just removes a clean inode from any fs-writeback list. The latter updates the inode's dirty timestamp so that fs-writeback does not keep trying flush it. Both these things could be done more efficiently by re-engineering fs-writeback, but we prefer to work with the existing scheme for now. Hirofumi's masterful hack nicely avoided racy removal of inodes from the writeback list by taking an internal fs-writeback lock inside filesystem code. The new helper requires dropping i_lock inside filesystem code and retaking it in the helper, so inode redirty can race with writeback list removal. This does not seem to present a problem because filesystem code is able to enforce strict alternation of cleaning and calling the helper. As an offsetting advantage, writeback lock contention is reduced. Compared to Hirofumi's hack, the cost of this interface is one additional spinlock per inode_writeback_done, which is insignificant compared to the convoluted code path that is avoided. Regards, Daniel --- From: Daniel Phillips dan...@tux3.org Subject: [PATCH] Add a super operation for writeback Add a writeback super operation to be called in the form: progress = sb-s_op-writeback(sb, wbc, pages); The filesystem is expected to flush some inodes to disk and return progress of at least 1, or if no inodes are flushed, return progress of zero. The filesystem should try to flush at least the number of pages specified in *pages, or if that is not possible, return approximately the number of pages not flushed into *pages. Within the -writeback callback, the filesystem should call inode_writeback_done(inode) for each inode flushed (and therefore set clean) or inode_writeback_touch(inode) for any inode that will be retained dirty in cache. Signed-off-by: Daniel
Re: [RFC][PATCH 1/2] Add a super operation for writeback
On Sun, Jun 01, 2014 at 02:41:02PM -0700, Daniel Phillips wrote: --- From: Daniel Phillips dan...@tux3.org Subject: [PATCH] Add a super operation for writeback Add a writeback super operation to be called in the form: progress = sb-s_op-writeback(sb, wbc, pages); The filesystem is expected to flush some inodes to disk and return progress of at least 1, or if no inodes are flushed, return progress of zero. The filesystem should try to flush at least the number of pages specified in *pages, or if that is not possible, return approximately the number of pages not flushed into *pages. Within the -writeback callback, the filesystem should call inode_writeback_done(inode) for each inode flushed (and therefore set clean) or inode_writeback_touch(inode) for any inode that will be retained dirty in cache. Signed-off-by: Daniel Phillips dan...@tux3.org Signed-off-by: OGAWA Hirofumi hirof...@mail.parknet.co.jp --- fs/fs-writeback.c | 59 +--- include/linux/fs.h |4 +++ 2 files changed, 60 insertions(+), 3 deletions(-) diff -puN fs/fs-writeback.c~core-writeback fs/fs-writeback.c --- linux-tux3/fs/fs-writeback.c~core-writeback 2014-05-31 06:43:19.416031712 +0900 +++ linux-tux3-hirofumi/fs/fs-writeback.c 2014-05-31 06:44:46.087904373 +0900 @@ -192,6 +192,35 @@ void inode_wb_list_del(struct inode *ino } /* + * Remove inode from writeback list if clean. + */ +void inode_writeback_done(struct inode *inode) +{ + struct backing_dev_info *bdi = inode_to_bdi(inode); + + spin_lock(bdi-wb.list_lock); + spin_lock(inode-i_lock); + if (!(inode-i_state I_DIRTY)) + list_del_init(inode-i_wb_list); + spin_unlock(inode-i_lock); + spin_unlock(bdi-wb.list_lock); +} +EXPORT_SYMBOL_GPL(inode_writeback_done); + +/* + * Add inode to writeback dirty list with current time. + */ +void inode_writeback_touch(struct inode *inode) +{ + struct backing_dev_info *bdi = inode-i_sb-s_bdi; + spin_lock(bdi-wb.list_lock); + inode-dirtied_when = jiffies; + list_move(inode-i_wb_list, bdi-wb.b_dirty); + spin_unlock(bdi-wb.list_lock); +} +EXPORT_SYMBOL_GPL(inode_writeback_touch); You should be able to use redirty_tail() for this H - this is using the wb dirty lists and locks, but you don't pass the wb structure to the writeback callout? That seem wrong to me - why would you bother manipulating these lists if you aren't using them to track dirty inodes in the first place? + +/* * Redirty an inode: set its when-it-was dirtied timestamp and move it to the * furthest end of its superblock's dirty-inode list. * @@ -593,9 +622,9 @@ static long writeback_chunk_size(struct * * Return the number of pages and/or inodes written. */ -static long writeback_sb_inodes(struct super_block *sb, - struct bdi_writeback *wb, - struct wb_writeback_work *work) +static long __writeback_sb_inodes(struct super_block *sb, + struct bdi_writeback *wb, + struct wb_writeback_work *work) { struct writeback_control wbc = { .sync_mode = work-sync_mode, @@ -710,6 +739,30 @@ static long writeback_sb_inodes(struct s return wrote; } +static long writeback_sb_inodes(struct super_block *sb, + struct bdi_writeback *wb, + struct wb_writeback_work *work) +{ + if (sb-s_op-writeback) { + struct writeback_control wbc = { + .sync_mode = work-sync_mode, + .tagged_writepages = work-tagged_writepages, + .for_kupdate= work-for_kupdate, + .for_background = work-for_background, + .for_sync = work-for_sync, + .range_cyclic = work-range_cyclic, + }; + long ret; + + spin_unlock(wb-list_lock); + ret = sb-s_op-writeback(sb, wbc, work-nr_pages); + spin_lock(wb-list_lock); + return ret; + } + + return __writeback_sb_inodes(sb, wb, work); +} The first thing that __writeback_sb_inodes() does is create a struct writeback_control from the wb_writeback_work. That should be done here and passed to __writeback_sb_inodes(), which should be renamed generic_writeback_sb_inodes(). Also, all the fields in the wbc need to be initialised correctly (i.e including range_start/end). Further, a writeback implementation will need to use the generic bdi list and lock structures and so we need to pass the bdi_writeback. Similarly, if we are going to pass nr_pages, we might as well pass the entire work structure. Finally, I don't like the way the wb-list_lock is treated