Re: [jffs2] [rfc] fix write deadlock regression
On Sun, 2007-09-02 at 16:17 +0200, Nick Piggin wrote: > On Sun, Sep 02, 2007 at 02:48:04PM +0100, David Woodhouse wrote: > > On Sun, 2007-09-02 at 15:20 +0200, Nick Piggin wrote: > > > OK, but then hasn't the patch just made the deadlock harder to hit, > > > or is there some invariant that says that readpage() will never be > > > invoked if gc was invoked on the same page as we're commit_write()ing? > > > > > The Q/A comments aren't very sure about this. I guess from the look > > > of it, prepare_write/commit_write make sure the page will be uptodate > > > by the start of commit_write, > > > > That's the intention, yes. > > > > > and you avoid GCing the page in > > > prepare_write because your new page won't have any nodes allocated > > > yet that can possibly be GCed? > > > > We _might_ GC the page -- it might not be a new page; we might be > > overwriting it. But it's fine if we do. Actually it's slightly > > suboptimal because we'll write out the same data twice -- once in GC and > > then immediately afterward in the write which we were making space for. > > But doesn't GC only happen in prepare_write in the case that the > i_size is being extended into a new page? Ah, yes. I was thinking of commit_write, and it had temporarily escaped my notice that we also write in prepare_write, to extend the file. So yes, you're probably right that it doesn't matter; in any GC triggered from _prepare_write_ we avoid GCing the page in question because it by definition doesn't exist. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [jffs2] [rfc] fix write deadlock regression
On Sun, Sep 02, 2007 at 02:48:04PM +0100, David Woodhouse wrote: > On Sun, 2007-09-02 at 15:20 +0200, Nick Piggin wrote: > > OK, but then hasn't the patch just made the deadlock harder to hit, > > or is there some invariant that says that readpage() will never be > > invoked if gc was invoked on the same page as we're commit_write()ing? > > > The Q/A comments aren't very sure about this. I guess from the look > > of it, prepare_write/commit_write make sure the page will be uptodate > > by the start of commit_write, > > That's the intention, yes. > > > and you avoid GCing the page in > > prepare_write because your new page won't have any nodes allocated > > yet that can possibly be GCed? > > We _might_ GC the page -- it might not be a new page; we might be > overwriting it. But it's fine if we do. Actually it's slightly > suboptimal because we'll write out the same data twice -- once in GC and > then immediately afterward in the write which we were making space for. But doesn't GC only happen in prepare_write in the case that the i_size is being extended into a new page? If you GC the page in prepare_write (when it may be potentially !uptodate), then I'm sure you would get a deadlock when read_cache_page finds it non-uptodate and locks it for readpage(). > But that's not the end of the world, and it's not very common. > > > BTW. with write_begin/write_end, you get to control the page lock, > > so for example if the readpage in prepare_write for partial writes > > is *only* for the purpose of avoiding this deadlock later, you > > could possibly avoid the RMW with the new aops. Maybe it would > > help you with data nodes crossing page boundaries too... > > I'll look at that; thanks. OK. The patches are in -mm now, but could get in as early as 2.6.24. If you have any suggestions about the form of the APIs, it would be good to hear them. > > OK, thanks for looking at it. If you'd care to pass it on to Linus > > before he releases 2.6.23 in random() % X days time... ;) > > Not before the Kernel Summit now, I suspect. But yes, I'll do that later > today or in the morning (the linuxconf.eu conference has already > started). Thanks, - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [jffs2] [rfc] fix write deadlock regression
On Sun, 2007-09-02 at 15:20 +0200, Nick Piggin wrote: > OK, but then hasn't the patch just made the deadlock harder to hit, > or is there some invariant that says that readpage() will never be > invoked if gc was invoked on the same page as we're commit_write()ing? > The Q/A comments aren't very sure about this. I guess from the look > of it, prepare_write/commit_write make sure the page will be uptodate > by the start of commit_write, That's the intention, yes. > and you avoid GCing the page in > prepare_write because your new page won't have any nodes allocated > yet that can possibly be GCed? We _might_ GC the page -- it might not be a new page; we might be overwriting it. But it's fine if we do. Actually it's slightly suboptimal because we'll write out the same data twice -- once in GC and then immediately afterward in the write which we were making space for. But that's not the end of the world, and it's not very common. > BTW. with write_begin/write_end, you get to control the page lock, > so for example if the readpage in prepare_write for partial writes > is *only* for the purpose of avoiding this deadlock later, you > could possibly avoid the RMW with the new aops. Maybe it would > help you with data nodes crossing page boundaries too... I'll look at that; thanks. > OK, thanks for looking at it. If you'd care to pass it on to Linus > before he releases 2.6.23 in random() % X days time... ;) Not before the Kernel Summit now, I suspect. But yes, I'll do that later today or in the morning (the linuxconf.eu conference has already started). -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [jffs2] [rfc] fix write deadlock regression
On Sun, Sep 02, 2007 at 01:13:23PM +0100, David Woodhouse wrote: > Jason, thank you _so_ much for finding the underlying cause of this. > > On Sun, 2007-09-02 at 06:20 +0200, Nick Piggin wrote: > > Hmm, thanks for that. It does sound like it is deadlocking via > > commit_write(). OTOH, it seems like it could be using the page > > before it is uptodate -- it _may_ only be dealing with uptodate > > data at that point... but if so, why even read_cache_page at > > all? > > jffs2_readpage() is synchronous -- there's no chance that the page won't > be up to date. We're doing this for garbage collection -- if there are > many log entries covering a single page of data, we want to write out a > single replacement which covers the whole page, obsoleting the previous > suboptimal representation of the same data. OK, but then hasn't the patch just made the deadlock harder to hit, or is there some invariant that says that readpage() will never be invoked if gc was invoked on the same page as we're commit_write()ing? The Q/A comments aren't very sure about this. I guess from the look of it, prepare_write/commit_write make sure the page will be uptodate by the start of commit_write, and you avoid GCing the page in prepare_write because your new page won't have any nodes allocated yet that can possibly be GCed? BTW. with write_begin/write_end, you get to control the page lock, so for example if the readpage in prepare_write for partial writes is *only* for the purpose of avoiding this deadlock later, you could possibly avoid the RMW with the new aops. Maybe it would help you with data nodes crossing page boundaries too... > > However, it is a regression. So unless David can come up with a > > more satisfactory approach, I guess we'd have to go with your > > patch. > > I think Jason's patch is the best answer for the moment. At some point > in the very near future I want to improve the RAM usage and compression > ratio by dropping the rule that data nodes may not cross page boundaries > -- in which case garbage collection will need to do something other than > reading the page using read_cache_page() and then writing it out again; > it'll probably need to end up using its own internal buffer. But for > now, Jason's patch looks good. OK, thanks for looking at it. If you'd care to pass it on to Linus before he releases 2.6.23 in random() % X days time... ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [jffs2] [rfc] fix write deadlock regression
Jason, thank you _so_ much for finding the underlying cause of this. On Sun, 2007-09-02 at 06:20 +0200, Nick Piggin wrote: > Hmm, thanks for that. It does sound like it is deadlocking via > commit_write(). OTOH, it seems like it could be using the page > before it is uptodate -- it _may_ only be dealing with uptodate > data at that point... but if so, why even read_cache_page at > all? jffs2_readpage() is synchronous -- there's no chance that the page won't be up to date. We're doing this for garbage collection -- if there are many log entries covering a single page of data, we want to write out a single replacement which covers the whole page, obsoleting the previous suboptimal representation of the same data. > However, it is a regression. So unless David can come up with a > more satisfactory approach, I guess we'd have to go with your > patch. I think Jason's patch is the best answer for the moment. At some point in the very near future I want to improve the RAM usage and compression ratio by dropping the rule that data nodes may not cross page boundaries -- in which case garbage collection will need to do something other than reading the page using read_cache_page() and then writing it out again; it'll probably need to end up using its own internal buffer. But for now, Jason's patch looks good. Thanks. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [jffs2] [rfc] fix write deadlock regression
Jason, thank you _so_ much for finding the underlying cause of this. On Sun, 2007-09-02 at 06:20 +0200, Nick Piggin wrote: Hmm, thanks for that. It does sound like it is deadlocking via commit_write(). OTOH, it seems like it could be using the page before it is uptodate -- it _may_ only be dealing with uptodate data at that point... but if so, why even read_cache_page at all? jffs2_readpage() is synchronous -- there's no chance that the page won't be up to date. We're doing this for garbage collection -- if there are many log entries covering a single page of data, we want to write out a single replacement which covers the whole page, obsoleting the previous suboptimal representation of the same data. However, it is a regression. So unless David can come up with a more satisfactory approach, I guess we'd have to go with your patch. I think Jason's patch is the best answer for the moment. At some point in the very near future I want to improve the RAM usage and compression ratio by dropping the rule that data nodes may not cross page boundaries -- in which case garbage collection will need to do something other than reading the page using read_cache_page() and then writing it out again; it'll probably need to end up using its own internal buffer. But for now, Jason's patch looks good. Thanks. -- dwmw2 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [jffs2] [rfc] fix write deadlock regression
On Sun, Sep 02, 2007 at 01:13:23PM +0100, David Woodhouse wrote: Jason, thank you _so_ much for finding the underlying cause of this. On Sun, 2007-09-02 at 06:20 +0200, Nick Piggin wrote: Hmm, thanks for that. It does sound like it is deadlocking via commit_write(). OTOH, it seems like it could be using the page before it is uptodate -- it _may_ only be dealing with uptodate data at that point... but if so, why even read_cache_page at all? jffs2_readpage() is synchronous -- there's no chance that the page won't be up to date. We're doing this for garbage collection -- if there are many log entries covering a single page of data, we want to write out a single replacement which covers the whole page, obsoleting the previous suboptimal representation of the same data. OK, but then hasn't the patch just made the deadlock harder to hit, or is there some invariant that says that readpage() will never be invoked if gc was invoked on the same page as we're commit_write()ing? The Q/A comments aren't very sure about this. I guess from the look of it, prepare_write/commit_write make sure the page will be uptodate by the start of commit_write, and you avoid GCing the page in prepare_write because your new page won't have any nodes allocated yet that can possibly be GCed? BTW. with write_begin/write_end, you get to control the page lock, so for example if the readpage in prepare_write for partial writes is *only* for the purpose of avoiding this deadlock later, you could possibly avoid the RMW with the new aops. Maybe it would help you with data nodes crossing page boundaries too... However, it is a regression. So unless David can come up with a more satisfactory approach, I guess we'd have to go with your patch. I think Jason's patch is the best answer for the moment. At some point in the very near future I want to improve the RAM usage and compression ratio by dropping the rule that data nodes may not cross page boundaries -- in which case garbage collection will need to do something other than reading the page using read_cache_page() and then writing it out again; it'll probably need to end up using its own internal buffer. But for now, Jason's patch looks good. OK, thanks for looking at it. If you'd care to pass it on to Linus before he releases 2.6.23 in random() % X days time... ;) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [jffs2] [rfc] fix write deadlock regression
On Sun, 2007-09-02 at 15:20 +0200, Nick Piggin wrote: OK, but then hasn't the patch just made the deadlock harder to hit, or is there some invariant that says that readpage() will never be invoked if gc was invoked on the same page as we're commit_write()ing? The Q/A comments aren't very sure about this. I guess from the look of it, prepare_write/commit_write make sure the page will be uptodate by the start of commit_write, That's the intention, yes. and you avoid GCing the page in prepare_write because your new page won't have any nodes allocated yet that can possibly be GCed? We _might_ GC the page -- it might not be a new page; we might be overwriting it. But it's fine if we do. Actually it's slightly suboptimal because we'll write out the same data twice -- once in GC and then immediately afterward in the write which we were making space for. But that's not the end of the world, and it's not very common. BTW. with write_begin/write_end, you get to control the page lock, so for example if the readpage in prepare_write for partial writes is *only* for the purpose of avoiding this deadlock later, you could possibly avoid the RMW with the new aops. Maybe it would help you with data nodes crossing page boundaries too... I'll look at that; thanks. OK, thanks for looking at it. If you'd care to pass it on to Linus before he releases 2.6.23 in random() % X days time... ;) Not before the Kernel Summit now, I suspect. But yes, I'll do that later today or in the morning (the linuxconf.eu conference has already started). -- dwmw2 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [jffs2] [rfc] fix write deadlock regression
On Sun, Sep 02, 2007 at 02:48:04PM +0100, David Woodhouse wrote: On Sun, 2007-09-02 at 15:20 +0200, Nick Piggin wrote: OK, but then hasn't the patch just made the deadlock harder to hit, or is there some invariant that says that readpage() will never be invoked if gc was invoked on the same page as we're commit_write()ing? The Q/A comments aren't very sure about this. I guess from the look of it, prepare_write/commit_write make sure the page will be uptodate by the start of commit_write, That's the intention, yes. and you avoid GCing the page in prepare_write because your new page won't have any nodes allocated yet that can possibly be GCed? We _might_ GC the page -- it might not be a new page; we might be overwriting it. But it's fine if we do. Actually it's slightly suboptimal because we'll write out the same data twice -- once in GC and then immediately afterward in the write which we were making space for. But doesn't GC only happen in prepare_write in the case that the i_size is being extended into a new page? If you GC the page in prepare_write (when it may be potentially !uptodate), then I'm sure you would get a deadlock when read_cache_page finds it non-uptodate and locks it for readpage(). But that's not the end of the world, and it's not very common. BTW. with write_begin/write_end, you get to control the page lock, so for example if the readpage in prepare_write for partial writes is *only* for the purpose of avoiding this deadlock later, you could possibly avoid the RMW with the new aops. Maybe it would help you with data nodes crossing page boundaries too... I'll look at that; thanks. OK. The patches are in -mm now, but could get in as early as 2.6.24. If you have any suggestions about the form of the APIs, it would be good to hear them. OK, thanks for looking at it. If you'd care to pass it on to Linus before he releases 2.6.23 in random() % X days time... ;) Not before the Kernel Summit now, I suspect. But yes, I'll do that later today or in the morning (the linuxconf.eu conference has already started). Thanks, - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [jffs2] [rfc] fix write deadlock regression
On Sun, 2007-09-02 at 16:17 +0200, Nick Piggin wrote: On Sun, Sep 02, 2007 at 02:48:04PM +0100, David Woodhouse wrote: On Sun, 2007-09-02 at 15:20 +0200, Nick Piggin wrote: OK, but then hasn't the patch just made the deadlock harder to hit, or is there some invariant that says that readpage() will never be invoked if gc was invoked on the same page as we're commit_write()ing? The Q/A comments aren't very sure about this. I guess from the look of it, prepare_write/commit_write make sure the page will be uptodate by the start of commit_write, That's the intention, yes. and you avoid GCing the page in prepare_write because your new page won't have any nodes allocated yet that can possibly be GCed? We _might_ GC the page -- it might not be a new page; we might be overwriting it. But it's fine if we do. Actually it's slightly suboptimal because we'll write out the same data twice -- once in GC and then immediately afterward in the write which we were making space for. But doesn't GC only happen in prepare_write in the case that the i_size is being extended into a new page? Ah, yes. I was thinking of commit_write, and it had temporarily escaped my notice that we also write in prepare_write, to extend the file. So yes, you're probably right that it doesn't matter; in any GC triggered from _prepare_write_ we avoid GCing the page in question because it by definition doesn't exist. -- dwmw2 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [jffs2] [rfc] fix write deadlock regression
On Sat, Sep 01, 2007 at 12:06:03PM -0700, Jason Lunz wrote: > > It introduced a wait to read_cache_page, as well as a > read_cache_page_async function equivalent to the old read_cache_page > without any callers. > > Switching jffs2_gc_fetch_page to read_cache_page_async for the old > behavior makes the deadlocks go away, but maybe reintroduces the > use-before-uptodate problem? I don't understand the mm/fs interaction > well enough to say. > > Someone more knowledgable should see if similar deadlock issues may have > been introduced for other read_cache_page callers, including the other > two in jffs2. Hmm, thanks for that. It does sound like it is deadlocking via commit_write(). OTOH, it seems like it could be using the page before it is uptodate -- it _may_ only be dealing with uptodate data at that point... but if so, why even read_cache_page at all? However, it is a regression. So unless David can come up with a more satisfactory approach, I guess we'd have to go with your patch. > > Signed-off-by: Jason Lunz <[EMAIL PROTECTED]> > > --- > fs/jffs2/fs.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/fs/jffs2/fs.c b/fs/jffs2/fs.c > index 1d3b7a9..8bc727b 100644 > --- a/fs/jffs2/fs.c > +++ b/fs/jffs2/fs.c > @@ -627,7 +627,7 @@ unsigned char *jffs2_gc_fetch_page(struct jffs2_sb_info > *c, > struct inode *inode = OFNI_EDONI_2SFFJ(f); > struct page *pg; > > - pg = read_cache_page(inode->i_mapping, offset >> PAGE_CACHE_SHIFT, > + pg = read_cache_page_async(inode->i_mapping, offset >> PAGE_CACHE_SHIFT, >(void *)jffs2_do_readpage_unlock, inode); > if (IS_ERR(pg)) > return (void *)pg; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[jffs2] [rfc] fix write deadlock regression
I've bisected the deadlock when many small appends are done on jffs2 down to this commit: commit 6fe6900e1e5b6fa9e5c59aa5061f244fe3f467e2 Author: Nick Piggin <[EMAIL PROTECTED]> Date: Sun May 6 14:49:04 2007 -0700 mm: make read_cache_page synchronous Ensure pages are uptodate after returning from read_cache_page, which allows us to cut out most of the filesystem-internal PageUptodate calls. I didn't have a great look down the call chains, but this appears to fixes 7 possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in ecryptfs, 1 in jffs2, and a possible cleared data overwritten with readpage in block2mtd. All depending on whether the filler is async and/or can return with a !uptodate page. It introduced a wait to read_cache_page, as well as a read_cache_page_async function equivalent to the old read_cache_page without any callers. Switching jffs2_gc_fetch_page to read_cache_page_async for the old behavior makes the deadlocks go away, but maybe reintroduces the use-before-uptodate problem? I don't understand the mm/fs interaction well enough to say. Someone more knowledgable should see if similar deadlock issues may have been introduced for other read_cache_page callers, including the other two in jffs2. Signed-off-by: Jason Lunz <[EMAIL PROTECTED]> --- fs/jffs2/fs.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/jffs2/fs.c b/fs/jffs2/fs.c index 1d3b7a9..8bc727b 100644 --- a/fs/jffs2/fs.c +++ b/fs/jffs2/fs.c @@ -627,7 +627,7 @@ unsigned char *jffs2_gc_fetch_page(struct jffs2_sb_info *c, struct inode *inode = OFNI_EDONI_2SFFJ(f); struct page *pg; - pg = read_cache_page(inode->i_mapping, offset >> PAGE_CACHE_SHIFT, + pg = read_cache_page_async(inode->i_mapping, offset >> PAGE_CACHE_SHIFT, (void *)jffs2_do_readpage_unlock, inode); if (IS_ERR(pg)) return (void *)pg; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[jffs2] [rfc] fix write deadlock regression
I've bisected the deadlock when many small appends are done on jffs2 down to this commit: commit 6fe6900e1e5b6fa9e5c59aa5061f244fe3f467e2 Author: Nick Piggin [EMAIL PROTECTED] Date: Sun May 6 14:49:04 2007 -0700 mm: make read_cache_page synchronous Ensure pages are uptodate after returning from read_cache_page, which allows us to cut out most of the filesystem-internal PageUptodate calls. I didn't have a great look down the call chains, but this appears to fixes 7 possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in ecryptfs, 1 in jffs2, and a possible cleared data overwritten with readpage in block2mtd. All depending on whether the filler is async and/or can return with a !uptodate page. It introduced a wait to read_cache_page, as well as a read_cache_page_async function equivalent to the old read_cache_page without any callers. Switching jffs2_gc_fetch_page to read_cache_page_async for the old behavior makes the deadlocks go away, but maybe reintroduces the use-before-uptodate problem? I don't understand the mm/fs interaction well enough to say. Someone more knowledgable should see if similar deadlock issues may have been introduced for other read_cache_page callers, including the other two in jffs2. Signed-off-by: Jason Lunz [EMAIL PROTECTED] --- fs/jffs2/fs.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/jffs2/fs.c b/fs/jffs2/fs.c index 1d3b7a9..8bc727b 100644 --- a/fs/jffs2/fs.c +++ b/fs/jffs2/fs.c @@ -627,7 +627,7 @@ unsigned char *jffs2_gc_fetch_page(struct jffs2_sb_info *c, struct inode *inode = OFNI_EDONI_2SFFJ(f); struct page *pg; - pg = read_cache_page(inode-i_mapping, offset PAGE_CACHE_SHIFT, + pg = read_cache_page_async(inode-i_mapping, offset PAGE_CACHE_SHIFT, (void *)jffs2_do_readpage_unlock, inode); if (IS_ERR(pg)) return (void *)pg; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/