Re: Write-back from inside FS - need suggestions
On Sat, 29 September 2007 13:00:11 -0700, Andrew Morton wrote: > > err, it's basically an open-coded mutex via which one thread can get > exclusive access to some parts of an inode's internals. Perhaps it could > literally be replaced with a mutex. Exactly what I_LOCK protects has not > been documented afaik. That would need to be reverse engineered :( I believe you actually have some documentation in your tree. At least the behaviour after my I_SYNC patch has been documented with that patch. Jörn -- "Error protection by error detection and correction." -- from a university class - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Write-back from inside FS - need suggestions
Andrew, thank you for this help. Andrew Morton wrote: writepage under i_mutex is commonly done on the sys_write->alloc_pages->direct-reclaim path. It absolutely has to work, and you'll be fine relying upon that. However ->prepare_write() is called with the page locked, so you are vulnerable to deadlocks there. I suspect you got lucky because the page which you're holding the lock on is not dirty in your testing. But in other applications (eg: 1k blocksize ext2/3/4) the page _can_ be dirty while we're trying to allocate more blocks for it, in which case the lock_page() deadlock can happen. One approach might be to add another flag to writeback_control telling write_cache_pages() to skip locked pages. Or even put a page* into wrietback_control and change it to skip *this* page. Well, in my case I force write-back from prepare_write _only_ when the page is clean, because if it is dirty, it was (pessimistically) accounted earlier already and changing dirty page does not change anything on the media. So I call writeback only for _new_ pages, which are always clean. I use PagePrivate() flag to flag pages as dirty, and unflag them in writepage(). I need to keep my own accounting of number of dirty pages at any point of time. I found that I cannot use PageDirty() flag because it is cleaned before my ->writepage is called, so I cannot decrement my dirty_pg_counter, and I'd have to muck with radix tree's tags which I do not really like to do, thus I just use the private flag. So in writepage() i only call writeback if PagePrivate() is unset, which guarantees me that the page is clean, I presume. So for my purposes the patch below _looks_ ok. I'm saying "looks" because I tested it just a little. This means that if I'm in the middle of an operation or ino #X, I own its i_mutex, but not I_LOCK, I can be preempted and ->writepage can be called for a dirty page belonging to this inode #X? yup. Or another CPU can do the same. Ok, thank you! I (naively) thought i_mutex is locked in ->writepage. But now I see that pdflush does not lock it, readahead calls ->readpage without i_mutex too. They just lock the page. Could you or someone please give me a hint what exactly inode->i_flags & I_LOCK protects? err, it's basically an open-coded mutex via which one thread can get exclusive access to some parts of an inode's internals. Perhaps it could literally be replaced with a mutex. Exactly what I_LOCK protects has not been documented afaik. That would need to be reverse engineered :( I see, thanks. There is also i_size and i_size_write() and i_size_read(). My understanding is that i_size may be changed without something (i_mutex or I_LOCK) locked, thus these helpers. i_size is read/written without them in many places, though, so the relation of these i_size protection helpers to i_mutex/I_LOCK is unclean for me. Ideally, it would be nice to teach lockdep to monitor I_LOCK vs i_mutex. Below it the patch which seems to give me what I need. Just for reference. = Subject: [PATCH] VFS: introduce writeback_inodes_sb() Let file systems to writeback their pages and inodes when needed. Note, it cannot be called if one of the dirty pages is locked by the caller, otherwise it'll deadlock. Signed-off-by: Artem Bityutskiy <[EMAIL PROTECTED]> --- fs/fs-writeback.c |8 include/linux/writeback.h |1 + 2 files changed, 9 insertions(+), 0 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index a4b142a..17c8aaa 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -486,6 +486,14 @@ void sync_inodes_sb(struct super_block *sb, int wait) spin_unlock(_lock); } +void writeback_inodes_sb(struct super_block *sb, struct writeback_control *wbc) +{ + spin_lock(_lock); + sync_sb_inodes(sb, wbc); + spin_unlock(_lock); +} +EXPORT_SYMBOL_GPL(writeback_inodes_sb); + /* * Rather lame livelock avoidance. */ diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 4ef4d22..e20cd12 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -72,6 +72,7 @@ void writeback_inodes(struct writeback_control *wbc); void wake_up_inode(struct inode *inode); int inode_wait(void *); void sync_inodes_sb(struct super_block *, int wait); +void writeback_inodes_sb(struct super_block *sb, struct writeback_control *wbc); void sync_inodes(int wait); /* writeback.h requires fs.h; it, too, is not included from here. */ -- 1.5.0.6 -- Best Regards, Artem Bityutskiy (Артём Битюцкий) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Write-back from inside FS - need suggestions
Andrew, thank you for this help. Andrew Morton wrote: writepage under i_mutex is commonly done on the sys_write-alloc_pages-direct-reclaim path. It absolutely has to work, and you'll be fine relying upon that. However -prepare_write() is called with the page locked, so you are vulnerable to deadlocks there. I suspect you got lucky because the page which you're holding the lock on is not dirty in your testing. But in other applications (eg: 1k blocksize ext2/3/4) the page _can_ be dirty while we're trying to allocate more blocks for it, in which case the lock_page() deadlock can happen. One approach might be to add another flag to writeback_control telling write_cache_pages() to skip locked pages. Or even put a page* into wrietback_control and change it to skip *this* page. Well, in my case I force write-back from prepare_write _only_ when the page is clean, because if it is dirty, it was (pessimistically) accounted earlier already and changing dirty page does not change anything on the media. So I call writeback only for _new_ pages, which are always clean. I use PagePrivate() flag to flag pages as dirty, and unflag them in writepage(). I need to keep my own accounting of number of dirty pages at any point of time. I found that I cannot use PageDirty() flag because it is cleaned before my -writepage is called, so I cannot decrement my dirty_pg_counter, and I'd have to muck with radix tree's tags which I do not really like to do, thus I just use the private flag. So in writepage() i only call writeback if PagePrivate() is unset, which guarantees me that the page is clean, I presume. So for my purposes the patch below _looks_ ok. I'm saying looks because I tested it just a little. This means that if I'm in the middle of an operation or ino #X, I own its i_mutex, but not I_LOCK, I can be preempted and -writepage can be called for a dirty page belonging to this inode #X? yup. Or another CPU can do the same. Ok, thank you! I (naively) thought i_mutex is locked in -writepage. But now I see that pdflush does not lock it, readahead calls -readpage without i_mutex too. They just lock the page. Could you or someone please give me a hint what exactly inode-i_flags I_LOCK protects? err, it's basically an open-coded mutex via which one thread can get exclusive access to some parts of an inode's internals. Perhaps it could literally be replaced with a mutex. Exactly what I_LOCK protects has not been documented afaik. That would need to be reverse engineered :( I see, thanks. There is also i_size and i_size_write() and i_size_read(). My understanding is that i_size may be changed without something (i_mutex or I_LOCK) locked, thus these helpers. i_size is read/written without them in many places, though, so the relation of these i_size protection helpers to i_mutex/I_LOCK is unclean for me. Ideally, it would be nice to teach lockdep to monitor I_LOCK vs i_mutex. Below it the patch which seems to give me what I need. Just for reference. = Subject: [PATCH] VFS: introduce writeback_inodes_sb() Let file systems to writeback their pages and inodes when needed. Note, it cannot be called if one of the dirty pages is locked by the caller, otherwise it'll deadlock. Signed-off-by: Artem Bityutskiy [EMAIL PROTECTED] --- fs/fs-writeback.c |8 include/linux/writeback.h |1 + 2 files changed, 9 insertions(+), 0 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index a4b142a..17c8aaa 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -486,6 +486,14 @@ void sync_inodes_sb(struct super_block *sb, int wait) spin_unlock(inode_lock); } +void writeback_inodes_sb(struct super_block *sb, struct writeback_control *wbc) +{ + spin_lock(inode_lock); + sync_sb_inodes(sb, wbc); + spin_unlock(inode_lock); +} +EXPORT_SYMBOL_GPL(writeback_inodes_sb); + /* * Rather lame livelock avoidance. */ diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 4ef4d22..e20cd12 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -72,6 +72,7 @@ void writeback_inodes(struct writeback_control *wbc); void wake_up_inode(struct inode *inode); int inode_wait(void *); void sync_inodes_sb(struct super_block *, int wait); +void writeback_inodes_sb(struct super_block *sb, struct writeback_control *wbc); void sync_inodes(int wait); /* writeback.h requires fs.h; it, too, is not included from here. */ -- 1.5.0.6 -- Best Regards, Artem Bityutskiy (Артём Битюцкий) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Write-back from inside FS - need suggestions
On Sat, 29 September 2007 13:00:11 -0700, Andrew Morton wrote: err, it's basically an open-coded mutex via which one thread can get exclusive access to some parts of an inode's internals. Perhaps it could literally be replaced with a mutex. Exactly what I_LOCK protects has not been documented afaik. That would need to be reverse engineered :( I believe you actually have some documentation in your tree. At least the behaviour after my I_SYNC patch has been documented with that patch. Jörn -- Error protection by error detection and correction. -- from a university class - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Write-back from inside FS - need suggestions
On Sat, 29 Sep 2007 22:10:42 +0300 Artem Bityutskiy <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > I'd have thought that a suitable wrapper around a suitably-modified > > sync_sb_inodes() would be appropriate for both filesystems? > > Ok, I've modified sync_inodes_sb() so that I can pass it my own wbc, > where I set wcb->nr_to_write = 20. It gives me _exactly_ what I want. > It just flushes a bit more then 20 pages and returns. I use > WB_SYNC_ALL. Great! ok.. > Now I would like to understand why it works :-) To my surprise, it > does not deadlock! I call it from ->prepare_write where I'm holding > i_mutex, and it works just fine. It calls ->writepage() without trying > to lock i_mutex! This looks like some witchcraft for me. writepage under i_mutex is commonly done on the sys_write->alloc_pages->direct-reclaim path. It absolutely has to work, and you'll be fine relying upon that. However ->prepare_write() is called with the page locked, so you are vulnerable to deadlocks there. I suspect you got lucky because the page which you're holding the lock on is not dirty in your testing. But in other applications (eg: 1k blocksize ext2/3/4) the page _can_ be dirty while we're trying to allocate more blocks for it, in which case the lock_page() deadlock can happen. One approach might be to add another flag to writeback_control telling write_cache_pages() to skip locked pages. Or even put a page* into wrietback_control and change it to skip *this* page. > This means that if I'm in the middle of an operation or ino #X, I own > its i_mutex, but not I_LOCK, I can be preempted and ->writepage can > be called for a dirty page belonging to this inode #X? yup. Or another CPU can do the same. > I haven't seen > this in practice and I do not believe this may happen. Why? Perhaps a heavier workload is needed. There is code in the VFS which tries to prevent lots of CPUs from getting in and fighting with each other (see writeback_acquire()) which will have the effect of serialising things for some extent. But writeback_acquire() is causing scalability problems on monster IO systems and might be removed, and it is only a partial thing - there are other ways in which concurrent writeout can occur (fsync, sync, page reclaim, ...) > Could you or someone please give me a hint what exactly > inode->i_flags & I_LOCK protects? err, it's basically an open-coded mutex via which one thread can get exclusive access to some parts of an inode's internals. Perhaps it could literally be replaced with a mutex. Exactly what I_LOCK protects has not been documented afaik. That would need to be reverse engineered :( > What is its relationship to i_mutex? On a regular file i_mutex is used mainly for protection of the data part of the file, although it gets borrowed for other things, like protecting f_pos of all the inode's file*'s. I_LOCK is used to serialise access to a few parts of the inode itself. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Write-back from inside FS - need suggestions
Andrew Morton wrote: I'd have thought that a suitable wrapper around a suitably-modified sync_sb_inodes() would be appropriate for both filesystems? Ok, I've modified sync_inodes_sb() so that I can pass it my own wbc, where I set wcb->nr_to_write = 20. It gives me _exactly_ what I want. It just flushes a bit more then 20 pages and returns. I use WB_SYNC_ALL. Great! Now I would like to understand why it works :-) To my surprise, it does not deadlock! I call it from ->prepare_write where I'm holding i_mutex, and it works just fine. It calls ->writepage() without trying to lock i_mutex! This looks like some witchcraft for me. This means that if I'm in the middle of an operation or ino #X, I own its i_mutex, but not I_LOCK, I can be preempted and ->writepage can be called for a dirty page belonging to this inode #X? I haven't seen this in practice and I do not believe this may happen. Why? Could you or someone please give me a hint what exactly inode->i_flags & I_LOCK protects? What is its relationship to i_mutex? -- Best Regards, Artem Bityutskiy (Артём Битюцкий) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Write-back from inside FS - need suggestions
Andrew Morton wrote: I'd have thought that a suitable wrapper around a suitably-modified sync_sb_inodes() would be appropriate for both filesystems? Hmm, OK, I'll try to do this. Thanks. -- Best Regards, Artem Bityutskiy (Артём Битюцкий) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Write-back from inside FS - need suggestions
On Sat, 29 Sep 2007 12:56:55 +0300 Artem Bityutskiy <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > This is precisely the problem which needs to be solved for delayed > > allocation on ext2/3/4. This is because it is infeasible to work out how > > much disk space an ext2 pagecache page will take to write out (it will > > require zero to three indirect blocks as well). > > > > When I did delalloc-for-ext2, umm, six years ago I did > > maximally-pessimistic in-memory space accounting and I think I just ran a > > superblock-wide sync operation when ENOSPC was about to happen. That > > caused all the pessimistic reservations to be collapsed into real ones, > > releasing space. So as the disk neared a real ENOSPC, the syncs becaome > > more frequent. But the overhead was small. > > > > I expect that a similar thing was done in the ext4 delayed allocation > > patches - you should take a look at that and see what can be > > shared/generalised/etc. > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ > > > > Although, judging by the comment in here: > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ext4-delayed-allocation.patch > > > > + * TODO: > > + * MUST: > > + * - flush dirty pages in -ENOSPC case in order to free reserved blocks > > > > things need a bit more work. Hopefully that's a dead comment. > > > > > > > > omigod, that thing has gone and done a clone-and-own on half the VFS. > > Anyway, I doubt if you'll be able to find a design description anyway > > but you should spend some time picking it apart. It is the same problem.. > > (For some reasons I haven't got your answer in my mailbox, found it in > archives) > > Thank you for these pointers. I was looking at ext4 code and found haven't > found what they do in these cases. I don't think it's written yet. Not in those patches, at least. > I think I need some hints to realize > what's going on there. Our FS is so different from traditional ones > - e.g., we do not use buffer heads, we do not have block device > underneath, etc, so I even doubt I can borrow anything from ext4. Common ideas need to be found and implemented in the VFS. The ext4 patches do it all in the fs which is just wrong. The tracking of reservations (or worst-case utilisation) is surely common across these two implementations? Quite possibly the ENOSPC-time forced writeback is too. > I have impression that I just have to implement my own list of > inodes and my own victim-picking policies. Although I still think it > should better be done on VFS level, because it has all these LRU lists, > and I'd duplicate things. I'd have thought that a suitable wrapper around a suitably-modified sync_sb_inodes() would be appropriate for both filesystems? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Write-back from inside FS - need suggestions
Andrew Morton wrote: This is precisely the problem which needs to be solved for delayed allocation on ext2/3/4. This is because it is infeasible to work out how much disk space an ext2 pagecache page will take to write out (it will require zero to three indirect blocks as well). When I did delalloc-for-ext2, umm, six years ago I did maximally-pessimistic in-memory space accounting and I think I just ran a superblock-wide sync operation when ENOSPC was about to happen. That caused all the pessimistic reservations to be collapsed into real ones, releasing space. So as the disk neared a real ENOSPC, the syncs becaome more frequent. But the overhead was small. I expect that a similar thing was done in the ext4 delayed allocation patches - you should take a look at that and see what can be shared/generalised/etc. ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ Although, judging by the comment in here: ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ext4-delayed-allocation.patch + * TODO: + * MUST: + * - flush dirty pages in -ENOSPC case in order to free reserved blocks things need a bit more work. Hopefully that's a dead comment. omigod, that thing has gone and done a clone-and-own on half the VFS. Anyway, I doubt if you'll be able to find a design description anyway but you should spend some time picking it apart. It is the same problem.. (For some reasons I haven't got your answer in my mailbox, found it in archives) Thank you for these pointers. I was looking at ext4 code and found haven't found what they do in these cases. I think I need some hints to realize what's going on there. Our FS is so different from traditional ones - e.g., we do not use buffer heads, we do not have block device underneath, etc, so I even doubt I can borrow anything from ext4. I have impression that I just have to implement my own list of inodes and my own victim-picking policies. Although I still think it should better be done on VFS level, because it has all these LRU lists, and I'd duplicate things. Nevertheless, I add Teo on CC in a hope he'll give me some pointers. -- Best Regards, Artem Bityutskiy (Артём Битюцкий) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Write-back from inside FS - need suggestions
Andrew Morton wrote: This is precisely the problem which needs to be solved for delayed allocation on ext2/3/4. This is because it is infeasible to work out how much disk space an ext2 pagecache page will take to write out (it will require zero to three indirect blocks as well). When I did delalloc-for-ext2, umm, six years ago I did maximally-pessimistic in-memory space accounting and I think I just ran a superblock-wide sync operation when ENOSPC was about to happen. That caused all the pessimistic reservations to be collapsed into real ones, releasing space. So as the disk neared a real ENOSPC, the syncs becaome more frequent. But the overhead was small. I expect that a similar thing was done in the ext4 delayed allocation patches - you should take a look at that and see what can be shared/generalised/etc. ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ Although, judging by the comment in here: ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ext4-delayed-allocation.patch + * TODO: + * MUST: + * - flush dirty pages in -ENOSPC case in order to free reserved blocks things need a bit more work. Hopefully that's a dead comment. looks omigod, that thing has gone and done a clone-and-own on half the VFS. Anyway, I doubt if you'll be able to find a design description anyway but you should spend some time picking it apart. It is the same problem.. (For some reasons I haven't got your answer in my mailbox, found it in archives) Thank you for these pointers. I was looking at ext4 code and found haven't found what they do in these cases. I think I need some hints to realize what's going on there. Our FS is so different from traditional ones - e.g., we do not use buffer heads, we do not have block device underneath, etc, so I even doubt I can borrow anything from ext4. I have impression that I just have to implement my own list of inodes and my own victim-picking policies. Although I still think it should better be done on VFS level, because it has all these LRU lists, and I'd duplicate things. Nevertheless, I add Teo on CC in a hope he'll give me some pointers. -- Best Regards, Artem Bityutskiy (Артём Битюцкий) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Write-back from inside FS - need suggestions
On Sat, 29 Sep 2007 12:56:55 +0300 Artem Bityutskiy [EMAIL PROTECTED] wrote: Andrew Morton wrote: This is precisely the problem which needs to be solved for delayed allocation on ext2/3/4. This is because it is infeasible to work out how much disk space an ext2 pagecache page will take to write out (it will require zero to three indirect blocks as well). When I did delalloc-for-ext2, umm, six years ago I did maximally-pessimistic in-memory space accounting and I think I just ran a superblock-wide sync operation when ENOSPC was about to happen. That caused all the pessimistic reservations to be collapsed into real ones, releasing space. So as the disk neared a real ENOSPC, the syncs becaome more frequent. But the overhead was small. I expect that a similar thing was done in the ext4 delayed allocation patches - you should take a look at that and see what can be shared/generalised/etc. ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ Although, judging by the comment in here: ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ext4-delayed-allocation.patch + * TODO: + * MUST: + * - flush dirty pages in -ENOSPC case in order to free reserved blocks things need a bit more work. Hopefully that's a dead comment. looks omigod, that thing has gone and done a clone-and-own on half the VFS. Anyway, I doubt if you'll be able to find a design description anyway but you should spend some time picking it apart. It is the same problem.. (For some reasons I haven't got your answer in my mailbox, found it in archives) Thank you for these pointers. I was looking at ext4 code and found haven't found what they do in these cases. I don't think it's written yet. Not in those patches, at least. I think I need some hints to realize what's going on there. Our FS is so different from traditional ones - e.g., we do not use buffer heads, we do not have block device underneath, etc, so I even doubt I can borrow anything from ext4. Common ideas need to be found and implemented in the VFS. The ext4 patches do it all in the fs which is just wrong. The tracking of reservations (or worst-case utilisation) is surely common across these two implementations? Quite possibly the ENOSPC-time forced writeback is too. I have impression that I just have to implement my own list of inodes and my own victim-picking policies. Although I still think it should better be done on VFS level, because it has all these LRU lists, and I'd duplicate things. I'd have thought that a suitable wrapper around a suitably-modified sync_sb_inodes() would be appropriate for both filesystems? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Write-back from inside FS - need suggestions
Andrew Morton wrote: I'd have thought that a suitable wrapper around a suitably-modified sync_sb_inodes() would be appropriate for both filesystems? Hmm, OK, I'll try to do this. Thanks. -- Best Regards, Artem Bityutskiy (Артём Битюцкий) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Write-back from inside FS - need suggestions
Andrew Morton wrote: I'd have thought that a suitable wrapper around a suitably-modified sync_sb_inodes() would be appropriate for both filesystems? Ok, I've modified sync_inodes_sb() so that I can pass it my own wbc, where I set wcb-nr_to_write = 20. It gives me _exactly_ what I want. It just flushes a bit more then 20 pages and returns. I use WB_SYNC_ALL. Great! Now I would like to understand why it works :-) To my surprise, it does not deadlock! I call it from -prepare_write where I'm holding i_mutex, and it works just fine. It calls -writepage() without trying to lock i_mutex! This looks like some witchcraft for me. This means that if I'm in the middle of an operation or ino #X, I own its i_mutex, but not I_LOCK, I can be preempted and -writepage can be called for a dirty page belonging to this inode #X? I haven't seen this in practice and I do not believe this may happen. Why? Could you or someone please give me a hint what exactly inode-i_flags I_LOCK protects? What is its relationship to i_mutex? -- Best Regards, Artem Bityutskiy (Артём Битюцкий) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Write-back from inside FS - need suggestions
On Sat, 29 Sep 2007 22:10:42 +0300 Artem Bityutskiy [EMAIL PROTECTED] wrote: Andrew Morton wrote: I'd have thought that a suitable wrapper around a suitably-modified sync_sb_inodes() would be appropriate for both filesystems? Ok, I've modified sync_inodes_sb() so that I can pass it my own wbc, where I set wcb-nr_to_write = 20. It gives me _exactly_ what I want. It just flushes a bit more then 20 pages and returns. I use WB_SYNC_ALL. Great! ok.. Now I would like to understand why it works :-) To my surprise, it does not deadlock! I call it from -prepare_write where I'm holding i_mutex, and it works just fine. It calls -writepage() without trying to lock i_mutex! This looks like some witchcraft for me. writepage under i_mutex is commonly done on the sys_write-alloc_pages-direct-reclaim path. It absolutely has to work, and you'll be fine relying upon that. However -prepare_write() is called with the page locked, so you are vulnerable to deadlocks there. I suspect you got lucky because the page which you're holding the lock on is not dirty in your testing. But in other applications (eg: 1k blocksize ext2/3/4) the page _can_ be dirty while we're trying to allocate more blocks for it, in which case the lock_page() deadlock can happen. One approach might be to add another flag to writeback_control telling write_cache_pages() to skip locked pages. Or even put a page* into wrietback_control and change it to skip *this* page. This means that if I'm in the middle of an operation or ino #X, I own its i_mutex, but not I_LOCK, I can be preempted and -writepage can be called for a dirty page belonging to this inode #X? yup. Or another CPU can do the same. I haven't seen this in practice and I do not believe this may happen. Why? Perhaps a heavier workload is needed. There is code in the VFS which tries to prevent lots of CPUs from getting in and fighting with each other (see writeback_acquire()) which will have the effect of serialising things for some extent. But writeback_acquire() is causing scalability problems on monster IO systems and might be removed, and it is only a partial thing - there are other ways in which concurrent writeout can occur (fsync, sync, page reclaim, ...) Could you or someone please give me a hint what exactly inode-i_flags I_LOCK protects? err, it's basically an open-coded mutex via which one thread can get exclusive access to some parts of an inode's internals. Perhaps it could literally be replaced with a mutex. Exactly what I_LOCK protects has not been documented afaik. That would need to be reverse engineered :( What is its relationship to i_mutex? On a regular file i_mutex is used mainly for protection of the data part of the file, although it gets borrowed for other things, like protecting f_pos of all the inode's file*'s. I_LOCK is used to serialise access to a few parts of the inode itself. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Write-back from inside FS - need suggestions
On Fri, 28 Sep 2007 12:16:54 +0300 Artem Bityutskiy <[EMAIL PROTECTED]> wrote: > Hi, > > we are writing anew flash FS (UBIFS) and need some advise/suggestion. > Brief FS info and the code are available at > http://www.linux-mtd.infradead.org/doc/ubifs.html. > > At any point of time we may have a plenty of cached stuff which have to > be written back later to the flash media: dirty pages an dirty inodes. > This is what we call "liability" - current set of dirty pages and > inodes UBIFS must be able to write back on demand. > > The problem is that we cannot do accurate flash space accounting due > to several reasons: > 1. Wastage - some smal random amount of flash space at ends or > eraseblocks cannot be used. > 2. Compression - we do not know how well will the pages be compressed, > so we do not know how much flash space will they consume. > > So, if our current liability is X, we do not know exactly how much > flash space (Y) it will take. All we can do is to introduce some > pessimistic, worst-case function Y = F(X). This pessimistic function > assumes that pages won't be compressible, and it assumes worst-case > wastage. In real life it is hardly going to happen, but possible. > The functiion is really bad and may lead to huge over-estimations > like 40%. > > So, if we are, say, in ->prepare_write(), we have to decide whether > there is enough flash space to write-back this page later. We do not > want to fail with -ENOSPC when,say, pdflush writes the page back. So > we use our pessimistic function F(X) to decide whether we have enough > space or not. If there is a plenty of flash space, the F(X) says "yes", > and just we proceed. The question is what do we do if F(X) says "no"? > > If we just return -ENOSPC, the flash space utilization becomes too > poor, just because F() is really rough. We do have space in most > real-life cases. All we have to do in this case is to lessen our > liability. IOW, we have to flush few dirty inodes/pages, then we'd > be able to proceed. > > So my question is: how can we flush _few_ oldest dirty pages/inodes > while we are inside UBIFS (e.g., in ->prepare_write(), ->mkdir(), > ->link(), etc)? > > I failed to find VFS calls which would do this. Stuff like > sync_sb_inodes() is not exactly what we need. Should we implement > a similar function? Since we have to call it from inside UBIFS, which > means we are holding i_mutex and the inode is locked, the function > has to be smart enough not to wait on this inode, but wait on other > inodes if needed. > > A solution like kicking pdflush to do the job and wait on a waitqueue > would probably also work, but I'd prefer to do this from the context > of current task. > > Should we have our own list of inodes and call write_inode_now() for > dirty ones? But I'd prefer to let VFS pick oldest victims. > > So I'm asking for ideas which would work and be acceptable by the > community later. > This is precisely the problem which needs to be solved for delayed allocation on ext2/3/4. This is because it is infeasible to work out how much disk space an ext2 pagecache page will take to write out (it will require zero to three indirect blocks as well). When I did delalloc-for-ext2, umm, six years ago I did maximally-pessimistic in-memory space accounting and I think I just ran a superblock-wide sync operation when ENOSPC was about to happen. That caused all the pessimistic reservations to be collapsed into real ones, releasing space. So as the disk neared a real ENOSPC, the syncs becaome more frequent. But the overhead was small. I expect that a similar thing was done in the ext4 delayed allocation patches - you should take a look at that and see what can be shared/generalised/etc. ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ Although, judging by the comment in here: ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ext4-delayed-allocation.patch + * TODO: + * MUST: + * - flush dirty pages in -ENOSPC case in order to free reserved blocks things need a bit more work. Hopefully that's a dead comment. omigod, that thing has gone and done a clone-and-own on half the VFS. Anyway, I doubt if you'll be able to find a design description anyway but you should spend some time picking it apart. It is the same problem.. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Write-back from inside FS - need suggestions
Hi, we are writing anew flash FS (UBIFS) and need some advise/suggestion. Brief FS info and the code are available at http://www.linux-mtd.infradead.org/doc/ubifs.html. At any point of time we may have a plenty of cached stuff which have to be written back later to the flash media: dirty pages an dirty inodes. This is what we call "liability" - current set of dirty pages and inodes UBIFS must be able to write back on demand. The problem is that we cannot do accurate flash space accounting due to several reasons: 1. Wastage - some smal random amount of flash space at ends or eraseblocks cannot be used. 2. Compression - we do not know how well will the pages be compressed, so we do not know how much flash space will they consume. So, if our current liability is X, we do not know exactly how much flash space (Y) it will take. All we can do is to introduce some pessimistic, worst-case function Y = F(X). This pessimistic function assumes that pages won't be compressible, and it assumes worst-case wastage. In real life it is hardly going to happen, but possible. The functiion is really bad and may lead to huge over-estimations like 40%. So, if we are, say, in ->prepare_write(), we have to decide whether there is enough flash space to write-back this page later. We do not want to fail with -ENOSPC when,say, pdflush writes the page back. So we use our pessimistic function F(X) to decide whether we have enough space or not. If there is a plenty of flash space, the F(X) says "yes", and just we proceed. The question is what do we do if F(X) says "no"? If we just return -ENOSPC, the flash space utilization becomes too poor, just because F() is really rough. We do have space in most real-life cases. All we have to do in this case is to lessen our liability. IOW, we have to flush few dirty inodes/pages, then we'd be able to proceed. So my question is: how can we flush _few_ oldest dirty pages/inodes while we are inside UBIFS (e.g., in ->prepare_write(), ->mkdir(), ->link(), etc)? I failed to find VFS calls which would do this. Stuff like sync_sb_inodes() is not exactly what we need. Should we implement a similar function? Since we have to call it from inside UBIFS, which means we are holding i_mutex and the inode is locked, the function has to be smart enough not to wait on this inode, but wait on other inodes if needed. A solution like kicking pdflush to do the job and wait on a waitqueue would probably also work, but I'd prefer to do this from the context of current task. Should we have our own list of inodes and call write_inode_now() for dirty ones? But I'd prefer to let VFS pick oldest victims. So I'm asking for ideas which would work and be acceptable by the community later. Thanks! -- Best Regards, Artem Bityutskiy (Артём Битюцкий) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Write-back from inside FS - need suggestions
On Fri, 28 Sep 2007 12:16:54 +0300 Artem Bityutskiy [EMAIL PROTECTED] wrote: Hi, we are writing anew flash FS (UBIFS) and need some advise/suggestion. Brief FS info and the code are available at http://www.linux-mtd.infradead.org/doc/ubifs.html. At any point of time we may have a plenty of cached stuff which have to be written back later to the flash media: dirty pages an dirty inodes. This is what we call liability - current set of dirty pages and inodes UBIFS must be able to write back on demand. The problem is that we cannot do accurate flash space accounting due to several reasons: 1. Wastage - some smal random amount of flash space at ends or eraseblocks cannot be used. 2. Compression - we do not know how well will the pages be compressed, so we do not know how much flash space will they consume. So, if our current liability is X, we do not know exactly how much flash space (Y) it will take. All we can do is to introduce some pessimistic, worst-case function Y = F(X). This pessimistic function assumes that pages won't be compressible, and it assumes worst-case wastage. In real life it is hardly going to happen, but possible. The functiion is really bad and may lead to huge over-estimations like 40%. So, if we are, say, in -prepare_write(), we have to decide whether there is enough flash space to write-back this page later. We do not want to fail with -ENOSPC when,say, pdflush writes the page back. So we use our pessimistic function F(X) to decide whether we have enough space or not. If there is a plenty of flash space, the F(X) says yes, and just we proceed. The question is what do we do if F(X) says no? If we just return -ENOSPC, the flash space utilization becomes too poor, just because F() is really rough. We do have space in most real-life cases. All we have to do in this case is to lessen our liability. IOW, we have to flush few dirty inodes/pages, then we'd be able to proceed. So my question is: how can we flush _few_ oldest dirty pages/inodes while we are inside UBIFS (e.g., in -prepare_write(), -mkdir(), -link(), etc)? I failed to find VFS calls which would do this. Stuff like sync_sb_inodes() is not exactly what we need. Should we implement a similar function? Since we have to call it from inside UBIFS, which means we are holding i_mutex and the inode is locked, the function has to be smart enough not to wait on this inode, but wait on other inodes if needed. A solution like kicking pdflush to do the job and wait on a waitqueue would probably also work, but I'd prefer to do this from the context of current task. Should we have our own list of inodes and call write_inode_now() for dirty ones? But I'd prefer to let VFS pick oldest victims. So I'm asking for ideas which would work and be acceptable by the community later. This is precisely the problem which needs to be solved for delayed allocation on ext2/3/4. This is because it is infeasible to work out how much disk space an ext2 pagecache page will take to write out (it will require zero to three indirect blocks as well). When I did delalloc-for-ext2, umm, six years ago I did maximally-pessimistic in-memory space accounting and I think I just ran a superblock-wide sync operation when ENOSPC was about to happen. That caused all the pessimistic reservations to be collapsed into real ones, releasing space. So as the disk neared a real ENOSPC, the syncs becaome more frequent. But the overhead was small. I expect that a similar thing was done in the ext4 delayed allocation patches - you should take a look at that and see what can be shared/generalised/etc. ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ Although, judging by the comment in here: ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/ext4-patches/LATEST/broken-out/ext4-delayed-allocation.patch + * TODO: + * MUST: + * - flush dirty pages in -ENOSPC case in order to free reserved blocks things need a bit more work. Hopefully that's a dead comment. looks omigod, that thing has gone and done a clone-and-own on half the VFS. Anyway, I doubt if you'll be able to find a design description anyway but you should spend some time picking it apart. It is the same problem.. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Write-back from inside FS - need suggestions
Hi, we are writing anew flash FS (UBIFS) and need some advise/suggestion. Brief FS info and the code are available at http://www.linux-mtd.infradead.org/doc/ubifs.html. At any point of time we may have a plenty of cached stuff which have to be written back later to the flash media: dirty pages an dirty inodes. This is what we call liability - current set of dirty pages and inodes UBIFS must be able to write back on demand. The problem is that we cannot do accurate flash space accounting due to several reasons: 1. Wastage - some smal random amount of flash space at ends or eraseblocks cannot be used. 2. Compression - we do not know how well will the pages be compressed, so we do not know how much flash space will they consume. So, if our current liability is X, we do not know exactly how much flash space (Y) it will take. All we can do is to introduce some pessimistic, worst-case function Y = F(X). This pessimistic function assumes that pages won't be compressible, and it assumes worst-case wastage. In real life it is hardly going to happen, but possible. The functiion is really bad and may lead to huge over-estimations like 40%. So, if we are, say, in -prepare_write(), we have to decide whether there is enough flash space to write-back this page later. We do not want to fail with -ENOSPC when,say, pdflush writes the page back. So we use our pessimistic function F(X) to decide whether we have enough space or not. If there is a plenty of flash space, the F(X) says yes, and just we proceed. The question is what do we do if F(X) says no? If we just return -ENOSPC, the flash space utilization becomes too poor, just because F() is really rough. We do have space in most real-life cases. All we have to do in this case is to lessen our liability. IOW, we have to flush few dirty inodes/pages, then we'd be able to proceed. So my question is: how can we flush _few_ oldest dirty pages/inodes while we are inside UBIFS (e.g., in -prepare_write(), -mkdir(), -link(), etc)? I failed to find VFS calls which would do this. Stuff like sync_sb_inodes() is not exactly what we need. Should we implement a similar function? Since we have to call it from inside UBIFS, which means we are holding i_mutex and the inode is locked, the function has to be smart enough not to wait on this inode, but wait on other inodes if needed. A solution like kicking pdflush to do the job and wait on a waitqueue would probably also work, but I'd prefer to do this from the context of current task. Should we have our own list of inodes and call write_inode_now() for dirty ones? But I'd prefer to let VFS pick oldest victims. So I'm asking for ideas which would work and be acceptable by the community later. Thanks! -- Best Regards, Artem Bityutskiy (Артём Битюцкий) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/