Re: [PATCH 0/2] mm/fs: put_user_page() proposal
On Mon 09-07-18 13:00:49, Matthew Wilcox wrote: > On Mon, Jul 09, 2018 at 09:47:40PM +0200, Jan Kara wrote: > > On Mon 09-07-18 10:16:51, Matthew Wilcox wrote: > > > > 2) What to do when some page is pinned but we need to do e.g. > > > > clear_page_dirty_for_io(). After some more thinking I agree with you > > > > that > > > > just blocking waiting for page to unpin will create deadlocks like: > > > > > > Why are we trying to writeback a page that is pinned? It's presumed to > > > be continuously redirtied by its pinner. We can't evict it. > > > > So what should be a result of fsync(file), where some 'file' pages are > > pinned e.g. by running direct IO? If we just skip those pages, we'll lie to > > userspace that data was committed while it was not (and it's not only about > > data that has landed in those pages via DMA, you can have first 1k of a page > > modified by normal IO in parallel to DMA modifying second 1k chunk). If > > fsync(2) returns error, it would be really unexpected by userspace and most > > apps will just not handle that correctly. So what else can you do than > > block? > > I was thinking about writeback, and neglected the fsync case. For memory cleaning writeback skipping is certainly the right thing to do and that's what we plan to do. > For fsync, we could copy the "current" contents of the page to a > freshly-allocated page and write _that_ to disc? As long as we redirty > the real page after the pin is dropped, I think we're fine. So for record, this technique is called "bouncing" in block layer terminology and we do have a support for it there (see block/bounce.c). It would need some tweaking (e.g. a bio flag to indicate that some page in a bio needs bouncing if underlying storage requires stable pages) but that is easy to do - we even had support for something similar some years back as ext3 needed it to provide guarantee metadata buffer cannot be modified while IO is running on it. I was actually already considering using this some time ago but then disregarded it as it seemed it won't buy us much compared to blocking / skipping. But now seeing the troubles with blocking, using page bouncing for situations where we cannot just skip page writeout looks indeed appealing. Thanks for suggesting that! As a side note I'm not 100% decided whether it is better to keep the original page dirty all the time while it is pinned or not. I'm more inclined to keeping it dirty all the time as it gives mm more accurate information about the amount of really dirty pages, prevents reclaim of filesystem's dirtiness / allocation tracking information (buffers or whatever it has attached to the page), and generally avoids "surprising" set_page_dirty() once page is unpinned (one less dirtying path for filesystems to care about). OTOH it would make flusher threads always try to writeback these pages only to skip them, fsync(2) would always write them, etc... Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 0/2] mm/fs: put_user_page() proposal
On Mon 09-07-18 13:00:49, Matthew Wilcox wrote: > On Mon, Jul 09, 2018 at 09:47:40PM +0200, Jan Kara wrote: > > On Mon 09-07-18 10:16:51, Matthew Wilcox wrote: > > > > 2) What to do when some page is pinned but we need to do e.g. > > > > clear_page_dirty_for_io(). After some more thinking I agree with you > > > > that > > > > just blocking waiting for page to unpin will create deadlocks like: > > > > > > Why are we trying to writeback a page that is pinned? It's presumed to > > > be continuously redirtied by its pinner. We can't evict it. > > > > So what should be a result of fsync(file), where some 'file' pages are > > pinned e.g. by running direct IO? If we just skip those pages, we'll lie to > > userspace that data was committed while it was not (and it's not only about > > data that has landed in those pages via DMA, you can have first 1k of a page > > modified by normal IO in parallel to DMA modifying second 1k chunk). If > > fsync(2) returns error, it would be really unexpected by userspace and most > > apps will just not handle that correctly. So what else can you do than > > block? > > I was thinking about writeback, and neglected the fsync case. For memory cleaning writeback skipping is certainly the right thing to do and that's what we plan to do. > For fsync, we could copy the "current" contents of the page to a > freshly-allocated page and write _that_ to disc? As long as we redirty > the real page after the pin is dropped, I think we're fine. So for record, this technique is called "bouncing" in block layer terminology and we do have a support for it there (see block/bounce.c). It would need some tweaking (e.g. a bio flag to indicate that some page in a bio needs bouncing if underlying storage requires stable pages) but that is easy to do - we even had support for something similar some years back as ext3 needed it to provide guarantee metadata buffer cannot be modified while IO is running on it. I was actually already considering using this some time ago but then disregarded it as it seemed it won't buy us much compared to blocking / skipping. But now seeing the troubles with blocking, using page bouncing for situations where we cannot just skip page writeout looks indeed appealing. Thanks for suggesting that! As a side note I'm not 100% decided whether it is better to keep the original page dirty all the time while it is pinned or not. I'm more inclined to keeping it dirty all the time as it gives mm more accurate information about the amount of really dirty pages, prevents reclaim of filesystem's dirtiness / allocation tracking information (buffers or whatever it has attached to the page), and generally avoids "surprising" set_page_dirty() once page is unpinned (one less dirtying path for filesystems to care about). OTOH it would make flusher threads always try to writeback these pages only to skip them, fsync(2) would always write them, etc... Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 0/2] mm/fs: put_user_page() proposal
On Mon 09-07-18 13:56:57, Jason Gunthorpe wrote: > On Mon, Jul 09, 2018 at 09:47:40PM +0200, Jan Kara wrote: > > On Mon 09-07-18 10:16:51, Matthew Wilcox wrote: > > > On Mon, Jul 09, 2018 at 06:08:06PM +0200, Jan Kara wrote: > > > > On Mon 09-07-18 18:49:37, Nicholas Piggin wrote: > > > > > The problem with blocking in clear_page_dirty_for_io is that the fs is > > > > > holding the page lock (or locks) and possibly others too. If you > > > > > expect to have a bunch of long term references hanging around on the > > > > > page, then there will be hangs and deadlocks everywhere. And if you do > > > > > not have such log term references, then page lock (or some similar > > > > > lock > > > > > bit) for the duration of the DMA should be about enough? > > > > > > > > There are two separate questions: > > > > > > > > 1) How to identify pages pinned for DMA? We have no bit in struct page > > > > to > > > > use and we cannot reuse page lock as that immediately creates lock > > > > inversions e.g. in direct IO code (which could be fixed but then good > > > > luck > > > > with auditing all the other GUP users). Matthew had an idea and John > > > > implemented it based on removing page from LRU and using that space in > > > > struct page. So we at least have a way to identify pages that are pinned > > > > and can track their pin count. > > > > > > > > 2) What to do when some page is pinned but we need to do e.g. > > > > clear_page_dirty_for_io(). After some more thinking I agree with you > > > > that > > > > just blocking waiting for page to unpin will create deadlocks like: > > > > > > Why are we trying to writeback a page that is pinned? It's presumed to > > > be continuously redirtied by its pinner. We can't evict it. > > > > So what should be a result of fsync(file), where some 'file' pages are > > pinned e.g. by running direct IO? If we just skip those pages, we'll lie to > > userspace that data was committed while it was not (and it's not only about > > data that has landed in those pages via DMA, you can have first 1k of a page > > modified by normal IO in parallel to DMA modifying second 1k chunk). If > > fsync(2) returns error, it would be really unexpected by userspace and most > > apps will just not handle that correctly. So what else can you do than > > block? > > I think as a userspace I would expect the 'current content' to be > flushed without waiting.. Yes but the problem is we cannot generally write out a page whose contents is possibly changing (e.g. RAID5 checksums would then be wrong). But maybe using bounce pages (and keeping original page still dirty) in such case would be worth it - originally I thought using bounce pages would not bring us much but now seeing problems with blocking in more detail maybe they are worth the trouble after all... > If you block fsync() then anyone using a RDMA MR with it will just > dead lock. What happens if two processes open the same file and > one makes a MR and the other calls fsync()? Sounds bad. Yes, that's one of the reasons why we were discussing revoke mechanisms for long term pins. But with bounce pages we could possibly avoid that (except for cases like DAX + truncate where it's really unavoidable but there it's a new functionality so mandating revoke and returning error otherwise is fine I guess). Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 0/2] mm/fs: put_user_page() proposal
On Mon 09-07-18 13:56:57, Jason Gunthorpe wrote: > On Mon, Jul 09, 2018 at 09:47:40PM +0200, Jan Kara wrote: > > On Mon 09-07-18 10:16:51, Matthew Wilcox wrote: > > > On Mon, Jul 09, 2018 at 06:08:06PM +0200, Jan Kara wrote: > > > > On Mon 09-07-18 18:49:37, Nicholas Piggin wrote: > > > > > The problem with blocking in clear_page_dirty_for_io is that the fs is > > > > > holding the page lock (or locks) and possibly others too. If you > > > > > expect to have a bunch of long term references hanging around on the > > > > > page, then there will be hangs and deadlocks everywhere. And if you do > > > > > not have such log term references, then page lock (or some similar > > > > > lock > > > > > bit) for the duration of the DMA should be about enough? > > > > > > > > There are two separate questions: > > > > > > > > 1) How to identify pages pinned for DMA? We have no bit in struct page > > > > to > > > > use and we cannot reuse page lock as that immediately creates lock > > > > inversions e.g. in direct IO code (which could be fixed but then good > > > > luck > > > > with auditing all the other GUP users). Matthew had an idea and John > > > > implemented it based on removing page from LRU and using that space in > > > > struct page. So we at least have a way to identify pages that are pinned > > > > and can track their pin count. > > > > > > > > 2) What to do when some page is pinned but we need to do e.g. > > > > clear_page_dirty_for_io(). After some more thinking I agree with you > > > > that > > > > just blocking waiting for page to unpin will create deadlocks like: > > > > > > Why are we trying to writeback a page that is pinned? It's presumed to > > > be continuously redirtied by its pinner. We can't evict it. > > > > So what should be a result of fsync(file), where some 'file' pages are > > pinned e.g. by running direct IO? If we just skip those pages, we'll lie to > > userspace that data was committed while it was not (and it's not only about > > data that has landed in those pages via DMA, you can have first 1k of a page > > modified by normal IO in parallel to DMA modifying second 1k chunk). If > > fsync(2) returns error, it would be really unexpected by userspace and most > > apps will just not handle that correctly. So what else can you do than > > block? > > I think as a userspace I would expect the 'current content' to be > flushed without waiting.. Yes but the problem is we cannot generally write out a page whose contents is possibly changing (e.g. RAID5 checksums would then be wrong). But maybe using bounce pages (and keeping original page still dirty) in such case would be worth it - originally I thought using bounce pages would not bring us much but now seeing problems with blocking in more detail maybe they are worth the trouble after all... > If you block fsync() then anyone using a RDMA MR with it will just > dead lock. What happens if two processes open the same file and > one makes a MR and the other calls fsync()? Sounds bad. Yes, that's one of the reasons why we were discussing revoke mechanisms for long term pins. But with bounce pages we could possibly avoid that (except for cases like DAX + truncate where it's really unavoidable but there it's a new functionality so mandating revoke and returning error otherwise is fine I guess). Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 0/2] mm/fs: put_user_page() proposal
On Mon, Jul 09, 2018 at 09:47:40PM +0200, Jan Kara wrote: > On Mon 09-07-18 10:16:51, Matthew Wilcox wrote: > > > 2) What to do when some page is pinned but we need to do e.g. > > > clear_page_dirty_for_io(). After some more thinking I agree with you that > > > just blocking waiting for page to unpin will create deadlocks like: > > > > Why are we trying to writeback a page that is pinned? It's presumed to > > be continuously redirtied by its pinner. We can't evict it. > > So what should be a result of fsync(file), where some 'file' pages are > pinned e.g. by running direct IO? If we just skip those pages, we'll lie to > userspace that data was committed while it was not (and it's not only about > data that has landed in those pages via DMA, you can have first 1k of a page > modified by normal IO in parallel to DMA modifying second 1k chunk). If > fsync(2) returns error, it would be really unexpected by userspace and most > apps will just not handle that correctly. So what else can you do than > block? I was thinking about writeback, and neglected the fsync case. For fsync, we could copy the "current" contents of the page to a freshly-allocated page and write _that_ to disc? As long as we redirty the real page after the pin is dropped, I think we're fine.
Re: [PATCH 0/2] mm/fs: put_user_page() proposal
On Mon, Jul 09, 2018 at 09:47:40PM +0200, Jan Kara wrote: > On Mon 09-07-18 10:16:51, Matthew Wilcox wrote: > > > 2) What to do when some page is pinned but we need to do e.g. > > > clear_page_dirty_for_io(). After some more thinking I agree with you that > > > just blocking waiting for page to unpin will create deadlocks like: > > > > Why are we trying to writeback a page that is pinned? It's presumed to > > be continuously redirtied by its pinner. We can't evict it. > > So what should be a result of fsync(file), where some 'file' pages are > pinned e.g. by running direct IO? If we just skip those pages, we'll lie to > userspace that data was committed while it was not (and it's not only about > data that has landed in those pages via DMA, you can have first 1k of a page > modified by normal IO in parallel to DMA modifying second 1k chunk). If > fsync(2) returns error, it would be really unexpected by userspace and most > apps will just not handle that correctly. So what else can you do than > block? I was thinking about writeback, and neglected the fsync case. For fsync, we could copy the "current" contents of the page to a freshly-allocated page and write _that_ to disc? As long as we redirty the real page after the pin is dropped, I think we're fine.
Re: [PATCH 0/2] mm/fs: put_user_page() proposal
On Mon, Jul 09, 2018 at 09:47:40PM +0200, Jan Kara wrote: > On Mon 09-07-18 10:16:51, Matthew Wilcox wrote: > > On Mon, Jul 09, 2018 at 06:08:06PM +0200, Jan Kara wrote: > > > On Mon 09-07-18 18:49:37, Nicholas Piggin wrote: > > > > The problem with blocking in clear_page_dirty_for_io is that the fs is > > > > holding the page lock (or locks) and possibly others too. If you > > > > expect to have a bunch of long term references hanging around on the > > > > page, then there will be hangs and deadlocks everywhere. And if you do > > > > not have such log term references, then page lock (or some similar lock > > > > bit) for the duration of the DMA should be about enough? > > > > > > There are two separate questions: > > > > > > 1) How to identify pages pinned for DMA? We have no bit in struct page to > > > use and we cannot reuse page lock as that immediately creates lock > > > inversions e.g. in direct IO code (which could be fixed but then good luck > > > with auditing all the other GUP users). Matthew had an idea and John > > > implemented it based on removing page from LRU and using that space in > > > struct page. So we at least have a way to identify pages that are pinned > > > and can track their pin count. > > > > > > 2) What to do when some page is pinned but we need to do e.g. > > > clear_page_dirty_for_io(). After some more thinking I agree with you that > > > just blocking waiting for page to unpin will create deadlocks like: > > > > Why are we trying to writeback a page that is pinned? It's presumed to > > be continuously redirtied by its pinner. We can't evict it. > > So what should be a result of fsync(file), where some 'file' pages are > pinned e.g. by running direct IO? If we just skip those pages, we'll lie to > userspace that data was committed while it was not (and it's not only about > data that has landed in those pages via DMA, you can have first 1k of a page > modified by normal IO in parallel to DMA modifying second 1k chunk). If > fsync(2) returns error, it would be really unexpected by userspace and most > apps will just not handle that correctly. So what else can you do than > block? I think as a userspace I would expect the 'current content' to be flushed without waiting.. If you block fsync() then anyone using a RDMA MR with it will just dead lock. What happens if two processes open the same file and one makes a MR and the other calls fsync()? Sounds bad. Jason
Re: [PATCH 0/2] mm/fs: put_user_page() proposal
On Mon, Jul 09, 2018 at 09:47:40PM +0200, Jan Kara wrote: > On Mon 09-07-18 10:16:51, Matthew Wilcox wrote: > > On Mon, Jul 09, 2018 at 06:08:06PM +0200, Jan Kara wrote: > > > On Mon 09-07-18 18:49:37, Nicholas Piggin wrote: > > > > The problem with blocking in clear_page_dirty_for_io is that the fs is > > > > holding the page lock (or locks) and possibly others too. If you > > > > expect to have a bunch of long term references hanging around on the > > > > page, then there will be hangs and deadlocks everywhere. And if you do > > > > not have such log term references, then page lock (or some similar lock > > > > bit) for the duration of the DMA should be about enough? > > > > > > There are two separate questions: > > > > > > 1) How to identify pages pinned for DMA? We have no bit in struct page to > > > use and we cannot reuse page lock as that immediately creates lock > > > inversions e.g. in direct IO code (which could be fixed but then good luck > > > with auditing all the other GUP users). Matthew had an idea and John > > > implemented it based on removing page from LRU and using that space in > > > struct page. So we at least have a way to identify pages that are pinned > > > and can track their pin count. > > > > > > 2) What to do when some page is pinned but we need to do e.g. > > > clear_page_dirty_for_io(). After some more thinking I agree with you that > > > just blocking waiting for page to unpin will create deadlocks like: > > > > Why are we trying to writeback a page that is pinned? It's presumed to > > be continuously redirtied by its pinner. We can't evict it. > > So what should be a result of fsync(file), where some 'file' pages are > pinned e.g. by running direct IO? If we just skip those pages, we'll lie to > userspace that data was committed while it was not (and it's not only about > data that has landed in those pages via DMA, you can have first 1k of a page > modified by normal IO in parallel to DMA modifying second 1k chunk). If > fsync(2) returns error, it would be really unexpected by userspace and most > apps will just not handle that correctly. So what else can you do than > block? I think as a userspace I would expect the 'current content' to be flushed without waiting.. If you block fsync() then anyone using a RDMA MR with it will just dead lock. What happens if two processes open the same file and one makes a MR and the other calls fsync()? Sounds bad. Jason
Re: [PATCH 0/2] mm/fs: put_user_page() proposal
On Mon 09-07-18 10:16:51, Matthew Wilcox wrote: > On Mon, Jul 09, 2018 at 06:08:06PM +0200, Jan Kara wrote: > > On Mon 09-07-18 18:49:37, Nicholas Piggin wrote: > > > The problem with blocking in clear_page_dirty_for_io is that the fs is > > > holding the page lock (or locks) and possibly others too. If you > > > expect to have a bunch of long term references hanging around on the > > > page, then there will be hangs and deadlocks everywhere. And if you do > > > not have such log term references, then page lock (or some similar lock > > > bit) for the duration of the DMA should be about enough? > > > > There are two separate questions: > > > > 1) How to identify pages pinned for DMA? We have no bit in struct page to > > use and we cannot reuse page lock as that immediately creates lock > > inversions e.g. in direct IO code (which could be fixed but then good luck > > with auditing all the other GUP users). Matthew had an idea and John > > implemented it based on removing page from LRU and using that space in > > struct page. So we at least have a way to identify pages that are pinned > > and can track their pin count. > > > > 2) What to do when some page is pinned but we need to do e.g. > > clear_page_dirty_for_io(). After some more thinking I agree with you that > > just blocking waiting for page to unpin will create deadlocks like: > > Why are we trying to writeback a page that is pinned? It's presumed to > be continuously redirtied by its pinner. We can't evict it. So what should be a result of fsync(file), where some 'file' pages are pinned e.g. by running direct IO? If we just skip those pages, we'll lie to userspace that data was committed while it was not (and it's not only about data that has landed in those pages via DMA, you can have first 1k of a page modified by normal IO in parallel to DMA modifying second 1k chunk). If fsync(2) returns error, it would be really unexpected by userspace and most apps will just not handle that correctly. So what else can you do than block? > > ext4_writepages() ext4_direct_IO_write() > > __blockdev_direct_IO() > > iov_iter_get_pages() > > - pins page > > handle = ext4_journal_start_with_reserve(inode, ...) > > - starts transaction > > ... > > lock_page(page) > > mpage_submit_page() > > clear_page_dirty_for_io(page) -> blocks on pin > > I don't think it should block. It should fail. See above... Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 0/2] mm/fs: put_user_page() proposal
On Mon 09-07-18 10:16:51, Matthew Wilcox wrote: > On Mon, Jul 09, 2018 at 06:08:06PM +0200, Jan Kara wrote: > > On Mon 09-07-18 18:49:37, Nicholas Piggin wrote: > > > The problem with blocking in clear_page_dirty_for_io is that the fs is > > > holding the page lock (or locks) and possibly others too. If you > > > expect to have a bunch of long term references hanging around on the > > > page, then there will be hangs and deadlocks everywhere. And if you do > > > not have such log term references, then page lock (or some similar lock > > > bit) for the duration of the DMA should be about enough? > > > > There are two separate questions: > > > > 1) How to identify pages pinned for DMA? We have no bit in struct page to > > use and we cannot reuse page lock as that immediately creates lock > > inversions e.g. in direct IO code (which could be fixed but then good luck > > with auditing all the other GUP users). Matthew had an idea and John > > implemented it based on removing page from LRU and using that space in > > struct page. So we at least have a way to identify pages that are pinned > > and can track their pin count. > > > > 2) What to do when some page is pinned but we need to do e.g. > > clear_page_dirty_for_io(). After some more thinking I agree with you that > > just blocking waiting for page to unpin will create deadlocks like: > > Why are we trying to writeback a page that is pinned? It's presumed to > be continuously redirtied by its pinner. We can't evict it. So what should be a result of fsync(file), where some 'file' pages are pinned e.g. by running direct IO? If we just skip those pages, we'll lie to userspace that data was committed while it was not (and it's not only about data that has landed in those pages via DMA, you can have first 1k of a page modified by normal IO in parallel to DMA modifying second 1k chunk). If fsync(2) returns error, it would be really unexpected by userspace and most apps will just not handle that correctly. So what else can you do than block? > > ext4_writepages() ext4_direct_IO_write() > > __blockdev_direct_IO() > > iov_iter_get_pages() > > - pins page > > handle = ext4_journal_start_with_reserve(inode, ...) > > - starts transaction > > ... > > lock_page(page) > > mpage_submit_page() > > clear_page_dirty_for_io(page) -> blocks on pin > > I don't think it should block. It should fail. See above... Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 0/2] mm/fs: put_user_page() proposal
On Mon, Jul 09, 2018 at 06:08:06PM +0200, Jan Kara wrote: > On Mon 09-07-18 18:49:37, Nicholas Piggin wrote: > > The problem with blocking in clear_page_dirty_for_io is that the fs is > > holding the page lock (or locks) and possibly others too. If you > > expect to have a bunch of long term references hanging around on the > > page, then there will be hangs and deadlocks everywhere. And if you do > > not have such log term references, then page lock (or some similar lock > > bit) for the duration of the DMA should be about enough? > > There are two separate questions: > > 1) How to identify pages pinned for DMA? We have no bit in struct page to > use and we cannot reuse page lock as that immediately creates lock > inversions e.g. in direct IO code (which could be fixed but then good luck > with auditing all the other GUP users). Matthew had an idea and John > implemented it based on removing page from LRU and using that space in > struct page. So we at least have a way to identify pages that are pinned > and can track their pin count. > > 2) What to do when some page is pinned but we need to do e.g. > clear_page_dirty_for_io(). After some more thinking I agree with you that > just blocking waiting for page to unpin will create deadlocks like: Why are we trying to writeback a page that is pinned? It's presumed to be continuously redirtied by its pinner. We can't evict it. > ext4_writepages() ext4_direct_IO_write() > __blockdev_direct_IO() > iov_iter_get_pages() > - pins page > handle = ext4_journal_start_with_reserve(inode, ...) > - starts transaction > ... > lock_page(page) > mpage_submit_page() > clear_page_dirty_for_io(page) -> blocks on pin I don't think it should block. It should fail.
Re: [PATCH 0/2] mm/fs: put_user_page() proposal
On Mon, Jul 09, 2018 at 06:08:06PM +0200, Jan Kara wrote: > On Mon 09-07-18 18:49:37, Nicholas Piggin wrote: > > The problem with blocking in clear_page_dirty_for_io is that the fs is > > holding the page lock (or locks) and possibly others too. If you > > expect to have a bunch of long term references hanging around on the > > page, then there will be hangs and deadlocks everywhere. And if you do > > not have such log term references, then page lock (or some similar lock > > bit) for the duration of the DMA should be about enough? > > There are two separate questions: > > 1) How to identify pages pinned for DMA? We have no bit in struct page to > use and we cannot reuse page lock as that immediately creates lock > inversions e.g. in direct IO code (which could be fixed but then good luck > with auditing all the other GUP users). Matthew had an idea and John > implemented it based on removing page from LRU and using that space in > struct page. So we at least have a way to identify pages that are pinned > and can track their pin count. > > 2) What to do when some page is pinned but we need to do e.g. > clear_page_dirty_for_io(). After some more thinking I agree with you that > just blocking waiting for page to unpin will create deadlocks like: Why are we trying to writeback a page that is pinned? It's presumed to be continuously redirtied by its pinner. We can't evict it. > ext4_writepages() ext4_direct_IO_write() > __blockdev_direct_IO() > iov_iter_get_pages() > - pins page > handle = ext4_journal_start_with_reserve(inode, ...) > - starts transaction > ... > lock_page(page) > mpage_submit_page() > clear_page_dirty_for_io(page) -> blocks on pin I don't think it should block. It should fail.
Re: [PATCH 0/2] mm/fs: put_user_page() proposal
Hi, On Mon 09-07-18 01:05:52, john.hubb...@gmail.com wrote: > From: John Hubbard > > With respect to tracking get_user_pages*() pages with page->dma_pinned* > fields [1], I spent a few days retrofitting most of the get_user_pages*() > call sites, by adding calls to a new put_user_page() function, in place > of put_page(), where appropriate. This will work, but it's a large effort. > > Design note: I didn't see anything that hinted at a way to fix this > problem, without actually changing all of the get_user_pages*() call sites, > so I think it's reasonable to start with that. Agreed. > Anyway, it's still incomplete, but because this is a large, tree-wide > change (that will take some time and testing), I'd like to propose a plan, > before spamming zillions of people with put_user_page() conversion patches. > So I picked out the first two patches to show where this is going. > > Proposed steps: > > Step 1: > > Start with the patches here, then continue with...dozens more. > This will eventually convert all of the call sites to use put_user_page(). > This is easy in some places, but complex in others, such as: > > -- drivers/gpu/drm/amd > -- bio > -- fuse > -- cifs > -- anything from: >git grep iov_iter_get_pages | cut -f1 -d ':' | sort | uniq > > The easy ones can be grouped into a single patchset, perhaps, and the > complex ones probably each need a patchset, in order to get the in-depth > review they'll need. Agreed. > Furthermore, some of these areas I hope to attract some help on, once > this starts going. > > Step 2: > > In parallel, tidy up the core patchset that was discussed in [1], (version > 2 has already been reviewed, so I know what to do), and get it perfected > and reviewed. Don't apply it until step 1 is all done, though. > > Step 3: > > Activate refcounting of dma-pinned pages (essentially, patch #5, which is > [1]), but don't use it yet. Place a few WARN_ON_ONCE calls to start > mopping up any missed call sites. > > Step 4: > > After some soak time, actually connect it up (patch #6 of [1]) and start > taking action based on the new page->dma_pinned* fields. > > [1] https://www.spinics.net/lists/linux-mm/msg156409.html > > or, the same thread on LKML if it's working for you: > > https://lkml.org/lkml/2018/7/4/368 Yeah, but as Nick pointed out we have some more work to do in step 4 to avoid deadlocks. Still there's a lot of work to do on which the direction to progress is clear :). Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 0/2] mm/fs: put_user_page() proposal
Hi, On Mon 09-07-18 01:05:52, john.hubb...@gmail.com wrote: > From: John Hubbard > > With respect to tracking get_user_pages*() pages with page->dma_pinned* > fields [1], I spent a few days retrofitting most of the get_user_pages*() > call sites, by adding calls to a new put_user_page() function, in place > of put_page(), where appropriate. This will work, but it's a large effort. > > Design note: I didn't see anything that hinted at a way to fix this > problem, without actually changing all of the get_user_pages*() call sites, > so I think it's reasonable to start with that. Agreed. > Anyway, it's still incomplete, but because this is a large, tree-wide > change (that will take some time and testing), I'd like to propose a plan, > before spamming zillions of people with put_user_page() conversion patches. > So I picked out the first two patches to show where this is going. > > Proposed steps: > > Step 1: > > Start with the patches here, then continue with...dozens more. > This will eventually convert all of the call sites to use put_user_page(). > This is easy in some places, but complex in others, such as: > > -- drivers/gpu/drm/amd > -- bio > -- fuse > -- cifs > -- anything from: >git grep iov_iter_get_pages | cut -f1 -d ':' | sort | uniq > > The easy ones can be grouped into a single patchset, perhaps, and the > complex ones probably each need a patchset, in order to get the in-depth > review they'll need. Agreed. > Furthermore, some of these areas I hope to attract some help on, once > this starts going. > > Step 2: > > In parallel, tidy up the core patchset that was discussed in [1], (version > 2 has already been reviewed, so I know what to do), and get it perfected > and reviewed. Don't apply it until step 1 is all done, though. > > Step 3: > > Activate refcounting of dma-pinned pages (essentially, patch #5, which is > [1]), but don't use it yet. Place a few WARN_ON_ONCE calls to start > mopping up any missed call sites. > > Step 4: > > After some soak time, actually connect it up (patch #6 of [1]) and start > taking action based on the new page->dma_pinned* fields. > > [1] https://www.spinics.net/lists/linux-mm/msg156409.html > > or, the same thread on LKML if it's working for you: > > https://lkml.org/lkml/2018/7/4/368 Yeah, but as Nick pointed out we have some more work to do in step 4 to avoid deadlocks. Still there's a lot of work to do on which the direction to progress is clear :). Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH 0/2] mm/fs: put_user_page() proposal
On Mon 09-07-18 18:49:37, Nicholas Piggin wrote: > On Mon, 9 Jul 2018 01:05:52 -0700 > john.hubb...@gmail.com wrote: > > > From: John Hubbard > > > > Hi, > > > > With respect to tracking get_user_pages*() pages with page->dma_pinned* > > fields [1], I spent a few days retrofitting most of the get_user_pages*() > > call sites, by adding calls to a new put_user_page() function, in place > > of put_page(), where appropriate. This will work, but it's a large effort. > > > > Design note: I didn't see anything that hinted at a way to fix this > > problem, without actually changing all of the get_user_pages*() call sites, > > so I think it's reasonable to start with that. > > > > Anyway, it's still incomplete, but because this is a large, tree-wide > > change (that will take some time and testing), I'd like to propose a plan, > > before spamming zillions of people with put_user_page() conversion patches. > > So I picked out the first two patches to show where this is going. > > > > Proposed steps: > > > > Step 1: > > > > Start with the patches here, then continue with...dozens more. > > This will eventually convert all of the call sites to use put_user_page(). > > This is easy in some places, but complex in others, such as: > > > > -- drivers/gpu/drm/amd > > -- bio > > -- fuse > > -- cifs > > -- anything from: > >git grep iov_iter_get_pages | cut -f1 -d ':' | sort | uniq > > > > The easy ones can be grouped into a single patchset, perhaps, and the > > complex ones probably each need a patchset, in order to get the in-depth > > review they'll need. > > > > Furthermore, some of these areas I hope to attract some help on, once > > this starts going. > > > > Step 2: > > > > In parallel, tidy up the core patchset that was discussed in [1], (version > > 2 has already been reviewed, so I know what to do), and get it perfected > > and reviewed. Don't apply it until step 1 is all done, though. > > > > Step 3: > > > > Activate refcounting of dma-pinned pages (essentially, patch #5, which is > > [1]), but don't use it yet. Place a few WARN_ON_ONCE calls to start > > mopping up any missed call sites. > > > > Step 4: > > > > After some soak time, actually connect it up (patch #6 of [1]) and start > > taking action based on the new page->dma_pinned* fields. > > You can use my decade old patch! > > https://lkml.org/lkml/2009/2/17/113 The problem has a longer history than I thought ;) > The problem with blocking in clear_page_dirty_for_io is that the fs is > holding the page lock (or locks) and possibly others too. If you > expect to have a bunch of long term references hanging around on the > page, then there will be hangs and deadlocks everywhere. And if you do > not have such log term references, then page lock (or some similar lock > bit) for the duration of the DMA should be about enough? There are two separate questions: 1) How to identify pages pinned for DMA? We have no bit in struct page to use and we cannot reuse page lock as that immediately creates lock inversions e.g. in direct IO code (which could be fixed but then good luck with auditing all the other GUP users). Matthew had an idea and John implemented it based on removing page from LRU and using that space in struct page. So we at least have a way to identify pages that are pinned and can track their pin count. 2) What to do when some page is pinned but we need to do e.g. clear_page_dirty_for_io(). After some more thinking I agree with you that just blocking waiting for page to unpin will create deadlocks like: ext4_writepages() ext4_direct_IO_write() __blockdev_direct_IO() iov_iter_get_pages() - pins page handle = ext4_journal_start_with_reserve(inode, ...) - starts transaction ... lock_page(page) mpage_submit_page() clear_page_dirty_for_io(page) -> blocks on pin ext4_dio_get_block_unwritten_sync() - called to allocate blocks for DIO ext4_journal_start() - may block and wait for transaction started by ext4_writepages() to finish > I think it has to be more fundamental to the filesystem. Filesystem > would get callbacks to register such long term dirtying on its files. > Then it can do locking, resource allocation, -ENOTSUPP, etc. Well, direct IO would not classify as long term dirtying I guess but still
Re: [PATCH 0/2] mm/fs: put_user_page() proposal
On Mon 09-07-18 18:49:37, Nicholas Piggin wrote: > On Mon, 9 Jul 2018 01:05:52 -0700 > john.hubb...@gmail.com wrote: > > > From: John Hubbard > > > > Hi, > > > > With respect to tracking get_user_pages*() pages with page->dma_pinned* > > fields [1], I spent a few days retrofitting most of the get_user_pages*() > > call sites, by adding calls to a new put_user_page() function, in place > > of put_page(), where appropriate. This will work, but it's a large effort. > > > > Design note: I didn't see anything that hinted at a way to fix this > > problem, without actually changing all of the get_user_pages*() call sites, > > so I think it's reasonable to start with that. > > > > Anyway, it's still incomplete, but because this is a large, tree-wide > > change (that will take some time and testing), I'd like to propose a plan, > > before spamming zillions of people with put_user_page() conversion patches. > > So I picked out the first two patches to show where this is going. > > > > Proposed steps: > > > > Step 1: > > > > Start with the patches here, then continue with...dozens more. > > This will eventually convert all of the call sites to use put_user_page(). > > This is easy in some places, but complex in others, such as: > > > > -- drivers/gpu/drm/amd > > -- bio > > -- fuse > > -- cifs > > -- anything from: > >git grep iov_iter_get_pages | cut -f1 -d ':' | sort | uniq > > > > The easy ones can be grouped into a single patchset, perhaps, and the > > complex ones probably each need a patchset, in order to get the in-depth > > review they'll need. > > > > Furthermore, some of these areas I hope to attract some help on, once > > this starts going. > > > > Step 2: > > > > In parallel, tidy up the core patchset that was discussed in [1], (version > > 2 has already been reviewed, so I know what to do), and get it perfected > > and reviewed. Don't apply it until step 1 is all done, though. > > > > Step 3: > > > > Activate refcounting of dma-pinned pages (essentially, patch #5, which is > > [1]), but don't use it yet. Place a few WARN_ON_ONCE calls to start > > mopping up any missed call sites. > > > > Step 4: > > > > After some soak time, actually connect it up (patch #6 of [1]) and start > > taking action based on the new page->dma_pinned* fields. > > You can use my decade old patch! > > https://lkml.org/lkml/2009/2/17/113 The problem has a longer history than I thought ;) > The problem with blocking in clear_page_dirty_for_io is that the fs is > holding the page lock (or locks) and possibly others too. If you > expect to have a bunch of long term references hanging around on the > page, then there will be hangs and deadlocks everywhere. And if you do > not have such log term references, then page lock (or some similar lock > bit) for the duration of the DMA should be about enough? There are two separate questions: 1) How to identify pages pinned for DMA? We have no bit in struct page to use and we cannot reuse page lock as that immediately creates lock inversions e.g. in direct IO code (which could be fixed but then good luck with auditing all the other GUP users). Matthew had an idea and John implemented it based on removing page from LRU and using that space in struct page. So we at least have a way to identify pages that are pinned and can track their pin count. 2) What to do when some page is pinned but we need to do e.g. clear_page_dirty_for_io(). After some more thinking I agree with you that just blocking waiting for page to unpin will create deadlocks like: ext4_writepages() ext4_direct_IO_write() __blockdev_direct_IO() iov_iter_get_pages() - pins page handle = ext4_journal_start_with_reserve(inode, ...) - starts transaction ... lock_page(page) mpage_submit_page() clear_page_dirty_for_io(page) -> blocks on pin ext4_dio_get_block_unwritten_sync() - called to allocate blocks for DIO ext4_journal_start() - may block and wait for transaction started by ext4_writepages() to finish > I think it has to be more fundamental to the filesystem. Filesystem > would get callbacks to register such long term dirtying on its files. > Then it can do locking, resource allocation, -ENOTSUPP, etc. Well, direct IO would not classify as long term dirtying I guess but still
Re: [PATCH 0/2] mm/fs: put_user_page() proposal
On Mon, 9 Jul 2018 01:05:52 -0700 john.hubb...@gmail.com wrote: > From: John Hubbard > > Hi, > > With respect to tracking get_user_pages*() pages with page->dma_pinned* > fields [1], I spent a few days retrofitting most of the get_user_pages*() > call sites, by adding calls to a new put_user_page() function, in place > of put_page(), where appropriate. This will work, but it's a large effort. > > Design note: I didn't see anything that hinted at a way to fix this > problem, without actually changing all of the get_user_pages*() call sites, > so I think it's reasonable to start with that. > > Anyway, it's still incomplete, but because this is a large, tree-wide > change (that will take some time and testing), I'd like to propose a plan, > before spamming zillions of people with put_user_page() conversion patches. > So I picked out the first two patches to show where this is going. > > Proposed steps: > > Step 1: > > Start with the patches here, then continue with...dozens more. > This will eventually convert all of the call sites to use put_user_page(). > This is easy in some places, but complex in others, such as: > > -- drivers/gpu/drm/amd > -- bio > -- fuse > -- cifs > -- anything from: >git grep iov_iter_get_pages | cut -f1 -d ':' | sort | uniq > > The easy ones can be grouped into a single patchset, perhaps, and the > complex ones probably each need a patchset, in order to get the in-depth > review they'll need. > > Furthermore, some of these areas I hope to attract some help on, once > this starts going. > > Step 2: > > In parallel, tidy up the core patchset that was discussed in [1], (version > 2 has already been reviewed, so I know what to do), and get it perfected > and reviewed. Don't apply it until step 1 is all done, though. > > Step 3: > > Activate refcounting of dma-pinned pages (essentially, patch #5, which is > [1]), but don't use it yet. Place a few WARN_ON_ONCE calls to start > mopping up any missed call sites. > > Step 4: > > After some soak time, actually connect it up (patch #6 of [1]) and start > taking action based on the new page->dma_pinned* fields. You can use my decade old patch! https://lkml.org/lkml/2009/2/17/113 The problem with blocking in clear_page_dirty_for_io is that the fs is holding the page lock (or locks) and possibly others too. If you expect to have a bunch of long term references hanging around on the page, then there will be hangs and deadlocks everywhere. And if you do not have such log term references, then page lock (or some similar lock bit) for the duration of the DMA should be about enough? I think it has to be more fundamental to the filesystem. Filesystem would get callbacks to register such long term dirtying on its files. Then it can do locking, resource allocation, -ENOTSUPP, etc. Thanks, Nick
Re: [PATCH 0/2] mm/fs: put_user_page() proposal
On Mon, 9 Jul 2018 01:05:52 -0700 john.hubb...@gmail.com wrote: > From: John Hubbard > > Hi, > > With respect to tracking get_user_pages*() pages with page->dma_pinned* > fields [1], I spent a few days retrofitting most of the get_user_pages*() > call sites, by adding calls to a new put_user_page() function, in place > of put_page(), where appropriate. This will work, but it's a large effort. > > Design note: I didn't see anything that hinted at a way to fix this > problem, without actually changing all of the get_user_pages*() call sites, > so I think it's reasonable to start with that. > > Anyway, it's still incomplete, but because this is a large, tree-wide > change (that will take some time and testing), I'd like to propose a plan, > before spamming zillions of people with put_user_page() conversion patches. > So I picked out the first two patches to show where this is going. > > Proposed steps: > > Step 1: > > Start with the patches here, then continue with...dozens more. > This will eventually convert all of the call sites to use put_user_page(). > This is easy in some places, but complex in others, such as: > > -- drivers/gpu/drm/amd > -- bio > -- fuse > -- cifs > -- anything from: >git grep iov_iter_get_pages | cut -f1 -d ':' | sort | uniq > > The easy ones can be grouped into a single patchset, perhaps, and the > complex ones probably each need a patchset, in order to get the in-depth > review they'll need. > > Furthermore, some of these areas I hope to attract some help on, once > this starts going. > > Step 2: > > In parallel, tidy up the core patchset that was discussed in [1], (version > 2 has already been reviewed, so I know what to do), and get it perfected > and reviewed. Don't apply it until step 1 is all done, though. > > Step 3: > > Activate refcounting of dma-pinned pages (essentially, patch #5, which is > [1]), but don't use it yet. Place a few WARN_ON_ONCE calls to start > mopping up any missed call sites. > > Step 4: > > After some soak time, actually connect it up (patch #6 of [1]) and start > taking action based on the new page->dma_pinned* fields. You can use my decade old patch! https://lkml.org/lkml/2009/2/17/113 The problem with blocking in clear_page_dirty_for_io is that the fs is holding the page lock (or locks) and possibly others too. If you expect to have a bunch of long term references hanging around on the page, then there will be hangs and deadlocks everywhere. And if you do not have such log term references, then page lock (or some similar lock bit) for the duration of the DMA should be about enough? I think it has to be more fundamental to the filesystem. Filesystem would get callbacks to register such long term dirtying on its files. Then it can do locking, resource allocation, -ENOTSUPP, etc. Thanks, Nick
[PATCH 0/2] mm/fs: put_user_page() proposal
From: John Hubbard Hi, With respect to tracking get_user_pages*() pages with page->dma_pinned* fields [1], I spent a few days retrofitting most of the get_user_pages*() call sites, by adding calls to a new put_user_page() function, in place of put_page(), where appropriate. This will work, but it's a large effort. Design note: I didn't see anything that hinted at a way to fix this problem, without actually changing all of the get_user_pages*() call sites, so I think it's reasonable to start with that. Anyway, it's still incomplete, but because this is a large, tree-wide change (that will take some time and testing), I'd like to propose a plan, before spamming zillions of people with put_user_page() conversion patches. So I picked out the first two patches to show where this is going. Proposed steps: Step 1: Start with the patches here, then continue with...dozens more. This will eventually convert all of the call sites to use put_user_page(). This is easy in some places, but complex in others, such as: -- drivers/gpu/drm/amd -- bio -- fuse -- cifs -- anything from: git grep iov_iter_get_pages | cut -f1 -d ':' | sort | uniq The easy ones can be grouped into a single patchset, perhaps, and the complex ones probably each need a patchset, in order to get the in-depth review they'll need. Furthermore, some of these areas I hope to attract some help on, once this starts going. Step 2: In parallel, tidy up the core patchset that was discussed in [1], (version 2 has already been reviewed, so I know what to do), and get it perfected and reviewed. Don't apply it until step 1 is all done, though. Step 3: Activate refcounting of dma-pinned pages (essentially, patch #5, which is [1]), but don't use it yet. Place a few WARN_ON_ONCE calls to start mopping up any missed call sites. Step 4: After some soak time, actually connect it up (patch #6 of [1]) and start taking action based on the new page->dma_pinned* fields. [1] https://www.spinics.net/lists/linux-mm/msg156409.html or, the same thread on LKML if it's working for you: https://lkml.org/lkml/2018/7/4/368 John Hubbard (2): mm: introduce put_user_page(), placeholder version goldfish_pipe/mm: convert to the new put_user_page() call drivers/platform/goldfish/goldfish_pipe.c | 6 +++--- include/linux/mm.h| 14 ++ 2 files changed, 17 insertions(+), 3 deletions(-) -- 2.18.0
[PATCH 0/2] mm/fs: put_user_page() proposal
From: John Hubbard Hi, With respect to tracking get_user_pages*() pages with page->dma_pinned* fields [1], I spent a few days retrofitting most of the get_user_pages*() call sites, by adding calls to a new put_user_page() function, in place of put_page(), where appropriate. This will work, but it's a large effort. Design note: I didn't see anything that hinted at a way to fix this problem, without actually changing all of the get_user_pages*() call sites, so I think it's reasonable to start with that. Anyway, it's still incomplete, but because this is a large, tree-wide change (that will take some time and testing), I'd like to propose a plan, before spamming zillions of people with put_user_page() conversion patches. So I picked out the first two patches to show where this is going. Proposed steps: Step 1: Start with the patches here, then continue with...dozens more. This will eventually convert all of the call sites to use put_user_page(). This is easy in some places, but complex in others, such as: -- drivers/gpu/drm/amd -- bio -- fuse -- cifs -- anything from: git grep iov_iter_get_pages | cut -f1 -d ':' | sort | uniq The easy ones can be grouped into a single patchset, perhaps, and the complex ones probably each need a patchset, in order to get the in-depth review they'll need. Furthermore, some of these areas I hope to attract some help on, once this starts going. Step 2: In parallel, tidy up the core patchset that was discussed in [1], (version 2 has already been reviewed, so I know what to do), and get it perfected and reviewed. Don't apply it until step 1 is all done, though. Step 3: Activate refcounting of dma-pinned pages (essentially, patch #5, which is [1]), but don't use it yet. Place a few WARN_ON_ONCE calls to start mopping up any missed call sites. Step 4: After some soak time, actually connect it up (patch #6 of [1]) and start taking action based on the new page->dma_pinned* fields. [1] https://www.spinics.net/lists/linux-mm/msg156409.html or, the same thread on LKML if it's working for you: https://lkml.org/lkml/2018/7/4/368 John Hubbard (2): mm: introduce put_user_page(), placeholder version goldfish_pipe/mm: convert to the new put_user_page() call drivers/platform/goldfish/goldfish_pipe.c | 6 +++--- include/linux/mm.h| 14 ++ 2 files changed, 17 insertions(+), 3 deletions(-) -- 2.18.0