Re: [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit
On 20.04.22 19:10, Vlastimil Babka wrote: > On 3/29/22 18:43, David Hildenbrand wrote: >> Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about >> it. We do this, to keep fork() logic on swap entries easy and efficient: >> for example, if we wouldn't clear it when unmapping, we'd have to lookup >> the page in the swapcache for each and every swap entry during fork() and >> clear PG_anon_exclusive if set. >> >> Instead, we want to store that information directly in the swap pte, >> protected by the page table lock, similarly to how we handle >> SWP_MIGRATION_READ_EXCLUSIVE for migration entries. However, for actual >> swap entries, we don't want to mess with the swap type (e.g., still one >> bit) because it overcomplicates swap code. >> >> In try_to_unmap(), we already reject to unmap in case the page might be >> pinned, because we must not lose PG_anon_exclusive on pinned pages ever. >> Checking if there are other unexpected references reliably *before* >> completely unmapping a page is unfortunately not really possible: THP >> heavily overcomplicate the situation. Once fully unmapped it's easier -- >> we, for example, make sure that there are no unexpected references >> *after* unmapping a page before starting writeback on that page. >> >> So, we currently might end up unmapping a page and clearing >> PG_anon_exclusive if that page has additional references, for example, >> due to a FOLL_GET. >> >> do_swap_page() has to re-determine if a page is exclusive, which will >> easily fail if there are other references on a page, most prominently >> GUP references via FOLL_GET. This can currently result in memory >> corruptions when taking a FOLL_GET | FOLL_WRITE reference on a page even >> when fork() is never involved: try_to_unmap() will succeed, and when >> refaulting the page, it cannot be marked exclusive and will get replaced >> by a copy in the page tables on the next write access, resulting in writes >> via the GUP reference to the page being lost. >> >> In an ideal world, everybody that uses GUP and wants to modify page >> content, such as O_DIRECT, would properly use FOLL_PIN. However, that >> conversion will take a while. It's easier to fix what used to work in the >> past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive. In addition, >> by remembering PG_anon_exclusive we can further reduce unnecessary COW >> in some cases, so it's the natural thing to do. >> >> So let's transfer the PG_anon_exclusive information to the swap pte and >> store it via an architecture-dependant pte bit; use that information when >> restoring the swap pte in do_swap_page() and unuse_pte(). During fork(), we >> simply have to clear the pte bit and are done. >> >> Of course, there is one corner case to handle: swap backends that don't >> support concurrent page modifications while the page is under writeback. >> Special case these, and drop the exclusive marker. Add a comment why that >> is just fine (also, reuse_swap_page() would have done the same in the >> past). >> >> In the future, we'll hopefully have all architectures support >> __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty >> stubs and the define completely. Then, we can also convert >> SWP_MIGRATION_READ_EXCLUSIVE. For architectures it's fairly easy to >> support: either simply use a yet unused pte bit that can be used for swap >> entries, steal one from the arch type bits if they exceed 5, or steal one >> from the offset bits. >> >> Note: R/O FOLL_GET references were never really reliable, especially >> when taking one on a shared page and then writing to the page (e.g., GUP >> after fork()). FOLL_GET, including R/W references, were never really >> reliable once fork was involved (e.g., GUP before fork(), >> GUP during fork()). KSM steps back in case it stumbles over unexpected >> references and is, therefore, fine. >> >> Signed-off-by: David Hildenbrand > > With the fixup as reportedy by Miaohe Lin > > Acked-by: Vlastimil Babka > > (sent a separate mm-commits mail to inquire about the fix going missing from > mmotm) > > https://lore.kernel.org/mm-commits/c3195d8a-2931-0749-973a-1d04e4bae...@suse.cz/T/#m4e98ccae6f747e11f45e4d0726427ba2fef740eb Yes I saw that, thanks for catching that! -- Thanks, David / dhildenb
Re: [PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit
On 3/29/22 18:43, David Hildenbrand wrote: > Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about > it. We do this, to keep fork() logic on swap entries easy and efficient: > for example, if we wouldn't clear it when unmapping, we'd have to lookup > the page in the swapcache for each and every swap entry during fork() and > clear PG_anon_exclusive if set. > > Instead, we want to store that information directly in the swap pte, > protected by the page table lock, similarly to how we handle > SWP_MIGRATION_READ_EXCLUSIVE for migration entries. However, for actual > swap entries, we don't want to mess with the swap type (e.g., still one > bit) because it overcomplicates swap code. > > In try_to_unmap(), we already reject to unmap in case the page might be > pinned, because we must not lose PG_anon_exclusive on pinned pages ever. > Checking if there are other unexpected references reliably *before* > completely unmapping a page is unfortunately not really possible: THP > heavily overcomplicate the situation. Once fully unmapped it's easier -- > we, for example, make sure that there are no unexpected references > *after* unmapping a page before starting writeback on that page. > > So, we currently might end up unmapping a page and clearing > PG_anon_exclusive if that page has additional references, for example, > due to a FOLL_GET. > > do_swap_page() has to re-determine if a page is exclusive, which will > easily fail if there are other references on a page, most prominently > GUP references via FOLL_GET. This can currently result in memory > corruptions when taking a FOLL_GET | FOLL_WRITE reference on a page even > when fork() is never involved: try_to_unmap() will succeed, and when > refaulting the page, it cannot be marked exclusive and will get replaced > by a copy in the page tables on the next write access, resulting in writes > via the GUP reference to the page being lost. > > In an ideal world, everybody that uses GUP and wants to modify page > content, such as O_DIRECT, would properly use FOLL_PIN. However, that > conversion will take a while. It's easier to fix what used to work in the > past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive. In addition, > by remembering PG_anon_exclusive we can further reduce unnecessary COW > in some cases, so it's the natural thing to do. > > So let's transfer the PG_anon_exclusive information to the swap pte and > store it via an architecture-dependant pte bit; use that information when > restoring the swap pte in do_swap_page() and unuse_pte(). During fork(), we > simply have to clear the pte bit and are done. > > Of course, there is one corner case to handle: swap backends that don't > support concurrent page modifications while the page is under writeback. > Special case these, and drop the exclusive marker. Add a comment why that > is just fine (also, reuse_swap_page() would have done the same in the > past). > > In the future, we'll hopefully have all architectures support > __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty > stubs and the define completely. Then, we can also convert > SWP_MIGRATION_READ_EXCLUSIVE. For architectures it's fairly easy to > support: either simply use a yet unused pte bit that can be used for swap > entries, steal one from the arch type bits if they exceed 5, or steal one > from the offset bits. > > Note: R/O FOLL_GET references were never really reliable, especially > when taking one on a shared page and then writing to the page (e.g., GUP > after fork()). FOLL_GET, including R/W references, were never really > reliable once fork was involved (e.g., GUP before fork(), > GUP during fork()). KSM steps back in case it stumbles over unexpected > references and is, therefore, fine. > > Signed-off-by: David Hildenbrand With the fixup as reportedy by Miaohe Lin Acked-by: Vlastimil Babka (sent a separate mm-commits mail to inquire about the fix going missing from mmotm) https://lore.kernel.org/mm-commits/c3195d8a-2931-0749-973a-1d04e4bae...@suse.cz/T/#m4e98ccae6f747e11f45e4d0726427ba2fef740eb
[PATCH v2 1/8] mm/swap: remember PG_anon_exclusive via a swp pte bit
Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about it. We do this, to keep fork() logic on swap entries easy and efficient: for example, if we wouldn't clear it when unmapping, we'd have to lookup the page in the swapcache for each and every swap entry during fork() and clear PG_anon_exclusive if set. Instead, we want to store that information directly in the swap pte, protected by the page table lock, similarly to how we handle SWP_MIGRATION_READ_EXCLUSIVE for migration entries. However, for actual swap entries, we don't want to mess with the swap type (e.g., still one bit) because it overcomplicates swap code. In try_to_unmap(), we already reject to unmap in case the page might be pinned, because we must not lose PG_anon_exclusive on pinned pages ever. Checking if there are other unexpected references reliably *before* completely unmapping a page is unfortunately not really possible: THP heavily overcomplicate the situation. Once fully unmapped it's easier -- we, for example, make sure that there are no unexpected references *after* unmapping a page before starting writeback on that page. So, we currently might end up unmapping a page and clearing PG_anon_exclusive if that page has additional references, for example, due to a FOLL_GET. do_swap_page() has to re-determine if a page is exclusive, which will easily fail if there are other references on a page, most prominently GUP references via FOLL_GET. This can currently result in memory corruptions when taking a FOLL_GET | FOLL_WRITE reference on a page even when fork() is never involved: try_to_unmap() will succeed, and when refaulting the page, it cannot be marked exclusive and will get replaced by a copy in the page tables on the next write access, resulting in writes via the GUP reference to the page being lost. In an ideal world, everybody that uses GUP and wants to modify page content, such as O_DIRECT, would properly use FOLL_PIN. However, that conversion will take a while. It's easier to fix what used to work in the past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive. In addition, by remembering PG_anon_exclusive we can further reduce unnecessary COW in some cases, so it's the natural thing to do. So let's transfer the PG_anon_exclusive information to the swap pte and store it via an architecture-dependant pte bit; use that information when restoring the swap pte in do_swap_page() and unuse_pte(). During fork(), we simply have to clear the pte bit and are done. Of course, there is one corner case to handle: swap backends that don't support concurrent page modifications while the page is under writeback. Special case these, and drop the exclusive marker. Add a comment why that is just fine (also, reuse_swap_page() would have done the same in the past). In the future, we'll hopefully have all architectures support __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty stubs and the define completely. Then, we can also convert SWP_MIGRATION_READ_EXCLUSIVE. For architectures it's fairly easy to support: either simply use a yet unused pte bit that can be used for swap entries, steal one from the arch type bits if they exceed 5, or steal one from the offset bits. Note: R/O FOLL_GET references were never really reliable, especially when taking one on a shared page and then writing to the page (e.g., GUP after fork()). FOLL_GET, including R/W references, were never really reliable once fork was involved (e.g., GUP before fork(), GUP during fork()). KSM steps back in case it stumbles over unexpected references and is, therefore, fine. Signed-off-by: David Hildenbrand --- include/linux/pgtable.h | 29 ++ include/linux/swapops.h | 2 ++ mm/memory.c | 55 ++--- mm/rmap.c | 19 -- mm/swapfile.c | 13 +- 5 files changed, 105 insertions(+), 13 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index f4f4077b97aa..53750224e176 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1003,6 +1003,35 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot) #define arch_start_context_switch(prev)do {} while (0) #endif +/* + * When replacing an anonymous page by a real (!non) swap entry, we clear + * PG_anon_exclusive from the page and instead remember whether the flag was + * set in the swp pte. During fork(), we have to mark the entry as !exclusive + * (possibly shared). On swapin, we use that information to restore + * PG_anon_exclusive, which is very helpful in cases where we might have + * additional (e.g., FOLL_GET) references on a page and wouldn't be able to + * detect exclusivity. + * + * These functions don't apply to non-swap entries (e.g., migration, hwpoison, + * ...). + */ +#ifndef __HAVE_ARCH_PTE_SWP_EXCLUSIVE +static inline pte_t pte_swp_mkexclusive(pte_t pte) +{ + return pte; +} + +static inline int pte_swp_