Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations
On Fri 15-02-19 15:08:36, Huang, Ying wrote: > Michal Hocko writes: > > > On Mon 11-02-19 16:38:46, Huang, Ying wrote: > >> From: Huang Ying > >> > >> When swapin is performed, after getting the swap entry information from > >> the page table, system will swap in the swap entry, without any lock held > >> to prevent the swap device from being swapoff. This may cause the race > >> like below, > >> > >> CPU 1 CPU 2 > >> - - > >>do_swap_page > >> swapin_readahead > >>__read_swap_cache_async > >> swapoff swapcache_prepare > >> p->swap_map = NULL __swap_duplicate > >> p->swap_map[?] /* !!! NULL pointer > >> access */ > >> > >> Because swapoff is usually done when system shutdown only, the race may > >> not hit many people in practice. But it is still a race need to be fixed. > >> > >> To fix the race, get_swap_device() is added to check whether the specified > >> swap entry is valid in its swap device. If so, it will keep the swap > >> entry valid via preventing the swap device from being swapoff, until > >> put_swap_device() is called. > >> > >> Because swapoff() is very rare code path, to make the normal path runs as > >> fast as possible, disabling preemption + stop_machine() instead of > >> reference count is used to implement get/put_swap_device(). From > >> get_swap_device() to put_swap_device(), the preemption is disabled, so > >> stop_machine() in swapoff() will wait until put_swap_device() is called. > >> > >> In addition to swap_map, cluster_info, etc. data structure in the struct > >> swap_info_struct, the swap cache radix tree will be freed after swapoff, > >> so this patch fixes the race between swap cache looking up and swapoff > >> too. > >> > >> Races between some other swap cache usages protected via disabling > >> preemption and swapoff are fixed too via calling stop_machine() between > >> clearing PageSwapCache() and freeing swap cache data structure. > >> > >> Alternative implementation could be replacing disable preemption with > >> rcu_read_lock_sched and stop_machine() with synchronize_sched(). > > > > using stop_machine is generally discouraged. It is a gross > > synchronization. > > > > Besides that, since when do we have this problem? > > For problem, you mean the race between swapoff and the page fault > handler? yes > The problem is introduced in v4.11 when we avoid to replace > swap_info_struct->lock with swap_cluster_info->lock in > __swap_duplicate() if possible to improve the scalability of swap > operations. But because swapoff is a really rare operation, I don't > think it's necessary to backport the fix. Well, a lack of any bug reports would support your theory that this is unlikely to hit in practice. Fixes tag would be nice to have regardless though. Thanks! -- Michal Hocko SUSE Labs
Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations
Andrea Arcangeli writes: > On Thu, Feb 14, 2019 at 04:07:37PM +0800, Huang, Ying wrote: >> Before, we choose to use stop_machine() to reduce the overhead of hot >> path (page fault handler) as much as possible. But now, I found >> rcu_read_lock_sched() is just a wrapper of preempt_disable(). So maybe >> we can switch to RCU version now. > > rcu_read_lock looks more efficient than rcu_read_lock_sched. So for > this purpose in the fast path rcu_read_lock()/unlock() should be the > preferred methods, no need to force preempt_disable() (except for > debug purposes if sleep debug is enabled). Server builds are done with > voluntary preempt (no preempt shouldn't even exist as config option) > and there rcu_read_lock might be just a noop. If CONFIG_PREEMPT_VOLUNTARY=y CONFIG_PREEMPT=n CONFIG_PREEMPT_COUNT=n which is common for servers, rcu_read_lock() will be a noop, rcu_read_lock_sched() and preempt_disable() will be barrier(). So rcu_read_lock() is a little better. > Against a fast path rcu_read_lock/unlock before the consolidation > synchronize_rcu would have been enough, now after the consolidation > even more certain that it's enough because it's equivalent to _mult. Yes. Will change to rcu_read_lock/unlock based method. Best Regards, Huang, Ying
Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations
On Thu, Feb 14, 2019 at 04:07:37PM +0800, Huang, Ying wrote: > Before, we choose to use stop_machine() to reduce the overhead of hot > path (page fault handler) as much as possible. But now, I found > rcu_read_lock_sched() is just a wrapper of preempt_disable(). So maybe > we can switch to RCU version now. rcu_read_lock looks more efficient than rcu_read_lock_sched. So for this purpose in the fast path rcu_read_lock()/unlock() should be the preferred methods, no need to force preempt_disable() (except for debug purposes if sleep debug is enabled). Server builds are done with voluntary preempt (no preempt shouldn't even exist as config option) and there rcu_read_lock might be just a noop. Against a fast path rcu_read_lock/unlock before the consolidation synchronize_rcu would have been enough, now after the consolidation even more certain that it's enough because it's equivalent to _mult.
Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations
Hello, On Thu, Feb 14, 2019 at 12:30:02PM -0800, Andrew Morton wrote: > This was discussed to death and I think the changelog explains the > conclusions adequately. swapoff is super-rare so a stop_machine() in > that path is appropriate if its use permits more efficiency in the > regular swap code paths. The problem is precisely that the way the stop_machine callback is implemented right now (a dummy noop), makes the stop_machine() solution fully equivalent to RCU from the fast path point of view. It does not permit more efficiency in the fast path which is all we care about. For the slow path point of view the only difference is possibly that stop_machine will reach the quiescent state faster (i.e. swapoff may return a few dozen milliseconds faster), but nobody cares about the latency of swapoff and it's actually nicer if swapoff doesn't stop all CPUs on large systems and it uses less CPU overall. This is why I suggested if we keep using stop_machine() we should not use a dummy function whose only purpose is to reach a queiscent state (which is something that is more efficiently achieved with the syncronize_rcu/sched/kernel RCU API of the day) but we should instead try to leverage the UP-like serialization to remove more spinlocks from the fast path and convert them to preempt_disable(). However the current dummy callback cannot achieve that higher efficiency in the fast paths, the code would need to be reshuffled to try to remove at least the swap_lock. If no spinlock is converted to preempt_disable() RCU I don't see the point of stop_machine(). On a side note, the cmpxchge machinery I posted to run the function simultaneously on all CPUs I think may actually be superflous if using cpus=NULL like Hing suggested. Implementation details aside, still the idea of stop_machine would be to do those p->swap_map = NULL and everything protected by the swap_lock, should be executed inside the callback that runs like in a UP system to speedup the fast path further. Thanks, Andrea
Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations
On Thu, 14 Feb 2019 15:33:18 +0100 Michal Hocko wrote: > > Because swapoff() is very rare code path, to make the normal path runs as > > fast as possible, disabling preemption + stop_machine() instead of > > reference count is used to implement get/put_swap_device(). From > > get_swap_device() to put_swap_device(), the preemption is disabled, so > > stop_machine() in swapoff() will wait until put_swap_device() is called. > > > > In addition to swap_map, cluster_info, etc. data structure in the struct > > swap_info_struct, the swap cache radix tree will be freed after swapoff, > > so this patch fixes the race between swap cache looking up and swapoff > > too. > > > > Races between some other swap cache usages protected via disabling > > preemption and swapoff are fixed too via calling stop_machine() between > > clearing PageSwapCache() and freeing swap cache data structure. > > > > Alternative implementation could be replacing disable preemption with > > rcu_read_lock_sched and stop_machine() with synchronize_sched(). > > using stop_machine is generally discouraged. It is a gross > synchronization. This was discussed to death and I think the changelog explains the conclusions adequately. swapoff is super-rare so a stop_machine() in that path is appropriate if its use permits more efficiency in the regular swap code paths. > Besides that, since when do we have this problem? What problem??
Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations
On Mon 11-02-19 16:38:46, Huang, Ying wrote: > From: Huang Ying > > When swapin is performed, after getting the swap entry information from > the page table, system will swap in the swap entry, without any lock held > to prevent the swap device from being swapoff. This may cause the race > like below, > > CPU 1 CPU 2 > - - > do_swap_page > swapin_readahead > __read_swap_cache_async > swapoff swapcache_prepare > p->swap_map = NULL __swap_duplicate > p->swap_map[?] /* !!! NULL pointer > access */ > > Because swapoff is usually done when system shutdown only, the race may > not hit many people in practice. But it is still a race need to be fixed. > > To fix the race, get_swap_device() is added to check whether the specified > swap entry is valid in its swap device. If so, it will keep the swap > entry valid via preventing the swap device from being swapoff, until > put_swap_device() is called. > > Because swapoff() is very rare code path, to make the normal path runs as > fast as possible, disabling preemption + stop_machine() instead of > reference count is used to implement get/put_swap_device(). From > get_swap_device() to put_swap_device(), the preemption is disabled, so > stop_machine() in swapoff() will wait until put_swap_device() is called. > > In addition to swap_map, cluster_info, etc. data structure in the struct > swap_info_struct, the swap cache radix tree will be freed after swapoff, > so this patch fixes the race between swap cache looking up and swapoff > too. > > Races between some other swap cache usages protected via disabling > preemption and swapoff are fixed too via calling stop_machine() between > clearing PageSwapCache() and freeing swap cache data structure. > > Alternative implementation could be replacing disable preemption with > rcu_read_lock_sched and stop_machine() with synchronize_sched(). using stop_machine is generally discouraged. It is a gross synchronization. Besides that, since when do we have this problem? > Signed-off-by: "Huang, Ying" > Not-Nacked-by: Hugh Dickins > Cc: Paul E. McKenney > Cc: Minchan Kim > Cc: Johannes Weiner > Cc: Tim Chen > Cc: Mel Gorman > Cc: Jérôme Glisse > Cc: Michal Hocko > Cc: Andrea Arcangeli > Cc: David Rientjes > Cc: Rik van Riel > Cc: Jan Kara > Cc: Dave Jiang > Cc: Daniel Jordan > Cc: Andrea Parri > > Changelog: > > v7: > > - Rebased on patch: "mm, swap: bounds check swap_info accesses to avoid NULL > derefs" > > v6: > > - Add more comments to get_swap_device() to make it more clear about > possible swapoff or swapoff+swapon. > > v5: > > - Replace RCU with stop_machine() > > v4: > > - Use synchronize_rcu() in enable_swap_info() to reduce overhead of > normal paths further. > > v3: > > - Re-implemented with RCU to reduce the overhead of normal paths > > v2: > > - Re-implemented with SRCU to reduce the overhead of normal paths. > > - Avoid to check whether the swap device has been swapoff in > get_swap_device(). Because we can check the origin of the swap > entry to make sure the swap device hasn't bee swapoff. > --- > include/linux/swap.h | 13 +++- > mm/memory.c | 2 +- > mm/swap_state.c | 16 - > mm/swapfile.c| 150 ++- > 4 files changed, 145 insertions(+), 36 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 649529be91f2..aecd1430d0a9 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -175,8 +175,9 @@ enum { > SWP_PAGE_DISCARD = (1 << 10), /* freed swap page-cluster discards */ > SWP_STABLE_WRITES = (1 << 11), /* no overwrite PG_writeback pages */ > SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */ > + SWP_VALID = (1 << 13),/* swap is valid to be operated on? */ > /* add others here before... */ > - SWP_SCANNING= (1 << 13),/* refcount in scan_swap_map */ > + SWP_SCANNING= (1 << 14),/* refcount in scan_swap_map */ > }; > > #define SWAP_CLUSTER_MAX 32UL > @@ -460,7 +461,7 @@ extern unsigned int count_swap_pages(int, int); > extern sector_t map_swap_page(struct page *, struct block_device **); > extern sector_t swapdev_block(int, pgoff_t); > extern int page_swapcount(struct page *); > -extern int __swap_count(struct swap_info_struct *si, swp_entry_t entry); > +extern int __swap_count(swp_entry_t entry); > extern int __swp_swapcount(swp_entry_t entry); > extern int swp_swapcount(swp_entry_t entry); > extern struct swap_info_struct *page_swap_info(struct page *); > @@ -470,6 +471,12 @@ extern int try_to_free_swap(struct page *); > struct backing_dev_info; > extern int
Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations
Hello everyone, On Mon, Feb 11, 2019 at 04:38:46PM +0800, Huang, Ying wrote: > @@ -2386,7 +2463,17 @@ static void enable_swap_info(struct swap_info_struct > *p, int prio, > frontswap_init(p->type, frontswap_map); > spin_lock(_lock); > spin_lock(>lock); > - _enable_swap_info(p, prio, swap_map, cluster_info); > + setup_swap_info(p, prio, swap_map, cluster_info); > + spin_unlock(>lock); > + spin_unlock(_lock); > + /* > + * Guarantee swap_map, cluster_info, etc. fields are used > + * between get/put_swap_device() only if SWP_VALID bit is set > + */ > + stop_machine(swap_onoff_stop, NULL, cpu_online_mask); Should cpu_online_mask be read while holding cpus_read_lock? cpus_read_lock(); err = __stop_machine(swap_onoff_stop, NULL, cpu_online_mask); cpus_read_unlock(); I missed what the exact motivation was for the switch from rcu_read_lock()/syncrhonize_rcu() to preempt_disable()/stop_machine(). It looks like the above stop_machine all it does is to reach a quiescent point, when you've RCU that already can reach the quiescent point without an explicit stop_machine. The reason both implementations are basically looking the same is that stop_machine dummy call of swap_onoff_stop() { /* noop */ } will only reach a quiescent point faster than RCU, but it's otherwise functionally identical to RCU, but it's extremely more expensive. If it wasn't functionally identical stop_machine() couldn't be used as a drop in replacement of synchronize_sched() in the previous patch. I don't see the point of worrying about the synchronize_rcu latency in swapoff when RCU is basically identical and not more complex. So to be clear, I'm not against stop_machine() but with stop_machine() method invoked in all CPUs, you can actually do more than RCU and you can remove real locking not just reach a quiescent point. With stop_machine() the code would need reshuffling around so that the actual p->swap_map = NULL happens inside stop_machine, not outside like with RCU. With RCU all code stays concurrent at all times, simply the race is controlled, as opposed with stop_machine() you can make fully serialize and run like in UP temporarily (vs all preempt_disable() section at least). For example nr_swapfiles could in theory become a constant under preempt_disable() with stop_machine() without having to take a swap_lock. swap_onoff_stop can be implemented like this: enum { FIRST_STOP_MACHINE_INIT, FIRST_STOP_MACHINE_START, FIRST_STOP_MACHINE_END, }; static int first_stop_machine; static int swap_onoff_stop(void *data) { struct swap_stop_machine *swsm = (struct swap_stop_machine *)data; int first; first = cmpxchg(_stop_machine, FIRST_STOP_MACHINE_INIT, FIRST_STOP_MACHINE_START); if (first == FIRST_STOP_MACHINE_INIT) { swsm->p->swap_map = NULL; /* add more stuff here until swap_lock goes away */ smp_wmb(); WRITE_ONCE(first_stop_machine, FIRST_STOP_MACHINE_END); } else { do { cpu_relax(); } while (READ_ONCE(first_stop_machine) != FIRST_STOP_MACHINE_END); smp_rmb(); } return 0; } stop_machine invoked with a method like above, will guarantee while we set p->swap_map to NULL (and while we do nr_swapfiles++) nothing else can run, no even interrupts, so some lock may just disappear. Only NMI and SMI could possibly run concurrently with the swsm->p->swap_map = NULL operation. If we've to keep swap_onoff_stop() a dummy function run on all CPUs just to reach a quiescent point, then I don't see why the synchronize_rcu() (or synchronize_sched or synchronize_kernel or whatever it is called right now, but still RCU) solution isn't preferable. Thanks, Andrea
Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations
Tim Chen writes: > On 2/11/19 10:47 PM, Huang, Ying wrote: >> Andrea Parri writes: >> > + if (!si) > + goto bad_nofile; > + > + preempt_disable(); > + if (!(si->flags & SWP_VALID)) > + goto unlock_out; After Hugh alluded to barriers, it seems the read of SWP_VALID could be reordered with the write in preempt_disable at runtime. Without smp_mb() between the two, couldn't this happen, however unlikely a race it is? CPU0CPU1 __swap_duplicate() get_swap_device() // sees SWP_VALID set swapoff p->flags &= ~SWP_VALID; spin_unlock(>lock); // pair w/ smp_mb ... stop_machine(...) p->swap_map = NULL; preempt_disable() read NULL p->swap_map >>> >>> >>> I don't think that that smp_mb() is necessary. I elaborate: >>> >>> An important piece of information, I think, that is missing in the >>> diagram above is the stopper thread which executes the work queued >>> by stop_machine(). We have two cases to consider, that is, >>> >>> 1) the stopper is "executed before" the preempt-disable section >>> >>> CPU0 >>> >>> cpu_stopper_thread() >>> ... >>> preempt_disable() >>> ... >>> preempt_enable() >>> >>> 2) the stopper is "executed after" the preempt-disable section >>> >>> CPU0 >>> >>> preempt_disable() >>> ... >>> preempt_enable() >>> ... >>> cpu_stopper_thread() >>> >>> Notice that the reads from p->flags and p->swap_map in CPU0 cannot >>> cross cpu_stopper_thread(). The claim is that CPU0 sees SWP_VALID >>> unset in (1) and that it sees a non-NULL p->swap_map in (2). >>> >>> I consider the two cases separately: >>> >>> 1) CPU1 unsets SPW_VALID, it locks the stopper's lock, and it >>> queues the stopper work; CPU0 locks the stopper's lock, it >>> dequeues this work, and it reads from p->flags. >>> >>> Diagrammatically, we have the following MP-like pattern: >>> >>> CPU0CPU1 >>> >>> lock(stopper->lock) p->flags &= ~SPW_VALID >>> get @work lock(stopper->lock) >>> unlock(stopper->lock) add @work >>> reads p->flags unlock(stopper->lock) >>> >>> where CPU0 must see SPW_VALID unset (if CPU0 sees the work >>> added by CPU1). >>> >>> 2) CPU0 reads from p->swap_map, it locks the completion lock, >>> and it signals completion; CPU1 locks the completion lock, >>> it checks for completion, and it writes to p->swap_map. >>> >>> (If CPU0 doesn't signal the completion, or CPU1 doesn't see >>> the completion, then CPU1 will have to iterate the read and >>> to postpone the control-dependent write to p->swap_map.) >>> >>> Diagrammatically, we have the following LB-like pattern: >>> >>> CPU0CPU1 >>> >>> reads p->swap_map lock(completion) >>> lock(completion)read completion->done >>> completion->done++ unlock(completion) >>> unlock(completion) p->swap_map = NULL >>> >>> where CPU0 must see a non-NULL p->swap_map if CPU1 sees the >>> completion from CPU0. >>> >>> Does this make sense? >> >> Thanks a lot for detailed explanation! > > This is certainly a non-trivial explanation of why memory barrier is not > needed. Can we put it in the commit log and mention something in > comments on why we don't need memory barrier? Good idea! Will do this. Best Regards, Huang, Ying > Thanks. > > Tim
Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations
On Tue, Feb 12, 2019 at 04:21:21AM +0100, Andrea Parri wrote: > > > + if (!si) > > > + goto bad_nofile; > > > + > > > + preempt_disable(); > > > + if (!(si->flags & SWP_VALID)) > > > + goto unlock_out; > > > > After Hugh alluded to barriers, it seems the read of SWP_VALID could be > > reordered with the write in preempt_disable at runtime. Without smp_mb() > > between the two, couldn't this happen, however unlikely a race it is? > > > > CPU0CPU1 > > > > __swap_duplicate() > > get_swap_device() > > // sees SWP_VALID set > >swapoff > >p->flags &= ~SWP_VALID; > >spin_unlock(>lock); // pair w/ > > smp_mb > >... > >stop_machine(...) > >p->swap_map = NULL; > > preempt_disable() > > read NULL p->swap_map > > > I don't think that that smp_mb() is necessary. I elaborate: > > An important piece of information, I think, that is missing in the > diagram above is the stopper thread which executes the work queued > by stop_machine(). We have two cases to consider, that is, > > 1) the stopper is "executed before" the preempt-disable section > > CPU0 > > cpu_stopper_thread() > ... > preempt_disable() > ... > preempt_enable() > > 2) the stopper is "executed after" the preempt-disable section > > CPU0 > > preempt_disable() > ... > preempt_enable() > ... > cpu_stopper_thread() > > Notice that the reads from p->flags and p->swap_map in CPU0 cannot > cross cpu_stopper_thread(). The claim is that CPU0 sees SWP_VALID > unset in (1) and that it sees a non-NULL p->swap_map in (2). > > I consider the two cases separately: > > 1) CPU1 unsets SPW_VALID, it locks the stopper's lock, and it > queues the stopper work; CPU0 locks the stopper's lock, it > dequeues this work, and it reads from p->flags. > > Diagrammatically, we have the following MP-like pattern: > > CPU0CPU1 > > lock(stopper->lock) p->flags &= ~SPW_VALID > get @work lock(stopper->lock) > unlock(stopper->lock) add @work > reads p->flags unlock(stopper->lock) > > where CPU0 must see SPW_VALID unset (if CPU0 sees the work > added by CPU1). > > 2) CPU0 reads from p->swap_map, it locks the completion lock, > and it signals completion; CPU1 locks the completion lock, > it checks for completion, and it writes to p->swap_map. > > (If CPU0 doesn't signal the completion, or CPU1 doesn't see > the completion, then CPU1 will have to iterate the read and > to postpone the control-dependent write to p->swap_map.) > > Diagrammatically, we have the following LB-like pattern: > > CPU0CPU1 > > reads p->swap_map lock(completion) > lock(completion)read completion->done > completion->done++ unlock(completion) > unlock(completion) p->swap_map = NULL > > where CPU0 must see a non-NULL p->swap_map if CPU1 sees the > completion from CPU0. > > Does this make sense? Yes, thanks for this, Andrea! Good that smp_mb isn't needed.
Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations
On 2/11/19 10:47 PM, Huang, Ying wrote: > Andrea Parri writes: > + if (!si) + goto bad_nofile; + + preempt_disable(); + if (!(si->flags & SWP_VALID)) + goto unlock_out; >>> >>> After Hugh alluded to barriers, it seems the read of SWP_VALID could be >>> reordered with the write in preempt_disable at runtime. Without smp_mb() >>> between the two, couldn't this happen, however unlikely a race it is? >>> >>> CPU0CPU1 >>> >>> __swap_duplicate() >>> get_swap_device() >>> // sees SWP_VALID set >>>swapoff >>>p->flags &= ~SWP_VALID; >>>spin_unlock(>lock); // pair w/ >>> smp_mb >>>... >>>stop_machine(...) >>>p->swap_map = NULL; >>> preempt_disable() >>> read NULL p->swap_map >> >> >> I don't think that that smp_mb() is necessary. I elaborate: >> >> An important piece of information, I think, that is missing in the >> diagram above is the stopper thread which executes the work queued >> by stop_machine(). We have two cases to consider, that is, >> >> 1) the stopper is "executed before" the preempt-disable section >> >> CPU0 >> >> cpu_stopper_thread() >> ... >> preempt_disable() >> ... >> preempt_enable() >> >> 2) the stopper is "executed after" the preempt-disable section >> >> CPU0 >> >> preempt_disable() >> ... >> preempt_enable() >> ... >> cpu_stopper_thread() >> >> Notice that the reads from p->flags and p->swap_map in CPU0 cannot >> cross cpu_stopper_thread(). The claim is that CPU0 sees SWP_VALID >> unset in (1) and that it sees a non-NULL p->swap_map in (2). >> >> I consider the two cases separately: >> >> 1) CPU1 unsets SPW_VALID, it locks the stopper's lock, and it >> queues the stopper work; CPU0 locks the stopper's lock, it >> dequeues this work, and it reads from p->flags. >> >> Diagrammatically, we have the following MP-like pattern: >> >> CPU0CPU1 >> >> lock(stopper->lock) p->flags &= ~SPW_VALID >> get @work lock(stopper->lock) >> unlock(stopper->lock) add @work >> reads p->flags unlock(stopper->lock) >> >> where CPU0 must see SPW_VALID unset (if CPU0 sees the work >> added by CPU1). >> >> 2) CPU0 reads from p->swap_map, it locks the completion lock, >> and it signals completion; CPU1 locks the completion lock, >> it checks for completion, and it writes to p->swap_map. >> >> (If CPU0 doesn't signal the completion, or CPU1 doesn't see >> the completion, then CPU1 will have to iterate the read and >> to postpone the control-dependent write to p->swap_map.) >> >> Diagrammatically, we have the following LB-like pattern: >> >> CPU0CPU1 >> >> reads p->swap_map lock(completion) >> lock(completion)read completion->done >> completion->done++ unlock(completion) >> unlock(completion) p->swap_map = NULL >> >> where CPU0 must see a non-NULL p->swap_map if CPU1 sees the >> completion from CPU0. >> >> Does this make sense? > > Thanks a lot for detailed explanation! This is certainly a non-trivial explanation of why memory barrier is not needed. Can we put it in the commit log and mention something in comments on why we don't need memory barrier? Thanks. Tim
Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations
> Alternative implementation could be replacing disable preemption with > rcu_read_lock_sched and stop_machine() with synchronize_sched(). JFYI, starting with v4.20-rc1, synchronize_rcu{,expedited}() also wait for preempt-disable sections (the intent seems to retire the RCU-sched update-side API), so that here you could instead use preempt-disable + synchronize_rcu{,expedited}(). This LWN article gives an overview of the latest RCU API/semantics changes: https://lwn.net/Articles/777036/. Andrea
Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations
Daniel Jordan writes: > On Mon, Feb 11, 2019 at 04:38:46PM +0800, Huang, Ying wrote: >> +struct swap_info_struct *get_swap_device(swp_entry_t entry) >> +{ >> +struct swap_info_struct *si; >> +unsigned long type, offset; >> + >> +if (!entry.val) >> +goto out; > >> +type = swp_type(entry); >> +si = swap_type_to_swap_info(type); > > These lines can be collapsed into swp_swap_info if you want. Yes. I can use that function to reduce another line from the patch. Thanks! Will do that. >> +if (!si) >> +goto bad_nofile; >> + >> +preempt_disable(); >> +if (!(si->flags & SWP_VALID)) >> +goto unlock_out; > > After Hugh alluded to barriers, it seems the read of SWP_VALID could be > reordered with the write in preempt_disable at runtime. Without smp_mb() > between the two, couldn't this happen, however unlikely a race it is? > > CPU0CPU1 > > __swap_duplicate() > get_swap_device() > // sees SWP_VALID set >swapoff >p->flags &= ~SWP_VALID; >spin_unlock(>lock); // pair w/ > smp_mb >... >stop_machine(...) >p->swap_map = NULL; > preempt_disable() > read NULL p->swap_map Andrea has helped to explain this. Best Regards, Huang, Ying
Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations
> > + if (!si) > > + goto bad_nofile; > > + > > + preempt_disable(); > > + if (!(si->flags & SWP_VALID)) > > + goto unlock_out; > > After Hugh alluded to barriers, it seems the read of SWP_VALID could be > reordered with the write in preempt_disable at runtime. Without smp_mb() > between the two, couldn't this happen, however unlikely a race it is? > > CPU0CPU1 > > __swap_duplicate() > get_swap_device() > // sees SWP_VALID set >swapoff >p->flags &= ~SWP_VALID; >spin_unlock(>lock); // pair w/ > smp_mb >... >stop_machine(...) >p->swap_map = NULL; > preempt_disable() > read NULL p->swap_map I don't think that that smp_mb() is necessary. I elaborate: An important piece of information, I think, that is missing in the diagram above is the stopper thread which executes the work queued by stop_machine(). We have two cases to consider, that is, 1) the stopper is "executed before" the preempt-disable section CPU0 cpu_stopper_thread() ... preempt_disable() ... preempt_enable() 2) the stopper is "executed after" the preempt-disable section CPU0 preempt_disable() ... preempt_enable() ... cpu_stopper_thread() Notice that the reads from p->flags and p->swap_map in CPU0 cannot cross cpu_stopper_thread(). The claim is that CPU0 sees SWP_VALID unset in (1) and that it sees a non-NULL p->swap_map in (2). I consider the two cases separately: 1) CPU1 unsets SPW_VALID, it locks the stopper's lock, and it queues the stopper work; CPU0 locks the stopper's lock, it dequeues this work, and it reads from p->flags. Diagrammatically, we have the following MP-like pattern: CPU0CPU1 lock(stopper->lock) p->flags &= ~SPW_VALID get @work lock(stopper->lock) unlock(stopper->lock) add @work reads p->flags unlock(stopper->lock) where CPU0 must see SPW_VALID unset (if CPU0 sees the work added by CPU1). 2) CPU0 reads from p->swap_map, it locks the completion lock, and it signals completion; CPU1 locks the completion lock, it checks for completion, and it writes to p->swap_map. (If CPU0 doesn't signal the completion, or CPU1 doesn't see the completion, then CPU1 will have to iterate the read and to postpone the control-dependent write to p->swap_map.) Diagrammatically, we have the following LB-like pattern: CPU0CPU1 reads p->swap_map lock(completion) lock(completion)read completion->done completion->done++ unlock(completion) unlock(completion) p->swap_map = NULL where CPU0 must see a non-NULL p->swap_map if CPU1 sees the completion from CPU0. Does this make sense? Andrea
Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations
On Mon, Feb 11, 2019 at 04:38:46PM +0800, Huang, Ying wrote: > +struct swap_info_struct *get_swap_device(swp_entry_t entry) > +{ > + struct swap_info_struct *si; > + unsigned long type, offset; > + > + if (!entry.val) > + goto out; > + type = swp_type(entry); > + si = swap_type_to_swap_info(type); These lines can be collapsed into swp_swap_info if you want. > + if (!si) > + goto bad_nofile; > + > + preempt_disable(); > + if (!(si->flags & SWP_VALID)) > + goto unlock_out; After Hugh alluded to barriers, it seems the read of SWP_VALID could be reordered with the write in preempt_disable at runtime. Without smp_mb() between the two, couldn't this happen, however unlikely a race it is? CPU0CPU1 __swap_duplicate() get_swap_device() // sees SWP_VALID set swapoff p->flags &= ~SWP_VALID; spin_unlock(>lock); // pair w/ smp_mb ... stop_machine(...) p->swap_map = NULL; preempt_disable() read NULL p->swap_map
[PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations
From: Huang Ying When swapin is performed, after getting the swap entry information from the page table, system will swap in the swap entry, without any lock held to prevent the swap device from being swapoff. This may cause the race like below, CPU 1 CPU 2 - - do_swap_page swapin_readahead __read_swap_cache_async swapoff swapcache_prepare p->swap_map = NULL__swap_duplicate p->swap_map[?] /* !!! NULL pointer access */ Because swapoff is usually done when system shutdown only, the race may not hit many people in practice. But it is still a race need to be fixed. To fix the race, get_swap_device() is added to check whether the specified swap entry is valid in its swap device. If so, it will keep the swap entry valid via preventing the swap device from being swapoff, until put_swap_device() is called. Because swapoff() is very rare code path, to make the normal path runs as fast as possible, disabling preemption + stop_machine() instead of reference count is used to implement get/put_swap_device(). From get_swap_device() to put_swap_device(), the preemption is disabled, so stop_machine() in swapoff() will wait until put_swap_device() is called. In addition to swap_map, cluster_info, etc. data structure in the struct swap_info_struct, the swap cache radix tree will be freed after swapoff, so this patch fixes the race between swap cache looking up and swapoff too. Races between some other swap cache usages protected via disabling preemption and swapoff are fixed too via calling stop_machine() between clearing PageSwapCache() and freeing swap cache data structure. Alternative implementation could be replacing disable preemption with rcu_read_lock_sched and stop_machine() with synchronize_sched(). Signed-off-by: "Huang, Ying" Not-Nacked-by: Hugh Dickins Cc: Paul E. McKenney Cc: Minchan Kim Cc: Johannes Weiner Cc: Tim Chen Cc: Mel Gorman Cc: Jérôme Glisse Cc: Michal Hocko Cc: Andrea Arcangeli Cc: David Rientjes Cc: Rik van Riel Cc: Jan Kara Cc: Dave Jiang Cc: Daniel Jordan Cc: Andrea Parri Changelog: v7: - Rebased on patch: "mm, swap: bounds check swap_info accesses to avoid NULL derefs" v6: - Add more comments to get_swap_device() to make it more clear about possible swapoff or swapoff+swapon. v5: - Replace RCU with stop_machine() v4: - Use synchronize_rcu() in enable_swap_info() to reduce overhead of normal paths further. v3: - Re-implemented with RCU to reduce the overhead of normal paths v2: - Re-implemented with SRCU to reduce the overhead of normal paths. - Avoid to check whether the swap device has been swapoff in get_swap_device(). Because we can check the origin of the swap entry to make sure the swap device hasn't bee swapoff. --- include/linux/swap.h | 13 +++- mm/memory.c | 2 +- mm/swap_state.c | 16 - mm/swapfile.c| 150 ++- 4 files changed, 145 insertions(+), 36 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 649529be91f2..aecd1430d0a9 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -175,8 +175,9 @@ enum { SWP_PAGE_DISCARD = (1 << 10), /* freed swap page-cluster discards */ SWP_STABLE_WRITES = (1 << 11), /* no overwrite PG_writeback pages */ SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */ + SWP_VALID = (1 << 13),/* swap is valid to be operated on? */ /* add others here before... */ - SWP_SCANNING= (1 << 13),/* refcount in scan_swap_map */ + SWP_SCANNING= (1 << 14),/* refcount in scan_swap_map */ }; #define SWAP_CLUSTER_MAX 32UL @@ -460,7 +461,7 @@ extern unsigned int count_swap_pages(int, int); extern sector_t map_swap_page(struct page *, struct block_device **); extern sector_t swapdev_block(int, pgoff_t); extern int page_swapcount(struct page *); -extern int __swap_count(struct swap_info_struct *si, swp_entry_t entry); +extern int __swap_count(swp_entry_t entry); extern int __swp_swapcount(swp_entry_t entry); extern int swp_swapcount(swp_entry_t entry); extern struct swap_info_struct *page_swap_info(struct page *); @@ -470,6 +471,12 @@ extern int try_to_free_swap(struct page *); struct backing_dev_info; extern int init_swap_address_space(unsigned int type, unsigned long nr_pages); extern void exit_swap_address_space(unsigned int type); +extern struct swap_info_struct *get_swap_device(swp_entry_t entry); + +static inline void put_swap_device(struct swap_info_struct *si) +{ + preempt_enable(); +} #else /* CONFIG_SWAP */ @@ -576,7 +583,7 @@ static inline int page_swapcount(struct page *page) return 0; } -static inline