subject:"\[PATCH \-mm \-V7\] mm, swap\: fix race between swapoff and some swap operations"

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

2019-02-15 Thread Michal Hocko

On Fri 15-02-19 15:08:36, Huang, Ying wrote:
> Michal Hocko  writes:
> 
> > On Mon 11-02-19 16:38:46, Huang, Ying wrote:
> >> From: Huang Ying 
> >> 
> >> When swapin is performed, after getting the swap entry information from
> >> the page table, system will swap in the swap entry, without any lock held
> >> to prevent the swap device from being swapoff.  This may cause the race
> >> like below,
> >> 
> >> CPU 1  CPU 2
> >> -  -
> >>do_swap_page
> >>  swapin_readahead
> >>__read_swap_cache_async
> >> swapoff  swapcache_prepare
> >>   p->swap_map = NULL   __swap_duplicate
> >>  p->swap_map[?] /* !!! NULL pointer 
> >> access */
> >> 
> >> Because swapoff is usually done when system shutdown only, the race may
> >> not hit many people in practice.  But it is still a race need to be fixed.
> >> 
> >> To fix the race, get_swap_device() is added to check whether the specified
> >> swap entry is valid in its swap device.  If so, it will keep the swap
> >> entry valid via preventing the swap device from being swapoff, until
> >> put_swap_device() is called.
> >> 
> >> Because swapoff() is very rare code path, to make the normal path runs as
> >> fast as possible, disabling preemption + stop_machine() instead of
> >> reference count is used to implement get/put_swap_device().  From
> >> get_swap_device() to put_swap_device(), the preemption is disabled, so
> >> stop_machine() in swapoff() will wait until put_swap_device() is called.
> >> 
> >> In addition to swap_map, cluster_info, etc.  data structure in the struct
> >> swap_info_struct, the swap cache radix tree will be freed after swapoff,
> >> so this patch fixes the race between swap cache looking up and swapoff
> >> too.
> >> 
> >> Races between some other swap cache usages protected via disabling
> >> preemption and swapoff are fixed too via calling stop_machine() between
> >> clearing PageSwapCache() and freeing swap cache data structure.
> >> 
> >> Alternative implementation could be replacing disable preemption with
> >> rcu_read_lock_sched and stop_machine() with synchronize_sched().
> >
> > using stop_machine is generally discouraged. It is a gross
> > synchronization.
> >
> > Besides that, since when do we have this problem?
> 
> For problem, you mean the race between swapoff and the page fault
> handler?

yes

> The problem is introduced in v4.11 when we avoid to replace
> swap_info_struct->lock with swap_cluster_info->lock in
> __swap_duplicate() if possible to improve the scalability of swap
> operations.  But because swapoff is a really rare operation, I don't
> think it's necessary to backport the fix.

Well, a lack of any bug reports would support your theory that this is
unlikely to hit in practice. Fixes tag would be nice to have regardless
though.

Thanks!
-- 
Michal Hocko
SUSE Labs

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

2019-02-14 Thread Huang, Ying

Andrea Arcangeli  writes:

> On Thu, Feb 14, 2019 at 04:07:37PM +0800, Huang, Ying wrote:
>> Before, we choose to use stop_machine() to reduce the overhead of hot
>> path (page fault handler) as much as possible.  But now, I found
>> rcu_read_lock_sched() is just a wrapper of preempt_disable().  So maybe
>> we can switch to RCU version now.
>
> rcu_read_lock looks more efficient than rcu_read_lock_sched. So for
> this purpose in the fast path rcu_read_lock()/unlock() should be the
> preferred methods, no need to force preempt_disable() (except for
> debug purposes if sleep debug is enabled). Server builds are done with
> voluntary preempt (no preempt shouldn't even exist as config option)
> and there rcu_read_lock might be just a noop.

If

CONFIG_PREEMPT_VOLUNTARY=y
CONFIG_PREEMPT=n
CONFIG_PREEMPT_COUNT=n

which is common for servers,

rcu_read_lock() will be a noop, rcu_read_lock_sched() and
preempt_disable() will be barrier().  So rcu_read_lock() is a little
better.

> Against a fast path rcu_read_lock/unlock before the consolidation
> synchronize_rcu would have been enough, now after the consolidation
> even more certain that it's enough because it's equivalent to _mult.

Yes.  Will change to rcu_read_lock/unlock based method.

Best Regards,
Huang, Ying

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

2019-02-14 Thread Andrea Arcangeli

On Thu, Feb 14, 2019 at 04:07:37PM +0800, Huang, Ying wrote:
> Before, we choose to use stop_machine() to reduce the overhead of hot
> path (page fault handler) as much as possible.  But now, I found
> rcu_read_lock_sched() is just a wrapper of preempt_disable().  So maybe
> we can switch to RCU version now.

rcu_read_lock looks more efficient than rcu_read_lock_sched. So for
this purpose in the fast path rcu_read_lock()/unlock() should be the
preferred methods, no need to force preempt_disable() (except for
debug purposes if sleep debug is enabled). Server builds are done with
voluntary preempt (no preempt shouldn't even exist as config option)
and there rcu_read_lock might be just a noop.

Against a fast path rcu_read_lock/unlock before the consolidation
synchronize_rcu would have been enough, now after the consolidation
even more certain that it's enough because it's equivalent to _mult.

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

2019-02-14 Thread Andrea Arcangeli

Hello,

On Thu, Feb 14, 2019 at 12:30:02PM -0800, Andrew Morton wrote:
> This was discussed to death and I think the changelog explains the
> conclusions adequately.  swapoff is super-rare so a stop_machine() in
> that path is appropriate if its use permits more efficiency in the
> regular swap code paths.  

The problem is precisely that the way the stop_machine callback is
implemented right now (a dummy noop), makes the stop_machine()
solution fully equivalent to RCU from the fast path point of view. It
does not permit more efficiency in the fast path which is all we care
about.

For the slow path point of view the only difference is possibly that
stop_machine will reach the quiescent state faster (i.e. swapoff may
return a few dozen milliseconds faster), but nobody cares about the
latency of swapoff and it's actually nicer if swapoff doesn't stop all
CPUs on large systems and it uses less CPU overall.

This is why I suggested if we keep using stop_machine() we should not
use a dummy function whose only purpose is to reach a queiscent state
(which is something that is more efficiently achieved with the
syncronize_rcu/sched/kernel RCU API of the day) but we should instead
try to leverage the UP-like serialization to remove more spinlocks
from the fast path and convert them to preempt_disable(). However the
current dummy callback cannot achieve that higher efficiency in the
fast paths, the code would need to be reshuffled to try to remove at
least the swap_lock.

If no spinlock is converted to preempt_disable() RCU I don't see the
point of stop_machine().

On a side note, the cmpxchge machinery I posted to run the function
simultaneously on all CPUs I think may actually be superflous if using
cpus=NULL like Hing suggested. Implementation details aside, still the
idea of stop_machine would be to do those p->swap_map = NULL and
everything protected by the swap_lock, should be executed inside the
callback that runs like in a UP system to speedup the fast path
further.

Thanks,
Andrea

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

2019-02-14 Thread Andrew Morton

On Thu, 14 Feb 2019 15:33:18 +0100 Michal Hocko  wrote:

> > Because swapoff() is very rare code path, to make the normal path runs as
> > fast as possible, disabling preemption + stop_machine() instead of
> > reference count is used to implement get/put_swap_device().  From
> > get_swap_device() to put_swap_device(), the preemption is disabled, so
> > stop_machine() in swapoff() will wait until put_swap_device() is called.
> > 
> > In addition to swap_map, cluster_info, etc.  data structure in the struct
> > swap_info_struct, the swap cache radix tree will be freed after swapoff,
> > so this patch fixes the race between swap cache looking up and swapoff
> > too.
> > 
> > Races between some other swap cache usages protected via disabling
> > preemption and swapoff are fixed too via calling stop_machine() between
> > clearing PageSwapCache() and freeing swap cache data structure.
> > 
> > Alternative implementation could be replacing disable preemption with
> > rcu_read_lock_sched and stop_machine() with synchronize_sched().
> 
> using stop_machine is generally discouraged. It is a gross
> synchronization.

This was discussed to death and I think the changelog explains the
conclusions adequately.  swapoff is super-rare so a stop_machine() in
that path is appropriate if its use permits more efficiency in the
regular swap code paths.  

> Besides that, since when do we have this problem?

What problem??

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

2019-02-14 Thread Michal Hocko

On Mon 11-02-19 16:38:46, Huang, Ying wrote:
> From: Huang Ying 
> 
> When swapin is performed, after getting the swap entry information from
> the page table, system will swap in the swap entry, without any lock held
> to prevent the swap device from being swapoff.  This may cause the race
> like below,
> 
> CPU 1 CPU 2
> - -
>   do_swap_page
> swapin_readahead
>   __read_swap_cache_async
> swapoff swapcache_prepare
>   p->swap_map = NULL  __swap_duplicate
> p->swap_map[?] /* !!! NULL pointer 
> access */
> 
> Because swapoff is usually done when system shutdown only, the race may
> not hit many people in practice.  But it is still a race need to be fixed.
> 
> To fix the race, get_swap_device() is added to check whether the specified
> swap entry is valid in its swap device.  If so, it will keep the swap
> entry valid via preventing the swap device from being swapoff, until
> put_swap_device() is called.
> 
> Because swapoff() is very rare code path, to make the normal path runs as
> fast as possible, disabling preemption + stop_machine() instead of
> reference count is used to implement get/put_swap_device().  From
> get_swap_device() to put_swap_device(), the preemption is disabled, so
> stop_machine() in swapoff() will wait until put_swap_device() is called.
> 
> In addition to swap_map, cluster_info, etc.  data structure in the struct
> swap_info_struct, the swap cache radix tree will be freed after swapoff,
> so this patch fixes the race between swap cache looking up and swapoff
> too.
> 
> Races between some other swap cache usages protected via disabling
> preemption and swapoff are fixed too via calling stop_machine() between
> clearing PageSwapCache() and freeing swap cache data structure.
> 
> Alternative implementation could be replacing disable preemption with
> rcu_read_lock_sched and stop_machine() with synchronize_sched().

using stop_machine is generally discouraged. It is a gross
synchronization.

Besides that, since when do we have this problem?

> Signed-off-by: "Huang, Ying" 
> Not-Nacked-by: Hugh Dickins 
> Cc: Paul E. McKenney 
> Cc: Minchan Kim 
> Cc: Johannes Weiner 
> Cc: Tim Chen 
> Cc: Mel Gorman 
> Cc: Jérôme Glisse 
> Cc: Michal Hocko 
> Cc: Andrea Arcangeli 
> Cc: David Rientjes 
> Cc: Rik van Riel 
> Cc: Jan Kara 
> Cc: Dave Jiang 
> Cc: Daniel Jordan 
> Cc: Andrea Parri 
> 
> Changelog:
> 
> v7:
> 
> - Rebased on patch: "mm, swap: bounds check swap_info accesses to avoid NULL 
> derefs"
> 
> v6:
> 
> - Add more comments to get_swap_device() to make it more clear about
>   possible swapoff or swapoff+swapon.
> 
> v5:
> 
> - Replace RCU with stop_machine()
> 
> v4:
> 
> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
>   normal paths further.
> 
> v3:
> 
> - Re-implemented with RCU to reduce the overhead of normal paths
> 
> v2:
> 
> - Re-implemented with SRCU to reduce the overhead of normal paths.
> 
> - Avoid to check whether the swap device has been swapoff in
>   get_swap_device().  Because we can check the origin of the swap
>   entry to make sure the swap device hasn't bee swapoff.
> ---
>  include/linux/swap.h |  13 +++-
>  mm/memory.c  |   2 +-
>  mm/swap_state.c  |  16 -
>  mm/swapfile.c| 150 ++-
>  4 files changed, 145 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 649529be91f2..aecd1430d0a9 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -175,8 +175,9 @@ enum {
>   SWP_PAGE_DISCARD = (1 << 10),   /* freed swap page-cluster discards */
>   SWP_STABLE_WRITES = (1 << 11),  /* no overwrite PG_writeback pages */
>   SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */
> + SWP_VALID   = (1 << 13),/* swap is valid to be operated on? */
>   /* add others here before... */
> - SWP_SCANNING= (1 << 13),/* refcount in scan_swap_map */
> + SWP_SCANNING= (1 << 14),/* refcount in scan_swap_map */
>  };
>  
>  #define SWAP_CLUSTER_MAX 32UL
> @@ -460,7 +461,7 @@ extern unsigned int count_swap_pages(int, int);
>  extern sector_t map_swap_page(struct page *, struct block_device **);
>  extern sector_t swapdev_block(int, pgoff_t);
>  extern int page_swapcount(struct page *);
> -extern int __swap_count(struct swap_info_struct *si, swp_entry_t entry);
> +extern int __swap_count(swp_entry_t entry);
>  extern int __swp_swapcount(swp_entry_t entry);
>  extern int swp_swapcount(swp_entry_t entry);
>  extern struct swap_info_struct *page_swap_info(struct page *);
> @@ -470,6 +471,12 @@ extern int try_to_free_swap(struct page *);
>  struct backing_dev_info;
>  extern int

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

2019-02-13 Thread Andrea Arcangeli

Hello everyone,

On Mon, Feb 11, 2019 at 04:38:46PM +0800, Huang, Ying wrote:
> @@ -2386,7 +2463,17 @@ static void enable_swap_info(struct swap_info_struct 
> *p, int prio,
>   frontswap_init(p->type, frontswap_map);
>   spin_lock(_lock);
>   spin_lock(>lock);
> -  _enable_swap_info(p, prio, swap_map, cluster_info);
> + setup_swap_info(p, prio, swap_map, cluster_info);
> + spin_unlock(>lock);
> + spin_unlock(_lock);
> + /*
> +  * Guarantee swap_map, cluster_info, etc. fields are used
> +  * between get/put_swap_device() only if SWP_VALID bit is set
> +  */
> + stop_machine(swap_onoff_stop, NULL, cpu_online_mask);

Should cpu_online_mask be read while holding cpus_read_lock?

cpus_read_lock();
err = __stop_machine(swap_onoff_stop, NULL, cpu_online_mask);
cpus_read_unlock();

I missed what the exact motivation was for the switch from
rcu_read_lock()/syncrhonize_rcu() to preempt_disable()/stop_machine().

It looks like the above stop_machine all it does is to reach a
quiescent point, when you've RCU that already can reach the quiescent
point without an explicit stop_machine.

The reason both implementations are basically looking the same is that
stop_machine dummy call of swap_onoff_stop() { /* noop */ } will only
reach a quiescent point faster than RCU, but it's otherwise
functionally identical to RCU, but it's extremely more expensive. If
it wasn't functionally identical stop_machine() couldn't be used as a
drop in replacement of synchronize_sched() in the previous patch.

I don't see the point of worrying about the synchronize_rcu latency in
swapoff when RCU is basically identical and not more complex.

So to be clear, I'm not against stop_machine() but with stop_machine()
method invoked in all CPUs, you can actually do more than RCU and you
can remove real locking not just reach a quiescent point.

With stop_machine() the code would need reshuffling around so that the
actual p->swap_map = NULL happens inside stop_machine, not outside
like with RCU.

With RCU all code stays concurrent at all times, simply the race is
controlled, as opposed with stop_machine() you can make fully
serialize and run like in UP temporarily (vs all preempt_disable()
section at least).

For example nr_swapfiles could in theory become a constant under
preempt_disable() with stop_machine() without having to take a
swap_lock.

swap_onoff_stop can be implemented like this:

enum {
FIRST_STOP_MACHINE_INIT,
FIRST_STOP_MACHINE_START,
FIRST_STOP_MACHINE_END,
};
static int first_stop_machine;
static int swap_onoff_stop(void *data)
{
struct swap_stop_machine *swsm = (struct swap_stop_machine *)data;
int first;

first = cmpxchg(_stop_machine, FIRST_STOP_MACHINE_INIT,
FIRST_STOP_MACHINE_START);
if (first == FIRST_STOP_MACHINE_INIT) {
swsm->p->swap_map = NULL;
/* add more stuff here until swap_lock goes away */
smp_wmb();
WRITE_ONCE(first_stop_machine, FIRST_STOP_MACHINE_END);
} else {
do {
cpu_relax();
} while (READ_ONCE(first_stop_machine) !=
 FIRST_STOP_MACHINE_END);
smp_rmb();
}

return 0;
}

stop_machine invoked with a method like above, will guarantee while we
set p->swap_map to NULL (and while we do nr_swapfiles++) nothing else
can run, no even interrupts, so some lock may just disappear. Only NMI
and SMI could possibly run concurrently with the swsm->p->swap_map =
NULL operation.

If we've to keep swap_onoff_stop() a dummy function run on all CPUs
just to reach a quiescent point, then I don't see why
the synchronize_rcu() (or synchronize_sched or synchronize_kernel or
whatever it is called right now, but still RCU) solution isn't
preferable.

Thanks,
Andrea

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

2019-02-12 Thread Huang, Ying

Tim Chen  writes:

> On 2/11/19 10:47 PM, Huang, Ying wrote:
>> Andrea Parri  writes:
>> 
> + if (!si)
> + goto bad_nofile;
> +
> + preempt_disable();
> + if (!(si->flags & SWP_VALID))
> + goto unlock_out;

 After Hugh alluded to barriers, it seems the read of SWP_VALID could be
 reordered with the write in preempt_disable at runtime.  Without smp_mb()
 between the two, couldn't this happen, however unlikely a race it is?

 CPU0CPU1

 __swap_duplicate()
 get_swap_device()
 // sees SWP_VALID set
swapoff
p->flags &= ~SWP_VALID;
spin_unlock(>lock); // pair w/ 
 smp_mb
...
stop_machine(...)
p->swap_map = NULL;
 preempt_disable()
 read NULL p->swap_map
>>>
>>>
>>> I don't think that that smp_mb() is necessary.  I elaborate:
>>>
>>> An important piece of information, I think, that is missing in the
>>> diagram above is the stopper thread which executes the work queued
>>> by stop_machine().  We have two cases to consider, that is,
>>>
>>>   1) the stopper is "executed before" the preempt-disable section
>>>
>>> CPU0
>>>
>>> cpu_stopper_thread()
>>> ...
>>> preempt_disable()
>>> ...
>>> preempt_enable()
>>>
>>>   2) the stopper is "executed after" the preempt-disable section
>>>
>>> CPU0
>>>
>>> preempt_disable()
>>> ...
>>> preempt_enable()
>>> ...
>>> cpu_stopper_thread()
>>>
>>> Notice that the reads from p->flags and p->swap_map in CPU0 cannot
>>> cross cpu_stopper_thread().  The claim is that CPU0 sees SWP_VALID
>>> unset in (1) and that it sees a non-NULL p->swap_map in (2).
>>>
>>> I consider the two cases separately:
>>>
>>>   1) CPU1 unsets SPW_VALID, it locks the stopper's lock, and it
>>>  queues the stopper work; CPU0 locks the stopper's lock, it
>>>  dequeues this work, and it reads from p->flags.
>>>
>>>  Diagrammatically, we have the following MP-like pattern:
>>>
>>> CPU0CPU1
>>>
>>> lock(stopper->lock) p->flags &= ~SPW_VALID
>>> get @work   lock(stopper->lock)
>>> unlock(stopper->lock)   add @work
>>> reads p->flags  unlock(stopper->lock)
>>>
>>>  where CPU0 must see SPW_VALID unset (if CPU0 sees the work
>>>  added by CPU1).
>>>
>>>   2) CPU0 reads from p->swap_map, it locks the completion lock,
>>>  and it signals completion; CPU1 locks the completion lock,
>>>  it checks for completion, and it writes to p->swap_map.
>>>
>>>  (If CPU0 doesn't signal the completion, or CPU1 doesn't see
>>>  the completion, then CPU1 will have to iterate the read and
>>>  to postpone the control-dependent write to p->swap_map.)
>>>
>>>  Diagrammatically, we have the following LB-like pattern:
>>>
>>> CPU0CPU1
>>>
>>> reads p->swap_map   lock(completion)
>>> lock(completion)read completion->done
>>> completion->done++  unlock(completion)
>>> unlock(completion)  p->swap_map = NULL
>>>
>>>  where CPU0 must see a non-NULL p->swap_map if CPU1 sees the
>>>  completion from CPU0.
>>>
>>> Does this make sense?
>> 
>> Thanks a lot for detailed explanation!
>
> This is certainly a non-trivial explanation of why memory barrier is not
> needed.  Can we put it in the commit log and mention something in
> comments on why we don't need memory barrier?

Good idea!  Will do this.

Best Regards,
Huang, Ying

> Thanks.
>
> Tim

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

2019-02-12 Thread Daniel Jordan

On Tue, Feb 12, 2019 at 04:21:21AM +0100, Andrea Parri wrote:
> > > + if (!si)
> > > + goto bad_nofile;
> > > +
> > > + preempt_disable();
> > > + if (!(si->flags & SWP_VALID))
> > > + goto unlock_out;
> > 
> > After Hugh alluded to barriers, it seems the read of SWP_VALID could be
> > reordered with the write in preempt_disable at runtime.  Without smp_mb()
> > between the two, couldn't this happen, however unlikely a race it is?
> > 
> > CPU0CPU1
> > 
> > __swap_duplicate()
> > get_swap_device()
> > // sees SWP_VALID set
> >swapoff
> >p->flags &= ~SWP_VALID;
> >spin_unlock(>lock); // pair w/ 
> > smp_mb
> >...
> >stop_machine(...)
> >p->swap_map = NULL;
> > preempt_disable()
> > read NULL p->swap_map
> 
> 
> I don't think that that smp_mb() is necessary.  I elaborate:
> 
> An important piece of information, I think, that is missing in the
> diagram above is the stopper thread which executes the work queued
> by stop_machine().  We have two cases to consider, that is,
> 
>   1) the stopper is "executed before" the preempt-disable section
> 
>   CPU0
> 
>   cpu_stopper_thread()
>   ...
>   preempt_disable()
>   ...
>   preempt_enable()
> 
>   2) the stopper is "executed after" the preempt-disable section
> 
>   CPU0
> 
>   preempt_disable()
>   ...
>   preempt_enable()
>   ...
>   cpu_stopper_thread()
> 
> Notice that the reads from p->flags and p->swap_map in CPU0 cannot
> cross cpu_stopper_thread().  The claim is that CPU0 sees SWP_VALID
> unset in (1) and that it sees a non-NULL p->swap_map in (2).
> 
> I consider the two cases separately:
> 
>   1) CPU1 unsets SPW_VALID, it locks the stopper's lock, and it
>  queues the stopper work; CPU0 locks the stopper's lock, it
>  dequeues this work, and it reads from p->flags.
> 
>  Diagrammatically, we have the following MP-like pattern:
> 
>   CPU0CPU1
> 
>   lock(stopper->lock) p->flags &= ~SPW_VALID
>   get @work   lock(stopper->lock)
>   unlock(stopper->lock)   add @work
>   reads p->flags  unlock(stopper->lock)
> 
>  where CPU0 must see SPW_VALID unset (if CPU0 sees the work
>  added by CPU1).
> 
>   2) CPU0 reads from p->swap_map, it locks the completion lock,
>  and it signals completion; CPU1 locks the completion lock,
>  it checks for completion, and it writes to p->swap_map.
> 
>  (If CPU0 doesn't signal the completion, or CPU1 doesn't see
>  the completion, then CPU1 will have to iterate the read and
>  to postpone the control-dependent write to p->swap_map.)
> 
>  Diagrammatically, we have the following LB-like pattern:
> 
>   CPU0CPU1
> 
>   reads p->swap_map   lock(completion)
>   lock(completion)read completion->done
>   completion->done++  unlock(completion)
>   unlock(completion)  p->swap_map = NULL
> 
>  where CPU0 must see a non-NULL p->swap_map if CPU1 sees the
>  completion from CPU0.
> 
> Does this make sense?

Yes, thanks for this, Andrea!  Good that smp_mb isn't needed.

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

2019-02-12 Thread Tim Chen

On 2/11/19 10:47 PM, Huang, Ying wrote:
> Andrea Parri  writes:
> 
 +  if (!si)
 +  goto bad_nofile;
 +
 +  preempt_disable();
 +  if (!(si->flags & SWP_VALID))
 +  goto unlock_out;
>>>
>>> After Hugh alluded to barriers, it seems the read of SWP_VALID could be
>>> reordered with the write in preempt_disable at runtime.  Without smp_mb()
>>> between the two, couldn't this happen, however unlikely a race it is?
>>>
>>> CPU0CPU1
>>>
>>> __swap_duplicate()
>>> get_swap_device()
>>> // sees SWP_VALID set
>>>swapoff
>>>p->flags &= ~SWP_VALID;
>>>spin_unlock(>lock); // pair w/ 
>>> smp_mb
>>>...
>>>stop_machine(...)
>>>p->swap_map = NULL;
>>> preempt_disable()
>>> read NULL p->swap_map
>>
>>
>> I don't think that that smp_mb() is necessary.  I elaborate:
>>
>> An important piece of information, I think, that is missing in the
>> diagram above is the stopper thread which executes the work queued
>> by stop_machine().  We have two cases to consider, that is,
>>
>>   1) the stopper is "executed before" the preempt-disable section
>>
>>  CPU0
>>
>>  cpu_stopper_thread()
>>  ...
>>  preempt_disable()
>>  ...
>>  preempt_enable()
>>
>>   2) the stopper is "executed after" the preempt-disable section
>>
>>  CPU0
>>
>>  preempt_disable()
>>  ...
>>  preempt_enable()
>>  ...
>>  cpu_stopper_thread()
>>
>> Notice that the reads from p->flags and p->swap_map in CPU0 cannot
>> cross cpu_stopper_thread().  The claim is that CPU0 sees SWP_VALID
>> unset in (1) and that it sees a non-NULL p->swap_map in (2).
>>
>> I consider the two cases separately:
>>
>>   1) CPU1 unsets SPW_VALID, it locks the stopper's lock, and it
>>  queues the stopper work; CPU0 locks the stopper's lock, it
>>  dequeues this work, and it reads from p->flags.
>>
>>  Diagrammatically, we have the following MP-like pattern:
>>
>>  CPU0CPU1
>>
>>  lock(stopper->lock) p->flags &= ~SPW_VALID
>>  get @work   lock(stopper->lock)
>>  unlock(stopper->lock)   add @work
>>  reads p->flags  unlock(stopper->lock)
>>
>>  where CPU0 must see SPW_VALID unset (if CPU0 sees the work
>>  added by CPU1).
>>
>>   2) CPU0 reads from p->swap_map, it locks the completion lock,
>>  and it signals completion; CPU1 locks the completion lock,
>>  it checks for completion, and it writes to p->swap_map.
>>
>>  (If CPU0 doesn't signal the completion, or CPU1 doesn't see
>>  the completion, then CPU1 will have to iterate the read and
>>  to postpone the control-dependent write to p->swap_map.)
>>
>>  Diagrammatically, we have the following LB-like pattern:
>>
>>  CPU0CPU1
>>
>>  reads p->swap_map   lock(completion)
>>  lock(completion)read completion->done
>>  completion->done++  unlock(completion)
>>  unlock(completion)  p->swap_map = NULL
>>
>>  where CPU0 must see a non-NULL p->swap_map if CPU1 sees the
>>  completion from CPU0.
>>
>> Does this make sense?
> 
> Thanks a lot for detailed explanation!

This is certainly a non-trivial explanation of why memory barrier is not
needed.  Can we put it in the commit log and mention something in
comments on why we don't need memory barrier?

Thanks.

Tim

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

2019-02-12 Thread Andrea Parri

> Alternative implementation could be replacing disable preemption with
> rcu_read_lock_sched and stop_machine() with synchronize_sched().

JFYI, starting with v4.20-rc1, synchronize_rcu{,expedited}() also wait
for preempt-disable sections (the intent seems to retire the RCU-sched
update-side API), so that here you could instead use preempt-disable +
synchronize_rcu{,expedited}().  This LWN article gives an overview of
the latest RCU API/semantics changes: https://lwn.net/Articles/777036/.

  Andrea

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

2019-02-11 Thread Huang, Ying

Daniel Jordan  writes:

> On Mon, Feb 11, 2019 at 04:38:46PM +0800, Huang, Ying wrote:
>> +struct swap_info_struct *get_swap_device(swp_entry_t entry)
>> +{
>> +struct swap_info_struct *si;
>> +unsigned long type, offset;
>> +
>> +if (!entry.val)
>> +goto out;
>
>> +type = swp_type(entry);
>> +si = swap_type_to_swap_info(type);
>
> These lines can be collapsed into swp_swap_info if you want.

Yes.  I can use that function to reduce another line from the patch.
Thanks!  Will do that.

>> +if (!si)
>> +goto bad_nofile;
>> +
>> +preempt_disable();
>> +if (!(si->flags & SWP_VALID))
>> +goto unlock_out;
>
> After Hugh alluded to barriers, it seems the read of SWP_VALID could be
> reordered with the write in preempt_disable at runtime.  Without smp_mb()
> between the two, couldn't this happen, however unlikely a race it is?
>
> CPU0CPU1
>
> __swap_duplicate()
> get_swap_device()
> // sees SWP_VALID set
>swapoff
>p->flags &= ~SWP_VALID;
>spin_unlock(>lock); // pair w/ 
> smp_mb
>...
>stop_machine(...)
>p->swap_map = NULL;
> preempt_disable()
> read NULL p->swap_map

Andrea has helped to explain this.

Best Regards,
Huang, Ying

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

2019-02-11 Thread Andrea Parri

> > +   if (!si)
> > +   goto bad_nofile;
> > +
> > +   preempt_disable();
> > +   if (!(si->flags & SWP_VALID))
> > +   goto unlock_out;
> 
> After Hugh alluded to barriers, it seems the read of SWP_VALID could be
> reordered with the write in preempt_disable at runtime.  Without smp_mb()
> between the two, couldn't this happen, however unlikely a race it is?
> 
> CPU0CPU1
> 
> __swap_duplicate()
> get_swap_device()
> // sees SWP_VALID set
>swapoff
>p->flags &= ~SWP_VALID;
>spin_unlock(>lock); // pair w/ 
> smp_mb
>...
>stop_machine(...)
>p->swap_map = NULL;
> preempt_disable()
> read NULL p->swap_map


I don't think that that smp_mb() is necessary.  I elaborate:

An important piece of information, I think, that is missing in the
diagram above is the stopper thread which executes the work queued
by stop_machine().  We have two cases to consider, that is,

  1) the stopper is "executed before" the preempt-disable section

CPU0

cpu_stopper_thread()
...
preempt_disable()
...
preempt_enable()

  2) the stopper is "executed after" the preempt-disable section

CPU0

preempt_disable()
...
preempt_enable()
...
cpu_stopper_thread()

Notice that the reads from p->flags and p->swap_map in CPU0 cannot
cross cpu_stopper_thread().  The claim is that CPU0 sees SWP_VALID
unset in (1) and that it sees a non-NULL p->swap_map in (2).

I consider the two cases separately:

  1) CPU1 unsets SPW_VALID, it locks the stopper's lock, and it
 queues the stopper work; CPU0 locks the stopper's lock, it
 dequeues this work, and it reads from p->flags.

 Diagrammatically, we have the following MP-like pattern:

CPU0CPU1

lock(stopper->lock) p->flags &= ~SPW_VALID
get @work   lock(stopper->lock)
unlock(stopper->lock)   add @work
reads p->flags  unlock(stopper->lock)

 where CPU0 must see SPW_VALID unset (if CPU0 sees the work
 added by CPU1).

  2) CPU0 reads from p->swap_map, it locks the completion lock,
 and it signals completion; CPU1 locks the completion lock,
 it checks for completion, and it writes to p->swap_map.

 (If CPU0 doesn't signal the completion, or CPU1 doesn't see
 the completion, then CPU1 will have to iterate the read and
 to postpone the control-dependent write to p->swap_map.)

 Diagrammatically, we have the following LB-like pattern:

CPU0CPU1

reads p->swap_map   lock(completion)
lock(completion)read completion->done
completion->done++  unlock(completion)
unlock(completion)  p->swap_map = NULL

 where CPU0 must see a non-NULL p->swap_map if CPU1 sees the
 completion from CPU0.

Does this make sense?

  Andrea

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

2019-02-11 Thread Daniel Jordan

On Mon, Feb 11, 2019 at 04:38:46PM +0800, Huang, Ying wrote:
> +struct swap_info_struct *get_swap_device(swp_entry_t entry)
> +{
> + struct swap_info_struct *si;
> + unsigned long type, offset;
> +
> + if (!entry.val)
> + goto out;

> + type = swp_type(entry);
> + si = swap_type_to_swap_info(type);

These lines can be collapsed into swp_swap_info if you want.

> + if (!si)
> + goto bad_nofile;
> +
> + preempt_disable();
> + if (!(si->flags & SWP_VALID))
> + goto unlock_out;

After Hugh alluded to barriers, it seems the read of SWP_VALID could be
reordered with the write in preempt_disable at runtime.  Without smp_mb()
between the two, couldn't this happen, however unlikely a race it is?

CPU0CPU1

__swap_duplicate()
get_swap_device()
// sees SWP_VALID set
   swapoff
   p->flags &= ~SWP_VALID;
   spin_unlock(>lock); // pair w/ smp_mb
   ...
   stop_machine(...)
   p->swap_map = NULL;
preempt_disable()
read NULL p->swap_map

[PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

2019-02-11 Thread Huang, Ying

From: Huang Ying 

When swapin is performed, after getting the swap entry information from
the page table, system will swap in the swap entry, without any lock held
to prevent the swap device from being swapoff.  This may cause the race
like below,

CPU 1   CPU 2
-   -
do_swap_page
  swapin_readahead
__read_swap_cache_async
swapoff   swapcache_prepare
  p->swap_map = NULL__swap_duplicate
  p->swap_map[?] /* !!! NULL pointer 
access */

Because swapoff is usually done when system shutdown only, the race may
not hit many people in practice.  But it is still a race need to be fixed.

To fix the race, get_swap_device() is added to check whether the specified
swap entry is valid in its swap device.  If so, it will keep the swap
entry valid via preventing the swap device from being swapoff, until
put_swap_device() is called.

Because swapoff() is very rare code path, to make the normal path runs as
fast as possible, disabling preemption + stop_machine() instead of
reference count is used to implement get/put_swap_device().  From
get_swap_device() to put_swap_device(), the preemption is disabled, so
stop_machine() in swapoff() will wait until put_swap_device() is called.

In addition to swap_map, cluster_info, etc.  data structure in the struct
swap_info_struct, the swap cache radix tree will be freed after swapoff,
so this patch fixes the race between swap cache looking up and swapoff
too.

Races between some other swap cache usages protected via disabling
preemption and swapoff are fixed too via calling stop_machine() between
clearing PageSwapCache() and freeing swap cache data structure.

Alternative implementation could be replacing disable preemption with
rcu_read_lock_sched and stop_machine() with synchronize_sched().

Signed-off-by: "Huang, Ying" 
Not-Nacked-by: Hugh Dickins 
Cc: Paul E. McKenney 
Cc: Minchan Kim 
Cc: Johannes Weiner 
Cc: Tim Chen 
Cc: Mel Gorman 
Cc: Jérôme Glisse 
Cc: Michal Hocko 
Cc: Andrea Arcangeli 
Cc: David Rientjes 
Cc: Rik van Riel 
Cc: Jan Kara 
Cc: Dave Jiang 
Cc: Daniel Jordan 
Cc: Andrea Parri 

Changelog:

v7:

- Rebased on patch: "mm, swap: bounds check swap_info accesses to avoid NULL 
derefs"

v6:

- Add more comments to get_swap_device() to make it more clear about
  possible swapoff or swapoff+swapon.

v5:

- Replace RCU with stop_machine()

v4:

- Use synchronize_rcu() in enable_swap_info() to reduce overhead of
  normal paths further.

v3:

- Re-implemented with RCU to reduce the overhead of normal paths

v2:

- Re-implemented with SRCU to reduce the overhead of normal paths.

- Avoid to check whether the swap device has been swapoff in
  get_swap_device().  Because we can check the origin of the swap
  entry to make sure the swap device hasn't bee swapoff.
---
 include/linux/swap.h |  13 +++-
 mm/memory.c  |   2 +-
 mm/swap_state.c  |  16 -
 mm/swapfile.c| 150 ++-
 4 files changed, 145 insertions(+), 36 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 649529be91f2..aecd1430d0a9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -175,8 +175,9 @@ enum {
SWP_PAGE_DISCARD = (1 << 10),   /* freed swap page-cluster discards */
SWP_STABLE_WRITES = (1 << 11),  /* no overwrite PG_writeback pages */
SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */
+   SWP_VALID   = (1 << 13),/* swap is valid to be operated on? */
/* add others here before... */
-   SWP_SCANNING= (1 << 13),/* refcount in scan_swap_map */
+   SWP_SCANNING= (1 << 14),/* refcount in scan_swap_map */
 };
 
 #define SWAP_CLUSTER_MAX 32UL
@@ -460,7 +461,7 @@ extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
-extern int __swap_count(struct swap_info_struct *si, swp_entry_t entry);
+extern int __swap_count(swp_entry_t entry);
 extern int __swp_swapcount(swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
@@ -470,6 +471,12 @@ extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
 extern void exit_swap_address_space(unsigned int type);
+extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
+
+static inline void put_swap_device(struct swap_info_struct *si)
+{
+   preempt_enable();
+}
 
 #else /* CONFIG_SWAP */
 
@@ -576,7 +583,7 @@ static inline int page_swapcount(struct page *page)
return 0;
 }
 
-static inline

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

Re: [PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

[PATCH -mm -V7] mm, swap: fix race between swapoff and some swap operations

15 matches

Site Navigation

Mail list logo

Footer information