Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2018-01-04 Thread Mel Gorman
On Thu, Jan 04, 2018 at 09:17:36AM +0800, Huang, Ying wrote:
> > Maybe, but in this particular case, I would prefer to go with something
> > more conventional unless there is strong evidence that it's an improvement
> > (which I doubt in this case given the cost of migration overall and the
> > corner case of migrating a dirty page).
> 
> So you like page_lock() more than RCU? 

In this instance, yes.

> Is there any problem of RCU?
> The object to be protected isn't clear?
> 

It's not clear what object is being protected or how it's protected and
it's not the usual means a mapping is pinned. Furthermore, in the event
a page is being truncated, we really do not want to bother doing any
migration work for compaction purposes as it's a waste.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2018-01-04 Thread Mel Gorman
On Thu, Jan 04, 2018 at 09:17:36AM +0800, Huang, Ying wrote:
> > Maybe, but in this particular case, I would prefer to go with something
> > more conventional unless there is strong evidence that it's an improvement
> > (which I doubt in this case given the cost of migration overall and the
> > corner case of migrating a dirty page).
> 
> So you like page_lock() more than RCU? 

In this instance, yes.

> Is there any problem of RCU?
> The object to be protected isn't clear?
> 

It's not clear what object is being protected or how it's protected and
it's not the usual means a mapping is pinned. Furthermore, in the event
a page is being truncated, we really do not want to bother doing any
migration work for compaction purposes as it's a waste.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2018-01-03 Thread Huang, Ying
Mel Gorman  writes:

> On Wed, Jan 03, 2018 at 08:42:15AM +0800, Huang, Ying wrote:
>> Mel Gorman  writes:
>> 
>> > On Tue, Jan 02, 2018 at 12:29:55PM +0100, Jan Kara wrote:
>> >> On Tue 02-01-18 10:21:03, Mel Gorman wrote:
>> >> > On Sat, Dec 23, 2017 at 10:36:53AM +0900, Minchan Kim wrote:
>> >> > > > code path.  It appears that similar situation is possible for them 
>> >> > > > too.
>> >> > > > 
>> >> > > > The file cache pages will be delete from file cache address_space 
>> >> > > > before
>> >> > > > address_space (embedded in inode) is freed.  But they will be 
>> >> > > > deleted
>> >> > > > from LRU list only when its refcount dropped to zero, please take a 
>> >> > > > look
>> >> > > > at put_page() and release_pages().  While address_space will be 
>> >> > > > freed
>> >> > > > after putting reference to all file cache pages.  If someone holds a
>> >> > > > reference to a file cache page for quite long time, it is possible 
>> >> > > > for a
>> >> > > > file cache page to be in LRU list after the inode/address_space is
>> >> > > > freed.
>> >> > > > 
>> >> > > > And I found inode/address_space is freed witch call_rcu().  I don't 
>> >> > > > know
>> >> > > > whether this is related to page_mapping().
>> >> > > > 
>> >> > > > This is just my understanding.
>> >> > > 
>> >> > > Hmm, it smells like a bug of __isolate_lru_page.
>> >> > > 
>> >> > > Ccing Mel:
>> >> > > 
>> >> > > What locks protects address_space destroying when race happens between
>> >> > > inode trauncation and __isolate_lru_page?
>> >> > > 
>> >> > 
>> >> > I'm just back online and have a lot of catching up to do so this is a 
>> >> > rushed
>> >> > answer and I didn't read the background of this. However the question is
>> >> > somewhat ambiguous and the scope is broad as I'm not sure which race you
>> >> > refer to. For file cache pages, I wouldnt' expect the address_space to 
>> >> > be
>> >> > destroyed specifically as long as the inode exists which is the 
>> >> > structure
>> >> > containing the address_space in this case. A page on the LRU being 
>> >> > isolated
>> >> > in __isolate_lru_page will have an elevated reference count which will
>> >> > pin the inode until remove_mapping is called which holds the page lock
>> >> > while inode truncation looking at a page for truncation also only checks
>> >> > page_mapping under the page lock. Very broadly speaking, pages avoid 
>> >> > being
>> >> > added back to an inode being freed by checking the I_FREEING state.
>> >> 
>> >> So I'm wondering what prevents the following:
>> >> 
>> >> CPU1  CPU2
>> >> 
>> >> truncate(inode)   __isolate_lru_page()
>> >>   ...
>> >>   truncate_inode_page(mapping, page);
>> >> delete_from_page_cache(page)
>> >>   spin_lock_irqsave(>tree_lock, flags);
>> >> __delete_from_page_cache(page, NULL)
>> >>   page_cache_tree_delete(..)
>> >> ... mapping = 
>> >> page_mapping(page);
>> >> page->mapping = NULL;
>> >> ...
>> >>   spin_unlock_irqrestore(>tree_lock, flags);
>> >>   page_cache_free_page(mapping, page)
>> >> put_page(page)
>> >>   if (put_page_testzero(page)) -> false
>> >> - inode now has no pages and can be freed including embedded address_space
>> >> 
>> >> if (mapping && 
>> >> !mapping->a_ops->migratepage)
>> >> - we've dereferenced mapping which is potentially already free.
>> >> 
>> >
>> > Hmm, possible if unlikely.
>> >
>> > Before delete_from_page_cache, we called truncate_cleanup_page so the
>> > page is likely to be !PageDirty or PageWriteback which gets skipped by
>> > the only caller that checks the mappping in __isolate_lru_page. The race
>> > is tiny but it does exist. One way of closing it is to check the mapping
>> > under the page lock which will prevent races with truncation. The
>> > overhead is minimal as the calling context (compaction) is quite a heavy
>> > operation anyway.
>> >
>> 
>> I think another possible fix is to use call_rcu_sched() to free inode
>> (and address_space).  Because __isolate_lru_page() will be called with
>> LRU spinlock held and IRQ disabled, call_rcu_sched() will wait
>> LRU spin_unlock and IRQ enabled.
>> 
>
> Maybe, but in this particular case, I would prefer to go with something
> more conventional unless there is strong evidence that it's an improvement
> (which I doubt in this case given the cost of migration overall and the
> corner case of migrating a dirty page).

So you like page_lock() more than RCU?  Is there any problem of RCU?
The object to be protected isn't clear?

Another way to fix this with RCU is to replace
trylock_page()/unlock_page() with rcu_read_lock()/rcu_read_unlock() in
your fix.

JFYI, please keep your fix if you think that is more appropriate.

Best Regards,
Huang, Ying


Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2018-01-03 Thread Huang, Ying
Mel Gorman  writes:

> On Wed, Jan 03, 2018 at 08:42:15AM +0800, Huang, Ying wrote:
>> Mel Gorman  writes:
>> 
>> > On Tue, Jan 02, 2018 at 12:29:55PM +0100, Jan Kara wrote:
>> >> On Tue 02-01-18 10:21:03, Mel Gorman wrote:
>> >> > On Sat, Dec 23, 2017 at 10:36:53AM +0900, Minchan Kim wrote:
>> >> > > > code path.  It appears that similar situation is possible for them 
>> >> > > > too.
>> >> > > > 
>> >> > > > The file cache pages will be delete from file cache address_space 
>> >> > > > before
>> >> > > > address_space (embedded in inode) is freed.  But they will be 
>> >> > > > deleted
>> >> > > > from LRU list only when its refcount dropped to zero, please take a 
>> >> > > > look
>> >> > > > at put_page() and release_pages().  While address_space will be 
>> >> > > > freed
>> >> > > > after putting reference to all file cache pages.  If someone holds a
>> >> > > > reference to a file cache page for quite long time, it is possible 
>> >> > > > for a
>> >> > > > file cache page to be in LRU list after the inode/address_space is
>> >> > > > freed.
>> >> > > > 
>> >> > > > And I found inode/address_space is freed witch call_rcu().  I don't 
>> >> > > > know
>> >> > > > whether this is related to page_mapping().
>> >> > > > 
>> >> > > > This is just my understanding.
>> >> > > 
>> >> > > Hmm, it smells like a bug of __isolate_lru_page.
>> >> > > 
>> >> > > Ccing Mel:
>> >> > > 
>> >> > > What locks protects address_space destroying when race happens between
>> >> > > inode trauncation and __isolate_lru_page?
>> >> > > 
>> >> > 
>> >> > I'm just back online and have a lot of catching up to do so this is a 
>> >> > rushed
>> >> > answer and I didn't read the background of this. However the question is
>> >> > somewhat ambiguous and the scope is broad as I'm not sure which race you
>> >> > refer to. For file cache pages, I wouldnt' expect the address_space to 
>> >> > be
>> >> > destroyed specifically as long as the inode exists which is the 
>> >> > structure
>> >> > containing the address_space in this case. A page on the LRU being 
>> >> > isolated
>> >> > in __isolate_lru_page will have an elevated reference count which will
>> >> > pin the inode until remove_mapping is called which holds the page lock
>> >> > while inode truncation looking at a page for truncation also only checks
>> >> > page_mapping under the page lock. Very broadly speaking, pages avoid 
>> >> > being
>> >> > added back to an inode being freed by checking the I_FREEING state.
>> >> 
>> >> So I'm wondering what prevents the following:
>> >> 
>> >> CPU1  CPU2
>> >> 
>> >> truncate(inode)   __isolate_lru_page()
>> >>   ...
>> >>   truncate_inode_page(mapping, page);
>> >> delete_from_page_cache(page)
>> >>   spin_lock_irqsave(>tree_lock, flags);
>> >> __delete_from_page_cache(page, NULL)
>> >>   page_cache_tree_delete(..)
>> >> ... mapping = 
>> >> page_mapping(page);
>> >> page->mapping = NULL;
>> >> ...
>> >>   spin_unlock_irqrestore(>tree_lock, flags);
>> >>   page_cache_free_page(mapping, page)
>> >> put_page(page)
>> >>   if (put_page_testzero(page)) -> false
>> >> - inode now has no pages and can be freed including embedded address_space
>> >> 
>> >> if (mapping && 
>> >> !mapping->a_ops->migratepage)
>> >> - we've dereferenced mapping which is potentially already free.
>> >> 
>> >
>> > Hmm, possible if unlikely.
>> >
>> > Before delete_from_page_cache, we called truncate_cleanup_page so the
>> > page is likely to be !PageDirty or PageWriteback which gets skipped by
>> > the only caller that checks the mappping in __isolate_lru_page. The race
>> > is tiny but it does exist. One way of closing it is to check the mapping
>> > under the page lock which will prevent races with truncation. The
>> > overhead is minimal as the calling context (compaction) is quite a heavy
>> > operation anyway.
>> >
>> 
>> I think another possible fix is to use call_rcu_sched() to free inode
>> (and address_space).  Because __isolate_lru_page() will be called with
>> LRU spinlock held and IRQ disabled, call_rcu_sched() will wait
>> LRU spin_unlock and IRQ enabled.
>> 
>
> Maybe, but in this particular case, I would prefer to go with something
> more conventional unless there is strong evidence that it's an improvement
> (which I doubt in this case given the cost of migration overall and the
> corner case of migrating a dirty page).

So you like page_lock() more than RCU?  Is there any problem of RCU?
The object to be protected isn't clear?

Another way to fix this with RCU is to replace
trylock_page()/unlock_page() with rcu_read_lock()/rcu_read_unlock() in
your fix.

JFYI, please keep your fix if you think that is more appropriate.

Best Regards,
Huang, Ying


Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2018-01-03 Thread Mel Gorman
On Wed, Jan 03, 2018 at 08:42:15AM +0800, Huang, Ying wrote:
> Mel Gorman  writes:
> 
> > On Tue, Jan 02, 2018 at 12:29:55PM +0100, Jan Kara wrote:
> >> On Tue 02-01-18 10:21:03, Mel Gorman wrote:
> >> > On Sat, Dec 23, 2017 at 10:36:53AM +0900, Minchan Kim wrote:
> >> > > > code path.  It appears that similar situation is possible for them 
> >> > > > too.
> >> > > > 
> >> > > > The file cache pages will be delete from file cache address_space 
> >> > > > before
> >> > > > address_space (embedded in inode) is freed.  But they will be deleted
> >> > > > from LRU list only when its refcount dropped to zero, please take a 
> >> > > > look
> >> > > > at put_page() and release_pages().  While address_space will be freed
> >> > > > after putting reference to all file cache pages.  If someone holds a
> >> > > > reference to a file cache page for quite long time, it is possible 
> >> > > > for a
> >> > > > file cache page to be in LRU list after the inode/address_space is
> >> > > > freed.
> >> > > > 
> >> > > > And I found inode/address_space is freed witch call_rcu().  I don't 
> >> > > > know
> >> > > > whether this is related to page_mapping().
> >> > > > 
> >> > > > This is just my understanding.
> >> > > 
> >> > > Hmm, it smells like a bug of __isolate_lru_page.
> >> > > 
> >> > > Ccing Mel:
> >> > > 
> >> > > What locks protects address_space destroying when race happens between
> >> > > inode trauncation and __isolate_lru_page?
> >> > > 
> >> > 
> >> > I'm just back online and have a lot of catching up to do so this is a 
> >> > rushed
> >> > answer and I didn't read the background of this. However the question is
> >> > somewhat ambiguous and the scope is broad as I'm not sure which race you
> >> > refer to. For file cache pages, I wouldnt' expect the address_space to be
> >> > destroyed specifically as long as the inode exists which is the structure
> >> > containing the address_space in this case. A page on the LRU being 
> >> > isolated
> >> > in __isolate_lru_page will have an elevated reference count which will
> >> > pin the inode until remove_mapping is called which holds the page lock
> >> > while inode truncation looking at a page for truncation also only checks
> >> > page_mapping under the page lock. Very broadly speaking, pages avoid 
> >> > being
> >> > added back to an inode being freed by checking the I_FREEING state.
> >> 
> >> So I'm wondering what prevents the following:
> >> 
> >> CPU1   CPU2
> >> 
> >> truncate(inode)__isolate_lru_page()
> >>   ...
> >>   truncate_inode_page(mapping, page);
> >> delete_from_page_cache(page)
> >>   spin_lock_irqsave(>tree_lock, flags);
> >> __delete_from_page_cache(page, NULL)
> >>   page_cache_tree_delete(..)
> >> ...  mapping = 
> >> page_mapping(page);
> >> page->mapping = NULL;
> >> ...
> >>   spin_unlock_irqrestore(>tree_lock, flags);
> >>   page_cache_free_page(mapping, page)
> >> put_page(page)
> >>   if (put_page_testzero(page)) -> false
> >> - inode now has no pages and can be freed including embedded address_space
> >> 
> >>  if (mapping && 
> >> !mapping->a_ops->migratepage)
> >> - we've dereferenced mapping which is potentially already free.
> >> 
> >
> > Hmm, possible if unlikely.
> >
> > Before delete_from_page_cache, we called truncate_cleanup_page so the
> > page is likely to be !PageDirty or PageWriteback which gets skipped by
> > the only caller that checks the mappping in __isolate_lru_page. The race
> > is tiny but it does exist. One way of closing it is to check the mapping
> > under the page lock which will prevent races with truncation. The
> > overhead is minimal as the calling context (compaction) is quite a heavy
> > operation anyway.
> >
> 
> I think another possible fix is to use call_rcu_sched() to free inode
> (and address_space).  Because __isolate_lru_page() will be called with
> LRU spinlock held and IRQ disabled, call_rcu_sched() will wait
> LRU spin_unlock and IRQ enabled.
> 

Maybe, but in this particular case, I would prefer to go with something
more conventional unless there is strong evidence that it's an improvement
(which I doubt in this case given the cost of migration overall and the
corner case of migrating a dirty page).

-- 
Mel Gorman
SUSE Labs


Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2018-01-03 Thread Mel Gorman
On Wed, Jan 03, 2018 at 08:42:15AM +0800, Huang, Ying wrote:
> Mel Gorman  writes:
> 
> > On Tue, Jan 02, 2018 at 12:29:55PM +0100, Jan Kara wrote:
> >> On Tue 02-01-18 10:21:03, Mel Gorman wrote:
> >> > On Sat, Dec 23, 2017 at 10:36:53AM +0900, Minchan Kim wrote:
> >> > > > code path.  It appears that similar situation is possible for them 
> >> > > > too.
> >> > > > 
> >> > > > The file cache pages will be delete from file cache address_space 
> >> > > > before
> >> > > > address_space (embedded in inode) is freed.  But they will be deleted
> >> > > > from LRU list only when its refcount dropped to zero, please take a 
> >> > > > look
> >> > > > at put_page() and release_pages().  While address_space will be freed
> >> > > > after putting reference to all file cache pages.  If someone holds a
> >> > > > reference to a file cache page for quite long time, it is possible 
> >> > > > for a
> >> > > > file cache page to be in LRU list after the inode/address_space is
> >> > > > freed.
> >> > > > 
> >> > > > And I found inode/address_space is freed witch call_rcu().  I don't 
> >> > > > know
> >> > > > whether this is related to page_mapping().
> >> > > > 
> >> > > > This is just my understanding.
> >> > > 
> >> > > Hmm, it smells like a bug of __isolate_lru_page.
> >> > > 
> >> > > Ccing Mel:
> >> > > 
> >> > > What locks protects address_space destroying when race happens between
> >> > > inode trauncation and __isolate_lru_page?
> >> > > 
> >> > 
> >> > I'm just back online and have a lot of catching up to do so this is a 
> >> > rushed
> >> > answer and I didn't read the background of this. However the question is
> >> > somewhat ambiguous and the scope is broad as I'm not sure which race you
> >> > refer to. For file cache pages, I wouldnt' expect the address_space to be
> >> > destroyed specifically as long as the inode exists which is the structure
> >> > containing the address_space in this case. A page on the LRU being 
> >> > isolated
> >> > in __isolate_lru_page will have an elevated reference count which will
> >> > pin the inode until remove_mapping is called which holds the page lock
> >> > while inode truncation looking at a page for truncation also only checks
> >> > page_mapping under the page lock. Very broadly speaking, pages avoid 
> >> > being
> >> > added back to an inode being freed by checking the I_FREEING state.
> >> 
> >> So I'm wondering what prevents the following:
> >> 
> >> CPU1   CPU2
> >> 
> >> truncate(inode)__isolate_lru_page()
> >>   ...
> >>   truncate_inode_page(mapping, page);
> >> delete_from_page_cache(page)
> >>   spin_lock_irqsave(>tree_lock, flags);
> >> __delete_from_page_cache(page, NULL)
> >>   page_cache_tree_delete(..)
> >> ...  mapping = 
> >> page_mapping(page);
> >> page->mapping = NULL;
> >> ...
> >>   spin_unlock_irqrestore(>tree_lock, flags);
> >>   page_cache_free_page(mapping, page)
> >> put_page(page)
> >>   if (put_page_testzero(page)) -> false
> >> - inode now has no pages and can be freed including embedded address_space
> >> 
> >>  if (mapping && 
> >> !mapping->a_ops->migratepage)
> >> - we've dereferenced mapping which is potentially already free.
> >> 
> >
> > Hmm, possible if unlikely.
> >
> > Before delete_from_page_cache, we called truncate_cleanup_page so the
> > page is likely to be !PageDirty or PageWriteback which gets skipped by
> > the only caller that checks the mappping in __isolate_lru_page. The race
> > is tiny but it does exist. One way of closing it is to check the mapping
> > under the page lock which will prevent races with truncation. The
> > overhead is minimal as the calling context (compaction) is quite a heavy
> > operation anyway.
> >
> 
> I think another possible fix is to use call_rcu_sched() to free inode
> (and address_space).  Because __isolate_lru_page() will be called with
> LRU spinlock held and IRQ disabled, call_rcu_sched() will wait
> LRU spin_unlock and IRQ enabled.
> 

Maybe, but in this particular case, I would prefer to go with something
more conventional unless there is strong evidence that it's an improvement
(which I doubt in this case given the cost of migration overall and the
corner case of migrating a dirty page).

-- 
Mel Gorman
SUSE Labs


Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2018-01-02 Thread Huang, Ying
Mel Gorman  writes:

> On Tue, Jan 02, 2018 at 12:29:55PM +0100, Jan Kara wrote:
>> On Tue 02-01-18 10:21:03, Mel Gorman wrote:
>> > On Sat, Dec 23, 2017 at 10:36:53AM +0900, Minchan Kim wrote:
>> > > > code path.  It appears that similar situation is possible for them too.
>> > > > 
>> > > > The file cache pages will be delete from file cache address_space 
>> > > > before
>> > > > address_space (embedded in inode) is freed.  But they will be deleted
>> > > > from LRU list only when its refcount dropped to zero, please take a 
>> > > > look
>> > > > at put_page() and release_pages().  While address_space will be freed
>> > > > after putting reference to all file cache pages.  If someone holds a
>> > > > reference to a file cache page for quite long time, it is possible for 
>> > > > a
>> > > > file cache page to be in LRU list after the inode/address_space is
>> > > > freed.
>> > > > 
>> > > > And I found inode/address_space is freed witch call_rcu().  I don't 
>> > > > know
>> > > > whether this is related to page_mapping().
>> > > > 
>> > > > This is just my understanding.
>> > > 
>> > > Hmm, it smells like a bug of __isolate_lru_page.
>> > > 
>> > > Ccing Mel:
>> > > 
>> > > What locks protects address_space destroying when race happens between
>> > > inode trauncation and __isolate_lru_page?
>> > > 
>> > 
>> > I'm just back online and have a lot of catching up to do so this is a 
>> > rushed
>> > answer and I didn't read the background of this. However the question is
>> > somewhat ambiguous and the scope is broad as I'm not sure which race you
>> > refer to. For file cache pages, I wouldnt' expect the address_space to be
>> > destroyed specifically as long as the inode exists which is the structure
>> > containing the address_space in this case. A page on the LRU being isolated
>> > in __isolate_lru_page will have an elevated reference count which will
>> > pin the inode until remove_mapping is called which holds the page lock
>> > while inode truncation looking at a page for truncation also only checks
>> > page_mapping under the page lock. Very broadly speaking, pages avoid being
>> > added back to an inode being freed by checking the I_FREEING state.
>> 
>> So I'm wondering what prevents the following:
>> 
>> CPU1 CPU2
>> 
>> truncate(inode)  __isolate_lru_page()
>>   ...
>>   truncate_inode_page(mapping, page);
>> delete_from_page_cache(page)
>>   spin_lock_irqsave(>tree_lock, flags);
>> __delete_from_page_cache(page, NULL)
>>   page_cache_tree_delete(..)
>> ...mapping = 
>> page_mapping(page);
>> page->mapping = NULL;
>> ...
>>   spin_unlock_irqrestore(>tree_lock, flags);
>>   page_cache_free_page(mapping, page)
>> put_page(page)
>>   if (put_page_testzero(page)) -> false
>> - inode now has no pages and can be freed including embedded address_space
>> 
>>if (mapping && 
>> !mapping->a_ops->migratepage)
>> - we've dereferenced mapping which is potentially already free.
>> 
>
> Hmm, possible if unlikely.
>
> Before delete_from_page_cache, we called truncate_cleanup_page so the
> page is likely to be !PageDirty or PageWriteback which gets skipped by
> the only caller that checks the mappping in __isolate_lru_page. The race
> is tiny but it does exist. One way of closing it is to check the mapping
> under the page lock which will prevent races with truncation. The
> overhead is minimal as the calling context (compaction) is quite a heavy
> operation anyway.
>

I think another possible fix is to use call_rcu_sched() to free inode
(and address_space).  Because __isolate_lru_page() will be called with
LRU spinlock held and IRQ disabled, call_rcu_sched() will wait
LRU spin_unlock and IRQ enabled.

Best Regards,
Huang, Ying


Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2018-01-02 Thread Huang, Ying
Mel Gorman  writes:

> On Tue, Jan 02, 2018 at 12:29:55PM +0100, Jan Kara wrote:
>> On Tue 02-01-18 10:21:03, Mel Gorman wrote:
>> > On Sat, Dec 23, 2017 at 10:36:53AM +0900, Minchan Kim wrote:
>> > > > code path.  It appears that similar situation is possible for them too.
>> > > > 
>> > > > The file cache pages will be delete from file cache address_space 
>> > > > before
>> > > > address_space (embedded in inode) is freed.  But they will be deleted
>> > > > from LRU list only when its refcount dropped to zero, please take a 
>> > > > look
>> > > > at put_page() and release_pages().  While address_space will be freed
>> > > > after putting reference to all file cache pages.  If someone holds a
>> > > > reference to a file cache page for quite long time, it is possible for 
>> > > > a
>> > > > file cache page to be in LRU list after the inode/address_space is
>> > > > freed.
>> > > > 
>> > > > And I found inode/address_space is freed witch call_rcu().  I don't 
>> > > > know
>> > > > whether this is related to page_mapping().
>> > > > 
>> > > > This is just my understanding.
>> > > 
>> > > Hmm, it smells like a bug of __isolate_lru_page.
>> > > 
>> > > Ccing Mel:
>> > > 
>> > > What locks protects address_space destroying when race happens between
>> > > inode trauncation and __isolate_lru_page?
>> > > 
>> > 
>> > I'm just back online and have a lot of catching up to do so this is a 
>> > rushed
>> > answer and I didn't read the background of this. However the question is
>> > somewhat ambiguous and the scope is broad as I'm not sure which race you
>> > refer to. For file cache pages, I wouldnt' expect the address_space to be
>> > destroyed specifically as long as the inode exists which is the structure
>> > containing the address_space in this case. A page on the LRU being isolated
>> > in __isolate_lru_page will have an elevated reference count which will
>> > pin the inode until remove_mapping is called which holds the page lock
>> > while inode truncation looking at a page for truncation also only checks
>> > page_mapping under the page lock. Very broadly speaking, pages avoid being
>> > added back to an inode being freed by checking the I_FREEING state.
>> 
>> So I'm wondering what prevents the following:
>> 
>> CPU1 CPU2
>> 
>> truncate(inode)  __isolate_lru_page()
>>   ...
>>   truncate_inode_page(mapping, page);
>> delete_from_page_cache(page)
>>   spin_lock_irqsave(>tree_lock, flags);
>> __delete_from_page_cache(page, NULL)
>>   page_cache_tree_delete(..)
>> ...mapping = 
>> page_mapping(page);
>> page->mapping = NULL;
>> ...
>>   spin_unlock_irqrestore(>tree_lock, flags);
>>   page_cache_free_page(mapping, page)
>> put_page(page)
>>   if (put_page_testzero(page)) -> false
>> - inode now has no pages and can be freed including embedded address_space
>> 
>>if (mapping && 
>> !mapping->a_ops->migratepage)
>> - we've dereferenced mapping which is potentially already free.
>> 
>
> Hmm, possible if unlikely.
>
> Before delete_from_page_cache, we called truncate_cleanup_page so the
> page is likely to be !PageDirty or PageWriteback which gets skipped by
> the only caller that checks the mappping in __isolate_lru_page. The race
> is tiny but it does exist. One way of closing it is to check the mapping
> under the page lock which will prevent races with truncation. The
> overhead is minimal as the calling context (compaction) is quite a heavy
> operation anyway.
>

I think another possible fix is to use call_rcu_sched() to free inode
(and address_space).  Because __isolate_lru_page() will be called with
LRU spinlock held and IRQ disabled, call_rcu_sched() will wait
LRU spin_unlock and IRQ enabled.

Best Regards,
Huang, Ying


Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2018-01-02 Thread Mel Gorman
On Tue, Jan 02, 2018 at 12:29:55PM +0100, Jan Kara wrote:
> On Tue 02-01-18 10:21:03, Mel Gorman wrote:
> > On Sat, Dec 23, 2017 at 10:36:53AM +0900, Minchan Kim wrote:
> > > > code path.  It appears that similar situation is possible for them too.
> > > > 
> > > > The file cache pages will be delete from file cache address_space before
> > > > address_space (embedded in inode) is freed.  But they will be deleted
> > > > from LRU list only when its refcount dropped to zero, please take a look
> > > > at put_page() and release_pages().  While address_space will be freed
> > > > after putting reference to all file cache pages.  If someone holds a
> > > > reference to a file cache page for quite long time, it is possible for a
> > > > file cache page to be in LRU list after the inode/address_space is
> > > > freed.
> > > > 
> > > > And I found inode/address_space is freed witch call_rcu().  I don't know
> > > > whether this is related to page_mapping().
> > > > 
> > > > This is just my understanding.
> > > 
> > > Hmm, it smells like a bug of __isolate_lru_page.
> > > 
> > > Ccing Mel:
> > > 
> > > What locks protects address_space destroying when race happens between
> > > inode trauncation and __isolate_lru_page?
> > > 
> > 
> > I'm just back online and have a lot of catching up to do so this is a rushed
> > answer and I didn't read the background of this. However the question is
> > somewhat ambiguous and the scope is broad as I'm not sure which race you
> > refer to. For file cache pages, I wouldnt' expect the address_space to be
> > destroyed specifically as long as the inode exists which is the structure
> > containing the address_space in this case. A page on the LRU being isolated
> > in __isolate_lru_page will have an elevated reference count which will
> > pin the inode until remove_mapping is called which holds the page lock
> > while inode truncation looking at a page for truncation also only checks
> > page_mapping under the page lock. Very broadly speaking, pages avoid being
> > added back to an inode being freed by checking the I_FREEING state.
> 
> So I'm wondering what prevents the following:
> 
> CPU1  CPU2
> 
> truncate(inode)   __isolate_lru_page()
>   ...
>   truncate_inode_page(mapping, page);
> delete_from_page_cache(page)
>   spin_lock_irqsave(>tree_lock, flags);
> __delete_from_page_cache(page, NULL)
>   page_cache_tree_delete(..)
> ... mapping = 
> page_mapping(page);
> page->mapping = NULL;
> ...
>   spin_unlock_irqrestore(>tree_lock, flags);
>   page_cache_free_page(mapping, page)
> put_page(page)
>   if (put_page_testzero(page)) -> false
> - inode now has no pages and can be freed including embedded address_space
> 
> if (mapping && 
> !mapping->a_ops->migratepage)
> - we've dereferenced mapping which is potentially already free.
> 

Hmm, possible if unlikely.

Before delete_from_page_cache, we called truncate_cleanup_page so the
page is likely to be !PageDirty or PageWriteback which gets skipped by
the only caller that checks the mappping in __isolate_lru_page. The race
is tiny but it does exist. One way of closing it is to check the mapping
under the page lock which will prevent races with truncation. The
overhead is minimal as the calling context (compaction) is quite a heavy
operation anyway.

Build tested only for review


diff --git a/mm/vmscan.c b/mm/vmscan.c
index c02c850ea349..61bf0bc60d96 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1433,14 +1433,24 @@ int __isolate_lru_page(struct page *page, 
isolate_mode_t mode)
 
if (PageDirty(page)) {
struct address_space *mapping;
+   bool migrate_dirty;
 
/*
 * Only pages without mappings or that have a
 * ->migratepage callback are possible to migrate
-* without blocking
+* without blocking. However, we can be racing with
+* truncation so it's necessary to lock the page
+* to stabilise the mapping as truncation holds
+* the page lock until after the page is removed
+* from the page cache.
 */
+   if (!trylock_page(page))
+   return ret;
+
mapping = page_mapping(page);
-   if (mapping && !mapping->a_ops->migratepage)
+   migrate_dirty = mapping && mapping->a_ops->migratepage;
+   unlock_page(page);
+   if (!migrate_dirty)
return ret;
}
}

-- 
Mel Gorman
SUSE Labs


Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2018-01-02 Thread Mel Gorman
On Tue, Jan 02, 2018 at 12:29:55PM +0100, Jan Kara wrote:
> On Tue 02-01-18 10:21:03, Mel Gorman wrote:
> > On Sat, Dec 23, 2017 at 10:36:53AM +0900, Minchan Kim wrote:
> > > > code path.  It appears that similar situation is possible for them too.
> > > > 
> > > > The file cache pages will be delete from file cache address_space before
> > > > address_space (embedded in inode) is freed.  But they will be deleted
> > > > from LRU list only when its refcount dropped to zero, please take a look
> > > > at put_page() and release_pages().  While address_space will be freed
> > > > after putting reference to all file cache pages.  If someone holds a
> > > > reference to a file cache page for quite long time, it is possible for a
> > > > file cache page to be in LRU list after the inode/address_space is
> > > > freed.
> > > > 
> > > > And I found inode/address_space is freed witch call_rcu().  I don't know
> > > > whether this is related to page_mapping().
> > > > 
> > > > This is just my understanding.
> > > 
> > > Hmm, it smells like a bug of __isolate_lru_page.
> > > 
> > > Ccing Mel:
> > > 
> > > What locks protects address_space destroying when race happens between
> > > inode trauncation and __isolate_lru_page?
> > > 
> > 
> > I'm just back online and have a lot of catching up to do so this is a rushed
> > answer and I didn't read the background of this. However the question is
> > somewhat ambiguous and the scope is broad as I'm not sure which race you
> > refer to. For file cache pages, I wouldnt' expect the address_space to be
> > destroyed specifically as long as the inode exists which is the structure
> > containing the address_space in this case. A page on the LRU being isolated
> > in __isolate_lru_page will have an elevated reference count which will
> > pin the inode until remove_mapping is called which holds the page lock
> > while inode truncation looking at a page for truncation also only checks
> > page_mapping under the page lock. Very broadly speaking, pages avoid being
> > added back to an inode being freed by checking the I_FREEING state.
> 
> So I'm wondering what prevents the following:
> 
> CPU1  CPU2
> 
> truncate(inode)   __isolate_lru_page()
>   ...
>   truncate_inode_page(mapping, page);
> delete_from_page_cache(page)
>   spin_lock_irqsave(>tree_lock, flags);
> __delete_from_page_cache(page, NULL)
>   page_cache_tree_delete(..)
> ... mapping = 
> page_mapping(page);
> page->mapping = NULL;
> ...
>   spin_unlock_irqrestore(>tree_lock, flags);
>   page_cache_free_page(mapping, page)
> put_page(page)
>   if (put_page_testzero(page)) -> false
> - inode now has no pages and can be freed including embedded address_space
> 
> if (mapping && 
> !mapping->a_ops->migratepage)
> - we've dereferenced mapping which is potentially already free.
> 

Hmm, possible if unlikely.

Before delete_from_page_cache, we called truncate_cleanup_page so the
page is likely to be !PageDirty or PageWriteback which gets skipped by
the only caller that checks the mappping in __isolate_lru_page. The race
is tiny but it does exist. One way of closing it is to check the mapping
under the page lock which will prevent races with truncation. The
overhead is minimal as the calling context (compaction) is quite a heavy
operation anyway.

Build tested only for review


diff --git a/mm/vmscan.c b/mm/vmscan.c
index c02c850ea349..61bf0bc60d96 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1433,14 +1433,24 @@ int __isolate_lru_page(struct page *page, 
isolate_mode_t mode)
 
if (PageDirty(page)) {
struct address_space *mapping;
+   bool migrate_dirty;
 
/*
 * Only pages without mappings or that have a
 * ->migratepage callback are possible to migrate
-* without blocking
+* without blocking. However, we can be racing with
+* truncation so it's necessary to lock the page
+* to stabilise the mapping as truncation holds
+* the page lock until after the page is removed
+* from the page cache.
 */
+   if (!trylock_page(page))
+   return ret;
+
mapping = page_mapping(page);
-   if (mapping && !mapping->a_ops->migratepage)
+   migrate_dirty = mapping && mapping->a_ops->migratepage;
+   unlock_page(page);
+   if (!migrate_dirty)
return ret;
}
}

-- 
Mel Gorman
SUSE Labs


Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2018-01-02 Thread Jan Kara
On Tue 02-01-18 10:21:03, Mel Gorman wrote:
> On Sat, Dec 23, 2017 at 10:36:53AM +0900, Minchan Kim wrote:
> > > code path.  It appears that similar situation is possible for them too.
> > > 
> > > The file cache pages will be delete from file cache address_space before
> > > address_space (embedded in inode) is freed.  But they will be deleted
> > > from LRU list only when its refcount dropped to zero, please take a look
> > > at put_page() and release_pages().  While address_space will be freed
> > > after putting reference to all file cache pages.  If someone holds a
> > > reference to a file cache page for quite long time, it is possible for a
> > > file cache page to be in LRU list after the inode/address_space is
> > > freed.
> > > 
> > > And I found inode/address_space is freed witch call_rcu().  I don't know
> > > whether this is related to page_mapping().
> > > 
> > > This is just my understanding.
> > 
> > Hmm, it smells like a bug of __isolate_lru_page.
> > 
> > Ccing Mel:
> > 
> > What locks protects address_space destroying when race happens between
> > inode trauncation and __isolate_lru_page?
> > 
> 
> I'm just back online and have a lot of catching up to do so this is a rushed
> answer and I didn't read the background of this. However the question is
> somewhat ambiguous and the scope is broad as I'm not sure which race you
> refer to. For file cache pages, I wouldnt' expect the address_space to be
> destroyed specifically as long as the inode exists which is the structure
> containing the address_space in this case. A page on the LRU being isolated
> in __isolate_lru_page will have an elevated reference count which will
> pin the inode until remove_mapping is called which holds the page lock
> while inode truncation looking at a page for truncation also only checks
> page_mapping under the page lock. Very broadly speaking, pages avoid being
> added back to an inode being freed by checking the I_FREEING state.

So I'm wondering what prevents the following:

CPU1CPU2

truncate(inode) __isolate_lru_page()
  ...
  truncate_inode_page(mapping, page);
delete_from_page_cache(page)
  spin_lock_irqsave(>tree_lock, flags);
__delete_from_page_cache(page, NULL)
  page_cache_tree_delete(..)
...   mapping = page_mapping(page);
page->mapping = NULL;
...
  spin_unlock_irqrestore(>tree_lock, flags);
  page_cache_free_page(mapping, page)
put_page(page)
  if (put_page_testzero(page)) -> false
- inode now has no pages and can be freed including embedded address_space

  if (mapping && 
!mapping->a_ops->migratepage)
- we've dereferenced mapping which is potentially already free.

This all seems very theoretical but in principle possible...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2018-01-02 Thread Jan Kara
On Tue 02-01-18 10:21:03, Mel Gorman wrote:
> On Sat, Dec 23, 2017 at 10:36:53AM +0900, Minchan Kim wrote:
> > > code path.  It appears that similar situation is possible for them too.
> > > 
> > > The file cache pages will be delete from file cache address_space before
> > > address_space (embedded in inode) is freed.  But they will be deleted
> > > from LRU list only when its refcount dropped to zero, please take a look
> > > at put_page() and release_pages().  While address_space will be freed
> > > after putting reference to all file cache pages.  If someone holds a
> > > reference to a file cache page for quite long time, it is possible for a
> > > file cache page to be in LRU list after the inode/address_space is
> > > freed.
> > > 
> > > And I found inode/address_space is freed witch call_rcu().  I don't know
> > > whether this is related to page_mapping().
> > > 
> > > This is just my understanding.
> > 
> > Hmm, it smells like a bug of __isolate_lru_page.
> > 
> > Ccing Mel:
> > 
> > What locks protects address_space destroying when race happens between
> > inode trauncation and __isolate_lru_page?
> > 
> 
> I'm just back online and have a lot of catching up to do so this is a rushed
> answer and I didn't read the background of this. However the question is
> somewhat ambiguous and the scope is broad as I'm not sure which race you
> refer to. For file cache pages, I wouldnt' expect the address_space to be
> destroyed specifically as long as the inode exists which is the structure
> containing the address_space in this case. A page on the LRU being isolated
> in __isolate_lru_page will have an elevated reference count which will
> pin the inode until remove_mapping is called which holds the page lock
> while inode truncation looking at a page for truncation also only checks
> page_mapping under the page lock. Very broadly speaking, pages avoid being
> added back to an inode being freed by checking the I_FREEING state.

So I'm wondering what prevents the following:

CPU1CPU2

truncate(inode) __isolate_lru_page()
  ...
  truncate_inode_page(mapping, page);
delete_from_page_cache(page)
  spin_lock_irqsave(>tree_lock, flags);
__delete_from_page_cache(page, NULL)
  page_cache_tree_delete(..)
...   mapping = page_mapping(page);
page->mapping = NULL;
...
  spin_unlock_irqrestore(>tree_lock, flags);
  page_cache_free_page(mapping, page)
put_page(page)
  if (put_page_testzero(page)) -> false
- inode now has no pages and can be freed including embedded address_space

  if (mapping && 
!mapping->a_ops->migratepage)
- we've dereferenced mapping which is potentially already free.

This all seems very theoretical but in principle possible...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2018-01-02 Thread Mel Gorman
On Sat, Dec 23, 2017 at 10:36:53AM +0900, Minchan Kim wrote:
> > code path.  It appears that similar situation is possible for them too.
> > 
> > The file cache pages will be delete from file cache address_space before
> > address_space (embedded in inode) is freed.  But they will be deleted
> > from LRU list only when its refcount dropped to zero, please take a look
> > at put_page() and release_pages().  While address_space will be freed
> > after putting reference to all file cache pages.  If someone holds a
> > reference to a file cache page for quite long time, it is possible for a
> > file cache page to be in LRU list after the inode/address_space is
> > freed.
> > 
> > And I found inode/address_space is freed witch call_rcu().  I don't know
> > whether this is related to page_mapping().
> > 
> > This is just my understanding.
> 
> Hmm, it smells like a bug of __isolate_lru_page.
> 
> Ccing Mel:
> 
> What locks protects address_space destroying when race happens between
> inode trauncation and __isolate_lru_page?
> 

I'm just back online and have a lot of catching up to do so this is a rushed
answer and I didn't read the background of this. However the question is
somewhat ambiguous and the scope is broad as I'm not sure which race you
refer to. For file cache pages, I wouldnt' expect the address_space to be
destroyed specifically as long as the inode exists which is the structure
containing the address_space in this case. A page on the LRU being isolated
in __isolate_lru_page will have an elevated reference count which will
pin the inode until remove_mapping is called which holds the page lock
while inode truncation looking at a page for truncation also only checks
page_mapping under the page lock. Very broadly speaking, pages avoid being
added back to an inode being freed by checking the I_FREEING state.

Hopefully that helps while I go back to the TODO mountain.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2018-01-02 Thread Mel Gorman
On Sat, Dec 23, 2017 at 10:36:53AM +0900, Minchan Kim wrote:
> > code path.  It appears that similar situation is possible for them too.
> > 
> > The file cache pages will be delete from file cache address_space before
> > address_space (embedded in inode) is freed.  But they will be deleted
> > from LRU list only when its refcount dropped to zero, please take a look
> > at put_page() and release_pages().  While address_space will be freed
> > after putting reference to all file cache pages.  If someone holds a
> > reference to a file cache page for quite long time, it is possible for a
> > file cache page to be in LRU list after the inode/address_space is
> > freed.
> > 
> > And I found inode/address_space is freed witch call_rcu().  I don't know
> > whether this is related to page_mapping().
> > 
> > This is just my understanding.
> 
> Hmm, it smells like a bug of __isolate_lru_page.
> 
> Ccing Mel:
> 
> What locks protects address_space destroying when race happens between
> inode trauncation and __isolate_lru_page?
> 

I'm just back online and have a lot of catching up to do so this is a rushed
answer and I didn't read the background of this. However the question is
somewhat ambiguous and the scope is broad as I'm not sure which race you
refer to. For file cache pages, I wouldnt' expect the address_space to be
destroyed specifically as long as the inode exists which is the structure
containing the address_space in this case. A page on the LRU being isolated
in __isolate_lru_page will have an elevated reference count which will
pin the inode until remove_mapping is called which holds the page lock
while inode truncation looking at a page for truncation also only checks
page_mapping under the page lock. Very broadly speaking, pages avoid being
added back to an inode being freed by checking the I_FREEING state.

Hopefully that helps while I go back to the TODO mountain.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2017-12-25 Thread Huang, Ying
Minchan Kim  writes:

> On Fri, Dec 22, 2017 at 10:14:43PM +0800, Huang, Ying wrote:
>> Minchan Kim  writes:
>> 
>> > On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
>> >> Minchan Kim  writes:
>> >> 
>> >> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
>> >> >> From: Huang Ying 
>> >> >> 
>> >> >> When the swapin is performed, after getting the swap entry information
>> >> >> from the page table, system will swap in the swap entry, without any
>> >> >> lock held to prevent the swap device from being swapoff.  This may
>> >> >> cause the race like below,
>> >> >> 
>> >> >> CPU 1  CPU 2
>> >> >> -  -
>> >> >>do_swap_page
>> >> >>  swapin_readahead
>> >> >>__read_swap_cache_async
>> >> >> swapoff  swapcache_prepare
>> >> >>   p->swap_map = NULL   __swap_duplicate
>> >> >>  p->swap_map[?] /* !!! NULL 
>> >> >> pointer access */
>> >> >> 
>> >> >> Because swapoff is usually done when system shutdown only, the race
>> >> >> may not hit many people in practice.  But it is still a race need to
>> >> >> be fixed.
>> >> >> 
>> >> >> To fix the race, get_swap_device() is added to check whether the
>> >> >> specified swap entry is valid in its swap device.  If so, it will keep
>> >> >> the swap entry valid via preventing the swap device from being
>> >> >> swapoff, until put_swap_device() is called.
>> >> >> 
>> >> >> Because swapoff() is very race code path, to make the normal path runs
>> >> >> as fast as possible, RCU instead of reference count is used to
>> >> >> implement get/put_swap_device().  From get_swap_device() to
>> >> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
>> >> >> swapoff() will wait until put_swap_device() is called.
>> >> >> 
>> >> >> In addition to swap_map, cluster_info, etc. data structure in the
>> >> >> struct swap_info_struct, the swap cache radix tree will be freed after
>> >> >> swapoff, so this patch fixes the race between swap cache looking up
>> >> >> and swapoff too.
>> >> >> 
>> >> >> Cc: Hugh Dickins 
>> >> >> Cc: Paul E. McKenney 
>> >> >> Cc: Minchan Kim 
>> >> >> Cc: Johannes Weiner 
>> >> >> Cc: Tim Chen 
>> >> >> Cc: Shaohua Li 
>> >> >> Cc: Mel Gorman 
>> >> >> Cc: "Jrme Glisse" 
>> >> >> Cc: Michal Hocko 
>> >> >> Cc: Andrea Arcangeli 
>> >> >> Cc: David Rientjes 
>> >> >> Cc: Rik van Riel 
>> >> >> Cc: Jan Kara 
>> >> >> Cc: Dave Jiang 
>> >> >> Cc: Aaron Lu 
>> >> >> Signed-off-by: "Huang, Ying" 
>> >> >> 
>> >> >> Changelog:
>> >> >> 
>> >> >> v4:
>> >> >> 
>> >> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
>> >> >>   normal paths further.
>> >> >
>> >> > Hi Huang,
>> >> 
>> >> Hi, Minchan,
>> >> 
>> >> > This version is much better than old. To me, it's due to not rcu,
>> >> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
>> >> > into every swap related functions so users who don't interested on swap
>> >> > don't need to care of it. Good.
>> >> >
>> >> > The problem is caused by freeing by swap related-data structure
>> >> > *dynamically* while old swap logic was based on static data
>> >> > structure(i.e., never freed and the verify it's stale).
>> >> > So, I reviewed some places where use PageSwapCache and swp_entry_t
>> >> > which could make access of swap related data structures.
>> >> >
>> >> > A example is __isolate_lru_page
>> >> >
>> >> > It calls page_mapping to get a address_space.
>> >> > What happens if the page is on SwapCache and raced with swapoff?
>> >> > The mapping got could be disappeared by the race. Right?
>> >> 
>> >> Yes.  We should think about that.  Considering the file cache pages, the
>> >> address_space backing the file cache pages may be freed dynamically too.
>> >> So to use page_mapping() return value for the file cache pages, some
>> >> kind of locking is needed to guarantee the address_space isn't freed
>> >> under us.  Page may be locked, or under writeback, or some other locks
>> >
>> > I didn't look at the code in detail but I guess every file page should
>> > be freed before the address space destruction and page_lock/lru_lock makes
>> > the work safe, I guess. So, it wouldn't be a problem.
>> >
>> > However, in case of swapoff, it doesn't remove pages from LRU list
>> > so there is no lock to prevent the race at this moment. :(
>> 
>> Take a look at file cache pages and file 

Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2017-12-25 Thread Huang, Ying
Minchan Kim  writes:

> On Fri, Dec 22, 2017 at 10:14:43PM +0800, Huang, Ying wrote:
>> Minchan Kim  writes:
>> 
>> > On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
>> >> Minchan Kim  writes:
>> >> 
>> >> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
>> >> >> From: Huang Ying 
>> >> >> 
>> >> >> When the swapin is performed, after getting the swap entry information
>> >> >> from the page table, system will swap in the swap entry, without any
>> >> >> lock held to prevent the swap device from being swapoff.  This may
>> >> >> cause the race like below,
>> >> >> 
>> >> >> CPU 1  CPU 2
>> >> >> -  -
>> >> >>do_swap_page
>> >> >>  swapin_readahead
>> >> >>__read_swap_cache_async
>> >> >> swapoff  swapcache_prepare
>> >> >>   p->swap_map = NULL   __swap_duplicate
>> >> >>  p->swap_map[?] /* !!! NULL 
>> >> >> pointer access */
>> >> >> 
>> >> >> Because swapoff is usually done when system shutdown only, the race
>> >> >> may not hit many people in practice.  But it is still a race need to
>> >> >> be fixed.
>> >> >> 
>> >> >> To fix the race, get_swap_device() is added to check whether the
>> >> >> specified swap entry is valid in its swap device.  If so, it will keep
>> >> >> the swap entry valid via preventing the swap device from being
>> >> >> swapoff, until put_swap_device() is called.
>> >> >> 
>> >> >> Because swapoff() is very race code path, to make the normal path runs
>> >> >> as fast as possible, RCU instead of reference count is used to
>> >> >> implement get/put_swap_device().  From get_swap_device() to
>> >> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
>> >> >> swapoff() will wait until put_swap_device() is called.
>> >> >> 
>> >> >> In addition to swap_map, cluster_info, etc. data structure in the
>> >> >> struct swap_info_struct, the swap cache radix tree will be freed after
>> >> >> swapoff, so this patch fixes the race between swap cache looking up
>> >> >> and swapoff too.
>> >> >> 
>> >> >> Cc: Hugh Dickins 
>> >> >> Cc: Paul E. McKenney 
>> >> >> Cc: Minchan Kim 
>> >> >> Cc: Johannes Weiner 
>> >> >> Cc: Tim Chen 
>> >> >> Cc: Shaohua Li 
>> >> >> Cc: Mel Gorman 
>> >> >> Cc: "Jrme Glisse" 
>> >> >> Cc: Michal Hocko 
>> >> >> Cc: Andrea Arcangeli 
>> >> >> Cc: David Rientjes 
>> >> >> Cc: Rik van Riel 
>> >> >> Cc: Jan Kara 
>> >> >> Cc: Dave Jiang 
>> >> >> Cc: Aaron Lu 
>> >> >> Signed-off-by: "Huang, Ying" 
>> >> >> 
>> >> >> Changelog:
>> >> >> 
>> >> >> v4:
>> >> >> 
>> >> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
>> >> >>   normal paths further.
>> >> >
>> >> > Hi Huang,
>> >> 
>> >> Hi, Minchan,
>> >> 
>> >> > This version is much better than old. To me, it's due to not rcu,
>> >> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
>> >> > into every swap related functions so users who don't interested on swap
>> >> > don't need to care of it. Good.
>> >> >
>> >> > The problem is caused by freeing by swap related-data structure
>> >> > *dynamically* while old swap logic was based on static data
>> >> > structure(i.e., never freed and the verify it's stale).
>> >> > So, I reviewed some places where use PageSwapCache and swp_entry_t
>> >> > which could make access of swap related data structures.
>> >> >
>> >> > A example is __isolate_lru_page
>> >> >
>> >> > It calls page_mapping to get a address_space.
>> >> > What happens if the page is on SwapCache and raced with swapoff?
>> >> > The mapping got could be disappeared by the race. Right?
>> >> 
>> >> Yes.  We should think about that.  Considering the file cache pages, the
>> >> address_space backing the file cache pages may be freed dynamically too.
>> >> So to use page_mapping() return value for the file cache pages, some
>> >> kind of locking is needed to guarantee the address_space isn't freed
>> >> under us.  Page may be locked, or under writeback, or some other locks
>> >
>> > I didn't look at the code in detail but I guess every file page should
>> > be freed before the address space destruction and page_lock/lru_lock makes
>> > the work safe, I guess. So, it wouldn't be a problem.
>> >
>> > However, in case of swapoff, it doesn't remove pages from LRU list
>> > so there is no lock to prevent the race at this moment. :(
>> 
>> Take a look at file cache pages and file cache address_space freeing
>> code path.  It appears that similar situation is possible for them too.
>> 
>> The file cache pages will be delete from file cache address_space before
>> address_space (embedded in inode) is freed.  But they will be deleted
>> from LRU list only when its refcount dropped to zero, please take a look
>> at put_page() and release_pages().  While address_space will be freed
>> after 

Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2017-12-24 Thread Huang, Ying
"Paul E. McKenney"  writes:

> On Fri, Dec 22, 2017 at 10:14:43PM +0800, Huang, Ying wrote:
>> Minchan Kim  writes:
>> 
>> > On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
>> >> Minchan Kim  writes:
>> >> 
>> >> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
>> >> >> From: Huang Ying 
>> >> >> 
>> >> >> When the swapin is performed, after getting the swap entry information
>> >> >> from the page table, system will swap in the swap entry, without any
>> >> >> lock held to prevent the swap device from being swapoff.  This may
>> >> >> cause the race like below,
>> >> >> 
>> >> >> CPU 1  CPU 2
>> >> >> -  -
>> >> >>do_swap_page
>> >> >>  swapin_readahead
>> >> >>__read_swap_cache_async
>> >> >> swapoff  swapcache_prepare
>> >> >>   p->swap_map = NULL   __swap_duplicate
>> >> >>  p->swap_map[?] /* !!! NULL 
>> >> >> pointer access */
>> >> >> 
>> >> >> Because swapoff is usually done when system shutdown only, the race
>> >> >> may not hit many people in practice.  But it is still a race need to
>> >> >> be fixed.
>> >> >> 
>> >> >> To fix the race, get_swap_device() is added to check whether the
>> >> >> specified swap entry is valid in its swap device.  If so, it will keep
>> >> >> the swap entry valid via preventing the swap device from being
>> >> >> swapoff, until put_swap_device() is called.
>> >> >> 
>> >> >> Because swapoff() is very race code path, to make the normal path runs
>> >> >> as fast as possible, RCU instead of reference count is used to
>> >> >> implement get/put_swap_device().  From get_swap_device() to
>> >> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
>> >> >> swapoff() will wait until put_swap_device() is called.
>> >> >> 
>> >> >> In addition to swap_map, cluster_info, etc. data structure in the
>> >> >> struct swap_info_struct, the swap cache radix tree will be freed after
>> >> >> swapoff, so this patch fixes the race between swap cache looking up
>> >> >> and swapoff too.
>> >> >> 
>> >> >> Cc: Hugh Dickins 
>> >> >> Cc: Paul E. McKenney 
>> >> >> Cc: Minchan Kim 
>> >> >> Cc: Johannes Weiner 
>> >> >> Cc: Tim Chen 
>> >> >> Cc: Shaohua Li 
>> >> >> Cc: Mel Gorman 
>> >> >> Cc: "Jrme Glisse" 
>> >> >> Cc: Michal Hocko 
>> >> >> Cc: Andrea Arcangeli 
>> >> >> Cc: David Rientjes 
>> >> >> Cc: Rik van Riel 
>> >> >> Cc: Jan Kara 
>> >> >> Cc: Dave Jiang 
>> >> >> Cc: Aaron Lu 
>> >> >> Signed-off-by: "Huang, Ying" 
>> >> >> 
>> >> >> Changelog:
>> >> >> 
>> >> >> v4:
>> >> >> 
>> >> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
>> >> >>   normal paths further.
>> >> >
>> >> > Hi Huang,
>> >> 
>> >> Hi, Minchan,
>> >> 
>> >> > This version is much better than old. To me, it's due to not rcu,
>> >> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
>> >> > into every swap related functions so users who don't interested on swap
>> >> > don't need to care of it. Good.
>> >> >
>> >> > The problem is caused by freeing by swap related-data structure
>> >> > *dynamically* while old swap logic was based on static data
>> >> > structure(i.e., never freed and the verify it's stale).
>> >> > So, I reviewed some places where use PageSwapCache and swp_entry_t
>> >> > which could make access of swap related data structures.
>> >> >
>> >> > A example is __isolate_lru_page
>> >> >
>> >> > It calls page_mapping to get a address_space.
>> >> > What happens if the page is on SwapCache and raced with swapoff?
>> >> > The mapping got could be disappeared by the race. Right?
>> >> 
>> >> Yes.  We should think about that.  Considering the file cache pages, the
>> >> address_space backing the file cache pages may be freed dynamically too.
>> >> So to use page_mapping() return value for the file cache pages, some
>> >> kind of locking is needed to guarantee the address_space isn't freed
>> >> under us.  Page may be locked, or under writeback, or some other locks
>> >
>> > I didn't look at the code in detail but I guess every file page should
>> > be freed before the address space destruction and page_lock/lru_lock makes
>> > the work safe, I guess. So, it wouldn't be a problem.
>> >
>> > However, in case of swapoff, it doesn't remove pages from LRU list
>> > so there is no lock to prevent the race at this moment. :(
>> 
>> Take a look at file cache 

Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2017-12-24 Thread Huang, Ying
"Paul E. McKenney"  writes:

> On Fri, Dec 22, 2017 at 10:14:43PM +0800, Huang, Ying wrote:
>> Minchan Kim  writes:
>> 
>> > On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
>> >> Minchan Kim  writes:
>> >> 
>> >> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
>> >> >> From: Huang Ying 
>> >> >> 
>> >> >> When the swapin is performed, after getting the swap entry information
>> >> >> from the page table, system will swap in the swap entry, without any
>> >> >> lock held to prevent the swap device from being swapoff.  This may
>> >> >> cause the race like below,
>> >> >> 
>> >> >> CPU 1  CPU 2
>> >> >> -  -
>> >> >>do_swap_page
>> >> >>  swapin_readahead
>> >> >>__read_swap_cache_async
>> >> >> swapoff  swapcache_prepare
>> >> >>   p->swap_map = NULL   __swap_duplicate
>> >> >>  p->swap_map[?] /* !!! NULL 
>> >> >> pointer access */
>> >> >> 
>> >> >> Because swapoff is usually done when system shutdown only, the race
>> >> >> may not hit many people in practice.  But it is still a race need to
>> >> >> be fixed.
>> >> >> 
>> >> >> To fix the race, get_swap_device() is added to check whether the
>> >> >> specified swap entry is valid in its swap device.  If so, it will keep
>> >> >> the swap entry valid via preventing the swap device from being
>> >> >> swapoff, until put_swap_device() is called.
>> >> >> 
>> >> >> Because swapoff() is very race code path, to make the normal path runs
>> >> >> as fast as possible, RCU instead of reference count is used to
>> >> >> implement get/put_swap_device().  From get_swap_device() to
>> >> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
>> >> >> swapoff() will wait until put_swap_device() is called.
>> >> >> 
>> >> >> In addition to swap_map, cluster_info, etc. data structure in the
>> >> >> struct swap_info_struct, the swap cache radix tree will be freed after
>> >> >> swapoff, so this patch fixes the race between swap cache looking up
>> >> >> and swapoff too.
>> >> >> 
>> >> >> Cc: Hugh Dickins 
>> >> >> Cc: Paul E. McKenney 
>> >> >> Cc: Minchan Kim 
>> >> >> Cc: Johannes Weiner 
>> >> >> Cc: Tim Chen 
>> >> >> Cc: Shaohua Li 
>> >> >> Cc: Mel Gorman 
>> >> >> Cc: "Jrme Glisse" 
>> >> >> Cc: Michal Hocko 
>> >> >> Cc: Andrea Arcangeli 
>> >> >> Cc: David Rientjes 
>> >> >> Cc: Rik van Riel 
>> >> >> Cc: Jan Kara 
>> >> >> Cc: Dave Jiang 
>> >> >> Cc: Aaron Lu 
>> >> >> Signed-off-by: "Huang, Ying" 
>> >> >> 
>> >> >> Changelog:
>> >> >> 
>> >> >> v4:
>> >> >> 
>> >> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
>> >> >>   normal paths further.
>> >> >
>> >> > Hi Huang,
>> >> 
>> >> Hi, Minchan,
>> >> 
>> >> > This version is much better than old. To me, it's due to not rcu,
>> >> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
>> >> > into every swap related functions so users who don't interested on swap
>> >> > don't need to care of it. Good.
>> >> >
>> >> > The problem is caused by freeing by swap related-data structure
>> >> > *dynamically* while old swap logic was based on static data
>> >> > structure(i.e., never freed and the verify it's stale).
>> >> > So, I reviewed some places where use PageSwapCache and swp_entry_t
>> >> > which could make access of swap related data structures.
>> >> >
>> >> > A example is __isolate_lru_page
>> >> >
>> >> > It calls page_mapping to get a address_space.
>> >> > What happens if the page is on SwapCache and raced with swapoff?
>> >> > The mapping got could be disappeared by the race. Right?
>> >> 
>> >> Yes.  We should think about that.  Considering the file cache pages, the
>> >> address_space backing the file cache pages may be freed dynamically too.
>> >> So to use page_mapping() return value for the file cache pages, some
>> >> kind of locking is needed to guarantee the address_space isn't freed
>> >> under us.  Page may be locked, or under writeback, or some other locks
>> >
>> > I didn't look at the code in detail but I guess every file page should
>> > be freed before the address space destruction and page_lock/lru_lock makes
>> > the work safe, I guess. So, it wouldn't be a problem.
>> >
>> > However, in case of swapoff, it doesn't remove pages from LRU list
>> > so there is no lock to prevent the race at this moment. :(
>> 
>> Take a look at file cache pages and file cache address_space freeing
>> code path.  It appears that similar situation is possible for them too.
>> 
>> The file cache pages will be delete from file cache address_space before
>> address_space (embedded in inode) is freed.  But they will be deleted
>> from LRU list only when its refcount dropped to zero, please take a look
>> at put_page() and release_pages().  While address_space will be freed
>> 

Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2017-12-22 Thread Minchan Kim
On Fri, Dec 22, 2017 at 10:14:43PM +0800, Huang, Ying wrote:
> Minchan Kim  writes:
> 
> > On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
> >> Minchan Kim  writes:
> >> 
> >> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
> >> >> From: Huang Ying 
> >> >> 
> >> >> When the swapin is performed, after getting the swap entry information
> >> >> from the page table, system will swap in the swap entry, without any
> >> >> lock held to prevent the swap device from being swapoff.  This may
> >> >> cause the race like below,
> >> >> 
> >> >> CPU 1   CPU 2
> >> >> -   -
> >> >> do_swap_page
> >> >>   swapin_readahead
> >> >> __read_swap_cache_async
> >> >> swapoff   swapcache_prepare
> >> >>   p->swap_map = NULL__swap_duplicate
> >> >>   p->swap_map[?] /* !!! NULL 
> >> >> pointer access */
> >> >> 
> >> >> Because swapoff is usually done when system shutdown only, the race
> >> >> may not hit many people in practice.  But it is still a race need to
> >> >> be fixed.
> >> >> 
> >> >> To fix the race, get_swap_device() is added to check whether the
> >> >> specified swap entry is valid in its swap device.  If so, it will keep
> >> >> the swap entry valid via preventing the swap device from being
> >> >> swapoff, until put_swap_device() is called.
> >> >> 
> >> >> Because swapoff() is very race code path, to make the normal path runs
> >> >> as fast as possible, RCU instead of reference count is used to
> >> >> implement get/put_swap_device().  From get_swap_device() to
> >> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
> >> >> swapoff() will wait until put_swap_device() is called.
> >> >> 
> >> >> In addition to swap_map, cluster_info, etc. data structure in the
> >> >> struct swap_info_struct, the swap cache radix tree will be freed after
> >> >> swapoff, so this patch fixes the race between swap cache looking up
> >> >> and swapoff too.
> >> >> 
> >> >> Cc: Hugh Dickins 
> >> >> Cc: Paul E. McKenney 
> >> >> Cc: Minchan Kim 
> >> >> Cc: Johannes Weiner 
> >> >> Cc: Tim Chen 
> >> >> Cc: Shaohua Li 
> >> >> Cc: Mel Gorman 
> >> >> Cc: "Jrme Glisse" 
> >> >> Cc: Michal Hocko 
> >> >> Cc: Andrea Arcangeli 
> >> >> Cc: David Rientjes 
> >> >> Cc: Rik van Riel 
> >> >> Cc: Jan Kara 
> >> >> Cc: Dave Jiang 
> >> >> Cc: Aaron Lu 
> >> >> Signed-off-by: "Huang, Ying" 
> >> >> 
> >> >> Changelog:
> >> >> 
> >> >> v4:
> >> >> 
> >> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
> >> >>   normal paths further.
> >> >
> >> > Hi Huang,
> >> 
> >> Hi, Minchan,
> >> 
> >> > This version is much better than old. To me, it's due to not rcu,
> >> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
> >> > into every swap related functions so users who don't interested on swap
> >> > don't need to care of it. Good.
> >> >
> >> > The problem is caused by freeing by swap related-data structure
> >> > *dynamically* while old swap logic was based on static data
> >> > structure(i.e., never freed and the verify it's stale).
> >> > So, I reviewed some places where use PageSwapCache and swp_entry_t
> >> > which could make access of swap related data structures.
> >> >
> >> > A example is __isolate_lru_page
> >> >
> >> > It calls page_mapping to get a address_space.
> >> > What happens if the page is on SwapCache and raced with swapoff?
> >> > The mapping got could be disappeared by the race. Right?
> >> 
> >> Yes.  We should think about that.  Considering the file cache pages, the
> >> address_space backing the file cache pages may be freed dynamically too.
> >> So to use page_mapping() return value for the file cache pages, some
> >> kind of locking is needed to guarantee the address_space isn't freed
> >> under us.  Page may be locked, or under writeback, or some other locks
> >
> > I didn't look at the code in detail but I guess every file page should
> > be freed before the address space destruction and page_lock/lru_lock makes
> > the work safe, I guess. So, it wouldn't be a problem.
> >
> > However, in case of swapoff, it doesn't remove pages from LRU list
> > so there is no lock to prevent the race at this moment. :(
> 
> Take a look at file cache pages and file cache address_space freeing
> code path.  It appears that similar situation is possible for them too.
> 
> The file cache pages will be delete from 

Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2017-12-22 Thread Minchan Kim
On Fri, Dec 22, 2017 at 10:14:43PM +0800, Huang, Ying wrote:
> Minchan Kim  writes:
> 
> > On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
> >> Minchan Kim  writes:
> >> 
> >> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
> >> >> From: Huang Ying 
> >> >> 
> >> >> When the swapin is performed, after getting the swap entry information
> >> >> from the page table, system will swap in the swap entry, without any
> >> >> lock held to prevent the swap device from being swapoff.  This may
> >> >> cause the race like below,
> >> >> 
> >> >> CPU 1   CPU 2
> >> >> -   -
> >> >> do_swap_page
> >> >>   swapin_readahead
> >> >> __read_swap_cache_async
> >> >> swapoff   swapcache_prepare
> >> >>   p->swap_map = NULL__swap_duplicate
> >> >>   p->swap_map[?] /* !!! NULL 
> >> >> pointer access */
> >> >> 
> >> >> Because swapoff is usually done when system shutdown only, the race
> >> >> may not hit many people in practice.  But it is still a race need to
> >> >> be fixed.
> >> >> 
> >> >> To fix the race, get_swap_device() is added to check whether the
> >> >> specified swap entry is valid in its swap device.  If so, it will keep
> >> >> the swap entry valid via preventing the swap device from being
> >> >> swapoff, until put_swap_device() is called.
> >> >> 
> >> >> Because swapoff() is very race code path, to make the normal path runs
> >> >> as fast as possible, RCU instead of reference count is used to
> >> >> implement get/put_swap_device().  From get_swap_device() to
> >> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
> >> >> swapoff() will wait until put_swap_device() is called.
> >> >> 
> >> >> In addition to swap_map, cluster_info, etc. data structure in the
> >> >> struct swap_info_struct, the swap cache radix tree will be freed after
> >> >> swapoff, so this patch fixes the race between swap cache looking up
> >> >> and swapoff too.
> >> >> 
> >> >> Cc: Hugh Dickins 
> >> >> Cc: Paul E. McKenney 
> >> >> Cc: Minchan Kim 
> >> >> Cc: Johannes Weiner 
> >> >> Cc: Tim Chen 
> >> >> Cc: Shaohua Li 
> >> >> Cc: Mel Gorman 
> >> >> Cc: "Jrme Glisse" 
> >> >> Cc: Michal Hocko 
> >> >> Cc: Andrea Arcangeli 
> >> >> Cc: David Rientjes 
> >> >> Cc: Rik van Riel 
> >> >> Cc: Jan Kara 
> >> >> Cc: Dave Jiang 
> >> >> Cc: Aaron Lu 
> >> >> Signed-off-by: "Huang, Ying" 
> >> >> 
> >> >> Changelog:
> >> >> 
> >> >> v4:
> >> >> 
> >> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
> >> >>   normal paths further.
> >> >
> >> > Hi Huang,
> >> 
> >> Hi, Minchan,
> >> 
> >> > This version is much better than old. To me, it's due to not rcu,
> >> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
> >> > into every swap related functions so users who don't interested on swap
> >> > don't need to care of it. Good.
> >> >
> >> > The problem is caused by freeing by swap related-data structure
> >> > *dynamically* while old swap logic was based on static data
> >> > structure(i.e., never freed and the verify it's stale).
> >> > So, I reviewed some places where use PageSwapCache and swp_entry_t
> >> > which could make access of swap related data structures.
> >> >
> >> > A example is __isolate_lru_page
> >> >
> >> > It calls page_mapping to get a address_space.
> >> > What happens if the page is on SwapCache and raced with swapoff?
> >> > The mapping got could be disappeared by the race. Right?
> >> 
> >> Yes.  We should think about that.  Considering the file cache pages, the
> >> address_space backing the file cache pages may be freed dynamically too.
> >> So to use page_mapping() return value for the file cache pages, some
> >> kind of locking is needed to guarantee the address_space isn't freed
> >> under us.  Page may be locked, or under writeback, or some other locks
> >
> > I didn't look at the code in detail but I guess every file page should
> > be freed before the address space destruction and page_lock/lru_lock makes
> > the work safe, I guess. So, it wouldn't be a problem.
> >
> > However, in case of swapoff, it doesn't remove pages from LRU list
> > so there is no lock to prevent the race at this moment. :(
> 
> Take a look at file cache pages and file cache address_space freeing
> code path.  It appears that similar situation is possible for them too.
> 
> The file cache pages will be delete from file cache address_space before
> address_space (embedded in inode) is freed.  But they will be deleted
> from LRU list only when its refcount dropped to zero, please take a look
> at put_page() and release_pages().  While address_space will be freed
> after putting reference to all file cache pages.  If someone holds a
> reference to a file cache page for quite long time, it is possible 

Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2017-12-22 Thread Paul E. McKenney
On Fri, Dec 22, 2017 at 10:14:43PM +0800, Huang, Ying wrote:
> Minchan Kim  writes:
> 
> > On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
> >> Minchan Kim  writes:
> >> 
> >> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
> >> >> From: Huang Ying 
> >> >> 
> >> >> When the swapin is performed, after getting the swap entry information
> >> >> from the page table, system will swap in the swap entry, without any
> >> >> lock held to prevent the swap device from being swapoff.  This may
> >> >> cause the race like below,
> >> >> 
> >> >> CPU 1   CPU 2
> >> >> -   -
> >> >> do_swap_page
> >> >>   swapin_readahead
> >> >> __read_swap_cache_async
> >> >> swapoff   swapcache_prepare
> >> >>   p->swap_map = NULL__swap_duplicate
> >> >>   p->swap_map[?] /* !!! NULL 
> >> >> pointer access */
> >> >> 
> >> >> Because swapoff is usually done when system shutdown only, the race
> >> >> may not hit many people in practice.  But it is still a race need to
> >> >> be fixed.
> >> >> 
> >> >> To fix the race, get_swap_device() is added to check whether the
> >> >> specified swap entry is valid in its swap device.  If so, it will keep
> >> >> the swap entry valid via preventing the swap device from being
> >> >> swapoff, until put_swap_device() is called.
> >> >> 
> >> >> Because swapoff() is very race code path, to make the normal path runs
> >> >> as fast as possible, RCU instead of reference count is used to
> >> >> implement get/put_swap_device().  From get_swap_device() to
> >> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
> >> >> swapoff() will wait until put_swap_device() is called.
> >> >> 
> >> >> In addition to swap_map, cluster_info, etc. data structure in the
> >> >> struct swap_info_struct, the swap cache radix tree will be freed after
> >> >> swapoff, so this patch fixes the race between swap cache looking up
> >> >> and swapoff too.
> >> >> 
> >> >> Cc: Hugh Dickins 
> >> >> Cc: Paul E. McKenney 
> >> >> Cc: Minchan Kim 
> >> >> Cc: Johannes Weiner 
> >> >> Cc: Tim Chen 
> >> >> Cc: Shaohua Li 
> >> >> Cc: Mel Gorman 
> >> >> Cc: "Jrme Glisse" 
> >> >> Cc: Michal Hocko 
> >> >> Cc: Andrea Arcangeli 
> >> >> Cc: David Rientjes 
> >> >> Cc: Rik van Riel 
> >> >> Cc: Jan Kara 
> >> >> Cc: Dave Jiang 
> >> >> Cc: Aaron Lu 
> >> >> Signed-off-by: "Huang, Ying" 
> >> >> 
> >> >> Changelog:
> >> >> 
> >> >> v4:
> >> >> 
> >> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
> >> >>   normal paths further.
> >> >
> >> > Hi Huang,
> >> 
> >> Hi, Minchan,
> >> 
> >> > This version is much better than old. To me, it's due to not rcu,
> >> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
> >> > into every swap related functions so users who don't interested on swap
> >> > don't need to care of it. Good.
> >> >
> >> > The problem is caused by freeing by swap related-data structure
> >> > *dynamically* while old swap logic was based on static data
> >> > structure(i.e., never freed and the verify it's stale).
> >> > So, I reviewed some places where use PageSwapCache and swp_entry_t
> >> > which could make access of swap related data structures.
> >> >
> >> > A example is __isolate_lru_page
> >> >
> >> > It calls page_mapping to get a address_space.
> >> > What happens if the page is on SwapCache and raced with swapoff?
> >> > The mapping got could be disappeared by the race. Right?
> >> 
> >> Yes.  We should think about that.  Considering the file cache pages, the
> >> address_space backing the file cache pages may be freed dynamically too.
> >> So to use page_mapping() return value for the file cache pages, some
> >> kind of locking is needed to guarantee the address_space isn't freed
> >> under us.  Page may be locked, or under writeback, or some other locks
> >
> > I didn't look at the code in detail but I guess every file page should
> > be freed before the address space destruction and page_lock/lru_lock makes
> > the work safe, I guess. So, it wouldn't be a problem.
> >
> > However, in case of swapoff, it doesn't remove pages from LRU list
> > so there is no lock to prevent the race at this moment. :(
> 
> Take a look at file cache pages and file cache address_space freeing
> code path.  It appears that similar situation is possible for them too.
> 
> The file cache pages will be delete from 

Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2017-12-22 Thread Paul E. McKenney
On Fri, Dec 22, 2017 at 10:14:43PM +0800, Huang, Ying wrote:
> Minchan Kim  writes:
> 
> > On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
> >> Minchan Kim  writes:
> >> 
> >> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
> >> >> From: Huang Ying 
> >> >> 
> >> >> When the swapin is performed, after getting the swap entry information
> >> >> from the page table, system will swap in the swap entry, without any
> >> >> lock held to prevent the swap device from being swapoff.  This may
> >> >> cause the race like below,
> >> >> 
> >> >> CPU 1   CPU 2
> >> >> -   -
> >> >> do_swap_page
> >> >>   swapin_readahead
> >> >> __read_swap_cache_async
> >> >> swapoff   swapcache_prepare
> >> >>   p->swap_map = NULL__swap_duplicate
> >> >>   p->swap_map[?] /* !!! NULL 
> >> >> pointer access */
> >> >> 
> >> >> Because swapoff is usually done when system shutdown only, the race
> >> >> may not hit many people in practice.  But it is still a race need to
> >> >> be fixed.
> >> >> 
> >> >> To fix the race, get_swap_device() is added to check whether the
> >> >> specified swap entry is valid in its swap device.  If so, it will keep
> >> >> the swap entry valid via preventing the swap device from being
> >> >> swapoff, until put_swap_device() is called.
> >> >> 
> >> >> Because swapoff() is very race code path, to make the normal path runs
> >> >> as fast as possible, RCU instead of reference count is used to
> >> >> implement get/put_swap_device().  From get_swap_device() to
> >> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
> >> >> swapoff() will wait until put_swap_device() is called.
> >> >> 
> >> >> In addition to swap_map, cluster_info, etc. data structure in the
> >> >> struct swap_info_struct, the swap cache radix tree will be freed after
> >> >> swapoff, so this patch fixes the race between swap cache looking up
> >> >> and swapoff too.
> >> >> 
> >> >> Cc: Hugh Dickins 
> >> >> Cc: Paul E. McKenney 
> >> >> Cc: Minchan Kim 
> >> >> Cc: Johannes Weiner 
> >> >> Cc: Tim Chen 
> >> >> Cc: Shaohua Li 
> >> >> Cc: Mel Gorman 
> >> >> Cc: "Jrme Glisse" 
> >> >> Cc: Michal Hocko 
> >> >> Cc: Andrea Arcangeli 
> >> >> Cc: David Rientjes 
> >> >> Cc: Rik van Riel 
> >> >> Cc: Jan Kara 
> >> >> Cc: Dave Jiang 
> >> >> Cc: Aaron Lu 
> >> >> Signed-off-by: "Huang, Ying" 
> >> >> 
> >> >> Changelog:
> >> >> 
> >> >> v4:
> >> >> 
> >> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
> >> >>   normal paths further.
> >> >
> >> > Hi Huang,
> >> 
> >> Hi, Minchan,
> >> 
> >> > This version is much better than old. To me, it's due to not rcu,
> >> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
> >> > into every swap related functions so users who don't interested on swap
> >> > don't need to care of it. Good.
> >> >
> >> > The problem is caused by freeing by swap related-data structure
> >> > *dynamically* while old swap logic was based on static data
> >> > structure(i.e., never freed and the verify it's stale).
> >> > So, I reviewed some places where use PageSwapCache and swp_entry_t
> >> > which could make access of swap related data structures.
> >> >
> >> > A example is __isolate_lru_page
> >> >
> >> > It calls page_mapping to get a address_space.
> >> > What happens if the page is on SwapCache and raced with swapoff?
> >> > The mapping got could be disappeared by the race. Right?
> >> 
> >> Yes.  We should think about that.  Considering the file cache pages, the
> >> address_space backing the file cache pages may be freed dynamically too.
> >> So to use page_mapping() return value for the file cache pages, some
> >> kind of locking is needed to guarantee the address_space isn't freed
> >> under us.  Page may be locked, or under writeback, or some other locks
> >
> > I didn't look at the code in detail but I guess every file page should
> > be freed before the address space destruction and page_lock/lru_lock makes
> > the work safe, I guess. So, it wouldn't be a problem.
> >
> > However, in case of swapoff, it doesn't remove pages from LRU list
> > so there is no lock to prevent the race at this moment. :(
> 
> Take a look at file cache pages and file cache address_space freeing
> code path.  It appears that similar situation is possible for them too.
> 
> The file cache pages will be delete from file cache address_space before
> address_space (embedded in inode) is freed.  But they will be deleted
> from LRU list only when its refcount dropped to zero, please take a look
> at put_page() and release_pages().  While address_space will be freed
> after putting reference to all file cache pages.  If someone holds a
> reference to a file cache page for quite long time, it is possible 

Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2017-12-22 Thread Huang, Ying
Minchan Kim  writes:

> On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
>> Minchan Kim  writes:
>> 
>> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
>> >> From: Huang Ying 
>> >> 
>> >> When the swapin is performed, after getting the swap entry information
>> >> from the page table, system will swap in the swap entry, without any
>> >> lock held to prevent the swap device from being swapoff.  This may
>> >> cause the race like below,
>> >> 
>> >> CPU 1 CPU 2
>> >> - -
>> >>   do_swap_page
>> >> swapin_readahead
>> >>   __read_swap_cache_async
>> >> swapoff swapcache_prepare
>> >>   p->swap_map = NULL  __swap_duplicate
>> >> p->swap_map[?] /* !!! NULL pointer 
>> >> access */
>> >> 
>> >> Because swapoff is usually done when system shutdown only, the race
>> >> may not hit many people in practice.  But it is still a race need to
>> >> be fixed.
>> >> 
>> >> To fix the race, get_swap_device() is added to check whether the
>> >> specified swap entry is valid in its swap device.  If so, it will keep
>> >> the swap entry valid via preventing the swap device from being
>> >> swapoff, until put_swap_device() is called.
>> >> 
>> >> Because swapoff() is very race code path, to make the normal path runs
>> >> as fast as possible, RCU instead of reference count is used to
>> >> implement get/put_swap_device().  From get_swap_device() to
>> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
>> >> swapoff() will wait until put_swap_device() is called.
>> >> 
>> >> In addition to swap_map, cluster_info, etc. data structure in the
>> >> struct swap_info_struct, the swap cache radix tree will be freed after
>> >> swapoff, so this patch fixes the race between swap cache looking up
>> >> and swapoff too.
>> >> 
>> >> Cc: Hugh Dickins 
>> >> Cc: Paul E. McKenney 
>> >> Cc: Minchan Kim 
>> >> Cc: Johannes Weiner 
>> >> Cc: Tim Chen 
>> >> Cc: Shaohua Li 
>> >> Cc: Mel Gorman 
>> >> Cc: "Jrme Glisse" 
>> >> Cc: Michal Hocko 
>> >> Cc: Andrea Arcangeli 
>> >> Cc: David Rientjes 
>> >> Cc: Rik van Riel 
>> >> Cc: Jan Kara 
>> >> Cc: Dave Jiang 
>> >> Cc: Aaron Lu 
>> >> Signed-off-by: "Huang, Ying" 
>> >> 
>> >> Changelog:
>> >> 
>> >> v4:
>> >> 
>> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
>> >>   normal paths further.
>> >
>> > Hi Huang,
>> 
>> Hi, Minchan,
>> 
>> > This version is much better than old. To me, it's due to not rcu,
>> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
>> > into every swap related functions so users who don't interested on swap
>> > don't need to care of it. Good.
>> >
>> > The problem is caused by freeing by swap related-data structure
>> > *dynamically* while old swap logic was based on static data
>> > structure(i.e., never freed and the verify it's stale).
>> > So, I reviewed some places where use PageSwapCache and swp_entry_t
>> > which could make access of swap related data structures.
>> >
>> > A example is __isolate_lru_page
>> >
>> > It calls page_mapping to get a address_space.
>> > What happens if the page is on SwapCache and raced with swapoff?
>> > The mapping got could be disappeared by the race. Right?
>> 
>> Yes.  We should think about that.  Considering the file cache pages, the
>> address_space backing the file cache pages may be freed dynamically too.
>> So to use page_mapping() return value for the file cache pages, some
>> kind of locking is needed to guarantee the address_space isn't freed
>> under us.  Page may be locked, or under writeback, or some other locks
>
> I didn't look at the code in detail but I guess every file page should
> be freed before the address space destruction and page_lock/lru_lock makes
> the work safe, I guess. So, it wouldn't be a problem.
>
> However, in case of swapoff, it doesn't remove pages from LRU list
> so there is no lock to prevent the race at this moment. :(

Take a look at file cache pages and file cache address_space freeing
code path.  It appears that similar situation is possible for them too.

The file cache pages will be delete from file cache address_space before
address_space (embedded in inode) is freed.  But they will be deleted
from LRU list only when its refcount dropped to zero, please take a look
at put_page() and release_pages().  While address_space will be freed
after putting reference to all file cache 

Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2017-12-22 Thread Huang, Ying
Minchan Kim  writes:

> On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
>> Minchan Kim  writes:
>> 
>> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
>> >> From: Huang Ying 
>> >> 
>> >> When the swapin is performed, after getting the swap entry information
>> >> from the page table, system will swap in the swap entry, without any
>> >> lock held to prevent the swap device from being swapoff.  This may
>> >> cause the race like below,
>> >> 
>> >> CPU 1 CPU 2
>> >> - -
>> >>   do_swap_page
>> >> swapin_readahead
>> >>   __read_swap_cache_async
>> >> swapoff swapcache_prepare
>> >>   p->swap_map = NULL  __swap_duplicate
>> >> p->swap_map[?] /* !!! NULL pointer 
>> >> access */
>> >> 
>> >> Because swapoff is usually done when system shutdown only, the race
>> >> may not hit many people in practice.  But it is still a race need to
>> >> be fixed.
>> >> 
>> >> To fix the race, get_swap_device() is added to check whether the
>> >> specified swap entry is valid in its swap device.  If so, it will keep
>> >> the swap entry valid via preventing the swap device from being
>> >> swapoff, until put_swap_device() is called.
>> >> 
>> >> Because swapoff() is very race code path, to make the normal path runs
>> >> as fast as possible, RCU instead of reference count is used to
>> >> implement get/put_swap_device().  From get_swap_device() to
>> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
>> >> swapoff() will wait until put_swap_device() is called.
>> >> 
>> >> In addition to swap_map, cluster_info, etc. data structure in the
>> >> struct swap_info_struct, the swap cache radix tree will be freed after
>> >> swapoff, so this patch fixes the race between swap cache looking up
>> >> and swapoff too.
>> >> 
>> >> Cc: Hugh Dickins 
>> >> Cc: Paul E. McKenney 
>> >> Cc: Minchan Kim 
>> >> Cc: Johannes Weiner 
>> >> Cc: Tim Chen 
>> >> Cc: Shaohua Li 
>> >> Cc: Mel Gorman 
>> >> Cc: "Jrme Glisse" 
>> >> Cc: Michal Hocko 
>> >> Cc: Andrea Arcangeli 
>> >> Cc: David Rientjes 
>> >> Cc: Rik van Riel 
>> >> Cc: Jan Kara 
>> >> Cc: Dave Jiang 
>> >> Cc: Aaron Lu 
>> >> Signed-off-by: "Huang, Ying" 
>> >> 
>> >> Changelog:
>> >> 
>> >> v4:
>> >> 
>> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
>> >>   normal paths further.
>> >
>> > Hi Huang,
>> 
>> Hi, Minchan,
>> 
>> > This version is much better than old. To me, it's due to not rcu,
>> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
>> > into every swap related functions so users who don't interested on swap
>> > don't need to care of it. Good.
>> >
>> > The problem is caused by freeing by swap related-data structure
>> > *dynamically* while old swap logic was based on static data
>> > structure(i.e., never freed and the verify it's stale).
>> > So, I reviewed some places where use PageSwapCache and swp_entry_t
>> > which could make access of swap related data structures.
>> >
>> > A example is __isolate_lru_page
>> >
>> > It calls page_mapping to get a address_space.
>> > What happens if the page is on SwapCache and raced with swapoff?
>> > The mapping got could be disappeared by the race. Right?
>> 
>> Yes.  We should think about that.  Considering the file cache pages, the
>> address_space backing the file cache pages may be freed dynamically too.
>> So to use page_mapping() return value for the file cache pages, some
>> kind of locking is needed to guarantee the address_space isn't freed
>> under us.  Page may be locked, or under writeback, or some other locks
>
> I didn't look at the code in detail but I guess every file page should
> be freed before the address space destruction and page_lock/lru_lock makes
> the work safe, I guess. So, it wouldn't be a problem.
>
> However, in case of swapoff, it doesn't remove pages from LRU list
> so there is no lock to prevent the race at this moment. :(

Take a look at file cache pages and file cache address_space freeing
code path.  It appears that similar situation is possible for them too.

The file cache pages will be delete from file cache address_space before
address_space (embedded in inode) is freed.  But they will be deleted
from LRU list only when its refcount dropped to zero, please take a look
at put_page() and release_pages().  While address_space will be freed
after putting reference to all file cache pages.  If someone holds a
reference to a file cache page for quite long time, it is possible for a
file cache page to be in LRU list after the inode/address_space is
freed.

And I found inode/address_space is freed witch call_rcu().  I don't know
whether this is related to page_mapping().

This is just my understanding.

>> need to be held, for example, page table lock, or lru_lock, etc.  

Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2017-12-21 Thread Minchan Kim
On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
> Minchan Kim  writes:
> 
> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
> >> From: Huang Ying 
> >> 
> >> When the swapin is performed, after getting the swap entry information
> >> from the page table, system will swap in the swap entry, without any
> >> lock held to prevent the swap device from being swapoff.  This may
> >> cause the race like below,
> >> 
> >> CPU 1  CPU 2
> >> -  -
> >>do_swap_page
> >>  swapin_readahead
> >>__read_swap_cache_async
> >> swapoff  swapcache_prepare
> >>   p->swap_map = NULL   __swap_duplicate
> >>  p->swap_map[?] /* !!! NULL pointer 
> >> access */
> >> 
> >> Because swapoff is usually done when system shutdown only, the race
> >> may not hit many people in practice.  But it is still a race need to
> >> be fixed.
> >> 
> >> To fix the race, get_swap_device() is added to check whether the
> >> specified swap entry is valid in its swap device.  If so, it will keep
> >> the swap entry valid via preventing the swap device from being
> >> swapoff, until put_swap_device() is called.
> >> 
> >> Because swapoff() is very race code path, to make the normal path runs
> >> as fast as possible, RCU instead of reference count is used to
> >> implement get/put_swap_device().  From get_swap_device() to
> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
> >> swapoff() will wait until put_swap_device() is called.
> >> 
> >> In addition to swap_map, cluster_info, etc. data structure in the
> >> struct swap_info_struct, the swap cache radix tree will be freed after
> >> swapoff, so this patch fixes the race between swap cache looking up
> >> and swapoff too.
> >> 
> >> Cc: Hugh Dickins 
> >> Cc: Paul E. McKenney 
> >> Cc: Minchan Kim 
> >> Cc: Johannes Weiner 
> >> Cc: Tim Chen 
> >> Cc: Shaohua Li 
> >> Cc: Mel Gorman 
> >> Cc: "Jrme Glisse" 
> >> Cc: Michal Hocko 
> >> Cc: Andrea Arcangeli 
> >> Cc: David Rientjes 
> >> Cc: Rik van Riel 
> >> Cc: Jan Kara 
> >> Cc: Dave Jiang 
> >> Cc: Aaron Lu 
> >> Signed-off-by: "Huang, Ying" 
> >> 
> >> Changelog:
> >> 
> >> v4:
> >> 
> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
> >>   normal paths further.
> >
> > Hi Huang,
> 
> Hi, Minchan,
> 
> > This version is much better than old. To me, it's due to not rcu,
> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
> > into every swap related functions so users who don't interested on swap
> > don't need to care of it. Good.
> >
> > The problem is caused by freeing by swap related-data structure
> > *dynamically* while old swap logic was based on static data
> > structure(i.e., never freed and the verify it's stale).
> > So, I reviewed some places where use PageSwapCache and swp_entry_t
> > which could make access of swap related data structures.
> >
> > A example is __isolate_lru_page
> >
> > It calls page_mapping to get a address_space.
> > What happens if the page is on SwapCache and raced with swapoff?
> > The mapping got could be disappeared by the race. Right?
> 
> Yes.  We should think about that.  Considering the file cache pages, the
> address_space backing the file cache pages may be freed dynamically too.
> So to use page_mapping() return value for the file cache pages, some
> kind of locking is needed to guarantee the address_space isn't freed
> under us.  Page may be locked, or under writeback, or some other locks

I didn't look at the code in detail but I guess every file page should
be freed before the address space destruction and page_lock/lru_lock makes
the work safe, I guess. So, it wouldn't be a problem.

However, in case of swapoff, it doesn't remove pages from LRU list
so there is no lock to prevent the race at this moment. :(

> need to be held, for example, page table lock, or lru_lock, etc.  For
> __isolate_lru_page(), lru_lock will be held when it is called.  And we
> will call synchronize_rcu() between clear PageSwapCache and free swap
> cache, so the usage of swap cache in __isolate_lru_page() should be
> safe.  Do you think my analysis makes sense?

I don't understand how synchronize_rcu closes the race with spin_lock.
Paul might help it.

Even if we solve it, there is a other problem I spot.
When I see migrate_vma_pages, it pass mapping to migrate_page which
accesses mapping->tree_lock unconditionally even though the 

Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2017-12-21 Thread Minchan Kim
On Thu, Dec 21, 2017 at 03:48:56PM +0800, Huang, Ying wrote:
> Minchan Kim  writes:
> 
> > On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
> >> From: Huang Ying 
> >> 
> >> When the swapin is performed, after getting the swap entry information
> >> from the page table, system will swap in the swap entry, without any
> >> lock held to prevent the swap device from being swapoff.  This may
> >> cause the race like below,
> >> 
> >> CPU 1  CPU 2
> >> -  -
> >>do_swap_page
> >>  swapin_readahead
> >>__read_swap_cache_async
> >> swapoff  swapcache_prepare
> >>   p->swap_map = NULL   __swap_duplicate
> >>  p->swap_map[?] /* !!! NULL pointer 
> >> access */
> >> 
> >> Because swapoff is usually done when system shutdown only, the race
> >> may not hit many people in practice.  But it is still a race need to
> >> be fixed.
> >> 
> >> To fix the race, get_swap_device() is added to check whether the
> >> specified swap entry is valid in its swap device.  If so, it will keep
> >> the swap entry valid via preventing the swap device from being
> >> swapoff, until put_swap_device() is called.
> >> 
> >> Because swapoff() is very race code path, to make the normal path runs
> >> as fast as possible, RCU instead of reference count is used to
> >> implement get/put_swap_device().  From get_swap_device() to
> >> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
> >> swapoff() will wait until put_swap_device() is called.
> >> 
> >> In addition to swap_map, cluster_info, etc. data structure in the
> >> struct swap_info_struct, the swap cache radix tree will be freed after
> >> swapoff, so this patch fixes the race between swap cache looking up
> >> and swapoff too.
> >> 
> >> Cc: Hugh Dickins 
> >> Cc: Paul E. McKenney 
> >> Cc: Minchan Kim 
> >> Cc: Johannes Weiner 
> >> Cc: Tim Chen 
> >> Cc: Shaohua Li 
> >> Cc: Mel Gorman 
> >> Cc: "Jrme Glisse" 
> >> Cc: Michal Hocko 
> >> Cc: Andrea Arcangeli 
> >> Cc: David Rientjes 
> >> Cc: Rik van Riel 
> >> Cc: Jan Kara 
> >> Cc: Dave Jiang 
> >> Cc: Aaron Lu 
> >> Signed-off-by: "Huang, Ying" 
> >> 
> >> Changelog:
> >> 
> >> v4:
> >> 
> >> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
> >>   normal paths further.
> >
> > Hi Huang,
> 
> Hi, Minchan,
> 
> > This version is much better than old. To me, it's due to not rcu,
> > srcu, refcount thing but it adds swap device dependency(i.e., get/put)
> > into every swap related functions so users who don't interested on swap
> > don't need to care of it. Good.
> >
> > The problem is caused by freeing by swap related-data structure
> > *dynamically* while old swap logic was based on static data
> > structure(i.e., never freed and the verify it's stale).
> > So, I reviewed some places where use PageSwapCache and swp_entry_t
> > which could make access of swap related data structures.
> >
> > A example is __isolate_lru_page
> >
> > It calls page_mapping to get a address_space.
> > What happens if the page is on SwapCache and raced with swapoff?
> > The mapping got could be disappeared by the race. Right?
> 
> Yes.  We should think about that.  Considering the file cache pages, the
> address_space backing the file cache pages may be freed dynamically too.
> So to use page_mapping() return value for the file cache pages, some
> kind of locking is needed to guarantee the address_space isn't freed
> under us.  Page may be locked, or under writeback, or some other locks

I didn't look at the code in detail but I guess every file page should
be freed before the address space destruction and page_lock/lru_lock makes
the work safe, I guess. So, it wouldn't be a problem.

However, in case of swapoff, it doesn't remove pages from LRU list
so there is no lock to prevent the race at this moment. :(

> need to be held, for example, page table lock, or lru_lock, etc.  For
> __isolate_lru_page(), lru_lock will be held when it is called.  And we
> will call synchronize_rcu() between clear PageSwapCache and free swap
> cache, so the usage of swap cache in __isolate_lru_page() should be
> safe.  Do you think my analysis makes sense?

I don't understand how synchronize_rcu closes the race with spin_lock.
Paul might help it.

Even if we solve it, there is a other problem I spot.
When I see migrate_vma_pages, it pass mapping to migrate_page which
accesses mapping->tree_lock unconditionally even though the address_space
is already gone.

Hmm, I didn't check all sites where uses PageSwapCache, swp_entry_t
but gut feeling is it would be not simple.


Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2017-12-20 Thread Minchan Kim
On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
> From: Huang Ying 
> 
> When the swapin is performed, after getting the swap entry information
> from the page table, system will swap in the swap entry, without any
> lock held to prevent the swap device from being swapoff.  This may
> cause the race like below,
> 
> CPU 1 CPU 2
> - -
>   do_swap_page
> swapin_readahead
>   __read_swap_cache_async
> swapoff swapcache_prepare
>   p->swap_map = NULL  __swap_duplicate
> p->swap_map[?] /* !!! NULL pointer 
> access */
> 
> Because swapoff is usually done when system shutdown only, the race
> may not hit many people in practice.  But it is still a race need to
> be fixed.
> 
> To fix the race, get_swap_device() is added to check whether the
> specified swap entry is valid in its swap device.  If so, it will keep
> the swap entry valid via preventing the swap device from being
> swapoff, until put_swap_device() is called.
> 
> Because swapoff() is very race code path, to make the normal path runs
> as fast as possible, RCU instead of reference count is used to
> implement get/put_swap_device().  From get_swap_device() to
> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
> swapoff() will wait until put_swap_device() is called.
> 
> In addition to swap_map, cluster_info, etc. data structure in the
> struct swap_info_struct, the swap cache radix tree will be freed after
> swapoff, so this patch fixes the race between swap cache looking up
> and swapoff too.
> 
> Cc: Hugh Dickins 
> Cc: Paul E. McKenney 
> Cc: Minchan Kim 
> Cc: Johannes Weiner 
> Cc: Tim Chen 
> Cc: Shaohua Li 
> Cc: Mel Gorman 
> Cc: "Jérôme Glisse" 
> Cc: Michal Hocko 
> Cc: Andrea Arcangeli 
> Cc: David Rientjes 
> Cc: Rik van Riel 
> Cc: Jan Kara 
> Cc: Dave Jiang 
> Cc: Aaron Lu 
> Signed-off-by: "Huang, Ying" 
> 
> Changelog:
> 
> v4:
> 
> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
>   normal paths further.

Hi Huang,

This version is much better than old. To me, it's due to not rcu,
srcu, refcount thing but it adds swap device dependency(i.e., get/put)
into every swap related functions so users who don't interested on swap
don't need to care of it. Good.

The problem is caused by freeing by swap related-data structure
*dynamically* while old swap logic was based on static data
structure(i.e., never freed and the verify it's stale).
So, I reviewed some places where use PageSwapCache and swp_entry_t
which could make access of swap related data structures.

A example is __isolate_lru_page

It calls page_mapping to get a address_space.
What happens if the page is on SwapCache and raced with swapoff?
The mapping got could be disappeared by the race. Right?

Thanks.

> 
> v3:
> 
> - Re-implemented with RCU to reduce the overhead of normal paths
> 
> v2:
> 
> - Re-implemented with SRCU to reduce the overhead of normal paths.
> 
> - Avoid to check whether the swap device has been swapoff in
>   get_swap_device().  Because we can check the origin of the swap
>   entry to make sure the swap device hasn't bee swapoff.
> ---
>  include/linux/swap.h |  11 -
>  mm/memory.c  |   2 +-
>  mm/swap_state.c  |  16 +--
>  mm/swapfile.c| 123 
> ++-
>  4 files changed, 116 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2417d288e016..f7e8f26cf07f 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -172,8 +172,9 @@ enum {
>   SWP_PAGE_DISCARD = (1 << 9),/* freed swap page-cluster discards */
>   SWP_STABLE_WRITES = (1 << 10),  /* no overwrite PG_writeback pages */
>   SWP_SYNCHRONOUS_IO = (1 << 11), /* synchronous IO is efficient */
> + SWP_VALID   = (1 << 12),/* swap is valid to be operated on? */
>   /* add others here before... */
> - SWP_SCANNING= (1 << 12),/* refcount in scan_swap_map */
> + SWP_SCANNING= (1 << 13),/* refcount in scan_swap_map */
>  };
>  
>  #define SWAP_CLUSTER_MAX 32UL
> @@ -460,7 +461,7 @@ extern unsigned int count_swap_pages(int, int);
>  extern sector_t map_swap_page(struct page *, struct block_device **);
>  extern sector_t swapdev_block(int, pgoff_t);
>  extern int page_swapcount(struct page *);
> -extern int __swap_count(struct swap_info_struct *si, swp_entry_t 

Re: [PATCH -V4 -mm] mm, swap: Fix race between swapoff and some swap operations

2017-12-20 Thread Minchan Kim
On Wed, Dec 20, 2017 at 09:26:32AM +0800, Huang, Ying wrote:
> From: Huang Ying 
> 
> When the swapin is performed, after getting the swap entry information
> from the page table, system will swap in the swap entry, without any
> lock held to prevent the swap device from being swapoff.  This may
> cause the race like below,
> 
> CPU 1 CPU 2
> - -
>   do_swap_page
> swapin_readahead
>   __read_swap_cache_async
> swapoff swapcache_prepare
>   p->swap_map = NULL  __swap_duplicate
> p->swap_map[?] /* !!! NULL pointer 
> access */
> 
> Because swapoff is usually done when system shutdown only, the race
> may not hit many people in practice.  But it is still a race need to
> be fixed.
> 
> To fix the race, get_swap_device() is added to check whether the
> specified swap entry is valid in its swap device.  If so, it will keep
> the swap entry valid via preventing the swap device from being
> swapoff, until put_swap_device() is called.
> 
> Because swapoff() is very race code path, to make the normal path runs
> as fast as possible, RCU instead of reference count is used to
> implement get/put_swap_device().  From get_swap_device() to
> put_swap_device(), the RCU read lock is held, so synchronize_rcu() in
> swapoff() will wait until put_swap_device() is called.
> 
> In addition to swap_map, cluster_info, etc. data structure in the
> struct swap_info_struct, the swap cache radix tree will be freed after
> swapoff, so this patch fixes the race between swap cache looking up
> and swapoff too.
> 
> Cc: Hugh Dickins 
> Cc: Paul E. McKenney 
> Cc: Minchan Kim 
> Cc: Johannes Weiner 
> Cc: Tim Chen 
> Cc: Shaohua Li 
> Cc: Mel Gorman 
> Cc: "Jérôme Glisse" 
> Cc: Michal Hocko 
> Cc: Andrea Arcangeli 
> Cc: David Rientjes 
> Cc: Rik van Riel 
> Cc: Jan Kara 
> Cc: Dave Jiang 
> Cc: Aaron Lu 
> Signed-off-by: "Huang, Ying" 
> 
> Changelog:
> 
> v4:
> 
> - Use synchronize_rcu() in enable_swap_info() to reduce overhead of
>   normal paths further.

Hi Huang,

This version is much better than old. To me, it's due to not rcu,
srcu, refcount thing but it adds swap device dependency(i.e., get/put)
into every swap related functions so users who don't interested on swap
don't need to care of it. Good.

The problem is caused by freeing by swap related-data structure
*dynamically* while old swap logic was based on static data
structure(i.e., never freed and the verify it's stale).
So, I reviewed some places where use PageSwapCache and swp_entry_t
which could make access of swap related data structures.

A example is __isolate_lru_page

It calls page_mapping to get a address_space.
What happens if the page is on SwapCache and raced with swapoff?
The mapping got could be disappeared by the race. Right?

Thanks.

> 
> v3:
> 
> - Re-implemented with RCU to reduce the overhead of normal paths
> 
> v2:
> 
> - Re-implemented with SRCU to reduce the overhead of normal paths.
> 
> - Avoid to check whether the swap device has been swapoff in
>   get_swap_device().  Because we can check the origin of the swap
>   entry to make sure the swap device hasn't bee swapoff.
> ---
>  include/linux/swap.h |  11 -
>  mm/memory.c  |   2 +-
>  mm/swap_state.c  |  16 +--
>  mm/swapfile.c| 123 
> ++-
>  4 files changed, 116 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2417d288e016..f7e8f26cf07f 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -172,8 +172,9 @@ enum {
>   SWP_PAGE_DISCARD = (1 << 9),/* freed swap page-cluster discards */
>   SWP_STABLE_WRITES = (1 << 10),  /* no overwrite PG_writeback pages */
>   SWP_SYNCHRONOUS_IO = (1 << 11), /* synchronous IO is efficient */
> + SWP_VALID   = (1 << 12),/* swap is valid to be operated on? */
>   /* add others here before... */
> - SWP_SCANNING= (1 << 12),/* refcount in scan_swap_map */
> + SWP_SCANNING= (1 << 13),/* refcount in scan_swap_map */
>  };
>  
>  #define SWAP_CLUSTER_MAX 32UL
> @@ -460,7 +461,7 @@ extern unsigned int count_swap_pages(int, int);
>  extern sector_t map_swap_page(struct page *, struct block_device **);
>  extern sector_t swapdev_block(int, pgoff_t);
>  extern int page_swapcount(struct page *);
> -extern int __swap_count(struct swap_info_struct *si, swp_entry_t entry);
> +extern int __swap_count(swp_entry_t entry);
>  extern int __swp_swapcount(swp_entry_t entry);
>  extern int swp_swapcount(swp_entry_t entry);
>  extern struct swap_info_struct *page_swap_info(struct page *);
> @@ -470,6 +471,12 @@ extern int try_to_free_swap(struct page *);
>  struct backing_dev_info;
>  extern int