Re: [PATCH] mm: migration: fix migration of huge PMD shared pages

2018-08-15 Thread Kirill A. Shutemov
On Tue, Aug 14, 2018 at 05:15:57PM -0700, Mike Kravetz wrote:
> On 08/14/2018 01:48 AM, Kirill A. Shutemov wrote:
> > On Mon, Aug 13, 2018 at 11:21:41PM +, Mike Kravetz wrote:
> >> On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote:
> >>> On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
>  I am not %100 sure on the required flushing, so suggestions would be
>  appreciated.  This also should go to stable.  It has been around for
>  a long time so still looking for an appropriate 'fixes:'.
> >>>
> >>> I believe we need flushing. And huge_pmd_unshare() usage in
> >>> __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
> >>> that case.
> >>
> >> Thanks Kirill,
> >>
> >> __unmap_hugepage_range() has two callers:
> >> 1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and
> >>tlb_finish_mmu on the range.  IIUC, this should cause an appropriate
> >>TLB flush.
> >> 2) __unmap_hugepage_range_final via unmap_single_vma.  unmap_single_vma
> >>   has three callers:
> >>   - unmap_vmas which assumes the caller will flush the whole range after
> >> return.
> >>   - zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu
> >>   - zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu
> >>
> >> So, it appears we are covered.  But, I could be missing something.
> > 
> > My problem here is that the mapping that moved by huge_pmd_unshare() in
> > not accounted into mmu_gather and can be missed on tlb_finish_mmu().
> 
> Ah, I think I now see the issue you are concerned with.
> 
> When huge_pmd_unshare succeeds we effectively unmap a PUD_SIZE area.
> The routine __unmap_hugepage_range may only have been passed a range
> that is a subset of PUD_SIZE.  In the case I was trying to address,
> try_to_unmap_one() the 'range' will certainly be less than PUD_SIZE.
> Upon further thought, I think that even in the case of try_to_unmap_one
> we should flush PUD_SIZE range.
> 
> My first thought would be to embed this flushing within huge_pmd_unshare
> itself.  Perhaps, whenever huge_pmd_unshare succeeds we should do an
> explicit:
> flush_cache_range(PUD_SIZE)
> flush_tlb_range(PUD_SIZE)
> mmu_notifier_invalidate_range(PUD_SIZE)
> That would take some of the burden off the callers of huge_pmd_unshare.
> However, I am not sure if the flushing calls above play nice in all the
> calling environments.  I'll look into it some more, but would appreciate
> additional comments.

I don't think it would work: flush_tlb_range() does IPI and calling it
under spinlock will not go well. I think we need to find a way to account
it properly in the mmu_gather. It's not obvious to me how.

-- 
 Kirill A. Shutemov


Re: [PATCH] mm: migration: fix migration of huge PMD shared pages

2018-08-15 Thread Kirill A. Shutemov
On Tue, Aug 14, 2018 at 05:15:57PM -0700, Mike Kravetz wrote:
> On 08/14/2018 01:48 AM, Kirill A. Shutemov wrote:
> > On Mon, Aug 13, 2018 at 11:21:41PM +, Mike Kravetz wrote:
> >> On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote:
> >>> On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
>  I am not %100 sure on the required flushing, so suggestions would be
>  appreciated.  This also should go to stable.  It has been around for
>  a long time so still looking for an appropriate 'fixes:'.
> >>>
> >>> I believe we need flushing. And huge_pmd_unshare() usage in
> >>> __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
> >>> that case.
> >>
> >> Thanks Kirill,
> >>
> >> __unmap_hugepage_range() has two callers:
> >> 1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and
> >>tlb_finish_mmu on the range.  IIUC, this should cause an appropriate
> >>TLB flush.
> >> 2) __unmap_hugepage_range_final via unmap_single_vma.  unmap_single_vma
> >>   has three callers:
> >>   - unmap_vmas which assumes the caller will flush the whole range after
> >> return.
> >>   - zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu
> >>   - zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu
> >>
> >> So, it appears we are covered.  But, I could be missing something.
> > 
> > My problem here is that the mapping that moved by huge_pmd_unshare() in
> > not accounted into mmu_gather and can be missed on tlb_finish_mmu().
> 
> Ah, I think I now see the issue you are concerned with.
> 
> When huge_pmd_unshare succeeds we effectively unmap a PUD_SIZE area.
> The routine __unmap_hugepage_range may only have been passed a range
> that is a subset of PUD_SIZE.  In the case I was trying to address,
> try_to_unmap_one() the 'range' will certainly be less than PUD_SIZE.
> Upon further thought, I think that even in the case of try_to_unmap_one
> we should flush PUD_SIZE range.
> 
> My first thought would be to embed this flushing within huge_pmd_unshare
> itself.  Perhaps, whenever huge_pmd_unshare succeeds we should do an
> explicit:
> flush_cache_range(PUD_SIZE)
> flush_tlb_range(PUD_SIZE)
> mmu_notifier_invalidate_range(PUD_SIZE)
> That would take some of the burden off the callers of huge_pmd_unshare.
> However, I am not sure if the flushing calls above play nice in all the
> calling environments.  I'll look into it some more, but would appreciate
> additional comments.

I don't think it would work: flush_tlb_range() does IPI and calling it
under spinlock will not go well. I think we need to find a way to account
it properly in the mmu_gather. It's not obvious to me how.

-- 
 Kirill A. Shutemov


Re: [PATCH] mm: migration: fix migration of huge PMD shared pages

2018-08-14 Thread Mike Kravetz
On 08/14/2018 01:48 AM, Kirill A. Shutemov wrote:
> On Mon, Aug 13, 2018 at 11:21:41PM +, Mike Kravetz wrote:
>> On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote:
>>> On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
 I am not %100 sure on the required flushing, so suggestions would be
 appreciated.  This also should go to stable.  It has been around for
 a long time so still looking for an appropriate 'fixes:'.
>>>
>>> I believe we need flushing. And huge_pmd_unshare() usage in
>>> __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
>>> that case.
>>
>> Thanks Kirill,
>>
>> __unmap_hugepage_range() has two callers:
>> 1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and
>>tlb_finish_mmu on the range.  IIUC, this should cause an appropriate
>>TLB flush.
>> 2) __unmap_hugepage_range_final via unmap_single_vma.  unmap_single_vma
>>   has three callers:
>>   - unmap_vmas which assumes the caller will flush the whole range after
>> return.
>>   - zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu
>>   - zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu
>>
>> So, it appears we are covered.  But, I could be missing something.
> 
> My problem here is that the mapping that moved by huge_pmd_unshare() in
> not accounted into mmu_gather and can be missed on tlb_finish_mmu().

Ah, I think I now see the issue you are concerned with.

When huge_pmd_unshare succeeds we effectively unmap a PUD_SIZE area.
The routine __unmap_hugepage_range may only have been passed a range
that is a subset of PUD_SIZE.  In the case I was trying to address,
try_to_unmap_one() the 'range' will certainly be less than PUD_SIZE.
Upon further thought, I think that even in the case of try_to_unmap_one
we should flush PUD_SIZE range.

My first thought would be to embed this flushing within huge_pmd_unshare
itself.  Perhaps, whenever huge_pmd_unshare succeeds we should do an
explicit:
flush_cache_range(PUD_SIZE)
flush_tlb_range(PUD_SIZE)
mmu_notifier_invalidate_range(PUD_SIZE)
That would take some of the burden off the callers of huge_pmd_unshare.
However, I am not sure if the flushing calls above play nice in all the
calling environments.  I'll look into it some more, but would appreciate
additional comments.

-- 
Mike Kravetz


Re: [PATCH] mm: migration: fix migration of huge PMD shared pages

2018-08-14 Thread Mike Kravetz
On 08/14/2018 01:48 AM, Kirill A. Shutemov wrote:
> On Mon, Aug 13, 2018 at 11:21:41PM +, Mike Kravetz wrote:
>> On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote:
>>> On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
 I am not %100 sure on the required flushing, so suggestions would be
 appreciated.  This also should go to stable.  It has been around for
 a long time so still looking for an appropriate 'fixes:'.
>>>
>>> I believe we need flushing. And huge_pmd_unshare() usage in
>>> __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
>>> that case.
>>
>> Thanks Kirill,
>>
>> __unmap_hugepage_range() has two callers:
>> 1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and
>>tlb_finish_mmu on the range.  IIUC, this should cause an appropriate
>>TLB flush.
>> 2) __unmap_hugepage_range_final via unmap_single_vma.  unmap_single_vma
>>   has three callers:
>>   - unmap_vmas which assumes the caller will flush the whole range after
>> return.
>>   - zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu
>>   - zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu
>>
>> So, it appears we are covered.  But, I could be missing something.
> 
> My problem here is that the mapping that moved by huge_pmd_unshare() in
> not accounted into mmu_gather and can be missed on tlb_finish_mmu().

Ah, I think I now see the issue you are concerned with.

When huge_pmd_unshare succeeds we effectively unmap a PUD_SIZE area.
The routine __unmap_hugepage_range may only have been passed a range
that is a subset of PUD_SIZE.  In the case I was trying to address,
try_to_unmap_one() the 'range' will certainly be less than PUD_SIZE.
Upon further thought, I think that even in the case of try_to_unmap_one
we should flush PUD_SIZE range.

My first thought would be to embed this flushing within huge_pmd_unshare
itself.  Perhaps, whenever huge_pmd_unshare succeeds we should do an
explicit:
flush_cache_range(PUD_SIZE)
flush_tlb_range(PUD_SIZE)
mmu_notifier_invalidate_range(PUD_SIZE)
That would take some of the burden off the callers of huge_pmd_unshare.
However, I am not sure if the flushing calls above play nice in all the
calling environments.  I'll look into it some more, but would appreciate
additional comments.

-- 
Mike Kravetz


Re: [PATCH] mm: migration: fix migration of huge PMD shared pages

2018-08-14 Thread Kirill A. Shutemov
On Mon, Aug 13, 2018 at 11:21:41PM +, Mike Kravetz wrote:
> On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote:
> > On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
> >> The page migration code employs try_to_unmap() to try and unmap the
> >> source page.  This is accomplished by using rmap_walk to find all
> >> vmas where the page is mapped.  This search stops when page mapcount
> >> is zero.  For shared PMD huge pages, the page map count is always 1
> >> not matter the number of mappings.  Shared mappings are tracked via
> >> the reference count of the PMD page.  Therefore, try_to_unmap stops
> >> prematurely and does not completely unmap all mappings of the source
> >> page.
> >>
> >> This problem can result is data corruption as writes to the original
> >> source page can happen after contents of the page are copied to the
> >> target page.  Hence, data is lost.
> >>
> >> This problem was originally seen as DB corruption of shared global
> >> areas after a huge page was soft offlined.  DB developers noticed
> >> they could reproduce the issue by (hotplug) offlining memory used
> >> to back huge pages.  A simple testcase can reproduce the problem by
> >> creating a shared PMD mapping (note that this must be at least
> >> PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
> >> migrate_pages() to migrate process pages between nodes.
> >>
> >> To fix, have the try_to_unmap_one routine check for huge PMD sharing
> >> by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
> >> shared mapping it will be 'unshared' which removes the page table
> >> entry and drops reference on PMD page.  After this, flush caches and
> >> TLB.
> >>
> >> Signed-off-by: Mike Kravetz 
> >> ---
> >> I am not %100 sure on the required flushing, so suggestions would be
> >> appreciated.  This also should go to stable.  It has been around for
> >> a long time so still looking for an appropriate 'fixes:'.
> > 
> > I believe we need flushing. And huge_pmd_unshare() usage in
> > __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
> > that case.
> 
> Thanks Kirill,
> 
> __unmap_hugepage_range() has two callers:
> 1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and
>tlb_finish_mmu on the range.  IIUC, this should cause an appropriate
>TLB flush.
> 2) __unmap_hugepage_range_final via unmap_single_vma.  unmap_single_vma
>   has three callers:
>   - unmap_vmas which assumes the caller will flush the whole range after
> return.
>   - zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu
>   - zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu
> 
> So, it appears we are covered.  But, I could be missing something.

My problem here is that the mapping that moved by huge_pmd_unshare() in
not accounted into mmu_gather and can be missed on tlb_finish_mmu().

-- 
 Kirill A. Shutemov


Re: [PATCH] mm: migration: fix migration of huge PMD shared pages

2018-08-14 Thread Kirill A. Shutemov
On Mon, Aug 13, 2018 at 11:21:41PM +, Mike Kravetz wrote:
> On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote:
> > On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
> >> The page migration code employs try_to_unmap() to try and unmap the
> >> source page.  This is accomplished by using rmap_walk to find all
> >> vmas where the page is mapped.  This search stops when page mapcount
> >> is zero.  For shared PMD huge pages, the page map count is always 1
> >> not matter the number of mappings.  Shared mappings are tracked via
> >> the reference count of the PMD page.  Therefore, try_to_unmap stops
> >> prematurely and does not completely unmap all mappings of the source
> >> page.
> >>
> >> This problem can result is data corruption as writes to the original
> >> source page can happen after contents of the page are copied to the
> >> target page.  Hence, data is lost.
> >>
> >> This problem was originally seen as DB corruption of shared global
> >> areas after a huge page was soft offlined.  DB developers noticed
> >> they could reproduce the issue by (hotplug) offlining memory used
> >> to back huge pages.  A simple testcase can reproduce the problem by
> >> creating a shared PMD mapping (note that this must be at least
> >> PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
> >> migrate_pages() to migrate process pages between nodes.
> >>
> >> To fix, have the try_to_unmap_one routine check for huge PMD sharing
> >> by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
> >> shared mapping it will be 'unshared' which removes the page table
> >> entry and drops reference on PMD page.  After this, flush caches and
> >> TLB.
> >>
> >> Signed-off-by: Mike Kravetz 
> >> ---
> >> I am not %100 sure on the required flushing, so suggestions would be
> >> appreciated.  This also should go to stable.  It has been around for
> >> a long time so still looking for an appropriate 'fixes:'.
> > 
> > I believe we need flushing. And huge_pmd_unshare() usage in
> > __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
> > that case.
> 
> Thanks Kirill,
> 
> __unmap_hugepage_range() has two callers:
> 1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and
>tlb_finish_mmu on the range.  IIUC, this should cause an appropriate
>TLB flush.
> 2) __unmap_hugepage_range_final via unmap_single_vma.  unmap_single_vma
>   has three callers:
>   - unmap_vmas which assumes the caller will flush the whole range after
> return.
>   - zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu
>   - zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu
> 
> So, it appears we are covered.  But, I could be missing something.

My problem here is that the mapping that moved by huge_pmd_unshare() in
not accounted into mmu_gather and can be missed on tlb_finish_mmu().

-- 
 Kirill A. Shutemov


Re: [PATCH] mm: migration: fix migration of huge PMD shared pages

2018-08-13 Thread Mike Kravetz
On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote:
> On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
>> The page migration code employs try_to_unmap() to try and unmap the
>> source page.  This is accomplished by using rmap_walk to find all
>> vmas where the page is mapped.  This search stops when page mapcount
>> is zero.  For shared PMD huge pages, the page map count is always 1
>> not matter the number of mappings.  Shared mappings are tracked via
>> the reference count of the PMD page.  Therefore, try_to_unmap stops
>> prematurely and does not completely unmap all mappings of the source
>> page.
>>
>> This problem can result is data corruption as writes to the original
>> source page can happen after contents of the page are copied to the
>> target page.  Hence, data is lost.
>>
>> This problem was originally seen as DB corruption of shared global
>> areas after a huge page was soft offlined.  DB developers noticed
>> they could reproduce the issue by (hotplug) offlining memory used
>> to back huge pages.  A simple testcase can reproduce the problem by
>> creating a shared PMD mapping (note that this must be at least
>> PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
>> migrate_pages() to migrate process pages between nodes.
>>
>> To fix, have the try_to_unmap_one routine check for huge PMD sharing
>> by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
>> shared mapping it will be 'unshared' which removes the page table
>> entry and drops reference on PMD page.  After this, flush caches and
>> TLB.
>>
>> Signed-off-by: Mike Kravetz 
>> ---
>> I am not %100 sure on the required flushing, so suggestions would be
>> appreciated.  This also should go to stable.  It has been around for
>> a long time so still looking for an appropriate 'fixes:'.
> 
> I believe we need flushing. And huge_pmd_unshare() usage in
> __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
> that case.

Thanks Kirill,

__unmap_hugepage_range() has two callers:
1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and
   tlb_finish_mmu on the range.  IIUC, this should cause an appropriate
   TLB flush.
2) __unmap_hugepage_range_final via unmap_single_vma.  unmap_single_vma
  has three callers:
  - unmap_vmas which assumes the caller will flush the whole range after
return.
  - zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu
  - zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu

So, it appears we are covered.  But, I could be missing something.

My primary reason for asking the question was with respect to the code
added to try_to_unmap_one.  In my testing, the changes I added appeared
to be required.  Just wanted to make sure.

I need to fix a build issue and will send another version.
-- 
Mike Kravetz


Re: [PATCH] mm: migration: fix migration of huge PMD shared pages

2018-08-13 Thread Mike Kravetz
On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote:
> On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
>> The page migration code employs try_to_unmap() to try and unmap the
>> source page.  This is accomplished by using rmap_walk to find all
>> vmas where the page is mapped.  This search stops when page mapcount
>> is zero.  For shared PMD huge pages, the page map count is always 1
>> not matter the number of mappings.  Shared mappings are tracked via
>> the reference count of the PMD page.  Therefore, try_to_unmap stops
>> prematurely and does not completely unmap all mappings of the source
>> page.
>>
>> This problem can result is data corruption as writes to the original
>> source page can happen after contents of the page are copied to the
>> target page.  Hence, data is lost.
>>
>> This problem was originally seen as DB corruption of shared global
>> areas after a huge page was soft offlined.  DB developers noticed
>> they could reproduce the issue by (hotplug) offlining memory used
>> to back huge pages.  A simple testcase can reproduce the problem by
>> creating a shared PMD mapping (note that this must be at least
>> PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
>> migrate_pages() to migrate process pages between nodes.
>>
>> To fix, have the try_to_unmap_one routine check for huge PMD sharing
>> by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
>> shared mapping it will be 'unshared' which removes the page table
>> entry and drops reference on PMD page.  After this, flush caches and
>> TLB.
>>
>> Signed-off-by: Mike Kravetz 
>> ---
>> I am not %100 sure on the required flushing, so suggestions would be
>> appreciated.  This also should go to stable.  It has been around for
>> a long time so still looking for an appropriate 'fixes:'.
> 
> I believe we need flushing. And huge_pmd_unshare() usage in
> __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
> that case.

Thanks Kirill,

__unmap_hugepage_range() has two callers:
1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and
   tlb_finish_mmu on the range.  IIUC, this should cause an appropriate
   TLB flush.
2) __unmap_hugepage_range_final via unmap_single_vma.  unmap_single_vma
  has three callers:
  - unmap_vmas which assumes the caller will flush the whole range after
return.
  - zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu
  - zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu

So, it appears we are covered.  But, I could be missing something.

My primary reason for asking the question was with respect to the code
added to try_to_unmap_one.  In my testing, the changes I added appeared
to be required.  Just wanted to make sure.

I need to fix a build issue and will send another version.
-- 
Mike Kravetz


Re: [PATCH] mm: migration: fix migration of huge PMD shared pages

2018-08-13 Thread Kirill A. Shutemov
On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
> The page migration code employs try_to_unmap() to try and unmap the
> source page.  This is accomplished by using rmap_walk to find all
> vmas where the page is mapped.  This search stops when page mapcount
> is zero.  For shared PMD huge pages, the page map count is always 1
> not matter the number of mappings.  Shared mappings are tracked via
> the reference count of the PMD page.  Therefore, try_to_unmap stops
> prematurely and does not completely unmap all mappings of the source
> page.
> 
> This problem can result is data corruption as writes to the original
> source page can happen after contents of the page are copied to the
> target page.  Hence, data is lost.
> 
> This problem was originally seen as DB corruption of shared global
> areas after a huge page was soft offlined.  DB developers noticed
> they could reproduce the issue by (hotplug) offlining memory used
> to back huge pages.  A simple testcase can reproduce the problem by
> creating a shared PMD mapping (note that this must be at least
> PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
> migrate_pages() to migrate process pages between nodes.
> 
> To fix, have the try_to_unmap_one routine check for huge PMD sharing
> by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
> shared mapping it will be 'unshared' which removes the page table
> entry and drops reference on PMD page.  After this, flush caches and
> TLB.
> 
> Signed-off-by: Mike Kravetz 
> ---
> I am not %100 sure on the required flushing, so suggestions would be
> appreciated.  This also should go to stable.  It has been around for
> a long time so still looking for an appropriate 'fixes:'.

I believe we need flushing. And huge_pmd_unshare() usage in
__unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
that case.

-- 
 Kirill A. Shutemov


Re: [PATCH] mm: migration: fix migration of huge PMD shared pages

2018-08-13 Thread Kirill A. Shutemov
On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
> The page migration code employs try_to_unmap() to try and unmap the
> source page.  This is accomplished by using rmap_walk to find all
> vmas where the page is mapped.  This search stops when page mapcount
> is zero.  For shared PMD huge pages, the page map count is always 1
> not matter the number of mappings.  Shared mappings are tracked via
> the reference count of the PMD page.  Therefore, try_to_unmap stops
> prematurely and does not completely unmap all mappings of the source
> page.
> 
> This problem can result is data corruption as writes to the original
> source page can happen after contents of the page are copied to the
> target page.  Hence, data is lost.
> 
> This problem was originally seen as DB corruption of shared global
> areas after a huge page was soft offlined.  DB developers noticed
> they could reproduce the issue by (hotplug) offlining memory used
> to back huge pages.  A simple testcase can reproduce the problem by
> creating a shared PMD mapping (note that this must be at least
> PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
> migrate_pages() to migrate process pages between nodes.
> 
> To fix, have the try_to_unmap_one routine check for huge PMD sharing
> by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
> shared mapping it will be 'unshared' which removes the page table
> entry and drops reference on PMD page.  After this, flush caches and
> TLB.
> 
> Signed-off-by: Mike Kravetz 
> ---
> I am not %100 sure on the required flushing, so suggestions would be
> appreciated.  This also should go to stable.  It has been around for
> a long time so still looking for an appropriate 'fixes:'.

I believe we need flushing. And huge_pmd_unshare() usage in
__unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
that case.

-- 
 Kirill A. Shutemov


Re: [PATCH] mm: migration: fix migration of huge PMD shared pages

2018-08-12 Thread kbuild test robot
Hi Mike,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18 next-20180810]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Mike-Kravetz/mm-migration-fix-migration-of-huge-PMD-shared-pages/20180813-114549
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   mm/rmap.c: In function 'try_to_unmap_one':
>> mm/rmap.c:1425:7: error: implicit declaration of function 
>> 'huge_pmd_unshare'; did you mean 'do_huge_pmd_wp_page'? 
>> [-Werror=implicit-function-declaration]
  huge_pmd_unshare(mm, , pvmw.pte)) {
  ^~~~
  do_huge_pmd_wp_page
   cc1: some warnings being treated as errors

vim +1425 mm/rmap.c

  1382  
  1383  /*
  1384   * If the page is mlock()d, we cannot swap it out.
  1385   * If it's recently referenced (perhaps page_referenced
  1386   * skipped over this mm) then we should reactivate it.
  1387   */
  1388  if (!(flags & TTU_IGNORE_MLOCK)) {
  1389  if (vma->vm_flags & VM_LOCKED) {
  1390  /* PTE-mapped THP are never mlocked */
  1391  if (!PageTransCompound(page)) {
  1392  /*
  1393   * Holding pte lock, we do 
*not* need
  1394   * mmap_sem here
  1395   */
  1396  mlock_vma_page(page);
  1397  }
  1398  ret = false;
  1399  page_vma_mapped_walk_done();
  1400  break;
  1401  }
  1402  if (flags & TTU_MUNLOCK)
  1403  continue;
  1404  }
  1405  
  1406  /* Unexpected PMD-mapped THP? */
  1407  VM_BUG_ON_PAGE(!pvmw.pte, page);
  1408  
  1409  subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
  1410  address = pvmw.address;
  1411  
  1412  /*
  1413   * PMDs for hugetlbfs pages could be shared.  In this 
case,
  1414   * pages with shared PMDs will have a mapcount of 1 no 
matter
  1415   * how many times it is actually mapped.  Map counting 
for
  1416   * PMD sharing is mostly done via the reference count 
on the
  1417   * PMD page itself.  If the page we are trying to unmap 
is a
  1418   * hugetlbfs page, attempt to 'unshare' at the PMD 
level.
  1419   * huge_pmd_unshare takes care of clearing the PUD and
  1420   * reference counting on the PMD page which effectively 
unmaps
  1421   * the page.  Take care of flushing cache and TLB for 
page in
  1422   * this specific mapping here.
  1423   */
  1424  if (PageHuge(page) &&
> 1425  huge_pmd_unshare(mm, , pvmw.pte)) {
  1426  unsigned long end_add = address + 
vma_mmu_pagesize(vma);
  1427  
  1428  flush_cache_range(vma, address, end_add);
  1429  flush_tlb_range(vma, address, end_add);
  1430  mmu_notifier_invalidate_range(mm, address, 
end_add);
  1431  continue;
  1432  }
  1433  
  1434  if (IS_ENABLED(CONFIG_MIGRATION) &&
  1435  (flags & TTU_MIGRATION) &&
  1436  is_zone_device_page(page)) {
  1437  swp_entry_t entry;
  1438  pte_t swp_pte;
  1439  
  1440  pteval = ptep_get_and_clear(mm, pvmw.address, 
pvmw.pte);
  1441  
  1442  /*
  1443   * Store the pfn of the page in a special 
migration
  1444   * pte. do_swap_page() will wait until the 
migration
  1445   * pte is removed and then restart fault 
handling.
  1446   */
  1447  entry = make_migration_entry(page, 0);
  1448  swp_pte = swp_entry_to_pte(entry);
  1449  if (pte_soft_dirty(pteval))
  1450  swp_pte = pte_swp_mksoft_dirty(swp_pte);
  1451  set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
  1452

Re: [PATCH] mm: migration: fix migration of huge PMD shared pages

2018-08-12 Thread kbuild test robot
Hi Mike,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18 next-20180810]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Mike-Kravetz/mm-migration-fix-migration-of-huge-PMD-shared-pages/20180813-114549
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   mm/rmap.c: In function 'try_to_unmap_one':
>> mm/rmap.c:1425:7: error: implicit declaration of function 
>> 'huge_pmd_unshare'; did you mean 'do_huge_pmd_wp_page'? 
>> [-Werror=implicit-function-declaration]
  huge_pmd_unshare(mm, , pvmw.pte)) {
  ^~~~
  do_huge_pmd_wp_page
   cc1: some warnings being treated as errors

vim +1425 mm/rmap.c

  1382  
  1383  /*
  1384   * If the page is mlock()d, we cannot swap it out.
  1385   * If it's recently referenced (perhaps page_referenced
  1386   * skipped over this mm) then we should reactivate it.
  1387   */
  1388  if (!(flags & TTU_IGNORE_MLOCK)) {
  1389  if (vma->vm_flags & VM_LOCKED) {
  1390  /* PTE-mapped THP are never mlocked */
  1391  if (!PageTransCompound(page)) {
  1392  /*
  1393   * Holding pte lock, we do 
*not* need
  1394   * mmap_sem here
  1395   */
  1396  mlock_vma_page(page);
  1397  }
  1398  ret = false;
  1399  page_vma_mapped_walk_done();
  1400  break;
  1401  }
  1402  if (flags & TTU_MUNLOCK)
  1403  continue;
  1404  }
  1405  
  1406  /* Unexpected PMD-mapped THP? */
  1407  VM_BUG_ON_PAGE(!pvmw.pte, page);
  1408  
  1409  subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
  1410  address = pvmw.address;
  1411  
  1412  /*
  1413   * PMDs for hugetlbfs pages could be shared.  In this 
case,
  1414   * pages with shared PMDs will have a mapcount of 1 no 
matter
  1415   * how many times it is actually mapped.  Map counting 
for
  1416   * PMD sharing is mostly done via the reference count 
on the
  1417   * PMD page itself.  If the page we are trying to unmap 
is a
  1418   * hugetlbfs page, attempt to 'unshare' at the PMD 
level.
  1419   * huge_pmd_unshare takes care of clearing the PUD and
  1420   * reference counting on the PMD page which effectively 
unmaps
  1421   * the page.  Take care of flushing cache and TLB for 
page in
  1422   * this specific mapping here.
  1423   */
  1424  if (PageHuge(page) &&
> 1425  huge_pmd_unshare(mm, , pvmw.pte)) {
  1426  unsigned long end_add = address + 
vma_mmu_pagesize(vma);
  1427  
  1428  flush_cache_range(vma, address, end_add);
  1429  flush_tlb_range(vma, address, end_add);
  1430  mmu_notifier_invalidate_range(mm, address, 
end_add);
  1431  continue;
  1432  }
  1433  
  1434  if (IS_ENABLED(CONFIG_MIGRATION) &&
  1435  (flags & TTU_MIGRATION) &&
  1436  is_zone_device_page(page)) {
  1437  swp_entry_t entry;
  1438  pte_t swp_pte;
  1439  
  1440  pteval = ptep_get_and_clear(mm, pvmw.address, 
pvmw.pte);
  1441  
  1442  /*
  1443   * Store the pfn of the page in a special 
migration
  1444   * pte. do_swap_page() will wait until the 
migration
  1445   * pte is removed and then restart fault 
handling.
  1446   */
  1447  entry = make_migration_entry(page, 0);
  1448  swp_pte = swp_entry_to_pte(entry);
  1449  if (pte_soft_dirty(pteval))
  1450  swp_pte = pte_swp_mksoft_dirty(swp_pte);
  1451  set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
  1452

Re: [PATCH] mm: migration: fix migration of huge PMD shared pages

2018-08-12 Thread kbuild test robot
Hi Mike,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18 next-20180810]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Mike-Kravetz/mm-migration-fix-migration-of-huge-PMD-shared-pages/20180813-114549
config: i386-randconfig-x003-201832 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   mm/rmap.c: In function 'try_to_unmap_one':
>> mm/rmap.c:1425:7: error: implicit declaration of function 
>> 'huge_pmd_unshare'; did you mean '__NR_unshare'? 
>> [-Werror=implicit-function-declaration]
  huge_pmd_unshare(mm, , pvmw.pte)) {
  ^~~~
  __NR_unshare
   cc1: some warnings being treated as errors

vim +1425 mm/rmap.c

  1382  
  1383  /*
  1384   * If the page is mlock()d, we cannot swap it out.
  1385   * If it's recently referenced (perhaps page_referenced
  1386   * skipped over this mm) then we should reactivate it.
  1387   */
  1388  if (!(flags & TTU_IGNORE_MLOCK)) {
  1389  if (vma->vm_flags & VM_LOCKED) {
  1390  /* PTE-mapped THP are never mlocked */
  1391  if (!PageTransCompound(page)) {
  1392  /*
  1393   * Holding pte lock, we do 
*not* need
  1394   * mmap_sem here
  1395   */
  1396  mlock_vma_page(page);
  1397  }
  1398  ret = false;
  1399  page_vma_mapped_walk_done();
  1400  break;
  1401  }
  1402  if (flags & TTU_MUNLOCK)
  1403  continue;
  1404  }
  1405  
  1406  /* Unexpected PMD-mapped THP? */
  1407  VM_BUG_ON_PAGE(!pvmw.pte, page);
  1408  
  1409  subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
  1410  address = pvmw.address;
  1411  
  1412  /*
  1413   * PMDs for hugetlbfs pages could be shared.  In this 
case,
  1414   * pages with shared PMDs will have a mapcount of 1 no 
matter
  1415   * how many times it is actually mapped.  Map counting 
for
  1416   * PMD sharing is mostly done via the reference count 
on the
  1417   * PMD page itself.  If the page we are trying to unmap 
is a
  1418   * hugetlbfs page, attempt to 'unshare' at the PMD 
level.
  1419   * huge_pmd_unshare takes care of clearing the PUD and
  1420   * reference counting on the PMD page which effectively 
unmaps
  1421   * the page.  Take care of flushing cache and TLB for 
page in
  1422   * this specific mapping here.
  1423   */
  1424  if (PageHuge(page) &&
> 1425  huge_pmd_unshare(mm, , pvmw.pte)) {
  1426  unsigned long end_add = address + 
vma_mmu_pagesize(vma);
  1427  
  1428  flush_cache_range(vma, address, end_add);
  1429  flush_tlb_range(vma, address, end_add);
  1430  mmu_notifier_invalidate_range(mm, address, 
end_add);
  1431  continue;
  1432  }
  1433  
  1434  if (IS_ENABLED(CONFIG_MIGRATION) &&
  1435  (flags & TTU_MIGRATION) &&
  1436  is_zone_device_page(page)) {
  1437  swp_entry_t entry;
  1438  pte_t swp_pte;
  1439  
  1440  pteval = ptep_get_and_clear(mm, pvmw.address, 
pvmw.pte);
  1441  
  1442  /*
  1443   * Store the pfn of the page in a special 
migration
  1444   * pte. do_swap_page() will wait until the 
migration
  1445   * pte is removed and then restart fault 
handling.
  1446   */
  1447  entry = make_migration_entry(page, 0);
  1448  swp_pte = swp_entry_to_pte(entry);
  1449  if (pte_soft_dirty(pteval))
  1450  swp_pte = pte_swp_mksoft_dirty(swp_pte);
  1451  set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
  1452  

Re: [PATCH] mm: migration: fix migration of huge PMD shared pages

2018-08-12 Thread kbuild test robot
Hi Mike,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18 next-20180810]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Mike-Kravetz/mm-migration-fix-migration-of-huge-PMD-shared-pages/20180813-114549
config: i386-randconfig-x003-201832 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   mm/rmap.c: In function 'try_to_unmap_one':
>> mm/rmap.c:1425:7: error: implicit declaration of function 
>> 'huge_pmd_unshare'; did you mean '__NR_unshare'? 
>> [-Werror=implicit-function-declaration]
  huge_pmd_unshare(mm, , pvmw.pte)) {
  ^~~~
  __NR_unshare
   cc1: some warnings being treated as errors

vim +1425 mm/rmap.c

  1382  
  1383  /*
  1384   * If the page is mlock()d, we cannot swap it out.
  1385   * If it's recently referenced (perhaps page_referenced
  1386   * skipped over this mm) then we should reactivate it.
  1387   */
  1388  if (!(flags & TTU_IGNORE_MLOCK)) {
  1389  if (vma->vm_flags & VM_LOCKED) {
  1390  /* PTE-mapped THP are never mlocked */
  1391  if (!PageTransCompound(page)) {
  1392  /*
  1393   * Holding pte lock, we do 
*not* need
  1394   * mmap_sem here
  1395   */
  1396  mlock_vma_page(page);
  1397  }
  1398  ret = false;
  1399  page_vma_mapped_walk_done();
  1400  break;
  1401  }
  1402  if (flags & TTU_MUNLOCK)
  1403  continue;
  1404  }
  1405  
  1406  /* Unexpected PMD-mapped THP? */
  1407  VM_BUG_ON_PAGE(!pvmw.pte, page);
  1408  
  1409  subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
  1410  address = pvmw.address;
  1411  
  1412  /*
  1413   * PMDs for hugetlbfs pages could be shared.  In this 
case,
  1414   * pages with shared PMDs will have a mapcount of 1 no 
matter
  1415   * how many times it is actually mapped.  Map counting 
for
  1416   * PMD sharing is mostly done via the reference count 
on the
  1417   * PMD page itself.  If the page we are trying to unmap 
is a
  1418   * hugetlbfs page, attempt to 'unshare' at the PMD 
level.
  1419   * huge_pmd_unshare takes care of clearing the PUD and
  1420   * reference counting on the PMD page which effectively 
unmaps
  1421   * the page.  Take care of flushing cache and TLB for 
page in
  1422   * this specific mapping here.
  1423   */
  1424  if (PageHuge(page) &&
> 1425  huge_pmd_unshare(mm, , pvmw.pte)) {
  1426  unsigned long end_add = address + 
vma_mmu_pagesize(vma);
  1427  
  1428  flush_cache_range(vma, address, end_add);
  1429  flush_tlb_range(vma, address, end_add);
  1430  mmu_notifier_invalidate_range(mm, address, 
end_add);
  1431  continue;
  1432  }
  1433  
  1434  if (IS_ENABLED(CONFIG_MIGRATION) &&
  1435  (flags & TTU_MIGRATION) &&
  1436  is_zone_device_page(page)) {
  1437  swp_entry_t entry;
  1438  pte_t swp_pte;
  1439  
  1440  pteval = ptep_get_and_clear(mm, pvmw.address, 
pvmw.pte);
  1441  
  1442  /*
  1443   * Store the pfn of the page in a special 
migration
  1444   * pte. do_swap_page() will wait until the 
migration
  1445   * pte is removed and then restart fault 
handling.
  1446   */
  1447  entry = make_migration_entry(page, 0);
  1448  swp_pte = swp_entry_to_pte(entry);
  1449  if (pte_soft_dirty(pteval))
  1450  swp_pte = pte_swp_mksoft_dirty(swp_pte);
  1451  set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
  1452