Re: [PATCH v11 19/26] mm: provide speculative fault infrastructure

2018-07-25 Thread zhong jiang
On 2018/7/25 18:44, Laurent Dufour wrote:
>
> On 25/07/2018 11:04, zhong jiang wrote:
>> On 2018/7/25 0:10, Laurent Dufour wrote:
>>> On 24/07/2018 16:26, zhong jiang wrote:
 On 2018/5/17 19:06, Laurent Dufour wrote:
> From: Peter Zijlstra 
>
> Provide infrastructure to do a speculative fault (not holding
> mmap_sem).
>
> The not holding of mmap_sem means we can race against VMA
> change/removal and page-table destruction. We use the SRCU VMA freeing
> to keep the VMA around. We use the VMA seqcount to detect change
> (including umapping / page-table deletion) and we use gup_fast() style
> page-table walking to deal with page-table races.
>
> Once we've obtained the page and are ready to update the PTE, we
> validate if the state we started the fault with is still valid, if
> not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
> PTE and we're done.
>
> Signed-off-by: Peter Zijlstra (Intel) 
>
> [Manage the newly introduced pte_spinlock() for speculative page
>  fault to fail if the VMA is touched in our back]
> [Rename vma_is_dead() to vma_has_changed() and declare it here]
> [Fetch p4d and pud]
> [Set vmd.sequence in __handle_mm_fault()]
> [Abort speculative path when handle_userfault() has to be called]
> [Add additional VMA's flags checks in handle_speculative_fault()]
> [Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
> [Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
> [Remove warning comment about waiting for !seq&1 since we don't want
>  to wait]
> [Remove warning about no huge page support, mention it explictly]
> [Don't call do_fault() in the speculative path as __do_fault() calls
>  vma->vm_ops->fault() which may want to release mmap_sem]
> [Only vm_fault pointer argument for vma_has_changed()]
> [Fix check against huge page, calling pmd_trans_huge()]
> [Use READ_ONCE() when reading VMA's fields in the speculative path]
> [Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
>  processing done in vm_normal_page()]
> [Check that vma->anon_vma is already set when starting the speculative
>  path]
> [Check for memory policy as we can't support MPOL_INTERLEAVE case due to
>  the processing done in mpol_misplaced()]
> [Don't support VMA growing up or down]
> [Move check on vm_sequence just before calling handle_pte_fault()]
> [Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
> [Add mem cgroup oom check]
> [Use READ_ONCE to access p*d entries]
> [Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
> [Don't fetch pte again in handle_pte_fault() when running the speculative
>  path]
> [Check PMD against concurrent collapsing operation]
> [Try spin lock the pte during the speculative path to avoid deadlock with
>  other CPU's invalidating the TLB and requiring this CPU to catch the
>  inter processor's interrupt]
> [Move define of FAULT_FLAG_SPECULATIVE here]
> [Introduce __handle_speculative_fault() and add a check against
>  mm->mm_users in handle_speculative_fault() defined in mm.h]
> Signed-off-by: Laurent Dufour 
> ---
>  include/linux/hugetlb_inline.h |   2 +-
>  include/linux/mm.h |  30 
>  include/linux/pagemap.h|   4 +-
>  mm/internal.h  |  16 +-
>  mm/memory.c| 340 
> -
>  5 files changed, 385 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/hugetlb_inline.h 
> b/include/linux/hugetlb_inline.h
> index 0660a03d37d9..9e25283d6fc9 100644
> --- a/include/linux/hugetlb_inline.h
> +++ b/include/linux/hugetlb_inline.h
> @@ -8,7 +8,7 @@
>  
>  static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
>  {
> - return !!(vma->vm_flags & VM_HUGETLB);
> + return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
>  }
>  
>  #else
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 05cbba70104b..31acf98a7d92 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -315,6 +315,7 @@ extern pgprot_t protection_map[16];
>  #define FAULT_FLAG_USER  0x40/* The fault originated in 
> userspace */
>  #define FAULT_FLAG_REMOTE0x80/* faulting for non current 
> tsk/mm */
>  #define FAULT_FLAG_INSTRUCTION  0x100/* The fault was during an 
> instruction fetch */
> +#define FAULT_FLAG_SPECULATIVE   0x200   /* Speculative fault, not 
> holding mmap_sem */
>  
>  #define FAULT_FLAG_TRACE \
>   { FAULT_FLAG_WRITE, "WRITE" }, \
> @@ -343,6 +344,10 @@ struct vm_fault {
>   gfp_t gfp_mask; /* gfp mask to be used for allocations 
> */
>   pgoff_t pgoff;  

Re: [PATCH v11 19/26] mm: provide speculative fault infrastructure

2018-07-25 Thread Laurent Dufour



On 25/07/2018 11:04, zhong jiang wrote:
> On 2018/7/25 0:10, Laurent Dufour wrote:
>>
>> On 24/07/2018 16:26, zhong jiang wrote:
>>> On 2018/5/17 19:06, Laurent Dufour wrote:
 From: Peter Zijlstra 

 Provide infrastructure to do a speculative fault (not holding
 mmap_sem).

 The not holding of mmap_sem means we can race against VMA
 change/removal and page-table destruction. We use the SRCU VMA freeing
 to keep the VMA around. We use the VMA seqcount to detect change
 (including umapping / page-table deletion) and we use gup_fast() style
 page-table walking to deal with page-table races.

 Once we've obtained the page and are ready to update the PTE, we
 validate if the state we started the fault with is still valid, if
 not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
 PTE and we're done.

 Signed-off-by: Peter Zijlstra (Intel) 

 [Manage the newly introduced pte_spinlock() for speculative page
  fault to fail if the VMA is touched in our back]
 [Rename vma_is_dead() to vma_has_changed() and declare it here]
 [Fetch p4d and pud]
 [Set vmd.sequence in __handle_mm_fault()]
 [Abort speculative path when handle_userfault() has to be called]
 [Add additional VMA's flags checks in handle_speculative_fault()]
 [Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
 [Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
 [Remove warning comment about waiting for !seq&1 since we don't want
  to wait]
 [Remove warning about no huge page support, mention it explictly]
 [Don't call do_fault() in the speculative path as __do_fault() calls
  vma->vm_ops->fault() which may want to release mmap_sem]
 [Only vm_fault pointer argument for vma_has_changed()]
 [Fix check against huge page, calling pmd_trans_huge()]
 [Use READ_ONCE() when reading VMA's fields in the speculative path]
 [Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
  processing done in vm_normal_page()]
 [Check that vma->anon_vma is already set when starting the speculative
  path]
 [Check for memory policy as we can't support MPOL_INTERLEAVE case due to
  the processing done in mpol_misplaced()]
 [Don't support VMA growing up or down]
 [Move check on vm_sequence just before calling handle_pte_fault()]
 [Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
 [Add mem cgroup oom check]
 [Use READ_ONCE to access p*d entries]
 [Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
 [Don't fetch pte again in handle_pte_fault() when running the speculative
  path]
 [Check PMD against concurrent collapsing operation]
 [Try spin lock the pte during the speculative path to avoid deadlock with
  other CPU's invalidating the TLB and requiring this CPU to catch the
  inter processor's interrupt]
 [Move define of FAULT_FLAG_SPECULATIVE here]
 [Introduce __handle_speculative_fault() and add a check against
  mm->mm_users in handle_speculative_fault() defined in mm.h]
 Signed-off-by: Laurent Dufour 
 ---
  include/linux/hugetlb_inline.h |   2 +-
  include/linux/mm.h |  30 
  include/linux/pagemap.h|   4 +-
  mm/internal.h  |  16 +-
  mm/memory.c| 340 
 -
  5 files changed, 385 insertions(+), 7 deletions(-)

 diff --git a/include/linux/hugetlb_inline.h 
 b/include/linux/hugetlb_inline.h
 index 0660a03d37d9..9e25283d6fc9 100644
 --- a/include/linux/hugetlb_inline.h
 +++ b/include/linux/hugetlb_inline.h
 @@ -8,7 +8,7 @@
  
  static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
  {
 -  return !!(vma->vm_flags & VM_HUGETLB);
 +  return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
  }
  
  #else
 diff --git a/include/linux/mm.h b/include/linux/mm.h
 index 05cbba70104b..31acf98a7d92 100644
 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -315,6 +315,7 @@ extern pgprot_t protection_map[16];
  #define FAULT_FLAG_USER   0x40/* The fault originated in 
 userspace */
  #define FAULT_FLAG_REMOTE 0x80/* faulting for non current tsk/mm */
  #define FAULT_FLAG_INSTRUCTION  0x100 /* The fault was during an 
 instruction fetch */
 +#define FAULT_FLAG_SPECULATIVE0x200   /* Speculative fault, not 
 holding mmap_sem */
  
  #define FAULT_FLAG_TRACE \
{ FAULT_FLAG_WRITE, "WRITE" }, \
 @@ -343,6 +344,10 @@ struct vm_fault {
gfp_t gfp_mask; /* gfp mask to be used for allocations 
 */
pgoff_t pgoff;  /* Logical page offset based on vma */
unsigned long address;  /* Faulting virtual address */
 +#ifdef 

Re: [PATCH v11 19/26] mm: provide speculative fault infrastructure

2018-07-25 Thread zhong jiang
On 2018/7/25 0:10, Laurent Dufour wrote:
>
> On 24/07/2018 16:26, zhong jiang wrote:
>> On 2018/5/17 19:06, Laurent Dufour wrote:
>>> From: Peter Zijlstra 
>>>
>>> Provide infrastructure to do a speculative fault (not holding
>>> mmap_sem).
>>>
>>> The not holding of mmap_sem means we can race against VMA
>>> change/removal and page-table destruction. We use the SRCU VMA freeing
>>> to keep the VMA around. We use the VMA seqcount to detect change
>>> (including umapping / page-table deletion) and we use gup_fast() style
>>> page-table walking to deal with page-table races.
>>>
>>> Once we've obtained the page and are ready to update the PTE, we
>>> validate if the state we started the fault with is still valid, if
>>> not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
>>> PTE and we're done.
>>>
>>> Signed-off-by: Peter Zijlstra (Intel) 
>>>
>>> [Manage the newly introduced pte_spinlock() for speculative page
>>>  fault to fail if the VMA is touched in our back]
>>> [Rename vma_is_dead() to vma_has_changed() and declare it here]
>>> [Fetch p4d and pud]
>>> [Set vmd.sequence in __handle_mm_fault()]
>>> [Abort speculative path when handle_userfault() has to be called]
>>> [Add additional VMA's flags checks in handle_speculative_fault()]
>>> [Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
>>> [Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
>>> [Remove warning comment about waiting for !seq&1 since we don't want
>>>  to wait]
>>> [Remove warning about no huge page support, mention it explictly]
>>> [Don't call do_fault() in the speculative path as __do_fault() calls
>>>  vma->vm_ops->fault() which may want to release mmap_sem]
>>> [Only vm_fault pointer argument for vma_has_changed()]
>>> [Fix check against huge page, calling pmd_trans_huge()]
>>> [Use READ_ONCE() when reading VMA's fields in the speculative path]
>>> [Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
>>>  processing done in vm_normal_page()]
>>> [Check that vma->anon_vma is already set when starting the speculative
>>>  path]
>>> [Check for memory policy as we can't support MPOL_INTERLEAVE case due to
>>>  the processing done in mpol_misplaced()]
>>> [Don't support VMA growing up or down]
>>> [Move check on vm_sequence just before calling handle_pte_fault()]
>>> [Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
>>> [Add mem cgroup oom check]
>>> [Use READ_ONCE to access p*d entries]
>>> [Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
>>> [Don't fetch pte again in handle_pte_fault() when running the speculative
>>>  path]
>>> [Check PMD against concurrent collapsing operation]
>>> [Try spin lock the pte during the speculative path to avoid deadlock with
>>>  other CPU's invalidating the TLB and requiring this CPU to catch the
>>>  inter processor's interrupt]
>>> [Move define of FAULT_FLAG_SPECULATIVE here]
>>> [Introduce __handle_speculative_fault() and add a check against
>>>  mm->mm_users in handle_speculative_fault() defined in mm.h]
>>> Signed-off-by: Laurent Dufour 
>>> ---
>>>  include/linux/hugetlb_inline.h |   2 +-
>>>  include/linux/mm.h |  30 
>>>  include/linux/pagemap.h|   4 +-
>>>  mm/internal.h  |  16 +-
>>>  mm/memory.c| 340 
>>> -
>>>  5 files changed, 385 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
>>> index 0660a03d37d9..9e25283d6fc9 100644
>>> --- a/include/linux/hugetlb_inline.h
>>> +++ b/include/linux/hugetlb_inline.h
>>> @@ -8,7 +8,7 @@
>>>  
>>>  static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
>>>  {
>>> -   return !!(vma->vm_flags & VM_HUGETLB);
>>> +   return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
>>>  }
>>>  
>>>  #else
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 05cbba70104b..31acf98a7d92 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -315,6 +315,7 @@ extern pgprot_t protection_map[16];
>>>  #define FAULT_FLAG_USER0x40/* The fault originated in 
>>> userspace */
>>>  #define FAULT_FLAG_REMOTE  0x80/* faulting for non current tsk/mm */
>>>  #define FAULT_FLAG_INSTRUCTION  0x100  /* The fault was during an 
>>> instruction fetch */
>>> +#define FAULT_FLAG_SPECULATIVE 0x200   /* Speculative fault, not 
>>> holding mmap_sem */
>>>  
>>>  #define FAULT_FLAG_TRACE \
>>> { FAULT_FLAG_WRITE, "WRITE" }, \
>>> @@ -343,6 +344,10 @@ struct vm_fault {
>>> gfp_t gfp_mask; /* gfp mask to be used for allocations 
>>> */
>>> pgoff_t pgoff;  /* Logical page offset based on vma */
>>> unsigned long address;  /* Faulting virtual address */
>>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>> +   unsigned int sequence;
>>> +   pmd_t orig_pmd; /* value of PMD at the time of fault */
>>> +#endif
>>> 

Re: [PATCH v11 19/26] mm: provide speculative fault infrastructure

2018-07-24 Thread zhong jiang
On 2018/5/17 19:06, Laurent Dufour wrote:
> From: Peter Zijlstra 
>
> Provide infrastructure to do a speculative fault (not holding
> mmap_sem).
>
> The not holding of mmap_sem means we can race against VMA
> change/removal and page-table destruction. We use the SRCU VMA freeing
> to keep the VMA around. We use the VMA seqcount to detect change
> (including umapping / page-table deletion) and we use gup_fast() style
> page-table walking to deal with page-table races.
>
> Once we've obtained the page and are ready to update the PTE, we
> validate if the state we started the fault with is still valid, if
> not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
> PTE and we're done.
>
> Signed-off-by: Peter Zijlstra (Intel) 
>
> [Manage the newly introduced pte_spinlock() for speculative page
>  fault to fail if the VMA is touched in our back]
> [Rename vma_is_dead() to vma_has_changed() and declare it here]
> [Fetch p4d and pud]
> [Set vmd.sequence in __handle_mm_fault()]
> [Abort speculative path when handle_userfault() has to be called]
> [Add additional VMA's flags checks in handle_speculative_fault()]
> [Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
> [Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
> [Remove warning comment about waiting for !seq&1 since we don't want
>  to wait]
> [Remove warning about no huge page support, mention it explictly]
> [Don't call do_fault() in the speculative path as __do_fault() calls
>  vma->vm_ops->fault() which may want to release mmap_sem]
> [Only vm_fault pointer argument for vma_has_changed()]
> [Fix check against huge page, calling pmd_trans_huge()]
> [Use READ_ONCE() when reading VMA's fields in the speculative path]
> [Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
>  processing done in vm_normal_page()]
> [Check that vma->anon_vma is already set when starting the speculative
>  path]
> [Check for memory policy as we can't support MPOL_INTERLEAVE case due to
>  the processing done in mpol_misplaced()]
> [Don't support VMA growing up or down]
> [Move check on vm_sequence just before calling handle_pte_fault()]
> [Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
> [Add mem cgroup oom check]
> [Use READ_ONCE to access p*d entries]
> [Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
> [Don't fetch pte again in handle_pte_fault() when running the speculative
>  path]
> [Check PMD against concurrent collapsing operation]
> [Try spin lock the pte during the speculative path to avoid deadlock with
>  other CPU's invalidating the TLB and requiring this CPU to catch the
>  inter processor's interrupt]
> [Move define of FAULT_FLAG_SPECULATIVE here]
> [Introduce __handle_speculative_fault() and add a check against
>  mm->mm_users in handle_speculative_fault() defined in mm.h]
> Signed-off-by: Laurent Dufour 
> ---
>  include/linux/hugetlb_inline.h |   2 +-
>  include/linux/mm.h |  30 
>  include/linux/pagemap.h|   4 +-
>  mm/internal.h  |  16 +-
>  mm/memory.c| 340 
> -
>  5 files changed, 385 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
> index 0660a03d37d9..9e25283d6fc9 100644
> --- a/include/linux/hugetlb_inline.h
> +++ b/include/linux/hugetlb_inline.h
> @@ -8,7 +8,7 @@
>  
>  static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
>  {
> - return !!(vma->vm_flags & VM_HUGETLB);
> + return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
>  }
>  
>  #else
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 05cbba70104b..31acf98a7d92 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -315,6 +315,7 @@ extern pgprot_t protection_map[16];
>  #define FAULT_FLAG_USER  0x40/* The fault originated in 
> userspace */
>  #define FAULT_FLAG_REMOTE0x80/* faulting for non current tsk/mm */
>  #define FAULT_FLAG_INSTRUCTION  0x100/* The fault was during an 
> instruction fetch */
> +#define FAULT_FLAG_SPECULATIVE   0x200   /* Speculative fault, not 
> holding mmap_sem */
>  
>  #define FAULT_FLAG_TRACE \
>   { FAULT_FLAG_WRITE, "WRITE" }, \
> @@ -343,6 +344,10 @@ struct vm_fault {
>   gfp_t gfp_mask; /* gfp mask to be used for allocations 
> */
>   pgoff_t pgoff;  /* Logical page offset based on vma */
>   unsigned long address;  /* Faulting virtual address */
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + unsigned int sequence;
> + pmd_t orig_pmd; /* value of PMD at the time of fault */
> +#endif
>   pmd_t *pmd; /* Pointer to pmd entry matching
>* the 'address' */
>   pud_t *pud; /* Pointer to pud entry matching
> @@ -1415,6 +1420,31 @@ int invalidate_inode_page(struct page *page);

Re: [PATCH v11 19/26] mm: provide speculative fault infrastructure

2018-07-24 Thread Laurent Dufour



On 24/07/2018 16:26, zhong jiang wrote:
> On 2018/5/17 19:06, Laurent Dufour wrote:
>> From: Peter Zijlstra 
>>
>> Provide infrastructure to do a speculative fault (not holding
>> mmap_sem).
>>
>> The not holding of mmap_sem means we can race against VMA
>> change/removal and page-table destruction. We use the SRCU VMA freeing
>> to keep the VMA around. We use the VMA seqcount to detect change
>> (including umapping / page-table deletion) and we use gup_fast() style
>> page-table walking to deal with page-table races.
>>
>> Once we've obtained the page and are ready to update the PTE, we
>> validate if the state we started the fault with is still valid, if
>> not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
>> PTE and we're done.
>>
>> Signed-off-by: Peter Zijlstra (Intel) 
>>
>> [Manage the newly introduced pte_spinlock() for speculative page
>>  fault to fail if the VMA is touched in our back]
>> [Rename vma_is_dead() to vma_has_changed() and declare it here]
>> [Fetch p4d and pud]
>> [Set vmd.sequence in __handle_mm_fault()]
>> [Abort speculative path when handle_userfault() has to be called]
>> [Add additional VMA's flags checks in handle_speculative_fault()]
>> [Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
>> [Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
>> [Remove warning comment about waiting for !seq&1 since we don't want
>>  to wait]
>> [Remove warning about no huge page support, mention it explictly]
>> [Don't call do_fault() in the speculative path as __do_fault() calls
>>  vma->vm_ops->fault() which may want to release mmap_sem]
>> [Only vm_fault pointer argument for vma_has_changed()]
>> [Fix check against huge page, calling pmd_trans_huge()]
>> [Use READ_ONCE() when reading VMA's fields in the speculative path]
>> [Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
>>  processing done in vm_normal_page()]
>> [Check that vma->anon_vma is already set when starting the speculative
>>  path]
>> [Check for memory policy as we can't support MPOL_INTERLEAVE case due to
>>  the processing done in mpol_misplaced()]
>> [Don't support VMA growing up or down]
>> [Move check on vm_sequence just before calling handle_pte_fault()]
>> [Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
>> [Add mem cgroup oom check]
>> [Use READ_ONCE to access p*d entries]
>> [Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
>> [Don't fetch pte again in handle_pte_fault() when running the speculative
>>  path]
>> [Check PMD against concurrent collapsing operation]
>> [Try spin lock the pte during the speculative path to avoid deadlock with
>>  other CPU's invalidating the TLB and requiring this CPU to catch the
>>  inter processor's interrupt]
>> [Move define of FAULT_FLAG_SPECULATIVE here]
>> [Introduce __handle_speculative_fault() and add a check against
>>  mm->mm_users in handle_speculative_fault() defined in mm.h]
>> Signed-off-by: Laurent Dufour 
>> ---
>>  include/linux/hugetlb_inline.h |   2 +-
>>  include/linux/mm.h |  30 
>>  include/linux/pagemap.h|   4 +-
>>  mm/internal.h  |  16 +-
>>  mm/memory.c| 340 
>> -
>>  5 files changed, 385 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
>> index 0660a03d37d9..9e25283d6fc9 100644
>> --- a/include/linux/hugetlb_inline.h
>> +++ b/include/linux/hugetlb_inline.h
>> @@ -8,7 +8,7 @@
>>  
>>  static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
>>  {
>> -return !!(vma->vm_flags & VM_HUGETLB);
>> +return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
>>  }
>>  
>>  #else
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 05cbba70104b..31acf98a7d92 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -315,6 +315,7 @@ extern pgprot_t protection_map[16];
>>  #define FAULT_FLAG_USER 0x40/* The fault originated in 
>> userspace */
>>  #define FAULT_FLAG_REMOTE   0x80/* faulting for non current tsk/mm */
>>  #define FAULT_FLAG_INSTRUCTION  0x100   /* The fault was during an 
>> instruction fetch */
>> +#define FAULT_FLAG_SPECULATIVE  0x200   /* Speculative fault, not 
>> holding mmap_sem */
>>  
>>  #define FAULT_FLAG_TRACE \
>>  { FAULT_FLAG_WRITE, "WRITE" }, \
>> @@ -343,6 +344,10 @@ struct vm_fault {
>>  gfp_t gfp_mask; /* gfp mask to be used for allocations 
>> */
>>  pgoff_t pgoff;  /* Logical page offset based on vma */
>>  unsigned long address;  /* Faulting virtual address */
>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>> +unsigned int sequence;
>> +pmd_t orig_pmd; /* value of PMD at the time of fault */
>> +#endif
>>  pmd_t *pmd; /* Pointer to pmd entry matching
>>   * the 'address' */
>>  

[PATCH v11 19/26] mm: provide speculative fault infrastructure

2018-05-17 Thread Laurent Dufour
From: Peter Zijlstra 

Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) 

[Manage the newly introduced pte_spinlock() for speculative page
 fault to fail if the VMA is touched in our back]
[Rename vma_is_dead() to vma_has_changed() and declare it here]
[Fetch p4d and pud]
[Set vmd.sequence in __handle_mm_fault()]
[Abort speculative path when handle_userfault() has to be called]
[Add additional VMA's flags checks in handle_speculative_fault()]
[Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
[Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
[Remove warning comment about waiting for !seq&1 since we don't want
 to wait]
[Remove warning about no huge page support, mention it explictly]
[Don't call do_fault() in the speculative path as __do_fault() calls
 vma->vm_ops->fault() which may want to release mmap_sem]
[Only vm_fault pointer argument for vma_has_changed()]
[Fix check against huge page, calling pmd_trans_huge()]
[Use READ_ONCE() when reading VMA's fields in the speculative path]
[Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
 processing done in vm_normal_page()]
[Check that vma->anon_vma is already set when starting the speculative
 path]
[Check for memory policy as we can't support MPOL_INTERLEAVE case due to
 the processing done in mpol_misplaced()]
[Don't support VMA growing up or down]
[Move check on vm_sequence just before calling handle_pte_fault()]
[Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Add mem cgroup oom check]
[Use READ_ONCE to access p*d entries]
[Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
[Don't fetch pte again in handle_pte_fault() when running the speculative
 path]
[Check PMD against concurrent collapsing operation]
[Try spin lock the pte during the speculative path to avoid deadlock with
 other CPU's invalidating the TLB and requiring this CPU to catch the
 inter processor's interrupt]
[Move define of FAULT_FLAG_SPECULATIVE here]
[Introduce __handle_speculative_fault() and add a check against
 mm->mm_users in handle_speculative_fault() defined in mm.h]
Signed-off-by: Laurent Dufour 
---
 include/linux/hugetlb_inline.h |   2 +-
 include/linux/mm.h |  30 
 include/linux/pagemap.h|   4 +-
 mm/internal.h  |  16 +-
 mm/memory.c| 340 -
 5 files changed, 385 insertions(+), 7 deletions(-)

diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 0660a03d37d9..9e25283d6fc9 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -8,7 +8,7 @@
 
 static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
 {
-   return !!(vma->vm_flags & VM_HUGETLB);
+   return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
 }
 
 #else
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 05cbba70104b..31acf98a7d92 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -315,6 +315,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_USER0x40/* The fault originated in 
userspace */
 #define FAULT_FLAG_REMOTE  0x80/* faulting for non current tsk/mm */
 #define FAULT_FLAG_INSTRUCTION  0x100  /* The fault was during an instruction 
fetch */
+#define FAULT_FLAG_SPECULATIVE 0x200   /* Speculative fault, not holding 
mmap_sem */
 
 #define FAULT_FLAG_TRACE \
{ FAULT_FLAG_WRITE, "WRITE" }, \
@@ -343,6 +344,10 @@ struct vm_fault {
gfp_t gfp_mask; /* gfp mask to be used for allocations 
*/
pgoff_t pgoff;  /* Logical page offset based on vma */
unsigned long address;  /* Faulting virtual address */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   unsigned int sequence;
+   pmd_t orig_pmd; /* value of PMD at the time of fault */
+#endif
pmd_t *pmd; /* Pointer to pmd entry matching
 * the 'address' */
pud_t *pud; /* Pointer to pud entry matching
@@ -1415,6 +1420,31 @@ int invalidate_inode_page(struct page *page);
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags);
+
+#ifdef