Re: [PATCH -mm] mm, hugetlb: Pass fault address to no page handler

2018-05-16 Thread Mike Kravetz
On 05/16/2018 02:12 AM, Michal Hocko wrote:
> On Tue 15-05-18 08:57:56, Huang, Ying wrote:
>> From: Huang Ying 
>>
>> This is to take better advantage of huge page clearing
>> optimization (c79b57e462b5d, "mm: hugetlb: clear target sub-page last
>> when clearing huge page").  Which will clear to access sub-page last
>> to avoid the cache lines of to access sub-page to be evicted when
>> clearing other sub-pages.  This needs to get the address of the
>> sub-page to access, that is, the fault address inside of the huge
>> page.  So the hugetlb no page fault handler is changed to pass that
>> information.  This will benefit workloads which don't access the begin
>> of the huge page after page fault.
>>
>> With this patch, the throughput increases ~28.1% in vm-scalability
>> anon-w-seq test case with 88 processes on a 2 socket Xeon E5 2699 v4
>> system (44 cores, 88 threads).  The test case creates 88 processes,
>> each process mmap a big anonymous memory area and writes to it from
>> the end to the begin.  For each process, other processes could be seen
>> as other workload which generates heavy cache pressure.  At the same
>> time, the cache miss rate reduced from ~36.3% to ~25.6%, the
>> IPC (instruction per cycle) increased from 0.3 to 0.37, and the time
>> spent in user space is reduced ~19.3%
> 
> This paragraph is confusing as Mike mentioned already. It would be
> probably more helpful to see how was the test configured to use hugetlb
> pages and what is the end benefit.
> 
> I do not have any real objection to the implementation so feel free to
> add
> Acked-by: Michal Hocko 
> I am just wondering what is the usecase driving this. Or is it just a
> generic optimization that always makes sense to do? Indicating that in
> the changelog would be helpful as well.

I just noticed that the optimization was not added for 'gigantic' pages.
Should we consider adding support for gigantic pages as well?  It may be
that the cache miss cost is insignificant when added to the time required
to clear a 1GB (for x86) gigantic page.

One more thing, I'm guessing the copy_huge/gigantic_page() routines would
see a similar benefit.  Specifically, for copies as a result of a COW.
Is that another area to consider?

That gets back to Michal's question of a specific use case or generic
optimization.  Unless code is simple (as in this patch), seems like we should
hold off on considering additional optimizations unless there is a specific
use case.

I'm still OK with this change.
-- 
Mike Kravetz


Re: [PATCH -mm] mm, hugetlb: Pass fault address to no page handler

2018-05-16 Thread Mike Kravetz
On 05/16/2018 02:12 AM, Michal Hocko wrote:
> On Tue 15-05-18 08:57:56, Huang, Ying wrote:
>> From: Huang Ying 
>>
>> This is to take better advantage of huge page clearing
>> optimization (c79b57e462b5d, "mm: hugetlb: clear target sub-page last
>> when clearing huge page").  Which will clear to access sub-page last
>> to avoid the cache lines of to access sub-page to be evicted when
>> clearing other sub-pages.  This needs to get the address of the
>> sub-page to access, that is, the fault address inside of the huge
>> page.  So the hugetlb no page fault handler is changed to pass that
>> information.  This will benefit workloads which don't access the begin
>> of the huge page after page fault.
>>
>> With this patch, the throughput increases ~28.1% in vm-scalability
>> anon-w-seq test case with 88 processes on a 2 socket Xeon E5 2699 v4
>> system (44 cores, 88 threads).  The test case creates 88 processes,
>> each process mmap a big anonymous memory area and writes to it from
>> the end to the begin.  For each process, other processes could be seen
>> as other workload which generates heavy cache pressure.  At the same
>> time, the cache miss rate reduced from ~36.3% to ~25.6%, the
>> IPC (instruction per cycle) increased from 0.3 to 0.37, and the time
>> spent in user space is reduced ~19.3%
> 
> This paragraph is confusing as Mike mentioned already. It would be
> probably more helpful to see how was the test configured to use hugetlb
> pages and what is the end benefit.
> 
> I do not have any real objection to the implementation so feel free to
> add
> Acked-by: Michal Hocko 
> I am just wondering what is the usecase driving this. Or is it just a
> generic optimization that always makes sense to do? Indicating that in
> the changelog would be helpful as well.

I just noticed that the optimization was not added for 'gigantic' pages.
Should we consider adding support for gigantic pages as well?  It may be
that the cache miss cost is insignificant when added to the time required
to clear a 1GB (for x86) gigantic page.

One more thing, I'm guessing the copy_huge/gigantic_page() routines would
see a similar benefit.  Specifically, for copies as a result of a COW.
Is that another area to consider?

That gets back to Michal's question of a specific use case or generic
optimization.  Unless code is simple (as in this patch), seems like we should
hold off on considering additional optimizations unless there is a specific
use case.

I'm still OK with this change.
-- 
Mike Kravetz


Re: [PATCH -mm] mm, hugetlb: Pass fault address to no page handler

2018-05-16 Thread Michal Hocko
On Tue 15-05-18 08:57:56, Huang, Ying wrote:
> From: Huang Ying 
> 
> This is to take better advantage of huge page clearing
> optimization (c79b57e462b5d, "mm: hugetlb: clear target sub-page last
> when clearing huge page").  Which will clear to access sub-page last
> to avoid the cache lines of to access sub-page to be evicted when
> clearing other sub-pages.  This needs to get the address of the
> sub-page to access, that is, the fault address inside of the huge
> page.  So the hugetlb no page fault handler is changed to pass that
> information.  This will benefit workloads which don't access the begin
> of the huge page after page fault.
> 
> With this patch, the throughput increases ~28.1% in vm-scalability
> anon-w-seq test case with 88 processes on a 2 socket Xeon E5 2699 v4
> system (44 cores, 88 threads).  The test case creates 88 processes,
> each process mmap a big anonymous memory area and writes to it from
> the end to the begin.  For each process, other processes could be seen
> as other workload which generates heavy cache pressure.  At the same
> time, the cache miss rate reduced from ~36.3% to ~25.6%, the
> IPC (instruction per cycle) increased from 0.3 to 0.37, and the time
> spent in user space is reduced ~19.3%

This paragraph is confusing as Mike mentioned already. It would be
probably more helpful to see how was the test configured to use hugetlb
pages and what is the end benefit.

I do not have any real objection to the implementation so feel free to
add
Acked-by: Michal Hocko 
I am just wondering what is the usecase driving this. Or is it just a
generic optimization that always makes sense to do? Indicating that in
the changelog would be helpful as well.

Thanks!

> Signed-off-by: "Huang, Ying" 
> Cc: Andrea Arcangeli 
> Cc: "Kirill A. Shutemov" 
> Cc: Andi Kleen 
> Cc: Jan Kara 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Cc: Hugh Dickins 
> Cc: Minchan Kim 
> Cc: Shaohua Li 
> Cc: Christopher Lameter 
> Cc: "Aneesh Kumar K.V" 
> Cc: Punit Agrawal 
> Cc: Anshuman Khandual 
> ---
>  mm/hugetlb.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 129088710510..3de6326abf39 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3677,7 +3677,7 @@ int huge_add_to_page_cache(struct page *page, struct 
> address_space *mapping,
>  
>  static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  struct address_space *mapping, pgoff_t idx,
> -unsigned long address, pte_t *ptep, unsigned int 
> flags)
> +unsigned long faddress, pte_t *ptep, unsigned int 
> flags)
>  {
>   struct hstate *h = hstate_vma(vma);
>   int ret = VM_FAULT_SIGBUS;
> @@ -3686,6 +3686,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   struct page *page;
>   pte_t new_pte;
>   spinlock_t *ptl;
> + unsigned long address = faddress & huge_page_mask(h);
>  
>   /*
>* Currently, we are forced to kill the process in the event the
> @@ -3749,7 +3750,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   ret = VM_FAULT_SIGBUS;
>   goto out;
>   }
> - clear_huge_page(page, address, pages_per_huge_page(h));
> + clear_huge_page(page, faddress, pages_per_huge_page(h));
>   __SetPageUptodate(page);
>   set_page_huge_active(page);
>  
> @@ -3871,7 +3872,7 @@ u32 hugetlb_fault_mutex_hash(struct hstate *h, struct 
> mm_struct *mm,
>  #endif
>  
>  int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> - unsigned long address, unsigned int flags)
> + unsigned long faddress, unsigned int flags)
>  {
>   pte_t *ptep, entry;
>   spinlock_t *ptl;
> @@ -3883,8 +3884,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   struct hstate *h = hstate_vma(vma);
>   struct address_space *mapping;
>   int need_wait_lock = 0;
> -
> - address &= huge_page_mask(h);
> + unsigned long address = faddress & huge_page_mask(h);
>  
>   ptep = huge_pte_offset(mm, address, huge_page_size(h));
>   if (ptep) {
> @@ -3914,7 +3914,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>  
>   entry = huge_ptep_get(ptep);
>   if (huge_pte_none(entry)) {
> - ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, 
> flags);
> + ret = hugetlb_no_page(mm, vma, mapping, idx, faddress, 

Re: [PATCH -mm] mm, hugetlb: Pass fault address to no page handler

2018-05-16 Thread Michal Hocko
On Tue 15-05-18 08:57:56, Huang, Ying wrote:
> From: Huang Ying 
> 
> This is to take better advantage of huge page clearing
> optimization (c79b57e462b5d, "mm: hugetlb: clear target sub-page last
> when clearing huge page").  Which will clear to access sub-page last
> to avoid the cache lines of to access sub-page to be evicted when
> clearing other sub-pages.  This needs to get the address of the
> sub-page to access, that is, the fault address inside of the huge
> page.  So the hugetlb no page fault handler is changed to pass that
> information.  This will benefit workloads which don't access the begin
> of the huge page after page fault.
> 
> With this patch, the throughput increases ~28.1% in vm-scalability
> anon-w-seq test case with 88 processes on a 2 socket Xeon E5 2699 v4
> system (44 cores, 88 threads).  The test case creates 88 processes,
> each process mmap a big anonymous memory area and writes to it from
> the end to the begin.  For each process, other processes could be seen
> as other workload which generates heavy cache pressure.  At the same
> time, the cache miss rate reduced from ~36.3% to ~25.6%, the
> IPC (instruction per cycle) increased from 0.3 to 0.37, and the time
> spent in user space is reduced ~19.3%

This paragraph is confusing as Mike mentioned already. It would be
probably more helpful to see how was the test configured to use hugetlb
pages and what is the end benefit.

I do not have any real objection to the implementation so feel free to
add
Acked-by: Michal Hocko 
I am just wondering what is the usecase driving this. Or is it just a
generic optimization that always makes sense to do? Indicating that in
the changelog would be helpful as well.

Thanks!

> Signed-off-by: "Huang, Ying" 
> Cc: Andrea Arcangeli 
> Cc: "Kirill A. Shutemov" 
> Cc: Andi Kleen 
> Cc: Jan Kara 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Cc: Hugh Dickins 
> Cc: Minchan Kim 
> Cc: Shaohua Li 
> Cc: Christopher Lameter 
> Cc: "Aneesh Kumar K.V" 
> Cc: Punit Agrawal 
> Cc: Anshuman Khandual 
> ---
>  mm/hugetlb.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 129088710510..3de6326abf39 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3677,7 +3677,7 @@ int huge_add_to_page_cache(struct page *page, struct 
> address_space *mapping,
>  
>  static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  struct address_space *mapping, pgoff_t idx,
> -unsigned long address, pte_t *ptep, unsigned int 
> flags)
> +unsigned long faddress, pte_t *ptep, unsigned int 
> flags)
>  {
>   struct hstate *h = hstate_vma(vma);
>   int ret = VM_FAULT_SIGBUS;
> @@ -3686,6 +3686,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   struct page *page;
>   pte_t new_pte;
>   spinlock_t *ptl;
> + unsigned long address = faddress & huge_page_mask(h);
>  
>   /*
>* Currently, we are forced to kill the process in the event the
> @@ -3749,7 +3750,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   ret = VM_FAULT_SIGBUS;
>   goto out;
>   }
> - clear_huge_page(page, address, pages_per_huge_page(h));
> + clear_huge_page(page, faddress, pages_per_huge_page(h));
>   __SetPageUptodate(page);
>   set_page_huge_active(page);
>  
> @@ -3871,7 +3872,7 @@ u32 hugetlb_fault_mutex_hash(struct hstate *h, struct 
> mm_struct *mm,
>  #endif
>  
>  int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> - unsigned long address, unsigned int flags)
> + unsigned long faddress, unsigned int flags)
>  {
>   pte_t *ptep, entry;
>   spinlock_t *ptl;
> @@ -3883,8 +3884,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   struct hstate *h = hstate_vma(vma);
>   struct address_space *mapping;
>   int need_wait_lock = 0;
> -
> - address &= huge_page_mask(h);
> + unsigned long address = faddress & huge_page_mask(h);
>  
>   ptep = huge_pte_offset(mm, address, huge_page_size(h));
>   if (ptep) {
> @@ -3914,7 +3914,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>  
>   entry = huge_ptep_get(ptep);
>   if (huge_pte_none(entry)) {
> - ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, 
> flags);
> + ret = hugetlb_no_page(mm, vma, mapping, idx, faddress, ptep, 
> flags);
>   goto out_mutex;
>   }
>  
> -- 
> 2.16.1
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH -mm] mm, hugetlb: Pass fault address to no page handler

2018-05-16 Thread Kirill A. Shutemov
On Wed, May 16, 2018 at 12:42:43AM +, Huang, Ying wrote:
> >> +  unsigned long address = faddress & huge_page_mask(h);
> >
> > faddress? I would rather keep it address and rename maked out variable to
> > 'haddr'. We use 'haddr' for the cause in other places.
> 
> I found haddr is popular in huge_memory.c but not used in hugetlb.c at
> all.  Is it desirable to start to use "haddr" in hugetlb.c?

Yes, I think so. There's no reason to limit haddr convention to THP.

-- 
 Kirill A. Shutemov


Re: [PATCH -mm] mm, hugetlb: Pass fault address to no page handler

2018-05-16 Thread Kirill A. Shutemov
On Wed, May 16, 2018 at 12:42:43AM +, Huang, Ying wrote:
> >> +  unsigned long address = faddress & huge_page_mask(h);
> >
> > faddress? I would rather keep it address and rename maked out variable to
> > 'haddr'. We use 'haddr' for the cause in other places.
> 
> I found haddr is popular in huge_memory.c but not used in hugetlb.c at
> all.  Is it desirable to start to use "haddr" in hugetlb.c?

Yes, I think so. There's no reason to limit haddr convention to THP.

-- 
 Kirill A. Shutemov


Re: [PATCH -mm] mm, hugetlb: Pass fault address to no page handler

2018-05-15 Thread Kirill A. Shutemov
On Tue, May 15, 2018 at 08:57:56AM +0800, Huang, Ying wrote:
> From: Huang Ying 
> 
> This is to take better advantage of huge page clearing
> optimization (c79b57e462b5d, "mm: hugetlb: clear target sub-page last
> when clearing huge page").  Which will clear to access sub-page last
> to avoid the cache lines of to access sub-page to be evicted when
> clearing other sub-pages.  This needs to get the address of the
> sub-page to access, that is, the fault address inside of the huge
> page.  So the hugetlb no page fault handler is changed to pass that
> information.  This will benefit workloads which don't access the begin
> of the huge page after page fault.
> 
> With this patch, the throughput increases ~28.1% in vm-scalability
> anon-w-seq test case with 88 processes on a 2 socket Xeon E5 2699 v4
> system (44 cores, 88 threads).  The test case creates 88 processes,
> each process mmap a big anonymous memory area and writes to it from
> the end to the begin.  For each process, other processes could be seen
> as other workload which generates heavy cache pressure.  At the same
> time, the cache miss rate reduced from ~36.3% to ~25.6%, the
> IPC (instruction per cycle) increased from 0.3 to 0.37, and the time
> spent in user space is reduced ~19.3%
> 
> Signed-off-by: "Huang, Ying" 
> Cc: Andrea Arcangeli 
> Cc: "Kirill A. Shutemov" 
> Cc: Andi Kleen 
> Cc: Jan Kara 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Cc: Hugh Dickins 
> Cc: Minchan Kim 
> Cc: Shaohua Li 
> Cc: Christopher Lameter 
> Cc: "Aneesh Kumar K.V" 
> Cc: Punit Agrawal 
> Cc: Anshuman Khandual 
> ---
>  mm/hugetlb.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 129088710510..3de6326abf39 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3677,7 +3677,7 @@ int huge_add_to_page_cache(struct page *page, struct 
> address_space *mapping,
>  
>  static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  struct address_space *mapping, pgoff_t idx,
> -unsigned long address, pte_t *ptep, unsigned int 
> flags)
> +unsigned long faddress, pte_t *ptep, unsigned int 
> flags)
>  {
>   struct hstate *h = hstate_vma(vma);
>   int ret = VM_FAULT_SIGBUS;
> @@ -3686,6 +3686,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   struct page *page;
>   pte_t new_pte;
>   spinlock_t *ptl;
> + unsigned long address = faddress & huge_page_mask(h);

faddress? I would rather keep it address and rename maked out variable to
'haddr'. We use 'haddr' for the cause in other places.

-- 
 Kirill A. Shutemov


Re: [PATCH -mm] mm, hugetlb: Pass fault address to no page handler

2018-05-15 Thread Kirill A. Shutemov
On Tue, May 15, 2018 at 08:57:56AM +0800, Huang, Ying wrote:
> From: Huang Ying 
> 
> This is to take better advantage of huge page clearing
> optimization (c79b57e462b5d, "mm: hugetlb: clear target sub-page last
> when clearing huge page").  Which will clear to access sub-page last
> to avoid the cache lines of to access sub-page to be evicted when
> clearing other sub-pages.  This needs to get the address of the
> sub-page to access, that is, the fault address inside of the huge
> page.  So the hugetlb no page fault handler is changed to pass that
> information.  This will benefit workloads which don't access the begin
> of the huge page after page fault.
> 
> With this patch, the throughput increases ~28.1% in vm-scalability
> anon-w-seq test case with 88 processes on a 2 socket Xeon E5 2699 v4
> system (44 cores, 88 threads).  The test case creates 88 processes,
> each process mmap a big anonymous memory area and writes to it from
> the end to the begin.  For each process, other processes could be seen
> as other workload which generates heavy cache pressure.  At the same
> time, the cache miss rate reduced from ~36.3% to ~25.6%, the
> IPC (instruction per cycle) increased from 0.3 to 0.37, and the time
> spent in user space is reduced ~19.3%
> 
> Signed-off-by: "Huang, Ying" 
> Cc: Andrea Arcangeli 
> Cc: "Kirill A. Shutemov" 
> Cc: Andi Kleen 
> Cc: Jan Kara 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Cc: Hugh Dickins 
> Cc: Minchan Kim 
> Cc: Shaohua Li 
> Cc: Christopher Lameter 
> Cc: "Aneesh Kumar K.V" 
> Cc: Punit Agrawal 
> Cc: Anshuman Khandual 
> ---
>  mm/hugetlb.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 129088710510..3de6326abf39 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3677,7 +3677,7 @@ int huge_add_to_page_cache(struct page *page, struct 
> address_space *mapping,
>  
>  static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  struct address_space *mapping, pgoff_t idx,
> -unsigned long address, pte_t *ptep, unsigned int 
> flags)
> +unsigned long faddress, pte_t *ptep, unsigned int 
> flags)
>  {
>   struct hstate *h = hstate_vma(vma);
>   int ret = VM_FAULT_SIGBUS;
> @@ -3686,6 +3686,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   struct page *page;
>   pte_t new_pte;
>   spinlock_t *ptl;
> + unsigned long address = faddress & huge_page_mask(h);

faddress? I would rather keep it address and rename maked out variable to
'haddr'. We use 'haddr' for the cause in other places.

-- 
 Kirill A. Shutemov


Re: [PATCH -mm] mm, hugetlb: Pass fault address to no page handler

2018-05-15 Thread David Rientjes
On Tue, 15 May 2018, Huang, Ying wrote:

> From: Huang Ying 
> 
> This is to take better advantage of huge page clearing
> optimization (c79b57e462b5d, "mm: hugetlb: clear target sub-page last
> when clearing huge page").  Which will clear to access sub-page last
> to avoid the cache lines of to access sub-page to be evicted when
> clearing other sub-pages.  This needs to get the address of the
> sub-page to access, that is, the fault address inside of the huge
> page.  So the hugetlb no page fault handler is changed to pass that
> information.  This will benefit workloads which don't access the begin
> of the huge page after page fault.
> 
> With this patch, the throughput increases ~28.1% in vm-scalability
> anon-w-seq test case with 88 processes on a 2 socket Xeon E5 2699 v4
> system (44 cores, 88 threads).  The test case creates 88 processes,
> each process mmap a big anonymous memory area and writes to it from
> the end to the begin.  For each process, other processes could be seen
> as other workload which generates heavy cache pressure.  At the same
> time, the cache miss rate reduced from ~36.3% to ~25.6%, the
> IPC (instruction per cycle) increased from 0.3 to 0.37, and the time
> spent in user space is reduced ~19.3%
> 
> Signed-off-by: "Huang, Ying" 
> Cc: Andrea Arcangeli 
> Cc: "Kirill A. Shutemov" 
> Cc: Andi Kleen 
> Cc: Jan Kara 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Cc: Hugh Dickins 
> Cc: Minchan Kim 
> Cc: Shaohua Li 
> Cc: Christopher Lameter 
> Cc: "Aneesh Kumar K.V" 
> Cc: Punit Agrawal 
> Cc: Anshuman Khandual 

Acked-by: David Rientjes 


Re: [PATCH -mm] mm, hugetlb: Pass fault address to no page handler

2018-05-15 Thread David Rientjes
On Tue, 15 May 2018, Huang, Ying wrote:

> From: Huang Ying 
> 
> This is to take better advantage of huge page clearing
> optimization (c79b57e462b5d, "mm: hugetlb: clear target sub-page last
> when clearing huge page").  Which will clear to access sub-page last
> to avoid the cache lines of to access sub-page to be evicted when
> clearing other sub-pages.  This needs to get the address of the
> sub-page to access, that is, the fault address inside of the huge
> page.  So the hugetlb no page fault handler is changed to pass that
> information.  This will benefit workloads which don't access the begin
> of the huge page after page fault.
> 
> With this patch, the throughput increases ~28.1% in vm-scalability
> anon-w-seq test case with 88 processes on a 2 socket Xeon E5 2699 v4
> system (44 cores, 88 threads).  The test case creates 88 processes,
> each process mmap a big anonymous memory area and writes to it from
> the end to the begin.  For each process, other processes could be seen
> as other workload which generates heavy cache pressure.  At the same
> time, the cache miss rate reduced from ~36.3% to ~25.6%, the
> IPC (instruction per cycle) increased from 0.3 to 0.37, and the time
> spent in user space is reduced ~19.3%
> 
> Signed-off-by: "Huang, Ying" 
> Cc: Andrea Arcangeli 
> Cc: "Kirill A. Shutemov" 
> Cc: Andi Kleen 
> Cc: Jan Kara 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Cc: Hugh Dickins 
> Cc: Minchan Kim 
> Cc: Shaohua Li 
> Cc: Christopher Lameter 
> Cc: "Aneesh Kumar K.V" 
> Cc: Punit Agrawal 
> Cc: Anshuman Khandual 

Acked-by: David Rientjes 


Re: [PATCH -mm] mm, hugetlb: Pass fault address to no page handler

2018-05-14 Thread Mike Kravetz
On 05/14/2018 05:57 PM, Huang, Ying wrote:
> From: Huang Ying 
> 
> This is to take better advantage of huge page clearing
> optimization (c79b57e462b5d, "mm: hugetlb: clear target sub-page last
> when clearing huge page").  Which will clear to access sub-page last
> to avoid the cache lines of to access sub-page to be evicted when
> clearing other sub-pages.  This needs to get the address of the
> sub-page to access, that is, the fault address inside of the huge
> page.  So the hugetlb no page fault handler is changed to pass that
> information.  This will benefit workloads which don't access the begin
> of the huge page after page fault.
> 
> With this patch, the throughput increases ~28.1% in vm-scalability
> anon-w-seq test case with 88 processes on a 2 socket Xeon E5 2699 v4
> system (44 cores, 88 threads).  The test case creates 88 processes,
> each process mmap a big anonymous memory area and writes to it from
> the end to the begin.  For each process, other processes could be seen
> as other workload which generates heavy cache pressure.  At the same
> time, the cache miss rate reduced from ~36.3% to ~25.6%, the
> IPC (instruction per cycle) increased from 0.3 to 0.37, and the time
> spent in user space is reduced ~19.3%

Since this patch only addresses hugetlbfs huge pages, I would suggest
making that more explicit in the commit message.  Other than that, the
changes look fine to me.

> Signed-off-by: "Huang, Ying" 

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

> Cc: Andrea Arcangeli 
> Cc: "Kirill A. Shutemov" 
> Cc: Andi Kleen 
> Cc: Jan Kara 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Cc: Hugh Dickins 
> Cc: Minchan Kim 
> Cc: Shaohua Li 
> Cc: Christopher Lameter 
> Cc: "Aneesh Kumar K.V" 
> Cc: Punit Agrawal 
> Cc: Anshuman Khandual 
> ---
>  mm/hugetlb.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 129088710510..3de6326abf39 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3677,7 +3677,7 @@ int huge_add_to_page_cache(struct page *page, struct 
> address_space *mapping,
>  
>  static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  struct address_space *mapping, pgoff_t idx,
> -unsigned long address, pte_t *ptep, unsigned int 
> flags)
> +unsigned long faddress, pte_t *ptep, unsigned int 
> flags)
>  {
>   struct hstate *h = hstate_vma(vma);
>   int ret = VM_FAULT_SIGBUS;
> @@ -3686,6 +3686,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   struct page *page;
>   pte_t new_pte;
>   spinlock_t *ptl;
> + unsigned long address = faddress & huge_page_mask(h);
>  
>   /*
>* Currently, we are forced to kill the process in the event the
> @@ -3749,7 +3750,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   ret = VM_FAULT_SIGBUS;
>   goto out;
>   }
> - clear_huge_page(page, address, pages_per_huge_page(h));
> + clear_huge_page(page, faddress, pages_per_huge_page(h));
>   __SetPageUptodate(page);
>   set_page_huge_active(page);
>  
> @@ -3871,7 +3872,7 @@ u32 hugetlb_fault_mutex_hash(struct hstate *h, struct 
> mm_struct *mm,
>  #endif
>  
>  int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> - unsigned long address, unsigned int flags)
> + unsigned long faddress, unsigned int flags)
>  {
>   pte_t *ptep, entry;
>   spinlock_t *ptl;
> @@ -3883,8 +3884,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   struct hstate *h = hstate_vma(vma);
>   struct address_space *mapping;
>   int need_wait_lock = 0;
> -
> - address &= huge_page_mask(h);
> + unsigned long address = faddress & huge_page_mask(h);
>  
>   ptep = huge_pte_offset(mm, address, huge_page_size(h));
>   if (ptep) {
> @@ -3914,7 +3914,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>  
>   entry = huge_ptep_get(ptep);
>   if (huge_pte_none(entry)) {
> - ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, 
> flags);
> + ret = hugetlb_no_page(mm, vma, mapping, idx, faddress, ptep, 
> flags);
>   goto out_mutex;
>   }
>  
> 


Re: [PATCH -mm] mm, hugetlb: Pass fault address to no page handler

2018-05-14 Thread Mike Kravetz
On 05/14/2018 05:57 PM, Huang, Ying wrote:
> From: Huang Ying 
> 
> This is to take better advantage of huge page clearing
> optimization (c79b57e462b5d, "mm: hugetlb: clear target sub-page last
> when clearing huge page").  Which will clear to access sub-page last
> to avoid the cache lines of to access sub-page to be evicted when
> clearing other sub-pages.  This needs to get the address of the
> sub-page to access, that is, the fault address inside of the huge
> page.  So the hugetlb no page fault handler is changed to pass that
> information.  This will benefit workloads which don't access the begin
> of the huge page after page fault.
> 
> With this patch, the throughput increases ~28.1% in vm-scalability
> anon-w-seq test case with 88 processes on a 2 socket Xeon E5 2699 v4
> system (44 cores, 88 threads).  The test case creates 88 processes,
> each process mmap a big anonymous memory area and writes to it from
> the end to the begin.  For each process, other processes could be seen
> as other workload which generates heavy cache pressure.  At the same
> time, the cache miss rate reduced from ~36.3% to ~25.6%, the
> IPC (instruction per cycle) increased from 0.3 to 0.37, and the time
> spent in user space is reduced ~19.3%

Since this patch only addresses hugetlbfs huge pages, I would suggest
making that more explicit in the commit message.  Other than that, the
changes look fine to me.

> Signed-off-by: "Huang, Ying" 

Reviewed-by: Mike Kravetz 
-- 
Mike Kravetz

> Cc: Andrea Arcangeli 
> Cc: "Kirill A. Shutemov" 
> Cc: Andi Kleen 
> Cc: Jan Kara 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Cc: Hugh Dickins 
> Cc: Minchan Kim 
> Cc: Shaohua Li 
> Cc: Christopher Lameter 
> Cc: "Aneesh Kumar K.V" 
> Cc: Punit Agrawal 
> Cc: Anshuman Khandual 
> ---
>  mm/hugetlb.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 129088710510..3de6326abf39 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3677,7 +3677,7 @@ int huge_add_to_page_cache(struct page *page, struct 
> address_space *mapping,
>  
>  static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  struct address_space *mapping, pgoff_t idx,
> -unsigned long address, pte_t *ptep, unsigned int 
> flags)
> +unsigned long faddress, pte_t *ptep, unsigned int 
> flags)
>  {
>   struct hstate *h = hstate_vma(vma);
>   int ret = VM_FAULT_SIGBUS;
> @@ -3686,6 +3686,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   struct page *page;
>   pte_t new_pte;
>   spinlock_t *ptl;
> + unsigned long address = faddress & huge_page_mask(h);
>  
>   /*
>* Currently, we are forced to kill the process in the event the
> @@ -3749,7 +3750,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   ret = VM_FAULT_SIGBUS;
>   goto out;
>   }
> - clear_huge_page(page, address, pages_per_huge_page(h));
> + clear_huge_page(page, faddress, pages_per_huge_page(h));
>   __SetPageUptodate(page);
>   set_page_huge_active(page);
>  
> @@ -3871,7 +3872,7 @@ u32 hugetlb_fault_mutex_hash(struct hstate *h, struct 
> mm_struct *mm,
>  #endif
>  
>  int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> - unsigned long address, unsigned int flags)
> + unsigned long faddress, unsigned int flags)
>  {
>   pte_t *ptep, entry;
>   spinlock_t *ptl;
> @@ -3883,8 +3884,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   struct hstate *h = hstate_vma(vma);
>   struct address_space *mapping;
>   int need_wait_lock = 0;
> -
> - address &= huge_page_mask(h);
> + unsigned long address = faddress & huge_page_mask(h);
>  
>   ptep = huge_pte_offset(mm, address, huge_page_size(h));
>   if (ptep) {
> @@ -3914,7 +3914,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>  
>   entry = huge_ptep_get(ptep);
>   if (huge_pte_none(entry)) {
> - ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, 
> flags);
> + ret = hugetlb_no_page(mm, vma, mapping, idx, faddress, ptep, 
> flags);
>   goto out_mutex;
>   }
>  
> 


[PATCH -mm] mm, hugetlb: Pass fault address to no page handler

2018-05-14 Thread Huang, Ying
From: Huang Ying 

This is to take better advantage of huge page clearing
optimization (c79b57e462b5d, "mm: hugetlb: clear target sub-page last
when clearing huge page").  Which will clear to access sub-page last
to avoid the cache lines of to access sub-page to be evicted when
clearing other sub-pages.  This needs to get the address of the
sub-page to access, that is, the fault address inside of the huge
page.  So the hugetlb no page fault handler is changed to pass that
information.  This will benefit workloads which don't access the begin
of the huge page after page fault.

With this patch, the throughput increases ~28.1% in vm-scalability
anon-w-seq test case with 88 processes on a 2 socket Xeon E5 2699 v4
system (44 cores, 88 threads).  The test case creates 88 processes,
each process mmap a big anonymous memory area and writes to it from
the end to the begin.  For each process, other processes could be seen
as other workload which generates heavy cache pressure.  At the same
time, the cache miss rate reduced from ~36.3% to ~25.6%, the
IPC (instruction per cycle) increased from 0.3 to 0.37, and the time
spent in user space is reduced ~19.3%

Signed-off-by: "Huang, Ying" 
Cc: Andrea Arcangeli 
Cc: "Kirill A. Shutemov" 
Cc: Andi Kleen 
Cc: Jan Kara 
Cc: Michal Hocko 
Cc: Matthew Wilcox 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Shaohua Li 
Cc: Christopher Lameter 
Cc: "Aneesh Kumar K.V" 
Cc: Punit Agrawal 
Cc: Anshuman Khandual 
---
 mm/hugetlb.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 129088710510..3de6326abf39 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3677,7 +3677,7 @@ int huge_add_to_page_cache(struct page *page, struct 
address_space *mapping,
 
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
   struct address_space *mapping, pgoff_t idx,
-  unsigned long address, pte_t *ptep, unsigned int 
flags)
+  unsigned long faddress, pte_t *ptep, unsigned int 
flags)
 {
struct hstate *h = hstate_vma(vma);
int ret = VM_FAULT_SIGBUS;
@@ -3686,6 +3686,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
struct page *page;
pte_t new_pte;
spinlock_t *ptl;
+   unsigned long address = faddress & huge_page_mask(h);
 
/*
 * Currently, we are forced to kill the process in the event the
@@ -3749,7 +3750,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
ret = VM_FAULT_SIGBUS;
goto out;
}
-   clear_huge_page(page, address, pages_per_huge_page(h));
+   clear_huge_page(page, faddress, pages_per_huge_page(h));
__SetPageUptodate(page);
set_page_huge_active(page);
 
@@ -3871,7 +3872,7 @@ u32 hugetlb_fault_mutex_hash(struct hstate *h, struct 
mm_struct *mm,
 #endif
 
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-   unsigned long address, unsigned int flags)
+   unsigned long faddress, unsigned int flags)
 {
pte_t *ptep, entry;
spinlock_t *ptl;
@@ -3883,8 +3884,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
struct hstate *h = hstate_vma(vma);
struct address_space *mapping;
int need_wait_lock = 0;
-
-   address &= huge_page_mask(h);
+   unsigned long address = faddress & huge_page_mask(h);
 
ptep = huge_pte_offset(mm, address, huge_page_size(h));
if (ptep) {
@@ -3914,7 +3914,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
 
entry = huge_ptep_get(ptep);
if (huge_pte_none(entry)) {
-   ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, 
flags);
+   ret = hugetlb_no_page(mm, vma, mapping, idx, faddress, ptep, 
flags);
goto out_mutex;
}
 
-- 
2.16.1



[PATCH -mm] mm, hugetlb: Pass fault address to no page handler

2018-05-14 Thread Huang, Ying
From: Huang Ying 

This is to take better advantage of huge page clearing
optimization (c79b57e462b5d, "mm: hugetlb: clear target sub-page last
when clearing huge page").  Which will clear to access sub-page last
to avoid the cache lines of to access sub-page to be evicted when
clearing other sub-pages.  This needs to get the address of the
sub-page to access, that is, the fault address inside of the huge
page.  So the hugetlb no page fault handler is changed to pass that
information.  This will benefit workloads which don't access the begin
of the huge page after page fault.

With this patch, the throughput increases ~28.1% in vm-scalability
anon-w-seq test case with 88 processes on a 2 socket Xeon E5 2699 v4
system (44 cores, 88 threads).  The test case creates 88 processes,
each process mmap a big anonymous memory area and writes to it from
the end to the begin.  For each process, other processes could be seen
as other workload which generates heavy cache pressure.  At the same
time, the cache miss rate reduced from ~36.3% to ~25.6%, the
IPC (instruction per cycle) increased from 0.3 to 0.37, and the time
spent in user space is reduced ~19.3%

Signed-off-by: "Huang, Ying" 
Cc: Andrea Arcangeli 
Cc: "Kirill A. Shutemov" 
Cc: Andi Kleen 
Cc: Jan Kara 
Cc: Michal Hocko 
Cc: Matthew Wilcox 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Shaohua Li 
Cc: Christopher Lameter 
Cc: "Aneesh Kumar K.V" 
Cc: Punit Agrawal 
Cc: Anshuman Khandual 
---
 mm/hugetlb.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 129088710510..3de6326abf39 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3677,7 +3677,7 @@ int huge_add_to_page_cache(struct page *page, struct 
address_space *mapping,
 
 static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
   struct address_space *mapping, pgoff_t idx,
-  unsigned long address, pte_t *ptep, unsigned int 
flags)
+  unsigned long faddress, pte_t *ptep, unsigned int 
flags)
 {
struct hstate *h = hstate_vma(vma);
int ret = VM_FAULT_SIGBUS;
@@ -3686,6 +3686,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
struct page *page;
pte_t new_pte;
spinlock_t *ptl;
+   unsigned long address = faddress & huge_page_mask(h);
 
/*
 * Currently, we are forced to kill the process in the event the
@@ -3749,7 +3750,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
ret = VM_FAULT_SIGBUS;
goto out;
}
-   clear_huge_page(page, address, pages_per_huge_page(h));
+   clear_huge_page(page, faddress, pages_per_huge_page(h));
__SetPageUptodate(page);
set_page_huge_active(page);
 
@@ -3871,7 +3872,7 @@ u32 hugetlb_fault_mutex_hash(struct hstate *h, struct 
mm_struct *mm,
 #endif
 
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-   unsigned long address, unsigned int flags)
+   unsigned long faddress, unsigned int flags)
 {
pte_t *ptep, entry;
spinlock_t *ptl;
@@ -3883,8 +3884,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
struct hstate *h = hstate_vma(vma);
struct address_space *mapping;
int need_wait_lock = 0;
-
-   address &= huge_page_mask(h);
+   unsigned long address = faddress & huge_page_mask(h);
 
ptep = huge_pte_offset(mm, address, huge_page_size(h));
if (ptep) {
@@ -3914,7 +3914,7 @@ int hugetlb_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
 
entry = huge_ptep_get(ptep);
if (huge_pte_none(entry)) {
-   ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep, 
flags);
+   ret = hugetlb_no_page(mm, vma, mapping, idx, faddress, ptep, 
flags);
goto out_mutex;
}
 
-- 
2.16.1