Re: INFO: rcu detected stall in nsim_fib_event_work

2024-02-03 Thread Hillf Danton
On Sat, 03 Feb 2024 14:16:16 +0800 Ubisectech Sirius 
> Hello.
> We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec.
> Recently, our team has discovered a issue in Linux kernel 
> 6.8.0-rc2-g6764c317b6bb.
> Attached to the email were a POC file of the issue.

Could you test if the fix [1] syzbot tested a while before works for you?

[1] https://lore.kernel.org/lkml/c041c206103c2...@google.com/



Re: [PATCH for-next v3 1/7] RDMA/rxe: Convert triple tasklets to use workqueue

2022-12-23 Thread Hillf Danton
On 23 Dec 2022 15:51:52 +0900 Daisuke Matsuda 
> @@ -137,15 +153,27 @@ void rxe_sched_task(struct rxe_task *task)
>   if (task->destroyed)
>   return;
>  
> - tasklet_schedule(>tasklet);
> + /*
> +  * busy-loop while qp reset is in progress.
> +  * This may be called from softirq context and thus cannot sleep.
> +  */
> + while (atomic_read(>suspended))
> + cpu_relax();
> +
> + queue_work(task->workq, >work);
>  }

This busy wait particularly in softirq barely makes sense given the
flush_workqueue() below.
>  
>  void rxe_disable_task(struct rxe_task *task)
>  {
> - tasklet_disable(>tasklet);
> + /* Alternative to tasklet_disable() */
> + atomic_inc(>suspended);
> + smp_mb__after_atomic();
> + flush_workqueue(task->workq);
>  }
>  
>  void rxe_enable_task(struct rxe_task *task)
>  {
> - tasklet_enable(>tasklet);
> + /* Alternative to tasklet_enable() */
> + smp_mb__before_atomic();
> + atomic_dec(>suspended);
>  }

Feel free to add one-line comment for why smp_mb is needed in both
cases.



Re: [PATCH 05/12] dma-buf: add explicit buffer pinning

2019-05-25 Thread Hillf Danton

On Tue, 16 Apr 2019 20:38:34 +0200 Christian König wrote:
> + /**
> +  * @unpin_dma_buf:
> +  *
> +  * This is called by dma_buf_unpin and lets the exporter know that an
> +  * importer doesn't need to the DMA-buf to stay were it is any more.
> +  *
s/need to/need/  s/were/where/

> +  * This is called with the dmabuf->resv object locked.
> +  *
> +  * This callback is optional.
> +  */
> + void (*unpin)(struct dma_buf *);
> +
BR
Hillf

___
amd-gfx mailing list
amd-...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 06/12] drm: remove prime sg_table caching

2019-05-25 Thread Hillf Danton

On Tue, 16 Apr 2019 20:38:35 +0200 Christian König wrote:
> @ -331,14 +282,19 @@ EXPORT_SYMBOL(drm_gem_map_dma_buf);
>   * @sgt: scatterlist info of the buffer to unmap
>   * @dir: direction of DMA transfer
>   *
> - * Not implemented. The unmap is done at drm_gem_map_detach().  This can be
> - * used as the _buf_ops.unmap_dma_buf callback.
> + * This can be used as the _buf_ops.unmap_dma_buf callback.
>   */
>  void drm_gem_unmap_dma_buf(struct dma_buf_attachment *attach,
>  struct sg_table *sgt,
>  enum dma_data_direction dir)
>  {
> - /* nothing to be done here */
> + if (!sgt)

if (WARN_ON(!sgt))   ?
> + return;
> +
> + dma_unmap_sg_attrs(attach->dev, sgt->sgl, sgt->nents, dir,
> +DMA_ATTR_SKIP_CPU_SYNC);
> + sg_free_table(sgt);
> + kfree(sgt);
>  }
>  EXPORT_SYMBOL(drm_gem_unmap_dma_buf);
>  
> -- 
> 2.17.1
> 
BR
Hillf

___
amd-gfx mailing list
amd-...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 03/12] dma-buf: lock the reservation object during (un)map_dma_buf v3

2019-05-25 Thread Hillf Danton

On Tue, 16 Apr 2019 20:38:32 +0200 Christian König wrote:
> @@ -688,9 +689,9 @@ struct sg_table *dma_buf_map_attachment(struct 
> dma_buf_attachment *attach,
>   if (attach->sgt)
>   return attach->sgt;
>  
> - sg_table = attach->dmabuf->ops->map_dma_buf(attach, direction);
> - if (!sg_table)
> - sg_table = ERR_PTR(-ENOMEM);
> + reservation_object_lock(attach->dmabuf->resv, NULL);
> + sg_table = dma_buf_map_attachment_locked(attach, direction);
> + reservation_object_unlock(attach->dmabuf->resv);
>  
Looks better if sg_table is checked after mapping, and feed error info
back in case there is anything unusual.

>   return sg_table;
>  }

Best Regards
Hillf

___
amd-gfx mailing list
amd-...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 04/12] dma-buf: add optional invalidate_mappings callback v5

2019-05-25 Thread Hillf Danton

On Tue, 16 Apr 2019 20:38:33 +0200 Christian König wrote:
> Each importer can now provide an invalidate_mappings callback.
> 
> This allows the exporter to provide the mappings without the need to pin
> the backing store.
> 
> v2: don't try to invalidate mappings when the callback is NULL,
> lock the reservation obj while using the attachments,
> add helper to set the callback
> v3: move flag for invalidation support into the DMA-buf,
> use new attach_info structure to set the callback
> v4: use importer_priv field instead of mangling exporter priv.
> v5: drop invalidation_supported flag
> 
> Signed-off-by: Christian König 
> ---
>  drivers/dma-buf/dma-buf.c | 37 +
>  include/linux/dma-buf.h   | 33 +++--
>  2 files changed, 68 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 83c92bfd964c..a3738fab3927 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -563,6 +563,8 @@ struct dma_buf_attachment *dma_buf_attach(const struct 
> dma_buf_attach_info *info
>  
>   attach->dev = info->dev;
>   attach->dmabuf = dmabuf;
> + attach->importer_priv = info->importer_priv;
> + attach->invalidate = info->invalidate;
>  
>   mutex_lock(>lock);
>  
> @@ -571,7 +573,9 @@ struct dma_buf_attachment *dma_buf_attach(const struct 
> dma_buf_attach_info *info
>   if (ret)
>   goto err_attach;
>   }
> + reservation_object_lock(dmabuf->resv, NULL);
>   list_add(>node, >attachments);
> + reservation_object_unlock(dmabuf->resv);
>  
OK, protection from resv lock is needed for attach.

>   mutex_unlock(>lock);
>  
The snippet in [PATCH 01/12] dma-buf: add dynamic caching of sg_table
is copied and pasted below:

@@ -573,6 +573,20 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf 
*dmabuf,
list_add(>node, >attachments);

mutex_unlock(>lock);
+
+   if (!dmabuf->ops->dynamic_sgt_mapping) {
+   struct sg_table *sgt;
+
+   sgt = dmabuf->ops->map_dma_buf(attach, DMA_BIDIRECTIONAL);
+   if (!sgt)
+   sgt = ERR_PTR(-ENOMEM);
+   if (IS_ERR(sgt)) {
+   dma_buf_detach(dmabuf, attach);
+   return ERR_CAST(sgt);
+   }
+   attach->sgt = sgt;

Looks like the protection mentioned is also needed in this case.
+   }
+
[...]
> +/**
> + * dma_buf_invalidate_mappings - invalidate all mappings of this dma_buf
> + *
> + * @dmabuf:  [in]buffer which mappings should be invalidated
> + *
> + * Informs all attachmenst that they need to destroy and recreated all their

s/Informs/Inform/ s/attachmenst/attachments/ s/recreated/recreate/

> + * mappings.
> + */
> +void dma_buf_invalidate_mappings(struct dma_buf *dmabuf)
> +{
> + struct dma_buf_attachment *attach;
> +
> + reservation_object_assert_held(dmabuf->resv);
> +
> + list_for_each_entry(attach, >attachments, node)
> + if (attach->invalidate)
> + attach->invalidate(attach);
> +}
> +EXPORT_SYMBOL_GPL(dma_buf_invalidate_mappings);
> +
> 
BR
Hillf

___
amd-gfx mailing list
amd-...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 02/12] dma-buf: add dma_buf_(un)map_attachment_locked variants v3

2019-05-25 Thread Hillf Danton

On Tue, 16 Apr 2019 20:38:31 +0200 Christian König wrote:
> Add function variants which can be called with the reservation lock
> already held.
> 
> v2: reordered, add lockdep asserts, fix kerneldoc
> v3: rebased on sgt caching
> 
> Signed-off-by: Christian König 
> ---
>  drivers/dma-buf/dma-buf.c | 63 +++
>  include/linux/dma-buf.h   |  5 
>  2 files changed, 68 insertions(+)
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 65161a82d4d5..ef480e5fb239 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -623,6 +623,43 @@ void dma_buf_detach(struct dma_buf *dmabuf, struct 
> dma_buf_attachment *attach)
>  }
>  EXPORT_SYMBOL_GPL(dma_buf_detach);
>  
> +/**
> + * dma_buf_map_attachment_locked - Maps the buffer into _device_ address 
> space
> + * with the reservation lock held. Is a wrapper for map_dma_buf() of the

Something is missing; seems to be s/of the/of the dma_buf_ops./
> + *
> + * Returns the scatterlist table of the attachment;
> + * dma_buf_ops.

Oh it is sitting here!

> + * @attach:  [in]attachment whose scatterlist is to be returned
> + * @direction:   [in]direction of DMA transfer
> + *
> + * Returns sg_table containing the scatterlist to be returned; returns 
> ERR_PTR
> + * on error. May return -EINTR if it is interrupted by a signal.
> + *
EINTR looks impossible in the code.

> + * A mapping must be unmapped by using dma_buf_unmap_attachment_locked(). 
> Note
> + * that the underlying backing storage is pinned for as long as a mapping
> + * exists, therefore users/importers should not hold onto a mapping for undue
> + * amounts of time.
> + */
> +struct sg_table *
> +dma_buf_map_attachment_locked(struct dma_buf_attachment *attach,
> +   enum dma_data_direction direction)
> +{
> + struct sg_table *sg_table;
> +
> + might_sleep();
> + reservation_object_assert_held(attach->dmabuf->resv);
> +
> + if (attach->sgt)
> + return attach->sgt;
> +
> + sg_table = attach->dmabuf->ops->map_dma_buf(attach, direction);
> + if (!sg_table)
> + sg_table = ERR_PTR(-ENOMEM);
> +
> + return sg_table;
> +}
> +EXPORT_SYMBOL_GPL(dma_buf_map_attachment_locked);
> +
Best Regards
Hillf


___
amd-gfx mailing list
amd-...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [RFC 5/6] mm, cpuset: always use seqlock when changing task's nodemask

2017-04-12 Thread Hillf Danton
On April 11, 2017 10:06 PM Vlastimil Babka wrote: 
> 
>  static void cpuset_change_task_nodemask(struct task_struct *tsk,
>   nodemask_t *newmems)
>  {
> - bool need_loop;
> -
>   task_lock(tsk);
> - /*
> -  * Determine if a loop is necessary if another thread is doing
> -  * read_mems_allowed_begin().  If at least one node remains unchanged 
> and
> -  * tsk does not have a mempolicy, then an empty nodemask will not be
> -  * possible when mems_allowed is larger than a word.
> -  */
> - need_loop = task_has_mempolicy(tsk) ||
> - !nodes_intersects(*newmems, tsk->mems_allowed);
> 
> - if (need_loop) {
> - local_irq_disable();
> - write_seqcount_begin(>mems_allowed_seq);
> - }
> + local_irq_disable();
> + write_seqcount_begin(>mems_allowed_seq);
> 
> - nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
>   mpol_rebind_task(tsk, newmems);
>   tsk->mems_allowed = *newmems;
> 
> - if (need_loop) {
> - write_seqcount_end(>mems_allowed_seq);
> - local_irq_enable();
> - }
> + write_seqcount_end(>mems_allowed_seq);
> 
Doubt if we'd listen irq again.

>   task_unlock(tsk);
>  }
> --
> 2.12.2



Re: [RFC 5/6] mm, cpuset: always use seqlock when changing task's nodemask

2017-04-12 Thread Hillf Danton
On April 11, 2017 10:06 PM Vlastimil Babka wrote: 
> 
>  static void cpuset_change_task_nodemask(struct task_struct *tsk,
>   nodemask_t *newmems)
>  {
> - bool need_loop;
> -
>   task_lock(tsk);
> - /*
> -  * Determine if a loop is necessary if another thread is doing
> -  * read_mems_allowed_begin().  If at least one node remains unchanged 
> and
> -  * tsk does not have a mempolicy, then an empty nodemask will not be
> -  * possible when mems_allowed is larger than a word.
> -  */
> - need_loop = task_has_mempolicy(tsk) ||
> - !nodes_intersects(*newmems, tsk->mems_allowed);
> 
> - if (need_loop) {
> - local_irq_disable();
> - write_seqcount_begin(>mems_allowed_seq);
> - }
> + local_irq_disable();
> + write_seqcount_begin(>mems_allowed_seq);
> 
> - nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
>   mpol_rebind_task(tsk, newmems);
>   tsk->mems_allowed = *newmems;
> 
> - if (need_loop) {
> - write_seqcount_end(>mems_allowed_seq);
> - local_irq_enable();
> - }
> + write_seqcount_end(>mems_allowed_seq);
> 
Doubt if we'd listen irq again.

>   task_unlock(tsk);
>  }
> --
> 2.12.2



Re: [PATCH] hugetlbfs: fix offset overflow in huegtlbfs mmap

2017-04-11 Thread Hillf Danton
On April 12, 2017 6:52 AM Mike Kravetz wrote: 
> 
> If mmap() maps a file, it can be passed an offset into the file at
> which the mapping is to start.  Offset could be a negative value when
> represented as a loff_t.  The offset plus length will be used to
> update the file size (i_size) which is also a loff_t.  Validate the
> value of offset and offset + length to make sure they do not overflow
> and appear as negative.
> 
> Found by syzcaller with commit ff8c0c53c475 ("mm/hugetlb.c: don't call
> region_abort if region_chg fails") applied.  Prior to this commit, the
> overflow would still occur but we would luckily return ENOMEM.
> To reproduce:
> mmap(0, 0x2000, 0, 0x40021, 0xULL, 0x8000ULL);
> 
> Resulted in,
> kernel BUG at mm/hugetlb.c:742!
> Call Trace:
>  hugetlbfs_evict_inode+0x80/0xa0
>  ? hugetlbfs_setattr+0x3c0/0x3c0
>  evict+0x24a/0x620
>  iput+0x48f/0x8c0
>  dentry_unlink_inode+0x31f/0x4d0
>  __dentry_kill+0x292/0x5e0
>  dput+0x730/0x830
>  __fput+0x438/0x720
>  fput+0x1a/0x20
>  task_work_run+0xfe/0x180
>  exit_to_usermode_loop+0x133/0x150
>  syscall_return_slowpath+0x184/0x1c0
>  entry_SYSCALL_64_fastpath+0xab/0xad
> 
> Reported-by: Vegard Nossum <vegard.nos...@gmail.com>
> Signed-off-by: Mike Kravetz <mike.krav...@oracle.com>
> ---

Acked-by: Hillf Danton <hillf...@alibaba-inc.com>




Re: [PATCH] hugetlbfs: fix offset overflow in huegtlbfs mmap

2017-04-11 Thread Hillf Danton
On April 12, 2017 6:52 AM Mike Kravetz wrote: 
> 
> If mmap() maps a file, it can be passed an offset into the file at
> which the mapping is to start.  Offset could be a negative value when
> represented as a loff_t.  The offset plus length will be used to
> update the file size (i_size) which is also a loff_t.  Validate the
> value of offset and offset + length to make sure they do not overflow
> and appear as negative.
> 
> Found by syzcaller with commit ff8c0c53c475 ("mm/hugetlb.c: don't call
> region_abort if region_chg fails") applied.  Prior to this commit, the
> overflow would still occur but we would luckily return ENOMEM.
> To reproduce:
> mmap(0, 0x2000, 0, 0x40021, 0xULL, 0x8000ULL);
> 
> Resulted in,
> kernel BUG at mm/hugetlb.c:742!
> Call Trace:
>  hugetlbfs_evict_inode+0x80/0xa0
>  ? hugetlbfs_setattr+0x3c0/0x3c0
>  evict+0x24a/0x620
>  iput+0x48f/0x8c0
>  dentry_unlink_inode+0x31f/0x4d0
>  __dentry_kill+0x292/0x5e0
>  dput+0x730/0x830
>  __fput+0x438/0x720
>  fput+0x1a/0x20
>  task_work_run+0xfe/0x180
>  exit_to_usermode_loop+0x133/0x150
>  syscall_return_slowpath+0x184/0x1c0
>  entry_SYSCALL_64_fastpath+0xab/0xad
> 
> Reported-by: Vegard Nossum 
> Signed-off-by: Mike Kravetz 
> ---

Acked-by: Hillf Danton 




Re: NULL pointer dereference in the kernel 3.10

2017-04-10 Thread Hillf Danton
On April 10, 2017 5:54 PM Xishi Qiu wrote: 
> On 2017/4/10 17:37, Hillf Danton wrote:
> 
> > On April 10, 2017 4:57 PM Xishi Qiu wrote:
> >> On 2017/4/10 14:42, Hillf Danton wrote:
> >>
> >>> On April 08, 2017 9:40 PM zhong Jiang wrote:
> >>>>
> >>>> when runing the stabile docker cases in the vm.   The following issue 
> >>>> will come up.
> >>>>
> >>>> #40 [8801b57ffb30] async_page_fault at 8165c9f8
> >>>> [exception RIP: down_read_trylock+5]
> >>>> RIP: 810aca65  RSP: 8801b57ffbe8  RFLAGS: 00010202
> >>>> RAX:   RBX: 88018ae858c1  RCX: 
> >>>> RDX:   RSI:   RDI: 0008
> >>>> RBP: 8801b57ffc10   R8: ea0006903de0   R9: 8800b3c61810
> >>>> R10: 22cb  R11:   R12: 88018ae858c0
> >>>> R13: ea0006903dc0  R14: 0008  R15: ea0006903dc0
> >>>> ORIG_RAX:   CS: 0010  SS: 
> >>>> #41 [8801b57ffbe8] page_lock_anon_vma_read at 811b241c
> >>>> #42 [8801b57ffc18] page_referenced at 811b26a7
> >>>> #43 [8801b57ffc90] shrink_active_list at 8118d634
> >>>> #44 [8801b57ffd48] balance_pgdat at 8118f088
> >>>> #45 [8801b57ffe20] kswapd at 8118f633
> >>>> #46 [8801b57ffec8] kthread at 810a795f
> >>>> #47 [8801b57fff50] ret_from_fork at 81665398
> >>>> crash> struct page.mapping ea0006903dc0
> >>>>   mapping = 0x88018ae858c1
> >>>> crash> struct anon_vma 0x88018ae858c0
> >>>> struct anon_vma {
> >>>>   root = 0x0,
> >>>>   rwsem = {
> >>>> count = 0,
> >>>> wait_lock = {
> >>>>   raw_lock = {
> >>>> {
> >>>>   head_tail = 1,
> >>>>   tickets = {
> >>>> head = 1,
> >>>> tail = 0
> >>>>   }
> >>>> }
> >>>>   }
> >>>> },
> >>>> wait_list = {
> >>>>   next = 0x0,
> >>>>   prev = 0x0
> >>>> }
> >>>>   },
> >>>>   refcount = {
> >>>> counter = 0
> >>>>   },
> >>>>   rb_root = {
> >>>> rb_node = 0x0
> >>>>   }
> >>>> }
> >>>>
> >>>> This maks me wonder,  the anon_vma do not come from slab structure.
> >>>> and the content is abnormal. IMO,  At least anon_vma->root will not NULL.
> >>>> The issue can be reproduced every other week.
> >>>>
> >>> Check please if commit
> >>> 624483f3ea8 ("mm: rmap: fix use-after-free in __put_anon_vma")
> >>> is included in the 3.10 you are running.
> >>>
> >> We missed this patch in RHEL 7.2
> >> Could you please give more details for how it triggered?
> >
> > Sorry, I could not.
> > I guess it is UAF as described in the log of that commit.
> > And if it works for you, we know how.
> >
> > Hillf
> >
> 
> __put_anon_vma|   page_lock_anon_vma_read
>   anon_vma_free(root) |
>   | root_anon_vma = ACCESS_ONCE(anon_vma->root)
>   | down_read_trylock(_anon_vma->rwsem)
>   anon_vma_free(anon_vma) |
> 
> I find anon_vma was created by SLAB_DESTROY_BY_RCU, so it will not merge
> by other slabs, and free_slab() will not free it during 
> page_lock_anon_vma_read(),
> because it holds rcu_read_lock(), right?
> 
Dunno frankly, Sir, you know, I am not an rmap expert like you.
And pretty much probable I made a wrong guess, and sorry again.

> If root_anon_vma was reuse by someone, why "crash> struct anon_vma"
> shows almost zero?
> 
thank you very much
Hillf



Re: NULL pointer dereference in the kernel 3.10

2017-04-10 Thread Hillf Danton
On April 10, 2017 5:54 PM Xishi Qiu wrote: 
> On 2017/4/10 17:37, Hillf Danton wrote:
> 
> > On April 10, 2017 4:57 PM Xishi Qiu wrote:
> >> On 2017/4/10 14:42, Hillf Danton wrote:
> >>
> >>> On April 08, 2017 9:40 PM zhong Jiang wrote:
> >>>>
> >>>> when runing the stabile docker cases in the vm.   The following issue 
> >>>> will come up.
> >>>>
> >>>> #40 [8801b57ffb30] async_page_fault at 8165c9f8
> >>>> [exception RIP: down_read_trylock+5]
> >>>> RIP: 810aca65  RSP: 8801b57ffbe8  RFLAGS: 00010202
> >>>> RAX:   RBX: 88018ae858c1  RCX: 
> >>>> RDX:   RSI:   RDI: 0008
> >>>> RBP: 8801b57ffc10   R8: ea0006903de0   R9: 8800b3c61810
> >>>> R10: 22cb  R11:   R12: 88018ae858c0
> >>>> R13: ea0006903dc0  R14: 0008  R15: ea0006903dc0
> >>>> ORIG_RAX:   CS: 0010  SS: 
> >>>> #41 [8801b57ffbe8] page_lock_anon_vma_read at 811b241c
> >>>> #42 [8801b57ffc18] page_referenced at 811b26a7
> >>>> #43 [8801b57ffc90] shrink_active_list at 8118d634
> >>>> #44 [8801b57ffd48] balance_pgdat at 8118f088
> >>>> #45 [8801b57ffe20] kswapd at 8118f633
> >>>> #46 [8801b57ffec8] kthread at 810a795f
> >>>> #47 [8801b57fff50] ret_from_fork at 81665398
> >>>> crash> struct page.mapping ea0006903dc0
> >>>>   mapping = 0x88018ae858c1
> >>>> crash> struct anon_vma 0x88018ae858c0
> >>>> struct anon_vma {
> >>>>   root = 0x0,
> >>>>   rwsem = {
> >>>> count = 0,
> >>>> wait_lock = {
> >>>>   raw_lock = {
> >>>> {
> >>>>   head_tail = 1,
> >>>>   tickets = {
> >>>> head = 1,
> >>>> tail = 0
> >>>>   }
> >>>> }
> >>>>   }
> >>>> },
> >>>> wait_list = {
> >>>>   next = 0x0,
> >>>>   prev = 0x0
> >>>> }
> >>>>   },
> >>>>   refcount = {
> >>>> counter = 0
> >>>>   },
> >>>>   rb_root = {
> >>>> rb_node = 0x0
> >>>>   }
> >>>> }
> >>>>
> >>>> This maks me wonder,  the anon_vma do not come from slab structure.
> >>>> and the content is abnormal. IMO,  At least anon_vma->root will not NULL.
> >>>> The issue can be reproduced every other week.
> >>>>
> >>> Check please if commit
> >>> 624483f3ea8 ("mm: rmap: fix use-after-free in __put_anon_vma")
> >>> is included in the 3.10 you are running.
> >>>
> >> We missed this patch in RHEL 7.2
> >> Could you please give more details for how it triggered?
> >
> > Sorry, I could not.
> > I guess it is UAF as described in the log of that commit.
> > And if it works for you, we know how.
> >
> > Hillf
> >
> 
> __put_anon_vma|   page_lock_anon_vma_read
>   anon_vma_free(root) |
>   | root_anon_vma = ACCESS_ONCE(anon_vma->root)
>   | down_read_trylock(_anon_vma->rwsem)
>   anon_vma_free(anon_vma) |
> 
> I find anon_vma was created by SLAB_DESTROY_BY_RCU, so it will not merge
> by other slabs, and free_slab() will not free it during 
> page_lock_anon_vma_read(),
> because it holds rcu_read_lock(), right?
> 
Dunno frankly, Sir, you know, I am not an rmap expert like you.
And pretty much probable I made a wrong guess, and sorry again.

> If root_anon_vma was reuse by someone, why "crash> struct anon_vma"
> shows almost zero?
> 
thank you very much
Hillf



Re: NULL pointer dereference in the kernel 3.10

2017-04-10 Thread Hillf Danton
On April 10, 2017 4:57 PM Xishi Qiu wrote: 
> On 2017/4/10 14:42, Hillf Danton wrote:
> 
> > On April 08, 2017 9:40 PM zhong Jiang wrote:
> >>
> >> when runing the stabile docker cases in the vm.   The following issue will 
> >> come up.
> >>
> >> #40 [8801b57ffb30] async_page_fault at 8165c9f8
> >> [exception RIP: down_read_trylock+5]
> >> RIP: 810aca65  RSP: 8801b57ffbe8  RFLAGS: 00010202
> >> RAX:   RBX: 88018ae858c1  RCX: 
> >> RDX:   RSI:   RDI: 0008
> >> RBP: 8801b57ffc10   R8: ea0006903de0   R9: 8800b3c61810
> >> R10: 22cb  R11:   R12: 88018ae858c0
> >> R13: ea0006903dc0  R14: 0008  R15: ea0006903dc0
> >> ORIG_RAX:   CS: 0010  SS: 
> >> #41 [8801b57ffbe8] page_lock_anon_vma_read at 811b241c
> >> #42 [8801b57ffc18] page_referenced at 811b26a7
> >> #43 [8801b57ffc90] shrink_active_list at 8118d634
> >> #44 [8801b57ffd48] balance_pgdat at 8118f088
> >> #45 [8801b57ffe20] kswapd at 8118f633
> >> #46 [8801b57ffec8] kthread at 810a795f
> >> #47 [8801b57fff50] ret_from_fork at 81665398
> >> crash> struct page.mapping ea0006903dc0
> >>   mapping = 0x88018ae858c1
> >> crash> struct anon_vma 0x88018ae858c0
> >> struct anon_vma {
> >>   root = 0x0,
> >>   rwsem = {
> >> count = 0,
> >> wait_lock = {
> >>   raw_lock = {
> >> {
> >>   head_tail = 1,
> >>   tickets = {
> >> head = 1,
> >> tail = 0
> >>   }
> >> }
> >>   }
> >> },
> >> wait_list = {
> >>   next = 0x0,
> >>   prev = 0x0
> >> }
> >>   },
> >>   refcount = {
> >> counter = 0
> >>   },
> >>   rb_root = {
> >> rb_node = 0x0
> >>   }
> >> }
> >>
> >> This maks me wonder,  the anon_vma do not come from slab structure.
> >> and the content is abnormal. IMO,  At least anon_vma->root will not NULL.
> >> The issue can be reproduced every other week.
> >>
> > Check please if commit
> > 624483f3ea8 ("mm: rmap: fix use-after-free in __put_anon_vma")
> > is included in the 3.10 you are running.
> >
> We missed this patch in RHEL 7.2
> Could you please give more details for how it triggered?

Sorry, I could not. 
I guess it is UAF as described in the log of that commit.
And if it works for you, we know how.

Hillf





Re: NULL pointer dereference in the kernel 3.10

2017-04-10 Thread Hillf Danton
On April 10, 2017 4:57 PM Xishi Qiu wrote: 
> On 2017/4/10 14:42, Hillf Danton wrote:
> 
> > On April 08, 2017 9:40 PM zhong Jiang wrote:
> >>
> >> when runing the stabile docker cases in the vm.   The following issue will 
> >> come up.
> >>
> >> #40 [8801b57ffb30] async_page_fault at 8165c9f8
> >> [exception RIP: down_read_trylock+5]
> >> RIP: 810aca65  RSP: 8801b57ffbe8  RFLAGS: 00010202
> >> RAX:   RBX: 88018ae858c1  RCX: 
> >> RDX:   RSI:   RDI: 0008
> >> RBP: 8801b57ffc10   R8: ea0006903de0   R9: 8800b3c61810
> >> R10: 22cb  R11:   R12: 88018ae858c0
> >> R13: ea0006903dc0  R14: 0008  R15: ea0006903dc0
> >> ORIG_RAX:   CS: 0010  SS: 
> >> #41 [8801b57ffbe8] page_lock_anon_vma_read at 811b241c
> >> #42 [8801b57ffc18] page_referenced at 811b26a7
> >> #43 [8801b57ffc90] shrink_active_list at 8118d634
> >> #44 [8801b57ffd48] balance_pgdat at 8118f088
> >> #45 [8801b57ffe20] kswapd at 8118f633
> >> #46 [8801b57ffec8] kthread at 810a795f
> >> #47 [8801b57fff50] ret_from_fork at 81665398
> >> crash> struct page.mapping ea0006903dc0
> >>   mapping = 0x88018ae858c1
> >> crash> struct anon_vma 0x88018ae858c0
> >> struct anon_vma {
> >>   root = 0x0,
> >>   rwsem = {
> >> count = 0,
> >> wait_lock = {
> >>   raw_lock = {
> >> {
> >>   head_tail = 1,
> >>   tickets = {
> >> head = 1,
> >> tail = 0
> >>   }
> >> }
> >>   }
> >> },
> >> wait_list = {
> >>   next = 0x0,
> >>   prev = 0x0
> >> }
> >>   },
> >>   refcount = {
> >> counter = 0
> >>   },
> >>   rb_root = {
> >> rb_node = 0x0
> >>   }
> >> }
> >>
> >> This maks me wonder,  the anon_vma do not come from slab structure.
> >> and the content is abnormal. IMO,  At least anon_vma->root will not NULL.
> >> The issue can be reproduced every other week.
> >>
> > Check please if commit
> > 624483f3ea8 ("mm: rmap: fix use-after-free in __put_anon_vma")
> > is included in the 3.10 you are running.
> >
> We missed this patch in RHEL 7.2
> Could you please give more details for how it triggered?

Sorry, I could not. 
I guess it is UAF as described in the log of that commit.
And if it works for you, we know how.

Hillf





Re: NULL pointer dereference in the kernel 3.10

2017-04-10 Thread Hillf Danton
On April 08, 2017 9:40 PM zhong Jiang wrote: 
> 
> when runing the stabile docker cases in the vm.   The following issue will 
> come up.
> 
> #40 [8801b57ffb30] async_page_fault at 8165c9f8
> [exception RIP: down_read_trylock+5]
> RIP: 810aca65  RSP: 8801b57ffbe8  RFLAGS: 00010202
> RAX:   RBX: 88018ae858c1  RCX: 
> RDX:   RSI:   RDI: 0008
> RBP: 8801b57ffc10   R8: ea0006903de0   R9: 8800b3c61810
> R10: 22cb  R11:   R12: 88018ae858c0
> R13: ea0006903dc0  R14: 0008  R15: ea0006903dc0
> ORIG_RAX:   CS: 0010  SS: 
> #41 [8801b57ffbe8] page_lock_anon_vma_read at 811b241c
> #42 [8801b57ffc18] page_referenced at 811b26a7
> #43 [8801b57ffc90] shrink_active_list at 8118d634
> #44 [8801b57ffd48] balance_pgdat at 8118f088
> #45 [8801b57ffe20] kswapd at 8118f633
> #46 [8801b57ffec8] kthread at 810a795f
> #47 [8801b57fff50] ret_from_fork at 81665398
> crash> struct page.mapping ea0006903dc0
>   mapping = 0x88018ae858c1
> crash> struct anon_vma 0x88018ae858c0
> struct anon_vma {
>   root = 0x0,
>   rwsem = {
> count = 0,
> wait_lock = {
>   raw_lock = {
> {
>   head_tail = 1,
>   tickets = {
> head = 1,
> tail = 0
>   }
> }
>   }
> },
> wait_list = {
>   next = 0x0,
>   prev = 0x0
> }
>   },
>   refcount = {
> counter = 0
>   },
>   rb_root = {
> rb_node = 0x0
>   }
> }
> 
> This maks me wonder,  the anon_vma do not come from slab structure.
> and the content is abnormal. IMO,  At least anon_vma->root will not NULL.
> The issue can be reproduced every other week.
> 
Check please if commit 
624483f3ea8 ("mm: rmap: fix use-after-free in __put_anon_vma")
is included in the 3.10 you are running.

btw, why not run the mainline?

Hillf




Re: NULL pointer dereference in the kernel 3.10

2017-04-10 Thread Hillf Danton
On April 08, 2017 9:40 PM zhong Jiang wrote: 
> 
> when runing the stabile docker cases in the vm.   The following issue will 
> come up.
> 
> #40 [8801b57ffb30] async_page_fault at 8165c9f8
> [exception RIP: down_read_trylock+5]
> RIP: 810aca65  RSP: 8801b57ffbe8  RFLAGS: 00010202
> RAX:   RBX: 88018ae858c1  RCX: 
> RDX:   RSI:   RDI: 0008
> RBP: 8801b57ffc10   R8: ea0006903de0   R9: 8800b3c61810
> R10: 22cb  R11:   R12: 88018ae858c0
> R13: ea0006903dc0  R14: 0008  R15: ea0006903dc0
> ORIG_RAX:   CS: 0010  SS: 
> #41 [8801b57ffbe8] page_lock_anon_vma_read at 811b241c
> #42 [8801b57ffc18] page_referenced at 811b26a7
> #43 [8801b57ffc90] shrink_active_list at 8118d634
> #44 [8801b57ffd48] balance_pgdat at 8118f088
> #45 [8801b57ffe20] kswapd at 8118f633
> #46 [8801b57ffec8] kthread at 810a795f
> #47 [8801b57fff50] ret_from_fork at 81665398
> crash> struct page.mapping ea0006903dc0
>   mapping = 0x88018ae858c1
> crash> struct anon_vma 0x88018ae858c0
> struct anon_vma {
>   root = 0x0,
>   rwsem = {
> count = 0,
> wait_lock = {
>   raw_lock = {
> {
>   head_tail = 1,
>   tickets = {
> head = 1,
> tail = 0
>   }
> }
>   }
> },
> wait_list = {
>   next = 0x0,
>   prev = 0x0
> }
>   },
>   refcount = {
> counter = 0
>   },
>   rb_root = {
> rb_node = 0x0
>   }
> }
> 
> This maks me wonder,  the anon_vma do not come from slab structure.
> and the content is abnormal. IMO,  At least anon_vma->root will not NULL.
> The issue can be reproduced every other week.
> 
Check please if commit 
624483f3ea8 ("mm: rmap: fix use-after-free in __put_anon_vma")
is included in the 3.10 you are running.

btw, why not run the mainline?

Hillf




Re: [PATCH] Documentation: vm, add hugetlbfs reservation overview

2017-04-09 Thread Hillf Danton

On April 08, 2017 1:43 AM Mike Kravetz wrote: 
> 
> Adding a brief overview of hugetlbfs reservation design and implementation
> as an aid to those making code modifications in this area.
> 
> Signed-off-by: Mike Kravetz <mike.krav...@oracle.com>
> ---
You are doing more than I can double thank you, Mike:)

Acked-by: Hillf Danton <hillf...@alibaba-inc.com>




Re: [PATCH] Documentation: vm, add hugetlbfs reservation overview

2017-04-09 Thread Hillf Danton

On April 08, 2017 1:43 AM Mike Kravetz wrote: 
> 
> Adding a brief overview of hugetlbfs reservation design and implementation
> as an aid to those making code modifications in this area.
> 
> Signed-off-by: Mike Kravetz 
> ---
You are doing more than I can double thank you, Mike:)

Acked-by: Hillf Danton 




Re: [PATCH 2/4] mm: introduce memalloc_noreclaim_{save,restore}

2017-04-07 Thread Hillf Danton

On April 05, 2017 3:47 PM Vlastimil Babka wrote: 
> 
> The previous patch has shown that simply setting and clearing PF_MEMALLOC in
> current->flags can result in wrongly clearing a pre-existing PF_MEMALLOC flag
> and potentially lead to recursive reclaim. Let's introduce helpers that 
> support
> proper nesting by saving the previous stat of the flag, similar to the 
> existing
> memalloc_noio_* and memalloc_nofs_* helpers. Convert existing setting/clearing
> of PF_MEMALLOC within mm to the new helpers.
> 
> There are no known issues with the converted code, but the change makes it 
> more
> robust.
> 
> Suggested-by: Michal Hocko <mho...@suse.com>
> Signed-off-by: Vlastimil Babka <vba...@suse.cz>
> ---

Acked-by: Hillf Danton <hillf...@alibaba-inc.com>



Re: [PATCH 2/4] mm: introduce memalloc_noreclaim_{save,restore}

2017-04-07 Thread Hillf Danton

On April 05, 2017 3:47 PM Vlastimil Babka wrote: 
> 
> The previous patch has shown that simply setting and clearing PF_MEMALLOC in
> current->flags can result in wrongly clearing a pre-existing PF_MEMALLOC flag
> and potentially lead to recursive reclaim. Let's introduce helpers that 
> support
> proper nesting by saving the previous stat of the flag, similar to the 
> existing
> memalloc_noio_* and memalloc_nofs_* helpers. Convert existing setting/clearing
> of PF_MEMALLOC within mm to the new helpers.
> 
> There are no known issues with the converted code, but the change makes it 
> more
> robust.
> 
> Suggested-by: Michal Hocko 
> Signed-off-by: Vlastimil Babka 
> ---

Acked-by: Hillf Danton 



Re: [PATCH 1/4] mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC

2017-04-07 Thread Hillf Danton
On April 05, 2017 3:47 PM Vlastimil Babka wrote: 
> 
> The function __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent
> deadlock during page migration by lock_page() (see the comment in
> __unmap_and_move()). Then it unconditionally clears the flag, which can clear 
> a
> pre-existing PF_MEMALLOC flag and result in recursive reclaim. This was not a
> problem until commit a8161d1ed609 ("mm, page_alloc: restructure direct
> compaction handling in slowpath"), because direct compation was called only
> after direct reclaim, which was skipped when PF_MEMALLOC flag was set.
> 
> Even now it's only a theoretical issue, as the new callsite of
> __alloc_pages_direct_compact() is reached only for costly orders and when
> gfp_pfmemalloc_allowed() is true, which means either __GFP_NOMEMALLOC is in
> gfp_flags or in_interrupt() is true. There is no such known context, but let's
> play it safe and make __alloc_pages_direct_compact() robust for cases where
> PF_MEMALLOC is already set.
> 
> Fixes: a8161d1ed609 ("mm, page_alloc: restructure direct compaction handling 
> in slowpath")
> Reported-by: Andrey Ryabinin <aryabi...@virtuozzo.com>
> Signed-off-by: Vlastimil Babka <vba...@suse.cz>
> Cc: <sta...@vger.kernel.org>
> ---
Acked-by: Hillf Danton <hillf...@alibaba-inc.com>

>  mm/page_alloc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3589f8be53be..b84e6ffbe756 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3288,6 +3288,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
> int order,
>   enum compact_priority prio, enum compact_result *compact_result)
>  {
>   struct page *page;
> + unsigned int noreclaim_flag = current->flags & PF_MEMALLOC;
> 
>   if (!order)
>   return NULL;
> @@ -3295,7 +3296,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
> int order,
>   current->flags |= PF_MEMALLOC;
>   *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>   prio);
> - current->flags &= ~PF_MEMALLOC;
> + current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag;
> 
>   if (*compact_result <= COMPACT_INACTIVE)
>   return NULL;
> --
> 2.12.2



Re: [PATCH 1/4] mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC

2017-04-07 Thread Hillf Danton
On April 05, 2017 3:47 PM Vlastimil Babka wrote: 
> 
> The function __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent
> deadlock during page migration by lock_page() (see the comment in
> __unmap_and_move()). Then it unconditionally clears the flag, which can clear 
> a
> pre-existing PF_MEMALLOC flag and result in recursive reclaim. This was not a
> problem until commit a8161d1ed609 ("mm, page_alloc: restructure direct
> compaction handling in slowpath"), because direct compation was called only
> after direct reclaim, which was skipped when PF_MEMALLOC flag was set.
> 
> Even now it's only a theoretical issue, as the new callsite of
> __alloc_pages_direct_compact() is reached only for costly orders and when
> gfp_pfmemalloc_allowed() is true, which means either __GFP_NOMEMALLOC is in
> gfp_flags or in_interrupt() is true. There is no such known context, but let's
> play it safe and make __alloc_pages_direct_compact() robust for cases where
> PF_MEMALLOC is already set.
> 
> Fixes: a8161d1ed609 ("mm, page_alloc: restructure direct compaction handling 
> in slowpath")
> Reported-by: Andrey Ryabinin 
> Signed-off-by: Vlastimil Babka 
> Cc: 
> ---
Acked-by: Hillf Danton 

>  mm/page_alloc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3589f8be53be..b84e6ffbe756 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3288,6 +3288,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
> int order,
>   enum compact_priority prio, enum compact_result *compact_result)
>  {
>   struct page *page;
> + unsigned int noreclaim_flag = current->flags & PF_MEMALLOC;
> 
>   if (!order)
>   return NULL;
> @@ -3295,7 +3296,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
> int order,
>   current->flags |= PF_MEMALLOC;
>   *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>   prio);
> - current->flags &= ~PF_MEMALLOC;
> + current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag;
> 
>   if (*compact_result <= COMPACT_INACTIVE)
>   return NULL;
> --
> 2.12.2



Re: [PATCH 3/6] mm: remove return value from init_currently_empty_zone

2017-03-31 Thread Hillf Danton
On March 31, 2017 2:49 PM Michal Hocko wrote: 
> On Fri 31-03-17 11:49:49, Hillf Danton wrote:
> [...]
> > > -/* Can fail with -ENOMEM from allocating a wait table with vmalloc() or
> > > - * alloc_bootmem_node_nopanic()/memblock_virt_alloc_node_nopanic() */
> > > -static int __ref ensure_zone_is_initialized(struct zone *zone,
> > > +static void __ref ensure_zone_is_initialized(struct zone *zone,
> > >   unsigned long start_pfn, unsigned long num_pages)
> > >  {
> > > - if (zone_is_empty(zone))
> > > - return init_currently_empty_zone(zone, start_pfn, num_pages);
> > > -
> > > - return 0;
> > > + if (!zone_is_empty(zone))
> > > + init_currently_empty_zone(zone, start_pfn, num_pages);
> > >  }
> > Semantic change added?
> 
> could you be more specific?

Well, I'm wondering why you are trying to initiate a nonempty zone.

Hillf



Re: [PATCH 3/6] mm: remove return value from init_currently_empty_zone

2017-03-31 Thread Hillf Danton
On March 31, 2017 2:49 PM Michal Hocko wrote: 
> On Fri 31-03-17 11:49:49, Hillf Danton wrote:
> [...]
> > > -/* Can fail with -ENOMEM from allocating a wait table with vmalloc() or
> > > - * alloc_bootmem_node_nopanic()/memblock_virt_alloc_node_nopanic() */
> > > -static int __ref ensure_zone_is_initialized(struct zone *zone,
> > > +static void __ref ensure_zone_is_initialized(struct zone *zone,
> > >   unsigned long start_pfn, unsigned long num_pages)
> > >  {
> > > - if (zone_is_empty(zone))
> > > - return init_currently_empty_zone(zone, start_pfn, num_pages);
> > > -
> > > - return 0;
> > > + if (!zone_is_empty(zone))
> > > + init_currently_empty_zone(zone, start_pfn, num_pages);
> > >  }
> > Semantic change added?
> 
> could you be more specific?

Well, I'm wondering why you are trying to initiate a nonempty zone.

Hillf



Re: [PATCH 5/6] mm, memory_hotplug: do not associate hotadded memory to zones until online

2017-03-31 Thread Hillf Danton

On March 30, 2017 7:55 PM Michal Hocko wrote:
> 
> +static void __meminit resize_zone_range(struct zone *zone, unsigned long 
> start_pfn,
> + unsigned long nr_pages)
> +{
> + unsigned long old_end_pfn = zone_end_pfn(zone);
> +
> + if (start_pfn < zone->zone_start_pfn)
> + zone->zone_start_pfn = start_pfn;
> +
> + zone->spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - 
> zone->zone_start_pfn;
> +}
The implementation above implies zone can only go bigger.
Can we resize zone with the given data?

btw,  this mail address, Zhang Zhen  , is not 
reachable. 

Hillf



Re: [PATCH 5/6] mm, memory_hotplug: do not associate hotadded memory to zones until online

2017-03-31 Thread Hillf Danton

On March 30, 2017 7:55 PM Michal Hocko wrote:
> 
> +static void __meminit resize_zone_range(struct zone *zone, unsigned long 
> start_pfn,
> + unsigned long nr_pages)
> +{
> + unsigned long old_end_pfn = zone_end_pfn(zone);
> +
> + if (start_pfn < zone->zone_start_pfn)
> + zone->zone_start_pfn = start_pfn;
> +
> + zone->spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - 
> zone->zone_start_pfn;
> +}
The implementation above implies zone can only go bigger.
Can we resize zone with the given data?

btw,  this mail address, Zhang Zhen  , is not 
reachable. 

Hillf



Re: [PATCH 3/6] mm: remove return value from init_currently_empty_zone

2017-03-30 Thread Hillf Danton
On March 30, 2017 7:55 PM Michal Hocko wrote: 
> 
> From: Michal Hocko 
> 
> init_currently_empty_zone doesn't have any error to return yet it is
> still an int and callers try to be defensive and try to handle potential
> error. Remove this nonsense and simplify all callers.
> 
It is already cut off in 1/6 in this series?



> -/* Can fail with -ENOMEM from allocating a wait table with vmalloc() or
> - * alloc_bootmem_node_nopanic()/memblock_virt_alloc_node_nopanic() */
> -static int __ref ensure_zone_is_initialized(struct zone *zone,
> +static void __ref ensure_zone_is_initialized(struct zone *zone,
>   unsigned long start_pfn, unsigned long num_pages)
>  {
> - if (zone_is_empty(zone))
> - return init_currently_empty_zone(zone, start_pfn, num_pages);
> -
> - return 0;
> + if (!zone_is_empty(zone))
> + init_currently_empty_zone(zone, start_pfn, num_pages);
>  }
Semantic change added?

Hillf



Re: [PATCH 3/6] mm: remove return value from init_currently_empty_zone

2017-03-30 Thread Hillf Danton
On March 30, 2017 7:55 PM Michal Hocko wrote: 
> 
> From: Michal Hocko 
> 
> init_currently_empty_zone doesn't have any error to return yet it is
> still an int and callers try to be defensive and try to handle potential
> error. Remove this nonsense and simplify all callers.
> 
It is already cut off in 1/6 in this series?



> -/* Can fail with -ENOMEM from allocating a wait table with vmalloc() or
> - * alloc_bootmem_node_nopanic()/memblock_virt_alloc_node_nopanic() */
> -static int __ref ensure_zone_is_initialized(struct zone *zone,
> +static void __ref ensure_zone_is_initialized(struct zone *zone,
>   unsigned long start_pfn, unsigned long num_pages)
>  {
> - if (zone_is_empty(zone))
> - return init_currently_empty_zone(zone, start_pfn, num_pages);
> -
> - return 0;
> + if (!zone_is_empty(zone))
> + init_currently_empty_zone(zone, start_pfn, num_pages);
>  }
Semantic change added?

Hillf



Re: [PATCH 1/6] mm: get rid of zone_is_initialized

2017-03-30 Thread Hillf Danton

On March 30, 2017 7:55 PM Michal Hocko wrote:
> 
> @@ -5535,9 +5535,6 @@ int __meminit init_currently_empty_zone(struct zone 
> *zone,
>   zone_start_pfn, (zone_start_pfn + size));
> 
>   zone_init_free_lists(zone);
> - zone->initialized = 1;
> -
> - return 0;
>  }
Nit: Add changes more than correct.

Hillf



Re: [PATCH 1/6] mm: get rid of zone_is_initialized

2017-03-30 Thread Hillf Danton

On March 30, 2017 7:55 PM Michal Hocko wrote:
> 
> @@ -5535,9 +5535,6 @@ int __meminit init_currently_empty_zone(struct zone 
> *zone,
>   zone_start_pfn, (zone_start_pfn + size));
> 
>   zone_init_free_lists(zone);
> - zone->initialized = 1;
> -
> - return 0;
>  }
Nit: Add changes more than correct.

Hillf



Re: [PATCH] shmem: fix __shmem_file_setup error path leaks

2017-03-27 Thread Hillf Danton

On March 28, 2017 1:06 AM Vito Caputo wrote:
> 
> The existing path and memory cleanups appear to be in reverse order, and
> there's no iput() potentially leaking the inode in the last two error gotos.
> 
> Also make put_memory shmem_unacct_size() conditional on !inode since if we
> entered cleanup at put_inode, shmem_evict_inode() occurs via
> iput()->iput_final(), which performs the shmem_unacct_size() for us.
> 
> Signed-off-by: Vito Caputo 
> ---
> 
> This caught my eye while looking through the memfd_create() implementation.
> Included patch was compile tested only...
> 
>  mm/shmem.c | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index e67d6ba..a1a84eaf 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -4134,7 +4134,7 @@ static struct file *__shmem_file_setup(const char 
> *name, loff_t size,
>  unsigned long flags, unsigned int 
> i_flags)
>  {
>   struct file *res;
> - struct inode *inode;
> + struct inode *inode = NULL;
>   struct path path;
>   struct super_block *sb;
>   struct qstr this;
> @@ -4162,7 +4162,7 @@ static struct file *__shmem_file_setup(const char 
> *name, loff_t size,
>   res = ERR_PTR(-ENOSPC);
>   inode = shmem_get_inode(sb, NULL, S_IFREG | S_IRWXUGO, 0, flags);
>   if (!inode)
> - goto put_memory;
> + goto put_path;
> 
>   inode->i_flags |= i_flags;
>   d_instantiate(path.dentry, inode);

After this routine, the inode is in use of the dcache, as its comment says.

> @@ -4170,19 +4170,22 @@ static struct file *__shmem_file_setup(const char 
> *name, loff_t size,
>   clear_nlink(inode); /* It is unlinked */
>   res = ERR_PTR(ramfs_nommu_expand_for_mapping(inode, size));
>   if (IS_ERR(res))
> - goto put_path;
> + goto put_inode;
> 
>   res = alloc_file(, FMODE_WRITE | FMODE_READ,
> _file_operations);
>   if (IS_ERR(res))
> - goto put_path;
> + goto put_inode;
> 
>   return res;
> 
> -put_memory:
> - shmem_unacct_size(flags, size);
> +put_inode:
> + iput(inode);
>  put_path:
>   path_put();
> +put_memory:
> + if (!inode)
> + shmem_unacct_size(flags, size);
>   return res;
>  }
> 
> --
> 2.1.4



Re: [PATCH] shmem: fix __shmem_file_setup error path leaks

2017-03-27 Thread Hillf Danton

On March 28, 2017 1:06 AM Vito Caputo wrote:
> 
> The existing path and memory cleanups appear to be in reverse order, and
> there's no iput() potentially leaking the inode in the last two error gotos.
> 
> Also make put_memory shmem_unacct_size() conditional on !inode since if we
> entered cleanup at put_inode, shmem_evict_inode() occurs via
> iput()->iput_final(), which performs the shmem_unacct_size() for us.
> 
> Signed-off-by: Vito Caputo 
> ---
> 
> This caught my eye while looking through the memfd_create() implementation.
> Included patch was compile tested only...
> 
>  mm/shmem.c | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index e67d6ba..a1a84eaf 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -4134,7 +4134,7 @@ static struct file *__shmem_file_setup(const char 
> *name, loff_t size,
>  unsigned long flags, unsigned int 
> i_flags)
>  {
>   struct file *res;
> - struct inode *inode;
> + struct inode *inode = NULL;
>   struct path path;
>   struct super_block *sb;
>   struct qstr this;
> @@ -4162,7 +4162,7 @@ static struct file *__shmem_file_setup(const char 
> *name, loff_t size,
>   res = ERR_PTR(-ENOSPC);
>   inode = shmem_get_inode(sb, NULL, S_IFREG | S_IRWXUGO, 0, flags);
>   if (!inode)
> - goto put_memory;
> + goto put_path;
> 
>   inode->i_flags |= i_flags;
>   d_instantiate(path.dentry, inode);

After this routine, the inode is in use of the dcache, as its comment says.

> @@ -4170,19 +4170,22 @@ static struct file *__shmem_file_setup(const char 
> *name, loff_t size,
>   clear_nlink(inode); /* It is unlinked */
>   res = ERR_PTR(ramfs_nommu_expand_for_mapping(inode, size));
>   if (IS_ERR(res))
> - goto put_path;
> + goto put_inode;
> 
>   res = alloc_file(, FMODE_WRITE | FMODE_READ,
> _file_operations);
>   if (IS_ERR(res))
> - goto put_path;
> + goto put_inode;
> 
>   return res;
> 
> -put_memory:
> - shmem_unacct_size(flags, size);
> +put_inode:
> + iput(inode);
>  put_path:
>   path_put();
> +put_memory:
> + if (!inode)
> + shmem_unacct_size(flags, size);
>   return res;
>  }
> 
> --
> 2.1.4



Re: [PATCH v2] hugetlbfs: initialize shared policy as part of inode allocation

2017-03-27 Thread Hillf Danton

On March 26, 2017 5:38 AM Mike Kravetz wrote:
> 
> Any time after inode allocation, destroy_inode can be called.  The
> hugetlbfs inode contains a shared_policy structure, and
> mpol_free_shared_policy is unconditionally called as part of
> hugetlbfs_destroy_inode.  Initialize the policy as part of inode
> allocation so that any quick (error path) calls to destroy_inode
> will be handed an initialized policy.
> 
> syzkaller fuzzer found this bug, that resulted in the following:
> 
> BUG: KASAN: user-memory-access in atomic_inc
> include/asm-generic/atomic-instrumented.h:87 [inline] at addr
> 00131730bd7a
> BUG: KASAN: user-memory-access in __lock_acquire+0x21a/0x3a80
> kernel/locking/lockdep.c:3239 at addr 00131730bd7a
> Write of size 4 by task syz-executor6/14086
> CPU: 3 PID: 14086 Comm: syz-executor6 Not tainted 4.11.0-rc3+ #364
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:16 [inline]
>  dump_stack+0x1b8/0x28d lib/dump_stack.c:52
>  kasan_report_error mm/kasan/report.c:291 [inline]
>  kasan_report.part.2+0x34a/0x480 mm/kasan/report.c:316
>  kasan_report+0x21/0x30 mm/kasan/report.c:303
>  check_memory_region_inline mm/kasan/kasan.c:326 [inline]
>  check_memory_region+0x137/0x190 mm/kasan/kasan.c:333
>  kasan_check_write+0x14/0x20 mm/kasan/kasan.c:344
>  atomic_inc include/asm-generic/atomic-instrumented.h:87 [inline]
>  __lock_acquire+0x21a/0x3a80 kernel/locking/lockdep.c:3239
>  lock_acquire+0x1ee/0x590 kernel/locking/lockdep.c:3762
>  __raw_write_lock include/linux/rwlock_api_smp.h:210 [inline]
>  _raw_write_lock+0x33/0x50 kernel/locking/spinlock.c:295
>  mpol_free_shared_policy+0x43/0xb0 mm/mempolicy.c:2536
>  hugetlbfs_destroy_inode+0xca/0x120 fs/hugetlbfs/inode.c:952
>  alloc_inode+0x10d/0x180 fs/inode.c:216
>  new_inode_pseudo+0x69/0x190 fs/inode.c:889
>  new_inode+0x1c/0x40 fs/inode.c:918
>  hugetlbfs_get_inode+0x40/0x420 fs/hugetlbfs/inode.c:734
>  hugetlb_file_setup+0x329/0x9f0 fs/hugetlbfs/inode.c:1282
>  newseg+0x422/0xd30 ipc/shm.c:575
>  ipcget_new ipc/util.c:285 [inline]
>  ipcget+0x21e/0x580 ipc/util.c:639
>  SYSC_shmget ipc/shm.c:673 [inline]
>  SyS_shmget+0x158/0x230 ipc/shm.c:657
>  entry_SYSCALL_64_fastpath+0x1f/0xc2
> 
> Analysis provided by Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> v2: Remove now redundant initialization in hugetlbfs_get_root
> 
> Reported-by: Dmitry Vyukov <dvyu...@google.com>
> Signed-off-by: Mike Kravetz <mike.krav...@oracle.com>
> ---

Acked-by: Hillf Danton <hillf...@alibaba-inc.com>



Re: [PATCH v2] hugetlbfs: initialize shared policy as part of inode allocation

2017-03-27 Thread Hillf Danton

On March 26, 2017 5:38 AM Mike Kravetz wrote:
> 
> Any time after inode allocation, destroy_inode can be called.  The
> hugetlbfs inode contains a shared_policy structure, and
> mpol_free_shared_policy is unconditionally called as part of
> hugetlbfs_destroy_inode.  Initialize the policy as part of inode
> allocation so that any quick (error path) calls to destroy_inode
> will be handed an initialized policy.
> 
> syzkaller fuzzer found this bug, that resulted in the following:
> 
> BUG: KASAN: user-memory-access in atomic_inc
> include/asm-generic/atomic-instrumented.h:87 [inline] at addr
> 00131730bd7a
> BUG: KASAN: user-memory-access in __lock_acquire+0x21a/0x3a80
> kernel/locking/lockdep.c:3239 at addr 00131730bd7a
> Write of size 4 by task syz-executor6/14086
> CPU: 3 PID: 14086 Comm: syz-executor6 Not tainted 4.11.0-rc3+ #364
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:16 [inline]
>  dump_stack+0x1b8/0x28d lib/dump_stack.c:52
>  kasan_report_error mm/kasan/report.c:291 [inline]
>  kasan_report.part.2+0x34a/0x480 mm/kasan/report.c:316
>  kasan_report+0x21/0x30 mm/kasan/report.c:303
>  check_memory_region_inline mm/kasan/kasan.c:326 [inline]
>  check_memory_region+0x137/0x190 mm/kasan/kasan.c:333
>  kasan_check_write+0x14/0x20 mm/kasan/kasan.c:344
>  atomic_inc include/asm-generic/atomic-instrumented.h:87 [inline]
>  __lock_acquire+0x21a/0x3a80 kernel/locking/lockdep.c:3239
>  lock_acquire+0x1ee/0x590 kernel/locking/lockdep.c:3762
>  __raw_write_lock include/linux/rwlock_api_smp.h:210 [inline]
>  _raw_write_lock+0x33/0x50 kernel/locking/spinlock.c:295
>  mpol_free_shared_policy+0x43/0xb0 mm/mempolicy.c:2536
>  hugetlbfs_destroy_inode+0xca/0x120 fs/hugetlbfs/inode.c:952
>  alloc_inode+0x10d/0x180 fs/inode.c:216
>  new_inode_pseudo+0x69/0x190 fs/inode.c:889
>  new_inode+0x1c/0x40 fs/inode.c:918
>  hugetlbfs_get_inode+0x40/0x420 fs/hugetlbfs/inode.c:734
>  hugetlb_file_setup+0x329/0x9f0 fs/hugetlbfs/inode.c:1282
>  newseg+0x422/0xd30 ipc/shm.c:575
>  ipcget_new ipc/util.c:285 [inline]
>  ipcget+0x21e/0x580 ipc/util.c:639
>  SYSC_shmget ipc/shm.c:673 [inline]
>  SyS_shmget+0x158/0x230 ipc/shm.c:657
>  entry_SYSCALL_64_fastpath+0x1f/0xc2
> 
> Analysis provided by Tetsuo Handa 
> v2: Remove now redundant initialization in hugetlbfs_get_root
> 
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Mike Kravetz 
> ---

Acked-by: Hillf Danton 



Re: [PATCH] mm/hugetlb: Don't call region_abort if region_chg fails

2017-03-24 Thread Hillf Danton
Reply again with Andrew's mail address corrected:)

-'Andrew Morton' <akpm@linux^Coundation.org>
+'Andrew Morton' <a...@linux-foundation.org>
> 
> On March 24, 2017 12:03 PM Mike Kravetz wrote:
> >
> > Changes to hugetlbfs reservation maps is a two step process.  The first
> > step is a call to region_chg to determine what needs to be changed, and
> > prepare that change.  This should be followed by a call to call to
> > region_add to commit the change, or region_abort to abort the change.
> >
> > The error path in hugetlb_reserve_pages called region_abort after a
> > failed call to region_chg.  As a result, the adds_in_progress counter
> > in the reservation map is off by 1.  This is caught by a VM_BUG_ON
> > in resv_map_release when the reservation map is freed.
> >
> > syzkaller fuzzer found this bug, that resulted in the following:
> >
> >  kernel BUG at mm/hugetlb.c:742!
> >  Call Trace:
> >   hugetlbfs_evict_inode+0x7b/0xa0 fs/hugetlbfs/inode.c:493
> >   evict+0x481/0x920 fs/inode.c:553
> >   iput_final fs/inode.c:1515 [inline]
> >   iput+0x62b/0xa20 fs/inode.c:1542
> >   hugetlb_file_setup+0x593/0x9f0 fs/hugetlbfs/inode.c:1306
> >   newseg+0x422/0xd30 ipc/shm.c:575
> >   ipcget_new ipc/util.c:285 [inline]
> >   ipcget+0x21e/0x580 ipc/util.c:639
> >   SYSC_shmget ipc/shm.c:673 [inline]
> >   SyS_shmget+0x158/0x230 ipc/shm.c:657
> >   entry_SYSCALL_64_fastpath+0x1f/0xc2
> >  RIP: resv_map_release+0x265/0x330 mm/hugetlb.c:742
> >
> > Reported-by: Dmitry Vyukov <dvyu...@google.com>
> > Signed-off-by: Mike Kravetz <mike.krav...@oracle.com>
> > ---
> 
> Acked-by: Hillf Danton <hillf...@alibaba-inc.com>
> 
> >  mm/hugetlb.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index c7025c1..c65d45c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -4233,7 +4233,9 @@ int hugetlb_reserve_pages(struct inode *inode,
> > return 0;
> >  out_err:
> > if (!vma || vma->vm_flags & VM_MAYSHARE)
> > -   region_abort(resv_map, from, to);
> > +   /* Don't call region_abort if region_chg failed */
> > +   if (chg >= 0)
> > +   region_abort(resv_map, from, to);
> > if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> > kref_put(_map->refs, resv_map_release);
> > return ret;
> > --
> > 2.7.4



Re: [PATCH] mm/hugetlb: Don't call region_abort if region_chg fails

2017-03-24 Thread Hillf Danton
Reply again with Andrew's mail address corrected:)

-'Andrew Morton' 
+'Andrew Morton' 
> 
> On March 24, 2017 12:03 PM Mike Kravetz wrote:
> >
> > Changes to hugetlbfs reservation maps is a two step process.  The first
> > step is a call to region_chg to determine what needs to be changed, and
> > prepare that change.  This should be followed by a call to call to
> > region_add to commit the change, or region_abort to abort the change.
> >
> > The error path in hugetlb_reserve_pages called region_abort after a
> > failed call to region_chg.  As a result, the adds_in_progress counter
> > in the reservation map is off by 1.  This is caught by a VM_BUG_ON
> > in resv_map_release when the reservation map is freed.
> >
> > syzkaller fuzzer found this bug, that resulted in the following:
> >
> >  kernel BUG at mm/hugetlb.c:742!
> >  Call Trace:
> >   hugetlbfs_evict_inode+0x7b/0xa0 fs/hugetlbfs/inode.c:493
> >   evict+0x481/0x920 fs/inode.c:553
> >   iput_final fs/inode.c:1515 [inline]
> >   iput+0x62b/0xa20 fs/inode.c:1542
> >   hugetlb_file_setup+0x593/0x9f0 fs/hugetlbfs/inode.c:1306
> >   newseg+0x422/0xd30 ipc/shm.c:575
> >   ipcget_new ipc/util.c:285 [inline]
> >   ipcget+0x21e/0x580 ipc/util.c:639
> >   SYSC_shmget ipc/shm.c:673 [inline]
> >   SyS_shmget+0x158/0x230 ipc/shm.c:657
> >   entry_SYSCALL_64_fastpath+0x1f/0xc2
> >  RIP: resv_map_release+0x265/0x330 mm/hugetlb.c:742
> >
> > Reported-by: Dmitry Vyukov 
> > Signed-off-by: Mike Kravetz 
> > ---
> 
> Acked-by: Hillf Danton 
> 
> >  mm/hugetlb.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index c7025c1..c65d45c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -4233,7 +4233,9 @@ int hugetlb_reserve_pages(struct inode *inode,
> > return 0;
> >  out_err:
> > if (!vma || vma->vm_flags & VM_MAYSHARE)
> > -   region_abort(resv_map, from, to);
> > +   /* Don't call region_abort if region_chg failed */
> > +   if (chg >= 0)
> > +   region_abort(resv_map, from, to);
> > if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
> > kref_put(_map->refs, resv_map_release);
> > return ret;
> > --
> > 2.7.4



Re: [PATCH] mm/hugetlb: Don't call region_abort if region_chg fails

2017-03-24 Thread Hillf Danton
On March 24, 2017 12:03 PM Mike Kravetz wrote:
> 
> Changes to hugetlbfs reservation maps is a two step process.  The first
> step is a call to region_chg to determine what needs to be changed, and
> prepare that change.  This should be followed by a call to call to
> region_add to commit the change, or region_abort to abort the change.
> 
> The error path in hugetlb_reserve_pages called region_abort after a
> failed call to region_chg.  As a result, the adds_in_progress counter
> in the reservation map is off by 1.  This is caught by a VM_BUG_ON
> in resv_map_release when the reservation map is freed.
> 
> syzkaller fuzzer found this bug, that resulted in the following:
> 
>  kernel BUG at mm/hugetlb.c:742!
>  Call Trace:
>   hugetlbfs_evict_inode+0x7b/0xa0 fs/hugetlbfs/inode.c:493
>   evict+0x481/0x920 fs/inode.c:553
>   iput_final fs/inode.c:1515 [inline]
>   iput+0x62b/0xa20 fs/inode.c:1542
>   hugetlb_file_setup+0x593/0x9f0 fs/hugetlbfs/inode.c:1306
>   newseg+0x422/0xd30 ipc/shm.c:575
>   ipcget_new ipc/util.c:285 [inline]
>   ipcget+0x21e/0x580 ipc/util.c:639
>   SYSC_shmget ipc/shm.c:673 [inline]
>   SyS_shmget+0x158/0x230 ipc/shm.c:657
>   entry_SYSCALL_64_fastpath+0x1f/0xc2
>  RIP: resv_map_release+0x265/0x330 mm/hugetlb.c:742
> 
> Reported-by: Dmitry Vyukov <dvyu...@google.com>
> Signed-off-by: Mike Kravetz <mike.krav...@oracle.com>
> ---

Acked-by: Hillf Danton <hillf...@alibaba-inc.com>

>  mm/hugetlb.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c7025c1..c65d45c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4233,7 +4233,9 @@ int hugetlb_reserve_pages(struct inode *inode,
>   return 0;
>  out_err:
>   if (!vma || vma->vm_flags & VM_MAYSHARE)
> - region_abort(resv_map, from, to);
> + /* Don't call region_abort if region_chg failed */
> + if (chg >= 0)
> + region_abort(resv_map, from, to);
>   if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
>   kref_put(_map->refs, resv_map_release);
>   return ret;
> --
> 2.7.4



Re: [PATCH] mm/hugetlb: Don't call region_abort if region_chg fails

2017-03-24 Thread Hillf Danton
On March 24, 2017 12:03 PM Mike Kravetz wrote:
> 
> Changes to hugetlbfs reservation maps is a two step process.  The first
> step is a call to region_chg to determine what needs to be changed, and
> prepare that change.  This should be followed by a call to call to
> region_add to commit the change, or region_abort to abort the change.
> 
> The error path in hugetlb_reserve_pages called region_abort after a
> failed call to region_chg.  As a result, the adds_in_progress counter
> in the reservation map is off by 1.  This is caught by a VM_BUG_ON
> in resv_map_release when the reservation map is freed.
> 
> syzkaller fuzzer found this bug, that resulted in the following:
> 
>  kernel BUG at mm/hugetlb.c:742!
>  Call Trace:
>   hugetlbfs_evict_inode+0x7b/0xa0 fs/hugetlbfs/inode.c:493
>   evict+0x481/0x920 fs/inode.c:553
>   iput_final fs/inode.c:1515 [inline]
>   iput+0x62b/0xa20 fs/inode.c:1542
>   hugetlb_file_setup+0x593/0x9f0 fs/hugetlbfs/inode.c:1306
>   newseg+0x422/0xd30 ipc/shm.c:575
>   ipcget_new ipc/util.c:285 [inline]
>   ipcget+0x21e/0x580 ipc/util.c:639
>   SYSC_shmget ipc/shm.c:673 [inline]
>   SyS_shmget+0x158/0x230 ipc/shm.c:657
>   entry_SYSCALL_64_fastpath+0x1f/0xc2
>  RIP: resv_map_release+0x265/0x330 mm/hugetlb.c:742
> 
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Mike Kravetz 
> ---

Acked-by: Hillf Danton 

>  mm/hugetlb.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c7025c1..c65d45c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4233,7 +4233,9 @@ int hugetlb_reserve_pages(struct inode *inode,
>   return 0;
>  out_err:
>   if (!vma || vma->vm_flags & VM_MAYSHARE)
> - region_abort(resv_map, from, to);
> + /* Don't call region_abort if region_chg failed */
> + if (chg >= 0)
> + region_abort(resv_map, from, to);
>   if (vma && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
>   kref_put(_map->refs, resv_map_release);
>   return ret;
> --
> 2.7.4



Re: [PATCH v1] mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()

2017-03-21 Thread Hillf Danton



On March 22, 2017 10:32 AM Naoya Horiguchi wrote: 
> 
> I found the race condition which triggers the following bug when
> move_pages() and soft offline are called on a single hugetlb page
> concurrently.
> 
> [61163.578957] Soft offlining page 0x119400 at 0x7000
> [61163.580062] BUG: unable to handle kernel paging request at 
> ea0011943820
> [61163.580791] IP: follow_huge_pmd+0x143/0x190
> [61163.581203] PGD 7ffd2067
> [61163.581204] PUD 7ffd1067
> [61163.581471] PMD 0
> [61163.581723]
> [61163.582052] Oops:  [#1] SMP
> [61163.582349] Modules linked in: binfmt_misc ppdev virtio_balloon 
> parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq
> ip_tables xfs libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel 
> ata_piix serio_raw libata virtio_pci 8139cp
virtio_ring virtio
> mii floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cap_check]
> [61163.585130] CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P 
>   OE   4.11.0-rc2-mm1+ #2
> [61163.586055] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> [61163.586627] task: 88007c951680 task.stack: c90004bd8000
> [61163.587181] RIP: 0010:follow_huge_pmd+0x143/0x190
> [61163.587622] RSP: 0018:c90004bdbcd0 EFLAGS: 00010202
> [61163.588096] RAX: 000465003e80 RBX: ea0004e34d30 RCX: 
> 3000
> [61163.588818] RDX: 11943800 RSI: 00080001 RDI: 
> 000465003e80
> [61163.589486] RBP: c90004bdbd18 R08:  R09: 
> 880138d34000
> [61163.590097] R10: ea000465 R11: 00c363b0 R12: 
> ea0011943800
> [61163.590751] R13: 8801b8d34000 R14: ea00 R15: 
> 77ff8000
> [61163.591375] FS:  7fc977710740() GS:88007dc0() 
> knlGS:
> [61163.592068] CS:  0010 DS:  ES:  CR0: 80050033
> [61163.592627] CR2: ea0011943820 CR3: 7a746000 CR4: 
> 001406f0
> [61163.593330] Call Trace:
> [61163.593556]  follow_page_mask+0x270/0x550
> [61163.593908]  SYSC_move_pages+0x4ea/0x8f0
> [61163.594253]  ? lru_cache_add_active_or_unevictable+0x4b/0xd0
> [61163.594798]  SyS_move_pages+0xe/0x10
> [61163.595113]  do_syscall_64+0x67/0x180
> [61163.595434]  entry_SYSCALL64_slow_path+0x25/0x25
> [61163.595837] RIP: 0033:0x7fc976e03949
> [61163.596148] RSP: 002b:7ffe72221d88 EFLAGS: 0246 ORIG_RAX: 
> 0117
> [61163.596940] RAX: ffda RBX:  RCX: 
> 7fc976e03949
> [61163.597567] RDX: 00c22390 RSI: 1400 RDI: 
> 5827
> [61163.598177] RBP: 7ffe72221e00 R08: 00c2c3a0 R09: 
> 0004
> [61163.598842] R10: 00c363b0 R11: 0246 R12: 
> 00400650
> [61163.599456] R13: 7ffe72221ee0 R14:  R15: 
> 
> [61163.600067] Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 
> 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49
01 d4 f6 45 bc
> 04 74 90 <49> 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
> [61163.601845] RIP: follow_huge_pmd+0x143/0x190 RSP: c90004bdbcd0
> [61163.602376] CR2: ea0011943820
> [61163.602767] ---[ end trace e4f81353a2d23232 ]---
> [61163.603236] Kernel panic - not syncing: Fatal exception
> [61163.603706] Kernel Offset: disabled
> 
> This bug is triggered when pmd_present() returns true for non-present
> hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
> Using pmd_present() to determine present/non-present for hugetlb is
> not correct, because pmd_present() checks multiple bits (not only
> _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.
> 
> Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()")
> Signed-off-by: Naoya Horiguchi <n-horigu...@ah.jp.nec.com>
> Cc: <sta...@vger.kernel.org>[4.0+]
> ---

Acked-by: Hillf Danton <hillf...@alibaba-inc.com>

>  mm/hugetlb.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c 
> v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
> index 3d0aab9..f501f14 100644
> --- v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c
> +++ v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
> @@ -4651,6 +4651,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long 
> address,
>  {
>   struct page *page = NULL;
>   spinlock_t *ptl;
> + pte_t pte;
>  retry:
>   ptl = pmd_lockptr(mm, pmd);
>   spin_loc

Re: [PATCH v1] mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()

2017-03-21 Thread Hillf Danton



On March 22, 2017 10:32 AM Naoya Horiguchi wrote: 
> 
> I found the race condition which triggers the following bug when
> move_pages() and soft offline are called on a single hugetlb page
> concurrently.
> 
> [61163.578957] Soft offlining page 0x119400 at 0x7000
> [61163.580062] BUG: unable to handle kernel paging request at 
> ea0011943820
> [61163.580791] IP: follow_huge_pmd+0x143/0x190
> [61163.581203] PGD 7ffd2067
> [61163.581204] PUD 7ffd1067
> [61163.581471] PMD 0
> [61163.581723]
> [61163.582052] Oops:  [#1] SMP
> [61163.582349] Modules linked in: binfmt_misc ppdev virtio_balloon 
> parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq
> ip_tables xfs libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel 
> ata_piix serio_raw libata virtio_pci 8139cp
virtio_ring virtio
> mii floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cap_check]
> [61163.585130] CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P 
>   OE   4.11.0-rc2-mm1+ #2
> [61163.586055] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> [61163.586627] task: 88007c951680 task.stack: c90004bd8000
> [61163.587181] RIP: 0010:follow_huge_pmd+0x143/0x190
> [61163.587622] RSP: 0018:c90004bdbcd0 EFLAGS: 00010202
> [61163.588096] RAX: 000465003e80 RBX: ea0004e34d30 RCX: 
> 3000
> [61163.588818] RDX: 11943800 RSI: 00080001 RDI: 
> 000465003e80
> [61163.589486] RBP: c90004bdbd18 R08:  R09: 
> 880138d34000
> [61163.590097] R10: ea000465 R11: 00c363b0 R12: 
> ea0011943800
> [61163.590751] R13: 8801b8d34000 R14: ea00 R15: 
> 77ff8000
> [61163.591375] FS:  7fc977710740() GS:88007dc0() 
> knlGS:
> [61163.592068] CS:  0010 DS:  ES:  CR0: 80050033
> [61163.592627] CR2: ea0011943820 CR3: 7a746000 CR4: 
> 001406f0
> [61163.593330] Call Trace:
> [61163.593556]  follow_page_mask+0x270/0x550
> [61163.593908]  SYSC_move_pages+0x4ea/0x8f0
> [61163.594253]  ? lru_cache_add_active_or_unevictable+0x4b/0xd0
> [61163.594798]  SyS_move_pages+0xe/0x10
> [61163.595113]  do_syscall_64+0x67/0x180
> [61163.595434]  entry_SYSCALL64_slow_path+0x25/0x25
> [61163.595837] RIP: 0033:0x7fc976e03949
> [61163.596148] RSP: 002b:7ffe72221d88 EFLAGS: 0246 ORIG_RAX: 
> 0117
> [61163.596940] RAX: ffda RBX:  RCX: 
> 7fc976e03949
> [61163.597567] RDX: 00c22390 RSI: 1400 RDI: 
> 5827
> [61163.598177] RBP: 7ffe72221e00 R08: 00c2c3a0 R09: 
> 0004
> [61163.598842] R10: 00c363b0 R11: 0246 R12: 
> 00400650
> [61163.599456] R13: 7ffe72221ee0 R14:  R15: 
> 
> [61163.600067] Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 
> 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49
01 d4 f6 45 bc
> 04 74 90 <49> 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
> [61163.601845] RIP: follow_huge_pmd+0x143/0x190 RSP: c90004bdbcd0
> [61163.602376] CR2: ea0011943820
> [61163.602767] ---[ end trace e4f81353a2d23232 ]---
> [61163.603236] Kernel panic - not syncing: Fatal exception
> [61163.603706] Kernel Offset: disabled
> 
> This bug is triggered when pmd_present() returns true for non-present
> hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
> Using pmd_present() to determine present/non-present for hugetlb is
> not correct, because pmd_present() checks multiple bits (not only
> _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.
> 
> Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()")
> Signed-off-by: Naoya Horiguchi 
> Cc: [4.0+]
> ---

Acked-by: Hillf Danton 

>  mm/hugetlb.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c 
> v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
> index 3d0aab9..f501f14 100644
> --- v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c
> +++ v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
> @@ -4651,6 +4651,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long 
> address,
>  {
>   struct page *page = NULL;
>   spinlock_t *ptl;
> + pte_t pte;
>  retry:
>   ptl = pmd_lockptr(mm, pmd);
>   spin_lock(ptl);
> @@ -4660,12 +4661,13 @@ follow_huge_pmd(struct mm_struct *mm, unsigned 

Re: [PATCH] kcov: simplify interrupt check

2017-03-21 Thread Hillf Danton

On March 21, 2017 5:10 PM Dmitry Vyukov wrote: 
> 
> @@ -60,15 +60,8 @@ void notrace __sanitizer_cov_trace_pc(void)
>   /*
>* We are interested in code coverage as a function of a syscall inputs,
>* so we ignore code executed in interrupts.
> -  * The checks for whether we are in an interrupt are open-coded, because
> -  * 1. We can't use in_interrupt() here, since it also returns true
> -  *when we are inside local_bh_disable() section.
> -  * 2. We don't want to use (in_irq() | in_serving_softirq() | in_nmi()),
> -  *since that leads to slower generated code (three separate tests,
> -  *one for each of the flags).
>*/
> - if (!t || (preempt_count() & (HARDIRQ_MASK | SOFTIRQ_OFFSET
> - | NMI_MASK)))
> + if (!t || !in_task())
>   return;

Nit: can we get the current task check cut off?




Re: [PATCH] kcov: simplify interrupt check

2017-03-21 Thread Hillf Danton

On March 21, 2017 5:10 PM Dmitry Vyukov wrote: 
> 
> @@ -60,15 +60,8 @@ void notrace __sanitizer_cov_trace_pc(void)
>   /*
>* We are interested in code coverage as a function of a syscall inputs,
>* so we ignore code executed in interrupts.
> -  * The checks for whether we are in an interrupt are open-coded, because
> -  * 1. We can't use in_interrupt() here, since it also returns true
> -  *when we are inside local_bh_disable() section.
> -  * 2. We don't want to use (in_irq() | in_serving_softirq() | in_nmi()),
> -  *since that leads to slower generated code (three separate tests,
> -  *one for each of the flags).
>*/
> - if (!t || (preempt_count() & (HARDIRQ_MASK | SOFTIRQ_OFFSET
> - | NMI_MASK)))
> + if (!t || !in_task())
>   return;

Nit: can we get the current task check cut off?




Re: [PATCH v2 2/5] mm: parallel free pages

2017-03-15 Thread Hillf Danton

On March 15, 2017 5:00 PM Aaron Lu wrote: 
>  void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned 
> long end)
>  {
> + struct batch_free_struct *batch_free, *n;
> +
s/*n/*next/

>   tlb_flush_mmu(tlb);
> 
>   /* keep the page table cache within bounds */
>   check_pgt_cache();
> 
> + list_for_each_entry_safe(batch_free, n, >worker_list, list) {
> + flush_work(_free->work);

Not sure, list_del before free?

> + kfree(batch_free);
> + }
> +
>   tlb_flush_mmu_free_batches(tlb->local.next, true);
>   tlb->local.next = NULL;
>  }



Re: [PATCH v2 2/5] mm: parallel free pages

2017-03-15 Thread Hillf Danton

On March 15, 2017 5:00 PM Aaron Lu wrote: 
>  void tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned 
> long end)
>  {
> + struct batch_free_struct *batch_free, *n;
> +
s/*n/*next/

>   tlb_flush_mmu(tlb);
> 
>   /* keep the page table cache within bounds */
>   check_pgt_cache();
> 
> + list_for_each_entry_safe(batch_free, n, >worker_list, list) {
> + flush_work(_free->work);

Not sure, list_del before free?

> + kfree(batch_free);
> + }
> +
>   tlb_flush_mmu_free_batches(tlb->local.next, true);
>   tlb->local.next = NULL;
>  }



Re: [PATCH v2] mm: fix condition for throttle_direct_reclaim

2017-03-13 Thread Hillf Danton

On March 14, 2017 6:19 AM Shakeel Butt wrote: 
> 
> Recently kswapd has been modified to give up after MAX_RECLAIM_RETRIES
> number of unsucessful iterations. Before going to sleep, kswapd thread
> will unconditionally wakeup all threads sleeping on pfmemalloc_wait.
> However the awoken threads will recheck the watermarks and wake the
> kswapd thread and sleep again on pfmemalloc_wait. There is a chance
> of continuous back and forth between kswapd and direct reclaiming
> threads if the kswapd keep failing and thus defeat the purpose of
> adding backoff mechanism to kswapd. So, add kswapd_failures check
> on the throttle_direct_reclaim condition.
> 
> Signed-off-by: Shakeel Butt <shake...@google.com>
> Suggested-by: Michal Hocko <mho...@suse.com>
> Suggested-by: Johannes Weiner <han...@cmpxchg.org>
> ---

Acked-by: Hillf Danton <hillf...@alibaba-inc.com>



Re: [PATCH v2] mm: fix condition for throttle_direct_reclaim

2017-03-13 Thread Hillf Danton

On March 14, 2017 6:19 AM Shakeel Butt wrote: 
> 
> Recently kswapd has been modified to give up after MAX_RECLAIM_RETRIES
> number of unsucessful iterations. Before going to sleep, kswapd thread
> will unconditionally wakeup all threads sleeping on pfmemalloc_wait.
> However the awoken threads will recheck the watermarks and wake the
> kswapd thread and sleep again on pfmemalloc_wait. There is a chance
> of continuous back and forth between kswapd and direct reclaiming
> threads if the kswapd keep failing and thus defeat the purpose of
> adding backoff mechanism to kswapd. So, add kswapd_failures check
> on the throttle_direct_reclaim condition.
> 
> Signed-off-by: Shakeel Butt 
> Suggested-by: Michal Hocko 
> Suggested-by: Johannes Weiner 
> ---

Acked-by: Hillf Danton 



Re: [PATCH v1 02/10] mm: remove SWAP_DIRTY in ttu

2017-03-13 Thread Hillf Danton

On March 13, 2017 8:36 AM Minchan Kim wrote: 
> 
> If we found lazyfree page is dirty, try_to_unmap_one can just
> SetPageSwapBakced in there like PG_mlocked page and just return
> with SWAP_FAIL which is very natural because the page is not
> swappable right now so that vmscan can activate it.
> There is no point to introduce new return value SWAP_DIRTY
> in ttu at the moment.
> 
> Acked-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
> Signed-off-by: Minchan Kim <minc...@kernel.org>
> ---
Acked-by: Hillf Danton <hillf...@alibaba-inc.com>

>  include/linux/rmap.h | 1 -
>  mm/rmap.c| 6 +++---
>  mm/vmscan.c  | 3 ---
>  3 files changed, 3 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index fee10d7..b556eef 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -298,6 +298,5 @@ static inline int page_mkclean(struct page *page)
>  #define SWAP_AGAIN   1
>  #define SWAP_FAIL2
>  #define SWAP_MLOCK   3
> -#define SWAP_DIRTY   4
> 
>  #endif   /* _LINUX_RMAP_H */
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 9dbfa6f..d47af09 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1414,7 +1414,7 @@ static int try_to_unmap_one(struct page *page, struct 
> vm_area_struct *vma,
>*/
>   if (unlikely(PageSwapBacked(page) != 
> PageSwapCache(page))) {
>   WARN_ON_ONCE(1);
> - ret = SWAP_FAIL;
> + ret = false;
Nit:
Hm looks like stray merge.
Not sure it's really needed. 

>   page_vma_mapped_walk_done();
>   break;
>   }
> @@ -1431,7 +1431,8 @@ static int try_to_unmap_one(struct page *page, struct 
> vm_area_struct *vma,
>* discarded. Remap the page to page table.
>*/
>   set_pte_at(mm, address, pvmw.pte, pteval);
> - ret = SWAP_DIRTY;
> + SetPageSwapBacked(page);
> + ret = SWAP_FAIL;
>   page_vma_mapped_walk_done();
>   break;
>   }
> @@ -1501,7 +1502,6 @@ static int page_mapcount_is_zero(struct page *page)
>   * SWAP_AGAIN- we missed a mapping, try again later
>   * SWAP_FAIL - the page is unswappable
>   * SWAP_MLOCK- page is mlocked.
> - * SWAP_DIRTY- page is dirty MADV_FREE page
>   */
>  int try_to_unmap(struct page *page, enum ttu_flags flags)
>  {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a3656f9..b8fd656 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1142,9 +1142,6 @@ static unsigned long shrink_page_list(struct list_head 
> *page_list,
>   if (page_mapped(page)) {
>   switch (ret = try_to_unmap(page,
>   ttu_flags | TTU_BATCH_FLUSH)) {
> - case SWAP_DIRTY:
> - SetPageSwapBacked(page);
> - /* fall through */
>   case SWAP_FAIL:
>   nr_unmap_fail++;
>   goto activate_locked;
> --
> 2.7.4



Re: [PATCH v1 02/10] mm: remove SWAP_DIRTY in ttu

2017-03-13 Thread Hillf Danton

On March 13, 2017 8:36 AM Minchan Kim wrote: 
> 
> If we found lazyfree page is dirty, try_to_unmap_one can just
> SetPageSwapBakced in there like PG_mlocked page and just return
> with SWAP_FAIL which is very natural because the page is not
> swappable right now so that vmscan can activate it.
> There is no point to introduce new return value SWAP_DIRTY
> in ttu at the moment.
> 
> Acked-by: Kirill A. Shutemov 
> Signed-off-by: Minchan Kim 
> ---
Acked-by: Hillf Danton 

>  include/linux/rmap.h | 1 -
>  mm/rmap.c| 6 +++---
>  mm/vmscan.c  | 3 ---
>  3 files changed, 3 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index fee10d7..b556eef 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -298,6 +298,5 @@ static inline int page_mkclean(struct page *page)
>  #define SWAP_AGAIN   1
>  #define SWAP_FAIL2
>  #define SWAP_MLOCK   3
> -#define SWAP_DIRTY   4
> 
>  #endif   /* _LINUX_RMAP_H */
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 9dbfa6f..d47af09 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1414,7 +1414,7 @@ static int try_to_unmap_one(struct page *page, struct 
> vm_area_struct *vma,
>*/
>   if (unlikely(PageSwapBacked(page) != 
> PageSwapCache(page))) {
>   WARN_ON_ONCE(1);
> - ret = SWAP_FAIL;
> + ret = false;
Nit:
Hm looks like stray merge.
Not sure it's really needed. 

>   page_vma_mapped_walk_done();
>   break;
>   }
> @@ -1431,7 +1431,8 @@ static int try_to_unmap_one(struct page *page, struct 
> vm_area_struct *vma,
>* discarded. Remap the page to page table.
>*/
>   set_pte_at(mm, address, pvmw.pte, pteval);
> - ret = SWAP_DIRTY;
> + SetPageSwapBacked(page);
> + ret = SWAP_FAIL;
>   page_vma_mapped_walk_done();
>   break;
>   }
> @@ -1501,7 +1502,6 @@ static int page_mapcount_is_zero(struct page *page)
>   * SWAP_AGAIN- we missed a mapping, try again later
>   * SWAP_FAIL - the page is unswappable
>   * SWAP_MLOCK- page is mlocked.
> - * SWAP_DIRTY- page is dirty MADV_FREE page
>   */
>  int try_to_unmap(struct page *page, enum ttu_flags flags)
>  {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a3656f9..b8fd656 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1142,9 +1142,6 @@ static unsigned long shrink_page_list(struct list_head 
> *page_list,
>   if (page_mapped(page)) {
>   switch (ret = try_to_unmap(page,
>   ttu_flags | TTU_BATCH_FLUSH)) {
> - case SWAP_DIRTY:
> - SetPageSwapBacked(page);
> - /* fall through */
>   case SWAP_FAIL:
>   nr_unmap_fail++;
>   goto activate_locked;
> --
> 2.7.4



Re: [PATCH v1 01/10] mm: remove unncessary ret in page_referenced

2017-03-13 Thread Hillf Danton

On March 13, 2017 8:36 AM Minchan Kim wrote: 
> 
> Anyone doesn't use ret variable. Remove it.
> 
> Acked-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
> Signed-off-by: Minchan Kim <minc...@kernel.org>
> ---
Acked-by: Hillf Danton <hillf...@alibaba-inc.com>


>  mm/rmap.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 7d24bb9..9dbfa6f 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -807,7 +807,6 @@ int page_referenced(struct page *page,
>   struct mem_cgroup *memcg,
>   unsigned long *vm_flags)
>  {
> - int ret;
>   int we_locked = 0;
>   struct page_referenced_arg pra = {
>   .mapcount = total_mapcount(page),
> @@ -841,7 +840,7 @@ int page_referenced(struct page *page,
>   rwc.invalid_vma = invalid_page_referenced_vma;
>   }
> 
> - ret = rmap_walk(page, );
> + rmap_walk(page, );
>   *vm_flags = pra.vm_flags;
> 
>   if (we_locked)
> --
> 2.7.4



Re: [PATCH v1 01/10] mm: remove unncessary ret in page_referenced

2017-03-13 Thread Hillf Danton

On March 13, 2017 8:36 AM Minchan Kim wrote: 
> 
> Anyone doesn't use ret variable. Remove it.
> 
> Acked-by: Kirill A. Shutemov 
> Signed-off-by: Minchan Kim 
> ---
Acked-by: Hillf Danton 


>  mm/rmap.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 7d24bb9..9dbfa6f 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -807,7 +807,6 @@ int page_referenced(struct page *page,
>   struct mem_cgroup *memcg,
>   unsigned long *vm_flags)
>  {
> - int ret;
>   int we_locked = 0;
>   struct page_referenced_arg pra = {
>   .mapcount = total_mapcount(page),
> @@ -841,7 +840,7 @@ int page_referenced(struct page *page,
>   rwc.invalid_vma = invalid_page_referenced_vma;
>   }
> 
> - ret = rmap_walk(page, );
> + rmap_walk(page, );
>   *vm_flags = pra.vm_flags;
> 
>   if (we_locked)
> --
> 2.7.4



Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

2017-03-06 Thread Hillf Danton

On March 07, 2017 12:24 AM Johannes Weiner wrote: 
> On Mon, Mar 06, 2017 at 10:37:40AM +0900, Minchan Kim wrote:
> > On Fri, Mar 03, 2017 at 08:59:54AM +0100, Michal Hocko wrote:
> > > On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> > > > On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > > > > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int 
> > > > > order, int classzone_idx)
> > > > >   sc.priority--;
> > > > >   } while (sc.priority >= 1);
> > > > >
> > > > > + if (!sc.nr_reclaimed)
> > > > > + pgdat->kswapd_failures++;
> > > >
> > > > sc.nr_reclaimed is reset to zero in above big loop's beginning so most 
> > > > of time,
> > > > it pgdat->kswapd_failures is increased.
> 
> That wasn't intentional; I didn't see the sc.nr_reclaimed reset.
> 
> ---
> 
> From e126db716926ff353b35f3a6205bd5853e01877b Mon Sep 17 00:00:00 2001
> From: Johannes Weiner 
> Date: Mon, 6 Mar 2017 10:53:59 -0500
> Subject: [PATCH] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes fix
> 
> Check kswapd failure against the cumulative nr_reclaimed count, not
> against the count from the lowest priority iteration.
> 
> Suggested-by: Minchan Kim 
> Signed-off-by: Johannes Weiner 
> ---
>  mm/vmscan.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ddcff8a11c1e..b834b2dd4e19 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3179,9 +3179,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
> int classzone_idx)
>   count_vm_event(PAGEOUTRUN);
> 
>   do {
> + unsigned long nr_reclaimed = sc.nr_reclaimed;
>   bool raise_priority = true;
> 
> - sc.nr_reclaimed = 0;

This has another effect that we'll reclaim less pages than we're
currently doing if we are balancing for high order request. And 
it looks worth including that info also in log message.

>   sc.reclaim_idx = classzone_idx;
> 
>   /*
> @@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
> int classzone_idx)
>* Raise priority if scanning rate is too low or there was no
>* progress in reclaiming pages
>*/
> - if (raise_priority || !sc.nr_reclaimed)
> + nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> + if (raise_priority || !nr_reclaimed)
>   sc.priority--;
>   } while (sc.priority >= 1);
> 
> --
> 2.11.1



Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

2017-03-06 Thread Hillf Danton

On March 07, 2017 12:24 AM Johannes Weiner wrote: 
> On Mon, Mar 06, 2017 at 10:37:40AM +0900, Minchan Kim wrote:
> > On Fri, Mar 03, 2017 at 08:59:54AM +0100, Michal Hocko wrote:
> > > On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> > > > On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > > > > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int 
> > > > > order, int classzone_idx)
> > > > >   sc.priority--;
> > > > >   } while (sc.priority >= 1);
> > > > >
> > > > > + if (!sc.nr_reclaimed)
> > > > > + pgdat->kswapd_failures++;
> > > >
> > > > sc.nr_reclaimed is reset to zero in above big loop's beginning so most 
> > > > of time,
> > > > it pgdat->kswapd_failures is increased.
> 
> That wasn't intentional; I didn't see the sc.nr_reclaimed reset.
> 
> ---
> 
> From e126db716926ff353b35f3a6205bd5853e01877b Mon Sep 17 00:00:00 2001
> From: Johannes Weiner 
> Date: Mon, 6 Mar 2017 10:53:59 -0500
> Subject: [PATCH] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes fix
> 
> Check kswapd failure against the cumulative nr_reclaimed count, not
> against the count from the lowest priority iteration.
> 
> Suggested-by: Minchan Kim 
> Signed-off-by: Johannes Weiner 
> ---
>  mm/vmscan.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ddcff8a11c1e..b834b2dd4e19 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3179,9 +3179,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
> int classzone_idx)
>   count_vm_event(PAGEOUTRUN);
> 
>   do {
> + unsigned long nr_reclaimed = sc.nr_reclaimed;
>   bool raise_priority = true;
> 
> - sc.nr_reclaimed = 0;

This has another effect that we'll reclaim less pages than we're
currently doing if we are balancing for high order request. And 
it looks worth including that info also in log message.

>   sc.reclaim_idx = classzone_idx;
> 
>   /*
> @@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
> int classzone_idx)
>* Raise priority if scanning rate is too low or there was no
>* progress in reclaiming pages
>*/
> - if (raise_priority || !sc.nr_reclaimed)
> + nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> + if (raise_priority || !nr_reclaimed)
>   sc.priority--;
>   } while (sc.priority >= 1);
> 
> --
> 2.11.1



Re: [RFC PATCH 03/12] staging: android: ion: Duplicate sg_table

2017-03-03 Thread Hillf Danton

On March 03, 2017 5:45 AM Laura Abbott wrote: 
> 
> +static struct sg_table *dup_sg_table(struct sg_table *table)
> +{
> + struct sg_table *new_table;
> + int ret, i;
> + struct scatterlist *sg, *new_sg;
> +
> + new_table = kzalloc(sizeof(*new_table), GFP_KERNEL);
> + if (!new_table)
> + return ERR_PTR(-ENOMEM);
> +
> + ret = sg_alloc_table(new_table, table->nents, GFP_KERNEL);
> + if (ret) {
> + kfree(table);

Free new table?

> + return ERR_PTR(-ENOMEM);
> + }
> +
> + new_sg = new_table->sgl;
> + for_each_sg(table->sgl, sg, table->nents, i) {
> + memcpy(new_sg, sg, sizeof(*sg));
> + sg->dma_address = 0;
> + new_sg = sg_next(new_sg);
> + }
> +

Do we need a helper, sg_copy_table(dst_table, src_table)?

> + return new_table;
> +}
> +



Re: [RFC PATCH 03/12] staging: android: ion: Duplicate sg_table

2017-03-03 Thread Hillf Danton

On March 03, 2017 5:45 AM Laura Abbott wrote: 
> 
> +static struct sg_table *dup_sg_table(struct sg_table *table)
> +{
> + struct sg_table *new_table;
> + int ret, i;
> + struct scatterlist *sg, *new_sg;
> +
> + new_table = kzalloc(sizeof(*new_table), GFP_KERNEL);
> + if (!new_table)
> + return ERR_PTR(-ENOMEM);
> +
> + ret = sg_alloc_table(new_table, table->nents, GFP_KERNEL);
> + if (ret) {
> + kfree(table);

Free new table?

> + return ERR_PTR(-ENOMEM);
> + }
> +
> + new_sg = new_table->sgl;
> + for_each_sg(table->sgl, sg, table->nents, i) {
> + memcpy(new_sg, sg, sizeof(*sg));
> + sg->dma_address = 0;
> + new_sg = sg_next(new_sg);
> + }
> +

Do we need a helper, sg_copy_table(dst_table, src_table)?

> + return new_table;
> +}
> +



Re: [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race

2017-03-02 Thread Hillf Danton

On March 02, 2017 11:11 PM Kirill A. Shutemov wrote: 
> 
> Basically the same race as with numa balancing in change_huge_pmd(), but
> a bit simpler to mitigate: we don't need to preserve dirty/young flags
> here due to MADV_FREE functionality.
> 
> Signed-off-by: Kirill A. Shutemov 
> Cc: Minchan Kim 
> ---
>  mm/huge_memory.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bb2b3646bd78..324217c31ec9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, 
> struct vm_area_struct *vma,
>   deactivate_page(page);
> 
>   if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> - orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> - tlb->fullmm);
>   orig_pmd = pmd_mkold(orig_pmd);
>   orig_pmd = pmd_mkclean(orig_pmd);
> 
$ grep -n set_pmd_at  linux-4.10/arch/powerpc/mm/pgtable-book3s64.c

/*
 * set a new huge pmd. We should not be called for updating
 * an existing pmd entry. That should go via pmd_hugepage_update.
 */
void set_pmd_at(struct mm_struct *mm, unsigned long addr,



Re: [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race

2017-03-02 Thread Hillf Danton

On March 02, 2017 11:11 PM Kirill A. Shutemov wrote: 
> 
> Basically the same race as with numa balancing in change_huge_pmd(), but
> a bit simpler to mitigate: we don't need to preserve dirty/young flags
> here due to MADV_FREE functionality.
> 
> Signed-off-by: Kirill A. Shutemov 
> Cc: Minchan Kim 
> ---
>  mm/huge_memory.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bb2b3646bd78..324217c31ec9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, 
> struct vm_area_struct *vma,
>   deactivate_page(page);
> 
>   if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> - orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> - tlb->fullmm);
>   orig_pmd = pmd_mkold(orig_pmd);
>   orig_pmd = pmd_mkclean(orig_pmd);
> 
$ grep -n set_pmd_at  linux-4.10/arch/powerpc/mm/pgtable-book3s64.c

/*
 * set a new huge pmd. We should not be called for updating
 * an existing pmd entry. That should go via pmd_hugepage_update.
 */
void set_pmd_at(struct mm_struct *mm, unsigned long addr,



Re: [RFC 03/11] mm: remove SWAP_DIRTY in ttu

2017-03-01 Thread Hillf Danton

On March 02, 2017 2:39 PM Minchan Kim wrote: 
> @@ -1424,7 +1424,8 @@ static int try_to_unmap_one(struct page *page, struct 
> vm_area_struct *vma,
>   } else if (!PageSwapBacked(page)) {
>   /* dirty MADV_FREE page */

Nit: enrich the comment please.
>   set_pte_at(mm, address, pvmw.pte, pteval);
> - ret = SWAP_DIRTY;
> + SetPageSwapBacked(page);
> + ret = SWAP_FAIL;
>   page_vma_mapped_walk_done();
>   break;
>   }



Re: [RFC 03/11] mm: remove SWAP_DIRTY in ttu

2017-03-01 Thread Hillf Danton

On March 02, 2017 2:39 PM Minchan Kim wrote: 
> @@ -1424,7 +1424,8 @@ static int try_to_unmap_one(struct page *page, struct 
> vm_area_struct *vma,
>   } else if (!PageSwapBacked(page)) {
>   /* dirty MADV_FREE page */

Nit: enrich the comment please.
>   set_pte_at(mm, address, pvmw.pte, pteval);
> - ret = SWAP_DIRTY;
> + SetPageSwapBacked(page);
> + ret = SWAP_FAIL;
>   page_vma_mapped_walk_done();
>   break;
>   }



Re: [PATCH 8/9] Revert "mm, vmscan: account for skipped pages as a partial scan"

2017-03-01 Thread Hillf Danton
On March 01, 2017 5:40 AM Johannes Weiner wrote: 
>  
> This reverts commit d7f05528eedb047efe2288cff777676b028747b6.
> 
> Now that reclaimability of a node is no longer based on the ratio
> between pages scanned and theoretically reclaimable pages, we can
> remove accounting tricks for pages skipped due to zone constraints.
> 
> Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf...@alibaba-inc.com>




Re: [PATCH 8/9] Revert "mm, vmscan: account for skipped pages as a partial scan"

2017-03-01 Thread Hillf Danton
On March 01, 2017 5:40 AM Johannes Weiner wrote: 
>  
> This reverts commit d7f05528eedb047efe2288cff777676b028747b6.
> 
> Now that reclaimability of a node is no longer based on the ratio
> between pages scanned and theoretically reclaimable pages, we can
> remove accounting tricks for pages skipped due to zone constraints.
> 
> Signed-off-by: Johannes Weiner 
> ---
Acked-by: Hillf Danton 




Re: [PATCH 9/9] mm: remove unnecessary back-off function when retrying page reclaim

2017-03-01 Thread Hillf Danton

On March 01, 2017 5:40 AM Johannes Weiner wrote:
> 
> The backoff mechanism is not needed. If we have MAX_RECLAIM_RETRIES
> loops without progress, we'll OOM anyway; backing off might cut one or
> two iterations off that in the rare OOM case. If we have intermittent
> success reclaiming a few pages, the backoff function gets reset also,
> and so is of little help in these scenarios.
> 
> We might want a backoff function for when there IS progress, but not
> enough to be satisfactory. But this isn't that. Remove it.
> 
> Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf...@alibaba-inc.com>




Re: [PATCH 9/9] mm: remove unnecessary back-off function when retrying page reclaim

2017-03-01 Thread Hillf Danton

On March 01, 2017 5:40 AM Johannes Weiner wrote:
> 
> The backoff mechanism is not needed. If we have MAX_RECLAIM_RETRIES
> loops without progress, we'll OOM anyway; backing off might cut one or
> two iterations off that in the rare OOM case. If we have intermittent
> success reclaiming a few pages, the backoff function gets reset also,
> and so is of little help in these scenarios.
> 
> We might want a backoff function for when there IS progress, but not
> enough to be satisfactory. But this isn't that. Remove it.
> 
> Signed-off-by: Johannes Weiner 
> ---
Acked-by: Hillf Danton 




Re: [PATCH 4/9] mm: remove unnecessary reclaimability check from NUMA balancing target

2017-03-01 Thread Hillf Danton



On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> NUMA balancing already checks the watermarks of the target node to
> decide whether it's a suitable balancing target. Whether the node is
> reclaimable or not is irrelevant when we don't intend to reclaim.
> 
> Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf...@alibaba-inc.com>




Re: [PATCH 4/9] mm: remove unnecessary reclaimability check from NUMA balancing target

2017-03-01 Thread Hillf Danton



On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> NUMA balancing already checks the watermarks of the target node to
> decide whether it's a suitable balancing target. Whether the node is
> reclaimable or not is irrelevant when we don't intend to reclaim.
> 
> Signed-off-by: Johannes Weiner 
> ---
Acked-by: Hillf Danton 




Re: [PATCH 7/9] mm: delete NR_PAGES_SCANNED and pgdat_reclaimable()

2017-03-01 Thread Hillf Danton
On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> NR_PAGES_SCANNED counts number of pages scanned since the last page
> free event in the allocator. This was used primarily to measure the
> reclaimability of zones and nodes, and determine when reclaim should
> give up on them. In that role, it has been replaced in the preceeding
> patches by a different mechanism.
> 
> Being implemented as an efficient vmstat counter, it was automatically
> exported to userspace as well. It's however unlikely that anyone
> outside the kernel is using this counter in any meaningful way.
> 
> Remove the counter and the unused pgdat_reclaimable().
> 
> Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf...@alibaba-inc.com>




Re: [PATCH 7/9] mm: delete NR_PAGES_SCANNED and pgdat_reclaimable()

2017-03-01 Thread Hillf Danton
On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> NR_PAGES_SCANNED counts number of pages scanned since the last page
> free event in the allocator. This was used primarily to measure the
> reclaimability of zones and nodes, and determine when reclaim should
> give up on them. In that role, it has been replaced in the preceeding
> patches by a different mechanism.
> 
> Being implemented as an efficient vmstat counter, it was automatically
> exported to userspace as well. It's however unlikely that anyone
> outside the kernel is using this counter in any meaningful way.
> 
> Remove the counter and the unused pgdat_reclaimable().
> 
> Signed-off-by: Johannes Weiner 
> ---
Acked-by: Hillf Danton 




Re: [PATCH 5/9] mm: don't avoid high-priority reclaim on unreclaimable nodes

2017-03-01 Thread Hillf Danton

On March 01, 2017 5:40 AM Johannes Weiner wrote:
> 
> 246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
> to avoid high reclaim priorities for kswapd by forcing it to scan a
> minimum amount of pages when lru_pages >> priority yielded nothing.
> 
> b95a2f2d486d ("mm: vmscan: convert global reclaim to per-memcg LRU
> lists"), due to switching global reclaim to a round-robin scheme over
> all cgroups, had to restrict this forceful behavior to unreclaimable
> zones in order to prevent massive overreclaim with many cgroups.
> 
> The latter patch effectively neutered the behavior completely for all
> but extreme memory pressure. But in those situations we might as well
> drop the reclaimers to lower priority levels. Remove the check.
> 
> Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
> ---
>  mm/vmscan.c | 19 +++++------
>  1 file changed, 5 insertions(+), 14 deletions(-)
> 
Acked-by: Hillf Danton <hillf...@alibaba-inc.com>




Re: [PATCH 5/9] mm: don't avoid high-priority reclaim on unreclaimable nodes

2017-03-01 Thread Hillf Danton

On March 01, 2017 5:40 AM Johannes Weiner wrote:
> 
> 246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
> to avoid high reclaim priorities for kswapd by forcing it to scan a
> minimum amount of pages when lru_pages >> priority yielded nothing.
> 
> b95a2f2d486d ("mm: vmscan: convert global reclaim to per-memcg LRU
> lists"), due to switching global reclaim to a round-robin scheme over
> all cgroups, had to restrict this forceful behavior to unreclaimable
> zones in order to prevent massive overreclaim with many cgroups.
> 
> The latter patch effectively neutered the behavior completely for all
> but extreme memory pressure. But in those situations we might as well
> drop the reclaimers to lower priority levels. Remove the check.
> 
> Signed-off-by: Johannes Weiner 
> ---
>  mm/vmscan.c | 19 +------
>  1 file changed, 5 insertions(+), 14 deletions(-)
> 
Acked-by: Hillf Danton 




Re: [PATCH 6/9] mm: don't avoid high-priority reclaim on memcg limit reclaim

2017-03-01 Thread Hillf Danton
On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> 246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
> to avoid high reclaim priorities for memcg by forcing it to scan a
> minimum amount of pages when lru_pages >> priority yielded nothing.
> This was done at a time when reclaim decisions like dirty throttling
> were tied to the priority level.
> 
> Nowadays, the only meaningful thing still tied to priority dropping
> below DEF_PRIORITY - 2 is gating whether laptop_mode=1 is generally
> allowed to write. But that is from an era where direct reclaim was
> still allowed to call ->writepage, and kswapd nowadays avoids writes
> until it's scanned every clean page in the system. Potential changes
> to how quick sc->may_writepage could trigger are of little concern.
> 
> Remove the force_scan stuff, as well as the ugly multi-pass target
> calculation that it necessitated.
> 
> Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf...@alibaba-inc.com>




Re: [PATCH 6/9] mm: don't avoid high-priority reclaim on memcg limit reclaim

2017-03-01 Thread Hillf Danton
On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> 246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
> to avoid high reclaim priorities for memcg by forcing it to scan a
> minimum amount of pages when lru_pages >> priority yielded nothing.
> This was done at a time when reclaim decisions like dirty throttling
> were tied to the priority level.
> 
> Nowadays, the only meaningful thing still tied to priority dropping
> below DEF_PRIORITY - 2 is gating whether laptop_mode=1 is generally
> allowed to write. But that is from an era where direct reclaim was
> still allowed to call ->writepage, and kswapd nowadays avoids writes
> until it's scanned every clean page in the system. Potential changes
> to how quick sc->may_writepage could trigger are of little concern.
> 
> Remove the force_scan stuff, as well as the ugly multi-pass target
> calculation that it necessitated.
> 
> Signed-off-by: Johannes Weiner 
> ---
Acked-by: Hillf Danton 




Re: [PATCH 3/9] mm: remove seemingly spurious reclaimability check from laptop_mode gating

2017-03-01 Thread Hillf Danton
On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> allowed laptop_mode=1 to start writing not just when the priority
> drops to DEF_PRIORITY - 2 but also when the node is unreclaimable.
> That appears to be a spurious change in this patch as I doubt the
> series was tested with laptop_mode, and neither is that particular
> change mentioned in the changelog. Remove it, it's still recent.
> 
> Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf...@alibaba-inc.com>




Re: [PATCH 3/9] mm: remove seemingly spurious reclaimability check from laptop_mode gating

2017-03-01 Thread Hillf Danton
On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> allowed laptop_mode=1 to start writing not just when the priority
> drops to DEF_PRIORITY - 2 but also when the node is unreclaimable.
> That appears to be a spurious change in this patch as I doubt the
> series was tested with laptop_mode, and neither is that particular
> change mentioned in the changelog. Remove it, it's still recent.
> 
> Signed-off-by: Johannes Weiner 
> ---
Acked-by: Hillf Danton 




Re: [PATCH 2/9] mm: fix check for reclaimable pages in PF_MEMALLOC reclaim throttling

2017-03-01 Thread Hillf Danton

On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> PF_MEMALLOC direct reclaimers get throttled on a node when the sum of
> all free pages in each zone fall below half the min watermark. During
> the summation, we want to exclude zones that don't have reclaimables.
> Checking the same pgdat over and over again doesn't make sense.
> 
> Fixes: 599d0c954f91 ("mm, vmscan: move LRU lists to node")
> Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf...@alibaba-inc.com>




Re: [PATCH 2/9] mm: fix check for reclaimable pages in PF_MEMALLOC reclaim throttling

2017-03-01 Thread Hillf Danton

On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> PF_MEMALLOC direct reclaimers get throttled on a node when the sum of
> all free pages in each zone fall below half the min watermark. During
> the summation, we want to exclude zones that don't have reclaimables.
> Checking the same pgdat over and over again doesn't make sense.
> 
> Fixes: 599d0c954f91 ("mm, vmscan: move LRU lists to node")
> Signed-off-by: Johannes Weiner 
> ---
Acked-by: Hillf Danton 




Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

2017-03-01 Thread Hillf Danton

On March 01, 2017 5:40 AM Johannes Weiner wrote:
> 
> Jia He reports a problem with kswapd spinning at 100% CPU when
> requesting more hugepages than memory available in the system:
> 
> $ echo 4000 >/proc/sys/vm/nr_hugepages
> 
> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> KiB Mem:  31371520 total, 30915136 used,   456384 free,  320 buffers
> KiB Swap:  6284224 total,   115712 used,  6168512 free.48192 cached Mem
> 
>   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>76 root  20   0   0  0  0 R 100.0 0.000 217:17.29 kswapd3
> 
> At that time, there are no reclaimable pages left in the node, but as
> kswapd fails to restore the high watermarks it refuses to go to sleep.
> 
> Kswapd needs to back away from nodes that fail to balance. Up until
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> kswapd had such a mechanism. It considered zones whose theoretically
> reclaimable pages it had reclaimed six times over as unreclaimable and
> backed away from them. This guard was erroneously removed as the patch
> changed the definition of a balanced node.
> 
> However, simply restoring this code wouldn't help in the case reported
> here: there *are* no reclaimable pages that could be scanned until the
> threshold is met. Kswapd would stay awake anyway.
> 
> Introduce a new and much simpler way of backing off. If kswapd runs
> through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> page, make it back off from the node. This is the same number of shots
> direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> that node until a direct reclaimer manages to reclaim some pages, thus
> proving the node reclaimable again.
> 
> v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
> 
> Reported-by: Jia He <hejia...@gmail.com>
> Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
> Tested-by: Jia He <hejia...@gmail.com>
> Acked-by: Michal Hocko <mho...@suse.com>
> ---

Acked-by: Hillf Danton <hillf...@alibaba-inc.com>



Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes

2017-03-01 Thread Hillf Danton

On March 01, 2017 5:40 AM Johannes Weiner wrote:
> 
> Jia He reports a problem with kswapd spinning at 100% CPU when
> requesting more hugepages than memory available in the system:
> 
> $ echo 4000 >/proc/sys/vm/nr_hugepages
> 
> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 
> st
> KiB Mem:  31371520 total, 30915136 used,   456384 free,  320 buffers
> KiB Swap:  6284224 total,   115712 used,  6168512 free.48192 cached Mem
> 
>   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>76 root  20   0   0  0  0 R 100.0 0.000 217:17.29 kswapd3
> 
> At that time, there are no reclaimable pages left in the node, but as
> kswapd fails to restore the high watermarks it refuses to go to sleep.
> 
> Kswapd needs to back away from nodes that fail to balance. Up until
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> kswapd had such a mechanism. It considered zones whose theoretically
> reclaimable pages it had reclaimed six times over as unreclaimable and
> backed away from them. This guard was erroneously removed as the patch
> changed the definition of a balanced node.
> 
> However, simply restoring this code wouldn't help in the case reported
> here: there *are* no reclaimable pages that could be scanned until the
> threshold is met. Kswapd would stay awake anyway.
> 
> Introduce a new and much simpler way of backing off. If kswapd runs
> through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> page, make it back off from the node. This is the same number of shots
> direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> that node until a direct reclaimer manages to reclaim some pages, thus
> proving the node reclaimable again.
> 
> v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
> 
> Reported-by: Jia He 
> Signed-off-by: Johannes Weiner 
> Tested-by: Jia He 
> Acked-by: Michal Hocko 
> ---

Acked-by: Hillf Danton 



Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps

2017-02-27 Thread Hillf Danton

On February 25, 2017 5:32 AM Shaohua Li wrote: 
> 
> show MADV_FREE pages info of each vma in smaps. The interface is for
> diganose or monitoring purpose, userspace could use it to understand
> what happens in the application. Since userspace could dirty MADV_FREE
> pages without notice from kernel, this interface is the only place we
> can get accurate accounting info about MADV_FREE pages.
> 
> Cc: Michal Hocko <mho...@suse.com>
> Cc: Hugh Dickins <hu...@google.com>
> Cc: Rik van Riel <r...@redhat.com>
> Cc: Mel Gorman <mgor...@techsingularity.net>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Acked-by: Johannes Weiner <han...@cmpxchg.org>
> Acked-by: Minchan Kim <minc...@kernel.org>
> Signed-off-by: Shaohua Li <s...@fb.com>
> ---

Acked-by: Hillf Danton <hillf...@alibaba-inc.com>




Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps

2017-02-27 Thread Hillf Danton

On February 25, 2017 5:32 AM Shaohua Li wrote: 
> 
> show MADV_FREE pages info of each vma in smaps. The interface is for
> diganose or monitoring purpose, userspace could use it to understand
> what happens in the application. Since userspace could dirty MADV_FREE
> pages without notice from kernel, this interface is the only place we
> can get accurate accounting info about MADV_FREE pages.
> 
> Cc: Michal Hocko 
> Cc: Hugh Dickins 
> Cc: Rik van Riel 
> Cc: Mel Gorman 
> Cc: Andrew Morton 
> Acked-by: Johannes Weiner 
> Acked-by: Minchan Kim 
> Signed-off-by: Shaohua Li 
> ---

Acked-by: Hillf Danton 




Re: [PATCH V5 5/6] mm: enable MADV_FREE for swapless system

2017-02-27 Thread Hillf Danton

On February 25, 2017 5:32 AM Shaohua Li wrote: 
> 
> Now MADV_FREE pages can be easily reclaimed even for swapless system. We
> can safely enable MADV_FREE for all systems.
> 
> Cc: Michal Hocko <mho...@suse.com>
> Cc: Minchan Kim <minc...@kernel.org>
> Cc: Hugh Dickins <hu...@google.com>
> Cc: Rik van Riel <r...@redhat.com>
> Cc: Mel Gorman <mgor...@techsingularity.net>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Acked-by: Johannes Weiner <han...@cmpxchg.org>
> Signed-off-by: Shaohua Li <s...@fb.com>
> ---

Acked-by: Hillf Danton <hillf...@alibaba-inc.com>




Re: [PATCH V5 5/6] mm: enable MADV_FREE for swapless system

2017-02-27 Thread Hillf Danton

On February 25, 2017 5:32 AM Shaohua Li wrote: 
> 
> Now MADV_FREE pages can be easily reclaimed even for swapless system. We
> can safely enable MADV_FREE for all systems.
> 
> Cc: Michal Hocko 
> Cc: Minchan Kim 
> Cc: Hugh Dickins 
> Cc: Rik van Riel 
> Cc: Mel Gorman 
> Cc: Andrew Morton 
> Acked-by: Johannes Weiner 
> Signed-off-by: Shaohua Li 
> ---

Acked-by: Hillf Danton 




Re: [PATCH V5 3/6] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list

2017-02-27 Thread Hillf Danton
On February 25, 2017 5:32 AM Shaohua Li wrote: 
> 
> madv MADV_FREE indicate pages are 'lazyfree'. They are still anonymous
> pages, but they can be freed without pageout. To destinguish them
> against normal anonymous pages, we clear their SwapBacked flag.
> 
> MADV_FREE pages could be freed without pageout, so they pretty much like
> used once file pages. For such pages, we'd like to reclaim them once
> there is memory pressure. Also it might be unfair reclaiming MADV_FREE
> pages always before used once file pages and we definitively want to
> reclaim the pages before other anonymous and file pages.
> 
> To speed up MADV_FREE pages reclaim, we put the pages into
> LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
> nowadays and should be full of used once file pages. Reclaiming
> MADV_FREE pages will not have much interfere of anonymous and active
> file pages. And the inactive file pages and MADV_FREE pages will be
> reclaimed according to their age, so we don't reclaim too many MADV_FREE
> pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
> means we can reclaim the pages without swap support. This idea is
> suggested by Johannes.
> 
> This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
> avoid bisect failure, next patch will do it.
> 
> The patch is based on Minchan's original patch.
> 
> Cc: Michal Hocko <mho...@suse.com>
> Cc: Minchan Kim <minc...@kernel.org>
> Cc: Hugh Dickins <hu...@google.com>
> Cc: Rik van Riel <r...@redhat.com>
> Cc: Mel Gorman <mgor...@techsingularity.net>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Suggested-by: Johannes Weiner <han...@cmpxchg.org>
> Signed-off-by: Shaohua Li <s...@fb.com>
> ---

Acked-by: Hillf Danton <hillf...@alibaba-inc.com>



Re: [PATCH V5 3/6] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list

2017-02-27 Thread Hillf Danton
On February 25, 2017 5:32 AM Shaohua Li wrote: 
> 
> madv MADV_FREE indicate pages are 'lazyfree'. They are still anonymous
> pages, but they can be freed without pageout. To destinguish them
> against normal anonymous pages, we clear their SwapBacked flag.
> 
> MADV_FREE pages could be freed without pageout, so they pretty much like
> used once file pages. For such pages, we'd like to reclaim them once
> there is memory pressure. Also it might be unfair reclaiming MADV_FREE
> pages always before used once file pages and we definitively want to
> reclaim the pages before other anonymous and file pages.
> 
> To speed up MADV_FREE pages reclaim, we put the pages into
> LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
> nowadays and should be full of used once file pages. Reclaiming
> MADV_FREE pages will not have much interfere of anonymous and active
> file pages. And the inactive file pages and MADV_FREE pages will be
> reclaimed according to their age, so we don't reclaim too many MADV_FREE
> pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
> means we can reclaim the pages without swap support. This idea is
> suggested by Johannes.
> 
> This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
> avoid bisect failure, next patch will do it.
> 
> The patch is based on Minchan's original patch.
> 
> Cc: Michal Hocko 
> Cc: Minchan Kim 
> Cc: Hugh Dickins 
> Cc: Rik van Riel 
> Cc: Mel Gorman 
> Cc: Andrew Morton 
> Suggested-by: Johannes Weiner 
> Signed-off-by: Shaohua Li 
> ---

Acked-by: Hillf Danton 



Re: [PATCH V5 4/6] mm: reclaim MADV_FREE pages

2017-02-27 Thread Hillf Danton

On February 25, 2017 5:32 AM Shaohua Li wrote: 
> 
> When memory pressure is high, we free MADV_FREE pages. If the pages are
> not dirty in pte, the pages could be freed immediately. Otherwise we
> can't reclaim them. We put the pages back to anonumous LRU list (by
> setting SwapBacked flag) and the pages will be reclaimed in normal
> swapout way.
> 
> We use normal page reclaim policy. Since MADV_FREE pages are put into
> inactive file list, such pages and inactive file pages are reclaimed
> according to their age. This is expected, because we don't want to
> reclaim too many MADV_FREE pages before used once pages.
> 
> Based on Minchan's original patch
> 
> Cc: Michal Hocko <mho...@suse.com>
> Cc: Minchan Kim <minc...@kernel.org>
> Cc: Hugh Dickins <hu...@google.com>
> Cc: Johannes Weiner <han...@cmpxchg.org>
> Cc: Rik van Riel <r...@redhat.com>
> Cc: Mel Gorman <mgor...@techsingularity.net>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Signed-off-by: Shaohua Li <s...@fb.com>
> ---

Acked-by: Hillf Danton <hillf...@alibaba-inc.com>




Re: [PATCH V5 4/6] mm: reclaim MADV_FREE pages

2017-02-27 Thread Hillf Danton

On February 25, 2017 5:32 AM Shaohua Li wrote: 
> 
> When memory pressure is high, we free MADV_FREE pages. If the pages are
> not dirty in pte, the pages could be freed immediately. Otherwise we
> can't reclaim them. We put the pages back to anonumous LRU list (by
> setting SwapBacked flag) and the pages will be reclaimed in normal
> swapout way.
> 
> We use normal page reclaim policy. Since MADV_FREE pages are put into
> inactive file list, such pages and inactive file pages are reclaimed
> according to their age. This is expected, because we don't want to
> reclaim too many MADV_FREE pages before used once pages.
> 
> Based on Minchan's original patch
> 
> Cc: Michal Hocko 
> Cc: Minchan Kim 
> Cc: Hugh Dickins 
> Cc: Johannes Weiner 
> Cc: Rik van Riel 
> Cc: Mel Gorman 
> Cc: Andrew Morton 
> Signed-off-by: Shaohua Li 
> ---

Acked-by: Hillf Danton 




Re: [PATCH V5 2/6] mm: don't assume anonymous pages have SwapBacked flag

2017-02-26 Thread Hillf Danton

On February 25, 2017 5:32 AM Shaohua Li wrote: 
> 
> There are a few places the code assumes anonymous pages should have
> SwapBacked flag set. MADV_FREE pages are anonymous pages but we are
> going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
> for them. The assumption doesn't hold any more, so fix them.
> 
> Cc: Michal Hocko <mho...@suse.com>
> Cc: Minchan Kim <minc...@kernel.org>
> Cc: Hugh Dickins <hu...@google.com>
> Cc: Rik van Riel <r...@redhat.com>
> Cc: Mel Gorman <mgor...@techsingularity.net>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Acked-by: Johannes Weiner <han...@cmpxchg.org>
> Signed-off-by: Shaohua Li <s...@fb.com>
> ---

Acked-by: Hillf Danton <hillf...@alibaba-inc.com>




Re: [PATCH V5 2/6] mm: don't assume anonymous pages have SwapBacked flag

2017-02-26 Thread Hillf Danton

On February 25, 2017 5:32 AM Shaohua Li wrote: 
> 
> There are a few places the code assumes anonymous pages should have
> SwapBacked flag set. MADV_FREE pages are anonymous pages but we are
> going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
> for them. The assumption doesn't hold any more, so fix them.
> 
> Cc: Michal Hocko 
> Cc: Minchan Kim 
> Cc: Hugh Dickins 
> Cc: Rik van Riel 
> Cc: Mel Gorman 
> Cc: Andrew Morton 
> Acked-by: Johannes Weiner 
> Signed-off-by: Shaohua Li 
> ---

Acked-by: Hillf Danton 




Re: [PATCH V4 1/6] mm: delete unnecessary TTU_* flags

2017-02-23 Thread Hillf Danton
On February 23, 2017 2:51 AM Shaohua Li wrote: 
> 
> Johannes pointed out TTU_LZFREE is unnecessary. It's true because we
> always have the flag set if we want to do an unmap. For cases we don't
> do an unmap, the TTU_LZFREE part of code should never run.
> 
> Also the TTU_UNMAP is unnecessary. If no other flags set (for
> example, TTU_MIGRATION), an unmap is implied.
> 
> Cc: Michal Hocko <mho...@suse.com>
> Cc: Minchan Kim <minc...@kernel.org>
> Cc: Hugh Dickins <hu...@google.com>
> Cc: Rik van Riel <r...@redhat.com>
> Cc: Mel Gorman <mgor...@techsingularity.net>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Suggested-by: Johannes Weiner <han...@cmpxchg.org>
> Signed-off-by: Shaohua Li <s...@fb.com>
> ---

Acked-by: Hillf Danton <hillf...@alibaba-inc.com>




Re: [PATCH V4 1/6] mm: delete unnecessary TTU_* flags

2017-02-23 Thread Hillf Danton
On February 23, 2017 2:51 AM Shaohua Li wrote: 
> 
> Johannes pointed out TTU_LZFREE is unnecessary. It's true because we
> always have the flag set if we want to do an unmap. For cases we don't
> do an unmap, the TTU_LZFREE part of code should never run.
> 
> Also the TTU_UNMAP is unnecessary. If no other flags set (for
> example, TTU_MIGRATION), an unmap is implied.
> 
> Cc: Michal Hocko 
> Cc: Minchan Kim 
> Cc: Hugh Dickins 
> Cc: Rik van Riel 
> Cc: Mel Gorman 
> Cc: Andrew Morton 
> Suggested-by: Johannes Weiner 
> Signed-off-by: Shaohua Li 
> ---

Acked-by: Hillf Danton 




Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-20 Thread Hillf Danton
On February 21, 2017 12:34 AM Vlastimil Babka wrote:
> On 02/16/2017 09:21 AM, Hillf Danton wrote:
> > Right, but the order-3 request can also come up while kswapd is active and
> > gives up order-5.
> 
> "Giving up on order-5" means it will set sc.order to 0, go to sleep (assuming
> order-0 watermarks are OK) and wakeup kcompactd for order-5. There's no way 
> how
> kswapd could help an order-3 allocation at that point - it's up to kcompactd.
> 
cpu0cpu1
give up order-5 
fall back to order-0
wake up kswapd for order-3 
wake up kswapd for order-5
fall in sleep
wake up kswapd for order-3
what order would
we try?

It is order-5 in the patch. 

Given the fresh new world without hike ban after napping, 
one tenth second or 3 minutes, we feel free IMHO to select
any order and go another round of reclaiming pages.

thanks
Hillf




Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-20 Thread Hillf Danton
On February 21, 2017 12:34 AM Vlastimil Babka wrote:
> On 02/16/2017 09:21 AM, Hillf Danton wrote:
> > Right, but the order-3 request can also come up while kswapd is active and
> > gives up order-5.
> 
> "Giving up on order-5" means it will set sc.order to 0, go to sleep (assuming
> order-0 watermarks are OK) and wakeup kcompactd for order-5. There's no way 
> how
> kswapd could help an order-3 allocation at that point - it's up to kcompactd.
> 
cpu0cpu1
give up order-5 
fall back to order-0
wake up kswapd for order-3 
wake up kswapd for order-5
fall in sleep
wake up kswapd for order-3
what order would
we try?

It is order-5 in the patch. 

Given the fresh new world without hike ban after napping, 
one tenth second or 3 minutes, we feel free IMHO to select
any order and go another round of reclaiming pages.

thanks
Hillf




Re: [PATCH] mm/thp/autonuma: Use TNF flag instead of vm fault.

2017-02-19 Thread Hillf Danton

On February 19, 2017 6:00 PM Aneesh Kumar K.V wrote: 
> 
> We are using wrong flag value in task_numa_falt function. This can result in
> us doing wrong numa fault statistics update, because we update 
> num_pages_migrate
> and numa_fault_locality etc based on the flag argument passed.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.ku...@linux.vnet.ibm.com>

Fix: bae473a423 ("mm: introduce fault_env")
Acked-by: Hillf Danton <hillf...@alibaba-inc.com>

> ---
>  mm/huge_memory.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5f3ad65c85de..8f1d93257fb9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1333,7 +1333,7 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t 
> pmd)
> 
>   if (page_nid != -1)
>   task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR,
> - vmf->flags);
> + flags);
> 
>   return 0;
>  }
> --
> 2.7.4



Re: [PATCH] mm/thp/autonuma: Use TNF flag instead of vm fault.

2017-02-19 Thread Hillf Danton

On February 19, 2017 6:00 PM Aneesh Kumar K.V wrote: 
> 
> We are using wrong flag value in task_numa_falt function. This can result in
> us doing wrong numa fault statistics update, because we update 
> num_pages_migrate
> and numa_fault_locality etc based on the flag argument passed.
> 
> Signed-off-by: Aneesh Kumar K.V 

Fix: bae473a423 ("mm: introduce fault_env")
Acked-by: Hillf Danton 

> ---
>  mm/huge_memory.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5f3ad65c85de..8f1d93257fb9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1333,7 +1333,7 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t 
> pmd)
> 
>   if (page_nid != -1)
>   task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR,
> - vmf->flags);
> + flags);
> 
>   return 0;
>  }
> --
> 2.7.4



Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-16 Thread Hillf Danton

On February 16, 2017 4:11 PM Mel Gorman wrote:
> On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote:
> > On February 15, 2017 5:23 PM Mel Gorman wrote:
> > >   */
> > >  static int kswapd(void *p)
> > >  {
> > > - unsigned int alloc_order, reclaim_order, classzone_idx;
> > > + unsigned int alloc_order, reclaim_order;
> > > + unsigned int classzone_idx = MAX_NR_ZONES - 1;
> > >   pg_data_t *pgdat = (pg_data_t*)p;
> > >   struct task_struct *tsk = current;
> > >
> > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
> > >   tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > >   set_freezable();
> > >
> > > - pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> > > - pgdat->kswapd_classzone_idx = classzone_idx = 0;
> > > + pgdat->kswapd_order = 0;
> > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > >   for ( ; ; ) {
> > >   bool ret;
> > >
> > > + alloc_order = reclaim_order = pgdat->kswapd_order;
> > > + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> > > +
> > >  kswapd_try_sleep:
> > >   kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> > >   classzone_idx);
> > >
> > >   /* Read the new order and classzone_idx */
> > >   alloc_order = reclaim_order = pgdat->kswapd_order;
> > > - classzone_idx = pgdat->kswapd_classzone_idx;
> > > + classzone_idx = kswapd_classzone_idx(pgdat, 0);
> > >   pgdat->kswapd_order = 0;
> > > - pgdat->kswapd_classzone_idx = 0;
> > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > >
> > >   ret = try_to_freeze();
> > >   if (kthread_should_stop())
> > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
> > >   reclaim_order = balance_pgdat(pgdat, alloc_order, 
> > > classzone_idx);
> > >   if (reclaim_order < alloc_order)
> > >   goto kswapd_try_sleep;
> >
> > If we fail order-5 request,  can we then give up order-5, and
> > try order-3 if requested, after napping?
> >
> 
> That has no bearing upon this patch. At this point, kswapd has stopped
> reclaiming at the requested order and is preparing to sleep. If there is
> a parallel request for order-3 while it's sleeping, it'll wake and start
> reclaiming at order-3 as requested.
> 
Right, but the order-3 request can also come up while kswapd is active and
gives up order-5.

thanks
Hillf



Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-16 Thread Hillf Danton

On February 16, 2017 4:11 PM Mel Gorman wrote:
> On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote:
> > On February 15, 2017 5:23 PM Mel Gorman wrote:
> > >   */
> > >  static int kswapd(void *p)
> > >  {
> > > - unsigned int alloc_order, reclaim_order, classzone_idx;
> > > + unsigned int alloc_order, reclaim_order;
> > > + unsigned int classzone_idx = MAX_NR_ZONES - 1;
> > >   pg_data_t *pgdat = (pg_data_t*)p;
> > >   struct task_struct *tsk = current;
> > >
> > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
> > >   tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > >   set_freezable();
> > >
> > > - pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> > > - pgdat->kswapd_classzone_idx = classzone_idx = 0;
> > > + pgdat->kswapd_order = 0;
> > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > >   for ( ; ; ) {
> > >   bool ret;
> > >
> > > + alloc_order = reclaim_order = pgdat->kswapd_order;
> > > + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> > > +
> > >  kswapd_try_sleep:
> > >   kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> > >   classzone_idx);
> > >
> > >   /* Read the new order and classzone_idx */
> > >   alloc_order = reclaim_order = pgdat->kswapd_order;
> > > - classzone_idx = pgdat->kswapd_classzone_idx;
> > > + classzone_idx = kswapd_classzone_idx(pgdat, 0);
> > >   pgdat->kswapd_order = 0;
> > > - pgdat->kswapd_classzone_idx = 0;
> > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > >
> > >   ret = try_to_freeze();
> > >   if (kthread_should_stop())
> > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
> > >   reclaim_order = balance_pgdat(pgdat, alloc_order, 
> > > classzone_idx);
> > >   if (reclaim_order < alloc_order)
> > >   goto kswapd_try_sleep;
> >
> > If we fail order-5 request,  can we then give up order-5, and
> > try order-3 if requested, after napping?
> >
> 
> That has no bearing upon this patch. At this point, kswapd has stopped
> reclaiming at the requested order and is preparing to sleep. If there is
> a parallel request for order-3 while it's sleeping, it'll wake and start
> reclaiming at order-3 as requested.
> 
Right, but the order-3 request can also come up while kswapd is active and
gives up order-5.

thanks
Hillf



Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-15 Thread Hillf Danton
On February 15, 2017 5:23 PM Mel Gorman wrote: 
>   */
>  static int kswapd(void *p)
>  {
> - unsigned int alloc_order, reclaim_order, classzone_idx;
> + unsigned int alloc_order, reclaim_order;
> + unsigned int classzone_idx = MAX_NR_ZONES - 1;
>   pg_data_t *pgdat = (pg_data_t*)p;
>   struct task_struct *tsk = current;
> 
> @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
>   tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>   set_freezable();
> 
> - pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> - pgdat->kswapd_classzone_idx = classzone_idx = 0;
> + pgdat->kswapd_order = 0;
> + pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>   for ( ; ; ) {
>   bool ret;
> 
> + alloc_order = reclaim_order = pgdat->kswapd_order;
> + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> +
>  kswapd_try_sleep:
>   kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
>   classzone_idx);
> 
>   /* Read the new order and classzone_idx */
>   alloc_order = reclaim_order = pgdat->kswapd_order;
> - classzone_idx = pgdat->kswapd_classzone_idx;
> + classzone_idx = kswapd_classzone_idx(pgdat, 0);
>   pgdat->kswapd_order = 0;
> - pgdat->kswapd_classzone_idx = 0;
> + pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> 
>   ret = try_to_freeze();
>   if (kthread_should_stop())
> @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
>   reclaim_order = balance_pgdat(pgdat, alloc_order, 
> classzone_idx);
>   if (reclaim_order < alloc_order)
>   goto kswapd_try_sleep;

If we fail order-5 request,  can we then give up order-5, and
try order-3 if requested, after napping?

> -
> - alloc_order = reclaim_order = pgdat->kswapd_order;
> - classzone_idx = pgdat->kswapd_classzone_idx;
>   }
> 




Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx

2017-02-15 Thread Hillf Danton
On February 15, 2017 5:23 PM Mel Gorman wrote: 
>   */
>  static int kswapd(void *p)
>  {
> - unsigned int alloc_order, reclaim_order, classzone_idx;
> + unsigned int alloc_order, reclaim_order;
> + unsigned int classzone_idx = MAX_NR_ZONES - 1;
>   pg_data_t *pgdat = (pg_data_t*)p;
>   struct task_struct *tsk = current;
> 
> @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
>   tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>   set_freezable();
> 
> - pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> - pgdat->kswapd_classzone_idx = classzone_idx = 0;
> + pgdat->kswapd_order = 0;
> + pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>   for ( ; ; ) {
>   bool ret;
> 
> + alloc_order = reclaim_order = pgdat->kswapd_order;
> + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> +
>  kswapd_try_sleep:
>   kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
>   classzone_idx);
> 
>   /* Read the new order and classzone_idx */
>   alloc_order = reclaim_order = pgdat->kswapd_order;
> - classzone_idx = pgdat->kswapd_classzone_idx;
> + classzone_idx = kswapd_classzone_idx(pgdat, 0);
>   pgdat->kswapd_order = 0;
> - pgdat->kswapd_classzone_idx = 0;
> + pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> 
>   ret = try_to_freeze();
>   if (kthread_should_stop())
> @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
>   reclaim_order = balance_pgdat(pgdat, alloc_order, 
> classzone_idx);
>   if (reclaim_order < alloc_order)
>   goto kswapd_try_sleep;

If we fail order-5 request,  can we then give up order-5, and
try order-3 if requested, after napping?

> -
> - alloc_order = reclaim_order = pgdat->kswapd_order;
> - classzone_idx = pgdat->kswapd_classzone_idx;
>   }
> 




Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep

2017-02-15 Thread Hillf Danton
On February 15, 2017 5:23 PM Mel Gorman wrote: 
> 
> From: Shantanu Goel <sgoe...@yahoo.com>
> 
> The check in prepare_kswapd_sleep needs to match the one in balance_pgdat
> since the latter will return as soon as any one of the zones in the
> classzone is above the watermark.  This is specially important for higher
> order allocations since balance_pgdat will typically reset the order to
> zero relying on compaction to create the higher order pages.  Without this
> patch, prepare_kswapd_sleep fails to wake up kcompactd since the zone
> balance check fails.
> 
> On 4.9.7 kswapd is failing to wake up kcompactd due to a mismatch in the
> zone balance check between balance_pgdat() and prepare_kswapd_sleep().
> balance_pgdat() returns as soon as a single zone satisfies the allocation
> but prepare_kswapd_sleep() requires all zones to do +the same.  This causes
> prepare_kswapd_sleep() to never succeed except in the order == 0 case and
> consequently, wakeup_kcompactd() is never called.  On my machine prior to
> apply this patch, the state of compaction from /proc/vmstat looked this
> way after a day and a half +of uptime:
> 
> compact_migrate_scanned 240496
> compact_free_scanned 76238632
> compact_isolated 123472
> compact_stall 1791
> compact_fail 29
> compact_success 1762
> compact_daemon_wake 0
> 
> After applying the patch and about 10 hours of uptime the state looks
> like this:
> 
> compact_migrate_scanned 59927299
> compact_free_scanned 2021075136
> compact_isolated 640926
> compact_stall 4
> compact_fail 2
> compact_success 2
> compact_daemon_wake 5160
> 
> Further notes from Mel that motivated him to pick this patch up and
> resend it;
> 
> It was observed for the simoop workload (pressures the VM similar to HADOOP)
> that kswapd was failing to keep ahead of direct reclaim. The investigation
> noted that there was a need to rationalise kswapd decisions to reclaim
> with kswapd decisions to sleep. With this patch on a 2-socket box, there
> was a 43% reduction in direct reclaim scanning.
> 
> However, the impact otherwise is extremely negative. Kswapd reclaim
> efficiency dropped from 98% to 76%. simoop has three latency-related
> metrics for read, write and allocation (an anonymous mmap and fault).
> 
>  4.10.0-rc74.10.0-rc7
>  mmots-20170209   fixcheck-v1
> Ameanp50-Read 22325202.49 (  0.00%) 20026926.55 ( 10.29%)
> Ameanp95-Read 26102988.80 (  0.00%) 27023360.00 ( -3.53%)
> Ameanp99-Read 30935176.53 (  0.00%) 30994432.00 ( -0.19%)
> Ameanp50-Write 976.44 (  0.00%) 1905.28 (-95.12%)
> Ameanp95-Write   15471.29 (  0.00%)36210.09 (-134.05%)
> Ameanp99-Write   35108.62 (  0.00%)   479494.96 (-1265.75%)
> Ameanp50-Allocation  76382.61 (  0.00%)87603.20 (-14.69%)
> Ameanp95-Allocation 12.39 (  0.00%)   244491.38 (-91.34%)
> Ameanp99-Allocation 187937.39 (  0.00%)  1745237.33 (-828.63%)
> 
> There are also more allocation stalls. One of the largest impacts was due
> to pages written back from kswapd context rising from 0 pages to 4516642
> pages during the hour the workload ran for. By and large, the patch has very
> bad behaviour but easily missed as the impact on a UMA machine is negligible.
> 
> This patch is included with the data in case a bisection leads to this area.
> This patch is also a pre-requisite for the rest of the series.
> 
> Signed-off-by: Shantanu Goel <sgoe...@yahoo.com>
> Signed-off-by: Mel Gorman <mgor...@techsingularity.net>
> ---
Acked-by: Hillf Danton <hillf...@alibaba-inc.com>

>  mm/vmscan.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b405ef34..92fc66bd52bc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3140,11 +3140,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, 
> int order, int classzone_idx)
>   if (!managed_zone(zone))
>   continue;
> 
> - if (!zone_balanced(zone, order, classzone_idx))
> - return false;
> + if (zone_balanced(zone, order, classzone_idx))
> + return true;
>   }
> 
> - return true;
> + return false;
>  }
> 
>  /*
> --
> 2.11.0



Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep

2017-02-15 Thread Hillf Danton
On February 15, 2017 5:23 PM Mel Gorman wrote: 
> 
> From: Shantanu Goel 
> 
> The check in prepare_kswapd_sleep needs to match the one in balance_pgdat
> since the latter will return as soon as any one of the zones in the
> classzone is above the watermark.  This is specially important for higher
> order allocations since balance_pgdat will typically reset the order to
> zero relying on compaction to create the higher order pages.  Without this
> patch, prepare_kswapd_sleep fails to wake up kcompactd since the zone
> balance check fails.
> 
> On 4.9.7 kswapd is failing to wake up kcompactd due to a mismatch in the
> zone balance check between balance_pgdat() and prepare_kswapd_sleep().
> balance_pgdat() returns as soon as a single zone satisfies the allocation
> but prepare_kswapd_sleep() requires all zones to do +the same.  This causes
> prepare_kswapd_sleep() to never succeed except in the order == 0 case and
> consequently, wakeup_kcompactd() is never called.  On my machine prior to
> apply this patch, the state of compaction from /proc/vmstat looked this
> way after a day and a half +of uptime:
> 
> compact_migrate_scanned 240496
> compact_free_scanned 76238632
> compact_isolated 123472
> compact_stall 1791
> compact_fail 29
> compact_success 1762
> compact_daemon_wake 0
> 
> After applying the patch and about 10 hours of uptime the state looks
> like this:
> 
> compact_migrate_scanned 59927299
> compact_free_scanned 2021075136
> compact_isolated 640926
> compact_stall 4
> compact_fail 2
> compact_success 2
> compact_daemon_wake 5160
> 
> Further notes from Mel that motivated him to pick this patch up and
> resend it;
> 
> It was observed for the simoop workload (pressures the VM similar to HADOOP)
> that kswapd was failing to keep ahead of direct reclaim. The investigation
> noted that there was a need to rationalise kswapd decisions to reclaim
> with kswapd decisions to sleep. With this patch on a 2-socket box, there
> was a 43% reduction in direct reclaim scanning.
> 
> However, the impact otherwise is extremely negative. Kswapd reclaim
> efficiency dropped from 98% to 76%. simoop has three latency-related
> metrics for read, write and allocation (an anonymous mmap and fault).
> 
>  4.10.0-rc74.10.0-rc7
>  mmots-20170209   fixcheck-v1
> Ameanp50-Read 22325202.49 (  0.00%) 20026926.55 ( 10.29%)
> Ameanp95-Read 26102988.80 (  0.00%) 27023360.00 ( -3.53%)
> Ameanp99-Read 30935176.53 (  0.00%) 30994432.00 ( -0.19%)
> Ameanp50-Write 976.44 (  0.00%) 1905.28 (-95.12%)
> Ameanp95-Write   15471.29 (  0.00%)36210.09 (-134.05%)
> Ameanp99-Write   35108.62 (  0.00%)   479494.96 (-1265.75%)
> Ameanp50-Allocation  76382.61 (  0.00%)87603.20 (-14.69%)
> Ameanp95-Allocation 12.39 (  0.00%)   244491.38 (-91.34%)
> Ameanp99-Allocation 187937.39 (  0.00%)  1745237.33 (-828.63%)
> 
> There are also more allocation stalls. One of the largest impacts was due
> to pages written back from kswapd context rising from 0 pages to 4516642
> pages during the hour the workload ran for. By and large, the patch has very
> bad behaviour but easily missed as the impact on a UMA machine is negligible.
> 
> This patch is included with the data in case a bisection leads to this area.
> This patch is also a pre-requisite for the rest of the series.
> 
> Signed-off-by: Shantanu Goel 
> Signed-off-by: Mel Gorman 
> ---
Acked-by: Hillf Danton 

>  mm/vmscan.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b405ef34..92fc66bd52bc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3140,11 +3140,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, 
> int order, int classzone_idx)
>   if (!managed_zone(zone))
>   continue;
> 
> - if (!zone_balanced(zone, order, classzone_idx))
> - return false;
> + if (zone_balanced(zone, order, classzone_idx))
> + return true;
>   }
> 
> - return true;
> + return false;
>  }
> 
>  /*
> --
> 2.11.0



Re: [PATCH V2 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list

2017-02-08 Thread Hillf Danton

On February 04, 2017 2:38 PM Hillf Danton wrote: 
> 
> On February 04, 2017 7:33 AM Shaohua Li wrote:
> > @@ -1404,6 +1401,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, 
> > struct vm_area_struct *vma,
> > set_pmd_at(mm, addr, pmd, orig_pmd);
> > tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> > }
> > +
> > +   mark_page_lazyfree(page);
> > ret = true;
> >  out:
> > spin_unlock(ptl);
> 
> 
> 
> > -void deactivate_page(struct page *page)
> > -{
> > -   if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> > -   struct pagevec *pvec = _cpu_var(lru_deactivate_pvecs);
> > +void mark_page_lazyfree(struct page *page)
> > + {
> > +   if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
> > +   !PageUnevictable(page)) {
> > +   struct pagevec *pvec = _cpu_var(lru_lazyfree_pvecs);
> >
> > get_page(page);
> > if (!pagevec_add(pvec, page) || PageCompound(page))
> > -   pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> > -   put_cpu_var(lru_deactivate_pvecs);
> > +   pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
> > +   put_cpu_var(lru_lazyfree_pvecs);
> > }
> >  }
> 
> You are not adding it but would you please try to fix or avoid flipping
> preempt count with page table lock hold?
> 
preempt_en/disable are embedded in spin_lock/unlock, so please
ignore my noise.

thanks
Hillf



  1   2   3   4   5   6   7   8   9   10   >