Re: [PATCH 08/13] zram: add error handling support for add_disk()

2021-10-25 Thread Minchan Kim
On Fri, Oct 15, 2021 at 04:52:14PM -0700, Luis Chamberlain wrote:
> We never checked for errors on add_disk() as this function
> returned void. Now that this is fixed, use the shiny new
> error handling.
> 
> Signed-off-by: Luis Chamberlain 
Acked-by: Minchan Kim 


Re: [PATCH 9/9] zsmalloc: remove the zsmalloc file system

2021-03-09 Thread Minchan Kim
On Tue, Mar 09, 2021 at 04:53:48PM +0100, Christoph Hellwig wrote:
> Just use the generic anon_inode file system.
> 
> Signed-off-by: Christoph Hellwig 
Acked-by: Minchan Kim 


Re: [PATCH 1/9] fs: rename alloc_anon_inode to alloc_anon_inode_sb

2021-03-09 Thread Minchan Kim
On Tue, Mar 09, 2021 at 04:53:40PM +0100, Christoph Hellwig wrote:
> Rename alloc_inode to free the name for a new variant that does not
> need boilerplate to create a super_block first.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/powerpc/platforms/pseries/cmm.c | 2 +-
>  drivers/dma-buf/dma-buf.c| 2 +-
>  drivers/gpu/drm/drm_drv.c| 2 +-
>  drivers/misc/cxl/api.c   | 2 +-
>  drivers/misc/vmw_balloon.c   | 2 +-
>  drivers/scsi/cxlflash/ocxl_hw.c  | 2 +-
>  drivers/virtio/virtio_balloon.c  | 2 +-
>  fs/aio.c | 2 +-
>  fs/anon_inodes.c | 4 ++--
>  fs/libfs.c   | 2 +-
>  include/linux/fs.h   | 2 +-
>  kernel/resource.c| 2 +-
>  mm/z3fold.c  | 2 +-
>  mm/zsmalloc.c| 2 +-
>  14 files changed, 15 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/cmm.c 
> b/arch/powerpc/platforms/pseries/cmm.c
> index 45a3a3022a85c9..6d36b858b14df1 100644
> --- a/arch/powerpc/platforms/pseries/cmm.c
> +++ b/arch/powerpc/platforms/pseries/cmm.c
> @@ -580,7 +580,7 @@ static int cmm_balloon_compaction_init(void)
>   return rc;
>   }
>  
> - b_dev_info.inode = alloc_anon_inode(balloon_mnt->mnt_sb);
> + b_dev_info.inode = alloc_anon_inode_sb(balloon_mnt->mnt_sb);
>   if (IS_ERR(b_dev_info.inode)) {
>   rc = PTR_ERR(b_dev_info.inode);
>   b_dev_info.inode = NULL;
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index f264b70c383eb4..dedcc9483352dc 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -445,7 +445,7 @@ static inline int is_dma_buf_file(struct file *file)
>  static struct file *dma_buf_getfile(struct dma_buf *dmabuf, int flags)
>  {
>   struct file *file;
> - struct inode *inode = alloc_anon_inode(dma_buf_mnt->mnt_sb);
> + struct inode *inode = alloc_anon_inode_sb(dma_buf_mnt->mnt_sb);
>  
>   if (IS_ERR(inode))
>   return ERR_CAST(inode);
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 20d22e41d7ce74..87e7214a8e3565 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -519,7 +519,7 @@ static struct inode *drm_fs_inode_new(void)
>   return ERR_PTR(r);
>   }
>  
> - inode = alloc_anon_inode(drm_fs_mnt->mnt_sb);
> + inode = alloc_anon_inode_sb(drm_fs_mnt->mnt_sb);
>   if (IS_ERR(inode))
>   simple_release_fs(_fs_mnt, _fs_cnt);
>  
> diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
> index b493de962153ba..2efbf6c98028ef 100644
> --- a/drivers/misc/cxl/api.c
> +++ b/drivers/misc/cxl/api.c
> @@ -73,7 +73,7 @@ static struct file *cxl_getfile(const char *name,
>   goto err_module;
>   }
>  
> - inode = alloc_anon_inode(cxl_vfs_mount->mnt_sb);
> + inode = alloc_anon_inode_sb(cxl_vfs_mount->mnt_sb);
>   if (IS_ERR(inode)) {
>   file = ERR_CAST(inode);
>   goto err_fs;
> diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
> index b837e7eba5f7dc..5d057a05ddbee8 100644
> --- a/drivers/misc/vmw_balloon.c
> +++ b/drivers/misc/vmw_balloon.c
> @@ -1900,7 +1900,7 @@ static __init int vmballoon_compaction_init(struct 
> vmballoon *b)
>   return PTR_ERR(vmballoon_mnt);
>  
>   b->b_dev_info.migratepage = vmballoon_migratepage;
> - b->b_dev_info.inode = alloc_anon_inode(vmballoon_mnt->mnt_sb);
> + b->b_dev_info.inode = alloc_anon_inode_sb(vmballoon_mnt->mnt_sb);
>  
>   if (IS_ERR(b->b_dev_info.inode))
>   return PTR_ERR(b->b_dev_info.inode);
> diff --git a/drivers/scsi/cxlflash/ocxl_hw.c b/drivers/scsi/cxlflash/ocxl_hw.c
> index 244fc27215dc79..40184ed926b557 100644
> --- a/drivers/scsi/cxlflash/ocxl_hw.c
> +++ b/drivers/scsi/cxlflash/ocxl_hw.c
> @@ -88,7 +88,7 @@ static struct file *ocxlflash_getfile(struct device *dev, 
> const char *name,
>   goto err2;
>   }
>  
> - inode = alloc_anon_inode(ocxlflash_vfs_mount->mnt_sb);
> + inode = alloc_anon_inode_sb(ocxlflash_vfs_mount->mnt_sb);
>   if (IS_ERR(inode)) {
>   rc = PTR_ERR(inode);
>   dev_err(dev, "%s: alloc_anon_inode failed rc=%d\n",
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 8985fc2cea8615..cae76ee5bdd688 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -916,7 +916,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
>   }
>  
>   vb->vb_dev_info.migratepage = virtballoon_migratepage;
> - vb->vb_dev_info.inode = alloc_anon_inode(balloon_mnt->mnt_sb);
> + vb->vb_dev_info.inode = alloc_anon_inode_sb(balloon_mnt->mnt_sb);
>   if (IS_ERR(vb->vb_dev_info.inode)) {
>   err = PTR_ERR(vb->vb_dev_info.inode);
>   goto 

Re: [PATCH 11/29] mm: only allow page table mappings for built-in zsmalloc

2020-04-16 Thread Minchan Kim
On Tue, Apr 14, 2020 at 03:13:30PM +0200, Christoph Hellwig wrote:
> This allows to unexport map_vm_area and unmap_kernel_range, which are
> rather deep internal and should not be available to modules, as they for
> example allow fine grained control of mapping permissions, and also
> allow splitting the setup of a vmalloc area and the actual mapping and
> thus expose vmalloc internals.
> 
> zsmalloc is typically built-in and continues to work (just like the
> percpu-vm code using a similar patter), while modular zsmalloc also
> continues to work, but must use copies.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Peter Zijlstra (Intel) 
Acked-by: Minchan Kim 

Thanks!


Re: [PATCH 10/28] mm: only allow page table mappings for built-in zsmalloc

2020-04-16 Thread Minchan Kim
Hi Christoph,


Sorry for the late.

On Sat, Apr 11, 2020 at 09:20:52AM +0200, Christoph Hellwig wrote:
> Hi Minchan,
> 
> On Fri, Apr 10, 2020 at 04:11:36PM -0700, Minchan Kim wrote:
> > It doesn't mean we couldn't use zsmalloc as module any longer. It means
> > we couldn't use zsmalloc as module with pgtable mapping whcih was little
> > bit faster on microbenchmark in some architecutre(However, I usually temped
> > to remove it since it had several problems). However, we could still use
> > zsmalloc as module as copy way instead of pgtable mapping. Thus, if someone
> > really want to rollback the feature, they should provide reasonable reason
> > why it doesn't work for them. "A little fast" wouldn't be enough to exports
> > deep internal to the module.
> 
> do you have any data how much faster it is on arm (and does that include
> arm64 as well)?  Besides the exports which were my prime concern,

https://github.com/sjenning/zsmapbench

I need to recall the memory. IIRC, it was almost 30% faster at that time
in ARM so was not trivial at that time. However, it was story from
several years ago.

> zsmalloc with pgtable mappings also is the only user of map_kernel_range
> outside of vmalloc.c, if it really is another code base for tiny
> improvements we could mark map_kernel_range or in fact remove it entirely
> and open code it in the remaining callers.

I alsh have temped to remove it. Let me have time to revist it in this
chance.

Thanks.


Re: [PATCH 10/28] mm: only allow page table mappings for built-in zsmalloc

2020-04-10 Thread Minchan Kim
Hi Sergey,

On Fri, Apr 10, 2020 at 11:38:45AM +0900, Sergey Senozhatsky wrote:
> On (20/04/09 10:08), Minchan Kim wrote:
> > > > Even though I don't know how many usecase we have using zsmalloc as
> > > > module(I heard only once by dumb reason), it could affect existing
> > > > users. Thus, please include concrete explanation in the patch to
> > > > justify when the complain occurs.
> > > 
> > > The justification is 'we can unexport functions that have no sane reason
> > > of being exported in the first place'.
> > > 
> > > The Changelog pretty much says that.
> > 
> > Okay, I hope there is no affected user since this patch.
> > If there are someone, they need to provide sane reason why they want
> > to have zsmalloc as module.
> 
> I'm one of those who use zsmalloc as a module - mainly because I use zram
> as a compressing general purpose block device, not as a swap device.
> I create zram0, mkfs, mount, checkout and compile code, once done -
> umount, rmmod. This reduces the number of writes to SSD. Some people use
> tmpfs, but zram device(-s) can be much larger in size. That's a niche use
> case and I'm not against the patch.

It doesn't mean we couldn't use zsmalloc as module any longer. It means
we couldn't use zsmalloc as module with pgtable mapping whcih was little
bit faster on microbenchmark in some architecutre(However, I usually temped
to remove it since it had several problems). However, we could still use
zsmalloc as module as copy way instead of pgtable mapping. Thus, if someone
really want to rollback the feature, they should provide reasonable reason
why it doesn't work for them. "A little fast" wouldn't be enough to exports
deep internal to the module.

Thanks.


Re: [PATCH 10/28] mm: only allow page table mappings for built-in zsmalloc

2020-04-09 Thread Minchan Kim
On Thu, Apr 09, 2020 at 06:50:30PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 09, 2020 at 09:08:26AM -0700, Minchan Kim wrote:
> > On Wed, Apr 08, 2020 at 01:59:08PM +0200, Christoph Hellwig wrote:
> > > This allows to unexport map_vm_area and unmap_kernel_range, which are
> > > rather deep internal and should not be available to modules.
> > 
> > Even though I don't know how many usecase we have using zsmalloc as
> > module(I heard only once by dumb reason), it could affect existing
> > users. Thus, please include concrete explanation in the patch to
> > justify when the complain occurs.
> 
> The justification is 'we can unexport functions that have no sane reason
> of being exported in the first place'.
> 
> The Changelog pretty much says that.

Okay, I hope there is no affected user since this patch.
If there are someone, they need to provide sane reason why they want
to have zsmalloc as module.

Acked-by: Minchan Kim 


Re: [PATCH 10/28] mm: only allow page table mappings for built-in zsmalloc

2020-04-09 Thread Minchan Kim
On Wed, Apr 08, 2020 at 01:59:08PM +0200, Christoph Hellwig wrote:
> This allows to unexport map_vm_area and unmap_kernel_range, which are
> rather deep internal and should not be available to modules.

Even though I don't know how many usecase we have using zsmalloc as
module(I heard only once by dumb reason), it could affect existing
users. Thus, please include concrete explanation in the patch to
justify when the complain occurs.

> 
> Signed-off-by: Christoph Hellwig 
> ---
>  mm/Kconfig   | 2 +-
>  mm/vmalloc.c | 2 --
>  2 files changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 36949a9425b8..614cc786b519 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -702,7 +702,7 @@ config ZSMALLOC
>  
>  config ZSMALLOC_PGTABLE_MAPPING
>   bool "Use page table mapping to access object in zsmalloc"
> - depends on ZSMALLOC
> + depends on ZSMALLOC=y
>   help
> By default, zsmalloc uses a copy-based object mapping method to
> access allocations that span two pages. However, if a particular
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 3375f9508ef6..9183fc0d365a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2046,7 +2046,6 @@ void unmap_kernel_range(unsigned long addr, unsigned 
> long size)
>   vunmap_page_range(addr, end);
>   flush_tlb_kernel_range(addr, end);
>  }
> -EXPORT_SYMBOL_GPL(unmap_kernel_range);
>  
>  int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page **pages)
>  {
> @@ -2058,7 +2057,6 @@ int map_vm_area(struct vm_struct *area, pgprot_t prot, 
> struct page **pages)
>  
>   return err > 0 ? 0 : err;
>  }
> -EXPORT_SYMBOL_GPL(map_vm_area);
>  
>  static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
>   struct vmap_area *va, unsigned long flags, const void *caller)
> -- 
> 2.25.1
> 


Re: [PATCH 09/28] mm: rename CONFIG_PGTABLE_MAPPING to CONFIG_ZSMALLOC_PGTABLE_MAPPING

2020-04-09 Thread Minchan Kim
On Wed, Apr 08, 2020 at 01:59:07PM +0200, Christoph Hellwig wrote:
> Rename the Kconfig variable to clarify the scope.
> 
> Signed-off-by: Christoph Hellwig 
Acked-by: Minchan Kim 


Re: [PATCH v10 12/25] mm: cache some VMA fields in the vm_fault structure

2018-05-08 Thread Minchan Kim
On Fri, May 04, 2018 at 11:10:54AM +0200, Laurent Dufour wrote:
> On 03/05/2018 17:42, Minchan Kim wrote:
> > On Thu, May 03, 2018 at 02:25:18PM +0200, Laurent Dufour wrote:
> >> On 23/04/2018 09:42, Minchan Kim wrote:
> >>> On Tue, Apr 17, 2018 at 04:33:18PM +0200, Laurent Dufour wrote:
> >>>> When handling speculative page fault, the vma->vm_flags and
> >>>> vma->vm_page_prot fields are read once the page table lock is released. 
> >>>> So
> >>>> there is no more guarantee that these fields would not change in our back
> >>>> They will be saved in the vm_fault structure before the VMA is checked 
> >>>> for
> >>>> changes.
> >>>
> >>> Sorry. I cannot understand.
> >>> If it is changed under us, what happens? If it's critical, why cannot we
> >>> check with seqcounter?
> >>> Clearly, I'm not understanding the logic here. However, it's a global
> >>> change without CONFIG_SPF so I want to be more careful.
> >>> It would be better to describe why we need to sanpshot those values
> >>> into vm_fault rather than preventing the race.
> >>
> >> The idea is to go forward processing the page fault using the VMA's fields
> >> values saved in the vm_fault structure. Then once the pte are locked, the
> >> vma->sequence_counter is checked again and if something has changed in our 
> >> back
> >> the speculative page fault processing is aborted.
> > 
> > Sorry, still I don't understand why we should capture some fields to 
> > vm_fault.
> > If we found vma->seq_cnt is changed under pte lock, can't we just bail out 
> > and
> > fallback to classic fault handling?
> > 
> > Maybe, I'm missing something clear now. It would be really helpful to 
> > understand
> > if you give some exmaple.
> 
> I'd rather say that I was not clear enough ;)
> 
> Here is the point, when we deal with a speculative page fault, the mmap_sem is
> not taken, so parallel VMA's changes can occurred. When a VMA change is done
> which will impact the page fault processing, we assumed that the VMA sequence
> counter will be changed.
> 
> In the page fault processing, at the time the PTE is locked, we checked the 
> VMA
> sequence counter to detect changes done in our back. If no change is detected
> we can continue further. But this doesn't prevent the VMA to not be changed in
> our back while the PTE is locked. So VMA's fields which are used while the PTE
> is locked must be saved to ensure that we are using *static* values.
> This is important since the PTE changes will be made on regards to these VMA
> fields and they need to be consistent. This concerns the vma->vm_flags and
> vma->vm_page_prot VMA fields.
> 
> I hope I make this clear enough this time.

It's more clear at this point. Please include such nice explanation in 
description.
Now, I am wondering how you synchronize those static value and vma's seqcount.
It must be in next patchset. I hope to grab a time to read it, asap.

Thanks.


Re: [PATCH v10 12/25] mm: cache some VMA fields in the vm_fault structure

2018-05-03 Thread Minchan Kim
On Thu, May 03, 2018 at 02:25:18PM +0200, Laurent Dufour wrote:
> On 23/04/2018 09:42, Minchan Kim wrote:
> > On Tue, Apr 17, 2018 at 04:33:18PM +0200, Laurent Dufour wrote:
> >> When handling speculative page fault, the vma->vm_flags and
> >> vma->vm_page_prot fields are read once the page table lock is released. So
> >> there is no more guarantee that these fields would not change in our back
> >> They will be saved in the vm_fault structure before the VMA is checked for
> >> changes.
> > 
> > Sorry. I cannot understand.
> > If it is changed under us, what happens? If it's critical, why cannot we
> > check with seqcounter?
> > Clearly, I'm not understanding the logic here. However, it's a global
> > change without CONFIG_SPF so I want to be more careful.
> > It would be better to describe why we need to sanpshot those values
> > into vm_fault rather than preventing the race.
> 
> The idea is to go forward processing the page fault using the VMA's fields
> values saved in the vm_fault structure. Then once the pte are locked, the
> vma->sequence_counter is checked again and if something has changed in our 
> back
> the speculative page fault processing is aborted.

Sorry, still I don't understand why we should capture some fields to vm_fault.
If we found vma->seq_cnt is changed under pte lock, can't we just bail out and
fallback to classic fault handling?

Maybe, I'm missing something clear now. It would be really helpful to understand
if you give some exmaple.

Thanks.

> 
> Thanks,
> Laurent.
> 
> 
> > 
> > Thanks.
> > 
> >>
> >> This patch also set the fields in hugetlb_no_page() and
> >> __collapse_huge_page_swapin even if it is not need for the callee.
> >>
> >> Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
> >> ---
> >>  include/linux/mm.h | 10 --
> >>  mm/huge_memory.c   |  6 +++---
> >>  mm/hugetlb.c   |  2 ++
> >>  mm/khugepaged.c|  2 ++
> >>  mm/memory.c| 50 ++
> >>  mm/migrate.c   |  2 +-
> >>  6 files changed, 42 insertions(+), 30 deletions(-)
> >>
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index f6edd15563bc..c65205c8c558 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -367,6 +367,12 @@ struct vm_fault {
> >> * page table to avoid allocation from
> >> * atomic context.
> >> */
> >> +  /*
> >> +   * These entries are required when handling speculative page fault.
> >> +   * This way the page handling is done using consistent field values.
> >> +   */
> >> +  unsigned long vma_flags;
> >> +  pgprot_t vma_page_prot;
> >>  };
> >>  
> >>  /* page entry size for vm->huge_fault() */
> >> @@ -687,9 +693,9 @@ void free_compound_page(struct page *page);
> >>   * pte_mkwrite.  But get_user_pages can cause write faults for mappings
> >>   * that do not have writing enabled, when used by access_process_vm.
> >>   */
> >> -static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> >> +static inline pte_t maybe_mkwrite(pte_t pte, unsigned long vma_flags)
> >>  {
> >> -  if (likely(vma->vm_flags & VM_WRITE))
> >> +  if (likely(vma_flags & VM_WRITE))
> >>pte = pte_mkwrite(pte);
> >>return pte;
> >>  }
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index a3a1815f8e11..da2afda67e68 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -1194,8 +1194,8 @@ static int do_huge_pmd_wp_page_fallback(struct 
> >> vm_fault *vmf, pmd_t orig_pmd,
> >>  
> >>for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> >>pte_t entry;
> >> -  entry = mk_pte(pages[i], vma->vm_page_prot);
> >> -  entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> >> +  entry = mk_pte(pages[i], vmf->vma_page_prot);
> >> +  entry = maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
> >>memcg = (void *)page_private(pages[i]);
> >>set_page_private(pages[i], 0);
> >>page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
> >> @@ -2168,7 +2168,7 @@ static void __split_huge_pmd_locked(struct 
> >> vm_area_str

Re: [PATCH v10 08/25] mm: VMA sequence count

2018-05-01 Thread Minchan Kim
On Mon, Apr 30, 2018 at 05:14:27PM +0200, Laurent Dufour wrote:
> 
> 
> On 23/04/2018 08:42, Minchan Kim wrote:
> > On Tue, Apr 17, 2018 at 04:33:14PM +0200, Laurent Dufour wrote:
> >> From: Peter Zijlstra <pet...@infradead.org>
> >>
> >> Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
> >> counts such that we can easily test if a VMA is changed.
> > 
> > So, seqcount is to protect modifying all attributes of vma?
> 
> The seqcount is used to protect fields that will be used during the 
> speculative
> page fault like boundaries, protections.

a VMA is changed, it was rather vague to me at this point.
If you could specify detail fields or some example what seqcount aim for,
it would help to review.

> 
> >>
> >> The unmap_page_range() one allows us to make assumptions about
> >> page-tables; when we find the seqcount hasn't changed we can assume
> >> page-tables are still valid.
> > 
> > Hmm, seqcount covers page-table, too.
> > Please describe what the seqcount want to protect.
> 
> The calls to vm_write_begin/end() in unmap_page_range() are used to detect 
> when
> a VMA is being unmap and thus that new page fault should not be satisfied for
> this VMA. This is protecting the VMA unmapping operation, not the page tables
> themselves.

Thanks for the detail. yes, please include this phrase instead of "page-table
are still valid". It makes me confused.

> 
> >>
> >> The flip side is that we cannot distinguish between a vma_adjust() and
> >> the unmap_page_range() -- where with the former we could have
> >> re-checked the vma bounds against the address.
> >>
> >> Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
> >>
> >> [Port to 4.12 kernel]
> >> [Build depends on CONFIG_SPECULATIVE_PAGE_FAULT]
> >> [Introduce vm_write_* inline function depending on
> >>  CONFIG_SPECULATIVE_PAGE_FAULT]
> >> [Fix lock dependency between mapping->i_mmap_rwsem and vma->vm_sequence by
> >>  using vm_raw_write* functions]
> >> [Fix a lock dependency warning in mmap_region() when entering the error
> >>  path]
> >> [move sequence initialisation INIT_VMA()]
> >> Signed-off-by: Laurent Dufour <lduf...@linux.vnet.ibm.com>
> >> ---
> >>  include/linux/mm.h   | 44 
> >>  include/linux/mm_types.h |  3 +++
> >>  mm/memory.c  |  2 ++
> >>  mm/mmap.c| 31 +++
> >>  4 files changed, 80 insertions(+)
> >>
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index efc1248b82bd..988daf7030c9 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -1264,6 +1264,9 @@ struct zap_details {
> >>  static inline void INIT_VMA(struct vm_area_struct *vma)
> >>  {
> >>INIT_LIST_HEAD(>anon_vma_chain);
> >> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> >> +  seqcount_init(>vm_sequence);
> >> +#endif
> >>  }
> >>  
> >>  struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long 
> >> addr,
> >> @@ -1386,6 +1389,47 @@ static inline void 
> >> unmap_shared_mapping_range(struct address_space *mapping,
> >>unmap_mapping_range(mapping, holebegin, holelen, 0);
> >>  }
> >>  
> >> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> >> +static inline void vm_write_begin(struct vm_area_struct *vma)
> >> +{
> >> +  write_seqcount_begin(>vm_sequence);
> >> +}
> >> +static inline void vm_write_begin_nested(struct vm_area_struct *vma,
> >> +   int subclass)
> >> +{
> >> +  write_seqcount_begin_nested(>vm_sequence, subclass);
> >> +}
> >> +static inline void vm_write_end(struct vm_area_struct *vma)
> >> +{
> >> +  write_seqcount_end(>vm_sequence);
> >> +}
> >> +static inline void vm_raw_write_begin(struct vm_area_struct *vma)
> >> +{
> >> +  raw_write_seqcount_begin(>vm_sequence);
> >> +}
> >> +static inline void vm_raw_write_end(struct vm_area_struct *vma)
> >> +{
> >> +  raw_write_seqcount_end(>vm_sequence);
> >> +}
> >> +#else
> >> +static inline void vm_write_begin(struct vm_area_struct *vma)
> >> +{
> >> +}
> >> +static inline void vm_write_begin_nested(struct vm_area_struct *vma,
> >> +

Re: [PATCH v10 06/25] mm: make pte_unmap_same compatible with SPF

2018-05-01 Thread Minchan Kim
On Mon, Apr 30, 2018 at 04:07:30PM +0200, Laurent Dufour wrote:
> On 23/04/2018 08:31, Minchan Kim wrote:
> > On Tue, Apr 17, 2018 at 04:33:12PM +0200, Laurent Dufour wrote:
> >> pte_unmap_same() is making the assumption that the page table are still
> >> around because the mmap_sem is held.
> >> This is no more the case when running a speculative page fault and
> >> additional check must be made to ensure that the final page table are still
> >> there.
> >>
> >> This is now done by calling pte_spinlock() to check for the VMA's
> >> consistency while locking for the page tables.
> >>
> >> This is requiring passing a vm_fault structure to pte_unmap_same() which is
> >> containing all the needed parameters.
> >>
> >> As pte_spinlock() may fail in the case of a speculative page fault, if the
> >> VMA has been touched in our back, pte_unmap_same() should now return 3
> >> cases :
> >>1. pte are the same (0)
> >>2. pte are different (VM_FAULT_PTNOTSAME)
> >>3. a VMA's changes has been detected (VM_FAULT_RETRY)
> >>
> >> The case 2 is handled by the introduction of a new VM_FAULT flag named
> >> VM_FAULT_PTNOTSAME which is then trapped in cow_user_page().
> > 
> > I don't see such logic in this patch.
> > Maybe you introduces it later? If so, please comment on it.
> > Or just return 0 in case of 2 without introducing VM_FAULT_PTNOTSAME.
> 
> Late in the series, pte_spinlock() will check for the VMA's changes and may
> return 1. This will be then required to handle the 3 cases presented above.
> 
> I can move this handling later in the series, but I wondering if this will 
> make
> it more easier to read.

Just nit:
During reviewing this patch, I was just curious you introduced new thing
here but I couldn't find any site where use it. It makes review hard. :(
That's why I said to you that please commet on it if you will use new thing
late in this series.
If you think as-is is better for review, it would be better to mention it
explicitly.

Thanks.


Re: [PATCH v10 12/25] mm: cache some VMA fields in the vm_fault structure

2018-04-23 Thread Minchan Kim
On Tue, Apr 17, 2018 at 04:33:18PM +0200, Laurent Dufour wrote:
> When handling speculative page fault, the vma->vm_flags and
> vma->vm_page_prot fields are read once the page table lock is released. So
> there is no more guarantee that these fields would not change in our back.
> They will be saved in the vm_fault structure before the VMA is checked for
> changes.

Sorry. I cannot understand.
If it is changed under us, what happens? If it's critical, why cannot we
check with seqcounter?
Clearly, I'm not understanding the logic here. However, it's a global
change without CONFIG_SPF so I want to be more careful.
It would be better to describe why we need to sanpshot those values
into vm_fault rather than preventing the race.

Thanks.

> 
> This patch also set the fields in hugetlb_no_page() and
> __collapse_huge_page_swapin even if it is not need for the callee.
> 
> Signed-off-by: Laurent Dufour 
> ---
>  include/linux/mm.h | 10 --
>  mm/huge_memory.c   |  6 +++---
>  mm/hugetlb.c   |  2 ++
>  mm/khugepaged.c|  2 ++
>  mm/memory.c| 50 ++
>  mm/migrate.c   |  2 +-
>  6 files changed, 42 insertions(+), 30 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f6edd15563bc..c65205c8c558 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -367,6 +367,12 @@ struct vm_fault {
>* page table to avoid allocation from
>* atomic context.
>*/
> + /*
> +  * These entries are required when handling speculative page fault.
> +  * This way the page handling is done using consistent field values.
> +  */
> + unsigned long vma_flags;
> + pgprot_t vma_page_prot;
>  };
>  
>  /* page entry size for vm->huge_fault() */
> @@ -687,9 +693,9 @@ void free_compound_page(struct page *page);
>   * pte_mkwrite.  But get_user_pages can cause write faults for mappings
>   * that do not have writing enabled, when used by access_process_vm.
>   */
> -static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> +static inline pte_t maybe_mkwrite(pte_t pte, unsigned long vma_flags)
>  {
> - if (likely(vma->vm_flags & VM_WRITE))
> + if (likely(vma_flags & VM_WRITE))
>   pte = pte_mkwrite(pte);
>   return pte;
>  }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a3a1815f8e11..da2afda67e68 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1194,8 +1194,8 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault 
> *vmf, pmd_t orig_pmd,
>  
>   for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
>   pte_t entry;
> - entry = mk_pte(pages[i], vma->vm_page_prot);
> - entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> + entry = mk_pte(pages[i], vmf->vma_page_prot);
> + entry = maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
>   memcg = (void *)page_private(pages[i]);
>   set_page_private(pages[i], 0);
>   page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
> @@ -2168,7 +2168,7 @@ static void __split_huge_pmd_locked(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   entry = pte_swp_mksoft_dirty(entry);
>   } else {
>   entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
> - entry = maybe_mkwrite(entry, vma);
> + entry = maybe_mkwrite(entry, vma->vm_flags);
>   if (!write)
>   entry = pte_wrprotect(entry);
>   if (!young)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 218679138255..774864153407 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3718,6 +3718,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   .vma = vma,
>   .address = address,
>   .flags = flags,
> + .vma_flags = vma->vm_flags,
> + .vma_page_prot = vma->vm_page_prot,
>   /*
>* Hard to debug if it ends up being
>* used by a callee that assumes
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 0b28af4b950d..2b02a9f9589e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -887,6 +887,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct 
> *mm,
>   .flags = FAULT_FLAG_ALLOW_RETRY,
>   .pmd = pmd,
>   .pgoff = linear_page_index(vma, address),
> + .vma_flags = vma->vm_flags,
> + .vma_page_prot = vma->vm_page_prot,
>   };
>  
>   /* we only decide to swapin, if there is enough young ptes */
> diff --git 

Re: [PATCH v10 09/25] mm: protect VMA modifications using VMA sequence count

2018-04-23 Thread Minchan Kim
On Tue, Apr 17, 2018 at 04:33:15PM +0200, Laurent Dufour wrote:
> The VMA sequence count has been introduced to allow fast detection of
> VMA modification when running a page fault handler without holding
> the mmap_sem.
> 
> This patch provides protection against the VMA modification done in :
>   - madvise()
>   - mpol_rebind_policy()
>   - vma_replace_policy()
>   - change_prot_numa()
>   - mlock(), munlock()
>   - mprotect()
>   - mmap_region()
>   - collapse_huge_page()
>   - userfaultd registering services
> 
> In addition, VMA fields which will be read during the speculative fault
> path needs to be written using WRITE_ONCE to prevent write to be split
> and intermediate values to be pushed to other CPUs.
> 
> Signed-off-by: Laurent Dufour 
> ---
>  fs/proc/task_mmu.c |  5 -
>  fs/userfaultfd.c   | 17 +
>  mm/khugepaged.c|  3 +++
>  mm/madvise.c   |  6 +-
>  mm/mempolicy.c | 51 ++-
>  mm/mlock.c | 13 -
>  mm/mmap.c  | 22 +-
>  mm/mprotect.c  |  4 +++-
>  mm/swap_state.c|  8 ++--
>  9 files changed, 89 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index c486ad4b43f0..aeb417f28839 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1136,8 +1136,11 @@ static ssize_t clear_refs_write(struct file *file, 
> const char __user *buf,
>   goto out_mm;
>   }
>   for (vma = mm->mmap; vma; vma = vma->vm_next) {
> - vma->vm_flags &= ~VM_SOFTDIRTY;
> + vm_write_begin(vma);
> + WRITE_ONCE(vma->vm_flags,
> +vma->vm_flags & 
> ~VM_SOFTDIRTY);
>   vma_set_page_prot(vma);
> + vm_write_end(vma);

trivial:

I think It's tricky to maintain that VMA fields to be read during SPF should be
(READ|WRITE_ONCE). I think we need some accessor to read/write them rather than
raw accessing like like vma_set_page_prot. Maybe spf prefix would be helpful. 

vma_spf_set_value(vma, vm_flags, val);

We also add some markers in vm_area_struct's fileds to indicate that
people shouldn't access those fields directly.

Just a thought.


>   }
>   downgrade_write(>mmap_sem);


> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index fe079756bb18..8a8a402ed59f 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -575,6 +575,10 @@ static unsigned long swapin_nr_pages(unsigned long 
> offset)
>   * the readahead.
>   *
>   * Caller must hold down_read on the vma->vm_mm if vmf->vma is not NULL.
> + * This is needed to ensure the VMA will not be freed in our back. In the 
> case
> + * of the speculative page fault handler, this cannot happen, even if we 
> don't
> + * hold the mmap_sem. Callees are assumed to take care of reading VMA's 
> fields

I guess reader would be curious on *why* is safe with SPF.
Comment about the why could be helpful for reviewer.

> + * using READ_ONCE() to read consistent values.
>   */
>  struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>   struct vm_fault *vmf)
> @@ -668,9 +672,9 @@ static inline void swap_ra_clamp_pfn(struct 
> vm_area_struct *vma,
>unsigned long *start,
>unsigned long *end)
>  {
> - *start = max3(lpfn, PFN_DOWN(vma->vm_start),
> + *start = max3(lpfn, PFN_DOWN(READ_ONCE(vma->vm_start)),
> PFN_DOWN(faddr & PMD_MASK));
> - *end = min3(rpfn, PFN_DOWN(vma->vm_end),
> + *end = min3(rpfn, PFN_DOWN(READ_ONCE(vma->vm_end)),
>   PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE));
>  }
>  
> -- 
> 2.7.4
> 


Re: [PATCH v10 08/25] mm: VMA sequence count

2018-04-23 Thread Minchan Kim
On Tue, Apr 17, 2018 at 04:33:14PM +0200, Laurent Dufour wrote:
> From: Peter Zijlstra 
> 
> Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
> counts such that we can easily test if a VMA is changed.

So, seqcount is to protect modifying all attributes of vma?

> 
> The unmap_page_range() one allows us to make assumptions about
> page-tables; when we find the seqcount hasn't changed we can assume
> page-tables are still valid.

Hmm, seqcount covers page-table, too.
Please describe what the seqcount want to protect.

> 
> The flip side is that we cannot distinguish between a vma_adjust() and
> the unmap_page_range() -- where with the former we could have
> re-checked the vma bounds against the address.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> 
> [Port to 4.12 kernel]
> [Build depends on CONFIG_SPECULATIVE_PAGE_FAULT]
> [Introduce vm_write_* inline function depending on
>  CONFIG_SPECULATIVE_PAGE_FAULT]
> [Fix lock dependency between mapping->i_mmap_rwsem and vma->vm_sequence by
>  using vm_raw_write* functions]
> [Fix a lock dependency warning in mmap_region() when entering the error
>  path]
> [move sequence initialisation INIT_VMA()]
> Signed-off-by: Laurent Dufour 
> ---
>  include/linux/mm.h   | 44 
>  include/linux/mm_types.h |  3 +++
>  mm/memory.c  |  2 ++
>  mm/mmap.c| 31 +++
>  4 files changed, 80 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index efc1248b82bd..988daf7030c9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1264,6 +1264,9 @@ struct zap_details {
>  static inline void INIT_VMA(struct vm_area_struct *vma)
>  {
>   INIT_LIST_HEAD(>anon_vma_chain);
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + seqcount_init(>vm_sequence);
> +#endif
>  }
>  
>  struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> @@ -1386,6 +1389,47 @@ static inline void unmap_shared_mapping_range(struct 
> address_space *mapping,
>   unmap_mapping_range(mapping, holebegin, holelen, 0);
>  }
>  
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> +static inline void vm_write_begin(struct vm_area_struct *vma)
> +{
> + write_seqcount_begin(>vm_sequence);
> +}
> +static inline void vm_write_begin_nested(struct vm_area_struct *vma,
> +  int subclass)
> +{
> + write_seqcount_begin_nested(>vm_sequence, subclass);
> +}
> +static inline void vm_write_end(struct vm_area_struct *vma)
> +{
> + write_seqcount_end(>vm_sequence);
> +}
> +static inline void vm_raw_write_begin(struct vm_area_struct *vma)
> +{
> + raw_write_seqcount_begin(>vm_sequence);
> +}
> +static inline void vm_raw_write_end(struct vm_area_struct *vma)
> +{
> + raw_write_seqcount_end(>vm_sequence);
> +}
> +#else
> +static inline void vm_write_begin(struct vm_area_struct *vma)
> +{
> +}
> +static inline void vm_write_begin_nested(struct vm_area_struct *vma,
> +  int subclass)
> +{
> +}
> +static inline void vm_write_end(struct vm_area_struct *vma)
> +{
> +}
> +static inline void vm_raw_write_begin(struct vm_area_struct *vma)
> +{
> +}
> +static inline void vm_raw_write_end(struct vm_area_struct *vma)
> +{
> +}
> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
> +
>  extern int access_process_vm(struct task_struct *tsk, unsigned long addr,
>   void *buf, int len, unsigned int gup_flags);
>  extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 21612347d311..db5e9d630e7a 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -335,6 +335,9 @@ struct vm_area_struct {
>   struct mempolicy *vm_policy;/* NUMA policy for the VMA */
>  #endif
>   struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
> + seqcount_t vm_sequence;
> +#endif
>  } __randomize_layout;
>  
>  struct core_thread {
> diff --git a/mm/memory.c b/mm/memory.c
> index f86efcb8e268..f7fed053df80 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1503,6 +1503,7 @@ void unmap_page_range(struct mmu_gather *tlb,
>   unsigned long next;
>  
>   BUG_ON(addr >= end);

The comment about saying it aims for page-table stability will help.

> + vm_write_begin(vma);
>   tlb_start_vma(tlb, vma);
>   pgd = pgd_offset(vma->vm_mm, addr);
>   do {
> @@ -1512,6 +1513,7 @@ void unmap_page_range(struct mmu_gather *tlb,
>   next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
>   } while (pgd++, addr = next, addr != end);
>   tlb_end_vma(tlb, vma);
> + vm_write_end(vma);
>  }
>  
>  
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 8bd9ae1dfacc..813e49589ea1 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -692,6 +692,30 @@ int __vma_adjust(struct 

Re: [PATCH v10 06/25] mm: make pte_unmap_same compatible with SPF

2018-04-23 Thread Minchan Kim
On Tue, Apr 17, 2018 at 04:33:12PM +0200, Laurent Dufour wrote:
> pte_unmap_same() is making the assumption that the page table are still
> around because the mmap_sem is held.
> This is no more the case when running a speculative page fault and
> additional check must be made to ensure that the final page table are still
> there.
> 
> This is now done by calling pte_spinlock() to check for the VMA's
> consistency while locking for the page tables.
> 
> This is requiring passing a vm_fault structure to pte_unmap_same() which is
> containing all the needed parameters.
> 
> As pte_spinlock() may fail in the case of a speculative page fault, if the
> VMA has been touched in our back, pte_unmap_same() should now return 3
> cases :
>   1. pte are the same (0)
>   2. pte are different (VM_FAULT_PTNOTSAME)
>   3. a VMA's changes has been detected (VM_FAULT_RETRY)
> 
> The case 2 is handled by the introduction of a new VM_FAULT flag named
> VM_FAULT_PTNOTSAME which is then trapped in cow_user_page().

I don't see such logic in this patch.
Maybe you introduces it later? If so, please comment on it.
Or just return 0 in case of 2 without introducing VM_FAULT_PTNOTSAME.

> If VM_FAULT_RETRY is returned, it is passed up to the callers to retry the
> page fault while holding the mmap_sem.
> 
> Acked-by: David Rientjes 
> Signed-off-by: Laurent Dufour 
> ---
>  include/linux/mm.h |  1 +
>  mm/memory.c| 39 ---
>  2 files changed, 29 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 4d1aff80669c..714da99d77a3 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1208,6 +1208,7 @@ static inline void clear_page_pfmemalloc(struct page 
> *page)
>  #define VM_FAULT_NEEDDSYNC  0x2000   /* ->fault did not modify page tables
>* and needs fsync() to complete (for
>* synchronous page faults in DAX) */
> +#define VM_FAULT_PTNOTSAME 0x4000/* Page table entries have changed */
>  
>  #define VM_FAULT_ERROR   (VM_FAULT_OOM | VM_FAULT_SIGBUS | 
> VM_FAULT_SIGSEGV | \
>VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
> diff --git a/mm/memory.c b/mm/memory.c
> index 0b9a51f80e0e..f86efcb8e268 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2309,21 +2309,29 @@ static inline bool pte_map_lock(struct vm_fault *vmf)
>   * parts, do_swap_page must check under lock before unmapping the pte and
>   * proceeding (but do_wp_page is only called after already making such a 
> check;
>   * and do_anonymous_page can safely check later on).
> + *
> + * pte_unmap_same() returns:
> + *   0   if the PTE are the same
> + *   VM_FAULT_PTNOTSAME  if the PTE are different
> + *   VM_FAULT_RETRY  if the VMA has changed in our back during
> + *   a speculative page fault handling.
>   */
> -static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
> - pte_t *page_table, pte_t orig_pte)
> +static inline int pte_unmap_same(struct vm_fault *vmf)
>  {
> - int same = 1;
> + int ret = 0;
> +
>  #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
>   if (sizeof(pte_t) > sizeof(unsigned long)) {
> - spinlock_t *ptl = pte_lockptr(mm, pmd);
> - spin_lock(ptl);
> - same = pte_same(*page_table, orig_pte);
> - spin_unlock(ptl);
> + if (pte_spinlock(vmf)) {
> + if (!pte_same(*vmf->pte, vmf->orig_pte))
> + ret = VM_FAULT_PTNOTSAME;
> + spin_unlock(vmf->ptl);
> + } else
> + ret = VM_FAULT_RETRY;
>   }
>  #endif
> - pte_unmap(page_table);
> - return same;
> + pte_unmap(vmf->pte);
> + return ret;
>  }
>  
>  static inline void cow_user_page(struct page *dst, struct page *src, 
> unsigned long va, struct vm_area_struct *vma)
> @@ -2912,10 +2920,19 @@ int do_swap_page(struct vm_fault *vmf)
>   pte_t pte;
>   int locked;
>   int exclusive = 0;
> - int ret = 0;
> + int ret;
>  
> - if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
> + ret = pte_unmap_same(vmf);
> + if (ret) {
> + /*
> +  * If pte != orig_pte, this means another thread did the
> +  * swap operation in our back.
> +  * So nothing else to do.
> +  */
> + if (ret == VM_FAULT_PTNOTSAME)
> + ret = 0;
>   goto out;
> + }
>  
>   entry = pte_to_swp_entry(vmf->orig_pte);
>   if (unlikely(non_swap_entry(entry))) {
> -- 
> 2.7.4
> 


Re: [PATCH v10 01/25] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT

2018-04-22 Thread Minchan Kim
Hi Laurent,

I guess it's good timing to review. Guess LSF/MM goes so might change
a lot since then. :) Anyway, I grap a time to review.

On Tue, Apr 17, 2018 at 04:33:07PM +0200, Laurent Dufour wrote:
> This configuration variable will be used to build the code needed to
> handle speculative page fault.
> 
> By default it is turned off, and activated depending on architecture
> support, SMP and MMU.

Can we have description in here why it depends on architecture?

> 
> Suggested-by: Thomas Gleixner 
> Suggested-by: David Rientjes 
> Signed-off-by: Laurent Dufour 
> ---
>  mm/Kconfig | 22 ++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index d5004d82a1d6..5484dca11199 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -752,3 +752,25 @@ config GUP_BENCHMARK
> performance of get_user_pages_fast().
>  
> See tools/testing/selftests/vm/gup_benchmark.c
> +
> +config ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> +   def_bool n
> +
> +config SPECULATIVE_PAGE_FAULT
> +   bool "Speculative page faults"
> +   default y
> +   depends on ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> +   depends on MMU && SMP
> +   help
> + Try to handle user space page faults without holding the mmap_sem.
> +
> +  This should allow better concurrency for massively threaded process
> +  since the page fault handler will not wait for other threads memory
> +  layout change to be done, assuming that this change is done in another
> +  part of the process's memory space. This type of page fault is named
> +  speculative page fault.
> +
> +  If the speculative page fault fails because of a concurrency is
> +  detected or because underlying PMD or PTE tables are not yet
> +  allocating, it is failing its processing and a classic page fault
> +  is then tried.
> -- 
> 2.7.4
> 


Re: clear_page, copy_page address align question?

2017-04-10 Thread Minchan Kim
On Tue, Apr 11, 2017 at 01:12:24PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2017-04-11 at 12:08 +0900, Minchan Kim wrote:
> > Hello,
> > 
> > When I tested zram in ppc64, I got random corruption.
> > With investigation, it seems clear_page corrupted the memory.
> > I passed 64K kmalloced(kmalloc(PAGE_SIZE)) address to clear_page
> > and turned on slub debug so address is not aligned with PAGE_SIZE.
> > Is it a valid usecase that non-PAGE_SIZE aligned address is
> > used for clear_page in ppc64?
> > 
> > As well, copy_page have same rule, too?
> > 
> > Anyway, when I changed clear_page to memset, it seems the problem
> > is gone.
> 
> Yes, both clear_page and copy_page assume a PAGE_SHIFT alignment and
> are highly optimize according to this.
> 
> I wouldn't be surprised of other architectures implementations are the
> same.
> 
> I don't think it's ever legit to call these functions for something
> that isn't a naturally aligned page.


If it's the common for every architecture, it would have better to
have description about that in somewhere or WARN_ON. :(

Thanks for the confirm!


clear_page, copy_page address align question?

2017-04-10 Thread Minchan Kim
Hello,

When I tested zram in ppc64, I got random corruption.
With investigation, it seems clear_page corrupted the memory.
I passed 64K kmalloced(kmalloc(PAGE_SIZE)) address to clear_page
and turned on slub debug so address is not aligned with PAGE_SIZE.
Is it a valid usecase that non-PAGE_SIZE aligned address is
used for clear_page in ppc64?

As well, copy_page have same rule, too?

Anyway, when I changed clear_page to memset, it seems the problem
is gone.

Thanks.


Re: [PATCH v3 5/7] zram: Convert to using memset_l

2017-03-26 Thread Minchan Kim
On Fri, Mar 24, 2017 at 09:13:16AM -0700, Matthew Wilcox wrote:
> From: Matthew Wilcox <mawil...@microsoft.com>
> 
> zram was the motivation for creating memset_l().  Minchan Kim sees a 7%
> performance improvement on x86 with 100MB of non-zero deduplicatable
> data:
> 
> perf stat -r 10 dd if=/dev/zram0 of=/dev/null
> 
> vanilla:0.232050465 seconds time elapsed ( +-  0.51% )
> memset_l: 0.217219387 seconds time elapsed ( +-  0.07% )
> 
> Signed-off-by: Matthew Wilcox <mawil...@microsoft.com>
> Tested-by: Minchan Kim <minc...@kernel.org>
Acked-by: Minchan Kim <minc...@kernel.org>

Thanks!


Re: [PATCH] powerpc: remove newly added extra definition of pmd_dirty

2016-01-20 Thread Minchan Kim
On Thu, Jan 21, 2016 at 01:05:20PM +1100, Stephen Rothwell wrote:
> Commit d5d6a443b243 ("arch/powerpc/include/asm/pgtable-ppc64.h:
> add pmd_[dirty|mkclean] for THP") added a new identical definition
> of pdm_dirty.  Remove it again.
> 
> Cc: Minchan Kim <minc...@kernel.org>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Signed-off-by: Stephen Rothwell <s...@canb.auug.org.au>

Thanks for the fix!
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v17 4/7] powerpc: add pmd_[dirty|mkclean] for THP

2014-10-20 Thread Minchan Kim
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Minchan Kim minc...@kernel.org
---
 arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index 889c6fa9ee01..7c07e5975871 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -468,9 +468,11 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 
 #define pmd_pfn(pmd)   pte_pfn(pmd_pte(pmd))
 #define pmd_young(pmd) pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
 #define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)   pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkclean(pmd)   pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)   pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)   pte_pmd(pte_mkwrite(pmd_pte(pmd)))
 
-- 
2.0.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v16 4/7] powerpc: add pmd_[dirty|mkclean] for THP

2014-09-01 Thread Minchan Kim
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Minchan Kim minc...@kernel.org
---
 arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index 7b3d54fae46f..fb89d8eb96c8 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -468,9 +468,11 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 
 #define pmd_pfn(pmd)   pte_pfn(pmd_pte(pmd))
 #define pmd_young(pmd) pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
 #define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)   pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkclean(pmd)   pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)   pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)   pte_pmd(pte_mkwrite(pmd_pte(pmd)))
 
-- 
2.0.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v15 4/7] powerpc: add pmd_[dirty|mkclean] for THP

2014-08-17 Thread Minchan Kim
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Minchan Kim minc...@kernel.org
---
 arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index eb9261024f51..c9a4bbe8e179 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -468,9 +468,11 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 
 #define pmd_pfn(pmd)   pte_pfn(pmd_pte(pmd))
 #define pmd_young(pmd) pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
 #define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)   pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkclean(pmd)   pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)   pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)   pte_pmd(pte_mkwrite(pmd_pte(pmd)))
 
-- 
2.0.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v14 4/8] powerpc: add pmd_[dirty|mkclean] for THP

2014-08-13 Thread Minchan Kim
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Minchan Kim minc...@kernel.org
---
 arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index eb9261024f51..c9a4bbe8e179 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -468,9 +468,11 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 
 #define pmd_pfn(pmd)   pte_pfn(pmd_pte(pmd))
 #define pmd_young(pmd) pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
 #define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)   pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkclean(pmd)   pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)   pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)   pte_pmd(pte_mkwrite(pmd_pte(pmd)))
 
-- 
2.0.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v13 4/8] powerpc: add pmd_[dirty|mkclean] for THP

2014-07-18 Thread Minchan Kim
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Minchan Kim minc...@kernel.org
---
 arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index eb9261024f51..c9a4bbe8e179 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -468,9 +468,11 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 
 #define pmd_pfn(pmd)   pte_pfn(pmd_pte(pmd))
 #define pmd_young(pmd) pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
 #define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)   pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkclean(pmd)   pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)   pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)   pte_pmd(pte_mkwrite(pmd_pte(pmd)))
 
-- 
2.0.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v12 4/8] powerpc: add pmd_[dirty|mkclean] for THP

2014-07-09 Thread Minchan Kim
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Minchan Kim minc...@kernel.org
---
 arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index eb9261024f51..c9a4bbe8e179 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -468,9 +468,11 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 
 #define pmd_pfn(pmd)   pte_pfn(pmd_pte(pmd))
 #define pmd_young(pmd) pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
 #define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)   pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkclean(pmd)   pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)   pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)   pte_pmd(pte_mkwrite(pmd_pte(pmd)))
 
-- 
2.0.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v11 4/7] powerpc: add pmd_[dirty|mkclean] for THP

2014-07-08 Thread Minchan Kim
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Minchan Kim minc...@kernel.org
---
 arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index eb9261024f51..c9a4bbe8e179 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -468,9 +468,11 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 
 #define pmd_pfn(pmd)   pte_pfn(pmd_pte(pmd))
 #define pmd_young(pmd) pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
 #define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)   pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkclean(pmd)   pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)   pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)   pte_pmd(pte_mkwrite(pmd_pte(pmd)))
 
-- 
2.0.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v10 4/7] powerpc: add pmd_[dirty|mkclean] for THP

2014-07-06 Thread Minchan Kim
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Minchan Kim minc...@kernel.org
---
 arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h 
b/arch/powerpc/include/asm/pgtable-ppc64.h
index eb9261024f51..c9a4bbe8e179 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -468,9 +468,11 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 
 #define pmd_pfn(pmd)   pte_pfn(pmd_pte(pmd))
 #define pmd_young(pmd) pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
 #define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)   pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkclean(pmd)   pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)   pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)   pte_pmd(pte_mkwrite(pmd_pte(pmd)))
 
-- 
2.0.0

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 -next 1/9] DMA, CMA: fix possible memory leak

2014-06-16 Thread Minchan Kim
Hi, Joonsoo

On Mon, Jun 16, 2014 at 02:40:43PM +0900, Joonsoo Kim wrote:
 We should free memory for bitmap when we find zone mis-match,
 otherwise this memory will leak.
 
 Additionally, I copy code comment from PPC KVM's CMA code to inform
 why we need to check zone mis-match.
 
 * Note
 Minchan suggested to add a tag for the stable, but, I don't do it,
 because I found this possibility during code-review and, IMO,
 this patch isn't suitable for stable tree.

Nice idea to put the comment in here. Thanks Joonsoo.

It seems you obey It must fix a real bug that bothers people
on Documentation/stable_kernel_rules.txt but it's a really obvious
bug and hard to get a report from people because limited user and
hard to detect small such small memory leak.

In my experince, Andrew perfered stable marking for such a obvious
problem but simple fix like this but not sure so let's pass the decision
to him and will learn a lesson from him and will follow the decision
from now on.

Thanks.

Acked-by: Minchan Kim minc...@kernel.org

 
 Acked-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
 Reviewed-by: Michal Nazarewicz min...@mina86.com
 Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
 diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
 index 83969f8..6467c91 100644
 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -176,14 +176,24 @@ static int __init cma_activate_area(struct cma *cma)
   base_pfn = pfn;
   for (j = pageblock_nr_pages; j; --j, pfn++) {
   WARN_ON_ONCE(!pfn_valid(pfn));
 + /*
 +  * alloc_contig_range requires the pfn range
 +  * specified to be in the same zone. Make this
 +  * simple by forcing the entire CMA resv range
 +  * to be in the same zone.
 +  */
   if (page_zone(pfn_to_page(pfn)) != zone)
 - return -EINVAL;
 + goto err;
   }
   init_cma_reserved_pageblock(pfn_to_page(base_pfn));
   } while (--i);
  
   mutex_init(cma-lock);
   return 0;
 +
 +err:
 + kfree(cma-bitmap);
 + return -EINVAL;
  }
  
  static struct cma cma_areas[MAX_CMA_AREAS];
 -- 
 1.7.9.5
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

-- 
Kind regards,
Minchan Kim
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 05/10] DMA, CMA: support arbitrary bitmap granularity

2014-06-12 Thread Minchan Kim
,
  unsigned int align)
  {
 - unsigned long mask, pfn, pageno, start = 0;
 + unsigned long mask, pfn, start = 0;
 + unsigned long bitmap_maxno, bitmapno, nr_bits;
   struct page *page = NULL;
   int ret;
  
 @@ -358,18 +386,19 @@ static struct page *__dma_alloc_from_contiguous(struct 
 cma *cma, int count,
   if (!count)
   return NULL;
  
 - mask = (1  align) - 1;
 -
 + mask = cma_bitmap_aligned_mask(cma, align);
 + bitmap_maxno = cma_bitmap_maxno(cma);
 + nr_bits = cma_bitmap_pages_to_bits(cma, count);
  
   for (;;) {
   mutex_lock(cma-lock);
 - pageno = bitmap_find_next_zero_area(cma-bitmap, cma-count,
 - start, count, mask);
 - if (pageno = cma-count) {
 + bitmapno = bitmap_find_next_zero_area(cma-bitmap,
 + bitmap_maxno, start, nr_bits, mask);
 + if (bitmapno = bitmap_maxno) {
   mutex_unlock(cma-lock);
   break;
   }
 - bitmap_set(cma-bitmap, pageno, count);
 + bitmap_set(cma-bitmap, bitmapno, nr_bits);
   /*
* It's safe to drop the lock here. We've marked this region for
* our exclusive use. If the migration fails we will take the
 @@ -377,7 +406,7 @@ static struct page *__dma_alloc_from_contiguous(struct 
 cma *cma, int count,
*/
   mutex_unlock(cma-lock);
  
 - pfn = cma-base_pfn + pageno;
 + pfn = cma-base_pfn + (bitmapno  cma-order_per_bit);
   mutex_lock(cma_mutex);
   ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA);
   mutex_unlock(cma_mutex);
 @@ -392,7 +421,7 @@ static struct page *__dma_alloc_from_contiguous(struct 
 cma *cma, int count,
   pr_debug(%s(): memory range at %p is busy, retrying\n,
__func__, pfn_to_page(pfn));
   /* try again with a bit different memory target */
 - start = pageno + mask + 1;
 + start = bitmapno + mask + 1;
   }
  
   pr_debug(%s(): returned %p\n, __func__, page);
 -- 
 1.7.9.5

-- 
Kind regards,
Minchan Kim
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 05/10] DMA, CMA: support arbitrary bitmap granularity

2014-06-12 Thread Minchan Kim
On Thu, Jun 12, 2014 at 03:43:55PM +0900, Joonsoo Kim wrote:
 On Thu, Jun 12, 2014 at 03:06:10PM +0900, Minchan Kim wrote:
  On Thu, Jun 12, 2014 at 12:21:42PM +0900, Joonsoo Kim wrote:
   ppc kvm's cma region management requires arbitrary bitmap granularity,
   since they want to reserve very large memory and manage this region
   with bitmap that one bit for several pages to reduce management overheads.
   So support arbitrary bitmap granularity for following generalization.
   
   Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
   
   diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
   index bc4c171..9bc9340 100644
   --- a/drivers/base/dma-contiguous.c
   +++ b/drivers/base/dma-contiguous.c
   @@ -38,6 +38,7 @@ struct cma {
 unsigned long   base_pfn;
 unsigned long   count;
 unsigned long   *bitmap;
   + int order_per_bit; /* Order of pages represented by one bit */
  
  Hmm, I'm not sure it's good as *general* interface even though it covers
  existing usecases.
  
  It forces a cma area should be handled by same size unit. Right?
  It's really important point for this patchset's motivation so I will stop
  review and wait other opinions.
 
 If you pass 0 to order_per_bit, you can manage cma area in every
 size(page unit) you want. If you pass certain number to order_per_bit,
 you can allocate and release cma area in multiple of such page order.
 
 I think that this is more general implementation than previous versions.

Fair enough.

-- 
Kind regards,
Minchan Kim
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 05/10] DMA, CMA: support arbitrary bitmap granularity

2014-06-12 Thread Minchan Kim
On Thu, Jun 12, 2014 at 12:21:42PM +0900, Joonsoo Kim wrote:
 ppc kvm's cma region management requires arbitrary bitmap granularity,
 since they want to reserve very large memory and manage this region
 with bitmap that one bit for several pages to reduce management overheads.
 So support arbitrary bitmap granularity for following generalization.
 
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
Acked-by: Minchan Kim minc...@kernel.org

Just a nit below.

 
 diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
 index bc4c171..9bc9340 100644
 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -38,6 +38,7 @@ struct cma {
   unsigned long   base_pfn;
   unsigned long   count;
   unsigned long   *bitmap;
 + int order_per_bit; /* Order of pages represented by one bit */
   struct mutexlock;
  };
  
 @@ -157,9 +158,38 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
  
  static DEFINE_MUTEX(cma_mutex);
  
 +static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int 
 align_order)
 +{
 + return (1  (align_order  cma-order_per_bit)) - 1;
 +}
 +
 +static unsigned long cma_bitmap_maxno(struct cma *cma)
 +{
 + return cma-count  cma-order_per_bit;
 +}
 +
 +static unsigned long cma_bitmap_pages_to_bits(struct cma *cma,
 + unsigned long pages)
 +{
 + return ALIGN(pages, 1  cma-order_per_bit)  cma-order_per_bit;
 +}
 +
 +static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
 +{
 + unsigned long bitmapno, nr_bits;
 +
 + bitmapno = (pfn - cma-base_pfn)  cma-order_per_bit;
 + nr_bits = cma_bitmap_pages_to_bits(cma, count);
 +
 + mutex_lock(cma-lock);
 + bitmap_clear(cma-bitmap, bitmapno, nr_bits);
 + mutex_unlock(cma-lock);
 +}
 +
  static int __init cma_activate_area(struct cma *cma)
  {
 - int bitmap_size = BITS_TO_LONGS(cma-count) * sizeof(long);
 + int bitmap_maxno = cma_bitmap_maxno(cma);
 + int bitmap_size = BITS_TO_LONGS(bitmap_maxno) * sizeof(long);
   unsigned long base_pfn = cma-base_pfn, pfn = base_pfn;
   unsigned i = cma-count  pageblock_order;
   struct zone *zone;
 @@ -221,6 +251,7 @@ core_initcall(cma_init_reserved_areas);
   * @base: Base address of the reserved area optional, use 0 for any
   * @limit: End address of the reserved memory (optional, 0 for any).
   * @alignment: Alignment for the contiguous memory area, should be power of 2
 + * @order_per_bit: Order of pages represented by one bit on bitmap.
   * @res_cma: Pointer to store the created cma region.
   * @fixed: hint about where to place the reserved area
   *
 @@ -235,7 +266,7 @@ core_initcall(cma_init_reserved_areas);
   */
  static int __init __dma_contiguous_reserve_area(phys_addr_t size,
   phys_addr_t base, phys_addr_t limit,
 - phys_addr_t alignment,
 + phys_addr_t alignment, int order_per_bit,
   struct cma **res_cma, bool fixed)
  {
   struct cma *cma = cma_areas[cma_area_count];
 @@ -269,6 +300,8 @@ static int __init 
 __dma_contiguous_reserve_area(phys_addr_t size,
   base = ALIGN(base, alignment);
   size = ALIGN(size, alignment);
   limit = ~(alignment - 1);
 + /* size should be aligned with order_per_bit */
 + BUG_ON(!IS_ALIGNED(size  PAGE_SHIFT, 1  order_per_bit));
  
   /* Reserve memory */
   if (base  fixed) {
 @@ -294,6 +327,7 @@ static int __init 
 __dma_contiguous_reserve_area(phys_addr_t size,
*/
   cma-base_pfn = PFN_DOWN(base);
   cma-count = size  PAGE_SHIFT;
 + cma-order_per_bit = order_per_bit;
   *res_cma = cma;
   cma_area_count++;
  
 @@ -313,7 +347,7 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
 phys_addr_t base,
  {
   int ret;
  
 - ret = __dma_contiguous_reserve_area(size, base, limit, 0,
 + ret = __dma_contiguous_reserve_area(size, base, limit, 0, 0,
   res_cma, fixed);
   if (ret)
   return ret;
 @@ -324,13 +358,6 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
 phys_addr_t base,
   return 0;
  }
  
 -static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
 -{
 - mutex_lock(cma-lock);
 - bitmap_clear(cma-bitmap, pfn - cma-base_pfn, count);
 - mutex_unlock(cma-lock);
 -}
 -
  /**
   * dma_alloc_from_contiguous() - allocate pages from contiguous area
   * @dev:   Pointer to device for which the allocation is performed.
 @@ -345,7 +372,8 @@ static void clear_cma_bitmap(struct cma *cma, unsigned 
 long pfn, int count)
  static struct page *__dma_alloc_from_contiguous(struct cma *cma, int count,
  unsigned int align)
  {
 - unsigned long mask, pfn, pageno, start = 0;
 + unsigned long mask, pfn, start = 0;
 + unsigned long bitmap_maxno, bitmapno

Re: [PATCH v2 06/10] CMA: generalize CMA reserved area management functionality

2014-06-12 Thread Minchan Kim
On Thu, Jun 12, 2014 at 12:21:43PM +0900, Joonsoo Kim wrote:
 Currently, there are two users on CMA functionality, one is the DMA
 subsystem and the other is the kvm on powerpc. They have their own code
 to manage CMA reserved area even if they looks really similar.
 From my guess, it is caused by some needs on bitmap management. Kvm side
 wants to maintain bitmap not for 1 page, but for more size. Eventually it
 use bitmap where one bit represents 64 pages.
 
 When I implement CMA related patches, I should change those two places
 to apply my change and it seem to be painful to me. I want to change
 this situation and reduce future code management overhead through
 this patch.
 
 This change could also help developer who want to use CMA in their
 new feature development, since they can use CMA easily without
 copying  pasting this reserved area management code.
 
 In previous patches, we have prepared some features to generalize
 CMA reserved area management and now it's time to do it. This patch
 moves core functions to mm/cma.c and change DMA APIs to use
 these functions.
 
 There is no functional change in DMA APIs.
 
 v2: There is no big change from v1 in mm/cma.c. Mostly renaming.
 
 Acked-by: Michal Nazarewicz min...@mina86.com
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

Acutally, I want to remove bool return of cma_release but it's not
a out of scope in this patchset.

Acked-by: Minchan Kim minc...@kernel.org

 
 diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
 index 00e13ce..4eac559 100644
 --- a/drivers/base/Kconfig
 +++ b/drivers/base/Kconfig
 @@ -283,16 +283,6 @@ config CMA_ALIGNMENT
  
 If unsure, leave the default value 8.
  
 -config CMA_AREAS
 - int Maximum count of the CMA device-private areas
 - default 7
 - help
 -   CMA allows to create CMA areas for particular devices. This parameter
 -   sets the maximum number of such device private CMA areas in the
 -   system.
 -
 -   If unsure, leave the default value 7.
 -
  endif
  
  endmenu
 diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
 index 9bc9340..f177f73 100644
 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -24,25 +24,10 @@
  
  #include linux/memblock.h
  #include linux/err.h
 -#include linux/mm.h
 -#include linux/mutex.h
 -#include linux/page-isolation.h
  #include linux/sizes.h
 -#include linux/slab.h
 -#include linux/swap.h
 -#include linux/mm_types.h
  #include linux/dma-contiguous.h
  #include linux/log2.h

Should we remain log2.h in here?

 -
 -struct cma {
 - unsigned long   base_pfn;
 - unsigned long   count;
 - unsigned long   *bitmap;
 - int order_per_bit; /* Order of pages represented by one bit */
 - struct mutexlock;
 -};
 -
 -struct cma *dma_contiguous_default_area;
 +#include linux/cma.h
  
  #ifdef CONFIG_CMA_SIZE_MBYTES
  #define CMA_SIZE_MBYTES CONFIG_CMA_SIZE_MBYTES
 @@ -50,6 +35,8 @@ struct cma *dma_contiguous_default_area;
  #define CMA_SIZE_MBYTES 0
  #endif
  
 +struct cma *dma_contiguous_default_area;
 +
  /*
   * Default global CMA area size can be defined in kernel's .config.
   * This is useful mainly for distro maintainers to create a kernel
 @@ -156,199 +143,13 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
   }
  }
  
 -static DEFINE_MUTEX(cma_mutex);
 -
 -static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int 
 align_order)
 -{
 - return (1  (align_order  cma-order_per_bit)) - 1;
 -}
 -
 -static unsigned long cma_bitmap_maxno(struct cma *cma)
 -{
 - return cma-count  cma-order_per_bit;
 -}
 -
 -static unsigned long cma_bitmap_pages_to_bits(struct cma *cma,
 - unsigned long pages)
 -{
 - return ALIGN(pages, 1  cma-order_per_bit)  cma-order_per_bit;
 -}
 -
 -static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
 -{
 - unsigned long bitmapno, nr_bits;
 -
 - bitmapno = (pfn - cma-base_pfn)  cma-order_per_bit;
 - nr_bits = cma_bitmap_pages_to_bits(cma, count);
 -
 - mutex_lock(cma-lock);
 - bitmap_clear(cma-bitmap, bitmapno, nr_bits);
 - mutex_unlock(cma-lock);
 -}
 -
 -static int __init cma_activate_area(struct cma *cma)
 -{
 - int bitmap_maxno = cma_bitmap_maxno(cma);
 - int bitmap_size = BITS_TO_LONGS(bitmap_maxno) * sizeof(long);
 - unsigned long base_pfn = cma-base_pfn, pfn = base_pfn;
 - unsigned i = cma-count  pageblock_order;
 - struct zone *zone;
 -
 - pr_debug(%s()\n, __func__);
 -
 - cma-bitmap = kzalloc(bitmap_size, GFP_KERNEL);
 - if (!cma-bitmap)
 - return -ENOMEM;
 -
 - WARN_ON_ONCE(!pfn_valid(pfn));
 - zone = page_zone(pfn_to_page(pfn));
 -
 - do {
 - unsigned j;
 - base_pfn = pfn;
 - for (j = pageblock_nr_pages; j; --j, pfn++) {
 - WARN_ON_ONCE(!pfn_valid(pfn

Re: [PATCH v2 08/10] mm, cma: clean-up cma allocation error path

2014-06-12 Thread Minchan Kim
On Thu, Jun 12, 2014 at 12:21:45PM +0900, Joonsoo Kim wrote:
 We can remove one call sites for clear_cma_bitmap() if we first
 call it before checking error number.
 
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
Acked-by: Minchan Kim minc...@kernel.org

-- 
Kind regards,
Minchan Kim
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 09/10] mm, cma: move output param to the end of param list

2014-06-12 Thread Minchan Kim
On Thu, Jun 12, 2014 at 12:21:46PM +0900, Joonsoo Kim wrote:
 Conventionally, we put output param to the end of param list.
 cma_declare_contiguous() doesn't look like that, so change it.

If you says Conventionally, I'd like to suggest one more thing.
Conventionally, we put 'base' ahead 'size' but dma_contiguous_reserve_area
is opposite.

 
 Additionally, move down cma_areas reference code to the position
 where it is really needed.
 
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
 diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
 b/arch/powerpc/kvm/book3s_hv_builtin.c
 index 28ec226..97613ea 100644
 --- a/arch/powerpc/kvm/book3s_hv_builtin.c
 +++ b/arch/powerpc/kvm/book3s_hv_builtin.c
 @@ -184,7 +184,7 @@ void __init kvm_cma_reserve(void)
  
   align_size = max(kvm_rma_pages  PAGE_SHIFT, align_size);
   cma_declare_contiguous(selected_size, 0, 0, align_size,
 - KVM_CMA_CHUNK_ORDER - PAGE_SHIFT, kvm_cma, false);
 + KVM_CMA_CHUNK_ORDER - PAGE_SHIFT, false, kvm_cma);
   }
  }
  
 diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
 index f177f73..bfd4553 100644
 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -149,7 +149,7 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
 phys_addr_t base,
  {
   int ret;
  
 - ret = cma_declare_contiguous(size, base, limit, 0, 0, res_cma, fixed);
 + ret = cma_declare_contiguous(size, base, limit, 0, 0, fixed, res_cma);
   if (ret)
   return ret;
  
 diff --git a/include/linux/cma.h b/include/linux/cma.h
 index e38efe9..e53eead 100644
 --- a/include/linux/cma.h
 +++ b/include/linux/cma.h
 @@ -6,7 +6,7 @@ struct cma;
  extern int __init cma_declare_contiguous(phys_addr_t size,
   phys_addr_t base, phys_addr_t limit,
   phys_addr_t alignment, int order_per_bit,
 - struct cma **res_cma, bool fixed);
 + bool fixed, struct cma **res_cma);
  extern struct page *cma_alloc(struct cma *cma, int count, unsigned int 
 align);
  extern bool cma_release(struct cma *cma, struct page *pages, int count);
  #endif
 diff --git a/mm/cma.c b/mm/cma.c
 index 01a0713..22a5b23 100644
 --- a/mm/cma.c
 +++ b/mm/cma.c
 @@ -142,8 +142,8 @@ core_initcall(cma_init_reserved_areas);
   * @limit: End address of the reserved memory (optional, 0 for any).
   * @alignment: Alignment for the contiguous memory area, should be power of 2
   * @order_per_bit: Order of pages represented by one bit on bitmap.
 - * @res_cma: Pointer to store the created cma region.
   * @fixed: hint about where to place the reserved area
 + * @res_cma: Pointer to store the created cma region.
   *
   * This function reserves memory from early allocator. It should be
   * called by arch specific code once the early allocator (memblock or 
 bootmem)
 @@ -156,9 +156,9 @@ core_initcall(cma_init_reserved_areas);
  int __init cma_declare_contiguous(phys_addr_t size,
   phys_addr_t base, phys_addr_t limit,
   phys_addr_t alignment, int order_per_bit,
 - struct cma **res_cma, bool fixed)
 + bool fixed, struct cma **res_cma)
  {
 - struct cma *cma = cma_areas[cma_area_count];
 + struct cma *cma;
   int ret = 0;
  
   pr_debug(%s(size %lx, base %08lx, limit %08lx alignment %08lx)\n,
 @@ -214,6 +214,7 @@ int __init cma_declare_contiguous(phys_addr_t size,
* Each reserved area must be initialised later, when more kernel
* subsystems (like slab allocator) are available.
*/
 + cma = cma_areas[cma_area_count];
   cma-base_pfn = PFN_DOWN(base);
   cma-count = size  PAGE_SHIFT;
   cma-order_per_bit = order_per_bit;
 -- 
 1.7.9.5

-- 
Kind regards,
Minchan Kim
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 10/10] mm, cma: use spinlock instead of mutex

2014-06-12 Thread Minchan Kim
On Thu, Jun 12, 2014 at 12:21:47PM +0900, Joonsoo Kim wrote:
 Currently, we should take the mutex for manipulating bitmap.
 This job may be really simple and short so we don't need to sleep
 if contended. So I change it to spinlock.

I'm not sure it would be good always.
Maybe you remember we discussed about similar stuff about bitmap
searching in vmap friend internally, which was really painful
when it was fragmented. So, at least we need number if you really want
and I hope the number from ARM machine most popular platform for CMA
at the moment.

 
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
 diff --git a/mm/cma.c b/mm/cma.c
 index 22a5b23..3085e8c 100644
 --- a/mm/cma.c
 +++ b/mm/cma.c
 @@ -27,6 +27,7 @@
  #include linux/memblock.h
  #include linux/err.h
  #include linux/mm.h
 +#include linux/spinlock.h
  #include linux/mutex.h
  #include linux/sizes.h
  #include linux/slab.h
 @@ -36,7 +37,7 @@ struct cma {
   unsigned long   count;
   unsigned long   *bitmap;
   int order_per_bit; /* Order of pages represented by one bit */
 - struct mutexlock;
 + spinlock_t  lock;
  };
  
  /*
 @@ -72,9 +73,9 @@ static void clear_cma_bitmap(struct cma *cma, unsigned long 
 pfn, int count)
   bitmapno = (pfn - cma-base_pfn)  cma-order_per_bit;
   nr_bits = cma_bitmap_pages_to_bits(cma, count);
  
 - mutex_lock(cma-lock);
 + spin_lock(cma-lock);
   bitmap_clear(cma-bitmap, bitmapno, nr_bits);
 - mutex_unlock(cma-lock);
 + spin_unlock(cma-lock);
  }
  
  static int __init cma_activate_area(struct cma *cma)
 @@ -112,7 +113,7 @@ static int __init cma_activate_area(struct cma *cma)
   init_cma_reserved_pageblock(pfn_to_page(base_pfn));
   } while (--i);
  
 - mutex_init(cma-lock);
 + spin_lock_init(cma-lock);
   return 0;
  
  err:
 @@ -261,11 +262,11 @@ struct page *cma_alloc(struct cma *cma, int count, 
 unsigned int align)
   nr_bits = cma_bitmap_pages_to_bits(cma, count);
  
   for (;;) {
 - mutex_lock(cma-lock);
 + spin_lock(cma-lock);
   bitmapno = bitmap_find_next_zero_area(cma-bitmap,
   bitmap_maxno, start, nr_bits, mask);
   if (bitmapno = bitmap_maxno) {
 - mutex_unlock(cma-lock);
 + spin_unlock(cma-lock);
   break;
   }
   bitmap_set(cma-bitmap, bitmapno, nr_bits);
 @@ -274,7 +275,7 @@ struct page *cma_alloc(struct cma *cma, int count, 
 unsigned int align)
* our exclusive use. If the migration fails we will take the
* lock again and unmark it.
*/
 - mutex_unlock(cma-lock);
 + spin_unlock(cma-lock);
  
   pfn = cma-base_pfn + (bitmapno  cma-order_per_bit);
   mutex_lock(cma_mutex);
 -- 
 1.7.9.5
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

-- 
Kind regards,
Minchan Kim
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 02/10] DMA, CMA: fix possible memory leak

2014-06-11 Thread Minchan Kim
On Thu, Jun 12, 2014 at 12:21:39PM +0900, Joonsoo Kim wrote:
 We should free memory for bitmap when we find zone mis-match,
 otherwise this memory will leak.

Then, -stable stuff?

 
 Additionally, I copy code comment from ppc kvm's cma code to notify
 why we need to check zone mis-match.
 
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
 diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
 index bd0bb81..fb0cdce 100644
 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -177,14 +177,24 @@ static int __init cma_activate_area(struct cma *cma)
   base_pfn = pfn;
   for (j = pageblock_nr_pages; j; --j, pfn++) {
   WARN_ON_ONCE(!pfn_valid(pfn));
 + /*
 +  * alloc_contig_range requires the pfn range
 +  * specified to be in the same zone. Make this
 +  * simple by forcing the entire CMA resv range
 +  * to be in the same zone.
 +  */
   if (page_zone(pfn_to_page(pfn)) != zone)
 - return -EINVAL;
 + goto err;

At a first glance, I thought it would be better to handle such error
before activating.
So when I see the registration code(ie, dma_contiguous_revere_area),
I realized it is impossible because we didn't set up zone yet. :(

If so, when we detect to fail here, it would be better to report more
meaningful error message like what was successful zone and what is
new zone and failed pfn number?

   }
   init_cma_reserved_pageblock(pfn_to_page(base_pfn));
   } while (--i);
  
   mutex_init(cma-lock);
   return 0;
 +
 +err:
 + kfree(cma-bitmap);
 + return -EINVAL;
  }
  
  static struct cma cma_areas[MAX_CMA_AREAS];
 -- 
 1.7.9.5

-- 
Kind regards,
Minchan Kim
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 01/10] DMA, CMA: clean-up log message

2014-06-11 Thread Minchan Kim
Hi Joonsoo,

On Thu, Jun 12, 2014 at 12:21:38PM +0900, Joonsoo Kim wrote:
 We don't need explicit 'CMA:' prefix, since we already define prefix
 'cma:' in pr_fmt. So remove it.
 
 And, some logs print function name and others doesn't. This looks
 bad to me, so I unify log format to print function name consistently.
 
 Lastly, I add one more debug log on cma_activate_area().

When I take a look, it just indicates cma_activate_area was called or not,
without what range for the area was reserved successfully so I couldn't see
the intention for new message. Description should explain it so that everybody
can agree on your claim.

 
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
 diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
 index 83969f8..bd0bb81 100644
 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -144,7 +144,7 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
   }
  
   if (selected_size  !dma_contiguous_default_area) {
 - pr_debug(%s: reserving %ld MiB for global area\n, __func__,
 + pr_debug(%s(): reserving %ld MiB for global area\n, __func__,
(unsigned long)selected_size / SZ_1M);
  
   dma_contiguous_reserve_area(selected_size, selected_base,
 @@ -163,8 +163,9 @@ static int __init cma_activate_area(struct cma *cma)
   unsigned i = cma-count  pageblock_order;
   struct zone *zone;
  
 - cma-bitmap = kzalloc(bitmap_size, GFP_KERNEL);
 + pr_debug(%s()\n, __func__);
  
 + cma-bitmap = kzalloc(bitmap_size, GFP_KERNEL);
   if (!cma-bitmap)
   return -ENOMEM;
  
 @@ -234,7 +235,8 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
 phys_addr_t base,
  
   /* Sanity checks */
   if (cma_area_count == ARRAY_SIZE(cma_areas)) {
 - pr_err(Not enough slots for CMA reserved regions!\n);
 + pr_err(%s(): Not enough slots for CMA reserved regions!\n,
 + __func__);
   return -ENOSPC;
   }
  
 @@ -274,14 +276,15 @@ int __init dma_contiguous_reserve_area(phys_addr_t 
 size, phys_addr_t base,
   *res_cma = cma;
   cma_area_count++;
  
 - pr_info(CMA: reserved %ld MiB at %08lx\n, (unsigned long)size / SZ_1M,
 - (unsigned long)base);
 + pr_info(%s(): reserved %ld MiB at %08lx\n,
 + __func__, (unsigned long)size / SZ_1M, (unsigned long)base);
  
   /* Architecture specific contiguous memory fixup. */
   dma_contiguous_early_fixup(base, size);
   return 0;
  err:
 - pr_err(CMA: failed to reserve %ld MiB\n, (unsigned long)size / SZ_1M);
 + pr_err(%s(): failed to reserve %ld MiB\n,
 + __func__, (unsigned long)size / SZ_1M);
   return ret;
  }
  
 -- 
 1.7.9.5

-- 
Kind regards,
Minchan Kim
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 03/10] DMA, CMA: separate core cma management codes from DMA APIs

2014-06-11 Thread Minchan Kim

-- 
Kind regards,
Minchan Kim
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 04/10] DMA, CMA: support alignment constraint on cma region

2014-06-11 Thread Minchan Kim
On Thu, Jun 12, 2014 at 12:21:41PM +0900, Joonsoo Kim wrote:
 ppc kvm's cma area management needs alignment constraint on
 cma region. So support it to prepare generalization of cma area
 management functionality.
 
 Additionally, add some comments which tell us why alignment
 constraint is needed on cma region.
 
 Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
 diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
 index 8a44c82..bc4c171 100644
 --- a/drivers/base/dma-contiguous.c
 +++ b/drivers/base/dma-contiguous.c
 @@ -32,6 +32,7 @@
  #include linux/swap.h
  #include linux/mm_types.h
  #include linux/dma-contiguous.h
 +#include linux/log2.h
  
  struct cma {
   unsigned long   base_pfn;
 @@ -219,6 +220,7 @@ core_initcall(cma_init_reserved_areas);
   * @size: Size of the reserved area (in bytes),
   * @base: Base address of the reserved area optional, use 0 for any
   * @limit: End address of the reserved memory (optional, 0 for any).
 + * @alignment: Alignment for the contiguous memory area, should be power of 2
   * @res_cma: Pointer to store the created cma region.
   * @fixed: hint about where to place the reserved area
   *

Pz, move the all description to new API function rather than internal one.

 @@ -233,15 +235,15 @@ core_initcall(cma_init_reserved_areas);
   */
  static int __init __dma_contiguous_reserve_area(phys_addr_t size,
   phys_addr_t base, phys_addr_t limit,
 + phys_addr_t alignment,
   struct cma **res_cma, bool fixed)
  {
   struct cma *cma = cma_areas[cma_area_count];
 - phys_addr_t alignment;
   int ret = 0;
  
 - pr_debug(%s(size %lx, base %08lx, limit %08lx)\n, __func__,
 -  (unsigned long)size, (unsigned long)base,
 -  (unsigned long)limit);
 + pr_debug(%s(size %lx, base %08lx, limit %08lx align_order %08lx)\n,

Why is it called by align_order?

 + __func__, (unsigned long)size, (unsigned long)base,
 + (unsigned long)limit, (unsigned long)alignment);
  
   /* Sanity checks */
   if (cma_area_count == ARRAY_SIZE(cma_areas)) {
 @@ -253,8 +255,17 @@ static int __init 
 __dma_contiguous_reserve_area(phys_addr_t size,
   if (!size)
   return -EINVAL;
  
 - /* Sanitise input arguments */
 - alignment = PAGE_SIZE  max(MAX_ORDER - 1, pageblock_order);
 + if (alignment  !is_power_of_2(alignment))
 + return -EINVAL;
 +
 + /*
 +  * Sanitise input arguments.
 +  * CMA area should be at least MAX_ORDER - 1 aligned. Otherwise,
 +  * CMA area could be merged into other MIGRATE_TYPE by buddy mechanism

I'm not a native but try for clear documenation.

 Pages both ends in CMA area could be merged into adjacent unmovable
 migratetype page by page allocator's buddy algorithm. In the case,
 you couldn't get a contiguous memory, which is not what we want.

 +  * and CMA property will be broken.
 +  */
 + alignment = max(alignment,
 + (phys_addr_t)PAGE_SIZE  max(MAX_ORDER - 1, pageblock_order));
   base = ALIGN(base, alignment);
   size = ALIGN(size, alignment);
   limit = ~(alignment - 1);
 @@ -302,7 +313,8 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
 phys_addr_t base,
  {
   int ret;
  
 - ret = __dma_contiguous_reserve_area(size, base, limit, res_cma, fixed);
 + ret = __dma_contiguous_reserve_area(size, base, limit, 0,
 + res_cma, fixed);
   if (ret)
   return ret;
  
 -- 
 1.7.9.5

-- 
Kind regards,
Minchan Kim
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 36/52] zsmalloc: Fix CPU hotplug callback registration

2014-03-14 Thread Minchan Kim
On Tue, Mar 11, 2014 at 02:09:59AM +0530, Srivatsa S. Bhat wrote:
 Subsystems that want to register CPU hotplug callbacks, as well as perform
 initialization for the CPUs that are already online, often do it as shown
 below:
 
   get_online_cpus();
 
   for_each_online_cpu(cpu)
   init_cpu(cpu);
 
   register_cpu_notifier(foobar_cpu_notifier);
 
   put_online_cpus();
 
 This is wrong, since it is prone to ABBA deadlocks involving the
 cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
 with CPU hotplug operations).
 
 Instead, the correct and race-free way of performing the callback
 registration is:
 
   cpu_notifier_register_begin();
 
   for_each_online_cpu(cpu)
   init_cpu(cpu);
 
   /* Note the use of the double underscored version of the API */
   __register_cpu_notifier(foobar_cpu_notifier);
 
   cpu_notifier_register_done();
 
 
 Fix the zsmalloc code by using this latter form of callback registration.
 
 Cc: Minchan Kim minc...@kernel.org
 Cc: Nitin Gupta ngu...@vflare.org
 Cc: Ingo Molnar mi...@kernel.org
 Cc: linux...@kvack.org
 Signed-off-by: Srivatsa S. Bhat srivatsa.b...@linux.vnet.ibm.com

Acked-by: Minchan Kim minc...@kernel.org

Thanks.

-- 
Kind regards,
Minchan Kim
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC v8 PATCH 00/20] memory-hotplug: hot-remove physical memory

2012-09-12 Thread Minchan Kim
On Tue, Sep 11, 2012 at 01:18:24PM +0800, Jerry wrote:
 Hi Kim,
 
 Thank you for your kindness. Let me clarify this:
 
 On ARM architecture, there are 32 bits physical addresses space. However,
 the addresses space is divided into 8 banks normally. Each bank
 disabled/enabled by a chip selector signal. In my platform, bank0 connects
 a DDR chip, and bank1 also connects another DDR chip. And each DDR chip
 whose capability is 512MB is integrated into the main board. So, it could
 not be removed by hand. We can disable/enable each bank by peripheral
 device controller registers.
 
 When system enter suspend state, if all the pages allocated could be
 migrated to one bank, there are no valid data in the another bank. In this
 time, I could disable the free bank. It isn't necessary to provided power
 to this chip in the suspend state. When system resume, I just need to
 enable it again.
 

Yes. I already know it and other trials for that a few years ago[1].
A few years ago, I investigated the benefit between power consumption
benefit during suspend VS start-up latency of resume and
power consumption cost of migration(page migration and IO write for
migration) and concluded normally the gain is not big. :)
The situation could be changed these days as workload are changing
but I'm skeptical about that approach, still.

Anyway, it's my private thought so you don't need to care about that.
If you are ready to submit the patchset, please send out.

1. http://lwn.net/Articles/478049/

Thanks.

- 
Kind regards,
Minchan Kim
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC v8 PATCH 00/20] memory-hotplug: hot-remove physical memory

2012-09-10 Thread Minchan Kim
Hi Jerry,

On Tue, Sep 11, 2012 at 08:27:40AM +0800, Jerry wrote:
 Hi Wen,
 
 I have been arranged a job related memory hotplug on ARM architecture.
 Maybe I know some new issues about memory hotplug on ARM architecture. I
 just enabled it on ARM, and it works well in my Android tablet now.
 However, I have not send out my patches. The real reason is that I don't
 know how to do it. Maybe I need to read Documentation/SubmittingPatches.
 
 Hi Andrew,
 This is my first time to send you a e-mail. I am so nervous about if I have
 some mistakes or not.

Don't be afraid.
If you might make a mistake, it's very natural to newbie.
I am sure anyone doesn't blame you. :)
If you have a good patch, please send out.

 
 Some peoples maybe think memory hotplug need to be supported by special
 hardware. Maybe it means memory physical hotplug. Some times, we just need
 to use memory logical hotplug, doesn't remove the memory in physical. It is
 also usefully for power saving in my platform. Because I doesn't want
 the offline memory is in *self-refresh* state.

Just out of curiosity.
What's the your scenario and gain?
AFAIK, there were some effort about it in embedded side but gain isn't rather 
big
IIRC.

 
 Any comments are appreciated.
 
 Thanks,
 Jerry
 
 2012/9/10 Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
 
  Hi,
 
  On Mon, Sep 10, 2012 at 10:01:44AM +0800, Wen Congyang wrote:
   At 09/10/2012 09:46 AM, Yasuaki Ishimatsu Wrote:
Hi Wen,
   
2012/09/01 5:49, Andrew Morton wrote:
On Tue, 28 Aug 2012 18:00:07 +0800
we...@cn.fujitsu.com wrote:
   
This patch series aims to support physical memory hot-remove.
   
I doubt if many people have hardware which permits physical memory
removal?  How would you suggest that people with regular hardware can
test these chagnes?
   
How do you test the patch? As Andrew says, for hot-removing memory,
we need a particular hardware. I think so too. So many people may want
to know how to test the patch.
If we apply following patch to kvm guest, can we hot-remove memory on
kvm guest?
   
http://lists.gnu.org/archive/html/qemu-devel/2012-07/msg01389.html
  
   Yes, if we apply this patchset, we can test hot-remove memory on kvm
  guest.
   But that patchset doesn't implement _PS3, so there is some restriction.
 
  the following repos contain the patchset above, plus 2 more patches that
  add
  PS3 support to the dimm devices in qemu/seabios:
 
  https://github.com/vliaskov/seabios/commits/memhp-v2
  https://github.com/vliaskov/qemu-kvm/commits/memhp-v2
 
  I have not posted the PS3 patches yet in the qemu list, but will post them
  soon for v3 of the memory hotplug series. If you have issues testing, let
  me
  know.
 
  thanks,
 
  - Vasilis
 
  --
  To unsubscribe, send a message with 'unsubscribe linux-mm' in
  the body to majord...@kvack.org.  For more info on Linux MM,
  see: http://www.linux-mm.org/ .
  Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
 
 
 
 
 -- 
 I love linux!!!

-- 
Kind regards,
Minchan Kim
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev