RE: [PATCH v4 1/2] dma-direct: provide the ability to reserve per-numa CMA

2020-07-28 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Tuesday, July 28, 2020 11:53 PM
> To: Song Bao Hua (Barry Song) 
> Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> w...@kernel.org; ganapatrao.kulka...@cavium.com;
> catalin.mari...@arm.com; io...@lists.linux-foundation.org; Linuxarm
> ; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; Zengtao (B) ;
> huangdaode ; Jonathan Cameron
> ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport 
> Subject: Re: [PATCH v4 1/2] dma-direct: provide the ability to reserve
> per-numa CMA
> 
> On Fri, Jul 24, 2020 at 01:13:43AM +1200, Barry Song wrote:
> > +config CMA_PERNUMA_SIZE_MBYTES
> > +   int "Size in Mega Bytes for per-numa CMA areas"
> > +   depends on NUMA
> > +   default 16 if ARM64
> > +   default 0
> > +   help
> > + Defines the size (in MiB) of the per-numa memory area for Contiguous
> > + Memory Allocator. Every numa node will get a separate CMA with this
> > + size. If the size of 0 is selected, per-numa CMA is disabled.
> 
> I'm still not a fan of the config option.  You can just hardcode the
> value in CONFIG_CMDLINE based on the kernel parameter.  Also I wonder

I am sorry I haven't got your point yet. Do you mean something like the below?

arch/arm64/Kconfig:
config CMDLINE
string "Default kernel command string"
-   default ""
+   default "pernuma_cma=16M"
help
  Provide a set of default command-line options at build time by
  entering them here. As a minimum, you should specify the the
  root device (e.g. root=/dev/nfs).

A background of the current code is that Linux distributions can usually use 
arch/arm64/configs/defconfig
directly to build kernel. cmdline can be easily ignored during the generation 
of Linux distributions.

> if a way to expose this in the device tree might be useful, but people
> more familiar with the device tree and the arm code will have to chime
> in on that.

Not sure if it is an useful user case as we are using ACPI but not device tree 
since it is an ARM64
server with NUMA.

> 
> >  struct cma *dma_contiguous_default_area;
> > +static struct cma *dma_contiguous_pernuma_area[MAX_NUMNODES];
> >
> >  /*
> >   * Default global CMA area size can be defined in kernel's .config.
> > @@ -44,6 +51,8 @@ struct cma *dma_contiguous_default_area;
> >   */
> >  static const phys_addr_t size_bytes __initconst =
> > (phys_addr_t)CMA_SIZE_MBYTES * SZ_1M;
> > +static phys_addr_t pernuma_size_bytes __initdata =
> > +   (phys_addr_t)CMA_SIZE_PERNUMA_MBYTES * SZ_1M;
> >  static phys_addr_t  size_cmdline __initdata = -1;
> >  static phys_addr_t base_cmdline __initdata;
> >  static phys_addr_t limit_cmdline __initdata;
> > @@ -69,6 +78,13 @@ static int __init early_cma(char *p)
> >  }
> >  early_param("cma", early_cma);
> >
> > +static int __init early_pernuma_cma(char *p)
> > +{
> > +   pernuma_size_bytes = memparse(p, );
> > +   return 0;
> > +}
> > +early_param("pernuma_cma", early_pernuma_cma);
> > +
> >  #ifdef CONFIG_CMA_SIZE_PERCENTAGE
> >
> >  static phys_addr_t __init __maybe_unused
> cma_early_percent_memory(void)
> > @@ -96,6 +112,33 @@ static inline __maybe_unused phys_addr_t
> cma_early_percent_memory(void)
> >
> >  #endif
> >
> > +void __init dma_pernuma_cma_reserve(void)
> > +{
> > +   int nid;
> > +
> > +   if (!pernuma_size_bytes)
> > +   return;
> > +
> > +   for_each_node_state(nid, N_MEMORY) {
> > +   int ret;
> > +   char name[20];
> > +
> > +   snprintf(name, sizeof(name), "pernuma%d", nid);
> > +   ret = cma_declare_contiguous_nid(0, pernuma_size_bytes, 0, 0,
> > +0, false, name,
> > +
> > _contiguous_pernuma_area[nid],
> > +nid);
> 
> This adds a > 80 char line.

Will refine.

> 
> >  struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t
> gfp)
> >  {
> > +   int nid = dev_to_node(dev);
> > +
> > /* CMA can be used only in the context which permits sleeping */
> > if (!gfpflags_allow_blocking(gfp))
> > return NULL;
> > if (dev->cma_area)
> > return cma_alloc_aligned(dev->cma_area, size, gfp);
> > -   if (size <= PAGE_SIZE || !dma_contiguous_default_area)
> > +

RE: [PATCH v3 1/2] dma-direct: provide the ability to reserve per-numa CMA

2020-07-23 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Friday, July 24, 2020 12:01 AM
> To: Song Bao Hua (Barry Song) 
> Cc: Christoph Hellwig ; m.szyprow...@samsung.com;
> robin.mur...@arm.com; w...@kernel.org; ganapatrao.kulka...@cavium.com;
> catalin.mari...@arm.com; io...@lists.linux-foundation.org; Linuxarm
> ; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; Jonathan Cameron
> ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport ;
> Zengtao (B) ; huangdaode
> 
> Subject: Re: [PATCH v3 1/2] dma-direct: provide the ability to reserve
> per-numa CMA
> 
> On Wed, Jul 22, 2020 at 09:41:50PM +, Song Bao Hua (Barry Song)
> wrote:
> > I got a kernel robot warning which said dev should be checked before
> > being accessed when I did a similar change in v1. Probably it was an
> > invalid warning if dev should never be null.
> 
> That usually shows up if a function is inconsistent about sometimes checking 
> it
> and sometimes now.
> 
> > Yes, it looks much better.
> 
> Below is a prep patch to rebase on top of:

Thanks for letting me know.

Will rebase on top of your patch.

> 
> ---
> From b81a5e1da65fce9750f0a8b66dbb6f842cbfdd4d Mon Sep 17 00:00:00
> 2001
> From: Christoph Hellwig 
> Date: Wed, 22 Jul 2020 16:33:43 +0200
> Subject: dma-contiguous: cleanup dma_alloc_contiguous
> 
> Split out a cma_alloc_aligned helper to deal with the "interesting"
> calling conventions for cma_alloc, which then allows to the main function to
> be written straight forward.  This also takes advantage of the fact that NULL
> dev arguments have been gone from the DMA API for a while.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  kernel/dma/contiguous.c | 31 ++-
>  1 file changed, 14 insertions(+), 17 deletions(-)
> 
> diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c index
> 15bc5026c485f2..cff7e60968b9e1 100644
> --- a/kernel/dma/contiguous.c
> +++ b/kernel/dma/contiguous.c
> @@ -215,6 +215,13 @@ bool dma_release_from_contiguous(struct device
> *dev, struct page *pages,
>   return cma_release(dev_get_cma_area(dev), pages, count);  }
> 
> +static struct page *cma_alloc_aligned(struct cma *cma, size_t size,
> +gfp_t gfp) {
> + unsigned int align = min(get_order(size), CONFIG_CMA_ALIGNMENT);
> +
> + return cma_alloc(cma, size >> PAGE_SHIFT, align, gfp & __GFP_NOWARN);
> +}
> +
>  /**
>   * dma_alloc_contiguous() - allocate contiguous pages
>   * @dev:   Pointer to device for which the allocation is performed.
> @@ -231,24 +238,14 @@ bool dma_release_from_contiguous(struct device
> *dev, struct page *pages,
>   */
>  struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp)
> {
> - size_t count = size >> PAGE_SHIFT;
> - struct page *page = NULL;
> - struct cma *cma = NULL;
> -
> - if (dev && dev->cma_area)
> - cma = dev->cma_area;
> - else if (count > 1)
> - cma = dma_contiguous_default_area;
> -
>   /* CMA can be used only in the context which permits sleeping */
> - if (cma && gfpflags_allow_blocking(gfp)) {
> - size_t align = get_order(size);
> - size_t cma_align = min_t(size_t, align, CONFIG_CMA_ALIGNMENT);
> -
> - page = cma_alloc(cma, count, cma_align, gfp & __GFP_NOWARN);
> - }
> -
> - return page;
> + if (!gfpflags_allow_blocking(gfp))
> + return NULL;
> + if (dev->cma_area)
> + return cma_alloc_aligned(dev->cma_area, size, gfp);
> + if (size <= PAGE_SIZE || !dma_contiguous_default_area)
> + return NULL;
> + return cma_alloc_aligned(dma_contiguous_default_area, size, gfp);
>  }
> 
>  /**
> --
> 2.27.0

Thanks
Barry



RE: [RESEND PATCH v5] mm/zswap: move to use crypto_acomp API for hardware acceleration

2020-07-22 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Friday, July 17, 2020 1:51 AM
> To: a...@linux-foundation.org; herb...@gondor.apana.org.au;
> da...@davemloft.net
> Cc: linux-cry...@vger.kernel.org; linux...@kvack.org;
> linux-kernel@vger.kernel.org; Linuxarm ; Song Bao
> Hua (Barry Song) ; Luis Claudio R . Goncalves
> ; Sebastian Andrzej Siewior ;
> Mahipal Challa ; Seth Jennings
> ; Dan Streetman ; Vitaly Wool
> ; Wangzhou (B) ;
> fanghao (A) ; Colin Ian King
> 
> Subject: [RESEND PATCH v5] mm/zswap: move to use crypto_acomp API for
> hardware acceleration
> 
> Right now, all new ZIP drivers are adapted to crypto_acomp APIs rather than
> legacy crypto_comp APIs. Tradiontal ZIP drivers like lz4,lzo etc have been 
> also
> wrapped into acomp via scomp backend. But zswap.c is still using the old APIs.
> That means zswap won't be able to work on any new ZIP drivers in kernel.
> 
> This patch moves to use cryto_acomp APIs to fix the disconnected bridge
> between new ZIP drivers and zswap. It is probably the first real user to use
> acomp but perhaps not a good example to demonstrate how multiple acomp
> requests can be executed in parallel in one acomp instance.
> frontswap is doing page load and store page by page synchronously.
> swap_writepage() depends on the completion of frontswap_store() to decide if
> it should call __swap_writepage() to swap to disk.
> 
> However this patch creates multiple acomp instances, so multiple threads
> running on multiple different cpus can actually do (de)compression parallelly,
> leveraging the power of multiple ZIP hardware queues. This is also consistent
> with frontswap's page management model.
> 
> The old zswap code uses atomic context and avoids the race conditions while
> shared resources like zswap_dstmem are accessed. Here since acomp can sleep,
> per-cpu mutex is used to replace preemption-disable.
> 
> While it is possible to make mm/page_io.c and mm/frontswap.c support async
> (de)compression in some way, the entire design requires careful thinking and
> performance evaluation. For the first step, the base with fixed connection
> between ZIP drivers and zswap should be built.
> 
> Cc: Luis Claudio R. Goncalves 
> Cc: Sebastian Andrzej Siewior 
> Cc: Andrew Morton 
> Cc: Herbert Xu 
> Cc: David S. Miller 
> Cc: Mahipal Challa 
> Cc: Seth Jennings 
> Cc: Dan Streetman 
> Cc: Vitaly Wool 
> Cc: Zhou Wang 
> Cc: Hao Fang 
> Cc: Colin Ian King 
> Signed-off-by: Barry Song 
> ---
>  v5: address two comments from Sebastian Andrzej Siewior, thanks!
>1. use pointer rather than pointer's pointer for acomp_ctx;
>2. fix the race while multiple zpool exist while dynamically switching
>   comprossor and zpool type

Hi All,
Any further comments?

> 
>  mm/zswap.c | 183 -
>  1 file changed, 138 insertions(+), 45 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index fbb782924ccc..8e9c18b6fdd9 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -24,8 +24,10 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
> +#include 
> 
>  #include 
>  #include 
> @@ -127,9 +129,17 @@ module_param_named(same_filled_pages_enabled,
> zswap_same_filled_pages_enabled,
>  * data structures
>  **/
> 
> +struct crypto_acomp_ctx {
> + struct crypto_acomp *acomp;
> + struct acomp_req *req;
> + struct crypto_wait wait;
> + u8 *dstmem;
> + struct mutex *mutex;
> +};
> +
>  struct zswap_pool {
>   struct zpool *zpool;
> - struct crypto_comp * __percpu *tfm;
> + struct crypto_acomp_ctx __percpu *acomp_ctx;
>   struct kref kref;
>   struct list_head list;
>   struct work_struct release_work;
> @@ -388,23 +398,43 @@ static struct zswap_entry
> *zswap_entry_find_get(struct rb_root *root,
>  * per-cpu code
>  **/
>  static DEFINE_PER_CPU(u8 *, zswap_dstmem);
> +/*
> + * If users dynamically change the zpool type and compressor at runtime, i.e.
> + * zswap is running, zswap can have more than one zpool on one cpu, but
> +they
> + * are sharing dtsmem. So we need this mutex to be per-cpu.
> + */
> +static DEFINE_PER_CPU(struct mutex *, zswap_mutex);
> 
>  static int zswap_dstmem_prepare(unsigned int cpu)  {
> + struct mutex *mutex;
>   u8 *dst;
> 
>   dst = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
>   if (!dst)
>   return -ENOMEM;
> 
> + mutex = kmalloc_node(sizeof(*mutex), GFP_KERNEL, cpu_to_node(cpu));
> + if (!mutex) {
> + kfree(dst);
> 

RE: [PATCH v3 1/2] dma-direct: provide the ability to reserve per-numa CMA

2020-07-22 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Thursday, July 23, 2020 2:30 AM
> To: Song Bao Hua (Barry Song) 
> Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> w...@kernel.org; ganapatrao.kulka...@cavium.com;
> catalin.mari...@arm.com; io...@lists.linux-foundation.org; Linuxarm
> ; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; Jonathan Cameron
> ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport 
> Subject: Re: [PATCH v3 1/2] dma-direct: provide the ability to reserve
> per-numa CMA
> 
+cc Prime and Daode who are interested in this patchset.

> On Sun, Jun 28, 2020 at 11:12:50PM +1200, Barry Song wrote:
> >  struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t
> gfp)
> >  {
> > size_t count = size >> PAGE_SHIFT;
> > struct page *page = NULL;
> > struct cma *cma = NULL;
> > +   int nid = dev ? dev_to_node(dev) : NUMA_NO_NODE;
> > +   bool alloc_from_pernuma = false;
> > +
> > +   if ((count <= 1) && !(dev && dev->cma_area))
> > +   return NULL;
> >
> > if (dev && dev->cma_area)
> > cma = dev->cma_area;
> > -   else if (count > 1)
> > +   else if ((nid != NUMA_NO_NODE) &&
> dma_contiguous_pernuma_area[nid]
> > +   && !(gfp & (GFP_DMA | GFP_DMA32))) {
> > +   cma = dma_contiguous_pernuma_area[nid];
> > +   alloc_from_pernuma = true;
> > +   } else {
> > cma = dma_contiguous_default_area;
> > +   }
> 
> I find the function rather confusing now.  What about something
> like (this relies on the fact that dev should never be NULL in the
> DMA API)
> 
> struct page *dma_alloc_contiguous(struct device *dev, size_t size, gfp_t gfp)
> {
>   size_t cma_align = min_t(size_t, get_order(size),
> CONFIG_CMA_ALIGNMENT);
>   size_t count = size >> PAGE_SHIFT;
> 
>   if (gfpflags_allow_blocking(gfp))
>   return NULL;
>   gfp &= __GFP_NOWARN;
> 
>   if (dev->cma_area)

I got a kernel robot warning which said dev should be checked before being 
accessed
when I did a similar change in v1. Probably it was an invalid warning if dev 
should
never be null.

>   return cma_alloc(dev->cma_area, count, cma_align, gfp);
>   if (count <= 1)
>   return NULL;
> 
>   if (IS_ENABLED(CONFIG_PERNODE_CMA) && !(gfp & (GFP_DMA |
> GFP_DMA32)) {
>   int nid = dev_to_node(dev);
>   struct cma *cma = dma_contiguous_pernuma_area[nid];
>   struct page *page;
> 
>   if (cma) {
>   page = cma_alloc(cma, count, cma_align, gfp);
>   if (page)
>   return page;
>   }
>   }
> 
>   return cma_alloc(dma_contiguous_default_area, count, cma_align, gfp);
> }

Yes, it looks much better.

> 
> > +   /*
> > +* otherwise, page is from either per-numa cma or default cma
> > +*/
> > +   int nid = page_to_nid(page);
> > +
> > +   if (nid != NUMA_NO_NODE) {
> > +   if (cma_release(dma_contiguous_pernuma_area[nid], page,
> > +   PAGE_ALIGN(size) >> PAGE_SHIFT))
> > +   return;
> > +   }
> > +
> > +   if (cma_release(dma_contiguous_default_area, page,
> 
> How can page_to_nid ever return NUMA_NO_NODE?

I thought page_to_nid would return NUMA_NO_NODE if CONFIG_NUMA is
not enabled. Probably I was wrong. Will get it fixed in v4.

Thanks
Barry



RE: [PATCH v3 1/2] dma-direct: provide the ability to reserve per-numa CMA

2020-07-22 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Thursday, July 23, 2020 2:17 AM
> To: Song Bao Hua (Barry Song) 
> Cc: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> w...@kernel.org; ganapatrao.kulka...@cavium.com;
> catalin.mari...@arm.com; io...@lists.linux-foundation.org; Linuxarm
> ; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; Jonathan Cameron
> ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport 
> Subject: Re: [PATCH v3 1/2] dma-direct: provide the ability to reserve
> per-numa CMA
> 

+cc Prime and Daode who are interested in this patchset.

> On Sun, Jun 28, 2020 at 11:12:50PM +1200, Barry Song wrote:
> > This is useful for at least two scenarios:
> > 1. ARM64 smmu will get memory from local numa node, it can save its
> > command queues and page tables locally. Tests show it can decrease
> > dma_unmap latency at lot. For example, without this patch, smmu on
> > node2 will get memory from node0 by calling dma_alloc_coherent(),
> > typically, it has to wait for more than 560ns for the completion of
> > CMD_SYNC in an empty command queue; with this patch, it needs 240ns
> > only.
> > 2. when we set iommu passthrough, drivers will get memory from CMA,
> > local memory means much less latency.
> 
> I really don't like the config options.  With the boot parameters
> you can always hardcode that in CONFIG_CMDLINE anyway.

I understand your concern. Anyway, The primary purpose of this patchset is 
providing
a general way for users like IOMMU to get local coherent dma buffers to put 
their
command queue and page tables in. The first user case is what really made me
begin to prepare this patchset.

For the second case, it is probably a positive side effect of this patchset for 
those users
who have more concern on performance than dma security, then they maybe skip
IOMMU by
iommu.passthrough=
[ARM64, X86] Configure DMA to bypass the IOMMU by 
default.
Format: { "0" | "1" }
0 - Use IOMMU translation for DMA.
1 - Bypass the IOMMU for DMA.
unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH.
In this case, they can get local memory and get better performance.
However, it is not the primary purpose of this patchset.

Thanks
Barry



RE: [PATCH] iommu/arm-smmu-v3: remove the approach of MSI polling for CMD SYNC

2020-07-20 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Friday, July 17, 2020 9:06 PM
> To: 'Robin Murphy' ; w...@kernel.org;
> j...@8bytes.org
> Cc: linux-kernel@vger.kernel.org; Linuxarm ;
> linux-arm-ker...@lists.infradead.org; io...@lists.linux-foundation.org;
> Zengtao (B) 
> Subject: RE: [PATCH] iommu/arm-smmu-v3: remove the approach of MSI
> polling for CMD SYNC
> 
> 
> 
> > -Original Message-
> > From: Robin Murphy [mailto:robin.mur...@arm.com]
> > Sent: Friday, July 17, 2020 8:55 PM
> > To: Song Bao Hua (Barry Song) ;
> w...@kernel.org;
> > j...@8bytes.org
> > Cc: linux-kernel@vger.kernel.org; Linuxarm ;
> > linux-arm-ker...@lists.infradead.org; io...@lists.linux-foundation.org;
> > Zengtao (B) 
> > Subject: Re: [PATCH] iommu/arm-smmu-v3: remove the approach of MSI
> > polling for CMD SYNC
> >
> > On 2020-07-17 00:07, Barry Song wrote:
> > > Before commit 587e6c10a7ce ("iommu/arm-smmu-v3: Reduce contention
> > during
> > > command-queue insertion"), msi polling perhaps performed better since
> > > it could run outside the spin_lock_irqsave() while the code polling cons
> > > reg was running in the lock.
> > >
> > > But after the great reorganization of smmu queue, neither of these two
> > > polling methods are running in a spinlock. And real tests show polling
> > > cons reg via sev means smaller latency. It is probably because polling
> > > by msi will ask hardware to write memory but sev polling depends on the
> > > update of register only.
> > >
> > > Using 16 threads to run netperf on hns3 100G NIC with UDP packet size
> > > in 32768bytes and set iommu to strict, TX throughput can improve from
> > > 25227.74Mbps to 27145.59Mbps by this patch. In this case, SMMU is
> super
> > > busy as hns3 sends map/unmap requests extremely frequently.
> >
> > How many different systems and SMMU implementations are those numbers
> > representative of? Given that we may have cases where the SMMU can use
> > MSIs but can't use SEV, so would have to fall back to inefficient
> > busy-polling, I'd be wary of removing this entirely. Allowing particular
> > platforms or SMMU implementations to suppress MSI functionality if they
> > know for sure it makes sense seems like a safer bet.
> >
> Hello Robin,
> 
> Thanks for taking a look. Actually I was really struggling with the good way 
> to
> make every platform happy.
> And I don't have other platforms to test and check if those platforms run
> better by sev polling. Even two
> platforms have completely same SMMU features, it is still possible they
> behave differently.
> So I simply sent this patch to get the discussion started to get opinions.
> 
> At the first beginning, I wanted to have a module parameter for users to 
> decide
> if msi polling should be disabled.
> But the module parameter might be totally ignored by linux distro.
> 
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -418,6 +418,11 @@ module_param_named(disable_bypass,
> disable_bypass, bool, S_IRUGO);  MODULE_PARM_DESC(disable_bypass,
>   "Disable bypass streams such that incoming transactions from devices
> that are not attached to an iommu domain will report an abort back to the
> device and will not be allowed to pass through the SMMU.");
> 
> +static bool disable_msipolling = 1;
> +module_param_named(disable_msipolling, disable_msipolling, bool,
> +S_IRUGO); MODULE_PARM_DESC(disable_msipolling,
> + "Don't use MSI to poll the completion of CMD_SYNC if it is slower than
> +SEV");
> +
>  enum pri_resp {
>   PRI_RESP_DENY = 0,
>   PRI_RESP_FAIL = 1,
> @@ -992,7 +997,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64
> *cmd, struct arm_smmu_device *smmu,
>* Beware that Hi16xx adds an extra 32 bits of goodness to its MSI
>* payload, so the write will zero the entire command on that platform.
>*/
> - if (smmu->features & ARM_SMMU_FEAT_MSI &&
> + if (!disable_msipolling && smmu->features & ARM_SMMU_FEAT_MSI &&
>   smmu->features & ARM_SMMU_FEAT_COHERENCY) {
>   ent.sync.msiaddr = q->base_dma + Q_IDX(>llq, prod) *
>  q->ent_dwords * 8;
> @@ -1332,7 +1337,7 @@ static int
> __arm_smmu_cmdq_poll_until_consumed(struct arm_smmu_device *smmu,
> static int arm_smmu_cmdq_poll_until_sync(struct arm_smmu_device *smmu,
>struct arm_smmu_ll_queue *llq)
>  {
> - if (smmu->features & ARM_

RE: [PATCH] iommu/arm-smmu-v3: remove the approach of MSI polling for CMD SYNC

2020-07-17 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Friday, July 17, 2020 8:55 PM
> To: Song Bao Hua (Barry Song) ; w...@kernel.org;
> j...@8bytes.org
> Cc: linux-kernel@vger.kernel.org; Linuxarm ;
> linux-arm-ker...@lists.infradead.org; io...@lists.linux-foundation.org;
> Zengtao (B) 
> Subject: Re: [PATCH] iommu/arm-smmu-v3: remove the approach of MSI
> polling for CMD SYNC
> 
> On 2020-07-17 00:07, Barry Song wrote:
> > Before commit 587e6c10a7ce ("iommu/arm-smmu-v3: Reduce contention
> during
> > command-queue insertion"), msi polling perhaps performed better since
> > it could run outside the spin_lock_irqsave() while the code polling cons
> > reg was running in the lock.
> >
> > But after the great reorganization of smmu queue, neither of these two
> > polling methods are running in a spinlock. And real tests show polling
> > cons reg via sev means smaller latency. It is probably because polling
> > by msi will ask hardware to write memory but sev polling depends on the
> > update of register only.
> >
> > Using 16 threads to run netperf on hns3 100G NIC with UDP packet size
> > in 32768bytes and set iommu to strict, TX throughput can improve from
> > 25227.74Mbps to 27145.59Mbps by this patch. In this case, SMMU is super
> > busy as hns3 sends map/unmap requests extremely frequently.
> 
> How many different systems and SMMU implementations are those numbers
> representative of? Given that we may have cases where the SMMU can use
> MSIs but can't use SEV, so would have to fall back to inefficient
> busy-polling, I'd be wary of removing this entirely. Allowing particular
> platforms or SMMU implementations to suppress MSI functionality if they
> know for sure it makes sense seems like a safer bet.
> 
Hello Robin,

Thanks for taking a look. Actually I was really struggling with the good way to 
make every platform happy.
And I don't have other platforms to test and check if those platforms run 
better by sev polling. Even two
platforms have completely same SMMU features, it is still possible they behave 
differently.
So I simply sent this patch to get the discussion started to get opinions.

At the first beginning, I wanted to have a module parameter for users to decide 
if msi polling should be disabled.
But the module parameter might be totally ignored by linux distro.

--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -418,6 +418,11 @@ module_param_named(disable_bypass, disable_bypass, bool, 
S_IRUGO);  MODULE_PARM_DESC(disable_bypass,
"Disable bypass streams such that incoming transactions from devices 
that are not attached to an iommu domain will report an abort back to the 
device and will not be allowed to pass through the SMMU.");
 
+static bool disable_msipolling = 1;
+module_param_named(disable_msipolling, disable_msipolling, bool, 
+S_IRUGO); MODULE_PARM_DESC(disable_msipolling,
+   "Don't use MSI to poll the completion of CMD_SYNC if it is slower than 
+SEV");
+
 enum pri_resp {
PRI_RESP_DENY = 0,
PRI_RESP_FAIL = 1,
@@ -992,7 +997,7 @@ static void arm_smmu_cmdq_build_sync_cmd(u64 *cmd, struct 
arm_smmu_device *smmu,
 * Beware that Hi16xx adds an extra 32 bits of goodness to its MSI
 * payload, so the write will zero the entire command on that platform.
 */
-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
+   if (!disable_msipolling && smmu->features & ARM_SMMU_FEAT_MSI &&
smmu->features & ARM_SMMU_FEAT_COHERENCY) {
ent.sync.msiaddr = q->base_dma + Q_IDX(>llq, prod) *
   q->ent_dwords * 8;
@@ -1332,7 +1337,7 @@ static int __arm_smmu_cmdq_poll_until_consumed(struct 
arm_smmu_device *smmu,  static int arm_smmu_cmdq_poll_until_sync(struct 
arm_smmu_device *smmu,
 struct arm_smmu_ll_queue *llq)
 {
-   if (smmu->features & ARM_SMMU_FEAT_MSI &&
+   if (!disable_msipolling && smmu->features & ARM_SMMU_FEAT_MSI &&
smmu->features & ARM_SMMU_FEAT_COHERENCY)
return __arm_smmu_cmdq_poll_until_msi(smmu, llq);


Another option is that we don't use module parameter, alternatively, we check 
the vendor/chip ID,
if the chip has better performance on sev polling, it may set 
disable_msipolling to true.

You are very welcome to give your suggestions.

Thanks
Barry


RE: [PATCH v3 0/2] make dma_alloc_coherent NUMA-aware by per-NUMA CMA

2020-07-12 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Sunday, June 28, 2020 11:13 PM
> To: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com;
> w...@kernel.org; ganapatrao.kulka...@cavium.com;
> catalin.mari...@arm.com
> Cc: io...@lists.linux-foundation.org; Linuxarm ;
> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org; Song Bao
> Hua (Barry Song) 
> Subject: [PATCH v3 0/2] make dma_alloc_coherent NUMA-aware by
> per-NUMA CMA
> 
> Ganapatrao Kulkarni has put some effort on making arm-smmu-v3 use local
> memory to save command queues[1]. I also did similar job in patch
> "iommu/arm-smmu-v3: allocate the memory of queues in local numa node"
> [2] while not realizing Ganapatrao has done that before.
> 
> But it seems it is much better to make dma_alloc_coherent() to be inherently
> NUMA-aware on NUMA-capable systems.
> 
> Right now, smmu is using dma_alloc_coherent() to get memory to save queues
> and tables. Typically, on ARM64 server, there is a default CMA located at
> node0, which could be far away from node2, node3 etc.
> Saving queues and tables remotely will increase the latency of ARM SMMU
> significantly. For example, when SMMU is at node2 and the default global
> CMA is at node0, after sending a CMD_SYNC in an empty command queue, we
> have to wait more than 550ns for the completion of the command
> CMD_SYNC.
> However, if we save them locally, we only need to wait for 240ns.
> 
> with per-numa CMA, smmu will get memory from local numa node to save
> command queues and page tables. that means dma_unmap latency will be
> shrunk much.
> 
> Meanwhile, when iommu.passthrough is on, device drivers which call dma_
> alloc_coherent() will also get local memory and avoid the travel between
> numa nodes.
> 
> [1]
> https://lists.linuxfoundation.org/pipermail/iommu/2017-October/024455.htm
> l
> [2] https://www.spinics.net/lists/iommu/msg44767.html
> 
> -v3:
>   * move to use page_to_nid() while freeing cma with respect to Robin's
>   comment, but this will only work after applying my below patch:
>   "mm/cma.c: use exact_nid true to fix possible per-numa cma leak"
>   https://marc.info/?l=linux-mm=159333034726647=2
> 
>   * handle the case count <= 1 more properly according to Robin's
>   comment;
> 
>   * add pernuma_cma parameter to support dynamic setting of per-numa
>   cma size;
>   ideally we can leverage the CMA_SIZE_MBYTES, CMA_SIZE_PERCENTAGE and
>   "cma=" kernel parameter and avoid a new paramter separately for per-
>   numa cma. Practically, it is really too complicated considering the
>   below problems:
>   (1) if we leverage the size of default numa for per-numa, we have to
>   avoid creating two cma with same size in node0 since default cma is
>   probably on node0.
>   (2) default cma can consider the address limitation for old devices
>   while per-numa cma doesn't support GFP_DMA and GFP_DMA32. all
>   allocations with limitation flags will fallback to default one.
>   (3) hard to apply CMA_SIZE_PERCENTAGE to per-numa. it is hard to
>   decide if the percentage should apply to the whole memory size
>   or only apply to the memory size of a specific numa node.
>   (4) default cma size has CMA_SIZE_SEL_MIN and CMA_SIZE_SEL_MAX, it
>   makes things even more complicated to per-numa cma.
> 
>   I haven't figured out a good way to leverage the size of default cma
>   for per-numa cma. it seems a separate parameter for per-numa could
>   make life easier.
> 
>   * move dma_pernuma_cma_reserve() after hugetlb_cma_reserve() to
>   reuse the comment before hugetlb_cma_reserve() with respect to
>   Robin's comment
> 
> -v2:
>   * fix some issues reported by kernel test robot
>   * fallback to default cma while allocation fails in per-numa cma
>  free memory properly
> 
> Barry Song (2):
>   dma-direct: provide the ability to reserve per-numa CMA
>   arm64: mm: reserve per-numa CMA to localize coherent dma buffers
> 
>  .../admin-guide/kernel-parameters.txt |  9 ++
>  arch/arm64/mm/init.c  |  2 +
>  include/linux/dma-contiguous.h|  4 +
>  kernel/dma/Kconfig| 10 ++
>  kernel/dma/contiguous.c   | 98
> +--
>  5 files changed, 114 insertions(+), 9 deletions(-)

Gentle ping :-)

Thanks
Barry



RE: [PATCH v2] mm/hugetlb: split hugetlb_cma in nodes with memory

2020-07-10 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Anshuman Khandual [mailto:anshuman.khand...@arm.com]
> Sent: Friday, July 10, 2020 8:59 PM
> To: Song Bao Hua (Barry Song) ;
> a...@linux-foundation.org
> Cc: x...@kernel.org; linux...@kvack.org; linux-kernel@vger.kernel.org;
> Linuxarm ; linux-arm-ker...@lists.infradead.org;
> Roman Gushchin ; Catalin Marinas
> ; Will Deacon ; Thomas Gleixner
> ; Ingo Molnar ; Borislav Petkov
> ; H . Peter Anvin ; Mike Kravetz
> ; Mike Rapoport ; Jonathan
> Cameron 
> Subject: Re: [PATCH v2] mm/hugetlb: split hugetlb_cma in nodes with memory
> 
> 
> 
> On 07/10/2020 09:20 AM, Barry Song wrote:
> > Rather than splitting huge_cma in online nodes, it is better to do it in
> > nodes with memory.
> > Without this patch, for an ARM64 server with four numa nodes and only
> > node0 has memory. If I set hugetlb_cma=4G in bootargs,
> >
> > without this patch, I got the below printk:
> > hugetlb_cma: reserve 4096 MiB, up to 1024 MiB per node
> > hugetlb_cma: reserved 1024 MiB on node 0
> > hugetlb_cma: reservation failed: err -12, node 1
> > hugetlb_cma: reservation failed: err -12, node 2
> > hugetlb_cma: reservation failed: err -12, node 3
> 
> Seems like you have not changed the commit message from the last time.
> This does not explain the problem as we had discussed before. This just
> states the effects of the problem not the cause which is really required
> when there is a proposal for change in behavior.

OK, will think about some better words. 
But, it seems the changelog is clearly showing we get only 1GB CMA even though
we set 4GB size. And you have understood that from the 1st beginning as you
commented, which probably proves it is valid enough to let people understand
what's going on :-)

> 
> >
> > With this patch, I got the below printk:
> >
> > hugetlb_cma: reserve 4096 MiB, up to 4096 MiB per node
> > hugetlb_cma: reserved 4096 MiB on node 0
> 
> Before and after the patch behavior should be described after the problem
> and it's solution have been articulated.

Ok, can add some summary.

> 
> >
> > So this patch makes the hugetlb_cma size consistent with users' setting
> > on ARM64 platforms.
> 
> As mentioned before, there is nothing arm64 specific here. The problem just
> manifests on arm64 platform because of it's memory init process.
> 

Apology if the words make you misunderstand.

Here it is just using ARM64 as an example that problem gets fixed. It has never 
said this
problem is specific to ARM64. Before the change, we get 1GB CMA; after the 
patch, we
get 4GB CMA which is same with users' setting. The size issue is not specific 
to ARM64.

And the changelog also said what you wanted to say:
 The problem is always there if N_ONLINE != N_MEMORY. In x86 case, it
 is just hidden because N_ONLINE happen to match N_MEMORY during the
 boot process when hugetlb_cma_reserve() gets called.

Will think about some better words.

> >
> > Jonathan Cameron tested this patch on x86 platform. Jonathan figured out
> > the boot code of x86 is much different with arm64. On arm64 all nodes are
> > marked online at the same time. On x86, only nodes with memory are
> > initially marked as online:
> > initmem_init()->x86_numa_init()->numa_init()->
> > numa_register_memblks()->alloc_node_data()->node_set_online()
> 
> Function call sequence should be described (if at all) cleanly with proper
> indentation.

Agreed

> 
> > So at time of the existing cma setup call only the memory containing nodes
> > are online. The other nodes are brought up much later.
> > Therefore, on x86 platform, hugetlb_cma size is actually consistent with
> > users' setting even though system has nodes without memory.
> >
> > The problem is always there if N_ONLINE != N_MEMORY. In x86 case, it
> > is just hidden because N_ONLINE happen to match N_MEMORY during the
> boot
> > process when hugetlb_cma_reserve() gets called.
> >
> > This patch documents this problem in the comment of
> hugetlb_cma_reserve()
> > and makes hugetlb_cma size optimal.
> 
> The commit message here is not proper for the change it's proposing. It needs
> to describe the problem, how it manifests on arm64 but remains hidden on
> x86,
> proposed solution i.e changes in generic and platform code, also if required
> before and after patch behavior on arm64 which had triggered this patch. Also
> please take care of the indentation, paragraph formatting and avoid writing
> in first person singular format.

Ok. will try one more time.

> 
> >
> > Cc: Roman Gushchin 
> > Cc: Catalin Marinas 
> > Cc: Will Deacon 
> > Cc: Thomas Gleixner 
> > Cc

RE: [PATCH v3] mm/hugetlb: avoid hardcoding while checking if cma is enable

2020-07-09 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Mike Kravetz [mailto:mike.krav...@oracle.com]
> Sent: Friday, July 10, 2020 6:58 AM
> To: Song Bao Hua (Barry Song) ; Roman
> Gushchin 
> Cc: Andrew Morton ; linux...@kvack.org;
> linux-kernel@vger.kernel.org; Linuxarm ; Jonathan
> Cameron 
> Subject: Re: [PATCH v3] mm/hugetlb: avoid hardcoding while checking if cma
> is enable
> 
> Looks like this produced a warning in linux-next.  I suspect it is due to the
> combination CONFIG_HUGETLB_PAGE && !CONFIG_CMA.
> 
> Instead of adding the routine hugetlb_cma_enabled() to scan the hugetlb_cma
> array, could we just use a boolean as follows?  It can simply be set in
> hugetlb_cma_reserve when we reserve CMA.

Maybe just use hugetlb_cma_size? If hugetlb_cma_size is not 0, someone is 
trying to use
cma, then bootmem for gigantic pages will be totally ignored according to 
discussion here:
https://lkml.org/lkml/2020/7/8/1288

if somebody sets a wrong hugetlb_cma_size which causes that cma is not 
reserved. 
It is the fault of users? We just need to document hugetlb_cma will overwrite 
bootmem
reservations?

> --
> Mike Kravetz
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index fab4485b9e52..92cb882cf287 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -46,6 +46,7 @@ unsigned int default_hstate_idx;
>  struct hstate hstates[HUGE_MAX_HSTATE];
> 
>  static struct cma *hugetlb_cma[MAX_NUMNODES];
> +static bool hugetlb_cma_enabled = false;
> 
>  /*
>   * Minimum page order among possible hugepage sizes, set to a proper value
> @@ -2571,7 +2572,7 @@ static void __init hugetlb_hstate_alloc_pages(struct
> hstate *h)
> 
>   for (i = 0; i < h->max_huge_pages; ++i) {
>   if (hstate_is_gigantic(h)) {
> - if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
> + if (hugetlb_cma_enabled) {
>   pr_warn_once("HugeTLB: hugetlb_cma is enabled, 
> skip
> boot time allocation\n");
>   break;
>   }
> @@ -5708,6 +5709,7 @@ void __init hugetlb_cma_reserve(int order)
>   reserved += size;
>   pr_info("hugetlb_cma: reserved %lu MiB on node %d\n",
>   size / SZ_1M, nid);
> + hugetlb_cma_enabled = true;
> 
>   if (reserved >= hugetlb_cma_size)
>   break;

Thanks
Barry



RE: [PATCH] arm64: mm: free unused memmap for sparse memory model that define VMEMMAP

2020-07-09 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: liwei (CM)
> Sent: Wednesday, July 8, 2020 7:52 PM
> To: Song Bao Hua (Barry Song) ;
> catalin.mari...@arm.com; w...@kernel.org
> Cc: fengbaopeng ; nsaenzjulie...@suse.de;
> steve.cap...@arm.com; r...@linux.ibm.com;
> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org; sujunfei
> ; Xiaqing (A) ;
> Yaobaofeng (Yaobaofeng) 
> Subject: 答复: [PATCH] arm64: mm: free unused memmap for sparse memory
> model that define VMEMMAP
> 
> Hi, baohua
> 
> Thank you for your attention.
> 
> In my understanding of the MEMORY_HOTPLUG this patch has no effect on it.
> The reason is that in sparse_add_one_section() the memory that memmap
> needs from Slab if kernel start completed,this memory has nothing to do with
> memblock alloc/ free memory in the process of kernel start.
> 
> You may have a look vmemmap_alloc_block () this function.
> 
> If I don't understand right welcome pointed out in a timely manner.

At the first glance of this patch, I suspect that this bootmem may be used by 
hot-added memory.
If you confirm this won't happen, please ignore my noise.

BTW, next time, bear in mind that top-post is not a good way to reply mail :-)

> 
> Thanks!
> 
> 
> -邮件原件-
> 发件人: Song Bao Hua (Barry Song)
> 发送时间: 2020年7月8日 15:19
> 收件人: liwei (CM) ; catalin.mari...@arm.com;
> w...@kernel.org
> 抄送: fengbaopeng ; nsaenzjulie...@suse.de;
> steve.cap...@arm.com; r...@linux.ibm.com;
> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org; sujunfei
> 
> 主题: RE: [PATCH] arm64: mm: free unused memmap for sparse memory
> model that define VMEMMAP
> 
> 
> 
> > -Original Message-
> > From: liwei (CM)
> > Sent: Wednesday, July 8, 2020 1:56 PM
> > To: catalin.mari...@arm.com; w...@kernel.org
> > Cc: liwei (CM) ; fengbaopeng
> > ; nsaenzjulie...@suse.de;
> > steve.cap...@arm.com; r...@linux.ibm.com; Song Bao Hua (Barry Song)
> > ; linux-arm-ker...@lists.infradead.org;
> > linux-kernel@vger.kernel.org; sujunfei 
> > Subject: [PATCH] arm64: mm: free unused memmap for sparse memory
> model
> > that define VMEMMAP
> >
> > For the memory hole, sparse memory model that define
> SPARSEMEM_VMEMMAP
> > do not free the reserved memory for the page map, this patch do it.
> 
> Hello Wei,
> Just curious if this patch breaks MEMORY_HOTPLUG?
> 
> >
> > Signed-off-by: Wei Li 
> > Signed-off-by: Chen Feng 
> > Signed-off-by: Xia Qing 
> > ---
> >  arch/arm64/mm/init.c | 81
> > +---
> >  1 file changed, 71 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index
> > 1e93cfc7c47a..d1b56b47d5ba 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -441,7 +441,48 @@ void __init bootmem_init(void)
> > memblock_dump_all();
> >  }
> >
> 
> Thanks
> Barry



RE: [PATCH v4] mm/zswap: move to use crypto_acomp API for hardware acceleration

2020-07-09 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: linux-crypto-ow...@vger.kernel.org
> [mailto:linux-crypto-ow...@vger.kernel.org] On Behalf Of Sebastian Andrzej
> Siewior
> Sent: Thursday, July 9, 2020 7:17 PM
> To: Song Bao Hua (Barry Song) 
> Cc: a...@linux-foundation.org; herb...@gondor.apana.org.au;
> da...@davemloft.net; linux-cry...@vger.kernel.org; linux...@kvack.org;
> linux-kernel@vger.kernel.org; Linuxarm ; Luis Claudio
> R . Goncalves ; Mahipal Challa
> ; Seth Jennings ;
> Dan Streetman ; Vitaly Wool
> ; Wangzhou (B) ;
> Colin Ian King 
> Subject: Re: [PATCH v4] mm/zswap: move to use crypto_acomp API for
> hardware acceleration
> 
> On 2020-07-08 21:45:47 [+], Song Bao Hua (Barry Song) wrote:
> > > On 2020-07-08 00:52:10 [+1200], Barry Song wrote:
> > > > @@ -127,9 +129,17 @@
> > > > +struct crypto_acomp_ctx {
> > > > +   struct crypto_acomp *acomp;
> > > > +   struct acomp_req *req;
> > > > +   struct crypto_wait wait;
> > > > +   u8 *dstmem;
> > > > +   struct mutex mutex;
> > > > +};
> > > …
> > > > @@ -1074,12 +1138,32 @@ static int zswap_frontswap_store(unsigned
> > > type, pgoff_t offset,
> > > > }
> > > >
> > > > /* compress */
> > > > -   dst = get_cpu_var(zswap_dstmem);
> > > > -   tfm = *get_cpu_ptr(entry->pool->tfm);
> > > > -   src = kmap_atomic(page);
> > > > -   ret = crypto_comp_compress(tfm, src, PAGE_SIZE, dst, );
> > > > -   kunmap_atomic(src);
> > > > -   put_cpu_ptr(entry->pool->tfm);
> > > > +   acomp_ctx = *this_cpu_ptr(entry->pool->acomp_ctx);
> > > > +
> > > > +   mutex_lock(_ctx->mutex);
> > > > +
> > > > +   src = kmap(page);
> > > > +   dst = acomp_ctx->dstmem;
> > >
> > > that mutex is per-CPU, per-context. The dstmem pointer is per-CPU.
> > > So if I read this right, you can get preempted after
> > > crypto_wait_req() and another context in this CPU writes its data to
> > > the same dstmem and then…
> > >
> >
> > This isn't true. Another thread in this cpu will be blocked by the mutex.
> > It is impossible for two threads to write the same dstmem.
> > If thread1 ran on cpu1, it held cpu1's mutex; if another thread wants to run
> on cpu1, it is blocked.
> > If thread1 ran on cpu1 first, it held cpu1's mutex, then it migrated to cpu2
> (with very rare chance)
> > a. if another thread wants to run on cpu1, it is blocked;
> 
> How it is blocked? That "struct crypto_acomp_ctx" is
> "this_cpu_ptr(entry->pool->acomp_ctx)" - which is per-CPU of a pool which
> you can have multiple of. But `dstmem' you have only one per-CPU no matter
> have many pools you have.
> So pool1 on CPU1 uses the same `dstmem' as pool2 on CPU1. But pool1 and
> pool2 on CPU1 use a different mutex for protection of this `dstmem'.

Good catch, Sebastian, thanks!
this is a corner case testing has not encountered yet. There is a race if we 
change the pool type at runtime.
Typically, a group of initial parameters were set, then software wrote/read 
lots of anon pages to generate
swapping as busy as possible. But never tried to change the compressor/pool 
type at runtime.

will address this problem in v5 with the cleanup of acomp_ctx pointer in 
zswap_pool. I mean to
create acomp instants for per-cpu, not for (pools * per-cpu).

Thanks
Barry


RE: [PATCH v4] mm/zswap: move to use crypto_acomp API for hardware acceleration

2020-07-09 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On
> Behalf Of Sebastian Andrzej Siewior
> Sent: Thursday, July 9, 2020 8:41 PM
> To: Song Bao Hua (Barry Song) 
> Cc: a...@linux-foundation.org; herb...@gondor.apana.org.au;
> da...@davemloft.net; linux-cry...@vger.kernel.org; linux...@kvack.org;
> linux-kernel@vger.kernel.org; Linuxarm ; Luis Claudio
> R . Goncalves ; Mahipal Challa
> ; Seth Jennings ;
> Dan Streetman ; Vitaly Wool
> ; Wangzhou (B) ;
> Colin Ian King 
> Subject: Re: [PATCH v4] mm/zswap: move to use crypto_acomp API for
> hardware acceleration
> 
> On 2020-07-09 07:55:22 [+], Song Bao Hua (Barry Song) wrote:
> > Hello Sebastian, thanks for your reply and careful review.
> Hi,
> 
> > I don't think we can simply "forward the result to the caller and let him
> decide".
> > Would you like to present some pseudo code?
> 
> I provided just some pseudo code to illustrate an example how the async
> interface should look like (more or less). The essential part is where
> you allow to feed multiple requests without blocking.

Sebastian, Do you mean the below code?

@@ -252,12 +252,15 @@ int swap_writepage(struct page *page, struct 
writeback_control *wbc)
unlock_page(page);
goto out;
}
-   if (frontswap_store(page) == 0) {
+   ret = frontswap_store(page);
+   if (ret == 0) {
set_page_writeback(page);
unlock_page(page);
end_page_writeback(page);
goto out;
}
+   if (ret = -EINPROGRESS)
+   goto out;
ret = __swap_writepage(page, wbc, end_swap_bio_write);
 out:
return ret;

I think this won' work. -EINPROGRESS won't be able to decide if we should goto 
out. We can only goto out if the compression
has done without any error. The error might be because of HW like EIO or 
because the data is not suitable to compress. We can
only know the result after the compression is really done and the completion 
callback is called by ZIP drivers.

If the compression is still INPROGRESS, we don't know what will happen.

> I went up the call-chain and found one potential user which seem to have
> a list of pages which are processed. This looked like a nice example. I
> haven't looked at the details.
> 
> I have no opinion whether or not it makes sense to switch to the async
> interface in a sync way.

I always appreciate your comment and your opinion.

The real problem here is that all of those new zip drivers are adapted to async 
interface. There is no old interface support
for those new drivers mainlined these years. zswap doesn’t work on those new 
drivers as they totally don't support
crypto_comp_compress()
crypto_comp_decompress()
...

So the initial goal of this patch is fixing the disconnected bridge between new 
zip drivers and zswap.

Making frontswap async can probably happen if we see performance improvement. 
But it seems it is a big project, not
that simple. On the other hand, it seems hisi_zip in drivers/crypto is the only 
async driver till now. Sorry if I am missing
any one. other drivers are adapted to acomp APIs by scomp APIs. For example:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/crypto/lz4.c?id=8cd9330e0a
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/crypto/lzo.c?id=ac9d2c4b3

So even we make frontswap totally async, most zip drivers are still sync and we 
don't get the benefit. From my prospective,
I am glad to try the possibility of making frontswap async to leverage the 
power of ZIP hardware. This would probably and
only happen after we have a base to support acomp APIs.

Thanks
Barry


RE: [PATCH v4] mm/zswap: move to use crypto_acomp API for hardware acceleration

2020-07-09 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: linux-crypto-ow...@vger.kernel.org
> [mailto:linux-crypto-ow...@vger.kernel.org] On Behalf Of Sebastian Andrzej
> Siewior
> Sent: Thursday, July 9, 2020 7:39 PM
> To: Song Bao Hua (Barry Song) 
> Cc: a...@linux-foundation.org; herb...@gondor.apana.org.au;
> da...@davemloft.net; linux-cry...@vger.kernel.org; linux...@kvack.org;
> linux-kernel@vger.kernel.org; Linuxarm ; Luis Claudio
> R . Goncalves ; Mahipal Challa
> ; Seth Jennings ;
> Dan Streetman ; Vitaly Wool
> ; Wangzhou (B) ;
> Colin Ian King 
> Subject: Re: [PATCH v4] mm/zswap: move to use crypto_acomp API for
> hardware acceleration
> 
> On 2020-07-09 01:32:38 [+], Song Bao Hua (Barry Song) wrote:
> > > This looks using the same synchronous mechanism around an
> asynchronous
> > > interface. It works as a PoC.
> > >
> > > As far as I remember the crypto async interface, the incoming skbs were 
> > > fed
> to
> > > the async interface and returned to the caller so the NIC could continue
> > > allocate new RX skbs and move on. Only if the queue of requests was
> getting
> > > to long the code started to throttle. Eventually the async crypto code
> > > completed the decryption operation in a different context and fed the
> > > decrypted packet(s) into the stack.
> > >
> > > From a quick view, you would have to return -EINPROGRESS here and have
> at
> > > the caller side something like that:
> > >
> > > iff --git a/mm/page_io.c b/mm/page_io.c
> > > index e8726f3e3820b..9d1baa46ec3ed 100644
> > > --- a/mm/page_io.c
> > > +++ b/mm/page_io.c
> > > @@ -252,12 +252,15 @@ int swap_writepage(struct page *page, struct
> > > writeback_control *wbc)
> > > unlock_page(page);
> > > goto out;
> > > }
> > > -   if (frontswap_store(page) == 0) {
> > > +   ret = frontswap_store(page);
> > > +   if (ret == 0) {
> > > set_page_writeback(page);
> > > unlock_page(page);
> > > end_page_writeback(page);
> > > goto out;
> > > }
> > > +   if (ret = -EINPROGRESS)
> > > +   goto out;
> > > ret = __swap_writepage(page, wbc, end_swap_bio_write);
> > >  out:
> > > return ret;
> > >
> > Unfortunately, this is not true and things are not that simple.
> >
> > We can't simply depend on -EINPROGRESS and go out.
> > We have to wait for the result of compression to decide if we should
> > do __swap_writepage(). As one page might be compressed into two
> > pages, in this case, we will still need to do _swap_writepage().
> > As I replied in the latest email, all of the async improvement to frontswap
> > needs very careful thinking and benchmark. It can only happen after
> > we build the base in this patch, fixing the broken connection between
> > zswap and those new zip drivers.
> 
> At the time the compression finishes you see what happens and based on
> the design you can either complete it immediately (the 0/error case from
> above) or forward the result to the caller and let him decide.

Hello Sebastian, thanks for your reply and careful review.

Right now, frontswap is pretty much one thing which happens before 
__swap_writepage().
The whole design is full of the assumption that frontswap is sync. So if 
frontswap
consumes a page without any error, this page won't go to __swap_writepage()
which is async. On the other hand, if frontswap's store has any error, that 
means
this page needs to swap to disk.

int swap_writepage(struct page *page, struct writeback_control *wbc)
{
int ret = 0;

if (try_to_free_swap(page)) {
unlock_page(page);
goto out;
}
if (frontswap_store(page) == 0) {
set_page_writeback(page);
unlock_page(page);
end_page_writeback(page);
goto out;
}
ret = __swap_writepage(page, wbc, end_swap_bio_write);
out:
return ret;
}

I don't think we can simply "forward the result to the caller and let him 
decide".
Would you like to present some pseudo code?

Thanks
Barry



RE: [PATCH v4] mm/zswap: move to use crypto_acomp API for hardware acceleration

2020-07-08 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: linux-crypto-ow...@vger.kernel.org
> [mailto:linux-crypto-ow...@vger.kernel.org] On Behalf Of Sebastian Andrzej
> Siewior
> Sent: Thursday, July 9, 2020 3:00 AM
> To: Song Bao Hua (Barry Song) 
> Cc: a...@linux-foundation.org; herb...@gondor.apana.org.au;
> da...@davemloft.net; linux-cry...@vger.kernel.org; linux...@kvack.org;
> linux-kernel@vger.kernel.org; Linuxarm ; Luis Claudio
> R . Goncalves ; Mahipal Challa
> ; Seth Jennings ;
> Dan Streetman ; Vitaly Wool
> ; Wangzhou (B) ;
> Colin Ian King 
> Subject: Re: [PATCH v4] mm/zswap: move to use crypto_acomp API for
> hardware acceleration
> 
> On 2020-07-08 00:52:10 [+1200], Barry Song wrote:
> …
> > @@ -127,9 +129,17 @@
> module_param_named(same_filled_pages_enabled,
> > zswap_same_filled_pages_enabled,
> >  * data structures
> >  **/
> >
> > +struct crypto_acomp_ctx {
> > +   struct crypto_acomp *acomp;
> > +   struct acomp_req *req;
> > +   struct crypto_wait wait;
> > +   u8 *dstmem;
> > +   struct mutex mutex;
> > +};
> …
> > @@ -561,8 +614,9 @@ static struct zswap_pool *zswap_pool_create(char
> *type, char *compressor)
> > pr_debug("using %s zpool\n", zpool_get_type(pool->zpool));
> >
> > strlcpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
> > -   pool->tfm = alloc_percpu(struct crypto_comp *);
> > -   if (!pool->tfm) {
> > +
> > +   pool->acomp_ctx = alloc_percpu(struct crypto_acomp_ctx *);
> 
> Can't you allocate the whole structure instead just a pointer to it? The
> structure looks just like bunch of pointers anyway. Less time for pointer
> chasing means more time for fun.
> 
> > @@ -1074,12 +1138,32 @@ static int zswap_frontswap_store(unsigned
> type, pgoff_t offset,
> > }
> >
> > /* compress */
> > -   dst = get_cpu_var(zswap_dstmem);
> > -   tfm = *get_cpu_ptr(entry->pool->tfm);
> > -   src = kmap_atomic(page);
> > -   ret = crypto_comp_compress(tfm, src, PAGE_SIZE, dst, );
> > -   kunmap_atomic(src);
> > -   put_cpu_ptr(entry->pool->tfm);
> > +   acomp_ctx = *this_cpu_ptr(entry->pool->acomp_ctx);
> > +
> > +   mutex_lock(_ctx->mutex);
> > +
> > +   src = kmap(page);
> > +   dst = acomp_ctx->dstmem;
> 
> that mutex is per-CPU, per-context. The dstmem pointer is per-CPU. So if I 
> read
> this right, you can get preempted after crypto_wait_req() and another context
> in this CPU writes its data to the same dstmem and then…
> 
> > +   sg_init_one(, src, PAGE_SIZE);
> > +   /* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */
> > +   sg_init_one(, dst, PAGE_SIZE * 2);
> > +   acomp_request_set_params(acomp_ctx->req, , ,
> PAGE_SIZE, dlen);
> > +   /*
> > +* it maybe looks a little bit silly that we send an asynchronous 
> > request,
> > +* then wait for its completion synchronously. This makes the process
> look
> > +* synchronous in fact.
> > +* Theoretically, acomp supports users send multiple acomp requests in
> one
> > +* acomp instance, then get those requests done simultaneously. but in
> this
> > +* case, frontswap actually does store and load page by page, there is 
> > no
> > +* existing method to send the second page before the first page is done
> > +* in one thread doing frontswap.
> > +* but in different threads running on different cpu, we have different
> > +* acomp instance, so multiple threads can do (de)compression in
> parallel.
> > +*/
> > +   ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
> _ctx->wait);
> > +   dlen = acomp_ctx->req->dlen;
> > +   kunmap(page);
> > +
> > if (ret) {
> > ret = -EINVAL;
> > goto put_dstmem;
> 
> This looks using the same synchronous mechanism around an asynchronous
> interface. It works as a PoC.
> 
> As far as I remember the crypto async interface, the incoming skbs were fed to
> the async interface and returned to the caller so the NIC could continue
> allocate new RX skbs and move on. Only if the queue of requests was getting
> to long the code started to throttle. Eventually the async crypto code
> completed the decryption operation in a different context and fed the
> decrypted packet(s) into the stack.
> 
> From a quick view, you would have to return -EINPROGRESS here and have at
> the caller side something like that:
> 
> iff --git a/mm/page_io.c b/mm/page_io.c
> index e8726f3e3820b..9d1baa46ec3e

RE: [PATCH v3] mm/hugetlb: avoid hardcoding while checking if cma is enable

2020-07-08 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Roman Gushchin [mailto:g...@fb.com]
> Sent: Thursday, July 9, 2020 6:46 AM
> To: Mike Kravetz 
> Cc: Andrew Morton ; Song Bao Hua (Barry Song)
> ; linux...@kvack.org;
> linux-kernel@vger.kernel.org; Linuxarm ; Jonathan
> Cameron 
> Subject: Re: [PATCH v3] mm/hugetlb: avoid hardcoding while checking if cma
> is enable
> 
> On Wed, Jul 08, 2020 at 10:45:16AM -0700, Mike Kravetz wrote:
> > On 7/7/20 12:56 PM, Andrew Morton wrote:
> > > On Tue, 7 Jul 2020 16:02:04 +1200 Barry Song
>  wrote:
> > >
> > >> hugetlb_cma[0] can be NULL due to various reasons, for example, node0
> has
> > >> no memory. so NULL hugetlb_cma[0] doesn't necessarily mean cma is not
> > >> enabled. gigantic pages might have been reserved on other nodes.
> > >
> > > I'm trying to figure out whether this should be backported into 5.7.1,
> > > but the changelog doesn't describe any known user-visible effects of
> > > the bug.  Are there any?
> >
> > Barry must have missed this email.  He reported the issue so I was hoping
> > he would reply.

Yep. it should be better to backport it into 5.7. it doesn't cause serious 
crash or failure,
but could cause double reservation or cma leak.

> >
> > Based on the code changes, I believe the following could happen:
> > - Someone uses 'hugetlb_cma=' kernel command line parameter to reserve
> >   CMA for gigantic pages.
> > - The system topology is such that no memory is on node 0.  Therefore,
> >   no CMA can be reserved for gigantic pages on node 0.  CMA is reserved
> >   on other nodes.
> > - The user also specifies a number of gigantic pages to pre-allocate on
> >   the command line with hugepagesz= hugepages=
> > - The routine which allocates gigantic pages from the bootmem allocator
> >   will not detect CMA has been reserved as there is no memory on node 0.
> >   Therefore, pages will be pre-allocated from bootmem allocator as well
> >   as reserved in CMA.
> >
> > This double allocation (bootmem and CMA) is the worst case scenario.  Not
> > sure if this is what Barry saw, and I suspect this would rarely happen.
> >
> > After writing this, I started to think that perhaps command line parsing
> > should be changed.  If hugetlb_cma= is specified, it makes no sense to
> > pre-allocate gigantic pages.  Therefore, the hugepages= paramemter
> > should be ignored and flagged with a warning if  hugetlb_cma= is specified.
> > This could be checked at parsing time and there would be no need for such
> > a check in the allocation code (except for sanity cheching).
> >
> > Thoughts?  I just cleaned up the parsing code and could make such a
> change
> > quite easily.
> 
> I agree. Basically, if hugetlb_cma_size > 0, we should not pre-allocate
> gigantic pages. It would be much simpler and more reliable than the existing
> code.

I agree this is a better solution, if hugetlb_cma has higher priority than 
bootmem gigantic pages,
we should document it.

> 
> Thank you!

Thanks
Barry



RE: [PATCH v4] mm/zswap: move to use crypto_acomp API for hardware acceleration

2020-07-08 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: linux-crypto-ow...@vger.kernel.org
> [mailto:linux-crypto-ow...@vger.kernel.org] On Behalf Of Sebastian Andrzej
> Siewior
> Sent: Thursday, July 9, 2020 3:00 AM
> To: Song Bao Hua (Barry Song) 
> Cc: a...@linux-foundation.org; herb...@gondor.apana.org.au;
> da...@davemloft.net; linux-cry...@vger.kernel.org; linux...@kvack.org;
> linux-kernel@vger.kernel.org; Linuxarm ; Luis Claudio
> R . Goncalves ; Mahipal Challa
> ; Seth Jennings ;
> Dan Streetman ; Vitaly Wool
> ; Wangzhou (B) ;
> Colin Ian King 
> Subject: Re: [PATCH v4] mm/zswap: move to use crypto_acomp API for
> hardware acceleration
> 
> On 2020-07-08 00:52:10 [+1200], Barry Song wrote:
> …
> > @@ -127,9 +129,17 @@
> module_param_named(same_filled_pages_enabled,
> zswap_same_filled_pages_enabled,
> >  * data structures
> >  **/
> >
> > +struct crypto_acomp_ctx {
> > +   struct crypto_acomp *acomp;
> > +   struct acomp_req *req;
> > +   struct crypto_wait wait;
> > +   u8 *dstmem;
> > +   struct mutex mutex;
> > +};
> …
> > @@ -561,8 +614,9 @@ static struct zswap_pool *zswap_pool_create(char
> *type, char *compressor)
> > pr_debug("using %s zpool\n", zpool_get_type(pool->zpool));
> >
> > strlcpy(pool->tfm_name, compressor, sizeof(pool->tfm_name));
> > -   pool->tfm = alloc_percpu(struct crypto_comp *);
> > -   if (!pool->tfm) {
> > +
> > +   pool->acomp_ctx = alloc_percpu(struct crypto_acomp_ctx *);
> 
> Can't you allocate the whole structure instead just a pointer to it? The
> structure looks just like bunch of pointers anyway. Less time for
> pointer chasing means more time for fun.
> 

Should be possible.

> > @@ -1074,12 +1138,32 @@ static int zswap_frontswap_store(unsigned
> type, pgoff_t offset,
> > }
> >
> > /* compress */
> > -   dst = get_cpu_var(zswap_dstmem);
> > -   tfm = *get_cpu_ptr(entry->pool->tfm);
> > -   src = kmap_atomic(page);
> > -   ret = crypto_comp_compress(tfm, src, PAGE_SIZE, dst, );
> > -   kunmap_atomic(src);
> > -   put_cpu_ptr(entry->pool->tfm);
> > +   acomp_ctx = *this_cpu_ptr(entry->pool->acomp_ctx);
> > +
> > +   mutex_lock(_ctx->mutex);
> > +
> > +   src = kmap(page);
> > +   dst = acomp_ctx->dstmem;
> 
> that mutex is per-CPU, per-context. The dstmem pointer is per-CPU. So if
> I read this right, you can get preempted after crypto_wait_req() and
> another context in this CPU writes its data to the same dstmem and then…
> 

This isn't true. Another thread in this cpu will be blocked by the mutex.
It is impossible for two threads to write the same dstmem.
If thread1 ran on cpu1, it held cpu1's mutex; if another thread wants to run on 
cpu1, it is blocked.
If thread1 ran on cpu1 first, it held cpu1's mutex, then it migrated to cpu2 
(with very rare chance)
a. if another thread wants to run on cpu1, it is blocked;
b. if another thread wants to run on cpu2, it is not blocked but it 
will write cpu2's dstmem not cpu1's

> > +   sg_init_one(, src, PAGE_SIZE);
> > +   /* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */
> > +   sg_init_one(, dst, PAGE_SIZE * 2);
> > +   acomp_request_set_params(acomp_ctx->req, , ,
> PAGE_SIZE, dlen);
> > +   /*
> > +* it maybe looks a little bit silly that we send an asynchronous 
> > request,
> > +* then wait for its completion synchronously. This makes the process
> look
> > +* synchronous in fact.
> > +* Theoretically, acomp supports users send multiple acomp requests in
> one
> > +* acomp instance, then get those requests done simultaneously. but in
> this
> > +* case, frontswap actually does store and load page by page, there is 
> > no
> > +* existing method to send the second page before the first page is done
> > +* in one thread doing frontswap.
> > +* but in different threads running on different cpu, we have different
> > +* acomp instance, so multiple threads can do (de)compression in
> parallel.
> > +*/
> > +   ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
> _ctx->wait);
> > +   dlen = acomp_ctx->req->dlen;
> > +   kunmap(page);
> > +
> > if (ret) {
> > ret = -EINVAL;
> > goto put_dstmem;
> 
> This looks using the same synchronous mechanism around an asynchronous
> interface. It works as a PoC.
> 
> As far as I remember the crypto async interface, the incoming skbs were
> fed to the async interface and return

RE: [PATCH net] xsk: remove cheap_dma optimization

2020-07-08 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of Christoph Hellwig
> Sent: Wednesday, July 8, 2020 6:50 PM
> To: Robin Murphy 
> Cc: Björn Töpel ; Christoph Hellwig ;
> Daniel Borkmann ; maxi...@mellanox.com;
> konrad.w...@oracle.com; jonathan.le...@gmail.com;
> linux-kernel@vger.kernel.org; io...@lists.linux-foundation.org;
> net...@vger.kernel.org; b...@vger.kernel.org; da...@davemloft.net;
> magnus.karls...@intel.com
> Subject: Re: [PATCH net] xsk: remove cheap_dma optimization
> 
> On Mon, Jun 29, 2020 at 04:41:16PM +0100, Robin Murphy wrote:
> > On 2020-06-28 18:16, Björn Töpel wrote:
> >>
> >> On 2020-06-27 09:04, Christoph Hellwig wrote:
> >>> On Sat, Jun 27, 2020 at 01:00:19AM +0200, Daniel Borkmann wrote:
>  Given there is roughly a ~5 weeks window at max where this removal
> could
>  still be applied in the worst case, could we come up with a fix /
>  proposal
>  first that moves this into the DMA mapping core? If there is something
>  that
>  can be agreed upon by all parties, then we could avoid re-adding the 9%
>  slowdown. :/
> >>>
> >>> I'd rather turn it upside down - this abuse of the internals blocks work
> >>> that has basically just missed the previous window and I'm not going
> >>> to wait weeks to sort out the API misuse.  But we can add optimizations
> >>> back later if we find a sane way.
> >>>
> >>
> >> I'm not super excited about the performance loss, but I do get
> >> Christoph's frustration about gutting the DMA API making it harder for
> >> DMA people to get work done. Lets try to solve this properly using
> >> proper DMA APIs.
> >>
> >>
> >>> That being said I really can't see how this would make so much of a
> >>> difference.  What architecture and what dma_ops are you using for
> >>> those measurements?  What is the workload?
> >>>
> >>
> >> The 9% is for an AF_XDP (Fast raw Ethernet socket. Think AF_PACKET, but
> >> faster.) benchmark: receive the packet from the NIC, and drop it. The DMA
> >> syncs stand out in the perf top:
> >>
> >>    28.63%  [kernel]   [k] i40e_clean_rx_irq_zc
> >>    17.12%  [kernel]   [k] xp_alloc
> >>     8.80%  [kernel]   [k] __xsk_rcv_zc
> >>     7.69%  [kernel]   [k] xdp_do_redirect
> >>     5.35%  bpf_prog_992d9ddc835e5629  [k]
> bpf_prog_992d9ddc835e5629
> >>     4.77%  [kernel]   [k] xsk_rcv.part.0
> >>     4.07%  [kernel]   [k] __xsk_map_redirect
> >>     3.80%  [kernel]   [k]
> dma_direct_sync_single_for_cpu
> >>     3.03%  [kernel]   [k]
> dma_direct_sync_single_for_device
> >>     2.76%  [kernel]   [k]
> i40e_alloc_rx_buffers_zc
> >>     1.83%  [kernel]   [k] xsk_flush
> >> ...
> >>
> >> For this benchmark the dma_ops are NULL (dma_is_direct() == true), and
> >> the main issue is that SWIOTLB is now unconditionally enabled [1] for
> >> x86, and for each sync we have to check that if is_swiotlb_buffer()
> >> which involves a some costly indirection.
> >>
> >> That was pretty much what my hack avoided. Instead we did all the checks
> >> upfront, since AF_XDP has long-term DMA mappings, and just set a flag
> >> for that.
> >>
> >> Avoiding the whole "is this address swiotlb" in
> >> dma_direct_sync_single_for_{cpu, device]() per-packet
> >> would help a lot.
> >
> > I'm pretty sure that's one of the things we hope to achieve with the
> > generic bypass flag :)
> >
> >> Somewhat related to the DMA API; It would have performance benefits for
> >> AF_XDP if the DMA range of the mapped memory was linear, i.e. by IOMMU
> >> utilization. I've started hacking a thing a little bit, but it would be
> >> nice if such API was part of the mapping core.
> >>
> >> Input: array of pages Output: array of dma addrs (and obviously dev,
> >> flags and such)
> >>
> >> For non-IOMMU len(array of pages) == len(array of dma addrs)
> >> For best-case IOMMU len(array of dma addrs) == 1 (large linear space)
> >>
> >> But that's for later. :-)
> >
> > FWIW you will typically get that behaviour from IOMMU-based
> implementations
> > of dma_map_sg() right now, although it's not strictly guaranteed. If you
> > can weather some additional setup cost of calling
> > sg_alloc_table_from_pages() plus walking the list after mapping to test
> > whether you did get a contiguous result, you could start taking advantage
> > of it as some of the dma-buf code in DRM and v4l2 does already (although
> > those cases actually treat it as a strict dependency rather than an
> > optimisation).
> 
> Yikes.
> 
> > I'm inclined to agree that if we're going to see more of these cases, a new
> > API call that did formally guarantee a DMA-contiguous mapping (either via
> > IOMMU or bounce buffering) or failure might indeed be handy.
> 
> I was planning on adding a dma-level API to add more pages to an
> IOMMU batch, but was waiting for at 

RE: [PATCH] arm64: mm: free unused memmap for sparse memory model that define VMEMMAP

2020-07-08 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: liwei (CM)
> Sent: Wednesday, July 8, 2020 1:56 PM
> To: catalin.mari...@arm.com; w...@kernel.org
> Cc: liwei (CM) ; fengbaopeng
> ; nsaenzjulie...@suse.de;
> steve.cap...@arm.com; r...@linux.ibm.com; Song Bao Hua (Barry Song)
> ; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; sujunfei 
> Subject: [PATCH] arm64: mm: free unused memmap for sparse memory model
> that define VMEMMAP
> 
> For the memory hole, sparse memory model that define
> SPARSEMEM_VMEMMAP do not free the reserved memory for the page map,
> this patch do it.

Hello Wei,
Just curious if this patch breaks MEMORY_HOTPLUG?

> 
> Signed-off-by: Wei Li 
> Signed-off-by: Chen Feng 
> Signed-off-by: Xia Qing 
> ---
>  arch/arm64/mm/init.c | 81
> +---
>  1 file changed, 71 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index
> 1e93cfc7c47a..d1b56b47d5ba 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -441,7 +441,48 @@ void __init bootmem_init(void)
>   memblock_dump_all();
>  }
> 

Thanks
Barry



RE: [PATCH] mm/hugetlb: split hugetlb_cma in nodes with memory

2020-07-07 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Anshuman Khandual [mailto:anshuman.khand...@arm.com]
> Sent: Wednesday, July 8, 2020 4:18 PM
> To: Song Bao Hua (Barry Song) ;
> a...@linux-foundation.org
> Cc: x...@kernel.org; linux...@kvack.org; linux-kernel@vger.kernel.org;
> Linuxarm ; linux-arm-ker...@lists.infradead.org;
> Roman Gushchin ; Catalin Marinas
> ; Will Deacon ; Thomas Gleixner
> ; Ingo Molnar ; Borislav Petkov
> ; H . Peter Anvin ; Mike Kravetz
> ; Mike Rapoport ; Jonathan
> Cameron 
> Subject: Re: [PATCH] mm/hugetlb: split hugetlb_cma in nodes with memory
> 
> Hello Barry,
> 
> On 07/08/2020 05:53 AM, Barry Song wrote:
> > Rather than splitting huge_cma in online nodes, it is better to do it in
> > nodes with memory.
> 
> Right, it makes sense to avoid nodes without memory, hence loosing portions
> of CMA reservation intended for HugeTLB. N_MEMORY is better than
> N_ONLINE
> and will help avoid this situation.

Thanks for taking a look, Anshuman.

> 
> > For an ARM64 server with four numa nodes and only node0 has memory. If I
> > set hugetlb_cma=4G in bootargs,
> >
> > without this patch, I got the below printk:
> > hugetlb_cma: reserve 4096 MiB, up to 1024 MiB per node
> > hugetlb_cma: reserved 1024 MiB on node 0
> > hugetlb_cma: reservation failed: err -12, node 1
> > hugetlb_cma: reservation failed: err -12, node 2
> > hugetlb_cma: reservation failed: err -12, node 3
> 
> As expected.
> 
> >
> > hugetlb_cma size is broken once the system has nodes without memory.
> 
> I would not say that it is 'broken'. It is just not optimal but still works
> as designed.
> 
> >
> > With this patch, I got the below printk:
> > hugetlb_cma: reserve 4096 MiB, up to 4096 MiB per node
> > hugetlb_cma: reserved 4096 MiB on node 0
> 
> As expected, the per node CMA reservation quota has changed from
> N_ONLINE
> to N_MEMORY.
> 
> >
> > So this patch fixes the broken hugetlb_cma size on arm64.
> 
> There is nothing arm64 specific here. A platform where N_ONLINE !=
> N_MEMORY
> i.e with some nodes without memory when CMA reservation gets called, will
> have this problem.

Agreed. one fact is that right now only x86 and arm64 are calling 
hugetlb_cma_reserve().
So I don't know how eager other platforms need this function.

> 
> >
> > Jonathan Cameron tested this patch on x86 platform. Jonathan figured out
> x86
> > is much different with arm64. hugetlb_cma size has never broken on x86.
> > On arm64 all nodes are marked online at the same time. On x86, only
> > nodes with memory are initially marked as online:
> > initmem_init()->x86_numa_init()->numa_init()->
> > numa_register_memblks()->alloc_node_data()->node_set_online()
> > So at time of the existing cma setup call only the memory containing nodes
> > are online. The other nodes are brought up much later.
> 
> The problem is always there if N_ONLINE != N_MEMORY but in this case, it
> is just hidden because N_ONLINE happen to match N_MEMORY during the
> boot
> process when hugetlb_cma_reserve() gets called.

Yes. Exactly.

> 
> >
> > Thus, the change is simply to fix ARM64.  A change is needed to x86 only
> > because the inherent assumptions in cma_hugetlb_reserve() have changed.
> 
> cma_hugetlb_reserve() will now scan over N_MEMORY and hence expects all
> platforms to have N_MEMORY initialized properly before calling it. This
> needs to be well documented for the hugetlb_cma_reserve() function along
> with it's call sites.
> 

Yep. will document this.

> >
> > Fixes: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic hugepages
> using cma")
> 
> I would not call this a "Fix". The current code still works, though in
> a sub optimal manner.

Do you think it is worth linux-stable? For example, is it better for this 
optimal manner
to be in 5.7 and 5.8? or we have this patch in 5.9-rc1?

To me, I would prefer 5.7 and 5.8 users can still have a hugetlb cma size which 
is consistent
with the bootargs.

> 
> > Cc: Roman Gushchin 
> > Cc: Catalin Marinas 
> > Cc: Will Deacon 
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: Borislav Petkov 
> > Cc: H. Peter Anvin 
> > Cc: Mike Kravetz 
> > Cc: Mike Rapoport 
> > Cc: Andrew Morton 
> > Cc: Anshuman Khandual 
> > Cc: Jonathan Cameron 
> > Signed-off-by: Barry Song 
> > ---
> >  arch/arm64/mm/init.c| 18 +-
> >  arch/x86/kernel/setup.c | 13 ++---
> >  mm/hugetlb.c|  4 ++--
> >  3 files changed, 21 insertions(+), 14 deletions(-)
> >
> &g

RE: [PATCH v2] mm/hugetlb: avoid hardcoding while checking if cma is enable

2020-07-07 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Mike Rapoport [mailto:r...@kernel.org]
> Sent: Tuesday, July 7, 2020 7:38 PM
> To: Roman Gushchin 
> Cc: Song Bao Hua (Barry Song) ;
> a...@linux-foundation.org; linux...@kvack.org;
> linux-kernel@vger.kernel.org; Linuxarm ; Mike
> Kravetz ; Jonathan Cameron
> 
> Subject: Re: [PATCH v2] mm/hugetlb: avoid hardcoding while checking if cma
> is enable
> 
> On Mon, Jul 06, 2020 at 08:36:31PM -0700, Roman Gushchin wrote:
> > On Tue, Jul 07, 2020 at 03:11:56PM +1200, Barry Song wrote:
> > > hugetlb_cma[0] can be NULL due to various reasons, for example, node0
> has
> > > no memory. so NULL hugetlb_cma[0] doesn't necessarily mean cma is not
> > > enabled. gigantic pages might have been reserved on other nodes.
> > >
> > > Fixes: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic hugepages
> using cma")
> > > Cc: Roman Gushchin 
> > > Cc: Mike Kravetz 
> > > Cc: Jonathan Cameron 
> > > Signed-off-by: Barry Song 
> > > ---
> > >  -v2: add hugetlb_cma_enabled() helper to improve readability according
> to Roman
> > >
> > >  mm/hugetlb.c | 16 +++-
> > >  1 file changed, 15 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 57ece74e3aae..d5e98ed86bb9 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -2546,6 +2546,20 @@ static void __init
> gather_bootmem_prealloc(void)
> > >   }
> > >  }
> > >
> > > +bool __init hugetlb_cma_enabled(void)
> > > +{
> > > + if (IS_ENABLED(CONFIG_CMA)) {
> > > + int node;
> > > +
> > > + for_each_online_node(node) {
> > > + if (hugetlb_cma[node])
> > > + return true;
> > > + }
> > > + }
> > > +
> > > + return false;
> > > +}
> > > +
> >
> > Can you, please, change it to a more canonical
> >
> > #ifdef CONFIG_CMA
> > bool __init hugetlb_cma_enabled(void)
> > {
> > int node;
> >
> > for_each_online_node(node)
> > if (hugetlb_cma[node])
> > return true;
> >
> > return false;
> > }
> > #else
> > bool __init hugetlb_cma_enabled(void)
> > {
> > return false;
> > }
> > #endif
> >
> > or maybe just
> >
> > bool __init hugetlb_cma_enabled(void)
> > {
> > #ifdef CONFIG_CMA
> > int node;
> >
> > for_each_online_node(node)
> > if (hugetlb_cma[node])
> > return true;
> > #endif
> > return false;
> > }
> 
> This one please.

Yep. Patch v3 is just the one you prefer:
https://marc.info/?l=linux-mm=159409465228477=2

> 
> > ?
> >
> > Please, feel free to add
> > Acked-by: Roman Gushchin  after that.
> >
> > Thank you!
> >
> 
> --
> Sincerely yours,
> Mike.

Thanks
Barry



RE: [PATCH] mm/hugetlb: avoid hardcoding while checking if cma is reserved

2020-07-06 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Song Bao Hua (Barry Song)
> Sent: Tuesday, July 7, 2020 10:12 AM
> To: 'Roman Gushchin' 
> Cc: a...@linux-foundation.org; linux...@kvack.org;
> linux-kernel@vger.kernel.org; Linuxarm ; Mike
> Kravetz ; Jonathan Cameron
> 
> Subject: RE: [PATCH] mm/hugetlb: avoid hardcoding while checking if cma is
> reserved
> 
> 
> 
> > -Original Message-
> > From: Roman Gushchin [mailto:g...@fb.com]
> > Sent: Tuesday, July 7, 2020 9:48 AM
> > To: Song Bao Hua (Barry Song) 
> > Cc: a...@linux-foundation.org; linux...@kvack.org;
> > linux-kernel@vger.kernel.org; Linuxarm ; Mike
> > Kravetz ; Jonathan Cameron
> > 
> > Subject: Re: [PATCH] mm/hugetlb: avoid hardcoding while checking if
> > cma is reserved
> >
> > On Mon, Jul 06, 2020 at 08:44:05PM +1200, Barry Song wrote:
> >
> > Hello, Barry!
> >
> > > hugetlb_cma[0] can be NULL due to various reasons, for example,
> > > node0 has no memory. Thus, NULL hugetlb_cma[0] doesn't necessarily
> > > mean cma is not enabled. gigantic pages might have been reserved on
> other nodes.
> >
> > Just curious, is it a real-life problem you've seen? If so, I wonder
> > how you're using the hugetlb_cma option, and what's the outcome?
> 
> Yes. It is kind of stupid but I once got a board on which node0 has no DDR
> though node1 and node3 have memory.
> 
> I actually prefer we get cma size of per node by:
> cma size of one node = hugetlb_cma/ (nodes with memory) rather than:
> cma size of one node = hugetlb_cma/ (all online nodes)
> 
> but unfortunately, or the N_MEMORY infrastructures are not ready yet. I
> mean:
> 
> for_each_node_state(nid, N_MEMORY) {
>   int res;
> 
>   size = min(per_node, hugetlb_cma_size - reserved);
>   size = round_up(size, PAGE_SIZE << order);
> 
>   res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order,
>0, false, "hugetlb",
>_cma[nid], nid);
>   ...
>   }
> 

And for a server, there are many memory slots. The best config would be
making every node have at least one DDR. But it isn't necessarily true, it
is totally up to the users.

If we move hugetlb_cma_reserve() a bit later, we probably make hugetlb_cma size
completely consistent by splitting it to nodes with memory rather than nodes 
which are online:

void __init bootmem_init(void)
{
...

arm64_numa_init();

/*
 * must be done after arm64_numa_init() which calls numa_init() to
 * initialize node_online_map that gets used in hugetlb_cma_reserve()
 * while allocating required CMA size across online nodes.
 */
- #ifdef CONFIG_ARM64_4K_PAGES
-   hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
- #endif

...

sparse_init();
zone_sizes_init(min, max);

+ #ifdef CONFIG_ARM64_4K_PAGES
+   hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
+ #endif
memblock_dump_all();
}

For x86, it could be done in similar way. Do you think it is worth to try?

> >
> > >
> > > Fixes: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic
> > > hugepages
> > using cma")
> > > Cc: Roman Gushchin 
> > > Cc: Mike Kravetz 
> > > Cc: Jonathan Cameron 
> > > Signed-off-by: Barry Song 
> > > ---
> > >  mm/hugetlb.c | 18 +++---
> > >  1 file changed, 15 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c index
> > > 57ece74e3aae..603aa854aa89 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -2571,9 +2571,21 @@ static void __init
> > hugetlb_hstate_alloc_pages(struct hstate *h)
> > >
> > >   for (i = 0; i < h->max_huge_pages; ++i) {
> > >   if (hstate_is_gigantic(h)) {
> > > - if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
> > > - pr_warn_once("HugeTLB: hugetlb_cma is enabled, 
> > > skip
> > boot time allocation\n");
> > > - break;
> > > + if (IS_ENABLED(CONFIG_CMA)) {
> > > + int nid;
> > > + bool cma_reserved = false;
> > > +
> > > + for_each_node_state(nid, N_ONLINE) {
> > > + if (hugetlb_cma[nid]) {
> > > + pr_warn_once("HugeTLB: 
> > > hugetlb_cma

RE: [PATCH] mm/hugetlb: avoid hardcoding while checking if cma is reserved

2020-07-06 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Roman Gushchin [mailto:g...@fb.com]
> Sent: Tuesday, July 7, 2020 9:48 AM
> To: Song Bao Hua (Barry Song) 
> Cc: a...@linux-foundation.org; linux...@kvack.org;
> linux-kernel@vger.kernel.org; Linuxarm ; Mike
> Kravetz ; Jonathan Cameron
> 
> Subject: Re: [PATCH] mm/hugetlb: avoid hardcoding while checking if cma is
> reserved
> 
> On Mon, Jul 06, 2020 at 08:44:05PM +1200, Barry Song wrote:
> 
> Hello, Barry!
> 
> > hugetlb_cma[0] can be NULL due to various reasons, for example, node0 has
> > no memory. Thus, NULL hugetlb_cma[0] doesn't necessarily mean cma is not
> > enabled. gigantic pages might have been reserved on other nodes.
> 
> Just curious, is it a real-life problem you've seen? If so, I wonder how
> you're using the hugetlb_cma option, and what's the outcome?

Yes. It is kind of stupid but I once got a board on which node0 has no DDR
though node1 and node3 have memory.

I actually prefer we get cma size of per node by:
cma size of one node = hugetlb_cma/ (nodes with memory)
rather than:
cma size of one node = hugetlb_cma/ (all online nodes)

but unfortunately, or the N_MEMORY infrastructures are not ready yet. I mean:

for_each_node_state(nid, N_MEMORY) {
int res;

size = min(per_node, hugetlb_cma_size - reserved);
size = round_up(size, PAGE_SIZE << order);

res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order,
 0, false, "hugetlb",
 _cma[nid], nid);
...
}

> 
> >
> > Fixes: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic hugepages
> using cma")
> > Cc: Roman Gushchin 
> > Cc: Mike Kravetz 
> > Cc: Jonathan Cameron 
> > Signed-off-by: Barry Song 
> > ---
> >  mm/hugetlb.c | 18 +++---
> >  1 file changed, 15 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 57ece74e3aae..603aa854aa89 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2571,9 +2571,21 @@ static void __init
> hugetlb_hstate_alloc_pages(struct hstate *h)
> >
> > for (i = 0; i < h->max_huge_pages; ++i) {
> > if (hstate_is_gigantic(h)) {
> > -   if (IS_ENABLED(CONFIG_CMA) && hugetlb_cma[0]) {
> > -   pr_warn_once("HugeTLB: hugetlb_cma is enabled, 
> > skip
> boot time allocation\n");
> > -   break;
> > +   if (IS_ENABLED(CONFIG_CMA)) {
> > +   int nid;
> > +   bool cma_reserved = false;
> > +
> > +   for_each_node_state(nid, N_ONLINE) {
> > +   if (hugetlb_cma[nid]) {
> > +   pr_warn_once("HugeTLB: 
> > hugetlb_cma is
> reserved,"
> > +   "skip boot time 
> > allocation\n");
> > +   cma_reserved = true;
> > +   break;
> > +   }
> > +   }
> > +
> > +   if (cma_reserved)
> > +   break;
> 
> It's a valid problem, and I like to see it fixed. But I wonder if it would be 
> better
> to introduce a new helper bool hugetlb_cma_enabled()? And move both
> IS_ENABLED(CONFIG_CMA)
> and hugetlb_cma[nid] checks there?

Yep. that would be more readable.

> 
> Thank you!

Thanks
Barry



RE: [PATCH v3] driver core: platform: expose numa_node to users in sysfs

2020-07-02 Thread Song Bao Hua (Barry Song)

> 
> However, it is still much more clear and credible to users by exposing the 
> data
> directly from ACPI table.
> 

Except ARM64 iort, numa_node is actually also applicable to x86 and other 
architectures through general
acpi_create_platform_device() API:

drivers/acpi/scan.c:
static void acpi_default_enumeration(struct acpi_device *device)
{
...
if (!device->flags.enumeration_by_parent) {
acpi_create_platform_device(device, NULL);
acpi_device_set_enumerated(device);
}
}

struct platform_device *acpi_create_platform_device(struct acpi_device *adev,
struct property_entry *properties)
{
...

pdev = platform_device_register_full();
if (IS_ERR(pdev))
dev_err(>dev, "platform device creation failed: %ld\n",
PTR_ERR(pdev));
else {
set_dev_node(>dev, acpi_get_node(adev->handle));
dev_dbg(>dev, "created platform device %s\n",
dev_name(>dev));
}

...

return pdev;
}

> >
> > Thanks,
> > John

Thanks
Barry




RE: [PATCH v2 1/3] crypto: permit users to specify numa node of acomp hardware

2020-07-02 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
> Sent: Friday, July 3, 2020 4:11 PM
> To: Song Bao Hua (Barry Song) 
> Cc: da...@davemloft.net; Wangzhou (B) ;
> Jonathan Cameron ;
> a...@linux-foundation.org; linux-cry...@vger.kernel.org;
> linux...@kvack.org; linux-kernel@vger.kernel.org; Linuxarm
> ; Seth Jennings ; Dan
> Streetman ; Vitaly Wool 
> Subject: Re: [PATCH v2 1/3] crypto: permit users to specify numa node of
> acomp hardware
> 
> On Tue, Jun 23, 2020 at 04:16:08PM +1200, Barry Song wrote:
> >
> > -void *crypto_create_tfm(struct crypto_alg *alg,
> > -   const struct crypto_type *frontend)
> > +void *crypto_create_tfm_node(struct crypto_alg *alg,
> > +   const struct crypto_type *frontend,
> > +   int node)
> >  {
> > char *mem;
> > struct crypto_tfm *tfm = NULL;
> > @@ -451,6 +452,7 @@ void *crypto_create_tfm(struct crypto_alg *alg,
> >
> > tfm = (struct crypto_tfm *)(mem + tfmsize);
> > tfm->__crt_alg = alg;
> > +   tfm->node = node;
> 
> Should the kzalloc also use node?

Yes, it would be nice since the tfm will mainly be accessed by CPU on the 
specific node.

> 
> Thanks,
> --
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Thanks
Barry



RE: [PATCH v2 1/2] dma-direct: provide the ability to reserve per-numa CMA

2020-06-26 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Robin Murphy
> Sent: Thursday, June 25, 2020 11:11 PM
> To: Song Bao Hua (Barry Song) ; h...@lst.de;
> m.szyprow...@samsung.com; w...@kernel.org;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com
> Cc: io...@lists.linux-foundation.org; Linuxarm ;
> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org; Jonathan
> Cameron ; Nicolas Saenz Julienne
> ; Steve Capper ; Andrew
> Morton ; Mike Rapoport 
> Subject: Re: [PATCH v2 1/2] dma-direct: provide the ability to reserve
> per-numa CMA
> 
> On 2020-06-25 08:43, Barry Song wrote:
> > This is useful for at least two scenarios:
> > 1. ARM64 smmu will get memory from local numa node, it can save its
> > command queues and page tables locally. Tests show it can decrease
> > dma_unmap latency at lot. For example, without this patch, smmu on
> > node2 will get memory from node0 by calling dma_alloc_coherent(),
> > typically, it has to wait for more than 560ns for the completion of
> > CMD_SYNC in an empty command queue; with this patch, it needs 240ns
> > only.
> > 2. when we set iommu passthrough, drivers will get memory from CMA,
> > local memory means much less latency.
> >
> > Cc: Jonathan Cameron 
> > Cc: Christoph Hellwig 
> > Cc: Marek Szyprowski 
> > Cc: Will Deacon 
> > Cc: Robin Murphy 
> > Cc: Ganapatrao Kulkarni 
> > Cc: Catalin Marinas 
> > Cc: Nicolas Saenz Julienne 
> > Cc: Steve Capper 
> > Cc: Andrew Morton 
> > Cc: Mike Rapoport 
> > Signed-off-by: Barry Song 
> > ---
> >   include/linux/dma-contiguous.h |  4 ++
> >   kernel/dma/Kconfig | 10 
> >   kernel/dma/contiguous.c| 99
> ++
> >   3 files changed, 104 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/dma-contiguous.h
> b/include/linux/dma-contiguous.h
> > index 03f8e98e3bcc..278a80a40456 100644
> > --- a/include/linux/dma-contiguous.h
> > +++ b/include/linux/dma-contiguous.h
> > @@ -79,6 +79,8 @@ static inline void dma_contiguous_set_default(struct
> cma *cma)
> >
> >   void dma_contiguous_reserve(phys_addr_t addr_limit);
> >
> > +void dma_pernuma_cma_reserve(void);
> > +
> >   int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t
> base,
> >phys_addr_t limit, struct cma **res_cma,
> >bool fixed);
> > @@ -128,6 +130,8 @@ static inline void dma_contiguous_set_default(struct
> cma *cma) { }
> >
> >   static inline void dma_contiguous_reserve(phys_addr_t limit) { }
> >
> > +static inline void dma_pernuma_cma_reserve(void) { }
> > +
> >   static inline int dma_contiguous_reserve_area(phys_addr_t size,
> phys_addr_t base,
> >phys_addr_t limit, struct cma **res_cma,
> >bool fixed)
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index d006668c0027..aeb976b1d21c 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -104,6 +104,16 @@ config DMA_CMA
> >   if  DMA_CMA
> >   comment "Default contiguous memory area size:"
> >
> > +config CMA_PERNUMA_SIZE_MBYTES
> > +   int "Size in Mega Bytes for per-numa CMA areas"
> > +   depends on NUMA
> > +   default 16 if ARM64
> > +   default 0
> > +   help
> > + Defines the size (in MiB) of the per-numa memory area for Contiguous
> > + Memory Allocator. Every numa node will get a separate CMA with this
> > + size. If the size of 0 is selected, per-numa CMA is disabled.
> > +
> 
> I think this needs to be cleverer than just a static config option.
> Pretty much everything else CMA-related is runtime-configurable to some
> degree, and doing any per-node setup when booting on a single-node
> system would be wasted effort.

I agree some dynamic configuration should be supported to set the size of cma.
It could be a kernel parameter bootargs, or leverage an existing parameter.

For a system with NUMA enabled, but with only one node or actually non-numa, 
the current dma_pernuma_cma_reserve() won't do anything:

void __init dma_pernuma_cma_reserve(void)
{
int nid;

if (!pernuma_size_bytes || nr_online_nodes <= 1)
return;
}

> 
> Since this is conceptually very similar to the existing hugetlb_cma
> implementation I'm also wondering about inconsistency with respect to
> specifying per-node vs.

RE: [PATCH v3] mm/zswap: move to use crypto_acomp API for hardware acceleration

2020-06-26 Thread Song Bao Hua (Barry Song)
> > -Original Message-
> > From: linux-kernel-ow...@vger.kernel.org
> > [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Herbert Xu
> > Sent: Friday, June 26, 2020 7:20 PM
> > To: Song Bao Hua (Barry Song) 
> > Cc: a...@linux-foundation.org; linux...@kvack.org;
> > linux-kernel@vger.kernel.org; Linuxarm ; Luis
> > Claudio R . Goncalves ; Sebastian Andrzej Siewior
> > ; David S . Miller ;
> > Mahipal Challa ; Seth Jennings
> > ; Dan Streetman ; Vitaly Wool
> > ; Wangzhou (B) ;
> > Colin Ian King 
> > Subject: Re: [PATCH v3] mm/zswap: move to use crypto_acomp API for
> > hardware acceleration
> >
> > On Fri, Jun 26, 2020 at 07:09:03PM +1200, Barry Song wrote:
> > >
> > > + mutex_lock(_ctx->mutex);
> > > +
> > > + src = kmap(page);
> > > + dst = acomp_ctx->dstmem;
> > > + sg_init_one(, src, PAGE_SIZE);
> > > + /* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */
> > > + sg_init_one(, dst, PAGE_SIZE * 2);
> > > + acomp_request_set_params(acomp_ctx->req, , ,
> > PAGE_SIZE, dlen);
> > > + ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
> > _ctx->wait);
> > > + dlen = acomp_ctx->req->dlen;
> > > + kunmap(page);
> >
> > Waiting on an async request like this is just silly.  This defeats the
> > whole purpose of having a fallback.
> 
> For this zswap case and for this moment, it is probably not.
> As for this case, there are no two parallel (de)compressions which can be done
> in parallel in a single (de)compressor instance.
> The original zswap code is doing all compression/decompression in atomic
> context.
> Right now, to use acomp api, the patch has moved to sleep-able context.
> 
> However, compression/decompression can be done in parallel in different
> instances of acomp, also different cpus.
> 
> If we want to do multiple (de)compressions simultaneously in one acomp
> instance by callbacks, it will ask a large changes in zswap.c not only using 
> the
> new APIs. I think we can only achieve the ideal goal step by step.

On the other hand, I also don't think mm/frontswap.c supports async store. It 
is pretty much
a sync operation to call store callback of frontswap for every single page:

int __frontswap_store(struct page *page)
{
...
/* Try to store in each implementation, until one succeeds. */
for_each_frontswap_ops(ops) {
ret = ops->store(type, offset, page);
if (!ret) /* successful store */
break;
}
...
if (frontswap_writethrough_enabled)
/* report failure so swap also writes to swap device */
ret = -1;
return ret;
}

If we don't want to execute a sync wait in zswap, a lot of things need changes, 
not only zswap.

> 
> >
> > Cheers,
> > --
> > Email: Herbert Xu  Home Page:
> > http://gondor.apana.org.au/~herbert/
> > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> 
> -barry

Thanks
Barry



RE: [PATCH v3] mm/zswap: move to use crypto_acomp API for hardware acceleration

2020-06-26 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org
> [mailto:linux-kernel-ow...@vger.kernel.org] On Behalf Of Herbert Xu
> Sent: Friday, June 26, 2020 7:20 PM
> To: Song Bao Hua (Barry Song) 
> Cc: a...@linux-foundation.org; linux...@kvack.org;
> linux-kernel@vger.kernel.org; Linuxarm ; Luis Claudio
> R . Goncalves ; Sebastian Andrzej Siewior
> ; David S . Miller ; Mahipal
> Challa ; Seth Jennings
> ; Dan Streetman ; Vitaly Wool
> ; Wangzhou (B) ;
> Colin Ian King 
> Subject: Re: [PATCH v3] mm/zswap: move to use crypto_acomp API for
> hardware acceleration
> 
> On Fri, Jun 26, 2020 at 07:09:03PM +1200, Barry Song wrote:
> >
> > +   mutex_lock(_ctx->mutex);
> > +
> > +   src = kmap(page);
> > +   dst = acomp_ctx->dstmem;
> > +   sg_init_one(, src, PAGE_SIZE);
> > +   /* zswap_dstmem is of size (PAGE_SIZE * 2). Reflect same in sg_list */
> > +   sg_init_one(, dst, PAGE_SIZE * 2);
> > +   acomp_request_set_params(acomp_ctx->req, , ,
> PAGE_SIZE, dlen);
> > +   ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req),
> _ctx->wait);
> > +   dlen = acomp_ctx->req->dlen;
> > +   kunmap(page);
> 
> Waiting on an async request like this is just silly.  This defeats
> the whole purpose of having a fallback.

For this zswap case and for this moment, it is probably not.
As for this case, there are no two parallel (de)compressions which can be done 
in parallel
in a single (de)compressor instance.
The original zswap code is doing all compression/decompression in atomic 
context.
Right now, to use acomp api, the patch has moved to sleep-able context.

However, compression/decompression can be done in parallel in different 
instances
of acomp, also different cpus.

If we want to do multiple (de)compressions simultaneously in one acomp instance
by callbacks, it will ask a large changes in zswap.c not only using the new 
APIs. I think
we can only achieve the ideal goal step by step.

> 
> Cheers,
> --
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

-barry



RE: [PATCH v2 2/2] arm64: mm: reserve per-numa CMA after numa_init

2020-06-25 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Thursday, June 25, 2020 11:16 PM
> To: Song Bao Hua (Barry Song) ; h...@lst.de;
> m.szyprow...@samsung.com; w...@kernel.org;
> ganapatrao.kulka...@cavium.com; catalin.mari...@arm.com
> Cc: io...@lists.linux-foundation.org; Linuxarm ;
> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org; Nicolas
> Saenz Julienne ; Steve Capper
> ; Andrew Morton ;
> Mike Rapoport 
> Subject: Re: [PATCH v2 2/2] arm64: mm: reserve per-numa CMA after
> numa_init
> 
> On 2020-06-25 08:43, Barry Song wrote:
> > Right now, smmu is using dma_alloc_coherent() to get memory to save
> queues
> > and tables. Typically, on ARM64 server, there is a default CMA located at
> > node0, which could be far away from node2, node3 etc.
> > with this patch, smmu will get memory from local numa node to save
> command
> > queues and page tables. that means dma_unmap latency will be shrunk
> much.
> > Meanwhile, when iommu.passthrough is on, device drivers which call dma_
> > alloc_coherent() will also get local memory and avoid the travel between
> > numa nodes.
> >
> > Cc: Christoph Hellwig 
> > Cc: Marek Szyprowski 
> > Cc: Will Deacon 
> > Cc: Robin Murphy 
> > Cc: Ganapatrao Kulkarni 
> > Cc: Catalin Marinas 
> > Cc: Nicolas Saenz Julienne 
> > Cc: Steve Capper 
> > Cc: Andrew Morton 
> > Cc: Mike Rapoport 
> > Signed-off-by: Barry Song 
> > ---
> >   arch/arm64/mm/init.c | 2 ++
> >   1 file changed, 2 insertions(+)
> >
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index 1e93cfc7c47a..07d4d1fe7983 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -420,6 +420,8 @@ void __init bootmem_init(void)
> >
> > arm64_numa_init();
> >
> > +   dma_pernuma_cma_reserve();
> > +
> 
> It might be worth putting this after the hugetlb_cma_reserve() call for
> clarity, since the comment below applies equally to this call too.

Yep, it looks even better though dma_pernuma_cma_reserve() is self-documenting 
by name.

> 
> Robin.
> 
> > /*
> >  * must be done after arm64_numa_init() which calls numa_init() to
> >  * initialize node_online_map that gets used in hugetlb_cma_reserve()
> >
Thanks
Barry



RE: [PATCH] mm: cma: Return cma->name directly in cma_get_name

2020-06-22 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Nathan Chancellor [mailto:natechancel...@gmail.com]
> Sent: Tuesday, June 23, 2020 1:59 PM
> To: Andrew Morton 
> Cc: linux...@kvack.org; linux-kernel@vger.kernel.org;
> clang-built-li...@googlegroups.com; Song Bao Hua (Barry Song)
> ; Nathan Chancellor
> 
> Subject: [PATCH] mm: cma: Return cma->name directly in cma_get_name
> 
> clang warns:
> 
> mm/cma.c:55:14: warning: address of array 'cma->name' will always
> evaluate to 'true' [-Wpointer-bool-conversion]
> return cma->name ? cma->name : "(undefined)";
>~^~~~ ~
> 1 warning generated.
> 
> After commit e7f0557d7de9 ("mm: cma: fix the name of CMA areas"),
> cma->name will be an array with some value worth printing so we do not
> need to check for NULL.
> 
> Link: https://github.com/ClangBuiltLinux/linux/issues/1063
> Signed-off-by: Nathan Chancellor 

Thanks!

Acked-by: Barry Song 

> ---
>  mm/cma.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index 31a61d410c59..6cf08817bac6 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -52,7 +52,7 @@ unsigned long cma_get_size(const struct cma *cma)
> 
>  const char *cma_get_name(const struct cma *cma)
>  {
> - return cma->name ? cma->name : "(undefined)";
> + return cma->name;
>  }
> 
>  static unsigned long cma_bitmap_aligned_mask(const struct cma *cma,
> 
> base-commit: 27f11fea33608cbd321a97cbecfa2ef97dcc1821
> --
> 2.27.0



RE: [PATCH][next] mm/zswap: fix a couple of memory leaks and rework kzalloc failure check

2020-06-22 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Dan Carpenter [mailto:dan.carpen...@oracle.com]
> Sent: Tuesday, June 23, 2020 6:28 AM
> To: Colin King 
> Cc: Seth Jennings ; Dan Streetman
> ; Vitaly Wool ; Andrew
> Morton ; Song Bao Hua (Barry Song)
> ; Stephen Rothwell ;
> linux...@kvack.org; kernel-janit...@vger.kernel.org;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH][next] mm/zswap: fix a couple of memory leaks and
> rework kzalloc failure check
> 
> On Mon, Jun 22, 2020 at 04:35:46PM +0100, Colin King wrote:
> > From: Colin Ian King 
> >
> > kzalloc failures return NULL on out of memory errors, so replace the
> > IS_ERR_OR_NULL check with the usual null pointer check.  Fix two memory
> > leaks with on acomp and acomp_ctx by ensuring these objects are free'd
> > on the error return path.
> >
> > Addresses-Coverity: ("Resource leak")
> > Fixes: d4f86abd6e35 ("mm/zswap: move to use crypto_acomp API for
> hardware acceleration")
> > Signed-off-by: Colin Ian King 


Colin, thanks for your patch. I am sorry I did the same thing with you here:
https://lkml.org/lkml/2020/6/22/347


> > ---
> >  mm/zswap.c | 16 +++-
> >  1 file changed, 11 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 0d914ba6b4a0..14839cbac7ff 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -433,23 +433,23 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> > return 0;
> >
> > acomp_ctx = kzalloc(sizeof(*acomp_ctx), GFP_KERNEL);
> > -   if (IS_ERR_OR_NULL(acomp_ctx)) {
> > +   if (!acomp_ctx) {
> > pr_err("Could not initialize acomp_ctx\n");
> > return -ENOMEM;
> > }
> > acomp = crypto_alloc_acomp(pool->tfm_name, 0, 0);
> > -   if (IS_ERR_OR_NULL(acomp)) {
> > +   if (!acomp) {
> 
> This should be IS_ERR(acomp).  Please preserve the error code.
> 
> > pr_err("could not alloc crypto acomp %s : %ld\n",
> > pool->tfm_name, PTR_ERR(acomp));
> > -   return -ENOMEM;
> > +   goto free_acomp_ctx;
> > }
> > acomp_ctx->acomp = acomp;
> >
> > req = acomp_request_alloc(acomp_ctx->acomp);
> > -   if (IS_ERR_OR_NULL(req)) {
> > +   if (!req) {
> > pr_err("could not alloc crypto acomp %s : %ld\n",
> >pool->tfm_name, PTR_ERR(acomp));
> > -   return -ENOMEM;
> > +   goto free_acomp;
> > }
> > acomp_ctx->req = req;
> >
> > @@ -462,6 +462,12 @@ static int zswap_cpu_comp_prepare(unsigned int
> cpu, struct hlist_node *node)
> > *per_cpu_ptr(pool->acomp_ctx, cpu) = acomp_ctx;
> >
> > return 0;
> > +
> > +free_acomp:
> > +   kfree(acomp);
> 
> The kfree() isn't correct.  It needs to be:
> 
>   crypto_free_acomp(acomp);
> 
> > +free_acomp_ctx:
> > +   kfree(acomp_ctx);
> > +   return -ENOMEM;
> 
> regards,
> dan carpenter



RE: [PATCH v3] driver core: platform: expose numa_node to users in sysfs

2020-06-22 Thread Song Bao Hua (Barry Song)
> -Original Message-
> From: John Garry
> Sent: Monday, June 22, 2020 10:49 PM
> To: Song Bao Hua (Barry Song) ;
> gre...@linuxfoundation.org; raf...@kernel.org
> Cc: Robin Murphy ; linux-kernel@vger.kernel.org;
> Zengtao (B) ; Linuxarm 
> Subject: Re: [PATCH v3] driver core: platform: expose numa_node to users in
> sysfs
> 
> On 19/06/2020 04:00, Barry Song wrote:
> > Some platform devices like ARM SMMU are memory-mapped and populated
> by ACPI/IORT.
> > In this case, NUMA topology of those platform devices are exported by
> firmware as
> > well. Software might care about the numa_node of those devices in order to
> achieve
> > NUMA locality.
> 

Thanks for your review, John.

> Is it generally the case that the SMMU will be in the same NUMA node as
> the endpoint device (which you're driving)? If so, we can get this info


This could be true, but I am not sure if it has to be true :-)

On the other hand, drivers/acpi/arm64/iort.c has some code to set numa node for 
smmu.
It doesn't assume the numa_node is directly same with the pci devices.

static int  __init arm_smmu_v3_set_proximity(struct device *dev,
  struct acpi_iort_node *node)
{
struct acpi_iort_smmu_v3 *smmu;

smmu = (struct acpi_iort_smmu_v3 *)node->node_data;
if (smmu->flags & ACPI_IORT_SMMU_V3_PXM_VALID) {
int dev_node = acpi_map_pxm_to_node(smmu->pxm);

if (dev_node != NUMA_NO_NODE && !node_online(dev_node))
return -EINVAL;

set_dev_node(dev, dev_node);
pr_info("SMMU-v3[%llx] Mapped to Proximity domain %d\n",
smmu->base_address,
smmu->pxm);
}
return 0;
}

numa_node may also extend to other platform devices once we provide a common 
dev_set_proximity() callback to them.
iort_add_platform_device() will set node for them:

static int __init iort_add_platform_device(struct acpi_iort_node *node,
   const struct iort_dev_config *ops)
{
struct fwnode_handle *fwnode;
struct platform_device *pdev;
struct resource *r;
int ret, count;

pdev = platform_device_alloc(ops->name, PLATFORM_DEVID_AUTO);
if (!pdev)
return -ENOMEM;

if (ops->dev_set_proximity) {
ret = ops->dev_set_proximity(>dev, node);
if (ret)
goto dev_put;
}
...
}

It is probably worth to make dev_set_proximity() common for all iort devices.

> from sysfs already for the endpoint, and also have a link from the
> endpoint to the iommu for pci devices (which I assume you're interested in):
> 

> root@(none)$ ls -l /sys/devices/pci:74/:74:02.0/ | grep iommu
> lrwxrwxrwx 1 root  root 0 Jun 22 10:33 iommu ->
> ../../platform/arm-smmu-v3.2.auto/iommu/smmu3.0x00014000
> lrwxrwxrwx 1 root  root 0 Jun 22 10:33 iommu_group ->
> ../../../kernel/iommu_groups/0
> root@(none)$

Sure there is an implicit way to figure out the numa node of smmu by various 
links between smmu
and devices which use the smmu if smmu and devices are luckily put in one same 
numa node.

However, it is still much more clear and credible to users by exposing the data 
directly from ACPI table. 

> 
> Thanks,
> John

Barry



RE: [PATCH 1/3] crypto: permit users to specify numa node of acomp hardware

2020-06-22 Thread Song Bao Hua (Barry Song)
> -Original Message-
> From: Jonathan Cameron
> Sent: Monday, June 22, 2020 9:59 PM
> To: Song Bao Hua (Barry Song) 
> Cc: herb...@gondor.apana.org.au; da...@davemloft.net; Seth Jennings
> ; Linuxarm ;
> linux-kernel@vger.kernel.org; linux...@kvack.org;
> linux-cry...@vger.kernel.org; a...@linux-foundation.org; Dan Streetman
> ; Vitaly Wool 
> Subject: Re: [PATCH 1/3] crypto: permit users to specify numa node of acomp
> hardware
> 
> On Mon, 22 Jun 2020 14:48:59 +1200
> Barry Song  wrote:
> 
> > For a Linux server with NUMA, there are possibly multiple (de)compressors
> > which are either local or remote to some NUMA node. Some drivers will
> > automatically use the (de)compressor near the CPU calling acomp_alloc().
> > However, it is not necessarily correct because users who send acomp_req
> > could be from different NUMA node with the CPU which allocates acomp.
> >
> > Just like kernel has kmalloc() and kmalloc_node(), here crypto can have
> > same support.
> >
> > Cc: Seth Jennings 
> > Cc: Dan Streetman 
> > Cc: Vitaly Wool 
> > Cc: Andrew Morton 
> > Signed-off-by: Barry Song 
> 
> Hi Barry,
> 
> Seems sensible to me.  A few trivial comments inline.

Thanks for your review, Jonathan.

> 
> Thanks,
> 
> Jonathan
> 
> > ---
> >  crypto/acompress.c |  8 
> >  crypto/api.c   | 22 ++
> >  crypto/internal.h  | 23 +++
> >  include/crypto/acompress.h |  7 +++
> >  include/linux/crypto.h |  3 ++-
> >  5 files changed, 50 insertions(+), 13 deletions(-)
> >
> > diff --git a/crypto/acompress.c b/crypto/acompress.c
> > index 84a76723e851..c32c72048a1c 100644
> > --- a/crypto/acompress.c
> > +++ b/crypto/acompress.c
> > @@ -109,6 +109,14 @@ struct crypto_acomp *crypto_alloc_acomp(const
> char *alg_name, u32 type,
> >  }
> >  EXPORT_SYMBOL_GPL(crypto_alloc_acomp);
> >
> > +struct crypto_acomp *crypto_alloc_acomp_node(const char *alg_name,
> u32 type,
> > +   u32 mask, int node)
> > +{
> > +   return crypto_alloc_tfm_node(alg_name, _acomp_type, type,
> mask,
> > +   node);
> > +}
> > +EXPORT_SYMBOL_GPL(crypto_alloc_acomp_node);
> > +
> >  struct acomp_req *acomp_request_alloc(struct crypto_acomp *acomp)
> >  {
> > struct crypto_tfm *tfm = crypto_acomp_tfm(acomp);
> > diff --git a/crypto/api.c b/crypto/api.c
> > index edcf690800d4..4ecf712286af 100644
> > --- a/crypto/api.c
> > +++ b/crypto/api.c
> > @@ -433,8 +433,9 @@ struct crypto_tfm *crypto_alloc_base(const char
> *alg_name, u32 type, u32 mask)
> >  }
> >  EXPORT_SYMBOL_GPL(crypto_alloc_base);
> >
> > -void *crypto_create_tfm(struct crypto_alg *alg,
> > -   const struct crypto_type *frontend)
> > +void *crypto_create_tfm_node(struct crypto_alg *alg,
> > +   const struct crypto_type *frontend,
> > +   int node)
> >  {
> > char *mem;
> > struct crypto_tfm *tfm = NULL;
> > @@ -451,6 +452,7 @@ void *crypto_create_tfm(struct crypto_alg *alg,
> >
> > tfm = (struct crypto_tfm *)(mem + tfmsize);
> > tfm->__crt_alg = alg;
> > +   tfm->node = node;
> >
> > err = frontend->init_tfm(tfm);
> > if (err)
> > @@ -472,7 +474,7 @@ void *crypto_create_tfm(struct crypto_alg *alg,
> >  out:
> > return mem;
> >  }
> > -EXPORT_SYMBOL_GPL(crypto_create_tfm);
> > +EXPORT_SYMBOL_GPL(crypto_create_tfm_node);
> >
> >  struct crypto_alg *crypto_find_alg(const char *alg_name,
> >const struct crypto_type *frontend,
> > @@ -490,11 +492,13 @@ struct crypto_alg *crypto_find_alg(const char
> *alg_name,
> >  EXPORT_SYMBOL_GPL(crypto_find_alg);
> >
> >  /*
> > - * crypto_alloc_tfm - Locate algorithm and allocate transform
> > + * crypto_alloc_tfm_node - Locate algorithm and allocate transform
> >   * @alg_name: Name of algorithm
> >   * @frontend: Frontend algorithm type
> >   * @type: Type of algorithm
> >   * @mask: Mask for type comparison
> > + * @node: NUMA node in which users desire to put requests, if node is
> > + * NUMA_NO_NODE, it means users have no special requirement.
> >   *
> >   * crypto_alloc_tfm() will first attempt to locate an already loaded
> >   * algorithm.  If that fails and the kernel supports dynamically
> loadable
> > @@ -509,8 +513,10 @@ EXPORT_SYMBOL_GPL(cryp

RE: [PATCH v2] mm/zswap: move to use crypto_acomp API for hardware acceleration

2020-06-21 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Vitaly Wool [mailto:vitaly.w...@konsulko.com]
> Sent: Sunday, June 21, 2020 11:16 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Andrew Morton ;
> herb...@gondor.apana.org.au; da...@davemloft.net;
> linux-cry...@vger.kernel.org; Linux-MM ; LKML
> ; Linuxarm ; Luis
> Claudio R . Goncalves ; Sebastian Andrzej Siewior
> ; Mahipal Challa ;
> Seth Jennings ; Dan Streetman ;
> Wangzhou (B) 
> Subject: Re: [PATCH v2] mm/zswap: move to use crypto_acomp API for
> hardware acceleration
> 
> On Sun, Jun 21, 2020 at 1:52 AM Barry Song 
> wrote:
> >
> > right now, all new ZIP drivers are using crypto_acomp APIs rather than
> > legacy crypto_comp APIs. But zswap.c is still using the old APIs. That
> > means zswap won't be able to use any new zip drivers in kernel.
> >
> > This patch moves to use cryto_acomp APIs to fix the problem. On the
> > other hand, tradiontal compressors like lz4,lzo etc have been wrapped
> > into acomp via scomp backend. So platforms without async compressors
> > can fallback to use acomp via scomp backend.
> >
> > Cc: Luis Claudio R. Goncalves 
> > Cc: Sebastian Andrzej Siewior 
> > Cc: Andrew Morton 
> > Cc: Herbert Xu 
> > Cc: David S. Miller 
> > Cc: Mahipal Challa 
> > Cc: Seth Jennings 
> > Cc: Dan Streetman 
> > Cc: Vitaly Wool 
> > Cc: Zhou Wang 
> > Signed-off-by: Barry Song 
> > ---
> >  -v2:
> >  rebase to 5.8-rc1;
> >  cleanup commit log;
> >  cleanup to improve the readability according to Sebastian's comment
> >
> >  mm/zswap.c | 153
> > ++---
> >  1 file changed, 110 insertions(+), 43 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index fbb782924ccc..0d914ba6b4a0 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -24,8 +24,10 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #include 
> >  #include 
> > @@ -127,9 +129,17 @@
> module_param_named(same_filled_pages_enabled,
> > zswap_same_filled_pages_enabled,
> >  * data structures
> >  **/
> >
> > +struct crypto_acomp_ctx {
> > +   struct crypto_acomp *acomp;
> > +   struct acomp_req *req;
> > +   struct crypto_wait wait;
> > +   u8 *dstmem;
> > +   struct mutex mutex;
> > +};
> > +
> >  struct zswap_pool {
> > struct zpool *zpool;
> > -   struct crypto_comp * __percpu *tfm;
> > +   struct crypto_acomp_ctx * __percpu *acomp_ctx;
> > struct kref kref;
> > struct list_head list;
> > struct work_struct release_work; @@ -415,30 +425,60 @@ static
> > int zswap_dstmem_dead(unsigned int cpu)  static int
> > zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)  {
> > struct zswap_pool *pool = hlist_entry(node, struct zswap_pool,
> node);
> > -   struct crypto_comp *tfm;
> > +   struct crypto_acomp *acomp;
> > +   struct acomp_req *req;
> > +   struct crypto_acomp_ctx *acomp_ctx;
> >
> > -   if (WARN_ON(*per_cpu_ptr(pool->tfm, cpu)))
> > +   if (WARN_ON(*per_cpu_ptr(pool->acomp_ctx, cpu)))
> > return 0;
> >
> > -   tfm = crypto_alloc_comp(pool->tfm_name, 0, 0);
> > -   if (IS_ERR_OR_NULL(tfm)) {
> > -   pr_err("could not alloc crypto comp %s : %ld\n",
> > -  pool->tfm_name, PTR_ERR(tfm));
> > +   acomp_ctx = kzalloc(sizeof(*acomp_ctx), GFP_KERNEL);
> > +   if (IS_ERR_OR_NULL(acomp_ctx)) {
> > +   pr_err("Could not initialize acomp_ctx\n");
> > +   return -ENOMEM;
> > +   }
> > +   acomp = crypto_alloc_acomp(pool->tfm_name, 0, 0);
> > +   if (IS_ERR_OR_NULL(acomp)) {
> > +   pr_err("could not alloc crypto acomp %s : %ld\n",
> > +   pool->tfm_name, PTR_ERR(acomp));
> > return -ENOMEM;
> > }
> 
> I bet you actually want to free acomp_ctx here. Overall, could you please
> provide more careful error path implementation or explain why it isn't
> necessary?

Oops. I could hardly believe my eyes. it is definitely necessary to free the 
allocated data struct here,
will send an incremental patch to fix this. Thanks for your comment.

Best Regards,
Barry
> 
> Best regards,
> Vitaly



RE: [PATCH v2] arm64: mm: reserve hugetlb CMA after numa_init

2020-06-18 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Will Deacon [mailto:w...@kernel.org]
> Sent: Thursday, June 18, 2020 7:20 PM
> To: Song Bao Hua (Barry Song) 
> Cc: Roman Gushchin ; catalin.mari...@arm.com;
> nsaenzjulie...@suse.de; steve.cap...@arm.com; r...@linux.ibm.com;
> a...@linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; Linuxarm ; Matthias
> Brugger 
> Subject: Re: [PATCH v2] arm64: mm: reserve hugetlb CMA after numa_init
> 
> On Wed, Jun 17, 2020 at 09:43:51PM +, Song Bao Hua (Barry Song)
> wrote:
> > > From: Roman Gushchin [mailto:g...@fb.com]
> > > On Wed, Jun 17, 2020 at 11:38:03AM +, Song Bao Hua (Barry Song)
> > > > > From: Will Deacon [mailto:w...@kernel.org]
> > > > > On Wed, Jun 17, 2020 at 10:19:24AM +1200, Barry Song wrote:
> > > > > > hugetlb_cma_reserve() is called at the wrong place. numa_init has 
> > > > > > not
> > > > > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > > > > > index e631e6425165..41914b483d54 100644
> > > > > > --- a/arch/arm64/mm/init.c
> > > > > > +++ b/arch/arm64/mm/init.c
> > > > > > @@ -404,11 +404,6 @@ void __init arm64_memblock_init(void)
> > > > > > high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
> > > > > >
> > > > > > dma_contiguous_reserve(arm64_dma32_phys_limit);
> > > > > > -
> > > > > > -#ifdef CONFIG_ARM64_4K_PAGES
> > > > > > -   hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> > > > > > -#endif
> > > > >
> > > > > Why is this dependent on CONFIG_ARM64_4K_PAGES? We
> unconditionally
> > > > > select ARCH_HAS_GIGANTIC_PAGE so this seems unnecessary.
> > > >
> > > > Roman, would you like to answer this question? Have you found any
> > > problem if system
> > > > doesn't set 4K_PAGES?
> > >
> > > No, I was just following the code in arch/arm64/mm/hugetlbpage.c where
> all
> > > related to PUD-sized pages is guarded by CONFIG_ARM64_4K_PAGES.
> > > Actually I did all my testing on x86-64, I don't even have any arm
> hardware.
> > >
> > > I'm totally fine with removing this #ifdef if it's not needed.
> >
> > At this moment, I would suggest we should keep this "ifdef". Otherwise,
> hugetlb_cma_reserve() won't be really useful.
> >
> > For example, while setting PAGE size to 64KB. I got this error in
> hugetlb_cma_reserve():
> > hugetlb_cma: cma area should be at least 4194304 MiB
> > This is absolutely unreasonable.
> 
> Maybe one for RaspberryPi 5, huh? ;)
> 
> But ok, I'll take your patch as-is and add a comment about NUMA.

Have you seen the v3 with comment? I've already sent.

> 
> Thanks,
> 
> Will


RE: [PATCH v2] arm64: mm: reserve hugetlb CMA after numa_init

2020-06-17 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Roman Gushchin [mailto:g...@fb.com]
> Sent: Thursday, June 18, 2020 6:20 AM
> To: Song Bao Hua (Barry Song) 
> Cc: Will Deacon ; catalin.mari...@arm.com;
> nsaenzjulie...@suse.de; steve.cap...@arm.com; r...@linux.ibm.com;
> a...@linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; Linuxarm ; Matthias
> Brugger 
> Subject: Re: [PATCH v2] arm64: mm: reserve hugetlb CMA after numa_init
> 
> On Wed, Jun 17, 2020 at 11:38:03AM +, Song Bao Hua (Barry Song)
> wrote:
> >
> >
> > > -Original Message-
> > > From: Will Deacon [mailto:w...@kernel.org]
> > > Sent: Wednesday, June 17, 2020 10:18 PM
> > > To: Song Bao Hua (Barry Song) 
> > > Cc: catalin.mari...@arm.com; nsaenzjulie...@suse.de;
> > > steve.cap...@arm.com; r...@linux.ibm.com;
> a...@linux-foundation.org;
> > > linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org;
> Linuxarm
> > > ; Matthias Brugger ;
> > > Roman Gushchin 
> > > Subject: Re: [PATCH v2] arm64: mm: reserve hugetlb CMA after numa_init
> > >
> > > On Wed, Jun 17, 2020 at 10:19:24AM +1200, Barry Song wrote:
> > > > hugetlb_cma_reserve() is called at the wrong place. numa_init has not
> been
> > > > done yet. so all reserved memory will be located at node0.
> > > >
> > > > Fixes: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic
> hugepages
> > > using cma")
> > >
> > > Damn, wasn't CC'd on that :/
> > >
> > > > Cc: Matthias Brugger 
> > > > Acked-by: Roman Gushchin 
> > > > Signed-off-by: Barry Song 
> > > > ---
> > > >  -v2: add Fixes tag according to Matthias Brugger's comment
> > > >
> > > >  arch/arm64/mm/init.c | 10 +-
> > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > > > index e631e6425165..41914b483d54 100644
> > > > --- a/arch/arm64/mm/init.c
> > > > +++ b/arch/arm64/mm/init.c
> > > > @@ -404,11 +404,6 @@ void __init arm64_memblock_init(void)
> > > > high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
> > > >
> > > > dma_contiguous_reserve(arm64_dma32_phys_limit);
> > > > -
> > > > -#ifdef CONFIG_ARM64_4K_PAGES
> > > > -   hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> > > > -#endif
> > >
> > > Why is this dependent on CONFIG_ARM64_4K_PAGES? We unconditionally
> > > select ARCH_HAS_GIGANTIC_PAGE so this seems unnecessary.
> >
> > Roman, would you like to answer this question? Have you found any
> problem if system
> > doesn't set 4K_PAGES?
> 
> No, I was just following the code in arch/arm64/mm/hugetlbpage.c where all
> related to PUD-sized pages is guarded by CONFIG_ARM64_4K_PAGES.
> Actually I did all my testing on x86-64, I don't even have any arm hardware.
> 
> I'm totally fine with removing this #ifdef if it's not needed.

At this moment, I would suggest we should keep this "ifdef". Otherwise, 
hugetlb_cma_reserve() won't be really useful.

For example, while setting PAGE size to 64KB. I got this error in 
hugetlb_cma_reserve():
hugetlb_cma: cma area should be at least 4194304 MiB
This is absolutely unreasonable.

Supporting hugetlb_cma_reserve() for page sizes other than 4k is a different 
issue. 
It might be addressed in a separate patch later.

> 
> Thanks!
> 
> >
> > >
> > > > -
> > > >  }
> > > >
> > > >  void __init bootmem_init(void)
> > > > @@ -424,6 +419,11 @@ void __init bootmem_init(void)
> > > > min_low_pfn = min;
> > > >
> > > > arm64_numa_init();
> > > > +
> > > > +#ifdef CONFIG_ARM64_4K_PAGES
> > > > +   hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> > > > +#endif
> > >
> > > A comment here wouldn't hurt, as it does look a lot more natural next
> > > to dma_contiguous_reserve().
> >
> > Will add some comment here.
> >
> > >
> > > Will
> >
> > barry


RE: [PATCH v2] arm64: mm: reserve hugetlb CMA after numa_init

2020-06-17 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Will Deacon [mailto:w...@kernel.org]
> Sent: Wednesday, June 17, 2020 10:18 PM
> To: Song Bao Hua (Barry Song) 
> Cc: catalin.mari...@arm.com; nsaenzjulie...@suse.de;
> steve.cap...@arm.com; r...@linux.ibm.com; a...@linux-foundation.org;
> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org; Linuxarm
> ; Matthias Brugger ;
> Roman Gushchin 
> Subject: Re: [PATCH v2] arm64: mm: reserve hugetlb CMA after numa_init
> 
> On Wed, Jun 17, 2020 at 10:19:24AM +1200, Barry Song wrote:
> > hugetlb_cma_reserve() is called at the wrong place. numa_init has not been
> > done yet. so all reserved memory will be located at node0.
> >
> > Fixes: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic hugepages
> using cma")
> 
> Damn, wasn't CC'd on that :/
> 
> > Cc: Matthias Brugger 
> > Acked-by: Roman Gushchin 
> > Signed-off-by: Barry Song 
> > ---
> >  -v2: add Fixes tag according to Matthias Brugger's comment
> >
> >  arch/arm64/mm/init.c | 10 +-
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> > index e631e6425165..41914b483d54 100644
> > --- a/arch/arm64/mm/init.c
> > +++ b/arch/arm64/mm/init.c
> > @@ -404,11 +404,6 @@ void __init arm64_memblock_init(void)
> > high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
> >
> > dma_contiguous_reserve(arm64_dma32_phys_limit);
> > -
> > -#ifdef CONFIG_ARM64_4K_PAGES
> > -   hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> > -#endif
> 
> Why is this dependent on CONFIG_ARM64_4K_PAGES? We unconditionally
> select ARCH_HAS_GIGANTIC_PAGE so this seems unnecessary.

Roman, would you like to answer this question? Have you found any problem if 
system
doesn't set 4K_PAGES?

> 
> > -
> >  }
> >
> >  void __init bootmem_init(void)
> > @@ -424,6 +419,11 @@ void __init bootmem_init(void)
> > min_low_pfn = min;
> >
> > arm64_numa_init();
> > +
> > +#ifdef CONFIG_ARM64_4K_PAGES
> > +   hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> > +#endif
> 
> A comment here wouldn't hurt, as it does look a lot more natural next
> to dma_contiguous_reserve().

Will add some comment here.

> 
> Will

barry


RE: [PATCH 2/3] arm64: mm: reserve hugetlb CMA after numa_init

2020-06-07 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: Matthias Brugger [mailto:matthias@gmail.com]
> Sent: Monday, June 8, 2020 8:15 AM
> To: Roman Gushchin ; Song Bao Hua (Barry Song)
> 
> Cc: catalin.mari...@arm.com; John Garry ;
> linux-kernel@vger.kernel.org; Linuxarm ;
> io...@lists.linux-foundation.org; Zengtao (B) ;
> Jonathan Cameron ;
> robin.mur...@arm.com; h...@lst.de; linux-arm-ker...@lists.infradead.org;
> m.szyprow...@samsung.com
> Subject: Re: [PATCH 2/3] arm64: mm: reserve hugetlb CMA after numa_init
> 
> 
> 
> On 03/06/2020 05:22, Roman Gushchin wrote:
> > On Wed, Jun 03, 2020 at 02:42:30PM +1200, Barry Song wrote:
> >> hugetlb_cma_reserve() is called at the wrong place. numa_init has not been
> >> done yet. so all reserved memory will be located at node0.
> >>
> >> Cc: Roman Gushchin 
> >> Signed-off-by: Barry Song 
> >
> > Acked-by: Roman Gushchin 
> >
> 
> When did this break or was it broken since the beginning?
> In any case, could you provide a "Fixes" tag for it, so that it can easily be
> backported to older releases.

I guess it was broken at the first beginning.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cf11e85fc08cc

Fixes: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic hugepages using 
cma")

Would you think it is better for me to send v2 for this patch separately with 
this tag and take this out of my original patch set for per-numa CMA?
Please give your suggestion.

Best Regards
Barry

> 
> Regards,
> Matthias
> 
> > Thanks!
> >
> >> ---
> >>  arch/arm64/mm/init.c | 10 +-
> >>  1 file changed, 5 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> >> index e42727e3568e..8f0e70ebb49d 100644
> >> --- a/arch/arm64/mm/init.c
> >> +++ b/arch/arm64/mm/init.c
> >> @@ -458,11 +458,6 @@ void __init arm64_memblock_init(void)
> >>high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
> >>
> >>dma_contiguous_reserve(arm64_dma32_phys_limit);
> >> -
> >> -#ifdef CONFIG_ARM64_4K_PAGES
> >> -  hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> >> -#endif
> >> -
> >>  }
> >>
> >>  void __init bootmem_init(void)
> >> @@ -478,6 +473,11 @@ void __init bootmem_init(void)
> >>min_low_pfn = min;
> >>
> >>arm64_numa_init();
> >> +
> >> +#ifdef CONFIG_ARM64_4K_PAGES
> >> +  hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
> >> +#endif
> >> +
> >>/*
> >> * Sparsemem tries to allocate bootmem in memory_present(), so must
> be
> >> * done after the fixed reservations.
> >> --
> >> 2.23.0



RE: [kbuild-all] Re: [PATCH 1/3] dma-direct: provide the ability to reserve per-numa CMA

2020-06-06 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Philip Li [mailto:philip...@intel.com]
> Sent: Saturday, June 6, 2020 3:47 PM
> To: Dan Carpenter 
> Cc: Song Bao Hua (Barry Song) ;
> kbu...@lists.01.org; h...@lst.de; m.szyprow...@samsung.com;
> robin.mur...@arm.com; catalin.mari...@arm.com; l...@intel.com; Dan
> Carpenter ; kbuild-...@lists.01.org;
> io...@lists.linux-foundation.org; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; Linuxarm ; Jonathan
> Cameron ; John Garry
> 
> Subject: Re: [kbuild-all] Re: [PATCH 1/3] dma-direct: provide the ability to
> reserve per-numa CMA
> 
> On Fri, Jun 05, 2020 at 11:57:51AM +0300, Dan Carpenter wrote:
> > On Fri, Jun 05, 2020 at 06:04:31AM +, Song Bao Hua (Barry Song)
> wrote:
> > >
> > >
> > > > -Original Message-
> > > > From: Dan Carpenter [mailto:dan.carpen...@oracle.com]
> > > > Sent: Thursday, June 4, 2020 11:37 PM
> > > > To: kbu...@lists.01.org; Song Bao Hua (Barry Song)
> > > > ; h...@lst.de;
> > > > m.szyprow...@samsung.com; robin.mur...@arm.com;
> > > > catalin.mari...@arm.com
> > > > Cc: l...@intel.com; Dan Carpenter ;
> > > > kbuild-...@lists.01.org; io...@lists.linux-foundation.org;
> > > > linux-arm-ker...@lists.infradead.org;
> > > > linux-kernel@vger.kernel.org; Linuxarm ;
> > > > Jonathan Cameron ; John Garry
> > > > 
> > > > Subject: Re: [PATCH 1/3] dma-direct: provide the ability to
> > > > reserve per-numa CMA
> > > >
> > > > Hi Barry,
> > > >
> > > > url:
> > > > https://github.com/0day-ci/linux/commits/Barry-Song/support-per-nu
> > > > ma-CM
> > > > A-for-ARM-server/20200603-104821
> > > > base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git
> > > > for-next/core
> > > > config: x86_64-randconfig-m001-20200603 (attached as .config)
> > > > compiler: gcc-9 (Debian 9.3.0-13) 9.3.0
> > > >
> > > > If you fix the issue, kindly add following tag as appropriate
> > > > Reported-by: kernel test robot 
> > > > Reported-by: Dan Carpenter 
> > >
> > > Dan, thanks! Good catch!
> > > as this patch has not been in mainline yet, is it correct to add these
> "reported-by" in patch v2?
> Hi Barry, we provides the suggestion here, but you may decide to add or not as
> appropriate for your situation. For the patch still under development, it is 
> not
> that necessary to add i think.

Hi Philip, Dan,
Thanks for clarification.
> 
> >
> > These are just an automated email from the zero day bot.  I look over
> > the Smatch warnings and then forward them on.
> >
> > regards,
> > dan carpenter

Best regards
Barry



RE: [PATCH 1/3] dma-direct: provide the ability to reserve per-numa CMA

2020-06-05 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Dan Carpenter [mailto:dan.carpen...@oracle.com]
> Sent: Thursday, June 4, 2020 11:37 PM
> To: kbu...@lists.01.org; Song Bao Hua (Barry Song)
> ; h...@lst.de; m.szyprow...@samsung.com;
> robin.mur...@arm.com; catalin.mari...@arm.com
> Cc: l...@intel.com; Dan Carpenter ;
> kbuild-...@lists.01.org; io...@lists.linux-foundation.org;
> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org; Linuxarm
> ; Jonathan Cameron
> ; John Garry 
> Subject: Re: [PATCH 1/3] dma-direct: provide the ability to reserve per-numa
> CMA
> 
> Hi Barry,
> 
> url:
> https://github.com/0day-ci/linux/commits/Barry-Song/support-per-numa-CM
> A-for-ARM-server/20200603-104821
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git
> for-next/core
> config: x86_64-randconfig-m001-20200603 (attached as .config)
> compiler: gcc-9 (Debian 9.3.0-13) 9.3.0
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
> Reported-by: Dan Carpenter 

Dan, thanks! Good catch!
as this patch has not been in mainline yet, is it correct to add these 
"reported-by" in patch v2?

Barry

> 
> smatch warnings:
> kernel/dma/contiguous.c:274 dma_alloc_contiguous() warn: variable
> dereferenced before check 'dev' (see line 272)
> 
> #
> https://github.com/0day-ci/linux/commit/adb919e972c1cac3d8b11905d525
> 8d23d3aac6a4
> git remote add linux-review https://github.com/0day-ci/linux git remote
> update linux-review git checkout
> adb919e972c1cac3d8b11905d5258d23d3aac6a4
> vim +/dev +274 kernel/dma/contiguous.c
> 
> b1d2dc009dece4 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  267  struct page *dma_alloc_contiguous(struct device *dev,
> size_t size, gfp_t gfp)
> b1d2dc009dece4 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  268  {
> 90ae409f9eb3bc kernel/dma/contiguous.c   Christoph Hellwig
> 2019-08-20  269   size_t count = size >> PAGE_SHIFT;
> b1d2dc009dece4 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  270   struct page *page = NULL;
> bd2e75633c8012 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  271   struct cma *cma = NULL;
> adb919e972c1ca kernel/dma/contiguous.c   Barry Song
> 2020-06-03 @272   int nid = dev_to_node(dev);
> 
> ^^^ Dereferenced inside function.
> 
> bd2e75633c8012 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  273
> bd2e75633c8012 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23 @274   if (dev && dev->cma_area)
> 
> ^^^ Too late.
> 
> bd2e75633c8012 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  275   cma = dev->cma_area;
> adb919e972c1ca kernel/dma/contiguous.c   Barry Song
> 2020-06-03  276   else if ((nid != NUMA_NO_NODE) &&
> dma_contiguous_pernuma_area[nid]
> adb919e972c1ca kernel/dma/contiguous.c   Barry Song
> 2020-06-03  277   && !(gfp & (GFP_DMA | GFP_DMA32)))
> adb919e972c1ca kernel/dma/contiguous.c   Barry Song
> 2020-06-03  278   cma = dma_contiguous_pernuma_area[nid];
> bd2e75633c8012 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  279   else if (count > 1)
> bd2e75633c8012 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  280   cma = dma_contiguous_default_area;
> b1d2dc009dece4 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  281
> b1d2dc009dece4 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  282   /* CMA can be used only in the context which permits
> sleeping */
> b1d2dc009dece4 kernel/dma/contiguous.c   Nicolin Chen
> 2019-05-23  283   if (cma && gfpflags_allow_blocking(gfp)) {
> 
> ---
> 0-DAY CI Kernel Test Service, Intel Corporation
> https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


RE: [PATCH] driver core: platform: expose numa_node to users in sysfs

2020-06-02 Thread Song Bao Hua (Barry Song)
> >
> > On Tue, Jun 02, 2020 at 05:09:57AM +0000, Song Bao Hua (Barry Song)
> wrote:
> > > > >
> > > > > Platform devices are NUMA?  That's crazy, and feels like a total
> > > > > abuse of platform devices and drivers that really should belong
> > > > > on a
> > "real"
> > > > > bus.
> > > >
> > > > I am not sure if it is an abuse of platform device. But smmu is a
> > > > platform device, drivers/iommu/arm-smmu-v3.c is a platform driver.
> > > > In a typical ARM server, there are maybe multiple SMMU devices
> > > > which can support IO virtual address and page tables for other
> > > > devices on PCI-like busses.
> > > > Each different SMMU device might be close to different NUMA node.
> > > > There is really a hardware topology.
> > > >
> > > > If you have multiple CPU packages in a NUMA server, some platform
> > > > devices might Belong to CPU0, some other might belong to CPU1.
> > >
> > > Those devices are populated by acpi_iort for an ARM server:
> > >
> > > drivers/acpi/arm64/iort.c:
> > >
> > > static const struct iort_dev_config iort_arm_smmu_v3_cfg __initconst = {
> > > .name = "arm-smmu-v3",
> > > .dev_dma_configure = arm_smmu_v3_dma_configure,
> > > .dev_count_resources = arm_smmu_v3_count_resources,
> > > .dev_init_resources = arm_smmu_v3_init_resources,
> > > .dev_set_proximity = arm_smmu_v3_set_proximity, };
> > >
> > > void __init acpi_iort_init(void)
> > > {
> > > acpi_status status;
> > >
> > > status = acpi_get_table(ACPI_SIG_IORT, 0, _table);
> > > ...
> > > iort_check_id_count_workaround(iort_table);
> > > iort_init_platform_devices(); }
> > >
> > > static void __init iort_init_platform_devices(void) {
> > > ...
> > >
> > > for (i = 0; i < iort->node_count; i++) {
> > > if (iort_node >= iort_end) {
> > > pr_err("iort node pointer overflows, bad
> > table\n");
> > > return;
> > > }
> > >
> > > iort_enable_acs(iort_node);
> > >
> > > ops = iort_get_dev_cfg(iort_node);
> > > if (ops) {
> > > fwnode = acpi_alloc_fwnode_static();
> > > if (!fwnode)
> > > return;
> > >
> > > iort_set_fwnode(iort_node, fwnode);
> > >
> > > ret = iort_add_platform_device(iort_node,
> ops);
> > > if (ret) {
> > > iort_delete_fwnode(iort_node);
> > > acpi_free_fwnode_static(fwnode);
> > > return;
> > > }
> > > }
> > >
> > > ...
> > > }
> > > ...
> > > }
> > >
> > > NUMA node is got from ACPI:
> > >
> > > static int  __init arm_smmu_v3_set_proximity(struct device *dev,
> > >   struct
> acpi_iort_node
> > > *node) {
> > > struct acpi_iort_smmu_v3 *smmu;
> > >
> > > smmu = (struct acpi_iort_smmu_v3 *)node->node_data;
> > > if (smmu->flags & ACPI_IORT_SMMU_V3_PXM_VALID) {
> > > int dev_node = acpi_map_pxm_to_node(smmu->pxm);
> > >
> > > ...
> > >
> > > set_dev_node(dev, dev_node);
> > > ...
> > > }
> > > return 0;
> > > }
> > >
> > > Barry
> >
> > That's fine, but those are "real" devices, not platform devices, right?
> >
> 
> Most platform devices are "real" memory-mapped hardware devices. For an
> embedded system, almost all "simple-bus"
> devices are populated from device trees as platform devices. Only a part of
> platform devices are not "real" hardware.
> 
> Smmu is a memory-mapped device. It is totally like most other platform
> devices populated in a memory space mapped in cpu's local space. It uses
> ioremap to map registers

RE: [PATCH] driver core: platform: expose numa_node to users in sysfs

2020-06-02 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Greg KH [mailto:gre...@linuxfoundation.org]
> Sent: Tuesday, June 2, 2020 6:11 PM
> To: Song Bao Hua (Barry Song) 
> Cc: raf...@kernel.org; io...@lists.linux-foundation.org;
> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org; Linuxarm
> ; Zengtao (B) ; Robin
> Murphy 
> Subject: Re: [PATCH] driver core: platform: expose numa_node to users in sysfs
> 
> On Tue, Jun 02, 2020 at 05:09:57AM +, Song Bao Hua (Barry Song) wrote:
> > > >
> > > > Platform devices are NUMA?  That's crazy, and feels like a total
> > > > abuse of platform devices and drivers that really should belong on a
> "real"
> > > > bus.
> > >
> > > I am not sure if it is an abuse of platform device. But smmu is a
> > > platform device, drivers/iommu/arm-smmu-v3.c is a platform driver.
> > > In a typical ARM server, there are maybe multiple SMMU devices which
> > > can support IO virtual address and page tables for other devices on
> > > PCI-like busses.
> > > Each different SMMU device might be close to different NUMA node.
> > > There is really a hardware topology.
> > >
> > > If you have multiple CPU packages in a NUMA server, some platform
> > > devices might Belong to CPU0, some other might belong to CPU1.
> >
> > Those devices are populated by acpi_iort for an ARM server:
> >
> > drivers/acpi/arm64/iort.c:
> >
> > static const struct iort_dev_config iort_arm_smmu_v3_cfg __initconst = {
> > .name = "arm-smmu-v3",
> > .dev_dma_configure = arm_smmu_v3_dma_configure,
> > .dev_count_resources = arm_smmu_v3_count_resources,
> > .dev_init_resources = arm_smmu_v3_init_resources,
> > .dev_set_proximity = arm_smmu_v3_set_proximity, };
> >
> > void __init acpi_iort_init(void)
> > {
> > acpi_status status;
> >
> > status = acpi_get_table(ACPI_SIG_IORT, 0, _table);
> > ...
> > iort_check_id_count_workaround(iort_table);
> > iort_init_platform_devices();
> > }
> >
> > static void __init iort_init_platform_devices(void) {
> > ...
> >
> > for (i = 0; i < iort->node_count; i++) {
> > if (iort_node >= iort_end) {
> > pr_err("iort node pointer overflows, bad
> table\n");
> > return;
> > }
> >
> > iort_enable_acs(iort_node);
> >
> > ops = iort_get_dev_cfg(iort_node);
> > if (ops) {
> > fwnode = acpi_alloc_fwnode_static();
> > if (!fwnode)
> > return;
> >
> > iort_set_fwnode(iort_node, fwnode);
> >
> > ret = iort_add_platform_device(iort_node, ops);
> > if (ret) {
> > iort_delete_fwnode(iort_node);
> > acpi_free_fwnode_static(fwnode);
> > return;
> > }
> > }
> >
> > ...
> > }
> > ...
> > }
> >
> > NUMA node is got from ACPI:
> >
> > static int  __init arm_smmu_v3_set_proximity(struct device *dev,
> >   struct acpi_iort_node
> > *node) {
> > struct acpi_iort_smmu_v3 *smmu;
> >
> > smmu = (struct acpi_iort_smmu_v3 *)node->node_data;
> > if (smmu->flags & ACPI_IORT_SMMU_V3_PXM_VALID) {
> > int dev_node = acpi_map_pxm_to_node(smmu->pxm);
> >
> > ...
> >
> > set_dev_node(dev, dev_node);
> > ...
> > }
> > return 0;
> > }
> >
> > Barry
> 
> That's fine, but those are "real" devices, not platform devices, right?
> 

Most platform devices are "real" memory-mapped hardware devices. For an 
embedded system, almost all "simple-bus"
devices are populated from device trees as platform devices. Only a part of 
platform devices are not "real" hardware.

Smmu is a memory-mapped device. It is totally like most other platform devices 
populated in a 
memory space mapped in cpu's local space. It uses ioremap to map registers, use 
readl/writel to read/write its
space.

> What platform device has this issue?  What one will show up this way with
> the new patch?

if platform device shouldn't be a real hardware, there is no platform device 
with a hardware topology.
But platform devices are "real" hardware at most time. Smmu is a "real" device, 
but it is a platform device in Linux.

> 
> thanks,
> 
> greg k-h

-barry



RE: [PATCH] driver core: platform: expose numa_node to users in sysfs

2020-06-01 Thread Song Bao Hua (Barry Song)
> >
> > Platform devices are NUMA?  That's crazy, and feels like a total abuse
> > of platform devices and drivers that really should belong on a "real"
> > bus.
> 
> I am not sure if it is an abuse of platform device. But smmu is a platform
> device,
> drivers/iommu/arm-smmu-v3.c is a platform driver.
> In a typical ARM server, there are maybe multiple SMMU devices which can
> support
> IO virtual address and page tables for other devices on PCI-like busses.
> Each different SMMU device might be close to different NUMA node. There is
> really a hardware topology.
> 
> If you have multiple CPU packages in a NUMA server, some platform devices
> might
> Belong to CPU0, some other might belong to CPU1.

Those devices are populated by acpi_iort for an ARM server:

drivers/acpi/arm64/iort.c:

static const struct iort_dev_config iort_arm_smmu_v3_cfg __initconst = {
.name = "arm-smmu-v3",
.dev_dma_configure = arm_smmu_v3_dma_configure,
.dev_count_resources = arm_smmu_v3_count_resources,
.dev_init_resources = arm_smmu_v3_init_resources,
.dev_set_proximity = arm_smmu_v3_set_proximity,
};

void __init acpi_iort_init(void)
{
acpi_status status;

status = acpi_get_table(ACPI_SIG_IORT, 0, _table);
...
iort_check_id_count_workaround(iort_table);
iort_init_platform_devices();
}

static void __init iort_init_platform_devices(void)
{
...

for (i = 0; i < iort->node_count; i++) {
if (iort_node >= iort_end) {
pr_err("iort node pointer overflows, bad table\n");
return;
}

iort_enable_acs(iort_node);

ops = iort_get_dev_cfg(iort_node);
if (ops) {
fwnode = acpi_alloc_fwnode_static();
if (!fwnode)
return;

iort_set_fwnode(iort_node, fwnode);

ret = iort_add_platform_device(iort_node, ops);
if (ret) {
iort_delete_fwnode(iort_node);
acpi_free_fwnode_static(fwnode);
return;
}
}

...
}
...
}

NUMA node is got from ACPI:

static int  __init arm_smmu_v3_set_proximity(struct device *dev,
  struct acpi_iort_node *node)
{
struct acpi_iort_smmu_v3 *smmu;

smmu = (struct acpi_iort_smmu_v3 *)node->node_data;
if (smmu->flags & ACPI_IORT_SMMU_V3_PXM_VALID) {
int dev_node = acpi_map_pxm_to_node(smmu->pxm);

...

set_dev_node(dev, dev_node);
...
}
return 0;
}

Barry




RE: [PATCH] driver core: platform: expose numa_node to users in sysfs

2020-06-01 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: Greg KH [mailto:gre...@linuxfoundation.org]
> Sent: Tuesday, June 2, 2020 4:24 PM
> To: Song Bao Hua (Barry Song) 
> Cc: raf...@kernel.org; io...@lists.linux-foundation.org;
> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org; Linuxarm
> ; Zengtao (B) ; Robin
> Murphy 
> Subject: Re: [PATCH] driver core: platform: expose numa_node to users in sysfs
> 
> On Tue, Jun 02, 2020 at 03:01:39PM +1200, Barry Song wrote:
> > For some platform devices like iommu, particually ARM smmu, users may
> > care about the numa locality. for example, if threads and drivers run
> > near iommu, they may get much better performance on dma_unmap_sg.
> > For other platform devices, users may still want to know the hardware
> > topology.
> >
> > Cc: Prime Zeng 
> > Cc: Robin Murphy 
> > Signed-off-by: Barry Song 
> > ---
> >  drivers/base/platform.c | 26 +-
> >  1 file changed, 25 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/base/platform.c b/drivers/base/platform.c
> > index b27d0f6c18c9..7794b9a38d82 100644
> > --- a/drivers/base/platform.c
> > +++ b/drivers/base/platform.c
> > @@ -1062,13 +1062,37 @@ static ssize_t driver_override_show(struct
> device *dev,
> >  }
> >  static DEVICE_ATTR_RW(driver_override);
> >
> > +static ssize_t numa_node_show(struct device *dev,
> > +   struct device_attribute *attr, char *buf)
> > +{
> > +   return sprintf(buf, "%d\n", dev_to_node(dev));
> > +}
> > +static DEVICE_ATTR_RO(numa_node);
> > +
> > +static umode_t platform_dev_attrs_visible(struct kobject *kobj, struct
> attribute *a,
> > +   int n)
> > +{
> > +   struct device *dev = container_of(kobj, typeof(*dev), kobj);
> > +
> > +   if (a == _attr_numa_node.attr &&
> > +   dev_to_node(dev) == NUMA_NO_NODE)
> > +   return 0;
> > +
> > +   return a->mode;
> > +}
> >
> >  static struct attribute *platform_dev_attrs[] = {
> > _attr_modalias.attr,
> > +   _attr_numa_node.attr,
> > _attr_driver_override.attr,
> > NULL,
> >  };
> > -ATTRIBUTE_GROUPS(platform_dev);
> > +
> > +static struct attribute_group platform_dev_group = {
> > +   .attrs = platform_dev_attrs,
> > +   .is_visible = platform_dev_attrs_visible,
> > +};
> > +__ATTRIBUTE_GROUPS(platform_dev);
> >
> >  static int platform_uevent(struct device *dev, struct kobj_uevent_env *env)
> >  {
> 
> Platform devices are NUMA?  That's crazy, and feels like a total abuse
> of platform devices and drivers that really should belong on a "real"
> bus.

I am not sure if it is an abuse of platform device. But smmu is a platform 
device,
drivers/iommu/arm-smmu-v3.c is a platform driver.
In a typical ARM server, there are maybe multiple SMMU devices which can support
IO virtual address and page tables for other devices on PCI-like busses.
Each different SMMU device might be close to different NUMA node. There is
really a hardware topology.

If you have multiple CPU packages in a NUMA server, some platform devices might
Belong to CPU0, some other might belong to CPU1.

-barry

> 
> What drivers in the kernel today have this issue?
> 
> thanks,
> 
> greg k-h




RE: [PATCH 0/3] arm64: perf: Add support for Perf NMI interrupts

2020-05-20 Thread Song Bao Hua (Barry Song)


> -Original Message-
> From: linux-arm-kernel [mailto:linux-arm-kernel-boun...@lists.infradead.org]
> On Behalf Of Alexandru Elisei
> Sent: Wednesday, May 20, 2020 10:31 PM> 
> Hi,
> 
> On 5/18/20 12:17 PM, Alexandru Elisei wrote:
> > Hi,
> >
> > On 5/18/20 11:45 AM, Mark Rutland wrote:
> >> Hi all,
> >>
> >> On Mon, May 18, 2020 at 02:26:00PM +0800, Lecopzer Chen wrote:
> >>> HI Sumit,
> >>>
> >>> Thanks for your information.
> >>>
> >>> I've already implemented IPI (same as you did [1], little difference
> >>> in detail), hardlockup detector and perf in last year(2019) for
> >>> debuggability.
> >>> And now we tend to upstream to reduce kernel maintaining effort.
> >>> I'm glad if someone in ARM can do this work :)
> >>>
> >>> Hi Julien,
> >>>
> >>> Does any Arm maintainers can proceed this action?
> >> Alexandru (Cc'd) has been rebasing and reworking Julien's patches,
> >> which is my preferred approach.
> >>
> >> I understand that's not quite ready for posting since he's
> >> investigating some of the nastier subtleties (e.g. mutual exclusion
> >> with the NMI), but maybe we can put the work-in-progress patches
> >> somewhere in the mean time.
> >>
> >> Alexandru, do you have an idea of what needs to be done, and/or when
> >> you expect you could post that?
> > I'm currently working on rebasing the patches on top of 5.7-rc5, when
> > I have something usable I'll post a link (should be a couple of days).
> > After that I will address the review comments, and I plan to do a
> > thorough testing because I'm not 100% confident that some of the
> > assumptions around the locks that were removed are correct. My guess is
> this will take a few weeks.
> 
> Pushed a WIP branch on linux-arm.org [1]:
> 
> git clone -b WIP-pmu-nmi git://linux-arm.org/linux-ae
> 
> Practically untested, I only did perf record on a defconfig kernel running on 
> the
> model.
> 
> [1]
> http://www.linux-arm.org/git?p=linux-ae.git;a=shortlog;h=refs/heads/WIP-pm
> u-nmi

Fortunately, it does work. I used this tree to perf annotate 
arm_smmu_cmdq_issue_cmdlist() which
is completely disabling IRQ. Luckily, it reports correct data. Before that, it 
reported all time was spent by
the code which enabled IRQ .


Barry

> 
> Thanks,
> Alex
> >
> > Thanks,
> > Alex
> >> Thanks,
> >> Mark.
> >>
> >>> This is really useful in debugging.
> >>> Thank you!!
> >>>
> >>>
> >>>
> >>> [1] https://lkml.org/lkml/2020/4/24/328
> >>>
> >>>
> >>> Lecopzer
> >>>
> >>> Sumit Garg  於 2020年5月18日 週一 下午
> 1:46寫道:
>  + Julien
> 
>  Hi Lecopzer,
> 
>  On Sat, 16 May 2020 at 18:20, Lecopzer Chen 
> wrote:
> > These series implement Perf NMI funxtionality and depends on
> > Pseudo NMI [1] which has been upstreamed.
> >
> > In arm64 with GICv3, Pseudo NMI was implemented for NMI-like
> interruts.
> > That can be extended to Perf NMI which is the prerequisite for
> > hard-lockup detector which had already a standard interface inside
> Linux.
> >
> > Thus the first step we need to implement perf NMI interface and
> > make sure it works fine.
> >
>  This is something that is already implemented via Julien's
>  patch-set [1]. Its v4 has been floating since July, 2019 and I
>  couldn't find any major blocking comments but not sure why things
>  haven't progressed further.
> 
>  Maybe Julien or Arm maintainers can provide updates on existing
>  patch-set [1] and how we should proceed further with this
>  interesting feature.
> 
>  And regarding hard-lockup detection, I have been able to enable it
>  based on perf NMI events using Julien's perf patch-set [1]. Have a
>  look at the patch here [2].
> 
>  [1] https://patchwork.kernel.org/cover/11047407/
>  [2]
>  http://lists.infradead.org/pipermail/linux-arm-kernel/2020-May/7322
>  27.html
> 
>  -Sumit
> 
> > Perf NMI has been test by dd if=/dev/urandom of=/dev/null like the
> > link [2] did.
> >
> > [1] https://lkml.org/lkml/2019/1/31/535
> > [2] https://www.linaro.org/blog/debugging-arm-kernels-using-nmifiq
> >
> >
> > Lecopzer Chen (3):
> >   arm_pmu: Add support for perf NMI interrupts registration
> >   arm64: perf: Support NMI context for perf event ISR
> >   arm64: Kconfig: Add support for the Perf NMI
> >
> >  arch/arm64/Kconfig | 10 +++
> >  arch/arm64/kernel/perf_event.c | 36 ++--
> >  drivers/perf/arm_pmu.c | 51
> ++
> >  include/linux/perf/arm_pmu.h   |  6 
> >  4 files changed, 88 insertions(+), 15 deletions(-)
> >
> > --
> > 2.25.1



<    1   2   3