from:"John"

Hi Geert,

On Wed, 2024-05-01 at 11:12 +0200, John Paul Adrian Glaubitz wrote:
> Hi Geert,
> 
> On Fri, 2024-03-01 at 22:02 +0100, Geert Uytterhoeven wrote:
> > dma_extend(), get_dma_info_by_name(), register_chan_caps(), and
> > request_dma_bycap() are unused.  Remove them, and all related code.
> > 
> > Signed-off-by: Geert Uytterhoeven 
> > ---
> >  arch/sh/drivers/dma/dma-api.c | 116 --
> >  arch/sh/include/asm/dma.h |   7 --
> >  2 files changed, 123 deletions(-)
> > 
> > diff --git a/arch/sh/drivers/dma/dma-api.c b/arch/sh/drivers/dma/dma-api.c
> > index f49097fa634c36d4..87e5a892887360f5 100644
> > --- a/arch/sh/drivers/dma/dma-api.c
> > +++ b/arch/sh/drivers/dma/dma-api.c
> > @@ -41,21 +41,6 @@ struct dma_info *get_dma_info(unsigned int chan)
> >  }
> >  EXPORT_SYMBOL(get_dma_info);
> >  
> > -struct dma_info *get_dma_info_by_name(const char *dmac_name)
> > -{
> > -   struct dma_info *info;
> > -
> > -   list_for_each_entry(info, _dmac_list, list) {
> > -   if (dmac_name && (strcmp(dmac_name, info->name) != 0))
> > -   continue;
> > -   else
> > -   return info;
> > -   }
> > -
> > -   return NULL;
> > -}
> > -EXPORT_SYMBOL(get_dma_info_by_name);
> > -
> >  static unsigned int get_nr_channels(void)
> >  {
> > struct dma_info *info;
> > @@ -101,66 +86,6 @@ int get_dma_residue(unsigned int chan)
> >  }
> >  EXPORT_SYMBOL(get_dma_residue);
> >  
> > -static int search_cap(const char **haystack, const char *needle)
> > -{
> > -   const char **p;
> > -
> > -   for (p = haystack; *p; p++)
> > -   if (strcmp(*p, needle) == 0)
> > -   return 1;
> > -
> > -   return 0;
> > -}
> > -
> > -/**
> > - * request_dma_bycap - Allocate a DMA channel based on its capabilities
> > - * @dmac: List of DMA controllers to search
> > - * @caps: List of capabilities
> > - *
> > - * Search all channels of all DMA controllers to find a channel which
> > - * matches the requested capabilities. The result is the channel
> > - * number if a match is found, or %-ENODEV if no match is found.
> > - *
> > - * Note that not all DMA controllers export capabilities, in which
> > - * case they can never be allocated using this API, and so
> > - * request_dma() must be used specifying the channel number.
> > - */
> > -int request_dma_bycap(const char **dmac, const char **caps, const char 
> > *dev_id)
> > -{
> > -   unsigned int found = 0;
> > -   struct dma_info *info;
> > -   const char **p;
> > -   int i;
> > -
> > -   BUG_ON(!dmac || !caps);
> > -
> > -   list_for_each_entry(info, _dmac_list, list)
> > -   if (strcmp(*dmac, info->name) == 0) {
> > -   found = 1;
> > -   break;
> > -   }
> > -
> > -   if (!found)
> > -   return -ENODEV;
> > -
> > -   for (i = 0; i < info->nr_channels; i++) {
> > -   struct dma_channel *channel = >channels[i];
> > -
> > -   if (unlikely(!channel->caps))
> > -   continue;
> > -
> > -   for (p = caps; *p; p++) {
> > -   if (!search_cap(channel->caps, *p))
> > -   break;
> > -   if (request_dma(channel->chan, dev_id) == 0)
> > -   return channel->chan;
> > -   }
> > -   }
> > -
> > -   return -EINVAL;
> > -}
> > -EXPORT_SYMBOL(request_dma_bycap);
> > -
> >  int request_dma(unsigned int chan, const char *dev_id)
> >  {
> > struct dma_channel *channel = { 0 };
> > @@ -213,35 +138,6 @@ void dma_wait_for_completion(unsigned int chan)
> >  }
> >  EXPORT_SYMBOL(dma_wait_for_completion);
> >  
> > -int register_chan_caps(const char *dmac, struct dma_chan_caps *caps)
> > -{
> > -   struct dma_info *info;
> > -   unsigned int found = 0;
> > -   int i;
> > -
> > -   list_for_each_entry(info, _dmac_list, list)
> > -   if (strcmp(dmac, info->name) == 0) {
> > -   found = 1;
> > -   break;
> > -   }
> > -
> > -   if (unlikely(!found))
> > -   return -ENODEV;
> > -
> > -   for (i = 0; i < info->nr_channels; i++, caps++) {
> > -   struct dma_channel *channel;
> > -
> >

Re: [PATCH 00/20] sh: Fix missing prototypes

Hi Geert,

On Fri, 2024-03-01 at 22:02 +0100, Geert Uytterhoeven wrote:
>   Hi all,
> 
> This patch series fixes several "no previous prototype for "
> warnings when building a kernel for SuperH.
> 
> Known issues:
>   - The various warnings about cache functions are not yet fixed, but
> I didn't want to hold off the rest of this series,
>   - sdk7786_defconfig needs "[PATCH/RFC] locking/spinlocks: Make __raw_*
> lock ops static" [1],
>   - Probably there are more warnings to fix, I didn't build all
> defconfigs.
> 
> This has been boot-tested on landisk and on qemu/rts7751r2d.
> 
> Thanks for your comments!
> 
> [1] 
> https://lore.kernel.org/linux-sh/c395b02613572131568bc1fd1bc456d20d1a5426.1709325647.git.geert+rene...@glider.be
> 
> Geert Uytterhoeven (20):
>   sh: pgtable: Fix missing prototypes
>   sh: fpu: Add missing forward declarations
>   sh: syscall: Add missing forward declaration for sys_cacheflush()
>   sh: tlb: Add missing forward declaration for handle_tlbmiss()
>   sh: return_address: Add missing #include 
>   sh: traps: Add missing #include 
>   sh: hw_breakpoint: Add missing forward declaration for
> arch_bp_generic_fields()
>   sh: boot: Add proper forward declarations
>   sh: ftrace: Fix missing prototypes
>   sh: nommu: Add missing #include 
>   sh: math-emu: Add missing #include 
>   sh: dma: Remove unused dmac_search_free_channel()
>   sh: sh2a: Add missing #include 
>   sh: sh7786: Remove unused sh7786_usb_use_exclock()
>   sh: smp: Fix missing prototypes
>   sh: kprobes: Merge arch_copy_kprobe() into arch_prepare_kprobe()
>   sh: kprobes: Make trampoline_probe_handler() static
>   sh: kprobes: Remove unneeded kprobe_opcode_t casts
>   sh: dwarf: Make dwarf_lookup_fde() static
>   [RFC] sh: dma: Remove unused functionality
> 
>  arch/sh/boot/compressed/cache.c |   3 +
>  arch/sh/boot/compressed/cache.h |  10 ++
>  arch/sh/boot/compressed/misc.c  |   8 +-
>  arch/sh/boot/compressed/misc.h  |   9 ++
>  arch/sh/drivers/dma/dma-api.c   | 143 
>  arch/sh/include/asm/dma.h   |   7 --
>  arch/sh/include/asm/fpu.h   |   3 +
>  arch/sh/include/asm/ftrace.h|  10 ++
>  arch/sh/include/asm/hw_breakpoint.h |   2 +
>  arch/sh/include/asm/syscalls.h  |   1 +
>  arch/sh/include/asm/tlb.h   |   4 +
>  arch/sh/kernel/cpu/sh2a/opcode_helper.c |   2 +
>  arch/sh/kernel/cpu/sh4a/setup-sh7786.c  |  14 ---
>  arch/sh/kernel/dwarf.c  |   2 +-
>  arch/sh/kernel/kprobes.c|  13 +--
>  arch/sh/kernel/return_address.c |   2 +
>  arch/sh/kernel/smp.c|   4 +-
>  arch/sh/kernel/traps.c  |  10 +-
>  arch/sh/kernel/traps_32.c   |   1 +
>  arch/sh/math-emu/math.c |   2 +
>  arch/sh/mm/nommu.c  |   2 +
>  arch/sh/mm/pgtable.c|   4 +-
>  arch/sh/mm/tlbex_32.c   |   1 +
>  23 files changed, 68 insertions(+), 189 deletions(-)
>  create mode 100644 arch/sh/boot/compressed/cache.h
>  create mode 100644 arch/sh/boot/compressed/misc.h
> 

For the whole series:

Reviewed-by: John Paul Adrian Glaubitz 

I would still like to get feedback from Yoshinori on patch #20 though, i.e.

"sh: dma: Remove unused functionality".

On the other hand, we could just merge this series and re-add the functions
later if we decide they're still needed.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [PATCH 20/20] [RFC] sh: dma: Remove unused functionality

han, unsigned long op, void *param)
> -{
> - struct dma_info *info = get_dma_info(chan);
> - struct dma_channel *channel = get_dma_channel(chan);
> -
> - if (info->ops->extend)
> - return info->ops->extend(channel, op, param);
> -
> - return -ENOSYS;
> -}
> -EXPORT_SYMBOL(dma_extend);
> -
>  static int dma_proc_show(struct seq_file *m, void *v)
>  {
>   struct dma_info *info = v;
> diff --git a/arch/sh/include/asm/dma.h b/arch/sh/include/asm/dma.h
> index c8bee3f985a29393..6b6d409956d17f09 100644
> --- a/arch/sh/include/asm/dma.h
> +++ b/arch/sh/include/asm/dma.h
> @@ -56,7 +56,6 @@ struct dma_ops {
>   int (*get_residue)(struct dma_channel *chan);
>   int (*xfer)(struct dma_channel *chan);
>   int (*configure)(struct dma_channel *chan, unsigned long flags);
> - int (*extend)(struct dma_channel *chan, unsigned long op, void *param);
>  };
>  
>  struct dma_channel {
> @@ -118,8 +117,6 @@ extern int dma_xfer(unsigned int chan, unsigned long from,
>  #define dma_read_page(chan, from, to)\
>   dma_read(chan, from, to, PAGE_SIZE)
>  
> -extern int request_dma_bycap(const char **dmac, const char **caps,
> -  const char *dev_id);
>  extern int get_dma_residue(unsigned int chan);
>  extern struct dma_info *get_dma_info(unsigned int chan);
>  extern struct dma_channel *get_dma_channel(unsigned int chan);
> @@ -128,10 +125,6 @@ extern void dma_configure_channel(unsigned int chan, 
> unsigned long flags);
>  
>  extern int register_dmac(struct dma_info *info);
>  extern void unregister_dmac(struct dma_info *info);
> -extern struct dma_info *get_dma_info_by_name(const char *dmac_name);
> -
> -extern int dma_extend(unsigned int chan, unsigned long op, void *param);
> -extern int register_chan_caps(const char *dmac, struct dma_chan_caps 
> *capslist);
>  
>  /* arch/sh/drivers/dma/dma-sysfs.c */
>  extern int dma_create_sysfs_files(struct dma_channel *, struct dma_info *);

I assume we could re-add these again in case we need them, but it would be good
if Yoshinori could comment on whether we should keep these functions or not.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [PATCH 12/20] sh: dma: Remove unused dmac_search_free_channel()

Hi Geert,

On Fri, 2024-03-01 at 22:02 +0100, Geert Uytterhoeven wrote:
> arch/sh/drivers/dma/dma-api.c:164:5: warning: no previous prototype for 
> 'dmac_search_free_channel' [-Wmissing-prototypes]
> 
> dmac_search_free_channel() never had a user in upstream, remove it.
> 
> Signed-off-by: Geert Uytterhoeven 
> ---
> dma_extend(), get_dma_info_by_name(), register_chan_caps(), and
> request_dma_bycap() are also unused, but don't trigger warnings
> ---

I assume the other functions didn't trigger a warning because their symbols
were exported. Correct me if I'm wrong.

Adrian

>  arch/sh/drivers/dma/dma-api.c | 27 ---
>  1 file changed, 27 deletions(-)
> 
> diff --git a/arch/sh/drivers/dma/dma-api.c b/arch/sh/drivers/dma/dma-api.c
> index 89cd4a3b4ccafbe2..f49097fa634c36d4 100644
> --- a/arch/sh/drivers/dma/dma-api.c
> +++ b/arch/sh/drivers/dma/dma-api.c
> @@ -161,33 +161,6 @@ int request_dma_bycap(const char **dmac, const char 
> **caps, const char *dev_id)
>  }
>  EXPORT_SYMBOL(request_dma_bycap);
>  
> -int dmac_search_free_channel(const char *dev_id)
> -{
> - struct dma_channel *channel = { 0 };
> - struct dma_info *info = get_dma_info(0);
> - int i;
> -
> - for (i = 0; i < info->nr_channels; i++) {
> - channel = >channels[i];
> - if (unlikely(!channel))
> - return -ENODEV;
> -
> - if (atomic_read(>busy) == 0)
> - break;
> - }
> -
> - if (info->ops->request) {
> - int result = info->ops->request(channel);
> - if (result)
> - return result;
> -
> - atomic_set(>busy, 1);
> - return channel->chan;
> - }
> -
> - return -ENOSYS;
> -}
> -
>  int request_dma(unsigned int chan, const char *dev_id)
>  {
>   struct dma_channel *channel = { 0 };

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

RE: [PATCH bpf-next] rethook: Remove warning messages printed for finding return address of a frame.

2024-04-03 Thread John Fastabend

Kui-Feng Lee wrote:
> rethook_find_ret_addr() prints a warning message and returns 0 when the
> target task is running and not the "current" task to prevent returning an
> incorrect return address. However, this check is incomplete as the target
> task can still transition to the running state when finding the return
> address, although it is safe with RCU.
> 
> The issue we encounter is that the kernel frequently prints warning
> messages when BPF profiling programs call to bpf_get_task_stack() on
> running tasks.
> 
> The callers should be aware and willing to take the risk of receiving an
> incorrect return address from a task that is currently running other than
> the "current" one. A warning is not needed here as the callers are intent
> on it.
> 
> Signed-off-by: Kui-Feng Lee 
> ---
>  kernel/trace/rethook.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> index fa03094e9e69..4297a132a7ae 100644
> --- a/kernel/trace/rethook.c
> +++ b/kernel/trace/rethook.c
> @@ -248,7 +248,7 @@ unsigned long rethook_find_ret_addr(struct task_struct 
> *tsk, unsigned long frame
>   if (WARN_ON_ONCE(!cur))
>   return 0;
>  
> - if (WARN_ON_ONCE(tsk != current && task_is_running(tsk)))
> +     if (tsk != current && task_is_running(tsk))
>   return 0;
>  
>   do {
> -- 
> 2.34.1
> 
> 

Acked-by: John Fastabend

Re: [PATCH net-next v5] virtio_net: Support RX hash XDP hint

2024-02-26 Thread John Fastabend

(for the same reason an H/W nic should prevent XDP from
> > > > > modifying its own buffer descriptor).
> > > >
> > > > Thank you for highlighting this concern. The header layout differs
> > > > slightly between small and mergeable mode. Taking 'mergeable mode' as
> > > > an example, after calling xdp_prepare_buff the layout of xdp_buff
> > > > would be as depicted in the diagram below,
> > > >
> > > >   buf
> > > >|
> > > >v
> > > > +--+--+-+
> > > > | xdp headroom | virtio header| packet  |
> > > > | (256 bytes)  | (20 bytes)   | content |
> > > > +--+--+-+
> > > > ^ ^
> > > > | |
> > > >  data_hard_startdata
> > > >   data_meta
> > > >
> > > > If 'bpf_xdp_adjust_meta' repositions the 'data_meta' pointer a little
> > > > towards 'data_hard_start', it would point to the inline header, thus
> > > > potentially allowing the XDP program to access the inline header.

Fairly late to the thread sorry. Given above layout does it make sense to
just delay extraction to the kfunc as suggested above? Sure the XDP program
could smash the entry in virtio header, but this is already the case for
anything else there. A program writing over the virtio header is likely
buggy anyways. Worse that might happen is bad rss values and mappings?

I like seeing more use cases for the hints though.

Thanks!
John

[PATCH] tracing: Use init_utsname()->release

2024-02-22 Thread John Garry

Instead of using UTS_RELEASE, use init_utsname()->release, which means that
we don't need to rebuild the code just for the git head commit changing.

Signed-off-by: John Garry 
---
I originally sent an RFC using new string uts_release, but that
string is not needed as we can use init_utsname()->release instead.

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8198bfc54b58..1416bf72f3f4 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -13,7 +13,7 @@
  *  Copyright (C) 2004 Nadia Yvette Chambers
  */
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -4368,7 +4368,7 @@ print_trace_header(struct seq_file *m, struct 
trace_iterator *iter)
get_total_entries(buf, , );
 
seq_printf(m, "# %s latency trace v1.1.5 on %s\n",
-  name, UTS_RELEASE);
+  name, init_utsname()->release);
seq_puts(m, "# ---"
 "-\n");
seq_printf(m, "# latency: %lu us, #%lu/%lu, CPU#%d |"
-- 
2.31.1

Re: [PATCH RFC 0/4] Introduce uts_release

2024-02-21 Thread John Garry


On 08/02/2024 10:08, John Garry wrote:

On 05/02/2024 23:10, Masahiro Yamada wrote:

I think what you can contribute are:

   - Explore the UTS_RELEASE users, and check if you can get rid of it.

Unfortunately I expect resistance for this. I also expect places like FW
loader it is necessary. And when this is used in sysfs, people will say
that it is part of the ABI now.

How about I send the patch to update to use init_uts_ns and mention also
that it would be better to not use at all, if possible? I can cc you.


OK.


As I mentioned in the previous reply, the replacement is safe
for builtin code.

When you touch modular code, please pay a little more care,
because UTS_RELEASE and init_utsname()->release
may differ when CONFIG_MODVERSIONS=y.



Are you saying that we may have a different release version kernel and 
module built with CONFIG_MODVERSIONS=y, and the module was using 
UTS_RELEASE for something? That something may be like setting some info 
in a sysfs file, like in this example:


static ssize_t target_core_item_version_show(struct config_item *item,
     char *page)
{
 return sprintf(page, "Target Engine Core ConfigFS Infrastructure %s"
     " on %s/%s on "UTS_RELEASE"\n", TARGET_CORE_VERSION,
     utsname()->sysname, utsname()->machine);
}

And the intention is to use the module codebase release version and not 
the kernel codebase release version. Hence utsname() is used for 
.sysname and .machine, but not .release .


Hi Masahiro,

Can you comment on whether I am right about CONFIG_MODVERSIONS, above?

Thanks,
John

Re: [PATCH RFC 0/4] Introduce uts_release

2024-02-08 Thread John Garry


On 05/02/2024 23:10, Masahiro Yamada wrote:

I think what you can contribute are:

   - Explore the UTS_RELEASE users, and check if you can get rid of it.

Unfortunately I expect resistance for this. I also expect places like FW
loader it is necessary. And when this is used in sysfs, people will say
that it is part of the ABI now.

How about I send the patch to update to use init_uts_ns and mention also
that it would be better to not use at all, if possible? I can cc you.


OK.


As I mentioned in the previous reply, the replacement is safe
for builtin code.

When you touch modular code, please pay a little more care,
because UTS_RELEASE and init_utsname()->release
may differ when CONFIG_MODVERSIONS=y.



Are you saying that we may have a different release version kernel and 
module built with CONFIG_MODVERSIONS=y, and the module was using 
UTS_RELEASE for something? That something may be like setting some info 
in a sysfs file, like in this example:


static ssize_t target_core_item_version_show(struct config_item *item,
char *page)
{
return sprintf(page, "Target Engine Core ConfigFS Infrastructure %s"
" on %s/%s on "UTS_RELEASE"\n", TARGET_CORE_VERSION,
utsname()->sysname, utsname()->machine);
}

And the intention is to use the module codebase release version and not 
the kernel codebase release version. Hence utsname() is used for 
.sysname and .machine, but not .release .


Thanks,
John

Re: [PATCH RFC 0/4] Introduce uts_release

2024-02-05 Thread John Garry


On 02/02/2024 15:01, Masahiro Yamada wrote:

--
2.35.3


As you see, several drivers store UTS_RELEASE in their driver data,
and even print it in debug print.


I do not see why it is useful.


I would tend to agree, and mentioned that earlier.


As you discussed in 3/4, if UTS_RELEASE is unneeded,
it is better to get rid of it.


Jakub replied about this.




If such version information is useful for drivers, the intention is
whether the version of the module, or the version of vmlinux.
That is a question.
They differ when CONFIG_MODVERSION.



I think often this information in UTS_RELEASE is shared as informative 
only, so the user can conveniently know the specific kernel git version.




When module developers intend to printk the git version
from which the module was compiled from,
presumably they want to use UTS_RELEASE, which
was expanded at the compile time of the module.

If you replace it with uts_release, it is the git version
of vmlinux.


Of course, the replacement is safe for always-builtin code.



Lastly, we can avoid using UTS_RELEASE without relying
on your patch.



For example, commit 3a3a11e6e5a2bc0595c7e36ae33c861c9e8c75b1
replaced  UTS_RELEASE with init_uts_ns.name.release


So, is your uts_release a shorthand of init_uts_ns.name.release?


Yes - well that both are strings containing UTS_RELEASE. Using a struct 
sub-member is bit ungainly, but I suppose that we should not be making 
life easy for people using this.


However we already have init_utsname in:

static inline struct new_utsname *init_utsname(void)
{
return _uts_ns.name;
}

So could use init_utsname()->release, which is a bit nicer.





I think what you can contribute are:

  - Explore the UTS_RELEASE users, and check if you can get rid of it.


Unfortunately I expect resistance for this. I also expect places like FW 
loader it is necessary. And when this is used in sysfs, people will say 
that it is part of the ABI now.


How about I send the patch to update to use init_uts_ns and mention also 
that it would be better to not use at all, if possible? I can cc you.




  - Where UTS_RELEASE is useful, consider if it is possible
to replace it with init_uts_ns.name.release


ok, but, as above, could use init_utsname()->release also

Thanks,
John

Re: [PATCH RFC 3/4] net: ethtool: Use uts_release

2024-02-01 Thread John Garry


On 01/02/2024 16:09, Jakub Kicinski wrote:

On Thu, 1 Feb 2024 14:20:23 +0100 Jiri Pirko wrote:

BTW, I assume that changes like this are also ok:

8<-

   net: team: Don't bother filling in ethtool driver version


Yup, just to be clear - you can send this independently from the series,


Sure, and I think rocker and staging/octeon also have this unnecessary 
code also.



tag is as

  [PATCH net-next]


ah, yes



we'll take it via the networking tree.


Thanks. I assume Greg - the staging maintainer - would take the octeon 
patch.



I'm not sure which tree the other
patches will go thru..


I think that the best thing to do is get a minimal set in for 6.9 and 
then merge the rest in the next cycle. I've got about 22 patches in 
total now, but I think that there will be more. We'll see who can pick 
up the first set when I send it officially.


Thanks,
John

Re: [PATCH RFC 3/4] net: ethtool: Use uts_release

2024-02-01 Thread John Garry


On 31/01/2024 19:24, Jakub Kicinski wrote:

On Wed, 31 Jan 2024 10:48:50 + John Garry wrote:

Instead of using UTS_RELEASE, use uts_release, which means that we don't
need to rebuild the code just for the git head commit changing.

Signed-off-by: John Garry

Yes, please!

Acked-by: Jakub Kicinski


Cheers

BTW, I assume that changes like this are also ok:

8<-

   net: team: Don't bother filling in ethtool driver version

   The version is same as the default, as don't bother.

   Signed-off-by: John Garry 

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index f575f225d417..0a44bbdcfb7b 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -25,7 +25,6 @@
#include 
#include 
#include 
-#include 
#include 

#define DRV_NAME "team"
@@ -2074,7 +2073,6 @@ static void team_ethtool_get_drvinfo(struct
net_device *dev,
struct ethtool_drvinfo *drvinfo)
{
   strscpy(drvinfo->driver, DRV_NAME, sizeof(drvinfo->driver));
-   strscpy(drvinfo->version, UTS_RELEASE, sizeof(drvinfo->version));
}

>8-

right?

John

Re: [PATCH RFC 0/4] Introduce uts_release


On 31/01/2024 16:22, Greg KH wrote:

before:
real0m53.591s
user1m1.842s
sys 0m9.161s

after:
real0m37.481s
user0m46.461s
sys 0m7.199s

Sending as an RFC as I need to test more of the conversions and I would
like to also convert more UTS_RELEASE users to prove this is proper
approach.

I like it, I also think that v4l2 includes this as well as all of those
drivers seem to rebuild when this changes, does that not happen for you
too?


I didn't see that. Were you were building for arm64? I can see some v4l2 
configs enabled there for the vanilla defconfig (but none for x86-64).




Anyway, if the firmware changes work, I'm all for this, thanks for
taking it on!


cheers. BTW, I just noticed that utsrelase.h might not be updated for my 
allmodconfig build when I change the head commit - I'll investigate why ...


John

[PATCH RFC 1/4] init: Add uts_release

Add a char [] for UTS_RELEASE so that we don't need to rebuild code which
references UTS_RELEASE.

Signed-off-by: John Garry 
---
 include/linux/utsname.h | 1 +
 init/version.c  | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index bf7613ba412b..15b0b1c9a9ee 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -88,5 +88,6 @@ static inline struct new_utsname *init_utsname(void)
 }
 
 extern struct rw_semaphore uts_sem;
+extern const char uts_release[];
 
 #endif /* _LINUX_UTSNAME_H */
diff --git a/init/version.c b/init/version.c
index 94c96f6fbfe6..87fecdd4fbfb 100644
--- a/init/version.c
+++ b/init/version.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static int __init early_hostname(char *arg)
 {
@@ -48,6 +49,8 @@ BUILD_LTO_INFO;
 
 struct uts_namespace init_uts_ns __weak;
 const char linux_banner[] __weak;
+const char uts_release[] = UTS_RELEASE;
+EXPORT_SYMBOL_GPL(uts_release);
 
 #include "version-timestamp.c"
 
-- 
2.35.3

[PATCH RFC 0/4] Introduce uts_release

When hacking it is a waste of time and compute energy that we need to
rebuild much kernel code just for changing the head git commit, like this:

> touch include/generated/utsrelease.h 
> time make  -j3
mkdir -p /home/john/mnt_sda4/john/kernel-dev2/tools/objtool && make 
O=/home/john/mnt_sda4/john/kernel-dev2 subdir=tools/objtool 
--no-print-directory -C objtool 
  INSTALL libsubcmd_headers
  CALLscripts/checksyscalls.sh
  CC  init/version.o
  AR  init/built-in.a
  CC  kernel/sys.o
  CC  kernel/module/main.o
  AR  kernel/module/built-in.a
  CC  drivers/base/firmware_loader/main.o
  CC  kernel/trace/trace.o
  AR  drivers/base/firmware_loader/built-in.a
  AR  drivers/base/built-in.a
  CC  net/ethtool/ioctl.o
  AR  kernel/trace/built-in.a
  AR  kernel/built-in.a
  AR  net/ethtool/built-in.a
  AR  net/built-in.a
  AR  drivers/built-in.a
  AR  built-in.a
  ...

Files like drivers/base/firmware_loader/main.c needs to be recompiled as
it includes generated/utsrelease.h for UTS_RELEASE macro, and utsrelease.h
is regenerated when the head commit changes.

Introduce global char uts_release[] in init/version.c, which this
mentioned code can use instead of UTS_RELEASE, meaning that we don't need
to rebuild for changing the head commit - only init/version.c needs to be
rebuilt. Whether all the references to UTS_RELEASE in the codebase are
proper is a different matter.

For an x86_64 defconfig build for this series on my old laptop, here is
before and after rebuild time:

before:
real0m53.591s
user1m1.842s
sys 0m9.161s

after:
real0m37.481s
user0m46.461s
sys 0m7.199s

Sending as an RFC as I need to test more of the conversions and I would
like to also convert more UTS_RELEASE users to prove this is proper
approach.

John Garry (4):
  init: Add uts_release
  tracing: Use uts_release
  net: ethtool: Use uts_release
  firmware_loader: Use uts_release

 drivers/base/firmware_loader/main.c | 39 +++--
 include/linux/utsname.h |  1 +
 init/version.c  |  3 +++
 kernel/trace/trace.c|  4 +--
 net/ethtool/ioctl.c |  4 +--
 5 files changed, 39 insertions(+), 12 deletions(-)

-- 
2.35.3

[PATCH RFC 4/4] firmware_loader: Use uts_release

Instead of using UTS_RELEASE, use uts_release, which means that we don't
need to rebuild the code just for the git head commit changing.

Since UTS_RELEASE was used for fw_path and this points to const data,
append uts_release dynamically to an intermediate string.

Signed-off-by: John Garry 
---
 drivers/base/firmware_loader/main.c | 39 +++--
 1 file changed, 31 insertions(+), 8 deletions(-)

diff --git a/drivers/base/firmware_loader/main.c 
b/drivers/base/firmware_loader/main.c
index ea28102d421e..87da7be61a29 100644
--- a/drivers/base/firmware_loader/main.c
+++ b/drivers/base/firmware_loader/main.c
@@ -38,7 +38,7 @@
 #include 
 #include 
 
-#include 
+#include 
 
 #include "../base.h"
 #include "firmware.h"
@@ -471,9 +471,9 @@ static int fw_decompress_xz(struct device *dev, struct 
fw_priv *fw_priv,
 static char fw_path_para[256];
 static const char * const fw_path[] = {
fw_path_para,
-   "/lib/firmware/updates/" UTS_RELEASE,
+   "/lib/firmware/updates/", /* UTS_RELEASE is appended later */
"/lib/firmware/updates",
-   "/lib/firmware/" UTS_RELEASE,
+   "/lib/firmware/", /* UTS_RELEASE is appended later */
"/lib/firmware"
 };
 
@@ -496,7 +496,7 @@ fw_get_filesystem_firmware(struct device *device, struct 
fw_priv *fw_priv,
size_t size;
int i, len, maxlen = 0;
int rc = -ENOENT;
-   char *path, *nt = NULL;
+   char *path, *fw_path_string, *nt = NULL;
size_t msize = INT_MAX;
void *buffer = NULL;
 
@@ -510,6 +510,12 @@ fw_get_filesystem_firmware(struct device *device, struct 
fw_priv *fw_priv,
if (!path)
return -ENOMEM;
 
+   fw_path_string = __getname();
+   if (!fw_path_string) {
+   __putname(path);
+   return -ENOMEM;
+   }
+
wait_for_initramfs();
for (i = 0; i < ARRAY_SIZE(fw_path); i++) {
size_t file_size = 0;
@@ -519,16 +525,32 @@ fw_get_filesystem_firmware(struct device *device, struct 
fw_priv *fw_priv,
if (!fw_path[i][0])
continue;
 
+   len = snprintf(fw_path_string, PATH_MAX, "%s", fw_path[i]);
+   if (len >= PATH_MAX) {
+   rc = -ENAMETOOLONG;
+   break;
+   }
+
+   /* Special handling to append UTS_RELEASE */
+   if (fw_path[i][len - 1] == '/') {
+   len = snprintf(fw_path_string, PATH_MAX, "%s%s",
+   fw_path[i], uts_release);
+   if (len >= PATH_MAX) {
+   rc = -ENAMETOOLONG;
+   break;
+   }
+   }
+
/* strip off \n from customized path */
-   maxlen = strlen(fw_path[i]);
+   maxlen = strlen(fw_path_string);
if (i == 0) {
-   nt = strchr(fw_path[i], '\n');
+   nt = strchr(fw_path_string, '\n');
if (nt)
-   maxlen = nt - fw_path[i];
+   maxlen = nt - fw_path_string;
}
 
len = snprintf(path, PATH_MAX, "%.*s/%s%s",
-  maxlen, fw_path[i],
+  maxlen, fw_path_string,
   fw_priv->fw_name, suffix);
if (len >= PATH_MAX) {
rc = -ENAMETOOLONG;
@@ -585,6 +607,7 @@ fw_get_filesystem_firmware(struct device *device, struct 
fw_priv *fw_priv,
break;
}
__putname(path);
+   __putname(fw_path_string);
 
return rc;
 }
-- 
2.35.3

[PATCH RFC 3/4] net: ethtool: Use uts_release

Instead of using UTS_RELEASE, use uts_release, which means that we don't
need to rebuild the code just for the git head commit changing.

Signed-off-by: John Garry 
---
 net/ethtool/ioctl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ethtool/ioctl.c b/net/ethtool/ioctl.c
index 7519b0818b91..81d052505f67 100644
--- a/net/ethtool/ioctl.c
+++ b/net/ethtool/ioctl.c
@@ -31,7 +31,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include "common.h"
 
 /* State held across locks and calls for commands which have devlink fallback 
*/
@@ -713,7 +713,7 @@ ethtool_get_drvinfo(struct net_device *dev, struct 
ethtool_devlink_compat *rsp)
struct device *parent = dev->dev.parent;
 
rsp->info.cmd = ETHTOOL_GDRVINFO;
-   strscpy(rsp->info.version, UTS_RELEASE, sizeof(rsp->info.version));
+   strscpy(rsp->info.version, uts_release, sizeof(rsp->info.version));
if (ops->get_drvinfo) {
ops->get_drvinfo(dev, >info);
if (!rsp->info.bus_info[0] && parent)
-- 
2.35.3

[PATCH RFC 2/4] tracing: Use uts_release

Instead of using UTS_RELEASE, use uts_release, which means that we don't
need to rebuild the code just for the git head commit changing.

Signed-off-by: John Garry 
---
 kernel/trace/trace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2a7c6fd934e9..68513924beb4 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -13,7 +13,7 @@
  *  Copyright (C) 2004 Nadia Yvette Chambers
  */
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -4354,7 +4354,7 @@ print_trace_header(struct seq_file *m, struct 
trace_iterator *iter)
get_total_entries(buf, , );
 
seq_printf(m, "# %s latency trace v1.1.5 on %s\n",
-  name, UTS_RELEASE);
+  name, uts_release);
seq_puts(m, "# ---"
 "-\n");
seq_printf(m, "# latency: %lu us, #%lu/%lu, CPU#%d |"
-- 
2.35.3

Re: [PATCH] kmod: Add FIPS 202 SHA-3 support

2023-12-12 Thread Dimitri John Ledkov

On Wed, 6 Dec 2023 at 15:26, Lucas De Marchi  wrote:
>
> On Sun, Oct 22, 2023 at 07:09:28PM +0100, Dimitri John Ledkov wrote:
> >Add support for parsing FIPS 202 SHA-3 signature hashes. Separately,
> >it is not clear why explicit hashes are re-encoded here, instead of
> >trying to generically show any digest openssl supports.
> >
> >Signed-off-by: Dimitri John Ledkov 

NACK

> >---
> > libkmod/libkmod-signature.c | 12 
> > 1 file changed, 12 insertions(+)
> >
> >diff --git a/libkmod/libkmod-signature.c b/libkmod/libkmod-signature.c
> >index b749a818f9..a39059cd7c 100644
> >--- a/libkmod/libkmod-signature.c
> >+++ b/libkmod/libkmod-signature.c
> >@@ -57,6 +57,9 @@ enum pkey_hash_algo {
> >   PKEY_HASH_SHA512,
> >   PKEY_HASH_SHA224,
> >   PKEY_HASH_SM3,
> >+  PKEY_HASH_SHA3_256,
> >+  PKEY_HASH_SHA3_384,
> >+  PKEY_HASH_SHA3_512,
> >   PKEY_HASH__LAST
> > };
> >
> >@@ -70,6 +73,9 @@ const char *const pkey_hash_algo[PKEY_HASH__LAST] = {
> >   [PKEY_HASH_SHA512]  = "sha512",
> >   [PKEY_HASH_SHA224]  = "sha224",
> >   [PKEY_HASH_SM3] = "sm3",
> >+  [PKEY_HASH_SHA3_256]= "sha3-256",
> >+  [PKEY_HASH_SHA3_384]= "sha3-384",
> >+  [PKEY_HASH_SHA3_512]= "sha3-512",
> > };
> >
> > enum pkey_id_type {
> >@@ -167,6 +173,12 @@ static int obj_to_hash_algo(const ASN1_OBJECT *o)
> >   case NID_sm3:
> >   return PKEY_HASH_SM3;
> > # endif
> >+  case NID_sha3_256:
> >+  return PKEY_HASH_SHA3_256;
> >+  case NID_sha3_384:
> >+  return PKEY_HASH_SHA3_384;
> >+  case NID_sha3_512:
> >+  return PKEY_HASH_SHA3_512;
>
>
> with your other patch, libkmod: remove pkcs7 obj_to_hash_algo(), this
> hunk is not needed anymore. Do you want to send a new version of this
> patch?

This patch is no longer required, given that
https://lore.kernel.org/all/20231029010319.157390-1-dimitri.led...@canonical.com/
is applied. Upgrade kmod to the one that has at least that patch
applied, and then pkcs7 signatures are parsed correctly with
everything that a runtime OpenSSL supports. Thus if you want to see
SHA3 signatures, ensure your runtime libssl has SHA3 support.

>
> thanks
> Lucas De Marchi
>
> >   default:
> >   return -1;
> >   }
> >--
> >2.34.1
> >
> >

-- 
Dimitri

Sent from Ubuntu Pro
https://ubuntu.com/pro

[PATCH RFC 4/4] dev_dax_iomap: Add CONFIG_DEV_DAX_IOMAP kernel build parameter

From: John Groves 

Add CONFIG_DEV_DAXIOMAP kernel build parameter
---
 drivers/dax/Kconfig | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index a88744244149..b1ebcc77120b 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -78,4 +78,10 @@ config DEV_DAX_KMEM
 
  Say N if unsure.
 
+config DEV_DAX_IOMAP
+   depends on DEV_DAX && DAX
+   def_bool y
+   help
+ Support iomap mapping of devdax devices (for FS-DAX file
+ systems that reside on character /dev/dax devices)
 endif
-- 
2.40.1

[PATCH RFC 3/4] dev_dax_iomap: Add dax_operations to /dev/dax struct dax_device

From: John Groves 

This is the primary content of this rfc. Notes about this commit:

* These methods are based somewhat loosely on pmem_dax_ops from
  drivers/nvdimm/pmem.c

* dev_dax_direct_access() is physaddr based

* dev_dax_direct_access() works for mmap, but fails dax_copy_to_iter()
  on posix read

* dev_dax_recovery_write()  and dev_dax_zero_page_range() have not been
  tested yet. I'm looking for suggestions as to how to test those.

I'm hoping somebody (Dan?) can point the way to getting this working
with posix I/O. Does this need to go the memremap route?

Thanks,
John
---
 drivers/dax/bus.c | 105 ++
 1 file changed, 105 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1b55fd7aabaf..8f8c2991c7c2 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -10,6 +10,12 @@
 #include "dax-private.h"
 #include "bus.h"
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+#include 
+#include 
+#include 
+#endif
+
 static DEFINE_MUTEX(dax_bus_lock);
 
 #define DAX_NAME_LEN 30
@@ -1374,6 +1380,100 @@ phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, 
pgoff_t pgoff,
 }
 
 
+/* the phys address approach */
+long __dev_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+long nr_pages, enum dax_access_mode mode, void 
**kaddr,
+pfn_t *pfn)
+{
+   struct dev_dax *dev_dax = dax_get_private(dax_dev);
+   size_t size = nr_pages << PAGE_SHIFT;
+   size_t offset = pgoff << PAGE_SHIFT;
+   long range_remainder = 0;
+   phys_addr_t phys;
+   int i;
+
+   /*
+* pmem hides dax ranges by mapping  to a contiguous
+* pmem->virt_addr = devm_mremap_pages() (in pem_attach_disk()).
+* Is it legal to avoid the vmap overhead (and resource consumption) 
and just return
+* a (potentially partial) phys range? This function does this, 
returning the
+* phys_addr with the length truncated if necessary to the range 
remainder
+*/
+   phys = dax_pgoff_to_phys(dev_dax, pgoff, nr_pages << PAGE_SHIFT);
+
+   if (kaddr)
+   *kaddr = (void *)phys;
+
+   if (pfn)
+   *pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP); /* are flags 
correct? */
+
+   /*
+* If dax_pgoff_to_phys() also returned the range remainder (range_len 
- range_offset)
+* this loop would not be necessary
+*/
+   for (i = 0; i < dev_dax->nr_range; i++) {
+   size_t rlen = range_len(&(dev_dax->ranges[i].range));
+
+   if (offset < rlen) {
+   range_remainder = rlen - offset;
+   break;
+   }
+   offset -= rlen;
+   }
+
+   /*
+* Return length valid at phys. Hoping callers can deal with len < 
entire_dax_device
+* (or < npages). This returns the remaining length in the applicable 
dax region.
+*/
+   return PHYS_PFN(min_t(size_t, range_remainder, size));
+}
+
+static int dev_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
+   size_t nr_pages)
+{
+   long resid = nr_pages << PAGE_SHIFT;
+   long offset = pgoff << PAGE_SHIFT;
+
+   /* Break into one write per dax region */
+   while (resid > 0) {
+   void *kaddr;
+   pgoff_t poff = offset >> PAGE_SHIFT;
+   long len = __dev_dax_direct_access(dax_dev, poff,
+  nr_pages, DAX_ACCESS, 
, NULL);
+   len = min_t(long, len, PAGE_SIZE);
+   write_dax(kaddr, ZERO_PAGE(0), offset, len);
+
+   offset += len;
+   resid  -= len;
+   }
+   return 0;
+}
+
+static long dev_dax_direct_access(struct dax_device *dax_dev,
+   pgoff_t pgoff, long nr_pages, enum dax_access_mode mode,
+   void **kaddr, pfn_t *pfn)
+{
+   return __dev_dax_direct_access(dax_dev, pgoff, nr_pages, mode, kaddr, 
pfn);
+}
+
+static size_t dev_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
+   void *addr, size_t bytes, struct iov_iter *i)
+{
+   size_t len, off;
+
+   off = offset_in_page(addr);
+   len = PFN_PHYS(PFN_UP(off + bytes));
+
+   return _copy_from_iter_flushcache(addr, bytes, i);
+}
+
+static const struct dax_operations dev_dax_ops = {
+   .direct_access = dev_dax_direct_access,
+   .zero_page_range = dev_dax_zero_page_range,
+   .recovery_write = dev_dax_recovery_write,
+};
+#endif /* IS_ENABLED(CONFIG_DEV_DAX_IOMAP) */
+
 struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 {
struct dax_region *dax_region = data->dax_region;
@@ -1429,11 +1529,16 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data 
*data)
}
}
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+

[PATCH RFC 2/4] dev_dax_iomap: Temporary hacks due to linkage issues

From: John Groves 

These are functions that should be called from outside, but I had
linkage issues and did this hack instead. Will fix in the "real"
patches...
---
 drivers/dax/bus.c | 50 +++
 1 file changed, 50 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1659b787b65f..1b55fd7aabaf 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1324,6 +1324,56 @@ static const struct device_type dev_dax_type = {
.groups = dax_attribute_groups,
 };
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+
+/*
+ * This is write_pmem() from pmem.c
+ */
+static void write_dax(void *pmem_addr, struct page *page,
+   unsigned int off, unsigned int len)
+{
+   unsigned int chunk;
+   void *mem;
+
+   while (len) {
+   mem = kmap_atomic(page);
+   chunk = min_t(unsigned int, len, PAGE_SIZE - off);
+   memcpy_flushcache(pmem_addr, mem + off, chunk);
+   kunmap_atomic(mem);
+   len -= chunk;
+   off = 0;
+   page++;
+   pmem_addr += chunk;
+   }
+}
+
+/*
+ * This function is from drivers/dax/device.c
+ * For some reason EXPORT_SYMBOL(dax_pgoff_to_phys) didn't result in linkable 
code
+ */
+phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
+ unsigned long size)
+{
+   int i;
+
+   for (i = 0; i < dev_dax->nr_range; i++) {
+   struct dev_dax_range *dax_range = _dax->ranges[i];
+   struct range *range = _range->range;
+   unsigned long long pgoff_end;
+   phys_addr_t phys;
+
+   pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
+   if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
+   continue;
+   phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
+   if (phys + size - 1 <= range->end)
+   return phys;
+   break;
+   }
+   return -1;
+}
+
+
 struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 {
struct dax_region *dax_region = data->dax_region;
-- 
2.40.1

[PATCH RFC 1/4] dev_dax_iomap: Add add_dax_ops() func for fs-dax to provide dax holder_ops

From: John Groves 

This is clearly not the right way to set the holder_ops; where &
how should this be done?
---
 drivers/dax/super.c | 16 
 include/linux/dax.h |  5 +
 2 files changed, 21 insertions(+)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232ea175..3d4e205c1854 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -467,6 +467,22 @@ struct dax_device *alloc_dax(void *private, const struct 
dax_operations *ops)
 }
 EXPORT_SYMBOL_GPL(alloc_dax);
 
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+/* famfs calls this to add the holder_ops. There is probably a more elegant 
approach */
+int add_dax_ops(
+   struct dax_device   *dax_dev,
+   const struct dax_holder_operations *hops)
+{
+   /* dax_dev->ops should have been populated by devm_create_dev_dax() */
+   WARN_ON(!dax_dev->ops);
+
+   /* Use cmpxchg? */
+   dax_dev->holder_ops = hops;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(add_dax_ops);
+#endif /* DEV_DAX_IOMAP */
+
 void put_dax(struct dax_device *dax_dev)
 {
if (!dax_dev)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..31b68667b3cb 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -57,6 +57,11 @@ struct dax_holder_operations {
 
 #if IS_ENABLED(CONFIG_DAX)
 struct dax_device *alloc_dax(void *private, const struct dax_operations *ops);
+
+#if IS_ENABLED(CONFIG_DEV_DAX_IOMAP)
+int add_dax_ops(struct dax_device *dax_dev,
+   const struct dax_holder_operations *hops);
+#endif
 void *dax_holder(struct dax_device *dax_dev);
 void put_dax(struct dax_device *dax_dev);
 void kill_dax(struct dax_device *dax_dev);
-- 
2.40.1

[PATCH RFC 0/4] dev_dax_iomap: introduce iomap for /dev/dax

From: John Groves 

This patch set is not intended to be merged; I'm hoping to get some
clarification as to the correct approach (especialy from Dan). 

This work is related to famfs, which is a dax file system for shared
fabric-attached memory (FAM). Famfs is "coming soon" as an RFC, but
the concept and requirements were presented at LPC 2023. See
https://lpc.events/event/17/contributions/1455/ and
https://www.youtube.com/watch?v=aA_DgO95gLo. My expectation is that
a future (fully working) version of this patch will be folded into the
famfs
patches.

Unlike the current set of fs-dax file systems, famfs does not need a block
(pmem) device, and should really run on a /dev/dax character device since
that's how sharable fabric-attached cxl memory will surface. But
/dev/dax character devices are missing some functionality that is provided
by the block /dev/pmem driver - specifically struct dax_operations pointer
in struct dax_device.

This patch, when CONFIG_DEV_DAX_IOMAP=y, populates dax_dev->ops for
character dax devices. The added operations differ (currently) from
/dev/pmem's dev_dax->ops in that they don't use memremap but instead
provide a physical address in response to the dev_dax->direct_access()
method. 

The dax_operations are direct_access() (which resolves a dax dev offset
to an address), zero_page_range() and recovery_write(). I'm not sure yet
how to test the latter two, but the direct_access() method works in
conjunciton with famfs - but only for mmaped files.

But Posix reads fail. Specifically dax_iomap_iter() calls
dax_copy_to_iter(), which declines to copy the data for some reason in
one of the lower level copy_to_user variants. I've tried to isolate the
reason for the failure with a VM under gdb, but no luck there yet. I'm
wondering if there is some flag or attribute that needs to be applied to
these addresses/pages somehow to allow this to work.

The failing copy_to_user() differs from the same path with pmem fs-dax,
in that pmem does a memremap (which I think generates one contiguous
range, even if the device had more than one range - is this right, and
does this mean it's consuming some of the vmap/vmalloc range?)

I spent some time attempting a memremap, but I haven't figured out the
magic for that. However, I like the simplicity of resolving to phys if
that's not a non-starter for some reason.

I hope this is enough context for a meaningful review and suggestions as to
what a working dev_dax->dax_operations implementation should look like.

Thanks for any tips!


John Groves (4):
  Add add_dax_ops() func for fs-dax to provide dax holder_ops
  Temporary hacks due to linkage issues
  Add dax_operations to /dev/dax struct dax_device
  Add CONFIG_DEV_DAX_IOMAP kernel build parameter

 drivers/dax/Kconfig |   6 ++
 drivers/dax/bus.c   | 155 
 drivers/dax/super.c |  16 +
 include/linux/dax.h |   5 ++
 4 files changed, 182 insertions(+)

-- 
2.40.1

Re: [PATCH v2] locking/atomic: sh: Use generic_cmpxchg_local for arch_cmpxchg_local()

2023-11-02 Thread John Paul Adrian Glaubitz

On Thu, 2023-10-26 at 00:10 +0900, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) 
> 
> Use __generic_cmpxchg_local() for arch_cmpxchg_local() implementation
> in SH architecture because it does not implement arch_cmpxchg_local().
> 
> Reported-by: kernel test robot 
> Closes: 
> https://lore.kernel.org/oe-kbuild-all/202310241310.ir5uukog-...@intel.com/
> Signed-off-by: Masami Hiramatsu (Google) 
> ---
>  arch/sh/include/asm/cmpxchg.h |9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/sh/include/asm/cmpxchg.h b/arch/sh/include/asm/cmpxchg.h
> index 288f6f38d98f..5d617b3ef78f 100644
> --- a/arch/sh/include/asm/cmpxchg.h
> +++ b/arch/sh/include/asm/cmpxchg.h
> @@ -71,4 +71,13 @@ static inline unsigned long __cmpxchg(volatile void * ptr, 
> unsigned long old,
>   (unsigned long)_n_, sizeof(*(ptr))); \
>})
>  
> +#include 
> +
> +#define arch_cmpxchg_local(ptr, o, n) ({ \
> + (__typeof__(*ptr))__generic_cmpxchg_local((ptr),\
> +   (unsigned long)(o),   \
> +   (unsigned long)(n),   \
> +   sizeof(*(ptr)));  \
> +})
> +
>  #endif /* __ASM_SH_CMPXCHG_H */

Reviewed-by: John Paul Adrian Glaubitz 

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

[PATCH kmod] libkmod: remove pkcs7 obj_to_hash_algo()

2023-10-28 Thread Dimitri John Ledkov

Switch to using OBJ_obj2txt() to calculate and print the pkcs7
signature hash name. This eliminates the need to duplicate libcrypto
NID to name mapping, detect SM3 openssl compile-time support, and
enables using any hashes that openssl and kernel know about. For
example SHA3 are being added for v6.7 and with this patch are
automatically supported.

Signed-off-by: Dimitri John Ledkov 
---
 configure.ac|  7 -
 libkmod/libkmod-signature.c | 59 +
 2 files changed, 20 insertions(+), 46 deletions(-)

diff --git a/configure.ac b/configure.ac
index 7bf8d78ca7..a6b8fa0308 100644
--- a/configure.ac
+++ b/configure.ac
@@ -133,13 +133,6 @@ AC_ARG_WITH([openssl],
 AS_IF([test "x$with_openssl" != "xno"], [
PKG_CHECK_MODULES([libcrypto], [libcrypto >= 1.1.0], [LIBS="$LIBS 
$libcrypto_LIBS"])
AC_DEFINE([ENABLE_OPENSSL], [1], [Enable openssl for modinfo.])
-   AC_COMPILE_IFELSE([AC_LANG_SOURCE([[#include 
-   int nid = NID_sm3;]])], [
-   AC_MSG_NOTICE([openssl supports sm3])
-   ], [
-   AC_MSG_NOTICE([openssl sm3 support not detected])
-   CPPFLAGS="$CPPFLAGS -DOPENSSL_NO_SM3"
-   ])
module_signatures="PKCS7 $module_signatures"
 ], [
AC_MSG_NOTICE([openssl support not requested])
diff --git a/libkmod/libkmod-signature.c b/libkmod/libkmod-signature.c
index b749a818f9..80f6447bce 100644
--- a/libkmod/libkmod-signature.c
+++ b/libkmod/libkmod-signature.c
@@ -127,6 +127,7 @@ struct pkcs7_private {
PKCS7 *pkcs7;
unsigned char *key_id;
BIGNUM *sno;
+   char *hash_algo;
 };
 
 static void pkcs7_free(void *s)
@@ -137,42 +138,11 @@ static void pkcs7_free(void *s)
PKCS7_free(pvt->pkcs7);
BN_free(pvt->sno);
free(pvt->key_id);
+   free(pvt->hash_algo);
free(pvt);
si->private = NULL;
 }
 
-static int obj_to_hash_algo(const ASN1_OBJECT *o)
-{
-   int nid;
-
-   nid = OBJ_obj2nid(o);
-   switch (nid) {
-   case NID_md4:
-   return PKEY_HASH_MD4;
-   case NID_md5:
-   return PKEY_HASH_MD5;
-   case NID_sha1:
-   return PKEY_HASH_SHA1;
-   case NID_ripemd160:
-   return PKEY_HASH_RIPE_MD_160;
-   case NID_sha256:
-   return PKEY_HASH_SHA256;
-   case NID_sha384:
-   return PKEY_HASH_SHA384;
-   case NID_sha512:
-   return PKEY_HASH_SHA512;
-   case NID_sha224:
-   return PKEY_HASH_SHA224;
-# ifndef OPENSSL_NO_SM3
-   case NID_sm3:
-   return PKEY_HASH_SM3;
-# endif
-   default:
-   return -1;
-   }
-   return -1;
-}
-
 static const char *x509_name_to_str(X509_NAME *name)
 {
int i;
@@ -219,7 +189,8 @@ static bool fill_pkcs7(const char *mem, off_t size,
unsigned char *key_id_str;
struct pkcs7_private *pvt;
const char *issuer_str;
-   int hash_algo;
+   char *hash_algo;
+   int hash_algo_len;
 
size -= sig_len;
pkcs7_raw = mem + size;
@@ -278,27 +249,37 @@ static bool fill_pkcs7(const char *mem, off_t size,
 
X509_ALGOR_get0(, NULL, NULL, dig_alg);
 
-   hash_algo = obj_to_hash_algo(o);
-   if (hash_algo < 0)
+   // Use OBJ_obj2txt to calculate string length
+   hash_algo_len = OBJ_obj2txt(NULL, 0, o, 0);
+   if (hash_algo_len < 0)
goto err3;
-   sig_info->hash_algo = pkey_hash_algo[hash_algo];
-   // hash algo has not been recognized
-   if (sig_info->hash_algo == NULL)
+   hash_algo = malloc(hash_algo_len + 1);
+   if (hash_algo == NULL)
goto err3;
+   hash_algo_len = OBJ_obj2txt(hash_algo, hash_algo_len + 1, o, 0);
+   if (hash_algo_len < 0)
+   goto err4;
+
+   // Assign libcrypto hash algo string or number
+   sig_info->hash_algo = hash_algo;
+
sig_info->id_type = pkey_id_type[modsig->id_type];
 
pvt = malloc(sizeof(*pvt));
if (pvt == NULL)
-   goto err3;
+   goto err4;
 
pvt->pkcs7 = pkcs7;
pvt->key_id = key_id_str;
pvt->sno = sno_bn;
+   pvt->hash_algo = hash_algo;
sig_info->private = pvt;
 
sig_info->free = pkcs7_free;
 
return true;
+err4:
+   free(hash_algo);
 err3:
free(key_id_str);
 err2:
-- 
2.34.1

Re: [PATCH] locking/atomic: sh: Use generic_cmpxchg_local for arch_cmpxchg_local()

2023-10-25 Thread John Paul Adrian Glaubitz

On Wed, 2023-10-25 at 11:30 +0100, Mark Rutland wrote:
> On Wed, Oct 25, 2023 at 08:42:55AM +0900, Masami Hiramatsu wrote:
> > On Tue, 24 Oct 2023 16:08:12 +0100
> > Mark Rutland  wrote:
> > 
> > > On Tue, Oct 24, 2023 at 11:52:54PM +0900, Masami Hiramatsu (Google) wrote:
> > > > From: Masami Hiramatsu (Google) 
> > > > 
> > > > Use generic_cmpxchg_local() for arch_cmpxchg_local() implementation
> > > > in SH architecture because it does not implement arch_cmpxchg_local().
> > > 
> > > I do not think this is correct.
> > > 
> > > The implementation in  is UP-only (and it 
> > > only
> > > disables interrupts), whereas arch/sh can be built SMP. We should 
> > > probably add
> > > some guards into  for that as we have in
> > > .
> > 
> > Isn't cmpxchg_local for the data which only needs to ensure to do cmpxchg
> > on local CPU?
> > So I think it doesn't care about the other CPUs (IOW, it should not touched 
> > by
> > other CPUs), so it only considers UP case. E.g. on x86, 
> > arch_cmpxchg_local() is
> > defined as raw "cmpxchg" without lock prefix.
> > 
> > #define __cmpxchg_local(ptr, old, new, size)\
> > __raw_cmpxchg((ptr), (old), (new), (size), "")
> > 
> 
> Yes, you're right; sorry for the noise.
> 
> For your original patch:
> 
> Acked-by: Mark Rutland 
> 

Geert, what's your opinion on this?

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [PATCH] locking/atomic: sh: Use generic_cmpxchg_local for arch_cmpxchg_local()

2023-10-24 Thread John Paul Adrian Glaubitz

On Tue, 2023-10-24 at 23:52 +0900, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) 
> 
> Use generic_cmpxchg_local() for arch_cmpxchg_local() implementation
> in SH architecture because it does not implement arch_cmpxchg_local().
> 
> Reported-by: kernel test robot 
> Closes: 
> https://lore.kernel.org/oe-kbuild-all/202310241310.ir5uukog-...@intel.com/
> Signed-off-by: Masami Hiramatsu (Google) 
> ---
>  arch/sh/include/asm/cmpxchg.h |2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/sh/include/asm/cmpxchg.h b/arch/sh/include/asm/cmpxchg.h
> index 288f6f38d98f..e920e61fb817 100644
> --- a/arch/sh/include/asm/cmpxchg.h
> +++ b/arch/sh/include/asm/cmpxchg.h
> @@ -71,4 +71,6 @@ static inline unsigned long __cmpxchg(volatile void * ptr, 
> unsigned long old,
>   (unsigned long)_n_, sizeof(*(ptr))); \
>})
>  
> +#include 
> +
>  #endif /* __ASM_SH_CMPXCHG_H */

Reviewed-by: John Paul Adrian Glaubitz 

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

[PATCH] kmod: Add FIPS 202 SHA-3 support

2023-10-22 Thread Dimitri John Ledkov

Add support for parsing FIPS 202 SHA-3 signature hashes. Separately,
it is not clear why explicit hashes are re-encoded here, instead of
trying to generically show any digest openssl supports.

Signed-off-by: Dimitri John Ledkov 
---
 libkmod/libkmod-signature.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/libkmod/libkmod-signature.c b/libkmod/libkmod-signature.c
index b749a818f9..a39059cd7c 100644
--- a/libkmod/libkmod-signature.c
+++ b/libkmod/libkmod-signature.c
@@ -57,6 +57,9 @@ enum pkey_hash_algo {
PKEY_HASH_SHA512,
PKEY_HASH_SHA224,
PKEY_HASH_SM3,
+   PKEY_HASH_SHA3_256,
+   PKEY_HASH_SHA3_384,
+   PKEY_HASH_SHA3_512,
PKEY_HASH__LAST
 };
 
@@ -70,6 +73,9 @@ const char *const pkey_hash_algo[PKEY_HASH__LAST] = {
[PKEY_HASH_SHA512]  = "sha512",
[PKEY_HASH_SHA224]  = "sha224",
[PKEY_HASH_SM3] = "sm3",
+   [PKEY_HASH_SHA3_256]= "sha3-256",
+   [PKEY_HASH_SHA3_384]= "sha3-384",
+   [PKEY_HASH_SHA3_512]= "sha3-512",
 };
 
 enum pkey_id_type {
@@ -167,6 +173,12 @@ static int obj_to_hash_algo(const ASN1_OBJECT *o)
case NID_sm3:
return PKEY_HASH_SM3;
 # endif
+   case NID_sha3_256:
+   return PKEY_HASH_SHA3_256;
+   case NID_sha3_384:
+   return PKEY_HASH_SHA3_384;
+   case NID_sha3_512:
+   return PKEY_HASH_SHA3_512;
default:
return -1;
}
-- 
2.34.1

Re: [PATCH] printk: add cpu id information to printk() output

2023-09-15 Thread John Ogness

On 2023-09-15, Enlin Mu  wrote:
> Sometimes we want to print cpu id of printk() messages to consoles
>
> diff --git a/include/linux/threads.h b/include/linux/threads.h
> index c34173e6c5f1..6700bd9a174f 100644
> --- a/include/linux/threads.h
> +++ b/include/linux/threads.h
> @@ -34,6 +34,9 @@
>  #define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \
>   (sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))
>  
> +#define CPU_ID_SHIFT 23
> +#define CPU_ID_MASK  0xff80

This only supports 256 CPUs. I think it doesn't make sense to try to
squish CPU and Task IDs into 32 bits.

What about introducing a caller_id option to always only print the CPU
ID? Or do you really need Task _and_ CPU?

John Ogness

Re: [PATCH v3 2/2] drivers/perf: hisi: Add driver for HiSilicon PCIe PMU

2021-04-20 Thread John Garry


On 15/04/2021 13:48, Qi Liu wrote:

PCIe PMU Root Complex Integrated End Point(RCiEP) device is supported
to sample bandwidth, latency, buffer occupation etc.

Each PMU RCiEP device monitors multiple Root Ports, and each RCiEP is
registered as a PMU in /sys/bus/event_source/devices, so users can
select target PMU, and use filter to do further sets.

Filtering options contains:
event- select the event.
subevent - select the subevent.
port - select target Root Ports. Information of Root Ports
are shown under sysfs.
bdf  - select requester_id of target EP device.
trig_len - set trigger condition for starting event statistics.
trigger_mode - set trigger mode. 0 means starting to statistic when
bigger than trigger condition, and 1 means smaller.
thr_len  - set threshold for statistics.
thr_mode - set threshold mode. 0 means count when bigger than
threshold, and 1 means smaller.

Signed-off-by: Qi Liu 


Some minor items and nits with coding style below, but generally looks ok:

Reviewed-by: John Garry 


---
  MAINTAINERS|6 +
  drivers/perf/Kconfig   |2 +
  drivers/perf/Makefile  |1 +
  drivers/perf/pci/Kconfig   |   16 +
  drivers/perf/pci/Makefile  |2 +
  drivers/perf/pci/hisilicon/Makefile|3 +
  drivers/perf/pci/hisilicon/hisi_pcie_pmu.c | 1014 
  include/linux/cpuhotplug.h |1 +
  8 files changed, 1045 insertions(+)
  create mode 100644 drivers/perf/pci/Kconfig
  create mode 100644 drivers/perf/pci/Makefile
  create mode 100644 drivers/perf/pci/hisilicon/Makefile
  create mode 100644 drivers/perf/pci/hisilicon/hisi_pcie_pmu.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 7fdc513..efe06cd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8084,6 +8084,12 @@ W:   http://www.hisilicon.com
  F:Documentation/admin-guide/perf/hisi-pmu.rst
  F:drivers/perf/hisilicon
  
+HISILICON PCIE PMU DRIVER

+M: Qi Liu 
+S: Maintained
+F: Documentation/admin-guide/perf/hisi-pcie-pmu.rst
+F: drivers/perf/pci/hisilicon/hisi_pcie_pmu.c
+
  HISILICON QM AND ZIP Controller DRIVER
  M:Zhou Wang 
  L:linux-cry...@vger.kernel.org
diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 77522e5..ddd82fa 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -139,4 +139,6 @@ config ARM_DMC620_PMU
  
  source "drivers/perf/hisilicon/Kconfig"
  
+source "drivers/perf/pci/Kconfig"

+
  endmenu
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index 5260b11..1208c08 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -14,3 +14,4 @@ obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o
  obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
  obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
  obj-$(CONFIG_ARM_DMC620_PMU) += arm_dmc620_pmu.o
+obj-y += pci/
diff --git a/drivers/perf/pci/Kconfig b/drivers/perf/pci/Kconfig
new file mode 100644
index 000..9f30291
--- /dev/null
+++ b/drivers/perf/pci/Kconfig
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# PCIe Performance Monitor Drivers
+#
+menu "PCIe Performance Monitor"
+
+config HISI_PCIE_PMU
+   tristate "HiSilicon PCIE PERF PMU"
+   depends on (ARM64 && PCI) || COMPILE_TEST
+   help
+ Provide support for HiSilicon PCIe performance monitoring unit (PMU)
+ RCiEP devices.
+ Adds the PCIe PMU into perf events system for monitoring latency,
+ bandwidth etc.
+
+endmenu
diff --git a/drivers/perf/pci/Makefile b/drivers/perf/pci/Makefile
new file mode 100644
index 000..a56b1a9
--- /dev/null
+++ b/drivers/perf/pci/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y += hisilicon/
diff --git a/drivers/perf/pci/hisilicon/Makefile 
b/drivers/perf/pci/hisilicon/Makefile
new file mode 100644
index 000..65b0bd7
--- /dev/null
+++ b/drivers/perf/pci/hisilicon/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+obj-$(CONFIG_HISI_PCIE_PMU) += hisi_pcie_pmu.o
diff --git a/drivers/perf/pci/hisilicon/hisi_pcie_pmu.c 
b/drivers/perf/pci/hisilicon/hisi_pcie_pmu.c
new file mode 100644
index 000..415bf39
--- /dev/null
+++ b/drivers/perf/pci/hisilicon/hisi_pcie_pmu.c
@@ -0,0 +1,1014 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * This driver adds support for PCIe PMU RCiEP device. Related
+ * perf events are bandwidth, bandwidth utilization, latency
+ * etc.
+ *
+ * Copyright (C) 2021 HiSilicon Limited
+ * Author: Qi Liu
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* Define registers */
+#define HISI_PCIE_GLOBAL_CTRL  0x00
+#define HISI_PCIE_EVENT_CTRL   0x010
+#define HISI_PCIE_CNT  0x090
+#defi

Re: [PATCH] scsi: core: Cap initial sdev queue depth at shost.can_queue

2021-04-20 Thread John Garry


On 20/04/2021 01:02, Ming Lei wrote:

On Tue, Apr 20, 2021 at 12:06:24AM +0800, John Garry wrote:

Function sdev_store_queue_depth() enforces that the sdev queue depth cannot
exceed shost.can_queue.

However, the LLDD may still set cmd_per_lun > can_queue, which leads to an
initial sdev queue depth greater than can_queue.

Stop this happened by capping initial sdev queue depth at can_queue.

Signed-off-by: John Garry 
---
Topic originally discussed at:
https://lore.kernel.org/linux-scsi/85dec8eb-8eab-c7d6-b0fb-5622747c5...@interlog.com/T/#m5663d0cac657d843b93d0c9a2374f98fc04384b9

Last idea there was to error/warn in scsi_add_host() for cmd_per_lun >




Hi Ming,


No, that isn't my suggestion.


Right, it was what I mentioned.




can_queue. However, such a shost driver could still configure the sdev
queue depth to be sound value at .slave_configure callback, so now thinking
the orig patch better.


As I mentioned last time, why can't we fix ->cmd_per_lun in
scsi_add_host() using .can_queue?



I would rather not change the values which are provided from the driver. 
I would rather take the original values and try to use them in a sane way.


I have not seen other places where driver shost config values are 
modified by the core code.


Thanks,
John

[PATCH] scsi: core: Cap initial sdev queue depth at shost.can_queue

2021-04-19 Thread John Garry

Function sdev_store_queue_depth() enforces that the sdev queue depth cannot
exceed shost.can_queue.

However, the LLDD may still set cmd_per_lun > can_queue, which leads to an
initial sdev queue depth greater than can_queue.

Stop this happened by capping initial sdev queue depth at can_queue.

Signed-off-by: John Garry 
---
Topic originally discussed at:
https://lore.kernel.org/linux-scsi/85dec8eb-8eab-c7d6-b0fb-5622747c5...@interlog.com/T/#m5663d0cac657d843b93d0c9a2374f98fc04384b9

Last idea there was to error/warn in scsi_add_host() for cmd_per_lun >
can_queue. However, such a shost driver could still configure the sdev
queue depth to be sound value at .slave_configure callback, so now thinking
the orig patch better.

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 9f1b7f3c650a..8de2f830bcdc 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -277,7 +277,11 @@ static struct scsi_device *scsi_alloc_sdev(struct 
scsi_target *starget,
WARN_ON_ONCE(!blk_get_queue(sdev->request_queue));
sdev->request_queue->queuedata = sdev;
 
-   depth = sdev->host->cmd_per_lun ?: 1;
+   if (sdev->host->cmd_per_lun)
+   depth = min_t(unsigned int, sdev->host->cmd_per_lun,
+ sdev->host->can_queue);
+   else
+   depth = 1;
 
/*
 * Use .can_queue as budget map's depth because we have to
-- 
2.26.2

Re: [PATCH 5/8] iommu: fix a couple of spelling mistakes

2021-04-16 Thread John Garry


On 26/03/2021 06:24, Zhen Lei wrote:

There are several spelling mistakes, as follows:
funcions ==> functions
distiguish ==> distinguish
detroyed ==> destroyed

Signed-off-by: Zhen Lei


I think that there should be a /s/appropriatley/appropriately/ in iommu.c

Thanks,
john

Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator

2021-04-14 Thread John Garry


On 06/04/2021 17:54, John Garry wrote:

Hi Robin,



Sorry if the phrasing was unclear there - the allusion to default 
domains is new, it just occurred to me that what we do there is in 
fact fairly close to what I've suggested previously for this. In that 
case, we have a global policy set by the command line, which *can* be 
overridden per-domain via sysfs at runtime, provided the user is 
willing to tear the whole thing down. Using a similar approach here 
would give a fair degree of flexibility but still mean that changes 
never have to be made dynamically to a live domain.


So are you saying that we can handle it similar to how we now can handle 
changing default domain for an IOMMU group via sysfs? If so, that just 
is not practical here. Reason being that this particular DMA engine 
provides the block device giving / mount point, so if we unbind the 
driver, we lose / mount point.


And I am not sure if the end user would even know how to set such a 
tunable. Or, in this case, why the end user would not want the optimized 
range configured always.


I'd still rather if the device driver could provide info which can be 
used to configure this before or during probing.


As a new solution, how about do both of these:
a. Add a per-IOMMU group sysfs file to set this tunable. Works same as 
how we change the default domain, and has all the same 
restrictions/steps. I think that this is what you are already suggesting.
b. Provide a DMA mapping API to set this value, similar to this current 
series. In the IOMMU backend for that API, we record a new range value 
and return -EPROBE_DEFER when successful. In the reprobe we reset the 
default domain for the devices' IOMMU group, with the IOVA domain rcache 
range configured as previously requested. Again, allocating the new 
default domain is similar to how we change default domain type today.
This means that we don't play with a live domain. Downside is that we 
need to defer the probe.


Thanks,
John

Re: [PATCH v2 1/2] drivers/perf: hisi: Add driver for HiSilicon PCIe PMU

2021-04-13 Thread John Garry


On 13/04/2021 10:12, liuqi (BA) wrote:


I do wonder why we even need maintain pcie_pmu->cpumask

Can't we just use cpu_online_mask as appropiate instead?


?

Sorry, missed it yesterday.
It seems that cpumask is always same as cpu_online_mask, So do we need 
to reserve the cpumask sysfs interface?


I'm not saying that we don't require the cpumask sysfs interface. I am 
just asking why you maintain a separate cpumask, when, as I said, they 
seem the same.


Thanks,
John

perf arm64 --topdown support (was "perf arm64 metricgroup support")

2021-04-13 Thread John Garry


On 08/04/2021 13:06, Jiri Olsa wrote:

perf stat --topdown is not supported, as this requires the CPU PMU to
expose (alias) events for the TopDown L1 metrics from sysfs, which arm
does not do. To get that to work, we probably need to make perf use the
pmu-events cpumap to learn about those alias events.


Hi guys,

About supporting --topdown command for other archs apart from x86, it 
seems not possible today. Support there is based on kernel support for 
"topdown" CPU events used in the metric calculations. However, arm64, 
for example, does not support these "topdown" events.


It seems to me that we can change to use pmu-events framework + 
metricgroup support here, rather than hardcoded events - has anyone 
considered this approach previously? Seems a pretty big job, so thought 
I'd ask first ...


Thanks,
John

Re: [PATCH v2 1/2] drivers/perf: hisi: Add driver for HiSilicon PCIe PMU

2021-04-12 Thread John Garry


On 12/04/2021 14:34, liuqi (BA) wrote:


Hi John,

Thanks for reviewing this.
On 2021/4/9 18:22, John Garry wrote:

On 09/04/2021 10:05, Qi Liu wrote:

PCIe PMU Root Complex Integrated End Point(RCiEP) device is supported
to sample bandwidth, latency, buffer occupation etc.

Each PMU RCiEP device monitors multiple Root Ports, and each RCiEP is
registered as a PMU in /sys/bus/event_source/devices, so users can
select target PMU, and use filter to do further sets.


side note: it would be good to mention what baseline the series is based 
on in the cover letter




Filtering options contains:
event    - select the event.
subevent - select the subevent.
port - select target Root Ports. Information of Root Ports
     are shown under sysfs.
bdf  - select requester_id of target EP device.
trig_len - set trigger condition for starting event statistics.
trigger_mode - set trigger mode. 0 means starting to statistic when
     bigger than trigger condition, and 1 means smaller.
thr_len  - set threshold for statistics.
thr_mode - set threshold mode. 0 means count when bigger than
     threshold, and 1 means smaller.

Signed-off-by: Qi Liu 
---
   MAINTAINERS    |    6 +
   drivers/perf/Kconfig   |    2 +
   drivers/perf/Makefile  |    1 +
   drivers/perf/pci/Kconfig   |   16 +
   drivers/perf/pci/Makefile  |    2 +
   drivers/perf/pci/hisilicon/Makefile    |    5 +
   drivers/perf/pci/hisilicon/hisi_pcie_pmu.c | 1011

   include/linux/cpuhotplug.h |    1 +
   8 files changed, 1044 insertions(+)
   create mode 100644 drivers/perf/pci/Kconfig
   create mode 100644 drivers/perf/pci/Makefile
   create mode 100644 drivers/perf/pci/hisilicon/Makefile
   create mode 100644 drivers/perf/pci/hisilicon/hisi_pcie_pmu.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 3353de0..46c7861 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8023,6 +8023,12 @@ W:    http://www.hisilicon.com
   F:    Documentation/admin-guide/perf/hisi-pmu.rst
   F:    drivers/perf/hisilicon
+HISILICON PCIE PMU DRIVER
+M:    Qi Liu 
+S:    Maintained
+F:    Documentation/admin-guide/perf/hisi-pcie-pmu.rst


nit: this does not exist yet...


thanks, I'll move this add-maintainer-part to the second patch.


that's why I advocate the documentation first :)


+F:    drivers/perf/pci/hisilicon/hisi_pcie_pmu.c
+
   HISILICON QM AND ZIP Controller DRIVER
   M:    Zhou Wang 
   L:    linux-cry...@vger.kernel.org
diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 3075cf1..99d4760 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -139,4 +139,6 @@ config ARM_DMC620_PMU
   source "drivers/perf/hisilicon/Kconfig"
+source "drivers/perf/pci/Kconfig"
+
   endmenu
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index 5260b11..1208c08 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -14,3 +14,4 @@ obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o
   obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
   obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
   obj-$(CONFIG_ARM_DMC620_PMU) += arm_dmc620_pmu.o
+obj-y += pci/
diff --git a/drivers/perf/pci/Kconfig b/drivers/perf/pci/Kconfig
new file mode 100644
index 000..a119486
--- /dev/null
+++ b/drivers/perf/pci/Kconfig
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# PCIe Performance Monitor Drivers
+#
+menu "PCIe Performance Monitor"
+
+config HISI_PCIE_PMU
+    tristate "HiSilicon PCIE PERF PMU"
+    depends on ARM64 && PCI && HISI_PMU


What from HISI_PMU is needed? I couldn't find anything here

The event_sysfs_show() and format_sysfs_show() function of
hisi_uncore_pmu.h can be reused in hisi_pcie_pmu.c, So I add path in
Makefile and include "hisi_uncore_pmu.h".


Right, but it would be nice to be able to build this under COMPILE_TEST. 
CONFIG_HISI_PMU cannot be built under COMPILE_TEST, so nice to not 
depend on it.


So you could put hisi_event_sysfs_show() as a static inline in 
hisi_uncore_pmu.h, so the dependency can be removed


Having said that, there is nothing really hisi specific in those 
functions like hisi_event_sysfs_show().


Can't we just create generic functions here?

hisi_event_sysfs_show() == cci400_pmu_cycle_event_show() == 
xgene_pmu_event_show()







+    help
+  Provide support for HiSilicon PCIe performance monitoring unit
(PMU)
+  RCiEP devices.
+  Adds the PCIe PMU into perf events system for monitoring latency,
+  bandwidth etc.
+







+#define HISI_PCIE_CHECK_EVENTS(name, list) \


"check" in a function name is too vague, as it does not imply any
side-effect from calling this function.

And I think "build" or similar would be good to be included in the macro
name

Re: [PATCH] iommu/amd: Fix extended features logging

2021-04-11 Thread John Ogness

On 2021-04-11, Alexander Monakov  wrote:
>>> The second line is emitted via 'pr_cont', which causes it to have a
>>> different ('warn') loglevel compared to the previous line ('info').
>>> 
>>> Commit 9a295ff0ffc9 attempted to rectify this by removing the newline
>>> from the pci_info format string, but this doesn't work, as pci_info
>>> calls implicitly append a newline anyway.
>> 
>> Hmm, did I really screw that up during my testing? I am sorry about that.
>> 
>> I tried to wrap my head around, where the newline is implicitly appended, and
>> only found the definitions below.
>> 
>> include/linux/pci.h:#define pci_info(pdev, fmt, arg...)
>> dev_info(&(pdev)->dev, fmt, ##arg)
>> 
>> include/linux/dev_printk.h:#define dev_info(dev, fmt, ...)
>> \
>> include/linux/dev_printk.h: _dev_info(dev, dev_fmt(fmt),
>> ##__VA_ARGS__)
>> 
>> include/linux/dev_printk.h:__printf(2, 3) __cold
>> include/linux/dev_printk.h:void _dev_info(const struct device *dev, const
>> char *fmt, ...);
>> 
>> include/linux/compiler_attributes.h:#define __printf(a, b)
>> __attribute__((__format__(printf, a, b)))
>
> Yeah, it's not obvious: it happens in kernel/printk/printk.c:vprintk_store
> where it does
>
>   if (dev_info)
>   lflags |= LOG_NEWLINE;
>
> It doesn't seem to be documented; I think it prevents using pr_cont with
> "rich" printk facilities that go via _dev_info.
>
> I suspect it quietly changed in commit c313af145b9bc ("printk() - isolate
> KERN_CONT users from ordinary complete lines").

Yes, this behavior has been around for a while. I see no reason why it
should be that way. These days printk does not care if there is dev_info
included or not.

>> In the discussion *smpboot: CPU numbers printed as warning* [1] John wrote:
>> 
>>> It is supported to provide loglevels for CONT messages. The loglevel is
>>> then only used if the append fails:
>>> 
>>> pr_cont(KERN_INFO "message part");
>>> 
>>> I don't know if we want to go down that path. But it is supported.
>
> Yeah, I saw that, but decided to go with the 'pr_info("")' solution, because
> it is less magic, and already used in two other drivers.

Note that what I was suggesting was to fix a different issue: If the
pr_cont() caller is interrupted by another printk user, then the
following pr_cont() calls will use the default loglevel. By explicitly
specifying the loglevel in pr_cont(), you can be sure that those pieces
get the desired loglevel, even if those pieces get separated off because
of an interrupting printk user.

So even if we fix dev_info to allow pr_cont continuation, it still may
be desirable to specify the loglevel in the pr_cont pieces.

> pr_info("") will also prepend 'AMD-Vi:' to the feature list, which is
> nice.

I'd rather fix dev_info callers to allow pr_cont and then fix any code
that is using this workaround.

And if the print maintainers agree it is OK to encourage
pr_cont(LOGLEVEL "...") usage, then people should really start using
that if the loglevel on those pieces is important.

John Ogness

RE: [syzbot] WARNING: refcount bug in sk_psock_get

2021-04-09 Thread John Fastabend

syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:9c54130c Add linux-next specific files for 20210406
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=17d8d7aad0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=d125958c3995ddcd
> dashboard link: https://syzkaller.appspot.com/bug?extid=b54a1ce86ba4a623b7f0
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1729797ed0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1190f46ad0
> 
> The issue was bisected to:
> 
> commit 997acaf6b4b59c6a9c259740312a69ea549cc684
> Author: Mark Rutland 
> Date:   Mon Jan 11 15:37:07 2021 +
> 
> lockdep: report broken irq restoration
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=11a6cc96d0
> final oops: https://syzkaller.appspot.com/x/report.txt?x=13a6cc96d0
> console output: https://syzkaller.appspot.com/x/log.txt?x=15a6cc96d0
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+b54a1ce86ba4a623b...@syzkaller.appspotmail.com
> Fixes: 997acaf6b4b5 ("lockdep: report broken irq restoration")
> 
> [ cut here ]
> refcount_t: saturated; leaking memory.
> WARNING: CPU: 1 PID: 8414 at lib/refcount.c:19 
> refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19
> Modules linked in:
> CPU: 1 PID: 8414 Comm: syz-executor793 Not tainted 
> 5.12.0-rc6-next-20210406-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> RIP: 0010:refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19
> Code: 1d 69 0c e6 09 31 ff 89 de e8 c8 b4 a6 fd 84 db 75 ab e8 0f ae a6 fd 48 
> c7 c7 e0 52 c2 89 c6 05 49 0c e6 09 01 e8 91 0f 00 05 <0f> 0b eb 8f e8 f3 ad 
> a6 fd 0f b6 1d 33 0c e6 09 31 ff 89 de e8 93
> RSP: 0018:c9eef388 EFLAGS: 00010282
> RAX:  RBX:  RCX: 
> RDX: 88801bbdd580 RSI: 815c2e05 RDI: f520001dde63
> RBP:  R08:  R09: 
> R10: 815bcc6e R11:  R12: 1920001dde74
> R13: 90200301 R14: 888026e0 R15: c9eef3c0
> FS:  01422300() GS:8880b9d0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 2000 CR3: 12b3b000 CR4: 001506e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  __refcount_add_not_zero include/linux/refcount.h:163 [inline]
>  __refcount_inc_not_zero include/linux/refcount.h:227 [inline]
>  refcount_inc_not_zero include/linux/refcount.h:245 [inline]
>  sk_psock_get+0x3b0/0x400 include/linux/skmsg.h:435
>  bpf_exec_tx_verdict+0x11e/0x11a0 net/tls/tls_sw.c:799
>  tls_sw_sendmsg+0xa41/0x1800 net/tls/tls_sw.c:1013
>  inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:821

[...]

This is likely a problem with latest round of sockmap patches I'll
tke a look.

Re: [PATCH v2 1/2] drivers/perf: hisi: Add driver for HiSilicon PCIe PMU

2021-04-09 Thread John Garry


On 09/04/2021 10:05, Qi Liu wrote:

PCIe PMU Root Complex Integrated End Point(RCiEP) device is supported
to sample bandwidth, latency, buffer occupation etc.

Each PMU RCiEP device monitors multiple Root Ports, and each RCiEP is
registered as a PMU in /sys/bus/event_source/devices, so users can
select target PMU, and use filter to do further sets.

Filtering options contains:
event- select the event.
subevent - select the subevent.
port - select target Root Ports. Information of Root Ports
are shown under sysfs.
bdf  - select requester_id of target EP device.
trig_len - set trigger condition for starting event statistics.
trigger_mode - set trigger mode. 0 means starting to statistic when
bigger than trigger condition, and 1 means smaller.
thr_len  - set threshold for statistics.
thr_mode - set threshold mode. 0 means count when bigger than
threshold, and 1 means smaller.

Signed-off-by: Qi Liu 
---
  MAINTAINERS|6 +
  drivers/perf/Kconfig   |2 +
  drivers/perf/Makefile  |1 +
  drivers/perf/pci/Kconfig   |   16 +
  drivers/perf/pci/Makefile  |2 +
  drivers/perf/pci/hisilicon/Makefile|5 +
  drivers/perf/pci/hisilicon/hisi_pcie_pmu.c | 1011 
  include/linux/cpuhotplug.h |1 +
  8 files changed, 1044 insertions(+)
  create mode 100644 drivers/perf/pci/Kconfig
  create mode 100644 drivers/perf/pci/Makefile
  create mode 100644 drivers/perf/pci/hisilicon/Makefile
  create mode 100644 drivers/perf/pci/hisilicon/hisi_pcie_pmu.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 3353de0..46c7861 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8023,6 +8023,12 @@ W:   http://www.hisilicon.com
  F:Documentation/admin-guide/perf/hisi-pmu.rst
  F:drivers/perf/hisilicon
  
+HISILICON PCIE PMU DRIVER

+M: Qi Liu 
+S: Maintained
+F: Documentation/admin-guide/perf/hisi-pcie-pmu.rst


nit: this does not exist yet...


+F: drivers/perf/pci/hisilicon/hisi_pcie_pmu.c
+
  HISILICON QM AND ZIP Controller DRIVER
  M:Zhou Wang 
  L:linux-cry...@vger.kernel.org
diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 3075cf1..99d4760 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -139,4 +139,6 @@ config ARM_DMC620_PMU
  
  source "drivers/perf/hisilicon/Kconfig"
  
+source "drivers/perf/pci/Kconfig"

+
  endmenu
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index 5260b11..1208c08 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -14,3 +14,4 @@ obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o
  obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
  obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
  obj-$(CONFIG_ARM_DMC620_PMU) += arm_dmc620_pmu.o
+obj-y += pci/
diff --git a/drivers/perf/pci/Kconfig b/drivers/perf/pci/Kconfig
new file mode 100644
index 000..a119486
--- /dev/null
+++ b/drivers/perf/pci/Kconfig
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# PCIe Performance Monitor Drivers
+#
+menu "PCIe Performance Monitor"
+
+config HISI_PCIE_PMU
+   tristate "HiSilicon PCIE PERF PMU"
+   depends on ARM64 && PCI && HISI_PMU


What from HISI_PMU is needed? I couldn't find anything here


+   help
+ Provide support for HiSilicon PCIe performance monitoring unit (PMU)
+ RCiEP devices.
+ Adds the PCIe PMU into perf events system for monitoring latency,
+ bandwidth etc.
+
+endmenu
diff --git a/drivers/perf/pci/Makefile b/drivers/perf/pci/Makefile
new file mode 100644
index 000..a56b1a9
--- /dev/null
+++ b/drivers/perf/pci/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y += hisilicon/
diff --git a/drivers/perf/pci/hisilicon/Makefile 
b/drivers/perf/pci/hisilicon/Makefile
new file mode 100644
index 000..6f99f52
--- /dev/null
+++ b/drivers/perf/pci/hisilicon/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+ccflags-y :=  -I $(srctree)/drivers/perf/hisilicon
+
+obj-$(CONFIG_HISI_PCIE_PMU) += hisi_pcie_pmu.o
diff --git a/drivers/perf/pci/hisilicon/hisi_pcie_pmu.c 
b/drivers/perf/pci/hisilicon/hisi_pcie_pmu.c
new file mode 100644
index 000..ac0e444
--- /dev/null
+++ b/drivers/perf/pci/hisilicon/hisi_pcie_pmu.c
@@ -0,0 +1,1011 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * This driver adds support for PCIe PMU RCiEP device. Related
+ * perf events are bandwidth, bandwidth utilization, latency
+ * etc.
+ *
+ * Copyright (C) 2021 HiSilicon Limited
+ * Author: Qi Liu
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "hisi_uncore_pmu.h"
+
+/* Define registers */
+#define HISI_PCIE_GLOBAL_CTRL  0x00
+#define HISI_PCIE_EVENT_CTRL   0x010
+#define HISI_PCIE_CNT

Re: [PATCH 1/2] drivers/perf: hisi: Add driver for HiSilicon PCIe PMU

2021-04-08 Thread John Garry


On 08/04/2021 10:01, Jonathan Cameron wrote:

On Wed, 7 Apr 2021 21:40:05 +0100
Will Deacon  wrote:


On Wed, Apr 07, 2021 at 05:49:02PM +0800, Qi Liu wrote:

PCIe PMU Root Complex Integrated End Point(RCiEP) device is supported
to sample bandwidth, latency, buffer occupation etc.

Each PMU RCiEP device monitors multiple root ports, and each RCiEP is
registered as a pmu in /sys/bus/event_source/devices, so users can
select target PMU, and use filter to do further sets.

Filtering options contains:
event- select the event.
subevent - select the subevent.
port - select target root ports. Information of root ports
are shown under sysfs.
bdf   - select requester_id of target EP device.
trig_len - set trigger condition for starting event statistics.
trigger_mode - set trigger mode. 0 means starting to statistic when
bigger than trigger condition, and 1 means smaller.
thr_len  - set threshold for statistics.
thr_mode - set threshold mode. 0 means count when bigger than
threshold, and 1 means smaller.

Reviewed-by: Jonathan Cameron 


Do you have a link to this review, please?


Internal review, so drop the tag.

Jonathan


Hi Will,

Are you implying that you would rather that any review for these drivers 
is done in public on the lists?


Cheers,
John

[PATCH v3 6/6] perf vendor events arm64: Add Hisi hip08 L3 metrics

Add L3 metrics.

Signed-off-by: John Garry 
Reviewed-by: Kajol Jain 
---
 .../arch/arm64/hisilicon/hip08/metrics.json   | 161 ++
 1 file changed, 161 insertions(+)

diff --git a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json 
b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
index dda898d23c2d..dda8e59149d2 100644
--- a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
+++ b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
@@ -69,4 +69,165 @@
 "MetricGroup": "TopDownL2",
 "MetricName": "memory_bound"
 },
+{
+"MetricExpr": "(((L2I_TLB - L2I_TLB_REFILL) * 15) + (L2I_TLB_REFILL * 
100)) / CPU_CYCLES",
+"PublicDescription": "Idle by itlb miss L3 topdown metric",
+"BriefDescription": "Idle by itlb miss L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "idle_by_itlb_miss"
+},
+{
+"MetricExpr": "(((L2I_CACHE - L2I_CACHE_REFILL) * 15) + 
(L2I_CACHE_REFILL * 100)) / CPU_CYCLES",
+"PublicDescription": "Idle by icache miss L3 topdown metric",
+"BriefDescription": "Idle by icache miss L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "idle_by_icache_miss"
+},
+{
+"MetricExpr": "(BR_MIS_PRED * 5) / CPU_CYCLES",
+"PublicDescription": "BP misp flush L3 topdown metric",
+"BriefDescription": "BP misp flush L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "bp_misp_flush"
+},
+{
+"MetricExpr": "(armv8_pmuv3_0@event\\=0x2013@ * 5) / CPU_CYCLES",
+"PublicDescription": "OOO flush L3 topdown metric",
+"BriefDescription": "OOO flush L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "ooo_flush"
+},
+{
+"MetricExpr": "(armv8_pmuv3_0@event\\=0x1001@ * 5) / CPU_CYCLES",
+"PublicDescription": "Static predictor flush L3 topdown metric",
+"BriefDescription": "Static predictor flush L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "sp_flush"
+},
+{
+"MetricExpr": "armv8_pmuv3_0@event\\=0x1010@ / BR_MIS_PRED",
+"PublicDescription": "Indirect branch L3 topdown metric",
+"BriefDescription": "Indirect branch L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "indirect_branch"
+},
+{
+"MetricExpr": "(armv8_pmuv3_0@event\\=0x1014@ + 
armv8_pmuv3_0@event\\=0x1018@) / BR_MIS_PRED",
+"PublicDescription": "Push branch L3 topdown metric",
+"BriefDescription": "Push branch L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "push_branch"
+},
+{
+"MetricExpr": "armv8_pmuv3_0@event\\=0x100c@ / BR_MIS_PRED",
+"PublicDescription": "Pop branch L3 topdown metric",
+"BriefDescription": "Pop branch L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "pop_branch"
+},
+{
+"MetricExpr": "(BR_MIS_PRED - armv8_pmuv3_0@event\\=0x1010@ - 
armv8_pmuv3_0@event\\=0x1014@ - armv8_pmuv3_0@event\\=0x1018@ - 
armv8_pmuv3_0@event\\=0x100c@) / BR_MIS_PRED",
+"PublicDescription": "Other branch L3 topdown metric",
+"BriefDescription": "Other branch L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "other_branch"
+},
+{
+"MetricExpr": "armv8_pmuv3_0@event\\=0x2012@ / 
armv8_pmuv3_0@event\\=0x2013@",
+"PublicDescription": "Nuke flush L3 topdown metric",
+"BriefDescription": "Nuke flush L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "nuke_flush"
+},
+{
+"MetricExpr": "1 - nuke_flush",
+"PublicDescription": "Other flush L3 topdown metric",
+"BriefDescription": &q

[PATCH v3 0/6] perf arm64 metricgroup support

This series contains support to get basic metricgroups working for
arm64 CPUs.

Initial support is added for HiSilicon hip08 platform.

Some sample usage on Huawei D06 board:

 $ ./perf list metric

List of pre-defined events (to be used in -e): 

Metrics: 

  bp_misp_flush
   [BP misp flush L3 topdown metric]
  branch_mispredicts
   [Branch mispredicts L2 topdown metric]
  core_bound
   [Core bound L2 topdown metric]
  divider
   [Divider L3 topdown metric]
  exe_ports_util
   [EXE ports util L3 topdown metric]
  fetch_bandwidth_bound
   [Fetch bandwidth bound L2 topdown metric]
  fetch_latency_bound
   [Fetch latency bound L2 topdown metric]
  fsu_stall
   [FSU stall L3 topdown metric]
  idle_by_icache_miss

$ sudo ./perf stat -v -M core_bound sleep 1
Using CPUID 0x480fd010
metric expr (exe_stall_cycle - (mem_stall_anyload + 
armv8_pmuv3_0@event\=0x7005@)) / cpu_cycles for core_bound
found event cpu_cycles
found event armv8_pmuv3_0/event=0x7005/
found event exe_stall_cycle
found event mem_stall_anyload
adding {cpu_cycles -> armv8_pmuv3_0/event=0x7001/
mem_stall_anyload -> armv8_pmuv3_0/event=0x7004/
Control descriptor is not initialized
cpu_cycles: 989433 385050 385050
armv8_pmuv3_0/event=0x7005/: 19207 385050 385050
exe_stall_cycle: 900825 385050 385050
mem_stall_anyload: 253516 385050 385050

Performance counter stats for 'sleep':

989,433  cpu_cycles  # 0.63 core_bound
  19,207  armv8_pmuv3_0/event=0x7005/
 900,825  exe_stall_cycle
 253,516  mem_stall_anyload

   0.000805809 seconds time elapsed

   0.000875000 seconds user
   0.0 seconds sys
   
perf stat --topdown is not supported, as this requires the CPU PMU to
expose (alias) events for the TopDown L1 metrics from sysfs, which arm 
does not do. To get that to work, we probably need to make perf use the
pmu-events cpumap to learn about those alias events.

Metric reuse support is added for pmu-events parse metric testcase.
This had been broken on power9 recently:
https://lore.kernel.org/lkml/20210324015418.gc8...@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com/
 

Differences to v2:
- Add TB and RB tags (Thanks!)
- Rename metricgroup__find_metric() from metricgroup_find_metric()
- Change resolve_metric_simple() to rescan after any insert

Differences to v1:
- Add pmu_events_map__find() as arm64-specific function
- Fix metric reuse for pmu-events parse metric testcase 

John Garry (6):
  perf metricgroup: Make find_metric() public with name change
  perf test: Handle metric reuse in pmu-events parsing test
  perf pmu: Add pmu_events_map__find()
  perf vendor events arm64: Add Hisi hip08 L1 metrics
  perf vendor events arm64: Add Hisi hip08 L2 metrics
  perf vendor events arm64: Add Hisi hip08 L3 metrics

 tools/perf/arch/arm64/util/Build  |   1 +
 tools/perf/arch/arm64/util/pmu.c  |  25 ++
 .../arch/arm64/hisilicon/hip08/metrics.json   | 233 ++
 tools/perf/tests/pmu-events.c |  83 ++-
 tools/perf/util/metricgroup.c |  12 +-
 tools/perf/util/metricgroup.h |   3 +-
 tools/perf/util/pmu.c |   5 +
 tools/perf/util/pmu.h |   1 +
 tools/perf/util/s390-sample-raw.c |   4 +-
 9 files changed, 356 insertions(+), 11 deletions(-)
 create mode 100644 tools/perf/arch/arm64/util/pmu.c
 create mode 100644 
tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json

-- 
2.26.2

[PATCH v3 5/6] perf vendor events arm64: Add Hisi hip08 L2 metrics

Add L2 metrics.

Signed-off-by: John Garry 
Reviewed-by: Kajol Jain 
---
 .../arch/arm64/hisilicon/hip08/metrics.json   | 42 +++
 1 file changed, 42 insertions(+)

diff --git a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json 
b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
index dc5ff3051639..dda898d23c2d 100644
--- a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
+++ b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
@@ -27,4 +27,46 @@
 "MetricGroup": "TopDownL1",
 "MetricName": "backend_bound"
 },
+{
+"MetricExpr": "armv8_pmuv3_0@event\\=0x201d@ / CPU_CYCLES",
+"PublicDescription": "Fetch latency bound L2 topdown metric",
+"BriefDescription": "Fetch latency bound L2 topdown metric",
+"MetricGroup": "TopDownL2",
+"MetricName": "fetch_latency_bound"
+},
+{
+"MetricExpr": "frontend_bound - fetch_latency_bound",
+"PublicDescription": "Fetch bandwidth bound L2 topdown metric",
+"BriefDescription": "Fetch bandwidth bound L2 topdown metric",
+"MetricGroup": "TopDownL2",
+"MetricName": "fetch_bandwidth_bound"
+},
+{
+"MetricExpr": "(bad_speculation * BR_MIS_PRED) / (BR_MIS_PRED + 
armv8_pmuv3_0@event\\=0x2013@)",
+"PublicDescription": "Branch mispredicts L2 topdown metric",
+"BriefDescription": "Branch mispredicts L2 topdown metric",
+"MetricGroup": "TopDownL2",
+"MetricName": "branch_mispredicts"
+},
+{
+"MetricExpr": "bad_speculation - branch_mispredicts",
+"PublicDescription": "Machine clears L2 topdown metric",
+"BriefDescription": "Machine clears L2 topdown metric",
+"MetricGroup": "TopDownL2",
+"MetricName": "machine_clears"
+},
+{
+"MetricExpr": "(EXE_STALL_CYCLE - (MEM_STALL_ANYLOAD + 
armv8_pmuv3_0@event\\=0x7005@)) / CPU_CYCLES",
+"PublicDescription": "Core bound L2 topdown metric",
+"BriefDescription": "Core bound L2 topdown metric",
+"MetricGroup": "TopDownL2",
+"MetricName": "core_bound"
+},
+{
+"MetricExpr": "(MEM_STALL_ANYLOAD + armv8_pmuv3_0@event\\=0x7005@) / 
CPU_CYCLES",
+"PublicDescription": "Memory bound L2 topdown metric",
+"BriefDescription": "Memory bound L2 topdown metric",
+"MetricGroup": "TopDownL2",
+"MetricName": "memory_bound"
+},
 ]
-- 
2.26.2

[PATCH v3 4/6] perf vendor events arm64: Add Hisi hip08 L1 metrics

Add L1 metrics. Formula is as consistent as possible with MAN pages
description for these metrics.

Signed-off-by: John Garry 
Reviewed-by: Kajol Jain 
---
 .../arch/arm64/hisilicon/hip08/metrics.json   | 30 +++
 1 file changed, 30 insertions(+)
 create mode 100644 
tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json

diff --git a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json 
b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
new file mode 100644
index ..dc5ff3051639
--- /dev/null
+++ b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
@@ -0,0 +1,30 @@
+[
+{
+"MetricExpr": "FETCH_BUBBLE / (4 * CPU_CYCLES)",
+"PublicDescription": "Frontend bound L1 topdown metric",
+"BriefDescription": "Frontend bound L1 topdown metric",
+"MetricGroup": "TopDownL1",
+"MetricName": "frontend_bound"
+},
+{
+"MetricExpr": "(INST_SPEC - INST_RETIRED) / (4 * CPU_CYCLES)",
+"PublicDescription": "Bad Speculation L1 topdown metric",
+"BriefDescription": "Bad Speculation L1 topdown metric",
+"MetricGroup": "TopDownL1",
+"MetricName": "bad_speculation"
+},
+{
+"MetricExpr": "INST_RETIRED / (CPU_CYCLES * 4)",
+"PublicDescription": "Retiring L1 topdown metric",
+"BriefDescription": "Retiring L1 topdown metric",
+"MetricGroup": "TopDownL1",
+"MetricName": "retiring"
+},
+{
+"MetricExpr": "1 - (frontend_bound + bad_speculation + retiring)",
+"PublicDescription": "Backend Bound L1 topdown metric",
+"BriefDescription": "Backend Bound L1 topdown metric",
+"MetricGroup": "TopDownL1",
+"MetricName": "backend_bound"
+},
+]
-- 
2.26.2

[PATCH v3 3/6] perf pmu: Add pmu_events_map__find()

Add a function to find the common PMU map for the system.

For arm64, a special variant is added. This is because arm64 supports
heterogeneous CPU systems. As such, it cannot be guaranteed that the cpumap
is same for all CPUs. So in case of heterogeneous systems, don't return
a cpumap.

Tested-by: Paul A. Clarke 
Signed-off-by: John Garry 
Reviewed-by: Kajol Jain 
---
 tools/perf/arch/arm64/util/Build  |  1 +
 tools/perf/arch/arm64/util/pmu.c  | 25 +
 tools/perf/tests/pmu-events.c |  2 +-
 tools/perf/util/metricgroup.c |  7 +++
 tools/perf/util/pmu.c |  5 +
 tools/perf/util/pmu.h |  1 +
 tools/perf/util/s390-sample-raw.c |  4 +---
 7 files changed, 37 insertions(+), 8 deletions(-)
 create mode 100644 tools/perf/arch/arm64/util/pmu.c

diff --git a/tools/perf/arch/arm64/util/Build b/tools/perf/arch/arm64/util/Build
index ead2f2275eee..9fcb4e68add9 100644
--- a/tools/perf/arch/arm64/util/Build
+++ b/tools/perf/arch/arm64/util/Build
@@ -2,6 +2,7 @@ perf-y += header.o
 perf-y += machine.o
 perf-y += perf_regs.o
 perf-y += tsc.o
+perf-y += pmu.o
 perf-y += kvm-stat.o
 perf-$(CONFIG_DWARF) += dwarf-regs.o
 perf-$(CONFIG_LOCAL_LIBUNWIND) += unwind-libunwind.o
diff --git a/tools/perf/arch/arm64/util/pmu.c b/tools/perf/arch/arm64/util/pmu.c
new file mode 100644
index ..d3259d61ca75
--- /dev/null
+++ b/tools/perf/arch/arm64/util/pmu.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "../../util/cpumap.h"
+#include "../../util/pmu.h"
+
+struct pmu_events_map *pmu_events_map__find(void)
+{
+   struct perf_pmu *pmu = NULL;
+
+   while ((pmu = perf_pmu__scan(pmu))) {
+   if (!is_pmu_core(pmu->name))
+   continue;
+
+   /*
+* The cpumap should cover all CPUs. Otherwise, some CPUs may
+* not support some events or have different event IDs.
+*/
+   if (pmu->cpus->nr != cpu__max_cpu())
+   return NULL;
+
+   return perf_pmu__find_map(pmu);
+   }
+
+   return NULL;
+}
diff --git a/tools/perf/tests/pmu-events.c b/tools/perf/tests/pmu-events.c
index cb5b25d2fb27..b8aff8fb50d8 100644
--- a/tools/perf/tests/pmu-events.c
+++ b/tools/perf/tests/pmu-events.c
@@ -539,7 +539,7 @@ static int resolve_metric_simple(struct expr_parse_ctx 
*pctx,
 
 static int test_parsing(void)
 {
-   struct pmu_events_map *cpus_map = perf_pmu__find_map(NULL);
+   struct pmu_events_map *cpus_map = pmu_events_map__find();
struct pmu_events_map *map;
struct pmu_event *pe;
int i, j, k;
diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
index 37fe34a5d93d..8336dd8e8098 100644
--- a/tools/perf/util/metricgroup.c
+++ b/tools/perf/util/metricgroup.c
@@ -618,7 +618,7 @@ static int metricgroup__print_sys_event_iter(struct 
pmu_event *pe, void *data)
 void metricgroup__print(bool metrics, bool metricgroups, char *filter,
bool raw, bool details)
 {
-   struct pmu_events_map *map = perf_pmu__find_map(NULL);
+   struct pmu_events_map *map = pmu_events_map__find();
struct pmu_event *pe;
int i;
struct rblist groups;
@@ -1254,8 +1254,7 @@ int metricgroup__parse_groups(const struct option *opt,
  struct rblist *metric_events)
 {
struct evlist *perf_evlist = *(struct evlist **)opt->value;
-   struct pmu_events_map *map = perf_pmu__find_map(NULL);
-
+   struct pmu_events_map *map = pmu_events_map__find();
 
return parse_groups(perf_evlist, str, metric_no_group,
metric_no_merge, NULL, metric_events, map);
@@ -1274,7 +1273,7 @@ int metricgroup__parse_groups_test(struct evlist *evlist,
 
 bool metricgroup__has_metric(const char *metric)
 {
-   struct pmu_events_map *map = perf_pmu__find_map(NULL);
+   struct pmu_events_map *map = pmu_events_map__find();
struct pmu_event *pe;
int i;
 
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 88da5cf6aee8..419ef6c4fbc0 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -717,6 +717,11 @@ struct pmu_events_map *perf_pmu__find_map(struct perf_pmu 
*pmu)
return map;
 }
 
+struct pmu_events_map *__weak pmu_events_map__find(void)
+{
+   return perf_pmu__find_map(NULL);
+}
+
 bool pmu_uncore_alias_match(const char *pmu_name, const char *name)
 {
char *tmp = NULL, *tok, *str;
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index 8164388478c6..012317229488 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -114,6 +114,7 @@ void pmu_add_cpu_aliases_map(struct list_head *head, struct 
perf_pmu *pmu,
 struct pmu_events_map *map);
 
 struct pmu_events_map *perf_pmu__find_map(struct perf_pmu *pmu);
+struct pmu_events_map *pmu_events_map__find(void);
 bool p

[PATCH v3 1/6] perf metricgroup: Make find_metric() public with name change

Function find_metric() is required for the metric processing in the
pmu-events testcase, so make it public. Also change the name to include
"metricgroup".

Tested-by: Paul A. Clarke 
Signed-off-by: John Garry 
Reviewed-by: Kajol Jain 
---
 tools/perf/util/metricgroup.c | 5 +++--
 tools/perf/util/metricgroup.h | 3 ++-
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
index 6acb44ad439b..37fe34a5d93d 100644
--- a/tools/perf/util/metricgroup.c
+++ b/tools/perf/util/metricgroup.c
@@ -900,7 +900,8 @@ static int __add_metric(struct list_head *metric_list,
(match_metric(__pe->metric_group, __metric) ||  \
 match_metric(__pe->metric_name, __metric)))
 
-static struct pmu_event *find_metric(const char *metric, struct pmu_events_map 
*map)
+struct pmu_event *metricgroup__find_metric(const char *metric,
+  struct pmu_events_map *map)
 {
struct pmu_event *pe;
int i;
@@ -985,7 +986,7 @@ static int __resolve_metric(struct metric *m,
struct expr_id *parent;
struct pmu_event *pe;
 
-   pe = find_metric(cur->key, map);
+   pe = metricgroup__find_metric(cur->key, map);
if (!pe)
continue;
 
diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h
index ed1b9392e624..1424e7c1af77 100644
--- a/tools/perf/util/metricgroup.h
+++ b/tools/perf/util/metricgroup.h
@@ -44,7 +44,8 @@ int metricgroup__parse_groups(const struct option *opt,
  bool metric_no_group,
  bool metric_no_merge,
  struct rblist *metric_events);
-
+struct pmu_event *metricgroup__find_metric(const char *metric,
+  struct pmu_events_map *map);
 int metricgroup__parse_groups_test(struct evlist *evlist,
   struct pmu_events_map *map,
   const char *str,
-- 
2.26.2

[PATCH v3 2/6] perf test: Handle metric reuse in pmu-events parsing test

The pmu-events parsing test does not handle metric reuse at all.

Introduce some simple handling to resolve metrics who reference other
metrics.

Tested-by: Paul A. Clarke 
Signed-off-by: John Garry 
Reviewed-by: Kajol Jain 
---
 tools/perf/tests/pmu-events.c | 81 +++
 1 file changed, 81 insertions(+)

diff --git a/tools/perf/tests/pmu-events.c b/tools/perf/tests/pmu-events.c
index 0ca6a5a53523..cb5b25d2fb27 100644
--- a/tools/perf/tests/pmu-events.c
+++ b/tools/perf/tests/pmu-events.c
@@ -12,6 +12,7 @@
 #include "util/evlist.h"
 #include "util/expr.h"
 #include "util/parse-events.h"
+#include "metricgroup.h"
 
 struct perf_pmu_test_event {
/* used for matching against events from generated pmu-events.c */
@@ -471,6 +472,71 @@ static void expr_failure(const char *msg,
pr_debug("On expression %s\n", pe->metric_expr);
 }
 
+struct metric {
+   struct list_head list;
+   struct metric_ref metric_ref;
+};
+
+static int resolve_metric_simple(struct expr_parse_ctx *pctx,
+struct list_head *compound_list,
+struct pmu_events_map *map,
+const char *metric_name)
+{
+   struct hashmap_entry *cur, *cur_tmp;
+   struct metric *metric, *tmp;
+   size_t bkt;
+   bool all;
+   int rc;
+
+   do {
+   all = true;
+   hashmap__for_each_entry_safe((>ids), cur, cur_tmp, bkt) {
+   struct metric_ref *ref;
+   struct pmu_event *pe;
+
+   pe = metricgroup__find_metric(cur->key, map);
+   if (!pe)
+   continue;
+
+   if (!strcmp(metric_name, (char *)cur->key)) {
+   pr_warning("Recursion detected for metric 
%s\n", metric_name);
+   rc = -1;
+   goto out_err;
+   }
+
+   all = false;
+
+   /* The metric key itself needs to go out.. */
+   expr__del_id(pctx, cur->key);
+
+   metric = malloc(sizeof(*metric));
+   if (!metric) {
+   rc = -ENOMEM;
+   goto out_err;
+   }
+
+   ref = >metric_ref;
+   ref->metric_name = pe->metric_name;
+   ref->metric_expr = pe->metric_expr;
+   list_add_tail(>list, compound_list);
+
+   rc = expr__find_other(pe->metric_expr, NULL, pctx, 0);
+   if (rc)
+   goto out_err;
+   break; /* The hashmap has been modified, so restart */
+   }
+   } while (!all);
+
+   return 0;
+
+out_err:
+   list_for_each_entry_safe(metric, tmp, compound_list, list)
+   free(metric);
+
+   return rc;
+
+}
+
 static int test_parsing(void)
 {
struct pmu_events_map *cpus_map = perf_pmu__find_map(NULL);
@@ -488,7 +554,9 @@ static int test_parsing(void)
break;
j = 0;
for (;;) {
+   struct metric *metric, *tmp;
struct hashmap_entry *cur;
+   LIST_HEAD(compound_list);
size_t bkt;
 
pe = >table[j++];
@@ -504,6 +572,13 @@ static int test_parsing(void)
continue;
}
 
+   if (resolve_metric_simple(, _list, map,
+ pe->metric_name)) {
+   expr_failure("Could not resolve metrics", map, 
pe);
+   ret++;
+   goto exit; /* Don't tolerate errors due to 
severity */
+   }
+
/*
 * Add all ids with a made up value. The value may
 * trigger divide by zero when subtracted and so try to
@@ -519,6 +594,11 @@ static int test_parsing(void)
ret++;
}
 
+   list_for_each_entry_safe(metric, tmp, _list, 
list) {
+   expr__add_ref(, >metric_ref);
+   free(metric);
+   }
+
if (expr__parse(, , pe->metric_expr, 0)) {
expr_failure("Parse failed", map, pe);
ret++;
@@ -527,6 +607,7 @@ static int test_parsing(void)
}
}
/* TODO: fail when not ok */
+exit:
return ret == 0 ? TEST_OK : TEST_SKIP;
 }
 
-- 
2.26.2

Re: [PATCH 0/3] iommu/iova: Add CPU hotplug handler to flush rcaches to core code


On 07/04/2021 09:04, Joerg Roedel wrote:

On Mon, Mar 01, 2021 at 08:12:18PM +0800, John Garry wrote:

The Intel IOMMU driver supports flushing the per-CPU rcaches when a CPU is
offlined.

Let's move it to core code, so everyone can take advantage.

Also correct a code comment.

Based on v5.12-rc1. Tested on arm64 only.

John Garry (3):
   iova: Add CPU hotplug handler to flush rcaches
   iommu/vt-d: Remove IOVA domain rcache flushing for CPU offlining
   iova: Correct comment for free_cpu_cached_iovas()

  drivers/iommu/intel/iommu.c | 31 ---
  drivers/iommu/iova.c| 32 ++--
  include/linux/cpuhotplug.h  |  2 +-
  include/linux/iova.h|  1 +
  4 files changed, 32 insertions(+), 34 deletions(-)


Applied, thanks.

.



Thanks, but there was a v2 on this series. Not sure which you applied.

https://lore.kernel.org/linux-iommu/9aad6e94-ecb7-ca34-7f7d-3df6e43e9...@huawei.com/T/#mbea81468782c75fa84744ad7a7801831a4c952e9

RE: [syzbot] WARNING: suspicious RCU usage in tcp_bpf_update_proto

2021-04-06 Thread John Fastabend

syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:514e1150 net: x25: Queue received packets in the drivers i..
> git tree:   net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=112a8831d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=7eff0f22b8563a5f
> dashboard link: https://syzkaller.appspot.com/bug?extid=320a3bc8d80f478c37e4
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1532d711d0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15f44c5ed0
> 
> The issue was bisected to:
> 
> commit 4dfe6bd94959222e18d512bdf15f6bf9edb9c27c
> Author: Rustam Kovhaev 
> Date:   Wed Feb 24 20:00:30 2021 +
> 
> ntfs: check for valid standard information attribute
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=16207a81d0
> final oops: https://syzkaller.appspot.com/x/report.txt?x=15207a81d0
> console output: https://syzkaller.appspot.com/x/log.txt?x=11207a81d0
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+320a3bc8d80f478c3...@syzkaller.appspotmail.com
> Fixes: 4dfe6bd94959 ("ntfs: check for valid standard information attribute")
> 
> =
> WARNING: suspicious RCU usage
> 5.12.0-rc4-syzkaller #0 Not tainted
> -
> include/linux/skmsg.h:286 suspicious rcu_dereference_check() usage!
> 
> other info that might help us debug this:
> 
> 
> rcu_scheduler_active = 2, debug_locks = 1
> 1 lock held by syz-executor383/8454:
>  #0: 888013a99b48 (clock-AF_INET){++..}-{2:2}, at: 
> sk_psock_drop+0x2c/0x460 net/core/skmsg.c:788
> 
> stack backtrace:
> CPU: 1 PID: 8454 Comm: syz-executor383 Not tainted 5.12.0-rc4-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x141/0x1d7 lib/dump_stack.c:120
>  sk_psock include/linux/skmsg.h:286 [inline]
>  tcp_bpf_update_proto+0x530/0x5f0 net/ipv4/tcp_bpf.c:504
>  sk_psock_restore_proto include/linux/skmsg.h:408 [inline]
>  sk_psock_drop+0xdf/0x460 net/core/skmsg.c:789
>  sk_psock_put include/linux/skmsg.h:446 [inline]
>  tcp_bpf_recvmsg+0x42d/0x480 net/ipv4/tcp_bpf.c:208


The bisection is bogus, but we will get a fix for this ASAP. The real commit
is,

commit 8a59f9d1e3d4340659fdfee8879dc09a6f2546e1
Author: Cong Wang 
Date:   Tue Mar 30 19:32:31 2021 -0700

sock: Introduce sk->sk_prot->psock_update_sk_prot()

Thanks,
John

Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator


So then we have the issue of how to dynamically increase this rcache
threshold. The problem is that we may have many devices associated with
the same domain. So, in theory, we can't assume that when we increase
the threshold that some other device will try to fast free an IOVA 
which

was allocated prior to the increase and was not rounded up.

I'm very open to better (or less bad) suggestions on how to do this ...

...but yes, regardless of exactly where it happens, rounding up or not
is the problem for rcaches in general. I've said several times that my
preferred approach is to not change it that dynamically at all, but
instead treat it more like we treat the default domain type.



Can you remind me of that idea? I don't remember you mentioning using 
default domain handling as a reference in any context.




Hi Robin,

Sorry if the phrasing was unclear there - the allusion to default 
domains is new, it just occurred to me that what we do there is in fact 
fairly close to what I've suggested previously for this. In that case, 
we have a global policy set by the command line, which *can* be 
overridden per-domain via sysfs at runtime, provided the user is willing 
to tear the whole thing down. Using a similar approach here would give a 
fair degree of flexibility but still mean that changes never have to be 
made dynamically to a live domain.


So are you saying that we can handle it similar to how we now can handle 
changing default domain for an IOMMU group via sysfs? If so, that just 
is not practical here. Reason being that this particular DMA engine 
provides the block device giving / mount point, so if we unbind the 
driver, we lose / mount point.


And I am not sure if the end user would even know how to set such a 
tunable. Or, in this case, why the end user would not want the optimized 
range configured always.


I'd still rather if the device driver could provide info which can be 
used to configure this before or during probing.


Cheers,
John

Re: [bug report] Memory leak from acpi_ev_install_space_handler()


On 06/04/2021 17:40, Rafael J. Wysocki wrote:

On Tue, Apr 6, 2021 at 5:51 PM John Garry  wrote:


Hi guys,

On next-20210406, I enabled CONFIG_DEBUG_KMEMLEAK and
CONFIG_DEBUG_TEST_DRIVER_REMOVE for my arm64 system, and see this:




Hi Rafael,


Why exactly do you think that acpi_ev_install_space_handler() leaks memory?



I don't think that acpi_ev_install_space_handler() itself leaks memory, 
but it seems that there is something missing in the code which is meant 
to undo/clean up after that on the uninstall path - I did make the point 
in writing "memory leak from", but maybe still not worded clearly.


Anyway, I have not analyzed the problem fully - I'm just reporting.

I don't mind looking further if requested.

Thanks,
John


root@debian:/home/john# more /sys/kernel/debug/kmemleak
unreferenced object 0x202803c11f00 (size 128):
comm "swapper/0", pid 1, jiffies 4294894325 (age 337.524s)
hex dump (first 32 bytes):
00 00 00 00 02 00 00 00 08 1f c1 03 28 20 ff ff( ..
08 1f c1 03 28 20 ff ff 00 00 00 00 00 00 00 00( ..
backtrace:
[<670a0938>] slab_post_alloc_hook+0x9c/0x2f8
[<a3f47b39>] kmem_cache_alloc+0x198/0x2a8
[<2bdba864>] acpi_os_create_semaphore+0x54/0xe0
[<bcd513fe>] acpi_ev_install_space_handler+0x24c/0x300
[<02e116e2>] acpi_install_address_space_handler+0x64/0xb0
[<ba00abc5>] i2c_acpi_install_space_handler+0xd4/0x138
[<8da42058>] i2c_register_adapter+0x368/0x910
[<c03f7142>] i2c_add_adapter+0x9c/0x100
[<00ba2fcf>] i2c_add_numbered_adapter+0x44/0x58
[<7df22d67>] i2c_dw_probe_master+0x68c/0x900
[<682dfc98>] dw_i2c_plat_probe+0x460/0x640
[<ad2dd3ee>] platform_probe+0x8c/0x108
[<dd183e3f>] really_probe+0x190/0x670
[<66017341>] driver_probe_device+0x8c/0xf8
[<c441e843>] device_driver_attach+0x9c/0xa8
[<f91dc709>] __driver_attach+0x88/0x138
unreferenced object 0x00280452c100 (size 128):
comm "swapper/0", pid 1, jiffies 4294894558 (age 336.604s)
hex dump (first 32 bytes):
00 00 00 00 02 00 00 00 08 c1 52 04 28 00 ff ff..R.(...
08 c1 52 04 28 00 ff ff 00 00 00 00 00 00 00 00..R.(...
backtrace:
[<670a0938>] slab_post_alloc_hook+0x9c/0x2f8
[<a3f47b39>] kmem_cache_alloc+0x198/0x2a8
[<2bdba864>] acpi_os_create_semaphore+0x54/0xe0
[<bcd513fe>] acpi_ev_install_space_handler+0x24c/0x300
[<02e116e2>] acpi_install_address_space_handler+0x64/0xb0
[<988d4f61>] acpi_gpiochip_add+0x20c/0x4a0
[<73d4faab>] gpiochip_add_data_with_key+0xd10/0x1680
[<1d50b98a>] devm_gpiochip_add_data_with_key+0x30/0x78
[<fc3e7eaf>] dwapb_gpio_probe+0x828/0xb28
[<ad2dd3ee>] platform_probe+0x8c/0x108
[<dd183e3f>] really_probe+0x190/0x670
[<66017341>] driver_probe_device+0x8c/0xf8
[<c441e843>] device_driver_attach+0x9c/0xa8
[<f91dc709>] __driver_attach+0x88/0x138
[<d330caed>] bus_for_each_dev+0xec/0x160
[<eebc5f04>] driver_attach+0x34/0x48
root@debian:/home/john#

Thanks,
John

.

[bug report] Memory leak from acpi_ev_install_space_handler()


Hi guys,

On next-20210406, I enabled CONFIG_DEBUG_KMEMLEAK and
CONFIG_DEBUG_TEST_DRIVER_REMOVE for my arm64 system, and see this:

root@debian:/home/john# more /sys/kernel/debug/kmemleak
unreferenced object 0x202803c11f00 (size 128):
comm "swapper/0", pid 1, jiffies 4294894325 (age 337.524s)
hex dump (first 32 bytes):
00 00 00 00 02 00 00 00 08 1f c1 03 28 20 ff ff( ..
08 1f c1 03 28 20 ff ff 00 00 00 00 00 00 00 00( ..
backtrace:
[<670a0938>] slab_post_alloc_hook+0x9c/0x2f8
[<a3f47b39>] kmem_cache_alloc+0x198/0x2a8
[<2bdba864>] acpi_os_create_semaphore+0x54/0xe0
[<bcd513fe>] acpi_ev_install_space_handler+0x24c/0x300
[<02e116e2>] acpi_install_address_space_handler+0x64/0xb0
[<ba00abc5>] i2c_acpi_install_space_handler+0xd4/0x138
[<8da42058>] i2c_register_adapter+0x368/0x910
[<c03f7142>] i2c_add_adapter+0x9c/0x100
[<00ba2fcf>] i2c_add_numbered_adapter+0x44/0x58
[<7df22d67>] i2c_dw_probe_master+0x68c/0x900
[<682dfc98>] dw_i2c_plat_probe+0x460/0x640
[<ad2dd3ee>] platform_probe+0x8c/0x108
[<dd183e3f>] really_probe+0x190/0x670
[<66017341>] driver_probe_device+0x8c/0xf8
[<c441e843>] device_driver_attach+0x9c/0xa8
[<f91dc709>] __driver_attach+0x88/0x138
unreferenced object 0x00280452c100 (size 128):
comm "swapper/0", pid 1, jiffies 4294894558 (age 336.604s)
hex dump (first 32 bytes):
00 00 00 00 02 00 00 00 08 c1 52 04 28 00 ff ff..R.(...
08 c1 52 04 28 00 ff ff 00 00 00 00 00 00 00 00..R.(...
backtrace:
[<670a0938>] slab_post_alloc_hook+0x9c/0x2f8
[<a3f47b39>] kmem_cache_alloc+0x198/0x2a8
[<2bdba864>] acpi_os_create_semaphore+0x54/0xe0
[<bcd513fe>] acpi_ev_install_space_handler+0x24c/0x300
[<02e116e2>] acpi_install_address_space_handler+0x64/0xb0
[<988d4f61>] acpi_gpiochip_add+0x20c/0x4a0
[<73d4faab>] gpiochip_add_data_with_key+0xd10/0x1680
[<1d50b98a>] devm_gpiochip_add_data_with_key+0x30/0x78
[<fc3e7eaf>] dwapb_gpio_probe+0x828/0xb28
[<ad2dd3ee>] platform_probe+0x8c/0x108
[<dd183e3f>] really_probe+0x190/0x670
[<66017341>] driver_probe_device+0x8c/0xf8
[<c441e843>] device_driver_attach+0x9c/0xa8
[<f91dc709>] __driver_attach+0x88/0x138
[<d330caed>] bus_for_each_dev+0xec/0x160
[<eebc5f04>] driver_attach+0x34/0x48
root@debian:/home/john#

Thanks,
John

Re: [PATCH v2 2/6] perf test: Handle metric reuse in pmu-events parsing test


On 06/04/2021 14:34, Jiri Olsa wrote:


}

So once we evaluate a pmu_event in pctx->ids in @pe, @all is set false, and
we would loop again in the do-while loop, regardless of what
expr__find_other() does (apart from erroring), and so call
hashmap__for_each_entry_safe(>ids, ) again.

ah ok, so it finishes the hash iteration first and
then restarts it.. ok, I missed that, then it's fine
 >> This is really what is done in __resolve_metric() - indeed, I would 

use that

function directly, but it looks hard to extract that from metricgroup.c .

yea, it's another world;-)  it's better to keep it separated


ok, so I'll still add the break statement, as suggested.

I'll also wait to see if Ian or you have a strong feeling about the 
function naming in patch 1/6.


Thanks,
John

Re: [PATCH v2 2/6] perf test: Handle metric reuse in pmu-events parsing test


On 06/04/2021 13:55, Jiri Olsa wrote:

So expr__find_other() may add a new item to pctx->ids, and we always iterate
again, and try to lookup any pmu_events, *, above. If none exist, then we

hm, I don't see that.. so, what you do is:

hashmap__for_each_entry_safe((>ids) ) {

rc = expr__find_other(pe->metric_expr, NULL, pctx, 0);
}

and what I think we need to do is:

hashmap__for_each_entry_safe((>ids) ) {

rc = expr__find_other(pe->metric_expr, NULL, pctx, 0);

break;  
}

each time you resolve another metric, you need to restart
the pctx->ids iteration, because there will be new items,
and we are in the middle of it

Sure, but we will restart anyway.

hum, where? you call expr__find_other and continue to next
pctx->ids item


We have:

resolve_metric_simple()
{
bool all;

do {
all = true;

hashmap__for_each_entry_safe(>ids, ...) {

pe = metricgroup_find_metric(cur->key, map);
if (!pe)
continue;

...
all = false;

expr_del_id(pctx, cur->key);

...
rc = expr__find_other(pe->metric_expr, pctx);
if (rc)
goto out_err;
}

} while (!all);

}

So once we evaluate a pmu_event in pctx->ids in @pe, @all is set false, 
and we would loop again in the do-while loop, regardless of what 
expr__find_other() does (apart from erroring), and so call 
hashmap__for_each_entry_safe(>ids, ) again.


This is really what is done in __resolve_metric() - indeed, I would use 
that function directly, but it looks hard to extract that from 
metricgroup.c .


Thanks,
John




Regardless of this, I don't think what I am doing is safe, i.e. adding new
items in the middle of the iter, so I will change in the way you suggest.

it'll always add items in the middle of the iteration

Re: [PATCH v2 2/6] perf test: Handle metric reuse in pmu-events parsing test


On 06/04/2021 13:17, Jiri Olsa wrote:

+   ref = >metric_ref;
+   ref->metric_name = pe->metric_name;
+   ref->metric_expr = pe->metric_expr;
+   list_add_tail(>list, compound_list);
+
+   rc = expr__find_other(pe->metric_expr, NULL, pctx, 0);

Hi Jirka,


so this might add new items to pctx->ids, I think you need
to restart the iteration as we do it in __resolve_metric
otherwise you could miss some new keys

I thought that I was doing this. Indeed, this code is very much like
__resolve_metric();)

So expr__find_other() may add a new item to pctx->ids, and we always iterate
again, and try to lookup any pmu_events, *, above. If none exist, then we

hm, I don't see that.. so, what you do is:

hashmap__for_each_entry_safe((>ids) ) {

rc = expr__find_other(pe->metric_expr, NULL, pctx, 0);
}

and what I think we need to do is:

hashmap__for_each_entry_safe((>ids) ) {

rc = expr__find_other(pe->metric_expr, NULL, pctx, 0);

break;  
}

each time you resolve another metric, you need to restart
the pctx->ids iteration, because there will be new items,
and we are in the middle of it


Sure, but we will restart anyway.

Regardless of this, I don't think what I am doing is safe, i.e. adding 
new items in the middle of the iter, so I will change in the way you 
suggest.


Thanks,
John

Re: [PATCH v2 0/6] perf arm64 metricgroup support

On 30/03/2021 07:41, kajoljain wrote:

On 3/30/21 2:37 AM, Paul A. Clarke wrote:

On Fri, Mar 26, 2021 at 10:57:40AM +, John Garry wrote:

On 25/03/2021 20:39, Paul A. Clarke wrote:

On Thu, Mar 25, 2021 at 06:33:12PM +0800, John Garry wrote:

Metric reuse support is added for pmu-events parse metric testcase.
This had been broken on power9 recentlty:
https://lore.kernel.org/lkml/20210324015418.gc8...@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com/

Much better. Before:
--
$ perf test -v 10 2>&1 | grep -i error | wc -l
112
--
After:
--
$ perf test -v 10 2>&1 | grep -i error | wc -l
17
--

And these seem like different types of issues:
--
$ perf test -v 10 2>&1 | grep -i error
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help
'(null)'
Error string 'Cannot find PMU `nest_powerbus0_imc'. Missing kernel support?'
help '(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help
'(null)'
--

This looks suspicious.

Firstly, does /sys/bus/event_source/devices/nest_mcs01_imc (or others,
above) exist on your system? I guess not.

Secondly, checking Documentation/powerpc/imc.rst, we have examples of:
nest_mcs01/PM_MCS01_64B_R...

So is the PMU name correct in the metric file for nest_mcs01_imc? Looking at
the kernel driver file, arch/powerpc/perf/imc-pmu.c, it seems to be correct.
Not sure.

I ran with a newer kernel, and the above errors disappeared, replaced with
about 10 of:
--
Error string 'Cannot find PMU `hv_24x7'. Missing kernel support?' help '(null)'
--

...but I was running without a hypervisor, so I tried the same kernel on a
PowerVM-virtualized system and the "hv_24x7" messages went away, but the
"nest" messages returned. This may all be expected behavior... I confess
I haven't followed these new perf capabilities closely.

Hi Paul/John,
This is something expected. For nest-imc we need bare-metal system and for
hv-24x7 we need VM environment. Since you are checking this test in VM machine,
there nest events are not supported and hence we are getting this error.

Thanks,
Kajol Jain

Cool, so I hope that tested-by or similar can be provided :) [obviously
pending any changes that come from reviews]

Thanks

It's extremely likely that none of these errors has anything to do with your
changes. :-

Re: [PATCH v2 2/6] perf test: Handle metric reuse in pmu-events parsing test


On 01/04/2021 14:49, Jiri Olsa wrote:

On Thu, Mar 25, 2021 at 06:33:14PM +0800, John Garry wrote:

SNIP


+struct metric {
+   struct list_head list;
+   struct metric_ref metric_ref;
+};
+
+static int resolve_metric_simple(struct expr_parse_ctx *pctx,
+struct list_head *compound_list,
+struct pmu_events_map *map,
+const char *metric_name)
+{
+   struct hashmap_entry *cur, *cur_tmp;
+   struct metric *metric, *tmp;
+   size_t bkt;
+   bool all;
+   int rc;
+
+   do {
+   all = true;
+   hashmap__for_each_entry_safe((>ids), cur, cur_tmp, bkt) {
+   struct metric_ref *ref;
+   struct pmu_event *pe;
+
+   pe = metrcgroup_find_metric(cur->key, map);


*


+   if (!pe)
+   continue;
+
+   if (!strcmp(metric_name, (char *)cur->key)) {
+   pr_warning("Recursion detected for metric 
%s\n", metric_name);
+   rc = -1;
+   goto out_err;
+   }
+
+   all = false;
+
+   /* The metric key itself needs to go out.. */
+   expr__del_id(pctx, cur->key);
+
+   metric = malloc(sizeof(*metric));
+   if (!metric) {
+   rc = -ENOMEM;
+   goto out_err;
+   }
+
+   ref = >metric_ref;
+   ref->metric_name = pe->metric_name;
+   ref->metric_expr = pe->metric_expr;
+   list_add_tail(>list, compound_list);
+
+   rc = expr__find_other(pe->metric_expr, NULL, pctx, 0);




Hi Jirka,


so this might add new items to pctx->ids, I think you need
to restart the iteration as we do it in __resolve_metric
otherwise you could miss some new keys


I thought that I was doing this. Indeed, this code is very much like 
__resolve_metric() ;)


So expr__find_other() may add a new item to pctx->ids, and we always 
iterate again, and try to lookup any pmu_events, *, above. If none 
exist, then we have broken down pctx into primitive events aliases and 
unresolvable metrics, and stop iterating. And then unresolvable metrics 
would be found in check_parse_cpu().


As an example, we can deal with metric test1, below, which references 2x 
other metrics:


{
"MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / (4 * (( ( 
CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE 
/ CPU_CLK_UNHALTED.REF_XCLK ) )))",

  "MetricName": "Frontend_Bound",
},
{
"MetricExpr": "( UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 
4 * INT_MISC.RECOVERY_CYCLES ) / (4 * cycles)",

"MetricName": "Bad_Speculation",
},
{
"MetricExpr": "Bad_Speculation + Frontend_Bound",
"MetricName": "test1",
},

Does that satisfy your concern, or have I missed something?

Thanks,
John



jirka


+   if (rc)
+   goto out_err;
+   }
+   } while (!all);
+
+   return 0;
+
+out_err:
+   list_for_each_entry_safe(metric, tmp, compound_list, list)
+   free(metric);
+
+   return rc;
+
+}


SNIP

.

Re: [PATCH printk v2 2/5] printk: remove safe buffers

2021-04-06 Thread John Ogness

On 2021-04-01, Petr Mladek  wrote:
>> Caller-id solves this problem and is easy to sort for anyone with
>> `grep'. Yes, it is a shame that `dmesg' does not show it, but
>> directly using any of the printk interfaces does show it (kmsg_dump,
>> /dev/kmsg, syslog, console).
>
> True but frankly, the current situation is _far_ from convenient:
>
>+ consoles do not show it by default
>+ none userspace tool (dmesg, journalctl, crash) is able to show it
>+ grep is a nightmare, especially if you have more than handful of CPUs
>
> Yes, everything is solvable but not easily.
>
>> > I get this with "echo l >/proc/sysrq-trigger" and this patchset:
>> 
>> Of course. Without caller-id, it is a mess. But this has nothing to do
>> with NMI. The same problem exists for WARN_ON() on multiple CPUs
>> simultaneously. If the user is not using caller-id, they are
>> lost. Caller-id is the current solution to the interlaced logs.
>
> Sure. But in reality, the risk of mixed WARN_ONs is small. While
> this patch makes backtraces from all CPUs always unusable without
> caller_id and non-trivial effort.

I would prefer we solve the situation for non-NMI as well, not just for
the sysrq "l" case.

>> For the long term, we should introduce a printk-context API that allows
>> callers to perfectly pack their multi-line output into a single
>> entry. We discussed [0][1] this back in August 2020.
>
> We need a "short" term solution. There are currently 3 solutions:
>
> 1. Keep nmi_safe() and all the hacks around.
>
> 2. Serialize nmi_cpu_backtrace() by a spin lock and later by
>the special lock used also by atomic consoles.
>
> 3. Tell complaining people how to sort the messed logs.

Or we look into the long term solution now. If caller-id's cannot not be
used as the solution (because nobody turns it on, nobody knows about it,
and/or distros do not enable it), then we should look at how to make at
least the backtraces contiguous. I have a few ideas here.

John Ogness

Re: [PATCH v2 1/6] perf metricgroup: Make find_metric() public with name change


On 02/04/2021 00:16, Ian Rogers wrote:

On Thu, Mar 25, 2021 at 3:38 AM John Garry  wrote:


Function find_metric() is required for the metric processing in the
pmu-events testcase, so make it public. Also change the name to include
"metricgroup".


Would it make more sense as "pmu_events_map__find_metric" ?



So all functions apart from one in metricgroup.h are named 
metricgroup__XXX, so I was trying to keep this style - apart from the 
double-underscore (which can be remedied).


Personally I don't think pmu_events_map__find_metric name fits with that 
convention.


Thanks,
John


Thanks,
Ian


Signed-off-by: John Garry 
---
  tools/perf/util/metricgroup.c | 5 +++--
  tools/perf/util/metricgroup.h | 3 ++-
  2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
index 6acb44ad439b..71a13406e0bd 100644
--- a/tools/perf/util/metricgroup.c
+++ b/tools/perf/util/metricgroup.c
@@ -900,7 +900,8 @@ static int __add_metric(struct list_head *metric_list,
 (match_metric(__pe->metric_group, __metric) ||  \
  match_metric(__pe->metric_name, __metric)))

-static struct pmu_event *find_metric(const char *metric, struct pmu_events_map 
*map)
+struct pmu_event *metrcgroup_find_metric(const char *metric,
+struct pmu_events_map *map)
  {
 struct pmu_event *pe;
 int i;
@@ -985,7 +986,7 @@ static int __resolve_metric(struct metric *m,
 struct expr_id *parent;
 struct pmu_event *pe;

-   pe = find_metric(cur->key, map);
+   pe = metrcgroup_find_metric(cur->key, map);
 if (!pe)
 continue;

diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h
index ed1b9392e624..1674c6a36d74 100644
--- a/tools/perf/util/metricgroup.h
+++ b/tools/perf/util/metricgroup.h
@@ -44,7 +44,8 @@ int metricgroup__parse_groups(const struct option *opt,
   bool metric_no_group,
   bool metric_no_merge,
   struct rblist *metric_events);
-
+struct pmu_event *metrcgroup_find_metric(const char *metric,
+struct pmu_events_map *map);
  int metricgroup__parse_groups_test(struct evlist *evlist,
struct pmu_events_map *map,
const char *str,
--
2.26.2


.

Re: PCI MSI issue with reinserting a driver


On 03/02/2021 17:23, Marc Zyngier wrote:

On 2021-02-02 15:46, John Garry wrote:

On 02/02/2021 14:48, Marc Zyngier wrote:


Not sure. I also now notice an error for the SAS PCI driver on D06 
when nr_cpus < 16, which means number of MSI vectors allocated < 
32, so looks the same problem. There we try to allocate 16 + max(nr 
cpus, 16) MSI.


Anyway, let me have a look today to see what is going wrong.


Could this be the problem:

nr_cpus=11

In alloc path, we have:
its_alloc_device_irq(nvecs=27 = 16+11)
  bitmap_find_free_region(order = 5);
In free path, we have:
its_irq_domain_free(nvecs = 1) and free each 27 vecs
  bitmap_release_region(order = 0)

So we allocate 32 bits, but only free 27. And 2nd alloc for 32 fails.


[ ... ]





Hi Marc,

Just a friendly reminder on this issue. Let me know if anything required 
some our side.


Cheers,
John



But I'm not sure that we have any requirement for those map bits to be
consecutive.


We can't really do that. All the events must be contiguous,
and there is also a lot of assumptions in the ITS driver that
LPI allocations is also contiguous.

But there is also the fact that for Multi-MSI, we *must*
allocate 32 vectors. Any driver could assume that if we have
allocated 17 vectors, then there is another 15 available.

My question still stand: how was this working with the previous
behaviour?


Because previously in this scenario we would allocate 32 bits and free
32 bits in the map; but now we allocate 32 bits, yet only free 27 - so
leak 5 bits. And this comes from how irq_domain_free_irqs_hierarchy()
now frees per-interrupt, instead of all irqs per domain.

Before:
 In free path, we have:
 its_irq_domain_free(nvecs = 27)
   bitmap_release_region(count order = 5 == 32bits)

Current:
 In free path, we have:
 its_irq_domain_free(nvecs = 1) for free each 27 vecs
   bitmap_release_region(count order = 0 == 1bit)


Right. I was focusing on the patch and blindly ignored the explanation
at the top of the email. Apologies for that.

I'm not overly keen on handling this in the ITS though, and I'd rather
we try and do it in the generic code. How about this (compile tested
only)?

Thanks,

     M.

diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index 6aacd342cd14..cfccad83c2df 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -1399,8 +1399,19 @@ static void irq_domain_free_irqs_hierarchy(struct 
irq_domain *domain,

  return;

  for (i = 0; i < nr_irqs; i++) {
-    if (irq_domain_get_irq_data(domain, irq_base + i))
-    domain->ops->free(domain, irq_base + i, 1);
+    int n ;
+
+    /* Find the largest possible span of IRQs to free in one go */
+    for (n = 0;
+ ((i + n) < nr_irqs &&
+  irq_domain_get_irq_data(domain, irq_base + i + n));
+ n++);
+
+    if (!n)
+    continue;
+
+    domain->ops->free(domain, irq_base + i, n);
+    i += n;
  }
  }

Re: [PATCH v2 0/4] iommu/iova: Add CPU hotplug handler to flush rcaches to core code


On 25/03/2021 17:53, Will Deacon wrote:

On Thu, Mar 25, 2021 at 08:29:57PM +0800, John Garry wrote:

The Intel IOMMU driver supports flushing the per-CPU rcaches when a CPU is
offlined.

Let's move it to core code, so everyone can take advantage.

Also throw in a patch to stop exporting free_iova_fast().

Differences to v1:
- Add RB tags (thanks!)
- Add patch to stop exporting free_iova_fast()
- Drop patch to correct comment
- Add patch to delete iommu_dma_free_cpu_cached_iovas() and associated
   changes

John Garry (4):
   iova: Add CPU hotplug handler to flush rcaches
   iommu/vt-d: Remove IOVA domain rcache flushing for CPU offlining
   iommu: Delete iommu_dma_free_cpu_cached_iovas()
   iommu: Stop exporting free_iova_fast()

Looks like this is all set for 5.13, so hopefully Joerg can stick it in
-next for a bit more exposure.




Hi Joerg,

Can you kindly consider picking this up now?

Thanks,
John

RE: [PATCH bpf-next 2/3] libbpf: selftests: refactor 'BPF_PERCPU_TYPE()' and 'bpf_percpu()' macros

2021-04-05 Thread John Fastabend

Pedro Tammela wrote:
> This macro was refactored out of the bpf selftests.
> 
> Since percpu values are rounded up to '8' in the kernel, a careless
> user in userspace might encounter unexpected values when parsing the
> output of the batched operations.
> 
> Now that both array and hash maps have support for batched ops in the
> percpu variant, let's provide a convenient macro to declare percpu map
> value types.
> 
> Updates the tests to a "reference" usage of the new macro.
> 
> Signed-off-by: Pedro Tammela 
> ---

Other than the initial patch needing a bit of description the series
looks good to me. Thanks.

RE: [PATCH bpf-next 1/3] bpf: add batched ops support for percpu array

2021-04-05 Thread John Fastabend

Pedro Tammela wrote:
> Suggested-by: Jamal Hadi Salim 
> Signed-off-by: Pedro Tammela 
> ---

A commit message describing some of the change details and a note it uses
the for-each cpu copies (same as normal syscall on percpu map) and not the
per-cpu ones would be nice. I at least had to go and check the generic_map*
batch operations.

Also something about why generic_map_delete_batch is omitted?

>  kernel/bpf/arraymap.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 463d25e1e67e..3c4105603f9d 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -698,6 +698,8 @@ const struct bpf_map_ops percpu_array_map_ops = {
>   .map_delete_elem = array_map_delete_elem,
>   .map_seq_show_elem = percpu_array_map_seq_show_elem,
>   .map_check_btf = array_map_check_btf,
> + .map_lookup_batch = generic_map_lookup_batch,
> + .map_update_batch = generic_map_update_batch,
>   .map_set_for_each_callback_args = map_set_for_each_callback_args,
>   .map_for_each_callback = bpf_for_each_array_elem,
>   .map_btf_name = "bpf_array",
> -- 
> 2.25.1
>

Re: [PATCH] drm/msm: Fix removal of valid error case when checking speed_bin

2021-04-05 Thread John Stultz

On Mon, Mar 29, 2021 at 6:34 PM John Stultz  wrote:
>
> Commit 7bf168c8fe8c  ("drm/msm: Fix speed-bin support not to
> access outside valid memory"), reworked the nvmem reading of
> "speed_bin", but in doing so dropped handling of the -ENOENT
> case which was previously documented as "fine".
>
> That change resulted in the db845c board display to fail to
> start, with the following error:
>
> adreno 500.gpu: [drm:a6xx_gpu_init] *ERROR* failed to read speed-bin 
> (-2). Some OPPs may not be supported by hardware
>
> Thus, this patch simply re-adds the ENOENT handling so the lack
> of the speed_bin entry isn't fatal for display, and gets things
> working on db845c.

Hey Folks,
  Just wanted to re-ping you on this, as it resolves a regression
introduced in 5.12-rc5 and I'm not yet seeing this in -next. Would be
nice to have this in place before 5.12 final.

thanks
-john

Re: [PATCH] ia64: module: fix symbolizer crash on fdescr

2021-04-04 Thread John Paul Adrian Glaubitz

Hi Sergei!

On 4/3/21 9:48 AM, Sergei Trofimovich wrote:
> Noticed failure as a crash on ia64 when tried to symbolize all
> backtraces collected by page_owner=on:
> 
> $ cat /sys/kernel/debug/page_owner
> 
> 
> CPU: 1 PID: 2074 Comm: cat Not tainted 5.12.0-rc4 #226
> Hardware name: hp server rx3600, BIOS 04.03 04/08/2008
> ip is at dereference_module_function_descriptor+0x41/0x100
> 
> Crash happens at dereference_module_function_descriptor() due to
> use-after-free when dereferencing ".opd" section header.
> 
> All section headers are already freed after module is laoded
> successfully.
> 
> To keep symbolizer working the change stores ".opd" address
> and size after module is relocated to a new place and before
> section headers are discarded.
> 
> To make similar errors less obscure module_finalize() now
> zeroes out all variables relevant to module loading only.

Typo: s/zeroes/zero/.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer - glaub...@debian.org
`. `'   Freie Universitaet Berlin - glaub...@physik.fu-berlin.de
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [PATCH] apparmor: avoid -Wempty-body warning

2021-04-03 Thread John Johansen

On 3/22/21 4:00 AM, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> Building with 'make W=1' shows a warning for an empty macro:
> 
> security/apparmor/label.c: In function '__label_update':
> security/apparmor/label.c:2096:59: error: suggest braces around empty body in 
> an 'else' statement [-Werror=empty-body]
>  2096 | AA_BUG(labels_ns(label) != labels_ns(new));
> 
> Change the macro defintion to use no_printk(), which improves
> format string checking and avoids the warning.
> 
> Signed-off-by: Arnd Bergmann 

Aked-by: John Johansen 

I have pulled it into the apparmor tree

> ---
>  security/apparmor/include/lib.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/security/apparmor/include/lib.h b/security/apparmor/include/lib.h
> index 7d27db740bc2..67fbb81a11f3 100644
> --- a/security/apparmor/include/lib.h
> +++ b/security/apparmor/include/lib.h
> @@ -36,7 +36,7 @@
>  #define AA_BUG_FMT(X, fmt, args...)  \
>   WARN((X), "AppArmor WARN %s: (" #X "): " fmt, __func__, ##args)
>  #else
> -#define AA_BUG_FMT(X, fmt, args...)
> +#define AA_BUG_FMT(X, fmt, args...) no_printk(fmt, ##args)
>  #endif
>  
>  #define AA_ERROR(fmt, args...)   
> \
>

Re: [PATCH v2 7/7] ABI: sysfs-kernel-mm-cma: fix two cross-references

2021-04-01 Thread John Hubbard


On 3/25/21 3:38 AM, Mauro Carvalho Chehab wrote:

Change the text in order to generate cross-references for
alloc_pages_success and alloc_pages_fail symbols.

Signed-off-by: Mauro Carvalho Chehab 
---
  Documentation/ABI/testing/sysfs-kernel-mm-cma | 8 +++-
  1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-cma 
b/Documentation/ABI/testing/sysfs-kernel-mm-cma
index 02b2bb60c296..86e261185561 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-cma
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-cma
@@ -5,12 +5,10 @@ Description:
/sys/kernel/mm/cma/ contains a subdirectory for each CMA
heap name (also sometimes called CMA areas).
  
-		Each CMA heap subdirectory (that is, each

-   /sys/kernel/mm/cma/ directory) contains the
-   following items:
+   Each CMA heap subdirectory contains the following items:
  
-			alloc_pages_success

-   alloc_pages_fail
+   - /sys/kernel/mm/cma//alloc_pages_success
+   - /sys/kernel/mm/cma//alloc_pages_fail
  


I agree that this is clearer and easier on the reader, who can now see
directly what the full path to each item is.

As for calling it a "fix", that seems a bit much. It's an upgrade.
I'm not sure how many people realize that this sort of change causes
cross refs to magically start working. I certainly didn't until now.

But either way, this improvement is nice to have, so:

Reviewed-by: John Hubbard 

thanks,
--
John Hubbard
NVIDIA


  What: /sys/kernel/mm/cma//alloc_pages_success
  Date: Feb 2021

Re: [PATCH printk v2 2/5] printk: remove safe buffers

2021-04-01 Thread John Ogness

On 2021-04-01, Petr Mladek  wrote:
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -1142,24 +1128,37 @@ void __init setup_log_buf(int early)
>>   new_descs, ilog2(new_descs_count),
>>   new_infos);
>>  
>> -printk_safe_enter_irqsave(flags);
>> +local_irq_save(flags);
>
> IMHO, we actually do not have to disable IRQ here. We already copy
> messages that might appear in the small race window in NMI. It would
> work the same way also for IRQ context.

We do not have to, but why open up this window? We are still in early
boot and interrupts have always been disabled here. I am not happy that
this window even exists. I really prefer to keep it NMI-only.

>> --- a/lib/nmi_backtrace.c
>> +++ b/lib/nmi_backtrace.c
>> @@ -75,12 +75,6 @@ void nmi_trigger_cpumask_backtrace(const cpumask_t *mask,
>>  touch_softlockup_watchdog();
>>  }
>>  
>> -/*
>> - * Force flush any remote buffers that might be stuck in IRQ context
>> - * and therefore could not run their irq_work.
>> - */
>> -printk_safe_flush();
>
> Sigh, this reminds me that the nmi_safe buffers serialized backtraces
> from all CPUs.
>
> I am afraid that we have to put back the spinlock into
> nmi_cpu_backtrace().

Please no. That spinlock is a disaster. It can cause deadlocks with
other cpu-locks (such as in kdb) and it will cause a major problem for
atomic consoles. We need to be very careful about introducing locks
where NMIs are waiting on other CPUs.

> It has been repeatedly added and removed depending
> on whether the backtrace was printed into the main log buffer
> or into the per-CPU buffers. Last time it was removed by
> the commit 03fc7f9c99c1e7ae2925d ("printk/nmi: Prevent deadlock
> when accessing the main log buffer in NMI").
>
> It should be safe because there should not be any other locks in the
> code path. Note that only one backtrace might be triggered at the same
> time, see @backtrace_flag in nmi_trigger_cpumask_backtrace().

It is adding a lock around a lockless ringbuffer. For me that is a step
backwards.

> We _must_ serialize it somehow[*]. The lock in nmi_cpu_backtrace()
> looks less evil than the nmi_safe machinery. nmi_safe() shrinks
> too long backtraces, lose timestamps, needs to be explicitely
> flushed here and there, is a non-trivial code.
>
> [*] Non-serialized bactraces are real mess. Caller-id is visible
> only on consoles or via syslogd interface. And it is not much
> convenient.

Caller-id solves this problem and is easy to sort for anyone with
`grep'. Yes, it is a shame that `dmesg' does not show it, but directly
using any of the printk interfaces does show it (kmsg_dump, /dev/kmsg,
syslog, console).

> I get this with "echo l >/proc/sysrq-trigger" and this patchset:

Of course. Without caller-id, it is a mess. But this has nothing to do
with NMI. The same problem exists for WARN_ON() on multiple CPUs
simultaneously. If the user is not using caller-id, they are
lost. Caller-id is the current solution to the interlaced logs.

For the long term, we should introduce a printk-context API that allows
callers to perfectly pack their multi-line output into a single
entry. We discussed [0][1] this back in August 2020.

John Ogness

[0] 
https://lore.kernel.org/lkml/472f2e553805b52d9834d64e4056db965edee329.ca...@perches.com
[1] offlist message-id: 87d03k9ymz@jogness.linutronix.de

Re: [PATCH] crypto: ccp -A value assigned to a variable is never used.

2021-03-31 Thread John Allen

On Tue, Mar 30, 2021 at 06:10:29PM +0800, Jiapeng Chong wrote:
> Fix the following whitescan warning:
> 
> Assigning value "64" to "dst.address" here, but that stored value is
> overwritten before it can be used.
> 

Thanks for reporting.

Acked-by: John Allen 

> Reported-by: Abaci Robot 
> Signed-off-by: Jiapeng Chong 
> ---
>  drivers/crypto/ccp/ccp-ops.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/crypto/ccp/ccp-ops.c b/drivers/crypto/ccp/ccp-ops.c
> index d6a8f4e..bb88198 100644
> --- a/drivers/crypto/ccp/ccp-ops.c
> +++ b/drivers/crypto/ccp/ccp-ops.c
> @@ -2418,7 +2418,6 @@ static int ccp_run_ecc_pm_cmd(struct ccp_cmd_queue 
> *cmd_q, struct ccp_cmd *cmd)
>   dst.address += CCP_ECC_OUTPUT_SIZE;
>   ccp_reverse_get_dm_area(, 0, ecc->u.pm.result.y, 0,
>   CCP_ECC_MODULUS_BYTES);
> - dst.address += CCP_ECC_OUTPUT_SIZE;
>  
>   /* Restore the workarea address */
>   dst.address = save;
> -- 
> 1.8.3.1
>

Re: [PATCH printk v2 2/5] printk: remove safe buffers

2021-03-31 Thread John Ogness

On 2021-03-30, John Ogness  wrote:
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index e971c0a9ec9e..f090d6a1b39e 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1772,16 +1759,21 @@ static struct task_struct *console_owner;
>  static bool console_waiter;
>  
>  /**
> - * console_lock_spinning_enable - mark beginning of code where another
> + * console_lock_spinning_enable_irqsave - mark beginning of code where 
> another
>   *   thread might safely busy wait
>   *
>   * This basically converts console_lock into a spinlock. This marks
>   * the section where the console_lock owner can not sleep, because
>   * there may be a waiter spinning (like a spinlock). Also it must be
>   * ready to hand over the lock at the end of the section.
> + *
> + * This disables interrupts because the hand over to a waiter must not be
> + * interrupted until the hand over is completed (@console_waiter is cleared).
>   */
> -static void console_lock_spinning_enable(void)
> +static void console_lock_spinning_enable_irqsave(unsigned long *flags)

I missed the prototype change for the !CONFIG_PRINTK case, resulting in:

linux/kernel/printk/printk.c:2707:3: error: implicit declaration of function 
‘console_lock_spinning_enable_irqsave’; did you mean 
‘console_lock_spinning_enable’? [-Werror=implicit-function-declaration]
   console_lock_spinning_enable_irqsave();
   ^~~~
   console_lock_spinning_enable

Will be fixed for v3.

(I have now officially added !CONFIG_PRINTK to my CI tests.)

John Ogness

Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap


On 3/30/21 8:56 PM, John Hubbard wrote:

On 3/30/21 3:56 PM, Alistair Popple wrote:
...

+1 for renaming "munlock*" items to "mlock*", where applicable. good grief.


At least the situation was weird enough to prompt further investigation :)

Renaming to mlock* doesn't feel like the right solution to me either though. I
am not sure if you saw me responding to myself earlier but I am thinking
renaming try_to_munlock() -> page_mlocked() and try_to_munlock_one() ->
page_mlock_one() might be better. Thoughts?



Quite confused by this naming idea. Because: try_to_munlock() returns
void, so a boolean-style name such as "page_mlocked()" is already not a
good fit.

Even more important, though, is that try_to_munlock() is mlock-ing the
page, right? Is there some subtle point I'm missing? It really is doing
an mlock to the best of my knowledge here. Although the kerneldoc
comment for try_to_munlock() seems questionable too:

/**
* try_to_munlock - try to munlock a page
* @page: the page to be munlocked
*
* Called from munlock code.  Checks all of the VMAs mapping the page
* to make sure nobody else has this page mlocked. The page will be
* returned with PG_mlocked cleared if no other vmas have it mlocked.
*/

...because I don't see where, in *this* routine, it clears PG_mlocked!

Obviously we agree that a routine should be named based on what it does,
rather than on who calls it. So I think that still leads to:

     try_to_munlock() --> try_to_mlock()
     try_to_munlock_one() --> try_to_mlock_one()

Sorry if I'm missing something really obvious.


Actually, re-reading your and Jason's earlier points in the thread, I see
that I'm *not* missing anything, and we are actually in agreement about how
the code operates. OK, good!

Also, as you point out above, maybe the "try_" prefix is not really accurate
either, given how this works. So maybe we have arrived at something like:

try_to_munlock() --> page_mlock() // or mlock_page()...
try_to_munlock_one() --> page_mlock_one()



thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap


On 3/30/21 3:56 PM, Alistair Popple wrote:
...

+1 for renaming "munlock*" items to "mlock*", where applicable. good grief.


At least the situation was weird enough to prompt further investigation :)

Renaming to mlock* doesn't feel like the right solution to me either though. I
am not sure if you saw me responding to myself earlier but I am thinking
renaming try_to_munlock() -> page_mlocked() and try_to_munlock_one() ->
page_mlock_one() might be better. Thoughts?



Quite confused by this naming idea. Because: try_to_munlock() returns
void, so a boolean-style name such as "page_mlocked()" is already not a
good fit.

Even more important, though, is that try_to_munlock() is mlock-ing the
page, right? Is there some subtle point I'm missing? It really is doing
an mlock to the best of my knowledge here. Although the kerneldoc
comment for try_to_munlock() seems questionable too:

/**
 * try_to_munlock - try to munlock a page
 * @page: the page to be munlocked
 *
 * Called from munlock code.  Checks all of the VMAs mapping the page
 * to make sure nobody else has this page mlocked. The page will be
 * returned with PG_mlocked cleared if no other vmas have it mlocked.
 */

...because I don't see where, in *this* routine, it clears PG_mlocked!

Obviously we agree that a routine should be named based on what it does,
rather than on who calls it. So I think that still leads to:

 try_to_munlock() --> try_to_mlock()
 try_to_munlock_one() --> try_to_mlock_one()

Sorry if I'm missing something really obvious.



This is actually inspired from a suggestion in Documentation/vm/unevictable-
lru.rst which warns about this problem:

try_to_munlock() Reverse Map Scan
-

.. warning::
[!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
page_referenced() reverse map walker.



This is actually rather bad advice! page_referenced() returns an
int-that-is-really-a-boolean, whereas try_to_munlock(), at least as it
stands now, returns void. Usually when I'm writing a TODO item, I'm in a
hurry, and I think that's what probably happened here, too. :)



Although, it seems reasonable to tack such renaming patches onto the tail

end

of this series. But whatever works.


Unless anyone objects strongly I will roll the rename into this patch as there
is only one caller of try_to_munlock.

  - Alistair



No objections here. :)

thanks,
--
John Hubbard
NVIDIA

Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap


On 3/30/21 3:24 PM, Jason Gunthorpe wrote:
...

As far as I can tell this has always been called try_to_munlock() even though
it appears to do the opposite.


Maybe we should change it then?


/**
  * try_to_munlock - try to munlock a page
  * @page: the page to be munlocked
  *
  * Called from munlock code.  Checks all of the VMAs mapping the page
  * to make sure nobody else has this page mlocked. The page will be
  * returned with PG_mlocked cleared if no other vmas have it mlocked.
  */


In other words it sets PG_mlocked if one or more vmas has it mlocked. So
try_to_mlock() might be a better name, except that seems to have the potential
for confusion as well because it's only called from the munlock code path and
never for mlock.


That explanation makes more sense.. This function looks like it is
'set PG_mlocked of the page if any vm->flags has VM_LOCKED'

Maybe call it check_vm_locked or something then and reword the above
comment?

(and why is it OK to read vm->flags for this without any locking?)


Something needs attention here..


I think the code is correct, but perhaps the naming could be better. Would be
interested hearing any thoughts on renaming try_to_munlock() to try_to_mlock()
as the current name appears based on the context it is called from (munlock)
rather than what it does (mlock).


The point of this patch is to make it clearer, after all, so I'd
change something and maybe slightly clarify the comment.



I'd add that, after looking around the calling code, this is a really unhappy
pre-existing situation. Anyone reading this has to remember at which point in 
the
call stack the naming transitions from "do the opposite of what the name says",
to "do what the name says".

+1 for renaming "munlock*" items to "mlock*", where applicable. good grief.

Although, it seems reasonable to tack such renaming patches onto the tail end
of this series. But whatever works.

thanks,
--
John Hubbard
NVIDIA

[PATCH printk v2 4/5] printk: convert @syslog_lock to mutex

@syslog_lock was a raw_spin_lock to simplify the transition of
removing @logbuf_lock and the safe buffers. With that transition
complete, and since all uses of @syslog_lock are within sleepable
contexts, @syslog_lock can become a mutex.

Note that until now register_console() would disable interrupts
using irqsave, which implies that it may be called with interrupts
disabled. And indeed, there is one possible call chain on parisc
where this happens:

handle_interruption(code=1) /* High-priority machine check (HPMC) */
  pdc_console_restart()
pdc_console_init_force()
  register_console()

However, register_console() calls console_lock(), which might sleep.
So it has never been allowed to call register_console() from an
atomic context and the above call chain is a bug.

Signed-off-by: John Ogness 
---
 Note: The removal of read_syslog_seq_irq() is technically a small
   step backwards. But the follow-up patch moves forward again
   and closes a window that existed with read_syslog_seq_irq()
   and @syslog_lock as a spin_lock.

 kernel/printk/printk.c | 49 +-
 1 file changed, 20 insertions(+), 29 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index f090d6a1b39e..b771aae46445 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -356,7 +356,7 @@ enum log_flags {
 };
 
 /* syslog_lock protects syslog_* variables and write access to clear_seq. */
-static DEFINE_RAW_SPINLOCK(syslog_lock);
+static DEFINE_MUTEX(syslog_lock);
 
 #ifdef CONFIG_PRINTK
 DECLARE_WAIT_QUEUE_HEAD(log_wait);
@@ -1497,9 +1497,9 @@ static int syslog_print(char __user *buf, int size)
size_t n;
size_t skip;
 
-   raw_spin_lock_irq(_lock);
+   mutex_lock(_lock);
if (!prb_read_valid(prb, syslog_seq, )) {
-   raw_spin_unlock_irq(_lock);
+   mutex_unlock(_lock);
break;
}
if (r.info->seq != syslog_seq) {
@@ -1528,7 +1528,7 @@ static int syslog_print(char __user *buf, int size)
syslog_partial += n;
} else
n = 0;
-   raw_spin_unlock_irq(_lock);
+   mutex_unlock(_lock);
 
if (!n)
break;
@@ -1592,9 +1592,9 @@ static int syslog_print_all(char __user *buf, int size, 
bool clear)
}
 
if (clear) {
-   raw_spin_lock_irq(_lock);
+   mutex_lock(_lock);
latched_seq_write(_seq, seq);
-   raw_spin_unlock_irq(_lock);
+   mutex_unlock(_lock);
}
 
kfree(text);
@@ -1603,21 +1603,9 @@ static int syslog_print_all(char __user *buf, int size, 
bool clear)
 
 static void syslog_clear(void)
 {
-   raw_spin_lock_irq(_lock);
+   mutex_lock(_lock);
latched_seq_write(_seq, prb_next_seq(prb));
-   raw_spin_unlock_irq(_lock);
-}
-
-/* Return a consistent copy of @syslog_seq. */
-static u64 read_syslog_seq_irq(void)
-{
-   u64 seq;
-
-   raw_spin_lock_irq(_lock);
-   seq = syslog_seq;
-   raw_spin_unlock_irq(_lock);
-
-   return seq;
+   mutex_unlock(_lock);
 }
 
 int do_syslog(int type, char __user *buf, int len, int source)
@@ -1626,6 +1614,7 @@ int do_syslog(int type, char __user *buf, int len, int 
source)
bool clear = false;
static int saved_console_loglevel = LOGLEVEL_DEFAULT;
int error;
+   u64 seq;
 
error = check_syslog_permissions(type, source);
if (error)
@@ -1644,8 +1633,12 @@ int do_syslog(int type, char __user *buf, int len, int 
source)
if (!access_ok(buf, len))
return -EFAULT;
 
-   error = wait_event_interruptible(log_wait,
-   prb_read_valid(prb, read_syslog_seq_irq(), 
NULL));
+   /* Get a consistent copy of @syslog_seq. */
+   mutex_lock(_lock);
+   seq = syslog_seq;
+   mutex_unlock(_lock);
+
+   error = wait_event_interruptible(log_wait, prb_read_valid(prb, 
seq, NULL));
if (error)
return error;
error = syslog_print(buf, len);
@@ -1693,10 +1686,10 @@ int do_syslog(int type, char __user *buf, int len, int 
source)
break;
/* Number of chars in the log buffer */
case SYSLOG_ACTION_SIZE_UNREAD:
-   raw_spin_lock_irq(_lock);
+   mutex_lock(_lock);
if (!prb_read_valid_info(prb, syslog_seq, , NULL)) {
/* No unread messages. */
-   raw_spin_unlock_irq(_lock);
+   mutex_unlock(_lock);
return 0;
}
if (info.seq != syslog_seq) {
@@ -1714,7 +1707,6 @@ int do_syslog(int type, char __user *buf, int len, int 
sou

[PATCH printk v2 5/5] printk: syslog: close window between wait and read

Syslog's SYSLOG_ACTION_READ is supposed to block until the next
syslog record can be read, and then it should read that record.
However, because @syslog_lock is not held between waking up and
reading the record, another reader could read the record first,
thus causing SYSLOG_ACTION_READ to return with a value of 0, never
having read _anything_.

By holding @syslog_lock between waking up and reading, it can be
guaranteed that SYSLOG_ACTION_READ blocks until it successfully
reads a syslog record (or a real error occurs).

Signed-off-by: John Ogness 
---
 kernel/printk/printk.c | 50 +++---
 1 file changed, 37 insertions(+), 13 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index b771aae46445..bd23f00ebc32 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1486,6 +1486,7 @@ static int syslog_print(char __user *buf, int size)
struct printk_record r;
char *text;
int len = 0;
+   u64 seq;
 
text = kmalloc(CONSOLE_LOG_MAX, GFP_KERNEL);
if (!text)
@@ -1493,11 +1494,38 @@ static int syslog_print(char __user *buf, int size)
 
prb_rec_init_rd(, , text, CONSOLE_LOG_MAX);
 
-   while (size > 0) {
+   /* Get a consistent copy of @syslog_seq. */
+   mutex_lock(_lock);
+   seq = syslog_seq;
+   mutex_unlock(_lock);
+
+   /* Wait for the @syslog_seq record to be available. */
+   for (;;) {
+   len = wait_event_interruptible(log_wait, prb_read_valid(prb, 
seq, NULL));
+   if (len)
+   goto out;
+
+   /*
+* @syslog_seq may have changed while waiting. If so, wait
+* for the new @syslog_seq record.
+*/
+
+   mutex_lock(_lock);
+   if (syslog_seq == seq)
+   break;
+   seq = syslog_seq;
+   mutex_unlock(_lock);
+   }
+
+   /*
+* @syslog_lock is held when entering the read loop to prevent
+* another reader from modifying @syslog_seq.
+*/
+
+   for (;;) {
size_t n;
size_t skip;
 
-   mutex_lock(_lock);
if (!prb_read_valid(prb, syslog_seq, )) {
mutex_unlock(_lock);
break;
@@ -1542,8 +1570,13 @@ static int syslog_print(char __user *buf, int size)
len += n;
size -= n;
buf += n;
-   }
 
+   if (!size)
+   break;
+
+   mutex_lock(_lock);
+   }
+out:
kfree(text);
return len;
 }
@@ -1614,7 +1647,6 @@ int do_syslog(int type, char __user *buf, int len, int 
source)
bool clear = false;
static int saved_console_loglevel = LOGLEVEL_DEFAULT;
int error;
-   u64 seq;
 
error = check_syslog_permissions(type, source);
if (error)
@@ -1632,15 +1664,6 @@ int do_syslog(int type, char __user *buf, int len, int 
source)
return 0;
if (!access_ok(buf, len))
return -EFAULT;
-
-   /* Get a consistent copy of @syslog_seq. */
-   mutex_lock(_lock);
-   seq = syslog_seq;
-   mutex_unlock(_lock);
-
-   error = wait_event_interruptible(log_wait, prb_read_valid(prb, 
seq, NULL));
-   if (error)
-   return error;
error = syslog_print(buf, len);
break;
/* Read/clear last kernel messages */
@@ -1707,6 +1730,7 @@ int do_syslog(int type, char __user *buf, int len, int 
source)
} else {
bool time = syslog_partial ? syslog_time : printk_time;
unsigned int line_count;
+   u64 seq;
 
prb_for_each_info(syslog_seq, prb, seq, ,
  _count) {
-- 
2.20.1

[PATCH printk v2 1/5] printk: track/limit recursion

Currently the printk safe buffers provide a form of recursion
protection by redirecting to the safe buffers whenever printk() is
recursively called.

In preparation for removal of the safe buffers, provide an alternate
explicit recursion protection. Recursion is limited to 3 levels
per-CPU and per-context.

Signed-off-by: John Ogness 
---
 kernel/printk/printk.c | 83 --
 1 file changed, 80 insertions(+), 3 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 421c35571797..e971c0a9ec9e 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1940,6 +1940,74 @@ static void call_console_drivers(const char *ext_text, 
size_t ext_len,
}
 }
 
+/*
+ * Recursion is tracked separately on each CPU. If NMIs are supported, an
+ * additional NMI context per CPU is also separately tracked. Until per-CPU
+ * is available, a separate "early tracking" is performed.
+ */
+static DEFINE_PER_CPU(u8, printk_count);
+static u8 printk_count_early;
+#ifdef CONFIG_HAVE_NMI
+static DEFINE_PER_CPU(u8, printk_count_nmi);
+static u8 printk_count_nmi_early;
+#endif
+
+/*
+ * Recursion is limited to keep the output sane. printk() should not require
+ * more than 1 level of recursion (allowing, for example, printk() to trigger
+ * a WARN), but a higher value is used in case some printk-internal errors
+ * exist, such as the ringbuffer validation checks failing.
+ */
+#define PRINTK_MAX_RECURSION 3
+
+/*
+ * Return a pointer to the dedicated counter for the CPU+context of the
+ * caller.
+ */
+static u8 *printk_recursion_counter(void)
+{
+#ifdef CONFIG_HAVE_NMI
+   if (in_nmi()) {
+   if (printk_percpu_data_ready())
+   return this_cpu_ptr(_count_nmi);
+   return _count_nmi_early;
+   }
+#endif
+   if (printk_percpu_data_ready())
+   return this_cpu_ptr(_count);
+   return _count_early;
+}
+
+/*
+ * Enter recursion tracking. Interrupts are disabled to simplify tracking.
+ * The caller must check the return value to see if the recursion is allowed.
+ * On failure, interrupts are not disabled.
+ */
+static bool printk_enter_irqsave(unsigned long *flags)
+{
+   u8 *count;
+
+   local_irq_save(*flags);
+   count = printk_recursion_counter();
+   if (*count > PRINTK_MAX_RECURSION) {
+   local_irq_restore(*flags);
+   return false;
+   }
+   (*count)++;
+
+   return true;
+}
+
+/* Exit recursion tracking, restoring interrupts. */
+static void printk_exit_irqrestore(unsigned long flags)
+{
+   u8 *count;
+
+   count = printk_recursion_counter();
+   (*count)--;
+   local_irq_restore(flags);
+}
+
 int printk_delay_msec __read_mostly;
 
 static inline void printk_delay(void)
@@ -2040,11 +2108,13 @@ int vprintk_store(int facility, int level,
struct prb_reserved_entry e;
enum log_flags lflags = 0;
struct printk_record r;
+   unsigned long irqflags;
u16 trunc_msg_len = 0;
char prefix_buf[8];
u16 reserve_size;
va_list args2;
u16 text_len;
+   int ret = 0;
u64 ts_nsec;
 
/*
@@ -2055,6 +2125,9 @@ int vprintk_store(int facility, int level,
 */
ts_nsec = local_clock();
 
+   if (!printk_enter_irqsave())
+   return 0;
+
/*
 * The sprintf needs to come first since the syslog prefix might be
 * passed in as a parameter. An extra byte must be reserved so that
@@ -2092,7 +2165,8 @@ int vprintk_store(int facility, int level,
prb_commit();
}
 
-   return text_len;
+   ret = text_len;
+   goto out;
}
}
 
@@ -2108,7 +2182,7 @@ int vprintk_store(int facility, int level,
 
prb_rec_init_wr(, reserve_size + trunc_msg_len);
if (!prb_reserve(, prb, ))
-   return 0;
+   goto out;
}
 
/* fill message */
@@ -2130,7 +2204,10 @@ int vprintk_store(int facility, int level,
else
prb_final_commit();
 
-   return (text_len + trunc_msg_len);
+   ret = text_len + trunc_msg_len;
+out:
+   printk_exit_irqrestore(irqflags);
+   return ret;
 }
 
 asmlinkage int vprintk_emit(int facility, int level,
-- 
2.20.1

[PATCH printk v2 0/5] printk: remove safe buffers

Hi,

Here is v2 of a series to remove the safe buffers. v1 can be
found here [0]. The safe buffers are no longer needed because
messages can be stored directly into the log buffer from any
context.

However, the safe buffers also provided a form of recursion
protection. For that reason, explicit recursion protection is
also implemented for this series.

And finally, with the removal of the safe buffers, there is no
need for extra NMI enter/exit tracking. So this is also removed
(which includes removing config option CONFIG_PRINTK_NMI).

This series is based on the printk-rework branch of
printk/linux.git:

commit acebb5597ff1 ("kernel/printk.c: Fixed mundane typos")

Changes since v1:

- remove the printk nmi enter/exit tracking

- remove CONFIG_PRINTK_NMI config option

- use in_nmi() to detect NMI context

- remove unused printk_safe_enter/exit macros

- after switching to the dynamic buffer, copy over NMI records
  that may have arrived during the switch window

- use local_irq_*() instead of printk_safe_*() for console
  spinning

- use separate variables rather than arrays for the per-cpu
  recursion tracking

- make @syslog_lock a mutex instead of a spin_lock

- close the wait-read window for SYSLOG_ACTION_READ

- adjust various comments and commit messages as requested

John Ogness

[0] 
https://lore.kernel.org/lkml/20210316233326.10778-1-john.ogn...@linutronix.de

John Ogness (5):
  printk: track/limit recursion
  printk: remove safe buffers
  printk: remove NMI tracking
  printk: convert @syslog_lock to mutex
  printk: syslog: close window between wait and read

 arch/arm/kernel/smp.c  |   2 -
 arch/powerpc/kernel/traps.c|   1 -
 arch/powerpc/kernel/watchdog.c |   5 -
 arch/powerpc/kexec/crash.c |   3 -
 include/linux/hardirq.h|   2 -
 include/linux/printk.h |  22 --
 init/Kconfig   |   5 -
 kernel/kexec_core.c|   1 -
 kernel/panic.c |   3 -
 kernel/printk/internal.h   |  23 ---
 kernel/printk/printk.c | 281 +++--
 kernel/printk/printk_safe.c| 362 +
 kernel/trace/trace.c   |   2 -
 lib/nmi_backtrace.c|   6 -
 14 files changed, 171 insertions(+), 547 deletions(-)

-- 
2.20.1

[PATCH printk v2 2/5] printk: remove safe buffers

With @logbuf_lock removed, the high level printk functions for
storing messages are lockless. Messages can be stored from any
context, so there is no need for the NMI and safe buffers anymore.
Remove the NMI and safe buffers.

Although the safe buffers are removed, the NMI and safe context
tracking is still in place. In these contexts, store the message
immediately but still use irq_work to defer the console printing.

Since printk recursion tracking is in place, safe context tracking
for most of printk is not needed. Remove it. Only safe context
tracking relating to the console lock is left in place. This is
because the console lock is needed for the actual printing.

Signed-off-by: John Ogness 
---
 Note: The follow-up patch removes the NMI tracking.

 arch/powerpc/kernel/traps.c|   1 -
 arch/powerpc/kernel/watchdog.c |   5 -
 include/linux/printk.h |  10 -
 kernel/kexec_core.c|   1 -
 kernel/panic.c |   3 -
 kernel/printk/internal.h   |  17 --
 kernel/printk/printk.c | 137 +-
 kernel/printk/printk_safe.c| 333 +
 lib/nmi_backtrace.c|   6 -
 9 files changed, 56 insertions(+), 457 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 3ec7b443fe6b..7d2b339afcb0 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -170,7 +170,6 @@ extern void panic_flush_kmsg_start(void)
 
 extern void panic_flush_kmsg_end(void)
 {
-   printk_safe_flush_on_panic();
kmsg_dump(KMSG_DUMP_PANIC);
bust_spinlocks(0);
debug_locks_off();
diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index af3c15a1d41e..8ae46c5945d0 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -181,11 +181,6 @@ static void watchdog_smp_panic(int cpu, u64 tb)
 
wd_smp_unlock();
 
-   printk_safe_flush();
-   /*
-* printk_safe_flush() seems to require another print
-* before anything actually goes out to console.
-*/
if (sysctl_hardlockup_all_cpu_backtrace)
trigger_allbutself_cpu_backtrace();
 
diff --git a/include/linux/printk.h b/include/linux/printk.h
index fe7eb2351610..2476796c1150 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -207,8 +207,6 @@ __printf(1, 2) void dump_stack_set_arch_desc(const char 
*fmt, ...);
 void dump_stack_print_info(const char *log_lvl);
 void show_regs_print_info(const char *log_lvl);
 extern asmlinkage void dump_stack(void) __cold;
-extern void printk_safe_flush(void);
-extern void printk_safe_flush_on_panic(void);
 #else
 static inline __printf(1, 0)
 int vprintk(const char *s, va_list args)
@@ -272,14 +270,6 @@ static inline void show_regs_print_info(const char 
*log_lvl)
 static inline void dump_stack(void)
 {
 }
-
-static inline void printk_safe_flush(void)
-{
-}
-
-static inline void printk_safe_flush_on_panic(void)
-{
-}
 #endif
 
 extern int kptr_restrict;
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index a0b6780740c8..480d5f77ef4f 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -977,7 +977,6 @@ void crash_kexec(struct pt_regs *regs)
old_cpu = atomic_cmpxchg(_cpu, PANIC_CPU_INVALID, this_cpu);
if (old_cpu == PANIC_CPU_INVALID) {
/* This is the 1st CPU which comes here, so go ahead. */
-   printk_safe_flush_on_panic();
__crash_kexec(regs);
 
/*
diff --git a/kernel/panic.c b/kernel/panic.c
index 332736a72a58..1f0df42f8d0c 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -247,7 +247,6 @@ void panic(const char *fmt, ...)
 * Bypass the panic_cpu check and call __crash_kexec directly.
 */
if (!_crash_kexec_post_notifiers) {
-   printk_safe_flush_on_panic();
__crash_kexec(NULL);
 
/*
@@ -271,8 +270,6 @@ void panic(const char *fmt, ...)
 */
atomic_notifier_call_chain(_notifier_list, 0, buf);
 
-   /* Call flush even twice. It tries harder with a single online CPU */
-   printk_safe_flush_on_panic();
kmsg_dump(KMSG_DUMP_PANIC);
 
/*
diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h
index 51615c909b2f..6cc35c5de890 100644
--- a/kernel/printk/internal.h
+++ b/kernel/printk/internal.h
@@ -22,7 +22,6 @@ __printf(1, 0) int vprintk_deferred(const char *fmt, va_list 
args);
 void __printk_safe_enter(void);
 void __printk_safe_exit(void);
 
-void printk_safe_init(void);
 bool printk_percpu_data_ready(void);
 
 #define printk_safe_enter_irqsave(flags)   \
@@ -37,18 +36,6 @@ bool printk_percpu_data_ready(void);
local_irq_restore(flags);   \
} while (0)
 
-#define printk_safe_enter_irq()\
-   do {\
-   local_irq_disable();\
-   __printk_safe_enter

[PATCH printk v2 3/5] printk: remove NMI tracking

All NMI contexts are handled the same as the safe context: store the
message and defer printing. There is no need to have special NMI
context tracking for this. Using in_nmi() is enough.

Signed-off-by: John Ogness 
---
 arch/arm/kernel/smp.c   |  2 --
 arch/powerpc/kexec/crash.c  |  3 ---
 include/linux/hardirq.h |  2 --
 include/linux/printk.h  | 12 
 init/Kconfig|  5 -
 kernel/printk/internal.h|  6 --
 kernel/printk/printk_safe.c | 37 +
 kernel/trace/trace.c|  2 --
 8 files changed, 1 insertion(+), 68 deletions(-)

diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 5c48eb4fd0e5..77a720c1f402 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -671,9 +671,7 @@ static void do_handle_IPI(int ipinr)
break;
 
case IPI_CPU_BACKTRACE:
-   printk_nmi_enter();
nmi_cpu_backtrace(get_irq_regs());
-   printk_nmi_exit();
break;
 
default:
diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
index c9a889880214..d488311efab1 100644
--- a/arch/powerpc/kexec/crash.c
+++ b/arch/powerpc/kexec/crash.c
@@ -311,9 +311,6 @@ void default_machine_crash_shutdown(struct pt_regs *regs)
unsigned int i;
int (*old_handler)(struct pt_regs *regs);
 
-   /* Avoid hardlocking with irresponsive CPU holding logbuf_lock */
-   printk_nmi_enter();
-
/*
 * This function is only called after the system
 * has panicked or is otherwise in a critical state.
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 7c9d6a2d7e90..0926e9ca4d85 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -115,7 +115,6 @@ extern void rcu_nmi_exit(void);
do {\
lockdep_off();  \
arch_nmi_enter();   \
-   printk_nmi_enter(); \
BUG_ON(in_nmi() == NMI_MASK);   \
__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);   \
} while (0)
@@ -134,7 +133,6 @@ extern void rcu_nmi_exit(void);
do {\
BUG_ON(!in_nmi());  \
__preempt_count_sub(NMI_OFFSET + HARDIRQ_OFFSET);   \
-   printk_nmi_exit();  \
arch_nmi_exit();\
lockdep_on();   \
} while (0)
diff --git a/include/linux/printk.h b/include/linux/printk.h
index 2476796c1150..77f66625706e 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -149,18 +149,6 @@ static inline __printf(1, 2) __cold
 void early_printk(const char *s, ...) { }
 #endif
 
-#ifdef CONFIG_PRINTK_NMI
-extern void printk_nmi_enter(void);
-extern void printk_nmi_exit(void);
-extern void printk_nmi_direct_enter(void);
-extern void printk_nmi_direct_exit(void);
-#else
-static inline void printk_nmi_enter(void) { }
-static inline void printk_nmi_exit(void) { }
-static inline void printk_nmi_direct_enter(void) { }
-static inline void printk_nmi_direct_exit(void) { }
-#endif /* PRINTK_NMI */
-
 struct dev_printk_info;
 
 #ifdef CONFIG_PRINTK
diff --git a/init/Kconfig b/init/Kconfig
index 096e1af5c586..ea58c0d30a97 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1482,11 +1482,6 @@ config PRINTK
  very difficult to diagnose system problems, saying N here is
  strongly discouraged.
 
-config PRINTK_NMI
-   def_bool y
-   depends on PRINTK
-   depends on HAVE_NMI
-
 config BUG
bool "BUG() support" if EXPERT
default y
diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h
index 6cc35c5de890..ff890ae3ee6a 100644
--- a/kernel/printk/internal.h
+++ b/kernel/printk/internal.h
@@ -6,12 +6,6 @@
 
 #ifdef CONFIG_PRINTK
 
-#define PRINTK_SAFE_CONTEXT_MASK   0x007ff
-#define PRINTK_NMI_DIRECT_CONTEXT_MASK 0x00800
-#define PRINTK_NMI_CONTEXT_MASK0xff000
-
-#define PRINTK_NMI_CONTEXT_OFFSET  0x01000
-
 __printf(4, 0)
 int vprintk_store(int facility, int level,
  const struct dev_printk_info *dev_info,
diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c
index 4b5df5c27334..4da1ab3332d6 100644
--- a/kernel/printk/printk_safe.c
+++ b/kernel/printk/printk_safe.c
@@ -4,12 +4,9 @@
  */
 
 #include 
-#include 
-#include 
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -17,35 +14,6 @@
 
 static DEFINE_PER_CPU(int, printk_context);
 
-#ifdef CONFIG_PRINTK_NMI
-void noinstr printk_nmi_enter(void)
-{
-   this_cpu_add(printk_context, PRINTK_NMI_CONTEXT_OFFSET);
-}
-
-void noinstr printk_nmi

Re: [PATCH] printk: rename vprintk_func to vprintk

On 2021-03-30, Petr Mladek  wrote:
> On Tue 2021-03-23 15:42:01, Rasmus Villemoes wrote:
>> The printk code is already hard enough to understand. Remove an
>> unnecessary indirection by renaming vprintk_func to vprintk (adding
>> the asmlinkage annotation), and removing the vprintk definition from
>> printk.c. That way, printk is implemented in terms of vprintk as one
>> would expect, and there's no "vprintk_func, what's that? Some function
>> pointer that gets set where?"
>> 
>> The declaration of vprintk in linux/printk.h already has the
>> __printf(1,0) attribute, there's no point repeating that with the
>> definition - it's for diagnostics in callers.
>> 
>> linux/printk.h already contains a static inline {return 0;} definition
>> of vprintk when !CONFIG_PRINTK.
>> 
>> Since the corresponding stub definition of vprintk_func was not marked
>> "static inline", any translation unit including internal.h would get a
>> definition of vprintk_func - it just so happens that for
>> !CONFIG_PRINTK, there is precisely one such TU, namely printk.c. Had
>> there been more, it would be a link error; now it's just a silly waste
>> of a few bytes of .text, which one must assume are rather precious to
>> anyone disabling PRINTK.
>> 
>> $ objdump -dr kernel/printk/printk.o
>> 0330 :
>>  330:   31 c0   xor%eax,%eax
>>  332:   c3  ret
>>  333:   8d b4 26 00 00 00 00    lea0x0(%esi,%eiz,1),%esi
>>  33a:   8d b6 00 00 00 00   lea0x0(%esi),%esi
>> 
>> Signed-off-by: Rasmus Villemoes 
>
> Nice clean up!
>
> Reviewed-by: Petr Mladek 
>
> John,
>
> it conflicts with the patchset removing printk safe buffers[1].
> Would you prefer to queue this into the patchset?
> Or should I push it into printk/linux.git, printk-rework and you would
> base v2 on top of it?

Please push it to printk-rework. I will base my v2 on top of it.

Thanks.

John

Re: [PATCH] kernel/printk.c: Fixed mundane typos

On 2021-03-30, Petr Mladek  wrote:
> On Sun 2021-03-28 10:09:32, Bhaskar Chowdhury wrote:
>> 
>> s/sempahore/semaphore/
>> s/exacly/exactly/
>> s/unregistred/unregistered/
>> s/interation/iteration/
>> 
>> 
>> Signed-off-by: Bhaskar Chowdhury 
>
> Reviewed-by: Petr Mladek 
>
> John,
>
> it conflicts with the patchset removing printk safe buffers[1].
> Would you prefer to queue this into the patchset?
> Or should I push it into printk/linux.git, printk-rework and you would
> base v2 on top of it?

Go ahead and push it to printk-rework. I'll base v2 on top of it.

Thanks.

John

Re: [PATCH] mm: gup: remove FOLL_SPLIT


On 3/29/21 11:24 PM, kernel test robot wrote:

Hi Yang,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on hnaz-linux-mm/master]

url:
https://github.com/0day-ci/linux/commits/Yang-Shi/mm-gup-remove-FOLL_SPLIT/20210330-034042
base:   https://github.com/hnaz/linux-mm master
config: s390-randconfig-r032-20210330 (attached as .config)
compiler: s390-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
 wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
 chmod +x ~/bin/make.cross
 # 
https://github.com/0day-ci/linux/commit/c8563a636718f98af86a3965d94e25b8f2cf2354
 git remote add linux-review https://github.com/0day-ci/linux
 git fetch --no-tags linux-review 
Yang-Shi/mm-gup-remove-FOLL_SPLIT/20210330-034042
 git checkout c8563a636718f98af86a3965d94e25b8f2cf2354
 # save the attached .config to linux build tree
 COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
ARCH=s390

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

arch/s390/mm/gmap.c: In function 'thp_split_mm':

arch/s390/mm/gmap.c:2498:27: error: 'FOLL_SPLIT' undeclared (first use in this 
function); did you mean 'FOLL_PIN'?

 2498 |follow_page(vma, addr, FOLL_SPLIT);
  |   ^~
  |   FOLL_PIN
arch/s390/mm/gmap.c:2498:27: note: each undeclared identifier is reported 
only once for each function it appears in


vim +2498 arch/s390/mm/gmap.c


There appears to be an imperfection in this 0day testing system, because (just 
as the patch
says), commit ba925fa35057a062ac98c3e8138b013ce4ce351c ("s390/gmap: improve THP 
splitting"),
July 29, 2020, removes the above use of FOLL_SPLIT.

And "git grep", just to be sure, shows it is not there in today's linux.git. So 
I guess the
https://github.com/0day-ci/linux repo needs a better way to stay in sync?


thanks,
--
John Hubbard
NVIDIA

Re: [PATCH] mm: gup: remove FOLL_SPLIT