[PATCH] tracing: Use init_utsname()->release

2024-02-22 Thread John Garry
Instead of using UTS_RELEASE, use init_utsname()->release, which means that
we don't need to rebuild the code just for the git head commit changing.

Signed-off-by: John Garry 
---
I originally sent an RFC using new string uts_release, but that
string is not needed as we can use init_utsname()->release instead.

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8198bfc54b58..1416bf72f3f4 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -13,7 +13,7 @@
  *  Copyright (C) 2004 Nadia Yvette Chambers
  */
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -4368,7 +4368,7 @@ print_trace_header(struct seq_file *m, struct 
trace_iterator *iter)
get_total_entries(buf, , );
 
seq_printf(m, "# %s latency trace v1.1.5 on %s\n",
-  name, UTS_RELEASE);
+  name, init_utsname()->release);
seq_puts(m, "# ---"
 "-\n");
seq_printf(m, "# latency: %lu us, #%lu/%lu, CPU#%d |"
-- 
2.31.1




Re: [PATCH RFC 0/4] Introduce uts_release

2024-02-21 Thread John Garry

On 08/02/2024 10:08, John Garry wrote:

On 05/02/2024 23:10, Masahiro Yamada wrote:

I think what you can contribute are:

   - Explore the UTS_RELEASE users, and check if you can get rid of it.

Unfortunately I expect resistance for this. I also expect places like FW
loader it is necessary. And when this is used in sysfs, people will say
that it is part of the ABI now.

How about I send the patch to update to use init_uts_ns and mention also
that it would be better to not use at all, if possible? I can cc you.


OK.


As I mentioned in the previous reply, the replacement is safe
for builtin code.

When you touch modular code, please pay a little more care,
because UTS_RELEASE and init_utsname()->release
may differ when CONFIG_MODVERSIONS=y.



Are you saying that we may have a different release version kernel and 
module built with CONFIG_MODVERSIONS=y, and the module was using 
UTS_RELEASE for something? That something may be like setting some info 
in a sysfs file, like in this example:


static ssize_t target_core_item_version_show(struct config_item *item,
     char *page)
{
 return sprintf(page, "Target Engine Core ConfigFS Infrastructure %s"
     " on %s/%s on "UTS_RELEASE"\n", TARGET_CORE_VERSION,
     utsname()->sysname, utsname()->machine);
}

And the intention is to use the module codebase release version and not 
the kernel codebase release version. Hence utsname() is used for 
.sysname and .machine, but not .release .


Hi Masahiro,

Can you comment on whether I am right about CONFIG_MODVERSIONS, above?

Thanks,
John



Re: [PATCH RFC 0/4] Introduce uts_release

2024-02-08 Thread John Garry

On 05/02/2024 23:10, Masahiro Yamada wrote:

I think what you can contribute are:

   - Explore the UTS_RELEASE users, and check if you can get rid of it.

Unfortunately I expect resistance for this. I also expect places like FW
loader it is necessary. And when this is used in sysfs, people will say
that it is part of the ABI now.

How about I send the patch to update to use init_uts_ns and mention also
that it would be better to not use at all, if possible? I can cc you.


OK.


As I mentioned in the previous reply, the replacement is safe
for builtin code.

When you touch modular code, please pay a little more care,
because UTS_RELEASE and init_utsname()->release
may differ when CONFIG_MODVERSIONS=y.



Are you saying that we may have a different release version kernel and 
module built with CONFIG_MODVERSIONS=y, and the module was using 
UTS_RELEASE for something? That something may be like setting some info 
in a sysfs file, like in this example:


static ssize_t target_core_item_version_show(struct config_item *item,
char *page)
{
return sprintf(page, "Target Engine Core ConfigFS Infrastructure %s"
" on %s/%s on "UTS_RELEASE"\n", TARGET_CORE_VERSION,
utsname()->sysname, utsname()->machine);
}

And the intention is to use the module codebase release version and not 
the kernel codebase release version. Hence utsname() is used for 
.sysname and .machine, but not .release .


Thanks,
John



Re: [PATCH RFC 0/4] Introduce uts_release

2024-02-05 Thread John Garry

On 02/02/2024 15:01, Masahiro Yamada wrote:

--
2.35.3


As you see, several drivers store UTS_RELEASE in their driver data,
and even print it in debug print.


I do not see why it is useful.


I would tend to agree, and mentioned that earlier.


As you discussed in 3/4, if UTS_RELEASE is unneeded,
it is better to get rid of it.


Jakub replied about this.




If such version information is useful for drivers, the intention is
whether the version of the module, or the version of vmlinux.
That is a question.
They differ when CONFIG_MODVERSION.



I think often this information in UTS_RELEASE is shared as informative 
only, so the user can conveniently know the specific kernel git version.




When module developers intend to printk the git version
from which the module was compiled from,
presumably they want to use UTS_RELEASE, which
was expanded at the compile time of the module.

If you replace it with uts_release, it is the git version
of vmlinux.


Of course, the replacement is safe for always-builtin code.



Lastly, we can avoid using UTS_RELEASE without relying
on your patch.



For example, commit 3a3a11e6e5a2bc0595c7e36ae33c861c9e8c75b1
replaced  UTS_RELEASE with init_uts_ns.name.release


So, is your uts_release a shorthand of init_uts_ns.name.release?


Yes - well that both are strings containing UTS_RELEASE. Using a struct 
sub-member is bit ungainly, but I suppose that we should not be making 
life easy for people using this.


However we already have init_utsname in:

static inline struct new_utsname *init_utsname(void)
{
return _uts_ns.name;
}

So could use init_utsname()->release, which is a bit nicer.





I think what you can contribute are:

  - Explore the UTS_RELEASE users, and check if you can get rid of it.


Unfortunately I expect resistance for this. I also expect places like FW 
loader it is necessary. And when this is used in sysfs, people will say 
that it is part of the ABI now.


How about I send the patch to update to use init_uts_ns and mention also 
that it would be better to not use at all, if possible? I can cc you.




  - Where UTS_RELEASE is useful, consider if it is possible
to replace it with init_uts_ns.name.release


ok, but, as above, could use init_utsname()->release also

Thanks,
John




Re: [PATCH RFC 3/4] net: ethtool: Use uts_release

2024-02-01 Thread John Garry

On 01/02/2024 16:09, Jakub Kicinski wrote:

On Thu, 1 Feb 2024 14:20:23 +0100 Jiri Pirko wrote:

BTW, I assume that changes like this are also ok:

8<-

   net: team: Don't bother filling in ethtool driver version


Yup, just to be clear - you can send this independently from the series,


Sure, and I think rocker and staging/octeon also have this unnecessary 
code also.



tag is as

  [PATCH net-next]


ah, yes



we'll take it via the networking tree.


Thanks. I assume Greg - the staging maintainer - would take the octeon 
patch.



I'm not sure which tree the other
patches will go thru..


I think that the best thing to do is get a minimal set in for 6.9 and 
then merge the rest in the next cycle. I've got about 22 patches in 
total now, but I think that there will be more. We'll see who can pick 
up the first set when I send it officially.


Thanks,
John




Re: [PATCH RFC 3/4] net: ethtool: Use uts_release

2024-02-01 Thread John Garry

On 31/01/2024 19:24, Jakub Kicinski wrote:

On Wed, 31 Jan 2024 10:48:50 + John Garry wrote:

Instead of using UTS_RELEASE, use uts_release, which means that we don't
need to rebuild the code just for the git head commit changing.

Signed-off-by: John Garry

Yes, please!

Acked-by: Jakub Kicinski


Cheers

BTW, I assume that changes like this are also ok:

8<-

   net: team: Don't bother filling in ethtool driver version

   The version is same as the default, as don't bother.

   Signed-off-by: John Garry 

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index f575f225d417..0a44bbdcfb7b 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -25,7 +25,6 @@
#include 
#include 
#include 
-#include 
#include 

#define DRV_NAME "team"
@@ -2074,7 +2073,6 @@ static void team_ethtool_get_drvinfo(struct
net_device *dev,
struct ethtool_drvinfo *drvinfo)
{
   strscpy(drvinfo->driver, DRV_NAME, sizeof(drvinfo->driver));
-   strscpy(drvinfo->version, UTS_RELEASE, sizeof(drvinfo->version));
}

>8-

right?

John







Re: [PATCH RFC 0/4] Introduce uts_release

2024-01-31 Thread John Garry

On 31/01/2024 16:22, Greg KH wrote:

before:
real0m53.591s
user1m1.842s
sys 0m9.161s

after:
real0m37.481s
user0m46.461s
sys 0m7.199s

Sending as an RFC as I need to test more of the conversions and I would
like to also convert more UTS_RELEASE users to prove this is proper
approach.

I like it, I also think that v4l2 includes this as well as all of those
drivers seem to rebuild when this changes, does that not happen for you
too?


I didn't see that. Were you were building for arm64? I can see some v4l2 
configs enabled there for the vanilla defconfig (but none for x86-64).




Anyway, if the firmware changes work, I'm all for this, thanks for
taking it on!


cheers. BTW, I just noticed that utsrelase.h might not be updated for my 
allmodconfig build when I change the head commit - I'll investigate why ...


John



[PATCH RFC 1/4] init: Add uts_release

2024-01-31 Thread John Garry
Add a char [] for UTS_RELEASE so that we don't need to rebuild code which
references UTS_RELEASE.

Signed-off-by: John Garry 
---
 include/linux/utsname.h | 1 +
 init/version.c  | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/utsname.h b/include/linux/utsname.h
index bf7613ba412b..15b0b1c9a9ee 100644
--- a/include/linux/utsname.h
+++ b/include/linux/utsname.h
@@ -88,5 +88,6 @@ static inline struct new_utsname *init_utsname(void)
 }
 
 extern struct rw_semaphore uts_sem;
+extern const char uts_release[];
 
 #endif /* _LINUX_UTSNAME_H */
diff --git a/init/version.c b/init/version.c
index 94c96f6fbfe6..87fecdd4fbfb 100644
--- a/init/version.c
+++ b/init/version.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static int __init early_hostname(char *arg)
 {
@@ -48,6 +49,8 @@ BUILD_LTO_INFO;
 
 struct uts_namespace init_uts_ns __weak;
 const char linux_banner[] __weak;
+const char uts_release[] = UTS_RELEASE;
+EXPORT_SYMBOL_GPL(uts_release);
 
 #include "version-timestamp.c"
 
-- 
2.35.3




[PATCH RFC 0/4] Introduce uts_release

2024-01-31 Thread John Garry
When hacking it is a waste of time and compute energy that we need to
rebuild much kernel code just for changing the head git commit, like this:

> touch include/generated/utsrelease.h 
> time make  -j3
mkdir -p /home/john/mnt_sda4/john/kernel-dev2/tools/objtool && make 
O=/home/john/mnt_sda4/john/kernel-dev2 subdir=tools/objtool 
--no-print-directory -C objtool 
  INSTALL libsubcmd_headers
  CALLscripts/checksyscalls.sh
  CC  init/version.o
  AR  init/built-in.a
  CC  kernel/sys.o
  CC  kernel/module/main.o
  AR  kernel/module/built-in.a
  CC  drivers/base/firmware_loader/main.o
  CC  kernel/trace/trace.o
  AR  drivers/base/firmware_loader/built-in.a
  AR  drivers/base/built-in.a
  CC  net/ethtool/ioctl.o
  AR  kernel/trace/built-in.a
  AR  kernel/built-in.a
  AR  net/ethtool/built-in.a
  AR  net/built-in.a
  AR  drivers/built-in.a
  AR  built-in.a
  ...

Files like drivers/base/firmware_loader/main.c needs to be recompiled as
it includes generated/utsrelease.h for UTS_RELEASE macro, and utsrelease.h
is regenerated when the head commit changes.

Introduce global char uts_release[] in init/version.c, which this
mentioned code can use instead of UTS_RELEASE, meaning that we don't need
to rebuild for changing the head commit - only init/version.c needs to be
rebuilt. Whether all the references to UTS_RELEASE in the codebase are
proper is a different matter.

For an x86_64 defconfig build for this series on my old laptop, here is
before and after rebuild time:

before:
real0m53.591s
user1m1.842s
sys 0m9.161s

after:
real0m37.481s
user0m46.461s
sys 0m7.199s

Sending as an RFC as I need to test more of the conversions and I would
like to also convert more UTS_RELEASE users to prove this is proper
approach.

John Garry (4):
  init: Add uts_release
  tracing: Use uts_release
  net: ethtool: Use uts_release
  firmware_loader: Use uts_release

 drivers/base/firmware_loader/main.c | 39 +++--
 include/linux/utsname.h |  1 +
 init/version.c  |  3 +++
 kernel/trace/trace.c|  4 +--
 net/ethtool/ioctl.c |  4 +--
 5 files changed, 39 insertions(+), 12 deletions(-)

-- 
2.35.3




[PATCH RFC 4/4] firmware_loader: Use uts_release

2024-01-31 Thread John Garry
Instead of using UTS_RELEASE, use uts_release, which means that we don't
need to rebuild the code just for the git head commit changing.

Since UTS_RELEASE was used for fw_path and this points to const data,
append uts_release dynamically to an intermediate string.

Signed-off-by: John Garry 
---
 drivers/base/firmware_loader/main.c | 39 +++--
 1 file changed, 31 insertions(+), 8 deletions(-)

diff --git a/drivers/base/firmware_loader/main.c 
b/drivers/base/firmware_loader/main.c
index ea28102d421e..87da7be61a29 100644
--- a/drivers/base/firmware_loader/main.c
+++ b/drivers/base/firmware_loader/main.c
@@ -38,7 +38,7 @@
 #include 
 #include 
 
-#include 
+#include 
 
 #include "../base.h"
 #include "firmware.h"
@@ -471,9 +471,9 @@ static int fw_decompress_xz(struct device *dev, struct 
fw_priv *fw_priv,
 static char fw_path_para[256];
 static const char * const fw_path[] = {
fw_path_para,
-   "/lib/firmware/updates/" UTS_RELEASE,
+   "/lib/firmware/updates/", /* UTS_RELEASE is appended later */
"/lib/firmware/updates",
-   "/lib/firmware/" UTS_RELEASE,
+   "/lib/firmware/", /* UTS_RELEASE is appended later */
"/lib/firmware"
 };
 
@@ -496,7 +496,7 @@ fw_get_filesystem_firmware(struct device *device, struct 
fw_priv *fw_priv,
size_t size;
int i, len, maxlen = 0;
int rc = -ENOENT;
-   char *path, *nt = NULL;
+   char *path, *fw_path_string, *nt = NULL;
size_t msize = INT_MAX;
void *buffer = NULL;
 
@@ -510,6 +510,12 @@ fw_get_filesystem_firmware(struct device *device, struct 
fw_priv *fw_priv,
if (!path)
return -ENOMEM;
 
+   fw_path_string = __getname();
+   if (!fw_path_string) {
+   __putname(path);
+   return -ENOMEM;
+   }
+
wait_for_initramfs();
for (i = 0; i < ARRAY_SIZE(fw_path); i++) {
size_t file_size = 0;
@@ -519,16 +525,32 @@ fw_get_filesystem_firmware(struct device *device, struct 
fw_priv *fw_priv,
if (!fw_path[i][0])
continue;
 
+   len = snprintf(fw_path_string, PATH_MAX, "%s", fw_path[i]);
+   if (len >= PATH_MAX) {
+   rc = -ENAMETOOLONG;
+   break;
+   }
+
+   /* Special handling to append UTS_RELEASE */
+   if (fw_path[i][len - 1] == '/') {
+   len = snprintf(fw_path_string, PATH_MAX, "%s%s",
+   fw_path[i], uts_release);
+   if (len >= PATH_MAX) {
+   rc = -ENAMETOOLONG;
+   break;
+   }
+   }
+
/* strip off \n from customized path */
-   maxlen = strlen(fw_path[i]);
+   maxlen = strlen(fw_path_string);
if (i == 0) {
-   nt = strchr(fw_path[i], '\n');
+   nt = strchr(fw_path_string, '\n');
if (nt)
-   maxlen = nt - fw_path[i];
+   maxlen = nt - fw_path_string;
}
 
len = snprintf(path, PATH_MAX, "%.*s/%s%s",
-  maxlen, fw_path[i],
+  maxlen, fw_path_string,
   fw_priv->fw_name, suffix);
if (len >= PATH_MAX) {
rc = -ENAMETOOLONG;
@@ -585,6 +607,7 @@ fw_get_filesystem_firmware(struct device *device, struct 
fw_priv *fw_priv,
break;
}
__putname(path);
+   __putname(fw_path_string);
 
return rc;
 }
-- 
2.35.3




[PATCH RFC 3/4] net: ethtool: Use uts_release

2024-01-31 Thread John Garry
Instead of using UTS_RELEASE, use uts_release, which means that we don't
need to rebuild the code just for the git head commit changing.

Signed-off-by: John Garry 
---
 net/ethtool/ioctl.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ethtool/ioctl.c b/net/ethtool/ioctl.c
index 7519b0818b91..81d052505f67 100644
--- a/net/ethtool/ioctl.c
+++ b/net/ethtool/ioctl.c
@@ -31,7 +31,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include "common.h"
 
 /* State held across locks and calls for commands which have devlink fallback 
*/
@@ -713,7 +713,7 @@ ethtool_get_drvinfo(struct net_device *dev, struct 
ethtool_devlink_compat *rsp)
struct device *parent = dev->dev.parent;
 
rsp->info.cmd = ETHTOOL_GDRVINFO;
-   strscpy(rsp->info.version, UTS_RELEASE, sizeof(rsp->info.version));
+   strscpy(rsp->info.version, uts_release, sizeof(rsp->info.version));
if (ops->get_drvinfo) {
ops->get_drvinfo(dev, >info);
if (!rsp->info.bus_info[0] && parent)
-- 
2.35.3




[PATCH RFC 2/4] tracing: Use uts_release

2024-01-31 Thread John Garry
Instead of using UTS_RELEASE, use uts_release, which means that we don't
need to rebuild the code just for the git head commit changing.

Signed-off-by: John Garry 
---
 kernel/trace/trace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2a7c6fd934e9..68513924beb4 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -13,7 +13,7 @@
  *  Copyright (C) 2004 Nadia Yvette Chambers
  */
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -4354,7 +4354,7 @@ print_trace_header(struct seq_file *m, struct 
trace_iterator *iter)
get_total_entries(buf, , );
 
seq_printf(m, "# %s latency trace v1.1.5 on %s\n",
-  name, UTS_RELEASE);
+  name, uts_release);
seq_puts(m, "# ---"
 "-\n");
seq_printf(m, "# latency: %lu us, #%lu/%lu, CPU#%d |"
-- 
2.35.3




Re: [PATCH v3 2/2] drivers/perf: hisi: Add driver for HiSilicon PCIe PMU

2021-04-20 Thread John Garry

On 15/04/2021 13:48, Qi Liu wrote:

PCIe PMU Root Complex Integrated End Point(RCiEP) device is supported
to sample bandwidth, latency, buffer occupation etc.

Each PMU RCiEP device monitors multiple Root Ports, and each RCiEP is
registered as a PMU in /sys/bus/event_source/devices, so users can
select target PMU, and use filter to do further sets.

Filtering options contains:
event- select the event.
subevent - select the subevent.
port - select target Root Ports. Information of Root Ports
are shown under sysfs.
bdf  - select requester_id of target EP device.
trig_len - set trigger condition for starting event statistics.
trigger_mode - set trigger mode. 0 means starting to statistic when
bigger than trigger condition, and 1 means smaller.
thr_len  - set threshold for statistics.
thr_mode - set threshold mode. 0 means count when bigger than
threshold, and 1 means smaller.

Signed-off-by: Qi Liu 


Some minor items and nits with coding style below, but generally looks ok:

Reviewed-by: John Garry 


---
  MAINTAINERS|6 +
  drivers/perf/Kconfig   |2 +
  drivers/perf/Makefile  |1 +
  drivers/perf/pci/Kconfig   |   16 +
  drivers/perf/pci/Makefile  |2 +
  drivers/perf/pci/hisilicon/Makefile|3 +
  drivers/perf/pci/hisilicon/hisi_pcie_pmu.c | 1014 
  include/linux/cpuhotplug.h |1 +
  8 files changed, 1045 insertions(+)
  create mode 100644 drivers/perf/pci/Kconfig
  create mode 100644 drivers/perf/pci/Makefile
  create mode 100644 drivers/perf/pci/hisilicon/Makefile
  create mode 100644 drivers/perf/pci/hisilicon/hisi_pcie_pmu.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 7fdc513..efe06cd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8084,6 +8084,12 @@ W:   http://www.hisilicon.com
  F:Documentation/admin-guide/perf/hisi-pmu.rst
  F:drivers/perf/hisilicon
  
+HISILICON PCIE PMU DRIVER

+M: Qi Liu 
+S: Maintained
+F: Documentation/admin-guide/perf/hisi-pcie-pmu.rst
+F: drivers/perf/pci/hisilicon/hisi_pcie_pmu.c
+
  HISILICON QM AND ZIP Controller DRIVER
  M:Zhou Wang 
  L:linux-cry...@vger.kernel.org
diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 77522e5..ddd82fa 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -139,4 +139,6 @@ config ARM_DMC620_PMU
  
  source "drivers/perf/hisilicon/Kconfig"
  
+source "drivers/perf/pci/Kconfig"

+
  endmenu
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index 5260b11..1208c08 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -14,3 +14,4 @@ obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o
  obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
  obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
  obj-$(CONFIG_ARM_DMC620_PMU) += arm_dmc620_pmu.o
+obj-y += pci/
diff --git a/drivers/perf/pci/Kconfig b/drivers/perf/pci/Kconfig
new file mode 100644
index 000..9f30291
--- /dev/null
+++ b/drivers/perf/pci/Kconfig
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# PCIe Performance Monitor Drivers
+#
+menu "PCIe Performance Monitor"
+
+config HISI_PCIE_PMU
+   tristate "HiSilicon PCIE PERF PMU"
+   depends on (ARM64 && PCI) || COMPILE_TEST
+   help
+ Provide support for HiSilicon PCIe performance monitoring unit (PMU)
+ RCiEP devices.
+ Adds the PCIe PMU into perf events system for monitoring latency,
+ bandwidth etc.
+
+endmenu
diff --git a/drivers/perf/pci/Makefile b/drivers/perf/pci/Makefile
new file mode 100644
index 000..a56b1a9
--- /dev/null
+++ b/drivers/perf/pci/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y += hisilicon/
diff --git a/drivers/perf/pci/hisilicon/Makefile 
b/drivers/perf/pci/hisilicon/Makefile
new file mode 100644
index 000..65b0bd7
--- /dev/null
+++ b/drivers/perf/pci/hisilicon/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+obj-$(CONFIG_HISI_PCIE_PMU) += hisi_pcie_pmu.o
diff --git a/drivers/perf/pci/hisilicon/hisi_pcie_pmu.c 
b/drivers/perf/pci/hisilicon/hisi_pcie_pmu.c
new file mode 100644
index 000..415bf39
--- /dev/null
+++ b/drivers/perf/pci/hisilicon/hisi_pcie_pmu.c
@@ -0,0 +1,1014 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * This driver adds support for PCIe PMU RCiEP device. Related
+ * perf events are bandwidth, bandwidth utilization, latency
+ * etc.
+ *
+ * Copyright (C) 2021 HiSilicon Limited
+ * Author: Qi Liu
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* Define registers */
+#define HISI_PCIE_GLOBAL_CTRL  0x00
+#define HISI_PCIE_EVENT_CTRL   0x010
+#define HISI_PCIE_CNT  0x090
+#defi

Re: [PATCH] scsi: core: Cap initial sdev queue depth at shost.can_queue

2021-04-20 Thread John Garry

On 20/04/2021 01:02, Ming Lei wrote:

On Tue, Apr 20, 2021 at 12:06:24AM +0800, John Garry wrote:

Function sdev_store_queue_depth() enforces that the sdev queue depth cannot
exceed shost.can_queue.

However, the LLDD may still set cmd_per_lun > can_queue, which leads to an
initial sdev queue depth greater than can_queue.

Stop this happened by capping initial sdev queue depth at can_queue.

Signed-off-by: John Garry 
---
Topic originally discussed at:
https://lore.kernel.org/linux-scsi/85dec8eb-8eab-c7d6-b0fb-5622747c5...@interlog.com/T/#m5663d0cac657d843b93d0c9a2374f98fc04384b9

Last idea there was to error/warn in scsi_add_host() for cmd_per_lun >




Hi Ming,


No, that isn't my suggestion.


Right, it was what I mentioned.




can_queue. However, such a shost driver could still configure the sdev
queue depth to be sound value at .slave_configure callback, so now thinking
the orig patch better.


As I mentioned last time, why can't we fix ->cmd_per_lun in
scsi_add_host() using .can_queue?



I would rather not change the values which are provided from the driver. 
I would rather take the original values and try to use them in a sane way.


I have not seen other places where driver shost config values are 
modified by the core code.


Thanks,
John


[PATCH] scsi: core: Cap initial sdev queue depth at shost.can_queue

2021-04-19 Thread John Garry
Function sdev_store_queue_depth() enforces that the sdev queue depth cannot
exceed shost.can_queue.

However, the LLDD may still set cmd_per_lun > can_queue, which leads to an
initial sdev queue depth greater than can_queue.

Stop this happened by capping initial sdev queue depth at can_queue.

Signed-off-by: John Garry 
---
Topic originally discussed at:
https://lore.kernel.org/linux-scsi/85dec8eb-8eab-c7d6-b0fb-5622747c5...@interlog.com/T/#m5663d0cac657d843b93d0c9a2374f98fc04384b9

Last idea there was to error/warn in scsi_add_host() for cmd_per_lun >
can_queue. However, such a shost driver could still configure the sdev
queue depth to be sound value at .slave_configure callback, so now thinking
the orig patch better.

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 9f1b7f3c650a..8de2f830bcdc 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -277,7 +277,11 @@ static struct scsi_device *scsi_alloc_sdev(struct 
scsi_target *starget,
WARN_ON_ONCE(!blk_get_queue(sdev->request_queue));
sdev->request_queue->queuedata = sdev;
 
-   depth = sdev->host->cmd_per_lun ?: 1;
+   if (sdev->host->cmd_per_lun)
+   depth = min_t(unsigned int, sdev->host->cmd_per_lun,
+ sdev->host->can_queue);
+   else
+   depth = 1;
 
/*
 * Use .can_queue as budget map's depth because we have to
-- 
2.26.2



Re: [PATCH 5/8] iommu: fix a couple of spelling mistakes

2021-04-16 Thread John Garry

On 26/03/2021 06:24, Zhen Lei wrote:

There are several spelling mistakes, as follows:
funcions ==> functions
distiguish ==> distinguish
detroyed ==> destroyed

Signed-off-by: Zhen Lei


I think that there should be a /s/appropriatley/appropriately/ in iommu.c

Thanks,
john


Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator

2021-04-14 Thread John Garry

On 06/04/2021 17:54, John Garry wrote:

Hi Robin,



Sorry if the phrasing was unclear there - the allusion to default 
domains is new, it just occurred to me that what we do there is in 
fact fairly close to what I've suggested previously for this. In that 
case, we have a global policy set by the command line, which *can* be 
overridden per-domain via sysfs at runtime, provided the user is 
willing to tear the whole thing down. Using a similar approach here 
would give a fair degree of flexibility but still mean that changes 
never have to be made dynamically to a live domain.


So are you saying that we can handle it similar to how we now can handle 
changing default domain for an IOMMU group via sysfs? If so, that just 
is not practical here. Reason being that this particular DMA engine 
provides the block device giving / mount point, so if we unbind the 
driver, we lose / mount point.


And I am not sure if the end user would even know how to set such a 
tunable. Or, in this case, why the end user would not want the optimized 
range configured always.


I'd still rather if the device driver could provide info which can be 
used to configure this before or during probing.


As a new solution, how about do both of these:
a. Add a per-IOMMU group sysfs file to set this tunable. Works same as 
how we change the default domain, and has all the same 
restrictions/steps. I think that this is what you are already suggesting.
b. Provide a DMA mapping API to set this value, similar to this current 
series. In the IOMMU backend for that API, we record a new range value 
and return -EPROBE_DEFER when successful. In the reprobe we reset the 
default domain for the devices' IOMMU group, with the IOVA domain rcache 
range configured as previously requested. Again, allocating the new 
default domain is similar to how we change default domain type today.
This means that we don't play with a live domain. Downside is that we 
need to defer the probe.


Thanks,
John


Re: [PATCH v2 1/2] drivers/perf: hisi: Add driver for HiSilicon PCIe PMU

2021-04-13 Thread John Garry

On 13/04/2021 10:12, liuqi (BA) wrote:


I do wonder why we even need maintain pcie_pmu->cpumask

Can't we just use cpu_online_mask as appropiate instead?


?

Sorry, missed it yesterday.
It seems that cpumask is always same as cpu_online_mask, So do we need 
to reserve the cpumask sysfs interface?


I'm not saying that we don't require the cpumask sysfs interface. I am 
just asking why you maintain a separate cpumask, when, as I said, they 
seem the same.


Thanks,
John


perf arm64 --topdown support (was "perf arm64 metricgroup support")

2021-04-13 Thread John Garry

On 08/04/2021 13:06, Jiri Olsa wrote:

perf stat --topdown is not supported, as this requires the CPU PMU to
expose (alias) events for the TopDown L1 metrics from sysfs, which arm
does not do. To get that to work, we probably need to make perf use the
pmu-events cpumap to learn about those alias events.


Hi guys,

About supporting --topdown command for other archs apart from x86, it 
seems not possible today. Support there is based on kernel support for 
"topdown" CPU events used in the metric calculations. However, arm64, 
for example, does not support these "topdown" events.


It seems to me that we can change to use pmu-events framework + 
metricgroup support here, rather than hardcoded events - has anyone 
considered this approach previously? Seems a pretty big job, so thought 
I'd ask first ...


Thanks,
John


Re: [PATCH v2 1/2] drivers/perf: hisi: Add driver for HiSilicon PCIe PMU

2021-04-12 Thread John Garry

On 12/04/2021 14:34, liuqi (BA) wrote:


Hi John,

Thanks for reviewing this.
On 2021/4/9 18:22, John Garry wrote:

On 09/04/2021 10:05, Qi Liu wrote:

PCIe PMU Root Complex Integrated End Point(RCiEP) device is supported
to sample bandwidth, latency, buffer occupation etc.

Each PMU RCiEP device monitors multiple Root Ports, and each RCiEP is
registered as a PMU in /sys/bus/event_source/devices, so users can
select target PMU, and use filter to do further sets.


side note: it would be good to mention what baseline the series is based 
on in the cover letter




Filtering options contains:
event    - select the event.
subevent - select the subevent.
port - select target Root Ports. Information of Root Ports
     are shown under sysfs.
bdf  - select requester_id of target EP device.
trig_len - set trigger condition for starting event statistics.
trigger_mode - set trigger mode. 0 means starting to statistic when
     bigger than trigger condition, and 1 means smaller.
thr_len  - set threshold for statistics.
thr_mode - set threshold mode. 0 means count when bigger than
     threshold, and 1 means smaller.

Signed-off-by: Qi Liu 
---
   MAINTAINERS    |    6 +
   drivers/perf/Kconfig   |    2 +
   drivers/perf/Makefile  |    1 +
   drivers/perf/pci/Kconfig   |   16 +
   drivers/perf/pci/Makefile  |    2 +
   drivers/perf/pci/hisilicon/Makefile    |    5 +
   drivers/perf/pci/hisilicon/hisi_pcie_pmu.c | 1011

   include/linux/cpuhotplug.h |    1 +
   8 files changed, 1044 insertions(+)
   create mode 100644 drivers/perf/pci/Kconfig
   create mode 100644 drivers/perf/pci/Makefile
   create mode 100644 drivers/perf/pci/hisilicon/Makefile
   create mode 100644 drivers/perf/pci/hisilicon/hisi_pcie_pmu.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 3353de0..46c7861 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8023,6 +8023,12 @@ W:    http://www.hisilicon.com
   F:    Documentation/admin-guide/perf/hisi-pmu.rst
   F:    drivers/perf/hisilicon
+HISILICON PCIE PMU DRIVER
+M:    Qi Liu 
+S:    Maintained
+F:    Documentation/admin-guide/perf/hisi-pcie-pmu.rst


nit: this does not exist yet...


thanks, I'll move this add-maintainer-part to the second patch.


that's why I advocate the documentation first :)


+F:    drivers/perf/pci/hisilicon/hisi_pcie_pmu.c
+
   HISILICON QM AND ZIP Controller DRIVER
   M:    Zhou Wang 
   L:    linux-cry...@vger.kernel.org
diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 3075cf1..99d4760 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -139,4 +139,6 @@ config ARM_DMC620_PMU
   source "drivers/perf/hisilicon/Kconfig"
+source "drivers/perf/pci/Kconfig"
+
   endmenu
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index 5260b11..1208c08 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -14,3 +14,4 @@ obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o
   obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
   obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
   obj-$(CONFIG_ARM_DMC620_PMU) += arm_dmc620_pmu.o
+obj-y += pci/
diff --git a/drivers/perf/pci/Kconfig b/drivers/perf/pci/Kconfig
new file mode 100644
index 000..a119486
--- /dev/null
+++ b/drivers/perf/pci/Kconfig
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# PCIe Performance Monitor Drivers
+#
+menu "PCIe Performance Monitor"
+
+config HISI_PCIE_PMU
+    tristate "HiSilicon PCIE PERF PMU"
+    depends on ARM64 && PCI && HISI_PMU


What from HISI_PMU is needed? I couldn't find anything here

The event_sysfs_show() and format_sysfs_show() function of
hisi_uncore_pmu.h can be reused in hisi_pcie_pmu.c, So I add path in
Makefile and include "hisi_uncore_pmu.h".


Right, but it would be nice to be able to build this under COMPILE_TEST. 
CONFIG_HISI_PMU cannot be built under COMPILE_TEST, so nice to not 
depend on it.


So you could put hisi_event_sysfs_show() as a static inline in 
hisi_uncore_pmu.h, so the dependency can be removed


Having said that, there is nothing really hisi specific in those 
functions like hisi_event_sysfs_show().


Can't we just create generic functions here?

hisi_event_sysfs_show() == cci400_pmu_cycle_event_show() == 
xgene_pmu_event_show()







+    help
+  Provide support for HiSilicon PCIe performance monitoring unit
(PMU)
+  RCiEP devices.
+  Adds the PCIe PMU into perf events system for monitoring latency,
+  bandwidth etc.
+







+#define HISI_PCIE_CHECK_EVENTS(name, list) \


"check" in a function name is too vague, as it does not imply any
side-effect from calling this function.

And I think "build" or similar would be good to be included in the macro
name

Re: [PATCH v2 1/2] drivers/perf: hisi: Add driver for HiSilicon PCIe PMU

2021-04-09 Thread John Garry

On 09/04/2021 10:05, Qi Liu wrote:

PCIe PMU Root Complex Integrated End Point(RCiEP) device is supported
to sample bandwidth, latency, buffer occupation etc.

Each PMU RCiEP device monitors multiple Root Ports, and each RCiEP is
registered as a PMU in /sys/bus/event_source/devices, so users can
select target PMU, and use filter to do further sets.

Filtering options contains:
event- select the event.
subevent - select the subevent.
port - select target Root Ports. Information of Root Ports
are shown under sysfs.
bdf  - select requester_id of target EP device.
trig_len - set trigger condition for starting event statistics.
trigger_mode - set trigger mode. 0 means starting to statistic when
bigger than trigger condition, and 1 means smaller.
thr_len  - set threshold for statistics.
thr_mode - set threshold mode. 0 means count when bigger than
threshold, and 1 means smaller.

Signed-off-by: Qi Liu 
---
  MAINTAINERS|6 +
  drivers/perf/Kconfig   |2 +
  drivers/perf/Makefile  |1 +
  drivers/perf/pci/Kconfig   |   16 +
  drivers/perf/pci/Makefile  |2 +
  drivers/perf/pci/hisilicon/Makefile|5 +
  drivers/perf/pci/hisilicon/hisi_pcie_pmu.c | 1011 
  include/linux/cpuhotplug.h |1 +
  8 files changed, 1044 insertions(+)
  create mode 100644 drivers/perf/pci/Kconfig
  create mode 100644 drivers/perf/pci/Makefile
  create mode 100644 drivers/perf/pci/hisilicon/Makefile
  create mode 100644 drivers/perf/pci/hisilicon/hisi_pcie_pmu.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 3353de0..46c7861 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8023,6 +8023,12 @@ W:   http://www.hisilicon.com
  F:Documentation/admin-guide/perf/hisi-pmu.rst
  F:drivers/perf/hisilicon
  
+HISILICON PCIE PMU DRIVER

+M: Qi Liu 
+S: Maintained
+F: Documentation/admin-guide/perf/hisi-pcie-pmu.rst


nit: this does not exist yet...


+F: drivers/perf/pci/hisilicon/hisi_pcie_pmu.c
+
  HISILICON QM AND ZIP Controller DRIVER
  M:Zhou Wang 
  L:linux-cry...@vger.kernel.org
diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 3075cf1..99d4760 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -139,4 +139,6 @@ config ARM_DMC620_PMU
  
  source "drivers/perf/hisilicon/Kconfig"
  
+source "drivers/perf/pci/Kconfig"

+
  endmenu
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index 5260b11..1208c08 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -14,3 +14,4 @@ obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o
  obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
  obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
  obj-$(CONFIG_ARM_DMC620_PMU) += arm_dmc620_pmu.o
+obj-y += pci/
diff --git a/drivers/perf/pci/Kconfig b/drivers/perf/pci/Kconfig
new file mode 100644
index 000..a119486
--- /dev/null
+++ b/drivers/perf/pci/Kconfig
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# PCIe Performance Monitor Drivers
+#
+menu "PCIe Performance Monitor"
+
+config HISI_PCIE_PMU
+   tristate "HiSilicon PCIE PERF PMU"
+   depends on ARM64 && PCI && HISI_PMU


What from HISI_PMU is needed? I couldn't find anything here


+   help
+ Provide support for HiSilicon PCIe performance monitoring unit (PMU)
+ RCiEP devices.
+ Adds the PCIe PMU into perf events system for monitoring latency,
+ bandwidth etc.
+
+endmenu
diff --git a/drivers/perf/pci/Makefile b/drivers/perf/pci/Makefile
new file mode 100644
index 000..a56b1a9
--- /dev/null
+++ b/drivers/perf/pci/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-y += hisilicon/
diff --git a/drivers/perf/pci/hisilicon/Makefile 
b/drivers/perf/pci/hisilicon/Makefile
new file mode 100644
index 000..6f99f52
--- /dev/null
+++ b/drivers/perf/pci/hisilicon/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+ccflags-y :=  -I $(srctree)/drivers/perf/hisilicon
+
+obj-$(CONFIG_HISI_PCIE_PMU) += hisi_pcie_pmu.o
diff --git a/drivers/perf/pci/hisilicon/hisi_pcie_pmu.c 
b/drivers/perf/pci/hisilicon/hisi_pcie_pmu.c
new file mode 100644
index 000..ac0e444
--- /dev/null
+++ b/drivers/perf/pci/hisilicon/hisi_pcie_pmu.c
@@ -0,0 +1,1011 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * This driver adds support for PCIe PMU RCiEP device. Related
+ * perf events are bandwidth, bandwidth utilization, latency
+ * etc.
+ *
+ * Copyright (C) 2021 HiSilicon Limited
+ * Author: Qi Liu
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "hisi_uncore_pmu.h"
+
+/* Define registers */
+#define HISI_PCIE_GLOBAL_CTRL  0x00
+#define HISI_PCIE_EVENT_CTRL   0x010
+#define HISI_PCIE_CNT   

Re: [PATCH 1/2] drivers/perf: hisi: Add driver for HiSilicon PCIe PMU

2021-04-08 Thread John Garry

On 08/04/2021 10:01, Jonathan Cameron wrote:

On Wed, 7 Apr 2021 21:40:05 +0100
Will Deacon  wrote:


On Wed, Apr 07, 2021 at 05:49:02PM +0800, Qi Liu wrote:

PCIe PMU Root Complex Integrated End Point(RCiEP) device is supported
to sample bandwidth, latency, buffer occupation etc.

Each PMU RCiEP device monitors multiple root ports, and each RCiEP is
registered as a pmu in /sys/bus/event_source/devices, so users can
select target PMU, and use filter to do further sets.

Filtering options contains:
event- select the event.
subevent - select the subevent.
port - select target root ports. Information of root ports
are shown under sysfs.
bdf   - select requester_id of target EP device.
trig_len - set trigger condition for starting event statistics.
trigger_mode - set trigger mode. 0 means starting to statistic when
bigger than trigger condition, and 1 means smaller.
thr_len  - set threshold for statistics.
thr_mode - set threshold mode. 0 means count when bigger than
threshold, and 1 means smaller.

Reviewed-by: Jonathan Cameron 


Do you have a link to this review, please?


Internal review, so drop the tag.

Jonathan


Hi Will,

Are you implying that you would rather that any review for these drivers 
is done in public on the lists?


Cheers,
John


[PATCH v3 6/6] perf vendor events arm64: Add Hisi hip08 L3 metrics

2021-04-07 Thread John Garry
Add L3 metrics.

Signed-off-by: John Garry 
Reviewed-by: Kajol Jain 
---
 .../arch/arm64/hisilicon/hip08/metrics.json   | 161 ++
 1 file changed, 161 insertions(+)

diff --git a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json 
b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
index dda898d23c2d..dda8e59149d2 100644
--- a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
+++ b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
@@ -69,4 +69,165 @@
 "MetricGroup": "TopDownL2",
 "MetricName": "memory_bound"
 },
+{
+"MetricExpr": "(((L2I_TLB - L2I_TLB_REFILL) * 15) + (L2I_TLB_REFILL * 
100)) / CPU_CYCLES",
+"PublicDescription": "Idle by itlb miss L3 topdown metric",
+"BriefDescription": "Idle by itlb miss L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "idle_by_itlb_miss"
+},
+{
+"MetricExpr": "(((L2I_CACHE - L2I_CACHE_REFILL) * 15) + 
(L2I_CACHE_REFILL * 100)) / CPU_CYCLES",
+"PublicDescription": "Idle by icache miss L3 topdown metric",
+"BriefDescription": "Idle by icache miss L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "idle_by_icache_miss"
+},
+{
+"MetricExpr": "(BR_MIS_PRED * 5) / CPU_CYCLES",
+"PublicDescription": "BP misp flush L3 topdown metric",
+"BriefDescription": "BP misp flush L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "bp_misp_flush"
+},
+{
+"MetricExpr": "(armv8_pmuv3_0@event\\=0x2013@ * 5) / CPU_CYCLES",
+"PublicDescription": "OOO flush L3 topdown metric",
+"BriefDescription": "OOO flush L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "ooo_flush"
+},
+{
+"MetricExpr": "(armv8_pmuv3_0@event\\=0x1001@ * 5) / CPU_CYCLES",
+"PublicDescription": "Static predictor flush L3 topdown metric",
+"BriefDescription": "Static predictor flush L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "sp_flush"
+},
+{
+"MetricExpr": "armv8_pmuv3_0@event\\=0x1010@ / BR_MIS_PRED",
+"PublicDescription": "Indirect branch L3 topdown metric",
+"BriefDescription": "Indirect branch L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "indirect_branch"
+},
+{
+"MetricExpr": "(armv8_pmuv3_0@event\\=0x1014@ + 
armv8_pmuv3_0@event\\=0x1018@) / BR_MIS_PRED",
+"PublicDescription": "Push branch L3 topdown metric",
+"BriefDescription": "Push branch L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "push_branch"
+},
+{
+"MetricExpr": "armv8_pmuv3_0@event\\=0x100c@ / BR_MIS_PRED",
+"PublicDescription": "Pop branch L3 topdown metric",
+"BriefDescription": "Pop branch L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "pop_branch"
+},
+{
+"MetricExpr": "(BR_MIS_PRED - armv8_pmuv3_0@event\\=0x1010@ - 
armv8_pmuv3_0@event\\=0x1014@ - armv8_pmuv3_0@event\\=0x1018@ - 
armv8_pmuv3_0@event\\=0x100c@) / BR_MIS_PRED",
+"PublicDescription": "Other branch L3 topdown metric",
+"BriefDescription": "Other branch L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "other_branch"
+},
+{
+"MetricExpr": "armv8_pmuv3_0@event\\=0x2012@ / 
armv8_pmuv3_0@event\\=0x2013@",
+"PublicDescription": "Nuke flush L3 topdown metric",
+"BriefDescription": "Nuke flush L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "nuke_flush"
+},
+{
+"MetricExpr": "1 - nuke_flush",
+"PublicDescription": "Other flush L3 topdown metric",
+"BriefDescription": &q

[PATCH v3 0/6] perf arm64 metricgroup support

2021-04-07 Thread John Garry
This series contains support to get basic metricgroups working for
arm64 CPUs.

Initial support is added for HiSilicon hip08 platform.

Some sample usage on Huawei D06 board:

 $ ./perf list metric

List of pre-defined events (to be used in -e): 

Metrics: 

  bp_misp_flush
   [BP misp flush L3 topdown metric]
  branch_mispredicts
   [Branch mispredicts L2 topdown metric]
  core_bound
   [Core bound L2 topdown metric]
  divider
   [Divider L3 topdown metric]
  exe_ports_util
   [EXE ports util L3 topdown metric]
  fetch_bandwidth_bound
   [Fetch bandwidth bound L2 topdown metric]
  fetch_latency_bound
   [Fetch latency bound L2 topdown metric]
  fsu_stall
   [FSU stall L3 topdown metric]
  idle_by_icache_miss

$ sudo ./perf stat -v -M core_bound sleep 1
Using CPUID 0x480fd010
metric expr (exe_stall_cycle - (mem_stall_anyload + 
armv8_pmuv3_0@event\=0x7005@)) / cpu_cycles for core_bound
found event cpu_cycles
found event armv8_pmuv3_0/event=0x7005/
found event exe_stall_cycle
found event mem_stall_anyload
adding {cpu_cycles -> armv8_pmuv3_0/event=0x7001/
mem_stall_anyload -> armv8_pmuv3_0/event=0x7004/
Control descriptor is not initialized
cpu_cycles: 989433 385050 385050
armv8_pmuv3_0/event=0x7005/: 19207 385050 385050
exe_stall_cycle: 900825 385050 385050
mem_stall_anyload: 253516 385050 385050

Performance counter stats for 'sleep':

989,433  cpu_cycles  # 0.63 core_bound
  19,207  armv8_pmuv3_0/event=0x7005/
 900,825  exe_stall_cycle
 253,516  mem_stall_anyload

   0.000805809 seconds time elapsed

   0.000875000 seconds user
   0.0 seconds sys
   
perf stat --topdown is not supported, as this requires the CPU PMU to
expose (alias) events for the TopDown L1 metrics from sysfs, which arm 
does not do. To get that to work, we probably need to make perf use the
pmu-events cpumap to learn about those alias events.

Metric reuse support is added for pmu-events parse metric testcase.
This had been broken on power9 recently:
https://lore.kernel.org/lkml/20210324015418.gc8...@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com/
 

Differences to v2:
- Add TB and RB tags (Thanks!)
- Rename metricgroup__find_metric() from metricgroup_find_metric()
- Change resolve_metric_simple() to rescan after any insert

Differences to v1:
- Add pmu_events_map__find() as arm64-specific function
- Fix metric reuse for pmu-events parse metric testcase 

John Garry (6):
  perf metricgroup: Make find_metric() public with name change
  perf test: Handle metric reuse in pmu-events parsing test
  perf pmu: Add pmu_events_map__find()
  perf vendor events arm64: Add Hisi hip08 L1 metrics
  perf vendor events arm64: Add Hisi hip08 L2 metrics
  perf vendor events arm64: Add Hisi hip08 L3 metrics

 tools/perf/arch/arm64/util/Build  |   1 +
 tools/perf/arch/arm64/util/pmu.c  |  25 ++
 .../arch/arm64/hisilicon/hip08/metrics.json   | 233 ++
 tools/perf/tests/pmu-events.c |  83 ++-
 tools/perf/util/metricgroup.c |  12 +-
 tools/perf/util/metricgroup.h |   3 +-
 tools/perf/util/pmu.c |   5 +
 tools/perf/util/pmu.h |   1 +
 tools/perf/util/s390-sample-raw.c |   4 +-
 9 files changed, 356 insertions(+), 11 deletions(-)
 create mode 100644 tools/perf/arch/arm64/util/pmu.c
 create mode 100644 
tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json

-- 
2.26.2



[PATCH v3 5/6] perf vendor events arm64: Add Hisi hip08 L2 metrics

2021-04-07 Thread John Garry
Add L2 metrics.

Signed-off-by: John Garry 
Reviewed-by: Kajol Jain 
---
 .../arch/arm64/hisilicon/hip08/metrics.json   | 42 +++
 1 file changed, 42 insertions(+)

diff --git a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json 
b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
index dc5ff3051639..dda898d23c2d 100644
--- a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
+++ b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
@@ -27,4 +27,46 @@
 "MetricGroup": "TopDownL1",
 "MetricName": "backend_bound"
 },
+{
+"MetricExpr": "armv8_pmuv3_0@event\\=0x201d@ / CPU_CYCLES",
+"PublicDescription": "Fetch latency bound L2 topdown metric",
+"BriefDescription": "Fetch latency bound L2 topdown metric",
+"MetricGroup": "TopDownL2",
+"MetricName": "fetch_latency_bound"
+},
+{
+"MetricExpr": "frontend_bound - fetch_latency_bound",
+"PublicDescription": "Fetch bandwidth bound L2 topdown metric",
+"BriefDescription": "Fetch bandwidth bound L2 topdown metric",
+"MetricGroup": "TopDownL2",
+"MetricName": "fetch_bandwidth_bound"
+},
+{
+"MetricExpr": "(bad_speculation * BR_MIS_PRED) / (BR_MIS_PRED + 
armv8_pmuv3_0@event\\=0x2013@)",
+"PublicDescription": "Branch mispredicts L2 topdown metric",
+"BriefDescription": "Branch mispredicts L2 topdown metric",
+"MetricGroup": "TopDownL2",
+"MetricName": "branch_mispredicts"
+},
+{
+"MetricExpr": "bad_speculation - branch_mispredicts",
+"PublicDescription": "Machine clears L2 topdown metric",
+"BriefDescription": "Machine clears L2 topdown metric",
+"MetricGroup": "TopDownL2",
+"MetricName": "machine_clears"
+},
+{
+"MetricExpr": "(EXE_STALL_CYCLE - (MEM_STALL_ANYLOAD + 
armv8_pmuv3_0@event\\=0x7005@)) / CPU_CYCLES",
+"PublicDescription": "Core bound L2 topdown metric",
+"BriefDescription": "Core bound L2 topdown metric",
+"MetricGroup": "TopDownL2",
+"MetricName": "core_bound"
+},
+{
+"MetricExpr": "(MEM_STALL_ANYLOAD + armv8_pmuv3_0@event\\=0x7005@) / 
CPU_CYCLES",
+"PublicDescription": "Memory bound L2 topdown metric",
+"BriefDescription": "Memory bound L2 topdown metric",
+"MetricGroup": "TopDownL2",
+"MetricName": "memory_bound"
+},
 ]
-- 
2.26.2



[PATCH v3 4/6] perf vendor events arm64: Add Hisi hip08 L1 metrics

2021-04-07 Thread John Garry
Add L1 metrics. Formula is as consistent as possible with MAN pages
description for these metrics.

Signed-off-by: John Garry 
Reviewed-by: Kajol Jain 
---
 .../arch/arm64/hisilicon/hip08/metrics.json   | 30 +++
 1 file changed, 30 insertions(+)
 create mode 100644 
tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json

diff --git a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json 
b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
new file mode 100644
index ..dc5ff3051639
--- /dev/null
+++ b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
@@ -0,0 +1,30 @@
+[
+{
+"MetricExpr": "FETCH_BUBBLE / (4 * CPU_CYCLES)",
+"PublicDescription": "Frontend bound L1 topdown metric",
+"BriefDescription": "Frontend bound L1 topdown metric",
+"MetricGroup": "TopDownL1",
+"MetricName": "frontend_bound"
+},
+{
+"MetricExpr": "(INST_SPEC - INST_RETIRED) / (4 * CPU_CYCLES)",
+"PublicDescription": "Bad Speculation L1 topdown metric",
+"BriefDescription": "Bad Speculation L1 topdown metric",
+"MetricGroup": "TopDownL1",
+"MetricName": "bad_speculation"
+},
+{
+"MetricExpr": "INST_RETIRED / (CPU_CYCLES * 4)",
+"PublicDescription": "Retiring L1 topdown metric",
+"BriefDescription": "Retiring L1 topdown metric",
+"MetricGroup": "TopDownL1",
+"MetricName": "retiring"
+},
+{
+"MetricExpr": "1 - (frontend_bound + bad_speculation + retiring)",
+"PublicDescription": "Backend Bound L1 topdown metric",
+"BriefDescription": "Backend Bound L1 topdown metric",
+"MetricGroup": "TopDownL1",
+"MetricName": "backend_bound"
+},
+]
-- 
2.26.2



[PATCH v3 3/6] perf pmu: Add pmu_events_map__find()

2021-04-07 Thread John Garry
Add a function to find the common PMU map for the system.

For arm64, a special variant is added. This is because arm64 supports
heterogeneous CPU systems. As such, it cannot be guaranteed that the cpumap
is same for all CPUs. So in case of heterogeneous systems, don't return
a cpumap.

Tested-by: Paul A. Clarke 
Signed-off-by: John Garry 
Reviewed-by: Kajol Jain 
---
 tools/perf/arch/arm64/util/Build  |  1 +
 tools/perf/arch/arm64/util/pmu.c  | 25 +
 tools/perf/tests/pmu-events.c |  2 +-
 tools/perf/util/metricgroup.c |  7 +++
 tools/perf/util/pmu.c |  5 +
 tools/perf/util/pmu.h |  1 +
 tools/perf/util/s390-sample-raw.c |  4 +---
 7 files changed, 37 insertions(+), 8 deletions(-)
 create mode 100644 tools/perf/arch/arm64/util/pmu.c

diff --git a/tools/perf/arch/arm64/util/Build b/tools/perf/arch/arm64/util/Build
index ead2f2275eee..9fcb4e68add9 100644
--- a/tools/perf/arch/arm64/util/Build
+++ b/tools/perf/arch/arm64/util/Build
@@ -2,6 +2,7 @@ perf-y += header.o
 perf-y += machine.o
 perf-y += perf_regs.o
 perf-y += tsc.o
+perf-y += pmu.o
 perf-y += kvm-stat.o
 perf-$(CONFIG_DWARF) += dwarf-regs.o
 perf-$(CONFIG_LOCAL_LIBUNWIND) += unwind-libunwind.o
diff --git a/tools/perf/arch/arm64/util/pmu.c b/tools/perf/arch/arm64/util/pmu.c
new file mode 100644
index ..d3259d61ca75
--- /dev/null
+++ b/tools/perf/arch/arm64/util/pmu.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "../../util/cpumap.h"
+#include "../../util/pmu.h"
+
+struct pmu_events_map *pmu_events_map__find(void)
+{
+   struct perf_pmu *pmu = NULL;
+
+   while ((pmu = perf_pmu__scan(pmu))) {
+   if (!is_pmu_core(pmu->name))
+   continue;
+
+   /*
+* The cpumap should cover all CPUs. Otherwise, some CPUs may
+* not support some events or have different event IDs.
+*/
+   if (pmu->cpus->nr != cpu__max_cpu())
+   return NULL;
+
+   return perf_pmu__find_map(pmu);
+   }
+
+   return NULL;
+}
diff --git a/tools/perf/tests/pmu-events.c b/tools/perf/tests/pmu-events.c
index cb5b25d2fb27..b8aff8fb50d8 100644
--- a/tools/perf/tests/pmu-events.c
+++ b/tools/perf/tests/pmu-events.c
@@ -539,7 +539,7 @@ static int resolve_metric_simple(struct expr_parse_ctx 
*pctx,
 
 static int test_parsing(void)
 {
-   struct pmu_events_map *cpus_map = perf_pmu__find_map(NULL);
+   struct pmu_events_map *cpus_map = pmu_events_map__find();
struct pmu_events_map *map;
struct pmu_event *pe;
int i, j, k;
diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
index 37fe34a5d93d..8336dd8e8098 100644
--- a/tools/perf/util/metricgroup.c
+++ b/tools/perf/util/metricgroup.c
@@ -618,7 +618,7 @@ static int metricgroup__print_sys_event_iter(struct 
pmu_event *pe, void *data)
 void metricgroup__print(bool metrics, bool metricgroups, char *filter,
bool raw, bool details)
 {
-   struct pmu_events_map *map = perf_pmu__find_map(NULL);
+   struct pmu_events_map *map = pmu_events_map__find();
struct pmu_event *pe;
int i;
struct rblist groups;
@@ -1254,8 +1254,7 @@ int metricgroup__parse_groups(const struct option *opt,
  struct rblist *metric_events)
 {
struct evlist *perf_evlist = *(struct evlist **)opt->value;
-   struct pmu_events_map *map = perf_pmu__find_map(NULL);
-
+   struct pmu_events_map *map = pmu_events_map__find();
 
return parse_groups(perf_evlist, str, metric_no_group,
metric_no_merge, NULL, metric_events, map);
@@ -1274,7 +1273,7 @@ int metricgroup__parse_groups_test(struct evlist *evlist,
 
 bool metricgroup__has_metric(const char *metric)
 {
-   struct pmu_events_map *map = perf_pmu__find_map(NULL);
+   struct pmu_events_map *map = pmu_events_map__find();
struct pmu_event *pe;
int i;
 
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 88da5cf6aee8..419ef6c4fbc0 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -717,6 +717,11 @@ struct pmu_events_map *perf_pmu__find_map(struct perf_pmu 
*pmu)
return map;
 }
 
+struct pmu_events_map *__weak pmu_events_map__find(void)
+{
+   return perf_pmu__find_map(NULL);
+}
+
 bool pmu_uncore_alias_match(const char *pmu_name, const char *name)
 {
char *tmp = NULL, *tok, *str;
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index 8164388478c6..012317229488 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -114,6 +114,7 @@ void pmu_add_cpu_aliases_map(struct list_head *head, struct 
perf_pmu *pmu,
 struct pmu_events_map *map);
 
 struct pmu_events_map *perf_pmu__find_map(struct perf_pmu *pmu);
+struct pmu_events_map *pmu_events_map__find(void);
 bool p

[PATCH v3 1/6] perf metricgroup: Make find_metric() public with name change

2021-04-07 Thread John Garry
Function find_metric() is required for the metric processing in the
pmu-events testcase, so make it public. Also change the name to include
"metricgroup".

Tested-by: Paul A. Clarke 
Signed-off-by: John Garry 
Reviewed-by: Kajol Jain 
---
 tools/perf/util/metricgroup.c | 5 +++--
 tools/perf/util/metricgroup.h | 3 ++-
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
index 6acb44ad439b..37fe34a5d93d 100644
--- a/tools/perf/util/metricgroup.c
+++ b/tools/perf/util/metricgroup.c
@@ -900,7 +900,8 @@ static int __add_metric(struct list_head *metric_list,
(match_metric(__pe->metric_group, __metric) ||  \
 match_metric(__pe->metric_name, __metric)))
 
-static struct pmu_event *find_metric(const char *metric, struct pmu_events_map 
*map)
+struct pmu_event *metricgroup__find_metric(const char *metric,
+  struct pmu_events_map *map)
 {
struct pmu_event *pe;
int i;
@@ -985,7 +986,7 @@ static int __resolve_metric(struct metric *m,
struct expr_id *parent;
struct pmu_event *pe;
 
-   pe = find_metric(cur->key, map);
+   pe = metricgroup__find_metric(cur->key, map);
if (!pe)
continue;
 
diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h
index ed1b9392e624..1424e7c1af77 100644
--- a/tools/perf/util/metricgroup.h
+++ b/tools/perf/util/metricgroup.h
@@ -44,7 +44,8 @@ int metricgroup__parse_groups(const struct option *opt,
  bool metric_no_group,
  bool metric_no_merge,
  struct rblist *metric_events);
-
+struct pmu_event *metricgroup__find_metric(const char *metric,
+  struct pmu_events_map *map);
 int metricgroup__parse_groups_test(struct evlist *evlist,
   struct pmu_events_map *map,
   const char *str,
-- 
2.26.2



[PATCH v3 2/6] perf test: Handle metric reuse in pmu-events parsing test

2021-04-07 Thread John Garry
The pmu-events parsing test does not handle metric reuse at all.

Introduce some simple handling to resolve metrics who reference other
metrics.

Tested-by: Paul A. Clarke 
Signed-off-by: John Garry 
Reviewed-by: Kajol Jain 
---
 tools/perf/tests/pmu-events.c | 81 +++
 1 file changed, 81 insertions(+)

diff --git a/tools/perf/tests/pmu-events.c b/tools/perf/tests/pmu-events.c
index 0ca6a5a53523..cb5b25d2fb27 100644
--- a/tools/perf/tests/pmu-events.c
+++ b/tools/perf/tests/pmu-events.c
@@ -12,6 +12,7 @@
 #include "util/evlist.h"
 #include "util/expr.h"
 #include "util/parse-events.h"
+#include "metricgroup.h"
 
 struct perf_pmu_test_event {
/* used for matching against events from generated pmu-events.c */
@@ -471,6 +472,71 @@ static void expr_failure(const char *msg,
pr_debug("On expression %s\n", pe->metric_expr);
 }
 
+struct metric {
+   struct list_head list;
+   struct metric_ref metric_ref;
+};
+
+static int resolve_metric_simple(struct expr_parse_ctx *pctx,
+struct list_head *compound_list,
+struct pmu_events_map *map,
+const char *metric_name)
+{
+   struct hashmap_entry *cur, *cur_tmp;
+   struct metric *metric, *tmp;
+   size_t bkt;
+   bool all;
+   int rc;
+
+   do {
+   all = true;
+   hashmap__for_each_entry_safe((>ids), cur, cur_tmp, bkt) {
+   struct metric_ref *ref;
+   struct pmu_event *pe;
+
+   pe = metricgroup__find_metric(cur->key, map);
+   if (!pe)
+   continue;
+
+   if (!strcmp(metric_name, (char *)cur->key)) {
+   pr_warning("Recursion detected for metric 
%s\n", metric_name);
+   rc = -1;
+   goto out_err;
+   }
+
+   all = false;
+
+   /* The metric key itself needs to go out.. */
+   expr__del_id(pctx, cur->key);
+
+   metric = malloc(sizeof(*metric));
+   if (!metric) {
+   rc = -ENOMEM;
+   goto out_err;
+   }
+
+   ref = >metric_ref;
+   ref->metric_name = pe->metric_name;
+   ref->metric_expr = pe->metric_expr;
+   list_add_tail(>list, compound_list);
+
+   rc = expr__find_other(pe->metric_expr, NULL, pctx, 0);
+   if (rc)
+   goto out_err;
+   break; /* The hashmap has been modified, so restart */
+   }
+   } while (!all);
+
+   return 0;
+
+out_err:
+   list_for_each_entry_safe(metric, tmp, compound_list, list)
+   free(metric);
+
+   return rc;
+
+}
+
 static int test_parsing(void)
 {
struct pmu_events_map *cpus_map = perf_pmu__find_map(NULL);
@@ -488,7 +554,9 @@ static int test_parsing(void)
break;
j = 0;
for (;;) {
+   struct metric *metric, *tmp;
struct hashmap_entry *cur;
+   LIST_HEAD(compound_list);
size_t bkt;
 
pe = >table[j++];
@@ -504,6 +572,13 @@ static int test_parsing(void)
continue;
}
 
+   if (resolve_metric_simple(, _list, map,
+ pe->metric_name)) {
+   expr_failure("Could not resolve metrics", map, 
pe);
+   ret++;
+   goto exit; /* Don't tolerate errors due to 
severity */
+   }
+
/*
 * Add all ids with a made up value. The value may
 * trigger divide by zero when subtracted and so try to
@@ -519,6 +594,11 @@ static int test_parsing(void)
ret++;
}
 
+   list_for_each_entry_safe(metric, tmp, _list, 
list) {
+   expr__add_ref(, >metric_ref);
+   free(metric);
+   }
+
if (expr__parse(, , pe->metric_expr, 0)) {
expr_failure("Parse failed", map, pe);
ret++;
@@ -527,6 +607,7 @@ static int test_parsing(void)
}
}
/* TODO: fail when not ok */
+exit:
return ret == 0 ? TEST_OK : TEST_SKIP;
 }
 
-- 
2.26.2



Re: [PATCH 0/3] iommu/iova: Add CPU hotplug handler to flush rcaches to core code

2021-04-07 Thread John Garry

On 07/04/2021 09:04, Joerg Roedel wrote:

On Mon, Mar 01, 2021 at 08:12:18PM +0800, John Garry wrote:

The Intel IOMMU driver supports flushing the per-CPU rcaches when a CPU is
offlined.

Let's move it to core code, so everyone can take advantage.

Also correct a code comment.

Based on v5.12-rc1. Tested on arm64 only.

John Garry (3):
   iova: Add CPU hotplug handler to flush rcaches
   iommu/vt-d: Remove IOVA domain rcache flushing for CPU offlining
   iova: Correct comment for free_cpu_cached_iovas()

  drivers/iommu/intel/iommu.c | 31 ---
  drivers/iommu/iova.c| 32 ++--
  include/linux/cpuhotplug.h  |  2 +-
  include/linux/iova.h|  1 +
  4 files changed, 32 insertions(+), 34 deletions(-)


Applied, thanks.

.



Thanks, but there was a v2 on this series. Not sure which you applied.

https://lore.kernel.org/linux-iommu/9aad6e94-ecb7-ca34-7f7d-3df6e43e9...@huawei.com/T/#mbea81468782c75fa84744ad7a7801831a4c952e9



Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator

2021-04-06 Thread John Garry

So then we have the issue of how to dynamically increase this rcache
threshold. The problem is that we may have many devices associated with
the same domain. So, in theory, we can't assume that when we increase
the threshold that some other device will try to fast free an IOVA 
which

was allocated prior to the increase and was not rounded up.

I'm very open to better (or less bad) suggestions on how to do this ...

...but yes, regardless of exactly where it happens, rounding up or not
is the problem for rcaches in general. I've said several times that my
preferred approach is to not change it that dynamically at all, but
instead treat it more like we treat the default domain type.



Can you remind me of that idea? I don't remember you mentioning using 
default domain handling as a reference in any context.




Hi Robin,

Sorry if the phrasing was unclear there - the allusion to default 
domains is new, it just occurred to me that what we do there is in fact 
fairly close to what I've suggested previously for this. In that case, 
we have a global policy set by the command line, which *can* be 
overridden per-domain via sysfs at runtime, provided the user is willing 
to tear the whole thing down. Using a similar approach here would give a 
fair degree of flexibility but still mean that changes never have to be 
made dynamically to a live domain.


So are you saying that we can handle it similar to how we now can handle 
changing default domain for an IOMMU group via sysfs? If so, that just 
is not practical here. Reason being that this particular DMA engine 
provides the block device giving / mount point, so if we unbind the 
driver, we lose / mount point.


And I am not sure if the end user would even know how to set such a 
tunable. Or, in this case, why the end user would not want the optimized 
range configured always.


I'd still rather if the device driver could provide info which can be 
used to configure this before or during probing.


Cheers,
John


Re: [bug report] Memory leak from acpi_ev_install_space_handler()

2021-04-06 Thread John Garry

On 06/04/2021 17:40, Rafael J. Wysocki wrote:

On Tue, Apr 6, 2021 at 5:51 PM John Garry  wrote:


Hi guys,

On next-20210406, I enabled CONFIG_DEBUG_KMEMLEAK and
CONFIG_DEBUG_TEST_DRIVER_REMOVE for my arm64 system, and see this:




Hi Rafael,


Why exactly do you think that acpi_ev_install_space_handler() leaks memory?



I don't think that acpi_ev_install_space_handler() itself leaks memory, 
but it seems that there is something missing in the code which is meant 
to undo/clean up after that on the uninstall path - I did make the point 
in writing "memory leak from", but maybe still not worded clearly.


Anyway, I have not analyzed the problem fully - I'm just reporting.

I don't mind looking further if requested.

Thanks,
John


root@debian:/home/john# more /sys/kernel/debug/kmemleak
unreferenced object 0x202803c11f00 (size 128):
comm "swapper/0", pid 1, jiffies 4294894325 (age 337.524s)
hex dump (first 32 bytes):
00 00 00 00 02 00 00 00 08 1f c1 03 28 20 ff ff( ..
08 1f c1 03 28 20 ff ff 00 00 00 00 00 00 00 00( ..
backtrace:
[<670a0938>] slab_post_alloc_hook+0x9c/0x2f8
[<a3f47b39>] kmem_cache_alloc+0x198/0x2a8
[<2bdba864>] acpi_os_create_semaphore+0x54/0xe0
[<bcd513fe>] acpi_ev_install_space_handler+0x24c/0x300
[<02e116e2>] acpi_install_address_space_handler+0x64/0xb0
[<ba00abc5>] i2c_acpi_install_space_handler+0xd4/0x138
[<8da42058>] i2c_register_adapter+0x368/0x910
[<c03f7142>] i2c_add_adapter+0x9c/0x100
[<00ba2fcf>] i2c_add_numbered_adapter+0x44/0x58
[<7df22d67>] i2c_dw_probe_master+0x68c/0x900
[<682dfc98>] dw_i2c_plat_probe+0x460/0x640
[<ad2dd3ee>] platform_probe+0x8c/0x108
[<dd183e3f>] really_probe+0x190/0x670
[<66017341>] driver_probe_device+0x8c/0xf8
[<c441e843>] device_driver_attach+0x9c/0xa8
[<f91dc709>] __driver_attach+0x88/0x138
unreferenced object 0x00280452c100 (size 128):
comm "swapper/0", pid 1, jiffies 4294894558 (age 336.604s)
hex dump (first 32 bytes):
00 00 00 00 02 00 00 00 08 c1 52 04 28 00 ff ff..R.(...
08 c1 52 04 28 00 ff ff 00 00 00 00 00 00 00 00..R.(...
backtrace:
[<670a0938>] slab_post_alloc_hook+0x9c/0x2f8
[<a3f47b39>] kmem_cache_alloc+0x198/0x2a8
[<2bdba864>] acpi_os_create_semaphore+0x54/0xe0
[<bcd513fe>] acpi_ev_install_space_handler+0x24c/0x300
[<02e116e2>] acpi_install_address_space_handler+0x64/0xb0
[<988d4f61>] acpi_gpiochip_add+0x20c/0x4a0
[<73d4faab>] gpiochip_add_data_with_key+0xd10/0x1680
[<1d50b98a>] devm_gpiochip_add_data_with_key+0x30/0x78
[<fc3e7eaf>] dwapb_gpio_probe+0x828/0xb28
[<ad2dd3ee>] platform_probe+0x8c/0x108
[<dd183e3f>] really_probe+0x190/0x670
[<66017341>] driver_probe_device+0x8c/0xf8
[<c441e843>] device_driver_attach+0x9c/0xa8
[<f91dc709>] __driver_attach+0x88/0x138
[<d330caed>] bus_for_each_dev+0xec/0x160
[<eebc5f04>] driver_attach+0x34/0x48
root@debian:/home/john#

Thanks,
John

.





[bug report] Memory leak from acpi_ev_install_space_handler()

2021-04-06 Thread John Garry

Hi guys,

On next-20210406, I enabled CONFIG_DEBUG_KMEMLEAK and
CONFIG_DEBUG_TEST_DRIVER_REMOVE for my arm64 system, and see this:

root@debian:/home/john# more /sys/kernel/debug/kmemleak
unreferenced object 0x202803c11f00 (size 128):
comm "swapper/0", pid 1, jiffies 4294894325 (age 337.524s)
hex dump (first 32 bytes):
00 00 00 00 02 00 00 00 08 1f c1 03 28 20 ff ff( ..
08 1f c1 03 28 20 ff ff 00 00 00 00 00 00 00 00( ..
backtrace:
[<670a0938>] slab_post_alloc_hook+0x9c/0x2f8
[] kmem_cache_alloc+0x198/0x2a8
[<2bdba864>] acpi_os_create_semaphore+0x54/0xe0
[] acpi_ev_install_space_handler+0x24c/0x300
[<02e116e2>] acpi_install_address_space_handler+0x64/0xb0
[] i2c_acpi_install_space_handler+0xd4/0x138
[<8da42058>] i2c_register_adapter+0x368/0x910
[] i2c_add_adapter+0x9c/0x100
[<00ba2fcf>] i2c_add_numbered_adapter+0x44/0x58
[<7df22d67>] i2c_dw_probe_master+0x68c/0x900
[<682dfc98>] dw_i2c_plat_probe+0x460/0x640
[] platform_probe+0x8c/0x108
[] really_probe+0x190/0x670
[<66017341>] driver_probe_device+0x8c/0xf8
[] device_driver_attach+0x9c/0xa8
[] __driver_attach+0x88/0x138
unreferenced object 0x00280452c100 (size 128):
comm "swapper/0", pid 1, jiffies 4294894558 (age 336.604s)
hex dump (first 32 bytes):
00 00 00 00 02 00 00 00 08 c1 52 04 28 00 ff ff..R.(...
08 c1 52 04 28 00 ff ff 00 00 00 00 00 00 00 00..R.(...
backtrace:
[<670a0938>] slab_post_alloc_hook+0x9c/0x2f8
[] kmem_cache_alloc+0x198/0x2a8
[<2bdba864>] acpi_os_create_semaphore+0x54/0xe0
[] acpi_ev_install_space_handler+0x24c/0x300
[<02e116e2>] acpi_install_address_space_handler+0x64/0xb0
[<988d4f61>] acpi_gpiochip_add+0x20c/0x4a0
[<73d4faab>] gpiochip_add_data_with_key+0xd10/0x1680
[<1d50b98a>] devm_gpiochip_add_data_with_key+0x30/0x78
[] dwapb_gpio_probe+0x828/0xb28
[] platform_probe+0x8c/0x108
[] really_probe+0x190/0x670
[<66017341>] driver_probe_device+0x8c/0xf8
[] device_driver_attach+0x9c/0xa8
[] __driver_attach+0x88/0x138
[] bus_for_each_dev+0xec/0x160
[] driver_attach+0x34/0x48
root@debian:/home/john#

Thanks,
John


Re: [PATCH v2 2/6] perf test: Handle metric reuse in pmu-events parsing test

2021-04-06 Thread John Garry

On 06/04/2021 14:34, Jiri Olsa wrote:


}

So once we evaluate a pmu_event in pctx->ids in @pe, @all is set false, and
we would loop again in the do-while loop, regardless of what
expr__find_other() does (apart from erroring), and so call
hashmap__for_each_entry_safe(>ids, ) again.

ah ok, so it finishes the hash iteration first and
then restarts it.. ok, I missed that, then it's fine
 >> This is really what is done in __resolve_metric() - indeed, I would 

use that

function directly, but it looks hard to extract that from metricgroup.c .

yea, it's another world;-)  it's better to keep it separated


ok, so I'll still add the break statement, as suggested.

I'll also wait to see if Ian or you have a strong feeling about the 
function naming in patch 1/6.


Thanks,
John



Re: [PATCH v2 2/6] perf test: Handle metric reuse in pmu-events parsing test

2021-04-06 Thread John Garry

On 06/04/2021 13:55, Jiri Olsa wrote:

So expr__find_other() may add a new item to pctx->ids, and we always iterate
again, and try to lookup any pmu_events, *, above. If none exist, then we

hm, I don't see that.. so, what you do is:

hashmap__for_each_entry_safe((>ids) ) {

rc = expr__find_other(pe->metric_expr, NULL, pctx, 0);
}

and what I think we need to do is:

hashmap__for_each_entry_safe((>ids) ) {

rc = expr__find_other(pe->metric_expr, NULL, pctx, 0);

break;  
}

each time you resolve another metric, you need to restart
the pctx->ids iteration, because there will be new items,
and we are in the middle of it

Sure, but we will restart anyway.

hum, where? you call expr__find_other and continue to next
pctx->ids item


We have:

resolve_metric_simple()
{
bool all;

do {
all = true;

hashmap__for_each_entry_safe(>ids, ...) {

pe = metricgroup_find_metric(cur->key, map);
if (!pe)
continue;

...
all = false;

expr_del_id(pctx, cur->key);

...
rc = expr__find_other(pe->metric_expr, pctx);
if (rc)
goto out_err;
}

} while (!all);

}

So once we evaluate a pmu_event in pctx->ids in @pe, @all is set false, 
and we would loop again in the do-while loop, regardless of what 
expr__find_other() does (apart from erroring), and so call 
hashmap__for_each_entry_safe(>ids, ) again.


This is really what is done in __resolve_metric() - indeed, I would use 
that function directly, but it looks hard to extract that from 
metricgroup.c .


Thanks,
John




Regardless of this, I don't think what I am doing is safe, i.e. adding new
items in the middle of the iter, so I will change in the way you suggest.

it'll always add items in the middle of the iteration




Re: [PATCH v2 2/6] perf test: Handle metric reuse in pmu-events parsing test

2021-04-06 Thread John Garry

On 06/04/2021 13:17, Jiri Olsa wrote:

+   ref = >metric_ref;
+   ref->metric_name = pe->metric_name;
+   ref->metric_expr = pe->metric_expr;
+   list_add_tail(>list, compound_list);
+
+   rc = expr__find_other(pe->metric_expr, NULL, pctx, 0);

Hi Jirka,


so this might add new items to pctx->ids, I think you need
to restart the iteration as we do it in __resolve_metric
otherwise you could miss some new keys

I thought that I was doing this. Indeed, this code is very much like
__resolve_metric();)

So expr__find_other() may add a new item to pctx->ids, and we always iterate
again, and try to lookup any pmu_events, *, above. If none exist, then we

hm, I don't see that.. so, what you do is:

hashmap__for_each_entry_safe((>ids) ) {

rc = expr__find_other(pe->metric_expr, NULL, pctx, 0);
}

and what I think we need to do is:

hashmap__for_each_entry_safe((>ids) ) {

rc = expr__find_other(pe->metric_expr, NULL, pctx, 0);

break;  
}

each time you resolve another metric, you need to restart
the pctx->ids iteration, because there will be new items,
and we are in the middle of it


Sure, but we will restart anyway.

Regardless of this, I don't think what I am doing is safe, i.e. adding 
new items in the middle of the iter, so I will change in the way you 
suggest.


Thanks,
John


Re: [PATCH v2 0/6] perf arm64 metricgroup support

2021-04-06 Thread John Garry

On 30/03/2021 07:41, kajoljain wrote:



On 3/30/21 2:37 AM, Paul A. Clarke wrote:

On Fri, Mar 26, 2021 at 10:57:40AM +, John Garry wrote:

On 25/03/2021 20:39, Paul A. Clarke wrote:

On Thu, Mar 25, 2021 at 06:33:12PM +0800, John Garry wrote:

Metric reuse support is added for pmu-events parse metric testcase.
This had been broken on power9 recentlty:
https://lore.kernel.org/lkml/20210324015418.gc8...@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com/


Much better.  Before:
--
$ perf test -v 10 2>&1 | grep -i error | wc -l
112
--
After:
--
$ perf test -v 10 2>&1 | grep -i error | wc -l
17
--

And these seem like different types of issues:
--
$ perf test -v 10 2>&1 | grep -i error
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_powerbus0_imc'. Missing kernel support?' 
help '(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help 
'(null)'
--



This looks suspicious.

Firstly, does /sys/bus/event_source/devices/nest_mcs01_imc (or others,
above) exist on your system? I guess not.

Secondly, checking Documentation/powerpc/imc.rst, we have examples of:
nest_mcs01/PM_MCS01_64B_R...

So is the PMU name correct in the metric file for nest_mcs01_imc? Looking at
the kernel driver file, arch/powerpc/perf/imc-pmu.c, it seems to be correct.
Not sure.


I ran with a newer kernel, and the above errors disappeared, replaced with
about 10 of:
--
Error string 'Cannot find PMU `hv_24x7'. Missing kernel support?' help '(null)'
--

...but I was running without a hypervisor, so I tried the same kernel on a
PowerVM-virtualized system and the "hv_24x7" messages went away, but the
"nest" messages returned.  This may all be expected behavior... I confess
I haven't followed these new perf capabilities closely.



Hi Paul/John,
This is something expected. For nest-imc we need bare-metal system and for
hv-24x7 we need VM environment. Since you are checking this test in VM machine,
there nest events are not supported and hence we are getting this error.

Thanks,
Kajol Jain


Cool, so I hope that tested-by or similar can be provided :) [obviously 
pending any changes that come from reviews]


Thanks




It's extremely likely that none of these errors has anything to do with your
changes. :-

PC


.





Re: [PATCH v2 2/6] perf test: Handle metric reuse in pmu-events parsing test

2021-04-06 Thread John Garry

On 01/04/2021 14:49, Jiri Olsa wrote:

On Thu, Mar 25, 2021 at 06:33:14PM +0800, John Garry wrote:

SNIP


+struct metric {
+   struct list_head list;
+   struct metric_ref metric_ref;
+};
+
+static int resolve_metric_simple(struct expr_parse_ctx *pctx,
+struct list_head *compound_list,
+struct pmu_events_map *map,
+const char *metric_name)
+{
+   struct hashmap_entry *cur, *cur_tmp;
+   struct metric *metric, *tmp;
+   size_t bkt;
+   bool all;
+   int rc;
+
+   do {
+   all = true;
+   hashmap__for_each_entry_safe((>ids), cur, cur_tmp, bkt) {
+   struct metric_ref *ref;
+   struct pmu_event *pe;
+
+   pe = metrcgroup_find_metric(cur->key, map);


*


+   if (!pe)
+   continue;
+
+   if (!strcmp(metric_name, (char *)cur->key)) {
+   pr_warning("Recursion detected for metric 
%s\n", metric_name);
+   rc = -1;
+   goto out_err;
+   }
+
+   all = false;
+
+   /* The metric key itself needs to go out.. */
+   expr__del_id(pctx, cur->key);
+
+   metric = malloc(sizeof(*metric));
+   if (!metric) {
+   rc = -ENOMEM;
+   goto out_err;
+   }
+
+   ref = >metric_ref;
+   ref->metric_name = pe->metric_name;
+   ref->metric_expr = pe->metric_expr;
+   list_add_tail(>list, compound_list);
+
+   rc = expr__find_other(pe->metric_expr, NULL, pctx, 0);




Hi Jirka,


so this might add new items to pctx->ids, I think you need
to restart the iteration as we do it in __resolve_metric
otherwise you could miss some new keys


I thought that I was doing this. Indeed, this code is very much like 
__resolve_metric() ;)


So expr__find_other() may add a new item to pctx->ids, and we always 
iterate again, and try to lookup any pmu_events, *, above. If none 
exist, then we have broken down pctx into primitive events aliases and 
unresolvable metrics, and stop iterating. And then unresolvable metrics 
would be found in check_parse_cpu().


As an example, we can deal with metric test1, below, which references 2x 
other metrics:


{
"MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / (4 * (( ( 
CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE 
/ CPU_CLK_UNHALTED.REF_XCLK ) )))",

  "MetricName": "Frontend_Bound",
},
{
"MetricExpr": "( UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 
4 * INT_MISC.RECOVERY_CYCLES ) / (4 * cycles)",

"MetricName": "Bad_Speculation",
},
{
"MetricExpr": "Bad_Speculation + Frontend_Bound",
"MetricName": "test1",
},

Does that satisfy your concern, or have I missed something?

Thanks,
John



jirka


+   if (rc)
+   goto out_err;
+   }
+   } while (!all);
+
+   return 0;
+
+out_err:
+   list_for_each_entry_safe(metric, tmp, compound_list, list)
+   free(metric);
+
+   return rc;
+
+}


SNIP

.





Re: [PATCH v2 1/6] perf metricgroup: Make find_metric() public with name change

2021-04-06 Thread John Garry

On 02/04/2021 00:16, Ian Rogers wrote:

On Thu, Mar 25, 2021 at 3:38 AM John Garry  wrote:


Function find_metric() is required for the metric processing in the
pmu-events testcase, so make it public. Also change the name to include
"metricgroup".


Would it make more sense as "pmu_events_map__find_metric" ?



So all functions apart from one in metricgroup.h are named 
metricgroup__XXX, so I was trying to keep this style - apart from the 
double-underscore (which can be remedied).


Personally I don't think pmu_events_map__find_metric name fits with that 
convention.


Thanks,
John


Thanks,
Ian


Signed-off-by: John Garry 
---
  tools/perf/util/metricgroup.c | 5 +++--
  tools/perf/util/metricgroup.h | 3 ++-
  2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
index 6acb44ad439b..71a13406e0bd 100644
--- a/tools/perf/util/metricgroup.c
+++ b/tools/perf/util/metricgroup.c
@@ -900,7 +900,8 @@ static int __add_metric(struct list_head *metric_list,
 (match_metric(__pe->metric_group, __metric) ||  \
  match_metric(__pe->metric_name, __metric)))

-static struct pmu_event *find_metric(const char *metric, struct pmu_events_map 
*map)
+struct pmu_event *metrcgroup_find_metric(const char *metric,
+struct pmu_events_map *map)
  {
 struct pmu_event *pe;
 int i;
@@ -985,7 +986,7 @@ static int __resolve_metric(struct metric *m,
 struct expr_id *parent;
 struct pmu_event *pe;

-   pe = find_metric(cur->key, map);
+   pe = metrcgroup_find_metric(cur->key, map);
 if (!pe)
 continue;

diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h
index ed1b9392e624..1674c6a36d74 100644
--- a/tools/perf/util/metricgroup.h
+++ b/tools/perf/util/metricgroup.h
@@ -44,7 +44,8 @@ int metricgroup__parse_groups(const struct option *opt,
   bool metric_no_group,
   bool metric_no_merge,
   struct rblist *metric_events);
-
+struct pmu_event *metrcgroup_find_metric(const char *metric,
+struct pmu_events_map *map);
  int metricgroup__parse_groups_test(struct evlist *evlist,
struct pmu_events_map *map,
const char *str,
--
2.26.2


.





Re: PCI MSI issue with reinserting a driver

2021-04-06 Thread John Garry

On 03/02/2021 17:23, Marc Zyngier wrote:

On 2021-02-02 15:46, John Garry wrote:

On 02/02/2021 14:48, Marc Zyngier wrote:


Not sure. I also now notice an error for the SAS PCI driver on D06 
when nr_cpus < 16, which means number of MSI vectors allocated < 
32, so looks the same problem. There we try to allocate 16 + max(nr 
cpus, 16) MSI.


Anyway, let me have a look today to see what is going wrong.


Could this be the problem:

nr_cpus=11

In alloc path, we have:
its_alloc_device_irq(nvecs=27 = 16+11)
  bitmap_find_free_region(order = 5);
In free path, we have:
its_irq_domain_free(nvecs = 1) and free each 27 vecs
  bitmap_release_region(order = 0)

So we allocate 32 bits, but only free 27. And 2nd alloc for 32 fails.


[ ... ]





Hi Marc,

Just a friendly reminder on this issue. Let me know if anything required 
some our side.


Cheers,
John



But I'm not sure that we have any requirement for those map bits to be
consecutive.


We can't really do that. All the events must be contiguous,
and there is also a lot of assumptions in the ITS driver that
LPI allocations is also contiguous.

But there is also the fact that for Multi-MSI, we *must*
allocate 32 vectors. Any driver could assume that if we have
allocated 17 vectors, then there is another 15 available.

My question still stand: how was this working with the previous
behaviour?


Because previously in this scenario we would allocate 32 bits and free
32 bits in the map; but now we allocate 32 bits, yet only free 27 - so
leak 5 bits. And this comes from how irq_domain_free_irqs_hierarchy()
now frees per-interrupt, instead of all irqs per domain.

Before:
 In free path, we have:
 its_irq_domain_free(nvecs = 27)
   bitmap_release_region(count order = 5 == 32bits)

Current:
 In free path, we have:
 its_irq_domain_free(nvecs = 1) for free each 27 vecs
   bitmap_release_region(count order = 0 == 1bit)


Right. I was focusing on the patch and blindly ignored the explanation
at the top of the email. Apologies for that.

I'm not overly keen on handling this in the ITS though, and I'd rather
we try and do it in the generic code. How about this (compile tested
only)?

Thanks,

     M.

diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index 6aacd342cd14..cfccad83c2df 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -1399,8 +1399,19 @@ static void irq_domain_free_irqs_hierarchy(struct 
irq_domain *domain,

  return;

  for (i = 0; i < nr_irqs; i++) {
-    if (irq_domain_get_irq_data(domain, irq_base + i))
-    domain->ops->free(domain, irq_base + i, 1);
+    int n ;
+
+    /* Find the largest possible span of IRQs to free in one go */
+    for (n = 0;
+ ((i + n) < nr_irqs &&
+  irq_domain_get_irq_data(domain, irq_base + i + n));
+ n++);
+
+    if (!n)
+    continue;
+
+    domain->ops->free(domain, irq_base + i, n);
+    i += n;
  }
  }






Re: [PATCH v2 0/4] iommu/iova: Add CPU hotplug handler to flush rcaches to core code

2021-04-06 Thread John Garry

On 25/03/2021 17:53, Will Deacon wrote:

On Thu, Mar 25, 2021 at 08:29:57PM +0800, John Garry wrote:

The Intel IOMMU driver supports flushing the per-CPU rcaches when a CPU is
offlined.

Let's move it to core code, so everyone can take advantage.

Also throw in a patch to stop exporting free_iova_fast().

Differences to v1:
- Add RB tags (thanks!)
- Add patch to stop exporting free_iova_fast()
- Drop patch to correct comment
- Add patch to delete iommu_dma_free_cpu_cached_iovas() and associated
   changes

John Garry (4):
   iova: Add CPU hotplug handler to flush rcaches
   iommu/vt-d: Remove IOVA domain rcache flushing for CPU offlining
   iommu: Delete iommu_dma_free_cpu_cached_iovas()
   iommu: Stop exporting free_iova_fast()

Looks like this is all set for 5.13, so hopefully Joerg can stick it in
-next for a bit more exposure.




Hi Joerg,

Can you kindly consider picking this up now?

Thanks,
John



Re: [PATCH v2 0/6] perf arm64 metricgroup support

2021-03-26 Thread John Garry

On 25/03/2021 20:39, Paul A. Clarke wrote:

On Thu, Mar 25, 2021 at 06:33:12PM +0800, John Garry wrote:

Metric reuse support is added for pmu-events parse metric testcase.
This had been broken on power9 recentlty:
https://lore.kernel.org/lkml/20210324015418.gc8...@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com/


Much better.  Before:
--
$ perf test -v 10 2>&1 | grep -i error | wc -l
112
--
After:
--
$ perf test -v 10 2>&1 | grep -i error | wc -l
17
--

And these seem like different types of issues:
--
$ perf test -v 10 2>&1 | grep -i error
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_powerbus0_imc'. Missing kernel support?' 
help '(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help 
'(null)'
Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help 
'(null)'
--



This looks suspicious.

Firstly, does /sys/bus/event_source/devices/nest_mcs01_imc (or others, 
above) exist on your system? I guess not.


Secondly, checking Documentation/powerpc/imc.rst, we have examples of:
nest_mcs01/PM_MCS01_64B_R...

So is the PMU name correct in the metric file for nest_mcs01_imc? 
Looking at the kernel driver file, arch/powerpc/perf/imc-pmu.c, it seems 
to be correct. Not sure.


Thanks,
John




(Added Kajol Jain to CC)






[PATCH v2 2/4] iommu/vt-d: Remove IOVA domain rcache flushing for CPU offlining

2021-03-25 Thread John Garry
Now that the core code handles flushing per-IOVA domain CPU rcaches,
remove the handling here.

Reviewed-by: Lu Baolu 
Signed-off-by: John Garry 
---
 drivers/iommu/intel/iommu.c | 31 ---
 include/linux/cpuhotplug.h  |  1 -
 2 files changed, 32 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ee0932307d64..d1e66e1b07b8 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4065,35 +4065,6 @@ static struct notifier_block intel_iommu_memory_nb = {
.priority = 0
 };
 
-static void free_all_cpu_cached_iovas(unsigned int cpu)
-{
-   int i;
-
-   for (i = 0; i < g_num_of_iommus; i++) {
-   struct intel_iommu *iommu = g_iommus[i];
-   struct dmar_domain *domain;
-   int did;
-
-   if (!iommu)
-   continue;
-
-   for (did = 0; did < cap_ndoms(iommu->cap); did++) {
-   domain = get_iommu_domain(iommu, (u16)did);
-
-   if (!domain || domain->domain.type != IOMMU_DOMAIN_DMA)
-   continue;
-
-   iommu_dma_free_cpu_cached_iovas(cpu, >domain);
-   }
-   }
-}
-
-static int intel_iommu_cpu_dead(unsigned int cpu)
-{
-   free_all_cpu_cached_iovas(cpu);
-   return 0;
-}
-
 static void intel_disable_iommus(void)
 {
struct intel_iommu *iommu = NULL;
@@ -4388,8 +4359,6 @@ int __init intel_iommu_init(void)
bus_set_iommu(_bus_type, _iommu_ops);
if (si_domain && !hw_pass_through)
register_memory_notifier(_iommu_memory_nb);
-   cpuhp_setup_state(CPUHP_IOMMU_INTEL_DEAD, "iommu/intel:dead", NULL,
- intel_iommu_cpu_dead);
 
down_read(_global_lock);
if (probe_acpi_namespace_devices())
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index cedac9986557..85996494bec1 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -57,7 +57,6 @@ enum cpuhp_state {
CPUHP_PAGE_ALLOC_DEAD,
CPUHP_NET_DEV_DEAD,
CPUHP_PCI_XGENE_DEAD,
-   CPUHP_IOMMU_INTEL_DEAD,
CPUHP_IOMMU_IOVA_DEAD,
CPUHP_LUSTRE_CFS_DEAD,
CPUHP_AP_ARM_CACHE_B15_RAC_DEAD,
-- 
2.26.2



[PATCH v2 0/4] iommu/iova: Add CPU hotplug handler to flush rcaches to core code

2021-03-25 Thread John Garry
The Intel IOMMU driver supports flushing the per-CPU rcaches when a CPU is
offlined.

Let's move it to core code, so everyone can take advantage.

Also throw in a patch to stop exporting free_iova_fast().

Differences to v1:
- Add RB tags (thanks!)
- Add patch to stop exporting free_iova_fast()
- Drop patch to correct comment
- Add patch to delete iommu_dma_free_cpu_cached_iovas() and associated
  changes

John Garry (4):
  iova: Add CPU hotplug handler to flush rcaches
  iommu/vt-d: Remove IOVA domain rcache flushing for CPU offlining
  iommu: Delete iommu_dma_free_cpu_cached_iovas()
  iommu: Stop exporting free_iova_fast()

 drivers/iommu/dma-iommu.c   |  9 -
 drivers/iommu/intel/iommu.c | 31 ---
 drivers/iommu/iova.c| 34 +++---
 include/linux/cpuhotplug.h  |  2 +-
 include/linux/dma-iommu.h   |  8 
 include/linux/iova.h|  6 +-
 6 files changed, 33 insertions(+), 57 deletions(-)

-- 
2.26.2



[PATCH v2 4/4] iommu: Stop exporting free_iova_fast()

2021-03-25 Thread John Garry
Function free_iova_fast() is only referenced by dma-iommu.c, which can
only be in-built, so stop exporting it.

This was missed in an earlier tidy-up patch.

Signed-off-by: John Garry 
---
 drivers/iommu/iova.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 8a493ee92c79..e31e79a9f5c3 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -493,7 +493,6 @@ free_iova_fast(struct iova_domain *iovad, unsigned long 
pfn, unsigned long size)
 
free_iova(iovad, pfn);
 }
-EXPORT_SYMBOL_GPL(free_iova_fast);
 
 #define fq_ring_for_each(i, fq) \
for ((i) = (fq)->head; (i) != (fq)->tail; (i) = ((i) + 1) % 
IOVA_FQ_SIZE)
-- 
2.26.2



[PATCH v2 3/4] iommu: Delete iommu_dma_free_cpu_cached_iovas()

2021-03-25 Thread John Garry
Function iommu_dma_free_cpu_cached_iovas() no longer has any caller, so
delete it.

With that, function free_cpu_cached_iovas() may be made static.

Signed-off-by: John Garry 
---
 drivers/iommu/dma-iommu.c | 9 -
 drivers/iommu/iova.c  | 3 ++-
 include/linux/dma-iommu.h | 8 
 include/linux/iova.h  | 5 -
 4 files changed, 2 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index af765c813cc8..9da7e9901bec 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -53,15 +53,6 @@ struct iommu_dma_cookie {
 
 static DEFINE_STATIC_KEY_FALSE(iommu_deferred_attach_enabled);
 
-void iommu_dma_free_cpu_cached_iovas(unsigned int cpu,
-   struct iommu_domain *domain)
-{
-   struct iommu_dma_cookie *cookie = domain->iova_cookie;
-   struct iova_domain *iovad = >iovad;
-
-   free_cpu_cached_iovas(cpu, iovad);
-}
-
 static void iommu_dma_entry_dtor(unsigned long data)
 {
struct page *freelist = (struct page *)data;
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index c78312560425..8a493ee92c79 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -22,6 +22,7 @@ static unsigned long iova_rcache_get(struct iova_domain 
*iovad,
 unsigned long size,
 unsigned long limit_pfn);
 static void init_iova_rcaches(struct iova_domain *iovad);
+static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
 static void free_iova_rcaches(struct iova_domain *iovad);
 static void fq_destroy_all_entries(struct iova_domain *iovad);
 static void fq_flush_timeout(struct timer_list *t);
@@ -998,7 +999,7 @@ static void free_iova_rcaches(struct iova_domain *iovad)
 /*
  * free all the IOVA ranges cached by a cpu (used when cpu is unplugged)
  */
-void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad)
+static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad)
 {
struct iova_cpu_rcache *cpu_rcache;
struct iova_rcache *rcache;
diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
index 706b68d1359b..2112f21f73d8 100644
--- a/include/linux/dma-iommu.h
+++ b/include/linux/dma-iommu.h
@@ -37,9 +37,6 @@ void iommu_dma_compose_msi_msg(struct msi_desc *desc,
 
 void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list);
 
-void iommu_dma_free_cpu_cached_iovas(unsigned int cpu,
-   struct iommu_domain *domain);
-
 #else /* CONFIG_IOMMU_DMA */
 
 struct iommu_domain;
@@ -81,10 +78,5 @@ static inline void iommu_dma_get_resv_regions(struct device 
*dev, struct list_he
 {
 }
 
-static inline void iommu_dma_free_cpu_cached_iovas(unsigned int cpu,
-   struct iommu_domain *domain)
-{
-}
-
 #endif /* CONFIG_IOMMU_DMA */
 #endif /* __DMA_IOMMU_H */
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 4be6c0ab4997..71d8a2de6635 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -157,7 +157,6 @@ int init_iova_flush_queue(struct iova_domain *iovad,
  iova_flush_cb flush_cb, iova_entry_dtor entry_dtor);
 struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
 void put_iova_domain(struct iova_domain *iovad);
-void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
 #else
 static inline int iova_cache_get(void)
 {
@@ -234,10 +233,6 @@ static inline void put_iova_domain(struct iova_domain 
*iovad)
 {
 }
 
-static inline void free_cpu_cached_iovas(unsigned int cpu,
-struct iova_domain *iovad)
-{
-}
 #endif
 
 #endif
-- 
2.26.2



[PATCH v2 1/4] iova: Add CPU hotplug handler to flush rcaches

2021-03-25 Thread John Garry
Like the Intel IOMMU driver already does, flush the per-IOVA domain
CPU rcache when a CPU goes offline - there's no point in keeping it.

Reviewed-by: Robin Murphy 
Signed-off-by: John Garry 
---
 drivers/iommu/iova.c   | 30 +-
 include/linux/cpuhotplug.h |  1 +
 include/linux/iova.h   |  1 +
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index e6e2fa85271c..c78312560425 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -25,6 +25,17 @@ static void init_iova_rcaches(struct iova_domain *iovad);
 static void free_iova_rcaches(struct iova_domain *iovad);
 static void fq_destroy_all_entries(struct iova_domain *iovad);
 static void fq_flush_timeout(struct timer_list *t);
+
+static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node)
+{
+   struct iova_domain *iovad;
+
+   iovad = hlist_entry_safe(node, struct iova_domain, cpuhp_dead);
+
+   free_cpu_cached_iovas(cpu, iovad);
+   return 0;
+}
+
 static void free_global_cached_iovas(struct iova_domain *iovad);
 
 void
@@ -51,6 +62,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long 
granule,
iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
rb_link_node(>anchor.node, NULL, >rbroot.rb_node);
rb_insert_color(>anchor.node, >rbroot);
+   cpuhp_state_add_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, 
>cpuhp_dead);
init_iova_rcaches(iovad);
 }
 EXPORT_SYMBOL_GPL(init_iova_domain);
@@ -257,10 +269,21 @@ int iova_cache_get(void)
 {
mutex_lock(_cache_mutex);
if (!iova_cache_users) {
+   int ret;
+
+   ret = cpuhp_setup_state_multi(CPUHP_IOMMU_IOVA_DEAD, 
"iommu/iova:dead", NULL,
+   iova_cpuhp_dead);
+   if (ret) {
+   mutex_unlock(_cache_mutex);
+   pr_err("Couldn't register cpuhp handler\n");
+   return ret;
+   }
+
iova_cache = kmem_cache_create(
"iommu_iova", sizeof(struct iova), 0,
SLAB_HWCACHE_ALIGN, NULL);
if (!iova_cache) {
+   cpuhp_remove_multi_state(CPUHP_IOMMU_IOVA_DEAD);
mutex_unlock(_cache_mutex);
pr_err("Couldn't create iova cache\n");
return -ENOMEM;
@@ -282,8 +305,10 @@ void iova_cache_put(void)
return;
}
iova_cache_users--;
-   if (!iova_cache_users)
+   if (!iova_cache_users) {
+   cpuhp_remove_multi_state(CPUHP_IOMMU_IOVA_DEAD);
kmem_cache_destroy(iova_cache);
+   }
mutex_unlock(_cache_mutex);
 }
 EXPORT_SYMBOL_GPL(iova_cache_put);
@@ -606,6 +631,9 @@ void put_iova_domain(struct iova_domain *iovad)
 {
struct iova *iova, *tmp;
 
+   cpuhp_state_remove_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD,
+   >cpuhp_dead);
+
free_iova_flush_queue(iovad);
free_iova_rcaches(iovad);
rbtree_postorder_for_each_entry_safe(iova, tmp, >rbroot, node)
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index f14adb882338..cedac9986557 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -58,6 +58,7 @@ enum cpuhp_state {
CPUHP_NET_DEV_DEAD,
CPUHP_PCI_XGENE_DEAD,
CPUHP_IOMMU_INTEL_DEAD,
+   CPUHP_IOMMU_IOVA_DEAD,
CPUHP_LUSTRE_CFS_DEAD,
CPUHP_AP_ARM_CACHE_B15_RAC_DEAD,
CPUHP_PADATA_DEAD,
diff --git a/include/linux/iova.h b/include/linux/iova.h
index c834c01c0a5b..4be6c0ab4997 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -95,6 +95,7 @@ struct iova_domain {
   flush-queues */
atomic_t fq_timer_on;   /* 1 when timer is active, 0
   when not */
+   struct hlist_node   cpuhp_dead;
 };
 
 static inline unsigned long iova_size(struct iova *iova)
-- 
2.26.2



[PATCH v2 2/6] perf test: Handle metric reuse in pmu-events parsing test

2021-03-25 Thread John Garry
The pmu-events parsing test does not handle metric reuse at all.

Introduce some simple handling to resolve metrics who reference other
metrics.

Signed-off-by: John Garry 
---
 tools/perf/tests/pmu-events.c | 80 +++
 1 file changed, 80 insertions(+)

diff --git a/tools/perf/tests/pmu-events.c b/tools/perf/tests/pmu-events.c
index 0ca6a5a53523..20b6bf14f7f7 100644
--- a/tools/perf/tests/pmu-events.c
+++ b/tools/perf/tests/pmu-events.c
@@ -12,6 +12,7 @@
 #include "util/evlist.h"
 #include "util/expr.h"
 #include "util/parse-events.h"
+#include "metricgroup.h"
 
 struct perf_pmu_test_event {
/* used for matching against events from generated pmu-events.c */
@@ -471,6 +472,70 @@ static void expr_failure(const char *msg,
pr_debug("On expression %s\n", pe->metric_expr);
 }
 
+struct metric {
+   struct list_head list;
+   struct metric_ref metric_ref;
+};
+
+static int resolve_metric_simple(struct expr_parse_ctx *pctx,
+struct list_head *compound_list,
+struct pmu_events_map *map,
+const char *metric_name)
+{
+   struct hashmap_entry *cur, *cur_tmp;
+   struct metric *metric, *tmp;
+   size_t bkt;
+   bool all;
+   int rc;
+
+   do {
+   all = true;
+   hashmap__for_each_entry_safe((>ids), cur, cur_tmp, bkt) {
+   struct metric_ref *ref;
+   struct pmu_event *pe;
+
+   pe = metrcgroup_find_metric(cur->key, map);
+   if (!pe)
+   continue;
+
+   if (!strcmp(metric_name, (char *)cur->key)) {
+   pr_warning("Recursion detected for metric 
%s\n", metric_name);
+   rc = -1;
+   goto out_err;
+   }
+
+   all = false;
+
+   /* The metric key itself needs to go out.. */
+   expr__del_id(pctx, cur->key);
+
+   metric = malloc(sizeof(*metric));
+   if (!metric) {
+   rc = -ENOMEM;
+   goto out_err;
+   }
+
+   ref = >metric_ref;
+   ref->metric_name = pe->metric_name;
+   ref->metric_expr = pe->metric_expr;
+   list_add_tail(>list, compound_list);
+
+   rc = expr__find_other(pe->metric_expr, NULL, pctx, 0);
+   if (rc)
+   goto out_err;
+   }
+   } while (!all);
+
+   return 0;
+
+out_err:
+   list_for_each_entry_safe(metric, tmp, compound_list, list)
+   free(metric);
+
+   return rc;
+
+}
+
 static int test_parsing(void)
 {
struct pmu_events_map *cpus_map = perf_pmu__find_map(NULL);
@@ -488,7 +553,9 @@ static int test_parsing(void)
break;
j = 0;
for (;;) {
+   struct metric *metric, *tmp;
struct hashmap_entry *cur;
+   LIST_HEAD(compound_list);
size_t bkt;
 
pe = >table[j++];
@@ -504,6 +571,13 @@ static int test_parsing(void)
continue;
}
 
+   if (resolve_metric_simple(, _list, map,
+ pe->metric_name)) {
+   expr_failure("Could not resolve metrics", map, 
pe);
+   ret++;
+   goto exit; /* Don't tolerate errors due to 
severity */
+   }
+
/*
 * Add all ids with a made up value. The value may
 * trigger divide by zero when subtracted and so try to
@@ -519,6 +593,11 @@ static int test_parsing(void)
ret++;
}
 
+   list_for_each_entry_safe(metric, tmp, _list, 
list) {
+   expr__add_ref(, >metric_ref);
+   free(metric);
+   }
+
if (expr__parse(, , pe->metric_expr, 0)) {
expr_failure("Parse failed", map, pe);
ret++;
@@ -527,6 +606,7 @@ static int test_parsing(void)
}
}
/* TODO: fail when not ok */
+exit:
return ret == 0 ? TEST_OK : TEST_SKIP;
 }
 
-- 
2.26.2



[PATCH v2 0/6] perf arm64 metricgroup support

2021-03-25 Thread John Garry
This series contains support to get basic metricgroups working for
arm64 CPUs.

Initial support is added for HiSilicon hip08 platform.

Some sample usage on Huawei D06 board:

 $ ./perf list metric

List of pre-defined events (to be used in -e): 

Metrics: 

  bp_misp_flush
   [BP misp flush L3 topdown metric]
  branch_mispredicts
   [Branch mispredicts L2 topdown metric]
  core_bound
   [Core bound L2 topdown metric]
  divider
   [Divider L3 topdown metric]
  exe_ports_util
   [EXE ports util L3 topdown metric]
  fetch_bandwidth_bound
   [Fetch bandwidth bound L2 topdown metric]
  fetch_latency_bound
   [Fetch latency bound L2 topdown metric]
  fsu_stall
   [FSU stall L3 topdown metric]
  idle_by_icache_miss

$ sudo ./perf stat -v -M core_bound sleep 1
Using CPUID 0x480fd010
metric expr (exe_stall_cycle - (mem_stall_anyload + 
armv8_pmuv3_0@event\=0x7005@)) / cpu_cycles for core_bound
found event cpu_cycles
found event armv8_pmuv3_0/event=0x7005/
found event exe_stall_cycle
found event mem_stall_anyload
adding {cpu_cycles -> armv8_pmuv3_0/event=0x7001/
mem_stall_anyload -> armv8_pmuv3_0/event=0x7004/
Control descriptor is not initialized
cpu_cycles: 989433 385050 385050
armv8_pmuv3_0/event=0x7005/: 19207 385050 385050
exe_stall_cycle: 900825 385050 385050
mem_stall_anyload: 253516 385050 385050

Performance counter stats for 'sleep':

989,433  cpu_cycles  # 0.63 core_bound
  19,207  armv8_pmuv3_0/event=0x7005/
 900,825  exe_stall_cycle
 253,516  mem_stall_anyload

   0.000805809 seconds time elapsed

   0.000875000 seconds user
   0.0 seconds sys
   
perf stat --topdown is not supported, as this requires the CPU PMU to
expose (alias) events for the TopDown L1 metrics from sysfs, which arm 
does not do. To get that to work, we probably need to make perf use the
pmu-events cpumap to learn about those alias events.

Metric reuse support is added for pmu-events parse metric testcase.
This had been broken on power9 recentlty:
https://lore.kernel.org/lkml/20210324015418.gc8...@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com/
 

Differences to v1:
- Add pmu_events_map__find() as arm64-specific function
- Fix metric reuse for pmu-events parse metric testcase 

John Garry (6):
  perf metricgroup: Make find_metric() public with name change
  perf test: Handle metric reuse in pmu-events parsing test
  perf pmu: Add pmu_events_map__find()
  perf vendor events arm64: Add Hisi hip08 L1 metrics
  perf vendor events arm64: Add Hisi hip08 L2 metrics
  perf vendor events arm64: Add Hisi hip08 L3 metrics

 tools/perf/arch/arm64/util/Build  |   1 +
 tools/perf/arch/arm64/util/pmu.c  |  25 ++
 .../arch/arm64/hisilicon/hip08/metrics.json   | 233 ++
 tools/perf/tests/pmu-events.c |  82 +-
 tools/perf/util/metricgroup.c |  12 +-
 tools/perf/util/metricgroup.h |   3 +-
 tools/perf/util/pmu.c |   5 +
 tools/perf/util/pmu.h |   1 +
 tools/perf/util/s390-sample-raw.c |   4 +-
 9 files changed, 355 insertions(+), 11 deletions(-)
 create mode 100644 tools/perf/arch/arm64/util/pmu.c
 create mode 100644 
tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json

-- 
2.26.2



[PATCH v2 4/6] perf vendor events arm64: Add Hisi hip08 L1 metrics

2021-03-25 Thread John Garry
Add L1 metrics. Formula is as consistent as possible with MAN pages
description for these metrics.

Signed-off-by: John Garry 
---
 .../arch/arm64/hisilicon/hip08/metrics.json   | 30 +++
 1 file changed, 30 insertions(+)
 create mode 100644 
tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json

diff --git a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json 
b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
new file mode 100644
index ..dc5ff3051639
--- /dev/null
+++ b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
@@ -0,0 +1,30 @@
+[
+{
+"MetricExpr": "FETCH_BUBBLE / (4 * CPU_CYCLES)",
+"PublicDescription": "Frontend bound L1 topdown metric",
+"BriefDescription": "Frontend bound L1 topdown metric",
+"MetricGroup": "TopDownL1",
+"MetricName": "frontend_bound"
+},
+{
+"MetricExpr": "(INST_SPEC - INST_RETIRED) / (4 * CPU_CYCLES)",
+"PublicDescription": "Bad Speculation L1 topdown metric",
+"BriefDescription": "Bad Speculation L1 topdown metric",
+"MetricGroup": "TopDownL1",
+"MetricName": "bad_speculation"
+},
+{
+"MetricExpr": "INST_RETIRED / (CPU_CYCLES * 4)",
+"PublicDescription": "Retiring L1 topdown metric",
+"BriefDescription": "Retiring L1 topdown metric",
+"MetricGroup": "TopDownL1",
+"MetricName": "retiring"
+},
+{
+"MetricExpr": "1 - (frontend_bound + bad_speculation + retiring)",
+"PublicDescription": "Backend Bound L1 topdown metric",
+"BriefDescription": "Backend Bound L1 topdown metric",
+"MetricGroup": "TopDownL1",
+"MetricName": "backend_bound"
+},
+]
-- 
2.26.2



[PATCH v2 3/6] perf pmu: Add pmu_events_map__find()

2021-03-25 Thread John Garry
Add a function to find the common PMU map for the system.

For arm64, a special variant is added. This is because arm64 supports
heterogeneous CPU systems. As such, it cannot be guaranteed that the cpumap
is same for all CPUs. So in case of heterogeneous systems, don't return
a cpumap.

Signed-off-by: John Garry 
---
 tools/perf/arch/arm64/util/Build  |  1 +
 tools/perf/arch/arm64/util/pmu.c  | 25 +
 tools/perf/tests/pmu-events.c |  2 +-
 tools/perf/util/metricgroup.c |  7 +++
 tools/perf/util/pmu.c |  5 +
 tools/perf/util/pmu.h |  1 +
 tools/perf/util/s390-sample-raw.c |  4 +---
 7 files changed, 37 insertions(+), 8 deletions(-)
 create mode 100644 tools/perf/arch/arm64/util/pmu.c

diff --git a/tools/perf/arch/arm64/util/Build b/tools/perf/arch/arm64/util/Build
index ead2f2275eee..9fcb4e68add9 100644
--- a/tools/perf/arch/arm64/util/Build
+++ b/tools/perf/arch/arm64/util/Build
@@ -2,6 +2,7 @@ perf-y += header.o
 perf-y += machine.o
 perf-y += perf_regs.o
 perf-y += tsc.o
+perf-y += pmu.o
 perf-y += kvm-stat.o
 perf-$(CONFIG_DWARF) += dwarf-regs.o
 perf-$(CONFIG_LOCAL_LIBUNWIND) += unwind-libunwind.o
diff --git a/tools/perf/arch/arm64/util/pmu.c b/tools/perf/arch/arm64/util/pmu.c
new file mode 100644
index ..d3259d61ca75
--- /dev/null
+++ b/tools/perf/arch/arm64/util/pmu.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "../../util/cpumap.h"
+#include "../../util/pmu.h"
+
+struct pmu_events_map *pmu_events_map__find(void)
+{
+   struct perf_pmu *pmu = NULL;
+
+   while ((pmu = perf_pmu__scan(pmu))) {
+   if (!is_pmu_core(pmu->name))
+   continue;
+
+   /*
+* The cpumap should cover all CPUs. Otherwise, some CPUs may
+* not support some events or have different event IDs.
+*/
+   if (pmu->cpus->nr != cpu__max_cpu())
+   return NULL;
+
+   return perf_pmu__find_map(pmu);
+   }
+
+   return NULL;
+}
diff --git a/tools/perf/tests/pmu-events.c b/tools/perf/tests/pmu-events.c
index 20b6bf14f7f7..9fe2455b355b 100644
--- a/tools/perf/tests/pmu-events.c
+++ b/tools/perf/tests/pmu-events.c
@@ -538,7 +538,7 @@ static int resolve_metric_simple(struct expr_parse_ctx 
*pctx,
 
 static int test_parsing(void)
 {
-   struct pmu_events_map *cpus_map = perf_pmu__find_map(NULL);
+   struct pmu_events_map *cpus_map = pmu_events_map__find();
struct pmu_events_map *map;
struct pmu_event *pe;
int i, j, k;
diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
index 71a13406e0bd..e6e787e94f91 100644
--- a/tools/perf/util/metricgroup.c
+++ b/tools/perf/util/metricgroup.c
@@ -618,7 +618,7 @@ static int metricgroup__print_sys_event_iter(struct 
pmu_event *pe, void *data)
 void metricgroup__print(bool metrics, bool metricgroups, char *filter,
bool raw, bool details)
 {
-   struct pmu_events_map *map = perf_pmu__find_map(NULL);
+   struct pmu_events_map *map = pmu_events_map__find();
struct pmu_event *pe;
int i;
struct rblist groups;
@@ -1254,8 +1254,7 @@ int metricgroup__parse_groups(const struct option *opt,
  struct rblist *metric_events)
 {
struct evlist *perf_evlist = *(struct evlist **)opt->value;
-   struct pmu_events_map *map = perf_pmu__find_map(NULL);
-
+   struct pmu_events_map *map = pmu_events_map__find();
 
return parse_groups(perf_evlist, str, metric_no_group,
metric_no_merge, NULL, metric_events, map);
@@ -1274,7 +1273,7 @@ int metricgroup__parse_groups_test(struct evlist *evlist,
 
 bool metricgroup__has_metric(const char *metric)
 {
-   struct pmu_events_map *map = perf_pmu__find_map(NULL);
+   struct pmu_events_map *map = pmu_events_map__find();
struct pmu_event *pe;
int i;
 
diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c
index 88da5cf6aee8..419ef6c4fbc0 100644
--- a/tools/perf/util/pmu.c
+++ b/tools/perf/util/pmu.c
@@ -717,6 +717,11 @@ struct pmu_events_map *perf_pmu__find_map(struct perf_pmu 
*pmu)
return map;
 }
 
+struct pmu_events_map *__weak pmu_events_map__find(void)
+{
+   return perf_pmu__find_map(NULL);
+}
+
 bool pmu_uncore_alias_match(const char *pmu_name, const char *name)
 {
char *tmp = NULL, *tok, *str;
diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h
index 8164388478c6..012317229488 100644
--- a/tools/perf/util/pmu.h
+++ b/tools/perf/util/pmu.h
@@ -114,6 +114,7 @@ void pmu_add_cpu_aliases_map(struct list_head *head, struct 
perf_pmu *pmu,
 struct pmu_events_map *map);
 
 struct pmu_events_map *perf_pmu__find_map(struct perf_pmu *pmu);
+struct pmu_events_map *pmu_events_map__find(void);
 bool pmu_uncore_alias_match(const char *pm

[PATCH v2 5/6] perf vendor events arm64: Add Hisi hip08 L2 metrics

2021-03-25 Thread John Garry
Add L2 metrics.

Signed-off-by: John Garry 
---
 .../arch/arm64/hisilicon/hip08/metrics.json   | 42 +++
 1 file changed, 42 insertions(+)

diff --git a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json 
b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
index dc5ff3051639..dda898d23c2d 100644
--- a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
+++ b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
@@ -27,4 +27,46 @@
 "MetricGroup": "TopDownL1",
 "MetricName": "backend_bound"
 },
+{
+"MetricExpr": "armv8_pmuv3_0@event\\=0x201d@ / CPU_CYCLES",
+"PublicDescription": "Fetch latency bound L2 topdown metric",
+"BriefDescription": "Fetch latency bound L2 topdown metric",
+"MetricGroup": "TopDownL2",
+"MetricName": "fetch_latency_bound"
+},
+{
+"MetricExpr": "frontend_bound - fetch_latency_bound",
+"PublicDescription": "Fetch bandwidth bound L2 topdown metric",
+"BriefDescription": "Fetch bandwidth bound L2 topdown metric",
+"MetricGroup": "TopDownL2",
+"MetricName": "fetch_bandwidth_bound"
+},
+{
+"MetricExpr": "(bad_speculation * BR_MIS_PRED) / (BR_MIS_PRED + 
armv8_pmuv3_0@event\\=0x2013@)",
+"PublicDescription": "Branch mispredicts L2 topdown metric",
+"BriefDescription": "Branch mispredicts L2 topdown metric",
+"MetricGroup": "TopDownL2",
+"MetricName": "branch_mispredicts"
+},
+{
+"MetricExpr": "bad_speculation - branch_mispredicts",
+"PublicDescription": "Machine clears L2 topdown metric",
+"BriefDescription": "Machine clears L2 topdown metric",
+"MetricGroup": "TopDownL2",
+"MetricName": "machine_clears"
+},
+{
+"MetricExpr": "(EXE_STALL_CYCLE - (MEM_STALL_ANYLOAD + 
armv8_pmuv3_0@event\\=0x7005@)) / CPU_CYCLES",
+"PublicDescription": "Core bound L2 topdown metric",
+"BriefDescription": "Core bound L2 topdown metric",
+"MetricGroup": "TopDownL2",
+"MetricName": "core_bound"
+},
+{
+"MetricExpr": "(MEM_STALL_ANYLOAD + armv8_pmuv3_0@event\\=0x7005@) / 
CPU_CYCLES",
+"PublicDescription": "Memory bound L2 topdown metric",
+"BriefDescription": "Memory bound L2 topdown metric",
+"MetricGroup": "TopDownL2",
+"MetricName": "memory_bound"
+},
 ]
-- 
2.26.2



[PATCH v2 6/6] perf vendor events arm64: Add Hisi hip08 L3 metrics

2021-03-25 Thread John Garry
Add L3 metrics.

Signed-off-by: John Garry 
---
 .../arch/arm64/hisilicon/hip08/metrics.json   | 161 ++
 1 file changed, 161 insertions(+)

diff --git a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json 
b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
index dda898d23c2d..dda8e59149d2 100644
--- a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
+++ b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json
@@ -69,4 +69,165 @@
 "MetricGroup": "TopDownL2",
 "MetricName": "memory_bound"
 },
+{
+"MetricExpr": "(((L2I_TLB - L2I_TLB_REFILL) * 15) + (L2I_TLB_REFILL * 
100)) / CPU_CYCLES",
+"PublicDescription": "Idle by itlb miss L3 topdown metric",
+"BriefDescription": "Idle by itlb miss L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "idle_by_itlb_miss"
+},
+{
+"MetricExpr": "(((L2I_CACHE - L2I_CACHE_REFILL) * 15) + 
(L2I_CACHE_REFILL * 100)) / CPU_CYCLES",
+"PublicDescription": "Idle by icache miss L3 topdown metric",
+"BriefDescription": "Idle by icache miss L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "idle_by_icache_miss"
+},
+{
+"MetricExpr": "(BR_MIS_PRED * 5) / CPU_CYCLES",
+"PublicDescription": "BP misp flush L3 topdown metric",
+"BriefDescription": "BP misp flush L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "bp_misp_flush"
+},
+{
+"MetricExpr": "(armv8_pmuv3_0@event\\=0x2013@ * 5) / CPU_CYCLES",
+"PublicDescription": "OOO flush L3 topdown metric",
+"BriefDescription": "OOO flush L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "ooo_flush"
+},
+{
+"MetricExpr": "(armv8_pmuv3_0@event\\=0x1001@ * 5) / CPU_CYCLES",
+"PublicDescription": "Static predictor flush L3 topdown metric",
+"BriefDescription": "Static predictor flush L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "sp_flush"
+},
+{
+"MetricExpr": "armv8_pmuv3_0@event\\=0x1010@ / BR_MIS_PRED",
+"PublicDescription": "Indirect branch L3 topdown metric",
+"BriefDescription": "Indirect branch L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "indirect_branch"
+},
+{
+"MetricExpr": "(armv8_pmuv3_0@event\\=0x1014@ + 
armv8_pmuv3_0@event\\=0x1018@) / BR_MIS_PRED",
+"PublicDescription": "Push branch L3 topdown metric",
+"BriefDescription": "Push branch L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "push_branch"
+},
+{
+"MetricExpr": "armv8_pmuv3_0@event\\=0x100c@ / BR_MIS_PRED",
+"PublicDescription": "Pop branch L3 topdown metric",
+"BriefDescription": "Pop branch L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "pop_branch"
+},
+{
+"MetricExpr": "(BR_MIS_PRED - armv8_pmuv3_0@event\\=0x1010@ - 
armv8_pmuv3_0@event\\=0x1014@ - armv8_pmuv3_0@event\\=0x1018@ - 
armv8_pmuv3_0@event\\=0x100c@) / BR_MIS_PRED",
+"PublicDescription": "Other branch L3 topdown metric",
+"BriefDescription": "Other branch L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "other_branch"
+},
+{
+"MetricExpr": "armv8_pmuv3_0@event\\=0x2012@ / 
armv8_pmuv3_0@event\\=0x2013@",
+"PublicDescription": "Nuke flush L3 topdown metric",
+"BriefDescription": "Nuke flush L3 topdown metric",
+"MetricGroup": "TopDownL3",
+"MetricName": "nuke_flush"
+},
+{
+"MetricExpr": "1 - nuke_flush",
+"PublicDescription": "Other flush L3 topdown metric",
+"BriefDescription": "Other flush L3 topdown metric",
+ 

[PATCH v2 1/6] perf metricgroup: Make find_metric() public with name change

2021-03-25 Thread John Garry
Function find_metric() is required for the metric processing in the
pmu-events testcase, so make it public. Also change the name to include
"metricgroup".

Signed-off-by: John Garry 
---
 tools/perf/util/metricgroup.c | 5 +++--
 tools/perf/util/metricgroup.h | 3 ++-
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
index 6acb44ad439b..71a13406e0bd 100644
--- a/tools/perf/util/metricgroup.c
+++ b/tools/perf/util/metricgroup.c
@@ -900,7 +900,8 @@ static int __add_metric(struct list_head *metric_list,
(match_metric(__pe->metric_group, __metric) ||  \
 match_metric(__pe->metric_name, __metric)))
 
-static struct pmu_event *find_metric(const char *metric, struct pmu_events_map 
*map)
+struct pmu_event *metrcgroup_find_metric(const char *metric,
+struct pmu_events_map *map)
 {
struct pmu_event *pe;
int i;
@@ -985,7 +986,7 @@ static int __resolve_metric(struct metric *m,
struct expr_id *parent;
struct pmu_event *pe;
 
-   pe = find_metric(cur->key, map);
+   pe = metrcgroup_find_metric(cur->key, map);
if (!pe)
continue;
 
diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h
index ed1b9392e624..1674c6a36d74 100644
--- a/tools/perf/util/metricgroup.h
+++ b/tools/perf/util/metricgroup.h
@@ -44,7 +44,8 @@ int metricgroup__parse_groups(const struct option *opt,
  bool metric_no_group,
  bool metric_no_merge,
  struct rblist *metric_events);
-
+struct pmu_event *metrcgroup_find_metric(const char *metric,
+struct pmu_events_map *map);
 int metricgroup__parse_groups_test(struct evlist *evlist,
   struct pmu_events_map *map,
   const char *str,
-- 
2.26.2



Re: [PATCHv4 00/19] perf metric: Add support to reuse metric

2021-03-24 Thread John Garry

On 24/03/2021 01:54, Paul A. Clarke wrote:

--

Since commit 8989f5f07605 ("perf stat: Update POWER9 metrics to utilize
other metrics"), power9 has reused metrics.

And I am finding that subtest 10.3 caused problems when I tried to introduce
metric reuse on arm64, so I was just asking you to check.

Now I am a bit confused...

I now understand your original request!:-)

The above test was run on a POWER8 system.

I do see failures on a POWER9 system:
--
$ ./perf test 10
10: PMU events  :
10.1: PMU event table sanity: Ok
10.2: PMU event map aliases : Ok
10.3: Parsing of PMU event table metrics: Skip 
(some metrics failed)
10.4: Parsing of PMU event table metrics with fake PMUs : Ok
$ ./perf test --verbose 10 2>&1 | grep -i '^Parse event failed' | wc -l
112


ok, thanks for confirming.

I'm looking at fixing it, and the solution doesn't look simple :(

John


Re: [PATCHv4 00/19] perf metric: Add support to reuse metric

2021-03-23 Thread John Garry

On 23/03/2021 15:06, Paul A. Clarke wrote:

On Mon, Mar 22, 2021 at 11:36:23AM +, John Garry wrote:

On 01/08/2020 12:40, Paul A. Clarke wrote:

v4 changes:
- removed acks from patch because it changed a bit
  with the last fixes:
perf metric: Collect referenced metrics in struct metric_ref_node
- fixed runtime metrics [Kajol Jain]
- increased recursion depth [Paul A. Clarke]
- changed patches due to dependencies:
perf metric: Collect referenced metrics in struct metric_ref_node
perf metric: Add recursion check when processing nested metrics
perf metric: Rename struct egroup to metric
perf metric: Rename group_list to metric_list

Also available in here:
git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
perf/metric

I built and ran from the above git branch, and things seem to work.
Indeed, I was able to apply my changes to exploit the new capabilities
via modifications to tools/perf/pmu-events/arch/powerpc/power9/metrics.json,
as I posted earlier (and will submit once this set gets merged).

I was just wondering: Does perf subtest 10.3 work ok for you with the metric
reuse?

That's "Parsing of PMU event table metrics" subtest.

I confess I'm not sure what you are asking. Using the latest mainline
(84196390620ac0e5070ae36af84c137c6216a7dc), perf subtest 10.3 does
pass for me:
--
$ ./perf test 10
10: PMU events  :
10.1: PMU event table sanity: Ok
10.2: PMU event map aliases : Ok
10.3: Parsing of PMU event table metrics: Ok
10.4: Parsing of PMU event table metrics with fake PMUs : Ok
--
Since commit 8989f5f07605 ("perf stat: Update POWER9 metrics to utilize 
other metrics"), power9 has reused metrics.


And I am finding that subtest 10.3 caused problems when I tried to 
introduce metric reuse on arm64, so I was just asking you to check.


Now I am a bit confused...

Thanks for checking,
john


Re: [PATCH 3/3] iova: Correct comment for free_cpu_cached_iovas()

2021-03-23 Thread John Garry

On 23/03/2021 13:05, Robin Murphy wrote:

On 2021-03-01 12:12, John Garry wrote:

Function free_cpu_cached_iovas() is not only called when a CPU is
hotplugged, so remove that part of the code comment.


FWIW I read it as clarifying why this is broken out into a separate 
function vs. a monolithic "free all cached IOVAs" routine that handles 
both the per-cpu and global caches 



it never said "*only* used..."


It seems to be implying that.

It's only a code comment, so I don't care too much either way and can 
drop this change.




As such I'd hesitate to call it incorrect, but it's certainly arguable 
whether it needs to be stated or not, especially once the hotplug 
callsite is now obvious in the same file - on which note the function 
itself also shouldn't need to be public any more, no?




Right, I actually missed deleting iommu_dma_free_cpu_cached_iovas(), so 
can fix that now.


Cheers,
John


Robin.


Signed-off-by: John Garry 
---
  drivers/iommu/iova.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index c78312560425..465b3b0eeeb0 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -996,7 +996,7 @@ static void free_iova_rcaches(struct iova_domain 
*iovad)

  }
  /*
- * free all the IOVA ranges cached by a cpu (used when cpu is unplugged)
+ * free all the IOVA ranges cached by a cpu
   */
  void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad)
  {


.




Re: [PATCH 0/3] iommu/iova: Add CPU hotplug handler to flush rcaches to core code

2021-03-22 Thread John Garry

On 01/03/2021 12:12, John Garry wrote:

The Intel IOMMU driver supports flushing the per-CPU rcaches when a CPU is
offlined.

Let's move it to core code, so everyone can take advantage.

Also correct a code comment.

Based on v5.12-rc1. Tested on arm64 only.


Hi guys,

Friendly reminder ...

Thanks
John



John Garry (3):
   iova: Add CPU hotplug handler to flush rcaches
   iommu/vt-d: Remove IOVA domain rcache flushing for CPU offlining
   iova: Correct comment for free_cpu_cached_iovas()

  drivers/iommu/intel/iommu.c | 31 ---
  drivers/iommu/iova.c| 32 ++--
  include/linux/cpuhotplug.h  |  2 +-
  include/linux/iova.h|  1 +
  4 files changed, 32 insertions(+), 34 deletions(-)





Re: arm64 syzbot instances

2021-03-22 Thread John Garry


There's apparently a bit in the PCI spec that reads:
 The host bus bridge, in PC compatible systems, must return all
 1's on a read transaction and discard data on a write transaction
 when terminated with Master-Abort.

which obviously applies only to "PC compatible systems".


Right. As far as I can tell, all ARMv8 and most ARMv7 based SoCs
do this to be more compatible with PC style operating systems like
Linux, but you are right that the specification here does not
mandate that, and the older ARMv5 SoCs seem to be compliant
as well based on this.


Linux has a driver for DPC, which apparently configures it to
cause an interrupt to log the event, but it does not hook up the
CPU exception handler to this. I don't see an implementation of DPC
in qemu, which I take as an indication that it should use the
default behavior and cause neither an interrupt nor a CPU exception.


Hmm, maybe. We should probably also implement -1/discard just because
we're not intending to have 'surprising' behaviour.

TBH I'm having difficulty seeing why the kernel should be doing
this at all, though. The device tree tells you you have a PCI
controller; PCI supports enumeration of devices; you know exactly
where everything is mapped because the BARs tell you that.
I don't see anything that justifies the kernel in randomly
dereferencing areas of the IO or memory windows where it hasn't
mapped anything.


BIOS has described a CPU-addressable PIO region in the PCI hostbridge, 
and the kernel has mapped it:


[3.974309][T1] pci-host-generic 401000.pcie:   IO
0x003eff..0x003eff -> 0x00

So I don't see why any accesses there should fault.


You shouldn't be probing for legacy ISA-port
devices unless you're on a system which might actually have them
(eg an x86 PC).


It only happened in this case because there is also a bug in
the 8250 serial port driver that is configured to assume four ports
exist at port zero. On real arm64 hardware, this is apparently
harmless because the driver has coped with this for 30 years ;-)

There are a few other drivers that assume hardware is accessible
at the legacy addresses, and applications can also still open /dev/ioport
(if that is enabled at compile time) for the same purpose. Examples
could be PC-style mouse/keyboard (emulated by a server BMC),
PATA/SATA controllers in pre-AHCI mode, VGA console, and a
couple of industrial I/O drivers that have ISA devices behind a
PCI bridge.

Most other actual ISA add-on card drivers can only be enabled
on kernels that support machines with real slots, so you could
get them on an i386 kernel running a virtualized x86_64 machine,
but not on ARMv6 or later kernels, and you can't run pre-ARMv7
kernels on ARMv8 hardware.
  There are also lots of the hwmon drivers which use super IO, and probe 
a fixed PIO addresses for HW detection. These may be enabled on any 
architecture (apart from PPC, who explicitly disabled them to avoid 
issues like this).


Thanks,
John


Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator

2021-03-22 Thread John Garry

On 19/03/2021 19:20, Robin Murphy wrote:

Hi Robin,


So then we have the issue of how to dynamically increase this rcache
threshold. The problem is that we may have many devices associated with
the same domain. So, in theory, we can't assume that when we increase
the threshold that some other device will try to fast free an IOVA which
was allocated prior to the increase and was not rounded up.

I'm very open to better (or less bad) suggestions on how to do this ...

...but yes, regardless of exactly where it happens, rounding up or not
is the problem for rcaches in general. I've said several times that my
preferred approach is to not change it that dynamically at all, but
instead treat it more like we treat the default domain type.



Can you remind me of that idea? I don't remember you mentioning using 
default domain handling as a reference in any context.


Thanks,
John


Re: [PATCHv4 00/19] perf metric: Add support to reuse metric

2021-03-22 Thread John Garry

On 01/08/2020 12:40, Paul A. Clarke wrote:

v4 changes:
   - removed acks from patch because it changed a bit
 with the last fixes:
   perf metric: Collect referenced metrics in struct metric_ref_node
   - fixed runtime metrics [Kajol Jain]
   - increased recursion depth [Paul A. Clarke]
   - changed patches due to dependencies:
   perf metric: Collect referenced metrics in struct metric_ref_node
   perf metric: Add recursion check when processing nested metrics
   perf metric: Rename struct egroup to metric
   perf metric: Rename group_list to metric_list

Also available in here:
   git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
   perf/metric

I built and ran from the above git branch, and things seem to work.
Indeed, I was able to apply my changes to exploit the new capabilities
via modifications to tools/perf/pmu-events/arch/powerpc/power9/metrics.json,
as I posted earlier (and will submit once this set gets merged).



Hi Paul,

I was just wondering: Does perf subtest 10.3 work ok for you with the 
metric reuse?


That's "Parsing of PMU event table metrics" subtest.

Hi Jirka,

If I add something like this:
{
"BriefDescription": "dummy test1",
"MetricExpr": "Bad_Speculation + Frontend_Bound",
"MetricGroup": "TopdownL1",
"MetricName": "dummytest",
"PublicDescription": "dummy test2"
},

I get "Parse event failed metric 'dummytest' id 'bad_speculation' expr 
'bad_speculation + frontend_bound'"


Thanks,
john


Tested-by: Paul A. Clarke

One thing I noted, but which also occurs without these patches, is that
the perf metrics are not computed unless run as root:
--
$ perf stat --metrics br_misprediction_percent command

  Performance counter stats for 'command':

  1,823,530,051  pm_br_pred:u
  2,662,705  pm_br_mpred_cmpl:u

$ /usr/bin/sudo perf stat --metrics br_misprediction_percent command

  Performance counter stats for 'command':

  1,824,655,269  pm_br_pred# 0.09 
br_misprediction_percent
  1,654,466  pm_br_mpred_cmpl
--




Re: [PATCH v2] scsi: libsas: Reset num_scatter if libata mark qc as NODATA

2021-03-20 Thread John Garry

On 19/03/2021 01:43, Jason Yan wrote:



在 2021/3/19 6:56, Jolly Shah 写道:

When the cache_type for the scsi device is changed, the scsi layer
issues a MODE_SELECT command. The caching mode details are communicated
via a request buffer associated with the scsi command with data
direction set as DMA_TO_DEVICE (scsi_mode_select). When this command
reaches the libata layer, as a part of generic initial setup, libata
layer sets up the scatterlist for the command using the scsi command
(ata_scsi_qc_new). This command is then translated by the libata layer
into ATA_CMD_SET_FEATURES (ata_scsi_mode_select_xlat). The libata layer
treats this as a non data command (ata_mselect_caching), since it only
needs an ata taskfile to pass the caching on/off information to the
device. It does not need the scatterlist that has been setup, so it does
not perform dma_map_sg on the scatterlist (ata_qc_issue). Unfortunately,
when this command reaches the libsas layer(sas_ata_qc_issue), libsas
layer sees it as a non data command with a scatterlist. It cannot
extract the correct dma length, since the scatterlist has not been
mapped with dma_map_sg for a DMA operation. When this partially
constructed SAS task reaches pm80xx LLDD, it results in below warning.

"pm80xx_chip_sata_req 6058: The sg list address
start_addr=0x data_len=0x0end_addr_high=0x
end_addr_low=0x has crossed 4G boundary"

This patch updates code to handle ata non data commands separately so
num_scatter and total_xfer_len remain 0.

Fixes: 53de092f47ff ("scsi: libsas: Set data_dir as DMA_NONE if libata 
marks qc as NODATA")

Signed-off-by: Jolly Shah 


Reviewed-by: John Garry 

@luojiaxing, can you please test this?


---
v2:
- reorganized code to avoid setting num_scatter twice

  drivers/scsi/libsas/sas_ata.c | 9 -
  1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/libsas/sas_ata.c 
b/drivers/scsi/libsas/sas_ata.c

index 024e5a550759..8b9a39077dba 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -201,18 +201,17 @@ static unsigned int sas_ata_qc_issue(struct 
ata_queued_cmd *qc)

  memcpy(task->ata_task.atapi_packet, qc->cdb, qc->dev->cdb_len);
  task->total_xfer_len = qc->nbytes;
  task->num_scatter = qc->n_elem;
+    task->data_dir = qc->dma_dir;
+    } else if (qc->tf.protocol == ATA_PROT_NODATA) {
+    task->data_dir = DMA_NONE;


Hi Jolly & John,

We only set DMA_NONE for ATA_PROT_NODATA, I'm curious about why 
ATA_PROT_NCQ_NODATA and ATAPI_PROT_NODATA do not need to set DMA_NONE?


So we can see something like atapi_eh_tur() -> ata_exec_internal(), 
which is a ATAPI NONDATA and has DMA_NONE, so should be ok.


Other cases, like those using the xlate function on the qc for 
ATA_PROT_NCQ_NODATA, could be checked further.


For now, we're just trying to fix the fix.



Thanks,
Jason



  } else {
  for_each_sg(qc->sg, sg, qc->n_elem, si)
  xfer += sg_dma_len(sg);
  task->total_xfer_len = xfer;
  task->num_scatter = si;
-    }
-
-    if (qc->tf.protocol == ATA_PROT_NODATA)
-    task->data_dir = DMA_NONE;
-    else
  task->data_dir = qc->dma_dir;
+    }
  task->scatter = qc->sg;
  task->ata_task.retry_count = 1;
  task->task_state_flags = SAS_TASK_STATE_PENDING;


.




Re: [RFC PATCH v3 2/3] blk-mq: Freeze and quiesce all queues for tagset in elevator_exit()

2021-03-19 Thread John Garry

On 16/03/2021 19:59, Bart Van Assche wrote:

On 3/16/21 10:43 AM, John Garry wrote:

On 16/03/2021 17:00, Bart Van Assche wrote:

I agree that Jens asked at the end of 2018 not to touch the fast path
to fix this use-after-free (maybe that request has been repeated more
recently). If Jens or anyone else feels strongly about not clearing
hctx->tags->rqs[rq->tag] from the fast path then I will make that change.


Hi Bart,


Is that possible for this same approach? I need to check the code more..

If the fast path should not be modified, I'm considering to borrow patch
1/3 from your patch series


Fine


and to add an rcu_barrier() between the code
that clears the request pointers and that frees the scheduler requests.


And don't we still have the problem that some iter callbacks may
sleep/block, which is not allowed in an RCU read-side critical section?

Thanks for having brought this up. Since none of the functions that
iterate over requests should be called from the hot path of a block
driver, I think that we can use srcu_read_(un|)lock() inside bt_iter()
and bt_tags_iter() instead of rcu_read_(un|)lock().


OK, but TBH, I am not so familiar with srcu - where you going to try this?

Thanks,
John


Re: [PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size()

2021-03-19 Thread John Garry

On 19/03/2021 17:00, Robin Murphy wrote:

On 2021-03-19 13:25, John Garry wrote:
Add a function to allow the max size which we want to optimise DMA 
mappings

for.


It seems neat in theory - particularly for packet-based interfaces that 
might have a known fixed size of data unit that they're working on at 
any given time - but aren't there going to be many cases where the 
driver has no idea because it depends on whatever size(s) of request 
userspace happens to throw at it? Even if it does know the absolute 
maximum size of thing it could ever transfer, that could be 
impractically large in areas like video/AI/etc., so it could still be 
hard to make a reasonable decision.


So if you consider the SCSI stack, which is my interest, we know the max 
segment size and we know the max number of segments per request, so we 
should know the theoretical upper limit of the actual IOVA length we can 
get.


Indeed, from my experiment on my SCSI host, max IOVA len is found to be 
507904, which is PAGE_SIZE * 124 (that is max sg ents there). 
Incidentally that means that we want RCACHE RANGE MAX of 8, not 6.




Being largely workload-dependent is why I still think this should be a 
command-line or sysfs tuneable - we could set the default based on how 
much total memory is available, but ultimately it's the end user who 
knows what the workload is going to be and what they care about 
optimising for.


If that hardware is only found in a server, then the extra memory cost 
would be trivial, so setting to max is standard approach.




Another thought (which I'm almost reluctant to share) is that I would 
*love* to try implementing a self-tuning strategy that can detect high 
contention on particular allocation sizes and adjust the caches on the 
fly, but I can easily imagine that having enough inherent overhead to 
end up being an impractical (but fun) waste of time.




For now, I just want to recover the performance lost recently :)

Thanks,
John


Re: [PATCH 3/6] iova: Allow rcache range upper limit to be configurable

2021-03-19 Thread John Garry

On 19/03/2021 16:25, Robin Murphy wrote:

On 2021-03-19 13:25, John Garry wrote:

Some LLDs may request DMA mappings whose IOVA length exceeds that of the
current rcache upper limit.

This means that allocations for those IOVAs will never be cached, and
always must be allocated and freed from the RB tree per DMA mapping 
cycle.

This has a significant effect on performance, more so since commit
4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search
fails"), as discussed at [0].

Allow the range of cached IOVAs to be increased, by providing an API 
to set

the upper limit, which is capped at IOVA_RANGE_CACHE_MAX_SIZE.

For simplicity, the full range of IOVA rcaches is allocated and 
initialized

at IOVA domain init time.

Setting the range upper limit will fail if the domain is already live
(before the tree contains IOVA nodes). This must be done to ensure any
IOVAs cached comply with rule of size being a power-of-2.

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ 



Signed-off-by: John Garry 
---
  drivers/iommu/iova.c | 37 +++--
  include/linux/iova.h | 11 ++-
  2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index cecc74fb8663..d4f2f7fbbd84 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -49,6 +49,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned 
long granule,

  iovad->flush_cb = NULL;
  iovad->fq = NULL;
  iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
+    iovad->rcache_max_size = IOVA_RANGE_CACHE_DEFAULT_SIZE;
  rb_link_node(>anchor.node, NULL, >rbroot.rb_node);
  rb_insert_color(>anchor.node, >rbroot);
  init_iova_rcaches(iovad);
@@ -194,7 +195,7 @@ static int __alloc_and_insert_iova_range(struct 
iova_domain *iovad,
   * rounding up anything cacheable to make sure that can't 
happen. The

   * order of the unadjusted size will still match upon freeing.
   */
-    if (fast && size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
+    if (fast && size < (1 << (iovad->rcache_max_size - 1)))
  size = roundup_pow_of_two(size);
  if (size_aligned)
@@ -901,7 +902,7 @@ static bool iova_rcache_insert(struct iova_domain 
*iovad, unsigned long pfn,

  {
  unsigned int log_size = order_base_2(size);
-    if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE)
+    if (log_size >= iovad->rcache_max_size)
  return false;
  return __iova_rcache_insert(iovad, >rcaches[log_size], pfn);
@@ -946,6 +947,38 @@ static unsigned long __iova_rcache_get(struct 
iova_rcache *rcache,

  return iova_pfn;
  }
+void iova_rcache_set_upper_limit(struct iova_domain *iovad,
+ unsigned long iova_len)
+{
+    unsigned int rcache_index = order_base_2(iova_len) + 1;
+    struct rb_node *rb_node = >anchor.node;
+    unsigned long flags;
+    int count = 0;
+
+    spin_lock_irqsave(>iova_rbtree_lock, flags);
+    if (rcache_index <= iovad->rcache_max_size)
+    goto out;
+
+    while (1) {
+    rb_node = rb_prev(rb_node);
+    if (!rb_node)
+    break;
+    count++;
+    }
+
+    /*
+ * If there are already IOVA nodes present in the tree, then don't
+ * allow range upper limit to be set.
+ */
+    if (count != iovad->reserved_node_count)
+    goto out;
+
+    iovad->rcache_max_size = min_t(unsigned long, rcache_index,
+   IOVA_RANGE_CACHE_MAX_SIZE);
+out:
+    spin_unlock_irqrestore(>iova_rbtree_lock, flags);
+}
+
  /*
   * Try to satisfy IOVA allocation range from rcache.  Fail if requested
   * size is too big or the DMA limit we are given isn't satisfied by the
diff --git a/include/linux/iova.h b/include/linux/iova.h
index fd3217a605b2..952b81b54ef7 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -25,7 +25,8 @@ struct iova {
  struct iova_magazine;
  struct iova_cpu_rcache;
-#define IOVA_RANGE_CACHE_MAX_SIZE 6    /* log of max cached IOVA 
range size (in pages) */

+#define IOVA_RANGE_CACHE_DEFAULT_SIZE 6
+#define IOVA_RANGE_CACHE_MAX_SIZE 10 /* log of max cached IOVA range 
size (in pages) */


No.

And why? If we're going to allocate massive caches anyway, whatever is 
the point of *not* using all of them?




I wanted to keep the same effective threshold for devices today, unless 
set otherwise.


The reason is that I didn't know if a blanket increase would cause 
regressions, and I was taking the super-safe road. Specifically some 
systems may be very IOVA space limited, and just work today by not 
caching large IOVAs.


And in the precursor thread you wrote "We can't arbitrarily *increase* 
the scope of caching once a domain is active due to the size-rounding-up 
requirement, which would be prohibitive to larger allocations if applied 
universally" (sorry for quotin

Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator

2021-03-19 Thread John Garry

On 19/03/2021 16:13, Robin Murphy wrote:

On 2021-03-19 13:25, John Garry wrote:

Move the IOVA size power-of-2 rcache roundup into the IOVA allocator.

This is to eventually make it possible to be able to configure the upper
limit of the IOVA rcache range.

Signed-off-by: John Garry 
---
  drivers/iommu/dma-iommu.c |  8 --
  drivers/iommu/iova.c  | 51 ++-
  2 files changed, 34 insertions(+), 25 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index af765c813cc8..15b7270a5c2a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -429,14 +429,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct 
iommu_domain *domain,

  shift = iova_shift(iovad);
  iova_len = size >> shift;
-    /*
- * Freeing non-power-of-two-sized allocations back into the IOVA 
caches
- * will come back to bite us badly, so we have to waste a bit of 
space
- * rounding up anything cacheable to make sure that can't happen. 
The

- * order of the unadjusted size will still match upon freeing.
- */
-    if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
-    iova_len = roundup_pow_of_two(iova_len);
  dma_limit = min_not_zero(dma_limit, dev->bus_dma_limit);
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index e6e2fa85271c..e62e9e30b30c 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -179,7 +179,7 @@ iova_insert_rbtree(struct rb_root *root, struct 
iova *iova,

  static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
  unsigned long size, unsigned long limit_pfn,
-    struct iova *new, bool size_aligned)
+    struct iova *new, bool size_aligned, bool fast)
  {
  struct rb_node *curr, *prev;
  struct iova *curr_iova;
@@ -188,6 +188,15 @@ static int __alloc_and_insert_iova_range(struct 
iova_domain *iovad,

  unsigned long align_mask = ~0UL;
  unsigned long high_pfn = limit_pfn, low_pfn = iovad->start_pfn;
+    /*
+ * Freeing non-power-of-two-sized allocations back into the IOVA 
caches
+ * will come back to bite us badly, so we have to waste a bit of 
space
+ * rounding up anything cacheable to make sure that can't happen. 
The

+ * order of the unadjusted size will still match upon freeing.
+ */
+    if (fast && size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
+    size = roundup_pow_of_two(size);


If this transformation is only relevant to alloc_iova_fast(), and we 
have to add a special parameter here to tell whether we were called from 
alloc_iova_fast(), doesn't it seem more sensible to just do it in 
alloc_iova_fast() rather than here?


We have the restriction that anything we put in the rcache needs be a 
power-of-2.


So then we have the issue of how to dynamically increase this rcache 
threshold. The problem is that we may have many devices associated with 
the same domain. So, in theory, we can't assume that when we increase 
the threshold that some other device will try to fast free an IOVA which 
was allocated prior to the increase and was not rounded up.


I'm very open to better (or less bad) suggestions on how to do this ...

I could say that we only allow this for a group with a single device, so 
these sort of things don't have to be worried about, but even then the 
iommu_group internals are not readily accessible here.




But then the API itself has no strict requirement that a pfn passed to 
free_iova_fast() wasn't originally allocated with alloc_iova(), so 
arguably hiding the adjustment away makes it less clear that the 
responsibility is really on any caller of free_iova_fast() to make sure 
they don't get things wrong.




alloc_iova() doesn't roundup to pow-of-2, so wouldn't it be broken to do 
that?


Cheers,
John


Re: [PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured

2021-03-19 Thread John Garry

On 19/03/2021 13:40, Christoph Hellwig wrote:

On Fri, Mar 19, 2021 at 09:25:42PM +0800, John Garry wrote:

For streaming DMA mappings involving an IOMMU and whose IOVA len regularly
exceeds the IOVA rcache upper limit (meaning that they are not cached),
performance can be reduced.

This is much more pronounced from commit 4e89dce72521 ("iommu/iova: Retry
from last rb tree node if iova search fails"), as discussed at [0].

IOVAs which cannot be cached are highly involved in the IOVA aging issue,
as discussed at [1].


I'm confused.  If this a limit in the IOVA allocator, dma-iommu should
be able to just not grow the allocation so larger without help from
the driver.


This is not an issue with the IOVA allocator.

The issue is with how the IOVA code handles caching of IOVAs. 
Specifically, when we DMA unmap, for an IOVA whose length is above a 
fixed threshold, the IOVA is freed, rather than being cached. See 
free_iova_fast().


For performance reasons, I want that threshold increased for my driver 
to avail of the caching of all lengths of IOVA which we may see - 
currently we see IOVAs whose length exceeds that threshold. But it may 
not be good to increase that threshold for everyone.


> If contrary to the above description it is device-specific, the driver
> could simply use dma_get_max_seg_size().
> .
>

But that is for a single segment, right? Is there something equivalent 
to tell how many scatter-gather elements which we may generate, like 
scsi_host_template.sg_tablesize?


Thanks,
John




[PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator

2021-03-19 Thread John Garry
Move the IOVA size power-of-2 rcache roundup into the IOVA allocator.

This is to eventually make it possible to be able to configure the upper
limit of the IOVA rcache range.

Signed-off-by: John Garry 
---
 drivers/iommu/dma-iommu.c |  8 --
 drivers/iommu/iova.c  | 51 ++-
 2 files changed, 34 insertions(+), 25 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index af765c813cc8..15b7270a5c2a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -429,14 +429,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain 
*domain,
 
shift = iova_shift(iovad);
iova_len = size >> shift;
-   /*
-* Freeing non-power-of-two-sized allocations back into the IOVA caches
-* will come back to bite us badly, so we have to waste a bit of space
-* rounding up anything cacheable to make sure that can't happen. The
-* order of the unadjusted size will still match upon freeing.
-*/
-   if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
-   iova_len = roundup_pow_of_two(iova_len);
 
dma_limit = min_not_zero(dma_limit, dev->bus_dma_limit);
 
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index e6e2fa85271c..e62e9e30b30c 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -179,7 +179,7 @@ iova_insert_rbtree(struct rb_root *root, struct iova *iova,
 
 static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
unsigned long size, unsigned long limit_pfn,
-   struct iova *new, bool size_aligned)
+   struct iova *new, bool size_aligned, bool fast)
 {
struct rb_node *curr, *prev;
struct iova *curr_iova;
@@ -188,6 +188,15 @@ static int __alloc_and_insert_iova_range(struct 
iova_domain *iovad,
unsigned long align_mask = ~0UL;
unsigned long high_pfn = limit_pfn, low_pfn = iovad->start_pfn;
 
+   /*
+* Freeing non-power-of-two-sized allocations back into the IOVA caches
+* will come back to bite us badly, so we have to waste a bit of space
+* rounding up anything cacheable to make sure that can't happen. The
+* order of the unadjusted size will still match upon freeing.
+*/
+   if (fast && size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
+   size = roundup_pow_of_two(size);
+
if (size_aligned)
align_mask <<= fls_long(size - 1);
 
@@ -288,21 +297,10 @@ void iova_cache_put(void)
 }
 EXPORT_SYMBOL_GPL(iova_cache_put);
 
-/**
- * alloc_iova - allocates an iova
- * @iovad: - iova domain in question
- * @size: - size of page frames to allocate
- * @limit_pfn: - max limit address
- * @size_aligned: - set if size_aligned address range is required
- * This function allocates an iova in the range iovad->start_pfn to limit_pfn,
- * searching top-down from limit_pfn to iovad->start_pfn. If the size_aligned
- * flag is set then the allocated address iova->pfn_lo will be naturally
- * aligned on roundup_power_of_two(size).
- */
-struct iova *
-alloc_iova(struct iova_domain *iovad, unsigned long size,
+static struct iova *
+__alloc_iova(struct iova_domain *iovad, unsigned long size,
unsigned long limit_pfn,
-   bool size_aligned)
+   bool size_aligned, bool fast)
 {
struct iova *new_iova;
int ret;
@@ -312,7 +310,7 @@ alloc_iova(struct iova_domain *iovad, unsigned long size,
return NULL;
 
ret = __alloc_and_insert_iova_range(iovad, size, limit_pfn + 1,
-   new_iova, size_aligned);
+   new_iova, size_aligned, fast);
 
if (ret) {
free_iova_mem(new_iova);
@@ -321,6 +319,25 @@ alloc_iova(struct iova_domain *iovad, unsigned long size,
 
return new_iova;
 }
+
+/**
+ * alloc_iova - allocates an iova
+ * @iovad: - iova domain in question
+ * @size: - size of page frames to allocate
+ * @limit_pfn: - max limit address
+ * @size_aligned: - set if size_aligned address range is required
+ * This function allocates an iova in the range iovad->start_pfn to limit_pfn,
+ * searching top-down from limit_pfn to iovad->start_pfn. If the size_aligned
+ * flag is set then the allocated address iova->pfn_lo will be naturally
+ * aligned on roundup_power_of_two(size).
+ */
+struct iova *
+alloc_iova(struct iova_domain *iovad, unsigned long size,
+   unsigned long limit_pfn,
+   bool size_aligned)
+{
+   return __alloc_iova(iovad, size, limit_pfn, size_aligned, false);
+}
 EXPORT_SYMBOL_GPL(alloc_iova);
 
 static struct iova *
@@ -433,7 +450,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long 
size,
return iova_pfn;
 
 retry:
-   new_iova = alloc_iova(iovad, size, limit_pfn, true);
+   new_iova = __alloc_iova(iovad, size, limit_pfn, true, true);
if (!new_iova) {
unsigned int cpu;
 
-- 
2.26.2



[PATCH 2/6] iova: Add a per-domain count of reserved nodes

2021-03-19 Thread John Garry
To help learn if the domain has regular IOVA nodes, add a count of
reserved nodes, calculated at init time.

Signed-off-by: John Garry 
---
 drivers/iommu/iova.c | 2 ++
 include/linux/iova.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index e62e9e30b30c..cecc74fb8663 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -717,6 +717,8 @@ reserve_iova(struct iova_domain *iovad,
 * or need to insert remaining non overlap addr range
 */
iova = __insert_new_range(iovad, pfn_lo, pfn_hi);
+   if (iova)
+   iovad->reserved_node_count++;
 finish:
 
spin_unlock_irqrestore(>iova_rbtree_lock, flags);
diff --git a/include/linux/iova.h b/include/linux/iova.h
index c834c01c0a5b..fd3217a605b2 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -95,6 +95,7 @@ struct iova_domain {
   flush-queues */
atomic_t fq_timer_on;   /* 1 when timer is active, 0
   when not */
+   int reserved_node_count;
 };
 
 static inline unsigned long iova_size(struct iova *iova)
-- 
2.26.2



[PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size()

2021-03-19 Thread John Garry
Add a function to allow the max size which we want to optimise DMA mappings
for.

Signed-off-by: John Garry 
---
 drivers/iommu/dma-iommu.c   |  2 +-
 include/linux/dma-map-ops.h |  1 +
 include/linux/dma-mapping.h |  5 +
 kernel/dma/mapping.c| 11 +++
 4 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index a5dfbd6c0496..d35881fcfb9c 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -447,7 +447,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain 
*domain,
return (dma_addr_t)iova << shift;
 }
 
-__maybe_unused
 static void iommu_dma_set_opt_size(struct device *dev, size_t size)
 {
struct iommu_domain *domain = iommu_get_dma_domain(dev);
@@ -1278,6 +1277,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.map_resource   = iommu_dma_map_resource,
.unmap_resource = iommu_dma_unmap_resource,
.get_merge_boundary = iommu_dma_get_merge_boundary,
+   .set_max_opt_size   = iommu_dma_set_opt_size,
 };
 
 /*
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 51872e736e7b..fed7a183b3b9 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -64,6 +64,7 @@ struct dma_map_ops {
u64 (*get_required_mask)(struct device *dev);
size_t (*max_mapping_size)(struct device *dev);
unsigned long (*get_merge_boundary)(struct device *dev);
+   void (*set_max_opt_size)(struct device *dev, size_t size);
 };
 
 #ifdef CONFIG_DMA_OPS
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 2a984cb4d1e0..91fe770145d4 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -144,6 +144,7 @@ u64 dma_get_required_mask(struct device *dev);
 size_t dma_max_mapping_size(struct device *dev);
 bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
 unsigned long dma_get_merge_boundary(struct device *dev);
+void dma_set_max_opt_size(struct device *dev, size_t size);
 #else /* CONFIG_HAS_DMA */
 static inline dma_addr_t dma_map_page_attrs(struct device *dev,
struct page *page, size_t offset, size_t size,
@@ -257,6 +258,10 @@ static inline unsigned long dma_get_merge_boundary(struct 
device *dev)
 {
return 0;
 }
+static inline void dma_set_max_opt_size(struct device *dev, size_t size)
+{
+}
+
 #endif /* CONFIG_HAS_DMA */
 
 struct page *dma_alloc_pages(struct device *dev, size_t size,
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index b6a633679933..59e6acb1c471 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -608,3 +608,14 @@ unsigned long dma_get_merge_boundary(struct device *dev)
return ops->get_merge_boundary(dev);
 }
 EXPORT_SYMBOL_GPL(dma_get_merge_boundary);
+
+void dma_set_max_opt_size(struct device *dev, size_t size)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+
+   if (!ops || !ops->set_max_opt_size)
+   return;
+
+   ops->set_max_opt_size(dev, size);
+}
+EXPORT_SYMBOL_GPL(dma_set_max_opt_size);
-- 
2.26.2



[PATCH 4/6] iommu: Add iommu_dma_set_opt_size()

2021-03-19 Thread John Garry
Add a function which allows the max optimised IOMMU DMA size to be set.

Signed-off-by: John Garry 
---
 drivers/iommu/dma-iommu.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 15b7270a5c2a..a5dfbd6c0496 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -447,6 +447,21 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain 
*domain,
return (dma_addr_t)iova << shift;
 }
 
+__maybe_unused
+static void iommu_dma_set_opt_size(struct device *dev, size_t size)
+{
+   struct iommu_domain *domain = iommu_get_dma_domain(dev);
+   struct iommu_dma_cookie *cookie = domain->iova_cookie;
+   struct iova_domain *iovad = >iovad;
+   unsigned long shift, iova_len;
+
+   shift = iova_shift(iovad);
+   iova_len = size >> shift;
+   iova_len = roundup_pow_of_two(iova_len);
+
+   iova_rcache_set_upper_limit(iovad, iova_len);
+}
+
 static void iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
dma_addr_t iova, size_t size, struct page *freelist)
 {
-- 
2.26.2



[PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured

2021-03-19 Thread John Garry
For streaming DMA mappings involving an IOMMU and whose IOVA len regularly
exceeds the IOVA rcache upper limit (meaning that they are not cached),
performance can be reduced. 

This is much more pronounced from commit 4e89dce72521 ("iommu/iova: Retry
from last rb tree node if iova search fails"), as discussed at [0].

IOVAs which cannot be cached are highly involved in the IOVA aging issue,
as discussed at [1].

This series attempts to allow the device driver hint what upper limit its
DMA mapping IOVA lengths would be, so that the caching range may be
increased.

Some figures on storage scenario:
v5.12-rc3 baseline: 600K IOPS
With series:1300K IOPS
With reverting 4e89dce72521:1250K IOPS

All above are for IOMMU strict mode. Non-strict mode gives ~1750K IOPS in
all scenarios.

I will say that APIs and their semantics are a bit ropey - any better
ideas welcome...

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/
[1] 
https://lore.kernel.org/linux-iommu/1607538189-237944-1-git-send-email-john.ga...@huawei.com/

John Garry (6):
  iommu: Move IOVA power-of-2 roundup into allocator
  iova: Add a per-domain count of reserved nodes
  iova: Allow rcache range upper limit to be configurable
  iommu: Add iommu_dma_set_opt_size()
  dma-mapping/iommu: Add dma_set_max_opt_size()
  scsi: hisi_sas: Set max optimal DMA size for v3 hw

 drivers/iommu/dma-iommu.c  | 23 ---
 drivers/iommu/iova.c   | 88 --
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c |  2 +
 include/linux/dma-map-ops.h|  1 +
 include/linux/dma-mapping.h|  5 ++
 include/linux/iova.h   | 12 +++-
 kernel/dma/mapping.c   | 11 
 7 files changed, 115 insertions(+), 27 deletions(-)

-- 
2.26.2



[PATCH 3/6] iova: Allow rcache range upper limit to be configurable

2021-03-19 Thread John Garry
Some LLDs may request DMA mappings whose IOVA length exceeds that of the
current rcache upper limit.

This means that allocations for those IOVAs will never be cached, and
always must be allocated and freed from the RB tree per DMA mapping cycle.
This has a significant effect on performance, more so since commit
4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search
fails"), as discussed at [0].

Allow the range of cached IOVAs to be increased, by providing an API to set
the upper limit, which is capped at IOVA_RANGE_CACHE_MAX_SIZE.

For simplicity, the full range of IOVA rcaches is allocated and initialized
at IOVA domain init time.

Setting the range upper limit will fail if the domain is already live
(before the tree contains IOVA nodes). This must be done to ensure any
IOVAs cached comply with rule of size being a power-of-2.

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/

Signed-off-by: John Garry 
---
 drivers/iommu/iova.c | 37 +++--
 include/linux/iova.h | 11 ++-
 2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index cecc74fb8663..d4f2f7fbbd84 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -49,6 +49,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long 
granule,
iovad->flush_cb = NULL;
iovad->fq = NULL;
iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
+   iovad->rcache_max_size = IOVA_RANGE_CACHE_DEFAULT_SIZE;
rb_link_node(>anchor.node, NULL, >rbroot.rb_node);
rb_insert_color(>anchor.node, >rbroot);
init_iova_rcaches(iovad);
@@ -194,7 +195,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain 
*iovad,
 * rounding up anything cacheable to make sure that can't happen. The
 * order of the unadjusted size will still match upon freeing.
 */
-   if (fast && size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
+   if (fast && size < (1 << (iovad->rcache_max_size - 1)))
size = roundup_pow_of_two(size);
 
if (size_aligned)
@@ -901,7 +902,7 @@ static bool iova_rcache_insert(struct iova_domain *iovad, 
unsigned long pfn,
 {
unsigned int log_size = order_base_2(size);
 
-   if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE)
+   if (log_size >= iovad->rcache_max_size)
return false;
 
return __iova_rcache_insert(iovad, >rcaches[log_size], pfn);
@@ -946,6 +947,38 @@ static unsigned long __iova_rcache_get(struct iova_rcache 
*rcache,
return iova_pfn;
 }
 
+void iova_rcache_set_upper_limit(struct iova_domain *iovad,
+unsigned long iova_len)
+{
+   unsigned int rcache_index = order_base_2(iova_len) + 1;
+   struct rb_node *rb_node = >anchor.node;
+   unsigned long flags;
+   int count = 0;
+
+   spin_lock_irqsave(>iova_rbtree_lock, flags);
+   if (rcache_index <= iovad->rcache_max_size)
+   goto out;
+
+   while (1) {
+   rb_node = rb_prev(rb_node);
+   if (!rb_node)
+   break;
+   count++;
+   }
+
+   /*
+* If there are already IOVA nodes present in the tree, then don't
+* allow range upper limit to be set.
+*/
+   if (count != iovad->reserved_node_count)
+   goto out;
+
+   iovad->rcache_max_size = min_t(unsigned long, rcache_index,
+  IOVA_RANGE_CACHE_MAX_SIZE);
+out:
+   spin_unlock_irqrestore(>iova_rbtree_lock, flags);
+}
+
 /*
  * Try to satisfy IOVA allocation range from rcache.  Fail if requested
  * size is too big or the DMA limit we are given isn't satisfied by the
diff --git a/include/linux/iova.h b/include/linux/iova.h
index fd3217a605b2..952b81b54ef7 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -25,7 +25,8 @@ struct iova {
 struct iova_magazine;
 struct iova_cpu_rcache;
 
-#define IOVA_RANGE_CACHE_MAX_SIZE 6/* log of max cached IOVA range size 
(in pages) */
+#define IOVA_RANGE_CACHE_DEFAULT_SIZE 6
+#define IOVA_RANGE_CACHE_MAX_SIZE 10 /* log of max cached IOVA range size (in 
pages) */
 #define MAX_GLOBAL_MAGS 32 /* magazines per bin */
 
 struct iova_rcache {
@@ -74,6 +75,7 @@ struct iova_domain {
unsigned long   start_pfn;  /* Lower limit for this domain */
unsigned long   dma_32bit_pfn;
unsigned long   max32_alloc_size; /* Size of last failed allocation */
+   unsigned long   rcache_max_size; /* Upper limit of cached IOVA RANGE */
struct iova_fq __percpu *fq;/* Flush Queue */
 
atomic64_t  fq_flush_start_cnt; /* Number of TLB flushes that
@@ -158,6 +160,8 @@ int init_iova_flush_queue(struct iova_domain *iovad,
 struct iova *find_

[PATCH 6/6] scsi: hisi_sas: Set max optimal DMA size for v3 hw

2021-03-19 Thread John Garry
For IOMMU strict mode, more than doubles throughput in some scenarios.

Signed-off-by: John Garry 
---
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c 
b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index 4580e081e489..2f77b418bbeb 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -4684,6 +4684,8 @@ hisi_sas_v3_probe(struct pci_dev *pdev, const struct 
pci_device_id *id)
goto err_out_regions;
}
 
+   dma_set_max_opt_size(dev, PAGE_SIZE * HISI_SAS_SGE_PAGE_CNT);
+
shost = hisi_sas_shost_alloc_pci(pdev);
if (!shost) {
rc = -ENOMEM;
-- 
2.26.2



Re: [PATCH] scsi: libsas: Reset num_scatter if libata mark qc as NODATA

2021-03-18 Thread John Garry

On 18/03/2021 00:24, Jolly Shah wrote:

Hi John,

Thanks for the review.


On Wed, Mar 17, 2021 at 4:44 AM John Garry  wrote:


On 16/03/2021 19:39, Jolly Shah wrote:

When the cache_type for the scsi device is changed, the scsi layer
issues a MODE_SELECT command. The caching mode details are communicated
via a request buffer associated with the scsi command with data
direction set as DMA_TO_DEVICE (scsi_mode_select). When this command
reaches the libata layer, as a part of generic initial setup, libata
layer sets up the scatterlist for the command using the scsi command
(ata_scsi_qc_new). This command is then translated by the libata layer
into ATA_CMD_SET_FEATURES (ata_scsi_mode_select_xlat). The libata layer
treats this as a non data command (ata_mselect_caching), since it only
needs an ata taskfile to pass the caching on/off information to the
device. It does not need the scatterlist that has been setup, so it does
not perform dma_map_sg on the scatterlist (ata_qc_issue).


So if we don't perform the dma_map_sg() on the sgl at this point, then
it seems to me that we should not perform for_each_sg() on it either,
right? That is still what happens in sas_ata_qc_issue() in this case.



Hi Jolly Shah,



Yes that's right. To avoid that, I can add elseif block for
ATA_PROT_NODATA before for_each_sg() is performed. Currently there's
existing code block for ATA_PROT_NODATA after for_each_sg()  is
performed,
reused that to reset num_scatter. Please suggest.



How about just combine the 2x if-else statements into 1x if-elif-else 
statement, like:



if (ata_is_atapi(qc->tf.protocol)) {
memcpy(task->ata_task.atapi_packet, qc->cdb, qc->dev->cdb_len);
task->total_xfer_len = qc->nbytes;
task->num_scatter = qc->n_elem;
task->data_dir = qc->dma_dir;
} else if (qc->tf.protocol == ATA_PROT_NODATA) {
task->data_dir = DMA_NONE;
} else {
for_each_sg(qc->sg, sg, qc->n_elem, si)
xfer += sg_dma_len(sg);

task->total_xfer_len = xfer;
task->num_scatter = si;
task->data_dir = qc->dma_dir;
}


Unfortunately,
when this command reaches the libsas layer(sas_ata_qc_issue), libsas
layer sees it as a non data command with a scatterlist. It cannot
extract the correct dma length, since the scatterlist has not been
mapped with dma_map_sg for a DMA operation. When this partially
constructed SAS task reaches pm80xx LLDD, it results in below warning.

"pm80xx_chip_sata_req 6058: The sg list address
start_addr=0x data_len=0x0end_addr_high=0x
end_addr_low=0x has crossed 4G boundary"

This patch assigns appropriate value to  num_sectors for ata non data
commands.

Signed-off-by: Jolly Shah 
---
   drivers/scsi/libsas/sas_ata.c | 6 --
   1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index 024e5a550759..94ec08cebbaa 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -209,10 +209,12 @@ static unsigned int sas_ata_qc_issue(struct 
ata_queued_cmd *qc)
   task->num_scatter = si;
   }

- if (qc->tf.protocol == ATA_PROT_NODATA)
+ if (qc->tf.protocol == ATA_PROT_NODATA) {
   task->data_dir = DMA_NONE;
- else
+ task->num_scatter = 0;


task->num_scatter has already been set in this function. Best not set it
twice.



Sure. Please suggest if I should update patch to above suggested
approach. That will avoid setting num_scatter twice.



Thanks,
John

BTW, could we add a fixes tag?


Re: [PATCH 2/2] iommu/iova: Improve restart logic

2021-03-18 Thread John Garry


Well yeah, in your particular case you're allocating from a heavily 
over-contended address space, so much of the time it is genuinely full. 
Plus you're primarily churning one or two sizes of IOVA, so there's a 
high chance that you will either allocate immediately from the cached 
node (after a previous free), or search the whole space and fail. In 
case it was missed, searching only some arbitrary subset of the space 
before giving up is not a good behaviour for an allocator to have in 
general.


So since the retry means that we search through the complete pfn 
range most of the time (due to poor success rate), we should be able 
to do a better job at maintaining an accurate max alloc size, by 
calculating it from the range search, and not relying on max alloc 
failed or resetting it frequently. Hopefully that would mean that 
we're smarter about not trying the allocation.


So I tried that out and we seem to be able to scrap back an 
appreciable amount of performance. Maybe 80% of original, with with 
another change, below.


TBH if you really want to make allocation more efficient I think there 
are more radical changes that would be worth experimenting with, like 
using some form of augmented rbtree to also encode the amount of free 
space under each branch, or representing the free space in its own 
parallel tree, or whether some other structure entirely might be a 
better bet these days.


And if you just want to make your thing acceptably fast, now I'm going 
to say stick a quirk somewhere to force the "forcedac" option on your 
platform ;)




Easier said than done :)

But still, I'd like to just be able to cache all IOVA sizes for my DMA 
engine, so we should not have to go near the RB tree often.


I have put together a series to allow upper limit of rcache range be 
increased per domain. So naturally that gives better performance than we 
originally had.


I don't want to prejudice the solution by saying what I think of it now, 
so will send it out...




[...]
@@ -219,7 +256,7 @@ static int __alloc_and_insert_iova_range(struct 
iova_domain *iovad,

  if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) {
  high_pfn = limit_pfn;
  low_pfn = retry_pfn;
-    curr = >anchor.node;
+    curr = iova_find_limit(iovad, limit_pfn);



I see that it is now applied. However, alternatively could we just add 
a zero-length 32b boundary marker node for the 32b pfn restart point?


That would need special cases all over the place to prevent the marker 
getting merged into reservations or hit by lookups, and at worst break 
the ordering of the tree if a legitimate node straddles the boundary. I 
did consider having the insert/delete routines keep track of yet another 
cached node for whatever's currently the first thing above the 32-bit 
boundary, but I was worried that might be a bit too invasive.


Yeah, I did think of that. I don't think that it would have too much 
overhead.




FWIW I'm currently planning to come back to this again when I have a bit 
more time, since the optimum thing to do (modulo replacing the entire 
algorithm...) is actually to make the second part of the search 
*upwards* from the cached node to the limit. Furthermore, to revive my 
arch/arm conversion I think we're realistically going to need a 
compatibility option for bottom-up allocation to avoid too many nasty 
surprises, so I'd like to generalise things to tackle both concerns at 
once.




Thanks,
John



Re: [PATCH 2/2] iommu/iova: Improve restart logic

2021-03-18 Thread John Garry

On 10/03/2021 17:47, John Garry wrote:

On 09/03/2021 15:55, John Garry wrote:

On 05/03/2021 16:35, Robin Murphy wrote:

Hi Robin,


When restarting after searching below the cached node fails, resetting
the start point to the anchor node is often overly pessimistic. If
allocations are made with mixed limits - particularly in the case of the
opportunistic 32-bit allocation for PCI devices - this could mean
significant time wasted walking through the whole populated upper range
just to reach the initial limit. 


Right


We can improve on that by implementing
a proper tree traversal to find the first node above the relevant limit,
and set the exact start point.


Thanks for this. However, as mentioned in the other thread, this does 
not help our performance regression Re: commit 4e89dce72521.


And mentioning this "retry" approach again, in case it was missed, 
from my experiment on the affected HW, it also has generally a 
dreadfully low success rate - less than 1% for the 32b range retry. 
Retry rate is about 20%. So I am saying from this 20%, less than 1% of 
those succeed.




So since the retry means that we search through the complete pfn range 
most of the time (due to poor success rate), we should be able to do a 
better job at maintaining an accurate max alloc size, by calculating it 
from the range search, and not relying on max alloc failed or resetting 
it frequently. Hopefully that would mean that we're smarter about not 
trying the allocation.


So I tried that out and we seem to be able to scrap back an appreciable 
amount of performance. Maybe 80% of original, with with another change, 
below.


Thanks,
John








Signed-off-by: Robin Murphy 
---
  drivers/iommu/iova.c | 39 ++-
  1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index c28003e1d2ee..471c48dd71e7 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -154,6 +154,43 @@ __cached_rbnode_delete_update(struct iova_domain 
*iovad, struct iova *free)

  iovad->cached_node = rb_next(>node);
  }
+static struct rb_node *iova_find_limit(struct iova_domain *iovad, 
unsigned long limit_pfn)

+{
+    struct rb_node *node, *next;
+    /*
+ * Ideally what we'd like to judge here is whether limit_pfn is 
close
+ * enough to the highest-allocated IOVA that starting the 
allocation
+ * walk from the anchor node will be quicker than this initial 
work to
+ * find an exact starting point (especially if that ends up 
being the
+ * anchor node anyway). This is an incredibly crude 
approximation which
+ * only really helps the most likely case, but is at least 
trivially easy.

+ */
+    if (limit_pfn > iovad->dma_32bit_pfn)
+    return >anchor.node;
+
+    node = iovad->rbroot.rb_node;
+    while (to_iova(node)->pfn_hi < limit_pfn)
+    node = node->rb_right;
+
+search_left:
+    while (node->rb_left && to_iova(node->rb_left)->pfn_lo >= 
limit_pfn)

+    node = node->rb_left;
+
+    if (!node->rb_left)
+    return node;
+
+    next = node->rb_left;
+    while (next->rb_right) {
+    next = next->rb_right;
+    if (to_iova(next)->pfn_lo >= limit_pfn) {
+    node = next;
+    goto search_left;
+    }
+    }
+
+    return node;
+}
+
  /* Insert the iova into domain rbtree by holding writer lock */
  static void
  iova_insert_rbtree(struct rb_root *root, struct iova *iova,
@@ -219,7 +256,7 @@ static int __alloc_and_insert_iova_range(struct 
iova_domain *iovad,

  if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) {
  high_pfn = limit_pfn;
  low_pfn = retry_pfn;
-    curr = >anchor.node;
+    curr = iova_find_limit(iovad, limit_pfn);



I see that it is now applied. However, alternatively could we just add a 
zero-length 32b boundary marker node for the 32b pfn restart point?



  curr_iova = to_iova(curr);
  goto retry;
  }



___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
.






Re: [PATCH] scsi: libsas: Reset num_scatter if libata mark qc as NODATA

2021-03-17 Thread John Garry

On 16/03/2021 19:39, Jolly Shah wrote:

When the cache_type for the scsi device is changed, the scsi layer
issues a MODE_SELECT command. The caching mode details are communicated
via a request buffer associated with the scsi command with data
direction set as DMA_TO_DEVICE (scsi_mode_select). When this command
reaches the libata layer, as a part of generic initial setup, libata
layer sets up the scatterlist for the command using the scsi command
(ata_scsi_qc_new). This command is then translated by the libata layer
into ATA_CMD_SET_FEATURES (ata_scsi_mode_select_xlat). The libata layer
treats this as a non data command (ata_mselect_caching), since it only
needs an ata taskfile to pass the caching on/off information to the
device. It does not need the scatterlist that has been setup, so it does
not perform dma_map_sg on the scatterlist (ata_qc_issue). 


So if we don't perform the dma_map_sg() on the sgl at this point, then 
it seems to me that we should not perform for_each_sg() on it either, 
right? That is still what happens in sas_ata_qc_issue() in this case.



Unfortunately,
when this command reaches the libsas layer(sas_ata_qc_issue), libsas
layer sees it as a non data command with a scatterlist. It cannot
extract the correct dma length, since the scatterlist has not been
mapped with dma_map_sg for a DMA operation. When this partially
constructed SAS task reaches pm80xx LLDD, it results in below warning.

"pm80xx_chip_sata_req 6058: The sg list address
start_addr=0x data_len=0x0end_addr_high=0x
end_addr_low=0x has crossed 4G boundary"

This patch assigns appropriate value to  num_sectors for ata non data
commands.

Signed-off-by: Jolly Shah 
---
  drivers/scsi/libsas/sas_ata.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index 024e5a550759..94ec08cebbaa 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -209,10 +209,12 @@ static unsigned int sas_ata_qc_issue(struct 
ata_queued_cmd *qc)
task->num_scatter = si;
}
  
-	if (qc->tf.protocol == ATA_PROT_NODATA)

+   if (qc->tf.protocol == ATA_PROT_NODATA) {
task->data_dir = DMA_NONE;
-   else
+   task->num_scatter = 0;


task->num_scatter has already been set in this function. Best not set it 
twice.


Thanks,
John


+   } else {
task->data_dir = qc->dma_dir;
+   }
task->scatter = qc->sg;
task->ata_task.retry_count = 1;
task->task_state_flags = SAS_TASK_STATE_PENDING;





Re: [RFC PATCH v3 2/3] blk-mq: Freeze and quiesce all queues for tagset in elevator_exit()

2021-03-16 Thread John Garry

On 16/03/2021 17:00, Bart Van Assche wrote:

On 3/16/21 9:15 AM, John Garry wrote:

I'll have a look at this ASAP -  a bit busy.

But a quick scan and I notice this:

 > @@ -226,6 +226,7 @@ static inline void 
__blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,

 >  struct request *rq)
 >   {
 >   blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
 > +    rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);

Wasn't a requirement to not touch the fastpath at all, including even 
if only NULLifying a pointer?


IIRC, Kashyap some time ago had a patch like above (but without RCU 
usage), but the request from Jens was to not touch the fastpath.


Maybe I'm mistaken - I will try to dig up the thread.




Hi Bart,



I agree that Jens asked at the end of 2018 not to touch the fast path to 
fix this use-after-free (maybe that request has been repeated more 
recently). If Jens or anyone else feels strongly about not clearing 
hctx->tags->rqs[rq->tag] from the fast path then I will make that 
change. 


Is that possible for this same approach? I need to check the code more..

And don't we still have the problem that some iter callbacks may 
sleep/block, which is not allowed in an RCU read-side critical section?


My motivation for clearing these pointers from the fast path is 
as follows:

- This results in code that is easier to read and easier to maintain.
- Every modern CPU pipelines store instructions so the performance 
impact of adding an additional store should be small.
- Since the block layer has a tendency to reuse tags that have been 
freed recently, it is likely that hctx->tags->rqs[rq->tag] will be used 
for a next request and hence that it will have to be loaded into the CPU 
cache anyway.




Those points make sense to me, but obviously it's the maintainers call.

Thanks,
john



Re: [RFC PATCH v3 2/3] blk-mq: Freeze and quiesce all queues for tagset in elevator_exit()

2021-03-16 Thread John Garry

Hi Bart,

I'll have a look at this ASAP -  a bit busy.

But a quick scan and I notice this:

> @@ -226,6 +226,7 @@ static inline void __blk_mq_put_driver_tag(struct 
blk_mq_hw_ctx *hctx,

>   struct request *rq)
>   {
>blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
> +  rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);

Wasn't a requirement to not touch the fastpath at all, including even if 
only NULLifying a pointer?


IIRC, Kashyap some time ago had a patch like above (but without RCU 
usage), but the request from Jens was to not touch the fastpath.


Maybe I'm mistaken - I will try to dig up the thread.

Thanks,
John


How about replacing the entire patch series with the patch below? The
patch below has the following advantages:
- Instead of making the race window smaller, the race is fixed
   completely.
- No new atomic instructions are added to the block layer code.
- No early return is inserted in blk_mq_tagset_busy_iter().

Thanks,

Bart.

 From a0e534012a766bd6e53cdd466eec0a811164c12a Mon Sep 17 00:00:00 2001
From: Bart Van Assche
Date: Wed, 10 Mar 2021 19:11:47 -0800
Subject: [PATCH] blk-mq: Fix races between iterating over requests and freeing
  requests

Multiple users have reported use-after-free complaints similar to the
following (see 
alsohttps://lore.kernel.org/linux-block/1545261885.185366.488.ca...@acm.org/):

BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
Read of size 8 at addr 88803b335240 by task fio/21412

CPU: 0 PID: 21412 Comm: fio Tainted: GW 4.20.0-rc6-dbg+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
  dump_stack+0x86/0xca
  print_address_description+0x71/0x239
  kasan_report.cold.5+0x242/0x301
  __asan_load8+0x54/0x90
  bt_iter+0x86/0xf0
  blk_mq_queue_tag_busy_iter+0x373/0x5e0
  blk_mq_in_flight+0x96/0xb0
  part_in_flight+0x40/0x140
  part_round_stats+0x18e/0x370
  blk_account_io_start+0x3d7/0x670
  blk_mq_bio_to_request+0x19c/0x3a0
  blk_mq_make_request+0x7a9/0xcb0
  generic_make_request+0x41d/0x960
  submit_bio+0x9b/0x250
  do_blockdev_direct_IO+0x435c/0x4c70
  __blockdev_direct_IO+0x79/0x88
  ext4_direct_IO+0x46c/0xc00
  generic_file_direct_write+0x119/0x210
  __generic_file_write_iter+0x11c/0x280
  ext4_file_write_iter+0x1b8/0x6f0
  aio_write+0x204/0x310
  io_submit_one+0x9d3/0xe80
  __x64_sys_io_submit+0x115/0x340
  do_syscall_64+0x71/0x210

When multiple request queues share a tag set and when switching the I/O
scheduler for one of the request queues that uses this tag set, the
following race can happen:
* blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns
   a pointer to a scheduler request to the local variable 'rq'.
* blk_mq_sched_free_requests() is called to free hctx->sched_tags.
* blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free.

Fix this race as follows:
* Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[].
* Protect hctx->tags->rqs[] reads with an RCU read-side lock.
* No new rcu_barrier() call has been added because clearing the request
   pointer in hctx->tags->rqs[] happens before blk_queue_exit() and the
   blk_freeze_queue() call in blk_cleanup_queue() triggers an RCU barrier
   after all scheduler request pointers assiociated with a request queue
   have been removed from hctx->tags->rqs[] and before these scheduler
   requests are freed.

Signed-off-by: Bart Van Assche
---
  block/blk-mq-tag.c | 27 +--
  block/blk-mq-tag.h |  2 +-
  block/blk-mq.c | 10 ++
  block/blk-mq.h |  1 +
  4 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 9c92053e704d..8351c3f2fe2d 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -206,18 +206,23 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int 
bitnr, void *data)
struct blk_mq_tags *tags = hctx->tags;
bool reserved = iter_data->reserved;
struct request *rq;
+   bool res = true;

if (!reserved)
bitnr += tags->nr_reserved_tags;
-   rq = tags->rqs[bitnr];
+
+   rcu_read_lock();
+   rq = rcu_dereference(tags->rqs[bitnr]);

/*
 * We can hit rq == NULL here, because the tagging functions
 * test and set the bit before assigning ->rqs[].
 */
if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
-   return iter_data->fn(hctx, rq, iter_data->data, reserved);
-   return true;
+   res = iter_data->fn(hctx, rq, iter_data->data, reserved);
+   rcu_read_unlock();
+
+   return res;
  }

  /**
@@ -264,10 +269,12 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned 
int bitnr, void *data)
struct blk_mq_tags *tags = iter_data->tags;
bool reserved = iter_data->flags & BT_TAG_ITER_RESERVED;
struct request *rq;
+   bool res = true;

if (!reserved)

Re: [PATCH] lib/logic_pio: Fix overlap check for pio registery

2021-03-15 Thread John Garry

On 18/01/2021 01:27, Jiahui Cen wrote:

+

Hi Arnd, xu wei,

Any chance we could pick up this patch via arm-soc tree?

The series which I originally included in, below, is stalled a bit.

Thanks,
John

https://lore.kernel.org/lkml/1610729929-188490-1-git-send-email-john.ga...@huawei.com/


Hi John,

On 2021/1/15 18:10, John Garry wrote:

On 21/12/2020 13:04, Jiahui Cen wrote:

On 21/12/2020 03:24, Jiahui Cen wrote:

Hi John,

On 2020/12/18 18:40, John Garry wrote:

On 18/12/2020 06:23, Jiahui Cen wrote:

Since the [start, end) is a half-open interval, a range with the end equal
to the start of another range should not be considered as overlapped.

Signed-off-by: Jiahui Cen
---
     lib/logic_pio.c | 2 +-
     1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/logic_pio.c b/lib/logic_pio.c
index f32fe481b492..445d611f1dc1 100644
--- a/lib/logic_pio.c
+++ b/lib/logic_pio.c
@@ -57,7 +57,7 @@ int logic_pio_register_range(struct logic_pio_hwaddr 
*new_range)
     new_range->flags == LOGIC_PIO_CPU_MMIO) {
     /* for MMIO ranges we need to check for overlap */
     if (start >= range->hw_start + range->size ||
-    end < range->hw_start) {
+    end <= range->hw_start) {

It looks like your change is correct, but should not really have an impact in 
practice since:
a: BIOSes generally list ascending IO port CPU addresses
b. there is space between IO port CPU address regions

Have you seen a problem here?


No serious problem. I found it just when I was working on adding support of
pci expander bridge for Arm in QEMU. I found the IO window of some extended
root bus could not be registered when I inserted the extended buses' _CRS
info into DSDT table in the x86 way, which does not sort the buses.

Though root buses should be sorted in QEMU, would it be better to accept
those non-ascending IO windows?


ok, so it seems that you have seen a real problem, and this issue is not just 
detected by code analysis.


BTW, for b, it seems to be no space between IO windows of different root buses
generated by EDK2. Or maybe I missed something obvious.

I don't know about that. Anyway, your change looks ok.

Reviewed-by: John Garry

BTW, for your virt env, will there be requirement to unregister PCI MMIO 
ranges? Currently we don't see that in non-virt world.


Thanks for your review.

And currently there is no such a requirement in my virt env.



I am not sure what happened to this patch, but I plan on sending some patches 
in this area soon - do you want me to include this one?



Sorry for replying late. It's appreciated if you can include this patch.

Thanks!
Jiahui


Thanks,
John

.

.





Re: arm64 syzbot instances

2021-03-15 Thread John Garry

On 15/03/2021 10:01, Dmitry Vyukov wrote:

On Mon, Mar 15, 2021 at 10:45 AM John Garry  wrote:

It does not happen too often on syzbot so far, so let's try to do the
right thing first.
I've filed:https://bugs.launchpad.net/qemu/+bug/1918917
with a link to this thread. To be fair, I don't fully understand what
I am talking about, I hope I proxied your description properly.

Thanks, looks good. I provided a little more detail in a comment there.

  Arnd
.


  From looking at the bug report, my impression is that this is a qemu
issue, as the logical IO space is mapped to the PCI host bridge IO
space, and qemu does not handle accesses to that CPU addressable region
at all. As Arnd said.

However, we really should not be accessing logical IO ports 0 or 0x2f8
at all via ttyS3 if not enumerated from PCI device at that logical IO
port. That is what I think anyway, as who knows what device - if any -
really exists at that location. That is why I had this patch to just
stop accesses to legacy IO port regions on arm64:

https://lore.kernel.org/lkml/1610729929-188490-2-git-send-email-john.ga...@huawei.com/

Hi John,

Thanks for the info.

The patch is from January, but it's not merged yet, right?
It will fix the crash we see, right?
.


It's not merged, and it probably would solve this issue. But following 
discussion with Arnd when it was originally posted, I still need to do 
some analysis whether it is the proper thing to do.


However, as mentioned, the fundamental issue looks like qemu IO port 
access, so it would be good to check that first.


On a related topic, I will cc colleague Jiahui Cen, who I think was 
doing some work arm on qemu support in a related area, so may share some 
experience here.


Jiahui Cen did have a patch to fix logical PIO code from this work [0], 
which is not merged, but I don't think would help here. I will cc you on it.


Thanks,
John

[0] 
https://lore.kernel.org/lkml/006ad6ce-d6b2-59cb-8209-aca3f6e53...@huawei.com/


Re: arm64 syzbot instances

2021-03-15 Thread John Garry

On 12/03/2021 10:52, Arnd Bergmann wrote:

On Fri, Mar 12, 2021 at 11:38 AM Dmitry Vyukov  wrote:

On Fri, Mar 12, 2021 at 11:11 AM Arnd Bergmann  wrote:

It does not happen too often on syzbot so far, so let's try to do the
right thing first.
I've filed: https://bugs.launchpad.net/qemu/+bug/1918917
with a link to this thread. To be fair, I don't fully understand what
I am talking about, I hope I proxied your description properly.


Thanks, looks good. I provided a little more detail in a comment there.

 Arnd
.



From looking at the bug report, my impression is that this is a qemu 
issue, as the logical IO space is mapped to the PCI host bridge IO 
space, and qemu does not handle accesses to that CPU addressable region 
at all. As Arnd said.


However, we really should not be accessing logical IO ports 0 or 0x2f8 
at all via ttyS3 if not enumerated from PCI device at that logical IO 
port. That is what I think anyway, as who knows what device - if any - 
really exists at that location. That is why I had this patch to just 
stop accesses to legacy IO port regions on arm64:


https://lore.kernel.org/lkml/1610729929-188490-2-git-send-email-john.ga...@huawei.com/

Thanks,
John


Re: [PATCH] iommu/dma: Resurrect the "forcedac" option

2021-03-11 Thread John Garry

On 05/03/2021 16:32, Robin Murphy wrote:

In converting intel-iommu over to the common IOMMU DMA ops, it quietly
lost the functionality of its "forcedac" option. Since this is a handy
thing both for testing and for performance optimisation on certain
platforms, reimplement it under the common IOMMU parameter namespace.

For the sake of fixing the inadvertent breakage of the Intel-specific
parameter, remove the dmar_forcedac remnants and hook it up as an alias
while documenting the transition to the new common parameter.

Fixes: c588072bba6b ("iommu/vt-d: Convert intel iommu driver to the iommu ops")
Signed-off-by: Robin Murphy 


If it's worth anything:
Reviewed-by: John Garry 


---
  Documentation/admin-guide/kernel-parameters.txt | 15 ---
  drivers/iommu/dma-iommu.c   | 13 -
  drivers/iommu/intel/iommu.c |  5 ++---
  include/linux/dma-iommu.h   |  2 ++
  4 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 04545725f187..835f810f2f26 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1869,13 +1869,6 @@
bypassed by not enabling DMAR with this option. In
this case, gfx device will use physical address for
DMA.
-   forcedac [X86-64]
-   With this option iommu will not optimize to look
-   for io virtual address below 32-bit forcing dual
-   address cycle on pci bus for cards supporting greater
-   than 32-bit addressing. The default is to look
-   for translation below 32-bit and if not available
-   then look in the higher range.
strict [Default Off]
With this option on every unmap_single operation will
result in a hardware IOTLB flush operation as opposed
@@ -1964,6 +1957,14 @@
nobypass[PPC/POWERNV]
Disable IOMMU bypass, using IOMMU for PCI devices.
  
+	iommu.forcedac=	[ARM64, X86] Control IOVA allocation for PCI devices.

+   Format: { "0" | "1" }
+   0 - Try to allocate a 32-bit DMA address first, before
+ falling back to the full range if needed.
+   1 - Allocate directly from the full usable range,
+ forcing Dual Address Cycle for PCI cards supporting
+ greater than 32-bit addressing.
+
iommu.strict=   [ARM64] Configure TLB invalidation behaviour
Format: { "0" | "1" }
0 - Lazy mode.
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 9ab6ee22c110..260bf3de1992 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -52,6 +52,17 @@ struct iommu_dma_cookie {
  };
  
  static DEFINE_STATIC_KEY_FALSE(iommu_deferred_attach_enabled);

+bool iommu_dma_forcedac __read_mostly;
+
+static int __init iommu_dma_forcedac_setup(char *str)
+{
+   int ret = kstrtobool(str, _dma_forcedac);
+
+   if (!ret && iommu_dma_forcedac)
+   pr_info("Forcing DAC for PCI devices\n");


I seem to remember some disagreement on this sort of print some other 
time :)



+   return ret;
+}
+early_param("iommu.forcedac", iommu_dma_forcedac_setup);
  
  void iommu_dma_free_cpu_cached_iovas(unsigned int cpu,

struct iommu_domain *domain)
@@ -438,7 +449,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain 
*domain,
dma_limit = min(dma_limit, (u64)domain->geometry.aperture_end);


Re: [PATCH 1/5] perf metricgroup: Support printing metrics for arm64

2021-03-11 Thread John Garry

On 06/03/2021 19:34, Jiri Olsa wrote:

On Fri, Mar 05, 2021 at 11:06:58AM +, John Garry wrote:

Hi Jirka,


-   struct pmu_events_map *map = perf_pmu__find_map(NULL);
+   struct pmu_events_map *map = find_cpumap();

so this is just for arm at the moment right?


Yes - but to be more accurate, arm64.

At the moment, from the archs which use pmu-events, only arm64 and nds32
have versions of get_cpuid_str() which require a non-NULL pmu argument.

But then apparently nds32 only supports a single CPU, so this issue of
heterogeneous CPUs should not be a concern there:)


could we rather make this arch specific code, so we don't need
to do the scanning on archs where this is not needed?

like marking perf_pmu__find_map as __weak and add arm specific
version?

Well I was thinking that this code should not be in metricgroup.c anyway.

So there is code which is common in current perf_pmu__find_map() for all
archs.

I could factor that out into a common function, below. Just a bit worried
about perf_pmu__find_map() and perf_pmu__find_pmu_map() being confused.

right, so perf_pmu__find_map does not take perf_pmu as argument
anymore, so the prefix does not fit, how about pmu_events_map__find ?


I just noticed this series:
https://lore.kernel.org/lkml/1612797946-18784-1-git-send-email-kan.li...@linux.intel.com/

Seems that this has metricgroup support for heterogeneous system config, 
while this series is metricgroup support for homogeneous system config 
for arch which supports heterogeneous system config. I need to check 
further for any conflicts.


@Kan Liang, it would be great if you could cc me on that series. I don't 
subscribe to the general list.


Thanks,
John


Re: [RFC PATCH v3 2/3] blk-mq: Freeze and quiesce all queues for tagset in elevator_exit()

2021-03-11 Thread John Garry

On 11/03/2021 00:58, Ming Lei wrote:

Indeed, blk_mq_queue_tag_busy_iter() already does take a reference to its
queue usage counter when called, and the queue cannot be frozen to switch
IO scheduler until all refs are dropped. This ensures no stale references
to IO scheduler requests will be seen by blk_mq_queue_tag_busy_iter().

However, there is nothing to stop blk_mq_queue_tag_busy_iter() being
run for another queue associated with the same tagset, and it seeing
a stale IO scheduler request from the other queue after they are freed.

To stop this happening, freeze and quiesce all queues associated with the
tagset as the elevator is exited.

I think this way can't be accepted since switching one queue's scheduler
is nothing to do with other request queues attached to same HBA.

This patch will cause performance regression because userspace may
switch scheduler according to medium or workloads, at that time other
LUNs will be affected by this patch.


Hmmm..that was my concern also. Do you think that it may cause a big 
impact? Depends totally on the workload, I suppose.


FWIW, it is useful though for solving both iterator problems.

Thanks,
John


Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"

2021-03-10 Thread John Garry

On 08/03/2021 16:22, John Garry wrote:




While max32_alloc_size indirectly tracks the largest*contiguous* 
available space, one of the ideas from which it grew was to simply keep

count of the total number of free PFNs. If you're really spending
significant time determining that the tree is full, as opposed to just
taking longer to eventually succeed, then it might be relatively
innocuous to tack on that semi-redundant extra accounting as a
self-contained quick fix for that worst case.



...


So since the retry means that we search through the complete pfn range 
most of the time (due to poor success rate quoted), we should be able to 
do a better job at maintaining an accurate max alloc size, by 
calculating it during the failed search, and not relying on max alloc 
failed or resetting it frequently. Hopefully that would mean that we're 
smarter about quickly failing the allocation.


I'll further look at that.

Thanks,
John


Re: [RFC PATCH v3 3/3] blk-mq: Lockout tagset iterator when exiting elevator

2021-03-10 Thread John Garry

On 10/03/2021 16:00, Bart Van Assche wrote:
So I can incorporate any changes and suggestions so far and send a 
non-RFC version - that may get more attention if none extra comes.


As mentioned on the cover letter, if patch 2+3/3 are accepted, then 
patch 1/3 could be simplified. But I plan to leave as is.


BTW, any issue with putting your suggested-by on patch 2/3?




Hi Bart,



I have added my Reviewed-by to patch 2/3.



OK, thanks.

Please note that I still want to check further whether some of Ming's 
series "blk-mq: implement queue quiesce via percpu_ref for 
BLK_MQ_F_BLOCKING" can be used.


Regarding the other two patches in this series: I do not agree with 
patch 3/3. As I have explained, I am concerned that that patch breaks 
existing block drivers.


Understood. I need to check your concern further to allay any fears.

So I could probably change that patch to drop the early return.

Instead we just need to ensure that we complete any existing calls to 
blk_mq_tagset_busy_iter() prior to freeing the IO scheduler requests. 
Then we don't need to return early and can iter as before - but, as I 
said previously, there should be no active tags to iter.




Are patches 1/3 and 3/3 necessary? Or in other words, is patch 2/3 
sufficient to fix the use-after-free?


No, we need them all in some form.

So far, reports are that 1/3 solves the most common seen UAF. It is 
pretty easy to trigger.


But the scenarios associated with 2/3 and 3/3 are much harder to 
trigger, and I needed to add delays in the code just to trigger them.


Thanks,
John


Re: [RFC PATCH v3 3/3] blk-mq: Lockout tagset iterator when exiting elevator

2021-03-10 Thread John Garry

On 09/03/2021 19:21, Bart Van Assche wrote:

On 3/9/21 9:47 AM, John Garry wrote:

This does fall over if some tags are allocated without associated
request queue, which I do not know exists.




Hi Bart,


The only tag allocation mechanism I know of is blk_mq_get_tag(). The
only blk_mq_get_tag() callers I know of are __blk_mq_alloc_request() and
blk_mq_alloc_request_hctx(). So I think all allocated tags are
associated with a request queue.



ok, good.


Regarding this patch series, I have shared the feedback I wanted to
share so I would appreciate it if someone else could also take a look.



So I can incorporate any changes and suggestions so far and send a 
non-RFC version - that may get more attention if none extra comes.


As mentioned on the cover letter, if patch 2+3/3 are accepted, then 
patch 1/3 could be simplified. But I plan to leave as is.


BTW, any issue with putting your suggested-by on patch 2/3?

Thanks,
John



Re: [RFC PATCH v3 3/3] blk-mq: Lockout tagset iterator when exiting elevator

2021-03-09 Thread John Garry

On 08/03/2021 19:59, Bart Van Assche wrote:

This changes the behavior of blk_mq_tagset_busy_iter(). What will e.g.
happen if the mtip driver calls blk_mq_tagset_busy_iter(>tags,
mtip_abort_cmd, dd) concurrently with another blk_mq_tagset_busy_iter()
call and if that causes all mtip_abort_cmd() calls to be skipped?


I'm not sure that I understand this problem you describe. So if 
blk_mq_tagset_busy_iter(>tags, mtip_abort_cmd, dd) is called, 
either can happen:
a. normal operation, iter_usage_counter initially holds >= 1, and then 
iter_usage_counter is incremented in blk_mq_tagset_busy_iter() and we 
iter the busy tags. Any parallel call to blk_mq_tagset_busy_iter() 
will also increase iter_usage_counter.
b. we're switching IO scheduler. In this scenario, first we quiesce 
all queues. After that, there should be no active requests. At that 
point, we ensure any calls to blk_mq_tagset_busy_iter() are finished 
and block (or discard may be a better term) any more calls. Blocking 
any more calls should be safe as there are no requests to iter. 
atomic_cmpxchg() is used to set iter_usage_counter to 0, blocking any 
more calls.





Hi Bart,

My concern is about the insertion of the early return statement in 
blk_mq_tagset_busy_iter(). 


So I take this approach as I don't see any way to use a mutual exclusion 
waiting mechanism to block calls to blk_mq_tagset_busy_iter() while the 
IO scheduler is being switched.


The reason is that blk_mq_tagset_busy_iter() can be called from any 
context, including hardirq.


Although most blk_mq_tagset_busy_iter() 
callers can handle skipping certain blk_mq_tagset_busy_iter() calls 
(e.g. when gathering statistics), I'm not sure this is safe for all 
blk_mq_tagset_busy_iter() callers. The example I cited is an example of 
a blk_mq_tagset_busy_iter() call with side effects.


I don't like to think that we're skipping it, which may imply that there 
are some active requests to iter and we're just ignoring them.


It's more like: we know that there are no requests active, so don't 
bother trying to iterate.




The mtip driver allocates one tag set per request queue so quiescing 
queues should be sufficient to address my concern for the mtip driver.


The NVMe core and SCSI core however share a single tag set across 
multiple namespaces / LUNs. In the error path of nvme_rdma_setup_ctrl()
I found a call to nvme_cancel_tagset(). nvme_cancel_tagset() calls 
blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl). I'm 
not sure it is safe to skip the nvme_cancel_request() calls if the I/O 
scheduler for another NVMe namespace is being modified.


Again, I would be relying on all request_queues associated with that 
tagset to be queisced when switching IO scheduler at the point 
blk_mq_tagset_busy_iter() is called and returns early.


Now if there were active requests, I am relying on the request queue 
quiescing to flush them. So blk_mq_tagset_busy_iter() could be called 
during that quiescing period, and would continue to iter the requests.


This does fall over if some tags are allocated without associated 
request queue, which I do not know exists.


Thanks,
John



Re: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue

2021-03-09 Thread John Garry

On 09/03/2021 15:57, Michael Kelley wrote:

From: John Garry  Sent: Tuesday, March 9, 2021 2:10 AM


On 08/03/2021 17:56, Melanie Plageman wrote:

On Mon, Mar 08, 2021 at 02:37:40PM +, Michael Kelley wrote:

From: Melanie Plageman (Microsoft)  Sent: Friday,

March 5, 2021 3:22 PM


The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during
allocation.

Cap cmd_per_lun at can_queue to avoid dispatch errors.

Signed-off-by: Melanie Plageman (Microsoft) 
---
   drivers/scsi/storvsc_drv.c | 2 ++
   1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 6bc5453cea8a..d7953a6e00e6 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,
(max_sub_channels + 1) *
(100 - ring_avail_percent_lowater) / 100;

+   scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun,

scsi_driver.can_queue);

+


I'm not sure what you mean by "avoid dispatch errors".  Can you elaborate?


The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set
Scsi_Host->cmd_per_lun in storvsc_probe().

In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is
called and sets the scsi_device->queue_depth to the Scsi_Host's
cmd_per_lun with this code:

scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ?
  sdev->host->cmd_per_lun : 1);

During dispatch, the scsi_device->queue_depth is used in
scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine
whether or not the device can queue another command.

On some machines, with the 2048 value of cmd_per_lun that was used to
set the initial scsi_device->queue_depth, commands can be queued that
are later not able to be dispatched after running out of space in the
ringbuffer.

On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD
(running an fio workload that I can provide if needed), storvsc_do_io()
ends up often returning SCSI_MLQUEUE_DEVICE_BUSY.

This is the call stack:

hv_get_bytes_to_write
hv_ringbuffer_write
vmbus_send_packet
storvsc_dio_io
storvsc_queuecommand
scsi_dispatch_cmd
scsi_queue_rq
dispatch_rq_list


Be aware that the calculation of "can_queue" in this driver is somewhat
flawed -- it should not be based on the size of the ring buffer, but instead on
the maximum number of requests Hyper-V will queue.  And even then,
can_queue doesn't provide the cap you might expect because the blk-mq layer
allocates can_queue tags for each HW queue, not as a total.



The docs for scsi_mid_low_api document Scsi_Host can_queue this way:

can_queue
- must be greater than 0; do not send more than can_queue
  commands to the adapter.

I did notice that in scsi_host.h, the comment for can_queue does say
can_queue is the "maximum number of simultaneous commands a single hw
queue in HBA will accept." However, I don't see it being used this way
in the code.



JFYI, the block layer ensures that no more than can_queue requests are
sent to the host. See scsi_mq_setup_tags(), and how the tagset queue
depth is set to shost->can_queue.

Thanks,
John


Agree on what's in scsi_mq_setup_tags().  But scsi_mq_setup_tags() calls
blk_mq_alloc_tag_set(), which in turn calls blk_mq_alloc_map_and_requests(),
which calls __blk_mq_alloc_rq_maps() repeatedly, reducing the tag
set queue_depth as needed until it succeeds.

The key thing is that __blk_mq_alloc_rq_maps() iterates over the
number of HW queues calling __blk_mq_alloc_map_and_request().
The latter function allocates the map and the requests with a count
of the tag set's queue_depth.   There's no logic to apportion the
can_queue value across multiple HW queues. So each HW queue gets
can_queue tags allocated, and the SCSI host driver may see up to
(can_queue * # HW queues) simultaneous requests.

I'm certainly not an expert in this area, but that's what I see in the
code.  We've run live experiments, and can see the number
simultaneous requests sent to the storvsc driver be greater than
can_queue when the # of HW queues is greater than 1, which seems
to be consistent with the code.


ah, ok. I assumed that # of HW queues = 1 here. So you're describing a 
problem similar to 
https://lore.kernel.org/linux-scsi/b3e4e597-779b-7c1e-0d3c-07bc3dab1...@huawei.com/


So if you check nr_hw_queues comment in include/scsi/scsi_host.h, it reads:
the total queue depth per host is nr_hw_queues * can_queue. However, for 
when host_tagset is set, the total queue depth is can_queue.


Setting .host_tagset will ensure at most can_queue requests will be sent 
over all HW queues at any given time. A few SCSI MQ drivers set this now.


Thanks,
John



Michael





During dispatch, In scsi_target_queue_ready(), there is this code:

  if (busy >= starget->can_queue)
  goto

Re: [PATCH 2/2] iommu/iova: Improve restart logic

2021-03-09 Thread John Garry

On 05/03/2021 16:35, Robin Murphy wrote:

Hi Robin,


When restarting after searching below the cached node fails, resetting
the start point to the anchor node is often overly pessimistic. If
allocations are made with mixed limits - particularly in the case of the
opportunistic 32-bit allocation for PCI devices - this could mean
significant time wasted walking through the whole populated upper range
just to reach the initial limit. 


Right


We can improve on that by implementing
a proper tree traversal to find the first node above the relevant limit,
and set the exact start point.


Thanks for this. However, as mentioned in the other thread, this does 
not help our performance regression Re: commit 4e89dce72521.


And mentioning this "retry" approach again, in case it was missed, from 
my experiment on the affected HW, it also has generally a dreadfully low 
success rate - less than 1% for the 32b range retry. Retry rate is about 
20%. So I am saying from this 20%, less than 1% of those succeed.


Failing faster sounds key.

Thanks,
John



Signed-off-by: Robin Murphy 
---
  drivers/iommu/iova.c | 39 ++-
  1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index c28003e1d2ee..471c48dd71e7 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -154,6 +154,43 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, 
struct iova *free)
iovad->cached_node = rb_next(>node);
  }
  
+static struct rb_node *iova_find_limit(struct iova_domain *iovad, unsigned long limit_pfn)

+{
+   struct rb_node *node, *next;
+   /*
+* Ideally what we'd like to judge here is whether limit_pfn is close
+* enough to the highest-allocated IOVA that starting the allocation
+* walk from the anchor node will be quicker than this initial work to
+* find an exact starting point (especially if that ends up being the
+* anchor node anyway). This is an incredibly crude approximation which
+* only really helps the most likely case, but is at least trivially 
easy.
+*/
+   if (limit_pfn > iovad->dma_32bit_pfn)
+   return >anchor.node;
+
+   node = iovad->rbroot.rb_node;
+   while (to_iova(node)->pfn_hi < limit_pfn)
+   node = node->rb_right;
+
+search_left:
+   while (node->rb_left && to_iova(node->rb_left)->pfn_lo >= limit_pfn)
+   node = node->rb_left;
+
+   if (!node->rb_left)
+   return node;
+
+   next = node->rb_left;
+   while (next->rb_right) {
+   next = next->rb_right;
+   if (to_iova(next)->pfn_lo >= limit_pfn) {
+   node = next;
+   goto search_left;
+   }
+   }
+
+   return node;
+}
+
  /* Insert the iova into domain rbtree by holding writer lock */
  static void
  iova_insert_rbtree(struct rb_root *root, struct iova *iova,
@@ -219,7 +256,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain 
*iovad,
if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) {
high_pfn = limit_pfn;
low_pfn = retry_pfn;
-   curr = >anchor.node;
+   curr = iova_find_limit(iovad, limit_pfn);
curr_iova = to_iova(curr);
goto retry;
}





Re: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue

2021-03-09 Thread John Garry

On 08/03/2021 17:56, Melanie Plageman wrote:

On Mon, Mar 08, 2021 at 02:37:40PM +, Michael Kelley wrote:

From: Melanie Plageman (Microsoft)  Sent: Friday, 
March 5, 2021 3:22 PM


The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during
allocation.

Cap cmd_per_lun at can_queue to avoid dispatch errors.

Signed-off-by: Melanie Plageman (Microsoft) 
---
  drivers/scsi/storvsc_drv.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 6bc5453cea8a..d7953a6e00e6 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device,
(max_sub_channels + 1) *
(100 - ring_avail_percent_lowater) / 100;

+   scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, 
scsi_driver.can_queue);
+


I'm not sure what you mean by "avoid dispatch errors".  Can you elaborate?


The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set
Scsi_Host->cmd_per_lun in storvsc_probe().

In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is
called and sets the scsi_device->queue_depth to the Scsi_Host's
cmd_per_lun with this code:
 
scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ?

 sdev->host->cmd_per_lun : 1);

During dispatch, the scsi_device->queue_depth is used in
scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine
whether or not the device can queue another command.

On some machines, with the 2048 value of cmd_per_lun that was used to
set the initial scsi_device->queue_depth, commands can be queued that
are later not able to be dispatched after running out of space in the
ringbuffer.

On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD
(running an fio workload that I can provide if needed), storvsc_do_io()
ends up often returning SCSI_MLQUEUE_DEVICE_BUSY.
   
This is the call stack:
 
hv_get_bytes_to_write

hv_ringbuffer_write
vmbus_send_packet
storvsc_dio_io
storvsc_queuecommand
scsi_dispatch_cmd
scsi_queue_rq
dispatch_rq_list


Be aware that the calculation of "can_queue" in this driver is somewhat
flawed -- it should not be based on the size of the ring buffer, but instead on
the maximum number of requests Hyper-V will queue.  And even then,
can_queue doesn't provide the cap you might expect because the blk-mq layer
allocates can_queue tags for each HW queue, not as a total.



The docs for scsi_mid_low_api document Scsi_Host can_queue this way:

   can_queue
   - must be greater than 0; do not send more than can_queue
 commands to the adapter.
 
I did notice that in scsi_host.h, the comment for can_queue does say

can_queue is the "maximum number of simultaneous commands a single hw
queue in HBA will accept." However, I don't see it being used this way
in the code.
   


JFYI, the block layer ensures that no more than can_queue requests are 
sent to the host. See scsi_mq_setup_tags(), and how the tagset queue 
depth is set to shost->can_queue.


Thanks,
John



During dispatch, In scsi_target_queue_ready(), there is this code:

 if (busy >= starget->can_queue)
 goto starved;

And the scsi_target->can_queue value should be coming from Scsi_host as
mentioned in the scsi_target definition in scsi_device.h
 /*
   * LLDs should set this in the slave_alloc host template callout.
   * If set to zero then there is not limit.
   */
 unsigned intcan_queue;

So, I don't really see how this would be per hardware queue.



I agree that the cmd_per_lun setting is also too big, but we should fix that in
the context of getting all of these different settings working together 
correctly,
and not piecemeal.



Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe
will also prevent the LUN queue_depth from being set to a value that is
higher than it can ever be set to again by the user when
storvsc_change_queue_depth() is invoked.

Also in scsi_sysfs sdev_store_queue_depth() there is this check:

   if (depth < 1 || depth > sdev->host->can_queue)
 return -EINVAL;

I would also note that VirtIO SCSI in virtscsi_probe(), Scsi_Host->cmd_per_lun
is set to the min of the configured cmd_per_lun and
Scsi_Host->can_queue:

 shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);

Best,
Melanie
.





Re: [PATCH 1/5] perf metricgroup: Support printing metrics for arm64

2021-03-08 Thread John Garry

On 06/03/2021 19:34, Jiri Olsa wrote:

On Fri, Mar 05, 2021 at 11:06:58AM +, John Garry wrote:

Hi Jirka,


-   struct pmu_events_map *map = perf_pmu__find_map(NULL);
+   struct pmu_events_map *map = find_cpumap();

so this is just for arm at the moment right?


Yes - but to be more accurate, arm64.

At the moment, from the archs which use pmu-events, only arm64 and nds32
have versions of get_cpuid_str() which require a non-NULL pmu argument.

But then apparently nds32 only supports a single CPU, so this issue of
heterogeneous CPUs should not be a concern there:)


could we rather make this arch specific code, so we don't need
to do the scanning on archs where this is not needed?

like marking perf_pmu__find_map as __weak and add arm specific
version?

Well I was thinking that this code should not be in metricgroup.c anyway.

So there is code which is common in current perf_pmu__find_map() for all
archs.

I could factor that out into a common function, below. Just a bit worried
about perf_pmu__find_map() and perf_pmu__find_pmu_map() being confused.

right, so perf_pmu__find_map does not take perf_pmu as argument
anymore, so the prefix does not fit, how about pmu_events_map__find ?


I think it could be ok.

But now I am slightly concerned that we don't put anything like this in 
arch/arm64, based on this earlier discussion on close topic:


https://lore.kernel.org/lkml/20190719075450.xcm4i4a5sfaxlfap@willie-the-truck/

Hi Will, Mark,

Do you have any objection to add arm64 specific code here?

So what I had originally in this patch was to iterate PMUs  in common 
code and find the CPU PMU and use that to match CPU metrics, as long as 
it's not a heterogeneous system.


Now the suggestion was to move that into arch specific code, as it's not 
needed for all archs.


Thanks,
John


Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"

2021-03-08 Thread John Garry

On 08/03/2021 15:15, Robin Murphy wrote:
I figure that you're talking about 4e89dce72521 now. I would have 
liked to know which real-life problem it solved in practice.


 From what I remember, the problem reported was basically the one 
illustrated in that commit and the one I alluded to above - namely that 
certain allocation patterns with a broad mix of sizes and relative 
lifetimes end up pushing the cached PFN down to the bottom of the 
address space such that allocations start failing despite there still 
being sufficient free space overall, which was breaking some media 
workload. What was originally proposed was an overcomplicated palaver 
with DMA attributes and a whole extra allocation algorithm rather than 
just fixing the clearly unintended and broken behaviour.


ok, fine. I just wondered if this was a theoretical problem only.



While max32_alloc_size indirectly tracks the largest*contiguous* 
available space, one of the ideas from which it grew was to simply keep

count of the total number of free PFNs. If you're really spending
significant time determining that the tree is full, as opposed to just
taking longer to eventually succeed, then it might be relatively
innocuous to tack on that semi-redundant extra accounting as a
self-contained quick fix for that worst case.



...



Even if it is were configurable, wouldn't it make sense to have it 
configurable per IOVA domain?


Perhaps, but I don't see that being at all easy to implement. We can't 
arbitrarily *increase* the scope of caching once a domain is active due 
to the size-rounding-up requirement, which would be prohibitive to 
larger allocations if applied universally.




Agreed.

But having that (all IOVAs sizes being cacheable) available could be 
really great, though, for some situations.


Furthermore, as mentioned above, I still want to solve this IOVA aging 
issue, and this fixed RCACHE RANGE size seems to be the at the center 
of that problem.




As for 4e89dce72521, so even if it's proper to retry for a failed 
alloc,

it is not always necessary. I mean, if we're limiting ourselves to 32b
subspace for this SAC trick and we fail the alloc, then we can try the
space above 32b first (if usable). If that fails, then retry there. I
don't see a need to retry the 32b subspace if we're not limited to it.
How about it? We tried that idea and it looks to just about restore
performance.

The thing is, if you do have an actual PCI device where DAC might mean a
33% throughput loss and you're mapping a long-lived buffer, or you're on
one of these systems where firmware fails to document address limits and
using the full IOMMU address width quietly breaks things, then you
almost certainly*do*  want the allocator to actually do a proper job of
trying to satisfy the given request.


If those conditions were true, then it seems quite a tenuous position, 
so trying to help that scenario in general terms will have limited 
efficacy.


Still, I'd be curious to see if making the restart a bit cleverer offers 
a noticeable improvement. IIRC I suggested it at the time, but in the 
end the push was just to get *something* merged.


Sorry to say, I just tested that ("iommu/iova: Improve restart logic") 
and there is no obvious improvement.


I'll have a further look at what might be going on.

Thanks very much,
John


Re: [PATCH] iommu/dma: Resurrect the "forcedac" option

2021-03-08 Thread John Garry

On 08/03/2021 13:08, Robin Murphy wrote:

On 2021-03-05 17:41, John Garry wrote:

On 05/03/2021 16:32, Robin Murphy wrote:

In converting intel-iommu over to the common IOMMU DMA ops, it quietly
lost the functionality of its "forcedac" option. Since this is a handy
thing both for testing and for performance optimisation on certain
platforms, reimplement it under the common IOMMU parameter namespace.

For the sake of fixing the inadvertent breakage of the Intel-specific
parameter, remove the dmar_forcedac remnants and hook it up as an alias
while documenting the transition to the new common parameter.



Do you think that having a kconfig option to control the default for 
this can help identify the broken platforms which rely on forcedac=0? 
But seems a bit trivial for that, though.


I think it's still a sizeable can of worms - unlike, say, 
ARM_SMMU_DISABLE_BYPASS_BY_DEFAULT, we can't actually tell when things 
have gone awry and explicitly call it out. While I was getting the 
dma-ranges right on my Juno, everything broke differently - the SATA 
controller fails gracefully; the ethernet controller got the kernel tied 
up somewhere (to the point that the USB keyboard died) once it tried to 
brink up the link, but was at least spewing regular timeout backtraces 
that implicated the networking layer; having an (unused) NVMe plugged in 
simply wedged the boot process early on with no hint whatsoever of why.


TBH I'm not really sure what the best way forward is in terms of trying 
to weed out platforms that would need quirking.


I was more thinking of an unstable TEST config, like 
DEBUG_TEST_DRIVER_REMOVE. So we know that this particular config breaks 
many platforms. But at least those in the know can turn it on locally 
and detect and fix issues, and strive towards having a platform for 
which it works.


But then it does become a little harder to justify such a config when we 
can enable via commadline.


Our discussion just 
reminded me of this option and that it had gone AWOL, so bringing it 
back to be potentially *some* use to everyone seems justifiable on its own.


Of course.

Cheers,
John


Re: [RFC PATCH v3 1/3] blk-mq: Clean up references to old requests when freeing rqs

2021-03-08 Thread John Garry

On 06/03/2021 02:52, Khazhy Kumykov wrote:

On Fri, Mar 5, 2021 at 7:20 AM John Garry  wrote:


It has been reported many times that a use-after-free can be intermittently
found when iterating busy requests:

- 
https://lore.kernel.org/linux-block/8376443a-ec1b-0cef-8244-ed584b96f...@huawei.com/
- 
https://lore.kernel.org/linux-block/5c3ac5af-ed81-11e4-fee3-f92175f14...@acm.org/T/#m6c1ac11540522716f645d004e2a5a13c9f218908
- 
https://lore.kernel.org/linux-block/04e2f9e8-79fa-f1cb-ab23-4a15bf3f6...@kernel.dk/

The issue is that when we switch scheduler or change queue depth, there may
be references in the driver tagset to the stale requests.

As a solution, clean up any references to those requests in the driver
tagset. This is done with a cmpxchg to make safe any race with setting the
driver tagset request from another queue.


I noticed this crash recently when running blktests on a "debug"
config on a 4.15 based kernel (it would always crash), and backporting
this change fixes it. (testing on linus's latest tree also confirmed
the fix, with the same config). I realize I'm late to the
conversation, but appreciate the investigation and fixes :)


Good to know. I'll explicitly cc you on further versions.

Thanks,
John



Re: [RFC PATCH v3 3/3] blk-mq: Lockout tagset iterator when exiting elevator

2021-03-08 Thread John Garry

On 06/03/2021 04:43, Bart Van Assche wrote:

On 3/5/21 7:14 AM, John Garry wrote:

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 7ff1b20d58e7..5950fee490e8 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -358,11 +358,16 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set 
*tagset,
  {
int i;
  
+	if (!atomic_inc_not_zero(>iter_usage_counter))

+   return;
+
for (i = 0; i < tagset->nr_hw_queues; i++) {
if (tagset->tags && tagset->tags[i])
__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
  BT_TAG_ITER_STARTED);
}
+
+   atomic_dec(>iter_usage_counter);
  }
  EXPORT_SYMBOL(blk_mq_tagset_busy_iter);


Hi Bart,


This changes the behavior of blk_mq_tagset_busy_iter(). What will e.g.
happen if the mtip driver calls blk_mq_tagset_busy_iter(>tags,
mtip_abort_cmd, dd) concurrently with another blk_mq_tagset_busy_iter()
call and if that causes all mtip_abort_cmd() calls to be skipped?


I'm not sure that I understand this problem you describe. So if 
blk_mq_tagset_busy_iter(>tags, mtip_abort_cmd, dd) is called, either 
can happen:
a. normal operation, iter_usage_counter initially holds >= 1, and then 
iter_usage_counter is incremented in blk_mq_tagset_busy_iter() and we 
iter the busy tags. Any parallel call to blk_mq_tagset_busy_iter() will 
also increase iter_usage_counter.
b. we're switching IO scheduler. In this scenario, first we quiesce all 
queues. After that, there should be no active requests. At that point, 
we ensure any calls to blk_mq_tagset_busy_iter() are finished and block 
(or discard may be a better term) any more calls. Blocking any more 
calls should be safe as there are no requests to iter. atomic_cmpxchg() 
is used to set iter_usage_counter to 0, blocking any more calls.





+   while (atomic_cmpxchg(>iter_usage_counter, 1, 0) != 1);

Isn't it recommended to call cpu_relax() inside busy-waiting loops?


Maybe, but I am considering changing this patch to use percpu_refcnt() - 
I need to check it further.





blk_mq_sched_free_requests(q);
__elevator_exit(q, e);
  
+	atomic_set(>iter_usage_counter, 1);

Can it happen that the above atomic_set() call happens while a
blk_mq_tagset_busy_iter() call is in progress?


No, as at this point it should be ensured that iter_usage_counter holds 
0 from atomic_cmpxchg(), so there should be no active processes in 
blk_mq_tagset_busy_iter() sensitive region. Calls to 
blk_mq_tagset_busy_iter() are blocked when iter_usage_counter holds 0.



Should that atomic_set()
call perhaps be changed into an atomic_inc() call?


They have the same affect in practice, but we use atomic_set() in 
blk_mq_alloc_tag_set(), so at least consistent.


Thanks,
John


Re: [RFC PATCH v3 2/3] blk-mq: Freeze and quiesce all queues for tagset in elevator_exit()

2021-03-08 Thread John Garry

On 06/03/2021 04:32, Bart Van Assche wrote:

On 3/5/21 7:14 AM, John Garry wrote:

diff --git a/block/blk.h b/block/blk.h
index 3b53e44b967e..1a948bfd91e4 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -201,10 +201,29 @@ void elv_unregister_queue(struct request_queue *q);
  static inline void elevator_exit(struct request_queue *q,
struct elevator_queue *e)
  {
+   struct blk_mq_tag_set *set = q->tag_set;
+   struct request_queue *tmp;
+
lockdep_assert_held(>sysfs_lock);
  
+	mutex_lock(>tag_list_lock);

+   list_for_each_entry(tmp, >tag_list, tag_set_list) {
+   if (tmp == q)
+   continue;
+   blk_mq_freeze_queue(tmp);
+   blk_mq_quiesce_queue(tmp);
+   }
+
blk_mq_sched_free_requests(q);
__elevator_exit(q, e);
+
+   list_for_each_entry(tmp, >tag_list, tag_set_list) {
+   if (tmp == q)
+   continue;
+   blk_mq_unquiesce_queue(tmp);
+   blk_mq_unfreeze_queue(tmp);
+   }
+   mutex_unlock(>tag_list_lock);
  }


Hi Bart,


This patch introduces nesting of tag_list_lock inside sysfs_lock. The
latter is per request queue while the former can be shared across
multiple request queues. Has it been analyzed whether this is safe?


Firstly - ignoring implementation details for a moment - this patch is 
to ensure that the concept is consistent with your suggestion and 
whether it is sound.


As for nested locking, I can analyze more, but I did assume that we 
don't care about locking-out sysfs intervention during this time. And it 
seems pretty difficult to avoid nesting the locks.


And further to this, I see that 
https://lore.kernel.org/linux-block/3aa5407c-0800-2482-597b-4264781a7...@grimberg.me/T/#mc3e3175642660578c0ae2a6c32185b1e34ec4b8a 
has a new interface for tagset quiesce, which could make this process 
more efficient.


Please let me know further thoughts.

Thanks,
John


Re: [RFC PATCH v3 1/3] blk-mq: Clean up references to old requests when freeing rqs

2021-03-08 Thread John Garry

On 06/03/2021 18:13, Bart Van Assche wrote:

On 3/5/21 7:14 AM, John Garry wrote:

@@ -2296,10 +2296,14 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct 
blk_mq_tags *tags,
  
  		for (i = 0; i < tags->nr_tags; i++) {

struct request *rq = tags->static_rqs[i];
+   int j;
  
  			if (!rq)

continue;
set->ops->exit_request(set, rq, hctx_idx);
+   /* clean up any references which occur in @ref_tags */
+   for (j = 0; ref_tags && j < ref_tags->nr_tags; j++)
+   cmpxchg(_tags->rqs[j], rq, 0);
tags->static_rqs[i] = NULL;
}
}


Hi Bart,


What prevents blk_mq_tagset_busy_iter() from reading hctx->tags[...]
before the cmpxcg() call and dereferencing it after blk_mq_free_rqs()
has called __free_pages()?



So there is nothing in this patch to stop that. But it's pretty 
unlikely, as the window is very narrow generally between reading 
hctx->tags[...] and actually dereferencing it. However, something like 
that should be made safe in patch 2/3.


Thanks,
John




  1   2   3   4   5   6   7   8   9   10   >