[PATCH] tracing: Use init_utsname()->release
Instead of using UTS_RELEASE, use init_utsname()->release, which means that we don't need to rebuild the code just for the git head commit changing. Signed-off-by: John Garry --- I originally sent an RFC using new string uts_release, but that string is not needed as we can use init_utsname()->release instead. diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 8198bfc54b58..1416bf72f3f4 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -13,7 +13,7 @@ * Copyright (C) 2004 Nadia Yvette Chambers */ #include -#include +#include #include #include #include @@ -4368,7 +4368,7 @@ print_trace_header(struct seq_file *m, struct trace_iterator *iter) get_total_entries(buf, , ); seq_printf(m, "# %s latency trace v1.1.5 on %s\n", - name, UTS_RELEASE); + name, init_utsname()->release); seq_puts(m, "# ---" "-\n"); seq_printf(m, "# latency: %lu us, #%lu/%lu, CPU#%d |" -- 2.31.1
Re: [PATCH RFC 0/4] Introduce uts_release
On 08/02/2024 10:08, John Garry wrote: On 05/02/2024 23:10, Masahiro Yamada wrote: I think what you can contribute are: - Explore the UTS_RELEASE users, and check if you can get rid of it. Unfortunately I expect resistance for this. I also expect places like FW loader it is necessary. And when this is used in sysfs, people will say that it is part of the ABI now. How about I send the patch to update to use init_uts_ns and mention also that it would be better to not use at all, if possible? I can cc you. OK. As I mentioned in the previous reply, the replacement is safe for builtin code. When you touch modular code, please pay a little more care, because UTS_RELEASE and init_utsname()->release may differ when CONFIG_MODVERSIONS=y. Are you saying that we may have a different release version kernel and module built with CONFIG_MODVERSIONS=y, and the module was using UTS_RELEASE for something? That something may be like setting some info in a sysfs file, like in this example: static ssize_t target_core_item_version_show(struct config_item *item, char *page) { return sprintf(page, "Target Engine Core ConfigFS Infrastructure %s" " on %s/%s on "UTS_RELEASE"\n", TARGET_CORE_VERSION, utsname()->sysname, utsname()->machine); } And the intention is to use the module codebase release version and not the kernel codebase release version. Hence utsname() is used for .sysname and .machine, but not .release . Hi Masahiro, Can you comment on whether I am right about CONFIG_MODVERSIONS, above? Thanks, John
Re: [PATCH RFC 0/4] Introduce uts_release
On 05/02/2024 23:10, Masahiro Yamada wrote: I think what you can contribute are: - Explore the UTS_RELEASE users, and check if you can get rid of it. Unfortunately I expect resistance for this. I also expect places like FW loader it is necessary. And when this is used in sysfs, people will say that it is part of the ABI now. How about I send the patch to update to use init_uts_ns and mention also that it would be better to not use at all, if possible? I can cc you. OK. As I mentioned in the previous reply, the replacement is safe for builtin code. When you touch modular code, please pay a little more care, because UTS_RELEASE and init_utsname()->release may differ when CONFIG_MODVERSIONS=y. Are you saying that we may have a different release version kernel and module built with CONFIG_MODVERSIONS=y, and the module was using UTS_RELEASE for something? That something may be like setting some info in a sysfs file, like in this example: static ssize_t target_core_item_version_show(struct config_item *item, char *page) { return sprintf(page, "Target Engine Core ConfigFS Infrastructure %s" " on %s/%s on "UTS_RELEASE"\n", TARGET_CORE_VERSION, utsname()->sysname, utsname()->machine); } And the intention is to use the module codebase release version and not the kernel codebase release version. Hence utsname() is used for .sysname and .machine, but not .release . Thanks, John
Re: [PATCH RFC 0/4] Introduce uts_release
On 02/02/2024 15:01, Masahiro Yamada wrote: -- 2.35.3 As you see, several drivers store UTS_RELEASE in their driver data, and even print it in debug print. I do not see why it is useful. I would tend to agree, and mentioned that earlier. As you discussed in 3/4, if UTS_RELEASE is unneeded, it is better to get rid of it. Jakub replied about this. If such version information is useful for drivers, the intention is whether the version of the module, or the version of vmlinux. That is a question. They differ when CONFIG_MODVERSION. I think often this information in UTS_RELEASE is shared as informative only, so the user can conveniently know the specific kernel git version. When module developers intend to printk the git version from which the module was compiled from, presumably they want to use UTS_RELEASE, which was expanded at the compile time of the module. If you replace it with uts_release, it is the git version of vmlinux. Of course, the replacement is safe for always-builtin code. Lastly, we can avoid using UTS_RELEASE without relying on your patch. For example, commit 3a3a11e6e5a2bc0595c7e36ae33c861c9e8c75b1 replaced UTS_RELEASE with init_uts_ns.name.release So, is your uts_release a shorthand of init_uts_ns.name.release? Yes - well that both are strings containing UTS_RELEASE. Using a struct sub-member is bit ungainly, but I suppose that we should not be making life easy for people using this. However we already have init_utsname in: static inline struct new_utsname *init_utsname(void) { return _uts_ns.name; } So could use init_utsname()->release, which is a bit nicer. I think what you can contribute are: - Explore the UTS_RELEASE users, and check if you can get rid of it. Unfortunately I expect resistance for this. I also expect places like FW loader it is necessary. And when this is used in sysfs, people will say that it is part of the ABI now. How about I send the patch to update to use init_uts_ns and mention also that it would be better to not use at all, if possible? I can cc you. - Where UTS_RELEASE is useful, consider if it is possible to replace it with init_uts_ns.name.release ok, but, as above, could use init_utsname()->release also Thanks, John
Re: [PATCH RFC 3/4] net: ethtool: Use uts_release
On 01/02/2024 16:09, Jakub Kicinski wrote: On Thu, 1 Feb 2024 14:20:23 +0100 Jiri Pirko wrote: BTW, I assume that changes like this are also ok: 8<- net: team: Don't bother filling in ethtool driver version Yup, just to be clear - you can send this independently from the series, Sure, and I think rocker and staging/octeon also have this unnecessary code also. tag is as [PATCH net-next] ah, yes we'll take it via the networking tree. Thanks. I assume Greg - the staging maintainer - would take the octeon patch. I'm not sure which tree the other patches will go thru.. I think that the best thing to do is get a minimal set in for 6.9 and then merge the rest in the next cycle. I've got about 22 patches in total now, but I think that there will be more. We'll see who can pick up the first set when I send it officially. Thanks, John
Re: [PATCH RFC 3/4] net: ethtool: Use uts_release
On 31/01/2024 19:24, Jakub Kicinski wrote: On Wed, 31 Jan 2024 10:48:50 + John Garry wrote: Instead of using UTS_RELEASE, use uts_release, which means that we don't need to rebuild the code just for the git head commit changing. Signed-off-by: John Garry Yes, please! Acked-by: Jakub Kicinski Cheers BTW, I assume that changes like this are also ok: 8<- net: team: Don't bother filling in ethtool driver version The version is same as the default, as don't bother. Signed-off-by: John Garry diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c index f575f225d417..0a44bbdcfb7b 100644 --- a/drivers/net/team/team.c +++ b/drivers/net/team/team.c @@ -25,7 +25,6 @@ #include #include #include -#include #include #define DRV_NAME "team" @@ -2074,7 +2073,6 @@ static void team_ethtool_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *drvinfo) { strscpy(drvinfo->driver, DRV_NAME, sizeof(drvinfo->driver)); - strscpy(drvinfo->version, UTS_RELEASE, sizeof(drvinfo->version)); } >8- right? John
Re: [PATCH RFC 0/4] Introduce uts_release
On 31/01/2024 16:22, Greg KH wrote: before: real0m53.591s user1m1.842s sys 0m9.161s after: real0m37.481s user0m46.461s sys 0m7.199s Sending as an RFC as I need to test more of the conversions and I would like to also convert more UTS_RELEASE users to prove this is proper approach. I like it, I also think that v4l2 includes this as well as all of those drivers seem to rebuild when this changes, does that not happen for you too? I didn't see that. Were you were building for arm64? I can see some v4l2 configs enabled there for the vanilla defconfig (but none for x86-64). Anyway, if the firmware changes work, I'm all for this, thanks for taking it on! cheers. BTW, I just noticed that utsrelase.h might not be updated for my allmodconfig build when I change the head commit - I'll investigate why ... John
[PATCH RFC 1/4] init: Add uts_release
Add a char [] for UTS_RELEASE so that we don't need to rebuild code which references UTS_RELEASE. Signed-off-by: John Garry --- include/linux/utsname.h | 1 + init/version.c | 3 +++ 2 files changed, 4 insertions(+) diff --git a/include/linux/utsname.h b/include/linux/utsname.h index bf7613ba412b..15b0b1c9a9ee 100644 --- a/include/linux/utsname.h +++ b/include/linux/utsname.h @@ -88,5 +88,6 @@ static inline struct new_utsname *init_utsname(void) } extern struct rw_semaphore uts_sem; +extern const char uts_release[]; #endif /* _LINUX_UTSNAME_H */ diff --git a/init/version.c b/init/version.c index 94c96f6fbfe6..87fecdd4fbfb 100644 --- a/init/version.c +++ b/init/version.c @@ -16,6 +16,7 @@ #include #include #include +#include static int __init early_hostname(char *arg) { @@ -48,6 +49,8 @@ BUILD_LTO_INFO; struct uts_namespace init_uts_ns __weak; const char linux_banner[] __weak; +const char uts_release[] = UTS_RELEASE; +EXPORT_SYMBOL_GPL(uts_release); #include "version-timestamp.c" -- 2.35.3
[PATCH RFC 0/4] Introduce uts_release
When hacking it is a waste of time and compute energy that we need to rebuild much kernel code just for changing the head git commit, like this: > touch include/generated/utsrelease.h > time make -j3 mkdir -p /home/john/mnt_sda4/john/kernel-dev2/tools/objtool && make O=/home/john/mnt_sda4/john/kernel-dev2 subdir=tools/objtool --no-print-directory -C objtool INSTALL libsubcmd_headers CALLscripts/checksyscalls.sh CC init/version.o AR init/built-in.a CC kernel/sys.o CC kernel/module/main.o AR kernel/module/built-in.a CC drivers/base/firmware_loader/main.o CC kernel/trace/trace.o AR drivers/base/firmware_loader/built-in.a AR drivers/base/built-in.a CC net/ethtool/ioctl.o AR kernel/trace/built-in.a AR kernel/built-in.a AR net/ethtool/built-in.a AR net/built-in.a AR drivers/built-in.a AR built-in.a ... Files like drivers/base/firmware_loader/main.c needs to be recompiled as it includes generated/utsrelease.h for UTS_RELEASE macro, and utsrelease.h is regenerated when the head commit changes. Introduce global char uts_release[] in init/version.c, which this mentioned code can use instead of UTS_RELEASE, meaning that we don't need to rebuild for changing the head commit - only init/version.c needs to be rebuilt. Whether all the references to UTS_RELEASE in the codebase are proper is a different matter. For an x86_64 defconfig build for this series on my old laptop, here is before and after rebuild time: before: real0m53.591s user1m1.842s sys 0m9.161s after: real0m37.481s user0m46.461s sys 0m7.199s Sending as an RFC as I need to test more of the conversions and I would like to also convert more UTS_RELEASE users to prove this is proper approach. John Garry (4): init: Add uts_release tracing: Use uts_release net: ethtool: Use uts_release firmware_loader: Use uts_release drivers/base/firmware_loader/main.c | 39 +++-- include/linux/utsname.h | 1 + init/version.c | 3 +++ kernel/trace/trace.c| 4 +-- net/ethtool/ioctl.c | 4 +-- 5 files changed, 39 insertions(+), 12 deletions(-) -- 2.35.3
[PATCH RFC 4/4] firmware_loader: Use uts_release
Instead of using UTS_RELEASE, use uts_release, which means that we don't need to rebuild the code just for the git head commit changing. Since UTS_RELEASE was used for fw_path and this points to const data, append uts_release dynamically to an intermediate string. Signed-off-by: John Garry --- drivers/base/firmware_loader/main.c | 39 +++-- 1 file changed, 31 insertions(+), 8 deletions(-) diff --git a/drivers/base/firmware_loader/main.c b/drivers/base/firmware_loader/main.c index ea28102d421e..87da7be61a29 100644 --- a/drivers/base/firmware_loader/main.c +++ b/drivers/base/firmware_loader/main.c @@ -38,7 +38,7 @@ #include #include -#include +#include #include "../base.h" #include "firmware.h" @@ -471,9 +471,9 @@ static int fw_decompress_xz(struct device *dev, struct fw_priv *fw_priv, static char fw_path_para[256]; static const char * const fw_path[] = { fw_path_para, - "/lib/firmware/updates/" UTS_RELEASE, + "/lib/firmware/updates/", /* UTS_RELEASE is appended later */ "/lib/firmware/updates", - "/lib/firmware/" UTS_RELEASE, + "/lib/firmware/", /* UTS_RELEASE is appended later */ "/lib/firmware" }; @@ -496,7 +496,7 @@ fw_get_filesystem_firmware(struct device *device, struct fw_priv *fw_priv, size_t size; int i, len, maxlen = 0; int rc = -ENOENT; - char *path, *nt = NULL; + char *path, *fw_path_string, *nt = NULL; size_t msize = INT_MAX; void *buffer = NULL; @@ -510,6 +510,12 @@ fw_get_filesystem_firmware(struct device *device, struct fw_priv *fw_priv, if (!path) return -ENOMEM; + fw_path_string = __getname(); + if (!fw_path_string) { + __putname(path); + return -ENOMEM; + } + wait_for_initramfs(); for (i = 0; i < ARRAY_SIZE(fw_path); i++) { size_t file_size = 0; @@ -519,16 +525,32 @@ fw_get_filesystem_firmware(struct device *device, struct fw_priv *fw_priv, if (!fw_path[i][0]) continue; + len = snprintf(fw_path_string, PATH_MAX, "%s", fw_path[i]); + if (len >= PATH_MAX) { + rc = -ENAMETOOLONG; + break; + } + + /* Special handling to append UTS_RELEASE */ + if (fw_path[i][len - 1] == '/') { + len = snprintf(fw_path_string, PATH_MAX, "%s%s", + fw_path[i], uts_release); + if (len >= PATH_MAX) { + rc = -ENAMETOOLONG; + break; + } + } + /* strip off \n from customized path */ - maxlen = strlen(fw_path[i]); + maxlen = strlen(fw_path_string); if (i == 0) { - nt = strchr(fw_path[i], '\n'); + nt = strchr(fw_path_string, '\n'); if (nt) - maxlen = nt - fw_path[i]; + maxlen = nt - fw_path_string; } len = snprintf(path, PATH_MAX, "%.*s/%s%s", - maxlen, fw_path[i], + maxlen, fw_path_string, fw_priv->fw_name, suffix); if (len >= PATH_MAX) { rc = -ENAMETOOLONG; @@ -585,6 +607,7 @@ fw_get_filesystem_firmware(struct device *device, struct fw_priv *fw_priv, break; } __putname(path); + __putname(fw_path_string); return rc; } -- 2.35.3
[PATCH RFC 3/4] net: ethtool: Use uts_release
Instead of using UTS_RELEASE, use uts_release, which means that we don't need to rebuild the code just for the git head commit changing. Signed-off-by: John Garry --- net/ethtool/ioctl.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ethtool/ioctl.c b/net/ethtool/ioctl.c index 7519b0818b91..81d052505f67 100644 --- a/net/ethtool/ioctl.c +++ b/net/ethtool/ioctl.c @@ -31,7 +31,7 @@ #include #include #include -#include +#include #include "common.h" /* State held across locks and calls for commands which have devlink fallback */ @@ -713,7 +713,7 @@ ethtool_get_drvinfo(struct net_device *dev, struct ethtool_devlink_compat *rsp) struct device *parent = dev->dev.parent; rsp->info.cmd = ETHTOOL_GDRVINFO; - strscpy(rsp->info.version, UTS_RELEASE, sizeof(rsp->info.version)); + strscpy(rsp->info.version, uts_release, sizeof(rsp->info.version)); if (ops->get_drvinfo) { ops->get_drvinfo(dev, >info); if (!rsp->info.bus_info[0] && parent) -- 2.35.3
[PATCH RFC 2/4] tracing: Use uts_release
Instead of using UTS_RELEASE, use uts_release, which means that we don't need to rebuild the code just for the git head commit changing. Signed-off-by: John Garry --- kernel/trace/trace.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 2a7c6fd934e9..68513924beb4 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -13,7 +13,7 @@ * Copyright (C) 2004 Nadia Yvette Chambers */ #include -#include +#include #include #include #include @@ -4354,7 +4354,7 @@ print_trace_header(struct seq_file *m, struct trace_iterator *iter) get_total_entries(buf, , ); seq_printf(m, "# %s latency trace v1.1.5 on %s\n", - name, UTS_RELEASE); + name, uts_release); seq_puts(m, "# ---" "-\n"); seq_printf(m, "# latency: %lu us, #%lu/%lu, CPU#%d |" -- 2.35.3
Re: [PATCH v3 2/2] drivers/perf: hisi: Add driver for HiSilicon PCIe PMU
On 15/04/2021 13:48, Qi Liu wrote: PCIe PMU Root Complex Integrated End Point(RCiEP) device is supported to sample bandwidth, latency, buffer occupation etc. Each PMU RCiEP device monitors multiple Root Ports, and each RCiEP is registered as a PMU in /sys/bus/event_source/devices, so users can select target PMU, and use filter to do further sets. Filtering options contains: event- select the event. subevent - select the subevent. port - select target Root Ports. Information of Root Ports are shown under sysfs. bdf - select requester_id of target EP device. trig_len - set trigger condition for starting event statistics. trigger_mode - set trigger mode. 0 means starting to statistic when bigger than trigger condition, and 1 means smaller. thr_len - set threshold for statistics. thr_mode - set threshold mode. 0 means count when bigger than threshold, and 1 means smaller. Signed-off-by: Qi Liu Some minor items and nits with coding style below, but generally looks ok: Reviewed-by: John Garry --- MAINTAINERS|6 + drivers/perf/Kconfig |2 + drivers/perf/Makefile |1 + drivers/perf/pci/Kconfig | 16 + drivers/perf/pci/Makefile |2 + drivers/perf/pci/hisilicon/Makefile|3 + drivers/perf/pci/hisilicon/hisi_pcie_pmu.c | 1014 include/linux/cpuhotplug.h |1 + 8 files changed, 1045 insertions(+) create mode 100644 drivers/perf/pci/Kconfig create mode 100644 drivers/perf/pci/Makefile create mode 100644 drivers/perf/pci/hisilicon/Makefile create mode 100644 drivers/perf/pci/hisilicon/hisi_pcie_pmu.c diff --git a/MAINTAINERS b/MAINTAINERS index 7fdc513..efe06cd 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8084,6 +8084,12 @@ W: http://www.hisilicon.com F:Documentation/admin-guide/perf/hisi-pmu.rst F:drivers/perf/hisilicon +HISILICON PCIE PMU DRIVER +M: Qi Liu +S: Maintained +F: Documentation/admin-guide/perf/hisi-pcie-pmu.rst +F: drivers/perf/pci/hisilicon/hisi_pcie_pmu.c + HISILICON QM AND ZIP Controller DRIVER M:Zhou Wang L:linux-cry...@vger.kernel.org diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig index 77522e5..ddd82fa 100644 --- a/drivers/perf/Kconfig +++ b/drivers/perf/Kconfig @@ -139,4 +139,6 @@ config ARM_DMC620_PMU source "drivers/perf/hisilicon/Kconfig" +source "drivers/perf/pci/Kconfig" + endmenu diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile index 5260b11..1208c08 100644 --- a/drivers/perf/Makefile +++ b/drivers/perf/Makefile @@ -14,3 +14,4 @@ obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o obj-$(CONFIG_ARM_DMC620_PMU) += arm_dmc620_pmu.o +obj-y += pci/ diff --git a/drivers/perf/pci/Kconfig b/drivers/perf/pci/Kconfig new file mode 100644 index 000..9f30291 --- /dev/null +++ b/drivers/perf/pci/Kconfig @@ -0,0 +1,16 @@ +# SPDX-License-Identifier: GPL-2.0-only +# +# PCIe Performance Monitor Drivers +# +menu "PCIe Performance Monitor" + +config HISI_PCIE_PMU + tristate "HiSilicon PCIE PERF PMU" + depends on (ARM64 && PCI) || COMPILE_TEST + help + Provide support for HiSilicon PCIe performance monitoring unit (PMU) + RCiEP devices. + Adds the PCIe PMU into perf events system for monitoring latency, + bandwidth etc. + +endmenu diff --git a/drivers/perf/pci/Makefile b/drivers/perf/pci/Makefile new file mode 100644 index 000..a56b1a9 --- /dev/null +++ b/drivers/perf/pci/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only +obj-y += hisilicon/ diff --git a/drivers/perf/pci/hisilicon/Makefile b/drivers/perf/pci/hisilicon/Makefile new file mode 100644 index 000..65b0bd7 --- /dev/null +++ b/drivers/perf/pci/hisilicon/Makefile @@ -0,0 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only + +obj-$(CONFIG_HISI_PCIE_PMU) += hisi_pcie_pmu.o diff --git a/drivers/perf/pci/hisilicon/hisi_pcie_pmu.c b/drivers/perf/pci/hisilicon/hisi_pcie_pmu.c new file mode 100644 index 000..415bf39 --- /dev/null +++ b/drivers/perf/pci/hisilicon/hisi_pcie_pmu.c @@ -0,0 +1,1014 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * This driver adds support for PCIe PMU RCiEP device. Related + * perf events are bandwidth, bandwidth utilization, latency + * etc. + * + * Copyright (C) 2021 HiSilicon Limited + * Author: Qi Liu + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* Define registers */ +#define HISI_PCIE_GLOBAL_CTRL 0x00 +#define HISI_PCIE_EVENT_CTRL 0x010 +#define HISI_PCIE_CNT 0x090 +#defi
Re: [PATCH] scsi: core: Cap initial sdev queue depth at shost.can_queue
On 20/04/2021 01:02, Ming Lei wrote: On Tue, Apr 20, 2021 at 12:06:24AM +0800, John Garry wrote: Function sdev_store_queue_depth() enforces that the sdev queue depth cannot exceed shost.can_queue. However, the LLDD may still set cmd_per_lun > can_queue, which leads to an initial sdev queue depth greater than can_queue. Stop this happened by capping initial sdev queue depth at can_queue. Signed-off-by: John Garry --- Topic originally discussed at: https://lore.kernel.org/linux-scsi/85dec8eb-8eab-c7d6-b0fb-5622747c5...@interlog.com/T/#m5663d0cac657d843b93d0c9a2374f98fc04384b9 Last idea there was to error/warn in scsi_add_host() for cmd_per_lun > Hi Ming, No, that isn't my suggestion. Right, it was what I mentioned. can_queue. However, such a shost driver could still configure the sdev queue depth to be sound value at .slave_configure callback, so now thinking the orig patch better. As I mentioned last time, why can't we fix ->cmd_per_lun in scsi_add_host() using .can_queue? I would rather not change the values which are provided from the driver. I would rather take the original values and try to use them in a sane way. I have not seen other places where driver shost config values are modified by the core code. Thanks, John
[PATCH] scsi: core: Cap initial sdev queue depth at shost.can_queue
Function sdev_store_queue_depth() enforces that the sdev queue depth cannot exceed shost.can_queue. However, the LLDD may still set cmd_per_lun > can_queue, which leads to an initial sdev queue depth greater than can_queue. Stop this happened by capping initial sdev queue depth at can_queue. Signed-off-by: John Garry --- Topic originally discussed at: https://lore.kernel.org/linux-scsi/85dec8eb-8eab-c7d6-b0fb-5622747c5...@interlog.com/T/#m5663d0cac657d843b93d0c9a2374f98fc04384b9 Last idea there was to error/warn in scsi_add_host() for cmd_per_lun > can_queue. However, such a shost driver could still configure the sdev queue depth to be sound value at .slave_configure callback, so now thinking the orig patch better. diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c index 9f1b7f3c650a..8de2f830bcdc 100644 --- a/drivers/scsi/scsi_scan.c +++ b/drivers/scsi/scsi_scan.c @@ -277,7 +277,11 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget, WARN_ON_ONCE(!blk_get_queue(sdev->request_queue)); sdev->request_queue->queuedata = sdev; - depth = sdev->host->cmd_per_lun ?: 1; + if (sdev->host->cmd_per_lun) + depth = min_t(unsigned int, sdev->host->cmd_per_lun, + sdev->host->can_queue); + else + depth = 1; /* * Use .can_queue as budget map's depth because we have to -- 2.26.2
Re: [PATCH 5/8] iommu: fix a couple of spelling mistakes
On 26/03/2021 06:24, Zhen Lei wrote: There are several spelling mistakes, as follows: funcions ==> functions distiguish ==> distinguish detroyed ==> destroyed Signed-off-by: Zhen Lei I think that there should be a /s/appropriatley/appropriately/ in iommu.c Thanks, john
Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator
On 06/04/2021 17:54, John Garry wrote: Hi Robin, Sorry if the phrasing was unclear there - the allusion to default domains is new, it just occurred to me that what we do there is in fact fairly close to what I've suggested previously for this. In that case, we have a global policy set by the command line, which *can* be overridden per-domain via sysfs at runtime, provided the user is willing to tear the whole thing down. Using a similar approach here would give a fair degree of flexibility but still mean that changes never have to be made dynamically to a live domain. So are you saying that we can handle it similar to how we now can handle changing default domain for an IOMMU group via sysfs? If so, that just is not practical here. Reason being that this particular DMA engine provides the block device giving / mount point, so if we unbind the driver, we lose / mount point. And I am not sure if the end user would even know how to set such a tunable. Or, in this case, why the end user would not want the optimized range configured always. I'd still rather if the device driver could provide info which can be used to configure this before or during probing. As a new solution, how about do both of these: a. Add a per-IOMMU group sysfs file to set this tunable. Works same as how we change the default domain, and has all the same restrictions/steps. I think that this is what you are already suggesting. b. Provide a DMA mapping API to set this value, similar to this current series. In the IOMMU backend for that API, we record a new range value and return -EPROBE_DEFER when successful. In the reprobe we reset the default domain for the devices' IOMMU group, with the IOVA domain rcache range configured as previously requested. Again, allocating the new default domain is similar to how we change default domain type today. This means that we don't play with a live domain. Downside is that we need to defer the probe. Thanks, John
Re: [PATCH v2 1/2] drivers/perf: hisi: Add driver for HiSilicon PCIe PMU
On 13/04/2021 10:12, liuqi (BA) wrote: I do wonder why we even need maintain pcie_pmu->cpumask Can't we just use cpu_online_mask as appropiate instead? ? Sorry, missed it yesterday. It seems that cpumask is always same as cpu_online_mask, So do we need to reserve the cpumask sysfs interface? I'm not saying that we don't require the cpumask sysfs interface. I am just asking why you maintain a separate cpumask, when, as I said, they seem the same. Thanks, John
perf arm64 --topdown support (was "perf arm64 metricgroup support")
On 08/04/2021 13:06, Jiri Olsa wrote: perf stat --topdown is not supported, as this requires the CPU PMU to expose (alias) events for the TopDown L1 metrics from sysfs, which arm does not do. To get that to work, we probably need to make perf use the pmu-events cpumap to learn about those alias events. Hi guys, About supporting --topdown command for other archs apart from x86, it seems not possible today. Support there is based on kernel support for "topdown" CPU events used in the metric calculations. However, arm64, for example, does not support these "topdown" events. It seems to me that we can change to use pmu-events framework + metricgroup support here, rather than hardcoded events - has anyone considered this approach previously? Seems a pretty big job, so thought I'd ask first ... Thanks, John
Re: [PATCH v2 1/2] drivers/perf: hisi: Add driver for HiSilicon PCIe PMU
On 12/04/2021 14:34, liuqi (BA) wrote: Hi John, Thanks for reviewing this. On 2021/4/9 18:22, John Garry wrote: On 09/04/2021 10:05, Qi Liu wrote: PCIe PMU Root Complex Integrated End Point(RCiEP) device is supported to sample bandwidth, latency, buffer occupation etc. Each PMU RCiEP device monitors multiple Root Ports, and each RCiEP is registered as a PMU in /sys/bus/event_source/devices, so users can select target PMU, and use filter to do further sets. side note: it would be good to mention what baseline the series is based on in the cover letter Filtering options contains: event - select the event. subevent - select the subevent. port - select target Root Ports. Information of Root Ports are shown under sysfs. bdf - select requester_id of target EP device. trig_len - set trigger condition for starting event statistics. trigger_mode - set trigger mode. 0 means starting to statistic when bigger than trigger condition, and 1 means smaller. thr_len - set threshold for statistics. thr_mode - set threshold mode. 0 means count when bigger than threshold, and 1 means smaller. Signed-off-by: Qi Liu --- MAINTAINERS | 6 + drivers/perf/Kconfig | 2 + drivers/perf/Makefile | 1 + drivers/perf/pci/Kconfig | 16 + drivers/perf/pci/Makefile | 2 + drivers/perf/pci/hisilicon/Makefile | 5 + drivers/perf/pci/hisilicon/hisi_pcie_pmu.c | 1011 include/linux/cpuhotplug.h | 1 + 8 files changed, 1044 insertions(+) create mode 100644 drivers/perf/pci/Kconfig create mode 100644 drivers/perf/pci/Makefile create mode 100644 drivers/perf/pci/hisilicon/Makefile create mode 100644 drivers/perf/pci/hisilicon/hisi_pcie_pmu.c diff --git a/MAINTAINERS b/MAINTAINERS index 3353de0..46c7861 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8023,6 +8023,12 @@ W: http://www.hisilicon.com F: Documentation/admin-guide/perf/hisi-pmu.rst F: drivers/perf/hisilicon +HISILICON PCIE PMU DRIVER +M: Qi Liu +S: Maintained +F: Documentation/admin-guide/perf/hisi-pcie-pmu.rst nit: this does not exist yet... thanks, I'll move this add-maintainer-part to the second patch. that's why I advocate the documentation first :) +F: drivers/perf/pci/hisilicon/hisi_pcie_pmu.c + HISILICON QM AND ZIP Controller DRIVER M: Zhou Wang L: linux-cry...@vger.kernel.org diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig index 3075cf1..99d4760 100644 --- a/drivers/perf/Kconfig +++ b/drivers/perf/Kconfig @@ -139,4 +139,6 @@ config ARM_DMC620_PMU source "drivers/perf/hisilicon/Kconfig" +source "drivers/perf/pci/Kconfig" + endmenu diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile index 5260b11..1208c08 100644 --- a/drivers/perf/Makefile +++ b/drivers/perf/Makefile @@ -14,3 +14,4 @@ obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o obj-$(CONFIG_ARM_DMC620_PMU) += arm_dmc620_pmu.o +obj-y += pci/ diff --git a/drivers/perf/pci/Kconfig b/drivers/perf/pci/Kconfig new file mode 100644 index 000..a119486 --- /dev/null +++ b/drivers/perf/pci/Kconfig @@ -0,0 +1,16 @@ +# SPDX-License-Identifier: GPL-2.0-only +# +# PCIe Performance Monitor Drivers +# +menu "PCIe Performance Monitor" + +config HISI_PCIE_PMU + tristate "HiSilicon PCIE PERF PMU" + depends on ARM64 && PCI && HISI_PMU What from HISI_PMU is needed? I couldn't find anything here The event_sysfs_show() and format_sysfs_show() function of hisi_uncore_pmu.h can be reused in hisi_pcie_pmu.c, So I add path in Makefile and include "hisi_uncore_pmu.h". Right, but it would be nice to be able to build this under COMPILE_TEST. CONFIG_HISI_PMU cannot be built under COMPILE_TEST, so nice to not depend on it. So you could put hisi_event_sysfs_show() as a static inline in hisi_uncore_pmu.h, so the dependency can be removed Having said that, there is nothing really hisi specific in those functions like hisi_event_sysfs_show(). Can't we just create generic functions here? hisi_event_sysfs_show() == cci400_pmu_cycle_event_show() == xgene_pmu_event_show() + help + Provide support for HiSilicon PCIe performance monitoring unit (PMU) + RCiEP devices. + Adds the PCIe PMU into perf events system for monitoring latency, + bandwidth etc. + +#define HISI_PCIE_CHECK_EVENTS(name, list) \ "check" in a function name is too vague, as it does not imply any side-effect from calling this function. And I think "build" or similar would be good to be included in the macro name
Re: [PATCH v2 1/2] drivers/perf: hisi: Add driver for HiSilicon PCIe PMU
On 09/04/2021 10:05, Qi Liu wrote: PCIe PMU Root Complex Integrated End Point(RCiEP) device is supported to sample bandwidth, latency, buffer occupation etc. Each PMU RCiEP device monitors multiple Root Ports, and each RCiEP is registered as a PMU in /sys/bus/event_source/devices, so users can select target PMU, and use filter to do further sets. Filtering options contains: event- select the event. subevent - select the subevent. port - select target Root Ports. Information of Root Ports are shown under sysfs. bdf - select requester_id of target EP device. trig_len - set trigger condition for starting event statistics. trigger_mode - set trigger mode. 0 means starting to statistic when bigger than trigger condition, and 1 means smaller. thr_len - set threshold for statistics. thr_mode - set threshold mode. 0 means count when bigger than threshold, and 1 means smaller. Signed-off-by: Qi Liu --- MAINTAINERS|6 + drivers/perf/Kconfig |2 + drivers/perf/Makefile |1 + drivers/perf/pci/Kconfig | 16 + drivers/perf/pci/Makefile |2 + drivers/perf/pci/hisilicon/Makefile|5 + drivers/perf/pci/hisilicon/hisi_pcie_pmu.c | 1011 include/linux/cpuhotplug.h |1 + 8 files changed, 1044 insertions(+) create mode 100644 drivers/perf/pci/Kconfig create mode 100644 drivers/perf/pci/Makefile create mode 100644 drivers/perf/pci/hisilicon/Makefile create mode 100644 drivers/perf/pci/hisilicon/hisi_pcie_pmu.c diff --git a/MAINTAINERS b/MAINTAINERS index 3353de0..46c7861 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8023,6 +8023,12 @@ W: http://www.hisilicon.com F:Documentation/admin-guide/perf/hisi-pmu.rst F:drivers/perf/hisilicon +HISILICON PCIE PMU DRIVER +M: Qi Liu +S: Maintained +F: Documentation/admin-guide/perf/hisi-pcie-pmu.rst nit: this does not exist yet... +F: drivers/perf/pci/hisilicon/hisi_pcie_pmu.c + HISILICON QM AND ZIP Controller DRIVER M:Zhou Wang L:linux-cry...@vger.kernel.org diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig index 3075cf1..99d4760 100644 --- a/drivers/perf/Kconfig +++ b/drivers/perf/Kconfig @@ -139,4 +139,6 @@ config ARM_DMC620_PMU source "drivers/perf/hisilicon/Kconfig" +source "drivers/perf/pci/Kconfig" + endmenu diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile index 5260b11..1208c08 100644 --- a/drivers/perf/Makefile +++ b/drivers/perf/Makefile @@ -14,3 +14,4 @@ obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o obj-$(CONFIG_ARM_DMC620_PMU) += arm_dmc620_pmu.o +obj-y += pci/ diff --git a/drivers/perf/pci/Kconfig b/drivers/perf/pci/Kconfig new file mode 100644 index 000..a119486 --- /dev/null +++ b/drivers/perf/pci/Kconfig @@ -0,0 +1,16 @@ +# SPDX-License-Identifier: GPL-2.0-only +# +# PCIe Performance Monitor Drivers +# +menu "PCIe Performance Monitor" + +config HISI_PCIE_PMU + tristate "HiSilicon PCIE PERF PMU" + depends on ARM64 && PCI && HISI_PMU What from HISI_PMU is needed? I couldn't find anything here + help + Provide support for HiSilicon PCIe performance monitoring unit (PMU) + RCiEP devices. + Adds the PCIe PMU into perf events system for monitoring latency, + bandwidth etc. + +endmenu diff --git a/drivers/perf/pci/Makefile b/drivers/perf/pci/Makefile new file mode 100644 index 000..a56b1a9 --- /dev/null +++ b/drivers/perf/pci/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only +obj-y += hisilicon/ diff --git a/drivers/perf/pci/hisilicon/Makefile b/drivers/perf/pci/hisilicon/Makefile new file mode 100644 index 000..6f99f52 --- /dev/null +++ b/drivers/perf/pci/hisilicon/Makefile @@ -0,0 +1,5 @@ +# SPDX-License-Identifier: GPL-2.0-only + +ccflags-y := -I $(srctree)/drivers/perf/hisilicon + +obj-$(CONFIG_HISI_PCIE_PMU) += hisi_pcie_pmu.o diff --git a/drivers/perf/pci/hisilicon/hisi_pcie_pmu.c b/drivers/perf/pci/hisilicon/hisi_pcie_pmu.c new file mode 100644 index 000..ac0e444 --- /dev/null +++ b/drivers/perf/pci/hisilicon/hisi_pcie_pmu.c @@ -0,0 +1,1011 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * This driver adds support for PCIe PMU RCiEP device. Related + * perf events are bandwidth, bandwidth utilization, latency + * etc. + * + * Copyright (C) 2021 HiSilicon Limited + * Author: Qi Liu + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "hisi_uncore_pmu.h" + +/* Define registers */ +#define HISI_PCIE_GLOBAL_CTRL 0x00 +#define HISI_PCIE_EVENT_CTRL 0x010 +#define HISI_PCIE_CNT
Re: [PATCH 1/2] drivers/perf: hisi: Add driver for HiSilicon PCIe PMU
On 08/04/2021 10:01, Jonathan Cameron wrote: On Wed, 7 Apr 2021 21:40:05 +0100 Will Deacon wrote: On Wed, Apr 07, 2021 at 05:49:02PM +0800, Qi Liu wrote: PCIe PMU Root Complex Integrated End Point(RCiEP) device is supported to sample bandwidth, latency, buffer occupation etc. Each PMU RCiEP device monitors multiple root ports, and each RCiEP is registered as a pmu in /sys/bus/event_source/devices, so users can select target PMU, and use filter to do further sets. Filtering options contains: event- select the event. subevent - select the subevent. port - select target root ports. Information of root ports are shown under sysfs. bdf - select requester_id of target EP device. trig_len - set trigger condition for starting event statistics. trigger_mode - set trigger mode. 0 means starting to statistic when bigger than trigger condition, and 1 means smaller. thr_len - set threshold for statistics. thr_mode - set threshold mode. 0 means count when bigger than threshold, and 1 means smaller. Reviewed-by: Jonathan Cameron Do you have a link to this review, please? Internal review, so drop the tag. Jonathan Hi Will, Are you implying that you would rather that any review for these drivers is done in public on the lists? Cheers, John
[PATCH v3 6/6] perf vendor events arm64: Add Hisi hip08 L3 metrics
Add L3 metrics. Signed-off-by: John Garry Reviewed-by: Kajol Jain --- .../arch/arm64/hisilicon/hip08/metrics.json | 161 ++ 1 file changed, 161 insertions(+) diff --git a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json index dda898d23c2d..dda8e59149d2 100644 --- a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json +++ b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json @@ -69,4 +69,165 @@ "MetricGroup": "TopDownL2", "MetricName": "memory_bound" }, +{ +"MetricExpr": "(((L2I_TLB - L2I_TLB_REFILL) * 15) + (L2I_TLB_REFILL * 100)) / CPU_CYCLES", +"PublicDescription": "Idle by itlb miss L3 topdown metric", +"BriefDescription": "Idle by itlb miss L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "idle_by_itlb_miss" +}, +{ +"MetricExpr": "(((L2I_CACHE - L2I_CACHE_REFILL) * 15) + (L2I_CACHE_REFILL * 100)) / CPU_CYCLES", +"PublicDescription": "Idle by icache miss L3 topdown metric", +"BriefDescription": "Idle by icache miss L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "idle_by_icache_miss" +}, +{ +"MetricExpr": "(BR_MIS_PRED * 5) / CPU_CYCLES", +"PublicDescription": "BP misp flush L3 topdown metric", +"BriefDescription": "BP misp flush L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "bp_misp_flush" +}, +{ +"MetricExpr": "(armv8_pmuv3_0@event\\=0x2013@ * 5) / CPU_CYCLES", +"PublicDescription": "OOO flush L3 topdown metric", +"BriefDescription": "OOO flush L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "ooo_flush" +}, +{ +"MetricExpr": "(armv8_pmuv3_0@event\\=0x1001@ * 5) / CPU_CYCLES", +"PublicDescription": "Static predictor flush L3 topdown metric", +"BriefDescription": "Static predictor flush L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "sp_flush" +}, +{ +"MetricExpr": "armv8_pmuv3_0@event\\=0x1010@ / BR_MIS_PRED", +"PublicDescription": "Indirect branch L3 topdown metric", +"BriefDescription": "Indirect branch L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "indirect_branch" +}, +{ +"MetricExpr": "(armv8_pmuv3_0@event\\=0x1014@ + armv8_pmuv3_0@event\\=0x1018@) / BR_MIS_PRED", +"PublicDescription": "Push branch L3 topdown metric", +"BriefDescription": "Push branch L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "push_branch" +}, +{ +"MetricExpr": "armv8_pmuv3_0@event\\=0x100c@ / BR_MIS_PRED", +"PublicDescription": "Pop branch L3 topdown metric", +"BriefDescription": "Pop branch L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "pop_branch" +}, +{ +"MetricExpr": "(BR_MIS_PRED - armv8_pmuv3_0@event\\=0x1010@ - armv8_pmuv3_0@event\\=0x1014@ - armv8_pmuv3_0@event\\=0x1018@ - armv8_pmuv3_0@event\\=0x100c@) / BR_MIS_PRED", +"PublicDescription": "Other branch L3 topdown metric", +"BriefDescription": "Other branch L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "other_branch" +}, +{ +"MetricExpr": "armv8_pmuv3_0@event\\=0x2012@ / armv8_pmuv3_0@event\\=0x2013@", +"PublicDescription": "Nuke flush L3 topdown metric", +"BriefDescription": "Nuke flush L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "nuke_flush" +}, +{ +"MetricExpr": "1 - nuke_flush", +"PublicDescription": "Other flush L3 topdown metric", +"BriefDescription": &q
[PATCH v3 0/6] perf arm64 metricgroup support
This series contains support to get basic metricgroups working for arm64 CPUs. Initial support is added for HiSilicon hip08 platform. Some sample usage on Huawei D06 board: $ ./perf list metric List of pre-defined events (to be used in -e): Metrics: bp_misp_flush [BP misp flush L3 topdown metric] branch_mispredicts [Branch mispredicts L2 topdown metric] core_bound [Core bound L2 topdown metric] divider [Divider L3 topdown metric] exe_ports_util [EXE ports util L3 topdown metric] fetch_bandwidth_bound [Fetch bandwidth bound L2 topdown metric] fetch_latency_bound [Fetch latency bound L2 topdown metric] fsu_stall [FSU stall L3 topdown metric] idle_by_icache_miss $ sudo ./perf stat -v -M core_bound sleep 1 Using CPUID 0x480fd010 metric expr (exe_stall_cycle - (mem_stall_anyload + armv8_pmuv3_0@event\=0x7005@)) / cpu_cycles for core_bound found event cpu_cycles found event armv8_pmuv3_0/event=0x7005/ found event exe_stall_cycle found event mem_stall_anyload adding {cpu_cycles -> armv8_pmuv3_0/event=0x7001/ mem_stall_anyload -> armv8_pmuv3_0/event=0x7004/ Control descriptor is not initialized cpu_cycles: 989433 385050 385050 armv8_pmuv3_0/event=0x7005/: 19207 385050 385050 exe_stall_cycle: 900825 385050 385050 mem_stall_anyload: 253516 385050 385050 Performance counter stats for 'sleep': 989,433 cpu_cycles # 0.63 core_bound 19,207 armv8_pmuv3_0/event=0x7005/ 900,825 exe_stall_cycle 253,516 mem_stall_anyload 0.000805809 seconds time elapsed 0.000875000 seconds user 0.0 seconds sys perf stat --topdown is not supported, as this requires the CPU PMU to expose (alias) events for the TopDown L1 metrics from sysfs, which arm does not do. To get that to work, we probably need to make perf use the pmu-events cpumap to learn about those alias events. Metric reuse support is added for pmu-events parse metric testcase. This had been broken on power9 recently: https://lore.kernel.org/lkml/20210324015418.gc8...@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com/ Differences to v2: - Add TB and RB tags (Thanks!) - Rename metricgroup__find_metric() from metricgroup_find_metric() - Change resolve_metric_simple() to rescan after any insert Differences to v1: - Add pmu_events_map__find() as arm64-specific function - Fix metric reuse for pmu-events parse metric testcase John Garry (6): perf metricgroup: Make find_metric() public with name change perf test: Handle metric reuse in pmu-events parsing test perf pmu: Add pmu_events_map__find() perf vendor events arm64: Add Hisi hip08 L1 metrics perf vendor events arm64: Add Hisi hip08 L2 metrics perf vendor events arm64: Add Hisi hip08 L3 metrics tools/perf/arch/arm64/util/Build | 1 + tools/perf/arch/arm64/util/pmu.c | 25 ++ .../arch/arm64/hisilicon/hip08/metrics.json | 233 ++ tools/perf/tests/pmu-events.c | 83 ++- tools/perf/util/metricgroup.c | 12 +- tools/perf/util/metricgroup.h | 3 +- tools/perf/util/pmu.c | 5 + tools/perf/util/pmu.h | 1 + tools/perf/util/s390-sample-raw.c | 4 +- 9 files changed, 356 insertions(+), 11 deletions(-) create mode 100644 tools/perf/arch/arm64/util/pmu.c create mode 100644 tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json -- 2.26.2
[PATCH v3 5/6] perf vendor events arm64: Add Hisi hip08 L2 metrics
Add L2 metrics. Signed-off-by: John Garry Reviewed-by: Kajol Jain --- .../arch/arm64/hisilicon/hip08/metrics.json | 42 +++ 1 file changed, 42 insertions(+) diff --git a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json index dc5ff3051639..dda898d23c2d 100644 --- a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json +++ b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json @@ -27,4 +27,46 @@ "MetricGroup": "TopDownL1", "MetricName": "backend_bound" }, +{ +"MetricExpr": "armv8_pmuv3_0@event\\=0x201d@ / CPU_CYCLES", +"PublicDescription": "Fetch latency bound L2 topdown metric", +"BriefDescription": "Fetch latency bound L2 topdown metric", +"MetricGroup": "TopDownL2", +"MetricName": "fetch_latency_bound" +}, +{ +"MetricExpr": "frontend_bound - fetch_latency_bound", +"PublicDescription": "Fetch bandwidth bound L2 topdown metric", +"BriefDescription": "Fetch bandwidth bound L2 topdown metric", +"MetricGroup": "TopDownL2", +"MetricName": "fetch_bandwidth_bound" +}, +{ +"MetricExpr": "(bad_speculation * BR_MIS_PRED) / (BR_MIS_PRED + armv8_pmuv3_0@event\\=0x2013@)", +"PublicDescription": "Branch mispredicts L2 topdown metric", +"BriefDescription": "Branch mispredicts L2 topdown metric", +"MetricGroup": "TopDownL2", +"MetricName": "branch_mispredicts" +}, +{ +"MetricExpr": "bad_speculation - branch_mispredicts", +"PublicDescription": "Machine clears L2 topdown metric", +"BriefDescription": "Machine clears L2 topdown metric", +"MetricGroup": "TopDownL2", +"MetricName": "machine_clears" +}, +{ +"MetricExpr": "(EXE_STALL_CYCLE - (MEM_STALL_ANYLOAD + armv8_pmuv3_0@event\\=0x7005@)) / CPU_CYCLES", +"PublicDescription": "Core bound L2 topdown metric", +"BriefDescription": "Core bound L2 topdown metric", +"MetricGroup": "TopDownL2", +"MetricName": "core_bound" +}, +{ +"MetricExpr": "(MEM_STALL_ANYLOAD + armv8_pmuv3_0@event\\=0x7005@) / CPU_CYCLES", +"PublicDescription": "Memory bound L2 topdown metric", +"BriefDescription": "Memory bound L2 topdown metric", +"MetricGroup": "TopDownL2", +"MetricName": "memory_bound" +}, ] -- 2.26.2
[PATCH v3 4/6] perf vendor events arm64: Add Hisi hip08 L1 metrics
Add L1 metrics. Formula is as consistent as possible with MAN pages description for these metrics. Signed-off-by: John Garry Reviewed-by: Kajol Jain --- .../arch/arm64/hisilicon/hip08/metrics.json | 30 +++ 1 file changed, 30 insertions(+) create mode 100644 tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json diff --git a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json new file mode 100644 index ..dc5ff3051639 --- /dev/null +++ b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json @@ -0,0 +1,30 @@ +[ +{ +"MetricExpr": "FETCH_BUBBLE / (4 * CPU_CYCLES)", +"PublicDescription": "Frontend bound L1 topdown metric", +"BriefDescription": "Frontend bound L1 topdown metric", +"MetricGroup": "TopDownL1", +"MetricName": "frontend_bound" +}, +{ +"MetricExpr": "(INST_SPEC - INST_RETIRED) / (4 * CPU_CYCLES)", +"PublicDescription": "Bad Speculation L1 topdown metric", +"BriefDescription": "Bad Speculation L1 topdown metric", +"MetricGroup": "TopDownL1", +"MetricName": "bad_speculation" +}, +{ +"MetricExpr": "INST_RETIRED / (CPU_CYCLES * 4)", +"PublicDescription": "Retiring L1 topdown metric", +"BriefDescription": "Retiring L1 topdown metric", +"MetricGroup": "TopDownL1", +"MetricName": "retiring" +}, +{ +"MetricExpr": "1 - (frontend_bound + bad_speculation + retiring)", +"PublicDescription": "Backend Bound L1 topdown metric", +"BriefDescription": "Backend Bound L1 topdown metric", +"MetricGroup": "TopDownL1", +"MetricName": "backend_bound" +}, +] -- 2.26.2
[PATCH v3 3/6] perf pmu: Add pmu_events_map__find()
Add a function to find the common PMU map for the system. For arm64, a special variant is added. This is because arm64 supports heterogeneous CPU systems. As such, it cannot be guaranteed that the cpumap is same for all CPUs. So in case of heterogeneous systems, don't return a cpumap. Tested-by: Paul A. Clarke Signed-off-by: John Garry Reviewed-by: Kajol Jain --- tools/perf/arch/arm64/util/Build | 1 + tools/perf/arch/arm64/util/pmu.c | 25 + tools/perf/tests/pmu-events.c | 2 +- tools/perf/util/metricgroup.c | 7 +++ tools/perf/util/pmu.c | 5 + tools/perf/util/pmu.h | 1 + tools/perf/util/s390-sample-raw.c | 4 +--- 7 files changed, 37 insertions(+), 8 deletions(-) create mode 100644 tools/perf/arch/arm64/util/pmu.c diff --git a/tools/perf/arch/arm64/util/Build b/tools/perf/arch/arm64/util/Build index ead2f2275eee..9fcb4e68add9 100644 --- a/tools/perf/arch/arm64/util/Build +++ b/tools/perf/arch/arm64/util/Build @@ -2,6 +2,7 @@ perf-y += header.o perf-y += machine.o perf-y += perf_regs.o perf-y += tsc.o +perf-y += pmu.o perf-y += kvm-stat.o perf-$(CONFIG_DWARF) += dwarf-regs.o perf-$(CONFIG_LOCAL_LIBUNWIND) += unwind-libunwind.o diff --git a/tools/perf/arch/arm64/util/pmu.c b/tools/perf/arch/arm64/util/pmu.c new file mode 100644 index ..d3259d61ca75 --- /dev/null +++ b/tools/perf/arch/arm64/util/pmu.c @@ -0,0 +1,25 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include "../../util/cpumap.h" +#include "../../util/pmu.h" + +struct pmu_events_map *pmu_events_map__find(void) +{ + struct perf_pmu *pmu = NULL; + + while ((pmu = perf_pmu__scan(pmu))) { + if (!is_pmu_core(pmu->name)) + continue; + + /* +* The cpumap should cover all CPUs. Otherwise, some CPUs may +* not support some events or have different event IDs. +*/ + if (pmu->cpus->nr != cpu__max_cpu()) + return NULL; + + return perf_pmu__find_map(pmu); + } + + return NULL; +} diff --git a/tools/perf/tests/pmu-events.c b/tools/perf/tests/pmu-events.c index cb5b25d2fb27..b8aff8fb50d8 100644 --- a/tools/perf/tests/pmu-events.c +++ b/tools/perf/tests/pmu-events.c @@ -539,7 +539,7 @@ static int resolve_metric_simple(struct expr_parse_ctx *pctx, static int test_parsing(void) { - struct pmu_events_map *cpus_map = perf_pmu__find_map(NULL); + struct pmu_events_map *cpus_map = pmu_events_map__find(); struct pmu_events_map *map; struct pmu_event *pe; int i, j, k; diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c index 37fe34a5d93d..8336dd8e8098 100644 --- a/tools/perf/util/metricgroup.c +++ b/tools/perf/util/metricgroup.c @@ -618,7 +618,7 @@ static int metricgroup__print_sys_event_iter(struct pmu_event *pe, void *data) void metricgroup__print(bool metrics, bool metricgroups, char *filter, bool raw, bool details) { - struct pmu_events_map *map = perf_pmu__find_map(NULL); + struct pmu_events_map *map = pmu_events_map__find(); struct pmu_event *pe; int i; struct rblist groups; @@ -1254,8 +1254,7 @@ int metricgroup__parse_groups(const struct option *opt, struct rblist *metric_events) { struct evlist *perf_evlist = *(struct evlist **)opt->value; - struct pmu_events_map *map = perf_pmu__find_map(NULL); - + struct pmu_events_map *map = pmu_events_map__find(); return parse_groups(perf_evlist, str, metric_no_group, metric_no_merge, NULL, metric_events, map); @@ -1274,7 +1273,7 @@ int metricgroup__parse_groups_test(struct evlist *evlist, bool metricgroup__has_metric(const char *metric) { - struct pmu_events_map *map = perf_pmu__find_map(NULL); + struct pmu_events_map *map = pmu_events_map__find(); struct pmu_event *pe; int i; diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c index 88da5cf6aee8..419ef6c4fbc0 100644 --- a/tools/perf/util/pmu.c +++ b/tools/perf/util/pmu.c @@ -717,6 +717,11 @@ struct pmu_events_map *perf_pmu__find_map(struct perf_pmu *pmu) return map; } +struct pmu_events_map *__weak pmu_events_map__find(void) +{ + return perf_pmu__find_map(NULL); +} + bool pmu_uncore_alias_match(const char *pmu_name, const char *name) { char *tmp = NULL, *tok, *str; diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h index 8164388478c6..012317229488 100644 --- a/tools/perf/util/pmu.h +++ b/tools/perf/util/pmu.h @@ -114,6 +114,7 @@ void pmu_add_cpu_aliases_map(struct list_head *head, struct perf_pmu *pmu, struct pmu_events_map *map); struct pmu_events_map *perf_pmu__find_map(struct perf_pmu *pmu); +struct pmu_events_map *pmu_events_map__find(void); bool p
[PATCH v3 1/6] perf metricgroup: Make find_metric() public with name change
Function find_metric() is required for the metric processing in the pmu-events testcase, so make it public. Also change the name to include "metricgroup". Tested-by: Paul A. Clarke Signed-off-by: John Garry Reviewed-by: Kajol Jain --- tools/perf/util/metricgroup.c | 5 +++-- tools/perf/util/metricgroup.h | 3 ++- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c index 6acb44ad439b..37fe34a5d93d 100644 --- a/tools/perf/util/metricgroup.c +++ b/tools/perf/util/metricgroup.c @@ -900,7 +900,8 @@ static int __add_metric(struct list_head *metric_list, (match_metric(__pe->metric_group, __metric) || \ match_metric(__pe->metric_name, __metric))) -static struct pmu_event *find_metric(const char *metric, struct pmu_events_map *map) +struct pmu_event *metricgroup__find_metric(const char *metric, + struct pmu_events_map *map) { struct pmu_event *pe; int i; @@ -985,7 +986,7 @@ static int __resolve_metric(struct metric *m, struct expr_id *parent; struct pmu_event *pe; - pe = find_metric(cur->key, map); + pe = metricgroup__find_metric(cur->key, map); if (!pe) continue; diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h index ed1b9392e624..1424e7c1af77 100644 --- a/tools/perf/util/metricgroup.h +++ b/tools/perf/util/metricgroup.h @@ -44,7 +44,8 @@ int metricgroup__parse_groups(const struct option *opt, bool metric_no_group, bool metric_no_merge, struct rblist *metric_events); - +struct pmu_event *metricgroup__find_metric(const char *metric, + struct pmu_events_map *map); int metricgroup__parse_groups_test(struct evlist *evlist, struct pmu_events_map *map, const char *str, -- 2.26.2
[PATCH v3 2/6] perf test: Handle metric reuse in pmu-events parsing test
The pmu-events parsing test does not handle metric reuse at all. Introduce some simple handling to resolve metrics who reference other metrics. Tested-by: Paul A. Clarke Signed-off-by: John Garry Reviewed-by: Kajol Jain --- tools/perf/tests/pmu-events.c | 81 +++ 1 file changed, 81 insertions(+) diff --git a/tools/perf/tests/pmu-events.c b/tools/perf/tests/pmu-events.c index 0ca6a5a53523..cb5b25d2fb27 100644 --- a/tools/perf/tests/pmu-events.c +++ b/tools/perf/tests/pmu-events.c @@ -12,6 +12,7 @@ #include "util/evlist.h" #include "util/expr.h" #include "util/parse-events.h" +#include "metricgroup.h" struct perf_pmu_test_event { /* used for matching against events from generated pmu-events.c */ @@ -471,6 +472,71 @@ static void expr_failure(const char *msg, pr_debug("On expression %s\n", pe->metric_expr); } +struct metric { + struct list_head list; + struct metric_ref metric_ref; +}; + +static int resolve_metric_simple(struct expr_parse_ctx *pctx, +struct list_head *compound_list, +struct pmu_events_map *map, +const char *metric_name) +{ + struct hashmap_entry *cur, *cur_tmp; + struct metric *metric, *tmp; + size_t bkt; + bool all; + int rc; + + do { + all = true; + hashmap__for_each_entry_safe((>ids), cur, cur_tmp, bkt) { + struct metric_ref *ref; + struct pmu_event *pe; + + pe = metricgroup__find_metric(cur->key, map); + if (!pe) + continue; + + if (!strcmp(metric_name, (char *)cur->key)) { + pr_warning("Recursion detected for metric %s\n", metric_name); + rc = -1; + goto out_err; + } + + all = false; + + /* The metric key itself needs to go out.. */ + expr__del_id(pctx, cur->key); + + metric = malloc(sizeof(*metric)); + if (!metric) { + rc = -ENOMEM; + goto out_err; + } + + ref = >metric_ref; + ref->metric_name = pe->metric_name; + ref->metric_expr = pe->metric_expr; + list_add_tail(>list, compound_list); + + rc = expr__find_other(pe->metric_expr, NULL, pctx, 0); + if (rc) + goto out_err; + break; /* The hashmap has been modified, so restart */ + } + } while (!all); + + return 0; + +out_err: + list_for_each_entry_safe(metric, tmp, compound_list, list) + free(metric); + + return rc; + +} + static int test_parsing(void) { struct pmu_events_map *cpus_map = perf_pmu__find_map(NULL); @@ -488,7 +554,9 @@ static int test_parsing(void) break; j = 0; for (;;) { + struct metric *metric, *tmp; struct hashmap_entry *cur; + LIST_HEAD(compound_list); size_t bkt; pe = >table[j++]; @@ -504,6 +572,13 @@ static int test_parsing(void) continue; } + if (resolve_metric_simple(, _list, map, + pe->metric_name)) { + expr_failure("Could not resolve metrics", map, pe); + ret++; + goto exit; /* Don't tolerate errors due to severity */ + } + /* * Add all ids with a made up value. The value may * trigger divide by zero when subtracted and so try to @@ -519,6 +594,11 @@ static int test_parsing(void) ret++; } + list_for_each_entry_safe(metric, tmp, _list, list) { + expr__add_ref(, >metric_ref); + free(metric); + } + if (expr__parse(, , pe->metric_expr, 0)) { expr_failure("Parse failed", map, pe); ret++; @@ -527,6 +607,7 @@ static int test_parsing(void) } } /* TODO: fail when not ok */ +exit: return ret == 0 ? TEST_OK : TEST_SKIP; } -- 2.26.2
Re: [PATCH 0/3] iommu/iova: Add CPU hotplug handler to flush rcaches to core code
On 07/04/2021 09:04, Joerg Roedel wrote: On Mon, Mar 01, 2021 at 08:12:18PM +0800, John Garry wrote: The Intel IOMMU driver supports flushing the per-CPU rcaches when a CPU is offlined. Let's move it to core code, so everyone can take advantage. Also correct a code comment. Based on v5.12-rc1. Tested on arm64 only. John Garry (3): iova: Add CPU hotplug handler to flush rcaches iommu/vt-d: Remove IOVA domain rcache flushing for CPU offlining iova: Correct comment for free_cpu_cached_iovas() drivers/iommu/intel/iommu.c | 31 --- drivers/iommu/iova.c| 32 ++-- include/linux/cpuhotplug.h | 2 +- include/linux/iova.h| 1 + 4 files changed, 32 insertions(+), 34 deletions(-) Applied, thanks. . Thanks, but there was a v2 on this series. Not sure which you applied. https://lore.kernel.org/linux-iommu/9aad6e94-ecb7-ca34-7f7d-3df6e43e9...@huawei.com/T/#mbea81468782c75fa84744ad7a7801831a4c952e9
Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator
So then we have the issue of how to dynamically increase this rcache threshold. The problem is that we may have many devices associated with the same domain. So, in theory, we can't assume that when we increase the threshold that some other device will try to fast free an IOVA which was allocated prior to the increase and was not rounded up. I'm very open to better (or less bad) suggestions on how to do this ... ...but yes, regardless of exactly where it happens, rounding up or not is the problem for rcaches in general. I've said several times that my preferred approach is to not change it that dynamically at all, but instead treat it more like we treat the default domain type. Can you remind me of that idea? I don't remember you mentioning using default domain handling as a reference in any context. Hi Robin, Sorry if the phrasing was unclear there - the allusion to default domains is new, it just occurred to me that what we do there is in fact fairly close to what I've suggested previously for this. In that case, we have a global policy set by the command line, which *can* be overridden per-domain via sysfs at runtime, provided the user is willing to tear the whole thing down. Using a similar approach here would give a fair degree of flexibility but still mean that changes never have to be made dynamically to a live domain. So are you saying that we can handle it similar to how we now can handle changing default domain for an IOMMU group via sysfs? If so, that just is not practical here. Reason being that this particular DMA engine provides the block device giving / mount point, so if we unbind the driver, we lose / mount point. And I am not sure if the end user would even know how to set such a tunable. Or, in this case, why the end user would not want the optimized range configured always. I'd still rather if the device driver could provide info which can be used to configure this before or during probing. Cheers, John
Re: [bug report] Memory leak from acpi_ev_install_space_handler()
On 06/04/2021 17:40, Rafael J. Wysocki wrote: On Tue, Apr 6, 2021 at 5:51 PM John Garry wrote: Hi guys, On next-20210406, I enabled CONFIG_DEBUG_KMEMLEAK and CONFIG_DEBUG_TEST_DRIVER_REMOVE for my arm64 system, and see this: Hi Rafael, Why exactly do you think that acpi_ev_install_space_handler() leaks memory? I don't think that acpi_ev_install_space_handler() itself leaks memory, but it seems that there is something missing in the code which is meant to undo/clean up after that on the uninstall path - I did make the point in writing "memory leak from", but maybe still not worded clearly. Anyway, I have not analyzed the problem fully - I'm just reporting. I don't mind looking further if requested. Thanks, John root@debian:/home/john# more /sys/kernel/debug/kmemleak unreferenced object 0x202803c11f00 (size 128): comm "swapper/0", pid 1, jiffies 4294894325 (age 337.524s) hex dump (first 32 bytes): 00 00 00 00 02 00 00 00 08 1f c1 03 28 20 ff ff( .. 08 1f c1 03 28 20 ff ff 00 00 00 00 00 00 00 00( .. backtrace: [<670a0938>] slab_post_alloc_hook+0x9c/0x2f8 [<a3f47b39>] kmem_cache_alloc+0x198/0x2a8 [<2bdba864>] acpi_os_create_semaphore+0x54/0xe0 [<bcd513fe>] acpi_ev_install_space_handler+0x24c/0x300 [<02e116e2>] acpi_install_address_space_handler+0x64/0xb0 [<ba00abc5>] i2c_acpi_install_space_handler+0xd4/0x138 [<8da42058>] i2c_register_adapter+0x368/0x910 [<c03f7142>] i2c_add_adapter+0x9c/0x100 [<00ba2fcf>] i2c_add_numbered_adapter+0x44/0x58 [<7df22d67>] i2c_dw_probe_master+0x68c/0x900 [<682dfc98>] dw_i2c_plat_probe+0x460/0x640 [<ad2dd3ee>] platform_probe+0x8c/0x108 [<dd183e3f>] really_probe+0x190/0x670 [<66017341>] driver_probe_device+0x8c/0xf8 [<c441e843>] device_driver_attach+0x9c/0xa8 [<f91dc709>] __driver_attach+0x88/0x138 unreferenced object 0x00280452c100 (size 128): comm "swapper/0", pid 1, jiffies 4294894558 (age 336.604s) hex dump (first 32 bytes): 00 00 00 00 02 00 00 00 08 c1 52 04 28 00 ff ff..R.(... 08 c1 52 04 28 00 ff ff 00 00 00 00 00 00 00 00..R.(... backtrace: [<670a0938>] slab_post_alloc_hook+0x9c/0x2f8 [<a3f47b39>] kmem_cache_alloc+0x198/0x2a8 [<2bdba864>] acpi_os_create_semaphore+0x54/0xe0 [<bcd513fe>] acpi_ev_install_space_handler+0x24c/0x300 [<02e116e2>] acpi_install_address_space_handler+0x64/0xb0 [<988d4f61>] acpi_gpiochip_add+0x20c/0x4a0 [<73d4faab>] gpiochip_add_data_with_key+0xd10/0x1680 [<1d50b98a>] devm_gpiochip_add_data_with_key+0x30/0x78 [<fc3e7eaf>] dwapb_gpio_probe+0x828/0xb28 [<ad2dd3ee>] platform_probe+0x8c/0x108 [<dd183e3f>] really_probe+0x190/0x670 [<66017341>] driver_probe_device+0x8c/0xf8 [<c441e843>] device_driver_attach+0x9c/0xa8 [<f91dc709>] __driver_attach+0x88/0x138 [<d330caed>] bus_for_each_dev+0xec/0x160 [<eebc5f04>] driver_attach+0x34/0x48 root@debian:/home/john# Thanks, John .
[bug report] Memory leak from acpi_ev_install_space_handler()
Hi guys, On next-20210406, I enabled CONFIG_DEBUG_KMEMLEAK and CONFIG_DEBUG_TEST_DRIVER_REMOVE for my arm64 system, and see this: root@debian:/home/john# more /sys/kernel/debug/kmemleak unreferenced object 0x202803c11f00 (size 128): comm "swapper/0", pid 1, jiffies 4294894325 (age 337.524s) hex dump (first 32 bytes): 00 00 00 00 02 00 00 00 08 1f c1 03 28 20 ff ff( .. 08 1f c1 03 28 20 ff ff 00 00 00 00 00 00 00 00( .. backtrace: [<670a0938>] slab_post_alloc_hook+0x9c/0x2f8 [] kmem_cache_alloc+0x198/0x2a8 [<2bdba864>] acpi_os_create_semaphore+0x54/0xe0 [ ] acpi_ev_install_space_handler+0x24c/0x300 [<02e116e2>] acpi_install_address_space_handler+0x64/0xb0 [ ] i2c_acpi_install_space_handler+0xd4/0x138 [<8da42058>] i2c_register_adapter+0x368/0x910 [ ] i2c_add_adapter+0x9c/0x100 [<00ba2fcf>] i2c_add_numbered_adapter+0x44/0x58 [<7df22d67>] i2c_dw_probe_master+0x68c/0x900 [<682dfc98>] dw_i2c_plat_probe+0x460/0x640 [ ] platform_probe+0x8c/0x108 [ ] really_probe+0x190/0x670 [<66017341>] driver_probe_device+0x8c/0xf8 [ ] device_driver_attach+0x9c/0xa8 [ ] __driver_attach+0x88/0x138 unreferenced object 0x00280452c100 (size 128): comm "swapper/0", pid 1, jiffies 4294894558 (age 336.604s) hex dump (first 32 bytes): 00 00 00 00 02 00 00 00 08 c1 52 04 28 00 ff ff..R.(... 08 c1 52 04 28 00 ff ff 00 00 00 00 00 00 00 00..R.(... backtrace: [<670a0938>] slab_post_alloc_hook+0x9c/0x2f8 [ ] kmem_cache_alloc+0x198/0x2a8 [<2bdba864>] acpi_os_create_semaphore+0x54/0xe0 [ ] acpi_ev_install_space_handler+0x24c/0x300 [<02e116e2>] acpi_install_address_space_handler+0x64/0xb0 [<988d4f61>] acpi_gpiochip_add+0x20c/0x4a0 [<73d4faab>] gpiochip_add_data_with_key+0xd10/0x1680 [<1d50b98a>] devm_gpiochip_add_data_with_key+0x30/0x78 [ ] dwapb_gpio_probe+0x828/0xb28 [ ] platform_probe+0x8c/0x108 [ ] really_probe+0x190/0x670 [<66017341>] driver_probe_device+0x8c/0xf8 [ ] device_driver_attach+0x9c/0xa8 [ ] __driver_attach+0x88/0x138 [ ] bus_for_each_dev+0xec/0x160 [ ] driver_attach+0x34/0x48 root@debian:/home/john# Thanks, John
Re: [PATCH v2 2/6] perf test: Handle metric reuse in pmu-events parsing test
On 06/04/2021 14:34, Jiri Olsa wrote: } So once we evaluate a pmu_event in pctx->ids in @pe, @all is set false, and we would loop again in the do-while loop, regardless of what expr__find_other() does (apart from erroring), and so call hashmap__for_each_entry_safe(>ids, ) again. ah ok, so it finishes the hash iteration first and then restarts it.. ok, I missed that, then it's fine >> This is really what is done in __resolve_metric() - indeed, I would use that function directly, but it looks hard to extract that from metricgroup.c . yea, it's another world;-) it's better to keep it separated ok, so I'll still add the break statement, as suggested. I'll also wait to see if Ian or you have a strong feeling about the function naming in patch 1/6. Thanks, John
Re: [PATCH v2 2/6] perf test: Handle metric reuse in pmu-events parsing test
On 06/04/2021 13:55, Jiri Olsa wrote: So expr__find_other() may add a new item to pctx->ids, and we always iterate again, and try to lookup any pmu_events, *, above. If none exist, then we hm, I don't see that.. so, what you do is: hashmap__for_each_entry_safe((>ids) ) { rc = expr__find_other(pe->metric_expr, NULL, pctx, 0); } and what I think we need to do is: hashmap__for_each_entry_safe((>ids) ) { rc = expr__find_other(pe->metric_expr, NULL, pctx, 0); break; } each time you resolve another metric, you need to restart the pctx->ids iteration, because there will be new items, and we are in the middle of it Sure, but we will restart anyway. hum, where? you call expr__find_other and continue to next pctx->ids item We have: resolve_metric_simple() { bool all; do { all = true; hashmap__for_each_entry_safe(>ids, ...) { pe = metricgroup_find_metric(cur->key, map); if (!pe) continue; ... all = false; expr_del_id(pctx, cur->key); ... rc = expr__find_other(pe->metric_expr, pctx); if (rc) goto out_err; } } while (!all); } So once we evaluate a pmu_event in pctx->ids in @pe, @all is set false, and we would loop again in the do-while loop, regardless of what expr__find_other() does (apart from erroring), and so call hashmap__for_each_entry_safe(>ids, ) again. This is really what is done in __resolve_metric() - indeed, I would use that function directly, but it looks hard to extract that from metricgroup.c . Thanks, John Regardless of this, I don't think what I am doing is safe, i.e. adding new items in the middle of the iter, so I will change in the way you suggest. it'll always add items in the middle of the iteration
Re: [PATCH v2 2/6] perf test: Handle metric reuse in pmu-events parsing test
On 06/04/2021 13:17, Jiri Olsa wrote: + ref = >metric_ref; + ref->metric_name = pe->metric_name; + ref->metric_expr = pe->metric_expr; + list_add_tail(>list, compound_list); + + rc = expr__find_other(pe->metric_expr, NULL, pctx, 0); Hi Jirka, so this might add new items to pctx->ids, I think you need to restart the iteration as we do it in __resolve_metric otherwise you could miss some new keys I thought that I was doing this. Indeed, this code is very much like __resolve_metric();) So expr__find_other() may add a new item to pctx->ids, and we always iterate again, and try to lookup any pmu_events, *, above. If none exist, then we hm, I don't see that.. so, what you do is: hashmap__for_each_entry_safe((>ids) ) { rc = expr__find_other(pe->metric_expr, NULL, pctx, 0); } and what I think we need to do is: hashmap__for_each_entry_safe((>ids) ) { rc = expr__find_other(pe->metric_expr, NULL, pctx, 0); break; } each time you resolve another metric, you need to restart the pctx->ids iteration, because there will be new items, and we are in the middle of it Sure, but we will restart anyway. Regardless of this, I don't think what I am doing is safe, i.e. adding new items in the middle of the iter, so I will change in the way you suggest. Thanks, John
Re: [PATCH v2 0/6] perf arm64 metricgroup support
On 30/03/2021 07:41, kajoljain wrote: On 3/30/21 2:37 AM, Paul A. Clarke wrote: On Fri, Mar 26, 2021 at 10:57:40AM +, John Garry wrote: On 25/03/2021 20:39, Paul A. Clarke wrote: On Thu, Mar 25, 2021 at 06:33:12PM +0800, John Garry wrote: Metric reuse support is added for pmu-events parse metric testcase. This had been broken on power9 recentlty: https://lore.kernel.org/lkml/20210324015418.gc8...@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com/ Much better. Before: -- $ perf test -v 10 2>&1 | grep -i error | wc -l 112 -- After: -- $ perf test -v 10 2>&1 | grep -i error | wc -l 17 -- And these seem like different types of issues: -- $ perf test -v 10 2>&1 | grep -i error Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_powerbus0_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help '(null)' -- This looks suspicious. Firstly, does /sys/bus/event_source/devices/nest_mcs01_imc (or others, above) exist on your system? I guess not. Secondly, checking Documentation/powerpc/imc.rst, we have examples of: nest_mcs01/PM_MCS01_64B_R... So is the PMU name correct in the metric file for nest_mcs01_imc? Looking at the kernel driver file, arch/powerpc/perf/imc-pmu.c, it seems to be correct. Not sure. I ran with a newer kernel, and the above errors disappeared, replaced with about 10 of: -- Error string 'Cannot find PMU `hv_24x7'. Missing kernel support?' help '(null)' -- ...but I was running without a hypervisor, so I tried the same kernel on a PowerVM-virtualized system and the "hv_24x7" messages went away, but the "nest" messages returned. This may all be expected behavior... I confess I haven't followed these new perf capabilities closely. Hi Paul/John, This is something expected. For nest-imc we need bare-metal system and for hv-24x7 we need VM environment. Since you are checking this test in VM machine, there nest events are not supported and hence we are getting this error. Thanks, Kajol Jain Cool, so I hope that tested-by or similar can be provided :) [obviously pending any changes that come from reviews] Thanks It's extremely likely that none of these errors has anything to do with your changes. :- PC .
Re: [PATCH v2 2/6] perf test: Handle metric reuse in pmu-events parsing test
On 01/04/2021 14:49, Jiri Olsa wrote: On Thu, Mar 25, 2021 at 06:33:14PM +0800, John Garry wrote: SNIP +struct metric { + struct list_head list; + struct metric_ref metric_ref; +}; + +static int resolve_metric_simple(struct expr_parse_ctx *pctx, +struct list_head *compound_list, +struct pmu_events_map *map, +const char *metric_name) +{ + struct hashmap_entry *cur, *cur_tmp; + struct metric *metric, *tmp; + size_t bkt; + bool all; + int rc; + + do { + all = true; + hashmap__for_each_entry_safe((>ids), cur, cur_tmp, bkt) { + struct metric_ref *ref; + struct pmu_event *pe; + + pe = metrcgroup_find_metric(cur->key, map); * + if (!pe) + continue; + + if (!strcmp(metric_name, (char *)cur->key)) { + pr_warning("Recursion detected for metric %s\n", metric_name); + rc = -1; + goto out_err; + } + + all = false; + + /* The metric key itself needs to go out.. */ + expr__del_id(pctx, cur->key); + + metric = malloc(sizeof(*metric)); + if (!metric) { + rc = -ENOMEM; + goto out_err; + } + + ref = >metric_ref; + ref->metric_name = pe->metric_name; + ref->metric_expr = pe->metric_expr; + list_add_tail(>list, compound_list); + + rc = expr__find_other(pe->metric_expr, NULL, pctx, 0); Hi Jirka, so this might add new items to pctx->ids, I think you need to restart the iteration as we do it in __resolve_metric otherwise you could miss some new keys I thought that I was doing this. Indeed, this code is very much like __resolve_metric() ;) So expr__find_other() may add a new item to pctx->ids, and we always iterate again, and try to lookup any pmu_events, *, above. If none exist, then we have broken down pctx into primitive events aliases and unresolvable metrics, and stop iterating. And then unresolvable metrics would be found in check_parse_cpu(). As an example, we can deal with metric test1, below, which references 2x other metrics: { "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / (4 * (( ( CPU_CLK_UNHALTED.THREAD / 2 ) * ( 1 + CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK ) )))", "MetricName": "Frontend_Bound", }, { "MetricExpr": "( UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES ) / (4 * cycles)", "MetricName": "Bad_Speculation", }, { "MetricExpr": "Bad_Speculation + Frontend_Bound", "MetricName": "test1", }, Does that satisfy your concern, or have I missed something? Thanks, John jirka + if (rc) + goto out_err; + } + } while (!all); + + return 0; + +out_err: + list_for_each_entry_safe(metric, tmp, compound_list, list) + free(metric); + + return rc; + +} SNIP .
Re: [PATCH v2 1/6] perf metricgroup: Make find_metric() public with name change
On 02/04/2021 00:16, Ian Rogers wrote: On Thu, Mar 25, 2021 at 3:38 AM John Garry wrote: Function find_metric() is required for the metric processing in the pmu-events testcase, so make it public. Also change the name to include "metricgroup". Would it make more sense as "pmu_events_map__find_metric" ? So all functions apart from one in metricgroup.h are named metricgroup__XXX, so I was trying to keep this style - apart from the double-underscore (which can be remedied). Personally I don't think pmu_events_map__find_metric name fits with that convention. Thanks, John Thanks, Ian Signed-off-by: John Garry --- tools/perf/util/metricgroup.c | 5 +++-- tools/perf/util/metricgroup.h | 3 ++- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c index 6acb44ad439b..71a13406e0bd 100644 --- a/tools/perf/util/metricgroup.c +++ b/tools/perf/util/metricgroup.c @@ -900,7 +900,8 @@ static int __add_metric(struct list_head *metric_list, (match_metric(__pe->metric_group, __metric) || \ match_metric(__pe->metric_name, __metric))) -static struct pmu_event *find_metric(const char *metric, struct pmu_events_map *map) +struct pmu_event *metrcgroup_find_metric(const char *metric, +struct pmu_events_map *map) { struct pmu_event *pe; int i; @@ -985,7 +986,7 @@ static int __resolve_metric(struct metric *m, struct expr_id *parent; struct pmu_event *pe; - pe = find_metric(cur->key, map); + pe = metrcgroup_find_metric(cur->key, map); if (!pe) continue; diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h index ed1b9392e624..1674c6a36d74 100644 --- a/tools/perf/util/metricgroup.h +++ b/tools/perf/util/metricgroup.h @@ -44,7 +44,8 @@ int metricgroup__parse_groups(const struct option *opt, bool metric_no_group, bool metric_no_merge, struct rblist *metric_events); - +struct pmu_event *metrcgroup_find_metric(const char *metric, +struct pmu_events_map *map); int metricgroup__parse_groups_test(struct evlist *evlist, struct pmu_events_map *map, const char *str, -- 2.26.2 .
Re: PCI MSI issue with reinserting a driver
On 03/02/2021 17:23, Marc Zyngier wrote: On 2021-02-02 15:46, John Garry wrote: On 02/02/2021 14:48, Marc Zyngier wrote: Not sure. I also now notice an error for the SAS PCI driver on D06 when nr_cpus < 16, which means number of MSI vectors allocated < 32, so looks the same problem. There we try to allocate 16 + max(nr cpus, 16) MSI. Anyway, let me have a look today to see what is going wrong. Could this be the problem: nr_cpus=11 In alloc path, we have: its_alloc_device_irq(nvecs=27 = 16+11) bitmap_find_free_region(order = 5); In free path, we have: its_irq_domain_free(nvecs = 1) and free each 27 vecs bitmap_release_region(order = 0) So we allocate 32 bits, but only free 27. And 2nd alloc for 32 fails. [ ... ] Hi Marc, Just a friendly reminder on this issue. Let me know if anything required some our side. Cheers, John But I'm not sure that we have any requirement for those map bits to be consecutive. We can't really do that. All the events must be contiguous, and there is also a lot of assumptions in the ITS driver that LPI allocations is also contiguous. But there is also the fact that for Multi-MSI, we *must* allocate 32 vectors. Any driver could assume that if we have allocated 17 vectors, then there is another 15 available. My question still stand: how was this working with the previous behaviour? Because previously in this scenario we would allocate 32 bits and free 32 bits in the map; but now we allocate 32 bits, yet only free 27 - so leak 5 bits. And this comes from how irq_domain_free_irqs_hierarchy() now frees per-interrupt, instead of all irqs per domain. Before: In free path, we have: its_irq_domain_free(nvecs = 27) bitmap_release_region(count order = 5 == 32bits) Current: In free path, we have: its_irq_domain_free(nvecs = 1) for free each 27 vecs bitmap_release_region(count order = 0 == 1bit) Right. I was focusing on the patch and blindly ignored the explanation at the top of the email. Apologies for that. I'm not overly keen on handling this in the ITS though, and I'd rather we try and do it in the generic code. How about this (compile tested only)? Thanks, M. diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index 6aacd342cd14..cfccad83c2df 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -1399,8 +1399,19 @@ static void irq_domain_free_irqs_hierarchy(struct irq_domain *domain, return; for (i = 0; i < nr_irqs; i++) { - if (irq_domain_get_irq_data(domain, irq_base + i)) - domain->ops->free(domain, irq_base + i, 1); + int n ; + + /* Find the largest possible span of IRQs to free in one go */ + for (n = 0; + ((i + n) < nr_irqs && + irq_domain_get_irq_data(domain, irq_base + i + n)); + n++); + + if (!n) + continue; + + domain->ops->free(domain, irq_base + i, n); + i += n; } }
Re: [PATCH v2 0/4] iommu/iova: Add CPU hotplug handler to flush rcaches to core code
On 25/03/2021 17:53, Will Deacon wrote: On Thu, Mar 25, 2021 at 08:29:57PM +0800, John Garry wrote: The Intel IOMMU driver supports flushing the per-CPU rcaches when a CPU is offlined. Let's move it to core code, so everyone can take advantage. Also throw in a patch to stop exporting free_iova_fast(). Differences to v1: - Add RB tags (thanks!) - Add patch to stop exporting free_iova_fast() - Drop patch to correct comment - Add patch to delete iommu_dma_free_cpu_cached_iovas() and associated changes John Garry (4): iova: Add CPU hotplug handler to flush rcaches iommu/vt-d: Remove IOVA domain rcache flushing for CPU offlining iommu: Delete iommu_dma_free_cpu_cached_iovas() iommu: Stop exporting free_iova_fast() Looks like this is all set for 5.13, so hopefully Joerg can stick it in -next for a bit more exposure. Hi Joerg, Can you kindly consider picking this up now? Thanks, John
Re: [PATCH v2 0/6] perf arm64 metricgroup support
On 25/03/2021 20:39, Paul A. Clarke wrote: On Thu, Mar 25, 2021 at 06:33:12PM +0800, John Garry wrote: Metric reuse support is added for pmu-events parse metric testcase. This had been broken on power9 recentlty: https://lore.kernel.org/lkml/20210324015418.gc8...@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com/ Much better. Before: -- $ perf test -v 10 2>&1 | grep -i error | wc -l 112 -- After: -- $ perf test -v 10 2>&1 | grep -i error | wc -l 17 -- And these seem like different types of issues: -- $ perf test -v 10 2>&1 | grep -i error Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_powerbus0_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs01_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help '(null)' Error string 'Cannot find PMU `nest_mcs23_imc'. Missing kernel support?' help '(null)' -- This looks suspicious. Firstly, does /sys/bus/event_source/devices/nest_mcs01_imc (or others, above) exist on your system? I guess not. Secondly, checking Documentation/powerpc/imc.rst, we have examples of: nest_mcs01/PM_MCS01_64B_R... So is the PMU name correct in the metric file for nest_mcs01_imc? Looking at the kernel driver file, arch/powerpc/perf/imc-pmu.c, it seems to be correct. Not sure. Thanks, John (Added Kajol Jain to CC)
[PATCH v2 2/4] iommu/vt-d: Remove IOVA domain rcache flushing for CPU offlining
Now that the core code handles flushing per-IOVA domain CPU rcaches, remove the handling here. Reviewed-by: Lu Baolu Signed-off-by: John Garry --- drivers/iommu/intel/iommu.c | 31 --- include/linux/cpuhotplug.h | 1 - 2 files changed, 32 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index ee0932307d64..d1e66e1b07b8 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -4065,35 +4065,6 @@ static struct notifier_block intel_iommu_memory_nb = { .priority = 0 }; -static void free_all_cpu_cached_iovas(unsigned int cpu) -{ - int i; - - for (i = 0; i < g_num_of_iommus; i++) { - struct intel_iommu *iommu = g_iommus[i]; - struct dmar_domain *domain; - int did; - - if (!iommu) - continue; - - for (did = 0; did < cap_ndoms(iommu->cap); did++) { - domain = get_iommu_domain(iommu, (u16)did); - - if (!domain || domain->domain.type != IOMMU_DOMAIN_DMA) - continue; - - iommu_dma_free_cpu_cached_iovas(cpu, >domain); - } - } -} - -static int intel_iommu_cpu_dead(unsigned int cpu) -{ - free_all_cpu_cached_iovas(cpu); - return 0; -} - static void intel_disable_iommus(void) { struct intel_iommu *iommu = NULL; @@ -4388,8 +4359,6 @@ int __init intel_iommu_init(void) bus_set_iommu(_bus_type, _iommu_ops); if (si_domain && !hw_pass_through) register_memory_notifier(_iommu_memory_nb); - cpuhp_setup_state(CPUHP_IOMMU_INTEL_DEAD, "iommu/intel:dead", NULL, - intel_iommu_cpu_dead); down_read(_global_lock); if (probe_acpi_namespace_devices()) diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index cedac9986557..85996494bec1 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -57,7 +57,6 @@ enum cpuhp_state { CPUHP_PAGE_ALLOC_DEAD, CPUHP_NET_DEV_DEAD, CPUHP_PCI_XGENE_DEAD, - CPUHP_IOMMU_INTEL_DEAD, CPUHP_IOMMU_IOVA_DEAD, CPUHP_LUSTRE_CFS_DEAD, CPUHP_AP_ARM_CACHE_B15_RAC_DEAD, -- 2.26.2
[PATCH v2 0/4] iommu/iova: Add CPU hotplug handler to flush rcaches to core code
The Intel IOMMU driver supports flushing the per-CPU rcaches when a CPU is offlined. Let's move it to core code, so everyone can take advantage. Also throw in a patch to stop exporting free_iova_fast(). Differences to v1: - Add RB tags (thanks!) - Add patch to stop exporting free_iova_fast() - Drop patch to correct comment - Add patch to delete iommu_dma_free_cpu_cached_iovas() and associated changes John Garry (4): iova: Add CPU hotplug handler to flush rcaches iommu/vt-d: Remove IOVA domain rcache flushing for CPU offlining iommu: Delete iommu_dma_free_cpu_cached_iovas() iommu: Stop exporting free_iova_fast() drivers/iommu/dma-iommu.c | 9 - drivers/iommu/intel/iommu.c | 31 --- drivers/iommu/iova.c| 34 +++--- include/linux/cpuhotplug.h | 2 +- include/linux/dma-iommu.h | 8 include/linux/iova.h| 6 +- 6 files changed, 33 insertions(+), 57 deletions(-) -- 2.26.2
[PATCH v2 4/4] iommu: Stop exporting free_iova_fast()
Function free_iova_fast() is only referenced by dma-iommu.c, which can only be in-built, so stop exporting it. This was missed in an earlier tidy-up patch. Signed-off-by: John Garry --- drivers/iommu/iova.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 8a493ee92c79..e31e79a9f5c3 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -493,7 +493,6 @@ free_iova_fast(struct iova_domain *iovad, unsigned long pfn, unsigned long size) free_iova(iovad, pfn); } -EXPORT_SYMBOL_GPL(free_iova_fast); #define fq_ring_for_each(i, fq) \ for ((i) = (fq)->head; (i) != (fq)->tail; (i) = ((i) + 1) % IOVA_FQ_SIZE) -- 2.26.2
[PATCH v2 3/4] iommu: Delete iommu_dma_free_cpu_cached_iovas()
Function iommu_dma_free_cpu_cached_iovas() no longer has any caller, so delete it. With that, function free_cpu_cached_iovas() may be made static. Signed-off-by: John Garry --- drivers/iommu/dma-iommu.c | 9 - drivers/iommu/iova.c | 3 ++- include/linux/dma-iommu.h | 8 include/linux/iova.h | 5 - 4 files changed, 2 insertions(+), 23 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index af765c813cc8..9da7e9901bec 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -53,15 +53,6 @@ struct iommu_dma_cookie { static DEFINE_STATIC_KEY_FALSE(iommu_deferred_attach_enabled); -void iommu_dma_free_cpu_cached_iovas(unsigned int cpu, - struct iommu_domain *domain) -{ - struct iommu_dma_cookie *cookie = domain->iova_cookie; - struct iova_domain *iovad = >iovad; - - free_cpu_cached_iovas(cpu, iovad); -} - static void iommu_dma_entry_dtor(unsigned long data) { struct page *freelist = (struct page *)data; diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index c78312560425..8a493ee92c79 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -22,6 +22,7 @@ static unsigned long iova_rcache_get(struct iova_domain *iovad, unsigned long size, unsigned long limit_pfn); static void init_iova_rcaches(struct iova_domain *iovad); +static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad); static void free_iova_rcaches(struct iova_domain *iovad); static void fq_destroy_all_entries(struct iova_domain *iovad); static void fq_flush_timeout(struct timer_list *t); @@ -998,7 +999,7 @@ static void free_iova_rcaches(struct iova_domain *iovad) /* * free all the IOVA ranges cached by a cpu (used when cpu is unplugged) */ -void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad) +static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad) { struct iova_cpu_rcache *cpu_rcache; struct iova_rcache *rcache; diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h index 706b68d1359b..2112f21f73d8 100644 --- a/include/linux/dma-iommu.h +++ b/include/linux/dma-iommu.h @@ -37,9 +37,6 @@ void iommu_dma_compose_msi_msg(struct msi_desc *desc, void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list); -void iommu_dma_free_cpu_cached_iovas(unsigned int cpu, - struct iommu_domain *domain); - #else /* CONFIG_IOMMU_DMA */ struct iommu_domain; @@ -81,10 +78,5 @@ static inline void iommu_dma_get_resv_regions(struct device *dev, struct list_he { } -static inline void iommu_dma_free_cpu_cached_iovas(unsigned int cpu, - struct iommu_domain *domain) -{ -} - #endif /* CONFIG_IOMMU_DMA */ #endif /* __DMA_IOMMU_H */ diff --git a/include/linux/iova.h b/include/linux/iova.h index 4be6c0ab4997..71d8a2de6635 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -157,7 +157,6 @@ int init_iova_flush_queue(struct iova_domain *iovad, iova_flush_cb flush_cb, iova_entry_dtor entry_dtor); struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn); void put_iova_domain(struct iova_domain *iovad); -void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad); #else static inline int iova_cache_get(void) { @@ -234,10 +233,6 @@ static inline void put_iova_domain(struct iova_domain *iovad) { } -static inline void free_cpu_cached_iovas(unsigned int cpu, -struct iova_domain *iovad) -{ -} #endif #endif -- 2.26.2
[PATCH v2 1/4] iova: Add CPU hotplug handler to flush rcaches
Like the Intel IOMMU driver already does, flush the per-IOVA domain CPU rcache when a CPU goes offline - there's no point in keeping it. Reviewed-by: Robin Murphy Signed-off-by: John Garry --- drivers/iommu/iova.c | 30 +- include/linux/cpuhotplug.h | 1 + include/linux/iova.h | 1 + 3 files changed, 31 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index e6e2fa85271c..c78312560425 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -25,6 +25,17 @@ static void init_iova_rcaches(struct iova_domain *iovad); static void free_iova_rcaches(struct iova_domain *iovad); static void fq_destroy_all_entries(struct iova_domain *iovad); static void fq_flush_timeout(struct timer_list *t); + +static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node) +{ + struct iova_domain *iovad; + + iovad = hlist_entry_safe(node, struct iova_domain, cpuhp_dead); + + free_cpu_cached_iovas(cpu, iovad); + return 0; +} + static void free_global_cached_iovas(struct iova_domain *iovad); void @@ -51,6 +62,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule, iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR; rb_link_node(>anchor.node, NULL, >rbroot.rb_node); rb_insert_color(>anchor.node, >rbroot); + cpuhp_state_add_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, >cpuhp_dead); init_iova_rcaches(iovad); } EXPORT_SYMBOL_GPL(init_iova_domain); @@ -257,10 +269,21 @@ int iova_cache_get(void) { mutex_lock(_cache_mutex); if (!iova_cache_users) { + int ret; + + ret = cpuhp_setup_state_multi(CPUHP_IOMMU_IOVA_DEAD, "iommu/iova:dead", NULL, + iova_cpuhp_dead); + if (ret) { + mutex_unlock(_cache_mutex); + pr_err("Couldn't register cpuhp handler\n"); + return ret; + } + iova_cache = kmem_cache_create( "iommu_iova", sizeof(struct iova), 0, SLAB_HWCACHE_ALIGN, NULL); if (!iova_cache) { + cpuhp_remove_multi_state(CPUHP_IOMMU_IOVA_DEAD); mutex_unlock(_cache_mutex); pr_err("Couldn't create iova cache\n"); return -ENOMEM; @@ -282,8 +305,10 @@ void iova_cache_put(void) return; } iova_cache_users--; - if (!iova_cache_users) + if (!iova_cache_users) { + cpuhp_remove_multi_state(CPUHP_IOMMU_IOVA_DEAD); kmem_cache_destroy(iova_cache); + } mutex_unlock(_cache_mutex); } EXPORT_SYMBOL_GPL(iova_cache_put); @@ -606,6 +631,9 @@ void put_iova_domain(struct iova_domain *iovad) { struct iova *iova, *tmp; + cpuhp_state_remove_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, + >cpuhp_dead); + free_iova_flush_queue(iovad); free_iova_rcaches(iovad); rbtree_postorder_for_each_entry_safe(iova, tmp, >rbroot, node) diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index f14adb882338..cedac9986557 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -58,6 +58,7 @@ enum cpuhp_state { CPUHP_NET_DEV_DEAD, CPUHP_PCI_XGENE_DEAD, CPUHP_IOMMU_INTEL_DEAD, + CPUHP_IOMMU_IOVA_DEAD, CPUHP_LUSTRE_CFS_DEAD, CPUHP_AP_ARM_CACHE_B15_RAC_DEAD, CPUHP_PADATA_DEAD, diff --git a/include/linux/iova.h b/include/linux/iova.h index c834c01c0a5b..4be6c0ab4997 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -95,6 +95,7 @@ struct iova_domain { flush-queues */ atomic_t fq_timer_on; /* 1 when timer is active, 0 when not */ + struct hlist_node cpuhp_dead; }; static inline unsigned long iova_size(struct iova *iova) -- 2.26.2
[PATCH v2 2/6] perf test: Handle metric reuse in pmu-events parsing test
The pmu-events parsing test does not handle metric reuse at all. Introduce some simple handling to resolve metrics who reference other metrics. Signed-off-by: John Garry --- tools/perf/tests/pmu-events.c | 80 +++ 1 file changed, 80 insertions(+) diff --git a/tools/perf/tests/pmu-events.c b/tools/perf/tests/pmu-events.c index 0ca6a5a53523..20b6bf14f7f7 100644 --- a/tools/perf/tests/pmu-events.c +++ b/tools/perf/tests/pmu-events.c @@ -12,6 +12,7 @@ #include "util/evlist.h" #include "util/expr.h" #include "util/parse-events.h" +#include "metricgroup.h" struct perf_pmu_test_event { /* used for matching against events from generated pmu-events.c */ @@ -471,6 +472,70 @@ static void expr_failure(const char *msg, pr_debug("On expression %s\n", pe->metric_expr); } +struct metric { + struct list_head list; + struct metric_ref metric_ref; +}; + +static int resolve_metric_simple(struct expr_parse_ctx *pctx, +struct list_head *compound_list, +struct pmu_events_map *map, +const char *metric_name) +{ + struct hashmap_entry *cur, *cur_tmp; + struct metric *metric, *tmp; + size_t bkt; + bool all; + int rc; + + do { + all = true; + hashmap__for_each_entry_safe((>ids), cur, cur_tmp, bkt) { + struct metric_ref *ref; + struct pmu_event *pe; + + pe = metrcgroup_find_metric(cur->key, map); + if (!pe) + continue; + + if (!strcmp(metric_name, (char *)cur->key)) { + pr_warning("Recursion detected for metric %s\n", metric_name); + rc = -1; + goto out_err; + } + + all = false; + + /* The metric key itself needs to go out.. */ + expr__del_id(pctx, cur->key); + + metric = malloc(sizeof(*metric)); + if (!metric) { + rc = -ENOMEM; + goto out_err; + } + + ref = >metric_ref; + ref->metric_name = pe->metric_name; + ref->metric_expr = pe->metric_expr; + list_add_tail(>list, compound_list); + + rc = expr__find_other(pe->metric_expr, NULL, pctx, 0); + if (rc) + goto out_err; + } + } while (!all); + + return 0; + +out_err: + list_for_each_entry_safe(metric, tmp, compound_list, list) + free(metric); + + return rc; + +} + static int test_parsing(void) { struct pmu_events_map *cpus_map = perf_pmu__find_map(NULL); @@ -488,7 +553,9 @@ static int test_parsing(void) break; j = 0; for (;;) { + struct metric *metric, *tmp; struct hashmap_entry *cur; + LIST_HEAD(compound_list); size_t bkt; pe = >table[j++]; @@ -504,6 +571,13 @@ static int test_parsing(void) continue; } + if (resolve_metric_simple(, _list, map, + pe->metric_name)) { + expr_failure("Could not resolve metrics", map, pe); + ret++; + goto exit; /* Don't tolerate errors due to severity */ + } + /* * Add all ids with a made up value. The value may * trigger divide by zero when subtracted and so try to @@ -519,6 +593,11 @@ static int test_parsing(void) ret++; } + list_for_each_entry_safe(metric, tmp, _list, list) { + expr__add_ref(, >metric_ref); + free(metric); + } + if (expr__parse(, , pe->metric_expr, 0)) { expr_failure("Parse failed", map, pe); ret++; @@ -527,6 +606,7 @@ static int test_parsing(void) } } /* TODO: fail when not ok */ +exit: return ret == 0 ? TEST_OK : TEST_SKIP; } -- 2.26.2
[PATCH v2 0/6] perf arm64 metricgroup support
This series contains support to get basic metricgroups working for arm64 CPUs. Initial support is added for HiSilicon hip08 platform. Some sample usage on Huawei D06 board: $ ./perf list metric List of pre-defined events (to be used in -e): Metrics: bp_misp_flush [BP misp flush L3 topdown metric] branch_mispredicts [Branch mispredicts L2 topdown metric] core_bound [Core bound L2 topdown metric] divider [Divider L3 topdown metric] exe_ports_util [EXE ports util L3 topdown metric] fetch_bandwidth_bound [Fetch bandwidth bound L2 topdown metric] fetch_latency_bound [Fetch latency bound L2 topdown metric] fsu_stall [FSU stall L3 topdown metric] idle_by_icache_miss $ sudo ./perf stat -v -M core_bound sleep 1 Using CPUID 0x480fd010 metric expr (exe_stall_cycle - (mem_stall_anyload + armv8_pmuv3_0@event\=0x7005@)) / cpu_cycles for core_bound found event cpu_cycles found event armv8_pmuv3_0/event=0x7005/ found event exe_stall_cycle found event mem_stall_anyload adding {cpu_cycles -> armv8_pmuv3_0/event=0x7001/ mem_stall_anyload -> armv8_pmuv3_0/event=0x7004/ Control descriptor is not initialized cpu_cycles: 989433 385050 385050 armv8_pmuv3_0/event=0x7005/: 19207 385050 385050 exe_stall_cycle: 900825 385050 385050 mem_stall_anyload: 253516 385050 385050 Performance counter stats for 'sleep': 989,433 cpu_cycles # 0.63 core_bound 19,207 armv8_pmuv3_0/event=0x7005/ 900,825 exe_stall_cycle 253,516 mem_stall_anyload 0.000805809 seconds time elapsed 0.000875000 seconds user 0.0 seconds sys perf stat --topdown is not supported, as this requires the CPU PMU to expose (alias) events for the TopDown L1 metrics from sysfs, which arm does not do. To get that to work, we probably need to make perf use the pmu-events cpumap to learn about those alias events. Metric reuse support is added for pmu-events parse metric testcase. This had been broken on power9 recentlty: https://lore.kernel.org/lkml/20210324015418.gc8...@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com/ Differences to v1: - Add pmu_events_map__find() as arm64-specific function - Fix metric reuse for pmu-events parse metric testcase John Garry (6): perf metricgroup: Make find_metric() public with name change perf test: Handle metric reuse in pmu-events parsing test perf pmu: Add pmu_events_map__find() perf vendor events arm64: Add Hisi hip08 L1 metrics perf vendor events arm64: Add Hisi hip08 L2 metrics perf vendor events arm64: Add Hisi hip08 L3 metrics tools/perf/arch/arm64/util/Build | 1 + tools/perf/arch/arm64/util/pmu.c | 25 ++ .../arch/arm64/hisilicon/hip08/metrics.json | 233 ++ tools/perf/tests/pmu-events.c | 82 +- tools/perf/util/metricgroup.c | 12 +- tools/perf/util/metricgroup.h | 3 +- tools/perf/util/pmu.c | 5 + tools/perf/util/pmu.h | 1 + tools/perf/util/s390-sample-raw.c | 4 +- 9 files changed, 355 insertions(+), 11 deletions(-) create mode 100644 tools/perf/arch/arm64/util/pmu.c create mode 100644 tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json -- 2.26.2
[PATCH v2 4/6] perf vendor events arm64: Add Hisi hip08 L1 metrics
Add L1 metrics. Formula is as consistent as possible with MAN pages description for these metrics. Signed-off-by: John Garry --- .../arch/arm64/hisilicon/hip08/metrics.json | 30 +++ 1 file changed, 30 insertions(+) create mode 100644 tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json diff --git a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json new file mode 100644 index ..dc5ff3051639 --- /dev/null +++ b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json @@ -0,0 +1,30 @@ +[ +{ +"MetricExpr": "FETCH_BUBBLE / (4 * CPU_CYCLES)", +"PublicDescription": "Frontend bound L1 topdown metric", +"BriefDescription": "Frontend bound L1 topdown metric", +"MetricGroup": "TopDownL1", +"MetricName": "frontend_bound" +}, +{ +"MetricExpr": "(INST_SPEC - INST_RETIRED) / (4 * CPU_CYCLES)", +"PublicDescription": "Bad Speculation L1 topdown metric", +"BriefDescription": "Bad Speculation L1 topdown metric", +"MetricGroup": "TopDownL1", +"MetricName": "bad_speculation" +}, +{ +"MetricExpr": "INST_RETIRED / (CPU_CYCLES * 4)", +"PublicDescription": "Retiring L1 topdown metric", +"BriefDescription": "Retiring L1 topdown metric", +"MetricGroup": "TopDownL1", +"MetricName": "retiring" +}, +{ +"MetricExpr": "1 - (frontend_bound + bad_speculation + retiring)", +"PublicDescription": "Backend Bound L1 topdown metric", +"BriefDescription": "Backend Bound L1 topdown metric", +"MetricGroup": "TopDownL1", +"MetricName": "backend_bound" +}, +] -- 2.26.2
[PATCH v2 3/6] perf pmu: Add pmu_events_map__find()
Add a function to find the common PMU map for the system. For arm64, a special variant is added. This is because arm64 supports heterogeneous CPU systems. As such, it cannot be guaranteed that the cpumap is same for all CPUs. So in case of heterogeneous systems, don't return a cpumap. Signed-off-by: John Garry --- tools/perf/arch/arm64/util/Build | 1 + tools/perf/arch/arm64/util/pmu.c | 25 + tools/perf/tests/pmu-events.c | 2 +- tools/perf/util/metricgroup.c | 7 +++ tools/perf/util/pmu.c | 5 + tools/perf/util/pmu.h | 1 + tools/perf/util/s390-sample-raw.c | 4 +--- 7 files changed, 37 insertions(+), 8 deletions(-) create mode 100644 tools/perf/arch/arm64/util/pmu.c diff --git a/tools/perf/arch/arm64/util/Build b/tools/perf/arch/arm64/util/Build index ead2f2275eee..9fcb4e68add9 100644 --- a/tools/perf/arch/arm64/util/Build +++ b/tools/perf/arch/arm64/util/Build @@ -2,6 +2,7 @@ perf-y += header.o perf-y += machine.o perf-y += perf_regs.o perf-y += tsc.o +perf-y += pmu.o perf-y += kvm-stat.o perf-$(CONFIG_DWARF) += dwarf-regs.o perf-$(CONFIG_LOCAL_LIBUNWIND) += unwind-libunwind.o diff --git a/tools/perf/arch/arm64/util/pmu.c b/tools/perf/arch/arm64/util/pmu.c new file mode 100644 index ..d3259d61ca75 --- /dev/null +++ b/tools/perf/arch/arm64/util/pmu.c @@ -0,0 +1,25 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include "../../util/cpumap.h" +#include "../../util/pmu.h" + +struct pmu_events_map *pmu_events_map__find(void) +{ + struct perf_pmu *pmu = NULL; + + while ((pmu = perf_pmu__scan(pmu))) { + if (!is_pmu_core(pmu->name)) + continue; + + /* +* The cpumap should cover all CPUs. Otherwise, some CPUs may +* not support some events or have different event IDs. +*/ + if (pmu->cpus->nr != cpu__max_cpu()) + return NULL; + + return perf_pmu__find_map(pmu); + } + + return NULL; +} diff --git a/tools/perf/tests/pmu-events.c b/tools/perf/tests/pmu-events.c index 20b6bf14f7f7..9fe2455b355b 100644 --- a/tools/perf/tests/pmu-events.c +++ b/tools/perf/tests/pmu-events.c @@ -538,7 +538,7 @@ static int resolve_metric_simple(struct expr_parse_ctx *pctx, static int test_parsing(void) { - struct pmu_events_map *cpus_map = perf_pmu__find_map(NULL); + struct pmu_events_map *cpus_map = pmu_events_map__find(); struct pmu_events_map *map; struct pmu_event *pe; int i, j, k; diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c index 71a13406e0bd..e6e787e94f91 100644 --- a/tools/perf/util/metricgroup.c +++ b/tools/perf/util/metricgroup.c @@ -618,7 +618,7 @@ static int metricgroup__print_sys_event_iter(struct pmu_event *pe, void *data) void metricgroup__print(bool metrics, bool metricgroups, char *filter, bool raw, bool details) { - struct pmu_events_map *map = perf_pmu__find_map(NULL); + struct pmu_events_map *map = pmu_events_map__find(); struct pmu_event *pe; int i; struct rblist groups; @@ -1254,8 +1254,7 @@ int metricgroup__parse_groups(const struct option *opt, struct rblist *metric_events) { struct evlist *perf_evlist = *(struct evlist **)opt->value; - struct pmu_events_map *map = perf_pmu__find_map(NULL); - + struct pmu_events_map *map = pmu_events_map__find(); return parse_groups(perf_evlist, str, metric_no_group, metric_no_merge, NULL, metric_events, map); @@ -1274,7 +1273,7 @@ int metricgroup__parse_groups_test(struct evlist *evlist, bool metricgroup__has_metric(const char *metric) { - struct pmu_events_map *map = perf_pmu__find_map(NULL); + struct pmu_events_map *map = pmu_events_map__find(); struct pmu_event *pe; int i; diff --git a/tools/perf/util/pmu.c b/tools/perf/util/pmu.c index 88da5cf6aee8..419ef6c4fbc0 100644 --- a/tools/perf/util/pmu.c +++ b/tools/perf/util/pmu.c @@ -717,6 +717,11 @@ struct pmu_events_map *perf_pmu__find_map(struct perf_pmu *pmu) return map; } +struct pmu_events_map *__weak pmu_events_map__find(void) +{ + return perf_pmu__find_map(NULL); +} + bool pmu_uncore_alias_match(const char *pmu_name, const char *name) { char *tmp = NULL, *tok, *str; diff --git a/tools/perf/util/pmu.h b/tools/perf/util/pmu.h index 8164388478c6..012317229488 100644 --- a/tools/perf/util/pmu.h +++ b/tools/perf/util/pmu.h @@ -114,6 +114,7 @@ void pmu_add_cpu_aliases_map(struct list_head *head, struct perf_pmu *pmu, struct pmu_events_map *map); struct pmu_events_map *perf_pmu__find_map(struct perf_pmu *pmu); +struct pmu_events_map *pmu_events_map__find(void); bool pmu_uncore_alias_match(const char *pm
[PATCH v2 5/6] perf vendor events arm64: Add Hisi hip08 L2 metrics
Add L2 metrics. Signed-off-by: John Garry --- .../arch/arm64/hisilicon/hip08/metrics.json | 42 +++ 1 file changed, 42 insertions(+) diff --git a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json index dc5ff3051639..dda898d23c2d 100644 --- a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json +++ b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json @@ -27,4 +27,46 @@ "MetricGroup": "TopDownL1", "MetricName": "backend_bound" }, +{ +"MetricExpr": "armv8_pmuv3_0@event\\=0x201d@ / CPU_CYCLES", +"PublicDescription": "Fetch latency bound L2 topdown metric", +"BriefDescription": "Fetch latency bound L2 topdown metric", +"MetricGroup": "TopDownL2", +"MetricName": "fetch_latency_bound" +}, +{ +"MetricExpr": "frontend_bound - fetch_latency_bound", +"PublicDescription": "Fetch bandwidth bound L2 topdown metric", +"BriefDescription": "Fetch bandwidth bound L2 topdown metric", +"MetricGroup": "TopDownL2", +"MetricName": "fetch_bandwidth_bound" +}, +{ +"MetricExpr": "(bad_speculation * BR_MIS_PRED) / (BR_MIS_PRED + armv8_pmuv3_0@event\\=0x2013@)", +"PublicDescription": "Branch mispredicts L2 topdown metric", +"BriefDescription": "Branch mispredicts L2 topdown metric", +"MetricGroup": "TopDownL2", +"MetricName": "branch_mispredicts" +}, +{ +"MetricExpr": "bad_speculation - branch_mispredicts", +"PublicDescription": "Machine clears L2 topdown metric", +"BriefDescription": "Machine clears L2 topdown metric", +"MetricGroup": "TopDownL2", +"MetricName": "machine_clears" +}, +{ +"MetricExpr": "(EXE_STALL_CYCLE - (MEM_STALL_ANYLOAD + armv8_pmuv3_0@event\\=0x7005@)) / CPU_CYCLES", +"PublicDescription": "Core bound L2 topdown metric", +"BriefDescription": "Core bound L2 topdown metric", +"MetricGroup": "TopDownL2", +"MetricName": "core_bound" +}, +{ +"MetricExpr": "(MEM_STALL_ANYLOAD + armv8_pmuv3_0@event\\=0x7005@) / CPU_CYCLES", +"PublicDescription": "Memory bound L2 topdown metric", +"BriefDescription": "Memory bound L2 topdown metric", +"MetricGroup": "TopDownL2", +"MetricName": "memory_bound" +}, ] -- 2.26.2
[PATCH v2 6/6] perf vendor events arm64: Add Hisi hip08 L3 metrics
Add L3 metrics. Signed-off-by: John Garry --- .../arch/arm64/hisilicon/hip08/metrics.json | 161 ++ 1 file changed, 161 insertions(+) diff --git a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json index dda898d23c2d..dda8e59149d2 100644 --- a/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json +++ b/tools/perf/pmu-events/arch/arm64/hisilicon/hip08/metrics.json @@ -69,4 +69,165 @@ "MetricGroup": "TopDownL2", "MetricName": "memory_bound" }, +{ +"MetricExpr": "(((L2I_TLB - L2I_TLB_REFILL) * 15) + (L2I_TLB_REFILL * 100)) / CPU_CYCLES", +"PublicDescription": "Idle by itlb miss L3 topdown metric", +"BriefDescription": "Idle by itlb miss L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "idle_by_itlb_miss" +}, +{ +"MetricExpr": "(((L2I_CACHE - L2I_CACHE_REFILL) * 15) + (L2I_CACHE_REFILL * 100)) / CPU_CYCLES", +"PublicDescription": "Idle by icache miss L3 topdown metric", +"BriefDescription": "Idle by icache miss L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "idle_by_icache_miss" +}, +{ +"MetricExpr": "(BR_MIS_PRED * 5) / CPU_CYCLES", +"PublicDescription": "BP misp flush L3 topdown metric", +"BriefDescription": "BP misp flush L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "bp_misp_flush" +}, +{ +"MetricExpr": "(armv8_pmuv3_0@event\\=0x2013@ * 5) / CPU_CYCLES", +"PublicDescription": "OOO flush L3 topdown metric", +"BriefDescription": "OOO flush L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "ooo_flush" +}, +{ +"MetricExpr": "(armv8_pmuv3_0@event\\=0x1001@ * 5) / CPU_CYCLES", +"PublicDescription": "Static predictor flush L3 topdown metric", +"BriefDescription": "Static predictor flush L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "sp_flush" +}, +{ +"MetricExpr": "armv8_pmuv3_0@event\\=0x1010@ / BR_MIS_PRED", +"PublicDescription": "Indirect branch L3 topdown metric", +"BriefDescription": "Indirect branch L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "indirect_branch" +}, +{ +"MetricExpr": "(armv8_pmuv3_0@event\\=0x1014@ + armv8_pmuv3_0@event\\=0x1018@) / BR_MIS_PRED", +"PublicDescription": "Push branch L3 topdown metric", +"BriefDescription": "Push branch L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "push_branch" +}, +{ +"MetricExpr": "armv8_pmuv3_0@event\\=0x100c@ / BR_MIS_PRED", +"PublicDescription": "Pop branch L3 topdown metric", +"BriefDescription": "Pop branch L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "pop_branch" +}, +{ +"MetricExpr": "(BR_MIS_PRED - armv8_pmuv3_0@event\\=0x1010@ - armv8_pmuv3_0@event\\=0x1014@ - armv8_pmuv3_0@event\\=0x1018@ - armv8_pmuv3_0@event\\=0x100c@) / BR_MIS_PRED", +"PublicDescription": "Other branch L3 topdown metric", +"BriefDescription": "Other branch L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "other_branch" +}, +{ +"MetricExpr": "armv8_pmuv3_0@event\\=0x2012@ / armv8_pmuv3_0@event\\=0x2013@", +"PublicDescription": "Nuke flush L3 topdown metric", +"BriefDescription": "Nuke flush L3 topdown metric", +"MetricGroup": "TopDownL3", +"MetricName": "nuke_flush" +}, +{ +"MetricExpr": "1 - nuke_flush", +"PublicDescription": "Other flush L3 topdown metric", +"BriefDescription": "Other flush L3 topdown metric", +
[PATCH v2 1/6] perf metricgroup: Make find_metric() public with name change
Function find_metric() is required for the metric processing in the pmu-events testcase, so make it public. Also change the name to include "metricgroup". Signed-off-by: John Garry --- tools/perf/util/metricgroup.c | 5 +++-- tools/perf/util/metricgroup.h | 3 ++- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c index 6acb44ad439b..71a13406e0bd 100644 --- a/tools/perf/util/metricgroup.c +++ b/tools/perf/util/metricgroup.c @@ -900,7 +900,8 @@ static int __add_metric(struct list_head *metric_list, (match_metric(__pe->metric_group, __metric) || \ match_metric(__pe->metric_name, __metric))) -static struct pmu_event *find_metric(const char *metric, struct pmu_events_map *map) +struct pmu_event *metrcgroup_find_metric(const char *metric, +struct pmu_events_map *map) { struct pmu_event *pe; int i; @@ -985,7 +986,7 @@ static int __resolve_metric(struct metric *m, struct expr_id *parent; struct pmu_event *pe; - pe = find_metric(cur->key, map); + pe = metrcgroup_find_metric(cur->key, map); if (!pe) continue; diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h index ed1b9392e624..1674c6a36d74 100644 --- a/tools/perf/util/metricgroup.h +++ b/tools/perf/util/metricgroup.h @@ -44,7 +44,8 @@ int metricgroup__parse_groups(const struct option *opt, bool metric_no_group, bool metric_no_merge, struct rblist *metric_events); - +struct pmu_event *metrcgroup_find_metric(const char *metric, +struct pmu_events_map *map); int metricgroup__parse_groups_test(struct evlist *evlist, struct pmu_events_map *map, const char *str, -- 2.26.2
Re: [PATCHv4 00/19] perf metric: Add support to reuse metric
On 24/03/2021 01:54, Paul A. Clarke wrote: -- Since commit 8989f5f07605 ("perf stat: Update POWER9 metrics to utilize other metrics"), power9 has reused metrics. And I am finding that subtest 10.3 caused problems when I tried to introduce metric reuse on arm64, so I was just asking you to check. Now I am a bit confused... I now understand your original request!:-) The above test was run on a POWER8 system. I do see failures on a POWER9 system: -- $ ./perf test 10 10: PMU events : 10.1: PMU event table sanity: Ok 10.2: PMU event map aliases : Ok 10.3: Parsing of PMU event table metrics: Skip (some metrics failed) 10.4: Parsing of PMU event table metrics with fake PMUs : Ok $ ./perf test --verbose 10 2>&1 | grep -i '^Parse event failed' | wc -l 112 ok, thanks for confirming. I'm looking at fixing it, and the solution doesn't look simple :( John
Re: [PATCHv4 00/19] perf metric: Add support to reuse metric
On 23/03/2021 15:06, Paul A. Clarke wrote: On Mon, Mar 22, 2021 at 11:36:23AM +, John Garry wrote: On 01/08/2020 12:40, Paul A. Clarke wrote: v4 changes: - removed acks from patch because it changed a bit with the last fixes: perf metric: Collect referenced metrics in struct metric_ref_node - fixed runtime metrics [Kajol Jain] - increased recursion depth [Paul A. Clarke] - changed patches due to dependencies: perf metric: Collect referenced metrics in struct metric_ref_node perf metric: Add recursion check when processing nested metrics perf metric: Rename struct egroup to metric perf metric: Rename group_list to metric_list Also available in here: git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git perf/metric I built and ran from the above git branch, and things seem to work. Indeed, I was able to apply my changes to exploit the new capabilities via modifications to tools/perf/pmu-events/arch/powerpc/power9/metrics.json, as I posted earlier (and will submit once this set gets merged). I was just wondering: Does perf subtest 10.3 work ok for you with the metric reuse? That's "Parsing of PMU event table metrics" subtest. I confess I'm not sure what you are asking. Using the latest mainline (84196390620ac0e5070ae36af84c137c6216a7dc), perf subtest 10.3 does pass for me: -- $ ./perf test 10 10: PMU events : 10.1: PMU event table sanity: Ok 10.2: PMU event map aliases : Ok 10.3: Parsing of PMU event table metrics: Ok 10.4: Parsing of PMU event table metrics with fake PMUs : Ok -- Since commit 8989f5f07605 ("perf stat: Update POWER9 metrics to utilize other metrics"), power9 has reused metrics. And I am finding that subtest 10.3 caused problems when I tried to introduce metric reuse on arm64, so I was just asking you to check. Now I am a bit confused... Thanks for checking, john
Re: [PATCH 3/3] iova: Correct comment for free_cpu_cached_iovas()
On 23/03/2021 13:05, Robin Murphy wrote: On 2021-03-01 12:12, John Garry wrote: Function free_cpu_cached_iovas() is not only called when a CPU is hotplugged, so remove that part of the code comment. FWIW I read it as clarifying why this is broken out into a separate function vs. a monolithic "free all cached IOVAs" routine that handles both the per-cpu and global caches it never said "*only* used..." It seems to be implying that. It's only a code comment, so I don't care too much either way and can drop this change. As such I'd hesitate to call it incorrect, but it's certainly arguable whether it needs to be stated or not, especially once the hotplug callsite is now obvious in the same file - on which note the function itself also shouldn't need to be public any more, no? Right, I actually missed deleting iommu_dma_free_cpu_cached_iovas(), so can fix that now. Cheers, John Robin. Signed-off-by: John Garry --- drivers/iommu/iova.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index c78312560425..465b3b0eeeb0 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -996,7 +996,7 @@ static void free_iova_rcaches(struct iova_domain *iovad) } /* - * free all the IOVA ranges cached by a cpu (used when cpu is unplugged) + * free all the IOVA ranges cached by a cpu */ void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad) { .
Re: [PATCH 0/3] iommu/iova: Add CPU hotplug handler to flush rcaches to core code
On 01/03/2021 12:12, John Garry wrote: The Intel IOMMU driver supports flushing the per-CPU rcaches when a CPU is offlined. Let's move it to core code, so everyone can take advantage. Also correct a code comment. Based on v5.12-rc1. Tested on arm64 only. Hi guys, Friendly reminder ... Thanks John John Garry (3): iova: Add CPU hotplug handler to flush rcaches iommu/vt-d: Remove IOVA domain rcache flushing for CPU offlining iova: Correct comment for free_cpu_cached_iovas() drivers/iommu/intel/iommu.c | 31 --- drivers/iommu/iova.c| 32 ++-- include/linux/cpuhotplug.h | 2 +- include/linux/iova.h| 1 + 4 files changed, 32 insertions(+), 34 deletions(-)
Re: arm64 syzbot instances
There's apparently a bit in the PCI spec that reads: The host bus bridge, in PC compatible systems, must return all 1's on a read transaction and discard data on a write transaction when terminated with Master-Abort. which obviously applies only to "PC compatible systems". Right. As far as I can tell, all ARMv8 and most ARMv7 based SoCs do this to be more compatible with PC style operating systems like Linux, but you are right that the specification here does not mandate that, and the older ARMv5 SoCs seem to be compliant as well based on this. Linux has a driver for DPC, which apparently configures it to cause an interrupt to log the event, but it does not hook up the CPU exception handler to this. I don't see an implementation of DPC in qemu, which I take as an indication that it should use the default behavior and cause neither an interrupt nor a CPU exception. Hmm, maybe. We should probably also implement -1/discard just because we're not intending to have 'surprising' behaviour. TBH I'm having difficulty seeing why the kernel should be doing this at all, though. The device tree tells you you have a PCI controller; PCI supports enumeration of devices; you know exactly where everything is mapped because the BARs tell you that. I don't see anything that justifies the kernel in randomly dereferencing areas of the IO or memory windows where it hasn't mapped anything. BIOS has described a CPU-addressable PIO region in the PCI hostbridge, and the kernel has mapped it: [3.974309][T1] pci-host-generic 401000.pcie: IO 0x003eff..0x003eff -> 0x00 So I don't see why any accesses there should fault. You shouldn't be probing for legacy ISA-port devices unless you're on a system which might actually have them (eg an x86 PC). It only happened in this case because there is also a bug in the 8250 serial port driver that is configured to assume four ports exist at port zero. On real arm64 hardware, this is apparently harmless because the driver has coped with this for 30 years ;-) There are a few other drivers that assume hardware is accessible at the legacy addresses, and applications can also still open /dev/ioport (if that is enabled at compile time) for the same purpose. Examples could be PC-style mouse/keyboard (emulated by a server BMC), PATA/SATA controllers in pre-AHCI mode, VGA console, and a couple of industrial I/O drivers that have ISA devices behind a PCI bridge. Most other actual ISA add-on card drivers can only be enabled on kernels that support machines with real slots, so you could get them on an i386 kernel running a virtualized x86_64 machine, but not on ARMv6 or later kernels, and you can't run pre-ARMv7 kernels on ARMv8 hardware. There are also lots of the hwmon drivers which use super IO, and probe a fixed PIO addresses for HW detection. These may be enabled on any architecture (apart from PPC, who explicitly disabled them to avoid issues like this). Thanks, John
Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator
On 19/03/2021 19:20, Robin Murphy wrote: Hi Robin, So then we have the issue of how to dynamically increase this rcache threshold. The problem is that we may have many devices associated with the same domain. So, in theory, we can't assume that when we increase the threshold that some other device will try to fast free an IOVA which was allocated prior to the increase and was not rounded up. I'm very open to better (or less bad) suggestions on how to do this ... ...but yes, regardless of exactly where it happens, rounding up or not is the problem for rcaches in general. I've said several times that my preferred approach is to not change it that dynamically at all, but instead treat it more like we treat the default domain type. Can you remind me of that idea? I don't remember you mentioning using default domain handling as a reference in any context. Thanks, John
Re: [PATCHv4 00/19] perf metric: Add support to reuse metric
On 01/08/2020 12:40, Paul A. Clarke wrote: v4 changes: - removed acks from patch because it changed a bit with the last fixes: perf metric: Collect referenced metrics in struct metric_ref_node - fixed runtime metrics [Kajol Jain] - increased recursion depth [Paul A. Clarke] - changed patches due to dependencies: perf metric: Collect referenced metrics in struct metric_ref_node perf metric: Add recursion check when processing nested metrics perf metric: Rename struct egroup to metric perf metric: Rename group_list to metric_list Also available in here: git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git perf/metric I built and ran from the above git branch, and things seem to work. Indeed, I was able to apply my changes to exploit the new capabilities via modifications to tools/perf/pmu-events/arch/powerpc/power9/metrics.json, as I posted earlier (and will submit once this set gets merged). Hi Paul, I was just wondering: Does perf subtest 10.3 work ok for you with the metric reuse? That's "Parsing of PMU event table metrics" subtest. Hi Jirka, If I add something like this: { "BriefDescription": "dummy test1", "MetricExpr": "Bad_Speculation + Frontend_Bound", "MetricGroup": "TopdownL1", "MetricName": "dummytest", "PublicDescription": "dummy test2" }, I get "Parse event failed metric 'dummytest' id 'bad_speculation' expr 'bad_speculation + frontend_bound'" Thanks, john Tested-by: Paul A. Clarke One thing I noted, but which also occurs without these patches, is that the perf metrics are not computed unless run as root: -- $ perf stat --metrics br_misprediction_percent command Performance counter stats for 'command': 1,823,530,051 pm_br_pred:u 2,662,705 pm_br_mpred_cmpl:u $ /usr/bin/sudo perf stat --metrics br_misprediction_percent command Performance counter stats for 'command': 1,824,655,269 pm_br_pred# 0.09 br_misprediction_percent 1,654,466 pm_br_mpred_cmpl --
Re: [PATCH v2] scsi: libsas: Reset num_scatter if libata mark qc as NODATA
On 19/03/2021 01:43, Jason Yan wrote: 在 2021/3/19 6:56, Jolly Shah 写道: When the cache_type for the scsi device is changed, the scsi layer issues a MODE_SELECT command. The caching mode details are communicated via a request buffer associated with the scsi command with data direction set as DMA_TO_DEVICE (scsi_mode_select). When this command reaches the libata layer, as a part of generic initial setup, libata layer sets up the scatterlist for the command using the scsi command (ata_scsi_qc_new). This command is then translated by the libata layer into ATA_CMD_SET_FEATURES (ata_scsi_mode_select_xlat). The libata layer treats this as a non data command (ata_mselect_caching), since it only needs an ata taskfile to pass the caching on/off information to the device. It does not need the scatterlist that has been setup, so it does not perform dma_map_sg on the scatterlist (ata_qc_issue). Unfortunately, when this command reaches the libsas layer(sas_ata_qc_issue), libsas layer sees it as a non data command with a scatterlist. It cannot extract the correct dma length, since the scatterlist has not been mapped with dma_map_sg for a DMA operation. When this partially constructed SAS task reaches pm80xx LLDD, it results in below warning. "pm80xx_chip_sata_req 6058: The sg list address start_addr=0x data_len=0x0end_addr_high=0x end_addr_low=0x has crossed 4G boundary" This patch updates code to handle ata non data commands separately so num_scatter and total_xfer_len remain 0. Fixes: 53de092f47ff ("scsi: libsas: Set data_dir as DMA_NONE if libata marks qc as NODATA") Signed-off-by: Jolly Shah Reviewed-by: John Garry @luojiaxing, can you please test this? --- v2: - reorganized code to avoid setting num_scatter twice drivers/scsi/libsas/sas_ata.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c index 024e5a550759..8b9a39077dba 100644 --- a/drivers/scsi/libsas/sas_ata.c +++ b/drivers/scsi/libsas/sas_ata.c @@ -201,18 +201,17 @@ static unsigned int sas_ata_qc_issue(struct ata_queued_cmd *qc) memcpy(task->ata_task.atapi_packet, qc->cdb, qc->dev->cdb_len); task->total_xfer_len = qc->nbytes; task->num_scatter = qc->n_elem; + task->data_dir = qc->dma_dir; + } else if (qc->tf.protocol == ATA_PROT_NODATA) { + task->data_dir = DMA_NONE; Hi Jolly & John, We only set DMA_NONE for ATA_PROT_NODATA, I'm curious about why ATA_PROT_NCQ_NODATA and ATAPI_PROT_NODATA do not need to set DMA_NONE? So we can see something like atapi_eh_tur() -> ata_exec_internal(), which is a ATAPI NONDATA and has DMA_NONE, so should be ok. Other cases, like those using the xlate function on the qc for ATA_PROT_NCQ_NODATA, could be checked further. For now, we're just trying to fix the fix. Thanks, Jason } else { for_each_sg(qc->sg, sg, qc->n_elem, si) xfer += sg_dma_len(sg); task->total_xfer_len = xfer; task->num_scatter = si; - } - - if (qc->tf.protocol == ATA_PROT_NODATA) - task->data_dir = DMA_NONE; - else task->data_dir = qc->dma_dir; + } task->scatter = qc->sg; task->ata_task.retry_count = 1; task->task_state_flags = SAS_TASK_STATE_PENDING; .
Re: [RFC PATCH v3 2/3] blk-mq: Freeze and quiesce all queues for tagset in elevator_exit()
On 16/03/2021 19:59, Bart Van Assche wrote: On 3/16/21 10:43 AM, John Garry wrote: On 16/03/2021 17:00, Bart Van Assche wrote: I agree that Jens asked at the end of 2018 not to touch the fast path to fix this use-after-free (maybe that request has been repeated more recently). If Jens or anyone else feels strongly about not clearing hctx->tags->rqs[rq->tag] from the fast path then I will make that change. Hi Bart, Is that possible for this same approach? I need to check the code more.. If the fast path should not be modified, I'm considering to borrow patch 1/3 from your patch series Fine and to add an rcu_barrier() between the code that clears the request pointers and that frees the scheduler requests. And don't we still have the problem that some iter callbacks may sleep/block, which is not allowed in an RCU read-side critical section? Thanks for having brought this up. Since none of the functions that iterate over requests should be called from the hot path of a block driver, I think that we can use srcu_read_(un|)lock() inside bt_iter() and bt_tags_iter() instead of rcu_read_(un|)lock(). OK, but TBH, I am not so familiar with srcu - where you going to try this? Thanks, John
Re: [PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size()
On 19/03/2021 17:00, Robin Murphy wrote: On 2021-03-19 13:25, John Garry wrote: Add a function to allow the max size which we want to optimise DMA mappings for. It seems neat in theory - particularly for packet-based interfaces that might have a known fixed size of data unit that they're working on at any given time - but aren't there going to be many cases where the driver has no idea because it depends on whatever size(s) of request userspace happens to throw at it? Even if it does know the absolute maximum size of thing it could ever transfer, that could be impractically large in areas like video/AI/etc., so it could still be hard to make a reasonable decision. So if you consider the SCSI stack, which is my interest, we know the max segment size and we know the max number of segments per request, so we should know the theoretical upper limit of the actual IOVA length we can get. Indeed, from my experiment on my SCSI host, max IOVA len is found to be 507904, which is PAGE_SIZE * 124 (that is max sg ents there). Incidentally that means that we want RCACHE RANGE MAX of 8, not 6. Being largely workload-dependent is why I still think this should be a command-line or sysfs tuneable - we could set the default based on how much total memory is available, but ultimately it's the end user who knows what the workload is going to be and what they care about optimising for. If that hardware is only found in a server, then the extra memory cost would be trivial, so setting to max is standard approach. Another thought (which I'm almost reluctant to share) is that I would *love* to try implementing a self-tuning strategy that can detect high contention on particular allocation sizes and adjust the caches on the fly, but I can easily imagine that having enough inherent overhead to end up being an impractical (but fun) waste of time. For now, I just want to recover the performance lost recently :) Thanks, John
Re: [PATCH 3/6] iova: Allow rcache range upper limit to be configurable
On 19/03/2021 16:25, Robin Murphy wrote: On 2021-03-19 13:25, John Garry wrote: Some LLDs may request DMA mappings whose IOVA length exceeds that of the current rcache upper limit. This means that allocations for those IOVAs will never be cached, and always must be allocated and freed from the RB tree per DMA mapping cycle. This has a significant effect on performance, more so since commit 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails"), as discussed at [0]. Allow the range of cached IOVAs to be increased, by providing an API to set the upper limit, which is capped at IOVA_RANGE_CACHE_MAX_SIZE. For simplicity, the full range of IOVA rcaches is allocated and initialized at IOVA domain init time. Setting the range upper limit will fail if the domain is already live (before the tree contains IOVA nodes). This must be done to ensure any IOVAs cached comply with rule of size being a power-of-2. [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ Signed-off-by: John Garry --- drivers/iommu/iova.c | 37 +++-- include/linux/iova.h | 11 ++- 2 files changed, 45 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index cecc74fb8663..d4f2f7fbbd84 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -49,6 +49,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule, iovad->flush_cb = NULL; iovad->fq = NULL; iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR; + iovad->rcache_max_size = IOVA_RANGE_CACHE_DEFAULT_SIZE; rb_link_node(>anchor.node, NULL, >rbroot.rb_node); rb_insert_color(>anchor.node, >rbroot); init_iova_rcaches(iovad); @@ -194,7 +195,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad, * rounding up anything cacheable to make sure that can't happen. The * order of the unadjusted size will still match upon freeing. */ - if (fast && size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1))) + if (fast && size < (1 << (iovad->rcache_max_size - 1))) size = roundup_pow_of_two(size); if (size_aligned) @@ -901,7 +902,7 @@ static bool iova_rcache_insert(struct iova_domain *iovad, unsigned long pfn, { unsigned int log_size = order_base_2(size); - if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE) + if (log_size >= iovad->rcache_max_size) return false; return __iova_rcache_insert(iovad, >rcaches[log_size], pfn); @@ -946,6 +947,38 @@ static unsigned long __iova_rcache_get(struct iova_rcache *rcache, return iova_pfn; } +void iova_rcache_set_upper_limit(struct iova_domain *iovad, + unsigned long iova_len) +{ + unsigned int rcache_index = order_base_2(iova_len) + 1; + struct rb_node *rb_node = >anchor.node; + unsigned long flags; + int count = 0; + + spin_lock_irqsave(>iova_rbtree_lock, flags); + if (rcache_index <= iovad->rcache_max_size) + goto out; + + while (1) { + rb_node = rb_prev(rb_node); + if (!rb_node) + break; + count++; + } + + /* + * If there are already IOVA nodes present in the tree, then don't + * allow range upper limit to be set. + */ + if (count != iovad->reserved_node_count) + goto out; + + iovad->rcache_max_size = min_t(unsigned long, rcache_index, + IOVA_RANGE_CACHE_MAX_SIZE); +out: + spin_unlock_irqrestore(>iova_rbtree_lock, flags); +} + /* * Try to satisfy IOVA allocation range from rcache. Fail if requested * size is too big or the DMA limit we are given isn't satisfied by the diff --git a/include/linux/iova.h b/include/linux/iova.h index fd3217a605b2..952b81b54ef7 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -25,7 +25,8 @@ struct iova { struct iova_magazine; struct iova_cpu_rcache; -#define IOVA_RANGE_CACHE_MAX_SIZE 6 /* log of max cached IOVA range size (in pages) */ +#define IOVA_RANGE_CACHE_DEFAULT_SIZE 6 +#define IOVA_RANGE_CACHE_MAX_SIZE 10 /* log of max cached IOVA range size (in pages) */ No. And why? If we're going to allocate massive caches anyway, whatever is the point of *not* using all of them? I wanted to keep the same effective threshold for devices today, unless set otherwise. The reason is that I didn't know if a blanket increase would cause regressions, and I was taking the super-safe road. Specifically some systems may be very IOVA space limited, and just work today by not caching large IOVAs. And in the precursor thread you wrote "We can't arbitrarily *increase* the scope of caching once a domain is active due to the size-rounding-up requirement, which would be prohibitive to larger allocations if applied universally" (sorry for quotin
Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator
On 19/03/2021 16:13, Robin Murphy wrote: On 2021-03-19 13:25, John Garry wrote: Move the IOVA size power-of-2 rcache roundup into the IOVA allocator. This is to eventually make it possible to be able to configure the upper limit of the IOVA rcache range. Signed-off-by: John Garry --- drivers/iommu/dma-iommu.c | 8 -- drivers/iommu/iova.c | 51 ++- 2 files changed, 34 insertions(+), 25 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index af765c813cc8..15b7270a5c2a 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -429,14 +429,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain, shift = iova_shift(iovad); iova_len = size >> shift; - /* - * Freeing non-power-of-two-sized allocations back into the IOVA caches - * will come back to bite us badly, so we have to waste a bit of space - * rounding up anything cacheable to make sure that can't happen. The - * order of the unadjusted size will still match upon freeing. - */ - if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1))) - iova_len = roundup_pow_of_two(iova_len); dma_limit = min_not_zero(dma_limit, dev->bus_dma_limit); diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index e6e2fa85271c..e62e9e30b30c 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -179,7 +179,7 @@ iova_insert_rbtree(struct rb_root *root, struct iova *iova, static int __alloc_and_insert_iova_range(struct iova_domain *iovad, unsigned long size, unsigned long limit_pfn, - struct iova *new, bool size_aligned) + struct iova *new, bool size_aligned, bool fast) { struct rb_node *curr, *prev; struct iova *curr_iova; @@ -188,6 +188,15 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad, unsigned long align_mask = ~0UL; unsigned long high_pfn = limit_pfn, low_pfn = iovad->start_pfn; + /* + * Freeing non-power-of-two-sized allocations back into the IOVA caches + * will come back to bite us badly, so we have to waste a bit of space + * rounding up anything cacheable to make sure that can't happen. The + * order of the unadjusted size will still match upon freeing. + */ + if (fast && size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1))) + size = roundup_pow_of_two(size); If this transformation is only relevant to alloc_iova_fast(), and we have to add a special parameter here to tell whether we were called from alloc_iova_fast(), doesn't it seem more sensible to just do it in alloc_iova_fast() rather than here? We have the restriction that anything we put in the rcache needs be a power-of-2. So then we have the issue of how to dynamically increase this rcache threshold. The problem is that we may have many devices associated with the same domain. So, in theory, we can't assume that when we increase the threshold that some other device will try to fast free an IOVA which was allocated prior to the increase and was not rounded up. I'm very open to better (or less bad) suggestions on how to do this ... I could say that we only allow this for a group with a single device, so these sort of things don't have to be worried about, but even then the iommu_group internals are not readily accessible here. But then the API itself has no strict requirement that a pfn passed to free_iova_fast() wasn't originally allocated with alloc_iova(), so arguably hiding the adjustment away makes it less clear that the responsibility is really on any caller of free_iova_fast() to make sure they don't get things wrong. alloc_iova() doesn't roundup to pow-of-2, so wouldn't it be broken to do that? Cheers, John
Re: [PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured
On 19/03/2021 13:40, Christoph Hellwig wrote: On Fri, Mar 19, 2021 at 09:25:42PM +0800, John Garry wrote: For streaming DMA mappings involving an IOMMU and whose IOVA len regularly exceeds the IOVA rcache upper limit (meaning that they are not cached), performance can be reduced. This is much more pronounced from commit 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails"), as discussed at [0]. IOVAs which cannot be cached are highly involved in the IOVA aging issue, as discussed at [1]. I'm confused. If this a limit in the IOVA allocator, dma-iommu should be able to just not grow the allocation so larger without help from the driver. This is not an issue with the IOVA allocator. The issue is with how the IOVA code handles caching of IOVAs. Specifically, when we DMA unmap, for an IOVA whose length is above a fixed threshold, the IOVA is freed, rather than being cached. See free_iova_fast(). For performance reasons, I want that threshold increased for my driver to avail of the caching of all lengths of IOVA which we may see - currently we see IOVAs whose length exceeds that threshold. But it may not be good to increase that threshold for everyone. > If contrary to the above description it is device-specific, the driver > could simply use dma_get_max_seg_size(). > . > But that is for a single segment, right? Is there something equivalent to tell how many scatter-gather elements which we may generate, like scsi_host_template.sg_tablesize? Thanks, John
[PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator
Move the IOVA size power-of-2 rcache roundup into the IOVA allocator. This is to eventually make it possible to be able to configure the upper limit of the IOVA rcache range. Signed-off-by: John Garry --- drivers/iommu/dma-iommu.c | 8 -- drivers/iommu/iova.c | 51 ++- 2 files changed, 34 insertions(+), 25 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index af765c813cc8..15b7270a5c2a 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -429,14 +429,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain, shift = iova_shift(iovad); iova_len = size >> shift; - /* -* Freeing non-power-of-two-sized allocations back into the IOVA caches -* will come back to bite us badly, so we have to waste a bit of space -* rounding up anything cacheable to make sure that can't happen. The -* order of the unadjusted size will still match upon freeing. -*/ - if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1))) - iova_len = roundup_pow_of_two(iova_len); dma_limit = min_not_zero(dma_limit, dev->bus_dma_limit); diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index e6e2fa85271c..e62e9e30b30c 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -179,7 +179,7 @@ iova_insert_rbtree(struct rb_root *root, struct iova *iova, static int __alloc_and_insert_iova_range(struct iova_domain *iovad, unsigned long size, unsigned long limit_pfn, - struct iova *new, bool size_aligned) + struct iova *new, bool size_aligned, bool fast) { struct rb_node *curr, *prev; struct iova *curr_iova; @@ -188,6 +188,15 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad, unsigned long align_mask = ~0UL; unsigned long high_pfn = limit_pfn, low_pfn = iovad->start_pfn; + /* +* Freeing non-power-of-two-sized allocations back into the IOVA caches +* will come back to bite us badly, so we have to waste a bit of space +* rounding up anything cacheable to make sure that can't happen. The +* order of the unadjusted size will still match upon freeing. +*/ + if (fast && size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1))) + size = roundup_pow_of_two(size); + if (size_aligned) align_mask <<= fls_long(size - 1); @@ -288,21 +297,10 @@ void iova_cache_put(void) } EXPORT_SYMBOL_GPL(iova_cache_put); -/** - * alloc_iova - allocates an iova - * @iovad: - iova domain in question - * @size: - size of page frames to allocate - * @limit_pfn: - max limit address - * @size_aligned: - set if size_aligned address range is required - * This function allocates an iova in the range iovad->start_pfn to limit_pfn, - * searching top-down from limit_pfn to iovad->start_pfn. If the size_aligned - * flag is set then the allocated address iova->pfn_lo will be naturally - * aligned on roundup_power_of_two(size). - */ -struct iova * -alloc_iova(struct iova_domain *iovad, unsigned long size, +static struct iova * +__alloc_iova(struct iova_domain *iovad, unsigned long size, unsigned long limit_pfn, - bool size_aligned) + bool size_aligned, bool fast) { struct iova *new_iova; int ret; @@ -312,7 +310,7 @@ alloc_iova(struct iova_domain *iovad, unsigned long size, return NULL; ret = __alloc_and_insert_iova_range(iovad, size, limit_pfn + 1, - new_iova, size_aligned); + new_iova, size_aligned, fast); if (ret) { free_iova_mem(new_iova); @@ -321,6 +319,25 @@ alloc_iova(struct iova_domain *iovad, unsigned long size, return new_iova; } + +/** + * alloc_iova - allocates an iova + * @iovad: - iova domain in question + * @size: - size of page frames to allocate + * @limit_pfn: - max limit address + * @size_aligned: - set if size_aligned address range is required + * This function allocates an iova in the range iovad->start_pfn to limit_pfn, + * searching top-down from limit_pfn to iovad->start_pfn. If the size_aligned + * flag is set then the allocated address iova->pfn_lo will be naturally + * aligned on roundup_power_of_two(size). + */ +struct iova * +alloc_iova(struct iova_domain *iovad, unsigned long size, + unsigned long limit_pfn, + bool size_aligned) +{ + return __alloc_iova(iovad, size, limit_pfn, size_aligned, false); +} EXPORT_SYMBOL_GPL(alloc_iova); static struct iova * @@ -433,7 +450,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long size, return iova_pfn; retry: - new_iova = alloc_iova(iovad, size, limit_pfn, true); + new_iova = __alloc_iova(iovad, size, limit_pfn, true, true); if (!new_iova) { unsigned int cpu; -- 2.26.2
[PATCH 2/6] iova: Add a per-domain count of reserved nodes
To help learn if the domain has regular IOVA nodes, add a count of reserved nodes, calculated at init time. Signed-off-by: John Garry --- drivers/iommu/iova.c | 2 ++ include/linux/iova.h | 1 + 2 files changed, 3 insertions(+) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index e62e9e30b30c..cecc74fb8663 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -717,6 +717,8 @@ reserve_iova(struct iova_domain *iovad, * or need to insert remaining non overlap addr range */ iova = __insert_new_range(iovad, pfn_lo, pfn_hi); + if (iova) + iovad->reserved_node_count++; finish: spin_unlock_irqrestore(>iova_rbtree_lock, flags); diff --git a/include/linux/iova.h b/include/linux/iova.h index c834c01c0a5b..fd3217a605b2 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -95,6 +95,7 @@ struct iova_domain { flush-queues */ atomic_t fq_timer_on; /* 1 when timer is active, 0 when not */ + int reserved_node_count; }; static inline unsigned long iova_size(struct iova *iova) -- 2.26.2
[PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size()
Add a function to allow the max size which we want to optimise DMA mappings for. Signed-off-by: John Garry --- drivers/iommu/dma-iommu.c | 2 +- include/linux/dma-map-ops.h | 1 + include/linux/dma-mapping.h | 5 + kernel/dma/mapping.c| 11 +++ 4 files changed, 18 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index a5dfbd6c0496..d35881fcfb9c 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -447,7 +447,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain, return (dma_addr_t)iova << shift; } -__maybe_unused static void iommu_dma_set_opt_size(struct device *dev, size_t size) { struct iommu_domain *domain = iommu_get_dma_domain(dev); @@ -1278,6 +1277,7 @@ static const struct dma_map_ops iommu_dma_ops = { .map_resource = iommu_dma_map_resource, .unmap_resource = iommu_dma_unmap_resource, .get_merge_boundary = iommu_dma_get_merge_boundary, + .set_max_opt_size = iommu_dma_set_opt_size, }; /* diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h index 51872e736e7b..fed7a183b3b9 100644 --- a/include/linux/dma-map-ops.h +++ b/include/linux/dma-map-ops.h @@ -64,6 +64,7 @@ struct dma_map_ops { u64 (*get_required_mask)(struct device *dev); size_t (*max_mapping_size)(struct device *dev); unsigned long (*get_merge_boundary)(struct device *dev); + void (*set_max_opt_size)(struct device *dev, size_t size); }; #ifdef CONFIG_DMA_OPS diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 2a984cb4d1e0..91fe770145d4 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -144,6 +144,7 @@ u64 dma_get_required_mask(struct device *dev); size_t dma_max_mapping_size(struct device *dev); bool dma_need_sync(struct device *dev, dma_addr_t dma_addr); unsigned long dma_get_merge_boundary(struct device *dev); +void dma_set_max_opt_size(struct device *dev, size_t size); #else /* CONFIG_HAS_DMA */ static inline dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page, size_t offset, size_t size, @@ -257,6 +258,10 @@ static inline unsigned long dma_get_merge_boundary(struct device *dev) { return 0; } +static inline void dma_set_max_opt_size(struct device *dev, size_t size) +{ +} + #endif /* CONFIG_HAS_DMA */ struct page *dma_alloc_pages(struct device *dev, size_t size, diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index b6a633679933..59e6acb1c471 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -608,3 +608,14 @@ unsigned long dma_get_merge_boundary(struct device *dev) return ops->get_merge_boundary(dev); } EXPORT_SYMBOL_GPL(dma_get_merge_boundary); + +void dma_set_max_opt_size(struct device *dev, size_t size) +{ + const struct dma_map_ops *ops = get_dma_ops(dev); + + if (!ops || !ops->set_max_opt_size) + return; + + ops->set_max_opt_size(dev, size); +} +EXPORT_SYMBOL_GPL(dma_set_max_opt_size); -- 2.26.2
[PATCH 4/6] iommu: Add iommu_dma_set_opt_size()
Add a function which allows the max optimised IOMMU DMA size to be set. Signed-off-by: John Garry --- drivers/iommu/dma-iommu.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 15b7270a5c2a..a5dfbd6c0496 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -447,6 +447,21 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain, return (dma_addr_t)iova << shift; } +__maybe_unused +static void iommu_dma_set_opt_size(struct device *dev, size_t size) +{ + struct iommu_domain *domain = iommu_get_dma_domain(dev); + struct iommu_dma_cookie *cookie = domain->iova_cookie; + struct iova_domain *iovad = >iovad; + unsigned long shift, iova_len; + + shift = iova_shift(iovad); + iova_len = size >> shift; + iova_len = roundup_pow_of_two(iova_len); + + iova_rcache_set_upper_limit(iovad, iova_len); +} + static void iommu_dma_free_iova(struct iommu_dma_cookie *cookie, dma_addr_t iova, size_t size, struct page *freelist) { -- 2.26.2
[PATCH 0/6] dma mapping/iommu: Allow IOMMU IOVA rcache range to be configured
For streaming DMA mappings involving an IOMMU and whose IOVA len regularly exceeds the IOVA rcache upper limit (meaning that they are not cached), performance can be reduced. This is much more pronounced from commit 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails"), as discussed at [0]. IOVAs which cannot be cached are highly involved in the IOVA aging issue, as discussed at [1]. This series attempts to allow the device driver hint what upper limit its DMA mapping IOVA lengths would be, so that the caching range may be increased. Some figures on storage scenario: v5.12-rc3 baseline: 600K IOPS With series:1300K IOPS With reverting 4e89dce72521:1250K IOPS All above are for IOMMU strict mode. Non-strict mode gives ~1750K IOPS in all scenarios. I will say that APIs and their semantics are a bit ropey - any better ideas welcome... [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ [1] https://lore.kernel.org/linux-iommu/1607538189-237944-1-git-send-email-john.ga...@huawei.com/ John Garry (6): iommu: Move IOVA power-of-2 roundup into allocator iova: Add a per-domain count of reserved nodes iova: Allow rcache range upper limit to be configurable iommu: Add iommu_dma_set_opt_size() dma-mapping/iommu: Add dma_set_max_opt_size() scsi: hisi_sas: Set max optimal DMA size for v3 hw drivers/iommu/dma-iommu.c | 23 --- drivers/iommu/iova.c | 88 -- drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 2 + include/linux/dma-map-ops.h| 1 + include/linux/dma-mapping.h| 5 ++ include/linux/iova.h | 12 +++- kernel/dma/mapping.c | 11 7 files changed, 115 insertions(+), 27 deletions(-) -- 2.26.2
[PATCH 3/6] iova: Allow rcache range upper limit to be configurable
Some LLDs may request DMA mappings whose IOVA length exceeds that of the current rcache upper limit. This means that allocations for those IOVAs will never be cached, and always must be allocated and freed from the RB tree per DMA mapping cycle. This has a significant effect on performance, more so since commit 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails"), as discussed at [0]. Allow the range of cached IOVAs to be increased, by providing an API to set the upper limit, which is capped at IOVA_RANGE_CACHE_MAX_SIZE. For simplicity, the full range of IOVA rcaches is allocated and initialized at IOVA domain init time. Setting the range upper limit will fail if the domain is already live (before the tree contains IOVA nodes). This must be done to ensure any IOVAs cached comply with rule of size being a power-of-2. [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ Signed-off-by: John Garry --- drivers/iommu/iova.c | 37 +++-- include/linux/iova.h | 11 ++- 2 files changed, 45 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index cecc74fb8663..d4f2f7fbbd84 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -49,6 +49,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule, iovad->flush_cb = NULL; iovad->fq = NULL; iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR; + iovad->rcache_max_size = IOVA_RANGE_CACHE_DEFAULT_SIZE; rb_link_node(>anchor.node, NULL, >rbroot.rb_node); rb_insert_color(>anchor.node, >rbroot); init_iova_rcaches(iovad); @@ -194,7 +195,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad, * rounding up anything cacheable to make sure that can't happen. The * order of the unadjusted size will still match upon freeing. */ - if (fast && size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1))) + if (fast && size < (1 << (iovad->rcache_max_size - 1))) size = roundup_pow_of_two(size); if (size_aligned) @@ -901,7 +902,7 @@ static bool iova_rcache_insert(struct iova_domain *iovad, unsigned long pfn, { unsigned int log_size = order_base_2(size); - if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE) + if (log_size >= iovad->rcache_max_size) return false; return __iova_rcache_insert(iovad, >rcaches[log_size], pfn); @@ -946,6 +947,38 @@ static unsigned long __iova_rcache_get(struct iova_rcache *rcache, return iova_pfn; } +void iova_rcache_set_upper_limit(struct iova_domain *iovad, +unsigned long iova_len) +{ + unsigned int rcache_index = order_base_2(iova_len) + 1; + struct rb_node *rb_node = >anchor.node; + unsigned long flags; + int count = 0; + + spin_lock_irqsave(>iova_rbtree_lock, flags); + if (rcache_index <= iovad->rcache_max_size) + goto out; + + while (1) { + rb_node = rb_prev(rb_node); + if (!rb_node) + break; + count++; + } + + /* +* If there are already IOVA nodes present in the tree, then don't +* allow range upper limit to be set. +*/ + if (count != iovad->reserved_node_count) + goto out; + + iovad->rcache_max_size = min_t(unsigned long, rcache_index, + IOVA_RANGE_CACHE_MAX_SIZE); +out: + spin_unlock_irqrestore(>iova_rbtree_lock, flags); +} + /* * Try to satisfy IOVA allocation range from rcache. Fail if requested * size is too big or the DMA limit we are given isn't satisfied by the diff --git a/include/linux/iova.h b/include/linux/iova.h index fd3217a605b2..952b81b54ef7 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -25,7 +25,8 @@ struct iova { struct iova_magazine; struct iova_cpu_rcache; -#define IOVA_RANGE_CACHE_MAX_SIZE 6/* log of max cached IOVA range size (in pages) */ +#define IOVA_RANGE_CACHE_DEFAULT_SIZE 6 +#define IOVA_RANGE_CACHE_MAX_SIZE 10 /* log of max cached IOVA range size (in pages) */ #define MAX_GLOBAL_MAGS 32 /* magazines per bin */ struct iova_rcache { @@ -74,6 +75,7 @@ struct iova_domain { unsigned long start_pfn; /* Lower limit for this domain */ unsigned long dma_32bit_pfn; unsigned long max32_alloc_size; /* Size of last failed allocation */ + unsigned long rcache_max_size; /* Upper limit of cached IOVA RANGE */ struct iova_fq __percpu *fq;/* Flush Queue */ atomic64_t fq_flush_start_cnt; /* Number of TLB flushes that @@ -158,6 +160,8 @@ int init_iova_flush_queue(struct iova_domain *iovad, struct iova *find_
[PATCH 6/6] scsi: hisi_sas: Set max optimal DMA size for v3 hw
For IOMMU strict mode, more than doubles throughput in some scenarios. Signed-off-by: John Garry --- drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c index 4580e081e489..2f77b418bbeb 100644 --- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c +++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c @@ -4684,6 +4684,8 @@ hisi_sas_v3_probe(struct pci_dev *pdev, const struct pci_device_id *id) goto err_out_regions; } + dma_set_max_opt_size(dev, PAGE_SIZE * HISI_SAS_SGE_PAGE_CNT); + shost = hisi_sas_shost_alloc_pci(pdev); if (!shost) { rc = -ENOMEM; -- 2.26.2
Re: [PATCH] scsi: libsas: Reset num_scatter if libata mark qc as NODATA
On 18/03/2021 00:24, Jolly Shah wrote: Hi John, Thanks for the review. On Wed, Mar 17, 2021 at 4:44 AM John Garry wrote: On 16/03/2021 19:39, Jolly Shah wrote: When the cache_type for the scsi device is changed, the scsi layer issues a MODE_SELECT command. The caching mode details are communicated via a request buffer associated with the scsi command with data direction set as DMA_TO_DEVICE (scsi_mode_select). When this command reaches the libata layer, as a part of generic initial setup, libata layer sets up the scatterlist for the command using the scsi command (ata_scsi_qc_new). This command is then translated by the libata layer into ATA_CMD_SET_FEATURES (ata_scsi_mode_select_xlat). The libata layer treats this as a non data command (ata_mselect_caching), since it only needs an ata taskfile to pass the caching on/off information to the device. It does not need the scatterlist that has been setup, so it does not perform dma_map_sg on the scatterlist (ata_qc_issue). So if we don't perform the dma_map_sg() on the sgl at this point, then it seems to me that we should not perform for_each_sg() on it either, right? That is still what happens in sas_ata_qc_issue() in this case. Hi Jolly Shah, Yes that's right. To avoid that, I can add elseif block for ATA_PROT_NODATA before for_each_sg() is performed. Currently there's existing code block for ATA_PROT_NODATA after for_each_sg() is performed, reused that to reset num_scatter. Please suggest. How about just combine the 2x if-else statements into 1x if-elif-else statement, like: if (ata_is_atapi(qc->tf.protocol)) { memcpy(task->ata_task.atapi_packet, qc->cdb, qc->dev->cdb_len); task->total_xfer_len = qc->nbytes; task->num_scatter = qc->n_elem; task->data_dir = qc->dma_dir; } else if (qc->tf.protocol == ATA_PROT_NODATA) { task->data_dir = DMA_NONE; } else { for_each_sg(qc->sg, sg, qc->n_elem, si) xfer += sg_dma_len(sg); task->total_xfer_len = xfer; task->num_scatter = si; task->data_dir = qc->dma_dir; } Unfortunately, when this command reaches the libsas layer(sas_ata_qc_issue), libsas layer sees it as a non data command with a scatterlist. It cannot extract the correct dma length, since the scatterlist has not been mapped with dma_map_sg for a DMA operation. When this partially constructed SAS task reaches pm80xx LLDD, it results in below warning. "pm80xx_chip_sata_req 6058: The sg list address start_addr=0x data_len=0x0end_addr_high=0x end_addr_low=0x has crossed 4G boundary" This patch assigns appropriate value to num_sectors for ata non data commands. Signed-off-by: Jolly Shah --- drivers/scsi/libsas/sas_ata.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c index 024e5a550759..94ec08cebbaa 100644 --- a/drivers/scsi/libsas/sas_ata.c +++ b/drivers/scsi/libsas/sas_ata.c @@ -209,10 +209,12 @@ static unsigned int sas_ata_qc_issue(struct ata_queued_cmd *qc) task->num_scatter = si; } - if (qc->tf.protocol == ATA_PROT_NODATA) + if (qc->tf.protocol == ATA_PROT_NODATA) { task->data_dir = DMA_NONE; - else + task->num_scatter = 0; task->num_scatter has already been set in this function. Best not set it twice. Sure. Please suggest if I should update patch to above suggested approach. That will avoid setting num_scatter twice. Thanks, John BTW, could we add a fixes tag?
Re: [PATCH 2/2] iommu/iova: Improve restart logic
Well yeah, in your particular case you're allocating from a heavily over-contended address space, so much of the time it is genuinely full. Plus you're primarily churning one or two sizes of IOVA, so there's a high chance that you will either allocate immediately from the cached node (after a previous free), or search the whole space and fail. In case it was missed, searching only some arbitrary subset of the space before giving up is not a good behaviour for an allocator to have in general. So since the retry means that we search through the complete pfn range most of the time (due to poor success rate), we should be able to do a better job at maintaining an accurate max alloc size, by calculating it from the range search, and not relying on max alloc failed or resetting it frequently. Hopefully that would mean that we're smarter about not trying the allocation. So I tried that out and we seem to be able to scrap back an appreciable amount of performance. Maybe 80% of original, with with another change, below. TBH if you really want to make allocation more efficient I think there are more radical changes that would be worth experimenting with, like using some form of augmented rbtree to also encode the amount of free space under each branch, or representing the free space in its own parallel tree, or whether some other structure entirely might be a better bet these days. And if you just want to make your thing acceptably fast, now I'm going to say stick a quirk somewhere to force the "forcedac" option on your platform ;) Easier said than done :) But still, I'd like to just be able to cache all IOVA sizes for my DMA engine, so we should not have to go near the RB tree often. I have put together a series to allow upper limit of rcache range be increased per domain. So naturally that gives better performance than we originally had. I don't want to prejudice the solution by saying what I think of it now, so will send it out... [...] @@ -219,7 +256,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad, if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) { high_pfn = limit_pfn; low_pfn = retry_pfn; - curr = >anchor.node; + curr = iova_find_limit(iovad, limit_pfn); I see that it is now applied. However, alternatively could we just add a zero-length 32b boundary marker node for the 32b pfn restart point? That would need special cases all over the place to prevent the marker getting merged into reservations or hit by lookups, and at worst break the ordering of the tree if a legitimate node straddles the boundary. I did consider having the insert/delete routines keep track of yet another cached node for whatever's currently the first thing above the 32-bit boundary, but I was worried that might be a bit too invasive. Yeah, I did think of that. I don't think that it would have too much overhead. FWIW I'm currently planning to come back to this again when I have a bit more time, since the optimum thing to do (modulo replacing the entire algorithm...) is actually to make the second part of the search *upwards* from the cached node to the limit. Furthermore, to revive my arch/arm conversion I think we're realistically going to need a compatibility option for bottom-up allocation to avoid too many nasty surprises, so I'd like to generalise things to tackle both concerns at once. Thanks, John
Re: [PATCH 2/2] iommu/iova: Improve restart logic
On 10/03/2021 17:47, John Garry wrote: On 09/03/2021 15:55, John Garry wrote: On 05/03/2021 16:35, Robin Murphy wrote: Hi Robin, When restarting after searching below the cached node fails, resetting the start point to the anchor node is often overly pessimistic. If allocations are made with mixed limits - particularly in the case of the opportunistic 32-bit allocation for PCI devices - this could mean significant time wasted walking through the whole populated upper range just to reach the initial limit. Right We can improve on that by implementing a proper tree traversal to find the first node above the relevant limit, and set the exact start point. Thanks for this. However, as mentioned in the other thread, this does not help our performance regression Re: commit 4e89dce72521. And mentioning this "retry" approach again, in case it was missed, from my experiment on the affected HW, it also has generally a dreadfully low success rate - less than 1% for the 32b range retry. Retry rate is about 20%. So I am saying from this 20%, less than 1% of those succeed. So since the retry means that we search through the complete pfn range most of the time (due to poor success rate), we should be able to do a better job at maintaining an accurate max alloc size, by calculating it from the range search, and not relying on max alloc failed or resetting it frequently. Hopefully that would mean that we're smarter about not trying the allocation. So I tried that out and we seem to be able to scrap back an appreciable amount of performance. Maybe 80% of original, with with another change, below. Thanks, John Signed-off-by: Robin Murphy --- drivers/iommu/iova.c | 39 ++- 1 file changed, 38 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index c28003e1d2ee..471c48dd71e7 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -154,6 +154,43 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free) iovad->cached_node = rb_next(>node); } +static struct rb_node *iova_find_limit(struct iova_domain *iovad, unsigned long limit_pfn) +{ + struct rb_node *node, *next; + /* + * Ideally what we'd like to judge here is whether limit_pfn is close + * enough to the highest-allocated IOVA that starting the allocation + * walk from the anchor node will be quicker than this initial work to + * find an exact starting point (especially if that ends up being the + * anchor node anyway). This is an incredibly crude approximation which + * only really helps the most likely case, but is at least trivially easy. + */ + if (limit_pfn > iovad->dma_32bit_pfn) + return >anchor.node; + + node = iovad->rbroot.rb_node; + while (to_iova(node)->pfn_hi < limit_pfn) + node = node->rb_right; + +search_left: + while (node->rb_left && to_iova(node->rb_left)->pfn_lo >= limit_pfn) + node = node->rb_left; + + if (!node->rb_left) + return node; + + next = node->rb_left; + while (next->rb_right) { + next = next->rb_right; + if (to_iova(next)->pfn_lo >= limit_pfn) { + node = next; + goto search_left; + } + } + + return node; +} + /* Insert the iova into domain rbtree by holding writer lock */ static void iova_insert_rbtree(struct rb_root *root, struct iova *iova, @@ -219,7 +256,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad, if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) { high_pfn = limit_pfn; low_pfn = retry_pfn; - curr = >anchor.node; + curr = iova_find_limit(iovad, limit_pfn); I see that it is now applied. However, alternatively could we just add a zero-length 32b boundary marker node for the 32b pfn restart point? curr_iova = to_iova(curr); goto retry; } ___ iommu mailing list io...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu .
Re: [PATCH] scsi: libsas: Reset num_scatter if libata mark qc as NODATA
On 16/03/2021 19:39, Jolly Shah wrote: When the cache_type for the scsi device is changed, the scsi layer issues a MODE_SELECT command. The caching mode details are communicated via a request buffer associated with the scsi command with data direction set as DMA_TO_DEVICE (scsi_mode_select). When this command reaches the libata layer, as a part of generic initial setup, libata layer sets up the scatterlist for the command using the scsi command (ata_scsi_qc_new). This command is then translated by the libata layer into ATA_CMD_SET_FEATURES (ata_scsi_mode_select_xlat). The libata layer treats this as a non data command (ata_mselect_caching), since it only needs an ata taskfile to pass the caching on/off information to the device. It does not need the scatterlist that has been setup, so it does not perform dma_map_sg on the scatterlist (ata_qc_issue). So if we don't perform the dma_map_sg() on the sgl at this point, then it seems to me that we should not perform for_each_sg() on it either, right? That is still what happens in sas_ata_qc_issue() in this case. Unfortunately, when this command reaches the libsas layer(sas_ata_qc_issue), libsas layer sees it as a non data command with a scatterlist. It cannot extract the correct dma length, since the scatterlist has not been mapped with dma_map_sg for a DMA operation. When this partially constructed SAS task reaches pm80xx LLDD, it results in below warning. "pm80xx_chip_sata_req 6058: The sg list address start_addr=0x data_len=0x0end_addr_high=0x end_addr_low=0x has crossed 4G boundary" This patch assigns appropriate value to num_sectors for ata non data commands. Signed-off-by: Jolly Shah --- drivers/scsi/libsas/sas_ata.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c index 024e5a550759..94ec08cebbaa 100644 --- a/drivers/scsi/libsas/sas_ata.c +++ b/drivers/scsi/libsas/sas_ata.c @@ -209,10 +209,12 @@ static unsigned int sas_ata_qc_issue(struct ata_queued_cmd *qc) task->num_scatter = si; } - if (qc->tf.protocol == ATA_PROT_NODATA) + if (qc->tf.protocol == ATA_PROT_NODATA) { task->data_dir = DMA_NONE; - else + task->num_scatter = 0; task->num_scatter has already been set in this function. Best not set it twice. Thanks, John + } else { task->data_dir = qc->dma_dir; + } task->scatter = qc->sg; task->ata_task.retry_count = 1; task->task_state_flags = SAS_TASK_STATE_PENDING;
Re: [RFC PATCH v3 2/3] blk-mq: Freeze and quiesce all queues for tagset in elevator_exit()
On 16/03/2021 17:00, Bart Van Assche wrote: On 3/16/21 9:15 AM, John Garry wrote: I'll have a look at this ASAP - a bit busy. But a quick scan and I notice this: > @@ -226,6 +226,7 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx, > struct request *rq) > { > blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag); > + rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL); Wasn't a requirement to not touch the fastpath at all, including even if only NULLifying a pointer? IIRC, Kashyap some time ago had a patch like above (but without RCU usage), but the request from Jens was to not touch the fastpath. Maybe I'm mistaken - I will try to dig up the thread. Hi Bart, I agree that Jens asked at the end of 2018 not to touch the fast path to fix this use-after-free (maybe that request has been repeated more recently). If Jens or anyone else feels strongly about not clearing hctx->tags->rqs[rq->tag] from the fast path then I will make that change. Is that possible for this same approach? I need to check the code more.. And don't we still have the problem that some iter callbacks may sleep/block, which is not allowed in an RCU read-side critical section? My motivation for clearing these pointers from the fast path is as follows: - This results in code that is easier to read and easier to maintain. - Every modern CPU pipelines store instructions so the performance impact of adding an additional store should be small. - Since the block layer has a tendency to reuse tags that have been freed recently, it is likely that hctx->tags->rqs[rq->tag] will be used for a next request and hence that it will have to be loaded into the CPU cache anyway. Those points make sense to me, but obviously it's the maintainers call. Thanks, john
Re: [RFC PATCH v3 2/3] blk-mq: Freeze and quiesce all queues for tagset in elevator_exit()
Hi Bart, I'll have a look at this ASAP - a bit busy. But a quick scan and I notice this: > @@ -226,6 +226,7 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx, > struct request *rq) > { >blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag); > + rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL); Wasn't a requirement to not touch the fastpath at all, including even if only NULLifying a pointer? IIRC, Kashyap some time ago had a patch like above (but without RCU usage), but the request from Jens was to not touch the fastpath. Maybe I'm mistaken - I will try to dig up the thread. Thanks, John How about replacing the entire patch series with the patch below? The patch below has the following advantages: - Instead of making the race window smaller, the race is fixed completely. - No new atomic instructions are added to the block layer code. - No early return is inserted in blk_mq_tagset_busy_iter(). Thanks, Bart. From a0e534012a766bd6e53cdd466eec0a811164c12a Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Wed, 10 Mar 2021 19:11:47 -0800 Subject: [PATCH] blk-mq: Fix races between iterating over requests and freeing requests Multiple users have reported use-after-free complaints similar to the following (see alsohttps://lore.kernel.org/linux-block/1545261885.185366.488.ca...@acm.org/): BUG: KASAN: use-after-free in bt_iter+0x86/0xf0 Read of size 8 at addr 88803b335240 by task fio/21412 CPU: 0 PID: 21412 Comm: fio Tainted: GW 4.20.0-rc6-dbg+ #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: dump_stack+0x86/0xca print_address_description+0x71/0x239 kasan_report.cold.5+0x242/0x301 __asan_load8+0x54/0x90 bt_iter+0x86/0xf0 blk_mq_queue_tag_busy_iter+0x373/0x5e0 blk_mq_in_flight+0x96/0xb0 part_in_flight+0x40/0x140 part_round_stats+0x18e/0x370 blk_account_io_start+0x3d7/0x670 blk_mq_bio_to_request+0x19c/0x3a0 blk_mq_make_request+0x7a9/0xcb0 generic_make_request+0x41d/0x960 submit_bio+0x9b/0x250 do_blockdev_direct_IO+0x435c/0x4c70 __blockdev_direct_IO+0x79/0x88 ext4_direct_IO+0x46c/0xc00 generic_file_direct_write+0x119/0x210 __generic_file_write_iter+0x11c/0x280 ext4_file_write_iter+0x1b8/0x6f0 aio_write+0x204/0x310 io_submit_one+0x9d3/0xe80 __x64_sys_io_submit+0x115/0x340 do_syscall_64+0x71/0x210 When multiple request queues share a tag set and when switching the I/O scheduler for one of the request queues that uses this tag set, the following race can happen: * blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns a pointer to a scheduler request to the local variable 'rq'. * blk_mq_sched_free_requests() is called to free hctx->sched_tags. * blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free. Fix this race as follows: * Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[]. * Protect hctx->tags->rqs[] reads with an RCU read-side lock. * No new rcu_barrier() call has been added because clearing the request pointer in hctx->tags->rqs[] happens before blk_queue_exit() and the blk_freeze_queue() call in blk_cleanup_queue() triggers an RCU barrier after all scheduler request pointers assiociated with a request queue have been removed from hctx->tags->rqs[] and before these scheduler requests are freed. Signed-off-by: Bart Van Assche --- block/blk-mq-tag.c | 27 +-- block/blk-mq-tag.h | 2 +- block/blk-mq.c | 10 ++ block/blk-mq.h | 1 + 4 files changed, 25 insertions(+), 15 deletions(-) diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 9c92053e704d..8351c3f2fe2d 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -206,18 +206,23 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) struct blk_mq_tags *tags = hctx->tags; bool reserved = iter_data->reserved; struct request *rq; + bool res = true; if (!reserved) bitnr += tags->nr_reserved_tags; - rq = tags->rqs[bitnr]; + + rcu_read_lock(); + rq = rcu_dereference(tags->rqs[bitnr]); /* * We can hit rq == NULL here, because the tagging functions * test and set the bit before assigning ->rqs[]. */ if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx) - return iter_data->fn(hctx, rq, iter_data->data, reserved); - return true; + res = iter_data->fn(hctx, rq, iter_data->data, reserved); + rcu_read_unlock(); + + return res; } /** @@ -264,10 +269,12 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) struct blk_mq_tags *tags = iter_data->tags; bool reserved = iter_data->flags & BT_TAG_ITER_RESERVED; struct request *rq; + bool res = true; if (!reserved)
Re: [PATCH] lib/logic_pio: Fix overlap check for pio registery
On 18/01/2021 01:27, Jiahui Cen wrote: + Hi Arnd, xu wei, Any chance we could pick up this patch via arm-soc tree? The series which I originally included in, below, is stalled a bit. Thanks, John https://lore.kernel.org/lkml/1610729929-188490-1-git-send-email-john.ga...@huawei.com/ Hi John, On 2021/1/15 18:10, John Garry wrote: On 21/12/2020 13:04, Jiahui Cen wrote: On 21/12/2020 03:24, Jiahui Cen wrote: Hi John, On 2020/12/18 18:40, John Garry wrote: On 18/12/2020 06:23, Jiahui Cen wrote: Since the [start, end) is a half-open interval, a range with the end equal to the start of another range should not be considered as overlapped. Signed-off-by: Jiahui Cen --- lib/logic_pio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/logic_pio.c b/lib/logic_pio.c index f32fe481b492..445d611f1dc1 100644 --- a/lib/logic_pio.c +++ b/lib/logic_pio.c @@ -57,7 +57,7 @@ int logic_pio_register_range(struct logic_pio_hwaddr *new_range) new_range->flags == LOGIC_PIO_CPU_MMIO) { /* for MMIO ranges we need to check for overlap */ if (start >= range->hw_start + range->size || - end < range->hw_start) { + end <= range->hw_start) { It looks like your change is correct, but should not really have an impact in practice since: a: BIOSes generally list ascending IO port CPU addresses b. there is space between IO port CPU address regions Have you seen a problem here? No serious problem. I found it just when I was working on adding support of pci expander bridge for Arm in QEMU. I found the IO window of some extended root bus could not be registered when I inserted the extended buses' _CRS info into DSDT table in the x86 way, which does not sort the buses. Though root buses should be sorted in QEMU, would it be better to accept those non-ascending IO windows? ok, so it seems that you have seen a real problem, and this issue is not just detected by code analysis. BTW, for b, it seems to be no space between IO windows of different root buses generated by EDK2. Or maybe I missed something obvious. I don't know about that. Anyway, your change looks ok. Reviewed-by: John Garry BTW, for your virt env, will there be requirement to unregister PCI MMIO ranges? Currently we don't see that in non-virt world. Thanks for your review. And currently there is no such a requirement in my virt env. I am not sure what happened to this patch, but I plan on sending some patches in this area soon - do you want me to include this one? Sorry for replying late. It's appreciated if you can include this patch. Thanks! Jiahui Thanks, John . .
Re: arm64 syzbot instances
On 15/03/2021 10:01, Dmitry Vyukov wrote: On Mon, Mar 15, 2021 at 10:45 AM John Garry wrote: It does not happen too often on syzbot so far, so let's try to do the right thing first. I've filed:https://bugs.launchpad.net/qemu/+bug/1918917 with a link to this thread. To be fair, I don't fully understand what I am talking about, I hope I proxied your description properly. Thanks, looks good. I provided a little more detail in a comment there. Arnd . From looking at the bug report, my impression is that this is a qemu issue, as the logical IO space is mapped to the PCI host bridge IO space, and qemu does not handle accesses to that CPU addressable region at all. As Arnd said. However, we really should not be accessing logical IO ports 0 or 0x2f8 at all via ttyS3 if not enumerated from PCI device at that logical IO port. That is what I think anyway, as who knows what device - if any - really exists at that location. That is why I had this patch to just stop accesses to legacy IO port regions on arm64: https://lore.kernel.org/lkml/1610729929-188490-2-git-send-email-john.ga...@huawei.com/ Hi John, Thanks for the info. The patch is from January, but it's not merged yet, right? It will fix the crash we see, right? . It's not merged, and it probably would solve this issue. But following discussion with Arnd when it was originally posted, I still need to do some analysis whether it is the proper thing to do. However, as mentioned, the fundamental issue looks like qemu IO port access, so it would be good to check that first. On a related topic, I will cc colleague Jiahui Cen, who I think was doing some work arm on qemu support in a related area, so may share some experience here. Jiahui Cen did have a patch to fix logical PIO code from this work [0], which is not merged, but I don't think would help here. I will cc you on it. Thanks, John [0] https://lore.kernel.org/lkml/006ad6ce-d6b2-59cb-8209-aca3f6e53...@huawei.com/
Re: arm64 syzbot instances
On 12/03/2021 10:52, Arnd Bergmann wrote: On Fri, Mar 12, 2021 at 11:38 AM Dmitry Vyukov wrote: On Fri, Mar 12, 2021 at 11:11 AM Arnd Bergmann wrote: It does not happen too often on syzbot so far, so let's try to do the right thing first. I've filed: https://bugs.launchpad.net/qemu/+bug/1918917 with a link to this thread. To be fair, I don't fully understand what I am talking about, I hope I proxied your description properly. Thanks, looks good. I provided a little more detail in a comment there. Arnd . From looking at the bug report, my impression is that this is a qemu issue, as the logical IO space is mapped to the PCI host bridge IO space, and qemu does not handle accesses to that CPU addressable region at all. As Arnd said. However, we really should not be accessing logical IO ports 0 or 0x2f8 at all via ttyS3 if not enumerated from PCI device at that logical IO port. That is what I think anyway, as who knows what device - if any - really exists at that location. That is why I had this patch to just stop accesses to legacy IO port regions on arm64: https://lore.kernel.org/lkml/1610729929-188490-2-git-send-email-john.ga...@huawei.com/ Thanks, John
Re: [PATCH] iommu/dma: Resurrect the "forcedac" option
On 05/03/2021 16:32, Robin Murphy wrote: In converting intel-iommu over to the common IOMMU DMA ops, it quietly lost the functionality of its "forcedac" option. Since this is a handy thing both for testing and for performance optimisation on certain platforms, reimplement it under the common IOMMU parameter namespace. For the sake of fixing the inadvertent breakage of the Intel-specific parameter, remove the dmar_forcedac remnants and hook it up as an alias while documenting the transition to the new common parameter. Fixes: c588072bba6b ("iommu/vt-d: Convert intel iommu driver to the iommu ops") Signed-off-by: Robin Murphy If it's worth anything: Reviewed-by: John Garry --- Documentation/admin-guide/kernel-parameters.txt | 15 --- drivers/iommu/dma-iommu.c | 13 - drivers/iommu/intel/iommu.c | 5 ++--- include/linux/dma-iommu.h | 2 ++ 4 files changed, 24 insertions(+), 11 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 04545725f187..835f810f2f26 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1869,13 +1869,6 @@ bypassed by not enabling DMAR with this option. In this case, gfx device will use physical address for DMA. - forcedac [X86-64] - With this option iommu will not optimize to look - for io virtual address below 32-bit forcing dual - address cycle on pci bus for cards supporting greater - than 32-bit addressing. The default is to look - for translation below 32-bit and if not available - then look in the higher range. strict [Default Off] With this option on every unmap_single operation will result in a hardware IOTLB flush operation as opposed @@ -1964,6 +1957,14 @@ nobypass[PPC/POWERNV] Disable IOMMU bypass, using IOMMU for PCI devices. + iommu.forcedac= [ARM64, X86] Control IOVA allocation for PCI devices. + Format: { "0" | "1" } + 0 - Try to allocate a 32-bit DMA address first, before + falling back to the full range if needed. + 1 - Allocate directly from the full usable range, + forcing Dual Address Cycle for PCI cards supporting + greater than 32-bit addressing. + iommu.strict= [ARM64] Configure TLB invalidation behaviour Format: { "0" | "1" } 0 - Lazy mode. diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 9ab6ee22c110..260bf3de1992 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -52,6 +52,17 @@ struct iommu_dma_cookie { }; static DEFINE_STATIC_KEY_FALSE(iommu_deferred_attach_enabled); +bool iommu_dma_forcedac __read_mostly; + +static int __init iommu_dma_forcedac_setup(char *str) +{ + int ret = kstrtobool(str, _dma_forcedac); + + if (!ret && iommu_dma_forcedac) + pr_info("Forcing DAC for PCI devices\n"); I seem to remember some disagreement on this sort of print some other time :) + return ret; +} +early_param("iommu.forcedac", iommu_dma_forcedac_setup); void iommu_dma_free_cpu_cached_iovas(unsigned int cpu, struct iommu_domain *domain) @@ -438,7 +449,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain, dma_limit = min(dma_limit, (u64)domain->geometry.aperture_end);
Re: [PATCH 1/5] perf metricgroup: Support printing metrics for arm64
On 06/03/2021 19:34, Jiri Olsa wrote: On Fri, Mar 05, 2021 at 11:06:58AM +, John Garry wrote: Hi Jirka, - struct pmu_events_map *map = perf_pmu__find_map(NULL); + struct pmu_events_map *map = find_cpumap(); so this is just for arm at the moment right? Yes - but to be more accurate, arm64. At the moment, from the archs which use pmu-events, only arm64 and nds32 have versions of get_cpuid_str() which require a non-NULL pmu argument. But then apparently nds32 only supports a single CPU, so this issue of heterogeneous CPUs should not be a concern there:) could we rather make this arch specific code, so we don't need to do the scanning on archs where this is not needed? like marking perf_pmu__find_map as __weak and add arm specific version? Well I was thinking that this code should not be in metricgroup.c anyway. So there is code which is common in current perf_pmu__find_map() for all archs. I could factor that out into a common function, below. Just a bit worried about perf_pmu__find_map() and perf_pmu__find_pmu_map() being confused. right, so perf_pmu__find_map does not take perf_pmu as argument anymore, so the prefix does not fit, how about pmu_events_map__find ? I just noticed this series: https://lore.kernel.org/lkml/1612797946-18784-1-git-send-email-kan.li...@linux.intel.com/ Seems that this has metricgroup support for heterogeneous system config, while this series is metricgroup support for homogeneous system config for arch which supports heterogeneous system config. I need to check further for any conflicts. @Kan Liang, it would be great if you could cc me on that series. I don't subscribe to the general list. Thanks, John
Re: [RFC PATCH v3 2/3] blk-mq: Freeze and quiesce all queues for tagset in elevator_exit()
On 11/03/2021 00:58, Ming Lei wrote: Indeed, blk_mq_queue_tag_busy_iter() already does take a reference to its queue usage counter when called, and the queue cannot be frozen to switch IO scheduler until all refs are dropped. This ensures no stale references to IO scheduler requests will be seen by blk_mq_queue_tag_busy_iter(). However, there is nothing to stop blk_mq_queue_tag_busy_iter() being run for another queue associated with the same tagset, and it seeing a stale IO scheduler request from the other queue after they are freed. To stop this happening, freeze and quiesce all queues associated with the tagset as the elevator is exited. I think this way can't be accepted since switching one queue's scheduler is nothing to do with other request queues attached to same HBA. This patch will cause performance regression because userspace may switch scheduler according to medium or workloads, at that time other LUNs will be affected by this patch. Hmmm..that was my concern also. Do you think that it may cause a big impact? Depends totally on the workload, I suppose. FWIW, it is useful though for solving both iterator problems. Thanks, John
Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
On 08/03/2021 16:22, John Garry wrote: While max32_alloc_size indirectly tracks the largest*contiguous* available space, one of the ideas from which it grew was to simply keep count of the total number of free PFNs. If you're really spending significant time determining that the tree is full, as opposed to just taking longer to eventually succeed, then it might be relatively innocuous to tack on that semi-redundant extra accounting as a self-contained quick fix for that worst case. ... So since the retry means that we search through the complete pfn range most of the time (due to poor success rate quoted), we should be able to do a better job at maintaining an accurate max alloc size, by calculating it during the failed search, and not relying on max alloc failed or resetting it frequently. Hopefully that would mean that we're smarter about quickly failing the allocation. I'll further look at that. Thanks, John
Re: [RFC PATCH v3 3/3] blk-mq: Lockout tagset iterator when exiting elevator
On 10/03/2021 16:00, Bart Van Assche wrote: So I can incorporate any changes and suggestions so far and send a non-RFC version - that may get more attention if none extra comes. As mentioned on the cover letter, if patch 2+3/3 are accepted, then patch 1/3 could be simplified. But I plan to leave as is. BTW, any issue with putting your suggested-by on patch 2/3? Hi Bart, I have added my Reviewed-by to patch 2/3. OK, thanks. Please note that I still want to check further whether some of Ming's series "blk-mq: implement queue quiesce via percpu_ref for BLK_MQ_F_BLOCKING" can be used. Regarding the other two patches in this series: I do not agree with patch 3/3. As I have explained, I am concerned that that patch breaks existing block drivers. Understood. I need to check your concern further to allay any fears. So I could probably change that patch to drop the early return. Instead we just need to ensure that we complete any existing calls to blk_mq_tagset_busy_iter() prior to freeing the IO scheduler requests. Then we don't need to return early and can iter as before - but, as I said previously, there should be no active tags to iter. Are patches 1/3 and 3/3 necessary? Or in other words, is patch 2/3 sufficient to fix the use-after-free? No, we need them all in some form. So far, reports are that 1/3 solves the most common seen UAF. It is pretty easy to trigger. But the scenarios associated with 2/3 and 3/3 are much harder to trigger, and I needed to add delays in the code just to trigger them. Thanks, John
Re: [RFC PATCH v3 3/3] blk-mq: Lockout tagset iterator when exiting elevator
On 09/03/2021 19:21, Bart Van Assche wrote: On 3/9/21 9:47 AM, John Garry wrote: This does fall over if some tags are allocated without associated request queue, which I do not know exists. Hi Bart, The only tag allocation mechanism I know of is blk_mq_get_tag(). The only blk_mq_get_tag() callers I know of are __blk_mq_alloc_request() and blk_mq_alloc_request_hctx(). So I think all allocated tags are associated with a request queue. ok, good. Regarding this patch series, I have shared the feedback I wanted to share so I would appreciate it if someone else could also take a look. So I can incorporate any changes and suggestions so far and send a non-RFC version - that may get more attention if none extra comes. As mentioned on the cover letter, if patch 2+3/3 are accepted, then patch 1/3 could be simplified. But I plan to leave as is. BTW, any issue with putting your suggested-by on patch 2/3? Thanks, John
Re: [RFC PATCH v3 3/3] blk-mq: Lockout tagset iterator when exiting elevator
On 08/03/2021 19:59, Bart Van Assche wrote: This changes the behavior of blk_mq_tagset_busy_iter(). What will e.g. happen if the mtip driver calls blk_mq_tagset_busy_iter(>tags, mtip_abort_cmd, dd) concurrently with another blk_mq_tagset_busy_iter() call and if that causes all mtip_abort_cmd() calls to be skipped? I'm not sure that I understand this problem you describe. So if blk_mq_tagset_busy_iter(>tags, mtip_abort_cmd, dd) is called, either can happen: a. normal operation, iter_usage_counter initially holds >= 1, and then iter_usage_counter is incremented in blk_mq_tagset_busy_iter() and we iter the busy tags. Any parallel call to blk_mq_tagset_busy_iter() will also increase iter_usage_counter. b. we're switching IO scheduler. In this scenario, first we quiesce all queues. After that, there should be no active requests. At that point, we ensure any calls to blk_mq_tagset_busy_iter() are finished and block (or discard may be a better term) any more calls. Blocking any more calls should be safe as there are no requests to iter. atomic_cmpxchg() is used to set iter_usage_counter to 0, blocking any more calls. Hi Bart, My concern is about the insertion of the early return statement in blk_mq_tagset_busy_iter(). So I take this approach as I don't see any way to use a mutual exclusion waiting mechanism to block calls to blk_mq_tagset_busy_iter() while the IO scheduler is being switched. The reason is that blk_mq_tagset_busy_iter() can be called from any context, including hardirq. Although most blk_mq_tagset_busy_iter() callers can handle skipping certain blk_mq_tagset_busy_iter() calls (e.g. when gathering statistics), I'm not sure this is safe for all blk_mq_tagset_busy_iter() callers. The example I cited is an example of a blk_mq_tagset_busy_iter() call with side effects. I don't like to think that we're skipping it, which may imply that there are some active requests to iter and we're just ignoring them. It's more like: we know that there are no requests active, so don't bother trying to iterate. The mtip driver allocates one tag set per request queue so quiescing queues should be sufficient to address my concern for the mtip driver. The NVMe core and SCSI core however share a single tag set across multiple namespaces / LUNs. In the error path of nvme_rdma_setup_ctrl() I found a call to nvme_cancel_tagset(). nvme_cancel_tagset() calls blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl). I'm not sure it is safe to skip the nvme_cancel_request() calls if the I/O scheduler for another NVMe namespace is being modified. Again, I would be relying on all request_queues associated with that tagset to be queisced when switching IO scheduler at the point blk_mq_tagset_busy_iter() is called and returns early. Now if there were active requests, I am relying on the request queue quiescing to flush them. So blk_mq_tagset_busy_iter() could be called during that quiescing period, and would continue to iter the requests. This does fall over if some tags are allocated without associated request queue, which I do not know exists. Thanks, John
Re: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue
On 09/03/2021 15:57, Michael Kelley wrote: From: John Garry Sent: Tuesday, March 9, 2021 2:10 AM On 08/03/2021 17:56, Melanie Plageman wrote: On Mon, Mar 08, 2021 at 02:37:40PM +, Michael Kelley wrote: From: Melanie Plageman (Microsoft) Sent: Friday, March 5, 2021 3:22 PM The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during allocation. Cap cmd_per_lun at can_queue to avoid dispatch errors. Signed-off-by: Melanie Plageman (Microsoft) --- drivers/scsi/storvsc_drv.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index 6bc5453cea8a..d7953a6e00e6 100644 --- a/drivers/scsi/storvsc_drv.c +++ b/drivers/scsi/storvsc_drv.c @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device, (max_sub_channels + 1) * (100 - ring_avail_percent_lowater) / 100; + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, scsi_driver.can_queue); + I'm not sure what you mean by "avoid dispatch errors". Can you elaborate? The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set Scsi_Host->cmd_per_lun in storvsc_probe(). In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is called and sets the scsi_device->queue_depth to the Scsi_Host's cmd_per_lun with this code: scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ? sdev->host->cmd_per_lun : 1); During dispatch, the scsi_device->queue_depth is used in scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine whether or not the device can queue another command. On some machines, with the 2048 value of cmd_per_lun that was used to set the initial scsi_device->queue_depth, commands can be queued that are later not able to be dispatched after running out of space in the ringbuffer. On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD (running an fio workload that I can provide if needed), storvsc_do_io() ends up often returning SCSI_MLQUEUE_DEVICE_BUSY. This is the call stack: hv_get_bytes_to_write hv_ringbuffer_write vmbus_send_packet storvsc_dio_io storvsc_queuecommand scsi_dispatch_cmd scsi_queue_rq dispatch_rq_list Be aware that the calculation of "can_queue" in this driver is somewhat flawed -- it should not be based on the size of the ring buffer, but instead on the maximum number of requests Hyper-V will queue. And even then, can_queue doesn't provide the cap you might expect because the blk-mq layer allocates can_queue tags for each HW queue, not as a total. The docs for scsi_mid_low_api document Scsi_Host can_queue this way: can_queue - must be greater than 0; do not send more than can_queue commands to the adapter. I did notice that in scsi_host.h, the comment for can_queue does say can_queue is the "maximum number of simultaneous commands a single hw queue in HBA will accept." However, I don't see it being used this way in the code. JFYI, the block layer ensures that no more than can_queue requests are sent to the host. See scsi_mq_setup_tags(), and how the tagset queue depth is set to shost->can_queue. Thanks, John Agree on what's in scsi_mq_setup_tags(). But scsi_mq_setup_tags() calls blk_mq_alloc_tag_set(), which in turn calls blk_mq_alloc_map_and_requests(), which calls __blk_mq_alloc_rq_maps() repeatedly, reducing the tag set queue_depth as needed until it succeeds. The key thing is that __blk_mq_alloc_rq_maps() iterates over the number of HW queues calling __blk_mq_alloc_map_and_request(). The latter function allocates the map and the requests with a count of the tag set's queue_depth. There's no logic to apportion the can_queue value across multiple HW queues. So each HW queue gets can_queue tags allocated, and the SCSI host driver may see up to (can_queue * # HW queues) simultaneous requests. I'm certainly not an expert in this area, but that's what I see in the code. We've run live experiments, and can see the number simultaneous requests sent to the storvsc driver be greater than can_queue when the # of HW queues is greater than 1, which seems to be consistent with the code. ah, ok. I assumed that # of HW queues = 1 here. So you're describing a problem similar to https://lore.kernel.org/linux-scsi/b3e4e597-779b-7c1e-0d3c-07bc3dab1...@huawei.com/ So if you check nr_hw_queues comment in include/scsi/scsi_host.h, it reads: the total queue depth per host is nr_hw_queues * can_queue. However, for when host_tagset is set, the total queue depth is can_queue. Setting .host_tagset will ensure at most can_queue requests will be sent over all HW queues at any given time. A few SCSI MQ drivers set this now. Thanks, John Michael During dispatch, In scsi_target_queue_ready(), there is this code: if (busy >= starget->can_queue) goto
Re: [PATCH 2/2] iommu/iova: Improve restart logic
On 05/03/2021 16:35, Robin Murphy wrote: Hi Robin, When restarting after searching below the cached node fails, resetting the start point to the anchor node is often overly pessimistic. If allocations are made with mixed limits - particularly in the case of the opportunistic 32-bit allocation for PCI devices - this could mean significant time wasted walking through the whole populated upper range just to reach the initial limit. Right We can improve on that by implementing a proper tree traversal to find the first node above the relevant limit, and set the exact start point. Thanks for this. However, as mentioned in the other thread, this does not help our performance regression Re: commit 4e89dce72521. And mentioning this "retry" approach again, in case it was missed, from my experiment on the affected HW, it also has generally a dreadfully low success rate - less than 1% for the 32b range retry. Retry rate is about 20%. So I am saying from this 20%, less than 1% of those succeed. Failing faster sounds key. Thanks, John Signed-off-by: Robin Murphy --- drivers/iommu/iova.c | 39 ++- 1 file changed, 38 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index c28003e1d2ee..471c48dd71e7 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -154,6 +154,43 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free) iovad->cached_node = rb_next(>node); } +static struct rb_node *iova_find_limit(struct iova_domain *iovad, unsigned long limit_pfn) +{ + struct rb_node *node, *next; + /* +* Ideally what we'd like to judge here is whether limit_pfn is close +* enough to the highest-allocated IOVA that starting the allocation +* walk from the anchor node will be quicker than this initial work to +* find an exact starting point (especially if that ends up being the +* anchor node anyway). This is an incredibly crude approximation which +* only really helps the most likely case, but is at least trivially easy. +*/ + if (limit_pfn > iovad->dma_32bit_pfn) + return >anchor.node; + + node = iovad->rbroot.rb_node; + while (to_iova(node)->pfn_hi < limit_pfn) + node = node->rb_right; + +search_left: + while (node->rb_left && to_iova(node->rb_left)->pfn_lo >= limit_pfn) + node = node->rb_left; + + if (!node->rb_left) + return node; + + next = node->rb_left; + while (next->rb_right) { + next = next->rb_right; + if (to_iova(next)->pfn_lo >= limit_pfn) { + node = next; + goto search_left; + } + } + + return node; +} + /* Insert the iova into domain rbtree by holding writer lock */ static void iova_insert_rbtree(struct rb_root *root, struct iova *iova, @@ -219,7 +256,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad, if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) { high_pfn = limit_pfn; low_pfn = retry_pfn; - curr = >anchor.node; + curr = iova_find_limit(iovad, limit_pfn); curr_iova = to_iova(curr); goto retry; }
Re: [PATCH v1] scsi: storvsc: Cap cmd_per_lun at can_queue
On 08/03/2021 17:56, Melanie Plageman wrote: On Mon, Mar 08, 2021 at 02:37:40PM +, Michael Kelley wrote: From: Melanie Plageman (Microsoft) Sent: Friday, March 5, 2021 3:22 PM The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during allocation. Cap cmd_per_lun at can_queue to avoid dispatch errors. Signed-off-by: Melanie Plageman (Microsoft) --- drivers/scsi/storvsc_drv.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index 6bc5453cea8a..d7953a6e00e6 100644 --- a/drivers/scsi/storvsc_drv.c +++ b/drivers/scsi/storvsc_drv.c @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device, (max_sub_channels + 1) * (100 - ring_avail_percent_lowater) / 100; + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, scsi_driver.can_queue); + I'm not sure what you mean by "avoid dispatch errors". Can you elaborate? The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set Scsi_Host->cmd_per_lun in storvsc_probe(). In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is called and sets the scsi_device->queue_depth to the Scsi_Host's cmd_per_lun with this code: scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ? sdev->host->cmd_per_lun : 1); During dispatch, the scsi_device->queue_depth is used in scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine whether or not the device can queue another command. On some machines, with the 2048 value of cmd_per_lun that was used to set the initial scsi_device->queue_depth, commands can be queued that are later not able to be dispatched after running out of space in the ringbuffer. On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD (running an fio workload that I can provide if needed), storvsc_do_io() ends up often returning SCSI_MLQUEUE_DEVICE_BUSY. This is the call stack: hv_get_bytes_to_write hv_ringbuffer_write vmbus_send_packet storvsc_dio_io storvsc_queuecommand scsi_dispatch_cmd scsi_queue_rq dispatch_rq_list Be aware that the calculation of "can_queue" in this driver is somewhat flawed -- it should not be based on the size of the ring buffer, but instead on the maximum number of requests Hyper-V will queue. And even then, can_queue doesn't provide the cap you might expect because the blk-mq layer allocates can_queue tags for each HW queue, not as a total. The docs for scsi_mid_low_api document Scsi_Host can_queue this way: can_queue - must be greater than 0; do not send more than can_queue commands to the adapter. I did notice that in scsi_host.h, the comment for can_queue does say can_queue is the "maximum number of simultaneous commands a single hw queue in HBA will accept." However, I don't see it being used this way in the code. JFYI, the block layer ensures that no more than can_queue requests are sent to the host. See scsi_mq_setup_tags(), and how the tagset queue depth is set to shost->can_queue. Thanks, John During dispatch, In scsi_target_queue_ready(), there is this code: if (busy >= starget->can_queue) goto starved; And the scsi_target->can_queue value should be coming from Scsi_host as mentioned in the scsi_target definition in scsi_device.h /* * LLDs should set this in the slave_alloc host template callout. * If set to zero then there is not limit. */ unsigned intcan_queue; So, I don't really see how this would be per hardware queue. I agree that the cmd_per_lun setting is also too big, but we should fix that in the context of getting all of these different settings working together correctly, and not piecemeal. Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe will also prevent the LUN queue_depth from being set to a value that is higher than it can ever be set to again by the user when storvsc_change_queue_depth() is invoked. Also in scsi_sysfs sdev_store_queue_depth() there is this check: if (depth < 1 || depth > sdev->host->can_queue) return -EINVAL; I would also note that VirtIO SCSI in virtscsi_probe(), Scsi_Host->cmd_per_lun is set to the min of the configured cmd_per_lun and Scsi_Host->can_queue: shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); Best, Melanie .
Re: [PATCH 1/5] perf metricgroup: Support printing metrics for arm64
On 06/03/2021 19:34, Jiri Olsa wrote: On Fri, Mar 05, 2021 at 11:06:58AM +, John Garry wrote: Hi Jirka, - struct pmu_events_map *map = perf_pmu__find_map(NULL); + struct pmu_events_map *map = find_cpumap(); so this is just for arm at the moment right? Yes - but to be more accurate, arm64. At the moment, from the archs which use pmu-events, only arm64 and nds32 have versions of get_cpuid_str() which require a non-NULL pmu argument. But then apparently nds32 only supports a single CPU, so this issue of heterogeneous CPUs should not be a concern there:) could we rather make this arch specific code, so we don't need to do the scanning on archs where this is not needed? like marking perf_pmu__find_map as __weak and add arm specific version? Well I was thinking that this code should not be in metricgroup.c anyway. So there is code which is common in current perf_pmu__find_map() for all archs. I could factor that out into a common function, below. Just a bit worried about perf_pmu__find_map() and perf_pmu__find_pmu_map() being confused. right, so perf_pmu__find_map does not take perf_pmu as argument anymore, so the prefix does not fit, how about pmu_events_map__find ? I think it could be ok. But now I am slightly concerned that we don't put anything like this in arch/arm64, based on this earlier discussion on close topic: https://lore.kernel.org/lkml/20190719075450.xcm4i4a5sfaxlfap@willie-the-truck/ Hi Will, Mark, Do you have any objection to add arm64 specific code here? So what I had originally in this patch was to iterate PMUs in common code and find the CPU PMU and use that to match CPU metrics, as long as it's not a heterogeneous system. Now the suggestion was to move that into arch specific code, as it's not needed for all archs. Thanks, John
Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"
On 08/03/2021 15:15, Robin Murphy wrote: I figure that you're talking about 4e89dce72521 now. I would have liked to know which real-life problem it solved in practice. From what I remember, the problem reported was basically the one illustrated in that commit and the one I alluded to above - namely that certain allocation patterns with a broad mix of sizes and relative lifetimes end up pushing the cached PFN down to the bottom of the address space such that allocations start failing despite there still being sufficient free space overall, which was breaking some media workload. What was originally proposed was an overcomplicated palaver with DMA attributes and a whole extra allocation algorithm rather than just fixing the clearly unintended and broken behaviour. ok, fine. I just wondered if this was a theoretical problem only. While max32_alloc_size indirectly tracks the largest*contiguous* available space, one of the ideas from which it grew was to simply keep count of the total number of free PFNs. If you're really spending significant time determining that the tree is full, as opposed to just taking longer to eventually succeed, then it might be relatively innocuous to tack on that semi-redundant extra accounting as a self-contained quick fix for that worst case. ... Even if it is were configurable, wouldn't it make sense to have it configurable per IOVA domain? Perhaps, but I don't see that being at all easy to implement. We can't arbitrarily *increase* the scope of caching once a domain is active due to the size-rounding-up requirement, which would be prohibitive to larger allocations if applied universally. Agreed. But having that (all IOVAs sizes being cacheable) available could be really great, though, for some situations. Furthermore, as mentioned above, I still want to solve this IOVA aging issue, and this fixed RCACHE RANGE size seems to be the at the center of that problem. As for 4e89dce72521, so even if it's proper to retry for a failed alloc, it is not always necessary. I mean, if we're limiting ourselves to 32b subspace for this SAC trick and we fail the alloc, then we can try the space above 32b first (if usable). If that fails, then retry there. I don't see a need to retry the 32b subspace if we're not limited to it. How about it? We tried that idea and it looks to just about restore performance. The thing is, if you do have an actual PCI device where DAC might mean a 33% throughput loss and you're mapping a long-lived buffer, or you're on one of these systems where firmware fails to document address limits and using the full IOMMU address width quietly breaks things, then you almost certainly*do* want the allocator to actually do a proper job of trying to satisfy the given request. If those conditions were true, then it seems quite a tenuous position, so trying to help that scenario in general terms will have limited efficacy. Still, I'd be curious to see if making the restart a bit cleverer offers a noticeable improvement. IIRC I suggested it at the time, but in the end the push was just to get *something* merged. Sorry to say, I just tested that ("iommu/iova: Improve restart logic") and there is no obvious improvement. I'll have a further look at what might be going on. Thanks very much, John
Re: [PATCH] iommu/dma: Resurrect the "forcedac" option
On 08/03/2021 13:08, Robin Murphy wrote: On 2021-03-05 17:41, John Garry wrote: On 05/03/2021 16:32, Robin Murphy wrote: In converting intel-iommu over to the common IOMMU DMA ops, it quietly lost the functionality of its "forcedac" option. Since this is a handy thing both for testing and for performance optimisation on certain platforms, reimplement it under the common IOMMU parameter namespace. For the sake of fixing the inadvertent breakage of the Intel-specific parameter, remove the dmar_forcedac remnants and hook it up as an alias while documenting the transition to the new common parameter. Do you think that having a kconfig option to control the default for this can help identify the broken platforms which rely on forcedac=0? But seems a bit trivial for that, though. I think it's still a sizeable can of worms - unlike, say, ARM_SMMU_DISABLE_BYPASS_BY_DEFAULT, we can't actually tell when things have gone awry and explicitly call it out. While I was getting the dma-ranges right on my Juno, everything broke differently - the SATA controller fails gracefully; the ethernet controller got the kernel tied up somewhere (to the point that the USB keyboard died) once it tried to brink up the link, but was at least spewing regular timeout backtraces that implicated the networking layer; having an (unused) NVMe plugged in simply wedged the boot process early on with no hint whatsoever of why. TBH I'm not really sure what the best way forward is in terms of trying to weed out platforms that would need quirking. I was more thinking of an unstable TEST config, like DEBUG_TEST_DRIVER_REMOVE. So we know that this particular config breaks many platforms. But at least those in the know can turn it on locally and detect and fix issues, and strive towards having a platform for which it works. But then it does become a little harder to justify such a config when we can enable via commadline. Our discussion just reminded me of this option and that it had gone AWOL, so bringing it back to be potentially *some* use to everyone seems justifiable on its own. Of course. Cheers, John
Re: [RFC PATCH v3 1/3] blk-mq: Clean up references to old requests when freeing rqs
On 06/03/2021 02:52, Khazhy Kumykov wrote: On Fri, Mar 5, 2021 at 7:20 AM John Garry wrote: It has been reported many times that a use-after-free can be intermittently found when iterating busy requests: - https://lore.kernel.org/linux-block/8376443a-ec1b-0cef-8244-ed584b96f...@huawei.com/ - https://lore.kernel.org/linux-block/5c3ac5af-ed81-11e4-fee3-f92175f14...@acm.org/T/#m6c1ac11540522716f645d004e2a5a13c9f218908 - https://lore.kernel.org/linux-block/04e2f9e8-79fa-f1cb-ab23-4a15bf3f6...@kernel.dk/ The issue is that when we switch scheduler or change queue depth, there may be references in the driver tagset to the stale requests. As a solution, clean up any references to those requests in the driver tagset. This is done with a cmpxchg to make safe any race with setting the driver tagset request from another queue. I noticed this crash recently when running blktests on a "debug" config on a 4.15 based kernel (it would always crash), and backporting this change fixes it. (testing on linus's latest tree also confirmed the fix, with the same config). I realize I'm late to the conversation, but appreciate the investigation and fixes :) Good to know. I'll explicitly cc you on further versions. Thanks, John
Re: [RFC PATCH v3 3/3] blk-mq: Lockout tagset iterator when exiting elevator
On 06/03/2021 04:43, Bart Van Assche wrote: On 3/5/21 7:14 AM, John Garry wrote: diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 7ff1b20d58e7..5950fee490e8 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -358,11 +358,16 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, { int i; + if (!atomic_inc_not_zero(>iter_usage_counter)) + return; + for (i = 0; i < tagset->nr_hw_queues; i++) { if (tagset->tags && tagset->tags[i]) __blk_mq_all_tag_iter(tagset->tags[i], fn, priv, BT_TAG_ITER_STARTED); } + + atomic_dec(>iter_usage_counter); } EXPORT_SYMBOL(blk_mq_tagset_busy_iter); Hi Bart, This changes the behavior of blk_mq_tagset_busy_iter(). What will e.g. happen if the mtip driver calls blk_mq_tagset_busy_iter(>tags, mtip_abort_cmd, dd) concurrently with another blk_mq_tagset_busy_iter() call and if that causes all mtip_abort_cmd() calls to be skipped? I'm not sure that I understand this problem you describe. So if blk_mq_tagset_busy_iter(>tags, mtip_abort_cmd, dd) is called, either can happen: a. normal operation, iter_usage_counter initially holds >= 1, and then iter_usage_counter is incremented in blk_mq_tagset_busy_iter() and we iter the busy tags. Any parallel call to blk_mq_tagset_busy_iter() will also increase iter_usage_counter. b. we're switching IO scheduler. In this scenario, first we quiesce all queues. After that, there should be no active requests. At that point, we ensure any calls to blk_mq_tagset_busy_iter() are finished and block (or discard may be a better term) any more calls. Blocking any more calls should be safe as there are no requests to iter. atomic_cmpxchg() is used to set iter_usage_counter to 0, blocking any more calls. + while (atomic_cmpxchg(>iter_usage_counter, 1, 0) != 1); Isn't it recommended to call cpu_relax() inside busy-waiting loops? Maybe, but I am considering changing this patch to use percpu_refcnt() - I need to check it further. blk_mq_sched_free_requests(q); __elevator_exit(q, e); + atomic_set(>iter_usage_counter, 1); Can it happen that the above atomic_set() call happens while a blk_mq_tagset_busy_iter() call is in progress? No, as at this point it should be ensured that iter_usage_counter holds 0 from atomic_cmpxchg(), so there should be no active processes in blk_mq_tagset_busy_iter() sensitive region. Calls to blk_mq_tagset_busy_iter() are blocked when iter_usage_counter holds 0. Should that atomic_set() call perhaps be changed into an atomic_inc() call? They have the same affect in practice, but we use atomic_set() in blk_mq_alloc_tag_set(), so at least consistent. Thanks, John
Re: [RFC PATCH v3 2/3] blk-mq: Freeze and quiesce all queues for tagset in elevator_exit()
On 06/03/2021 04:32, Bart Van Assche wrote: On 3/5/21 7:14 AM, John Garry wrote: diff --git a/block/blk.h b/block/blk.h index 3b53e44b967e..1a948bfd91e4 100644 --- a/block/blk.h +++ b/block/blk.h @@ -201,10 +201,29 @@ void elv_unregister_queue(struct request_queue *q); static inline void elevator_exit(struct request_queue *q, struct elevator_queue *e) { + struct blk_mq_tag_set *set = q->tag_set; + struct request_queue *tmp; + lockdep_assert_held(>sysfs_lock); + mutex_lock(>tag_list_lock); + list_for_each_entry(tmp, >tag_list, tag_set_list) { + if (tmp == q) + continue; + blk_mq_freeze_queue(tmp); + blk_mq_quiesce_queue(tmp); + } + blk_mq_sched_free_requests(q); __elevator_exit(q, e); + + list_for_each_entry(tmp, >tag_list, tag_set_list) { + if (tmp == q) + continue; + blk_mq_unquiesce_queue(tmp); + blk_mq_unfreeze_queue(tmp); + } + mutex_unlock(>tag_list_lock); } Hi Bart, This patch introduces nesting of tag_list_lock inside sysfs_lock. The latter is per request queue while the former can be shared across multiple request queues. Has it been analyzed whether this is safe? Firstly - ignoring implementation details for a moment - this patch is to ensure that the concept is consistent with your suggestion and whether it is sound. As for nested locking, I can analyze more, but I did assume that we don't care about locking-out sysfs intervention during this time. And it seems pretty difficult to avoid nesting the locks. And further to this, I see that https://lore.kernel.org/linux-block/3aa5407c-0800-2482-597b-4264781a7...@grimberg.me/T/#mc3e3175642660578c0ae2a6c32185b1e34ec4b8a has a new interface for tagset quiesce, which could make this process more efficient. Please let me know further thoughts. Thanks, John
Re: [RFC PATCH v3 1/3] blk-mq: Clean up references to old requests when freeing rqs
On 06/03/2021 18:13, Bart Van Assche wrote: On 3/5/21 7:14 AM, John Garry wrote: @@ -2296,10 +2296,14 @@ void blk_mq_free_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, for (i = 0; i < tags->nr_tags; i++) { struct request *rq = tags->static_rqs[i]; + int j; if (!rq) continue; set->ops->exit_request(set, rq, hctx_idx); + /* clean up any references which occur in @ref_tags */ + for (j = 0; ref_tags && j < ref_tags->nr_tags; j++) + cmpxchg(_tags->rqs[j], rq, 0); tags->static_rqs[i] = NULL; } } Hi Bart, What prevents blk_mq_tagset_busy_iter() from reading hctx->tags[...] before the cmpxcg() call and dereferencing it after blk_mq_free_rqs() has called __free_pages()? So there is nothing in this patch to stop that. But it's pretty unlikely, as the window is very narrow generally between reading hctx->tags[...] and actually dereferencing it. However, something like that should be made safe in patch 2/3. Thanks, John