Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Le 30/12/2017 à 07:58, Matthew Wilcox a écrit : > On Wed, Dec 27, 2017 at 10:10:34AM +0100, Brice Goglin wrote: >>> Perhaps we can enlist /proc/iomem or a similar enumeration interface >>> to tell userspace the NUMA node and whether the kernel thinks it has >>> better or worse performance characteristics relative to base >>> system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start >>> publishing absolute numbers in sysfs userspace will default to looking >>> for specific magic numbers in sysfs vs asking the kernel for memory >>> that has performance characteristics relative to base "System RAM". In >>> other words the absolute performance information that the HMAT >>> publishes is useful to the kernel, but it's not clear that userspace >>> needs that vs a relative indicator for making NUMA node preference >>> decisions. >> Some HPC users will benchmark the machine to discovery actual >> performance numbers anyway. >> However, most users won't do this. They will want to know relative >> performance of different nodes. If you normalize HMAT values by dividing >> them with system-RAM values, that's likely OK. If you just say "that >> node is faster than system RAM", it's not precise enough. > So "this memory has 800% bandwidth of normal" and "this memory has 70% > bandwidth of normal"? I guess that would work. Brice
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Le 22/12/2017 à 23:53, Dan Williams a écrit : > On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin <brice.gog...@gmail.com> wrote: >> Le 20/12/2017 à 23:41, Ross Zwisler a écrit : > [..] >> Hello >> >> I can confirm that HPC runtimes are going to use these patches (at least >> all runtimes that use hwloc for topology discovery, but that's the vast >> majority of HPC anyway). >> >> We really didn't like KNL exposing a hacky SLIT table [1]. We had to >> explicitly detect that specific crazy table to find out which NUMA nodes >> were local to which cores, and to find out which NUMA nodes were >> HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the >> application because the reported latencies didn't match reality. Quite >> annoying. >> >> With Ross' patches, we can easily get what we need: >> * which NUMA nodes are local to which CPUs? /sys/devices/system/node/ >> can only report a single local node per CPU (doesn't work for KNL and >> upcoming architectures with HBM+DDR+...) >> * which NUMA nodes are slow/fast (for both bandwidth and latency) >> And we can still look at SLIT under /sys/devices/system/node if really >> needed. >> >> And of course having this in sysfs is much better than parsing ACPI >> tables that are only accessible to root :) > On this point, it's not clear to me that we should allow these sysfs > entries to be world readable. Given /proc/iomem now hides physical > address information from non-root we at least need to be careful not > to undo that with new sysfs HMAT attributes. Once you need to be root > for this info, is parsing binary HMAT vs sysfs a blocker for the HPC > use case? I don't think it would be a blocker. > Perhaps we can enlist /proc/iomem or a similar enumeration interface > to tell userspace the NUMA node and whether the kernel thinks it has > better or worse performance characteristics relative to base > system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start > publishing absolute numbers in sysfs userspace will default to looking > for specific magic numbers in sysfs vs asking the kernel for memory > that has performance characteristics relative to base "System RAM". In > other words the absolute performance information that the HMAT > publishes is useful to the kernel, but it's not clear that userspace > needs that vs a relative indicator for making NUMA node preference > decisions. Some HPC users will benchmark the machine to discovery actual performance numbers anyway. However, most users won't do this. They will want to know relative performance of different nodes. If you normalize HMAT values by dividing them with system-RAM values, that's likely OK. If you just say "that node is faster than system RAM", it's not precise enough. Brice
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
Le 20/12/2017 à 23:41, Ross Zwisler a écrit : > On Wed, Dec 20, 2017 at 02:29:56PM -0800, Dan Williams wrote: >> On Wed, Dec 20, 2017 at 1:24 PM, Ross Zwisler >>wrote: >>> On Wed, Dec 20, 2017 at 01:16:49PM -0800, Matthew Wilcox wrote: On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote: > On 12/20/2017 10:19 AM, Matthew Wilcox wrote: >> I don't know what the right interface is, but my laptop has a set of >> /sys/devices/system/memory/memoryN/ directories. Perhaps this is the >> right place to expose write_bw (etc). > Those directories are already too redundant and wasteful. I think we'd > really rather not add to them. In addition, it's technically possible > to have a memory section span NUMA nodes and have different performance > properties, which make it impossible to represent there. > > In any case, ACPI PXM's (Proximity Domains) are guaranteed to have > uniform performance properties in the HMAT, and we just so happen to > always create one NUMA node per PXM. So, NUMA nodes really are a good > fit. I think you're missing my larger point which is that I don't think this should be exposed to userspace as an ACPI feature. Because if you do, then it'll also be exposed to userspace as an openfirmware feature. And sooner or later a devicetree feature. And then writing a portable program becomes an exercise in suffering. So, what's the right place in sysfs that isn't tied to ACPI? A new directory or set of directories under /sys/devices/system/memory/ ? >>> Oh, the current location isn't at all tied to acpi except that it happens to >>> be named 'hmat'. When it was all named 'hmem' it was just: >>> >>> /sys/devices/system/hmem >>> >>> Which has no ACPI-isms at all. I'm happy to move it under >>> /sys/devices/system/memory/hmat if that's helpful, but I think we still have >>> the issue that the data represented therein is still pulled right from the >>> HMAT, and I don't know how to abstract it into something more platform >>> agnostic until I know what data is provided by those other platforms. >>> >>> For example, the HMAT provides latency information and bandwidth information >>> for both reads and writes. Will the devicetree/openfirmware/etc version >>> have >>> this same info, or will it be just different enough that it won't translate >>> into whatever I choose to stick in sysfs? >> For the initial implementation do we need to have a representation of >> all the performance data? Given that >> /sys/devices/system/node/nodeX/distance is the only generic >> performance attribute published by the kernel today it is already the >> case that applications that need to target specific memories need to >> go parse information that is not provided by the kernel by default. >> The question is can those specialized applications stay special and go >> parse the platform specific data sources, like raw HMAT, directly, or >> do we expect general purpose applications to make use of this data? I >> think a firmware-id to numa-node translation facility >> (/sys/devices/system/node/nodeX/fwid) is a simple start that we can >> build on with more information as specific use cases arise. > We don't represent all the performance data, we only represent the data for > local initiator/target pairs. I do think that this is useful to have in sysfs > because it provides a way to easily answer the most commonly asked questions > (or at least what I'm guessing will be the most commmonly asked queststions), > i.e. "given a CPU, what are the speeds of the various types of memory attached > to it", and "given a chunk of memory, how fast is it and to which CPU is it > local"? By providing this base level of information I'm hoping to prevent > most applications from having to parse the HMAT directly. > > The question of whether or not to include this local performance information > was one of the main questions of the initial RFC patch series, and I did get > feedback (albiet off-list) that the local performance information was > valuable to at least some users. I did intentionally structure my (now very > short) set so that the performance information was added as a separate patch, > so we can get to the place you're talking about where we only provide firmware > id <=> proximity domain mappings by just leaving off the last patch in the > series. > Hello I can confirm that HPC runtimes are going to use these patches (at least all runtimes that use hwloc for topology discovery, but that's the vast majority of HPC anyway). We really didn't like KNL exposing a hacky SLIT table [1]. We had to explicitly detect that specific crazy table to find out which NUMA nodes were local to which cores, and to find out which NUMA nodes were HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the application because the reported latencies didn't match reality. Quite annoying. With Ross'
Re: Topology updates and NUMA-level sched domains
Le 07/04/2015 21:41, Peter Zijlstra a écrit : No, that's very much not the same. Even if it were dealing with hotplug it would still assume the cpu to return to the same node. But mostly people do not even bother to handle hotplug. You said userspace assumes the cpu-node relation is a boot-time fixed one, and hotplug breaks this. How do you expect userspace to handle hotplug? Is there a convenient way to be notified when a CPU (or memory) is unplugged? thanks Brice ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc allmodconfig
Dan Williams wrote: On Wed, 2008-10-15 at 22:02 -0700, David Miller wrote: drivers/dma/ioat_dca.c: In function 'dca_enabled_in_bios': drivers/dma/ioat_dca.c:81: error: implicit declaration of function 'cpuid_eax' drivers/dma/ioat_dca.c: In function 'system_has_dca_enabled': drivers/dma/ioat_dca.c:91: error: implicit declaration of function 'boot_cpu_has' drivers/dma/ioat_dca.c:91: error: 'X86_FEATURE_DCA' undeclared (first use in this function) drivers/dma/ioat_dca.c:91: error: (Each undeclared identifier is reported only once drivers/dma/ioat_dca.c:91: error: for each function it appears in.) drivers/dma/ioat_dca.c: In function 'ioat_dca_get_tag': drivers/dma/ioat_dca.c:190: error: implicit declaration of function 'cpu_physical_id' Known issue. I tried to ping Jeff Garzik about doing a driver bug fix run in order to fix this, but he hasn't shown any signs of life. So I'll do it myself later tonight. :-/ The following seems to fix this up... ---snip--- ixgbe, myri10ge: INTEL_IOATDMA can only be selected when X86=y There's already a completely different fix queued in netdev patchworks (for myri10ge only right now, to be duplicated for Intel drivers). The idea is to stop having almost-unrelated drivers select each other directly, let people select which drivers they really want, and have Kconfig handle modules/builtin-stuff correctly. See http://patchwork.ozlabs.org/patch/4506/ Brice ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [BUILD_FAILURE] 2.6.27-git2 - allyesconfig on powerpc selectsCONFIG_INTEL_IOATDMA=y
Adrian Bunk wrote: But considering that igb is in a similar situation it would be nice if all 3 drivers would handle it the same way. Jesse, What do you think of the below patch? I am not very familiar with Kconfig, but it seems to solve the problem. If a Kconfig guru could double-check... Brice myri10ge: Add MYRI10GE_DCA instead of selecting INTEL_IOATDMA Add a bool MYRI10GE_DCA defined to y if MYRI10GE and DCA are enabled, but MYRI10GE isn't y while DCA=m. And thus remove the need to select INTEL_IOATDMA when MYRI10GE is enabled, so that non-x86 architectures can build the myri10ge. Signed-off-by: Brice Goglin [EMAIL PROTECTED] diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index e9d5294..0162d55 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -2462,7 +2462,6 @@ config MYRI10GE select FW_LOADER select CRC32 select INET_LRO - select INTEL_IOATDMA ---help--- This driver supports Myricom Myri-10G Dual Protocol interface in Ethernet mode. If the eeprom on your board is not recent enough, @@ -2474,6 +2473,11 @@ config MYRI10GE To compile this driver as a module, choose M here. The module will be called myri10ge. +config MYRI10GE_DCA + bool + default y + depends on MYRI10GE DCA !(MYRI10GE=y DCA=m) + config NETXEN_NIC tristate NetXen Multi port (1/10) Gigabit Ethernet NIC depends on PCI diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c index 6dce901..a9aebad 100644 --- a/drivers/net/myri10ge/myri10ge.c +++ b/drivers/net/myri10ge/myri10ge.c @@ -188,7 +188,7 @@ struct myri10ge_slice_state { dma_addr_t fw_stats_bus; int watchdog_tx_done; int watchdog_tx_req; -#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE) +#ifdef CONFIG_MYRI10GE_DCA int cached_dca_tag; int cpu; __be32 __iomem *dca_tag; @@ -220,7 +220,7 @@ struct myri10ge_priv { int msi_enabled; int msix_enabled; struct msix_entry *msix_vectors; -#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE) +#ifdef CONFIG_MYRI10GE_DCA int dca_enabled; #endif u32 link_state; @@ -902,7 +902,7 @@ static int myri10ge_reset(struct myri10ge_priv *mgp) struct myri10ge_slice_state *ss; int i, status; size_t bytes; -#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE) +#ifdef CONFIG_MYRI10GE_DCA unsigned long dca_tag_off; #endif @@ -1012,7 +1012,7 @@ static int myri10ge_reset(struct myri10ge_priv *mgp) } put_be32(htonl(mgp-intr_coal_delay), mgp-intr_coal_delay_ptr); -#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE) +#ifdef CONFIG_MYRI10GE_DCA status = myri10ge_send_cmd(mgp, MXGEFW_CMD_GET_DCA_OFFSET, cmd, 0); dca_tag_off = cmd.data0; for (i = 0; i mgp-num_slices; i++) { @@ -1051,7 +1051,7 @@ static int myri10ge_reset(struct myri10ge_priv *mgp) return status; } -#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE) +#ifdef CONFIG_MYRI10GE_DCA static void myri10ge_write_dca(struct myri10ge_slice_state *ss, int cpu, int tag) { @@ -1505,7 +1505,7 @@ static int myri10ge_poll(struct napi_struct *napi, int budget) struct net_device *netdev = ss-mgp-dev; int work_done; -#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE) +#ifdef CONFIG_MYRI10GE_DCA if (ss-mgp-dca_enabled) myri10ge_update_dca(ss); #endif @@ -1736,7 +1736,7 @@ static const char myri10ge_gstrings_main_stats[][ETH_GSTRING_LEN] = { tx_boundary, WC, irq, MSI, MSIX, read_dma_bw_MBs, write_dma_bw_MBs, read_write_dma_bw_MBs, serial_number, watchdog_resets, -#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE) +#ifdef CONFIG_MYRI10GE_DCA dca_capable_firmware, dca_device_present, #endif link_changes, link_up, dropped_link_overflow, @@ -1815,7 +1815,7 @@ myri10ge_get_ethtool_stats(struct net_device *netdev, data[i++] = (unsigned int)mgp-read_write_dma; data[i++] = (unsigned int)mgp-serial_number; data[i++] = (unsigned int)mgp-watchdog_resets; -#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE) +#ifdef CONFIG_MYRI10GE_DCA data[i++] = (unsigned int)(mgp-ss[0].dca_tag != NULL); data[i++] = (unsigned int)(mgp-dca_enabled); #endif @@ -3844,7 +3844,7 @@ static int myri10ge_probe(struct pci_dev *pdev, const struct pci_device_id *ent) dev_err(pdev-dev, failed reset\n); goto abort_with_slices; } -#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE) +#ifdef CONFIG_MYRI10GE_DCA myri10ge_setup_dca(mgp); #endif pci_set_drvdata(pdev, mgp); @@ -3948,7 +3948,7 @@ static void myri10ge_remove(struct pci_dev *pdev) netdev = mgp-dev; unregister_netdev(netdev); -#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE) +#ifdef CONFIG_MYRI10GE_DCA
Re: [BUILD_FAILURE] 2.6.27-git2 - allyesconfig on powerpc selects CONFIG_INTEL_IOATDMA=y
Adrian Bunk wrote: On Mon, Oct 13, 2008 at 03:45:59PM +0530, Kamalesh Babulal wrote: Hi, 2.6.27-git2 kernel build fails, while building the kernel with allyesconfig option. The allyesconfig selects CONFIG_INTEL_IOATDMA=y CC drivers/dma/ioat_dca.o drivers/dma/ioat_dca.c: In function âdca_enabled_in_biosâ: drivers/dma/ioat_dca.c:81: error: implicit declaration of function âcpuid_eaxâ drivers/dma/ioat_dca.c: In function âsystem_has_dca_enabledâ: drivers/dma/ioat_dca.c:91: error: implicit declaration of function âboot_cpu_hasâ drivers/dma/ioat_dca.c:91: error: âX86_FEATURE_DCAâ undeclared (first use in this function) drivers/dma/ioat_dca.c:91: error: (Each undeclared identifier is reported only once drivers/dma/ioat_dca.c:91: error: for each function it appears in.) drivers/dma/ioat_dca.c: In function âioat_dca_get_tagâ: drivers/dma/ioat_dca.c:190: error: implicit declaration of function âcpu_physical_idâ make[2]: *** [drivers/dma/ioat_dca.o] Error 1 make[1]: *** [drivers/dma] Error 2 make: *** [drivers] Error 2 ... Thanks for the report, the MYRI10GE and IXGBE commits that introduced the select's are really broken. For fixing it I need to know the intended semantics. Brian, Jesse, is it OK to limit the drivers to m with CONFIG_INTEL_IOATDMA=m ? I think I would rather drop DCA from myri10ge if IOATDMA=m while myri10ge=y. What's the simplest way to do so? When Jesse told me to commit this in myri10ge, I thought it would be nice to have DCA work the same than NETDMA/DMAengine does: you can have NETDMA enabled without IOATDMA (either not built at all, or just not loaded). You just don't get any DMA channel when you ask for one. Why not do the same for DCA? There could be some generic DCA layer that can be built all the time and returns DCA resources only if IOATDMA is loaded/built ? Brice ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [BUILD_FAILURE] 2.6.27-git2 - allyesconfig on powerpc selectsCONFIG_INTEL_IOATDMA=y
Brandeburg, Jesse wrote: What we want, is myri10ge and ixgbe drivers that can build whether or not CONFIG_INTEL_IOATDMA is enabled. IF CONFIG_INTEL_IOATDMA *is* enabled (which it should not be on PPC) then there are several cases we want to work: CONFIG_INTEL_IOATDMA=m --- CONFIG_IXGBE=[m|n] CONFIG_INTEL_IOATDMA=y --- CONFIG_IXGBE=[m|y|n] CONFIG_INTEL_IOATDMA=n --- CONFIG_IXGBE=[m|y|n] CONFIG_INTEL_IOATDMA depends on X86 I am not sure I want to prevent myri10ge=y just because ioatdma=m. I would vote for adding some Kconfig stuff to define CONFIG_MYRI10GE_DCA as boolean set to yes if (IOATDMA=y and MYRI10GE=y/m) or (IOATDMA=m and MYRI10GE=m). And then use #ifdef CONFIG_MYRI10GE_DCA in the driver source. Brice ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev