Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-30 Thread Brice Goglin


Le 30/12/2017 à 07:58, Matthew Wilcox a écrit :
> On Wed, Dec 27, 2017 at 10:10:34AM +0100, Brice Goglin wrote:
>>> Perhaps we can enlist /proc/iomem or a similar enumeration interface
>>> to tell userspace the NUMA node and whether the kernel thinks it has
>>> better or worse performance characteristics relative to base
>>> system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start
>>> publishing absolute numbers in sysfs userspace will default to looking
>>> for specific magic numbers in sysfs vs asking the kernel for memory
>>> that has performance characteristics relative to base "System RAM". In
>>> other words the absolute performance information that the HMAT
>>> publishes is useful to the kernel, but it's not clear that userspace
>>> needs that vs a relative indicator for making NUMA node preference
>>> decisions.
>> Some HPC users will benchmark the machine to discovery actual
>> performance numbers anyway.
>> However, most users won't do this. They will want to know relative
>> performance of different nodes. If you normalize HMAT values by dividing
>> them with system-RAM values, that's likely OK. If you just say "that
>> node is faster than system RAM", it's not precise enough.
> So "this memory has 800% bandwidth of normal" and "this memory has 70%
> bandwidth of normal"?

I guess that would work.
Brice



Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-27 Thread Brice Goglin
Le 22/12/2017 à 23:53, Dan Williams a écrit :
> On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin <brice.gog...@gmail.com> wrote:
>> Le 20/12/2017 à 23:41, Ross Zwisler a écrit :
> [..]
>> Hello
>>
>> I can confirm that HPC runtimes are going to use these patches (at least
>> all runtimes that use hwloc for topology discovery, but that's the vast
>> majority of HPC anyway).
>>
>> We really didn't like KNL exposing a hacky SLIT table [1]. We had to
>> explicitly detect that specific crazy table to find out which NUMA nodes
>> were local to which cores, and to find out which NUMA nodes were
>> HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the
>> application because the reported latencies didn't match reality. Quite
>> annoying.
>>
>> With Ross' patches, we can easily get what we need:
>> * which NUMA nodes are local to which CPUs? /sys/devices/system/node/
>> can only report a single local node per CPU (doesn't work for KNL and
>> upcoming architectures with HBM+DDR+...)
>> * which NUMA nodes are slow/fast (for both bandwidth and latency)
>> And we can still look at SLIT under /sys/devices/system/node if really
>> needed.
>>
>> And of course having this in sysfs is much better than parsing ACPI
>> tables that are only accessible to root :)
> On this point, it's not clear to me that we should allow these sysfs
> entries to be world readable. Given /proc/iomem now hides physical
> address information from non-root we at least need to be careful not
> to undo that with new sysfs HMAT attributes. Once you need to be root
> for this info, is parsing binary HMAT vs sysfs a blocker for the HPC
> use case?

I don't think it would be a blocker.

> Perhaps we can enlist /proc/iomem or a similar enumeration interface
> to tell userspace the NUMA node and whether the kernel thinks it has
> better or worse performance characteristics relative to base
> system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start
> publishing absolute numbers in sysfs userspace will default to looking
> for specific magic numbers in sysfs vs asking the kernel for memory
> that has performance characteristics relative to base "System RAM". In
> other words the absolute performance information that the HMAT
> publishes is useful to the kernel, but it's not clear that userspace
> needs that vs a relative indicator for making NUMA node preference
> decisions.

Some HPC users will benchmark the machine to discovery actual
performance numbers anyway.
However, most users won't do this. They will want to know relative
performance of different nodes. If you normalize HMAT values by dividing
them with system-RAM values, that's likely OK. If you just say "that
node is faster than system RAM", it's not precise enough.

Brice



Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-21 Thread Brice Goglin
Le 20/12/2017 à 23:41, Ross Zwisler a écrit :
> On Wed, Dec 20, 2017 at 02:29:56PM -0800, Dan Williams wrote:
>> On Wed, Dec 20, 2017 at 1:24 PM, Ross Zwisler
>>  wrote:
>>> On Wed, Dec 20, 2017 at 01:16:49PM -0800, Matthew Wilcox wrote:
 On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote:
> On 12/20/2017 10:19 AM, Matthew Wilcox wrote:
>> I don't know what the right interface is, but my laptop has a set of
>> /sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
>> right place to expose write_bw (etc).
> Those directories are already too redundant and wasteful.  I think we'd
> really rather not add to them.  In addition, it's technically possible
> to have a memory section span NUMA nodes and have different performance
> properties, which make it impossible to represent there.
>
> In any case, ACPI PXM's (Proximity Domains) are guaranteed to have
> uniform performance properties in the HMAT, and we just so happen to
> always create one NUMA node per PXM.  So, NUMA nodes really are a good 
> fit.
 I think you're missing my larger point which is that I don't think this
 should be exposed to userspace as an ACPI feature.  Because if you do,
 then it'll also be exposed to userspace as an openfirmware feature.
 And sooner or later a devicetree feature.  And then writing a portable
 program becomes an exercise in suffering.

 So, what's the right place in sysfs that isn't tied to ACPI?  A new
 directory or set of directories under /sys/devices/system/memory/ ?
>>> Oh, the current location isn't at all tied to acpi except that it happens to
>>> be named 'hmat'.  When it was all named 'hmem' it was just:
>>>
>>> /sys/devices/system/hmem
>>>
>>> Which has no ACPI-isms at all.  I'm happy to move it under
>>> /sys/devices/system/memory/hmat if that's helpful, but I think we still have
>>> the issue that the data represented therein is still pulled right from the
>>> HMAT, and I don't know how to abstract it into something more platform
>>> agnostic until I know what data is provided by those other platforms.
>>>
>>> For example, the HMAT provides latency information and bandwidth information
>>> for both reads and writes.  Will the devicetree/openfirmware/etc version 
>>> have
>>> this same info, or will it be just different enough that it won't translate
>>> into whatever I choose to stick in sysfs?
>> For the initial implementation do we need to have a representation of
>> all the performance data? Given that
>> /sys/devices/system/node/nodeX/distance is the only generic
>> performance attribute published by the kernel today it is already the
>> case that applications that need to target specific memories need to
>> go parse information that is not provided by the kernel by default.
>> The question is can those specialized applications stay special and go
>> parse the platform specific data sources, like raw HMAT, directly, or
>> do we expect general purpose applications to make use of this data? I
>> think a firmware-id to numa-node translation facility
>> (/sys/devices/system/node/nodeX/fwid) is a simple start that we can
>> build on with more information as specific use cases arise.
> We don't represent all the performance data, we only represent the data for
> local initiator/target pairs.  I do think that this is useful to have in sysfs
> because it provides a way to easily answer the most commonly asked questions
> (or at least what I'm guessing will be the most commmonly asked queststions),
> i.e. "given a CPU, what are the speeds of the various types of memory attached
> to it", and "given a chunk of memory, how fast is it and to which CPU is it
> local"?  By providing this base level of information I'm hoping to prevent
> most applications from having to parse the HMAT directly.
>
> The question of whether or not to include this local performance information
> was one of the main questions of the initial RFC patch series, and I did get
> feedback (albiet off-list) that the local performance information was
> valuable to at least some users.  I did intentionally structure my (now very
> short) set so that the performance information was added as a separate patch,
> so we can get to the place you're talking about where we only provide firmware
> id <=> proximity domain mappings by just leaving off the last patch in the
> series.
>

Hello

I can confirm that HPC runtimes are going to use these patches (at least
all runtimes that use hwloc for topology discovery, but that's the vast
majority of HPC anyway).

We really didn't like KNL exposing a hacky SLIT table [1]. We had to
explicitly detect that specific crazy table to find out which NUMA nodes
were local to which cores, and to find out which NUMA nodes were
HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the
application because the reported latencies didn't match reality. Quite
annoying.

With Ross' 

Re: Topology updates and NUMA-level sched domains

2015-04-08 Thread Brice Goglin
Le 07/04/2015 21:41, Peter Zijlstra a écrit :
 No, that's very much not the same. Even if it were dealing with hotplug
 it would still assume the cpu to return to the same node.

 But mostly people do not even bother to handle hotplug.


You said userspace assumes the cpu-node relation is a boot-time fixed
one, and hotplug breaks this. How do you expect userspace to handle
hotplug? Is there a convenient way to be notified when a CPU (or memory)
is unplugged?

thanks
Brice
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc allmodconfig

2008-10-16 Thread Brice Goglin
Dan Williams wrote:
 On Wed, 2008-10-15 at 22:02 -0700, David Miller wrote:
   
 drivers/dma/ioat_dca.c: In function 'dca_enabled_in_bios':
 drivers/dma/ioat_dca.c:81: error: implicit declaration of function 
 'cpuid_eax'
 drivers/dma/ioat_dca.c: In function 'system_has_dca_enabled':
 drivers/dma/ioat_dca.c:91: error: implicit declaration of function 
 'boot_cpu_has'
 drivers/dma/ioat_dca.c:91: error: 'X86_FEATURE_DCA' undeclared (first use 
 in this function)
 drivers/dma/ioat_dca.c:91: error: (Each undeclared identifier is reported 
 only once
 drivers/dma/ioat_dca.c:91: error: for each function it appears in.)
 drivers/dma/ioat_dca.c: In function 'ioat_dca_get_tag':
 drivers/dma/ioat_dca.c:190: error: implicit declaration of function 
 'cpu_physical_id'
   
 Known issue.  I tried to ping Jeff Garzik about doing a driver bug fix run in
 order to fix this, but he hasn't shown any signs of life.

 So I'll do it myself later tonight. :-/

 
 The following seems to fix this up...

 ---snip---
 ixgbe, myri10ge: INTEL_IOATDMA can only be selected when X86=y
   

There's already a completely different fix queued in netdev patchworks
(for myri10ge only right now, to be duplicated for Intel drivers). The
idea is to stop having almost-unrelated drivers select each other
directly, let people select which drivers they really want, and have
Kconfig handle modules/builtin-stuff correctly. See
http://patchwork.ozlabs.org/patch/4506/

Brice

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [BUILD_FAILURE] 2.6.27-git2 - allyesconfig on powerpc selectsCONFIG_INTEL_IOATDMA=y

2008-10-14 Thread Brice Goglin
Adrian Bunk wrote:
 But considering that igb is in a similar situation it would be nice if 
 all 3 drivers would handle it the same way.
   

Jesse,
What do you think of the below patch?
I am not very familiar with Kconfig, but it seems to solve the problem.
If a Kconfig guru could double-check...
Brice


myri10ge: Add MYRI10GE_DCA instead of selecting INTEL_IOATDMA

Add a bool MYRI10GE_DCA defined to y if MYRI10GE and DCA are enabled, but
MYRI10GE isn't y while DCA=m. And thus remove the need to select INTEL_IOATDMA
when MYRI10GE is enabled, so that non-x86 architectures can build the myri10ge.

Signed-off-by: Brice Goglin [EMAIL PROTECTED]

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index e9d5294..0162d55 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -2462,7 +2462,6 @@ config MYRI10GE
select FW_LOADER
select CRC32
select INET_LRO
-   select INTEL_IOATDMA
---help---
  This driver supports Myricom Myri-10G Dual Protocol interface in
  Ethernet mode. If the eeprom on your board is not recent enough,
@@ -2474,6 +2473,11 @@ config MYRI10GE
  To compile this driver as a module, choose M here. The module
  will be called myri10ge.
 
+config MYRI10GE_DCA
+   bool
+   default y
+   depends on MYRI10GE  DCA  !(MYRI10GE=y  DCA=m)
+
 config NETXEN_NIC
tristate NetXen Multi port (1/10) Gigabit Ethernet NIC
depends on PCI
diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c
index 6dce901..a9aebad 100644
--- a/drivers/net/myri10ge/myri10ge.c
+++ b/drivers/net/myri10ge/myri10ge.c
@@ -188,7 +188,7 @@ struct myri10ge_slice_state {
dma_addr_t fw_stats_bus;
int watchdog_tx_done;
int watchdog_tx_req;
-#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE)
+#ifdef CONFIG_MYRI10GE_DCA
int cached_dca_tag;
int cpu;
__be32 __iomem *dca_tag;
@@ -220,7 +220,7 @@ struct myri10ge_priv {
int msi_enabled;
int msix_enabled;
struct msix_entry *msix_vectors;
-#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE)
+#ifdef CONFIG_MYRI10GE_DCA
int dca_enabled;
 #endif
u32 link_state;
@@ -902,7 +902,7 @@ static int myri10ge_reset(struct myri10ge_priv *mgp)
struct myri10ge_slice_state *ss;
int i, status;
size_t bytes;
-#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE)
+#ifdef CONFIG_MYRI10GE_DCA
unsigned long dca_tag_off;
 #endif
 
@@ -1012,7 +1012,7 @@ static int myri10ge_reset(struct myri10ge_priv *mgp)
}
put_be32(htonl(mgp-intr_coal_delay), mgp-intr_coal_delay_ptr);
 
-#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE)
+#ifdef CONFIG_MYRI10GE_DCA
status = myri10ge_send_cmd(mgp, MXGEFW_CMD_GET_DCA_OFFSET, cmd, 0);
dca_tag_off = cmd.data0;
for (i = 0; i  mgp-num_slices; i++) {
@@ -1051,7 +1051,7 @@ static int myri10ge_reset(struct myri10ge_priv *mgp)
return status;
 }
 
-#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE)
+#ifdef CONFIG_MYRI10GE_DCA
 static void
 myri10ge_write_dca(struct myri10ge_slice_state *ss, int cpu, int tag)
 {
@@ -1505,7 +1505,7 @@ static int myri10ge_poll(struct napi_struct *napi, int 
budget)
struct net_device *netdev = ss-mgp-dev;
int work_done;
 
-#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE)
+#ifdef CONFIG_MYRI10GE_DCA
if (ss-mgp-dca_enabled)
myri10ge_update_dca(ss);
 #endif
@@ -1736,7 +1736,7 @@ static const char 
myri10ge_gstrings_main_stats[][ETH_GSTRING_LEN] = {
tx_boundary, WC, irq, MSI, MSIX,
read_dma_bw_MBs, write_dma_bw_MBs, read_write_dma_bw_MBs,
serial_number, watchdog_resets,
-#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE)
+#ifdef CONFIG_MYRI10GE_DCA
dca_capable_firmware, dca_device_present,
 #endif
link_changes, link_up, dropped_link_overflow,
@@ -1815,7 +1815,7 @@ myri10ge_get_ethtool_stats(struct net_device *netdev,
data[i++] = (unsigned int)mgp-read_write_dma;
data[i++] = (unsigned int)mgp-serial_number;
data[i++] = (unsigned int)mgp-watchdog_resets;
-#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE)
+#ifdef CONFIG_MYRI10GE_DCA
data[i++] = (unsigned int)(mgp-ss[0].dca_tag != NULL);
data[i++] = (unsigned int)(mgp-dca_enabled);
 #endif
@@ -3844,7 +3844,7 @@ static int myri10ge_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
dev_err(pdev-dev, failed reset\n);
goto abort_with_slices;
}
-#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE)
+#ifdef CONFIG_MYRI10GE_DCA
myri10ge_setup_dca(mgp);
 #endif
pci_set_drvdata(pdev, mgp);
@@ -3948,7 +3948,7 @@ static void myri10ge_remove(struct pci_dev *pdev)
netdev = mgp-dev;
unregister_netdev(netdev);
 
-#if (defined CONFIG_DCA) || (defined CONFIG_DCA_MODULE)
+#ifdef CONFIG_MYRI10GE_DCA

Re: [BUILD_FAILURE] 2.6.27-git2 - allyesconfig on powerpc selects CONFIG_INTEL_IOATDMA=y

2008-10-13 Thread Brice Goglin
Adrian Bunk wrote:
 On Mon, Oct 13, 2008 at 03:45:59PM +0530, Kamalesh Babulal wrote:
   
 Hi,

2.6.27-git2 kernel build fails, while building the kernel with
 allyesconfig option. The allyesconfig selects CONFIG_INTEL_IOATDMA=y

 CC   drivers/dma/ioat_dca.o
 drivers/dma/ioat_dca.c: In function ‘dca_enabled_in_bios’:
 drivers/dma/ioat_dca.c:81: error: implicit declaration of function 
 ‘cpuid_eax’
 drivers/dma/ioat_dca.c: In function ‘system_has_dca_enabled’:
 drivers/dma/ioat_dca.c:91: error: implicit declaration of function 
 ‘boot_cpu_has’
 drivers/dma/ioat_dca.c:91: error: ‘X86_FEATURE_DCA’ undeclared (first 
 use in this function)
 drivers/dma/ioat_dca.c:91: error: (Each undeclared identifier is reported 
 only once
 drivers/dma/ioat_dca.c:91: error: for each function it appears in.)
 drivers/dma/ioat_dca.c: In function ‘ioat_dca_get_tag’:
 drivers/dma/ioat_dca.c:190: error: implicit declaration of function 
 ‘cpu_physical_id’
 make[2]: *** [drivers/dma/ioat_dca.o] Error 1
 make[1]: *** [drivers/dma] Error 2
 make: *** [drivers] Error 2
 ...
 

 Thanks for the report, the MYRI10GE and IXGBE commits that introduced 
 the select's are really broken.

 For fixing it I need to know the intended semantics.

 Brian, Jesse, is it OK to limit the drivers to m with 
 CONFIG_INTEL_IOATDMA=m ?
   

I think I would rather drop DCA from myri10ge if IOATDMA=m while
myri10ge=y. What's the simplest way to do so?

When Jesse told me to commit this in myri10ge, I thought it would be
nice to have DCA work the same than NETDMA/DMAengine does: you can have
NETDMA enabled without IOATDMA (either not built at all, or just not
loaded). You just don't get any DMA channel when you ask for one. Why
not do the same for DCA? There could be some generic DCA layer that can
be built all the time and returns DCA resources only if IOATDMA is
loaded/built ?

Brice

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [BUILD_FAILURE] 2.6.27-git2 - allyesconfig on powerpc selectsCONFIG_INTEL_IOATDMA=y

2008-10-13 Thread Brice Goglin
Brandeburg, Jesse wrote:
 What we want, is myri10ge and ixgbe drivers that can build whether or not 
 CONFIG_INTEL_IOATDMA is enabled.  IF CONFIG_INTEL_IOATDMA *is* enabled (which 
 it should not be on PPC) then there are several cases we want to work:
 CONFIG_INTEL_IOATDMA=m  --- CONFIG_IXGBE=[m|n]
 CONFIG_INTEL_IOATDMA=y  --- CONFIG_IXGBE=[m|y|n]
 CONFIG_INTEL_IOATDMA=n  --- CONFIG_IXGBE=[m|y|n]
 CONFIG_INTEL_IOATDMA depends on X86
   

I am not sure I want to prevent myri10ge=y just because ioatdma=m.

I would vote for adding some Kconfig stuff to define CONFIG_MYRI10GE_DCA
as boolean set to yes if (IOATDMA=y and MYRI10GE=y/m) or (IOATDMA=m and
MYRI10GE=m). And then use #ifdef CONFIG_MYRI10GE_DCA in the driver source.

Brice

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev