Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-07 Thread Jerome Glisse
On Fri, Dec 07, 2018 at 03:06:36PM +, Jonathan Cameron wrote:
> On Thu, 6 Dec 2018 19:20:45 -0500
> Jerome Glisse  wrote:
> 
> > On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote:
> > > 
> > > 
> > > On 2018-12-06 4:38 p.m., Dave Hansen wrote:  
> > > > On 12/6/18 3:28 PM, Logan Gunthorpe wrote:  
> > > >> I didn't think this was meant to describe actual real world performance
> > > >> between all of the links. If that's the case all of this seems like a
> > > >> pipe dream to me.  
> > > > 
> > > > The HMAT discussions (that I was a part of at least) settled on just
> > > > trying to describe what we called "sticker speed".  Nobody had an
> > > > expectation that you *really* had to measure everything.
> > > > 
> > > > The best we can do for any of these approaches is approximate things.  
> > > 
> > > Yes, though there's a lot of caveats in this assumption alone.
> > > Specifically with PCI: the bus may run at however many GB/s but P2P
> > > through a CPU's root complexes can slow down significantly (like down to
> > > MB/s).
> > > 
> > > I've seen similar things across QPI: I can sometimes do P2P from
> > > PCI->QPI->PCI but the performance doesn't even come close to the sticker
> > > speed of any of those buses.
> > > 
> > > I'm not sure how anyone is going to deal with those issues, but it does
> > > firmly place us in world view #2 instead of #1. But, yes, I agree
> > > exposing information like in #2 full out to userspace, especially
> > > through sysfs, seems like a nightmare and I don't see anything in HMS to
> > > help with that. Providing an API to ask for memory (or another resource)
> > > that's accessible by a set of initiators and with a set of requirements
> > > for capabilities seems more manageable.  
> > 
> > Note that in #1 you have bridge that fully allow to express those path
> > limitation. So what you just describe can be fully reported to userspace.
> > 
> > I explained and given examples on how program adapt their computation to
> > the system topology it does exist today and people are even developing new
> > programming langage with some of those idea baked in.
> > 
> > So they are people out there that already rely on such information they
> > just do not get it from the kernel but from a mix of various device specific
> > API and they have to stich everything themself and develop a database of
> > quirk and gotcha. My proposal is to provide a coherent kernel API where
> > we can sanitize that informations and report it to userspace in a single
> > and coherent description.
> > 
> > Cheers,
> > Jérôme
> 
> I know it doesn't work everywhere, but I think it's worth enumerating what
> cases we can get some of these numbers for and where the complexity lies.
> I.e. What can the really determined user space library do today?

I gave an example in an email in this thread:

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1821872.html

Is the kind of example you are looking for ? :)

> 
> So one open question is how close can we get in a userspace only prototype.
> At the end of the day userspace can often read HMAT directly if it wants to
> /sys/firmware/acpi/tables/HMAT.  Obviously that gets us only the end to
> end view (world 2).  I dislike the limitations of that as much as the next
> person. It is slowly improving with the word "Auditable" being
> kicked around - btw anyone interested in ACPI who works for a UEFI
> member, there are efforts going on and more viewpoints would be great.
> Expect some baby steps shortly.
> 
> For devices on PCIe (and protocols on top of it e.g. CCIX), a lot of
> this is discoverable to some degree. 
> * Link speed,
> * Number of Lanes,
> * Full topology.

Yes discoverable bus like PCIE and all its derivative (CCIX, OpenCAPI,
...) userspace will have way to find the topology. The issue lies with
orthogonal topology of extra bus that are not necessarily enumerated
or with a device driver presently and especially how they inter-act
with each other (can you cross them ? ...)

> 
> What isn't there (I think)
> * In component latency / bandwidth limitations (some activity going
>   on to improve that long term)
> * Effect of credit allocations etc on effectively bandwidth - interconnect
>   performance is a whole load of black magic.
> 
> Presumably there is some information available from NVLink etc?

>From my point of view we want to give the best case sticker value to
userspace ie the bandwidth the engineer that designed the bus sworn
their hardware deliver :)

I believe it the is the best approximation we can deliver.

> 
> So whilst I really like the proposal in some ways, I wonder how much 
> exploration
> could be done of the usefulness of the data without touching the kernel at 
> all.
> 
> The other aspect that is needed to actually make this 'dynamically' useful is
> to be able to map whatever Performance Counters are available to the relevant
> 'links', bridges etc.   Ticket numbers are not all that useful 

Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-07 Thread Jerome Glisse
On Fri, Dec 07, 2018 at 03:06:36PM +, Jonathan Cameron wrote:
> On Thu, 6 Dec 2018 19:20:45 -0500
> Jerome Glisse  wrote:
> 
> > On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote:
> > > 
> > > 
> > > On 2018-12-06 4:38 p.m., Dave Hansen wrote:  
> > > > On 12/6/18 3:28 PM, Logan Gunthorpe wrote:  
> > > >> I didn't think this was meant to describe actual real world performance
> > > >> between all of the links. If that's the case all of this seems like a
> > > >> pipe dream to me.  
> > > > 
> > > > The HMAT discussions (that I was a part of at least) settled on just
> > > > trying to describe what we called "sticker speed".  Nobody had an
> > > > expectation that you *really* had to measure everything.
> > > > 
> > > > The best we can do for any of these approaches is approximate things.  
> > > 
> > > Yes, though there's a lot of caveats in this assumption alone.
> > > Specifically with PCI: the bus may run at however many GB/s but P2P
> > > through a CPU's root complexes can slow down significantly (like down to
> > > MB/s).
> > > 
> > > I've seen similar things across QPI: I can sometimes do P2P from
> > > PCI->QPI->PCI but the performance doesn't even come close to the sticker
> > > speed of any of those buses.
> > > 
> > > I'm not sure how anyone is going to deal with those issues, but it does
> > > firmly place us in world view #2 instead of #1. But, yes, I agree
> > > exposing information like in #2 full out to userspace, especially
> > > through sysfs, seems like a nightmare and I don't see anything in HMS to
> > > help with that. Providing an API to ask for memory (or another resource)
> > > that's accessible by a set of initiators and with a set of requirements
> > > for capabilities seems more manageable.  
> > 
> > Note that in #1 you have bridge that fully allow to express those path
> > limitation. So what you just describe can be fully reported to userspace.
> > 
> > I explained and given examples on how program adapt their computation to
> > the system topology it does exist today and people are even developing new
> > programming langage with some of those idea baked in.
> > 
> > So they are people out there that already rely on such information they
> > just do not get it from the kernel but from a mix of various device specific
> > API and they have to stich everything themself and develop a database of
> > quirk and gotcha. My proposal is to provide a coherent kernel API where
> > we can sanitize that informations and report it to userspace in a single
> > and coherent description.
> > 
> > Cheers,
> > Jérôme
> 
> I know it doesn't work everywhere, but I think it's worth enumerating what
> cases we can get some of these numbers for and where the complexity lies.
> I.e. What can the really determined user space library do today?

I gave an example in an email in this thread:

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1821872.html

Is the kind of example you are looking for ? :)

> 
> So one open question is how close can we get in a userspace only prototype.
> At the end of the day userspace can often read HMAT directly if it wants to
> /sys/firmware/acpi/tables/HMAT.  Obviously that gets us only the end to
> end view (world 2).  I dislike the limitations of that as much as the next
> person. It is slowly improving with the word "Auditable" being
> kicked around - btw anyone interested in ACPI who works for a UEFI
> member, there are efforts going on and more viewpoints would be great.
> Expect some baby steps shortly.
> 
> For devices on PCIe (and protocols on top of it e.g. CCIX), a lot of
> this is discoverable to some degree. 
> * Link speed,
> * Number of Lanes,
> * Full topology.

Yes discoverable bus like PCIE and all its derivative (CCIX, OpenCAPI,
...) userspace will have way to find the topology. The issue lies with
orthogonal topology of extra bus that are not necessarily enumerated
or with a device driver presently and especially how they inter-act
with each other (can you cross them ? ...)

> 
> What isn't there (I think)
> * In component latency / bandwidth limitations (some activity going
>   on to improve that long term)
> * Effect of credit allocations etc on effectively bandwidth - interconnect
>   performance is a whole load of black magic.
> 
> Presumably there is some information available from NVLink etc?

>From my point of view we want to give the best case sticker value to
userspace ie the bandwidth the engineer that designed the bus sworn
their hardware deliver :)

I believe it the is the best approximation we can deliver.

> 
> So whilst I really like the proposal in some ways, I wonder how much 
> exploration
> could be done of the usefulness of the data without touching the kernel at 
> all.
> 
> The other aspect that is needed to actually make this 'dynamically' useful is
> to be able to map whatever Performance Counters are available to the relevant
> 'links', bridges etc.   Ticket numbers are not all that useful 

Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-07 Thread Jonathan Cameron
On Thu, 6 Dec 2018 19:20:45 -0500
Jerome Glisse  wrote:

> On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote:
> > 
> > 
> > On 2018-12-06 4:38 p.m., Dave Hansen wrote:  
> > > On 12/6/18 3:28 PM, Logan Gunthorpe wrote:  
> > >> I didn't think this was meant to describe actual real world performance
> > >> between all of the links. If that's the case all of this seems like a
> > >> pipe dream to me.  
> > > 
> > > The HMAT discussions (that I was a part of at least) settled on just
> > > trying to describe what we called "sticker speed".  Nobody had an
> > > expectation that you *really* had to measure everything.
> > > 
> > > The best we can do for any of these approaches is approximate things.  
> > 
> > Yes, though there's a lot of caveats in this assumption alone.
> > Specifically with PCI: the bus may run at however many GB/s but P2P
> > through a CPU's root complexes can slow down significantly (like down to
> > MB/s).
> > 
> > I've seen similar things across QPI: I can sometimes do P2P from
> > PCI->QPI->PCI but the performance doesn't even come close to the sticker
> > speed of any of those buses.
> > 
> > I'm not sure how anyone is going to deal with those issues, but it does
> > firmly place us in world view #2 instead of #1. But, yes, I agree
> > exposing information like in #2 full out to userspace, especially
> > through sysfs, seems like a nightmare and I don't see anything in HMS to
> > help with that. Providing an API to ask for memory (or another resource)
> > that's accessible by a set of initiators and with a set of requirements
> > for capabilities seems more manageable.  
> 
> Note that in #1 you have bridge that fully allow to express those path
> limitation. So what you just describe can be fully reported to userspace.
> 
> I explained and given examples on how program adapt their computation to
> the system topology it does exist today and people are even developing new
> programming langage with some of those idea baked in.
> 
> So they are people out there that already rely on such information they
> just do not get it from the kernel but from a mix of various device specific
> API and they have to stich everything themself and develop a database of
> quirk and gotcha. My proposal is to provide a coherent kernel API where
> we can sanitize that informations and report it to userspace in a single
> and coherent description.
> 
> Cheers,
> Jérôme

I know it doesn't work everywhere, but I think it's worth enumerating what
cases we can get some of these numbers for and where the complexity lies.
I.e. What can the really determined user space library do today?

So one open question is how close can we get in a userspace only prototype.
At the end of the day userspace can often read HMAT directly if it wants to
/sys/firmware/acpi/tables/HMAT.  Obviously that gets us only the end to
end view (world 2).  I dislike the limitations of that as much as the next
person. It is slowly improving with the word "Auditable" being
kicked around - btw anyone interested in ACPI who works for a UEFI
member, there are efforts going on and more viewpoints would be great.
Expect some baby steps shortly.

For devices on PCIe (and protocols on top of it e.g. CCIX), a lot of
this is discoverable to some degree. 
* Link speed,
* Number of Lanes,
* Full topology.

What isn't there (I think)
* In component latency / bandwidth limitations (some activity going
  on to improve that long term)
* Effect of credit allocations etc on effectively bandwidth - interconnect
  performance is a whole load of black magic.

Presumably there is some information available from NVLink etc?

So whilst I really like the proposal in some ways, I wonder how much exploration
could be done of the usefulness of the data without touching the kernel at all.

The other aspect that is needed to actually make this 'dynamically' useful is
to be able to map whatever Performance Counters are available to the relevant
'links', bridges etc.   Ticket numbers are not all that useful unfortunately
except for small amounts of data on lightly loaded buses.

The kernel ultimately only needs to have a model of this topology if:
1) It's going to use it itself
2) Its going to do something automatic with it.
3) It needs to fix garbage info or supplement with things only the kernel knows.

Jonathan



Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-07 Thread Jonathan Cameron
On Thu, 6 Dec 2018 19:20:45 -0500
Jerome Glisse  wrote:

> On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote:
> > 
> > 
> > On 2018-12-06 4:38 p.m., Dave Hansen wrote:  
> > > On 12/6/18 3:28 PM, Logan Gunthorpe wrote:  
> > >> I didn't think this was meant to describe actual real world performance
> > >> between all of the links. If that's the case all of this seems like a
> > >> pipe dream to me.  
> > > 
> > > The HMAT discussions (that I was a part of at least) settled on just
> > > trying to describe what we called "sticker speed".  Nobody had an
> > > expectation that you *really* had to measure everything.
> > > 
> > > The best we can do for any of these approaches is approximate things.  
> > 
> > Yes, though there's a lot of caveats in this assumption alone.
> > Specifically with PCI: the bus may run at however many GB/s but P2P
> > through a CPU's root complexes can slow down significantly (like down to
> > MB/s).
> > 
> > I've seen similar things across QPI: I can sometimes do P2P from
> > PCI->QPI->PCI but the performance doesn't even come close to the sticker
> > speed of any of those buses.
> > 
> > I'm not sure how anyone is going to deal with those issues, but it does
> > firmly place us in world view #2 instead of #1. But, yes, I agree
> > exposing information like in #2 full out to userspace, especially
> > through sysfs, seems like a nightmare and I don't see anything in HMS to
> > help with that. Providing an API to ask for memory (or another resource)
> > that's accessible by a set of initiators and with a set of requirements
> > for capabilities seems more manageable.  
> 
> Note that in #1 you have bridge that fully allow to express those path
> limitation. So what you just describe can be fully reported to userspace.
> 
> I explained and given examples on how program adapt their computation to
> the system topology it does exist today and people are even developing new
> programming langage with some of those idea baked in.
> 
> So they are people out there that already rely on such information they
> just do not get it from the kernel but from a mix of various device specific
> API and they have to stich everything themself and develop a database of
> quirk and gotcha. My proposal is to provide a coherent kernel API where
> we can sanitize that informations and report it to userspace in a single
> and coherent description.
> 
> Cheers,
> Jérôme

I know it doesn't work everywhere, but I think it's worth enumerating what
cases we can get some of these numbers for and where the complexity lies.
I.e. What can the really determined user space library do today?

So one open question is how close can we get in a userspace only prototype.
At the end of the day userspace can often read HMAT directly if it wants to
/sys/firmware/acpi/tables/HMAT.  Obviously that gets us only the end to
end view (world 2).  I dislike the limitations of that as much as the next
person. It is slowly improving with the word "Auditable" being
kicked around - btw anyone interested in ACPI who works for a UEFI
member, there are efforts going on and more viewpoints would be great.
Expect some baby steps shortly.

For devices on PCIe (and protocols on top of it e.g. CCIX), a lot of
this is discoverable to some degree. 
* Link speed,
* Number of Lanes,
* Full topology.

What isn't there (I think)
* In component latency / bandwidth limitations (some activity going
  on to improve that long term)
* Effect of credit allocations etc on effectively bandwidth - interconnect
  performance is a whole load of black magic.

Presumably there is some information available from NVLink etc?

So whilst I really like the proposal in some ways, I wonder how much exploration
could be done of the usefulness of the data without touching the kernel at all.

The other aspect that is needed to actually make this 'dynamically' useful is
to be able to map whatever Performance Counters are available to the relevant
'links', bridges etc.   Ticket numbers are not all that useful unfortunately
except for small amounts of data on lightly loaded buses.

The kernel ultimately only needs to have a model of this topology if:
1) It's going to use it itself
2) Its going to do something automatic with it.
3) It needs to fix garbage info or supplement with things only the kernel knows.

Jonathan



Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Jerome Glisse
On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-06 4:38 p.m., Dave Hansen wrote:
> > On 12/6/18 3:28 PM, Logan Gunthorpe wrote:
> >> I didn't think this was meant to describe actual real world performance
> >> between all of the links. If that's the case all of this seems like a
> >> pipe dream to me.
> > 
> > The HMAT discussions (that I was a part of at least) settled on just
> > trying to describe what we called "sticker speed".  Nobody had an
> > expectation that you *really* had to measure everything.
> > 
> > The best we can do for any of these approaches is approximate things.
> 
> Yes, though there's a lot of caveats in this assumption alone.
> Specifically with PCI: the bus may run at however many GB/s but P2P
> through a CPU's root complexes can slow down significantly (like down to
> MB/s).
> 
> I've seen similar things across QPI: I can sometimes do P2P from
> PCI->QPI->PCI but the performance doesn't even come close to the sticker
> speed of any of those buses.
> 
> I'm not sure how anyone is going to deal with those issues, but it does
> firmly place us in world view #2 instead of #1. But, yes, I agree
> exposing information like in #2 full out to userspace, especially
> through sysfs, seems like a nightmare and I don't see anything in HMS to
> help with that. Providing an API to ask for memory (or another resource)
> that's accessible by a set of initiators and with a set of requirements
> for capabilities seems more manageable.

Note that in #1 you have bridge that fully allow to express those path
limitation. So what you just describe can be fully reported to userspace.

I explained and given examples on how program adapt their computation to
the system topology it does exist today and people are even developing new
programming langage with some of those idea baked in.

So they are people out there that already rely on such information they
just do not get it from the kernel but from a mix of various device specific
API and they have to stich everything themself and develop a database of
quirk and gotcha. My proposal is to provide a coherent kernel API where
we can sanitize that informations and report it to userspace in a single
and coherent description.

Cheers,
Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Jerome Glisse
On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-06 4:38 p.m., Dave Hansen wrote:
> > On 12/6/18 3:28 PM, Logan Gunthorpe wrote:
> >> I didn't think this was meant to describe actual real world performance
> >> between all of the links. If that's the case all of this seems like a
> >> pipe dream to me.
> > 
> > The HMAT discussions (that I was a part of at least) settled on just
> > trying to describe what we called "sticker speed".  Nobody had an
> > expectation that you *really* had to measure everything.
> > 
> > The best we can do for any of these approaches is approximate things.
> 
> Yes, though there's a lot of caveats in this assumption alone.
> Specifically with PCI: the bus may run at however many GB/s but P2P
> through a CPU's root complexes can slow down significantly (like down to
> MB/s).
> 
> I've seen similar things across QPI: I can sometimes do P2P from
> PCI->QPI->PCI but the performance doesn't even come close to the sticker
> speed of any of those buses.
> 
> I'm not sure how anyone is going to deal with those issues, but it does
> firmly place us in world view #2 instead of #1. But, yes, I agree
> exposing information like in #2 full out to userspace, especially
> through sysfs, seems like a nightmare and I don't see anything in HMS to
> help with that. Providing an API to ask for memory (or another resource)
> that's accessible by a set of initiators and with a set of requirements
> for capabilities seems more manageable.

Note that in #1 you have bridge that fully allow to express those path
limitation. So what you just describe can be fully reported to userspace.

I explained and given examples on how program adapt their computation to
the system topology it does exist today and people are even developing new
programming langage with some of those idea baked in.

So they are people out there that already rely on such information they
just do not get it from the kernel but from a mix of various device specific
API and they have to stich everything themself and develop a database of
quirk and gotcha. My proposal is to provide a coherent kernel API where
we can sanitize that informations and report it to userspace in a single
and coherent description.

Cheers,
Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Jerome Glisse
On Thu, Dec 06, 2018 at 03:09:21PM -0800, Dave Hansen wrote:
> On 12/6/18 2:39 PM, Jerome Glisse wrote:
> > No if the 4 sockets are connect in a ring fashion ie:
> > Socket0 - Socket1
> >| |
> > Socket3 - Socket2
> > 
> > Then you have 4 links:
> > link0: socket0 socket1
> > link1: socket1 socket2
> > link3: socket2 socket3
> > link4: socket3 socket0
> > 
> > I do not see how their can be an explosion of link directory, worse
> > case is as many link directories as they are bus for a CPU/device/
> > target.
> 
> This looks great.  But, we don't _have_ this kind of information for any
> system that I know about or any system available in the near future.

We do not have it in any standard way, it is out there in either
device driver database, application data base, special platform
OEM blob burried somewhere in the firmware ...

I want to solve the kernel side of the problem ie how to expose
this to userspace. How the kernel get that information is an
orthogonal problem. For now my intention is to have device driver
register and create the links and bridges that are not enumerated
by standard firmware.

> 
> We basically have two different world views:
> 1. The system is described point-to-point.  A connects to B @
>100GB/s.  B connects to C at 50GB/s.  Thus, C->A should be
>50GB/s.
>* Less information to convey
>* Potentially less precise if the properties are not perfectly
>  additive.  If A->B=10ns and B->C=20ns, A->C might be >30ns.
>* Costs must be calculated instead of being explicitly specified
> 2. The system is described endpoint-to-endpoint.  A->B @ 100GB/s
>B->C @ 50GB/s, A->C @ 50GB/s.
>* A *lot* more information to convey O(N^2)?
>* Potentially more precise.
>* Costs are explicitly specified, not calculated
> 
> These patches are really tied to world view #1.  But, the HMAT is really
> tied to world view #1.
  ^#2

Note that they are also the bridge object in my proposal. So in my
proposal you in #1 you have:
link0: A <-> B with 100GB/s and 10ns latency
link1: B <-> C with 50GB/s and 20ns latency

Now if A can reach C through B then you have bridges (bridge are uni-
directional unlike link that are bi-directional thought that finer
point can be discuss this is what allow any kind of directed graph to
be represented):
bridge2: link0 -> link1
bridge3: link1 -> link0

You can also associated properties to bridge (but it is not mandatory).
So you can say that bridge2 and bridge3 have a latency of 50ns and if
the addition of latency is enough then you do not specificy it in bridge.
It is a rule that a path latency is the sum of its individual link
latency. For bandwidth it is the minimum bandwidth ie what ever is the
bottleneck for the path.


> I know you're not a fan of the HMAT.  But it is the firmware reality
> that we are stuck with, until something better shows up.  I just don't
> see a way to convert it into what you have described here.

Like i said i am not targetting HMAT system i am targeting system that
rely today on database spread between driver and application. I want to
move that knowledge in driver first so that they can teach the core
kernel and register thing in the core. Providing a standard firmware
way to provide this information is a different problem (they are some
loose standard on non ACPI platform AFAIK).

> I'm starting to think that, no matter if the HMAT or some other approach
> gets adopted, we shouldn't be exposing this level of gunk to userspace
> at *all* since it requires adopting one of the world views.

I do not see this as exclusive. Yes they are HMAT system "soon" to arrive
but we already have the more extended view which is just buried under a
pile of different pieces. I do not see any exclusion between the 2. If
HMAT is good enough for a whole class of system fine but there is also
a whole class of system and users that do not fit in that paradigm hence
my proposal.

Cheers,
Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Jerome Glisse
On Thu, Dec 06, 2018 at 03:09:21PM -0800, Dave Hansen wrote:
> On 12/6/18 2:39 PM, Jerome Glisse wrote:
> > No if the 4 sockets are connect in a ring fashion ie:
> > Socket0 - Socket1
> >| |
> > Socket3 - Socket2
> > 
> > Then you have 4 links:
> > link0: socket0 socket1
> > link1: socket1 socket2
> > link3: socket2 socket3
> > link4: socket3 socket0
> > 
> > I do not see how their can be an explosion of link directory, worse
> > case is as many link directories as they are bus for a CPU/device/
> > target.
> 
> This looks great.  But, we don't _have_ this kind of information for any
> system that I know about or any system available in the near future.

We do not have it in any standard way, it is out there in either
device driver database, application data base, special platform
OEM blob burried somewhere in the firmware ...

I want to solve the kernel side of the problem ie how to expose
this to userspace. How the kernel get that information is an
orthogonal problem. For now my intention is to have device driver
register and create the links and bridges that are not enumerated
by standard firmware.

> 
> We basically have two different world views:
> 1. The system is described point-to-point.  A connects to B @
>100GB/s.  B connects to C at 50GB/s.  Thus, C->A should be
>50GB/s.
>* Less information to convey
>* Potentially less precise if the properties are not perfectly
>  additive.  If A->B=10ns and B->C=20ns, A->C might be >30ns.
>* Costs must be calculated instead of being explicitly specified
> 2. The system is described endpoint-to-endpoint.  A->B @ 100GB/s
>B->C @ 50GB/s, A->C @ 50GB/s.
>* A *lot* more information to convey O(N^2)?
>* Potentially more precise.
>* Costs are explicitly specified, not calculated
> 
> These patches are really tied to world view #1.  But, the HMAT is really
> tied to world view #1.
  ^#2

Note that they are also the bridge object in my proposal. So in my
proposal you in #1 you have:
link0: A <-> B with 100GB/s and 10ns latency
link1: B <-> C with 50GB/s and 20ns latency

Now if A can reach C through B then you have bridges (bridge are uni-
directional unlike link that are bi-directional thought that finer
point can be discuss this is what allow any kind of directed graph to
be represented):
bridge2: link0 -> link1
bridge3: link1 -> link0

You can also associated properties to bridge (but it is not mandatory).
So you can say that bridge2 and bridge3 have a latency of 50ns and if
the addition of latency is enough then you do not specificy it in bridge.
It is a rule that a path latency is the sum of its individual link
latency. For bandwidth it is the minimum bandwidth ie what ever is the
bottleneck for the path.


> I know you're not a fan of the HMAT.  But it is the firmware reality
> that we are stuck with, until something better shows up.  I just don't
> see a way to convert it into what you have described here.

Like i said i am not targetting HMAT system i am targeting system that
rely today on database spread between driver and application. I want to
move that knowledge in driver first so that they can teach the core
kernel and register thing in the core. Providing a standard firmware
way to provide this information is a different problem (they are some
loose standard on non ACPI platform AFAIK).

> I'm starting to think that, no matter if the HMAT or some other approach
> gets adopted, we shouldn't be exposing this level of gunk to userspace
> at *all* since it requires adopting one of the world views.

I do not see this as exclusive. Yes they are HMAT system "soon" to arrive
but we already have the more extended view which is just buried under a
pile of different pieces. I do not see any exclusion between the 2. If
HMAT is good enough for a whole class of system fine but there is also
a whole class of system and users that do not fit in that paradigm hence
my proposal.

Cheers,
Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Logan Gunthorpe



On 2018-12-06 4:38 p.m., Dave Hansen wrote:
> On 12/6/18 3:28 PM, Logan Gunthorpe wrote:
>> I didn't think this was meant to describe actual real world performance
>> between all of the links. If that's the case all of this seems like a
>> pipe dream to me.
> 
> The HMAT discussions (that I was a part of at least) settled on just
> trying to describe what we called "sticker speed".  Nobody had an
> expectation that you *really* had to measure everything.
> 
> The best we can do for any of these approaches is approximate things.

Yes, though there's a lot of caveats in this assumption alone.
Specifically with PCI: the bus may run at however many GB/s but P2P
through a CPU's root complexes can slow down significantly (like down to
MB/s).

I've seen similar things across QPI: I can sometimes do P2P from
PCI->QPI->PCI but the performance doesn't even come close to the sticker
speed of any of those buses.

I'm not sure how anyone is going to deal with those issues, but it does
firmly place us in world view #2 instead of #1. But, yes, I agree
exposing information like in #2 full out to userspace, especially
through sysfs, seems like a nightmare and I don't see anything in HMS to
help with that. Providing an API to ask for memory (or another resource)
that's accessible by a set of initiators and with a set of requirements
for capabilities seems more manageable.

Logan


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Logan Gunthorpe



On 2018-12-06 4:38 p.m., Dave Hansen wrote:
> On 12/6/18 3:28 PM, Logan Gunthorpe wrote:
>> I didn't think this was meant to describe actual real world performance
>> between all of the links. If that's the case all of this seems like a
>> pipe dream to me.
> 
> The HMAT discussions (that I was a part of at least) settled on just
> trying to describe what we called "sticker speed".  Nobody had an
> expectation that you *really* had to measure everything.
> 
> The best we can do for any of these approaches is approximate things.

Yes, though there's a lot of caveats in this assumption alone.
Specifically with PCI: the bus may run at however many GB/s but P2P
through a CPU's root complexes can slow down significantly (like down to
MB/s).

I've seen similar things across QPI: I can sometimes do P2P from
PCI->QPI->PCI but the performance doesn't even come close to the sticker
speed of any of those buses.

I'm not sure how anyone is going to deal with those issues, but it does
firmly place us in world view #2 instead of #1. But, yes, I agree
exposing information like in #2 full out to userspace, especially
through sysfs, seems like a nightmare and I don't see anything in HMS to
help with that. Providing an API to ask for memory (or another resource)
that's accessible by a set of initiators and with a set of requirements
for capabilities seems more manageable.

Logan


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Dave Hansen
On 12/6/18 3:28 PM, Logan Gunthorpe wrote:
> I didn't think this was meant to describe actual real world performance
> between all of the links. If that's the case all of this seems like a
> pipe dream to me.

The HMAT discussions (that I was a part of at least) settled on just
trying to describe what we called "sticker speed".  Nobody had an
expectation that you *really* had to measure everything.

The best we can do for any of these approaches is approximate things.

> You're not *really* going to know bandwidth or latency for any of this
> unless you actually measure it on the system in question.

Yeah, agreed.


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Dave Hansen
On 12/6/18 3:28 PM, Logan Gunthorpe wrote:
> I didn't think this was meant to describe actual real world performance
> between all of the links. If that's the case all of this seems like a
> pipe dream to me.

The HMAT discussions (that I was a part of at least) settled on just
trying to describe what we called "sticker speed".  Nobody had an
expectation that you *really* had to measure everything.

The best we can do for any of these approaches is approximate things.

> You're not *really* going to know bandwidth or latency for any of this
> unless you actually measure it on the system in question.

Yeah, agreed.


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Dave Hansen
On 12/6/18 3:28 PM, Logan Gunthorpe wrote:
> These patches are really tied to world view #1.  But, the HMAT is really
> tied to world view #1.

Whoops, should have been "the HMAT is really tied to world view #2"


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Dave Hansen
On 12/6/18 3:28 PM, Logan Gunthorpe wrote:
> These patches are really tied to world view #1.  But, the HMAT is really
> tied to world view #1.

Whoops, should have been "the HMAT is really tied to world view #2"


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Logan Gunthorpe



On 2018-12-06 4:09 p.m., Dave Hansen wrote:
> This looks great.  But, we don't _have_ this kind of information for any
> system that I know about or any system available in the near future.
> 
> We basically have two different world views:
> 1. The system is described point-to-point.  A connects to B @
>100GB/s.  B connects to C at 50GB/s.  Thus, C->A should be
>50GB/s.
>* Less information to convey
>* Potentially less precise if the properties are not perfectly
>  additive.  If A->B=10ns and B->C=20ns, A->C might be >30ns.
>* Costs must be calculated instead of being explicitly specified
> 2. The system is described endpoint-to-endpoint.  A->B @ 100GB/s
>B->C @ 50GB/s, A->C @ 50GB/s.
>* A *lot* more information to convey O(N^2)?
>* Potentially more precise.
>* Costs are explicitly specified, not calculated
> 
> These patches are really tied to world view #1.  But, the HMAT is really
> tied to world view #1.

I didn't think this was meant to describe actual real world performance
between all of the links. If that's the case all of this seems like a
pipe dream to me.

Attributes like cache coherency, atomics, etc should fit well in world
view #1... and, at best, some kind of flag saying whether or not to use
a particular link if you care about transfer speed. -- But we don't need
special "link" directories to describe the properties of existing buses.

You're not *really* going to know bandwidth or latency for any of this
unless you actually measure it on the system in question.

Logan


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Logan Gunthorpe



On 2018-12-06 4:09 p.m., Dave Hansen wrote:
> This looks great.  But, we don't _have_ this kind of information for any
> system that I know about or any system available in the near future.
> 
> We basically have two different world views:
> 1. The system is described point-to-point.  A connects to B @
>100GB/s.  B connects to C at 50GB/s.  Thus, C->A should be
>50GB/s.
>* Less information to convey
>* Potentially less precise if the properties are not perfectly
>  additive.  If A->B=10ns and B->C=20ns, A->C might be >30ns.
>* Costs must be calculated instead of being explicitly specified
> 2. The system is described endpoint-to-endpoint.  A->B @ 100GB/s
>B->C @ 50GB/s, A->C @ 50GB/s.
>* A *lot* more information to convey O(N^2)?
>* Potentially more precise.
>* Costs are explicitly specified, not calculated
> 
> These patches are really tied to world view #1.  But, the HMAT is really
> tied to world view #1.

I didn't think this was meant to describe actual real world performance
between all of the links. If that's the case all of this seems like a
pipe dream to me.

Attributes like cache coherency, atomics, etc should fit well in world
view #1... and, at best, some kind of flag saying whether or not to use
a particular link if you care about transfer speed. -- But we don't need
special "link" directories to describe the properties of existing buses.

You're not *really* going to know bandwidth or latency for any of this
unless you actually measure it on the system in question.

Logan


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Dave Hansen
On 12/6/18 2:39 PM, Jerome Glisse wrote:
> No if the 4 sockets are connect in a ring fashion ie:
> Socket0 - Socket1
>| |
> Socket3 - Socket2
> 
> Then you have 4 links:
> link0: socket0 socket1
> link1: socket1 socket2
> link3: socket2 socket3
> link4: socket3 socket0
> 
> I do not see how their can be an explosion of link directory, worse
> case is as many link directories as they are bus for a CPU/device/
> target.

This looks great.  But, we don't _have_ this kind of information for any
system that I know about or any system available in the near future.

We basically have two different world views:
1. The system is described point-to-point.  A connects to B @
   100GB/s.  B connects to C at 50GB/s.  Thus, C->A should be
   50GB/s.
   * Less information to convey
   * Potentially less precise if the properties are not perfectly
 additive.  If A->B=10ns and B->C=20ns, A->C might be >30ns.
   * Costs must be calculated instead of being explicitly specified
2. The system is described endpoint-to-endpoint.  A->B @ 100GB/s
   B->C @ 50GB/s, A->C @ 50GB/s.
   * A *lot* more information to convey O(N^2)?
   * Potentially more precise.
   * Costs are explicitly specified, not calculated

These patches are really tied to world view #1.  But, the HMAT is really
tied to world view #1.

I know you're not a fan of the HMAT.  But it is the firmware reality
that we are stuck with, until something better shows up.  I just don't
see a way to convert it into what you have described here.

I'm starting to think that, no matter if the HMAT or some other approach
gets adopted, we shouldn't be exposing this level of gunk to userspace
at *all* since it requires adopting one of the world views.


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Dave Hansen
On 12/6/18 2:39 PM, Jerome Glisse wrote:
> No if the 4 sockets are connect in a ring fashion ie:
> Socket0 - Socket1
>| |
> Socket3 - Socket2
> 
> Then you have 4 links:
> link0: socket0 socket1
> link1: socket1 socket2
> link3: socket2 socket3
> link4: socket3 socket0
> 
> I do not see how their can be an explosion of link directory, worse
> case is as many link directories as they are bus for a CPU/device/
> target.

This looks great.  But, we don't _have_ this kind of information for any
system that I know about or any system available in the near future.

We basically have two different world views:
1. The system is described point-to-point.  A connects to B @
   100GB/s.  B connects to C at 50GB/s.  Thus, C->A should be
   50GB/s.
   * Less information to convey
   * Potentially less precise if the properties are not perfectly
 additive.  If A->B=10ns and B->C=20ns, A->C might be >30ns.
   * Costs must be calculated instead of being explicitly specified
2. The system is described endpoint-to-endpoint.  A->B @ 100GB/s
   B->C @ 50GB/s, A->C @ 50GB/s.
   * A *lot* more information to convey O(N^2)?
   * Potentially more precise.
   * Costs are explicitly specified, not calculated

These patches are really tied to world view #1.  But, the HMAT is really
tied to world view #1.

I know you're not a fan of the HMAT.  But it is the firmware reality
that we are stuck with, until something better shows up.  I just don't
see a way to convert it into what you have described here.

I'm starting to think that, no matter if the HMAT or some other approach
gets adopted, we shouldn't be exposing this level of gunk to userspace
at *all* since it requires adopting one of the world views.


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Jerome Glisse
On Thu, Dec 06, 2018 at 02:04:46PM -0800, Dave Hansen wrote:
> On 12/6/18 12:11 PM, Logan Gunthorpe wrote:
> >> My concern with having folks do per-program parsing, *and* having a huge
> >> amount of data to parse makes it unusable.  The largest systems will
> >> literally have hundreds of thousands of objects in /sysfs, even in a
> >> single directory.  That makes readdir() basically impossible, and makes
> >> even open() (if you already know the path you want somehow) hard to do 
> >> fast.
> > Is this actually realistic? I find it hard to imagine an actual hardware
> > bus that can have even thousands of devices under a single node, let
> > alone hundreds of thousands.
> 
> Jerome's proposal, as I understand it, would have generic "links".
> They're not an instance of bus, but characterize a class of "link".  For
> instance, a "link" might characterize the characteristics of the QPI bus
> between two CPU sockets. The link directory would enumerate the list of
> all *instances* of that link
> 
> So, a "link" directory for QPI would say Socket0<->Socket1,
> Socket1<->Socket2, Socket1<->Socket2, Socket2<->PCIe-1.2.3.4 etc...  It
> would have to enumerate the connections between every entity that shared
> those link properties.
> 
> While there might not be millions of buses, there could be millions of
> *paths* across all those buses, and that's what the HMAT describes, at
> least: the net result of all those paths.

Sorry if again i miss-explained thing. Link are arrows between nodes
(CPU or device or memory). An arrow/link has properties associated
with it: bandwidth, latency, cache-coherent, ...

So if in your system you 4 Sockets and that each socket is connected to
each other (mesh) and all inter-connect in the mesh have same property
then you only have 1 link directory with the 4 socket in it.

No if the 4 sockets are connect in a ring fashion ie:
Socket0 - Socket1
   | |
Socket3 - Socket2

Then you have 4 links:
link0: socket0 socket1
link1: socket1 socket2
link3: socket2 socket3
link4: socket3 socket0

I do not see how their can be an explosion of link directory, worse
case is as many link directories as they are bus for a CPU/device/
target. So worse case if you have N devices and each devices is
connected two 2 bus (PCIE and QPI to go to other socket for instance)
then you have 2*N link directory (again this is a worst case).

They are lot of commonality that will remain so i expect that quite
a few link directory will have many symlink ie you won't get close
to the worst case.


In the end really it is easier to think from the physical topology
and there a link correspond to an inter-connect between two device
or CPU. In all the systems i have seen even in the craziest roadmap
i have only seen thing like 128/256 inter-connect (4 socket 32/64
devices per socket) and many of which can be grouped under a common
link directory. Here worse case is 4 connection per device/CPU/
target so worse case of 128/256 * 4  = 512/1024 link directory
and that's a lot. Given regularity i have seen described on slides
i expect that it would need something like 30 link directory and
20 bridges directory.

On today system 8GPU per socket with GPUlink between each GPU and
PCIE all this with 4 socket it comes down to 20 links directory.

In any cases each devices/CPU/target has a limit on the number of
bus/inter-connect it is connected too. I doubt there is anyone
designing device that will have much more than 4 external bus
connection.

So it is not a link per pair. It is a link for group of device/CPU/
target. Is it any clearer ?

Cheers,
Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Jerome Glisse
On Thu, Dec 06, 2018 at 02:04:46PM -0800, Dave Hansen wrote:
> On 12/6/18 12:11 PM, Logan Gunthorpe wrote:
> >> My concern with having folks do per-program parsing, *and* having a huge
> >> amount of data to parse makes it unusable.  The largest systems will
> >> literally have hundreds of thousands of objects in /sysfs, even in a
> >> single directory.  That makes readdir() basically impossible, and makes
> >> even open() (if you already know the path you want somehow) hard to do 
> >> fast.
> > Is this actually realistic? I find it hard to imagine an actual hardware
> > bus that can have even thousands of devices under a single node, let
> > alone hundreds of thousands.
> 
> Jerome's proposal, as I understand it, would have generic "links".
> They're not an instance of bus, but characterize a class of "link".  For
> instance, a "link" might characterize the characteristics of the QPI bus
> between two CPU sockets. The link directory would enumerate the list of
> all *instances* of that link
> 
> So, a "link" directory for QPI would say Socket0<->Socket1,
> Socket1<->Socket2, Socket1<->Socket2, Socket2<->PCIe-1.2.3.4 etc...  It
> would have to enumerate the connections between every entity that shared
> those link properties.
> 
> While there might not be millions of buses, there could be millions of
> *paths* across all those buses, and that's what the HMAT describes, at
> least: the net result of all those paths.

Sorry if again i miss-explained thing. Link are arrows between nodes
(CPU or device or memory). An arrow/link has properties associated
with it: bandwidth, latency, cache-coherent, ...

So if in your system you 4 Sockets and that each socket is connected to
each other (mesh) and all inter-connect in the mesh have same property
then you only have 1 link directory with the 4 socket in it.

No if the 4 sockets are connect in a ring fashion ie:
Socket0 - Socket1
   | |
Socket3 - Socket2

Then you have 4 links:
link0: socket0 socket1
link1: socket1 socket2
link3: socket2 socket3
link4: socket3 socket0

I do not see how their can be an explosion of link directory, worse
case is as many link directories as they are bus for a CPU/device/
target. So worse case if you have N devices and each devices is
connected two 2 bus (PCIE and QPI to go to other socket for instance)
then you have 2*N link directory (again this is a worst case).

They are lot of commonality that will remain so i expect that quite
a few link directory will have many symlink ie you won't get close
to the worst case.


In the end really it is easier to think from the physical topology
and there a link correspond to an inter-connect between two device
or CPU. In all the systems i have seen even in the craziest roadmap
i have only seen thing like 128/256 inter-connect (4 socket 32/64
devices per socket) and many of which can be grouped under a common
link directory. Here worse case is 4 connection per device/CPU/
target so worse case of 128/256 * 4  = 512/1024 link directory
and that's a lot. Given regularity i have seen described on slides
i expect that it would need something like 30 link directory and
20 bridges directory.

On today system 8GPU per socket with GPUlink between each GPU and
PCIE all this with 4 socket it comes down to 20 links directory.

In any cases each devices/CPU/target has a limit on the number of
bus/inter-connect it is connected too. I doubt there is anyone
designing device that will have much more than 4 external bus
connection.

So it is not a link per pair. It is a link for group of device/CPU/
target. Is it any clearer ?

Cheers,
Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Dave Hansen
On 12/6/18 12:11 PM, Logan Gunthorpe wrote:
>> My concern with having folks do per-program parsing, *and* having a huge
>> amount of data to parse makes it unusable.  The largest systems will
>> literally have hundreds of thousands of objects in /sysfs, even in a
>> single directory.  That makes readdir() basically impossible, and makes
>> even open() (if you already know the path you want somehow) hard to do fast.
> Is this actually realistic? I find it hard to imagine an actual hardware
> bus that can have even thousands of devices under a single node, let
> alone hundreds of thousands.

Jerome's proposal, as I understand it, would have generic "links".
They're not an instance of bus, but characterize a class of "link".  For
instance, a "link" might characterize the characteristics of the QPI bus
between two CPU sockets. The link directory would enumerate the list of
all *instances* of that link

So, a "link" directory for QPI would say Socket0<->Socket1,
Socket1<->Socket2, Socket1<->Socket2, Socket2<->PCIe-1.2.3.4 etc...  It
would have to enumerate the connections between every entity that shared
those link properties.

While there might not be millions of buses, there could be millions of
*paths* across all those buses, and that's what the HMAT describes, at
least: the net result of all those paths.


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Dave Hansen
On 12/6/18 12:11 PM, Logan Gunthorpe wrote:
>> My concern with having folks do per-program parsing, *and* having a huge
>> amount of data to parse makes it unusable.  The largest systems will
>> literally have hundreds of thousands of objects in /sysfs, even in a
>> single directory.  That makes readdir() basically impossible, and makes
>> even open() (if you already know the path you want somehow) hard to do fast.
> Is this actually realistic? I find it hard to imagine an actual hardware
> bus that can have even thousands of devices under a single node, let
> alone hundreds of thousands.

Jerome's proposal, as I understand it, would have generic "links".
They're not an instance of bus, but characterize a class of "link".  For
instance, a "link" might characterize the characteristics of the QPI bus
between two CPU sockets. The link directory would enumerate the list of
all *instances* of that link

So, a "link" directory for QPI would say Socket0<->Socket1,
Socket1<->Socket2, Socket1<->Socket2, Socket2<->PCIe-1.2.3.4 etc...  It
would have to enumerate the connections between every entity that shared
those link properties.

While there might not be millions of buses, there could be millions of
*paths* across all those buses, and that's what the HMAT describes, at
least: the net result of all those paths.


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Jerome Glisse
On Thu, Dec 06, 2018 at 03:27:06PM -0500, Jerome Glisse wrote:
> On Thu, Dec 06, 2018 at 11:31:21AM -0800, Dave Hansen wrote:
> > On 12/6/18 11:20 AM, Jerome Glisse wrote:
> > >>> For case 1 you can pre-parse stuff but this can be done by helper 
> > >>> library
> > >> How would that work?  Would each user/container/whatever do this once?
> > >> Where would they keep the pre-parsed stuff?  How do they manage their
> > >> cache if the topology changes?
> > > Short answer i don't expect a cache, i expect that each program will have
> > > a init function that query the topology and update the application codes
> > > accordingly.
> > 
> > My concern with having folks do per-program parsing, *and* having a huge
> > amount of data to parse makes it unusable.  The largest systems will
> > literally have hundreds of thousands of objects in /sysfs, even in a
> > single directory.  That makes readdir() basically impossible, and makes
> > even open() (if you already know the path you want somehow) hard to do fast.
> > 
> > I just don't think sysfs (or any filesystem, really) can scale to
> > express large, complicated topologies in a way that any normal program
> > can practically parse it.
> > 
> > My suspicion is that we're going to need to have the kernel parse and
> > cache these things.  We *might* have the data available in sysfs, but we
> > can't reasonably expect anyone to go parsing it.
> 
> What i am failing to explain is that kernel can not parse because kernel
> does not know what the application cares about and every single applications
> will make different choices and thus select differents devices and memory.
> 
> It is not even gonna a thing like class A of application will do X and
> class B will do Y. Every single application in class A might do something
> different because somes care about the little details.
> 
> So any kind of pre-parsing in the kernel is defeated by the fact that the
> kernel does not know what the application is looking for.
> 
> I do not see anyway to express the application logic in something that
> can be some kind of automaton or regular expression. The application can
> litteraly intro-inspect itself and the topology to partition its workload.
> The topology and device selection is expected to be thousands of line of
> code in the most advance application.
> 
> Even worse inside one same application, they might be different device
> partition and memory selection for different function in the application.
> 
> 
> I am not scare about the anount of data to parse really, even on big node
> it is gonna be few dozens of links and bridges, and few dozens of devices.
> So we are talking hundred directories to parse and read.
> 
> 
> Maybe an example will help. Let say we have an application with the
> following pipeline:
> 
> inA -> functionA -> outA -> functionB -> outB -> functionC -> result
> 
> - inA 8 gigabytes
> - outA 8 gigabytes
> - outB one dword
> - result something small
> - functionA is doing heavy computation on inA (several thousands of
>   instructions for each dword in inA).
> - functionB is doing heavy computation for each dword in outA (again
>   thousand of instruction for each dword) and it is looking for a
>   specific result that it knows will be unique among all the dword
>   computation ie it is output only one dword in outB
> - functionC is something well suited for CPU that take outB and turns
>   it into the final result
> 
> Now let see few different system and their topologies:
> [T2] 1 GPU with 16GB of memory and a handfull of CPU cores
> [T1] 1 GPU with 8GB of memory and a handfull of CPU cores
> [T3] 2 GPU with 8GB of memory and a handfull of CPU core
> [T4] 2 GPU with 8GB of memory and a handfull of CPU core
>  the 2 GPU have a very fast link between each others
>  (400GBytes/s)
> 
> Now let see how the program will partition itself for each topology:
> [T1] Application partition its computation in 3 phases:
> P1: - migrate inA to GPU memory
> P2: - execute functionA on inA producing outA
> P3  - execute functionB on outA producing outB
> - run functionC and see if functionB have found the
>   thing and written it to outB if so then kill all
>   GPU threads and return the result we are done
> 
> [T2] Application partition its computation in 5 phases:
> P1: - migrate first 4GB of inA to GPU memory
> P2: - execute functionA for the 4GB and write the 4GB
>   outA result to the GPU memory
> P3: - execute functionB for the first 4GB of outA
> - while functionB is running DMA in the background
>   the the second 4GB of inA to the GPU memory
> - once one of the millions of thread running functionB
>   find the result it is looking for it writes it to
> 

Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Jerome Glisse
On Thu, Dec 06, 2018 at 03:27:06PM -0500, Jerome Glisse wrote:
> On Thu, Dec 06, 2018 at 11:31:21AM -0800, Dave Hansen wrote:
> > On 12/6/18 11:20 AM, Jerome Glisse wrote:
> > >>> For case 1 you can pre-parse stuff but this can be done by helper 
> > >>> library
> > >> How would that work?  Would each user/container/whatever do this once?
> > >> Where would they keep the pre-parsed stuff?  How do they manage their
> > >> cache if the topology changes?
> > > Short answer i don't expect a cache, i expect that each program will have
> > > a init function that query the topology and update the application codes
> > > accordingly.
> > 
> > My concern with having folks do per-program parsing, *and* having a huge
> > amount of data to parse makes it unusable.  The largest systems will
> > literally have hundreds of thousands of objects in /sysfs, even in a
> > single directory.  That makes readdir() basically impossible, and makes
> > even open() (if you already know the path you want somehow) hard to do fast.
> > 
> > I just don't think sysfs (or any filesystem, really) can scale to
> > express large, complicated topologies in a way that any normal program
> > can practically parse it.
> > 
> > My suspicion is that we're going to need to have the kernel parse and
> > cache these things.  We *might* have the data available in sysfs, but we
> > can't reasonably expect anyone to go parsing it.
> 
> What i am failing to explain is that kernel can not parse because kernel
> does not know what the application cares about and every single applications
> will make different choices and thus select differents devices and memory.
> 
> It is not even gonna a thing like class A of application will do X and
> class B will do Y. Every single application in class A might do something
> different because somes care about the little details.
> 
> So any kind of pre-parsing in the kernel is defeated by the fact that the
> kernel does not know what the application is looking for.
> 
> I do not see anyway to express the application logic in something that
> can be some kind of automaton or regular expression. The application can
> litteraly intro-inspect itself and the topology to partition its workload.
> The topology and device selection is expected to be thousands of line of
> code in the most advance application.
> 
> Even worse inside one same application, they might be different device
> partition and memory selection for different function in the application.
> 
> 
> I am not scare about the anount of data to parse really, even on big node
> it is gonna be few dozens of links and bridges, and few dozens of devices.
> So we are talking hundred directories to parse and read.
> 
> 
> Maybe an example will help. Let say we have an application with the
> following pipeline:
> 
> inA -> functionA -> outA -> functionB -> outB -> functionC -> result
> 
> - inA 8 gigabytes
> - outA 8 gigabytes
> - outB one dword
> - result something small
> - functionA is doing heavy computation on inA (several thousands of
>   instructions for each dword in inA).
> - functionB is doing heavy computation for each dword in outA (again
>   thousand of instruction for each dword) and it is looking for a
>   specific result that it knows will be unique among all the dword
>   computation ie it is output only one dword in outB
> - functionC is something well suited for CPU that take outB and turns
>   it into the final result
> 
> Now let see few different system and their topologies:
> [T2] 1 GPU with 16GB of memory and a handfull of CPU cores
> [T1] 1 GPU with 8GB of memory and a handfull of CPU cores
> [T3] 2 GPU with 8GB of memory and a handfull of CPU core
> [T4] 2 GPU with 8GB of memory and a handfull of CPU core
>  the 2 GPU have a very fast link between each others
>  (400GBytes/s)
> 
> Now let see how the program will partition itself for each topology:
> [T1] Application partition its computation in 3 phases:
> P1: - migrate inA to GPU memory
> P2: - execute functionA on inA producing outA
> P3  - execute functionB on outA producing outB
> - run functionC and see if functionB have found the
>   thing and written it to outB if so then kill all
>   GPU threads and return the result we are done
> 
> [T2] Application partition its computation in 5 phases:
> P1: - migrate first 4GB of inA to GPU memory
> P2: - execute functionA for the 4GB and write the 4GB
>   outA result to the GPU memory
> P3: - execute functionB for the first 4GB of outA
> - while functionB is running DMA in the background
>   the the second 4GB of inA to the GPU memory
> - once one of the millions of thread running functionB
>   find the result it is looking for it writes it to
> 

Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Jerome Glisse
On Thu, Dec 06, 2018 at 11:31:21AM -0800, Dave Hansen wrote:
> On 12/6/18 11:20 AM, Jerome Glisse wrote:
> >>> For case 1 you can pre-parse stuff but this can be done by helper library
> >> How would that work?  Would each user/container/whatever do this once?
> >> Where would they keep the pre-parsed stuff?  How do they manage their
> >> cache if the topology changes?
> > Short answer i don't expect a cache, i expect that each program will have
> > a init function that query the topology and update the application codes
> > accordingly.
> 
> My concern with having folks do per-program parsing, *and* having a huge
> amount of data to parse makes it unusable.  The largest systems will
> literally have hundreds of thousands of objects in /sysfs, even in a
> single directory.  That makes readdir() basically impossible, and makes
> even open() (if you already know the path you want somehow) hard to do fast.
> 
> I just don't think sysfs (or any filesystem, really) can scale to
> express large, complicated topologies in a way that any normal program
> can practically parse it.
> 
> My suspicion is that we're going to need to have the kernel parse and
> cache these things.  We *might* have the data available in sysfs, but we
> can't reasonably expect anyone to go parsing it.

What i am failing to explain is that kernel can not parse because kernel
does not know what the application cares about and every single applications
will make different choices and thus select differents devices and memory.

It is not even gonna a thing like class A of application will do X and
class B will do Y. Every single application in class A might do something
different because somes care about the little details.

So any kind of pre-parsing in the kernel is defeated by the fact that the
kernel does not know what the application is looking for.

I do not see anyway to express the application logic in something that
can be some kind of automaton or regular expression. The application can
litteraly intro-inspect itself and the topology to partition its workload.
The topology and device selection is expected to be thousands of line of
code in the most advance application.

Even worse inside one same application, they might be different device
partition and memory selection for different function in the application.


I am not scare about the anount of data to parse really, even on big node
it is gonna be few dozens of links and bridges, and few dozens of devices.
So we are talking hundred directories to parse and read.


Maybe an example will help. Let say we have an application with the
following pipeline:

inA -> functionA -> outA -> functionB -> outB -> functionC -> result

- inA 8 gigabytes
- outA 8 gigabytes
- outB one dword
- result something small
- functionA is doing heavy computation on inA (several thousands of
  instructions for each dword in inA).
- functionB is doing heavy computation for each dword in outA (again
  thousand of instruction for each dword) and it is looking for a
  specific result that it knows will be unique among all the dword
  computation ie it is output only one dword in outB
- functionC is something well suited for CPU that take outB and turns
  it into the final result

Now let see few different system and their topologies:
[T2] 1 GPU with 16GB of memory and a handfull of CPU cores
[T1] 1 GPU with 8GB of memory and a handfull of CPU cores
[T3] 2 GPU with 8GB of memory and a handfull of CPU core
[T4] 2 GPU with 8GB of memory and a handfull of CPU core
 the 2 GPU have a very fast link between each others
 (400GBytes/s)

Now let see how the program will partition itself for each topology:
[T1] Application partition its computation in 3 phases:
P1: - migrate inA to GPU memory
P2: - execute functionA on inA producing outA
P3  - execute functionB on outA producing outB
- run functionC and see if functionB have found the
  thing and written it to outB if so then kill all
  GPU threads and return the result we are done

[T2] Application partition its computation in 5 phases:
P1: - migrate first 4GB of inA to GPU memory
P2: - execute functionA for the 4GB and write the 4GB
  outA result to the GPU memory
P3: - execute functionB for the first 4GB of outA
- while functionB is running DMA in the background
  the the second 4GB of inA to the GPU memory
- once one of the millions of thread running functionB
  find the result it is looking for it writes it to
  outB which is in main memory
- run functionC and see if functionB have found the
  thing and written it to outB if so then kill all
  GPU thread and DMA and return the result we are
  done
   

Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Jerome Glisse
On Thu, Dec 06, 2018 at 11:31:21AM -0800, Dave Hansen wrote:
> On 12/6/18 11:20 AM, Jerome Glisse wrote:
> >>> For case 1 you can pre-parse stuff but this can be done by helper library
> >> How would that work?  Would each user/container/whatever do this once?
> >> Where would they keep the pre-parsed stuff?  How do they manage their
> >> cache if the topology changes?
> > Short answer i don't expect a cache, i expect that each program will have
> > a init function that query the topology and update the application codes
> > accordingly.
> 
> My concern with having folks do per-program parsing, *and* having a huge
> amount of data to parse makes it unusable.  The largest systems will
> literally have hundreds of thousands of objects in /sysfs, even in a
> single directory.  That makes readdir() basically impossible, and makes
> even open() (if you already know the path you want somehow) hard to do fast.
> 
> I just don't think sysfs (or any filesystem, really) can scale to
> express large, complicated topologies in a way that any normal program
> can practically parse it.
> 
> My suspicion is that we're going to need to have the kernel parse and
> cache these things.  We *might* have the data available in sysfs, but we
> can't reasonably expect anyone to go parsing it.

What i am failing to explain is that kernel can not parse because kernel
does not know what the application cares about and every single applications
will make different choices and thus select differents devices and memory.

It is not even gonna a thing like class A of application will do X and
class B will do Y. Every single application in class A might do something
different because somes care about the little details.

So any kind of pre-parsing in the kernel is defeated by the fact that the
kernel does not know what the application is looking for.

I do not see anyway to express the application logic in something that
can be some kind of automaton or regular expression. The application can
litteraly intro-inspect itself and the topology to partition its workload.
The topology and device selection is expected to be thousands of line of
code in the most advance application.

Even worse inside one same application, they might be different device
partition and memory selection for different function in the application.


I am not scare about the anount of data to parse really, even on big node
it is gonna be few dozens of links and bridges, and few dozens of devices.
So we are talking hundred directories to parse and read.


Maybe an example will help. Let say we have an application with the
following pipeline:

inA -> functionA -> outA -> functionB -> outB -> functionC -> result

- inA 8 gigabytes
- outA 8 gigabytes
- outB one dword
- result something small
- functionA is doing heavy computation on inA (several thousands of
  instructions for each dword in inA).
- functionB is doing heavy computation for each dword in outA (again
  thousand of instruction for each dword) and it is looking for a
  specific result that it knows will be unique among all the dword
  computation ie it is output only one dword in outB
- functionC is something well suited for CPU that take outB and turns
  it into the final result

Now let see few different system and their topologies:
[T2] 1 GPU with 16GB of memory and a handfull of CPU cores
[T1] 1 GPU with 8GB of memory and a handfull of CPU cores
[T3] 2 GPU with 8GB of memory and a handfull of CPU core
[T4] 2 GPU with 8GB of memory and a handfull of CPU core
 the 2 GPU have a very fast link between each others
 (400GBytes/s)

Now let see how the program will partition itself for each topology:
[T1] Application partition its computation in 3 phases:
P1: - migrate inA to GPU memory
P2: - execute functionA on inA producing outA
P3  - execute functionB on outA producing outB
- run functionC and see if functionB have found the
  thing and written it to outB if so then kill all
  GPU threads and return the result we are done

[T2] Application partition its computation in 5 phases:
P1: - migrate first 4GB of inA to GPU memory
P2: - execute functionA for the 4GB and write the 4GB
  outA result to the GPU memory
P3: - execute functionB for the first 4GB of outA
- while functionB is running DMA in the background
  the the second 4GB of inA to the GPU memory
- once one of the millions of thread running functionB
  find the result it is looking for it writes it to
  outB which is in main memory
- run functionC and see if functionB have found the
  thing and written it to outB if so then kill all
  GPU thread and DMA and return the result we are
  done
   

Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Logan Gunthorpe



On 2018-12-06 12:31 p.m., Dave Hansen wrote:
> On 12/6/18 11:20 AM, Jerome Glisse wrote:
 For case 1 you can pre-parse stuff but this can be done by helper library
>>> How would that work?  Would each user/container/whatever do this once?
>>> Where would they keep the pre-parsed stuff?  How do they manage their
>>> cache if the topology changes?
>> Short answer i don't expect a cache, i expect that each program will have
>> a init function that query the topology and update the application codes
>> accordingly.
> 
> My concern with having folks do per-program parsing, *and* having a huge
> amount of data to parse makes it unusable.  The largest systems will
> literally have hundreds of thousands of objects in /sysfs, even in a
> single directory.  That makes readdir() basically impossible, and makes
> even open() (if you already know the path you want somehow) hard to do fast.

Is this actually realistic? I find it hard to imagine an actual hardware
bus that can have even thousands of devices under a single node, let
alone hundreds of thousands. At some point the laws of physics apply.
For example, in present hardware, the most ports a single PCI switch can
have these days is under one hundred. I'd imagine any such large systems
would have a hierarchy of devices (ie. layers of switch-like devices)
which implies the existing sysfs bus/devices  should have a path through
it without navigating a directory with that unreasonable a number of
objects in it. HMS, on the other hand, has all possible initiators
(,etc) under a single directory.

The caveat to this is, that to find an initial starting point in the bus
hierarchy you might have to go through /sys/dev/{block|char} or
/sys/class which may have directories with a large number of objects.
Though, such a system would necessarily have a similarly large number of
objects in /dev which means means you will probably never get around the
readdir/open bottleneck you mention... and, thus, this doesn't seem
overly realistic to me.

Logan


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Logan Gunthorpe



On 2018-12-06 12:31 p.m., Dave Hansen wrote:
> On 12/6/18 11:20 AM, Jerome Glisse wrote:
 For case 1 you can pre-parse stuff but this can be done by helper library
>>> How would that work?  Would each user/container/whatever do this once?
>>> Where would they keep the pre-parsed stuff?  How do they manage their
>>> cache if the topology changes?
>> Short answer i don't expect a cache, i expect that each program will have
>> a init function that query the topology and update the application codes
>> accordingly.
> 
> My concern with having folks do per-program parsing, *and* having a huge
> amount of data to parse makes it unusable.  The largest systems will
> literally have hundreds of thousands of objects in /sysfs, even in a
> single directory.  That makes readdir() basically impossible, and makes
> even open() (if you already know the path you want somehow) hard to do fast.

Is this actually realistic? I find it hard to imagine an actual hardware
bus that can have even thousands of devices under a single node, let
alone hundreds of thousands. At some point the laws of physics apply.
For example, in present hardware, the most ports a single PCI switch can
have these days is under one hundred. I'd imagine any such large systems
would have a hierarchy of devices (ie. layers of switch-like devices)
which implies the existing sysfs bus/devices  should have a path through
it without navigating a directory with that unreasonable a number of
objects in it. HMS, on the other hand, has all possible initiators
(,etc) under a single directory.

The caveat to this is, that to find an initial starting point in the bus
hierarchy you might have to go through /sys/dev/{block|char} or
/sys/class which may have directories with a large number of objects.
Though, such a system would necessarily have a similarly large number of
objects in /dev which means means you will probably never get around the
readdir/open bottleneck you mention... and, thus, this doesn't seem
overly realistic to me.

Logan


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Dave Hansen
On 12/6/18 11:20 AM, Jerome Glisse wrote:
>>> For case 1 you can pre-parse stuff but this can be done by helper library
>> How would that work?  Would each user/container/whatever do this once?
>> Where would they keep the pre-parsed stuff?  How do they manage their
>> cache if the topology changes?
> Short answer i don't expect a cache, i expect that each program will have
> a init function that query the topology and update the application codes
> accordingly.

My concern with having folks do per-program parsing, *and* having a huge
amount of data to parse makes it unusable.  The largest systems will
literally have hundreds of thousands of objects in /sysfs, even in a
single directory.  That makes readdir() basically impossible, and makes
even open() (if you already know the path you want somehow) hard to do fast.

I just don't think sysfs (or any filesystem, really) can scale to
express large, complicated topologies in a way that any normal program
can practically parse it.

My suspicion is that we're going to need to have the kernel parse and
cache these things.  We *might* have the data available in sysfs, but we
can't reasonably expect anyone to go parsing it.


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Dave Hansen
On 12/6/18 11:20 AM, Jerome Glisse wrote:
>>> For case 1 you can pre-parse stuff but this can be done by helper library
>> How would that work?  Would each user/container/whatever do this once?
>> Where would they keep the pre-parsed stuff?  How do they manage their
>> cache if the topology changes?
> Short answer i don't expect a cache, i expect that each program will have
> a init function that query the topology and update the application codes
> accordingly.

My concern with having folks do per-program parsing, *and* having a huge
amount of data to parse makes it unusable.  The largest systems will
literally have hundreds of thousands of objects in /sysfs, even in a
single directory.  That makes readdir() basically impossible, and makes
even open() (if you already know the path you want somehow) hard to do fast.

I just don't think sysfs (or any filesystem, really) can scale to
express large, complicated topologies in a way that any normal program
can practically parse it.

My suspicion is that we're going to need to have the kernel parse and
cache these things.  We *might* have the data available in sysfs, but we
can't reasonably expect anyone to go parsing it.


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Jerome Glisse
On Thu, Dec 06, 2018 at 10:25:08AM -0800, Dave Hansen wrote:
> On 12/5/18 9:53 AM, Jerome Glisse wrote:
> > No so there is 2 kinds of applications:
> > 1) average one: i am using device {1, 3, 9} give me best memory for
> >those devices
> ...
> > 
> > For case 1 you can pre-parse stuff but this can be done by helper library
> 
> How would that work?  Would each user/container/whatever do this once?
> Where would they keep the pre-parsed stuff?  How do they manage their
> cache if the topology changes?

Short answer i don't expect a cache, i expect that each program will have
a init function that query the topology and update the application codes
accordingly. This is what people do today, query all available devices,
decide which one to use and how, create context for each selected ones,
define a memory migration job/memory policy for each part of the program
so that memory is migrated/have proper policy in place when the code that
run on some device is executed.


Long answer:

I can not dictate how user folks do their program saddly :) I expect that
many application will do it once during start up. Then you will have all
those containers folks or VM folks that will get presure to react to hot-
plug. For instance if you upgrade your instance with your cloud provider
to have more GPUs or more TPUs ... It is likely to appear as an hotplug
from the VM/container point of view and thus as an hotplug from the
application point of view. So far demonstration i have seen do that by
relaunching the application ... More on that through the live re-patching
issues below.

Oh and i expect application will crash if you hot-unplug anything it is
using (this is what happens i believe now in most API). Again i expect
that some pressure from cloud user and provider will force programmer
to be a bit more reactive to this kind of event.


Live re-patching application code can be difficult i am told. Let say you
have:

void compute_serious0_stuff(accelerator_t *accelerator, void *inputA,
size_t sinputA, void *inputB, size_t sinputB,
void *outputA, size_t soutputA)
{
...

// Migrate the inputA to the accelerator memory
api_migrate_memory_to_accelerator(accelerator, inputA, sinputA);

// The inputB buffer is fine in its default placement

// The output is assume to be empty vma ie no page allocated yet
// so set a policy to direct all allocation due to page fault to
// use the accelerator memory
api_set_memory_policy_to_accelerator(accelerator, outputA, soutputA);

...
for_parallel (i = 0; i < THEYAREAMILLIONSITEMS; ++i) {
// Do something serious
}
...
}

void serious0_orchestrator(topology topology, void *inputA,
   void *inputB, void *outputA)
{
static accelerator_t **selected = NULL;
static serious0_job_partition *partition;
...
if (selected == NULL) {
serious0_select_and_partition(topology, , ,
  inputA, inputB, outputA)
}
...
for(i = 0; i < nselected; ++) {
...
compute_serious0_stuff(selected[i],
   inputA + partition[i].inputA_offset,
   partition[i].inputA_size,
   inputB + partition[i].inputB_offset,
   partition[i].inputB_size,
   outputA + partition[i].outputB_offset,
   partition[i].outputA_size);
...
}
...
for(i = 0; i < nselected; ++) {
accelerator_wait_finish(selected[i]);
}
...
// outputA is ready to be use by the next function in the program
}

If you start without a GPU/TPU your for_parallel will use the CPU and
with the code the compiler have emitted at built time. For GPU/TPU at
build time you compile your for_parallel loop to some intermediate
representation (a virtual ISA) then at runtime during the application
initialization that intermediate representation get lowered down to
all the available GPU/TPU on your system and each for_parallel loop
is patched to be turn into a call to:

void dispatch_accelerator_function(accelerator_t *accelerator,
   void *function, ...)
{
}

So in the above example the for_parallel loop becomes:
dispatch_accelerator_function(accelerator, i_compute_serious_stuff,
  inputA, inputB, outputA);

This hot patching of code is easy to do when no CPU thread is running
the code. However when CPU threads are running it can be problematic,
i am sure you can do trickery like delay the patching only to the next
time the function get call by doing clever thing at build time like
prepending each for_parallel section with enough nop that would allow
you to replace it to a call to the dispatch function and a jump over
the normal CPU code.


I think compiler people want to solve the static case 

Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Jerome Glisse
On Thu, Dec 06, 2018 at 10:25:08AM -0800, Dave Hansen wrote:
> On 12/5/18 9:53 AM, Jerome Glisse wrote:
> > No so there is 2 kinds of applications:
> > 1) average one: i am using device {1, 3, 9} give me best memory for
> >those devices
> ...
> > 
> > For case 1 you can pre-parse stuff but this can be done by helper library
> 
> How would that work?  Would each user/container/whatever do this once?
> Where would they keep the pre-parsed stuff?  How do they manage their
> cache if the topology changes?

Short answer i don't expect a cache, i expect that each program will have
a init function that query the topology and update the application codes
accordingly. This is what people do today, query all available devices,
decide which one to use and how, create context for each selected ones,
define a memory migration job/memory policy for each part of the program
so that memory is migrated/have proper policy in place when the code that
run on some device is executed.


Long answer:

I can not dictate how user folks do their program saddly :) I expect that
many application will do it once during start up. Then you will have all
those containers folks or VM folks that will get presure to react to hot-
plug. For instance if you upgrade your instance with your cloud provider
to have more GPUs or more TPUs ... It is likely to appear as an hotplug
from the VM/container point of view and thus as an hotplug from the
application point of view. So far demonstration i have seen do that by
relaunching the application ... More on that through the live re-patching
issues below.

Oh and i expect application will crash if you hot-unplug anything it is
using (this is what happens i believe now in most API). Again i expect
that some pressure from cloud user and provider will force programmer
to be a bit more reactive to this kind of event.


Live re-patching application code can be difficult i am told. Let say you
have:

void compute_serious0_stuff(accelerator_t *accelerator, void *inputA,
size_t sinputA, void *inputB, size_t sinputB,
void *outputA, size_t soutputA)
{
...

// Migrate the inputA to the accelerator memory
api_migrate_memory_to_accelerator(accelerator, inputA, sinputA);

// The inputB buffer is fine in its default placement

// The output is assume to be empty vma ie no page allocated yet
// so set a policy to direct all allocation due to page fault to
// use the accelerator memory
api_set_memory_policy_to_accelerator(accelerator, outputA, soutputA);

...
for_parallel (i = 0; i < THEYAREAMILLIONSITEMS; ++i) {
// Do something serious
}
...
}

void serious0_orchestrator(topology topology, void *inputA,
   void *inputB, void *outputA)
{
static accelerator_t **selected = NULL;
static serious0_job_partition *partition;
...
if (selected == NULL) {
serious0_select_and_partition(topology, , ,
  inputA, inputB, outputA)
}
...
for(i = 0; i < nselected; ++) {
...
compute_serious0_stuff(selected[i],
   inputA + partition[i].inputA_offset,
   partition[i].inputA_size,
   inputB + partition[i].inputB_offset,
   partition[i].inputB_size,
   outputA + partition[i].outputB_offset,
   partition[i].outputA_size);
...
}
...
for(i = 0; i < nselected; ++) {
accelerator_wait_finish(selected[i]);
}
...
// outputA is ready to be use by the next function in the program
}

If you start without a GPU/TPU your for_parallel will use the CPU and
with the code the compiler have emitted at built time. For GPU/TPU at
build time you compile your for_parallel loop to some intermediate
representation (a virtual ISA) then at runtime during the application
initialization that intermediate representation get lowered down to
all the available GPU/TPU on your system and each for_parallel loop
is patched to be turn into a call to:

void dispatch_accelerator_function(accelerator_t *accelerator,
   void *function, ...)
{
}

So in the above example the for_parallel loop becomes:
dispatch_accelerator_function(accelerator, i_compute_serious_stuff,
  inputA, inputB, outputA);

This hot patching of code is easy to do when no CPU thread is running
the code. However when CPU threads are running it can be problematic,
i am sure you can do trickery like delay the patching only to the next
time the function get call by doing clever thing at build time like
prepending each for_parallel section with enough nop that would allow
you to replace it to a call to the dispatch function and a jump over
the normal CPU code.


I think compiler people want to solve the static case 

Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Dave Hansen
On 12/5/18 9:53 AM, Jerome Glisse wrote:
> No so there is 2 kinds of applications:
> 1) average one: i am using device {1, 3, 9} give me best memory for
>those devices
...
> 
> For case 1 you can pre-parse stuff but this can be done by helper library

How would that work?  Would each user/container/whatever do this once?
Where would they keep the pre-parsed stuff?  How do they manage their
cache if the topology changes?


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-06 Thread Dave Hansen
On 12/5/18 9:53 AM, Jerome Glisse wrote:
> No so there is 2 kinds of applications:
> 1) average one: i am using device {1, 3, 9} give me best memory for
>those devices
...
> 
> For case 1 you can pre-parse stuff but this can be done by helper library

How would that work?  Would each user/container/whatever do this once?
Where would they keep the pre-parsed stuff?  How do they manage their
cache if the topology changes?


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-05 Thread Jerome Glisse
On Wed, Dec 05, 2018 at 09:27:09AM -0800, Dave Hansen wrote:
> On 12/4/18 6:13 PM, Jerome Glisse wrote:
> > On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote:
> >> OK, but there are 1024*1024 matrix cells on a systems with 1024
> >> proximity domains (ACPI term for NUMA node).  So it sounds like you are
> >> proposing a million-directory approach.
> > 
> > No, pseudo code:
> > struct list links;
> > 
> > for (unsigned r = 0; r < nrows; r++) {
> > for (unsigned c = 0; c < ncolumns; c++) {
> > if (!link_find(links, hmat[r][c].bandwidth,
> >hmat[r][c].latency)) {
> > link = link_new(hmat[r][c].bandwidth,
> > hmat[r][c].latency);
> > // add initiator and target correspond to that row
> > // and columns to this new link
> > list_add(, links);
> > }
> > }
> > }
> > 
> > So all cells that have same property are under the same link. 
> 
> OK, so the "link" here is like a cable.  It's like saying, "we have a
> network and everything is connected with an ethernet cable that can do
> 1gbit/sec".
> 
> But, what actually connects an initiator to a target?  I assume we still
> need to know which link is used for each target/initiator pair.  Where
> is that enumerated?

ls /sys/bus/hms/devices/v0-0-link/
node0   power   subsystem   uevent
uid bandwidth   latency v0-1-target
v0-15-initiator v0-21-targetv0-4-initiator  v0-7-initiator
v0-10-initiator v0-13-initiator v0-16-initiator v0-2-initiator
v0-11-initiator v0-14-initiator v0-17-initiator v0-3-initiator
v0-5-initiator  v0-8-initiator  v0-6-initiator  v0-9-initiator
v0-12-initiator v0-10-initiator

So above is 16 CPUs (initiators*) and 2 targets all connected
through a common link. This means that all the initiators
connected to this link can access all the target connected to
this link. The bandwidth and latency is best case scenario
for instance when only one initiator is accessing the target.

Initiator can only access target they share a link with or
an extended path through a bridge. So if you have an initiator
connected to link0 and a target connected to link1 and there
is a bridge link0 to link1 then the initiator can access the
target memory in link1 but the bandwidth and latency will be
min(link0.bandwidth, link1.bandwidth, bridge.bandwidth)
min(link0.latency, link1.latency, bridge.latency)

You can really match one to one a link with bus in your
system. For instance with PCIE if you only have 16lanes
PCIE devices you only devince one link directory for all
your PCIE devices (ignore the PCIE peer to peer scenario
here). You add a bride between your PCIE link to your
NUMA node link (the node to which this PCIE root complex
belongs), this means that PCIE device can access the local
node memory with given bandwidth and latency (best case).


> 
> I think this just means we need a million symlinks to a "link" instead
> of a million link directories.  Still not great.
> 
> > Note that userspace can parse all this once during its initialization
> > and create pools of target to use.
> 
> It sounds like you're agreeing that there is too much data in this
> interface for applications to _regularly_ parse it.  We need some
> central thing that parses it all and caches the results.

No so there is 2 kinds of applications:
1) average one: i am using device {1, 3, 9} give me best memory for
   those devices
2) advance one: what is the topology of this system ? Parse the
   topology and partition its workload accordingly

For case 1 you can pre-parse stuff but this can be done by helper library
but for case 2 there is no amount of pre-parsing you can do in kernel, only
the application knows its own architecture and thus only the application
knows what matter in the topology. Is the application looking for big
chunk of memory even if it is slow ? Is it also looking for fast memory
close to X and Y ? ...

Each application will care about different thing and there is no telling
what its gonna be.

So what i am saying is that this information is likely to be parse once
by the application during startup ie the sysfs is not something that
is continuously read and parse by the application (unless application
also care about hotplug and then we are talking about the 1% of the 1%).

Cheers,
Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-05 Thread Jerome Glisse
On Wed, Dec 05, 2018 at 09:27:09AM -0800, Dave Hansen wrote:
> On 12/4/18 6:13 PM, Jerome Glisse wrote:
> > On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote:
> >> OK, but there are 1024*1024 matrix cells on a systems with 1024
> >> proximity domains (ACPI term for NUMA node).  So it sounds like you are
> >> proposing a million-directory approach.
> > 
> > No, pseudo code:
> > struct list links;
> > 
> > for (unsigned r = 0; r < nrows; r++) {
> > for (unsigned c = 0; c < ncolumns; c++) {
> > if (!link_find(links, hmat[r][c].bandwidth,
> >hmat[r][c].latency)) {
> > link = link_new(hmat[r][c].bandwidth,
> > hmat[r][c].latency);
> > // add initiator and target correspond to that row
> > // and columns to this new link
> > list_add(, links);
> > }
> > }
> > }
> > 
> > So all cells that have same property are under the same link. 
> 
> OK, so the "link" here is like a cable.  It's like saying, "we have a
> network and everything is connected with an ethernet cable that can do
> 1gbit/sec".
> 
> But, what actually connects an initiator to a target?  I assume we still
> need to know which link is used for each target/initiator pair.  Where
> is that enumerated?

ls /sys/bus/hms/devices/v0-0-link/
node0   power   subsystem   uevent
uid bandwidth   latency v0-1-target
v0-15-initiator v0-21-targetv0-4-initiator  v0-7-initiator
v0-10-initiator v0-13-initiator v0-16-initiator v0-2-initiator
v0-11-initiator v0-14-initiator v0-17-initiator v0-3-initiator
v0-5-initiator  v0-8-initiator  v0-6-initiator  v0-9-initiator
v0-12-initiator v0-10-initiator

So above is 16 CPUs (initiators*) and 2 targets all connected
through a common link. This means that all the initiators
connected to this link can access all the target connected to
this link. The bandwidth and latency is best case scenario
for instance when only one initiator is accessing the target.

Initiator can only access target they share a link with or
an extended path through a bridge. So if you have an initiator
connected to link0 and a target connected to link1 and there
is a bridge link0 to link1 then the initiator can access the
target memory in link1 but the bandwidth and latency will be
min(link0.bandwidth, link1.bandwidth, bridge.bandwidth)
min(link0.latency, link1.latency, bridge.latency)

You can really match one to one a link with bus in your
system. For instance with PCIE if you only have 16lanes
PCIE devices you only devince one link directory for all
your PCIE devices (ignore the PCIE peer to peer scenario
here). You add a bride between your PCIE link to your
NUMA node link (the node to which this PCIE root complex
belongs), this means that PCIE device can access the local
node memory with given bandwidth and latency (best case).


> 
> I think this just means we need a million symlinks to a "link" instead
> of a million link directories.  Still not great.
> 
> > Note that userspace can parse all this once during its initialization
> > and create pools of target to use.
> 
> It sounds like you're agreeing that there is too much data in this
> interface for applications to _regularly_ parse it.  We need some
> central thing that parses it all and caches the results.

No so there is 2 kinds of applications:
1) average one: i am using device {1, 3, 9} give me best memory for
   those devices
2) advance one: what is the topology of this system ? Parse the
   topology and partition its workload accordingly

For case 1 you can pre-parse stuff but this can be done by helper library
but for case 2 there is no amount of pre-parsing you can do in kernel, only
the application knows its own architecture and thus only the application
knows what matter in the topology. Is the application looking for big
chunk of memory even if it is slow ? Is it also looking for fast memory
close to X and Y ? ...

Each application will care about different thing and there is no telling
what its gonna be.

So what i am saying is that this information is likely to be parse once
by the application during startup ie the sysfs is not something that
is continuously read and parse by the application (unless application
also care about hotplug and then we are talking about the 1% of the 1%).

Cheers,
Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-05 Thread Dave Hansen
On 12/4/18 6:13 PM, Jerome Glisse wrote:
> On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote:
>> OK, but there are 1024*1024 matrix cells on a systems with 1024
>> proximity domains (ACPI term for NUMA node).  So it sounds like you are
>> proposing a million-directory approach.
> 
> No, pseudo code:
> struct list links;
> 
> for (unsigned r = 0; r < nrows; r++) {
> for (unsigned c = 0; c < ncolumns; c++) {
> if (!link_find(links, hmat[r][c].bandwidth,
>hmat[r][c].latency)) {
> link = link_new(hmat[r][c].bandwidth,
> hmat[r][c].latency);
> // add initiator and target correspond to that row
> // and columns to this new link
> list_add(, links);
> }
> }
> }
> 
> So all cells that have same property are under the same link. 

OK, so the "link" here is like a cable.  It's like saying, "we have a
network and everything is connected with an ethernet cable that can do
1gbit/sec".

But, what actually connects an initiator to a target?  I assume we still
need to know which link is used for each target/initiator pair.  Where
is that enumerated?

I think this just means we need a million symlinks to a "link" instead
of a million link directories.  Still not great.

> Note that userspace can parse all this once during its initialization
> and create pools of target to use.

It sounds like you're agreeing that there is too much data in this
interface for applications to _regularly_ parse it.  We need some
central thing that parses it all and caches the results.


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-05 Thread Dave Hansen
On 12/4/18 6:13 PM, Jerome Glisse wrote:
> On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote:
>> OK, but there are 1024*1024 matrix cells on a systems with 1024
>> proximity domains (ACPI term for NUMA node).  So it sounds like you are
>> proposing a million-directory approach.
> 
> No, pseudo code:
> struct list links;
> 
> for (unsigned r = 0; r < nrows; r++) {
> for (unsigned c = 0; c < ncolumns; c++) {
> if (!link_find(links, hmat[r][c].bandwidth,
>hmat[r][c].latency)) {
> link = link_new(hmat[r][c].bandwidth,
> hmat[r][c].latency);
> // add initiator and target correspond to that row
> // and columns to this new link
> list_add(, links);
> }
> }
> }
> 
> So all cells that have same property are under the same link. 

OK, so the "link" here is like a cable.  It's like saying, "we have a
network and everything is connected with an ethernet cable that can do
1gbit/sec".

But, what actually connects an initiator to a target?  I assume we still
need to know which link is used for each target/initiator pair.  Where
is that enumerated?

I think this just means we need a million symlinks to a "link" instead
of a million link directories.  Still not great.

> Note that userspace can parse all this once during its initialization
> and create pools of target to use.

It sounds like you're agreeing that there is too much data in this
interface for applications to _regularly_ parse it.  We need some
central thing that parses it all and caches the results.


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-05 Thread Jerome Glisse
On Wed, Dec 05, 2018 at 04:57:17PM +0530, Aneesh Kumar K.V wrote:
> On 12/5/18 12:19 AM, Jerome Glisse wrote:
> 
> > Above example is for migrate. Here is an example for how the
> > topology is use today:
> > 
> >  Application knows that the platform is running on have 16
> >  GPU split into 2 group of 8 GPUs each. GPU in each group can
> >  access each other memory with dedicated mesh links between
> >  each others. Full speed no traffic bottleneck.
> > 
> >  Application splits its GPU computation in 2 so that each
> >  partition runs on a group of interconnected GPU allowing
> >  them to share the dataset.
> > 
> > With HMS:
> >  Application can query the kernel to discover the topology of
> >  system it is running on and use it to partition and balance
> >  its workload accordingly. Same application should now be able
> >  to run on new platform without having to adapt it to it.
> > 
> 
> Will the kernel be ever involved in decision making here? Like the scheduler
> will we ever want to control how there computation units get scheduled onto
> GPU groups or GPU?

I don;t think you will ever see fine control in software because it
would go against what GPU are fundamentaly. GPU have 1000 of cores
and usualy 10 times more thread in flight than core (depends on the
number of register use by the program or size of their thread local
storage). By having many more thread in flight the GPU always have
some threads that are not waiting for memory access and thus always
have something to schedule next on the core. This scheduling is all
done in real time and i do not see that as a good fit for any kernel
CPU code.

That being said higher level and more coarse directive can be given
to the GPU hardware scheduler like giving priorities to group of
thread so that they always get schedule first if ready. There is
a cgroup proposal that goes into the direction of exposing high
level control over GPU resource like that. I think this is a better
venue to discuss such topics.

> 
> > This is kind of naive i expect topology to be hard to use but maybe
> > it is just me being pesimistics. In any case today we have a chicken
> > and egg problem. We do not have a standard way to expose topology so
> > program that can leverage topology are only done for HPC where the
> > platform is standard for few years. If we had a standard way to expose
> > the topology then maybe we would see more program using it. At very
> > least we could convert existing user.
> > 
> > 
> 
> I am wondering whether we should consider HMAT as a subset of the ideas
> mentioned in this thread and see whether we can first achieve HMAT
> representation with your patch series?

I do not want to block HMAT on that. What i am trying to do really
does not fit in the existing NUMA node this is what i have been trying
to show even if not everyone is convince by that. Some bulets points
of why:
- memory i care about is not accessible by everyone (backed in
  assumption in NUMA node)
- memory i care about might not be cache coherent (again backed
  in assumption in NUMA node)
- topology matter so that userspace knows what inter-connect is
  share and what have dedicated links to memory
- their can be multiple path between one device and one target
  memory and each path have different numa distance (or rather
  properties like bandwidth, latency, ...) again this is does
  not fit with the NUMA distance thing
- memory is not manage by core kernel for reasons i hav explained
- ...

The HMAT proposal does not deal with such memory, it is much more
close to what the current model can describe.

Cheers,
Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-05 Thread Jerome Glisse
On Wed, Dec 05, 2018 at 04:57:17PM +0530, Aneesh Kumar K.V wrote:
> On 12/5/18 12:19 AM, Jerome Glisse wrote:
> 
> > Above example is for migrate. Here is an example for how the
> > topology is use today:
> > 
> >  Application knows that the platform is running on have 16
> >  GPU split into 2 group of 8 GPUs each. GPU in each group can
> >  access each other memory with dedicated mesh links between
> >  each others. Full speed no traffic bottleneck.
> > 
> >  Application splits its GPU computation in 2 so that each
> >  partition runs on a group of interconnected GPU allowing
> >  them to share the dataset.
> > 
> > With HMS:
> >  Application can query the kernel to discover the topology of
> >  system it is running on and use it to partition and balance
> >  its workload accordingly. Same application should now be able
> >  to run on new platform without having to adapt it to it.
> > 
> 
> Will the kernel be ever involved in decision making here? Like the scheduler
> will we ever want to control how there computation units get scheduled onto
> GPU groups or GPU?

I don;t think you will ever see fine control in software because it
would go against what GPU are fundamentaly. GPU have 1000 of cores
and usualy 10 times more thread in flight than core (depends on the
number of register use by the program or size of their thread local
storage). By having many more thread in flight the GPU always have
some threads that are not waiting for memory access and thus always
have something to schedule next on the core. This scheduling is all
done in real time and i do not see that as a good fit for any kernel
CPU code.

That being said higher level and more coarse directive can be given
to the GPU hardware scheduler like giving priorities to group of
thread so that they always get schedule first if ready. There is
a cgroup proposal that goes into the direction of exposing high
level control over GPU resource like that. I think this is a better
venue to discuss such topics.

> 
> > This is kind of naive i expect topology to be hard to use but maybe
> > it is just me being pesimistics. In any case today we have a chicken
> > and egg problem. We do not have a standard way to expose topology so
> > program that can leverage topology are only done for HPC where the
> > platform is standard for few years. If we had a standard way to expose
> > the topology then maybe we would see more program using it. At very
> > least we could convert existing user.
> > 
> > 
> 
> I am wondering whether we should consider HMAT as a subset of the ideas
> mentioned in this thread and see whether we can first achieve HMAT
> representation with your patch series?

I do not want to block HMAT on that. What i am trying to do really
does not fit in the existing NUMA node this is what i have been trying
to show even if not everyone is convince by that. Some bulets points
of why:
- memory i care about is not accessible by everyone (backed in
  assumption in NUMA node)
- memory i care about might not be cache coherent (again backed
  in assumption in NUMA node)
- topology matter so that userspace knows what inter-connect is
  share and what have dedicated links to memory
- their can be multiple path between one device and one target
  memory and each path have different numa distance (or rather
  properties like bandwidth, latency, ...) again this is does
  not fit with the NUMA distance thing
- memory is not manage by core kernel for reasons i hav explained
- ...

The HMAT proposal does not deal with such memory, it is much more
close to what the current model can describe.

Cheers,
Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-05 Thread Aneesh Kumar K.V

On 12/5/18 12:19 AM, Jerome Glisse wrote:


Above example is for migrate. Here is an example for how the
topology is use today:

 Application knows that the platform is running on have 16
 GPU split into 2 group of 8 GPUs each. GPU in each group can
 access each other memory with dedicated mesh links between
 each others. Full speed no traffic bottleneck.

 Application splits its GPU computation in 2 so that each
 partition runs on a group of interconnected GPU allowing
 them to share the dataset.

With HMS:
 Application can query the kernel to discover the topology of
 system it is running on and use it to partition and balance
 its workload accordingly. Same application should now be able
 to run on new platform without having to adapt it to it.



Will the kernel be ever involved in decision making here? Like the 
scheduler will we ever want to control how there computation units get 
scheduled onto GPU groups or GPU?



This is kind of naive i expect topology to be hard to use but maybe
it is just me being pesimistics. In any case today we have a chicken
and egg problem. We do not have a standard way to expose topology so
program that can leverage topology are only done for HPC where the
platform is standard for few years. If we had a standard way to expose
the topology then maybe we would see more program using it. At very
least we could convert existing user.




I am wondering whether we should consider HMAT as a subset of the ideas
mentioned in this thread and see whether we can first achieve HMAT 
representation with your patch series?


-aneesh



Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-05 Thread Aneesh Kumar K.V

On 12/5/18 12:19 AM, Jerome Glisse wrote:


Above example is for migrate. Here is an example for how the
topology is use today:

 Application knows that the platform is running on have 16
 GPU split into 2 group of 8 GPUs each. GPU in each group can
 access each other memory with dedicated mesh links between
 each others. Full speed no traffic bottleneck.

 Application splits its GPU computation in 2 so that each
 partition runs on a group of interconnected GPU allowing
 them to share the dataset.

With HMS:
 Application can query the kernel to discover the topology of
 system it is running on and use it to partition and balance
 its workload accordingly. Same application should now be able
 to run on new platform without having to adapt it to it.



Will the kernel be ever involved in decision making here? Like the 
scheduler will we ever want to control how there computation units get 
scheduled onto GPU groups or GPU?



This is kind of naive i expect topology to be hard to use but maybe
it is just me being pesimistics. In any case today we have a chicken
and egg problem. We do not have a standard way to expose topology so
program that can leverage topology are only done for HPC where the
platform is standard for few years. If we had a standard way to expose
the topology then maybe we would see more program using it. At very
least we could convert existing user.




I am wondering whether we should consider HMAT as a subset of the ideas
mentioned in this thread and see whether we can first achieve HMAT 
representation with your patch series?


-aneesh



Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Jerome Glisse
On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote:
> On 12/4/18 4:15 PM, Jerome Glisse wrote:
> > On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote:
> >> Basically, is sysfs the right place to even expose this much data?
> > 
> > I definitly want to avoid the memoryX mistake. So i do not want to
> > see one link directory per device. Taking my simple laptop as an
> > example with 4 CPUs, a wifi and 2 GPU (the integrated one and a
> > discret one):
> > 
> > link0: cpu0 cpu1 cpu2 cpu3
> > link1: wifi (2 pcie lane)
> > link2: gpu0 (unknown number of lane but i believe it has higher
> >  bandwidth to main memory)
> > link3: gpu1 (16 pcie lane)
> > link4: gpu1 and gpu memory
> > 
> > So one link directory per number of pcie lane your device have
> > so that you can differentiate on bandwidth. The main memory is
> > symlinked inside all the link directory except link4. The GPU
> > discret memory is only in link4 directory as it is only
> > accessible by the GPU (we could add it under link3 too with the
> > non cache coherent property attach to it).
> 
> I'm actually really interested in how this proposal scales.  It's quite
> easy to represent a laptop, but can this scale to the largest systems
> that we expect to encounter over the next 20 years that this ABI will live?
> 
> > The issue then becomes how to convert down the HMAT over verbose
> > information to populate some reasonable layout for HMS. For that
> > i would say that create a link directory for each different
> > matrix cell. As an example let say that each entry in the matrix
> > has bandwidth and latency then we create a link directory for
> > each combination of bandwidth and latency. On simple system that
> > should boils down to a handfull of combination roughly speaking
> > mirroring the example above of one link directory per number of
> > PCIE lane for instance.
> 
> OK, but there are 1024*1024 matrix cells on a systems with 1024
> proximity domains (ACPI term for NUMA node).  So it sounds like you are
> proposing a million-directory approach.

No, pseudo code:
struct list links;

for (unsigned r = 0; r < nrows; r++) {
for (unsigned c = 0; c < ncolumns; c++) {
if (!link_find(links, hmat[r][c].bandwidth,
   hmat[r][c].latency)) {
link = link_new(hmat[r][c].bandwidth,
hmat[r][c].latency);
// add initiator and target correspond to that row
// and columns to this new link
list_add(, links);
}
}
}

So all cells that have same property are under the same link. Do you
expect all the cell to always have different properties ? On today
platform it should not be the case. I do expect we will keep seeing
many initiator/target pair that share same properties as other pair.

But yes if you have system where no initiator/target pair have the
same properties than you in the worst case you are describing. But
hey that is the hardware you have then :)

Note that userspace can parse all this once during its initialization
and create pools of target to use.


> We also can't simply say that two CPUs with the same connection to two
> other CPUs (think a 4-socket QPI-connected system) share the same "link"
> because they share the same combination of bandwidth and latency.  We
> need to know that *each* has its own, unique link and do not share link
> resources.

That is the purpose of the bridge object to inter-connect link.
To be more exact link is like saying you have 2 arrows with the
same properties between every node listed in the link. While
bridge allow to define arrow in just one direction. Maybe i
should define arrow and node instead of trying to match some of
the ACPI terminology. This might be easier for people to follow
than first having to understand the terminology.

The fear i have with HMAT culling is that HMAT does not have the
information to avoid such culling.

> > I don't think i have a system with an HMAT table if you have one
> > HMAT table to provide i could show up the end result.
> 
> It is new enough (ACPI 6.2) that no publicly-available hardware that
> exists that implements one (that I know of).  Keith Busch can probably
> extract one and send it to you or show you how we're faking them with QEMU.
> 
> > Note i believe the ACPI HMAT matrix is a bad design for that
> > reasons ie there is lot of commonality in many of the matrix
> > entry and many entry also do not make sense (ie initiator not
> > being able to access all the targets). I feel that link/bridge
> > is much more compact and allow to represent any directed graph
> > with multiple arrows from one node to another same node.
> 
> I don't disagree.  But, folks are building systems with them and we need
> to either deal with it, or make its data manageable.  You saw our
> approach: we cull the data and only expose the bare minimum in sysfs.

Yeah and i intend to cull data too 

Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Jerome Glisse
On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote:
> On 12/4/18 4:15 PM, Jerome Glisse wrote:
> > On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote:
> >> Basically, is sysfs the right place to even expose this much data?
> > 
> > I definitly want to avoid the memoryX mistake. So i do not want to
> > see one link directory per device. Taking my simple laptop as an
> > example with 4 CPUs, a wifi and 2 GPU (the integrated one and a
> > discret one):
> > 
> > link0: cpu0 cpu1 cpu2 cpu3
> > link1: wifi (2 pcie lane)
> > link2: gpu0 (unknown number of lane but i believe it has higher
> >  bandwidth to main memory)
> > link3: gpu1 (16 pcie lane)
> > link4: gpu1 and gpu memory
> > 
> > So one link directory per number of pcie lane your device have
> > so that you can differentiate on bandwidth. The main memory is
> > symlinked inside all the link directory except link4. The GPU
> > discret memory is only in link4 directory as it is only
> > accessible by the GPU (we could add it under link3 too with the
> > non cache coherent property attach to it).
> 
> I'm actually really interested in how this proposal scales.  It's quite
> easy to represent a laptop, but can this scale to the largest systems
> that we expect to encounter over the next 20 years that this ABI will live?
> 
> > The issue then becomes how to convert down the HMAT over verbose
> > information to populate some reasonable layout for HMS. For that
> > i would say that create a link directory for each different
> > matrix cell. As an example let say that each entry in the matrix
> > has bandwidth and latency then we create a link directory for
> > each combination of bandwidth and latency. On simple system that
> > should boils down to a handfull of combination roughly speaking
> > mirroring the example above of one link directory per number of
> > PCIE lane for instance.
> 
> OK, but there are 1024*1024 matrix cells on a systems with 1024
> proximity domains (ACPI term for NUMA node).  So it sounds like you are
> proposing a million-directory approach.

No, pseudo code:
struct list links;

for (unsigned r = 0; r < nrows; r++) {
for (unsigned c = 0; c < ncolumns; c++) {
if (!link_find(links, hmat[r][c].bandwidth,
   hmat[r][c].latency)) {
link = link_new(hmat[r][c].bandwidth,
hmat[r][c].latency);
// add initiator and target correspond to that row
// and columns to this new link
list_add(, links);
}
}
}

So all cells that have same property are under the same link. Do you
expect all the cell to always have different properties ? On today
platform it should not be the case. I do expect we will keep seeing
many initiator/target pair that share same properties as other pair.

But yes if you have system where no initiator/target pair have the
same properties than you in the worst case you are describing. But
hey that is the hardware you have then :)

Note that userspace can parse all this once during its initialization
and create pools of target to use.


> We also can't simply say that two CPUs with the same connection to two
> other CPUs (think a 4-socket QPI-connected system) share the same "link"
> because they share the same combination of bandwidth and latency.  We
> need to know that *each* has its own, unique link and do not share link
> resources.

That is the purpose of the bridge object to inter-connect link.
To be more exact link is like saying you have 2 arrows with the
same properties between every node listed in the link. While
bridge allow to define arrow in just one direction. Maybe i
should define arrow and node instead of trying to match some of
the ACPI terminology. This might be easier for people to follow
than first having to understand the terminology.

The fear i have with HMAT culling is that HMAT does not have the
information to avoid such culling.

> > I don't think i have a system with an HMAT table if you have one
> > HMAT table to provide i could show up the end result.
> 
> It is new enough (ACPI 6.2) that no publicly-available hardware that
> exists that implements one (that I know of).  Keith Busch can probably
> extract one and send it to you or show you how we're faking them with QEMU.
> 
> > Note i believe the ACPI HMAT matrix is a bad design for that
> > reasons ie there is lot of commonality in many of the matrix
> > entry and many entry also do not make sense (ie initiator not
> > being able to access all the targets). I feel that link/bridge
> > is much more compact and allow to represent any directed graph
> > with multiple arrows from one node to another same node.
> 
> I don't disagree.  But, folks are building systems with them and we need
> to either deal with it, or make its data manageable.  You saw our
> approach: we cull the data and only expose the bare minimum in sysfs.

Yeah and i intend to cull data too 

Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Kuehling, Felix

On 2018-12-04 4:57 p.m., Jerome Glisse wrote:
> On Tue, Dec 04, 2018 at 01:37:56PM -0800, Dave Hansen wrote:
>> Yeah, our NUMA mechanisms are for managing memory that the kernel itself
>> manages in the "normal" allocator and supports a full feature set on.
>> That has a bunch of implications, like that the memory is cache coherent
>> and accessible from everywhere.
>>
>> The HMAT patches only comprehend this "normal" memory, which is why
>> we're extending the existing /sys/devices/system/node infrastructure.
>>
>> This series has a much more aggressive goal, which is comprehending the
>> connections of every memory-target to every memory-initiator, no matter
>> who is managing the memory, who can access it, or what it can be used for.
>>
>> Theoretically, HMS could be used for everything that we're doing with
>> /sys/devices/system/node, as long as it's tied back into the existing
>> NUMA infrastructure _somehow_.
>>
>> Right?
> Fully correct mind if i steal that perfect summary description next time
> i post ? I am so bad at explaining thing :)
>
> Intention is to allow program to do everything they do with mbind() today
> and tomorrow with the HMAT patchset and on top of that to also be able to
> do what they do today through API like OpenCL, ROCm, CUDA ... So it is one
> kernel API to rule them all ;)

As for ROCm, I'm looking forward to using hbind in our own APIs. It will
save us some time and trouble not having to implement all the low-level
policy and tracking of virtual address ranges in our device driver.
Going forward, having a common API to manage the topology and memory
affinity would also enable sane ways of having accelerators and memory
devices from different vendors interact under control of a
topology-aware application.

Disclaimer: I haven't had a chance to review the patches in detail yet.
Got caught up in the documentation and discussion ...

Regards,
  Felix


>
> Also at first i intend to special case vma alloc page when they are HMS
> policy, long term i would like to merge code path inside the kernel. But
> i do not want to disrupt existing code path today, i rather grow to that
> organicaly. Step by step. The mbind() would still work un-affected in
> the end just the plumbing would be slightly different.
>
> Cheers,
> Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Kuehling, Felix

On 2018-12-04 4:57 p.m., Jerome Glisse wrote:
> On Tue, Dec 04, 2018 at 01:37:56PM -0800, Dave Hansen wrote:
>> Yeah, our NUMA mechanisms are for managing memory that the kernel itself
>> manages in the "normal" allocator and supports a full feature set on.
>> That has a bunch of implications, like that the memory is cache coherent
>> and accessible from everywhere.
>>
>> The HMAT patches only comprehend this "normal" memory, which is why
>> we're extending the existing /sys/devices/system/node infrastructure.
>>
>> This series has a much more aggressive goal, which is comprehending the
>> connections of every memory-target to every memory-initiator, no matter
>> who is managing the memory, who can access it, or what it can be used for.
>>
>> Theoretically, HMS could be used for everything that we're doing with
>> /sys/devices/system/node, as long as it's tied back into the existing
>> NUMA infrastructure _somehow_.
>>
>> Right?
> Fully correct mind if i steal that perfect summary description next time
> i post ? I am so bad at explaining thing :)
>
> Intention is to allow program to do everything they do with mbind() today
> and tomorrow with the HMAT patchset and on top of that to also be able to
> do what they do today through API like OpenCL, ROCm, CUDA ... So it is one
> kernel API to rule them all ;)

As for ROCm, I'm looking forward to using hbind in our own APIs. It will
save us some time and trouble not having to implement all the low-level
policy and tracking of virtual address ranges in our device driver.
Going forward, having a common API to manage the topology and memory
affinity would also enable sane ways of having accelerators and memory
devices from different vendors interact under control of a
topology-aware application.

Disclaimer: I haven't had a chance to review the patches in detail yet.
Got caught up in the documentation and discussion ...

Regards,
  Felix


>
> Also at first i intend to special case vma alloc page when they are HMS
> policy, long term i would like to merge code path inside the kernel. But
> i do not want to disrupt existing code path today, i rather grow to that
> organicaly. Step by step. The mbind() would still work un-affected in
> the end just the plumbing would be slightly different.
>
> Cheers,
> Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Dave Hansen
On 12/4/18 4:15 PM, Jerome Glisse wrote:
> On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote:
>> Basically, is sysfs the right place to even expose this much data?
> 
> I definitly want to avoid the memoryX mistake. So i do not want to
> see one link directory per device. Taking my simple laptop as an
> example with 4 CPUs, a wifi and 2 GPU (the integrated one and a
> discret one):
> 
> link0: cpu0 cpu1 cpu2 cpu3
> link1: wifi (2 pcie lane)
> link2: gpu0 (unknown number of lane but i believe it has higher
>  bandwidth to main memory)
> link3: gpu1 (16 pcie lane)
> link4: gpu1 and gpu memory
> 
> So one link directory per number of pcie lane your device have
> so that you can differentiate on bandwidth. The main memory is
> symlinked inside all the link directory except link4. The GPU
> discret memory is only in link4 directory as it is only
> accessible by the GPU (we could add it under link3 too with the
> non cache coherent property attach to it).

I'm actually really interested in how this proposal scales.  It's quite
easy to represent a laptop, but can this scale to the largest systems
that we expect to encounter over the next 20 years that this ABI will live?

> The issue then becomes how to convert down the HMAT over verbose
> information to populate some reasonable layout for HMS. For that
> i would say that create a link directory for each different
> matrix cell. As an example let say that each entry in the matrix
> has bandwidth and latency then we create a link directory for
> each combination of bandwidth and latency. On simple system that
> should boils down to a handfull of combination roughly speaking
> mirroring the example above of one link directory per number of
> PCIE lane for instance.

OK, but there are 1024*1024 matrix cells on a systems with 1024
proximity domains (ACPI term for NUMA node).  So it sounds like you are
proposing a million-directory approach.

We also can't simply say that two CPUs with the same connection to two
other CPUs (think a 4-socket QPI-connected system) share the same "link"
because they share the same combination of bandwidth and latency.  We
need to know that *each* has its own, unique link and do not share link
resources.

> I don't think i have a system with an HMAT table if you have one
> HMAT table to provide i could show up the end result.

It is new enough (ACPI 6.2) that no publicly-available hardware that
exists that implements one (that I know of).  Keith Busch can probably
extract one and send it to you or show you how we're faking them with QEMU.

> Note i believe the ACPI HMAT matrix is a bad design for that
> reasons ie there is lot of commonality in many of the matrix
> entry and many entry also do not make sense (ie initiator not
> being able to access all the targets). I feel that link/bridge
> is much more compact and allow to represent any directed graph
> with multiple arrows from one node to another same node.

I don't disagree.  But, folks are building systems with them and we need
to either deal with it, or make its data manageable.  You saw our
approach: we cull the data and only expose the bare minimum in sysfs.


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Dave Hansen
On 12/4/18 4:15 PM, Jerome Glisse wrote:
> On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote:
>> Basically, is sysfs the right place to even expose this much data?
> 
> I definitly want to avoid the memoryX mistake. So i do not want to
> see one link directory per device. Taking my simple laptop as an
> example with 4 CPUs, a wifi and 2 GPU (the integrated one and a
> discret one):
> 
> link0: cpu0 cpu1 cpu2 cpu3
> link1: wifi (2 pcie lane)
> link2: gpu0 (unknown number of lane but i believe it has higher
>  bandwidth to main memory)
> link3: gpu1 (16 pcie lane)
> link4: gpu1 and gpu memory
> 
> So one link directory per number of pcie lane your device have
> so that you can differentiate on bandwidth. The main memory is
> symlinked inside all the link directory except link4. The GPU
> discret memory is only in link4 directory as it is only
> accessible by the GPU (we could add it under link3 too with the
> non cache coherent property attach to it).

I'm actually really interested in how this proposal scales.  It's quite
easy to represent a laptop, but can this scale to the largest systems
that we expect to encounter over the next 20 years that this ABI will live?

> The issue then becomes how to convert down the HMAT over verbose
> information to populate some reasonable layout for HMS. For that
> i would say that create a link directory for each different
> matrix cell. As an example let say that each entry in the matrix
> has bandwidth and latency then we create a link directory for
> each combination of bandwidth and latency. On simple system that
> should boils down to a handfull of combination roughly speaking
> mirroring the example above of one link directory per number of
> PCIE lane for instance.

OK, but there are 1024*1024 matrix cells on a systems with 1024
proximity domains (ACPI term for NUMA node).  So it sounds like you are
proposing a million-directory approach.

We also can't simply say that two CPUs with the same connection to two
other CPUs (think a 4-socket QPI-connected system) share the same "link"
because they share the same combination of bandwidth and latency.  We
need to know that *each* has its own, unique link and do not share link
resources.

> I don't think i have a system with an HMAT table if you have one
> HMAT table to provide i could show up the end result.

It is new enough (ACPI 6.2) that no publicly-available hardware that
exists that implements one (that I know of).  Keith Busch can probably
extract one and send it to you or show you how we're faking them with QEMU.

> Note i believe the ACPI HMAT matrix is a bad design for that
> reasons ie there is lot of commonality in many of the matrix
> entry and many entry also do not make sense (ie initiator not
> being able to access all the targets). I feel that link/bridge
> is much more compact and allow to represent any directed graph
> with multiple arrows from one node to another same node.

I don't disagree.  But, folks are building systems with them and we need
to either deal with it, or make its data manageable.  You saw our
approach: we cull the data and only expose the bare minimum in sysfs.


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Jerome Glisse
On Tue, Dec 04, 2018 at 03:58:23PM -0800, Dave Hansen wrote:
> On 12/4/18 1:57 PM, Jerome Glisse wrote:
> > Fully correct mind if i steal that perfect summary description next time
> > i post ? I am so bad at explaining thing :)
> 
> Go for it!
> 
> > Intention is to allow program to do everything they do with mbind() today
> > and tomorrow with the HMAT patchset and on top of that to also be able to
> > do what they do today through API like OpenCL, ROCm, CUDA ... So it is one
> > kernel API to rule them all ;)
> 
> While I appreciate the exhaustive scope of such a project, I'm really
> worried that if we decided to use this for our "HMAT" use cases, we'll
> be bottlenecked behind this project while *it* goes through 25 revisions
> over 4 or 5 years like HMM did.
> 
> So, should we just "park" the enhancements to the existing NUMA
> interfaces and infrastructure (think /sys/devices/system/node) and wait
> for this to go in?  Do we try to develop them in parallel and make them
> consistent?  Or, do we just ignore each other and make Andrew sort it
> out in a few years? :)

Let have a battle with giant foam q-tip at next LSF/MM and see who wins ;)

More seriously i think you should go ahead with Keith HMAT patchset and
make progress there. In HMAT case you can grow and evolve the NUMA node
infrastructure to address your need and i believe you are doing it in
a sensible way. But i do not see a path for what i am trying to achieve
in that framework. If anyone have any good idea i would welcome it.

In the meantime i hope i can make progress with my proposal here under
staging. Once i get enough stuff working in userspace and convince guinea
pig (i need to find a better name for those poor people i will coerce
in testing this ;)) then i can have some hard evidence of what thing in
my proposal is useful on some concret case with open source stack from
top to bottom. It might means stripping down what i am proposing today
to what turns out to be useful. Then start a discussion about merging the
kernel underlying code into one (while preserving all existing API) and
getting out of staging with real syscall we will have to die with.

I know that at the very least the hbind() and hpolicy() syscall would
be successful as the HPC folks have been been dreaming of this. The
topology thing is harder to know, they are some users today but i can
not say how much more interest it can spark outside of this very small
community that is HPC.

Cheers,
Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Jerome Glisse
On Tue, Dec 04, 2018 at 03:58:23PM -0800, Dave Hansen wrote:
> On 12/4/18 1:57 PM, Jerome Glisse wrote:
> > Fully correct mind if i steal that perfect summary description next time
> > i post ? I am so bad at explaining thing :)
> 
> Go for it!
> 
> > Intention is to allow program to do everything they do with mbind() today
> > and tomorrow with the HMAT patchset and on top of that to also be able to
> > do what they do today through API like OpenCL, ROCm, CUDA ... So it is one
> > kernel API to rule them all ;)
> 
> While I appreciate the exhaustive scope of such a project, I'm really
> worried that if we decided to use this for our "HMAT" use cases, we'll
> be bottlenecked behind this project while *it* goes through 25 revisions
> over 4 or 5 years like HMM did.
> 
> So, should we just "park" the enhancements to the existing NUMA
> interfaces and infrastructure (think /sys/devices/system/node) and wait
> for this to go in?  Do we try to develop them in parallel and make them
> consistent?  Or, do we just ignore each other and make Andrew sort it
> out in a few years? :)

Let have a battle with giant foam q-tip at next LSF/MM and see who wins ;)

More seriously i think you should go ahead with Keith HMAT patchset and
make progress there. In HMAT case you can grow and evolve the NUMA node
infrastructure to address your need and i believe you are doing it in
a sensible way. But i do not see a path for what i am trying to achieve
in that framework. If anyone have any good idea i would welcome it.

In the meantime i hope i can make progress with my proposal here under
staging. Once i get enough stuff working in userspace and convince guinea
pig (i need to find a better name for those poor people i will coerce
in testing this ;)) then i can have some hard evidence of what thing in
my proposal is useful on some concret case with open source stack from
top to bottom. It might means stripping down what i am proposing today
to what turns out to be useful. Then start a discussion about merging the
kernel underlying code into one (while preserving all existing API) and
getting out of staging with real syscall we will have to die with.

I know that at the very least the hbind() and hpolicy() syscall would
be successful as the HPC folks have been been dreaming of this. The
topology thing is harder to know, they are some users today but i can
not say how much more interest it can spark outside of this very small
community that is HPC.

Cheers,
Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Jerome Glisse
On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote:
> On 12/3/18 3:34 PM, jgli...@redhat.com wrote:
> > This patchset use the above scheme to expose system topology through
> > sysfs under /sys/bus/hms/ with:
> > - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
> >   each has a UID and you can usual value in that folder (node id,
> >   size, ...)
> > 
> > - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
> >   (CPU or device), each has a HMS UID but also a CPU id for CPU
> >   (which match CPU id in (/sys/bus/cpu/). For device you have a
> >   path that can be PCIE BUS ID for instance)
> > 
> > - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
> >   UID and a file per property (bandwidth, latency, ...) you also
> >   find a symlink to every target and initiator connected to that
> >   link.
> > 
> > - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
> >   a UID and a file per property (bandwidth, latency, ...) you
> >   also find a symlink to all initiators that can use that bridge.
> 
> We support 1024 NUMA nodes on x86.  The ACPI HMAT expresses the
> connections between each node.  Let's suppose that each node has some
> CPUs and some memory.
> 
> That means we'll have 1024 target directories in sysfs, 1024 initiator
> directories in sysfs, and 1024*1024 link directories.  Or, would the
> kernel be responsible for "compiling" the firmware-provided information
> down into a more manageable number of links?
> 
> Some idiot made the mistake of having one sysfs directory per 128MB of
> memory way back when, and now we have hundreds of thousands of
> /sys/devices/system/memory/memoryX directories.  That sucks to manage.
> Isn't this potentially repeating that mistake?
> 
> Basically, is sysfs the right place to even expose this much data?

I definitly want to avoid the memoryX mistake. So i do not want to
see one link directory per device. Taking my simple laptop as an
example with 4 CPUs, a wifi and 2 GPU (the integrated one and a
discret one):

link0: cpu0 cpu1 cpu2 cpu3
link1: wifi (2 pcie lane)
link2: gpu0 (unknown number of lane but i believe it has higher
 bandwidth to main memory)
link3: gpu1 (16 pcie lane)
link4: gpu1 and gpu memory

So one link directory per number of pcie lane your device have
so that you can differentiate on bandwidth. The main memory is
symlinked inside all the link directory except link4. The GPU
discret memory is only in link4 directory as it is only
accessible by the GPU (we could add it under link3 too with the
non cache coherent property attach to it).


The issue then becomes how to convert down the HMAT over verbose
information to populate some reasonable layout for HMS. For that
i would say that create a link directory for each different
matrix cell. As an example let say that each entry in the matrix
has bandwidth and latency then we create a link directory for
each combination of bandwidth and latency. On simple system that
should boils down to a handfull of combination roughly speaking
mirroring the example above of one link directory per number of
PCIE lane for instance.

I don't think i have a system with an HMAT table if you have one
HMAT table to provide i could show up the end result.

Note i believe the ACPI HMAT matrix is a bad design for that
reasons ie there is lot of commonality in many of the matrix
entry and many entry also do not make sense (ie initiator not
being able to access all the targets). I feel that link/bridge
is much more compact and allow to represent any directed graph
with multiple arrows from one node to another same node.

Cheers,
Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Jerome Glisse
On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote:
> On 12/3/18 3:34 PM, jgli...@redhat.com wrote:
> > This patchset use the above scheme to expose system topology through
> > sysfs under /sys/bus/hms/ with:
> > - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
> >   each has a UID and you can usual value in that folder (node id,
> >   size, ...)
> > 
> > - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
> >   (CPU or device), each has a HMS UID but also a CPU id for CPU
> >   (which match CPU id in (/sys/bus/cpu/). For device you have a
> >   path that can be PCIE BUS ID for instance)
> > 
> > - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
> >   UID and a file per property (bandwidth, latency, ...) you also
> >   find a symlink to every target and initiator connected to that
> >   link.
> > 
> > - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
> >   a UID and a file per property (bandwidth, latency, ...) you
> >   also find a symlink to all initiators that can use that bridge.
> 
> We support 1024 NUMA nodes on x86.  The ACPI HMAT expresses the
> connections between each node.  Let's suppose that each node has some
> CPUs and some memory.
> 
> That means we'll have 1024 target directories in sysfs, 1024 initiator
> directories in sysfs, and 1024*1024 link directories.  Or, would the
> kernel be responsible for "compiling" the firmware-provided information
> down into a more manageable number of links?
> 
> Some idiot made the mistake of having one sysfs directory per 128MB of
> memory way back when, and now we have hundreds of thousands of
> /sys/devices/system/memory/memoryX directories.  That sucks to manage.
> Isn't this potentially repeating that mistake?
> 
> Basically, is sysfs the right place to even expose this much data?

I definitly want to avoid the memoryX mistake. So i do not want to
see one link directory per device. Taking my simple laptop as an
example with 4 CPUs, a wifi and 2 GPU (the integrated one and a
discret one):

link0: cpu0 cpu1 cpu2 cpu3
link1: wifi (2 pcie lane)
link2: gpu0 (unknown number of lane but i believe it has higher
 bandwidth to main memory)
link3: gpu1 (16 pcie lane)
link4: gpu1 and gpu memory

So one link directory per number of pcie lane your device have
so that you can differentiate on bandwidth. The main memory is
symlinked inside all the link directory except link4. The GPU
discret memory is only in link4 directory as it is only
accessible by the GPU (we could add it under link3 too with the
non cache coherent property attach to it).


The issue then becomes how to convert down the HMAT over verbose
information to populate some reasonable layout for HMS. For that
i would say that create a link directory for each different
matrix cell. As an example let say that each entry in the matrix
has bandwidth and latency then we create a link directory for
each combination of bandwidth and latency. On simple system that
should boils down to a handfull of combination roughly speaking
mirroring the example above of one link directory per number of
PCIE lane for instance.

I don't think i have a system with an HMAT table if you have one
HMAT table to provide i could show up the end result.

Note i believe the ACPI HMAT matrix is a bad design for that
reasons ie there is lot of commonality in many of the matrix
entry and many entry also do not make sense (ie initiator not
being able to access all the targets). I feel that link/bridge
is much more compact and allow to represent any directed graph
with multiple arrows from one node to another same node.

Cheers,
Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Dave Hansen
On 12/4/18 1:57 PM, Jerome Glisse wrote:
> Fully correct mind if i steal that perfect summary description next time
> i post ? I am so bad at explaining thing :)

Go for it!

> Intention is to allow program to do everything they do with mbind() today
> and tomorrow with the HMAT patchset and on top of that to also be able to
> do what they do today through API like OpenCL, ROCm, CUDA ... So it is one
> kernel API to rule them all ;)

While I appreciate the exhaustive scope of such a project, I'm really
worried that if we decided to use this for our "HMAT" use cases, we'll
be bottlenecked behind this project while *it* goes through 25 revisions
over 4 or 5 years like HMM did.

So, should we just "park" the enhancements to the existing NUMA
interfaces and infrastructure (think /sys/devices/system/node) and wait
for this to go in?  Do we try to develop them in parallel and make them
consistent?  Or, do we just ignore each other and make Andrew sort it
out in a few years? :)


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Dave Hansen
On 12/4/18 1:57 PM, Jerome Glisse wrote:
> Fully correct mind if i steal that perfect summary description next time
> i post ? I am so bad at explaining thing :)

Go for it!

> Intention is to allow program to do everything they do with mbind() today
> and tomorrow with the HMAT patchset and on top of that to also be able to
> do what they do today through API like OpenCL, ROCm, CUDA ... So it is one
> kernel API to rule them all ;)

While I appreciate the exhaustive scope of such a project, I'm really
worried that if we decided to use this for our "HMAT" use cases, we'll
be bottlenecked behind this project while *it* goes through 25 revisions
over 4 or 5 years like HMM did.

So, should we just "park" the enhancements to the existing NUMA
interfaces and infrastructure (think /sys/devices/system/node) and wait
for this to go in?  Do we try to develop them in parallel and make them
consistent?  Or, do we just ignore each other and make Andrew sort it
out in a few years? :)


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Dave Hansen
On 12/3/18 3:34 PM, jgli...@redhat.com wrote:
> This patchset use the above scheme to expose system topology through
> sysfs under /sys/bus/hms/ with:
> - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
>   each has a UID and you can usual value in that folder (node id,
>   size, ...)
> 
> - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
>   (CPU or device), each has a HMS UID but also a CPU id for CPU
>   (which match CPU id in (/sys/bus/cpu/). For device you have a
>   path that can be PCIE BUS ID for instance)
> 
> - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
>   UID and a file per property (bandwidth, latency, ...) you also
>   find a symlink to every target and initiator connected to that
>   link.
> 
> - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
>   a UID and a file per property (bandwidth, latency, ...) you
>   also find a symlink to all initiators that can use that bridge.

We support 1024 NUMA nodes on x86.  The ACPI HMAT expresses the
connections between each node.  Let's suppose that each node has some
CPUs and some memory.

That means we'll have 1024 target directories in sysfs, 1024 initiator
directories in sysfs, and 1024*1024 link directories.  Or, would the
kernel be responsible for "compiling" the firmware-provided information
down into a more manageable number of links?

Some idiot made the mistake of having one sysfs directory per 128MB of
memory way back when, and now we have hundreds of thousands of
/sys/devices/system/memory/memoryX directories.  That sucks to manage.
Isn't this potentially repeating that mistake?

Basically, is sysfs the right place to even expose this much data?


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Dave Hansen
On 12/3/18 3:34 PM, jgli...@redhat.com wrote:
> This patchset use the above scheme to expose system topology through
> sysfs under /sys/bus/hms/ with:
> - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
>   each has a UID and you can usual value in that folder (node id,
>   size, ...)
> 
> - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
>   (CPU or device), each has a HMS UID but also a CPU id for CPU
>   (which match CPU id in (/sys/bus/cpu/). For device you have a
>   path that can be PCIE BUS ID for instance)
> 
> - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
>   UID and a file per property (bandwidth, latency, ...) you also
>   find a symlink to every target and initiator connected to that
>   link.
> 
> - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
>   a UID and a file per property (bandwidth, latency, ...) you
>   also find a symlink to all initiators that can use that bridge.

We support 1024 NUMA nodes on x86.  The ACPI HMAT expresses the
connections between each node.  Let's suppose that each node has some
CPUs and some memory.

That means we'll have 1024 target directories in sysfs, 1024 initiator
directories in sysfs, and 1024*1024 link directories.  Or, would the
kernel be responsible for "compiling" the firmware-provided information
down into a more manageable number of links?

Some idiot made the mistake of having one sysfs directory per 128MB of
memory way back when, and now we have hundreds of thousands of
/sys/devices/system/memory/memoryX directories.  That sucks to manage.
Isn't this potentially repeating that mistake?

Basically, is sysfs the right place to even expose this much data?


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Jerome Glisse
On Tue, Dec 04, 2018 at 01:37:56PM -0800, Dave Hansen wrote:
> On 12/4/18 10:49 AM, Jerome Glisse wrote:
> >> Also, could you add a simple, example program for how someone might use
> >> this?  I got lost in all the new sysfs and ioctl gunk.  Can you
> >> characterize how this would work with the *exiting* NUMA interfaces that
> >> we have?
> > That is the issue i can not expose device memory as NUMA node as
> > device memory is not cache coherent on AMD and Intel platform today.
> > 
> > More over in some case that memory is not visible at all by the CPU
> > which is not something you can express in the current NUMA node.
> 
> Yeah, our NUMA mechanisms are for managing memory that the kernel itself
> manages in the "normal" allocator and supports a full feature set on.
> That has a bunch of implications, like that the memory is cache coherent
> and accessible from everywhere.
> 
> The HMAT patches only comprehend this "normal" memory, which is why
> we're extending the existing /sys/devices/system/node infrastructure.
> 
> This series has a much more aggressive goal, which is comprehending the
> connections of every memory-target to every memory-initiator, no matter
> who is managing the memory, who can access it, or what it can be used for.
> 
> Theoretically, HMS could be used for everything that we're doing with
> /sys/devices/system/node, as long as it's tied back into the existing
> NUMA infrastructure _somehow_.
> 
> Right?

Fully correct mind if i steal that perfect summary description next time
i post ? I am so bad at explaining thing :)

Intention is to allow program to do everything they do with mbind() today
and tomorrow with the HMAT patchset and on top of that to also be able to
do what they do today through API like OpenCL, ROCm, CUDA ... So it is one
kernel API to rule them all ;)

Also at first i intend to special case vma alloc page when they are HMS
policy, long term i would like to merge code path inside the kernel. But
i do not want to disrupt existing code path today, i rather grow to that
organicaly. Step by step. The mbind() would still work un-affected in
the end just the plumbing would be slightly different.

Cheers,
Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Jerome Glisse
On Tue, Dec 04, 2018 at 01:37:56PM -0800, Dave Hansen wrote:
> On 12/4/18 10:49 AM, Jerome Glisse wrote:
> >> Also, could you add a simple, example program for how someone might use
> >> this?  I got lost in all the new sysfs and ioctl gunk.  Can you
> >> characterize how this would work with the *exiting* NUMA interfaces that
> >> we have?
> > That is the issue i can not expose device memory as NUMA node as
> > device memory is not cache coherent on AMD and Intel platform today.
> > 
> > More over in some case that memory is not visible at all by the CPU
> > which is not something you can express in the current NUMA node.
> 
> Yeah, our NUMA mechanisms are for managing memory that the kernel itself
> manages in the "normal" allocator and supports a full feature set on.
> That has a bunch of implications, like that the memory is cache coherent
> and accessible from everywhere.
> 
> The HMAT patches only comprehend this "normal" memory, which is why
> we're extending the existing /sys/devices/system/node infrastructure.
> 
> This series has a much more aggressive goal, which is comprehending the
> connections of every memory-target to every memory-initiator, no matter
> who is managing the memory, who can access it, or what it can be used for.
> 
> Theoretically, HMS could be used for everything that we're doing with
> /sys/devices/system/node, as long as it's tied back into the existing
> NUMA infrastructure _somehow_.
> 
> Right?

Fully correct mind if i steal that perfect summary description next time
i post ? I am so bad at explaining thing :)

Intention is to allow program to do everything they do with mbind() today
and tomorrow with the HMAT patchset and on top of that to also be able to
do what they do today through API like OpenCL, ROCm, CUDA ... So it is one
kernel API to rule them all ;)

Also at first i intend to special case vma alloc page when they are HMS
policy, long term i would like to merge code path inside the kernel. But
i do not want to disrupt existing code path today, i rather grow to that
organicaly. Step by step. The mbind() would still work un-affected in
the end just the plumbing would be slightly different.

Cheers,
Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Dave Hansen
On 12/4/18 10:49 AM, Jerome Glisse wrote:
>> Also, could you add a simple, example program for how someone might use
>> this?  I got lost in all the new sysfs and ioctl gunk.  Can you
>> characterize how this would work with the *exiting* NUMA interfaces that
>> we have?
> That is the issue i can not expose device memory as NUMA node as
> device memory is not cache coherent on AMD and Intel platform today.
> 
> More over in some case that memory is not visible at all by the CPU
> which is not something you can express in the current NUMA node.

Yeah, our NUMA mechanisms are for managing memory that the kernel itself
manages in the "normal" allocator and supports a full feature set on.
That has a bunch of implications, like that the memory is cache coherent
and accessible from everywhere.

The HMAT patches only comprehend this "normal" memory, which is why
we're extending the existing /sys/devices/system/node infrastructure.

This series has a much more aggressive goal, which is comprehending the
connections of every memory-target to every memory-initiator, no matter
who is managing the memory, who can access it, or what it can be used for.

Theoretically, HMS could be used for everything that we're doing with
/sys/devices/system/node, as long as it's tied back into the existing
NUMA infrastructure _somehow_.

Right?


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Dave Hansen
On 12/4/18 10:49 AM, Jerome Glisse wrote:
>> Also, could you add a simple, example program for how someone might use
>> this?  I got lost in all the new sysfs and ioctl gunk.  Can you
>> characterize how this would work with the *exiting* NUMA interfaces that
>> we have?
> That is the issue i can not expose device memory as NUMA node as
> device memory is not cache coherent on AMD and Intel platform today.
> 
> More over in some case that memory is not visible at all by the CPU
> which is not something you can express in the current NUMA node.

Yeah, our NUMA mechanisms are for managing memory that the kernel itself
manages in the "normal" allocator and supports a full feature set on.
That has a bunch of implications, like that the memory is cache coherent
and accessible from everywhere.

The HMAT patches only comprehend this "normal" memory, which is why
we're extending the existing /sys/devices/system/node infrastructure.

This series has a much more aggressive goal, which is comprehending the
connections of every memory-target to every memory-initiator, no matter
who is managing the memory, who can access it, or what it can be used for.

Theoretically, HMS could be used for everything that we're doing with
/sys/devices/system/node, as long as it's tied back into the existing
NUMA infrastructure _somehow_.

Right?


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Jerome Glisse
On Tue, Dec 04, 2018 at 10:54:10AM -0800, Dave Hansen wrote:
> On 12/4/18 10:49 AM, Jerome Glisse wrote:
> > Policy is same kind of story, this email is long enough now :) But
> > i can write one down if you want.
> 
> Yes, please.  I'd love to see the code.
> 
> We'll do the same on the "HMAT" side and we can compare notes.

Example use case ? Example use are:
Application create a range of virtual address with mmap() for the
input dataset. Application knows it will use GPU on it directly so
it calls hbind() to set a policy for the range to use GPU memory
for any new allocation for the range.

Application directly stream the dataset to GPU memory through the
virtual address range thanks to the policy.


Application create a range of virtual address with mmap() to store
the output result of GPU jobs its about to launch. It binds the
range of virtual address to GPU memory so that allocation use GPU
memory for the range.


Application can also use policy binding as a slow migration path
ie set a policy to a new target memory so that new allocation are
directed to this new target.

Or do you want example userspace program like the one in the last
patch of this serie ?

Cheers,
Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Jerome Glisse
On Tue, Dec 04, 2018 at 10:54:10AM -0800, Dave Hansen wrote:
> On 12/4/18 10:49 AM, Jerome Glisse wrote:
> > Policy is same kind of story, this email is long enough now :) But
> > i can write one down if you want.
> 
> Yes, please.  I'd love to see the code.
> 
> We'll do the same on the "HMAT" side and we can compare notes.

Example use case ? Example use are:
Application create a range of virtual address with mmap() for the
input dataset. Application knows it will use GPU on it directly so
it calls hbind() to set a policy for the range to use GPU memory
for any new allocation for the range.

Application directly stream the dataset to GPU memory through the
virtual address range thanks to the policy.


Application create a range of virtual address with mmap() to store
the output result of GPU jobs its about to launch. It binds the
range of virtual address to GPU memory so that allocation use GPU
memory for the range.


Application can also use policy binding as a slow migration path
ie set a policy to a new target memory so that new allocation are
directed to this new target.

Or do you want example userspace program like the one in the last
patch of this serie ?

Cheers,
Jérôme


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Dave Hansen
On 12/4/18 10:49 AM, Jerome Glisse wrote:
> Policy is same kind of story, this email is long enough now :) But
> i can write one down if you want.

Yes, please.  I'd love to see the code.

We'll do the same on the "HMAT" side and we can compare notes.


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Dave Hansen
On 12/4/18 10:49 AM, Jerome Glisse wrote:
> Policy is same kind of story, this email is long enough now :) But
> i can write one down if you want.

Yes, please.  I'd love to see the code.

We'll do the same on the "HMAT" side and we can compare notes.


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Jerome Glisse
On Tue, Dec 04, 2018 at 10:02:55AM -0800, Dave Hansen wrote:
> On 12/3/18 3:34 PM, jgli...@redhat.com wrote:
> > This means that it is no longer sufficient to consider a flat view
> > for each node in a system but for maximum performance we need to
> > account for all of this new memory but also for system topology.
> > This is why this proposal is unlike the HMAT proposal [1] which
> > tries to extend the existing NUMA for new type of memory. Here we
> > are tackling a much more profound change that depart from NUMA.
> 
> The HMAT and its implications exist, in firmware, whether or not we do
> *anything* in Linux to support it or not.  Any system with an HMAT
> inherently reflects the new topology, via proximity domains, whether or
> not we parse the HMAT table in Linux or not.
> 
> Basically, *ACPI* has decided to extend NUMA.  Linux can either fight
> that or embrace it.  Keith's HMAT patches are embracing it.  These
> patches are appearing to fight it.  Agree?  Disagree?

Disagree, sorry if it felt that way that was not my intention. The
ACPI HMAT information can be use to populate the HMS file system
representation. My intention is not to fight Keith's HMAT patches
they are useful on their own. But i do not see how to evolve NUMA
to support device memory, so while Keith is taking a step into the
direction i want, i do not see how to cross to the place i need to
be. More on that below.

> 
> Also, could you add a simple, example program for how someone might use
> this?  I got lost in all the new sysfs and ioctl gunk.  Can you
> characterize how this would work with the *exiting* NUMA interfaces that
> we have?

That is the issue i can not expose device memory as NUMA node as
device memory is not cache coherent on AMD and Intel platform today.

More over in some case that memory is not visible at all by the CPU
which is not something you can express in the current NUMA node.
Here is an abreviated list of feature i need to support:
- device private memory (not accessible by CPU or anybody else)
- non-coherent memory (PCIE is not cache coherent for CPU access)
- multiple path to access same memory either:
- multiple _different_ physical address alias to same memory
- device block can select which path they take to access some
  memory (it is not inside the page table but in how you program
  the device block)
- complex topology that is not a tree where device link can have
  better characteristics than the CPU inter-connect between the
  nodes. They are existing today user that use topology information
  to partition their workload (HPC folks who have a fix platform).
- device memory needs to stay under device driver control as some
  existing API (OpenGL, Vulkan) have different memory model and if
  we want the device to be use for those too then we need to keep
  the device driver in control of the device memory allocation


There is an example userspace program with the last patch in the serie.
But here is a high level overview of how one application looks today:

1) Application get some dataset from some source (disk, network,
   sensors, ...)
2) Application allocate memory on device A and copy over the dataset
3) Application run some CPU code to format the copy of the dataset
   inside device A memory (rebuild pointers inside the dataset,
   this can represent millions and millions of operations)
4) Application run code on device A that use the dataset
5) Application allocate memory on device B and copy over result
   from device A
6) Application run some CPU code to format the copy of the dataset
   inside device B (rebuild pointers inside the dataset,
   this can represent millions and millions of operations)
7) Application run code on device B that use the dataset
8) Application copy result over from device B and keep on doing its
   thing

How it looks with HMS:
1) Application get some dataset from some source (disk, network,
   sensors, ...)
2-3) Application calls HMS to migrate to device A memory
4) Application run code on device A that use the dataset
5-6) Application calls HMS to migrate to device B memory
7) Application run code on device B that use the dataset
8) Application calls HMS to migrate result to main memory

So we now avoid explicit copy and having to rebuild data structure
inside each device address space.


Above example is for migrate. Here is an example for how the
topology is use today:

Application knows that the platform is running on have 16
GPU split into 2 group of 8 GPUs each. GPU in each group can
access each other memory with dedicated mesh links between
each others. Full speed no traffic bottleneck.

Application splits its GPU computation in 2 so that each
partition runs on a group of interconnected GPU allowing
them to share the dataset.

With HMS:
Application can query 

Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Jerome Glisse
On Tue, Dec 04, 2018 at 10:02:55AM -0800, Dave Hansen wrote:
> On 12/3/18 3:34 PM, jgli...@redhat.com wrote:
> > This means that it is no longer sufficient to consider a flat view
> > for each node in a system but for maximum performance we need to
> > account for all of this new memory but also for system topology.
> > This is why this proposal is unlike the HMAT proposal [1] which
> > tries to extend the existing NUMA for new type of memory. Here we
> > are tackling a much more profound change that depart from NUMA.
> 
> The HMAT and its implications exist, in firmware, whether or not we do
> *anything* in Linux to support it or not.  Any system with an HMAT
> inherently reflects the new topology, via proximity domains, whether or
> not we parse the HMAT table in Linux or not.
> 
> Basically, *ACPI* has decided to extend NUMA.  Linux can either fight
> that or embrace it.  Keith's HMAT patches are embracing it.  These
> patches are appearing to fight it.  Agree?  Disagree?

Disagree, sorry if it felt that way that was not my intention. The
ACPI HMAT information can be use to populate the HMS file system
representation. My intention is not to fight Keith's HMAT patches
they are useful on their own. But i do not see how to evolve NUMA
to support device memory, so while Keith is taking a step into the
direction i want, i do not see how to cross to the place i need to
be. More on that below.

> 
> Also, could you add a simple, example program for how someone might use
> this?  I got lost in all the new sysfs and ioctl gunk.  Can you
> characterize how this would work with the *exiting* NUMA interfaces that
> we have?

That is the issue i can not expose device memory as NUMA node as
device memory is not cache coherent on AMD and Intel platform today.

More over in some case that memory is not visible at all by the CPU
which is not something you can express in the current NUMA node.
Here is an abreviated list of feature i need to support:
- device private memory (not accessible by CPU or anybody else)
- non-coherent memory (PCIE is not cache coherent for CPU access)
- multiple path to access same memory either:
- multiple _different_ physical address alias to same memory
- device block can select which path they take to access some
  memory (it is not inside the page table but in how you program
  the device block)
- complex topology that is not a tree where device link can have
  better characteristics than the CPU inter-connect between the
  nodes. They are existing today user that use topology information
  to partition their workload (HPC folks who have a fix platform).
- device memory needs to stay under device driver control as some
  existing API (OpenGL, Vulkan) have different memory model and if
  we want the device to be use for those too then we need to keep
  the device driver in control of the device memory allocation


There is an example userspace program with the last patch in the serie.
But here is a high level overview of how one application looks today:

1) Application get some dataset from some source (disk, network,
   sensors, ...)
2) Application allocate memory on device A and copy over the dataset
3) Application run some CPU code to format the copy of the dataset
   inside device A memory (rebuild pointers inside the dataset,
   this can represent millions and millions of operations)
4) Application run code on device A that use the dataset
5) Application allocate memory on device B and copy over result
   from device A
6) Application run some CPU code to format the copy of the dataset
   inside device B (rebuild pointers inside the dataset,
   this can represent millions and millions of operations)
7) Application run code on device B that use the dataset
8) Application copy result over from device B and keep on doing its
   thing

How it looks with HMS:
1) Application get some dataset from some source (disk, network,
   sensors, ...)
2-3) Application calls HMS to migrate to device A memory
4) Application run code on device A that use the dataset
5-6) Application calls HMS to migrate to device B memory
7) Application run code on device B that use the dataset
8) Application calls HMS to migrate result to main memory

So we now avoid explicit copy and having to rebuild data structure
inside each device address space.


Above example is for migrate. Here is an example for how the
topology is use today:

Application knows that the platform is running on have 16
GPU split into 2 group of 8 GPUs each. GPU in each group can
access each other memory with dedicated mesh links between
each others. Full speed no traffic bottleneck.

Application splits its GPU computation in 2 so that each
partition runs on a group of interconnected GPU allowing
them to share the dataset.

With HMS:
Application can query 

Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Dave Hansen
On 12/3/18 3:34 PM, jgli...@redhat.com wrote:
> This means that it is no longer sufficient to consider a flat view
> for each node in a system but for maximum performance we need to
> account for all of this new memory but also for system topology.
> This is why this proposal is unlike the HMAT proposal [1] which
> tries to extend the existing NUMA for new type of memory. Here we
> are tackling a much more profound change that depart from NUMA.

The HMAT and its implications exist, in firmware, whether or not we do
*anything* in Linux to support it or not.  Any system with an HMAT
inherently reflects the new topology, via proximity domains, whether or
not we parse the HMAT table in Linux or not.

Basically, *ACPI* has decided to extend NUMA.  Linux can either fight
that or embrace it.  Keith's HMAT patches are embracing it.  These
patches are appearing to fight it.  Agree?  Disagree?

Also, could you add a simple, example program for how someone might use
this?  I got lost in all the new sysfs and ioctl gunk.  Can you
characterize how this would work with the *exiting* NUMA interfaces that
we have?


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Dave Hansen
On 12/3/18 3:34 PM, jgli...@redhat.com wrote:
> This means that it is no longer sufficient to consider a flat view
> for each node in a system but for maximum performance we need to
> account for all of this new memory but also for system topology.
> This is why this proposal is unlike the HMAT proposal [1] which
> tries to extend the existing NUMA for new type of memory. Here we
> are tackling a much more profound change that depart from NUMA.

The HMAT and its implications exist, in firmware, whether or not we do
*anything* in Linux to support it or not.  Any system with an HMAT
inherently reflects the new topology, via proximity domains, whether or
not we parse the HMAT table in Linux or not.

Basically, *ACPI* has decided to extend NUMA.  Linux can either fight
that or embrace it.  Keith's HMAT patches are embracing it.  These
patches are appearing to fight it.  Agree?  Disagree?

Also, could you add a simple, example program for how someone might use
this?  I got lost in all the new sysfs and ioctl gunk.  Can you
characterize how this would work with the *exiting* NUMA interfaces that
we have?


Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Jerome Glisse
On Tue, Dec 04, 2018 at 01:14:14PM +0530, Aneesh Kumar K.V wrote:
> On 12/4/18 5:04 AM, jgli...@redhat.com wrote:
> > From: Jérôme Glisse 

[...]

> > This patchset use the above scheme to expose system topology through
> > sysfs under /sys/bus/hms/ with:
> >  - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
> >each has a UID and you can usual value in that folder (node id,
> >size, ...)
> > 
> >  - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
> >(CPU or device), each has a HMS UID but also a CPU id for CPU
> >(which match CPU id in (/sys/bus/cpu/). For device you have a
> >path that can be PCIE BUS ID for instance)
> > 
> >  - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
> >UID and a file per property (bandwidth, latency, ...) you also
> >find a symlink to every target and initiator connected to that
> >link.
> > 
> >  - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
> >a UID and a file per property (bandwidth, latency, ...) you
> >also find a symlink to all initiators that can use that bridge.
> 
> is that version tagging really needed? What changes do you envision with
> versions?

I kind of dislike it myself but this is really to keep userspace from
inadvertently using some kind of memory/initiator/link/bridge that it
should not be using if it does not understand what are the implication.

If it was a file inside the directory there is a big chance that user-
space will overlook it. So an old program on a new platform with a new
kind of weird memory like un-coherent memory might start using it and
get all weird result. If version is in the directory name it kind of
force userspace to only look at memory/initiator/link/bridge it does
understand and can use safely.

So i am doing this in hope that it will protect application when new
type of things pops up. We have too many example where we can not
evolve something because existing application have bake in assumptions
about it.


[...]

> > 3) Tracking and applying heterogeneous memory policies
> > --
> > 
> > Current memory policy infrastructure is node oriented, instead of
> > changing that and risking breakage and regression this patchset add a
> > new heterogeneous policy tracking infra-structure. The expectation is
> > that existing application can keep using mbind() and all existing
> > infrastructure under-disturb and unaffected, while new application
> > will use the new API and should avoid mix and matching both (as they
> > can achieve the same thing with the new API).
> > 
> > Also the policy is not directly tie to the vma structure for a few
> > reasons:
> >  - avoid having to split vma for policy that do not cover full vma
> >  - avoid changing too much vma code
> >  - avoid growing the vma structure with an extra pointer
> > So instead this patchset use the mmu_notifier API to track vma liveness
> > (munmap(),mremap(),...).
> > 
> > This patchset is not tie to process memory allocation either (like said
> > at the begining this is not and end to end patchset but a starting
> > point). It does however demonstrate how migration to device memory can
> > work under this scheme (using nouveau as a demonstration vehicle).
> > 
> > The overall design is simple, on hbind() call a hms policy structure
> > is created for the supplied range and hms use the callback associated
> > with the target memory. This callback is provided by device driver
> > for device memory or by core HMS for regular main memory. The callback
> > can decide to migrate the range to the target memories or do nothing
> > (this can be influenced by flags provided to hbind() too).
> > 
> > 
> > Latter patches can tie page fault with HMS policy to direct memory
> > allocation to the right target. For now i would rather postpone that
> > discussion until a consensus is reach on how to move forward on all
> > the topics presented in this email. Start smalls, grow big ;)
> > 
> > 
> 
> I liked the simplicity of keeping it outside all the existing memory
> management policy code. But that that is also the drawback isn't it?
> We now have multiple entities tracking cpu and memory. (This reminded me of
> how we started with memcg in the early days).

This is a hard choice, the rational is that either application use this
new API either it use the old one. So the expectation is that both should
not co-exist in a process. Eventualy both can be consolidated into one
inside the kernel while maintaining the different userspace API. But i
feel that it is better to get to that point slowly while we experiment
with the new API. I feel that we need to gain some experience with the
new API on real workload to convince ourself that it is something we can
leave with. If we reach that point than we can work on consolidating
kernel code into one. In the meantime this experiment 

Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-04 Thread Jerome Glisse
On Tue, Dec 04, 2018 at 01:14:14PM +0530, Aneesh Kumar K.V wrote:
> On 12/4/18 5:04 AM, jgli...@redhat.com wrote:
> > From: Jérôme Glisse 

[...]

> > This patchset use the above scheme to expose system topology through
> > sysfs under /sys/bus/hms/ with:
> >  - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
> >each has a UID and you can usual value in that folder (node id,
> >size, ...)
> > 
> >  - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
> >(CPU or device), each has a HMS UID but also a CPU id for CPU
> >(which match CPU id in (/sys/bus/cpu/). For device you have a
> >path that can be PCIE BUS ID for instance)
> > 
> >  - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
> >UID and a file per property (bandwidth, latency, ...) you also
> >find a symlink to every target and initiator connected to that
> >link.
> > 
> >  - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
> >a UID and a file per property (bandwidth, latency, ...) you
> >also find a symlink to all initiators that can use that bridge.
> 
> is that version tagging really needed? What changes do you envision with
> versions?

I kind of dislike it myself but this is really to keep userspace from
inadvertently using some kind of memory/initiator/link/bridge that it
should not be using if it does not understand what are the implication.

If it was a file inside the directory there is a big chance that user-
space will overlook it. So an old program on a new platform with a new
kind of weird memory like un-coherent memory might start using it and
get all weird result. If version is in the directory name it kind of
force userspace to only look at memory/initiator/link/bridge it does
understand and can use safely.

So i am doing this in hope that it will protect application when new
type of things pops up. We have too many example where we can not
evolve something because existing application have bake in assumptions
about it.


[...]

> > 3) Tracking and applying heterogeneous memory policies
> > --
> > 
> > Current memory policy infrastructure is node oriented, instead of
> > changing that and risking breakage and regression this patchset add a
> > new heterogeneous policy tracking infra-structure. The expectation is
> > that existing application can keep using mbind() and all existing
> > infrastructure under-disturb and unaffected, while new application
> > will use the new API and should avoid mix and matching both (as they
> > can achieve the same thing with the new API).
> > 
> > Also the policy is not directly tie to the vma structure for a few
> > reasons:
> >  - avoid having to split vma for policy that do not cover full vma
> >  - avoid changing too much vma code
> >  - avoid growing the vma structure with an extra pointer
> > So instead this patchset use the mmu_notifier API to track vma liveness
> > (munmap(),mremap(),...).
> > 
> > This patchset is not tie to process memory allocation either (like said
> > at the begining this is not and end to end patchset but a starting
> > point). It does however demonstrate how migration to device memory can
> > work under this scheme (using nouveau as a demonstration vehicle).
> > 
> > The overall design is simple, on hbind() call a hms policy structure
> > is created for the supplied range and hms use the callback associated
> > with the target memory. This callback is provided by device driver
> > for device memory or by core HMS for regular main memory. The callback
> > can decide to migrate the range to the target memories or do nothing
> > (this can be influenced by flags provided to hbind() too).
> > 
> > 
> > Latter patches can tie page fault with HMS policy to direct memory
> > allocation to the right target. For now i would rather postpone that
> > discussion until a consensus is reach on how to move forward on all
> > the topics presented in this email. Start smalls, grow big ;)
> > 
> > 
> 
> I liked the simplicity of keeping it outside all the existing memory
> management policy code. But that that is also the drawback isn't it?
> We now have multiple entities tracking cpu and memory. (This reminded me of
> how we started with memcg in the early days).

This is a hard choice, the rational is that either application use this
new API either it use the old one. So the expectation is that both should
not co-exist in a process. Eventualy both can be consolidated into one
inside the kernel while maintaining the different userspace API. But i
feel that it is better to get to that point slowly while we experiment
with the new API. I feel that we need to gain some experience with the
new API on real workload to convince ourself that it is something we can
leave with. If we reach that point than we can work on consolidating
kernel code into one. In the meantime this experiment 

Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-03 Thread Aneesh Kumar K.V

On 12/4/18 5:04 AM, jgli...@redhat.com wrote:

From: Jérôme Glisse 

Heterogeneous memory system are becoming more and more the norm, in
those system there is not only the main system memory for each node,
but also device memory and|or memory hierarchy to consider. Device
memory can comes from a device like GPU, FPGA, ... or from a memory
only device (persistent memory, or high density memory device).

Memory hierarchy is when you not only have the main memory but also
other type of memory like HBM (High Bandwidth Memory often stack up
on CPU die or GPU die), peristent memory or high density memory (ie
something slower then regular DDR DIMM but much bigger).

On top of this diversity of memories you also have to account for the
system bus topology ie how all CPUs and devices are connected to each
others. Userspace do not care about the exact physical topology but
care about topology from behavior point of view ie what are all the
paths between an initiator (anything that can initiate memory access
like CPU, GPU, FGPA, network controller ...) and a target memory and
what are all the properties of each of those path (bandwidth, latency,
granularity, ...).

This means that it is no longer sufficient to consider a flat view
for each node in a system but for maximum performance we need to
account for all of this new memory but also for system topology.
This is why this proposal is unlike the HMAT proposal [1] which
tries to extend the existing NUMA for new type of memory. Here we
are tackling a much more profound change that depart from NUMA.


One of the reasons for radical change is the advance of accelerator
like GPU or FPGA means that CPU is no longer the only piece where
computation happens. It is becoming more and more common for an
application to use a mix and match of different accelerator to
perform its computation. So we can no longer satisfy our self with
a CPU centric and flat view of a system like NUMA and NUMA distance.


This patchset is a proposal to tackle this problems through three
aspects:
 1 - Expose complex system topology and various kind of memory
 to user space so that application have a standard way and
 single place to get all the information it cares about.
 2 - A new API for user space to bind/provide hint to kernel on
 which memory to use for range of virtual address (a new
 mbind() syscall).
 3 - Kernel side changes for vm policy to handle this changes

This patchset is not and end to end solution but it provides enough
pieces to be useful against nouveau (upstream open source driver for
NVidia GPU). It is intended as a starting point for discussion so
that we can figure out what to do. To avoid having too much topics
to discuss i am not considering memory cgroup for now but it is
definitely something we will want to integrate with.

The rest of this emails is splits in 3 sections, the first section
talks about complex system topology: what it is, how it is use today
and how to describe it tomorrow. The second sections talks about
new API to bind/provide hint to kernel for range of virtual address.
The third section talks about new mechanism to track bind/hint
provided by user space or device driver inside the kernel.


1) Complex system topology and representing them


Inside a node you can have a complex topology of memory, for instance
you can have multiple HBM memory in a node, each HBM memory tie to a
set of CPUs (all of which are in the same node). This means that you
have a hierarchy of memory for CPUs. The local fast HBM but which is
expected to be relatively small compare to main memory and then the
main memory. New memory technology might also deepen this hierarchy
with another level of yet slower memory but gigantic in size (some
persistent memory technology might fall into that category). Another
example is device memory, and device themself can have a hierarchy
like HBM on top of device core and main device memory.

On top of that you can have multiple path to access each memory and
each path can have different properties (latency, bandwidth, ...).
Also there is not always symmetry ie some memory might only be
accessible by some device or CPU ie not accessible by everyone.

So a flat hierarchy for each node is not capable of representing this
kind of complexity. To simplify discussion and because we do not want
to single out CPU from device, from here on out we will use initiator
to refer to either CPU or device. An initiator is any kind of CPU or
device that can access memory (ie initiate memory access).

At this point a example of such system might help:
 - 2 nodes and for each node:
 - 1 CPU per node with 2 complex of CPUs cores per CPU
 - one HBM memory for each complex of CPUs cores (200GB/s)
 - CPUs cores complex are linked to each other (100GB/s)
 - main memory is (90GB/s)
 - 4 GPUs each with:
 - HBM memory for 

Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()

2018-12-03 Thread Aneesh Kumar K.V

On 12/4/18 5:04 AM, jgli...@redhat.com wrote:

From: Jérôme Glisse 

Heterogeneous memory system are becoming more and more the norm, in
those system there is not only the main system memory for each node,
but also device memory and|or memory hierarchy to consider. Device
memory can comes from a device like GPU, FPGA, ... or from a memory
only device (persistent memory, or high density memory device).

Memory hierarchy is when you not only have the main memory but also
other type of memory like HBM (High Bandwidth Memory often stack up
on CPU die or GPU die), peristent memory or high density memory (ie
something slower then regular DDR DIMM but much bigger).

On top of this diversity of memories you also have to account for the
system bus topology ie how all CPUs and devices are connected to each
others. Userspace do not care about the exact physical topology but
care about topology from behavior point of view ie what are all the
paths between an initiator (anything that can initiate memory access
like CPU, GPU, FGPA, network controller ...) and a target memory and
what are all the properties of each of those path (bandwidth, latency,
granularity, ...).

This means that it is no longer sufficient to consider a flat view
for each node in a system but for maximum performance we need to
account for all of this new memory but also for system topology.
This is why this proposal is unlike the HMAT proposal [1] which
tries to extend the existing NUMA for new type of memory. Here we
are tackling a much more profound change that depart from NUMA.


One of the reasons for radical change is the advance of accelerator
like GPU or FPGA means that CPU is no longer the only piece where
computation happens. It is becoming more and more common for an
application to use a mix and match of different accelerator to
perform its computation. So we can no longer satisfy our self with
a CPU centric and flat view of a system like NUMA and NUMA distance.


This patchset is a proposal to tackle this problems through three
aspects:
 1 - Expose complex system topology and various kind of memory
 to user space so that application have a standard way and
 single place to get all the information it cares about.
 2 - A new API for user space to bind/provide hint to kernel on
 which memory to use for range of virtual address (a new
 mbind() syscall).
 3 - Kernel side changes for vm policy to handle this changes

This patchset is not and end to end solution but it provides enough
pieces to be useful against nouveau (upstream open source driver for
NVidia GPU). It is intended as a starting point for discussion so
that we can figure out what to do. To avoid having too much topics
to discuss i am not considering memory cgroup for now but it is
definitely something we will want to integrate with.

The rest of this emails is splits in 3 sections, the first section
talks about complex system topology: what it is, how it is use today
and how to describe it tomorrow. The second sections talks about
new API to bind/provide hint to kernel for range of virtual address.
The third section talks about new mechanism to track bind/hint
provided by user space or device driver inside the kernel.


1) Complex system topology and representing them


Inside a node you can have a complex topology of memory, for instance
you can have multiple HBM memory in a node, each HBM memory tie to a
set of CPUs (all of which are in the same node). This means that you
have a hierarchy of memory for CPUs. The local fast HBM but which is
expected to be relatively small compare to main memory and then the
main memory. New memory technology might also deepen this hierarchy
with another level of yet slower memory but gigantic in size (some
persistent memory technology might fall into that category). Another
example is device memory, and device themself can have a hierarchy
like HBM on top of device core and main device memory.

On top of that you can have multiple path to access each memory and
each path can have different properties (latency, bandwidth, ...).
Also there is not always symmetry ie some memory might only be
accessible by some device or CPU ie not accessible by everyone.

So a flat hierarchy for each node is not capable of representing this
kind of complexity. To simplify discussion and because we do not want
to single out CPU from device, from here on out we will use initiator
to refer to either CPU or device. An initiator is any kind of CPU or
device that can access memory (ie initiate memory access).

At this point a example of such system might help:
 - 2 nodes and for each node:
 - 1 CPU per node with 2 complex of CPUs cores per CPU
 - one HBM memory for each complex of CPUs cores (200GB/s)
 - CPUs cores complex are linked to each other (100GB/s)
 - main memory is (90GB/s)
 - 4 GPUs each with:
 - HBM memory for