Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Fri, Dec 07, 2018 at 03:06:36PM +, Jonathan Cameron wrote: > On Thu, 6 Dec 2018 19:20:45 -0500 > Jerome Glisse wrote: > > > On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote: > > > > > > > > > On 2018-12-06 4:38 p.m., Dave Hansen wrote: > > > > On 12/6/18 3:28 PM, Logan Gunthorpe wrote: > > > >> I didn't think this was meant to describe actual real world performance > > > >> between all of the links. If that's the case all of this seems like a > > > >> pipe dream to me. > > > > > > > > The HMAT discussions (that I was a part of at least) settled on just > > > > trying to describe what we called "sticker speed". Nobody had an > > > > expectation that you *really* had to measure everything. > > > > > > > > The best we can do for any of these approaches is approximate things. > > > > > > Yes, though there's a lot of caveats in this assumption alone. > > > Specifically with PCI: the bus may run at however many GB/s but P2P > > > through a CPU's root complexes can slow down significantly (like down to > > > MB/s). > > > > > > I've seen similar things across QPI: I can sometimes do P2P from > > > PCI->QPI->PCI but the performance doesn't even come close to the sticker > > > speed of any of those buses. > > > > > > I'm not sure how anyone is going to deal with those issues, but it does > > > firmly place us in world view #2 instead of #1. But, yes, I agree > > > exposing information like in #2 full out to userspace, especially > > > through sysfs, seems like a nightmare and I don't see anything in HMS to > > > help with that. Providing an API to ask for memory (or another resource) > > > that's accessible by a set of initiators and with a set of requirements > > > for capabilities seems more manageable. > > > > Note that in #1 you have bridge that fully allow to express those path > > limitation. So what you just describe can be fully reported to userspace. > > > > I explained and given examples on how program adapt their computation to > > the system topology it does exist today and people are even developing new > > programming langage with some of those idea baked in. > > > > So they are people out there that already rely on such information they > > just do not get it from the kernel but from a mix of various device specific > > API and they have to stich everything themself and develop a database of > > quirk and gotcha. My proposal is to provide a coherent kernel API where > > we can sanitize that informations and report it to userspace in a single > > and coherent description. > > > > Cheers, > > Jérôme > > I know it doesn't work everywhere, but I think it's worth enumerating what > cases we can get some of these numbers for and where the complexity lies. > I.e. What can the really determined user space library do today? I gave an example in an email in this thread: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1821872.html Is the kind of example you are looking for ? :) > > So one open question is how close can we get in a userspace only prototype. > At the end of the day userspace can often read HMAT directly if it wants to > /sys/firmware/acpi/tables/HMAT. Obviously that gets us only the end to > end view (world 2). I dislike the limitations of that as much as the next > person. It is slowly improving with the word "Auditable" being > kicked around - btw anyone interested in ACPI who works for a UEFI > member, there are efforts going on and more viewpoints would be great. > Expect some baby steps shortly. > > For devices on PCIe (and protocols on top of it e.g. CCIX), a lot of > this is discoverable to some degree. > * Link speed, > * Number of Lanes, > * Full topology. Yes discoverable bus like PCIE and all its derivative (CCIX, OpenCAPI, ...) userspace will have way to find the topology. The issue lies with orthogonal topology of extra bus that are not necessarily enumerated or with a device driver presently and especially how they inter-act with each other (can you cross them ? ...) > > What isn't there (I think) > * In component latency / bandwidth limitations (some activity going > on to improve that long term) > * Effect of credit allocations etc on effectively bandwidth - interconnect > performance is a whole load of black magic. > > Presumably there is some information available from NVLink etc? >From my point of view we want to give the best case sticker value to userspace ie the bandwidth the engineer that designed the bus sworn their hardware deliver :) I believe it the is the best approximation we can deliver. > > So whilst I really like the proposal in some ways, I wonder how much > exploration > could be done of the usefulness of the data without touching the kernel at > all. > > The other aspect that is needed to actually make this 'dynamically' useful is > to be able to map whatever Performance Counters are available to the relevant > 'links', bridges etc. Ticket numbers are not all that useful
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Fri, Dec 07, 2018 at 03:06:36PM +, Jonathan Cameron wrote: > On Thu, 6 Dec 2018 19:20:45 -0500 > Jerome Glisse wrote: > > > On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote: > > > > > > > > > On 2018-12-06 4:38 p.m., Dave Hansen wrote: > > > > On 12/6/18 3:28 PM, Logan Gunthorpe wrote: > > > >> I didn't think this was meant to describe actual real world performance > > > >> between all of the links. If that's the case all of this seems like a > > > >> pipe dream to me. > > > > > > > > The HMAT discussions (that I was a part of at least) settled on just > > > > trying to describe what we called "sticker speed". Nobody had an > > > > expectation that you *really* had to measure everything. > > > > > > > > The best we can do for any of these approaches is approximate things. > > > > > > Yes, though there's a lot of caveats in this assumption alone. > > > Specifically with PCI: the bus may run at however many GB/s but P2P > > > through a CPU's root complexes can slow down significantly (like down to > > > MB/s). > > > > > > I've seen similar things across QPI: I can sometimes do P2P from > > > PCI->QPI->PCI but the performance doesn't even come close to the sticker > > > speed of any of those buses. > > > > > > I'm not sure how anyone is going to deal with those issues, but it does > > > firmly place us in world view #2 instead of #1. But, yes, I agree > > > exposing information like in #2 full out to userspace, especially > > > through sysfs, seems like a nightmare and I don't see anything in HMS to > > > help with that. Providing an API to ask for memory (or another resource) > > > that's accessible by a set of initiators and with a set of requirements > > > for capabilities seems more manageable. > > > > Note that in #1 you have bridge that fully allow to express those path > > limitation. So what you just describe can be fully reported to userspace. > > > > I explained and given examples on how program adapt their computation to > > the system topology it does exist today and people are even developing new > > programming langage with some of those idea baked in. > > > > So they are people out there that already rely on such information they > > just do not get it from the kernel but from a mix of various device specific > > API and they have to stich everything themself and develop a database of > > quirk and gotcha. My proposal is to provide a coherent kernel API where > > we can sanitize that informations and report it to userspace in a single > > and coherent description. > > > > Cheers, > > Jérôme > > I know it doesn't work everywhere, but I think it's worth enumerating what > cases we can get some of these numbers for and where the complexity lies. > I.e. What can the really determined user space library do today? I gave an example in an email in this thread: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1821872.html Is the kind of example you are looking for ? :) > > So one open question is how close can we get in a userspace only prototype. > At the end of the day userspace can often read HMAT directly if it wants to > /sys/firmware/acpi/tables/HMAT. Obviously that gets us only the end to > end view (world 2). I dislike the limitations of that as much as the next > person. It is slowly improving with the word "Auditable" being > kicked around - btw anyone interested in ACPI who works for a UEFI > member, there are efforts going on and more viewpoints would be great. > Expect some baby steps shortly. > > For devices on PCIe (and protocols on top of it e.g. CCIX), a lot of > this is discoverable to some degree. > * Link speed, > * Number of Lanes, > * Full topology. Yes discoverable bus like PCIE and all its derivative (CCIX, OpenCAPI, ...) userspace will have way to find the topology. The issue lies with orthogonal topology of extra bus that are not necessarily enumerated or with a device driver presently and especially how they inter-act with each other (can you cross them ? ...) > > What isn't there (I think) > * In component latency / bandwidth limitations (some activity going > on to improve that long term) > * Effect of credit allocations etc on effectively bandwidth - interconnect > performance is a whole load of black magic. > > Presumably there is some information available from NVLink etc? >From my point of view we want to give the best case sticker value to userspace ie the bandwidth the engineer that designed the bus sworn their hardware deliver :) I believe it the is the best approximation we can deliver. > > So whilst I really like the proposal in some ways, I wonder how much > exploration > could be done of the usefulness of the data without touching the kernel at > all. > > The other aspect that is needed to actually make this 'dynamically' useful is > to be able to map whatever Performance Counters are available to the relevant > 'links', bridges etc. Ticket numbers are not all that useful
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Thu, 6 Dec 2018 19:20:45 -0500 Jerome Glisse wrote: > On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote: > > > > > > On 2018-12-06 4:38 p.m., Dave Hansen wrote: > > > On 12/6/18 3:28 PM, Logan Gunthorpe wrote: > > >> I didn't think this was meant to describe actual real world performance > > >> between all of the links. If that's the case all of this seems like a > > >> pipe dream to me. > > > > > > The HMAT discussions (that I was a part of at least) settled on just > > > trying to describe what we called "sticker speed". Nobody had an > > > expectation that you *really* had to measure everything. > > > > > > The best we can do for any of these approaches is approximate things. > > > > Yes, though there's a lot of caveats in this assumption alone. > > Specifically with PCI: the bus may run at however many GB/s but P2P > > through a CPU's root complexes can slow down significantly (like down to > > MB/s). > > > > I've seen similar things across QPI: I can sometimes do P2P from > > PCI->QPI->PCI but the performance doesn't even come close to the sticker > > speed of any of those buses. > > > > I'm not sure how anyone is going to deal with those issues, but it does > > firmly place us in world view #2 instead of #1. But, yes, I agree > > exposing information like in #2 full out to userspace, especially > > through sysfs, seems like a nightmare and I don't see anything in HMS to > > help with that. Providing an API to ask for memory (or another resource) > > that's accessible by a set of initiators and with a set of requirements > > for capabilities seems more manageable. > > Note that in #1 you have bridge that fully allow to express those path > limitation. So what you just describe can be fully reported to userspace. > > I explained and given examples on how program adapt their computation to > the system topology it does exist today and people are even developing new > programming langage with some of those idea baked in. > > So they are people out there that already rely on such information they > just do not get it from the kernel but from a mix of various device specific > API and they have to stich everything themself and develop a database of > quirk and gotcha. My proposal is to provide a coherent kernel API where > we can sanitize that informations and report it to userspace in a single > and coherent description. > > Cheers, > Jérôme I know it doesn't work everywhere, but I think it's worth enumerating what cases we can get some of these numbers for and where the complexity lies. I.e. What can the really determined user space library do today? So one open question is how close can we get in a userspace only prototype. At the end of the day userspace can often read HMAT directly if it wants to /sys/firmware/acpi/tables/HMAT. Obviously that gets us only the end to end view (world 2). I dislike the limitations of that as much as the next person. It is slowly improving with the word "Auditable" being kicked around - btw anyone interested in ACPI who works for a UEFI member, there are efforts going on and more viewpoints would be great. Expect some baby steps shortly. For devices on PCIe (and protocols on top of it e.g. CCIX), a lot of this is discoverable to some degree. * Link speed, * Number of Lanes, * Full topology. What isn't there (I think) * In component latency / bandwidth limitations (some activity going on to improve that long term) * Effect of credit allocations etc on effectively bandwidth - interconnect performance is a whole load of black magic. Presumably there is some information available from NVLink etc? So whilst I really like the proposal in some ways, I wonder how much exploration could be done of the usefulness of the data without touching the kernel at all. The other aspect that is needed to actually make this 'dynamically' useful is to be able to map whatever Performance Counters are available to the relevant 'links', bridges etc. Ticket numbers are not all that useful unfortunately except for small amounts of data on lightly loaded buses. The kernel ultimately only needs to have a model of this topology if: 1) It's going to use it itself 2) Its going to do something automatic with it. 3) It needs to fix garbage info or supplement with things only the kernel knows. Jonathan
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Thu, 6 Dec 2018 19:20:45 -0500 Jerome Glisse wrote: > On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote: > > > > > > On 2018-12-06 4:38 p.m., Dave Hansen wrote: > > > On 12/6/18 3:28 PM, Logan Gunthorpe wrote: > > >> I didn't think this was meant to describe actual real world performance > > >> between all of the links. If that's the case all of this seems like a > > >> pipe dream to me. > > > > > > The HMAT discussions (that I was a part of at least) settled on just > > > trying to describe what we called "sticker speed". Nobody had an > > > expectation that you *really* had to measure everything. > > > > > > The best we can do for any of these approaches is approximate things. > > > > Yes, though there's a lot of caveats in this assumption alone. > > Specifically with PCI: the bus may run at however many GB/s but P2P > > through a CPU's root complexes can slow down significantly (like down to > > MB/s). > > > > I've seen similar things across QPI: I can sometimes do P2P from > > PCI->QPI->PCI but the performance doesn't even come close to the sticker > > speed of any of those buses. > > > > I'm not sure how anyone is going to deal with those issues, but it does > > firmly place us in world view #2 instead of #1. But, yes, I agree > > exposing information like in #2 full out to userspace, especially > > through sysfs, seems like a nightmare and I don't see anything in HMS to > > help with that. Providing an API to ask for memory (or another resource) > > that's accessible by a set of initiators and with a set of requirements > > for capabilities seems more manageable. > > Note that in #1 you have bridge that fully allow to express those path > limitation. So what you just describe can be fully reported to userspace. > > I explained and given examples on how program adapt their computation to > the system topology it does exist today and people are even developing new > programming langage with some of those idea baked in. > > So they are people out there that already rely on such information they > just do not get it from the kernel but from a mix of various device specific > API and they have to stich everything themself and develop a database of > quirk and gotcha. My proposal is to provide a coherent kernel API where > we can sanitize that informations and report it to userspace in a single > and coherent description. > > Cheers, > Jérôme I know it doesn't work everywhere, but I think it's worth enumerating what cases we can get some of these numbers for and where the complexity lies. I.e. What can the really determined user space library do today? So one open question is how close can we get in a userspace only prototype. At the end of the day userspace can often read HMAT directly if it wants to /sys/firmware/acpi/tables/HMAT. Obviously that gets us only the end to end view (world 2). I dislike the limitations of that as much as the next person. It is slowly improving with the word "Auditable" being kicked around - btw anyone interested in ACPI who works for a UEFI member, there are efforts going on and more viewpoints would be great. Expect some baby steps shortly. For devices on PCIe (and protocols on top of it e.g. CCIX), a lot of this is discoverable to some degree. * Link speed, * Number of Lanes, * Full topology. What isn't there (I think) * In component latency / bandwidth limitations (some activity going on to improve that long term) * Effect of credit allocations etc on effectively bandwidth - interconnect performance is a whole load of black magic. Presumably there is some information available from NVLink etc? So whilst I really like the proposal in some ways, I wonder how much exploration could be done of the usefulness of the data without touching the kernel at all. The other aspect that is needed to actually make this 'dynamically' useful is to be able to map whatever Performance Counters are available to the relevant 'links', bridges etc. Ticket numbers are not all that useful unfortunately except for small amounts of data on lightly loaded buses. The kernel ultimately only needs to have a model of this topology if: 1) It's going to use it itself 2) Its going to do something automatic with it. 3) It needs to fix garbage info or supplement with things only the kernel knows. Jonathan
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote: > > > On 2018-12-06 4:38 p.m., Dave Hansen wrote: > > On 12/6/18 3:28 PM, Logan Gunthorpe wrote: > >> I didn't think this was meant to describe actual real world performance > >> between all of the links. If that's the case all of this seems like a > >> pipe dream to me. > > > > The HMAT discussions (that I was a part of at least) settled on just > > trying to describe what we called "sticker speed". Nobody had an > > expectation that you *really* had to measure everything. > > > > The best we can do for any of these approaches is approximate things. > > Yes, though there's a lot of caveats in this assumption alone. > Specifically with PCI: the bus may run at however many GB/s but P2P > through a CPU's root complexes can slow down significantly (like down to > MB/s). > > I've seen similar things across QPI: I can sometimes do P2P from > PCI->QPI->PCI but the performance doesn't even come close to the sticker > speed of any of those buses. > > I'm not sure how anyone is going to deal with those issues, but it does > firmly place us in world view #2 instead of #1. But, yes, I agree > exposing information like in #2 full out to userspace, especially > through sysfs, seems like a nightmare and I don't see anything in HMS to > help with that. Providing an API to ask for memory (or another resource) > that's accessible by a set of initiators and with a set of requirements > for capabilities seems more manageable. Note that in #1 you have bridge that fully allow to express those path limitation. So what you just describe can be fully reported to userspace. I explained and given examples on how program adapt their computation to the system topology it does exist today and people are even developing new programming langage with some of those idea baked in. So they are people out there that already rely on such information they just do not get it from the kernel but from a mix of various device specific API and they have to stich everything themself and develop a database of quirk and gotcha. My proposal is to provide a coherent kernel API where we can sanitize that informations and report it to userspace in a single and coherent description. Cheers, Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote: > > > On 2018-12-06 4:38 p.m., Dave Hansen wrote: > > On 12/6/18 3:28 PM, Logan Gunthorpe wrote: > >> I didn't think this was meant to describe actual real world performance > >> between all of the links. If that's the case all of this seems like a > >> pipe dream to me. > > > > The HMAT discussions (that I was a part of at least) settled on just > > trying to describe what we called "sticker speed". Nobody had an > > expectation that you *really* had to measure everything. > > > > The best we can do for any of these approaches is approximate things. > > Yes, though there's a lot of caveats in this assumption alone. > Specifically with PCI: the bus may run at however many GB/s but P2P > through a CPU's root complexes can slow down significantly (like down to > MB/s). > > I've seen similar things across QPI: I can sometimes do P2P from > PCI->QPI->PCI but the performance doesn't even come close to the sticker > speed of any of those buses. > > I'm not sure how anyone is going to deal with those issues, but it does > firmly place us in world view #2 instead of #1. But, yes, I agree > exposing information like in #2 full out to userspace, especially > through sysfs, seems like a nightmare and I don't see anything in HMS to > help with that. Providing an API to ask for memory (or another resource) > that's accessible by a set of initiators and with a set of requirements > for capabilities seems more manageable. Note that in #1 you have bridge that fully allow to express those path limitation. So what you just describe can be fully reported to userspace. I explained and given examples on how program adapt their computation to the system topology it does exist today and people are even developing new programming langage with some of those idea baked in. So they are people out there that already rely on such information they just do not get it from the kernel but from a mix of various device specific API and they have to stich everything themself and develop a database of quirk and gotcha. My proposal is to provide a coherent kernel API where we can sanitize that informations and report it to userspace in a single and coherent description. Cheers, Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Thu, Dec 06, 2018 at 03:09:21PM -0800, Dave Hansen wrote: > On 12/6/18 2:39 PM, Jerome Glisse wrote: > > No if the 4 sockets are connect in a ring fashion ie: > > Socket0 - Socket1 > >| | > > Socket3 - Socket2 > > > > Then you have 4 links: > > link0: socket0 socket1 > > link1: socket1 socket2 > > link3: socket2 socket3 > > link4: socket3 socket0 > > > > I do not see how their can be an explosion of link directory, worse > > case is as many link directories as they are bus for a CPU/device/ > > target. > > This looks great. But, we don't _have_ this kind of information for any > system that I know about or any system available in the near future. We do not have it in any standard way, it is out there in either device driver database, application data base, special platform OEM blob burried somewhere in the firmware ... I want to solve the kernel side of the problem ie how to expose this to userspace. How the kernel get that information is an orthogonal problem. For now my intention is to have device driver register and create the links and bridges that are not enumerated by standard firmware. > > We basically have two different world views: > 1. The system is described point-to-point. A connects to B @ >100GB/s. B connects to C at 50GB/s. Thus, C->A should be >50GB/s. >* Less information to convey >* Potentially less precise if the properties are not perfectly > additive. If A->B=10ns and B->C=20ns, A->C might be >30ns. >* Costs must be calculated instead of being explicitly specified > 2. The system is described endpoint-to-endpoint. A->B @ 100GB/s >B->C @ 50GB/s, A->C @ 50GB/s. >* A *lot* more information to convey O(N^2)? >* Potentially more precise. >* Costs are explicitly specified, not calculated > > These patches are really tied to world view #1. But, the HMAT is really > tied to world view #1. ^#2 Note that they are also the bridge object in my proposal. So in my proposal you in #1 you have: link0: A <-> B with 100GB/s and 10ns latency link1: B <-> C with 50GB/s and 20ns latency Now if A can reach C through B then you have bridges (bridge are uni- directional unlike link that are bi-directional thought that finer point can be discuss this is what allow any kind of directed graph to be represented): bridge2: link0 -> link1 bridge3: link1 -> link0 You can also associated properties to bridge (but it is not mandatory). So you can say that bridge2 and bridge3 have a latency of 50ns and if the addition of latency is enough then you do not specificy it in bridge. It is a rule that a path latency is the sum of its individual link latency. For bandwidth it is the minimum bandwidth ie what ever is the bottleneck for the path. > I know you're not a fan of the HMAT. But it is the firmware reality > that we are stuck with, until something better shows up. I just don't > see a way to convert it into what you have described here. Like i said i am not targetting HMAT system i am targeting system that rely today on database spread between driver and application. I want to move that knowledge in driver first so that they can teach the core kernel and register thing in the core. Providing a standard firmware way to provide this information is a different problem (they are some loose standard on non ACPI platform AFAIK). > I'm starting to think that, no matter if the HMAT or some other approach > gets adopted, we shouldn't be exposing this level of gunk to userspace > at *all* since it requires adopting one of the world views. I do not see this as exclusive. Yes they are HMAT system "soon" to arrive but we already have the more extended view which is just buried under a pile of different pieces. I do not see any exclusion between the 2. If HMAT is good enough for a whole class of system fine but there is also a whole class of system and users that do not fit in that paradigm hence my proposal. Cheers, Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Thu, Dec 06, 2018 at 03:09:21PM -0800, Dave Hansen wrote: > On 12/6/18 2:39 PM, Jerome Glisse wrote: > > No if the 4 sockets are connect in a ring fashion ie: > > Socket0 - Socket1 > >| | > > Socket3 - Socket2 > > > > Then you have 4 links: > > link0: socket0 socket1 > > link1: socket1 socket2 > > link3: socket2 socket3 > > link4: socket3 socket0 > > > > I do not see how their can be an explosion of link directory, worse > > case is as many link directories as they are bus for a CPU/device/ > > target. > > This looks great. But, we don't _have_ this kind of information for any > system that I know about or any system available in the near future. We do not have it in any standard way, it is out there in either device driver database, application data base, special platform OEM blob burried somewhere in the firmware ... I want to solve the kernel side of the problem ie how to expose this to userspace. How the kernel get that information is an orthogonal problem. For now my intention is to have device driver register and create the links and bridges that are not enumerated by standard firmware. > > We basically have two different world views: > 1. The system is described point-to-point. A connects to B @ >100GB/s. B connects to C at 50GB/s. Thus, C->A should be >50GB/s. >* Less information to convey >* Potentially less precise if the properties are not perfectly > additive. If A->B=10ns and B->C=20ns, A->C might be >30ns. >* Costs must be calculated instead of being explicitly specified > 2. The system is described endpoint-to-endpoint. A->B @ 100GB/s >B->C @ 50GB/s, A->C @ 50GB/s. >* A *lot* more information to convey O(N^2)? >* Potentially more precise. >* Costs are explicitly specified, not calculated > > These patches are really tied to world view #1. But, the HMAT is really > tied to world view #1. ^#2 Note that they are also the bridge object in my proposal. So in my proposal you in #1 you have: link0: A <-> B with 100GB/s and 10ns latency link1: B <-> C with 50GB/s and 20ns latency Now if A can reach C through B then you have bridges (bridge are uni- directional unlike link that are bi-directional thought that finer point can be discuss this is what allow any kind of directed graph to be represented): bridge2: link0 -> link1 bridge3: link1 -> link0 You can also associated properties to bridge (but it is not mandatory). So you can say that bridge2 and bridge3 have a latency of 50ns and if the addition of latency is enough then you do not specificy it in bridge. It is a rule that a path latency is the sum of its individual link latency. For bandwidth it is the minimum bandwidth ie what ever is the bottleneck for the path. > I know you're not a fan of the HMAT. But it is the firmware reality > that we are stuck with, until something better shows up. I just don't > see a way to convert it into what you have described here. Like i said i am not targetting HMAT system i am targeting system that rely today on database spread between driver and application. I want to move that knowledge in driver first so that they can teach the core kernel and register thing in the core. Providing a standard firmware way to provide this information is a different problem (they are some loose standard on non ACPI platform AFAIK). > I'm starting to think that, no matter if the HMAT or some other approach > gets adopted, we shouldn't be exposing this level of gunk to userspace > at *all* since it requires adopting one of the world views. I do not see this as exclusive. Yes they are HMAT system "soon" to arrive but we already have the more extended view which is just buried under a pile of different pieces. I do not see any exclusion between the 2. If HMAT is good enough for a whole class of system fine but there is also a whole class of system and users that do not fit in that paradigm hence my proposal. Cheers, Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 2018-12-06 4:38 p.m., Dave Hansen wrote: > On 12/6/18 3:28 PM, Logan Gunthorpe wrote: >> I didn't think this was meant to describe actual real world performance >> between all of the links. If that's the case all of this seems like a >> pipe dream to me. > > The HMAT discussions (that I was a part of at least) settled on just > trying to describe what we called "sticker speed". Nobody had an > expectation that you *really* had to measure everything. > > The best we can do for any of these approaches is approximate things. Yes, though there's a lot of caveats in this assumption alone. Specifically with PCI: the bus may run at however many GB/s but P2P through a CPU's root complexes can slow down significantly (like down to MB/s). I've seen similar things across QPI: I can sometimes do P2P from PCI->QPI->PCI but the performance doesn't even come close to the sticker speed of any of those buses. I'm not sure how anyone is going to deal with those issues, but it does firmly place us in world view #2 instead of #1. But, yes, I agree exposing information like in #2 full out to userspace, especially through sysfs, seems like a nightmare and I don't see anything in HMS to help with that. Providing an API to ask for memory (or another resource) that's accessible by a set of initiators and with a set of requirements for capabilities seems more manageable. Logan
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 2018-12-06 4:38 p.m., Dave Hansen wrote: > On 12/6/18 3:28 PM, Logan Gunthorpe wrote: >> I didn't think this was meant to describe actual real world performance >> between all of the links. If that's the case all of this seems like a >> pipe dream to me. > > The HMAT discussions (that I was a part of at least) settled on just > trying to describe what we called "sticker speed". Nobody had an > expectation that you *really* had to measure everything. > > The best we can do for any of these approaches is approximate things. Yes, though there's a lot of caveats in this assumption alone. Specifically with PCI: the bus may run at however many GB/s but P2P through a CPU's root complexes can slow down significantly (like down to MB/s). I've seen similar things across QPI: I can sometimes do P2P from PCI->QPI->PCI but the performance doesn't even come close to the sticker speed of any of those buses. I'm not sure how anyone is going to deal with those issues, but it does firmly place us in world view #2 instead of #1. But, yes, I agree exposing information like in #2 full out to userspace, especially through sysfs, seems like a nightmare and I don't see anything in HMS to help with that. Providing an API to ask for memory (or another resource) that's accessible by a set of initiators and with a set of requirements for capabilities seems more manageable. Logan
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/6/18 3:28 PM, Logan Gunthorpe wrote: > I didn't think this was meant to describe actual real world performance > between all of the links. If that's the case all of this seems like a > pipe dream to me. The HMAT discussions (that I was a part of at least) settled on just trying to describe what we called "sticker speed". Nobody had an expectation that you *really* had to measure everything. The best we can do for any of these approaches is approximate things. > You're not *really* going to know bandwidth or latency for any of this > unless you actually measure it on the system in question. Yeah, agreed.
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/6/18 3:28 PM, Logan Gunthorpe wrote: > I didn't think this was meant to describe actual real world performance > between all of the links. If that's the case all of this seems like a > pipe dream to me. The HMAT discussions (that I was a part of at least) settled on just trying to describe what we called "sticker speed". Nobody had an expectation that you *really* had to measure everything. The best we can do for any of these approaches is approximate things. > You're not *really* going to know bandwidth or latency for any of this > unless you actually measure it on the system in question. Yeah, agreed.
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/6/18 3:28 PM, Logan Gunthorpe wrote: > These patches are really tied to world view #1. But, the HMAT is really > tied to world view #1. Whoops, should have been "the HMAT is really tied to world view #2"
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/6/18 3:28 PM, Logan Gunthorpe wrote: > These patches are really tied to world view #1. But, the HMAT is really > tied to world view #1. Whoops, should have been "the HMAT is really tied to world view #2"
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 2018-12-06 4:09 p.m., Dave Hansen wrote: > This looks great. But, we don't _have_ this kind of information for any > system that I know about or any system available in the near future. > > We basically have two different world views: > 1. The system is described point-to-point. A connects to B @ >100GB/s. B connects to C at 50GB/s. Thus, C->A should be >50GB/s. >* Less information to convey >* Potentially less precise if the properties are not perfectly > additive. If A->B=10ns and B->C=20ns, A->C might be >30ns. >* Costs must be calculated instead of being explicitly specified > 2. The system is described endpoint-to-endpoint. A->B @ 100GB/s >B->C @ 50GB/s, A->C @ 50GB/s. >* A *lot* more information to convey O(N^2)? >* Potentially more precise. >* Costs are explicitly specified, not calculated > > These patches are really tied to world view #1. But, the HMAT is really > tied to world view #1. I didn't think this was meant to describe actual real world performance between all of the links. If that's the case all of this seems like a pipe dream to me. Attributes like cache coherency, atomics, etc should fit well in world view #1... and, at best, some kind of flag saying whether or not to use a particular link if you care about transfer speed. -- But we don't need special "link" directories to describe the properties of existing buses. You're not *really* going to know bandwidth or latency for any of this unless you actually measure it on the system in question. Logan
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 2018-12-06 4:09 p.m., Dave Hansen wrote: > This looks great. But, we don't _have_ this kind of information for any > system that I know about or any system available in the near future. > > We basically have two different world views: > 1. The system is described point-to-point. A connects to B @ >100GB/s. B connects to C at 50GB/s. Thus, C->A should be >50GB/s. >* Less information to convey >* Potentially less precise if the properties are not perfectly > additive. If A->B=10ns and B->C=20ns, A->C might be >30ns. >* Costs must be calculated instead of being explicitly specified > 2. The system is described endpoint-to-endpoint. A->B @ 100GB/s >B->C @ 50GB/s, A->C @ 50GB/s. >* A *lot* more information to convey O(N^2)? >* Potentially more precise. >* Costs are explicitly specified, not calculated > > These patches are really tied to world view #1. But, the HMAT is really > tied to world view #1. I didn't think this was meant to describe actual real world performance between all of the links. If that's the case all of this seems like a pipe dream to me. Attributes like cache coherency, atomics, etc should fit well in world view #1... and, at best, some kind of flag saying whether or not to use a particular link if you care about transfer speed. -- But we don't need special "link" directories to describe the properties of existing buses. You're not *really* going to know bandwidth or latency for any of this unless you actually measure it on the system in question. Logan
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/6/18 2:39 PM, Jerome Glisse wrote: > No if the 4 sockets are connect in a ring fashion ie: > Socket0 - Socket1 >| | > Socket3 - Socket2 > > Then you have 4 links: > link0: socket0 socket1 > link1: socket1 socket2 > link3: socket2 socket3 > link4: socket3 socket0 > > I do not see how their can be an explosion of link directory, worse > case is as many link directories as they are bus for a CPU/device/ > target. This looks great. But, we don't _have_ this kind of information for any system that I know about or any system available in the near future. We basically have two different world views: 1. The system is described point-to-point. A connects to B @ 100GB/s. B connects to C at 50GB/s. Thus, C->A should be 50GB/s. * Less information to convey * Potentially less precise if the properties are not perfectly additive. If A->B=10ns and B->C=20ns, A->C might be >30ns. * Costs must be calculated instead of being explicitly specified 2. The system is described endpoint-to-endpoint. A->B @ 100GB/s B->C @ 50GB/s, A->C @ 50GB/s. * A *lot* more information to convey O(N^2)? * Potentially more precise. * Costs are explicitly specified, not calculated These patches are really tied to world view #1. But, the HMAT is really tied to world view #1. I know you're not a fan of the HMAT. But it is the firmware reality that we are stuck with, until something better shows up. I just don't see a way to convert it into what you have described here. I'm starting to think that, no matter if the HMAT or some other approach gets adopted, we shouldn't be exposing this level of gunk to userspace at *all* since it requires adopting one of the world views.
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/6/18 2:39 PM, Jerome Glisse wrote: > No if the 4 sockets are connect in a ring fashion ie: > Socket0 - Socket1 >| | > Socket3 - Socket2 > > Then you have 4 links: > link0: socket0 socket1 > link1: socket1 socket2 > link3: socket2 socket3 > link4: socket3 socket0 > > I do not see how their can be an explosion of link directory, worse > case is as many link directories as they are bus for a CPU/device/ > target. This looks great. But, we don't _have_ this kind of information for any system that I know about or any system available in the near future. We basically have two different world views: 1. The system is described point-to-point. A connects to B @ 100GB/s. B connects to C at 50GB/s. Thus, C->A should be 50GB/s. * Less information to convey * Potentially less precise if the properties are not perfectly additive. If A->B=10ns and B->C=20ns, A->C might be >30ns. * Costs must be calculated instead of being explicitly specified 2. The system is described endpoint-to-endpoint. A->B @ 100GB/s B->C @ 50GB/s, A->C @ 50GB/s. * A *lot* more information to convey O(N^2)? * Potentially more precise. * Costs are explicitly specified, not calculated These patches are really tied to world view #1. But, the HMAT is really tied to world view #1. I know you're not a fan of the HMAT. But it is the firmware reality that we are stuck with, until something better shows up. I just don't see a way to convert it into what you have described here. I'm starting to think that, no matter if the HMAT or some other approach gets adopted, we shouldn't be exposing this level of gunk to userspace at *all* since it requires adopting one of the world views.
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Thu, Dec 06, 2018 at 02:04:46PM -0800, Dave Hansen wrote: > On 12/6/18 12:11 PM, Logan Gunthorpe wrote: > >> My concern with having folks do per-program parsing, *and* having a huge > >> amount of data to parse makes it unusable. The largest systems will > >> literally have hundreds of thousands of objects in /sysfs, even in a > >> single directory. That makes readdir() basically impossible, and makes > >> even open() (if you already know the path you want somehow) hard to do > >> fast. > > Is this actually realistic? I find it hard to imagine an actual hardware > > bus that can have even thousands of devices under a single node, let > > alone hundreds of thousands. > > Jerome's proposal, as I understand it, would have generic "links". > They're not an instance of bus, but characterize a class of "link". For > instance, a "link" might characterize the characteristics of the QPI bus > between two CPU sockets. The link directory would enumerate the list of > all *instances* of that link > > So, a "link" directory for QPI would say Socket0<->Socket1, > Socket1<->Socket2, Socket1<->Socket2, Socket2<->PCIe-1.2.3.4 etc... It > would have to enumerate the connections between every entity that shared > those link properties. > > While there might not be millions of buses, there could be millions of > *paths* across all those buses, and that's what the HMAT describes, at > least: the net result of all those paths. Sorry if again i miss-explained thing. Link are arrows between nodes (CPU or device or memory). An arrow/link has properties associated with it: bandwidth, latency, cache-coherent, ... So if in your system you 4 Sockets and that each socket is connected to each other (mesh) and all inter-connect in the mesh have same property then you only have 1 link directory with the 4 socket in it. No if the 4 sockets are connect in a ring fashion ie: Socket0 - Socket1 | | Socket3 - Socket2 Then you have 4 links: link0: socket0 socket1 link1: socket1 socket2 link3: socket2 socket3 link4: socket3 socket0 I do not see how their can be an explosion of link directory, worse case is as many link directories as they are bus for a CPU/device/ target. So worse case if you have N devices and each devices is connected two 2 bus (PCIE and QPI to go to other socket for instance) then you have 2*N link directory (again this is a worst case). They are lot of commonality that will remain so i expect that quite a few link directory will have many symlink ie you won't get close to the worst case. In the end really it is easier to think from the physical topology and there a link correspond to an inter-connect between two device or CPU. In all the systems i have seen even in the craziest roadmap i have only seen thing like 128/256 inter-connect (4 socket 32/64 devices per socket) and many of which can be grouped under a common link directory. Here worse case is 4 connection per device/CPU/ target so worse case of 128/256 * 4 = 512/1024 link directory and that's a lot. Given regularity i have seen described on slides i expect that it would need something like 30 link directory and 20 bridges directory. On today system 8GPU per socket with GPUlink between each GPU and PCIE all this with 4 socket it comes down to 20 links directory. In any cases each devices/CPU/target has a limit on the number of bus/inter-connect it is connected too. I doubt there is anyone designing device that will have much more than 4 external bus connection. So it is not a link per pair. It is a link for group of device/CPU/ target. Is it any clearer ? Cheers, Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Thu, Dec 06, 2018 at 02:04:46PM -0800, Dave Hansen wrote: > On 12/6/18 12:11 PM, Logan Gunthorpe wrote: > >> My concern with having folks do per-program parsing, *and* having a huge > >> amount of data to parse makes it unusable. The largest systems will > >> literally have hundreds of thousands of objects in /sysfs, even in a > >> single directory. That makes readdir() basically impossible, and makes > >> even open() (if you already know the path you want somehow) hard to do > >> fast. > > Is this actually realistic? I find it hard to imagine an actual hardware > > bus that can have even thousands of devices under a single node, let > > alone hundreds of thousands. > > Jerome's proposal, as I understand it, would have generic "links". > They're not an instance of bus, but characterize a class of "link". For > instance, a "link" might characterize the characteristics of the QPI bus > between two CPU sockets. The link directory would enumerate the list of > all *instances* of that link > > So, a "link" directory for QPI would say Socket0<->Socket1, > Socket1<->Socket2, Socket1<->Socket2, Socket2<->PCIe-1.2.3.4 etc... It > would have to enumerate the connections between every entity that shared > those link properties. > > While there might not be millions of buses, there could be millions of > *paths* across all those buses, and that's what the HMAT describes, at > least: the net result of all those paths. Sorry if again i miss-explained thing. Link are arrows between nodes (CPU or device or memory). An arrow/link has properties associated with it: bandwidth, latency, cache-coherent, ... So if in your system you 4 Sockets and that each socket is connected to each other (mesh) and all inter-connect in the mesh have same property then you only have 1 link directory with the 4 socket in it. No if the 4 sockets are connect in a ring fashion ie: Socket0 - Socket1 | | Socket3 - Socket2 Then you have 4 links: link0: socket0 socket1 link1: socket1 socket2 link3: socket2 socket3 link4: socket3 socket0 I do not see how their can be an explosion of link directory, worse case is as many link directories as they are bus for a CPU/device/ target. So worse case if you have N devices and each devices is connected two 2 bus (PCIE and QPI to go to other socket for instance) then you have 2*N link directory (again this is a worst case). They are lot of commonality that will remain so i expect that quite a few link directory will have many symlink ie you won't get close to the worst case. In the end really it is easier to think from the physical topology and there a link correspond to an inter-connect between two device or CPU. In all the systems i have seen even in the craziest roadmap i have only seen thing like 128/256 inter-connect (4 socket 32/64 devices per socket) and many of which can be grouped under a common link directory. Here worse case is 4 connection per device/CPU/ target so worse case of 128/256 * 4 = 512/1024 link directory and that's a lot. Given regularity i have seen described on slides i expect that it would need something like 30 link directory and 20 bridges directory. On today system 8GPU per socket with GPUlink between each GPU and PCIE all this with 4 socket it comes down to 20 links directory. In any cases each devices/CPU/target has a limit on the number of bus/inter-connect it is connected too. I doubt there is anyone designing device that will have much more than 4 external bus connection. So it is not a link per pair. It is a link for group of device/CPU/ target. Is it any clearer ? Cheers, Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/6/18 12:11 PM, Logan Gunthorpe wrote: >> My concern with having folks do per-program parsing, *and* having a huge >> amount of data to parse makes it unusable. The largest systems will >> literally have hundreds of thousands of objects in /sysfs, even in a >> single directory. That makes readdir() basically impossible, and makes >> even open() (if you already know the path you want somehow) hard to do fast. > Is this actually realistic? I find it hard to imagine an actual hardware > bus that can have even thousands of devices under a single node, let > alone hundreds of thousands. Jerome's proposal, as I understand it, would have generic "links". They're not an instance of bus, but characterize a class of "link". For instance, a "link" might characterize the characteristics of the QPI bus between two CPU sockets. The link directory would enumerate the list of all *instances* of that link So, a "link" directory for QPI would say Socket0<->Socket1, Socket1<->Socket2, Socket1<->Socket2, Socket2<->PCIe-1.2.3.4 etc... It would have to enumerate the connections between every entity that shared those link properties. While there might not be millions of buses, there could be millions of *paths* across all those buses, and that's what the HMAT describes, at least: the net result of all those paths.
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/6/18 12:11 PM, Logan Gunthorpe wrote: >> My concern with having folks do per-program parsing, *and* having a huge >> amount of data to parse makes it unusable. The largest systems will >> literally have hundreds of thousands of objects in /sysfs, even in a >> single directory. That makes readdir() basically impossible, and makes >> even open() (if you already know the path you want somehow) hard to do fast. > Is this actually realistic? I find it hard to imagine an actual hardware > bus that can have even thousands of devices under a single node, let > alone hundreds of thousands. Jerome's proposal, as I understand it, would have generic "links". They're not an instance of bus, but characterize a class of "link". For instance, a "link" might characterize the characteristics of the QPI bus between two CPU sockets. The link directory would enumerate the list of all *instances* of that link So, a "link" directory for QPI would say Socket0<->Socket1, Socket1<->Socket2, Socket1<->Socket2, Socket2<->PCIe-1.2.3.4 etc... It would have to enumerate the connections between every entity that shared those link properties. While there might not be millions of buses, there could be millions of *paths* across all those buses, and that's what the HMAT describes, at least: the net result of all those paths.
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Thu, Dec 06, 2018 at 03:27:06PM -0500, Jerome Glisse wrote: > On Thu, Dec 06, 2018 at 11:31:21AM -0800, Dave Hansen wrote: > > On 12/6/18 11:20 AM, Jerome Glisse wrote: > > >>> For case 1 you can pre-parse stuff but this can be done by helper > > >>> library > > >> How would that work? Would each user/container/whatever do this once? > > >> Where would they keep the pre-parsed stuff? How do they manage their > > >> cache if the topology changes? > > > Short answer i don't expect a cache, i expect that each program will have > > > a init function that query the topology and update the application codes > > > accordingly. > > > > My concern with having folks do per-program parsing, *and* having a huge > > amount of data to parse makes it unusable. The largest systems will > > literally have hundreds of thousands of objects in /sysfs, even in a > > single directory. That makes readdir() basically impossible, and makes > > even open() (if you already know the path you want somehow) hard to do fast. > > > > I just don't think sysfs (or any filesystem, really) can scale to > > express large, complicated topologies in a way that any normal program > > can practically parse it. > > > > My suspicion is that we're going to need to have the kernel parse and > > cache these things. We *might* have the data available in sysfs, but we > > can't reasonably expect anyone to go parsing it. > > What i am failing to explain is that kernel can not parse because kernel > does not know what the application cares about and every single applications > will make different choices and thus select differents devices and memory. > > It is not even gonna a thing like class A of application will do X and > class B will do Y. Every single application in class A might do something > different because somes care about the little details. > > So any kind of pre-parsing in the kernel is defeated by the fact that the > kernel does not know what the application is looking for. > > I do not see anyway to express the application logic in something that > can be some kind of automaton or regular expression. The application can > litteraly intro-inspect itself and the topology to partition its workload. > The topology and device selection is expected to be thousands of line of > code in the most advance application. > > Even worse inside one same application, they might be different device > partition and memory selection for different function in the application. > > > I am not scare about the anount of data to parse really, even on big node > it is gonna be few dozens of links and bridges, and few dozens of devices. > So we are talking hundred directories to parse and read. > > > Maybe an example will help. Let say we have an application with the > following pipeline: > > inA -> functionA -> outA -> functionB -> outB -> functionC -> result > > - inA 8 gigabytes > - outA 8 gigabytes > - outB one dword > - result something small > - functionA is doing heavy computation on inA (several thousands of > instructions for each dword in inA). > - functionB is doing heavy computation for each dword in outA (again > thousand of instruction for each dword) and it is looking for a > specific result that it knows will be unique among all the dword > computation ie it is output only one dword in outB > - functionC is something well suited for CPU that take outB and turns > it into the final result > > Now let see few different system and their topologies: > [T2] 1 GPU with 16GB of memory and a handfull of CPU cores > [T1] 1 GPU with 8GB of memory and a handfull of CPU cores > [T3] 2 GPU with 8GB of memory and a handfull of CPU core > [T4] 2 GPU with 8GB of memory and a handfull of CPU core > the 2 GPU have a very fast link between each others > (400GBytes/s) > > Now let see how the program will partition itself for each topology: > [T1] Application partition its computation in 3 phases: > P1: - migrate inA to GPU memory > P2: - execute functionA on inA producing outA > P3 - execute functionB on outA producing outB > - run functionC and see if functionB have found the > thing and written it to outB if so then kill all > GPU threads and return the result we are done > > [T2] Application partition its computation in 5 phases: > P1: - migrate first 4GB of inA to GPU memory > P2: - execute functionA for the 4GB and write the 4GB > outA result to the GPU memory > P3: - execute functionB for the first 4GB of outA > - while functionB is running DMA in the background > the the second 4GB of inA to the GPU memory > - once one of the millions of thread running functionB > find the result it is looking for it writes it to >
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Thu, Dec 06, 2018 at 03:27:06PM -0500, Jerome Glisse wrote: > On Thu, Dec 06, 2018 at 11:31:21AM -0800, Dave Hansen wrote: > > On 12/6/18 11:20 AM, Jerome Glisse wrote: > > >>> For case 1 you can pre-parse stuff but this can be done by helper > > >>> library > > >> How would that work? Would each user/container/whatever do this once? > > >> Where would they keep the pre-parsed stuff? How do they manage their > > >> cache if the topology changes? > > > Short answer i don't expect a cache, i expect that each program will have > > > a init function that query the topology and update the application codes > > > accordingly. > > > > My concern with having folks do per-program parsing, *and* having a huge > > amount of data to parse makes it unusable. The largest systems will > > literally have hundreds of thousands of objects in /sysfs, even in a > > single directory. That makes readdir() basically impossible, and makes > > even open() (if you already know the path you want somehow) hard to do fast. > > > > I just don't think sysfs (or any filesystem, really) can scale to > > express large, complicated topologies in a way that any normal program > > can practically parse it. > > > > My suspicion is that we're going to need to have the kernel parse and > > cache these things. We *might* have the data available in sysfs, but we > > can't reasonably expect anyone to go parsing it. > > What i am failing to explain is that kernel can not parse because kernel > does not know what the application cares about and every single applications > will make different choices and thus select differents devices and memory. > > It is not even gonna a thing like class A of application will do X and > class B will do Y. Every single application in class A might do something > different because somes care about the little details. > > So any kind of pre-parsing in the kernel is defeated by the fact that the > kernel does not know what the application is looking for. > > I do not see anyway to express the application logic in something that > can be some kind of automaton or regular expression. The application can > litteraly intro-inspect itself and the topology to partition its workload. > The topology and device selection is expected to be thousands of line of > code in the most advance application. > > Even worse inside one same application, they might be different device > partition and memory selection for different function in the application. > > > I am not scare about the anount of data to parse really, even on big node > it is gonna be few dozens of links and bridges, and few dozens of devices. > So we are talking hundred directories to parse and read. > > > Maybe an example will help. Let say we have an application with the > following pipeline: > > inA -> functionA -> outA -> functionB -> outB -> functionC -> result > > - inA 8 gigabytes > - outA 8 gigabytes > - outB one dword > - result something small > - functionA is doing heavy computation on inA (several thousands of > instructions for each dword in inA). > - functionB is doing heavy computation for each dword in outA (again > thousand of instruction for each dword) and it is looking for a > specific result that it knows will be unique among all the dword > computation ie it is output only one dword in outB > - functionC is something well suited for CPU that take outB and turns > it into the final result > > Now let see few different system and their topologies: > [T2] 1 GPU with 16GB of memory and a handfull of CPU cores > [T1] 1 GPU with 8GB of memory and a handfull of CPU cores > [T3] 2 GPU with 8GB of memory and a handfull of CPU core > [T4] 2 GPU with 8GB of memory and a handfull of CPU core > the 2 GPU have a very fast link between each others > (400GBytes/s) > > Now let see how the program will partition itself for each topology: > [T1] Application partition its computation in 3 phases: > P1: - migrate inA to GPU memory > P2: - execute functionA on inA producing outA > P3 - execute functionB on outA producing outB > - run functionC and see if functionB have found the > thing and written it to outB if so then kill all > GPU threads and return the result we are done > > [T2] Application partition its computation in 5 phases: > P1: - migrate first 4GB of inA to GPU memory > P2: - execute functionA for the 4GB and write the 4GB > outA result to the GPU memory > P3: - execute functionB for the first 4GB of outA > - while functionB is running DMA in the background > the the second 4GB of inA to the GPU memory > - once one of the millions of thread running functionB > find the result it is looking for it writes it to >
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Thu, Dec 06, 2018 at 11:31:21AM -0800, Dave Hansen wrote: > On 12/6/18 11:20 AM, Jerome Glisse wrote: > >>> For case 1 you can pre-parse stuff but this can be done by helper library > >> How would that work? Would each user/container/whatever do this once? > >> Where would they keep the pre-parsed stuff? How do they manage their > >> cache if the topology changes? > > Short answer i don't expect a cache, i expect that each program will have > > a init function that query the topology and update the application codes > > accordingly. > > My concern with having folks do per-program parsing, *and* having a huge > amount of data to parse makes it unusable. The largest systems will > literally have hundreds of thousands of objects in /sysfs, even in a > single directory. That makes readdir() basically impossible, and makes > even open() (if you already know the path you want somehow) hard to do fast. > > I just don't think sysfs (or any filesystem, really) can scale to > express large, complicated topologies in a way that any normal program > can practically parse it. > > My suspicion is that we're going to need to have the kernel parse and > cache these things. We *might* have the data available in sysfs, but we > can't reasonably expect anyone to go parsing it. What i am failing to explain is that kernel can not parse because kernel does not know what the application cares about and every single applications will make different choices and thus select differents devices and memory. It is not even gonna a thing like class A of application will do X and class B will do Y. Every single application in class A might do something different because somes care about the little details. So any kind of pre-parsing in the kernel is defeated by the fact that the kernel does not know what the application is looking for. I do not see anyway to express the application logic in something that can be some kind of automaton or regular expression. The application can litteraly intro-inspect itself and the topology to partition its workload. The topology and device selection is expected to be thousands of line of code in the most advance application. Even worse inside one same application, they might be different device partition and memory selection for different function in the application. I am not scare about the anount of data to parse really, even on big node it is gonna be few dozens of links and bridges, and few dozens of devices. So we are talking hundred directories to parse and read. Maybe an example will help. Let say we have an application with the following pipeline: inA -> functionA -> outA -> functionB -> outB -> functionC -> result - inA 8 gigabytes - outA 8 gigabytes - outB one dword - result something small - functionA is doing heavy computation on inA (several thousands of instructions for each dword in inA). - functionB is doing heavy computation for each dword in outA (again thousand of instruction for each dword) and it is looking for a specific result that it knows will be unique among all the dword computation ie it is output only one dword in outB - functionC is something well suited for CPU that take outB and turns it into the final result Now let see few different system and their topologies: [T2] 1 GPU with 16GB of memory and a handfull of CPU cores [T1] 1 GPU with 8GB of memory and a handfull of CPU cores [T3] 2 GPU with 8GB of memory and a handfull of CPU core [T4] 2 GPU with 8GB of memory and a handfull of CPU core the 2 GPU have a very fast link between each others (400GBytes/s) Now let see how the program will partition itself for each topology: [T1] Application partition its computation in 3 phases: P1: - migrate inA to GPU memory P2: - execute functionA on inA producing outA P3 - execute functionB on outA producing outB - run functionC and see if functionB have found the thing and written it to outB if so then kill all GPU threads and return the result we are done [T2] Application partition its computation in 5 phases: P1: - migrate first 4GB of inA to GPU memory P2: - execute functionA for the 4GB and write the 4GB outA result to the GPU memory P3: - execute functionB for the first 4GB of outA - while functionB is running DMA in the background the the second 4GB of inA to the GPU memory - once one of the millions of thread running functionB find the result it is looking for it writes it to outB which is in main memory - run functionC and see if functionB have found the thing and written it to outB if so then kill all GPU thread and DMA and return the result we are done
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Thu, Dec 06, 2018 at 11:31:21AM -0800, Dave Hansen wrote: > On 12/6/18 11:20 AM, Jerome Glisse wrote: > >>> For case 1 you can pre-parse stuff but this can be done by helper library > >> How would that work? Would each user/container/whatever do this once? > >> Where would they keep the pre-parsed stuff? How do they manage their > >> cache if the topology changes? > > Short answer i don't expect a cache, i expect that each program will have > > a init function that query the topology and update the application codes > > accordingly. > > My concern with having folks do per-program parsing, *and* having a huge > amount of data to parse makes it unusable. The largest systems will > literally have hundreds of thousands of objects in /sysfs, even in a > single directory. That makes readdir() basically impossible, and makes > even open() (if you already know the path you want somehow) hard to do fast. > > I just don't think sysfs (or any filesystem, really) can scale to > express large, complicated topologies in a way that any normal program > can practically parse it. > > My suspicion is that we're going to need to have the kernel parse and > cache these things. We *might* have the data available in sysfs, but we > can't reasonably expect anyone to go parsing it. What i am failing to explain is that kernel can not parse because kernel does not know what the application cares about and every single applications will make different choices and thus select differents devices and memory. It is not even gonna a thing like class A of application will do X and class B will do Y. Every single application in class A might do something different because somes care about the little details. So any kind of pre-parsing in the kernel is defeated by the fact that the kernel does not know what the application is looking for. I do not see anyway to express the application logic in something that can be some kind of automaton or regular expression. The application can litteraly intro-inspect itself and the topology to partition its workload. The topology and device selection is expected to be thousands of line of code in the most advance application. Even worse inside one same application, they might be different device partition and memory selection for different function in the application. I am not scare about the anount of data to parse really, even on big node it is gonna be few dozens of links and bridges, and few dozens of devices. So we are talking hundred directories to parse and read. Maybe an example will help. Let say we have an application with the following pipeline: inA -> functionA -> outA -> functionB -> outB -> functionC -> result - inA 8 gigabytes - outA 8 gigabytes - outB one dword - result something small - functionA is doing heavy computation on inA (several thousands of instructions for each dword in inA). - functionB is doing heavy computation for each dword in outA (again thousand of instruction for each dword) and it is looking for a specific result that it knows will be unique among all the dword computation ie it is output only one dword in outB - functionC is something well suited for CPU that take outB and turns it into the final result Now let see few different system and their topologies: [T2] 1 GPU with 16GB of memory and a handfull of CPU cores [T1] 1 GPU with 8GB of memory and a handfull of CPU cores [T3] 2 GPU with 8GB of memory and a handfull of CPU core [T4] 2 GPU with 8GB of memory and a handfull of CPU core the 2 GPU have a very fast link between each others (400GBytes/s) Now let see how the program will partition itself for each topology: [T1] Application partition its computation in 3 phases: P1: - migrate inA to GPU memory P2: - execute functionA on inA producing outA P3 - execute functionB on outA producing outB - run functionC and see if functionB have found the thing and written it to outB if so then kill all GPU threads and return the result we are done [T2] Application partition its computation in 5 phases: P1: - migrate first 4GB of inA to GPU memory P2: - execute functionA for the 4GB and write the 4GB outA result to the GPU memory P3: - execute functionB for the first 4GB of outA - while functionB is running DMA in the background the the second 4GB of inA to the GPU memory - once one of the millions of thread running functionB find the result it is looking for it writes it to outB which is in main memory - run functionC and see if functionB have found the thing and written it to outB if so then kill all GPU thread and DMA and return the result we are done
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 2018-12-06 12:31 p.m., Dave Hansen wrote: > On 12/6/18 11:20 AM, Jerome Glisse wrote: For case 1 you can pre-parse stuff but this can be done by helper library >>> How would that work? Would each user/container/whatever do this once? >>> Where would they keep the pre-parsed stuff? How do they manage their >>> cache if the topology changes? >> Short answer i don't expect a cache, i expect that each program will have >> a init function that query the topology and update the application codes >> accordingly. > > My concern with having folks do per-program parsing, *and* having a huge > amount of data to parse makes it unusable. The largest systems will > literally have hundreds of thousands of objects in /sysfs, even in a > single directory. That makes readdir() basically impossible, and makes > even open() (if you already know the path you want somehow) hard to do fast. Is this actually realistic? I find it hard to imagine an actual hardware bus that can have even thousands of devices under a single node, let alone hundreds of thousands. At some point the laws of physics apply. For example, in present hardware, the most ports a single PCI switch can have these days is under one hundred. I'd imagine any such large systems would have a hierarchy of devices (ie. layers of switch-like devices) which implies the existing sysfs bus/devices should have a path through it without navigating a directory with that unreasonable a number of objects in it. HMS, on the other hand, has all possible initiators (,etc) under a single directory. The caveat to this is, that to find an initial starting point in the bus hierarchy you might have to go through /sys/dev/{block|char} or /sys/class which may have directories with a large number of objects. Though, such a system would necessarily have a similarly large number of objects in /dev which means means you will probably never get around the readdir/open bottleneck you mention... and, thus, this doesn't seem overly realistic to me. Logan
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 2018-12-06 12:31 p.m., Dave Hansen wrote: > On 12/6/18 11:20 AM, Jerome Glisse wrote: For case 1 you can pre-parse stuff but this can be done by helper library >>> How would that work? Would each user/container/whatever do this once? >>> Where would they keep the pre-parsed stuff? How do they manage their >>> cache if the topology changes? >> Short answer i don't expect a cache, i expect that each program will have >> a init function that query the topology and update the application codes >> accordingly. > > My concern with having folks do per-program parsing, *and* having a huge > amount of data to parse makes it unusable. The largest systems will > literally have hundreds of thousands of objects in /sysfs, even in a > single directory. That makes readdir() basically impossible, and makes > even open() (if you already know the path you want somehow) hard to do fast. Is this actually realistic? I find it hard to imagine an actual hardware bus that can have even thousands of devices under a single node, let alone hundreds of thousands. At some point the laws of physics apply. For example, in present hardware, the most ports a single PCI switch can have these days is under one hundred. I'd imagine any such large systems would have a hierarchy of devices (ie. layers of switch-like devices) which implies the existing sysfs bus/devices should have a path through it without navigating a directory with that unreasonable a number of objects in it. HMS, on the other hand, has all possible initiators (,etc) under a single directory. The caveat to this is, that to find an initial starting point in the bus hierarchy you might have to go through /sys/dev/{block|char} or /sys/class which may have directories with a large number of objects. Though, such a system would necessarily have a similarly large number of objects in /dev which means means you will probably never get around the readdir/open bottleneck you mention... and, thus, this doesn't seem overly realistic to me. Logan
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/6/18 11:20 AM, Jerome Glisse wrote: >>> For case 1 you can pre-parse stuff but this can be done by helper library >> How would that work? Would each user/container/whatever do this once? >> Where would they keep the pre-parsed stuff? How do they manage their >> cache if the topology changes? > Short answer i don't expect a cache, i expect that each program will have > a init function that query the topology and update the application codes > accordingly. My concern with having folks do per-program parsing, *and* having a huge amount of data to parse makes it unusable. The largest systems will literally have hundreds of thousands of objects in /sysfs, even in a single directory. That makes readdir() basically impossible, and makes even open() (if you already know the path you want somehow) hard to do fast. I just don't think sysfs (or any filesystem, really) can scale to express large, complicated topologies in a way that any normal program can practically parse it. My suspicion is that we're going to need to have the kernel parse and cache these things. We *might* have the data available in sysfs, but we can't reasonably expect anyone to go parsing it.
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/6/18 11:20 AM, Jerome Glisse wrote: >>> For case 1 you can pre-parse stuff but this can be done by helper library >> How would that work? Would each user/container/whatever do this once? >> Where would they keep the pre-parsed stuff? How do they manage their >> cache if the topology changes? > Short answer i don't expect a cache, i expect that each program will have > a init function that query the topology and update the application codes > accordingly. My concern with having folks do per-program parsing, *and* having a huge amount of data to parse makes it unusable. The largest systems will literally have hundreds of thousands of objects in /sysfs, even in a single directory. That makes readdir() basically impossible, and makes even open() (if you already know the path you want somehow) hard to do fast. I just don't think sysfs (or any filesystem, really) can scale to express large, complicated topologies in a way that any normal program can practically parse it. My suspicion is that we're going to need to have the kernel parse and cache these things. We *might* have the data available in sysfs, but we can't reasonably expect anyone to go parsing it.
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Thu, Dec 06, 2018 at 10:25:08AM -0800, Dave Hansen wrote: > On 12/5/18 9:53 AM, Jerome Glisse wrote: > > No so there is 2 kinds of applications: > > 1) average one: i am using device {1, 3, 9} give me best memory for > >those devices > ... > > > > For case 1 you can pre-parse stuff but this can be done by helper library > > How would that work? Would each user/container/whatever do this once? > Where would they keep the pre-parsed stuff? How do they manage their > cache if the topology changes? Short answer i don't expect a cache, i expect that each program will have a init function that query the topology and update the application codes accordingly. This is what people do today, query all available devices, decide which one to use and how, create context for each selected ones, define a memory migration job/memory policy for each part of the program so that memory is migrated/have proper policy in place when the code that run on some device is executed. Long answer: I can not dictate how user folks do their program saddly :) I expect that many application will do it once during start up. Then you will have all those containers folks or VM folks that will get presure to react to hot- plug. For instance if you upgrade your instance with your cloud provider to have more GPUs or more TPUs ... It is likely to appear as an hotplug from the VM/container point of view and thus as an hotplug from the application point of view. So far demonstration i have seen do that by relaunching the application ... More on that through the live re-patching issues below. Oh and i expect application will crash if you hot-unplug anything it is using (this is what happens i believe now in most API). Again i expect that some pressure from cloud user and provider will force programmer to be a bit more reactive to this kind of event. Live re-patching application code can be difficult i am told. Let say you have: void compute_serious0_stuff(accelerator_t *accelerator, void *inputA, size_t sinputA, void *inputB, size_t sinputB, void *outputA, size_t soutputA) { ... // Migrate the inputA to the accelerator memory api_migrate_memory_to_accelerator(accelerator, inputA, sinputA); // The inputB buffer is fine in its default placement // The output is assume to be empty vma ie no page allocated yet // so set a policy to direct all allocation due to page fault to // use the accelerator memory api_set_memory_policy_to_accelerator(accelerator, outputA, soutputA); ... for_parallel (i = 0; i < THEYAREAMILLIONSITEMS; ++i) { // Do something serious } ... } void serious0_orchestrator(topology topology, void *inputA, void *inputB, void *outputA) { static accelerator_t **selected = NULL; static serious0_job_partition *partition; ... if (selected == NULL) { serious0_select_and_partition(topology, , , inputA, inputB, outputA) } ... for(i = 0; i < nselected; ++) { ... compute_serious0_stuff(selected[i], inputA + partition[i].inputA_offset, partition[i].inputA_size, inputB + partition[i].inputB_offset, partition[i].inputB_size, outputA + partition[i].outputB_offset, partition[i].outputA_size); ... } ... for(i = 0; i < nselected; ++) { accelerator_wait_finish(selected[i]); } ... // outputA is ready to be use by the next function in the program } If you start without a GPU/TPU your for_parallel will use the CPU and with the code the compiler have emitted at built time. For GPU/TPU at build time you compile your for_parallel loop to some intermediate representation (a virtual ISA) then at runtime during the application initialization that intermediate representation get lowered down to all the available GPU/TPU on your system and each for_parallel loop is patched to be turn into a call to: void dispatch_accelerator_function(accelerator_t *accelerator, void *function, ...) { } So in the above example the for_parallel loop becomes: dispatch_accelerator_function(accelerator, i_compute_serious_stuff, inputA, inputB, outputA); This hot patching of code is easy to do when no CPU thread is running the code. However when CPU threads are running it can be problematic, i am sure you can do trickery like delay the patching only to the next time the function get call by doing clever thing at build time like prepending each for_parallel section with enough nop that would allow you to replace it to a call to the dispatch function and a jump over the normal CPU code. I think compiler people want to solve the static case
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Thu, Dec 06, 2018 at 10:25:08AM -0800, Dave Hansen wrote: > On 12/5/18 9:53 AM, Jerome Glisse wrote: > > No so there is 2 kinds of applications: > > 1) average one: i am using device {1, 3, 9} give me best memory for > >those devices > ... > > > > For case 1 you can pre-parse stuff but this can be done by helper library > > How would that work? Would each user/container/whatever do this once? > Where would they keep the pre-parsed stuff? How do they manage their > cache if the topology changes? Short answer i don't expect a cache, i expect that each program will have a init function that query the topology and update the application codes accordingly. This is what people do today, query all available devices, decide which one to use and how, create context for each selected ones, define a memory migration job/memory policy for each part of the program so that memory is migrated/have proper policy in place when the code that run on some device is executed. Long answer: I can not dictate how user folks do their program saddly :) I expect that many application will do it once during start up. Then you will have all those containers folks or VM folks that will get presure to react to hot- plug. For instance if you upgrade your instance with your cloud provider to have more GPUs or more TPUs ... It is likely to appear as an hotplug from the VM/container point of view and thus as an hotplug from the application point of view. So far demonstration i have seen do that by relaunching the application ... More on that through the live re-patching issues below. Oh and i expect application will crash if you hot-unplug anything it is using (this is what happens i believe now in most API). Again i expect that some pressure from cloud user and provider will force programmer to be a bit more reactive to this kind of event. Live re-patching application code can be difficult i am told. Let say you have: void compute_serious0_stuff(accelerator_t *accelerator, void *inputA, size_t sinputA, void *inputB, size_t sinputB, void *outputA, size_t soutputA) { ... // Migrate the inputA to the accelerator memory api_migrate_memory_to_accelerator(accelerator, inputA, sinputA); // The inputB buffer is fine in its default placement // The output is assume to be empty vma ie no page allocated yet // so set a policy to direct all allocation due to page fault to // use the accelerator memory api_set_memory_policy_to_accelerator(accelerator, outputA, soutputA); ... for_parallel (i = 0; i < THEYAREAMILLIONSITEMS; ++i) { // Do something serious } ... } void serious0_orchestrator(topology topology, void *inputA, void *inputB, void *outputA) { static accelerator_t **selected = NULL; static serious0_job_partition *partition; ... if (selected == NULL) { serious0_select_and_partition(topology, , , inputA, inputB, outputA) } ... for(i = 0; i < nselected; ++) { ... compute_serious0_stuff(selected[i], inputA + partition[i].inputA_offset, partition[i].inputA_size, inputB + partition[i].inputB_offset, partition[i].inputB_size, outputA + partition[i].outputB_offset, partition[i].outputA_size); ... } ... for(i = 0; i < nselected; ++) { accelerator_wait_finish(selected[i]); } ... // outputA is ready to be use by the next function in the program } If you start without a GPU/TPU your for_parallel will use the CPU and with the code the compiler have emitted at built time. For GPU/TPU at build time you compile your for_parallel loop to some intermediate representation (a virtual ISA) then at runtime during the application initialization that intermediate representation get lowered down to all the available GPU/TPU on your system and each for_parallel loop is patched to be turn into a call to: void dispatch_accelerator_function(accelerator_t *accelerator, void *function, ...) { } So in the above example the for_parallel loop becomes: dispatch_accelerator_function(accelerator, i_compute_serious_stuff, inputA, inputB, outputA); This hot patching of code is easy to do when no CPU thread is running the code. However when CPU threads are running it can be problematic, i am sure you can do trickery like delay the patching only to the next time the function get call by doing clever thing at build time like prepending each for_parallel section with enough nop that would allow you to replace it to a call to the dispatch function and a jump over the normal CPU code. I think compiler people want to solve the static case
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/5/18 9:53 AM, Jerome Glisse wrote: > No so there is 2 kinds of applications: > 1) average one: i am using device {1, 3, 9} give me best memory for >those devices ... > > For case 1 you can pre-parse stuff but this can be done by helper library How would that work? Would each user/container/whatever do this once? Where would they keep the pre-parsed stuff? How do they manage their cache if the topology changes?
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/5/18 9:53 AM, Jerome Glisse wrote: > No so there is 2 kinds of applications: > 1) average one: i am using device {1, 3, 9} give me best memory for >those devices ... > > For case 1 you can pre-parse stuff but this can be done by helper library How would that work? Would each user/container/whatever do this once? Where would they keep the pre-parsed stuff? How do they manage their cache if the topology changes?
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Wed, Dec 05, 2018 at 09:27:09AM -0800, Dave Hansen wrote: > On 12/4/18 6:13 PM, Jerome Glisse wrote: > > On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote: > >> OK, but there are 1024*1024 matrix cells on a systems with 1024 > >> proximity domains (ACPI term for NUMA node). So it sounds like you are > >> proposing a million-directory approach. > > > > No, pseudo code: > > struct list links; > > > > for (unsigned r = 0; r < nrows; r++) { > > for (unsigned c = 0; c < ncolumns; c++) { > > if (!link_find(links, hmat[r][c].bandwidth, > >hmat[r][c].latency)) { > > link = link_new(hmat[r][c].bandwidth, > > hmat[r][c].latency); > > // add initiator and target correspond to that row > > // and columns to this new link > > list_add(, links); > > } > > } > > } > > > > So all cells that have same property are under the same link. > > OK, so the "link" here is like a cable. It's like saying, "we have a > network and everything is connected with an ethernet cable that can do > 1gbit/sec". > > But, what actually connects an initiator to a target? I assume we still > need to know which link is used for each target/initiator pair. Where > is that enumerated? ls /sys/bus/hms/devices/v0-0-link/ node0 power subsystem uevent uid bandwidth latency v0-1-target v0-15-initiator v0-21-targetv0-4-initiator v0-7-initiator v0-10-initiator v0-13-initiator v0-16-initiator v0-2-initiator v0-11-initiator v0-14-initiator v0-17-initiator v0-3-initiator v0-5-initiator v0-8-initiator v0-6-initiator v0-9-initiator v0-12-initiator v0-10-initiator So above is 16 CPUs (initiators*) and 2 targets all connected through a common link. This means that all the initiators connected to this link can access all the target connected to this link. The bandwidth and latency is best case scenario for instance when only one initiator is accessing the target. Initiator can only access target they share a link with or an extended path through a bridge. So if you have an initiator connected to link0 and a target connected to link1 and there is a bridge link0 to link1 then the initiator can access the target memory in link1 but the bandwidth and latency will be min(link0.bandwidth, link1.bandwidth, bridge.bandwidth) min(link0.latency, link1.latency, bridge.latency) You can really match one to one a link with bus in your system. For instance with PCIE if you only have 16lanes PCIE devices you only devince one link directory for all your PCIE devices (ignore the PCIE peer to peer scenario here). You add a bride between your PCIE link to your NUMA node link (the node to which this PCIE root complex belongs), this means that PCIE device can access the local node memory with given bandwidth and latency (best case). > > I think this just means we need a million symlinks to a "link" instead > of a million link directories. Still not great. > > > Note that userspace can parse all this once during its initialization > > and create pools of target to use. > > It sounds like you're agreeing that there is too much data in this > interface for applications to _regularly_ parse it. We need some > central thing that parses it all and caches the results. No so there is 2 kinds of applications: 1) average one: i am using device {1, 3, 9} give me best memory for those devices 2) advance one: what is the topology of this system ? Parse the topology and partition its workload accordingly For case 1 you can pre-parse stuff but this can be done by helper library but for case 2 there is no amount of pre-parsing you can do in kernel, only the application knows its own architecture and thus only the application knows what matter in the topology. Is the application looking for big chunk of memory even if it is slow ? Is it also looking for fast memory close to X and Y ? ... Each application will care about different thing and there is no telling what its gonna be. So what i am saying is that this information is likely to be parse once by the application during startup ie the sysfs is not something that is continuously read and parse by the application (unless application also care about hotplug and then we are talking about the 1% of the 1%). Cheers, Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Wed, Dec 05, 2018 at 09:27:09AM -0800, Dave Hansen wrote: > On 12/4/18 6:13 PM, Jerome Glisse wrote: > > On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote: > >> OK, but there are 1024*1024 matrix cells on a systems with 1024 > >> proximity domains (ACPI term for NUMA node). So it sounds like you are > >> proposing a million-directory approach. > > > > No, pseudo code: > > struct list links; > > > > for (unsigned r = 0; r < nrows; r++) { > > for (unsigned c = 0; c < ncolumns; c++) { > > if (!link_find(links, hmat[r][c].bandwidth, > >hmat[r][c].latency)) { > > link = link_new(hmat[r][c].bandwidth, > > hmat[r][c].latency); > > // add initiator and target correspond to that row > > // and columns to this new link > > list_add(, links); > > } > > } > > } > > > > So all cells that have same property are under the same link. > > OK, so the "link" here is like a cable. It's like saying, "we have a > network and everything is connected with an ethernet cable that can do > 1gbit/sec". > > But, what actually connects an initiator to a target? I assume we still > need to know which link is used for each target/initiator pair. Where > is that enumerated? ls /sys/bus/hms/devices/v0-0-link/ node0 power subsystem uevent uid bandwidth latency v0-1-target v0-15-initiator v0-21-targetv0-4-initiator v0-7-initiator v0-10-initiator v0-13-initiator v0-16-initiator v0-2-initiator v0-11-initiator v0-14-initiator v0-17-initiator v0-3-initiator v0-5-initiator v0-8-initiator v0-6-initiator v0-9-initiator v0-12-initiator v0-10-initiator So above is 16 CPUs (initiators*) and 2 targets all connected through a common link. This means that all the initiators connected to this link can access all the target connected to this link. The bandwidth and latency is best case scenario for instance when only one initiator is accessing the target. Initiator can only access target they share a link with or an extended path through a bridge. So if you have an initiator connected to link0 and a target connected to link1 and there is a bridge link0 to link1 then the initiator can access the target memory in link1 but the bandwidth and latency will be min(link0.bandwidth, link1.bandwidth, bridge.bandwidth) min(link0.latency, link1.latency, bridge.latency) You can really match one to one a link with bus in your system. For instance with PCIE if you only have 16lanes PCIE devices you only devince one link directory for all your PCIE devices (ignore the PCIE peer to peer scenario here). You add a bride between your PCIE link to your NUMA node link (the node to which this PCIE root complex belongs), this means that PCIE device can access the local node memory with given bandwidth and latency (best case). > > I think this just means we need a million symlinks to a "link" instead > of a million link directories. Still not great. > > > Note that userspace can parse all this once during its initialization > > and create pools of target to use. > > It sounds like you're agreeing that there is too much data in this > interface for applications to _regularly_ parse it. We need some > central thing that parses it all and caches the results. No so there is 2 kinds of applications: 1) average one: i am using device {1, 3, 9} give me best memory for those devices 2) advance one: what is the topology of this system ? Parse the topology and partition its workload accordingly For case 1 you can pre-parse stuff but this can be done by helper library but for case 2 there is no amount of pre-parsing you can do in kernel, only the application knows its own architecture and thus only the application knows what matter in the topology. Is the application looking for big chunk of memory even if it is slow ? Is it also looking for fast memory close to X and Y ? ... Each application will care about different thing and there is no telling what its gonna be. So what i am saying is that this information is likely to be parse once by the application during startup ie the sysfs is not something that is continuously read and parse by the application (unless application also care about hotplug and then we are talking about the 1% of the 1%). Cheers, Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/4/18 6:13 PM, Jerome Glisse wrote: > On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote: >> OK, but there are 1024*1024 matrix cells on a systems with 1024 >> proximity domains (ACPI term for NUMA node). So it sounds like you are >> proposing a million-directory approach. > > No, pseudo code: > struct list links; > > for (unsigned r = 0; r < nrows; r++) { > for (unsigned c = 0; c < ncolumns; c++) { > if (!link_find(links, hmat[r][c].bandwidth, >hmat[r][c].latency)) { > link = link_new(hmat[r][c].bandwidth, > hmat[r][c].latency); > // add initiator and target correspond to that row > // and columns to this new link > list_add(, links); > } > } > } > > So all cells that have same property are under the same link. OK, so the "link" here is like a cable. It's like saying, "we have a network and everything is connected with an ethernet cable that can do 1gbit/sec". But, what actually connects an initiator to a target? I assume we still need to know which link is used for each target/initiator pair. Where is that enumerated? I think this just means we need a million symlinks to a "link" instead of a million link directories. Still not great. > Note that userspace can parse all this once during its initialization > and create pools of target to use. It sounds like you're agreeing that there is too much data in this interface for applications to _regularly_ parse it. We need some central thing that parses it all and caches the results.
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/4/18 6:13 PM, Jerome Glisse wrote: > On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote: >> OK, but there are 1024*1024 matrix cells on a systems with 1024 >> proximity domains (ACPI term for NUMA node). So it sounds like you are >> proposing a million-directory approach. > > No, pseudo code: > struct list links; > > for (unsigned r = 0; r < nrows; r++) { > for (unsigned c = 0; c < ncolumns; c++) { > if (!link_find(links, hmat[r][c].bandwidth, >hmat[r][c].latency)) { > link = link_new(hmat[r][c].bandwidth, > hmat[r][c].latency); > // add initiator and target correspond to that row > // and columns to this new link > list_add(, links); > } > } > } > > So all cells that have same property are under the same link. OK, so the "link" here is like a cable. It's like saying, "we have a network and everything is connected with an ethernet cable that can do 1gbit/sec". But, what actually connects an initiator to a target? I assume we still need to know which link is used for each target/initiator pair. Where is that enumerated? I think this just means we need a million symlinks to a "link" instead of a million link directories. Still not great. > Note that userspace can parse all this once during its initialization > and create pools of target to use. It sounds like you're agreeing that there is too much data in this interface for applications to _regularly_ parse it. We need some central thing that parses it all and caches the results.
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Wed, Dec 05, 2018 at 04:57:17PM +0530, Aneesh Kumar K.V wrote: > On 12/5/18 12:19 AM, Jerome Glisse wrote: > > > Above example is for migrate. Here is an example for how the > > topology is use today: > > > > Application knows that the platform is running on have 16 > > GPU split into 2 group of 8 GPUs each. GPU in each group can > > access each other memory with dedicated mesh links between > > each others. Full speed no traffic bottleneck. > > > > Application splits its GPU computation in 2 so that each > > partition runs on a group of interconnected GPU allowing > > them to share the dataset. > > > > With HMS: > > Application can query the kernel to discover the topology of > > system it is running on and use it to partition and balance > > its workload accordingly. Same application should now be able > > to run on new platform without having to adapt it to it. > > > > Will the kernel be ever involved in decision making here? Like the scheduler > will we ever want to control how there computation units get scheduled onto > GPU groups or GPU? I don;t think you will ever see fine control in software because it would go against what GPU are fundamentaly. GPU have 1000 of cores and usualy 10 times more thread in flight than core (depends on the number of register use by the program or size of their thread local storage). By having many more thread in flight the GPU always have some threads that are not waiting for memory access and thus always have something to schedule next on the core. This scheduling is all done in real time and i do not see that as a good fit for any kernel CPU code. That being said higher level and more coarse directive can be given to the GPU hardware scheduler like giving priorities to group of thread so that they always get schedule first if ready. There is a cgroup proposal that goes into the direction of exposing high level control over GPU resource like that. I think this is a better venue to discuss such topics. > > > This is kind of naive i expect topology to be hard to use but maybe > > it is just me being pesimistics. In any case today we have a chicken > > and egg problem. We do not have a standard way to expose topology so > > program that can leverage topology are only done for HPC where the > > platform is standard for few years. If we had a standard way to expose > > the topology then maybe we would see more program using it. At very > > least we could convert existing user. > > > > > > I am wondering whether we should consider HMAT as a subset of the ideas > mentioned in this thread and see whether we can first achieve HMAT > representation with your patch series? I do not want to block HMAT on that. What i am trying to do really does not fit in the existing NUMA node this is what i have been trying to show even if not everyone is convince by that. Some bulets points of why: - memory i care about is not accessible by everyone (backed in assumption in NUMA node) - memory i care about might not be cache coherent (again backed in assumption in NUMA node) - topology matter so that userspace knows what inter-connect is share and what have dedicated links to memory - their can be multiple path between one device and one target memory and each path have different numa distance (or rather properties like bandwidth, latency, ...) again this is does not fit with the NUMA distance thing - memory is not manage by core kernel for reasons i hav explained - ... The HMAT proposal does not deal with such memory, it is much more close to what the current model can describe. Cheers, Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Wed, Dec 05, 2018 at 04:57:17PM +0530, Aneesh Kumar K.V wrote: > On 12/5/18 12:19 AM, Jerome Glisse wrote: > > > Above example is for migrate. Here is an example for how the > > topology is use today: > > > > Application knows that the platform is running on have 16 > > GPU split into 2 group of 8 GPUs each. GPU in each group can > > access each other memory with dedicated mesh links between > > each others. Full speed no traffic bottleneck. > > > > Application splits its GPU computation in 2 so that each > > partition runs on a group of interconnected GPU allowing > > them to share the dataset. > > > > With HMS: > > Application can query the kernel to discover the topology of > > system it is running on and use it to partition and balance > > its workload accordingly. Same application should now be able > > to run on new platform without having to adapt it to it. > > > > Will the kernel be ever involved in decision making here? Like the scheduler > will we ever want to control how there computation units get scheduled onto > GPU groups or GPU? I don;t think you will ever see fine control in software because it would go against what GPU are fundamentaly. GPU have 1000 of cores and usualy 10 times more thread in flight than core (depends on the number of register use by the program or size of their thread local storage). By having many more thread in flight the GPU always have some threads that are not waiting for memory access and thus always have something to schedule next on the core. This scheduling is all done in real time and i do not see that as a good fit for any kernel CPU code. That being said higher level and more coarse directive can be given to the GPU hardware scheduler like giving priorities to group of thread so that they always get schedule first if ready. There is a cgroup proposal that goes into the direction of exposing high level control over GPU resource like that. I think this is a better venue to discuss such topics. > > > This is kind of naive i expect topology to be hard to use but maybe > > it is just me being pesimistics. In any case today we have a chicken > > and egg problem. We do not have a standard way to expose topology so > > program that can leverage topology are only done for HPC where the > > platform is standard for few years. If we had a standard way to expose > > the topology then maybe we would see more program using it. At very > > least we could convert existing user. > > > > > > I am wondering whether we should consider HMAT as a subset of the ideas > mentioned in this thread and see whether we can first achieve HMAT > representation with your patch series? I do not want to block HMAT on that. What i am trying to do really does not fit in the existing NUMA node this is what i have been trying to show even if not everyone is convince by that. Some bulets points of why: - memory i care about is not accessible by everyone (backed in assumption in NUMA node) - memory i care about might not be cache coherent (again backed in assumption in NUMA node) - topology matter so that userspace knows what inter-connect is share and what have dedicated links to memory - their can be multiple path between one device and one target memory and each path have different numa distance (or rather properties like bandwidth, latency, ...) again this is does not fit with the NUMA distance thing - memory is not manage by core kernel for reasons i hav explained - ... The HMAT proposal does not deal with such memory, it is much more close to what the current model can describe. Cheers, Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/5/18 12:19 AM, Jerome Glisse wrote: Above example is for migrate. Here is an example for how the topology is use today: Application knows that the platform is running on have 16 GPU split into 2 group of 8 GPUs each. GPU in each group can access each other memory with dedicated mesh links between each others. Full speed no traffic bottleneck. Application splits its GPU computation in 2 so that each partition runs on a group of interconnected GPU allowing them to share the dataset. With HMS: Application can query the kernel to discover the topology of system it is running on and use it to partition and balance its workload accordingly. Same application should now be able to run on new platform without having to adapt it to it. Will the kernel be ever involved in decision making here? Like the scheduler will we ever want to control how there computation units get scheduled onto GPU groups or GPU? This is kind of naive i expect topology to be hard to use but maybe it is just me being pesimistics. In any case today we have a chicken and egg problem. We do not have a standard way to expose topology so program that can leverage topology are only done for HPC where the platform is standard for few years. If we had a standard way to expose the topology then maybe we would see more program using it. At very least we could convert existing user. I am wondering whether we should consider HMAT as a subset of the ideas mentioned in this thread and see whether we can first achieve HMAT representation with your patch series? -aneesh
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/5/18 12:19 AM, Jerome Glisse wrote: Above example is for migrate. Here is an example for how the topology is use today: Application knows that the platform is running on have 16 GPU split into 2 group of 8 GPUs each. GPU in each group can access each other memory with dedicated mesh links between each others. Full speed no traffic bottleneck. Application splits its GPU computation in 2 so that each partition runs on a group of interconnected GPU allowing them to share the dataset. With HMS: Application can query the kernel to discover the topology of system it is running on and use it to partition and balance its workload accordingly. Same application should now be able to run on new platform without having to adapt it to it. Will the kernel be ever involved in decision making here? Like the scheduler will we ever want to control how there computation units get scheduled onto GPU groups or GPU? This is kind of naive i expect topology to be hard to use but maybe it is just me being pesimistics. In any case today we have a chicken and egg problem. We do not have a standard way to expose topology so program that can leverage topology are only done for HPC where the platform is standard for few years. If we had a standard way to expose the topology then maybe we would see more program using it. At very least we could convert existing user. I am wondering whether we should consider HMAT as a subset of the ideas mentioned in this thread and see whether we can first achieve HMAT representation with your patch series? -aneesh
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote: > On 12/4/18 4:15 PM, Jerome Glisse wrote: > > On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote: > >> Basically, is sysfs the right place to even expose this much data? > > > > I definitly want to avoid the memoryX mistake. So i do not want to > > see one link directory per device. Taking my simple laptop as an > > example with 4 CPUs, a wifi and 2 GPU (the integrated one and a > > discret one): > > > > link0: cpu0 cpu1 cpu2 cpu3 > > link1: wifi (2 pcie lane) > > link2: gpu0 (unknown number of lane but i believe it has higher > > bandwidth to main memory) > > link3: gpu1 (16 pcie lane) > > link4: gpu1 and gpu memory > > > > So one link directory per number of pcie lane your device have > > so that you can differentiate on bandwidth. The main memory is > > symlinked inside all the link directory except link4. The GPU > > discret memory is only in link4 directory as it is only > > accessible by the GPU (we could add it under link3 too with the > > non cache coherent property attach to it). > > I'm actually really interested in how this proposal scales. It's quite > easy to represent a laptop, but can this scale to the largest systems > that we expect to encounter over the next 20 years that this ABI will live? > > > The issue then becomes how to convert down the HMAT over verbose > > information to populate some reasonable layout for HMS. For that > > i would say that create a link directory for each different > > matrix cell. As an example let say that each entry in the matrix > > has bandwidth and latency then we create a link directory for > > each combination of bandwidth and latency. On simple system that > > should boils down to a handfull of combination roughly speaking > > mirroring the example above of one link directory per number of > > PCIE lane for instance. > > OK, but there are 1024*1024 matrix cells on a systems with 1024 > proximity domains (ACPI term for NUMA node). So it sounds like you are > proposing a million-directory approach. No, pseudo code: struct list links; for (unsigned r = 0; r < nrows; r++) { for (unsigned c = 0; c < ncolumns; c++) { if (!link_find(links, hmat[r][c].bandwidth, hmat[r][c].latency)) { link = link_new(hmat[r][c].bandwidth, hmat[r][c].latency); // add initiator and target correspond to that row // and columns to this new link list_add(, links); } } } So all cells that have same property are under the same link. Do you expect all the cell to always have different properties ? On today platform it should not be the case. I do expect we will keep seeing many initiator/target pair that share same properties as other pair. But yes if you have system where no initiator/target pair have the same properties than you in the worst case you are describing. But hey that is the hardware you have then :) Note that userspace can parse all this once during its initialization and create pools of target to use. > We also can't simply say that two CPUs with the same connection to two > other CPUs (think a 4-socket QPI-connected system) share the same "link" > because they share the same combination of bandwidth and latency. We > need to know that *each* has its own, unique link and do not share link > resources. That is the purpose of the bridge object to inter-connect link. To be more exact link is like saying you have 2 arrows with the same properties between every node listed in the link. While bridge allow to define arrow in just one direction. Maybe i should define arrow and node instead of trying to match some of the ACPI terminology. This might be easier for people to follow than first having to understand the terminology. The fear i have with HMAT culling is that HMAT does not have the information to avoid such culling. > > I don't think i have a system with an HMAT table if you have one > > HMAT table to provide i could show up the end result. > > It is new enough (ACPI 6.2) that no publicly-available hardware that > exists that implements one (that I know of). Keith Busch can probably > extract one and send it to you or show you how we're faking them with QEMU. > > > Note i believe the ACPI HMAT matrix is a bad design for that > > reasons ie there is lot of commonality in many of the matrix > > entry and many entry also do not make sense (ie initiator not > > being able to access all the targets). I feel that link/bridge > > is much more compact and allow to represent any directed graph > > with multiple arrows from one node to another same node. > > I don't disagree. But, folks are building systems with them and we need > to either deal with it, or make its data manageable. You saw our > approach: we cull the data and only expose the bare minimum in sysfs. Yeah and i intend to cull data too
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote: > On 12/4/18 4:15 PM, Jerome Glisse wrote: > > On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote: > >> Basically, is sysfs the right place to even expose this much data? > > > > I definitly want to avoid the memoryX mistake. So i do not want to > > see one link directory per device. Taking my simple laptop as an > > example with 4 CPUs, a wifi and 2 GPU (the integrated one and a > > discret one): > > > > link0: cpu0 cpu1 cpu2 cpu3 > > link1: wifi (2 pcie lane) > > link2: gpu0 (unknown number of lane but i believe it has higher > > bandwidth to main memory) > > link3: gpu1 (16 pcie lane) > > link4: gpu1 and gpu memory > > > > So one link directory per number of pcie lane your device have > > so that you can differentiate on bandwidth. The main memory is > > symlinked inside all the link directory except link4. The GPU > > discret memory is only in link4 directory as it is only > > accessible by the GPU (we could add it under link3 too with the > > non cache coherent property attach to it). > > I'm actually really interested in how this proposal scales. It's quite > easy to represent a laptop, but can this scale to the largest systems > that we expect to encounter over the next 20 years that this ABI will live? > > > The issue then becomes how to convert down the HMAT over verbose > > information to populate some reasonable layout for HMS. For that > > i would say that create a link directory for each different > > matrix cell. As an example let say that each entry in the matrix > > has bandwidth and latency then we create a link directory for > > each combination of bandwidth and latency. On simple system that > > should boils down to a handfull of combination roughly speaking > > mirroring the example above of one link directory per number of > > PCIE lane for instance. > > OK, but there are 1024*1024 matrix cells on a systems with 1024 > proximity domains (ACPI term for NUMA node). So it sounds like you are > proposing a million-directory approach. No, pseudo code: struct list links; for (unsigned r = 0; r < nrows; r++) { for (unsigned c = 0; c < ncolumns; c++) { if (!link_find(links, hmat[r][c].bandwidth, hmat[r][c].latency)) { link = link_new(hmat[r][c].bandwidth, hmat[r][c].latency); // add initiator and target correspond to that row // and columns to this new link list_add(, links); } } } So all cells that have same property are under the same link. Do you expect all the cell to always have different properties ? On today platform it should not be the case. I do expect we will keep seeing many initiator/target pair that share same properties as other pair. But yes if you have system where no initiator/target pair have the same properties than you in the worst case you are describing. But hey that is the hardware you have then :) Note that userspace can parse all this once during its initialization and create pools of target to use. > We also can't simply say that two CPUs with the same connection to two > other CPUs (think a 4-socket QPI-connected system) share the same "link" > because they share the same combination of bandwidth and latency. We > need to know that *each* has its own, unique link and do not share link > resources. That is the purpose of the bridge object to inter-connect link. To be more exact link is like saying you have 2 arrows with the same properties between every node listed in the link. While bridge allow to define arrow in just one direction. Maybe i should define arrow and node instead of trying to match some of the ACPI terminology. This might be easier for people to follow than first having to understand the terminology. The fear i have with HMAT culling is that HMAT does not have the information to avoid such culling. > > I don't think i have a system with an HMAT table if you have one > > HMAT table to provide i could show up the end result. > > It is new enough (ACPI 6.2) that no publicly-available hardware that > exists that implements one (that I know of). Keith Busch can probably > extract one and send it to you or show you how we're faking them with QEMU. > > > Note i believe the ACPI HMAT matrix is a bad design for that > > reasons ie there is lot of commonality in many of the matrix > > entry and many entry also do not make sense (ie initiator not > > being able to access all the targets). I feel that link/bridge > > is much more compact and allow to represent any directed graph > > with multiple arrows from one node to another same node. > > I don't disagree. But, folks are building systems with them and we need > to either deal with it, or make its data manageable. You saw our > approach: we cull the data and only expose the bare minimum in sysfs. Yeah and i intend to cull data too
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 2018-12-04 4:57 p.m., Jerome Glisse wrote: > On Tue, Dec 04, 2018 at 01:37:56PM -0800, Dave Hansen wrote: >> Yeah, our NUMA mechanisms are for managing memory that the kernel itself >> manages in the "normal" allocator and supports a full feature set on. >> That has a bunch of implications, like that the memory is cache coherent >> and accessible from everywhere. >> >> The HMAT patches only comprehend this "normal" memory, which is why >> we're extending the existing /sys/devices/system/node infrastructure. >> >> This series has a much more aggressive goal, which is comprehending the >> connections of every memory-target to every memory-initiator, no matter >> who is managing the memory, who can access it, or what it can be used for. >> >> Theoretically, HMS could be used for everything that we're doing with >> /sys/devices/system/node, as long as it's tied back into the existing >> NUMA infrastructure _somehow_. >> >> Right? > Fully correct mind if i steal that perfect summary description next time > i post ? I am so bad at explaining thing :) > > Intention is to allow program to do everything they do with mbind() today > and tomorrow with the HMAT patchset and on top of that to also be able to > do what they do today through API like OpenCL, ROCm, CUDA ... So it is one > kernel API to rule them all ;) As for ROCm, I'm looking forward to using hbind in our own APIs. It will save us some time and trouble not having to implement all the low-level policy and tracking of virtual address ranges in our device driver. Going forward, having a common API to manage the topology and memory affinity would also enable sane ways of having accelerators and memory devices from different vendors interact under control of a topology-aware application. Disclaimer: I haven't had a chance to review the patches in detail yet. Got caught up in the documentation and discussion ... Regards, Felix > > Also at first i intend to special case vma alloc page when they are HMS > policy, long term i would like to merge code path inside the kernel. But > i do not want to disrupt existing code path today, i rather grow to that > organicaly. Step by step. The mbind() would still work un-affected in > the end just the plumbing would be slightly different. > > Cheers, > Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 2018-12-04 4:57 p.m., Jerome Glisse wrote: > On Tue, Dec 04, 2018 at 01:37:56PM -0800, Dave Hansen wrote: >> Yeah, our NUMA mechanisms are for managing memory that the kernel itself >> manages in the "normal" allocator and supports a full feature set on. >> That has a bunch of implications, like that the memory is cache coherent >> and accessible from everywhere. >> >> The HMAT patches only comprehend this "normal" memory, which is why >> we're extending the existing /sys/devices/system/node infrastructure. >> >> This series has a much more aggressive goal, which is comprehending the >> connections of every memory-target to every memory-initiator, no matter >> who is managing the memory, who can access it, or what it can be used for. >> >> Theoretically, HMS could be used for everything that we're doing with >> /sys/devices/system/node, as long as it's tied back into the existing >> NUMA infrastructure _somehow_. >> >> Right? > Fully correct mind if i steal that perfect summary description next time > i post ? I am so bad at explaining thing :) > > Intention is to allow program to do everything they do with mbind() today > and tomorrow with the HMAT patchset and on top of that to also be able to > do what they do today through API like OpenCL, ROCm, CUDA ... So it is one > kernel API to rule them all ;) As for ROCm, I'm looking forward to using hbind in our own APIs. It will save us some time and trouble not having to implement all the low-level policy and tracking of virtual address ranges in our device driver. Going forward, having a common API to manage the topology and memory affinity would also enable sane ways of having accelerators and memory devices from different vendors interact under control of a topology-aware application. Disclaimer: I haven't had a chance to review the patches in detail yet. Got caught up in the documentation and discussion ... Regards, Felix > > Also at first i intend to special case vma alloc page when they are HMS > policy, long term i would like to merge code path inside the kernel. But > i do not want to disrupt existing code path today, i rather grow to that > organicaly. Step by step. The mbind() would still work un-affected in > the end just the plumbing would be slightly different. > > Cheers, > Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/4/18 4:15 PM, Jerome Glisse wrote: > On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote: >> Basically, is sysfs the right place to even expose this much data? > > I definitly want to avoid the memoryX mistake. So i do not want to > see one link directory per device. Taking my simple laptop as an > example with 4 CPUs, a wifi and 2 GPU (the integrated one and a > discret one): > > link0: cpu0 cpu1 cpu2 cpu3 > link1: wifi (2 pcie lane) > link2: gpu0 (unknown number of lane but i believe it has higher > bandwidth to main memory) > link3: gpu1 (16 pcie lane) > link4: gpu1 and gpu memory > > So one link directory per number of pcie lane your device have > so that you can differentiate on bandwidth. The main memory is > symlinked inside all the link directory except link4. The GPU > discret memory is only in link4 directory as it is only > accessible by the GPU (we could add it under link3 too with the > non cache coherent property attach to it). I'm actually really interested in how this proposal scales. It's quite easy to represent a laptop, but can this scale to the largest systems that we expect to encounter over the next 20 years that this ABI will live? > The issue then becomes how to convert down the HMAT over verbose > information to populate some reasonable layout for HMS. For that > i would say that create a link directory for each different > matrix cell. As an example let say that each entry in the matrix > has bandwidth and latency then we create a link directory for > each combination of bandwidth and latency. On simple system that > should boils down to a handfull of combination roughly speaking > mirroring the example above of one link directory per number of > PCIE lane for instance. OK, but there are 1024*1024 matrix cells on a systems with 1024 proximity domains (ACPI term for NUMA node). So it sounds like you are proposing a million-directory approach. We also can't simply say that two CPUs with the same connection to two other CPUs (think a 4-socket QPI-connected system) share the same "link" because they share the same combination of bandwidth and latency. We need to know that *each* has its own, unique link and do not share link resources. > I don't think i have a system with an HMAT table if you have one > HMAT table to provide i could show up the end result. It is new enough (ACPI 6.2) that no publicly-available hardware that exists that implements one (that I know of). Keith Busch can probably extract one and send it to you or show you how we're faking them with QEMU. > Note i believe the ACPI HMAT matrix is a bad design for that > reasons ie there is lot of commonality in many of the matrix > entry and many entry also do not make sense (ie initiator not > being able to access all the targets). I feel that link/bridge > is much more compact and allow to represent any directed graph > with multiple arrows from one node to another same node. I don't disagree. But, folks are building systems with them and we need to either deal with it, or make its data manageable. You saw our approach: we cull the data and only expose the bare minimum in sysfs.
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/4/18 4:15 PM, Jerome Glisse wrote: > On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote: >> Basically, is sysfs the right place to even expose this much data? > > I definitly want to avoid the memoryX mistake. So i do not want to > see one link directory per device. Taking my simple laptop as an > example with 4 CPUs, a wifi and 2 GPU (the integrated one and a > discret one): > > link0: cpu0 cpu1 cpu2 cpu3 > link1: wifi (2 pcie lane) > link2: gpu0 (unknown number of lane but i believe it has higher > bandwidth to main memory) > link3: gpu1 (16 pcie lane) > link4: gpu1 and gpu memory > > So one link directory per number of pcie lane your device have > so that you can differentiate on bandwidth. The main memory is > symlinked inside all the link directory except link4. The GPU > discret memory is only in link4 directory as it is only > accessible by the GPU (we could add it under link3 too with the > non cache coherent property attach to it). I'm actually really interested in how this proposal scales. It's quite easy to represent a laptop, but can this scale to the largest systems that we expect to encounter over the next 20 years that this ABI will live? > The issue then becomes how to convert down the HMAT over verbose > information to populate some reasonable layout for HMS. For that > i would say that create a link directory for each different > matrix cell. As an example let say that each entry in the matrix > has bandwidth and latency then we create a link directory for > each combination of bandwidth and latency. On simple system that > should boils down to a handfull of combination roughly speaking > mirroring the example above of one link directory per number of > PCIE lane for instance. OK, but there are 1024*1024 matrix cells on a systems with 1024 proximity domains (ACPI term for NUMA node). So it sounds like you are proposing a million-directory approach. We also can't simply say that two CPUs with the same connection to two other CPUs (think a 4-socket QPI-connected system) share the same "link" because they share the same combination of bandwidth and latency. We need to know that *each* has its own, unique link and do not share link resources. > I don't think i have a system with an HMAT table if you have one > HMAT table to provide i could show up the end result. It is new enough (ACPI 6.2) that no publicly-available hardware that exists that implements one (that I know of). Keith Busch can probably extract one and send it to you or show you how we're faking them with QEMU. > Note i believe the ACPI HMAT matrix is a bad design for that > reasons ie there is lot of commonality in many of the matrix > entry and many entry also do not make sense (ie initiator not > being able to access all the targets). I feel that link/bridge > is much more compact and allow to represent any directed graph > with multiple arrows from one node to another same node. I don't disagree. But, folks are building systems with them and we need to either deal with it, or make its data manageable. You saw our approach: we cull the data and only expose the bare minimum in sysfs.
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Tue, Dec 04, 2018 at 03:58:23PM -0800, Dave Hansen wrote: > On 12/4/18 1:57 PM, Jerome Glisse wrote: > > Fully correct mind if i steal that perfect summary description next time > > i post ? I am so bad at explaining thing :) > > Go for it! > > > Intention is to allow program to do everything they do with mbind() today > > and tomorrow with the HMAT patchset and on top of that to also be able to > > do what they do today through API like OpenCL, ROCm, CUDA ... So it is one > > kernel API to rule them all ;) > > While I appreciate the exhaustive scope of such a project, I'm really > worried that if we decided to use this for our "HMAT" use cases, we'll > be bottlenecked behind this project while *it* goes through 25 revisions > over 4 or 5 years like HMM did. > > So, should we just "park" the enhancements to the existing NUMA > interfaces and infrastructure (think /sys/devices/system/node) and wait > for this to go in? Do we try to develop them in parallel and make them > consistent? Or, do we just ignore each other and make Andrew sort it > out in a few years? :) Let have a battle with giant foam q-tip at next LSF/MM and see who wins ;) More seriously i think you should go ahead with Keith HMAT patchset and make progress there. In HMAT case you can grow and evolve the NUMA node infrastructure to address your need and i believe you are doing it in a sensible way. But i do not see a path for what i am trying to achieve in that framework. If anyone have any good idea i would welcome it. In the meantime i hope i can make progress with my proposal here under staging. Once i get enough stuff working in userspace and convince guinea pig (i need to find a better name for those poor people i will coerce in testing this ;)) then i can have some hard evidence of what thing in my proposal is useful on some concret case with open source stack from top to bottom. It might means stripping down what i am proposing today to what turns out to be useful. Then start a discussion about merging the kernel underlying code into one (while preserving all existing API) and getting out of staging with real syscall we will have to die with. I know that at the very least the hbind() and hpolicy() syscall would be successful as the HPC folks have been been dreaming of this. The topology thing is harder to know, they are some users today but i can not say how much more interest it can spark outside of this very small community that is HPC. Cheers, Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Tue, Dec 04, 2018 at 03:58:23PM -0800, Dave Hansen wrote: > On 12/4/18 1:57 PM, Jerome Glisse wrote: > > Fully correct mind if i steal that perfect summary description next time > > i post ? I am so bad at explaining thing :) > > Go for it! > > > Intention is to allow program to do everything they do with mbind() today > > and tomorrow with the HMAT patchset and on top of that to also be able to > > do what they do today through API like OpenCL, ROCm, CUDA ... So it is one > > kernel API to rule them all ;) > > While I appreciate the exhaustive scope of such a project, I'm really > worried that if we decided to use this for our "HMAT" use cases, we'll > be bottlenecked behind this project while *it* goes through 25 revisions > over 4 or 5 years like HMM did. > > So, should we just "park" the enhancements to the existing NUMA > interfaces and infrastructure (think /sys/devices/system/node) and wait > for this to go in? Do we try to develop them in parallel and make them > consistent? Or, do we just ignore each other and make Andrew sort it > out in a few years? :) Let have a battle with giant foam q-tip at next LSF/MM and see who wins ;) More seriously i think you should go ahead with Keith HMAT patchset and make progress there. In HMAT case you can grow and evolve the NUMA node infrastructure to address your need and i believe you are doing it in a sensible way. But i do not see a path for what i am trying to achieve in that framework. If anyone have any good idea i would welcome it. In the meantime i hope i can make progress with my proposal here under staging. Once i get enough stuff working in userspace and convince guinea pig (i need to find a better name for those poor people i will coerce in testing this ;)) then i can have some hard evidence of what thing in my proposal is useful on some concret case with open source stack from top to bottom. It might means stripping down what i am proposing today to what turns out to be useful. Then start a discussion about merging the kernel underlying code into one (while preserving all existing API) and getting out of staging with real syscall we will have to die with. I know that at the very least the hbind() and hpolicy() syscall would be successful as the HPC folks have been been dreaming of this. The topology thing is harder to know, they are some users today but i can not say how much more interest it can spark outside of this very small community that is HPC. Cheers, Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote: > On 12/3/18 3:34 PM, jgli...@redhat.com wrote: > > This patchset use the above scheme to expose system topology through > > sysfs under /sys/bus/hms/ with: > > - /sys/bus/hms/devices/v%version-%id-target/ : a target memory, > > each has a UID and you can usual value in that folder (node id, > > size, ...) > > > > - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator > > (CPU or device), each has a HMS UID but also a CPU id for CPU > > (which match CPU id in (/sys/bus/cpu/). For device you have a > > path that can be PCIE BUS ID for instance) > > > > - /sys/bus/hms/devices/v%version-%id-link : an link, each has a > > UID and a file per property (bandwidth, latency, ...) you also > > find a symlink to every target and initiator connected to that > > link. > > > > - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has > > a UID and a file per property (bandwidth, latency, ...) you > > also find a symlink to all initiators that can use that bridge. > > We support 1024 NUMA nodes on x86. The ACPI HMAT expresses the > connections between each node. Let's suppose that each node has some > CPUs and some memory. > > That means we'll have 1024 target directories in sysfs, 1024 initiator > directories in sysfs, and 1024*1024 link directories. Or, would the > kernel be responsible for "compiling" the firmware-provided information > down into a more manageable number of links? > > Some idiot made the mistake of having one sysfs directory per 128MB of > memory way back when, and now we have hundreds of thousands of > /sys/devices/system/memory/memoryX directories. That sucks to manage. > Isn't this potentially repeating that mistake? > > Basically, is sysfs the right place to even expose this much data? I definitly want to avoid the memoryX mistake. So i do not want to see one link directory per device. Taking my simple laptop as an example with 4 CPUs, a wifi and 2 GPU (the integrated one and a discret one): link0: cpu0 cpu1 cpu2 cpu3 link1: wifi (2 pcie lane) link2: gpu0 (unknown number of lane but i believe it has higher bandwidth to main memory) link3: gpu1 (16 pcie lane) link4: gpu1 and gpu memory So one link directory per number of pcie lane your device have so that you can differentiate on bandwidth. The main memory is symlinked inside all the link directory except link4. The GPU discret memory is only in link4 directory as it is only accessible by the GPU (we could add it under link3 too with the non cache coherent property attach to it). The issue then becomes how to convert down the HMAT over verbose information to populate some reasonable layout for HMS. For that i would say that create a link directory for each different matrix cell. As an example let say that each entry in the matrix has bandwidth and latency then we create a link directory for each combination of bandwidth and latency. On simple system that should boils down to a handfull of combination roughly speaking mirroring the example above of one link directory per number of PCIE lane for instance. I don't think i have a system with an HMAT table if you have one HMAT table to provide i could show up the end result. Note i believe the ACPI HMAT matrix is a bad design for that reasons ie there is lot of commonality in many of the matrix entry and many entry also do not make sense (ie initiator not being able to access all the targets). I feel that link/bridge is much more compact and allow to represent any directed graph with multiple arrows from one node to another same node. Cheers, Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote: > On 12/3/18 3:34 PM, jgli...@redhat.com wrote: > > This patchset use the above scheme to expose system topology through > > sysfs under /sys/bus/hms/ with: > > - /sys/bus/hms/devices/v%version-%id-target/ : a target memory, > > each has a UID and you can usual value in that folder (node id, > > size, ...) > > > > - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator > > (CPU or device), each has a HMS UID but also a CPU id for CPU > > (which match CPU id in (/sys/bus/cpu/). For device you have a > > path that can be PCIE BUS ID for instance) > > > > - /sys/bus/hms/devices/v%version-%id-link : an link, each has a > > UID and a file per property (bandwidth, latency, ...) you also > > find a symlink to every target and initiator connected to that > > link. > > > > - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has > > a UID and a file per property (bandwidth, latency, ...) you > > also find a symlink to all initiators that can use that bridge. > > We support 1024 NUMA nodes on x86. The ACPI HMAT expresses the > connections between each node. Let's suppose that each node has some > CPUs and some memory. > > That means we'll have 1024 target directories in sysfs, 1024 initiator > directories in sysfs, and 1024*1024 link directories. Or, would the > kernel be responsible for "compiling" the firmware-provided information > down into a more manageable number of links? > > Some idiot made the mistake of having one sysfs directory per 128MB of > memory way back when, and now we have hundreds of thousands of > /sys/devices/system/memory/memoryX directories. That sucks to manage. > Isn't this potentially repeating that mistake? > > Basically, is sysfs the right place to even expose this much data? I definitly want to avoid the memoryX mistake. So i do not want to see one link directory per device. Taking my simple laptop as an example with 4 CPUs, a wifi and 2 GPU (the integrated one and a discret one): link0: cpu0 cpu1 cpu2 cpu3 link1: wifi (2 pcie lane) link2: gpu0 (unknown number of lane but i believe it has higher bandwidth to main memory) link3: gpu1 (16 pcie lane) link4: gpu1 and gpu memory So one link directory per number of pcie lane your device have so that you can differentiate on bandwidth. The main memory is symlinked inside all the link directory except link4. The GPU discret memory is only in link4 directory as it is only accessible by the GPU (we could add it under link3 too with the non cache coherent property attach to it). The issue then becomes how to convert down the HMAT over verbose information to populate some reasonable layout for HMS. For that i would say that create a link directory for each different matrix cell. As an example let say that each entry in the matrix has bandwidth and latency then we create a link directory for each combination of bandwidth and latency. On simple system that should boils down to a handfull of combination roughly speaking mirroring the example above of one link directory per number of PCIE lane for instance. I don't think i have a system with an HMAT table if you have one HMAT table to provide i could show up the end result. Note i believe the ACPI HMAT matrix is a bad design for that reasons ie there is lot of commonality in many of the matrix entry and many entry also do not make sense (ie initiator not being able to access all the targets). I feel that link/bridge is much more compact and allow to represent any directed graph with multiple arrows from one node to another same node. Cheers, Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/4/18 1:57 PM, Jerome Glisse wrote: > Fully correct mind if i steal that perfect summary description next time > i post ? I am so bad at explaining thing :) Go for it! > Intention is to allow program to do everything they do with mbind() today > and tomorrow with the HMAT patchset and on top of that to also be able to > do what they do today through API like OpenCL, ROCm, CUDA ... So it is one > kernel API to rule them all ;) While I appreciate the exhaustive scope of such a project, I'm really worried that if we decided to use this for our "HMAT" use cases, we'll be bottlenecked behind this project while *it* goes through 25 revisions over 4 or 5 years like HMM did. So, should we just "park" the enhancements to the existing NUMA interfaces and infrastructure (think /sys/devices/system/node) and wait for this to go in? Do we try to develop them in parallel and make them consistent? Or, do we just ignore each other and make Andrew sort it out in a few years? :)
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/4/18 1:57 PM, Jerome Glisse wrote: > Fully correct mind if i steal that perfect summary description next time > i post ? I am so bad at explaining thing :) Go for it! > Intention is to allow program to do everything they do with mbind() today > and tomorrow with the HMAT patchset and on top of that to also be able to > do what they do today through API like OpenCL, ROCm, CUDA ... So it is one > kernel API to rule them all ;) While I appreciate the exhaustive scope of such a project, I'm really worried that if we decided to use this for our "HMAT" use cases, we'll be bottlenecked behind this project while *it* goes through 25 revisions over 4 or 5 years like HMM did. So, should we just "park" the enhancements to the existing NUMA interfaces and infrastructure (think /sys/devices/system/node) and wait for this to go in? Do we try to develop them in parallel and make them consistent? Or, do we just ignore each other and make Andrew sort it out in a few years? :)
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/3/18 3:34 PM, jgli...@redhat.com wrote: > This patchset use the above scheme to expose system topology through > sysfs under /sys/bus/hms/ with: > - /sys/bus/hms/devices/v%version-%id-target/ : a target memory, > each has a UID and you can usual value in that folder (node id, > size, ...) > > - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator > (CPU or device), each has a HMS UID but also a CPU id for CPU > (which match CPU id in (/sys/bus/cpu/). For device you have a > path that can be PCIE BUS ID for instance) > > - /sys/bus/hms/devices/v%version-%id-link : an link, each has a > UID and a file per property (bandwidth, latency, ...) you also > find a symlink to every target and initiator connected to that > link. > > - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has > a UID and a file per property (bandwidth, latency, ...) you > also find a symlink to all initiators that can use that bridge. We support 1024 NUMA nodes on x86. The ACPI HMAT expresses the connections between each node. Let's suppose that each node has some CPUs and some memory. That means we'll have 1024 target directories in sysfs, 1024 initiator directories in sysfs, and 1024*1024 link directories. Or, would the kernel be responsible for "compiling" the firmware-provided information down into a more manageable number of links? Some idiot made the mistake of having one sysfs directory per 128MB of memory way back when, and now we have hundreds of thousands of /sys/devices/system/memory/memoryX directories. That sucks to manage. Isn't this potentially repeating that mistake? Basically, is sysfs the right place to even expose this much data?
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/3/18 3:34 PM, jgli...@redhat.com wrote: > This patchset use the above scheme to expose system topology through > sysfs under /sys/bus/hms/ with: > - /sys/bus/hms/devices/v%version-%id-target/ : a target memory, > each has a UID and you can usual value in that folder (node id, > size, ...) > > - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator > (CPU or device), each has a HMS UID but also a CPU id for CPU > (which match CPU id in (/sys/bus/cpu/). For device you have a > path that can be PCIE BUS ID for instance) > > - /sys/bus/hms/devices/v%version-%id-link : an link, each has a > UID and a file per property (bandwidth, latency, ...) you also > find a symlink to every target and initiator connected to that > link. > > - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has > a UID and a file per property (bandwidth, latency, ...) you > also find a symlink to all initiators that can use that bridge. We support 1024 NUMA nodes on x86. The ACPI HMAT expresses the connections between each node. Let's suppose that each node has some CPUs and some memory. That means we'll have 1024 target directories in sysfs, 1024 initiator directories in sysfs, and 1024*1024 link directories. Or, would the kernel be responsible for "compiling" the firmware-provided information down into a more manageable number of links? Some idiot made the mistake of having one sysfs directory per 128MB of memory way back when, and now we have hundreds of thousands of /sys/devices/system/memory/memoryX directories. That sucks to manage. Isn't this potentially repeating that mistake? Basically, is sysfs the right place to even expose this much data?
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Tue, Dec 04, 2018 at 01:37:56PM -0800, Dave Hansen wrote: > On 12/4/18 10:49 AM, Jerome Glisse wrote: > >> Also, could you add a simple, example program for how someone might use > >> this? I got lost in all the new sysfs and ioctl gunk. Can you > >> characterize how this would work with the *exiting* NUMA interfaces that > >> we have? > > That is the issue i can not expose device memory as NUMA node as > > device memory is not cache coherent on AMD and Intel platform today. > > > > More over in some case that memory is not visible at all by the CPU > > which is not something you can express in the current NUMA node. > > Yeah, our NUMA mechanisms are for managing memory that the kernel itself > manages in the "normal" allocator and supports a full feature set on. > That has a bunch of implications, like that the memory is cache coherent > and accessible from everywhere. > > The HMAT patches only comprehend this "normal" memory, which is why > we're extending the existing /sys/devices/system/node infrastructure. > > This series has a much more aggressive goal, which is comprehending the > connections of every memory-target to every memory-initiator, no matter > who is managing the memory, who can access it, or what it can be used for. > > Theoretically, HMS could be used for everything that we're doing with > /sys/devices/system/node, as long as it's tied back into the existing > NUMA infrastructure _somehow_. > > Right? Fully correct mind if i steal that perfect summary description next time i post ? I am so bad at explaining thing :) Intention is to allow program to do everything they do with mbind() today and tomorrow with the HMAT patchset and on top of that to also be able to do what they do today through API like OpenCL, ROCm, CUDA ... So it is one kernel API to rule them all ;) Also at first i intend to special case vma alloc page when they are HMS policy, long term i would like to merge code path inside the kernel. But i do not want to disrupt existing code path today, i rather grow to that organicaly. Step by step. The mbind() would still work un-affected in the end just the plumbing would be slightly different. Cheers, Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Tue, Dec 04, 2018 at 01:37:56PM -0800, Dave Hansen wrote: > On 12/4/18 10:49 AM, Jerome Glisse wrote: > >> Also, could you add a simple, example program for how someone might use > >> this? I got lost in all the new sysfs and ioctl gunk. Can you > >> characterize how this would work with the *exiting* NUMA interfaces that > >> we have? > > That is the issue i can not expose device memory as NUMA node as > > device memory is not cache coherent on AMD and Intel platform today. > > > > More over in some case that memory is not visible at all by the CPU > > which is not something you can express in the current NUMA node. > > Yeah, our NUMA mechanisms are for managing memory that the kernel itself > manages in the "normal" allocator and supports a full feature set on. > That has a bunch of implications, like that the memory is cache coherent > and accessible from everywhere. > > The HMAT patches only comprehend this "normal" memory, which is why > we're extending the existing /sys/devices/system/node infrastructure. > > This series has a much more aggressive goal, which is comprehending the > connections of every memory-target to every memory-initiator, no matter > who is managing the memory, who can access it, or what it can be used for. > > Theoretically, HMS could be used for everything that we're doing with > /sys/devices/system/node, as long as it's tied back into the existing > NUMA infrastructure _somehow_. > > Right? Fully correct mind if i steal that perfect summary description next time i post ? I am so bad at explaining thing :) Intention is to allow program to do everything they do with mbind() today and tomorrow with the HMAT patchset and on top of that to also be able to do what they do today through API like OpenCL, ROCm, CUDA ... So it is one kernel API to rule them all ;) Also at first i intend to special case vma alloc page when they are HMS policy, long term i would like to merge code path inside the kernel. But i do not want to disrupt existing code path today, i rather grow to that organicaly. Step by step. The mbind() would still work un-affected in the end just the plumbing would be slightly different. Cheers, Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/4/18 10:49 AM, Jerome Glisse wrote: >> Also, could you add a simple, example program for how someone might use >> this? I got lost in all the new sysfs and ioctl gunk. Can you >> characterize how this would work with the *exiting* NUMA interfaces that >> we have? > That is the issue i can not expose device memory as NUMA node as > device memory is not cache coherent on AMD and Intel platform today. > > More over in some case that memory is not visible at all by the CPU > which is not something you can express in the current NUMA node. Yeah, our NUMA mechanisms are for managing memory that the kernel itself manages in the "normal" allocator and supports a full feature set on. That has a bunch of implications, like that the memory is cache coherent and accessible from everywhere. The HMAT patches only comprehend this "normal" memory, which is why we're extending the existing /sys/devices/system/node infrastructure. This series has a much more aggressive goal, which is comprehending the connections of every memory-target to every memory-initiator, no matter who is managing the memory, who can access it, or what it can be used for. Theoretically, HMS could be used for everything that we're doing with /sys/devices/system/node, as long as it's tied back into the existing NUMA infrastructure _somehow_. Right?
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/4/18 10:49 AM, Jerome Glisse wrote: >> Also, could you add a simple, example program for how someone might use >> this? I got lost in all the new sysfs and ioctl gunk. Can you >> characterize how this would work with the *exiting* NUMA interfaces that >> we have? > That is the issue i can not expose device memory as NUMA node as > device memory is not cache coherent on AMD and Intel platform today. > > More over in some case that memory is not visible at all by the CPU > which is not something you can express in the current NUMA node. Yeah, our NUMA mechanisms are for managing memory that the kernel itself manages in the "normal" allocator and supports a full feature set on. That has a bunch of implications, like that the memory is cache coherent and accessible from everywhere. The HMAT patches only comprehend this "normal" memory, which is why we're extending the existing /sys/devices/system/node infrastructure. This series has a much more aggressive goal, which is comprehending the connections of every memory-target to every memory-initiator, no matter who is managing the memory, who can access it, or what it can be used for. Theoretically, HMS could be used for everything that we're doing with /sys/devices/system/node, as long as it's tied back into the existing NUMA infrastructure _somehow_. Right?
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Tue, Dec 04, 2018 at 10:54:10AM -0800, Dave Hansen wrote: > On 12/4/18 10:49 AM, Jerome Glisse wrote: > > Policy is same kind of story, this email is long enough now :) But > > i can write one down if you want. > > Yes, please. I'd love to see the code. > > We'll do the same on the "HMAT" side and we can compare notes. Example use case ? Example use are: Application create a range of virtual address with mmap() for the input dataset. Application knows it will use GPU on it directly so it calls hbind() to set a policy for the range to use GPU memory for any new allocation for the range. Application directly stream the dataset to GPU memory through the virtual address range thanks to the policy. Application create a range of virtual address with mmap() to store the output result of GPU jobs its about to launch. It binds the range of virtual address to GPU memory so that allocation use GPU memory for the range. Application can also use policy binding as a slow migration path ie set a policy to a new target memory so that new allocation are directed to this new target. Or do you want example userspace program like the one in the last patch of this serie ? Cheers, Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Tue, Dec 04, 2018 at 10:54:10AM -0800, Dave Hansen wrote: > On 12/4/18 10:49 AM, Jerome Glisse wrote: > > Policy is same kind of story, this email is long enough now :) But > > i can write one down if you want. > > Yes, please. I'd love to see the code. > > We'll do the same on the "HMAT" side and we can compare notes. Example use case ? Example use are: Application create a range of virtual address with mmap() for the input dataset. Application knows it will use GPU on it directly so it calls hbind() to set a policy for the range to use GPU memory for any new allocation for the range. Application directly stream the dataset to GPU memory through the virtual address range thanks to the policy. Application create a range of virtual address with mmap() to store the output result of GPU jobs its about to launch. It binds the range of virtual address to GPU memory so that allocation use GPU memory for the range. Application can also use policy binding as a slow migration path ie set a policy to a new target memory so that new allocation are directed to this new target. Or do you want example userspace program like the one in the last patch of this serie ? Cheers, Jérôme
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/4/18 10:49 AM, Jerome Glisse wrote: > Policy is same kind of story, this email is long enough now :) But > i can write one down if you want. Yes, please. I'd love to see the code. We'll do the same on the "HMAT" side and we can compare notes.
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/4/18 10:49 AM, Jerome Glisse wrote: > Policy is same kind of story, this email is long enough now :) But > i can write one down if you want. Yes, please. I'd love to see the code. We'll do the same on the "HMAT" side and we can compare notes.
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Tue, Dec 04, 2018 at 10:02:55AM -0800, Dave Hansen wrote: > On 12/3/18 3:34 PM, jgli...@redhat.com wrote: > > This means that it is no longer sufficient to consider a flat view > > for each node in a system but for maximum performance we need to > > account for all of this new memory but also for system topology. > > This is why this proposal is unlike the HMAT proposal [1] which > > tries to extend the existing NUMA for new type of memory. Here we > > are tackling a much more profound change that depart from NUMA. > > The HMAT and its implications exist, in firmware, whether or not we do > *anything* in Linux to support it or not. Any system with an HMAT > inherently reflects the new topology, via proximity domains, whether or > not we parse the HMAT table in Linux or not. > > Basically, *ACPI* has decided to extend NUMA. Linux can either fight > that or embrace it. Keith's HMAT patches are embracing it. These > patches are appearing to fight it. Agree? Disagree? Disagree, sorry if it felt that way that was not my intention. The ACPI HMAT information can be use to populate the HMS file system representation. My intention is not to fight Keith's HMAT patches they are useful on their own. But i do not see how to evolve NUMA to support device memory, so while Keith is taking a step into the direction i want, i do not see how to cross to the place i need to be. More on that below. > > Also, could you add a simple, example program for how someone might use > this? I got lost in all the new sysfs and ioctl gunk. Can you > characterize how this would work with the *exiting* NUMA interfaces that > we have? That is the issue i can not expose device memory as NUMA node as device memory is not cache coherent on AMD and Intel platform today. More over in some case that memory is not visible at all by the CPU which is not something you can express in the current NUMA node. Here is an abreviated list of feature i need to support: - device private memory (not accessible by CPU or anybody else) - non-coherent memory (PCIE is not cache coherent for CPU access) - multiple path to access same memory either: - multiple _different_ physical address alias to same memory - device block can select which path they take to access some memory (it is not inside the page table but in how you program the device block) - complex topology that is not a tree where device link can have better characteristics than the CPU inter-connect between the nodes. They are existing today user that use topology information to partition their workload (HPC folks who have a fix platform). - device memory needs to stay under device driver control as some existing API (OpenGL, Vulkan) have different memory model and if we want the device to be use for those too then we need to keep the device driver in control of the device memory allocation There is an example userspace program with the last patch in the serie. But here is a high level overview of how one application looks today: 1) Application get some dataset from some source (disk, network, sensors, ...) 2) Application allocate memory on device A and copy over the dataset 3) Application run some CPU code to format the copy of the dataset inside device A memory (rebuild pointers inside the dataset, this can represent millions and millions of operations) 4) Application run code on device A that use the dataset 5) Application allocate memory on device B and copy over result from device A 6) Application run some CPU code to format the copy of the dataset inside device B (rebuild pointers inside the dataset, this can represent millions and millions of operations) 7) Application run code on device B that use the dataset 8) Application copy result over from device B and keep on doing its thing How it looks with HMS: 1) Application get some dataset from some source (disk, network, sensors, ...) 2-3) Application calls HMS to migrate to device A memory 4) Application run code on device A that use the dataset 5-6) Application calls HMS to migrate to device B memory 7) Application run code on device B that use the dataset 8) Application calls HMS to migrate result to main memory So we now avoid explicit copy and having to rebuild data structure inside each device address space. Above example is for migrate. Here is an example for how the topology is use today: Application knows that the platform is running on have 16 GPU split into 2 group of 8 GPUs each. GPU in each group can access each other memory with dedicated mesh links between each others. Full speed no traffic bottleneck. Application splits its GPU computation in 2 so that each partition runs on a group of interconnected GPU allowing them to share the dataset. With HMS: Application can query
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Tue, Dec 04, 2018 at 10:02:55AM -0800, Dave Hansen wrote: > On 12/3/18 3:34 PM, jgli...@redhat.com wrote: > > This means that it is no longer sufficient to consider a flat view > > for each node in a system but for maximum performance we need to > > account for all of this new memory but also for system topology. > > This is why this proposal is unlike the HMAT proposal [1] which > > tries to extend the existing NUMA for new type of memory. Here we > > are tackling a much more profound change that depart from NUMA. > > The HMAT and its implications exist, in firmware, whether or not we do > *anything* in Linux to support it or not. Any system with an HMAT > inherently reflects the new topology, via proximity domains, whether or > not we parse the HMAT table in Linux or not. > > Basically, *ACPI* has decided to extend NUMA. Linux can either fight > that or embrace it. Keith's HMAT patches are embracing it. These > patches are appearing to fight it. Agree? Disagree? Disagree, sorry if it felt that way that was not my intention. The ACPI HMAT information can be use to populate the HMS file system representation. My intention is not to fight Keith's HMAT patches they are useful on their own. But i do not see how to evolve NUMA to support device memory, so while Keith is taking a step into the direction i want, i do not see how to cross to the place i need to be. More on that below. > > Also, could you add a simple, example program for how someone might use > this? I got lost in all the new sysfs and ioctl gunk. Can you > characterize how this would work with the *exiting* NUMA interfaces that > we have? That is the issue i can not expose device memory as NUMA node as device memory is not cache coherent on AMD and Intel platform today. More over in some case that memory is not visible at all by the CPU which is not something you can express in the current NUMA node. Here is an abreviated list of feature i need to support: - device private memory (not accessible by CPU or anybody else) - non-coherent memory (PCIE is not cache coherent for CPU access) - multiple path to access same memory either: - multiple _different_ physical address alias to same memory - device block can select which path they take to access some memory (it is not inside the page table but in how you program the device block) - complex topology that is not a tree where device link can have better characteristics than the CPU inter-connect between the nodes. They are existing today user that use topology information to partition their workload (HPC folks who have a fix platform). - device memory needs to stay under device driver control as some existing API (OpenGL, Vulkan) have different memory model and if we want the device to be use for those too then we need to keep the device driver in control of the device memory allocation There is an example userspace program with the last patch in the serie. But here is a high level overview of how one application looks today: 1) Application get some dataset from some source (disk, network, sensors, ...) 2) Application allocate memory on device A and copy over the dataset 3) Application run some CPU code to format the copy of the dataset inside device A memory (rebuild pointers inside the dataset, this can represent millions and millions of operations) 4) Application run code on device A that use the dataset 5) Application allocate memory on device B and copy over result from device A 6) Application run some CPU code to format the copy of the dataset inside device B (rebuild pointers inside the dataset, this can represent millions and millions of operations) 7) Application run code on device B that use the dataset 8) Application copy result over from device B and keep on doing its thing How it looks with HMS: 1) Application get some dataset from some source (disk, network, sensors, ...) 2-3) Application calls HMS to migrate to device A memory 4) Application run code on device A that use the dataset 5-6) Application calls HMS to migrate to device B memory 7) Application run code on device B that use the dataset 8) Application calls HMS to migrate result to main memory So we now avoid explicit copy and having to rebuild data structure inside each device address space. Above example is for migrate. Here is an example for how the topology is use today: Application knows that the platform is running on have 16 GPU split into 2 group of 8 GPUs each. GPU in each group can access each other memory with dedicated mesh links between each others. Full speed no traffic bottleneck. Application splits its GPU computation in 2 so that each partition runs on a group of interconnected GPU allowing them to share the dataset. With HMS: Application can query
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/3/18 3:34 PM, jgli...@redhat.com wrote: > This means that it is no longer sufficient to consider a flat view > for each node in a system but for maximum performance we need to > account for all of this new memory but also for system topology. > This is why this proposal is unlike the HMAT proposal [1] which > tries to extend the existing NUMA for new type of memory. Here we > are tackling a much more profound change that depart from NUMA. The HMAT and its implications exist, in firmware, whether or not we do *anything* in Linux to support it or not. Any system with an HMAT inherently reflects the new topology, via proximity domains, whether or not we parse the HMAT table in Linux or not. Basically, *ACPI* has decided to extend NUMA. Linux can either fight that or embrace it. Keith's HMAT patches are embracing it. These patches are appearing to fight it. Agree? Disagree? Also, could you add a simple, example program for how someone might use this? I got lost in all the new sysfs and ioctl gunk. Can you characterize how this would work with the *exiting* NUMA interfaces that we have?
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/3/18 3:34 PM, jgli...@redhat.com wrote: > This means that it is no longer sufficient to consider a flat view > for each node in a system but for maximum performance we need to > account for all of this new memory but also for system topology. > This is why this proposal is unlike the HMAT proposal [1] which > tries to extend the existing NUMA for new type of memory. Here we > are tackling a much more profound change that depart from NUMA. The HMAT and its implications exist, in firmware, whether or not we do *anything* in Linux to support it or not. Any system with an HMAT inherently reflects the new topology, via proximity domains, whether or not we parse the HMAT table in Linux or not. Basically, *ACPI* has decided to extend NUMA. Linux can either fight that or embrace it. Keith's HMAT patches are embracing it. These patches are appearing to fight it. Agree? Disagree? Also, could you add a simple, example program for how someone might use this? I got lost in all the new sysfs and ioctl gunk. Can you characterize how this would work with the *exiting* NUMA interfaces that we have?
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Tue, Dec 04, 2018 at 01:14:14PM +0530, Aneesh Kumar K.V wrote: > On 12/4/18 5:04 AM, jgli...@redhat.com wrote: > > From: Jérôme Glisse [...] > > This patchset use the above scheme to expose system topology through > > sysfs under /sys/bus/hms/ with: > > - /sys/bus/hms/devices/v%version-%id-target/ : a target memory, > >each has a UID and you can usual value in that folder (node id, > >size, ...) > > > > - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator > >(CPU or device), each has a HMS UID but also a CPU id for CPU > >(which match CPU id in (/sys/bus/cpu/). For device you have a > >path that can be PCIE BUS ID for instance) > > > > - /sys/bus/hms/devices/v%version-%id-link : an link, each has a > >UID and a file per property (bandwidth, latency, ...) you also > >find a symlink to every target and initiator connected to that > >link. > > > > - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has > >a UID and a file per property (bandwidth, latency, ...) you > >also find a symlink to all initiators that can use that bridge. > > is that version tagging really needed? What changes do you envision with > versions? I kind of dislike it myself but this is really to keep userspace from inadvertently using some kind of memory/initiator/link/bridge that it should not be using if it does not understand what are the implication. If it was a file inside the directory there is a big chance that user- space will overlook it. So an old program on a new platform with a new kind of weird memory like un-coherent memory might start using it and get all weird result. If version is in the directory name it kind of force userspace to only look at memory/initiator/link/bridge it does understand and can use safely. So i am doing this in hope that it will protect application when new type of things pops up. We have too many example where we can not evolve something because existing application have bake in assumptions about it. [...] > > 3) Tracking and applying heterogeneous memory policies > > -- > > > > Current memory policy infrastructure is node oriented, instead of > > changing that and risking breakage and regression this patchset add a > > new heterogeneous policy tracking infra-structure. The expectation is > > that existing application can keep using mbind() and all existing > > infrastructure under-disturb and unaffected, while new application > > will use the new API and should avoid mix and matching both (as they > > can achieve the same thing with the new API). > > > > Also the policy is not directly tie to the vma structure for a few > > reasons: > > - avoid having to split vma for policy that do not cover full vma > > - avoid changing too much vma code > > - avoid growing the vma structure with an extra pointer > > So instead this patchset use the mmu_notifier API to track vma liveness > > (munmap(),mremap(),...). > > > > This patchset is not tie to process memory allocation either (like said > > at the begining this is not and end to end patchset but a starting > > point). It does however demonstrate how migration to device memory can > > work under this scheme (using nouveau as a demonstration vehicle). > > > > The overall design is simple, on hbind() call a hms policy structure > > is created for the supplied range and hms use the callback associated > > with the target memory. This callback is provided by device driver > > for device memory or by core HMS for regular main memory. The callback > > can decide to migrate the range to the target memories or do nothing > > (this can be influenced by flags provided to hbind() too). > > > > > > Latter patches can tie page fault with HMS policy to direct memory > > allocation to the right target. For now i would rather postpone that > > discussion until a consensus is reach on how to move forward on all > > the topics presented in this email. Start smalls, grow big ;) > > > > > > I liked the simplicity of keeping it outside all the existing memory > management policy code. But that that is also the drawback isn't it? > We now have multiple entities tracking cpu and memory. (This reminded me of > how we started with memcg in the early days). This is a hard choice, the rational is that either application use this new API either it use the old one. So the expectation is that both should not co-exist in a process. Eventualy both can be consolidated into one inside the kernel while maintaining the different userspace API. But i feel that it is better to get to that point slowly while we experiment with the new API. I feel that we need to gain some experience with the new API on real workload to convince ourself that it is something we can leave with. If we reach that point than we can work on consolidating kernel code into one. In the meantime this experiment
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On Tue, Dec 04, 2018 at 01:14:14PM +0530, Aneesh Kumar K.V wrote: > On 12/4/18 5:04 AM, jgli...@redhat.com wrote: > > From: Jérôme Glisse [...] > > This patchset use the above scheme to expose system topology through > > sysfs under /sys/bus/hms/ with: > > - /sys/bus/hms/devices/v%version-%id-target/ : a target memory, > >each has a UID and you can usual value in that folder (node id, > >size, ...) > > > > - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator > >(CPU or device), each has a HMS UID but also a CPU id for CPU > >(which match CPU id in (/sys/bus/cpu/). For device you have a > >path that can be PCIE BUS ID for instance) > > > > - /sys/bus/hms/devices/v%version-%id-link : an link, each has a > >UID and a file per property (bandwidth, latency, ...) you also > >find a symlink to every target and initiator connected to that > >link. > > > > - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has > >a UID and a file per property (bandwidth, latency, ...) you > >also find a symlink to all initiators that can use that bridge. > > is that version tagging really needed? What changes do you envision with > versions? I kind of dislike it myself but this is really to keep userspace from inadvertently using some kind of memory/initiator/link/bridge that it should not be using if it does not understand what are the implication. If it was a file inside the directory there is a big chance that user- space will overlook it. So an old program on a new platform with a new kind of weird memory like un-coherent memory might start using it and get all weird result. If version is in the directory name it kind of force userspace to only look at memory/initiator/link/bridge it does understand and can use safely. So i am doing this in hope that it will protect application when new type of things pops up. We have too many example where we can not evolve something because existing application have bake in assumptions about it. [...] > > 3) Tracking and applying heterogeneous memory policies > > -- > > > > Current memory policy infrastructure is node oriented, instead of > > changing that and risking breakage and regression this patchset add a > > new heterogeneous policy tracking infra-structure. The expectation is > > that existing application can keep using mbind() and all existing > > infrastructure under-disturb and unaffected, while new application > > will use the new API and should avoid mix and matching both (as they > > can achieve the same thing with the new API). > > > > Also the policy is not directly tie to the vma structure for a few > > reasons: > > - avoid having to split vma for policy that do not cover full vma > > - avoid changing too much vma code > > - avoid growing the vma structure with an extra pointer > > So instead this patchset use the mmu_notifier API to track vma liveness > > (munmap(),mremap(),...). > > > > This patchset is not tie to process memory allocation either (like said > > at the begining this is not and end to end patchset but a starting > > point). It does however demonstrate how migration to device memory can > > work under this scheme (using nouveau as a demonstration vehicle). > > > > The overall design is simple, on hbind() call a hms policy structure > > is created for the supplied range and hms use the callback associated > > with the target memory. This callback is provided by device driver > > for device memory or by core HMS for regular main memory. The callback > > can decide to migrate the range to the target memories or do nothing > > (this can be influenced by flags provided to hbind() too). > > > > > > Latter patches can tie page fault with HMS policy to direct memory > > allocation to the right target. For now i would rather postpone that > > discussion until a consensus is reach on how to move forward on all > > the topics presented in this email. Start smalls, grow big ;) > > > > > > I liked the simplicity of keeping it outside all the existing memory > management policy code. But that that is also the drawback isn't it? > We now have multiple entities tracking cpu and memory. (This reminded me of > how we started with memcg in the early days). This is a hard choice, the rational is that either application use this new API either it use the old one. So the expectation is that both should not co-exist in a process. Eventualy both can be consolidated into one inside the kernel while maintaining the different userspace API. But i feel that it is better to get to that point slowly while we experiment with the new API. I feel that we need to gain some experience with the new API on real workload to convince ourself that it is something we can leave with. If we reach that point than we can work on consolidating kernel code into one. In the meantime this experiment
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/4/18 5:04 AM, jgli...@redhat.com wrote: From: Jérôme Glisse Heterogeneous memory system are becoming more and more the norm, in those system there is not only the main system memory for each node, but also device memory and|or memory hierarchy to consider. Device memory can comes from a device like GPU, FPGA, ... or from a memory only device (persistent memory, or high density memory device). Memory hierarchy is when you not only have the main memory but also other type of memory like HBM (High Bandwidth Memory often stack up on CPU die or GPU die), peristent memory or high density memory (ie something slower then regular DDR DIMM but much bigger). On top of this diversity of memories you also have to account for the system bus topology ie how all CPUs and devices are connected to each others. Userspace do not care about the exact physical topology but care about topology from behavior point of view ie what are all the paths between an initiator (anything that can initiate memory access like CPU, GPU, FGPA, network controller ...) and a target memory and what are all the properties of each of those path (bandwidth, latency, granularity, ...). This means that it is no longer sufficient to consider a flat view for each node in a system but for maximum performance we need to account for all of this new memory but also for system topology. This is why this proposal is unlike the HMAT proposal [1] which tries to extend the existing NUMA for new type of memory. Here we are tackling a much more profound change that depart from NUMA. One of the reasons for radical change is the advance of accelerator like GPU or FPGA means that CPU is no longer the only piece where computation happens. It is becoming more and more common for an application to use a mix and match of different accelerator to perform its computation. So we can no longer satisfy our self with a CPU centric and flat view of a system like NUMA and NUMA distance. This patchset is a proposal to tackle this problems through three aspects: 1 - Expose complex system topology and various kind of memory to user space so that application have a standard way and single place to get all the information it cares about. 2 - A new API for user space to bind/provide hint to kernel on which memory to use for range of virtual address (a new mbind() syscall). 3 - Kernel side changes for vm policy to handle this changes This patchset is not and end to end solution but it provides enough pieces to be useful against nouveau (upstream open source driver for NVidia GPU). It is intended as a starting point for discussion so that we can figure out what to do. To avoid having too much topics to discuss i am not considering memory cgroup for now but it is definitely something we will want to integrate with. The rest of this emails is splits in 3 sections, the first section talks about complex system topology: what it is, how it is use today and how to describe it tomorrow. The second sections talks about new API to bind/provide hint to kernel for range of virtual address. The third section talks about new mechanism to track bind/hint provided by user space or device driver inside the kernel. 1) Complex system topology and representing them Inside a node you can have a complex topology of memory, for instance you can have multiple HBM memory in a node, each HBM memory tie to a set of CPUs (all of which are in the same node). This means that you have a hierarchy of memory for CPUs. The local fast HBM but which is expected to be relatively small compare to main memory and then the main memory. New memory technology might also deepen this hierarchy with another level of yet slower memory but gigantic in size (some persistent memory technology might fall into that category). Another example is device memory, and device themself can have a hierarchy like HBM on top of device core and main device memory. On top of that you can have multiple path to access each memory and each path can have different properties (latency, bandwidth, ...). Also there is not always symmetry ie some memory might only be accessible by some device or CPU ie not accessible by everyone. So a flat hierarchy for each node is not capable of representing this kind of complexity. To simplify discussion and because we do not want to single out CPU from device, from here on out we will use initiator to refer to either CPU or device. An initiator is any kind of CPU or device that can access memory (ie initiate memory access). At this point a example of such system might help: - 2 nodes and for each node: - 1 CPU per node with 2 complex of CPUs cores per CPU - one HBM memory for each complex of CPUs cores (200GB/s) - CPUs cores complex are linked to each other (100GB/s) - main memory is (90GB/s) - 4 GPUs each with: - HBM memory for
Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
On 12/4/18 5:04 AM, jgli...@redhat.com wrote: From: Jérôme Glisse Heterogeneous memory system are becoming more and more the norm, in those system there is not only the main system memory for each node, but also device memory and|or memory hierarchy to consider. Device memory can comes from a device like GPU, FPGA, ... or from a memory only device (persistent memory, or high density memory device). Memory hierarchy is when you not only have the main memory but also other type of memory like HBM (High Bandwidth Memory often stack up on CPU die or GPU die), peristent memory or high density memory (ie something slower then regular DDR DIMM but much bigger). On top of this diversity of memories you also have to account for the system bus topology ie how all CPUs and devices are connected to each others. Userspace do not care about the exact physical topology but care about topology from behavior point of view ie what are all the paths between an initiator (anything that can initiate memory access like CPU, GPU, FGPA, network controller ...) and a target memory and what are all the properties of each of those path (bandwidth, latency, granularity, ...). This means that it is no longer sufficient to consider a flat view for each node in a system but for maximum performance we need to account for all of this new memory but also for system topology. This is why this proposal is unlike the HMAT proposal [1] which tries to extend the existing NUMA for new type of memory. Here we are tackling a much more profound change that depart from NUMA. One of the reasons for radical change is the advance of accelerator like GPU or FPGA means that CPU is no longer the only piece where computation happens. It is becoming more and more common for an application to use a mix and match of different accelerator to perform its computation. So we can no longer satisfy our self with a CPU centric and flat view of a system like NUMA and NUMA distance. This patchset is a proposal to tackle this problems through three aspects: 1 - Expose complex system topology and various kind of memory to user space so that application have a standard way and single place to get all the information it cares about. 2 - A new API for user space to bind/provide hint to kernel on which memory to use for range of virtual address (a new mbind() syscall). 3 - Kernel side changes for vm policy to handle this changes This patchset is not and end to end solution but it provides enough pieces to be useful against nouveau (upstream open source driver for NVidia GPU). It is intended as a starting point for discussion so that we can figure out what to do. To avoid having too much topics to discuss i am not considering memory cgroup for now but it is definitely something we will want to integrate with. The rest of this emails is splits in 3 sections, the first section talks about complex system topology: what it is, how it is use today and how to describe it tomorrow. The second sections talks about new API to bind/provide hint to kernel for range of virtual address. The third section talks about new mechanism to track bind/hint provided by user space or device driver inside the kernel. 1) Complex system topology and representing them Inside a node you can have a complex topology of memory, for instance you can have multiple HBM memory in a node, each HBM memory tie to a set of CPUs (all of which are in the same node). This means that you have a hierarchy of memory for CPUs. The local fast HBM but which is expected to be relatively small compare to main memory and then the main memory. New memory technology might also deepen this hierarchy with another level of yet slower memory but gigantic in size (some persistent memory technology might fall into that category). Another example is device memory, and device themself can have a hierarchy like HBM on top of device core and main device memory. On top of that you can have multiple path to access each memory and each path can have different properties (latency, bandwidth, ...). Also there is not always symmetry ie some memory might only be accessible by some device or CPU ie not accessible by everyone. So a flat hierarchy for each node is not capable of representing this kind of complexity. To simplify discussion and because we do not want to single out CPU from device, from here on out we will use initiator to refer to either CPU or device. An initiator is any kind of CPU or device that can access memory (ie initiate memory access). At this point a example of such system might help: - 2 nodes and for each node: - 1 CPU per node with 2 complex of CPUs cores per CPU - one HBM memory for each complex of CPUs cores (200GB/s) - CPUs cores complex are linked to each other (100GB/s) - main memory is (90GB/s) - 4 GPUs each with: - HBM memory for