Re: Interacting with coherent memory on external devices
On Thu, May 14, 2015 at 05:51:19PM +1000, Benjamin Herrenschmidt wrote: > On Thu, 2015-05-14 at 09:39 +0200, Vlastimil Babka wrote: > > On 05/14/2015 01:38 AM, Benjamin Herrenschmidt wrote: > > > On Wed, 2015-05-13 at 16:10 +0200, Vlastimil Babka wrote: > > >> Sorry for reviving oldish thread... > > > > > > Well, that's actually appreciated since this is constructive discussion > > > of the kind I was hoping to trigger initially :-) I'll look at > > > > I hoped so :) > > > > > ZONE_MOVABLE, I wasn't aware of its existence. > > > > > > Don't we still have the problem that ZONEs must be somewhat contiguous > > > chunks ? Ie, my "CAPI memory" will be interleaved in the physical > > > address space somewhat.. This is due to the address space on some of > > > those systems where you'll basically have something along the lines of: > > > > > > [ node 0 mem ] [ node 0 CAPI dev ] [ node 1 mem] [ node 1 CAPI dev] > > > ... > > > > Oh, I see. The VM code should cope with that, but some operations would > > be inefficiently looping over the holes in the CAPI zone by 2MB > > pageblock per iteration. This would include compaction scanning, which > > would suck if you need those large contiguous allocations as you said. > > Interleaving works better if it's done with a smaller granularity. > > > > But I guess you could just represent the CAPI as multiple NUMA nodes, > > each with single ZONE_MOVABLE zone. Especially if "node 0 CAPI dev" and > > "node 1 CAPI dev" differs in other characteristics than just using a > > different range of PFNs... otherwise what's the point of this split anyway? > > Correct, I think we want the CAPI devs to look like CPU-less NUMA nodes > anyway. This is the right way to target an allocation at one of them and > it conveys the distance properly, so it makes sense. > > I'll add the ZONE_MOVABLE to the list of things to investigate on our > side, thanks for the pointer ! Any thoughts on CONFIG_MOVABLE_NODE and the corresponding "movable_node" boot parameter? It looks like it is designed to make an entire NUMA node's memory hotpluggable, which seems consistent with what we are trying to do here. This feature is currently x86_64-only, so would need to be enabled on other architectures. It looks like this is intended to be used by booting normally, but keeping the CAPI nodes' memory offline, setting movable_node, then onlining their memory. Thoughts? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Thu, May 14, 2015 at 05:51:19PM +1000, Benjamin Herrenschmidt wrote: On Thu, 2015-05-14 at 09:39 +0200, Vlastimil Babka wrote: On 05/14/2015 01:38 AM, Benjamin Herrenschmidt wrote: On Wed, 2015-05-13 at 16:10 +0200, Vlastimil Babka wrote: Sorry for reviving oldish thread... Well, that's actually appreciated since this is constructive discussion of the kind I was hoping to trigger initially :-) I'll look at I hoped so :) ZONE_MOVABLE, I wasn't aware of its existence. Don't we still have the problem that ZONEs must be somewhat contiguous chunks ? Ie, my CAPI memory will be interleaved in the physical address space somewhat.. This is due to the address space on some of those systems where you'll basically have something along the lines of: [ node 0 mem ] [ node 0 CAPI dev ] [ node 1 mem] [ node 1 CAPI dev] ... Oh, I see. The VM code should cope with that, but some operations would be inefficiently looping over the holes in the CAPI zone by 2MB pageblock per iteration. This would include compaction scanning, which would suck if you need those large contiguous allocations as you said. Interleaving works better if it's done with a smaller granularity. But I guess you could just represent the CAPI as multiple NUMA nodes, each with single ZONE_MOVABLE zone. Especially if node 0 CAPI dev and node 1 CAPI dev differs in other characteristics than just using a different range of PFNs... otherwise what's the point of this split anyway? Correct, I think we want the CAPI devs to look like CPU-less NUMA nodes anyway. This is the right way to target an allocation at one of them and it conveys the distance properly, so it makes sense. I'll add the ZONE_MOVABLE to the list of things to investigate on our side, thanks for the pointer ! Any thoughts on CONFIG_MOVABLE_NODE and the corresponding movable_node boot parameter? It looks like it is designed to make an entire NUMA node's memory hotpluggable, which seems consistent with what we are trying to do here. This feature is currently x86_64-only, so would need to be enabled on other architectures. It looks like this is intended to be used by booting normally, but keeping the CAPI nodes' memory offline, setting movable_node, then onlining their memory. Thoughts? Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Thu, 2015-05-14 at 09:39 +0200, Vlastimil Babka wrote: > On 05/14/2015 01:38 AM, Benjamin Herrenschmidt wrote: > > On Wed, 2015-05-13 at 16:10 +0200, Vlastimil Babka wrote: > >> Sorry for reviving oldish thread... > > > > Well, that's actually appreciated since this is constructive discussion > > of the kind I was hoping to trigger initially :-) I'll look at > > I hoped so :) > > > ZONE_MOVABLE, I wasn't aware of its existence. > > > > Don't we still have the problem that ZONEs must be somewhat contiguous > > chunks ? Ie, my "CAPI memory" will be interleaved in the physical > > address space somewhat.. This is due to the address space on some of > > those systems where you'll basically have something along the lines of: > > > > [ node 0 mem ] [ node 0 CAPI dev ] [ node 1 mem] [ node 1 CAPI dev] ... > > Oh, I see. The VM code should cope with that, but some operations would > be inefficiently looping over the holes in the CAPI zone by 2MB > pageblock per iteration. This would include compaction scanning, which > would suck if you need those large contiguous allocations as you said. > Interleaving works better if it's done with a smaller granularity. > > But I guess you could just represent the CAPI as multiple NUMA nodes, > each with single ZONE_MOVABLE zone. Especially if "node 0 CAPI dev" and > "node 1 CAPI dev" differs in other characteristics than just using a > different range of PFNs... otherwise what's the point of this split anyway? Correct, I think we want the CAPI devs to look like CPU-less NUMA nodes anyway. This is the right way to target an allocation at one of them and it conveys the distance properly, so it makes sense. I'll add the ZONE_MOVABLE to the list of things to investigate on our side, thanks for the pointer ! Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On 05/14/2015 01:38 AM, Benjamin Herrenschmidt wrote: On Wed, 2015-05-13 at 16:10 +0200, Vlastimil Babka wrote: Sorry for reviving oldish thread... Well, that's actually appreciated since this is constructive discussion of the kind I was hoping to trigger initially :-) I'll look at I hoped so :) ZONE_MOVABLE, I wasn't aware of its existence. Don't we still have the problem that ZONEs must be somewhat contiguous chunks ? Ie, my "CAPI memory" will be interleaved in the physical address space somewhat.. This is due to the address space on some of those systems where you'll basically have something along the lines of: [ node 0 mem ] [ node 0 CAPI dev ] [ node 1 mem] [ node 1 CAPI dev] ... Oh, I see. The VM code should cope with that, but some operations would be inefficiently looping over the holes in the CAPI zone by 2MB pageblock per iteration. This would include compaction scanning, which would suck if you need those large contiguous allocations as you said. Interleaving works better if it's done with a smaller granularity. But I guess you could just represent the CAPI as multiple NUMA nodes, each with single ZONE_MOVABLE zone. Especially if "node 0 CAPI dev" and "node 1 CAPI dev" differs in other characteristics than just using a different range of PFNs... otherwise what's the point of this split anyway? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On 05/14/2015 01:38 AM, Benjamin Herrenschmidt wrote: On Wed, 2015-05-13 at 16:10 +0200, Vlastimil Babka wrote: Sorry for reviving oldish thread... Well, that's actually appreciated since this is constructive discussion of the kind I was hoping to trigger initially :-) I'll look at I hoped so :) ZONE_MOVABLE, I wasn't aware of its existence. Don't we still have the problem that ZONEs must be somewhat contiguous chunks ? Ie, my CAPI memory will be interleaved in the physical address space somewhat.. This is due to the address space on some of those systems where you'll basically have something along the lines of: [ node 0 mem ] [ node 0 CAPI dev ] [ node 1 mem] [ node 1 CAPI dev] ... Oh, I see. The VM code should cope with that, but some operations would be inefficiently looping over the holes in the CAPI zone by 2MB pageblock per iteration. This would include compaction scanning, which would suck if you need those large contiguous allocations as you said. Interleaving works better if it's done with a smaller granularity. But I guess you could just represent the CAPI as multiple NUMA nodes, each with single ZONE_MOVABLE zone. Especially if node 0 CAPI dev and node 1 CAPI dev differs in other characteristics than just using a different range of PFNs... otherwise what's the point of this split anyway? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Thu, 2015-05-14 at 09:39 +0200, Vlastimil Babka wrote: On 05/14/2015 01:38 AM, Benjamin Herrenschmidt wrote: On Wed, 2015-05-13 at 16:10 +0200, Vlastimil Babka wrote: Sorry for reviving oldish thread... Well, that's actually appreciated since this is constructive discussion of the kind I was hoping to trigger initially :-) I'll look at I hoped so :) ZONE_MOVABLE, I wasn't aware of its existence. Don't we still have the problem that ZONEs must be somewhat contiguous chunks ? Ie, my CAPI memory will be interleaved in the physical address space somewhat.. This is due to the address space on some of those systems where you'll basically have something along the lines of: [ node 0 mem ] [ node 0 CAPI dev ] [ node 1 mem] [ node 1 CAPI dev] ... Oh, I see. The VM code should cope with that, but some operations would be inefficiently looping over the holes in the CAPI zone by 2MB pageblock per iteration. This would include compaction scanning, which would suck if you need those large contiguous allocations as you said. Interleaving works better if it's done with a smaller granularity. But I guess you could just represent the CAPI as multiple NUMA nodes, each with single ZONE_MOVABLE zone. Especially if node 0 CAPI dev and node 1 CAPI dev differs in other characteristics than just using a different range of PFNs... otherwise what's the point of this split anyway? Correct, I think we want the CAPI devs to look like CPU-less NUMA nodes anyway. This is the right way to target an allocation at one of them and it conveys the distance properly, so it makes sense. I'll add the ZONE_MOVABLE to the list of things to investigate on our side, thanks for the pointer ! Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Wed, 2015-05-13 at 16:10 +0200, Vlastimil Babka wrote: > Sorry for reviving oldish thread... Well, that's actually appreciated since this is constructive discussion of the kind I was hoping to trigger initially :-) I'll look at ZONE_MOVABLE, I wasn't aware of its existence. Don't we still have the problem that ZONEs must be somewhat contiguous chunks ? Ie, my "CAPI memory" will be interleaved in the physical address space somewhat.. This is due to the address space on some of those systems where you'll basically have something along the lines of: [ node 0 mem ] [ node 0 CAPI dev ] [ node 1 mem] [ node 1 CAPI dev] ... > On 04/28/2015 01:54 AM, Benjamin Herrenschmidt wrote: > > On Mon, 2015-04-27 at 11:48 -0500, Christoph Lameter wrote: > >> On Mon, 27 Apr 2015, Rik van Riel wrote: > >> > >>> Why would we want to avoid the sane approach that makes this thing > >>> work with the fewest required changes to core code? > >> > >> Becaus new ZONEs are a pretty invasive change to the memory management and > >> because there are other ways to handle references to device specific > >> memory. > > > > ZONEs is just one option we put on the table. > > > > I think we can mostly agree on the fundamentals that a good model of > > such a co-processor is a NUMA node, possibly with a higher distance > > than other nodes (but even that can be debated). > > > > That gives us a lot of the basics we need such as struct page, ability > > to use existing migration infrastructure, and is actually a reasonably > > representation at high level as well. > > > > The question is how do we additionally get the random stuff we don't > > care about out of the way. The large distance will not help that much > > under memory pressure for example. > > > > Covering the entire device memory with a CMA goes a long way toward that > > goal. It will avoid your ordinary kernel allocations. > > I think ZONE_MOVABLE should be sufficient for this. CMA is basically for > marking parts of zones as MOVABLE-only. You shouldn't need that for the > whole zone. Although it might happen that CMA will be a special zone one > day. > > > It also provides just what we need to be able to do large contiguous > > "explicit" allocations for use by workloads that don't want the > > transparent migration and by the driver for the device which might also > > need such special allocations for its own internal management data > > structures. > > Plain zone compaction + reclaim should work as well in a ZONE_MOVABLE > zone. CMA allocations might IIRC additionally migrate across zones, e.g. > from the device to system memory (unlike plain compaction), which might > be what you want, or not. > > > We still have the risk of pages in the CMA being pinned by something > > like gup however, that's where the ZONE idea comes in, to ensure the > > various kernel allocators will *never* allocate in that zone unless > > explicitly specified, but that could possibly implemented differently. > > Kernel allocations should ignore the ZONE_MOVABLE zone as they are not > typically movable. Then it depends on how much control you want for > userspace allocations. > > > Maybe a concept of "exclusive" NUMA node, where allocations never > > fallback to that node unless explicitly asked to go there. > > I guess that could be doable on the zonelist level, where the device > memory node/zone wouldn't be part of the "normal" zonelists, so memory > pressure calculations should be also fine. But sure there will be some > corner cases :) > > > Of course that would have an impact on memory pressure calculations, > > nothign comes completely for free, but at this stage, this is the goal > > of this thread, ie, to swap ideas around and see what's most likely to > > work in the long run before we even start implementing something. > > > > Cheers, > > Ben. > > > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majord...@kvack.org. For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Don't email: mailto:"d...@kvack.org;> em...@kvack.org > > > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
Sorry for reviving oldish thread... On 04/28/2015 01:54 AM, Benjamin Herrenschmidt wrote: On Mon, 2015-04-27 at 11:48 -0500, Christoph Lameter wrote: On Mon, 27 Apr 2015, Rik van Riel wrote: Why would we want to avoid the sane approach that makes this thing work with the fewest required changes to core code? Becaus new ZONEs are a pretty invasive change to the memory management and because there are other ways to handle references to device specific memory. ZONEs is just one option we put on the table. I think we can mostly agree on the fundamentals that a good model of such a co-processor is a NUMA node, possibly with a higher distance than other nodes (but even that can be debated). That gives us a lot of the basics we need such as struct page, ability to use existing migration infrastructure, and is actually a reasonably representation at high level as well. The question is how do we additionally get the random stuff we don't care about out of the way. The large distance will not help that much under memory pressure for example. Covering the entire device memory with a CMA goes a long way toward that goal. It will avoid your ordinary kernel allocations. I think ZONE_MOVABLE should be sufficient for this. CMA is basically for marking parts of zones as MOVABLE-only. You shouldn't need that for the whole zone. Although it might happen that CMA will be a special zone one day. It also provides just what we need to be able to do large contiguous "explicit" allocations for use by workloads that don't want the transparent migration and by the driver for the device which might also need such special allocations for its own internal management data structures. Plain zone compaction + reclaim should work as well in a ZONE_MOVABLE zone. CMA allocations might IIRC additionally migrate across zones, e.g. from the device to system memory (unlike plain compaction), which might be what you want, or not. We still have the risk of pages in the CMA being pinned by something like gup however, that's where the ZONE idea comes in, to ensure the various kernel allocators will *never* allocate in that zone unless explicitly specified, but that could possibly implemented differently. Kernel allocations should ignore the ZONE_MOVABLE zone as they are not typically movable. Then it depends on how much control you want for userspace allocations. Maybe a concept of "exclusive" NUMA node, where allocations never fallback to that node unless explicitly asked to go there. I guess that could be doable on the zonelist level, where the device memory node/zone wouldn't be part of the "normal" zonelists, so memory pressure calculations should be also fine. But sure there will be some corner cases :) Of course that would have an impact on memory pressure calculations, nothign comes completely for free, but at this stage, this is the goal of this thread, ie, to swap ideas around and see what's most likely to work in the long run before we even start implementing something. Cheers, Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
Sorry for reviving oldish thread... On 04/28/2015 01:54 AM, Benjamin Herrenschmidt wrote: On Mon, 2015-04-27 at 11:48 -0500, Christoph Lameter wrote: On Mon, 27 Apr 2015, Rik van Riel wrote: Why would we want to avoid the sane approach that makes this thing work with the fewest required changes to core code? Becaus new ZONEs are a pretty invasive change to the memory management and because there are other ways to handle references to device specific memory. ZONEs is just one option we put on the table. I think we can mostly agree on the fundamentals that a good model of such a co-processor is a NUMA node, possibly with a higher distance than other nodes (but even that can be debated). That gives us a lot of the basics we need such as struct page, ability to use existing migration infrastructure, and is actually a reasonably representation at high level as well. The question is how do we additionally get the random stuff we don't care about out of the way. The large distance will not help that much under memory pressure for example. Covering the entire device memory with a CMA goes a long way toward that goal. It will avoid your ordinary kernel allocations. I think ZONE_MOVABLE should be sufficient for this. CMA is basically for marking parts of zones as MOVABLE-only. You shouldn't need that for the whole zone. Although it might happen that CMA will be a special zone one day. It also provides just what we need to be able to do large contiguous explicit allocations for use by workloads that don't want the transparent migration and by the driver for the device which might also need such special allocations for its own internal management data structures. Plain zone compaction + reclaim should work as well in a ZONE_MOVABLE zone. CMA allocations might IIRC additionally migrate across zones, e.g. from the device to system memory (unlike plain compaction), which might be what you want, or not. We still have the risk of pages in the CMA being pinned by something like gup however, that's where the ZONE idea comes in, to ensure the various kernel allocators will *never* allocate in that zone unless explicitly specified, but that could possibly implemented differently. Kernel allocations should ignore the ZONE_MOVABLE zone as they are not typically movable. Then it depends on how much control you want for userspace allocations. Maybe a concept of exclusive NUMA node, where allocations never fallback to that node unless explicitly asked to go there. I guess that could be doable on the zonelist level, where the device memory node/zone wouldn't be part of the normal zonelists, so memory pressure calculations should be also fine. But sure there will be some corner cases :) Of course that would have an impact on memory pressure calculations, nothign comes completely for free, but at this stage, this is the goal of this thread, ie, to swap ideas around and see what's most likely to work in the long run before we even start implementing something. Cheers, Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Wed, 2015-05-13 at 16:10 +0200, Vlastimil Babka wrote: Sorry for reviving oldish thread... Well, that's actually appreciated since this is constructive discussion of the kind I was hoping to trigger initially :-) I'll look at ZONE_MOVABLE, I wasn't aware of its existence. Don't we still have the problem that ZONEs must be somewhat contiguous chunks ? Ie, my CAPI memory will be interleaved in the physical address space somewhat.. This is due to the address space on some of those systems where you'll basically have something along the lines of: [ node 0 mem ] [ node 0 CAPI dev ] [ node 1 mem] [ node 1 CAPI dev] ... On 04/28/2015 01:54 AM, Benjamin Herrenschmidt wrote: On Mon, 2015-04-27 at 11:48 -0500, Christoph Lameter wrote: On Mon, 27 Apr 2015, Rik van Riel wrote: Why would we want to avoid the sane approach that makes this thing work with the fewest required changes to core code? Becaus new ZONEs are a pretty invasive change to the memory management and because there are other ways to handle references to device specific memory. ZONEs is just one option we put on the table. I think we can mostly agree on the fundamentals that a good model of such a co-processor is a NUMA node, possibly with a higher distance than other nodes (but even that can be debated). That gives us a lot of the basics we need such as struct page, ability to use existing migration infrastructure, and is actually a reasonably representation at high level as well. The question is how do we additionally get the random stuff we don't care about out of the way. The large distance will not help that much under memory pressure for example. Covering the entire device memory with a CMA goes a long way toward that goal. It will avoid your ordinary kernel allocations. I think ZONE_MOVABLE should be sufficient for this. CMA is basically for marking parts of zones as MOVABLE-only. You shouldn't need that for the whole zone. Although it might happen that CMA will be a special zone one day. It also provides just what we need to be able to do large contiguous explicit allocations for use by workloads that don't want the transparent migration and by the driver for the device which might also need such special allocations for its own internal management data structures. Plain zone compaction + reclaim should work as well in a ZONE_MOVABLE zone. CMA allocations might IIRC additionally migrate across zones, e.g. from the device to system memory (unlike plain compaction), which might be what you want, or not. We still have the risk of pages in the CMA being pinned by something like gup however, that's where the ZONE idea comes in, to ensure the various kernel allocators will *never* allocate in that zone unless explicitly specified, but that could possibly implemented differently. Kernel allocations should ignore the ZONE_MOVABLE zone as they are not typically movable. Then it depends on how much control you want for userspace allocations. Maybe a concept of exclusive NUMA node, where allocations never fallback to that node unless explicitly asked to go there. I guess that could be doable on the zonelist level, where the device memory node/zone wouldn't be part of the normal zonelists, so memory pressure calculations should be also fine. But sure there will be some corner cases :) Of course that would have an impact on memory pressure calculations, nothign comes completely for free, but at this stage, this is the goal of this thread, ie, to swap ideas around and see what's most likely to work in the long run before we even start implementing something. Cheers, Ben. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Tue, Apr 28, 2015 at 09:18:55AM -0500, Christoph Lameter wrote: > On Mon, 27 Apr 2015, Jerome Glisse wrote: > > > > is the mechanism that DAX relies on in the VM. > > > > Which would require fare more changes than you seem to think. First using > > MIXED|PFNMAP means we loose any kind of memory accounting and forget about > > memcg too. Seconds it means we would need to set those flags on all vma, > > which kind of point out that something must be wrong here. You will also > > need to have vm_ops for all those vma (including for anonymous private vma > > which sounds like it will break quite few place that test for that). Then > > you have to think about vma that already have vm_ops but you would need > > to override it to handle case where its device memory and then forward > > other case to the existing vm_ops, extra layering, extra complexity. > > These vmas would only be used for those section of memory that use > memory in the coprocessor. Special memory accounting etc can be done at > the device driver layer. Multiple processes would be able to use different > GPU contexts (or devices) which provides proper isolations. > > memcg is about accouting for regular memory and this is not regular > memory. It ooks like one would need a lot of special casing in > the VM if one wanted to handle f.e. GPU memory as regular memory under > Linux. Well i shoed this does not need much changes refer to : http://lwn.net/Articles/597289/ More specifically : http://thread.gmane.org/gmane.linux.kernel.mm/116584 http://thread.gmane.org/gmane.linux.kernel.mm/116584 http://thread.gmane.org/gmane.linux.kernel.mm/116584 Idea here is that even if device memory is speciak kind of memory we still want to account it properly against process ie an anonymous page that is on the device memory would still be accounted as regular anonymous page for memcg (same apply to file backed pages). With that existing memcg keeps working as intended and process memory use are properly accounted. This does not prevent the device driver to perform its own accounting of device memory and to allow or block migration for a given process. At this point we do not think it is meaningfull to move such accounting to a common layer. Bottom line is, we want to keep existing memcg accounting intact and we want to reflect remote memory as regular memory. Note that the memcg changes would be even smaller now that Johannes cleaned up and simplified memcg. I have not rebase that part of HMM yet. > > > I think at this point there is nothing more to discuss here. It is pretty > > clear to me that any solution using block device/MIXEDMAP would be far > > more complex and far more intrusive. I do not mind being prove wrong but > > i will certainly not waste my time trying to implement such solution. > > The device driver method is the current solution used by the GPUS and > that would be the natural starting point for development. And they do not > currently add code to the core vm. I think we first need to figure out if > we cannot do what you want through that method. We do need a different solution, i have been working on that for last 2 years for a reason. Requirement: _no_ special allocator in userspace so that all kind of memory (anonymous, share, file backed) can be use and migrated to device memory in a transparent fashion for the application. No special allocator imply no special vma so no special vm_ops. So we need either to hook up in few places inside mm code with minor change to deal with special CPU pte entry of migrated memory (on page fault, fork, write back). For all those place it's just about adding : if(new_special_pte) new_helper_function() Other solution would have been to introduce yet another vm_ops that would superceed the existing vm_ops. This work for page fault but require more changes for page fault and fork, and major changes for write back. Hence, why first solution was favor. I explored many different path before going down the road i am, and all you are doing is hand waving some idea without even considering any of the objection i formulated. I explained why your idea can not work or require excessive and more complex change than solution we are proposing. Cheers, Jérôme -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, 27 Apr 2015, Jerome Glisse wrote: > > is the mechanism that DAX relies on in the VM. > > Which would require fare more changes than you seem to think. First using > MIXED|PFNMAP means we loose any kind of memory accounting and forget about > memcg too. Seconds it means we would need to set those flags on all vma, > which kind of point out that something must be wrong here. You will also > need to have vm_ops for all those vma (including for anonymous private vma > which sounds like it will break quite few place that test for that). Then > you have to think about vma that already have vm_ops but you would need > to override it to handle case where its device memory and then forward > other case to the existing vm_ops, extra layering, extra complexity. These vmas would only be used for those section of memory that use memory in the coprocessor. Special memory accounting etc can be done at the device driver layer. Multiple processes would be able to use different GPU contexts (or devices) which provides proper isolations. memcg is about accouting for regular memory and this is not regular memory. It ooks like one would need a lot of special casing in the VM if one wanted to handle f.e. GPU memory as regular memory under Linux. > I think at this point there is nothing more to discuss here. It is pretty > clear to me that any solution using block device/MIXEDMAP would be far > more complex and far more intrusive. I do not mind being prove wrong but > i will certainly not waste my time trying to implement such solution. The device driver method is the current solution used by the GPUS and that would be the natural starting point for development. And they do not currently add code to the core vm. I think we first need to figure out if we cannot do what you want through that method. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, 27 Apr 2015, Jerome Glisse wrote: is the mechanism that DAX relies on in the VM. Which would require fare more changes than you seem to think. First using MIXED|PFNMAP means we loose any kind of memory accounting and forget about memcg too. Seconds it means we would need to set those flags on all vma, which kind of point out that something must be wrong here. You will also need to have vm_ops for all those vma (including for anonymous private vma which sounds like it will break quite few place that test for that). Then you have to think about vma that already have vm_ops but you would need to override it to handle case where its device memory and then forward other case to the existing vm_ops, extra layering, extra complexity. These vmas would only be used for those section of memory that use memory in the coprocessor. Special memory accounting etc can be done at the device driver layer. Multiple processes would be able to use different GPU contexts (or devices) which provides proper isolations. memcg is about accouting for regular memory and this is not regular memory. It ooks like one would need a lot of special casing in the VM if one wanted to handle f.e. GPU memory as regular memory under Linux. I think at this point there is nothing more to discuss here. It is pretty clear to me that any solution using block device/MIXEDMAP would be far more complex and far more intrusive. I do not mind being prove wrong but i will certainly not waste my time trying to implement such solution. The device driver method is the current solution used by the GPUS and that would be the natural starting point for development. And they do not currently add code to the core vm. I think we first need to figure out if we cannot do what you want through that method. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Tue, Apr 28, 2015 at 09:18:55AM -0500, Christoph Lameter wrote: On Mon, 27 Apr 2015, Jerome Glisse wrote: is the mechanism that DAX relies on in the VM. Which would require fare more changes than you seem to think. First using MIXED|PFNMAP means we loose any kind of memory accounting and forget about memcg too. Seconds it means we would need to set those flags on all vma, which kind of point out that something must be wrong here. You will also need to have vm_ops for all those vma (including for anonymous private vma which sounds like it will break quite few place that test for that). Then you have to think about vma that already have vm_ops but you would need to override it to handle case where its device memory and then forward other case to the existing vm_ops, extra layering, extra complexity. These vmas would only be used for those section of memory that use memory in the coprocessor. Special memory accounting etc can be done at the device driver layer. Multiple processes would be able to use different GPU contexts (or devices) which provides proper isolations. memcg is about accouting for regular memory and this is not regular memory. It ooks like one would need a lot of special casing in the VM if one wanted to handle f.e. GPU memory as regular memory under Linux. Well i shoed this does not need much changes refer to : http://lwn.net/Articles/597289/ More specifically : http://thread.gmane.org/gmane.linux.kernel.mm/116584 http://thread.gmane.org/gmane.linux.kernel.mm/116584 http://thread.gmane.org/gmane.linux.kernel.mm/116584 Idea here is that even if device memory is speciak kind of memory we still want to account it properly against process ie an anonymous page that is on the device memory would still be accounted as regular anonymous page for memcg (same apply to file backed pages). With that existing memcg keeps working as intended and process memory use are properly accounted. This does not prevent the device driver to perform its own accounting of device memory and to allow or block migration for a given process. At this point we do not think it is meaningfull to move such accounting to a common layer. Bottom line is, we want to keep existing memcg accounting intact and we want to reflect remote memory as regular memory. Note that the memcg changes would be even smaller now that Johannes cleaned up and simplified memcg. I have not rebase that part of HMM yet. I think at this point there is nothing more to discuss here. It is pretty clear to me that any solution using block device/MIXEDMAP would be far more complex and far more intrusive. I do not mind being prove wrong but i will certainly not waste my time trying to implement such solution. The device driver method is the current solution used by the GPUS and that would be the natural starting point for development. And they do not currently add code to the core vm. I think we first need to figure out if we cannot do what you want through that method. We do need a different solution, i have been working on that for last 2 years for a reason. Requirement: _no_ special allocator in userspace so that all kind of memory (anonymous, share, file backed) can be use and migrated to device memory in a transparent fashion for the application. No special allocator imply no special vma so no special vm_ops. So we need either to hook up in few places inside mm code with minor change to deal with special CPU pte entry of migrated memory (on page fault, fork, write back). For all those place it's just about adding : if(new_special_pte) new_helper_function() Other solution would have been to introduce yet another vm_ops that would superceed the existing vm_ops. This work for page fault but require more changes for page fault and fork, and major changes for write back. Hence, why first solution was favor. I explored many different path before going down the road i am, and all you are doing is hand waving some idea without even considering any of the objection i formulated. I explained why your idea can not work or require excessive and more complex change than solution we are proposing. Cheers, Jérôme -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, 2015-04-27 at 11:48 -0500, Christoph Lameter wrote: > On Mon, 27 Apr 2015, Rik van Riel wrote: > > > Why would we want to avoid the sane approach that makes this thing > > work with the fewest required changes to core code? > > Becaus new ZONEs are a pretty invasive change to the memory management and > because there are other ways to handle references to device specific > memory. ZONEs is just one option we put on the table. I think we can mostly agree on the fundamentals that a good model of such a co-processor is a NUMA node, possibly with a higher distance than other nodes (but even that can be debated). That gives us a lot of the basics we need such as struct page, ability to use existing migration infrastructure, and is actually a reasonably representation at high level as well. The question is how do we additionally get the random stuff we don't care about out of the way. The large distance will not help that much under memory pressure for example. Covering the entire device memory with a CMA goes a long way toward that goal. It will avoid your ordinary kernel allocations. It also provides just what we need to be able to do large contiguous "explicit" allocations for use by workloads that don't want the transparent migration and by the driver for the device which might also need such special allocations for its own internal management data structures. We still have the risk of pages in the CMA being pinned by something like gup however, that's where the ZONE idea comes in, to ensure the various kernel allocators will *never* allocate in that zone unless explicitly specified, but that could possibly implemented differently. Maybe a concept of "exclusive" NUMA node, where allocations never fallback to that node unless explicitly asked to go there. Of course that would have an impact on memory pressure calculations, nothign comes completely for free, but at this stage, this is the goal of this thread, ie, to swap ideas around and see what's most likely to work in the long run before we even start implementing something. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, Apr 27, 2015 at 02:26:04PM -0500, Christoph Lameter wrote: > On Mon, 27 Apr 2015, Jerome Glisse wrote: > > > > We can drop the DAX name and just talk about mapping to external memory if > > > that confuses the issue. > > > > DAX is for direct access block layer (X is for the cool name factor) > > there is zero code inside DAX that would be usefull to us. Because it > > is all about filesystem and short circuiting the pagecache. So DAX is > > _not_ about providing rw mappings to non regular memory, it is about > > allowing to directly map _filesystem backing storage_ into a process. > > Its about directly mapping memory outside of regular kernel > management via a block device into user space. That you can put a > filesystem on top is one possible use case. You can provide a block > device to map the memory of the coprocessor and then configure the memory > space to have the same layout on the coprocessor as well as the linux > process. _Block device_ not what we want, the API of block device does not match anything remotely usefull for our usecase. Most of the block device api deals with disk and scheduling io on them, none of which is interesting to us. So we would need to carefully create various noop functions and insert ourself as some kind of fake block device while also making sure no userspace could actually use ourself as a regular block device. So we would be pretending being something we are not. > > > Moreover DAX is not about managing that persistent memory, all the > > management is done inside the fs (ext4, xfs, ...) in the same way as > > for non persistent memory. While in our case we want to manage the > > memory as a runtime resources that is allocated to process the same > > way regular system memory is managed. > > I repeatedly said that. So you would have a block device that would be > used to mmap portions of the special memory into a process. > > > So current DAX code have nothing of value for our usecase nor what we > > propose will have anyvalue for DAX. Unless they decide to go down the > > struct page road for persistent memory (which from last discussion i > > heard was not there plan, i am pretty sure they entirely dismissed > > that idea for now). > > DAX is about directly accessing memory. It is made for the purpose of > serving as a block device for a filesystem right now but it can easily be > used as a way to map any external memory into a processes space using the > abstraction of a block device. But then you can do that with any device > driver using VM_PFNMAP or VM_MIXEDMAP. Maybe we better use that term > instead. Guess I have repeated myself 6 times or so now? I am stopping > with this one. > > > My point is that this is 2 differents non overlapping problems, and > > thus mandate 2 differents solution. > > Well confusion abounds since so much other stuff has ben attached to DAX > devices. > > Lets drop the DAX term and use VM_PFNMAP or VM_MIXEDMAP instead. MIXEDMAP > is the mechanism that DAX relies on in the VM. Which would require fare more changes than you seem to think. First using MIXED|PFNMAP means we loose any kind of memory accounting and forget about memcg too. Seconds it means we would need to set those flags on all vma, which kind of point out that something must be wrong here. You will also need to have vm_ops for all those vma (including for anonymous private vma which sounds like it will break quite few place that test for that). Then you have to think about vma that already have vm_ops but you would need to override it to handle case where its device memory and then forward other case to the existing vm_ops, extra layering, extra complexity. All in all, this points me to believe that any such approach would be vastly more complex, involve changing many places and try to force shoe horning something into the block device model that is clearly not a block device. Paul solution or mine, are far smaller, i think Paul can even get away from adding/changing ZONE by putting the device pages onto a different list that is not use by kernel memory allocator. Only few code place would need a new if() (when freeing a page and when initializing device memory struct page, you could keep the lru code intact here). I think at this point there is nothing more to discuss here. It is pretty clear to me that any solution using block device/MIXEDMAP would be far more complex and far more intrusive. I do not mind being prove wrong but i will certainly not waste my time trying to implement such solution. Btw as a data point, if you ignore my patches to mmu_notifier (which are mostly about passing down more context information to the callback), i touch less then 50 lines of mm common code. Every thing else is helpers that are only use by the device driver. Cheers, Jérôme -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: Interacting with coherent memory on external devices
On 04/27/2015 03:26 PM, Christoph Lameter wrote: > DAX is about directly accessing memory. It is made for the purpose of > serving as a block device for a filesystem right now but it can easily be > used as a way to map any external memory into a processes space using the > abstraction of a block device. But then you can do that with any device > driver using VM_PFNMAP or VM_MIXEDMAP. Maybe we better use that term > instead. Guess I have repeated myself 6 times or so now? I am stopping > with this one. Yeah, please stop. If after 6 times you have still not grasped that having the application manage which memory goes onto the device and which goes in RAM is the exact opposite of the use model that Paul and Jerome are trying to enable (transparent moving around of memory, by eg. GPU calculation libraries), you are clearly not paying enough attention. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, 27 Apr 2015, Jerome Glisse wrote: > > We can drop the DAX name and just talk about mapping to external memory if > > that confuses the issue. > > DAX is for direct access block layer (X is for the cool name factor) > there is zero code inside DAX that would be usefull to us. Because it > is all about filesystem and short circuiting the pagecache. So DAX is > _not_ about providing rw mappings to non regular memory, it is about > allowing to directly map _filesystem backing storage_ into a process. Its about directly mapping memory outside of regular kernel management via a block device into user space. That you can put a filesystem on top is one possible use case. You can provide a block device to map the memory of the coprocessor and then configure the memory space to have the same layout on the coprocessor as well as the linux process. > Moreover DAX is not about managing that persistent memory, all the > management is done inside the fs (ext4, xfs, ...) in the same way as > for non persistent memory. While in our case we want to manage the > memory as a runtime resources that is allocated to process the same > way regular system memory is managed. I repeatedly said that. So you would have a block device that would be used to mmap portions of the special memory into a process. > So current DAX code have nothing of value for our usecase nor what we > propose will have anyvalue for DAX. Unless they decide to go down the > struct page road for persistent memory (which from last discussion i > heard was not there plan, i am pretty sure they entirely dismissed > that idea for now). DAX is about directly accessing memory. It is made for the purpose of serving as a block device for a filesystem right now but it can easily be used as a way to map any external memory into a processes space using the abstraction of a block device. But then you can do that with any device driver using VM_PFNMAP or VM_MIXEDMAP. Maybe we better use that term instead. Guess I have repeated myself 6 times or so now? I am stopping with this one. > My point is that this is 2 differents non overlapping problems, and > thus mandate 2 differents solution. Well confusion abounds since so much other stuff has ben attached to DAX devices. Lets drop the DAX term and use VM_PFNMAP or VM_MIXEDMAP instead. MIXEDMAP is the mechanism that DAX relies on in the VM. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, Apr 27, 2015 at 11:51:51AM -0500, Christoph Lameter wrote: > On Mon, 27 Apr 2015, Jerome Glisse wrote: > > > > Well lets avoid that. Access to device memory comparable to what the > > > drivers do today by establishing page table mappings or a generalization > > > of DAX approaches would be the most straightforward way of implementing it > > > and would build based on existing functionality. Page migration currently > > > does not work with driver mappings or DAX because there is no struct page > > > that would allow the lockdown of the page. That may require either > > > continued work on the DAX with page structs approach or new developments > > > in the page migration logic comparable to the get_user_page() alternative > > > of simply creating a scatter gather table to just submit a couple of > > > memory ranges to the I/O subsystem thereby avoiding page structs. > > > > What you refuse to see is that DAX is geared toward filesystem and as such > > rely on special mapping. There is a reason why dax.c is in fs/ and not mm/ > > and i keep pointing out we do not want our mecanism to be perceive as fs > > from userspace point of view. We want to be below the fs, at the mm level > > where we could really do thing transparently no matter what kind of memory > > we are talking about (anonymous, file mapped, share). > > Ok that is why I mentioned the device memory mappings that are currently > used for this purpose. You could generalize the DAX approach (which I > understand as providing rw mappings to memory outside of the memory > managed by the kernel and not as a fs specific thing). > > We can drop the DAX name and just talk about mapping to external memory if > that confuses the issue. DAX is for direct access block layer (X is for the cool name factor) there is zero code inside DAX that would be usefull to us. Because it is all about filesystem and short circuiting the pagecache. So DAX is _not_ about providing rw mappings to non regular memory, it is about allowing to directly map _filesystem backing storage_ into a process. Moreover DAX is not about managing that persistent memory, all the management is done inside the fs (ext4, xfs, ...) in the same way as for non persistent memory. While in our case we want to manage the memory as a runtime resources that is allocated to process the same way regular system memory is managed. So current DAX code have nothing of value for our usecase nor what we propose will have anyvalue for DAX. Unless they decide to go down the struct page road for persistent memory (which from last discussion i heard was not there plan, i am pretty sure they entirely dismissed that idea for now). My point is that this is 2 differents non overlapping problems, and thus mandate 2 differents solution. Cheers, Jérôme -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, 27 Apr 2015, Jerome Glisse wrote: > > Well lets avoid that. Access to device memory comparable to what the > > drivers do today by establishing page table mappings or a generalization > > of DAX approaches would be the most straightforward way of implementing it > > and would build based on existing functionality. Page migration currently > > does not work with driver mappings or DAX because there is no struct page > > that would allow the lockdown of the page. That may require either > > continued work on the DAX with page structs approach or new developments > > in the page migration logic comparable to the get_user_page() alternative > > of simply creating a scatter gather table to just submit a couple of > > memory ranges to the I/O subsystem thereby avoiding page structs. > > What you refuse to see is that DAX is geared toward filesystem and as such > rely on special mapping. There is a reason why dax.c is in fs/ and not mm/ > and i keep pointing out we do not want our mecanism to be perceive as fs > from userspace point of view. We want to be below the fs, at the mm level > where we could really do thing transparently no matter what kind of memory > we are talking about (anonymous, file mapped, share). Ok that is why I mentioned the device memory mappings that are currently used for this purpose. You could generalize the DAX approach (which I understand as providing rw mappings to memory outside of the memory managed by the kernel and not as a fs specific thing). We can drop the DAX name and just talk about mapping to external memory if that confuses the issue. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, 27 Apr 2015, Rik van Riel wrote: > Why would we want to avoid the sane approach that makes this thing > work with the fewest required changes to core code? Becaus new ZONEs are a pretty invasive change to the memory management and because there are other ways to handle references to device specific memory. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, Apr 27, 2015 at 11:17:43AM -0500, Christoph Lameter wrote: > On Mon, 27 Apr 2015, Jerome Glisse wrote: > > > > Improvements to the general code would be preferred instead of > > > having specialized solutions for a particular hardware alone. If the > > > general code can then handle the special coprocessor situation then we > > > avoid a lot of code development. > > > > I think Paul only big change would be the memory ZONE changes. Having a > > way to add the device memory as struct page while blocking the kernel > > allocation from using this memory. Beside that i think the autonuma changes > > he would need would really be specific to his usecase but would still > > reuse all of the low level logic. > > Well lets avoid that. Access to device memory comparable to what the > drivers do today by establishing page table mappings or a generalization > of DAX approaches would be the most straightforward way of implementing it > and would build based on existing functionality. Page migration currently > does not work with driver mappings or DAX because there is no struct page > that would allow the lockdown of the page. That may require either > continued work on the DAX with page structs approach or new developments > in the page migration logic comparable to the get_user_page() alternative > of simply creating a scatter gather table to just submit a couple of > memory ranges to the I/O subsystem thereby avoiding page structs. What you refuse to see is that DAX is geared toward filesystem and as such rely on special mapping. There is a reason why dax.c is in fs/ and not mm/ and i keep pointing out we do not want our mecanism to be perceive as fs from userspace point of view. We want to be below the fs, at the mm level where we could really do thing transparently no matter what kind of memory we are talking about (anonymous, file mapped, share). The fact is that DAX is about persistant storage but the people that develop the persitant storage think it would be nice to expose it as some kind of special memory. I am all for the direct mapping of this kind of memory but still it is use as a backing store for a filesystem. While in our case we are talking about "usual" _volatile_ memory that should be use or expose as a filesystem. I can't understand why you are so hellbent on the DAX paradigm, but it is not what suit us in no way. We are not filesystem, we are regular memory, our realm is mm/ not fs/ Cheers, Jérôme -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, 27 Apr 2015, Paul E. McKenney wrote: > I would instead look on this as a way to try out use of hardware migration > hints, which could lead to hardware vendors providing similar hints for > node-to-node migrations. At that time, the benefits could be provided > all the functionality relying on such migrations. Ok that sounds good. These "hints" could allow for the optimization of the page migration logic. > > Well yes that works with read-only mappings. Maybe we can special case > > that in the page migration code? We do not need migration entries if > > access is read-only actually. > > So you are talking about the situation only during the migration itself, > then? If there is no migration in progress, then of course there is > no problem with concurrent writes because the cache-coherence protocol > takes care of things. During migration of a given page, I agree that > marking that page read-only on both sides makes sense. This is sortof what happens in the current migration scheme. In the page tables the regular entries are replaced by migration ptes and the page is therefore inaccessible. Any access is then trapped until the page contentshave been moved to the new location. Then the migration pte is replaced by a real pte again that allows full access to the page. At that point the processes that have been put to sleep because they attempted an access to that page are woken up. The current scheme may be improvied on by allowing read access to the page while migration is in process. If we would change the migration entries to allow read access then the readers would not have to be put to sleep. Only writers would have to be put to sleep until the migration is complete. > > And I agree that latency-sensitive applications might not tolerate > the page being read-only, and thus would want to avoid migration. > Such applications would of course instead rely on placing the memory. Thats why we have the ability to switch off these automatism and that is why we are trying to keep the OS away from certain processors. But this is not the only concern here. The other thing is to make this fit into existing functionaly as cleanly as possible. So I think we would be looking at gradual improvements in the page migration logic as well as in the support for mapping external memory via driver mmap calls, DAX and/or RDMA subsystem functionality. Those two areas of functionality need to work together better in order to provide a solution for your use cases. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On 04/27/2015 12:17 PM, Christoph Lameter wrote: > On Mon, 27 Apr 2015, Jerome Glisse wrote: > >>> Improvements to the general code would be preferred instead of >>> having specialized solutions for a particular hardware alone. If the >>> general code can then handle the special coprocessor situation then we >>> avoid a lot of code development. >> >> I think Paul only big change would be the memory ZONE changes. Having a >> way to add the device memory as struct page while blocking the kernel >> allocation from using this memory. Beside that i think the autonuma changes >> he would need would really be specific to his usecase but would still >> reuse all of the low level logic. > > Well lets avoid that. Why would we want to avoid the sane approach that makes this thing work with the fewest required changes to core code? Just because your workload is different from the workload they are trying to enable? -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, 27 Apr 2015, Jerome Glisse wrote: > > Improvements to the general code would be preferred instead of > > having specialized solutions for a particular hardware alone. If the > > general code can then handle the special coprocessor situation then we > > avoid a lot of code development. > > I think Paul only big change would be the memory ZONE changes. Having a > way to add the device memory as struct page while blocking the kernel > allocation from using this memory. Beside that i think the autonuma changes > he would need would really be specific to his usecase but would still > reuse all of the low level logic. Well lets avoid that. Access to device memory comparable to what the drivers do today by establishing page table mappings or a generalization of DAX approaches would be the most straightforward way of implementing it and would build based on existing functionality. Page migration currently does not work with driver mappings or DAX because there is no struct page that would allow the lockdown of the page. That may require either continued work on the DAX with page structs approach or new developments in the page migration logic comparable to the get_user_page() alternative of simply creating a scatter gather table to just submit a couple of memory ranges to the I/O subsystem thereby avoiding page structs. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, Apr 27, 2015 at 10:08:29AM -0500, Christoph Lameter wrote: > On Sat, 25 Apr 2015, Paul E. McKenney wrote: > > > Would you have a URL or other pointer to this code? > > linux/mm/migrate.c Ah, I thought you were calling out something not yet in mainline. > > > > Without modifying a single line of mm code, the only way to do this is > > > > to > > > > either unmap from the cpu page table the range being migrated or to > > > > mprotect > > > > it in some way. In both case the cpu access will trigger some kind of > > > > fault. > > > > > > Yes that is how Linux migration works. If you can fix that then how about > > > improving page migration in Linux between NUMA nodes first? > > > > In principle, that also would be a good thing. But why do that first? > > Because it would benefit a lot of functionality that today relies on page > migration to have a faster more reliable way of moving pages around. I would instead look on this as a way to try out use of hardware migration hints, which could lead to hardware vendors providing similar hints for node-to-node migrations. At that time, the benefits could be provided all the functionality relying on such migrations. > > > > This is not the behavior we want. What we want is same address space > > > > while > > > > being able to migrate system memory to device memory (who make that > > > > decision > > > > should not be part of that discussion) while still gracefully handling > > > > any > > > > CPU access. > > > > > > Well then there could be a situation where you have concurrent write > > > access. How do you reconcile that then? Somehow you need to stall one or > > > the other until the transaction is complete. > > > > Or have store buffers on one or both sides. > > Well if those store buffers end up with divergent contents then you have > the problem of not being able to decide which version should survive. But > from Jerome's response I deduce that this is avoided by only allow > read-only access during migration. That is actually similar to what page > migration does. Fair enough. > > > > This means if CPU access it we want to migrate memory back to system > > > > memory. > > > > To achieve this there is no way around adding couple of if inside the mm > > > > page fault code path. Now do you want each driver to add its own if > > > > branch > > > > or do you want a common infrastructure to do just that ? > > > > > > If you can improve the page migration in general then we certainly would > > > love that. Having faultless migration is certain a good thing for a lot of > > > functionality that depends on page migration. > > > > We do have to start somewhere, though. If we insist on perfection for > > all situations before we agree to make a change, we won't be making very > > many changes, now will we? > > Improvements to the general code would be preferred instead of > having specialized solutions for a particular hardware alone. If the > general code can then handle the special coprocessor situation then we > avoid a lot of code development. All else being equal, I agree that generality is preferred. But here, as is often the case, all else is not necessarily equal. > > As I understand it, the trick (if you can call it that) is having the > > device have the same memory-mapping capabilities as the CPUs. > > Well yes that works with read-only mappings. Maybe we can special case > that in the page migration code? We do not need migration entries if > access is read-only actually. So you are talking about the situation only during the migration itself, then? If there is no migration in progress, then of course there is no problem with concurrent writes because the cache-coherence protocol takes care of things. During migration of a given page, I agree that marking that page read-only on both sides makes sense. And I agree that latency-sensitive applications might not tolerate the page being read-only, and thus would want to avoid migration. Such applications would of course instead rely on placing the memory. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, Apr 27, 2015 at 10:08:29AM -0500, Christoph Lameter wrote: > On Sat, 25 Apr 2015, Paul E. McKenney wrote: > > > Would you have a URL or other pointer to this code? > > linux/mm/migrate.c > > > > > Without modifying a single line of mm code, the only way to do this is > > > > to > > > > either unmap from the cpu page table the range being migrated or to > > > > mprotect > > > > it in some way. In both case the cpu access will trigger some kind of > > > > fault. > > > > > > Yes that is how Linux migration works. If you can fix that then how about > > > improving page migration in Linux between NUMA nodes first? > > > > In principle, that also would be a good thing. But why do that first? > > Because it would benefit a lot of functionality that today relies on page > migration to have a faster more reliable way of moving pages around. I do no think in the CAPI case there is anyway to improve on current low leve page migration. I am talking about : - write protect & tlb flush - copy - update page table tlb flush The upper level that have the logic for the migration would however need some change. Like Paul said some kind of new metric and also new way to gather statistics from device instead from CPU. I think the device can provide better informations that the actual logic where page are unmap and the kernel look which CPU fault on page first. Also a way to allow hint provide by userspace through the device driver into the numa decision process. So i do not think that anything in this work would benefit any other work load then the one Paul is interested in. Still i am sure Paul want to build on top of existing infrastructure. > > > > > This is not the behavior we want. What we want is same address space > > > > while > > > > being able to migrate system memory to device memory (who make that > > > > decision > > > > should not be part of that discussion) while still gracefully handling > > > > any > > > > CPU access. > > > > > > Well then there could be a situation where you have concurrent write > > > access. How do you reconcile that then? Somehow you need to stall one or > > > the other until the transaction is complete. > > > > Or have store buffers on one or both sides. > > Well if those store buffers end up with divergent contents then you have > the problem of not being able to decide which version should survive. But > from Jerome's response I deduce that this is avoided by only allow > read-only access during migration. That is actually similar to what page > migration does. Yes, as said above no change to the logic there, we do not want divergent content at all. The thing is, autonuma is a better fit for Paul because Paul platform being more advance he can allocate struct page for the device memory. While in my case it would be pointless as the memory is not CPU accessible. This is why the HMM patchset do not build on top of autonuma and current page migration but still use the same kind of logic. > > > > > This means if CPU access it we want to migrate memory back to system > > > > memory. > > > > To achieve this there is no way around adding couple of if inside the mm > > > > page fault code path. Now do you want each driver to add its own if > > > > branch > > > > or do you want a common infrastructure to do just that ? > > > > > > If you can improve the page migration in general then we certainly would > > > love that. Having faultless migration is certain a good thing for a lot of > > > functionality that depends on page migration. > > > > We do have to start somewhere, though. If we insist on perfection for > > all situations before we agree to make a change, we won't be making very > > many changes, now will we? > > Improvements to the general code would be preferred instead of > having specialized solutions for a particular hardware alone. If the > general code can then handle the special coprocessor situation then we > avoid a lot of code development. I think Paul only big change would be the memory ZONE changes. Having a way to add the device memory as struct page while blocking the kernel allocation from using this memory. Beside that i think the autonuma changes he would need would really be specific to his usecase but would still reuse all of the low level logic. > > > As I understand it, the trick (if you can call it that) is having the > > device have the same memory-mapping capabilities as the CPUs. > > Well yes that works with read-only mappings. Maybe we can special case > that in the page migration code? We do not need migration entries if > access is read-only actually. The duplicate read only memory on device, is really an optimization that is not critical to the whole. The common use case remain the migration of read & write memory to device memory when the memory is mostly/only accessed by the device. Cheers, Jérôme -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to
Re: Interacting with coherent memory on external devices
On Sat, 25 Apr 2015, Paul E. McKenney wrote: > Would you have a URL or other pointer to this code? linux/mm/migrate.c > > > Without modifying a single line of mm code, the only way to do this is to > > > either unmap from the cpu page table the range being migrated or to > > > mprotect > > > it in some way. In both case the cpu access will trigger some kind of > > > fault. > > > > Yes that is how Linux migration works. If you can fix that then how about > > improving page migration in Linux between NUMA nodes first? > > In principle, that also would be a good thing. But why do that first? Because it would benefit a lot of functionality that today relies on page migration to have a faster more reliable way of moving pages around. > > > This is not the behavior we want. What we want is same address space while > > > being able to migrate system memory to device memory (who make that > > > decision > > > should not be part of that discussion) while still gracefully handling any > > > CPU access. > > > > Well then there could be a situation where you have concurrent write > > access. How do you reconcile that then? Somehow you need to stall one or > > the other until the transaction is complete. > > Or have store buffers on one or both sides. Well if those store buffers end up with divergent contents then you have the problem of not being able to decide which version should survive. But from Jerome's response I deduce that this is avoided by only allow read-only access during migration. That is actually similar to what page migration does. > > > This means if CPU access it we want to migrate memory back to system > > > memory. > > > To achieve this there is no way around adding couple of if inside the mm > > > page fault code path. Now do you want each driver to add its own if branch > > > or do you want a common infrastructure to do just that ? > > > > If you can improve the page migration in general then we certainly would > > love that. Having faultless migration is certain a good thing for a lot of > > functionality that depends on page migration. > > We do have to start somewhere, though. If we insist on perfection for > all situations before we agree to make a change, we won't be making very > many changes, now will we? Improvements to the general code would be preferred instead of having specialized solutions for a particular hardware alone. If the general code can then handle the special coprocessor situation then we avoid a lot of code development. > As I understand it, the trick (if you can call it that) is having the > device have the same memory-mapping capabilities as the CPUs. Well yes that works with read-only mappings. Maybe we can special case that in the page migration code? We do not need migration entries if access is read-only actually. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On 04/27/2015 03:26 PM, Christoph Lameter wrote: DAX is about directly accessing memory. It is made for the purpose of serving as a block device for a filesystem right now but it can easily be used as a way to map any external memory into a processes space using the abstraction of a block device. But then you can do that with any device driver using VM_PFNMAP or VM_MIXEDMAP. Maybe we better use that term instead. Guess I have repeated myself 6 times or so now? I am stopping with this one. Yeah, please stop. If after 6 times you have still not grasped that having the application manage which memory goes onto the device and which goes in RAM is the exact opposite of the use model that Paul and Jerome are trying to enable (transparent moving around of memory, by eg. GPU calculation libraries), you are clearly not paying enough attention. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, Apr 27, 2015 at 02:26:04PM -0500, Christoph Lameter wrote: On Mon, 27 Apr 2015, Jerome Glisse wrote: We can drop the DAX name and just talk about mapping to external memory if that confuses the issue. DAX is for direct access block layer (X is for the cool name factor) there is zero code inside DAX that would be usefull to us. Because it is all about filesystem and short circuiting the pagecache. So DAX is _not_ about providing rw mappings to non regular memory, it is about allowing to directly map _filesystem backing storage_ into a process. Its about directly mapping memory outside of regular kernel management via a block device into user space. That you can put a filesystem on top is one possible use case. You can provide a block device to map the memory of the coprocessor and then configure the memory space to have the same layout on the coprocessor as well as the linux process. _Block device_ not what we want, the API of block device does not match anything remotely usefull for our usecase. Most of the block device api deals with disk and scheduling io on them, none of which is interesting to us. So we would need to carefully create various noop functions and insert ourself as some kind of fake block device while also making sure no userspace could actually use ourself as a regular block device. So we would be pretending being something we are not. Moreover DAX is not about managing that persistent memory, all the management is done inside the fs (ext4, xfs, ...) in the same way as for non persistent memory. While in our case we want to manage the memory as a runtime resources that is allocated to process the same way regular system memory is managed. I repeatedly said that. So you would have a block device that would be used to mmap portions of the special memory into a process. So current DAX code have nothing of value for our usecase nor what we propose will have anyvalue for DAX. Unless they decide to go down the struct page road for persistent memory (which from last discussion i heard was not there plan, i am pretty sure they entirely dismissed that idea for now). DAX is about directly accessing memory. It is made for the purpose of serving as a block device for a filesystem right now but it can easily be used as a way to map any external memory into a processes space using the abstraction of a block device. But then you can do that with any device driver using VM_PFNMAP or VM_MIXEDMAP. Maybe we better use that term instead. Guess I have repeated myself 6 times or so now? I am stopping with this one. My point is that this is 2 differents non overlapping problems, and thus mandate 2 differents solution. Well confusion abounds since so much other stuff has ben attached to DAX devices. Lets drop the DAX term and use VM_PFNMAP or VM_MIXEDMAP instead. MIXEDMAP is the mechanism that DAX relies on in the VM. Which would require fare more changes than you seem to think. First using MIXED|PFNMAP means we loose any kind of memory accounting and forget about memcg too. Seconds it means we would need to set those flags on all vma, which kind of point out that something must be wrong here. You will also need to have vm_ops for all those vma (including for anonymous private vma which sounds like it will break quite few place that test for that). Then you have to think about vma that already have vm_ops but you would need to override it to handle case where its device memory and then forward other case to the existing vm_ops, extra layering, extra complexity. All in all, this points me to believe that any such approach would be vastly more complex, involve changing many places and try to force shoe horning something into the block device model that is clearly not a block device. Paul solution or mine, are far smaller, i think Paul can even get away from adding/changing ZONE by putting the device pages onto a different list that is not use by kernel memory allocator. Only few code place would need a new if() (when freeing a page and when initializing device memory struct page, you could keep the lru code intact here). I think at this point there is nothing more to discuss here. It is pretty clear to me that any solution using block device/MIXEDMAP would be far more complex and far more intrusive. I do not mind being prove wrong but i will certainly not waste my time trying to implement such solution. Btw as a data point, if you ignore my patches to mmu_notifier (which are mostly about passing down more context information to the callback), i touch less then 50 lines of mm common code. Every thing else is helpers that are only use by the device driver. Cheers, Jérôme -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, 2015-04-27 at 11:48 -0500, Christoph Lameter wrote: On Mon, 27 Apr 2015, Rik van Riel wrote: Why would we want to avoid the sane approach that makes this thing work with the fewest required changes to core code? Becaus new ZONEs are a pretty invasive change to the memory management and because there are other ways to handle references to device specific memory. ZONEs is just one option we put on the table. I think we can mostly agree on the fundamentals that a good model of such a co-processor is a NUMA node, possibly with a higher distance than other nodes (but even that can be debated). That gives us a lot of the basics we need such as struct page, ability to use existing migration infrastructure, and is actually a reasonably representation at high level as well. The question is how do we additionally get the random stuff we don't care about out of the way. The large distance will not help that much under memory pressure for example. Covering the entire device memory with a CMA goes a long way toward that goal. It will avoid your ordinary kernel allocations. It also provides just what we need to be able to do large contiguous explicit allocations for use by workloads that don't want the transparent migration and by the driver for the device which might also need such special allocations for its own internal management data structures. We still have the risk of pages in the CMA being pinned by something like gup however, that's where the ZONE idea comes in, to ensure the various kernel allocators will *never* allocate in that zone unless explicitly specified, but that could possibly implemented differently. Maybe a concept of exclusive NUMA node, where allocations never fallback to that node unless explicitly asked to go there. Of course that would have an impact on memory pressure calculations, nothign comes completely for free, but at this stage, this is the goal of this thread, ie, to swap ideas around and see what's most likely to work in the long run before we even start implementing something. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Sat, 25 Apr 2015, Paul E. McKenney wrote: Would you have a URL or other pointer to this code? linux/mm/migrate.c Without modifying a single line of mm code, the only way to do this is to either unmap from the cpu page table the range being migrated or to mprotect it in some way. In both case the cpu access will trigger some kind of fault. Yes that is how Linux migration works. If you can fix that then how about improving page migration in Linux between NUMA nodes first? In principle, that also would be a good thing. But why do that first? Because it would benefit a lot of functionality that today relies on page migration to have a faster more reliable way of moving pages around. This is not the behavior we want. What we want is same address space while being able to migrate system memory to device memory (who make that decision should not be part of that discussion) while still gracefully handling any CPU access. Well then there could be a situation where you have concurrent write access. How do you reconcile that then? Somehow you need to stall one or the other until the transaction is complete. Or have store buffers on one or both sides. Well if those store buffers end up with divergent contents then you have the problem of not being able to decide which version should survive. But from Jerome's response I deduce that this is avoided by only allow read-only access during migration. That is actually similar to what page migration does. This means if CPU access it we want to migrate memory back to system memory. To achieve this there is no way around adding couple of if inside the mm page fault code path. Now do you want each driver to add its own if branch or do you want a common infrastructure to do just that ? If you can improve the page migration in general then we certainly would love that. Having faultless migration is certain a good thing for a lot of functionality that depends on page migration. We do have to start somewhere, though. If we insist on perfection for all situations before we agree to make a change, we won't be making very many changes, now will we? Improvements to the general code would be preferred instead of having specialized solutions for a particular hardware alone. If the general code can then handle the special coprocessor situation then we avoid a lot of code development. As I understand it, the trick (if you can call it that) is having the device have the same memory-mapping capabilities as the CPUs. Well yes that works with read-only mappings. Maybe we can special case that in the page migration code? We do not need migration entries if access is read-only actually. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, Apr 27, 2015 at 10:08:29AM -0500, Christoph Lameter wrote: On Sat, 25 Apr 2015, Paul E. McKenney wrote: Would you have a URL or other pointer to this code? linux/mm/migrate.c Without modifying a single line of mm code, the only way to do this is to either unmap from the cpu page table the range being migrated or to mprotect it in some way. In both case the cpu access will trigger some kind of fault. Yes that is how Linux migration works. If you can fix that then how about improving page migration in Linux between NUMA nodes first? In principle, that also would be a good thing. But why do that first? Because it would benefit a lot of functionality that today relies on page migration to have a faster more reliable way of moving pages around. I do no think in the CAPI case there is anyway to improve on current low leve page migration. I am talking about : - write protect tlb flush - copy - update page table tlb flush The upper level that have the logic for the migration would however need some change. Like Paul said some kind of new metric and also new way to gather statistics from device instead from CPU. I think the device can provide better informations that the actual logic where page are unmap and the kernel look which CPU fault on page first. Also a way to allow hint provide by userspace through the device driver into the numa decision process. So i do not think that anything in this work would benefit any other work load then the one Paul is interested in. Still i am sure Paul want to build on top of existing infrastructure. This is not the behavior we want. What we want is same address space while being able to migrate system memory to device memory (who make that decision should not be part of that discussion) while still gracefully handling any CPU access. Well then there could be a situation where you have concurrent write access. How do you reconcile that then? Somehow you need to stall one or the other until the transaction is complete. Or have store buffers on one or both sides. Well if those store buffers end up with divergent contents then you have the problem of not being able to decide which version should survive. But from Jerome's response I deduce that this is avoided by only allow read-only access during migration. That is actually similar to what page migration does. Yes, as said above no change to the logic there, we do not want divergent content at all. The thing is, autonuma is a better fit for Paul because Paul platform being more advance he can allocate struct page for the device memory. While in my case it would be pointless as the memory is not CPU accessible. This is why the HMM patchset do not build on top of autonuma and current page migration but still use the same kind of logic. This means if CPU access it we want to migrate memory back to system memory. To achieve this there is no way around adding couple of if inside the mm page fault code path. Now do you want each driver to add its own if branch or do you want a common infrastructure to do just that ? If you can improve the page migration in general then we certainly would love that. Having faultless migration is certain a good thing for a lot of functionality that depends on page migration. We do have to start somewhere, though. If we insist on perfection for all situations before we agree to make a change, we won't be making very many changes, now will we? Improvements to the general code would be preferred instead of having specialized solutions for a particular hardware alone. If the general code can then handle the special coprocessor situation then we avoid a lot of code development. I think Paul only big change would be the memory ZONE changes. Having a way to add the device memory as struct page while blocking the kernel allocation from using this memory. Beside that i think the autonuma changes he would need would really be specific to his usecase but would still reuse all of the low level logic. As I understand it, the trick (if you can call it that) is having the device have the same memory-mapping capabilities as the CPUs. Well yes that works with read-only mappings. Maybe we can special case that in the page migration code? We do not need migration entries if access is read-only actually. The duplicate read only memory on device, is really an optimization that is not critical to the whole. The common use case remain the migration of read write memory to device memory when the memory is mostly/only accessed by the device. Cheers, Jérôme -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, 27 Apr 2015, Paul E. McKenney wrote: I would instead look on this as a way to try out use of hardware migration hints, which could lead to hardware vendors providing similar hints for node-to-node migrations. At that time, the benefits could be provided all the functionality relying on such migrations. Ok that sounds good. These hints could allow for the optimization of the page migration logic. Well yes that works with read-only mappings. Maybe we can special case that in the page migration code? We do not need migration entries if access is read-only actually. So you are talking about the situation only during the migration itself, then? If there is no migration in progress, then of course there is no problem with concurrent writes because the cache-coherence protocol takes care of things. During migration of a given page, I agree that marking that page read-only on both sides makes sense. This is sortof what happens in the current migration scheme. In the page tables the regular entries are replaced by migration ptes and the page is therefore inaccessible. Any access is then trapped until the page contentshave been moved to the new location. Then the migration pte is replaced by a real pte again that allows full access to the page. At that point the processes that have been put to sleep because they attempted an access to that page are woken up. The current scheme may be improvied on by allowing read access to the page while migration is in process. If we would change the migration entries to allow read access then the readers would not have to be put to sleep. Only writers would have to be put to sleep until the migration is complete. And I agree that latency-sensitive applications might not tolerate the page being read-only, and thus would want to avoid migration. Such applications would of course instead rely on placing the memory. Thats why we have the ability to switch off these automatism and that is why we are trying to keep the OS away from certain processors. But this is not the only concern here. The other thing is to make this fit into existing functionaly as cleanly as possible. So I think we would be looking at gradual improvements in the page migration logic as well as in the support for mapping external memory via driver mmap calls, DAX and/or RDMA subsystem functionality. Those two areas of functionality need to work together better in order to provide a solution for your use cases. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, 27 Apr 2015, Jerome Glisse wrote: Well lets avoid that. Access to device memory comparable to what the drivers do today by establishing page table mappings or a generalization of DAX approaches would be the most straightforward way of implementing it and would build based on existing functionality. Page migration currently does not work with driver mappings or DAX because there is no struct page that would allow the lockdown of the page. That may require either continued work on the DAX with page structs approach or new developments in the page migration logic comparable to the get_user_page() alternative of simply creating a scatter gather table to just submit a couple of memory ranges to the I/O subsystem thereby avoiding page structs. What you refuse to see is that DAX is geared toward filesystem and as such rely on special mapping. There is a reason why dax.c is in fs/ and not mm/ and i keep pointing out we do not want our mecanism to be perceive as fs from userspace point of view. We want to be below the fs, at the mm level where we could really do thing transparently no matter what kind of memory we are talking about (anonymous, file mapped, share). Ok that is why I mentioned the device memory mappings that are currently used for this purpose. You could generalize the DAX approach (which I understand as providing rw mappings to memory outside of the memory managed by the kernel and not as a fs specific thing). We can drop the DAX name and just talk about mapping to external memory if that confuses the issue. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, Apr 27, 2015 at 10:08:29AM -0500, Christoph Lameter wrote: On Sat, 25 Apr 2015, Paul E. McKenney wrote: Would you have a URL or other pointer to this code? linux/mm/migrate.c Ah, I thought you were calling out something not yet in mainline. Without modifying a single line of mm code, the only way to do this is to either unmap from the cpu page table the range being migrated or to mprotect it in some way. In both case the cpu access will trigger some kind of fault. Yes that is how Linux migration works. If you can fix that then how about improving page migration in Linux between NUMA nodes first? In principle, that also would be a good thing. But why do that first? Because it would benefit a lot of functionality that today relies on page migration to have a faster more reliable way of moving pages around. I would instead look on this as a way to try out use of hardware migration hints, which could lead to hardware vendors providing similar hints for node-to-node migrations. At that time, the benefits could be provided all the functionality relying on such migrations. This is not the behavior we want. What we want is same address space while being able to migrate system memory to device memory (who make that decision should not be part of that discussion) while still gracefully handling any CPU access. Well then there could be a situation where you have concurrent write access. How do you reconcile that then? Somehow you need to stall one or the other until the transaction is complete. Or have store buffers on one or both sides. Well if those store buffers end up with divergent contents then you have the problem of not being able to decide which version should survive. But from Jerome's response I deduce that this is avoided by only allow read-only access during migration. That is actually similar to what page migration does. Fair enough. This means if CPU access it we want to migrate memory back to system memory. To achieve this there is no way around adding couple of if inside the mm page fault code path. Now do you want each driver to add its own if branch or do you want a common infrastructure to do just that ? If you can improve the page migration in general then we certainly would love that. Having faultless migration is certain a good thing for a lot of functionality that depends on page migration. We do have to start somewhere, though. If we insist on perfection for all situations before we agree to make a change, we won't be making very many changes, now will we? Improvements to the general code would be preferred instead of having specialized solutions for a particular hardware alone. If the general code can then handle the special coprocessor situation then we avoid a lot of code development. All else being equal, I agree that generality is preferred. But here, as is often the case, all else is not necessarily equal. As I understand it, the trick (if you can call it that) is having the device have the same memory-mapping capabilities as the CPUs. Well yes that works with read-only mappings. Maybe we can special case that in the page migration code? We do not need migration entries if access is read-only actually. So you are talking about the situation only during the migration itself, then? If there is no migration in progress, then of course there is no problem with concurrent writes because the cache-coherence protocol takes care of things. During migration of a given page, I agree that marking that page read-only on both sides makes sense. And I agree that latency-sensitive applications might not tolerate the page being read-only, and thus would want to avoid migration. Such applications would of course instead rely on placing the memory. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On 04/27/2015 12:17 PM, Christoph Lameter wrote: On Mon, 27 Apr 2015, Jerome Glisse wrote: Improvements to the general code would be preferred instead of having specialized solutions for a particular hardware alone. If the general code can then handle the special coprocessor situation then we avoid a lot of code development. I think Paul only big change would be the memory ZONE changes. Having a way to add the device memory as struct page while blocking the kernel allocation from using this memory. Beside that i think the autonuma changes he would need would really be specific to his usecase but would still reuse all of the low level logic. Well lets avoid that. Why would we want to avoid the sane approach that makes this thing work with the fewest required changes to core code? Just because your workload is different from the workload they are trying to enable? -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, 27 Apr 2015, Rik van Riel wrote: Why would we want to avoid the sane approach that makes this thing work with the fewest required changes to core code? Becaus new ZONEs are a pretty invasive change to the memory management and because there are other ways to handle references to device specific memory. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, 27 Apr 2015, Jerome Glisse wrote: Improvements to the general code would be preferred instead of having specialized solutions for a particular hardware alone. If the general code can then handle the special coprocessor situation then we avoid a lot of code development. I think Paul only big change would be the memory ZONE changes. Having a way to add the device memory as struct page while blocking the kernel allocation from using this memory. Beside that i think the autonuma changes he would need would really be specific to his usecase but would still reuse all of the low level logic. Well lets avoid that. Access to device memory comparable to what the drivers do today by establishing page table mappings or a generalization of DAX approaches would be the most straightforward way of implementing it and would build based on existing functionality. Page migration currently does not work with driver mappings or DAX because there is no struct page that would allow the lockdown of the page. That may require either continued work on the DAX with page structs approach or new developments in the page migration logic comparable to the get_user_page() alternative of simply creating a scatter gather table to just submit a couple of memory ranges to the I/O subsystem thereby avoiding page structs. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, Apr 27, 2015 at 11:17:43AM -0500, Christoph Lameter wrote: On Mon, 27 Apr 2015, Jerome Glisse wrote: Improvements to the general code would be preferred instead of having specialized solutions for a particular hardware alone. If the general code can then handle the special coprocessor situation then we avoid a lot of code development. I think Paul only big change would be the memory ZONE changes. Having a way to add the device memory as struct page while blocking the kernel allocation from using this memory. Beside that i think the autonuma changes he would need would really be specific to his usecase but would still reuse all of the low level logic. Well lets avoid that. Access to device memory comparable to what the drivers do today by establishing page table mappings or a generalization of DAX approaches would be the most straightforward way of implementing it and would build based on existing functionality. Page migration currently does not work with driver mappings or DAX because there is no struct page that would allow the lockdown of the page. That may require either continued work on the DAX with page structs approach or new developments in the page migration logic comparable to the get_user_page() alternative of simply creating a scatter gather table to just submit a couple of memory ranges to the I/O subsystem thereby avoiding page structs. What you refuse to see is that DAX is geared toward filesystem and as such rely on special mapping. There is a reason why dax.c is in fs/ and not mm/ and i keep pointing out we do not want our mecanism to be perceive as fs from userspace point of view. We want to be below the fs, at the mm level where we could really do thing transparently no matter what kind of memory we are talking about (anonymous, file mapped, share). The fact is that DAX is about persistant storage but the people that develop the persitant storage think it would be nice to expose it as some kind of special memory. I am all for the direct mapping of this kind of memory but still it is use as a backing store for a filesystem. While in our case we are talking about usual _volatile_ memory that should be use or expose as a filesystem. I can't understand why you are so hellbent on the DAX paradigm, but it is not what suit us in no way. We are not filesystem, we are regular memory, our realm is mm/ not fs/ Cheers, Jérôme -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, Apr 27, 2015 at 11:51:51AM -0500, Christoph Lameter wrote: On Mon, 27 Apr 2015, Jerome Glisse wrote: Well lets avoid that. Access to device memory comparable to what the drivers do today by establishing page table mappings or a generalization of DAX approaches would be the most straightforward way of implementing it and would build based on existing functionality. Page migration currently does not work with driver mappings or DAX because there is no struct page that would allow the lockdown of the page. That may require either continued work on the DAX with page structs approach or new developments in the page migration logic comparable to the get_user_page() alternative of simply creating a scatter gather table to just submit a couple of memory ranges to the I/O subsystem thereby avoiding page structs. What you refuse to see is that DAX is geared toward filesystem and as such rely on special mapping. There is a reason why dax.c is in fs/ and not mm/ and i keep pointing out we do not want our mecanism to be perceive as fs from userspace point of view. We want to be below the fs, at the mm level where we could really do thing transparently no matter what kind of memory we are talking about (anonymous, file mapped, share). Ok that is why I mentioned the device memory mappings that are currently used for this purpose. You could generalize the DAX approach (which I understand as providing rw mappings to memory outside of the memory managed by the kernel and not as a fs specific thing). We can drop the DAX name and just talk about mapping to external memory if that confuses the issue. DAX is for direct access block layer (X is for the cool name factor) there is zero code inside DAX that would be usefull to us. Because it is all about filesystem and short circuiting the pagecache. So DAX is _not_ about providing rw mappings to non regular memory, it is about allowing to directly map _filesystem backing storage_ into a process. Moreover DAX is not about managing that persistent memory, all the management is done inside the fs (ext4, xfs, ...) in the same way as for non persistent memory. While in our case we want to manage the memory as a runtime resources that is allocated to process the same way regular system memory is managed. So current DAX code have nothing of value for our usecase nor what we propose will have anyvalue for DAX. Unless they decide to go down the struct page road for persistent memory (which from last discussion i heard was not there plan, i am pretty sure they entirely dismissed that idea for now). My point is that this is 2 differents non overlapping problems, and thus mandate 2 differents solution. Cheers, Jérôme -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Mon, 27 Apr 2015, Jerome Glisse wrote: We can drop the DAX name and just talk about mapping to external memory if that confuses the issue. DAX is for direct access block layer (X is for the cool name factor) there is zero code inside DAX that would be usefull to us. Because it is all about filesystem and short circuiting the pagecache. So DAX is _not_ about providing rw mappings to non regular memory, it is about allowing to directly map _filesystem backing storage_ into a process. Its about directly mapping memory outside of regular kernel management via a block device into user space. That you can put a filesystem on top is one possible use case. You can provide a block device to map the memory of the coprocessor and then configure the memory space to have the same layout on the coprocessor as well as the linux process. Moreover DAX is not about managing that persistent memory, all the management is done inside the fs (ext4, xfs, ...) in the same way as for non persistent memory. While in our case we want to manage the memory as a runtime resources that is allocated to process the same way regular system memory is managed. I repeatedly said that. So you would have a block device that would be used to mmap portions of the special memory into a process. So current DAX code have nothing of value for our usecase nor what we propose will have anyvalue for DAX. Unless they decide to go down the struct page road for persistent memory (which from last discussion i heard was not there plan, i am pretty sure they entirely dismissed that idea for now). DAX is about directly accessing memory. It is made for the purpose of serving as a block device for a filesystem right now but it can easily be used as a way to map any external memory into a processes space using the abstraction of a block device. But then you can do that with any device driver using VM_PFNMAP or VM_MIXEDMAP. Maybe we better use that term instead. Guess I have repeated myself 6 times or so now? I am stopping with this one. My point is that this is 2 differents non overlapping problems, and thus mandate 2 differents solution. Well confusion abounds since so much other stuff has ben attached to DAX devices. Lets drop the DAX term and use VM_PFNMAP or VM_MIXEDMAP instead. MIXEDMAP is the mechanism that DAX relies on in the VM. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Sat, Apr 25, 2015 at 01:32:39PM +1000, Benjamin Herrenschmidt wrote: > On Fri, 2015-04-24 at 22:32 -0400, Rik van Riel wrote: > > > The result would be that the kernel would allocate only > > migratable > > > pages within the CCAD device's memory, and even then only if > > > memory was otherwise exhausted. > > > > Does it make sense to allocate the device's page tables in memory > > belonging to the device? > > > > Is this a necessary thing with some devices? Jerome's HMM comes > > to mind... > > In our case, the device's MMU shares the host page tables (which is why > we can't use HMM, ie we can't have a page with different permissions on > CPU vs. device which HMM does). > > However the device has a pretty fast path to system memory, the best > thing we can do is pin the workload to the same chip the device is > connected to so those page tables arent' too far away. And another update, diffs then full document. Among other things, this version explicitly calls out the goal of gaining substantial performance without changing user applications, which should hopefully help. Thanx, Paul diff --git a/DeviceMem.txt b/DeviceMem.txt index 15d0a8b5d360..3de70c4b9922 100644 --- a/DeviceMem.txt +++ b/DeviceMem.txt @@ -40,10 +40,13 @@ workloads will have less-predictable access patterns, and these workloads can benefit from automatic migration of data between device memory and system memory as access patterns change. - Furthermore, some devices will provide special hardware that - collects access statistics that can be used to determine whether - or not a given page of memory should be migrated, and if so, - to where. + In this latter case, the goal is not optimal performance, + but rather a significant increase in performance compared to + what the CPUs alone can provide without needing to recompile + any of the applications making up the workload. Furthermore, + some devices will provide special hardware that collects access + statistics that can be used to determine whether or not a given + page of memory should be migrated, and if so, to where. The purpose of this document is to explore how this access and migration can be provided for within the Linux kernel. @@ -146,6 +149,32 @@ REQUIREMENTS required for low-latency applications that are sensitive to OS jitter. + 6. It must be possible to cause an application to use a + CCAD device simply by switching dynamically linked + libraries, but without recompiling that application. + This implies the following requirements: + + a. Address spaces must be synchronized for a given + application on the CPUs and the CCAD. In other + words, a given virtual address must access the same + physical memory from the CCAD device and from + the CPUs. + + b. Code running on the CCAD device must be able to + access the running application's memory, + regardless of how that memory was allocated, + including statically allocated at compile time. + + c. Use of the CCAD device must not interfere with + memory allocations that are never used by the + CCAD device. For example, if a CCAD device + has 16GB of memory, that should not prevent an + application using that device from allocating + more than 16GB of memory. For another example, + memory that is never accessed by a given CCAD + device should preferably remain outside of that + CCAD device's memory. + POTENTIAL IDEAS @@ -178,12 +207,11 @@ POTENTIAL IDEAS physical address ranges of normal system memory would be interleaved with those of device memory. - This would also require some sort of - migration infrastructure to be added, as autonuma would - not apply. However, this approach has the advantage - of preventing allocations in these regions, at least - unless those allocations have been explicitly flagged - to go there. + This would also require some sort of migration + infrastructure to be added, as autonuma would not apply. + However, this approach has the advantage of preventing + allocations in these regions, at least unless those + allocations have been explicitly flagged to go there. 4.
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 10:49:28AM -0500, Christoph Lameter wrote: > On Fri, 24 Apr 2015, Paul E. McKenney wrote: > > > can deliver, but where the cost of full-fledge hand tuning cannot be > > justified. > > > > You seem to believe that this latter category is the empty set, which > > I must confess does greatly surprise me. > > If there are already compromises are being made then why would you want to > modify the kernel for this? Some user space coding and device drivers > should be sufficient. The goal is to gain substantial performance improvement without any user-space changes. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 03:00:18PM -0500, Christoph Lameter wrote: > On Fri, 24 Apr 2015, Jerome Glisse wrote: > > > > Still no answer as to why is that not possible with the current scheme? > > > You keep on talking about pointers and I keep on responding that this is a > > > matter of making the address space compatible on both sides. > > > > So if do that in a naive way, how can we migrate a chunk of memory to video > > memory while still handling properly the case where CPU try to access that > > same memory while it is migrated to the GPU memory. > > Well that the same issue that the migration code is handling which I > submitted a long time ago to the kernel. Would you have a URL or other pointer to this code? > > Without modifying a single line of mm code, the only way to do this is to > > either unmap from the cpu page table the range being migrated or to mprotect > > it in some way. In both case the cpu access will trigger some kind of fault. > > Yes that is how Linux migration works. If you can fix that then how about > improving page migration in Linux between NUMA nodes first? In principle, that also would be a good thing. But why do that first? > > This is not the behavior we want. What we want is same address space while > > being able to migrate system memory to device memory (who make that decision > > should not be part of that discussion) while still gracefully handling any > > CPU access. > > Well then there could be a situation where you have concurrent write > access. How do you reconcile that then? Somehow you need to stall one or > the other until the transaction is complete. Or have store buffers on one or both sides. > > This means if CPU access it we want to migrate memory back to system memory. > > To achieve this there is no way around adding couple of if inside the mm > > page fault code path. Now do you want each driver to add its own if branch > > or do you want a common infrastructure to do just that ? > > If you can improve the page migration in general then we certainly would > love that. Having faultless migration is certain a good thing for a lot of > functionality that depends on page migration. We do have to start somewhere, though. If we insist on perfection for all situations before we agree to make a change, we won't be making very many changes, now will we? > > As i keep saying the solution you propose is what we have today, today we > > have fake share address space through the trick of remapping system memory > > at same address inside the GPU address space and also enforcing the use of > > a special memory allocator that goes behind the back of mm code. > > Hmmm... I'd like to know more details about that. As I understand it, the trick (if you can call it that) is having the device have the same memory-mapping capabilities as the CPUs. > > As you pointed out, not using GPU memory is a waste and we want to be able > > to use it. Now Paul have more sofisticated hardware that offer oportunities > > to do thing in a more transparent and efficient way. > > Does this also work between NUMA nodes in a Power8 system? Heh! At the rate we are going with this discussion, Power8 will be obsolete before we have this in. ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 11:09:36AM -0400, Jerome Glisse wrote: > On Fri, Apr 24, 2015 at 07:57:38AM -0700, Paul E. McKenney wrote: > > On Fri, Apr 24, 2015 at 09:12:07AM -0500, Christoph Lameter wrote: > > > On Thu, 23 Apr 2015, Paul E. McKenney wrote: > > > > > > > > > > > DAX > > > > > > > > DAX is a mechanism for providing direct-memory access to > > > > high-speed non-volatile (AKA "persistent") memory. Good > > > > introductions to DAX may be found in the following LWN > > > > articles: > > > > > > DAX is a mechanism to access memory not managed by the kernel and is the > > > successor to XIP. It just happens to be needed for persistent memory. > > > Fundamentally any driver can provide an MMAPPed interface to allow access > > > to a devices memory. > > > > I will take another look, but others in this thread have called out > > difficulties with DAX's filesystem nature. > > Do not waste your time on that this is not what we want. Christoph here > is more than stuborn and fails to see the world. Well, we do need to make sure that we are correctly representing DAX's capabilities. It is a hot topic, and others will probably also suggest that it be used. That said, at the moment, I don't see how it would help, given the need to migrate memory. Perhaps Boas Harrosh's patch set to allow struct pages to be associated might help? But from what I can see, a fair amount of other functionality would still be required either way. I am updating the DAX section a bit, but I don't claim that it is complete. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 11:09:36AM -0400, Jerome Glisse wrote: On Fri, Apr 24, 2015 at 07:57:38AM -0700, Paul E. McKenney wrote: On Fri, Apr 24, 2015 at 09:12:07AM -0500, Christoph Lameter wrote: On Thu, 23 Apr 2015, Paul E. McKenney wrote: DAX DAX is a mechanism for providing direct-memory access to high-speed non-volatile (AKA persistent) memory. Good introductions to DAX may be found in the following LWN articles: DAX is a mechanism to access memory not managed by the kernel and is the successor to XIP. It just happens to be needed for persistent memory. Fundamentally any driver can provide an MMAPPed interface to allow access to a devices memory. I will take another look, but others in this thread have called out difficulties with DAX's filesystem nature. Do not waste your time on that this is not what we want. Christoph here is more than stuborn and fails to see the world. Well, we do need to make sure that we are correctly representing DAX's capabilities. It is a hot topic, and others will probably also suggest that it be used. That said, at the moment, I don't see how it would help, given the need to migrate memory. Perhaps Boas Harrosh's patch set to allow struct pages to be associated might help? But from what I can see, a fair amount of other functionality would still be required either way. I am updating the DAX section a bit, but I don't claim that it is complete. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 03:00:18PM -0500, Christoph Lameter wrote: On Fri, 24 Apr 2015, Jerome Glisse wrote: Still no answer as to why is that not possible with the current scheme? You keep on talking about pointers and I keep on responding that this is a matter of making the address space compatible on both sides. So if do that in a naive way, how can we migrate a chunk of memory to video memory while still handling properly the case where CPU try to access that same memory while it is migrated to the GPU memory. Well that the same issue that the migration code is handling which I submitted a long time ago to the kernel. Would you have a URL or other pointer to this code? Without modifying a single line of mm code, the only way to do this is to either unmap from the cpu page table the range being migrated or to mprotect it in some way. In both case the cpu access will trigger some kind of fault. Yes that is how Linux migration works. If you can fix that then how about improving page migration in Linux between NUMA nodes first? In principle, that also would be a good thing. But why do that first? This is not the behavior we want. What we want is same address space while being able to migrate system memory to device memory (who make that decision should not be part of that discussion) while still gracefully handling any CPU access. Well then there could be a situation where you have concurrent write access. How do you reconcile that then? Somehow you need to stall one or the other until the transaction is complete. Or have store buffers on one or both sides. This means if CPU access it we want to migrate memory back to system memory. To achieve this there is no way around adding couple of if inside the mm page fault code path. Now do you want each driver to add its own if branch or do you want a common infrastructure to do just that ? If you can improve the page migration in general then we certainly would love that. Having faultless migration is certain a good thing for a lot of functionality that depends on page migration. We do have to start somewhere, though. If we insist on perfection for all situations before we agree to make a change, we won't be making very many changes, now will we? As i keep saying the solution you propose is what we have today, today we have fake share address space through the trick of remapping system memory at same address inside the GPU address space and also enforcing the use of a special memory allocator that goes behind the back of mm code. Hmmm... I'd like to know more details about that. As I understand it, the trick (if you can call it that) is having the device have the same memory-mapping capabilities as the CPUs. As you pointed out, not using GPU memory is a waste and we want to be able to use it. Now Paul have more sofisticated hardware that offer oportunities to do thing in a more transparent and efficient way. Does this also work between NUMA nodes in a Power8 system? Heh! At the rate we are going with this discussion, Power8 will be obsolete before we have this in. ;-) Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 10:49:28AM -0500, Christoph Lameter wrote: On Fri, 24 Apr 2015, Paul E. McKenney wrote: can deliver, but where the cost of full-fledge hand tuning cannot be justified. You seem to believe that this latter category is the empty set, which I must confess does greatly surprise me. If there are already compromises are being made then why would you want to modify the kernel for this? Some user space coding and device drivers should be sufficient. The goal is to gain substantial performance improvement without any user-space changes. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Sat, Apr 25, 2015 at 01:32:39PM +1000, Benjamin Herrenschmidt wrote: On Fri, 2015-04-24 at 22:32 -0400, Rik van Riel wrote: The result would be that the kernel would allocate only migratable pages within the CCAD device's memory, and even then only if memory was otherwise exhausted. Does it make sense to allocate the device's page tables in memory belonging to the device? Is this a necessary thing with some devices? Jerome's HMM comes to mind... In our case, the device's MMU shares the host page tables (which is why we can't use HMM, ie we can't have a page with different permissions on CPU vs. device which HMM does). However the device has a pretty fast path to system memory, the best thing we can do is pin the workload to the same chip the device is connected to so those page tables arent' too far away. And another update, diffs then full document. Among other things, this version explicitly calls out the goal of gaining substantial performance without changing user applications, which should hopefully help. Thanx, Paul diff --git a/DeviceMem.txt b/DeviceMem.txt index 15d0a8b5d360..3de70c4b9922 100644 --- a/DeviceMem.txt +++ b/DeviceMem.txt @@ -40,10 +40,13 @@ workloads will have less-predictable access patterns, and these workloads can benefit from automatic migration of data between device memory and system memory as access patterns change. - Furthermore, some devices will provide special hardware that - collects access statistics that can be used to determine whether - or not a given page of memory should be migrated, and if so, - to where. + In this latter case, the goal is not optimal performance, + but rather a significant increase in performance compared to + what the CPUs alone can provide without needing to recompile + any of the applications making up the workload. Furthermore, + some devices will provide special hardware that collects access + statistics that can be used to determine whether or not a given + page of memory should be migrated, and if so, to where. The purpose of this document is to explore how this access and migration can be provided for within the Linux kernel. @@ -146,6 +149,32 @@ REQUIREMENTS required for low-latency applications that are sensitive to OS jitter. + 6. It must be possible to cause an application to use a + CCAD device simply by switching dynamically linked + libraries, but without recompiling that application. + This implies the following requirements: + + a. Address spaces must be synchronized for a given + application on the CPUs and the CCAD. In other + words, a given virtual address must access the same + physical memory from the CCAD device and from + the CPUs. + + b. Code running on the CCAD device must be able to + access the running application's memory, + regardless of how that memory was allocated, + including statically allocated at compile time. + + c. Use of the CCAD device must not interfere with + memory allocations that are never used by the + CCAD device. For example, if a CCAD device + has 16GB of memory, that should not prevent an + application using that device from allocating + more than 16GB of memory. For another example, + memory that is never accessed by a given CCAD + device should preferably remain outside of that + CCAD device's memory. + POTENTIAL IDEAS @@ -178,12 +207,11 @@ POTENTIAL IDEAS physical address ranges of normal system memory would be interleaved with those of device memory. - This would also require some sort of - migration infrastructure to be added, as autonuma would - not apply. However, this approach has the advantage - of preventing allocations in these regions, at least - unless those allocations have been explicitly flagged - to go there. + This would also require some sort of migration + infrastructure to be added, as autonuma would not apply. + However, this approach has the advantage of preventing + allocations in these regions, at least unless those + allocations have been explicitly flagged to go there. 4. Your idea here! @@ -274,21
Re: Interacting with coherent memory on external devices
On Fri, 2015-04-24 at 22:32 -0400, Rik van Riel wrote: > > The result would be that the kernel would allocate only > migratable > > pages within the CCAD device's memory, and even then only if > > memory was otherwise exhausted. > > Does it make sense to allocate the device's page tables in memory > belonging to the device? > > Is this a necessary thing with some devices? Jerome's HMM comes > to mind... In our case, the device's MMU shares the host page tables (which is why we can't use HMM, ie we can't have a page with different permissions on CPU vs. device which HMM does). However the device has a pretty fast path to system memory, the best thing we can do is pin the workload to the same chip the device is connected to so those page tables arent' too far away. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On 04/21/2015 05:44 PM, Paul E. McKenney wrote: > AUTONUMA > > The Linux kernel's autonuma facility supports migrating both > memory and processes to promote NUMA memory locality. It was > accepted into 3.13 and is available in RHEL 7.0 and SLES 12. > It is enabled by the Kconfig variable CONFIG_NUMA_BALANCING. > > This approach uses a kernel thread "knuma_scand" that periodically > marks pages inaccessible. The page-fault handler notes any > mismatches between the NUMA node that the process is running on > and the NUMA node on which the page resides. Minor nit: marking pages inaccessible is done from task_work nowadays, there no longer is a kernel thread. > The result would be that the kernel would allocate only migratable > pages within the CCAD device's memory, and even then only if > memory was otherwise exhausted. Does it make sense to allocate the device's page tables in memory belonging to the device? Is this a necessary thing with some devices? Jerome's HMM comes to mind... -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, 2015-04-24 at 11:58 -0500, Christoph Lameter wrote: > On Fri, 24 Apr 2015, Jerome Glisse wrote: > > > > What exactly is the more advanced version's benefit? What are the features > > > that the other platforms do not provide? > > > > Transparent access to device memory from the CPU, you can map any of the GPU > > memory inside the CPU and have the whole cache coherency including proper > > atomic memory operation. CAPI is not some mumbo jumbo marketing name there > > is real hardware behind it. > > Got the hardware here but I am getting pretty sobered given what I heard > here. The IBM mumbo jumpo marketing comes down to "not much" now. Ugh ... first nothing we propose precludes using it with explicit memory management the way you want. So I don't know why you have a problem here. We are trying to cover a *different* usage model than yours obviously. But they aren't exclusive. Secondly, none of what we are discussing here is supported by *existing* hardware, so whatever you have is not concerned. There is no CAPI based coprocessor today that provides cachable memory to the system (though CAPI as a technology supports it), and no GPU doing that either *yet*. Today CAPI adapters can own host cache lines but don't expose large swath of cachable local memory. Finally, this discussion is not even specifically about CAPI or its performances. It's about the *general* case of a coherent coprocessor sharing the MMU. Whether it's using CAPI or whatever other technology that allows that sort of thing that we may or may not be able to mention at this point. CAPI is just an example because architecturally it allows that too. Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 03:00:18PM -0500, Christoph Lameter wrote: > On Fri, 24 Apr 2015, Jerome Glisse wrote: > > > > Still no answer as to why is that not possible with the current scheme? > > > You keep on talking about pointers and I keep on responding that this is a > > > matter of making the address space compatible on both sides. > > > > So if do that in a naive way, how can we migrate a chunk of memory to video > > memory while still handling properly the case where CPU try to access that > > same memory while it is migrated to the GPU memory. > > Well that the same issue that the migration code is handling which I > submitted a long time ago to the kernel. Yes so you had to modify the kernel for that ! So do we, and no, page migration as it exist is not sufficience and does not cover all use case we have. > > > Without modifying a single line of mm code, the only way to do this is to > > either unmap from the cpu page table the range being migrated or to mprotect > > it in some way. In both case the cpu access will trigger some kind of fault. > > Yes that is how Linux migration works. If you can fix that then how about > improving page migration in Linux between NUMA nodes first? In my case i can not use the page migration because there is no where to hook to explain how to migrate thing back and forth with a device. The page migration code is all on CPU and enjoy the benefit of being able to do thing atomicaly, i do not have such luxury. More over the core mm code assume that cpu pte migration entry is a short lived state. In case of migration to device memory we are talking about time span of several minutes. So obviously the page migration is not what we want, we want something similar but with different properties. That exactly what my HMM patchset does provide. What Paul wants to do however should be able to leverage the page migration that does exist. But again he has a far more advance platform. > > > This is not the behavior we want. What we want is same address space while > > being able to migrate system memory to device memory (who make that decision > > should not be part of that discussion) while still gracefully handling any > > CPU access. > > Well then there could be a situation where you have concurrent write > access. How do you reconcile that then? Somehow you need to stall one or > the other until the transaction is complete. No, it is exactly like thread on a CPU, if you have 2 threads that write to same address without having anykind of synchronization btw them, you can not predict what will be the end result. Same will happen here, either the GPU write goes last or the CPU one. Anyway this is not the use case we have in mind. We are thinking about concurrent access to same page but in a non conflicting way. Any conflicting access is a software bug like it is in the case of CPU threads. > > > This means if CPU access it we want to migrate memory back to system memory. > > To achieve this there is no way around adding couple of if inside the mm > > page fault code path. Now do you want each driver to add its own if branch > > or do you want a common infrastructure to do just that ? > > If you can improve the page migration in general then we certainly would > love that. Having faultless migration is certain a good thing for a lot of > functionality that depends on page migration. Faultless migration i am talking about is only on GPU side, but this is just an extra feature where you keep something mapped read only while migrating it to device memory and updating the GPU page table once done. So GPU will keep accessing system memory without interruption, this assume read only access. Otherwise you need a faulty migration thought you can cooperate with the thread scheduler to schedule other thread while migration is on going. > > > As i keep saying the solution you propose is what we have today, today we > > have fake share address space through the trick of remapping system memory > > at same address inside the GPU address space and also enforcing the use of > > a special memory allocator that goes behind the back of mm code. > > Hmmm... I'd like to know more details about that. Well there is no open source OpenCL 2.0 stack for discret GPU. But the idea is that you need special allocator because the GPU driver need to know about all the possible pages that might be use ie there is no page fault so all object need to be mapped and thus all page are pinned down. Well this is a little more complex as the special allocator keep track of each allocation creating an object for each of them and trying to only pin object that are use by current shader. Anyway bottom line is that it needs a special allocator, you can not use mmaped file directly or shared memory directly or anonymous memory allocated outside the special allocator. It require pinning memory. It can not migrate memory to device memory. We want to fix all that. > > > As you pointed out, not using
Re: Interacting with coherent memory on external devices
On Fri, 24 Apr 2015, Jerome Glisse wrote: > > Still no answer as to why is that not possible with the current scheme? > > You keep on talking about pointers and I keep on responding that this is a > > matter of making the address space compatible on both sides. > > So if do that in a naive way, how can we migrate a chunk of memory to video > memory while still handling properly the case where CPU try to access that > same memory while it is migrated to the GPU memory. Well that the same issue that the migration code is handling which I submitted a long time ago to the kernel. > Without modifying a single line of mm code, the only way to do this is to > either unmap from the cpu page table the range being migrated or to mprotect > it in some way. In both case the cpu access will trigger some kind of fault. Yes that is how Linux migration works. If you can fix that then how about improving page migration in Linux between NUMA nodes first? > This is not the behavior we want. What we want is same address space while > being able to migrate system memory to device memory (who make that decision > should not be part of that discussion) while still gracefully handling any > CPU access. Well then there could be a situation where you have concurrent write access. How do you reconcile that then? Somehow you need to stall one or the other until the transaction is complete. > This means if CPU access it we want to migrate memory back to system memory. > To achieve this there is no way around adding couple of if inside the mm > page fault code path. Now do you want each driver to add its own if branch > or do you want a common infrastructure to do just that ? If you can improve the page migration in general then we certainly would love that. Having faultless migration is certain a good thing for a lot of functionality that depends on page migration. > As i keep saying the solution you propose is what we have today, today we > have fake share address space through the trick of remapping system memory > at same address inside the GPU address space and also enforcing the use of > a special memory allocator that goes behind the back of mm code. Hmmm... I'd like to know more details about that. > As you pointed out, not using GPU memory is a waste and we want to be able > to use it. Now Paul have more sofisticated hardware that offer oportunities > to do thing in a more transparent and efficient way. Does this also work between NUMA nodes in a Power8 system? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 01:56:45PM -0500, Christoph Lameter wrote: > On Fri, 24 Apr 2015, Jerome Glisse wrote: > > > > Right this is how things work and you could improve on that. Stay with the > > > scheme. Why would that not work if you map things the same way in both > > > environments if both accellerator and host processor can acceess each > > > others memory? > > > > Again and again share address space, having a pointer means the same thing > > for the GPU than it means for the CPU ie having a random pointer point to > > the same memory whether it is accessed by the GPU or the CPU. While also > > keeping the property of the backing memory. It can be share memory from > > other process, a file mmaped from disk or simply anonymous memory and > > thus we have no control whatsoever on how such memory is allocated. > > Still no answer as to why is that not possible with the current scheme? > You keep on talking about pointers and I keep on responding that this is a > matter of making the address space compatible on both sides. So if do that in a naive way, how can we migrate a chunk of memory to video memory while still handling properly the case where CPU try to access that same memory while it is migrated to the GPU memory. Without modifying a single line of mm code, the only way to do this is to either unmap from the cpu page table the range being migrated or to mprotect it in some way. In both case the cpu access will trigger some kind of fault. This is not the behavior we want. What we want is same address space while being able to migrate system memory to device memory (who make that decision should not be part of that discussion) while still gracefully handling any CPU access. This means if CPU access it we want to migrate memory back to system memory. To achieve this there is no way around adding couple of if inside the mm page fault code path. Now do you want each driver to add its own if branch or do you want a common infrastructure to do just that ? As i keep saying the solution you propose is what we have today, today we have fake share address space through the trick of remapping system memory at same address inside the GPU address space and also enforcing the use of a special memory allocator that goes behind the back of mm code. But this limit to only using system memory, you can not use video memory transparently through such scheme. Some trick use today is to copy memory to device memory and to not bother with CPU access pretend it can not happen and as such the GPU and CPU can diverge in what they see for same address. We want to avoid trick like this that just lead to some weird and unexpected behavior. As you pointed out, not using GPU memory is a waste and we want to be able to use it. Now Paul have more sofisticated hardware that offer oportunities to do thing in a more transparent and efficient way. > > > Then you had transparent migration (transparent in the sense that we can > > handle CPU page fault on migrated memory) and you will see that you need > > to modify the kernel to become aware of this and provide a common code > > to deal with all this. > > If the GPU works like a CPU (which I keep hearing) then you should also be > able to run a linu8x kernel on it and make it a regular NUMA node. Hey why > dont we make the host cpu a GPU (hello Xeon Phi). I am not saying it works like a CPU, i am saying it should face the same kind of pattern when it comes to page fault, ie page fault are not the end of the world for the GPU and you should not assume that all GPU threads will wait for a pagefault because this is not the common case on CPU. Yes we prefer when page fault never happen, so does the CPU. No, you can not run the linux kernel on the GPU unless you are willing to allow having the kernel runs on heterogneous architecture with different instruction set. Not even going into the problematic of ring level/system level. We might one day go down that road but i see no compeling point today. Cheers, Jérôme -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, 24 Apr 2015, Jerome Glisse wrote: > > Right this is how things work and you could improve on that. Stay with the > > scheme. Why would that not work if you map things the same way in both > > environments if both accellerator and host processor can acceess each > > others memory? > > Again and again share address space, having a pointer means the same thing > for the GPU than it means for the CPU ie having a random pointer point to > the same memory whether it is accessed by the GPU or the CPU. While also > keeping the property of the backing memory. It can be share memory from > other process, a file mmaped from disk or simply anonymous memory and > thus we have no control whatsoever on how such memory is allocated. Still no answer as to why is that not possible with the current scheme? You keep on talking about pointers and I keep on responding that this is a matter of making the address space compatible on both sides. > Then you had transparent migration (transparent in the sense that we can > handle CPU page fault on migrated memory) and you will see that you need > to modify the kernel to become aware of this and provide a common code > to deal with all this. If the GPU works like a CPU (which I keep hearing) then you should also be able to run a linu8x kernel on it and make it a regular NUMA node. Hey why dont we make the host cpu a GPU (hello Xeon Phi). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On 04/23/2015 07:22 PM, Jerome Glisse wrote: On Thu, Apr 23, 2015 at 09:20:55AM -0500, Christoph Lameter wrote: On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote: There are hooks in glibc where you can replace the memory management of the apps if you want that. We don't control the app. Let's say we are doing a plugin for libfoo which accelerates "foo" using GPUs. There are numerous examples of malloc implementation that can be used for apps without modifying the app. What about share memory pass btw process ? Or mmaped file ? Or a library that is loaded through dlopen and thus had no way to control any allocation that happen before it became active ? Now some other app we have no control on uses libfoo. So pointers already allocated/mapped, possibly a long time ago, will hit libfoo (or the plugin) and we need GPUs to churn on the data. IF the GPU would need to suspend one of its computation thread to wait on a mapping to be established on demand or so then it looks like the performance of the parallel threads on a GPU will be significantly compromised. You would want to do the transfer explicitly in some fashion that meshes with the concurrent calculation in the GPU. You do not want stalls while GPU number crunching is ongoing. You do not understand how GPU works. GPU have a pools of thread, and they always try to have the pool as big as possible so that when a group of thread is waiting for some memory access, there are others thread ready to perform some operation. GPU are about hidding memory latency that's what they are good at. But they only achieve that when they have more thread in flight than compute unit. The whole thread scheduling is done by hardware and barely control by the device driver. So no having the GPU wait for a page fault is not as dramatic as you think. If you use GPU as they are intended to use you might even never notice the pagefault and reach close to the theoritical throughput of the GPU nonetheless. The point I'm making is you are arguing against a usage model which has been repeatedly asked for by large amounts of customer (after all that's also why HMM exists). I am still not clear what is the use case for this would be. Who is asking for this? Everyone but you ? OpenCL 2.0 specific request it and have several level of support about transparent address space. The lowest one is the one implemented today in which application needs to use a special memory allocator. The most advance one imply integration with the kernel in which any memory (mmaped file, share memory or anonymous memory) can be use by the GPU and does not need to come from a special allocator. Everyone in the industry is moving toward the most advance one. That is the raison d'être of HMM, to provide this functionality on hw platform that do not have things such as CAPI. Which is x86/arm. So use case is all application using OpenCL or Cuda. So pretty much everyone doing GPGPU wants this. I dunno how you can't see that. Share address space is so much easier. Believe it or not most coders do not have deep knowledge of how things work and if you can remove the complexity of different memory allocation and different address space from them they will be happy. Cheers, Jérôme I second what Jerome said, and add that one of the key features of HSA is the ptr-is-a-ptr scheme, where the applications do *not* need to handle different address spaces. Instead, all the memory is seen as a unified address space. See slide 6 on the following presentation: http://www.slideshare.net/hsafoundation/hsa-overview Thanks, Oded -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 11:58:39AM -0500, Christoph Lameter wrote: > On Fri, 24 Apr 2015, Jerome Glisse wrote: > > > > What exactly is the more advanced version's benefit? What are the features > > > that the other platforms do not provide? > > > > Transparent access to device memory from the CPU, you can map any of the GPU > > memory inside the CPU and have the whole cache coherency including proper > > atomic memory operation. CAPI is not some mumbo jumbo marketing name there > > is real hardware behind it. > > Got the hardware here but I am getting pretty sobered given what I heard > here. The IBM mumbo jumpo marketing comes down to "not much" now. > > > On x86 you have to take into account the PCI bar size, you also have to take > > into account that PCIE transaction are really bad when it comes to sharing > > memory with CPU. CAPI really improve things here. > > Ok that would be interesting for the general device driver case. Can you > show a real performance benefit here of CAPI transactions vs. PCI-E > transactions? I am sure IBM will show benchmark here when they have everything in place. I am not working on CAPI personnaly, i just went through some of the specification for it. > > So on x86 even if you could map all the GPU memory it would still be a bad > > solution and thing like atomic memory operation might not even work > > properly. > > That is solvable and doable in many other ways if needed. Actually I'd > prefer a Xeon Phi in that case because then we also have the same > instruction set. Having locks work right with different instruction sets > and different coherency schemes. Ewww... > Well then go the Xeon Phi solution way and let people that want to provide a different simpler (from programmer point of view) solution work on it. > > > > Then you have the problem of fast memory access and you are proposing to > > > complicate that access path on the GPU. > > > > No, i am proposing to have a solution where people doing such kind of work > > load can leverage the GPU, yes it will not be as fast as people hand tuning > > and rewritting their application for the GPU but it will still be faster > > by a significant factor than only using the CPU. > > Well the general purpose processors also also gaining more floating point > capabilities which increases the pressure on accellerators to become more > specialized. > > > Moreover i am saying that this can happen without even touching a single > > line of code of many many applications, because many of them rely on library > > and those are the only one that would need to know about GPU. > > Yea. We have heard this numerous times in parallel computing and it never > really worked right. Because you had split userspace, a pointer value was not pointing to the same thing on the GPU as on the CPU so porting library or application is hard and troublesome. AMD is already working on porting general application or library to leverage the brave new world of share address space (libreoffice, gimp, ...). Other people keep presuring for same address space, again this is the corner stone of OpenCL 2.0. I can not predict if it will work this time, if all meaning full and usefull library will start leveraging GPU. All i am trying to do is solve the split address space problem. Problem that you seem to ignore completely because you are happy the way things are. Other people are not happy. > > > Finaly i am saying that having a unified address space btw the GPU and CPU > > is a primordial prerequisite for this to happen in a transparent fashion > > and thus DAX solution is non-sense and does not provide transparent address > > space sharing. DAX solution is not even something new, this is how today > > stack is working, no need for DAX, userspace just mmap the device driver > > file and that's how they access the GPU accessible memory (which in most > > case is just system memory mapped through the device file to the user > > application). > > Right this is how things work and you could improve on that. Stay with the > scheme. Why would that not work if you map things the same way in both > environments if both accellerator and host processor can acceess each > others memory? Again and again share address space, having a pointer means the same thing for the GPU than it means for the CPU ie having a random pointer point to the same memory whether it is accessed by the GPU or the CPU. While also keeping the property of the backing memory. It can be share memory from other process, a file mmaped from disk or simply anonymous memory and thus we have no control whatsoever on how such memory is allocated. Then you had transparent migration (transparent in the sense that we can handle CPU page fault on migrated memory) and you will see that you need to modify the kernel to become aware of this and provide a common code to deal with all this. Cheers, Jérôme -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the
Re: Interacting with coherent memory on external devices
On Fri, 24 Apr 2015, Jerome Glisse wrote: > > What exactly is the more advanced version's benefit? What are the features > > that the other platforms do not provide? > > Transparent access to device memory from the CPU, you can map any of the GPU > memory inside the CPU and have the whole cache coherency including proper > atomic memory operation. CAPI is not some mumbo jumbo marketing name there > is real hardware behind it. Got the hardware here but I am getting pretty sobered given what I heard here. The IBM mumbo jumpo marketing comes down to "not much" now. > On x86 you have to take into account the PCI bar size, you also have to take > into account that PCIE transaction are really bad when it comes to sharing > memory with CPU. CAPI really improve things here. Ok that would be interesting for the general device driver case. Can you show a real performance benefit here of CAPI transactions vs. PCI-E transactions? > So on x86 even if you could map all the GPU memory it would still be a bad > solution and thing like atomic memory operation might not even work properly. That is solvable and doable in many other ways if needed. Actually I'd prefer a Xeon Phi in that case because then we also have the same instruction set. Having locks work right with different instruction sets and different coherency schemes. Ewww... > > Then you have the problem of fast memory access and you are proposing to > > complicate that access path on the GPU. > > No, i am proposing to have a solution where people doing such kind of work > load can leverage the GPU, yes it will not be as fast as people hand tuning > and rewritting their application for the GPU but it will still be faster > by a significant factor than only using the CPU. Well the general purpose processors also also gaining more floating point capabilities which increases the pressure on accellerators to become more specialized. > Moreover i am saying that this can happen without even touching a single > line of code of many many applications, because many of them rely on library > and those are the only one that would need to know about GPU. Yea. We have heard this numerous times in parallel computing and it never really worked right. > Finaly i am saying that having a unified address space btw the GPU and CPU > is a primordial prerequisite for this to happen in a transparent fashion > and thus DAX solution is non-sense and does not provide transparent address > space sharing. DAX solution is not even something new, this is how today > stack is working, no need for DAX, userspace just mmap the device driver > file and that's how they access the GPU accessible memory (which in most > case is just system memory mapped through the device file to the user > application). Right this is how things work and you could improve on that. Stay with the scheme. Why would that not work if you map things the same way in both environments if both accellerator and host processor can acceess each others memory? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 11:03:52AM -0500, Christoph Lameter wrote: > On Fri, 24 Apr 2015, Jerome Glisse wrote: > > > On Fri, Apr 24, 2015 at 09:29:12AM -0500, Christoph Lameter wrote: > > > On Thu, 23 Apr 2015, Jerome Glisse wrote: > > > > > > > No this not have been solve properly. Today solution is doing an > > > > explicit > > > > copy and again and again when complex data struct are involve (list, > > > > tree, > > > > ...) this is extremly tedious and hard to debug. So today solution often > > > > restrict themself to easy thing like matrix multiplication. But if you > > > > provide a unified address space then you make things a lot easiers for a > > > > lot more usecase. That's a fact, and again OpenCL 2.0 which is an > > > > industry > > > > standard is a proof that unified address space is one of the most > > > > important > > > > feature requested by user of GPGPU. You might not care but the rest of > > > > the > > > > world does. > > > > > > You could use page tables on the kernel side to transfer data on demand > > > from the GPU. And you can use a device driver to establish mappings to the > > > GPUs memory. > > > > > > There is no copy needed with these approaches. > > > > So you are telling me to do get_user_page() ? If so you aware that this pins > > memory ? So what happens when the GPU wants to access a range of 32GB of > > memory ? I pin everything ? > > Use either a device driver to create PTEs pointing to the data or do > something similar like what DAX does. Pinning can be avoided if you use > mmu_notifiers. Those will give you a callback before the OS removes the > data and thus you can operate without pinning. So you are actualy telling me to do as i am doing inside the HMM patchset ? Because what you seem to say here is exactly what the HMM patchset does. So you are acknowledging that we need work inside the kernel ? That being said Paul have the chance to have a more advance platform where what i am doing would actualy be under using the capabilities of the platform. So he needs a different solution. > > > Overall the throughput of the GPU will stay close to its theoritical maximum > > if you have enough other thread that can progress and this is very common. > > GPUs operate on groups of threads not single ones. If you stall > then there will be a stall of a whole group of them. We are dealing with > accellerators here that are different for performance reasons. They are > not to be treated like regular processor, nor is memory like > operating like host mmemory. Again i know how GPU works, they work on group of thread i am well aware of that, the group size is often 32 or 64 threads. But they keep in the hardware a large pool of thread group, something like 2^11 or 2^12 thread group in flight for 2^4 or 2^5 unit capable working on thread group (in thread count this is 2^15/2^16 thread in flight for 2^9/2^10 cores). So again like on the CPU we do not exepect the whole 2^11/2^12 group of thread to hit a pagefault and i am saying as long as only a small number of group hit one let say 2^3 group (2^8/2^9 thread) then you still have a large number of thread group that can make progress without being impacted whatsoever. And you can bet that GPU designer are also improving this by allowing to swap out faulting thread and swapin runnable one so the overall 2^16 threads in flight might be lot bigger in future hardware giving even more chance to hide page fault. GPU can operate on host memory and you can still saturate GPU with host memory as long as the workload you are running are not bandwidth starved. I know this is unlikely for GPU but again think several _different_ application some of thos application might already have their dataset in the GPU memory and thus can run along side slower thread that are limited by the system memory bandwidth. But still you can saturate your GPU that way. > > > But IBM here want to go further and to provide a more advance solution, > > so their need are specific to there platform and we can not know if AMD, > > ARM or Intel will want to go down the same road, they do not seem to be > > interested. Does it means we should not support IBM ? I think it would be > > wrong. > > What exactly is the more advanced version's benefit? What are the features > that the other platforms do not provide? Transparent access to device memory from the CPU, you can map any of the GPU memory inside the CPU and have the whole cache coherency including proper atomic memory operation. CAPI is not some mumbo jumbo marketing name there is real hardware behind it. On x86 you have to take into account the PCI bar size, you also have to take into account that PCIE transaction are really bad when it comes to sharing memory with CPU. CAPI really improve things here. So on x86 even if you could map all the GPU memory it would still be a bad solution and thing like atomic memory operation might not even work properly. > > > > This sounds more like a case for a
Re: Interacting with coherent memory on external devices
On 04/24/2015 10:30 AM, Christoph Lameter wrote: > On Thu, 23 Apr 2015, Paul E. McKenney wrote: > >> If by "entire industry" you mean everyone who might want to use hardware >> acceleration, for example, including mechanical computer-aided design, >> I am skeptical. > > The industry designs GPUs with super fast special ram and accellerators > with special ram designed to do fast searches and you think you can demand > page > that stuff in from the main processor? DRAM access latencies are a few hundred CPU cycles, but somehow CPUs can still do computations at a fast speed, and we do not require gigabytes of L2-cache-speed memory in the system. It turns out the vast majority of programs have working sets, and data access patterns where prefetching works satisfactorily. With GPU calculations done transparently by libraries, and largely hidden from programs, why would this be any different? -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On 04/24/2015 11:49 AM, Christoph Lameter wrote: > On Fri, 24 Apr 2015, Paul E. McKenney wrote: > >> can deliver, but where the cost of full-fledge hand tuning cannot be >> justified. >> >> You seem to believe that this latter category is the empty set, which >> I must confess does greatly surprise me. > > If there are already compromises are being made then why would you want to > modify the kernel for this? Some user space coding and device drivers > should be sufficient. You assume only one program at a time would get to use the GPU for accelerated computations, and the GPU would get dedicated to that program. That will not be the case when you have libraries using the GPU for computations. There could be dozens of programs in the system using that library, with no knowledge of how many GPU resources are used by the other programs. There is a very clear cut case for having the OS manage the GPU resources transparently, just like it does for all the other resources in the system. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, 24 Apr 2015, Jerome Glisse wrote: > On Fri, Apr 24, 2015 at 09:29:12AM -0500, Christoph Lameter wrote: > > On Thu, 23 Apr 2015, Jerome Glisse wrote: > > > > > No this not have been solve properly. Today solution is doing an explicit > > > copy and again and again when complex data struct are involve (list, tree, > > > ...) this is extremly tedious and hard to debug. So today solution often > > > restrict themself to easy thing like matrix multiplication. But if you > > > provide a unified address space then you make things a lot easiers for a > > > lot more usecase. That's a fact, and again OpenCL 2.0 which is an industry > > > standard is a proof that unified address space is one of the most > > > important > > > feature requested by user of GPGPU. You might not care but the rest of the > > > world does. > > > > You could use page tables on the kernel side to transfer data on demand > > from the GPU. And you can use a device driver to establish mappings to the > > GPUs memory. > > > > There is no copy needed with these approaches. > > So you are telling me to do get_user_page() ? If so you aware that this pins > memory ? So what happens when the GPU wants to access a range of 32GB of > memory ? I pin everything ? Use either a device driver to create PTEs pointing to the data or do something similar like what DAX does. Pinning can be avoided if you use mmu_notifiers. Those will give you a callback before the OS removes the data and thus you can operate without pinning. > Overall the throughput of the GPU will stay close to its theoritical maximum > if you have enough other thread that can progress and this is very common. GPUs operate on groups of threads not single ones. If you stall then there will be a stall of a whole group of them. We are dealing with accellerators here that are different for performance reasons. They are not to be treated like regular processor, nor is memory like operating like host mmemory. > But IBM here want to go further and to provide a more advance solution, > so their need are specific to there platform and we can not know if AMD, > ARM or Intel will want to go down the same road, they do not seem to be > interested. Does it means we should not support IBM ? I think it would be > wrong. What exactly is the more advanced version's benefit? What are the features that the other platforms do not provide? > > This sounds more like a case for a general purpose processor. If it is a > > special device then it will typically also have special memory to allow > > fast searches. > > No this kind of thing can be fast on a GPU, with GPU you easily have x500 > more cores than CPU cores, so you can slice the dataset even more and have > each of the GPU core perform the search. Note that i am not only thinking > of stupid memcmp here it can be something more complex like searching a > pattern that allow variation and that require a whole program to decide if > a chunk falls under the variation rules or not. Then you have the problem of fast memory access and you are proposing to complicate that access path on the GPU. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 09:30:40AM -0500, Christoph Lameter wrote: > On Thu, 23 Apr 2015, Paul E. McKenney wrote: > > > If by "entire industry" you mean everyone who might want to use hardware > > acceleration, for example, including mechanical computer-aided design, > > I am skeptical. > > The industry designs GPUs with super fast special ram and accellerators > with special ram designed to do fast searches and you think you can demand > page > that stuff in from the main processor? > Why do you think AMD and NVidia are adding page fault support to their GPU in the first place ? They are not doing this on a whim, they have carefully thought about that. Are you saying you know better than the 2 biggest GPU designer on the planet ? And who do you think is pushing for such thing in the kernel ? Do you think we are working on this on a whim ? Because we woke up one day and thought that it would be cool and that it should be done this way ? Yes if all your GPU do is pagefault it will be disastrous, but is this the usual thing we see on CPU ? No ! Are people complaining about the numerous page fault that happens over a day ? No, the vast majority of user are completely oblivious to page fault. This is how it works on CPU and yes this can work for GPU too. What happens on CPU ? Well CPU can switch to work on a different thread or a different application altogether. The same thing will happen on the GPU. If you have enough jobs, your GPU will be busy and you will never worry about page fault because overall your GPU will deliver the same kind of throughput as if there was no pagefault. It can very well be buried into the overall noise if the ratio of available runnable thread versus page faulting thread is high enough. Which is most of the time the case for the CPU, why would the same assumption not work on the GPU ? Note that i am not dismissing low latency folks, i know they exist, i know they hate page fault and in no way what we propose will make it worse for them. They will be able to keep the same kind of control they cherish but this does not mean you should go on a holy crusade to pretend that other people workload does not exist. They do exist. Page fault is not evil and it has prove usefull to the whole computer industry for CPU. To be sure you are not misinterpretting what we propose, in no way we say we gonna migrate thing on page fault for everyone. We are saying first the device driver decide where thing need to be (system memory or local memory) device driver can get hint/request from userspace for this (as they do today). So no change whatsoever here, people that hand tune things will keep being able to do so. Now we want to add the case where device driver do not get any kind of directive or hint from userspace. So what autonuma is, simply collect informations from the GPU on what is access often and then migrate this transparently (yes this can happen without interruption to GPU). So you are migrating from a memory that has 16GB/s or 32GB/s bandwidth to the device memory that have 500GB/s. This is a valid usecase, they are many people outthere that do not want to learn about hand tuning there application for the GPU but they could nonetheless benefit from it. Cheers, Jérôme -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On 04/24/2015 10:01 AM, Christoph Lameter wrote: > On Thu, 23 Apr 2015, Paul E. McKenney wrote: > >>> As far as I know Jerome is talkeing about HPC loads and high performance >>> GPU processing. This is the same use case. >> >> The difference is sensitivity to latency. You have latency-sensitive >> HPC workloads, and Jerome is talking about HPC workloads that need >> high throughput, but are insensitive to latency. > > Those are correlated. > >>> What you are proposing for High Performacne Computing is reducing the >>> performance these guys trying to get. You cannot sell someone a Volkswagen >>> if he needs the Ferrari. >> >> You do need the low-latency Ferrari. But others are best served by a >> high-throughput freight train. > > The problem is that they want to run 2000 trains at the same time > and they all must arrive at the destination before they can be send on > their next trip. 1999 trains will be sitting idle because they need > to wait of the one train that was delayed. This reduces the troughput. > People really would like all 2000 trains to arrive on schedule so that > they get more performance. So you run 4000 or even 6000 trains, and have some subset of them run at full steam, while others are waiting on memory accesses. In reality the overcommit factor is likely much smaller, because the GPU threads run and block on memory in smaller, more manageable numbers, say a few dozen at a time. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, 24 Apr 2015, Paul E. McKenney wrote: > > DAX is a mechanism to access memory not managed by the kernel and is the > > successor to XIP. It just happens to be needed for persistent memory. > > Fundamentally any driver can provide an MMAPPed interface to allow access > > to a devices memory. > > I will take another look, but others in this thread have called out > difficulties with DAX's filesystem nature. Right so you do not need the filesystem structure. Just simply writing a device driver that mmaps data as needed from the coprocessor will also do the trick. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, 24 Apr 2015, Paul E. McKenney wrote: > can deliver, but where the cost of full-fledge hand tuning cannot be > justified. > > You seem to believe that this latter category is the empty set, which > I must confess does greatly surprise me. If there are already compromises are being made then why would you want to modify the kernel for this? Some user space coding and device drivers should be sufficient. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 07:57:38AM -0700, Paul E. McKenney wrote: > On Fri, Apr 24, 2015 at 09:12:07AM -0500, Christoph Lameter wrote: > > On Thu, 23 Apr 2015, Paul E. McKenney wrote: > > > > > > > > DAX > > > > > > DAX is a mechanism for providing direct-memory access to > > > high-speed non-volatile (AKA "persistent") memory. Good > > > introductions to DAX may be found in the following LWN > > > articles: > > > > DAX is a mechanism to access memory not managed by the kernel and is the > > successor to XIP. It just happens to be needed for persistent memory. > > Fundamentally any driver can provide an MMAPPed interface to allow access > > to a devices memory. > > I will take another look, but others in this thread have called out > difficulties with DAX's filesystem nature. Do not waste your time on that this is not what we want. Christoph here is more than stuborn and fails to see the world. Cheers, Jérôme -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 09:29:12AM -0500, Christoph Lameter wrote: > On Thu, 23 Apr 2015, Jerome Glisse wrote: > > > No this not have been solve properly. Today solution is doing an explicit > > copy and again and again when complex data struct are involve (list, tree, > > ...) this is extremly tedious and hard to debug. So today solution often > > restrict themself to easy thing like matrix multiplication. But if you > > provide a unified address space then you make things a lot easiers for a > > lot more usecase. That's a fact, and again OpenCL 2.0 which is an industry > > standard is a proof that unified address space is one of the most important > > feature requested by user of GPGPU. You might not care but the rest of the > > world does. > > You could use page tables on the kernel side to transfer data on demand > from the GPU. And you can use a device driver to establish mappings to the > GPUs memory. > > There is no copy needed with these approaches. So you are telling me to do get_user_page() ? If so you aware that this pins memory ? So what happens when the GPU wants to access a range of 32GB of memory ? I pin everything ? I am not talking about only transfrom from GPU to system memory i am talking about application that have : dataset = mmap(datatset, 32<<30); // ... dl_open(superlibrary) superlibrary.dosomething(dataset); So the application here have no clue about GPU and we do not want to change that yes this is a valid usecase and countless user ask for it. How can the superlibrary give access to the GPU to the dataset ? Does it have to go get_user_page() on all single page effectively pinning memory ? Should it allocate GPU memory through special API and memcpy ? What HMM does is allow to share the process page table with the GPU and GPU can transparently access the dataset (no pinning whatsover). Will there be pagefault ? It can happens and if it does the assumption is that you have more threads that do not get a pagefault than one that does, so GPU keeps being saturated (ie all its unit are feed with something to do) while the pagefault are resolve. For some workload yes you will see the penalty of the pagefault ie you will have a group of thread that finish late but the thing you seem to fail to get is that all the other GPU thread can make process and finish even before the pagefault is resolved. It all depends on the application. Moreover if you have several application then GPU can switch to different application and make progress on them too. Overall the throughput of the GPU will stay close to its theoritical maximum if you have enough other thread that can progress and this is very common. > > > > I think these two things need to be separated. The shift-the-memory-back- > > > and-forth approach should be separate and if someone wants to use the > > > thing then it should also work on other platforms like ARM and Intel. > > > > What IBM does with there platform is there choice, they can not force ARM > > or Intel or AMD to do the same. Each of those might have different view > > on what is their most important target. For instance i highly doubt ARM > > cares about any of this. > > Well but the kernel code submitted should allow for easy use on other > platform. I.e. Intel processors should be able to implement the > "transparent" memory by establishing device mappings to PCI-E space > and/or transferring data from the GPU and signaling the GPU to establish > such a mapping. HMM does that, it only require the GPU to have a certain set of features and the only requirement for the platform is to offer a bus which allow cache coherent system memory access such as PCIE. But IBM here want to go further and to provide a more advance solution, so their need are specific to there platform and we can not know if AMD, ARM or Intel will want to go down the same road, they do not seem to be interested. Does it means we should not support IBM ? I think it would be wrong. > > > Only time critical application care about latency, everyone else cares > > about throughput, where the applications can runs for days, weeks, months > > before producing any useable/meaningfull results. Many of which do not > > care a tiny bit about latency because they can perform independant > > computation. > > Computationally intensive high performance application care about > random latency introduced to computational threads because that is > delaying the data exchange and thus slows everything down. And that is the > typical case of a GPUI. You assume that all HPC application have strong data exchange, i gave you example of application where there is 0 data exchange btw threads what so ever. Those use case exist and we want to support them too. Yes for thread where there is data exchange page fault stall jobs but again we are talking about HPC where several _different_ application run in // and share resources so while page fault can block part of an application, other applications can
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 09:12:07AM -0500, Christoph Lameter wrote: > On Thu, 23 Apr 2015, Paul E. McKenney wrote: > > > > > DAX > > > > DAX is a mechanism for providing direct-memory access to > > high-speed non-volatile (AKA "persistent") memory. Good > > introductions to DAX may be found in the following LWN > > articles: > > DAX is a mechanism to access memory not managed by the kernel and is the > successor to XIP. It just happens to be needed for persistent memory. > Fundamentally any driver can provide an MMAPPed interface to allow access > to a devices memory. I will take another look, but others in this thread have called out difficulties with DAX's filesystem nature. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 09:30:40AM -0500, Christoph Lameter wrote: > On Thu, 23 Apr 2015, Paul E. McKenney wrote: > > > If by "entire industry" you mean everyone who might want to use hardware > > acceleration, for example, including mechanical computer-aided design, > > I am skeptical. > > The industry designs GPUs with super fast special ram and accellerators > with special ram designed to do fast searches and you think you can demand > page > that stuff in from the main processor? The demand paging is indeed a drawback for the option of using autonuma to handle the migration. And again, this is not intended to replace the careful hand-tuning that is required to get the last drop of performance out of the system. It is instead intended to handle the cases where the application needs substantially more performance than the CPUs alone can deliver, but where the cost of full-fledge hand tuning cannot be justified. You seem to believe that this latter category is the empty set, which I must confess does greatly surprise me. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Thu, 23 Apr 2015, Paul E. McKenney wrote: > If by "entire industry" you mean everyone who might want to use hardware > acceleration, for example, including mechanical computer-aided design, > I am skeptical. The industry designs GPUs with super fast special ram and accellerators with special ram designed to do fast searches and you think you can demand page that stuff in from the main processor? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Thu, 23 Apr 2015, Jerome Glisse wrote: > No this not have been solve properly. Today solution is doing an explicit > copy and again and again when complex data struct are involve (list, tree, > ...) this is extremly tedious and hard to debug. So today solution often > restrict themself to easy thing like matrix multiplication. But if you > provide a unified address space then you make things a lot easiers for a > lot more usecase. That's a fact, and again OpenCL 2.0 which is an industry > standard is a proof that unified address space is one of the most important > feature requested by user of GPGPU. You might not care but the rest of the > world does. You could use page tables on the kernel side to transfer data on demand from the GPU. And you can use a device driver to establish mappings to the GPUs memory. There is no copy needed with these approaches. > > I think these two things need to be separated. The shift-the-memory-back- > > and-forth approach should be separate and if someone wants to use the > > thing then it should also work on other platforms like ARM and Intel. > > What IBM does with there platform is there choice, they can not force ARM > or Intel or AMD to do the same. Each of those might have different view > on what is their most important target. For instance i highly doubt ARM > cares about any of this. Well but the kernel code submitted should allow for easy use on other platform. I.e. Intel processors should be able to implement the "transparent" memory by establishing device mappings to PCI-E space and/or transferring data from the GPU and signaling the GPU to establish such a mapping. > Only time critical application care about latency, everyone else cares > about throughput, where the applications can runs for days, weeks, months > before producing any useable/meaningfull results. Many of which do not > care a tiny bit about latency because they can perform independant > computation. Computationally intensive high performance application care about random latency introduced to computational threads because that is delaying the data exchange and thus slows everything down. And that is the typical case of a GPUI. > Take a company rendering a movie for instance, they want to render the > millions of frame as fast as possible but each frame can be rendered > independently, they only share data is the input geometry, textures and > lighting but this are constant, the rendering of one frame does not > depend on the rendering of the previous (leaving post processing like > motion blur aside). The rendering would be done by the GPU and this will involve concurrency rapidly accessing data. Performance is certainly impacted if the GPU cannot use its own RAM designed for the proper feeding of its processing. And if you add a paging layer and swivel stuff below then this will be very bad. At minimum you need to shovel blocks of data into the GPU to allow it to operate undisturbed for a while on the data and do its job. > Same apply if you do some data mining. You want might want to find all > occurence of a specific sequence in a large data pool. You can slice > your data pool and have an independant job per slice and only aggregate > the result of each jobs at the end (or as they finish). This sounds more like a case for a general purpose processor. If it is a special device then it will typically also have special memory to allow fast searches. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 09:01:47AM -0500, Christoph Lameter wrote: > On Thu, 23 Apr 2015, Paul E. McKenney wrote: > > > > As far as I know Jerome is talkeing about HPC loads and high performance > > > GPU processing. This is the same use case. > > > > The difference is sensitivity to latency. You have latency-sensitive > > HPC workloads, and Jerome is talking about HPC workloads that need > > high throughput, but are insensitive to latency. > > Those are correlated. In some cases, yes. But are you -really- claiming that -all- HPC workloads are highly sensitive to latency? That would be quite a claim! > > > What you are proposing for High Performacne Computing is reducing the > > > performance these guys trying to get. You cannot sell someone a Volkswagen > > > if he needs the Ferrari. > > > > You do need the low-latency Ferrari. But others are best served by a > > high-throughput freight train. > > The problem is that they want to run 2000 trains at the same time > and they all must arrive at the destination before they can be send on > their next trip. 1999 trains will be sitting idle because they need > to wait of the one train that was delayed. This reduces the troughput. > People really would like all 2000 trains to arrive on schedule so that > they get more performance. Yes, there is some portion of the market that needs both high throughput and highly predictable latencies. You are claiming that the -entire- HPC market has this sort of requirement? Again, this would be quite a claim! Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Thu, 23 Apr 2015, Paul E. McKenney wrote: > > DAX > > DAX is a mechanism for providing direct-memory access to > high-speed non-volatile (AKA "persistent") memory. Good > introductions to DAX may be found in the following LWN > articles: DAX is a mechanism to access memory not managed by the kernel and is the successor to XIP. It just happens to be needed for persistent memory. Fundamentally any driver can provide an MMAPPed interface to allow access to a devices memory. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Thu, 23 Apr 2015, Jerome Glisse wrote: > The numa code we have today for CPU case exist because it does make > a difference but you keep trying to restrict GPU user to a workload > that is specific. Go talk to people doing physic, biology, data > mining, CAD most of them do not care about latency. They have not > hard deadline to meet with their computation. They just want things > to compute as fast as possible and programming to be as easy as it > can get. I started working on the latency issues a long time ago because performance of those labs was restricted by OS processing. A noted problem was SLABs scanning of its objects every 2 seconds which caused pretty significant performance regressions due to the delay of the computation in individual threads. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Thu, 23 Apr 2015, Austin S Hemmelgarn wrote: Looking at this whole conversation, all I see is two different views on how to present the asymmetric multiprocessing arrangements that have become commonplace in today's systems to userspace. Your model favors performance, while CAPI favors simplicity for userspace. Oww. No performance just simplicity? Really? The simplification of the memory registration for Infiniband etc is certainly useful and I hope to see contributions on that going into the kernel. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Thu, 23 Apr 2015, Paul E. McKenney wrote: > > As far as I know Jerome is talkeing about HPC loads and high performance > > GPU processing. This is the same use case. > > The difference is sensitivity to latency. You have latency-sensitive > HPC workloads, and Jerome is talking about HPC workloads that need > high throughput, but are insensitive to latency. Those are correlated. > > What you are proposing for High Performacne Computing is reducing the > > performance these guys trying to get. You cannot sell someone a Volkswagen > > if he needs the Ferrari. > > You do need the low-latency Ferrari. But others are best served by a > high-throughput freight train. The problem is that they want to run 2000 trains at the same time and they all must arrive at the destination before they can be send on their next trip. 1999 trains will be sitting idle because they need to wait of the one train that was delayed. This reduces the troughput. People really would like all 2000 trains to arrive on schedule so that they get more performance. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Thu, 23 Apr 2015, Austin S Hemmelgarn wrote: Looking at this whole conversation, all I see is two different views on how to present the asymmetric multiprocessing arrangements that have become commonplace in today's systems to userspace. Your model favors performance, while CAPI favors simplicity for userspace. Oww. No performance just simplicity? Really? The simplification of the memory registration for Infiniband etc is certainly useful and I hope to see contributions on that going into the kernel. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Thu, 23 Apr 2015, Paul E. McKenney wrote: If by entire industry you mean everyone who might want to use hardware acceleration, for example, including mechanical computer-aided design, I am skeptical. The industry designs GPUs with super fast special ram and accellerators with special ram designed to do fast searches and you think you can demand page that stuff in from the main processor? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 09:01:47AM -0500, Christoph Lameter wrote: On Thu, 23 Apr 2015, Paul E. McKenney wrote: As far as I know Jerome is talkeing about HPC loads and high performance GPU processing. This is the same use case. The difference is sensitivity to latency. You have latency-sensitive HPC workloads, and Jerome is talking about HPC workloads that need high throughput, but are insensitive to latency. Those are correlated. In some cases, yes. But are you -really- claiming that -all- HPC workloads are highly sensitive to latency? That would be quite a claim! What you are proposing for High Performacne Computing is reducing the performance these guys trying to get. You cannot sell someone a Volkswagen if he needs the Ferrari. You do need the low-latency Ferrari. But others are best served by a high-throughput freight train. The problem is that they want to run 2000 trains at the same time and they all must arrive at the destination before they can be send on their next trip. 1999 trains will be sitting idle because they need to wait of the one train that was delayed. This reduces the troughput. People really would like all 2000 trains to arrive on schedule so that they get more performance. Yes, there is some portion of the market that needs both high throughput and highly predictable latencies. You are claiming that the -entire- HPC market has this sort of requirement? Again, this would be quite a claim! Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Thu, 23 Apr 2015, Paul E. McKenney wrote: As far as I know Jerome is talkeing about HPC loads and high performance GPU processing. This is the same use case. The difference is sensitivity to latency. You have latency-sensitive HPC workloads, and Jerome is talking about HPC workloads that need high throughput, but are insensitive to latency. Those are correlated. What you are proposing for High Performacne Computing is reducing the performance these guys trying to get. You cannot sell someone a Volkswagen if he needs the Ferrari. You do need the low-latency Ferrari. But others are best served by a high-throughput freight train. The problem is that they want to run 2000 trains at the same time and they all must arrive at the destination before they can be send on their next trip. 1999 trains will be sitting idle because they need to wait of the one train that was delayed. This reduces the troughput. People really would like all 2000 trains to arrive on schedule so that they get more performance. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Thu, 23 Apr 2015, Jerome Glisse wrote: The numa code we have today for CPU case exist because it does make a difference but you keep trying to restrict GPU user to a workload that is specific. Go talk to people doing physic, biology, data mining, CAD most of them do not care about latency. They have not hard deadline to meet with their computation. They just want things to compute as fast as possible and programming to be as easy as it can get. I started working on the latency issues a long time ago because performance of those labs was restricted by OS processing. A noted problem was SLABs scanning of its objects every 2 seconds which caused pretty significant performance regressions due to the delay of the computation in individual threads. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Thu, 23 Apr 2015, Jerome Glisse wrote: No this not have been solve properly. Today solution is doing an explicit copy and again and again when complex data struct are involve (list, tree, ...) this is extremly tedious and hard to debug. So today solution often restrict themself to easy thing like matrix multiplication. But if you provide a unified address space then you make things a lot easiers for a lot more usecase. That's a fact, and again OpenCL 2.0 which is an industry standard is a proof that unified address space is one of the most important feature requested by user of GPGPU. You might not care but the rest of the world does. You could use page tables on the kernel side to transfer data on demand from the GPU. And you can use a device driver to establish mappings to the GPUs memory. There is no copy needed with these approaches. I think these two things need to be separated. The shift-the-memory-back- and-forth approach should be separate and if someone wants to use the thing then it should also work on other platforms like ARM and Intel. What IBM does with there platform is there choice, they can not force ARM or Intel or AMD to do the same. Each of those might have different view on what is their most important target. For instance i highly doubt ARM cares about any of this. Well but the kernel code submitted should allow for easy use on other platform. I.e. Intel processors should be able to implement the transparent memory by establishing device mappings to PCI-E space and/or transferring data from the GPU and signaling the GPU to establish such a mapping. Only time critical application care about latency, everyone else cares about throughput, where the applications can runs for days, weeks, months before producing any useable/meaningfull results. Many of which do not care a tiny bit about latency because they can perform independant computation. Computationally intensive high performance application care about random latency introduced to computational threads because that is delaying the data exchange and thus slows everything down. And that is the typical case of a GPUI. Take a company rendering a movie for instance, they want to render the millions of frame as fast as possible but each frame can be rendered independently, they only share data is the input geometry, textures and lighting but this are constant, the rendering of one frame does not depend on the rendering of the previous (leaving post processing like motion blur aside). The rendering would be done by the GPU and this will involve concurrency rapidly accessing data. Performance is certainly impacted if the GPU cannot use its own RAM designed for the proper feeding of its processing. And if you add a paging layer and swivel stuff below then this will be very bad. At minimum you need to shovel blocks of data into the GPU to allow it to operate undisturbed for a while on the data and do its job. Same apply if you do some data mining. You want might want to find all occurence of a specific sequence in a large data pool. You can slice your data pool and have an independant job per slice and only aggregate the result of each jobs at the end (or as they finish). This sounds more like a case for a general purpose processor. If it is a special device then it will typically also have special memory to allow fast searches. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Thu, 23 Apr 2015, Paul E. McKenney wrote: DAX DAX is a mechanism for providing direct-memory access to high-speed non-volatile (AKA persistent) memory. Good introductions to DAX may be found in the following LWN articles: DAX is a mechanism to access memory not managed by the kernel and is the successor to XIP. It just happens to be needed for persistent memory. Fundamentally any driver can provide an MMAPPed interface to allow access to a devices memory. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On 04/23/2015 07:22 PM, Jerome Glisse wrote: On Thu, Apr 23, 2015 at 09:20:55AM -0500, Christoph Lameter wrote: On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote: There are hooks in glibc where you can replace the memory management of the apps if you want that. We don't control the app. Let's say we are doing a plugin for libfoo which accelerates foo using GPUs. There are numerous examples of malloc implementation that can be used for apps without modifying the app. What about share memory pass btw process ? Or mmaped file ? Or a library that is loaded through dlopen and thus had no way to control any allocation that happen before it became active ? Now some other app we have no control on uses libfoo. So pointers already allocated/mapped, possibly a long time ago, will hit libfoo (or the plugin) and we need GPUs to churn on the data. IF the GPU would need to suspend one of its computation thread to wait on a mapping to be established on demand or so then it looks like the performance of the parallel threads on a GPU will be significantly compromised. You would want to do the transfer explicitly in some fashion that meshes with the concurrent calculation in the GPU. You do not want stalls while GPU number crunching is ongoing. You do not understand how GPU works. GPU have a pools of thread, and they always try to have the pool as big as possible so that when a group of thread is waiting for some memory access, there are others thread ready to perform some operation. GPU are about hidding memory latency that's what they are good at. But they only achieve that when they have more thread in flight than compute unit. The whole thread scheduling is done by hardware and barely control by the device driver. So no having the GPU wait for a page fault is not as dramatic as you think. If you use GPU as they are intended to use you might even never notice the pagefault and reach close to the theoritical throughput of the GPU nonetheless. The point I'm making is you are arguing against a usage model which has been repeatedly asked for by large amounts of customer (after all that's also why HMM exists). I am still not clear what is the use case for this would be. Who is asking for this? Everyone but you ? OpenCL 2.0 specific request it and have several level of support about transparent address space. The lowest one is the one implemented today in which application needs to use a special memory allocator. The most advance one imply integration with the kernel in which any memory (mmaped file, share memory or anonymous memory) can be use by the GPU and does not need to come from a special allocator. Everyone in the industry is moving toward the most advance one. That is the raison d'être of HMM, to provide this functionality on hw platform that do not have things such as CAPI. Which is x86/arm. So use case is all application using OpenCL or Cuda. So pretty much everyone doing GPGPU wants this. I dunno how you can't see that. Share address space is so much easier. Believe it or not most coders do not have deep knowledge of how things work and if you can remove the complexity of different memory allocation and different address space from them they will be happy. Cheers, Jérôme I second what Jerome said, and add that one of the key features of HSA is the ptr-is-a-ptr scheme, where the applications do *not* need to handle different address spaces. Instead, all the memory is seen as a unified address space. See slide 6 on the following presentation: http://www.slideshare.net/hsafoundation/hsa-overview Thanks, Oded -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 09:30:40AM -0500, Christoph Lameter wrote: On Thu, 23 Apr 2015, Paul E. McKenney wrote: If by entire industry you mean everyone who might want to use hardware acceleration, for example, including mechanical computer-aided design, I am skeptical. The industry designs GPUs with super fast special ram and accellerators with special ram designed to do fast searches and you think you can demand page that stuff in from the main processor? The demand paging is indeed a drawback for the option of using autonuma to handle the migration. And again, this is not intended to replace the careful hand-tuning that is required to get the last drop of performance out of the system. It is instead intended to handle the cases where the application needs substantially more performance than the CPUs alone can deliver, but where the cost of full-fledge hand tuning cannot be justified. You seem to believe that this latter category is the empty set, which I must confess does greatly surprise me. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, 2015-04-24 at 11:58 -0500, Christoph Lameter wrote: On Fri, 24 Apr 2015, Jerome Glisse wrote: What exactly is the more advanced version's benefit? What are the features that the other platforms do not provide? Transparent access to device memory from the CPU, you can map any of the GPU memory inside the CPU and have the whole cache coherency including proper atomic memory operation. CAPI is not some mumbo jumbo marketing name there is real hardware behind it. Got the hardware here but I am getting pretty sobered given what I heard here. The IBM mumbo jumpo marketing comes down to not much now. Ugh ... first nothing we propose precludes using it with explicit memory management the way you want. So I don't know why you have a problem here. We are trying to cover a *different* usage model than yours obviously. But they aren't exclusive. Secondly, none of what we are discussing here is supported by *existing* hardware, so whatever you have is not concerned. There is no CAPI based coprocessor today that provides cachable memory to the system (though CAPI as a technology supports it), and no GPU doing that either *yet*. Today CAPI adapters can own host cache lines but don't expose large swath of cachable local memory. Finally, this discussion is not even specifically about CAPI or its performances. It's about the *general* case of a coherent coprocessor sharing the MMU. Whether it's using CAPI or whatever other technology that allows that sort of thing that we may or may not be able to mention at this point. CAPI is just an example because architecturally it allows that too. Ben. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, 24 Apr 2015, Paul E. McKenney wrote: DAX is a mechanism to access memory not managed by the kernel and is the successor to XIP. It just happens to be needed for persistent memory. Fundamentally any driver can provide an MMAPPed interface to allow access to a devices memory. I will take another look, but others in this thread have called out difficulties with DAX's filesystem nature. Right so you do not need the filesystem structure. Just simply writing a device driver that mmaps data as needed from the coprocessor will also do the trick. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, 24 Apr 2015, Jerome Glisse wrote: On Fri, Apr 24, 2015 at 09:29:12AM -0500, Christoph Lameter wrote: On Thu, 23 Apr 2015, Jerome Glisse wrote: No this not have been solve properly. Today solution is doing an explicit copy and again and again when complex data struct are involve (list, tree, ...) this is extremly tedious and hard to debug. So today solution often restrict themself to easy thing like matrix multiplication. But if you provide a unified address space then you make things a lot easiers for a lot more usecase. That's a fact, and again OpenCL 2.0 which is an industry standard is a proof that unified address space is one of the most important feature requested by user of GPGPU. You might not care but the rest of the world does. You could use page tables on the kernel side to transfer data on demand from the GPU. And you can use a device driver to establish mappings to the GPUs memory. There is no copy needed with these approaches. So you are telling me to do get_user_page() ? If so you aware that this pins memory ? So what happens when the GPU wants to access a range of 32GB of memory ? I pin everything ? Use either a device driver to create PTEs pointing to the data or do something similar like what DAX does. Pinning can be avoided if you use mmu_notifiers. Those will give you a callback before the OS removes the data and thus you can operate without pinning. Overall the throughput of the GPU will stay close to its theoritical maximum if you have enough other thread that can progress and this is very common. GPUs operate on groups of threads not single ones. If you stall then there will be a stall of a whole group of them. We are dealing with accellerators here that are different for performance reasons. They are not to be treated like regular processor, nor is memory like operating like host mmemory. But IBM here want to go further and to provide a more advance solution, so their need are specific to there platform and we can not know if AMD, ARM or Intel will want to go down the same road, they do not seem to be interested. Does it means we should not support IBM ? I think it would be wrong. What exactly is the more advanced version's benefit? What are the features that the other platforms do not provide? This sounds more like a case for a general purpose processor. If it is a special device then it will typically also have special memory to allow fast searches. No this kind of thing can be fast on a GPU, with GPU you easily have x500 more cores than CPU cores, so you can slice the dataset even more and have each of the GPU core perform the search. Note that i am not only thinking of stupid memcmp here it can be something more complex like searching a pattern that allow variation and that require a whole program to decide if a chunk falls under the variation rules or not. Then you have the problem of fast memory access and you are proposing to complicate that access path on the GPU. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On 04/24/2015 11:49 AM, Christoph Lameter wrote: On Fri, 24 Apr 2015, Paul E. McKenney wrote: can deliver, but where the cost of full-fledge hand tuning cannot be justified. You seem to believe that this latter category is the empty set, which I must confess does greatly surprise me. If there are already compromises are being made then why would you want to modify the kernel for this? Some user space coding and device drivers should be sufficient. You assume only one program at a time would get to use the GPU for accelerated computations, and the GPU would get dedicated to that program. That will not be the case when you have libraries using the GPU for computations. There could be dozens of programs in the system using that library, with no knowledge of how many GPU resources are used by the other programs. There is a very clear cut case for having the OS manage the GPU resources transparently, just like it does for all the other resources in the system. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 09:30:40AM -0500, Christoph Lameter wrote: On Thu, 23 Apr 2015, Paul E. McKenney wrote: If by entire industry you mean everyone who might want to use hardware acceleration, for example, including mechanical computer-aided design, I am skeptical. The industry designs GPUs with super fast special ram and accellerators with special ram designed to do fast searches and you think you can demand page that stuff in from the main processor? Why do you think AMD and NVidia are adding page fault support to their GPU in the first place ? They are not doing this on a whim, they have carefully thought about that. Are you saying you know better than the 2 biggest GPU designer on the planet ? And who do you think is pushing for such thing in the kernel ? Do you think we are working on this on a whim ? Because we woke up one day and thought that it would be cool and that it should be done this way ? Yes if all your GPU do is pagefault it will be disastrous, but is this the usual thing we see on CPU ? No ! Are people complaining about the numerous page fault that happens over a day ? No, the vast majority of user are completely oblivious to page fault. This is how it works on CPU and yes this can work for GPU too. What happens on CPU ? Well CPU can switch to work on a different thread or a different application altogether. The same thing will happen on the GPU. If you have enough jobs, your GPU will be busy and you will never worry about page fault because overall your GPU will deliver the same kind of throughput as if there was no pagefault. It can very well be buried into the overall noise if the ratio of available runnable thread versus page faulting thread is high enough. Which is most of the time the case for the CPU, why would the same assumption not work on the GPU ? Note that i am not dismissing low latency folks, i know they exist, i know they hate page fault and in no way what we propose will make it worse for them. They will be able to keep the same kind of control they cherish but this does not mean you should go on a holy crusade to pretend that other people workload does not exist. They do exist. Page fault is not evil and it has prove usefull to the whole computer industry for CPU. To be sure you are not misinterpretting what we propose, in no way we say we gonna migrate thing on page fault for everyone. We are saying first the device driver decide where thing need to be (system memory or local memory) device driver can get hint/request from userspace for this (as they do today). So no change whatsoever here, people that hand tune things will keep being able to do so. Now we want to add the case where device driver do not get any kind of directive or hint from userspace. So what autonuma is, simply collect informations from the GPU on what is access often and then migrate this transparently (yes this can happen without interruption to GPU). So you are migrating from a memory that has 16GB/s or 32GB/s bandwidth to the device memory that have 500GB/s. This is a valid usecase, they are many people outthere that do not want to learn about hand tuning there application for the GPU but they could nonetheless benefit from it. Cheers, Jérôme -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 09:29:12AM -0500, Christoph Lameter wrote: On Thu, 23 Apr 2015, Jerome Glisse wrote: No this not have been solve properly. Today solution is doing an explicit copy and again and again when complex data struct are involve (list, tree, ...) this is extremly tedious and hard to debug. So today solution often restrict themself to easy thing like matrix multiplication. But if you provide a unified address space then you make things a lot easiers for a lot more usecase. That's a fact, and again OpenCL 2.0 which is an industry standard is a proof that unified address space is one of the most important feature requested by user of GPGPU. You might not care but the rest of the world does. You could use page tables on the kernel side to transfer data on demand from the GPU. And you can use a device driver to establish mappings to the GPUs memory. There is no copy needed with these approaches. So you are telling me to do get_user_page() ? If so you aware that this pins memory ? So what happens when the GPU wants to access a range of 32GB of memory ? I pin everything ? I am not talking about only transfrom from GPU to system memory i am talking about application that have : dataset = mmap(datatset, 3230); // ... dl_open(superlibrary) superlibrary.dosomething(dataset); So the application here have no clue about GPU and we do not want to change that yes this is a valid usecase and countless user ask for it. How can the superlibrary give access to the GPU to the dataset ? Does it have to go get_user_page() on all single page effectively pinning memory ? Should it allocate GPU memory through special API and memcpy ? What HMM does is allow to share the process page table with the GPU and GPU can transparently access the dataset (no pinning whatsover). Will there be pagefault ? It can happens and if it does the assumption is that you have more threads that do not get a pagefault than one that does, so GPU keeps being saturated (ie all its unit are feed with something to do) while the pagefault are resolve. For some workload yes you will see the penalty of the pagefault ie you will have a group of thread that finish late but the thing you seem to fail to get is that all the other GPU thread can make process and finish even before the pagefault is resolved. It all depends on the application. Moreover if you have several application then GPU can switch to different application and make progress on them too. Overall the throughput of the GPU will stay close to its theoritical maximum if you have enough other thread that can progress and this is very common. I think these two things need to be separated. The shift-the-memory-back- and-forth approach should be separate and if someone wants to use the thing then it should also work on other platforms like ARM and Intel. What IBM does with there platform is there choice, they can not force ARM or Intel or AMD to do the same. Each of those might have different view on what is their most important target. For instance i highly doubt ARM cares about any of this. Well but the kernel code submitted should allow for easy use on other platform. I.e. Intel processors should be able to implement the transparent memory by establishing device mappings to PCI-E space and/or transferring data from the GPU and signaling the GPU to establish such a mapping. HMM does that, it only require the GPU to have a certain set of features and the only requirement for the platform is to offer a bus which allow cache coherent system memory access such as PCIE. But IBM here want to go further and to provide a more advance solution, so their need are specific to there platform and we can not know if AMD, ARM or Intel will want to go down the same road, they do not seem to be interested. Does it means we should not support IBM ? I think it would be wrong. Only time critical application care about latency, everyone else cares about throughput, where the applications can runs for days, weeks, months before producing any useable/meaningfull results. Many of which do not care a tiny bit about latency because they can perform independant computation. Computationally intensive high performance application care about random latency introduced to computational threads because that is delaying the data exchange and thus slows everything down. And that is the typical case of a GPUI. You assume that all HPC application have strong data exchange, i gave you example of application where there is 0 data exchange btw threads what so ever. Those use case exist and we want to support them too. Yes for thread where there is data exchange page fault stall jobs but again we are talking about HPC where several _different_ application run in // and share resources so while page fault can block part of an application, other applications can still make progress as GPU can switch to work on them. Moreover the
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 07:57:38AM -0700, Paul E. McKenney wrote: On Fri, Apr 24, 2015 at 09:12:07AM -0500, Christoph Lameter wrote: On Thu, 23 Apr 2015, Paul E. McKenney wrote: DAX DAX is a mechanism for providing direct-memory access to high-speed non-volatile (AKA persistent) memory. Good introductions to DAX may be found in the following LWN articles: DAX is a mechanism to access memory not managed by the kernel and is the successor to XIP. It just happens to be needed for persistent memory. Fundamentally any driver can provide an MMAPPed interface to allow access to a devices memory. I will take another look, but others in this thread have called out difficulties with DAX's filesystem nature. Do not waste your time on that this is not what we want. Christoph here is more than stuborn and fails to see the world. Cheers, Jérôme -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, Apr 24, 2015 at 09:12:07AM -0500, Christoph Lameter wrote: On Thu, 23 Apr 2015, Paul E. McKenney wrote: DAX DAX is a mechanism for providing direct-memory access to high-speed non-volatile (AKA persistent) memory. Good introductions to DAX may be found in the following LWN articles: DAX is a mechanism to access memory not managed by the kernel and is the successor to XIP. It just happens to be needed for persistent memory. Fundamentally any driver can provide an MMAPPed interface to allow access to a devices memory. I will take another look, but others in this thread have called out difficulties with DAX's filesystem nature. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On 04/24/2015 10:01 AM, Christoph Lameter wrote: On Thu, 23 Apr 2015, Paul E. McKenney wrote: As far as I know Jerome is talkeing about HPC loads and high performance GPU processing. This is the same use case. The difference is sensitivity to latency. You have latency-sensitive HPC workloads, and Jerome is talking about HPC workloads that need high throughput, but are insensitive to latency. Those are correlated. What you are proposing for High Performacne Computing is reducing the performance these guys trying to get. You cannot sell someone a Volkswagen if he needs the Ferrari. You do need the low-latency Ferrari. But others are best served by a high-throughput freight train. The problem is that they want to run 2000 trains at the same time and they all must arrive at the destination before they can be send on their next trip. 1999 trains will be sitting idle because they need to wait of the one train that was delayed. This reduces the troughput. People really would like all 2000 trains to arrive on schedule so that they get more performance. So you run 4000 or even 6000 trains, and have some subset of them run at full steam, while others are waiting on memory accesses. In reality the overcommit factor is likely much smaller, because the GPU threads run and block on memory in smaller, more manageable numbers, say a few dozen at a time. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, 24 Apr 2015, Paul E. McKenney wrote: can deliver, but where the cost of full-fledge hand tuning cannot be justified. You seem to believe that this latter category is the empty set, which I must confess does greatly surprise me. If there are already compromises are being made then why would you want to modify the kernel for this? Some user space coding and device drivers should be sufficient. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On 04/21/2015 05:44 PM, Paul E. McKenney wrote: AUTONUMA The Linux kernel's autonuma facility supports migrating both memory and processes to promote NUMA memory locality. It was accepted into 3.13 and is available in RHEL 7.0 and SLES 12. It is enabled by the Kconfig variable CONFIG_NUMA_BALANCING. This approach uses a kernel thread knuma_scand that periodically marks pages inaccessible. The page-fault handler notes any mismatches between the NUMA node that the process is running on and the NUMA node on which the page resides. Minor nit: marking pages inaccessible is done from task_work nowadays, there no longer is a kernel thread. The result would be that the kernel would allocate only migratable pages within the CCAD device's memory, and even then only if memory was otherwise exhausted. Does it make sense to allocate the device's page tables in memory belonging to the device? Is this a necessary thing with some devices? Jerome's HMM comes to mind... -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interacting with coherent memory on external devices
On Fri, 2015-04-24 at 22:32 -0400, Rik van Riel wrote: The result would be that the kernel would allocate only migratable pages within the CCAD device's memory, and even then only if memory was otherwise exhausted. Does it make sense to allocate the device's page tables in memory belonging to the device? Is this a necessary thing with some devices? Jerome's HMM comes to mind... In our case, the device's MMU shares the host page tables (which is why we can't use HMM, ie we can't have a page with different permissions on CPU vs. device which HMM does). However the device has a pretty fast path to system memory, the best thing we can do is pin the workload to the same chip the device is connected to so those page tables arent' too far away. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/