Re: Interacting with coherent memory on external devices

2015-05-28 Thread Paul E. McKenney
On Thu, May 14, 2015 at 05:51:19PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2015-05-14 at 09:39 +0200, Vlastimil Babka wrote:
> > On 05/14/2015 01:38 AM, Benjamin Herrenschmidt wrote:
> > > On Wed, 2015-05-13 at 16:10 +0200, Vlastimil Babka wrote:
> > >> Sorry for reviving oldish thread...
> > >
> > > Well, that's actually appreciated since this is constructive discussion
> > > of the kind I was hoping to trigger initially :-) I'll look at
> > 
> > I hoped so :)
> > 
> > > ZONE_MOVABLE, I wasn't aware of its existence.
> > >
> > > Don't we still have the problem that ZONEs must be somewhat contiguous
> > > chunks ? Ie, my "CAPI memory" will be interleaved in the physical
> > > address space somewhat.. This is due to the address space on some of
> > > those systems where you'll basically have something along the lines of:
> > >
> > > [ node 0 mem ] [ node 0 CAPI dev ]  [ node 1 mem] [ node 1 CAPI dev] 
> > > ...
> > 
> > Oh, I see. The VM code should cope with that, but some operations would 
> > be inefficiently looping over the holes in the CAPI zone by 2MB 
> > pageblock per iteration. This would include compaction scanning, which 
> > would suck if you need those large contiguous allocations as you said. 
> > Interleaving works better if it's done with a smaller granularity.
> > 
> > But I guess you could just represent the CAPI as multiple NUMA nodes, 
> > each with single ZONE_MOVABLE zone. Especially if "node 0 CAPI dev" and 
> > "node 1 CAPI dev" differs in other characteristics than just using a 
> > different range of PFNs... otherwise what's the point of this split anyway?
> 
> Correct, I think we want the CAPI devs to look like CPU-less NUMA nodes
> anyway. This is the right way to target an allocation at one of them and
> it conveys the distance properly, so it makes sense.
> 
> I'll add the ZONE_MOVABLE to the list of things to investigate on our
> side, thanks for the pointer !

Any thoughts on CONFIG_MOVABLE_NODE and the corresponding "movable_node"
boot parameter?  It looks like it is designed to make an entire NUMA
node's memory hotpluggable, which seems consistent with what we are
trying to do here.  This feature is currently x86_64-only, so would need
to be enabled on other architectures.

It looks like this is intended to be used by booting normally, but
keeping the CAPI nodes' memory offline, setting movable_node, then
onlining their memory.

Thoughts?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-05-28 Thread Paul E. McKenney
On Thu, May 14, 2015 at 05:51:19PM +1000, Benjamin Herrenschmidt wrote:
 On Thu, 2015-05-14 at 09:39 +0200, Vlastimil Babka wrote:
  On 05/14/2015 01:38 AM, Benjamin Herrenschmidt wrote:
   On Wed, 2015-05-13 at 16:10 +0200, Vlastimil Babka wrote:
   Sorry for reviving oldish thread...
  
   Well, that's actually appreciated since this is constructive discussion
   of the kind I was hoping to trigger initially :-) I'll look at
  
  I hoped so :)
  
   ZONE_MOVABLE, I wasn't aware of its existence.
  
   Don't we still have the problem that ZONEs must be somewhat contiguous
   chunks ? Ie, my CAPI memory will be interleaved in the physical
   address space somewhat.. This is due to the address space on some of
   those systems where you'll basically have something along the lines of:
  
   [ node 0 mem ] [ node 0 CAPI dev ]  [ node 1 mem] [ node 1 CAPI dev] 
   ...
  
  Oh, I see. The VM code should cope with that, but some operations would 
  be inefficiently looping over the holes in the CAPI zone by 2MB 
  pageblock per iteration. This would include compaction scanning, which 
  would suck if you need those large contiguous allocations as you said. 
  Interleaving works better if it's done with a smaller granularity.
  
  But I guess you could just represent the CAPI as multiple NUMA nodes, 
  each with single ZONE_MOVABLE zone. Especially if node 0 CAPI dev and 
  node 1 CAPI dev differs in other characteristics than just using a 
  different range of PFNs... otherwise what's the point of this split anyway?
 
 Correct, I think we want the CAPI devs to look like CPU-less NUMA nodes
 anyway. This is the right way to target an allocation at one of them and
 it conveys the distance properly, so it makes sense.
 
 I'll add the ZONE_MOVABLE to the list of things to investigate on our
 side, thanks for the pointer !

Any thoughts on CONFIG_MOVABLE_NODE and the corresponding movable_node
boot parameter?  It looks like it is designed to make an entire NUMA
node's memory hotpluggable, which seems consistent with what we are
trying to do here.  This feature is currently x86_64-only, so would need
to be enabled on other architectures.

It looks like this is intended to be used by booting normally, but
keeping the CAPI nodes' memory offline, setting movable_node, then
onlining their memory.

Thoughts?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-05-14 Thread Benjamin Herrenschmidt
On Thu, 2015-05-14 at 09:39 +0200, Vlastimil Babka wrote:
> On 05/14/2015 01:38 AM, Benjamin Herrenschmidt wrote:
> > On Wed, 2015-05-13 at 16:10 +0200, Vlastimil Babka wrote:
> >> Sorry for reviving oldish thread...
> >
> > Well, that's actually appreciated since this is constructive discussion
> > of the kind I was hoping to trigger initially :-) I'll look at
> 
> I hoped so :)
> 
> > ZONE_MOVABLE, I wasn't aware of its existence.
> >
> > Don't we still have the problem that ZONEs must be somewhat contiguous
> > chunks ? Ie, my "CAPI memory" will be interleaved in the physical
> > address space somewhat.. This is due to the address space on some of
> > those systems where you'll basically have something along the lines of:
> >
> > [ node 0 mem ] [ node 0 CAPI dev ]  [ node 1 mem] [ node 1 CAPI dev] ...
> 
> Oh, I see. The VM code should cope with that, but some operations would 
> be inefficiently looping over the holes in the CAPI zone by 2MB 
> pageblock per iteration. This would include compaction scanning, which 
> would suck if you need those large contiguous allocations as you said. 
> Interleaving works better if it's done with a smaller granularity.
> 
> But I guess you could just represent the CAPI as multiple NUMA nodes, 
> each with single ZONE_MOVABLE zone. Especially if "node 0 CAPI dev" and 
> "node 1 CAPI dev" differs in other characteristics than just using a 
> different range of PFNs... otherwise what's the point of this split anyway?

Correct, I think we want the CAPI devs to look like CPU-less NUMA nodes
anyway. This is the right way to target an allocation at one of them and
it conveys the distance properly, so it makes sense.

I'll add the ZONE_MOVABLE to the list of things to investigate on our
side, thanks for the pointer !

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-05-14 Thread Vlastimil Babka

On 05/14/2015 01:38 AM, Benjamin Herrenschmidt wrote:

On Wed, 2015-05-13 at 16:10 +0200, Vlastimil Babka wrote:

Sorry for reviving oldish thread...


Well, that's actually appreciated since this is constructive discussion
of the kind I was hoping to trigger initially :-) I'll look at


I hoped so :)


ZONE_MOVABLE, I wasn't aware of its existence.

Don't we still have the problem that ZONEs must be somewhat contiguous
chunks ? Ie, my "CAPI memory" will be interleaved in the physical
address space somewhat.. This is due to the address space on some of
those systems where you'll basically have something along the lines of:

[ node 0 mem ] [ node 0 CAPI dev ]  [ node 1 mem] [ node 1 CAPI dev] ...


Oh, I see. The VM code should cope with that, but some operations would 
be inefficiently looping over the holes in the CAPI zone by 2MB 
pageblock per iteration. This would include compaction scanning, which 
would suck if you need those large contiguous allocations as you said. 
Interleaving works better if it's done with a smaller granularity.


But I guess you could just represent the CAPI as multiple NUMA nodes, 
each with single ZONE_MOVABLE zone. Especially if "node 0 CAPI dev" and 
"node 1 CAPI dev" differs in other characteristics than just using a 
different range of PFNs... otherwise what's the point of this split anyway?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-05-14 Thread Vlastimil Babka

On 05/14/2015 01:38 AM, Benjamin Herrenschmidt wrote:

On Wed, 2015-05-13 at 16:10 +0200, Vlastimil Babka wrote:

Sorry for reviving oldish thread...


Well, that's actually appreciated since this is constructive discussion
of the kind I was hoping to trigger initially :-) I'll look at


I hoped so :)


ZONE_MOVABLE, I wasn't aware of its existence.

Don't we still have the problem that ZONEs must be somewhat contiguous
chunks ? Ie, my CAPI memory will be interleaved in the physical
address space somewhat.. This is due to the address space on some of
those systems where you'll basically have something along the lines of:

[ node 0 mem ] [ node 0 CAPI dev ]  [ node 1 mem] [ node 1 CAPI dev] ...


Oh, I see. The VM code should cope with that, but some operations would 
be inefficiently looping over the holes in the CAPI zone by 2MB 
pageblock per iteration. This would include compaction scanning, which 
would suck if you need those large contiguous allocations as you said. 
Interleaving works better if it's done with a smaller granularity.


But I guess you could just represent the CAPI as multiple NUMA nodes, 
each with single ZONE_MOVABLE zone. Especially if node 0 CAPI dev and 
node 1 CAPI dev differs in other characteristics than just using a 
different range of PFNs... otherwise what's the point of this split anyway?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-05-14 Thread Benjamin Herrenschmidt
On Thu, 2015-05-14 at 09:39 +0200, Vlastimil Babka wrote:
 On 05/14/2015 01:38 AM, Benjamin Herrenschmidt wrote:
  On Wed, 2015-05-13 at 16:10 +0200, Vlastimil Babka wrote:
  Sorry for reviving oldish thread...
 
  Well, that's actually appreciated since this is constructive discussion
  of the kind I was hoping to trigger initially :-) I'll look at
 
 I hoped so :)
 
  ZONE_MOVABLE, I wasn't aware of its existence.
 
  Don't we still have the problem that ZONEs must be somewhat contiguous
  chunks ? Ie, my CAPI memory will be interleaved in the physical
  address space somewhat.. This is due to the address space on some of
  those systems where you'll basically have something along the lines of:
 
  [ node 0 mem ] [ node 0 CAPI dev ]  [ node 1 mem] [ node 1 CAPI dev] ...
 
 Oh, I see. The VM code should cope with that, but some operations would 
 be inefficiently looping over the holes in the CAPI zone by 2MB 
 pageblock per iteration. This would include compaction scanning, which 
 would suck if you need those large contiguous allocations as you said. 
 Interleaving works better if it's done with a smaller granularity.
 
 But I guess you could just represent the CAPI as multiple NUMA nodes, 
 each with single ZONE_MOVABLE zone. Especially if node 0 CAPI dev and 
 node 1 CAPI dev differs in other characteristics than just using a 
 different range of PFNs... otherwise what's the point of this split anyway?

Correct, I think we want the CAPI devs to look like CPU-less NUMA nodes
anyway. This is the right way to target an allocation at one of them and
it conveys the distance properly, so it makes sense.

I'll add the ZONE_MOVABLE to the list of things to investigate on our
side, thanks for the pointer !

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-05-13 Thread Benjamin Herrenschmidt
On Wed, 2015-05-13 at 16:10 +0200, Vlastimil Babka wrote:
> Sorry for reviving oldish thread...

Well, that's actually appreciated since this is constructive discussion
of the kind I was hoping to trigger initially :-) I'll look at
ZONE_MOVABLE, I wasn't aware of its existence.

Don't we still have the problem that ZONEs must be somewhat contiguous
chunks ? Ie, my "CAPI memory" will be interleaved in the physical
address space somewhat.. This is due to the address space on some of
those systems where you'll basically have something along the lines of:

[ node 0 mem ] [ node 0 CAPI dev ]  [ node 1 mem] [ node 1 CAPI dev] ...

> On 04/28/2015 01:54 AM, Benjamin Herrenschmidt wrote:
> > On Mon, 2015-04-27 at 11:48 -0500, Christoph Lameter wrote:
> >> On Mon, 27 Apr 2015, Rik van Riel wrote:
> >>
> >>> Why would we want to avoid the sane approach that makes this thing
> >>> work with the fewest required changes to core code?
> >>
> >> Becaus new ZONEs are a pretty invasive change to the memory management and
> >> because there are  other ways to handle references to device specific
> >> memory.
> >
> > ZONEs is just one option we put on the table.
> >
> > I think we can mostly agree on the fundamentals that a good model of
> > such a co-processor is a NUMA node, possibly with a higher distance
> > than other nodes (but even that can be debated).
> >
> > That gives us a lot of the basics we need such as struct page, ability
> > to use existing migration infrastructure, and is actually a reasonably
> > representation at high level as well.
> >
> > The question is how do we additionally get the random stuff we don't
> > care about out of the way. The large distance will not help that much
> > under memory pressure for example.
> >
> > Covering the entire device memory with a CMA goes a long way toward that
> > goal. It will avoid your ordinary kernel allocations.
> 
> I think ZONE_MOVABLE should be sufficient for this. CMA is basically for 
> marking parts of zones as MOVABLE-only. You shouldn't need that for the 
> whole zone. Although it might happen that CMA will be a special zone one 
> day.
> 
> > It also provides just what we need to be able to do large contiguous
> > "explicit" allocations for use by workloads that don't want the
> > transparent migration and by the driver for the device which might also
> > need such special allocations for its own internal management data
> > structures.
> 
> Plain zone compaction + reclaim should work as well in a ZONE_MOVABLE 
> zone. CMA allocations might IIRC additionally migrate across zones, e.g. 
> from the device to system memory (unlike plain compaction), which might 
> be what you want, or not.
> 
> > We still have the risk of pages in the CMA being pinned by something
> > like gup however, that's where the ZONE idea comes in, to ensure the
> > various kernel allocators will *never* allocate in that zone unless
> > explicitly specified, but that could possibly implemented differently.
> 
> Kernel allocations should ignore the ZONE_MOVABLE zone as they are not 
> typically movable. Then it depends on how much control you want for 
> userspace allocations.
> 
> > Maybe a concept of "exclusive" NUMA node, where allocations never
> > fallback to that node unless explicitly asked to go there.
> 
> I guess that could be doable on the zonelist level, where the device 
> memory node/zone wouldn't be part of the "normal" zonelists, so memory 
> pressure calculations should be also fine. But sure there will be some 
> corner cases :)
> 
> > Of course that would have an impact on memory pressure calculations,
> > nothign comes completely for free, but at this stage, this is the goal
> > of this thread, ie, to swap ideas around and see what's most likely to
> > work in the long run before we even start implementing something.
> >
> > Cheers,
> > Ben.
> >
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majord...@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-05-13 Thread Vlastimil Babka

Sorry for reviving oldish thread...

On 04/28/2015 01:54 AM, Benjamin Herrenschmidt wrote:

On Mon, 2015-04-27 at 11:48 -0500, Christoph Lameter wrote:

On Mon, 27 Apr 2015, Rik van Riel wrote:


Why would we want to avoid the sane approach that makes this thing
work with the fewest required changes to core code?


Becaus new ZONEs are a pretty invasive change to the memory management and
because there are  other ways to handle references to device specific
memory.


ZONEs is just one option we put on the table.

I think we can mostly agree on the fundamentals that a good model of
such a co-processor is a NUMA node, possibly with a higher distance
than other nodes (but even that can be debated).

That gives us a lot of the basics we need such as struct page, ability
to use existing migration infrastructure, and is actually a reasonably
representation at high level as well.

The question is how do we additionally get the random stuff we don't
care about out of the way. The large distance will not help that much
under memory pressure for example.

Covering the entire device memory with a CMA goes a long way toward that
goal. It will avoid your ordinary kernel allocations.


I think ZONE_MOVABLE should be sufficient for this. CMA is basically for 
marking parts of zones as MOVABLE-only. You shouldn't need that for the 
whole zone. Although it might happen that CMA will be a special zone one 
day.



It also provides just what we need to be able to do large contiguous
"explicit" allocations for use by workloads that don't want the
transparent migration and by the driver for the device which might also
need such special allocations for its own internal management data
structures.


Plain zone compaction + reclaim should work as well in a ZONE_MOVABLE 
zone. CMA allocations might IIRC additionally migrate across zones, e.g. 
from the device to system memory (unlike plain compaction), which might 
be what you want, or not.



We still have the risk of pages in the CMA being pinned by something
like gup however, that's where the ZONE idea comes in, to ensure the
various kernel allocators will *never* allocate in that zone unless
explicitly specified, but that could possibly implemented differently.


Kernel allocations should ignore the ZONE_MOVABLE zone as they are not 
typically movable. Then it depends on how much control you want for 
userspace allocations.



Maybe a concept of "exclusive" NUMA node, where allocations never
fallback to that node unless explicitly asked to go there.


I guess that could be doable on the zonelist level, where the device 
memory node/zone wouldn't be part of the "normal" zonelists, so memory 
pressure calculations should be also fine. But sure there will be some 
corner cases :)



Of course that would have an impact on memory pressure calculations,
nothign comes completely for free, but at this stage, this is the goal
of this thread, ie, to swap ideas around and see what's most likely to
work in the long run before we even start implementing something.

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-05-13 Thread Vlastimil Babka

Sorry for reviving oldish thread...

On 04/28/2015 01:54 AM, Benjamin Herrenschmidt wrote:

On Mon, 2015-04-27 at 11:48 -0500, Christoph Lameter wrote:

On Mon, 27 Apr 2015, Rik van Riel wrote:


Why would we want to avoid the sane approach that makes this thing
work with the fewest required changes to core code?


Becaus new ZONEs are a pretty invasive change to the memory management and
because there are  other ways to handle references to device specific
memory.


ZONEs is just one option we put on the table.

I think we can mostly agree on the fundamentals that a good model of
such a co-processor is a NUMA node, possibly with a higher distance
than other nodes (but even that can be debated).

That gives us a lot of the basics we need such as struct page, ability
to use existing migration infrastructure, and is actually a reasonably
representation at high level as well.

The question is how do we additionally get the random stuff we don't
care about out of the way. The large distance will not help that much
under memory pressure for example.

Covering the entire device memory with a CMA goes a long way toward that
goal. It will avoid your ordinary kernel allocations.


I think ZONE_MOVABLE should be sufficient for this. CMA is basically for 
marking parts of zones as MOVABLE-only. You shouldn't need that for the 
whole zone. Although it might happen that CMA will be a special zone one 
day.



It also provides just what we need to be able to do large contiguous
explicit allocations for use by workloads that don't want the
transparent migration and by the driver for the device which might also
need such special allocations for its own internal management data
structures.


Plain zone compaction + reclaim should work as well in a ZONE_MOVABLE 
zone. CMA allocations might IIRC additionally migrate across zones, e.g. 
from the device to system memory (unlike plain compaction), which might 
be what you want, or not.



We still have the risk of pages in the CMA being pinned by something
like gup however, that's where the ZONE idea comes in, to ensure the
various kernel allocators will *never* allocate in that zone unless
explicitly specified, but that could possibly implemented differently.


Kernel allocations should ignore the ZONE_MOVABLE zone as they are not 
typically movable. Then it depends on how much control you want for 
userspace allocations.



Maybe a concept of exclusive NUMA node, where allocations never
fallback to that node unless explicitly asked to go there.


I guess that could be doable on the zonelist level, where the device 
memory node/zone wouldn't be part of the normal zonelists, so memory 
pressure calculations should be also fine. But sure there will be some 
corner cases :)



Of course that would have an impact on memory pressure calculations,
nothign comes completely for free, but at this stage, this is the goal
of this thread, ie, to swap ideas around and see what's most likely to
work in the long run before we even start implementing something.

Cheers,
Ben.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-05-13 Thread Benjamin Herrenschmidt
On Wed, 2015-05-13 at 16:10 +0200, Vlastimil Babka wrote:
 Sorry for reviving oldish thread...

Well, that's actually appreciated since this is constructive discussion
of the kind I was hoping to trigger initially :-) I'll look at
ZONE_MOVABLE, I wasn't aware of its existence.

Don't we still have the problem that ZONEs must be somewhat contiguous
chunks ? Ie, my CAPI memory will be interleaved in the physical
address space somewhat.. This is due to the address space on some of
those systems where you'll basically have something along the lines of:

[ node 0 mem ] [ node 0 CAPI dev ]  [ node 1 mem] [ node 1 CAPI dev] ...

 On 04/28/2015 01:54 AM, Benjamin Herrenschmidt wrote:
  On Mon, 2015-04-27 at 11:48 -0500, Christoph Lameter wrote:
  On Mon, 27 Apr 2015, Rik van Riel wrote:
 
  Why would we want to avoid the sane approach that makes this thing
  work with the fewest required changes to core code?
 
  Becaus new ZONEs are a pretty invasive change to the memory management and
  because there are  other ways to handle references to device specific
  memory.
 
  ZONEs is just one option we put on the table.
 
  I think we can mostly agree on the fundamentals that a good model of
  such a co-processor is a NUMA node, possibly with a higher distance
  than other nodes (but even that can be debated).
 
  That gives us a lot of the basics we need such as struct page, ability
  to use existing migration infrastructure, and is actually a reasonably
  representation at high level as well.
 
  The question is how do we additionally get the random stuff we don't
  care about out of the way. The large distance will not help that much
  under memory pressure for example.
 
  Covering the entire device memory with a CMA goes a long way toward that
  goal. It will avoid your ordinary kernel allocations.
 
 I think ZONE_MOVABLE should be sufficient for this. CMA is basically for 
 marking parts of zones as MOVABLE-only. You shouldn't need that for the 
 whole zone. Although it might happen that CMA will be a special zone one 
 day.
 
  It also provides just what we need to be able to do large contiguous
  explicit allocations for use by workloads that don't want the
  transparent migration and by the driver for the device which might also
  need such special allocations for its own internal management data
  structures.
 
 Plain zone compaction + reclaim should work as well in a ZONE_MOVABLE 
 zone. CMA allocations might IIRC additionally migrate across zones, e.g. 
 from the device to system memory (unlike plain compaction), which might 
 be what you want, or not.
 
  We still have the risk of pages in the CMA being pinned by something
  like gup however, that's where the ZONE idea comes in, to ensure the
  various kernel allocators will *never* allocate in that zone unless
  explicitly specified, but that could possibly implemented differently.
 
 Kernel allocations should ignore the ZONE_MOVABLE zone as they are not 
 typically movable. Then it depends on how much control you want for 
 userspace allocations.
 
  Maybe a concept of exclusive NUMA node, where allocations never
  fallback to that node unless explicitly asked to go there.
 
 I guess that could be doable on the zonelist level, where the device 
 memory node/zone wouldn't be part of the normal zonelists, so memory 
 pressure calculations should be also fine. But sure there will be some 
 corner cases :)
 
  Of course that would have an impact on memory pressure calculations,
  nothign comes completely for free, but at this stage, this is the goal
  of this thread, ie, to swap ideas around and see what's most likely to
  work in the long run before we even start implementing something.
 
  Cheers,
  Ben.
 
 
  --
  To unsubscribe, send a message with 'unsubscribe linux-mm' in
  the body to majord...@kvack.org.  For more info on Linux MM,
  see: http://www.linux-mm.org/ .
  Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
 
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-28 Thread Jerome Glisse
On Tue, Apr 28, 2015 at 09:18:55AM -0500, Christoph Lameter wrote:
> On Mon, 27 Apr 2015, Jerome Glisse wrote:
> 
> > > is the mechanism that DAX relies on in the VM.
> >
> > Which would require fare more changes than you seem to think. First using
> > MIXED|PFNMAP means we loose any kind of memory accounting and forget about
> > memcg too. Seconds it means we would need to set those flags on all vma,
> > which kind of point out that something must be wrong here. You will also
> > need to have vm_ops for all those vma (including for anonymous private vma
> > which sounds like it will break quite few place that test for that). Then
> > you have to think about vma that already have vm_ops but you would need
> > to override it to handle case where its device memory and then forward
> > other case to the existing vm_ops, extra layering, extra complexity.
> 
> These vmas would only be used for those section of memory that use
> memory in the coprocessor. Special memory accounting etc can be done at
> the device driver layer. Multiple processes would be able to use different
> GPU contexts (or devices) which provides proper isolations.
> 
> memcg is about accouting for regular memory and this is not regular
> memory. It ooks like one would need a lot of special casing in
> the VM if one wanted to handle f.e. GPU memory as regular memory under
> Linux.

Well i shoed this does not need much changes refer to :
http://lwn.net/Articles/597289/
More specifically :
http://thread.gmane.org/gmane.linux.kernel.mm/116584
http://thread.gmane.org/gmane.linux.kernel.mm/116584
http://thread.gmane.org/gmane.linux.kernel.mm/116584

Idea here is that even if device memory is speciak kind of memory we still
want to account it properly against process ie an anonymous page that is
on the device memory would still be accounted as regular anonymous page for
memcg (same apply to file backed pages). With that existing memcg keeps
working as intended and process memory use are properly accounted.

This does not prevent the device driver to perform its own accounting of
device memory and to allow or block migration for a given process. At this
point we do not think it is meaningfull to move such accounting to a common
layer.

Bottom line is, we want to keep existing memcg accounting intact and we
want to reflect remote memory as regular memory. Note that the memcg changes
would be even smaller now that Johannes cleaned up and simplified memcg. I
have not rebase that part of HMM yet.


> 
> > I think at this point there is nothing more to discuss here. It is pretty
> > clear to me that any solution using block device/MIXEDMAP would be far
> > more complex and far more intrusive. I do not mind being prove wrong but
> > i will certainly not waste my time trying to implement such solution.
> 
> The device driver method is the current solution used by the GPUS and
> that would be the natural starting point for development. And they do not
> currently add code to the core vm. I think we first need to figure out if
> we cannot do what you want through that method.

We do need a different solution, i have been working on that for last 2 years
for a reason.

Requirement: _no_ special allocator in userspace so that all kind of memory
(anonymous, share, file backed) can be use and migrated to device memory in
a transparent fashion for the application.

No special allocator imply no special vma so no special vm_ops. So we need
either to hook up in few places inside mm code with minor change to deal with
special CPU pte entry of migrated memory (on page fault, fork, write back).
For all those place it's just about adding :
  if(new_special_pte)
  new_helper_function()

Other solution would have been to introduce yet another vm_ops that would
superceed the existing vm_ops. This work for page fault but require more
changes for page fault and fork, and major changes for write back. Hence,
why first solution was favor.

I explored many different path before going down the road i am, and all
you are doing is hand waving some idea without even considering any of
the objection i formulated. I explained why your idea can not work or
require excessive and more complex change than solution we are proposing.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-28 Thread Christoph Lameter
On Mon, 27 Apr 2015, Jerome Glisse wrote:

> > is the mechanism that DAX relies on in the VM.
>
> Which would require fare more changes than you seem to think. First using
> MIXED|PFNMAP means we loose any kind of memory accounting and forget about
> memcg too. Seconds it means we would need to set those flags on all vma,
> which kind of point out that something must be wrong here. You will also
> need to have vm_ops for all those vma (including for anonymous private vma
> which sounds like it will break quite few place that test for that). Then
> you have to think about vma that already have vm_ops but you would need
> to override it to handle case where its device memory and then forward
> other case to the existing vm_ops, extra layering, extra complexity.

These vmas would only be used for those section of memory that use
memory in the coprocessor. Special memory accounting etc can be done at
the device driver layer. Multiple processes would be able to use different
GPU contexts (or devices) which provides proper isolations.

memcg is about accouting for regular memory and this is not regular
memory. It ooks like one would need a lot of special casing in
the VM if one wanted to handle f.e. GPU memory as regular memory under
Linux.

> I think at this point there is nothing more to discuss here. It is pretty
> clear to me that any solution using block device/MIXEDMAP would be far
> more complex and far more intrusive. I do not mind being prove wrong but
> i will certainly not waste my time trying to implement such solution.

The device driver method is the current solution used by the GPUS and
that would be the natural starting point for development. And they do not
currently add code to the core vm. I think we first need to figure out if
we cannot do what you want through that method.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-28 Thread Christoph Lameter
On Mon, 27 Apr 2015, Jerome Glisse wrote:

  is the mechanism that DAX relies on in the VM.

 Which would require fare more changes than you seem to think. First using
 MIXED|PFNMAP means we loose any kind of memory accounting and forget about
 memcg too. Seconds it means we would need to set those flags on all vma,
 which kind of point out that something must be wrong here. You will also
 need to have vm_ops for all those vma (including for anonymous private vma
 which sounds like it will break quite few place that test for that). Then
 you have to think about vma that already have vm_ops but you would need
 to override it to handle case where its device memory and then forward
 other case to the existing vm_ops, extra layering, extra complexity.

These vmas would only be used for those section of memory that use
memory in the coprocessor. Special memory accounting etc can be done at
the device driver layer. Multiple processes would be able to use different
GPU contexts (or devices) which provides proper isolations.

memcg is about accouting for regular memory and this is not regular
memory. It ooks like one would need a lot of special casing in
the VM if one wanted to handle f.e. GPU memory as regular memory under
Linux.

 I think at this point there is nothing more to discuss here. It is pretty
 clear to me that any solution using block device/MIXEDMAP would be far
 more complex and far more intrusive. I do not mind being prove wrong but
 i will certainly not waste my time trying to implement such solution.

The device driver method is the current solution used by the GPUS and
that would be the natural starting point for development. And they do not
currently add code to the core vm. I think we first need to figure out if
we cannot do what you want through that method.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-28 Thread Jerome Glisse
On Tue, Apr 28, 2015 at 09:18:55AM -0500, Christoph Lameter wrote:
 On Mon, 27 Apr 2015, Jerome Glisse wrote:
 
   is the mechanism that DAX relies on in the VM.
 
  Which would require fare more changes than you seem to think. First using
  MIXED|PFNMAP means we loose any kind of memory accounting and forget about
  memcg too. Seconds it means we would need to set those flags on all vma,
  which kind of point out that something must be wrong here. You will also
  need to have vm_ops for all those vma (including for anonymous private vma
  which sounds like it will break quite few place that test for that). Then
  you have to think about vma that already have vm_ops but you would need
  to override it to handle case where its device memory and then forward
  other case to the existing vm_ops, extra layering, extra complexity.
 
 These vmas would only be used for those section of memory that use
 memory in the coprocessor. Special memory accounting etc can be done at
 the device driver layer. Multiple processes would be able to use different
 GPU contexts (or devices) which provides proper isolations.
 
 memcg is about accouting for regular memory and this is not regular
 memory. It ooks like one would need a lot of special casing in
 the VM if one wanted to handle f.e. GPU memory as regular memory under
 Linux.

Well i shoed this does not need much changes refer to :
http://lwn.net/Articles/597289/
More specifically :
http://thread.gmane.org/gmane.linux.kernel.mm/116584
http://thread.gmane.org/gmane.linux.kernel.mm/116584
http://thread.gmane.org/gmane.linux.kernel.mm/116584

Idea here is that even if device memory is speciak kind of memory we still
want to account it properly against process ie an anonymous page that is
on the device memory would still be accounted as regular anonymous page for
memcg (same apply to file backed pages). With that existing memcg keeps
working as intended and process memory use are properly accounted.

This does not prevent the device driver to perform its own accounting of
device memory and to allow or block migration for a given process. At this
point we do not think it is meaningfull to move such accounting to a common
layer.

Bottom line is, we want to keep existing memcg accounting intact and we
want to reflect remote memory as regular memory. Note that the memcg changes
would be even smaller now that Johannes cleaned up and simplified memcg. I
have not rebase that part of HMM yet.


 
  I think at this point there is nothing more to discuss here. It is pretty
  clear to me that any solution using block device/MIXEDMAP would be far
  more complex and far more intrusive. I do not mind being prove wrong but
  i will certainly not waste my time trying to implement such solution.
 
 The device driver method is the current solution used by the GPUS and
 that would be the natural starting point for development. And they do not
 currently add code to the core vm. I think we first need to figure out if
 we cannot do what you want through that method.

We do need a different solution, i have been working on that for last 2 years
for a reason.

Requirement: _no_ special allocator in userspace so that all kind of memory
(anonymous, share, file backed) can be use and migrated to device memory in
a transparent fashion for the application.

No special allocator imply no special vma so no special vm_ops. So we need
either to hook up in few places inside mm code with minor change to deal with
special CPU pte entry of migrated memory (on page fault, fork, write back).
For all those place it's just about adding :
  if(new_special_pte)
  new_helper_function()

Other solution would have been to introduce yet another vm_ops that would
superceed the existing vm_ops. This work for page fault but require more
changes for page fault and fork, and major changes for write back. Hence,
why first solution was favor.

I explored many different path before going down the road i am, and all
you are doing is hand waving some idea without even considering any of
the objection i formulated. I explained why your idea can not work or
require excessive and more complex change than solution we are proposing.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Benjamin Herrenschmidt
On Mon, 2015-04-27 at 11:48 -0500, Christoph Lameter wrote:
> On Mon, 27 Apr 2015, Rik van Riel wrote:
> 
> > Why would we want to avoid the sane approach that makes this thing
> > work with the fewest required changes to core code?
> 
> Becaus new ZONEs are a pretty invasive change to the memory management and
> because there are  other ways to handle references to device specific
> memory.

ZONEs is just one option we put on the table.

I think we can mostly agree on the fundamentals that a good model of
such a co-processor is a NUMA node, possibly with a higher distance
than other nodes (but even that can be debated).

That gives us a lot of the basics we need such as struct page, ability
to use existing migration infrastructure, and is actually a reasonably
representation at high level as well.

The question is how do we additionally get the random stuff we don't
care about out of the way. The large distance will not help that much
under memory pressure for example.

Covering the entire device memory with a CMA goes a long way toward that
goal. It will avoid your ordinary kernel allocations.

It also provides just what we need to be able to do large contiguous
"explicit" allocations for use by workloads that don't want the
transparent migration and by the driver for the device which might also
need such special allocations for its own internal management data
structures. 

We still have the risk of pages in the CMA being pinned by something
like gup however, that's where the ZONE idea comes in, to ensure the
various kernel allocators will *never* allocate in that zone unless
explicitly specified, but that could possibly implemented differently.

Maybe a concept of "exclusive" NUMA node, where allocations never
fallback to that node unless explicitly asked to go there.

Of course that would have an impact on memory pressure calculations,
nothign comes completely for free, but at this stage, this is the goal
of this thread, ie, to swap ideas around and see what's most likely to
work in the long run before we even start implementing something.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Jerome Glisse
On Mon, Apr 27, 2015 at 02:26:04PM -0500, Christoph Lameter wrote:
> On Mon, 27 Apr 2015, Jerome Glisse wrote:
> 
> > > We can drop the DAX name and just talk about mapping to external memory if
> > > that confuses the issue.
> >
> > DAX is for direct access block layer (X is for the cool name factor)
> > there is zero code inside DAX that would be usefull to us. Because it
> > is all about filesystem and short circuiting the pagecache. So DAX is
> > _not_ about providing rw mappings to non regular memory, it is about
> > allowing to directly map _filesystem backing storage_ into a process.
> 
> Its about directly mapping memory outside of regular kernel
> management via a block device into user space. That you can put a
> filesystem on top is one possible use case. You can provide a block
> device to map the memory of the coprocessor and then configure the memory
> space to have the same layout on the coprocessor as well as the linux
> process.

_Block device_ not what we want, the API of block device does not match
anything remotely usefull for our usecase. Most of the block device api
deals with disk and scheduling io on them, none of which is interesting
to us. So we would need to carefully create various noop functions and
insert ourself as some kind of fake block device while also making sure
no userspace could actually use ourself as a regular block device. So
we would be pretending being something we are not.

> 
> > Moreover DAX is not about managing that persistent memory, all the
> > management is done inside the fs (ext4, xfs, ...) in the same way as
> > for non persistent memory. While in our case we want to manage the
> > memory as a runtime resources that is allocated to process the same
> > way regular system memory is managed.
> 
> I repeatedly said that. So you would have a block device that would be
> used to mmap portions of the special memory into a process.
> 
> > So current DAX code have nothing of value for our usecase nor what we
> > propose will have anyvalue for DAX. Unless they decide to go down the
> > struct page road for persistent memory (which from last discussion i
> > heard was not there plan, i am pretty sure they entirely dismissed
> > that idea for now).
> 
> DAX is about directly accessing memory. It is made for the purpose of
> serving as a block device for a filesystem right now but it can easily be
> used as a way to map any external memory into a processes space using the
> abstraction of a block device. But then you can do that with any device
> driver using VM_PFNMAP or VM_MIXEDMAP. Maybe we better use that term
> instead. Guess I have repeated myself 6 times or so now? I am stopping
> with this one.
> 
> > My point is that this is 2 differents non overlapping problems, and
> > thus mandate 2 differents solution.
> 
> Well confusion abounds since so much other stuff has ben attached to DAX
> devices.
> 
> Lets drop the DAX term and use VM_PFNMAP or VM_MIXEDMAP instead. MIXEDMAP
> is the mechanism that DAX relies on in the VM.

Which would require fare more changes than you seem to think. First using
MIXED|PFNMAP means we loose any kind of memory accounting and forget about
memcg too. Seconds it means we would need to set those flags on all vma,
which kind of point out that something must be wrong here. You will also
need to have vm_ops for all those vma (including for anonymous private vma
which sounds like it will break quite few place that test for that). Then
you have to think about vma that already have vm_ops but you would need
to override it to handle case where its device memory and then forward
other case to the existing vm_ops, extra layering, extra complexity.

All in all, this points me to believe that any such approach would be
vastly more complex, involve changing many places and try to force shoe
horning something into the block device model that is clearly not a
block device.

Paul solution or mine, are far smaller, i think Paul can even get away
from adding/changing ZONE by putting the device pages onto a different
list that is not use by kernel memory allocator. Only few code place
would need a new if() (when freeing a page and when initializing device
memory struct page, you could keep the lru code intact here).

I think at this point there is nothing more to discuss here. It is pretty
clear to me that any solution using block device/MIXEDMAP would be far
more complex and far more intrusive. I do not mind being prove wrong but
i will certainly not waste my time trying to implement such solution.

Btw as a data point, if you ignore my patches to mmu_notifier (which are
mostly about passing down more context information to the callback),
i touch less then 50 lines of mm common code. Every thing else is helpers
that are only use by the device driver.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: Interacting with coherent memory on external devices

2015-04-27 Thread Rik van Riel
On 04/27/2015 03:26 PM, Christoph Lameter wrote:

> DAX is about directly accessing memory. It is made for the purpose of
> serving as a block device for a filesystem right now but it can easily be
> used as a way to map any external memory into a processes space using the
> abstraction of a block device. But then you can do that with any device
> driver using VM_PFNMAP or VM_MIXEDMAP. Maybe we better use that term
> instead. Guess I have repeated myself 6 times or so now? I am stopping
> with this one.

Yeah, please stop.

If after 6 times you have still not grasped that having the
application manage which memory goes onto the device and
which goes in RAM is the exact opposite of the use model
that Paul and Jerome are trying to enable (transparent moving
around of memory, by eg. GPU calculation libraries), you are
clearly not paying enough attention.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Christoph Lameter
On Mon, 27 Apr 2015, Jerome Glisse wrote:

> > We can drop the DAX name and just talk about mapping to external memory if
> > that confuses the issue.
>
> DAX is for direct access block layer (X is for the cool name factor)
> there is zero code inside DAX that would be usefull to us. Because it
> is all about filesystem and short circuiting the pagecache. So DAX is
> _not_ about providing rw mappings to non regular memory, it is about
> allowing to directly map _filesystem backing storage_ into a process.

Its about directly mapping memory outside of regular kernel
management via a block device into user space. That you can put a
filesystem on top is one possible use case. You can provide a block
device to map the memory of the coprocessor and then configure the memory
space to have the same layout on the coprocessor as well as the linux
process.

> Moreover DAX is not about managing that persistent memory, all the
> management is done inside the fs (ext4, xfs, ...) in the same way as
> for non persistent memory. While in our case we want to manage the
> memory as a runtime resources that is allocated to process the same
> way regular system memory is managed.

I repeatedly said that. So you would have a block device that would be
used to mmap portions of the special memory into a process.

> So current DAX code have nothing of value for our usecase nor what we
> propose will have anyvalue for DAX. Unless they decide to go down the
> struct page road for persistent memory (which from last discussion i
> heard was not there plan, i am pretty sure they entirely dismissed
> that idea for now).

DAX is about directly accessing memory. It is made for the purpose of
serving as a block device for a filesystem right now but it can easily be
used as a way to map any external memory into a processes space using the
abstraction of a block device. But then you can do that with any device
driver using VM_PFNMAP or VM_MIXEDMAP. Maybe we better use that term
instead. Guess I have repeated myself 6 times or so now? I am stopping
with this one.

> My point is that this is 2 differents non overlapping problems, and
> thus mandate 2 differents solution.

Well confusion abounds since so much other stuff has ben attached to DAX
devices.

Lets drop the DAX term and use VM_PFNMAP or VM_MIXEDMAP instead. MIXEDMAP
is the mechanism that DAX relies on in the VM.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Jerome Glisse
On Mon, Apr 27, 2015 at 11:51:51AM -0500, Christoph Lameter wrote:
> On Mon, 27 Apr 2015, Jerome Glisse wrote:
> 
> > > Well lets avoid that. Access to device memory comparable to what the
> > > drivers do today by establishing page table mappings or a generalization
> > > of DAX approaches would be the most straightforward way of implementing it
> > > and would build based on existing functionality. Page migration currently
> > > does not work with driver mappings or DAX because there is no struct page
> > > that would allow the lockdown of the page. That may require either
> > > continued work on the DAX with page structs approach or new developments
> > > in the page migration logic comparable to the get_user_page() alternative
> > > of simply creating a scatter gather table to just submit a couple of
> > > memory ranges to the I/O subsystem thereby avoiding page structs.
> >
> > What you refuse to see is that DAX is geared toward filesystem and as such
> > rely on special mapping. There is a reason why dax.c is in fs/ and not mm/
> > and i keep pointing out we do not want our mecanism to be perceive as fs
> > from userspace point of view. We want to be below the fs, at the mm level
> > where we could really do thing transparently no matter what kind of memory
> > we are talking about (anonymous, file mapped, share).
> 
> Ok that is why I mentioned the device memory mappings that are currently
> used for this purpose. You could generalize the DAX approach (which I
> understand as providing rw mappings to memory outside of the memory
> managed by the kernel and not as a fs specific thing).
> 
> We can drop the DAX name and just talk about mapping to external memory if
> that confuses the issue.

DAX is for direct access block layer (X is for the cool name factor)
there is zero code inside DAX that would be usefull to us. Because it
is all about filesystem and short circuiting the pagecache. So DAX is
_not_ about providing rw mappings to non regular memory, it is about
allowing to directly map _filesystem backing storage_ into a process.
Moreover DAX is not about managing that persistent memory, all the
management is done inside the fs (ext4, xfs, ...) in the same way as
for non persistent memory. While in our case we want to manage the
memory as a runtime resources that is allocated to process the same
way regular system memory is managed.

So current DAX code have nothing of value for our usecase nor what we
propose will have anyvalue for DAX. Unless they decide to go down the
struct page road for persistent memory (which from last discussion i
heard was not there plan, i am pretty sure they entirely dismissed
that idea for now).

My point is that this is 2 differents non overlapping problems, and
thus mandate 2 differents solution.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Christoph Lameter
On Mon, 27 Apr 2015, Jerome Glisse wrote:

> > Well lets avoid that. Access to device memory comparable to what the
> > drivers do today by establishing page table mappings or a generalization
> > of DAX approaches would be the most straightforward way of implementing it
> > and would build based on existing functionality. Page migration currently
> > does not work with driver mappings or DAX because there is no struct page
> > that would allow the lockdown of the page. That may require either
> > continued work on the DAX with page structs approach or new developments
> > in the page migration logic comparable to the get_user_page() alternative
> > of simply creating a scatter gather table to just submit a couple of
> > memory ranges to the I/O subsystem thereby avoiding page structs.
>
> What you refuse to see is that DAX is geared toward filesystem and as such
> rely on special mapping. There is a reason why dax.c is in fs/ and not mm/
> and i keep pointing out we do not want our mecanism to be perceive as fs
> from userspace point of view. We want to be below the fs, at the mm level
> where we could really do thing transparently no matter what kind of memory
> we are talking about (anonymous, file mapped, share).

Ok that is why I mentioned the device memory mappings that are currently
used for this purpose. You could generalize the DAX approach (which I
understand as providing rw mappings to memory outside of the memory
managed by the kernel and not as a fs specific thing).

We can drop the DAX name and just talk about mapping to external memory if
that confuses the issue.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Christoph Lameter
On Mon, 27 Apr 2015, Rik van Riel wrote:

> Why would we want to avoid the sane approach that makes this thing
> work with the fewest required changes to core code?

Becaus new ZONEs are a pretty invasive change to the memory management and
because there are  other ways to handle references to device specific
memory.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Jerome Glisse
On Mon, Apr 27, 2015 at 11:17:43AM -0500, Christoph Lameter wrote:
> On Mon, 27 Apr 2015, Jerome Glisse wrote:
> 
> > > Improvements to the general code would be preferred instead of
> > > having specialized solutions for a particular hardware alone.  If the
> > > general code can then handle the special coprocessor situation then we
> > > avoid a lot of code development.
> >
> > I think Paul only big change would be the memory ZONE changes. Having a
> > way to add the device memory as struct page while blocking the kernel
> > allocation from using this memory. Beside that i think the autonuma changes
> > he would need would really be specific to his usecase but would still
> > reuse all of the low level logic.
> 
> Well lets avoid that. Access to device memory comparable to what the
> drivers do today by establishing page table mappings or a generalization
> of DAX approaches would be the most straightforward way of implementing it
> and would build based on existing functionality. Page migration currently
> does not work with driver mappings or DAX because there is no struct page
> that would allow the lockdown of the page. That may require either
> continued work on the DAX with page structs approach or new developments
> in the page migration logic comparable to the get_user_page() alternative
> of simply creating a scatter gather table to just submit a couple of
> memory ranges to the I/O subsystem thereby avoiding page structs.

What you refuse to see is that DAX is geared toward filesystem and as such
rely on special mapping. There is a reason why dax.c is in fs/ and not mm/
and i keep pointing out we do not want our mecanism to be perceive as fs
from userspace point of view. We want to be below the fs, at the mm level
where we could really do thing transparently no matter what kind of memory
we are talking about (anonymous, file mapped, share).

The fact is that DAX is about persistant storage but the people that
develop the persitant storage think it would be nice to expose it as some
kind of special memory. I am all for the direct mapping of this kind of
memory but still it is use as a backing store for a filesystem.

While in our case we are talking about "usual" _volatile_ memory that
should be use or expose as a filesystem.

I can't understand why you are so hellbent on the DAX paradigm, but it is
not what suit us in no way. We are not filesystem, we are regular memory,
our realm is mm/ not fs/

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Christoph Lameter
On Mon, 27 Apr 2015, Paul E. McKenney wrote:

> I would instead look on this as a way to try out use of hardware migration
> hints, which could lead to hardware vendors providing similar hints for
> node-to-node migrations.  At that time, the benefits could be provided
> all the functionality relying on such migrations.

Ok that sounds good. These "hints" could allow for the optimization of the
page migration logic.

> > Well yes that works with read-only mappings. Maybe we can special case
> > that in the page migration code? We do not need migration entries if
> > access is read-only actually.
>
> So you are talking about the situation only during the migration itself,
> then?  If there is no migration in progress, then of course there is
> no problem with concurrent writes because the cache-coherence protocol
> takes care of things.  During migration of a given page, I agree that
> marking that page read-only on both sides makes sense.

This is sortof what happens in the current migration scheme. In the page
tables the regular entries are replaced by migration ptes and the page is
therefore inaccessible. Any access is then trapped until the page
contentshave been moved to the new location. Then the migration pte is
replaced by a real pte again that allows full access to the page. At that
point the processes that have been put to sleep because they attempted an
access to that page are woken up.

The current scheme may be improvied on by allowing read access to the page
while migration is in process. If we would change the migration entries to
allow read access then the readers would not have to be put to sleep. Only
writers would have to be put to sleep until the migration is complete.

> > And I agree that latency-sensitive applications might not tolerate
> the page being read-only, and thus would want to avoid migration.
> Such applications would of course instead rely on placing the memory.

Thats why we have the ability to switch off these automatism and that is
why we are trying to keep the OS away from certain processors.

But this is not the only concern here. The other thing is to make this fit
into existing functionaly as cleanly as possible. So I think we would be
looking at gradual improvements in the page migration logic as well as
in the support for mapping external memory via driver mmap calls, DAX
and/or RDMA subsystem functionality. Those two areas of functionality need
to work together better in order to provide a solution for your use cases.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Rik van Riel
On 04/27/2015 12:17 PM, Christoph Lameter wrote:
> On Mon, 27 Apr 2015, Jerome Glisse wrote:
> 
>>> Improvements to the general code would be preferred instead of
>>> having specialized solutions for a particular hardware alone.  If the
>>> general code can then handle the special coprocessor situation then we
>>> avoid a lot of code development.
>>
>> I think Paul only big change would be the memory ZONE changes. Having a
>> way to add the device memory as struct page while blocking the kernel
>> allocation from using this memory. Beside that i think the autonuma changes
>> he would need would really be specific to his usecase but would still
>> reuse all of the low level logic.
> 
> Well lets avoid that. 

Why would we want to avoid the sane approach that makes this thing
work with the fewest required changes to core code?

Just because your workload is different from the workload they are
trying to enable?

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Christoph Lameter
On Mon, 27 Apr 2015, Jerome Glisse wrote:

> > Improvements to the general code would be preferred instead of
> > having specialized solutions for a particular hardware alone.  If the
> > general code can then handle the special coprocessor situation then we
> > avoid a lot of code development.
>
> I think Paul only big change would be the memory ZONE changes. Having a
> way to add the device memory as struct page while blocking the kernel
> allocation from using this memory. Beside that i think the autonuma changes
> he would need would really be specific to his usecase but would still
> reuse all of the low level logic.

Well lets avoid that. Access to device memory comparable to what the
drivers do today by establishing page table mappings or a generalization
of DAX approaches would be the most straightforward way of implementing it
and would build based on existing functionality. Page migration currently
does not work with driver mappings or DAX because there is no struct page
that would allow the lockdown of the page. That may require either
continued work on the DAX with page structs approach or new developments
in the page migration logic comparable to the get_user_page() alternative
of simply creating a scatter gather table to just submit a couple of
memory ranges to the I/O subsystem thereby avoiding page structs.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Paul E. McKenney
On Mon, Apr 27, 2015 at 10:08:29AM -0500, Christoph Lameter wrote:
> On Sat, 25 Apr 2015, Paul E. McKenney wrote:
> 
> > Would you have a URL or other pointer to this code?
> 
> linux/mm/migrate.c

Ah, I thought you were calling out something not yet in mainline.

> > > > Without modifying a single line of mm code, the only way to do this is 
> > > > to
> > > > either unmap from the cpu page table the range being migrated or to 
> > > > mprotect
> > > > it in some way. In both case the cpu access will trigger some kind of 
> > > > fault.
> > >
> > > Yes that is how Linux migration works. If you can fix that then how about
> > > improving page migration in Linux between NUMA nodes first?
> >
> > In principle, that also would be a good thing.  But why do that first?
> 
> Because it would benefit a lot of functionality that today relies on page
> migration to have a faster more reliable way of moving pages around.

I would instead look on this as a way to try out use of hardware migration
hints, which could lead to hardware vendors providing similar hints for
node-to-node migrations.  At that time, the benefits could be provided
all the functionality relying on such migrations.

> > > > This is not the behavior we want. What we want is same address space 
> > > > while
> > > > being able to migrate system memory to device memory (who make that 
> > > > decision
> > > > should not be part of that discussion) while still gracefully handling 
> > > > any
> > > > CPU access.
> > >
> > > Well then there could be a situation where you have concurrent write
> > > access. How do you reconcile that then? Somehow you need to stall one or
> > > the other until the transaction is complete.
> >
> > Or have store buffers on one or both sides.
> 
> Well if those store buffers end up with divergent contents then you have
> the problem of not being able to decide which version should survive. But
> from Jerome's response I deduce that this is avoided by only allow
> read-only access during migration. That is actually similar to what page
> migration does.

Fair enough.

> > > > This means if CPU access it we want to migrate memory back to system 
> > > > memory.
> > > > To achieve this there is no way around adding couple of if inside the mm
> > > > page fault code path. Now do you want each driver to add its own if 
> > > > branch
> > > > or do you want a common infrastructure to do just that ?
> > >
> > > If you can improve the page migration in general then we certainly would
> > > love that. Having faultless migration is certain a good thing for a lot of
> > > functionality that depends on page migration.
> >
> > We do have to start somewhere, though.  If we insist on perfection for
> > all situations before we agree to make a change, we won't be making very
> > many changes, now will we?
> 
> Improvements to the general code would be preferred instead of
> having specialized solutions for a particular hardware alone.  If the
> general code can then handle the special coprocessor situation then we
> avoid a lot of code development.

All else being equal, I agree that generality is preferred.  But here,
as is often the case, all else is not necessarily equal.

> > As I understand it, the trick (if you can call it that) is having the
> > device have the same memory-mapping capabilities as the CPUs.
> 
> Well yes that works with read-only mappings. Maybe we can special case
> that in the page migration code? We do not need migration entries if
> access is read-only actually.

So you are talking about the situation only during the migration itself,
then?  If there is no migration in progress, then of course there is
no problem with concurrent writes because the cache-coherence protocol
takes care of things.  During migration of a given page, I agree that
marking that page read-only on both sides makes sense.

And I agree that latency-sensitive applications might not tolerate
the page being read-only, and thus would want to avoid migration.
Such applications would of course instead rely on placing the memory.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Jerome Glisse
On Mon, Apr 27, 2015 at 10:08:29AM -0500, Christoph Lameter wrote:
> On Sat, 25 Apr 2015, Paul E. McKenney wrote:
> 
> > Would you have a URL or other pointer to this code?
> 
> linux/mm/migrate.c
> 
> > > > Without modifying a single line of mm code, the only way to do this is 
> > > > to
> > > > either unmap from the cpu page table the range being migrated or to 
> > > > mprotect
> > > > it in some way. In both case the cpu access will trigger some kind of 
> > > > fault.
> > >
> > > Yes that is how Linux migration works. If you can fix that then how about
> > > improving page migration in Linux between NUMA nodes first?
> >
> > In principle, that also would be a good thing.  But why do that first?
> 
> Because it would benefit a lot of functionality that today relies on page
> migration to have a faster more reliable way of moving pages around.

I do no think in the CAPI case there is anyway to improve on current low
leve page migration. I am talking about :
  - write protect & tlb flush
  - copy
  - update page table tlb flush

The upper level that have the logic for the migration would however need
some change. Like Paul said some kind of new metric and also new way to
gather statistics from device instead from CPU. I think the device can
provide better informations that the actual logic where page are unmap
and the kernel look which CPU fault on page first. Also a way to allow
hint provide by userspace through the device driver into the numa
decision process.

So i do not think that anything in this work would benefit any other work
load then the one Paul is interested in. Still i am sure Paul want to
build on top of existing infrastructure.


> 
> > > > This is not the behavior we want. What we want is same address space 
> > > > while
> > > > being able to migrate system memory to device memory (who make that 
> > > > decision
> > > > should not be part of that discussion) while still gracefully handling 
> > > > any
> > > > CPU access.
> > >
> > > Well then there could be a situation where you have concurrent write
> > > access. How do you reconcile that then? Somehow you need to stall one or
> > > the other until the transaction is complete.
> >
> > Or have store buffers on one or both sides.
> 
> Well if those store buffers end up with divergent contents then you have
> the problem of not being able to decide which version should survive. But
> from Jerome's response I deduce that this is avoided by only allow
> read-only access during migration. That is actually similar to what page
> migration does.

Yes, as said above no change to the logic there, we do not want divergent
content at all. The thing is, autonuma is a better fit for Paul because
Paul platform being more advance he can allocate struct page for the device
memory. While in my case it would be pointless as the memory is not CPU
accessible. This is why the HMM patchset do not build on top of autonuma
and current page migration but still use the same kind of logic.

> 
> > > > This means if CPU access it we want to migrate memory back to system 
> > > > memory.
> > > > To achieve this there is no way around adding couple of if inside the mm
> > > > page fault code path. Now do you want each driver to add its own if 
> > > > branch
> > > > or do you want a common infrastructure to do just that ?
> > >
> > > If you can improve the page migration in general then we certainly would
> > > love that. Having faultless migration is certain a good thing for a lot of
> > > functionality that depends on page migration.
> >
> > We do have to start somewhere, though.  If we insist on perfection for
> > all situations before we agree to make a change, we won't be making very
> > many changes, now will we?
> 
> Improvements to the general code would be preferred instead of
> having specialized solutions for a particular hardware alone.  If the
> general code can then handle the special coprocessor situation then we
> avoid a lot of code development.

I think Paul only big change would be the memory ZONE changes. Having a
way to add the device memory as struct page while blocking the kernel
allocation from using this memory. Beside that i think the autonuma changes
he would need would really be specific to his usecase but would still
reuse all of the low level logic.

> 
> > As I understand it, the trick (if you can call it that) is having the
> > device have the same memory-mapping capabilities as the CPUs.
> 
> Well yes that works with read-only mappings. Maybe we can special case
> that in the page migration code? We do not need migration entries if
> access is read-only actually.

The duplicate read only memory on device, is really an optimization that
is not critical to the whole. The common use case remain the migration of
read & write memory to device memory when the memory is mostly/only
accessed by the device.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to 

Re: Interacting with coherent memory on external devices

2015-04-27 Thread Christoph Lameter
On Sat, 25 Apr 2015, Paul E. McKenney wrote:

> Would you have a URL or other pointer to this code?

linux/mm/migrate.c

> > > Without modifying a single line of mm code, the only way to do this is to
> > > either unmap from the cpu page table the range being migrated or to 
> > > mprotect
> > > it in some way. In both case the cpu access will trigger some kind of 
> > > fault.
> >
> > Yes that is how Linux migration works. If you can fix that then how about
> > improving page migration in Linux between NUMA nodes first?
>
> In principle, that also would be a good thing.  But why do that first?

Because it would benefit a lot of functionality that today relies on page
migration to have a faster more reliable way of moving pages around.

> > > This is not the behavior we want. What we want is same address space while
> > > being able to migrate system memory to device memory (who make that 
> > > decision
> > > should not be part of that discussion) while still gracefully handling any
> > > CPU access.
> >
> > Well then there could be a situation where you have concurrent write
> > access. How do you reconcile that then? Somehow you need to stall one or
> > the other until the transaction is complete.
>
> Or have store buffers on one or both sides.

Well if those store buffers end up with divergent contents then you have
the problem of not being able to decide which version should survive. But
from Jerome's response I deduce that this is avoided by only allow
read-only access during migration. That is actually similar to what page
migration does.

> > > This means if CPU access it we want to migrate memory back to system 
> > > memory.
> > > To achieve this there is no way around adding couple of if inside the mm
> > > page fault code path. Now do you want each driver to add its own if branch
> > > or do you want a common infrastructure to do just that ?
> >
> > If you can improve the page migration in general then we certainly would
> > love that. Having faultless migration is certain a good thing for a lot of
> > functionality that depends on page migration.
>
> We do have to start somewhere, though.  If we insist on perfection for
> all situations before we agree to make a change, we won't be making very
> many changes, now will we?

Improvements to the general code would be preferred instead of
having specialized solutions for a particular hardware alone.  If the
general code can then handle the special coprocessor situation then we
avoid a lot of code development.

> As I understand it, the trick (if you can call it that) is having the
> device have the same memory-mapping capabilities as the CPUs.

Well yes that works with read-only mappings. Maybe we can special case
that in the page migration code? We do not need migration entries if
access is read-only actually.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Rik van Riel
On 04/27/2015 03:26 PM, Christoph Lameter wrote:

 DAX is about directly accessing memory. It is made for the purpose of
 serving as a block device for a filesystem right now but it can easily be
 used as a way to map any external memory into a processes space using the
 abstraction of a block device. But then you can do that with any device
 driver using VM_PFNMAP or VM_MIXEDMAP. Maybe we better use that term
 instead. Guess I have repeated myself 6 times or so now? I am stopping
 with this one.

Yeah, please stop.

If after 6 times you have still not grasped that having the
application manage which memory goes onto the device and
which goes in RAM is the exact opposite of the use model
that Paul and Jerome are trying to enable (transparent moving
around of memory, by eg. GPU calculation libraries), you are
clearly not paying enough attention.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Jerome Glisse
On Mon, Apr 27, 2015 at 02:26:04PM -0500, Christoph Lameter wrote:
 On Mon, 27 Apr 2015, Jerome Glisse wrote:
 
   We can drop the DAX name and just talk about mapping to external memory if
   that confuses the issue.
 
  DAX is for direct access block layer (X is for the cool name factor)
  there is zero code inside DAX that would be usefull to us. Because it
  is all about filesystem and short circuiting the pagecache. So DAX is
  _not_ about providing rw mappings to non regular memory, it is about
  allowing to directly map _filesystem backing storage_ into a process.
 
 Its about directly mapping memory outside of regular kernel
 management via a block device into user space. That you can put a
 filesystem on top is one possible use case. You can provide a block
 device to map the memory of the coprocessor and then configure the memory
 space to have the same layout on the coprocessor as well as the linux
 process.

_Block device_ not what we want, the API of block device does not match
anything remotely usefull for our usecase. Most of the block device api
deals with disk and scheduling io on them, none of which is interesting
to us. So we would need to carefully create various noop functions and
insert ourself as some kind of fake block device while also making sure
no userspace could actually use ourself as a regular block device. So
we would be pretending being something we are not.

 
  Moreover DAX is not about managing that persistent memory, all the
  management is done inside the fs (ext4, xfs, ...) in the same way as
  for non persistent memory. While in our case we want to manage the
  memory as a runtime resources that is allocated to process the same
  way regular system memory is managed.
 
 I repeatedly said that. So you would have a block device that would be
 used to mmap portions of the special memory into a process.
 
  So current DAX code have nothing of value for our usecase nor what we
  propose will have anyvalue for DAX. Unless they decide to go down the
  struct page road for persistent memory (which from last discussion i
  heard was not there plan, i am pretty sure they entirely dismissed
  that idea for now).
 
 DAX is about directly accessing memory. It is made for the purpose of
 serving as a block device for a filesystem right now but it can easily be
 used as a way to map any external memory into a processes space using the
 abstraction of a block device. But then you can do that with any device
 driver using VM_PFNMAP or VM_MIXEDMAP. Maybe we better use that term
 instead. Guess I have repeated myself 6 times or so now? I am stopping
 with this one.
 
  My point is that this is 2 differents non overlapping problems, and
  thus mandate 2 differents solution.
 
 Well confusion abounds since so much other stuff has ben attached to DAX
 devices.
 
 Lets drop the DAX term and use VM_PFNMAP or VM_MIXEDMAP instead. MIXEDMAP
 is the mechanism that DAX relies on in the VM.

Which would require fare more changes than you seem to think. First using
MIXED|PFNMAP means we loose any kind of memory accounting and forget about
memcg too. Seconds it means we would need to set those flags on all vma,
which kind of point out that something must be wrong here. You will also
need to have vm_ops for all those vma (including for anonymous private vma
which sounds like it will break quite few place that test for that). Then
you have to think about vma that already have vm_ops but you would need
to override it to handle case where its device memory and then forward
other case to the existing vm_ops, extra layering, extra complexity.

All in all, this points me to believe that any such approach would be
vastly more complex, involve changing many places and try to force shoe
horning something into the block device model that is clearly not a
block device.

Paul solution or mine, are far smaller, i think Paul can even get away
from adding/changing ZONE by putting the device pages onto a different
list that is not use by kernel memory allocator. Only few code place
would need a new if() (when freeing a page and when initializing device
memory struct page, you could keep the lru code intact here).

I think at this point there is nothing more to discuss here. It is pretty
clear to me that any solution using block device/MIXEDMAP would be far
more complex and far more intrusive. I do not mind being prove wrong but
i will certainly not waste my time trying to implement such solution.

Btw as a data point, if you ignore my patches to mmu_notifier (which are
mostly about passing down more context information to the callback),
i touch less then 50 lines of mm common code. Every thing else is helpers
that are only use by the device driver.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Benjamin Herrenschmidt
On Mon, 2015-04-27 at 11:48 -0500, Christoph Lameter wrote:
 On Mon, 27 Apr 2015, Rik van Riel wrote:
 
  Why would we want to avoid the sane approach that makes this thing
  work with the fewest required changes to core code?
 
 Becaus new ZONEs are a pretty invasive change to the memory management and
 because there are  other ways to handle references to device specific
 memory.

ZONEs is just one option we put on the table.

I think we can mostly agree on the fundamentals that a good model of
such a co-processor is a NUMA node, possibly with a higher distance
than other nodes (but even that can be debated).

That gives us a lot of the basics we need such as struct page, ability
to use existing migration infrastructure, and is actually a reasonably
representation at high level as well.

The question is how do we additionally get the random stuff we don't
care about out of the way. The large distance will not help that much
under memory pressure for example.

Covering the entire device memory with a CMA goes a long way toward that
goal. It will avoid your ordinary kernel allocations.

It also provides just what we need to be able to do large contiguous
explicit allocations for use by workloads that don't want the
transparent migration and by the driver for the device which might also
need such special allocations for its own internal management data
structures. 

We still have the risk of pages in the CMA being pinned by something
like gup however, that's where the ZONE idea comes in, to ensure the
various kernel allocators will *never* allocate in that zone unless
explicitly specified, but that could possibly implemented differently.

Maybe a concept of exclusive NUMA node, where allocations never
fallback to that node unless explicitly asked to go there.

Of course that would have an impact on memory pressure calculations,
nothign comes completely for free, but at this stage, this is the goal
of this thread, ie, to swap ideas around and see what's most likely to
work in the long run before we even start implementing something.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Christoph Lameter
On Sat, 25 Apr 2015, Paul E. McKenney wrote:

 Would you have a URL or other pointer to this code?

linux/mm/migrate.c

   Without modifying a single line of mm code, the only way to do this is to
   either unmap from the cpu page table the range being migrated or to 
   mprotect
   it in some way. In both case the cpu access will trigger some kind of 
   fault.
 
  Yes that is how Linux migration works. If you can fix that then how about
  improving page migration in Linux between NUMA nodes first?

 In principle, that also would be a good thing.  But why do that first?

Because it would benefit a lot of functionality that today relies on page
migration to have a faster more reliable way of moving pages around.

   This is not the behavior we want. What we want is same address space while
   being able to migrate system memory to device memory (who make that 
   decision
   should not be part of that discussion) while still gracefully handling any
   CPU access.
 
  Well then there could be a situation where you have concurrent write
  access. How do you reconcile that then? Somehow you need to stall one or
  the other until the transaction is complete.

 Or have store buffers on one or both sides.

Well if those store buffers end up with divergent contents then you have
the problem of not being able to decide which version should survive. But
from Jerome's response I deduce that this is avoided by only allow
read-only access during migration. That is actually similar to what page
migration does.

   This means if CPU access it we want to migrate memory back to system 
   memory.
   To achieve this there is no way around adding couple of if inside the mm
   page fault code path. Now do you want each driver to add its own if branch
   or do you want a common infrastructure to do just that ?
 
  If you can improve the page migration in general then we certainly would
  love that. Having faultless migration is certain a good thing for a lot of
  functionality that depends on page migration.

 We do have to start somewhere, though.  If we insist on perfection for
 all situations before we agree to make a change, we won't be making very
 many changes, now will we?

Improvements to the general code would be preferred instead of
having specialized solutions for a particular hardware alone.  If the
general code can then handle the special coprocessor situation then we
avoid a lot of code development.

 As I understand it, the trick (if you can call it that) is having the
 device have the same memory-mapping capabilities as the CPUs.

Well yes that works with read-only mappings. Maybe we can special case
that in the page migration code? We do not need migration entries if
access is read-only actually.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Jerome Glisse
On Mon, Apr 27, 2015 at 10:08:29AM -0500, Christoph Lameter wrote:
 On Sat, 25 Apr 2015, Paul E. McKenney wrote:
 
  Would you have a URL or other pointer to this code?
 
 linux/mm/migrate.c
 
Without modifying a single line of mm code, the only way to do this is 
to
either unmap from the cpu page table the range being migrated or to 
mprotect
it in some way. In both case the cpu access will trigger some kind of 
fault.
  
   Yes that is how Linux migration works. If you can fix that then how about
   improving page migration in Linux between NUMA nodes first?
 
  In principle, that also would be a good thing.  But why do that first?
 
 Because it would benefit a lot of functionality that today relies on page
 migration to have a faster more reliable way of moving pages around.

I do no think in the CAPI case there is anyway to improve on current low
leve page migration. I am talking about :
  - write protect  tlb flush
  - copy
  - update page table tlb flush

The upper level that have the logic for the migration would however need
some change. Like Paul said some kind of new metric and also new way to
gather statistics from device instead from CPU. I think the device can
provide better informations that the actual logic where page are unmap
and the kernel look which CPU fault on page first. Also a way to allow
hint provide by userspace through the device driver into the numa
decision process.

So i do not think that anything in this work would benefit any other work
load then the one Paul is interested in. Still i am sure Paul want to
build on top of existing infrastructure.


 
This is not the behavior we want. What we want is same address space 
while
being able to migrate system memory to device memory (who make that 
decision
should not be part of that discussion) while still gracefully handling 
any
CPU access.
  
   Well then there could be a situation where you have concurrent write
   access. How do you reconcile that then? Somehow you need to stall one or
   the other until the transaction is complete.
 
  Or have store buffers on one or both sides.
 
 Well if those store buffers end up with divergent contents then you have
 the problem of not being able to decide which version should survive. But
 from Jerome's response I deduce that this is avoided by only allow
 read-only access during migration. That is actually similar to what page
 migration does.

Yes, as said above no change to the logic there, we do not want divergent
content at all. The thing is, autonuma is a better fit for Paul because
Paul platform being more advance he can allocate struct page for the device
memory. While in my case it would be pointless as the memory is not CPU
accessible. This is why the HMM patchset do not build on top of autonuma
and current page migration but still use the same kind of logic.

 
This means if CPU access it we want to migrate memory back to system 
memory.
To achieve this there is no way around adding couple of if inside the mm
page fault code path. Now do you want each driver to add its own if 
branch
or do you want a common infrastructure to do just that ?
  
   If you can improve the page migration in general then we certainly would
   love that. Having faultless migration is certain a good thing for a lot of
   functionality that depends on page migration.
 
  We do have to start somewhere, though.  If we insist on perfection for
  all situations before we agree to make a change, we won't be making very
  many changes, now will we?
 
 Improvements to the general code would be preferred instead of
 having specialized solutions for a particular hardware alone.  If the
 general code can then handle the special coprocessor situation then we
 avoid a lot of code development.

I think Paul only big change would be the memory ZONE changes. Having a
way to add the device memory as struct page while blocking the kernel
allocation from using this memory. Beside that i think the autonuma changes
he would need would really be specific to his usecase but would still
reuse all of the low level logic.

 
  As I understand it, the trick (if you can call it that) is having the
  device have the same memory-mapping capabilities as the CPUs.
 
 Well yes that works with read-only mappings. Maybe we can special case
 that in the page migration code? We do not need migration entries if
 access is read-only actually.

The duplicate read only memory on device, is really an optimization that
is not critical to the whole. The common use case remain the migration of
read  write memory to device memory when the memory is mostly/only
accessed by the device.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Christoph Lameter
On Mon, 27 Apr 2015, Paul E. McKenney wrote:

 I would instead look on this as a way to try out use of hardware migration
 hints, which could lead to hardware vendors providing similar hints for
 node-to-node migrations.  At that time, the benefits could be provided
 all the functionality relying on such migrations.

Ok that sounds good. These hints could allow for the optimization of the
page migration logic.

  Well yes that works with read-only mappings. Maybe we can special case
  that in the page migration code? We do not need migration entries if
  access is read-only actually.

 So you are talking about the situation only during the migration itself,
 then?  If there is no migration in progress, then of course there is
 no problem with concurrent writes because the cache-coherence protocol
 takes care of things.  During migration of a given page, I agree that
 marking that page read-only on both sides makes sense.

This is sortof what happens in the current migration scheme. In the page
tables the regular entries are replaced by migration ptes and the page is
therefore inaccessible. Any access is then trapped until the page
contentshave been moved to the new location. Then the migration pte is
replaced by a real pte again that allows full access to the page. At that
point the processes that have been put to sleep because they attempted an
access to that page are woken up.

The current scheme may be improvied on by allowing read access to the page
while migration is in process. If we would change the migration entries to
allow read access then the readers would not have to be put to sleep. Only
writers would have to be put to sleep until the migration is complete.

  And I agree that latency-sensitive applications might not tolerate
 the page being read-only, and thus would want to avoid migration.
 Such applications would of course instead rely on placing the memory.

Thats why we have the ability to switch off these automatism and that is
why we are trying to keep the OS away from certain processors.

But this is not the only concern here. The other thing is to make this fit
into existing functionaly as cleanly as possible. So I think we would be
looking at gradual improvements in the page migration logic as well as
in the support for mapping external memory via driver mmap calls, DAX
and/or RDMA subsystem functionality. Those two areas of functionality need
to work together better in order to provide a solution for your use cases.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Christoph Lameter
On Mon, 27 Apr 2015, Jerome Glisse wrote:

  Well lets avoid that. Access to device memory comparable to what the
  drivers do today by establishing page table mappings or a generalization
  of DAX approaches would be the most straightforward way of implementing it
  and would build based on existing functionality. Page migration currently
  does not work with driver mappings or DAX because there is no struct page
  that would allow the lockdown of the page. That may require either
  continued work on the DAX with page structs approach or new developments
  in the page migration logic comparable to the get_user_page() alternative
  of simply creating a scatter gather table to just submit a couple of
  memory ranges to the I/O subsystem thereby avoiding page structs.

 What you refuse to see is that DAX is geared toward filesystem and as such
 rely on special mapping. There is a reason why dax.c is in fs/ and not mm/
 and i keep pointing out we do not want our mecanism to be perceive as fs
 from userspace point of view. We want to be below the fs, at the mm level
 where we could really do thing transparently no matter what kind of memory
 we are talking about (anonymous, file mapped, share).

Ok that is why I mentioned the device memory mappings that are currently
used for this purpose. You could generalize the DAX approach (which I
understand as providing rw mappings to memory outside of the memory
managed by the kernel and not as a fs specific thing).

We can drop the DAX name and just talk about mapping to external memory if
that confuses the issue.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Paul E. McKenney
On Mon, Apr 27, 2015 at 10:08:29AM -0500, Christoph Lameter wrote:
 On Sat, 25 Apr 2015, Paul E. McKenney wrote:
 
  Would you have a URL or other pointer to this code?
 
 linux/mm/migrate.c

Ah, I thought you were calling out something not yet in mainline.

Without modifying a single line of mm code, the only way to do this is 
to
either unmap from the cpu page table the range being migrated or to 
mprotect
it in some way. In both case the cpu access will trigger some kind of 
fault.
  
   Yes that is how Linux migration works. If you can fix that then how about
   improving page migration in Linux between NUMA nodes first?
 
  In principle, that also would be a good thing.  But why do that first?
 
 Because it would benefit a lot of functionality that today relies on page
 migration to have a faster more reliable way of moving pages around.

I would instead look on this as a way to try out use of hardware migration
hints, which could lead to hardware vendors providing similar hints for
node-to-node migrations.  At that time, the benefits could be provided
all the functionality relying on such migrations.

This is not the behavior we want. What we want is same address space 
while
being able to migrate system memory to device memory (who make that 
decision
should not be part of that discussion) while still gracefully handling 
any
CPU access.
  
   Well then there could be a situation where you have concurrent write
   access. How do you reconcile that then? Somehow you need to stall one or
   the other until the transaction is complete.
 
  Or have store buffers on one or both sides.
 
 Well if those store buffers end up with divergent contents then you have
 the problem of not being able to decide which version should survive. But
 from Jerome's response I deduce that this is avoided by only allow
 read-only access during migration. That is actually similar to what page
 migration does.

Fair enough.

This means if CPU access it we want to migrate memory back to system 
memory.
To achieve this there is no way around adding couple of if inside the mm
page fault code path. Now do you want each driver to add its own if 
branch
or do you want a common infrastructure to do just that ?
  
   If you can improve the page migration in general then we certainly would
   love that. Having faultless migration is certain a good thing for a lot of
   functionality that depends on page migration.
 
  We do have to start somewhere, though.  If we insist on perfection for
  all situations before we agree to make a change, we won't be making very
  many changes, now will we?
 
 Improvements to the general code would be preferred instead of
 having specialized solutions for a particular hardware alone.  If the
 general code can then handle the special coprocessor situation then we
 avoid a lot of code development.

All else being equal, I agree that generality is preferred.  But here,
as is often the case, all else is not necessarily equal.

  As I understand it, the trick (if you can call it that) is having the
  device have the same memory-mapping capabilities as the CPUs.
 
 Well yes that works with read-only mappings. Maybe we can special case
 that in the page migration code? We do not need migration entries if
 access is read-only actually.

So you are talking about the situation only during the migration itself,
then?  If there is no migration in progress, then of course there is
no problem with concurrent writes because the cache-coherence protocol
takes care of things.  During migration of a given page, I agree that
marking that page read-only on both sides makes sense.

And I agree that latency-sensitive applications might not tolerate
the page being read-only, and thus would want to avoid migration.
Such applications would of course instead rely on placing the memory.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Rik van Riel
On 04/27/2015 12:17 PM, Christoph Lameter wrote:
 On Mon, 27 Apr 2015, Jerome Glisse wrote:
 
 Improvements to the general code would be preferred instead of
 having specialized solutions for a particular hardware alone.  If the
 general code can then handle the special coprocessor situation then we
 avoid a lot of code development.

 I think Paul only big change would be the memory ZONE changes. Having a
 way to add the device memory as struct page while blocking the kernel
 allocation from using this memory. Beside that i think the autonuma changes
 he would need would really be specific to his usecase but would still
 reuse all of the low level logic.
 
 Well lets avoid that. 

Why would we want to avoid the sane approach that makes this thing
work with the fewest required changes to core code?

Just because your workload is different from the workload they are
trying to enable?

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Christoph Lameter
On Mon, 27 Apr 2015, Rik van Riel wrote:

 Why would we want to avoid the sane approach that makes this thing
 work with the fewest required changes to core code?

Becaus new ZONEs are a pretty invasive change to the memory management and
because there are  other ways to handle references to device specific
memory.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Christoph Lameter
On Mon, 27 Apr 2015, Jerome Glisse wrote:

  Improvements to the general code would be preferred instead of
  having specialized solutions for a particular hardware alone.  If the
  general code can then handle the special coprocessor situation then we
  avoid a lot of code development.

 I think Paul only big change would be the memory ZONE changes. Having a
 way to add the device memory as struct page while blocking the kernel
 allocation from using this memory. Beside that i think the autonuma changes
 he would need would really be specific to his usecase but would still
 reuse all of the low level logic.

Well lets avoid that. Access to device memory comparable to what the
drivers do today by establishing page table mappings or a generalization
of DAX approaches would be the most straightforward way of implementing it
and would build based on existing functionality. Page migration currently
does not work with driver mappings or DAX because there is no struct page
that would allow the lockdown of the page. That may require either
continued work on the DAX with page structs approach or new developments
in the page migration logic comparable to the get_user_page() alternative
of simply creating a scatter gather table to just submit a couple of
memory ranges to the I/O subsystem thereby avoiding page structs.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Jerome Glisse
On Mon, Apr 27, 2015 at 11:17:43AM -0500, Christoph Lameter wrote:
 On Mon, 27 Apr 2015, Jerome Glisse wrote:
 
   Improvements to the general code would be preferred instead of
   having specialized solutions for a particular hardware alone.  If the
   general code can then handle the special coprocessor situation then we
   avoid a lot of code development.
 
  I think Paul only big change would be the memory ZONE changes. Having a
  way to add the device memory as struct page while blocking the kernel
  allocation from using this memory. Beside that i think the autonuma changes
  he would need would really be specific to his usecase but would still
  reuse all of the low level logic.
 
 Well lets avoid that. Access to device memory comparable to what the
 drivers do today by establishing page table mappings or a generalization
 of DAX approaches would be the most straightforward way of implementing it
 and would build based on existing functionality. Page migration currently
 does not work with driver mappings or DAX because there is no struct page
 that would allow the lockdown of the page. That may require either
 continued work on the DAX with page structs approach or new developments
 in the page migration logic comparable to the get_user_page() alternative
 of simply creating a scatter gather table to just submit a couple of
 memory ranges to the I/O subsystem thereby avoiding page structs.

What you refuse to see is that DAX is geared toward filesystem and as such
rely on special mapping. There is a reason why dax.c is in fs/ and not mm/
and i keep pointing out we do not want our mecanism to be perceive as fs
from userspace point of view. We want to be below the fs, at the mm level
where we could really do thing transparently no matter what kind of memory
we are talking about (anonymous, file mapped, share).

The fact is that DAX is about persistant storage but the people that
develop the persitant storage think it would be nice to expose it as some
kind of special memory. I am all for the direct mapping of this kind of
memory but still it is use as a backing store for a filesystem.

While in our case we are talking about usual _volatile_ memory that
should be use or expose as a filesystem.

I can't understand why you are so hellbent on the DAX paradigm, but it is
not what suit us in no way. We are not filesystem, we are regular memory,
our realm is mm/ not fs/

Cheers,
Jérôme
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Jerome Glisse
On Mon, Apr 27, 2015 at 11:51:51AM -0500, Christoph Lameter wrote:
 On Mon, 27 Apr 2015, Jerome Glisse wrote:
 
   Well lets avoid that. Access to device memory comparable to what the
   drivers do today by establishing page table mappings or a generalization
   of DAX approaches would be the most straightforward way of implementing it
   and would build based on existing functionality. Page migration currently
   does not work with driver mappings or DAX because there is no struct page
   that would allow the lockdown of the page. That may require either
   continued work on the DAX with page structs approach or new developments
   in the page migration logic comparable to the get_user_page() alternative
   of simply creating a scatter gather table to just submit a couple of
   memory ranges to the I/O subsystem thereby avoiding page structs.
 
  What you refuse to see is that DAX is geared toward filesystem and as such
  rely on special mapping. There is a reason why dax.c is in fs/ and not mm/
  and i keep pointing out we do not want our mecanism to be perceive as fs
  from userspace point of view. We want to be below the fs, at the mm level
  where we could really do thing transparently no matter what kind of memory
  we are talking about (anonymous, file mapped, share).
 
 Ok that is why I mentioned the device memory mappings that are currently
 used for this purpose. You could generalize the DAX approach (which I
 understand as providing rw mappings to memory outside of the memory
 managed by the kernel and not as a fs specific thing).
 
 We can drop the DAX name and just talk about mapping to external memory if
 that confuses the issue.

DAX is for direct access block layer (X is for the cool name factor)
there is zero code inside DAX that would be usefull to us. Because it
is all about filesystem and short circuiting the pagecache. So DAX is
_not_ about providing rw mappings to non regular memory, it is about
allowing to directly map _filesystem backing storage_ into a process.
Moreover DAX is not about managing that persistent memory, all the
management is done inside the fs (ext4, xfs, ...) in the same way as
for non persistent memory. While in our case we want to manage the
memory as a runtime resources that is allocated to process the same
way regular system memory is managed.

So current DAX code have nothing of value for our usecase nor what we
propose will have anyvalue for DAX. Unless they decide to go down the
struct page road for persistent memory (which from last discussion i
heard was not there plan, i am pretty sure they entirely dismissed
that idea for now).

My point is that this is 2 differents non overlapping problems, and
thus mandate 2 differents solution.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-27 Thread Christoph Lameter
On Mon, 27 Apr 2015, Jerome Glisse wrote:

  We can drop the DAX name and just talk about mapping to external memory if
  that confuses the issue.

 DAX is for direct access block layer (X is for the cool name factor)
 there is zero code inside DAX that would be usefull to us. Because it
 is all about filesystem and short circuiting the pagecache. So DAX is
 _not_ about providing rw mappings to non regular memory, it is about
 allowing to directly map _filesystem backing storage_ into a process.

Its about directly mapping memory outside of regular kernel
management via a block device into user space. That you can put a
filesystem on top is one possible use case. You can provide a block
device to map the memory of the coprocessor and then configure the memory
space to have the same layout on the coprocessor as well as the linux
process.

 Moreover DAX is not about managing that persistent memory, all the
 management is done inside the fs (ext4, xfs, ...) in the same way as
 for non persistent memory. While in our case we want to manage the
 memory as a runtime resources that is allocated to process the same
 way regular system memory is managed.

I repeatedly said that. So you would have a block device that would be
used to mmap portions of the special memory into a process.

 So current DAX code have nothing of value for our usecase nor what we
 propose will have anyvalue for DAX. Unless they decide to go down the
 struct page road for persistent memory (which from last discussion i
 heard was not there plan, i am pretty sure they entirely dismissed
 that idea for now).

DAX is about directly accessing memory. It is made for the purpose of
serving as a block device for a filesystem right now but it can easily be
used as a way to map any external memory into a processes space using the
abstraction of a block device. But then you can do that with any device
driver using VM_PFNMAP or VM_MIXEDMAP. Maybe we better use that term
instead. Guess I have repeated myself 6 times or so now? I am stopping
with this one.

 My point is that this is 2 differents non overlapping problems, and
 thus mandate 2 differents solution.

Well confusion abounds since so much other stuff has ben attached to DAX
devices.

Lets drop the DAX term and use VM_PFNMAP or VM_MIXEDMAP instead. MIXEDMAP
is the mechanism that DAX relies on in the VM.




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-25 Thread Paul E. McKenney
On Sat, Apr 25, 2015 at 01:32:39PM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2015-04-24 at 22:32 -0400, Rik van Riel wrote:
> > >   The result would be that the kernel would allocate only
> > migratable
> > >   pages within the CCAD device's memory, and even then only if
> > >   memory was otherwise exhausted.
> > 
> > Does it make sense to allocate the device's page tables in memory
> > belonging to the device?
> > 
> > Is this a necessary thing with some devices? Jerome's HMM comes
> > to mind...
> 
> In our case, the device's MMU shares the host page tables (which is why
> we can't use HMM, ie we can't have a page with different permissions on
> CPU vs. device which HMM does).
> 
> However the device has a pretty fast path to system memory, the best
> thing we can do is pin the workload to the same chip the device is
> connected to so those page tables arent' too far away.

And another update, diffs then full document.  Among other things, this
version explicitly calls out the goal of gaining substantial performance
without changing user applications, which should hopefully help.

Thanx, Paul



diff --git a/DeviceMem.txt b/DeviceMem.txt
index 15d0a8b5d360..3de70c4b9922 100644
--- a/DeviceMem.txt
+++ b/DeviceMem.txt
@@ -40,10 +40,13 @@
workloads will have less-predictable access patterns, and these
workloads can benefit from automatic migration of data between
device memory and system memory as access patterns change.
-   Furthermore, some devices will provide special hardware that
-   collects access statistics that can be used to determine whether
-   or not a given page of memory should be migrated, and if so,
-   to where.
+   In this latter case, the goal is not optimal performance,
+   but rather a significant increase in performance compared to
+   what the CPUs alone can provide without needing to recompile
+   any of the applications making up the workload.  Furthermore,
+   some devices will provide special hardware that collects access
+   statistics that can be used to determine whether or not a given
+   page of memory should be migrated, and if so, to where.
 
The purpose of this document is to explore how this access
and migration can be provided for within the Linux kernel.
@@ -146,6 +149,32 @@ REQUIREMENTS
required for low-latency applications that are sensitive
to OS jitter.
 
+   6.  It must be possible to cause an application to use a
+   CCAD device simply by switching dynamically linked
+   libraries, but without recompiling that application.
+   This implies the following requirements:
+
+   a.  Address spaces must be synchronized for a given
+   application on the CPUs and the CCAD.  In other
+   words, a given virtual address must access the same
+   physical memory from the CCAD device and from
+   the CPUs.
+
+   b.  Code running on the CCAD device must be able to
+   access the running application's memory,
+   regardless of how that memory was allocated,
+   including statically allocated at compile time.
+
+   c.  Use of the CCAD device must not interfere with
+   memory allocations that are never used by the
+   CCAD device.  For example, if a CCAD device
+   has 16GB of memory, that should not prevent an
+   application using that device from allocating
+   more than 16GB of memory.  For another example,
+   memory that is never accessed by a given CCAD
+   device should preferably remain outside of that
+   CCAD device's memory.
+
 
 POTENTIAL IDEAS
 
@@ -178,12 +207,11 @@ POTENTIAL IDEAS
physical address ranges of normal system memory would
be interleaved with those of device memory.
 
-   This would also require some sort of
-   migration infrastructure to be added, as autonuma would
-   not apply.  However, this approach has the advantage
-   of preventing allocations in these regions, at least
-   unless those allocations have been explicitly flagged
-   to go there.
+   This would also require some sort of migration
+   infrastructure to be added, as autonuma would not apply.
+   However, this approach has the advantage of preventing
+   allocations in these regions, at least unless those
+   allocations have been explicitly flagged to go there.
 
4.  

Re: Interacting with coherent memory on external devices

2015-04-25 Thread Paul E. McKenney
On Fri, Apr 24, 2015 at 10:49:28AM -0500, Christoph Lameter wrote:
> On Fri, 24 Apr 2015, Paul E. McKenney wrote:
> 
> > can deliver, but where the cost of full-fledge hand tuning cannot be
> > justified.
> >
> > You seem to believe that this latter category is the empty set, which
> > I must confess does greatly surprise me.
> 
> If there are already compromises are being made then why would you want to
> modify the kernel for this? Some user space coding and device drivers
> should be sufficient.

The goal is to gain substantial performance improvement without any
user-space changes.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-25 Thread Paul E. McKenney
On Fri, Apr 24, 2015 at 03:00:18PM -0500, Christoph Lameter wrote:
> On Fri, 24 Apr 2015, Jerome Glisse wrote:
> 
> > > Still no answer as to why is that not possible with the current scheme?
> > > You keep on talking about pointers and I keep on responding that this is a
> > > matter of making the address space compatible on both sides.
> >
> > So if do that in a naive way, how can we migrate a chunk of memory to video
> > memory while still handling properly the case where CPU try to access that
> > same memory while it is migrated to the GPU memory.
> 
> Well that the same issue that the migration code is handling which I
> submitted a long time ago to the kernel.

Would you have a URL or other pointer to this code?

> > Without modifying a single line of mm code, the only way to do this is to
> > either unmap from the cpu page table the range being migrated or to mprotect
> > it in some way. In both case the cpu access will trigger some kind of fault.
> 
> Yes that is how Linux migration works. If you can fix that then how about
> improving page migration in Linux between NUMA nodes first?

In principle, that also would be a good thing.  But why do that first?

> > This is not the behavior we want. What we want is same address space while
> > being able to migrate system memory to device memory (who make that decision
> > should not be part of that discussion) while still gracefully handling any
> > CPU access.
> 
> Well then there could be a situation where you have concurrent write
> access. How do you reconcile that then? Somehow you need to stall one or
> the other until the transaction is complete.

Or have store buffers on one or both sides.

> > This means if CPU access it we want to migrate memory back to system memory.
> > To achieve this there is no way around adding couple of if inside the mm
> > page fault code path. Now do you want each driver to add its own if branch
> > or do you want a common infrastructure to do just that ?
> 
> If you can improve the page migration in general then we certainly would
> love that. Having faultless migration is certain a good thing for a lot of
> functionality that depends on page migration.

We do have to start somewhere, though.  If we insist on perfection for
all situations before we agree to make a change, we won't be making very
many changes, now will we?

> > As i keep saying the solution you propose is what we have today, today we
> > have fake share address space through the trick of remapping system memory
> > at same address inside the GPU address space and also enforcing the use of
> > a special memory allocator that goes behind the back of mm code.
> 
> Hmmm... I'd like to know more details about that.

As I understand it, the trick (if you can call it that) is having the
device have the same memory-mapping capabilities as the CPUs.

> > As you pointed out, not using GPU memory is a waste and we want to be able
> > to use it. Now Paul have more sofisticated hardware that offer oportunities
> > to do thing in a more transparent and efficient way.
> 
> Does this also work between NUMA nodes in a Power8 system?

Heh!  At the rate we are going with this discussion, Power8 will be
obsolete before we have this in.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-25 Thread Paul E. McKenney
On Fri, Apr 24, 2015 at 11:09:36AM -0400, Jerome Glisse wrote:
> On Fri, Apr 24, 2015 at 07:57:38AM -0700, Paul E. McKenney wrote:
> > On Fri, Apr 24, 2015 at 09:12:07AM -0500, Christoph Lameter wrote:
> > > On Thu, 23 Apr 2015, Paul E. McKenney wrote:
> > > 
> > > >
> > > > DAX
> > > >
> > > > DAX is a mechanism for providing direct-memory access to
> > > > high-speed non-volatile (AKA "persistent") memory.  Good
> > > > introductions to DAX may be found in the following LWN
> > > > articles:
> > > 
> > > DAX is a mechanism to access memory not managed by the kernel and is the
> > > successor to XIP. It just happens to be needed for persistent memory.
> > > Fundamentally any driver can provide an MMAPPed interface to allow access
> > > to a devices memory.
> > 
> > I will take another look, but others in this thread have called out
> > difficulties with DAX's filesystem nature.
> 
> Do not waste your time on that this is not what we want. Christoph here
> is more than stuborn and fails to see the world.

Well, we do need to make sure that we are correctly representing DAX's
capabilities.  It is a hot topic, and others will probably also suggest
that it be used.  That said, at the moment, I don't see how it would help,
given the need to migrate memory.  Perhaps Boas Harrosh's patch set to
allow struct pages to be associated might help?  But from what I can see,
a fair amount of other functionality would still be required either way.

I am updating the DAX section a bit, but I don't claim that it is complete.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-25 Thread Paul E. McKenney
On Fri, Apr 24, 2015 at 11:09:36AM -0400, Jerome Glisse wrote:
 On Fri, Apr 24, 2015 at 07:57:38AM -0700, Paul E. McKenney wrote:
  On Fri, Apr 24, 2015 at 09:12:07AM -0500, Christoph Lameter wrote:
   On Thu, 23 Apr 2015, Paul E. McKenney wrote:
   
   
DAX
   
DAX is a mechanism for providing direct-memory access to
high-speed non-volatile (AKA persistent) memory.  Good
introductions to DAX may be found in the following LWN
articles:
   
   DAX is a mechanism to access memory not managed by the kernel and is the
   successor to XIP. It just happens to be needed for persistent memory.
   Fundamentally any driver can provide an MMAPPed interface to allow access
   to a devices memory.
  
  I will take another look, but others in this thread have called out
  difficulties with DAX's filesystem nature.
 
 Do not waste your time on that this is not what we want. Christoph here
 is more than stuborn and fails to see the world.

Well, we do need to make sure that we are correctly representing DAX's
capabilities.  It is a hot topic, and others will probably also suggest
that it be used.  That said, at the moment, I don't see how it would help,
given the need to migrate memory.  Perhaps Boas Harrosh's patch set to
allow struct pages to be associated might help?  But from what I can see,
a fair amount of other functionality would still be required either way.

I am updating the DAX section a bit, but I don't claim that it is complete.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-25 Thread Paul E. McKenney
On Fri, Apr 24, 2015 at 03:00:18PM -0500, Christoph Lameter wrote:
 On Fri, 24 Apr 2015, Jerome Glisse wrote:
 
   Still no answer as to why is that not possible with the current scheme?
   You keep on talking about pointers and I keep on responding that this is a
   matter of making the address space compatible on both sides.
 
  So if do that in a naive way, how can we migrate a chunk of memory to video
  memory while still handling properly the case where CPU try to access that
  same memory while it is migrated to the GPU memory.
 
 Well that the same issue that the migration code is handling which I
 submitted a long time ago to the kernel.

Would you have a URL or other pointer to this code?

  Without modifying a single line of mm code, the only way to do this is to
  either unmap from the cpu page table the range being migrated or to mprotect
  it in some way. In both case the cpu access will trigger some kind of fault.
 
 Yes that is how Linux migration works. If you can fix that then how about
 improving page migration in Linux between NUMA nodes first?

In principle, that also would be a good thing.  But why do that first?

  This is not the behavior we want. What we want is same address space while
  being able to migrate system memory to device memory (who make that decision
  should not be part of that discussion) while still gracefully handling any
  CPU access.
 
 Well then there could be a situation where you have concurrent write
 access. How do you reconcile that then? Somehow you need to stall one or
 the other until the transaction is complete.

Or have store buffers on one or both sides.

  This means if CPU access it we want to migrate memory back to system memory.
  To achieve this there is no way around adding couple of if inside the mm
  page fault code path. Now do you want each driver to add its own if branch
  or do you want a common infrastructure to do just that ?
 
 If you can improve the page migration in general then we certainly would
 love that. Having faultless migration is certain a good thing for a lot of
 functionality that depends on page migration.

We do have to start somewhere, though.  If we insist on perfection for
all situations before we agree to make a change, we won't be making very
many changes, now will we?

  As i keep saying the solution you propose is what we have today, today we
  have fake share address space through the trick of remapping system memory
  at same address inside the GPU address space and also enforcing the use of
  a special memory allocator that goes behind the back of mm code.
 
 Hmmm... I'd like to know more details about that.

As I understand it, the trick (if you can call it that) is having the
device have the same memory-mapping capabilities as the CPUs.

  As you pointed out, not using GPU memory is a waste and we want to be able
  to use it. Now Paul have more sofisticated hardware that offer oportunities
  to do thing in a more transparent and efficient way.
 
 Does this also work between NUMA nodes in a Power8 system?

Heh!  At the rate we are going with this discussion, Power8 will be
obsolete before we have this in.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-25 Thread Paul E. McKenney
On Fri, Apr 24, 2015 at 10:49:28AM -0500, Christoph Lameter wrote:
 On Fri, 24 Apr 2015, Paul E. McKenney wrote:
 
  can deliver, but where the cost of full-fledge hand tuning cannot be
  justified.
 
  You seem to believe that this latter category is the empty set, which
  I must confess does greatly surprise me.
 
 If there are already compromises are being made then why would you want to
 modify the kernel for this? Some user space coding and device drivers
 should be sufficient.

The goal is to gain substantial performance improvement without any
user-space changes.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-25 Thread Paul E. McKenney
On Sat, Apr 25, 2015 at 01:32:39PM +1000, Benjamin Herrenschmidt wrote:
 On Fri, 2015-04-24 at 22:32 -0400, Rik van Riel wrote:
 The result would be that the kernel would allocate only
  migratable
 pages within the CCAD device's memory, and even then only if
 memory was otherwise exhausted.
  
  Does it make sense to allocate the device's page tables in memory
  belonging to the device?
  
  Is this a necessary thing with some devices? Jerome's HMM comes
  to mind...
 
 In our case, the device's MMU shares the host page tables (which is why
 we can't use HMM, ie we can't have a page with different permissions on
 CPU vs. device which HMM does).
 
 However the device has a pretty fast path to system memory, the best
 thing we can do is pin the workload to the same chip the device is
 connected to so those page tables arent' too far away.

And another update, diffs then full document.  Among other things, this
version explicitly calls out the goal of gaining substantial performance
without changing user applications, which should hopefully help.

Thanx, Paul



diff --git a/DeviceMem.txt b/DeviceMem.txt
index 15d0a8b5d360..3de70c4b9922 100644
--- a/DeviceMem.txt
+++ b/DeviceMem.txt
@@ -40,10 +40,13 @@
workloads will have less-predictable access patterns, and these
workloads can benefit from automatic migration of data between
device memory and system memory as access patterns change.
-   Furthermore, some devices will provide special hardware that
-   collects access statistics that can be used to determine whether
-   or not a given page of memory should be migrated, and if so,
-   to where.
+   In this latter case, the goal is not optimal performance,
+   but rather a significant increase in performance compared to
+   what the CPUs alone can provide without needing to recompile
+   any of the applications making up the workload.  Furthermore,
+   some devices will provide special hardware that collects access
+   statistics that can be used to determine whether or not a given
+   page of memory should be migrated, and if so, to where.
 
The purpose of this document is to explore how this access
and migration can be provided for within the Linux kernel.
@@ -146,6 +149,32 @@ REQUIREMENTS
required for low-latency applications that are sensitive
to OS jitter.
 
+   6.  It must be possible to cause an application to use a
+   CCAD device simply by switching dynamically linked
+   libraries, but without recompiling that application.
+   This implies the following requirements:
+
+   a.  Address spaces must be synchronized for a given
+   application on the CPUs and the CCAD.  In other
+   words, a given virtual address must access the same
+   physical memory from the CCAD device and from
+   the CPUs.
+
+   b.  Code running on the CCAD device must be able to
+   access the running application's memory,
+   regardless of how that memory was allocated,
+   including statically allocated at compile time.
+
+   c.  Use of the CCAD device must not interfere with
+   memory allocations that are never used by the
+   CCAD device.  For example, if a CCAD device
+   has 16GB of memory, that should not prevent an
+   application using that device from allocating
+   more than 16GB of memory.  For another example,
+   memory that is never accessed by a given CCAD
+   device should preferably remain outside of that
+   CCAD device's memory.
+
 
 POTENTIAL IDEAS
 
@@ -178,12 +207,11 @@ POTENTIAL IDEAS
physical address ranges of normal system memory would
be interleaved with those of device memory.
 
-   This would also require some sort of
-   migration infrastructure to be added, as autonuma would
-   not apply.  However, this approach has the advantage
-   of preventing allocations in these regions, at least
-   unless those allocations have been explicitly flagged
-   to go there.
+   This would also require some sort of migration
+   infrastructure to be added, as autonuma would not apply.
+   However, this approach has the advantage of preventing
+   allocations in these regions, at least unless those
+   allocations have been explicitly flagged to go there.
 
4.  Your idea here!
 
@@ -274,21 

Re: Interacting with coherent memory on external devices

2015-04-24 Thread Benjamin Herrenschmidt
On Fri, 2015-04-24 at 22:32 -0400, Rik van Riel wrote:
> >   The result would be that the kernel would allocate only
> migratable
> >   pages within the CCAD device's memory, and even then only if
> >   memory was otherwise exhausted.
> 
> Does it make sense to allocate the device's page tables in memory
> belonging to the device?
> 
> Is this a necessary thing with some devices? Jerome's HMM comes
> to mind...

In our case, the device's MMU shares the host page tables (which is why
we can't use HMM, ie we can't have a page with different permissions on
CPU vs. device which HMM does).

However the device has a pretty fast path to system memory, the best
thing we can do is pin the workload to the same chip the device is
connected to so those page tables arent' too far away.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Rik van Riel
On 04/21/2015 05:44 PM, Paul E. McKenney wrote:

> AUTONUMA
> 
>   The Linux kernel's autonuma facility supports migrating both
>   memory and processes to promote NUMA memory locality.  It was
>   accepted into 3.13 and is available in RHEL 7.0 and SLES 12.
>   It is enabled by the Kconfig variable CONFIG_NUMA_BALANCING.
> 
>   This approach uses a kernel thread "knuma_scand" that periodically
>   marks pages inaccessible.  The page-fault handler notes any
>   mismatches between the NUMA node that the process is running on
>   and the NUMA node on which the page resides.

Minor nit: marking pages inaccessible is done from task_work
nowadays, there no longer is a kernel thread.

>   The result would be that the kernel would allocate only migratable
>   pages within the CCAD device's memory, and even then only if
>   memory was otherwise exhausted.

Does it make sense to allocate the device's page tables in memory
belonging to the device?

Is this a necessary thing with some devices? Jerome's HMM comes
to mind...

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Benjamin Herrenschmidt
On Fri, 2015-04-24 at 11:58 -0500, Christoph Lameter wrote:
> On Fri, 24 Apr 2015, Jerome Glisse wrote:
> 
> > > What exactly is the more advanced version's benefit? What are the features
> > > that the other platforms do not provide?
> >
> > Transparent access to device memory from the CPU, you can map any of the GPU
> > memory inside the CPU and have the whole cache coherency including proper
> > atomic memory operation. CAPI is not some mumbo jumbo marketing name there
> > is real hardware behind it.
> 
> Got the hardware here but I am getting pretty sobered given what I heard
> here. The IBM mumbo jumpo marketing comes down to "not much" now.

Ugh ... first nothing we propose precludes using it with explicit memory
management the way you want. So I don't know why you have a problem
here. We are trying to cover a *different* usage model than yours
obviously. But they aren't exclusive.

Secondly, none of what we are discussing here is supported by *existing*
hardware, so whatever you have is not concerned. There is no CAPI based
coprocessor today that provides cachable memory to the system (though
CAPI as a technology supports it), and no GPU doing that either *yet*.
Today CAPI adapters can own host cache lines but don't expose large
swath of cachable local memory.

Finally, this discussion is not even specifically about CAPI or its
performances. It's about the *general* case of a coherent coprocessor
sharing the MMU. Whether it's using CAPI or whatever other technology
that allows that sort of thing that we may or may not be able to mention
at this point.

CAPI is just an example because architecturally it allows that too.

Ben.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Jerome Glisse
On Fri, Apr 24, 2015 at 03:00:18PM -0500, Christoph Lameter wrote:
> On Fri, 24 Apr 2015, Jerome Glisse wrote:
> 
> > > Still no answer as to why is that not possible with the current scheme?
> > > You keep on talking about pointers and I keep on responding that this is a
> > > matter of making the address space compatible on both sides.
> >
> > So if do that in a naive way, how can we migrate a chunk of memory to video
> > memory while still handling properly the case where CPU try to access that
> > same memory while it is migrated to the GPU memory.
> 
> Well that the same issue that the migration code is handling which I
> submitted a long time ago to the kernel.

Yes so you had to modify the kernel for that ! So do we, and no, page migration
as it exist is not sufficience and does not cover all use case we have.

> 
> > Without modifying a single line of mm code, the only way to do this is to
> > either unmap from the cpu page table the range being migrated or to mprotect
> > it in some way. In both case the cpu access will trigger some kind of fault.
> 
> Yes that is how Linux migration works. If you can fix that then how about
> improving page migration in Linux between NUMA nodes first?

In my case i can not use the page migration because there is no where to hook
to explain how to migrate thing back and forth with a device. The page migration
code is all on CPU and enjoy the benefit of being able to do thing atomicaly,
i do not have such luxury.

More over the core mm code assume that cpu pte migration entry is a short lived
state. In case of migration to device memory we are talking about time span of
several minutes. So obviously the page migration is not what we want, we want
something similar but with different properties. That exactly what my HMM 
patchset
does provide.

What Paul wants to do however should be able to leverage the page migration that
does exist. But again he has a far more advance platform.

> 
> > This is not the behavior we want. What we want is same address space while
> > being able to migrate system memory to device memory (who make that decision
> > should not be part of that discussion) while still gracefully handling any
> > CPU access.
> 
> Well then there could be a situation where you have concurrent write
> access. How do you reconcile that then? Somehow you need to stall one or
> the other until the transaction is complete.

No, it is exactly like thread on a CPU, if you have 2 threads that write to
same address without having anykind of synchronization btw them, you can not
predict what will be the end result. Same will happen here, either the GPU
write goes last or the CPU one. Anyway this is not the use case we have in
mind. We are thinking about concurrent access to same page but in a non
conflicting way. Any conflicting access is a software bug like it is in the
case of CPU threads.

> 
> > This means if CPU access it we want to migrate memory back to system memory.
> > To achieve this there is no way around adding couple of if inside the mm
> > page fault code path. Now do you want each driver to add its own if branch
> > or do you want a common infrastructure to do just that ?
> 
> If you can improve the page migration in general then we certainly would
> love that. Having faultless migration is certain a good thing for a lot of
> functionality that depends on page migration.

Faultless migration i am talking about is only on GPU side, but this is just
an extra feature where you keep something mapped read only while migrating
it to device memory and updating the GPU page table once done. So GPU will
keep accessing system memory without interruption, this assume read only
access. Otherwise you need a faulty migration thought you can cooperate with
the thread scheduler to schedule other thread while migration is on going.

> 
> > As i keep saying the solution you propose is what we have today, today we
> > have fake share address space through the trick of remapping system memory
> > at same address inside the GPU address space and also enforcing the use of
> > a special memory allocator that goes behind the back of mm code.
> 
> Hmmm... I'd like to know more details about that.

Well there is no open source OpenCL 2.0 stack for discret GPU. But the idea is
that you need special allocator because the GPU driver need to know about all
the possible pages that might be use ie there is no page fault so all object
need to be mapped and thus all page are pinned down. Well this is a little more
complex as the special allocator keep track of each allocation creating an
object for each of them and trying to only pin object that are use by current
shader.

Anyway bottom line is that it needs a special allocator, you can not use mmaped
file directly or shared memory directly or anonymous memory allocated outside
the special allocator. It require pinning memory. It can not migrate memory to
device memory. We want to fix all that.

> 
> > As you pointed out, not using 

Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter
On Fri, 24 Apr 2015, Jerome Glisse wrote:

> > Still no answer as to why is that not possible with the current scheme?
> > You keep on talking about pointers and I keep on responding that this is a
> > matter of making the address space compatible on both sides.
>
> So if do that in a naive way, how can we migrate a chunk of memory to video
> memory while still handling properly the case where CPU try to access that
> same memory while it is migrated to the GPU memory.

Well that the same issue that the migration code is handling which I
submitted a long time ago to the kernel.

> Without modifying a single line of mm code, the only way to do this is to
> either unmap from the cpu page table the range being migrated or to mprotect
> it in some way. In both case the cpu access will trigger some kind of fault.

Yes that is how Linux migration works. If you can fix that then how about
improving page migration in Linux between NUMA nodes first?

> This is not the behavior we want. What we want is same address space while
> being able to migrate system memory to device memory (who make that decision
> should not be part of that discussion) while still gracefully handling any
> CPU access.

Well then there could be a situation where you have concurrent write
access. How do you reconcile that then? Somehow you need to stall one or
the other until the transaction is complete.

> This means if CPU access it we want to migrate memory back to system memory.
> To achieve this there is no way around adding couple of if inside the mm
> page fault code path. Now do you want each driver to add its own if branch
> or do you want a common infrastructure to do just that ?

If you can improve the page migration in general then we certainly would
love that. Having faultless migration is certain a good thing for a lot of
functionality that depends on page migration.

> As i keep saying the solution you propose is what we have today, today we
> have fake share address space through the trick of remapping system memory
> at same address inside the GPU address space and also enforcing the use of
> a special memory allocator that goes behind the back of mm code.

Hmmm... I'd like to know more details about that.

> As you pointed out, not using GPU memory is a waste and we want to be able
> to use it. Now Paul have more sofisticated hardware that offer oportunities
> to do thing in a more transparent and efficient way.

Does this also work between NUMA nodes in a Power8 system?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Jerome Glisse
On Fri, Apr 24, 2015 at 01:56:45PM -0500, Christoph Lameter wrote:
> On Fri, 24 Apr 2015, Jerome Glisse wrote:
> 
> > > Right this is how things work and you could improve on that. Stay with the
> > > scheme. Why would that not work if you map things the same way in both
> > > environments if both accellerator and host processor can acceess each
> > > others memory?
> >
> > Again and again share address space, having a pointer means the same thing
> > for the GPU than it means for the CPU ie having a random pointer point to
> > the same memory whether it is accessed by the GPU or the CPU. While also
> > keeping the property of the backing memory. It can be share memory from
> > other process, a file mmaped from disk or simply anonymous memory and
> > thus we have no control whatsoever on how such memory is allocated.
> 
> Still no answer as to why is that not possible with the current scheme?
> You keep on talking about pointers and I keep on responding that this is a
> matter of making the address space compatible on both sides.

So if do that in a naive way, how can we migrate a chunk of memory to video
memory while still handling properly the case where CPU try to access that
same memory while it is migrated to the GPU memory.

Without modifying a single line of mm code, the only way to do this is to
either unmap from the cpu page table the range being migrated or to mprotect
it in some way. In both case the cpu access will trigger some kind of fault.

This is not the behavior we want. What we want is same address space while
being able to migrate system memory to device memory (who make that decision
should not be part of that discussion) while still gracefully handling any
CPU access.

This means if CPU access it we want to migrate memory back to system memory.
To achieve this there is no way around adding couple of if inside the mm
page fault code path. Now do you want each driver to add its own if branch
or do you want a common infrastructure to do just that ?

As i keep saying the solution you propose is what we have today, today we
have fake share address space through the trick of remapping system memory
at same address inside the GPU address space and also enforcing the use of
a special memory allocator that goes behind the back of mm code.

But this limit to only using system memory, you can not use video memory
transparently through such scheme. Some trick use today is to copy memory
to device memory and to not bother with CPU access pretend it can not happen
and as such the GPU and CPU can diverge in what they see for same address.
We want to avoid trick like this that just lead to some weird and unexpected
behavior.

As you pointed out, not using GPU memory is a waste and we want to be able
to use it. Now Paul have more sofisticated hardware that offer oportunities
to do thing in a more transparent and efficient way.

> 
> > Then you had transparent migration (transparent in the sense that we can
> > handle CPU page fault on migrated memory) and you will see that you need
> > to modify the kernel to become aware of this and provide a common code
> > to deal with all this.
> 
> If the GPU works like a CPU (which I keep hearing) then you should also be
> able to run a linu8x kernel on it and make it a regular NUMA node. Hey why
> dont we make the host cpu a GPU (hello Xeon Phi).

I am not saying it works like a CPU, i am saying it should face the same kind
of pattern when it comes to page fault, ie page fault are not the end of the
world for the GPU and you should not assume that all GPU threads will wait
for a pagefault because this is not the common case on CPU. Yes we prefer when
page fault never happen, so does the CPU.

No, you can not run the linux kernel on the GPU unless you are willing to allow
having the kernel runs on heterogneous architecture with different instruction
set. Not even going into the problematic of ring level/system level. We might
one day go down that road but i see no compeling point today.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter
On Fri, 24 Apr 2015, Jerome Glisse wrote:

> > Right this is how things work and you could improve on that. Stay with the
> > scheme. Why would that not work if you map things the same way in both
> > environments if both accellerator and host processor can acceess each
> > others memory?
>
> Again and again share address space, having a pointer means the same thing
> for the GPU than it means for the CPU ie having a random pointer point to
> the same memory whether it is accessed by the GPU or the CPU. While also
> keeping the property of the backing memory. It can be share memory from
> other process, a file mmaped from disk or simply anonymous memory and
> thus we have no control whatsoever on how such memory is allocated.

Still no answer as to why is that not possible with the current scheme?
You keep on talking about pointers and I keep on responding that this is a
matter of making the address space compatible on both sides.

> Then you had transparent migration (transparent in the sense that we can
> handle CPU page fault on migrated memory) and you will see that you need
> to modify the kernel to become aware of this and provide a common code
> to deal with all this.

If the GPU works like a CPU (which I keep hearing) then you should also be
able to run a linu8x kernel on it and make it a regular NUMA node. Hey why
dont we make the host cpu a GPU (hello Xeon Phi).


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Oded Gabbay



On 04/23/2015 07:22 PM, Jerome Glisse wrote:

On Thu, Apr 23, 2015 at 09:20:55AM -0500, Christoph Lameter wrote:

On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote:


There are hooks in glibc where you can replace the memory
management of the apps if you want that.


We don't control the app. Let's say we are doing a plugin for libfoo
which accelerates "foo" using GPUs.


There are numerous examples of malloc implementation that can be used for
apps without modifying the app.


What about share memory pass btw process ? Or mmaped file ? Or
a library that is loaded through dlopen and thus had no way to
control any allocation that happen before it became active ?



Now some other app we have no control on uses libfoo. So pointers
already allocated/mapped, possibly a long time ago, will hit libfoo (or
the plugin) and we need GPUs to churn on the data.


IF the GPU would need to suspend one of its computation thread to wait on
a mapping to be established on demand or so then it looks like the
performance of the parallel threads on a GPU will be significantly
compromised. You would want to do the transfer explicitly in some fashion
that meshes with the concurrent calculation in the GPU. You do not want
stalls while GPU number crunching is ongoing.


You do not understand how GPU works. GPU have a pools of thread, and they
always try to have the pool as big as possible so that when a group of
thread is waiting for some memory access, there are others thread ready
to perform some operation. GPU are about hidding memory latency that's
what they are good at. But they only achieve that when they have more
thread in flight than compute unit. The whole thread scheduling is done
by hardware and barely control by the device driver.

So no having the GPU wait for a page fault is not as dramatic as you
think. If you use GPU as they are intended to use you might even never
notice the pagefault and reach close to the theoritical throughput of
the GPU nonetheless.





The point I'm making is you are arguing against a usage model which has
been repeatedly asked for by large amounts of customer (after all that's
also why HMM exists).


I am still not clear what is the use case for this would be. Who is asking
for this?


Everyone but you ? OpenCL 2.0 specific request it and have several level
of support about transparent address space. The lowest one is the one
implemented today in which application needs to use a special memory
allocator.

The most advance one imply integration with the kernel in which any
memory (mmaped file, share memory or anonymous memory) can be use by
the GPU and does not need to come from a special allocator.

Everyone in the industry is moving toward the most advance one. That
is the raison d'être of HMM, to provide this functionality on hw
platform that do not have things such as CAPI. Which is x86/arm.

So use case is all application using OpenCL or Cuda. So pretty much
everyone doing GPGPU wants this. I dunno how you can't see that.
Share address space is so much easier. Believe it or not most coders
do not have deep knowledge of how things work and if you can remove
the complexity of different memory allocation and different address
space from them they will be happy.

Cheers,
Jérôme
I second what Jerome said, and add that one of the key features of HSA 
is the ptr-is-a-ptr scheme, where the applications do *not* need to 
handle different address spaces. Instead, all the memory is seen as a 
unified address space.


See slide 6 on the following presentation:
http://www.slideshare.net/hsafoundation/hsa-overview

Thanks,
Oded


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Jerome Glisse
On Fri, Apr 24, 2015 at 11:58:39AM -0500, Christoph Lameter wrote:
> On Fri, 24 Apr 2015, Jerome Glisse wrote:
> 
> > > What exactly is the more advanced version's benefit? What are the features
> > > that the other platforms do not provide?
> >
> > Transparent access to device memory from the CPU, you can map any of the GPU
> > memory inside the CPU and have the whole cache coherency including proper
> > atomic memory operation. CAPI is not some mumbo jumbo marketing name there
> > is real hardware behind it.
> 
> Got the hardware here but I am getting pretty sobered given what I heard
> here. The IBM mumbo jumpo marketing comes down to "not much" now.
> 
> > On x86 you have to take into account the PCI bar size, you also have to take
> > into account that PCIE transaction are really bad when it comes to sharing
> > memory with CPU. CAPI really improve things here.
> 
> Ok that would be interesting for the general device driver case.  Can you
> show a real performance benefit here of CAPI transactions vs. PCI-E
> transactions?

I am sure IBM will show benchmark here when they have everything in place. I
am not working on CAPI personnaly, i just went through some of the specification
for it.

> > So on x86 even if you could map all the GPU memory it would still be a bad
> > solution and thing like atomic memory operation might not even work 
> > properly.
> 
> That is solvable and doable in many other ways if needed. Actually I'd
> prefer a Xeon Phi in that case because then we also have the same
> instruction set. Having locks work right with different instruction sets
> and different coherency schemes. Ewww...
> 

Well then go the Xeon Phi solution way and let people that want to provide a
different simpler (from programmer point of view) solution work on it.

> 
> > > Then you have the problem of fast memory access and you are proposing to
> > > complicate that access path on the GPU.
> >
> > No, i am proposing to have a solution where people doing such kind of work
> > load can leverage the GPU, yes it will not be as fast as people hand tuning
> > and rewritting their application for the GPU but it will still be faster
> > by a significant factor than only using the CPU.
> 
> Well the general purpose processors also also gaining more floating point
> capabilities which increases the pressure on accellerators to become more
> specialized.
> 
> > Moreover i am saying that this can happen without even touching a single
> > line of code of many many applications, because many of them rely on library
> > and those are the only one that would need to know about GPU.
> 
> Yea. We have heard this numerous times in parallel computing and it never
> really worked right.

Because you had split userspace, a pointer value was not pointing to the same
thing on the GPU as on the CPU so porting library or application is hard and
troublesome. AMD is already working on porting general application or library
to leverage the brave new world of share address space (libreoffice, gimp, ...).

Other people keep presuring for same address space, again this is the corner
stone of OpenCL 2.0.

I can not predict if it will work this time, if all meaning full and usefull
library will start leveraging GPU. All i am trying to do is solve the split
address space problem. Problem that you seem to ignore completely because you
are happy the way things are. Other people are not happy.


> 
> > Finaly i am saying that having a unified address space btw the GPU and CPU
> > is a primordial prerequisite for this to happen in a transparent fashion
> > and thus DAX solution is non-sense and does not provide transparent address
> > space sharing. DAX solution is not even something new, this is how today
> > stack is working, no need for DAX, userspace just mmap the device driver
> > file and that's how they access the GPU accessible memory (which in most
> > case is just system memory mapped through the device file to the user
> > application).
> 
> Right this is how things work and you could improve on that. Stay with the
> scheme. Why would that not work if you map things the same way in both
> environments if both accellerator and host processor can acceess each
> others memory?

Again and again share address space, having a pointer means the same thing
for the GPU than it means for the CPU ie having a random pointer point to
the same memory whether it is accessed by the GPU or the CPU. While also
keeping the property of the backing memory. It can be share memory from
other process, a file mmaped from disk or simply anonymous memory and
thus we have no control whatsoever on how such memory is allocated.

Then you had transparent migration (transparent in the sense that we can
handle CPU page fault on migrated memory) and you will see that you need
to modify the kernel to become aware of this and provide a common code
to deal with all this.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the 

Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter
On Fri, 24 Apr 2015, Jerome Glisse wrote:

> > What exactly is the more advanced version's benefit? What are the features
> > that the other platforms do not provide?
>
> Transparent access to device memory from the CPU, you can map any of the GPU
> memory inside the CPU and have the whole cache coherency including proper
> atomic memory operation. CAPI is not some mumbo jumbo marketing name there
> is real hardware behind it.

Got the hardware here but I am getting pretty sobered given what I heard
here. The IBM mumbo jumpo marketing comes down to "not much" now.

> On x86 you have to take into account the PCI bar size, you also have to take
> into account that PCIE transaction are really bad when it comes to sharing
> memory with CPU. CAPI really improve things here.

Ok that would be interesting for the general device driver case.  Can you
show a real performance benefit here of CAPI transactions vs. PCI-E
transactions?

> So on x86 even if you could map all the GPU memory it would still be a bad
> solution and thing like atomic memory operation might not even work properly.

That is solvable and doable in many other ways if needed. Actually I'd
prefer a Xeon Phi in that case because then we also have the same
instruction set. Having locks work right with different instruction sets
and different coherency schemes. Ewww...


> > Then you have the problem of fast memory access and you are proposing to
> > complicate that access path on the GPU.
>
> No, i am proposing to have a solution where people doing such kind of work
> load can leverage the GPU, yes it will not be as fast as people hand tuning
> and rewritting their application for the GPU but it will still be faster
> by a significant factor than only using the CPU.

Well the general purpose processors also also gaining more floating point
capabilities which increases the pressure on accellerators to become more
specialized.

> Moreover i am saying that this can happen without even touching a single
> line of code of many many applications, because many of them rely on library
> and those are the only one that would need to know about GPU.

Yea. We have heard this numerous times in parallel computing and it never
really worked right.

> Finaly i am saying that having a unified address space btw the GPU and CPU
> is a primordial prerequisite for this to happen in a transparent fashion
> and thus DAX solution is non-sense and does not provide transparent address
> space sharing. DAX solution is not even something new, this is how today
> stack is working, no need for DAX, userspace just mmap the device driver
> file and that's how they access the GPU accessible memory (which in most
> case is just system memory mapped through the device file to the user
> application).

Right this is how things work and you could improve on that. Stay with the
scheme. Why would that not work if you map things the same way in both
environments if both accellerator and host processor can acceess each
others memory?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Jerome Glisse
On Fri, Apr 24, 2015 at 11:03:52AM -0500, Christoph Lameter wrote:
> On Fri, 24 Apr 2015, Jerome Glisse wrote:
> 
> > On Fri, Apr 24, 2015 at 09:29:12AM -0500, Christoph Lameter wrote:
> > > On Thu, 23 Apr 2015, Jerome Glisse wrote:
> > >
> > > > No this not have been solve properly. Today solution is doing an 
> > > > explicit
> > > > copy and again and again when complex data struct are involve (list, 
> > > > tree,
> > > > ...) this is extremly tedious and hard to debug. So today solution often
> > > > restrict themself to easy thing like matrix multiplication. But if you
> > > > provide a unified address space then you make things a lot easiers for a
> > > > lot more usecase. That's a fact, and again OpenCL 2.0 which is an 
> > > > industry
> > > > standard is a proof that unified address space is one of the most 
> > > > important
> > > > feature requested by user of GPGPU. You might not care but the rest of 
> > > > the
> > > > world does.
> > >
> > > You could use page tables on the kernel side to transfer data on demand
> > > from the GPU. And you can use a device driver to establish mappings to the
> > > GPUs memory.
> > >
> > > There is no copy needed with these approaches.
> >
> > So you are telling me to do get_user_page() ? If so you aware that this pins
> > memory ? So what happens when the GPU wants to access a range of 32GB of
> > memory ? I pin everything ?
> 
> Use either a device driver to create PTEs pointing to the data or do
> something similar like what DAX does. Pinning can be avoided if you use
> mmu_notifiers. Those will give you a callback before the OS removes the
> data and thus you can operate without pinning.

So you are actualy telling me to do as i am doing inside the HMM patchset ?
Because what you seem to say here is exactly what the HMM patchset does.
So you are acknowledging that we need work inside the kernel ?

That being said Paul have the chance to have a more advance platform where
what i am doing would actualy be under using the capabilities of the platform.
So he needs a different solution.

> 
> > Overall the throughput of the GPU will stay close to its theoritical maximum
> > if you have enough other thread that can progress and this is very common.
> 
> GPUs operate on groups of threads not single ones. If you stall
> then there will be a stall of a whole group of them. We are dealing with
> accellerators here that are different for performance reasons. They are
> not to be treated like regular processor, nor is memory like
> operating like host mmemory.

Again i know how GPU works, they work on group of thread i am well aware of
that, the group size is often 32 or 64 threads. But they keep in the hardware
a large pool of thread group, something like 2^11 or 2^12 thread group in
flight for 2^4 or 2^5 unit capable working on thread group (in thread count
this is 2^15/2^16 thread in flight for 2^9/2^10 cores). So again like on
the CPU we do not exepect the whole 2^11/2^12 group of thread to hit a
pagefault and i am saying as long as only a small number of group hit one
let say 2^3 group (2^8/2^9 thread) then you still have a large number of
thread group that can make progress without being impacted whatsoever.

And you can bet that GPU designer are also improving this by allowing to
swap out faulting thread and swapin runnable one so the overall 2^16 threads
in flight might be lot bigger in future hardware giving even more chance
to hide page fault.

GPU can operate on host memory and you can still saturate GPU with host
memory as long as the workload you are running are not bandwidth starved.
I know this is unlikely for GPU but again think several _different_
application some of thos application might already have their dataset
in the GPU memory and thus can run along side slower thread that are
limited by the system memory bandwidth. But still you can saturate your
GPU that way.

> 
> > But IBM here want to go further and to provide a more advance solution,
> > so their need are specific to there platform and we can not know if AMD,
> > ARM or Intel will want to go down the same road, they do not seem to be
> > interested. Does it means we should not support IBM ? I think it would be
> > wrong.
> 
> What exactly is the more advanced version's benefit? What are the features
> that the other platforms do not provide?

Transparent access to device memory from the CPU, you can map any of the GPU
memory inside the CPU and have the whole cache coherency including proper
atomic memory operation. CAPI is not some mumbo jumbo marketing name there
is real hardware behind it.

On x86 you have to take into account the PCI bar size, you also have to take
into account that PCIE transaction are really bad when it comes to sharing
memory with CPU. CAPI really improve things here.

So on x86 even if you could map all the GPU memory it would still be a bad
solution and thing like atomic memory operation might not even work properly.

> 
> > > This sounds more like a case for a 

Re: Interacting with coherent memory on external devices

2015-04-24 Thread Rik van Riel
On 04/24/2015 10:30 AM, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Paul E. McKenney wrote:
> 
>> If by "entire industry" you mean everyone who might want to use hardware
>> acceleration, for example, including mechanical computer-aided design,
>> I am skeptical.
> 
> The industry designs GPUs with super fast special ram and accellerators
> with special ram designed to do fast searches and you think you can demand 
> page
> that stuff in from the main processor?

DRAM access latencies are a few hundred CPU cycles, but somehow
CPUs can still do computations at a fast speed, and we do not
require gigabytes of L2-cache-speed memory in the system.

It turns out the vast majority of programs have working sets,
and data access patterns where prefetching works satisfactorily.

With GPU calculations done transparently by libraries, and
largely hidden from programs, why would this be any different?

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Rik van Riel
On 04/24/2015 11:49 AM, Christoph Lameter wrote:
> On Fri, 24 Apr 2015, Paul E. McKenney wrote:
> 
>> can deliver, but where the cost of full-fledge hand tuning cannot be
>> justified.
>>
>> You seem to believe that this latter category is the empty set, which
>> I must confess does greatly surprise me.
> 
> If there are already compromises are being made then why would you want to
> modify the kernel for this? Some user space coding and device drivers
> should be sufficient.

You assume only one program at a time would get to use the GPU
for accelerated computations, and the GPU would get dedicated
to that program.

That will not be the case when you have libraries using the GPU
for computations. There could be dozens of programs in the system
using that library, with no knowledge of how many GPU resources
are used by the other programs.

There is a very clear cut case for having the OS manage the
GPU resources transparently, just like it does for all the
other resources in the system.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter
On Fri, 24 Apr 2015, Jerome Glisse wrote:

> On Fri, Apr 24, 2015 at 09:29:12AM -0500, Christoph Lameter wrote:
> > On Thu, 23 Apr 2015, Jerome Glisse wrote:
> >
> > > No this not have been solve properly. Today solution is doing an explicit
> > > copy and again and again when complex data struct are involve (list, tree,
> > > ...) this is extremly tedious and hard to debug. So today solution often
> > > restrict themself to easy thing like matrix multiplication. But if you
> > > provide a unified address space then you make things a lot easiers for a
> > > lot more usecase. That's a fact, and again OpenCL 2.0 which is an industry
> > > standard is a proof that unified address space is one of the most 
> > > important
> > > feature requested by user of GPGPU. You might not care but the rest of the
> > > world does.
> >
> > You could use page tables on the kernel side to transfer data on demand
> > from the GPU. And you can use a device driver to establish mappings to the
> > GPUs memory.
> >
> > There is no copy needed with these approaches.
>
> So you are telling me to do get_user_page() ? If so you aware that this pins
> memory ? So what happens when the GPU wants to access a range of 32GB of
> memory ? I pin everything ?

Use either a device driver to create PTEs pointing to the data or do
something similar like what DAX does. Pinning can be avoided if you use
mmu_notifiers. Those will give you a callback before the OS removes the
data and thus you can operate without pinning.

> Overall the throughput of the GPU will stay close to its theoritical maximum
> if you have enough other thread that can progress and this is very common.

GPUs operate on groups of threads not single ones. If you stall
then there will be a stall of a whole group of them. We are dealing with
accellerators here that are different for performance reasons. They are
not to be treated like regular processor, nor is memory like
operating like host mmemory.

> But IBM here want to go further and to provide a more advance solution,
> so their need are specific to there platform and we can not know if AMD,
> ARM or Intel will want to go down the same road, they do not seem to be
> interested. Does it means we should not support IBM ? I think it would be
> wrong.

What exactly is the more advanced version's benefit? What are the features
that the other platforms do not provide?

> > This sounds more like a case for a general purpose processor. If it is a
> > special device then it will typically also have special memory to allow
> > fast searches.
>
> No this kind of thing can be fast on a GPU, with GPU you easily have x500
> more cores than CPU cores, so you can slice the dataset even more and have
> each of the GPU core perform the search. Note that i am not only thinking
> of stupid memcmp here it can be something more complex like searching a
> pattern that allow variation and that require a whole program to decide if
> a chunk falls under the variation rules or not.

Then you have the problem of fast memory access and you are proposing to
complicate that access path on the GPU.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Jerome Glisse
On Fri, Apr 24, 2015 at 09:30:40AM -0500, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Paul E. McKenney wrote:
> 
> > If by "entire industry" you mean everyone who might want to use hardware
> > acceleration, for example, including mechanical computer-aided design,
> > I am skeptical.
> 
> The industry designs GPUs with super fast special ram and accellerators
> with special ram designed to do fast searches and you think you can demand 
> page
> that stuff in from the main processor?
> 

Why do you think AMD and NVidia are adding page fault support to their GPU
in the first place ? They are not doing this on a whim, they have carefully
thought about that.

Are you saying you know better than the 2 biggest GPU designer on the planet ?
And who do you think is pushing for such thing in the kernel ? Do you think
we are working on this on a whim ? Because we woke up one day and thought that
it would be cool and that it should be done this way ?


Yes if all your GPU do is pagefault it will be disastrous, but is this the
usual thing we see on CPU ? No ! Are people complaining about the numerous
page fault that happens over a day ? No, the vast majority of user are
completely oblivious to page fault. This is how it works on CPU and yes this
can work for GPU too. What happens on CPU ? Well CPU can switch to work on
a different thread or a different application altogether. The same thing will
happen on the GPU. If you have enough jobs, your GPU will be busy and you
will never worry about page fault because overall your GPU will deliver the
same kind of throughput as if there was no pagefault. It can very well be
buried into the overall noise if the ratio of available runnable thread
versus page faulting thread is high enough. Which is most of the time the
case for the CPU, why would the same assumption not work on the GPU ?

Note that i am not dismissing low latency folks, i know they exist, i know
they hate page fault and in no way what we propose will make it worse for
them. They will be able to keep the same kind of control they cherish but
this does not mean you should go on a holy crusade to pretend that other
people workload does not exist. They do exist. Page fault is not evil and
it has prove usefull to the whole computer industry for CPU.


To be sure you are not misinterpretting what we propose, in no way we say
we gonna migrate thing on page fault for everyone. We are saying first
the device driver decide where thing need to be (system memory or local
memory) device driver can get hint/request from userspace for this (as they
do today). So no change whatsoever here, people that hand tune things will
keep being able to do so.

Now we want to add the case where device driver do not get any kind of
directive or hint from userspace. So what autonuma is, simply collect
informations from the GPU on what is access often and then migrate this
transparently (yes this can happen without interruption to GPU). So you
are migrating from a memory that has 16GB/s or 32GB/s bandwidth to the
device memory that have 500GB/s.

This is a valid usecase, they are many people outthere that do not want
to learn about hand tuning there application for the GPU but they could
nonetheless benefit from it.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Rik van Riel
On 04/24/2015 10:01 AM, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Paul E. McKenney wrote:
> 
>>> As far as I know Jerome is talkeing about HPC loads and high performance
>>> GPU processing. This is the same use case.
>>
>> The difference is sensitivity to latency.  You have latency-sensitive
>> HPC workloads, and Jerome is talking about HPC workloads that need
>> high throughput, but are insensitive to latency.
> 
> Those are correlated.
> 
>>> What you are proposing for High Performacne Computing is reducing the
>>> performance these guys trying to get. You cannot sell someone a Volkswagen
>>> if he needs the Ferrari.
>>
>> You do need the low-latency Ferrari.  But others are best served by a
>> high-throughput freight train.
> 
> The problem is that they want to run 2000 trains at the same time
> and they all must arrive at the destination before they can be send on
> their next trip. 1999 trains will be sitting idle because they need
> to wait of the one train that was delayed. This reduces the troughput.
> People really would like all 2000 trains to arrive on schedule so that
> they get more performance.

So you run 4000 or even 6000 trains, and have some subset of them
run at full steam, while others are waiting on memory accesses.

In reality the overcommit factor is likely much smaller, because
the GPU threads run and block on memory in smaller, more manageable
numbers, say a few dozen at a time.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter
On Fri, 24 Apr 2015, Paul E. McKenney wrote:

> > DAX is a mechanism to access memory not managed by the kernel and is the
> > successor to XIP. It just happens to be needed for persistent memory.
> > Fundamentally any driver can provide an MMAPPed interface to allow access
> > to a devices memory.
>
> I will take another look, but others in this thread have called out
> difficulties with DAX's filesystem nature.

Right so you do not need the filesystem structure. Just simply writing a
device driver that mmaps data as needed from the coprocessor will also do
the trick.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter
On Fri, 24 Apr 2015, Paul E. McKenney wrote:

> can deliver, but where the cost of full-fledge hand tuning cannot be
> justified.
>
> You seem to believe that this latter category is the empty set, which
> I must confess does greatly surprise me.

If there are already compromises are being made then why would you want to
modify the kernel for this? Some user space coding and device drivers
should be sufficient.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Jerome Glisse
On Fri, Apr 24, 2015 at 07:57:38AM -0700, Paul E. McKenney wrote:
> On Fri, Apr 24, 2015 at 09:12:07AM -0500, Christoph Lameter wrote:
> > On Thu, 23 Apr 2015, Paul E. McKenney wrote:
> > 
> > >
> > > DAX
> > >
> > >   DAX is a mechanism for providing direct-memory access to
> > >   high-speed non-volatile (AKA "persistent") memory.  Good
> > >   introductions to DAX may be found in the following LWN
> > >   articles:
> > 
> > DAX is a mechanism to access memory not managed by the kernel and is the
> > successor to XIP. It just happens to be needed for persistent memory.
> > Fundamentally any driver can provide an MMAPPed interface to allow access
> > to a devices memory.
> 
> I will take another look, but others in this thread have called out
> difficulties with DAX's filesystem nature.

Do not waste your time on that this is not what we want. Christoph here
is more than stuborn and fails to see the world.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Jerome Glisse
On Fri, Apr 24, 2015 at 09:29:12AM -0500, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Jerome Glisse wrote:
> 
> > No this not have been solve properly. Today solution is doing an explicit
> > copy and again and again when complex data struct are involve (list, tree,
> > ...) this is extremly tedious and hard to debug. So today solution often
> > restrict themself to easy thing like matrix multiplication. But if you
> > provide a unified address space then you make things a lot easiers for a
> > lot more usecase. That's a fact, and again OpenCL 2.0 which is an industry
> > standard is a proof that unified address space is one of the most important
> > feature requested by user of GPGPU. You might not care but the rest of the
> > world does.
> 
> You could use page tables on the kernel side to transfer data on demand
> from the GPU. And you can use a device driver to establish mappings to the
> GPUs memory.
> 
> There is no copy needed with these approaches.

So you are telling me to do get_user_page() ? If so you aware that this pins
memory ? So what happens when the GPU wants to access a range of 32GB of
memory ? I pin everything ?

I am not talking about only transfrom from GPU to system memory i am talking
about application that have :
   dataset = mmap(datatset, 32<<30);
   // ...
   dl_open(superlibrary)
   superlibrary.dosomething(dataset);

So the application here have no clue about GPU and we do not want to change
that yes this is a valid usecase and countless user ask for it.

How can the superlibrary give access to the GPU to the dataset ? Does it
have to go get_user_page() on all single page effectively pinning memory ?
Should it allocate GPU memory through special API and memcpy ?


What HMM does is allow to share the process page table with the GPU and GPU
can transparently access the dataset (no pinning whatsover). Will there be
pagefault ? It can happens and if it does the assumption is that you have
more threads that do not get a pagefault than one that does, so GPU keeps
being saturated (ie all its unit are feed with something to do) while the
pagefault are resolve. For some workload yes you will see the penalty of the
pagefault ie you will have a group of thread that finish late but the thing
you seem to fail to get is that all the other GPU thread can make process
and finish even before the pagefault is resolved. It all depends on the
application. Moreover if you have several application then GPU can switch
to different application and make progress on them too.

Overall the throughput of the GPU will stay close to its theoritical maximum
if you have enough other thread that can progress and this is very common.

> 
> > > I think these two things need to be separated. The shift-the-memory-back-
> > > and-forth approach should be separate and if someone wants to use the
> > > thing then it should also work on other platforms like ARM and Intel.
> >
> > What IBM does with there platform is there choice, they can not force ARM
> > or Intel or AMD to do the same. Each of those might have different view
> > on what is their most important target. For instance i highly doubt ARM
> > cares about any of this.
> 
> Well but the kernel code submitted should allow for easy use on other
> platform. I.e. Intel processors should be able to implement the
> "transparent" memory by establishing device mappings to PCI-E space
> and/or transferring data from the GPU and signaling the GPU to establish
> such a mapping.

HMM does that, it only require the GPU to have a certain set of features
and the only requirement for the platform is to offer a bus which allow
cache coherent system memory access such as PCIE.

But IBM here want to go further and to provide a more advance solution,
so their need are specific to there platform and we can not know if AMD,
ARM or Intel will want to go down the same road, they do not seem to be
interested. Does it means we should not support IBM ? I think it would be
wrong.

> 
> > Only time critical application care about latency, everyone else cares
> > about throughput, where the applications can runs for days, weeks, months
> > before producing any useable/meaningfull results. Many of which do not
> > care a tiny bit about latency because they can perform independant
> > computation.
> 
> Computationally intensive high performance application care about
> random latency introduced to computational threads because that is
> delaying the data exchange and thus slows everything down. And that is the
> typical case of a GPUI.

You assume that all HPC application have strong data exchange, i gave
you example of application where there is 0 data exchange btw threads
what so ever. Those use case exist and we want to support them too.

Yes for thread where there is data exchange page fault stall jobs but
again we are talking about HPC where several _different_ application
run in // and share resources so while page fault can block part of
an application, other applications can 

Re: Interacting with coherent memory on external devices

2015-04-24 Thread Paul E. McKenney
On Fri, Apr 24, 2015 at 09:12:07AM -0500, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Paul E. McKenney wrote:
> 
> >
> > DAX
> >
> > DAX is a mechanism for providing direct-memory access to
> > high-speed non-volatile (AKA "persistent") memory.  Good
> > introductions to DAX may be found in the following LWN
> > articles:
> 
> DAX is a mechanism to access memory not managed by the kernel and is the
> successor to XIP. It just happens to be needed for persistent memory.
> Fundamentally any driver can provide an MMAPPed interface to allow access
> to a devices memory.

I will take another look, but others in this thread have called out
difficulties with DAX's filesystem nature.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Paul E. McKenney
On Fri, Apr 24, 2015 at 09:30:40AM -0500, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Paul E. McKenney wrote:
> 
> > If by "entire industry" you mean everyone who might want to use hardware
> > acceleration, for example, including mechanical computer-aided design,
> > I am skeptical.
> 
> The industry designs GPUs with super fast special ram and accellerators
> with special ram designed to do fast searches and you think you can demand 
> page
> that stuff in from the main processor?

The demand paging is indeed a drawback for the option of using autonuma
to handle the migration.  And again, this is not intended to replace the
careful hand-tuning that is required to get the last drop of performance
out of the system.  It is instead intended to handle the cases where
the application needs substantially more performance than the CPUs alone
can deliver, but where the cost of full-fledge hand tuning cannot be
justified.

You seem to believe that this latter category is the empty set, which
I must confess does greatly surprise me.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter
On Thu, 23 Apr 2015, Paul E. McKenney wrote:

> If by "entire industry" you mean everyone who might want to use hardware
> acceleration, for example, including mechanical computer-aided design,
> I am skeptical.

The industry designs GPUs with super fast special ram and accellerators
with special ram designed to do fast searches and you think you can demand page
that stuff in from the main processor?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter
On Thu, 23 Apr 2015, Jerome Glisse wrote:

> No this not have been solve properly. Today solution is doing an explicit
> copy and again and again when complex data struct are involve (list, tree,
> ...) this is extremly tedious and hard to debug. So today solution often
> restrict themself to easy thing like matrix multiplication. But if you
> provide a unified address space then you make things a lot easiers for a
> lot more usecase. That's a fact, and again OpenCL 2.0 which is an industry
> standard is a proof that unified address space is one of the most important
> feature requested by user of GPGPU. You might not care but the rest of the
> world does.

You could use page tables on the kernel side to transfer data on demand
from the GPU. And you can use a device driver to establish mappings to the
GPUs memory.

There is no copy needed with these approaches.

> > I think these two things need to be separated. The shift-the-memory-back-
> > and-forth approach should be separate and if someone wants to use the
> > thing then it should also work on other platforms like ARM and Intel.
>
> What IBM does with there platform is there choice, they can not force ARM
> or Intel or AMD to do the same. Each of those might have different view
> on what is their most important target. For instance i highly doubt ARM
> cares about any of this.

Well but the kernel code submitted should allow for easy use on other
platform. I.e. Intel processors should be able to implement the
"transparent" memory by establishing device mappings to PCI-E space
and/or transferring data from the GPU and signaling the GPU to establish
such a mapping.

> Only time critical application care about latency, everyone else cares
> about throughput, where the applications can runs for days, weeks, months
> before producing any useable/meaningfull results. Many of which do not
> care a tiny bit about latency because they can perform independant
> computation.

Computationally intensive high performance application care about
random latency introduced to computational threads because that is
delaying the data exchange and thus slows everything down. And that is the
typical case of a GPUI.

> Take a company rendering a movie for instance, they want to render the
> millions of frame as fast as possible but each frame can be rendered
> independently, they only share data is the input geometry, textures and
> lighting but this are constant, the rendering of one frame does not
> depend on the rendering of the previous (leaving post processing like
> motion blur aside).

The rendering would be done by the GPU and this will involve concurrency
rapidly accessing data. Performance is certainly impacted if the GPU
cannot use its own RAM designed for the proper feeding of its processing.
And if you add a paging layer and swivel stuff below then this will be
very bad.

At minimum you need to shovel blocks of data into the GPU to allow it to
operate undisturbed for a while on the data and do its job.

> Same apply if you do some data mining. You want might want to find all
> occurence of a specific sequence in a large data pool. You can slice
> your data pool and have an independant job per slice and only aggregate
> the result of each jobs at the end (or as they finish).

This sounds more like a case for a general purpose processor. If it is a
special device then it will typically also have special memory to allow
fast searches.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Paul E. McKenney
On Fri, Apr 24, 2015 at 09:01:47AM -0500, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Paul E. McKenney wrote:
> 
> > > As far as I know Jerome is talkeing about HPC loads and high performance
> > > GPU processing. This is the same use case.
> >
> > The difference is sensitivity to latency.  You have latency-sensitive
> > HPC workloads, and Jerome is talking about HPC workloads that need
> > high throughput, but are insensitive to latency.
> 
> Those are correlated.

In some cases, yes.  But are you -really- claiming that -all- HPC
workloads are highly sensitive to latency?  That would be quite a claim!

> > > What you are proposing for High Performacne Computing is reducing the
> > > performance these guys trying to get. You cannot sell someone a Volkswagen
> > > if he needs the Ferrari.
> >
> > You do need the low-latency Ferrari.  But others are best served by a
> > high-throughput freight train.
> 
> The problem is that they want to run 2000 trains at the same time
> and they all must arrive at the destination before they can be send on
> their next trip. 1999 trains will be sitting idle because they need
> to wait of the one train that was delayed. This reduces the troughput.
> People really would like all 2000 trains to arrive on schedule so that
> they get more performance.

Yes, there is some portion of the market that needs both high throughput
and highly predictable latencies.  You are claiming that the -entire- HPC
market has this sort of requirement?  Again, this would be quite a claim!

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter
On Thu, 23 Apr 2015, Paul E. McKenney wrote:

>
> DAX
>
>   DAX is a mechanism for providing direct-memory access to
>   high-speed non-volatile (AKA "persistent") memory.  Good
>   introductions to DAX may be found in the following LWN
>   articles:

DAX is a mechanism to access memory not managed by the kernel and is the
successor to XIP. It just happens to be needed for persistent memory.
Fundamentally any driver can provide an MMAPPed interface to allow access
to a devices memory.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter
On Thu, 23 Apr 2015, Jerome Glisse wrote:

> The numa code we have today for CPU case exist because it does make
> a difference but you keep trying to restrict GPU user to a workload
> that is specific. Go talk to people doing physic, biology, data
> mining, CAD most of them do not care about latency. They have not
> hard deadline to meet with their computation. They just want things
> to compute as fast as possible and programming to be as easy as it
> can get.

I started working on the latency issues a long time ago because
performance of those labs was restricted by OS processing. A noted problem
was SLABs scanning of its objects every 2 seconds which caused pretty
significant performance regressions due to the delay of the computation in
individual threads.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter

On Thu, 23 Apr 2015, Austin S Hemmelgarn wrote:


Looking at this whole conversation, all I see is two different views on how to
present the asymmetric multiprocessing arrangements that have become
commonplace in today's systems to userspace.  Your model favors performance,
while CAPI favors simplicity for userspace.


Oww. No performance just simplicity? Really?

The simplification of the memory registration for Infiniband etc is
certainly useful and I hope to see contributions on that going into the
kernel.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter
On Thu, 23 Apr 2015, Paul E. McKenney wrote:

> > As far as I know Jerome is talkeing about HPC loads and high performance
> > GPU processing. This is the same use case.
>
> The difference is sensitivity to latency.  You have latency-sensitive
> HPC workloads, and Jerome is talking about HPC workloads that need
> high throughput, but are insensitive to latency.

Those are correlated.

> > What you are proposing for High Performacne Computing is reducing the
> > performance these guys trying to get. You cannot sell someone a Volkswagen
> > if he needs the Ferrari.
>
> You do need the low-latency Ferrari.  But others are best served by a
> high-throughput freight train.

The problem is that they want to run 2000 trains at the same time
and they all must arrive at the destination before they can be send on
their next trip. 1999 trains will be sitting idle because they need
to wait of the one train that was delayed. This reduces the troughput.
People really would like all 2000 trains to arrive on schedule so that
they get more performance.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter

On Thu, 23 Apr 2015, Austin S Hemmelgarn wrote:


Looking at this whole conversation, all I see is two different views on how to
present the asymmetric multiprocessing arrangements that have become
commonplace in today's systems to userspace.  Your model favors performance,
while CAPI favors simplicity for userspace.


Oww. No performance just simplicity? Really?

The simplification of the memory registration for Infiniband etc is
certainly useful and I hope to see contributions on that going into the
kernel.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter
On Thu, 23 Apr 2015, Paul E. McKenney wrote:

 If by entire industry you mean everyone who might want to use hardware
 acceleration, for example, including mechanical computer-aided design,
 I am skeptical.

The industry designs GPUs with super fast special ram and accellerators
with special ram designed to do fast searches and you think you can demand page
that stuff in from the main processor?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Paul E. McKenney
On Fri, Apr 24, 2015 at 09:01:47AM -0500, Christoph Lameter wrote:
 On Thu, 23 Apr 2015, Paul E. McKenney wrote:
 
   As far as I know Jerome is talkeing about HPC loads and high performance
   GPU processing. This is the same use case.
 
  The difference is sensitivity to latency.  You have latency-sensitive
  HPC workloads, and Jerome is talking about HPC workloads that need
  high throughput, but are insensitive to latency.
 
 Those are correlated.

In some cases, yes.  But are you -really- claiming that -all- HPC
workloads are highly sensitive to latency?  That would be quite a claim!

   What you are proposing for High Performacne Computing is reducing the
   performance these guys trying to get. You cannot sell someone a Volkswagen
   if he needs the Ferrari.
 
  You do need the low-latency Ferrari.  But others are best served by a
  high-throughput freight train.
 
 The problem is that they want to run 2000 trains at the same time
 and they all must arrive at the destination before they can be send on
 their next trip. 1999 trains will be sitting idle because they need
 to wait of the one train that was delayed. This reduces the troughput.
 People really would like all 2000 trains to arrive on schedule so that
 they get more performance.

Yes, there is some portion of the market that needs both high throughput
and highly predictable latencies.  You are claiming that the -entire- HPC
market has this sort of requirement?  Again, this would be quite a claim!

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter
On Thu, 23 Apr 2015, Paul E. McKenney wrote:

  As far as I know Jerome is talkeing about HPC loads and high performance
  GPU processing. This is the same use case.

 The difference is sensitivity to latency.  You have latency-sensitive
 HPC workloads, and Jerome is talking about HPC workloads that need
 high throughput, but are insensitive to latency.

Those are correlated.

  What you are proposing for High Performacne Computing is reducing the
  performance these guys trying to get. You cannot sell someone a Volkswagen
  if he needs the Ferrari.

 You do need the low-latency Ferrari.  But others are best served by a
 high-throughput freight train.

The problem is that they want to run 2000 trains at the same time
and they all must arrive at the destination before they can be send on
their next trip. 1999 trains will be sitting idle because they need
to wait of the one train that was delayed. This reduces the troughput.
People really would like all 2000 trains to arrive on schedule so that
they get more performance.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter
On Thu, 23 Apr 2015, Jerome Glisse wrote:

 The numa code we have today for CPU case exist because it does make
 a difference but you keep trying to restrict GPU user to a workload
 that is specific. Go talk to people doing physic, biology, data
 mining, CAD most of them do not care about latency. They have not
 hard deadline to meet with their computation. They just want things
 to compute as fast as possible and programming to be as easy as it
 can get.

I started working on the latency issues a long time ago because
performance of those labs was restricted by OS processing. A noted problem
was SLABs scanning of its objects every 2 seconds which caused pretty
significant performance regressions due to the delay of the computation in
individual threads.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter
On Thu, 23 Apr 2015, Jerome Glisse wrote:

 No this not have been solve properly. Today solution is doing an explicit
 copy and again and again when complex data struct are involve (list, tree,
 ...) this is extremly tedious and hard to debug. So today solution often
 restrict themself to easy thing like matrix multiplication. But if you
 provide a unified address space then you make things a lot easiers for a
 lot more usecase. That's a fact, and again OpenCL 2.0 which is an industry
 standard is a proof that unified address space is one of the most important
 feature requested by user of GPGPU. You might not care but the rest of the
 world does.

You could use page tables on the kernel side to transfer data on demand
from the GPU. And you can use a device driver to establish mappings to the
GPUs memory.

There is no copy needed with these approaches.

  I think these two things need to be separated. The shift-the-memory-back-
  and-forth approach should be separate and if someone wants to use the
  thing then it should also work on other platforms like ARM and Intel.

 What IBM does with there platform is there choice, they can not force ARM
 or Intel or AMD to do the same. Each of those might have different view
 on what is their most important target. For instance i highly doubt ARM
 cares about any of this.

Well but the kernel code submitted should allow for easy use on other
platform. I.e. Intel processors should be able to implement the
transparent memory by establishing device mappings to PCI-E space
and/or transferring data from the GPU and signaling the GPU to establish
such a mapping.

 Only time critical application care about latency, everyone else cares
 about throughput, where the applications can runs for days, weeks, months
 before producing any useable/meaningfull results. Many of which do not
 care a tiny bit about latency because they can perform independant
 computation.

Computationally intensive high performance application care about
random latency introduced to computational threads because that is
delaying the data exchange and thus slows everything down. And that is the
typical case of a GPUI.

 Take a company rendering a movie for instance, they want to render the
 millions of frame as fast as possible but each frame can be rendered
 independently, they only share data is the input geometry, textures and
 lighting but this are constant, the rendering of one frame does not
 depend on the rendering of the previous (leaving post processing like
 motion blur aside).

The rendering would be done by the GPU and this will involve concurrency
rapidly accessing data. Performance is certainly impacted if the GPU
cannot use its own RAM designed for the proper feeding of its processing.
And if you add a paging layer and swivel stuff below then this will be
very bad.

At minimum you need to shovel blocks of data into the GPU to allow it to
operate undisturbed for a while on the data and do its job.

 Same apply if you do some data mining. You want might want to find all
 occurence of a specific sequence in a large data pool. You can slice
 your data pool and have an independant job per slice and only aggregate
 the result of each jobs at the end (or as they finish).

This sounds more like a case for a general purpose processor. If it is a
special device then it will typically also have special memory to allow
fast searches.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter
On Thu, 23 Apr 2015, Paul E. McKenney wrote:


 DAX

   DAX is a mechanism for providing direct-memory access to
   high-speed non-volatile (AKA persistent) memory.  Good
   introductions to DAX may be found in the following LWN
   articles:

DAX is a mechanism to access memory not managed by the kernel and is the
successor to XIP. It just happens to be needed for persistent memory.
Fundamentally any driver can provide an MMAPPed interface to allow access
to a devices memory.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Oded Gabbay



On 04/23/2015 07:22 PM, Jerome Glisse wrote:

On Thu, Apr 23, 2015 at 09:20:55AM -0500, Christoph Lameter wrote:

On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote:


There are hooks in glibc where you can replace the memory
management of the apps if you want that.


We don't control the app. Let's say we are doing a plugin for libfoo
which accelerates foo using GPUs.


There are numerous examples of malloc implementation that can be used for
apps without modifying the app.


What about share memory pass btw process ? Or mmaped file ? Or
a library that is loaded through dlopen and thus had no way to
control any allocation that happen before it became active ?



Now some other app we have no control on uses libfoo. So pointers
already allocated/mapped, possibly a long time ago, will hit libfoo (or
the plugin) and we need GPUs to churn on the data.


IF the GPU would need to suspend one of its computation thread to wait on
a mapping to be established on demand or so then it looks like the
performance of the parallel threads on a GPU will be significantly
compromised. You would want to do the transfer explicitly in some fashion
that meshes with the concurrent calculation in the GPU. You do not want
stalls while GPU number crunching is ongoing.


You do not understand how GPU works. GPU have a pools of thread, and they
always try to have the pool as big as possible so that when a group of
thread is waiting for some memory access, there are others thread ready
to perform some operation. GPU are about hidding memory latency that's
what they are good at. But they only achieve that when they have more
thread in flight than compute unit. The whole thread scheduling is done
by hardware and barely control by the device driver.

So no having the GPU wait for a page fault is not as dramatic as you
think. If you use GPU as they are intended to use you might even never
notice the pagefault and reach close to the theoritical throughput of
the GPU nonetheless.





The point I'm making is you are arguing against a usage model which has
been repeatedly asked for by large amounts of customer (after all that's
also why HMM exists).


I am still not clear what is the use case for this would be. Who is asking
for this?


Everyone but you ? OpenCL 2.0 specific request it and have several level
of support about transparent address space. The lowest one is the one
implemented today in which application needs to use a special memory
allocator.

The most advance one imply integration with the kernel in which any
memory (mmaped file, share memory or anonymous memory) can be use by
the GPU and does not need to come from a special allocator.

Everyone in the industry is moving toward the most advance one. That
is the raison d'être of HMM, to provide this functionality on hw
platform that do not have things such as CAPI. Which is x86/arm.

So use case is all application using OpenCL or Cuda. So pretty much
everyone doing GPGPU wants this. I dunno how you can't see that.
Share address space is so much easier. Believe it or not most coders
do not have deep knowledge of how things work and if you can remove
the complexity of different memory allocation and different address
space from them they will be happy.

Cheers,
Jérôme
I second what Jerome said, and add that one of the key features of HSA 
is the ptr-is-a-ptr scheme, where the applications do *not* need to 
handle different address spaces. Instead, all the memory is seen as a 
unified address space.


See slide 6 on the following presentation:
http://www.slideshare.net/hsafoundation/hsa-overview

Thanks,
Oded


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Paul E. McKenney
On Fri, Apr 24, 2015 at 09:30:40AM -0500, Christoph Lameter wrote:
 On Thu, 23 Apr 2015, Paul E. McKenney wrote:
 
  If by entire industry you mean everyone who might want to use hardware
  acceleration, for example, including mechanical computer-aided design,
  I am skeptical.
 
 The industry designs GPUs with super fast special ram and accellerators
 with special ram designed to do fast searches and you think you can demand 
 page
 that stuff in from the main processor?

The demand paging is indeed a drawback for the option of using autonuma
to handle the migration.  And again, this is not intended to replace the
careful hand-tuning that is required to get the last drop of performance
out of the system.  It is instead intended to handle the cases where
the application needs substantially more performance than the CPUs alone
can deliver, but where the cost of full-fledge hand tuning cannot be
justified.

You seem to believe that this latter category is the empty set, which
I must confess does greatly surprise me.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Benjamin Herrenschmidt
On Fri, 2015-04-24 at 11:58 -0500, Christoph Lameter wrote:
 On Fri, 24 Apr 2015, Jerome Glisse wrote:
 
   What exactly is the more advanced version's benefit? What are the features
   that the other platforms do not provide?
 
  Transparent access to device memory from the CPU, you can map any of the GPU
  memory inside the CPU and have the whole cache coherency including proper
  atomic memory operation. CAPI is not some mumbo jumbo marketing name there
  is real hardware behind it.
 
 Got the hardware here but I am getting pretty sobered given what I heard
 here. The IBM mumbo jumpo marketing comes down to not much now.

Ugh ... first nothing we propose precludes using it with explicit memory
management the way you want. So I don't know why you have a problem
here. We are trying to cover a *different* usage model than yours
obviously. But they aren't exclusive.

Secondly, none of what we are discussing here is supported by *existing*
hardware, so whatever you have is not concerned. There is no CAPI based
coprocessor today that provides cachable memory to the system (though
CAPI as a technology supports it), and no GPU doing that either *yet*.
Today CAPI adapters can own host cache lines but don't expose large
swath of cachable local memory.

Finally, this discussion is not even specifically about CAPI or its
performances. It's about the *general* case of a coherent coprocessor
sharing the MMU. Whether it's using CAPI or whatever other technology
that allows that sort of thing that we may or may not be able to mention
at this point.

CAPI is just an example because architecturally it allows that too.

Ben.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter
On Fri, 24 Apr 2015, Paul E. McKenney wrote:

  DAX is a mechanism to access memory not managed by the kernel and is the
  successor to XIP. It just happens to be needed for persistent memory.
  Fundamentally any driver can provide an MMAPPed interface to allow access
  to a devices memory.

 I will take another look, but others in this thread have called out
 difficulties with DAX's filesystem nature.

Right so you do not need the filesystem structure. Just simply writing a
device driver that mmaps data as needed from the coprocessor will also do
the trick.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter
On Fri, 24 Apr 2015, Jerome Glisse wrote:

 On Fri, Apr 24, 2015 at 09:29:12AM -0500, Christoph Lameter wrote:
  On Thu, 23 Apr 2015, Jerome Glisse wrote:
 
   No this not have been solve properly. Today solution is doing an explicit
   copy and again and again when complex data struct are involve (list, tree,
   ...) this is extremly tedious and hard to debug. So today solution often
   restrict themself to easy thing like matrix multiplication. But if you
   provide a unified address space then you make things a lot easiers for a
   lot more usecase. That's a fact, and again OpenCL 2.0 which is an industry
   standard is a proof that unified address space is one of the most 
   important
   feature requested by user of GPGPU. You might not care but the rest of the
   world does.
 
  You could use page tables on the kernel side to transfer data on demand
  from the GPU. And you can use a device driver to establish mappings to the
  GPUs memory.
 
  There is no copy needed with these approaches.

 So you are telling me to do get_user_page() ? If so you aware that this pins
 memory ? So what happens when the GPU wants to access a range of 32GB of
 memory ? I pin everything ?

Use either a device driver to create PTEs pointing to the data or do
something similar like what DAX does. Pinning can be avoided if you use
mmu_notifiers. Those will give you a callback before the OS removes the
data and thus you can operate without pinning.

 Overall the throughput of the GPU will stay close to its theoritical maximum
 if you have enough other thread that can progress and this is very common.

GPUs operate on groups of threads not single ones. If you stall
then there will be a stall of a whole group of them. We are dealing with
accellerators here that are different for performance reasons. They are
not to be treated like regular processor, nor is memory like
operating like host mmemory.

 But IBM here want to go further and to provide a more advance solution,
 so their need are specific to there platform and we can not know if AMD,
 ARM or Intel will want to go down the same road, they do not seem to be
 interested. Does it means we should not support IBM ? I think it would be
 wrong.

What exactly is the more advanced version's benefit? What are the features
that the other platforms do not provide?

  This sounds more like a case for a general purpose processor. If it is a
  special device then it will typically also have special memory to allow
  fast searches.

 No this kind of thing can be fast on a GPU, with GPU you easily have x500
 more cores than CPU cores, so you can slice the dataset even more and have
 each of the GPU core perform the search. Note that i am not only thinking
 of stupid memcmp here it can be something more complex like searching a
 pattern that allow variation and that require a whole program to decide if
 a chunk falls under the variation rules or not.

Then you have the problem of fast memory access and you are proposing to
complicate that access path on the GPU.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Rik van Riel
On 04/24/2015 11:49 AM, Christoph Lameter wrote:
 On Fri, 24 Apr 2015, Paul E. McKenney wrote:
 
 can deliver, but where the cost of full-fledge hand tuning cannot be
 justified.

 You seem to believe that this latter category is the empty set, which
 I must confess does greatly surprise me.
 
 If there are already compromises are being made then why would you want to
 modify the kernel for this? Some user space coding and device drivers
 should be sufficient.

You assume only one program at a time would get to use the GPU
for accelerated computations, and the GPU would get dedicated
to that program.

That will not be the case when you have libraries using the GPU
for computations. There could be dozens of programs in the system
using that library, with no knowledge of how many GPU resources
are used by the other programs.

There is a very clear cut case for having the OS manage the
GPU resources transparently, just like it does for all the
other resources in the system.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Jerome Glisse
On Fri, Apr 24, 2015 at 09:30:40AM -0500, Christoph Lameter wrote:
 On Thu, 23 Apr 2015, Paul E. McKenney wrote:
 
  If by entire industry you mean everyone who might want to use hardware
  acceleration, for example, including mechanical computer-aided design,
  I am skeptical.
 
 The industry designs GPUs with super fast special ram and accellerators
 with special ram designed to do fast searches and you think you can demand 
 page
 that stuff in from the main processor?
 

Why do you think AMD and NVidia are adding page fault support to their GPU
in the first place ? They are not doing this on a whim, they have carefully
thought about that.

Are you saying you know better than the 2 biggest GPU designer on the planet ?
And who do you think is pushing for such thing in the kernel ? Do you think
we are working on this on a whim ? Because we woke up one day and thought that
it would be cool and that it should be done this way ?


Yes if all your GPU do is pagefault it will be disastrous, but is this the
usual thing we see on CPU ? No ! Are people complaining about the numerous
page fault that happens over a day ? No, the vast majority of user are
completely oblivious to page fault. This is how it works on CPU and yes this
can work for GPU too. What happens on CPU ? Well CPU can switch to work on
a different thread or a different application altogether. The same thing will
happen on the GPU. If you have enough jobs, your GPU will be busy and you
will never worry about page fault because overall your GPU will deliver the
same kind of throughput as if there was no pagefault. It can very well be
buried into the overall noise if the ratio of available runnable thread
versus page faulting thread is high enough. Which is most of the time the
case for the CPU, why would the same assumption not work on the GPU ?

Note that i am not dismissing low latency folks, i know they exist, i know
they hate page fault and in no way what we propose will make it worse for
them. They will be able to keep the same kind of control they cherish but
this does not mean you should go on a holy crusade to pretend that other
people workload does not exist. They do exist. Page fault is not evil and
it has prove usefull to the whole computer industry for CPU.


To be sure you are not misinterpretting what we propose, in no way we say
we gonna migrate thing on page fault for everyone. We are saying first
the device driver decide where thing need to be (system memory or local
memory) device driver can get hint/request from userspace for this (as they
do today). So no change whatsoever here, people that hand tune things will
keep being able to do so.

Now we want to add the case where device driver do not get any kind of
directive or hint from userspace. So what autonuma is, simply collect
informations from the GPU on what is access often and then migrate this
transparently (yes this can happen without interruption to GPU). So you
are migrating from a memory that has 16GB/s or 32GB/s bandwidth to the
device memory that have 500GB/s.

This is a valid usecase, they are many people outthere that do not want
to learn about hand tuning there application for the GPU but they could
nonetheless benefit from it.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Jerome Glisse
On Fri, Apr 24, 2015 at 09:29:12AM -0500, Christoph Lameter wrote:
 On Thu, 23 Apr 2015, Jerome Glisse wrote:
 
  No this not have been solve properly. Today solution is doing an explicit
  copy and again and again when complex data struct are involve (list, tree,
  ...) this is extremly tedious and hard to debug. So today solution often
  restrict themself to easy thing like matrix multiplication. But if you
  provide a unified address space then you make things a lot easiers for a
  lot more usecase. That's a fact, and again OpenCL 2.0 which is an industry
  standard is a proof that unified address space is one of the most important
  feature requested by user of GPGPU. You might not care but the rest of the
  world does.
 
 You could use page tables on the kernel side to transfer data on demand
 from the GPU. And you can use a device driver to establish mappings to the
 GPUs memory.
 
 There is no copy needed with these approaches.

So you are telling me to do get_user_page() ? If so you aware that this pins
memory ? So what happens when the GPU wants to access a range of 32GB of
memory ? I pin everything ?

I am not talking about only transfrom from GPU to system memory i am talking
about application that have :
   dataset = mmap(datatset, 3230);
   // ...
   dl_open(superlibrary)
   superlibrary.dosomething(dataset);

So the application here have no clue about GPU and we do not want to change
that yes this is a valid usecase and countless user ask for it.

How can the superlibrary give access to the GPU to the dataset ? Does it
have to go get_user_page() on all single page effectively pinning memory ?
Should it allocate GPU memory through special API and memcpy ?


What HMM does is allow to share the process page table with the GPU and GPU
can transparently access the dataset (no pinning whatsover). Will there be
pagefault ? It can happens and if it does the assumption is that you have
more threads that do not get a pagefault than one that does, so GPU keeps
being saturated (ie all its unit are feed with something to do) while the
pagefault are resolve. For some workload yes you will see the penalty of the
pagefault ie you will have a group of thread that finish late but the thing
you seem to fail to get is that all the other GPU thread can make process
and finish even before the pagefault is resolved. It all depends on the
application. Moreover if you have several application then GPU can switch
to different application and make progress on them too.

Overall the throughput of the GPU will stay close to its theoritical maximum
if you have enough other thread that can progress and this is very common.

 
   I think these two things need to be separated. The shift-the-memory-back-
   and-forth approach should be separate and if someone wants to use the
   thing then it should also work on other platforms like ARM and Intel.
 
  What IBM does with there platform is there choice, they can not force ARM
  or Intel or AMD to do the same. Each of those might have different view
  on what is their most important target. For instance i highly doubt ARM
  cares about any of this.
 
 Well but the kernel code submitted should allow for easy use on other
 platform. I.e. Intel processors should be able to implement the
 transparent memory by establishing device mappings to PCI-E space
 and/or transferring data from the GPU and signaling the GPU to establish
 such a mapping.

HMM does that, it only require the GPU to have a certain set of features
and the only requirement for the platform is to offer a bus which allow
cache coherent system memory access such as PCIE.

But IBM here want to go further and to provide a more advance solution,
so their need are specific to there platform and we can not know if AMD,
ARM or Intel will want to go down the same road, they do not seem to be
interested. Does it means we should not support IBM ? I think it would be
wrong.

 
  Only time critical application care about latency, everyone else cares
  about throughput, where the applications can runs for days, weeks, months
  before producing any useable/meaningfull results. Many of which do not
  care a tiny bit about latency because they can perform independant
  computation.
 
 Computationally intensive high performance application care about
 random latency introduced to computational threads because that is
 delaying the data exchange and thus slows everything down. And that is the
 typical case of a GPUI.

You assume that all HPC application have strong data exchange, i gave
you example of application where there is 0 data exchange btw threads
what so ever. Those use case exist and we want to support them too.

Yes for thread where there is data exchange page fault stall jobs but
again we are talking about HPC where several _different_ application
run in // and share resources so while page fault can block part of
an application, other applications can still make progress as GPU can
switch to work on them.

Moreover the 

Re: Interacting with coherent memory on external devices

2015-04-24 Thread Jerome Glisse
On Fri, Apr 24, 2015 at 07:57:38AM -0700, Paul E. McKenney wrote:
 On Fri, Apr 24, 2015 at 09:12:07AM -0500, Christoph Lameter wrote:
  On Thu, 23 Apr 2015, Paul E. McKenney wrote:
  
  
   DAX
  
 DAX is a mechanism for providing direct-memory access to
 high-speed non-volatile (AKA persistent) memory.  Good
 introductions to DAX may be found in the following LWN
 articles:
  
  DAX is a mechanism to access memory not managed by the kernel and is the
  successor to XIP. It just happens to be needed for persistent memory.
  Fundamentally any driver can provide an MMAPPed interface to allow access
  to a devices memory.
 
 I will take another look, but others in this thread have called out
 difficulties with DAX's filesystem nature.

Do not waste your time on that this is not what we want. Christoph here
is more than stuborn and fails to see the world.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Paul E. McKenney
On Fri, Apr 24, 2015 at 09:12:07AM -0500, Christoph Lameter wrote:
 On Thu, 23 Apr 2015, Paul E. McKenney wrote:
 
 
  DAX
 
  DAX is a mechanism for providing direct-memory access to
  high-speed non-volatile (AKA persistent) memory.  Good
  introductions to DAX may be found in the following LWN
  articles:
 
 DAX is a mechanism to access memory not managed by the kernel and is the
 successor to XIP. It just happens to be needed for persistent memory.
 Fundamentally any driver can provide an MMAPPed interface to allow access
 to a devices memory.

I will take another look, but others in this thread have called out
difficulties with DAX's filesystem nature.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Rik van Riel
On 04/24/2015 10:01 AM, Christoph Lameter wrote:
 On Thu, 23 Apr 2015, Paul E. McKenney wrote:
 
 As far as I know Jerome is talkeing about HPC loads and high performance
 GPU processing. This is the same use case.

 The difference is sensitivity to latency.  You have latency-sensitive
 HPC workloads, and Jerome is talking about HPC workloads that need
 high throughput, but are insensitive to latency.
 
 Those are correlated.
 
 What you are proposing for High Performacne Computing is reducing the
 performance these guys trying to get. You cannot sell someone a Volkswagen
 if he needs the Ferrari.

 You do need the low-latency Ferrari.  But others are best served by a
 high-throughput freight train.
 
 The problem is that they want to run 2000 trains at the same time
 and they all must arrive at the destination before they can be send on
 their next trip. 1999 trains will be sitting idle because they need
 to wait of the one train that was delayed. This reduces the troughput.
 People really would like all 2000 trains to arrive on schedule so that
 they get more performance.

So you run 4000 or even 6000 trains, and have some subset of them
run at full steam, while others are waiting on memory accesses.

In reality the overcommit factor is likely much smaller, because
the GPU threads run and block on memory in smaller, more manageable
numbers, say a few dozen at a time.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Christoph Lameter
On Fri, 24 Apr 2015, Paul E. McKenney wrote:

 can deliver, but where the cost of full-fledge hand tuning cannot be
 justified.

 You seem to believe that this latter category is the empty set, which
 I must confess does greatly surprise me.

If there are already compromises are being made then why would you want to
modify the kernel for this? Some user space coding and device drivers
should be sufficient.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Rik van Riel
On 04/21/2015 05:44 PM, Paul E. McKenney wrote:

 AUTONUMA
 
   The Linux kernel's autonuma facility supports migrating both
   memory and processes to promote NUMA memory locality.  It was
   accepted into 3.13 and is available in RHEL 7.0 and SLES 12.
   It is enabled by the Kconfig variable CONFIG_NUMA_BALANCING.
 
   This approach uses a kernel thread knuma_scand that periodically
   marks pages inaccessible.  The page-fault handler notes any
   mismatches between the NUMA node that the process is running on
   and the NUMA node on which the page resides.

Minor nit: marking pages inaccessible is done from task_work
nowadays, there no longer is a kernel thread.

   The result would be that the kernel would allocate only migratable
   pages within the CCAD device's memory, and even then only if
   memory was otherwise exhausted.

Does it make sense to allocate the device's page tables in memory
belonging to the device?

Is this a necessary thing with some devices? Jerome's HMM comes
to mind...

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interacting with coherent memory on external devices

2015-04-24 Thread Benjamin Herrenschmidt
On Fri, 2015-04-24 at 22:32 -0400, Rik van Riel wrote:
The result would be that the kernel would allocate only
 migratable
pages within the CCAD device's memory, and even then only if
memory was otherwise exhausted.
 
 Does it make sense to allocate the device's page tables in memory
 belonging to the device?
 
 Is this a necessary thing with some devices? Jerome's HMM comes
 to mind...

In our case, the device's MMU shares the host page tables (which is why
we can't use HMM, ie we can't have a page with different permissions on
CPU vs. device which HMM does).

However the device has a pretty fast path to system memory, the best
thing we can do is pin the workload to the same chip the device is
connected to so those page tables arent' too far away.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >