Re: Enabling peer to peer device transactions for PCIe devices

2016-11-22 Thread Sagalovitch, Serguei
> I don't think we should be using numa distance to reverse engineer a
> certain allocation behavior.  The latency data should be truthful, but
> you're right we'll need a mechanism to keep general purpose
> allocations out of that range by default. 

Just to clarify: Do you propose/thinking to utilize NUMA API for 
such (VRAM) allocations? 




___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-22 Thread Daniel Vetter
On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
 wrote:
>
> On 2016-11-22 03:10 PM, Daniel Vetter wrote:
>>
>> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams 
>> wrote:
>>>
>>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
>>>  wrote:

 I personally like "device-DAX" idea but my concerns are:

 -  How well it will co-exists with the  DRM infrastructure /
 implementations
 in part dealing with CPU pointers?
>>>
>>> Inside the kernel a device-DAX range is "just memory" in the sense
>>> that you can perform pfn_to_page() on it and issue I/O, but the vma is
>>> not migratable. To be honest I do not know how well that co-exists
>>> with drm infrastructure.
>>>
 -  How well we will be able to handle case when we need to
 "move"/"evict"
 memory/data to the new location so CPU pointer should point to the
 new
 physical location/address
  (and may be not in PCI device memory at all)?
>>>
>>> So, device-DAX deliberately avoids support for in-kernel migration or
>>> overcommit. Those cases are left to the core mm or drm. The device-dax
>>> interface is for cases where all that is needed is a direct-mapping to
>>> a statically-allocated physical-address range be it persistent memory
>>> or some other special reserved memory range.
>>
>> For some of the fancy use-cases (e.g. to be comparable to what HMM can
>> pull off) I think we want all the magic in core mm, i.e. migration and
>> overcommit. At least that seems to be the very strong drive in all
>> general-purpose gpu abstractions and implementations, where memory is
>> allocated with malloc, and then mapped/moved into vram/gpu address
>> space through some magic,
>
> It is possible that there is other way around: memory is requested to be
> allocated and should be kept in vram for  performance reason but due
> to possible overcommit case we need at least temporally to "move" such
> allocation to system memory.

With migration I meant migrating both ways of course. And with stuff
like numactl we can also influence where exactly the malloc'ed memory
is allocated originally, at least if we'd expose the vram range as a
very special numa node that happens to be far away and not hold any
cpu cores.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-22 Thread Serguei Sagalovitch



On 2016-11-22 03:10 PM, Daniel Vetter wrote:

On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams  wrote:

On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
 wrote:

I personally like "device-DAX" idea but my concerns are:

-  How well it will co-exists with the  DRM infrastructure / implementations
in part dealing with CPU pointers?

Inside the kernel a device-DAX range is "just memory" in the sense
that you can perform pfn_to_page() on it and issue I/O, but the vma is
not migratable. To be honest I do not know how well that co-exists
with drm infrastructure.


-  How well we will be able to handle case when we need to "move"/"evict"
memory/data to the new location so CPU pointer should point to the new
physical location/address
 (and may be not in PCI device memory at all)?

So, device-DAX deliberately avoids support for in-kernel migration or
overcommit. Those cases are left to the core mm or drm. The device-dax
interface is for cases where all that is needed is a direct-mapping to
a statically-allocated physical-address range be it persistent memory
or some other special reserved memory range.

For some of the fancy use-cases (e.g. to be comparable to what HMM can
pull off) I think we want all the magic in core mm, i.e. migration and
overcommit. At least that seems to be the very strong drive in all
general-purpose gpu abstractions and implementations, where memory is
allocated with malloc, and then mapped/moved into vram/gpu address
space through some magic,

It is possible that there is other way around: memory is requested to be
allocated and should be kept in vram for  performance reason but due
to possible overcommit case we need at least temporally to "move" such
allocation to system memory.

  but still visible on both the cpu and gpu
side in some form. Special device to allocate memory, and not being
able to migrate stuff around sound like misfeatures from that pov.
-Daniel


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] x86: fix kaslr and memmap collision

2016-11-22 Thread Dan Williams
On Tue, Nov 22, 2016 at 10:54 AM, Kees Cook  wrote:
> On Tue, Nov 22, 2016 at 9:26 AM, Dan Williams  
> wrote:
>> [ replying for Dave since he's offline today and tomorrow ]
>>
>> On Tue, Nov 22, 2016 at 12:47 AM, Ingo Molnar  wrote:
>>>
>>> * Dave Jiang  wrote:
>>>
 CONFIG_RANDOMIZE_BASE relocates the kernel to a random base address.
 However it does not take into account the memmap= parameter passed in from
 the kernel commandline.
>>>
>>> memmap= parameters are often used as a list.
>>>
 [...] This results in the kernel sometimes being put in the middle of the 
 user
 memmap. [...]
>>>
>>> What does this mean? If memmap= is used to re-define the memory map then the
>>> kernel getting in the middle of a RAM area is what we want, isn't it? What 
>>> we
>>> don't want is for the kernel to get into reserved areas, right?
>>
>> Right, this is about teaching kaslr to not land the kernel in newly
>> defined reserved regions that were not marked reserved in the initial
>> e820 map from platform firmware.
>>
 [...] Check has been added in the kaslr in order to avoid the region 
 marked by
 memmap.
>>>
>>> What does this mean?
>>
>> Is this clearer? "Update the set of 'mem_avoid' entries to exclude
>> 'memmap=' defined reserved regions from the set of valid address range
>> to land the kernel image."
>>
>>>
 Signed-off-by: Dave Jiang 
 ---
  arch/x86/boot/boot.h |2 ++
  arch/x86/boot/compressed/kaslr.c |   45 
 ++
  arch/x86/boot/string.c   |   25 +
  3 files changed, 72 insertions(+)

 diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
 index e5612f3..0d5fe5b 100644
 --- a/arch/x86/boot/boot.h
 +++ b/arch/x86/boot/boot.h
 @@ -332,6 +332,8 @@ int strncmp(const char *cs, const char *ct, size_t 
 count);
  size_t strnlen(const char *s, size_t maxlen);
  unsigned int atou(const char *s);
  unsigned long long simple_strtoull(const char *cp, char **endp, unsigned 
 int base);
 +unsigned long simple_strtoul(const char *cp, char **endp, unsigned int 
 base);
 +long simple_strtol(const char *cp, char **endp, unsigned int base);
  size_t strlen(const char *s);

  /* tty.c */
 diff --git a/arch/x86/boot/compressed/kaslr.c 
 b/arch/x86/boot/compressed/kaslr.c
 index a66854d..6fb8f1ec 100644
 --- a/arch/x86/boot/compressed/kaslr.c
 +++ b/arch/x86/boot/compressed/kaslr.c
 @@ -11,6 +11,7 @@
   */
  #include "misc.h"
  #include "error.h"
 +#include "../boot.h"

  #include 
  #include 
 @@ -61,6 +62,7 @@ enum mem_avoid_index {
   MEM_AVOID_INITRD,
   MEM_AVOID_CMDLINE,
   MEM_AVOID_BOOTPARAMS,
 + MEM_AVOID_MEMMAP,
   MEM_AVOID_MAX,
  };

 @@ -77,6 +79,37 @@ static bool mem_overlaps(struct mem_vector *one, struct 
 mem_vector *two)
   return true;
  }

 +#include "../../../../lib/cmdline.c"
 +
 +static int
 +parse_memmap(char *p, unsigned long long *start, unsigned long long *size)
 +{
 + char *oldp;
 +
 + if (!p)
 + return -EINVAL;
 +
 + /* we don't care about this option here */
 + if (!strncmp(p, "exactmap", 8))
 + return -EINVAL;
 +
 + oldp = p;
 + *size = memparse(p, );
 + if (p == oldp)
 + return -EINVAL;
 +
 + switch (*p) {
 + case '@':
 + case '#':
 + case '$':
 + case '!':
 + *start = memparse(p+1, );
 + return 0;
 + }
 +
 + return -EINVAL;
 +}
 +
  /*
   * In theory, KASLR can put the kernel anywhere in the range of [16M, 
 64T).
   * The mem_avoid array is used to store the ranges that need to be avoided
 @@ -158,6 +191,8 @@ static void mem_avoid_init(unsigned long input, 
 unsigned long input_size,
   u64 initrd_start, initrd_size;
   u64 cmd_line, cmd_line_size;
   char *ptr;
 + char arg[38];
>>>
>>> Where does the magic '38' come from?
>>>
 + unsigned long long memmap_start, memmap_size;

   /*
* Avoid the region that is unsafe to overlap during
 @@ -195,6 +230,16 @@ static void mem_avoid_init(unsigned long input, 
 unsigned long input_size,
   add_identity_map(mem_avoid[MEM_AVOID_BOOTPARAMS].start,
mem_avoid[MEM_AVOID_BOOTPARAMS].size);

 + /* see if we have any memmap areas */
 + if (cmdline_find_option("memmap", arg, sizeof(arg)) > 0) {
 + int rc = parse_memmap(arg, _start, _size);
 +
 + if (!rc) {
 + 

Re: Enabling peer to peer device transactions for PCIe devices

2016-11-22 Thread Dan Williams
On Mon, Nov 21, 2016 at 12:36 PM, Deucher, Alexander
 wrote:
> This is certainly not the first time this has been brought up, but I'd like 
> to try and get some consensus on the best way to move this forward.  Allowing 
> devices to talk directly improves performance and reduces latency by avoiding 
> the use of staging buffers in system memory.  Also in cases where both 
> devices are behind a switch, it avoids the CPU entirely.  Most current APIs 
> (DirectGMA, PeerDirect, CUDA, HSA) that deal with this are pointer based.  
> Ideally we'd be able to take a CPU virtual address and be able to get to a 
> physical address taking into account IOMMUs, etc.  Having struct pages for 
> the memory would allow it to work more generally and wouldn't require as much 
> explicit support in drivers that wanted to use it.
>
> Some use cases:
> 1. Storage devices streaming directly to GPU device memory
> 2. GPU device memory to GPU device memory streaming
> 3. DVB/V4L/SDI devices streaming directly to GPU device memory
> 4. DVB/V4L/SDI devices streaming directly to storage devices
>
> Here is a relatively simple example of how this could work for testing.  This 
> is obviously not a complete solution.
> - Device memory will be registered with Linux memory sub-system by created 
> corresponding struct page structures for device memory
> - get_user_pages_fast() will  return corresponding struct pages when CPU 
> address points to the device memory
> - put_page() will deal with struct pages for device memory
>
[..]
> 4. iopmem
> iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/)

The change I suggest for this particular approach is to switch to
"device-DAX" [1]. I.e. a character device for establishing DAX
mappings rather than a block device plus a DAX filesystem. The pro of
this approach is standard user pointers and struct pages rather than a
new construct. The con is that this is done via an interface separate
from the existing gpu and storage device. For example it would require
a /dev/dax instance alongside a /dev/nvme interface, but I don't see
that as a significant blocking concern.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2016-October/007496.html
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] x86: fix kaslr and memmap collision

2016-11-22 Thread Ingo Molnar

* Dave Jiang  wrote:

> CONFIG_RANDOMIZE_BASE relocates the kernel to a random base address.
> However it does not take into account the memmap= parameter passed in from
> the kernel commandline.

memmap= parameters are often used as a list.

> [...] This results in the kernel sometimes being put in the middle of the 
> user 
> memmap. [...]

What does this mean? If memmap= is used to re-define the memory map then the 
kernel getting in the middle of a RAM area is what we want, isn't it? What we 
don't want is for the kernel to get into reserved areas, right?

> [...] Check has been added in the kaslr in order to avoid the region marked 
> by 
> memmap.

What does this mean?

> Signed-off-by: Dave Jiang 
> ---
>  arch/x86/boot/boot.h |2 ++
>  arch/x86/boot/compressed/kaslr.c |   45 
> ++
>  arch/x86/boot/string.c   |   25 +
>  3 files changed, 72 insertions(+)
> 
> diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
> index e5612f3..0d5fe5b 100644
> --- a/arch/x86/boot/boot.h
> +++ b/arch/x86/boot/boot.h
> @@ -332,6 +332,8 @@ int strncmp(const char *cs, const char *ct, size_t count);
>  size_t strnlen(const char *s, size_t maxlen);
>  unsigned int atou(const char *s);
>  unsigned long long simple_strtoull(const char *cp, char **endp, unsigned int 
> base);
> +unsigned long simple_strtoul(const char *cp, char **endp, unsigned int base);
> +long simple_strtol(const char *cp, char **endp, unsigned int base);
>  size_t strlen(const char *s);
>  
>  /* tty.c */
> diff --git a/arch/x86/boot/compressed/kaslr.c 
> b/arch/x86/boot/compressed/kaslr.c
> index a66854d..6fb8f1ec 100644
> --- a/arch/x86/boot/compressed/kaslr.c
> +++ b/arch/x86/boot/compressed/kaslr.c
> @@ -11,6 +11,7 @@
>   */
>  #include "misc.h"
>  #include "error.h"
> +#include "../boot.h"
>  
>  #include 
>  #include 
> @@ -61,6 +62,7 @@ enum mem_avoid_index {
>   MEM_AVOID_INITRD,
>   MEM_AVOID_CMDLINE,
>   MEM_AVOID_BOOTPARAMS,
> + MEM_AVOID_MEMMAP,
>   MEM_AVOID_MAX,
>  };
>  
> @@ -77,6 +79,37 @@ static bool mem_overlaps(struct mem_vector *one, struct 
> mem_vector *two)
>   return true;
>  }
>  
> +#include "../../../../lib/cmdline.c"
> +
> +static int
> +parse_memmap(char *p, unsigned long long *start, unsigned long long *size)
> +{
> + char *oldp;
> +
> + if (!p)
> + return -EINVAL;
> +
> + /* we don't care about this option here */
> + if (!strncmp(p, "exactmap", 8))
> + return -EINVAL;
> +
> + oldp = p;
> + *size = memparse(p, );
> + if (p == oldp)
> + return -EINVAL;
> +
> + switch (*p) {
> + case '@':
> + case '#':
> + case '$':
> + case '!':
> + *start = memparse(p+1, );
> + return 0;
> + }
> +
> + return -EINVAL;
> +}
> +
>  /*
>   * In theory, KASLR can put the kernel anywhere in the range of [16M, 64T).
>   * The mem_avoid array is used to store the ranges that need to be avoided
> @@ -158,6 +191,8 @@ static void mem_avoid_init(unsigned long input, unsigned 
> long input_size,
>   u64 initrd_start, initrd_size;
>   u64 cmd_line, cmd_line_size;
>   char *ptr;
> + char arg[38];

Where does the magic '38' come from?

> + unsigned long long memmap_start, memmap_size;
>  
>   /*
>* Avoid the region that is unsafe to overlap during
> @@ -195,6 +230,16 @@ static void mem_avoid_init(unsigned long input, unsigned 
> long input_size,
>   add_identity_map(mem_avoid[MEM_AVOID_BOOTPARAMS].start,
>mem_avoid[MEM_AVOID_BOOTPARAMS].size);
>  
> + /* see if we have any memmap areas */
> + if (cmdline_find_option("memmap", arg, sizeof(arg)) > 0) {
> + int rc = parse_memmap(arg, _start, _size);
> +
> + if (!rc) {
> + mem_avoid[MEM_AVOID_MEMMAP].start = memmap_start;
> + mem_avoid[MEM_AVOID_MEMMAP].size = memmap_size;
> + }
> + }
> +

This only handles a single (first) memmap argument, is that sufficient?

Thanks,

Ingo
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm