Re: Enabling peer to peer device transactions for PCIe devices
> I don't think we should be using numa distance to reverse engineer a > certain allocation behavior. The latency data should be truthful, but > you're right we'll need a mechanism to keep general purpose > allocations out of that range by default. Just to clarify: Do you propose/thinking to utilize NUMA API for such (VRAM) allocations? ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitchwrote: > > On 2016-11-22 03:10 PM, Daniel Vetter wrote: >> >> On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams >> wrote: >>> >>> On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch >>> wrote: I personally like "device-DAX" idea but my concerns are: - How well it will co-exists with the DRM infrastructure / implementations in part dealing with CPU pointers? >>> >>> Inside the kernel a device-DAX range is "just memory" in the sense >>> that you can perform pfn_to_page() on it and issue I/O, but the vma is >>> not migratable. To be honest I do not know how well that co-exists >>> with drm infrastructure. >>> - How well we will be able to handle case when we need to "move"/"evict" memory/data to the new location so CPU pointer should point to the new physical location/address (and may be not in PCI device memory at all)? >>> >>> So, device-DAX deliberately avoids support for in-kernel migration or >>> overcommit. Those cases are left to the core mm or drm. The device-dax >>> interface is for cases where all that is needed is a direct-mapping to >>> a statically-allocated physical-address range be it persistent memory >>> or some other special reserved memory range. >> >> For some of the fancy use-cases (e.g. to be comparable to what HMM can >> pull off) I think we want all the magic in core mm, i.e. migration and >> overcommit. At least that seems to be the very strong drive in all >> general-purpose gpu abstractions and implementations, where memory is >> allocated with malloc, and then mapped/moved into vram/gpu address >> space through some magic, > > It is possible that there is other way around: memory is requested to be > allocated and should be kept in vram for performance reason but due > to possible overcommit case we need at least temporally to "move" such > allocation to system memory. With migration I meant migrating both ways of course. And with stuff like numactl we can also influence where exactly the malloc'ed memory is allocated originally, at least if we'd expose the vram range as a very special numa node that happens to be far away and not hold any cpu cores. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-22 03:10 PM, Daniel Vetter wrote: On Tue, Nov 22, 2016 at 9:01 PM, Dan Williamswrote: On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch wrote: I personally like "device-DAX" idea but my concerns are: - How well it will co-exists with the DRM infrastructure / implementations in part dealing with CPU pointers? Inside the kernel a device-DAX range is "just memory" in the sense that you can perform pfn_to_page() on it and issue I/O, but the vma is not migratable. To be honest I do not know how well that co-exists with drm infrastructure. - How well we will be able to handle case when we need to "move"/"evict" memory/data to the new location so CPU pointer should point to the new physical location/address (and may be not in PCI device memory at all)? So, device-DAX deliberately avoids support for in-kernel migration or overcommit. Those cases are left to the core mm or drm. The device-dax interface is for cases where all that is needed is a direct-mapping to a statically-allocated physical-address range be it persistent memory or some other special reserved memory range. For some of the fancy use-cases (e.g. to be comparable to what HMM can pull off) I think we want all the magic in core mm, i.e. migration and overcommit. At least that seems to be the very strong drive in all general-purpose gpu abstractions and implementations, where memory is allocated with malloc, and then mapped/moved into vram/gpu address space through some magic, It is possible that there is other way around: memory is requested to be allocated and should be kept in vram for performance reason but due to possible overcommit case we need at least temporally to "move" such allocation to system memory. but still visible on both the cpu and gpu side in some form. Special device to allocate memory, and not being able to migrate stuff around sound like misfeatures from that pov. -Daniel ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] x86: fix kaslr and memmap collision
On Tue, Nov 22, 2016 at 10:54 AM, Kees Cookwrote: > On Tue, Nov 22, 2016 at 9:26 AM, Dan Williams > wrote: >> [ replying for Dave since he's offline today and tomorrow ] >> >> On Tue, Nov 22, 2016 at 12:47 AM, Ingo Molnar wrote: >>> >>> * Dave Jiang wrote: >>> CONFIG_RANDOMIZE_BASE relocates the kernel to a random base address. However it does not take into account the memmap= parameter passed in from the kernel commandline. >>> >>> memmap= parameters are often used as a list. >>> [...] This results in the kernel sometimes being put in the middle of the user memmap. [...] >>> >>> What does this mean? If memmap= is used to re-define the memory map then the >>> kernel getting in the middle of a RAM area is what we want, isn't it? What >>> we >>> don't want is for the kernel to get into reserved areas, right? >> >> Right, this is about teaching kaslr to not land the kernel in newly >> defined reserved regions that were not marked reserved in the initial >> e820 map from platform firmware. >> [...] Check has been added in the kaslr in order to avoid the region marked by memmap. >>> >>> What does this mean? >> >> Is this clearer? "Update the set of 'mem_avoid' entries to exclude >> 'memmap=' defined reserved regions from the set of valid address range >> to land the kernel image." >> >>> Signed-off-by: Dave Jiang --- arch/x86/boot/boot.h |2 ++ arch/x86/boot/compressed/kaslr.c | 45 ++ arch/x86/boot/string.c | 25 + 3 files changed, 72 insertions(+) diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h index e5612f3..0d5fe5b 100644 --- a/arch/x86/boot/boot.h +++ b/arch/x86/boot/boot.h @@ -332,6 +332,8 @@ int strncmp(const char *cs, const char *ct, size_t count); size_t strnlen(const char *s, size_t maxlen); unsigned int atou(const char *s); unsigned long long simple_strtoull(const char *cp, char **endp, unsigned int base); +unsigned long simple_strtoul(const char *cp, char **endp, unsigned int base); +long simple_strtol(const char *cp, char **endp, unsigned int base); size_t strlen(const char *s); /* tty.c */ diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c index a66854d..6fb8f1ec 100644 --- a/arch/x86/boot/compressed/kaslr.c +++ b/arch/x86/boot/compressed/kaslr.c @@ -11,6 +11,7 @@ */ #include "misc.h" #include "error.h" +#include "../boot.h" #include #include @@ -61,6 +62,7 @@ enum mem_avoid_index { MEM_AVOID_INITRD, MEM_AVOID_CMDLINE, MEM_AVOID_BOOTPARAMS, + MEM_AVOID_MEMMAP, MEM_AVOID_MAX, }; @@ -77,6 +79,37 @@ static bool mem_overlaps(struct mem_vector *one, struct mem_vector *two) return true; } +#include "../../../../lib/cmdline.c" + +static int +parse_memmap(char *p, unsigned long long *start, unsigned long long *size) +{ + char *oldp; + + if (!p) + return -EINVAL; + + /* we don't care about this option here */ + if (!strncmp(p, "exactmap", 8)) + return -EINVAL; + + oldp = p; + *size = memparse(p, ); + if (p == oldp) + return -EINVAL; + + switch (*p) { + case '@': + case '#': + case '$': + case '!': + *start = memparse(p+1, ); + return 0; + } + + return -EINVAL; +} + /* * In theory, KASLR can put the kernel anywhere in the range of [16M, 64T). * The mem_avoid array is used to store the ranges that need to be avoided @@ -158,6 +191,8 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size, u64 initrd_start, initrd_size; u64 cmd_line, cmd_line_size; char *ptr; + char arg[38]; >>> >>> Where does the magic '38' come from? >>> + unsigned long long memmap_start, memmap_size; /* * Avoid the region that is unsafe to overlap during @@ -195,6 +230,16 @@ static void mem_avoid_init(unsigned long input, unsigned long input_size, add_identity_map(mem_avoid[MEM_AVOID_BOOTPARAMS].start, mem_avoid[MEM_AVOID_BOOTPARAMS].size); + /* see if we have any memmap areas */ + if (cmdline_find_option("memmap", arg, sizeof(arg)) > 0) { + int rc = parse_memmap(arg, _start, _size); + + if (!rc) { +
Re: Enabling peer to peer device transactions for PCIe devices
On Mon, Nov 21, 2016 at 12:36 PM, Deucher, Alexanderwrote: > This is certainly not the first time this has been brought up, but I'd like > to try and get some consensus on the best way to move this forward. Allowing > devices to talk directly improves performance and reduces latency by avoiding > the use of staging buffers in system memory. Also in cases where both > devices are behind a switch, it avoids the CPU entirely. Most current APIs > (DirectGMA, PeerDirect, CUDA, HSA) that deal with this are pointer based. > Ideally we'd be able to take a CPU virtual address and be able to get to a > physical address taking into account IOMMUs, etc. Having struct pages for > the memory would allow it to work more generally and wouldn't require as much > explicit support in drivers that wanted to use it. > > Some use cases: > 1. Storage devices streaming directly to GPU device memory > 2. GPU device memory to GPU device memory streaming > 3. DVB/V4L/SDI devices streaming directly to GPU device memory > 4. DVB/V4L/SDI devices streaming directly to storage devices > > Here is a relatively simple example of how this could work for testing. This > is obviously not a complete solution. > - Device memory will be registered with Linux memory sub-system by created > corresponding struct page structures for device memory > - get_user_pages_fast() will return corresponding struct pages when CPU > address points to the device memory > - put_page() will deal with struct pages for device memory > [..] > 4. iopmem > iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/) The change I suggest for this particular approach is to switch to "device-DAX" [1]. I.e. a character device for establishing DAX mappings rather than a block device plus a DAX filesystem. The pro of this approach is standard user pointers and struct pages rather than a new construct. The con is that this is done via an interface separate from the existing gpu and storage device. For example it would require a /dev/dax instance alongside a /dev/nvme interface, but I don't see that as a significant blocking concern. [1]: https://lists.01.org/pipermail/linux-nvdimm/2016-October/007496.html ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] x86: fix kaslr and memmap collision
* Dave Jiangwrote: > CONFIG_RANDOMIZE_BASE relocates the kernel to a random base address. > However it does not take into account the memmap= parameter passed in from > the kernel commandline. memmap= parameters are often used as a list. > [...] This results in the kernel sometimes being put in the middle of the > user > memmap. [...] What does this mean? If memmap= is used to re-define the memory map then the kernel getting in the middle of a RAM area is what we want, isn't it? What we don't want is for the kernel to get into reserved areas, right? > [...] Check has been added in the kaslr in order to avoid the region marked > by > memmap. What does this mean? > Signed-off-by: Dave Jiang > --- > arch/x86/boot/boot.h |2 ++ > arch/x86/boot/compressed/kaslr.c | 45 > ++ > arch/x86/boot/string.c | 25 + > 3 files changed, 72 insertions(+) > > diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h > index e5612f3..0d5fe5b 100644 > --- a/arch/x86/boot/boot.h > +++ b/arch/x86/boot/boot.h > @@ -332,6 +332,8 @@ int strncmp(const char *cs, const char *ct, size_t count); > size_t strnlen(const char *s, size_t maxlen); > unsigned int atou(const char *s); > unsigned long long simple_strtoull(const char *cp, char **endp, unsigned int > base); > +unsigned long simple_strtoul(const char *cp, char **endp, unsigned int base); > +long simple_strtol(const char *cp, char **endp, unsigned int base); > size_t strlen(const char *s); > > /* tty.c */ > diff --git a/arch/x86/boot/compressed/kaslr.c > b/arch/x86/boot/compressed/kaslr.c > index a66854d..6fb8f1ec 100644 > --- a/arch/x86/boot/compressed/kaslr.c > +++ b/arch/x86/boot/compressed/kaslr.c > @@ -11,6 +11,7 @@ > */ > #include "misc.h" > #include "error.h" > +#include "../boot.h" > > #include > #include > @@ -61,6 +62,7 @@ enum mem_avoid_index { > MEM_AVOID_INITRD, > MEM_AVOID_CMDLINE, > MEM_AVOID_BOOTPARAMS, > + MEM_AVOID_MEMMAP, > MEM_AVOID_MAX, > }; > > @@ -77,6 +79,37 @@ static bool mem_overlaps(struct mem_vector *one, struct > mem_vector *two) > return true; > } > > +#include "../../../../lib/cmdline.c" > + > +static int > +parse_memmap(char *p, unsigned long long *start, unsigned long long *size) > +{ > + char *oldp; > + > + if (!p) > + return -EINVAL; > + > + /* we don't care about this option here */ > + if (!strncmp(p, "exactmap", 8)) > + return -EINVAL; > + > + oldp = p; > + *size = memparse(p, ); > + if (p == oldp) > + return -EINVAL; > + > + switch (*p) { > + case '@': > + case '#': > + case '$': > + case '!': > + *start = memparse(p+1, ); > + return 0; > + } > + > + return -EINVAL; > +} > + > /* > * In theory, KASLR can put the kernel anywhere in the range of [16M, 64T). > * The mem_avoid array is used to store the ranges that need to be avoided > @@ -158,6 +191,8 @@ static void mem_avoid_init(unsigned long input, unsigned > long input_size, > u64 initrd_start, initrd_size; > u64 cmd_line, cmd_line_size; > char *ptr; > + char arg[38]; Where does the magic '38' come from? > + unsigned long long memmap_start, memmap_size; > > /* >* Avoid the region that is unsafe to overlap during > @@ -195,6 +230,16 @@ static void mem_avoid_init(unsigned long input, unsigned > long input_size, > add_identity_map(mem_avoid[MEM_AVOID_BOOTPARAMS].start, >mem_avoid[MEM_AVOID_BOOTPARAMS].size); > > + /* see if we have any memmap areas */ > + if (cmdline_find_option("memmap", arg, sizeof(arg)) > 0) { > + int rc = parse_memmap(arg, _start, _size); > + > + if (!rc) { > + mem_avoid[MEM_AVOID_MEMMAP].start = memmap_start; > + mem_avoid[MEM_AVOID_MEMMAP].size = memmap_size; > + } > + } > + This only handles a single (first) memmap argument, is that sufficient? Thanks, Ingo ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm