Re: Enabling peer to peer device transactions for PCIe devices
On Fri, January 6, 2017 4:10 pm, Logan Gunthorpe wrote: > > > On 06/01/17 11:26 AM, Jason Gunthorpe wrote: > > >> Make a generic API for all of this and you'd have my vote.. >> >> >> IMHO, you must support basic pinning semantics - that is necessary to >> support generic short lived DMA (eg filesystem, etc). That hardware can >> clearly do that if it can support ODP. > > I agree completely. > > > What we want is for RDMA, O_DIRECT, etc to just work with special VMAs > (ie. at least those backed with ZONE_DEVICE memory). Then > GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace > (using whatever interface is most appropriate) and userspace can do what > it pleases with them. This makes _so_ much sense and actually largely > already works today (as demonstrated by iopmem). +1 for iopmem ;-) I feel like we are going around and around on this topic. I would like to see something that is upstream that enables P2P even if it is only the minimum viable useful functionality to begin. I think aiming for the moon (which is what HMM and things like it are) are simply going to take more time if they ever get there. There is a use case for in-kernel P2P PCIe transfers between two NVMe devices and between an NVMe device and an RDMA NIC (using NVMe CMBs or BARs on the NIC). I am even seeing users who now want to move data P2P between FPGAs and NVMe SSDs and the upstream kernel should be able to support these users or they will look elsewhere. The iopmem patchset addressed all the use cases above and while it is not an in kernel API it could have been modified to be one reasonably easily. As Logan states the driver can then choose to pass the VMAs to user-space in a manner that makes sense. Earlier in the thread someone mentioned LSF/MM. There is already a proposal to discuss this topic so if you are interested please respond to the email letting the committee know this topic is of interest to you [1]. Also earlier in the thread someone discussed the issues around the IOMMU. Given the known issues around P2P transfers in certain CPU root complexes [2] it might just be a case of only allowing P2P when a PCIe switch connects the two EPs. Another option is just to use CONFIG_EXPERT and make sure people are aware of the pitfalls if they invoke the P2P option. Finally, as Jason noted, we could all just wait until CAPI/OpenCAPI/CCIX/GenZ comes along. However given that these interfaces are the remit of the CPU vendors I think it behooves us to solve this problem before then. Also some of the above mentioned protocols are not even switchable and may not be amenable to a P2P topology... Stephen [1] http://marc.info/?l=linux-mm=148156541804940=2 [2] https://community.mellanox.com/docs/DOC-1119 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH v7] x86: fix kaslr and memmap collision
CONFIG_RANDOMIZE_BASE relocates the kernel to a random base address. However it does not take into account the memmap= parameter passed in from the kernel cmdline. This results in the kernel sometimes being put in the middle of memmap. Teaching kaslr to not insert the kernel in memmap defined regions. We will support up to 4 memmap regions. Any additional regions will cause kaslr to disable. The mem_avoid set has been augmented to add up to 4 unusable regions of memmaps provided by the user to exclude those regions from the set of valid address range to insert the uncompressed kernel image. The nn@ss ranges will be skipped by the mem_avoid set since it indicates memory useable. Signed-off-by: Dave JiangAcked-by: Kees Cook Acked-by: Baoquan He --- arch/x86/boot/boot.h |1 arch/x86/boot/compressed/kaslr.c | 140 +- arch/x86/boot/string.c | 13 3 files changed, 151 insertions(+), 3 deletions(-) v2: Addressed comments from Ingo. - Handle entire list of memmaps v3: Fix 32bit build issue v4: Addressed comments from Baoquan - Not exclude nn@ss ranges v5: Addressed additional comments from Baoquan - Update commit header and various coding style changes v6: Addressed comments from Kees - Only fail for physical address randomization v7: Addressed comments from Thomas - Dropped unused functions - Made address and size in memmap_avoid unsigned long long - Style fixes diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h index e5612f3..9b42b6d 100644 --- a/arch/x86/boot/boot.h +++ b/arch/x86/boot/boot.h @@ -333,6 +333,7 @@ size_t strnlen(const char *s, size_t maxlen); unsigned int atou(const char *s); unsigned long long simple_strtoull(const char *cp, char **endp, unsigned int base); size_t strlen(const char *s); +char *strchr(const char *s, int c); /* tty.c */ void puts(const char *); diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c index a66854d..8b7c9e7 100644 --- a/arch/x86/boot/compressed/kaslr.c +++ b/arch/x86/boot/compressed/kaslr.c @@ -11,6 +11,7 @@ */ #include "misc.h" #include "error.h" +#include "../boot.h" #include #include @@ -52,15 +53,22 @@ static unsigned long get_boot_seed(void) #include "../../lib/kaslr.c" struct mem_vector { - unsigned long start; - unsigned long size; + unsigned long long start; + unsigned long long size; }; +/* Only supporting at most 4 unusable memmap regions with kaslr */ +#define MAX_MEMMAP_REGIONS 4 + +static bool memmap_too_large; + enum mem_avoid_index { MEM_AVOID_ZO_RANGE = 0, MEM_AVOID_INITRD, MEM_AVOID_CMDLINE, MEM_AVOID_BOOTPARAMS, + MEM_AVOID_MEMMAP_BEGIN, + MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 1, MEM_AVOID_MAX, }; @@ -77,6 +85,123 @@ static bool mem_overlaps(struct mem_vector *one, struct mem_vector *two) return true; } +/** + * _memparse - Parse a string with mem suffixes into a number + * @ptr: Where parse begins + * @retptr: (output) Optional pointer to next char after parse completes + * + * Parses a string into a number. The number stored at @ptr is + * potentially suffixed with K, M, G, T, P, E. + */ +static unsigned long long _memparse(const char *ptr, char **retptr) +{ + char *endptr; /* Local pointer to end of parsed string */ + + unsigned long long ret = simple_strtoull(ptr, , 0); + + switch (*endptr) { + case 'E': + case 'e': + ret <<= 10; + case 'P': + case 'p': + ret <<= 10; + case 'T': + case 't': + ret <<= 10; + case 'G': + case 'g': + ret <<= 10; + case 'M': + case 'm': + ret <<= 10; + case 'K': + case 'k': + ret <<= 10; + endptr++; + default: + break; + } + + if (retptr) + *retptr = endptr; + + return ret; +} + +static int +parse_memmap(char *p, unsigned long long *start, unsigned long long *size) +{ + char *oldp; + + if (!p) + return -EINVAL; + + /* We don't care about this option here */ + if (!strncmp(p, "exactmap", 8)) + return -EINVAL; + + oldp = p; + *size = _memparse(p, ); + if (p == oldp) + return -EINVAL; + + switch (*p) { + case '@': + /* Skip this region, usable */ + *start = 0; + *size = 0; + return 0; + case '#': + case '$': + case '!': + *start = _memparse(p + 1, ); + return 0; + } + + return -EINVAL; +} + +static void mem_avoid_memmap(void) +{ + char arg[128]; + int rc; + int i; + char *str; + + /* See if we have any
Re: [PATCH v6] x86: fix kaslr and memmap collision
On 01/11/2017 05:00 AM, Thomas Gleixner wrote: > On Tue, 10 Jan 2017, Dave Jiang wrote: >> +unsigned long simple_strtoul(const char *cp, char **endp, unsigned int >> base); >> +long simple_strtol(const char *cp, char **endp, unsigned int base); > > What are those functions for? They are not used in that patch at all. My mistake. Those were used for something in earlier versions of the patch and I forgot to take them out now they aren't being used. Will remove. > >> +static void mem_avoid_memmap(void) >> +{ >> +char arg[128]; >> +int rc = 0; > > What's the point of defining this variable here and zero initializing it? It was necessary when the function was returning int instead of void. I will move. > >> +/* see if we have any memmap areas */ > > Sentences start with an upper case letter. Everywhere! Will fix. > >> +if (cmdline_find_option("memmap", arg, sizeof(arg)) > 0) { > > You can spare an indentation level by just returning when the search fails. Will update. > >> +int i = 0; >> +char *str = arg; >> + >> +while (str && (i < MAX_MEMMAP_REGIONS)) { > > for (i = 0; str && (i < MAX_MEMMAP_REGIONS); i++) { > > Please. The reason it's a while is so that when we hit an instance of memmap=nn@ss we can conditionally skip it without having to increment the counter 'i'. > >> +unsigned long long start, size; >> +char *k = strchr(str, ','); >> + >> +if (k) >> +*k++ = 0; >> + >> +rc = parse_memmap(str, , ); >> +if (rc < 0) >> +break; >> +str = k; >> +/* a usable region that should not be skipped */ >> +if (size == 0) >> +continue; >> + >> +mem_avoid[MEM_AVOID_MEMMAP_BEGIN + i].start = start; >> +mem_avoid[MEM_AVOID_MEMMAP_BEGIN + i].size = size; > > So this makes no sense. You parse start/size as unsigned long long and > then store it in an unsigned long. Works on 64bit, but on 32bit this is > just wrong: > > Assume a memmap above 4G, which then will create a avoid region below 4G > due to truncation to unsigned long. Ok so it looks like an issue for 32bit on the original code? I'm presuming none of the previous mem avoid regions are over 4G so that wasn't an issue? I will update the data structure to use unsigned long long. Thanks for the review! > > Thanks, > > tglx > ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4] DAX: enable iostat for read/write
On Tue, 2017-01-10 at 20:36 -0800, Joe Perches wrote: > > On Tue, 2017-01-10 at 17:11 -0700, Toshi Kani wrote: > > DAX IO path does not support iostat, but its metadata IO path does. > > Therefore, iostat shows metadata IO statistics only, which has been > > confusing to users. > > [] > > diff --git a/fs/dax.c b/fs/dax.c > > [] > > @@ -1058,12 +1058,22 @@ dax_iomap_rw(struct kiocb *iocb, struct > > iov_iter *iter, > > [] > > + if (blk_queue_io_stat(disk->queue)) { > > + int sec = iov_iter_count(iter) >> 9; > > + > > + start = jiffies; > > + generic_start_io_acct(iov_iter_rw(iter), > > + (!sec) ? 1 : sec, > > >part0); > > + } > > There is a signed/unsigned conversion of sec > It may be better to use something like: > > size_t sec = iov_iter_count(iter) >> 9; > [...] > generic_start_io_acct(iov_iter_rw(iter), > min_t(unsigned long, 1, sec), > >part0); Good catch. I will change as you suggested, and use 'sector_t' for 'sec'. Thanks, -Toshi ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 2/2] dax: assert that i_rwsem is held exclusive for writes
On Tue 10-01-17 16:48:08, Christoph Hellwig wrote: > Make sure all callers follow the same locking protocol, given that DAX > transparantly replaced the normal buffered I/O path. > > Signed-off-by: Christoph HellwigLooks good. You can add: Reviewed-by: Jan Kara Probably also for Ted since it depends on the ext4 fix... Honza > --- > fs/dax.c | 6 +- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 5c74f60..04734da 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1061,8 +1061,12 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter, > loff_t pos = iocb->ki_pos, ret = 0, done = 0; > unsigned flags = 0; > > - if (iov_iter_rw(iter) == WRITE) > + if (iov_iter_rw(iter) == WRITE) { > + lockdep_assert_held_exclusive(>i_rwsem); > flags |= IOMAP_WRITE; > + } else { > + lockdep_assert_held(>i_rwsem); > + } > > while (iov_iter_count(iter)) { > ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops, > -- > 2.1.4 > -- Jan Kara SUSE Labs, CR ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] dax: fix build warnings with FS_DAX and !FS_IOMAP
Looks fine: Reviewed-by: Christoph Hellwig___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm