Re: Enabling peer to peer device transactions for PCIe devices

2017-01-11 Thread Stephen Bates
On Fri, January 6, 2017 4:10 pm, Logan Gunthorpe wrote:
>
>
> On 06/01/17 11:26 AM, Jason Gunthorpe wrote:
>
>
>> Make a generic API for all of this and you'd have my vote..
>>
>>
>> IMHO, you must support basic pinning semantics - that is necessary to
>> support generic short lived DMA (eg filesystem, etc). That hardware can
>> clearly do that if it can support ODP.
>
> I agree completely.
>
>
> What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
> (ie. at least those backed with ZONE_DEVICE memory). Then
> GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
> (using whatever interface is most appropriate) and userspace can do what
> it pleases with them. This makes _so_ much sense and actually largely
> already works today (as demonstrated by iopmem).

+1 for iopmem ;-)

I feel like we are going around and around on this topic. I would like to
see something that is upstream that enables P2P even if it is only the
minimum viable useful functionality to begin. I think aiming for the moon
(which is what HMM and things like it are) are simply going to take more
time if they ever get there.

There is a use case for in-kernel P2P PCIe transfers between two NVMe
devices and between an NVMe device and an RDMA NIC (using NVMe CMBs or
BARs on the NIC). I am even seeing users who now want to move data P2P
between FPGAs and NVMe SSDs and the upstream kernel should be able to
support these users or they will look elsewhere.

The iopmem patchset addressed all the use cases above and while it is not
an in kernel API it could have been modified to be one reasonably easily.
As Logan states the driver can then choose to pass the VMAs to user-space
in a manner that makes sense.

Earlier in the thread someone mentioned LSF/MM. There is already a
proposal to discuss this topic so if you are interested please respond to
the email letting the committee know this topic is of interest to you [1].

Also earlier in the thread someone discussed the issues around the IOMMU.
Given the known issues around P2P transfers in certain CPU root complexes
[2] it might just be a case of only allowing P2P when a PCIe switch
connects the two EPs. Another option is just to use CONFIG_EXPERT and make
sure people are aware of the pitfalls if they invoke the P2P option.

Finally, as Jason noted, we could all just wait until
CAPI/OpenCAPI/CCIX/GenZ comes along. However given that these interfaces
are the remit of the CPU vendors I think it behooves us to solve this
problem before then. Also some of the above mentioned protocols are not
even switchable and may not be amenable to a P2P topology...

Stephen

[1] http://marc.info/?l=linux-mm=148156541804940=2
[2] https://community.mellanox.com/docs/DOC-1119

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v7] x86: fix kaslr and memmap collision

2017-01-11 Thread Dave Jiang
CONFIG_RANDOMIZE_BASE relocates the kernel to a random base address.
However it does not take into account the memmap= parameter passed in from
the kernel cmdline. This results in the kernel sometimes being put in
the middle of memmap. Teaching kaslr to not insert the kernel in
memmap defined regions. We will support up to 4 memmap regions. Any
additional regions will cause kaslr to disable. The mem_avoid set has
been augmented to add up to 4 unusable regions of memmaps provided by the
user to exclude those regions from the set of valid address range to insert
the uncompressed kernel image. The nn@ss ranges will be skipped by the
mem_avoid set since it indicates memory useable.

Signed-off-by: Dave Jiang 
Acked-by: Kees Cook 
Acked-by: Baoquan He 
---
 arch/x86/boot/boot.h |1 
 arch/x86/boot/compressed/kaslr.c |  140 +-
 arch/x86/boot/string.c   |   13 
 3 files changed, 151 insertions(+), 3 deletions(-)

v2:
Addressed comments from Ingo.
- Handle entire list of memmaps
v3:
Fix 32bit build issue
v4:
Addressed comments from Baoquan
- Not exclude nn@ss ranges
v5:
Addressed additional comments from Baoquan
- Update commit header and various coding style changes
v6:
Addressed comments from Kees
- Only fail for physical address randomization
v7:
Addressed comments from Thomas
- Dropped unused functions
- Made address and size in memmap_avoid unsigned long long
- Style fixes

diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index e5612f3..9b42b6d 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -333,6 +333,7 @@ size_t strnlen(const char *s, size_t maxlen);
 unsigned int atou(const char *s);
 unsigned long long simple_strtoull(const char *cp, char **endp, unsigned int 
base);
 size_t strlen(const char *s);
+char *strchr(const char *s, int c);
 
 /* tty.c */
 void puts(const char *);
diff --git a/arch/x86/boot/compressed/kaslr.c b/arch/x86/boot/compressed/kaslr.c
index a66854d..8b7c9e7 100644
--- a/arch/x86/boot/compressed/kaslr.c
+++ b/arch/x86/boot/compressed/kaslr.c
@@ -11,6 +11,7 @@
  */
 #include "misc.h"
 #include "error.h"
+#include "../boot.h"
 
 #include 
 #include 
@@ -52,15 +53,22 @@ static unsigned long get_boot_seed(void)
 #include "../../lib/kaslr.c"
 
 struct mem_vector {
-   unsigned long start;
-   unsigned long size;
+   unsigned long long start;
+   unsigned long long size;
 };
 
+/* Only supporting at most 4 unusable memmap regions with kaslr */
+#define MAX_MEMMAP_REGIONS 4
+
+static bool memmap_too_large;
+
 enum mem_avoid_index {
MEM_AVOID_ZO_RANGE = 0,
MEM_AVOID_INITRD,
MEM_AVOID_CMDLINE,
MEM_AVOID_BOOTPARAMS,
+   MEM_AVOID_MEMMAP_BEGIN,
+   MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 1,
MEM_AVOID_MAX,
 };
 
@@ -77,6 +85,123 @@ static bool mem_overlaps(struct mem_vector *one, struct 
mem_vector *two)
return true;
 }
 
+/**
+ * _memparse - Parse a string with mem suffixes into a number
+ * @ptr: Where parse begins
+ * @retptr: (output) Optional pointer to next char after parse completes
+ *
+ * Parses a string into a number.  The number stored at @ptr is
+ * potentially suffixed with K, M, G, T, P, E.
+ */
+static unsigned long long _memparse(const char *ptr, char **retptr)
+{
+   char *endptr;   /* Local pointer to end of parsed string */
+
+   unsigned long long ret = simple_strtoull(ptr, , 0);
+
+   switch (*endptr) {
+   case 'E':
+   case 'e':
+   ret <<= 10;
+   case 'P':
+   case 'p':
+   ret <<= 10;
+   case 'T':
+   case 't':
+   ret <<= 10;
+   case 'G':
+   case 'g':
+   ret <<= 10;
+   case 'M':
+   case 'm':
+   ret <<= 10;
+   case 'K':
+   case 'k':
+   ret <<= 10;
+   endptr++;
+   default:
+   break;
+   }
+
+   if (retptr)
+   *retptr = endptr;
+
+   return ret;
+}
+
+static int
+parse_memmap(char *p, unsigned long long *start, unsigned long long *size)
+{
+   char *oldp;
+
+   if (!p)
+   return -EINVAL;
+
+   /* We don't care about this option here */
+   if (!strncmp(p, "exactmap", 8))
+   return -EINVAL;
+
+   oldp = p;
+   *size = _memparse(p, );
+   if (p == oldp)
+   return -EINVAL;
+
+   switch (*p) {
+   case '@':
+   /* Skip this region, usable */
+   *start = 0;
+   *size = 0;
+   return 0;
+   case '#':
+   case '$':
+   case '!':
+   *start = _memparse(p + 1, );
+   return 0;
+   }
+
+   return -EINVAL;
+}
+
+static void mem_avoid_memmap(void)
+{
+   char arg[128];
+   int rc;
+   int i;
+   char *str;
+
+   /* See if we have any 

Re: [PATCH v6] x86: fix kaslr and memmap collision

2017-01-11 Thread Dave Jiang


On 01/11/2017 05:00 AM, Thomas Gleixner wrote:
> On Tue, 10 Jan 2017, Dave Jiang wrote:
>> +unsigned long simple_strtoul(const char *cp, char **endp, unsigned int 
>> base);
>> +long simple_strtol(const char *cp, char **endp, unsigned int base);
> 
> What are those functions for? They are not used in that patch at all.

My mistake. Those were used for something in earlier versions of the
patch and I forgot to take them out now they aren't being used. Will remove.

> 
>> +static void mem_avoid_memmap(void)
>> +{
>> +char arg[128];
>> +int rc = 0;
> 
> What's the point of defining this variable here and zero initializing it?

It was necessary when the function was returning int instead of void. I
will move.

> 
>> +/* see if we have any memmap areas */
> 
> Sentences start with an upper case letter. Everywhere!

Will fix.

> 
>> +if (cmdline_find_option("memmap", arg, sizeof(arg)) > 0) {
> 
> You can spare an indentation level by just returning when the search fails.

Will update.

> 
>> +int i = 0;
>> +char *str = arg;
>> +
>> +while (str && (i < MAX_MEMMAP_REGIONS)) {
> 
>   for (i = 0; str && (i < MAX_MEMMAP_REGIONS); i++) {
> 
> Please.

The reason it's a while is so that when we hit an instance of
memmap=nn@ss we can conditionally skip it without having to increment
the counter 'i'.

> 
>> +unsigned long long start, size;
>> +char *k = strchr(str, ',');
>> +
>> +if (k)
>> +*k++ = 0;
>> +
>> +rc = parse_memmap(str, , );
>> +if (rc < 0)
>> +break;
>> +str = k;
>> +/* a usable region that should not be skipped */
>> +if (size == 0)
>> +continue;
>> +
>> +mem_avoid[MEM_AVOID_MEMMAP_BEGIN + i].start = start;
>> +mem_avoid[MEM_AVOID_MEMMAP_BEGIN + i].size = size;
> 
> So this makes no sense. You parse start/size as unsigned long long and
> then store it in an unsigned long. Works on 64bit, but on 32bit this is
> just wrong:
> 
> Assume a memmap above 4G, which then will create a avoid region below 4G
> due to truncation to unsigned long.

Ok so it looks like an issue for 32bit on the original code? I'm
presuming none of the previous mem avoid regions are over 4G so that
wasn't an issue? I will update the data structure to use unsigned long
long.

Thanks for the review!

> 
> Thanks,
> 
>   tglx
> 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4] DAX: enable iostat for read/write

2017-01-11 Thread Kani, Toshimitsu
On Tue, 2017-01-10 at 20:36 -0800, Joe Perches wrote:
> > 
On Tue, 2017-01-10 at 17:11 -0700, Toshi Kani wrote:
> > DAX IO path does not support iostat, but its metadata IO path does.
> > Therefore, iostat shows metadata IO statistics only, which has been
> > confusing to users.
> 
> []
> > diff --git a/fs/dax.c b/fs/dax.c
> 
> []
> > @@ -1058,12 +1058,22 @@ dax_iomap_rw(struct kiocb *iocb, struct
> > iov_iter *iter,
> 
> []
> > +   if (blk_queue_io_stat(disk->queue)) {
> > +   int sec = iov_iter_count(iter) >> 9;
> > +
> > +   start = jiffies;
> > +   generic_start_io_acct(iov_iter_rw(iter),
> > +     (!sec) ? 1 : sec, 
> > >part0);
> > +   }
> 
> There is a signed/unsigned conversion of sec
> It may be better to use something like:
> 
>   size_t sec  = iov_iter_count(iter) >> 9;
>   [...]
>   generic_start_io_acct(iov_iter_rw(iter),
>     min_t(unsigned long, 1, sec),
>     >part0);

Good catch. I will change as you suggested, and use 'sector_t' for
'sec'. 

Thanks,
-Toshi
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/2] dax: assert that i_rwsem is held exclusive for writes

2017-01-11 Thread Jan Kara
On Tue 10-01-17 16:48:08, Christoph Hellwig wrote:
> Make sure all callers follow the same locking protocol, given that DAX
> transparantly replaced the normal buffered I/O path.
> 
> Signed-off-by: Christoph Hellwig 

Looks good. You can add:

Reviewed-by: Jan Kara 

Probably also for Ted since it depends on the ext4 fix...

Honza
> ---
>  fs/dax.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5c74f60..04734da 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1061,8 +1061,12 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
>   loff_t pos = iocb->ki_pos, ret = 0, done = 0;
>   unsigned flags = 0;
>  
> - if (iov_iter_rw(iter) == WRITE)
> + if (iov_iter_rw(iter) == WRITE) {
> + lockdep_assert_held_exclusive(>i_rwsem);
>   flags |= IOMAP_WRITE;
> + } else {
> + lockdep_assert_held(>i_rwsem);
> + }
>  
>   while (iov_iter_count(iter)) {
>   ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
> -- 
> 2.1.4
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] dax: fix build warnings with FS_DAX and !FS_IOMAP

2017-01-11 Thread Christoph Hellwig
Looks fine:

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm