Re: [PATCH 3/6] dax: add tracepoint infrastructure, PMD tracing

2016-11-28 Thread Dave Chinner
On Mon, Nov 28, 2016 at 03:46:51PM -0700, Ross Zwisler wrote:
> On Fri, Nov 25, 2016 at 02:00:59PM +1100, Dave Chinner wrote:
> > On Wed, Nov 23, 2016 at 11:44:19AM -0700, Ross Zwisler wrote:
> > > Tracepoints are the standard way to capture debugging and tracing
> > > information in many parts of the kernel, including the XFS and ext4
> > > filesystems.  Create a tracepoint header for FS DAX and add the first DAX
> > > tracepoints to the PMD fault handler.  This allows the tracing for DAX to
> > > be done in the same way as the filesystem tracing so that developers can
> > > look at them together and get a coherent idea of what the system is doing.
> > > 
> > > I added both an entry and exit tracepoint because future patches will add
> > > tracepoints to child functions of dax_iomap_pmd_fault() like
> > > dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages 
> > > to
> > > be wrapped by the parent function tracepoints so the code flow is more
> > > easily understood.  Having entry and exit tracepoints for faults also
> > > allows us to easily see what filesystems functions were called during the
> > > fault.  These filesystem functions get executed via iomap_begin() and
> > > iomap_end() calls, for example, and will have their own tracepoints.
> > > 
> > > For PMD faults we primarily want to understand the faulting address and
> > > whether it fell back to 4k faults.  If it fell back to 4k faults the
> > > tracepoints should let us understand why.
> > > 
> > > I named the new tracepoint header file "fs_dax.h" to allow for device DAX
> > > to have its own separate tracing header in the same directory at some
> > > point.
> > > 
> > > Here is an example output for these events from a successful PMD fault:
> > > 
> > > big-2057  [000]    136.396855: dax_pmd_fault: shared mapping write
> > > address 0x10505000 vm_start 0x1020 vm_end 0x1070 pgoff 0x200
> > > max_pgoff 0x1400
> > > 
> > > big-2057  [000]    136.397943: dax_pmd_fault_done: shared mapping 
> > > write
> > > address 0x10505000 vm_start 0x1020 vm_end 0x1070 pgoff 0x200
> > > max_pgoff 0x1400 NOPAGE
> > 
> > Can we make the output use the same format as most of the filesystem
> > code? i.e. the output starts with backing device + inode number like
> > so:
> > 
> > xfs_ilock:dev 8:96 ino 0x493 flags ILOCK_EXCL
> > 
> > This way we can filter the output easily across both dax and
> > filesystem tracepoints with 'grep "ino 0x493"'...
> 
> I think I can include the inode number, which I have via mapping->host.  Am I
> correct in assuming "struct inode.i_ino" will always be the same as
> "struct xfs_inode.i_ino"?

Yes - just use inode.i_ino.

> Unfortunately I don't have access to the major/minor (the dev_t) until I call
> iomap_begin(). 

In general, filesystem tracing uses inode->sb->s_dev as the
identifier. NFS, gfs2, XFS, ext4 and others all use this.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] libnvdimm: use consistent naming for request_mem_region()

2016-11-28 Thread Verma, Vishal L
On Mon, 2016-11-28 at 11:25 -0800, Dan Williams wrote:
> Here is an example /proc/iomem listing for a system with 2
> namespaces,
> one in "sector" mode and one in "memory" mode:
> 
>   1fc00-2fbff : Persistent Memory (legacy)
> 1fc00-2fbff : namespace1.0
>   34000-34fff : Persistent Memory
> 34000-34fff : btt0.1
> 
> Here is the corresponding ndctl listing:
> 
>   # ndctl list
>   [
> {
>   "dev":"namespace1.0",
>   "mode":"memory",
>   "size":4294967296,
>   "blockdev":"pmem1"
> },
> {
>   "dev":"namespace0.0",
>   "mode":"sector",
>   "size":267091968,
>   "uuid":"f7594f86-badb-4592-875f-ded577da2eaf",
>   "sector_size":4096,
>   "blockdev":"pmem0s"
> }
>   ]
> 
> Notice that the ndctl listing is purely in terms of namespace
> devices,
> while the iomem listing leaks the internal "btt0.1" implementation
> detail. Given that ndctl requires the namespace device name to change
> the mode, for example:
> 
>   # ndctl create-namespace --reconfig=namespace0.0 --mode=raw --force
> 
> ...use the namespace name in the iomem listing to keep the claiming
> device name consistent across different mode settings.
> 
> Cc: Vishal Verma 
> Signed-off-by: Dan Williams 
> ---
>  drivers/dax/pmem.c |3 ++-
>  drivers/nvdimm/claim.c |2 +-
>  drivers/nvdimm/pmem.c  |2 +-
>  3 files changed, 4 insertions(+), 3 deletions(-)

Looks good!

Reveiwed-by: Vishal Verma 

> 
> diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
> index 9630d8837ba9..3ff84784249a 100644
> --- a/drivers/dax/pmem.c
> +++ b/drivers/dax/pmem.c
> @@ -87,7 +87,8 @@ static int dax_pmem_probe(struct device *dev)
>   pfn_sb = nd_pfn->pfn_sb;
>  
>   if (!devm_request_mem_region(dev, nsio->res.start,
> - resource_size(>res),
> dev_name(dev))) {
> + resource_size(>res),
> + dev_name(>dev))) {
>   dev_warn(dev, "could not reserve region %pR\n",
> >res);
>   return -EBUSY;
>   }
> diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
> index 8d66fbb779ed..4638b9ea5229 100644
> --- a/drivers/nvdimm/claim.c
> +++ b/drivers/nvdimm/claim.c
> @@ -275,7 +275,7 @@ int devm_nsio_enable(struct device *dev, struct
> nd_namespace_io *nsio)
>  
>   nsio->size = resource_size(res);
>   if (!devm_request_mem_region(dev, res->start,
> resource_size(res),
> - dev_name(dev))) {
> + dev_name(>dev))) {
>   dev_warn(dev, "could not reserve region %pR\n",
> res);
>   return -EBUSY;
>   }
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 42b3a8217073..34f16a17c07b 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -266,7 +266,7 @@ static int pmem_attach_disk(struct device *dev,
>   dev_warn(dev, "unable to guarantee persistence of
> writes\n");
>  
>   if (!devm_request_mem_region(dev, res->start,
> resource_size(res),
> - dev_name(dev))) {
> + dev_name(>dev))) {
>   dev_warn(dev, "could not reserve region %pR\n",
> res);
>   return -EBUSY;
>   }
> 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 3/6] dax: add tracepoint infrastructure, PMD tracing

2016-11-28 Thread Ross Zwisler
On Fri, Nov 25, 2016 at 02:00:59PM +1100, Dave Chinner wrote:
> On Wed, Nov 23, 2016 at 11:44:19AM -0700, Ross Zwisler wrote:
> > Tracepoints are the standard way to capture debugging and tracing
> > information in many parts of the kernel, including the XFS and ext4
> > filesystems.  Create a tracepoint header for FS DAX and add the first DAX
> > tracepoints to the PMD fault handler.  This allows the tracing for DAX to
> > be done in the same way as the filesystem tracing so that developers can
> > look at them together and get a coherent idea of what the system is doing.
> > 
> > I added both an entry and exit tracepoint because future patches will add
> > tracepoints to child functions of dax_iomap_pmd_fault() like
> > dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages to
> > be wrapped by the parent function tracepoints so the code flow is more
> > easily understood.  Having entry and exit tracepoints for faults also
> > allows us to easily see what filesystems functions were called during the
> > fault.  These filesystem functions get executed via iomap_begin() and
> > iomap_end() calls, for example, and will have their own tracepoints.
> > 
> > For PMD faults we primarily want to understand the faulting address and
> > whether it fell back to 4k faults.  If it fell back to 4k faults the
> > tracepoints should let us understand why.
> > 
> > I named the new tracepoint header file "fs_dax.h" to allow for device DAX
> > to have its own separate tracing header in the same directory at some
> > point.
> > 
> > Here is an example output for these events from a successful PMD fault:
> > 
> > big-2057  [000]    136.396855: dax_pmd_fault: shared mapping write
> > address 0x10505000 vm_start 0x1020 vm_end 0x1070 pgoff 0x200
> > max_pgoff 0x1400
> > 
> > big-2057  [000]    136.397943: dax_pmd_fault_done: shared mapping write
> > address 0x10505000 vm_start 0x1020 vm_end 0x1070 pgoff 0x200
> > max_pgoff 0x1400 NOPAGE
> 
> Can we make the output use the same format as most of the filesystem
> code? i.e. the output starts with backing device + inode number like
> so:
> 
>   xfs_ilock:dev 8:96 ino 0x493 flags ILOCK_EXCL
> 
> This way we can filter the output easily across both dax and
> filesystem tracepoints with 'grep "ino 0x493"'...

I think I can include the inode number, which I have via mapping->host.  Am I
correct in assuming "struct inode.i_ino" will always be the same as
"struct xfs_inode.i_ino"?

Unfortunately I don't have access to the major/minor (the dev_t) until I call
iomap_begin().  Currently we call iomap_begin() only after we've done most of
our sanity checking that would cause us to fall back to PTEs.

I don't think we want to reorder things so that we start with an iomap_begin()
- that would mean that we would begin by asking the filesystem for a block
allocation, when in many cases we would then do an alignment check or
something similar and then fall back to PTEs.

I'll add "ino" to the output so it looks something like this:

big-2057  [000]    136.397943: dax_pmd_fault_done: ino 0x493 shared
mapping write address 0x10505000 vm_start 0x1020 vm_end 0x1070 pgoff
0x200 max_pgoff 0x1400 NOPAGE
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Jason Gunthorpe
On Mon, Nov 28, 2016 at 04:55:23PM -0500, Serguei Sagalovitch wrote:

> >We haven't touch this in a long time and perhaps it changed, but there
> >definitely was a call back in the PeerDirect API to allow the GPU to
> >invalidate the mapping. That's what we don't want.

> I assume that you are talking about "invalidate_peer_memory()' callback?
> I was told that it is the "last resort" because HCA (and driver) is not
> able to handle  it in the safe manner so it is basically "abort" everything.

If it is a last resort to save system stability then kill the impacted
process, that will release the MRs.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Serguei Sagalovitch


On 2016-11-28 04:36 PM, Logan Gunthorpe wrote:

On 28/11/16 12:35 PM, Serguei Sagalovitch wrote:

As soon as PeerDirect mapping is called then GPU must not "move" the
such memory.  It is by PeerDirect design. It is similar how it is works
with system memory and RDMA MR: when "get_user_pages" is called then the
memory is pinned.

We haven't touch this in a long time and perhaps it changed, but there
definitely was a call back in the PeerDirect API to allow the GPU to
invalidate the mapping. That's what we don't want.

I assume that you are talking about "invalidate_peer_memory()' callback?
I was told that it is the "last resort" because HCA (and driver) is not
able to handle  it in the safe manner so it is basically "abort" 
everything.


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/1] libnvdimm, namespace: fix the type of name variable

2016-11-28 Thread Ross Zwisler
On Sat, Nov 26, 2016 at 08:18:04PM +0100, Nicolas Iooss wrote:
> In create_namespace_blk(), the local variable "name" is defined as an
> array of NSLABEL_NAME_LEN pointers:
> 
> char *name[NSLABEL_NAME_LEN];
> 
> This variable is then used in calls to memcpy() and kmemdup() as if it
> were char[NSLABEL_NAME_LEN]. Remove the star in the variable definition
> to makes it look right.
> 
> Signed-off-by: Nicolas Iooss 

Yep, nice catch.

Reviewed-by: Ross Zwisler 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Serguei Sagalovitch

On 2016-11-28 01:20 PM, Logan Gunthorpe wrote:


On 28/11/16 09:57 AM, Jason Gunthorpe wrote:

On PeerDirect, we have some kind of a middle-ground solution for pinning
GPU memory. We create a non-ODP MR pointing to VRAM but rely on
user-space and the GPU not to migrate it. If they do, the MR gets
destroyed immediately.

That sounds horrible. How can that possibly work? What if the MR is
being used when the GPU decides to migrate? I would not support that
upstream without a lot more explanation..

Yup, this was our experience when playing around with PeerDirect. There
was nothing we could do if the GPU decided to invalidate the P2P
mapping.

As soon as PeerDirect mapping is called then GPU must not "move" the
such memory.  It is by PeerDirect design. It is similar how it is works
with system memory and RDMA MR: when "get_user_pages" is called then the
memory is pinned.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH] libnvdimm: use consistent naming for request_mem_region()

2016-11-28 Thread Dan Williams
Here is an example /proc/iomem listing for a system with 2 namespaces,
one in "sector" mode and one in "memory" mode:

  1fc00-2fbff : Persistent Memory (legacy)
1fc00-2fbff : namespace1.0
  34000-34fff : Persistent Memory
34000-34fff : btt0.1

Here is the corresponding ndctl listing:

  # ndctl list
  [
{
  "dev":"namespace1.0",
  "mode":"memory",
  "size":4294967296,
  "blockdev":"pmem1"
},
{
  "dev":"namespace0.0",
  "mode":"sector",
  "size":267091968,
  "uuid":"f7594f86-badb-4592-875f-ded577da2eaf",
  "sector_size":4096,
  "blockdev":"pmem0s"
}
  ]

Notice that the ndctl listing is purely in terms of namespace devices,
while the iomem listing leaks the internal "btt0.1" implementation
detail. Given that ndctl requires the namespace device name to change
the mode, for example:

  # ndctl create-namespace --reconfig=namespace0.0 --mode=raw --force

...use the namespace name in the iomem listing to keep the claiming
device name consistent across different mode settings.

Cc: Vishal Verma 
Signed-off-by: Dan Williams 
---
 drivers/dax/pmem.c |3 ++-
 drivers/nvdimm/claim.c |2 +-
 drivers/nvdimm/pmem.c  |2 +-
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index 9630d8837ba9..3ff84784249a 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -87,7 +87,8 @@ static int dax_pmem_probe(struct device *dev)
pfn_sb = nd_pfn->pfn_sb;
 
if (!devm_request_mem_region(dev, nsio->res.start,
-   resource_size(>res), dev_name(dev))) {
+   resource_size(>res),
+   dev_name(>dev))) {
dev_warn(dev, "could not reserve region %pR\n", >res);
return -EBUSY;
}
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 8d66fbb779ed..4638b9ea5229 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -275,7 +275,7 @@ int devm_nsio_enable(struct device *dev, struct 
nd_namespace_io *nsio)
 
nsio->size = resource_size(res);
if (!devm_request_mem_region(dev, res->start, resource_size(res),
-   dev_name(dev))) {
+   dev_name(>dev))) {
dev_warn(dev, "could not reserve region %pR\n", res);
return -EBUSY;
}
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 42b3a8217073..34f16a17c07b 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -266,7 +266,7 @@ static int pmem_attach_disk(struct device *dev,
dev_warn(dev, "unable to guarantee persistence of writes\n");
 
if (!devm_request_mem_region(dev, res->start, resource_size(res),
-   dev_name(dev))) {
+   dev_name(>dev))) {
dev_warn(dev, "could not reserve region %pR\n", res);
return -EBUSY;
}

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/6] dax: remove leading space from labels

2016-11-28 Thread Ross Zwisler
On Thu, Nov 24, 2016 at 11:42:38AM -0800, Dan Williams wrote:
> On Thu, Nov 24, 2016 at 1:11 AM, Jan Kara  wrote:
> > On Wed 23-11-16 11:44:18, Ross Zwisler wrote:
> >> No functional change.
> >>
> >> As of this commit:
> >>
> >> commit 218dd85887da (".gitattributes: set git diff driver for C source code
> >> files")
> >>
> >> git-diff and git-format-patch both generate diffs whose hunks are correctly
> >> prefixed by function names instead of labels, even if those labels aren't
> >> indented with spaces.
> >
> > Fine by me. I just have some 4 remaining DAX patches (will send them out
> > today) and they will clash with this. So I'd prefer if this happened after
> > they are merged...
> 
> Let's just leave them alone, it's not like this thrash buys us
> anything at this point.  We can just stop including spaces in new
> code.

Honestly I'm not sure which is better.  I understand your argument about not
introducing "thrash" for cleanup like this, but at the same time knowingly
leaving inconsistencies in the code style just because seems...gross?

In any case, sure Jan, if this patch happens lets do it after your remaining
DAX patches.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/6] dax: fix build breakage with ext4, dax and !iomap

2016-11-28 Thread Ross Zwisler
On Thu, Nov 24, 2016 at 10:02:39AM +0100, Jan Kara wrote:
> On Wed 23-11-16 11:44:17, Ross Zwisler wrote:
> > With the current Kconfig setup it is possible to have the following:
> > 
> > CONFIG_EXT4_FS=y
> > CONFIG_FS_DAX=y
> > CONFIG_FS_IOMAP=n   # this is in fs/Kconfig & isn't user accessible
> > 
> > With this config we get build failures in ext4_dax_fault() because the
> > iomap functions in fs/dax.c are missing:
> > 
> > fs/built-in.o: In function `ext4_dax_fault':
> > file.c:(.text+0x7f3ac): undefined reference to `dax_iomap_fault'
> > file.c:(.text+0x7f404): undefined reference to `dax_iomap_fault'
> > fs/built-in.o: In function `ext4_file_read_iter':
> > file.c:(.text+0x7fc54): undefined reference to `dax_iomap_rw'
> > fs/built-in.o: In function `ext4_file_write_iter':
> > file.c:(.text+0x7fe9a): undefined reference to `dax_iomap_rw'
> > file.c:(.text+0x7feed): undefined reference to `dax_iomap_rw'
> > fs/built-in.o: In function `ext4_block_zero_page_range':
> > inode.c:(.text+0x85c0d): undefined reference to `iomap_zero_range'
> > 
> > Now that the struct buffer_head based DAX fault paths and I/O path have
> > been removed we really depend on iomap support being present for DAX.  Make
> > this explicit by selecting FS_IOMAP if we compile in DAX support.
> > 
> > Signed-off-by: Ross Zwisler 
> 
> I've sent the same patch to Ted yesterday and he will probably queue it on
> top of ext4 iomap patches. If it doesn't happen for some reason, feel free
> to add:
> 
> Reviewed-by: Jan Kara 

Cool, looks like Ted has pulled in your patch.

I think we still eventually want this patch because it cleans up our handling
of FS_IOMAP.  With your patch we select it separately in both ext4 & ext2
based on whether we include DAX, and we still have #ifdefs in fs/dax.c for
FS_IOMAP.

I'll pull your most recent patch into my baseline & rework this patch.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Jason Gunthorpe
On Mon, Nov 28, 2016 at 06:19:40PM +, Haggai Eran wrote:
> > > GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> > > user-space and the GPU not to migrate it. If they do, the MR gets
> > > destroyed immediately.
> > That sounds horrible. How can that possibly work? What if the MR is
> > being used when the GPU decides to migrate? 
> Naturally this doesn't support migration. The GPU is expected to pin
> these pages as long as the MR lives. The MR invalidation is done only as
> a last resort to keep system correctness.

That just forces applications to handle horrible unexpected
failures. If this sort of thing is needed for correctness then OOM
kill the offending process, don't corrupt its operation.

> I think it is similar to how non-ODP MRs rely on user-space today to
> keep them correct. If you do something like madvise(MADV_DONTNEED) on a
> non-ODP MR's pages, you can still get yourself into a data corruption
> situation (HCA sees one page and the process sees another for the same
> virtual address). The pinning that we use only guarentees the HCA's page
> won't be reused.

That is not really data corruption - the data still goes where it was
originally destined. That is an application violating the
requirements of a MR. An application cannot munmap/mremap a VMA
while a non ODP MR points to it and then keep using the MR.

That is totally different from a GPU driver wanthing to mess with
translation to physical pages.

> > From what I understand we are not really talking about kernel p2p,
> > everything proposed so far is being mediated by a userspace VMA, so
> > I'd focus on making that work.

> Fair enough, although we will need both eventually, and I hope the
> infrastructure can be shared to some degree.

What use case do you see for in kernel?

Presumably in-kernel could use a vmap or something and the same basic
flow?

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Haggai Eran
On Mon, 2016-11-28 at 09:48 -0500, Serguei Sagalovitch wrote:
> On 2016-11-27 09:02 AM, Haggai Eran wrote
> > 
> > On PeerDirect, we have some kind of a middle-ground solution for
> > pinning
> > GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> > user-space and the GPU not to migrate it. If they do, the MR gets
> > destroyed immediately. This should work on legacy devices without
> > ODP
> > support, and allows the system to safely terminate a process that
> > misbehaves. The downside of course is that it cannot transparently
> > migrate memory but I think for user-space RDMA doing that
> > transparently
> > requires hardware support for paging, via something like HMM.
> > 
> > ...
> May be I am wrong but my understanding is that PeerDirect logic
> basically
> follow  "RDMA register MR" logic 
Yes. The only difference from regular MRs is the invalidation process I
mentioned, and the fact that we get the addresses not from
get_user_pages but from a peer driver.

> so basically nothing prevent to "terminate"
> process for "MMU notifier" case when we are very low on memory
> not making it similar (not worse) then PeerDirect case.
I'm not sure I understand. I don't think any solution prevents
terminating an application. The paragraph above is just trying to
explain how a non-ODP device/MR can handle an invalidation.

> > > I'm hearing most people say ZONE_DEVICE is the way to handle this,
> > > which means the missing remaing piece for RDMA is some kind of DMA
> > > core support for p2p address translation..
> > Yes, this is definitely something we need. I think Will Davis's
> > patches
> > are a good start.
> > 
> > Another thing I think is that while HMM is good for user-space
> > applications, for kernel p2p use there is no need for that.
> About HMM: I do not think that in the current form HMM would  fit in
> requirement for generic P2P transfer case. My understanding is that at
> the current stage HMM is good for "caching" system memory
> in device memory for fast GPU access but in RDMA MR non-ODP case
> it will not work because  the location of memory should not be
> changed so memory should be allocated directly in PCIe memory.
The way I see it there are two ways to handle non-ODP MRs. Either you
prevent the GPU from migrating / reusing the MR's VRAM pages for as long
as the MR is alive (if I understand correctly you didn't like this
solution), or you allow the GPU to somehow notify the HCA to invalidate
the MR. If you do that, you can use mmu notifiers or HMM or something
else, but HMM provides a nice framework to facilitate that notification.

> > 
> > Using ZONE_DEVICE with or without something like DMA-BUF to pin and
> > unpin
> > pages for the short duration as you wrote above could work fine for
> > kernel uses in which we can guarantee they are short.
> Potentially there is another issue related to pin/unpin. If memory
> could
> be used a lot of time then there is no sense to rebuild and program
> s/g tables each time if location of memory was not changed.
Is this about the kernel use or user-space? In user-space I think the MR
concept captures a long-lived s/g table so you don't need to rebuild it
(unless the mapping changes).

Haggai
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Haggai Eran
On Mon, 2016-11-28 at 09:57 -0700, Jason Gunthorpe wrote:
> On Sun, Nov 27, 2016 at 04:02:16PM +0200, Haggai Eran wrote:
> > I think blocking mmu notifiers against something that is basically
> > controlled by user-space can be problematic. This can block things
> > like
> > memory reclaim. If you have user-space access to the device's
> > queues,
> > user-space can block the mmu notifier forever.
> Right, I mentioned that..
Sorry, I must have missed it.

> > On PeerDirect, we have some kind of a middle-ground solution for
> > pinning
> > GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> > user-space and the GPU not to migrate it. If they do, the MR gets
> > destroyed immediately.
> That sounds horrible. How can that possibly work? What if the MR is
> being used when the GPU decides to migrate? 
Naturally this doesn't support migration. The GPU is expected to pin
these pages as long as the MR lives. The MR invalidation is done only as
a last resort to keep system correctness.

I think it is similar to how non-ODP MRs rely on user-space today to
keep them correct. If you do something like madvise(MADV_DONTNEED) on a
non-ODP MR's pages, you can still get yourself into a data corruption
situation (HCA sees one page and the process sees another for the same
virtual address). The pinning that we use only guarentees the HCA's page
won't be reused.

> I would not support that
> upstream without a lot more explanation..
> 
> I know people don't like requiring new hardware, but in this case we
> really do need ODP hardware to get all the semantics people want..
> 
> > 
> > Another thing I think is that while HMM is good for user-space
> > applications, for kernel p2p use there is no need for that. Using
> From what I understand we are not really talking about kernel p2p,
> everything proposed so far is being mediated by a userspace VMA, so
> I'd focus on making that work.
Fair enough, although we will need both eventually, and I hope the
infrastructure can be shared to some degree.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Jason Gunthorpe
On Sun, Nov 27, 2016 at 04:02:16PM +0200, Haggai Eran wrote:

> > Like in ODP, MMU notifiers/HMM are used to monitor for translation
> > changes. If a change comes in the GPU driver checks if an executing
> > command is touching those pages and blocks the MMU notifier until the
> > command flushes, then unfaults the page (blocking future commands) and
> > unblocks the mmu notifier.

> I think blocking mmu notifiers against something that is basically
> controlled by user-space can be problematic. This can block things like
> memory reclaim. If you have user-space access to the device's queues,
> user-space can block the mmu notifier forever.

Right, I mentioned that..

> On PeerDirect, we have some kind of a middle-ground solution for pinning
> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> user-space and the GPU not to migrate it. If they do, the MR gets
> destroyed immediately.

That sounds horrible. How can that possibly work? What if the MR is
being used when the GPU decides to migrate? I would not support that
upstream without a lot more explanation..

I know people don't like requiring new hardware, but in this case we
really do need ODP hardware to get all the semantics people want..

> Another thing I think is that while HMM is good for user-space
> applications, for kernel p2p use there is no need for that. Using

>From what I understand we are not really talking about kernel p2p,
everything proposed so far is being mediated by a userspace VMA, so
I'd focus on making that work.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 3/6] dax: add tracepoint infrastructure, PMD tracing

2016-11-28 Thread Dave Chinner
On Sun, Nov 27, 2016 at 04:58:43PM -0800, Linus Torvalds wrote:
> On Sun, Nov 27, 2016 at 2:42 PM, Dave Chinner  wrote:
> >
> > And that's exactly why we need a method of marking tracepoints as
> > stable. How else are we going to know whether a specific tracepoint
> > is stable if the kernel code doesn't document that it's stable?
> 
> You are living in some unrealistic dream-world where you think you can
> get the right tracepoint on the first try.

No, I'm  under no illusions that we'd get stable tracepoints right
the first go. I don't care about how we stabilise stable
tracepoints, because nothing I maintain will use stable tracepoints.
However, I will point out that we have /already solved these ABI
development problems/.

$ ls Documentation/ABI/
obsolete  README  removed  stable  testing
$

Expands this to include stable tracepoints, not just sysfs files.
New stable stuff gets classified as "testing" meaning it is supposed
to be stable but may change before being declared officially stable.
"stable" is obvious, are "obsolete" and "removed".


> So there is no way in hell I would ever mark any tracepoint "stable"
> until it has had a fair amount of use, and there are useful tools that
> actually make use of it, and it has shown itself to be the right
> trace-point.
> 
> And once that actually happens, what's the advantage of marking it
> stable? None. It's a catch-22. Before it has uses and has been tested
> and found to be good, it's not stable. And after, it's pointless.

And that "catch-22" is *precisely the problem we need to solve*.

Pointing out that there's a catch-22 doesn't help anyone - all the
developers that are telling you that they really need a way to mark
stable tracepoints already understand this catch-22 and they want a
way to avoid it.  Being able to either say "this is stable and we'll
support it forever" or "this will never be stable so use at your own
risk" is a simple way of avoiding the catch-22. If an unstable
tracepoint is useful to applications *and it can be implemented in a
maintainable, stable form* then it can go through the process of
being made stable and documented in Documentation/ABI/stable.

Problem is solved, catch-22 is gone.

All we want is some method of making a clear, unambiguous statement
about the nature of a specific tracepoint and a process for
transitioning a tracepoint to a stable, maintainable form. We do it
for other ABI interfaces, so why can't we do this for tracepoints?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 3/6] dax: add tracepoint infrastructure, PMD tracing

2016-11-28 Thread Jan Kara
On Fri 25-11-16 16:48:40, Ted Tso wrote:
> On Fri, Nov 25, 2016 at 11:51:26AM -0800, Linus Torvalds wrote:
> > We do have filesystem code that is just disgusting. As an example:
> > fs/afs/ tends to have these crazy "_enter()/_exit()" macros in every
> > single function. If you want that, use the function tracer. That seems
> > to be just debugging code that has been left around for others to
> > stumble over. I do *not* believe that we should encourage that kind of
> > "machine gun spray" use of tracepoints.
> 
> There is a reason why people want to be able to do that, and that's
> because kprobes doesn't give you access to the arguments and return
> codes to the functions.  Maybe there could be a way to do this more
> easily using DWARF information and EBPF magic, perhaps?  It won't help
> for inlined functions, of course, but most of the functions where
> people want to do this aren't generally functions which are going to
> be inlined, but rather things like write_begin, writepages, which are
> called via a struct ops table and so will never be inlined to begin
> with.

Actually, you can print register & stack contents from a kprobe and you can
get a function return value from a kretprobe (see
Documentation/trace/kprobetrace.txt). Since calling convention is fixed
(arg 1 in RDI, arg 2 in RSI...) you can relatively easily dump function
arguments on entry and dump return value on return for arbitrary function
of your choice. I was already debugging issues like that several times (in
VFS actually because of missing trace points ;)). You can even create a
kprobe to dump register contents in the middle of the function (although
there it takes more effort reading the dissasembly to see what you are
interested in).

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm