RE: [ndctl PATCH 2/2] ndctl, monitor: in daemon mode, exit successfully if no DIMMs are found

2018-10-04 Thread Qi, Fuli
> -Original Message-
> From: Vishal Verma [mailto:vishal.l.ve...@intel.com]
> Sent: Friday, October 5, 2018 9:01 AM
> To: linux-nvdimm@lists.01.org
> Cc: Qi, Fuli/斉 福利 ; Andreas Hasenack
> ; Vishal Verma ; Dan
> Williams 
> Subject: [ndctl PATCH 2/2] ndctl, monitor: in daemon mode, exit successfully 
> if no
> DIMMs are found
> 
> When we are running as a daemon, it is preferred to exit successfully when no 
> DIMMs
> are found. When running in the foreground, retain an error return so that the 
> user can
> be notified that nothing was found to monitor.
> 
> In the longer term, replace this with a libudev uevent monitor that will look 
> for DIMM
> addition/removal events, and update the running monitor(s) if the DIMMs match 
> the
> active filter spec.
> 
> Link: https://bugs.launchpad.net/ubuntu/+source/ndctl/+bug/1781268
> Reported-by: Andreas Hasenack 
> Cc: Dan Williams 
> Cc: QI Fuli 
> Signed-off-by: Vishal Verma 
> ---
>  ndctl/monitor.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/ndctl/monitor.c b/ndctl/monitor.c index b44f946..7bf2ba7 100644
> --- a/ndctl/monitor.c
> +++ b/ndctl/monitor.c
> @@ -681,8 +681,9 @@ int cmd_monitor(int argc, const char **argv, void *ctx)
>   goto out;
> 
>   if (!mfa.num_dimm) {
> - err((struct ndctl_ctx *)ctx, "no dimms to monitor\n");
> - rc = -ENXIO;
> + dbg((struct ndctl_ctx *)ctx, "no dimms to monitor\n");
> + if (!monitor.daemon)
> + rc = -ENXIO;
>   goto out;
>   }
> 
> --
> 2.17.1
> 

Looks good to me.

Thank you very much,
QI Fuli

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RE: [ndctl PATCH 1/2] ndctl, monitor: fix the severity of the "daemon started" message

2018-10-04 Thread Qi, Fuli
> -Original Message-
> From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of Dan
> Williams
> Sent: Friday, October 5, 2018 1:24 PM
> To: Gotou, Yasunori/五島 康文 
> Cc: andr...@canonical.com; linux-nvdimm 
> Subject: Re: [ndctl PATCH 1/2] ndctl, monitor: fix the severity of the "daemon
> started" message
> 
> On Thu, Oct 4, 2018 at 9:09 PM Yasunori Goto  wrote:
> >
> > Hi, Vishal-san,
> >
> > > The above message was printed as an error, but it is just an
> > > informational message. Change it to dbg().
> >
> > Hmmm.
> >
> > When I was a engineer for trouble-shooting of customer's Linux system,
> > the starting time and the ending time of any daemon was very helpful
> > for investigating their trouble.
> >
> > So, I think it is not only for developer, but also it is essential for
> > trouble-shooter for custmer support.
> > (It may be also same with system operator)
> >
> >
> > >
> > > Cc: QI Fuli 
> > > Cc: Dan Williams 
> > > Signed-off-by: Vishal Verma 
> > > ---
> > >  ndctl/monitor.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/ndctl/monitor.c b/ndctl/monitor.c index
> > > d29e378..b44f946 100644
> > > --- a/ndctl/monitor.c
> > > +++ b/ndctl/monitor.c
> > > @@ -660,7 +660,7 @@ int cmd_monitor(int argc, const char **argv, void 
> > > *ctx)
> > >   err((struct ndctl_ctx *)ctx, "daemon start 
> > > failed\n");
> > >   goto out;
> > >   }
> > > - err((struct ndctl_ctx *)ctx, "ndctl monitor daemon 
> > > started\n");
> > > + dbg((struct ndctl_ctx *)ctx, "ndctl monitor daemon
> > > + started\n");
> >
> > Though I agree its message is not "error" certainly, I would like to
> > keep it as normal status level message.
> >
> 
> Sounds good to me.

In my original idea, I just wanted to log the "daemon started" message.
After thinking more about it, I agree that the message is not at "error" level.
However, considering this message will be helpful for some users like system 
operators,
how about changing this message level to "info" and meanwhile changing the 
default log level to "LOG_INFO" as following?

diff --git a/ndctl/monitor.c b/ndctl/monitor.c
index d29e378..10866b6 100644
--- a/ndctl/monitor.c
+++ b/ndctl/monitor.c
@@ -628,7 +628,7 @@ int cmd_monitor(int argc, const char **argv, void *ctx)
if (monitor.verbose)
ndctl_set_log_priority((struct ndctl_ctx *)ctx, LOG_DEBUG);
else
-   ndctl_set_log_priority((struct ndctl_ctx *)ctx, LOG_NOTICE);
+   ndctl_set_log_priority((struct ndctl_ctx *)ctx, LOG_INFO);

rc = read_config_file((struct ndctl_ctx *)ctx, , );
if (rc)
@@ -657,7 +657,7 @@ int cmd_monitor(int argc, const char **argv, void *ctx)
if (!monitor.log || strncmp(monitor.log, "./", 2) == 0)
ndctl_set_log_fn((struct ndctl_ctx *)ctx, log_syslog);
if (daemon(0, 0) != 0) {
-   err((struct ndctl_ctx *)ctx, "daemon start failed\n");
+   info((struct ndctl_ctx *)ctx, "daemon start failed\n");
goto out;
}
err((struct ndctl_ctx *)ctx, "ndctl monitor daemon started\n");

Thank you very much,
QI Fuli
> ___
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] dax: Fix deadlock in dax_lock_mapping_entry()

2018-10-04 Thread Dan Williams
On Thu, Oct 4, 2018 at 9:01 PM Dan Williams  wrote:
>
> On Thu, Oct 4, 2018 at 7:52 PM Matthew Wilcox  wrote:
> >
> > On Thu, Oct 04, 2018 at 06:57:52PM -0700, Dan Williams wrote:
> > > On Thu, Oct 4, 2018 at 9:27 AM Jan Kara  wrote:
> > > >
> > > > On Thu 27-09-18 11:22:22, Dan Williams wrote:
> > > > > On Thu, Sep 27, 2018 at 6:41 AM Jan Kara  wrote:
> > > > > >
> > > > > > On Thu 27-09-18 06:28:43, Matthew Wilcox wrote:
> > > > > > > On Thu, Sep 27, 2018 at 01:23:32PM +0200, Jan Kara wrote:
> > > > > > > > When dax_lock_mapping_entry() has to sleep to obtain entry 
> > > > > > > > lock, it will
> > > > > > > > fail to unlock mapping->i_pages spinlock and thus immediately 
> > > > > > > > deadlock
> > > > > > > > against itself when retrying to grab the entry lock again. Fix 
> > > > > > > > the
> > > > > > > > problem by unlocking mapping->i_pages before retrying.
> > > > > > >
> > > > > > > It seems weird that xfstests doesn't provoke this ...
> > > > > >
> > > > > > The function currently gets called only from mm/memory-failure.c. 
> > > > > > And yes,
> > > > > > we are lacking DAX hwpoison error tests in fstests...
> > > > >
> > > > > I have an item on my backlog to port the ndctl unit test that does
> > > > > memory_failure() injection vs ext4 over to fstests. That said I've
> > > > > been investigating a deadlock on ext4 caused by this test. When I saw
> > > > > this patch I hoped it was root cause, but the test is still failing
> > > > > for me. Vishal is able to pass the test on his system, so the failure
> > > > > mode is timing dependent. I'm running this patch on top of -rc5 and
> > > > > still seeing the following deadlock.
> > > >
> > > > I went through the code but I don't see where the problem could be. How 
> > > > can
> > > > I run that test? Is KVM enough or do I need hardware with AEP dimms?
> > >
> > > KVM is enough... however, I have found a hack that makes the test pass:
> > >
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 52517f28e6f4..d7f035b1846e 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -1668,6 +1668,9 @@ unsigned find_get_entries(struct address_space 
> > > *mapping,
> > > goto repeat;
> > > }
> > >  export:
> > > +   if (iter.index < start)
> > > +   continue;
> > > +
> > > indices[ret] = iter.index;
> > > entries[ret] = page;
> > > if (++ret == nr_entries)
> > >
> > > Is this a radix bug? I would never expect:
> > >
> > > radix_tree_for_each_slot(slot, >i_pages, , start)
> > >
> > > ...to return entries with an index < start. Without that change above
> > > we loop forever because dax_layout_busy_page() can't make forward
> > > progress. I'll dig into the radix code tomorrow, but maybe someone
> > > else will be me to it overnight.
> >
> > If 'start' is within a 2MB entry, iter.index can absolutely be less
> > than start.  I forget exactly what the radix tree code does, but I think
> > it returns index set to the canonical/base index of the entry.
>
> Ok, that makes sense. Then the bug is in dax_layout_busy_page() which
> needs to increment 'index' by the entry size. This might also explain
> why not every run sees it because you may get lucky and have a 4K
> entry.

Hmm, no 2MB entry here.

We go through the first find_get_entries and export:

export start: 0x0 index: 0x0 page: 0x822000a
export start: 0x0 index: 0x200 page: 0xcc3801a

Then dax_layout_busy_page sets 'start' to 0x201, and find_get_entries returns:

export start: 0x201 index: 0x200 page: 0xcc3801a

...forevermore.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH 1/2] ndctl, monitor: fix the severity of the "daemon started" message

2018-10-04 Thread Dan Williams
On Thu, Oct 4, 2018 at 9:09 PM Yasunori Goto  wrote:
>
> Hi, Vishal-san,
>
> > The above message was printed as an error, but it is just an
> > informational message. Change it to dbg().
>
> Hmmm.
>
> When I was a engineer for trouble-shooting of customer's Linux system,
> the starting time and the ending time of any daemon was very helpful
> for investigating their trouble.
>
> So, I think it is not only for developer, but also it is essential for
> trouble-shooter for custmer support.
> (It may be also same with system operator)
>
>
> >
> > Cc: QI Fuli 
> > Cc: Dan Williams 
> > Signed-off-by: Vishal Verma 
> > ---
> >  ndctl/monitor.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/ndctl/monitor.c b/ndctl/monitor.c
> > index d29e378..b44f946 100644
> > --- a/ndctl/monitor.c
> > +++ b/ndctl/monitor.c
> > @@ -660,7 +660,7 @@ int cmd_monitor(int argc, const char **argv, void *ctx)
> >   err((struct ndctl_ctx *)ctx, "daemon start failed\n");
> >   goto out;
> >   }
> > - err((struct ndctl_ctx *)ctx, "ndctl monitor daemon 
> > started\n");
> > + dbg((struct ndctl_ctx *)ctx, "ndctl monitor daemon 
> > started\n");
>
> Though I agree its message is not "error" certainly,
> I would like to keep it as normal status level message.
>

Sounds good to me.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH 1/2] ndctl, monitor: fix the severity of the "daemon started" message

2018-10-04 Thread Yasunori Goto
Hi, Vishal-san,

> The above message was printed as an error, but it is just an
> informational message. Change it to dbg().

Hmmm.

When I was a engineer for trouble-shooting of customer's Linux system,
the starting time and the ending time of any daemon was very helpful
for investigating their trouble.

So, I think it is not only for developer, but also it is essential for
trouble-shooter for custmer support. 
(It may be also same with system operator)


> 
> Cc: QI Fuli 
> Cc: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  ndctl/monitor.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/ndctl/monitor.c b/ndctl/monitor.c
> index d29e378..b44f946 100644
> --- a/ndctl/monitor.c
> +++ b/ndctl/monitor.c
> @@ -660,7 +660,7 @@ int cmd_monitor(int argc, const char **argv, void *ctx)
>   err((struct ndctl_ctx *)ctx, "daemon start failed\n");
>   goto out;
>   }
> - err((struct ndctl_ctx *)ctx, "ndctl monitor daemon started\n");
> + dbg((struct ndctl_ctx *)ctx, "ndctl monitor daemon started\n");

Though I agree its message is not "error" certainly,
I would like to keep it as normal status level message.

Thanks,
---
Yasunori Goto



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] dax: Fix deadlock in dax_lock_mapping_entry()

2018-10-04 Thread Dan Williams
On Thu, Oct 4, 2018 at 7:52 PM Matthew Wilcox  wrote:
>
> On Thu, Oct 04, 2018 at 06:57:52PM -0700, Dan Williams wrote:
> > On Thu, Oct 4, 2018 at 9:27 AM Jan Kara  wrote:
> > >
> > > On Thu 27-09-18 11:22:22, Dan Williams wrote:
> > > > On Thu, Sep 27, 2018 at 6:41 AM Jan Kara  wrote:
> > > > >
> > > > > On Thu 27-09-18 06:28:43, Matthew Wilcox wrote:
> > > > > > On Thu, Sep 27, 2018 at 01:23:32PM +0200, Jan Kara wrote:
> > > > > > > When dax_lock_mapping_entry() has to sleep to obtain entry lock, 
> > > > > > > it will
> > > > > > > fail to unlock mapping->i_pages spinlock and thus immediately 
> > > > > > > deadlock
> > > > > > > against itself when retrying to grab the entry lock again. Fix the
> > > > > > > problem by unlocking mapping->i_pages before retrying.
> > > > > >
> > > > > > It seems weird that xfstests doesn't provoke this ...
> > > > >
> > > > > The function currently gets called only from mm/memory-failure.c. And 
> > > > > yes,
> > > > > we are lacking DAX hwpoison error tests in fstests...
> > > >
> > > > I have an item on my backlog to port the ndctl unit test that does
> > > > memory_failure() injection vs ext4 over to fstests. That said I've
> > > > been investigating a deadlock on ext4 caused by this test. When I saw
> > > > this patch I hoped it was root cause, but the test is still failing
> > > > for me. Vishal is able to pass the test on his system, so the failure
> > > > mode is timing dependent. I'm running this patch on top of -rc5 and
> > > > still seeing the following deadlock.
> > >
> > > I went through the code but I don't see where the problem could be. How 
> > > can
> > > I run that test? Is KVM enough or do I need hardware with AEP dimms?
> >
> > KVM is enough... however, I have found a hack that makes the test pass:
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 52517f28e6f4..d7f035b1846e 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1668,6 +1668,9 @@ unsigned find_get_entries(struct address_space 
> > *mapping,
> > goto repeat;
> > }
> >  export:
> > +   if (iter.index < start)
> > +   continue;
> > +
> > indices[ret] = iter.index;
> > entries[ret] = page;
> > if (++ret == nr_entries)
> >
> > Is this a radix bug? I would never expect:
> >
> > radix_tree_for_each_slot(slot, >i_pages, , start)
> >
> > ...to return entries with an index < start. Without that change above
> > we loop forever because dax_layout_busy_page() can't make forward
> > progress. I'll dig into the radix code tomorrow, but maybe someone
> > else will be me to it overnight.
>
> If 'start' is within a 2MB entry, iter.index can absolutely be less
> than start.  I forget exactly what the radix tree code does, but I think
> it returns index set to the canonical/base index of the entry.

Ok, that makes sense. Then the bug is in dax_layout_busy_page() which
needs to increment 'index' by the entry size. This might also explain
why not every run sees it because you may get lucky and have a 4K
entry.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] dax: Fix deadlock in dax_lock_mapping_entry()

2018-10-04 Thread Matthew Wilcox
On Thu, Oct 04, 2018 at 06:57:52PM -0700, Dan Williams wrote:
> On Thu, Oct 4, 2018 at 9:27 AM Jan Kara  wrote:
> >
> > On Thu 27-09-18 11:22:22, Dan Williams wrote:
> > > On Thu, Sep 27, 2018 at 6:41 AM Jan Kara  wrote:
> > > >
> > > > On Thu 27-09-18 06:28:43, Matthew Wilcox wrote:
> > > > > On Thu, Sep 27, 2018 at 01:23:32PM +0200, Jan Kara wrote:
> > > > > > When dax_lock_mapping_entry() has to sleep to obtain entry lock, it 
> > > > > > will
> > > > > > fail to unlock mapping->i_pages spinlock and thus immediately 
> > > > > > deadlock
> > > > > > against itself when retrying to grab the entry lock again. Fix the
> > > > > > problem by unlocking mapping->i_pages before retrying.
> > > > >
> > > > > It seems weird that xfstests doesn't provoke this ...
> > > >
> > > > The function currently gets called only from mm/memory-failure.c. And 
> > > > yes,
> > > > we are lacking DAX hwpoison error tests in fstests...
> > >
> > > I have an item on my backlog to port the ndctl unit test that does
> > > memory_failure() injection vs ext4 over to fstests. That said I've
> > > been investigating a deadlock on ext4 caused by this test. When I saw
> > > this patch I hoped it was root cause, but the test is still failing
> > > for me. Vishal is able to pass the test on his system, so the failure
> > > mode is timing dependent. I'm running this patch on top of -rc5 and
> > > still seeing the following deadlock.
> >
> > I went through the code but I don't see where the problem could be. How can
> > I run that test? Is KVM enough or do I need hardware with AEP dimms?
> 
> KVM is enough... however, I have found a hack that makes the test pass:
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 52517f28e6f4..d7f035b1846e 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1668,6 +1668,9 @@ unsigned find_get_entries(struct address_space *mapping,
> goto repeat;
> }
>  export:
> +   if (iter.index < start)
> +   continue;
> +
> indices[ret] = iter.index;
> entries[ret] = page;
> if (++ret == nr_entries)
> 
> Is this a radix bug? I would never expect:
> 
> radix_tree_for_each_slot(slot, >i_pages, , start)
> 
> ...to return entries with an index < start. Without that change above
> we loop forever because dax_layout_busy_page() can't make forward
> progress. I'll dig into the radix code tomorrow, but maybe someone
> else will be me to it overnight.

If 'start' is within a 2MB entry, iter.index can absolutely be less
than start.  I forget exactly what the radix tree code does, but I think
it returns index set to the canonical/base index of the entry.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] dax: Fix deadlock in dax_lock_mapping_entry()

2018-10-04 Thread Dan Williams
On Thu, Oct 4, 2018 at 9:27 AM Jan Kara  wrote:
>
> On Thu 27-09-18 11:22:22, Dan Williams wrote:
> > On Thu, Sep 27, 2018 at 6:41 AM Jan Kara  wrote:
> > >
> > > On Thu 27-09-18 06:28:43, Matthew Wilcox wrote:
> > > > On Thu, Sep 27, 2018 at 01:23:32PM +0200, Jan Kara wrote:
> > > > > When dax_lock_mapping_entry() has to sleep to obtain entry lock, it 
> > > > > will
> > > > > fail to unlock mapping->i_pages spinlock and thus immediately deadlock
> > > > > against itself when retrying to grab the entry lock again. Fix the
> > > > > problem by unlocking mapping->i_pages before retrying.
> > > >
> > > > It seems weird that xfstests doesn't provoke this ...
> > >
> > > The function currently gets called only from mm/memory-failure.c. And yes,
> > > we are lacking DAX hwpoison error tests in fstests...
> >
> > I have an item on my backlog to port the ndctl unit test that does
> > memory_failure() injection vs ext4 over to fstests. That said I've
> > been investigating a deadlock on ext4 caused by this test. When I saw
> > this patch I hoped it was root cause, but the test is still failing
> > for me. Vishal is able to pass the test on his system, so the failure
> > mode is timing dependent. I'm running this patch on top of -rc5 and
> > still seeing the following deadlock.
>
> I went through the code but I don't see where the problem could be. How can
> I run that test? Is KVM enough or do I need hardware with AEP dimms?

KVM is enough... however, I have found a hack that makes the test pass:

diff --git a/mm/filemap.c b/mm/filemap.c
index 52517f28e6f4..d7f035b1846e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1668,6 +1668,9 @@ unsigned find_get_entries(struct address_space *mapping,
goto repeat;
}
 export:
+   if (iter.index < start)
+   continue;
+
indices[ret] = iter.index;
entries[ret] = page;
if (++ret == nr_entries)

Is this a radix bug? I would never expect:

radix_tree_for_each_slot(slot, >i_pages, , start)

...to return entries with an index < start. Without that change above
we loop forever because dax_layout_busy_page() can't make forward
progress. I'll dig into the radix code tomorrow, but maybe someone
else will be me to it overnight.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] libnvdimm, pmem: Fix badblocks population for 'raw' namespaces

2018-10-04 Thread Verma, Vishal L


On Thu, 2018-10-04 at 16:56 -0700, Dan Williams wrote:
> The driver is only initializing bb_res in the devm_memremap_pages()
> paths, but the raw namespace case is passing an uninitialized bb_res
> to
> nvdimm_badblocks_populate().
> 
> Fixes: e8d513483300 ("memremap: change devm_memremap_pages
> interface...")
> Cc: 
> Cc: Christoph Hellwig 
> Reported-by: Jacek Zloch 
> Reported-by: Krzysztof Rusocki 
> Signed-off-by: Dan Williams 
> ---
>  drivers/nvdimm/pmem.c |4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 6071e2942053..2082ae01b9c8 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -421,9 +421,11 @@ static int pmem_attach_disk(struct device *dev,
>   addr = devm_memremap_pages(dev, >pgmap);
>   pmem->pfn_flags |= PFN_MAP;
>   memcpy(_res, >pgmap.res, sizeof(bb_res));
> - } else
> + } else {
>   addr = devm_memremap(dev, pmem->phys_addr,
>   pmem->size, ARCH_MEMREMAP_PMEM);
> + memcpy(_res, >res, sizeof(bb_res));
> + }


Good find!
Reviewed-by: Vishal Verma 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH] libnvdimm, pmem: Fix badblocks population for 'raw' namespaces

2018-10-04 Thread Dan Williams
The driver is only initializing bb_res in the devm_memremap_pages()
paths, but the raw namespace case is passing an uninitialized bb_res to
nvdimm_badblocks_populate().

Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...")
Cc: 
Cc: Christoph Hellwig 
Reported-by: Jacek Zloch 
Reported-by: Krzysztof Rusocki 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/pmem.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 6071e2942053..2082ae01b9c8 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -421,9 +421,11 @@ static int pmem_attach_disk(struct device *dev,
addr = devm_memremap_pages(dev, >pgmap);
pmem->pfn_flags |= PFN_MAP;
memcpy(_res, >pgmap.res, sizeof(bb_res));
-   } else
+   } else {
addr = devm_memremap(dev, pmem->phys_addr,
pmem->size, ARCH_MEMREMAP_PMEM);
+   memcpy(_res, >res, sizeof(bb_res));
+   }
 
/*
 * At release time the queue must be frozen before

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH 1/2] ndctl, monitor: fix the severity of the "daemon started" message

2018-10-04 Thread Williams, Dan J
On Thu, 2018-10-04 at 18:00 -0600, Vishal Verma wrote:
> The above message was printed as an error, but it is just an
> informational message. Change it to dbg().
> 
> Cc: QI Fuli 
> Cc: Dan Williams 

Reviewed-by: Dan Williams 

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH 2/2] ndctl, monitor: in daemon mode, exit successfully if no DIMMs are found

2018-10-04 Thread Williams, Dan J
On Thu, 2018-10-04 at 18:00 -0600, Vishal Verma wrote:
> When we are running as a daemon, it is preferred to exit successfully
> when no DIMMs are found. When running in the foreground, retain an
> error
> return so that the user can be notified that nothing was found to
> monitor.
> 
> In the longer term, replace this with a libudev uevent monitor that
> will
> look for DIMM addition/removal events, and update the running
> monitor(s)
> if the DIMMs match the active filter spec.
> 
> Link: https://bugs.launchpad.net/ubuntu/+source/ndctl/+bug/1781268
> Reported-by: Andreas Hasenack 
> Cc: Dan Williams 

Looks good to me, if no one is sticking around to reap the error might
as well claim success.

Reviewed-by: Dan Williams 

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[ndctl PATCH 1/2] ndctl, monitor: fix the severity of the "daemon started" message

2018-10-04 Thread Vishal Verma
The above message was printed as an error, but it is just an
informational message. Change it to dbg().

Cc: QI Fuli 
Cc: Dan Williams 
Signed-off-by: Vishal Verma 
---
 ndctl/monitor.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ndctl/monitor.c b/ndctl/monitor.c
index d29e378..b44f946 100644
--- a/ndctl/monitor.c
+++ b/ndctl/monitor.c
@@ -660,7 +660,7 @@ int cmd_monitor(int argc, const char **argv, void *ctx)
err((struct ndctl_ctx *)ctx, "daemon start failed\n");
goto out;
}
-   err((struct ndctl_ctx *)ctx, "ndctl monitor daemon started\n");
+   dbg((struct ndctl_ctx *)ctx, "ndctl monitor daemon started\n");
}
 
if (parse_monitor_event(, (struct ndctl_ctx *)ctx))
-- 
2.17.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[ndctl PATCH 2/2] ndctl, monitor: in daemon mode, exit successfully if no DIMMs are found

2018-10-04 Thread Vishal Verma
When we are running as a daemon, it is preferred to exit successfully
when no DIMMs are found. When running in the foreground, retain an error
return so that the user can be notified that nothing was found to
monitor.

In the longer term, replace this with a libudev uevent monitor that will
look for DIMM addition/removal events, and update the running monitor(s)
if the DIMMs match the active filter spec.

Link: https://bugs.launchpad.net/ubuntu/+source/ndctl/+bug/1781268
Reported-by: Andreas Hasenack 
Cc: Dan Williams 
Cc: QI Fuli 
Signed-off-by: Vishal Verma 
---
 ndctl/monitor.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/ndctl/monitor.c b/ndctl/monitor.c
index b44f946..7bf2ba7 100644
--- a/ndctl/monitor.c
+++ b/ndctl/monitor.c
@@ -681,8 +681,9 @@ int cmd_monitor(int argc, const char **argv, void *ctx)
goto out;
 
if (!mfa.num_dimm) {
-   err((struct ndctl_ctx *)ctx, "no dimms to monitor\n");
-   rc = -ENXIO;
+   dbg((struct ndctl_ctx *)ctx, "no dimms to monitor\n");
+   if (!monitor.daemon)
+   rc = -ENXIO;
goto out;
}
 
-- 
2.17.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH v2] libndctl: set errno for routines that don't return an error status

2018-10-04 Thread Williams, Dan J
On Thu, 2018-10-04 at 17:17 -0600, Vishal Verma wrote:
> For routines that return a UINT_MAX or UL{L}ONG_MAX, there isn't a
> way
> to get any information as to what went wrong. Set errno in such
> routines
> so that the callers can get some additional context about the error.
> 
> Reported-by: Lukasz Dorau 
> Cc: Dan Williams 

Reviewed-by: Dan Williams 

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[ndctl PATCH v2] libndctl: set errno for routines that don't return an error status

2018-10-04 Thread Vishal Verma
For routines that return a UINT_MAX or UL{L}ONG_MAX, there isn't a way
to get any information as to what went wrong. Set errno in such routines
so that the callers can get some additional context about the error.

Reported-by: Lukasz Dorau 
Cc: Dan Williams 
Signed-off-by: Vishal Verma 
---
 ndctl/lib/dimm.c |  8 +++--
 ndctl/lib/hpe1.c | 79 +---
 ndctl/lib/inject.c   |  2 ++
 ndctl/lib/intel.c| 45 +
 ndctl/lib/libndctl.c | 54 ++
 ndctl/lib/msft.c | 27 ---
 6 files changed, 177 insertions(+), 38 deletions(-)

v2: s/EOVERFLOW/ENOMEM/  (Dan)

diff --git a/ndctl/lib/dimm.c b/ndctl/lib/dimm.c
index b3e032e..5e41734 100644
--- a/ndctl/lib/dimm.c
+++ b/ndctl/lib/dimm.c
@@ -565,17 +565,21 @@ NDCTL_EXPORT unsigned long 
ndctl_dimm_get_available_labels(
 {
struct ndctl_ctx *ctx = ndctl_dimm_get_ctx(dimm);
char *path = dimm->dimm_buf;
-   int len = dimm->buf_len;
+   int rc, len = dimm->buf_len;
char buf[20];
 
if (snprintf(path, len, "%s/available_slots", dimm->dimm_path) >= len) {
err(ctx, "%s: buffer too small!\n",
ndctl_dimm_get_devname(dimm));
+   errno = ENOMEM;
return ULONG_MAX;
}
 
-   if (sysfs_read_attr(ctx, path, buf) < 0)
+   rc = sysfs_read_attr(ctx, path, buf);
+   if (rc < 0) {
+   errno = -rc;
return ULONG_MAX;
+   }
 
return strtoul(buf, NULL, 0);
 }
diff --git a/ndctl/lib/hpe1.c b/ndctl/lib/hpe1.c
index dbc1ff0..b26120e 100644
--- a/ndctl/lib/hpe1.c
+++ b/ndctl/lib/hpe1.c
@@ -90,9 +90,13 @@ static unsigned int hpe1_cmd_smart_get_flags(struct 
ndctl_cmd *cmd)
 {
unsigned int hpe1flags;
unsigned int flags;
+   int rc;
 
-   if (hpe1_smart_valid(cmd) < 0)
+   rc = hpe1_smart_valid(cmd);
+   if (rc < 0) {
+   errno = -rc;
return UINT_MAX;
+   }
 
hpe1flags = CMD_HPE1_SMART(cmd)->out_valid_flags;
flags = 0;
@@ -118,9 +122,13 @@ static unsigned int hpe1_cmd_smart_get_health(struct 
ndctl_cmd *cmd)
 {
unsigned char hpe1health;
unsigned int health;
+   int rc;
 
-   if (hpe1_smart_valid(cmd) < 0)
+   rc = hpe1_smart_valid(cmd);
+   if (rc < 0) {
+   errno = -rc;
return UINT_MAX;
+   }
 
hpe1health = CMD_HPE1_SMART(cmd)->stat_summary;
health = 0;
@@ -136,16 +144,26 @@ static unsigned int hpe1_cmd_smart_get_health(struct 
ndctl_cmd *cmd)
 
 static unsigned int hpe1_cmd_smart_get_media_temperature(struct ndctl_cmd *cmd)
 {
-   if (hpe1_smart_valid(cmd) < 0)
+   int rc;
+
+   rc = hpe1_smart_valid(cmd);
+   if (rc < 0) {
+   errno = -rc;
return UINT_MAX;
+   }
 
return CMD_HPE1_SMART(cmd)->curr_temp;
 }
 
 static unsigned int hpe1_cmd_smart_get_spares(struct ndctl_cmd *cmd)
 {
-   if (hpe1_smart_valid(cmd) < 0)
+   int rc;
+
+   rc = hpe1_smart_valid(cmd);
+   if (rc < 0) {
+   errno = -rc;
return UINT_MAX;
+   }
 
return CMD_HPE1_SMART(cmd)->spare_blocks;
 }
@@ -154,9 +172,13 @@ static unsigned int hpe1_cmd_smart_get_alarm_flags(struct 
ndctl_cmd *cmd)
 {
unsigned int hpe1flags;
unsigned int flags;
+   int rc;
 
-   if (hpe1_smart_valid(cmd) < 0)
+   rc = hpe1_smart_valid(cmd);
+   if (rc < 0) {
+   errno = -rc;
return UINT_MAX;
+   }
 
hpe1flags = CMD_HPE1_SMART(cmd)->alarm_trips;
flags = 0;
@@ -170,8 +192,13 @@ static unsigned int hpe1_cmd_smart_get_alarm_flags(struct 
ndctl_cmd *cmd)
 
 static unsigned int hpe1_cmd_smart_get_life_used(struct ndctl_cmd *cmd)
 {
-   if (hpe1_smart_valid(cmd) < 0)
+   int rc;
+
+   rc = hpe1_smart_valid(cmd);
+   if (rc < 0) {
+   errno = -rc;
return UINT_MAX;
+   }
 
return CMD_HPE1_SMART(cmd)->device_life;
 }
@@ -179,9 +206,13 @@ static unsigned int hpe1_cmd_smart_get_life_used(struct 
ndctl_cmd *cmd)
 static unsigned int hpe1_cmd_smart_get_shutdown_state(struct ndctl_cmd *cmd)
 {
unsigned int shutdown;
+   int rc;
 
-   if (hpe1_smart_valid(cmd) < 0)
+   rc = hpe1_smart_valid(cmd);
+   if (rc < 0) {
+   errno = -rc;
return UINT_MAX;
+   }
 
shutdown = CMD_HPE1_SMART(cmd)->last_shutdown_stat;
if (shutdown == NDN_HPE1_SMART_LASTSAVEGOOD)
@@ -192,16 +223,26 @@ static unsigned int 
hpe1_cmd_smart_get_shutdown_state(struct ndctl_cmd *cmd)
 
 static unsigned int hpe1_cmd_smart_get_vendor_size(struct ndctl_cmd *cmd)
 {
-   if (hpe1_smart_valid(cmd) < 0)
+   int rc;
+
+   rc = hpe1_smart_valid(cmd);
+   if (rc < 0) {
+   errno = -rc;
return UINT_MAX;
+   }
 

Re: [ndctl PATCH] libndctl: set errno for routines that don't return an error status

2018-10-04 Thread Verma, Vishal L


On Thu, 2018-10-04 at 16:11 -0700, Williams, Dan J wrote:
> On Thu, 2018-10-04 at 16:01 -0700, Verma, Vishal L wrote:
> > On Thu, 2018-10-04 at 15:54 -0700, Williams, Dan J wrote:
> > > On Thu, 2018-10-04 at 16:34 -0600, Vishal Verma wrote:
> > > > For routines that return a UINT_MAX or UL{L}ONG_MAX, there isn't
> > > > a
> > > > way
> > > > to get any information as to what went wrong. Set errno in such
> > > > routines
> > > > so that the callers can get some additional context about the
> > > > error.
> > > 
> > > Looks ok, but why EOVERFLOW and not ENOMEM for the out of resource
> > > conditions?
> > 
> > I debated between that and also ENOSPC, but nothing seemed like an
> > exact fit for a buffer too small.. Mainly not ENOMEM because we
> > aren't
> > actually trying to allocate memory as a part of this?
> 
> That's true, but the effect is the same, couldn't get enough resource
> to complete the operation. I would expect EOVERFLOW for trying to store
> a 64-bit quantity in a 32-bit field.
> 

Good point, I'll change to ENOMEM.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH] libndctl: set errno for routines that don't return an error status

2018-10-04 Thread Williams, Dan J
On Thu, 2018-10-04 at 16:01 -0700, Verma, Vishal L wrote:
> On Thu, 2018-10-04 at 15:54 -0700, Williams, Dan J wrote:
> > On Thu, 2018-10-04 at 16:34 -0600, Vishal Verma wrote:
> > > For routines that return a UINT_MAX or UL{L}ONG_MAX, there isn't
> > > a
> > > way
> > > to get any information as to what went wrong. Set errno in such
> > > routines
> > > so that the callers can get some additional context about the
> > > error.
> > 
> > Looks ok, but why EOVERFLOW and not ENOMEM for the out of resource
> > conditions?
> 
> I debated between that and also ENOSPC, but nothing seemed like an
> exact fit for a buffer too small.. Mainly not ENOMEM because we
> aren't
> actually trying to allocate memory as a part of this?

That's true, but the effect is the same, couldn't get enough resource
to complete the operation. I would expect EOVERFLOW for trying to store
a 64-bit quantity in a 32-bit field.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH] libndctl: set errno for routines that don't return an error status

2018-10-04 Thread Verma, Vishal L
On Thu, 2018-10-04 at 15:54 -0700, Williams, Dan J wrote:
> On Thu, 2018-10-04 at 16:34 -0600, Vishal Verma wrote:
> > For routines that return a UINT_MAX or UL{L}ONG_MAX, there isn't a
> > way
> > to get any information as to what went wrong. Set errno in such
> > routines
> > so that the callers can get some additional context about the
> > error.
> 
> Looks ok, but why EOVERFLOW and not ENOMEM for the out of resource
> conditions?

I debated between that and also ENOSPC, but nothing seemed like an
exact fit for a buffer too small.. Mainly not ENOMEM because we aren't
actually trying to allocate memory as a part of this?


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH] libndctl: set errno for routines that don't return an error status

2018-10-04 Thread Williams, Dan J
On Thu, 2018-10-04 at 16:34 -0600, Vishal Verma wrote:
> For routines that return a UINT_MAX or UL{L}ONG_MAX, there isn't a
> way
> to get any information as to what went wrong. Set errno in such
> routines
> so that the callers can get some additional context about the error.

Looks ok, but why EOVERFLOW and not ENOMEM for the out of resource
conditions?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[ndctl PATCH] libndctl: set errno for routines that don't return an error status

2018-10-04 Thread Vishal Verma
For routines that return a UINT_MAX or UL{L}ONG_MAX, there isn't a way
to get any information as to what went wrong. Set errno in such routines
so that the callers can get some additional context about the error.

Reported-by: Lukasz Dorau 
Cc: Dan Williams 
Signed-off-by: Vishal Verma 
---
 ndctl/lib/dimm.c |  8 +++--
 ndctl/lib/hpe1.c | 79 +---
 ndctl/lib/inject.c   |  2 ++
 ndctl/lib/intel.c| 45 +
 ndctl/lib/libndctl.c | 54 ++
 ndctl/lib/msft.c | 27 ---
 6 files changed, 177 insertions(+), 38 deletions(-)

diff --git a/ndctl/lib/dimm.c b/ndctl/lib/dimm.c
index b3e032e..0299b41 100644
--- a/ndctl/lib/dimm.c
+++ b/ndctl/lib/dimm.c
@@ -565,17 +565,21 @@ NDCTL_EXPORT unsigned long 
ndctl_dimm_get_available_labels(
 {
struct ndctl_ctx *ctx = ndctl_dimm_get_ctx(dimm);
char *path = dimm->dimm_buf;
-   int len = dimm->buf_len;
+   int rc, len = dimm->buf_len;
char buf[20];
 
if (snprintf(path, len, "%s/available_slots", dimm->dimm_path) >= len) {
err(ctx, "%s: buffer too small!\n",
ndctl_dimm_get_devname(dimm));
+   errno = EOVERFLOW;
return ULONG_MAX;
}
 
-   if (sysfs_read_attr(ctx, path, buf) < 0)
+   rc = sysfs_read_attr(ctx, path, buf);
+   if (rc < 0) {
+   errno = -rc;
return ULONG_MAX;
+   }
 
return strtoul(buf, NULL, 0);
 }
diff --git a/ndctl/lib/hpe1.c b/ndctl/lib/hpe1.c
index dbc1ff0..b26120e 100644
--- a/ndctl/lib/hpe1.c
+++ b/ndctl/lib/hpe1.c
@@ -90,9 +90,13 @@ static unsigned int hpe1_cmd_smart_get_flags(struct 
ndctl_cmd *cmd)
 {
unsigned int hpe1flags;
unsigned int flags;
+   int rc;
 
-   if (hpe1_smart_valid(cmd) < 0)
+   rc = hpe1_smart_valid(cmd);
+   if (rc < 0) {
+   errno = -rc;
return UINT_MAX;
+   }
 
hpe1flags = CMD_HPE1_SMART(cmd)->out_valid_flags;
flags = 0;
@@ -118,9 +122,13 @@ static unsigned int hpe1_cmd_smart_get_health(struct 
ndctl_cmd *cmd)
 {
unsigned char hpe1health;
unsigned int health;
+   int rc;
 
-   if (hpe1_smart_valid(cmd) < 0)
+   rc = hpe1_smart_valid(cmd);
+   if (rc < 0) {
+   errno = -rc;
return UINT_MAX;
+   }
 
hpe1health = CMD_HPE1_SMART(cmd)->stat_summary;
health = 0;
@@ -136,16 +144,26 @@ static unsigned int hpe1_cmd_smart_get_health(struct 
ndctl_cmd *cmd)
 
 static unsigned int hpe1_cmd_smart_get_media_temperature(struct ndctl_cmd *cmd)
 {
-   if (hpe1_smart_valid(cmd) < 0)
+   int rc;
+
+   rc = hpe1_smart_valid(cmd);
+   if (rc < 0) {
+   errno = -rc;
return UINT_MAX;
+   }
 
return CMD_HPE1_SMART(cmd)->curr_temp;
 }
 
 static unsigned int hpe1_cmd_smart_get_spares(struct ndctl_cmd *cmd)
 {
-   if (hpe1_smart_valid(cmd) < 0)
+   int rc;
+
+   rc = hpe1_smart_valid(cmd);
+   if (rc < 0) {
+   errno = -rc;
return UINT_MAX;
+   }
 
return CMD_HPE1_SMART(cmd)->spare_blocks;
 }
@@ -154,9 +172,13 @@ static unsigned int hpe1_cmd_smart_get_alarm_flags(struct 
ndctl_cmd *cmd)
 {
unsigned int hpe1flags;
unsigned int flags;
+   int rc;
 
-   if (hpe1_smart_valid(cmd) < 0)
+   rc = hpe1_smart_valid(cmd);
+   if (rc < 0) {
+   errno = -rc;
return UINT_MAX;
+   }
 
hpe1flags = CMD_HPE1_SMART(cmd)->alarm_trips;
flags = 0;
@@ -170,8 +192,13 @@ static unsigned int hpe1_cmd_smart_get_alarm_flags(struct 
ndctl_cmd *cmd)
 
 static unsigned int hpe1_cmd_smart_get_life_used(struct ndctl_cmd *cmd)
 {
-   if (hpe1_smart_valid(cmd) < 0)
+   int rc;
+
+   rc = hpe1_smart_valid(cmd);
+   if (rc < 0) {
+   errno = -rc;
return UINT_MAX;
+   }
 
return CMD_HPE1_SMART(cmd)->device_life;
 }
@@ -179,9 +206,13 @@ static unsigned int hpe1_cmd_smart_get_life_used(struct 
ndctl_cmd *cmd)
 static unsigned int hpe1_cmd_smart_get_shutdown_state(struct ndctl_cmd *cmd)
 {
unsigned int shutdown;
+   int rc;
 
-   if (hpe1_smart_valid(cmd) < 0)
+   rc = hpe1_smart_valid(cmd);
+   if (rc < 0) {
+   errno = -rc;
return UINT_MAX;
+   }
 
shutdown = CMD_HPE1_SMART(cmd)->last_shutdown_stat;
if (shutdown == NDN_HPE1_SMART_LASTSAVEGOOD)
@@ -192,16 +223,26 @@ static unsigned int 
hpe1_cmd_smart_get_shutdown_state(struct ndctl_cmd *cmd)
 
 static unsigned int hpe1_cmd_smart_get_vendor_size(struct ndctl_cmd *cmd)
 {
-   if (hpe1_smart_valid(cmd) < 0)
+   int rc;
+
+   rc = hpe1_smart_valid(cmd);
+   if (rc < 0) {
+   errno = -rc;
return UINT_MAX;
+   }
 
return 

Re: [PATCH v9 13/13] nvmet: Optionally use PCI P2P memory

2018-10-04 Thread Logan Gunthorpe



On 2018-10-04 4:20 p.m., Sagi Grimberg wrote:
>> +static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)
>> +{
>> +return disk_to_dev(ns->bdev->bd_disk);
>> +}
> 
> This needs to handle non bdev namespaces.

As it's coded now the helper never gets called unless ns->bdev is not
null. But in general, yes you are right, we should probably return NULL
if ns->bdev is NULL.

> And this?
> --
> diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c
> index ef286b72d958..3d12f5f4568d 100644
> --- a/drivers/nvme/target/fc.c
> +++ b/drivers/nvme/target/fc.c
> @@ -2280,6 +2280,7 @@ nvmet_fc_handle_fcp_rqst(struct nvmet_fc_tgtport 
> *tgtport,
>  fod->req.cmd = >cmdiubuf.sqe;
>  fod->req.rsp = >rspiubuf.cqe;
>  fod->req.port = tgtport->pe->port;
> +   fod->req.p2p_client = tgtport->dev;
> 
>  /* clear any response payload */
>  memset(>rspiubuf, 0, sizeof(fod->rspiubuf));
> --

Sure, I guess that makes sense. I've never tried it with fc hardware but
I assume there's no reason it wouldn't work.

I'll queue these changes up for a v10.

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 13/13] nvmet: Optionally use PCI P2P memory

2018-10-04 Thread Sagi Grimberg




diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index e7b7406c4e22..4333e2c5b4f5 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -26,6 +26,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #define NVMET_ASYNC_EVENTS		4

  #define NVMET_ERROR_LOG_SLOTS 128
@@ -77,6 +78,9 @@ struct nvmet_ns {
struct completion   disable_done;
mempool_t   *bvec_pool;
struct kmem_cache   *bvec_cache;
+
+   int use_p2pmem;
+   struct pci_dev  *p2p_dev;
  };
  
  static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item)

@@ -84,6 +88,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct 
config_item *item)
return container_of(to_config_group(item), struct nvmet_ns, group);
  }
  
+static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns)

+{
+   return disk_to_dev(ns->bdev->bd_disk);
+}


This needs to handle non bdev namespaces.


+
  struct nvmet_cq {
u16 qid;
u16 size;
@@ -184,6 +193,9 @@ struct nvmet_ctrl {
  
  	char			subsysnqn[NVMF_NQN_FIELD_LEN];

charhostnqn[NVMF_NQN_FIELD_LEN];
+
+   struct device *p2p_client;
+   struct radix_tree_root p2p_ns_map;
  };
  
  struct nvmet_subsys {

@@ -294,6 +306,9 @@ struct nvmet_req {
  
  	void (*execute)(struct nvmet_req *req);

const struct nvmet_fabrics_ops *ops;
+
+   struct pci_dev *p2p_dev;
+   struct device *p2p_client;
  };
  
  extern struct workqueue_struct *buffered_io_wq;

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 9e091e78a2f0..3f7971d3706d 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -749,6 +749,8 @@ static void nvmet_rdma_handle_command(struct 
nvmet_rdma_queue *queue,
cmd->send_sge.addr, cmd->send_sge.length,
DMA_TO_DEVICE);
  
+	cmd->req.p2p_client = >dev->device->dev;

+
if (!nvmet_req_init(>req, >nvme_cq,
>nvme_sq, _rdma_ops))
return;


And this?
--
diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c
index ef286b72d958..3d12f5f4568d 100644
--- a/drivers/nvme/target/fc.c
+++ b/drivers/nvme/target/fc.c
@@ -2280,6 +2280,7 @@ nvmet_fc_handle_fcp_rqst(struct nvmet_fc_tgtport 
*tgtport,

fod->req.cmd = >cmdiubuf.sqe;
fod->req.rsp = >rspiubuf.cqe;
fod->req.port = tgtport->pe->port;
+   fod->req.p2p_client = tgtport->dev;

/* clear any response payload */
memset(>rspiubuf, 0, sizeof(fod->rspiubuf));
--

Other than that this looks good!
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 04/13] PCI/P2PDMA: Introduce configfs/sysfs enable attribute helpers

2018-10-04 Thread Logan Gunthorpe
Users of the P2PDMA infrastructure will typically need a way for
the user to tell the kernel to use P2P resources. Typically
this will be a simple on/off boolean operation but sometimes
it may be desirable for the user to specify the exact device to
use for the P2P operation.

Add new helpers for attributes which take a boolean or a PCI device.
Any boolean as accepted by strtobool() turn P2P on or off (such as 'y', 'n',
'1', '0', etc). Specifying a full PCI device name/BDF will select the
specific device.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
---
 drivers/pci/p2pdma.c   | 82 ++
 include/linux/pci-p2pdma.h | 15 +++
 2 files changed, 97 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index e914f04ea1ec..b18b3703ab79 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -8,6 +8,8 @@
  * Copyright (c) 2018, Eideticom Inc.
  */
 
+#define pr_fmt(fmt) "pci-p2pdma: " fmt
+#include 
 #include 
 #include 
 #include 
@@ -725,3 +727,83 @@ int pci_p2pdma_map_sg(struct device *dev, struct 
scatterlist *sg, int nents,
return nents;
 }
 EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
+
+/**
+ * pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
+ * to enable p2pdma
+ * @page: contents of the value to be stored
+ * @p2p_dev: returns the PCI device that was selected to be used
+ * (if one was specified in the stored value)
+ * @use_p2pdma: returns whether to enable p2pdma or not
+ *
+ * Parses an attribute value to decide whether to enable p2pdma.
+ * The value can select a PCI device (using it's full BDF device
+ * name) or a boolean (in any format strtobool() accepts). A false
+ * value disables p2pdma, a true value expects the caller
+ * to automatically find a compatible device and specifying a PCI device
+ * expects the caller to use the specific provider.
+ *
+ * pci_p2pdma_enable_show() should be used as the show operation for
+ * the attribute.
+ *
+ * Returns 0 on success
+ */
+int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
+   bool *use_p2pdma)
+{
+   struct device *dev;
+
+   dev = bus_find_device_by_name(_bus_type, NULL, page);
+   if (dev) {
+   *use_p2pdma = true;
+   *p2p_dev = to_pci_dev(dev);
+
+   if (!pci_has_p2pmem(*p2p_dev)) {
+   pci_err(*p2p_dev,
+   "PCI device has no peer-to-peer memory: %s\n",
+   page);
+   pci_dev_put(*p2p_dev);
+   return -ENODEV;
+   }
+
+   return 0;
+   } else if ((page[0] == '0' || page[0] == '1') && !iscntrl(page[1])) {
+   /*
+* If the user enters a PCI device that  doesn't exist
+* like ":01:00.1", we don't want strtobool to think
+* it's a '0' when it's clearly not what the user wanted.
+* So we require 0's and 1's to be exactly one character.
+*/
+   } else if (!strtobool(page, use_p2pdma)) {
+   return 0;
+   }
+
+   pr_err("No such PCI device: %.*s\n", (int)strcspn(page, "\n"), page);
+   return -ENODEV;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_enable_store);
+
+/**
+ * pci_p2pdma_enable_show - show a configfs/sysfs attribute indicating
+ * whether p2pdma is enabled
+ * @page: contents of the stored value
+ * @p2p_dev: the selected p2p device (NULL if no device is selected)
+ * @use_p2pdma: whether p2pdme has been enabled
+ *
+ * Attributes that use pci_p2pdma_enable_store() should use this function
+ * to show the value of the attribute.
+ *
+ * Returns 0 on success
+ */
+ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
+  bool use_p2pdma)
+{
+   if (!use_p2pdma)
+   return sprintf(page, "0\n");
+
+   if (!p2p_dev)
+   return sprintf(page, "1\n");
+
+   return sprintf(page, "%s\n", pci_name(p2p_dev));
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index b6dfb6dc2e53..bca9bc3e5be7 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -32,6 +32,10 @@ void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct 
scatterlist *sgl);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
 int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
  enum dma_data_direction dir);
+int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
+   bool *use_p2pdma);
+ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
+  bool use_p2pdma);
 #else /* CONFIG_PCI_P2PDMA */
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
size_t size, u64 offset)
@@ -82,6 +86,17 

[PATCH v9 00/13] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-10-04 Thread Logan Gunthorpe
This is v9 for the p2pdma patch set. It has some substantial changes
from the previous version. Essentially, the last patch has changed
based on information from Sagi that the P2P memory can be assigned
per namespace instead of globally. To that end, I've added a
radix tree that stores the p2p devices to use for each namespace and
a lookup will be done on that before allocating the memory.

This has some knock on effects of simplifying the requirements
from the p2pdma code and so we drop all the client list code seeing
we don't need to worry about selecting a P2P device that works
with every namespace at once; only each namespace independantly.
Thus, Patch 1, 6 and 13 have significant changes. However,
I think the removal of the client list code will be generally
appreciated based on the feedback I've gotten from Christian.

As usual, a git repo of the branch is available here:

https://github.com/sbates130272/linux-p2pmem pci-p2p-v9

Thanks,

Logan

--

Changes in v9:
* Rebased on v4.19-rc6
* Droped pci_p2pdma_add_client(), pci_p2pdma_remove_client() and
  pci_p2pdma_client_list_free(), and reworked pci_p2pdma_distance() and
  pci_p2pmem_find() to take simple arrays instead of list_heads.
* Updated the documentation to match
* Reworked the nvmet code to have a configfs attribute per namespace
  instead of per port. Then each namespace has it's own P2P device
  stored in a radix-tree in the controller (it can't be stored in
  the nvme_ns structure seeing it needs to be unique per controller).
* Dropped the 'sq' argument in nvmet_req_alloc_sgl() seeing I've noticed
  it is now available in the nvmet_req structure.
* Collected Keith's reviewed-by tags (however, I have not applied his
  tag to the last patch seeing I think it has changed too much since
  his review).

Changes in v8:

* Added a couple of comments to address Bart's feedback and
  removed the bogus call to percpu_ref_switch_to_atomic_sync()

Changes in v7:

* Rebased on v4.19-rc5

* Fixed up commit message of patch 7 that was no longer accurate. (as
  pointed out by Jens)
* Change the configfs to not use "auto" or "none" and instead just
  use a 0/1/ (or boolean). This matches the existing
  nvme-target configfs booleans. (Per Bjorn)
* A handful of other minor changes and edits that were noticed by Bjorn
* Collected Acks from Bjorn

Changes in v6:

* Rebased on v4.19-rc3

* Remove the changes to the common submit_bio() path and instead
  set REQ_NOMERGE in the NVME target driver, when appropriate.
  Per discussions with Jens and Christoph.

* Some minor grammar changes in the documentation as spotted by Randy.

Changes in v5:

* Rebased on v4.19-rc1

* Drop changing ACS settings in this patchset. Now, the code
  will only allow P2P transactions between devices whos
  downstream ports do not restrict P2P TLPs.

* Drop the REQ_PCI_P2PDMA block flag and instead use
  is_pci_p2pdma_page() to tell if a request is P2P or not. In that
  case we check for queue support and enforce using REQ_NOMERGE.
  Per feedback from Christoph.

* Drop the pci_p2pdma_unmap_sg() function as it was empty and only
  there for symmetry and compatibility with dma_unmap_sg. Per feedback
  from Christoph.

* Split off the logic to handle enabling P2P in NVMe fabrics' configfs
  into specific helpers in the p2pdma code. Per feedback from Christoph.

* A number of other minor cleanups and fixes as pointed out by
  Christoph and others.

Changes in v4:

* Change the original upstream_bridges_match() function to
  upstream_bridge_distance() which calculates the distance between two
  devices as long as they are behind the same root port. This should
  address Bjorn's concerns that the code was to focused on
  being behind a single switch.

* The disable ACS function now disables ACS for all bridge ports instead
  of switch ports (ie. those that had two upstream_bridge ports).

* Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
  API to be more like sgl_alloc() in that the alloc function returns
  the allocated scatterlist and nents is not required bythe free
  function.

* Moved the new documentation into the driver-api tree as requested
  by Jonathan

* Add SGL alloc and free helpers in the nvmet code so that the
  individual drivers can share the code that allocates P2P memory.
  As requested by Christoph.

* Cleanup the nvmet_p2pmem_store() function as Christoph
  thought my first attempt was ugly.

* Numerous commit message and comment fix-ups

Changes in v3:

* Many more fixes and minor cleanups that were spotted by Bjorn

* Additional explanation of the ACS change in both the commit message
  and Kconfig doc. Also, the code that disables the ACS bits is surrounded
  explicitly by an #ifdef

* Removed the flag we added to rdma_rw_ctx() in favour of using
  is_pci_p2pdma_page(), as suggested by Sagi.

* Adjust pci_p2pmem_find() so that it prefers P2P providers that
  are closest to (or the same as) the clients using them. In cases
  of ties, the provider is 

[PATCH v9 10/13] nvme-pci: Add support for P2P memory in requests

2018-10-04 Thread Logan Gunthorpe
For P2P requests, we must use the pci_p2pmem_map_sg() function
instead of the dma_map_sg functions.

With that, we can then indicate PCI_P2P support in the request queue.
For this, we create an NVME_F_PCI_P2P flag which tells the core to
set QUEUE_FLAG_PCI_P2P in the request queue.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Sagi Grimberg 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Keith Busch 
---
 drivers/nvme/host/core.c |  4 
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  | 17 +
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index dd8ec1dd9219..6033ce2fd3e9 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3051,7 +3051,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid)
ns->queue = blk_mq_init_queue(ctrl->tagset);
if (IS_ERR(ns->queue))
goto out_free_ns;
+
blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
+   if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+   blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
+
ns->queue->queuedata = ns;
ns->ctrl = ctrl;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index bb4a2003c097..4030743c90aa 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -343,6 +343,7 @@ struct nvme_ctrl_ops {
unsigned int flags;
 #define NVME_F_FABRICS (1 << 0)
 #define NVME_F_METADATA_SUPPORTED  (1 << 1)
+#define NVME_F_PCI_P2PDMA  (1 << 2)
int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f434706a04e8..0d6c41bc2b35 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -745,8 +745,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, 
struct request *req,
goto out;
 
ret = BLK_STS_RESOURCE;
-   nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
-   DMA_ATTR_NO_WARN);
+
+   if (is_pci_p2pdma_page(sg_page(iod->sg)))
+   nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents,
+ dma_dir);
+   else
+   nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
+dma_dir,  DMA_ATTR_NO_WARN);
if (!nr_mapped)
goto out;
 
@@ -788,7 +793,10 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct 
request *req)
DMA_TO_DEVICE : DMA_FROM_DEVICE;
 
if (iod->nents) {
-   dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+   /* P2PDMA requests do not need to be unmapped */
+   if (!is_pci_p2pdma_page(sg_page(iod->sg)))
+   dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
+
if (blk_integrity_rq(req))
dma_unmap_sg(dev->dev, >meta_sg, 1, dma_dir);
}
@@ -2400,7 +2408,8 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, 
char *buf, int size)
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name   = "pcie",
.module = THIS_MODULE,
-   .flags  = NVME_F_METADATA_SUPPORTED,
+   .flags  = NVME_F_METADATA_SUPPORTED |
+ NVME_F_PCI_P2PDMA,
.reg_read32 = nvme_pci_reg_read32,
.reg_write32= nvme_pci_reg_write32,
.reg_read64 = nvme_pci_reg_read64,
-- 
2.19.0

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 05/13] docs-rst: Add a new directory for PCI documentation

2018-10-04 Thread Logan Gunthorpe
Add a new directory in the driver API guide for PCI specific
documentation.

This is in preparation for adding a new PCI P2P DMA driver writers
guide which will go in this directory.

Signed-off-by: Logan Gunthorpe 
Cc: Jonathan Corbet 
Cc: Mauro Carvalho Chehab 
Cc: Greg Kroah-Hartman 
Cc: Vinod Koul 
Cc: Linus Walleij 
Cc: Logan Gunthorpe 
Cc: Thierry Reding 
Cc: Sanyog Kale 
Cc: Sagar Dharia 
---
 Documentation/driver-api/index.rst |  2 +-
 Documentation/driver-api/pci/index.rst | 20 
 Documentation/driver-api/{ => pci}/pci.rst |  0
 3 files changed, 21 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/driver-api/pci/index.rst
 rename Documentation/driver-api/{ => pci}/pci.rst (100%)

diff --git a/Documentation/driver-api/index.rst 
b/Documentation/driver-api/index.rst
index 6d9f2f9fe20e..e9e7d24169cf 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -29,7 +29,7 @@ available subsections can be seen below.
iio/index
input
usb/index
-   pci
+   pci/index
spi
i2c
hsi
diff --git a/Documentation/driver-api/pci/index.rst 
b/Documentation/driver-api/pci/index.rst
new file mode 100644
index ..eaf20b24bf7d
--- /dev/null
+++ b/Documentation/driver-api/pci/index.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+The Linux PCI driver implementer's API guide
+
+
+.. class:: toc-title
+
+  Table of contents
+
+.. toctree::
+   :maxdepth: 2
+
+   pci
+
+.. only::  subproject and html
+
+   Indices
+   ===
+
+   * :ref:`genindex`
diff --git a/Documentation/driver-api/pci.rst 
b/Documentation/driver-api/pci/pci.rst
similarity index 100%
rename from Documentation/driver-api/pci.rst
rename to Documentation/driver-api/pci/pci.rst
-- 
2.19.0

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 02/13] PCI/P2PDMA: Add sysfs group to display p2pmem stats

2018-10-04 Thread Logan Gunthorpe
Add a sysfs group to display statistics about P2P memory that is
registered in each PCI device.

Attributes in the group display the total amount of P2P memory, the
amount available and whether it is published or not.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
---
 Documentation/ABI/testing/sysfs-bus-pci | 24 +++
 drivers/pci/p2pdma.c| 54 +
 2 files changed, 78 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci 
b/Documentation/ABI/testing/sysfs-bus-pci
index 44d4b2be92fd..8bfee557e50e 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -323,3 +323,27 @@ Description:
 
This is similar to /sys/bus/pci/drivers_autoprobe, but
affects only the VFs associated with a specific PF.
+
+What:  /sys/bus/pci/devices/.../p2pmem/size
+Date:  November 2017
+Contact:   Logan Gunthorpe 
+Description:
+   If the device has any Peer-to-Peer memory registered, this
+   file contains the total amount of memory that the device
+   provides (in decimal).
+
+What:  /sys/bus/pci/devices/.../p2pmem/available
+Date:  November 2017
+Contact:   Logan Gunthorpe 
+Description:
+   If the device has any Peer-to-Peer memory registered, this
+   file contains the amount of memory that has not been
+   allocated (in decimal).
+
+What:  /sys/bus/pci/devices/.../p2pmem/published
+Date:  November 2017
+Contact:   Logan Gunthorpe 
+Description:
+   If the device has any Peer-to-Peer memory registered, this
+   file contains a '1' if the memory has been published for
+   use outside the driver that owns the device.
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index c32cab20dd73..526c638579d1 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -24,6 +24,54 @@ struct pci_p2pdma {
bool p2pmem_published;
 };
 
+static ssize_t size_show(struct device *dev, struct device_attribute *attr,
+char *buf)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+   size_t size = 0;
+
+   if (pdev->p2pdma->pool)
+   size = gen_pool_size(pdev->p2pdma->pool);
+
+   return snprintf(buf, PAGE_SIZE, "%zd\n", size);
+}
+static DEVICE_ATTR_RO(size);
+
+static ssize_t available_show(struct device *dev, struct device_attribute 
*attr,
+ char *buf)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+   size_t avail = 0;
+
+   if (pdev->p2pdma->pool)
+   avail = gen_pool_avail(pdev->p2pdma->pool);
+
+   return snprintf(buf, PAGE_SIZE, "%zd\n", avail);
+}
+static DEVICE_ATTR_RO(available);
+
+static ssize_t published_show(struct device *dev, struct device_attribute 
*attr,
+ char *buf)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+
+   return snprintf(buf, PAGE_SIZE, "%d\n",
+   pdev->p2pdma->p2pmem_published);
+}
+static DEVICE_ATTR_RO(published);
+
+static struct attribute *p2pmem_attrs[] = {
+   _attr_size.attr,
+   _attr_available.attr,
+   _attr_published.attr,
+   NULL,
+};
+
+static const struct attribute_group p2pmem_group = {
+   .attrs = p2pmem_attrs,
+   .name = "p2pmem",
+};
+
 static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
 {
struct pci_p2pdma *p2p =
@@ -59,6 +107,7 @@ static void pci_p2pdma_release(void *data)
percpu_ref_exit(>p2pdma->devmap_ref);
 
gen_pool_destroy(pdev->p2pdma->pool);
+   sysfs_remove_group(>dev.kobj, _group);
pdev->p2pdma = NULL;
 }
 
@@ -87,9 +136,14 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 
pdev->p2pdma = p2p;
 
+   error = sysfs_create_group(>dev.kobj, _group);
+   if (error)
+   goto out_pool_destroy;
+
return 0;
 
 out_pool_destroy:
+   pdev->p2pdma = NULL;
gen_pool_destroy(p2p->pool);
 out:
devm_kfree(>dev, p2p);
-- 
2.19.0

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 09/13] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-10-04 Thread Logan Gunthorpe
Register the CMB buffer as p2pmem and use the appropriate allocation
functions to create and destroy the IO submission queues.

If the CMB supports WDS and RDS, publish it for use as P2P memory
by other devices.

Kernels without CONFIG_PCI_P2PDMA will also no longer support NVMe CMB.
However, seeing the main use-cases for the CMB is P2P operations,
this seems like a reasonable dependency.

We drop the __iomem safety on the buffer seeing that, by convention, it's
safe to directly access memory mapped by memremap()/devm_memremap_pages().
Architectures where this is not safe will not be supported by memremap()
and therefore will not be support PCI P2P and have no support for CMB.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Keith Busch 
---
 drivers/nvme/host/pci.c | 80 +++--
 1 file changed, 45 insertions(+), 35 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d668682f91df..f434706a04e8 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "nvme.h"
 
@@ -99,9 +100,8 @@ struct nvme_dev {
struct work_struct remove_work;
struct mutex shutdown_lock;
bool subsystem;
-   void __iomem *cmb;
-   pci_bus_addr_t cmb_bus_addr;
u64 cmb_size;
+   bool cmb_use_sqes;
u32 cmbsz;
u32 cmbloc;
struct nvme_ctrl ctrl;
@@ -158,7 +158,7 @@ struct nvme_queue {
struct nvme_dev *dev;
spinlock_t sq_lock;
struct nvme_command *sq_cmds;
-   struct nvme_command __iomem *sq_cmds_io;
+   bool sq_cmds_is_io;
spinlock_t cq_lock cacheline_aligned_in_smp;
volatile struct nvme_completion *cqes;
struct blk_mq_tags **tags;
@@ -447,11 +447,8 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 static void nvme_submit_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd)
 {
spin_lock(>sq_lock);
-   if (nvmeq->sq_cmds_io)
-   memcpy_toio(>sq_cmds_io[nvmeq->sq_tail], cmd,
-   sizeof(*cmd));
-   else
-   memcpy(>sq_cmds[nvmeq->sq_tail], cmd, sizeof(*cmd));
+
+   memcpy(>sq_cmds[nvmeq->sq_tail], cmd, sizeof(*cmd));
 
if (++nvmeq->sq_tail == nvmeq->q_depth)
nvmeq->sq_tail = 0;
@@ -1232,9 +1229,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth),
(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
-   if (nvmeq->sq_cmds)
-   dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth),
-   nvmeq->sq_cmds, nvmeq->sq_dma_addr);
+
+   if (nvmeq->sq_cmds) {
+   if (nvmeq->sq_cmds_is_io)
+   pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev),
+   nvmeq->sq_cmds,
+   SQ_SIZE(nvmeq->q_depth));
+   else
+   dma_free_coherent(nvmeq->q_dmadev,
+ SQ_SIZE(nvmeq->q_depth),
+ nvmeq->sq_cmds,
+ nvmeq->sq_dma_addr);
+   }
 }
 
 static void nvme_free_queues(struct nvme_dev *dev, int lowest)
@@ -1323,12 +1329,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int 
nr_io_queues,
 static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
int qid, int depth)
 {
-   /* CMB SQEs will be mapped before creation */
-   if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS))
-   return 0;
+   struct pci_dev *pdev = to_pci_dev(dev->dev);
+
+   if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
+   nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
+   nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
+   nvmeq->sq_cmds);
+   nvmeq->sq_cmds_is_io = true;
+   }
+
+   if (!nvmeq->sq_cmds) {
+   nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
+   >sq_dma_addr, GFP_KERNEL);
+   nvmeq->sq_cmds_is_io = false;
+   }
 
-   nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
-   >sq_dma_addr, GFP_KERNEL);
if (!nvmeq->sq_cmds)
return -ENOMEM;
return 0;
@@ -1405,13 +1420,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, 
int qid)
int result;
s16 vector;
 
-   if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
-   unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
- dev->ctrl.page_size);
-   nvmeq->sq_dma_addr = 

[PATCH v9 06/13] PCI/P2PDMA: Add P2P DMA driver writer's documentation

2018-10-04 Thread Logan Gunthorpe
Add a restructured text file describing how to write drivers
with support for P2P DMA transactions. The document describes
how to use the APIs that were added in the previous few
commits.

Also adds an index for the PCI documentation tree even though this
is the only PCI document that has been converted to restructured text
at this time.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
Cc: Jonathan Corbet 
---
 Documentation/driver-api/pci/index.rst  |   1 +
 Documentation/driver-api/pci/p2pdma.rst | 145 
 2 files changed, 146 insertions(+)
 create mode 100644 Documentation/driver-api/pci/p2pdma.rst

diff --git a/Documentation/driver-api/pci/index.rst 
b/Documentation/driver-api/pci/index.rst
index eaf20b24bf7d..ecc7416c523b 100644
--- a/Documentation/driver-api/pci/index.rst
+++ b/Documentation/driver-api/pci/index.rst
@@ -11,6 +11,7 @@ The Linux PCI driver implementer's API guide
:maxdepth: 2
 
pci
+   p2pdma
 
 .. only::  subproject and html
 
diff --git a/Documentation/driver-api/pci/p2pdma.rst 
b/Documentation/driver-api/pci/p2pdma.rst
new file mode 100644
index ..0c83a0619007
--- /dev/null
+++ b/Documentation/driver-api/pci/p2pdma.rst
@@ -0,0 +1,145 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+
+PCI Peer-to-Peer DMA Support
+
+
+The PCI bus has pretty decent support for performing DMA transfers
+between two devices on the bus. This type of transaction is henceforth
+called Peer-to-Peer (or P2P). However, there are a number of issues that
+make P2P transactions tricky to do in a perfectly safe way.
+
+One of the biggest issues is that PCI doesn't require forwarding
+transactions between hierarchy domains, and in PCIe, each Root Port
+defines a separate hierarchy domain. To make things worse, there is no
+simple way to determine if a given Root Complex supports this or not.
+(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel
+only supports doing P2P when the endpoints involved are all behind the
+same PCI bridge, as such devices are all in the same PCI hierarchy
+domain, and the spec guarantees that all transactions within the
+hierarchy will be routable, but it does not require routing
+between hierarchies.
+
+The second issue is that to make use of existing interfaces in Linux,
+memory that is used for P2P transactions needs to be backed by struct
+pages. However, PCI BARs are not typically cache coherent so there are
+a few corner case gotchas with these pages so developers need to
+be careful about what they do with them.
+
+
+Driver Writer's Guide
+=
+
+In a given P2P implementation there may be three or more different
+types of kernel drivers in play:
+
+* Provider - A driver which provides or publishes P2P resources like
+  memory or doorbell registers to other drivers.
+* Client - A driver which makes use of a resource by setting up a
+  DMA transaction to or from it.
+* Orchestrator - A driver which orchestrates the flow of data between
+  clients and providers.
+
+In many cases there could be overlap between these three types (i.e.,
+it may be typical for a driver to be both a provider and a client).
+
+For example, in the NVMe Target Copy Offload implementation:
+
+* The NVMe PCI driver is both a client, provider and orchestrator
+  in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
+  resource (provider), it accepts P2P memory pages as buffers in requests
+  to be used directly (client) and it can also make use of the CMB as
+  submission queue entries (orchastrator).
+* The RDMA driver is a client in this arrangement so that an RNIC
+  can DMA directly to the memory exposed by the NVMe device.
+* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
+  to the P2P memory (CMB) and then to the NVMe device (and vice versa).
+
+This is currently the only arrangement supported by the kernel but
+one could imagine slight tweaks to this that would allow for the same
+functionality. For example, if a specific RNIC added a BAR with some
+memory behind it, its driver could add support as a P2P provider and
+then the NVMe Target could use the RNIC's memory instead of the CMB
+in cases where the NVMe cards in use do not have CMB support.
+
+
+Provider Drivers
+
+
+A provider simply needs to register a BAR (or a portion of a BAR)
+as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
+This will register struct pages for all the specified memory.
+
+After that it may optionally publish all of its resources as
+P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
+any orchestrator drivers to find and use the memory. When marked in
+this way, the resource must be regular memory with no side effects.
+
+For the time being this is fairly rudimentary in that all resources
+are typically going to be P2P memory. Future work will likely expand
+this to include other types of resources like 

[PATCH v9 12/13] nvmet: Introduce helper functions to allocate and free request SGLs

2018-10-04 Thread Logan Gunthorpe
Add helpers to allocate and free the SGL in a struct nvmet_req:

int nvmet_req_alloc_sgl(struct nvmet_req *req)
void nvmet_req_free_sgl(struct nvmet_req *req)

This will be expanded in a future patch to implement peer-to-peer
memory DMAs and should be common with all target drivers.

The new helpers are used in nvmet-rdma. Seeing we use req.transfer_len
as the length of the SGL it is set earlier and cleared on any error.
It also seems to be unnecessary to accumulate the length as the map_sgl
functions should only ever be called once per request.

Signed-off-by: Logan Gunthorpe 
Acked-by: Sagi Grimberg 
Cc: Christoph Hellwig 
---
 drivers/nvme/target/core.c  | 18 ++
 drivers/nvme/target/nvmet.h |  2 ++
 drivers/nvme/target/rdma.c  | 20 
 3 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index b5ec96abd048..310b9fb54f6a 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -725,6 +725,24 @@ void nvmet_req_execute(struct nvmet_req *req)
 }
 EXPORT_SYMBOL_GPL(nvmet_req_execute);
 
+int nvmet_req_alloc_sgl(struct nvmet_req *req)
+{
+   req->sg = sgl_alloc(req->transfer_len, GFP_KERNEL, >sg_cnt);
+   if (!req->sg)
+   return -ENOMEM;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(nvmet_req_alloc_sgl);
+
+void nvmet_req_free_sgl(struct nvmet_req *req)
+{
+   sgl_free(req->sg);
+   req->sg = NULL;
+   req->sg_cnt = 0;
+}
+EXPORT_SYMBOL_GPL(nvmet_req_free_sgl);
+
 static inline bool nvmet_cc_en(u32 cc)
 {
return (cc >> NVME_CC_EN_SHIFT) & 0x1;
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index ec9af4ee03b6..e7b7406c4e22 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -336,6 +336,8 @@ bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq 
*cq,
 void nvmet_req_uninit(struct nvmet_req *req);
 void nvmet_req_execute(struct nvmet_req *req);
 void nvmet_req_complete(struct nvmet_req *req, u16 status);
+int nvmet_req_alloc_sgl(struct nvmet_req *req);
+void nvmet_req_free_sgl(struct nvmet_req *req);
 
 void nvmet_cq_setup(struct nvmet_ctrl *ctrl, struct nvmet_cq *cq, u16 qid,
u16 size);
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index bfc4da660bb4..9e091e78a2f0 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -503,7 +503,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp 
*rsp)
}
 
if (rsp->req.sg != rsp->cmd->inline_sg)
-   sgl_free(rsp->req.sg);
+   nvmet_req_free_sgl(>req);
 
if (unlikely(!list_empty_careful(>rsp_wr_wait_list)))
nvmet_rdma_process_wr_wait_list(queue);
@@ -652,24 +652,24 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp 
*rsp,
 {
struct rdma_cm_id *cm_id = rsp->queue->cm_id;
u64 addr = le64_to_cpu(sgl->addr);
-   u32 len = get_unaligned_le24(sgl->length);
u32 key = get_unaligned_le32(sgl->key);
int ret;
 
+   rsp->req.transfer_len = get_unaligned_le24(sgl->length);
+
/* no data command? */
-   if (!len)
+   if (!rsp->req.transfer_len)
return 0;
 
-   rsp->req.sg = sgl_alloc(len, GFP_KERNEL, >req.sg_cnt);
-   if (!rsp->req.sg)
-   return NVME_SC_INTERNAL;
+   ret = nvmet_req_alloc_sgl(>req);
+   if (ret < 0)
+   goto error_out;
 
ret = rdma_rw_ctx_init(>rw, cm_id->qp, cm_id->port_num,
rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
nvmet_data_dir(>req));
if (ret < 0)
-   return NVME_SC_INTERNAL;
-   rsp->req.transfer_len += len;
+   goto error_out;
rsp->n_rdma += ret;
 
if (invalidate) {
@@ -678,6 +678,10 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp 
*rsp,
}
 
return 0;
+
+error_out:
+   rsp->req.transfer_len = 0;
+   return NVME_SC_INTERNAL;
 }
 
 static u16 nvmet_rdma_map_sgl(struct nvmet_rdma_rsp *rsp)
-- 
2.19.0

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 01/13] PCI/P2PDMA: Support peer-to-peer memory

2018-10-04 Thread Logan Gunthorpe
Some PCI devices may have memory mapped in a BAR space that's
intended for use in peer-to-peer transactions. In order to enable
such transactions the memory must be registered with ZONE_DEVICE pages
so it can be used by DMA interfaces in existing drivers.

Add an interface for other subsystems to find and allocate chunks of P2P
memory as necessary to facilitate transfers between two PCI peers:

struct pci_dev *pci_p2pmem_find[_many]();
int pci_p2pdma_distance[_many]();
void *pci_alloc_p2pmem();

The new interface requires a driver to collect a list of client devices
involved in the transaction then call pci_p2pmem_find() to obtain any
suitable P2P memory. Alternatively, if the caller knows a device which
provides P2P memory, they can use pci_p2pdma_distance() to determine
if it is usable. With a suitable p2pmem device, memory can then be
allocated with pci_alloc_p2pmem() for use in DMA transactions.

Depending on hardware, using peer-to-peer memory may reduce the bandwidth
of the transfer but can significantly reduce pressure on system memory.
This may be desirable in many cases: for example a system could be designed
with a small CPU connected to a PCIe switch by a small number of lanes
which would maximize the number of lanes available to connect to NVMe
devices.

The code is designed to only utilize the p2pmem device if all the devices
involved in a transfer are behind the same PCI bridge. This is because we
have no way of knowing whether peer-to-peer routing between PCIe Root Ports
is supported (PCIe r4.0, sec 1.3.1). Additionally, the benefits of P2P
transfers that go through the RC is limited to only reducing DRAM usage
and, in some cases, coding convenience. The PCI-SIG may be exploring
adding a new capability bit to advertise whether this is possible for
future hardware.

This commit includes significant rework and feedback from Christoph
Hellwig.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas   # PCI pieces
---
 drivers/pci/Kconfig|  17 +
 drivers/pci/Makefile   |   1 +
 drivers/pci/p2pdma.c   | 630 +
 include/linux/memremap.h   |   5 +
 include/linux/mm.h |  18 ++
 include/linux/pci-p2pdma.h |  92 ++
 include/linux/pci.h|   4 +
 7 files changed, 767 insertions(+)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 56ff8f6d31fc..deb68be4fdac 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -132,6 +132,23 @@ config PCI_PASID
 
  If unsure, say N.
 
+config PCI_P2PDMA
+   bool "PCI peer-to-peer transfer support"
+   depends on PCI && ZONE_DEVICE
+   select GENERIC_ALLOCATOR
+   help
+ Enableѕ drivers to do PCI peer-to-peer transactions to and from
+ BARs that are exposed in other devices that are the part of
+ the hierarchy where peer-to-peer DMA is guaranteed by the PCI
+ specification to work (ie. anything below a single PCI bridge).
+
+ Many PCIe root complexes do not support P2P transactions and
+ it's hard to tell which support it at all, so at this time,
+ P2P DMA transations must be between devices behind the same root
+ port.
+
+ If unsure, say N.
+
 config PCI_LABEL
def_bool y if (DMI || ACPI)
depends on PCI
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 1b2cfe51e8d7..85f4a703b2be 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_PCI_SYSCALL) += syscall.o
 obj-$(CONFIG_PCI_STUB) += pci-stub.o
 obj-$(CONFIG_PCI_PF_STUB)  += pci-pf-stub.o
 obj-$(CONFIG_PCI_ECAM) += ecam.o
+obj-$(CONFIG_PCI_P2PDMA)   += p2pdma.o
 obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
 
 # Endpoint library must be initialized before its users
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
new file mode 100644
index ..c32cab20dd73
--- /dev/null
+++ b/drivers/pci/p2pdma.c
@@ -0,0 +1,630 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCI Peer 2 Peer DMA support.
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct pci_p2pdma {
+   struct percpu_ref devmap_ref;
+   struct completion devmap_ref_done;
+   struct gen_pool *pool;
+   bool p2pmem_published;
+};
+
+static void pci_p2pdma_percpu_release(struct percpu_ref *ref)
+{
+   struct pci_p2pdma *p2p =
+   container_of(ref, struct pci_p2pdma, devmap_ref);
+
+   complete_all(>devmap_ref_done);
+}
+
+static void pci_p2pdma_percpu_kill(void *data)
+{
+   struct percpu_ref *ref = data;
+
+   /*
+* pci_p2pdma_add_resource() may be called multiple times
+ 

[PATCH v9 07/13] block: Add PCI P2P flag for request queue and check support for requests

2018-10-04 Thread Logan Gunthorpe
QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue
supports targeting P2P memory. This will be used by P2P providers and
orchestrators (in subsequent patches) to ensure block devices can
support P2P memory before submitting P2P backed pages to submit_bio().

Signed-off-by: Logan Gunthorpe 
---
 include/linux/blkdev.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6980014357d4..87fb1963b721 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -699,6 +699,7 @@ struct request_queue {
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED28  /* queue has been quiesced */
 #define QUEUE_FLAG_PREEMPT_ONLY29  /* only process REQ_PREEMPT 
requests */
+#define QUEUE_FLAG_PCI_P2PDMA  30  /* device supports pci p2p requests */
 
 #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) |\
 (1 << QUEUE_FLAG_SAME_COMP)|   \
@@ -731,6 +732,8 @@ bool blk_queue_flag_test_and_clear(unsigned int flag, 
struct request_queue *q);
 #define blk_queue_dax(q)   test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
 #define blk_queue_scsi_passthrough(q)  \
test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
+#define blk_queue_pci_p2pdma(q)\
+   test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
2.19.0

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 11/13] nvme-pci: Add a quirk for a pseudo CMB

2018-10-04 Thread Logan Gunthorpe
Introduce a quirk to use CMB-like memory on older devices that have
an exposed BAR but do not advertise support for using CMBLOC and
CMBSIZE.

We'd like to use some of these older cards to test P2P memory.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Sagi Grimberg 
Reviewed-by: Keith Busch 
---
 drivers/nvme/host/nvme.h |  7 +++
 drivers/nvme/host/pci.c  | 24 
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 4030743c90aa..8e6f3bcfe956 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -90,6 +90,13 @@ enum nvme_quirks {
 * Set MEDIUM priority on SQ creation
 */
NVME_QUIRK_MEDIUM_PRIO_SQ   = (1 << 7),
+
+   /*
+* Pseudo CMB Support on BAR 4. For adapters like the Microsemi
+* NVRAM that have CMB-like memory on a BAR but does not set
+* CMBLOC or CMBSZ.
+*/
+   NVME_QUIRK_PSEUDO_CMB_BAR4  = (1 << 8),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 0d6c41bc2b35..db862ee6e53e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1644,6 +1644,13 @@ static ssize_t nvme_cmb_show(struct device *dev,
 }
 static DEVICE_ATTR(cmb, S_IRUGO, nvme_cmb_show, NULL);
 
+static u32 nvme_pseudo_cmbsz(struct pci_dev *pdev, int bar)
+{
+   return NVME_CMBSZ_WDS | NVME_CMBSZ_RDS |
+   (((ilog2(SZ_16M) - 12) / 4) << NVME_CMBSZ_SZU_SHIFT) |
+   ((pci_resource_len(pdev, bar) / SZ_16M) << NVME_CMBSZ_SZ_SHIFT);
+}
+
 static u64 nvme_cmb_size_unit(struct nvme_dev *dev)
 {
u8 szu = (dev->cmbsz >> NVME_CMBSZ_SZU_SHIFT) & NVME_CMBSZ_SZU_MASK;
@@ -1663,10 +1670,15 @@ static void nvme_map_cmb(struct nvme_dev *dev)
struct pci_dev *pdev = to_pci_dev(dev->dev);
int bar;
 
-   dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
-   if (!dev->cmbsz)
-   return;
-   dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+   if (dev->ctrl.quirks & NVME_QUIRK_PSEUDO_CMB_BAR4) {
+   dev->cmbsz = nvme_pseudo_cmbsz(pdev, 4);
+   dev->cmbloc = 4;
+   } else {
+   dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ);
+   if (!dev->cmbsz)
+   return;
+   dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC);
+   }
 
size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev);
offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc);
@@ -2715,6 +2727,10 @@ static const struct pci_device_id nvme_id_table[] = {
.driver_data = NVME_QUIRK_LIGHTNVM, },
{ PCI_DEVICE(0x1d1d, 0x2601),   /* CNEX Granby */
.driver_data = NVME_QUIRK_LIGHTNVM, },
+   { PCI_DEVICE(0x11f8, 0xf117),   /* Microsemi NVRAM adaptor */
+   .driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, },
+   { PCI_DEVICE(0x1db1, 0x0002),   /* Everspin nvNitro adaptor */
+   .driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4,  },
{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xff) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
-- 
2.19.0

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 13/13] nvmet: Optionally use PCI P2P memory

2018-10-04 Thread Logan Gunthorpe
We create a configfs attribute in each nvme-fabrics namespace to
enable P2P memory use. The attribute may be enabled (with a boolean)
or a specific P2P device may be given (with the device's PCI name).

When enabled, the namespace will ensure the underlying block device
supports P2P and that it is compatible with any specified P2P device.
If no device was specified it will ensure there is compatible P2P memory
somewhere in the system. Enabling an namespace with P2P memory will fail
with EINVAL (and an appropriate dmesg error) if any of these conditions
are not met.

Once a controller is setup on a specific port, the P2P device to use
for each namespace will be found and stored in a radix tree by
namespace ID. When memory is allocated for a request, the tree is used
to look up the P2P device to allocate memory against. If no device is in
the tree (because no appropriate device was found), or if allocation of
P2P memory fails, the system will fall back to using regular memory.

Signed-off-by: Stephen Bates 
Signed-off-by: Steve Wise 
[hch: partial rewrite of the initial code]
Signed-off-by: Christoph Hellwig 
Signed-off-by: Logan Gunthorpe 
---
 drivers/nvme/target/configfs.c|  45 
 drivers/nvme/target/core.c| 164 +-
 drivers/nvme/target/io-cmd-bdev.c |   3 +
 drivers/nvme/target/nvmet.h   |  15 +++
 drivers/nvme/target/rdma.c|   2 +
 5 files changed, 228 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index b37a8e3e3f80..72e7e356fed2 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -17,6 +17,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "nvmet.h"
 
@@ -340,6 +342,46 @@ static ssize_t nvmet_ns_device_path_store(struct 
config_item *item,
 
 CONFIGFS_ATTR(nvmet_ns_, device_path);
 
+#ifdef CONFIG_PCI_P2PDMA
+static ssize_t nvmet_ns_p2pmem_show(struct config_item *item, char *page)
+{
+   struct nvmet_ns *ns = to_nvmet_ns(item);
+
+   return pci_p2pdma_enable_show(page, ns->p2p_dev, ns->use_p2pmem);
+}
+
+static ssize_t nvmet_ns_p2pmem_store(struct config_item *item,
+   const char *page, size_t count)
+{
+   struct nvmet_ns *ns = to_nvmet_ns(item);
+   struct pci_dev *p2p_dev = NULL;
+   bool use_p2pmem;
+   int error;
+   int ret = count;
+
+   mutex_lock(>subsys->lock);
+   if (ns->enabled) {
+   ret = -EBUSY;
+   goto out_unlock;
+   }
+
+   error = pci_p2pdma_enable_store(page, _dev, _p2pmem);
+   if (error)
+   return error;
+
+   ns->use_p2pmem = use_p2pmem;
+   pci_dev_put(ns->p2p_dev);
+   ns->p2p_dev = p2p_dev;
+
+out_unlock:
+   mutex_unlock(>subsys->lock);
+
+   return ret;
+}
+
+CONFIGFS_ATTR(nvmet_ns_, p2pmem);
+#endif /* CONFIG_PCI_P2PDMA */
+
 static ssize_t nvmet_ns_device_uuid_show(struct config_item *item, char *page)
 {
return sprintf(page, "%pUb\n", _nvmet_ns(item)->uuid);
@@ -509,6 +551,9 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
_ns_attr_ana_grpid,
_ns_attr_enable,
_ns_attr_buffered_io,
+#ifdef CONFIG_PCI_P2PDMA
+   _ns_attr_p2pmem,
+#endif
NULL,
 };
 
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 310b9fb54f6a..d9e273ada36d 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "nvmet.h"
 
@@ -365,9 +366,93 @@ static void nvmet_ns_dev_disable(struct nvmet_ns *ns)
nvmet_file_ns_disable(ns);
 }
 
+static int nvmet_p2pmem_ns_enable(struct nvmet_ns *ns)
+{
+   int ret;
+   struct pci_dev *p2p_dev;
+
+   if (!ns->use_p2pmem)
+   return 0;
+
+   if (!ns->bdev) {
+   pr_err("peer-to-peer DMA is not supported by non-block device 
namespaces\n");
+   return -EINVAL;
+   }
+
+   if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) {
+   pr_err("peer-to-peer DMA is not supported by the driver of 
%s\n",
+  ns->device_path);
+   return -EINVAL;
+   }
+
+   if (ns->p2p_dev) {
+   ret = pci_p2pdma_distance(ns->p2p_dev, nvmet_ns_dev(ns), true);
+   if (ret < 0)
+   return -EINVAL;
+   } else {
+   /*
+* Right now we just check that there is p2pmem available so
+* we can report an error to the user right away if there
+* is not. We'll find the actual device to use once we
+* setup the controller when the port's device is available.
+*/
+
+   p2p_dev = pci_p2pmem_find(nvmet_ns_dev(ns));
+   if (!p2p_dev) {
+   pr_err("no peer-to-peer memory is available for %s\n",
+  ns->device_path);
+   return 

[PATCH v9 03/13] PCI/P2PDMA: Add PCI p2pmem DMA mappings to adjust the bus offset

2018-10-04 Thread Logan Gunthorpe
The DMA address used when mapping PCI P2P memory must be the PCI bus
address. Thus, introduce pci_p2pmem_map_sg() to map the correct
addresses when using P2P memory.

Memory mapped in this way does not need to be unmapped and thus if we
provided pci_p2pmem_unmap_sg() it would be empty. This breaks the
expected balance between map/unmap but was left out as an empty
function doesn't really provide any benefit. In the future, if
this call becomes necessary it can be added without much difficulty.

For this, we assume that an SGL passed to these functions contain all
P2P memory or no P2P memory.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
---
 drivers/pci/p2pdma.c   | 43 ++
 include/linux/memremap.h   |  1 +
 include/linux/pci-p2pdma.h |  7 +++
 3 files changed, 51 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 526c638579d1..e914f04ea1ec 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -194,6 +194,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, 
size_t size,
pgmap->res.flags = pci_resource_flags(pdev, bar);
pgmap->ref = >p2pdma->devmap_ref;
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+   pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) -
+   pci_resource_start(pdev, bar);
 
addr = devm_memremap_pages(>dev, pgmap);
if (IS_ERR(addr)) {
@@ -682,3 +684,44 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
pdev->p2pdma->p2pmem_published = publish;
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
+
+/**
+ * pci_p2pdma_map_sg - map a PCI peer-to-peer scatterlist for DMA
+ * @dev: device doing the DMA request
+ * @sg: scatter list to map
+ * @nents: elements in the scatterlist
+ * @dir: DMA direction
+ *
+ * Scatterlists mapped with this function should not be unmapped in any way.
+ *
+ * Returns the number of SG entries mapped or 0 on error.
+ */
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+ enum dma_data_direction dir)
+{
+   struct dev_pagemap *pgmap;
+   struct scatterlist *s;
+   phys_addr_t paddr;
+   int i;
+
+   /*
+* p2pdma mappings are not compatible with devices that use
+* dma_virt_ops. If the upper layers do the right thing
+* this should never happen because it will be prevented
+* by the check in pci_p2pdma_add_client()
+*/
+   if (WARN_ON_ONCE(IS_ENABLED(CONFIG_DMA_VIRT_OPS) &&
+dev->dma_ops == _virt_ops))
+   return 0;
+
+   for_each_sg(sg, s, nents, i) {
+   pgmap = sg_page(s)->pgmap;
+   paddr = sg_phys(s);
+
+   s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset;
+   sg_dma_len(s) = s->length;
+   }
+
+   return nents;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9553370ebdad..0ac69ddf5fc4 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -125,6 +125,7 @@ struct dev_pagemap {
struct device *dev;
void *data;
enum memory_type type;
+   u64 pci_p2pdma_bus_offset;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 7bdaacfd5892..b6dfb6dc2e53 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -30,6 +30,8 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
 unsigned int *nents, u32 length);
 void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+ enum dma_data_direction dir);
 #else /* CONFIG_PCI_P2PDMA */
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
size_t size, u64 offset)
@@ -75,6 +77,11 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
+static inline int pci_p2pdma_map_sg(struct device *dev,
+   struct scatterlist *sg, int nents, enum dma_data_direction dir)
+{
+   return 0;
+}
 #endif /* CONFIG_PCI_P2PDMA */
 
 
-- 
2.19.0

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 08/13] IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]()

2018-10-04 Thread Logan Gunthorpe
In order to use PCI P2P memory the pci_p2pmem_map_sg() function must be
called to map the correct PCI bus address.

To do this, check the first page in the scatter list to see if it is P2P
memory or not. At the moment, scatter lists that contain P2P memory must
be homogeneous so if the first page is P2P the entire SGL should be P2P.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Sagi Grimberg 
---
 drivers/infiniband/core/rw.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 683e6d11a564..d22c4a2ebac6 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -12,6 +12,7 @@
  */
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -280,7 +281,11 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp 
*qp, u8 port_num,
struct ib_device *dev = qp->pd->device;
int ret;
 
-   ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+   if (is_pci_p2pdma_page(sg_page(sg)))
+   ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
+   else
+   ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
+
if (!ret)
return -ENOMEM;
sg_cnt = ret;
@@ -602,7 +607,9 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct 
ib_qp *qp, u8 port_num,
break;
}
 
-   ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+   /* P2PDMA contexts do not need to be unmapped */
+   if (!is_pci_p2pdma_page(sg_page(sg)))
+   ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
-- 
2.19.0

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] dax: Fix deadlock in dax_lock_mapping_entry()

2018-10-04 Thread Jan Kara
On Thu 27-09-18 11:22:22, Dan Williams wrote:
> On Thu, Sep 27, 2018 at 6:41 AM Jan Kara  wrote:
> >
> > On Thu 27-09-18 06:28:43, Matthew Wilcox wrote:
> > > On Thu, Sep 27, 2018 at 01:23:32PM +0200, Jan Kara wrote:
> > > > When dax_lock_mapping_entry() has to sleep to obtain entry lock, it will
> > > > fail to unlock mapping->i_pages spinlock and thus immediately deadlock
> > > > against itself when retrying to grab the entry lock again. Fix the
> > > > problem by unlocking mapping->i_pages before retrying.
> > >
> > > It seems weird that xfstests doesn't provoke this ...
> >
> > The function currently gets called only from mm/memory-failure.c. And yes,
> > we are lacking DAX hwpoison error tests in fstests...
> 
> I have an item on my backlog to port the ndctl unit test that does
> memory_failure() injection vs ext4 over to fstests. That said I've
> been investigating a deadlock on ext4 caused by this test. When I saw
> this patch I hoped it was root cause, but the test is still failing
> for me. Vishal is able to pass the test on his system, so the failure
> mode is timing dependent. I'm running this patch on top of -rc5 and
> still seeing the following deadlock.

I went through the code but I don't see where the problem could be. How can
I run that test? Is KVM enough or do I need hardware with AEP dimms?

Honza

> 
> EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> EXT4-fs (pmem0): mounted filesystem with ordered data mode. Opts: dax
> Injecting memory failure for pfn 0x208900 at process virtual
> address 0x7f587290
> Memory failure: 0x208900: Killing dax-pmd:7095 due to hardware
> memory corruption
> Memory failure: 0x208900: recovery action for dax page: Recovered
> watchdog: BUG: soft lockup - CPU#35 stuck for 22s! [dax-pmd:7095]
> [..]
> irq event stamp: 121911146
> hardirqs last  enabled at (121911145): []
> _raw_spin_unlock_irq+0x29/0x40hardirqs last disabled at
> (121911146): [] trace_hardirqs_off_thunk+0x1a/0x1c
> softirqs last  enabled at (78238674): []
> __do_softirq+0x32e/0x428
> softirqs last disabled at (78238627): []
> irq_exit+0xf6/0x100
> CPU: 35 PID: 7095 Comm: dax-pmd Tainted: G   OE
> 4.19.0-rc5+ #2394
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
> RIP: 0010:lock_release+0x134/0x2a0
> [..]
> Call Trace:
>  find_get_entries+0x299/0x3c0
>  pagevec_lookup_entries+0x1a/0x30
>  dax_layout_busy_page+0x9c/0x280
>  ? __lock_acquire+0x12fa/0x1310
>  ext4_break_layouts+0x48/0x100
>  ? ext4_punch_hole+0x108/0x5a0
>  ext4_punch_hole+0x110/0x5a0
>  ext4_fallocate+0x189/0xa40
>  ? rcu_read_lock_sched_held+0x6b/0x80
>  ? rcu_sync_lockdep_assert+0x2e/0x60
>  vfs_fallocate+0x13f/0x270
> 
> The same test against xfs is not failing for me. I have been seeking
> some focus time to dig in on this.
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Problems with VM_MIXEDMAP removal from /proc//smaps

2018-10-04 Thread Johannes Thumshirn
On Tue, Oct 02, 2018 at 08:06:34AM -0700, Christoph Hellwig wrote:
> There is no promise, sorry.

Well there have been lot's of articles on for instance lwn.net [1] [2]
[3] describing how to avoid the "overhead" of the page cache when
running on persistent memory.

So if I would be a database developer, I'd look into them and see how
I could exploit this for my needs.

So even if we don't want to call it a promise, it was at least an
advertisement and people are now taking our word for it.

[1] https://lwn.net/Articles/610174/
[2] https://lwn.net/Articles/717953/
[3] https://lwn.net/Articles/684828/ 

Byte,
  Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Problems with VM_MIXEDMAP removal from /proc//smaps

2018-10-04 Thread Johannes Thumshirn
On Wed, Oct 03, 2018 at 06:44:07PM +0200, Jan Kara wrote:
> On Wed 03-10-18 08:13:37, Dan Williams wrote:
> > On Wed, Oct 3, 2018 at 8:07 AM Jan Kara  wrote:
> > > WRT per-inode DAX property, AFAIU that inode flag is just going to be
> > > advisory thing - i.e., use DAX if possible. If you mount a filesystem with
> > > these inode flags set in a configuration which does not allow DAX to be
> > > used, you will still be able to access such inodes but the access will use
> > > page cache instead. And querying these flags should better show real
> > > on-disk status and not just whether DAX is used as that would result in an
> > > even bigger mess. So this feature seems to be somewhat orthogonal to the
> > > API I'm looking for.
> > 
> > True, I imagine once we have that flag we will be able to distinguish
> > the "saved" property and the "effective / live" property of DAX...
> > Also it's really not DAX that applications care about as much as "is
> > there page-cache indirection / overhead for this mapping?". That seems
> > to be a narrower guarantee that we can make than what "DAX" might
> > imply.
> 
> Right. So what do people think about my suggestion earlier in the thread to
> use madvise(MADV_DIRECT_ACCESS) for this? Currently it would return success
> when DAX is in use, failure otherwise. Later we could extend it to be also
> used as a hint for caching policy for the inode...

Hmm apart from Dan's objection that it can't really be used for a
query, isn't madvise(2) for mmap(2)?

But AFAIU (from looking at the xfs code, so please correct me if I',
wrong), DAX can be used for the traditional read(2)/write(2) interface
as well.

There is at least:

xfs_file_read_iter()
`-> if (IS_DAX(inode))
`-> xfs_file_dax_read()
`->dax_iomap_rw()

So IMHO something on an inode granularity would make more sens to me.

Byte,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RE.Enquiry

2018-10-04 Thread Mr. Emi Kikuchi Tatsumi Marine Co. Ltd.
Dear Sirs,


Here is our attached open invoice.

Please check it, we need your reply and confirmation by return.

Before we arrange remittance tomorrow.


Thanks and best regards,


Mr. Emi Kikuchi
Tatsumi Marine Co. Ltd.
Adress : Tatsumi Building. 3rd Floor 3-8-7, Iidabashi,
Chiyoda-ku Tokyo 102-0072, Japan
Tel : +81 3 3265 5216
Fax : +81 3 3265 5315
mobile: +81 90 7007 5979
E-mail : operat...@tatsumimarine.co.jp (General)
E-mail: e-kiku...@tatsumimarine.co.jp (Personal)
E-mail: ekikuchi-tatsumimar...@docomo.ne.jp (mobile)

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm