[PATCH] dax: Fix last_page check in __bdev_dax_supported()

2019-05-15 Thread Vaibhav Jain
Presently __bdev_dax_supported() checks if first sector of last
page ( last_page ) on the block device is aligned to page
boundary. However the code to compute 'last_page' assumes that there
are 8 sectors/page assuming a 4K page-size.

This assumption breaks on architectures which use a different page
size specifically PPC64 where page-size == 64K. Hence a warning is
seen while trying to mount a xfs/ext4 file-system with dax enabled:

$ sudo mount -o dax /dev/pmem0 /mnt/pmem
XFS (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
XFS (pmem0): DAX unsupported by block device. Turning off DAX.

The patch fixes this issue by updating calculation of 'last_var' to
take into account number-of-sectors/page instead of assuming it to be
'8'.

Fixes: ad428cdb525a ("dax: Check the end of the block-device capacity with 
dax_direct_access()")
Signed-off-by: Chandan Rajendra 
Signed-off-by: Vaibhav Jain 
---
 drivers/dax/super.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index bbd57ca0634a..6b50ed3673d3 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -116,7 +116,9 @@ bool __bdev_dax_supported(struct block_device *bdev, int 
blocksize)
return false;
}
 
-   last_page = PFN_DOWN(i_size_read(bdev->bd_inode) - 1) * 8;
+   /* Calculate the first sector of last page on the block device */
+   last_page = PFN_DOWN(i_size_read(bdev->bd_inode) - 1) *
+   (PAGE_SIZE >> SECTOR_SHIFT);
err = bdev_dax_pgoff(bdev, last_page, PAGE_SIZE, _end);
if (err) {
pr_debug("%s: error: unaligned partition for dax\n",
-- 
2.21.0

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[no subject]

2019-05-15 Thread Mail Delivery Subsystem
This Message was undeliverable due to the following reason:

Your message was not delivered because the destination computer was
not reachable within the allowed queue period. The amount of time
a message is queued before it is returned depends on local configura-
tion parameters.

Most likely there is a network problem that prevented delivery, but
it is also possible that the computer is turned off, or does not
have a mail system running right now.

Your message was not delivered within 2 days:
Host 81.249.114.149 is not responding.

The following recipients did not receive this message:


Please reply to postmas...@lists.01.org
if you feel this message to be in error.



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [GIT PULL] libnvdimm fixes for v5.2-rc1

2019-05-15 Thread pr-tracker-bot
The pull request you sent on Wed, 15 May 2019 17:05:58 -0700:

> git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm 
> tags/libnvdimm-fixes-5.2-rc1

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/83f3ef3de625a5766de2382f9e077d4daafd5bac

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [v5 0/3] "Hotremove" persistent memory

2019-05-15 Thread Dan Williams
On Wed, May 15, 2019 at 11:12 AM Pavel Tatashin
 wrote:
>
> > Hi Pavel,
> >
> > I am working on adding this sort of a workflow into a new daxctl command
> > (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
> > dax device to kmem, online the resulting memory, and with your patches,
> > also attempt to offline the memory, and change back to device-dax.
> >
> > In running with these patches, and testing the offlining part, I ran
> > into the following lockdep below.
> >
> > This is with just these three patches on top of -rc7.
> >
> >
> > [  +0.004886] ==
> > [  +0.001576] WARNING: possible circular locking dependency detected
> > [  +0.001506] 5.1.0-rc7+ #13 Tainted: G   O
> > [  +0.000929] --
> > [  +0.000708] daxctl/22950 is trying to acquire lock:
> > [  +0.000548] f4d397f7 (kn->count#424){}, at: 
> > kernfs_remove_by_name_ns+0x40/0x80
> > [  +0.000922]
> >   but task is already holding lock:
> > [  +0.000657] 2aa52a9f (mem_sysfs_mutex){+.+.}, at: 
> > unregister_memory_section+0x22/0xa0
>
> I have studied this issue, and now have a clear understanding why it
> happens, I am not yet sure how to fix it, so suggestions are welcomed
> :)

I would think that ACPI hotplug would have a similar problem, but it does this:

acpi_unbind_memory_blocks(info);
__remove_memory(nid, info->start_addr, info->length);

I wonder if that ordering prevents going too deep into the
device_unregister() call stack that you highlighted below.


>
> Here is the problem:
>
> When we offline pages we have the following call stack:
>
> # echo offline > /sys/devices/system/memory/memory8/state
> ksys_write
>  vfs_write
>   __vfs_write
>kernfs_fop_write
> kernfs_get_active
>  lock_acquire   kn->count#122 (lock for
> "memory8/state" kn)
> sysfs_kf_write
>  dev_attr_store
>   state_store
>device_offline
> memory_subsys_offline
>  memory_block_action
>   offline_pages
>__offline_pages
> percpu_down_write
>  down_write
>   lock_acquire  mem_hotplug_lock.rw_sem
>
> When we unbind dax0.0 we have the following  stack:
> # echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
> drv_attr_store
>  unbind_store
>   device_driver_detach
>device_release_driver_internal
> dev_dax_kmem_remove
>  remove_memory  device_hotplug_lock
>   try_remove_memory mem_hotplug_lock.rw_sem
>arch_remove_memory
> __remove_pages
>  __remove_section
>   unregister_memory_section
>remove_memory_sectionmem_sysfs_mutex
> unregister_memory
>  device_unregister
>   device_del
>device_remove_attrs
> sysfs_remove_groups
>  sysfs_remove_group
>   remove_files
>kernfs_remove_by_name
> kernfs_remove_by_name_ns
>  __kernfs_removekn->count#122
>
> So, lockdep found the ordering issue with the above two stacks:
>
> 1. kn->count#122 -> mem_hotplug_lock.rw_sem
> 2. mem_hotplug_lock.rw_sem -> kn->count#122
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RESEND PATCH] nvdimm: fix some compilation warnings

2019-05-15 Thread Qian Cai


> On May 15, 2019, at 7:25 PM, Dan Williams  wrote:
> 
> On Tue, May 14, 2019 at 8:08 AM Qian Cai  wrote:
>> 
>> Several places (dimm_devs.c, core.c etc) include label.h but only
>> label.c uses NSINDEX_SIGNATURE, so move its definition to label.c
>> instead.
>> In file included from drivers/nvdimm/dimm_devs.c:23:
>> drivers/nvdimm/label.h:41:19: warning: 'NSINDEX_SIGNATURE' defined but
>> not used [-Wunused-const-variable=]
>> 
>> The commit d9b83c756953 ("libnvdimm, btt: rework error clearing") left
>> an unused variable.
>> drivers/nvdimm/btt.c: In function 'btt_read_pg':
>> drivers/nvdimm/btt.c:1272:8: warning: variable 'rc' set but not used
>> [-Wunused-but-set-variable]
>> 
>> Last, some places abuse "/**" which is only reserved for the kernel-doc.
>> drivers/nvdimm/bus.c:648: warning: cannot understand function prototype:
>> 'struct attribute_group nd_device_attribute_group = '
>> drivers/nvdimm/bus.c:677: warning: cannot understand function prototype:
>> 'struct attribute_group nd_numa_attribute_group = '
> 
> Can you include the compiler where these errors start appearing, since
> I don't see these warnings with gcc-8.3.1

This can be reproduced by performing extra compiler checks, i.e, "make W=n”.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RESEND PATCH] nvdimm: fix some compilation warnings

2019-05-15 Thread Qian Cai

>>}
>> 
>> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
>> index 7ff684159f29..2eb6a6cfe9e4 100644
>> --- a/drivers/nvdimm/bus.c
>> +++ b/drivers/nvdimm/bus.c
>> @@ -642,7 +642,7 @@ static struct attribute *nd_device_attributes[] = {
>>NULL,
>> };
>> 
>> -/**
>> +/*
>>  * nd_device_attribute_group - generic attributes for all devices on an nd 
>> bus
>>  */
>> struct attribute_group nd_device_attribute_group = {
>> @@ -671,7 +671,7 @@ static umode_t nd_numa_attr_visible(struct kobject 
>> *kobj, struct attribute *a,
>>return a->mode;
>> }
>> 
>> -/**
>> +/*
>>  * nd_numa_attribute_group - NUMA attributes for all devices on an nd bus
>>  */
> 
> Lets just fix this to be a valid kernel-doc format for a struct.
> 
> @@ -672,7 +672,7 @@ static umode_t nd_numa_attr_visible(struct kobject
> *kobj, struct attribute *a,
> }
> 
> /**
> - * nd_numa_attribute_group - NUMA attributes for all devices on an nd bus
> + * struct nd_numa_attribute_group - NUMA attributes for all devices
> on an nd bus
>  */
> struct attribute_group nd_numa_attribute_group = {
>.attrs = nd_numa_attributes,

This won’t work because kernel-doc is to explain a struct definition, but this 
is a just an assignment.
The "struct attribute_group” kernel-doc is in include/linux/sysfs.h.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RESEND PATCH] nvdimm: fix some compilation warnings

2019-05-15 Thread Verma, Vishal L


On Wed, 2019-05-15 at 17:26 -0700, Dan Williams wrote:
> On Wed, May 15, 2019 at 5:25 PM Verma, Vishal L
>  wrote:
> > On Wed, 2019-05-15 at 16:25 -0700, Dan Williams wrote:
> > > > diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
> > > > index 4671776f5623..9f02a99cfac0 100644
> > > > --- a/drivers/nvdimm/btt.c
> > > > +++ b/drivers/nvdimm/btt.c
> > > > @@ -1269,11 +1269,9 @@ static int btt_read_pg(struct btt *btt,
> > > > struct bio_integrity_payload *bip,
> > > > 
> > > > ret = btt_data_read(arena, page, off, postmap,
> > > > cur_len);
> > > > if (ret) {
> > > > -   int rc;
> > > > -
> > > > /* Media error - set the e_flag */
> > > > -   rc = btt_map_write(arena, premap,
> > > > postmap, 0, 1,
> > > > -   NVDIMM_IO_ATOMIC);
> > > > +   btt_map_write(arena, premap, postmap, 0,
> > > > 1,
> > > > + NVDIMM_IO_ATOMIC);
> > > > goto out_rtt;
> > > 
> > > This doesn't look correct to me, shouldn't we at least be logging
> > > that
> > > the bad-block failed to be persistently tracked?
> > 
> > Yes logging it sounds good to me. Qian, can you include this in your
> > respin or shall I send a fix for it separately (since we were always
> > ignoring the failure here regardless of this patch)?
> 
> I think a separate fix for this makes more sense. Likely also needs to
> be a ratelimited message in case a storm of errors is encountered.

Yes good point on rate limiting - I was thinking WARN_ONCE but that
might mask errors for distinct blocks, but a rate limited printk should
work best. I'll prepare a patch.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RESEND PATCH] nvdimm: fix some compilation warnings

2019-05-15 Thread Dan Williams
On Wed, May 15, 2019 at 5:25 PM Verma, Vishal L
 wrote:
>
> On Wed, 2019-05-15 at 16:25 -0700, Dan Williams wrote:
> >
> > > diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
> > > index 4671776f5623..9f02a99cfac0 100644
> > > --- a/drivers/nvdimm/btt.c
> > > +++ b/drivers/nvdimm/btt.c
> > > @@ -1269,11 +1269,9 @@ static int btt_read_pg(struct btt *btt, struct 
> > > bio_integrity_payload *bip,
> > >
> > > ret = btt_data_read(arena, page, off, postmap, cur_len);
> > > if (ret) {
> > > -   int rc;
> > > -
> > > /* Media error - set the e_flag */
> > > -   rc = btt_map_write(arena, premap, postmap, 0, 1,
> > > -   NVDIMM_IO_ATOMIC);
> > > +   btt_map_write(arena, premap, postmap, 0, 1,
> > > + NVDIMM_IO_ATOMIC);
> > > goto out_rtt;
> >
> > This doesn't look correct to me, shouldn't we at least be logging that
> > the bad-block failed to be persistently tracked?
>
> Yes logging it sounds good to me. Qian, can you include this in your
> respin or shall I send a fix for it separately (since we were always
> ignoring the failure here regardless of this patch)?

I think a separate fix for this makes more sense. Likely also needs to
be a ratelimited message in case a storm of errors is encountered.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RESEND PATCH] nvdimm: fix some compilation warnings

2019-05-15 Thread Verma, Vishal L
On Wed, 2019-05-15 at 16:25 -0700, Dan Williams wrote:
> 
> > diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
> > index 4671776f5623..9f02a99cfac0 100644
> > --- a/drivers/nvdimm/btt.c
> > +++ b/drivers/nvdimm/btt.c
> > @@ -1269,11 +1269,9 @@ static int btt_read_pg(struct btt *btt, struct 
> > bio_integrity_payload *bip,
> > 
> > ret = btt_data_read(arena, page, off, postmap, cur_len);
> > if (ret) {
> > -   int rc;
> > -
> > /* Media error - set the e_flag */
> > -   rc = btt_map_write(arena, premap, postmap, 0, 1,
> > -   NVDIMM_IO_ATOMIC);
> > +   btt_map_write(arena, premap, postmap, 0, 1,
> > + NVDIMM_IO_ATOMIC);
> > goto out_rtt;
> 
> This doesn't look correct to me, shouldn't we at least be logging that
> the bad-block failed to be persistently tracked?

Yes logging it sounds good to me. Qian, can you include this in your
respin or shall I send a fix for it separately (since we were always
ignoring the failure here regardless of this patch)?


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 12/30] dax: remove block device dependencies

2019-05-15 Thread Dan Williams
On Wed, May 15, 2019 at 12:28 PM Vivek Goyal  wrote:
>
> From: Stefan Hajnoczi 
>
> Although struct dax_device itself is not tied to a block device, some
> DAX code assumes there is a block device.  Make block devices optional
> by allowing bdev to be NULL in commonly used DAX APIs.
>
> When there is no block device:
>  * Skip the partition offset calculation in bdev_dax_pgoff()
>  * Skip the blkdev_issue_zeroout() optimization
>
> Note that more block device assumptions remain but I haven't reach those
> code paths yet.
>

Is there a generic object that non-block-based filesystems reference
for physical storage as a bdev stand-in? I assume "sector_t" is still
the common type for addressing filesystem capacity?

It just seems to me that we should stop pretending that the
filesystem-dax facility requires block devices and try to move this
functionality to generically use a dax device across all interfaces.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 1/7] libnvdimm: nd_region flush callback support

2019-05-15 Thread Dan Williams
On Tue, May 14, 2019 at 7:55 AM Pankaj Gupta  wrote:
>
> This patch adds functionality to perform flush from guest
> to host over VIRTIO. We are registering a callback based
> on 'nd_region' type. virtio_pmem driver requires this special
> flush function. For rest of the region types we are registering
> existing flush function. Report error returned by host fsync
> failure to userspace.
>
> Signed-off-by: Pankaj Gupta 
> ---
>  drivers/acpi/nfit/core.c |  4 ++--
>  drivers/nvdimm/claim.c   |  6 --
>  drivers/nvdimm/nd.h  |  1 +
>  drivers/nvdimm/pmem.c| 13 -
>  drivers/nvdimm/region_devs.c | 26 --
>  include/linux/libnvdimm.h|  8 +++-
>  6 files changed, 46 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index 5a389a4f4f65..08dde76cf459 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -2434,7 +2434,7 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, 
> unsigned int bw,
> offset = to_interleave_offset(offset, mmio);
>
> writeq(cmd, mmio->addr.base + offset);
> -   nvdimm_flush(nfit_blk->nd_region);
> +   nvdimm_flush(nfit_blk->nd_region, NULL);
>
> if (nfit_blk->dimm_flags & NFIT_BLK_DCR_LATCH)
> readq(mmio->addr.base + offset);
> @@ -2483,7 +2483,7 @@ static int acpi_nfit_blk_single_io(struct nfit_blk 
> *nfit_blk,
> }
>
> if (rw)
> -   nvdimm_flush(nfit_blk->nd_region);
> +   nvdimm_flush(nfit_blk->nd_region, NULL);
>
> rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0;
> return rc;
> diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
> index fb667bf469c7..13510bae1e6f 100644
> --- a/drivers/nvdimm/claim.c
> +++ b/drivers/nvdimm/claim.c
> @@ -263,7 +263,7 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
> struct nd_namespace_io *nsio = to_nd_namespace_io(>dev);
> unsigned int sz_align = ALIGN(size + (offset & (512 - 1)), 512);
> sector_t sector = offset >> 9;
> -   int rc = 0;
> +   int rc = 0, ret = 0;
>
> if (unlikely(!size))
> return 0;
> @@ -301,7 +301,9 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
> }
>
> memcpy_flushcache(nsio->addr + offset, buf, size);
> -   nvdimm_flush(to_nd_region(ndns->dev.parent));
> +   ret = nvdimm_flush(to_nd_region(ndns->dev.parent), NULL);
> +   if (ret)
> +   rc = ret;
>
> return rc;
>  }
> diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
> index a5ac3b240293..0c74d2428bd7 100644
> --- a/drivers/nvdimm/nd.h
> +++ b/drivers/nvdimm/nd.h
> @@ -159,6 +159,7 @@ struct nd_region {
> struct badblocks bb;
> struct nd_interleave_set *nd_set;
> struct nd_percpu_lane __percpu *lane;
> +   int (*flush)(struct nd_region *nd_region, struct bio *bio);

So this triggers:

In file included from drivers/nvdimm/e820.c:7:
./include/linux/libnvdimm.h:140:51: warning: ‘struct bio’ declared
inside parameter list will not be visible outside of this definition
or declaration
  int (*flush)(struct nd_region *nd_region, struct bio *bio);
   ^~~
I was already feeling uneasy about trying to squeeze this into v5.2,
but this warning and the continued drip of comments leads me to
conclude that this driver would do well to wait one more development
cycle. Lets close out the final fixups and let this driver soak in
-next. Then for the v5.3 cycle I'll redouble my efforts towards the
goal of closing patch acceptance at the -rc6 / -rc7 development
milestone.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 2/7] virtio-pmem: Add virtio pmem driver

2019-05-15 Thread David Hildenbrand
On 15.05.19 22:46, David Hildenbrand wrote:
>> +vpmem->vdev = vdev;
>> +vdev->priv = vpmem;
>> +err = init_vq(vpmem);
>> +if (err) {
>> +dev_err(>dev, "failed to initialize virtio pmem vq's\n");
>> +goto out_err;
>> +}
>> +
>> +virtio_cread(vpmem->vdev, struct virtio_pmem_config,
>> +start, >start);
>> +virtio_cread(vpmem->vdev, struct virtio_pmem_config,
>> +size, >size);
>> +
>> +res.start = vpmem->start;
>> +res.end   = vpmem->start + vpmem->size-1;
> 
> nit: " - 1;"
> 
>> +vpmem->nd_desc.provider_name = "virtio-pmem";
>> +vpmem->nd_desc.module = THIS_MODULE;
>> +
>> +vpmem->nvdimm_bus = nvdimm_bus_register(>dev,
>> +>nd_desc);
>> +if (!vpmem->nvdimm_bus) {
>> +dev_err(>dev, "failed to register device with 
>> nvdimm_bus\n");
>> +err = -ENXIO;
>> +goto out_vq;
>> +}
>> +
>> +dev_set_drvdata(>dev, vpmem->nvdimm_bus);
>> +
>> +ndr_desc.res = 
>> +ndr_desc.numa_node = nid;
>> +ndr_desc.flush = async_pmem_flush;
>> +set_bit(ND_REGION_PAGEMAP, _desc.flags);
>> +set_bit(ND_REGION_ASYNC, _desc.flags);
>> +nd_region = nvdimm_pmem_region_create(vpmem->nvdimm_bus, _desc);
>> +if (!nd_region) {
>> +dev_err(>dev, "failed to create nvdimm region\n");
>> +err = -ENXIO;
>> +goto out_nd;
>> +}
>> +nd_region->provider_data = dev_to_virtio(nd_region->dev.parent->parent);
>> +return 0;
>> +out_nd:
>> +nvdimm_bus_unregister(vpmem->nvdimm_bus);
>> +out_vq:
>> +vdev->config->del_vqs(vdev);
>> +out_err:
>> +return err;
>> +}
>> +
>> +static void virtio_pmem_remove(struct virtio_device *vdev)
>> +{
>> +struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(>dev);
>> +
>> +nvdimm_bus_unregister(nvdimm_bus);
>> +vdev->config->del_vqs(vdev);
>> +vdev->config->reset(vdev);
>> +}
>> +
>> +static struct virtio_driver virtio_pmem_driver = {
>> +.driver.name= KBUILD_MODNAME,
>> +.driver.owner   = THIS_MODULE,
>> +.id_table   = id_table,
>> +.probe  = virtio_pmem_probe,
>> +.remove = virtio_pmem_remove,
>> +};
>> +
>> +module_virtio_driver(virtio_pmem_driver);
>> +MODULE_DEVICE_TABLE(virtio, id_table);
>> +MODULE_DESCRIPTION("Virtio pmem driver");
>> +MODULE_LICENSE("GPL");
>> diff --git a/drivers/nvdimm/virtio_pmem.h b/drivers/nvdimm/virtio_pmem.h
>> new file mode 100644
>> index ..ab1da877575d
>> --- /dev/null
>> +++ b/drivers/nvdimm/virtio_pmem.h
>> @@ -0,0 +1,60 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * virtio_pmem.h: virtio pmem Driver
>> + *
>> + * Discovers persistent memory range information
>> + * from host and provides a virtio based flushing
>> + * interface.
>> + **/
>> +
>> +#ifndef _LINUX_VIRTIO_PMEM_H
>> +#define _LINUX_VIRTIO_PMEM_H
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +struct virtio_pmem_request {
>> +/* Host return status corresponding to flush request */
>> +int ret;
>> +
>> +/* command name*/
>> +char name[16];
> 
> So ... why are we sending string commands and expect native-endianess
> integers and don't define a proper request/response structure + request
> types in include/uapi/linux/virtio_pmem.h like
> 
> struct virtio_pmem_resp {
>   __virtio32 ret;
> }

FWIW, I wonder if we should even properly translate return values and
define types like

VIRTIO_PMEM_RESP_TYPE_OK0
VIRTIO_PMEM_RESP_TYPE_EIO   1

..

> 
> #define VIRTIO_PMEM_REQ_TYPE_FLUSH1
> struct virtio_pmem_req {
>   __virtio16 type;
> }
> 
> ... and this way we also define a proper endianess format for exchange
> and keep it extensible
> 
> @MST, what's your take on this?
> 
> 


-- 

Thanks,

David / dhildenb
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [Qemu-devel] [PATCH v9 2/7] virtio-pmem: Add virtio pmem driver

2019-05-15 Thread Dan Williams
On Tue, May 14, 2019 at 8:25 AM Pankaj Gupta  wrote:
>
>
> > On 5/14/19 7:54 AM, Pankaj Gupta wrote:
> > > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > > index 35897649c24f..94bad084ebab 100644
> > > --- a/drivers/virtio/Kconfig
> > > +++ b/drivers/virtio/Kconfig
> > > @@ -42,6 +42,17 @@ config VIRTIO_PCI_LEGACY
> > >
> > >   If unsure, say Y.
> > >
> > > +config VIRTIO_PMEM
> > > +   tristate "Support for virtio pmem driver"
> > > +   depends on VIRTIO
> > > +   depends on LIBNVDIMM
> > > +   help
> > > +   This driver provides access to virtio-pmem devices, storage devices
> > > +   that are mapped into the physical address space - similar to NVDIMMs
> > > +- with a virtio-based flushing interface.
> > > +
> > > +   If unsure, say M.
> >
> > 
> > from Documentation/process/coding-style.rst:
> > "Lines under a ``config`` definition
> > are indented with one tab, while help text is indented an additional two
> > spaces."
>
> ah... I changed help text and 'checkpatch' did not say anything :( .
>
> Will wait for Dan, If its possible to add two spaces to help text while 
> applying
> the series.

I'm inclined to handle this with a fixup appended to the end of the
series just so the patchwork-bot does not get confused by the content
changing from what was sent to the list.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 2/7] virtio-pmem: Add virtio pmem driver

2019-05-15 Thread David Hildenbrand
> + vpmem->vdev = vdev;
> + vdev->priv = vpmem;
> + err = init_vq(vpmem);
> + if (err) {
> + dev_err(>dev, "failed to initialize virtio pmem vq's\n");
> + goto out_err;
> + }
> +
> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + start, >start);
> + virtio_cread(vpmem->vdev, struct virtio_pmem_config,
> + size, >size);
> +
> + res.start = vpmem->start;
> + res.end   = vpmem->start + vpmem->size-1;

nit: " - 1;"

> + vpmem->nd_desc.provider_name = "virtio-pmem";
> + vpmem->nd_desc.module = THIS_MODULE;
> +
> + vpmem->nvdimm_bus = nvdimm_bus_register(>dev,
> + >nd_desc);
> + if (!vpmem->nvdimm_bus) {
> + dev_err(>dev, "failed to register device with 
> nvdimm_bus\n");
> + err = -ENXIO;
> + goto out_vq;
> + }
> +
> + dev_set_drvdata(>dev, vpmem->nvdimm_bus);
> +
> + ndr_desc.res = 
> + ndr_desc.numa_node = nid;
> + ndr_desc.flush = async_pmem_flush;
> + set_bit(ND_REGION_PAGEMAP, _desc.flags);
> + set_bit(ND_REGION_ASYNC, _desc.flags);
> + nd_region = nvdimm_pmem_region_create(vpmem->nvdimm_bus, _desc);
> + if (!nd_region) {
> + dev_err(>dev, "failed to create nvdimm region\n");
> + err = -ENXIO;
> + goto out_nd;
> + }
> + nd_region->provider_data = dev_to_virtio(nd_region->dev.parent->parent);
> + return 0;
> +out_nd:
> + nvdimm_bus_unregister(vpmem->nvdimm_bus);
> +out_vq:
> + vdev->config->del_vqs(vdev);
> +out_err:
> + return err;
> +}
> +
> +static void virtio_pmem_remove(struct virtio_device *vdev)
> +{
> + struct nvdimm_bus *nvdimm_bus = dev_get_drvdata(>dev);
> +
> + nvdimm_bus_unregister(nvdimm_bus);
> + vdev->config->del_vqs(vdev);
> + vdev->config->reset(vdev);
> +}
> +
> +static struct virtio_driver virtio_pmem_driver = {
> + .driver.name= KBUILD_MODNAME,
> + .driver.owner   = THIS_MODULE,
> + .id_table   = id_table,
> + .probe  = virtio_pmem_probe,
> + .remove = virtio_pmem_remove,
> +};
> +
> +module_virtio_driver(virtio_pmem_driver);
> +MODULE_DEVICE_TABLE(virtio, id_table);
> +MODULE_DESCRIPTION("Virtio pmem driver");
> +MODULE_LICENSE("GPL");
> diff --git a/drivers/nvdimm/virtio_pmem.h b/drivers/nvdimm/virtio_pmem.h
> new file mode 100644
> index ..ab1da877575d
> --- /dev/null
> +++ b/drivers/nvdimm/virtio_pmem.h
> @@ -0,0 +1,60 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * virtio_pmem.h: virtio pmem Driver
> + *
> + * Discovers persistent memory range information
> + * from host and provides a virtio based flushing
> + * interface.
> + **/
> +
> +#ifndef _LINUX_VIRTIO_PMEM_H
> +#define _LINUX_VIRTIO_PMEM_H
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +struct virtio_pmem_request {
> + /* Host return status corresponding to flush request */
> + int ret;
> +
> + /* command name*/
> + char name[16];

So ... why are we sending string commands and expect native-endianess
integers and don't define a proper request/response structure + request
types in include/uapi/linux/virtio_pmem.h like

struct virtio_pmem_resp {
__virtio32 ret;
}

#define VIRTIO_PMEM_REQ_TYPE_FLUSH  1
struct virtio_pmem_req {
__virtio16 type;
}

... and this way we also define a proper endianess format for exchange
and keep it extensible

@MST, what's your take on this?


-- 

Thanks,

David / dhildenb
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [GIT PULL] device-dax for 5.1: PMEM as RAM

2019-05-15 Thread Dan Williams
On Mon, Mar 11, 2019 at 5:08 PM Linus Torvalds
 wrote:
>
> On Mon, Mar 11, 2019 at 8:37 AM Dan Williams  wrote:
> >
> > Another feature the userspace tooling can support for the PMEM as RAM
> > case is the ability to complete an Address Range Scrub of the range
> > before it is added to the core-mm. I.e at least ensure that previously
> > encountered poison is eliminated.
>
> Ok, so this at least makes sense as an argument to me.
>
> In the "PMEM as filesystem" part, the errors have long-term history,
> while in "PMEM as RAM" the memory may be physically the same thing,
> but it doesn't have the history and as such may not be prone to
> long-term errors the same way.
>
> So that validly argues that yes, when used as RAM, the likelihood for
> errors is much lower because they don't accumulate the same way.

In case anyone is looking for the above mentioned tooling for use with
the v5.1 kernel, Vishal has released ndctl-v65 with the new
"clear-errors" command [1].

[1]: https://pmem.io/ndctl/ndctl-clear-errors.html
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 12/30] dax: remove block device dependencies

2019-05-15 Thread Vivek Goyal
From: Stefan Hajnoczi 

Although struct dax_device itself is not tied to a block device, some
DAX code assumes there is a block device.  Make block devices optional
by allowing bdev to be NULL in commonly used DAX APIs.

When there is no block device:
 * Skip the partition offset calculation in bdev_dax_pgoff()
 * Skip the blkdev_issue_zeroout() optimization

Note that more block device assumptions remain but I haven't reach those
code paths yet.

Signed-off-by: Stefan Hajnoczi 
---
 drivers/dax/super.c | 3 ++-
 fs/dax.c| 7 ++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0a339b85133e..cb44ec663991 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -53,7 +53,8 @@ EXPORT_SYMBOL_GPL(dax_read_unlock);
 int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
pgoff_t *pgoff)
 {
-   phys_addr_t phys_off = (get_start_sect(bdev) + sector) * 512;
+   sector_t start_sect = bdev ? get_start_sect(bdev) : 0;
+   phys_addr_t phys_off = (start_sect + sector) * 512;
 
if (pgoff)
*pgoff = PHYS_PFN(phys_off);
diff --git a/fs/dax.c b/fs/dax.c
index e5e54da1715f..815bc32fd967 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1042,7 +1042,12 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
 static bool dax_range_is_aligned(struct block_device *bdev,
 unsigned int offset, unsigned int length)
 {
-   unsigned short sector_size = bdev_logical_block_size(bdev);
+   unsigned short sector_size;
+
+   if (!bdev)
+   return false;
+
+   sector_size = bdev_logical_block_size(bdev);
 
if (!IS_ALIGNED(offset, sector_size))
return false;
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 11/30] virtio_fs: add skeleton virtio_fs.ko module

2019-05-15 Thread Vivek Goyal
From: Stefan Hajnoczi 

Add a basic file system module for virtio-fs.

Signed-off-by: Stefan Hajnoczi 
Signed-off-by: Vivek Goyal 
---
 fs/fuse/Kconfig |  11 +
 fs/fuse/Makefile|   1 +
 fs/fuse/fuse_i.h|  13 +
 fs/fuse/inode.c |  15 +-
 fs/fuse/virtio_fs.c | 956 
 include/uapi/linux/virtio_fs.h  |  41 ++
 include/uapi/linux/virtio_ids.h |   1 +
 7 files changed, 1035 insertions(+), 3 deletions(-)
 create mode 100644 fs/fuse/virtio_fs.c
 create mode 100644 include/uapi/linux/virtio_fs.h

diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index 76f09ce7e5b2..46e9a8ff9f7a 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -26,3 +26,14 @@ config CUSE
 
  If you want to develop or use a userspace character device
  based on CUSE, answer Y or M.
+
+config VIRTIO_FS
+   tristate "Virtio Filesystem"
+   depends on FUSE_FS
+   select VIRTIO
+   help
+ The Virtio Filesystem allows guests to mount file systems from the
+  host.
+
+ If you want to share files between guests or with the host, answer Y
+  or M.
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index f7b807bc1027..47b78fac5809 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -4,5 +4,6 @@
 
 obj-$(CONFIG_FUSE_FS) += fuse.o
 obj-$(CONFIG_CUSE) += cuse.o
+obj-$(CONFIG_VIRTIO_FS) += virtio_fs.o
 
 fuse-objs := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 4008ed65a48d..f5cb4d40b83f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -59,15 +59,18 @@ extern unsigned max_user_congthresh;
 /** Mount options */
 struct fuse_mount_data {
int fd;
+   const char *tag; /* lifetime: .fill_super() data argument */
unsigned rootmode;
kuid_t user_id;
kgid_t group_id;
unsigned fd_present:1;
+   unsigned tag_present:1;
unsigned rootmode_present:1;
unsigned user_id_present:1;
unsigned group_id_present:1;
unsigned default_permissions:1;
unsigned allow_other:1;
+   unsigned destroy:1;
unsigned max_read;
unsigned blksize;
 
@@ -465,6 +468,9 @@ struct fuse_req {
 
/** Request is stolen from fuse_file->reserved_req */
struct file *stolen_file;
+
+   /** virtio-fs's physically contiguous buffer for in and out args */
+   void *argbuf;
 };
 
 struct fuse_iqueue;
@@ -1070,6 +1076,13 @@ int parse_fuse_opt(char *opt, struct fuse_mount_data *d, 
int is_bdev,
 int fuse_fill_super_common(struct super_block *sb,
   struct fuse_mount_data *mount_data);
 
+/**
+ * Disassociate fuse connection from superblock and kill the superblock
+ *
+ * Calls kill_anon_super(), use with do not use with bdev mounts.
+ */
+void fuse_kill_sb_anon(struct super_block *sb);
+
 /**
  * Add connection to control filesystem
  */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 9b0114437a14..731a8a74d032 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -434,6 +434,7 @@ static int fuse_statfs(struct dentry *dentry, struct 
kstatfs *buf)
 
 enum {
OPT_FD,
+   OPT_TAG,
OPT_ROOTMODE,
OPT_USER_ID,
OPT_GROUP_ID,
@@ -446,6 +447,7 @@ enum {
 
 static const match_table_t tokens = {
{OPT_FD,"fd=%u"},
+   {OPT_TAG,   "tag=%s"},
{OPT_ROOTMODE,  "rootmode=%o"},
{OPT_USER_ID,   "user_id=%u"},
{OPT_GROUP_ID,  "group_id=%u"},
@@ -492,6 +494,11 @@ int parse_fuse_opt(char *opt, struct fuse_mount_data *d, 
int is_bdev,
d->fd_present = 1;
break;
 
+   case OPT_TAG:
+   d->tag = args[0].from;
+   d->tag_present = 1;
+   break;
+
case OPT_ROOTMODE:
if (match_octal([0], ))
return 0;
@@ -1170,7 +1177,7 @@ int fuse_fill_super_common(struct super_block *sb,
/* Root dentry doesn't have .d_revalidate */
sb->s_d_op = _dentry_operations;
 
-   if (is_bdev) {
+   if (mount_data->destroy) {
fc->destroy_req = fuse_request_alloc(0);
if (!fc->destroy_req)
goto err_put_root;
@@ -1216,7 +1223,7 @@ static int fuse_fill_super(struct super_block *sb, void 
*data, int silent)
err = -EINVAL;
if (!parse_fuse_opt(data, , is_bdev, sb->s_user_ns))
goto err;
-   if (!d.fd_present)
+   if (!d.fd_present || d.tag_present)
goto err;
 
file = fget(d.fd);
@@ -1239,6 +1246,7 @@ static int fuse_fill_super(struct super_block *sb, void 
*data, int silent)
d.fiq_ops = _dev_fiq_ops;
d.fiq_priv = NULL;
d.fudptr = >private_data;
+   d.destroy = 

[PATCH v2 26/30] fuse: Add logic to free up a memory range

2019-05-15 Thread Vivek Goyal
Add logic to free up a busy memory range. Freed memory range will be
returned to free pool. Add a worker which can be started to select
and free some busy memory ranges.

In certain cases (write path), process can steal one of its busy
dax ranges if free range is not available.

If free range is not available and nothing can't be stolen from same
inode, caller waits on a waitq for free range to become available.

For reclaiming a range, as of now we need to hold following locks in
specified order.

inode_trylock(inode);
down_write(>i_mmap_sem);
down_write(>i_dmap_sem);

This means, one can not wait for a range to become free when in fault
path because it can lead to deadlock in following two situations.

- Worker thread to free memory might block on fuse_inode->i_mmap_sem as well.
- This inode is holding all the memory and more memory can't be freed.

In both the cases, deadlock will ensue. So return -ENOSPC from iomap_begin()
in fault path if memory can't be allocated. Drop fuse_inode->i_mmap_sem,
and wait for a free range to become available and retry.

read path can't do direct reclaim as well because it holds shared inode
lock while reclaim assumes that inode lock is held exclusively. Due to
shared lock, it might happen that one reader is still reading from range
and another reader reclaims that range leading to problems. So read path
also returns -ENOSPC and higher layers retry (like fault path).

 a different story. We hold inode lock and lock ordering
allows to grab fuse_inode->immap_sem, if needed. That means we can do direct
reclaim in that path. But if there is no memory allocated to this inode,
then direct reclaim will not work and we need to wait for a memory range
to become free. So try following order.

A. Try to get a free range.
B. If not, try direct reclaim.
C. If not, wait for a memory range to become free

Here sleeping with locks held should be fine because in step B, we made
sure this inode is not holding any ranges. That means other inodes are
holding ranges and somebody should be able to free memory. Also, worker
thread does a trylock() on inode lock. That means worker tread will not
wait on this inode and move onto next memory range. Hence above sequence
should be deadlock free.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/file.c   | 357 +--
 fs/fuse/fuse_i.h |  22 +++
 fs/fuse/inode.c  |   4 +
 3 files changed, 374 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3f0f7a387341..87fc2b5e0a3a 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -25,6 +25,8 @@
 INTERVAL_TREE_DEFINE(struct fuse_dax_mapping, rb, __u64, __subtree_last,
  START, LAST, static inline, fuse_dax_interval_tree);
 
+static struct fuse_dax_mapping *alloc_dax_mapping_reclaim(struct fuse_conn *fc,
+   struct inode *inode);
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
  int opcode, struct fuse_open_out *outargp)
 {
@@ -179,6 +181,7 @@ static void fuse_link_write_file(struct file *file)
 
 static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
 {
+   unsigned long free_threshold;
struct fuse_dax_mapping *dmap = NULL;
 
spin_lock(>lock);
@@ -186,7 +189,7 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct 
fuse_conn *fc)
/* TODO: Add logic to try to free up memory if wait is allowed */
if (fc->nr_free_ranges <= 0) {
spin_unlock(>lock);
-   return NULL;
+   goto out_kick;
}
 
WARN_ON(list_empty(>free_ranges));
@@ -197,15 +200,43 @@ static struct fuse_dax_mapping *alloc_dax_mapping(struct 
fuse_conn *fc)
list_del_init(>list);
fc->nr_free_ranges--;
spin_unlock(>lock);
+
+out_kick:
+   /* If number of free ranges are below threshold, start reclaim */
+   free_threshold = max((fc->nr_ranges * FUSE_DAX_RECLAIM_THRESHOLD)/100,
+   (unsigned long)1);
+   if (fc->nr_free_ranges < free_threshold) {
+   pr_debug("fuse: Kicking dax memory reclaim worker. 
nr_free_ranges=0x%ld nr_total_ranges=%ld\n", fc->nr_free_ranges, fc->nr_ranges);
+   queue_delayed_work(system_long_wq, >dax_free_work, 0);
+   }
return dmap;
 }
 
+/* This assumes fc->lock is held */
+static void __dmap_remove_busy_list(struct fuse_conn *fc,
+   struct fuse_dax_mapping *dmap)
+{
+   list_del_init(>busy_list);
+   WARN_ON(fc->nr_busy_ranges == 0);
+   fc->nr_busy_ranges--;
+}
+
+static void dmap_remove_busy_list(struct fuse_conn *fc,
+   struct fuse_dax_mapping *dmap)
+{
+   spin_lock(>lock);
+   __dmap_remove_busy_list(fc, dmap);
+   spin_unlock(>lock);
+}
+
 /* This assumes fc->lock is held */
 static void __free_dax_mapping(struct fuse_conn *fc,
  

[PATCH v2 07/30] fuse: export fuse_get_unique()

2019-05-15 Thread Vivek Goyal
From: Stefan Hajnoczi 

virtio-fs will need unique IDs for FORGET requests from outside
fs/fuse/dev.c.  Make the symbol visible.

Signed-off-by: Stefan Hajnoczi 
---
 fs/fuse/dev.c| 3 ++-
 fs/fuse/fuse_i.h | 5 +
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 40eb827caa10..42fd3b576686 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -363,11 +363,12 @@ unsigned fuse_len_args(unsigned numargs, struct fuse_arg 
*args)
 }
 EXPORT_SYMBOL_GPL(fuse_len_args);
 
-static u64 fuse_get_unique(struct fuse_iqueue *fiq)
+u64 fuse_get_unique(struct fuse_iqueue *fiq)
 {
fiq->reqctr += FUSE_REQ_ID_STEP;
return fiq->reqctr;
 }
+EXPORT_SYMBOL_GPL(fuse_get_unique);
 
 static unsigned int fuse_req_hash(u64 unique)
 {
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 16f238d7f624..38a572ba650d 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1097,4 +1097,9 @@ int fuse_readdir(struct file *file, struct dir_context 
*ctx);
  */
 unsigned fuse_len_args(unsigned numargs, struct fuse_arg *args);
 
+/**
+ * Get the next unique ID for a request
+ */
+u64 fuse_get_unique(struct fuse_iqueue *fiq);
+
 #endif /* _FS_FUSE_I_H */
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 08/30] fuse: extract fuse_fill_super_common()

2019-05-15 Thread Vivek Goyal
From: Stefan Hajnoczi 

fuse_fill_super() includes code to process the fd= option and link the
struct fuse_dev to the fd's struct file.  In virtio-fs there is no file
descriptor because /dev/fuse is not used.

This patch extracts fuse_fill_super_common() so that both classic fuse
and virtio-fs can share the code to initialize a mount.

parse_fuse_opt() is also extracted so that the fuse_fill_super_common()
caller has access to the mount options.  This allows classic fuse to
handle the fd= option outside fuse_fill_super_common().

Signed-off-by: Stefan Hajnoczi 
Signed-off-by: Miklos Szeredi 
---
 fs/fuse/fuse_i.h |  33 
 fs/fuse/inode.c  | 137 ---
 2 files changed, 103 insertions(+), 67 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 38a572ba650d..84f094e4ac36 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -56,6 +56,25 @@ extern struct mutex fuse_mutex;
 extern unsigned max_user_bgreq;
 extern unsigned max_user_congthresh;
 
+/** Mount options */
+struct fuse_mount_data {
+   int fd;
+   unsigned rootmode;
+   kuid_t user_id;
+   kgid_t group_id;
+   unsigned fd_present:1;
+   unsigned rootmode_present:1;
+   unsigned user_id_present:1;
+   unsigned group_id_present:1;
+   unsigned default_permissions:1;
+   unsigned allow_other:1;
+   unsigned max_read;
+   unsigned blksize;
+
+   /* fuse_dev pointer to fill in, should contain NULL on entry */
+   void **fudptr;
+};
+
 /* One forget request */
 struct fuse_forget_link {
struct fuse_forget_one forget_one;
@@ -989,6 +1008,20 @@ struct fuse_dev *fuse_dev_alloc(struct fuse_conn *fc);
 void fuse_dev_free(struct fuse_dev *fud);
 void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req);
 
+/**
+ * Parse a mount options string
+ */
+int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+   struct user_namespace *user_ns);
+
+/**
+ * Fill in superblock and initialize fuse connection
+ * @sb: partially-initialized superblock to fill in
+ * @mount_data: mount parameters
+ */
+int fuse_fill_super_common(struct super_block *sb,
+  struct fuse_mount_data *mount_data);
+
 /**
  * Add connection to control filesystem
  */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index f02291469518..baf2966a753a 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -59,21 +59,6 @@ MODULE_PARM_DESC(max_user_congthresh,
 /** Congestion starts at 75% of maximum */
 #define FUSE_DEFAULT_CONGESTION_THRESHOLD (FUSE_DEFAULT_MAX_BACKGROUND * 3 / 4)
 
-struct fuse_mount_data {
-   int fd;
-   unsigned rootmode;
-   kuid_t user_id;
-   kgid_t group_id;
-   unsigned fd_present:1;
-   unsigned rootmode_present:1;
-   unsigned user_id_present:1;
-   unsigned group_id_present:1;
-   unsigned default_permissions:1;
-   unsigned allow_other:1;
-   unsigned max_read;
-   unsigned blksize;
-};
-
 struct fuse_forget_link *fuse_alloc_forget(void)
 {
return kzalloc(sizeof(struct fuse_forget_link), GFP_KERNEL);
@@ -482,7 +467,7 @@ static int fuse_match_uint(substring_t *s, unsigned int 
*res)
return err;
 }
 
-static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
+int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev,
  struct user_namespace *user_ns)
 {
char *p;
@@ -559,12 +544,13 @@ static int parse_fuse_opt(char *opt, struct 
fuse_mount_data *d, int is_bdev,
}
}
 
-   if (!d->fd_present || !d->rootmode_present ||
-   !d->user_id_present || !d->group_id_present)
+   if (!d->rootmode_present || !d->user_id_present ||
+   !d->group_id_present)
return 0;
 
return 1;
 }
+EXPORT_SYMBOL_GPL(parse_fuse_opt);
 
 static int fuse_show_options(struct seq_file *m, struct dentry *root)
 {
@@ -1079,15 +1065,13 @@ void fuse_dev_free(struct fuse_dev *fud)
 }
 EXPORT_SYMBOL_GPL(fuse_dev_free);
 
-static int fuse_fill_super(struct super_block *sb, void *data, int silent)
+int fuse_fill_super_common(struct super_block *sb,
+  struct fuse_mount_data *mount_data)
 {
struct fuse_dev *fud;
struct fuse_conn *fc;
struct inode *root;
-   struct fuse_mount_data d;
-   struct file *file;
struct dentry *root_dentry;
-   struct fuse_req *init_req;
int err;
int is_bdev = sb->s_bdev != NULL;
 
@@ -1097,13 +1081,10 @@ static int fuse_fill_super(struct super_block *sb, void 
*data, int silent)
 
sb->s_flags &= ~(SB_NOSEC | SB_I_VERSION);
 
-   if (!parse_fuse_opt(data, , is_bdev, sb->s_user_ns))
-   goto err;
-
if (is_bdev) {
 #ifdef CONFIG_BLOCK
err = -EINVAL;
-   if (!sb_set_blocksize(sb, d.blksize))
+   if (!sb_set_blocksize(sb, mount_data->blksize))
  

[PATCH v2 02/30] fuse: Clear setuid bit even in cache=never path

2019-05-15 Thread Vivek Goyal
If fuse daemon is started with cache=never, fuse falls back to direct IO.
In that write path we don't call file_remove_privs() and that means setuid
bit is not cleared if unpriviliged user writes to a file with setuid bit set.

pjdfstest chmod test 12.t tests this and fails.

Fix this by calling fuse_remove_privs() even for direct I/O path.

I tested this as follows.

- Run fuse example pasthrough fs.

  $ passthrough_ll /mnt/pasthrough-mnt -o 
default_permissions,allow_other,cache=never
  $ mkdir /mnt/pasthrough-mnt/testdir
  $ cd /mnt/pasthrough-mnt/testdir
  $ prove -rv pjdfstests/tests/chmod/12.t

Signed-off-by: Vivek Goyal 
---
 fs/fuse/file.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 06096b60f1df..5baf07fd2876 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1456,14 +1456,18 @@ static ssize_t fuse_direct_write_iter(struct kiocb 
*iocb, struct iov_iter *from)
/* Don't allow parallel writes to the same file */
inode_lock(inode);
res = generic_write_checks(iocb, from);
-   if (res > 0) {
-   if (!is_sync_kiocb(iocb) && iocb->ki_flags & IOCB_DIRECT) {
-   res = fuse_direct_IO(iocb, from);
-   } else {
-   res = fuse_direct_io(, from, >ki_pos,
-FUSE_DIO_WRITE);
-   }
+   if (res <= 0)
+   goto out;
+
+   res = file_remove_privs(iocb->ki_filp);
+   if (res)
+   goto out;
+   if (!is_sync_kiocb(iocb) && iocb->ki_flags & IOCB_DIRECT) {
+   res = fuse_direct_IO(iocb, from);
+   } else {
+   res = fuse_direct_io(, from, >ki_pos, FUSE_DIO_WRITE);
}
+out:
fuse_invalidate_attr(inode);
if (res > 0)
fuse_write_update_size(inode, iocb->ki_pos);
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 30/30] virtio-fs: Do not provide abort interface in fusectl

2019-05-15 Thread Vivek Goyal
virtio-fs does not support aborting requests which are being processed. That
is requests which have been sent to fuse daemon on host.

So do not provide "abort" interface for virtio-fs in fusectl.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/control.c   | 4 ++--
 fs/fuse/fuse_i.h| 4 
 fs/fuse/inode.c | 1 +
 fs/fuse/virtio_fs.c | 1 +
 4 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/control.c b/fs/fuse/control.c
index fe80bea4ad89..c1423f2ebc5e 100644
--- a/fs/fuse/control.c
+++ b/fs/fuse/control.c
@@ -278,8 +278,8 @@ int fuse_ctl_add_conn(struct fuse_conn *fc)
 
if (!fuse_ctl_add_dentry(parent, fc, "waiting", S_IFREG | 0400, 1,
 NULL, _ctl_waiting_ops) ||
-   !fuse_ctl_add_dentry(parent, fc, "abort", S_IFREG | 0200, 1,
-NULL, _ctl_abort_ops) ||
+   (!fc->no_abort && !fuse_ctl_add_dentry(parent, fc, "abort",
+   S_IFREG | 0200, 1, NULL, _ctl_abort_ops)) ||
!fuse_ctl_add_dentry(parent, fc, "max_background", S_IFREG | 0600,
 1, NULL, _conn_max_background_ops) ||
!fuse_ctl_add_dentry(parent, fc, "congestion_threshold",
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index b4a5728444bb..7ac7f9a0b81b 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -86,6 +86,7 @@ struct fuse_mount_data {
unsigned allow_other:1;
unsigned dax:1;
unsigned destroy:1;
+   unsigned no_abort:1;
unsigned max_read;
unsigned blksize;
 
@@ -847,6 +848,9 @@ struct fuse_conn {
/** Does the filesystem support copy_file_range? */
unsigned no_copy_file_range:1;
 
+   /** Do not create abort file in fuse control fs */
+   unsigned no_abort:1;
+
/** The number of requests waiting for completion */
atomic_t num_waiting;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 8af7f31c6e19..302f7e04b645 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1272,6 +1272,7 @@ int fuse_fill_super_common(struct super_block *sb,
fc->user_id = mount_data->user_id;
fc->group_id = mount_data->group_id;
fc->max_read = max_t(unsigned, 4096, mount_data->max_read);
+   fc->no_abort = mount_data->no_abort;
 
/* Used by get_root_inode() */
sb->s_fs_info = fc;
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 76c46edcc8ac..18fc0dca0abc 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1042,6 +1042,7 @@ static int virtio_fs_fill_super(struct super_block *sb, 
void *data,
d.fiq_priv = fs;
d.fudptr = (void **)>vqs[VQ_REQUEST].fud;
d.destroy = true; /* Send destroy request on unmount */
+   d.no_abort = 1;
err = fuse_fill_super_common(sb, );
if (err < 0)
goto err_free_init_req;
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 09/30] fuse: add fuse_iqueue_ops callbacks

2019-05-15 Thread Vivek Goyal
From: Stefan Hajnoczi 

The /dev/fuse device uses fiq->waitq and fasync to signal that requests
are available.  These mechanisms do not apply to virtio-fs.  This patch
introduces callbacks so alternative behavior can be used.

Note that queue_interrupt() changes along these lines:

  spin_lock(>waitq.lock);
  wake_up_locked(>waitq);
+ kill_fasync(>fasync, SIGIO, POLL_IN);
  spin_unlock(>waitq.lock);
- kill_fasync(>fasync, SIGIO, POLL_IN);

Since queue_request() and queue_forget() also call kill_fasync() inside
the spinlock this should be safe.

Signed-off-by: Stefan Hajnoczi 
Signed-off-by: Miklos Szeredi 
---
 fs/fuse/cuse.c   |  2 +-
 fs/fuse/dev.c| 50 
 fs/fuse/fuse_i.h | 48 +-
 fs/fuse/inode.c  | 16 
 4 files changed, 94 insertions(+), 22 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index 55a26f351467..a6ed7a036b50 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -504,7 +504,7 @@ static int cuse_channel_open(struct inode *inode, struct 
file *file)
 * Limit the cuse channel to requests that can
 * be represented in file->f_cred->user_ns.
 */
-   fuse_conn_init(>fc, file->f_cred->user_ns);
+   fuse_conn_init(>fc, file->f_cred->user_ns, _dev_fiq_ops, NULL);
 
fud = fuse_dev_alloc(>fc);
if (!fud) {
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 42fd3b576686..ef489beadf58 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -375,13 +375,33 @@ static unsigned int fuse_req_hash(u64 unique)
return hash_long(unique & ~FUSE_INT_REQ_BIT, FUSE_PQ_HASH_BITS);
 }
 
-static void queue_request(struct fuse_iqueue *fiq, struct fuse_req *req)
+/**
+ * A new request is available, wake fiq->waitq
+ */
+static void fuse_dev_wake_and_unlock(struct fuse_iqueue *fiq)
+__releases(fiq->waitq.lock)
 {
-   req->in.h.len = sizeof(struct fuse_in_header) +
-   fuse_len_args(req->in.numargs, (struct fuse_arg *) 
req->in.args);
-   list_add_tail(>list, >pending);
wake_up_locked(>waitq);
kill_fasync(>fasync, SIGIO, POLL_IN);
+   spin_unlock(>waitq.lock);
+}
+
+const struct fuse_iqueue_ops fuse_dev_fiq_ops = {
+   .wake_forget_and_unlock = fuse_dev_wake_and_unlock,
+   .wake_interrupt_and_unlock  = fuse_dev_wake_and_unlock,
+   .wake_pending_and_unlock= fuse_dev_wake_and_unlock,
+};
+EXPORT_SYMBOL_GPL(fuse_dev_fiq_ops);
+
+static void queue_request_and_unlock(struct fuse_iqueue *fiq,
+struct fuse_req *req)
+__releases(fiq->waitq.lock)
+{
+   req->in.h.len = sizeof(struct fuse_in_header) +
+   fuse_len_args(req->in.numargs,
+ (struct fuse_arg *) req->in.args);
+   list_add_tail(>list, >pending);
+   fiq->ops->wake_pending_and_unlock(fiq);
 }
 
 void fuse_queue_forget(struct fuse_conn *fc, struct fuse_forget_link *forget,
@@ -396,12 +416,11 @@ void fuse_queue_forget(struct fuse_conn *fc, struct 
fuse_forget_link *forget,
if (fiq->connected) {
fiq->forget_list_tail->next = forget;
fiq->forget_list_tail = forget;
-   wake_up_locked(>waitq);
-   kill_fasync(>fasync, SIGIO, POLL_IN);
+   fiq->ops->wake_forget_and_unlock(fiq);
} else {
kfree(forget);
+   spin_unlock(>waitq.lock);
}
-   spin_unlock(>waitq.lock);
 }
 
 static void flush_bg_queue(struct fuse_conn *fc)
@@ -417,8 +436,7 @@ static void flush_bg_queue(struct fuse_conn *fc)
fc->active_background++;
spin_lock(>waitq.lock);
req->in.h.unique = fuse_get_unique(fiq);
-   queue_request(fiq, req);
-   spin_unlock(>waitq.lock);
+   queue_request_and_unlock(fiq, req);
}
 }
 
@@ -506,10 +524,10 @@ static int queue_interrupt(struct fuse_iqueue *fiq, 
struct fuse_req *req)
spin_unlock(>waitq.lock);
return 0;
}
-   wake_up_locked(>waitq);
-   kill_fasync(>fasync, SIGIO, POLL_IN);
+   fiq->ops->wake_interrupt_and_unlock(fiq);
+   } else {
+   spin_unlock(>waitq.lock);
}
-   spin_unlock(>waitq.lock);
return 0;
 }
 
@@ -569,11 +587,10 @@ static void __fuse_request_send(struct fuse_conn *fc, 
struct fuse_req *req)
req->out.h.error = -ENOTCONN;
} else {
req->in.h.unique = fuse_get_unique(fiq);
-   queue_request(fiq, req);
/* acquire extra reference, since request is still needed
   after fuse_request_end() */
__fuse_get_request(req);
-   spin_unlock(>waitq.lock);
+   queue_request_and_unlock(fiq, req);
 
request_wait_answer(fc, req);
/* Pairs with smp_wmb() in 

[PATCH v2 19/30] fuse: Keep a list of free dax memory ranges

2019-05-15 Thread Vivek Goyal
Divide the dax memory range into fixed size ranges (2MB for now) and put
them in a list. This will track free ranges. Once an inode requires a
free range, we will take one from here and put it in interval-tree
of ranges assigned to inode.

Signed-off-by: Vivek Goyal 
Signed-off-by: Peng Tao 
---
 fs/fuse/fuse_i.h| 23 
 fs/fuse/inode.c | 86 +
 fs/fuse/virtio_fs.c |  2 ++
 3 files changed, 111 insertions(+)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 840c88af711c..5439e4628362 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -46,6 +46,10 @@
 /** Number of page pointers embedded in fuse_req */
 #define FUSE_REQ_INLINE_PAGES 1
 
+/* Default memory range size, 2MB */
+#define FUSE_DAX_MEM_RANGE_SZ  (2*1024*1024)
+#define FUSE_DAX_MEM_RANGE_PAGES   (FUSE_DAX_MEM_RANGE_SZ/PAGE_SIZE)
+
 /** List of active connections */
 extern struct list_head fuse_conn_list;
 
@@ -94,6 +98,18 @@ struct fuse_forget_link {
struct fuse_forget_link *next;
 };
 
+/** Translation information for file offsets to DAX window offsets */
+struct fuse_dax_mapping {
+   /* Will connect in fc->free_ranges to keep track of free memory */
+   struct list_head list;
+
+   /** Position in DAX window */
+   u64 window_offset;
+
+   /** Length of mapping, in bytes */
+   loff_t length;
+};
+
 /** FUSE inode */
 struct fuse_inode {
/** Inode data */
@@ -829,6 +845,13 @@ struct fuse_conn {
 
/** DAX device, non-NULL if DAX is supported */
struct dax_device *dax_dev;
+
+   /*
+* DAX Window Free Ranges. TODO: This might not be best place to store
+* this free list
+*/
+   long nr_free_ranges;
+   struct list_head free_ranges;
 };
 
 static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 97d218a7daa8..8a3dd72f9843 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -22,6 +22,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 MODULE_AUTHOR("Miklos Szeredi ");
 MODULE_DESCRIPTION("Filesystem in Userspace");
@@ -610,6 +612,76 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
fpq->connected = 1;
 }
 
+static void fuse_free_dax_mem_ranges(struct list_head *mem_list)
+{
+   struct fuse_dax_mapping *range, *temp;
+
+   /* Free All allocated elements */
+   list_for_each_entry_safe(range, temp, mem_list, list) {
+   list_del(>list);
+   kfree(range);
+   }
+}
+
+#ifdef CONFIG_FS_DAX
+static int fuse_dax_mem_range_init(struct fuse_conn *fc,
+  struct dax_device *dax_dev)
+{
+   long nr_pages, nr_ranges;
+   void *kaddr;
+   pfn_t pfn;
+   struct fuse_dax_mapping *range;
+   LIST_HEAD(mem_ranges);
+   phys_addr_t phys_addr;
+   int ret = 0, id;
+   size_t dax_size = -1;
+   unsigned long i;
+
+   id = dax_read_lock();
+   nr_pages = dax_direct_access(dax_dev, 0, PHYS_PFN(dax_size), ,
+   );
+   dax_read_unlock(id);
+   if (nr_pages < 0) {
+   pr_debug("dax_direct_access() returned %ld\n", nr_pages);
+   return nr_pages;
+   }
+
+   phys_addr = pfn_t_to_phys(pfn);
+   nr_ranges = nr_pages/FUSE_DAX_MEM_RANGE_PAGES;
+   printk("fuse_dax_mem_range_init(): dax mapped %ld pages. 
nr_ranges=%ld\n", nr_pages, nr_ranges);
+
+   for (i = 0; i < nr_ranges; i++) {
+   range = kzalloc(sizeof(struct fuse_dax_mapping), GFP_KERNEL);
+   if (!range) {
+   pr_debug("memory allocation for mem_range failed.\n");
+   ret = -ENOMEM;
+   goto out_err;
+   }
+   /* TODO: This offset only works if virtio-fs driver is not
+* having some memory hidden at the beginning. This needs
+* better handling
+*/
+   range->window_offset = i * FUSE_DAX_MEM_RANGE_SZ;
+   range->length = FUSE_DAX_MEM_RANGE_SZ;
+   list_add_tail(>list, _ranges);
+   }
+
+   list_replace_init(_ranges, >free_ranges);
+   fc->nr_free_ranges = nr_ranges;
+   return 0;
+out_err:
+   /* Free All allocated elements */
+   fuse_free_dax_mem_ranges(_ranges);
+   return ret;
+}
+#else /* !CONFIG_FS_DAX */
+static inline int fuse_dax_mem_range_init(struct fuse_conn *fc,
+ struct dax_device *dax_dev)
+{
+   return 0;
+}
+#endif /* CONFIG_FS_DAX */
+
 void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
struct dax_device *dax_dev,
const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv)
@@ -640,6 +712,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct 
user_namespace *user_ns,
fc->dax_dev = dax_dev;
fc->user_ns 

[PATCH v2 18/30] virtio_fs, dax: Set up virtio_fs dax_device

2019-05-15 Thread Vivek Goyal
From: Stefan Hajnoczi 

Setup a dax device.

Use the shm capability to find the cache entry and map it.

The DAX window is accessed by the fs/dax.c infrastructure and must have
struct pages (at least on x86).  Use devm_memremap_pages() to map the
DAX window PCI BAR and allocate struct page.

Signed-off-by: Stefan Hajnoczi 
Signed-off-by: Dr. David Alan Gilbert 
Signed-off-by: Vivek Goyal 
Signed-off-by: Sebastien Boeuf 
Signed-off-by: Liu Bo 
---
 fs/fuse/fuse_i.h   |   1 +
 fs/fuse/inode.c|   8 ++
 fs/fuse/virtio_fs.c| 173 -
 include/uapi/linux/virtio_fs.h |   3 +
 4 files changed, 183 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 46fc1a454084..840c88af711c 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -70,6 +70,7 @@ struct fuse_mount_data {
unsigned group_id_present:1;
unsigned default_permissions:1;
unsigned allow_other:1;
+   unsigned dax:1;
unsigned destroy:1;
unsigned max_read;
unsigned blksize;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 42f3ac5b7521..97d218a7daa8 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -442,6 +442,7 @@ enum {
OPT_ALLOW_OTHER,
OPT_MAX_READ,
OPT_BLKSIZE,
+   OPT_DAX,
OPT_ERR
 };
 
@@ -455,6 +456,7 @@ static const match_table_t tokens = {
{OPT_ALLOW_OTHER,   "allow_other"},
{OPT_MAX_READ,  "max_read=%u"},
{OPT_BLKSIZE,   "blksize=%u"},
+   {OPT_DAX,   "dax"},
{OPT_ERR,   NULL}
 };
 
@@ -546,6 +548,10 @@ int parse_fuse_opt(char *opt, struct fuse_mount_data *d, 
int is_bdev,
d->blksize = value;
break;
 
+   case OPT_DAX:
+   d->dax = 1;
+   break;
+
default:
return 0;
}
@@ -574,6 +580,8 @@ static int fuse_show_options(struct seq_file *m, struct 
dentry *root)
seq_printf(m, ",max_read=%u", fc->max_read);
if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+   if (fc->dax_dev)
+   seq_printf(m, ",dax");
return 0;
 }
 
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index a23a1fb67217..2b790865dc21 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -5,6 +5,9 @@
  */
 
 #include 
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -31,6 +34,18 @@ struct virtio_fs_vq {
char name[24];
 } cacheline_aligned_in_smp;
 
+/* State needed for devm_memremap_pages().  This API is called on the
+ * underlying pci_dev instead of struct virtio_fs (layering violation).  Since
+ * the memremap release function only gets called when the pci_dev is released,
+ * keep the associated state separate from struct virtio_fs (it has a different
+ * lifecycle from pci_dev).
+ */
+struct virtio_fs_memremap_info {
+   struct dev_pagemap pgmap;
+   struct percpu_ref ref;
+   struct completion completion;
+};
+
 /* A virtio-fs device instance */
 struct virtio_fs {
struct list_head list;/* on virtio_fs_instances */
@@ -38,6 +53,12 @@ struct virtio_fs {
struct virtio_fs_vq *vqs;
unsigned nvqs;/* number of virtqueues */
unsigned num_queues;  /* number of request queues */
+   struct dax_device *dax_dev;
+
+   /* DAX memory window where file contents are mapped */
+   void *window_kaddr;
+   phys_addr_t window_phys_addr;
+   size_t window_len;
 };
 
 struct virtio_fs_forget {
@@ -421,6 +442,151 @@ static void virtio_fs_cleanup_vqs(struct virtio_device 
*vdev,
vdev->config->del_vqs(vdev);
 }
 
+/* Map a window offset to a page frame number.  The window offset will have
+ * been produced by .iomap_begin(), which maps a file offset to a window
+ * offset.
+ */
+static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
+   long nr_pages, void **kaddr, pfn_t *pfn)
+{
+   struct virtio_fs *fs = dax_get_private(dax_dev);
+   phys_addr_t offset = PFN_PHYS(pgoff);
+   size_t max_nr_pages = fs->window_len/PAGE_SIZE - pgoff;
+
+   if (kaddr)
+   *kaddr = fs->window_kaddr + offset;
+   if (pfn)
+   *pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
+   PFN_DEV | PFN_MAP);
+   return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
+}
+
+static size_t virtio_fs_copy_from_iter(struct dax_device *dax_dev,
+  pgoff_t pgoff, void *addr,
+  size_t bytes, struct iov_iter *i)
+{
+   return copy_from_iter(addr, bytes, i);
+}
+
+static size_t virtio_fs_copy_to_iter(struct 

[PATCH v2 28/30] fuse: Reschedule dax free work if too many EAGAIN attempts

2019-05-15 Thread Vivek Goyal
fuse_dax_free_memory() can be very cpu intensive in corner cases. For example,
if one inode has consumed all the memory and a setupmapping request is
pending, that means inode lock is held by request and worker thread will
not get lock for a while. And given there is only one inode consuming all
the dax ranges, all the attempts to acquire lock will fail.

So if there are too many inode lock failures (-EAGAIN), reschedule the
worker with a 10ms delay.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/file.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index b0293a308b5e..9b82d9b4ebc3 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -4047,7 +4047,7 @@ int fuse_dax_free_one_mapping(struct fuse_conn *fc, 
struct inode *inode,
 int fuse_dax_free_memory(struct fuse_conn *fc, unsigned long nr_to_free)
 {
struct fuse_dax_mapping *dmap, *pos, *temp;
-   int ret, nr_freed = 0;
+   int ret, nr_freed = 0, nr_eagain = 0;
u64 dmap_start = 0, window_offset = 0;
struct inode *inode = NULL;
 
@@ -4056,6 +4056,12 @@ int fuse_dax_free_memory(struct fuse_conn *fc, unsigned 
long nr_to_free)
if (nr_freed >= nr_to_free)
break;
 
+   if (nr_eagain > 20) {
+   queue_delayed_work(system_long_wq, >dax_free_work,
+   msecs_to_jiffies(10));
+   return 0;
+   }
+
dmap = NULL;
spin_lock(>lock);
 
@@ -4093,8 +4099,10 @@ int fuse_dax_free_memory(struct fuse_conn *fc, unsigned 
long nr_to_free)
}
 
/* Could not get inode lock. Try next element */
-   if (ret == -EAGAIN)
+   if (ret == -EAGAIN) {
+   nr_eagain++;
continue;
+   }
nr_freed++;
}
return 0;
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 20/30] fuse: Introduce setupmapping/removemapping commands

2019-05-15 Thread Vivek Goyal
Introduce two new fuse commands to setup/remove memory mappings. This
will be used to setup/tear down file mapping in dax window.

Signed-off-by: Vivek Goyal 
---
 include/uapi/linux/fuse.h | 33 +
 1 file changed, 33 insertions(+)

diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 2ac598614a8f..9eb313220549 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -399,6 +399,8 @@ enum fuse_opcode {
FUSE_RENAME2= 45,
FUSE_LSEEK  = 46,
FUSE_COPY_FILE_RANGE= 47,
+   FUSE_SETUPMAPPING   = 48,
+   FUSE_REMOVEMAPPING  = 49,
 
/* CUSE specific operations */
CUSE_INIT   = 4096,
@@ -822,4 +824,35 @@ struct fuse_copy_file_range_in {
uint64_tflags;
 };
 
+#define FUSE_SETUPMAPPING_ENTRIES 8
+#define FUSE_SETUPMAPPING_FLAG_WRITE (1ull << 0)
+struct fuse_setupmapping_in {
+   /* An already open handle */
+   uint64_tfh;
+   /* Offset into the file to start the mapping */
+   uint64_tfoffset;
+   /* Length of mapping required */
+   uint64_tlen;
+   /* Flags, FUSE_SETUPMAPPING_FLAG_* */
+   uint64_tflags;
+   /* Offset in Memory Window */
+   uint64_tmoffset;
+};
+
+struct fuse_setupmapping_out {
+   /* Offsets into the cache of mappings */
+   uint64_tcoffset[FUSE_SETUPMAPPING_ENTRIES];
+/* Lengths of each mapping */
+uint64_t   len[FUSE_SETUPMAPPING_ENTRIES];
+};
+
+struct fuse_removemapping_in {
+/* An already open handle */
+uint64_t   fh;
+   /* Offset into the dax window start the unmapping */
+   uint64_tmoffset;
+/* Length of mapping required */
+uint64_t   len;
+};
+
 #endif /* _LINUX_FUSE_H */
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 23/30] fuse: Define dax address space operations

2019-05-15 Thread Vivek Goyal
This is done along the lines of ext4 and xfs. I primarily wanted ->writepages
hook at this time so that I could call into dax_writeback_mapping_range().
This in turn will decide which pfns need to be written back.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/file.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a053bcb9498d..2777355bc245 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2402,6 +2402,17 @@ static int fuse_writepages_fill(struct page *page,
return err;
 }
 
+static int fuse_dax_writepages(struct address_space *mapping,
+   struct writeback_control *wbc)
+{
+
+   struct inode *inode = mapping->host;
+   struct fuse_conn *fc = get_fuse_conn(inode);
+
+   return dax_writeback_mapping_range(mapping,
+   NULL, fc->dax_dev, wbc);
+}
+
 static int fuse_writepages(struct address_space *mapping,
   struct writeback_control *wbc)
 {
@@ -3707,6 +3718,13 @@ static const struct address_space_operations 
fuse_file_aops  = {
.write_end  = fuse_write_end,
 };
 
+static const struct address_space_operations fuse_dax_file_aops  = {
+   .writepages = fuse_dax_writepages,
+   .direct_IO  = noop_direct_IO,
+   .set_page_dirty = noop_set_page_dirty,
+   .invalidatepage = noop_invalidatepage,
+};
+
 void fuse_init_file_inode(struct inode *inode)
 {
struct fuse_inode *fi = get_fuse_inode(inode);
@@ -3724,5 +3742,6 @@ void fuse_init_file_inode(struct inode *inode)
 
if (fc->dax_dev) {
inode->i_flags |= S_DAX;
+   inode->i_data.a_ops = _dax_file_aops;
}
 }
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 25/30] fuse: Maintain a list of busy elements

2019-05-15 Thread Vivek Goyal
This list will be used selecting fuse_dax_mapping to free when number of
free mappings drops below a threshold.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/file.c   | 8 
 fs/fuse/fuse_i.h | 7 +++
 fs/fuse/inode.c  | 4 
 3 files changed, 19 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index e536a04aaa06..3f0f7a387341 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -273,6 +273,10 @@ static int fuse_setup_one_mapping(struct inode *inode,
/* Protected by fi->i_dmap_sem */
fuse_dax_interval_tree_insert(dmap, >dmap_tree);
fi->nr_dmaps++;
+   spin_lock(>lock);
+   list_add_tail(>busy_list, >busy_ranges);
+   fc->nr_busy_ranges++;
+   spin_unlock(>lock);
return 0;
 }
 
@@ -317,6 +321,10 @@ void fuse_removemapping(struct inode *inode)
if (dmap) {
fuse_dax_interval_tree_remove(dmap, >dmap_tree);
fi->nr_dmaps--;
+   spin_lock(>lock);
+   list_del_init(>busy_list);
+   fc->nr_busy_ranges--;
+   spin_unlock(>lock);
}
 
if (!dmap)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index a234cf30538d..c93e9155b723 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -114,6 +114,9 @@ struct fuse_dax_mapping {
__u64 end;
__u64 __subtree_last;
 
+   /* Will connect in fc->busy_ranges to keep track busy memory */
+   struct list_head busy_list;
+
/** Position in DAX window */
u64 window_offset;
 
@@ -873,6 +876,10 @@ struct fuse_conn {
/** DAX device, non-NULL if DAX is supported */
struct dax_device *dax_dev;
 
+   /* List of memory ranges which are busy */
+   unsigned long nr_busy_ranges;
+   struct list_head busy_ranges;
+
/*
 * DAX Window Free Ranges. TODO: This might not be best place to store
 * this free list
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 713c5f32ab35..f57f7ce02acc 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -626,6 +626,8 @@ static void fuse_free_dax_mem_ranges(struct list_head 
*mem_list)
/* Free All allocated elements */
list_for_each_entry_safe(range, temp, mem_list, list) {
list_del(>list);
+   if (!list_empty(>busy_list))
+   list_del(>busy_list);
kfree(range);
}
 }
@@ -670,6 +672,7 @@ static int fuse_dax_mem_range_init(struct fuse_conn *fc,
 */
range->window_offset = i * FUSE_DAX_MEM_RANGE_SZ;
range->length = FUSE_DAX_MEM_RANGE_SZ;
+   INIT_LIST_HEAD(>busy_list);
list_add_tail(>list, _ranges);
}
 
@@ -720,6 +723,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct 
user_namespace *user_ns,
fc->user_ns = get_user_ns(user_ns);
fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
INIT_LIST_HEAD(>free_ranges);
+   INIT_LIST_HEAD(>busy_ranges);
 }
 EXPORT_SYMBOL_GPL(fuse_conn_init);
 
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 16/30] virtio: Implement get_shm_region for MMIO transport

2019-05-15 Thread Vivek Goyal
From: Sebastien Boeuf 

On MMIO a new set of registers is defined for finding SHM
regions.  Add their definitions and use them to find the region.

Signed-off-by: Sebastien Boeuf 
---
 drivers/virtio/virtio_mmio.c | 32 
 include/uapi/linux/virtio_mmio.h | 11 +++
 2 files changed, 43 insertions(+)

diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
index d9dd0f789279..ac42520abd86 100644
--- a/drivers/virtio/virtio_mmio.c
+++ b/drivers/virtio/virtio_mmio.c
@@ -499,6 +499,37 @@ static const char *vm_bus_name(struct virtio_device *vdev)
return vm_dev->pdev->name;
 }
 
+static bool vm_get_shm_region(struct virtio_device *vdev,
+ struct virtio_shm_region *region, u8 id)
+{
+   struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
+   u64 len, addr;
+
+   /* Select the region we're interested in */
+   writel(id, vm_dev->base + VIRTIO_MMIO_SHM_SEL);
+
+   /* Read the region size */
+   len = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_LOW);
+   len |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_LEN_HIGH) << 32;
+
+   region->len = len;
+
+   /* Check if region length is -1. If that's the case, the shared memory
+* region does not exist and there is no need to proceed further.
+*/
+   if (len == ~(u64)0) {
+   return false;
+   }
+
+   /* Read the region base address */
+   addr = (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_LOW);
+   addr |= (u64) readl(vm_dev->base + VIRTIO_MMIO_SHM_BASE_HIGH) << 32;
+
+   region->addr = addr;
+
+   return true;
+}
+
 static const struct virtio_config_ops virtio_mmio_config_ops = {
.get= vm_get,
.set= vm_set,
@@ -511,6 +542,7 @@ static const struct virtio_config_ops 
virtio_mmio_config_ops = {
.get_features   = vm_get_features,
.finalize_features = vm_finalize_features,
.bus_name   = vm_bus_name,
+   .get_shm_region = vm_get_shm_region,
 };
 
 
diff --git a/include/uapi/linux/virtio_mmio.h b/include/uapi/linux/virtio_mmio.h
index c4b09689ab64..0650f91bea6c 100644
--- a/include/uapi/linux/virtio_mmio.h
+++ b/include/uapi/linux/virtio_mmio.h
@@ -122,6 +122,17 @@
 #define VIRTIO_MMIO_QUEUE_USED_LOW 0x0a0
 #define VIRTIO_MMIO_QUEUE_USED_HIGH0x0a4
 
+/* Shared memory region id */
+#define VIRTIO_MMIO_SHM_SEL 0x0ac
+
+/* Shared memory region length, 64 bits in two halves */
+#define VIRTIO_MMIO_SHM_LEN_LOW 0x0b0
+#define VIRTIO_MMIO_SHM_LEN_HIGH0x0b4
+
+/* Shared memory region base address, 64 bits in two halves */
+#define VIRTIO_MMIO_SHM_BASE_LOW0x0b8
+#define VIRTIO_MMIO_SHM_BASE_HIGH   0x0bc
+
 /* Configuration atomicity value */
 #define VIRTIO_MMIO_CONFIG_GENERATION  0x0fc
 
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 06/30] fuse: Export fuse_send_init_request()

2019-05-15 Thread Vivek Goyal
This will be used by virtio-fs to send init request to fuse server after
initialization of virt queues.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/dev.c| 1 +
 fs/fuse/fuse_i.h | 1 +
 fs/fuse/inode.c  | 3 ++-
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index d8054b1a45f5..40eb827caa10 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -139,6 +139,7 @@ void fuse_request_free(struct fuse_req *req)
fuse_req_pages_free(req);
kmem_cache_free(fuse_req_cachep, req);
 }
+EXPORT_SYMBOL_GPL(fuse_request_free);
 
 void __fuse_get_request(struct fuse_req *req)
 {
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 3a235386d667..16f238d7f624 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -987,6 +987,7 @@ void fuse_conn_put(struct fuse_conn *fc);
 
 struct fuse_dev *fuse_dev_alloc(struct fuse_conn *fc);
 void fuse_dev_free(struct fuse_dev *fud);
+void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req);
 
 /**
  * Add connection to control filesystem
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index ec5d9953dfb6..f02291469518 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -958,7 +958,7 @@ static void process_init_reply(struct fuse_conn *fc, struct 
fuse_req *req)
wake_up_all(>blocked_waitq);
 }
 
-static void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req)
+void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req)
 {
struct fuse_init_in *arg = >misc.init_in;
 
@@ -988,6 +988,7 @@ static void fuse_send_init(struct fuse_conn *fc, struct 
fuse_req *req)
req->end = process_init_reply;
fuse_request_send_background(fc, req);
 }
+EXPORT_SYMBOL_GPL(fuse_send_init);
 
 static void fuse_free_conn(struct fuse_conn *fc)
 {
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 21/30] fuse, dax: Implement dax read/write operations

2019-05-15 Thread Vivek Goyal
This patch implements basic DAX support. mmap() is not implemented
yet and will come in later patches. This patch looks into implemeting
read/write.

We make use of interval tree to keep track of per inode dax mappings.

Do not use dax for file extending writes, instead just send WRITE message
to daemon (like we do for direct I/O path). This will keep write and
i_size change atomic w.r.t crash.

Signed-off-by: Stefan Hajnoczi 
Signed-off-by: Dr. David Alan Gilbert 
Signed-off-by: Vivek Goyal 
Signed-off-by: Miklos Szeredi 
Signed-off-by: Liu Bo 
---
 fs/fuse/file.c| 454 +-
 fs/fuse/fuse_i.h  |  21 ++
 fs/fuse/inode.c   |   6 +
 include/uapi/linux/fuse.h |   1 +
 4 files changed, 476 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index e9a7aa97c539..edbb11ca735e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -18,6 +18,12 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+
+INTERVAL_TREE_DEFINE(struct fuse_dax_mapping, rb, __u64, __subtree_last,
+ START, LAST, static inline, fuse_dax_interval_tree);
 
 static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
  int opcode, struct fuse_open_out *outargp)
@@ -171,6 +177,173 @@ static void fuse_link_write_file(struct file *file)
spin_unlock(>lock);
 }
 
+static struct fuse_dax_mapping *alloc_dax_mapping(struct fuse_conn *fc)
+{
+   struct fuse_dax_mapping *dmap = NULL;
+
+   spin_lock(>lock);
+
+   /* TODO: Add logic to try to free up memory if wait is allowed */
+   if (fc->nr_free_ranges <= 0) {
+   spin_unlock(>lock);
+   return NULL;
+   }
+
+   WARN_ON(list_empty(>free_ranges));
+
+   /* Take a free range */
+   dmap = list_first_entry(>free_ranges, struct fuse_dax_mapping,
+   list);
+   list_del_init(>list);
+   fc->nr_free_ranges--;
+   spin_unlock(>lock);
+   return dmap;
+}
+
+/* This assumes fc->lock is held */
+static void __free_dax_mapping(struct fuse_conn *fc,
+   struct fuse_dax_mapping *dmap)
+{
+   list_add_tail(>list, >free_ranges);
+   fc->nr_free_ranges++;
+}
+
+static void free_dax_mapping(struct fuse_conn *fc,
+   struct fuse_dax_mapping *dmap)
+{
+   /* Return fuse_dax_mapping to free list */
+   spin_lock(>lock);
+   __free_dax_mapping(fc, dmap);
+   spin_unlock(>lock);
+}
+
+/* offset passed in should be aligned to FUSE_DAX_MEM_RANGE_SZ */
+static int fuse_setup_one_mapping(struct inode *inode,
+   struct file *file, loff_t offset,
+   struct fuse_dax_mapping *dmap)
+{
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_file *ff = NULL;
+   struct fuse_setupmapping_in inarg;
+   FUSE_ARGS(args);
+   ssize_t err;
+
+   if (file)
+   ff = file->private_data;
+
+   WARN_ON(offset % FUSE_DAX_MEM_RANGE_SZ);
+   WARN_ON(fc->nr_free_ranges < 0);
+
+   /* Ask fuse daemon to setup mapping */
+   memset(, 0, sizeof(inarg));
+   inarg.foffset = offset;
+   if (ff)
+   inarg.fh = ff->fh;
+   else
+   inarg.fh = -1;
+   inarg.moffset = dmap->window_offset;
+   inarg.len = FUSE_DAX_MEM_RANGE_SZ;
+   if (file) {
+   inarg.flags |= (file->f_mode & FMODE_WRITE) ?
+   FUSE_SETUPMAPPING_FLAG_WRITE : 0;
+   inarg.flags |= (file->f_mode & FMODE_READ) ?
+   FUSE_SETUPMAPPING_FLAG_READ : 0;
+   } else {
+   inarg.flags |= FUSE_SETUPMAPPING_FLAG_READ;
+   inarg.flags |= FUSE_SETUPMAPPING_FLAG_WRITE;
+   }
+   args.in.h.opcode = FUSE_SETUPMAPPING;
+   args.in.h.nodeid = fi->nodeid;
+   args.in.numargs = 1;
+   args.in.args[0].size = sizeof(inarg);
+   args.in.args[0].value = 
+   err = fuse_simple_request(fc, );
+   if (err < 0) {
+   printk(KERN_ERR "%s request failed at mem_offset=0x%llx %zd\n",
+__func__, dmap->window_offset, err);
+   return err;
+   }
+
+   pr_debug("fuse_setup_one_mapping() succeeded. offset=0x%llx err=%zd\n", 
offset, err);
+
+   /* TODO: What locking is required here. For now, using fc->lock */
+   dmap->start = offset;
+   dmap->end = offset + FUSE_DAX_MEM_RANGE_SZ - 1;
+   /* Protected by fi->i_dmap_sem */
+   fuse_dax_interval_tree_insert(dmap, >dmap_tree);
+   fi->nr_dmaps++;
+   return 0;
+}
+
+static int fuse_removemapping_one(struct inode *inode,
+   struct fuse_dax_mapping *dmap)
+{
+   struct fuse_inode *fi = get_fuse_inode(inode);
+   struct fuse_conn *fc = 

[PATCH v2 14/30] virtio: Add get_shm_region method

2019-05-15 Thread Vivek Goyal
From: Sebastien Boeuf 

Virtio defines 'shared memory regions' that provide a continuously
shared region between the host and guest.

Provide a method to find a particular region on a device.

Signed-off-by: Sebastien Boeuf 
Signed-off-by: Dr. David Alan Gilbert 
---
 include/linux/virtio_config.h | 17 +
 1 file changed, 17 insertions(+)

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index bb4cc4910750..c859f000a751 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -10,6 +10,11 @@
 
 struct irq_affinity;
 
+struct virtio_shm_region {
+   u64 addr;
+   u64 len;
+};
+
 /**
  * virtio_config_ops - operations for configuring a virtio device
  * Note: Do not assume that a transport implements all of the operations
@@ -65,6 +70,7 @@ struct irq_affinity;
  *  the caller can then copy.
  * @set_vq_affinity: set the affinity for a virtqueue (optional).
  * @get_vq_affinity: get the affinity for a virtqueue (optional).
+ * @get_shm_region: get a shared memory region based on the index.
  */
 typedef void vq_callback_t(struct virtqueue *);
 struct virtio_config_ops {
@@ -88,6 +94,8 @@ struct virtio_config_ops {
   const struct cpumask *cpu_mask);
const struct cpumask *(*get_vq_affinity)(struct virtio_device *vdev,
int index);
+   bool (*get_shm_region)(struct virtio_device *vdev,
+  struct virtio_shm_region *region, u8 id);
 };
 
 /* If driver didn't advertise the feature, it will never appear. */
@@ -250,6 +258,15 @@ int virtqueue_set_affinity(struct virtqueue *vq, const 
struct cpumask *cpu_mask)
return 0;
 }
 
+static inline
+bool virtio_get_shm_region(struct virtio_device *vdev,
+ struct virtio_shm_region *region, u8 id)
+{
+   if (!vdev->config->get_shm_region)
+   return false;
+   return vdev->config->get_shm_region(vdev, region, id);
+}
+
 static inline bool virtio_is_little_endian(struct virtio_device *vdev)
 {
return virtio_has_feature(vdev, VIRTIO_F_VERSION_1) ||
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 17/30] fuse, dax: add fuse_conn->dax_dev field

2019-05-15 Thread Vivek Goyal
From: Stefan Hajnoczi 

A struct dax_device instance is a prerequisite for the DAX filesystem
APIs.  Let virtio_fs associate a dax_device with a fuse_conn.  Classic
FUSE and CUSE set the pointer to NULL, disabling DAX.

Signed-off-by: Stefan Hajnoczi 
---
 fs/fuse/cuse.c  | 3 ++-
 fs/fuse/fuse_i.h| 9 -
 fs/fuse/inode.c | 9 ++---
 fs/fuse/virtio_fs.c | 1 +
 4 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index a509747153a7..417448f11f9f 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -504,7 +504,8 @@ static int cuse_channel_open(struct inode *inode, struct 
file *file)
 * Limit the cuse channel to requests that can
 * be represented in file->f_cred->user_ns.
 */
-   fuse_conn_init(>fc, file->f_cred->user_ns, _dev_fiq_ops, NULL);
+   fuse_conn_init(>fc, file->f_cred->user_ns, NULL, _dev_fiq_ops,
+   NULL);
 
fud = fuse_dev_alloc_install(>fc);
if (!fud) {
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f5cb4d40b83f..46fc1a454084 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -74,6 +74,9 @@ struct fuse_mount_data {
unsigned max_read;
unsigned blksize;
 
+   /* DAX device, may be NULL */
+   struct dax_device *dax_dev;
+
/* fuse input queue operations */
const struct fuse_iqueue_ops *fiq_ops;
 
@@ -822,6 +825,9 @@ struct fuse_conn {
 
/** List of device instances belonging to this connection */
struct list_head devices;
+
+   /** DAX device, non-NULL if DAX is supported */
+   struct dax_device *dax_dev;
 };
 
 static inline struct fuse_conn *get_fuse_conn_super(struct super_block *sb)
@@ -1049,7 +1055,8 @@ struct fuse_conn *fuse_conn_get(struct fuse_conn *fc);
  * Initialize fuse_conn
  */
 void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
-   const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv);
+   struct dax_device *dax_dev,
+   const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv);
 
 /**
  * Release reference to fuse_conn
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 731a8a74d032..42f3ac5b7521 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -603,7 +603,8 @@ static void fuse_pqueue_init(struct fuse_pqueue *fpq)
 }
 
 void fuse_conn_init(struct fuse_conn *fc, struct user_namespace *user_ns,
-   const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv)
+   struct dax_device *dax_dev,
+   const struct fuse_iqueue_ops *fiq_ops, void *fiq_priv)
 {
memset(fc, 0, sizeof(*fc));
spin_lock_init(>lock);
@@ -628,6 +629,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct 
user_namespace *user_ns,
atomic64_set(>attr_version, 1);
get_random_bytes(>scramble_key, sizeof(fc->scramble_key));
fc->pid_ns = get_pid_ns(task_active_pid_ns(current));
+   fc->dax_dev = dax_dev;
fc->user_ns = get_user_ns(user_ns);
fc->max_pages = FUSE_DEFAULT_MAX_PAGES_PER_REQ;
 }
@@ -1140,8 +1142,8 @@ int fuse_fill_super_common(struct super_block *sb,
if (!fc)
goto err;
 
-   fuse_conn_init(fc, sb->s_user_ns, mount_data->fiq_ops,
-  mount_data->fiq_priv);
+   fuse_conn_init(fc, sb->s_user_ns, mount_data->dax_dev,
+  mount_data->fiq_ops, mount_data->fiq_priv);
fc->release = fuse_free_conn;
 
fud = fuse_dev_alloc_install(fc);
@@ -1243,6 +1245,7 @@ static int fuse_fill_super(struct super_block *sb, void 
*data, int silent)
goto err_fput;
__set_bit(FR_BACKGROUND, _req->flags);
 
+   d.dax_dev = NULL;
d.fiq_ops = _dev_fiq_ops;
d.fiq_priv = NULL;
d.fudptr = >private_data;
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index e76e0f5dce40..a23a1fb67217 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -866,6 +866,7 @@ static int virtio_fs_fill_super(struct super_block *sb, 
void *data,
goto err_free_fuse_devs;
__set_bit(FR_BACKGROUND, _req->flags);
 
+   d.dax_dev = NULL;
d.fiq_ops = _fs_fiq_ops;
d.fiq_priv = fs;
d.fudptr = (void **)>vqs[VQ_REQUEST].fud;
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 22/30] fuse, dax: add DAX mmap support

2019-05-15 Thread Vivek Goyal
From: Stefan Hajnoczi 

Add DAX mmap() support.

Signed-off-by: Stefan Hajnoczi 
---
 fs/fuse/file.c | 64 +-
 1 file changed, 63 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index edbb11ca735e..a053bcb9498d 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2576,10 +2576,15 @@ static const struct vm_operations_struct 
fuse_file_vm_ops = {
.page_mkwrite   = fuse_page_mkwrite,
 };
 
+static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma);
 static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
struct fuse_file *ff = file->private_data;
 
+   /* DAX mmap is superior to direct_io mmap */
+   if (IS_DAX(file_inode(file)))
+   return fuse_dax_mmap(file, vma);
+
if (ff->open_flags & FOPEN_DIRECT_IO) {
/* Can't provide the coherency needed for MAP_SHARED */
if (vma->vm_flags & VM_MAYSHARE)
@@ -2611,9 +2616,65 @@ static ssize_t fuse_file_splice_read(struct file *in, 
loff_t *ppos,
 
 }
 
+static int __fuse_dax_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
+   bool write)
+{
+   vm_fault_t ret;
+   struct inode *inode = file_inode(vmf->vma->vm_file);
+   struct super_block *sb = inode->i_sb;
+   pfn_t pfn;
+
+   if (write)
+   sb_start_pagefault(sb);
+
+   /* TODO inode semaphore to protect faults vs truncate */
+
+   ret = dax_iomap_fault(vmf, pe_size, , NULL, _iomap_ops);
+
+   if (ret & VM_FAULT_NEEDDSYNC)
+   ret = dax_finish_sync_fault(vmf, pe_size, pfn);
+
+   if (write)
+   sb_end_pagefault(sb);
+
+   return ret;
+}
+
+static vm_fault_t fuse_dax_fault(struct vm_fault *vmf)
+{
+   return __fuse_dax_fault(vmf, PE_SIZE_PTE,
+   vmf->flags & FAULT_FLAG_WRITE);
+}
+
+static vm_fault_t fuse_dax_huge_fault(struct vm_fault *vmf,
+  enum page_entry_size pe_size)
+{
+   return __fuse_dax_fault(vmf, pe_size, vmf->flags & FAULT_FLAG_WRITE);
+}
+
+static vm_fault_t fuse_dax_page_mkwrite(struct vm_fault *vmf)
+{
+   return __fuse_dax_fault(vmf, PE_SIZE_PTE, true);
+}
+
+static vm_fault_t fuse_dax_pfn_mkwrite(struct vm_fault *vmf)
+{
+   return __fuse_dax_fault(vmf, PE_SIZE_PTE, true);
+}
+
+static const struct vm_operations_struct fuse_dax_vm_ops = {
+   .fault  = fuse_dax_fault,
+   .huge_fault = fuse_dax_huge_fault,
+   .page_mkwrite   = fuse_dax_page_mkwrite,
+   .pfn_mkwrite= fuse_dax_pfn_mkwrite,
+};
+
 static int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
 {
-   return -EINVAL; /* TODO */
+   file_accessed(file);
+   vma->vm_ops = _dax_vm_ops;
+   vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+   return 0;
 }
 
 static int convert_fuse_file_lock(struct fuse_conn *fc,
@@ -3622,6 +3683,7 @@ static const struct file_operations fuse_file_operations 
= {
.release= fuse_release,
.fsync  = fuse_fsync,
.lock   = fuse_file_lock,
+   .get_unmapped_area = thp_get_unmapped_area,
.flock  = fuse_file_flock,
.splice_read= fuse_file_splice_read,
.splice_write   = iter_file_splice_write,
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 27/30] fuse: Release file in process context

2019-05-15 Thread Vivek Goyal
fuse_file_put(sync) can be called with sync=true/false. If sync=true,
it waits for release request response and then calls iput() in the
caller's context. If sync=false, it does not wait for release request
response, frees the fuse_file struct immediately and req->end function
does the iput().

iput() can be a problem with DAX if called in req->end context. If this
is last reference to inode (VFS has let go its reference already), then
iput() will clean DAX mappings as well and send REMOVEMAPPING requests
and wait for completion. (All the the worker thread context which is
processing fuse replies from daemon on the host).

That means it blocks worker thread and it stops processing further
replies and system deadlocks.

So for now, force sync release of file in case of DAX inodes.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/file.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 87fc2b5e0a3a..b0293a308b5e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -475,6 +475,7 @@ void fuse_release_common(struct file *file, bool isdir)
struct fuse_file *ff = file->private_data;
struct fuse_req *req = ff->reserved_req;
int opcode = isdir ? FUSE_RELEASEDIR : FUSE_RELEASE;
+   bool sync = false;
 
fuse_prepare_release(fi, ff, file->f_flags, opcode);
 
@@ -495,8 +496,20 @@ void fuse_release_common(struct file *file, bool isdir)
 * Make the release synchronous if this is a fuseblk mount,
 * synchronous RELEASE is allowed (and desirable) in this case
 * because the server can be trusted not to screw up.
+*
+* For DAX, fuse server is trusted. So it should be fine to
+* do a sync file put. Doing async file put is creating
+* problems right now because when request finish, iput()
+* can lead to freeing of inode. That means it tears down
+* mappings backing DAX memory and sends REMOVEMAPPING message
+* to server and blocks for completion. Currently, waiting
+* in req->end context deadlocks the system as same worker thread
+* can't process REMOVEMAPPING reply it is waiting for.
 */
-   fuse_file_put(ff, ff->fc->destroy_req != NULL, isdir);
+   if (IS_DAX(req->misc.release.inode) || ff->fc->destroy_req != NULL)
+   sync = true;
+
+   fuse_file_put(ff, sync, isdir);
 }
 
 static int fuse_open(struct inode *inode, struct file *file)
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 03/30] fuse: Use default_file_splice_read for direct IO

2019-05-15 Thread Vivek Goyal
From: Miklos Szeredi 

---
 fs/fuse/file.c | 15 ++-
 fs/splice.c|  3 ++-
 include/linux/fs.h |  2 ++
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 5baf07fd2876..e9a7aa97c539 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2167,6 +2167,19 @@ static int fuse_file_mmap(struct file *file, struct 
vm_area_struct *vma)
return 0;
 }
 
+static ssize_t fuse_file_splice_read(struct file *in, loff_t *ppos,
+struct pipe_inode_info *pipe, size_t len,
+unsigned int flags)
+{
+   struct fuse_file *ff = in->private_data;
+
+   if (ff->open_flags & FOPEN_DIRECT_IO)
+   return default_file_splice_read(in, ppos, pipe, len, flags);
+   else
+   return generic_file_splice_read(in, ppos, pipe, len, flags);
+
+}
+
 static int convert_fuse_file_lock(struct fuse_conn *fc,
  const struct fuse_file_lock *ffl,
  struct file_lock *fl)
@@ -3174,7 +3187,7 @@ static const struct file_operations fuse_file_operations 
= {
.fsync  = fuse_fsync,
.lock   = fuse_file_lock,
.flock  = fuse_file_flock,
-   .splice_read= generic_file_splice_read,
+   .splice_read= fuse_file_splice_read,
.splice_write   = iter_file_splice_write,
.unlocked_ioctl = fuse_file_ioctl,
.compat_ioctl   = fuse_file_compat_ioctl,
diff --git a/fs/splice.c b/fs/splice.c
index 25212dcca2df..e2e881e34935 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -361,7 +361,7 @@ static ssize_t kernel_readv(struct file *file, const struct 
kvec *vec,
return res;
 }
 
-static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
+ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
 struct pipe_inode_info *pipe, size_t len,
 unsigned int flags)
 {
@@ -425,6 +425,7 @@ static ssize_t default_file_splice_read(struct file *in, 
loff_t *ppos,
iov_iter_advance(, copied);  /* truncates and discards */
return res;
 }
+EXPORT_SYMBOL(default_file_splice_read);
 
 /*
  * Send 'sd->len' bytes to socket from 'sd->file' at position 'sd->pos'
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dd28e7679089..6804aecf7e30 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3055,6 +3055,8 @@ extern void block_sync_page(struct page *page);
 /* fs/splice.c */
 extern ssize_t generic_file_splice_read(struct file *, loff_t *,
struct pipe_inode_info *, size_t, unsigned int);
+extern ssize_t default_file_splice_read(struct file *, loff_t *,
+   struct pipe_inode_info *, size_t, unsigned int);
 extern ssize_t iter_file_splice_write(struct pipe_inode_info *,
struct file *, loff_t *, size_t, unsigned int);
 extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe,
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 04/30] fuse: export fuse_end_request()

2019-05-15 Thread Vivek Goyal
From: Stefan Hajnoczi 

virtio-fs will need to complete requests from outside fs/fuse/dev.c.
Make the symbol visible.

Signed-off-by: Stefan Hajnoczi 
---
 fs/fuse/dev.c| 19 ++-
 fs/fuse/fuse_i.h |  5 +
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 9971a35cf1ef..46d1aecd7506 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -427,7 +427,7 @@ static void flush_bg_queue(struct fuse_conn *fc)
  * the 'end' callback is called if given, else the reference to the
  * request is released
  */
-static void request_end(struct fuse_conn *fc, struct fuse_req *req)
+void fuse_request_end(struct fuse_conn *fc, struct fuse_req *req)
 {
struct fuse_iqueue *fiq = >iq;
 
@@ -480,6 +480,7 @@ static void request_end(struct fuse_conn *fc, struct 
fuse_req *req)
 put_request:
fuse_put_request(fc, req);
 }
+EXPORT_SYMBOL_GPL(fuse_request_end);
 
 static int queue_interrupt(struct fuse_iqueue *fiq, struct fuse_req *req)
 {
@@ -567,12 +568,12 @@ static void __fuse_request_send(struct fuse_conn *fc, 
struct fuse_req *req)
req->in.h.unique = fuse_get_unique(fiq);
queue_request(fiq, req);
/* acquire extra reference, since request is still needed
-  after request_end() */
+  after fuse_request_end() */
__fuse_get_request(req);
spin_unlock(>waitq.lock);
 
request_wait_answer(fc, req);
-   /* Pairs with smp_wmb() in request_end() */
+   /* Pairs with smp_wmb() in fuse_request_end() */
smp_rmb();
}
 }
@@ -1302,7 +1303,7 @@ __releases(fiq->waitq.lock)
  * the pending list and copies request data to userspace buffer.  If
  * no reply is needed (FORGET) or request has been aborted or there
  * was an error during the copying then it's finished by calling
- * request_end().  Otherwise add it to the processing list, and set
+ * fuse_request_end().  Otherwise add it to the processing list, and set
  * the 'sent' flag.
  */
 static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
@@ -1362,7 +1363,7 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, 
struct file *file,
/* SETXATTR is special, since it may contain too large data */
if (in->h.opcode == FUSE_SETXATTR)
req->out.h.error = -E2BIG;
-   request_end(fc, req);
+   fuse_request_end(fc, req);
goto restart;
}
spin_lock(>lock);
@@ -1405,7 +1406,7 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, 
struct file *file,
if (!test_bit(FR_PRIVATE, >flags))
list_del_init(>list);
spin_unlock(>lock);
-   request_end(fc, req);
+   fuse_request_end(fc, req);
return err;
 
  err_unlock:
@@ -1913,7 +1914,7 @@ static int copy_out_args(struct fuse_copy_state *cs, 
struct fuse_out *out,
  * the write buffer.  The request is then searched on the processing
  * list by the unique ID found in the header.  If found, then remove
  * it from the list and copy the rest of the buffer to the request.
- * The request is finished by calling request_end()
+ * The request is finished by calling fuse_request_end().
  */
 static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
 struct fuse_copy_state *cs, size_t nbytes)
@@ -2000,7 +2001,7 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
list_del_init(>list);
spin_unlock(>lock);
 
-   request_end(fc, req);
+   fuse_request_end(fc, req);
 out:
return err ? err : nbytes;
 
@@ -2140,7 +2141,7 @@ static void end_requests(struct fuse_conn *fc, struct 
list_head *head)
req->out.h.error = -ECONNABORTED;
clear_bit(FR_SENT, >flags);
list_del_init(>list);
-   request_end(fc, req);
+   fuse_request_end(fc, req);
}
 }
 
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 0920c0c032a0..c4584c873b87 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -949,6 +949,11 @@ ssize_t fuse_simple_request(struct fuse_conn *fc, struct 
fuse_args *args);
 void fuse_request_send_background(struct fuse_conn *fc, struct fuse_req *req);
 bool fuse_request_queue_background(struct fuse_conn *fc, struct fuse_req *req);
 
+/**
+ * End a finished request
+ */
+void fuse_request_end(struct fuse_conn *fc, struct fuse_req *req);
+
 /* Abort all requests */
 void fuse_abort_conn(struct fuse_conn *fc);
 void fuse_wait_aborted(struct fuse_conn *fc);
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 10/30] fuse: Separate fuse device allocation and installation in fuse_conn

2019-05-15 Thread Vivek Goyal
As of now fuse_dev_alloc() both allocates a fuse device and installs it
in fuse_conn list. fuse_dev_alloc() can fail if fuse_device allocation
fails.

virtio-fs needs to initialize multiple fuse devices (one per virtio
queue). It initializes one fuse device as part of call to
fuse_fill_super_common() and rest of the devices are allocated and
installed after that.

But, we can't affort to fail after calling fuse_fill_super_common() as
we don't have a way to undo all the actions done by fuse_fill_super_common().
So to avoid failures after the call to fuse_fill_super_common(),
pre-allocate all fuse devices early and install them into fuse connection
later.

This patch provides two separate helpers for fuse device allocation and
fuse device installation in fuse_conn.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/cuse.c   |  2 +-
 fs/fuse/dev.c|  2 +-
 fs/fuse/fuse_i.h |  4 +++-
 fs/fuse/inode.c  | 25 -
 4 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index a6ed7a036b50..a509747153a7 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -506,7 +506,7 @@ static int cuse_channel_open(struct inode *inode, struct 
file *file)
 */
fuse_conn_init(>fc, file->f_cred->user_ns, _dev_fiq_ops, NULL);
 
-   fud = fuse_dev_alloc(>fc);
+   fud = fuse_dev_alloc_install(>fc);
if (!fud) {
kfree(cc);
return -ENOMEM;
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index ef489beadf58..ee9dd38bc0f0 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2318,7 +2318,7 @@ static int fuse_device_clone(struct fuse_conn *fc, struct 
file *new)
if (new->private_data)
return -EINVAL;
 
-   fud = fuse_dev_alloc(fc);
+   fud = fuse_dev_alloc_install(fc);
if (!fud)
return -ENOMEM;
 
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 0b578e07156d..4008ed65a48d 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1050,7 +1050,9 @@ void fuse_conn_init(struct fuse_conn *fc, struct 
user_namespace *user_ns,
  */
 void fuse_conn_put(struct fuse_conn *fc);
 
-struct fuse_dev *fuse_dev_alloc(struct fuse_conn *fc);
+struct fuse_dev *fuse_dev_alloc_install(struct fuse_conn *fc);
+struct fuse_dev *fuse_dev_alloc(void);
+void fuse_dev_install(struct fuse_dev *fud, struct fuse_conn *fc);
 void fuse_dev_free(struct fuse_dev *fud);
 void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req);
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 126e77854dac..9b0114437a14 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1027,8 +1027,7 @@ static int fuse_bdi_init(struct fuse_conn *fc, struct 
super_block *sb)
return 0;
 }
 
-struct fuse_dev *fuse_dev_alloc(struct fuse_conn *fc)
-{
+struct fuse_dev *fuse_dev_alloc(void) {
struct fuse_dev *fud;
struct list_head *pq;
 
@@ -1043,16 +1042,32 @@ struct fuse_dev *fuse_dev_alloc(struct fuse_conn *fc)
}
 
fud->pq.processing = pq;
-   fud->fc = fuse_conn_get(fc);
fuse_pqueue_init(>pq);
 
+   return fud;
+}
+EXPORT_SYMBOL_GPL(fuse_dev_alloc);
+
+void fuse_dev_install(struct fuse_dev *fud, struct fuse_conn *fc) {
+   fud->fc = fuse_conn_get(fc);
spin_lock(>lock);
list_add_tail(>entry, >devices);
spin_unlock(>lock);
+}
+EXPORT_SYMBOL_GPL(fuse_dev_install);
 
+struct fuse_dev *fuse_dev_alloc_install(struct fuse_conn *fc)
+{
+   struct fuse_dev *fud;
+
+   fud = fuse_dev_alloc();
+   if (!fud)
+   return NULL;
+
+   fuse_dev_install(fud, fc);
return fud;
 }
-EXPORT_SYMBOL_GPL(fuse_dev_alloc);
+EXPORT_SYMBOL_GPL(fuse_dev_alloc_install);
 
 void fuse_dev_free(struct fuse_dev *fud)
 {
@@ -1122,7 +1137,7 @@ int fuse_fill_super_common(struct super_block *sb,
   mount_data->fiq_priv);
fc->release = fuse_free_conn;
 
-   fud = fuse_dev_alloc(fc);
+   fud = fuse_dev_alloc_install(fc);
if (!fud)
goto err_put_conn;
 
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 29/30] fuse: Take inode lock for dax inode truncation

2019-05-15 Thread Vivek Goyal
When a file is opened with O_TRUNC, we need to make sure that any other
DAX operation is not in progress. DAX expects i_size to be stable.

In fuse_iomap_begin() we check for i_size at multiple places and we expect
i_size to not change.

Another problem is, if we setup a mapping in fuse_iomap_begin(), and
file gets truncated and dax read/write happens, KVM currently hangs.
It tries to fault in a page which does not exist on host (file got
truncated). It probably requries fixing in KVM.

So for now, take inode lock. Once KVM is fixed, we might have to
have a look at it again.

Signed-off-by: Vivek Goyal 
---
 fs/fuse/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9b82d9b4ebc3..d0979dc32f08 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -420,7 +420,7 @@ int fuse_open_common(struct inode *inode, struct file 
*file, bool isdir)
int err;
bool lock_inode = (file->f_flags & O_TRUNC) &&
  fc->atomic_o_trunc &&
- fc->writeback_cache;
+ (fc->writeback_cache || IS_DAX(inode));
 
err = generic_file_open(inode, file);
if (err)
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 00/30] [RFC] virtio-fs: shared file system for virtual machines

2019-05-15 Thread Vivek Goyal
Hi,

Here are the RFC patches for V2 of virtio-fs. These patches apply on top
of 5.1 kernel. These patches are also available here.

https://github.com/rhvgoyal/linux/commits/virtio-fs-dev-5.1
  
Patches for V1 were posted here.
  
https://lwn.net/ml/linux-fsdevel/20181210171318.16998-1-vgo...@redhat.com/

This is still work in progress. As of now one can passthrough a host
directory in to guest and it works reasonably well. pjdfstests test
suite passes and blogbench runs. But this dirctory can't be shared
between guests and host can't modify files in directory yet.  That's
still TBD.
  
Posting another version to gather feedback and comments on progress so far.
  
More information about the project can be found here.
  
https://virtio-fs.gitlab.io/

Changes from V1
===
- Various bug fixes
- virtio-fs dax huge page size working, leading to improved performance.
- Fixed kernel automated tests warnings.
- Better handling of shared cache region reporting by virtio device.

Description from V1 posting
---
Problem Description
===
We want to be able to take a directory tree on the host and share it with
guest[s]. Our goal is to be able to do it in a fast, consistent and secure
manner. Our primary use case is kata containers, but it should be usable in
other scenarios as well.

Containers may rely on local file system semantics for shared volumes,
read-write mounts that multiple containers access simultaneously.  File
system changes must be visible to other containers with the same consistency
expected of a local file system, including mmap MAP_SHARED.

Existing Solutions
==
We looked at existing solutions and virtio-9p already provides basic shared
file system functionality although does not offer local file system semantics,
causing some workloads and test suites to fail. In addition, virtio-9p
performance has been an issue for Kata Containers and we believe this cannot
be alleviated without major changes that do not fit into the 9P protocol.

Design Overview
===
With the goal of designing something with better performance and local file
system semantics, a bunch of ideas were proposed.

- Use fuse protocol (instead of 9p) for communication between guest
  and host. Guest kernel will be fuse client and a fuse server will
  run on host to serve the requests. Benchmark results are encouraging and
  show this approach performs well (2x to 8x improvement depending on test
  being run).

- For data access inside guest, mmap portion of file in QEMU address
  space and guest accesses this memory using dax. That way guest page
  cache is bypassed and there is only one copy of data (on host). This
  will also enable mmap(MAP_SHARED) between guests.

- For metadata coherency, there is a shared memory region which contains
  version number associated with metadata and any guest changing metadata
  updates version number and other guests refresh metadata on next
  access. This is yet to be implemented.

How virtio-fs differs from existing approaches
==
The unique idea behind virtio-fs is to take advantage of the co-location
of the virtual machine and hypervisor to avoid communication (vmexits).

DAX allows file contents to be accessed without communication with the
hypervisor. The shared memory region for metadata avoids communication in
the common case where metadata is unchanged.

By replacing expensive communication with cheaper shared memory accesses,
we expect to achieve better performance than approaches based on network
file system protocols. In addition, this also makes it easier to achieve
local file system semantics (coherency).

These techniques are not applicable to network file system protocols since
the communications channel is bypassed by taking advantage of shared memory
on a local machine. This is why we decided to build virtio-fs rather than
focus on 9P or NFS.

HOWTO
==
We have put instructions on how to use it here.

https://virtio-fs.gitlab.io/

Caching Modes
=
Like virtio-9p, different caching modes are supported which determine the
coherency level as well. The “cache=FOO” and “writeback” options control the
level of coherence between the guest and host filesystems. The “shared” option
only has an effect on coherence between virtio-fs filesystem instances
running inside different guests.

- cache=none
  metadata, data and pathname lookup are not cached in guest. They are always
  fetched from host and any changes are immediately pushed to host.

- cache=always
  metadata, data and pathname lookup are cached in guest and never expire.

- cache=auto
  metadata and pathname lookup cache expires after a configured amount of time
  (default is 1 second). Data is cached while the file is open (close to open
  consistency).

- writeback/no_writeback
  These options control the writeback strategy.  If writeback is disabled,
  then normal writes will immediately be 

[PATCH v2 15/30] virtio: Implement get_shm_region for PCI transport

2019-05-15 Thread Vivek Goyal
From: Sebastien Boeuf 

On PCI the shm regions are found using capability entries;
find a region by searching for the capability.

Signed-off-by: Sebastien Boeuf 
Signed-off-by: Dr. David Alan Gilbert 
---
 drivers/virtio/virtio_pci_modern.c | 108 +
 include/uapi/linux/virtio_pci.h|  10 +++
 2 files changed, 118 insertions(+)

diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index 07571daccfec..51c9e6eca5ac 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -446,6 +446,112 @@ static void del_vq(struct virtio_pci_vq_info *info)
vring_del_virtqueue(vq);
 }
 
+static int virtio_pci_find_shm_cap(struct pci_dev *dev,
+   u8 required_id,
+   u8 *bar, u64 *offset, u64 *len)
+{
+   int pos;
+
+for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR);
+ pos > 0;
+ pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) {
+   u8 type, cap_len, id;
+u32 tmp32;
+u64 res_offset, res_length;
+
+   pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+ cfg_type),
+ );
+if (type != VIRTIO_PCI_CAP_SHARED_MEMORY_CFG)
+continue;
+
+   pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+ cap_len),
+ _len);
+if (cap_len != sizeof(struct virtio_pci_shm_cap)) {
+   printk(KERN_ERR "%s: shm cap with bad size offset: %d 
size: %d\n",
+   __func__, pos, cap_len);
+continue;
+};
+
+   pci_read_config_byte(dev, pos + offsetof(struct 
virtio_pci_shm_cap,
+ id),
+ );
+if (id != required_id)
+continue;
+
+/* Type, and ID match, looks good */
+pci_read_config_byte(dev, pos + offsetof(struct virtio_pci_cap,
+ bar),
+ bar);
+
+/* Read the lower 32bit of length and offset */
+pci_read_config_dword(dev, pos + offsetof(struct 
virtio_pci_cap, offset),
+  );
+res_offset = tmp32;
+pci_read_config_dword(dev, pos + offsetof(struct 
virtio_pci_cap, length),
+  );
+res_length = tmp32;
+
+/* and now the top half */
+pci_read_config_dword(dev,
+  pos + offsetof(struct virtio_pci_shm_cap,
+ offset_hi),
+  );
+res_offset |= ((u64)tmp32) << 32;
+pci_read_config_dword(dev,
+  pos + offsetof(struct virtio_pci_shm_cap,
+ length_hi),
+  );
+res_length |= ((u64)tmp32) << 32;
+
+*offset = res_offset;
+*len = res_length;
+
+return pos;
+}
+return 0;
+}
+
+static bool vp_get_shm_region(struct virtio_device *vdev,
+ struct virtio_shm_region *region, u8 id)
+{
+   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+   struct pci_dev *pci_dev = vp_dev->pci_dev;
+   u8 bar;
+   u64 offset, len;
+   phys_addr_t phys_addr;
+   size_t bar_len;
+   char *bar_name;
+   int ret;
+
+   if (!virtio_pci_find_shm_cap(pci_dev, id, , , )) {
+   return false;
+   }
+
+   ret = pci_request_region(pci_dev, bar, "virtio-pci-shm");
+   if (ret < 0) {
+   dev_err(_dev->dev, "%s: failed to request BAR\n",
+   __func__);
+   return false;
+   }
+
+   phys_addr = pci_resource_start(pci_dev, bar);
+   bar_len = pci_resource_len(pci_dev, bar);
+
+if (offset + len > bar_len) {
+dev_err(_dev->dev,
+"%s: bar shorter than cap offset+len\n",
+__func__);
+return false;
+}
+
+   region->len = len;
+   region->addr = (u64) phys_addr + offset;
+
+   return true;
+}
+
 static const struct virtio_config_ops virtio_pci_config_nodev_ops = {
.get= NULL,
.set= NULL,
@@ -460,6 +566,7 @@ static const struct virtio_config_ops 
virtio_pci_config_nodev_ops = {
.bus_name   = 

[PATCH v2 13/30] dax: Pass dax_dev to dax_writeback_mapping_range()

2019-05-15 Thread Vivek Goyal
Right now dax_writeback_mapping_range() is passed a bdev and dax_dev
is searched from that bdev name.

virtio-fs does not have a bdev. So pass in dax_dev also to
dax_writeback_mapping_range(). If dax_dev is passed in, bdev is not
used otherwise dax_dev is searched using bdev.

Signed-off-by: Vivek Goyal 
---
 fs/dax.c| 16 ++--
 fs/ext2/inode.c |  2 +-
 fs/ext4/inode.c |  2 +-
 fs/xfs/xfs_aops.c   |  2 +-
 include/linux/dax.h |  6 --
 5 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 815bc32fd967..c944c1efc78f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -932,12 +932,12 @@ static int dax_writeback_one(struct xa_state *xas, struct 
dax_device *dax_dev,
  * on persistent storage prior to completion of the operation.
  */
 int dax_writeback_mapping_range(struct address_space *mapping,
-   struct block_device *bdev, struct writeback_control *wbc)
+   struct block_device *bdev, struct dax_device *dax_dev,
+   struct writeback_control *wbc)
 {
XA_STATE(xas, >i_pages, wbc->range_start >> PAGE_SHIFT);
struct inode *inode = mapping->host;
pgoff_t end_index = wbc->range_end >> PAGE_SHIFT;
-   struct dax_device *dax_dev;
void *entry;
int ret = 0;
unsigned int scanned = 0;
@@ -948,9 +948,12 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
if (!mapping->nrexceptional || wbc->sync_mode != WB_SYNC_ALL)
return 0;
 
-   dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
-   if (!dax_dev)
-   return -EIO;
+   if (bdev) {
+   WARN_ON(dax_dev);
+   dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+   if (!dax_dev)
+   return -EIO;
+   }
 
trace_dax_writeback_range(inode, xas.xa_index, end_index);
 
@@ -972,7 +975,8 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
xas_lock_irq();
}
xas_unlock_irq();
-   put_dax(dax_dev);
+   if (bdev)
+   put_dax(dax_dev);
trace_dax_writeback_range_done(inode, xas.xa_index, end_index);
return ret;
 }
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c27c27300d95..9b0131c53429 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -956,7 +956,7 @@ static int
 ext2_dax_writepages(struct address_space *mapping, struct writeback_control 
*wbc)
 {
return dax_writeback_mapping_range(mapping,
-   mapping->host->i_sb->s_bdev, wbc);
+   mapping->host->i_sb->s_bdev, NULL, wbc);
 }
 
 const struct address_space_operations ext2_aops = {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b32a57bc5d5d..cb8cf5eddd9b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2972,7 +2972,7 @@ static int ext4_dax_writepages(struct address_space 
*mapping,
percpu_down_read(>s_journal_flag_rwsem);
trace_ext4_writepages(inode, wbc);
 
-   ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, wbc);
+   ret = dax_writeback_mapping_range(mapping, inode->i_sb->s_bdev, NULL, 
wbc);
trace_ext4_writepages_result(inode, wbc, ret,
 nr_to_write - wbc->nr_to_write);
percpu_up_read(>s_journal_flag_rwsem);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3619e9e8d359..27f71ff55096 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -994,7 +994,7 @@ xfs_dax_writepages(
 {
xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
return dax_writeback_mapping_range(mapping,
-   xfs_find_bdev_for_inode(mapping->host), wbc);
+   xfs_find_bdev_for_inode(mapping->host), NULL, wbc);
 }
 
 STATIC int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0dd316a74a29..bf3b00b5f0bf 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -87,7 +87,8 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
 
 struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
 int dax_writeback_mapping_range(struct address_space *mapping,
-   struct block_device *bdev, struct writeback_control *wbc);
+   struct block_device *bdev, struct dax_device *dax_dev,
+   struct writeback_control *wbc);
 
 struct page *dax_layout_busy_page(struct address_space *mapping);
 dax_entry_t dax_lock_page(struct page *page);
@@ -119,7 +120,8 @@ static inline struct page *dax_layout_busy_page(struct 
address_space *mapping)
 }
 
 static inline int dax_writeback_mapping_range(struct address_space *mapping,
-   struct block_device *bdev, struct writeback_control *wbc)
+   struct block_device *bdev, struct dax_device *dax_dev,
+   struct writeback_control *wbc)
 {
return -EOPNOTSUPP;
 }
-- 
2.20.1

___
Linux-nvdimm mailing list

[PATCH v2 01/30] fuse: delete dentry if timeout is zero

2019-05-15 Thread Vivek Goyal
From: Miklos Szeredi 

Don't hold onto dentry in lru list if need to re-lookup it anyway at next
access.

More advanced version of this patch would periodically flush out dentries
from the lru which have gone stale.

Signed-off-by: Miklos Szeredi 
---
 fs/fuse/dir.c | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index dd0f64f7bc06..fd8636e67ae9 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -29,12 +29,26 @@ union fuse_dentry {
struct rcu_head rcu;
 };
 
-static inline void fuse_dentry_settime(struct dentry *entry, u64 time)
+static void fuse_dentry_settime(struct dentry *dentry, u64 time)
 {
-   ((union fuse_dentry *) entry->d_fsdata)->time = time;
+   /*
+* Mess with DCACHE_OP_DELETE because dput() will be faster without it.
+*  Don't care about races, either way it's just an optimization
+*/
+   if ((time && (dentry->d_flags & DCACHE_OP_DELETE)) ||
+   (!time && !(dentry->d_flags & DCACHE_OP_DELETE))) {
+   spin_lock(>d_lock);
+   if (time)
+   dentry->d_flags &= ~DCACHE_OP_DELETE;
+   else
+   dentry->d_flags |= DCACHE_OP_DELETE;
+   spin_unlock(>d_lock);
+   }
+
+   ((union fuse_dentry *) dentry->d_fsdata)->time = time;
 }
 
-static inline u64 fuse_dentry_time(struct dentry *entry)
+static inline u64 fuse_dentry_time(const struct dentry *entry)
 {
return ((union fuse_dentry *) entry->d_fsdata)->time;
 }
@@ -255,8 +269,14 @@ static void fuse_dentry_release(struct dentry *dentry)
kfree_rcu(fd, rcu);
 }
 
+static int fuse_dentry_delete(const struct dentry *dentry)
+{
+   return time_before64(fuse_dentry_time(dentry), get_jiffies_64());
+}
+
 const struct dentry_operations fuse_dentry_operations = {
.d_revalidate   = fuse_dentry_revalidate,
+   .d_delete   = fuse_dentry_delete,
.d_init = fuse_dentry_init,
.d_release  = fuse_dentry_release,
 };
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 05/30] fuse: export fuse_len_args()

2019-05-15 Thread Vivek Goyal
From: Stefan Hajnoczi 

virtio-fs will need to query the length of fuse_arg lists.  Make the
symbol visible.

Signed-off-by: Stefan Hajnoczi 
---
 fs/fuse/dev.c| 7 ---
 fs/fuse/fuse_i.h | 5 +
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 46d1aecd7506..d8054b1a45f5 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -350,7 +350,7 @@ void fuse_put_request(struct fuse_conn *fc, struct fuse_req 
*req)
 }
 EXPORT_SYMBOL_GPL(fuse_put_request);
 
-static unsigned len_args(unsigned numargs, struct fuse_arg *args)
+unsigned fuse_len_args(unsigned numargs, struct fuse_arg *args)
 {
unsigned nbytes = 0;
unsigned i;
@@ -360,6 +360,7 @@ static unsigned len_args(unsigned numargs, struct fuse_arg 
*args)
 
return nbytes;
 }
+EXPORT_SYMBOL_GPL(fuse_len_args);
 
 static u64 fuse_get_unique(struct fuse_iqueue *fiq)
 {
@@ -375,7 +376,7 @@ static unsigned int fuse_req_hash(u64 unique)
 static void queue_request(struct fuse_iqueue *fiq, struct fuse_req *req)
 {
req->in.h.len = sizeof(struct fuse_in_header) +
-   len_args(req->in.numargs, (struct fuse_arg *) req->in.args);
+   fuse_len_args(req->in.numargs, (struct fuse_arg *) 
req->in.args);
list_add_tail(>list, >pending);
wake_up_locked(>waitq);
kill_fasync(>fasync, SIGIO, POLL_IN);
@@ -1894,7 +1895,7 @@ static int copy_out_args(struct fuse_copy_state *cs, 
struct fuse_out *out,
if (out->h.error)
return nbytes != reqsize ? -EINVAL : 0;
 
-   reqsize += len_args(out->numargs, out->args);
+   reqsize += fuse_len_args(out->numargs, out->args);
 
if (reqsize < nbytes || (reqsize > nbytes && !out->argvar))
return -EINVAL;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c4584c873b87..3a235386d667 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -1091,4 +1091,9 @@ int fuse_set_acl(struct inode *inode, struct posix_acl 
*acl, int type);
 /* readdir.c */
 int fuse_readdir(struct file *file, struct dir_context *ctx);
 
+/**
+ * Return the number of bytes in an arguments list
+ */
+unsigned fuse_len_args(unsigned numargs, struct fuse_arg *args);
+
 #endif /* _FS_FUSE_I_H */
-- 
2.20.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [v5 0/3] "Hotremove" persistent memory

2019-05-15 Thread Pavel Tatashin
> Hi Pavel,
>
> I am working on adding this sort of a workflow into a new daxctl command
> (daxctl-reconfigure-device)- this will allow changing the 'mode' of a
> dax device to kmem, online the resulting memory, and with your patches,
> also attempt to offline the memory, and change back to device-dax.
>
> In running with these patches, and testing the offlining part, I ran
> into the following lockdep below.
>
> This is with just these three patches on top of -rc7.
>
>
> [  +0.004886] ==
> [  +0.001576] WARNING: possible circular locking dependency detected
> [  +0.001506] 5.1.0-rc7+ #13 Tainted: G   O
> [  +0.000929] --
> [  +0.000708] daxctl/22950 is trying to acquire lock:
> [  +0.000548] f4d397f7 (kn->count#424){}, at: 
> kernfs_remove_by_name_ns+0x40/0x80
> [  +0.000922]
>   but task is already holding lock:
> [  +0.000657] 2aa52a9f (mem_sysfs_mutex){+.+.}, at: 
> unregister_memory_section+0x22/0xa0

I have studied this issue, and now have a clear understanding why it
happens, I am not yet sure how to fix it, so suggestions are welcomed
:)

Here is the problem:

When we offline pages we have the following call stack:

# echo offline > /sys/devices/system/memory/memory8/state
ksys_write
 vfs_write
  __vfs_write
   kernfs_fop_write
kernfs_get_active
 lock_acquire   kn->count#122 (lock for
"memory8/state" kn)
sysfs_kf_write
 dev_attr_store
  state_store
   device_offline
memory_subsys_offline
 memory_block_action
  offline_pages
   __offline_pages
percpu_down_write
 down_write
  lock_acquire  mem_hotplug_lock.rw_sem

When we unbind dax0.0 we have the following  stack:
# echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
drv_attr_store
 unbind_store
  device_driver_detach
   device_release_driver_internal
dev_dax_kmem_remove
 remove_memory  device_hotplug_lock
  try_remove_memory mem_hotplug_lock.rw_sem
   arch_remove_memory
__remove_pages
 __remove_section
  unregister_memory_section
   remove_memory_sectionmem_sysfs_mutex
unregister_memory
 device_unregister
  device_del
   device_remove_attrs
sysfs_remove_groups
 sysfs_remove_group
  remove_files
   kernfs_remove_by_name
kernfs_remove_by_name_ns
 __kernfs_removekn->count#122

So, lockdep found the ordering issue with the above two stacks:

1. kn->count#122 -> mem_hotplug_lock.rw_sem
2. mem_hotplug_lock.rw_sem -> kn->count#122
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


罗先生:如何提高招聘与面试的知识和技巧

2019-05-15 Thread 罗先生



 原邮件信息 -
发件人:罗先生
收件人:linux-nvdimm 
发送时间:2019-5-16  2:08:23
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 15/18] Documentation: kunit: add documentation for KUnit

2019-05-15 Thread Jonathan Corbet
On Tue, 14 May 2019 16:19:02 -0700
Brendan Higgins  wrote:

> Hmmm...probably premature to bring this up, but Documentation/dev-tools/
> is kind of thrown together.

Wait a minute, man... *I* created that directory, are you impugning my
work? :)

But yes, "kind of thrown together" is a good description of much of
Documentation/.  A number of people have been working for years to make
that better, with some success, but there is a long way to go yet.  The
dev-tools directory is an improvement over having that stuff scattered all
over the place — at least it's actually thrown together — but it's not the
end point.

> It would be nice to provide a coherent overview, maybe provide some
> basic grouping as well.
> 
> It would be nice if there was kind of a gentle introduction to the
> tools, which ones you should be looking at, when, why, etc.

Total agreement.  All we need is somebody to write it!  :)

Thanks,

jon
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] dax: Arrange for dax_supported check to span multiple devices

2019-05-15 Thread Pankaj Gupta



> 
> Pankaj reports that starting with commit ad428cdb525a "dax: Check the
> end of the block-device capacity with dax_direct_access()" device-mapper
> no longer allows dax operation. This results from the stricter checks in
> __bdev_dax_supported() that validate that the start and end of a
> block-device map to the same 'pagemap' instance.
> 
> Teach the dax-core and device-mapper to validate the 'pagemap' on a
> per-target basis. This is accomplished by refactoring the
> bdev_dax_supported() internals into generic_fsdax_supported() which
> takes a sector range to validate. Consequently generic_fsdax_supported()
> is suitable to be used in a device-mapper ->iterate_devices() callback.
> A new ->dax_supported() operation is added to allow composite devices to
> split and route upper-level bdev_dax_supported() requests.
> 
> Fixes: ad428cdb525a ("dax: Check the end of the block-device...")
> Cc: 
> Cc: Jan Kara 
> Cc: Ira Weiny 
> Cc: Dave Jiang 
> Cc: Mike Snitzer 
> Cc: Keith Busch 
> Cc: Matthew Wilcox 
> Cc: Vishal Verma 
> Cc: Heiko Carstens 
> Cc: Martin Schwidefsky 
> Reported-by: Pankaj Gupta 
> Signed-off-by: Dan Williams 

Thank you for the patch. Looks good to me. 
I also tested the patch and it works well.

Reviewed-and-Tested-by: Pankaj Gupta  

Best regards,
Pankaj


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] dax: Arrange for dax_supported check to span multiple devices

2019-05-15 Thread Jan Kara
On Tue 14-05-19 20:48:49, Dan Williams wrote:
> Pankaj reports that starting with commit ad428cdb525a "dax: Check the
> end of the block-device capacity with dax_direct_access()" device-mapper
> no longer allows dax operation. This results from the stricter checks in
> __bdev_dax_supported() that validate that the start and end of a
> block-device map to the same 'pagemap' instance.
> 
> Teach the dax-core and device-mapper to validate the 'pagemap' on a
> per-target basis. This is accomplished by refactoring the
> bdev_dax_supported() internals into generic_fsdax_supported() which
> takes a sector range to validate. Consequently generic_fsdax_supported()
> is suitable to be used in a device-mapper ->iterate_devices() callback.
> A new ->dax_supported() operation is added to allow composite devices to
> split and route upper-level bdev_dax_supported() requests.
> 
> Fixes: ad428cdb525a ("dax: Check the end of the block-device...")
> Cc: 
> Cc: Jan Kara 
> Cc: Ira Weiny 
> Cc: Dave Jiang 
> Cc: Mike Snitzer 
> Cc: Keith Busch 
> Cc: Matthew Wilcox 
> Cc: Vishal Verma 
> Cc: Heiko Carstens 
> Cc: Martin Schwidefsky 
> Reported-by: Pankaj Gupta 
> Signed-off-by: Dan Williams 

Thanks for the fix. The patch looks good to me so feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
> Hi Mike,
> 
> Another day another new dax operation to allow device-mapper to better
> scope dax operations.
> 
> Let me know if the device-mapper changes look sane. This passes a new
> unit test that indeed fails on current mainline.
> 
> https://github.com/pmem/ndctl/blob/device-mapper-pending/test/dm.sh
> 
>  drivers/dax/super.c  |   88 
> +++---
>  drivers/md/dm-table.c|   17 +---
>  drivers/md/dm.c  |   20 ++
>  drivers/md/dm.h  |1 
>  drivers/nvdimm/pmem.c|1 
>  drivers/s390/block/dcssblk.c |1 
>  include/linux/dax.h  |   19 +
>  7 files changed, 110 insertions(+), 37 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 0a339b85133e..ec2f2262e3a9 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -73,22 +73,12 @@ struct dax_device *fs_dax_get_by_bdev(struct block_device 
> *bdev)
>  EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
>  #endif
>  
> -/**
> - * __bdev_dax_supported() - Check if the device supports dax for filesystem
> - * @bdev: block device to check
> - * @blocksize: The block size of the device
> - *
> - * This is a library function for filesystems to check if the block device
> - * can be mounted with dax option.
> - *
> - * Return: true if supported, false if unsupported
> - */
> -bool __bdev_dax_supported(struct block_device *bdev, int blocksize)
> +bool generic_fsdax_supported(struct dax_device *dax_dev,
> + struct block_device *bdev, int blocksize, sector_t start,
> + sector_t sectors)
>  {
> - struct dax_device *dax_dev;
>   bool dax_enabled = false;
>   pgoff_t pgoff, pgoff_end;
> - struct request_queue *q;
>   char buf[BDEVNAME_SIZE];
>   void *kaddr, *end_kaddr;
>   pfn_t pfn, end_pfn;
> @@ -102,21 +92,14 @@ bool __bdev_dax_supported(struct block_device *bdev, int 
> blocksize)
>   return false;
>   }
>  
> - q = bdev_get_queue(bdev);
> - if (!q || !blk_queue_dax(q)) {
> - pr_debug("%s: error: request queue doesn't support dax\n",
> - bdevname(bdev, buf));
> - return false;
> - }
> -
> - err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, );
> + err = bdev_dax_pgoff(bdev, start, PAGE_SIZE, );
>   if (err) {
>   pr_debug("%s: error: unaligned partition for dax\n",
>   bdevname(bdev, buf));
>   return false;
>   }
>  
> - last_page = PFN_DOWN(i_size_read(bdev->bd_inode) - 1) * 8;
> + last_page = PFN_DOWN((start + sectors - 1) * 512) * PAGE_SIZE / 512;
>   err = bdev_dax_pgoff(bdev, last_page, PAGE_SIZE, _end);
>   if (err) {
>   pr_debug("%s: error: unaligned partition for dax\n",
> @@ -124,20 +107,11 @@ bool __bdev_dax_supported(struct block_device *bdev, 
> int blocksize)
>   return false;
>   }
>  
> - dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
> - if (!dax_dev) {
> - pr_debug("%s: error: device does not support dax\n",
> - bdevname(bdev, buf));
> - return false;
> - }
> -
>   id = dax_read_lock();
>   len = dax_direct_access(dax_dev, pgoff, 1, , );
>   len2 = dax_direct_access(dax_dev, pgoff_end, 1, _kaddr, _pfn);
>   dax_read_unlock(id);
>  
> - put_dax(dax_dev);
> -
>   if (len < 1 || len2 < 1) {
>   pr_debug("%s: error: dax access failed (%ld)\n",
>   bdevname(bdev, buf), len < 1 ? len : len2);
> @@ -178,6